Architectures and precision analysis for modelling atmospheric variables with chaotic behaviour

Francis P. Russell*, Peter D. Düben†, Xinyu Niu‡, Wayne Luk§, T. N. Palmer¶
Department of Computing, Imperial College London, London, SW7 2AZ, UK
{francis.russell02*,niu,xinyu10‡,w.luk§}@imperial.ac.uk
Atmospheric Oceanic and Planetary Physics, University of Oxford, Oxford, OX1 3PU, UK
{dueben†,t.n.palmer¶}@atm.ox.ac.uk

Abstract—The computationally intensive nature of atmospheric modelling is an ideal target for hardware acceleration. Performance of hardware designs can be improved through the use of reduced precision arithmetic, but maintaining appropriate accuracy is essential. We explore reduced precision optimisation for simulating chaotic systems, targeting atmospheric modelling in which even minor changes in arithmetic behaviour can have a significant impact on system behaviour. Hence, standard techniques for comparing numerical accuracy are inappropriate. We use the Hellinger distance to compare statistical behaviour between reduced-precision CPU implementations to guide FPGA designs of a chaotic system, and analyse accuracy, performance and power efficiency of the resulting implementations. Our results show that with only a limited loss in accuracy corresponding to less than 10% uncertainty in input parameters, a single Xilinx Virtex 6 SXT475 FPGA can be 13 times faster and 23 times more power efficient than a 6-core Intel Xeon X5650 processor.

I. INTRODUCTION

Climate and weather prediction are computationally intensive; even with high-performance computing resources, it is typically impossible to resolve important convective cloud systems in global models [1]. Numerical models of weather and climate show significant model error due to limited resolution and complexity, necessitating a need for even more resource-intensive models. Performance and power requirements for running such models with hard time constraints have led to the exploration of hardware accelerators to obtain greater throughput and power efficiency.

A common approach to enhance throughput of hardware designs is to reduce precision of computations so that additional hardware resources can be employed to increase parallelism. Excessive precision reduction however, reduces calculation quality and usefulness, so a trade-off must be made between performance and accuracy.

Lorenz showed that weather forecasting involves the prediction of a chaotic system [2]. This implies an exponential growth of errors in initial conditions; significant divergence between a CPU and hardware implementation is expected simply due to implementation differences, and is not in itself an indicator of error. Diagnostics that rely on solution convergence between implementations for validation are inappropriate here.

Short-term forecasts may have uncertainty in initial conditions due to the nature of measurement. Also, the response to a forcing (e.g. a change in CO₂ concentration) is difficult to predict and must be tested with numerical simulation. Given the chaotic nature of weather and initial condition uncertainty, one may consider a reduction in precision appropriate so long as the behaviour change introduced is acceptable compared to those introduced by other factors.

We investigate the reduction of precision in chaotic systems using the Lorenz 1996 model (a.k.a. Lorenz 1995), designed by Lorenz [3] to study interactions of atmospheric processes with non-linear, chaotic dynamics. Our analysis has applicability beyond this model – scale interactions within the Lorenz ’96 model resemble scale interactive behaviour of various parts of numerical atmosphere models that is typically difficult to capture in idealised systems. Similar scale interactions are also important for turbulent energy cascades relevant to most applications in CFD (computational fluid dynamics). This paper contributes and presents:

- the hardware architecture of a two-scale Lorenz ’96 simulation using Runge-Kutta time-stepping;
- a demonstration of how it is possible to make trade-offs between precision and throughput for a system where numerical divergence is expected;
- an analysis of the impact of varying the precision of variables at different scales in the Lorenz ’96 implementation using error metrics appropriate to chaotic systems (the Hellinger distance);
- performance, precision and power consumption comparisons of reduced-precision FPGA implementations to an optimised CPU implementation.

II. BACKGROUND

A. Lorenz ’96

The two-scale Lorenz ’96 model (hereafter called L96) was designed by Lorenz as a simple model to study predictability in a chaotic system. Despite its simplicity, L96 shares important properties of numerical atmosphere models such as scale interactions and chaotic behaviour. Also, its numerical
is the time-scale ratio. Often $c = 1$ and $b = \frac{1}{3}$.

Equation (2) is a forcing term, $K_{c,b} = h_c X_{c,b}$ to hundreds of thousands. We look at how to scale $K$ through a summation so the spatial-scale ratio and $c$ is the time-scale ratio. Often $h = 1$, $F = 20$, $b = 10$ and $c = 4$ or $c = 10$.

Global and local-scale variables are arranged circularly (Fig. 1) such that $X_k = X_{k+K}$, $Y_{j,k} = Y_{j,k+1}$ and $Y_{j,k} = Y_{j-k+1}$. For clarity we show a small system; in practice we look at how to scale $J$ and $K$ to hundreds of thousands. We explore how to scale both $J$ and $K$, but only run simulations with relatively small values of $J$ (144) since local-scale values only influence global-scale values through a summation so larger values of $J$ result in statistically similar behaviour.

For all simulations, we temporally discretise using Runge-Kutta. This has a significant impact on our accelerator design due to the number of intermediate terms required during each time-step.

B. Related Work

While FPGAs have been used in implementing chaotic systems, previous work has focused on security and digital communication [4]. This paper provides the first FPGA architecture and precision analysis for atmospheric modelling with chaotic behaviour.

Oriato et al. have accelerated the dynamical core of a Limited Area Model derived from the BOLAM model [5]. The model includes three-dimensional non-linear equations covering conservation of momentum, continuity and thermal energy. An explicit finite-difference scheme is used to solve these equations over a staggered C-grid. Parts of the computation are converted to use fixed point arithmetic.

Gan et al. have investigated a hybrid CPU-FPGA approach for computing the upwind stencil of the global shallow water equations [6] using a finite-volume and TVD Runge-Kutta discretisation on a cubed-sphere mesh. Range analysis is applied to kernels, and used to transform values to fixed point and reduced-precision floating point. Despite the stencil’s small size, the high computational intensity leads to significant speedups in the FPGA implementation.

A major difference from our work is that the two studies above consider short-term simulations that show only limited propagation of model errors due to the chaotic dynamics of the atmosphere. We consider the influence of reduced precision on a long-term diagnostic for a system with strong chaotic behaviour.

Düben et al. have explored simulated stochastic processors and low-precision arithmetic in the context of atmospheric modelling for short [7] and long-term simulations [8]. Their two-level L96 model is run with bits in the mantissa randomly flipped after floating point operations, emulating a stochastic processor. Additionally, the dynamic core of the Intermediate Global Climate Model is run with both reduced precision, and with stochastic processor emulation for global simulations. Results suggest that imprecise computing strategies can be aggressively applied to small-scale dynamics without significantly altering those at larger scales.

III. ARCHITECTURE

A. System Overview

The entire L96 Runge-Kutta update is moved to a hardware accelerator. The system state is copied from host memory to the accelerator’s off-chip memory prior to computation. Arbitrary numbers of Runge-Kutta updates can be performed with the system state sent back to the host when needed.

All computations are moved to hardware, but host involvement remains during updates. Memory controller commands are sent by the host to the accelerator during computation since data access patterns for multiple iterations (Sec. III-G) are expensive to compute in hardware. The data-volume is small, so transfer time is not a limiting factor.

The structure of our design is shown in Fig. 2. The padding/striping steps add/remove padding needed to satisfy...
The matrix is flattened to memory in column-major order so that all values associated with the same type as the Y values. However, since we couple elements associated with the X entry is sent/received by kernels simultaneously with the first element of the associated Y-values.

\[
\begin{pmatrix}
X_1 & X_2 & \cdots & X_K \\
0 & 0 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
Y_{i,1} & Y_{i,2} & \cdots & Y_{i,K} \\
Y_{2,1} & Y_{2,2} & \cdots & Y_{2,K} \\
& \vdots & \ddots & \vdots \\
Y_{J,1} & Y_{J,2} & \cdots & Y_{J,K}
\end{pmatrix}
\]

Figure 3. State of the L96 system arranged as a matrix. Each column contains a single X element and all Y elements associated with the X element. Each column may contain zero or more padding elements after the X element due to vectorisation (Sec. III-F).

of \(X^{n+1}\) and \(Y^{n+1}\). Each step is specialised to avoid unnecessary computation and communication. X and Y remain deinterleaved during processing, but are synchronised so that each X entry is sent/received by kernels simultaneously with the first element of the associated Y-values.

D. Precision

We analyse the effect of precision reduction at the level of global and local-scale values of L96. In weather and climate modelling, it is important to represent large-scale patterns (analogous to global-scale quantities in L96) accurately, but of lesser importance to represent small-scale dynamics (analogous to local-scale quantities) in high accuracy. Small-scale dynamics are inherently uncertain due to the strong influence of parametrisation schemes and viscosity in atmosphere models and initial value uncertainty. Therefore, precision of local-scale quantities should be chosen to be equal to or less than the precision of global-scale quantities [8].

In our design, all values are stored in off-chip memory as IEEE 754 single-precision floating-point values. Global and local-scale quantities are converted to and from lower precision in the “deinterleave” and “interleave” kernels, respectively (Fig. 2).

All intermediate values such as derivatives and slope estimates are maintained in the same precision as the associated X or Y state. For the summation of Y values needed by the X-derivative, we use an accumulator of the same type as the X elements for improved precision.

E. Boundary Conditions

Calculation of X and Y derivatives both involve stencil operations with circular boundary conditions. Both stencils are small, however:

- Data dependencies force output to be delayed and reordered with respect to the stencil input. Since we pass the original input and Runge-Kutta intermediate values through the pipeline, these must be reordered and delayed in an identical manner.
- Since we couple X and Y, any buffering of X elements requires buffering of entire columns of Y. We wish to avoid this as much as possible since the required buffer size will increase with \(K\).
we could have replicated the Runge-Kutta pipeline, but the vectorisation approach has a number of benefits.

Our design is constructed such that we can choose an arbitrary vector length \( v_n \) representing the number of elements read from a \( Y \)-related stream in a single cycle. We enforce the requirement that \( J \) is a multiple of \( v_n \).

The arithmetic required to implement the \( Y \)-derivative calculation is duplicated \( v_n \) times. Since we do not duplicate the pipeline, we generate no additional counters nor communication channels between kernels. Most importantly, the sizes of the buffers required to store and delay \( Y \)-related values are nearly independent of \( v_n \) so increasing \( v_n \) has minimal effect on on-chip memory usage.

Calculation of the \( X \)-derivative involves a summation over each column of \( Y \). We implement an accumulator that accepts one element per cycle, which is relatively costly in terms of adders. With vectorisation we avoid duplicating the accumulator; instead we use another \( v_n - 1 \) adders to sum the vector elements for input to the accumulator.

One overhead incurred is that each \( X \)-state element must be padded to \( v_n \) elements. For an arbitrary system, this means that the fraction of padding of an in-memory representation of the system will be \( \frac{v_n - 1}{v_n} \). For a typical value of \( J = 128 \), a system where \( v_n = 8 \) will consist of 5.4% padding and where \( v_n = 16 \), it will have 11.6% padding.

G. Multiple Iterations

We can perform multiple L96 time-steps with only a single invocation from the host machine. As described in Section III-E, a “pre-read” region is read at both the start and end of a time-step update. Therefore, the pre-read region cannot be overwritten until it has been read both times.

The memory controller requires all reads and writes to have sizes and occur at addresses which are multiples of a burst size (384 bytes). During a time-step update, the new state is written starting at the first burst that does not contain any pre-read data, causing the start of the L96 state to move forward in memory each time-step.

The L96 state is padded to the memory controller’s burst size and circular addressing is used such that the flattened representation of the system state (including padding) “rotates” in memory by multiples of the burst size; rotation is undone by the controller on state transfer to the host. Fig. 5 illustrates the memory layout transformation caused by the first time-step.

IV. PRECISION ANALYSIS

A. Metrics

L96 exhibits complex wave-like and chaotic behaviour [9]. Specifically, small perturbations in the initial conditions of the system will result in significant differences in its evolution. Consequently, we cannot use a direct comparison of state between hardware and CPU implementations after a large number of time steps due to the expected divergence.
variables, averaged \( (9) \) = 10 and \( = 2000 \) (see for example \( - K = 64 \) = 40 \( J \)).

The climatology of L96 is typically defined as the probability density function (PDF) of the \( \) variables, averaged over a long run: 10,000 model time units, equivalent to more than 100 “atmospheric years” \[3\]. For assessing simulation accuracy, we compare the PDFs of both local and global-scale variables.

A metric often used to compare PDFs is the **Hellinger distance** (see for example \[10\]). For two probability distributions \( p \) and \( q \), the Hellinger distance \( H \) is defined as:

\[
H(p, q) = \sqrt{\frac{1}{2} \int \left( \sqrt{p(x)} - \sqrt{q(x)} \right)^2 \, dx} \tag{9}
\]

Taking the Hellinger distance between a high-precision CPU implementation and reduced-precision implementations enables us to determine the quality of the simulation. As a baseline, we calculate the Hellinger distance between two high-precision CPU implementations with different starting states. Since the Hellinger distance is a measure of long-term dynamics, a change in initial conditions should not change the PDF and the Hellinger distance should be close to zero. The Hellinger distance between simulations that only differ in the initial conditions can serve as an estimate of the uncertainty in the measurement of the Hellinger distance for a given model setup.

### B. Precision Reduction

Reducing calculation precision enables the reduction of hardware resource utilisation, permitting instantiation of additional functional units for increasing throughput. For the L96 simulations, we wish to determine the extent to which we can increase throughput while maintaining an acceptable level of accuracy.

We only consider reduced precision floating point and do not attempt to use fixed point. We do not attempt to optimise each intermediate value in our pipeline; rather, we investigate the effects of precision reduction on terms associated with the global and local-scale quantities of L96. Düben et al. have shown for an atmosphere model that the precision to calculate small-scale dynamics can be reduced much further than the precision for large-scale dynamics with a smaller influence on model results and forecast quality \[8\].

![Figure 5. Rotation of the L96 system state caused by a single time-step update](image)

![Figure 6. Hellinger distances (smaller is better) for X and Y state values between a double-precision CPU implementation and the CPU implementation using truncated precision for local and global-related values. \( F = 40, b = 10, c = 10, J = 64, K = 2000 \) and the simulation was run for 3,750,000 time-steps with the state sampled every 1000 time-steps.](image)
Table I
HELLINGER DISTANCES (4 S.F.) BETWEEN A DOUBLE-PRECISION SIMULATION AND THE SAME WITH DIFFERENT INITIAL CONDITIONS (AVERAGED FROM FIVE RUNS) AND ALTERED PARAMETERS.

<table>
<thead>
<tr>
<th>Run</th>
<th>H (global)</th>
<th>H (local)</th>
<th>Est. min. mantissa (global)</th>
<th>Est. min. mantissa (local)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Changed initial conditions</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>c × 0.99, F × 0.99</td>
<td>2.788e-3</td>
<td>1.939e-4</td>
<td>13</td>
<td>12</td>
</tr>
<tr>
<td>c × 1.01, F × 1.01</td>
<td>5.042e-3</td>
<td>1.719e-3</td>
<td>13</td>
<td>10</td>
</tr>
<tr>
<td>c × 0.9, F × 0.9</td>
<td>4.047e-2</td>
<td>1.625e-2</td>
<td>11</td>
<td>9</td>
</tr>
<tr>
<td>c × 1.1, F × 1.1</td>
<td>3.620e-2</td>
<td>1.415e-2</td>
<td>11</td>
<td>9</td>
</tr>
</tbody>
</table>

We estimate minimum mantissa widths required to produce a simulation with smaller global and local-scale Hellinger distances based on the results from CPU simulation. Where multiple combinations of global and local-scale mantissa lengths satisfy the requirements, we choose the configuration with the smallest combined mantissa length.

Figure 7. Exponent (base-2) distributions for global and local-scale values derived from CPU simulation with $F = 40$, $b = 10$ and $c = 10$. The profile includes all intermediate values calculated in the specified precision. 

c and F altered by 1% and 10% (Table I). Model parameters always have a degree of uncertainty, and we consider 1% to be realistic for most numerical models in Computational Fluid Dynamics and 10% to be reasonable within atmospheric modelling [11].

Runtime profiling of floating-point exponent values in the CPU simulation was performed for 8000 time-steps (Fig. 7). Results suggest that exponent values of up to 10 and 11 are required to capture the largest global and local-scale values, respectively. Exponent widths of 5-bits for both global and local-scale values can represent the majority ($\geq 99.999\%$) of values generated during a simulation.

V. RESULTS

Hardware simulations were performed on a Maxeler MAX3A Vectis Dataflow Engine (DFE) which contains a Xilinx Virtex 6 SX1475 FPGA and 24GB of DDR3 RAM. Connection to the host was via PCI express. All designs were compiled to run at 150MHz.

To compare performance between DFE and CPU, we wrote an optimised L96 simulation in C++. OpenMP was used for intra-node parallelisation and NUMA functionality employed to prevent thread migration between physical cores. The code was compiled with the Intel C++ Compiler 12.1.4. CPU simulations were run on the DFE host machine which contained two 6-core Intel Xeon X5660s running at 2.67GHz, each having 12MB L3 cache (System 2 in Sec. V-C). Hyper-threading was found to be beneficial so two threads were assigned to each physical core. The host was running Centos 6.3.

Four DFE designs were built for benchmarking and precision analysis: one at single precision, one at the precision estimated necessary for similar accuracy to single-precision and two at the precision level estimated to have errors of the same order of magnitude as those caused by 1% uncertainty in the input parameters. To enable comparison with the Hellinger distances in Table I, design DFE3 was built with a reduced vector width. DFE3 can be run with $J = 64$ permitting comparison with the CPU simulations but DFE4 cannot since $J$ is not a multiple of the vector width.

All designs (except DFE3) were compiled with the maximum supported vector width (Table II). The notation float(e, m) denotes a floating point type with an e-bit exponent and m-bit mantissa, including the implicit bit. The maximum value of $J$ each design could support was reduced to 512 since larger values are of minimal interest to climate scientists and this improves the chances of routing. DFE2 was compiled with a 6-bit exponent for global-scale values due to restrictions on mantissa-exponent size combinations imposed by the Maxeler tools. We compiled a design using float(5,13) for both global and local-scale variables with a vector width of 16 and a maximum value of $J$ of 128000, demonstrating we can scale $J$ much higher if necessary.

A. Precision

We calculate the Hellinger distances for the local and global-scale state variables between each DFE implementation and a double-precision CPU implementation (Table III).

For $J = 64$, DFE1 and DFE2 have Hellinger distances similar to those seen between double-precision simulations.
We also subtract the static power requirements of unused Maxeler cards from our reported figures (the average reported static power requirement per card is 19.6W). CPU implementations were run with two threads per hyper-threaded physical core. Relative efficiency uses the most power efficient CPU execution as the baseline.

Our CPU implementation scales well from 6 to 12 cores. To achieve this, we use a “first-touch” policy for initialising the L96 state so memory is allocated in the NUMA domain of each thread. Prior to implementing this, performance improvement when moving from 6 to 12 cores was minimal, also indicating the implementation is bandwidth limited for large sizes.

C. Power Consumption

We measure the power requirements of the CPU and FPGA implementations on two systems:

1) A hyper-threaded four-core Intel Core i7 870 running at 2.93 GHz with 8MB L3 cache, 16GB RAM and a single MAX3A Vectis card. Total idle power consumption of 93W.

2) A dual-socket machine with each socket containing a hyper-threaded six-core Intel Xeon X5650 running at 2.67 GHz with 12MB L3 cache and four MAX3A Vectis cards. The system had 48GB of RAM with 24GB in each socket’s NUMA domain. Total idle power consumption of 373W.

Total workstation power consumption was measured when running each implementation in different configurations (Table V). We subtract the static power requirements of the unused Maxeler cards from our reported figures (the average reported static power requirement per card is 19.6W).

In terms of computation per Watt, the reduced precision designs achieve power efficiency of up to 23 times higher than the most efficient CPU-only system. Assuming perfect scaling (not unreasonable considering the need in weather forecasting to calculate ensembles) system 1 could reach manual vectorisation would likely increase performance for these sizes. At $K = 6400$, $J = 144$ the size of a double precision L96 system state is just over 7MB which can reside within the 12MB L3 cache of a single socket. Larger systems no longer fit into cache and the single-precision implementation achieves $\approx 2$ times the performance of the double-precision one for these sizes, suggesting they are bandwidth limited.

Our CPU implementation scales well from 6 to 12 cores. To achieve this, we use a “first-touch” policy for initialising the L96 state so memory is allocated in the NUMA domain of each thread. Prior to implementing this, performance improvement when moving from 6 to 12 cores was minimal, also indicating the implementation is bandwidth limited for large sizes.

Table IV

<table>
<thead>
<tr>
<th>Build</th>
<th>Speedup (3 s.f.)</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU (single, 1 core)</td>
<td>1.00</td>
</tr>
<tr>
<td>CPU (single, 6 cores)</td>
<td>0.9594</td>
</tr>
<tr>
<td>CPU (single, 12 cores)</td>
<td>0.906</td>
</tr>
<tr>
<td>CPU (double, 1 core)</td>
<td>0.921</td>
</tr>
<tr>
<td>CPU (double, 6 cores)</td>
<td>0.258</td>
</tr>
<tr>
<td>CPU (double, 12 cores)</td>
<td>0.514</td>
</tr>
<tr>
<td>DFE1</td>
<td>29.6</td>
</tr>
<tr>
<td>DFE2</td>
<td>56.3</td>
</tr>
<tr>
<td>DFE4</td>
<td>72.7</td>
</tr>
</tbody>
</table>

Factors are relative to a single-precision CPU implementation running with a single core and 12 cores (two threads per hyper-threaded physical core). $J = 144$ and $K = 819200$.

with changes to the initial conditions. DFE3, which has the lowest precision exhibits a Hellinger distance comparable to that caused by a 1% variation in $c$ and $F$ (Table I). We also show Hellinger distances for $J = 144$ (the least common multiple of vector widths used that is $\geq 128$). For DFE4 (with $J = 144$), our Hellinger distances are outside those expected by varying input parameters by 1% (we expect larger distances due to having optimised for a different $J$) but within those expected by 10% variation.

B. Performance

We compare performance between the C++ implementation running on different numbers of cores in single and double-precision and different DFE builds (Fig. 8) with speedup factors shown in Table IV. DFE3 is excluded since it has the same precision as DFE4, but lower throughput due to the smaller vector width.

Our results show that a single DFE running with reduced-precision can outperform a 6-core Xeon X5660 running a single-precision implementation by over 13 times, and over 6 times when using two sockets.

For problems where $K \leq 6400$, the single and double-precision CPU implementations have near-identical performance. These computations appear arithmetic-bound and
Table VI
READS AND WRITES PERFORMED ON INPUT, OUTPUT AND INTERMEDIATE STATE BUFFERS BY EACH RUNGE-KUTTA STEP IN THE CPU IMPLEMENTATION.

<table>
<thead>
<tr>
<th>Runge-Kutta Step</th>
<th>State Value</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Input</td>
</tr>
<tr>
<td></td>
<td>Read</td>
</tr>
<tr>
<td>1</td>
<td>✓</td>
</tr>
<tr>
<td>2</td>
<td>✓</td>
</tr>
<tr>
<td>3</td>
<td>✓</td>
</tr>
<tr>
<td>4</td>
<td></td>
</tr>
</tbody>
</table>

a relative efficiency of 41 times if it contained three fully utilised MAX3A cards.

D. Discussion

Excluding data transfer due to pre-reading and padding, the amount of data read from and written to off-chip memory for a complete time-step is close to the L96 state size (a theoretical minimum). In contrast, the CPU implementation reads 9 times and writes 7 times the state size, respectively. The CPU implementation has a working set of approximately 3 times the state size consisting of storage for the input and output states, and a Runge-Kutta intermediate value. The buffers each Runge-Kutta step accesses are shown in Table VI.

Each buffer access in Table VI can be mapped to the corresponding inter-stage links in the hardware implementation (Fig. 2). However, the pipelined nature of the implementation means that the state passed between Runge-Kutta steps is never stored completely. Parallelisation occurs across the entire Runge-Kutta update as kernels operate on all four steps simultaneously.

In a CPU implementation, the natural approach involves domain decomposition to reduce inter-core communication. Parallelisation occurs across each Runge-Kutta step, with each thread updating partitions of the state. However, all threads work on the same Runge-Kutta step simultaneously. Consequently, communication between each Runge-Kutta step must occur via main memory (for large systems).

We consider the reduced memory bandwidth required for our DFE implementations to be a natural consequence of being able to express the Runge-Kutta update as a pipeline. Reducing the bandwidth requirements of the CPU implementation would most likely require some sort of iteration space partitioning beyond the scope of current compilers, or cause a significant increase in code complexity if implemented by hand.

VI. CONCLUSION

We designed a hardware architecture for the chaotic two-scale L96 system. We built FPGA implementations at different precision levels, guided by our CPU-based analysis of Hellinger distances and showed that acceptable behaviour can be achieved at precision significantly lower than single-precision. We demonstrated how to reduce precision of a chaotic system to improve throughput of a hardware implementation.

The performance improvements achieved have enabled climate scientists to substantially increase the scales of analyses involving the L96 system with and without reduced-precision. Previous work in this area was limited to using simulations to explore the latter [8].

Current and future work includes extending our research to cover more realistic models, supporting systems with multiple accelerators, and exploring techniques and tools to automate the architecture optimisation and precision analysis steps in our approach.

VII. ACKNOWLEDGEMENT

This work is supported in part by the European Union Seventh Framework Programme under grant agreement numbers 257906, 287804, 291406 and 318521, by the UK EPSRC, by the Maxeler University Programme, by the HiPEAC NoE, by Altera, and by Xilinx.

REFERENCES