Posted on Jul 01, 2014

EXPRESS: Exploiting Runtime Resources in Reconfigurable Systems

Why do we need runtime reconfiguration?

Runtime reconfiguration is a technique exclusive to reconfigurable devices (such as FPGAs), to change circuits on the fly. Theoretically, this gives reconfigurable devices unlimited potential. I always like to use the example of the evolution from woodblock printing to movable printing to express the difference between customised circuits with and without reconfigurability.

Customised circuits are like the woodblock printing, where a wood block is prepared for a specific content. The customised circuits, whether implemented with ASIC chip or FPGAs, shows the maximum efficiency when the same circuit is used to process a large volume of data (similar to printing a significant amount of papers with the same content). Every time the application operations (printing content) change, a new customised circuit (woodblock) needs to be prepared.

With reconfigurability, customised circuits can reconfigure redundant circuits into active circuits, to support applications with varying operations to execute. Similar to the movable printing, small modules are used to construct the customised circuits, and the modules no longer needed are swapped with circuits for new operations to support. In other words, with reconfigurable devices, we can construct swap in and out new circuit modules to support various application operations, without changing the circuits from scratch.

The benefits for runtime reconfiguration can be demonstrated as an example below. In a static design (i.e. a design without runtime reconfiguration), all functions are mapped into reconfigurable fabrics and replicated as much as possible to optimise concurrency. However, limited by data dependency and mapping strategies, some computational resources can be left idle from time to time. This situation is shown in Figure~\ref{fig:mot}(b): there are four function units, each implementing respectively the function A, B, C and D in the dataflow graph in Figure 1(a). Given that each function takes n cycles, the entire computation would take 4n cycles. It is assumed that the application RDFG indicates each function consumes 1 resource unit, and computation within functions starts once the last output datum of the leading functions becomes available. For t=0..4n-1, several function units would become idle. How could run-time reconfiguration be used to reduce the number of cycles required for this computation?

CityChip

One possibility involves reconfiguration of the idle function units to perform useful work. Let us assume that there is sufficient data independence in each function to enable linear speedup with additional function units: for k function units, the function takes n/k cycles to complete. So for k=1, it takes n cycles to complete the function as described before, and if k=n, it could potentially only take one cycle, although in practice, k is likely to be smaller than n.

With this assumption, Figure 1(c) shows a design which speeds up computing the functions A and B in the second level of the data flow graph in Figure 1(a) by reconfiguring the two idle function units C and D to A and B. This increase in parallelism means that these functions can be completed in n/2 cycles, during t=n..3n/2-1. For the functions in the third level of the data flow graph, B and C are reconfigured as A and D, finishing computation in A and D in n/2 cycles, during t=3n/2..2n-1. Then the same can be done in computing the last function C in the dataflow graph: this time all four function units are configured to compute C so that it can be completed in n/4 cycles, during t=2n..9n/4-1. The total number of cycles is thus 9n/4, reduced from the 4n cycles for the static design in Figure 1(b). The speedup stems from reconfiguring the resource occupied by the idle functions to generate multiple replications of the active functions, leading to increased parallelism.

The EXPRESS Approach

Like every new technology, there is a ‘but’ for runtime reconfiguration. The previous example we have neglect the reconfiguration time, i.e., the time to update the circuits. Large FPGAs can take 1-3 seconds to reconfigure the entire chip. Therefore, runtime reconfiguration would make performance improvement if the reconfiguration time outweighs the reduction in execution time. In order to effectively exploit runtime reconfiguration technologies, the challenges include: * How to identify reconfiguration opportunities, i.e., idle functions. * How to estimate and utilise the benefits gained from reconfiguring idle functions. * How to generate a run-time reconfigurable design that ensures functional correctness, while improving system performance.

In correspondence to these challenges, I developed the EXPRESS approach to automatically generate hardware designs that exploit runtime reconfiguration. As shown in the figure below, the EXPRESS approach starts from an application represented as an data-flow graph. The approach contains three compile-time steps and one run-time step. The compile-time steps generate various reconfigurable designs for the target applications. Each reconfigurable design is associated with a specific run-time reconfiguration strategy. The run-time step evaluates the generated reconfigurable designs, to select the design with maximum throughput.

CityChip

The EXPRESS approach is published in this paper:

@article{EXPRESS,
    author    = {Xinyu Niu and
    Thomas C. P. Chau and
    Qiwei Jin and
    Wayne Luk and
    Qiang Liu and
    Oliver Pell},
    title     = {Automating Elimination of Idle Functions by Runtime Reconfiguration},
    journal   = ,
    volume    = {8},
    number    = {3},
    pages     = {15},
    year      = {2015}
}

Express tool

In collaboration with my colleague Paul Grigoras, we develop the EXPRESS tool that takes C input, and generate optimal runtime reconfigurable designs. Paul developed the front-end to support source-to-source translation, and I developed back-end optimisation passes to automate the EXPRESS approach. The hardware design run on the Maxeler Platforms.

If you are interested, try our tool here!

Recent update

Lately I extend the EXPRESS approach to the cluster-scale application. The basic idea is that in a data centre with various FPGAs, applications start and finish from time to time, leaving FPGAs that are busy when an application starts, but becomes available in the middle of the application execution. I develop the approach to dynamic scale cluster-scale application with runtime reconfiguration: whenever new resources become available, if a runtime evaluator determines the resources can improve application performance, the hardware design will reconfigure to include the new resources. This approach is published in this paper. We are currently integrating the new features into our tool.


comments powered by Disqus