Here we introduce a computer architecture that can reconfigure circuits within a nanosecond, and how this make the architecture a perfect ‘city’ for applications.
We explained how runtime reconfiguration can eliminate idle units in the last project. Here we explain the motivation for more fine-grained runtime reconfiguration with an example.
The modern chips are similar to cities in many ways. The figure above shows the New York city during night, and the layout of a silicon chip. If we ignore the fancy lighting, we will find that the building blocks are just like silicon modules in a chip, connected with roads or metal wires (does this mean, in a higher dimension, we are just electrons in a chip? :D). The purpose of a chip is computing as fast as possible. More specifically, the chip transfers information represented with ‘0’ and ‘1’ bits into data-processing modules. Modern CPUs are built for general-purpose computing. All applications can be supported in such chips, with sacrificed computing performance for each task. Similarly, big cities like London can provide whatever people need, with higher costs (price, transport time, etc.).
Reconfigurable computing takes a different way to execute applications. Supported by FPGAs, customised circuits can be implemented for each applications. The description of customised circuits will be synthesised into configuration bitstream, which will be downloaded into FPGAs to define the implemented circuits. The reconfigurability of FPGAs enables different circuits to be mapped on chip without the time-consuming chip fabrication process. The chip download process —the process to download the synthesised bitstream, takes a few seconds for large-scale FPGAs. This configuration time is negligible given sufficient data to process. This is like a city constructed for a specific purpose. For example, if a city is built to minimise the time its citizens spend on transport, the city can be customised to support the best transport system in the world, which will provides significant reduction in transport time. However, in reality, even for a specific application, developing a customised design can be challenging.
The major drawback for a customised circuit is the fact that this circuit can only be used by a specific operation.
This works well if we know what operations in an application will be executed before we develop the customised design.
We name such designs “static” designs, since the operations to be executed will not change during runtime.
There are various design approach to derive the optimal design under this circumstance.
As you can imagine, the real-world applications will not always be this simple.
There are applications where different operations are executed under different runtime conditions. For example, people will plan their journey depending on weather, transport system status, distance, etc., and thus we cannot customise the transportation system due to lack of information such as weather in the future. We name such designs “dynamic” designs, since the operations to execute change from to time time, and cannot be determined when we develop designs.
As discussed in EXPRESS, we can use runtime reconfiguration to implement efficiency dynamic designs, by reconfiguring the underlying circuit based on runtime conditions. However, with current reconfigurable chips, the reconfiguration process takes 1 microsecond to 1 second, which correspond to hundreds to hundreds of millions cycles, while most applications requires reconfiguration every clock cycle. This is similar to that only a city needs to be optimised for transportation, but also a new road needs to be built the second a citizen needs it.
I developed the EURECA technology to resolve this issue. The minimum reconfiguration time is reduced from microseconds to a nanosecond. The basic idea is demonstrate in the following figure.
(a) demonstrate a “static” application, where all the operations can be determined before execution. Developing a customised hardware design is relatively straightforward.
(b) shows a “dynamic” application supported with static circuits. In order to support the unknown operations, this design implement customised operators for all the possible operations, and only enable the operators that are required at the current cycle. The hardware efficiency is very low.
(c) uses conventional runtime reconfiguration techniques to swap different operators in and out. The hardware design prepares multiple design configurations before execution, and replace them during runtime. The major drawback is that only one of the prepared configurations is active at each time. This brings large chip area overhead to store the configurations (if these prepared configurations are stored on-chip) and long reconfiguration time (if these prepared configurations are stored off-chip).
(d) shows the EURECA flow, where only one configuration memory is used to store the active configuration. Instead of preparing configuration files beforehand, the EURECA architecture implement architecture support such that configurations can be generated and updated on-chip. At each clock cycle, as new runtime conditions are know, the configurations are directly generated with Configuration Generators (CGs), and applied to the configuration memory with on-chip connections.
This brings three major improvements over existing architectures.
quick reconfiguration. By directly connecting CG outputs to circuit configurations, EURECA circuits can be defined within a nanosecond. This means within the same clock cycle, the circuit can be first reconfigured based on current runtime condition, and then execute the reconfigured operations.
low overhead. As only one reconfiguration memory is required, the overhead to store and transfer multiple stored configurations is eliminated.
customisable. The CGs are implemented with user logic, and thus can be customised based on application requirements. For “static” applications, no CGs wil be implemented.
The initial architecture has been developed and published, resulting in up to 32 times improvements in the following applications. These results are published in a recent paper.
We have developed the first EURECA chip in collaboration with Fudan University. Besides chip prototype, this design brings a few improvements including customised runtime reconfiguration strategies, reduced CG area, and reduced routing overhead. We will fabricate the prototype chip in early 2016.
Further EURECA support: Only a small part of dynamic operations is explored for EURECA now.
What would be the next improvement, large-scale communication, or computation？
EURECA simulator: To rapidly develop develop and evaluate new EURECA features, we need a simulator that we can quickly describe the architecture and run applications without going through the chip fabrication process.
EURECA compiler: To enable people from different fields to use this new architecture, we need a compiler to handle most of the low-level optimisation and circuit generation.