<?xml version="1.0"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Sangjin Han</title>
    <link>http://www.eecs.berkeley.edu/~sangjin/</link>
    <atom:link href="http://www.eecs.berkeley.edu/~sangjin/rss.xml" rel="self" type="application/rss+xml" />
    <description>Sangjin Han's Blog</description>
    <language>en-us</language>
    <pubDate>Sat, 28 Nov 2015 18:57:55 +0000</pubDate>
    <lastBuildDate>Sat, 28 Nov 2015 18:57:55 +0000</lastBuildDate>
 
    
    <item>
      <title>EURECA: a Perfect City</title>
      <link>http://www.eecs.berkeley.edu/~sangjin/2015/09/01/eureca.html</link>
      <pubDate>Tue, 01 Sep 2015 00:00:00 +0100</pubDate>
      <author>Sangjin Han</author>
      <guid>http://www.eecs.berkeley.edu/~sangjin/2015/09/01/eureca</guid>
      <description>&lt;h1 id=&quot;eureca-a-perfect-city&quot;&gt;EURECA: a Perfect City&lt;/h1&gt;

&lt;p&gt;Here we introduce a computer architecture that can reconfigure circuits within a nanosecond, and how 
this make the architecture a perfect ‘city’ for applications.&lt;/p&gt;

&lt;h2 id=&quot;why-we-need-quick-runtime-reconfiguration&quot;&gt;Why we need quick runtime reconfiguration?&lt;/h2&gt;

&lt;p&gt;We explained &lt;a href=&quot;http://www.doc.ic.ac.uk/~nx210/2014/07/01/express.html&quot;&gt;how runtime reconfiguration can eliminate idle units&lt;/a&gt; in the last project.
Here we explain the motivation for more fine-grained runtime reconfiguration with an example.&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
&lt;img src=&quot;static/project_images/2015-09-01/citychip.png&quot; alt=&quot;CityChip&quot; style=&quot;width: 100%&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;The modern chips are similar to cities in many ways. The figure above shows the New York city during night, and the layout of a silicon chip. 
If we ignore the fancy lighting, we will find that the building blocks are just like silicon modules in a chip, connected with roads or metal wires (does this mean, in a higher dimension, we are just electrons in a chip? :D). 
The purpose of a chip is computing as fast as possible. 
More specifically, the chip transfers information represented with ‘0’ and ‘1’ bits into data-processing modules.
Modern CPUs are built for general-purpose computing.
All applications can be supported in such chips, with sacrificed computing performance for each task.
Similarly, big cities like London can provide whatever people need, with higher costs (price, transport time, etc.).&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Reconfigurable_computing&quot;&gt;Reconfigurable computing&lt;/a&gt; takes a different way to execute applications.
Supported by &lt;a href=&quot;https://en.wikipedia.org/wiki/Field-programmable_gate_array&quot;&gt;FPGAs&lt;/a&gt;, customised circuits can be implemented for each applications.
The description of customised circuits will be synthesised into configuration bitstream, 
which will be downloaded into FPGAs to define the implemented circuits.
The reconfigurability of FPGAs enables different circuits to be mapped on chip without the time-consuming chip fabrication process.
The chip download process —the process to download the synthesised bitstream, takes a few seconds for large-scale FPGAs.
This configuration time is negligible given sufficient data to process. 
This is like a city constructed for a specific purpose.
For example, if a city is built to minimise the time its citizens spend on transport, the city can be customised 
to support the best transport system in the world, which will provides significant reduction in transport time.
However, in reality, even for a specific application, developing a customised design can be challenging.&lt;/p&gt;

&lt;p&gt;The major drawback for a customised circuit is the fact that this circuit can only be used by a specific operation.
This works well if we know what operations in an application will be executed before we develop the customised design. 
We name such designs “static” designs, since the operations to be executed will not change during runtime. 
There are various design approach to derive the optimal design under this circumstance.
As you can imagine, the real-world applications will not always be this simple.&lt;br /&gt;
There are applications where different operations are executed under different runtime conditions.
For example, people will plan their journey depending on weather, transport system status, distance, etc.,
and thus we cannot customise the transportation system due to lack of information such as weather in the future.
We name such designs “dynamic” designs, since the operations to execute change from to time time, and cannot be determined when 
we develop designs.&lt;/p&gt;

&lt;p&gt;As discussed in &lt;a href=&quot;http://www.doc.ic.ac.uk/~nx210/2014/07/01/express.html&quot;&gt;EXPRESS&lt;/a&gt;, we can use runtime reconfiguration to 
implement efficiency dynamic designs, by reconfiguring the underlying circuit based on runtime conditions.
However, with current reconfigurable chips, the reconfiguration process takes 1 microsecond to 1 second, 
which correspond to hundreds to hundreds of millions cycles, while most applications requires reconfiguration every clock cycle.
This is similar to that only a city needs to be optimised for transportation, but also a new road needs to be built 
the second a citizen needs it.&lt;/p&gt;

&lt;h2 id=&quot;eureca-a-new-approach&quot;&gt;EURECA, a new approach&lt;/h2&gt;

&lt;p&gt;I developed the EURECA technology to resolve this issue. The minimum reconfiguration time is reduced from microseconds to 
a nanosecond. 
The basic idea is demonstrate in the following figure.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;(a) demonstrate a “static” application, where all the operations can be determined before execution.
Developing a customised hardware design is relatively straightforward.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;(b) shows a “dynamic” application supported with static circuits. In order to support the unknown operations, 
this design implement customised operators for all the possible operations, and only enable the operators that are 
required at the current cycle. The hardware efficiency is very low.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;(c) uses conventional runtime reconfiguration techniques to swap different operators in and out. The hardware design 
prepares multiple design configurations before execution, and replace them during runtime.
The major drawback is that only one of the prepared configurations is active at each time. This brings large chip area overhead 
to store the configurations (if these prepared configurations are stored on-chip) and long reconfiguration time (if these prepared configurations are 
stored off-chip).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;(d) shows the EURECA flow, where only one configuration memory is used to store the active configuration. Instead of preparing configuration files 
beforehand, the EURECA architecture implement architecture support such that configurations can be generated and updated on-chip. At each clock cycle, as
new runtime conditions are know, the configurations are directly generated with Configuration Generators (CGs), and applied to the configuration memory 
with on-chip connections.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This brings three major improvements over existing architectures.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;quick reconfiguration. By directly connecting CG outputs to circuit configurations, EURECA circuits can be defined within a nanosecond. 
This means within the same clock cycle, the circuit can be first reconfigured based on current runtime condition, and then execute the reconfigured operations.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;low overhead. As only one reconfiguration memory is required, the overhead to store and transfer multiple stored configurations is eliminated.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;customisable. The CGs are implemented with user logic, and thus can be customised based on application requirements. For “static” applications, 
no CGs wil be implemented.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
&lt;img src=&quot;static/project_images/2015-09-01/flow.png&quot; alt=&quot;flow&quot; style=&quot;width: 100%&quot; /&gt;
&lt;/p&gt;

&lt;h2 id=&quot;what-has-been-done&quot;&gt;what has been done?&lt;/h2&gt;

&lt;p&gt;The initial architecture has been developed and published, resulting in up to 32 times improvements in the following applications.
These results are published in &lt;a href=&quot;&quot;&gt;a recent paper&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Memcached&quot;&gt;Memcached&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Sparse_matrix-vector_multiplication&quot;&gt;Sparse Matrix-Vector Multiplication (SpMV)&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Merge_sort&quot;&gt;large-scale sorting&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We have developed the first EURECA chip in collaboration with &lt;a href=&quot;http://www.fudan.edu.cn/en/&quot;&gt;Fudan University&lt;/a&gt;. 
Besides chip prototype, this design brings a few improvements including customised runtime reconfiguration strategies, 
reduced CG area, and reduced routing overhead.
We will fabricate the prototype chip in early 2016.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;
&lt;img src=&quot;static/project_images/2015-09-01/chip.png&quot; alt=&quot;chip&quot; style=&quot;width: 100%&quot; /&gt;
&lt;/p&gt;

&lt;h2 id=&quot;what-is-going-on-the-following-links-are-under-construction-i-will-release-them-when-the-work-is-ready&quot;&gt;what is going on (the following links are under construction, I will release them when the work is ready)?&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;&quot;&gt;Further EURECA support&lt;/a&gt;: Only a small part of dynamic operations is explored for EURECA now.&lt;br /&gt;
What would be the next improvement, 
large-scale communication, or computation？&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;&quot;&gt;EURECA simulator&lt;/a&gt;: To rapidly develop develop and evaluate new EURECA features, we need  a simulator that we can quickly describe 
the architecture and run applications without going through the chip fabrication process.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;&quot;&gt;EURECA compiler&lt;/a&gt;: To enable people from different fields to use this new architecture, we need a compiler to handle 
most of the low-level optimisation and circuit generation.&lt;/p&gt;

</description>
    </item>
    
    <item>
      <title>EXPRESS: Exploiting Runtime Resources in Reconfigurable Systems</title>
      <link>http://www.eecs.berkeley.edu/~sangjin/2014/07/01/express.html</link>
      <pubDate>Tue, 01 Jul 2014 00:00:00 +0100</pubDate>
      <author>Sangjin Han</author>
      <guid>http://www.eecs.berkeley.edu/~sangjin/2014/07/01/express</guid>
      <description>&lt;h1 id=&quot;express-exploiting-runtime-resources-in-reconfigurable-systems&quot;&gt;EXPRESS: Exploiting Runtime Resources in Reconfigurable Systems&lt;/h1&gt;

&lt;h2 id=&quot;why-do-we-need-runtime-reconfiguration&quot;&gt;Why do we need runtime reconfiguration?&lt;/h2&gt;

&lt;p&gt;Runtime reconfiguration is a technique exclusive to reconfigurable devices (such as FPGAs), 
to change circuits on the fly.
Theoretically, this gives reconfigurable devices unlimited potential. 
I always like to use the example of the evolution from &lt;a href=&quot;https://en.wikipedia.org/wiki/Woodblock_printing&quot;&gt;woodblock printing&lt;/a&gt;
to &lt;a href=&quot;https://en.wikipedia.org/wiki/Movable_type&quot;&gt;movable printing&lt;/a&gt; to express the difference between customised circuits with and without reconfigurability.&lt;/p&gt;

&lt;p&gt;Customised circuits are like the woodblock printing, where a wood block is prepared for a specific content. 
The customised circuits, whether implemented with &lt;a href=&quot;https://en.wikipedia.org/wiki/Application-specific_integrated_circuit&quot;&gt;ASIC&lt;/a&gt; chip or FPGAs, 
shows the maximum efficiency when the same circuit is used to process a large volume of data (similar to printing a significant amount of papers with the same content).
Every time the application operations (printing content) change, a new customised circuit (woodblock) needs to be prepared.&lt;/p&gt;

&lt;p&gt;With reconfigurability, customised circuits can reconfigure redundant circuits into active circuits, 
to support applications with varying operations to execute.
Similar to the movable printing, small modules are used to construct the customised circuits, 
and the modules no longer needed are swapped with circuits for new operations to support.
In other words, with reconfigurable devices, we can construct swap in and out new circuit modules 
to support various application operations, without changing the circuits from scratch.&lt;/p&gt;

&lt;p&gt;The benefits for runtime reconfiguration can be demonstrated as an example below.
In a static design (i.e. a design without runtime reconfiguration), all functions are mapped into reconfigurable
fabrics and replicated as much as possible to optimise concurrency.
However, limited by data dependency and mapping strategies, some
computational resources can be left idle from time to time. This
situation is shown in Figure~\ref{fig:mot}(b): there are four function units,
each implementing respectively the function A, B, C and D in the dataflow
graph in Figure 1(a). Given that each function takes n cycles, the
entire computation would take 4n cycles.
It is assumed that the application RDFG indicates each function consumes 1 resource unit,
and computation within functions starts once the last output datum of the leading functions becomes available.
For t=0..4n-1, several function units would become idle. How could
run-time reconfiguration be used to reduce the number of cycles
required for this computation?&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
&lt;img src=&quot;static/project_images/2014-07-01/motivation9.png&quot; alt=&quot;CityChip&quot; style=&quot;width: 100%&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;One possibility involves reconfiguration of the idle function units to
perform useful work. Let us assume that there is sufficient data
independence in each function to enable linear speedup with additional
function units: for k function units, the function takes n/k cycles to
complete. So for k=1, it takes n cycles to complete the function as
described before, and if k=n, it could potentially only take one cycle,
although in practice, k is likely to be smaller than n.&lt;/p&gt;

&lt;p&gt;With this assumption, Figure 1(c) shows a design which speeds up
computing the functions A and B in the second level of the data flow graph in Figure 1(a)
by reconfiguring the two idle function units C and D to A and B. This
increase in parallelism means that these functions can be completed in
n/2 cycles, during t=n..3n/2-1.
For the functions in the third level of the data flow graph, B and C are reconfigured as A and D,
finishing computation in A and D in n/2 cycles, during t=3n/2..2n-1.
Then the same can be done in computing the last function C in the dataflow graph: this time all
four function units are configured to compute C so that it can be
completed in n/4 cycles, during  t=2n..9n/4-1. The total number of
cycles is thus 9n/4, reduced from the 4n cycles for the static design in Figure 1(b).
The speedup stems from reconfiguring the resource occupied by the idle functions
to generate multiple replications of the active functions, leading to increased parallelism.&lt;/p&gt;

&lt;h2 id=&quot;the-express-approach&quot;&gt;The EXPRESS Approach&lt;/h2&gt;

&lt;p&gt;Like every new technology, there is a ‘but’ for runtime reconfiguration.
The previous example we have neglect the reconfiguration time, i.e., the time to update the circuits.
Large FPGAs can take 1-3 seconds to reconfigure the entire chip. 
Therefore, runtime reconfiguration would make performance improvement if the reconfiguration time outweighs 
the reduction in execution time.
In order to effectively exploit runtime reconfiguration technologies, the challenges include:
* How to identify reconfiguration opportunities, i.e., idle functions.
* How to estimate and utilise the benefits gained from reconfiguring idle functions.
* How to generate a run-time reconfigurable design that ensures functional correctness,
while improving system performance.&lt;/p&gt;

&lt;p&gt;In correspondence to these challenges, I developed the EXPRESS approach to automatically generate hardware designs 
that exploit runtime reconfiguration.
As shown in the figure below, the EXPRESS approach starts from an application represented as an data-flow graph.
The approach contains three compile-time steps and one run-time step.
The compile-time steps generate various reconfigurable designs for the
target applications.
Each reconfigurable design is associated with a specific run-time reconfiguration strategy.
The run-time step evaluates the generated reconfigurable designs, to select the design with
maximum throughput.&lt;/p&gt;

&lt;p style=&quot;text-align: center&quot;&gt;
&lt;img src=&quot;static/project_images/2014-07-01/flow2.png&quot; alt=&quot;CityChip&quot; style=&quot;width: 100%&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;The EXPRESS approach is published in &lt;a href=&quot;http://www.doc.ic.ac.uk/~nx210/static/pub/J4.pdf&quot;&gt;this paper&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;@article{EXPRESS,
    author    = {Xinyu Niu and
    Thomas C. P. Chau and
    Qiwei Jin and
    Wayne Luk and
    Qiang Liu and
    Oliver Pell},
    title     = {Automating Elimination of Idle Functions by Runtime Reconfiguration},
    journal   = ,
    volume    = {8},
    number    = {3},
    pages     = {15},
    year      = {2015}
}
&lt;/code&gt;&lt;/pre&gt;

&lt;h1 id=&quot;express-tool&quot;&gt;Express tool&lt;/h1&gt;

&lt;p&gt;In collaboration with my colleague &lt;a href=&quot;http://www.doc.ic.ac.uk/~pg1709/&quot;&gt;Paul Grigoras&lt;/a&gt;, we 
develop the EXPRESS tool that takes C input, and generate optimal runtime reconfigurable designs. 
Paul developed the &lt;a href=&quot;http://www.doc.ic.ac.uk/~pg1709/pgasap2013.pdf&quot;&gt;front-end&lt;/a&gt; to support source-to-source translation, 
and I developed back-end optimisation passes to automate the EXPRESS approach.
The hardware design run on the &lt;a href=&quot;https://www.maxeler.com/products/mpc-cseries/&quot;&gt;Maxeler Platforms&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you are interested, try our tool &lt;a href=&quot;https://github.com/custom-computing-ic/fastc&quot;&gt;here&lt;/a&gt;!&lt;/p&gt;

&lt;h1 id=&quot;recent-update&quot;&gt;Recent update&lt;/h1&gt;

&lt;p&gt;Lately I extend the EXPRESS approach to the cluster-scale application. 
The basic idea is that in a data centre with various FPGAs, applications start and 
finish from time to time, leaving FPGAs that are busy when an application starts, 
but becomes available in the middle of the application execution. 
I develop the approach to dynamic scale cluster-scale application with runtime reconfiguration: 
whenever new resources become available, if a runtime evaluator determines the resources 
can improve application performance, the hardware design will reconfigure to include the new resources.
This approach is published in &lt;a href=&quot;http://www.doc.ic.ac.uk/~nx210/static/pub/C13.pdf&quot;&gt;this paper&lt;/a&gt;.
We are currently integrating the new features into our tool.&lt;/p&gt;
</description>
    </item>
    
 
  </channel>
</rss>
