

## Introduction to Custom Computing

Oskar Mencer

oskar@doc.ic.ac.uk



- What is Custom Computing, and Why
- Compilation issues
- Run-time issues
- Systems: putting it all together

## Overview of Custom Computing

| <b>1. FPGAs: what and why</b>   | 2. Compiling to FPGAs              |
|---------------------------------|------------------------------------|
| FPGA, the big picture           | architecture generation            |
| FPGA devices                    | compiler passes                    |
| <b>4. Putting it altogether</b> | <b>3. Run-time reconfiguration</b> |
| system view                     | generating configurations          |
| PAM: a PC-size system           | - at compile time, at run time     |
| SONIC: a PCI card               | scheduling reconfigurations        |

- control driven, data driven

## some History



#### **FPGA** versus Microprocessor



## ... and the price per gate goes down ...



## FPGA versus ASIC Accelerator Card



## The Speedup Argument



DIMACS Benchmarks, Xilinx FPGAs vs. GRASP Software

## Speedup: FPGA vs MIPS Processor



Source: J. Babb PhD thesis, MIT, 2001

# Energy Reduction: FPGA vs MIPS Core



Source: J. Babb PhD thesis, MIT, 2001

## Overview of Custom Computing

| <ul> <li><b>1. FPGAs: what and why</b><br/>FPGA, the big picture</li> <li>→ FPGA devices</li> </ul> | 2. Compiling to FPGAs<br>architecture generation<br>compiler passes                                                                                                                                              |
|-----------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>4. Putting it altogether</b><br>system view<br>PAM: a PC-size system<br>SONIC: a PCI card        | <ul> <li><b>3. Run-time reconfiguration</b><br/>generating configurations <ul> <li>at compile time, at run time</li> <li>scheduling reconfigurations</li> <li>control driven, data driven</li> </ul> </li> </ul> |

# Look-Up Table (LUT): A Universal Gate

- combinatorial logic: stored in Look-Up Tables
- capacity limited by number of inputs not complexity

В

0

0

• LUT delay: independent of logic!



## Xilinx FPGAs: XC4000, Virtex, Virtex II



## Simplified CLB Structure of Xilinx FPGAs



## The VLSI CAD Productivity Gap



Source: SEMATECH

## Overview of Custom Computing

| <b>1. FPGAs: what and why</b><br>FPGA, the big picture<br>FPGA devices                       | <ul> <li>2. Compiling to FPGAs</li> <li>architecture generation compiler passes</li> </ul>                                                                                                                       |
|----------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>4. Putting it altogether</b><br>system view<br>PAM: a PC-size system<br>SONIC: a PCI card | <ul> <li><b>3. Run-time reconfiguration</b><br/>generating configurations <ul> <li>at compile time, at run time</li> <li>scheduling reconfigurations</li> <li>control driven, data driven</li> </ul> </li> </ul> |

# Programming FPGAs Toolflow



## The "Compiler" for FPGAs



## Languages for Architecture Generation

- generate particular instances of a generic architecture, such as a signal processor, a stream architecture, a neural network, a cellular automaton, etc...
- HW Description Languages: VHDL, Verilog
- SW Languages used for Custom Computing
  C/C++, Java, Ruby, etc.
- C++ example: ASC A Stream Compiler

## ASC - A Stream Compiler for FPGAs

#### **Bell Labs**





#### <u>features</u>

distributed registers
 distributed delay FIFOs
 local memory blocks
 external Memory
 PAM-Blox II modules
 (pre-pipelined)

ASC generates Stream Architectures based on C++ input.

# ASC Example: IDEA Encryption



## Overview of Custom Computing

| <b>1. FPGAs: what and why</b><br>FPGA, the big picture<br>FPGA devices<br>programming FPGAs<br>- module generation | <ul> <li>2. Compiling to FPGAs<br/>architecture generation</li> <li>compiler passes</li> </ul>                                                                                                                   |
|--------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>4. Putting it altogether</b><br>system view<br>PAM: a PC-size system<br>SONIC: a PCI card                       | <ul> <li><b>3. Run-time reconfiguration</b><br/>generating configurations <ul> <li>at compile time, at run time</li> <li>scheduling reconfigurations</li> <li>control driven, data driven</li> </ul> </li> </ul> |

## The "Compiler" for FPGAs



# Bitwidth Analysis (Type Size Inference)

- for programs written in a high-level language
- minimum bitwidth required for
  - each variable at every static location of the program
  - each static operation of the program
- reduces memory bandwidth dependency, MBD (slide 15, above)



### HAGAR: Hardware Graph Accelerators

Memory Latency Dependent, e.g. data structures, pointers → implement data-structure + algorithm in hardware



<u>operations:</u> insert, delete, reachability, ... <u>reachability speedup over PC:</u> 10× to 1000× for 1K nodes

## Speedup of Reachability



## Overview of Custom Computing

| <b>1. FPGAs: what and why</b><br>FPGA, the big picture<br>FPGA devices<br>programming FPGAs<br>- module generation | 2. Compiling to FPGAs<br>architecture generation<br>compiler passes                                                                                                                                              |  |
|--------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| <b>4. Putting it altogether</b><br>system view<br>PAM: a PC-size system<br>SONIC: a PCI card                       | <ul> <li><b>3. Run-time reconfiguration</b><br/>generating configurations <ul> <li>at compile time, at run time</li> <li>scheduling reconfigurations</li> <li>control driven, data driven</li> </ul> </li> </ul> |  |

## **Run-time Reconfiguration**

FPGAs are reconfigurable within milliseconds!

#### 1. generating configurations

- at compile time
- at run time

#### 2. scheduling reconfigurations

- at compile time control driven
- at run time data driven

## Generating & Scheduling Reconfigurations

Application source or executable

**FPGA** Configurations



=> insert reconfiguration strategy (algorithm) into application

## Generating & Scheduling Reconfigurations

| scheduling | sontrol<br>control<br>driven     | generating of<br>@compile time<br>network intrusion<br>detection<br>image processing | Configurations<br>@run time<br>Rijndael AES<br>encryption |
|------------|----------------------------------|--------------------------------------------------------------------------------------|-----------------------------------------------------------|
| sche       | ojj<br>odata<br>Odata<br>Odriven | Viterbi decoder:<br>adaptive error<br>correction                                     | boolean satisfiability<br>media processing                |

## Media Processing Example

shape-adaptive template matching (SATM)
MPEG4, MPEG7: find arbitrarily shaped object in video



- template contains object of interest
- Sum of Absolute Distances of template and video
- adapt design to template and search frame

- HDTV frames: 1920 by 1080 pixels
- for 300 HDTV frames, use PC time as reference
  - $T_{PC}$ : execution time of PC with 1.4GHz Pentium 4
- $S_D$  : speedup of dynamic design
- $S_{PD}$  : speedup of partially dynamic design
- $S_S$  : speedup of static design

| w×h     | $T_{PC}$   | $S_D$ | $S_{PD}$ | $S_S$ |
|---------|------------|-------|----------|-------|
| 10×10   | 1,015 sec  | 108   | 57       | 36    |
| 100×100 | 88,609 sec | 6,970 | 5,001    | 3,180 |

## SATM: Performance Results

dynamic design superior when:

- a large number of consecutive frames
- templates do not change often
- small templates

T\_D: time for dynamic design
T\_S: time for static design
T\_PD: time for partially dynamic design



## Overview of Custom Computing

| <b>1. FPGAs: what and why</b><br>FPGA, the big picture<br>FPGA devices<br>programming FPGAs<br>- module generation | <ul> <li>2. Compiling to FPGAs<br/>architecture generation<br/>compiler passes</li> <li>- loop transformations</li> <li>- bitwidth analysis</li> </ul>                                                      |  |
|--------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| <b>4. Putting it altogether</b><br>system view<br>PAM: a PC-size system<br>SONIC: a PCI card                       | <ul> <li><b>3. Run-time reconfiguration</b><br/>generating configurations <ul> <li>at compile time, at run time<br/>scheduling reconfigurations</li> <li>control driven, data driven</li> </ul> </li> </ul> |  |

## System View of Custom Accelerators



## Xilinx Virtex-II Pro



#### **Virtex-II Base Architecture**

#### **IBM PowerPC CPU**

- RocketIO™ MGT
  - Multi-Rate : 3.125, 2.5,
     2.0, 1.2, 1.0 Gb/sec
  - Multi-Protocol
    - Gb Ethernet
    - 10Gb Ethernet (XAUI)
    - Intel 3GIO
    - Serial ATA
    - Infiniband
    - FibreChannel
    - RapidiO
    - Serial Backplanes



# 1 PAM project at DIGITAL PRL (Compaq) • various applications on multi-FPGA platform

# 2 SONIC project at Imperial College and Sony broadcast-quality video processing PCI card

# Case Study 1: The PAM Project

#### Programmable Active Memories (source: Mark Shand)

| People:Jean VuilleminPhilippe BoucardPatrice BertinHarry PrintzDidier RoncinMark ShandHerve TouatiImage: State | Goals:<br>Maximum performance<br>Rapid turnaround<br>Exploring how to build and program PAM<br>Exploring the application space<br>(Non-goals: high-level synthesis,<br>platform independence, improving FPGAs)             |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>4 Platforms:</li> <li>Perle-0 (1989)<br/>50k gate reconfigurable VME board</li> <li>DECPeRLe-1 (1992)<br/>200k gates, 20ms turnaround</li> <li>TURBOchannel Pamette (1994)</li> <li>PCI Pamette (1996)<br/>PCI card with 40K - 112K gates</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | Applications:Stereo VisionLong Int. MultiplicationStereo VisionRSA CryptographyHough TransformDynamic ProgrammingHigh Energy PhysicsLaplace Heat EquationImage AquisitionViterbi DecoderWireless LANSound Synthesistestbed |

All applications are implemented by hand at the gate level using PamDC.

## DECPeRLe-1



#### Case Study 2: SONIC

professional video processing at SONY and Imperial College



## SONIC architecture





## SONIC Processing an Image



## SONIC with Multiple Plug-ins



Host BUS

#### SONIC versus Pentium



## Current Generation: UltraSONIC

- PE, PR on one FPGA
- 8MB SRAM on PIPE
- up to 16 PIPEs
- 64-bit, 66MHz PCI
- real-time image registration





## Pointers to Conferences and Journals

#### **FPGA Conferences**

**FPGA** Conference FCCM FPL FPT ERSA Hardware Design ICCAD DAC **Computer Architecture** ISCA, ISC, HPCA, ASPLOS, **MICRO** 

Journals IEEE Transactions on Computers

**IEEE Transactions on VLSI** 

**IEEE Transactions on CAD** 

Kluwer Journal on VLSI and Signal Processing

ACM Transactions on Architecture and Code Optimization