

#### Towards Verified Hardware Compilation

John Wickerson Imperial College London

FMATS Workshop, Microsoft Research Cambridge, 24 Sep 2018

#### Collaborators



#### Nadesh Ramanathan



#### George Constantinides

#### 3

#### Hardware Compilation?

• Also called "high-level synthesis".

- Also called "high-level synthesis".
- Basic idea: translate C (or OpenCL, or ...) to Verilog.

- Also called "high-level synthesis".
- Basic idea: translate C (or OpenCL, or ...) to Verilog.
- Custom hardware can be 10x faster and 10x more powerefficient than running software on a processor.



(1) S.O. Settle, "High-performance Dynamic Programming on FPGAs with OpenCL", in *High Performance Extreme Computing (HPEC)*, 2013.

• Use of hardware compilers has grown ~20x since 2011.<sup>(2)</sup>

(2) S. Raje, "Extending the power of FPGAs to software developers", in *Field-Programmable Logic and Applications (FPL)*, 2015. Keynote.

- Use of hardware compilers has grown ~20x since 2011.<sup>(2)</sup>
- There are ~19x more software engineers than hardware engineers.<sup>(3)</sup>

- (2) S. Raje, "Extending the power of FPGAs to software developers", in *Field-Programmable Logic and Applications (FPL)*, 2015. Keynote.
- (3) United States Bureau of Labor Statistics, "Occupational Outlook Handbook, 2016–17 Edition", 2015.

- Use of hardware compilers has grown ~20x since 2011.<sup>(2)</sup>
- There are ~19x more software engineers than hardware engineers.<sup>(3)</sup>
- A user survey found "Lack of C-to-RTL formal verification" to be the biggest problem with hardware compilation.<sup>(4)</sup>

- (2) S. Raje, "Extending the power of FPGAs to software developers", in *Field-Programmable Logic and Applications (FPL)*, 2015. Keynote.
- (3) United States Bureau of Labor Statistics, "Occupational Outlook Handbook, 2016–17 Edition", 2015.
- (4) Deep Chip, "Survey on HLS verification issues and power reduction", 2014. http://www.deepchip.com/items/0544-03.html

#### Hardware Compilation of Concurrency



• Atomics must appear to execute **instantaneously** to other threads

- Atomics must appear to execute instantaneously to other threads
- Atomics provide a variety of **ordering** guarantees

- Atomics must appear to execute **instantaneously** to other threads
- Atomics provide a variety of **ordering** guarantees

```
x = 1;
atomic_store(&y, 1,
memory_order_release); r = atomic_load(&y,
memory_order_acquire);
if (r==1) { print(x); }
```

- Atomics must appear to execute instantaneously to other threads
- Atomics provide a variety of **ordering** guarantees

```
x = 1;
atomic_store(&y, 1,
memory_order_release); r = atomic_load(&y,
memory_order_acquire);
if (r==1) { print(x); }
```

```
r1 = atomic_load(&x,
    memory_order_relaxed);
r2 = atomic_load(&x,
    memory_order_relaxed);
```

```
atomic_store(&x, 1,
    memory_order_relaxed);
```

• x86 proved tricky to formalise correctly.<sup>(5,6)</sup>

- (5) Sarkar et al., *POPL*, 2009.
- (6) Owens et al., *TPHOLs*, 2009.

- x86 proved tricky to formalise correctly.<sup>(5,6)</sup>
- Bug found in deployed "IBM Power 5" processors.<sup>(7)</sup>

- (5) Sarkar et al., *POPL*, 2009.
- (6) Owens et al., *TPHOLs*, 2009.
- (7) Alglave et al., *CAV*, 2010.

- x86 proved tricky to formalise correctly.<sup>(5,6)</sup>
- Bug found in deployed "IBM Power 5" processors.<sup>(7)</sup>
- C++ specification did not guarantee its own key property.<sup>(8)</sup>

- (5) Sarkar et al., *POPL*, 2009.
- (6) Owens et al., *TPHOLs*, 2009.
- (7) Alglave et al., *CAV*, 2010.
- (8) Batty et al., *POPL*, 2011.

- x86 proved tricky to formalise correctly.<sup>(5,6)</sup>
- Bug found in deployed "IBM Power 5" processors.<sup>(7)</sup>
- C++ specification did not guarantee its own key property.<sup>(8)</sup>
- Behaviour of NVIDIA's graphics processors contradicted their own programming guide.<sup>(9)</sup>
- (5) Sarkar et al., *POPL*, 2009.
- (6) Owens et al., *TPHOLs*, 2009.
- (7) Alglave et al., *CAV*, 2010.
- (8) Batty et al., *POPL*, 2011.
- (9) Alglave et al., *ASPLOS*, 2015.







r = atomic\_load(&y, memory\_order\_acquire);









\*\*not supported\*\*



r = atomic\_load(&y, memory\_order\_acquire);









lock(); r = y;unlock();

. . . . . . . . . .

(intel)

**E** XILINX

**FPGA** 



r = atomic\_load(&y, memory\_order\_acquire);





r = atomic\_load(&y, memory\_order\_acquire);

r = y;



- Atomics must appear to execute instantaneously to other threads
- Atomics provide a variety of **ordering** guarantees

- Atomics must appear to execute **instantaneously** to other threads
- Atomics provide a variety of **ordering** guarantees

- Atomics must appear to execute **instantaneously** to other threads
- Atomics provide a variety of **ordering** guarantees

r1 = atomic\_load(&x, memory\_order\_relaxed); r2 = atomic\_load(&x, memory\_order\_relaxed); atomic\_store(&x, 1, memory\_order\_relaxed);

| <pre>r1 = atomic_load(&amp;x,<br/>memory_order_relaxed);<br/>r2 = atomic_load(&amp;x,<br/>memory_order_relaxed);</pre> | <pre>atomic_store(&amp;x, 1,<br/>memory_order_relaxed);</pre> |
|------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
| r1 = x;<br>r2 = x;                                                                                                     | x = 1;                                                        |

| <pre>r1 = atomic_load(&amp;x,<br/>memory_order_relaxed);<br/>r2 = atomic_load(&amp;x,<br/>memory_order_relaxed);</pre> | <pre>atomic_store(&amp;x, 1,<br/>memory_order_relaxed)</pre> |  |
|------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------|--|
| r1 = x;<br>r2 = x;                                                                                                     | x = 1;                                                       |  |

|         | 1      | 2 | 3      | 4 |
|---------|--------|---|--------|---|
| r1 = x; | load x |   |        |   |
| r2 = x; |        |   | load x |   |

| <pre>r1 = atomic_load(&amp;x,<br/>memory_order_relaxed);<br/>r2 = atomic_load(&amp;x,<br/>memory_order_relaxed);</pre> | <pre>atomic_store(&amp;x, 1,<br/>memory_order_relaxed);</pre> |
|------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
| r1 = x;<br>r2 = x;                                                                                                     | x = 1;                                                        |

|         | 1   | 2   | 3   | 4   |
|---------|-----|-----|-----|-----|
| r1 = x; | loa | d x |     |     |
| r2 = x; |     |     | loa | d x |

|        | 1       |
|--------|---------|
| x = 1; | store x |

| <pre>r1 = atomic_load(&amp;x,<br/>memory_order_relaxed);<br/>r2 = atomic_load(&amp;x,<br/>memory_order_relaxed);</pre> | <pre>atomic_store(&amp;x, 1,<br/>memory_order_relaxed);</pre> |
|------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
| r1 = x;<br>r2 = x;                                                                                                     | x = 1;                                                        |
| 1 2                                                                                                                    | 1                                                             |

store x

x = 1;

|         | 1   | 2   |
|---------|-----|-----|
| r1 = x; | loa | d x |
| r2 = x; | loa | d x |

| <pre>r1 = atomic_load(&amp;x,<br/>memory_order_relaxed);<br/>r2 = atomic_load(&amp;x,<br/>memory_order_relaxed);</pre> | <pre>atomic_store(&amp;x, 1,<br/>memory_order_relaxed);</pre> |
|------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
| r1 = x;<br>r2 = x;                                                                                                     | x = 1;                                                        |

|        | 1       |
|--------|---------|
| x = 1; | store x |

| <pre>r1 = atomic_load(&amp;x,<br/>memory_order_relaxed);<br/>r2 = atomic_load(&amp;x,<br/>memory_order_relaxed);</pre> | <pre>atomic_store(&amp;x, 1,<br/>memory_order_relaxed);</pre> |
|------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
| r1 = x;<br>r2 = x/a;                                                                                                   | x = 1;                                                        |

|        | 1       |
|--------|---------|
| x = 1; | store x |

| <pre>r1 = atomic_load(&amp;x,<br/>memory_order_relaxed);<br/>r2 = atomic_load(&amp;x,<br/>memory_order_relaxed);</pre> | <pre>atomic_store(&amp;x, 1,<br/>memory_order_relaxed);</pre> |
|------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
| r0 = y+y+y+y+y;<br>r1 = x;<br>r2 = x/a;                                                                                | x = 1;                                                        |

|        | 1       |
|--------|---------|
| x = 1; | store x |

|        | <pre>r1 = atomic_load(&amp;x,     memory_order_relaxed); r2 = atomic_load(&amp;x,     memory_order_relaxed);</pre> |            |      |          |        |      |        |    |  | <pre>atomic_store(&amp;x, 1,<br/>memory_order_relaxed);</pre> |      |  |         |
|--------|--------------------------------------------------------------------------------------------------------------------|------------|------|----------|--------|------|--------|----|--|---------------------------------------------------------------|------|--|---------|
|        | r1                                                                                                                 | = 7<br>= 7 | ۲;   | y+y+y+y; |        |      | x = 1; |    |  |                                                               |      |  |         |
|        |                                                                                                                    |            |      |          |        |      |        |    |  | _                                                             |      |  |         |
|        |                                                                                                                    | 1          | 2    | 3        | 4      | 5    | •••    | 36 |  |                                                               |      |  | 1       |
|        |                                                                                                                    | loa        | d y  |          |        |      |        |    |  | x :                                                           | = 1; |  | store x |
|        |                                                                                                                    | load       |      |          |        |      |        |    |  |                                                               |      |  |         |
| r0 =   | └┭┮⊥                                                                                                               |            | loa  | d y      |        |      |        |    |  |                                                               |      |  |         |
|        | y+y+y+<br>y+y+y;                                                                                                   |            |      | load y   |        |      |        |    |  |                                                               |      |  |         |
|        |                                                                                                                    |            |      | loa      | d y    |      |        |    |  |                                                               |      |  |         |
|        |                                                                                                                    |            |      |          | loa    | d y  |        |    |  |                                                               |      |  |         |
| r1 = > | ζ;                                                                                                                 |            |      |          | loa    | .d x |        |    |  |                                                               |      |  |         |
|        | ./                                                                                                                 | loa        | .d x |          |        |      |        |    |  |                                                               |      |  |         |
| r2 = > | :/a;                                                                                                               |            |      |          | divide |      |        |    |  |                                                               |      |  |         |
|        |                                                                                                                    |            |      |          |        |      |        |    |  |                                                               |      |  |         |

• Two atomic accesses to the same location cannot be reordered.

- Two atomic accesses to the same location cannot be reordered.
- An atomic acquire load cannot be reordered with accesses that come later in program order

- Two atomic accesses to the same location cannot be reordered.
- An atomic acquire load cannot be reordered with accesses that come later in program order
- An atomic release store cannot be reordered with accesses that come earlier in program order

- Two atomic accesses to the same location cannot be reordered.
- An atomic acquire load cannot be reordered with accesses that come later in program order
- An atomic release store cannot be reordered with accesses that come earlier in program order
- An atomic SC access cannot be reordered with any other access.

#### Results



(10) Ramanathan et al., "Hardware Synthesis of Weakly Consistent C Concurrency", FPGA, 2017

• Ask Memalloy<sup>(11)</sup> for an execution that is **forbidden** according to the C++ standard but is **allowed** by our scheduling constraints.

• Ask Memalloy<sup>(11)</sup> for an execution that is **forbidden** according to the C++ standard but is **allowed** by our scheduling constraints.



(11) Wickerson et al., "Automatically Comparing Memory Consistency Models", POPL, 2017

• Ask Memalloy<sup>(11)</sup> for an execution that is **forbidden** according to the C++ standard but is **allowed** by our scheduling constraints.



• Memalloy uses the Alloy model checker, which in turn uses a SAT-solving backend.

(11) Wickerson et al., "Automatically Comparing Memory Consistency Models", POPL, 2017

#### Can we do better?



atomic\_store(&x, 1,
 memory\_order\_relaxed);

|         | 1      | 2   |
|---------|--------|-----|
| r1 = x; | load x |     |
| r2 = x; | loa    | d x |

|        | 1       |
|--------|---------|
| x = 1; | store x |



|         | 1      | 2 |
|---------|--------|---|
| r1 = x; | load x |   |
| r2 = x; | load x |   |















## Longer paths too





(12) Ramanathan et al., "Concurrency-Aware Thread Scheduling for High-Level Synthesis", FCCM, 2018



x = 1; atomic\_store(&y, 1, memory\_order\_release); atomic\_store(&y, 1, memory\_order\_release); atomic\_store(&y, 1, memory\_order\_release); } r = atomic\_load(&y, memory\_order\_acquire); if (r==1) { atomic\_store(&z, 1, memory\_order\_release); atomic\_store(&z, 1, memory\_order\_release); } s = atomic\_load(&z, memory\_order\_acquire); if (r==1) { atomic\_store(&z, 1, memory\_order\_release); }

































• Our solution: enumerate only the "primary" paths.



```
x = 1;
atomic store (&y, 1,
  memory order release);
atomic_store(&y, 1,
  memory order release);
```

```
r = atomic_load(&y,
    memory_order_acquire);
r = atomic_load(&y,
    memory_order_acquire);
memory_order_acquire);
s = atomic_load(&z,
    memory_order_acquire);
```

```
if (r==1) {
 atomic store(&z, 1,
  memory_order_release);
 atomic_store(&z, 1,
memory_order_release);
```

```
if (s==1) { print(x); }
```

```
x = 1;
atomic store (&y, 1,
  memory order release);
atomic_store(&y, 1,
  memory order release);
```

```
r = atomic_load(&y,
    memory_order_acquire);
r = atomic_load(&y,
    memory_order_acquire);
memory_order_acquire);
s = atomic_load(&z,
    memory_order_acquire);
```

```
if (r==1) {
 atomic store(&z, 1,
  memory_order_release);
 atomic_store(&z, 1,
    memory_order_release);
```

```
if (s==1) { print(x); }
```

x = 1;

```
atomic_store(&y, 1,
    memory_order_release);
atomic_store(&y, 1,
    memory_order_release);
```

```
r = atomic_load(&y,
    memory_order_acquire);
r = atomic_load(&y,
    memory_order_acquire);
```

```
if (r==1) {
  atomic_store(&z, 1,
    memory_order_release);
  atomic_store(&z, 1,
    memory_order_release);
```

s = atomic\_load(&z, memory\_order\_acquire); s = atomic\_load(&z, memory\_order\_acquire);

```
if (s==1) { print(x); }
```

x = 1;

```
atomic_store(&y, 1,
    memory_order_release);
atomic_store(&y, 1,
    memory_order_release);
```

 $r = atomic_load(&y)$ memory\_order\_acquire);
r = atomic\_load(&y, memory\_order\_acquire); if (r==1) { atomic\_store(&z, / memory\_order\_release); atomic\_store(&z, 1, memory\_order\_release);

s = atomic\_load(&z, memory\_order\_acquire); s = atomic\_load(&z, memory\_order\_acquire);

```
if (s==1) { print(x); }
```

x = 1;

```
atomic store(&y, 1,
 memory_order_release); if (r==1) {
atomic_store(&y, 1,
 memory order release);
```

r = atomic\_load(&y, memory\_order\_acquire); r = atomic\_load(&y, memory\_order\_acquire); memory\_order\_acquire); s = atomic\_load(&z, memory\_order\_acquire);

```
atomic store(&z, 1,
memory_order_release);
atomic_store(&z, 1,
   memory_order_release);
```

memory order acquire); memory\_order\_acquire);

```
if (s==1) { print(x); }
```

x = 1;

```
atomic store(&y, 1,
 memory order release); if (r==1) {
atomic_store(&y, 1,
 memory order release);
```

r = atomic\_load(&y, memory\_order\_acquire); r = atomic\_load(&y, memory\_order\_acquire); memory\_order\_acquire);s = atomic\_load(&z, memory\_order\_acquire);

```
atomic store(&z, 1,
memory_order_release);
atomic_store(&z, 1,
   memory_order_release);
```

memory order acquire); memory\_order\_acquire); if (s==1) { print(x); }



#### Checking correctness

• As before, we use Memalloy to check that our constraints are strong enough to guarantee C++ semantics.











• Heavyweight: a fully verified hardware compiler (e.g. a Verilog backend for CompCert).





- Heavyweight: a fully verified hardware compiler (e.g. a Verilog backend for CompCert).
- Lightweight: automatically generate and verify SystemVerilog assertions, *à la* RTLCheck.<sup>(13)</sup>

(13) Manerkar et al., "RTLCheck: Verifying the Memory Consistency of RTL Designs", MICRO, 2017.

#### Towards Verified Hardware Compilation

John Wickerson Imperial College London

FMATS Workshop, Microsoft Research Cambridge, 24 Sep 2018