

# COLLABORATORS



Mark Batty



George Constantinides



Nadesh Ramanathan



Alastair Donaldson



Tyler Sorensen



Brad Beckmann

#### THIS TALK



• Weak memory



- Weak memory
- Formalising the OpenCL memory model



- Weak memory
- Formalising the OpenCL memory model
- Implementing the OpenCL memory model on GPUs



- Weak memory
- Formalising the OpenCL memory model
- Implementing the OpenCL memory model on GPUs
- Implementing the OpenCL memory model on FPGAs

r0=1 r1=1

r0=1 r0=0 r1=1 r1=1

| r0=1 | r0=0 | r0=1 |
|------|------|------|
| r1=1 | r1=1 | r1=0 |





$$MOV [x] 1 MOV [y] 1$$

$$MOV r0 [y] MOV r1 [x]$$

| r0=1 | r0=0 | r0=1 | r0=0 |
|------|------|------|------|
| r1=1 | r1=1 | r1=0 | r1=0 |















#### WEAK MEMORY IS HARD!

- x86 proved tricky to formalise correctly [Sarkar et al., POPL'09; Owens et al., TPHOLs'09]
- Bug found in deployed "Power 5" processors [Alglave et al., CAV'10]
- C++ specification did not guarantee its own key property [Batty et al., POPL'11]
- Routine compiler optimisations are invalid under Java and C++ memory models [Sevcik, PLDI'11; Vafeiadis et al. POPL'15]
- Behaviour of NVIDIA graphics processors contradicted
   NVIDIA's programming guide [Alglave et al., ASPLOS'15]





- Formalising the OpenCL memory model
- Implementing the OpenCL memory model on GPUs
- Implementing the OpenCL memory model on FPGAs













For an atomic operation **B** that reads the value of an atomic object **M**, if there is a *memory\_order\_seq\_cst* fence **X** sequenced-before **B**, then **B** observes either the last *memory\_order\_seq\_cst* modification of **M** preceding **X** in the total order **S** or a later modification of **M** in its modification order. [OpenCL 2.0 standard, 2015]





For an atomic operation **B** that reads the value of an atomic object **M**, if there is a *memory\_order\_seq\_cst* fence **X** sequenced-before **B**, then **B** observes either the last *memory\_order\_seq\_cst* modification of **M** preceding **X** in the total order **S** or a later modification of **M** in its modification order. [OpenCL 2.0 standard, 2015]

#### "OpenCL"

withoutsc

let mo = co & ((!nonatomicloc)^2) let sb = po let rb = (rf^-1; mo) \ id

(\* Access modes \*) let mo\_acq = memory\_order\_acquire let mo rel = memory\_order\_release let mo\_acq\_rel = memory\_order\_acq\_rel let mo\_rlx = memory\_order\_relaxed let mo\_sc = memory\_order\_sea\_cst

(\* Scope annotations \*) let s\_wi = memory\_scope\_work\_item let s wg = memory\_scope\_work\_group let s\_dev = memory\_scope\_device let s\_all = memory\_scope\_all\_svm\_devices

(\* Synchronisation \*) 

let acq = (mo\_acq | mo\_acq\_rel | mo\_sc) & (R | F) let rel = (mo\_rel | mo\_acq\_rel | mo\_sc) & (W | F)

```
(* Fences sequenced before or after *)
let Fsb = [F]; sb
let sbF = sb; [F]
```

(\* Release sequence \*) let rs' = wi | (unv; [R & W])

(\* Inclusive scopes, less conservative

let incl1 = ([s\_wg];wg) | ([s\_dev];dev)

| ([s\_all];unv)

(\* Release-acquire synchronisation \*)

[[r & rel]; Fsb?; [W \ s\_wi]; rs?; [r]; rf;

[R \ s\_wi]; sbF?; [acq & r]) & incl & -wi

(unused) version \*)

let ra sw(r) =

let incl' = incl1 & incl1^-1

(\* Inclusive scopes \*)

let incl = wg & s\_wg^2 | dev & s\_dev^2 | s\_all^2

let rs = mo & rs' & ~((mo & ~rs') ; mo)

let coh(hb) = (rf^-1)?; mo; rf?; hb irreflexive coh(ghb) as O-CohG irreflexive coh(lhb) as 0-CohL

(\* Consistency of reads \*)

(\* Visible side effects \*)

that is visible. \*)

(\* Consistency of RMWs \*)

irreflexive rf; (ghb | lhb) as 0-Rf

let vis(hb) = (W \* R) & hb & loc &

happened. \*)

(\* Coherence \*) [\*\*\*\*\*\*\*\*\*

(\* Global and local happens-before \*) let ghb = (((G^2) & (sb | (I \* |I))) | gsw)+ let lhb = (((L^2) & (sb | (I \* !I))) | lsw)+ show ghb show lhb irreflexive ghb as 0-HbG irreflexive lhb as 0-HbL

(\* A load can only read from a store that already

-((hb & loc); [W]; hb)

(\* A non-atomic load can only read from a store

empty (rf;[G & nonatomicloc])\vis(ghb) as O-NaRfG empty (rf;[L & nonatomicloc])\vis(lhb) as O-NaRfL

irreflexive rf | (mo;mo;rf^-1) | (mo;rf) as 0-Rmw

(\* Happens-before \*) 

(\* Barrier synchronisation \*)

(\* Global and local synchronises-with \*) let gsw = ra\_sw(G) | bar\_sw(G) | (scf & ra\_sw(L)) let lsw = ra\_sw(L) | bar\_sw(L) | (scf & ra\_sw(G))

(\* Allowed to synchronise on the other region \*) let scf = mo\_sc^2 | (G & L & F)^2

let bar sw(r) = (entry fence \* exit fence) & same\_B & ~wi & wg & r^2

F at fe la pr

of

#### "OpenCL"

withoutsc

let mo = co & ((!nonatomicloc)^2) let sb = po let rb = (rf^-1; mo) \ id

(\* Access modes \*) let mo\_acq = memory\_order\_acquire let mo\_rel = memory\_order\_release let mo\_acq\_rel = memory\_order\_acq\_rel let mo\_rlx = memory\_order\_relaxed let mo\_sc = memory\_order\_seq\_cst

(\* Scope annotations \*) let s\_wi = memory\_scope\_work\_item let s wg = memory\_scope\_work\_group let s\_dev = memory\_scope\_device let s\_all = memory\_scope\_all\_svm\_devices

(\* Synchronisation \*) 

let acq = (mo\_acq | mo\_acq\_rel | mo\_sc) & (R | F) let rel = (mo\_rel | mo\_acq\_rel | mo\_sc) & (W | F)

```
(* Fences sequenced before or after *)
let Fsb = [F]; sb
let sbF = sb; [F]
```

(\* Release sequence \*) let rs' = wi | (unv; [R & W]) let rs = mo & rs' & ~{(mo & ~rs') ; mo)

(\* Inclusive scopes \*) let incl = wg & s\_wg^2 | dev & s\_dev^2 | s\_all^2

(\* Inclusive scopes, less conservative (unused) version \*) let incl1 = ([s\_wg];wg) | ([s\_dev];dev)

| ([s\_all];unv) let incl' = incll & incll^-1

(\* Release-acquire synchronisation \*) let ra\_sw(r) = ([r & rel]; Fsb?; [W \ s\_wi]; rs?; [r]; rf;

```
-((hb & loc); [W]; hb)
(* A non-atomic load can only read from a store
  that is visible. *)
empty (rf;[G & nonatomicloc])\vis(ghb) as O-NaRfG
empty (rf;[L & nonatomicloc])\vis(lhb) as O-NaRfL
```

(\* A load can only read from a store that already

(\* Consistency of RMWs \*) mourney rf^\_1) | (mo:rf) as 0-Rmw

Key publication: ACM POPL 2016 (with Batty & Donaldson)

(\* Barrier synchronisation \*) let bar sw(r) = (entry fence \* exit fence) & same\_B & ~wi & wg & r^2

let gsw = ra\_sw(G) | bar\_sw(G) | (scf & ra\_sw(L)) let lsw = ra\_sw(L) | bar\_sw(L) | (scf & ra\_sw(G))

(\* Global and local synchronises-with \*)

let scf = mo\_sc^2 | (G & L & F)^2

(\* Global and local happens-before \*)

let coh(hb) = (rf^-1)?; mo; rf?; hb

irreflexive rf; (ghb | lhb) as 0-Rf

let vis(hb) = (W \* R) & hb & loc &

irreflexive coh(ghb) as O-CohG

irreflexive coh(lhb) as 0-CohL

(\* Consistency of reads \*)

(\* Visible side effects \*)

happened. \*)

let ghb = (((G^2) & (sb | (I \* |I))) | gsw)+

let lhb = (((L^2) & (sb | (I \* !I))) | lsw)+

(\* Happens-before \*)

irreflexive ghb as 0-HbG irreflexive lhb as 0-HbL

show ghb

show lhb

(\* Coherence \*)

[\*\*\*\*\*\*\*\*\*





For an atomic operation **B** that reads the value of an atomic object **M**, if there is a *memory\_order\_seq\_cst* fence **X** sequenced-before **B**, then **B** observes either the last *memory\_order\_seq\_cst* modification of **M** preceding **X** in the total order **S** or a later modification of **M** in its modification order. [OpenCL 2.0 standard, 2015]





Formalising the OpenCL memory model

- Implementing the OpenCL memory model on GPUs
- Implementing the OpenCL memory model on FPGAs












|                | na or WG              | DV (not remote)                                                | DV (remote)                                                                                                         |
|----------------|-----------------------|----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| r=load(x)      | LD r x                | INV <sub>L1</sub> WG<br>LD r x                                 | $FLU_{L1} DV$ $INV_{L1} WG$ $LK x$ $LD r x$                                                                         |
| store(x,r)     | ST r x                | FLU <sub>L1</sub> WG<br>ST r x                                 | $ \left. \begin{array}{c} FLU_{L1} & WG \\ ST & r & x \\ INV_{L1} & DV \end{array} \right\} LK & x \\ \end{array} $ |
| r=fetch_inc(x) | INC <sub>L1</sub> r x | $FLU_{L1}$ WG<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x | FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x<br>INV <sub>L1</sub> DV                       |



|                | na or WG              | DV (not remote)                                                | DV (remote)                                                                                                         |
|----------------|-----------------------|----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| r=load(x)      | LD r x                | INV <sub>L1</sub> WG<br>LD r x                                 | $FLU_{L1} DV$ $INV_{L1} WG$ $LK x$ $LD r x$                                                                         |
| store(x,r)     | ST r x                | FLU <sub>L1</sub> WG<br>ST r x                                 | $ \left. \begin{array}{c} FLU_{L1} & WG \\ ST & r & x \\ INV_{L1} & DV \end{array} \right\} LK & x \\ \end{array} $ |
| r=fetch_inc(x) | INC <sub>L1</sub> r x | $FLU_{L1}$ WG<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x | FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x<br>INV <sub>L1</sub> DV                       |

|                             | na or WG                | DV (not remote)                                                | DV (remote)                                                                                                         |
|-----------------------------|-------------------------|----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| r=load(x)<br><b>message</b> | LD r x<br>passing error | INV <sub>L1</sub> WG<br>LD r x                                 | $FLU_{L1} DV$ $INV_{L1} WG$ $LK x$ $LD r x$                                                                         |
| store(x,r)                  | ST r x                  | FLU <sub>L1</sub> WG<br>ST r x                                 | $ \left. \begin{array}{c} FLU_{L1} & WG \\ ST & r & x \\ INV_{L1} & DV \end{array} \right\} LK & x \\ \end{array} $ |
| r=fetch_inc(x)              | $INC_{L1}$ rx           | $FLU_{L1}$ WG<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x | FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x<br>INV <sub>L1</sub> DV                       |

|                             | na or WG                | DV (not remote)                                                | DV (remote)                                                                                                         |
|-----------------------------|-------------------------|----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| r=load(x)<br><b>message</b> | LD r x<br>passing error | INV <sub>L1</sub> WG<br>LD r x                                 | $ \left. \begin{array}{c} FLU_{L1} & DV \\ INV_{L1} & WG \\ LD & r & x \end{array} \right\} LK & x $                |
| store(x,r)                  | ST r x<br>RMW ator      | FLU <sub>L1</sub> WG<br>ST r x<br><b>micity error</b>          | $ \left. \begin{array}{c} FLU_{L1} & WG \\ ST & r & x \\ INV_{L1} & DV \end{array} \right\} LK & x \\ \end{array} $ |
| r=fetch_inc(x)              | INC <sub>L1</sub> r x   | $FLU_{L1}$ WG<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x | FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x K <sub>rmw</sub><br>INV <sub>L1</sub> DV      |

|                             | na or WG                | DV (not remote)                                                | DV (remote)                                                                                                         |
|-----------------------------|-------------------------|----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| r=load(x)<br><b>message</b> | LD r x<br>passing error | Unneces<br>INVL1 WG<br>LD r x                                  | Sary locking<br>FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG<br>LD r x                                              |
| store(x,r)                  | ST r x<br>RMW atoi      | FLU <sub>L1</sub> WG<br>ST r x<br><b>micity error</b>          | $ \left. \begin{array}{c} FLU_{L1} & WG \\ ST & r & x \\ INV_{L1} & DV \end{array} \right\} LK & x \\ \end{array} $ |
| r=fetch_inc(x)              | INCL1 r x               | $FLU_{L1}$ WG<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x | FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG<br>INC <sub>L2</sub> r x LK <sub>rmw</sub><br>INV <sub>L1</sub> DV     |

|                | na or WG      | DV (not remote)                                                       | DV (remote)                                                                                                           |
|----------------|---------------|-----------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| r=load(x)      | LD r x        | LD r x<br>INV <sub>L1</sub> WG                                        | LD r x<br>FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG                                                                |
| store(x,r)     | ST r x        | FLU <sub>L1</sub> WG<br>ST r x                                        | $ \left. \begin{array}{c} FLU_{L1} & WG \\ INV_{L1} & DV \\ ST & r & x \end{array} \right\} LK_{rmw} $                |
| r=fetch_inc(x) | $INC_{L1}$ rx | FLU <sub>L1</sub> WG<br>INC <sub>L2</sub> r x<br>INV <sub>L1</sub> WG | FLU <sub>L1</sub> WG<br>INV <sub>L1</sub> DV<br>INC <sub>L2</sub> r x<br>FLU <sub>L1</sub> DV<br>INV <sub>L1</sub> WG |





"Only after the work by John and others in the programming languages community, have programmers gained confidence in using fine-grain inter-thread communication and synchronization on GPUs." Dr Brad Beckmann, Principal Researcher, AMD



"Only after the work by John and others in the programming languages community, have programmers gained confidence in using fine-grain inter-thread communication and synchronization on GPUs." Dr Brad Beckmann, Principal Researcher, AMD

Key publication: ACM OOPSLA '15 (with Batty, Beckmann, & Donaldson)

























Key publication: ACM POPL '17 (with Batty, Sorensen, & Constantinides)











Key publication: ACM PLDI '18 (with Chong & Sorensen)





Formalising the OpenCL memory model

Implementing the OpenCL memory model on GPUs

• Implementing the OpenCL memory model on FPGAs











r = atomic\_load(&y, memory\_order\_acquire);



\*\*not supported\*\*

r = atomic\_load(&y, memory\_order\_acquire);


r = atomic\_load(&y, lock();
memory\_order\_acquire); r = y;
unlock();

| Clock cycle:                     | 1   | 2   | 3      | 4 | 5   | 6   |
|----------------------------------|-----|-----|--------|---|-----|-----|
| r1 = x                           | loa | d x |        |   |     |     |
| r2 = atomic_load(&y,<br>acquire) |     |     | load y |   |     |     |
| r3 = z                           |     |     |        |   | loa | d z |

| Clock cycle:                     | 1   | 2   | 3      | 4 | 5      | 6 |
|----------------------------------|-----|-----|--------|---|--------|---|
| r1 = x                           | loa | d x |        |   |        |   |
| r2 = atomic_load(&y,<br>acquire) |     |     | load y |   |        |   |
| r3 = z                           |     |     |        |   | load z |   |

"Too conservative!"

| Clock cycle:                     | 1      | 2 | 3 | 4 | 5 | 6 |
|----------------------------------|--------|---|---|---|---|---|
| r1 = x                           | load x |   |   |   |   |   |
| r2 = atomic_load(&y,<br>acquire) | load y |   |   |   |   |   |
| r3 = z                           | load z |   |   |   |   |   |

| Clock cycle:                     | 1      | 2 | 3 | 4 | 5 | 6 |
|----------------------------------|--------|---|---|---|---|---|
| r1 = x                           | load x |   |   |   |   |   |
| r2 = atomic_load(&y,<br>acquire) | load y |   |   |   |   |   |
| r3 = z                           | load z |   |   |   |   |   |

"Too aggressive!"

| Clock cycle:                     | 1      | 2 | 3      | 4 | 5 | 6 |
|----------------------------------|--------|---|--------|---|---|---|
| r1 = x                           | load x |   |        |   |   |   |
| r2 = atomic_load(&y,<br>acquire) | load y |   |        |   |   |   |
| r3 = z                           |        |   | load z |   |   |   |

| Clock cycle:                     | 1      | 2 | 3      | 4 | 5 | 6 |
|----------------------------------|--------|---|--------|---|---|---|
| r1 = x                           | load x |   |        |   |   |   |
| r2 = atomic_load(&y,<br>acquire) | load y |   |        |   |   |   |
| r3 = z                           |        |   | load z |   |   |   |

"Just right!"

| Clock cycle:                     | 1      | 2 | 3 | 4 | 5 | 6 |
|----------------------------------|--------|---|---|---|---|---|
| r1 = x                           | load x |   |   |   |   |   |
| r2 = atomic_load(&y,<br>acquire) | load y |   |   |   |   |   |
| r3 = z                           | load z |   |   |   |   |   |

"Too aggressive!"

| Clock cycle:                     | 1      | 2 | 3 | 4 | 5 | 6 |
|----------------------------------|--------|---|---|---|---|---|
| r1 = x                           | load x |   |   |   |   |   |
| r2 = atomic_load(&y,<br>acquire) | load y |   |   |   |   |   |
| r3 = z                           | load z |   |   |   |   |   |

"Too aggressive!" ... or is it?





Key publications: ACM FPGA '17, IEEE Trans. on Computers '17, IEEE FCCM '18 (with Ramanathan, Fleming & Constantinides)

# THE FUTURE?



# THE FUTURE?

