## Background reading

332 Advanced Computer Architecture Chapter 4

Compiler issues: dependence analysis, vectorisation, automatic parallelisation

February 2007 Paul H J Kelly

These lecture notes are partly based on the course text, Hennessy and Patterson's Computer Architecture, a quantitative approach ( $3^{d}$ and  $4^{th}$  eds), and on the lecture slides of David Patterson and John Kubiatowicz's Berkeley course

Advanced Computer Architecture Chapter 4.1

The material for this part of the course is introduced only very briefly in Hennessy and Patterson (section 4.4 pp319). A good textbook which covers it properly is

Michael Wolfe. High Performance Compilers for Parallel Computing. Addison Wesley, 1996.

Much of the presentation is taken from the following research paper:

U. Banerjee. Unimodular transformations of double loops. In Proceedings of the Third Workshop on Programming Languages and Compilers for Parallel Computing, Irvine, CA. Pitman/MIT Press, 1990.

Banerjee's paper gives a simplified account of the theory in the context only of perfect doubly-nested loops with well-known dependences.

#### Advanced Computer Architecture Chapter 4.2

## Introduction

- In this segment of the course we consider compilation issues for loops involving arrays:
- Mow execution order of a loop is constrained,
- How a compiler can extract dependence information, and
- Mow this can be used to optimise a program.

# Understanding and transforming execution order can help exploit architectural features:

- ▶ Pipelined, superscalar and VLIW processors
- Systems which rely heavily on caches
- Processors with special instructions for vectors (SSE, AltiVec)
- Multiprocessors, multicore, and co-processors/accelerators

#### Idvanced Computer Architecture Chapter 4.3

## Restructuring

- Here we consider a special kind of optimisation, which is currently performed only by specialist compilers -"restructuring compilers".
- M Conventional optimisations must also be performed
- ▶ The difference is this:
  - Conventional optimisations reduce the amount of work the computer has to do at run-time
  - Restructuring aims to do the work in an order which suits the target architecture better





Inside: Pentium 4 processor

Let's experiment with a perfectly ordinary laptop:

- The experiments were performed on a Toshiba Satellite Pro 6100 laptop
- This machine has a 1.6GHz Intel Pentium 4 Mobile processor (we'll look at some other processors shortly)

Memory system



- L1 Instruction ("trace") cache: 12K microinstructions
- L1 Data cache: 8 KB, 4-way, 64 bytes/line, non-blocking, dual-ported, write-through, pseudo-LRU
- L2 unified cache: 512 KB, 2-Way, 64 Byte/Line, non-blocking

#### Suppose we're interested in quite big matrices, N=1088

- ▶ The matrix occupies 1088<sup>2</sup>×8 = 9.5MBytes
- ▶ Each row of the matrix occupies 1088×8 = 8.5KBytes.

dvanced Computer Architecture Chapter 4.7

## Performance

- ▶ For N=1088, the initial version runs in 130 seconds.
- The matrix multiplication takes 1088<sup>3</sup> steps, each involving two floating-point operations, an add and a multiply, i.e. 2.6×10<sup>9</sup> "FLOPs"
- This loop achieves a computation rate of 2600/130=19.8 MFLOPs.
- That is, one floating-point operation completed every 80 clock cycles (the chip runs at 1.6GHz)
- In How are we going to get value for money?

## Interchange loops



- Does it still give the right output?
- Ir Can we do better still?



IJK variant computes each element of result matrix C one at a time, as inner product of row of A and column of B

Traverses A in row-major order, B in column-major



 IKJ variant accumulates partial inner product into a row of result matrix C, using element of A and row of B

• Traverses C and B in row-major order

Advanced Computer Architecture Chapter 4.10



## The price of naivety

 Relative speedup of IKJ version over IJK version (per machine, per problem size)

On large problems, the IKJ variant is 2-10 times faster

# Blocking (a.k.a. "tiling")

- Idea: reorder execution of loop nest so data isn't evicted from cache before it's needed again.
- Blocking is a combination of two transformations: "strip mining", followed by interchange; we start with



## Blocking/tiling - stripmine then interchange

```
Now interchange so blocked loops are outermost:
for (kk = 0; kk < N; kk += 5)
for (jj = 0; jj < N; jj += 5)
for (i = 0; i < N; i++)
for (k = kk; k < min(kk+S,N); k++){
    r = A[i][k];
    for (j = jj; j < min(jj+S, N); j++)
        C[i][j] += r * B[k][j];
}
The inner i,k,j loops perform a multiplication of a pair of
partial matrices.
S is chosen so that a S x S submatrix of B and a row of
length S of C can fit in the cache.</pre>
```

What is the right value for S?

## Blocking/tiling - stripmine then interchange



## Blocking/tiling - stripmine then interchange



## Blocking/tiling - stripmine then interchange



## Blocking/tiling - stripmine then interchange



## Blocking/tiling - stripmine then interchange



## Blocking/tiling - stripmine then interchange



for (jj=0:N:S) C[0:N][jj:jj+S] += A[0:N][kk:kk+S] \* B[kk:kk+S][jj:jj+S] for last kk

Performance of blocked version: 1.6GHz Pentium 4M (N=1088)





#### Performance of blocked version: Thinkpad T60 (N=1003)

1.8 GHz Intel Core Duo (Lenovo Thinkpad T60) (gcc3.4.4)

Optimum blocking factor is 48, where we reach 866 MFLOPs Avoiding "min" operator doesn't help.

On battery power, clock rate drops to 987MHz, so only 469 MFLOPS (48 still best). In direct proportion to clock rate reduction.

#### Performance of blocked version: Pentium 3 (N=512)



Thinkpad T21 800MHz Pentium III (VS6.0)



Optimum blocking factor is 64, where we reach 692.4 MFLOPs Since 64 divides 1088 exactly, we can avoid "min" operator, giving 833.6 MFLOPs Using Intel compiler (-WI,-melf\_i386) this reaches 998 MFLOPs Using AMD's AMCL library this machine can reach ~4GFLOPS... there is a lot more you can do to improve matrix multiply performance!

## Impact....

- On Toshiba Satellite Pro 6100 laptop (1.6GHz Pentium 4M):
- Poriginal version: 130 seconds (19.8 MFLOP/s)
- Blocked version: 7.55 seconds (341 MFLOP/s)
  - We started with a "good" optimising compiler!
  - ➡Factor of 17 performance improvement.
  - No reduction in amount of arithmetic performed.
- (Using the Intel library or the ATLAS library does even better)

Blocksize 32: 2.013 seconds, 133.4 MFLOP/s

#### Dependence

#### How?

#### 🕨 Define:

- ➡IN(S): set of memory locns which might be read by some execn of statement S
- OUT(S): set of memory locns which might be written by some execn of statement S

#### Reordering is constrained by dependences;

#### ▶ There are four types:

| Data ("true") dependence: S1 δ S2 ●OUT(S1) ∩ IN(S2)                        | ("S1 must write something<br>before S2 can read it")                          |
|----------------------------------------------------------------------------|-------------------------------------------------------------------------------|
| Anti dependence: S1 <sup>8</sup> S2 IN(S1) ∩ OUT(S2)                       | ("S1 must read something before S2 overwrites it")                            |
| <ul> <li>Output dependence: S1 δ° S2</li> <li>OUT(S1) ∩ OUT(S2)</li> </ul> | ("If S1 and S2 might both<br>write to a location, S2 must<br>write after S1") |
| ➡Control dependence: S1 δ° S2                                              |                                                                               |

#### Dependence

#### 🕨 Define:

- ♦ IN(5): set of memory locns which might be read by some execn of statement 5
- OUT(5): set of memory locns which might be written by some execn of statement S
- In Reordering is constrained by dependences;

#### In There are four types:

| ▶Data ("true") dependence: S1 δ S          |                              |
|--------------------------------------------|------------------------------|
| ●OUT(S1) ∩ IN(S2)                          | before S2 can read it")      |
| ♦Anti dependence: S1_ S2                   | ("S1 must read something     |
| IN(S1) ∩ OUT(S2)                           | before S2 overwrites it")    |
| Output dependence: S1 δ° S2                | ("If S1 and S2 might both    |
| ●OUT(S1) ∩ OUT(S2)                         | write to a location, S2 must |
| ♦ Control dependence: S1 δ <sup>c</sup> S2 | write after S1")             |
|                                            |                              |

These are static analogues of the dynamic RAW, WAR, WAW and control hazards which have to be considered in processor architecture

Advanced Computer Architecture Chapter 4.26

How?

#### Loop-carried dependences le Consider: **S1**: A[0] := 0 for I = 1 to 8 **S2**: A[I] := A[I-1] + B[I]What does this loop do? B: 1 1 1 1 1 1 1 1 0 **A**:





# What is a loop-carried dependence?

- Consider two iterations I<sup>1</sup> and I<sup>2</sup>
- →A dependence occurs between two statements S<sub>p</sub> and S<sub>q</sub> (not necessarily distinct), when an assignment in S<sub>p</sub><sup>II</sup> refers to the same location as a use in S<sub>q</sub><sup>I2</sup>

#### In the example,

 $S_1$ : A[0] := 0 for I = 1 to 5  $S_2$ : A[I] := A[I-1] + B[I]

- ▶ The assignment is "A[I1] := ..."
- ▶ The use is "... := A[I<sub>2</sub>-1] ..."
- These refer to the same location when I<sup>1</sup> = I<sup>2</sup>-1
- In Thus I1 < I2, ie the assignment is in an earlier iteration

**Involution:**  $S_2 \delta_{c} S_2$ 

Advanced Computer Architecture Chapter 4.30

## Definition: The dependence equation

- Magendence occurs
  - **\bullet** between two statements  $S_p$  and  $S_q$  (not necessarily distinct),
  - when there exists a pair of loop iterations I<sup>1</sup> and I<sup>2</sup>,
  - such that a memory reference in S<sub>p</sub> in I<sup>1</sup> may refer to the same location as a memory reference in S<sub>p</sub> in I<sup>2</sup>.
- In This might occur if  $S_p$  and  $S_q$  refer to some common array A
  - (\$\$\vec{\overline{\mu}\_p(I)} is some subscript expression involving I)
- Suppose S<sub>p</sub> refers to A[φ<sub>p</sub>(I)]
   Suppose S<sub>q</sub> refers to A[φ<sub>q</sub>(I)]
- A dependence of some kind occurs between S<sub>p</sub> and S<sub>q</sub> if there exists a solution to the equation
  - $\phi_p(I^1) = \phi_q(I^2)$
- ▶ for integer values of I<sup>1</sup> and I<sup>2</sup> lying within the loop bounds

# Types of dependence

- If a solution to the dependence equation exists, a dependence of some kind occurs
- In the dependence type depends on what solutions exist
- ▶ The solutions consist of a set of pairs (I<sup>1</sup>,I<sup>2</sup>)
- In We would appear to have a *data* dependence if

 $\begin{array}{l} A[\phi_p(I)] \in OUT(S_p) \\ \text{and} \end{array}$ 

 $A[\phi_q(I)] \in IN(S_q)$ 

 ${\bf le}$  But we only really have a data dependence if the assignments precede the uses, ie

⇒S<sub>p</sub>δ, S<sub>q</sub>
 ⇒if, for each solution pair (I<sup>1</sup>, I<sup>2</sup>), I<sup>1</sup> < I<sup>2</sup>

#### Dependence versus anti-dependence

If the uses precede the assignments, we actually have an anti-dependence, ie

$$S_p \ \overline{\delta}_{\leq} \ S_q$$

- if, for each solution pair  $(I^1, I^2)$ ,  $I^1 > I^2$
- ▶ If there are some solution pairs (I<sup>1</sup>, I<sup>2</sup>) with I<sup>1</sup> < I<sup>2</sup> and some with I<sup>1</sup> > I<sup>2</sup>, we write

$$S_p \ \delta_* \ S_q$$

If, for all solution pairs (I<sup>1</sup>, I<sup>2</sup>), I<sup>1</sup> = I<sup>2</sup>, there are dependences within an iteration of the loop, but there are no loop-carried dependences:

$$S_p \delta_{=} S_q$$

## Dependence distance

- In many common examples, the set of solution pairs is characterised easily:
- Im Definition: dependence distance
  - ▶If, for all solution pairs (I<sup>1</sup>, I<sup>2</sup>), I<sup>1</sup> = I<sup>2</sup> - k then the dependence distance is k
- IF For example in the loop we considered earlier,

S<sub>1</sub>: A[0] := 0 for I = 1 to 5 S<sub>2</sub>: A[I] := A[I-1] + B[I]

We find that  $S_2 \delta_c S_2$  with dependence distance 1.

(of course there are many cases where the difference is not constant and so the dependence cannot be summarised this way)).

# Reuse distance

When optimising for cache performance, it is sometimes useful to consider the re-use relationship,

▶ $IN(S_1) \cap IN(S_2)$ 

- Here there is no dependence it doesn't matter which read occurs first
- Nonetheless, cache performance can be improved by minimising the reuse distance
- The reuse distance is calculated essentially the same way

for I = 5 to 100 S1: B[I] := A[I] \* 2 S2: C[I] := A[I-5] \* 10

Mere we have a loop-carried reuse with distance 5

Advanced Computer Architecture Chapter 4.35

## Nested loops

- ▶Up to now we have looked at single loops
- Now let's generalise to loop "nests"
- We begin by considering a very common dependence pattern, called the "wavefront":

for 
$$I_1 = 0$$
 to 3 do  
for  $I_2 = 0$  to 3 do  
 $S: A[I_1, I_2] := A[I_1 - 1, I_2] + A[I_1, I_2 - 1]$ 

Dependence structure?

#### System of dependence equations

Consider the dependence equations for this loop nest:

for 
$$I_1 = 0$$
 to 3 do  
for  $I_2 = 0$  to 3 do  
 $S: A[I_1, I_2] := A[I_1 - 1, I_2] + A[I_1, I_2 - 1]$ 

There are two potential dependences arising from the three references to A, so two systems of dependence equations to solve:

1. Between A[
$$I_1^1, I_2^1$$
] and A[ $I_1^2 - 1, I_2^2$ ]:  

$$\begin{cases}
I_1^1 = I_1^2 - 1 \\
I_2^1 = I_2^2
\end{cases}$$

2. Between A[
$$I_1^1$$
,  $I_2^1$ ] and A[ $I_1^2$ ,  $I_2^2 - 1$ ]:  

$$\begin{cases}
I_1^1 = I_1^2 \\
I_2^1 = I_2^2 - 1
\end{cases}$$

# The inner loop is not vectorisable since there is a dependence chain linking successive iterations.

- (to use a vector instruction, need to be able to operate on each element of the vector in parallel)
- Similarly, the outer loop is not parallel
- This loop is interchangeable: the top-to-bottom, left-toright execution order is also valid since all dependence constraints (as shown by the arrows) are still satisfied.
- Interchanging the loop does not improve vectorisability or parallelisability

Advanced Computer Architecture Chapter 4.39

The same loop:  
for 
$$I_1 = 0$$
 to 3 do  
for  $I_2 = 0$  to 3 do  
 $S : A[I_1, I_2] := A[I_1 - 1, I_2] + A[I_1, I_2 - 1]$   
For humans the easy way to understand this loop nest is

For humans the easy way to understand this loop nest is to draw the *iteration space graph* showing the iteration-toiteration dependences:



- $\begin{array}{l} \mbox{for } I_1 = 0 \mbox{ to 3 do} \\ \mbox{for } I_2 = 0 \mbox{ to 3 do} \\ S: \ \mbox{A}[I_1, I_2] := \mbox{A}[I_1 1, I_2] + \mbox{A}[I_1, I_2 1] \end{array}$
- The inner loop is not vectorisable since there is a dependence chain linking successive iterations.
  - (to use a vector instruction, need to be able to operate on each element of the vector in parallel)
- Similarly, the outer loop is not parallel
- This loop is interchangeable: the top-to-bottom, left-toright execution order is also valid since all dependence constraints (as shown by the arrows) are still satisfied.
- Interchanging the loop does not improve vectorisability or parallelisability



- The inner loop is not vectorisable since there is a dependence chain linking successive iterations.
  - (to use a vector instruction, need to be able to operate on each element of the vector in parallel)
- Similarly, the outer loop is not parallel
- This loop is interchangeable: the top-to-bottom, left-toright execution order is also valid since all dependence constraints (as shown by the arrows) are still satisfied.
- Interchanging the loop does not improve vectorisability or parallelisability

## Interchange: counter-example

for 
$$I_1 = 0$$
 to 3 do  
for  $I_2 = 0$  to 3 do  
 $S: A[I_1, I_2] := A[I_1 + 1, I_2 - 1] + B[I_1, I_2]$ 

Advanced Computer Architecture Chapter 4,42



#### Interchange: counter-example

#### Interchange: counter-example



#### Interchange: counter-example



#### Consider this variation on the wavefront loop: Skewing

for  $k_1 := 0$  to 3 do

for 
$$k_2 := k_1$$
 to  $k_1+3$  do

- $S: A[k_1,k_2-k_1] := A[k_1-1,k_2-k_1]+A[k_1,k_2-k_1-1]$
- The inner loop's control variable runs from k<sub>1</sub> to k<sub>1</sub>+3.
- The iteration space of this loop has 4<sup>2</sup> iterations just like the original loop.
- If we draw the iteration space with each iteration S<sup>K1,K2</sup> at coordinate position (K₁,K₂), it is skewed to form a lozenge shape:

| $S^{00}$                                         | $S^{01}$ | $S^{02}$             | $S^{03}$             |                      |                      |          |
|--------------------------------------------------|----------|----------------------|----------------------|----------------------|----------------------|----------|
| This loop                                        | $S^{11}$ | $S^{12}$<br>$S^{22}$ | $S^{13}$<br>$S^{23}$ | $S^{14}$<br>$S^{24}$ | <b>C</b> 25          |          |
| performs the<br>same computat<br>as the original |          | S <sup>22</sup>      | $S^{23}$<br>$S^{33}$ | $S^{24}$<br>$S^{34}$ | $S^{23}$<br>$S^{35}$ | $S^{36}$ |

vanced Computer Architecture Chapter 4.47

#### Interchange: condition

- A loop is *interchangeable* if all dependence constraints (as shown by the arrows) are still satisfied by the top-tobottom, left-to-right execution order
- How can you tell whether a loop can be interchanged?
- Look at it's dependence direction vectors:
  - ◆Is there a dependence direction vector with the form (<,>)?
  - +ie there is a dependence distance vector (k<sub>1</sub>,k<sub>2</sub>) with k<sub>1</sub>>0 and k<sub>2</sub><0 ?
  - If so, interchange would be invalid
  - Because the arrows would be traversed backwards
  - All other dependence directions are OK.

Advanced Computer Architecture Chapter 4.46

#### Skewing preserves semantics

- To see that this loop performs the same computation, lets work out its dependence structure.
- First label each iteration with the element of A to which it assigns:

| S <sup>00</sup> | S <sup>01</sup> | S <sup>02</sup> | S <sup>03</sup> |                 |                 |                 |
|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| A <sub>OO</sub> | A <sub>O1</sub> | A <sub>02</sub> | A <sub>03</sub> |                 |                 |                 |
|                 | $S^{11}$        | S <sup>12</sup> | $S^{13}$        | $S^{14}$        |                 |                 |
|                 | A <sub>10</sub> | $A_{11}$        | A <sub>12</sub> | A <sub>13</sub> |                 |                 |
|                 |                 | S <sup>22</sup> | $S^{23}$        | S <sup>24</sup> | $S^{25}$        |                 |
|                 |                 | A <sub>20</sub> | A <sub>21</sub> | A <sub>22</sub> | A <sub>23</sub> |                 |
|                 |                 |                 | S <sup>33</sup> | S <sup>34</sup> | S <sup>35</sup> | S <sup>36</sup> |
|                 |                 |                 | A <sub>30</sub> | A <sub>31</sub> | A <sub>32</sub> | A33             |

In The loop body is

#### $A[k_1,k_2-k_1] := A[k_1-1,k_2-k_1]+A[k_1,k_2-k_1-1]$

▶ E.g. iteration S<sub>23</sub> does: A[2,1] := A[1,1]+A[2,0]

In Thus the dependence structure of the skewed loop is shown by marking the iteration space with all the dependences:

$$S^{00} \xrightarrow{\delta} S^{01} \xrightarrow{\delta} S^{02} \xrightarrow{\delta} S^{03}$$

$$\searrow^{\delta} S^{11} \xrightarrow{\delta} S^{12} \xrightarrow{\delta} S^{13} \xrightarrow{\delta} S^{14}$$

$$\searrow^{\delta} S^{22} \xrightarrow{\delta} S^{23} \xrightarrow{\delta} S^{24} \xrightarrow{\delta} S^{25}$$

$$\searrow^{\delta} S^{33} \xrightarrow{\delta} S^{34} \xrightarrow{\delta} S^{35} \xrightarrow{\delta} S^{36}$$

Can this loop nest be vectorised?

Can this loop nest be interchanged?

#### Skewing changes effect of interchange





- ▶ You can think of loop interchange as changing the way the iteration space is traversed
- 🕨 Alternatively, you can think of it as a change to the way the runtime code instances are mapped onto the iteration space
- Traversal is always lexicographic - ie leftto-right, top-down



- Inter Imper loop is now vectorisable, since it has no loop-carried dependence
- The skewed iteration space has N rows and 2N-1 columns, but still only N<sup>2</sup> actual statement instances.



# Skewing and interchange: summary $S_{S_{00}}^{(0)}$

 $\delta \downarrow \checkmark^{\delta}_{S^{01}} S^{11}$ 

 $\delta \downarrow \ \delta \delta \downarrow \ S^{02} S^{12}$ 

 $\begin{smallmatrix} \delta \\ S^{03} \end{smallmatrix} \begin{smallmatrix} \delta \delta \\ S^{13} \end{smallmatrix} \begin{smallmatrix} \delta \delta \\ S^{2} \end{smallmatrix}$ 

 $\begin{array}{c} & \delta \\ & S^{14} \end{array} \begin{array}{c} & \delta \\ & S^{24} \end{array}$ 

| $S^{00}$                       | $\xrightarrow{\delta}$    | $S^{01}$                       | $\xrightarrow{\delta}$    | $S^{02}$               | $\overrightarrow{\delta}$ | $S^{03}$                 |  |
|--------------------------------|---------------------------|--------------------------------|---------------------------|------------------------|---------------------------|--------------------------|--|
| $\stackrel{\delta}{_{S^{10}}}$ |                           | $\stackrel{\delta}{_{S^{11}}}$ | $\xrightarrow{\epsilon}$  | ${}^{\delta}_{S^{12}}$ | $\xrightarrow{\circ}$     | $s \downarrow \\ S^{13}$ |  |
| δ                              | ~                         | δ                              | 0                         | $\delta \downarrow$    | 0                         | δ                        |  |
| $S^{20}$<br>$\delta \parallel$ | δ                         | $S^{21}$                       | δ                         | $S^{22}$               | δ                         | $S^{23}$                 |  |
| $S^{30}$                       | $\overrightarrow{\delta}$ | $S^{31}$                       | $\overrightarrow{\delta}$ | $S^{32}$               | $\overrightarrow{\delta}$ | S <sup>33</sup>          |  |
|                                | for $I_1 =$               | 0 to 3                         | do                        |                        |                           |                          |  |
|                                | for $I_2$                 | = 0 to                         | 3 do                      |                        |                           |                          |  |
|                                | S : A[i]                  | [1,I2]                         | = A[I1 ·                  | $-1, I_2$ ]            | + A[I1,                   | $I_2 - 1$ ]              |  |

- In Original loop interchangeable but not vectorisable.
- We skewed inner loop by outer loop by factor 1.
- ▶ Still not vectorisable, but interchangeable.
- Interchanged, skewed loop is vectorisable.
- Bounds of new loop not simple!
- for  $k_2 := 0$  to  $2N_2 2$  do for  $k_1 := \max(0, K_2 - N_2 + 2)$  to  $\min(K_2, N_1)$  do S: A[k1,k2-k1] := A[k1-1,k2-k1]+A[k1,k2-k1-1]

 $S^{22}$ 

 $S^{23}$ 

- Is skewing ever invalid?
- Does skewing affect interchangeability?
- Does skewing affect dependence distances?
- Can you predict value of skewing?



## Summary: dependence

- Dependence equation for single loop:
  - Suppose S<sub>n</sub> refers to  $A[\phi_n(I)]$
  - Suppose  $S_a$  refers to  $A[\phi_a(I)]$
  - $\Rightarrow$  A dependence of some kind occurs between  $S_{p}$  and  $S_{d}$  if there exists a solution to the equation

 $\phi_n(\mathbf{I}^1) = \phi_n(\mathbf{I}^2)$ 

- ♦ for integer values of I<sup>1</sup> and I<sup>2</sup> lying within the loop bounds
- For doubly-nested loops over multidimensional arrays generalise to system of simultaneous dependence equations for two iterations,  $(I_1^1, I_2^1)$  and  $(I_1^2, I_2^2)$
- **Iteration space graph**, lexicographic schedule of execution
- Arrows in graph show solutions to dependence equation
- Dependence distance vectors characterise families of congruent arrows

## Summary: transformations

- A loop can be executed in parallel if it has no loopcarried dependence
- A loop nest can be interchanged if the transposed dependence distance vectors are lexicographically forward
- **Strip-mining** is always valid
- **Tiling** = strip-mining + interchange
- In Skewing is always valid
- Skewing can expose parallelism by aligning parallel iterations with one of the loops
- Skewing can make interchange (and therefore tiling) valid

#### Matrix representation of loop transformations

To skew the inner loop by the outer loop by factor 1 we adjust the loop bounds, and replace I<sub>1</sub> by K<sub>1</sub>, and I<sub>2</sub> by K<sub>2</sub>-K<sub>1</sub>. That is,

$$(K_1, K_2) = (I_1, I_2) \cdot U = (I_1, I_2 + I_1)$$

▶ The inverse gets us back again: (I<sub>1</sub>,I<sub>2</sub>) = (K<sub>1</sub>,K<sub>2</sub>). U<sup>-1</sup> = (K<sub>1</sub>,K<sub>2</sub>-K<sub>1</sub>)

Advanced Computer Architecture Chapter 4.58

Advanced Computer Architecture Chapter 4,60

Matrix U maps each statement instance S<sup>I</sup><sup>1</sup><sup>2</sup> to its position in the new iteration space, S<sup>K</sup><sup>1</sup><sup>K</sup><sup>2</sup>:

#### Original iteration space:

|       | $I_2$ :      |          |          |          |
|-------|--------------|----------|----------|----------|
| $I_1$ | 0            | 1        | 2        | 3        |
| 0     | $S^{00}$     | $S^{01}$ | $S^{02}$ | $S^{03}$ |
| 1     | $S^{10}$     | $S^{11}$ | $S^{12}$ | $S^{13}$ |
| 2     | $S^{20}$     | $S^{21}$ | $S^{22}$ | $S^{23}$ |
| 3     | $S^{\rm 30}$ | $S^{31}$ | $S^{32}$ | $S^{33}$ |

#### In Transformed iteration space:

| V     | $K_2$ :  | 1        | C        | 2        | Λ        | 5        | 6        | The             |
|-------|----------|----------|----------|----------|----------|----------|----------|-----------------|
| $n_1$ | 0        | Ŧ        | 2        | 3        | 4        | 5        | 0        |                 |
| 0     | $S^{00}$ | $S^{01}$ | $S^{02}$ | $S^{03}$ |          |          |          | dependences     |
| 1     |          | $S^{11}$ | $S^{12}$ | $S^{13}$ | $S^{14}$ |          |          | are subject to  |
| 2     |          |          | $S^{22}$ | $S^{23}$ | $S^{24}$ | $S^{25}$ |          | the same        |
| 3     |          |          |          | $S^{33}$ | $S^{34}$ | $S^{35}$ | $S^{36}$ | transformation. |

#### Using matrices to reason about dependence

#### Recall that:

- There is a dependence between two iterations  $(I_1^1, I_2^1)$  and  $(I_1^2, I_2^2)$  if there is a memory location which is assigned to in iteration  $(I_1^1, I_2^1)$ , and read in iteration  $(I_1^2, I_2^2)$ . ((unless there is an intervening assignment))
- If  $(I_1^1, I_2^1)$  precedes  $(I_1^2, I_2^2)$  it is a *data*-dependence.
- If  $(I_1^2, I_2^2)$  precedes  $(I_1^1, I_2^1)$  it is a *anti*-dependence.
- If the location is assigned to in both iterations, it is an output-dependence.
- In the dependence distance vector  $(D_1, D_2)$  is  $(I_1^2 I_1^1, I_2^2 I_2^1)$ .

## Transforming dependence vectors

- Iterations (I<sub>1</sub><sup>1</sup>, I<sub>2</sub><sup>1</sup>). U and (I<sub>1</sub><sup>2</sup>, I<sub>2</sub><sup>2</sup>). U will also read and write the same location.
- The transformation U is valid iff (I<sub>1</sub><sup>1</sup>, I<sub>2</sub><sup>1</sup>). U precedes (I<sub>1</sub><sup>2</sup>, I<sub>2</sub><sup>2</sup>). U whenever there is a dependence between (I<sub>1</sub><sup>1</sup>, I<sub>2</sub><sup>1</sup>) and (I<sub>1</sub><sup>2</sup>, I<sub>2</sub><sup>2</sup>).

In the transformed loop the dependence distance vector

is also transformed, to  $(D_1, D_2)$  . U

 $\begin{array}{c} \textbf{Definition: Lexicographic ordering:}\\ (I_1^{1}, I_2^{1}) \ \textbf{precedes} \ (I_1^{2}, I_2^{2})\\ \hline \textbf{If} \ I_1^{1} < I_1^{2}, \ \textbf{or} \ I_1^{1} = I_1^{2} \ \textbf{and} \ I_2^{1} < I_2^{2}\\ \textbf{("Lexicographic" is dictionary order - both "baz" and "can" precede "cat")}\\ \end{array}$ 

## Example: loop given earlier

Before transformation we had two dependences:

- 1. Distance: (1,0), direction: (<,.)
- 2. Distance: (0,1), direction: (.,<)
- Matter transformation by matrix

$$\mathbf{U} = \left[ \begin{array}{cc} 1 & 1 \\ 0 & 1 \end{array} \right]$$

- ▶ (i.e. skewing of inner loop by outer) we get:
- 1. Distance: (1,1), direction: (<,<)
- 2. Distance: (0,1), direction: (.,<)

```
Advanced Computer Architecture Chapter 4.62
```

- We can also represent loop interchange by a matrix transformation.
- MAfter transforming the skewed loop by matrix

$$\mathbf{V} = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$$

▶ (i.e. loop interchange) we get:

- 1. Distance: (1,1), direction: (<,<)
- 2.Distance: (1,0), direction: (<,.)
- The transformed iteration space is the transpose of the skewed iteration space:

 Summary

Advanced Computer Architecture Chapter 4.64

- (I<sub>1</sub>, I<sub>2</sub>). U maps each statement instance (I<sub>1</sub>, I<sub>2</sub>) to its new position (K<sub>1</sub>, K<sub>2</sub>) in the transformed loop's execution sequence
- (D<sub>1</sub>, D<sub>2</sub>) . U gives new dependence distance vector, giving test for validity
- Maptures skewing, interchange and reversal
- Mc Compose transformations by matrix multiplication

 $U_1 \cdot U_2$ 

- Resulting loop's bounds may be a little tricky
  - Efficient algorithms exist [Banerjee90] to maximise parallelism by skewing and loop interchanging
  - Efficient algorithms exist to optimise cache performance by finding the combination of blocking, block size, interchange and skewing which leads to the best reuse [Wolf91]

## References

- Mennessy and Patterson: Section 4.4 (pp.319)
- Background: "conventional" compiler techniques
  - A.V. Aho, R. Sethi, and J.D. Ullman. Compilers: Principles, Techniques and Tools. Addison Wesley, 1986.
  - Andrew Appel and Jens Palsberg, Modern Compiler Implementation. Cambridge University Press, 2002.
  - Cooper and Torczon, Engineering a Compiler. Morgan Kaufmann 2004.
  - Morgan, Building an Optimizing Compiler
- In Textbooks covering restructuring compilers
  - Michael Wolfe. High Performance Compilers for Parallel Computing. Addison Wesley, 1996.
  - Steven Muchnick, Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.
  - Ken Kennedy and Randy Allen, Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2001.

READ Research papers:

THIS

- D. F. Bacon and S. L. Graham and O. J. Sharp, "Compiler Transformations for High-Performance Computing". ACM Computing Surveys V26 N4 Dec 1994 <u>http://doi.acm.org/10.1145/197405.197406</u>
- U. Banerjee, Unimodular transformations of double loops. In Proceedings of the Third Workshop on Programming Languages and Compilers for Parallel Computing, Irvine, CA. Pitman/MIT Press, 1990.
- M.E. Wolf and M.S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, volume 26, pages 30-44, Toronto, Ontario, Canada, June 1991.

## Additional material for background

Advanced Computer Architecture Chapter 4.66

#### A little history...early days at Bell Labs

- 1940: Russell Ohl develops PN junction (accidentally...)
- 1945: Shockley's lab established
- 1947: Bardeen and Brattain create point-contact transistor with two PN junctions, gain=18
- 1951: Shockley develops junction transistor which can be manufactured in quantity
- 1952: British radar expert GWA Dummer forecasts "solid block [with] layers of insulating, conducting and amplifying materials
- 1954: first transistor radio. Also Texas Instruments makes first silicon transistor (price \$2.50) This





2.50) This background section is not covered in the lectures Liks mit du/Fall961ectures/L1/P005.html ; See also http://www.maxmon.com/1952ad.htm Advanced Computer Architecture Chapter 4.67

## Pre-historic integrated circuits

1958: The first monolithic integrated circuit, about the size of a finger tip, developed at Texas Instruments by Jack Kilby. The IC was a chip of a single Germanium crystal containing one transistor, one capacitor, and one resistor (Source: Texas Instruments)



Source: http://kasap3.usask.ca/server/kasap/photo1.html



Advanced Computer Architecture Chapter 4.69





- Chips are made from slices of a singlecrystal silicon ingot
- Each slice is about 30cm in diameter, and 250-600 microns thick
- Transistors and wiring are constructed by photolithography
- Essentially a printing/etching process
- With lines ca. 0.18μm wide







Highly magnified scanning electron microscope (SEM) view of IBM's six-level copper interconnect technology in an integrated circuit chip. The aluminum in transistor interconnections in a silicon chip has been replaced by copper that has a higher conductivity (by nearly 40%) and also a better ability to carry higher current densities without electromigration. Lower copper interconnect resistance means higher speeds and lower RC constants (Photograph courtesy of IBM Corporation, 1997.)



Advanced computer Architecture Chapter 4.75

process



Advanced Computer Architecture Chapter 4,74

#### Integrated circuit fabrication is a printing

- 1. Grow pure silicon crystal
- 2. Slice into wafers and polish
- 3. Grow surface layer of silicon dioxide (ie glass), either using high-temperature oxygen or chemical vapour deposition
- 4. Coat surface with photoresist layer, then use mask to selectively expose photoresist to ultraviolet light
- 5. Etch away silicon dioxide regions not covered by hardened photoresist
- 6. Further photolithography steps build up additional layers, such as polysilicon
- 7. Exposed silicon is doped with small quantities of chemicals which alter its semiconductor behaviour to create transistors
- 8. Further photolithography steps build layers of metal for wiring
- 9. Die are tested, diced, tested and packaged





Close up of the wafer as it spins during a testing procedure



Intel technicians monitor wafers in an automated wet etch tool. The process cleans the wafers of any excess process chemicals or contamination.



Checking wafers processing in a vertical diffusion



## The future

- More transistors
- Higher clock rates
- 🕨 Lower power
- System-on-a-chip
- Field-programmable gate arrays
- "Compiling to silicon"
- Optical interconnect
- Quantum computing?



Intel x86/Pentium Family

| CPU         | Year | Data<br>Bus | Max.<br>Mem. | Transistors    | Clock MHz             | Av. MIPS | Level-1 Caches              |
|-------------|------|-------------|--------------|----------------|-----------------------|----------|-----------------------------|
| 8086        | 1978 | 16          | 1MB          | 29K            | 5-10                  | 0.8      |                             |
| 80286       | 1982 | 16          | 16MB         | 134K           | 8-12                  | 2.7      |                             |
| 80386       | 1985 | 32          | 4GB          | 275K           | 16-33                 | 6        |                             |
| 80486       | 1989 | 32          | 4GB          | 1.2M           | 25-100                | 20       | 8Kb                         |
| Pentium     | 1993 | 64          | 4GB          | 3.1M           | 60-233                | 100      | 8K Instr + 8K Data          |
| Pentium Pro | 1995 | 64          | 64GB         | 5.5M<br>+15.5M | 150-200               | 440      | 8K + 8K <sub>+</sub> Level2 |
| Pentium II  | 1997 | 64          | 64GB         | 7M             | 266- <mark>450</mark> | 466-     | 16K + 16K + L2              |
| Pentium III | 1999 | 64          | 64GB         | 8.2M           | 500-1000              | 1000-    | 16K+16K + L2                |
| Pentium IV  | 2001 | 64          | 64GB         | 42M            | 1300-2000             |          | 8K + L2                     |

On-line manuals: http://x86.ddj.com/intel.doc/386manuals.htm

On-line details: http://www.sandpile.org/ia32/index.htm

**Real World Examples** 

# **Integrated** Circuits Costs

| IC cost =<br><u> Die cost + Testing cost + Packaging cost</u><br>Final test yield                                                           |
|---------------------------------------------------------------------------------------------------------------------------------------------|
| Die cost =<br><u>     Wafer cost</u><br><u>     Dies per Wafer × Die yield</u>                                                              |
| Dies per wafer = $\frac{\pi (Wafer_dia m/2)^2}{Die_Area} - \frac{\pi \times Wafer_diam}{\sqrt{2} \cdot Die_Area}$ - Test_Die                |
|                                                                                                                                             |
| Die Yield = Wafer_yield $\times \left\{ 1 - \left( \frac{\text{Defect_Density} \times \text{Die\_area}}{\alpha} \right)^{-\alpha} \right\}$ |

Die Cost goes roughly with die area<sup>4</sup>

Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm<sup>2</sup> mm<sup>2</sup> wafer 386DX 2 0.90 \$900 1.0 43 360 71% \$4 486DX2 3 0.80 \$1200 1.0 81 181 54% \$12 PowerPC 601 4 0.80 \$1700 115 28% \$53 1.3 121 HP PA 7100 3 0.80 \$1300 \$73 1.0 196 66 27% DEC Alpha 3 0.70 \$1500 1.2 234 53 19% \$149 SuperSPARC 3 0.70 \$1700 1.6 256 48 13% \$272 Pentium 3 0.80 \$1500 1.5 296 \$417 40 9%

From "Estimating IC Manufacturing Costs," by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15

ed Computer Architecture Chapter 4,79





Cramming more components onto integrated circuits By Gordon E. Moore

Electronics, Volume 38, Number 8, April 19, 1965 (See

http://www.intel.com/research/silicon/mooresl aw.htm)

"With unit cost falling as the number of components per circuit rises, by 1975 economics may dictate squeezing as many as 65,000 components on a single silicon chip"



1968 with Robert Noyce and Andy Grove,

# Technology Trends: Microprocessor

