# 332 Advanced Computer Architecture Chapter 4

Compiler issues: dependence analysis, vectorisation, automatic parallelisation

February 2006 Paul H J Kelly

These lecture notes are partly based on the course text, Hennessy and Patterson's Computer Architecture, a quantitative approach (3rd ed), and on the lecture slides of David Patterson and John Kubiatowicz's Berkeley course

Advanced Computer Architecture Chapter 4.1

# Background reading

The material for this part of the course is introduced only very briefly in Hennessy and Patterson (section 4.4 pp319). A good textbook which covers it properly is

Michael Wolfe, High Performance Compilers for Parallel Computing. Addison Wesley, 1996.

Much of the presentation is taken from the following research paper:

▶U. Banerjee. Unimodular transformations of double loops. In Proceedings of the Third Workshop on Programming Languages and Compilers for Parallel Computing, Irvine, CA. Pitman/MIT Press, 1990.

Banerjee's paper gives a simplified account of the theory in the context only of perfect doubly-nested loops with well-known dependences.

Advanced Computer Architecture Chapter 4.3

#### Introduction

# In this segment of the course we consider compilation issues for loops involving arrays:

- Me How execution order of a loop is constrained,
- Me How a compiler can extract dependence information, and
- Me How this can be used to optimise a program.

# Understanding and transforming execn order can exploit architectural features:

- Pipelined, superscalar and VLIWprocessors
- > Systems which rely heavily on caches.
- MCPUs with special instructions for vectors.
- Multiprocessors.

Advanced Computer Architecture Chapter 4.

### Background reading continued

- Monica Lam's group at Stanford have used these ideas to build a public-domain prototype compiler system "SUIF". The following paper describes how SUIF uses unimodular transformations to optimise locality:
  - →M.E. Wolf and M.S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, volume 26, pages 30-44, Toronto, Ontario, Canada, June 1991.
- The Stanford work overcomes many of the restrictions of Banerjee's paper, but lies beyond the scope of this course.

#### Restructuring

- Here we consider a special kind of optimisation, which is currently performed only by specialist compilers -"restructuring compilers".
- Conventional optimisations (see the Dragon [1] book) must also be performed.
- The difference is this:
  - Conventional optimisations reduce the amount of work the computer has to do at run-time.
  - Restructuring aims to do the work in an order which suits the target architecture better.

Advanced Computer Architecture Chapter 4.

```
/*

* mm: Multiply A by B leaving the

* result in C.

* The result matrix is assumed

* to be initialised to zero.

*/

void mm1(A,B,C)

double A[512][512],

B[512][512],

C[512][512];

{

int i, j, k;

for (i = 0; i < 512; i++)

for (j = 0; j < 512; j++)

for (k = 0; k < 512; k++)

C[i][j] += A[i][k] * B[k][j];

}
```

#### Motivation: an example

- We will begin by looking at matrix multiplication.
- We will study the performance of a 512x512 doubleprecision floating point matrix multiply.
- We will investigate the performance of various versions in order to determine what transformations a compiler should be applying.

Advanced Computer Architecture Chapter 4

#### Architectural details

- The experiments were performed on an IBM Thinkpad T21 laptop with 128MB RAM:
- ► This machine has an 800MHz Intel Pentium III processor pipelined RISC CPU.
  - ⇒20 Entry RS, 40 Entry ROB
  - →Pipeline Depth: 12 (in-order) plus 2 (out-of-order) stages
  - ◆"up to 5 OPs/Cycle"
  - ◆The compiler used was Microsoft Visual Studio 6.0

#### Memory system

- L1 Instruction cache: 16 KB, 4-Way, 32 Byte/Line, LRU
- L1 Data cache: 16 KB, 4-way, 32 byte/line, non-blocking, dual-ported, write allocate, LRU
- L2 unified cache: 256 KB, 8-Way, 32 Byte/Line, nonblocking
- Note that the matrix occupies  $512^2 \times 8 = 2MBytes$
- Each row of the matrix occupies 512x8 = 4KBytes.

# Interchange loops

```
for (i = 0; i < 512; i++)
  for (k = 0; k < 512; k++){
   r = A[i][k];
   for (j = 0; j < 512; j++)
     C[i][i] += r * B[k][i];
▶ 5.2 seconds (51.6 MFLOPS).
```

- Why is this such a good idea?
- How might a compiler perform this transformation?
- Mr. Can we do better still?

#### Performance

- The initial version runs in 10.3 seconds.
- The matrix multiplication takes 5123 steps, each involving two floating-point operations, an add and a multiply, i.e. 268x106.
- ▶ This loop achieves a computation rate of 268/10.3=26.1 MFLOPs.
- In That is, one floating-point operation completed every 30 clock cycles
- Me How are we going to get value for money?

Advanced Computer Architecture Chapter 4.10

#### What was going on?



- ▶ IJK variant computes each element of result matrix C one at a time, as inner product of row of A and column of B
  - Traverses A in row-major order, B in column-major



- IKJ variant accumulates partial inner product into a row of result matrix C, using element of A and row of B
  - Traverses C and B in row-major order

#### Blocking (a.k.a. "tiling")

- Idea: reorder execn of loop nest so data isn't evicted from cache before it's needed again.
- Blocking is a combination of two transformations: "strip mining", followed by interchange; we start with

```
for (i = 0; i < 512; i++)
  for (k = 0; k < 512; k++){
    r = A[i][k];
    for (j = 0; j < 512; j++)
        C[i][j] += r * B[k][j]; }

w Strip mine the k and j loops:
  for (i = 0; i < 512; i++)
    for (kk = 0; kk < 512; kk += BLKSZ)
    for (k = kk; k < min(kk+BLKSZ,512); k++){
        r = A[i][k];
        for (j = 0; jj < 512; jj += BLKSZ)
        for (j = jj; j < min(jj+BLKSZ, 512); j++)
        C[i][j] += r * B[k][j];
}</pre>
```

#### Performance of blocked version

| Blocking<br>factor | Execution time | MFLOPS |
|--------------------|----------------|--------|
| 8                  | 3.815          | 70.4   |
| 16                 | 2.784          | 96.4   |
| 32                 | 2.283          | 117.6  |
| 40                 | 2.193          | 122.4  |
| 48                 | 2.253          | 119.1  |
| 56                 | 2.473          | 108.5  |
| 64                 | 3.404          | 78.9   |
| 72                 | 5.608          | 47.9   |
| 80                 | 5.578          | 48.1   |
| 88                 | 5.808          | 46.2   |
| 96                 | 5.928          | 45.3   |
| 104                | 6.309          | 42.5   |
| 112                | 5.778          | 46.5   |



#### Blocking/tiling - stripmine then interchange

- The inner i,k,j loops perform a multiplication of a pair of BLKSZxBLKSZ partial matrices.
- ▶ BLKSZ is chosen so that a BLKSZ x BLKSZ submatrix of B and a row of length BLKSZ of C can fit in the cache.
- What is the right value for BLKSZ?

Advanced Computer Architecture Chapter 4,15

### Reducing overheads of blocking...

- The min operators in the inner loop bounds are likely to be a performance hit; if we choose a good blocking factor which divides the problem size exactly...
- Blocksize 32:
  ⇒2.013 seconds, 133.4 MFLOP/s
  Blocksize 64:
  ⇒2.915 seconds, 92.1 MFLOP/s

Impact....

- ▶ Original version: 10.3 seconds (26.1 MFLOP/s)
- ▶ Blocked version: 2.013 seconds (133.4 MFLOP/s)
  - ◆That was using a "good" optimising compiler!
  - →Factor of five performance improvement.
  - ⇒No reduction in amount of arithmetic performed.
- (The CodePlay VectorC compiler generates slightly better code for this inner, blocked loop, giving 155.0 MFLOPS)
- (the ATLAS library does even better)

Advanced Computer Architecture Chapter 4,18

Mark Consider:

Loop-carried dependences

▶ What does this loop do?

| В:         |   | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
|------------|---|---|---|---|---|---|---|---|---|
| <b>A</b> : | 0 |   |   |   |   |   |   |   |   |

Dependence

- ▶ Define:
  - ▶IN(S): set of memory locus which might be read by some execn of statement S
  - OUT(S): set of memory locns which might be written by some execn of statement S
- Reordering is constrained by dependences;
- ▶ There are four types:
  - Data ("true") dependence: S1 δ S2
    - OUT(S1) ∩ IN(S2)
  - Anti dependence: S1 ₹ S2
    - IN(S1) ∩ OUT(S2)
  - lacktriangle Output dependence: S1  $\delta^{\circ}$  S2
    - OUT(S1) ∩ OUT(S2)
  - Control dependence: S1 δ<sup>c</sup> S2
- These are static analogues of dynamic RAW, WAR, WAW and control hazards.

Advanced Computer Architecture Chapter 4.19

How?

Mr Consider:

#### Loop-carried dependences

S1: A[0] := 0 for I = 1 to 8

S2 : A[I] := A[I-1] + B[I]

▶ What does this loop do?



- ▶ In this case, there is a data dependence
  - ◆This is a loop-carried dependence the dependence spans a loop iteration
  - →This loop is inherently sequential

Advanced Computer Architecture Chapter 4 21

B[6] B[7]

B[8]

# What is a loop-carried dependence?

- Consider two iterations I¹ and I²
- ▶A dependence occurs between two statements  $S_p$  and  $S_q$  (not necessarily distinct), when an assignment in  $S_p^{II}$  refers to the same location as a use in  $S_a^{I2}$
- ◆In the example,

$$S_1$$
: A[0] := 0  
for I = 1 to 5  
 $S_2$ : A[I] := A[I-1] + B[I]

- ► The assignment is ``A[I<sub>1</sub>] := ...''
- The use is `` ... := A[I2-1] ...'
- ▶ These refer to the same location when I¹ = I²-1
- ▶ Thus I¹ < I², ie the assignment is in an earlier iteration

**№** Notation:  $S_2 \delta_{\zeta} S_2$ 

Advanced Computer Architecture Chapter 4.

# Types of dependence

- If a solution to the dependence equation exists, a dependence of some kind occurs
- From The dependence type depends on what solutions exist
- $\blacktriangleright$  The solutions consist of a set of pairs (I1,I2)
- ▶ We would appear to have a data dependence if

$$A[\phi_p(\mathbf{I})] \in OUT(S_p)$$
 and

$$A[\phi_q(I)] \in IN(S_q)$$

- But we only really have a data dependence if the assignments precede the uses, ie
  - $\phi S_p \delta_c S_q$
  - ▶if, for each solution pair  $(I^1,I^2)$ ,  $I^1 < I^2$

Advanced Computer Architecture Chapter 4,24

### Definition: The dependence equation

- In A dependence occurs between two statements  $S_p$  and  $S_q$  (not necessarily distinct), when there exists a pair of loop iterations  $\mathbf{I}^1$  and  $\mathbf{I}^2$ , such that a memory reference in  $\mathbf{S}_q^{\,\mathrm{II}}$  may refer to the same location as memory reference in  $\mathbf{S}_q^{\,\mathrm{II}}$ .
- This might occur if S<sub>D</sub> and S<sub>d</sub> refer to some common array A
- ▶ Suppose  $S_p$  refers to  $A[\phi_p(I)]$
- Suppose  $S_q$  refers to  $A[\phi_a(I)]$

 $(\phi_p(\mathbf{I}) \text{ is some subscript} \\ \text{expression involving } \mathbf{I})$ 

A dependence of some kind occurs between S<sub>p</sub> and S<sub>q</sub> if there exists a solution to the equation

$$\phi_p(\mathbf{I}^1) = \phi_q(\mathbf{I}^2)$$

▶ for integer values of I¹ and I² lying within the loop bounds

Advanced Computer Architecture Chapter 4,23

#### Dependence versus anti-dependence

▶ If the uses precede the assignments, we actually have an anti-dependence, ie

$$S_p \ \overline{\delta}_{<} \ S_q$$

if, for each solution pair ( $I^1, I^2$ ),  $I^1 > I^2$ 

If there are some solution pairs (I¹,I²) with I¹ < I² and some with I¹ > I₂, we write

$$S_p \delta_* S_q$$

▶ If, for all solution pairs (I¹,I²), I¹ = I², there are dependences within an iteration of the loop, but there are no loop-carried dependences:

$$S_p \delta_{\bar{z}} S_q$$

### Dependence distance

In many common examples, the set of solution pairs is characterised easily:

- Definition: dependence distance
  - ➡If, for all solution pairs (I¹,I²),

$$I^1 = I^2 - k$$

then the dependence distance is k

For example in the loop we considered earlier,

$$S_1$$
: A[0] := 0  
for I = 1 to 5  
 $S_2$ : A[I] := A[I-1] + B[I]

We find that  $S_2$   $\delta$ ,  $S_2$  with dependence distance 1.

((of course there are many cases where the difference is not constant and so the dependence cannot be summarised this way)).

Advanced Computer Architecture Chapter 4,2

# Nested loops

- **▶**Up to now we have looked at single loops
- Mow let's generalise to loop "nests"
- ▶ We begin by considering a very simple loop with a very common dependence pattern. This example is also used in Banerjee's paper:

for 
$$I_1$$
 = 0 to 3 do  
for  $I_2$  = 0 to 3 do  
 $S: A[I_1,I_2] := A[I_1-1,I_2] + A[I_1,I_2-1]$ 

**№** Dependence structure?

Advanced Computer Architecture Chapter 4,28

#### Reuse distance

- When optimising for cache performance, it is sometimes useful to consider the re-use relationship,
  - $\Rightarrow$ IN(S<sub>1</sub>)  $\cap$  IN(S<sub>2</sub>)
- Here there is no dependence it doesn't matter which read occurs first
- Nonetheless, cache performance can be improved by minimising the reuse distance
- In the reuse distance is calculated essentially the same way
- 🌬 Eg

▶ Here we have a loop-carried reuse with distance 5

Advanced Computer Architecture Chapter 4,27

#### System of dependence equations

▶ Consider the dependence equations for this loop nest:

for 
$$I_1$$
 = 0 to 3 do  
for  $I_2$  = 0 to 3 do

$$S: A[I_1,I_2] := A[I_1-1,I_2] + A[I_1,I_2-1]$$

- ▶ There are two potential dependences arising from the three references to A, so two systems of dependence equations to solve:
- 1. Between  $A[I_1^1, I_2^1]$  and  $A[I_1^2 1, I_2^2]$ :

$$\begin{cases}
I_1^1 = I_1^2 - 1 \\
I_2^1 = I_2^2
\end{cases}$$

2. Between  $A[I_1^1, I_2^1]$  and  $A[I_1^2, I_2^2 - 1]$ :

$$\begin{cases} I_1^1 = I_1^2 \\ I_2^1 = I_2^2 - 1 \end{cases}$$

((strictly we should also consider output dependences between  $A[\mathbf{I}_1^1,\mathbf{I}_2^1]$  and  $A[\mathbf{I}_1^2,\mathbf{I}_2^2]$ , but this is obviously absent)), puter Architecture Chapter 4.29

#### ▶ The same loop:

#### Iteration space graph

for 
$$I_1$$
 = 0 to 3 do  
for  $I_2$  = 0 to 3 do  
 $S: A[I_1,I_2] := A[I_1-1,I_2] + A[I_1,I_2-1]$ 

For humans the easy way to understand this loop nest is to draw the *iteration space graph* showing the iteration-to-iteration dependences:

The diagram shows an arrow for each solution of each dependence equation.

Advanced Computer Architecture Chapter 4.30

### Interchange: counter-example

for 
$$I_1$$
 = 0 to 3 do  
for  $I_2$  = 0 to 3 do  
 $S$ :  $A[I_1, I_2]$  :=  $A[I_1 + 1, I_2 - 1] + B[I_1, I_2]$ 

- ▶ This loop is interchangeable: the top-to-bottom, left-to-right execution order is also valid since all dependence constraints (as shown by the arrows) are still satisfied.
- The inner loop is not vectorisable since there is a dependence chain linking successive iterations.
  - (to use a vector instruction, need to be able to operate on each element of the vector in parallel)
- ▶ Similarly, the outer loop is not parallel
- Interchanging the loop does not improve vectorisability or parallelisability

Advanced Computer Architecture Chapter 4,31

#### Interchange: counter-example

for 
$$I_1$$
 = 0 to 3 do

for  $I_2$  = 0 to 3 do

 $S: A[I_1,I_2] := A[I_1+1,I_2-1] + B[I_1,I_2]$ 

$$S^{00} S^{01} S^{02} S^{03}$$

$$S^{10} S^{11} S^{12} S^{13}$$

$$S^{20} S^{21} S^{22} S^{23}$$

$$S^{30} S^{31} S^{32} S^{33}$$

dyanced Computer Architecture Chapter 4 33

#### Interchange: counter-example

for 
$$I_1$$
 = 0 to 3 do  
for  $I_2$  = 0 to 3 do  
 $S$  : A[ $I_1, I_2$ ] := A[ $I_1 + 1, I_2 - 1$ ] + B[ $I_1, I_2$ ]



# Interchange: condition

- A loop is interchangeable if all dependence constraints (as shown by the arrows) are still satisfied by the topto-bottom, left-to-right execution order
- ▶ How can you tell whether a loop can be interchanged?
- Look at it's dependence direction vectors:
  - ◆Is there a dependence direction vector with the form (<,>)?
  - $\bullet ie$  there is a dependence distance vector  $(k_1,k_2)$  with  $k_1\!\!>\!\!0$  and  $k_2\!\!<\!\!0$  ?
  - ▶If so, interchange would be invalid
  - Because the arrows would be traversed backwards
  - ▶All other dependence directions are OK.

Advanced Computer Architecture Chapter 4.36

#### Interchange: counter-example

#### ▶ Consider this variation:

# Skewing

for 
$$k_1 := 0$$
 to 3 do  
for  $k_2 := k_1$  to  $k_1+3$  do

$$S: A[k_1,k_2-k_1] := A[k_1-1,k_2-k_1]+A[k_1,k_2-k_1-1]$$

- In The inner loop's control variable runs from  $k_1$  to  $k_1+3$ . In The iteration space of this loop has  $4^2$  iterations just
- like the original loop.
- ▶ If we draw the iteration space with each iteration S<sup>K1,K2</sup> at coordinate position (K<sub>1</sub>,K<sub>2</sub>), it is skewed to form a lozenge shape:

### Skewing preserves semantics

- To see that this loop performs the same computation, lets work out its dependence structure.
- First label each iteration with the element of A to which it assians

| 3 | gns.     |                 |                 |                 |                 |                 |          |
|---|----------|-----------------|-----------------|-----------------|-----------------|-----------------|----------|
|   | $S^{00}$ | $S^{01}$        | $S^{02}$        | $S^{03}$        |                 |                 |          |
|   | $A_{OO}$ | A <sub>O1</sub> | A <sub>02</sub> | A <sub>03</sub> |                 |                 |          |
|   |          | $S^{11}$        | $S^{12}$        | $S^{13}$        | $S^{14}$        |                 |          |
|   |          | A <sub>10</sub> | A <sub>11</sub> | A <sub>12</sub> | A <sub>13</sub> |                 |          |
|   |          |                 | $S^{22}$        | $S^{23}$        | $S^{24}$        | $S^{25}$        |          |
|   |          |                 | A <sub>20</sub> | A <sub>21</sub> | A <sub>22</sub> | A <sub>23</sub> |          |
|   |          |                 |                 | $S^{33}$        | $S^{34}$        | $S^{35}$        | $S^{36}$ |
|   |          |                 |                 | A <sub>30</sub> | A <sub>31</sub> | A32             | A33      |

The loop body is

$$A[k_1,k_2-k_1] := A[k_1-1,k_2-k_1]+A[k_1,k_2-k_1-1]$$

▶ E.g. iteration S<sub>23</sub> does:

$$A[2,1] := A[1,1] + A[2,0]$$



Thus the dependence structure of the skewed loop is shown by marking the iteration space with all the dependences: 
$$S^{00} \xrightarrow{\delta} S^{01} \xrightarrow{\delta} S^{02} \xrightarrow{\delta} S^{03}$$

$$S^{11} \xrightarrow{\delta} S^{12} \xrightarrow{\delta} S^{13} \xrightarrow{\delta} S^{14}$$

$$S^{22} \xrightarrow{\delta} S^{23} \xrightarrow{\delta} S^{24} \xrightarrow{\delta} S^{25}$$

$$S^{33} \xrightarrow{\delta} S^{34} \xrightarrow{\delta} S^{35} \xrightarrow{\delta} S^{36}$$
We Can this loop nest be vectorised?
We Can this loop nest be interchanged?



- You can think of loop interchange as changing the way the iteration space is traversed
- Alternatively, you can think of it as a change to the way the runtime code instances are mapped onto the iteration space
- ▶ Traversal is always lexicographic - ie leftto-right, top-down

 $S^{00}$ 

- The inner loop is now vectorisable.
- The skewed iteration space has N rows and 2N-1 columns, but still only N<sup>2</sup> actual statement instances.
- What are the appropriate loop bounds?

for 
$$k_2 := 0$$
 to  $2N_2 - 2$  do  
for  $k_1 := \max(0, K_2 - N_2 + 2)$  to  $\min(K_2, N_1)$  do  
 $S : A[k_1, k_2 - k_1] := A[k_1 - 1, k_2 - k_1] + A[k_1, k_2 - k_1 - 1]$ 

- The inner loop is now vectorisable.
- The skewed iteration space has N rows and 2N-1 columns, but still only N<sup>2</sup> actual statement instances.
- What are the appropriate loop bounds?

for 
$$k_2$$
 := ? to ? do  
for  $k_1$  := ? to ? do  
 $S: A[k_1,k_2-k_1] := A[k_1-1,k_2-k_1]+A[k_1,k_2-k_1-1]$ 

$$S^{00}$$

$$\delta \downarrow \delta$$

$$S^{01} S^{11}$$

$$\delta \downarrow \delta \delta \downarrow \delta$$

$$S^{02} S^{12} S^{22}$$

$$\delta \downarrow \delta \delta \downarrow \delta \downarrow \delta$$

$$S^{03} S^{13} S^{23} S^{33}$$

$$\delta \delta \downarrow \delta \delta \downarrow \delta \delta \downarrow$$

$$S^{14} S^{24} S^{34}$$

$$\delta \delta \downarrow \delta \delta \downarrow$$

$$\delta \delta \downarrow \delta \delta \downarrow$$

$$S^{14} S^{24} S^{35}$$

$$\delta \delta \downarrow \delta \delta \downarrow$$

$$\delta \delta \delta \downarrow$$

$$\delta \delta \downarrow$$

$$\delta \delta \delta \downarrow$$

Advanced Computer Architecture Chapter 4.43

# Skewing and interchange: summary

- Original loop interchangeable but not vectorisable.
- We skewed inner loop by outer loop by factor 1.
- Still not vectorisable, but interchangeable.
- Interchanged, skewed loop is vectorisable.
- Bounds of new loop not simple!



for  $k_2 := 0$  to  $2N_2 - 2$  do for  $k_1 := \max(0, K_2 - N_2 + 2)$  to  $\min(K_2, N_1)$  do  $S : A[k_1, k_2 - k_1] := A[k_1 - 1, k_2 - k_1] + A[k_1, k_2 - k_1 - 1]$ 

- → Is skewing ever invalid?
- Does skewing affect interchangeability?
- Does skewing affect dependence distances?
- Can you predict value of skewing?

#### Matrix representation of loop transformations

To skew the inner loop by the outer loop by factor 1 we adjust the loop bounds, and replace I₁ by K₁, and I₂ by K₂-K₁. That is,

$$(K_1,K_2) = (I_1,I_2) . U$$

where U is a 2 x 2 matrix

$$\begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}$$

That is,

$$(K_1,K_2) = (I_1,I_2) \cdot U = (I_1,I_2+I_1)$$

▶ The inverse gets us back again:

$$(I_1,I_2) = (K_1,K_2) \cdot U^{-1} = (K_1,K_2-K_1)$$

Advanced Computer Architecture Chapter 4,46

## Using matrices to reason about dependence

#### Recall that:

▶ There is a dependence between two iterations  $(I_1^1, I_2^1)$  and  $(I_1^2, I_2^2)$  if there is a memory location which is assigned to in iteration  $(I_1^1, I_2^1)$ , and read in iteration  $(I_1^2, I_2^2)$ .

((unless there is an intervening assignment))

- If  $(I_1^1, I_2^1)$  precedes  $(I_1^2, I_2^2)$  it is a *data*-dependence.
- ▶ If  $(I_1^2, I_2^2)$  precedes  $(I_1^1, I_2^1)$  it is a *anti-*dependence.
- ▶ If the location is assigned to in both iterations, it is an output-dependence.
- The dependence distance vector  $(D_1,D_2)$  is  $(I_1^2-I_1^1,I_2^2-I_2^1)$ .

Advanced Computer Architecture Chapter 4.48

- Matrix U maps each statement instance S<sup>III2</sup> to its position in the new iteration space, S<sup>K1K2</sup>:
- ▶ Original iteration space:

|       | $I_2$ :  |          |          |          |
|-------|----------|----------|----------|----------|
| $I_1$ | 0        | 1        | 2        | 3        |
| 0     | $S^{00}$ | $S^{01}$ | $S^{02}$ | $S^{03}$ |
| 1     | $S^{10}$ | $S^{11}$ | $S^{12}$ | $S^{13}$ |
| 2     | $S^{20}$ | $S^{21}$ | $S^{22}$ | $S^{23}$ |
| 3     | $S^{30}$ | $S^{31}$ | $S^{32}$ | $S^{33}$ |

▶ Transformed iteration space:

|       | $K_2$ :  |          |          |          |          |          |          |                 |
|-------|----------|----------|----------|----------|----------|----------|----------|-----------------|
| $K_1$ | 0        | 1        | 2        | 3        | 4        | 5        | 6        | The             |
| 0     | $S^{00}$ | $S^{01}$ | $S^{02}$ | $S^{03}$ |          |          |          | dependences     |
| 1     |          | $S^{11}$ | $S^{12}$ | $S^{13}$ | $S^{14}$ |          |          | are subject to  |
| 2     |          |          | $S^{22}$ | $S^{23}$ | $S^{24}$ | $S^{25}$ |          | the same        |
| 3     |          |          |          | $S^{33}$ | $S^{34}$ | $S^{35}$ | $S^{36}$ | transformation. |

Advanced Computer Architecture Chapter 4.47

## Transforming dependence vectors

- Iterations  $(I_1^1,I_2^1)$ . U and  $(I_1^2,I_2^2)$ . U will also read and write the same location.
- The transformation U is valid iff (I₁¹,I₂¹). U precedes (I₁²,I₂²). U whenever there is a dependence between (I₁¹, I₂¹) and (I₁², I₂²).
- ▶ In the transformed loop the dependence distance vector is also transformed, to

$$(D_1,D_2)$$
. U

Definition: Lexicographic ordering:  $(I_1^1,I_2^1) \text{ precedes } (I_1^2,I_2^2)$   $\text{If } I_1^1 < I_1^2, \text{ or } I_1^1 = I_1^2 \text{ and } I_2^1 < I_2^2$  ("Lexicographic" is dictionary order - both "baz" and "can" precede "cat")

# Example: loop given earlier

Before transformation we had two dependences:

1. Distance: (1,0), direction: (<,.)

2. Distance: (0,1), direction: (.,<)

After transformation by matrix

$$\mathbf{U} = \left[ \begin{array}{cc} 1 & 1 \\ 0 & 1 \end{array} \right]$$

(i.e. skewing of inner loop by outer) we get:

1. Distance: (1,1), direction: (<,<)

2. Distance: (0,1), direction: (.,<)

Advanced Computer Architecture Chapter 4.5

### Summary

- $\mathbb{I}(I_1,I_2)$ . U maps each statement instance  $(I_1,I_2)$  to its new position  $(K_1,K_2)$  in the transformed loop's execution sequence
- (D<sub>1</sub>,D<sub>2</sub>). U gives new dependence distance vector, giving test for validity
- Captures skewing, interchange and reversal
- $\blacktriangleright$  Compose transformations by matrix multiplication  $U_1$  .  $U_2$
- Resulting loop's bounds may be a little tricky
  - →Efficient algorithms exist [Banerjee90] to maximise parallelism by skewing and loop interchanging
  - →Efficient algorithms exist to optimise cache performance by finding the combination of blocking, block size, interchange and skewing which leads to the best reuse [Wolf91]

Advanced Computer Architecture Chapter 4,52

- We can also represent loop interchange by a matrix transformation.
- Market After transforming the skewed loop by matrix

$$\mathbf{V} = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$$

- ▶ (i.e. loop interchange) we get:
- 1. Distance: (1,1), direction: (<,<)
- 2. Distance: (1,0), direction: (<,.)
- ▶ The transformed iteration space is the transpose of the skewed iteration space:

Advanced Computer Architecture Chapter 4.51

#### References

- ► Hennessy and Patterson: Section 4.4 (pp.319)
- Background: "conventional" compiler techniques
  - A.V. Aho, R. Sethi, and J.D. Ullman. Compilers: Principles, Techniques and Tools. Addison Wesley, 1986.
  - Andrew Appel and Jens Palsberg, Modern Compiler Implementation. Cambridge University Press, 2002.
  - Cooper and Torczon, Engineering a Compiler. Morgan Kaufmann 2004.
  - Morgan, Building an Optimizing Compiler
- Textbooks covering restructuring compilers
  - Michael Wolfe. High Performance Compilers for Parallel Computing. Addison Wesley 1996
  - Steven Muchnick, Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.
  - Ken Kennedy and Randy Allen, Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2001.

READ Research papers:

- THIS

  D. F. Bacon and S. L. Graham and O. J. Sharp, "Compiler Transformations for High-Performance Computing". ACM Computing Surveys V26 N4 Dec 1994

  http://doi.acm.org/10.1145/197405.197406
  - U. Banerjee. Unimodular transformations of double loops. In Proceedings of the Third Workshop on Programming Languages and Compilers for Parallel Computing, Irvine, CA. Pitman/MIT Press, 1990.
  - M.E. Wolf and M.S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, volume 26, pages 30-44, Toronto, Ontario, Canada, June 1991.

#### A little history...early days at Bell Labs

- ▶ 1940: Russell Ohl develops PN junction (accidentally...)
- ▶ 1945: Shockley's lab established
- ▶ 1947: Bardeen and Brattain create point-contact transistor with two PN junctions, gain=18
- ▶ 1951: Shockley develops junction transistor which can be manufactured in quantity
- 1952: British radar expert GWA Dummer forecasts "solid block [with] layers of insulating, conducting and amplifying materials
- № 1954: first transistor radio. Also Texas Instruments makes first silicon transistor (price \$2.50)

First pointcontact transistor invented at Bell Labs. (Source: Bell Labs.)





The three inventors of the transistor: William Shockley. seated), John Bardeen (left) and Walter Brattain right) in 1948; the hree inventors hared the Nobel orize in 1956.

This background section is not covered in the lectures

- № 1970: Intel starts selling a 1K bit RAM
- № 1971: Intel introduces first microproces sor, the 4004
  - 4-bit buses
  - Clock rate 108 KHz
  - **2300** transistors
  - 📫 10μm process



#### Pre-historic integrated circuits

№ 1958: The first monolithic integrated circuit, about the size of a finger tip, developed at Texas Instruments by Jack Kilby. The IC was a chip of a single Germanium crystal containing one transistor, one capacitor, and one resistor (Source: Texas Instruments)



Source: http://kasap3.usask.ca/server/kasap/photo1.html

- ▶ IBM Power3 microprocessor
- **№ 15M** transistors
- № 0.18µm copper/SOI process
- About 270mm<sup>2</sup>



- Chips are made from slices of a singlecrystal silicon ingot
- ► Each slice is about 30cm in diameter, and 250-600 microns thick
- Transistors and wiring are constructed by photolithography
- Essentially a printing/etching process
- With lines ca. 0.18μm wide



#### Intel Pentium 4

- ► 42 M transistors
- 0.13mm copper/SOI process
- Clock speeds: 2200, 2000MHz
- Die size 146 square mm
- Power consumption 55.1W (2200), 52.4W (2000)
- Price (\$ per chip, in 1,000-chip units, Jan 2002):

US\$562 (2200) US\$364 (2000)



Advanced Computer Architecture Chapter 4 60





Highly magnified scanning electron microscope (SEM) view of IBM's six-level copper interconnect technology in an integrated circuit chip. The aluminum in transistor interconnections in a silicon chip has been replaced by copper that has a higher conductivity (by nearly 40%) and also a better ability to carry higher current densities without electromigration. Lower copper interconnect resistance means higher speeds and lower RC constants (Photograph courtesy of IBM Corporation, 1997.)



# Integrated circuit fabrication is a printing

- 1. Grow pure silicon crystal
- Slice into wafers and polish
- 3. Grow surface layer of silicon dioxide (ie glass), either using high-temperature oxygen or chemical vapour deposition
- 4. Coat surface with photoresist layer, then use mask to selectively expose photoresist to ultraviolet light
- 5. Etch away silicon dioxide regions not covered by hardened photoresist
- . Further photolithography steps build up additional layers, such as polysilicon
- 7. Exposed silicon is doped with small quantities of chemicals which alter its semiconductor behaviour to create transistors
- 8. Further photolithography steps build layers of metal for wiring
- 9. Die are tested, diced, tested and packaged



#### The future

- More transistors
- Migher clock rates
- Lower power
- System-on-a-chip
- Field-programmable gate arrays
- "Compiling to silicon"
- M Optical interconnect
- Quantum computing?



AVP-111 Video Codec from Lucent Technologies



Close up of the wafer as it spins during a testing procedure



Checking wafers processing in a vertical diffusion



Intel technicians monitor wafers in an automated wet et tool. The process cleans the wafers of any excess process chemicals or contamination.



Advanced Computer Architecture Chapter 4 63

# Intel ×86/Pentium Family

| CPU         | Year | Data<br>Bus | Max.<br>Mem. | Transistors    | Clock MHz | Av. MIPS | Level-1 Caches              |
|-------------|------|-------------|--------------|----------------|-----------|----------|-----------------------------|
| 8086        | 1978 | 16          | 1MB          | 29K            | 5-10      | 0.8      |                             |
| 80286       | 1982 | 16          | 16MB         | 134K           | 8-12      | 2.7      |                             |
| 80386       | 1985 | 32          | 4GB          | 275K           | 16-33     | 6        |                             |
| 80486       | 1989 | 32          | 4GB          | 1.2M           | 25-100    | 20       | 8Kb                         |
| Pentium     | 1993 | 64          | 4GB          | 3.1M           | 60-233    | 100      | 8K Instr + 8K Data          |
| Pentium Pro | 1995 | 64          | 64GB         | 5.5M<br>+15.5M | 150-200   | 440      | 8K + 8K <sub>+</sub> Level2 |
| Pentium II  | 1997 | 64          | 64GB         | 7M             | 266-450   | 466-     | 16K+16K + L2                |
| Pentium III | 1999 | 64          | 64GB         | 8.2M           | 500-1000  | 1000-    | 16K+16K + L2                |
| Pentium IV  | 2001 | 64          | 64GB         | 42M            | 1300-2000 |          | 8K + L2                     |

On-line manuals: http://x86.ddj.com/intel.doc/386manuals.htm

On-line details: http://www.sandpile.org/ia32/index.htm

## Integrated Circuits Costs

$$IC \ cost = \frac{\text{Die cost} + \text{Testing cost} + \text{Packaging cost}}{\text{Final test yield}}$$

$$\text{Die cost} = \frac{\text{Wafer cost}}{\text{Dies per Wafer} \times \text{Die yield}}$$

$$\text{Dies per wafer} = \frac{\pi \left(\text{Wafer\_dia m/2}\right)^2}{\text{Die\_Area}} - \frac{\pi \times \text{Wafer\_diam}}{\sqrt{2} \cdot \text{Die\_Area}} - \text{Test\_Die}$$

$$\text{Die Yield} = \text{Wafer\_yield} \times \left\{1 - \left(\frac{\text{Defect\_Density} \times \text{Die\_area}}{\sqrt{2} \cdot \text{Die\_Area}}\right)^{-\alpha}\right\}$$

#### Die Cost goes roughly with die area4

Advanced Computer Architecture Chapter 4.6

# Moore's "Law"



Cramming more components onto integrated circuits

By Gordon E. Moore

Electronics, Volume 38, Number 8, April 19, 1965

(See

http://www.intel.com/research/silicon/mooreslaw.htm)

"With unit cost falling as the number of components per circuit rises, by 1975 economics may dictate squeezing as many as 65,000 components on a single silicon chip"

(U)

Gordon Moore left Fairchild to found Intel in 1968 with Robert Noyce and Andy Grove,

Advanced Computer Architecture Chapter 4,68

# Real World Examples

|                  |      |      |        | Defect<br>/cm² |     |     |     | Die Cost |  |
|------------------|------|------|--------|----------------|-----|-----|-----|----------|--|
| 386DX            | 2    | 0.90 | \$900  | 1.0            | 43  | 360 | 71% | \$4      |  |
| 486DX2           | 3    | 0.80 | \$1200 | 1.0            | 81  | 181 | 54% | \$12     |  |
| PowerPC 6        | 01 4 | 0.80 | \$1700 | 1.3            | 121 | 115 | 28% | \$53     |  |
| HP PA 7100       | 3    | 0.80 | \$1300 | 1.0            | 196 | 66  | 27% | \$73     |  |
| <b>DEC Alpha</b> | 3    | 0.70 | \$1500 | 1.2            | 234 | 53  | 19% | \$149    |  |
| SuperSPAF        | RC 3 | 0.70 | \$1700 | 1.6            | 256 | 48  | 13% | \$272    |  |
| Pentium          | 3    | 0.80 | \$1500 | 1.5            | 296 | 40  | 9%  | \$417    |  |

From "Estimating IC Manufacturing Costs," by Linley Gwennap, *Microprocessor Report*, August 2, 1993, p. 15

Advanced Computer Architecture Chapter 4,67

# Technology Trends: Microprocessor Capacity

