332 Advanced Computer Architecture Chapter 1

Introduction and review of Pipelines, Performance, Caches, and Virtual Memory

January 2004 Paul H J Kelly

These lecture notes are partly based on the course text, Hennessy and Patterson's Computer Architecture, a quantitative approach (3rd ed), and on the lecture slides of David Patterson's Berkeley course (CS252)

Course materials online at

http://www.doc.ic.ac.uk/~phjk/AdvancedCompArchitecture.html username aca, password kelly

Advanced Computer Architecture 2004 Chapter 1.

# This is a textbook-based course



 Computer Architecture: A Quantitative Approach (3rd Edition)

John L. Hennessy, David A. Patterson

- 1128 pages. Morgan Kaufmann (29 May, 2002); ISBN: 1558607242
- http://www.elsevierinternational.com/catalogue/title.cfm?IS BN=1558607242
- Price: £ 33.95 (Amazon.co.uk, Jan 2004)

Advanced Computer Architecture 2004 Chapter 1. p

## Pre-requisites

- This a third-level computer architecture course
- The usual path would be to take this course after following a course based on a textbook like "Computer Organization and Design" (Patterson and Hennessy, Morgan Kaufmann)
- This course is based on the more advanced book by the same authors (see next slide)
- You can take this course provided you're prepared to catch up if necessary
  - Read chapters 1 to 8 of "Computer Organization and Design" (COD) if this material is new to you
  - If you have studied computer architecture before, make sure COD Chapters 2, 6, 7 are familiar
  - See also "Appendix A Pipelining: Basic and Intermediate Concepts" of course textbook
- FAST review today of Pipelining, Performance, Caches, and Virtual Memory

Advanced Computer Architecture 2004 Chapter 1. p

# Who are these guys anyway and why should I read their book?

## John Hennessy:

- Founder, MIPS Computer Systems
- President, Stanford University

(previous president: Condoleezza Rice)



## David Patterson

- Leader, Berkeley RISC project (led to Sun's SPARC)
- RAID (redundant arrays of inexpensive disks)
- Professor, University of California, Berkeley



RAID-I (1989) consisted of a Sun 4/280 workstation with 128 MB of DRAM, four dualstrong SCSI controllers, 28 5.25-inch SCSI disks and specialized disk striping software.



RISC-I (1982) Contains 44,420 transistors, fabbed in 5 micron NMOS, with a die area of 77 mm², ran at 1 MHz. This chip is probably the first VLSI RISC.

## Administrivia

- Course web site:
  - http://www.doc.ic.ac.uk/~phjk/AdvancedCompArchitecture.html
  - See web site for link to course news group:
    - mews:icdoc.courses.332aca.
  - Course textbook: H&P 3rd ed Read Appendix A right away

Advanced Computer Architecture 2004 Chapter 1.

# A "Typical" RISC

- 32-bit fixed format instruction (3 formats, see next slide)
- 32 32-bit general-purpose registers
  - (RO contains zero, double-precision/long operands occupy a pair)
- Memory access only via load/store instructions
  - No instruction both accesses memory and does arithmetic
  - All arithmetic is done on registers
- 3-address, reg-reg arithmetic instruction
  - Subw r1,r2,r3 means r1 := r2-r3
  - registers identifiers always occupy same bits of instruction encoding
- Single addressing mode for load/store:
- base + displacement
- ie register contents are added to constant from instruction word, and used as address, eq "lw R2,100(r1)" means "r2 := Mem[100+r1]"
- no indirection
- Simple branch conditions
- Delayed branch
- SPARC, MIPS, ARM, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1,
- Cray-2, Cray-3 Intel IA-32, IA-64 (?), Not:
  - Motorola 68000.
  - DEC VAX, PDP-11, IBM 360/370
- Eg: VAX matche instruction!

### Lecturers:

- Course organisation ■ Paul Kelly - first few weeks
- Jeremy Bradley last couple of weeks
- Tutorial helper:
  - Jeyarajan Thiyagalingam (Jeyan)
- 3 hours per week
- Nominally two hours of lectures, one hour of classroom tutorials
- We will use the time more flexibly
- Assessment:
  - Exam
    - For CS M.Eng. Class, exam will take place on last Monday of term
    - For everyone else, exam will take place early in the summer term
    - Im The goal of the course is to teach you how to think about computer architecture
  - ▶ The exam usually includes some architectural ideas not presented in the lectures
  - Coursework
    - ▶ You will be assigned a substantial, laboratory-based exercise
    - ▶ You will learn about performance tuning for computationally-intensive kernels
    - Mr You will learn about using simulators, and experimentally evaluating hypotheses to understand system performance
    - In Use the machines in Studio A to get started and get help during tutorials
- ◆ Please do not use the computers for anything else during classes

Advanced Computer Architecture 2004 Chapter 1, pi

#### Example: MIPS (Note register location) Register-Register 21 20 16 15 1110 6 5 Rs1 Rs2 Rd Register-Immediate 26 25 21 20 16 15

#### immediate Op Rs1 Rd Branch 26 25 21 20 16 15





- Q: What is the largest signed immediate operand for "subw r1,r2,X"?
- Q: To what range of addresses can a conditional branch jump to?

















# It's Not That Easy for Computers Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).









# Three Generic Data Hazards

Write After Read (WAR)
 Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> reads it

- Called an "anti-dependence" by compiler writers.
   This results from reuse of the name "r1".
- Can't happen in MIPS 5 stage pipeline because:
  - All instructions take 5 stages, and
  - Reads are always in stage 2, and
  - Writes are always in stage 5

Advanced Computer Architecture 2004 Chapter 1. p23

## Three Generic Data Hazards

Read After Write (RAW)
 Instr<sub>I</sub> tries to read operand before Instr<sub>I</sub> writes it

I: add r1,r2,r3 J: sub r4,r1,r3

Caused by a "Dependence" (in compiler nomenclature).
 This hazard results from an actual need for communication.

Advanced Computer Architecture 2004 Chapter 1, p22

# Three Generic Data Hazards

Write After Write (WAW)
 Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> writes it.

I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

- Called an "output dependence" by compiler writers This also results from the reuse of name "r1".
- Can't happen in MIPS 5 stage pipeline because:
  - All instructions take 5 stages, and
  - Writes are always in stage 5
- ◆ Will see WAR and WAW in later more complicated pipes

















## Four Branch Hazard Alternatives

#1: Stall until branch direction is clear

### #2: Predict Branch Not Taken

- Execute successor instructions in sequence
- "Squash" instructions in pipeline if branch actually taken
- Advantage of late pipeline state update
- 47% MIPS branches not taken on average
- PC+4 already calculated, so use it to get next instruction

### #3: Predict Branch Taken

- 53% MIPS branches taken on average
- But haven't calculated branch target address in MIPS
  - MIPS still incurs 1 cycle branch penalty
  - № Other machines: branch target known before outcome

Advanced Computer Architecture 2004 Chapter 1, p3

# Delayed Branch

L1: target

Blt R1,L1

- Where to get instructions to fill branch delay slot?
  - Before branch instruction
  - From the target address: only valuable when branch taken
  - From fall through: only valuable when branch not taken
- Compiler effectiveness for single branch delay slot:
  - Fills about 60% of branch delay slots
  - About 80% of instructions executed in branch delay slots useful in computation
  - About 50% (60% × 80%) of slots usefully filled
- Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
- Canceling branches
  - Branch delay slot instruction is executed but write-back is disabled if it is not supposed to be executed
  - Two variants: branch "likely taken", branch "likely not-taken"
  - allows more slots to be filled

Advanced Computer Architecture 2004 Chapter 1, p35

## Four Branch Hazard Alternatives

## #4: Delayed Branch

■ Define branch to take place AFTER a following instruction

branch instruction
sequential successor<sub>1</sub>
sequential successor<sub>n</sub>
sequential successor<sub>n</sub>
branch target if taken

Branch delay of length n

■ 1 slot delay allows proper decision and branch target address in 5 stage pipeline

■ MIPS uses this

Advanced Computer Architecture 2004 Chapter 1, p34

# Now, review basic performance issues in processor design

## Which is faster?

| Plane               | Washington<br>DC to Paris | Speed       | Passengers | Throughput (pmph) | First flew in February,<br>1969 |
|---------------------|---------------------------|-------------|------------|-------------------|---------------------------------|
| Boeing<br>747       | 6.5 hours                 | 610<br>mph  | 470        | 286,700           |                                 |
| BAC/Sud<br>Concorde | 3 hours                   | 1350<br>mph | 132        | 178,200           | First flavo                     |

- Time to run the task (ExTime)
  - Execution time, response time, latency
- · Tasks per day, hour, week, sec, ns ... (Performance)
  - Throughput, bandwidth

# Definitions

- ◆Performance is in units of things per sec
   bigger is better
- ◆If we are primarily concerned with response time

" X is n times faster than Y" means

Advanced Computer Architecture 2004 Chapter 1, p3

# Cycles Per Instruction (Throughput)

"Average Cycles per Instruction"

CPU time = Cycle Time 
$$\times \sum_{j=1}^{n} CPI_{j} \times I_{j}$$

"Instruction Frequency"

$$CPI = \sum_{j=1}^{n} CPI_{j} \times F_{j} \quad \text{where } F_{j} = \frac{I_{j}}{Instructio \ n \ Count}$$

Advanced Computer Architecture 2004 Chapter 1.

# Aspects of CPU Performance (CPU Law)

| CPU time | = Seconds | = Instructions x | Cycles x    | Seconds |  |
|----------|-----------|------------------|-------------|---------|--|
|          | Program   | Program          | Instruction | Cycle   |  |

|              | Inst Count | CPI | Clock Rate |
|--------------|------------|-----|------------|
| Program      | X          |     |            |
| Compiler     | X          | (X) |            |
| Inst. Set.   | Х          | Х   |            |
| Organization |            | Х   | X          |
| Technology   |            |     | Х          |

Advanced Computer Architecture 2004 Chapter 1. p3

# Example: Calculating CPI

Base Machine (Reg / Reg)

|        | (        | 5,     |        |          |
|--------|----------|--------|--------|----------|
| Оp     | Freq     | Cycles | CPI(i) | (% Time) |
| ALU    | 50%      | 1      | .5     | (33%)    |
| Load   | 20%      | 2      | .4     | (27%)    |
| Store  | 10%      | 2      | .2     | (13%)    |
| Branch | 20%      | 2      | .4     | (27%)    |
|        | <u> </u> | l      | 1 5    |          |

Typical Mix of instruction types in program

# Example: Branch Stall Impact

- Assume CPI = 1.0 ignoring branches
- Assume solution was stalling for 3 cycles
- ◆ If 30% branch, Stall 3 cycles
- Op
   Freq Cycles CPI(i) (% Time)
   Other
   Tother
   <
- ♦ => new CPI = 1.9, or almost 2 times slower

Advanced Computer Architecture 2004 Chapter 1, p4

# Example 3: Evaluating Branch Alternatives (for 1 program)

Pipeline speedup = 
$$\frac{\text{Pipeline depth}}{1 + \text{Branch frequency}} \times \frac{\text{Branch penalty}}{\text{Branch penalty}}$$

| Scheduling<br>scheme | Branch<br>penalty | CPI  | speedup v.<br>stall |
|----------------------|-------------------|------|---------------------|
| Stall pipeline       |                   | 1.42 | 1.0                 |
| Predict taken        | 1                 | 1.14 | 1.26                |
| Predict not tal      | ken 1             | 1.09 | 1.29                |
| Delayed branc        | h 0.5             | 1.07 | 1.31                |

Assuming Conditional & Unconditional branches make up 14% of the total instruction count, and 65% of them change the PC

Advanced Computer Architecture 2004 Chapter 1. p

# Example 2: Speed Up Equation for Pipelining

$$Speedup = \frac{Ideal\ \textit{CPI} \times Pipeline\ depth}{Ideal\ \textit{CPI} + Pipeline\ stall\ \textit{CPI}} \times \frac{\textit{Cycle}\ Time_{unpipelined}}{\textit{Cycle}\ Time_{pipelined}}$$

For simple RISC pipeline, Ideal CPI = 1:

$$Speedup = \frac{Pipeline \ depth}{1 + Pipeline \ stall \ CPI} \times \frac{Cycle \ Time_{unpipelined}}{Cycle \ Time_{pipelined}}$$

Advanced Computer Architecture 2004 Chapter 1, p4

# Example 4: Dual-port vs. Single-port

- ◆ Machine A: Dual ported memory ("Harvard Architecture")
- Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate
- ◆ Ideal CPI = 1 for both
- Loads are 40% of instructions executed

SpeedUp<sub>A</sub> = Pipeline Depth/(1 + 0) x (clock<sub>unpipe</sub>/clock<sub>pipe</sub>)
= Pipeline Depth

SpeedUp<sub>B</sub> = Pipeline Depth/(1 + 0.4 x 1) x (clock<sub>unpipe</sub>/(clock<sub>unpipe</sub>/ 1.05)

= (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth

SpeedUp<sub>A</sub> / SpeedUp<sub>B</sub> = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33

◆ Machine A is 1.33 times faster













# Cache Measures Hit rate: fraction found in that level So high that usually talk about Miss rate Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory Average memory-access time Hit time + Miss rate × Miss penalty (ns or clocks) Miss penalty: time to replace a block from lower level, including time to replace in CPU access time: time to lower level f(latency to lower level) transfer time: time to transfer block f(BW between upper & lower levels)













# Q3: Which block should be replaced on a miss?

- ◆ Easy for Direct Mapped
- **◆ Set Associative or Fully Associative:** 
  - Random
  - **LRU (Least Recently Used)**

| Assoc: 2-w |        | ay 4-wa |       | у     | 8-wa  | B-way |       |
|------------|--------|---------|-------|-------|-------|-------|-------|
|            | Size   | LRU     | Ran   | LRU   | Ran   | LRU   | Ran   |
|            | 16 KB  | 5.2%    | 5.7%  | 4.7%  | 5.3%  | 4.4%  | 5.0%  |
|            | 64 KB  | 1.9%    | 2.0%  | 1.5%  | 1.7%  | 1.4%  | 1.5%  |
|            | 256 KB | 1.15%   | 1.17% | 1.13% | 1.13% | 1.12% | 1.12% |

Advanced Computer Architecture 2004 Chapter 1. p



# Q4: What happens on a write?

- Write through—The information is written to both the block in the cache and to the block in the lower-level memory.
- Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.
  - is block clean or dirty?
- Pros and Cons of each?
  - WT: read misses cannot result in writes
  - WB: no repeated writes to same location
- WT always combined with write buffers so that don't wait for lower level memory

# Write Buffer for Write Through



- A Write Buffer is needed between the Cache and Memory
  - Processor: writes data into the cache and the write buffer
  - Memory controller: write contents of the buffer to memory
- Write buffer is just a FIFO:
  - Typical number of entries: 4
  - Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
- Memory system designer's nightmare:
  - Store frequency (w.r.t. time) -> 1 / DRAM write cycle
  - Write buffer saturation

Advanced Computer Architecture 2004 Chapter 1. pc

# Summary #1/4: Pipelining & Performance

- ◆ Just overlap tasks; easy if tasks are independent
- ◆ Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then:

Speedup = 
$$\frac{\text{Pipeline depth}}{1 + \text{Pipeline stall CPI}} \times \frac{\text{Cycle Time}_{\text{unpipelined}}}{\text{Cycle Time}_{\text{pipelined}}}$$

- Hazards limit performance on computers:
  - Structural: need more HW resources
  - Data (RAW, WAR, WAW): need forwarding, compiler scheduling
  - Control: delayed branch, prediction
- Time is measure of performance: latency or throughput
- \* CPI Law:



Advanced Computer Architecture 2004 Chapter 1. pc

# A Modern Memory Hierarchy

- By taking advantage of the principle of locality:
  - Present the user with as much memory as is available in the cheapest technology.
  - Provide access at the speed offered by the fastest technology.



# Summary #2/4: Caches

- ◆ The Principle of Locality:
  - Program access a relatively small portion of the address space at any instant of
    - **№ Temporal Locality: Locality in Time**
    - M Spatial Locality: Locality in Space
- Three Major Categories of Cache Misses:
  - Compulsory Misses: sad facts of life. Example: cold start misses.
  - Capacity Misses: increase cache size
  - Conflict Misses: increase cache size and/or associativity.
- Write Policy:
  - Write Through: needs a write buffer.
  - Write Back: control can be complex
- Today CPU time is often dominated by memory access time, not just computational work. What does this mean to Compilers, Data structures, Algorithms?