

### Average memory access time:

AMAT = HitTime + MissRate × MissPenalt y

### There are three ways to improve cache performance:

- 1. Reduce the miss rate,
- 2. Reduce the miss penalty, or
- 3. Reduce the time to hit in the cache.













### 2. Reduce Misses via Higher Associativity

### 2:1 Cache Rule of thumb:

The Miss Rate of a direct-mapped cache of size N
 Is the same as the Miss Rate of a 2-way set-associative cache size of size N/2

on average, over a large suite of benchmarks

### Beware: Execution time is only final measure!

Advanced Computer Architecture Chapter 3,10

Will Clock Cycle time increase?

Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2%

| E×     | amp                                                                                            | le: A | Avg.             | Mer  | nory  | Access Time vs.<br>Miss Rate                |  |  |  |  |
|--------|------------------------------------------------------------------------------------------------|-------|------------------|------|-------|---------------------------------------------|--|--|--|--|
| • Exc  | Example: assume CCT = 1.10 for 2-way, 1.12 for 4-<br>way, 1.14 for 8-way vs. CCT direct mapped |       |                  |      |       |                                             |  |  |  |  |
| way    |                                                                                                |       |                  |      |       |                                             |  |  |  |  |
|        | Cache<br>(KB)                                                                                  |       | Associa<br>2-way | •    | 8-way |                                             |  |  |  |  |
|        | 1                                                                                              |       | 2.15             |      |       |                                             |  |  |  |  |
|        | 2                                                                                              |       | 1.86             |      |       |                                             |  |  |  |  |
|        | 4                                                                                              | 1.72  | 1.67             | 1.61 | 1.53  |                                             |  |  |  |  |
|        | 8                                                                                              | 1.46  | 1.48             | 1.47 | 1.43  |                                             |  |  |  |  |
|        | 16                                                                                             | 1.29  | 1.32             | 1.32 |       |                                             |  |  |  |  |
|        | 32                                                                                             |       | 1.24             | 1.25 |       |                                             |  |  |  |  |
|        | 64                                                                                             | 1.14  | 1.20             | 1.21 |       |                                             |  |  |  |  |
|        | 128                                                                                            | 1.10  | 1.17             | 1.18 | 1.20  |                                             |  |  |  |  |
|        |                                                                                                |       |                  |      |       |                                             |  |  |  |  |
| (Red m | (Red means A.M.A.T. not improved by more associativity)                                        |       |                  |      |       |                                             |  |  |  |  |
|        |                                                                                                |       | <u></u> pre      |      |       |                                             |  |  |  |  |
|        |                                                                                                |       |                  |      |       | Advanced Computer Architecture Chapter 3.11 |  |  |  |  |





Advanced Computer Architecture Chapter 3,13

Advanced Computer Architecture Chapter 3.15

How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache?

Divide cache: on a miss, check other half of cache to see if there, if so have a <u>pseudo-hit</u> (slow hit)

| Pseudo Hit Time | Miss Penalty |  |  |  |  |  |  |
|-----------------|--------------|--|--|--|--|--|--|
| <b>&gt;</b> 4   |              |  |  |  |  |  |  |
|                 |              |  |  |  |  |  |  |
|                 |              |  |  |  |  |  |  |

Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
 Better for caches not tied directly to processor (L2)
 Used in MIPS R10000 L2 cache, similar in UltraSPARC

### 5. Reducing Misses by <u>Hardware</u> Prefetching of Instructions & Data

E.g., Instruction Prefetching

- Alpha 21064 fetches 2 blocks on a miss
- Extra block placed in "stream buffer"
- On miss check stream buffer

Works with data blocks too:

- Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43%
- Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches
- Prefetching relies on having extra memory bandwidth that can be used without penalty

### 6. Reducing Misses by Software Prefetching Data Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution Prefetching comes in two flavors: Binding prefetch: Requests load directly into register. Must be correct address and register! Non-Binding prefetch: Load into cache. Can be incorrect. Frees HW/SW to guess! Issuing Prefetch issues < savings in reduced misses?</li> Higher superscalar reduces difficulty of issue bandwidth

### 7. Reducing Misses by Compiler Optimizations

Advanced Computer Architecture Chapter 3,14

Advanced Computer Architecture Chapter 3,16

- McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software
- Instructions

Reorder procedures in memory so as to reduce conflict misses
 Profiling to look at conflicts(using tools they developed)

### 👁 Data

- Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays
- Loop Interchange: change nesting of loops to access data in order stored in memory
- Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap
- Blocking: Improve temporal locality by accessing "blocks" of data repeatedly vs. going down whole columns or rows

## /\* Before: 2 sequential arrays \*/ int val[SIZE]; int key[SIZE]; /\* After: 1 array of stuctures \*/ struct merge { int val; int key; }; struct merge merged\_array[SIZE];

Reducing conflicts between val & key; improve spatial locality

### Loop Interchange Example

Advanced Computer Architecture Chapter 3,18



















### 3. Reduce Miss Penalty: Non-blocking Caches to reduce stalls on misses

- <u>Non-blocking cache</u> or <u>lockup-free cache</u> allows data cache to continue to supply cache hits during a miss
   requires full/empty bits on registers or out-of-order execution
   requires multi-bank memories
- "<u>hit under miss</u>" reduces the effective miss penalty by working during miss instead of ignoring CPU requests
- "<u>hit under multiple miss</u>" or "<u>miss under miss</u>" may further lower the effective miss penalty by overlapping multiple misses
  - Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses

- Requires multiple memory banks (otherwise cannot support)
- Pentium Pro allows 4 outstanding memory misses









### Average memory access time:

AMAT = HitTime + MissRate × MissPenalt y

### There are three ways to improve cache performance:

- 1. Reduce the miss rate,
- 2. Reduce the miss penalty, or
- 3. Reduce the time to hit in the cache.

### Reducing the time to hit in the cache Why does the Alpha 21164 have 8KB Instruction and 8KB data cache + 96KB second level cache, all onchip?

- Keep the cache small and simple
   Keep address translation off the critical path
- 3. Pipeline the cache access

### 2. Fast hits by Avoiding Address Translation CPU CPU CPU VA VA VA VA PA \$ ΤВ ΤВ Tags Tags PA VA ΤВ \$ PA MEM PA MEM MEM Overlap \$ access with VA translation: Conventional Virtually Addressed Cache requires \$ index to Organization Translate only on miss remain invariant Synonym Problem across translation

### 2. Fast hits by Avoiding Address Translation

- Send virtual address to cache? Called <u>Virtually Addressed Cache</u> or just <u>Virtual Cache</u> vs. <u>Physical Cache</u>
   Every time process is switched logically must flush the cache; otherwise get false hits
  - Every time process is switched logically must flush the cache; otherwise get false hits
     Cost is time to flush + "compulsory" misses from empty cache
  - Dealing with <u>aliases</u> (sometimes called <u>synonyms/homonyms</u>): Two different virtual addresses map to same physical address, Two different physical addresses mapped to by the same virtual address in different contexts
  - I/O must interact with cache, so need virtual address
- Solution to aliases
  - HW guaranteess covers index field & direct mapped, they must be unique; called <u>page coloring</u>
- Solution to cache flush
  - Add <u>process identifier tag</u> that identifies process as well as address within process: can't get a hit if wrong process

Advanced Computer Architecture Chapter 3.41

Advanced Computer Architecture Chapter 3.3



### 2. Fast Cache Hits by Avoiding Translation: Index with Physical Portion of Address

 If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag

| 31 Page Address             | IC II Page Off  |              |
|-----------------------------|-----------------|--------------|
| Page address<br>Rddress (ag | Page (<br>Index | Block offset |
| Address Tag                 | Index           | Block Offset |

Advanced Computer Architecture Chapter 3.43

Limits cache to page size: what if want bigger caches and uses same trick?

Higher associativity moves barrier to right

Page coloring

### 3: Fast Hits by pipelining Cache Case Study: MIPS R4000

### 8 Stage Pipeline:

- IF-first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access.
- IS-second half of access to instruction cache.
- RF-instruction decode and register fetch, hazard checking and also instruction cache hit detection.
- EX-execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation.
- DF-data fetch, first half of access to data cache.
- DS-second half of access to data cache.
- TC-tag check, determine whether the data cache access hit.
- WB-write back for loads and register-register operations.

### What is impact on Load delay?

Need 2 instructions between a load and its use!

| TWO Cycle<br>Load Latency                                                                                                        | IF | IS<br>IF            | RF<br>IS<br>IF           | EX<br>RF<br>IS<br>IF | DF<br>EX<br>RF<br>IS       | DS<br>DF<br>EX<br>RF             | TC<br>DS<br>DF                         | WB<br>TC<br>DS                               |
|----------------------------------------------------------------------------------------------------------------------------------|----|---------------------|--------------------------|----------------------|----------------------------|----------------------------------|----------------------------------------|----------------------------------------------|
|                                                                                                                                  |    |                     |                          |                      | IF                         | IS<br>IF                         | EX<br>RF<br>IS<br>IF                   | DF<br>EX<br>RF<br>IS<br>IF                   |
| THREE Cycle<br>Branch Latency<br>(conditions evaluated<br>during EX phase)<br>Delay slot plus two sta<br>Branch likely cancels d |    | IS<br>IF<br>slot if | RF<br>IS<br>IF<br>not to | EX<br>RF<br>IS<br>IF | DF<br>EX<br>RF<br>IS<br>IF | DS<br>DF<br>EX<br>RF<br>IS<br>IF | TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF | WB<br>TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF |











|              | Cache C                                                                                                                                                                                           | Opti                                       | miz         | ati     | ion Summary                                          |
|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|-------------|---------|------------------------------------------------------|
| miss rate    | Technique<br>Larger Block Size<br>Higher Associativity<br>Victim Caches<br>Pseudo-Associative Caches<br>HW Prefetching of Instr/Data<br>Compiler Controlled Prefetching<br>Compiler Reduce Misses | MR<br>+<br>+<br>+<br>+<br>+<br>+<br>+<br>+ | МР<br>-     | НТ<br>- | <i>Complexity</i><br>0<br>1<br>2<br>2<br>2<br>3<br>0 |
| miss penalty | Priority to Read Misses<br>Early Restart & Critical Word 1st<br>Non-Blocking Caches<br>Second Level Caches                                                                                        |                                            | +<br>+<br>+ | ,       | 1<br>2<br>3<br>2                                     |

# <section-header> Practical exercise: explore memory bierocharachy on your favourite computed computed for the provided of the provide















Junk

Data Out









Advanced Computer Architecture Chapter 3.67

**Avoiding Bank Conflicts** Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1)for (i = 0; i < 256; i = i+1)x[i][j] = 2 \* x[i][j];Conflicts occur even with 128 banks, since 512 is multiple of 128, conflict on word accesses SW: loop interchange or declaring array not power of 2 ("array padding") HW: Prime number of banks bank number = address mod number of banks address within bank = address / number of words in bank modulo & divide per memory access with prime no. banks? address within bank = address mod number words in bank bank number? easy if 2<sup>N</sup> words per bank Advanced Computer Architecture Chapter 3.68











