### **Review:** Cache performance

332 Advanced Computer Architecture Chapter 2

Caches and Memory Systems

January 2006

Paul H J Kelly

These lecture notes are partly based on the course text, Hennessy and Patterson's *Computer Architecture, a quantitative approach (3<sup>rd</sup> ed),* and on the lecture slides of David Patterson and John Kubiatowicz's Berkeley course

Advanced Computer Architecture Chapter 2.1

Advanced Computer Architecture Chapter 2.3

Miss-oriented Approach to Memory Access:

 $CPUtime = IC \times \left(CPI_{Execution} + \frac{MemAccess}{Inst} \times MissRate \times MissPenalt y\right) \times CycleTime$   $CPUtime = IC \times \left(CPI_{Execution} + \frac{MemMisses}{Inst} \times MissPenalt y\right) \times CycleTime$   $\bigcirc CPI_{Execution} \text{ includes ALU and Memory instructions}$ 

Separating out Memory component entirely

AMAT = Average Memory Access Time
 CPI<sub>ALUOps</sub> does not include memory instructions

 $CPU time = IC \times \left(\frac{AluOps}{Inst} \times CPI_{Aluops} + \frac{MemAccess}{Inst} \times AMAT\right) \times CycleTime$ 

 $AMAT = HitTime + MissRate \times MissPenalt y$  $= (HitTime_{Inst} + MissRate_{Inst} \times MissPenalt y_{Inst}) +$ 

(HitTime  $_{Data}$  + MissRate  $_{Data}$  × MissPenalt y  $_{Data}$ )

Advanced Computer Architecture Chapter 2.2

#### Average memory access time:

AMAT = HitTime + MissRate × MissPenalt y

## There are three ways to improve cache performance:

- 1. Reduce the miss rate,
- 2. Reduce the miss penalty, or
- 3. Reduce the time to hit in the cache.

## **Reducing Misses**

- Classifying Misses: 3 Cs
  - Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses.
    (Misses in even an Infinite Cache)
  - Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache)
  - Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache)

More recent, 4th "C":

**Coherence** - Misses caused by cache coherence.



3Cs Absolute Miss Rate (SPEC92)

### 2:1 Cache Rule (of thumb!)

miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2



**3Cs Relative Miss Rate** 



#### How We Can Reduce Misses?

- 3 Cs: Compulsory, Capacity, Conflict
- In all cases, assume total cache size not changed:
- What happens if:
- 1) Change Block Size: Which of 3Cs is obviously affected?
- 2) Change Associativity: Which of 3Cs is obviously affected?
- 3) Change Compiler: Which of 3Cs is obviously affected?

#### 1. Reduce Misses via Larger Block Size



#### 2. Reduce Misses via Higher Associativity

#### 2:1 Cache Rule of thumb:

The Miss Rate of a direct-mapped cache of size N
 Is the same as the Miss Rate of a 2-way set-associative cache size of size N/2

• on average, over a large suite of benchmarks

#### Beware: Execution time is only final measure!

Will Clock Cycle time increase?

Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2%

Advanced Computer Architecture Chapter 2,10

#### Example: Avg. Memory Access Time vs. Miss Rate

| Example: assume CCT = 1.10 for 2-way, 1.12 for | 4- |
|------------------------------------------------|----|
| way, 1.14 for 8-way vs. CCT direct mapped      |    |

| Cache Size |       | Associo | Associativity |       |  |  |
|------------|-------|---------|---------------|-------|--|--|
| (KB)       | 1-way | 2-way   | 4-way         | 8-way |  |  |
| 1          | 2.33  | 2.15    | 2.07          | 2.01  |  |  |
| 2          | 1.98  | 1.86    | 1.76          | 1.68  |  |  |
| 4          | 1.72  | 1.67    | 1.61          | 1.53  |  |  |
| 8          | 1.46  | 1.48    | 1.47          | 1.43  |  |  |
| 16         | 1.29  | 1.32    | 1.32          | 1.32  |  |  |
| 32         | 1.20  | 1,24    | 1.25          | 1.27  |  |  |
| 64         | 1.14  | 1,20    | 1,21          | 1.23  |  |  |
| 128        | 1.10  | 1.17    | 1.18          | 1.20  |  |  |

(Red means A.M.A.T. not improved by more associativity)

Advanced Computer Architecture Chapter 2.11

#### 3. Reducing Misses via a "Victim Cache"



### 4. Reducing Misses via "Pseudo-Associativity"

•How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache?

Divide cache: on a miss, check other half of cache to see if there, if so have a <u>pseudo-hit</u> (slow hit)

|                 | ··· - ··     |
|-----------------|--------------|
| Pseudo Hit Time | Miss Penalty |

Time

Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
 Better for caches not tied directly to processor (L2)
 Used in MIPS R10000 L2 cache, similar in UltraSPARC

Advanced Computer Architecture Chapter 2.13

#### 5. Reducing Misses by <u>Hardware</u> Prefetching of Instructions & Data

- E.g., Instruction Prefetching
  - Alpha 21064 fetches 2 blocks on a miss
  - Extra block placed in "stream buffer"
  - On miss check stream buffer
- Works with data blocks too:
  - Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43%
  - Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches
- Prefetching relies on having extra memory bandwidth that can be used without penalty

Advanced Computer Architecture Chapter 2,14

## 6. Reducing Misses by <u>Software</u> Prefetching Data

- Data Prefetch
  - Load data into register (HP PA-RISC loads)
  - Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
  - Special prefetching instructions cannot cause faults; a form of speculative execution

#### Prefetching comes in two flavors:

- Binding prefetch: Requests load directly into register.
   Must be correct address and register!
- Non-Binding prefetch: Load into cache.
- Can be incorrect. Frees HW/SW to guess!
- Issuing Prefetch Instructions takes time

Is cost of prefetch issues < savings in reduced misses?</li>
 Higher superscalar reduces difficulty of issue bandwidth

### 7. Reducing Misses by Compiler Optimizations

- McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software
- Instructions
  - Reorder procedures in memory so as to reduce conflict misses
  - Profiling to look at conflicts(using tools they developed)

#### 🔵 Data

- Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays
- Loop Interchange: change nesting of loops to access data in order stored in memory
- Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap
- Blocking: Improve temporal locality by accessing "blocks" of data repeatedly vs. going down whole columns or rows

Advanced Computer Architecture Chapter 2,15

### Merging Arrays Example

/\* Before: 2 sequential arrays \*/
int val[SIZE];
int key[SIZE];

```
/* After: 1 array of stuctures */
struct merge {
    int val;
    int key;
};
struct merge merged_array[SIZE];
```

#### Reducing conflicts between val & key; improve spatial locality

Advanced Computer Architecture Chapter 2,17

#### Loop Interchange Example

Sequential accesses instead of striding through memory every 100 words; improved spatial locality

Advanced Computer Architecture Chapter 2,18

#### Loop Fusion Example



### Blocking Example

```
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
   for (j = jj; j < min(jj+B-1,N); j = j+1)
        {r = 0;
        for (k = kk; k < min(kk+B-1,N); k = k+1) {
            r = r + y[i][k]*z[k][j];};
        x[i][j] = x[i][j] + r;
        };
</pre>
```

# B called *Blocking Factor* Capacity Misses from 2N<sup>3</sup> + N<sup>2</sup> to N<sup>3</sup>/B+2N<sup>2</sup> Conflict Misses Too?

(We return to this example and this technique in Chapter 5)

### **Reducing Conflict Misses by Blocking**



 Conflict misses in caches not FA vs. Blocking size
 Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache

Advanced Computer Architecture Chapter 2.22



### Summary: Miss Rate Reduction



#### **Review: Improving Cache Performance**

- 1. Reduce the miss rate.
- 2. Reduce the miss penalty, or
- 3. Reduce the time to hit in the cache.

### Write Policy: Write-Through vs Write-Back

- Write-through: all writes update cache and underlying memory/cache
  - Can always discard cached data most up-to-date data is in memory
  - Cache control bit: only a valid bit
- Write-back: all writes simply update cache
  - Can't just discard cached data may have to write it back to memory
  - Cache control bits: both valid and dirty bits
- Other Advantages:
- Write-through:
  - memory (or other processors) always have latest data
  - Simpler management of cache
- Write-back:
  - much lower bandwidth, since data often overwritten multiple times
  - Better tolerance to long-latency memory?

Advanced Computer Architecture Chapter 2.27

Advanced Computer Architecture Chapter 2.26

#### Write Policy 2: Write Allocate vs Non-Allocate (What happens on write-miss)

- Write allocate: allocate new cache line in cache
  - Usually means that you have to do a "read miss" to fill in rest of the cache-line!
  - Alternative: per/word valid bits
- Write non-allocate (or "write-around"):
  - Simply send write data through to underlying memory/cache - don't allocate new cache line!

#### 1. Reducing Miss Penalty: **Read Priority over Write on Miss**



Could simply wait for write buffer to empty, before allowing read

RAW conflicts with main memory reads on cache

- Risks serious increase in read miss penalty (old MIPS 1000 by 50% )
- Solution:

misses

- · Check write buffer contents before read; if no conflicts, let the memory access continue
- Write-back also needs buffer to hold displaced blocks
  - Read miss replacing dirty block
  - Normal: Write dirty block to memory, and then do the read
  - Instead copy the dirty block to a write buffer, then do the read, and then do the write
  - CPU stall less since restarts as soon as do read

#### 2. Reduce Miss Penalty: Early Restart and Critical Word First

- Don't wait for full block to be loaded before restarting CPU
  - <u>Early restart</u>—As soon as the requested word of the block ar rives, send it to the CPU and let the CPU continue execution
  - <u>Critical Word First</u>—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first
- Generally useful only in large blocks,
- (Access to contiguous sequential words is very common but doesn't benefit from either scheme – are they worthwhile?)



Advanced Computer Architecture Chapter 2,29

#### 3. Reduce Miss Penalty: Non-blocking Caches to reduce stalls on misses

- <u>Non-blocking cache</u> or <u>lockup-free cache</u> allows data cache to continue to supply cache hits during a miss
   requires full/empty bits on registers or out-of-order execution
   requires multi-bank memories
- "<u>hit under miss</u>" reduces the effective miss penalty by working during miss instead of ignoring CPU requests
- "<u>hit under multiple miss</u>" or "<u>miss under miss</u>" may further lower the effective miss penalty by overlapping multiple misses
  - Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses
  - Requires multiple memory banks (otherwise cannot support)
  - Pentium Pro allows 4 outstanding memory misses

#### Compare:

prefetching: overlap memory access with pre-miss instructions, Non-blocking cache: overlap memory access with post-miss instructions

Advanced Computer Architecture Chapter 2,30

#### What happens on a Cache miss?

#### For in-order pipeline, 2 options:

Freeze pipeline in Mem stage (popular early on: Sparc, R4000)

- IF ID EX Mem stall stall ... stall Mem Wr IF ID EX stall stall stall ... stall stall Ex Wr
- Use Full/Empty bits in registers + MSHR queue
  - MSHR = "Miss Status/Handler Registers" (Kroft)
    - Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line.
    - Per cache-line: keep info about memory address.
    - For each word: register (if any) that is waiting for result.
    - Used to "merge" multiple requests to one memory line
  - New load creates MSHR entry and sets destination register to "Empty". Load is "released" from pipeline.
  - Attempt to use register before result returns causes instruction to block in decode stage.
  - Limited "out-of-order" execution with respect to loads.
     Popular with in-order superscalar architectures.
- Out-of-order pipelines already have this functionality built in... (load queues, etc).

vanced Computer Architecture Chapter 2.31

## Value of Hit Under Miss for SPEC



FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26

- Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
- 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss

#### 4: Add a second-level cache

L2 Equations

AMAT = Hit Time<sub>L1</sub> + Miss Rate<sub>L1</sub> × Miss Penalty<sub>L1</sub>

Miss Penalty<sub>L1</sub> = Hit Time<sub>L2</sub> + Miss Rate<sub>L2</sub> × Miss Penalty<sub>L2</sub>

#### AMAT = Hit Time $_{L1}$ +

Miss Rate, x (Hit Time, + Miss Rate, + Miss Penalty,)

#### Definitions:

- Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rate<sub>12</sub>)
- Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss Rate<sub>1.1</sub> × Miss Rate<sub>1.2</sub>)

Global Miss Rate is what matters

dvanced Computer Architecture Chapter 2.33

#### **Comparing Local and Global Miss Rates**

- 32 KByte 1st level cache; Increasing 2nd level cache
- Global miss rate close to single level cache rate provided L2 >> L1

027

805

707

625

40%

36%

20%

- Don't use local miss rate
- L2 not tied to CPU clock cycle!
- Cost & A.M.A.T.
- Generally Fast Hit Times and fewer misses
- Since hits are few, target miss reduction

Fig 5.10 pg416

Advanced Computer Architecture Chapter 2.34

256 512

128

Carine size (KRI)

-+- Local miss 1889

- Ciskal miss rais

🛧 – Sincle catho miss sai

#### Reducing Misses: Which apply to L2 Cache?

#### Reducing Miss Rate

- 1. Reduce Misses via Larger Block Size
- 2. Reduce Conflict Misses via Higher Associativity
- 3. Reducing Conflict Misses via Victim Cache
- 4. Reducing Conflict Misses via Pseudo-Associativity
- 5. Reducing Misses by HW Prefetching Instr, Data
- 6. Reducing Misses by SW Prefetching Data
- 7. Reducing Capacity/Conf. Misses by Compiler Optimizations

### L2 cache block size & A.M.A.T.



Advanced Computer Architecture Chapter 2,35

### Reducing Miss Penalty Summary

 $CPU time = IC \times \left( CPI_{\text{Excession}} + \frac{Memory \ accesses}{Instruction} \times Miss \ rate \ Miss \ penalty \ \times Clock \ cycle \ time$ 

#### Four techniques

Read priority over write on miss
 Early Restart and Critical Word First on miss
 Non-blocking Caches (Hit under Miss, Miss under Miss)
 Second Level Cache

- Can be applied recursively to Multilevel Caches
  - Danger is that time to DRAM will grow with multiple levels in between
  - First attempts at L2 caches can make things worse, since increased worst case is worse

#### Average memory access time:

AMAT = HitTime + MissRate × MissPenalt y

There are three ways to improve cache performance:

- 1. Reduce the miss rate,
- 2. Reduce the miss penalty, or
- 3. Reduce the time to hit in the cache.

Advanced Computer Architecture Chapter 2.37

Advanced Computer Architecture Chapter 2.38

#### Reducing the time to hit in the cache

- Why does the Alpha 21164 have 8KB Instruction and 8KB data cache + 96KB second level cache, all onchip?
  - 1. Keep the cache small and simple
  - 2. Keep address translation off the critical path
  - 3. Pipeline the cache access

### 2. Fast hits by Avoiding Address Translation





Virtual address space is divided into *pages* of equal size. Main Memory is divided into *page frames* the same size.



for students lacking CS background) Advanced Computer Architecture Chapter 2.41

#### Paging - Address Mapping



#### Paging - Address Mapping



Note: The Process Page Table (PPT) itself can be paged

(Review introductory operating systems material for students lacking CS background) Advanced Computer Architecture Chapter 2.42

#### Paging - Page Transfer

What happens when we access a page which is currently not in main memory (i.e. the page table entry is empty)?



→ Suspend running process

- $\rightarrow$  Get page from disk
- → Update page table
- → Resume process (re-execute instruction)

? Can one instruction cause more than one page fault?

The location of a page on disk can be recorded in a separate table or in the page table itself using a *presence* bit.





#### Synonyms and homonyms in address translation

- Homonyms (same sound different meaning)
  - same virtual address points to two different physical addresses in different processes
  - If you have a virtually-indexed cache, flush it between context switches - or include PID in cache tag
- Synonyms (different sound same meaning)

• different virtual addresses (from the same or different processes) point to the same physical address

- in a virtually addressed cache
  - a virtual address could be cached twice under different physical addresses
  - updates to one cached copy would not be reflected in the other cached copy
  - solution: make sure synonyms can't co-exist in the cache, e.g., OS can forces synonyms to have the same index bits in a direct mapped cache (sometimes called page colouring)

(a nice explanation in more detail can be found at http://www.ece.cmu.edu/~jhoe/course/ece447/handouts/L22.pdf)

#### 2. Fast Cache Hits by Avoiding Translation: Process ID impact



#### 2. Fast Cache Hits by Avoiding Translation: Index with Physical Portion of Address

- If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag
- Limits cache to page size: what if want bigger caches and uses same trick?

Higher associativity

#### Page coloring

- A cache conflict occurs if two cache blocks that have the same tag (physical address) are mapped to two different virtual addresses
- Make sure OS never creates a page table mapping with this property



### 3: Fast Hits by pipelining Cache Case Study: MIPS R4000

#### 8 Stage Pipeline:

IF-first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access.

#### IS-second half of access to instruction cache.

- RF-instruction decode and register fetch, hazard checking and also instruction cache hit detection.
- EX-execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation.
- DF-data fetch, first half of access to data cache.

#### DS-second half of access to data cache.

#### TC-tag check, determine whether the data cache access hit.

WB-write back for loads and register-register operations.

#### What is impact on Load delay?

Need 2 instructions between a load and its use!

Advanced Computer Architecture Chapter 2.49

| TWO Cycle<br>Load Latency                                                                                                     | IF | IS<br>IF            | RF<br>IS<br>IF           | EX<br>RF<br>IS<br>IF | DF<br>EX<br>RF<br>IS<br>IF | DS<br>DF<br>EX<br>RF<br>IS<br>IF | TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF | WB<br>TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF |  |
|-------------------------------------------------------------------------------------------------------------------------------|----|---------------------|--------------------------|----------------------|----------------------------|----------------------------------|----------------------------------------|----------------------------------------------|--|
| THREE Cycle<br>Branch Latency<br>(conditions evaluated<br>during EX phase)<br>Delay slot plus two st<br>Branch likely cancels |    | IS<br>IF<br>slot if | RF<br>IS<br>IF<br>not to | EX<br>RF<br>IS<br>IF | DF<br>EX<br>RF<br>IS<br>IF | DS<br>DF<br>EX<br>RF<br>IS<br>IF | TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF | WB<br>TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF |  |

**R4000** Performance



#### What is the Impact of What You've Learned About Caches?



Superscalar, Out-of-Order machines hide L1 data cache miss (-5 clocks) but not L2 cache miss (-50 clocks)?

Advanced Computer Architecture Chapter 2.52

Advanced Computer Architecture Chapter 2.50

### Case Study: MIPS R4000



#### Alpha Memory Performance: Miss Rates of SPEC92



### Cache Optimization Summary



### Alpha CPI Components

Instruction stall: branch mispredict (green);

 Data cache (blue); Instruction cache (yellow); L2\$ (pink) Other: compute + reg conflicts, structural conflicts



#### Practical exercise: explore memory hierarchy on your favourite computer

Download Stefan Manegold's "cache and TLB calibrator":

http://www.cwi.nl/~manegold/Calibrator/calibrator.shtml

(or find installed copy in ~phjk/ToyPrograms/C/ManegoldCalibrator)

 This program consists of a loop which runs over an array repeatedly

The size of the array is varied to evaluate cache size

The stride is varied to explore block size



#### Memory hierarchy of a 2.2GHz Intel Pentium 4 Xeon

- Memory access latency is close to 1ns when loop reuses array smaller than 8KB level-1 cache
- While array is smaller than 512KB, access time is 2-8ns, depending on stride
- When array exceeds 512KB, accesses miss both level-1 and level-2 caches
- Worst case (large stride) suffers 158ns access latency
  - How many instructions could be executed in 158ns?
  - what is the level-1 cache block size?
  - What is the level-2 cache block size?

dvanced Computer Architecture Chapter 2,57

#### Instructions for running the Manegold calibrator

#### Get a copy:

ep /homes/phjk/ToyPrograms/C/ManegoldCalibrator/calibrator.c ./

Compile it:

gcc -O3 -o calibrator calibrator.s

- Find out CPU MHz
- cat /proc/cpuinfo

Run it; ./calibrator <CPUMHz> <size> <filename>

- Eg on media03:
  - ./calibrator 3000 64M media03
  - Output is delivered to a set of files "media03.\*"
- Plot postscript graphs using generated gnuplot scripts:
  - gnuplot media03.cache-miss-latency.gp
     gnuplot media03.cache-replace-time.gp
  - gnuplot media03.tLB-miss-latency.gp
- View the generated postscript files:
  - gv media03.cache-miss-latency.ps &

Advanced Computer Architecture Chapter 2,59

### Main Memory Background

#### Performance of Main Memory:

- Latency: Cache Miss Penalty
  - Access Time: time between request and word arrives
  - Cycle Time: time between requests
- Bandwidth: I/O & Large Block Miss Penalty (L2)
- Main Memory is DRAM: Dynamic Random Access Memory
  - Dynamic since needs to be refreshed periodically (8 ms, 1% time)
  - Addresses divided into 2 halves (Memory as a 2D matrix):
    - RAS or Row Access Strobe
    - CAS or Column Access Strobe
- Cache uses SRAM: Static Random Access Memory
  - No refresh (6 transistors/bit vs. 1 transistor Size: DRAM/SRAM - 4-8, Cost/Cycle time: SRAM/DRAM - 8-16



- technology was based on magnetic "cores" - tiny ferrite rings threaded with copper wires
- That's why people talk about "Out-of-Core", "In-Core," "Core Dump"
- Non-volatile, magnetic
- Lost out when 4 Kbit DRAM became available
- Access time 750 ns, cycle time 1500-3000 ns



http://www.faqs.org/docs/electric/Digital/DIGI\_15.html http://www.psych.usyd.edu.au/pdp-11/core.html



The first magnetic core memory, from the IBM 405 Alphabetica Accounting Machine. The photo shows the single drive lines through the cores in the long direction and fifty turns in the short direction. The cores are 150 mil inside diameter, 240 mil outside, 45 mil high. This experimental system was tested successfully in April 1952.



Figure 10. IBM 2361 Core Storage

524,000 36-bit words and a total cycle time of eight microseconds in each memory (1964 - for the IBM7094)



and the capacitor to exchange charge. In this example, a data state Single transistor of either a "0" (0 V) or a "1" ( $V_{\text{BLH}}$ ) is written from the bitline to **@ Capacitor stores charge** the storage capacitor. VBB is the electrical bias applied to the p-well.

http://www.research.ibm.com/journal/rd/462/mandelman.html





### DRAM array design

- Square array of cells
- Address split into Row address and Column Address bits
- Row address selects row of cells to be activated
- Cells discharge
- Cell state latched by percolumn sense amplifiers
- Column address selects data for output
- Data must be written back to selected row

Page 16

#### **4 Key DRAM Timing Parameters**

- t<sub>RAC</sub>: minimum time from RAS line falling to the valid data output.
  - Quoted as the speed of a DRAM when buy
  - A typical 4Mb DRAM t<sub>RAC</sub> = 60 ns
  - Speed of DRAM since on purchase sheet?
- t<sub>RC</sub>: minimum time from the start of one row access to the start of the next.
  - $t_{RC}$  = 110 ns for a 4Mbit DRAM with a  $t_{RAC}$  of 60 ns
- t<sub>CAC</sub>: minimum time from CAS line falling to valid data output.
  - 15 ns for a 4Mbit DRAM with a t<sub>RAC</sub> of 60 ns
- t<sub>PC</sub>: minimum time from the start of one column access to the start of the next.
   35 ns for a 4Mbit DRAM with a t<sub>RAC</sub> of 60 ns

Advanced Computer Architecture Chapter 2.65

### Every DRAM access begins

#### DRAM Read Timing



Advanced Computer Architecture Chapter 2.66

### **DRAM** Performance

- A 60 ns (t<sub>RAC</sub>) DRAM can
  - perform a row access only every 110 ns (†<sub>RC</sub>)
  - perform column access (t<sub>CAC</sub>) in 15 ns, but time between column accesses is at least 35 ns (t<sub>PC</sub>).
    - In practice, external address delays and turning around buses make it 40 to 50 ns
- These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead!

### **DRAM** History

- DRAMs: capacity +60%/yr, cost -30%/yr
   2.5X cells/area, 1.5X die size in -3 years
- 2007 DRAM fab line costs \$4.6B (2004 prices)
   DRAM only: density, leakage v. speed
- Rely on increasing no. of computers & memory per computer (60% market)
  - SIMM or DIMM is replaceable unit => computers use any generation DRAM
- Commodity, second source industry
   > high volume, low profit, conservative
   Little organization innovation in 20 years
- Order of importance: 1) Cost/bit 2) Capacity
   First RAMBUS: 10X BW, +30% cost => little impact

"Elpida to Build \$4.6B DRAM Fab in Japan" (*Electronic News*, 6/9/2004) http://www.reed-electronics.com/electronicnews/article/CA424812.html Advanced Computer Architecture Chapter 2, 68



#### Fast Memory Systems: DRAM specific Multiple CAS accesses: several names (page mode) *Extended Data Out (EDO)*: 30% faster in page mode New DRAMs to address gap; what will they cost, will they survive?

- RAMBUS: "reinvent DRAM interface"
   Each Chip a module vs. slice of memory
  - Short bus between CPU and chips
  - Short bus between CPU
     Does own refresh
  - Does own retresh
  - Variable amount of data returned
  - Originally 1 byte / 2 ns (500 MB/s per chip)
  - Direct Rambus DRAM (DRDRAM) 16 bits at 400MHz, with a transfer on both clock edges, leading to 1.66B/s
  - 20% increase in DRAM area
- Synchronous DRAM: 2 banks on chip\_a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz). "Double Data Rate" DDR SDRAM also transfers on both clock edges
- Intel claims RAMBUS Direct (16 b wide) is future PC memory?
- Niche memory or main memory?
  - e.g., Video RAM for frame buffers, DRAM + fast serial output

Advanced Computer Architecture Chapter 2.70



#### Main Memory Performance



#### **Independent Memory Banks**



### **Independent Memory Banks**

How many banks?

 number banks ≤ number clocks to access word in bank
 For sequential accesses, otherwise will return to original bank before it has next word ready
 (like in vector case)

 Increasing DRAM => fewer chips => harder to have banks

Advanced Computer Architecture Chapter 2,74

#### **Avoiding Bank Conflicts**

Lots of banks

#### int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 \* x[i][j];

- Conflicts occur even with 128 banks, since 512 is multiple of 128, conflict on word accesses
- SW: loop interchange or declaring array not power of 2 ("array padding")
- HW: Prime number of banks
  - bank number = address mod number of banks
  - address within bank = address / number of words in bank
  - modulo & divide per memory access with prime no. banks?
  - address within bank = address mod number words in bank
  - bank number? easy if 2<sup>N</sup> words per bank

### **Fast Bank Number**

#### Chinese Remainder Theorem As long as two sets of integers ai and bi follow these rules

 $bi = x \mod ai, 0 \le bi < ai, 0 \le x < a0 \times a1 \times a2 \times ...$ and that ai and aj are co-prime if  $i \ne j$ , then the integer x has only one solution (unambiguous mapping):

- **Solution** bank number =  $b_0$ , number of banks =  $a_0$  (= 3 in example)
- address within bank = b<sub>1</sub>, number of words in bank = a<sub>1</sub> (= 8 in example)
- N word address 0 to N-1, prime no. banks, words power of 2

|                 |                | Seq. Interleaved |    |     | Modulo Interleaved |    |             |
|-----------------|----------------|------------------|----|-----|--------------------|----|-------------|
| Bank Numl<br>Ad | ber:<br>Idress | 0                | 1  | 2   | 0                  | 1  | 2<br>within |
| Bank:           | 0              | 0                | 1  | 2   | 0                  | 16 | 8           |
|                 | 1              | 3                | 4  | (5) | 9                  | 1  | 17          |
|                 | 2              | 6                | 7  | 8   | 18                 | 10 | 2           |
|                 | 3              | 9                | 10 | 11  | 3                  | 19 | 11          |
|                 | 4              | 12               | 13 | 14  | 12                 | 4  | 20          |
|                 | 5              | 15               | 16 | 17  | 21                 | 13 | (5)         |
|                 | 6              | 18               | 19 | 20  | 6                  | 22 | 14          |
|                 | 7              | 21               | 22 | 23  | 15                 | 7  | 23          |

### Need for Frror Correction!



#### DRAMs per PC over Time



### Architecture in practice



Figure 1. PlayStation 2000 employs an unprecedented level of parallelism to achieve workstation-class 3D performance

(as reported in Microprocessor Report, Vol 13, No. 5)

- Emotion Engine: 6.2 GFLOPS, 75 million polygons per second
- Graphics Synthesizer: 2.4 Billion pixels per second
- Claim: Toy Story realism brought to games!



#### More esoteric Storage Technologies?



FLASH



#### Main Memory Summary

- Wider Memory
- Interleaved Memory: for sequential or independent accesses
- Avoiding bank conflicts: SW & HW
- DRAM specific optimizations: page mode & Specialty DRAM
- Need Error correction