332 Advanced Computer Architecture

Unassessed tutorial exercise 5

Caches

A new computer is to be built using a DLX CPU. Trace-driven simulation of a set of programs gave the following results:

 
Table 1: Miss-rates for unified caches

Cache size Assoc. Tot miss rate Compulsory   Capacity   Conflict  
32 KB 2-way 0.041 0.009 22% 0.028 68% 0.004 11%
64 KB 4-way 0.028 0.009 32% 0.019 68% 0.000 0%
128 KB 1-way 0.026 0.009 34% 0.004 16% 0.013 50%



 
Table 2: Miss-rates for instruction-only, data-only and unified caches (2-way set-associativity)

Cache size Instrn. only. Data only Unified
32 KB 2.2% 4.0% 4.1%
64 KB 1.4% 2.8% 2.9%
128 KB 1.0% 2.1% 1.9%


Assume the following: You are considering three options:
A:
A 4-way set-associative unified cache of 64 KB
B:
Two 2-way set-associative caches of 32 KB each, one for instructions and one for data
C:
A direct-mapped unified cache of 128 KB. Assume that the clock rate is 10% faster in this case since the mapping is direct and the CPU address does not need to drive two caches, nor does the data bus need to be multiplexed. This faster clock rate increases the miss penalty to 13 clock cycles

1.
What is the average CPI for each of the three cache organisations?
2.
What cache organisation gives best average performance?
3.
Sketch a program for which each scheme gives the worst performance
4.
Suppose that the base CPI (assuming a perfect memory system) is 5. How well do options A and C perform?


(These exercises are based on Hennessy and Patterson, first edition, Exercise 8.5).


Paul Kelly and Andrew Bennett, Imperial College, 1999.

332 Advanced Computer Architecture: Notes on the solutions to Exercise 5

Q1: What is the average CPI for each of the three caches?

It is convenient to divide the execution time of the program into several components in order to calculate the total CPI. The total CPI is the sum of these four items:

1.
Base CPI - cycles used by the program operating on an ideal memory system
2.
Extra cycles required for loads and stores with a unified cache - each load and store requires one extra cycle (see assumptions above)
3.
Extra cycles for cache misses due to instruction fetches
4.
Extra cycles for cache misses due to data accesses

Components 3 and 4 above are dependent on the miss rate and penalty:

\begin{displaymath}CPI_{miss} = \frac{misses}{instruction} * miss\ penalty\end{displaymath}

i.e.

\begin{displaymath}CPI_{miss} = \frac{memory\ accesses}{instruction}*miss\ rate*miss\ penalty\end{displaymath}

Cache configuration A:

Cache configuration B:

Cache configuration C:

Cache configuration B gives the best CPI.

Q2: What cache organisation gives best average performance?

The cycle time, which varies between cache configuration, must be taken into account to determine which has the best performance. Multiplying the total CPIs derived above by cycle time for each configuration (20ns, 20ns and 18ns) gives A:43.7ns/instruction, B: 37.8ns/instruction, C: 39.3ns/instruction

Configuration B gives best performance. C is worse, despite being much larger in size: the extra overhead of the unified cache, the increased miss penalty, and limited associativity all outweigh the advantage of using a larger cache.

Q3: Sketch a program for which each scheme gives the worst performance

The differing sizes and associativities of the caches can be used to construct programs which perform badly in each case. A tight loop which accesses several arrays is used. The arrays are allocated so that each is cached into the same cache blocks, and references are made to each array in turn, causing conflicts. Similarly, several procedures are required which also map to the same cache blocks.

Configuration C has the lowest associativity, and is therefore easiest to deal with: it can be made to perform badly by using two arrays which map to the same blocks. Since it is direct mapped, accesses to the arrays will be misses due to replacements. The other configurations will perform better due their greater associativity.

The small size of configuration B can be used to make it the slowest. Three arrays are required since it is 2-way set associative - they must be allocated so that they map to the same cache blocks only on configuration B, so that no conflicts occur on the other, larger caches.

To make A the slowest requires two procedures which also map onto the same cache blocks as the arrays.

Q4: Suppose that the base CPI (assuming a perfect memory system) is 5. How well do options A and C perform?

Replacing base CPI in the expressions in answer 1 gives:

Multiplying by the appropriate cycle time gives:

C is the now the fastest configuration: the improved cycle time leads to the best performance when the CPI is so high.

Paul Kelly and Andrew Bennett, Imperial College, 1999.


next up previous