This question is intended to encourage you to think about the cache design space. Talk about it with your friends and with the tutorial assistants:
for (i = 0; i < 500; i++) for (j = 0; j < 500; j++) for (k = 0; k < 500; k++) C[i][j] += A[i][k] * B[k][j];
The performance differences for write-invalidate and write-update schemes can arise from both bandwidth consumption and latency. Assume a memory system with 64-byte cache blocks.
Consider the following program fragment:
/* processor 1 */ |
A = 0; |
A = 1; |
if (B == 0) { |
P(); |
} |
/* processor 2 */ |
B = 0; |
B = 1; |
if (A == 0) { |
P(); |
} |
Can both processors execute P simultaneously? Explain your answer carefully.
Can both processors execute P simultaneously? Explain your answer carefully.
What precautions are needed to preserve the expected behaviour?
Can both processors execute P simultaneously? Explain your answer carefully.
What precautions are needed to preserve the expected behaviour?
Paul Kelly, Imperial College, December 1999
This question is intended to encourage you to think about the cache design space. Talk about it with your friends and with the tutorial assistants:
A: Provided the cache miss penalty is unchanged, the miss rate should be improved thanks to spatial locality - how much depends on the application. However, increasing the block size means that the amount of data loaded into the cache but not actually used will increase - DRAM-cache bandwidth is wasted. If this bandwidth is fixed, increasing the block size increases the miss penalty and there is a difficult application-dependent tradeoff.
A: Generally speaking, programs with high spatial locality should benefit, but programs with high temporal locality which is not also spatially-localised will suffer. However if the working set is small so it easily fits within the cache (or the cache is big), we should always see an improvement.
A: We obviously don't have to wait for the whole cache block to arrive - the load can proceed as soon as the bytes it requests arrive. In fact we could transfer those bytes first, even if the access is to the middle of a block.
A: It's faster to pass the virtual address direct to the cache without having to translate it. We have two consistency problems - (1) different processes commonly map the same virtual address to physical physical addresses, and (2) if two different processes map the same physical page at different virtual addresses (eg using mmap).
A: If both index and tag are virtual we have a problem when switching from one process to another due to (1) above. One solution is to flush the cache on context switches. Another is to augment the tag with an ``address space identifier'', essentially a process id. Even if the index is virtual, we can do the translation in parallel with cache access, so we can store the physical address in the tag and check that. This solves problem (1). It does not solve problem (2).
A popular solution is to notice that if each bank of the cache is at most one page in size, address translation will not change the index - it will only change higher-order bits. So with 4KByte pages, the maximum size for a direct-mapped cache would be only 4KBytes - but increasing the amount of associativity allows a larger cache. Thus, it is common for level-1 caches to be more associative than simple benchmark studies would justify.
A: The main problems arise from writes, especially since the L1 block size is normally smaller than the L2 block size. Without multi-level inclusion (MLI), we would need a separate path from the L1 cache to main memory. MLI ensures that L1 can always write back/through to hit in the L2 cache, so cache-memory transfers occur only between L2 and memory, and only at L2's block size.
for (i = 0; i < 500; i++) for (j = 0; j < 500; j++) for (k = 0; k < 500; k++) C[i][j] += A[i][k] * B[k][j];
A: See Chapter 5 of the lecture notes.
A: We need a loop which accesses three different locations, which share the same low-order address bits. E.g.
for (i = 0; i < 500; i++) A[i] = A[i] + A[i+32768] + A[i+65536]
The performance differences for write-invalidate and write-update schemes can arise from both bandwidth consumption and latency. Assume a memory system with 64-byte cache blocks.
Consider the following program fragment:
/* processor 1 */ |
A = 0; |
A = 1; |
if (B == 0) { |
P(); |
} |
/* processor 2 */ |
B = 0; |
B = 1; |
if (A == 0) { |
P(); |
} |
Can both processors execute P simultaneously? Explain your answer carefully.
A: See textbook
Can both processors execute P simultaneously? Explain your answer carefully.
A: This can happen if the processors proceed with the test before waiting for an acknowledgement that the invalidation or update has finished.
What precautions are needed to preserve the expected behaviour?
A: The simple answer is to block the processor's subsequent memory operations until the store's acknowledgement has been received. Another approach with some followers, called ``weak consistency'', is to explain the memory system's semantics in the programmer's manual - explain that consistency is not guaranteed until specific synchronisation instructions are executed.
Can both processors execute P simultaneously? Explain your answer carefully.
What precautions are needed to preserve the expected behaviour?
A: This is a bit of a trick question - the answer is the same as above.
This means that computation (including memory accesses) can be overlapped with invalidation and acknowledgement messages.
(Paul Kelly, Imperial College, November 1999)