Advanced Computer
Architecture
Chapter 7: shared memory and
cache consistency
|
|
|
Paul Kelly |
|
Dept of Computing |
|
Imperial College |
|
p.kelly@ic.ac.uk |
|
http://www.doc.ic.ac.uk/~phjk |
Overview
|
|
|
Why add another processor? |
|
How should they be connected – I/O,
memory bus, L2 cache, registers? |
|
Cache coherency – the problem |
|
Coherency – what are the rules? |
|
Coherency using broadcast –
update/invalidate |
|
Coherency using multicast – SMP vs
ccNUMA |
|
Distributed directories, home nodes,
ownership; the ccNUMA design space |
|
Beyond ccNUMA; COMA and Simple COMA |
|
Hybrids and clusters |
|
|
Why add another
processor?
|
|
|
|
Increasing the complexity of a single
CPU leads to diminishing returns |
|
Due to lack of instruction-level
parallelism |
|
Too many simultaneous accesses to one
register file |
|
Forwarding wires between functional
units too long - inter-cluster communication takes >1 cycle |
Architectural
effectiveness of Intel processors
Architectural
effectiveness of Intel processors
Architectural
effectiveness of Intel processors
How to add another
processor?
|
|
|
|
Idea: instead of trying to exploit more
instruction-level parallelism by building a bigger CPU, build two - or more |
|
This only makes sense if the
application parallelism exists… |
|
Why might it be better? |
|
No need for multiported register file |
|
No need for long-range forwarding |
|
CPUs can take independent control paths |
|
Still need to synchronise and
communicate |
|
Program has to be structured
appropriately… |
|
|
|
|
How to add another
processor?
|
|
|
|
How should the CPUs be connected? |
|
Idea: systems linked by network
connected via I/O bus |
|
Eg Fujitsu AP3000, Myrinet, Quadrics |
|
Idea: CPU/memory packages linked by
network connecting main memory units |
|
Eg SGI Origin |
|
Idea: CPUs share main memory |
|
Eg Intel Xeon SMP |
|
Idea: CPUs share L2/L3 cache |
|
Eg IBM Power4 |
|
Idea: CPUs share L1 cache |
|
Idea: CPUs share registers, functional
units |
|
Cray/Tera MTA (multithreaded
architecture) |
How to connect
processors...
|
|
|
|
Tradeoffs: |
|
close coupling to minimise delays
incurred when processors interact |
|
separation to avoid contention for
shared resources |
|
Result: |
|
spectrum of alternative approaches
based on application requirements, cost, and packaging/integration issues |
|
Currently: |
|
just possible to integrate 2 full-scale
CPUs on one chip together with large shared L2 cache |
|
common to link multiple CPUs on same
motherboard with shared bus connecting to main memory |
|
more aggressive designs use richer
interconnection network, perhaps with cache-to-cache transfer capability |
Multiple caches… and
trouble
|
|
|
Suppose processor 0 loads memory
location x |
|
x is fetched from main memory and
allocated into processor 0’s cache(s) |
Multiple caches… and
trouble
|
|
|
Suppose processor 1 loads memory
location x |
|
x is fetched from main memory and
allocated into processor 1’s cache(s) as well |
Multiple caches… and
trouble
|
|
|
Suppose processor 0 stores to memory
location x |
|
Processor 0’s cached copy of x is
updated |
|
Processor 1 continues to used the old
value of x |
Multiple caches… and
trouble
|
|
|
Suppose processor 2 loads memory
location x |
|
How does it know whether to get x from
main memory, processor 0 or processor 1? |
Implementing distributed,
shared memory
|
|
|
Two issues: |
|
How do you know where to find the
latest version of the cache line? |
|
How do you know when you can use your
cached copy – and when you have to look for a more up-to-date version? |
|
|
|
We will find answers to this after
first thinking about what a distributed shared memory implementation is
supposed to do… |
Cache consistency (aka
cache coherency)
|
|
|
|
|
|
|
Goal (?): |
|
“Processors should not continue to use
out-of-date data indefinitely” |
|
Goal (?): |
|
“Every load instruction should yield
the result of the most recent store to that address” |
|
Goal (?): (definition: Sequential Consistency) |
|
“the result of any execution is the
same as if the operations of all the processors were executed in some
sequential order, and the operations of each individual processor appear in
this sequence in the order specified by its program” |
|
(Leslie Lamport, “How to make a
multiprocessor computer that correctly executes multiprocess programs” (IEEE
Trans Computers Vol.C-28(9) Sept 1979) |
|
|
|
|
|
|
Implementing Strong
Consistency: update
|
|
|
|
Idea #1: when a store to address x
occurs, update all the remote cached copies |
|
To do this we need either: |
|
To broadcast every store to every
remote cache |
|
Or to keep a list of which remote
caches hold the cache line |
|
Or at least keep a note of whether
there are any remote cached copies of this line |
|
But first…how well does this update
idea work? |
|
|
Implementing Strong
Consistency: update…
|
|
|
|
Problems with update |
|
What about if the cache line is several
words long? |
|
Each update to each word in the line
leads to a broadcast |
|
What about old data which other
processors are no longer interested in? |
|
We’ll keep broadcasting updates
indefinitely… |
|
Do we really have to broadcast every
store? |
|
It would be nice to know that we have
exclusive access to the cacheline so we don’t have to broadcast updates… |
A more cunning plan…
invalidation
|
|
|
|
Suppose instead of updating remote
cache lines, we invalidate them all when a store occurs? |
|
After the first write to a cache line
we know there are no remote copies – so subsequent writes don’t lead to
communication |
|
Is invalidate always better than
update? |
|
Often |
|
But not if the other processors really
need the new data as soon as possible |
|
Note that to exploit this, we need a
bit on each cache line indicating its sharing state |
|
(analogous to write-back vs
write-through) |
Update vs invalidate
|
|
|
Update: |
|
May reduce latency of subsequent remote
reads |
|
if any are made |
|
Invalidate: |
|
May reduce network traffic |
|
e.g. if same CPU writes to the line
before remote nodes take copies of it |
The “Berkeley"
Protocol
|
|
|
Four cache line states: |
|
|
|
Broadcast invalidations on bus unless
cache line is exclusively “owned” (DIRTY) |
Berkeley cache coherence
protocol:
state transition diagram
|
|
|
1. INVALID |
|
2. VALID: clean, potentially shared,
unowned |
|
3. SHARED-DIRTY: modified, possibly
shared, owned |
|
4. DIRTY: modified, only copy, owned |
The job of the cache
controller - snooping
|
|
|
|
The protocol state transitions are
implemented by the cache controller – which “snoops” all the bus traffic |
|
Transitions are triggered either by |
|
the bus (Bus invalidate, Bus write
miss, Bus read miss) |
|
The CPU (Read hit, Read miss, Write
hit, Write miss) |
|
|
|
For every bus transaction, it looks up
the directory (cache line state) information for the specified address |
|
If this processor holds the only valid
data (DIRTY), it responds to a “Bus read miss” by providing the data to the
requesting CPU |
|
If the memory copy is out of date, one
of the CPUs will have the cache line in the SHARED-DIRTY state (because it
updated it last) – so must provide data to requesting CPU |
|
State transition diagram doesn’t show
what happens when a cache line is displaced… |
|
|
|
|
Berkeley protocol -
summary
|
|
|
|
Invalidate is usually better than
update |
|
Cache line state “DIRTY” bit records
whether remote copies exist |
|
If so, remote copies are invalidated by
broadcasting message on bus – cache controllers snoop all traffic |
|
Where to get the up-to-date data from? |
|
Broadcast read miss request on the bus |
|
If this CPUs copy is DIRTY, it responds |
|
If no cache copies exist, main memory
responds |
|
If several copies exist, the CPU which
holds it in “SHARED-DIRTY” state responds |
|
If a SHARED-DIRTY cache line is
displaced, … need a plan |
|
How well does it work? |
|
See extensive analysis in Hennessy and
Patterson |
Large-Scale Shared-Memory
Multiprocessors
|
|
|
|
Bus inevitably becomes a bottleneck
when many processors are used |
|
Use a more general interconnection
network |
|
So snooping does not work |
|
DRAM memory is also distributed |
|
Each node allocates space from local DRAM |
|
Copies of remote data are made in cache |
|
Major design issues: |
|
How to find and represent the “directory"
of each line? |
|
How to find a copy of a line? |
|
As a case study, we will look at S3.MP
(Sun's Scalable Shared memory Multi-Processor, a CC-NUMA (cache-coherent
non-uniform memory access) architecture |
Case study:
Sun’s S3MP
|
|
|
Protocol Basics |
|
S3.MP uses distributed singly-linked
sharing lists, with static homes |
|
|
|
Each line has a “home" node, which
stores the root of the directory |
|
|
|
Requests are sent to the home node |
|
|
|
Home either has a copy of the line, or
knows a node which does |
S3MP: Read Requests
|
|
|
Simple case: initially only the home
has the data: |
S3MP: Read Requests -
remote
|
|
|
More interesting case: some other
processor has the data |
|
|
|
|
|
|
|
|
|
|
|
|
|
Home passes request to first processor
in chain, adding requester into the sharing list |
S3MP - Writes
|
|
|
|
If the line is exclusive (i.e. dirty
bit is set) no message is required |
|
Else send a write-request to the home |
|
Home sends an invalidation message down
the chain |
|
Each copy is invalidated (other than
that of the requester) |
|
Final node in chain acknowledges the
requester and the home |
|
Chain is locked for the duration of the
invalidation |
S3MP - Replacements
|
|
|
When a read or write requires a line to
be copied into the cache from another node, an existing line may need to be replaced |
|
Must remove it from the sharing list |
|
Must not lose last copy of the line |
Finding your data
|
|
|
|
How does a CPU find a valid copy of a
specified address’s data? |
|
Translate virtual address to physical |
|
Physical address includes bits which
identify “home” node |
|
Home node is where DRAM for this
address resides |
|
But current valid copy may not be there
– may be in another CPU’s cache |
|
Home node holds pointer to sharing
chain, so always knows where valid copy can be found |
ccNUMA summary
|
|
|
|
|
S3MP’s cache coherency protocol
implements strong consistency |
|
Many recent designs implement a weaker
consistency model… |
|
S3MP uses a singly-linked sharing chain |
|
Widely-shared data – long chains – long
invalidations, nasty replacements |
|
“Widely shared data is rare” |
|
In real life: |
|
IEEE Scalable Coherent Interconnect
(SCI): doubly-linked sharing list |
|
SGI Origin 2000: bit vector sharing
list |
|
Real Origin 2000 systems in service
with 256 CPUs |
|
Sun E10000: hybrid multiple buses for
invalidations, separate switched network for data transfers |
|
Many E10000s in service, often with 64
CPUs |
Beyond ccNUMA
|
|
|
|
COMA: cache-only memory architecture |
|
Data migrates into local DRAM of CPUs
where it is being used |
|
Handles very large working sets cleanly |
|
Replacement from DRAM is messy: have to
make sure someone still has a copy |
|
Scope for interesting OS/architecture
hybrid |
|
System slows down if total memory
requirement approaches RAM available, since space for replication is reduced |
|
Examples: DDM, KSR-1/2, rumours from
IBM… |
Clustered architectures
|
|
|
|
Idea: systems linked by network
connected via I/O bus |
|
Eg Fujitsu AP3000, Myrinet, Quadrics |
|
Idea: CPU/memory packages linked by
network connecting main memory units |
|
Eg SGI Origin |
|
Idea: CPUs share main memory |
|
Eg Intel Xeon SMP (and many others) |
|
Idea: CPUs share L2/L3 cache |
|
Eg IBM Power4 |
|
Idea: CPUs share L1 cache |
|
Idea: CPUs share registers, functional
units |
|
Alpha 21464/EV8, Cray/Tera MTA
(multithreaded architecture) |
Which cache should the
cache controller control?
|
|
|
|
L1 cache is already very busy with CPU
traffic |
|
L2 cache also very busy… |
|
L3 cache doesn’t always have the
current value for a cache line |
|
Although L1 cache is normally
write-through, L2 is normally write-back |
|
Some data may bypass L3 (and perhaps
L2) cache (eg when stream-prefetched) |
|
In Power4, cache controller manages L2
cache – all external invalidations/requests |
|
L3 cache improves access to DRAM for
accesses both from CPU and from network |
Summary and Conclusions
|
|
|
Caches are essential to gain the
maximum performance from modern microprocessors |
|
The performance of a cache is close to
that of SRAM but at the cost of DRAM |
|
Caches can be used to form the basis of
a parallel computer |
|
Bus-based multiprocessors do not scale
well: max < 10 nodes |
|
Larger-scale shared-memory
multiprocessors require more complicated networks and protocols |
|
CC-NUMA is becoming popular since
systems can be built from commodity components (chips, boards, OSs) and use
existing software |
|
e.g. HP/Convex, Sequent, Data General, SGI,
Sun, IBM |