Scalable Shared-Memory Implementations
Erik Hagersten Uppsala University
Sweden
Dept of Information Technology|www.it.uu.se
2
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Cache-to-cache in snoop-based
Thread
$
Thread
$
Thread
$
A:
...
Read A
… Write A
B:
Read B
… Read A
BusRTS My RTS
Æ wait for data
Gotta answer
Read A Read A
…
… Read A
3
AVDARK 2010
”Upgrade” in dir-based
Thread
$
Thread
$
Thread
$
Read A Read A
…
… A:
...
Read A
… Write A
B:
Read B
… Read A
INV INV
Who has
a copy Who has
a copy INV
ACK ACK
ACK
4
AVDARK 2010
Read A Read A
…
… Read A
Cache-to-cache in dir-based
Thread
$
Thread
$
Thread
$
A:
...
Read A
… Write A
B:
Read B
… Read A
Forward ReadRequest
Who has
a copy Who has
a copy ReadDemand
Ack
Dept of Information Technology|www.it.uu.se
5
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Directory-based coherence:
per-cachline info in the memory
Thread
$
Thread
$
Thread
$
A: B:
Cache access Cache access Cache access
Directory Protocol
State State State
Directory state
Dept of Information Technology|www.it.uu.se
6
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Directory-based snooping: NUMA.
Per-cachline info in the home node
Thread
$
Thread
$
A: B:
Cache access Cache access
Directory Protocol
State State
Directory Protocol
Interconnect
Directory
state Directory
state
Dept of Information Technology|www.it.uu.se
7
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Why directory-based
P2P messages Æ high bandwidth
Suits out-of-the-box coherence
Much more scalable!
Note:
Dir-based can be used to build a uniform-memory architecture (UMA)
Bandwidth will be great!!
Memory latency will be OK
Cache-to-cache latency will not!
Memory overhead can be high (storing directory...)
Dept of Information Technology|www.it.uu.se
8
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Cache-to-cache in snoop-based
Thread
$
Thread
$
Thread
$
A:
...
Read A
… Write A
B:
Read B
… Read A
BusRTS My RTS
Æ wait for data
Gotta answer
Read A Read A
…
…
Read A
Dept of Information Technology|www.it.uu.se
9
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Read A Read A
…
… Read A
Cache-to-cache in dir-based
Thread
$
Thread
$
Thread
$
A:
...
Read A
… Write A
B:
Read B
… Read A
Forward ReadRequest
Who has
a copy Who has
a copy
ReadDemand Ack
Dept of Information Technology|www.it.uu.se
10
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Fully mapped directory
• k Nodes
• Each node is the ”home”
for 1/k of the memory
• Dir entry per cacheline in home memory: k presence- bits + 1 dirty-bit
• Requests are first sent to the home node’s CA
• • •
Memory Directory
presence bits dirty bit
”
11
AVDARK 2010
Reducing the Memory Overhead: SCI
--- Scalable Coherence Interface (SCI)
• home only holds pointer to rest of the directory info [log(N) bits]
• distributed linked list of copies, weaves through caches
• cache tag has pointer, points to next cache with a copy
• on read, add yourself to head of the list (comm. needed)
• on write, propagate chain of invalidations down the list
• on replacement: remove yourself from the list
P
Cache
P
Cache
P
Cache Main Memory (Home)
Node 0 Node 1 Node 2
Log N bit ptr
Dir Data
12
AVDARK 2010
Cache Invalidation Patterns
Barnes-Hut Invalidation Patterns
1.27 48.35
22.87
10.56 5.33
2.87 1.88 1.4 2.5 1.06 0.61 0.24 0.28 0.2 0.06 0.1 0.07 0 0 0 0 0.33
0 5 10 15 20 25 30 35 40 45 50
0 1 2 3 4 5 6 7 8 to 11 12 to 15 16 to 19 20 to 23 24 to 27 28 to 31 32 to 35 36 to 39 40 to 43 44 to 47 48 to 51 52 to 55 56 to 59 60 to 63
# of invalidations
Radiosity Invalidation Patterns
6.68 58.35
12.04
4.16 2.24 1.59 1.16 0.97 3.28 2.2 1.74 1.46 0.92 0.45 0.37 0.31 0.28 0.26 0.24 0.19 0.19 0.91 0
10 20 30 40 50 60
0 1 2 3 4 5 6 7 8 to 11 12 to 15 16 to 19 20 to 23 24 to 27 28 to 31 32 to 35 36 to 39 40 to 43 44 to 47 48 to 51 52 to 55 56 to 59 60 to 63
Dept of Information Technology|www.it.uu.se
13
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Overflow Schemes for Limited Pointers
Broadcast (Dir i B)
broadcast bit turned on upon overflow
bad for widely-shared invalidated data
No-broadcast (Dir i NB)
on overflow, new sharer replaces one of the old ones (invalidated)
bad for widely read data
Coarse vector (Dir i CV)
change representation to a coarse vector, 1 bit per k nodes
on a write, invalidate all nodes that a bit corresponds to
P
0P
1P
2P
3P
4P
5P
6P
7P
8P
9P
10P
11P
12P
13P
14P
151
Over½ow bit 8-bit coarse vector
(a) Over½ow P
0P
1P
2P
3P
4P
5P
6P
7P
8P
9P
10P
11P
12P
13P
14P
150
Overflow bit 2 Pointers
(a) No over½ow
Dir in memory:
Dept of Information Technology|www.it.uu.se
14
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Directory cache
Thread
$
Thread
$
A: B:
Cache access Cache access
Directory Protocol
State State
Directory Protocol
Interconnect
Directory
state Directory
state
Dir$ Dir$
Dept of Information Technology|www.it.uu.se
15
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
cc-NUMA issues
Memory placement is key!
Gotta’ migrate data to where it’s being used
Gotta’ have cache affinity
Long time between process switches in the OS
Reschedule processor on the CPU it ran last
SGI Origin 2000’s migration always turned off/
Interconnect
$ P
M
$ P
M ...
Dept of Information Technology|www.it.uu.se
16
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Three options for shared memory
Interconnect
$ P
AM
$ P
AM ...
Interconnect
$ P
M
$ P
M ...
Interconnect
$ P
M
$ P
M
...
COMA cache-only (@SICS)
NUMA
non-uniform
UMA
uniform
(a.k.a. SMP)
Sun’s E6000 Server Family
Erik Hagersten Uppsala University
Sweden
Dept of Information Technology|www.it.uu.se
18
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
What Approach to Shared Memory
I/O devices Mem
P
1$ $
P
nP
1Switch
Main memory P
n(Interleaved)
(Interleaved)
P
1$
Inter connection network
$ P
nMem Mem
(b) Bus-based shar ed memory
(c) Dancehall (a) Shar ed cache
First-level $
Bus
P
1$
Inter connection network
$ P
nMem Mem
(d) Distributed-memory NUMA UMA
UMA
19
AVDARK 2010
Looks like a NUMA but drives like a UMA
Logical View
P
1$
Interconnection network
$ P
nMem Mem
Physical Viewl
P
1$ $
P
nMem Mem
Addr
Interconnection network
•Memory bandwidth scales with the processor count
•One “interconnect load” per (2xCPU + 2xMem)
•Optimize for the dancehall case (no memory shortcut)
20
AVDARK 2010
SUN Enterprise Overview
16 slots with with either CPUs or IO
Up to 30 UltraSPARC processors (peak 9 GFLOPs)
Gigaplane TM bus has peak bw 2.67 GB/s; up to 30GB memory
16 bus slots, for processing or I/O boards
Gig aplane TM bus (256 data, 41 address+ctrl, 83 MHz-100MHz)
I/O Cards P
$ 2
$ P
$ 2
$
mem ctrl
Bus Interface / Switch Bus Interf ace
CPU/Mem
Cards
Dept of Information Technology|www.it.uu.se
21
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Enterprize Server E6000
I/O Interconnect
P M $
E M
P
$
16 boards
P M $
E M
P
$
Dept of Information Technology|www.it.uu.se
22
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
An E6000 Proc Board
P
$ P
M $ e m
S S
80 signals = addr, uid, arb, ...
288 signals = 256 data + ECC
Proc tags Data
S S
Snoop tags ctrl
Address Controller
Addr
Dept of Information Technology|www.it.uu.se
23
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
An I/O Board
80 signals = addr, uid, arb, ...
288 signals = 256 data + ECC
ctrl
Address Controller
Data
Addr
$ I/O
$ I/O
Dept of Information Technology|www.it.uu.se
24
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Split-Transaction Bus
Address/CMD
Mem Access Delay
Data
Address/CMD
Data
Address/CMD Bus
arbitration
Split bus transaction into request and response sub- transactions
Separate arbitration for each phase
Other transactions may intervene
Improves bandwidth dramatically
Response is matched to request
Buffering between bus and cache controllers
time
Addr Signals Data Signals
A A A D D
Dept of Information Technology|www.it.uu.se
25
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Gigaplane Bus Timing
Arbitration Address
State
Tag Status Data
1 R d A
A D A D A D A D A D A D A D A D
2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S ha re ~Own
OK
D
0D
14,5
Rd B
Own
Tag 6
Cancel Tag 7
uid1 uid2
uid1
D
0D
Dept of Information Technology|www.it.uu.se
26
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Electrical Characteristics of the Bus
z At most 16 electrical loads per signal
z 8 boards from each side (ex. 15 CPU+1 I/O)
z 20.5 inches "centerplane"
z Well controlled impedance
z ~350-400 signals
z Runs at 90/100 MHz
27
AVDARK 2010
Address Controller
S
S
ctrl
addr arb aid did
OQ
IQ
P
$ P
$ S S Proc tags
Snoop tags Coh
Prot Data
28
AVDARK 2010
Dual State Tags
S
S
ctrl
addr arb aid did
OQ
IQ
P
$ P
$ S S
Access-right Stat
Obligation State Coh
Prot
Data
Dept of Information Technology|www.it.uu.se
29
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Timing of a single read trans
Board 1 reading from mem 2
Arbitration Address
State
Tag Status Data
1 Rd A
A D A D A D A D A D A D A D A D
2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Share ~Own
OK
D0 D1
4,5
Rd B
Own
Tag 6
Cancel Tag 7
uid1 uid2
uid1
D0 D1
= Fixed latency
= Shortest possible latency
Dept of Information Technology|www.it.uu.se
30
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Protocol tuned for timing
11 cycles = ~110 ns SRAM lookup
DRAM access Addr decode
Arbitration Address
State
Tag Status Data
1 Rd A
A D A D A D A D A D A D A D A D
2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Share ~Own
OK
D0 D1
4,5
Rd B
Own
Tag 6
Cancel Tag 7
uid1 uid2
uid1
D0 D1
Dept of Information Technology|www.it.uu.se
31
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Foreign and own transactions queue in IQ State Change on Address Packet
•Data “A” initially resides in CPU7’s cache
•CPU1: Issues a store request to “A”
•CPU1: Read-To-Write req, ID=d, (i.e., “write request”)
•CPU13: LD “A” -> Read-To-Shared req, ID=e
•CPU15: ST “A” -> RTW req , ID=f
mRTO stored in IQ CPU1
Own read IQtrans retired when data arrives
Later requests for A queued in IQ CPU1 behind mRTO IQ CPU1 will eventually store: <mRTW IDd , fRTS IDe, fRTW IDf >
Dept of Information Technology|www.it.uu.se
32
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
S
S
ctrl
addr arb aid did
OQ
IQ
P
$ P
$ S S
Proc tags = I
Snoop tags =I Coh
Prot
.
Store A
Dept of Information Technology|www.it.uu.se
33
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
S
S
ctrl
addr arb aid did
OQ
IQ
P
$ P
$ S S Proc tags = I
Snoop tags = I Coh
Prot
.
RTW
Dept of Information Technology|www.it.uu.se
34
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
S
S
ctrl
addr arb aid did
OQ
IQ
P
$ P
$ S S Proc tags = I
Snoop tags => M Coh
Prot
.
mRTW
mRTW, IDd
35
AVDARK 2010
S
S
ctrl
addr arb aid did
OQ
P
$ P
$ S S
Proc tags = I
Snoop tags => O Coh
Prot
.
fRTS mRTW
fRTS, IDe
36
AVDARK 2010
S
S
ctrl
addr arb aid did
OQ
P
$ P
$ S S
Proc tags = I
Snoop tags => I Coh
Prot
.
fRTW fRTS mRTW
fRTO, IDf
Dept of Information Technology|www.it.uu.se
37
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
S
S
ctrl
addr arb aid did
OQ
P
$ P
$ S S Proc tags = I
Snoop tags = I Coh
Prot
.
fRTW fRTS mRTW Data, IDd
Dept of Information Technology|www.it.uu.se
38
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
S
S
ctrl
addr arb aid did
OQ
P
$ P
$ S S Proc tags =>M
Snoop tags = I Coh
Prot
Data, IDd
.
fRTW fRTS
mRTO
Dept of Information Technology|www.it.uu.se
39
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
S
S
ctrl
addr arb aid did
OQ
P
$ P
$ S S
Proc tags => O
Snoop tags = I Coh
Prot Data, IDe
.
fRTW Data, IDe
Dept of Information Technology|www.it.uu.se
40
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
S
S
ctrl
addr arb aid did
OQ
P
$ P
$ S S
Proc tags =>I
Snoop tags = I Coh
Prot Data, IDf
.
Data, IDf
Dept of Information Technology|www.it.uu.se
41
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
A cascade of "write requests”
A initially resides in CPU7’s cache On the bus:
•CPU1: RTW, ID=a
•CPU2: RTW, ID=b
•…
•CPU5: RTW, ID=f
IQ1 = <mRTW IDa , fRTW IDb >
IQ2 = <mRTW IDb , fRTW IDc >
...
IQ5 = <mRTW IDf >
…
IQ7= < fRTW IDa >
I I
I I
I M
S I
CPU tags
Snoop tags
Implementing Sun’s SunFire 6800
Erik Hagersten Uppsala University
Sweden
43
AVDARK 2010
$ CPU D CPU $
C D S
M E M M
E M
$ CPU D CPU $
C D S
M E M M
E M
Addr. Rep.
CPU board
Addr. Rep.
L2 cache = 8MB, snoop tags on-chip CPU 1+GHz UltraSPARC III
Mem= 4+GB/CPU
FirePlane, 24 CPUs
44
AVDARK 2010
$ CPU D CPU $
C D S
M E M M
E M
$ CPU D CPU $
C D S
M E M M
E M
Addr. Rep.
CPU board
Addr. Rep.
CMD, Addr, ID
ID = <CPU#, Uid>
FirePlane, 24 CPUs
Dept of Information Technology|www.it.uu.se
45
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
$ CPU D CPU $
C D S
M E M M
E M
$ CPU D CPU $
C D S
M E M M
E M
Addr. Rep.
CPU board
Addr. Rep.
FirePlane, 24 CPUs
Dept of Information Technology|www.it.uu.se
46
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
$ CPU D CPU $
C D S
M E M M
E M
$ CPU D CPU $
C D S
M E M M
E M
Addr. Rep.
CPU board
Addr. Rep.
FirePlane, 24 CPUs
Dept of Information Technology|www.it.uu.se
47
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
$ CPU D CPU $
C D S
M E M M
E M
$ CPU D CPU $
C D S
M E M M
E M
Addr. Rep.
CPU board
Addr. Rep.
FirePlane, 24 CPUs
Dept of Information Technology|www.it.uu.se
48
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
$ CPU D CPU $
C D S
M E M M
E M
$ CPU D CPU $
C D S
M E M M
E M
Addr. Rep.
CPU board
Addr. Rep.
Here it is!!
FirePlane, 24 CPUs
Dept of Information Technology|www.it.uu.se
49
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
$ CPU D CPU $
C D S
M E M M
E M
$ CPU D CPU $
C D S
M E M M
E M
Data Repeater Addr. Rep.
CPU board
Addr. Rep.
Data Repeater
DR
Data, ID
ID = <CPU#, Uid>
FirePlane, 24 CPUs
DR DR DR
DR
Sun’s WildFire System
Erik Hagersten Uppsala University
Sweden
51
AVDARK 2010
Sun’s WildFire System
Runs unmodified SMP apps in a more scalable way than E6000
Minor modifications to E6000 snooping required
CPUs generate local address OR global address
Global address --> no replication (NUMA)
Coherent Memory Replication(~Simple COMA@ SICS)
Hardware support for detecting migration/replication pages
Directory cache + address translation cache backed by memory
Deterministic directory implementation (easy to verify)
52
AVDARK 2010
WildFire:
One Solaris spanning four nodes
Dept of Information Technology|www.it.uu.se
53
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Switch Switch
...
$
$ P P M M
$
$ P P M M
ccNUMA
Switch Switch
...
$ $ P P
$ $ P P
$ $
COMA
$ $
COMA: self-optimizing DSM
COMA:
Self-optimizing architecture
Problem at high memory pressure
Complex hardware and coherence protocol
Dept of Information Technology|www.it.uu.se
54
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Adaptive S-COMA of Large SMPs
A page may have space allocated in many nodes
HW maintains memory coherence per cache line
Replication under SW control --> simple HW (S-COMA)
Adaptive replication algorithm in OS (R-NUMA)
Coherent Memory Replication (CMR)
Hierarchical affinity scheduler (HAS)
Few large nodes ->simple interconnect and coherence protocol
...
Switch Switch
$ $ P P
$ $ P
... P Memory Memory
Switch Switch
$ $ P P
$ $ P
... P Memory Memory Interconnect Interconnect
28X 4X
28X
o
Dept of Information Technology|www.it.uu.se
55
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
A WildFire Node
16 slots with either CPUs, IO or…oard
Gig aplane TM bus (256 data, 41 address+ctrl, 83 MHz-100MHz)
I/O I/O
P
$ 2
$ P
$ 2
$
mem ctrl
Bus Interface / Switch Bus Interf ace
CPU/Mem Cards
WildFire Card
Bus Interf ace
WildFire extension board Î
Up to 28 UltraSPARC processors
Gigaplane TM bus has peak bw 2.67 GB/s
Local access time of 330ns (lmbench)
Dept of Information Technology|www.it.uu.se
56
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Sun WildFire Interface Board
ADDR Controller
SRAM
Data Buffers
Link Link Link
This space
for rent
Dept of Information Technology|www.it.uu.se
57
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Sun WildFire Interface Board
Data Bus Addr Bus Links
Dept of Information Technology|www.it.uu.se
58
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
WildFire as a vanilla "NUMA"
Dir$ 8 b/line Mtag2 b/line Mem
I/F
Cache Proc
Cache
... Proc
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc Interconnect
59
AVDARK 2010
NUMA -- local memory acces
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc Interconnect
Access right OK?
60
AVDARK 2010
NUMA -- remote memory access
Dir$ 8b/line Mtag 2b/line Mem
I/F
Cache Proc
Cache
... Proc
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc Interconnect
Who has the data
SRAM overhead = 10/512 = 2% (lower bound 2/512 = 0.4%)
Dept of Information Technology|www.it.uu.se
61
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Global Cache Coherence Prot.
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc Access
right changes
Mod dir entry
Reply(Data)
Dept of Information Technology|www.it.uu.se
62
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
NUMA -- local memory access
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc Interconnect
Access right OK?
NO!!
Dept of Information Technology|www.it.uu.se
63
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Gigaplane Bus Timing
Arbitration Address
State
Tag Status Data
1 R d A
A D A D A D A D A D A D A D A D
2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S hare ~Own
OK
D
0D
14,5
Own
uidB 6
Cancel uidB 7
uidA Rd B uidB
uidA
Rd C uidC
XXX xxx
Dept of Information Technology|www.it.uu.se
64
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Arbitration Address
State
Tag Status Data
1 R d A
A D A D A D A D A D A D A D A D
2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S hare Ignore
OK
D
0D
14,5
Own
uidB 6
Cancel uidB 7
uidA Rd B uidB
uidA
Rd A uidA
XXX xxx
Resent by WildFire Asserted by WildFire
9
WildFire Bus Extensions
Ignore transaction squashes an ongoing transaction => not put in IQ
WildFire eventually reissues the same transaction
RTSF -- a new transaction sends data to CPU and memory
Dept of Information Technology|www.it.uu.se
65
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
WildFire Directory -- only 4 nodes!!
• k nodes (with one or more procs).
• With each cache-block in memory: k presence-bits, 1 dirty-bit
• With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit
• • •
P P
Cache Cache
Memory Directory
presence bits dirty bit Interconnection Network
• ReadRequestfrom main memory by processor i:
• If dirty-bit OFF then { read from main memory; turn p[i] ON}
• if dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON;
supply recalled data to i;}
….
Dept of Information Technology|www.it.uu.se
66
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
NUMA "detecting excess misses"
Dir$ 8b/line Mtag 2b/line Mem
I/F
Cache Proc
Cache
... Proc
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc Interconnect I thought
you had the data!!*/
67
AVDARK 2010
Detecting a page for replication
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc
= addr
Associative Counters Data w/ E-miss-bit)
68
AVDARK 2010
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc Interconnect
AT AT$
= addr
Address Translation Grey. <--> Yel..
VA->PA
OS Initializes a CMR page
Init acc right to INV /page
New V->P mapping in this node
/CL
Dept of Information Technology|www.it.uu.se
69
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
An access to a CMR page
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc Interconnect
AT AT$
Access right OK?
Dept of Information Technology|www.it.uu.se
70
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
An access to a CMR page (miss)
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc Interconnect
AT AT$
Access right OK?
NO!!
Address Translation (AT) overhead = 8B/8kB = 0.1%
No extra latency added
Dept of Information Technology|www.it.uu.se
71
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
An access to a CMR page (miss)
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc Interconnect
AT AT$
Change MTAG to
“shared”
Dept of Information Technology|www.it.uu.se
72
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
An access to a CMR page (hit)
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc
Dir$
Mtag Mem
I/F
Cache Proc
Cache
... Proc Interconnect
AT AT$
Access right OK?
YES!
Dept of Information Technology|www.it.uu.se
73
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Deterministic Directory
MOSI protocol, fully mapped directory (one bit/node)
Directory blocking: one outstanding trans/cache line
Directory blocks new requests until completion received
The directory state and cache state always in agreement (except for silent replacement…)
L H R
1: req 2a:demand
3a:reply(D)
(a) Write: remote shared
L H R
1: req 2:demand
3:reply(D) 4:compl.
(b) Read: Remote dirty 4:compl.
3b. ack
R R
2b. inv
L H
1: req
2:ack/nack 3:comp(D).
(c) Writeback
Dept of Information Technology|www.it.uu.se
74
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Replication Issues Revisited
Replication
"Physical" memory mem tot
mem tot controlled
by the OS
WildFire
COMA ccNUMA
Only "promising" pages are replicated
OS dynamically limits the amount of replication
Solaris CMR changes in the hat_layer (=port)
75
AVDARK 2010
Advantages of Multiprocessor Nodes
Pros:
amortization of fixed node costs over multiple processors
can use commodity SMPs
fewer nodes to keep track of in the directory
much communication may stay within node (NUCA)
can share “node caches” (WildFire: Coherent Memory Replication)
Cons:
bandwidth shared among processors and interface
bus may increases latency to local memory
snoopy bus at remote node increases delays there too
76
AVDARK 2010
Memory cost of replication
Example: Replicate 10% of data in all nodes
50 nodes, each with 2 CPUs
==> 490% overhead
4 nodes, each with 25 CPUs
==> 30% overhead
Dept of Information Technology|www.it.uu.se
77
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Does migration/replication help?
NAS parallel Benchmark Study (Execution time in seconds) [M. Bull, EPCC 2002]
No migr Migr NoMigr Migr
No Repl 26 5.9 6.1 6.1
Repl 7.2 6.2 6.1 6.1
No Initial Plac. Initial Placement
No migr Migr NoMigr Migr 960 610 620 600
590 580 580 580
No Initial Plac. Initial Placement
No migr Migr NoMigr Migr
No Repl 520 330 380 260
Repl 250 260 190 200
No Initial Plac. Initial Placement
No migr Migr NoMigr Migr 1540 780 760 780
670 680 670 670
No Initial Plac. Initial Placement
Shallow BT
FT SP
No migr Migr NoMigr Migr
No Repl 230 230 240 230
Repl 220 220 220 220
No Initial Plac. Initial Placement
MG
No migr Migr NoMigr Migr 1060 700 940 700
300 280 300 290
No Initial Plac. Initial Placement
CG
Unopt. HWopt. SWopt.
Dept of Information Technology|www.it.uu.se
78
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
WildFire’s Technology Limits
Dir$=8 b/line Mtag=2 b/line Mem
I/F
Cache Proc
Cache
... Proc
Mem Interconnect
SRAM size = DRAMsize/256 Snoop frequency SRAM size = DRAMsize/256 Snoop frequency
Dir $ reach >>
sum(cache size) Dir $ reach >>
sum(cache size)
Slow interconnect
Slow interconnect
Hard to make busses faster Hard to make busses faster
Sun’s SunFire 15k/25k
Erik Hagersten Uppsala University
Sweden
Dept of Information Technology|www.it.uu.se
80
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
StarCat
Sun Fire 15k/25k
(used at Lab2)
Dept of Information Technology|www.it.uu.se
81
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Front Side
Dept of Information Technology|www.it.uu.se
82
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Back Side
83
AVDARK 2010
StarCat, 72 CPUs
$ CPU D CPU $
C D S
M E M M
E M
$ CPU D CPU $
C D S
M E M M
E M
Data Repeater Addr. Rep.
CPU board
Glob-coh prot Data Rep.
Expander Dir$
board
18x18 addr X-bar 18x18 addr X-bar Active
Backplane
Total 18 slots
P2P Directory Coherence
Repeater Snoop Coherence
84
AVDARK 2010
StarCat Coherence Mechanism
$ CPU D CPU $
C D S
M E M M
E M
$ CPU D CPU $
C D S
M E M M
E M
Data Repeater Addr. Rep.
CPU board
Glob-coh prot Data Rep.
Expander Dir$
board
18x18 addr X-bar 18x18 addr X-bar Active
Backplane
DATA+
MTAG+
ECC =576 bits DATA+
MTAG+
ECC =576 bits MTAG
Check!
MTAG Check!
DATA
Addr (snoop) Remote
request
Dept of Information Technology|www.it.uu.se
85
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
StarCat, 72 CPUs
$ CPU D CPU $
C D S
M E M M
E M
$ CPU D CPU $
C D S
M E M M
E M
Data Repeater Addr. Rep.
CPU board
Glob-coh prot Data Rep.
Expander Dir$
board
18x18 addr X-bar 18x18 addr X-bar Active
Backplane
Allocate Dir$ entry only for write
requests. Speculate on clean data on Dir$ miss Allocate Dir$ entry only for write
requests. Speculate on clean data on Dir$ miss
WildCat coherence w/o CMR
& w/ faster interconnect
Dept of Information Technology|www.it.uu.se
86
© Erik Hagersten|user.it.uu.se/~ehAVDARK 2010
Directory cache, but no directory (broadcast on Dir$ miss)
Thread
$
Thread
$
A: B:
Cache access Cache access
Directory Protocol
State State
Directory Protocol
Interconnect
Dir$ Dir$
Dept of Information Technology|www.it.uu.se