• No results found

AVDARK 2010

N/A
N/A
Protected

Academic year: 2022

Share "AVDARK 2010"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

Scalable Shared-Memory Implementations

Erik Hagersten Uppsala University

Sweden

Dept of Information Technology|www.it.uu.se

2

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Cache-to-cache in snoop-based

Thread

$

Thread

$

Thread

$

A:

...

Read A

Write A

B:

Read B

Read A

BusRTS My RTS

Æ wait for data

Gotta answer

Read A Read A

Read A

3

AVDARK 2010

”Upgrade” in dir-based

Thread

$

Thread

$

Thread

$

Read A Read A

A:

...

Read A

Write A

B:

Read B

Read A

INV INV

Who has

a copy Who has

a copy INV

ACK ACK

ACK

4

AVDARK 2010

Read A Read A

Read A

Cache-to-cache in dir-based

Thread

$

Thread

$

Thread

$

A:

...

Read A

Write A

B:

Read B

Read A

Forward ReadRequest

Who has

a copy Who has

a copy ReadDemand

Ack

(2)

Dept of Information Technology|www.it.uu.se

5

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Directory-based coherence:

per-cachline info in the memory

Thread

$

Thread

$

Thread

$

A: B:

Cache access Cache access Cache access

Directory Protocol

State State State

Directory state

Dept of Information Technology|www.it.uu.se

6

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Directory-based snooping: NUMA.

Per-cachline info in the home node

Thread

$

Thread

$

A: B:

Cache access Cache access

Directory Protocol

State State

Directory Protocol

Interconnect

Directory

state Directory

state

Dept of Information Technology|www.it.uu.se

7

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Why directory-based

„ P2P messages Æ high bandwidth

„ Suits out-of-the-box coherence

„ Much more scalable!

„ Note:

“ Dir-based can be used to build a uniform-memory architecture (UMA)

“ Bandwidth will be great!!

“ Memory latency will be OK

“ Cache-to-cache latency will not!

“ Memory overhead can be high (storing directory...)

Dept of Information Technology|www.it.uu.se

8

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Cache-to-cache in snoop-based

Thread

$

Thread

$

Thread

$

A:

...

Read A

Write A

B:

Read B

Read A

BusRTS My RTS

Æ wait for data

Gotta answer

Read A Read A

Read A

(3)

Dept of Information Technology|www.it.uu.se

9

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Read A Read A

Read A

Cache-to-cache in dir-based

Thread

$

Thread

$

Thread

$

A:

...

Read A

Write A

B:

Read B

Read A

Forward ReadRequest

Who has

a copy Who has

a copy

ReadDemand Ack

Dept of Information Technology|www.it.uu.se

10

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Fully mapped directory

• k Nodes

• Each node is the ”home”

for 1/k of the memory

• Dir entry per cacheline in home memory: k presence- bits + 1 dirty-bit

• Requests are first sent to the home node’s CA

• • •

Memory Directory

presence bits dirty bit

11

AVDARK 2010

Reducing the Memory Overhead: SCI

--- Scalable Coherence Interface (SCI)

• home only holds pointer to rest of the directory info [log(N) bits]

• distributed linked list of copies, weaves through caches

• cache tag has pointer, points to next cache with a copy

• on read, add yourself to head of the list (comm. needed)

• on write, propagate chain of invalidations down the list

• on replacement: remove yourself from the list

P

Cache

P

Cache

P

Cache Main Memory (Home)

Node 0 Node 1 Node 2

Log N bit ptr

Dir Data

12

AVDARK 2010

Cache Invalidation Patterns

Barnes-Hut Invalidation Patterns

1.27 48.35

22.87

10.56 5.33

2.87 1.88 1.4 2.5 1.06 0.61 0.24 0.28 0.2 0.06 0.1 0.07 0 0 0 0 0.33

0 5 10 15 20 25 30 35 40 45 50

0 1 2 3 4 5 6 7 8 to 11 12 to 15 16 to 19 20 to 23 24 to 27 28 to 31 32 to 35 36 to 39 40 to 43 44 to 47 48 to 51 52 to 55 56 to 59 60 to 63

# of invalidations

Radiosity Invalidation Patterns

6.68 58.35

12.04

4.16 2.24 1.59 1.16 0.97 3.28 2.2 1.74 1.46 0.92 0.45 0.37 0.31 0.28 0.26 0.24 0.19 0.19 0.91 0

10 20 30 40 50 60

0 1 2 3 4 5 6 7 8 to 11 12 to 15 16 to 19 20 to 23 24 to 27 28 to 31 32 to 35 36 to 39 40 to 43 44 to 47 48 to 51 52 to 55 56 to 59 60 to 63

(4)

Dept of Information Technology|www.it.uu.se

13

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Overflow Schemes for Limited Pointers

„ Broadcast (Dir i B)

“ broadcast bit turned on upon overflow

“ bad for widely-shared invalidated data

„ No-broadcast (Dir i NB)

“ on overflow, new sharer replaces one of the old ones (invalidated)

“ bad for widely read data

„ Coarse vector (Dir i CV)

“ change representation to a coarse vector, 1 bit per k nodes

“ on a write, invalidate all nodes that a bit corresponds to

P

0

P

1

P

2

P

3

P

4

P

5

P

6

P

7

P

8

P

9

P

10

P

11

P

12

P

13

P

14

P

15

1

Over½ow bit 8-bit coarse vector

(a) Over½ow P

0

P

1

P

2

P

3

P

4

P

5

P

6

P

7

P

8

P

9

P

10

P

11

P

12

P

13

P

14

P

15

0

Overflow bit 2 Pointers

(a) No over½ow

Dir in memory:

Dept of Information Technology|www.it.uu.se

14

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Directory cache

Thread

$

Thread

$

A: B:

Cache access Cache access

Directory Protocol

State State

Directory Protocol

Interconnect

Directory

state Directory

state

Dir$ Dir$

Dept of Information Technology|www.it.uu.se

15

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

cc-NUMA issues

„ Memory placement is key!

„ Gotta’ migrate data to where it’s being used

„ Gotta’ have cache affinity

“ Long time between process switches in the OS

“ Reschedule processor on the CPU it ran last

„ SGI Origin 2000’s migration always turned off/

Interconnect

$ P

M

$ P

M ...

Dept of Information Technology|www.it.uu.se

16

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Three options for shared memory

Interconnect

$ P

AM

$ P

AM ...

Interconnect

$ P

M

$ P

M ...

Interconnect

$ P

M

$ P

M

...

COMA cache-only (@SICS)

NUMA

non-uniform

UMA

uniform

(a.k.a. SMP)

(5)

Sun’s E6000 Server Family

Erik Hagersten Uppsala University

Sweden

Dept of Information Technology|www.it.uu.se

18

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

What Approach to Shared Memory

I/O devices Mem

P

1

$ $

P

n

P

1

Switch

Main memory P

n

(Interleaved)

(Interleaved)

P

1

$

Inter connection network

$ P

n

Mem Mem

(b) Bus-based shar ed memory

(c) Dancehall (a) Shar ed cache

First-level $

Bus

P

1

$

Inter connection network

$ P

n

Mem Mem

(d) Distributed-memory NUMA UMA

UMA

19

AVDARK 2010

Looks like a NUMA but drives like a UMA

Logical View

P

1

$

Interconnection network

$ P

n

Mem Mem

Physical Viewl

P

1

$ $

P

n

Mem Mem

Addr

Interconnection network

•Memory bandwidth scales with the processor count

•One “interconnect load” per (2xCPU + 2xMem)

•Optimize for the dancehall case (no memory shortcut)

20

AVDARK 2010

SUN Enterprise Overview

„ 16 slots with with either CPUs or IO

„ Up to 30 UltraSPARC processors (peak 9 GFLOPs)

„ Gigaplane TM bus has peak bw 2.67 GB/s; up to 30GB memory

„ 16 bus slots, for processing or I/O boards

Gig aplane TM bus (256 data, 41 address+ctrl, 83 MHz-100MHz)

I/O Cards P

$ 2

$ P

$ 2

$

mem ctrl

Bus Interface / Switch Bus Interf ace

CPU/Mem

Cards

(6)

Dept of Information Technology|www.it.uu.se

21

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Enterprize Server E6000

I/O Interconnect

P M $

E M

P

$

16 boards

P M $

E M

P

$

Dept of Information Technology|www.it.uu.se

22

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

An E6000 Proc Board

P

$ P

M $ e m

S S

80 signals = addr, uid, arb, ...

288 signals = 256 data + ECC

Proc tags Data

S S

Snoop tags ctrl

Address Controller

Addr

Dept of Information Technology|www.it.uu.se

23

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

An I/O Board

80 signals = addr, uid, arb, ...

288 signals = 256 data + ECC

ctrl

Address Controller

Data

Addr

$ I/O

$ I/O

Dept of Information Technology|www.it.uu.se

24

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Split-Transaction Bus

Address/CMD

Mem Access Delay

Data

Address/CMD

Data

Address/CMD Bus

arbitration

„ Split bus transaction into request and response sub- transactions

“ Separate arbitration for each phase

„ Other transactions may intervene

“ Improves bandwidth dramatically

“ Response is matched to request

“ Buffering between bus and cache controllers

time

Addr Signals Data Signals

A A A D D

(7)

Dept of Information Technology|www.it.uu.se

25

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Gigaplane Bus Timing

Arbitration Address

State

Tag Status Data

1 R d A

A D A D A D A D A D A D A D A D

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

S ha re ~Own

OK

D

0

D

1

4,5

Rd B

Own

Tag 6

Cancel Tag 7

uid1 uid2

uid1

D

0

D

Dept of Information Technology|www.it.uu.se

26

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Electrical Characteristics of the Bus

z At most 16 electrical loads per signal

z 8 boards from each side (ex. 15 CPU+1 I/O)

z 20.5 inches "centerplane"

z Well controlled impedance

z ~350-400 signals

z Runs at 90/100 MHz

27

AVDARK 2010

Address Controller

S

S

ctrl

addr arb aid did

OQ

IQ

P

$ P

$ S S Proc tags

Snoop tags Coh

Prot Data

28

AVDARK 2010

Dual State Tags

S

S

ctrl

addr arb aid did

OQ

IQ

P

$ P

$ S S

Access-right Stat

Obligation State Coh

Prot

Data

(8)

Dept of Information Technology|www.it.uu.se

29

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Timing of a single read trans

Board 1 reading from mem 2

Arbitration Address

State

Tag Status Data

1 Rd A

A D A D A D A D A D A D A D A D

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Share ~Own

OK

D0 D1

4,5

Rd B

Own

Tag 6

Cancel Tag 7

uid1 uid2

uid1

D0 D1

= Fixed latency

= Shortest possible latency

Dept of Information Technology|www.it.uu.se

30

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Protocol tuned for timing

11 cycles = ~110 ns SRAM lookup

DRAM access Addr decode

Arbitration Address

State

Tag Status Data

1 Rd A

A D A D A D A D A D A D A D A D

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Share ~Own

OK

D0 D1

4,5

Rd B

Own

Tag 6

Cancel Tag 7

uid1 uid2

uid1

D0 D1

Dept of Information Technology|www.it.uu.se

31

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Foreign and own transactions queue in IQ State Change on Address Packet

•Data “A” initially resides in CPU7’s cache

•CPU1: Issues a store request to “A”

•CPU1: Read-To-Write req, ID=d, (i.e., “write request”)

•CPU13: LD “A” -> Read-To-Shared req, ID=e

•CPU15: ST “A” -> RTW req , ID=f

mRTO stored in IQ CPU1

Own read IQtrans retired when data arrives

Later requests for A queued in IQ CPU1 behind mRTO IQ CPU1 will eventually store: <mRTW IDd , fRTS IDe, fRTW IDf >

Dept of Information Technology|www.it.uu.se

32

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

S

S

ctrl

addr arb aid did

OQ

IQ

P

$ P

$ S S

Proc tags = I

Snoop tags =I Coh

Prot

.

Store A

(9)

Dept of Information Technology|www.it.uu.se

33

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

S

S

ctrl

addr arb aid did

OQ

IQ

P

$ P

$ S S Proc tags = I

Snoop tags = I Coh

Prot

.

RTW

Dept of Information Technology|www.it.uu.se

34

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

S

S

ctrl

addr arb aid did

OQ

IQ

P

$ P

$ S S Proc tags = I

Snoop tags => M Coh

Prot

.

mRTW

mRTW, IDd

35

AVDARK 2010

S

S

ctrl

addr arb aid did

OQ

P

$ P

$ S S

Proc tags = I

Snoop tags => O Coh

Prot

.

fRTS mRTW

fRTS, IDe

36

AVDARK 2010

S

S

ctrl

addr arb aid did

OQ

P

$ P

$ S S

Proc tags = I

Snoop tags => I Coh

Prot

.

fRTW fRTS mRTW

fRTO, IDf

(10)

Dept of Information Technology|www.it.uu.se

37

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

S

S

ctrl

addr arb aid did

OQ

P

$ P

$ S S Proc tags = I

Snoop tags = I Coh

Prot

.

fRTW fRTS mRTW Data, IDd

Dept of Information Technology|www.it.uu.se

38

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

S

S

ctrl

addr arb aid did

OQ

P

$ P

$ S S Proc tags =>M

Snoop tags = I Coh

Prot

Data, IDd

.

fRTW fRTS

mRTO

Dept of Information Technology|www.it.uu.se

39

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

S

S

ctrl

addr arb aid did

OQ

P

$ P

$ S S

Proc tags => O

Snoop tags = I Coh

Prot Data, IDe

.

fRTW Data, IDe

Dept of Information Technology|www.it.uu.se

40

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

S

S

ctrl

addr arb aid did

OQ

P

$ P

$ S S

Proc tags =>I

Snoop tags = I Coh

Prot Data, IDf

.

Data, IDf

(11)

Dept of Information Technology|www.it.uu.se

41

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

A cascade of "write requests”

A initially resides in CPU7’s cache On the bus:

•CPU1: RTW, ID=a

•CPU2: RTW, ID=b

•…

•CPU5: RTW, ID=f

IQ1 = <mRTW IDa , fRTW IDb >

IQ2 = <mRTW IDb , fRTW IDc >

...

IQ5 = <mRTW IDf >

IQ7= < fRTW IDa >

I I

I I

I M

S I

CPU tags

Snoop tags

Implementing Sun’s SunFire 6800

Erik Hagersten Uppsala University

Sweden

43

AVDARK 2010

$ CPU D CPU $

C D S

M E M M

E M

$ CPU D CPU $

C D S

M E M M

E M

Addr. Rep.

CPU board

Addr. Rep.

L2 cache = 8MB, snoop tags on-chip CPU 1+GHz UltraSPARC III

Mem= 4+GB/CPU

FirePlane, 24 CPUs

44

AVDARK 2010

$ CPU D CPU $

C D S

M E M M

E M

$ CPU D CPU $

C D S

M E M M

E M

Addr. Rep.

CPU board

Addr. Rep.

CMD, Addr, ID

ID = <CPU#, Uid>

FirePlane, 24 CPUs

(12)

Dept of Information Technology|www.it.uu.se

45

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

$ CPU D CPU $

C D S

M E M M

E M

$ CPU D CPU $

C D S

M E M M

E M

Addr. Rep.

CPU board

Addr. Rep.

FirePlane, 24 CPUs

Dept of Information Technology|www.it.uu.se

46

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

$ CPU D CPU $

C D S

M E M M

E M

$ CPU D CPU $

C D S

M E M M

E M

Addr. Rep.

CPU board

Addr. Rep.

FirePlane, 24 CPUs

Dept of Information Technology|www.it.uu.se

47

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

$ CPU D CPU $

C D S

M E M M

E M

$ CPU D CPU $

C D S

M E M M

E M

Addr. Rep.

CPU board

Addr. Rep.

FirePlane, 24 CPUs

Dept of Information Technology|www.it.uu.se

48

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

$ CPU D CPU $

C D S

M E M M

E M

$ CPU D CPU $

C D S

M E M M

E M

Addr. Rep.

CPU board

Addr. Rep.

Here it is!!

FirePlane, 24 CPUs

(13)

Dept of Information Technology|www.it.uu.se

49

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

$ CPU D CPU $

C D S

M E M M

E M

$ CPU D CPU $

C D S

M E M M

E M

Data Repeater Addr. Rep.

CPU board

Addr. Rep.

Data Repeater

DR

Data, ID

ID = <CPU#, Uid>

FirePlane, 24 CPUs

DR DR DR

DR

Sun’s WildFire System

Erik Hagersten Uppsala University

Sweden

51

AVDARK 2010

Sun’s WildFire System

„ Runs unmodified SMP apps in a more scalable way than E6000

„ Minor modifications to E6000 snooping required

„ CPUs generate local address OR global address

„ Global address --> no replication (NUMA)

„ Coherent Memory Replication(~Simple COMA@ SICS)

„ Hardware support for detecting migration/replication pages

„ Directory cache + address translation cache backed by memory

„ Deterministic directory implementation (easy to verify)

52

AVDARK 2010

WildFire:

One Solaris spanning four nodes

(14)

Dept of Information Technology|www.it.uu.se

53

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Switch Switch

...

$

$ P P M M

$

$ P P M M

ccNUMA

Switch Switch

...

$ $ P P

$ $ P P

$ $

COMA

$ $

COMA: self-optimizing DSM

COMA:

ƒ Self-optimizing architecture

ƒ Problem at high memory pressure

ƒ Complex hardware and coherence protocol

Dept of Information Technology|www.it.uu.se

54

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Adaptive S-COMA of Large SMPs

„ A page may have space allocated in many nodes

„ HW maintains memory coherence per cache line

„ Replication under SW control --> simple HW (S-COMA)

„ Adaptive replication algorithm in OS (R-NUMA)

„ Coherent Memory Replication (CMR)

„ Hierarchical affinity scheduler (HAS)

„ Few large nodes ->simple interconnect and coherence protocol

...

Switch Switch

$ $ P P

$ $ P

... P Memory Memory

Switch Switch

$ $ P P

$ $ P

... P Memory Memory Interconnect Interconnect

28X 4X

28X

o

Dept of Information Technology|www.it.uu.se

55

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

A WildFire Node

„ 16 slots with either CPUs, IO or…oard

Gig aplane TM bus (256 data, 41 address+ctrl, 83 MHz-100MHz)

I/O I/O

P

$ 2

$ P

$ 2

$

mem ctrl

Bus Interface / Switch Bus Interf ace

CPU/Mem Cards

WildFire Card

Bus Interf ace

WildFire extension board Î

„ Up to 28 UltraSPARC processors

„ Gigaplane TM bus has peak bw 2.67 GB/s

„ Local access time of 330ns (lmbench)

Dept of Information Technology|www.it.uu.se

56

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Sun WildFire Interface Board

ADDR Controller

SRAM

Data Buffers

Link Link Link

This space

for rent

(15)

Dept of Information Technology|www.it.uu.se

57

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Sun WildFire Interface Board

Data Bus Addr Bus Links

Dept of Information Technology|www.it.uu.se

58

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

WildFire as a vanilla "NUMA"

Dir$ 8 b/line Mtag2 b/line Mem

I/F

Cache Proc

Cache

... Proc

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc Interconnect

59

AVDARK 2010

NUMA -- local memory acces

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc Interconnect

Access right OK?

60

AVDARK 2010

NUMA -- remote memory access

Dir$ 8b/line Mtag 2b/line Mem

I/F

Cache Proc

Cache

... Proc

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc Interconnect

Who has the data

SRAM overhead = 10/512 = 2% (lower bound 2/512 = 0.4%)

(16)

Dept of Information Technology|www.it.uu.se

61

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Global Cache Coherence Prot.

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc Access

right changes

Mod dir entry

Reply(Data)

Dept of Information Technology|www.it.uu.se

62

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

NUMA -- local memory access

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc Interconnect

Access right OK?

NO!!

Dept of Information Technology|www.it.uu.se

63

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Gigaplane Bus Timing

Arbitration Address

State

Tag Status Data

1 R d A

A D A D A D A D A D A D A D A D

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

S hare ~Own

OK

D

0

D

1

4,5

Own

uidB 6

Cancel uidB 7

uidA Rd B uidB

uidA

Rd C uidC

XXX xxx

Dept of Information Technology|www.it.uu.se

64

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Arbitration Address

State

Tag Status Data

1 R d A

A D A D A D A D A D A D A D A D

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

S hare Ignore

OK

D

0

D

1

4,5

Own

uidB 6

Cancel uidB 7

uidA Rd B uidB

uidA

Rd A uidA

XXX xxx

Resent by WildFire Asserted by WildFire

9

WildFire Bus Extensions

„ Ignore transaction squashes an ongoing transaction => not put in IQ

„ WildFire eventually reissues the same transaction

„ RTSF -- a new transaction sends data to CPU and memory

(17)

Dept of Information Technology|www.it.uu.se

65

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

WildFire Directory -- only 4 nodes!!

• k nodes (with one or more procs).

• With each cache-block in memory: k presence-bits, 1 dirty-bit

• With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit

• • •

P P

Cache Cache

Memory Directory

presence bits dirty bit Interconnection Network

„ • ReadRequestfrom main memory by processor i:

• If dirty-bit OFF then { read from main memory; turn p[i] ON}

• if dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON;

supply recalled data to i;}

….

Dept of Information Technology|www.it.uu.se

66

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

NUMA "detecting excess misses"

Dir$ 8b/line Mtag 2b/line Mem

I/F

Cache Proc

Cache

... Proc

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc Interconnect I thought

you had the data!!*/

67

AVDARK 2010

Detecting a page for replication

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc

= addr

Associative Counters Data w/ E-miss-bit)

68

AVDARK 2010

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc Interconnect

AT AT$

= addr

Address Translation Grey. <--> Yel..

VA->PA

OS Initializes a CMR page

Init acc right to INV /page

New V->P mapping in this node

/CL

(18)

Dept of Information Technology|www.it.uu.se

69

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

An access to a CMR page

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc Interconnect

AT AT$

Access right OK?

Dept of Information Technology|www.it.uu.se

70

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

An access to a CMR page (miss)

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc Interconnect

AT AT$

Access right OK?

NO!!

Address Translation (AT) overhead = 8B/8kB = 0.1%

No extra latency added

Dept of Information Technology|www.it.uu.se

71

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

An access to a CMR page (miss)

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc Interconnect

AT AT$

Change MTAG to

“shared”

Dept of Information Technology|www.it.uu.se

72

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

An access to a CMR page (hit)

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc

Dir$

Mtag Mem

I/F

Cache Proc

Cache

... Proc Interconnect

AT AT$

Access right OK?

YES!

(19)

Dept of Information Technology|www.it.uu.se

73

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Deterministic Directory

„ MOSI protocol, fully mapped directory (one bit/node)

„ Directory blocking: one outstanding trans/cache line

„ Directory blocks new requests until completion received

„ The directory state and cache state always in agreement (except for silent replacement…)

L H R

1: req 2a:demand

3a:reply(D)

(a) Write: remote shared

L H R

1: req 2:demand

3:reply(D) 4:compl.

(b) Read: Remote dirty 4:compl.

3b. ack

R R

2b. inv

L H

1: req

2:ack/nack 3:comp(D).

(c) Writeback

Dept of Information Technology|www.it.uu.se

74

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Replication Issues Revisited

Replication

"Physical" memory mem tot

mem tot controlled

by the OS

WildFire

COMA ccNUMA

„ Only "promising" pages are replicated

„ OS dynamically limits the amount of replication

„ Solaris CMR changes in the hat_layer (=port)

75

AVDARK 2010

Advantages of Multiprocessor Nodes

„ Pros:

“ amortization of fixed node costs over multiple processors

“ can use commodity SMPs

“ fewer nodes to keep track of in the directory

“ much communication may stay within node (NUCA)

“ can share “node caches” (WildFire: Coherent Memory Replication)

„ Cons:

“ bandwidth shared among processors and interface

“ bus may increases latency to local memory

“ snoopy bus at remote node increases delays there too

76

AVDARK 2010

Memory cost of replication

„ Example: Replicate 10% of data in all nodes

“ 50 nodes, each with 2 CPUs

==> 490% overhead

“ 4 nodes, each with 25 CPUs

==> 30% overhead

(20)

Dept of Information Technology|www.it.uu.se

77

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Does migration/replication help?

NAS parallel Benchmark Study (Execution time in seconds) [M. Bull, EPCC 2002]

No migr Migr NoMigr Migr

No Repl 26 5.9 6.1 6.1

Repl 7.2 6.2 6.1 6.1

No Initial Plac. Initial Placement

No migr Migr NoMigr Migr 960 610 620 600

590 580 580 580

No Initial Plac. Initial Placement

No migr Migr NoMigr Migr

No Repl 520 330 380 260

Repl 250 260 190 200

No Initial Plac. Initial Placement

No migr Migr NoMigr Migr 1540 780 760 780

670 680 670 670

No Initial Plac. Initial Placement

Shallow BT

FT SP

No migr Migr NoMigr Migr

No Repl 230 230 240 230

Repl 220 220 220 220

No Initial Plac. Initial Placement

MG

No migr Migr NoMigr Migr 1060 700 940 700

300 280 300 290

No Initial Plac. Initial Placement

CG

Unopt. HWopt. SWopt.

Dept of Information Technology|www.it.uu.se

78

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

WildFire’s Technology Limits

Dir$=8 b/line Mtag=2 b/line Mem

I/F

Cache Proc

Cache

... Proc

Mem Interconnect

SRAM size = DRAMsize/256 Snoop frequency SRAM size = DRAMsize/256 Snoop frequency

Dir $ reach >>

sum(cache size) Dir $ reach >>

sum(cache size)

Slow interconnect

Slow interconnect

Hard to make busses faster Hard to make busses faster

Sun’s SunFire 15k/25k

Erik Hagersten Uppsala University

Sweden

Dept of Information Technology|www.it.uu.se

80

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

StarCat

Sun Fire 15k/25k

(used at Lab2)

(21)

Dept of Information Technology|www.it.uu.se

81

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Front Side

Dept of Information Technology|www.it.uu.se

82

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Back Side

83

AVDARK 2010

StarCat, 72 CPUs

$ CPU D CPU $

C D S

M E M M

E M

$ CPU D CPU $

C D S

M E M M

E M

Data Repeater Addr. Rep.

CPU board

Glob-coh prot Data Rep.

Expander Dir$

board

18x18 addr X-bar 18x18 addr X-bar Active

Backplane

Total 18 slots

P2P Directory Coherence

Repeater Snoop Coherence

84

AVDARK 2010

StarCat Coherence Mechanism

$ CPU D CPU $

C D S

M E M M

E M

$ CPU D CPU $

C D S

M E M M

E M

Data Repeater Addr. Rep.

CPU board

Glob-coh prot Data Rep.

Expander Dir$

board

18x18 addr X-bar 18x18 addr X-bar Active

Backplane

DATA+

MTAG+

ECC =576 bits DATA+

MTAG+

ECC =576 bits MTAG

Check!

MTAG Check!

DATA

Addr (snoop) Remote

request

(22)

Dept of Information Technology|www.it.uu.se

85

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

StarCat, 72 CPUs

$ CPU D CPU $

C D S

M E M M

E M

$ CPU D CPU $

C D S

M E M M

E M

Data Repeater Addr. Rep.

CPU board

Glob-coh prot Data Rep.

Expander Dir$

board

18x18 addr X-bar 18x18 addr X-bar Active

Backplane

Allocate Dir$ entry only for write

requests. Speculate on clean data on Dir$ miss Allocate Dir$ entry only for write

requests. Speculate on clean data on Dir$ miss

WildCat coherence w/o CMR

& w/ faster interconnect

Dept of Information Technology|www.it.uu.se

86

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Directory cache, but no directory (broadcast on Dir$ miss)

Thread

$

Thread

$

A: B:

Cache access Cache access

Directory Protocol

State State

Directory Protocol

Interconnect

Dir$ Dir$

Dept of Information Technology|www.it.uu.se

87

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

StarCat Performance Data

$ CPU D CPU $

C D S

M E M M

E M

$ CPU D CPU $

C D S

M E M M

E M

Data Repeater Addr. Rep.

CPU board

Glob-coh prot Data Rep.

Expander Dir$

board

18x18 addr X-bar 18x18 addr X-bar Active

Backplane

Lat = 200-340ns

GBW=43GB/s

LBW=86GB/s

Up to 104 CPU

(trading for I/O)

References

Related documents

Dept of Information Technology| www.it.uu.se 2 © Erik Hagersten| user.it.uu.se/~eh..

Dept of Information Technology| www.it.uu.se 2 © Erik Hagersten| www.docs.uu.se/~eh..

x RTS: ReadtoShare (reading the data with the intention to read it) x RTW, ReadToWrite (reading the data with the intention to modify it) x WB: Writing data back to memory. x

No more sets than there are cache lines on a page + logic Page coloring can be used to guarantee correspondence between more PA and VA bits (e.g., Sun

Dept of Information Technology| www.it.uu.se 2 © Erik Hagersten| http://user.it.uu.se/~eh..

CPUwrite: Caused by a store miss CPUread Caused by a loadmiss CPUrepl: Caused by a replacement.. A New Kind of

Kopplingn för transistorer i basjordat, injektorjordat och kolIektorj ordet steg, samt närmast motsvarande rörkopplingar med liknande egenskaper.. Schema för

ligt stor, för att man på s kä rmen skall kunna erhålla en bild .av an' odströmmen från lIloll till dess högsta värd!. Denna växelspänning tillföres