• No results found

Erik Hagersten

N/A
N/A
Protected

Academic year: 2022

Share "Erik Hagersten"

Copied!
109
0
0

Loading.... (view fulltext now)

Full text

(1)

Welcome to AVDARK

Erik Hagersten

Uppsala University

(2)

Dept of Information Technology|www.it.uu.se

Intro and Caches 2

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

 Ericsson, CPU designer 1982-84: APZ212

 MIT 1984-85: Dataflow parallel architecture

 Ericsson computer science lab 1985-1988 NetInsight, (Erlang)

 SICS, parallel architectures 1988-1993 COMA, (Simics & Virtutech)

 Sun Microsystems, chief architect servers 1993 – 1999  WildFire, E6000, E15000, E25000

 Professor Uppsala University 1999 –  New modeling + Acumem

 Startup Acumem 2006 – 2010 ThreadSpotter

 Chief scientist Rogue Wave Software 2011 –

(3)

Dept of Information Technology|www.it.uu.se

Intro and Caches 3

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Goal for this course

Understand how and why modern computer systems are designed the way the are:

 pipelines

 memory organization

 virtual/physical memory ...

Understand how and why multiprocessors are built

 Cache coherence

 Memory models

 Synchronization…

Understand how and why parallelism is created and leveraged

 Instruction-level parallelism

 Memory-level parallelism

 Thread-level parallelism…

Understand how and why multiprocessors of combined SIMD/MIMD type are built

 GPU

 Vector processing…

Understand how computer systems are adopted to different usage areas

 General-purpose processors

 Embedded/network processors…

 Understand the physical limitation of modern computers

 Bandwidth

 Energy

 Cooling…

(4)

Dept of Information Technology|www.it.uu.se

Intro and Caches 4

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Literature Computer Architecture A Quantitative Approach (4 th /5 th edition) John Hennesey & David Pattersson

Lecturer

Erik Hagersten gives most lectures and is responsible for the course Andreas Sembrant and Jonas Flodin are responsible for the labs and the hand-ins.

Sverker Holmgren will teach parallel programming.

David Black-Schaffer will teach about graphics processors.

Mandatory Assignment

There are four lab assignments that all participants have to

complete before a hard deadline. Each can earn you a bonus point Optional

Assignment

There are four (optional) hand-in assignments. Each can earn you a bonus point

Examination Written exam at the end of the course. No books are allowed.

Bonus system 64p max/32p to pass. For each bonus point, there is a

corresponding question 4p bonus question. Full bonus  Pass.

AVDARK in a nutshell

(5)

Dept of Information Technology|www.it.uu.se

Intro and Caches 5

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Exam and bonus structure

 4 Mandatory labs (mandatory)

 4 Hand-in (optional)

 Written Exam

How to get a bonus point:

 Complete extra bonus activity at lab occation

 Complete optional bonus hand-in [with a

reasonable accuracy] before a hard deadline

 32p/64p at the exam = PASS

(6)

Dept of Information Technology|www.it.uu.se

Intro and Caches 6

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Schedule in a nutshell: 5 Batches

1. Memory systems

Caches, VM, DRAM, microbenchmarks, optimizing SW

2. Multiprocessors

TLP: coherence, memory models, interconnects, scalability, clusters, …

3. Scalable multiprocessors

Scalability, synchronization, clusters, …

4. CPUs

ILP: pipelines, scheduling, superscalars, VLIWs, Vector instructions…

5. Widening the view

Technology impact, GPUs, multicores, future trends

(7)

Dept of Information Technology|www.it.uu.se

Intro and Caches 7

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

AVDARK on the Web

www.it.uu.se/edu/course/homepage/avdark/ht12

Welcome!

News

FAQ Schedule Slides

New Papers Assignments

Reading instructions

Exam

(8)

Crash Course in Computer Architecture

(covering the course in 45 min)

Erik Hagersten

Uppsala University

(9)

Dept of Information Technology|www.it.uu.se

Intro and Caches 9

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

≈30 years ago: APZ 212 @ 5MHz

”the AXE supercomputer”

(10)

Dept of Information Technology|www.it.uu.se

Intro and Caches 10

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

APZ 212

marketing brochure quotes:

 ”Very compact”

 6 times the performance

 1/6:th the size

 1/5 the power consumption

 ”A breakthrough in computer science”

 ”Why more CPU power?”

 ”All the power needed for future development”

 ”…800,000 BHCA, should that ever be needed”

 ”SPC computer science at its most elegance”

 ”Using 64 kbit memory chips”

 ”1500W power consumption

(11)

Dept of Information Technology|www.it.uu.se

Intro and Caches 11

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

CPU Improvements

1970 1980 1990 2000 Year

Relative Performance [log scale]

1000

100

10

1

(12)

Dept of Information Technology|www.it.uu.se

Intro and Caches 12

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

How to get efficient architectures…

 Increase clock frequency

 Create and explore locality:

a) Spatial locality b) Temporal locality

c) Geographical locality

 Create and explore parallelism

a) Instruction level parallelism (ILP) b) Thread level parallelism (TLP)

c) Memory level parallelism (MLP)

 Speculative execution

a) Out-of-order execution b) Branch prediction

c) Prefetching

Very hard today

(13)

Dept of Information Technology|www.it.uu.se

Intro and Caches 13

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Memory Accesses Three Regs:

Source1 Source2

Destination

Load/Store architecture (e.g., ”RISC”)

ALU ops: Reg -->Reg

Mem ops: Reg <--> Mem

... 5 4 3 2 1

Mem Explicit

Registers

ALU

Load/Store

Example: C = A + B

LD R1, [A]

LD R3, [B]

ADD R2, R1, R3 ST R2, [C]

Compiler

(14)

Dept of Information Technology|www.it.uu.se

Intro and Caches 14

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Lifting the CPU hood (simplified…)

D C B A

CPU

Mem

Instructions:

(15)

Dept of Information Technology|www.it.uu.se

Intro and Caches 15

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Pipeline

D C B A

Mem Instructions:

I R X W

Regs

(16)

Dept of Information Technology|www.it.uu.se

Intro and Caches 16

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Pipeline

A

Mem I R X W

Regs

(17)

Dept of Information Technology|www.it.uu.se

Intro and Caches 17

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Pipeline

A

Mem I R X W

Regs

(18)

Dept of Information Technology|www.it.uu.se

Intro and Caches 18

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Pipeline

A

Mem I R X W

Regs

(19)

Dept of Information Technology|www.it.uu.se

Intro and Caches 19

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Pipeline:

A

Mem I R X W

Regs

I = Instruction fetch R = Read register X = Execute

W= Write register/mem

Processor

”state”

Pipeline stages

Memory for

data and instr.

(20)

Dept of Information Technology|www.it.uu.se

Intro and Caches 20

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Register Operations [aka ALU operation]

ADD R1, R2, R3

a.k.a. R1 := R2 op R3

A

Mem I R X W

Regs 2 3 1

e.g., +, -, *, / OP

Ifetch

(21)

Dept of Information Technology|www.it.uu.se

Intro and Caches 21

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Load Operation:

LD R1, mem[cnst+R2]

A

Mem I R X W

Regs 1

Ifetch

2

+

(22)

Dept of Information Technology|www.it.uu.se

Intro and Caches 22

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Store Operation:

ST R2, mem[cnst+R1]

A

Mem I R X W

Regs 1 2 Ifetch

+

(23)

Dept of Information Technology|www.it.uu.se

Intro and Caches 23

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Branch Operations:

if (R1 Op Const) GOTO mem[R2]

A

Mem I R X W

Regs 1 2 P c Ifetch OP

PC = Program Counter.

A special register pointing to the next instruction to execute

(24)

Dept of Information Technology|www.it.uu.se

Intro and Caches 24

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Initially

D C B A

Mem I R X W

Regs

LD RegA, (100 + RegC) IF RegC < 100 GOTO A

RegB := RegA + 1 RegC := RegC + 1

PC

(25)

Dept of Information Technology|www.it.uu.se

Intro and Caches 25

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cycle 1

D C B A

Mem I R X W

Regs

LD RegA, (100 + RegC) IF RegC < 100 GOTO A

RegB := RegA + 1 RegC := RegC + 1

PC

(26)

Dept of Information Technology|www.it.uu.se

Intro and Caches 26

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cycle 2

D C B A

Mem I R X W

Regs

LD RegA, (100 + RegC) IF RegC < 100 GOTO A

RegB := RegA + 1 RegC := RegC + 1

PC

(27)

Dept of Information Technology|www.it.uu.se

Intro and Caches 27

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cycle 3

D C B A

Mem I R X W

Regs

LD RegA, (100 + RegC) IF RegC < 100 GOTO A

RegB := RegA + 1 RegC := RegC + 1

+

PC

(28)

Dept of Information Technology|www.it.uu.se

Intro and Caches 28

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cycle 4

D C B A

Mem I R X W

Regs

LD RegA, (100 + RegC) IF RegC < 100 GOTO A

RegB := RegA + 1 RegC := RegC + 1

+

PC

(29)

Dept of Information Technology|www.it.uu.se

Intro and Caches 29

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cycle 5

D C B A

Mem I R X W

Regs

LD RegA, (100 + RegC) IF RegC < 100 GOTO A

RegB := RegA + 1 RegC := RegC + 1

+ PC

A

(30)

Dept of Information Technology|www.it.uu.se

Intro and Caches 30

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cycle 6

D C B A

Mem I R X W

Regs

LD RegA, (100 + RegC) IF RegC < 100 GOTO A

RegB := RegA + 1 RegC := RegC + 1

<

PC

A

(31)

Dept of Information Technology|www.it.uu.se

Intro and Caches 31

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cycle 7

D C B A

Mem I R X W

Regs

LD RegA, (100 + RegC) IF RegC < 100 GOTO A

RegB := RegA + 1 RegC := RegC + 1

PC

A

Branch: Addr(A)  PC

A:

(32)

Dept of Information Technology|www.it.uu.se

Intro and Caches 32

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cycle 8

D C B A

Mem I R X W

Regs

LD RegA, (100 + RegC) IF RegC < 100 GOTO A

RegB := RegA + 1 RegC := RegC + 1

PC

(33)

Dept of Information Technology|www.it.uu.se

Intro and Caches 33

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Data dependency 

Previous execution example wrong!

D C B A

Mem I R X W

Regs

LD RegA, (100 + RegC) IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

(34)

Dept of Information Technology|www.it.uu.se

Intro and Caches 34

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Data dependency fix 1:

pipeline delays

D C B A

Mem I R X W

Regs

LD RegA, (100 + RegC) IF RegC < 100 GOTO A

RegB := RegA + 1 RegC := RegC + 1

”Stall”

”Stall”

(35)

Dept of Information Technology|www.it.uu.se

Intro and Caches 35

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Branch delays 

D B C A

Mem I R X W

Regs

LD RegA, (100 + RegC) IF RegC < 100 GOTO A

RegB := RegA + 1 RegC := RegC + 1

A

9 cycles per iteration of 4 instructions 

Need longer basic blocks with independent instr.

Branch  Next PC

”Stall”

”Stall”

”Stall”

Next PC

”Stall”

”Stall”

PC

Need to find Instruction-

Level Parallelism (ILP)

to avoid stalls!

(36)

Dept of Information Technology|www.it.uu.se

Intro and Caches 36

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

It is actually a lot worse!

Modern CPUs:”superscalars” with ~4 parallel pipelines

Mem I R X W

Regs I R X W I R X W I R X W

+Higher throughput

- More complicated architecture

- Branch delay more expensive (more instr. missed) - Harder to find ”enough” independent instr. (need 8 instr. between write and use)

Issue logic

Need to find 4x more ILP

(37)

Dept of Information Technology|www.it.uu.se

Intro and Caches 37

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

It is actually a lot worse!

Modern CPUs: ~10-20 stages/pipe

Mem

I R

Regs

+Shorter cycletime (higher GHz)

- Branch delay even more expensive

- Even harder to find ”enough” independent instr.

I R B M M W

I R B M M W

I R B M M W

I I I I Issue

logic

(38)

Dept of Information Technology|www.it.uu.se

Intro and Caches 38

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

It is actually a lot worse!

DRAM access: ~150 CPU cycles

I R

Regs

I R B M M W

I R B M M W

I R B M M W

I I I I Issue

logic

Mem

150 cycles 

(39)

Dept of Information Technology|www.it.uu.se

Intro and Caches 39

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Pipeline delays gets worse

D C B A

Mem I R X W

Regs

LD RegA, (100 + RegC) IF RegC < 100 GOTO A

RegB := RegA + 1 RegC := RegC + 1

x151 ”Stall”

x1 ”Stall”

(40)

Dept of Information Technology|www.it.uu.se

Intro and Caches 40

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Fix 1: Out-of order execution:

Improving ILP

LD R1, M(100) ADD R3, R2, R1 SUB R5, R6, R7 ST R5, M(100)

>=0?

LD ...

ADD ...

SUB ...

ST ...

Y N

The HW may execute instructions in a different order, but will make the ”side-effects” of the

instructions appear in order.

Assume that LD takes a long time.

The ADD is dependent on the LD 

Start the SUB and ST before the ADD

Update R5 and M(100) after R3

(41)

Dept of Information Technology|www.it.uu.se

Intro and Caches 41

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Fix 2: Branch prediction

LD R1, M(100) ADD R3, R2, R1 SUB R5, R6, R7 ST R5, M(100)

>=0?

LD ...

ADD ...

SUB ...

ST ...

Y N

The HW can guess if the branch is taken or not and avoid branch stalls if the guess is correct,

Assume the guess is ”Y”.

The HW can start to execute these instruction before the outcome the

the branch is known, but cannot allow any ”side-effect”

to take place until the outcome is known

(42)

Dept of Information Technology|www.it.uu.se

Intro and Caches 42

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Fix 3: Scheduling Past Branches

Improving ILP

LD ADD SUB ST

>=0? LD

ADD SUB ST

>1?

LD ADD SUB ST

<2?

LD ADD SUB ST

=0?

Y

Y

Y

”Predict taken”

”Predict taken”

”Predict taken”

All instructions along the predicted path can be executed

out-of-order

Predicted path

(43)

Dept of Information Technology|www.it.uu.se

Intro and Caches 43

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

LD ADD SUB ST

>=0? LD

ADD SUB ST

>1?

LD ADD SUB ST

<2?

LD ADD SUB ST

=0?

Y

Y

Y

Wrong Prediction!!!

Throw away i.e., no side effects!

Actual path!

Fix 4: Scheduling Past Branches

Improving ILP

(44)

Dept of Information Technology|www.it.uu.se

Intro and Caches 44

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Fix 5: Use a cache

Mem

I R

Regs

B M M W

I R B M M W

I R B M M W

I R B M M W

I I I I Issue

logic

150cycles 1GB

$ 64kB

1-10 cycles

(45)

Dept of Information Technology|www.it.uu.se

Intro and Caches 45

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

How to get efficient architectures…

 Increase clock frequency

 Create and explore locality:

a) Spatial locality b) Temporal locality

c) Geographical locality

 Create and explore parallelism

a) Instruction level parallelism (ILP) b) Thread level parallelism (TLP)

c) Memory level parallelism (MLP)

 Speculative execution

a) Out-of-order execution b) Branch prediction

c) Prefetching

(46)

Dept of Information Technology|www.it.uu.se

Intro and Caches 46

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Woops, using too much power 2007

 Running at 2x the frequency  will use much more than 2x the power

 It is also really hard to find enough ILP

 Speculation results in a fair amount of

wasted work

(47)

Dept of Information Technology|www.it.uu.se

Intro and Caches 47

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Fix: Multicore

Mem

CPU

$1

CPU

$1

CPU

$1

CPU

$1 L2$

Mem I/F External

I/F

t treads

But now we also need

to find Thread-Level

Parallelism (TLP)

(48)

Dept of Information Technology|www.it.uu.se

Intro and Caches 48

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Example: Intel i7 ”Nehalem”

DRAM

Coherence

(49)

Dept of Information Technology|www.it.uu.se

Intro and Caches 49

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

What is computer architecture?

“Bridging the gap between programs and transistors”

“Finding the best model to execute the programs”

best={fast, cheap, energy-efficient, reliable, predictable, …}

(50)

Caches and more caches

spam, spam, spam and spam or

Erik Hagersten

Uppsala University, Sweden

eh@it.uu.se

(51)

Dept of Information Technology|www.it.uu.se

Intro and Caches 51

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Fix 5: Use a cache

Mem

I R

Regs

B M M W

I R B M M W

I R B M M W

I R B M M W

I I I I Issue

logic

200cycles 1GB

~32kB

$

~1 cycles

(52)

Dept of Information Technology|www.it.uu.se

Intro and Caches 52

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Webster about “cache”

1. cache \'kash\ n [F, fr. cacher to press, hide, fr. (assumed) VL coacticare to press] together, fr. L coactare to compel, fr.

coactus, pp. of cogere to compel - more at COGENT 1a: a

hiding place esp. for concealing and preserving provisions or

implements 1b: a secure place of storage 2: something hidden

or stored in a cache

(53)

Dept of Information Technology|www.it.uu.se

Intro and Caches 53

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache knowledge useful when...

 Designing a new computer

 Writing an optimized program

 or compiler

 or operating system …

 Implementing software caching

 Web caches

 Proxies

 File systems

(54)

Dept of Information Technology|www.it.uu.se

Intro and Caches 54

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Memory/storage

SRAM DRAM disk

sram

2000: 1ns 1ns 3ns 10ns 150ns 5 000 000ns 1kB 64k 4MB 1GB 1 TB

(1982: 200ns 200ns 200ns 10 000 000ns)

(55)

Dept of Information Technology|www.it.uu.se

Intro and Caches 55

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Address Book Cache

Looking for Tommy’s Telephone Number

Ö Ä Å Z Y X V U T

TOMMY 12345

Ö Ä Å Z Y X V

“Address Tag”

One entry per page =>

Direct-mapped caches with 28 entries

“Data”

Indexing

function

(56)

Dept of Information Technology|www.it.uu.se

Intro and Caches 56

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Address Book Cache

Looking for Tommy’s Number

Ö Ä Å Z Y X V U T

OMMY 12345

TOMMY

EQ?

index

(57)

Dept of Information Technology|www.it.uu.se

Intro and Caches 57

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Address Book Cache

Looking for Tomas’ Number

Ö Ä Å Z Y X V U T

OMMY 12345

TOMAS

EQ?

index

Miss!

Lookup Tomas’ number in

the telephone directory

(58)

Dept of Information Technology|www.it.uu.se

Intro and Caches 58

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Address Book Cache

Looking for Tomas’ Number

Z Y X V U T

OMMY 12345

TOMAS

index

Replace TOMMY’s data with TOMAS’ data.

There is no other choice (direct mapped)

OMAS 23457

Ö

Ä

Å

(59)

Dept of Information Technology|www.it.uu.se

Intro and Caches 59

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache

CPU

Cache address

data (a word) hit

Memory address

data

(60)

Dept of Information Technology|www.it.uu.se

Intro and Caches 60

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache Organization

Cache

OMAS 23457

TOMAS

index

=

Hit (1) (1)

(4) (4)

1

Addr tag

&

(1)

Data (5 digits) Valid

28 entries

(61)

Dept of Information Technology|www.it.uu.se

Intro and Caches 61

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache Organization (really)

4kB, direct mapped

index

=

(1) Hit?

(32bits = 4 bytes)

1

Addr tag

&

(1)

Data (1)

Valid

00100110000101001010011010100011

1k entries of 4 bytes each (?)

(?)

(?)

0101001

0010011100101…

32 bit address identifying

a byte in memory

Ordinary Memory

msb lsb

”Byte”

What is a

good

index

function?

(62)

Dept of Information Technology|www.it.uu.se

Intro and Caches 62

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache Organization

4kB, direct mapped

index

=

(1) Hit?

(32bits = 4 bytes)

1

Addr tag

&

(1)

Data (1)

Valid

00100110000101001010011010100011

1k entries of 4 bytes each

(10)

(20) (20)

0101001

0010011100101…

32 bit address Identifies the byte within a word

msb lsb

Mem Overhead:

21/32= 66%

Latency =

SRAM+CMP+AND

(63)

Dept of Information Technology|www.it.uu.se

Intro and Caches 63

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache

CPU

Cache address

data (a word) hit

Memory address

data

Hit: Use the data provided from the cache

~Hit: Use data from memory and also store it in

the cache

(64)

Dept of Information Technology|www.it.uu.se

Intro and Caches 64

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache performance parameters

 Cache “hit rate” [%]

 Cache “miss rate” [%] (= 1 - hit_rate)

 Hit time [CPU cycles]

 Miss time [CPU cycles]

 Hit bandwidth

 Miss bandwidth

 Write strategy

 ….

(65)

Dept of Information Technology|www.it.uu.se

Intro and Caches 65

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

How to rate architecture performance?

Marketing:

 Frequency and Number of “cores”…

Architecture “goodness”:

 CPI = Cycles Per Instruction

 IPC = Instructions Per Cycle

Benchmarking:

 SPEC-fp, SPEC-int, …

 TPC-C, TPC-D, …

 Varning: Using a unrepresentative benchmark can

be missleading

(66)

Dept of Information Technology|www.it.uu.se

Intro and Caches 66

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache performance example

Assumption:

Infinite bandwidth

A perfect 1.0 CyclesPerInstruction (CPI) CPU 100% instruction cache hit rate

Total number of cycles =

#Instr. * ( (1 - mem_ratio) * 1 +

mem_ratio * avg_mem_accesstime) =

= #Instr * ( (1- mem_ratio) +

mem_ratio * (hit_rate * hit_time +

(1 - hit_rate) * miss_time) CPI = 1 -mem_ratio +

mem_ratio * (hit_rate * hit_time +

(1 - hit_rate) * miss_time)

(67)

Dept of Information Technology|www.it.uu.se

Intro and Caches 67

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Example Numbers

CPI = 1 - mem_ratio +

mem_ratio * (hit_rate * hit_time) +

mem_ratio * (1 - hit_rate) * miss_time) mem_ratio = 0.25

hit_rate = 0.85 hit_time = 3

miss_time = 100

CPI = 0.75 + 0.25 * 0.85 * 3 + 0.25 * 0.15 * 100 = 0.75 + 0.64 + 3.75 = 5.14

CPU HIT MISS

(68)

Dept of Information Technology|www.it.uu.se

Intro and Caches 68

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

What if ...

CPI = 1 -mem_ratio +

mem_ratio * (hit_rate * hit_time) +

mem_ratio * (1 - hit_rate) * miss_time) mem_ratio = 0.25

hit_rate = 0.85

hit_time = 3 CPU HIT MISS

miss_time = 100 == > 0.75 + 0.64 + 3.75 = 5.14

•Twice as fast CPU ==> 0.37 + 0.64 + 3.75 = 4.77

•Faster memory (70c) ==> 0.75 + 0.64 + 2.62 = 4.01

•Improve hit_rate (0.95) => 0.75 + 0.71 + 1.25 = 2.71

(69)

Dept of Information Technology|www.it.uu.se

Intro and Caches 69

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

How to get more effective caches:

 Larger cache (more capacity)

 Cache block size (larger cache lines)

 More placement choice (more associativity)

 Innovative caches (victim, skewed, …)

 Cache hierarchies (L1, L2, L3, CMR)

 Latency-hiding (weaker memory models)

 Latency-avoiding (prefetching)

 Cache avoiding (cache bypass)

 Optimized application/compiler

 …

(70)

Dept of Information Technology|www.it.uu.se

Intro and Caches 70

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Why do you miss in a cache

 Mark Hill’s three “Cs”

 Compulsory miss (touching data for the first time)

 Capacity miss (the cache is too small)

 Conflict misses (non-ideal cache implementation) (too many names starting with “H”)

 (Multiprocessors)

 Communication (imposed by communication)

 False sharing (side-effect from large cache blocks)

(71)

Dept of Information Technology|www.it.uu.se

Intro and Caches 71

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Avoiding Capacity Misses –

a huge address book

Lots of pages. One entry per page.

Ö Ä Å Z Y X ÖV ÖU ÖT

LING 12345

ÖÖ ÖÄ ÖÅ ÖZ ÖY ÖX

“Address Tag”

One entry per page =>

Direct-mapped caches with 28 2 entries  784 entries

“Data”

New

Indexing

function

(72)

Dept of Information Technology|www.it.uu.se

Intro and Caches 72

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache Organization

1MB, direct mapped

index

=

(1) Hit?

(32)

1

Addr tag

&

(1)

Data

(1)

Valid

00100110000101001010011010100011

256k entries

(18)

(12) (12)

0101001

0010011100101

32 bit address Identifies the byte within a word

msb lsb

Mem Overhead:

13/32= 40%

Latency =

SRAM+CMP+AND

(73)

Dept of Information Technology|www.it.uu.se

Intro and Caches 73

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Pros/Cons Large Caches

++ The safest way to get improved hit rate -- SRAMs are very expensive!!

-- Larger size ==> slower speed more load on “signals”

longer distances

-- (power consumption)

-- (reliability)

(74)

Dept of Information Technology|www.it.uu.se

Intro and Caches 74

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Why do you hit in a cache?

 Temporal locality

 Likely to access the same data again soon

 Spatial locality

 Likely to access nearby data again soon

Typical access pattern:

(inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, ...

temporal spatial

(75)

Dept of Information Technology|www.it.uu.se

Intro and Caches 75

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK

2012 (32)

Data

Multiplexer (4:1 mux) Identifies the word within a cache line

(2) Select

Identifies a byte within a word

Fetch more than a word:

cache blocks(a.k.a cache line)

1MB, direct mapped, CacheLine=16B

1

00100110000101001010011010100011

64k entries

0101001

0010011100101

(16)

index

0010011100101 0010011100101 0010011100101

=

(1) Hit?

&

(1)

(12) (12)

(32) (32) (32) (32)

128 bits

msb lsb

Mem Overhead:

13/128= 10%

Latency =

SRAM+CMP+AND

(76)

Dept of Information Technology|www.it.uu.se

Intro and Caches 76

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Example in Class

Direct mapped cache:

 Cache size = 64 kB

 Cache line = 16 B

 Word size = 4B

 32 bits address (byte addressable)

“There are 10 kinds of people in the world…

Those who understand binary number and

those who do not.”

(77)

Dept of Information Technology|www.it.uu.se

Intro and Caches 77

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Pros/Cons Large Cache Lines

++ Explores spatial locality

++ Fits well with modern DRAMs

* first DRAM access slow

* subsequent accesses fast (“page mode”)

-- Poor usage of SRAM & BW for some patterns -- Higher miss penalty (fix: critical word first) -- (False sharing in multiprocessors)

Cache line size

Perf

(78)

Dept of Information Technology|www.it.uu.se

Intro and Caches 78

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK

2012 Thanks: Dr. Erik Berg

UART: StatCache Graph

app=matrix multiply

Small cache:

Short cache lines are better

Large caches

Longer cache lines are better

Huge caches:

Everything fits regardless of CL size

Note: this is just a single example, but

the conclusion typically holds for most

applications.

(79)

Dept of Information Technology|www.it.uu.se

Intro and Caches 79

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache Conflicts

Typical access pattern:

(inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, …

What if B and C index to the same cache location Conflict misses -- big time!

Potential performance loss 10-100x

temporal spatial

(80)

Dept of Information Technology|www.it.uu.se

Intro and Caches 80

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Address Book Cache

Two names per page: index first, then search.

Ö Ä Å Z Y X V U T

OMAS

TOMMY

EQ?

index

12345 23457

OMMY

EQ?

(81)

Dept of Information Technology|www.it.uu.se

Intro and Caches 81

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK

2012 How should the

select signal be produced?

Avoiding conflict: More associativity

1MB, 2-way set-associative, CL=4B

index

(32)

1

Data

00100110000101001010011010100011

128k

“sets”

(17) 0101001

0010011100101

Identifies a byte within a word

Multiplexer (2:1 mux)

(32) (32)

1 0101001

0010011100101

Hit?

=

&

(13)

(1) Select (13)

=

&

“logic”

msb lsb

Latency =

SRAM+CMP+AND+

LOGIC+MUX

One “set”

(82)

Dept of Information Technology|www.it.uu.se

Intro and Caches 82

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Pros/Cons Associativity

++ Avoids conflict misses -- Slower access time

-- More complex implementation comparators, muxes, ...

-- Requires more pins (for external

SRAM…)

(83)

Dept of Information Technology|www.it.uu.se

Intro and Caches 83

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Going all the way…!

1MB, fully associative, CL=16B

index

=

Hit? 4B

1

&

Data

00100110000101001010011010100011

One “set”

(0) (28)

0101001

0010011100101

Identifies a byte within a word

Multiplexer (256k:1 mux) Identifies the word within a cache line

Select (16) (13)

16B 16B

1 0101001

0010011100101

=

&

“logic”

(2)

1 0101001

0010011100101

1

=

&

=

&

... 16B

64k

comparators

(84)

Dept of Information Technology|www.it.uu.se

Intro and Caches 84

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Fully Associative

 Very expensive

 Only used for small caches (and sometimes TLBs)

CAM = Contents-addressable memory

 ~Fully-associative cache storing key+data

 Provide key to CAM and get the associated

data

(85)

Dept of Information Technology|www.it.uu.se

Intro and Caches 85

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

A combination thereof

1MB, 2-way, CL=16B

index

=

Hit? (32)

1

&

Data

001001100001010010100110101000110

32k

“sets”

(15) (13)

0101001

0010011100101

Identifies a byte within a word

Multiplexer (8:1 mux) Identifies the word within a cache line

Select (1)

(13)

(128) (128)

1 0101001

0010011100101

=

&

“logic”

(2)

(256)

msb lsb

(86)

Dept of Information Technology|www.it.uu.se

Intro and Caches 86

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Example in Class

 Cache size = 2 MB

 Cache line = 64 B

 Word size = 8B (64 bits)

 4-way set associative

 32 bits address (byte addressable)

(87)

Dept of Information Technology|www.it.uu.se

Intro and Caches 87

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Who to replace?

Picking a “victim”

 Least-recently used (aka LRU)

 Considered the “best” algorithm (which is not always true…)

 Only practical up to limited number of ways

 Not most recently used

 Remember who used it last: 8-way -> 3 bits/CL

 Pseudo-LRU

 E.g., based on course time stamps.

 Used in the VM system

 Random replacement

 Can’t continuously to have “bad luck...

(88)

Dept of Information Technology|www.it.uu.se

Intro and Caches 88

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache Model:

Random vs. LRU

Random

LRU

LRU Random

art (SPEC 2000) equake (SPEC 2000)

(89)

Dept of Information Technology|www.it.uu.se

Intro and Caches 89

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

4-way sub-blocked cache

1MB, direct mapped, Block=64B, sub-block=16B

index

=

Hit?

(4)

(32)

1

&

Data

00100110000101001010011010100011

16k

(14)

(12)

0101001

0010011100101

Identifies a byte within a word

0010011100101

16:1 mux

Identifies the word within a cache line

(2) (12)

(128) (128) (128) (128)

512 bits

0 1 0

& & &

logic

4:1 mux

(2) Sub block within a block

msb lsb

Mem Overhead:

16/512= 3%

(90)

Dept of Information Technology|www.it.uu.se

Intro and Caches 90

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Pros/Cons Sub-blocking

++ Lowers the memory overhead

++ (Avoids problems with false sharing -- MP) ++ Avoids problems with bandwidth waste

-- Will not explore as much spatial locality -- Still poor utilization of SRAM

-- Fewer sparse “things” allocated

(91)

Dept of Information Technology|www.it.uu.se

Intro and Caches 91

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Replacing dirty cache lines

 Write-back

 Write dirty data back to memory (next level) at replacement

 A “dirty bit” indicates an altered cache line

 Write-through

 Always write through to the next level (as well)

  data will never be dirty  no write-backs

(92)

Dept of Information Technology|www.it.uu.se

Intro and Caches 92

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Write Buffer/Store Buffer

 Do not need the old value for a store

 One option: Write around (no write allocate in caches) used for lower level smaller caches

CPU cache

WB:

stores loads

= =

=

(93)

Dept of Information Technology|www.it.uu.se

Intro and Caches 93

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Innovative cache: Victim cache

CPU

Cache address

data

hit Memory

address

data

Victim Cache (VC): a small, fairly associative cache (~10s of entries) Cache lookup: search cache and VC in parallel

Cache replacement: move victim to the VC and replace in VC VC hit: swap VC data with the corresponding data in Cache

“A second life ”

address VC

data (a word)

hit

(94)

Dept of Information Technology|www.it.uu.se

Intro and Caches 94

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Skewed Associative Cache

A, B and C have a three-way conflict

2-way A

B C

4-way A

B C

It has been shown that 2-way skewed performs roughly the same as 4-way caches

2-way skewed A

B

C

(95)

Dept of Information Technology|www.it.uu.se

Intro and Caches 95

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Skewed-associative cache:

Different indexing functions

index

=

1

&

00100110000101001010011010100011

128k entries

(17)

(13)

0101001

0010011100101

32 bit address Identifies the byte within a word

1 0101001

0010011100101

f 1 f 2

(>18)

(>18) (17)

msb lsb

2:1mux

=

&

function

(32) (32)

(32)

(96)

Dept of Information Technology|www.it.uu.se

Intro and Caches 96

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

UART: Elbow cache

Increase “associativity” when needed

Performs roughly the same as an 8-way cache Slightly faster

Uses much less power!!

A B

If severe conflict:

make room A

B C Conflict!!

A

B

C

(97)

Dept of Information Technology|www.it.uu.se

Intro and Caches 97

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Topology of caches: Harvard Arch

 CPU needs a new instruction each cycle

 25% of instruction LD/ST

 Data and Instr. have different access patterns

==> Separate D and I first level cache

==> Unified 2nd and 3rd level caches

(98)

Dept of Information Technology|www.it.uu.se

Intro and Caches 98

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache Hierarchy of Today

CPU

L1 D$

on-chip

L2$

on-chip

L3$

DRAM Memory

L1 I$

on-chip

Small enough to keep up with the CPU speed

Use whatever transistors there i left for the

Last-Level Cache (LLC)

Off-chip SRAM (today more

commonly on-chip)

Separate I cache to allow for

instruction fetch

and data fetch in

parallel

(99)

Dept of Information Technology|www.it.uu.se

Intro and Caches 99

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

HW prefetching

...a little green man that anticipates your next memory access and prefetches the data to the cache.

Improves MLP!

 Sequential prefetching: Sequential streams [to a page]. Some number of prefetch

streams supported. Often only for L2 and L3.

 PC-based prefetching: Detects strides from the same PC. Often also for L1.

 Adjacent prefetching: On a miss, also bring in the “next” cache line. Often only for L2 and

L3.

(100)

Dept of Information Technology|www.it.uu.se

Intro and Caches 100

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Hardware prefetching

 Hardware ”monitor” looking for patterns in memory accesses

 Brings data of anticipated future accesses into the cache prior to their usage

 Two major types:

 Sequential prefetching (typically page-based, 2nd level cache and higher). Detects sequential cache lines missing in the cache.

 PC-based prefetching, integrated with the pipeline.

Finds per-PC strides. Can find more complicated

patterns.

(101)

Dept of Information Technology|www.it.uu.se

Intro and Caches 101

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Data Cache Capacity/Latency/BW

(102)

Dept of Information Technology|www.it.uu.se

Intro and Caches 102

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache implementation

256kB L2 $

Generic Cache:

Addr [63..0]

MSB LSB

AT S Data = 64B

=

D1 ¢ 64kB

L3 € 24MB

Cacheline, here 64B:

...

= = = = =

= =

mux Hit

Sel way ”6”

Data = 64B index

SRAM:

Caches at all level roughly work like this:

I1 ¢

64kB

(103)

Dept of Information Technology|www.it.uu.se

Intro and Caches 103

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Address Book Analogy

Two names per page: index first, then search.

Ö Ä Å Z Y X V U T OMAS

TOMMY

EQ?

index

12345 23457

OMMY

EQ?

Select the second entry!

(104)

Dept of Information Technology|www.it.uu.se

Intro and Caches 104

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache lingo

Cacheline: Data chunk move to/from a cache

Cache set: Fraction of the cache identified by the index

Associativity: Number of alternative storage places for a cacheline

Replacement policy: picking the victim to throw out from a set (LRU/Random/Nehalem)

Temporal locality: Likelihood to access the same data again soon

Spatial locality: Likelihood to access nearby data again

Typical access pattern: soon

(inner loop stepping through an array) A, B, C, A+4, B, C, A+8, B, C, ...

temporal spatial

(105)

Dept of Information Technology|www.it.uu.se

Intro and Caches 105

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Cache Lingo Picture

C Generic Cache:

Addr [63..0]

MSB LSB

AT S Data = 64B Cacheline, here 64B:

...

mux Hit

Sel way ”6”

Data = 64B

SRAM:

Cache Set:

tag index

Associativity= 8-way

(106)

Dept of Information Technology|www.it.uu.se

Intro and Caches 106

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Nehalem i7 (one example)

32kB

8-way pLRU 256kB

8-way pLRU non-incl

8MB

16-way Neh.-repl

non-incl

(107)

Dept of Information Technology|www.it.uu.se

Intro and Caches 107

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Take-away message: Caches

 Caches are fast but small

 Cache space allocated in cache-line chunks (e.g., 64bytes)

 LSB part of the address is used to find the

”cache set” (aka, indexing)

 There is a limited number of cache lines per set (associativity)

 Typically, several levels of caches

 The most important target for optimizations

(108)

Dept of Information Technology|www.it.uu.se

Intro and Caches 108

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

How are we doing?

 Increase clock frequency

 Create and explore locality:

a) Spatial locality b) Temporal locality

c) Geographical locality

 Create and explore parallelism

a) Instruction level parallelism (ILP) b) Thread level parallelism (TLP)

c) Memory level parallelism (MLP)

 Speculative execution

a) Out-of-order execution b) Branch prediction

c) Prefetching

(109)

Dept of Information Technology|www.it.uu.se

Intro and Caches 109

© Erik Hagersten|http://user.it.uu.se/~eh

AVDARK 2012

Goals for this course

Understand how and why modern computer systems are designed the way the are:

 pipelines

 memory organization

 virtual/physical memory ...

Understand how and why multiprocessors are built

 Cache coherence

 Memory models

 Synchronization…

Understand how and why parallelism is created and leveraged

 Instruction-level parallelism

 Memory-level parallelism

 Thread-level parallelism…

Understand how and why multiprocessors of combined SIMD/MIMD type are built

 GPU

 Vector processing…

Understand how computer systems are adopted to different usage areas

 General-purpose processors

 Embedded/network processors…

 Understand the physical limitation of modern computers

 Bandwidth

 Energy

 Cooling…

References

Related documents

Dept of Information Technology| www.it.uu.se 2 © Erik Hagersten| http://user.it.uu.se/~eh..

Dept of Information Technology| www.it.uu.se 2 © Erik Hagersten| http://user.it.uu.se/~eh..

Fetch rate Cache utilization ≈ Fraction of cache data utilized.. Mem, VM and SW

No more sets than there are cache lines on a page + logic Page coloring can be used to guarantee correspondence between more PA and VA bits (e.g., Sun

Fetch rate Cache utilization ≈ Fraction of cache data

Dept of Information Technology| www.it.uu.se 2 © Erik Hagersten| http://user.it.uu.se/~eh..

Fetch rate Cache utilization ≈ Fraction of cache data utilized. Predicted fetch rate (if utilization

Dept of Information Technology|www.it.uu.se 3 © Erik Hagersten| user.it.uu.se/~eh..