Welcome to AVDARK
Erik Hagersten
Uppsala University
Dept of Information Technology|www.it.uu.se
Intro and Caches 2
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Ericsson, CPU designer 1982-84: APZ212
MIT 1984-85: Dataflow parallel architecture
Ericsson computer science lab 1985-1988 NetInsight, (Erlang)
SICS, parallel architectures 1988-1993 COMA, (Simics & Virtutech)
Sun Microsystems, chief architect servers 1993 – 1999 WildFire, E6000, E15000, E25000
Professor Uppsala University 1999 – New modeling + Acumem
Startup Acumem 2006 – 2010 ThreadSpotter
Chief scientist Rogue Wave Software 2011 –
Dept of Information Technology|www.it.uu.se
Intro and Caches 3
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Goal for this course
Understand how and why modern computer systems are designed the way the are:
pipelines
memory organization
virtual/physical memory ...
Understand how and why multiprocessors are built
Cache coherence
Memory models
Synchronization…
Understand how and why parallelism is created and leveraged
Instruction-level parallelism
Memory-level parallelism
Thread-level parallelism…
Understand how and why multiprocessors of combined SIMD/MIMD type are built
GPU
Vector processing…
Understand how computer systems are adopted to different usage areas
General-purpose processors
Embedded/network processors…
Understand the physical limitation of modern computers
Bandwidth
Energy
Cooling…
Dept of Information Technology|www.it.uu.se
Intro and Caches 4
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Literature Computer Architecture A Quantitative Approach (4 th /5 th edition) John Hennesey & David Pattersson
Lecturer
Erik Hagersten gives most lectures and is responsible for the course Andreas Sembrant and Jonas Flodin are responsible for the labs and the hand-ins.
Sverker Holmgren will teach parallel programming.
David Black-Schaffer will teach about graphics processors.
Mandatory Assignment
There are four lab assignments that all participants have to
complete before a hard deadline. Each can earn you a bonus point Optional
Assignment
There are four (optional) hand-in assignments. Each can earn you a bonus point
Examination Written exam at the end of the course. No books are allowed.
Bonus system 64p max/32p to pass. For each bonus point, there is a
corresponding question 4p bonus question. Full bonus Pass.
AVDARK in a nutshell
Dept of Information Technology|www.it.uu.se
Intro and Caches 5
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Exam and bonus structure
4 Mandatory labs (mandatory)
4 Hand-in (optional)
Written Exam
How to get a bonus point:
Complete extra bonus activity at lab occation
Complete optional bonus hand-in [with a
reasonable accuracy] before a hard deadline
32p/64p at the exam = PASS
Dept of Information Technology|www.it.uu.se
Intro and Caches 6
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Schedule in a nutshell: 5 Batches
1. Memory systems
Caches, VM, DRAM, microbenchmarks, optimizing SW
2. Multiprocessors
TLP: coherence, memory models, interconnects, scalability, clusters, …
3. Scalable multiprocessors
Scalability, synchronization, clusters, …
4. CPUs
ILP: pipelines, scheduling, superscalars, VLIWs, Vector instructions…
5. Widening the view
Technology impact, GPUs, multicores, future trends
Dept of Information Technology|www.it.uu.se
Intro and Caches 7
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
AVDARK on the Web
www.it.uu.se/edu/course/homepage/avdark/ht12
Welcome!
News
FAQ Schedule Slides
New Papers Assignments
Reading instructions
Exam
Crash Course in Computer Architecture
(covering the course in 45 min)
Erik Hagersten
Uppsala University
Dept of Information Technology|www.it.uu.se
Intro and Caches 9
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
≈30 years ago: APZ 212 @ 5MHz
”the AXE supercomputer”
Dept of Information Technology|www.it.uu.se
Intro and Caches 10
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
APZ 212
marketing brochure quotes:
”Very compact”
6 times the performance
1/6:th the size
1/5 the power consumption
”A breakthrough in computer science”
”Why more CPU power?”
”All the power needed for future development”
”…800,000 BHCA, should that ever be needed”
”SPC computer science at its most elegance”
”Using 64 kbit memory chips”
”1500W power consumption
Dept of Information Technology|www.it.uu.se
Intro and Caches 11
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
CPU Improvements
1970 1980 1990 2000 Year
Relative Performance [log scale]
1000
100
10
1
Dept of Information Technology|www.it.uu.se
Intro and Caches 12
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
How to get efficient architectures…
Increase clock frequency
Create and explore locality:
a) Spatial locality b) Temporal locality
c) Geographical locality
Create and explore parallelism
a) Instruction level parallelism (ILP) b) Thread level parallelism (TLP)
c) Memory level parallelism (MLP)
Speculative execution
a) Out-of-order execution b) Branch prediction
c) Prefetching
Very hard today
Dept of Information Technology|www.it.uu.se
Intro and Caches 13
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Memory Accesses Three Regs:
Source1 Source2
Destination
Load/Store architecture (e.g., ”RISC”)
ALU ops: Reg -->Reg
Mem ops: Reg <--> Mem
... 5 4 3 2 1
Mem Explicit
Registers
ALU
Load/Store
Example: C = A + B
LD R1, [A]
LD R3, [B]
ADD R2, R1, R3 ST R2, [C]
Compiler
Dept of Information Technology|www.it.uu.se
Intro and Caches 14
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Lifting the CPU hood (simplified…)
D C B A
CPU
Mem
Instructions:
Dept of Information Technology|www.it.uu.se
Intro and Caches 15
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Pipeline
D C B A
Mem Instructions:
I R X W
Regs
Dept of Information Technology|www.it.uu.se
Intro and Caches 16
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Pipeline
A
Mem I R X W
Regs
Dept of Information Technology|www.it.uu.se
Intro and Caches 17
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Pipeline
A
Mem I R X W
Regs
Dept of Information Technology|www.it.uu.se
Intro and Caches 18
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Pipeline
A
Mem I R X W
Regs
Dept of Information Technology|www.it.uu.se
Intro and Caches 19
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Pipeline:
A
Mem I R X W
Regs
I = Instruction fetch R = Read register X = Execute
W= Write register/mem
Processor
”state”
Pipeline stages
Memory for
data and instr.
Dept of Information Technology|www.it.uu.se
Intro and Caches 20
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Register Operations [aka ALU operation]
ADD R1, R2, R3
a.k.a. R1 := R2 op R3
A
Mem I R X W
Regs 2 3 1
e.g., +, -, *, / OP
Ifetch
Dept of Information Technology|www.it.uu.se
Intro and Caches 21
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Load Operation:
LD R1, mem[cnst+R2]
A
Mem I R X W
Regs 1
Ifetch
2
+
Dept of Information Technology|www.it.uu.se
Intro and Caches 22
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Store Operation:
ST R2, mem[cnst+R1]
A
Mem I R X W
Regs 1 2 Ifetch
+
Dept of Information Technology|www.it.uu.se
Intro and Caches 23
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Branch Operations:
if (R1 Op Const) GOTO mem[R2]
A
Mem I R X W
Regs 1 2 P c Ifetch OP
PC = Program Counter.
A special register pointing to the next instruction to execute
Dept of Information Technology|www.it.uu.se
Intro and Caches 24
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Initially
D C B A
Mem I R X W
Regs
LD RegA, (100 + RegC) IF RegC < 100 GOTO A
RegB := RegA + 1 RegC := RegC + 1
PC
Dept of Information Technology|www.it.uu.se
Intro and Caches 25
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cycle 1
D C B A
Mem I R X W
Regs
LD RegA, (100 + RegC) IF RegC < 100 GOTO A
RegB := RegA + 1 RegC := RegC + 1
PC
Dept of Information Technology|www.it.uu.se
Intro and Caches 26
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cycle 2
D C B A
Mem I R X W
Regs
LD RegA, (100 + RegC) IF RegC < 100 GOTO A
RegB := RegA + 1 RegC := RegC + 1
PC
Dept of Information Technology|www.it.uu.se
Intro and Caches 27
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cycle 3
D C B A
Mem I R X W
Regs
LD RegA, (100 + RegC) IF RegC < 100 GOTO A
RegB := RegA + 1 RegC := RegC + 1
+
PC
Dept of Information Technology|www.it.uu.se
Intro and Caches 28
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cycle 4
D C B A
Mem I R X W
Regs
LD RegA, (100 + RegC) IF RegC < 100 GOTO A
RegB := RegA + 1 RegC := RegC + 1
+
PC
Dept of Information Technology|www.it.uu.se
Intro and Caches 29
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cycle 5
D C B A
Mem I R X W
Regs
LD RegA, (100 + RegC) IF RegC < 100 GOTO A
RegB := RegA + 1 RegC := RegC + 1
+ PC
A
Dept of Information Technology|www.it.uu.se
Intro and Caches 30
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cycle 6
D C B A
Mem I R X W
Regs
LD RegA, (100 + RegC) IF RegC < 100 GOTO A
RegB := RegA + 1 RegC := RegC + 1
<
PC
A
Dept of Information Technology|www.it.uu.se
Intro and Caches 31
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cycle 7
D C B A
Mem I R X W
Regs
LD RegA, (100 + RegC) IF RegC < 100 GOTO A
RegB := RegA + 1 RegC := RegC + 1
PC
A
Branch: Addr(A) PC
A:
Dept of Information Technology|www.it.uu.se
Intro and Caches 32
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cycle 8
D C B A
Mem I R X W
Regs
LD RegA, (100 + RegC) IF RegC < 100 GOTO A
RegB := RegA + 1 RegC := RegC + 1
PC
Dept of Information Technology|www.it.uu.se
Intro and Caches 33
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Data dependency
Previous execution example wrong!
D C B A
Mem I R X W
Regs
LD RegA, (100 + RegC) IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
Dept of Information Technology|www.it.uu.se
Intro and Caches 34
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Data dependency fix 1:
pipeline delays
D C B A
Mem I R X W
Regs
LD RegA, (100 + RegC) IF RegC < 100 GOTO A
RegB := RegA + 1 RegC := RegC + 1
”Stall”
”Stall”
Dept of Information Technology|www.it.uu.se
Intro and Caches 35
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Branch delays
D B C A
Mem I R X W
Regs
LD RegA, (100 + RegC) IF RegC < 100 GOTO A
RegB := RegA + 1 RegC := RegC + 1
A
9 cycles per iteration of 4 instructions
Need longer basic blocks with independent instr.
Branch Next PC
”Stall”
”Stall”
”Stall”
Next PC
”Stall”
”Stall”
PC
Need to find Instruction-
Level Parallelism (ILP)
to avoid stalls!
Dept of Information Technology|www.it.uu.se
Intro and Caches 36
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
It is actually a lot worse!
Modern CPUs:”superscalars” with ~4 parallel pipelines
Mem I R X W
Regs I R X W I R X W I R X W
+Higher throughput
- More complicated architecture
- Branch delay more expensive (more instr. missed) - Harder to find ”enough” independent instr. (need 8 instr. between write and use)
Issue logic
Need to find 4x more ILP
Dept of Information Technology|www.it.uu.se
Intro and Caches 37
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
It is actually a lot worse!
Modern CPUs: ~10-20 stages/pipe
Mem
I R
Regs
+Shorter cycletime (higher GHz)
- Branch delay even more expensive
- Even harder to find ”enough” independent instr.
I R B M M W
I R B M M W
I R B M M W
I I I I Issue
logic
Dept of Information Technology|www.it.uu.se
Intro and Caches 38
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
It is actually a lot worse!
DRAM access: ~150 CPU cycles
I R
Regs
I R B M M W
I R B M M W
I R B M M W
I I I I Issue
logic
Mem
150 cycles
Dept of Information Technology|www.it.uu.se
Intro and Caches 39
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Pipeline delays gets worse
D C B A
Mem I R X W
Regs
LD RegA, (100 + RegC) IF RegC < 100 GOTO A
RegB := RegA + 1 RegC := RegC + 1
x151 ”Stall”
x1 ”Stall”
Dept of Information Technology|www.it.uu.se
Intro and Caches 40
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Fix 1: Out-of order execution:
Improving ILP
LD R1, M(100) ADD R3, R2, R1 SUB R5, R6, R7 ST R5, M(100)
>=0?
LD ...
ADD ...
SUB ...
ST ...
Y N
The HW may execute instructions in a different order, but will make the ”side-effects” of the
instructions appear in order.
Assume that LD takes a long time.
The ADD is dependent on the LD
Start the SUB and ST before the ADD
Update R5 and M(100) after R3
Dept of Information Technology|www.it.uu.se
Intro and Caches 41
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Fix 2: Branch prediction
LD R1, M(100) ADD R3, R2, R1 SUB R5, R6, R7 ST R5, M(100)
>=0?
LD ...
ADD ...
SUB ...
ST ...
Y N
The HW can guess if the branch is taken or not and avoid branch stalls if the guess is correct,
Assume the guess is ”Y”.
The HW can start to execute these instruction before the outcome the
the branch is known, but cannot allow any ”side-effect”
to take place until the outcome is known
Dept of Information Technology|www.it.uu.se
Intro and Caches 42
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Fix 3: Scheduling Past Branches
Improving ILP
LD ADD SUB ST
>=0? LD
ADD SUB ST
>1?
LD ADD SUB ST
<2?
LD ADD SUB ST
=0?
Y
Y
Y
”Predict taken”
”Predict taken”
”Predict taken”
All instructions along the predicted path can be executed
out-of-order
Predicted path
Dept of Information Technology|www.it.uu.se
Intro and Caches 43
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
LD ADD SUB ST
>=0? LD
ADD SUB ST
>1?
LD ADD SUB ST
<2?
LD ADD SUB ST
=0?
Y
Y
Y
Wrong Prediction!!!
Throw away i.e., no side effects!
Actual path!
Fix 4: Scheduling Past Branches
Improving ILP
Dept of Information Technology|www.it.uu.se
Intro and Caches 44
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Fix 5: Use a cache
Mem
I R
Regs
B M M W
I R B M M W
I R B M M W
I R B M M W
I I I I Issue
logic
150cycles 1GB
$ 64kB
1-10 cycles
Dept of Information Technology|www.it.uu.se
Intro and Caches 45
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
How to get efficient architectures…
Increase clock frequency
Create and explore locality:
a) Spatial locality b) Temporal locality
c) Geographical locality
Create and explore parallelism
a) Instruction level parallelism (ILP) b) Thread level parallelism (TLP)
c) Memory level parallelism (MLP)
Speculative execution
a) Out-of-order execution b) Branch prediction
c) Prefetching
Dept of Information Technology|www.it.uu.se
Intro and Caches 46
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Woops, using too much power 2007
Running at 2x the frequency will use much more than 2x the power
It is also really hard to find enough ILP
Speculation results in a fair amount of
wasted work
Dept of Information Technology|www.it.uu.se
Intro and Caches 47
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Fix: Multicore
Mem
CPU
$1
CPU
$1
CPU
$1
CPU
$1 L2$
Mem I/F External
I/F
t treads
But now we also need
to find Thread-Level
Parallelism (TLP)
Dept of Information Technology|www.it.uu.se
Intro and Caches 48
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Example: Intel i7 ”Nehalem”
DRAM
Coherence
Dept of Information Technology|www.it.uu.se
Intro and Caches 49
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
What is computer architecture?
“Bridging the gap between programs and transistors”
“Finding the best model to execute the programs”
best={fast, cheap, energy-efficient, reliable, predictable, …}
…
Caches and more caches
spam, spam, spam and spam or
Erik Hagersten
Uppsala University, Sweden
eh@it.uu.se
Dept of Information Technology|www.it.uu.se
Intro and Caches 51
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Fix 5: Use a cache
Mem
I R
Regs
B M M W
I R B M M W
I R B M M W
I R B M M W
I I I I Issue
logic
200cycles 1GB
~32kB
$
~1 cycles
Dept of Information Technology|www.it.uu.se
Intro and Caches 52
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Webster about “cache”
1. cache \'kash\ n [F, fr. cacher to press, hide, fr. (assumed) VL coacticare to press] together, fr. L coactare to compel, fr.
coactus, pp. of cogere to compel - more at COGENT 1a: a
hiding place esp. for concealing and preserving provisions or
implements 1b: a secure place of storage 2: something hidden
or stored in a cache
Dept of Information Technology|www.it.uu.se
Intro and Caches 53
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache knowledge useful when...
Designing a new computer
Writing an optimized program
or compiler
or operating system …
Implementing software caching
Web caches
Proxies
File systems
Dept of Information Technology|www.it.uu.se
Intro and Caches 54
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Memory/storage
SRAM DRAM disk
sram
2000: 1ns 1ns 3ns 10ns 150ns 5 000 000ns 1kB 64k 4MB 1GB 1 TB
(1982: 200ns 200ns 200ns 10 000 000ns)
Dept of Information Technology|www.it.uu.se
Intro and Caches 55
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Address Book Cache
Looking for Tommy’s Telephone Number
Ö Ä Å Z Y X V U T
TOMMY 12345
Ö Ä Å Z Y X V
“Address Tag”
One entry per page =>
Direct-mapped caches with 28 entries
“Data”
Indexing
function
Dept of Information Technology|www.it.uu.se
Intro and Caches 56
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Address Book Cache
Looking for Tommy’s Number
Ö Ä Å Z Y X V U T
OMMY 12345
TOMMY
EQ?
index
Dept of Information Technology|www.it.uu.se
Intro and Caches 57
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Address Book Cache
Looking for Tomas’ Number
Ö Ä Å Z Y X V U T
OMMY 12345
TOMAS
EQ?
index
Miss!
Lookup Tomas’ number in
the telephone directory
Dept of Information Technology|www.it.uu.se
Intro and Caches 58
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Address Book Cache
Looking for Tomas’ Number
Z Y X V U T
OMMY 12345
TOMAS
index
Replace TOMMY’s data with TOMAS’ data.
There is no other choice (direct mapped)
OMAS 23457
Ö
Ä
Å
Dept of Information Technology|www.it.uu.se
Intro and Caches 59
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache
CPU
Cache address
data (a word) hit
Memory address
data
Dept of Information Technology|www.it.uu.se
Intro and Caches 60
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache Organization
Cache
OMAS 23457
TOMAS
index
=
Hit (1) (1)
(4) (4)
1
Addr tag
&
(1)
Data (5 digits) Valid
28 entries
Dept of Information Technology|www.it.uu.se
Intro and Caches 61
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache Organization (really)
4kB, direct mapped
index
=
(1) Hit?
(32bits = 4 bytes)
1
Addr tag
&
(1)
Data (1)
Valid
00100110000101001010011010100011
1k entries of 4 bytes each (?)
(?)
(?)
0101001
0010011100101…32 bit address identifying
a byte in memory
Ordinary Memory
msb lsb
”Byte”
What is a
good
index
function?
Dept of Information Technology|www.it.uu.se
Intro and Caches 62
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache Organization
4kB, direct mapped
index
=
(1) Hit?
(32bits = 4 bytes)
1
Addr tag
&
(1)
Data (1)
Valid
00100110000101001010011010100011
1k entries of 4 bytes each
(10)
(20) (20)
0101001
0010011100101…32 bit address Identifies the byte within a word
msb lsb
Mem Overhead:
21/32= 66%
Latency =
SRAM+CMP+AND
Dept of Information Technology|www.it.uu.se
Intro and Caches 63
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache
CPU
Cache address
data (a word) hit
Memory address
data
Hit: Use the data provided from the cache
~Hit: Use data from memory and also store it in
the cache
Dept of Information Technology|www.it.uu.se
Intro and Caches 64
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache performance parameters
Cache “hit rate” [%]
Cache “miss rate” [%] (= 1 - hit_rate)
Hit time [CPU cycles]
Miss time [CPU cycles]
Hit bandwidth
Miss bandwidth
Write strategy
….
Dept of Information Technology|www.it.uu.se
Intro and Caches 65
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
How to rate architecture performance?
Marketing:
Frequency and Number of “cores”…
Architecture “goodness”:
CPI = Cycles Per Instruction
IPC = Instructions Per Cycle
Benchmarking:
SPEC-fp, SPEC-int, …
TPC-C, TPC-D, …
Varning: Using a unrepresentative benchmark can
be missleading
Dept of Information Technology|www.it.uu.se
Intro and Caches 66
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache performance example
Assumption:
Infinite bandwidth
A perfect 1.0 CyclesPerInstruction (CPI) CPU 100% instruction cache hit rate
Total number of cycles =
#Instr. * ( (1 - mem_ratio) * 1 +
mem_ratio * avg_mem_accesstime) =
= #Instr * ( (1- mem_ratio) +
mem_ratio * (hit_rate * hit_time +
(1 - hit_rate) * miss_time) CPI = 1 -mem_ratio +
mem_ratio * (hit_rate * hit_time +
(1 - hit_rate) * miss_time)
Dept of Information Technology|www.it.uu.se
Intro and Caches 67
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Example Numbers
CPI = 1 - mem_ratio +
mem_ratio * (hit_rate * hit_time) +
mem_ratio * (1 - hit_rate) * miss_time) mem_ratio = 0.25
hit_rate = 0.85 hit_time = 3
miss_time = 100
CPI = 0.75 + 0.25 * 0.85 * 3 + 0.25 * 0.15 * 100 = 0.75 + 0.64 + 3.75 = 5.14
CPU HIT MISS
Dept of Information Technology|www.it.uu.se
Intro and Caches 68
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
What if ...
CPI = 1 -mem_ratio +
mem_ratio * (hit_rate * hit_time) +
mem_ratio * (1 - hit_rate) * miss_time) mem_ratio = 0.25
hit_rate = 0.85
hit_time = 3 CPU HIT MISS
miss_time = 100 == > 0.75 + 0.64 + 3.75 = 5.14
•Twice as fast CPU ==> 0.37 + 0.64 + 3.75 = 4.77
•Faster memory (70c) ==> 0.75 + 0.64 + 2.62 = 4.01
•Improve hit_rate (0.95) => 0.75 + 0.71 + 1.25 = 2.71
Dept of Information Technology|www.it.uu.se
Intro and Caches 69
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
How to get more effective caches:
Larger cache (more capacity)
Cache block size (larger cache lines)
More placement choice (more associativity)
Innovative caches (victim, skewed, …)
Cache hierarchies (L1, L2, L3, CMR)
Latency-hiding (weaker memory models)
Latency-avoiding (prefetching)
Cache avoiding (cache bypass)
Optimized application/compiler
…
Dept of Information Technology|www.it.uu.se
Intro and Caches 70
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Why do you miss in a cache
Mark Hill’s three “Cs”
Compulsory miss (touching data for the first time)
Capacity miss (the cache is too small)
Conflict misses (non-ideal cache implementation) (too many names starting with “H”)
(Multiprocessors)
Communication (imposed by communication)
False sharing (side-effect from large cache blocks)
Dept of Information Technology|www.it.uu.se
Intro and Caches 71
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Avoiding Capacity Misses –
a huge address book
Lots of pages. One entry per page.
Ö Ä Å Z Y X ÖV ÖU ÖT
LING 12345
ÖÖ ÖÄ ÖÅ ÖZ ÖY ÖX
“Address Tag”
One entry per page =>
Direct-mapped caches with 28 2 entries 784 entries
“Data”
New
Indexing
function
Dept of Information Technology|www.it.uu.se
Intro and Caches 72
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache Organization
1MB, direct mapped
index
=
(1) Hit?
(32)
1
Addr tag
&
(1)
Data
(1)
Valid
00100110000101001010011010100011
256k entries
(18)
(12) (12)
0101001
001001110010132 bit address Identifies the byte within a word
msb lsb
Mem Overhead:
13/32= 40%
Latency =
SRAM+CMP+AND
Dept of Information Technology|www.it.uu.se
Intro and Caches 73
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Pros/Cons Large Caches
++ The safest way to get improved hit rate -- SRAMs are very expensive!!
-- Larger size ==> slower speed more load on “signals”
longer distances
-- (power consumption)
-- (reliability)
Dept of Information Technology|www.it.uu.se
Intro and Caches 74
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Why do you hit in a cache?
Temporal locality
Likely to access the same data again soon
Spatial locality
Likely to access nearby data again soon
Typical access pattern:
(inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, ...
temporal spatial
Dept of Information Technology|www.it.uu.se
Intro and Caches 75
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK
2012 (32)
Data
Multiplexer (4:1 mux) Identifies the word within a cache line
(2) Select
Identifies a byte within a word
Fetch more than a word:
cache blocks(a.k.a cache line)
1MB, direct mapped, CacheLine=16B
1
00100110000101001010011010100011
64k entries
0101001
0010011100101(16)
index
0010011100101 0010011100101 0010011100101
=
(1) Hit?
&
(1)
(12) (12)
(32) (32) (32) (32)
128 bits
msb lsb
Mem Overhead:
13/128= 10%
Latency =
SRAM+CMP+AND
Dept of Information Technology|www.it.uu.se
Intro and Caches 76
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Example in Class
Direct mapped cache:
Cache size = 64 kB
Cache line = 16 B
Word size = 4B
32 bits address (byte addressable)
“There are 10 kinds of people in the world…
Those who understand binary number and
those who do not.”
Dept of Information Technology|www.it.uu.se
Intro and Caches 77
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Pros/Cons Large Cache Lines
++ Explores spatial locality
++ Fits well with modern DRAMs
* first DRAM access slow
* subsequent accesses fast (“page mode”)
-- Poor usage of SRAM & BW for some patterns -- Higher miss penalty (fix: critical word first) -- (False sharing in multiprocessors)
Cache line size
Perf
Dept of Information Technology|www.it.uu.se
Intro and Caches 78
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK
2012 Thanks: Dr. Erik Berg
UART: StatCache Graph
app=matrix multiply
Small cache:
Short cache lines are better
Large caches
Longer cache lines are better
Huge caches:
Everything fits regardless of CL size
Note: this is just a single example, but
the conclusion typically holds for most
applications.
Dept of Information Technology|www.it.uu.se
Intro and Caches 79
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache Conflicts
Typical access pattern:
(inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, …
What if B and C index to the same cache location Conflict misses -- big time!
Potential performance loss 10-100x
temporal spatial
Dept of Information Technology|www.it.uu.se
Intro and Caches 80
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Address Book Cache
Two names per page: index first, then search.
Ö Ä Å Z Y X V U T
OMAS
TOMMY
EQ?
index
12345 23457
OMMY
EQ?
Dept of Information Technology|www.it.uu.se
Intro and Caches 81
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK
2012 How should the
select signal be produced?
Avoiding conflict: More associativity
1MB, 2-way set-associative, CL=4B
index
(32)
1
Data
00100110000101001010011010100011
128k
“sets”
(17) 0101001
0010011100101Identifies a byte within a word
Multiplexer (2:1 mux)
(32) (32)
1 0101001
0010011100101Hit?
=
&
(13)
(1) Select (13)
=
&
“logic”
msb lsb
Latency =
SRAM+CMP+AND+
LOGIC+MUX
One “set”
Dept of Information Technology|www.it.uu.se
Intro and Caches 82
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Pros/Cons Associativity
++ Avoids conflict misses -- Slower access time
-- More complex implementation comparators, muxes, ...
-- Requires more pins (for external
SRAM…)
Dept of Information Technology|www.it.uu.se
Intro and Caches 83
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Going all the way…!
1MB, fully associative, CL=16B
index
=
Hit? 4B
1
&
Data
00100110000101001010011010100011
One “set”
(0) (28)
0101001
0010011100101Identifies a byte within a word
Multiplexer (256k:1 mux) Identifies the word within a cache line
Select (16) (13)
16B 16B
1 0101001
0010011100101=
&
“logic”
(2)
1 0101001
00100111001011
=
&
=
&
... 16B
64k
comparators
Dept of Information Technology|www.it.uu.se
Intro and Caches 84
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Fully Associative
Very expensive
Only used for small caches (and sometimes TLBs)
CAM = Contents-addressable memory
~Fully-associative cache storing key+data
Provide key to CAM and get the associated
data
Dept of Information Technology|www.it.uu.se
Intro and Caches 85
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
A combination thereof
1MB, 2-way, CL=16B
index
=
Hit? (32)
1
&
Data
001001100001010010100110101000110
32k
“sets”
(15) (13)
0101001
0010011100101Identifies a byte within a word
Multiplexer (8:1 mux) Identifies the word within a cache line
Select (1)
(13)
(128) (128)
1 0101001
0010011100101=
&
“logic”
(2)
(256)
msb lsb
Dept of Information Technology|www.it.uu.se
Intro and Caches 86
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Example in Class
Cache size = 2 MB
Cache line = 64 B
Word size = 8B (64 bits)
4-way set associative
32 bits address (byte addressable)
Dept of Information Technology|www.it.uu.se
Intro and Caches 87
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Who to replace?
Picking a “victim”
Least-recently used (aka LRU)
Considered the “best” algorithm (which is not always true…)
Only practical up to limited number of ways
Not most recently used
Remember who used it last: 8-way -> 3 bits/CL
Pseudo-LRU
E.g., based on course time stamps.
Used in the VM system
Random replacement
Can’t continuously to have “bad luck...
Dept of Information Technology|www.it.uu.se
Intro and Caches 88
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache Model:
Random vs. LRU
Random
LRU
LRU Random
art (SPEC 2000) equake (SPEC 2000)
Dept of Information Technology|www.it.uu.se
Intro and Caches 89
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
4-way sub-blocked cache
1MB, direct mapped, Block=64B, sub-block=16B
index
=
Hit?
(4)
(32)
1
&
Data
00100110000101001010011010100011
16k
(14)
(12)
0101001
0010011100101Identifies a byte within a word
0010011100101
16:1 mux
Identifies the word within a cache line
(2) (12)
(128) (128) (128) (128)
512 bits
0 1 0
& & &
logic
4:1 mux
(2) Sub block within a block
msb lsb
Mem Overhead:
16/512= 3%
Dept of Information Technology|www.it.uu.se
Intro and Caches 90
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Pros/Cons Sub-blocking
++ Lowers the memory overhead
++ (Avoids problems with false sharing -- MP) ++ Avoids problems with bandwidth waste
-- Will not explore as much spatial locality -- Still poor utilization of SRAM
-- Fewer sparse “things” allocated
Dept of Information Technology|www.it.uu.se
Intro and Caches 91
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Replacing dirty cache lines
Write-back
Write dirty data back to memory (next level) at replacement
A “dirty bit” indicates an altered cache line
Write-through
Always write through to the next level (as well)
data will never be dirty no write-backs
Dept of Information Technology|www.it.uu.se
Intro and Caches 92
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Write Buffer/Store Buffer
Do not need the old value for a store
One option: Write around (no write allocate in caches) used for lower level smaller caches
CPU cache
WB:
stores loads
= =
=
Dept of Information Technology|www.it.uu.se
Intro and Caches 93
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Innovative cache: Victim cache
CPU
Cache address
data
hit Memory
address
data
Victim Cache (VC): a small, fairly associative cache (~10s of entries) Cache lookup: search cache and VC in parallel
Cache replacement: move victim to the VC and replace in VC VC hit: swap VC data with the corresponding data in Cache
“A second life ”
address VC
data (a word)
hit
Dept of Information Technology|www.it.uu.se
Intro and Caches 94
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Skewed Associative Cache
A, B and C have a three-way conflict
2-way A
B C
4-way A
B C
It has been shown that 2-way skewed performs roughly the same as 4-way caches
2-way skewed A
B
C
Dept of Information Technology|www.it.uu.se
Intro and Caches 95
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Skewed-associative cache:
Different indexing functions
index
=
1
&
00100110000101001010011010100011
128k entries
(17)
(13)
0101001
001001110010132 bit address Identifies the byte within a word
1 0101001
0010011100101f 1 f 2
(>18)
(>18) (17)
msb lsb
2:1mux
=
&
function
(32) (32)
(32)
Dept of Information Technology|www.it.uu.se
Intro and Caches 96
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
UART: Elbow cache
Increase “associativity” when needed
Performs roughly the same as an 8-way cache Slightly faster
Uses much less power!!
A B
If severe conflict:
make room A
B C Conflict!!
A
B
C
Dept of Information Technology|www.it.uu.se
Intro and Caches 97
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Topology of caches: Harvard Arch
CPU needs a new instruction each cycle
25% of instruction LD/ST
Data and Instr. have different access patterns
==> Separate D and I first level cache
==> Unified 2nd and 3rd level caches
Dept of Information Technology|www.it.uu.se
Intro and Caches 98
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache Hierarchy of Today
CPU
L1 D$
on-chip
L2$
on-chip
L3$
DRAM Memory
L1 I$
on-chip
Small enough to keep up with the CPU speed
Use whatever transistors there i left for the
Last-Level Cache (LLC)
Off-chip SRAM (today more
commonly on-chip)
Separate I cache to allow for
instruction fetch
and data fetch in
parallel
Dept of Information Technology|www.it.uu.se
Intro and Caches 99
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
HW prefetching
...a little green man that anticipates your next memory access and prefetches the data to the cache.
Improves MLP!
Sequential prefetching: Sequential streams [to a page]. Some number of prefetch
streams supported. Often only for L2 and L3.
PC-based prefetching: Detects strides from the same PC. Often also for L1.
Adjacent prefetching: On a miss, also bring in the “next” cache line. Often only for L2 and
L3.
Dept of Information Technology|www.it.uu.se
Intro and Caches 100
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Hardware prefetching
Hardware ”monitor” looking for patterns in memory accesses
Brings data of anticipated future accesses into the cache prior to their usage
Two major types:
Sequential prefetching (typically page-based, 2nd level cache and higher). Detects sequential cache lines missing in the cache.
PC-based prefetching, integrated with the pipeline.
Finds per-PC strides. Can find more complicated
patterns.
Dept of Information Technology|www.it.uu.se
Intro and Caches 101
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Data Cache Capacity/Latency/BW
Dept of Information Technology|www.it.uu.se
Intro and Caches 102
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache implementation
256kB L2 $
Generic Cache:
Addr [63..0]
MSB LSB
AT S Data = 64B
=
D1 ¢ 64kB
L3 € 24MB
Cacheline, here 64B:
...
= = = = =
= =
mux Hit
Sel way ”6”
Data = 64B index
SRAM:
Caches at all level roughly work like this:
I1 ¢
64kB
Dept of Information Technology|www.it.uu.se
Intro and Caches 103
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Address Book Analogy
Two names per page: index first, then search.
Ö Ä Å Z Y X V U T OMAS
TOMMY
EQ?
index
12345 23457
OMMY
EQ?
Select the second entry!
Dept of Information Technology|www.it.uu.se
Intro and Caches 104
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache lingo
Cacheline: Data chunk move to/from a cache
Cache set: Fraction of the cache identified by the index
Associativity: Number of alternative storage places for a cacheline
Replacement policy: picking the victim to throw out from a set (LRU/Random/Nehalem)
Temporal locality: Likelihood to access the same data again soon
Spatial locality: Likelihood to access nearby data again
Typical access pattern: soon
(inner loop stepping through an array) A, B, C, A+4, B, C, A+8, B, C, ...
temporal spatial
Dept of Information Technology|www.it.uu.se
Intro and Caches 105
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Cache Lingo Picture
C Generic Cache:
Addr [63..0]
MSB LSB
AT S Data = 64B Cacheline, here 64B:
...
mux Hit
Sel way ”6”
Data = 64B
SRAM:
Cache Set:
tag index
Associativity= 8-way
Dept of Information Technology|www.it.uu.se
Intro and Caches 106
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Nehalem i7 (one example)
32kB
8-way pLRU 256kB
8-way pLRU non-incl
8MB
16-way Neh.-repl
non-incl
Dept of Information Technology|www.it.uu.se
Intro and Caches 107
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
Take-away message: Caches
Caches are fast but small
Cache space allocated in cache-line chunks (e.g., 64bytes)
LSB part of the address is used to find the
”cache set” (aka, indexing)
There is a limited number of cache lines per set (associativity)
Typically, several levels of caches
The most important target for optimizations
Dept of Information Technology|www.it.uu.se
Intro and Caches 108
© Erik Hagersten|http://user.it.uu.se/~ehAVDARK 2012
How are we doing?
Increase clock frequency
Create and explore locality:
a) Spatial locality b) Temporal locality
c) Geographical locality
Create and explore parallelism
a) Instruction level parallelism (ILP) b) Thread level parallelism (TLP)
c) Memory level parallelism (MLP)
Speculative execution
a) Out-of-order execution b) Branch prediction
c) Prefetching
Dept of Information Technology|www.it.uu.se