• No results found

Erik Hagersten

N/A
N/A
Protected

Academic year: 2022

Share "Erik Hagersten"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

What Thread-Level Parallelism (TLP) can buy you

Erik Hagersten

Uppsala University

(2)

Dept of Information Technology|www.it.uu.se

2

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Old Trend1: Deeper pipelines/higher freq.

Exploring ILP (instruction-level parallelism)

Mem

I R

Regs I

150cycles 1GB

PC

Data Dependence

(3)

Old Trend2: Wider pipelines Exploring more ILP

I

Regs I

I I

I I I I Issue

logic

Thread 1

PC

1GB

More pipelines + Deeper pipelines

Î

Need more independent

instructions

(4)

Dept of Information Technology|www.it.uu.se

4

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Old Trend3: Deeper memory hierarchy Exploring access locality

Mem

I R

Regs

B M M W

I R B M M W

I R B M M W

I R B M M W

I I I I Issue

logic

£

kr

150cycles 30 cycles 10 cycles 2 cycles

1GB 2MB 64kB 2kB

Thread 1

PC

(5)

CMP: Chip Multiprocessor (aka Multicores)

more TLP & geographical locality

I R

Regs

B M M W

I R B M M W

I R B M M W

I R B M M W

I I I

Issue I

logic

$

SEK

Thread 1

Thread N

PC

PC Regs

SEK

Issue

logic

(6)

Dept of Information Technology|www.it.uu.se

6

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

L1 cache

CPU

L2 Cache

Ctrl L3

Slooow Memory

A = B + C:

Read B Read C

Add B & C WriteA

Latency

0.3 - 100 ns 0.3 - 100 ns 0.3 ns 0.3 - 100 ns

B

+ A B

C C

Not enough MLP?

(7)

TLP Î MLP

Î memory accesses from many threads (MLP)

Sloooow memory

Thread 2

Thread 1 + A

B C C

B

+ A B

C C

B

TLP Î MLP

(8)

Dept of Information Technology|www.it.uu.se

8

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Not enough ILP

CPU

Slooow Memory

L1

Issue L ogic

LD X, R2 ST R1, R2, R3 LD X, R2 ST R1, R2, R3 LD X, R2 ST R1, R2, R3 LD X, R2 ST R1, R2, R3 LD X, R2 ST R1, R2, R3

(9)

TLP Î MLP

Î feed one superscalar with instr. from many threads (also used in GPUs)

CPU L1

Issue L ogic

LD X, R2 ST R1, R2, R3 LD X, R2 ST R1, R2, R3 LD X, R2 ST R1, R2, R3 LD X, R2 LD X, R2

ST R1, R2, R3 LD X, R2 ST R1, R2, R3 LD X, R2 ST R1, R2, R3 LD X, R2

(10)

Dept of Information Technology|www.it.uu.se

10

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

SMT: Simultaneous Multithreading

”Combine TLP&ILP to find independent instr.”

Mem

I R

Regs 1 …

B M M W

I R B M M W

I R B M M W

I R B M M W

I I I I Issue

logic

$

SEK

Thread 1

Thread N

PC PC

Regs N

(11)

Thread-interleaved

Historical Examples:

„ Denelcor, HEP, Tera Computers [B. Smith] 1984

Each thread executes every n:th cycle in a round-robin fashion

-- Poor single-thread performance -- Expensive (due to early adoption)

„ Intel’s “Hyperthreading” (2002)

-- Poor implementation

(12)

Design Issues for Multicores

Erik Hagersten Uppsala University

Sweden

(13)

CMP bottlenecks/points of optimization

„ Performance per Watt?

„ Performance per memory byte?

„ Performance per bandwidth?

„ Performance per $?

„ …

„ How large fraction of a CMP system cost is in the CPU chip?

„ Should the execution (MIPS/FLOPS) be viewed as a

scarce resource?

(14)

Dept of Information Technology|www.it.uu.se

14

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

CPU L1

CPU L1

CPU L1

CPU L1

CPU L1 CPU

L1

CPU L1

CPU L1

Shared Resources

Bandwidth

L2 Cache

Shared Bottlenecks

(15)

Capacity or Capability Computing?

Capacity? (≈several sequential jobs) or

Capability? (≈one parallel job) Issues:

“ Memory requirement?

“ Sharing in cache?

“ Memory bandwidth requirement?

Memory: the major cost of a CMP system!

How do we utilize it the best?

“ Once the workingset is in memory, work like crazy!

Î Capability computing suits CMPs the best

(in general)

(16)

Dept of Information Technology|www.it.uu.se

16

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

A Few Fat or Many Narrow Cores?

„ Fat:

“ Fewer cores but…

“ wide issue?

“ O-O-O?

„ Narrow: More cores but…

“ narrow issue?

“ in-order?

“ have you ever heard of Amdahl?

“ SMT, run-ahead, execute-ahead … to cure shortcomings?

Read:

Maximizing CMP Throughput with Mediocre Cores Davis, Laudon and Olukotun, PACT 2006

Amdahls Law in the Multicore Era Mark Hill, IEEE Computer, July 2008

http://www.cs.wisc.edu/multifacet/papers/ieeecomputer08_amdahl_multicore.pdf

(17)

Cores vs. caches

„ Depends on your target applications…

„ Niagara’s answer: go for cores

“ In-order 5-stage pipeline

“ 8 cores a’ 4 SMT threads each Î 32 threads,

“ 3MB shared L2 cache (96 kB/thread)

“ SMT to hide memory latency

“ Memory bandwidth: 25 GB/s

“ Will this approach scale with technology?

„ Others: go for cache

“ 2-4 cores for now

(18)

Dept of Information Technology|www.it.uu.se

18

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

How to Hide Memory Latency (and create MLP)

Options:

„ O-O-O

„ HW prefetching

„ SMT

„ Run-ahead/Execute-ahead

(19)

Handling shared resources

Erik Hagersten

Uppsala University

(20)

Dept of Information Technology|www.it.uu.se

20

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

1 st Order MC Performance Problems

Core Binary

¢

$

¢

Mem •Additional multicore issues: 

‐ Even less cache resources per application

‐ Sharing of cache resources

‐ Wasted cache usage 

Binary

Fighting for shared resources

wasted

(21)

Cache Interference in Shared Cache

„ Cache sharing strategies:

1. Fight it out!

2. Fair share: 50% of the cache each

3. Maximize throughput: who will benefit the most?

Read:

STATSHARE: A Statistical Model for Managing Cache Share via Decay Pavlos Petoumenos et al in MOBS workshop ISCA 2006

miss rate

$ size A

B

1

1

single-threaded cache profiling

2 2

3 3

tot

A examples: ammp, art, …

B examples: vpr_place, vortex, …

(22)

Dept of Information Technology|www.it.uu.se

22

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Core

¢

$

¢

Mem

Binary

Andreas’ research: Using Prefetch-NT

wasted

AMD, Prefetch NT:

• Install in L1 cache with NT bit set

• Non-inclusive caching Î Not in L2, L3

• Upon eviction from L1, do not install in L2, L3 (if NT is not set, cacheline will get installed) Intel Core2, Prefetch NT:

• Install in L1&L2 caches (inclusive caching)

• Put in MRU place in L2 Î replaced more easily

• Upon eviction from L1, keep in L2 Intel i7, Prefetch NT:

• Install in L1&L2 caches (inclusive caching)

• Put in MRU place in L2 Î replaced more easily

• Upon eviction from L1, keep in L2 All, Store-NT:

• Keep cachline in a special write-buffer

•When all bytes of the cache line has been updated write to memory while bypassing caches

•Huge penaly in not all bytes are updated Avoid cache polution

using existing x86 ISA:

NT = Non-temporal

(23)

cache size cache misses

actual actual/4

The larger cache, the better

Example: Hints to avoid cache pollution (non-temporal prefetches)

Hint:

Don’t allocate!

missrate 2x missrate

0 1 2

3 One Instance Four Instances

40% faster

Throughput

(24)

Dept of Information Technology|www.it.uu.se

24

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

0 0,05 0,1 0,15 0,2 0,25

1 2 4 8 16 32 64 128 256 512 1M 2M 4M 8M 16M 32M 64M

Cache size

Libquantum LBM bzip

Mi ss  rate

Example: Hints for mixed workloads

(non-temporal prefetches)

áctual

0 0,2 0,4 0,6 0,8 1 1,2

bzip2 Libquantum LBM Geom mean

Individually In mix In mix, patched

25%

AMD Opteron

Performance

”streaming”

”bigger is better”

”tiny”

(25)

Wrapping up about multicores

Erik Hagersten

Uppsala Universitet

(26)

Dept of Information Technology|www.it.uu.se

26

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Looks and Smells Like an SMP (aka UMA)?

Well, how about:

„ Cost of parallelism?

„ Cache capacity per thread?

„ Memory bandwidth per thread?

„ Cost of thread communication? … CPU

L1 L2

Memory Interconnect

CPU L1 L2

CPU L1

L2 … 32 …

1

L2 Memory

TTT T

1 TTT T

…8…

SMP system Multicore system

(27)

now

t

Cache/Thread

Bandwidth/Thread Thr. Comm. Cost (temporal)

now

t

Memory/Chip

now

t

Threads/Chip

now

t

Transistors/Thread

Trends (my guess!)

(28)

Dept of Information Technology|www.it.uu.se

28

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

What matters for multicore performance?

„ Are we buying…

“ CPU frequency?

“ Number of cores?

“ MIPS and FLOPS?

“ Memory bandwidth?

“ Cache capacity?

“ Memory capacity?

“ Performance/Watt?

(29)

MC Questions for the Future

„ How to get parallelism?

„ How to get performance/data locality?

„ How to debug?

„ A case for new funky languages?

„ A case for automatic parallelization?

„ Are we buying:

“ compute power,

“ memory capacity, or

“ memory bandwidth?

„ Will 128 cores be mainstream in 5 years?

„ Will the CPU market diverge into

desktop/capacity/capability/special-purpose CPUs

again?

(30)

X86 Architecture

Erik Hagersten Uppsala University

Sweden

(31)

Intel Archeology

„ (8080: 1974, 6.0 kTransistors, 2MHz, 8bit)

„ 8086: 1978, 29 kT, 5-10MHz, 16bit (PC!)

„ (80186:1982 ? kT, 4-40MHz, integration! )

„ 80286: 1982, 0.1MT, 6-25MHz, chipset (PC-AT)

„ 80386: 1985, 0.3MT, 16-33MHz, 32 bits

„ 80486: 1989, 1.2MT, 25-50MHz, I&D$, FPU

„ Pentium: 1993, 3.1 MT, 66 MHz, superscalar

„ Pentium Pro: 1997, 5.5 MT, 200 MHz, O-O-O, 3-way superscalar

„ Intel Pentium4:2001, 42 MT, 1.5 GHz, Super-pipe, L2$ on-chip

„ …

(32)

Dept of Information Technology|www.it.uu.se

32

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

8086 registers

„ AX (Accumulator)

„ BX (Base)

„ CX (Count)

„ DX (Data)

„ SP (Stack ptr)

„ BP (Base ptr)

„ SI (Source index)

„ DI (Destination index)

„ CS (Code segment)

„ DS (Data segment)

„ SS (Stack segment)

”General purpose” registers

”Addressing registers”

Segmented addressing

(extending the address

range)

(33)

Complex instructions of x86

„ RISC (Reduced Instruction Set Computer)

“ LD/ST with a limited set of address modes

“ ALU instructions (a minimum)

“ Many general purpose registers

“ Simplifications (e.g., read R0 returns the value 0)

“ Simpler ISA Î more efficient implementations

„ x86 CISC (Complex Instruction Set Computer)

“ ALU/Memory in the same instruction

“ Complicated instructions

“ Few specialized registers (actually accumulator architecture)

“ Variable instruction length

“ x86 was lagging in performance to RISC in the 90s

(34)

Dept of Information Technology|www.it.uu.se

34

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Micro-ops

„ Newer pipelines implements RISC-ish μ-ops.

„ Some complex x86 instructions expanded to several micro-ops at runtime.

„ The translated μ-ops may be cached in a

trace-cache [in their predicted order] (first:

Pentium4)

„ Expanded to “loop cache” in Core-2

(35)

x86-64

„ ISA extension to x86 (by AMD 2001)

„ 64-bit virtual address space

„ 64-bit GP registers x16

x86’s regs extended: rax, rbx, rcx, rdx, rbp, rsp, rsi, rdi x86-64 also has: r8, r9, ... r15 (i.e., a total of 16 regs)

NOTE: dynamic register renaming makes the effective number of regs higher

„ SSEn: 16 128-bit SSE ”vector” registers

„ Backwards compatible with x86

Intel adoptions: IA-32e, EM64T, Intel64

(36)

Dept of Information Technology|www.it.uu.se

36

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

x86 Vector instructions

„ MMX: 64 bit vectors (e.g., two 32bit ops)

„ SSEn: 128 bit vectors(e.g., four 32 bit ops)

„ AVX: 256 bit vectors(e.g., eight 32 bit ops) (in Sandy Bridge, ~Q1 2011)

„ MIC: ”16-way vectors”. Is this 16 x 32 bits??

(37)

Examples of vector instructions

A:

B:

C:

D:

E:

SSE_MUL D, B, A

x x x x

Vector Regs

(38)

Dept of Information Technology|www.it.uu.se

38

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

How to explore SIMD: nVIDIA

„ 512 ”processors” (P)

„ 16 P/StreamProcessor (SP)

„ SP is SIMD-ish (sort off)

„ Full DP-FP IEEE support

„ 64kB L1 cache /SP

„ 768kB global shared cache

(less than the sum of L1:s)

„ Atomic instructions

„ ECC correction

„ Debugging support

„ Giant chip/high power

„ ...

L1

L2

(39)

”more than 50 cores”

How to explore SIMD: Intel MIC

(40)

Exponential Growth

Erik Hagersten Uppsala University

Sweden

(41)

Ray Kurzweil pictures

www.KurzweilAI.net/pps/WorldHealthCongress/

(42)

Dept of Information Technology|www.it.uu.se

42

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Ray Kurzweil pictures

www.KurzweilAI.net/pps/WorldHealthCongress/

(43)

Ray Kurzweil pictures

www.KurzweilAI.net/pps/WorldHealthCongress/

(44)

Dept of Information Technology|www.it.uu.se

44

© Erik Hagersten|user.it.uu.se/~eh

AVDARK 2010

Ray Kurzweil pictures

www.KurzweilAI.net/pps/WorldHealthCongress/

(45)

Doubling (or Halving) times

[Kurzweil 2006]

„ Dynamic RAM Memory (bits per dollar) 1.5 years

„ Average Transistor Price 1.6 years

„ Microprocessor Cost per Transistor Cycle 1.1 years

„ Total Bits Shipped 1.1 years

„ Processor Performance in MIPS 1.8 years

„ Transistors in Intel Microprocessors 2.0 years

„ Microprocessor Clock Speed 2.7 years

References

Related documents

Dept of Information Technology| www.it.uu.se 2 © Erik Hagersten| http://user.it.uu.se/~eh..

Dept of Information Technology| www.it.uu.se 2 © Erik Hagersten| http://user.it.uu.se/~eh..

Erik Hagersten Uppsala University.. Dept of Information Technology| www.it.uu.se Intro and Caches 12 © Erik Hagersten| http://user.it.uu.se/~eh..

Fetch rate Cache utilization ≈ Fraction of cache data utilized.. Mem, VM and SW

No more sets than there are cache lines on a page + logic Page coloring can be used to guarantee correspondence between more PA and VA bits (e.g., Sun

Fetch rate Cache utilization ≈ Fraction of cache data

Dept of Information Technology| www.it.uu.se 2 © Erik Hagersten| http://user.it.uu.se/~eh..

Fetch rate Cache utilization ≈ Fraction of cache data utilized. Predicted fetch rate (if utilization