• No results found

The Future is not what it used to be...

N/A
N/A
Protected

Academic year: 2022

Share "The Future is not what it used to be..."

Copied!
37
0
0

Loading.... (view fulltext now)

Full text

(1)

The Future is not what it used to be...

Erik Hagersten

(2)

AVDARK

Then...

ENIAC 1946 (”5kHz”)

18 000 radiorör

sladdprogrammerad

”5 KHz”

(3)

AVDARK

Then (in Sweden)

 BARK (~1950)

 8 000 relays,

 80 km cables

 BESK (~1953)

 2 400 vac. tubes

 ”20 kHz” (world record)

(4)

AVDARK

“Recently” APZ 212, 1983

Ericsson’s Supercomputer (“5 MHz”)

(5)

AVDARK

APZ 212

marketing brochure quotes:

 ”Very compact”

 6 times the performance

 1/6:th the size

 1/5 the power consumption

 ”A breakthrough in computer science”

 ”Why more CPU power?”

 ”All the power needed for future development”

 ”…800,000 BHCA, should that ever be needed”

 ”SPC computer science at its most elegance”

 ”Using 64 kbit memory chips”

 ”1500W power consumption

(6)

AVDARK

65 years of “improvements”

 Speed

 Size

 Price

 Price/performance

 Reliability

 Predictability

 Energy

 Safety

 Usability….

(7)

AVDARK

”Moore’s Law”

Pop: Double performance every 18-24th month

1 10 100 1000

2006

Performance [log]

Year

Single-core

Multicore

(8)

AVDARK

Ray Kurzweil pictures

www.KurzweilAI.net/pps/WorldHealthCongress/

(9)

AVDARK

Ray Kurzweil pictures

www.KurzweilAI.net/pps/WorldHealthCongress/

(10)

AVDARK

Ray Kurzweil pictures

www.KurzweilAI.net/pps/WorldHealthCongress/

(11)

AVDARK

Exponentiell utveckling:

Doublerings/halverings-tider

(according to Kurzweil)

Dynamic RAM Memory (bits per dollar) 1.5 years

Average Transistor Price 1.6 years

Microprocessor Cost per Transistor Cycle 1.1 years

Total Bits Shipped 1.1 years

Processor Performance in MIPS 1.8 years

Transistors in Intel Microprocessors 2.0 years

Log scale

1 10 100 1000

time

(12)

AVDARK

Ray Kurzweil pictures

www.KurzweilAI.net/pps/WorldHealthCongress/

(13)

AVDARK

Linear scale 1940  2017

(2x performance every 18th month)

0,E+00 5,E+14 1,E+15 2,E+15 2,E+15 3,E+15 3,E+15 4,E+15

40 50 60 70 80 90 0 10

Performance

Year

Doubling every 18th month since 1940

(14)

AVDARK

Exponentiell utveckling

Example: Doubling every 2nd year

How long does it it take for 1000x improvement?

Example: Doubling every 18th month

How long does it it take for 1000x improvement?

Log scale

1 10 100 1000

time

Linear scale

?

(15)

AVDARK

Looking Forward

Three rules of common wisdom:

 Do not bet against exponential trends

 Do not bet against exponential trends

 Do not bet against exponential trends

But, is it possible to continue ”Moore’s Law”?

- Are there show-stoppers?

- Can we utilize an exponential growth of

#cores?

(16)

AVDARK

0 0,5 1 1,5 2 2,5 3 3,5

1 2 3 4

Number of Cores Used

Thr oughput

Not everything scales as fast!

Example: 470.LBM

"Lattice Boltzmann Method" to simulate incompressible fluids in 3D

Throughput (as defined by SPEC):

Amount of work performed per time unit when several instances of the application is executed simultaneously.

Our TP study: compare TP improvement when you go from 1 core to 4 cores

1.0

(17)

AVDARK

Nerd Curve: 470.LBM

Miss rate (excluding HW prefetch effects)

Utilization, i.e., fraction cache data used (scale to the right) Possible miss rate if utilization problem was fixed

Running one thread Running

four threads

3,5%

5,0%

cache size cache

miss rate

 Less amount of work per memory byte moved

@ four threads

(18)

AVDARK

CPU CPU

CPU CPU

DRAM

Remember: It is getting worse!

From Karlsson and Hagersten. Conserving Memory Bandwidth in Chip Multiprocessors with Runahead Execution. IPDPS March 2007. [graph updated with more recent data]

Computation vs Bandwidth

0 1 2 3 4 5 6

2007 2008 2009 2010 2011 2012 2013 2014 2015

Y e a r

# T * T _ f r e q / # P * P _ f r e q

Source: Internatronal Technology Roadmap for Semiconductors (ITRS)

#Cores ~ #Transistors

HPCwire Feb 2011 [cites Linley Gwennap and Justin Rattner]

Without Silicon Photonics, Moore's Law Won't Matter HPCwire Feb 2011

Growing Data Deluge Prompts Processor Redesign

#Pins

(19)

AVDARK

Case study: Limited by bandwidth

(20)

AVDARK

Nerd Curve (again)

Miss rate (excluding HW prefetch effects)

Utilization, i.e., fraction cache data used (scale to the right) Possible miss rate if utilization problem was fixed

Running four threads

2,5%

5,0%

cache size cache

miss rate

 Twice the amount of work per memory byte moved

orig application

optimized application

(21)

AVDARK

0 0,5 1 1,5 2 2,5 3 3,5

1 2 3 4

# Cores Used

Thr ougput

 Better Memory Usage!

Example: 470.LBM

Modified to promote better cache utilization

Original code

(22)

AVDARK

0 1 2 3 4

1 2 3 4

# Cores

App: Cigar

Example 2: A Scalable Parallel Application

Looks like a perfect scalable application!

Are we done?

Performance

(23)

AVDARK

0 5 10 15 20 25 30

1 2 3 4

Original Optimized

#Cores

7.3x

App: Cigar

Performance

Example 2: The Same Application Optimized

Looks like a perfect scalable application!

Are we done?

 Duplicate one data structure

(24)

Implementation Trends

(25)

AVDARK

Predicting the future is hard

Predicting: “Chip Multiprocessor” aka Multicores

[from PARA Bergen 2000]

Chip Multiprocessor (CMP): Mem

CPU

$1

CPU

$1

CPU

$1

CPU

$1 L2$

Mem I/F External

I/F

t treads

Simple fast CPU

-- many open

questions

(26)

AVDARK

Multi-CMPs

[from PARA Bergen 2000]

Mem

c chips

Mem

Mem

Mem

Mem

Mem

Mem

Mem

Interconnect

Explicit parallelism:

#chips x #threads/chip

• Global shared memory

• Global/local comm cost >10

• Gotta’ explore small caches

• Gotta’ explore locality!

• OS scalability ?

• Application scalability ?

(27)

AVDARK

Why Multicores Now?

-- Hur Mår ”Moore’s Lag”? --

1. Not enough ILP/MLP to get payoff from using more transistors

2. Signal propagation delay » transistor delay

3. Power consumption P dyn ~ C • f • V 2

[log] Perf

~2007

Single core Multi core

time

(28)

AVDARK

Darling, I shrunk the computer

Mainframes

Super Minis:

Microprocessor: Mem Chip Multiprocessor (CMP):

A multiprocessor on a chip!

Mem

Sequential ex ecution ( ≈ one progr am)

Need TLP to make one chip run fast

Paradigm Shift

(29)

AVDARK

HPC in the Rear Mirror...

1980 1990 2000 2010 ????

Nifty Parallel Vector

†Not general Expensive

†Hard to use No standards

Killer Micro SMPs

Beowulf x86 Linux Clusters MC Clusters

MC + Accelerators

* Forced by technology

†High cost, Bad scaling

* Promise

of performance

†COTS perf management

* Scalability Naive view

* UNIX Commercial computing

* COTS cost

convergence † ????

† ????

(30)

AVDARK

Parallelism can be used to hide memory latency

 Intel ”Hyper Threading”

 T1 Niagara, MIC, … (4 threads per core)

 GPUs

 ...

 Is this a good idea?

 It cannot hide the need for bandwidth!

(31)

AVDARK

Parallelism is a Hard Currency

Speedup

Parallelism

Remember Amdahl’s Law?

(32)

AVDARK

Do you have 1000 threads to spare?

SIMD rears its ugly head again

512 ”cores” (C)

16 C/StreamProcessor (SP)

SP is SIMD-ish (sort off)

 Full DP-FP IEEE support

64kB L1 cache /SP

 768kB global shared cache

(less than the sum of L1:s)

 Atomic instructions

 ECC correction for DRAM

 Debugging support

 Synch within SP efficient

 Giant chip/high power

 ...

L1

L2

This is SIMD-ish (aka Vector)

(33)

AVDARK

Coh.

Mem.

NVIDIA Fermi:

•Special language...

•Topology matter...

•User-managed memory

I/O bus

Common research papers: ”How to get 100X speedup”.

Starting to get debunking of those results [ISCA 2010, IBM Journal 201

SIMD and CPUs?

Reminds me of:

†Hard to use No standards

* Scalability

(34)

AVDARK

[Pic from Michael Wulf, PGI]

Intel’s Knights Ferry [MIC]

(topology like Sandy Bridge) Vector instruction

Other efforts:

• AMD Fusion (x86 + GPU)

• ARM + NVIDIA collaboration (project Denver)

SIMD and CPUs?

Coh.

Mem.

(35)

AVDARK

Trends for 2016

 No major revolution of the Multicore magnitude

 Challenge: Will the number of cores double every 2 years?

 Moving towards MIMD+SIMD “fusion”

 Architecture complexity grows

 Bumpy memory/communication costs

 Heterogenious architectures (e.g., ARM: big.LITTLE)

 Memory bandwidth the bottleneck

 Energy is a first-class citizen

 Users are getting less computer-savvy (and ideally

should not have to be)

(36)

AVDARK

Implications

 One size will not “fit all”

 SIMD parallelism will be more prominent, but the jury is still out about how this will be done

 More heterogeneous arch (size, mem, isa)

 More parallelism needed, but ... memory/power to become the bottleneck anyhow

 Diversity: Different applications will need different

“heterogeneous configurations”

Even harder to use resources efficiently

(37)

AVDARK

HiPEAC Roadmap -- High Performance

Embedded Architecture and Compilers

References

Related documents

In this thesis we investigated the Internet and social media usage for the truck drivers and owners in Bulgaria, Romania, Turkey and Ukraine, with a special focus on

The bacterial system was described using the growth rate (k G ) of the fast-multiplying bacteria, a time-dependent linear rate parameter k FS lin , the transfer rate from fast- to

Part of R&D project “Infrastructure in 3D” in cooperation between Innovation Norway, Trafikverket and

This article hypothesizes that such schemes’ suppress- ing effect on corruption incentives is questionable in highly corrupt settings because the absence of noncorrupt

Studying the green bond premium and the effects of liquidity of a global sample in the secondary market, Zerbib (2019) evaluates the yield spread between 110 green

Enligt vad Backhaus och Tikoo (2004) förklarar i arbetet med arbetsgivarvarumärket behöver företag arbeta både med den interna och externa marknadskommunikationen för att

Illustrations from the left: Linnaeus’s birthplace, Råshult Farm; portrait of Carl Linnaeus and his wife Sara Elisabeth (Lisa) painted in 1739 by J.H.Scheffel; the wedding

No one may be evicted without the public authority having obtained a court order in advance and, as has been shown in case law, the constitutional right to housing obliges