FPGAworld CONFERENCE2009 SEPTEMBER

(1)

6 FPGAworld CONFERENCE

Book

2009 SEPTEMBER

EDITORS

Lennart Lindh, David Källberg, Santiago de Pablo and Vincent J. Mooney III

The FPGAworld Conference addresses aspects of digital and hardware/software system engineering

on FPGA technology. It is a discussion and network forum for students, researchers and engineers

working on industrial and research projects, state-of-the-art investigations, development and

applications. The book contains some presentations; for more information see

(

www.fpgaworld.com/conference

).

ISBN

978-91-977667-2-2

Copyright and Reprint Permission for personal or classroom use are allowed with credit to

FPGAworld.com. For commercial or other for-profit/for-commercial-advantage uses, prior

(3)

2009 PROGRAM COMMITTEE

General Chair

Lennart Lindh, FPGAworld, Sweden

Publicity Chair

David Kallberg, FPGAworld, Sweden

Academic Programme Chair

Vincent J. Mooney III, Georgia Institute of Technology, USA, and Nanyang

Technological University, Singapore

Academic Publicity Chair

Santiago de Pablo, University of Valladolid, Spain

Academic Programme Committee Members

Ketil Roed, Bergen University College, Norway

Lennart Lindh, Jönköping University, Sweden

Adam Postula, University of Queensland, Australia

Pramote Kuacharoen, National Institute of Development Administration, Thailand

Santiago de Pablo, University of Valladolid, Spain

Industrial Programme Chair

Lennart Lindh, Jönköping University, Sweden

Industrial Programme Committee Members

Solfrid Hasund, Bergen University

College

Kim Petersén, HDC, Sweden

Mickael Unnebäck, ORSoC, Sweden

Fredrik Lång, EBV, Sweden

Niclas Jansson, BitSim, Sweden

Göran Bilski, Xilinx, Sweden

Adam Edström, Elektroniktidningen,

Sweden

Espen Tallaksen, Digitas, Norway

Göran Rosén, Actel, Sweden

Tommy Klevin , ÅF, Sweden

Tryggve Mathiesen, BitSim, Sweden

Fredrik Kjellberg, Net Insight,

Sweden

Daniel Stackenäs, Altera, Sweden

Martin Olsson, Synective Labs,

Sweden

Stefan Sjöholm, Prevas, Sweden

Ola Wall, Synplicity, Sweden

Torbjorn Soderlund, Xilinx, Sweden

Anders Enggaard, Axcon, Denmark

Doug Amos, Synplicity, UK

Guido Schreiner, The Mathworks,

Germany

Stig Kalmo, Engineering College of

Aarhus, Denmark

Hichem Belhadj, Actel, USA

Rolf Sylvester-Hvid, Aktuell

Elektronik

(4)

This year’s conference is held in Stockholm (Sweden) and Copenhagen

(Denmark).

We try to balance student, academic and industrial presentations, exhibits

and tutorials to provide a unique chance for our attendants to obtain

knowledge from different views.

Track A - Industrial

Track A features presentations with focus on industrial applications. The

presenters were selected by the Industrial Programme Committee. Total

11 papers was presented (30 minutes slots).

Track B - Academic

Track B features presentations with focus on academic papers and

industrial applications. The presenters were selected by the Academic

Programme Committee. 7 out of the 12 submitted papers were

presented(30 minutes slots).

Track C - Product presentations

Track C features product presentations from our exhibitors and sponsors

(30 minutes slots).

Exhibitors:

Total 27 unique exhibitors in Stockholm and Copenhagen.

Sponsors:

6 Sponsors of lunch, coffee and snacks.

Students projects:

4-5 Master theses project was presented.

Please check out the website (http://fpgaworld.com) for more information

about FPGAworld. In addition, you may contact David Källberg

(

david@fpgaworld.com

) for more information.

We would like to thank all of the authors for submitting their papers and

hope that the attendees enjoyed the FPGAworld conference and you are

coming to next year’s conference.

(5)

Andrew Dauman, Synopsys

10:00 - 10:30

Coffee Break, Sponsored by Synopsys

10:30 - 11:30

Exhibitors Presentation

11:30 - 12:30

Lunch Break, Sponsored by Abound Logic

12:30 - 14:30

Session Chair

Anders Enggaard

Session Chair

Session A1

Making a simple VHDL testbench -

step-by-step

Session A2

Prototyping and Verifying HDL Code

with Graphical Development Tool

Session A3

A shortcut to hardware using C - a case

study from the real world...

Session A4

FPGA Development with Altium

Designer

Session C1

The MAGIC of acquisition and generators

Bitsim

Session C2

Implementing PCI Express® In High Performance Or

Low Cost FPGAs

Silica

Session C3

A general testbench infrastructure for simple

verification

Digitas

Session C4

Bugs & Problems; - Worst Disasters through many

interesting years

Digitas

14:30 - 15:00

Coffee Break

15:00 - 17:00

Session Chair

Tryggve Mathiesen

Session Chair

Session A5

Breaking through FPGA performance

barriers

Session A6

FPGA at 40nm: A great leap

forwards...or a leap in the dark?

Session A7

Ultra-low Power FPGAs for 'Cool'

Portable Applications

Session A8

Large scale real-time data acquisition

and signal processing in SARUS

Session C5

Designing a simple OVM Testbench

Dyrberg Trading

Session C6

Save time and money by reducing FPGA-PCB revisions

- and ensure correct FPGA IO pin assignment

Nordcad

Session C7

FPGA Raptor

Abound Logic

Session C8

Products and Roadmap

(6)

09:15 - 10:00

The Impact of Reconfigurable Computing on Manycore Programming Trends

Dr. Reiner Hartenstein, professor of Computer Science at the University of Kaiserslautern

10:00 - 10:30

_{Sponsored by Synopsys}

Coffee Break

10:30 - 11:30

Session Chair

Tryggve Mathiesen

Vincent J. Mooney III

Session Chair

Kristina Kristoffersson

Session Chair

Session A1

Breaking through FPGA Performance

Barriers

Session A2

Making a simple VHDL testbench -

step-by-step

Session B1

Design of BBN-based Framework for

Adaptive IP-reuse

Session B2

Camera and LCM IP-Cores for NIOS

SOPC System

Session C1

The MAGIC of acquisition and

generators

Bitsim

Session C2

Products and Roadmap

DINI Group

11:30 - 12:30

Lunch Break, Sponsored by Mentor Graphics

12:30 - 14:30

Session Chair

Tryggve Mathiesen

Session Chair

Johnny Öberg

Session Chair

Doug Amos

Session A3

Milkymist™

Session A4

Prototyping and Verifying HDL Code with

Graphical Development Tool

Session A5

FPGA: the Verification Platform of the

future?

Session B3

Implementing True Random Number

Generators by Overfilling the FPGA

Chip

Session B4

Combined simulation and emulation

setup for complex image processing

algorithms in VHDL

Session B5

On-Chip Transactional Memory

System for FPGAs using TCC model

Session C3

A shortcut to hardware using C - a case

study from the real world...

Bitsim

Session C4

A general testbench infrastructure for

simple verification

Digitas

Session C5

Bugs & Problems; - Worst Disasters

through many interesting years

Digitas

14:30 - 15:00

Coffee Break

15:00 - 16:30

Session Chair

Fredrik Lang

Santiago de Pablo

Session Chair

David Kallberg

Session Chair

Session A6

CASE STUDY: FPGA technology in

robotics

Session A7

FPGA at 40nm: A great leap forwards...or

a leap in the dark?

Session A8

Ultra-low Power FPGAs for 'Cool' Portable

Applications

Session B6

Power and Energy Efficiency

Evaluation for HW and SW

Implementation of nxn Matrix

Multiplication on Altera FPGAs

Session B7

Design and Implementation of a

Plesiochronous Multi-Core 4x4

Network-on-Chip FPGA Platform with

MPI HAL Support

Session C6

Live demo of an OpenRISC

processor SoC, running Linux

and showing the great

possibilities of an Open-source

system

Session C7

Designing a simple OVM Testbench

Mentor Graphics

Session C8

FPGA Raptor

(7)

Key Note Session

The Impact of Reconfigurable Computing on Manycore Programming

Trends

(8)

The Impact of

Reconfigurable Computing on

Manycore Programming Trends

Reiner

Hartenstein

1 10 Sep 2009, Stockholm, Sweden

ke

yno

reiner@hartenstein.de

9:15 – 10:00

Teaching for Change: an early martyr

„Turing is irrelevant“

The von Neumann model

is the emulation of a tape machine

http://www.sigsoft.org/SEN/parnas.html

D. L. Parnas (keynote):

"

Teaching for Change“;

10

th

_{Conf. Softw. Engineering Education}

and Training (CSEET '97)

„The von Neumann syndrome“:

coined ~ a decade later

Prof. C.V.

Ramamoorthy,

(UC Berkeley),

SDPS 2006,

San Diego, CA

Critique of von Neumann is not new:

punished for blasphemy?

(mimicking tape

on RAM)

Peter G.

Neumann

(9)

http://hartenstein.de

3 Outline

(1)

• The Power Consumption of Computing

• The Single-Core Approach

• The Multicore Scenario

• The Silver Bullet?

• A CPU-centric Flat World

• The Generalisation of Software Engineering

• Conclusions

Impact of the

von Neumann

Syndrome

http://www.forbes.com/forbes/1999/0531/6311070a.html

Dig more coal

--the PCs are coming

Peter W. Huber,

Mark P. Mills,

05.31.99

(10)

5 never run out of energy?

typical oil field operation

coal

hydro

nuclear

gas

oil

[Fatih Birol, Chief Economist IEA]. https://www.theoildrum.com/

2007:

80% crude oil coming from decline fields

natural gas: similar situation

> 30 %

~ 55 %

Pr

od

uct

io

n

(%

)

100

0

5 „6 more Saudi Arabias needed

for demand predicted for 2030“

Server

Farms

the electricity bill is a key issue

at banks of the Columbia river:

[Randy Katz: IEEE Spectrum, Febr. 2009]

Am. football fields

at Quincy,

size: 10

Power consumption by internet:

x30 til 2030 if trends continue

G. Fettweis, E. Zimmermann: ICT Energy Consumption - Trends and Challenges; WPMC'08, Lapland, Finland, 8 –11 Sep 2008

Quincy Dalles Boardman

WASHINGTON

OREGON

48 MW

power for

40,000 homes

each 6500 m

2

each 6500 m

2 at Dallas

(11)

7 Power

Consumption

of Computers

Energy cost may overtake

IT equipment cost

in the near future

but

„we may ultimately need

revolutionary new solutions“

[Horst Simon, LBNL, Berkeley]

... has become an industry-wide issue:

incremental improvements are on track,

[Albert

Zomaya]

Current trends will lead to

unaffordable future operation

cost of our cyber infrastructure

(subject

of my talk)

Outline

(2)

• The Power Consumption of Computing

• The Single-Core Approach

• The Multicore Scenario

• The Silver Bullet?

• A CPU-centric Flat World

• The Generalisation of Software Engineering

• Conclusions

(12)

9 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 00 02 04 06 08

10

9

10

8

10

7

10

6

10

5

10

4

10

3

free ride on

Moore„s Law

the burden of

software performance is

the task of chip designers*

year

*)

M-&-C-created

population

Single-core

approach:

Software Performance

Rapid VLSI Design Education Revolution

1980 - 1983

E.I.S.

project

The incubator

of the free ride

on Moore‘s law

DARPA;

NSF; many

national governments;

European Union …

massive

funding:

Created the missing

designer population

(Heinz

Riesenhuber)

(13)

11 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 00 02 04 06 08

10

9

10

8

10

7

10

6

10

5

10

4

10

3

The End of Moore„s Law

the end of the

single-core era

year

The end of

Moore„s Law

soon:

the 20

nm wall

2005

traditional instruction-based computing

is running out of steam

[DAC’09 special session:

Computation in the Post-Turing Era

]

year

70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 00 02 04 06 08

10

13

10

12

10

11

relative performance

10

9

10

8

10

7

10

6

10

5

10

4

10

3

10 12 14 16 18 20 22 24 26 28 30

the end of the

single-core era

number of transistors

doubles every 2 years

Growth beyond Moore„s Law?

(14)

13

13 year

relative performance

94

96

98

00

02

04

06 08 10

12

14

16

18

20

22

24

26

28

30 be

gin

of

t

he

multi

cor

e

er

a

Multimedia

in the Multicore Era

Multimedia

Performance Needs

application

performance

_{needs up to:}

Audio

800 MIPS

Graphics

11 GOPS

Video

160 GOPS

Digital TV

900 GOPS

[Pierre Paulin, MPSoC’09]

year

relative performance

94

96

98

00

02

04

06 08 10

12

14

16

18

20

22

24

26

28

30 be

gin

of

t

he

multi

cor

e

er

a

standard

Broadband

in the Multicore Era

needed

performance

growing

faster than

Moore‘s law

[courtesy E. Sanchez]

MIPS

GSM GPRS EDGE UMTS

(15)

15 ICT is at an inflection point

Senior Counselor to the U.S. Trade Representative (USTR)on strategy and negotiations.

Cheap Revolution:

„Broadband is significant at the inflection point,

prompting major market governance changes“

massive funding needed

Cowhey„s & Aronson„s Law:

affordable broadband

& software performance

„Future prosperity depends on network capacity,

..., efficient pricing, and flexible platforms“

handheld & living room commercially more important

than the comparatively small PC market.

requirement

growing

faster than

Moore‘s law

[courtesy E. Sanchez] MIPS

Funding market governance changes

RUS Broadband Initiatives Program (BIP)

http://www.broadbandusa.gov/

NTIA Broadband Technology Opportunities Program (BTOP).

ARPA-E ?

EU-FP7 ?

DARPA ?

other

sources

EFRCEs ?

Energy Frontier Research Centers 777 bio $

The Recovery Act:

$7.2 billion

(16)

17 Outline

(3)

• The Power Consumption of Computing

• The Single-Core Approach

• The Multicore Scenario

• The Silver Bullet?

• A CPU-centric Flat World

• The Generalisation of Software Engineering

• Conclusions

Multicore has been around for decades

•ACRI

•Alliant

•American Supercomputer

•Ametek

•Applied Dynamics

•Astronautics

•BBN

•CDC

•Convex

•Cray Computer

•Cray Research

•Culler-Harris

•Culler Scientific

•DAPP

•Denelcor

•Elexsi

•ETA Systems

•Evans and Sutherland

Computer

•Floating Point Systems

•Galaxy YH-1

•Goodyear Aerospace MPP

•Gould NPL

•Guiltech

•ICL

•Intel Scientific Computers

•International Parallel

Machines

Dead (Super)Computer Society

[Gordon Bell, keynote, ISCA 2000]

•MasPar

•Meiko

•Multiflow

•Myrias

•Numerix

•Prisma

•Tera

•Thinking Machines

•Saxpy

•Scientific Computer

•Systems (SCS)

•Soviet Supercomputers

•Supertek

•Supercomputer Systems

only 2 or 3 successes most in 1985-1995 - mainly research

18

(17)

19

19 Speed-up factors by GPGPUs (1)

The power efficiency is disputable

(up to ~150 x)

[Michael Garland, NVIDIA Research: Parallel Computing

on Manycore GPUs; IPDPS, Rome, Italy, June 25-29, 2009]

this hardware can only be

used only in certain ways.

Jan

2007 2007July 2008Jan 2008July 2009Jan 2009July 2010Jan

10

0

10

3

10

2

10

1 S

pe

ed

up

-F

ac

to

r

Imaging

Video

146 20 130 30 100 47 50 149 18 36

Bioinformatics

Numerics

effective only at problems

that can be solved using

stream processing.

streams provide data parallelism

_{*) migration from x86 singlecore}

*

?

such speed-ups by GPGPUs

only for embarrassingly

parallel applications

Speed-up factors by GPGPUs

(2)

CUDA ZONE pages [NVIDIA Corp.]:

non-reviewed

CUDA user submissions

http://www.nvidia.co.uk/object/cuda_home_uk.html#state=home

S

pe

ed

up

-Fa

cto

r

Cryptography

12 50 55 2

Imaging

5 169 20 100 50 30 327 100 5 20 90 109 13 40 10 15 10 36 100 50 35 50

Bioinformatics

3 . 5 30 270 20 4 15 16 26 10 4 150 100 35 29 13 4 . 3 15 35 60 40 4 170 12 90 10 15 500 420 75 675 340 50 10 172 50 60 100 169 2 100 50 3 2 5 10 270 27 8 7 32 470 150 9 10 100 30 138 55 7 20 9 10 9 60

CFD

Computational Fluid Dyamics Computational Fluid Dyamics 23 120 39 17 55 100 77 10 29 1 . 3 4 10

DCC

Digital Content Creation 5 3 5

Graphics

50 2 100 16 25 26 3

Astrophysics

250 250

0

10

3

10

2

10

1 DSP

Digital Signal Processing 5 35 50 31 35 8 260

EDA

34

_{oil &}

gas

Compute Unified Device

Architecture (CUDA),

accelerates BLAS

libraries (Basic Linear

Algebra Subroutines)

Less flexible

(GPGPU tool development

years earlier than f

.

x86)

NVIDIA

GeForce

GTX

stream

processor

cores

minium

power supply

275

240 650–680 watt

295

480 650–680 watt

Intel Xeon "Nehalem-EX" for servers: 8 cores

Intel Core™2 Quad (desktop PCs): 4 cores

(up to ~600 x)

(18)

21

21 year

relative

performance

94

96

98

00

02

04

06 08 10

12

14

16

18

20

22

24

26

28

30 Growth by Multicore

be

gin

of

t

he

multi

cor

e

er

a

John

Hennessy:

Hastily knitted

compilers for the

heavy lifting?

e. g. automatically

parallelizing

compilation via

multi-threading,

and many other

ad-hoc solutions

“wait for current

generation of

programmers to die

off and be replaced

new types

of bugs

introduced

(19)

23 Outline

(4)

• The Power Consumption of Computing

• The Single-Core Approach

• The Multicore Scenario

• The Silver Bullet?

• A CPU-centric Flat World

• The Generalisation of Software Engineering

• Conclusions

year

relative performance

94

96

98

00

02

04

06 08 10

12

14

16

18

20

22

24

26

28

30 be

gin

of

t

he

multi

cor

e

er

a

(20)

25

25 FFT

FFT

100 Reed-Solomon Decoding Reed-Solomon Decoding 2400 Viterbi Decoding Viterbi Decoding 400 1000 MAC MAC

DSP and

wireless

molecular dynamics simulation molecular dynamics simulation 88 BLAST BLAST 52 protein identification protein identification 40 Smith-Waterman pattern matching Smith-Waterman pattern matching 288

Bioinformatics

GRAPE GRAPE 20 20

Astrophysics

SPIHT wavelet-based image compression SPIHT wavelet-based image compression 457 real-time face detectionreal-time face detection

6000

video-rate stereo visionvideo-rate stereo vision 900 pattern recognitionpattern recognition 730

Image processing,

Pattern matching,

Multimedia

3000

CT imaging

10

0

10

3 S

pee

dup

-F

ac

to

r

Speed-up

factors

obtained

by Software

to Configware

migration

vs. GPU: almost 50x

(up to ~30,000x)

(200x)

~50x

(200x)

CUDA ZONE Garland IPDPS‘09

8723

DNA & protein sequencing

crypto

1000

28514

DES breaking

by FPGA:

intel supports direct front

side bus access by FPGAs

“... design techniques will evolve, by

necessity,

to satisfy the demands of

reconfigurable hardware

and software

programmability”. J. R. Rattner, DAC 2008

2 orders of magnitude

FFT

DSP and

wireless

Bioinformatics

Astrophysics

6000

Image processing,

Pattern matching,

Multimedia

3000

CT imaging

crypto

1000

28514

DES breaking

8723

10

3

10

6 S

pee

dup

-F

ac

to

r

10

3

10

6 Speedup-

_Factor

+ Pre-FPGA solutions

2000

39.4 160

15000

2-D FIR filter (no FPGA: DPLA by TU-KL*) 2-D FIR filter (no FPGA: DPLA by TU-KL*)

Lee Routing (DPLA by TU-KL*) Lee Routing (DPLA by TU-KL*) Grid-based DRC: no FPGA: DPLA on MoM by TU-KL* Grid-based DRC: no FPGA: DPLA on MoM by TU-KL* Grid-based DRC* („fair comparizon“) Grid-based DRC* („fair comparizon“)

fabricated by E.I.S.

(21)

27 Software vs. FPGA

96 98 00 02 04 06 08

10

5

10

4

10

3

10

2

10

1

10

0

year

10 relative performance

1990

1995

2000

200

100

0

50

150

75

25

125

175 SP

EC

fp

20

00 /M

H

z/Bi

lli

on

T

ran

si

sto

rs

HP

[BWRC, UC Berkeley, 2004]

0.

5 x

M

O

PS

/M

H

z/Bi

lli

on

T

ran

si

sto

rs

420 1996 46

Benchmarks:

Moore‘s

law

does not indicate

microprocessor MIPS

_?

!

Moore‟s law not applicable to all aspects

For multicore*:

the Law of More …

…with drastically declining

programmer productivity

*) number of cores doubles every 2 years

(22)

29 Massive

Energy Saving

factors: ~10%

of speedup factor

29 FFT

FFT

DSP and

wireless

Bioinformatics

GRAPE GRAPE 20 20

Astrophysics

6000

Image processing,

Pattern matching,

Multimedia

3000

CT imaging

crypto

1000

28514

DES breaking

10

0

10

3 S

pee

dup

-F

ac

to

r

8723

Software

vs. FPGA

(2)

[Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008]

Application

Speed-up

_factor

Savings

Power

Cost

Size

DNA and Protein

sequencing

8723

779

22

253 DES breaking

28514

3439

96 1116

much less

equipment

needed

much less memory and bandwidth needed massively

saving energy

**RC*: Demonstrating the intensive Impact**

SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster

Tarek

El-Ghazawi

(23)

31 Why such Speed-up Factors ...

...

with FPGAs

:

a much worse technology !

massive wiring overhead

+ routing congestion growing with FPGA size

+ massive reconfigurability overhead

main reason:

no

von Neumann Syndrome!

more recently also:

more „platform FPGAs“

The „Reconfigurable Computing Paradox“

RC versus Multicore

RC:

speed-up often higher

by orders of magnitude

RC:

energy-efficiency often higher:

very much, or, by orders of magnitude ?

Sure !

We need

both

: Multicore

and RC

this is the

silver bullet

_Multicore:

legacy software,

control-intensive

applications, etc.

„

RC

“ =

R

econfigurable

C

omputing

(24)

33

33 year

relative performance

94

96

98

00

02

04

06 08 10

12

14

16

18

20

22

24

26

28

30 end of

the

sing

lecor

e

er

a

33 Reconfigurable Computing is indispensable!

For a Booming Multicore Era

von-Neumann-only is not the silver bullet

Outline

(5)

• The Power Consumption of Computing

• The Single-Core Approach

• The Multicore Scenario

• The Silver Bullet?

• A CPU-centric Flat World

• The Generalisation of Software Engineering

• Conclusions

(25)

35 CPU-centric

flat world

sequential-only

mind set –

(Aristotelian model)

typical programmer

qualification:

This

Software-centric

world model

is obsolete

CPU-“centric“ but no

hardware know-how

CPU-“centric“ but no

hardware know-how

(kind of tunnel view)

Machine Model of the Mainframe Era

Machine

model

resources

sequencer

property

programming

_source

property

programming source

_register

state

(26)

37 40 years Software Crisis

Nathan‟s Law: Software is a gas.

It expands to fill its container ...

Nathan Myhrvold

… until being limited by Moore’s Law

[& Kryder’s Law]

Wirth„s Law

“software is slowing faster

than hardware is accelerating“

Oct 1957

The Economist: Nov 19th 1955

In 1955, Parkinson could not have

foreseen the impact of software.

formula: bureaucracy growth independent of actual work to be done

[Niklaus Wirth]

[Cyril Northcote Parkinson]

Software critics is not new:

F. L. Bauer 1968,

coined the term „Software Crisis“

N. N. 1995: THE STANDISH GROUP REPORT

Robert N. Charette 2005:

Why Software Fails; IEEE Spectrum, Sep 2005

Anthony Berglas 2008:

Why it is Important that Software Projects Fail

L. Savain 2006:

Why Software is bad

Peter G. Neumann 1985-2003:

216x “Inside Risks“(18 years inside back

cover of Comm_ACM)

“Software”

overhead piles up to code sizes

of astronomic dimensions

The von Neumann

Syndrome:

stands for extremely

memory-cycle-hungry instruction streams

from earlier talks:

_{from earlier talks:}

datastream

parallelism

instruction stream

parallelism

C.V.

Ramamoorthy

“The Memory Wall”

coined by Sally McKee

(& co-author)

Patterson‟s Law:

Dave

Patterson

bandwidth gap grows 50% / year

has reached >1000x

the uglyness

of this term

(27)

39 Machine Model of the PC Era

Machine

model

resources

sequencer

property

programming

_source

property

programming source

_register

state

ASIC

accelerator

hardwired

-

hardwired

-CPU

hardwired

-

programmable

Software

_{(instruction streams)}

program

_counter

Application-Specific Integrated Circuit &

other accelerators: e.g. graphics processor

wagging

the dog“

“the tail is

ISIS 1997

Austin, TX

[

]

Science does not

progress continuously,

Thomas S. Kuhn 1969:

The Structure of

Scientific Revolutions

…in which the established paradigm

is overthrown and replaced.

_?

.

?

Thomas S. Kuhn

The von Neumann paradigm?

… shortcomings in an established

paradigm produces

a crisis

(28)

41 Outline

(6)

• The Power Consumption of Computing

• The Single-Core Approach

• The Multicore Scenario

• The Silver Bullet?

• A CPU-centric Flat World

• The Generalisation of Software Engineering

• Conclusions

From CPU to

RPU

machine

model

right now

resources

sequencer

property

programming

_source

property

programming

_source

_register

state

ASIC

accelerator

hardwired

-

hardwired

-CPU

hardwired

-

programmable

Software

(instruction

streams)

program

counter

RPU

accelerator

programmable

Configware

(configuration

code)

programmable

Flowware

(data streams)

counters

data

we need 2 more program sources

R

econfigurable

P

rocessing

U

nit

non-von-Neumann-now accelerators

are programmable!

(29)

43

[Thomas S. Kuhn 1969: The Structure of Scientific Revolutions]

“… in which the established paradigm is

overthrown and replaced.”

However,

not

the von Neumann paradigm

will be overthrown and replaced.

The CPU-centric world model

of Software Engineering

will be replaced

by

removing the tunnel view perspective

Thomas

Kuhn

is right !

What Revolution?

**RC* outside a**

CPU-centric

flat world?

For the

Multicore era

we need

a new model

(Copernican)

For the

Multicore era

we need

a new model

(Copernican)

*) RC = Reconfigurable Computing

(30)

45 Program

Program Performance

„Multicore computers shift the burden of software

performance from chip designers to programmers.“

we anyway need a Software Education Revolution ...

Since People have to write code differently,

[J. Larus: Spending Moore's

Dividend; C_ACM, May 2009]

... performance drops & other problems

in moving single-core to multicore ...

**... the chance to move RC* from niche to mainstream**

a scenario like before the Mead-&-Conway revolution

Missing programmer population and methodology:

*) RC = Reconfigurable Computing

Embedded syst. & hdw scene have the right background

to reform the parallelism education of SW programmers

A Heliocentric CS Model

FE

F

lowware

E

ngineering

PE

P

rogram

E

ngineering

The Generalization of

Software Engineering —

A Twin Paradigm Dual

_{Dichotomy Approach.}

time to space

mapping

issue

SE

S

oftware

E

ngineering

RPU

special

*) do not confuse

with „dataflow“!

(31)

47 A Multicore Submarine Model?

C is not the silver bullet: it’s inherently serial

mapping parallelism just into the time domain:

“abstracting” away the space domain is fatal

But nobody wants to

learn a new language.

There is no easy way to program in parallel

The programmer needs to understand how data flows

through cores, accelerators, interconnect and peripherals

**The programmer* needs system visualization**

in the space

domain, to understand performance under parallelism

The datastream model of the twin-paradigm approach

helps to understand the space domain and parallelism

*) and, especially the student

Our Contemporary Computer Machine Model

Machine

model

resources

sequencer

property

programming

_source

property

programming

_source

state register

ASIC

accelerator

hardwired

-

hardwired

-CPU

hardwired

-

programmable

Software

(instruction

streams)

program

counter

RPU

accelerator programmable

Configware

(configuration

code)

programmable

Flowware

(data

streams)

data

counters

twin Paradigm Dichotomy

in CPU

in RAM

data counters of reconfigurable

address generators in

asM

(auto-sequencing) data memory blocks

(32)

49 Time to Space Mapping

Machine

model

resources

sequencer

property

programming

_source

property

programming

_source

state register

ASIC

accelerator

hardwired

-

hardwired

-CPU

hardwired

-

programmable

Software

(instruction

streams)

program

counter

RPU

accelerator programmable

Configware

(configuration

code)

programmable

Flowware

(data

streams)

data

counters

Relativity Dichotomy

„The biggest payoff will come from

P

utting

O

ld

i

deas

i

nto

P

ractice and teaching people how to apply them properly.“

David P

ar

na

s

_{loop turns}

2 pipeline

C

1967

How to achieve acceptance

Hardware description

languages hidden

Courses tailored for

students not being

hardware-savvy

Tools usable by users

not being hardware

designers

[Courtesy Richard Newton]

„How to hide the ugliness

from the user“

[Herman Schmit]

(33)

51 traditional qualification in the time domain

51 Software Education (R)evolution:

+ lean qualification in the space domain

= lean hardware modeling qualification

at a higher level of abstraction

by simultaneous dual domain co-education:

viable methodology for dual rail education

(only a few % curricula need to be changed)

step by step, not overthrowing the SE scene

We need a Software Education Revolution

2010 - ....

The incubator

of the free ride

on Cowhey‘s &

Aronson‘s law

massive

funding

required

partially

re-write

the code

Create the

missing

programmer

population

next most

effective project in

DOS to Windows

took 10 years

(34)

53 Community Building Function

of the DATE Friday Workshop

Friday, March 12, 2010, 08:30 – 17:00

Friday Workshop

reiner@hartenstein.de

Software Education Revolution

**for using Multicore - and RC* (SERUM-RC*)**

http://www.date-conference.com

DATE-Conference, Dresden, DE:

CfP:

http://fpl.org/cfp/

53 *) Reconfigurable Computing

RAW 2010

17th Reconfigurable Architectures Workshop

April 19-20, 2010, Atlanta (Georgia), USA

http://www.ece.lsu.edu/vaidy/raw/

Run-Time Reconfiguration & Adaptive Computing:

Architectures, Algorithms, Technologies

http://www.ipdps.org/

24th IEEE International

Parallel and Distributed

Processing Symposium

April 19-23, 2010,

Atlanta (Georgia) USA

in conjunction with:

Manuscript due:

October 18, 2009

Notification of acceptance: December 14, 2009

Camera-ready Papers Due:

February 1, 2010

(35)

55 Outline

(7)

• The Power Consumption of Computing

• The Single-Core Approach

• The Multicore Scenario

• The Silver Bullet?

• A CPU-centric Flat World

• The Generalisation of Software Engineering

• Conclusions

To maintain a Booming Multicore Era:

Not without Reconfigurable Computing!

Conclusions (1)

relative

performance

possible for 2 or 3 more decades?

th

e

n

d

o

f

th

e

si

n

g

le

c

o

re

e

ra

(36)

57 additional Flowware / Configware skills are

essential qualifications for programmers.

Mead-&-Conway-style

SE Revolution toward

dual-rail education

is urgently needed

key motivation: performance and

energy consumption of programs

we need to master hetero of

all 3: Singlecore, Multicore,

& Reconfigurable Computing

massive long term

R&D funding required

like known from DARPA

A main problem: selecting (or

_creating)

_{tools for lab courses}

SERUM-RC

the key issue:

ease of use!

Conclusions

We need „une' Levée en Masses“

We need „une'

(37)

59 thank you for your

patience

59

(38)

61 Credited to be „The father of Reconfigurable Computing“ (also pre-FPGA era) [1],

EU grant (80ies),

85 mio ECU

(pre-€): complete EDA framework [4,5] around KARL

1981: visiting professor at UC Berkeley (& coop. w. Xerox PARC)

1983: founder of the German contribution to the Mead-&-Conway VLSI design revolution:

the multi university „

E.I.S. project

“ (gov. grant:

38 million Deutschmark

)

IEEE fellow, SDPS fellow, FPL fellow, best paper awards, other awards

Professor (ordinarius emeritus), TU Kaiserslautern

CV of Reiner Hartenstein

All acad. degrees from

KIT

Karlsruhe Institute of Technology (his mentor:

Karl Steinbuch

)

Creator of

KARL

[2], most successful [3] trailblazer HDL before VHDL came up

[1] qu. Viktor Prasanna (with Gerald Estrin as the grandfather of Reconfigurable Computing, who proposed it in 1960 WJCC)

[4] R. Hartenstein: The History of KARL and ABL; in: J. Mermet (editor): Fundamentals and Standards in

Hardware Description Languages; ISBN 0-7923-2513-4, Kluwer (now Springer), September 1993.

also see:

http://xputers.informatik.uni-kl.de/karl/karl_history_fbi.html

[5] format-checking functional floorplan graphic editor, and textual editors, calculus-based term rewriting floorplan generator,

embedded router, automatic test generation, testability analysis, structured logic synthesis, simulator, et al. -- also see [

4

]

[2] R. Hartenstein: Fundamentals of Structured Hardware Design; American Elsevier,

1977 -- Bestseller

Founder / co-founder of several international annual conference series

reiner@hartenstein.de

61

1977 & later used as a textbook at UC Berkeley (not only here)

KARL: a Pascalish hardware language

[3] for users, usage details, quotations,etc.see:

http://www.fpl.uni-kl.de/staff/hartenstein/KARLUsers.html

his hobby: giving keynotes

http://hartenstein.de/keynotes.htm

(39)

63 Double Dichotomy

2) Relativity Dichotomy

-Procedure

time:

(Software-Domain)

-Structure

space:

(Configware-Domain)

1) Paradigm Dichotomy

instruction stream

von Neumann Machine

(Software-Domain)

data stream

Datastream Machine

(Flowware-Domain)

63 Relativity Dichotomy

time domain:

space domain:

procedure domain

structure domain

2 phases:

1) programming

instruction streams

2) run time

3 phases:

1) reconfiguration

of structures

time

space

2) programming

data streams

3) run time

von Neumann Machine

Datastream Machine

(40)

65 time-iterative to space-iterative

65 a time to

space/time

mapping

loop transformation

methodogy: 70ies and later

n*k time steps,

1 CPU

n time steps,

k DPUs

Often the space dimension is limited

n time steps,

1 CPU

1 time step,

n DPUs

a time to

space

mapping

e. g. example: bubble sort migration

Strip

[D. Loveman, J-ACM, 1977]

mining

time to space mapping

time domain:

space domain:

procedure domain

structure domain

program loop

n time steps, 1 CPU

1 time step, n DPUs

pipeline

Bubble Sort

n x k time steps,

1 „conditional

swap“ unit

Shuffle Sort

k time steps,

n „conditional

swap“ units

time algorithm

space algorithm

conditional

x

conditional swap conditional swap conditional swap conditional

(41)

67

1

2

3

4

5

6

7

8 y

x

1

2

3

4

5

6

7

8 JPEG zigzag

scan

pattern

EastScan

is

step by

[1,0]

end

EastScan;

SouthScan

is

step by

[0,1]

endSouthScan;

*> Declarations

NorthEastScan

is

loop

6 times

until [*,1]

step by

[1,-1]

endloop

end

NorthEastScan;

SouthWestScan

is

loop

7 times

until [1,*]

step by

[-1,1]

endloop

end

SouthWestScan;

HalfZigZag

is

East

Scan

loop

3 times

SouthWest

Scan

South

Scan

NorthEast

Scan

East

Scan

endloop

end

HalfZigZag;

goto

PixMap[1,1]

HalfZigZag;

SouthWestScan

uturn

(HalfZigZag)

Hal

fZig

Zag

data counter data counter data counter data counter

2

1

3

4 HalfZigZag

Flowware language example (MoPL):

programming the datastream

x

y

67 (an animation)

Programming model: Flowware

Adder

Speaker

FMDemod

LPF

1

Split

Gather

LPF

2

LPF

3

HPF

₁

HPF

₂

HPF

₃ Source: MIT StreamIT

• Pros for streaming

– Streamlined, low-overhead

communication

– (More) deterministic behaviour

– Good match for many simple media

rich applications

[Pierre Paulin]

We„ve to find out, which applications

types and programming models Students

should exercise for the flowware approach

• Cons

– control-dominated applications

– shunt yard problem

(42)

69 Flowware

from a generalization of the systolic arrays

supports any wild free form of pipe networks:

spiral, zigzag, fork and join, and even more wild,

unidirectional and fully or partially bidirectional,

Flowware: scheduling data streams

-Fifos, stacks, registers, register files, RAM blocks...

Flowware means parallelism

resulting from time to space migration

Ways to implement an Algorithm

• Hardware

• Software

• Configware

• mixed

von

Neumann-machine

datastream

machine

multicore

.

manycore

per se

singlecore

manycore

RAM-based

(43)

71 Acceleration Mechanisms

•parallelism by multi bank memory architecture

•auxiliary hardware for address calculation

•address calculation before run time

•avoiding multiple accesses to the same data.

•avoiding memory cycles for address computation

•optimization by storage scheme transformations

•optimization by

memory architecture transformations

New boundary constraints are the limiting factor

Legacy scientific applications: predominantly sequential

The entire software ecosystem will need to evolve

(including curricula): O/S, libraries, software

development environments, compilers and languages

additional levels of parallelism: chaining, pipelining,

systolic, super-systolic, wavefront arrays

additional data structures and storage organization:

the new distributed memory discipline

(44)

73 old Paradigms and Methodologies

1946: Machine Paradigm (von Neumann)

1980: Datastreams (Kung, Leiserson)

1989: Anti Machine** Paradigm (TU-KL)

1990: first rDPA* (Rabaey)

1994: higher Anti Machine** Programming Language (

Flowware:

TU-KL)

1995: super systolic array: rDPA (Kress)

1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...

1997+: Discipline of Distributed Memory Architectures

(IMEC …)

1997: first automatically partitioning Configware/Software Co-Compiler

(TU-KL)

*) rDPA = reconfigurable

Data Path Array

**) datastream machine

(flowware machine):

(45)

75 reiner@hartenstein.de

Teaching computing fundamentals

Ignoring Reconfigurable Computing

in teaching computing fundamentals

within our CS curricula

causing to waste billions of dollars.

is one of

the biggest mistakes in the history of

information technology application

(46)

Track A - Industrial

Track A features presentations with focus on industrial applications. The

presenters were selected by the Industrial Programme Committee. Total 11

papers was presented (30 minutes slots).

AND

Track C - Product presentations

Track C features product presentations from our exhibitors and sponsors

(30 minutes slots).

(47)

(48)

1. Introduction

2. Case study

(49)

The industrial robot market is much focused

on cutting cost and footprint.

Also the product needs to be sustained in

>15 years

Therefore it is very interesting to integrate

different functions in low cost FPGAs.

The presentation will cover some recent

examples in ABB's robot control system.

1. Fieldbus communication

2. Position measurement (Encoder interface)

3. Force Measurement

(50)

Main computer

PCI bus

DeviceNet M/S

PCI card

I/O

board

I/O

board

CAN bus

Gateway unit

PLC

Other fieldbus, eg

Profibus DP

Interbus

CC Link

Traditional solution

(51)

1. ”DeviceNet Lean”

CAN controller in MC FPGA

CAN interface board

Simple SW driver with pre-defined

messages – supports ABB I/O boards

only

2. Anybus ™ interface

Anybus™ controller in MC FPGA

Anybus™ CompactCOM™ module

with fieldbus slave interface

(52)

Main computer

FPGA

I/O

CAN i/f board

I/O

board

I/O

board

CAN bus

Anybus CC™

Fieldbus module

PLC

Other fieldbus, eg

Profibus DP

Ethernet IP

FPGA based solution vs traditional

Significant cost cut

FPGAworld CONFERENCE2009 SEPTEMBER

6