A performance-driven SoC architecture for video synthesis

(1)

Master of Science Thesis Stockholm, Sweden 2010 TRITA-ICT-EX-2010:90

S É B A S T I E N B O U R D E A U D U C Q

A performance-driven SoC architecture for video synthesis

K T H I n f o r m a t i o n a n d

C o m m u n i c a t i o n T e c h n o l o g y

(2)

(3)

A performan e-driven SoC ar hite ture for video

synthesis

Sébastien Bourdeaudu q

Sto kholm2010

Master of S ien e Thesis in System-on-ChipDesign

RoyalInstitute of Te hnology

Department of Software and Computer Systems

(4)

^Sébastien Bourdeaudu q, June 2010. Milkymist is a trademark of Sébastien Bourdeaudu q.

(5)

Commer ial system-on- hips with advan ed graphi s a eleration apabilities are

be oming ubiquitous today. However, in ontradi tion with the open sour e idea,

little is knownabout thedetails of their ar hite ture and implementation, asthey

areusually overedby tradese rets.

Fosteredbythe falling osts of high-density FPGAs,our thesis proje t en om-

passesresear hing, developing and implementing thekey points of thear hite ture

of an open sour e and omprehensive system-on- hip with ompetitive yet reason-

able graphi s apabilities. The hosen target appli ation is the synthesis of visual

ee tssimilar to those produ ed by thepopular MilkDrop visualization plug-infor

Winamp.

Our system-on- hip design onsists prin ipally of a ustom bus infrastru ture,

a ustom DDR SDRAM memory ontroller, a mi ropro essor ore, and ustom

graphi sa elerators for texturemapping and oatingpoint pro essing.

Our base mi ropro essor system is apable of running Linux (without MMU)

andoutperformsaMi roblaze-based solutiontestedinsimilar onditionsbya15to

35%in reaseinspeed ofexe ution. Forour videosynthesisappli ation,ourtexture

mapping a elerator a hieves an average ll rate of 44 megapixels per se ond and

our oatingpoint pro essingunitprovidesinex ess of70 millionoating point op-

erationsperse ond. Everything, in ludingI/Operipherals(AC97 audio,Ethernet,

RS232 UART, GPIO), is implemented on a Virtex-4 XC4VLX25 FPGA, where it

utilizesabout 80%ofthe resour es.

Finally, we have su essfully developed an embedded video synthesis program

that leverages the possibilitiesof our hardware ar hite ture to permit thelive ren-

dering ofmanyMilkDrop ee ts in640x480resolution at30 framesper se ond.

TRITA-ICT-EX-2010:90

(6)

First,IwouldliketoexpressmygratitudetoProfessorMatsBrorsson,mysupervisor

andexaminer attheRoyalInstituteofTe hnology,for having theopen-mindedness

ofletting mewrite mythesison this subje tand for hishelpand advi ewithit.

Iwouldalsolike to thankLatti e Semi ondu tor for openingthesour e odeof

their Latti eMi o32pro essor ore.

Spe ialthanksgotoallthepeoplewhoareindire tlyinvolved withthisMaster's

thesis proje t: Henry de Beau hesne (Xilinx) for gettingme started withhigh-end

FPGAtools,ShawnTan(Aeste Works(M)SdnBhd)for hishelpwithunderstand-

ingthe WISHBONE bus,Gregory Taylor (NASA's JetPropulsion Laboratory) for

letting me know that they were using parts of my ode in the development of a

ommuni ationssystemto beput onboardthe international spa estation, Takeshi

Matsuya (KeioUniversity) forhis work on theport ofLinux tothe system-on- hip

des ribedherein,Mi haelWallefor developing supportofthesystem-on- hipinthe

QEMUemulatorandWolfgangSpraul(Sharism atWorkLtd.) for proposingmean

agreement for manufa turingdevi es using thesystem-on- hipdesign.

Thanksto the Eidlonmusi band(Rheims,Fran e),for whom Iwrotemyrst

PC-based video synthesis program in 2005, whi h has been a sour e of inspiration

for thisproje t.

Finally,Iwouldliketothankalltheresear herswhohaveretainedtheir opyright

ontheirpapers(orhave puttheminthepubli domain) anddistributethemonline

for everybody to download freely (in identally in a ordan e with the prin iple of

freeex hangeofinformationfromtheKTHethi spoli y). Thisinspiteofthedefault

agreement ofmany publisherssu hasthe IEEE, whi h asksauthorsto assigntheir

opyrights to the publishers so the latter have the ex lusive permission to sell the

download ofdo umentsthattheydidnot write,withoutgivingba ktotheauthors,

at apri e supposedlymeant to over publishingexpensesbut whi h isnot justied

bytoday'slow ostsof network bandwidthand servers.

Thankstotheseresear hers,Ihavebeenabletoa essqualitys ienti literature

before I went to a university, from whi h I have learned a lot. Even throughout

the writing of this Master's thesis, papers freely available online enabled greater

produ tivityasa essto themwasmu h faster.

(7)

1 Introdu tion 1

2 Ba kground 5

2.1 Videosynthesis . . . 5

2.1.1 Overview . . . 5

2.1.2 Prin iple. . . 6

2.2 Open sour eSoC platforms . . . 11

2.3 DRAMte hnology . . . 14

2.3.1 Multiplebanks . . . 16

2.3.2 Refreshing . . . 17

2.4 Texturemapping . . . 17

2.5 Organization . . . 18

3 Memory subsystem 23 3.1 Atta king thememorywall . . . 23

3.2 Anotherapproa h. . . 24

3.3 Memorysystemfeatures . . . 24

3.3.1 Single SDRAMandsystem lo kdomain . . . 24

3.3.2 Page mode ontrol algorithm . . . 25

3.3.3 Bursta esses. . . 25

3.3.4 Burstreordering . . . 26

3.3.5 Pipelining . . . 26

3.4 Pra ti al implementation . . . 26

3.5 Performan e measurement . . . 29

3.5.1 Introdu tion. . . 29

3.5.2 Method . . . 29

3.5.3 Results . . . 31

4 SoC inter onne t 33 4.1 GeneralSoC inter onne t: theWishbonebus . . . 33

4.2 Conguration andStatus Registers: theCSR bus . . . 33

4.3 High-throughput memory a essbus: theFMLbus . . . 34

4.3.1 Variable laten y . . . 34

4.3.2 Burstonly . . . 34

(8)

4.3.3 Burstreordering . . . 35

4.3.4 Pipelining . . . 35

4.3.5 Usage . . . 35

4.4 BridgingWishbone to FML . . . 35

4.5 Ca he oheren y . . . 36

4.5.1 Coheren y issuesaround theCPU (L1) a he . . . 36

4.5.2 Coheren y issuesaround theWishbone-FML (L2) a he . . . 36

5 Texture mapping unit 39 5.1 Algorithm . . . 39

5.1.1 Two-dimensional interpolation. . . 39

5.1.2 One-dimensional interpolation. . . 40

5.1.3 Bilinearltering . . . 42

5.2 Performan e onsiderations . . . 44

5.2.1 Context . . . 44

5.2.2 Exe utiontimeof the interpolation algorithm . . . 44

5.2.3 Total exe utiontime . . . 45

5.3 Pipelined hardware implementation . . . 46

5.3.1 Strategy . . . 46

5.3.2 Vertex fet hengine . . . 46

5.3.3 Interpolators . . . 48

5.3.4 Clamping/wrapping . . . 49

5.3.5 Address generator . . . 50

5.3.6 Texel a he . . . 50

5.3.7 Bilinearlter . . . 65

5.3.8 Write buer . . . 65

5.3.9 Control interfa e . . . 67

5.4 Extra features. . . 67

5.5 Implementation results . . . 67

6 Floating point o-pro essor 71 6.1 Purpose . . . 71

6.2 Forms ofparallelism . . . 71

6.3 Hardware ar hite ture . . . 72

6.3.1 Overview . . . 72

6.3.2 Instru tionset . . . 74

6.3.3 Instru tionRAM . . . 74

6.3.4 ALU . . . 74

6.4 Run-time ompiler . . . 75

6.4.1 Compilation into virtual ma hine instru tions . . . 76

6.4.2 S heduling . . . 77

6.4.3 Constantsand uservariables . . . 78

(9)

7 Software 81

7.1 Latti eMi o32 . . . 81

7.2 Capabilities . . . 82

7.3 Ben hmarking . . . 82

7.4 Designofa MilkDrop-like renderingprogram . . . 86

7.4.1 Des ription . . . 86

7.4.2 Ca he oheren y . . . 88

7.4.3 Event-driven operation . . . 89

7.4.4 Results . . . 89

8 Con lusion and future works 91

(10)

(11)

1.1 FPGAboards ofthe IkosPegasus ASICemulator ( a. 1999). . . 2

1.2 Proje tlogo. . . 2

2.1 Sample video framefrom theMilkDrop visualsynthesizer. . . 5

2.2 Sample video frame from Visikord, a program mixing live video into MilkDrop. . . 6

2.3 The embedded user interfa e (based on Genode FX [12℄) of Fli kernoise,theMilkymistVJappli ation. Thepat heditorisshown, with per-frame and per-vertexequations.. . . 7

2.4 Basi MilkDrop renderingow. . . 8

2.5 Ex erpt from the MilkDrop preset Geiss Warp of Dali 1 (with somesimpli ations). . . 9

2.6 Blo kdiagramof aDRAMmemory bank. . . 15

2.7 Exampleof distortedpi ture. . . 18

2.8 Prin iple ofbilinear textureltering. . . 18

2.9 Rendering withbilinearltering enabled.. . . 19

2.10 Rendering withbilinearltering disabled(the nearesttexelis used). 19 2.11 SoCblo kdiagram. . . 21

3.1 Blo kdiagramof the HPDMC ar hite ture. . . 27

3.2 FMLtransa tions. . . 30

3.3 Maximumutilization ofa FMLbus. . . 31

5.1 Typi alde ompositionintotriangularprimitivesoftheMilkDropren- deringsurfa e. . . 40

5.2 2Dlinearinterpolation ona re tangle. . . 40

5.3 One-dimensional linearinterpolation algorithm. . . 41

5.4 Bilinearltering usingthexedpoint texture oordinates. . . 43

5.5 Blo kdiagramof the texturemapping unit ar hite ture. . . 47

5.6 Ve tor interpolator. . . 48

5.7 Pipelined s alarinterpolator. . . 49

5.8 Ar hite ture ofthe four- hanneltexel a he. . . 52

5.9 Dispositionof the hannels within thetexture, general ase. . . 54

5.10 Dispositionof the hannels within thetexture, verti al wrapping. . . 54

5.11 Dispositionof the hannels within thetexture, horizontalwrapping.. 55

(12)

5.12 Dispositionofthe hannelswithinthetexture,horizontalandverti al

wrapping. . . 56

5.13 TMUoutputpi ture forthe opy set (original pi ture). . . 60

5.14 TMUoutputpi ture forthe zoomin set. . . 60

5.15 TMUoutputpi ture forthe zoomout set. . . 61

5.16 TMUoutputpi ture fortherotozoom set. . . 61

5.17 Typi alTMU simulationtra e (ex erpt). . . 62

5.18 Hit rates versus texel a he size. The X axis ( a he size) uses a logarithmi s ale. . . 63

5.19 Theoreti al writebuer throughput versus memorywrite a esstime. 66 5.20 MeasuredTMU performan e versus global texel a he hitrate. . . . 68

6.1 Hardware ar hite ture oftheoating point o-pro essor. . . 73

6.2 Fastinverse squareroot algorithm. . . 77

7.1 Latti eMi o32ar hite ture (Latti e Semi ondu tor). . . 81

7.2 Linuxbooting onthe Milkymist SoC.. . . 82

7.3 XilinxML401 development board. . . 83

7.4 Comparative MiBen h results ofMilkymist andMi roblaze. . . 85

7.5 Rendering software ar hite ture. . . 86

8.1 Printed ir uit boardoor planof theMilkymistOne. . . 93

(13)

3.1 Estimate ofthe memory bandwidth onsumption. . . 27

3.2 Memoryperforman eindierent onditions(Milkymist0.5.1). Band-

widthsareinMb/s. . . 32

5.1 Estimatesof the ostof ommon software operations. . . 44

5.2 Detailedestimateofthe exe utiontime oftheinterpolationalgorithm. 45

5.3 Optimisti estimateoftheexe utiontimeofsoftwaretexturemapping. 45

5.4 Texture oordinatesets usedfor ben hmarking thetexel a he. . . . 59

5.5 Hitrates for ea hset oftexture oordinatesand dierent a he sizes. 63

6.1 PFPUinstru tionformat. . . 74

6.2 Greedy PFPU s heduler performan e with the per-vertex math of

dierent MilkDrop pat hes(Milkymist 0.5.1). . . 79

6.3 PFPUlaten ies in y les (Milkymist 0.5.1). . . 80

6.4 Exa t ost ininstru tions of ommon operations onthePFPU. . . . 80

7.1 Userexe ution times onMilkymist0.2. . . 84

7.2 Userexe ution times onMi roblaze 10.1. . . 85

(14)

(15)

Introdu tion

The open sour e model supports the idea that any individual, if he or shehas the

required level of te hni al knowledge, an realisti ally use, share and modify the

design of a te hni al system. During the nineties, this development model gained

popularity in the software world with, most notably, the Linux operating system.

But it was not viable for omplex SoCs until a few years ago, be ause the ost

of prototyping semi ondu tor hips is prohibitive and eld programmable gate ar-

rays (FPGAs) used to be too slow, too small, and too expensive. System-on- hip

design and hands-on omputer ar hite ture therefore remained a eld reserved to

well-funded a ademiaand resear hand development laboratories of ompanies ofa

signi ant sizeandwealth,whohada esstolarge FPGA lustersor evensemi on-

du torfoundries.

But the ost of FPGAs is falling (this was already the ase between 1985 and

1994[24℄andthetrendhas ontinuedsin ethen)andrelativelyfastandhigh-density

devi es aretodaybe omingavailableto thegeneral publi . For an exampleof this

falling ost (and in reasing densitiesand speed), we will mention theIkos Pegasus

appli ation spe i integrated ir uit (ASIC) emulator, whose insides are depi ted

ingure1.1. The Latti eMi o32CPU oreusedinthesystem-on- hipdes ribed in

this thesis o upies alone 60%of the resour es of one of theXC4036XL FPGAsof

this devi e, and runs at 30MHz. The Ikos Pegasus was a state-of-the-art devi e a

de ade ago. It onsumes up to 3 kilowatts of power, weights dozens of kilos and

probably ostedtheequivalentof several millionsof SEK.Thesame CPU orenow

o upiesabout15%ofamodernFPGA ostinglessthan500 SEK,whereitrunsin

ex essof 100MHz.

Thisevolutionmakesitpossibletoimplement omplexhigh-performan esystem-

on- hips (SoC) that an be modied and improved byanyone, thanks to the exi-

bilityof the FPGAplatform.

This Master's thesis introdu es Milkymist TM

[6℄, a fast and resour e-e ient

FPGA-based system-on- hip designed for the appli ation of rendering live video

ee tsduringperforman essu has on erts, lubsor ontemporaryartinstallations.

Su h ee ts are already popularized byartists known asvideo jo keys, or VJs.

VJingis ommonlydone withaPCand omputersoftwaresu hasGrandVJ [5℄or

(16)

Figure1.1. FPGAboardsoftheIkosPegasusASICemulator( a. 1999).

Figure 1.2. Proje tlogo.

(17)

Resolume[11℄. However,thisapproa hhassomedrawba ksandusinganembedded

devi einsteadwould be interesting:

•

Â^devi
eôf^very^small^sizeând^weightîs^possible,^whi
hîsônvenientⁱⁿ^mobile

or temporary setups.

•

^Boot ând ^set-up ^time ^(laun
hing ^the ^software) ân ^be ^greatly ^redu
ed ^(to â

fewse onds).

•

^Many ^interfa
es ^for intera tive performan es (MIDI, DMX, video input, low- level digital I/O for user sensors) an be integrated. By omparison, the

equivalent PC-based solutionwould be expensive and bulky.

Besides the fa t that this is an interesting, reative and popular appli ation,

it is also demanding in terms of omputational power and memory performan e.

Su h aproje twouldalsobeaproofthathighperforman eopen sour esystem-on-

hip design is possible inpra ti e; with a view to help, foster and atalyze similar

open hardware initiatives. As the Milkymist system-on- hip is entirely made of

synthesizableVerilogand,forthemostpart,releasedundertheGNUGeneralPubli

Li ense(GPL), its ode an be re-usedby otheropenhardwareproje ts.

Meeting theperforman e onstraintswhilestill using heapand relatively small

FPGAs is perhaps the most interesting and hallenging te hni al point of this

proje t, and it ouldnot be done withoutsubstantial workintheeldof omputer

ar hite ture. Thisis whatthis Master's thesis overs.

(18)

(19)

Ba kground

2.1 Video synthesis

2.1.1 Overview

MilkDrop [25℄ (gure 2.1)is a popular open sour e video synthesis framework that

wasoriginally madeto develop visualizationplug-ins for theWinampaudioplayer.

Peoplehavesin ethenportedMilkDroptomanydierentplatforms[32℄andmadeit

rea ttoliveevents, su has apturedaudioandvideo[20℄(gure2.2)ormovements

ofa Wiimoteremote ontrol [21℄.

TheideabehindtheMilkymist proje tistoimplement anembedded videosyn-

thesisplatform ona ustomopensour e system-on- hip, thatisbased onthesame

renderingprin iple ofMilkDrop but withmore ontrol interfa es andfeatures. The

devi e built around the system-on- hip should be stand-alone, whi h means that

a graphi al user interfa e for onguring the visual ee ts should be implemented

(gure2.3).

Figure2.1. SamplevideoframefromtheMilkDropvisualsynthesizer.

(20)

Figure 2.2. Samplevideo frame fromVisikord, aprogrammixing live videointo

MilkDrop.

2.1.2 Prin iple

General mode of operation

The MilkDrop-like renderer is the most ompute and memory intensive pro ess,

fromwhi hstem mostofthete hni al hallenges. We willnowgetintomoredetails

about howthe renderer works(gure2.4).

Rendering isbasedon aframe bueron whi h thesteps beloware ontinuously

repeated. Thisrepetitionis at theorigin ofmany feedba kor fra tal ee ts.

•

^The^urrent^frame^is^distorted^(zoomed, translated,warped,s aled,rotated...) bytexture mapping. Thisstep isdes ribed withmore detailinse tion2.4.

•

^The^frame îs^darkened ^(theôlorsâre^shifted^to ^bla
k).

•

Â ^waveform ôf ^theûrrently ^played ^musi îs ^drawn. ^The ^wave ân ^be ^drawn

linearly(likean os illos ope), ina ir le, et .

•

^Bordersâround^the ^s
reenâre^drawn. Îf^the^distortion^zoomsôut,^the^borders

will be pulled into the pi ture(someee tsarebasedon this).

•

^Motion ^ve
tors âre ^drawn. ^Motion ^ve
tors âre ^simply â ^grid ôf ^dots, ^whi
h

anbe usedto generate ee tsbyplaying with thedistortion.

•

^The^pro
ess ^repeats ^from^the^beginning.

These are the basi features of MilkDrop. There are more ( ustom waves,

shapes,...) whi h are listed on the MilkDrop website [25℄. Some other features

(21)

Figure2.3. Theembeddeduserinterfa e(basedonGenodeFX[12℄)ofFli kernoise,

theMilkymist VJappli ation. Thepat heditor isshown, withper-frame andper-

vertexequations.

This pro ess is done on an internal frame buer whose horizontal and verti al

dimensionsareapowerof2. Thisframebueristhens aledtothesizeofthes reen

inorderto bedisplayed. Thisbrings twofeatures:

•

^The ^sizes ^being â ^power ôf ² âllows out-of-bounds texture oordinates to be wrapped (inorderto repeatthetexture) bysimplyperformingabitwiseAND

ofthe oordinate,insteadofthefull omputationofadivisionremainderwhi h

isa mu h more expensive operation (even on thetraditionalGPUs MilkDrop

wasdesigned for).

•

^It^enables^theimplementationofthevideo e hoee t: aftertheinternalframe buerhasbeen drawn tothes reen at its nominaldimensions, azoomedand

semi-transparent opyof it an be overprinted.

Itmustbenotedthatthistwo-step pro essin reases the omputation timeandthe

onsumption ofmemory bandwidth.

All the steps of the rendering are heavily parameterizable by the user, using a

oded format alled a pat h or preset whi h denes the aspe t and theintera tion

formsofaparti ularvisualee t. Thelistingofasamplepat hisgivenbygure2.5

(22)

Figure2.4. Basi MilkDroprenderingow.

(23)

fDe ay=0.980000

nWaveMode=2

bTexWrap=1

bMotionVe torsOn=0

zoom=1.046000

rot=0.020000

x=0.500000

y=0.500000

warp=0.969000

sx=1.000000

sy=1.000000

wave_r=0.600000

wave_g=0.600000

wave_b=0.600000

wave_x=0.500000

wave_y=0.470000

per_frame_1=wave_r = wave_r + 0.400*( 0.60*sin(0.933*time)

+ 0.40*sin(1.045*time) );

per_frame_2=wave_g = wave_g + 0.400*( 0.60*sin(0.900*time)

+ 0.40*sin(0.956*time) );

per_frame_3=wave_b = wave_b + 0.400*( 0.60*sin(0.910*time)

+ 0.40*sin(0.920*time) );

per_frame_4=zoom = zoom + 0.010*( 0.60*sin(0.339*time)

+ 0.40*sin(0.276*time) );

per_frame_5=rot = rot + 0.050*( 0.60*sin(0.381*time)

+ 0.40*sin(0.579*time) );

per_frame_6= x = x + 0.030*( 0.60*sin(0.374*time)

+ 0.40*sin(0.294*time) );

per_frame_7= y = y + 0.030*( 0.60*sin(0.393*time)

+ 0.40*sin(0.223*time) );

per_vertex_1=sx=sx-0.04*sin((y*2-1)*6+(x*2-1)*7+time*1.59);

per_vertex_2=sy=sy-0.04*sin((x*2-1)*8-(y*2-1)*5+time*1.43);

Figure2.5. Ex erptfromtheMilkDroppresetGeissWarpofDali1(withsome

simpli ations).

(24)

Initial onditions

Thepat hbeginswithaseriesofparameterswhi hareusedtoinitializetherenderer,

andmanyofthemarekept onstantduringtheexe utionofthepat h. Forexample:

•

bMotionVe torsOn=0turns othedrawing of themotion ve tors.

•

nWaveMode=2sele ts one ofthemany ways of drawing the audiowaveform.

•

sx=1.000000andsy=1.000000settheXandYs alingfa torsofthedistortion to 1(i.e. the frameis initiallynot s aled).

•

wave_r=0.600000,wave_g=0.600000andwave_b=0.600000settheinitialRGB olor withwhi hthe wave isdrawn (itis initiallygrey).

Per-frame equations

Usinginitial onditions only limitsthe intera tion and evolutionpossibilitiesofthe

pat h.

It is therefore possible to make the parameters evolve over time, thanks to the

per-frame equations. As their name suggests, the per-frame equations are mathe-

mati al expressionsthat areevaluated atea h frame.

The example pat h (gure 2.5) shows some of them (the lines beginning with

per_frame). In this example, they hange the wave olor over time by modifying

the wave_r, wave_g and wave_b values in sinusoidal patterns, as well asthe zoom

(zoom),rotation (rot)and enterof rotation( x and y).

Per-frame equations an make the pat h rea t to sound, for example through

the bass, mid and treb variables that indi ate the intensity of the sound inthree

frequen ybands. OneoftheideasinMilkymististoaddothervariablesthat anbe

ontrolled by the DMX512and MIDI proto ols, enabling theuse of a whole range

of devi es ommonly found among musi ians (ele troni instruments, faders, stage

light onsoles, joysti ks,...) to ontrol thevisual ee ts.

Per-vertex equations

Per-vertex equationsare usedto ne-tune thedistortion applied to thepi ture.

Indeed,asexplainedfurtherinse tion2.4,thedistortionworksbyusingamesh

of ontrol points (verti es) that an be moved to transform the image in many

dierent ways (ee ts su h as zooming, s aling and rotating are implemented by

moving the verti es).

Per-vertex equations are thus evaluated at ea h vertex (whose position an be

retrieved through the x and y variables), and alter the position of that vertex. In

theexamplepat h(gure2.5),theimageislo allys aledhorizontallyandverti ally

by fa tors depending on the position of the vertex and on the time, resulting ina

twistedvisual ee t.

As dis ussed in hapter 5, the oating point omputations for ea h vertex are

(25)

2.2 Open sour e SoC platforms

There isan existingeort to buildopensour e system-on- hips. It isinteresting to

review these proje ts in order to look forward to building upon them possibly

addinghardwarea eleratorsor performingother modi ationsinordertoimprove

performan e.

There are many SoC designs available on the Internet, whi h are more or less

mature. The system-on- hipproje ts listed heremeet thefollowing riteria:

•

^they^have ^been^shown ^to ^workôn ât ^leastône ^FPGA^board

•

^theyâre ^releasedûnderân ôpen ^sour
e ^li
ense

•

^they^omprise ^a synthesizable RISCCPU

•

^the^CPUîs ^supported^byâ^Când ^C++ ômpiler

•

^theyîn
lude â^RS232 ômpatible ÛAR^T^(forâ ^debug ônsole)

•

^they^support interfa ingto o- hipSDRAMmemory OpenSPARC

OpenSPARC[23℄isthewell-knownSPARCpro essorofSunMi rosystemswhi his

now releasedunderan opensour e li ense andin luded into a SoCFPGA proje t.

Implementedona FPGA,this pro essorisextremely resour e-intensive. A ut-

down version of theCPU ore only, alledthe SimplyRISC S1, o upies at least

37000FPGAlook-uptables(LUT)withoutthe a hes[28℄. Thisisabouttwi ethe

logi apa ityof theVirtex-4XC4VLX25 FPGA.

As it turns out, the OpenSPARC ar hite ture is a very omplex design whi h

implementsahugenumberofte hniqueswhi hin reasethesoftwareexe utionspeed

(instru tions per lo k y le). While this is a wise hoi e for a software- entri

pro essorimplementedonafully ustomsemi ondu tor hip,withaFPGApro ess

itismoreappealing tokeepthesoftwarepro essorsimpleinorderto saveresour es

and make room for ustom hardware a elerators, taking advantage of the FPGA

exibility.

GRLIB

GRLIB [13℄ is a very professional and standard- ompliant library of SoC ores.

The library features a omprehensive set of ores: AMBA AHB/APB bus ontrol

elements, the LEON3 SPARC pro essor, a 32-bit PC133 SDRAM ontroller, a 32-

bitPCIbridge withDMA,a10/100/1000 Mb/sEthernet MAC,16/32/64-bit DDR

SDRAM/DDR2SDRAM ontrollersand more.

However,its drawba ks are:

•

^Code omplexity. GRLIB is written in VHDL and makes intensive use of

(26)

•

^Cores^are ^not self- ontained. GRLIB denesmany building blo ks that are usedeverywhereelse inthe ode,making itdi ultto re-use ode inanother

proje twhi h isnot basedon GRLIB.

•

^Signi
ant ^FPGA^resour
e ûsage. Â ^system ômprising ^the ^LEON3 ^SPARC

pro essor with a 2-way set-asso iative 16kB a he and no memory manage-

ment unit(MMU), the DDR SDRAM ontroller, a RS232 serialport, andan

Ethernet 10/100 MAC uses 13264 FPGA look-up tables (LUT). They map

to 79%of theVirtex-4 XC4VLX25 FPGA.Wehave arriedout thetest with

theXst synthesizer, Xilinx ISE 11.3, and GRLIB 1.0.21-b3957 (GPL release)

using the default provided synthesis s ripts. Thisundermines the possibility

ofaddinghardwarea eleration ores. In[22℄,asigni ant resour eusagewas

alsoreported for anolder versionof LEON.

•

^Relatively^low^lo
k^frequen
y^. ^With^the^same ^parameters^as^above,^the^max-

imum lo k frequen yis 84MHz.

Be auseof these reasons,GRLIB wasnot retained.

ORPSoC (OpenRISC)

ORPSoCisbasedontheOpenRISC[26℄pro essor ore,whi histheagshipprodu t

ofOpenCores,a ommunityofdevelopersofopensour esystem-on- hips. ORPSoC

isessentiallymaintained byORSoC AB.

ORPSoC notably features the OpenRISC OR1200 pro essor ore, the Wish-

bone [9℄bus, omprehensive debuggingfa ilities,a16550- ompatible RS232UART,

a10/100 Mb/sEthernet MACand a SDRAM ontroller.

Unfortunately, ORPSoC is resour e-ine ient and buggy. The OpenRISC im-

plementationisnot welloptimizedforsynthesis. We arriedouttestsontheAugust

17, 2009 OpenRISC release. Still using the XC4VLX25FPGA astarget, synthesis

with Xst and Xilinx ISE 11.4 yields an utilization of 5077 LUTs for the CPU ore

only (using the default FPGA onguration: no a hes, no MMU, multiplier, and

with the implementation of the RAMs using the RAMB16 elements of the FPGA

sele ted), running at approximately 100MHz. A similar resour e usage isreported

in[22℄. Thesynthesisreportshows asyn hronous ontrol signalswherethere should

notbe(su hasontheoutputoftheprogram ounter),whi h anbeanindi ationof

poor qualityof thedesign. Other IP ores omprising ORPSoC have similar issues

(wetested the16550UARTandtheEthernetMAC).Finally,theprovidedSDRAM

ontrolleronlysupportsthelow-bandwidth16-bitsingledatarateoption,hasahigh

laten y due to the extensive useof lo k domain transferFIFOs,does not support

pipelined transfers andhasa poorlywritten ode.

OpenRISC and ORPSoC therefore do not seem to be a good platform for the

(27)

Latti eMi o32 System

This produ t[30℄ from the FPGA vendor Latti e Semi ondu tor is omparable to

Mi roblaze[34℄ andNiosII[4℄fromits ompetitors,respe tively XilinxandAltera.

Like its ompeting produ ts, Latti eMi o32 System features a broad library of

light, fastand FPGA-optimized SoC ores.

One interesting move made by Latti e Semi ondu tor is that parts of the Lat-

ti eMi o32 System are released under an open sour e li ense, and most notably

the ustomLatti eMi o32 mi ropro essor ore. Latti eMi o32Systemisalso based

upon the Wishbone [9℄ bus, whose spe i ation is free of harge and freely dis-

tributable.

While it is perhaps te hni ally possible to build Milkymist on top of the Lat-

ti eMi o32 System, there are li ensing issues on erning most notably the DDR

SDRAM ontroller whi h is proprietary.

However, the Latti eMi o32 mi ropro essor ore is interesting. Synthesized for

the XC4VLX25 with the2-way set-asso iative a hes, the barrel shifter, the hard-

ware divider and the hardware multiplier enabled, it o upies only about 2400 4-

LUTs and runsat more than 100MHz.

Thismi ropro essor orehasbeenretainedfor useinMilkymist,asdes ribedin

hapter 7.

Mi roblaze and Nios II

Even though we are not interested in proprietary designs, we still give a brief

overview oftheresour e usageof Mi roblazeand Nios IIsystems asa omparison.

Mi roblaze. In [22℄, the Mi roblaze ore is reported to use approximately 2400

LUTs, like Latti eMi o32. Theplatform builderGUI inXilinx ISE12.1 also limits

thefrequen yofMi roblazesystemsto100MHzwhentargetingtheVirtex-4family.

Thus, Mi roblazeis lose to Latti eMi o32regardingarea andfrequen y.

Nios II. A ording to an Altera report [3℄, Nios II/f uses 1600 Cy lone II LEs.

A LE is mainly omprised of a 4-LUT and a register, whi h is omparable to the

Virtex-4 ar hite ture on whi h Latti eMi o32 was tested. Thus, it seems that the

NiosII ore wouldbeapproximately two thirdsof theareaofLatti eMi o32.

Somedieren es an benotedbetween theLatti eMi o32 ongurationandthe

NiosII/f onguration usedintheAlterareport:

•

^Ca
hes ^aredire t-mapped and512 bytes(ea h).

•

^There ^is^no multiplier.

•

^Nios ÎI/fûses â ^dynami ^bran
h ^predi
tor, ^while Latti eMi o32 usesa stati

(28)

•

^The^report^does^not^sayîf^theôptional^hardware^divider,^multiplierând^shifter

(thatwereenabled inLatti eMi o32) were sele ted.

TheNiosIIisalsoreportedtorunat140MHzwiththis ongurationandUART,

JTAGUART,SDR SDRAM ontrollerandtimerperipherals. Thisisveryfast,but

annotbe ompared totheLatti eMi o32 resultson Virtex-4for two reasons:

•

^Routing^resour
es ^and ^logi ^delays ^for ^the^two ^FPGA^families ^are^dierent.

•

Ît îs ^possible ^that Âltera ^hand-tuned ^the ^Nios ÎI ^pro
essor ^to ^their ^FPGA

te hnology.

2.3 DRAM te hnology

DRAMisby fartoday'sdominant memory te hnology, oftenbeingthe onlyaord-

ablesolution when relatively large densities(typi ally more than a fewmegabytes)

are required. Unfortunately, DRAMs are not straightforward devi es and we need

preliminaryknowledgespe i to thiste hnologyinordertounderstand the hoi es

dis ussed in hapter 3. Indeed, in order to redu e system osts, the intelligen e

has been moved away from the memory hips and into the memory ontroller [2℄,

leaving the ontroller designer withthetaskof dealing withthelow-level detailsof

the DRAMte hnology.

We will therefore explain how the SDRAM (syn hronous DRAM) te hnology

works. Theseprin iplesarethesamefortheoriginalsingledatarate(SDR)SDRAM,

and for the subsequent double datarate DDR, DDR2 and DDR3 memories. Inall

thatfollows,wesupposethatthelogi level0isrepresentedbyavoltage of0 volts,

anda logi level1 isrepresentedbya positive voltage H.

A DRAM memory bank (gure 2.6) isorganized asa two dimensional arrayof

ells. Ea h ell is omprised of a transistor onne ted to a apa itor. A ell stores

onebitofinformation,indi atedbythepresen eornotofa hargeinthe apa itor.

The transistora ts as a swit h that onne ts the apa itorto thebit line (verti al

lines) when the word line (horizontal lines) its gate is onne ted to arries a high

logi level.

A de odertranslates the rowaddress presentedto theDRAM devi e and a ti-

vates one oftheword lines,a ording tothe address.

Ea hbitlineis onne tedtoasenseamplier,whi hisapositivefeedba kdevi e

that,when swit hed on,turnsanyvoltageX onthebitlinebetween 0andH into0

(if

X < ^H ₂

⁾^or ^H ^(if

X > ^H ₂

^). ^The^set ôf^sense âmpliersîs âlled^the^page ^buer.

A esses toa SDRAMbankaremade asfollows:

1. We assume the SDRAM is in the pre harged (idle) state. In this state, no

word line isa tive,the sense ampliersareturned oand all thebitlinesare

heldat avoltage of

H 2

^.

2. The SDRAM ontroller presents the row address and issues an ACTIVATE

(29)

Figure2.6. Blo k diagramofaDRAMmemorybank.

(30)

de oderandoneofthe wordlinesisasserted. Thishastheee tof onne ting

allthe apa itorsoftheDRAM ellsintherowtotheirrespe tivebitlines. A

transferofele tri hargeo ursbetweentheparasiti apa itorsoftheword

lines(whi h were harged at a voltage of

H

2

⁾ ^and ^the ^DRAM^ell apa itors, whi h were either dis harged (at 0 volts) or harged at a voltage of H. This

ausesasmall hange

ǫ

ⁱⁿ^the^potential^of^the^bit^line,^whi
h^be
omes

^H ₂ −ǫ

^or

H

2 + ǫ

^(dependingôn ^the^harge înitially^stored ⁱⁿ^the^DRAMêll apa itor).

Then,theSDRAMdevi eturnsonallthesenseampliersofthebank. Onea h

bit line, the positive feedba k takes over and amplies the voltage dieren e

ǫ

ûntil ^the ^level ôf ^the ^bit ^line ^rea
hes ⁰ ôr ^H. ^The ÂCTIVÂTE ômmand îs

now ompleted and the row is said to be opened. The DDR SDRAM hips

usedinthe proje t(on the XilinxML401 board)take

20ns

^to ^omplete^these

operations.

3. On e a row has been opened, the ontroller an present the olumn address

andissueREADandWRITE ommandstotransferdata. Readingisdone by

simply measuring the voltages on the bit lines, and writing an be a hieved

byfor ingthebitlinestoa parti ular level. There isadelay, alledtheCAS 1

laten y,betweenaREAD ommandbeingsentandthedatabeingreturnedby

the devi e. Thisdelay isof

20ns

^with^the^hips^usedⁱⁿ^the^proje
t. ^However,

readoperationsarepipelined,whi hmeansthatanewREAD ommand anbe

sent whiletheprevious one isstill transferringdata. Withpropers heduling,

afull utilizationof the available I/Obandwidth an bea hieved.

4. Before a essing another row, the memory ontroller must dis onne t the

opened row from thebit linesand go ba k into thepre harged state. It does

sobyissuingaPRECHARGE ommand tothedevi e. Thedevi etakessome

time to pro ess the ommand (during whi h the bank annot be a essed),

whi his

20ns

^with^the^hips^used ⁱⁿ^the^pro^je
t.

Fromthisprin ipleofoperation,itbe omesapparentthataperforman e-oriented

ontroller should try to make several transfers in thesame row before opening an-

otherone, inorderto redu e thetime wasted toswit hingrows.

2.3.1 Multiple banks

SDRAM memory hips ontain multiple DRAM banks internally, whi h share the

I/O, ommand and address pins. Additional bank address pins sele t the bank to

send ommandsto.

Having multiplebanks brings two advantages:

•

^Beingâble^to êxe
ute^several ômmandssimultaneously(assumingthereis no resour e oni t for the pins). For example, one bank an be a tivating one

rowwhileanother bankis transferringdata.

1

CASstandsfor ColumnAddressStrobe,whi h isthe nameof the DRAM hip pin thatthe

(31)

•

^Having ^several ^rows ôpen ^(one ^per ^bank), ^whi
h ân ^redu
e ^the ^number ôf

required row swit hes andthus improve performan e.

The ontroller is responsible for managing the banks, and mapping absolute

memory addresses to parti ular banks. Appropriate bank mapping an improve

performan e [29℄.

Standard DDRSDRAM hips ome withfour internal banks.

2.3.2 Refreshing

Be ausetheDRAM apa itors arenot perfe t,theygraduallylosetheir hargeover

time,whi h results indata orruption.

Thesolutionistoperiodi ally re hargethe apa itors,whi hisdonebyopening

the rowsone byone. SDRAM hips providean AUTOREFRESH ommand whi h

opens and loses one row in all banks (and in rements an internal ounter so that

thenextAUTOREFRESH ommand willtarget anotherrow),but itistherespon-

sibility of the ontroller to issueit. Furthermore, the ontroller must pre harge all

banks beforea refresh.

Withthe memory hipsusedintheproje t,arefreshmustbe madeevery

7.8µs

andtakesintheworst ase

20 + 80 + 4 · 20 = 180ns

^(pre
harge^time² ⁺^refresh^time

+a tivationtimeforea hbank),soithasasmallimpa tonthememorybandwidth

(about 2%).

2.4 Texture mapping

Texturemappingisa ommon omputergraphi soperationfoundina elerated3D

APIslike OpenGL and Dire tX.It is typi ally usedto draw textured 3Dpolygons

on the s reen. It an also distort an image (see gure 2.7 for an example), and

MilkDrop usesitfor this purpose.

With ommon GPUs, texture mapping is performed on triangles (and more

omplex polygons are broke down into a series of triangles). The inputs to the

algorithmarethe2D(possiblyproje tedfromtheoriginal3D oordinates)positions

of thethree verti es of thetriangle to be lled,and the2D texture oordinates for

these threeverti es.

The algorithm then draws a textured triangle pixel by pixel, by interpolating

linearly the texture oordinates of theverti es for ea h pixeland then opying the

texturepixel (texel) at these oordinates.

Imagepro essingoperationslikezooming,rotatingors aling anbeimplemented

with texture mapping, by simply hanging the verti es' positions or the textures

oordinatesat ea h vertex.

Moreoftenthan not,theresultsofthelinearinterpolationarenotinteger,whi h

meansthatthe textureshouldbesampledbetween fouradja ent pixels(gure2.8).

In this ase, for a better rendering, the four pixels should be readand their olors

2

(32)

Figure 2.7.Exampleofdistortedpi ture.

Figure2.8. Prin ipleofbilineartextureltering.

shouldbeaveraged(withdierentweightsdependingonthefra tional parts). This

pro ess is alled bilinear ltering and is required to obtain a good rendering of

MilkDrop presets(see gures2.9 and2.10).

In MilkDrop (and Milkymist),a spe ial ase of thetexture mapping isused, as

theonly purposeis to distorta 2Dimage. The target surfa e isalways a re tangle

that overs the destination pi ture, on whi h the verti es are distributedevenly as

a mesh whi h is always kept the same regardless of the applied distortion. The

distortionisdened byaltering thetexture oordinates at ea h vertex.

Texturemapping, espe iallywhenbilinearltering isdesired,isa very ompute

intensive pro ess, as explained in hapter 5. A ustom hardware a elerator has

been developed,whose details arealso overedinthis hapter.

2.5 Organization

A ording to thisba kground,we an derive thefollowing proje t guidelines:

•

^develop^a ^fast, resour e-e ient and FPGA-optimized system-on- hip

(33)

Figure2.9. Renderingwithbilinearlteringenabled.

Figure 2.10. Renderingwithbilinearlteringdisabled(thenearesttexelisused).

(34)

•

^develop^an ^e
ient ^memory ^subsystem

•

^reuse^a light-weight soft- ore CPU

•

^partition^arefully ^the ^tasks^between ^hardware ^and^software

•

^develop^ustom ^hardware a elerators

The proposedsolutionis outlinedingure2.11. Notalltheblo ksare readyat

thetimeofthis writing,norall ofthemarewithinthes opeofthis Master'sthesis,

whi hfo uses on omputer ar hite ture.

Morespe i ally,thefollowing omponentsarenot developed yet:

•

^mi
roSDôntroller ^(the ûrrent ^prototype ûseâ ^CF ârd^through^Xilinx ^Sys-

temACE)

•

^USB^ontroller

•

^Video^input

•

^IR^re
eiver

•

^MIDI^ontroller

•

^DMX512^ontroller

Hardware a elerators have been developed for the omputation of verti es po-

sitions (PFPU) and for texture mapping (TMU), whi h have been foundto be the

most ompute-intensivepartsofthepro ess. Theyaredis ussedindetailin hapters

6and 5,respe tively.

Graphi s pro essing also requires a signi ant amount of memory bandwidth,

whi his dis ussed in hapter 3.

Chapter 4 des ribes the on- hip inter onne t used to make the various blo ks

ommuni ate withone another.

Finally, hapter 7 deals with the software exe ution environment and how the

software isar hite ted to obtaina good performan es fromthe hardware.

(35)

Figure2.11. SoCblo kdiagram.

(36)

(37)

Memory subsystem

3.1 Atta king the memory wall

A re urrent point in many modern omputer systems is the memory performan e

problem. Thetermmemory wall was oined[33℄torefertothegrowingdisparityof

performan e between logi su h as CPUs and o- hip memories. While mi ropro-

essorperforman e hasbeen improving at arate of 60 per ent per year, thea ess

timeto DRAMhasbeen improving at lessthan 10per ent per year[27℄.

Memoryperforman eis measuredwithtwo metri s:

•

^bandwidth, ^whi
h îs^theâmount ôf ^data ^that^the^memory ^systemân ^transfer

duringa given period of time.

•

^laten
y,^whi
hîs^theâmount ôf^time^that^the ^memory^system^spends^between

theissueofa memory reador writerequest and its ompletion.

Amemorysystem anhavebothhighbandwidthandlaten y. Ifthelogi making

the memory a esses is ableto issue requestsina pipelined fashion, sending a new

requestwithoutwaitingforthepreviousone to omplete,highlaten ywillnot have

animpa t onbandwidth.

Laten y and bandwidth arehowever linked inpra ti e. De reasing thelaten y

also in reases the bandwidth inmany ases, be ause laten y blo kssequential pro-

essesand preventsthem from utilizingthefullavailablebandwidth.

High-endpro essorsforserversandworkstationshaveagoodabilityto opewith

relatively high memory laten y, be ause te hniques su h as out-of-order exe ution

and hardware multi-threading enable the pro essor to issue new instru tions even

thoughone isblo king ona memory a ess.

Some SDRAM ontrollers do a lot to optimize bandwidth but have little fo us

onlaten y. Bandwidth-optimizing te hniques in lude:

•

^reordering ^memorytransa tions to maximize thepage modehit rate.

(38)

•

^grouping^reads^and^writes^together^to^redu
e^write^re
overy^times. ^Along^with

theabove te hnique, this has adetrimental impa ton laten y be ause of the

delays in urredby theadditional logi intheaddressdatapath.

•

^running^the ^system^and ^the^SDRAMⁱⁿasyn hronous lo kdomainsin order tobeabletoruntheSDRAMatitsmaximumallowable lo kfrequen y. This

requiresthe useof syn hronizers or FIFOs,whi h have ahighlaten y.

•

ônguring ^the ^SDRAMât ^high ^CAS ^laten
ies ⁱⁿ ôrder ^to în
rease îts ^maxi-

mum allowable lo k frequen y. This trend is best illustrated by the advent

of DDR2 and DDR3 memories whose key innovation is to run their internal

DRAM oreatasub-multipleoftheI/Ofrequen ywithawidedatabuswhi h

isthenserializedontheI/Opins. Sin etheinternalDRAM orehasalaten y

omparable to that of theearlier SDR and DDR te hnologies, thenumberof

CASlaten y y lesrelative to the I/O lo kis alsomultiplied.

An extremeexample ofthesememory ontroller bandwidthoptimizationsisthe

MemMax

R ^DRAM^s
heduler^[17℄. ^Thisûnit^sitsôn^topôfânâlreadyêxisting^memory ontroller (whi h already has its own laten y), adding seven stagesof omplex

andhigh-laten y pipeliningthatprodu es agood -but ompute-intensive - DRAM

s hedule. The a tual e ien y of this system has been questioned [15℄ be ause of

thatsigni ant in rease inlaten y.

3.2 Another approa h

The out-of-order exe ution and hardware multi-threading pro essor optimizations

dis ussedabovethat opewithhighmemorylaten yare omplexandimpra ti alin

the ontextofsmalland heapembeddedsystems,espe iallythosetargetedatFPGA

implementations. For example, FPGA implementations of the OpenSPARC [23℄

pro essor, whi h employs su h optimizations, typi ally require an expensive high-

end XilinxXUPV5 boardwhose Virtex-5FPGAalone ostsroughly13000 SEK.

Milkymist therefore uses simple in-order exe ution s hemes in its CPU and in

its a elerators, and strives to improve performan e by fo using on redu ing the

memory laten y.

The memory system features that improve laten y (but also bandwidth) are

dis ussed below.

3.3 Memory system features

3.3.1 Single SDRAM and system lo k domain

Thetypi al operatingfrequen yofearlySDR andDDRSDRAM(te hnologiesthat

are prior to DDR2 and do not have a lo k divider for the internal DRAM ore)

(39)

the omplete SoC. Thus, it was de ided to run the DRAM and the system syn-

hronously in order to remove the need for any lo k domain transfer logi and

redu e laten y. The SDRAM I/O registers are lo ked by the system lo k, and

timing of the SDRAM interfa e is met through the useof alibrated on- hip delay

elementsanddelay-lo ked-loops (DLLs)togenerate theo- hipSDRAM lo kand

thedatastrobes.

3.3.2 Page mode ontrol algorithm

The Milkymist memory ontroller takes the so- alled page mode gamble: after an

a ess, the DRAM row is left open in the hope that the next transa tion to the

memorybankwillo urwithinthesamerow. Ifthememory ontroller isright,the

read or write ommand an be immediately registered into the SDRAM, and only

the CAS or write laten y is in urred. If the memory ontroller is wrong, it must

rstpre hargetheDRAMbankand open the orre trow, ausing extradelays.

Thus,ifthememory ontrollerisoftenwrong,takingthepagemodegamblewill

a tuallyimpa tperforman enegatively. However,astudyhasshown[29℄that,with

typi al memory timings, the point at whi h the gamble pays o is for a page hit

probabilityof0.375 only,attainable withmanypra ti al memory a esspatterns.

Page hit probability is improved by the ability of the Milkymist memory on-

troller to tra k open rows independently in ea h of the four memory banks that

ommer ial SDRAM hips areequipped with.

Thisoptimization positively ae tsboth laten y andbandwidth.

3.3.3 Burst a esses

Allmemory a essesaremadeusingbursts,i.e. whenan a essforaword ismade,

thefollowingwordsarealsoreador written. BurstmodeisafeatureoftheSDRAM

hips: only one read of write ommand is sent to them, and several words are

transferred onsubsequent lo k y les.

Using bursts frees the bus and DRAM ontrol signals while other words are

transferred,allowing the issueof new ommands overlapping the dataphaseof the

previous transa tion.

Burst a ess is a form of prefet hing that improves laten y. It is only e ient

whentheprefet heddata anbeusedbytherequestingbusmaster. IntheMilkymist

system-on- hip, this isoftenthe ase:

•

^The^CPUôre^hasâ
hes^whi
hâ

ess^memory^byômpleteâ
he^lines. ^Thus,

ifthe a helinelengthisamultipleoftheburstlength,thebursts anbeeasily

fullymemorized.

•

^The^video ^frame^buer^repeatedly^reads^the^same^blo
k^of^dataⁱⁿ^a^sequential

manner,and an easily make full useof the prefet hed data assuming that is

(40)

•

^The^texture^mappingûnitâlso^hasââ
heândâ^write ^buer^whi
h^work^well

with bursta esses. This isdis ussed inChapter5.

3.3.4 Burst reordering

Thisfeature enablestheuseofthe riti al-word-rsts hemein a hes, redu ingthe

overall memory laten y.

When a request is issued at an address whi h is not a multiple of the burst

length, the order of the words in the burst is hanged so that the rst word that

omes out is thevery word thatis at therequested memory address. The prefet h

addressisthenin rementedand wrapsto staywithin thesame burst.

Forexample, assuminga burstlength of 4:

•

â^request ât âddress⁰^fet
hes^words ^0,^1,²ând ³⁽ⁱⁿ ^thisôrder)

•

â^request ât âddress²^fet
hes^words ^2,^3,⁰ând ¹⁽ⁱⁿ ^thisôrder)

3.3.5 Pipelining

ThememorybusofMilkymist[8℄ispipelined. Duringthetransferoftheprefet hed

(burst)data, a new request an be issued. Thisis illustrated for a read request by

the table below:

Address A1 A1 A1 A2 A2 A2 A2

Data M(A1) M(A1+1) M(A1+2) M(A1+3) M(A2)

Address ( ont.)

Data ( ont.) M(A2+1) M(A2+2) M(A2+3)

Together with bursta ess, this helpsa hieving highperforman e: thememory

ontroller an hideDRAM laten ies and row swit h delays byissuing the requests

to theDRAMinadvan e, while theprevioustransa tion is still transferringdata.

3.4 Pra ti al implementation

The Milkymist SoC uses 32-bit DDR SDRAM, ongured to its maximum burst

length of 8. Sin e the DDR SDRAM transfers two words per lo k y les (one

on ea h edge), this is turned by the I/O registers into bursts of four 64-bit words

syn hronousto the system lo k.

Thememoryisrunat100MHz,yieldingapeaktheoreti albandwidthof6.4Gb/s,

whi his morethan enough fortheintended video synthesisappli ation (table 3.1).

This bandwidth is however never attained: events su h as swit hing DRAM rows

whi htakessigni ant timeand,toalesserextent,DRAMrefreshesintrodu edead

times on the data bus. We will see in se tion 3.5 that su h an oversizing of the

(41)

Task Required bandwidth

VGAframe buer,1024x768,75Hz, 16bpp 950Mb/s

Distortion: texture mapping, 512x512 to

512x512,30fps, 16bpp

250Mb/s

Livevideo: texturemapping,720x576to512x512

withtransparen y,30fps, 16bpp

300Mb/s

S aling: texture mapping, 512x512 to 1024x768,

30fps, 16bpp

500Mb/s

Video e ho: texture mapping, 512x512 to

1024x768 withtransparen y,30fps, 16bpp

900Mb/s

NTSCinput, 720x576,30fps, 16bpp 200Mb/s

Softwareand mis . 200Mb/s

Total 3.3Gb/s

Table3.1. Estimateofthememorybandwidth onsumption.

Figure3.1. Blo kdiagramoftheHPDMCar hite ture.

The ar hite ture of the memory ontroller, alled HPDMC (for High Perfor-

man e Dynami Memory Controller), is outlinedingure3.1.

The ontrol interfa e is usedby the systemto ongurethe ontroller, andalso

to issue the start-up sequen e to the SDRAM. Indeed, SDRAM hips require a

sophisti ated sequen e of ommands upon power-up. In many memory ontroller

designs, a hardware nite state ma hine is used to issue this ommand sequen e.

(42)

software, and, for this purpose, in ludes a bypass MUX that routes dire tly a

ongurationand statusregister ofHPDMC totheSDRAM ommand andaddress

pins. On ethe SoChas run a software routine thatsends the orre tinitialization

sequen etotheSDRAM,itswit hespermanently thebypassMUXtotheSDRAM

management unit and an useo- hipmemory normally.

The SDRAM management unit is a nite state ma hine that translates the

two high-level memory ommands read burst at address and write burst at ad-

dress into a series of lower-level ommands understandable by the SDRAM hips

(pre harge bank, sele trow, readfrom row, et .). Themanagement unitis respon-

sible for keeping tra k of the open rows, dete ting page hits, swit hing rows, and

issuingperiodi DRAMrefresh y les.

The management unit is onne ted to the data path ontroller, that follows

the a tivities performed by the management unit in order to de ide the dire tion

ofthebidire tional I/Opins(they should be setasoutputs for writes and asinput

for reads). The data path ontroller is also responsible for sending signals to the

management unitthat indi ate ifit is safe to perform ertain low-level operations.

For example, the read_safe signal goes low immediately after a read ommand is

issued,be auseifanother onewere sentimmediately after, thetworesulting bursts

wouldoverlapin timeand this ouldnot work be ause thereisonly oneset ofdata

pins. Eventually,thedatapath ontrollertakesinto a ount theSDRAMwriteand

readlaten iestogenerateana knowledgementsignalwhenthedataisa tuallythere

(orneedsto besent totheSDRAM)afterareadrow or writerow ommand has

been sent to the SDRAM.

Finally,the bus interfa e isapie eof gluelogi that onne tstheSoCpipelined

memory bus (FML)to thedatapath ontroller and themanagement unit.

HPDMC has been implemented in Verilog HDL, tested and debugged in RTL

simulation using a DDR SDRAM Verilog model from Mi ron, integrated into the

SoC, synthesized into FPGA te hnology, and eventually alibrated and tested by

software routinesrunning onthe a tual hardware.

This design of memory ontroller,spe i ally rafted for the Milkymistproje t

and released under the GNU GPL li ense on the internet, has been pi ked up by

the NASA for a software dened radio proje t and may be put up on board the

international spa e station in 2011. Gregory Taylor, Ele troni s Engineer at the

NASAJetPropulsion Laboratory,wrote:

While sear hing for a suitable SDRAM ontroller for the Jet Propulsion Labo-

ratory's Software-Dened Radio on board NASA's CoNNeCT experiment, I found

Sébastien's HPDMC SDRAM ontroller on OpenCores.org. We needed a ontroller

that wasboth highperforman e andwelldo umented. Thoughthe original HPDMC

ontroller was designed for DDR SDRAM with a 32-bit bus, Sébastien learly ex-

plained the modi ations ne essary toadapt the ontroller toour Single Data Rate,

40-bitwideSDRAM hip. I foundthe odetobe welldo umentedandeasytofollow.

Theperforman e hasmetourrequirementsandtheFPGAsizerequirementissmall.

The Communi ation Navigation and Networking Re ongurable Testbed (CoN-

(43)

of SDRs onforming tothe Spa e Tele ommuni ations Radio Systems(STRS) open

ar hite ture standard. The HPDMC ontroller will likely nd its way into one or

more loadable waveform payloads in the JPL SDR, and perhaps be used in other

NASA proje ts as well. It mayeventually ndits way intodeep spa e.

3.5 Performan e measurement

3.5.1 Introdu tion

We wanted to validate and hara terize the memory system performan e (a tual

laten yandbandwidth)andgetanupperboundofofitsabilitytosustainloads,by

extrapolatingthe maximum bandwidthone ould getassuming thememory a ess

timeremains onstant.

Sin ethememoryperforman edependsontheparti ulara esspatternthatthe

systemmakes (be ause of the ontroller taking thepage mode gamble, we wanted

to take the measurements on the real system while it is rendering video ee ts in

orderto getan a urate result.

3.5.2 Method

Alogi ore hasbeen added to theSoCthatsnoopsonthememory bus a tivityin

orderto report theaverage laten yand bandwidth.

That logi oreexploitsproperties oftheFastMemoryLink signaling inorderto

redu e its omplexity to two ounters that measure, for a given time period, the

number of y les during whi h thestrobe and a knowledgement signals are a tive.

Several parameters an thenbe omputed:

•

^the ^net ^bandwidth ârried ^by ^the ^link ^(based ôn ^the âmount ôf ^data ^that

thelinkhasa tually transferred)

•

^theâverage ^memoryâ

ess^time, ^whi
hîs^the^time,ⁱⁿ^y
les,^between ^the

requestbeing madetothe memory ontroller andtherstword ofdatabeing

transferred.

•

^the ^bus ô

upan
y ^whi
h îs ^the ^per
entage ôf ^time ^during ^whi
h ^the ^link

wasbusyand thereforeunavailable fora newrequest.

EveryFastMemoryLinktransa tionbeginswiththeassertionofthestrobesignal.

Then,afteroneormorewait y les,thememory ontrollerassertsthea knowledge-

ment signal togetherwiththe rstword of databeing transferred. Thenext y le,

thestrobesignalisde-asserted(unlessanewtransa tionbegins)whilethenextword

intheburst isbeingtransferred. Anew transa tion an start withtheassertion of

the strobe signal even ifa burst is alreadygoing on (pipelining). See gure 3.2for

anexample.

(44)

Figure 3.2.FMLtransa tions.

• f

^is^the ^system^lo
k^frequen
y ⁱⁿ^Hz.

• T

îs^the ^time ^during ^whi
h ^the ôunters ^have ^been ênabled.

• w

îs ^the ^widthôfâ ^FML^word ⁱⁿ^bits.

• n

^is^the ^FML^burst ^length.

• S

îs ^the^number ôf^y
les ^during^whi
h ^the^strobe^signal ^wasâ
tive.

• A

^is^the^number^of^y
les^during^whi
h^thea knowledgementsignalwasa tive.

Net bandwidth. By ounting the number of y les for whi h the a knowledge-

ment signalwasa tive,onegetsthenumberoftransa tions. Sin eea htransa tion

arriesexa tlya burstof data,whi h is

w · n

^bits ⁱⁿ^size, ^the^volume ^of^data ^trans-

ferredis given by

w · n · A

^. ^Thus,ône ân ^derive ^the^net^bandwidthâs:

β = w · n · A

T

^(3.1)

Average memorya esstime. Onthebus,amasteriswaiting whenthestrobe

signalisassertedbutthea knowledgementsignalisnot. Therefore,thetotalnumber

of wait y les is given by

S − A

^. ^The âverage ^memory â

ess ^time ân ^thus ^be

omputed as:

∆ = S − A

A

^(3.2)

The average memory a ess time an be used to derive an upper bound on

the maximum bandwidth that the memory system an handle. Indeed, FML is a

pipelinedbuswhi hsupportsonlyoneoutstanding(waiting)transa tion,sothe ase

that uses the most bandwidth for a given memory a ess time is when the strobe

signalis alwaysasserted(gure3.3)sothata newtransa tionbeginsassoonasthe

(45)

Figure3.3. MaximumutilizationofaFMLbus.

Therefore, only a fra tion

α

^of ^the^peak ^bandwidth

f · w

ân ^be ûsed ât ^most,

andwehave:

α =

^max

(1, n

∆ + 1 )

^(3.3)

The maximum bandwidthis:

β _max = α · f · w

^(3.4)

Bus o upan y. The bus is busy when the strobe signal is asserted. The bus

o upan y is thereforegiven by:

ǫ = S

T · f

^(3.5)

By using this method, a very simple pie e of hardware added to the system

an yield to the retrieval of interesting information about the performan e of the

memory system.

3.5.3 Results

Resultsaresummarized intable3.2. Therstline orrespondsto asystemrunning

the demonstration rmware with the video output enabled at the standard VGA

mode of 640x480 at 60Hz (therefore ontinuously s anning the s reen with data

from system memory), but not rendering a preset. The other lines represent the

results while the demonstration rmware is rendering dierent MilkDrop presets,

still atthe same video resolution.

It is di ult to ompare these results to those of other memory ontrollers as

theyare usuallynot published (ornot measuredat all).

However,two on lusions an be drawn:

•

^thereâreênoughô

upan
yând^bandwidth^margins^for^the^system^toôperate

at higherresolutions and/or olor depths than 640x480and 16 bits perpixel.

The3.3Gb/sbandwidthrequirementthatwasestimatedinse tion3.4seems

(46)

Pat h

β ǫ ∆ α β max

Idle 292 7 % 5.51 61% 3932

Geiss - Bright FiberMatrix1 990 28 % 6.37 54% 3474

Geiss - Swirlie3 1080 32 % 6.71 52% 3320

Geiss - Spa edust 1021 29 % 6.47 54% 3427

Illusion& Rovastar - Snowake Delight 1399 39 % 6.28 55% 3516

Rovastar &Idiot24-7- BalkA id 1427 41 % 6.38 54% 3469

Table 3.2. Memory performan e indierent onditions (Milkymist0.5.1). Band-

widthsareinMb/s.

•

^to ^go ^further, ^an out-of-order memory ontroller an be envisioned. Su h a ontrollerwouldhaveasplittransa tionbus (allowingalargernumberofout-

standing transa tions, thus minimizing the impa t thatlaten y hason band-

width)andwouldbeabletoreorderpendingmemorytransa tionstomaximize

thepage hitrate.

(47)

SoC inter onne t

This hapterexplainshowthedierentinter onne tbusseswork,whattheirfeatures

are,whytheyare there,andhowthey are ommuni ate withea h other.

The general SoCblo kdiagramand its inter onne tis outlinedingure2.11.

4.1 General SoC inter onne t: the Wishbone bus

Wishbone [9℄ is a general purpose royalty-free SoC bus with open spe i ations,

advo atedbythemaintainers ofthe OpenCores.orgwebsite.

Wishboneisasyn hronoussequentialbuswithsupportforvariablelaten y(wait

states) through the use of an a knowledgement signal that marks the end of the

transa tion. Burst modes(automati transfer of onse utive words) aresupported

andare ongurable ona per-transa tion basis(i.e. burstsofarbitrarylengthsand

single-word transa tions an be freely mixed on the same bus). However, there is

nopipelining.

WishboneisusedaroundtheSoC'sLatti eMi o32CPU oreandforsimpleDMA

masterswhi hhavemodestrequirementsofbandwidthandofvolumeoftransferred

data. As explained in Se tion 4.4, onne ting DMA masters that transfer small

amountsofdata(whi h antintheL2 a he)tothesamebusastheCPUsimplies

dealing with a he oheren yissues.

The datawidth usedfor theWishbone bus is 32, yielding a peak bandwidthof

3.2Gb/swhen the systemisrunning at 100MHz.

4.2 Conguration and Status Registers: the CSR bus

Milkymistuses memory-mapped I/Othrough onguration andstatusregisters.

Iftheseregistersweredire tlya essedbytheWishboneCPUbus,twoproblems

wouldarise:

•

^Conne
ting^allperipheralsonthesameWishbonebusinvolveslargemultiplex- ersand highfanoutsignals,posingrouting and timingproblems.

(48)

•

^Wishbone^requires^the^generation ^of^ana knowledgement signalbyea h slave ore. This signal is useful in many ases, as it supports peripherals with a

variable laten y. However, onguration and status register les are usually

implementedwith a tualregisters (ip ops)or SRAM, whi h an always be

a essedinone lo k y le. Thus,thereisnoneedforvariablelaten yandthe

a knowledgement signal. Keeping this signalfor the onguration and status

registerswasteshardwareresour es and development time.

Toalleviatetheseproblems,theCSRbushasbeendeveloped [7℄andusedinthe

systemthrougha bus bridge.

The CSR bus is a simpler bus than Wishbone, where all transfers are done in

one y le. Ithasaninterfa e similarto thatofsyn hronous SRAM, onsistingonly

ofaddress,datain,dataout andwriteenablepinsand lo kedbythesystem lo k.

Abridge onne tsthe CSRbusto theCPUWishbonebus, toallowtransparent

memory-mapped a ess to the onguration and status registers by the software.

Thisbridgein ludesregistersforallthesignals rossingthetwobusses,relaxingthe

timing onstraints.

4.3 High-throughput memory a ess bus: the FML bus

FastMemoryLink(FML)[8℄was o-designedwithHPDMC(thememory ontroller)

asa on- hip bus tailored to a ess SDRAMmemories at high speed while keeping

the memory ontroller simple. Its keyfeaturesarelisted below.

4.3.1 Variable laten y

SDRAM laten y varies a lot depending on the state of the SDRAM at the time

the request is issued on the bus. It depends on whether the SDRAM was in the

middle of a refresh y le, whether the bank needs to be pre harged, and whether

a new row needs to be a tivated. Therefore, FML provides support for a variable

number of wait states, dened by the memory ontroller, through the use of an

a knowledgement signalsimilar to thatofWishbone.

4.3.2 Burst only

SDRAMisbesta essedinburst mode (seesubse tion 3.3.3).

However,enablingor onguring burstmodeisarelativelylengthyand omplex

operation,requiringareloadoftheSDRAMmoderegisterwhi htakesseveral y les.

Furthermore,supportingmultipleburstlengthsmakesthes hedulingofthetransfers

more omplextoavoidoverlapping transfersthatwould reate oni tsatthedata

pins.

Therefore, in order to greatly simplify the memory ontroller, all transfers on

A performance-driven SoC architecture for video synthesis

Master of Science Thesis Stockholm, Sweden 2010 TRITA-ICT-EX-2010:90

S É B A S T I E N B O U R D E A U D U C Q