Master of Science Thesis Stockholm, Sweden 2010 TRITA-ICT-EX-2010:90
S É B A S T I E N B O U R D E A U D U C Q
A performance-driven SoC architecture for video synthesis
K T H I n f o r m a t i o n a n d
C o m m u n i c a t i o n T e c h n o l o g y
A performan e-driven SoC ar hite ture for video
synthesis
Sébastien Bourdeaudu q
Sto kholm2010
Master of S ien e Thesis in System-on-ChipDesign
RoyalInstitute of Te hnology
Department of Software and Computer Systems
Sébastien Bourdeaudu q, June 2010. Milkymist is a trademark of Sébastien Bourdeaudu q.
Commer ial system-on- hips with advan ed graphi s a eleration apabilities are
be oming ubiquitous today. However, in ontradi tion with the open sour e idea,
little is knownabout thedetails of their ar hite ture and implementation, asthey
areusually overedby tradese rets.
Fosteredbythe falling osts of high-density FPGAs,our thesis proje t en om-
passesresear hing, developing and implementing thekey points of thear hite ture
of an open sour e and omprehensive system-on- hip with ompetitive yet reason-
able graphi s apabilities. The hosen target appli ation is the synthesis of visual
ee tssimilar to those produ ed by thepopular MilkDrop visualization plug-infor
Winamp.
Our system-on- hip design onsists prin ipally of a ustom bus infrastru ture,
a ustom DDR SDRAM memory ontroller, a mi ropro essor ore, and ustom
graphi sa elerators for texturemapping and oatingpoint pro essing.
Our base mi ropro essor system is apable of running Linux (without MMU)
andoutperformsaMi roblaze-based solutiontestedinsimilar onditionsbya15to
35%in reaseinspeed ofexe ution. Forour videosynthesisappli ation,ourtexture
mapping a elerator a hieves an average ll rate of 44 megapixels per se ond and
our oatingpoint pro essingunitprovidesinex ess of70 millionoating point op-
erationsperse ond. Everything, in ludingI/Operipherals(AC97 audio,Ethernet,
RS232 UART, GPIO), is implemented on a Virtex-4 XC4VLX25 FPGA, where it
utilizesabout 80%ofthe resour es.
Finally, we have su essfully developed an embedded video synthesis program
that leverages the possibilitiesof our hardware ar hite ture to permit thelive ren-
dering ofmanyMilkDrop ee ts in640x480resolution at30 framesper se ond.
TRITA-ICT-EX-2010:90
First,IwouldliketoexpressmygratitudetoProfessorMatsBrorsson,mysupervisor
andexaminer attheRoyalInstituteofTe hnology,for having theopen-mindedness
ofletting mewrite mythesison this subje tand for hishelpand advi ewithit.
Iwouldalsolike to thankLatti e Semi ondu tor for openingthesour e odeof
their Latti eMi o32pro essor ore.
Spe ialthanksgotoallthepeoplewhoareindire tlyinvolved withthisMaster's
thesis proje t: Henry de Beau hesne (Xilinx) for gettingme started withhigh-end
FPGAtools,ShawnTan(Aeste Works(M)SdnBhd)for hishelpwithunderstand-
ingthe WISHBONE bus,Gregory Taylor (NASA's JetPropulsion Laboratory) for
letting me know that they were using parts of my ode in the development of a
ommuni ationssystemto beput onboardthe international spa estation, Takeshi
Matsuya (KeioUniversity) forhis work on theport ofLinux tothe system-on- hip
des ribedherein,Mi haelWallefor developing supportofthesystem-on- hipinthe
QEMUemulatorandWolfgangSpraul(Sharism atWorkLtd.) for proposingmean
agreement for manufa turingdevi es using thesystem-on- hipdesign.
Thanksto the Eidlonmusi band(Rheims,Fran e),for whom Iwrotemyrst
PC-based video synthesis program in 2005, whi h has been a sour e of inspiration
for thisproje t.
Finally,Iwouldliketothankalltheresear herswhohaveretainedtheir opyright
ontheirpapers(orhave puttheminthepubli domain) anddistributethemonline
for everybody to download freely (in identally in a ordan e with the prin iple of
freeex hangeofinformationfromtheKTHethi spoli y). Thisinspiteofthedefault
agreement ofmany publisherssu hasthe IEEE, whi h asksauthorsto assigntheir
opyrights to the publishers so the latter have the ex lusive permission to sell the
download ofdo umentsthattheydidnot write,withoutgivingba ktotheauthors,
at apri e supposedlymeant to over publishingexpensesbut whi h isnot justied
bytoday'slow ostsof network bandwidthand servers.
Thankstotheseresear hers,Ihavebeenabletoa essqualitys ienti literature
before I went to a university, from whi h I have learned a lot. Even throughout
the writing of this Master's thesis, papers freely available online enabled greater
produ tivityasa essto themwasmu h faster.
1 Introdu tion 1
2 Ba kground 5
2.1 Videosynthesis . . . 5
2.1.1 Overview . . . 5
2.1.2 Prin iple. . . 6
2.2 Open sour eSoC platforms . . . 11
2.3 DRAMte hnology . . . 14
2.3.1 Multiplebanks . . . 16
2.3.2 Refreshing . . . 17
2.4 Texturemapping . . . 17
2.5 Organization . . . 18
3 Memory subsystem 23 3.1 Atta king thememorywall . . . 23
3.2 Anotherapproa h. . . 24
3.3 Memorysystemfeatures . . . 24
3.3.1 Single SDRAMandsystem lo kdomain . . . 24
3.3.2 Page mode ontrol algorithm . . . 25
3.3.3 Bursta esses. . . 25
3.3.4 Burstreordering . . . 26
3.3.5 Pipelining . . . 26
3.4 Pra ti al implementation . . . 26
3.5 Performan e measurement . . . 29
3.5.1 Introdu tion. . . 29
3.5.2 Method . . . 29
3.5.3 Results . . . 31
4 SoC inter onne t 33 4.1 GeneralSoC inter onne t: theWishbonebus . . . 33
4.2 Conguration andStatus Registers: theCSR bus . . . 33
4.3 High-throughput memory a essbus: theFMLbus . . . 34
4.3.1 Variable laten y . . . 34
4.3.2 Burstonly . . . 34
4.3.3 Burstreordering . . . 35
4.3.4 Pipelining . . . 35
4.3.5 Usage . . . 35
4.4 BridgingWishbone to FML . . . 35
4.5 Ca he oheren y . . . 36
4.5.1 Coheren y issuesaround theCPU (L1) a he . . . 36
4.5.2 Coheren y issuesaround theWishbone-FML (L2) a he . . . 36
5 Texture mapping unit 39 5.1 Algorithm . . . 39
5.1.1 Two-dimensional interpolation. . . 39
5.1.2 One-dimensional interpolation. . . 40
5.1.3 Bilinearltering . . . 42
5.2 Performan e onsiderations . . . 44
5.2.1 Context . . . 44
5.2.2 Exe utiontimeof the interpolation algorithm . . . 44
5.2.3 Total exe utiontime . . . 45
5.3 Pipelined hardware implementation . . . 46
5.3.1 Strategy . . . 46
5.3.2 Vertex fet hengine . . . 46
5.3.3 Interpolators . . . 48
5.3.4 Clamping/wrapping . . . 49
5.3.5 Address generator . . . 50
5.3.6 Texel a he . . . 50
5.3.7 Bilinearlter . . . 65
5.3.8 Write buer . . . 65
5.3.9 Control interfa e . . . 67
5.4 Extra features. . . 67
5.5 Implementation results . . . 67
6 Floating point o-pro essor 71 6.1 Purpose . . . 71
6.2 Forms ofparallelism . . . 71
6.3 Hardware ar hite ture . . . 72
6.3.1 Overview . . . 72
6.3.2 Instru tionset . . . 74
6.3.3 Instru tionRAM . . . 74
6.3.4 ALU . . . 74
6.4 Run-time ompiler . . . 75
6.4.1 Compilation into virtual ma hine instru tions . . . 76
6.4.2 S heduling . . . 77
6.4.3 Constantsand uservariables . . . 78
7 Software 81
7.1 Latti eMi o32 . . . 81
7.2 Capabilities . . . 82
7.3 Ben hmarking . . . 82
7.4 Designofa MilkDrop-like renderingprogram . . . 86
7.4.1 Des ription . . . 86
7.4.2 Ca he oheren y . . . 88
7.4.3 Event-driven operation . . . 89
7.4.4 Results . . . 89
8 Con lusion and future works 91
1.1 FPGAboards ofthe IkosPegasus ASICemulator ( a. 1999). . . 2
1.2 Proje tlogo. . . 2
2.1 Sample video framefrom theMilkDrop visualsynthesizer. . . 5
2.2 Sample video frame from Visikord, a program mixing live video into MilkDrop. . . 6
2.3 The embedded user interfa e (based on Genode FX [12℄) of Fli ker- noise,theMilkymistVJappli ation. Thepat heditorisshown, with per-frame and per-vertexequations.. . . 7
2.4 Basi MilkDrop renderingow. . . 8
2.5 Ex erpt from the MilkDrop preset Geiss Warp of Dali 1 (with somesimpli ations). . . 9
2.6 Blo kdiagramof aDRAMmemory bank. . . 15
2.7 Exampleof distortedpi ture. . . 18
2.8 Prin iple ofbilinear textureltering. . . 18
2.9 Rendering withbilinearltering enabled.. . . 19
2.10 Rendering withbilinearltering disabled(the nearesttexelis used). 19 2.11 SoCblo kdiagram. . . 21
3.1 Blo kdiagramof the HPDMC ar hite ture. . . 27
3.2 FMLtransa tions. . . 30
3.3 Maximumutilization ofa FMLbus. . . 31
5.1 Typi alde ompositionintotriangularprimitivesoftheMilkDropren- deringsurfa e. . . 40
5.2 2Dlinearinterpolation ona re tangle. . . 40
5.3 One-dimensional linearinterpolation algorithm. . . 41
5.4 Bilinearltering usingthexedpoint texture oordinates. . . 43
5.5 Blo kdiagramof the texturemapping unit ar hite ture. . . 47
5.6 Ve tor interpolator. . . 48
5.7 Pipelined s alarinterpolator. . . 49
5.8 Ar hite ture ofthe four- hanneltexel a he. . . 52
5.9 Dispositionof the hannels within thetexture, general ase. . . 54
5.10 Dispositionof the hannels within thetexture, verti al wrapping. . . 54
5.11 Dispositionof the hannels within thetexture, horizontalwrapping.. 55
5.12 Dispositionofthe hannelswithinthetexture,horizontalandverti al
wrapping. . . 56
5.13 TMUoutputpi ture forthe opy set (original pi ture). . . 60
5.14 TMUoutputpi ture forthe zoomin set. . . 60
5.15 TMUoutputpi ture forthe zoomout set. . . 61
5.16 TMUoutputpi ture fortherotozoom set. . . 61
5.17 Typi alTMU simulationtra e (ex erpt). . . 62
5.18 Hit rates versus texel a he size. The X axis ( a he size) uses a logarithmi s ale. . . 63
5.19 Theoreti al writebuer throughput versus memorywrite a esstime. 66 5.20 MeasuredTMU performan e versus global texel a he hitrate. . . . 68
6.1 Hardware ar hite ture oftheoating point o-pro essor. . . 73
6.2 Fastinverse squareroot algorithm. . . 77
7.1 Latti eMi o32ar hite ture (Latti e Semi ondu tor). . . 81
7.2 Linuxbooting onthe Milkymist SoC.. . . 82
7.3 XilinxML401 development board. . . 83
7.4 Comparative MiBen h results ofMilkymist andMi roblaze. . . 85
7.5 Rendering software ar hite ture. . . 86
8.1 Printed ir uit boardoor planof theMilkymistOne. . . 93
3.1 Estimate ofthe memory bandwidth onsumption. . . 27
3.2 Memoryperforman eindierent onditions(Milkymist0.5.1). Band-
widthsareinMb/s. . . 32
5.1 Estimatesof the ostof ommon software operations. . . 44
5.2 Detailedestimateofthe exe utiontime oftheinterpolationalgorithm. 45
5.3 Optimisti estimateoftheexe utiontimeofsoftwaretexturemapping. 45
5.4 Texture oordinatesets usedfor ben hmarking thetexel a he. . . . 59
5.5 Hitrates for ea hset oftexture oordinatesand dierent a he sizes. 63
6.1 PFPUinstru tionformat. . . 74
6.2 Greedy PFPU s heduler performan e with the per-vertex math of
dierent MilkDrop pat hes(Milkymist 0.5.1). . . 79
6.3 PFPUlaten ies in y les (Milkymist 0.5.1). . . 80
6.4 Exa t ost ininstru tions of ommon operations onthePFPU. . . . 80
7.1 Userexe ution times onMilkymist0.2. . . 84
7.2 Userexe ution times onMi roblaze 10.1. . . 85
Introdu tion
The open sour e model supports the idea that any individual, if he or shehas the
required level of te hni al knowledge, an realisti ally use, share and modify the
design of a te hni al system. During the nineties, this development model gained
popularity in the software world with, most notably, the Linux operating system.
But it was not viable for omplex SoCs until a few years ago, be ause the ost
of prototyping semi ondu tor hips is prohibitive and eld programmable gate ar-
rays (FPGAs) used to be too slow, too small, and too expensive. System-on- hip
design and hands-on omputer ar hite ture therefore remained a eld reserved to
well-funded a ademiaand resear hand development laboratories of ompanies ofa
signi ant sizeandwealth,whohada esstolarge FPGA lustersor evensemi on-
du torfoundries.
But the ost of FPGAs is falling (this was already the ase between 1985 and
1994[24℄andthetrendhas ontinuedsin ethen)andrelativelyfastandhigh-density
devi es aretodaybe omingavailableto thegeneral publi . For an exampleof this
falling ost (and in reasing densitiesand speed), we will mention theIkos Pegasus
appli ation spe i integrated ir uit (ASIC) emulator, whose insides are depi ted
ingure1.1. The Latti eMi o32CPU oreusedinthesystem-on- hipdes ribed in
this thesis o upies alone 60%of the resour es of one of theXC4036XL FPGAsof
this devi e, and runs at 30MHz. The Ikos Pegasus was a state-of-the-art devi e a
de ade ago. It onsumes up to 3 kilowatts of power, weights dozens of kilos and
probably ostedtheequivalentof several millionsof SEK.Thesame CPU orenow
o upiesabout15%ofamodernFPGA ostinglessthan500 SEK,whereitrunsin
ex essof 100MHz.
Thisevolutionmakesitpossibletoimplement omplexhigh-performan esystem-
on- hips (SoC) that an be modied and improved byanyone, thanks to the exi-
bilityof the FPGAplatform.
This Master's thesis introdu es Milkymist TM
[6℄, a fast and resour e-e ient
FPGA-based system-on- hip designed for the appli ation of rendering live video
ee tsduringperforman essu has on erts, lubsor ontemporaryartinstallations.
Su h ee ts are already popularized byartists known asvideo jo keys, or VJs.
VJingis ommonlydone withaPCand omputersoftwaresu hasGrandVJ [5℄or
Figure1.1. FPGAboardsoftheIkosPegasusASICemulator( a. 1999).
Figure 1.2. Proje tlogo.
Resolume[11℄. However,thisapproa hhassomedrawba ksandusinganembedded
devi einsteadwould be interesting:
•
Adevi eofverysmallsizeandweightispossible,whi his onvenientinmobileor temporary setups.
•
Boot and set-up time (laun hing the software) an be greatly redu ed (to afewse onds).
•
Many interfa es for intera tive performan es (MIDI, DMX, video input, low- level digital I/O for user sensors) an be integrated. By omparison, theequivalent PC-based solutionwould be expensive and bulky.
Besides the fa t that this is an interesting, reative and popular appli ation,
it is also demanding in terms of omputational power and memory performan e.
Su h aproje twouldalsobeaproofthathighperforman eopen sour esystem-on-
hip design is possible inpra ti e; with a view to help, foster and atalyze similar
open hardware initiatives. As the Milkymist system-on- hip is entirely made of
synthesizableVerilogand,forthemostpart,releasedundertheGNUGeneralPubli
Li ense(GPL), its ode an be re-usedby otheropenhardwareproje ts.
Meeting theperforman e onstraintswhilestill using heapand relatively small
FPGAs is perhaps the most interesting and hallenging te hni al point of this
proje t, and it ouldnot be done withoutsubstantial workintheeldof omputer
ar hite ture. Thisis whatthis Master's thesis overs.
Ba kground
2.1 Video synthesis
2.1.1 Overview
MilkDrop [25℄ (gure 2.1)is a popular open sour e video synthesis framework that
wasoriginally madeto develop visualizationplug-ins for theWinampaudioplayer.
Peoplehavesin ethenportedMilkDroptomanydierentplatforms[32℄andmadeit
rea ttoliveevents, su has apturedaudioandvideo[20℄(gure2.2)ormovements
ofa Wiimoteremote ontrol [21℄.
TheideabehindtheMilkymist proje tistoimplement anembedded videosyn-
thesisplatform ona ustomopensour e system-on- hip, thatisbased onthesame
renderingprin iple ofMilkDrop but withmore ontrol interfa es andfeatures. The
devi e built around the system-on- hip should be stand-alone, whi h means that
a graphi al user interfa e for onguring the visual ee ts should be implemented
(gure2.3).
Figure2.1. SamplevideoframefromtheMilkDropvisualsynthesizer.
Figure 2.2. Samplevideo frame fromVisikord, aprogrammixing live videointo
MilkDrop.
2.1.2 Prin iple
General mode of operation
The MilkDrop-like renderer is the most ompute and memory intensive pro ess,
fromwhi hstem mostofthete hni al hallenges. We willnowgetintomoredetails
about howthe renderer works(gure2.4).
Rendering isbasedon aframe bueron whi h thesteps beloware ontinuously
repeated. Thisrepetitionis at theorigin ofmany feedba kor fra tal ee ts.
•
The urrentframeisdistorted(zoomed, translated,warped,s aled,rotated...) bytexture mapping. Thisstep isdes ribed withmore detailinse tion2.4.•
Theframe isdarkened (the olorsareshiftedto bla k).•
A waveform of the urrently played musi is drawn. The wave an be drawnlinearly(likean os illos ope), ina ir le, et .
•
Bordersaroundthe s reenaredrawn. Ifthedistortionzoomsout,theborderswill be pulled into the pi ture(someee tsarebasedon this).
•
Motion ve tors are drawn. Motion ve tors are simply a grid of dots, whi hanbe usedto generate ee tsbyplaying with thedistortion.
•
Thepro ess repeats fromthebeginning.These are the basi features of MilkDrop. There are more ( ustom waves,
shapes,...) whi h are listed on the MilkDrop website [25℄. Some other features
Figure2.3. Theembeddeduserinterfa e(basedonGenodeFX[12℄)ofFli kernoise,
theMilkymist VJappli ation. Thepat heditor isshown, withper-frame andper-
vertexequations.
This pro ess is done on an internal frame buer whose horizontal and verti al
dimensionsareapowerof2. Thisframebueristhens aledtothesizeofthes reen
inorderto bedisplayed. Thisbrings twofeatures:
•
The sizes being a power of 2 allows out-of-bounds texture oordinates to be wrapped (inorderto repeatthetexture) bysimplyperformingabitwiseANDofthe oordinate,insteadofthefull omputationofadivisionremainderwhi h
isa mu h more expensive operation (even on thetraditionalGPUs MilkDrop
wasdesigned for).
•
Itenablestheimplementationofthevideo e hoee t: aftertheinternalframe buerhasbeen drawn tothes reen at its nominaldimensions, azoomedandsemi-transparent opyof it an be overprinted.
Itmustbenotedthatthistwo-step pro essin reases the omputation timeandthe
onsumption ofmemory bandwidth.
All the steps of the rendering are heavily parameterizable by the user, using a
oded format alled a pat h or preset whi h denes the aspe t and theintera tion
formsofaparti ularvisualee t. Thelistingofasamplepat hisgivenbygure2.5
Figure2.4. Basi MilkDroprenderingow.
fDe ay=0.980000
nWaveMode=2
bTexWrap=1
bMotionVe torsOn=0
zoom=1.046000
rot=0.020000
x=0.500000
y=0.500000
warp=0.969000
sx=1.000000
sy=1.000000
wave_r=0.600000
wave_g=0.600000
wave_b=0.600000
wave_x=0.500000
wave_y=0.470000
per_frame_1=wave_r = wave_r + 0.400*( 0.60*sin(0.933*time)
+ 0.40*sin(1.045*time) );
per_frame_2=wave_g = wave_g + 0.400*( 0.60*sin(0.900*time)
+ 0.40*sin(0.956*time) );
per_frame_3=wave_b = wave_b + 0.400*( 0.60*sin(0.910*time)
+ 0.40*sin(0.920*time) );
per_frame_4=zoom = zoom + 0.010*( 0.60*sin(0.339*time)
+ 0.40*sin(0.276*time) );
per_frame_5=rot = rot + 0.050*( 0.60*sin(0.381*time)
+ 0.40*sin(0.579*time) );
per_frame_6= x = x + 0.030*( 0.60*sin(0.374*time)
+ 0.40*sin(0.294*time) );
per_frame_7= y = y + 0.030*( 0.60*sin(0.393*time)
+ 0.40*sin(0.223*time) );
per_vertex_1=sx=sx-0.04*sin((y*2-1)*6+(x*2-1)*7+time*1.59);
per_vertex_2=sy=sy-0.04*sin((x*2-1)*8-(y*2-1)*5+time*1.43);
Figure2.5. Ex erptfromtheMilkDroppresetGeissWarpofDali1(withsome
simpli ations).
Initial onditions
Thepat hbeginswithaseriesofparameterswhi hareusedtoinitializetherenderer,
andmanyofthemarekept onstantduringtheexe utionofthepat h. Forexample:
•
bMotionVe torsOn=0turns othedrawing of themotion ve tors.•
nWaveMode=2sele ts one ofthemany ways of drawing the audiowaveform.•
sx=1.000000andsy=1.000000settheXandYs alingfa torsofthedistortion to 1(i.e. the frameis initiallynot s aled).•
wave_r=0.600000,wave_g=0.600000andwave_b=0.600000settheinitialRGB olor withwhi hthe wave isdrawn (itis initiallygrey).Per-frame equations
Usinginitial onditions only limitsthe intera tion and evolutionpossibilitiesofthe
pat h.
It is therefore possible to make the parameters evolve over time, thanks to the
per-frame equations. As their name suggests, the per-frame equations are mathe-
mati al expressionsthat areevaluated atea h frame.
The example pat h (gure 2.5) shows some of them (the lines beginning with
per_frame). In this example, they hange the wave olor over time by modifying
the wave_r, wave_g and wave_b values in sinusoidal patterns, as well asthe zoom
(zoom),rotation (rot)and enterof rotation( x and y).
Per-frame equations an make the pat h rea t to sound, for example through
the bass, mid and treb variables that indi ate the intensity of the sound inthree
frequen ybands. OneoftheideasinMilkymististoaddothervariablesthat anbe
ontrolled by the DMX512and MIDI proto ols, enabling theuse of a whole range
of devi es ommonly found among musi ians (ele troni instruments, faders, stage
light onsoles, joysti ks,...) to ontrol thevisual ee ts.
Per-vertex equations
Per-vertex equationsare usedto ne-tune thedistortion applied to thepi ture.
Indeed,asexplainedfurtherinse tion2.4,thedistortionworksbyusingamesh
of ontrol points (verti es) that an be moved to transform the image in many
dierent ways (ee ts su h as zooming, s aling and rotating are implemented by
moving the verti es).
Per-vertex equations are thus evaluated at ea h vertex (whose position an be
retrieved through the x and y variables), and alter the position of that vertex. In
theexamplepat h(gure2.5),theimageislo allys aledhorizontallyandverti ally
by fa tors depending on the position of the vertex and on the time, resulting ina
twistedvisual ee t.
As dis ussed in hapter 5, the oating point omputations for ea h vertex are
2.2 Open sour e SoC platforms
There isan existingeort to buildopensour e system-on- hips. It isinteresting to
review these proje ts in order to look forward to building upon them possibly
addinghardwarea eleratorsor performingother modi ationsinordertoimprove
performan e.
There are many SoC designs available on the Internet, whi h are more or less
mature. The system-on- hipproje ts listed heremeet thefollowing riteria:
•
theyhave beenshown to workon at leastone FPGAboard•
theyare releasedunderan open sour e li ense•
they omprise a synthesizable RISCCPU•
theCPUis supportedbyaCand C++ ompiler•
theyin lude aRS232 ompatible UART(fora debug onsole)•
theysupport interfa ingto o- hipSDRAMmemory OpenSPARCOpenSPARC[23℄isthewell-knownSPARCpro essorofSunMi rosystemswhi his
now releasedunderan opensour e li ense andin luded into a SoCFPGA proje t.
Implementedona FPGA,this pro essorisextremely resour e-intensive. A ut-
down version of theCPU ore only, alledthe SimplyRISC S1, o upies at least
37000FPGAlook-uptables(LUT)withoutthe a hes[28℄. Thisisabouttwi ethe
logi apa ityof theVirtex-4XC4VLX25 FPGA.
As it turns out, the OpenSPARC ar hite ture is a very omplex design whi h
implementsahugenumberofte hniqueswhi hin reasethesoftwareexe utionspeed
(instru tions per lo k y le). While this is a wise hoi e for a software- entri
pro essorimplementedonafully ustomsemi ondu tor hip,withaFPGApro ess
itismoreappealing tokeepthesoftwarepro essorsimpleinorderto saveresour es
and make room for ustom hardware a elerators, taking advantage of the FPGA
exibility.
GRLIB
GRLIB [13℄ is a very professional and standard- ompliant library of SoC ores.
The library features a omprehensive set of ores: AMBA AHB/APB bus ontrol
elements, the LEON3 SPARC pro essor, a 32-bit PC133 SDRAM ontroller, a 32-
bitPCIbridge withDMA,a10/100/1000 Mb/sEthernet MAC,16/32/64-bit DDR
SDRAM/DDR2SDRAM ontrollersand more.
However,its drawba ks are:
•
Code omplexity. GRLIB is written in VHDL and makes intensive use of•
Coresare not self- ontained. GRLIB denesmany building blo ks that are usedeverywhereelse inthe ode,making itdi ultto re-use ode inanotherproje twhi h isnot basedon GRLIB.
•
Signi ant FPGAresour e usage. A system omprising the LEON3 SPARCpro essor with a 2-way set-asso iative 16kB a he and no memory manage-
ment unit(MMU), the DDR SDRAM ontroller, a RS232 serialport, andan
Ethernet 10/100 MAC uses 13264 FPGA look-up tables (LUT). They map
to 79%of theVirtex-4 XC4VLX25 FPGA.Wehave arriedout thetest with
theXst synthesizer, Xilinx ISE 11.3, and GRLIB 1.0.21-b3957 (GPL release)
using the default provided synthesis s ripts. Thisundermines the possibility
ofaddinghardwarea eleration ores. In[22℄,asigni ant resour eusagewas
alsoreported for anolder versionof LEON.
•
Relativelylow lo kfrequen y. Withthesame parametersasabove,themax-imum lo k frequen yis 84MHz.
Be auseof these reasons,GRLIB wasnot retained.
ORPSoC (OpenRISC)
ORPSoCisbasedontheOpenRISC[26℄pro essor ore,whi histheagshipprodu t
ofOpenCores,a ommunityofdevelopersofopensour esystem-on- hips. ORPSoC
isessentiallymaintained byORSoC AB.
ORPSoC notably features the OpenRISC OR1200 pro essor ore, the Wish-
bone [9℄bus, omprehensive debuggingfa ilities,a16550- ompatible RS232UART,
a10/100 Mb/sEthernet MACand a SDRAM ontroller.
Unfortunately, ORPSoC is resour e-ine ient and buggy. The OpenRISC im-
plementationisnot welloptimizedforsynthesis. We arriedouttestsontheAugust
17, 2009 OpenRISC release. Still using the XC4VLX25FPGA astarget, synthesis
with Xst and Xilinx ISE 11.4 yields an utilization of 5077 LUTs for the CPU ore
only (using the default FPGA onguration: no a hes, no MMU, multiplier, and
with the implementation of the RAMs using the RAMB16 elements of the FPGA
sele ted), running at approximately 100MHz. A similar resour e usage isreported
in[22℄. Thesynthesisreportshows asyn hronous ontrol signalswherethere should
notbe(su hasontheoutputoftheprogram ounter),whi h anbeanindi ationof
poor qualityof thedesign. Other IP ores omprising ORPSoC have similar issues
(wetested the16550UARTandtheEthernetMAC).Finally,theprovidedSDRAM
ontrolleronlysupportsthelow-bandwidth16-bitsingledatarateoption,hasahigh
laten y due to the extensive useof lo k domain transferFIFOs,does not support
pipelined transfers andhasa poorlywritten ode.
OpenRISC and ORPSoC therefore do not seem to be a good platform for the
Latti eMi o32 System
This produ t[30℄ from the FPGA vendor Latti e Semi ondu tor is omparable to
Mi roblaze[34℄ andNiosII[4℄fromits ompetitors,respe tively XilinxandAltera.
Like its ompeting produ ts, Latti eMi o32 System features a broad library of
light, fastand FPGA-optimized SoC ores.
One interesting move made by Latti e Semi ondu tor is that parts of the Lat-
ti eMi o32 System are released under an open sour e li ense, and most notably
the ustomLatti eMi o32 mi ropro essor ore. Latti eMi o32Systemisalso based
upon the Wishbone [9℄ bus, whose spe i ation is free of harge and freely dis-
tributable.
While it is perhaps te hni ally possible to build Milkymist on top of the Lat-
ti eMi o32 System, there are li ensing issues on erning most notably the DDR
SDRAM ontroller whi h is proprietary.
However, the Latti eMi o32 mi ropro essor ore is interesting. Synthesized for
the XC4VLX25 with the2-way set-asso iative a hes, the barrel shifter, the hard-
ware divider and the hardware multiplier enabled, it o upies only about 2400 4-
LUTs and runsat more than 100MHz.
Thismi ropro essor orehasbeenretainedfor useinMilkymist,asdes ribedin
hapter 7.
Mi roblaze and Nios II
Even though we are not interested in proprietary designs, we still give a brief
overview oftheresour e usageof Mi roblazeand Nios IIsystems asa omparison.
Mi roblaze. In [22℄, the Mi roblaze ore is reported to use approximately 2400
LUTs, like Latti eMi o32. Theplatform builderGUI inXilinx ISE12.1 also limits
thefrequen yofMi roblazesystemsto100MHzwhentargetingtheVirtex-4family.
Thus, Mi roblazeis lose to Latti eMi o32regardingarea andfrequen y.
Nios II. A ording to an Altera report [3℄, Nios II/f uses 1600 Cy lone II LEs.
A LE is mainly omprised of a 4-LUT and a register, whi h is omparable to the
Virtex-4 ar hite ture on whi h Latti eMi o32 was tested. Thus, it seems that the
NiosII ore wouldbeapproximately two thirdsof theareaofLatti eMi o32.
Somedieren es an benotedbetween theLatti eMi o32 ongurationandthe
NiosII/f onguration usedintheAlterareport:
•
Ca hes aredire t-mapped and512 bytes(ea h).•
There isno multiplier.•
Nios II/fuses a dynami bran h predi tor, while Latti eMi o32 usesa stati•
Thereportdoesnotsayiftheoptionalhardwaredivider,multiplierandshifter(thatwereenabled inLatti eMi o32) were sele ted.
TheNiosIIisalsoreportedtorunat140MHzwiththis ongurationandUART,
JTAGUART,SDR SDRAM ontrollerandtimerperipherals. Thisisveryfast,but
annotbe ompared totheLatti eMi o32 resultson Virtex-4for two reasons:
•
Routingresour es and logi delays for thetwo FPGAfamilies aredierent.•
It is possible that Altera hand-tuned the Nios II pro essor to their FPGAte hnology.
2.3 DRAM te hnology
DRAMisby fartoday'sdominant memory te hnology, oftenbeingthe onlyaord-
ablesolution when relatively large densities(typi ally more than a fewmegabytes)
are required. Unfortunately, DRAMs are not straightforward devi es and we need
preliminaryknowledgespe i to thiste hnologyinordertounderstand the hoi es
dis ussed in hapter 3. Indeed, in order to redu e system osts, the intelligen e
has been moved away from the memory hips and into the memory ontroller [2℄,
leaving the ontroller designer withthetaskof dealing withthelow-level detailsof
the DRAMte hnology.
We will therefore explain how the SDRAM (syn hronous DRAM) te hnology
works. Theseprin iplesarethesamefortheoriginalsingledatarate(SDR)SDRAM,
and for the subsequent double datarate DDR, DDR2 and DDR3 memories. Inall
thatfollows,wesupposethatthelogi level0isrepresentedbyavoltage of0 volts,
anda logi level1 isrepresentedbya positive voltage H.
A DRAM memory bank (gure 2.6) isorganized asa two dimensional arrayof
ells. Ea h ell is omprised of a transistor onne ted to a apa itor. A ell stores
onebitofinformation,indi atedbythepresen eornotofa hargeinthe apa itor.
The transistora ts as a swit h that onne ts the apa itorto thebit line (verti al
lines) when the word line (horizontal lines) its gate is onne ted to arries a high
logi level.
A de odertranslates the rowaddress presentedto theDRAM devi e and a ti-
vates one oftheword lines,a ording tothe address.
Ea hbitlineis onne tedtoasenseamplier,whi hisapositivefeedba kdevi e
that,when swit hed on,turnsanyvoltageX onthebitlinebetween 0andH into0
(if
X < H 2
)or H (ifX > H 2
). Theset ofsense ampliersis alledthepage buer.A esses toa SDRAMbankaremade asfollows:
1. We assume the SDRAM is in the pre harged (idle) state. In this state, no
word line isa tive,the sense ampliersareturned oand all thebitlinesare
heldat avoltage of
H 2
.2. The SDRAM ontroller presents the row address and issues an ACTIVATE
Figure2.6. Blo k diagramofaDRAMmemorybank.
de oderandoneofthe wordlinesisasserted. Thishastheee tof onne ting
allthe apa itorsoftheDRAM ellsintherowtotheirrespe tivebitlines. A
transferofele tri hargeo ursbetweentheparasiti apa itorsoftheword
lines(whi h were harged at a voltage of
H
2
) and the DRAM ell apa itors, whi h were either dis harged (at 0 volts) or harged at a voltage of H. Thisausesasmall hange
ǫ
inthepotentialofthebitline,whi hbe omesH 2 −ǫ
orH
2 + ǫ
(dependingon the harge initiallystored intheDRAM ell apa itor).Then,theSDRAMdevi eturnsonallthesenseampliersofthebank. Onea h
bit line, the positive feedba k takes over and amplies the voltage dieren e
ǫ
until the level of the bit line rea hes 0 or H. The ACTIVATE ommand isnow ompleted and the row is said to be opened. The DDR SDRAM hips
usedinthe proje t(on the XilinxML401 board)take
20ns
to ompletetheseoperations.
3. On e a row has been opened, the ontroller an present the olumn address
andissueREADandWRITE ommandstotransferdata. Readingisdone by
simply measuring the voltages on the bit lines, and writing an be a hieved
byfor ingthebitlinestoa parti ular level. There isadelay, alledtheCAS 1
laten y,betweenaREAD ommandbeingsentandthedatabeingreturnedby
the devi e. Thisdelay isof
20ns
withthe hipsusedintheproje t. However,readoperationsarepipelined,whi hmeansthatanewREAD ommand anbe
sent whiletheprevious one isstill transferringdata. Withpropers heduling,
afull utilizationof the available I/Obandwidth an bea hieved.
4. Before a essing another row, the memory ontroller must dis onne t the
opened row from thebit linesand go ba k into thepre harged state. It does
sobyissuingaPRECHARGE ommand tothedevi e. Thedevi etakessome
time to pro ess the ommand (during whi h the bank annot be a essed),
whi his
20ns
withthe hipsused intheproje t.Fromthisprin ipleofoperation,itbe omesapparentthataperforman e-oriented
ontroller should try to make several transfers in thesame row before opening an-
otherone, inorderto redu e thetime wasted toswit hingrows.
2.3.1 Multiple banks
SDRAM memory hips ontain multiple DRAM banks internally, whi h share the
I/O, ommand and address pins. Additional bank address pins sele t the bank to
send ommandsto.
Having multiplebanks brings two advantages:
•
Beingableto exe uteseveral ommandssimultaneously(assumingthereis no resour e oni t for the pins). For example, one bank an be a tivating onerowwhileanother bankis transferringdata.
1
CASstandsfor ColumnAddressStrobe,whi h isthe nameof the DRAM hip pin thatthe
•
Having several rows open (one per bank), whi h an redu e the number ofrequired row swit hes andthus improve performan e.
The ontroller is responsible for managing the banks, and mapping absolute
memory addresses to parti ular banks. Appropriate bank mapping an improve
performan e [29℄.
Standard DDRSDRAM hips ome withfour internal banks.
2.3.2 Refreshing
Be ausetheDRAM apa itors arenot perfe t,theygraduallylosetheir hargeover
time,whi h results indata orruption.
Thesolutionistoperiodi ally re hargethe apa itors,whi hisdonebyopening
the rowsone byone. SDRAM hips providean AUTOREFRESH ommand whi h
opens and loses one row in all banks (and in rements an internal ounter so that
thenextAUTOREFRESH ommand willtarget anotherrow),but itistherespon-
sibility of the ontroller to issueit. Furthermore, the ontroller must pre harge all
banks beforea refresh.
Withthe memory hipsusedintheproje t,arefreshmustbe madeevery
7.8µs
andtakesintheworst ase
20 + 80 + 4 · 20 = 180ns
(pre hargetime2 +refreshtime+a tivationtimeforea hbank),soithasasmallimpa tonthememorybandwidth
(about 2%).
2.4 Texture mapping
Texturemappingisa ommon omputergraphi soperationfoundina elerated3D
APIslike OpenGL and Dire tX.It is typi ally usedto draw textured 3Dpolygons
on the s reen. It an also distort an image (see gure 2.7 for an example), and
MilkDrop usesitfor this purpose.
With ommon GPUs, texture mapping is performed on triangles (and more
omplex polygons are broke down into a series of triangles). The inputs to the
algorithmarethe2D(possiblyproje tedfromtheoriginal3D oordinates)positions
of thethree verti es of thetriangle to be lled,and the2D texture oordinates for
these threeverti es.
The algorithm then draws a textured triangle pixel by pixel, by interpolating
linearly the texture oordinates of theverti es for ea h pixeland then opying the
texturepixel (texel) at these oordinates.
Imagepro essingoperationslikezooming,rotatingors aling anbeimplemented
with texture mapping, by simply hanging the verti es' positions or the textures
oordinatesat ea h vertex.
Moreoftenthan not,theresultsofthelinearinterpolationarenotinteger,whi h
meansthatthe textureshouldbesampledbetween fouradja ent pixels(gure2.8).
In this ase, for a better rendering, the four pixels should be readand their olors
2
Figure 2.7.Exampleofdistortedpi ture.
Figure2.8. Prin ipleofbilineartextureltering.
shouldbeaveraged(withdierentweightsdependingonthefra tional parts). This
pro ess is alled bilinear ltering and is required to obtain a good rendering of
MilkDrop presets(see gures2.9 and2.10).
In MilkDrop (and Milkymist),a spe ial ase of thetexture mapping isused, as
theonly purposeis to distorta 2Dimage. The target surfa e isalways a re tangle
that overs the destination pi ture, on whi h the verti es are distributedevenly as
a mesh whi h is always kept the same regardless of the applied distortion. The
distortionisdened byaltering thetexture oordinates at ea h vertex.
Texturemapping, espe iallywhenbilinearltering isdesired,isa very ompute
intensive pro ess, as explained in hapter 5. A ustom hardware a elerator has
been developed,whose details arealso overedinthis hapter.
2.5 Organization
A ording to thisba kground,we an derive thefollowing proje t guidelines:
•
developa fast, resour e-e ient and FPGA-optimized system-on- hipFigure2.9. Renderingwithbilinearlteringenabled.
Figure 2.10. Renderingwithbilinearlteringdisabled(thenearesttexelisused).
•
developan e ient memory subsystem•
reusea light-weight soft- ore CPU•
partition arefully the tasksbetween hardware andsoftware•
develop ustom hardware a eleratorsThe proposedsolutionis outlinedingure2.11. Notalltheblo ksare readyat
thetimeofthis writing,norall ofthemarewithinthes opeofthis Master'sthesis,
whi hfo uses on omputer ar hite ture.
Morespe i ally,thefollowing omponentsarenot developed yet:
•
mi roSD ontroller (the urrent prototype usea CF ardthroughXilinx Sys-temACE)
•
USB ontroller•
Videoinput•
IRre eiver•
MIDI ontroller•
DMX512 ontrollerHardware a elerators have been developed for the omputation of verti es po-
sitions (PFPU) and for texture mapping (TMU), whi h have been foundto be the
most ompute-intensivepartsofthepro ess. Theyaredis ussedindetailin hapters
6and 5,respe tively.
Graphi s pro essing also requires a signi ant amount of memory bandwidth,
whi his dis ussed in hapter 3.
Chapter 4 des ribes the on- hip inter onne t used to make the various blo ks
ommuni ate withone another.
Finally, hapter 7 deals with the software exe ution environment and how the
software isar hite ted to obtaina good performan es fromthe hardware.
Figure2.11. SoCblo kdiagram.
Memory subsystem
3.1 Atta king the memory wall
A re urrent point in many modern omputer systems is the memory performan e
problem. Thetermmemory wall was oined[33℄torefertothegrowingdisparityof
performan e between logi su h as CPUs and o- hip memories. While mi ropro-
essorperforman e hasbeen improving at arate of 60 per ent per year, thea ess
timeto DRAMhasbeen improving at lessthan 10per ent per year[27℄.
Memoryperforman eis measuredwithtwo metri s:
•
bandwidth, whi h istheamount of data thatthememory system an transferduringa given period of time.
•
laten y,whi histheamount oftimethatthe memorysystemspendsbetweentheissueofa memory reador writerequest and its ompletion.
Amemorysystem anhavebothhighbandwidthandlaten y. Ifthelogi making
the memory a esses is ableto issue requestsina pipelined fashion, sending a new
requestwithoutwaitingforthepreviousone to omplete,highlaten ywillnot have
animpa t onbandwidth.
Laten y and bandwidth arehowever linked inpra ti e. De reasing thelaten y
also in reases the bandwidth inmany ases, be ause laten y blo kssequential pro-
essesand preventsthem from utilizingthefullavailablebandwidth.
High-endpro essorsforserversandworkstationshaveagoodabilityto opewith
relatively high memory laten y, be ause te hniques su h as out-of-order exe ution
and hardware multi-threading enable the pro essor to issue new instru tions even
thoughone isblo king ona memory a ess.
Some SDRAM ontrollers do a lot to optimize bandwidth but have little fo us
onlaten y. Bandwidth-optimizing te hniques in lude:
•
reordering memorytransa tions to maximize thepage modehit rate.•
groupingreadsandwritestogethertoredu ewritere overytimes. Alongwiththeabove te hnique, this has adetrimental impa ton laten y be ause of the
delays in urredby theadditional logi intheaddressdatapath.
•
runningthe systemand theSDRAMinasyn hronous lo kdomainsin order tobeabletoruntheSDRAMatitsmaximumallowable lo kfrequen y. Thisrequiresthe useof syn hronizers or FIFOs,whi h have ahighlaten y.
•
onguring the SDRAMat high CAS laten ies in order to in rease its maxi-mum allowable lo k frequen y. This trend is best illustrated by the advent
of DDR2 and DDR3 memories whose key innovation is to run their internal
DRAM oreatasub-multipleoftheI/Ofrequen ywithawidedatabuswhi h
isthenserializedontheI/Opins. Sin etheinternalDRAM orehasalaten y
omparable to that of theearlier SDR and DDR te hnologies, thenumberof
CASlaten y y lesrelative to the I/O lo kis alsomultiplied.
An extremeexample ofthesememory ontroller bandwidthoptimizationsisthe
MemMax
R DRAMs heduler[17℄. Thisunitsitsontopofanalreadyexistingmem- ory ontroller (whi h already has its own laten y), adding seven stagesof omplex
andhigh-laten y pipeliningthatprodu es agood -but ompute-intensive - DRAM
s hedule. The a tual e ien y of this system has been questioned [15℄ be ause of
thatsigni ant in rease inlaten y.
3.2 Another approa h
The out-of-order exe ution and hardware multi-threading pro essor optimizations
dis ussedabovethat opewithhighmemorylaten yare omplexandimpra ti alin
the ontextofsmalland heapembeddedsystems,espe iallythosetargetedatFPGA
implementations. For example, FPGA implementations of the OpenSPARC [23℄
pro essor, whi h employs su h optimizations, typi ally require an expensive high-
end XilinxXUPV5 boardwhose Virtex-5FPGAalone ostsroughly13000 SEK.
Milkymist therefore uses simple in-order exe ution s hemes in its CPU and in
its a elerators, and strives to improve performan e by fo using on redu ing the
memory laten y.
The memory system features that improve laten y (but also bandwidth) are
dis ussed below.
3.3 Memory system features
3.3.1 Single SDRAM and system lo k domain
Thetypi al operatingfrequen yofearlySDR andDDRSDRAM(te hnologiesthat
are prior to DDR2 and do not have a lo k divider for the internal DRAM ore)
the omplete SoC. Thus, it was de ided to run the DRAM and the system syn-
hronously in order to remove the need for any lo k domain transfer logi and
redu e laten y. The SDRAM I/O registers are lo ked by the system lo k, and
timing of the SDRAM interfa e is met through the useof alibrated on- hip delay
elementsanddelay-lo ked-loops (DLLs)togenerate theo- hipSDRAM lo kand
thedatastrobes.
3.3.2 Page mode ontrol algorithm
The Milkymist memory ontroller takes the so- alled page mode gamble: after an
a ess, the DRAM row is left open in the hope that the next transa tion to the
memorybankwillo urwithinthesamerow. Ifthememory ontroller isright,the
read or write ommand an be immediately registered into the SDRAM, and only
the CAS or write laten y is in urred. If the memory ontroller is wrong, it must
rstpre hargetheDRAMbankand open the orre trow, ausing extradelays.
Thus,ifthememory ontrollerisoftenwrong,takingthepagemodegamblewill
a tuallyimpa tperforman enegatively. However,astudyhasshown[29℄that,with
typi al memory timings, the point at whi h the gamble pays o is for a page hit
probabilityof0.375 only,attainable withmanypra ti al memory a esspatterns.
Page hit probability is improved by the ability of the Milkymist memory on-
troller to tra k open rows independently in ea h of the four memory banks that
ommer ial SDRAM hips areequipped with.
Thisoptimization positively ae tsboth laten y andbandwidth.
3.3.3 Burst a esses
Allmemory a essesaremadeusingbursts,i.e. whenan a essforaword ismade,
thefollowingwordsarealsoreador written. BurstmodeisafeatureoftheSDRAM
hips: only one read of write ommand is sent to them, and several words are
transferred onsubsequent lo k y les.
Using bursts frees the bus and DRAM ontrol signals while other words are
transferred,allowing the issueof new ommands overlapping the dataphaseof the
previous transa tion.
Burst a ess is a form of prefet hing that improves laten y. It is only e ient
whentheprefet heddata anbeusedbytherequestingbusmaster. IntheMilkymist
system-on- hip, this isoftenthe ase:
•
TheCPU orehas a heswhi ha essmemoryby omplete a helines. Thus,ifthe a helinelengthisamultipleoftheburstlength,thebursts anbeeasily
fullymemorized.
•
Thevideo framebuerrepeatedlyreadsthesameblo kofdatainasequentialmanner,and an easily make full useof the prefet hed data assuming that is
•
Thetexturemappingunitalsohasa a heandawrite buerwhi hworkwellwith bursta esses. This isdis ussed inChapter5.
3.3.4 Burst reordering
Thisfeature enablestheuseofthe riti al-word-rsts hemein a hes, redu ingthe
overall memory laten y.
When a request is issued at an address whi h is not a multiple of the burst
length, the order of the words in the burst is hanged so that the rst word that
omes out is thevery word thatis at therequested memory address. The prefet h
addressisthenin rementedand wrapsto staywithin thesame burst.
Forexample, assuminga burstlength of 4:
•
arequest at address0fet heswords 0,1,2and 3(in thisorder)•
arequest at address2fet heswords 2,3,0and 1(in thisorder)3.3.5 Pipelining
ThememorybusofMilkymist[8℄ispipelined. Duringthetransferoftheprefet hed
(burst)data, a new request an be issued. Thisis illustrated for a read request by
the table below:
Address A1 A1 A1 A2 A2 A2 A2
Data M(A1) M(A1+1) M(A1+2) M(A1+3) M(A2)
Address ( ont.)
Data ( ont.) M(A2+1) M(A2+2) M(A2+3)
Together with bursta ess, this helpsa hieving highperforman e: thememory
ontroller an hideDRAM laten ies and row swit h delays byissuing the requests
to theDRAMinadvan e, while theprevioustransa tion is still transferringdata.
3.4 Pra ti al implementation
The Milkymist SoC uses 32-bit DDR SDRAM, ongured to its maximum burst
length of 8. Sin e the DDR SDRAM transfers two words per lo k y les (one
on ea h edge), this is turned by the I/O registers into bursts of four 64-bit words
syn hronousto the system lo k.
Thememoryisrunat100MHz,yieldingapeaktheoreti albandwidthof6.4Gb/s,
whi his morethan enough fortheintended video synthesisappli ation (table 3.1).
This bandwidth is however never attained: events su h as swit hing DRAM rows
whi htakessigni ant timeand,toalesserextent,DRAMrefreshesintrodu edead
times on the data bus. We will see in se tion 3.5 that su h an oversizing of the
Task Required bandwidth
VGAframe buer,1024x768,75Hz, 16bpp 950Mb/s
Distortion: texture mapping, 512x512 to
512x512,30fps, 16bpp
250Mb/s
Livevideo: texturemapping,720x576to512x512
withtransparen y,30fps, 16bpp
300Mb/s
S aling: texture mapping, 512x512 to 1024x768,
30fps, 16bpp
500Mb/s
Video e ho: texture mapping, 512x512 to
1024x768 withtransparen y,30fps, 16bpp
900Mb/s
NTSCinput, 720x576,30fps, 16bpp 200Mb/s
Softwareand mis . 200Mb/s
Total 3.3Gb/s
Table3.1. Estimateofthememorybandwidth onsumption.
Figure3.1. Blo kdiagramoftheHPDMCar hite ture.
The ar hite ture of the memory ontroller, alled HPDMC (for High Perfor-
man e Dynami Memory Controller), is outlinedingure3.1.
The ontrol interfa e is usedby the systemto ongurethe ontroller, andalso
to issue the start-up sequen e to the SDRAM. Indeed, SDRAM hips require a
sophisti ated sequen e of ommands upon power-up. In many memory ontroller
designs, a hardware nite state ma hine is used to issue this ommand sequen e.
software, and, for this purpose, in ludes a bypass MUX that routes dire tly a
ongurationand statusregister ofHPDMC totheSDRAM ommand andaddress
pins. On ethe SoChas run a software routine thatsends the orre tinitialization
sequen etotheSDRAM,itswit hespermanently thebypassMUXtotheSDRAM
management unit and an useo- hipmemory normally.
The SDRAM management unit is a nite state ma hine that translates the
two high-level memory ommands read burst at address and write burst at ad-
dress into a series of lower-level ommands understandable by the SDRAM hips
(pre harge bank, sele trow, readfrom row, et .). Themanagement unitis respon-
sible for keeping tra k of the open rows, dete ting page hits, swit hing rows, and
issuingperiodi DRAMrefresh y les.
The management unit is onne ted to the data path ontroller, that follows
the a tivities performed by the management unit in order to de ide the dire tion
ofthebidire tional I/Opins(they should be setasoutputs for writes and asinput
for reads). The data path ontroller is also responsible for sending signals to the
management unitthat indi ate ifit is safe to perform ertain low-level operations.
For example, the read_safe signal goes low immediately after a read ommand is
issued,be auseifanother onewere sentimmediately after, thetworesulting bursts
wouldoverlapin timeand this ouldnot work be ause thereisonly oneset ofdata
pins. Eventually,thedatapath ontrollertakesinto a ount theSDRAMwriteand
readlaten iestogenerateana knowledgementsignalwhenthedataisa tuallythere
(orneedsto besent totheSDRAM)afterareadrow or writerow ommand has
been sent to the SDRAM.
Finally,the bus interfa e isapie eof gluelogi that onne tstheSoCpipelined
memory bus (FML)to thedatapath ontroller and themanagement unit.
HPDMC has been implemented in Verilog HDL, tested and debugged in RTL
simulation using a DDR SDRAM Verilog model from Mi ron, integrated into the
SoC, synthesized into FPGA te hnology, and eventually alibrated and tested by
software routinesrunning onthe a tual hardware.
This design of memory ontroller,spe i ally rafted for the Milkymistproje t
and released under the GNU GPL li ense on the internet, has been pi ked up by
the NASA for a software dened radio proje t and may be put up on board the
international spa e station in 2011. Gregory Taylor, Ele troni s Engineer at the
NASAJetPropulsion Laboratory,wrote:
While sear hing for a suitable SDRAM ontroller for the Jet Propulsion Labo-
ratory's Software-Dened Radio on board NASA's CoNNeCT experiment, I found
Sébastien's HPDMC SDRAM ontroller on OpenCores.org. We needed a ontroller
that wasboth highperforman e andwelldo umented. Thoughthe original HPDMC
ontroller was designed for DDR SDRAM with a 32-bit bus, Sébastien learly ex-
plained the modi ations ne essary toadapt the ontroller toour Single Data Rate,
40-bitwideSDRAM hip. I foundthe odetobe welldo umentedandeasytofollow.
Theperforman e hasmetourrequirementsandtheFPGAsizerequirementissmall.
The Communi ation Navigation and Networking Re ongurable Testbed (CoN-
of SDRs onforming tothe Spa e Tele ommuni ations Radio Systems(STRS) open
ar hite ture standard. The HPDMC ontroller will likely nd its way into one or
more loadable waveform payloads in the JPL SDR, and perhaps be used in other
NASA proje ts as well. It mayeventually ndits way intodeep spa e.
3.5 Performan e measurement
3.5.1 Introdu tion
We wanted to validate and hara terize the memory system performan e (a tual
laten yandbandwidth)andgetanupperboundofofitsabilitytosustainloads,by
extrapolatingthe maximum bandwidthone ould getassuming thememory a ess
timeremains onstant.
Sin ethememoryperforman edependsontheparti ulara esspatternthatthe
systemmakes (be ause of the ontroller taking thepage mode gamble, we wanted
to take the measurements on the real system while it is rendering video ee ts in
orderto getan a urate result.
3.5.2 Method
Alogi ore hasbeen added to theSoCthatsnoopsonthememory bus a tivityin
orderto report theaverage laten yand bandwidth.
That logi oreexploitsproperties oftheFastMemoryLink signaling inorderto
redu e its omplexity to two ounters that measure, for a given time period, the
number of y les during whi h thestrobe and a knowledgement signals are a tive.
Several parameters an thenbe omputed:
•
the net bandwidth arried by the link (based on the amount of data thatthelinkhasa tually transferred)
•
theaverage memorya esstime, whi histhetime,in y les,between therequestbeing madetothe memory ontroller andtherstword ofdatabeing
transferred.
•
the bus o upan y whi h is the per entage of time during whi h the linkwasbusyand thereforeunavailable fora newrequest.
EveryFastMemoryLinktransa tionbeginswiththeassertionofthestrobesignal.
Then,afteroneormorewait y les,thememory ontrollerassertsthea knowledge-
ment signal togetherwiththe rstword of databeing transferred. Thenext y le,
thestrobesignalisde-asserted(unlessanewtransa tionbegins)whilethenextword
intheburst isbeingtransferred. Anew transa tion an start withtheassertion of
the strobe signal even ifa burst is alreadygoing on (pipelining). See gure 3.2for
anexample.
Figure 3.2.FMLtransa tions.
• f
isthe system lo kfrequen y inHz.• T
isthe time during whi h the ounters have been enabled.• w
is the widthofa FMLword inbits.• n
isthe FMLburst length.• S
is thenumber of y les duringwhi h thestrobesignal wasa tive.• A
isthenumberof y lesduringwhi hthea knowledgementsignalwasa tive.Net bandwidth. By ounting the number of y les for whi h the a knowledge-
ment signalwasa tive,onegetsthenumberoftransa tions. Sin eea htransa tion
arriesexa tlya burstof data,whi h is
w · n
bits insize, thevolume ofdata trans-ferredis given by
w · n · A
. Thus,one an derive thenetbandwidthas:β = w · n · A
T
(3.1)Average memorya esstime. Onthebus,amasteriswaiting whenthestrobe
signalisassertedbutthea knowledgementsignalisnot. Therefore,thetotalnumber
of wait y les is given by
S − A
. The average memory a ess time an thus beomputed as:
∆ = S − A
A
(3.2)The average memory a ess time an be used to derive an upper bound on
the maximum bandwidth that the memory system an handle. Indeed, FML is a
pipelinedbuswhi hsupportsonlyoneoutstanding(waiting)transa tion,sothe ase
that uses the most bandwidth for a given memory a ess time is when the strobe
signalis alwaysasserted(gure3.3)sothata newtransa tionbeginsassoonasthe
Figure3.3. MaximumutilizationofaFMLbus.
Therefore, only a fra tion
α
of thepeak bandwidthf · w
an be used at most,andwehave:
α =
max(1, n
∆ + 1 )
(3.3)The maximum bandwidthis:
β max = α · f · w
(3.4)Bus o upan y. The bus is busy when the strobe signal is asserted. The bus
o upan y is thereforegiven by:
ǫ = S
T · f
(3.5)By using this method, a very simple pie e of hardware added to the system
an yield to the retrieval of interesting information about the performan e of the
memory system.
3.5.3 Results
Resultsaresummarized intable3.2. Therstline orrespondsto asystemrunning
the demonstration rmware with the video output enabled at the standard VGA
mode of 640x480 at 60Hz (therefore ontinuously s anning the s reen with data
from system memory), but not rendering a preset. The other lines represent the
results while the demonstration rmware is rendering dierent MilkDrop presets,
still atthe same video resolution.
It is di ult to ompare these results to those of other memory ontrollers as
theyare usuallynot published (ornot measuredat all).
However,two on lusions an be drawn:
•
thereareenougho upan yandbandwidthmarginsforthesystemtooperateat higherresolutions and/or olor depths than 640x480and 16 bits perpixel.
The3.3Gb/sbandwidthrequirementthatwasestimatedinse tion3.4seems
Pat h
β ǫ ∆ α β max
Idle 292 7 % 5.51 61% 3932
Geiss - Bright FiberMatrix1 990 28 % 6.37 54% 3474
Geiss - Swirlie3 1080 32 % 6.71 52% 3320
Geiss - Spa edust 1021 29 % 6.47 54% 3427
Illusion& Rovastar - Snowake Delight 1399 39 % 6.28 55% 3516
Rovastar &Idiot24-7- BalkA id 1427 41 % 6.38 54% 3469
Table 3.2. Memory performan e indierent onditions (Milkymist0.5.1). Band-
widthsareinMb/s.
•
to go further, an out-of-order memory ontroller an be envisioned. Su h a ontrollerwouldhaveasplittransa tionbus (allowingalargernumberofout-standing transa tions, thus minimizing the impa t thatlaten y hason band-
width)andwouldbeabletoreorderpendingmemorytransa tionstomaximize
thepage hitrate.
SoC inter onne t
This hapterexplainshowthedierentinter onne tbusseswork,whattheirfeatures
are,whytheyare there,andhowthey are ommuni ate withea h other.
The general SoCblo kdiagramand its inter onne tis outlinedingure2.11.
4.1 General SoC inter onne t: the Wishbone bus
Wishbone [9℄ is a general purpose royalty-free SoC bus with open spe i ations,
advo atedbythemaintainers ofthe OpenCores.orgwebsite.
Wishboneisasyn hronoussequentialbuswithsupportforvariablelaten y(wait
states) through the use of an a knowledgement signal that marks the end of the
transa tion. Burst modes(automati transfer of onse utive words) aresupported
andare ongurable ona per-transa tion basis(i.e. burstsofarbitrarylengthsand
single-word transa tions an be freely mixed on the same bus). However, there is
nopipelining.
WishboneisusedaroundtheSoC'sLatti eMi o32CPU oreandforsimpleDMA
masterswhi hhavemodestrequirementsofbandwidthandofvolumeoftransferred
data. As explained in Se tion 4.4, onne ting DMA masters that transfer small
amountsofdata(whi h antintheL2 a he)tothesamebusastheCPUsimplies
dealing with a he oheren yissues.
The datawidth usedfor theWishbone bus is 32, yielding a peak bandwidthof
3.2Gb/swhen the systemisrunning at 100MHz.
4.2 Conguration and Status Registers: the CSR bus
Milkymistuses memory-mapped I/Othrough onguration andstatusregisters.
Iftheseregistersweredire tlya essedbytheWishboneCPUbus,twoproblems
wouldarise:
•
Conne tingallperipheralsonthesameWishbonebusinvolveslargemultiplex- ersand highfanoutsignals,posingrouting and timingproblems.•
Wishbonerequiresthegeneration ofana knowledgement signalbyea h slave ore. This signal is useful in many ases, as it supports peripherals with avariable laten y. However, onguration and status register les are usually
implementedwith a tualregisters (ip ops)or SRAM, whi h an always be
a essedinone lo k y le. Thus,thereisnoneedforvariablelaten yandthe
a knowledgement signal. Keeping this signalfor the onguration and status
registerswasteshardwareresour es and development time.
Toalleviatetheseproblems,theCSRbushasbeendeveloped [7℄andusedinthe
systemthrougha bus bridge.
The CSR bus is a simpler bus than Wishbone, where all transfers are done in
one y le. Ithasaninterfa e similarto thatofsyn hronous SRAM, onsistingonly
ofaddress,datain,dataout andwriteenablepinsand lo kedbythesystem lo k.
Abridge onne tsthe CSRbusto theCPUWishbonebus, toallowtransparent
memory-mapped a ess to the onguration and status registers by the software.
Thisbridgein ludesregistersforallthesignals rossingthetwobusses,relaxingthe
timing onstraints.
4.3 High-throughput memory a ess bus: the FML bus
FastMemoryLink(FML)[8℄was o-designedwithHPDMC(thememory ontroller)
asa on- hip bus tailored to a ess SDRAMmemories at high speed while keeping
the memory ontroller simple. Its keyfeaturesarelisted below.
4.3.1 Variable laten y
SDRAM laten y varies a lot depending on the state of the SDRAM at the time
the request is issued on the bus. It depends on whether the SDRAM was in the
middle of a refresh y le, whether the bank needs to be pre harged, and whether
a new row needs to be a tivated. Therefore, FML provides support for a variable
number of wait states, dened by the memory ontroller, through the use of an
a knowledgement signalsimilar to thatofWishbone.
4.3.2 Burst only
SDRAMisbesta essedinburst mode (seesubse tion 3.3.3).
However,enablingor onguring burstmodeisarelativelylengthyand omplex
operation,requiringareloadoftheSDRAMmoderegisterwhi htakesseveral y les.
Furthermore,supportingmultipleburstlengthsmakesthes hedulingofthetransfers
more omplextoavoidoverlapping transfersthatwould reate oni tsatthedata
pins.
Therefore, in order to greatly simplify the memory ontroller, all transfers on