2008 SEPTEMBER
Lennart Lindh, David Källberg,
The FPGAworld Conference addresses
on FPGA technology. It is a discussion and network forum for working on industrial and research projects, state
applications. The book (www.fpgaworld.com/conference
Book
2008 SEPTEMBER
EDITORS
David Källberg, DTE - Santiago de Pablo and Vincent J. Mooney III
e FPGAworld Conference addresses aspects of digital and hardware/software system engineering It is a discussion and network forum for students, researchers and engineers working on industrial and research projects, state-of-the-art investigations, development
book contains all presentations; for more information
www.fpgaworld.com/conference).
ISBN 978-91-976844-1-5
SPONSORS
Vincent J. Mooney III
aspects of digital and hardware/software system engineering researchers and engineers art investigations, development and more information see
Copyright and Reprint Permission for personal or classroom use are allowed with credit to FPGAworld.com. For commercial or other for-profit/for-commercial-advantage uses, prior
Lennart Lindh, FPGAworld and Jönköpings University, Sweden
Publicity Chair
David Kallberg, FPGAworld, Sweden
Academic Programme Chair
Vincent J. Mooney III, Georgia Institute of Technology, USA
Academic Publicity Chair
Santiago de Pablo, University of Valladolid, Spain
Academic Programme Committee Members
Ketil Roed, Bergen University College, Norway
Lennart Lindh, Jönköping University, Sweden
Pramote Kuacharoen, National Institute of Development Administration, Thailand
Mohammed Yakoob Siyal, Nanyang Technological University, USA
Fumin Zhang, Georgia Institute of Technology, USA
Santiago de Pablo, University of Valladolid, Spain
Industrial Programme Committee Members
Solfrid Hasund, Bergen University
College
Kim Petersén, HDC, Sweden
Mickael Unnebäck, ORSoC, Sweden
Fredrik Lång, EBV, Sweden
Niclas Jansson, BitSim, Sweden
Göran Bilski, Xilinx, Sweden
Adam Edström, Elektroniktidningen,
Sweden
Espen Tallaksen, Digitas, Norway
Göran Rosén, Actel, Sweden
Tommy Klevin, ÅF, Sweden
Tryggve Mathiesen, BitSim, Sweden
Fredrik Kjellberg, Net Insight,
Sweden
Daniel Stackenäs, Altera, Sweden
Martin Olsson, Synective Labs,
Sweden
Stefan Sjöholm, Prevas, Sweden
Ola Wall, Synplicity, Sweden
Torbjorn Soderlund, Xilinx, Sweden
Anders Enggaard, Axcon, Denmark
Doug Amos, Synplicity, UK
Guido Schreiner, The Mathworks,
Germany
Stig Kalmo, Engineering College of
Aarhus, Denmark
hope that the conferences provide
We will try to balance academic and industrial presentations, exhibit
tutorials to provide a unique chance for our attendants to obtain knowledge from
different views. This year we have the strongest program in FPGAworld´s history.
Track A - Industrial
Track A features presentations with focus on industrial applications.
presenters were selected by the Industrial Programme Committee.
presented.
Track B - Academic
Track B features presentations with focus on academic papers and industria
applications. The presenters were selected by the Academic Programme
Committee. Due to the high quality, 5 out of the 17 papers submitted this year
were presented.
Track C - Product presentations
Track C features product presentations from our exhibitor
Track D - Altera Innovate Nordic
Track D is reserved for the Altera Innovate Nordic contest.
Three in the final
Exhibitors FPGAworld'2008 Stockholm & Lund
The FPGAworld 2008 conference
Totally we are close to 300 participants (Stockholm and Lund).
All are welcome to submit
the conference, both from
Together we can make the
expectations!
Please check out the website (http://fpgaworld.com/conference/)
information about FPGAworld
(
david@fpgaworld.com
) for more information
We would like to thank all of the authors for submitting their papers and
hope that the attendees enjoyed the FPGAworld conference 2008 and you
welcome to next year’s conference
conferences provide you with much more then you expected.
We will try to balance academic and industrial presentations, exhibit
tutorials to provide a unique chance for our attendants to obtain knowledge from
different views. This year we have the strongest program in FPGAworld´s history.
Track A features presentations with focus on industrial applications.
presenters were selected by the Industrial Programme Committee.
Track B features presentations with focus on academic papers and industria
he presenters were selected by the Academic Programme
Due to the high quality, 5 out of the 17 papers submitted this year
Product presentations
Track C features product presentations from our exhibitors and sponsors.
Altera Innovate Nordic
Track D is reserved for the Altera Innovate Nordic contest.
Exhibitors FPGAworld'2008 Stockholm & Lund; 15 unique exhibitors.
onference is bigger than the FPGAworld 2007
Totally we are close to 300 participants (Stockholm and Lund).
elcome to submit industrial/academic papers, exhibits and tutorials
from student, academic and industrial
Together we can make the FPGAworld conference exceed even
check out the website (http://fpgaworld.com/conference/)
information about FPGAworld. In addition, you may contact
for more information.
We would like to thank all of the authors for submitting their papers and
hope that the attendees enjoyed the FPGAworld conference 2008 and you
lcome to next year’s conference.
you with much more then you expected.
We will try to balance academic and industrial presentations, exhibits and
tutorials to provide a unique chance for our attendants to obtain knowledge from
different views. This year we have the strongest program in FPGAworld´s history.
Track A features presentations with focus on industrial applications. The
presenters were selected by the Industrial Programme Committee. 8 papers was
Track B features presentations with focus on academic papers and industrial
he presenters were selected by the Academic Programme
Due to the high quality, 5 out of the 17 papers submitted this year
s and sponsors.
; 15 unique exhibitors.
2007 conference.
, exhibits and tutorials to
academic and industrial backgrounds.
ven above our best
check out the website (http://fpgaworld.com/conference/) for more
. In addition, you may contact David Källberg
We would like to thank all of the authors for submitting their papers and
hope that the attendees enjoyed the FPGAworld conference 2008 and you
11:30 -
12:30 Sponsored by Actel Lunch Break
12:30 - 14:30
Session Chair
TBD Session Chair TBD
Session A1
Open Source within Hardware
Session A2
World's first mixed-signal
FPGA
Session A3
Verification - reducing costs
and increasing quality
Session A4
Analog Netlist partitioning and
automatic generation of
schematic
Session C1
Prototyping Drives FPGA Tool
Flows
Synplicity Business Group of Synopsys Session C2
OVM introduction
Mentor Graphics Session C3Verification Management
Mentor Graphics Session C4MAGIC - Next generation platform for Telecom and Signal Processing
BitSim 14:30 - 15:00 Coffee Break 15:00 – 16:30 Session Chair TBD Session Chair TBD Session A5 Product Presentation ORSoC Session A6
Drive on one chip
Session A7
Standard architecture for
typical remote sensing micro
satellite payload
Session C5 Product Presentation
The Dini Group Session C6 Product Presentation
Actel Session C7 Product Presentation Nextreme: The industries only Zero
Dr. Ivo Bolsons, CTO, Xilinx 10:00 -
10:30 Sponsored by Synplicity Coffee Break
10:30 -
11:30 Exhibitors Presentations
11:30 -
12:30 Sponsored by Mentor Graphics Lunch Break
12:30 - 14:30 Session Chair Kim Petersén HDC AB Session Chair Johnny Öberg Session Chair Tommy Klevin ÅF Session D1 Altera Innovate Nordic Contest Session A1
Open Source within Hardware
Session A2
Open and Flexible Network Hardware
Session A3
World's first mixed-signal FPGA
Session A4
Verification - reducing costs and increasing quality
Session B1
A Java-Based System for FPGA Programming
Session B2
Automated Design Approach for On-Chip Multiprocessor Systems
Session B3
ASM++ Charts: an Intuitive Circuit Representation Ranging from Low Level
RTL to SoC Design Session C1 Product Presentation Actel Session C2 Product Presentation
The Dini Group Session C3 Product Presentation
ORSoC Session C4
7Circuits - I/O Synthesis for FPGA Board Design Gateline 14:30 - 15:00 Coffee Break 15:00 - 16:30 Session Chair
TBD Santiago de Pablo Session Chair Session Chair TBD
Session D2 Altera Innovate Nordic
Contest Session A5
Large scale real-time data acquisition and signal processing in SARUS
Session A6
Drive on one chip
Session A7
Standard architecture for typical remote sensing micro satellite payload
Session B4
Space-Efficient FPGA-Implementations of FFTs in High-Speed Applications
Session B5
The ABB NoC – a Deflective Routing 2x2 Mesh NoC targeted for Xilinx FPGAs
Session C5
Prototyping Drives FPGA Tool Flows
Synplicity Business Group of Synopsys Session C6 OVM introduction Mentor Graphics Session C7 Verification Management Mentor Graphics Session C8 MAGIC - Next generation platform for Telecom and Signal
Processing BitSim
16:30 -
17:00 Altera Innovate Nordic Prize draw
17:00 -
Exhibitors FPGAworld'2008 Stockholm & Lund
5 Minutes presentation with
Stockholm
Altera
BitSim
Arrow
Silica
Actel
Synplicity
The Mathworks
EBV Elektronik
ACAL Technology
The Dini Group
VSYSTEMS
Gateline
ORSoC
National Instruments
Lund
BitSim
Arrow
Silica
Actel
Synplicity
The Mathworks
EBV Elektronik
ACAL Technology
The Dini Group
National Instruments
NOTE Lund
Exhibitors FPGAworld'2008 Stockholm & Lund
5 Minutes presentation with PowerPoint’s
National Instruments
National Instruments
© 2008 Altera Corporation—Public
Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 2
© 2008 Altera Corporation—Public
Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 3
GATEline Overview
Value added reseller of eCAD and ePLM
products on Nordic and Baltic market
Established 1984
7 employees, 6 in Sweden and 1 in
Norway
Office in Stockholm and Oslo
The Ultimate PCB Design Environment
Design for Manufacturing Signal Integrity Simulation PCB Design
Component Database
Schematic Design Functional Simulation FPGA I/O Synthesis
Philosophy … Why are we here?
• We make big FPGA boards
• Fastest, biggest for the lowest cost
– Easy to use where important
– Less polish where not
• What you get:
– Working, easy to use, cutting edge, cost effective,
reference designs
– High performance in both speed and gate density
• What you don’t:
– Pretty GUI’s and other SW that drives up the cost
– The ‘soft-shoe’ on partitioning …
– Partitioning (optional)
• Manual or third party solutions such as Auspy
– Synthesis
• Xilinx/Altera tools work fine
– Place/Route
• Comes from FPGA vendor: Xilinx/Altera
– Debug
• Chipscope, SignalTap, and other third party solutions
Overview of Product Line
• Goal: Provide customers a cost-effective
vehicle to use the biggest and fastest FPGA’s
– Xilinx
• Virtex-5
– Altera
• Stratix III
– Stratix IV when available
– We try to keep lead-times under 2 weeks.
• If not 2 weeks, issue is usually availability of FPGAs
• DN9200k10PCIe-8T– 2 FPGA’s (LX330’s)
• DN9002k10PCIe-8T– 2 FPGA’s (LX330/LX220/LX155/LX110)
– 16 Virtex-5 LX330’s
– Expected to start shipping in Dec ’07
– 32M ASIC gates (measured the real way …)
– 6 DDR2 SODIMM sockets (250MHz)
– 450MHz LVDS chip to chip interconnect
DN9000k10PCI
• 6 Virtex5 LX330
– Oversize PCI circuit board
• 66MHz/64-bit
• Stand-alone operation with ATX power supply
– ~12 million USABLE ASIC gates
• REAL ASIC gates! No exaggeration!
– Any subset of FPGA’s can be stuffed to reduce cost.
• 6, DDR2 SODIMM SDRAM sockets
• Custom DDR2-compatible cards for FLASH, SSRAM, RLDRAM, mictors, and others
• FPGA to FPGA interconnect LVDS (or
single-ended)
• LVDS: 450MHz
• 10x ISERDES/OSERDES tested and verified
• 160-pin Main bus connects all FPGA’s
DINI_selection_guide_v700.xls Max (100% util) (1000's) Practical (60% util) (1000's) Blocks (18kbits) Total (kbits) Total (kbytes) LX330 -1,-2 6-input 207,360 3,320 1,990 1,200 0 192 576 10,368 1,296 LX220 -1,-2 6-input 138,240 2,210 1,330 800 0 128 384 6,912 864 LX155 -1,-2,-3 6-input 97,280 1,556 934 800 0 128 384 6,912 864 LX110 -1,-2,-3 6-input 69,120 1,110 670 800 0 64 256 4,608 576 LX155T -1,-2,-3 6-input 97,280 1,556 934 640 0 128 424 7,632 954 LX110T -1,-2,-3 6-input 69,120 1,110 666 640 0 64 296 5,328 666 LX85T -1,-2,-3 6-input 51,840 830 498 480 0 48 216 3,888 486 LX50T -1,-2,-3 6-input 28,800 460 276 480 0 48 120 2,160 270 LX30T -1,-2,-3 6-input 19,200 307 184 360 0 32 72 1,296 162 SX95T -1,-2,-3 6-input 58,880 940 564 640 0 640 488 8,784 1,098 SX50T -1,-2,-3 6-input 32,640 522 313 480 0 288 264 4,752 594 SX35T -1,-2,-3 6-input 21,760 392 235 360 0 192 168 3,024 378 FX100T -1,-2,-3 6-input 64,000 1,024 614 640 0 256 456 8,208 1,026 FX70T -1,-2,-3 6-input 44,800 717 430 640 0 128 296 5,328 666 FX30T -1,-2,-3 6-input 20,480 328 197 360 0 64 136 2,448 306 LX200 -10,-11 4-input 178,176 2,490 1,490 960 96 0 336 6,048 756 LX160 -10,-11,-12 4-input 135,168 1,890 1,130 960 96 0 288 5,184 648 LX100 -10,-11,-12 4-input 98,304 1,380 830 960 96 0 240 4,320 540 FX100 -10,-11,-12 4-input 84,352 1,180 710 768 160 0 376 6,768 846 FX60 -10,-11,-12 4-input 50,560 710 430 576 128 0 232 4,176 522 LX160 -10,-11,-12 4-input 135,168 1,890 1,130 768 96 0 288 5,184 648 LX100 -10,-11,-12 4-input 98,304 1,380 830 768 96 0 240 4,320 540 LX80 -10,-11,-12 4-input 71,680 1,000 600 768 80 0 200 3,600 450 LX60 -10,-11,-12 4-input 53,248 750 450 640 64 0 160 2,880 360 LX40 -10,-11,-12 4-input 36,864 520 310 640 64 0 96 1,728 216 SX SX55 -10,-11,-12 4-input 49,152 690 410 640 512 0 320 5,760 720 2vp100 -5,-6 4-input 88,192 1,230 740 1040 444 0 444 7,992 999 2vp70 -5,-6,-7 4-input 66,176 930 560 996 328 0 328 5,904 738 2vp50 -5,-6,-7 4-input 47,232 660 400 692 232 0 232 4,176 522 Max (100% util) (1000's) Practical (60% util) (1000's) MLAB (640) M9K (9 kbit) M144K (144 kbit) Total (kbits) Total (kbytes) 4SE680 -4,-3,-2 6-input 544,880 8,718 5,231 1104 1360 13622 1529 64 22,977 2,872 4SE530 -4,-3,-2 6-input 424,960 6,799 4,080 960 1024 10624 1280 64 20,736 2,592 3SL340 -4,-3,-2 6-input 270,000 4,320 2,592 1120 576 6750 1040 48 16,272 2,034 M512 (32x18) M4K (128x36) M-RAM (4kx144) Total (kbits) Total (kbytes) 2SGX90E -5,-4,-3 6-input 72,768 1,020 610 558 192 488 408 4 4,415 552 2S180 -5,-4,-3 6-input 143,520 2,010 1,210 1,170 384 930 768 9 9,163 1,145 StratixII GX StratixII
A
lt
e
ra
FXT V ir te x -5 V ir te x -4 LX FXX
il
in
x
FPGA LX FPGA S p e e d G ra d e s ( s lo w e s t to fa s te s t) LUT Size VirtexII Pro LX LXT M u lt ip li e rs (1 8 x 1 8 ) M a x I /O 's Gate Estimate M a x I /O 's LUT Size (6-input or 4-input) S p e e d G ra d e s ( s lo w e s t to fa s te s t) SXT FF's Stratix III Stratix IV M u lt ip li e rs (2 5 x 1 8 ) Memory Memory FF's Gate Estimate M u lt ip li e rs (1 8 x 1 8 )2 ® ® TM Headquarters: Natick, Massachusetts US US: California, Michigan, Washington DC, Texas Europe:
UK, France, Germany, Switzerland, Italy, Spain, the Netherlands, Sweden Asia-Pacific:
China, Korea, Australia Worldwide training and consulting
Distributors in 25 countries
The MathWorks at a Glance
Earth’s topography on an equidistant cylindrical projection, created with MATLAB®and Mapping Toolbox™.
3
computing
Numeric computation
Data analysis and visualization The de facto industry-standard,
high-level programming language for algorithm development Toolboxes for signal and image
processing, statistics, optimization, symbolic math, and other areas Foundation of MathWorks products
® ®
TM
Core MathWorks Products
The leading environment for modeling, simulating, and implementing
communications systems and semiconductors
Foundation for Model-Based Design
Digital, analog, and mixed-signal systems, with floating- and fixed-point support
Algorithm development, system-level design, implementation, and test and verification Optimized code generation for FPGAs and
DSPs
Blocksets for signal processing,
communications, video and image processing, and RF
Open architecture with links to third-party modeling tools, IDEs, and test systems
5 Test Environments Continuous V&V DSP FPGA ASIC C, C++ VHDL, Verilog Implement Integration MCU Software Electronics
Silica an Avnet Company
550 employees (450-sales and engineering team) 23 franchises
Local sales organisations w. centralized backbone for logistic Excellent portfolio of value-added services and supply chain solutions.
SILICA I The Engineers of Distribution.
Programmable
Logic
(Signal Chain)
SILICA I The Engineers of Distribution.
SILICA I The Engineers of Distribution.
•
$ 395Xilinx® Spartan™-3A Evaluation Kit
•
XC3S400A-4FTG256C•
General FPGA prototyping•
Cypress® PSoC evaluation (Capsense)ALU – Actel © 2008 Actel Corporation. Confidential and Proprietary Aug 19th2008 2
Flash and Antifuse
Non-volatile Reprogrammable FPGAs
Flash (floating gate) technology
Non-Volatile OTP (One Time Programmable) FPGAs
ONO anti-fuse technology M2M anti-fuse technology
ALU – Actel © 2008 Actel Corporation. Confidential and Proprietary Aug 19th2008 3 Actel delivers a significant reliability
All Actel devices function as soon as power is
applied to the board
Single-chip offerings provide total cost
advantage over competition
Actel’s Silicon
Actel’s Silicon
Value-based Low Power FPGA
Ultra-low power
Very high volume
Sub-$1.00 market
Power and System Management
System developers needing
integrated functionality on single chip
System Critical
Where failure and tampering are
ALU – Actel © 2008 Actel Corporation. Confidential and Proprietary Aug 19th2008 55
Industry’s Most Comprehensive
Power Management Portfolio
ALU – Actel © 2008 Actel Corporation. Confidential and Proprietary Aug 19th2008 6
Later Today You are Invited
Later Today You are Invited
Low Power Solutions 12:30
Håkan Pettersson
Sr Applications EngineerHakan_pettersson@mentor.com
2
Mentor FPGA Design Solutions
Concept to Manufacturing
System System Design Design Embedded Embedded Development Development C C--BasedBased Synthesis Synthesis RTL Reuse RTL Reuse & Creation & Creation Verification Verification FPGA FPGA Synthesis Synthesis PCB PCB Design Design C++ Func if else if Then (); Goto des_dev End if 2Copyright ©1999-2005, Mentor Graphics. FPGA World November 2006
3
3
Copyright 2007 Mentor Graphics, All Rights Reserved
Copyright ©1999-2005, Mentor Graphics. FPGA World November 2006
4
Mentor @ FPGA World
Open Verification Methodology – An
Overview
2
EBV Elektronik - The Full Solution Provider
EBV added values: In-depth design expertise Application know-how Full logistics solutions
3
4
130 pan-European Field Application Engineers
– 13% of EBV’s total workforce! –
provide extensive application expertise and design know-how.
2 weeks of internal FAE trainings per year by the product specialists of EBV’s manufacturers. (FSEs also attend)
Technologies are chosen from EBV!
2 weeks of additional training at our suppliers
EBV – The Technical Specialist
EBV FAE Team
5
Reduces time-to-market
6
FalconEye Development Board
Open Source - gives the
competitive edge
ORSoC make SoC-development
easy, accessible and cost efficient
for all companies, regardless size or financial strength.
USB - Debugger
Development boards
Floppy-disk replacement
Designed and developed by ORSoC
Owned and sold by Swerob
USB - Debugger
Development boards
Customer product
ORSoC makes it easy
Open Source - gives the
competitive edge
ORSoC make SoC-development
easy, accessible and cost efficient
OpenCores
reach millions of engineers
OpenCores is owned and maintained by ORSoC. www.opencores.org
OpenCores
Facts
OpenCores is the number one site in the world for open source hardware IPs • ~540 projects (different IP-blocks)
• ~1 000 000 page views every month
• ~70 000 visitors every month
• 6:48 (min:sek) Average time at website
1
Welcome to Synopsys
May 20th, 2008
FPGAWorld 2008
2Welcome to the
Synplicity Business Group
of Synopsys
3
The Message is . . .
“The acquisition by Synopsys allows us to scale
our FPGA and rapid prototyping business to help more designers successfully solve increasingly complex problems”
- Gary Meyers General Manager, Synplicity Business Group
“The combination will support our strategy to
provide rapid prototyping capabilities and will enhance Synplicity’s already strong offering in the FPGA implementation market.”
- Aart de Geuss CEO and Founder, Synopsys
FPGA Implementation Solutions Confirma™ ASIC / ASSP Verification Platform ESL Synthesis
Synplify Premier The Ultimate in FPGA Implementation
Synplify Pro The Industry Leader in FPGA Synthesis
Identify Powerful RTL Debug
Certify Multi-FPGA Prototyping Environment
Identify Pro Full Visibility Functional Verification
Synplify DSP DSP Synthesis for FPGA Designers
Synplify DSP ASIC Edition
Synplify Premier Single-FPGA Prototyping Environment
HAPS High-performance ASIC Prototyping System
DSP Synthesis for ASIC Designers
© 2008 Actel Corporation September 2008
FPGA World September 2008
© 2008 Actel Corporation September 2008 2 Confidential and Proprietary
Key Market Segments
Value-based FPGA
Ultra-low power High volumes Sub-$10 market
Power and System
Management
Needs integrated functionality on single chip
System Critical
Failure and tampering are not options
© 2008 Actel Corporation September 2008 3 Confidential and Proprietary
Bit Line Vdd Vdd Bit Line
Word Line
Competitive SRAM Cell
Actel’s Flash Cell
Power: Actel Technical Advantage
Substantial leakage per cell Millions of configuration cells High static current
Negligible leakage per cell Millions of configuration cells Ultra-low static current
Actel’s System Management Solutions
High-end, standards-based system
management specifications
Fusion-based µTCA reference designs
Power Module and Advanced Mezzanine Card
Fusion-based ATCA reference designs
Low-cost system management for typical
embedded design
Robust reference design leverages Fusion and CoreABC
© 2008 Actel Corporation September 2008 5 Confidential and Proprietary
Klockan 13:00 i rum A
En presentation samt demo av Igloo, som visar skillnaden i
strömförbrukning mellan Flash- och SRAM-baserade
FPGA:er
Klockan 15:30 i rum C
Väl mött!
Offices
Head office in Stockholm with regional offices in Lund, Uppsala, Växjö and in Gothemburg.
~60 employees
In average 10+ years in electronic design
Advanced Microelectronics
FPGA, Board, DSP, ASIC & System-on-Chip, Analog & SW
Application specialists
cooperation with you. WE OFFER
• A site close to you
• Design and test resourses
• Industrialisation
• NOTEfied for selection of right components
• NOTE LAB for fast prototypes
• Competitive component sourcing
• Serial production inclusive Box Build
• After sales services
NOTE Lab
NOTE Lab
• Specialists in prototyping and other customized production
• Fast prototype production
– experienced component engineers and purchasing personnel
– prototype modifications while you wait
– advanced prototype delivery in days
– feedback based on customer needs
– seamless transfer to serial production
• Box build in small volymes
• Life cycle status
• Symbols
• Footprint
• Production recommendations
NOTE Lab
Let us help you!
We can help you to launch your product in a faster way and that can be the differance between winn or lose.
If you want more information please visit www.note.seor contact us
in Lund on 046 – 286 92 00. If you have your business somewhere else in Sweden you can find a NOTE site near you on our home page. We look forward to hear from you!
A Java-Based System for FPGA Programming
Automated Design Approach for On
Multiprocessor Systems
ASM++ Charts: an Intuitive Circuit Representation
Ranging from Low Level RTL to SoC De
Space-Efficient FPGA
High
The ABB NoC
NoC targeted for Xilinx FPGAs
Papers
Session B1
Based System for FPGA Programming
Session B2
Automated Design Approach for On-Chip
Multiprocessor Systems
Session B3
ASM++ Charts: an Intuitive Circuit Representation
Ranging from Low Level RTL to SoC De
Session B4
Efficient FPGA-Implementations of FFTs in
High-Speed Applications
Session B5
The ABB NoC – a Deflective Routing 2x2 Mesh
NoC targeted for Xilinx FPGAs
Based System for FPGA Programming
Chip
ASM++ Charts: an Intuitive Circuit Representation
Ranging from Low Level RTL to SoC Design
Implementations of FFTs in
by discussing both Photon’s abstract programming model which separates computation and data I/O, and by giving an overview of the compiler’s internal operation, includ-ing a flexible plug-and-play optimization system. We show that designs created with Photon always lead to deeply-pipelined hardware implementations, and present a case study showing how a floating-point convolution filter de-sign can be created and automatically optimized. Our final design runs at 250MHz on a Xilinx Virtex5 FPGA and has a data processing rate of 1 gigabyte per second.
1. Introduction
Traditional HDLs such as VHDL or Verilog incur major development overheads when implementing circuits, par-ticularly for FPGA which would support fast design cycles compared to ASIC development. While tools such as C-to-gates compilers can help, often existing software cannot be automatically transformed into high-performance FPGA designs without major re-factoring.
In order to bridge the FPGA programming gap we pro-pose a tool called Photon. Our goal with Photon is to sim-plify programming FPGAs with high-performance data-centric designs.
Currently the main features of Photon can be summa-rized as follows:
• Development of designs using a high-level approach combining Java and an integrated expression parser. • Designs can include an arbitrary mix of fixed and
floating point arithmetic with varied precision. • Plug-and-play optimizations enabling design tuning
without disturbing algorithmic code.
The remainder of this paper is divided as up as follows: In Section 2, we compare Photon and other tools for creat-ing FPGA designs. In Section 3 we describe Photon’s pro-gramming model which ensures designs often lead to high-performing FPGA implementations. In Sections 4 and 5 we give an overview of how Photon works internally and present a case study. Finally, in Section 6 we summarize our work and present our conclusions on Photon so far.
2. Comparisons to Other Work
In Table 1 we compare tools for creating FPGA designs using the following metrics:
• Design input – Programming language used to create designs.
• High level optimizations – Automatic inference and optimizing computation hardware, simplification of arithmetic expressions etc.
• Low level optimizations – Boolean expression min-imisation, state-machine optimizations, eliminating unused hardware etc.
• Floating-point support – Whether the tool has intrinsic support for floating-point and IEEE compliance. • Meta-programmability – Ability to statically
meta-program with weaker features being conditional com-pilation and variable bit-widths and stronger features such as higher-order design generation.
VHDL and Verilog use a traditional combination of structural constructs and RTL to specify designs. These tools typically require a high development effort. Such con-ventional tools typically have no direct support for floating
Build automation Yes Yes Limited No No No No Yes No Table 1. Comparison of tools for creating FPGA designs from software code.
point arithmetic and therefore require external IP. Meta-programmability e.g. generics in VHDL are fairly inflexi-ble [1]. The advantage of VHDL and Verilog is that they give the developer control over every aspect of the micro-architecture, providing the highest potential for an optimal design. Additionally, synthesis technology is relatively ma-ture and the low-level optimizations can be very effective. Other tools often produce VHDL or Verilog to leverage the low-level optimizers present in the conventional synthesis tool-chain.
Impulse-C [2] and Handel-C [3] are examples of C-to-gates tools aiming to enable hardware designs using lan-guages resembling C. The advantage of this approach is existing software code can form a basis for generating hard-ware with features such as ROMs, RAMs and floating-point units automatically inferred. However, software code will typically require modifications to support a particular C-to-gates compiler’s programming model. For example explicitly specifying parallelism, guiding resource map-ping, and eliminating features such as recursive function calls. The disadvantage of C-to-gates compilers is that the level of modification or guidance required of a developer may be large as in general it is not possible to infer a high-performance FPGA design from a C program. This arises as C programs in general are designed without parallelism in mind and are highly sequential in nature. Also, meta-programmability is often limited to the C pre-processor as there is no other way to distinguish between static and dy-namic program control in C.
PamDC [4], JHDL [5], YAHDL [1] and ASC [6] are examples of Domain Specific Embedded Languages [7] (DSELs) in which regular software code is used to imple-ment circuit designs. With this approach all functionality to produce hardware is encapsulated in software libraries with no need for a special compiler. These systems are a purely meta-programmed approach to generating hardware with the result of executing a program being a net-list or HDL for synthesis. Of these systems, PamDC, JHDL and YAHDL all provide similar functions for creating hardware
structurally in C++, Java and Ruby respectively. YAHDL and PamDC both take advantage of operator overloading to keep designs concise, whereas JHDL designs are often more verbose. YAHDL also provides functions for au-tomating build processes and integrating with existing IP and external IP generating tools. ASC is a system built on top of PamDC and uses operator overloading to spec-ify arithmetic computation cores with floating-point opera-tions.
Photon is also implemented as a DSEL in Java. Photon’s underlying hardware generation and build system is based on YAHDL rewritten in Java to improve robustness. Un-like JHDL, Photon minimizes verbosity by using an inte-grated expression parser which can be invoked from regular Java code. Photon also provides a pluggable optimization system unlike the other DSELs presented, which generate hardware in a purely syntax directed fashion.
3. Photon Programming Model
Our goal with Photon is to find a way to bridge the grow-ing FPGA size versus programmgrow-ing gap when accelerat-ing software applications. In this section we discuss the programming model employed by Photon which provides high-performance FPGA designs.
FPGA designs with the highest performance are gen-erally those which implement deep, hazard free pipelines. However, in general software code written without par-allelism in mind tends to have loops with dependen-cies which cannot directly be translated into hazard free pipelines. As such, software algorithm implementations often need to be re-factored to be amenable to a high-performance FPGA implementation. Photon’s program-ming model is built around making it easy to implement suitably re-factored algorithms.
When developing our programming model for Photon, we observe that dense computation often involves a sin-gle arithmetic kernel nested in one or more long running loops. Typically, dense computation arises from
repeat-Thus, we turn organizing data I/O for the kernel into a problem that can be tackled separately from the data-path compiler. Thus this leaves us with an arithmetic kernel which does not contain any loop structures and hence can be implemented as a loop-dependency free pipeline.
In Photon we assume the Data I/O problem is solved by Photon-external logic. Based on this assumption, Photon designs are implemented as directed acyclic graphs (DAGs) of computation. The acyclic nature of these graphs ensures a design can always be compiled to a loop-dependency free pipeline.
Within a Photon DAG there are broadly five classes of node:
• I/O nodes – Through which data flows into and out of the kernel under the control of external logic.
• Value nodes – Nodes which produce a constant value during computation. Values may be hard-coded or set via an I/O side-channel when computation is not run-ning.
• Computation nodes – Operations including: arith-metic (+, ÷ . . . ), bit-wise (&, or, . . . ), type-casts etc. • Control nodes – Flow-control and stateful elements,
e.g.: muxes, counters, accumulators etc.
• Stream shifts – Pseudo operations used to infer buffer-ing for simulatbuffer-ing accessbuffer-ing to data ahead or behind the current in-flow of data.
To illustrate Photon’s usage and graph elements, con-sider the pseudo-code program in Listing 1. This program implements a simple 1D averaging filter passing over data in an array din with output to array dout. The data I/O for this example is trivial: data in the array din should be passed linearly into a kernel implementing the average fil-ter which outputs linearly into an array dout.
Figure 1. Photon DAG for 1D averaging.
Figure 1 shows a Photon DAG implementing the aver-aging kernel from Listing 1. Exploring this graph from top-down: data flows into the graph through the din input node, from here data either goes into logic implementing an av-eraging computation or to a mux. The mux selects whether the current input data point should skip the averaging op-eration and go straight to the output as should be the case at the edges of the input data. The mux is controlled by logic which determines whether we are at the edges of the stream. The edge of the stream is detected using a com-bination of predicate operators (<, >, &) and a counter which increases once for each item of data which enters the stream. The constant input N − 1 to the < comparator can be implemented as a simple constant value, meaning the size of data which can be processed is fixed at com-pilation time. On the other hand, the constant input can be implemented as a more advanced value-node that can be modified via a side-channel before computation begins, thus allowing data-streams of any size to processed. The logic that performs the averaging computation contains a number of arithmetic operators, a constant and two stream-shifts. The stream-shift operators cause data to be buffered such that it arrives at the addition operator one data-point behind (−1) or one data-point ahead (+1) of the unshifted
Figure 2. Scheduled DAG for 1D average filter.
data which comes directly from din.
To implement our 1D averaging Photon DAG in ware, the design undergoes processing to arrive at a hard-ware implementation. Figure 2 illustrates the result of Pho-ton processing our original DAG. In this processed DAG, buffering implements the stream-shifting operators and en-sures data input streams to DAG nodes are aligned. Clock-enable logic has also been added for data alignment pur-poses.
With this newly processed DAG, data arriving at din produces a result at dout after a fixed latency. This is achieved by ensuring that data inputs to all nodes are aligned with respect to each other. For example the mux before dout has three inputs: the select logic, din and the averaging logic. Without the buffering and clock-enable logic, data from din would arrive at the left input to the mux before the averaging logic has computed a result. To compensate, buffering is inserted on the left input to bal-ance out the delay through the averaging logic. For the mux-select input a clock-enable is used to make sure the counter is started at the correct time.
After Photon processes a DAG by inserting buffering and clock-enable logic, the DAG can be turned into a struc-tural hardware design. This process involves mapping all the nodes in the graph to pre-made fully-pipelined imple-mentations of the represented operations and connecting the nodes together. As the design is composed of a series of fully-pipelined cores, the overall core is inherently also fully-pipelined. This means Photon cores typically offer a high degree of parallelism with good potential for achiev-ing a high clock-speed in an FPGA implementation.
d . c o n n e c t ( mul ( add ( a , b ) , c ) ; }
}
Listing 2. Photon floating-point add/mul design.
4. Implementation of Photon
In this section we given an overview of Photon’s con-crete implementation. Of particular interest in Photon is the mechanism by which designs are specified as Java pro-grams which is covered first in Section 4.1. We then discuss Photon’s compilation and hardware generation process in Section 4.2.
4.1. Design Input
Photon is effectively a Java software library and as such, Photon designs are created by writing Java programs. Exe-cuting a program using the Photon library results in either the execution of simulation software for testing a design or an FPGA configuration programming file being generated. When using the Photon library a new design is cre-ated by extending the PhotonDesign class which acts as the main library entry point. This class contains meth-ods which wrap around the creation and inter-connection of standard Photon nodes forming a DAG in memory which Photon later uses to produce hardware. New nodes for cus-tom hardware units, e.g. a fused multiply-accumulate unit, can also be created by users of Photon.
Listing 2 shows an example Photon program. When ex-ecuted this program creates a hardware design which takes three floating-point numbers a, b and c as inputs, adds a and b together and multiplies the result by c to produce a single floating-point output d. Method calls in the code specify a DAG which has six nodes: three inputs, an out-put, a multiplier and an add. These nodes are created by calls to the input, output, mul and add methods re-spectively. The input and output methods take a string pa-rameter to specify names of I/Os for use by external logic and for performing data I/O. Another parameter specifies the I/O type. For the example in this paper, we use IEEE single precision floating-point numbers. The floating point type is declared using a call to hwFloat which makes a
e v a l ( ” d o u t <− s e l ? avg : d i n ” ) ;
Listing 3. 1D averaging design implemented us-ing Photon expressions.
floating point type object with an 8 bit exponent and a 24 bit mantissa following the IEEE specification. We can also create floating-point numbers with other precisions, fixed-point and/or integer types. Types used at I/Os propagate through the DAG and hence define the types of operator nodes. Casting functions can be used to convert and con-strain types further within the design.
One drawback of using Java method calls to create a DAG is verbosity, making it hard to read the code or re-late lines back to the original specification. To resolve the function-call verbosity the Photon library provides a mechanism for expressing computation using a simple pression based language. Statements in this integrated ex-pression parser can be written using a regular Java strings passed to an eval method. The eval method uses the state-ments in the string to call the appropriate methods to extend the DAG.
To demonstrate our eval expressions, Listing 3 shows how our 1D averaging example from Figure 1 is imple-mented in Photon using eval calls.
4.2. Compilation and Hardware Generation
In addition to using Java for design specification, Photon also implements the compilation and hardware generation process entirely in Java. Photon’s design management fea-tures cover optimization of Photon designs, generation of VHDL code, and calling external programs such as synthe-sis, simulation, IP generation, and place-and-route.
After a Photon design is fully specified, Photon turns the specified DAG into a design which can be implemented in hardware. Photon achieves this primarily by executing a
first scheduling pass traverses the graph passively, collect-ing data about the latency (pipeline-depth) of each node in the graph. We then determine an offset in our core pipeline at which each node should be placed in order to ensure that data for all its inputs arrives in synchrony. After all the off-sets in a schedule are generated a second pass applies these offsets by inserting buffering to align node inputs.
Sub-optimal offsets cause unnecessary extra buffering to be inserted into the graph, wasting precious BlockRAM and shift-register resources. To combat this inefficiency we calculate a schedule for the offsets using Integer Lin-ear Programming (ILP). Our ILP formulation ensures all nodes are aligned such that their input data arrives at the same time while minimising the total number of bits used in buffering. Thus, Photon’s scheduled designs always have optimal buffering.
After all other graph-passes, a final graph-pass produces a hardware design. By this stage in compilation every node in the DAG has a direct mapping to an existing piece of parameterisable IP. Thus, this final pass creates a hardware design, by instantiating one IP component per node in the graph. Hardware is created in Photon using further Java classes to describe structural designs either directly using Java or by including external pre-written HDL, or running external processes to generate IP, e.g. CoreGen for floating-point units. After a design is fully described, external syn-thesis or simulation tools are invoked by Java to produce the required output for this execution. The system used to implement this low-level structural design and tool automa-tion is based on the model described in [1].
5. Case Study
As a case study we consider a simple 2D convolution filter. This kind of design is common in many digital image processing applications.
The filter we implement is shown in Figure 3. The filter is separable, using the equivalent of two 1D 5 point
convo-subtractions and (after algebraic optimization to factor out common sub-expressions) 5 multiplications per input point.
5.1. Optimizations
The compilation of the convolution case study illustrates two of the optimization graph-passes during the Photon compilation process.
The Photon implementation of the filter makes use of several large stream-shifts on the input data. These shifts are necessary as each output data-point requires the 9 sur-rounding points to compute the convolved value. These stream-shifts result in a large number of buffers being added to the Photon design. Photon helps reduce this buffering using a graph-pass that combines the multiple de-lay buffers into a single long chain of buffers. This ensures each data item is only stored once, reducing buffering re-quirements.
Photon is able to use the precise value of the filter coef-ficient constants to optimize the floating-point multipliers. Specifically, some of the coefficients are a power of two, which can be highly optimized. To implement this, Pho-ton includes another graph-pass which identifies floating-point multiplications by a power of two and replaces them with a dedicated node representing a dedicated hardware floating-point multiply-by-two IP core. This IP core uses a small number of LUTs to implement the multiplication rather than a DSP as in the conventional multipliers.
5.2. Implementation Results
For synthesis we target our filter design to a Xil-inx Virtex-5 LX110T FPGA, with a clock frequency of 250MHz. At this speed, with data arriving and exiting the circuit once per cycle, we achieve a sustained computation rate of 1GB/s.
Table 2 shows the area impact of the Photon optimiza-tion graph-passes on the filter hardware. The multiplier power of two substitution pass reduces the number of DSP blocks used from 10 to 6, and the delay merging pass re-duce BRAM usage from 18 RAMB36s to 4. The number of LUTs required for the original and optimized designs are similar.
6. Conclusion
In this paper we introduce Photon, a Java-based FPGA programming tool. We describe the programming model for Photon in which data I/O is separated from computation allowing designs to implicitly be easy to pipeline and hence perform well in an FPGA. We give an overview of Photon’s implementation as a library directed by user created Java programs. Finally, we present a case study demonstrating that Photon’s pluggable optimization system can be used to improve the resource utilisation of designs. Our current and future work with Photon includes developing a system for making it easier to create the data I/O logic external to Photon designs, and creating more advanced optimization passes.
References
[1] J. A. Bower, W. N. Cho, and W. Luk, “Unifying FPGA hard-ware development,” in International Conference on Field-Programmable Technology, December 2007, pp. 113–120. [2] Impulse Accelerated Technologies Inc., “ImpulseC,”
http://www.impulsec.com/, 2008.
[3] Agility, “DK design suite,” http://www.agilityds.com/, 2008. [4] O. Mencer, M. Morf, and M. J. Flynn, “PAM-Blox: High
performance FPGA design for adaptive computing,” in IEEE Symposium on FPGAs for Custom Computing Machines, Los Alamitos, CA, 1998, pp. 167–174.
[5] P. Bellows and B. Hutchings, “JHDL - An HDL for reconfig-urable systems,” in IEEE Symposium on FPGAs for Custom Computing Machines, Los Alamitos, CA, 1998, pp. 175 – 184.
[6] O. Mencer, “ASC: A stream compiler for computing with FPGAs,” IEEE Transactions on CAD of ICs and Systems, vol. 25, pp. 1603–1617, 2006.
[7] P. Hudak, “Modular domain specific languages and tools,” Intl. Conf. on Software Reuse, vol. 00, p. 134, 1998. [8] O. Pell and R. G. Clapp, “Accelerating subsurface offset
gath-ers for 3D seismic applications using FPGAs,” SEG Tech. Program Expanded Abstracts, vol. 26, no. 1, pp. 2383–2387, 2007.
[9] D. B. Thomas, J. A. Bower, and W. Luk, “Hardware archi-tectures for Monte-Carlo based financial simulations,” in In-ternational Conference on Field-Programmable Technology, December 2006, pp. 377–380.
of an adaptive multiprocessor system by creating compo-nents, like processing nodes or memories, from a parallel program. Therefore message-passing, a paradigm for par-allel programming on multiprocessor systems, is used. The analysis and simulation of the parallel application provides data for the formulation of constraints of the multiprocessor system. These constraints are used to solve an optimization problem with Integer Linear Programming: the creation of a suitable abstract multiprocessor hardware architecture and the mapping of tasks onto processors. The abstract ar-chitecture is then mapped onto a concrete arar-chitecture of components, like a specific Power-PC or soft-core proces-sor, and is further processed using a vendor tool-chain for the generation of a configuration file for an FPGA.
1. Introduction
As apparent in current developments the reduction of transistor size and the exploitation of instruction-level par-allelization can not longer be continued to enhance the per-formance of processors [1]. Instead, multi-core processors are a common way of enhancing performance by exploiting parallelism of applications. However, designing and im-plementing multiple processors on a single chip leads to new problems, which are absent in the design of single-core processors. For example, an optimal communication infrastructure between the processors needs to be found. Also, software developers have to parallelize their applica-tions, so that the performance of the application is increased through multiple processors. In the case of multiproces-sor systems-on-chip (MPSoCs), which combine embedded heterogeneous or homogeneous processing nodes, memory systems, interconnection networks and peripheral compo-nents, even more problems arise. Partly because of the va-riety of technologies available and partly because of their
computing with multiprocessor systems exist, the commu-nication through shared memory (SMP), i. e. cache or mem-ory on a bus-based system, and the passing of messages (MPI) through a communication network. SMP architec-tures, like the Sun Niagara processor [8] or the IBM Cell BE processor [9], are the common multiprocessors today. MPI is typically used in computer clusters, where physically dis-tributed processors communicate through a network.
This paper presents a survey about our developments in the area of adaptive MPSoC design with FPGAs (Field-Programmable Gate Arrays) as a flexible platform for Chip-Multi-Processors. In section 2 an overview of the proposed design approach for MPSoCs is given. Therefore the steps for architectural synthesis, starting with the analysis and simulation of a parallel program and ending with the gen-eration of a bitfile for the configuration of an FPGA are de-scribed in general. In the following section 3 an on-chip message passing software library for communication be-tween tasks of a parallel program, and a benchmark for the purpose of evaluation are presented. Section 4 summarizes the formulation of architecture constraints for the design space exploration with Integer Linear Programming. These constraints are formulated out of the results of the analysis and simulation of a parallel program. The following sec-tion 5 gives an overview about the creasec-tion of MPSoCs us-ing abstract components. Finally, this paper is concluded in section 6 and a brief overview about future work is given in section 7.
2. System design using architectural synthesis
To get an efficient multiprocessor system-on-chip from a parallel program several approaches are possible. In figure 1 our proposed synthesis flow using an analytical approach is shown. The architectural synthesis flow starts with
paral-Figure 1. Architectural Synthesis Flow In the first step of the design flow, information on data traffic and on task precedence is extracted from functional simulations of the parallel program. Information on the number of cycles of a task when executed on a specific pro-cessor is determined from cycle accurate simulations. This information is used to formulate an instance of an Integer Linear Programming (ILP) problem.
In the following step, called Abstract Component cre-ation, a combinatorial optimization is done by solving an ILP problem. Additionally to the information gathered in the first step, platform constraints, e. g. area and speed of the target platform, are needed as well. As a result of this step an abstract system description including the (abstract) hard- and software parts is generated.
The third step is called Component mapping. The ab-stract system description, which consists of abab-stract pro-cessors, memories, communication components or hard-ware accelerators and softhard-ware tasks linked onto abstract processors, is mapped on a concrete architecture of com-ponents like PPC405, MicroBlaze or on-chip BRAMs. If needed, an operating systems can be generated with scripts and makefiles and can be mapped onto a processor as well. This step can be done using the PinHaT software (Platform-independent Hardware generation Tool) [11].
In the final step a bitfile is generated from the concrete
veloped (see figure 2), which is similar to the approaches described in [13], [14] and [15].
Figure 2. SoC-MPI Library
The library consists of two layers. A Network indepen-dent layer (NInL) and a network depenindepen-dent layer (NDeL), for the separation of the hardware dependent part from the hardware independent part of the library. The advan-tage of this separation is the easy migration of the library to other platforms. The NInL provides MPI functions, like MPI Send, MPI Receive, MPI BSend or MPI BCast. These functions are used to perform the communication be-tween processes in the program. The NDeL is an accumu-lation of network dependent functions for different network topologies. In this layer the ranks and addresses for con-crete networks are determined and the cutting and send-ing of messages dependsend-ing on the chosen network is car-ried out. Currently the length of a message is limited to 64 Bytes, due to the limited on-chip memory of FPGAs. Longer messages are therefore cutted into several smaller messages and are send in series. The parameters of the MPI functions, like count, comm or dest (destination), are also used as signals and parameters for the hardware compo-nents of the network topology. That is, the parameters are used to build the header and the data packets for the
commu-Figure 3. Configuration of processing nodes In figure 3 several processing nodes are connected to-gether via a star network. Additionally node 0 and 1 are directly connected together via FSL (Fast Simplex Link) [16]. Each processing node has only a subset of the SoC-MPI Library with the dependent functions for the network topology.
3.1. Benchmarks
The MPI library is evaluated using Intel MPI Bench-marks 3.1, which is the successor of the well known pack-age PMB (Pallas MPI Benchmarks) [17]. The MPI imple-mentation was benchmarked on a Xilinx ML-403 evaluation platform [18], which includes a Virtex 4 FPGA running at 100MHz. Three MicoBlaze soft-core processors [19] were connected together via a star network. All programs were stored in the on-chip memories.
In Figure 4 the results of the five micro benchmarks are shown. Due to the limited on-chip memory not all bench-marks could be performed completely. Furthermore, a small decay between 128 and 256 Bytes message size exists, be-cause the maximum MPI message length is currently lim-ited to 251 KBytes and a message larger than that must be splitted into several messages. Further increase of the mes-sage size would lead to a bandwidth closer to the maximum possible bandwidth, which is limited through the MicroB-laze and was measured with approximately 14 MBytes/s.
4. Abstract component creation using Integer
Linear Programing
Figure 4. Benchmarks of the SOC-MPI Library
cating the best possible abstract architecture for a given par-allel application under given constraints. The simultaneous optimization problem is to map parallel tasks to a set of pro-cessing elements and to generate a suitable communication architecture, which meets the constraints of the target plat-form, and minimizes the overall computation time of the parallel program. The input for this step is obtained by us-ing a profilus-ing tool for the mpich2 package. In the followus-ing two subsections area and time constraints of processors and communication infrastructure are described in separate.
4.1. Processors - sharing constraint, area constraint
and costs
A few assumptions about processors and tasks need to be made, because it is possible to map several tasks on a processor: (1) a task scheduler exists, so that scheduling is not involved in the optimization problem. (2) Task mapping is static. (3) The instruction sequence of a task is stored on the local program memory of the processor, e. g. instruction cache, and hence the size of the local program memory limits the number of tasks, which can be mapped onto a processor. (4) Finally the cost of switching tasks in terms of processor cycles does not vary from task to task. Let Ii ∈ I0, ..., In be a task, Jj ∈ J0, ..., Jm a processor
and xij = 0, 1 a binary decision variable, whereas xij = 1
means that task Iiis mapped onto processor Jj. m
X
j=0
xij= 1, ∀Ii (1)
A constraint for task mapping (equation 2), called ad-dress space constraint, and the cost of task switching (equa-tion 3) can be formulated, where sijis the size of a task Ii
xij only shows if a task Ii is mapped onto a processor Jj
and does not show the number of processors in the system or the number of instantiations of a processor, an auxiliary variable vj = 0, 1 is needed. For each instance of a
pro-cessor Jj there is a corresponding virtual processor vj and
for all tasks mapped to a certain processor there is only one task which is mapped to the corresponding virtual proces-sor. This leads to the following constraint (equation 4) so that the area of the processors can be calculated with equa-tion 5. vj≤ n X i=0 xij, ∀Jj (4) AP E≥ m X j=0 vij· aj (5)
4.2. Communication networks - Network capacity
and area constraint
Several assumptions have to be made before constraints about the communication network can be formulated. The communication of two tasks mapped onto the same proces-sor is done via intra-procesproces-sor communication, which have a negligible communication latency and overhead, com-pared to memory access latency. All processors can use any of the available communication networks and can use more then one network. A communication network has arbitra-tion costs resulting from simultaneous access on the net-work. It is assumed, that tasks are not prioritized and an up-per bound on arbitration time for each network can be com-puted for each network topology depending on the number of processors. Finally, it is not predictable when two or more tasks will attempt to access the network. Though a certain probability can be assumed.
λi1,i2is an auxiliary 0-1 decision variable that is 1, if two
communicating tasks are mapped on different processors. The sum of xi1j1 and xi2j2 equals two if the tasks are on different processors as seen in equation 6.
xi1j1+ xi2j2
yk+
X
Ii1,Ii2|Ii1lIi2
λi1,i2≤ Mk, ∀Ck (7)
The total area cost of the communication network (re-sources for routing) can be calculated with equation 8, where Akis the area cost of topology Ck.
AN ET ≥ m
X
j=0
Ak· yk (8)
The cost of the topology in terms of computation time is calculated in 10, whereas zki1i2is a binary decision vari-able, which is 1 if a network will be used by two tasks. Oth-erwise the variable zki1i2is 0. Di1i2is the amount of data to be transferred between the two communicating tasks and pk
is the probability that network arbitration will be involved when a task wants to communicate. The upper bound arbi-tration time is τk. zki1i2 ≥ λi1,i2+ yk− 1 (9) TN ET = X Ii1,Ii2|Ii1lIi2 K X k=0 (Di1i2+ τk· pk)zki1i2 ! (10) Finally the total area cost A is calculated form the area of the processing elements AP E (equation 5) and the area
for the routing resources AN ET (equation 8).
A ≥ AP E+ AN ET (11)
The cost of computation time can be calculated with equation 12, whereas Tijis the time requirement to process
a task Iion a processor Jj. The objective, in this case, is to
minimize computation time of a (terminating) parallel pro-gram. However for non-terminating programs, like signal processing programs, the objectives are different.
first step an abstract specification of the system using ab-stract components like CPUs, Memories or Hardware Ac-celerators are described. In the following step these abstract components will be refined and mapped to concrete com-ponents, e. g. a specific CPU (PPC405, MicrocBlaze) or a Hardware-Divider. Also the software tasks are mapped onto the concrete components. The structure of PinHaT is shown in figure 5. A detailed overview about the PinHaT tool is given in [11].
Figure 5. Structure of PinHaT
The component mapping is divided into the generation and the configuration of the system infrastructure, where hardware is generated and software is configured. In this flow, the input to PinHaT is obtained by high level
synthe-its own parameters. Such classes can be easily added to the framework to extend the IP-Core base. In a subsequent step, another parser creates the platform specific hardware information file from the gathered information. In the sec-ond phase individual mappers for all components and target platforms, are created, followed by the last phase, where a mapper creates the platform dependent hardware descrip-tion files. These dependent hardware descripdescrip-tion files are then passed to the vendor’s tool chain, e. g. Xilinx EDK or Altera Quartus II.
5.2. Configuration of the System Infrastructure
-SW Mapping
In the case of software, a task is mapped onto a concrete processor. This is in contrast to the mapping of abstract components, e. g. processors or memories, to concrete ones during the generation of the system infrastructure.
For the mapping step, parameters of the software must be specified for each processor. The parameters include in-formation about the application or the operating system, like source-code, libraries or the os-type. With these in-formation scripts and Makefiles for building the standalone applications and the operating systems are created. While standalone applications only need compiling and linking of the application, building the operating system is more dif-ficult. Depending of the operating system, different steps, like configuration of the file-system or the kernel parame-ters, are necessary. The result of the task mapping is an executable file for each processor in the system.
6. Conclusion
In this paper a concept for the design automation of mul-tiprocessor systems on FPGAs was presented. A small-sized MPI library was implemented to use message passing for the communication between tasks of a parallel program.