5th FPGAworld CONFERENCE Book 2008

(1)

2008 SEPTEMBER

Lennart Lindh, David Källberg,

The FPGAworld Conference addresses

on FPGA technology. It is a discussion and network forum for working on industrial and research projects, state

applications. The book (www.fpgaworld.com/conference

Book

2008 SEPTEMBER

EDITORS

David Källberg, DTE - Santiago de Pablo and Vincent J. Mooney III

e FPGAworld Conference addresses aspects of digital and hardware/software system engineering It is a discussion and network forum for students, researchers and engineers working on industrial and research projects, state-of-the-art investigations, development

book contains all presentations; for more information

www.fpgaworld.com/conference).

ISBN 978-91-976844-1-5

Vincent J. Mooney III

aspects of digital and hardware/software system engineering researchers and engineers art investigations, development and more information see

(2)

(3)

Copyright and Reprint Permission for personal or classroom use are allowed with credit to FPGAworld.com. For commercial or other for-profit/for-commercial-advantage uses, prior

(4)

Lennart Lindh, FPGAworld and Jönköpings University, Sweden

Publicity Chair

David Kallberg, FPGAworld, Sweden

Academic Programme Chair

Vincent J. Mooney III, Georgia Institute of Technology, USA

Academic Publicity Chair

Santiago de Pablo, University of Valladolid, Spain

Academic Programme Committee Members

Ketil Roed, Bergen University College, Norway

Lennart Lindh, Jönköping University, Sweden

Pramote Kuacharoen, National Institute of Development Administration, Thailand

Mohammed Yakoob Siyal, Nanyang Technological University, USA

Fumin Zhang, Georgia Institute of Technology, USA

Santiago de Pablo, University of Valladolid, Spain

Industrial Programme Committee Members

Solfrid Hasund, Bergen University

College

Kim Petersén, HDC, Sweden

Mickael Unnebäck, ORSoC, Sweden

Fredrik Lång, EBV, Sweden

Niclas Jansson, BitSim, Sweden

Göran Bilski, Xilinx, Sweden

Adam Edström, Elektroniktidningen,

Sweden

Espen Tallaksen, Digitas, Norway

Göran Rosén, Actel, Sweden

Tommy Klevin, ÅF, Sweden

Tryggve Mathiesen, BitSim, Sweden

Fredrik Kjellberg, Net Insight,

Sweden

Daniel Stackenäs, Altera, Sweden

Martin Olsson, Synective Labs,

Sweden

Stefan Sjöholm, Prevas, Sweden

Ola Wall, Synplicity, Sweden

Torbjorn Soderlund, Xilinx, Sweden

Anders Enggaard, Axcon, Denmark

Doug Amos, Synplicity, UK

Guido Schreiner, The Mathworks,

Germany

Stig Kalmo, Engineering College of

Aarhus, Denmark

(5)

hope that the conferences provide

We will try to balance academic and industrial presentations, exhibit

tutorials to provide a unique chance for our attendants to obtain knowledge from

different views. This year we have the strongest program in FPGAworld´s history.

Track A - Industrial

Track A features presentations with focus on industrial applications.

presenters were selected by the Industrial Programme Committee.

presented.

Track B - Academic

Track B features presentations with focus on academic papers and industria

applications. The presenters were selected by the Academic Programme

Committee. Due to the high quality, 5 out of the 17 papers submitted this year

were presented.

Track C - Product presentations

Track C features product presentations from our exhibitor

Track D - Altera Innovate Nordic

Track D is reserved for the Altera Innovate Nordic contest.

Three in the final

Exhibitors FPGAworld'2008 Stockholm & Lund

The FPGAworld 2008 conference

Totally we are close to 300 participants (Stockholm and Lund).

All are welcome to submit

the conference, both from

Together we can make the

expectations!

Please check out the website (http://fpgaworld.com/conference/)

information about FPGAworld

(

david@fpgaworld.com

) for more information

We would like to thank all of the authors for submitting their papers and

hope that the attendees enjoyed the FPGAworld conference 2008 and you

welcome to next year’s conference

conferences provide you with much more then you expected.

We will try to balance academic and industrial presentations, exhibit

tutorials to provide a unique chance for our attendants to obtain knowledge from

different views. This year we have the strongest program in FPGAworld´s history.

Track A features presentations with focus on industrial applications.

presenters were selected by the Industrial Programme Committee.

Track B features presentations with focus on academic papers and industria

he presenters were selected by the Academic Programme

Due to the high quality, 5 out of the 17 papers submitted this year

Product presentations

Track C features product presentations from our exhibitors and sponsors.

Altera Innovate Nordic

Track D is reserved for the Altera Innovate Nordic contest.

Exhibitors FPGAworld'2008 Stockholm & Lund; 15 unique exhibitors.

onference is bigger than the FPGAworld 2007

Totally we are close to 300 participants (Stockholm and Lund).

elcome to submit industrial/academic papers, exhibits and tutorials

from student, academic and industrial

Together we can make the FPGAworld conference exceed even

check out the website (http://fpgaworld.com/conference/)

information about FPGAworld. In addition, you may contact

for more information.

We would like to thank all of the authors for submitting their papers and

hope that the attendees enjoyed the FPGAworld conference 2008 and you

lcome to next year’s conference.

you with much more then you expected.

We will try to balance academic and industrial presentations, exhibits and

tutorials to provide a unique chance for our attendants to obtain knowledge from

different views. This year we have the strongest program in FPGAworld´s history.

Track A features presentations with focus on industrial applications. The

presenters were selected by the Industrial Programme Committee. 8 papers was

Track B features presentations with focus on academic papers and industrial

he presenters were selected by the Academic Programme

Due to the high quality, 5 out of the 17 papers submitted this year

s and sponsors.

; 15 unique exhibitors.

2007 conference.

, exhibits and tutorials to

academic and industrial backgrounds.

ven above our best

check out the website (http://fpgaworld.com/conference/) for more

. In addition, you may contact David Källberg

We would like to thank all of the authors for submitting their papers and

hope that the attendees enjoyed the FPGAworld conference 2008 and you

(6)

11:30 -

12:30 Sponsored by Actel Lunch Break

12:30 - 14:30

Session Chair

TBD Session Chair TBD

Session A1

Open Source within Hardware

Session A2

World's first mixed-signal

FPGA

Session A3

Verification - reducing costs

and increasing quality

Session A4

Analog Netlist partitioning and

automatic generation of

schematic

Session C1

Prototyping Drives FPGA Tool

Flows

Synplicity Business Group of Synopsys Session C2

OVM introduction

Mentor Graphics Session C3

Verification Management

Mentor Graphics Session C4

MAGIC - Next generation platform for Telecom and Signal Processing

BitSim 14:30 - 15:00 Coffee Break 15:00 – 16:30 Session Chair TBD Session Chair TBD Session A5 Product Presentation ORSoC Session A6

Drive on one chip

Session A7

Standard architecture for

typical remote sensing micro

satellite payload

Session C5 Product Presentation

The Dini Group Session C6 Product Presentation

Actel Session C7 Product Presentation Nextreme: The industries only Zero

(7)

Dr. Ivo Bolsons, CTO, Xilinx 10:00 -

10:30 Sponsored by Synplicity Coffee Break

10:30 -

11:30 Exhibitors Presentations

11:30 -

12:30 Sponsored by Mentor Graphics Lunch Break

12:30 - 14:30 Session Chair Kim Petersén HDC AB Session Chair Johnny Öberg Session Chair Tommy Klevin ÅF Session D1 Altera Innovate Nordic Contest Session A1

Open Source within Hardware

Session A2

Open and Flexible Network Hardware

Session A3

World's first mixed-signal FPGA

Session A4

Verification - reducing costs and increasing quality

Session B1

A Java-Based System for FPGA Programming

Session B2

Automated Design Approach for On-Chip Multiprocessor Systems

Session B3

ASM++ Charts: an Intuitive Circuit Representation Ranging from Low Level

RTL to SoC Design Session C1 Product Presentation Actel Session C2 Product Presentation

The Dini Group Session C3 Product Presentation

ORSoC Session C4

7Circuits - I/O Synthesis for FPGA Board Design Gateline 14:30 - 15:00 Coffee Break 15:00 - 16:30 Session Chair

TBD Santiago de Pablo Session Chair Session Chair TBD

Session D2 Altera Innovate Nordic

Contest Session A5

Large scale real-time data acquisition and signal processing in SARUS

Session A6

Drive on one chip

Session A7

Standard architecture for typical remote sensing micro satellite payload

Session B4

Space-Efficient FPGA-Implementations of FFTs in High-Speed Applications

Session B5

The ABB NoC – a Deflective Routing 2x2 Mesh NoC targeted for Xilinx FPGAs

Session C5

Prototyping Drives FPGA Tool Flows

Synplicity Business Group of Synopsys Session C6 OVM introduction Mentor Graphics Session C7 Verification Management Mentor Graphics Session C8 MAGIC - Next generation platform for Telecom and Signal

Processing BitSim

16:30 -

17:00 Altera Innovate Nordic Prize draw

17:00 -

(8)

(9)

Exhibitors FPGAworld'2008 Stockholm & Lund

5 Minutes presentation with

Stockholm

Altera

BitSim

Arrow

Silica

Actel

Synplicity

The Mathworks

EBV Elektronik

ACAL Technology

The Dini Group

VSYSTEMS

Gateline

ORSoC

National Instruments

Lund

BitSim

Arrow

Silica

Actel

Synplicity

The Mathworks

EBV Elektronik

ACAL Technology

The Dini Group

National Instruments

NOTE Lund

Exhibitors FPGAworld'2008 Stockholm & Lund

5 Minutes presentation with PowerPoint’s

National Instruments

(10)

Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 2

(11)

Altera, Stratix, Arria, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation 3

(12)

GATEline Overview

Value added reseller of eCAD and ePLM

products on Nordic and Baltic market

Established 1984

7 employees, 6 in Sweden and 1 in

Norway

Office in Stockholm and Oslo

(13)

The Ultimate PCB Design Environment

Design for Manufacturing Signal Integrity Simulation PCB Design

Component Database

Schematic Design Functional Simulation FPGA I/O Synthesis

(14)

Philosophy … Why are we here?

• We make big FPGA boards

• Fastest, biggest for the lowest cost

– Easy to use where important

– Less polish where not

• What you get:

– Working, easy to use, cutting edge, cost effective,

reference designs

– High performance in both speed and gate density

• What you don’t:

– Pretty GUI’s and other SW that drives up the cost

– The ‘soft-shoe’ on partitioning …

(15)

– Partitioning (optional)

• Manual or third party solutions such as Auspy

– Synthesis

• Xilinx/Altera tools work fine

– Place/Route

• Comes from FPGA vendor: Xilinx/Altera

– Debug

• Chipscope, SignalTap, and other third party solutions

Overview of Product Line

• Goal: Provide customers a cost-effective

vehicle to use the biggest and fastest FPGA’s

– Xilinx

• Virtex-5

– Altera

• Stratix III

– Stratix IV when available

– We try to keep lead-times under 2 weeks.

• If not 2 weeks, issue is usually availability of FPGAs

(16)

• DN9200k10PCIe-8T– 2 FPGA’s (LX330’s)

• DN9002k10PCIe-8T– 2 FPGA’s (LX330/LX220/LX155/LX110)

(17)

– 16 Virtex-5 LX330’s

– Expected to start shipping in Dec ’07

– 32M ASIC gates (measured the real way …)

– 6 DDR2 SODIMM sockets (250MHz)

– 450MHz LVDS chip to chip interconnect

(18)

DN9000k10PCI

• 6 Virtex5 LX330

– Oversize PCI circuit board

• 66MHz/64-bit

• Stand-alone operation with ATX power supply

– ~12 million USABLE ASIC gates

• REAL ASIC gates! No exaggeration!

– Any subset of FPGA’s can be stuffed to reduce cost.

• 6, DDR2 SODIMM SDRAM sockets

• Custom DDR2-compatible cards for FLASH, SSRAM, RLDRAM, mictors, and others

• FPGA to FPGA interconnect LVDS (or

single-ended)

• LVDS: 450MHz

• 10x ISERDES/OSERDES tested and verified

• 160-pin Main bus connects all FPGA’s

(19)

DINI_selection_guide_v700.xls Max (100% util) (1000's) Practical (60% util) (1000's) Blocks (18kbits) Total (kbits) Total (kbytes) LX330 -1,-2 6-input 207,360 3,320 1,990 1,200 0 192 576 10,368 1,296 LX220 -1,-2 6-input 138,240 2,210 1,330 800 0 128 384 6,912 864 LX155 -1,-2,-3 6-input 97,280 1,556 934 800 0 128 384 6,912 864 LX110 -1,-2,-3 6-input 69,120 1,110 670 800 0 64 256 4,608 576 LX155T -1,-2,-3 6-input 97,280 1,556 934 640 0 128 424 7,632 954 LX110T -1,-2,-3 6-input 69,120 1,110 666 640 0 64 296 5,328 666 LX85T -1,-2,-3 6-input 51,840 830 498 480 0 48 216 3,888 486 LX50T -1,-2,-3 6-input 28,800 460 276 480 0 48 120 2,160 270 LX30T -1,-2,-3 6-input 19,200 307 184 360 0 32 72 1,296 162 SX95T -1,-2,-3 6-input 58,880 940 564 640 0 640 488 8,784 1,098 SX50T -1,-2,-3 6-input 32,640 522 313 480 0 288 264 4,752 594 SX35T -1,-2,-3 6-input 21,760 392 235 360 0 192 168 3,024 378 FX100T -1,-2,-3 6-input 64,000 1,024 614 640 0 256 456 8,208 1,026 FX70T -1,-2,-3 6-input 44,800 717 430 640 0 128 296 5,328 666 FX30T -1,-2,-3 6-input 20,480 328 197 360 0 64 136 2,448 306 LX200 -10,-11 4-input 178,176 2,490 1,490 960 96 0 336 6,048 756 LX160 -10,-11,-12 4-input 135,168 1,890 1,130 960 96 0 288 5,184 648 LX100 -10,-11,-12 4-input 98,304 1,380 830 960 96 0 240 4,320 540 FX100 -10,-11,-12 4-input 84,352 1,180 710 768 160 0 376 6,768 846 FX60 -10,-11,-12 4-input 50,560 710 430 576 128 0 232 4,176 522 LX160 -10,-11,-12 4-input 135,168 1,890 1,130 768 96 0 288 5,184 648 LX100 -10,-11,-12 4-input 98,304 1,380 830 768 96 0 240 4,320 540 LX80 -10,-11,-12 4-input 71,680 1,000 600 768 80 0 200 3,600 450 LX60 -10,-11,-12 4-input 53,248 750 450 640 64 0 160 2,880 360 LX40 -10,-11,-12 4-input 36,864 520 310 640 64 0 96 1,728 216 SX SX55 -10,-11,-12 4-input 49,152 690 410 640 512 0 320 5,760 720 2vp100 -5,-6 4-input 88,192 1,230 740 1040 444 0 444 7,992 999 2vp70 -5,-6,-7 4-input 66,176 930 560 996 328 0 328 5,904 738 2vp50 -5,-6,-7 4-input 47,232 660 400 692 232 0 232 4,176 522 Max (100% util) (1000's) Practical (60% util) (1000's) MLAB (640) M9K (9 kbit) M144K (144 kbit) Total (kbits) Total (kbytes) 4SE680 -4,-3,-2 6-input 544,880 8,718 5,231 1104 1360 13622 1529 64 22,977 2,872 4SE530 -4,-3,-2 6-input 424,960 6,799 4,080 960 1024 10624 1280 64 20,736 2,592 3SL340 -4,-3,-2 6-input 270,000 4,320 2,592 1120 576 6750 1040 48 16,272 2,034 M512 (32x18) M4K (128x36) M-RAM (4kx144) Total (kbits) Total (kbytes) 2SGX90E -5,-4,-3 6-input 72,768 1,020 610 558 192 488 408 4 4,415 552 2S180 -5,-4,-3 6-input 143,520 2,010 1,210 1,170 384 930 768 9 9,163 1,145 StratixII GX StratixII

A

lt

e

ra

FXT V ir te x -5 V ir te x -4 LX FX

X

il

in

x

FPGA LX FPGA S p e e d G ra d e s ( s lo w e s t to fa s te s t) LUT Size VirtexII Pro LX LXT M u lt ip li e rs (1 8 x 1 8 ) M a x I /O 's Gate Estimate M a x I /O 's LUT Size (6-input or 4-input) S p e e d G ra d e s ( s lo w e s t to fa s te s t) SXT FF's Stratix III Stratix IV M u lt ip li e rs (2 5 x 1 8 ) Memory Memory FF's Gate Estimate M u lt ip li e rs (1 8 x 1 8 )

(20)

2 ® ® TM Headquarters: Natick, Massachusetts US US: California, Michigan, Washington DC, Texas Europe:

UK, France, Germany, Switzerland, Italy, Spain, the Netherlands, Sweden Asia-Pacific:

China, Korea, Australia Worldwide training and consulting

Distributors in 25 countries

The MathWorks at a Glance

Earth’s topography on an equidistant cylindrical projection, created with MATLAB®_{and Mapping Toolbox}™_.

(21)

3

computing

Numeric computation

Data analysis and visualization The de facto industry-standard,

high-level programming language for algorithm development Toolboxes for signal and image

processing, statistics, optimization, symbolic math, and other areas Foundation of MathWorks products

® ®

TM

Core MathWorks Products

The leading environment for modeling, simulating, and implementing

communications systems and semiconductors

Foundation for Model-Based Design

Digital, analog, and mixed-signal systems, with floating- and fixed-point support

Algorithm development, system-level design, implementation, and test and verification Optimized code generation for FPGAs and

DSPs

Blocksets for signal processing,

communications, video and image processing, and RF

Open architecture with links to third-party modeling tools, IDEs, and test systems

(22)

5 Test Environments Continuous V&V DSP FPGA ASIC C, C++ VHDL, Verilog Implement Integration MCU Software Electronics

(23)

Silica an Avnet Company

550 employees (450-sales and engineering team) 23 franchises

Local sales organisations w. centralized backbone for logistic Excellent portfolio of value-added services and supply chain solutions.

(24)

SILICA I The Engineers of Distribution.

Programmable

Logic

(Signal Chain)

(25)

•

$ 395

Xilinx® Spartan™-3A Evaluation Kit

•

XC3S400A-4FTG256C

•

General FPGA prototyping

•

Cypress® PSoC evaluation (Capsense)

(26)

Flash and Antifuse

Non-volatile Reprogrammable FPGAs

Flash (floating gate) technology

Non-Volatile OTP (One Time Programmable) FPGAs

_{ONO anti-fuse technology} _{M2M anti-fuse technology}

(27)

All Actel devices function as soon as power is

applied to the board

Single-chip offerings provide total cost

advantage over competition

Actel’s Silicon

Value-based Low Power FPGA

Ultra-low power

Very high volume

Sub-$1.00 market

Power and System Management

System developers needing

integrated functionality on single chip

System Critical

Where failure and tampering are

(28)

Industry’s Most Comprehensive

Power Management Portfolio

Later Today You are Invited

Low Power Solutions 12:30

(29)

Håkan Pettersson

Sr Applications Engineer

Hakan_pettersson@mentor.com

2

Mentor FPGA Design Solutions

Concept to Manufacturing

System System Design Design Embedded Embedded Development Development C C--BasedBased Synthesis Synthesis RTL Reuse RTL Reuse & Creation & Creation Verification Verification FPGA FPGA Synthesis Synthesis PCB PCB Design Design C++ Func if else if Then (); Goto des_dev End if 2

(30)

3

4

Mentor @ FPGA World

Open Verification Methodology – An

Overview

(31)

(32)

2

EBV Elektronik - The Full Solution Provider

EBV added values: In-depth design expertise Application know-how Full logistics solutions

(33)

3

4

130 pan-European Field Application Engineers

– 13% of EBV’s total workforce! –

provide extensive application expertise and design know-how.

2 weeks of internal FAE trainings per year by the product specialists of EBV’s manufacturers. (FSEs also attend)

Technologies are chosen from EBV!

2 weeks of additional training at our suppliers

EBV – The Technical Specialist

EBV FAE Team

(34)

5

Reduces time-to-market

6

FalconEye Development Board

(35)

Open Source - gives the

competitive edge

ORSoC make SoC-development

easy, accessible and cost efficient

for all companies, regardless size or financial strength.

USB - Debugger

Development boards

(36)

Floppy-disk replacement

Designed and developed by ORSoC

Owned and sold by Swerob

USB - Debugger

Development boards

Customer product

ORSoC makes it easy

Open Source - gives the

competitive edge

ORSoC make SoC-development

easy, accessible and cost efficient

(37)

OpenCores

reach millions of engineers

OpenCores is owned and maintained by ORSoC. www.opencores.org

OpenCores

Facts

OpenCores is the number one site in the world for open source hardware IPs • ~540 projects (different IP-blocks)

• ~1 000 000 page views every month

• ~70 000 visitors every month

• 6:48 (min:sek) Average time at website

(38)

1

Welcome to Synopsys

May 20th, 2008

FPGAWorld 2008

2

Welcome to the

Synplicity Business Group

of Synopsys

(39)

3

The Message is . . .

“The acquisition by Synopsys allows us to scale

our FPGA and rapid prototyping business to help more designers successfully solve increasingly complex problems”

- Gary Meyers General Manager, Synplicity Business Group

“The combination will support our strategy to

provide rapid prototyping capabilities and will enhance Synplicity’s already strong offering in the FPGA implementation market.”

- Aart de Geuss CEO and Founder, Synopsys

FPGA Implementation Solutions Confirma™ ASIC / ASSP Verification Platform ESL Synthesis

Synplify Premier The Ultimate in FPGA Implementation

Synplify Pro The Industry Leader in FPGA Synthesis

Identify Powerful RTL Debug

Certify Multi-FPGA Prototyping Environment

Identify Pro Full Visibility Functional Verification

Synplify DSP DSP Synthesis for FPGA Designers

Synplify DSP ASIC Edition

Synplify Premier Single-FPGA Prototyping Environment

HAPS High-performance ASIC Prototyping System

DSP Synthesis for ASIC Designers

(40)

FPGA World September 2008

Key Market Segments

Value-based FPGA

Ultra-low power High volumes Sub-$10 market

Power and System

Management

Needs integrated functionality on single chip

System Critical

Failure and tampering are not options

(41)

Bit Line Vdd Vdd Bit Line

Word Line

Competitive SRAM Cell

Actel’s Flash Cell

Power: Actel Technical Advantage

Substantial leakage per cell Millions of configuration cells High static current

Negligible leakage per cell Millions of configuration cells Ultra-low static current

Actel’s System Management Solutions

High-end, standards-based system

management specifications

Fusion-based µTCA reference designs

Power Module and Advanced Mezzanine Card

Fusion-based ATCA reference designs

Low-cost system management for typical

embedded design

Robust reference design leverages Fusion and CoreABC

(42)

Klockan 13:00 i rum A

En presentation samt demo av Igloo, som visar skillnaden i

strömförbrukning mellan Flash- och SRAM-baserade

FPGA:er

Klockan 15:30 i rum C

Väl mött!

(43)

Offices

Head office in Stockholm with regional offices in Lund, Uppsala, Växjö and in Gothemburg.

~60 employees

In average 10+ years in electronic design

Advanced Microelectronics

FPGA, Board, DSP, ASIC & System-on-Chip, Analog & SW

Application specialists

(44)

(45)

cooperation with you. WE OFFER

• A site close to you

• Design and test resourses

• Industrialisation

• NOTEfied for selection of right components

• NOTE LAB for fast prototypes

• Competitive component sourcing

• Serial production inclusive Box Build

• After sales services

NOTE Lab

• Specialists in prototyping and other customized production

• Fast prototype production

– experienced component engineers and purchasing personnel

– prototype modifications while you wait

– advanced prototype delivery in days

– feedback based on customer needs

– seamless transfer to serial production

• Box build in small volymes

(46)

• Life cycle status

• Symbols

• Footprint

• Production recommendations

NOTE Lab

Let us help you!

We can help you to launch your product in a faster way and that can be the differance between winn or lose.

If you want more information please visit www.note.seor contact us

in Lund on 046 – 286 92 00. If you have your business somewhere else in Sweden you can find a NOTE site near you on our home page. We look forward to hear from you!

(47)

A Java-Based System for FPGA Programming

Automated Design Approach for On

Multiprocessor Systems

ASM++ Charts: an Intuitive Circuit Representation

Ranging from Low Level RTL to SoC De

Space-Efficient FPGA

High

The ABB NoC

NoC targeted for Xilinx FPGAs

Papers

Session B1

Based System for FPGA Programming

Session B2

Automated Design Approach for On-Chip

Multiprocessor Systems

Session B3

ASM++ Charts: an Intuitive Circuit Representation

Ranging from Low Level RTL to SoC De

Session B4

Efficient FPGA-Implementations of FFTs in

High-Speed Applications

Session B5

The ABB NoC – a Deflective Routing 2x2 Mesh

NoC targeted for Xilinx FPGAs

Based System for FPGA Programming

Chip

ASM++ Charts: an Intuitive Circuit Representation

Ranging from Low Level RTL to SoC Design

Implementations of FFTs in

(48)

by discussing both Photon’s abstract programming model which separates computation and data I/O, and by giving an overview of the compiler’s internal operation, includ-ing a flexible plug-and-play optimization system. We show that designs created with Photon always lead to deeply-pipelined hardware implementations, and present a case study showing how a floating-point convolution filter de-sign can be created and automatically optimized. Our final design runs at 250MHz on a Xilinx Virtex5 FPGA and has a data processing rate of 1 gigabyte per second.

1. Introduction

Traditional HDLs such as VHDL or Verilog incur major development overheads when implementing circuits, par-ticularly for FPGA which would support fast design cycles compared to ASIC development. While tools such as C-to-gates compilers can help, often existing software cannot be automatically transformed into high-performance FPGA designs without major re-factoring.

In order to bridge the FPGA programming gap we pro-pose a tool called Photon. Our goal with Photon is to sim-plify programming FPGAs with high-performance data-centric designs.

Currently the main features of Photon can be summa-rized as follows:

• Development of designs using a high-level approach combining Java and an integrated expression parser. • Designs can include an arbitrary mix of fixed and

floating point arithmetic with varied precision. • Plug-and-play optimizations enabling design tuning

without disturbing algorithmic code.

The remainder of this paper is divided as up as follows: In Section 2, we compare Photon and other tools for creat-ing FPGA designs. In Section 3 we describe Photon’s pro-gramming model which ensures designs often lead to high-performing FPGA implementations. In Sections 4 and 5 we give an overview of how Photon works internally and present a case study. Finally, in Section 6 we summarize our work and present our conclusions on Photon so far.

2. Comparisons to Other Work

In Table 1 we compare tools for creating FPGA designs using the following metrics:

• Design input – Programming language used to create designs.

• High level optimizations – Automatic inference and optimizing computation hardware, simplification of arithmetic expressions etc.

• Low level optimizations – Boolean expression min-imisation, state-machine optimizations, eliminating unused hardware etc.

• Floating-point support – Whether the tool has intrinsic support for floating-point and IEEE compliance. • Meta-programmability – Ability to statically

meta-program with weaker features being conditional com-pilation and variable bit-widths and stronger features such as higher-order design generation.

VHDL and Verilog use a traditional combination of structural constructs and RTL to specify designs. These tools typically require a high development effort. Such con-ventional tools typically have no direct support for floating

(49)

Build automation Yes Yes Limited No No No No Yes No Table 1. Comparison of tools for creating FPGA designs from software code.

point arithmetic and therefore require external IP. Meta-programmability e.g. generics in VHDL are fairly inflexi-ble [1]. The advantage of VHDL and Verilog is that they give the developer control over every aspect of the micro-architecture, providing the highest potential for an optimal design. Additionally, synthesis technology is relatively ma-ture and the low-level optimizations can be very effective. Other tools often produce VHDL or Verilog to leverage the low-level optimizers present in the conventional synthesis tool-chain.

Impulse-C [2] and Handel-C [3] are examples of C-to-gates tools aiming to enable hardware designs using lan-guages resembling C. The advantage of this approach is existing software code can form a basis for generating hard-ware with features such as ROMs, RAMs and floating-point units automatically inferred. However, software code will typically require modifications to support a particular C-to-gates compiler’s programming model. For example explicitly specifying parallelism, guiding resource map-ping, and eliminating features such as recursive function calls. The disadvantage of C-to-gates compilers is that the level of modification or guidance required of a developer may be large as in general it is not possible to infer a high-performance FPGA design from a C program. This arises as C programs in general are designed without parallelism in mind and are highly sequential in nature. Also, meta-programmability is often limited to the C pre-processor as there is no other way to distinguish between static and dy-namic program control in C.

PamDC [4], JHDL [5], YAHDL [1] and ASC [6] are examples of Domain Specific Embedded Languages [7] (DSELs) in which regular software code is used to imple-ment circuit designs. With this approach all functionality to produce hardware is encapsulated in software libraries with no need for a special compiler. These systems are a purely meta-programmed approach to generating hardware with the result of executing a program being a net-list or HDL for synthesis. Of these systems, PamDC, JHDL and YAHDL all provide similar functions for creating hardware

structurally in C++, Java and Ruby respectively. YAHDL and PamDC both take advantage of operator overloading to keep designs concise, whereas JHDL designs are often more verbose. YAHDL also provides functions for au-tomating build processes and integrating with existing IP and external IP generating tools. ASC is a system built on top of PamDC and uses operator overloading to spec-ify arithmetic computation cores with floating-point opera-tions.

Photon is also implemented as a DSEL in Java. Photon’s underlying hardware generation and build system is based on YAHDL rewritten in Java to improve robustness. Un-like JHDL, Photon minimizes verbosity by using an inte-grated expression parser which can be invoked from regular Java code. Photon also provides a pluggable optimization system unlike the other DSELs presented, which generate hardware in a purely syntax directed fashion.

3. Photon Programming Model

Our goal with Photon is to find a way to bridge the grow-ing FPGA size versus programmgrow-ing gap when accelerat-ing software applications. In this section we discuss the programming model employed by Photon which provides high-performance FPGA designs.

FPGA designs with the highest performance are gen-erally those which implement deep, hazard free pipelines. However, in general software code written without par-allelism in mind tends to have loops with dependen-cies which cannot directly be translated into hazard free pipelines. As such, software algorithm implementations often need to be re-factored to be amenable to a high-performance FPGA implementation. Photon’s program-ming model is built around making it easy to implement suitably re-factored algorithms.

When developing our programming model for Photon, we observe that dense computation often involves a sin-gle arithmetic kernel nested in one or more long running loops. Typically, dense computation arises from

(50)

repeat-Thus, we turn organizing data I/O for the kernel into a problem that can be tackled separately from the data-path compiler. Thus this leaves us with an arithmetic kernel which does not contain any loop structures and hence can be implemented as a loop-dependency free pipeline.

In Photon we assume the Data I/O problem is solved by Photon-external logic. Based on this assumption, Photon designs are implemented as directed acyclic graphs (DAGs) of computation. The acyclic nature of these graphs ensures a design can always be compiled to a loop-dependency free pipeline.

Within a Photon DAG there are broadly five classes of node:

• I/O nodes – Through which data flows into and out of the kernel under the control of external logic.

• Value nodes – Nodes which produce a constant value during computation. Values may be hard-coded or set via an I/O side-channel when computation is not run-ning.

• Computation nodes – Operations including: arith-metic (+, ÷ . . . ), bit-wise (&, or, . . . ), type-casts etc. • Control nodes – Flow-control and stateful elements,

e.g.: muxes, counters, accumulators etc.

• Stream shifts – Pseudo operations used to infer buffer-ing for simulatbuffer-ing accessbuffer-ing to data ahead or behind the current in-flow of data.

To illustrate Photon’s usage and graph elements, con-sider the pseudo-code program in Listing 1. This program implements a simple 1D averaging filter passing over data in an array din with output to array dout. The data I/O for this example is trivial: data in the array din should be passed linearly into a kernel implementing the average fil-ter which outputs linearly into an array dout.

Figure 1. Photon DAG for 1D averaging.

Figure 1 shows a Photon DAG implementing the aver-aging kernel from Listing 1. Exploring this graph from top-down: data flows into the graph through the din input node, from here data either goes into logic implementing an av-eraging computation or to a mux. The mux selects whether the current input data point should skip the averaging op-eration and go straight to the output as should be the case at the edges of the input data. The mux is controlled by logic which determines whether we are at the edges of the stream. The edge of the stream is detected using a com-bination of predicate operators (<, >, &) and a counter which increases once for each item of data which enters the stream. The constant input N − 1 to the < comparator can be implemented as a simple constant value, meaning the size of data which can be processed is fixed at com-pilation time. On the other hand, the constant input can be implemented as a more advanced value-node that can be modified via a side-channel before computation begins, thus allowing data-streams of any size to processed. The logic that performs the averaging computation contains a number of arithmetic operators, a constant and two stream-shifts. The stream-shift operators cause data to be buffered such that it arrives at the addition operator one data-point behind (−1) or one data-point ahead (+1) of the unshifted

(51)

Figure 2. Scheduled DAG for 1D average filter.

data which comes directly from din.

To implement our 1D averaging Photon DAG in ware, the design undergoes processing to arrive at a hard-ware implementation. Figure 2 illustrates the result of Pho-ton processing our original DAG. In this processed DAG, buffering implements the stream-shifting operators and en-sures data input streams to DAG nodes are aligned. Clock-enable logic has also been added for data alignment pur-poses.

With this newly processed DAG, data arriving at din produces a result at dout after a fixed latency. This is achieved by ensuring that data inputs to all nodes are aligned with respect to each other. For example the mux before dout has three inputs: the select logic, din and the averaging logic. Without the buffering and clock-enable logic, data from din would arrive at the left input to the mux before the averaging logic has computed a result. To compensate, buffering is inserted on the left input to bal-ance out the delay through the averaging logic. For the mux-select input a clock-enable is used to make sure the counter is started at the correct time.

After Photon processes a DAG by inserting buffering and clock-enable logic, the DAG can be turned into a struc-tural hardware design. This process involves mapping all the nodes in the graph to pre-made fully-pipelined imple-mentations of the represented operations and connecting the nodes together. As the design is composed of a series of fully-pipelined cores, the overall core is inherently also fully-pipelined. This means Photon cores typically offer a high degree of parallelism with good potential for achiev-ing a high clock-speed in an FPGA implementation.

d . c o n n e c t ( mul ( add ( a , b ) , c ) ; }

}

Listing 2. Photon floating-point add/mul design.

4. Implementation of Photon

In this section we given an overview of Photon’s con-crete implementation. Of particular interest in Photon is the mechanism by which designs are specified as Java pro-grams which is covered first in Section 4.1. We then discuss Photon’s compilation and hardware generation process in Section 4.2.

4.1. Design Input

Photon is effectively a Java software library and as such, Photon designs are created by writing Java programs. Exe-cuting a program using the Photon library results in either the execution of simulation software for testing a design or an FPGA configuration programming file being generated. When using the Photon library a new design is cre-ated by extending the PhotonDesign class which acts as the main library entry point. This class contains meth-ods which wrap around the creation and inter-connection of standard Photon nodes forming a DAG in memory which Photon later uses to produce hardware. New nodes for cus-tom hardware units, e.g. a fused multiply-accumulate unit, can also be created by users of Photon.

Listing 2 shows an example Photon program. When ex-ecuted this program creates a hardware design which takes three floating-point numbers a, b and c as inputs, adds a and b together and multiplies the result by c to produce a single floating-point output d. Method calls in the code specify a DAG which has six nodes: three inputs, an out-put, a multiplier and an add. These nodes are created by calls to the input, output, mul and add methods re-spectively. The input and output methods take a string pa-rameter to specify names of I/Os for use by external logic and for performing data I/O. Another parameter specifies the I/O type. For the example in this paper, we use IEEE single precision floating-point numbers. The floating point type is declared using a call to hwFloat which makes a

(52)

e v a l ( ” d o u t <− s e l ? avg : d i n ” ) ;

Listing 3. 1D averaging design implemented us-ing Photon expressions.

floating point type object with an 8 bit exponent and a 24 bit mantissa following the IEEE specification. We can also create floating-point numbers with other precisions, fixed-point and/or integer types. Types used at I/Os propagate through the DAG and hence define the types of operator nodes. Casting functions can be used to convert and con-strain types further within the design.

One drawback of using Java method calls to create a DAG is verbosity, making it hard to read the code or re-late lines back to the original specification. To resolve the function-call verbosity the Photon library provides a mechanism for expressing computation using a simple pression based language. Statements in this integrated ex-pression parser can be written using a regular Java strings passed to an eval method. The eval method uses the state-ments in the string to call the appropriate methods to extend the DAG.

To demonstrate our eval expressions, Listing 3 shows how our 1D averaging example from Figure 1 is imple-mented in Photon using eval calls.

4.2. Compilation and Hardware Generation

In addition to using Java for design specification, Photon also implements the compilation and hardware generation process entirely in Java. Photon’s design management fea-tures cover optimization of Photon designs, generation of VHDL code, and calling external programs such as synthe-sis, simulation, IP generation, and place-and-route.

After a Photon design is fully specified, Photon turns the specified DAG into a design which can be implemented in hardware. Photon achieves this primarily by executing a

first scheduling pass traverses the graph passively, collect-ing data about the latency (pipeline-depth) of each node in the graph. We then determine an offset in our core pipeline at which each node should be placed in order to ensure that data for all its inputs arrives in synchrony. After all the off-sets in a schedule are generated a second pass applies these offsets by inserting buffering to align node inputs.

Sub-optimal offsets cause unnecessary extra buffering to be inserted into the graph, wasting precious BlockRAM and shift-register resources. To combat this inefficiency we calculate a schedule for the offsets using Integer Lin-ear Programming (ILP). Our ILP formulation ensures all nodes are aligned such that their input data arrives at the same time while minimising the total number of bits used in buffering. Thus, Photon’s scheduled designs always have optimal buffering.

After all other graph-passes, a final graph-pass produces a hardware design. By this stage in compilation every node in the DAG has a direct mapping to an existing piece of parameterisable IP. Thus, this final pass creates a hardware design, by instantiating one IP component per node in the graph. Hardware is created in Photon using further Java classes to describe structural designs either directly using Java or by including external pre-written HDL, or running external processes to generate IP, e.g. CoreGen for floating-point units. After a design is fully described, external syn-thesis or simulation tools are invoked by Java to produce the required output for this execution. The system used to implement this low-level structural design and tool automa-tion is based on the model described in [1].

5. Case Study

As a case study we consider a simple 2D convolution filter. This kind of design is common in many digital image processing applications.

The filter we implement is shown in Figure 3. The filter is separable, using the equivalent of two 1D 5 point

(53)

convo-subtractions and (after algebraic optimization to factor out common sub-expressions) 5 multiplications per input point.

5.1. Optimizations

The compilation of the convolution case study illustrates two of the optimization graph-passes during the Photon compilation process.

The Photon implementation of the filter makes use of several large stream-shifts on the input data. These shifts are necessary as each output data-point requires the 9 sur-rounding points to compute the convolved value. These stream-shifts result in a large number of buffers being added to the Photon design. Photon helps reduce this buffering using a graph-pass that combines the multiple de-lay buffers into a single long chain of buffers. This ensures each data item is only stored once, reducing buffering re-quirements.

Photon is able to use the precise value of the filter coef-ficient constants to optimize the floating-point multipliers. Specifically, some of the coefficients are a power of two, which can be highly optimized. To implement this, Pho-ton includes another graph-pass which identifies floating-point multiplications by a power of two and replaces them with a dedicated node representing a dedicated hardware floating-point multiply-by-two IP core. This IP core uses a small number of LUTs to implement the multiplication rather than a DSP as in the conventional multipliers.

5.2. Implementation Results

For synthesis we target our filter design to a Xil-inx Virtex-5 LX110T FPGA, with a clock frequency of 250MHz. At this speed, with data arriving and exiting the circuit once per cycle, we achieve a sustained computation rate of 1GB/s.

Table 2 shows the area impact of the Photon optimiza-tion graph-passes on the filter hardware. The multiplier power of two substitution pass reduces the number of DSP blocks used from 10 to 6, and the delay merging pass re-duce BRAM usage from 18 RAMB36s to 4. The number of LUTs required for the original and optimized designs are similar.

6. Conclusion

In this paper we introduce Photon, a Java-based FPGA programming tool. We describe the programming model for Photon in which data I/O is separated from computation allowing designs to implicitly be easy to pipeline and hence perform well in an FPGA. We give an overview of Photon’s implementation as a library directed by user created Java programs. Finally, we present a case study demonstrating that Photon’s pluggable optimization system can be used to improve the resource utilisation of designs. Our current and future work with Photon includes developing a system for making it easier to create the data I/O logic external to Photon designs, and creating more advanced optimization passes.

References

[1] J. A. Bower, W. N. Cho, and W. Luk, “Unifying FPGA hard-ware development,” in International Conference on Field-Programmable Technology, December 2007, pp. 113–120. [2] Impulse Accelerated Technologies Inc., “ImpulseC,”

http://www.impulsec.com/, 2008.

[3] Agility, “DK design suite,” http://www.agilityds.com/, 2008. [4] O. Mencer, M. Morf, and M. J. Flynn, “PAM-Blox: High

performance FPGA design for adaptive computing,” in IEEE Symposium on FPGAs for Custom Computing Machines, Los Alamitos, CA, 1998, pp. 167–174.

[5] P. Bellows and B. Hutchings, “JHDL - An HDL for reconfig-urable systems,” in IEEE Symposium on FPGAs for Custom Computing Machines, Los Alamitos, CA, 1998, pp. 175 – 184.

[6] O. Mencer, “ASC: A stream compiler for computing with FPGAs,” IEEE Transactions on CAD of ICs and Systems, vol. 25, pp. 1603–1617, 2006.

[7] P. Hudak, “Modular domain specific languages and tools,” Intl. Conf. on Software Reuse, vol. 00, p. 134, 1998. [8] O. Pell and R. G. Clapp, “Accelerating subsurface offset

gath-ers for 3D seismic applications using FPGAs,” SEG Tech. Program Expanded Abstracts, vol. 26, no. 1, pp. 2383–2387, 2007.

[9] D. B. Thomas, J. A. Bower, and W. Luk, “Hardware archi-tectures for Monte-Carlo based financial simulations,” in In-ternational Conference on Field-Programmable Technology, December 2006, pp. 377–380.

(54)

of an adaptive multiprocessor system by creating compo-nents, like processing nodes or memories, from a parallel program. Therefore message-passing, a paradigm for par-allel programming on multiprocessor systems, is used. The analysis and simulation of the parallel application provides data for the formulation of constraints of the multiprocessor system. These constraints are used to solve an optimization problem with Integer Linear Programming: the creation of a suitable abstract multiprocessor hardware architecture and the mapping of tasks onto processors. The abstract ar-chitecture is then mapped onto a concrete arar-chitecture of components, like a specific Power-PC or soft-core proces-sor, and is further processed using a vendor tool-chain for the generation of a configuration file for an FPGA.

1. Introduction

As apparent in current developments the reduction of transistor size and the exploitation of instruction-level par-allelization can not longer be continued to enhance the per-formance of processors [1]. Instead, multi-core processors are a common way of enhancing performance by exploiting parallelism of applications. However, designing and im-plementing multiple processors on a single chip leads to new problems, which are absent in the design of single-core processors. For example, an optimal communication infrastructure between the processors needs to be found. Also, software developers have to parallelize their applica-tions, so that the performance of the application is increased through multiple processors. In the case of multiproces-sor systems-on-chip (MPSoCs), which combine embedded heterogeneous or homogeneous processing nodes, memory systems, interconnection networks and peripheral compo-nents, even more problems arise. Partly because of the va-riety of technologies available and partly because of their

computing with multiprocessor systems exist, the commu-nication through shared memory (SMP), i. e. cache or mem-ory on a bus-based system, and the passing of messages (MPI) through a communication network. SMP architec-tures, like the Sun Niagara processor [8] or the IBM Cell BE processor [9], are the common multiprocessors today. MPI is typically used in computer clusters, where physically dis-tributed processors communicate through a network.

This paper presents a survey about our developments in the area of adaptive MPSoC design with FPGAs (Field-Programmable Gate Arrays) as a flexible platform for Chip-Multi-Processors. In section 2 an overview of the proposed design approach for MPSoCs is given. Therefore the steps for architectural synthesis, starting with the analysis and simulation of a parallel program and ending with the gen-eration of a bitfile for the configuration of an FPGA are de-scribed in general. In the following section 3 an on-chip message passing software library for communication be-tween tasks of a parallel program, and a benchmark for the purpose of evaluation are presented. Section 4 summarizes the formulation of architecture constraints for the design space exploration with Integer Linear Programming. These constraints are formulated out of the results of the analysis and simulation of a parallel program. The following sec-tion 5 gives an overview about the creasec-tion of MPSoCs us-ing abstract components. Finally, this paper is concluded in section 6 and a brief overview about future work is given in section 7.

2. System design using architectural synthesis

To get an efficient multiprocessor system-on-chip from a parallel program several approaches are possible. In figure 1 our proposed synthesis flow using an analytical approach is shown. The architectural synthesis flow starts with

(55)

paral-Figure 1. Architectural Synthesis Flow In the first step of the design flow, information on data traffic and on task precedence is extracted from functional simulations of the parallel program. Information on the number of cycles of a task when executed on a specific pro-cessor is determined from cycle accurate simulations. This information is used to formulate an instance of an Integer Linear Programming (ILP) problem.

In the following step, called Abstract Component cre-ation, a combinatorial optimization is done by solving an ILP problem. Additionally to the information gathered in the first step, platform constraints, e. g. area and speed of the target platform, are needed as well. As a result of this step an abstract system description including the (abstract) hard- and software parts is generated.

The third step is called Component mapping. The ab-stract system description, which consists of abab-stract pro-cessors, memories, communication components or hard-ware accelerators and softhard-ware tasks linked onto abstract processors, is mapped on a concrete architecture of com-ponents like PPC405, MicroBlaze or on-chip BRAMs. If needed, an operating systems can be generated with scripts and makefiles and can be mapped onto a processor as well. This step can be done using the PinHaT software (Platform-independent Hardware generation Tool) [11].

In the final step a bitfile is generated from the concrete

veloped (see figure 2), which is similar to the approaches described in [13], [14] and [15].

Figure 2. SoC-MPI Library

The library consists of two layers. A Network indepen-dent layer (NInL) and a network depenindepen-dent layer (NDeL), for the separation of the hardware dependent part from the hardware independent part of the library. The advan-tage of this separation is the easy migration of the library to other platforms. The NInL provides MPI functions, like MPI Send, MPI Receive, MPI BSend or MPI BCast. These functions are used to perform the communication be-tween processes in the program. The NDeL is an accumu-lation of network dependent functions for different network topologies. In this layer the ranks and addresses for con-crete networks are determined and the cutting and send-ing of messages dependsend-ing on the chosen network is car-ried out. Currently the length of a message is limited to 64 Bytes, due to the limited on-chip memory of FPGAs. Longer messages are therefore cutted into several smaller messages and are send in series. The parameters of the MPI functions, like count, comm or dest (destination), are also used as signals and parameters for the hardware compo-nents of the network topology. That is, the parameters are used to build the header and the data packets for the

(56)

commu-Figure 3. Configuration of processing nodes In figure 3 several processing nodes are connected to-gether via a star network. Additionally node 0 and 1 are directly connected together via FSL (Fast Simplex Link) [16]. Each processing node has only a subset of the SoC-MPI Library with the dependent functions for the network topology.

3.1. Benchmarks

The MPI library is evaluated using Intel MPI Bench-marks 3.1, which is the successor of the well known pack-age PMB (Pallas MPI Benchmarks) [17]. The MPI imple-mentation was benchmarked on a Xilinx ML-403 evaluation platform [18], which includes a Virtex 4 FPGA running at 100MHz. Three MicoBlaze soft-core processors [19] were connected together via a star network. All programs were stored in the on-chip memories.

In Figure 4 the results of the five micro benchmarks are shown. Due to the limited on-chip memory not all bench-marks could be performed completely. Furthermore, a small decay between 128 and 256 Bytes message size exists, be-cause the maximum MPI message length is currently lim-ited to 251 KBytes and a message larger than that must be splitted into several messages. Further increase of the mes-sage size would lead to a bandwidth closer to the maximum possible bandwidth, which is limited through the MicroB-laze and was measured with approximately 14 MBytes/s.

4. Abstract component creation using Integer

Linear Programing

Figure 4. Benchmarks of the SOC-MPI Library

cating the best possible abstract architecture for a given par-allel application under given constraints. The simultaneous optimization problem is to map parallel tasks to a set of pro-cessing elements and to generate a suitable communication architecture, which meets the constraints of the target plat-form, and minimizes the overall computation time of the parallel program. The input for this step is obtained by us-ing a profilus-ing tool for the mpich2 package. In the followus-ing two subsections area and time constraints of processors and communication infrastructure are described in separate.

4.1. Processors - sharing constraint, area constraint

and costs

A few assumptions about processors and tasks need to be made, because it is possible to map several tasks on a processor: (1) a task scheduler exists, so that scheduling is not involved in the optimization problem. (2) Task mapping is static. (3) The instruction sequence of a task is stored on the local program memory of the processor, e. g. instruction cache, and hence the size of the local program memory limits the number of tasks, which can be mapped onto a processor. (4) Finally the cost of switching tasks in terms of processor cycles does not vary from task to task. Let Ii ∈ I0, ..., In be a task, Jj ∈ J0, ..., Jm a processor

and xij = 0, 1 a binary decision variable, whereas xij = 1

means that task Iiis mapped onto processor Jj. m

X

j=0

xij= 1, ∀Ii (1)

A constraint for task mapping (equation 2), called ad-dress space constraint, and the cost of task switching (equa-tion 3) can be formulated, where sijis the size of a task Ii

(57)

xij only shows if a task Ii is mapped onto a processor Jj

and does not show the number of processors in the system or the number of instantiations of a processor, an auxiliary variable vj = 0, 1 is needed. For each instance of a

pro-cessor Jj there is a corresponding virtual processor vj and

for all tasks mapped to a certain processor there is only one task which is mapped to the corresponding virtual proces-sor. This leads to the following constraint (equation 4) so that the area of the processors can be calculated with equa-tion 5. vj≤ n X i=0 xij, ∀Jj (4) AP E≥ m X j=0 vij· aj (5)

4.2. Communication networks - Network capacity

and area constraint

Several assumptions have to be made before constraints about the communication network can be formulated. The communication of two tasks mapped onto the same proces-sor is done via intra-procesproces-sor communication, which have a negligible communication latency and overhead, com-pared to memory access latency. All processors can use any of the available communication networks and can use more then one network. A communication network has arbitra-tion costs resulting from simultaneous access on the net-work. It is assumed, that tasks are not prioritized and an up-per bound on arbitration time for each network can be com-puted for each network topology depending on the number of processors. Finally, it is not predictable when two or more tasks will attempt to access the network. Though a certain probability can be assumed.

λi1,i2is an auxiliary 0-1 decision variable that is 1, if two

communicating tasks are mapped on different processors. The sum of xi1j1 and xi2j2 equals two if the tasks are on different processors as seen in equation 6.

xi1j1+ xi2j2

yk+

X

I_i1,I_i2|I_i1lIi2

λi1,i2≤ Mk, ∀Ck (7)

The total area cost of the communication network (re-sources for routing) can be calculated with equation 8, where Akis the area cost of topology Ck.

AN ET ≥ m

X

j=0

Ak· yk (8)

The cost of the topology in terms of computation time is calculated in 10, whereas zki1i2is a binary decision vari-able, which is 1 if a network will be used by two tasks. Oth-erwise the variable zki1i2is 0. Di1i2is the amount of data to be transferred between the two communicating tasks and pk

is the probability that network arbitration will be involved when a task wants to communicate. The upper bound arbi-tration time is τk. zki1i2 ≥ λi1,i2+ yk− 1 (9) TN ET = X I_i1,I_i2|I_i1lIi2 K X k=0 (Di1i2+ τk· pk)zki1i2 ! (10) Finally the total area cost A is calculated form the area of the processing elements AP E (equation 5) and the area

for the routing resources AN ET (equation 8).

A ≥ AP E+ AN ET (11)

The cost of computation time can be calculated with equation 12, whereas Tijis the time requirement to process

a task Iion a processor Jj. The objective, in this case, is to

minimize computation time of a (terminating) parallel pro-gram. However for non-terminating programs, like signal processing programs, the objectives are different.

(58)

first step an abstract specification of the system using ab-stract components like CPUs, Memories or Hardware Ac-celerators are described. In the following step these abstract components will be refined and mapped to concrete com-ponents, e. g. a specific CPU (PPC405, MicrocBlaze) or a Hardware-Divider. Also the software tasks are mapped onto the concrete components. The structure of PinHaT is shown in figure 5. A detailed overview about the PinHaT tool is given in [11].

Figure 5. Structure of PinHaT

The component mapping is divided into the generation and the configuration of the system infrastructure, where hardware is generated and software is configured. In this flow, the input to PinHaT is obtained by high level

synthe-its own parameters. Such classes can be easily added to the framework to extend the IP-Core base. In a subsequent step, another parser creates the platform specific hardware information file from the gathered information. In the sec-ond phase individual mappers for all components and target platforms, are created, followed by the last phase, where a mapper creates the platform dependent hardware descrip-tion files. These dependent hardware descripdescrip-tion files are then passed to the vendor’s tool chain, e. g. Xilinx EDK or Altera Quartus II.

5.2. Configuration of the System Infrastructure

-SW Mapping

In the case of software, a task is mapped onto a concrete processor. This is in contrast to the mapping of abstract components, e. g. processors or memories, to concrete ones during the generation of the system infrastructure.

For the mapping step, parameters of the software must be specified for each processor. The parameters include in-formation about the application or the operating system, like source-code, libraries or the os-type. With these in-formation scripts and Makefiles for building the standalone applications and the operating systems are created. While standalone applications only need compiling and linking of the application, building the operating system is more dif-ficult. Depending of the operating system, different steps, like configuration of the file-system or the kernel parame-ters, are necessary. The result of the task mapping is an executable file for each processor in the system.

6. Conclusion

In this paper a concept for the design automation of mul-tiprocessor systems on FPGAs was presented. A small-sized MPI library was implemented to use message passing for the communication between tasks of a parallel program.

5th FPGAworld CONFERENCE Book 2008

2008 SEPTEMBER

Lennart Lindh, David Källberg,

Book

2008 SEPTEMBER

EDITORS

David Källberg, DTE - Santiago de Pablo and Vincent J. Mooney III

SPONSORS

Vincent J. Mooney III

Lennart Lindh, FPGAworld and Jönköpings University, Sweden

Publicity Chair

David Kallberg, FPGAworld, Sweden

Academic Programme Chair

Vincent J. Mooney III, Georgia Institute of Technology, USA

Academic Publicity Chair

Santiago de Pablo, University of Valladolid, Spain

Academic Programme Committee Members

Ketil Roed, Bergen University College, Norway

Lennart Lindh, Jönköping University, Sweden

Pramote Kuacharoen, National Institute of Development Administration, Thailand

Mohammed Yakoob Siyal, Nanyang Technological University, USA

Fumin Zhang, Georgia Institute of Technology, USA

Santiago de Pablo, University of Valladolid, Spain

Industrial Programme Committee Members

Solfrid Hasund, Bergen University

College

Kim Petersén, HDC, Sweden

Mickael Unnebäck, ORSoC, Sweden

Fredrik Lång, EBV, Sweden

Niclas Jansson, BitSim, Sweden

Göran Bilski, Xilinx, Sweden

Adam Edström, Elektroniktidningen,

Sweden

Espen Tallaksen, Digitas, Norway

Göran Rosén, Actel, Sweden

Tommy Klevin, ÅF, Sweden

Tryggve Mathiesen, BitSim, Sweden

Fredrik Kjellberg, Net Insight,

Sweden

Daniel Stackenäs, Altera, Sweden

Martin Olsson, Synective Labs,

Sweden

Stefan Sjöholm, Prevas, Sweden

Ola Wall, Synplicity, Sweden

Torbjorn Soderlund, Xilinx, Sweden

Anders Enggaard, Axcon, Denmark

Doug Amos, Synplicity, UK

Guido Schreiner, The Mathworks,

Germany

Stig Kalmo, Engineering College of

Aarhus, Denmark

hope that the conferences provide

We will try to balance academic and industrial presentations, exhibit

tutorials to provide a unique chance for our attendants to obtain knowledge from

different views. This year we have the strongest program in FPGAworld´s history.

Track A - Industrial

Track A features presentations with focus on industrial applications.

presenters were selected by the Industrial Programme Committee.

presented.

Track B - Academic

Track B features presentations with focus on academic papers and industria

applications. The presenters were selected by the Academic Programme

Committee. Due to the high quality, 5 out of the 17 papers submitted this year

were presented.

Track C - Product presentations

Track C features product presentations from our exhibitor

Track D - Altera Innovate Nordic

Track D is reserved for the Altera Innovate Nordic contest.

Three in the final

Exhibitors FPGAworld'2008 Stockholm & Lund

The FPGAworld 2008 conference

Totally we are close to 300 participants (Stockholm and Lund).

All are welcome to submit

the conference, both from

Together we can make the

expectations!

Please check out the website (http://fpgaworld.com/conference/)

information about FPGAworld

(

david@fpgaworld.com