FPGA Design Tools - : the Challenges of Reporting Performance Data

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

FPGA Design Tools - the Challenges of Reporting

Performance Data

Examensarbete utfört i Datateknik

vid Tekniska högskolan vid Linköpings universitet av

Stefan Persson LiTH-ISY-EX--16/4935--SE

Linköping 2016

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

FPGA Design Tools - the Challenges of Reporting

Performance Data

Examensarbete utfört i Datateknik

vid Tekniska högskolan vid Linköpings universitet

av

Stefan Persson LiTH-ISY-EX--16/4935--SE

Handledare: Andreas Ehliar

isy_{, Linköpings universitet}

Examinator: Oscar Gustafsson

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

Computer Engineering

Department of Electrical Engineering SE-581 83 Linköping Datum Date 2016-05-27 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-XXXXX ISBN

— ISRN

LiTH-ISY-EX--16/4935--SE Serietitel och serienummer Title of series, numbering

ISSN —

Titel Title

Utvecklingsverktyg för FPGA - utmaningarna med att rapportera prestandadata FPGA Design Tools - the Challenges of Reporting Performance Data

Författare Author

Stefan Persson

Sammanfattning Abstract

Since its introduction in the 1980s, field-programmable gate arrays have seen a growing use over the years. Nowadays FPGAs are found in everything from planetary rovers and base transceiver stations to bitcoin miners. With the technological advancements and the growth of the market, there has been a steady flow of new models with increasing capacity. To make it possible to use this capacity in an efficient way, also the software tools have been improved. The applications in research have grown and so has the will to compare both the speed and size between different implementations that try to solve the same or similar problem. However, how to make a good comparison is not well defined. Since few research papers have source code available, such comparisons are hard to make and there is a high risk of comparing apples to pears.

In this thesis, we will study the impact of different software settings and design con-straints on the FPGA design flows to better understand how to report research results. This will be done by running selected designs through different EDA tools, using various settings and finally analyse the data the tools provide. At the end we will begin to define guidelines for how to report and compare implementation data, to give a good account of their performance compared to other designs.

Nyckelord

(6)

(7)

Abstract

Since its introduction in the 1980s, field-programmable gate arrays have seen a growing use over the years. Nowadays FPGAs are found in everything from planetary rovers and base transceiver stations to bitcoin miners. With the techno-logical advancements and the growth of the market, there has been a steady flow of new models with increasing capacity. To make it possible to use this capacity in an efficient way, also the software tools have been improved.

The applications in research have grown and so has the will to compare both the speed and size between different implementations that try to solve the same or similar problem. However, how to make a good comparison is not well defined. Since few research papers have source code available, such comparisons are hard to make and there is a high risk of comparing apples to pears.

In this thesis, we will study the impact of different software settings and de-sign constraints on the FPGA dede-sign flows to better understand how to report research results. This will be done by running selected designs through different EDA tools, using various settings and finally analyse the data the tools provide. At the end we will begin to define guidelines for how to report and compare implementation data, to give a good account of their performance compared to other designs.

(8)

(9)

1

Introduction

Field-Programmable Gate Arrays (FPGAs) are widely used and can be found in everything from planetary rovers to base transceiver stations to game systems. Due to their flexibility, FPGAs are also excellent for Application-Specific Inte-grated Circuit (ASIC) prototyping. They were first introduced in the 1980s and quite a lot has happened since then in terms of resource density and software tools, but not that much in how implementation metrics should be reported in research reports. In this introduction, we will give a short description on what an FPGA is, explain why this is an issue, look deeper into the architecture of modern FPGAs, describe the general tool flow, what has been done in the past, and state the scope and goals of this thesis.

1.1 A Brief Note on Field-Programmable Gate Array

FPGAs are a type of Programmable Logic Device (PLD), a configurable hard-ware chip. Main difference compared to older types is the use of Look-Up-Tables (LUTs), which makes it possible to handle a greater variety of complex function-ality within a limited area. An FPGA is programmed using a Hardware Descrip-tion Language (HDL) which differs from other computer languages by describing rather than instructing. For example, f oo <= bar in HDL is interpreted as a wire connection from bar to f oo, and not that the value of bar is assigned to f oo at that point in the execution. As such it has a sense of time in that everything triggers at certain intervals (clock cycles) and then flows through logic in parallel. This difference in the model of computation is one of the major differences to software development.

(12)

2 1 Introduction

Today the market is dominated by two companies that together hold almost 90% of the market shares[1, 2]. These are Xilinx, with 49% of market shares in 2010, and Altera, with 40% of market shares in 2010, and we will discuss their archi-tectures and software tools in Sections 1.3, 1.4 and 2.2.

1.2 The Issue

To see how performance metrics can be an issue we need to look at the synthesis process, which is where the code is translated and mapped to physical resources. This process often comes out as counter intuitive and are at times viewed as black magic where small changes can give a larger difference than one would expect. This is not that strange as the process is inherently random, and for great reasons. As the amount of resources on an FPGA rises, so does the possible ways to map the code to those resources. As an example, synthesising a design needing 500 flip-flops onto an Xilinx Spartan-6 X4 of which it has 4800 gives about

4800!

(4800 − 500)! = 10

1828

different possibilities to place the flip-flops alone. In comparison, the numbers of atoms in the known universe is estimated to be in the size of 1080[3]. In terms of the time it would take to find the optimum, we can look at how long it takes to brute force an Advanced Encryption Standard (AES)-128 key, which can be one of 1038different. Even with a supercomputer it could take over 1018years[4]. The universe has an age of more than 1010years[3] so doing a complete search of the design space is simply not possible. Instead a heuristic approach is used to find a solution that is "good enough" out of the tiny portion that is graspable. This is where the randomness is added to the process and it is normally sufficient from a commercial point of view to just meet requirements as you have to worry more about getting the product out on the market in time. For the academic world, ex-tra time till publication to maximise performance is a much more valid ex-trade-off, so knowing the process throughout should be of greater concern. Unfortunately, this does not always seem to be the case and can hurt publications unnecessarily or even worse, it can lead to an entirely misleading result.

1.3 Field-Programmable Gate Array Architectures

As previously mentioned, two companies dominate the FPGA market and while their products have many similarities, the differences have been large enough for them to give components and concepts different names. If we start with the similar parts an FPGA mostly consists of an array of interconnected basic blocks. Altera calls them Logic Array Block (LAB), while Xilinx calls them Configurable Logic Block (CLB). Both of them consist of multiple instances of another block called Adaptive Logic Module (ALM) for Altera and Slice for Xilinx. Inside these blocks we find the look-up-tables and flip-flops earlier mentioned, but also other components, e.g. logic gates. An in-depth explanation of these blocks can be

(13)

1.4 The Tool Chain 3

found at [5] and [6]. At the same level as LAB and CLB we also have Digital Signal Processing (DSP) blocks and Memory Logic Array Block (MLAB)/Block Random Access Memory (BRAM) blocks. We also have Input/Output (I/O) pins and clock generators that is of interest for this thesis.

1.4 The Tool Chain

In a broad sense there are four parts to the process. The first being compila-tion of HDL code and abstract structures and logic funccompila-tions to architecture spe-cific structures. Here optimisation is done on the logic functions of the design, e.g. how to structure a state machine. After this phase you can have the first estimates of resources and timing, but with all the reworks of later phases they should not be expected to be exact, only approximate.

The intermediate logic is mapped to structures of the chosen device. As discussed in Section 1.2, finding an optimal placement is an overwhelming task. It is here the randomness is added to the process. This can be managed by the user using seeds1, which are a number used to generate a the pseudorandom behaviour of the process. This ensures that a given seed always gives the same result if every-thing else remains the same. Here we also have duplication of logic to increase the performance of the design.

After that it is time to route the structures, which in itself is an NP-complete task[7]. In the cases where all of the interconnection is used, the tools have the possibility to use the look-up-tables as route-through to go past a slice or ALM. In both phases multiple optimisation is done, both locally and globally.

Lastly, static timing analysis is done on the final layout to find out if it meets the requirements of the design. The delays are usually categorised as setup time, hold time or propagation delays. Setup time is how long before a clock pulse the data need to be stable while hold time is how long after a clock pulse it still needs to be stable. Propagation delay is how long time it takes for the data to go through the logic all the way to the flip-flop. Together they define Input to Out-put, that is the propagation delay for an input to an output without being stored in a flip-flop or similar, Clock to Output, the time it takes for signals to ripple from the clock pulse all the way to the output pin or Fmax, the fastest the layout

can run according to setup and hold times.

(14)

4 1 Introduction

1.5 Literature Study

Much of past published input on the subject have come from the security commu-nity with [8] discussing how reporting of resources like Slices/ALMs, LUTs and DSPs by themselves is not enough for a qualitative comparison of an implemen-tation as a DSP or BRAM heavy design are likely to use less LUTs than a design that do not use those resources at all. The same author shows in [9] how the same code can get different max frequency depending on settings of the synthesis tools that is seldom or never reported. In both of them reproducible research by offer-ing source code is advocated as the proper solution to the issue while notoffer-ing that uneven effort, or even an inability, to maximise the designs would result in the researchers design always having an advantage over compared ones. There have also been attempts at creating a "fair" methodology for comparing cryptography implementations[10] which will be discussed in more detail in Section 2.1. Also the rest of the communities have touched the subject. In [11], it is examined how close to a known optimum the synthesis tools comes. In [12], it is proved that a manual layout can still beat the tools in performance. And in [13], it is explained how to use the clock period constraint when using Xilinx tools. Outside the FPGA communities the discussion is old and similar[14] with repro-ducible research being desired but still not achieved even if open services like sourceforge is available making hosting a non-issue[15]. It also brings up how translating an algorithm into native code is hard and prone to contain errors. In a study looking at signal processing articles it is shown that making code avail-able increase its likelihood of having higher than average number of citations[16]. In [17], from the ASIC community, it does a similar thing to what we will do in this study - we will return to this in Section 2.1.

If we look at Altera and Xilinx they also have input to the discussion with Al-tera presenting a benchmarking methodology meant for comparing architectures, but many of the concerns are the same when comparing different designs[18, 19]. However, comparing designs using different architectures isn’t obvious and can easily turn into a minefield[20, 21]. It has also been documented how max fre-quency of designs can vary between versions of the software tools[22].

The latest and most interesting trend is the machine learning approach to op-timal settings. One example is a commercial tool focusing on Altera and cloud computing[23], and another focuses on Xilinx[24]. Unfortunately, these are be-yond the scope of this thesis.

(15)

1.6 Scope and Purpose 5

1.6 Scope and Purpose

In this thesis we will look into the synthesis process to find different settings and choices with impact on the performance and size of the final design. The aim is a list of things to consider in order to to maximise the frequency and/or minimise the resource usage of ones design. As this can be done more or less indefinitely, as much as possible will be tested within the time frame of this thesis with special focus on Xilinx ISE, but to some extent also to Altera Quartus II. This choice is justified since Xilinx having a larger share of the market[2] and seems to be more used by the research community in general2 and at Linköping University in particular. At the end of this report an attempt to answer this question will me made:What factors needs to be taken into account if one tries to find the "best" result in aspect of speed and area when synthesising for FPGA?

(16)

(17)

2

Method

In Section 1.5 a handful of methods for benchmarking designs were discussed. We will look at them in relation to the purpose of the survey, present the method that was chosen in a general way and list the designs used. Furthermore, we will give an account of the chosen settings and data and how the general method was tweaked to satisfy specific requirements.

2.1 Discussed Methods

The first method is [18, 19]. It is intended for comparing different FPGA archi-tectures and many of the concerns mentioned in the papers are also applicable here. However, comparing architectures is outside the scope of this thesis and following the methodology would be too time consuming.

Reference [17] presents a method for exploring a tools behaviour while varying the clock constraint and presenting it using graphs. This was done for Synopsys ASIC tool Design Compiler by starting in two end points and halve the distance by running a setting in the middle of them as long as the resulting layout differ or the distance between settings is larger than an arbitrary value. With seeds this exact method would result in the script running for all settings with distance equal to the value from each other - no matter how short a distance, the results will most likely be different.

(18)

8 2 Method

Last we have [10]. A set of perl-scripts that does a Design Space Exploration (DSE) similar to Altera Design Space Explorer and Xilinx SmartXplorer. None of which cares about the sub-optimal results, something that would be to interest to analyse and visualise in order to better understand the overall behaviour, and not only the best case.

In the end all three methods lacked something desirable so a mix of all three methods were chosen. A set of designs were chosen with the attempt to have a variety in both function and resource usage. A set of smaller scripts to help with synthesising and collecting data was developed.

2.2 The Chosen Method

Algorithm 1perl script to run the tools create constraint file

run programs not dependent on seed for allseeds do

create folder

run programs dependent on seed move report files to folder end for

XST

MAP

FIT

PAR

TimeQuest

TRACE

MAP

NGDBuild

Xilinx

Altera

ISE

Quartus

Figure 2.1: The flow for Altera and Xil-inx

Central to the method is a set of perl scripts described in Algorithm 1. Argument to the script varies de-pending on what the constraint files are set to contain - these will be .xst[25] and .ucf[26] for Xilinx and .sdc[27] for Altera. Some settings will be given in the command-line when calling the tools. Makefiles were written to manage the calls to the scripts and better handle the storing of the report files. They call the scripts with proper argu-ments and saved the folders in a tar-file. This enabled running mul-tiple scripts in succession automati-cally.

The different tools that will be exam-ined are Xilinx ISE 14.7 and Altera Quartus 13.0sp1. The flow for Altera Quartus and Xilinx ISE can be seen in Figure 2.1. XST and NGDBuild for ISE correspond to Quartus MAP. This would be the compiling part the the process. MAP and PAR can be said to do the same as FIT, that is the place-ment and routing, while TRACE and

TimeQuest are mostly equivalent as the timing analysis tools. For extracting and presenting the data MatLab scripts was used. In Section 2.4, we discuss about what data were extracted.

(19)

2.2 The Chosen Method 9

2.2.1 The Chosen Designs

Choosing the designs is as hard as it is important. You can only find the behaviour generated by the designs, so having a varied benchmark is important. On the other hand, you don’t really know if they are good until you have the results, if ever. The easiest way to raise the odds of having a good benchmark is to make it very large. As time was limited, this was not an option.

One option is to use an already existing benchmark. An example is [28] which consist of 84 different designs from OpenCores and Aeroflex Gaisler among oth-ers. While some of the design were used, the benchmark as such have too many designs for this survey considering the limited time frame and resources at hand. In the end, nine designs were chosen:

• Avalon AES ECB-Core[29], a cryptography design implementing AES by Thomas Ruschival and released under BSD license.

• Discrete Cosine Transform core[30], a Discrete Cosine Transform (DCT) core by Michal Krepa, Emrah Yuce and Andreas Bergmann.

• FIR Filter, a small Finite Impulse Response (FIR) core by Stefan Persson. Code available in Appendix 5.3.

• Floating Point Unit Array[31], a floating point adder and multiplier by Guillermo Marcus. Added code available in Appendix 5.3.

• LatticeMico32[32], a Reduced Instruction Set Computing (RISC) micropro-cessor by Lattice Semiconductor.

• LEON3[33], a RISC microprocessor by Aeroflex Gaisler and released under GNU GPL. Config settings available in Appendix 5.3.

• minSoC[34], a System-on-Chip design consisting of an OpenRISC1200 processor[35], an Ethernet core[36], an UART core[37] and a debugging core[38] developed by various people and released under LGPL.

• nova[39], a H.264/AVC Baseline Decoder by Ke Xu.

• Spiral 128-bit DFT[40], an Fast Fourier Transform (FFT) core generated from software by Peter A. Milder, with a internal research use only type of license. Settings for generating available in Appendix 5.3.

(20)

10 2 Method

AES DCT FIR FPU LM32 LEON SoC nova DFT

LUT •• •• • ••• •• ••• ••_(•) ••• ••

DSP N/A N/A • •• • _N/A • • ••

RAM •• • _N/A _N/A •• ••• ••_(•) •• ••

Freq •• ••• •• •• •• • •_(••) • •••

I/O •• • • •• ••• ••_(•) • •• •••

Table 2.1:Relative properties of the designs.

• •• ••• LUT 1 – 103 103– 104 104– DSP 1 – 25 25 – 75 75 – RAM 1 – 105 105– 106 106– Freq 1 – 100 100 – 250 250 – I/O 1 – 50 50 – 200 200 – Table 2.2:Intervals of Table 2.1

RAM lists bits and Frequency is in MHz In Table 2.1 we have a presentation of

the properties of the different designs relative to each other. In Table 2.2, the exact intervals are listed. A note on Leon3 and minSoC where properties for Altera and Xilinx differs; Leon3 have optimised designs for a variety of development boards using both Altera and Xilinx FPGAs where the identi-cal interfaces are not always available. The alternative is to build a top

struc-ture yourself, but even in that interface to clocks and memory will be specific to the FPGA family in question. For the survey similar premade designs were modified to have the same settings for all core settings which result in differ-ent numbers of I/Os. With minSoC Alteras tools put a portion of the LUTs into BRAM resulting in a slower design making it effectively ••, •••, • for Altera and •••_{, ••, ••• for Xilinx in the order of LUT, Random Access Memory (RAM) and} frequency. Together, this is a test suite of manageable size where no two designs have the some relative properties based on previously defined arbitrary thresh-olds.

2.3 The Survey and Expected Result

The survey will start with a seed sweep with no constraints on the designs. This will give the metrics of the designs as is, i.e. as unaltered as possible. This will function as a reference point for the rest of the survey. We expect this will show that seeds have an impact on the performance. Seeds will be set by argument to the commands,-seed <number> for Altera and -t <number> for Xilinx.

(21)

2.4 Extracted Data 11 With a reference next step is to add clock constraints and different numbers of registers on I/Os. This is to remove any unnecessary delay at the I/O pins and to lower the routing distance to and from the I/O pins. We expect the use of clock constraints to boost performance and that we will see an initial increase fol-lowed by a stagnation with increased numbers of registers on the I/Os. A Figure of this can be seen in Figure 2.2. Clock constraint will be placed in a constraint file and for can Altera be written ascreate_clock -name clk -period <constraint> [get_ports clk] and for Xilinx as NET "clk" PERIOD = <constraint> ns HIGH 50% INPUT_JITTER 50 ps;

DUT D

D

Input

Output

Figure 2.2: Delay elements (registers) between Input/Output pins and De-sign Under Test

We will look at different optimisation settings for the tools. For Xilinx the set-tings are Area and Speed while Altera also have Balanced. We expect the results to follow the naming of the settings. For Altera this is sent as a argument to the command,–optimize <setting> while for Xilinx is set in the xst-file as -opt_mode <setting>

To see if the data we get is applicable even when resources are limited, we will restrict the region it is allowed to occupy. We will also see if this number is com-parable to having multiple designs on an FPGA. We expect that performance will decrease in both cases.1 _{For Xilinx this will be done in the constraint file as}

INST "<instance>" AREA_GROUP="<group>"; AREA_GROUP "group" RANGE = SLICE_<start coordinate>:SLICE_<end coordinate>; and also with CLOCKRE-GION_<coordinate>

We will also look at architecture and speed grade. Our expectations are that per-formance will increase with more high end architecture and higher speed grade. This is generally done when choosing the FPGA, but for Altera speed grade can be selected at timing with the argument–speed <grade>

2.4 Extracted Data

As discussed in Section 1.3 the historical architectural differences between Al-tera and Xilinx FPGAs have lead to different naming and handling of their logic blocks. Slices and LUTs are the two most frequently reported area metrics for Xilinx, while ALMs and Adaptive Look-Up-Tables (ALUTs) are the two most frequently reported for Altera. Other reported logic elements are registers/flip-flops, DSP blocks and on-chip memory. More often reported than any other

(22)

12 2 Method

rics is the maximum frequency of the design. As such we will look at Slices/ALMs, (A)LUTs and frequency with mentioning of registers, DSP blocks and memory. To clarify, with performance we mean register to register performance and not to and from I/Os. I/O delay is important when implementing the designs to an ac-tual FPGA, but as research more often look at IP-cores intended for used in bigger systems without direct connection to I/O pins, this is a reasonable simplification. In cases this does not seem reasonable I/O delay will be listed.

On Altera’s ALM count, it can be noted that the main number reported in the FIT report is not the number of ALMs used, but instead the result of the equation

ALMcount = A − B + C

where A is the real number of ALMs used, B is an estimation on how many of the half-used ALMs can be used again at a later stage as the design grows and C is an estimation on how many of these will unavailable due to for example routing[41]. Still, that is the number we will use if nothing else is stated, as we assume the community at large does too.

(23)

3

Survey Results

In this chapter, the results of the survey discussed in Section 2 will be listed. Both vendors will have their own section starting with Xilinx followed by Altera. Each section will contain subsections for the different parts of the survey.

3.1 Xilinx

In this part of the chapter we will present the results received from the survey of the Xilinx tools. First we run with no clock constraint and no registers on I/Os, then add constraints and different numbers of registers before looking at the im-pact of speed grade, lastly we will limit the part of the chip the tools can use with different constraint and add multiple design at the same time.

Unless else mentioned, number of slices and LUTs come from the PAR report and the frequency is the Clock to Setup which means that all I/O timing was ignored. The use of registers and RAM blocks is not reported; the reason in the case of register is presented in Section 3.1.1 and for RAM it is hard to know how much is used as all of the block of minimum 18Kb is flagged as soon as you use one bit which makes it too rough grained to serve any purpose for this survey.

(24)

14 3 Survey Results

3.1.1 No Constraint, No Register

This part was done with unmodified versions of the designs listed in Section 2.2.1. They were synthesised using version 14.7i of the tools onto a device of each of the families, namely XC7A200T-3FBG676, XC7K410T-3FBG900 and XC7VX550T-3 FFG1158 and all 100 of the seeds were tested. For Artix and Virtex only PAR and Timing reports were saved while for Kintex XST and MAP reports were saved too. For Artix all designs were tested, while for Kintex and Virtex, LEON3 was not include because of issues which were deemed too time consuming to fix. In Table 3.1, the results of the survey for Artix are presented. Registers have a standard deviation of 0 and were therefore not listed in any of the tables. As previously stated, 100 runs with 100 seeds were done in this section.

Design Slices LUTs Frequency (MHz) Registers

µ σ µ σ µ σ µ AES 551.2 20.62 1562 3.879 153.0 4.622 580 DCT 538.5 30.19 1523 24.59 276.4 8.460 1940 FIR 39.59 2.340 79.96 4.388 677.8a _79.88 ₁₀₁ FPU 4757 243.1 12999 4.011 54.20 20.59 6209 LM32 871.1 102.8 1870 20.62 133.7 20.70 1595 LEON3 13966 206.2 33755 133.8 74.94 14.61 16311 minSoC 6270 393.1 17933 13.36 315.5 15.58 5942 nova 12488 607.3 31888 10.66 19.99 4.367 7005 DFT 1683 40.59 5281 35.83 360.2 88.27 6806

Table 3.1:Survey results for Artix, data from PAR and TRACE

a_{By reasons mentioned in the text, this frequency is purely theoretical}

In Table 3.2 we have the minimum number of slices and LUTs and highest maxi-mum frequency. These three metrics do not have to come the the same seed but is simply the best found.

Design Min Slices Min LUTs Max Frequency

AES 501 1558 164.5 DCT 472 1431 300.4 FIR 29 76 853.2a FPU 3792 12984 125.3 LM32 606 1817 156.9 LEON3 13245 33635 105.2 minSoC 4842 17915 341.2 nova 9520 31877 30.55 DFT 1612 5194 467.3

Table 3.2:Survey results for Artix, data from PAR and TRACE

(25)

3.1 Xilinx 15

Looking at the frequencies for FIR we can conclude that those numbers are not trustworthy. Looking at the PAR report we can see that the highest frequency the device can handle is 628.9 MHz. In Table 3.3 we see the highest I/O to I/O delay for the first seed. Designs not included did not have any input directly connected to any output or have internal paths which have a higher delay.

Design Delay (ns) Frequency

AES 10.50 95.21

FIR 16.10 62.10

FPU 22.03 45.39

nova 44.14 22.66

Table 3.3:Highest Input pin to Output pin delay, first seed only

In Figures 3.1, 3.2 and 3.3, we have graphs between usage of slices to maximum frequency and LUTs and running time of PAR to maximum frequency of the layouts. Slices 25 30 35 40 45 Max frequency (MHz) 300 400 500 600 700 800 900 FIR Slices 500 520 540 560 580 600 620 Max frequency (MHz) 140 145 150 155 160 165 AES Slices ×104 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 Max frequency (MHz) 10 15 20 25 30 35 nova Slices 450 500 550 600 650 Max frequency (MHz) 250 260 270 280 290 300 310 DCT Slices 3500 4000 4500 5000 5500 Max frequency (MHz) 20 40 60 80 100 120 140 FPU Slices 600 700 800 900 1000 1100 1200 Max frequency (MHz) 60 80 100 120 140 160 LatticeMico32 Slices 4500 5000 5500 6000 6500 7000 Max frequency (MHz) 200 250 300 350 minSoC Slices 1600 1650 1700 1750 1800 Max frequency (MHz) 100 200 300 400 500 Spiral DFT Slices ×104 1.32 1.34 1.36 1.38 1.4 1.42 1.44 Max frequency (MHz) 40 50 60 70 80 90 100 110 LEON3

Figure 3.1: Maximum Frequency to Number of Slices of the Designs when synthesising to a Artix device

(26)

16 3 Survey Results Slices 25 30 35 40 45 LUTs 75 80 85 90 95 100 FIR Slices 500 520 540 560 580 600 620 LUTs 1555 1560 1565 1570 1575 1580 AES Slices ×104 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 LUTs ×104 3.187 3.188 3.189 3.19 3.191 3.192 3.193 3.194 nova Slices 450 500 550 600 650 LUTs 1400 1450 1500 1550 1600 DCT Slices 3500 4000 4500 5000 5500 LUTs ×104 1.298 1.2985 1.299 1.2995 1.3 1.3005 FPU Slices 600 700 800 900 1000 1100 1200 LUTs 1800 1820 1840 1860 1880 1900 1920 LatticeMico32 Slices 4500 5000 5500 6000 6500 7000 LUTs ×104 1.79 1.792 1.794 1.796 1.798 1.8 1.802 minSoC Slices 1600 1650 1700 1750 1800 LUTs 5150 5200 5250 5300 5350 5400 Spiral DFT Slices ×104 1.32 1.34 1.36 1.38 1.4 1.42 1.44 LUTs ×104 3.36 3.37 3.38 3.39 3.4 3.41 3.42 3.43 LEON3

Figure 3.2: Number of LUTs to Number of Slices of the Designs when syn-thesising to a Artix device

Running time (sec)

46 47 48 49 50 51 52 Max frequency (MHz) 300 400 500 600 700 800 900 FIR

Running time (sec)

59 60 61 62 63 Max frequency (MHz) 140 145 150 155 160 165 AES

Running time (min)

0 100 200 300 400 500 600 700 Max frequency (MHz) 10 15 20 25 30 35 nova

Running time (sec)

57 58 59 60 61 Max frequency (MHz) 250 260 270 280 290 300 310 DCT

Running time (min)

0 20 40 60 80 100 120 140 Max frequency (MHz) 20 40 60 80 100 120 140 FPU

Running time (sec)

60 80 100 120 140 160 180 Max frequency (MHz) 60 80 100 120 140 160 LatticeMico32

Running time (min)

0 200 400 600 800 1000 Max frequency (MHz) 200 250 300 350 minSoC

Running time (min)

0 20 40 60 80 100 Max frequency (MHz) 100 200 300 400 500 Spiral DFT

Running time (min)

0 200 400 600 800 1000 Max frequency (MHz) 40 50 60 70 80 90 100 110 LEON3

Figure 3.3: Running time of PAR to Maximum Frequency of the Designs when synthesising to a Artix device

(27)

3.1 Xilinx 17 In Figures 3.4, we see the histograms for different numbers of seeds. The three plots were chosen to represent different cases present in the survey for Artix. For four of the designs, one or more of the layouts with highest performance can be found within the ten first seeds. For three of the designs you need to run between 11 and 25 seeds to find one of them. The last two designs have the first high performing layout among the first fifty and later half of the seeds respectively.

AES Frequency (MHz) 140 145 150 155 160 165 0 5 10 15 20 25 Seeds 1 to 10 Seeds 1 to 25 Seeds 1 to 50 Seeds 1 to 100 DCT Frequency (MHz) 255 260 265 270 275 280 285 290 295 300 0 5 10 15 20 25 30 Seeds 1 to 10 Seeds 1 to 25 Seeds 1 to 50 Seeds 1 to 100 LEON3 Frequency (MHz) 40 50 60 70 80 90 100 110 0 5 10 15 20 25 30 Seeds 1 to 10 Seeds 1 to 25 Seeds 1 to 50 Seeds 1 to 100

(28)

18 3 Survey Results

In Tables 3.4 and 3.5, we see the data from synthesising the designs to a Kintex device with Figures 3.5, 3.6 and 3.7 show the same type of graphs we had for Artix. The highest frequency for the Kintex devices according to PAR report is 741.8 MHz.

µ σ µ σ µ σ µ AES 610.5 22.33 1654 3.617 198.1 5.874 580 DCT 599.2 29.86 1495 29.72 405.9 16.58 1940 FIR 55.49 3.301 130.9 5.634 879.1a _85.24 ₁₀₁ FPU 4324 66.86 11744 6.644 171.3 2.565 6209 LM32 892.7 51.31 2037 21.31 186.8 14.37 1591 minSoC 6798 112.5 18855 12.96 452.9 14.89 5957 nova 12966 969.2 31611 10.06 36.55 5.902 6994 DFT 1678 27.20 5282 22.31 442.8 63.89 6806

Table 3.4:Survey results for Kintex, data from PAR and TRACE

AES 569 1650 209.6 DCT 503 1399 436.1 FIR 47 124 1078a FPU 4150 11731 175.8 LM32 788 1966 207.8 minSoC 6467 18819 485.0 nova 11067 31593 48.63 DFT 1630 5230 578.0

Table 3.5:Survey results for Kintex, data from PAR and TRACE

(29)

3.1 Xilinx 19 Slices 45 50 55 60 65 Max frequency (MHz) 500 600 700 800 900 1000 1100 FIR Slices 550 600 650 700 750 Max frequency (MHz) 180 185 190 195 200 205 210 AES Slices ×104 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Max frequency (MHz) 15 20 25 30 35 40 45 50 nova Slices 500 550 600 650 700 Max frequency (MHz) 340 360 380 400 420 440 DCT Slices 4100 4200 4300 4400 4500 4600 Max frequency (MHz) 162 164 166 168 170 172 174 176 FPU Slices 750 800 850 900 950 1000 1050 1100 Max frequency (MHz) 100 120 140 160 180 200 220 LatticeMico32 Slices 6400 6500 6600 6700 6800 6900 7000 7100 Max frequency (MHz) 400 420 440 460 480 500 minSoC Slices 1600 1650 1700 1750 1800 Max frequency (MHz) 250 300 350 400 450 500 550 600 Spiral DFT

Figure 3.5: Maximum Frequency to Number of Slices of the Designs when synthesising to a Kintex device

Slices 45 50 55 60 65 LUTs 120 125 130 135 140 145 150 FIR Slices 550 600 650 700 750 LUTs 1650 1655 1660 1665 1670 AES Slices ×104 1.1 1.2 1.3 1.4 1.5 1.6 1.7 LUTs ×104 3.159 3.16 3.161 3.162 3.163 3.164 3.165 nova Slices 500 550 600 650 700 LUTs 1350 1400 1450 1500 1550 1600 1650 DCT Slices 4100 4200 4300 4400 4500 4600 LUTs ×104 1.173 1.1735 1.174 1.1745 1.175 1.1755 1.176 1.1765 FPU Slices 750 800 850 900 950 1000 1050 1100 LUTs 1960 1980 2000 2020 2040 2060 2080 2100 LatticeMico32 Slices 6400 6500 6600 6700 6800 6900 7000 7100 LUTs ×104 1.88 1.882 1.884 1.886 1.888 1.89 minSoC Slices 1600 1650 1700 1750 1800 LUTs 5220 5240 5260 5280 5300 5320 5340 Spiral DFT

Figure 3.6: Number of LUTs to Number of Slices of the Designs when syn-thesising to a Kintex device

(30)

20 3 Survey Results

Running time (sec)

72 73 74 75 76 77 78 Max frequency (MHz) 500 600 700 800 900 1000 1100 FIR

Running time (sec)

86 88 90 92 94 96 Max frequency (MHz) 180 185 190 195 200 205 210 AES

Running time (sec)

400 600 800 1000 1200 1400 1600 Max frequency (MHz) 15 20 25 30 35 40 45 50 nova

Running time (sec)

84 85 86 87 88 89 90 91 Max frequency (MHz) 340 360 380 400 420 440 DCT

Running time (sec)

205 210 215 220 225 230 Max frequency (MHz) 162 164 166 168 170 172 174 176 FPU

Running time (sec)

95 100 105 110 115 120 Max frequency (MHz) 100 120 140 160 180 200 220 LatticeMico32

Running time (sec)

400 450 500 550 600 Max frequency (MHz) 400 420 440 460 480 500 minSoC

Running time (sec)

125 130 135 140 145 150 155 Max frequency (MHz) 250 300 350 400 450 500 550 600 Spiral DFT

Figure 3.7: Running time of PAR to Maximum Frequency of the Designs when synthesising to a Kintex device

The seed distribution for Kintex is such that one design needs 10 seeds to find one of the better layouts, three needs 25 of them, one needs 50 and the last three needs to run all 100 seeds.

(31)

3.1 Xilinx 21 In Figures 3.8 and 3.9, we have graphs showing the difference between the XST, MAP, PAR and TRACE reports for DCT which show typical characteristic of the designs. The second figure showing some peculiar behaivour for PAR and TRACE, this is most likely due to PAR treating the clock signal as multiple clocks, while TRACE treats it as one.

DCT Par Freq (MHz) 340 360 380 400 420 440 Trace Freq (MHz) 350 360 370 380 390 400 410 420 430 440 Par Freq (MHz) 340 360 380 400 420 440 XST Freq (MHz) 426.5 427 427.5 428 428.5 429 Par LUT 1350 1400 1450 1500 1550 1600 1650 XST LUT 1587 1587.2 1587.4 1587.6 1587.8 1588 1588.2 1588.4 1588.6 1588.8 1589 Par LUT 1350 1400 1450 1500 1550 1600 1650 Map LUT 1350 1400 1450 1500 1550 1600 1650 Par slice 500 550 600 650 700 Map slice 500 520 540 560 580 600 620 640 660 680

Figure 3.8:Differences between XST, MAP, PAR and TRACE reports for DCT synthesised to a Kintex device

nova Par Freq (MHz) 30 40 50 60 70 80 Trace Freq (MHz) 15 20 25 30 35 40 45 50 Par Freq (MHz) 30 40 50 60 70 80 XST Freq (MHz) 79.5 80 80.5 81 81.5 82 Par LUT ×104 3.159 3.16 3.161 3.162 3.163 3.164 3.165 XST LUT ×104 3.4831 3.4831 3.4831 3.4832 3.4832 3.4832 3.4832 3.4832 3.4833 3.4833 3.4833 Par LUT ×104 3.159 3.16 3.161 3.162 3.163 3.164 3.165 Map LUT ×104 3.159 3.16 3.161 3.162 3.163 3.164 3.165 Par slice ×104 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Map slice ×104 1.1 1.2 1.3 1.4 1.5 1.6 1.7

Figure 3.9:Differences between XST, MAP, PAR and TRACE reports for nova synthesised to a Kintex device

(32)

22 3 Survey Results

In Tables 3.6 and 3.7 we have the data from the survey for a Virtex device with Figures 3.10, 3.11 and 3.12 showing the same type of graphs as for the previous devices. As with Kintex, the Virtex device can not be clocked over 741.8 MHz.

µ σ µ σ µ σ µ AES 650.8 42.24 1653.0 4.29 189.2 10.79 580 DCT 551.8 88.01 1518 64.76 353.3 52.14 1940 FIR 56.15 2.826 127.5 3.667 891.3a 130.8 101 FPU 4497 122.5 11744 4.300 149.2 8.534 6209 LM32 925.9 77.85 2030 20.56 179.0 14.84 1591 minSoC 7141 282.8 18833 13.34 439.1 21.81 5957 nova 13777 1453 31611 13.67 32.28 5.398 6994 DFT 1756 57.54 5237 29.40 428.6 54.94 6806

Table 3.6:Survey results for Virtex, data from PAR and TRACE

AES 554 1650 209.3 DCT 437 1361 432.9 FIR 51 124 1084a FPU 4209 11731 174.0 LM32 781 1983 209.1 minSoC 5372 18801 475.3 nova 10665 31593 42.60 DFT 1640 5194 530.2

Table 3.7:Survey results for Virtex, data from PAR and TRACE

(33)

3.1 Xilinx 23 Slices 50 55 60 65 70 Max frequency (MHz) 200 400 600 800 1000 1200 FIR Slices 550 600 650 700 750 800 Max frequency (MHz) 160 170 180 190 200 210 AES Slices ×104 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Max frequency (MHz) 15 20 25 30 35 40 45 nova Slices 400 500 600 700 800 900 Max frequency (MHz) 100 150 200 250 300 350 400 450 DCT Slices 4200 4300 4400 4500 4600 4700 4800 Max frequency (MHz) 120 130 140 150 160 170 180 FPU Slices 700 800 900 1000 1100 1200 Max frequency (MHz) 140 150 160 170 180 190 200 210 LatticeMico32 Slices 5000 5500 6000 6500 7000 7500 Max frequency (MHz) 380 400 420 440 460 480 minSoC Slices 1600 1650 1700 1750 1800 1850 1900 1950 Max frequency (MHz) 200 250 300 350 400 450 500 550 Spiral DFT

Figure 3.10:Maximum Frequency to Number of Slices of the Designs when synthesising to a Virtex device

Slices 50 55 60 65 70 LUTs 120 125 130 135 140 FIR Slices 550 600 650 700 750 800 LUTs 1650 1655 1660 1665 1670 AES Slices ×104 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 LUTs ×104 3.158 3.16 3.162 3.164 3.166 3.168 3.17 3.172 nova Slices 400 500 600 700 800 900 LUTs 1350 1400 1450 1500 1550 1600 1650 DCT Slices 4200 4300 4400 4500 4600 4700 4800 LUTs ×104 1.173 1.1735 1.174 1.1745 1.175 FPU Slices 700 800 900 1000 1100 1200 LUTs 1980 2000 2020 2040 2060 2080 2100 LatticeMico32 Slices 5000 5500 6000 6500 7000 7500 LUTs ×104 1.88 1.881 1.882 1.883 1.884 1.885 1.886 1.887 minSoC Slices 1600 1650 1700 1750 1800 1850 1900 1950 LUTs 5150 5200 5250 5300 5350 Spiral DFT

Figure 3.11:Number of LUTs to Number of Slices of the Designs when syn-thesising to a Virtex device

The seed distribution for Virtex is such that three designs only need 10 seeds to find one of the better layouts, three needs 25, and the last two designs needs all 100 seeds.

(34)

24 3 Survey Results

Running time (sec)

125 130 135 140 145 Max frequency (MHz) 200 400 600 800 1000 1200 FIR

Running time (sec)

110 120 130 140 150 Max frequency (MHz) 160 170 180 190 200 210 AES

Running time (sec)

400 600 800 1000 1200 1400 1600 1800 Max frequency (MHz) 15 20 25 30 35 40 45 nova

Running time (sec)

110 115 120 125 130 135 Max frequency (MHz) 100 150 200 250 300 350 400 450 DCT

Running time (sec)

230 240 250 260 270 280 290 Max frequency (MHz) 120 130 140 150 160 170 180 FPU

Running time (sec)

125 130 135 140 145 150 155 Max frequency (MHz) 140 150 160 170 180 190 200 210 LatticeMico32

Running time (sec)

450 500 550 600 650 Max frequency (MHz) 380 400 420 440 460 480 minSoC

Running time (sec)

150 160 170 180 190 200 Max frequency (MHz) 200 250 300 350 400 450 500 550 Spiral DFT

Figure 3.12: Running time of PAR to Maximum Frequency of the Designs when synthesising to a Virtex device

3.1.2 Constraint and Buffer Sweeps

In this section we will look at the data from the survey on constraints sweeps with registers after and before the I/Os when synthesising using the Xilinx tools. This is done by having a wrapper with the structure; Input → Registers → Design → Registers → Output. This was only done for seeds 1 to 25 as doing a complete search would have taken more time than was available.

In Figure 3.13, 3.14 and 3.15, we have constraint sweeps with different num-bers of registers for three of the designs. In (a) we have the highest maximum frequency of all the seeds. In (b) we have the maximum frequency for seed 1.

(35)

3.1 Xilinx 25 Period Constraint (MHz) 140 150 160 170 180 190 200 Actual Frequency (MHz) 150 155 160 165 170 175 180 185 190 1 Stage 2 Stages 3 Stages 4 Stages

(a)Seed with highest Fmaxof AES

Period Constraint (MHz) 140 150 160 170 180 190 200 Actual Frequency (MHz) 140 145 150 155 160 165 170 175 180 185 1 Stage 2 Stages 3 Stages 4 Stages (b)Seed 1 of AES

Figure 3.13:Constraint sweeps for multiple sets of buffer stages on Artix

Clock Period Constraint (MHz)

400 420 440 460 480 500 Actual Frequency (MHz) 400 410 420 430 440 450 460 470 480 1 Stage 2 Stages 3 Stages 4 Stages

(a)Seed with highest F_maxof DFT

Period Constraint (MHz) 400 420 440 460 480 500 Actual Frequency (MHz) 260 280 300 320 340 360 380 400 420 440 460 1 Stage 2 Stages 3 Stages 4 Stages (b)Seed 1 of DFT

(36)

26 3 Survey Results Period Constraint (MHz) 100 110 120 130 140 150 Actual Frequency (MHz) 116 118 120 122 124 126 128 130 132 134 136 1 Stage 2 Stages 3 Stages 4 Stages

(a)Seed with highest Fmaxof FIR

100 110 120 130 140 150 Actual Frequency (MHz) 110 115 120 125 130 135 1 Stage 2 Stages 3 Stages 4 Stages (b)Seed 1 of FIR

Figure 3.15:Constraint sweeps for multiple sets of buffer stages on Artix

In Tabel 3.8, we have metrics for the layouts with highest maximum frequency from constraint sweeps with different numbers of registers surrounding the de-signs. The device synthesised for was a Artix 200T with package FBG676 and speed grade -3.

Design Registers Slices LUTs Max Frequency AES 1 560 1562 183.8573 2 562 1602 185.6320 3 552 1647 183.1837 4 595 1567 184.3317 DCT 1 549 1518 315.3579 FIR 1 52 80 133.6184 2 55 100 133.8330 3 62 136 134.7891 4 69 168 133.9226 FPU 1 4332 12994 130.0221 LM32 1 598 1600 164.7446 nova 1 10507 31909 43.4311 DFT 1 1627 5385 461.0420 2 1701 5342 462.9630 3 1686 5380 471.9207 4 1767 5353 460.1933

Table 3.8:Metrics for highest performing layouts

In Figures 3.16 and 3.17, we have AVC and LM32 for two different devices in the Artix family. For LM32 the package was chosen to be as close to each other as possible with FBG484 for 200T and FGG484 for 100T while AVC used FBG676

(37)

3.1 Xilinx 27

for 200T and FGG484 for 100T. Tabel 3.9 lists metrics for the highest performing layouts of LM32 and nova.

40 41 42 43 44 45 46 47 48 Actual Frequency (MHz) 40.5 41 41.5 42 42.5 43 43.5 44 200T device

200T with ClockPeriod constraint 100T device

(a)Seed with highest frequency of nova

40 41 42 43 44 45 46 47 48 Actual Frequency (MHz) 26 28 30 32 34 36 38 40 42 44 200T device

(b)Seed 1 of nova

Figure 3.16:Constraint sweeps on Artix

152 154 156 158 160 162 164 166 168 Actual Frequency (MHz) 152 154 156 158 160 162 164 166 168

1 Stage: First Seed 1 Stage: Best Seed 2 Stages: First Seed 2 Stages: Best Seed

(a)Constraint sweeps on Artix 100T

152 154 156 158 160 162 164 166 168 Actual Frequency (MHz) 140 145 150 155 160 165 170

1 Stage: First Seed 1 Stage: Best Seed 2 Stages: First Seed 2 Stages: Best Seed

(b)Constraint sweeps on Artix 200T

(38)

28 3 Survey Results

Device Registers Slices LUTs Max Frequency

LM32 100T-3FGG484 1 796 1899 165.2619 2 850 1947 166.5002 200T-3FBG484 1 662 1911 161.5509 2 668 1937 165.6452 nova 100T-3FGG484 1 9979 31910 43.5028 200T-3FBG676 1 10507 31909 43.4311

(39)

3.1 Xilinx 29

In Figures 3.18 and 3.19, we have the seed sweeps containing the highest maxi-mum frequency for AES with two registers and DFT with three.

AES Slices 500 520 540 560 580 600 620 0 1 2 3 4 LUTs 15980 1600 1602 1604 1606 1608 1610 1612 2 4 6 8 10 12 Freq (MHz) 160 165 170 175 180 185 190 0 1 2 3 4 5 6 Time (sec) 70 80 90 100 110 120 130 0 1 2 3 4 Time (sec) 70 80 90 100 110 120 130 Freq (MHz) 160 165 170 175 180 185 190 Slices 500 520 540 560 580 600 620 Freq (MHz) 160 165 170 175 180 185 190 LUTs 1598 1600 1602 1604 1606 1608 1610 1612 Freq (MHz) 160 165 170 175 180 185 190 Slices 500 520 540 560 580 600 620 LUTs 1598 1600 1602 1604 1606 1608 1610 1612

Figure 3.18: A seed sweep of AES with two registers on I/Os and a period constraint of 186.0 MHz DFT Slices 16500 1700 1750 1800 0.5 1 1.5 2 2.5 3 LUTs 53000 5350 5400 5450 5500 0.5 1 1.5 2 2.5 3 Freq (MHz) 250 300 350 400 450 500 0 1 2 3 4 Time (min) 0 10 20 30 40 50 0 5 10 15 20 25 Time (min) 0 10 20 30 40 50 Freq (MHz) 250 300 350 400 450 500 Slices 1650 1700 1750 1800 Freq (MHz) 250 300 350 400 450 500 LUTs 5300 5350 5400 5450 5500 Freq (MHz) 250 300 350 400 450 500 Slices 1650 1700 1750 1800 LUTs 5300 5350 5400 5450 5500

Figure 3.19:A seed sweep of DFT with three registers on I/Os and a period constraint of 476.2 MHz

(40)

30 3 Survey Results

3.1.3 Optimisation Goals

In this section we will look into the optimisation goals for Xilinx tools which are Speed and Area where Speed is default. They are given to XST.

In Table 3.10, we have the highest performing layouts found when synthesising for Area.

Design Slices LUTs Max Frequency

AES 454 1173 181.3565

nova 10347 30791 36.6421

DFT 1609 5122 273.0748

Table 3.10: Metrics for highest performing layouts found on a Artix 200T device when synthesising for Area

In Figures 3.20 and 3.21, we have metrics and plots of the seed sweeps of nova that contained the highest performing layouts for both optimisation goals.

AVC Slices ×104 0.8 0.9 1 1.1 1.2 1.3 0 0.5 1 1.5 2 2.5 3 LUTs ×104 3.1890 3.19 3.191 3.192 3.193 3.194 1 2 3 4 Freq (MHz) 25 30 35 40 45 0 1 2 3 4 5 6 Time (min) 0 50 100 150 200 0 0.5 1 1.5 2 2.5 3 Time (min) 0 50 100 150 200 Freq (MHz) 25 30 35 40 45 Slices ×104 0.8 0.9 1 1.1 1.2 1.3 Freq (MHz) 25 30 35 40 45 LUTs ×104 3.189 3.19 3.191 3.192 3.193 3.194 Freq (MHz) 25 30 35 40 45 Slices ×104 0.8 0.9 1 1.1 1.2 1.3 LUTs ×104 3.189 3.19 3.191 3.192 3.193 3.194

Figure 3.20: A seed sweep with a period constraint of 43.5 MHz and Speed as optimisation goal

(41)

3.1 Xilinx 31 AVC Slices ×104 0.8 0.9 1 1.1 1.2 1.3 0 1 2 3 4 LUTs ×104 3.0780 3.079 3.08 3.081 3.082 3.083 3.084 3.085 0.5 1 1.5 2 2.5 3 Freq (MHz) 20 25 30 35 40 0 1 2 3 4 5 6 7 Time (min) 60 80 100 120 140 160 180 200 0 0.5 1 1.5 2 2.5 3 Time (min) 60 80 100 120 140 160 180 200 Freq (MHz) 20 25 30 35 40 Slices ×104 0.8 0.9 1 1.1 1.2 1.3 Freq (MHz) 20 25 30 35 40 LUTs ×104 3.078 3.079 3.08 3.081 3.082 3.083 3.084 3.085 Freq (MHz) 20 25 30 35 40 Slices ×104 0.8 0.9 1 1.1 1.2 1.3 LUTs ×104 3.078 3.079 3.08 3.081 3.082 3.083 3.084 3.085

Figure 3.21:A seed sweep with a period constraint of 38.5 MHz and Area as optimisation goal

3.1.4 Speed Grade

In this section we will look at the effect of different speed grades. For Xilinx speed grades are specifying together with the device as a part in the name of the target device. This is done to XST.

In Table 3.11 we have data from the highest performing layouts for different speed grades. For DFT with a speed grade of -2 two layouts had the same max-imum frequency, one with 1620 slices and 5363 LUTs and another with 1598 slices and 5403 LUTs.

(42)

32 3 Survey Results

Design Grade Slices LUTs Max Frequency

FIR -1 52 80 97.8091 -2 51 80 119.0334 -3 52 80 133.6184 FPU -1 3842 13021 97.1912 -2 3873 13032 117.0549 -3 4332 12994 130.0221 LM32 -1 702 1609 120.6273 -2 654 1611 151.4922 -3 598 1600 164.7446 DFT -1 1634 5375 350.8772 -2 1620/1598 5363/5403 417.7109 -3 1627 5385 461.0420

Table 3.11: Metrics for highest performing layouts found on a Artix device when synthesising with different speed grades

In Figures 3.22 and 3.23, we have the seed sweeps containing the highest maxi-mum frequency for LM32 speed grades -1 and -2.

LatticeMico32 Slices 500 600 700 800 900 1000 0 1 2 3 4 LUTs 1570 1580 1590 1600 1610 1620 1630 1640 0 1 2 3 4 Freq (MHz) 80 90 100 110 120 130 0 1 2 3 4 Time (sec) 70 75 80 85 90 95 0 1 2 3 4 Time (sec) 70 75 80 85 90 95 Freq (MHz) 80 90 100 110 120 130 Slices 500 600 700 800 900 1000 Freq (MHz) 80 90 100 110 120 130 LUTs 1570 1580 1590 1600 1610 1620 1630 1640 Freq (MHz) 80 90 100 110 120 130 Slices 500 600 700 800 900 1000 LUTs 1570 1580 1590 1600 1610 1620 1630 1640

Figure 3.22:A seed sweep with a period constraint of 120.5 MHz and a speed grade of -1

(43)

3.1 Xilinx 33 LatticeMico32 Slices 500 600 700 800 900 0 1 2 3 4 5 LUTs 15600 1580 1600 1620 1640 1660 0.5 1 1.5 2 2.5 3 Freq (MHz) 100 110 120 130 140 150 160 0 0.5 1 1.5 2 2.5 3 Time (sec) 75 80 85 90 95 100 105 0 0.5 1 1.5 2 2.5 3 Time (sec) 75 80 85 90 95 100 105 Freq (MHz) 100 110 120 130 140 150 160 Slices 500 600 700 800 900 Freq (MHz) 100 110 120 130 140 150 160 LUTs 1560 1580 1600 1620 1640 1660 Freq (MHz) 100 110 120 130 140 150 160 Slices 500 600 700 800 900 LUTs 1560 1580 1600 1620 1640 1660

Figure 3.23:A seed sweep with a period constraint of 151.5 MHz and a speed grade of -2

3.1.5 Clock Region and Area Range

In this section we will look at how performance is affected by limiting the area of the FPGA what the tool can use to create the layout. This will be done with both SLICERANGE and CLOCKREGION. We will start with CLOCKREGION constraints, which in practice is a SLICERANGE of X84Y50 slices. SLICERANGE in turn defines the range of slices that the tool can use to place the design. This is more fine grained and we will only have time to test the surface of this constraint. In Table 3.12, we have metrics for two designs when specifying a CLOCKRE-GION constraint. nova use four regions (X0Y1:X1Y2) while LM32 use one (X0Y0).

Design Slice LUT Max Frequency

nova 9795 32042 43.4802

LM32 684 1975 169.5203

Table 3.12:Metrics for best layouts found when synthesising with a CLOCK-REGION constraint onto a Artix device

(44)

34 3 Survey Results

40 41 42 43 44 45 46 47 48 Actual Frequency (MHz) 40.5 41 41.5 42 42.5 43 43.5 44 200T device

(a)Seed with highest frequency of nova

40 41 42 43 44 45 46 47 48 Actual Frequency (MHz) 26 28 30 32 34 36 38 40 42 44 200T device

(b)Seed 1 of nova

Figure 3.24:Constraint sweeps on Artix

In Figure 3.25, we have seed sweeps with clock period constraint at 133.33 MHz, the left is without area range and on the right we have with a range of X0Y0:X47Y103. This is the sweep that contains the layout with highest maximum frequency with this SLICERANGE constraint.

FPU

No Area Range Slices

36000 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 0.2 0.4 0.6 0.8 1

No Area Range LUTs ×104

1.2980 1.2985 1.299 1.2995 1.3 1.3005 1.301

1 2 3 4

No Area Range Freq (MHz)

30 40 50 60 70 80 90 100 110 120 130

0 1 2 3

No Area Range Time (min)

0 20 40 60 80 100 120 0 0.5 1 1.5 2

Area Range Slices

38500 3900 3950 4000

1 2 3

Area Range LUTs ×104

1.29840 1.2985 1.2986 1.2987 1.2988 1.2989 1.299 1.2991 1.2992 1.2993 2

4 6 8

Area Range Freq (MHz)

116 118 120 122 124 126 128 130 132 134 136 0 1 2 3 4

Area Range Time (min)

3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 0 5 10 15 20 25

Figure 3.25: A seed sweep for a clock period of 133.33 MHz. Left: without RANGE constraint, Right: with RANGE constraint

(45)

3.1 Xilinx 35

In Figures 3.26, 3.27 and 3.28, we have the seed sweeps of minSoC without a range constraint, with a range constraint of X0Y0:X79Y79 and with a range con-straint of X0Y0:X89Y99. MAP reports were saved for X0Y0:X89Y99 and it took roughly 2 hours and 50 minutes to run while the whole sweep took almost 13 days. X0Y0:X79Y79 took almost 9 days.

minSoC Slices 54000 5600 5800 6000 6200 6400 6600 6800 0.2 0.4 0.6 0.8 1 LUTs ×104 1.7820 1.784 1.786 1.788 1.79 0.5 1 1.5 2 Freq (MHz) 440 450 460 470 480 0 1 2 3 4 5 Time (min) 0 5 10 15 20 25 30 35 0 1 2 3 4 Time (min) 0 5 10 15 20 25 30 35 Freq (MHz) 440 450 460 470 480 Slices 5400 5600 5800 6000 6200 6400 6600 6800 Freq (MHz) 440 450 460 470 480 LUTs ×104 1.782 1.784 1.786 1.788 1.79 Freq (MHz) 440 450 460 470 480 Slices 5400 5600 5800 6000 6200 6400 6600 6800 LUTs ×104 1.782 1.784 1.786 1.788 1.79

Figure 3.26: A seed sweep for a clock period of 500.0 MHz without a slicerange constraint minSoC Slices 45000 4505 4510 4515 4520 4525 4530 1 2 3 4 LUTs ×104 1.7827 0 5 10 15 20 25 Freq (MHz) 265 270 275 280 285 290 295 300 0 0.5 1 1.5 2 Time (sec) 130 135 140 145 150 155 0 1 2 3 4 Time (sec) 130 135 140 145 150 155 Freq (MHz) 265 270 275 280 285 290 295 300 Slices 4500 4505 4510 4515 4520 4525 4530 Freq (MHz) 265 270 275 280 285 290 295 300 LUTs ×104 1.7826 1.7827 1.7827 1.7828 1.7828 Freq (MHz) 265 270 275 280 285 290 295 300 Slices 4500 4505 4510 4515 4520 4525 4530 LUTs ×104 1.7826 1.7827 1.7827 1.7828 1.7828

Figure 3.27:A seed sweep for a clock period of 317.5 MHz with a slicerange constraint of X0Y0:X79Y79

(46)

36 3 Survey Results minSoC Slices 45000 4510 4520 4530 4540 4550 0.5 1 1.5 2 2.5 3 LUTs ×104 1.7827 0 5 10 15 20 25 Freq (MHz) 200 220 240 260 280 300 320 340 0 0.5 1 1.5 2 Time (min) 500 520 540 560 580 600 0 0.5 1 1.5 2 Time (min) 500 520 540 560 580 600 Freq (MHz) 200 220 240 260 280 300 320 340 Slices 4500 4510 4520 4530 4540 4550 Freq (MHz) 200 220 240 260 280 300 320 340 LUTs ×104 1.7826 1.7827 1.7827 1.7828 1.7828 Freq (MHz) 200 220 240 260 280 300 320 340 Slices 4500 4510 4520 4530 4540 4550 LUTs ×104 1.7826 1.7827 1.7827 1.7828 1.7828

Figure 3.28:A seed sweep for a clock period of 317.5 MHz with a slicerange constraint of X0Y0:X89Y99

(47)

3.1 Xilinx 37

In Table 3.13, we can see the metrics of the layouts with highest performance for different SLICERANGE constraints. For X0Y0:X33Y1 the tools gave up as a successful routing of the design was not possible.

FIR

Range Slices LUTs Frequency

X0Y0:X1Y33 32 106 126.1670 X0Y0:X3Y33 37 116 128.1723 X0Y0:X5Y33 38 99 129.7353 X0Y0:X7Y33 39 111 130.3441 X0Y0:X9Y33 39 115 131.6829 X0Y0:X11Y33 43 112 131.8739 X0Y0:X13Y33 38 102 132.0132 X0Y0:X33Y1 – – – X0Y0:X33Y3 42 108 130.4121 X0Y0:X33Y5 40 116 130.0390 X0Y0:X33Y7 41 116 134.0842 X0Y0:X33Y9 45 104 130.3951

(48)

38 3 Survey Results

3.1.6 Multiple Designs

In this section we will synthesis multiple designs onto a single FPGA. We hope to get data which can be used to see if it is feasible to assume that a good result from past sections is achievable if used as a subsystem in a real application. In Table 3.14, we have results from each seed when synthesising a DCT, a minSoC, a FPU and a nova core all at once. All designs had one register on I/Os and the clock constraint were set close to the suspected maximum frequency of the de-signs. To make it more perspicuous, green fields were added to the frequencies that met the clock constraint.

Seed Slices LUTs Max Frequency (MHz)

DCT FPU minSoC nova

Clock Period Constraint: 312.5 76.92 470.8 42.55 Seed 1 19550 61693 173.1 71.40 376.6 32.28 Seed 2 17738 61925 268.3 77.36 453.1 31.44 Seed 3 19715 61648 125.1 72.30 380.8 35.39 Seed 4 19432 61724 234.6 57.59 361.4 34.07 Seed 5 19836 61700 176.5 63.44 466.4 37.06 Seed 6 19683 61676 187.3 67.34 468.4 36.80 Seed 7 17941 61815 279.9 77.47 461.7 39.52 Seed 8 18130 61734 268.3 77.04 469.5 39.57 Seed 9 17500 61771 258.1 77.50 469.0 41.23 Seed 10 19412 61687 266.0 65.21 468.6 31.87 Seed 11 17726 61872 283.8 77.39 444.0 38.98 Seed 12 17176 61792 279.8 76.73 279.7 28.58 Seed 13 18221 61984 275.5 77.54 462.3 36.45 Seed 14 17880 62025 266.3 77.40 476.0 36.17 Seed 15 19283 61693 186.8 39.04 255.6 20.79 Seed 16 18425 61680 253.0 77.34 469.0 41.27 Seed 17 19376 61725 252.7 59.92 440.7 35.23 Seed 18 19516 61689 156.7 58.92 448.8 30.68 Seed 19 18413 61755 278.5 77.66 470.1 40.98 Seed 20 19685 61693 177.2 40.55 382.4 26.66 Seed 21 19220 61721 86.84 45.16 197.2 21.45 Seed 22 18516 61889 280.4 77.97 429.9 39.66 Seed 23 19691 61711 200.8 53.71 383.1 32.55 Seed 24 17776 61769 288.9 77.39 472.4 41.89 Table 3.14:Performance of designs when synthesised freely on same FPGA

(49)

3.2 Altera 39

3.2 Altera

In this part of the chapter we will present the results received from the survey with the Altera tools. First we will do synthesis without clock constraints and registers on I/Os, then with both and lastly we will iterate over the various opti-misation goals and speed grades. Due to the time constraint on this thesis, lim-iting the area the tools can use for the designs will be left outside the scope of this report. The version of the tools that was used in this part of the survey was 13.0sp1 and every part of it was done on ISY’s computers.

The metrics we will look at is mainly ALM, ALUT and frequency. Number of registers will be listed in some parts to show that it fluctuates while number of memory bits and DSP blocks are, as far as we can tell, constant between runs and are not included. Unless else mentioned, ALM is the number from the report summary and ALUT is a sum of the numbers used as logic and the numbers used for routing.

3.2.1 No Constraint, No Register

In this section we will use the tools by Altera to synthesis the designs with dif-ferent seeds. No modifications were done, and no constraints added. There were three target devices, one for each of the families Cyclone, Arria, and Stratix. Namely 5cgxfc7d6f27c6, 5agxfb3h6f35c6 and 5sgxea7h3f35c3.

In Table 3.15 and 3.16, we have the metrics of the survey for the Cyclone. Not included is neither the number of DSP blocks nor the number of memory bits as both of them appear to be constant between seeds. For FIR the restricted Fmaxis

310.08 MHz due to minimum period restriction for the device. This means that even though it could go faster in theory, the particular device would not handle a higher frequency.

Design ALMs ALUTs Registers Frequency (MHz)

µ σ µ σ µ σ µ σ AES 937.1 3.811 1486 3.960 590.5 1.910 138.8 2.834 DCT 1126 1.821 1829 14.38 2108 8.236 199.4 4.633 FIR 49 0 95 0 0 0 567.3a 2.977 FPU 7259 10.31 12855 34.82 6303 9.503 97.95 2.033 LM32 1489 10.06 2105 15.23 1868 3.627 123.4 3.641 LEON3 22588 30.32 35488 69.04 17877 51.2 64.34 7.226 minSoC 4688 13.64 7588 34.85 5066 11.83 54.35 2.047 nova 18733 34.36 26800 61.48 7840 20.18 30.35 1.427 DFT 3247 0.790 8879 37.82 9502 2.704 263.3 5.129

Table 3.15:Survey results for Cyclone V, data from FIT and TimeQuest

(50)

40 3 Survey Results

Design ALMs ALUTs Registers Frequency

Min Min Min Max

AES 927 1479 586 144.6 DCT 1123 1798 2090 209.0 FIR 49 95 0 569.5a FPU 7235 12756 6280 102.4 LM32 1468 2070 1861 131.6 LEON3 22489 35333 17708 82.56 minSoC 4655 7517 5043 59.09 nova 18648 26659 7801 33.07 DFT 3245 8797 9497 272.7

Table 3.16: Minimum and Maximum results for Cyclone V, data from FIT and TimeQuest

Design Delay (ns) Frequency

AES 16.36 61.12

FIR 9.609 104.1

FPU 19.37 51.63

nova 50.65 19.74

Table 3.17:Highest Input to Output delay, first seed only

In Figure 3.29, we have graphs showing the relation between reported ALMs and number of ALUTs while Figure 3.30 shows the actual number of ALMs used in the layout without the addition and subtraction of estimated values.

(51)

3.2 Altera 41 ALM 48 48.5 49 49.5 50 ALUTs 94 94.5 95 95.5 96 FIR ALM 925 930 935 940 945 950 955 ALUTs 1475 1480 1485 1490 1495 1500 AES ALM 1123 1124 1125 1126 1127 1128 1129 ALUTs 1780 1800 1820 1840 1860 1880 DCT ALM 7230 7240 7250 7260 7270 7280 7290 ALUTs ×104 1.275 1.28 1.285 1.29 1.295 FPU ALM 4640 4660 4680 4700 4720 4740 ALUTs 7500 7550 7600 7650 7700 MinSoC ALM ×104 1.86 1.865 1.87 1.875 1.88 1.885 ALUTs ×104 2.665 2.67 2.675 2.68 2.685 2.69 2.695 2.7 nova ALM 1460 1470 1480 1490 1500 1510 1520 ALUTs 2070 2080 2090 2100 2110 2120 2130 2140 LatticeMico32 ALM 3245 3246 3247 3248 3249 ALUTs 8750 8800 8850 8900 8950 9000 Spiral DFT ALM ×104 2.245 2.25 2.255 2.26 2.265 ALUTs ×104 3.53 3.54 3.55 3.56 3.57 LEON3

Figure 3.29: Number of ALMs to number of ALUTs of the Designs when synthesising to a Cyclone device

ALM 47 47.5 48 48.5 49 ALUTs 94 94.5 95 95.5 96 FIR ALM 1000 1010 1020 1030 1040 1050 1060 1070 ALUTs 1475 1480 1485 1490 1495 1500 AES ALM 1220 1230 1240 1250 1260 1270 ALUTs 1780 1800 1820 1840 1860 1880 DCT ALM 7800 7850 7900 7950 8000 ALUTs ×104 1.275 1.28 1.285 1.29 1.295 FPU ALM 5100 5150 5200 5250 5300 5350 ALUTs 7500 7550 7600 7650 7700 MinSoC ALM ×104 1.96 1.97 1.98 1.99 2 ALUTs ×104 2.665 2.67 2.675 2.68 2.685 2.69 2.695 2.7 nova ALM 1520 1530 1540 1550 1560 1570 1580 1590 ALUTs 2070 2080 2090 2100 2110 2120 2130 2140 LatticeMico32 ALM 4920 4925 4930 4935 4940 4945 4950 ALUTs 8750 8800 8850 8900 8950 9000 Spiral DFT ALM ×104 2.38 2.4 2.42 2.44 2.46 ALUTs ×104 3.53 3.54 3.55 3.56 3.57 LEON3

Figure 3.30: Number of actual ALMs to number of ALUTs of the Designs when synthesising to a Cyclone device

In Figure 3.31 we have a graph showing maximum frequency and number of ALUTs of the designs.

(52)

42 3 Survey Results ALUTs 94 94.5 95 95.5 96 Max frequency (MHz) 550 555 560 565 570 FIR ALUTs 1475 1480 1485 1490 1495 1500 Max frequency (MHz) 130 135 140 145 AES ALUTs 1780 1800 1820 1840 1860 1880 Max frequency (MHz) 185 190 195 200 205 210 DCT ALUTs ×104 1.275 1.28 1.285 1.29 1.295 Max frequency (MHz) 92 94 96 98 100 102 104 FPU ALUTs 7500 7550 7600 7650 7700 Max frequency (MHz) 48 50 52 54 56 58 60 MinSoC ALUTs ×104 2.665 2.67 2.675 2.68 2.685 2.69 2.695 2.7 Max frequency (MHz) 24 26 28 30 32 34 nova ALUTs 2070 2080 2090 2100 2110 2120 2130 2140 Max frequency (MHz) 110 115 120 125 130 135 LatticeMico32 ALUTs 8750 8800 8850 8900 8950 9000 Max frequency (MHz) 245 250 255 260 265 270 275 Spiral DFT ALUTs ×104 3.53 3.54 3.55 3.56 3.57 Max frequency (MHz) 40 50 60 70 80 90 LEON3

Figure 3.31:Maximum Frequency to number of ALUTs of the Designs when synthesising to a Cyclone device

In Figure 3.32 we have graphs showing the run-time of quartus_fit and maximum frequency.

Running time (sec)

112 114 116 118 120 Max frequency (MHz) 550 555 560 565 570 FIR

Running time (sec)

160 170 180 190 200 210 Max frequency (MHz) 130 135 140 145 AES

Running time (sec)

150 160 170 180 190 Max frequency (MHz) 185 190 195 200 205 210 DCT

Running time (sec)

460 470 480 490 500 510 Max frequency (MHz) 92 94 96 98 100 102 104 FPU

Running time (sec)

400 500 600 700 800 900 Max frequency (MHz) 48 50 52 54 56 58 60 MinSoC

Running time (sec)

2000 3000 4000 5000 6000 7000 8000 Max frequency (MHz) 24 26 28 30 32 34 nova

Running time (sec)

310 320 330 340 350 360 370 380 Max frequency (MHz) 110 115 120 125 130 135 LatticeMico32

Running time (sec)

410 420 430 440 450 460 470 Max frequency (MHz) 245 250 255 260 265 270 275 Spiral DFT

Running time (sec)

1260 1280 1300 1320 1340 1360 1380 Max frequency (MHz) 40 50 60 70 80 90 LEON3

Figure 3.32: Running time of FIT to Maximum Frequency of the Designs when synthesising to a Cyclone device

(53)

3.2 Altera 43

In Table 3.18 and 3.19, we have metrics data from synthesising to the Arria de-vice. DFT and FIR have a restricted Fmaxof 220.02 MHz due to minimum period

restriction for the device.

Design ALMs ALUTs Registers Frequency (MHz)

µ σ µ σ µ σ µ σ AES 931.5 2.376 1488 5.292 590.2 2.071 120.6 3.304 DCT 1127 0.8989 1769 18.28 2111 8.026 173.2 4.051 FIR 49 0 95 0 0 0 415.1a 5.9 FPU 7245 10.43 12800 26.53 6309 11.19 87.99 1.815 LM32 1499 11.48 2118 15.24 1870 3.772 115.3 2.714 LEON3 22588 30.1 35433 32.64 17866 55.27 54.6 2.785 minSoC 4708 15.85 7621 31.95 5060 13.8 46.66 2.108 nova 18766 25.65 26666 57.27 7847 18.16 28.84 1.45 DFT 3245 0.6178 8517 47.68 9503 2.984 223.2a 3.401

Table 3.18:Survey results for Arria, data from FIT and TimeQuest

Design ALMs ALUTs Registers Frequency

Min Min Min Max

AES 923 1479 586 129.8 DCT 1124 1721 2095 181.3 FIR 49 95 0 419.3a FPU 7218 12744 6287 91.24 LM32 1472 2078 1864 122.9 LEON3 22511 35344 17744 60.61 minSoC 4674 7538 5028 50.08 nova 18699 26566 7810 31.40 DFT 3244 8375 9496 228.8a

Table 3.19:Minimum and Maximum results for Arria V, data from FIT and TimeQuest

In Figure 3.33 and 3.34, we have graphs for number of ALUTs, run-time of FIT and maximum frequency for Arria similar to what we had for Cyclone. Figures showing relations with ALMs were not added.

FPGA Design Tools - : the Challenges of Reporting Performance Data

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

FPGA Design Tools - the Challenges of Reporting

Performance Data

FPGA Design Tools - the Challenges of Reporting

Performance Data

Examensarbete utfört i Datateknik

vid Tekniska högskolan vid Linköpings universitet

av

Abstract

Contents

1

Introduction

1.1

A Brief Note on Field-Programmable Gate Array

1.2

The Issue

1.3

Field-Programmable Gate Array Architectures

1.4

The Tool Chain

1.5

Literature Study

1.6

Scope and Purpose

2

Method

2.1

Discussed Methods

2.2

The Chosen Method

XST

MAP

FIT

PAR

TimeQuest

TRACE

MAP

NGDBuild

Xilinx

Altera

ISE

Quartus

2.2.1

The Chosen Designs

2.3

The Survey and Expected Result

DUT D

D

Input

Output

2.4

Extracted Data

3

Survey Results

3.1

Xilinx

3.1.1

No Constraint, No Register

3.1.2

Constraint and Buffer Sweeps

3.1.3

Optimisation Goals

3.1.4

Speed Grade

3.1.5

Clock Region and Area Range

3.1.6

Multiple Designs

3.2

Altera

3.2.1

No Constraint, No Register