Design Space Exploration of Time-Multiplexed FIRFilters on FPGAs

(1)

Design Space Exploration of Time-Multiplexed FIR

Filters on FPGAs

Examensarbete utfört i ElektronikSystem vid Tekniska högskolan i Linköping

av

Syed Asad Alam

LiTH-ISY-EX--10/4343--SE Linköping 2010

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Examensarbete utfört i ElektronikSystem

vid Tekniska högskolan i Linköping

av

Syed Asad Alam

LiTH-ISY-EX--10/4343--SE

Handledare: Oscar Gustafsson

isy, Linköping University

Examinator: Oscar Gustafsson

isy, Linköping University

(4)

(5)

D-uppsats Övrig rapport

URL för elektronisk version

http://www.es.isy.liu.se

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-54286

Serietitel och serienummer

Title of series, numbering

ISSN

—

Titel

Title Svensk titel_{Design Space Exploration of Time-Multiplexed FIR Filters on FPGAs}

Författare

Author

Syed Asad Alam

Sammanfattning

Abstract

FIR (Finite-length Impulse Response) filters are the corner stone of many signal processing devices. A lot of research has gone into their development as well as their effective implementation. With recent research focusing a lot on power consumption reduction specially with regards to FPGAs, it was found necessary to explore FIR filters mapping on FPGAs.

Time multiplexed FIR filters are also a good candidate for examination with respect to power consumption and resource utilization, for example when imple-mented in Field Programmable Gate Arrays (FPGAs). This is motivated by the fact that the usable clock frequency often is higher compared to the required data rate. Current implementations by, e.g., Xilinx FIR Compiler suffer from high power consumption when the time multiplexing factor is low. Further, it needs to be investigated how exploiting coefficient symmetry, scaling the coefficients and increasing the time-multiplexing factor influences the performance.

Nyckelord

(6)

(7)

respect to power consumption and resource utilization, for example when imple-mented in Field Programmable Gate Arrays (FPGAs). This is motivated by the fact that the usable clock frequency often is higher compared to the required data rate. Current implementations by, e.g., Xilinx FIR Compiler suffer from high power consumption when the time multiplexing factor is low. Further, it needs to be investigated how exploiting coefficient symmetry, scaling the coefficients and increasing the time-multiplexing factor influences the performance.

(8)

(9)

Further more, I would like to thank PhD students Muhammad Abbas, Fahad Qureshi and Zakaullah Sheikh for providing me guidance and helping me out with writing this Thesis report on LA_TEX

I would also like to acknowledge M.Sc. student Syed Ahmed Aamir, M.Sc. Muhammad Saad Rahman for being a helpful guide and enabling me to settle in my life in Linköping.

Also, a big thanks to all the member of the division of Electronics Systems at Linköping University for creating a nice working environment.

Finally, and the most important, from the depths of my heart, my extreme gratitude to my mother, who has been such a rock for me during all the turbulent times I had initially here and supporting me all the way right upto my Thesis completion.

(10)

(11)

2.2 FIR Filter Structures . . . 5

2.2.1 Direct Form . . . 5

2.2.2 Transposed Direct Form . . . 7

2.2.3 Linear-Phase Structure . . . 7

2.2.4 Time-Multiplexed FIR Filters . . . 8

2.2.5 Scaling in FIR Filters . . . 9

3 FPGA 11 3.1 Major Companies . . . 11

3.2 History . . . 12

3.3 The Role of FPGAs . . . 12

3.3.1 Advantages of FPGAs . . . 12

3.4 Modern Developments . . . 13

3.5 Application . . . 13

3.6 Types of FPGAs . . . 14

3.7 FPGA vs Custom VLSI . . . 15

3.8 Architecture of FPGAs . . . 16

3.8.1 Configurable Logic Block . . . 17

3.8.2 Interconnection Network . . . 20 3.9 FPGA Configuration . . . 23 3.10 Xilinx . . . 24 3.10.1 Virtex 5 . . . 26 4 Architecture 47 4.1 Design Flow . . . 48 4.1.1 VHDL - The Chosen HDL . . . 48 4.1.2 Matlab . . . 48

4.1.3 Functional Simulation and Verification . . . 50

4.1.4 Synthesis and Implementation . . . 50

4.1.5 Post PAR Simulation and Verification . . . 51

4.1.6 Power Estimation . . . 52 ix

(12)

x Contents

4.2 Architecture - 1 : Fully Parallel Symmetric FIR Filter . . . 52

4.2.1 Basic Idea . . . 52 4.2.2 Matlab Flow . . . 53 4.2.3 FIR Top . . . 55 4.2.4 Main Controller . . . 57 4.2.5 Data Memory . . . 61 4.2.6 Coefficient Memory . . . 62 4.2.7 Main Filter . . . 63 4.2.8 Pre-Adder . . . 66 4.2.9 Multiply-Accumulate (MAC) . . . 66 4.2.10 Adder Tree . . . 67 4.2.11 Functional Simulation . . . 68

4.2.13 Post PAR Simulation . . . 69

4.2.14 Power Estimation . . . 70

4.2.15 Xilinx FIR Compiler . . . 70

4.3 Architecture - 2 : Semi Parallel Non-Symmetric FIR Filter . . . . 71

4.3.1 Basic Idea . . . 71 4.3.2 Matlab Flow . . . 73 4.3.3 FIR Top . . . 73 4.3.4 Main Controller . . . 73 4.3.5 Data Memory . . . 77 4.3.6 Coefficient Memory . . . 77 4.3.7 Main Filter . . . 79 4.3.8 Multiply-Add . . . 80 4.3.9 Accumulator (ACC) . . . 81 4.3.10 Functional Simulation . . . 81

4.3.12 Post PAR Simulation . . . 82

4.3.13 Power Estimation . . . 82

4.3.14 Xilinx FIR Compiler . . . 82

5 Results 83 5.1 Xilinx Resource Facts . . . 85

5.2 Altera Resource Facts . . . 86

5.3 Comparison-1, Non-Scaled Filter, Parallel vs Semi-Parallel: Slice Count . . . 86

5.4 Comparison-2, Non-Scaled, Parallel vs Semi-Parallel: BRAM Count 86 5.5 Comparison-3, Non-Scaled, Parallel vs Semi-Parallel: DSP48E Slice Count . . . 87

5.6 Comparison-4, Non-Scaled, Parallel vs Semi-Parallel: Clock Fre-quency . . . 88

5.7 Comparison-5, Non-Scaled, Parallel vs Semi-Parallel: Power Dissi-pation . . . 88

5.8 Comparison-6, Scaled, Parallel vs Semi-Parallel . . . 92

(13)

5.12.4 Semi-Parallel Architecture (Non-Symmetric), 24-bit Data

Path . . . 102

6 Conclusions and Future Work 113 6.1 Conclusions . . . 113

6.2 Future Work . . . 114

6.2.1 Single Port RAMs . . . 114

6.2.2 Bit-Serial and Digit-Serial Arithmetic . . . 114

6.2.3 Other Number Representations . . . 114

(14)

xii Contents

List of Figures

2.1 Impulse response of a causal FIR filter of order N[13] . . . 4

2.2 Pole-zero configuration - lowpass, linear-phase FIR filter[13] . . . . 4

2.3 Linear Phase Filter Types - Impulse Response[13] . . . 6

2.4 Direct Form FIR Filter Structure . . . 6

2.5 Tranposed Form FIR Filter Structure . . . 7

2.6 Direct Form Linear-Phase Structure . . . 8

3.1 Flash Programmed Switch[12] . . . 15

3.2 FPGA Structure[11] . . . 16

3.3 Look-Up Table[12] . . . 17

3.4 Look-Up Table with storage element[2] . . . 18

3.5 Advanced Block Diagram of a CLB[12] . . . 18

3.6 Xilinx Spartan-II CLB[12] . . . 19

3.7 Altera APEX-II Logic Element[12] . . . 20

3.8 SRAM Connection Box[12] . . . 21

3.9 Xilinx Spartan-II General Routing Matrix[12] . . . 23

3.10 Xilinx Spartan-II On-Chip Three State Bus[12] . . . 24

3.11 Altera APEX-II Interconnect Structure[12] . . . 25

3.12 Virtex-5 Configurable Logic Block[17] . . . 29

3.13 Virte-5 Slice Type : SLICEM[17] . . . 30

3.14 Virte-5 Slice Type : SLICEL[17] . . . 31

3.15 Basic Architecture of 6-input LUT[17] . . . 32

3.16 Dual 5-input LUT using 6-input LUT[26] . . . 33

3.17 Register/Latch configuration in a Slice[17] . . . 35

3.18 True Dual Port[17] . . . 39

3.19 Simple Dual Port Block Diagram[17] . . . 41

3.20 DSP48E Slice[17] . . . 42

3.21 DSP48E Tile[17] . . . 44

4.1 Thesis Main Design Flow . . . 49

4.2 Fully Parallel Symmetric FIR Filter Memory Arrangement . . . 53

4.3 Fully Parallel Symmetric FIR Filter Block Diagram . . . 54

4.4 Fully Parallel Symmetric FIR Filter Module Connections . . . 56

4.5 Fully Parallel Symmetric FIR Filter Memory Arrangement 1(6 mem-ories) . . . 59

4.6 Fully Parallel Symmetric FIR Filter Memory Arrangement 2(6 mem-ories) . . . 59

4.7 Fully Parallel Symmetric FIR Filter Structure . . . 64

4.8 Coefficient File for Xilinx Core Generator . . . 71

4.9 Semi-Parallel, Pipelined Non-Symmetric FIR Filter Block Diagram 72 4.10 Semi-Parallel, Pipelined Non-Symmetric FIR Filter Module Inter-connection . . . 74 4.11 Semi-Parallel, Pipelined Non-Symmetric Architecture Controller . 75

(15)

5.5 Effect of Scaling on Parallel Architecture . . . 93

5.6 Effect of Scaling on Semi-Parallel, Pipelined Architecture . . . 94

5.7 Effect of Increased Word Length on Parallel Architecture . . . 97

5.8 Effect of Increased Word Length on Parallel Architecture . . . 98

5.9 Effect of Increased Word Length on Semi-Parallel, Pipelined Archi-tecture . . . 99

5.10 Effect of Increased Word Length on Semi-Parallel, Pipelined Archi-tecture . . . 100

5.11 Effect of Different Memory Folding Factor on Parallel Architecture 100 5.12 Effect of Different Memory Folding Factor on Semi-Parallel, Pipelined Architecture . . . 101

5.13 Parallel Architecture vs Symmmetric FIR Core - 18 bit data path 104 5.14 Parallel Architecture vs Symmmetric FIR Core - 18 bit data path 104 5.15 Parallel Architecture vs Symmmetric FIR Core - 24 bit data path 106 5.16 Parallel Architecture vs Symmmetric FIR Core - 24 bit data path 106 5.17 Semi-Parallel Architecture vs Non-Symmmetric FIR Core - 18 bit data path . . . 108

5.18 Semi-Parallel Architecture vs Non-Symmmetric FIR Core - 18 bit data path . . . 108

(16)

xiv Contents

List of Tables

3.1 Virtex-5 LX20T and LX50T Features . . . 28

3.2 Virtex-5 LX20T and LX50T Features . . . 28

3.3 Logic Resources in One CLB . . . 29

3.4 Logic Resources in Selected FPGAs . . . 32

3.5 Type of Distributed RAMs . . . 36

3.6 Types of Distributed ROMs . . . 37

3.7 True Dual Port Definitions . . . 40

3.8 Simple Dual Port Definitions . . . 43

5.1 Slice Comparison . . . 87

5.2 DSP48E Comparison . . . 89

5.3 Clock Frequency Comparison, Frequency in MHz . . . 90

5.4 Power Dissipation Comparison, Power in mW . . . 91

5.5 Effect of Scaling-Parallel Architecture . . . 92

5.6 Effect of Scaling-Pipelined Architecture . . . 93

5.7 Effect of Increased Word Length-Parallel Architecture . . . 95

5.8 Effect of Increased Word Length-Parallel Architecture . . . 96

5.9 Effect of Increased Word Length-Pipelined Architecture . . . 96

5.10 Effect of Increased Word Length-Pipelined Architecture . . . 97

5.11 Effect of Different Memory Folding Factor - Parallel Architecture . 98 5.12 Effect of Different Memory Folding Factor - Pipelined Architecture 99 5.13 Parallel Architecture vs Symmmetric FIR Core - 18 bit data path 103 5.14 Parallel Architecture vs Symmmetric FIR Core - 18 bit data path 103 5.15 Parallel Architecture vs Symmmetric FIR Core - 24 bit data path 105 5.16 Parallel Architecture vs Symmmetric FIR Core - 24 bit data path 105 5.17 Semi-Parallel Architecture vs Non-Symmmetric FIR Core - 18 bit data path . . . 107

(17)

RAM _{Random Access Memory}

VHDL _{VHSIC Hardware Description Language}

VHSIC _{Very High Speed Integrated Circuit}

ISE _{Integrated System Environment}

IIR _{Infinite Impulse Response}

PROM _{Programmable Read Only Memory}

PLD _{Programmable Logic Device}

CPU _{Central Processing Unit}

CPLD _{Complex Programmable Logic Device}

SERDES _{Serializer/De-Serializer}

LUT _{Look-Up Table}

HDL _{Hardware Description Language}

CLB _{Configurable Logic Block}

IC _{Integrated Circuit}

LE _{Logic Element}

SRAM _{Static Random Access Memory}

CMOS _{Complementary Metal Oxide Semiconductor}

LAB _{Logic Array Block}

GRM _{General Routing Matrix}

PC _{Personal Computer}

PCB _{Printed Circuit Board}

(18)

Contents xvii

PLL _{Phase-Locked Loop}

GMACS _{Giga Multiply-ACcumulate operations Per Second} GFLOPS _{Giga Floating Point Operations Per Second} DMIPS _{Dhrystone Million Instructions Per Second}

PCI _{Peripheral Component Interconnect}

CMT _{Clock Management Tile}

ROM _{Read Only Memory}

FIFO _{First In First Out}

ASMBL _{Application Specific Modular Block}

SIMD _{Single Instruction Multiple Data}

RTL _{Register Transfer Level}

PAR _{Place And Route}

MAP _Mapping

DUT _{Design Under Test}

XST _{Xilinx Synthesis Technology}

NGD _{Native Generic Database}

DRC _{Design Rule Check}

NCD _{Native Circuit Description}

PCF _{Physical Constrant File}

VCD _{Value Change Dump}

MAD _Multiply-Add

SC _{Slice Count}

BC _{BRAM Count}

BRAM _{Block Random Access Memory}

DSPC _{DSP48E Slice Count}

CF _{Clock Frequency}

PD _{Power Dissipation}

(19)

nents in many digital signal processing systems. They provide a lot of advantages over IIR Filters such as linear phase, stability and no feedback.

Traditionally, filter design has been focused on Application Specific Integrated Circuits (ASICs) using standard cells. But lately, with the fast development of Field Programmable Gate Arrays (FPGAs) with specially built in components for signal processing, they are fast becoming the hardware of choice.

The operation of FIR filters is based on convolution. Such operations heavily involve Multiply-Accumulate (MAC) Operations. Nearly all FPGAs have special built in Digital Signal Processing (DSP) Blocks which support fast MAC as well as multiply operations. This makes FPGA an obvious choice to implement multiply and MAC intensive FIR filters.

The title of this thesis work in Design Space Exploration of Time-Multiplexed

FIR Filters on FPGAs. There are four components to this title. The first one, Design Space Exploration deals with exploring different ways of implementing FIR filters on FPGAs with the constraint that one has several cycles to computer on output sample, but not enough cycles to just use one multiplier. Not having enough cycles is the second component, time-multiplexing. FIR and FPGA are the third and fourth components. To implement this, two architectures have been imple-mented for Time Multiplexed FIR Filters. Both architectures use Block RAMs and DSP48E Blocks available in Xilinx FPGAs. Special attention has been given to reducing the transition activity of the Block Random Access Memory (RAM)s to reduce the total power consumption. Also, simple scaling has been employed and its effects studied on FPGA resource usage, frequency and power consumption. In addition to this, effect of data path word-length and different time-multiplexing factors have been studied. Finally a comparison has been made between the per-formance of implemented design and FIR core provided by Xilinx.

This document is organized in the following chapters • Chapter 1 : Introduction

• Chapter 2 : FIR - A brief overview of FIR filters, their properties and 1

(20)

2 Introduction

different structures

• Chapter 3 : FPGA - Introduction to FPGA, some examples and details about targeted FPGAs of Xilinx

• Chapter 4 : Architecture - Details about the implemented architectures • Chapter 5 : Results - Different comparisons and their plots

• Chapter 6 : Conclusions and Future Work - Different conclusions drawn on basis of results and future direction of this research

The whole design is based on Matlab. Target has been on generation of VHSIC Hardware Description Language (VHDL) code for different combinations of filter orders and time-multiplexing factors. Time-multiplexing factor has often been referred to by the name memory-folding factor. No fixed VHDL code is written. The user controls the filter order, time-multiplexing factor among some other options.

For synthesis, implementation and power estimation, the Integrated System Environment (ISE) tool provided by Xilinx has been used. Mentor Graphics’ ModelSim has been used for functional as well as gate level simulation.

(21)

FIR are the kind of digital filters whose impulse response to a kronecker delta input is finite i.e. it settles to zero in a finite number of sample intervals. This is the reason they are called Finite Impulse Response Filters. This is in contrast to Infinite Impulse Response (IIR). The impulse response of an N th-order FIR filter lasts for N + 1 samples, and then dies to zero.

The difference equation that defines the output of an FIR filter in terms of its input is: y(n) = N X i=0 h(k)x(n − k) (2.1) where y is the output, x is the input and h is the coefficient

A typical impulse response of a causal FIR filter of order N (length N + 1) [13] is shown in Fig. 2.1.

The transfer function and frequency response in Z-domain is given in Equation 2.2[13] H(z) = N X n=0 h(n)z−n _(2.2)

and in Frequency Domain represented as[13]

H(ejωT_{) =} N X n=0

h(n)e−jωT n _(2.3)

Non-recursive FIR filters have their poles located at the origin of the z-plane. The zeros can be placed any where in the z-plane, but are usually located on the unit circle or as pairs that are mirrored in the unit circle. In Fig. 2.2, typical pole-zero configuration for a lowpass FIR filter with linear-phase[13].

(22)

4 FIR Filters

Figure 2.1. Impulse response of a causal FIR filter of order N[13]

(23)

A linear-phase FIR filter is obtained by letting the impulse response exhibit symmetry around n = N/2, i.e.

h(n) = h(N − n), n = 0, 1, ..., N, (2.4) or antisymmetry around n = N/2, i.e.,

h(n) = −h(N − n), n = 0, 1, ..., N, (2.5) Based on whether N is even or odd, there are four different types of FIR filters with linear phase. These types are denoted as Type I, II, III and IV linear-phase FIR filters, respectively, according to the following

T ypeI : h(n) = h(N − n), N even T ypeII : h(n) = h(N − n), N odd T ypeIII : h(n) = −h(N − n), N even T ypeIV : h(n) = −h(N − n), N odd

(2.6)

The typical impulse responses of these different types are shown in Fig. 2.3. The centre value h(N/2) is always equal to zero for Type III filters. Furthermore, the point of symmetry is an integer corresponding to one of the samples values, in case when N is even. When N is odd, the point of symmetry lies between two sample values[13].

2.2 FIR Filter Structures

FIR filters can be realized using both recursive and non-recursive algorithms. Re-cursive algorithms, however, suffer from a number of drawbacks and are not used in practice. Non-recursive filters are always stable and cannot sustain any type of parasitic oscillations, except when the filters are a part of a recursive loop[13].

2.2.1 Direct Form

There are a large number of structures that are of interest for realization of FIR filters, particularly for multi-rate FIR filter, i.e., filters with several sample fre-quencies. One of the better and simple structure is the direct form, or transversal structure, which is depicted in Fig. 2.4 for a N th-order filter.

The direct form FIR filter of N th-order is described in Equation 2.3. The number of multiplications and additions are N + 1 and N , respectively. the signal levels in this structure are inherently scaled except for the output which for short

(24)

6 FIR Filters

Figure 2.3. Linear Phase Filter Types - Impulse Response[13]

(25)

Figure 2.5. Tranposed Form FIR Filter Structure

FIR filters normally is scaled using the safe scaling criterion. The roundoff noise at the output of the filters is N Q2/12, independently of the filter coefficients[13].

2.2.2 Transposed Direct Form

The transposed, direct for FIR structure, shown in Fig. 2.5, is derived from the signal-flow graph shown in Fig. 2.4. The signals in this filter are properly sclaed if the input and output signals are properly scaled. Also in this structure the roundoff noise N Q212 is independent of the filter coefficients[13]

The filter shown in Fig. 2.5 is a graphic illustration of the following difference equations y(n) := h(0)x(n) + v1(n − 1) v1(n) := h(1)x(n) + v2(n − 1) . . . vN −1(n) := h(N − 1)x(n) + vN(n − 1) vN(n) := h(N )x(n) (2.7)

2.2.3 Linear-Phase Structure

One of the most important reasons why FIR filters are used is that they can real-ize an exact linear-phase response. Linear-phase implies that the impulse response is either symmetric or antisymmetric. The number of multiplications can be re-duced by exploiting the symmetry (or antisymmetry) in the impulse response, as illustrated in Fig. 2.6. This structure is called direct form linear-phase structure. The number of multiplications for the direct form linear-phase structure is (N + 1)/2 for odd values of N , and N/2 + 1 for even values of N . The number of multiplications is thus significantly smaller than for the direct form structure, whereas the number of additions remain same. the signal level at the inputs of the multipliers are twice as large as the input signal. Hence, the input should be divided by two and the filter coefficients should be scaled so that a proper out signal level is obtained. The roundoff noise is this structure is only (N + 1)Q2/24

(26)

8 FIR Filters

Figure 2.6. Direct Form Linear-Phase Structure

and (N + 2)Q2/24 for odd values of N and even values of N, respectively. This is

independent of the filter coefficient.

2.2.4 Time-Multiplexed FIR Filters

In this thesis, the focus is on the implementation of Time-Multiplexed FIR Filters on FPGAs. By time-multiplexing, one means that the input data rate is slower than the clock rate. Design of such filters is motivated by the fact that the usable clock frequency is generally higher compared to the required data rate. With time-multiplexing, it is possible to decrease the number of multipliers at the cost of memory cost. But with filter order of around 150-200, the cost of memory is much less than the cost of multipliers or MACs(Multiply-Accumulate).

Furthermore, cost of multipliers can be further brought down by utilizing sym-metry. This reduction, tough, comes at a cost of additional adders.

For a given filter order N , direct form FIR structure needs N +1 multipliers and

N adders. The numbers remain same for transposed direct form filters. For

linear-phase structures, which utilizes symmetry, the number of multipliers is reduced to half i.e. ⌈N/2⌉.

However, when this direct form filters is mapped to a time-multiplexed ar-chitecture, number of multipliers is further reduced. Thus, for a given filter of order N and time-multiplexing factor of M , the number of multipliers are reduced to ⌈N/(2M )⌉. However, these multipliers now need to be replaced by Multiply-Accumulate in order to conserve earlier products. Putting it in the form of an equation, we have

(27)

Number of MACs = N / M ; when N is even

Number of MACs = (N + 1) / M ; when N is odd (2.9)

where N and M have the same meaning.

Thus, when the input data frequency is much lower, computing resources can be re-used to reduce the cost and power consumption.

2.2.5 Scaling in FIR Filters

In fixed-point arithmetic, parasitic oscillations can be produced by an overflow of the signal range. These oscillations can even persist after the cause of such overflow vanishes. Thus, it is necessary to insure that permanent oscillations are not sustained within a filter. It is also of utmost importance that frequent occurrences of signal overflow does not occur as it causes large distortions.

The probability of overflow, however, can be reduced by decreasing the signal levels inside the filter. This can be achieved by the insertion of scaling multipliers inside the filter. Care must be taken to insure that the scaling multiplier does not change the transfer function of the filter nor does it decrease the signal level so much that the Signal-to-Noise (SNR) becomes poor.

There are mainly two types of scaling strategies • Safe Scaling

• Lp-Norms

Safe scaling is generally used for short length FIR filters. Lp-Norms are much better because they utilize the signal range efficiently. However, since due to time-multiplexing, a long filter is reduced to a number of short filters, and thus safe scaling is enough for such filters.

Safe Scaling

The scaling policy of safe scaling is that if the input to the filter does not overflow, the input to the multiplier will never overflow. In such a scaling, one takes the sum of absolute values of filter coefficients at the critical node, and divide the coefficients with this value. Critical node is the node where scaling is needed and is generally the input to the multiplier.

(28)

(29)

FPGA is an integrated circuit designed to be configured by the customer or de-signer after being manufactured. Thus they are programmable. Its configuration is generally specified using a Hardware Description Language (HDL), which is also used for designing ASICs.

In the not-so-distant past, the question used to be, Can you do that task in an FPGA? With the advent of modern FPGA devices, however, the question has become, Why would you not use an FPGA? Modern FPGAs complexity rivals that of ASICs. The chips contain hundreds of thousands of flip-flops, multi megabits of RAM, thousands of DSP slices, multiple soft or hard processor cores, multiple Serializer/De-Serializer (SERDES) channels, and more[1].

FPGAs can be used to implement any logic function. Furthermore, this func-tionality can be updated even after selling it to the customer. The ability to re-program without going through the whole fabrication cycle an ASIC must go provides an enormous advantage, especially for low-volume applications.

The building block of an FPGA is called ’logic block’. Logic blocks are essen-tially Look-Up Table (LUT) which can be configured to complex combinatorial functions or simply logic gates. The LUT can be combined together to form a larger block which might contain a multiplexer, flip flop and even a carry chain.

3.1 Major Companies

There are two major companies manufacturing FPGAs. They are 1. Xilinx

2. Altera

Each company has produced a string of ultra-modern, low-cost, highly special-ized FPGAs with a host of facilities for signal processing, advance communication etc.

(30)

12 FPGA

3.2 History

The FPGA industry emerged from Programmable Read Only Memory (PROM) and Programmable Logic Devices (PLDs) [2]. They both can be programmed in the factory or in the field (field programmable), however programmable logic was hard-wired between logic gates.

The first company to formally begin manufacturing FPGAs was Xilinx in 1984. Among the initial FPGAs were XC2064, XC4000 and XC6200 which contained less than 100 Configurable Logic Blocks (CLBs) and 3-input LUTs. These devices laid the foundation of a new technology and market.

Xilinx continued unchallenged and quickly growing from 1985 to the mid-1990s, when competitors came up, reducing the market-share significantly. By 1993, 18 percent of the market was served by Actel[5]. The 1990s were an explosive period of time for FPGAs, both in sophistication and the volume of production. In the early 1990s, FPGAs were primarily used in telecommunications and networking. By the end of the decade, FPGAs found their way into consumer, automotive, and industrial applications[10].

In the first decade of the new century, extremely complex FPGAs were mar-keted, specially by Xilinx and its main competitor Altera. Platforms like Spartan-III, Virtex-6 by Xilinx and Cyclone-Spartan-III, Stratix-IV enabled designers to implement extremely complex applications in FPGAs.

3.3 The Role of FPGAs

Field Programmable Gate Arrays (FPGAs) fill a need in the design space of digital systems, complementary to the role played by microprocessors. Microprocessors can be used in a variety of environments, but because they rely on software to implement functions, they are generally slower and more power hungry than cus-tom chips. Similarly, FPGAs are not cuscus-tom parts, so they are not as good at any particular function as a dedicated chip designed for that application. FPGAs are generally slower and burn more power than custom logic. FPGAs are also relatively expensive although it is often tempting to think that a custom-designed chip would be cheaper[12]

3.3.1 Advantages of FPGAs

FPGAs have their disadvantages as mentioned above, however, they have com-pensating advantages, largely due to the fact that they are standard parts[12].

1. There is no need to wait for the chip to be fabricated to obtain a working chip. The design can be programmed in to the FPGA and tested immediately. 2. FPGAs are excellent prototyping tools. Using the FPGA in the final design

makes the jump from prototype to the final product much smaller and easily negotiable.

(31)

A recent trend has been to take the coarse-grained architectural approach a step further by combining the logic blocks and interconnects of traditional FP-GAs with embedded microprocessors and related peripherals to form a complete “system on a programmable chip”[2]. This work mirrors the architecture by Ron Perlof and Hana Potash of Burroughs Advanced Systems Group which combined a reconfigurable Central Processing Unit (CPU) architecture on a single chip called the SB24. That work was done in 1982. Examples of such hybrid technologies can be found in the Xilinx Virtex-II PRO and Virtex-4 devices, which include one or more PowerPC processors embedded within the FPGA’s logic fabric. The Atmel FPSLIC is another such device, which uses an AVR processor in combination with Atmel’s programmable logic architecture.

An alternate approach to using hard-macro processors is to make use of “soft” processor cores that are implemented within the FPGA logic.

As previously mentioned, many modern FPGAs have the ability to be repro-grammed at “run time”, and this is leading to the idea of reconfigurable computing or reconfigurable systems -CPUs that reconfigure themselves to suit the task at hand. The Mitrion Virtual Processor from Mitrionics is an example of a recon-figurable soft processor, implemented on FPGAs. However, it does not support dynamic reconfiguration at runtime, but instead adapts itself to a specific program. Additionally, new, non-FPGA architectures are beginning to emerge. Software-configurable microprocessors such as the Stretch S5000 adopt a hybrid approach by providing an array of processor cores and FPGA-like programmable cores on the same chip.

3.5 Application

FPGAs have a wide range of applications [2], some are listed below: 1. Signal Processing

2. Software-defined radio 3. Aerospace and Defence 4. ASIC Prototyping 5. Medical Imaging 6. Computer Vision

(32)

14 FPGA

7. Telecommunication 8. Speech Recognition

9. Computer Hardware Emulation

Originally, FPGAs were a competitor to Complex Programmable Logic Devices (CPLDs) and mostly were used as glue logic. But as their size, capabilities and speed increased they started being used in a wide range of increasingly complex and diverse applications. They are now being marketed as complete System-on-Chip solutions.

FPGAs became even more used in signal processing applications by the in-troduction of dedicated multipliers and then multiply-accumulate, which enabled designers to implement multiplier and Multiply-Accumulate (MAC) intensive ap-plications in FPGAs which were fast in speed and low on area.

The built-in parallelism of resources in FPGA allows massively parallel appli-cations to be easily implemented in an FPGA. It allows for a high throughput even at low MHz clock rates. This has given birth to a new type of processing called reconfigurable processing, where FPGAs perform time intensive tasks instead of software.

Since a single unit cost of an FPGA is generally more than an Application Specific Integrated Circuit (ASIC) , they normally find applications in low-volume products where the company does not need to incur the high Non-Recurring Engineering (NRE) cost of an ASIC . However, with recent advancements in FPGA Technology has enabled low cost FPGAs which in turn has made FPGA increasingly viable in high volume products well.

3.6 Types of FPGAs

There are three main types of FPGAs[11]. • Static Random Access Memory (SRAM) • Anti-Fuse

• Flash

Xilinx and Altera both sold early SRAM-based FPGAs. An alternative ar-chitecture was introduced by Actel, which used an anti-fuse arar-chitecture. This architecture was not re-programmable in the field, which arguably was an advan-tage in situations that did not require re-configuration. The Actel FPGAs used a mux-oriented logic structure organized around wiring channels.

SRAM stands for Static Random Access Memory (RAM). It is based on the static memory technology. Such a FPGA is volatile i.e. at power up it has to be programmed using external boot devices. They hold their configurations in static memory, output of which is directly connected to another circuit and its state controls the circuit being configured [12]. It has many advantages[12].

(33)

Figure 3.1. Flash Programmed Switch[12]

• Standard VLSI Processes can be used to fabricate such FPGAs • They do not need to be refreshed when in operation.

However, they also have their disadvantages [12]

• When power is switched-off, they have to be reprogrammed. This required external boot devices.

• They consume more power. Booting up power is one of the major compo-nents of the overall power consumption.

• The bits in the SRAM are susceptible to theft.

The two technologies that only need to be programmed once are anti-fuse and flash based FPGAs. Anti-fuse FPGAs are based on open-circuits. They take on low resistance when programmed and consume much less power. They are pro-grammed by putting a voltage across it [12]. Each anti-fuse must be propro-grammed separately and the FPGA must include circuitry that enables each anti-fuse to be addressed separately.

Flash memory is a read-only memory which is highly programmable [12]. It uses a floating gate structure in which voltage is held by a low-leakage capacitor. This controls a gate, thus enabling this memory cell to program transistors. A flash programmed switch can be seen in Fig. 3.1[12].

3.7 FPGA vs Custom VLSI

The main alternative to FPGA is the custom designed application-specific Inte-grated Circuit (IC), commonly known as ASIC. ASICs are used to implement a particular function. For an ASIC to be useful, it needs to be designed right upto the mask level and fabricated. This process can take months and huge amount of money.

ASICs have some significant advantages over FPGAs. Since they are generally designed for a particular purpose, they are faster, consume less energy and are generally cheaper if manufactured in large volume.

(34)

16 FPGA

Figure 3.2. FPGA Structure[11]

However, there are dis-advantages too. They are expensive, require more time to be fabricated affecting time-to-market. In an increasingly competitive market, this is a critical issue. Furthermore, with recent advancement and tremendous growth in FPGAs along with the inherent programmability, they are expected to take a larger share of the IC market for high-density chips.

3.8 Architecture of FPGAs

Generally [12] FPGA has three main types of elements • Combinational Logic

• Interconnect • I/O pins

In Fig. 3.2, the basic structure of an FPGA that shows all the elements. The combinational logic is normally divided into relatively small units called Logic Elements (LEs) or Configurable Logic Blocks (CLBs). With the advancement in technology, these blocks can now also support sequential circuits. This can be achieved with a built-in flip flop. The interconnections are programmable which are organized into channels and are used to route signals between CLBs and I/O blocks. There are different types of interconnects depending on the distance be-tween the CLBs to be connected; clocks are provided with dedicated interconnec-tion networks for fast routing of the clock with minimum skew. Also, there is a clock circuitry for driving the clock signals to each flip-flop in each logic block.

(35)

Figure 3.3. Look-Up Table[12]

3.8.1 Configurable Logic Block

The CLBs are generally composed of Look-Up Tables (LUTs). They are sometimes referred to as “Logic Elements” or “LEs”. A block diagram of a basic LUT is shown in Fig. 3.3. In SRAM based FPGAs, a LUT is used to implement a truth table. Each address in the SRAM represents a combination of inputs to the logic element. The value stored at that address represents the value of the function for that input combination. An n-input function requires an SRAM with 2n_locations. As a result, the n-input LE can represent 22n

functions. A typical logic element has four inputs. The delay through the LUT is independent of the bits stored in the SRAM, so the delay through the logic element is the same for all functions. This means that, for example, a LUT based LE will exhibit the same delay for a 4-input XOR and a 4-4-input NAND. In contrast, the same XOR function built using static Complementary Metal Oxide Semiconductor (CMOS) Logic is significantly slower than the NAND function. Although, the static CMOS implementation is generally faster than the LE.

LEs generally has registers; flip-flops and latches, as well as combinational logic. A flip-flop or latch is small compared to the combinational logic element. A LUT with a memory element is shown in Fig. 3.4. Here it is shown that both the unregistered and registered output is connected to the main output through a multiplexer. More complex logic blocks are also possible. For example, many logic elements also contain special circuitry for addition. Many FPGAs also incorporate specialized adder logic in the logic element. The critical component of an adder is the carry chain, which can be implemented much more efficiently in specialized logic than it can using standard lookup table techniques.

A more advanced block diagram of a CLB is shown in Fig. 3.5[11]. In this figure, inputs C1 through C4 allow outputs from other CLBs to be input to this particular CLB so that CLBs can be cascaded for advanced functions. Few exam-ples from [12] of advance and complex CLBs are given below

(36)

18 FPGA

Figure 3.4. Look-Up Table with storage element[2]

(37)

Figure 3.6. Xilinx Spartan-II CLB[12]

Xilinx Spartan-II

The Spartan-II combinational block [14] consists of two identical slices, with each slice containing a LUT, some carry logic, and register. One slice is shown in 3.6.

There are two logic cells in one slice. A pair of 4-bit LUTs form the foundation of a logic cell. Each LUT can also be used as a 16-bit synchronous RAM or as a 16-bit shift register. It also contains carry logic for each LUT so that additions can be performed. The multiplexer is used to combine the results of the two function generators in a slice. Another multiplexer combines the outputs of the multiplexers in the two slices, generating a result for the entire CLB.

Each CLB also contains two three-state drivers (known as BUFTs) that can be used to drive on-chip busses.

(38)

20 FPGA

Figure 3.7. Altera APEX-II Logic Element[12]

Altera APEX II

The APEX II’s logic [24] is organized into Logic Array Blocks (LABs). Each LAB includes 10 logic elements. Each logic element contains a lookup table, flip-flop etc. The logic elements in a LAB share some logic, such as a carry chain and some control signals generation. A single logic element is shown in Fig. 3.7.

The main logic chain starts with a 4-input lookup table. The output of the LUT is fed to a carry chain. The cascade chain is used for cascading large fan-in functions. The output of this chafan-in goes to a register. The register can be programmed to function either a D, T, JK or SR element.

3.8.2 Interconnection Network

An FPGA has several kinds of interconnect. There are short wires, general-purpose wires, global interconnect and specialized clock distribution networks. The reason the FPGAs need different types of wires is that wires can introduce a lot of delay, and wiring networks of different length and connectivity need different circuit design[11].

An SRAM-based FPGA uses SRAM to hold the information used to program the interconnect. As a result, the interconnect can also be reconfigured. Figure

(39)

Figure 3.8. SRAM Connection Box[12]

3.8 shows a simple version of an interconnection point, often known as a connec-tion box. A programmable connecconnec-tion between two wires is make by a CMOS transistor (Pass Transistor). The pass transistor’s gate is controlled by a static memory program bit (D Register in this figure). This transistor also conducts bidirectionally, however it is relatively slow, particularly on a signal path that includes several interconnection points in a row[12].

Performance

In comparison to custom chips, FPGA wiring with programmable interconnect is slower. There are two reasons

• Pass Transistor • Wire Lengths

The pass transistor is not a perfect on-switch, so a programmable interconnec-tion point is slower than wires connected by vias. Furthermore, FPGA wires are longer than would be necessary because they are designed to connect a variety of logic elements and other FPGA resources. This would introduce extra capacitance and resistance that slows the signals on the net. On the other hand, in custom chips, wires can be made only as long as needed[12].

Types

As indicated earlier, there are different types of interconnects in an FPGA, essen-tially to take advantage of the logic in the LEs. Wiring is often organized into different categories depending on its structure and intended use[12].

• Short Wires: They connect only local Logic Elements. They do not take up much space and only introduce short delays. Example: Carry Chains through the LEs.

(40)

22 FPGA

• Global Wires: Designed for long distance communication. they have fewer connection points as compared to local connections which reduces their impedance.

• Special Wires: Dedicated to either distribute clocks or other register control signals.

The next examples are taken from [12] and describe the interconnect systems in the two FPGAs discussed in 3.8.1.

Xilinx Spartan-II

The Spartan-II includes several types of interconnect. They are listed below • Local

• General Purpose • Input/Output • Global • Clock

The local interconnect system provides the several kinds of connections. It connects the

• Look-Up Tables • Flip-Flops

• General Purpose Interconnect

It also provides internal CLB feedback. Furthermore, it also includes some di-rect paths for high-speed connections between horizontally adjacent CLBs. These paths can be used for arithmetic, shift registers or other functions that need struc-tured layout and short connections.

The general-purpose routing network provides the bulk of the routing resources. This includes the following types of interconnect:

• A General Routing Matrix (GRM). This is a switch matrix which is used to connect horizontal and vertical routing channels along with connections between the CLBs and the routing channels.

• Twenty Four single-length lines to connect each GRM to the ’4’ nearest and adjacent.

• Hex lines route these GRM signals to the GRMs six block away. These lines provide longer interconnect. The hex lines include buffers to drive the longer wires. There are a total of 96 hex lines among which one-third are bidirectional and the rest unidirectional.

(41)

Figure 3.9. Xilinx Spartan-II General Routing Matrix[12]

This whole structure is shown in Fig. 3.9

One type of dedicated interconnect resource is the on-chip three state bus which run only horizontally. Four partition-able buses are available per CLB row. This three state bus is shown in Fig. 3.10.

The global routing system is designed to distribute high-fanout signals, includ-ing both clocks and logic signals. The primary global routinclud-ing network is a set of four dedicated global nets with dedicated input pins. Each global net can drive all CLB, I/O register and block RAM clock pins. The clock distribution network is buffered to provide low delay and low skew.

Altera APEX-II

The APEX-II uses horizontal and vertical interconnect channels to interconnect the LEs and the chip pins. The interconnect structure is shown in 3.11.

A row line can be driven directly by an LE, I/O element, or embedded memory in that row. A column line can also drive a row line; columns can be used to connect wires in two rows. Some dedicated signals with buffers are also provided for high-fanout signals such as clocks. Column I/O pins can directly drive these interconnect lines. Each line traverses two MegaLAB structures, driving the four MegaLABs in the top row and the four MegaLABs in the bottom row of the chip.

3.9 FPGA Configuration

SRAM-based FPGAs are reconfigured by changing the contents of the configura-tion SRAM. A few pins on the chip are dedicated to configuraconfigura-tion; some addiconfigura-tional

(42)

24 FPGA

Figure 3.10. Xilinx Spartan-II On-Chip Three State Bus[12]

pins may be used for configuration and later released for use as general-purpose I/O pins. Mostly configuration lines are usually bit-serial, however, several bits can be sent in parallel to decrease configuration time[12].

During prototyping and debugging, the configuration is generally changed using a download cable and a Personal Computer (PC). However, once the FPGA is in production or when it is to be finally made part of the complete system, specialized Programmable Read Only Memorys (PROMs) are used to store the configuration which are also installed on the same Printed Circuit Board (PCB) as the FPGA. Upon power-up, the FPGA runs through a protocol on its configuration pins and is loaded with the configuration.

3.10 Xilinx

Xilinx is one of the leading companies working in this domain. According to [3], it is the world’s largest supplier of programmable logic devices, the inventor of the FPGA and the first semiconductor company with a fabless manufacturing model. It was founded in 1984 in Silicon Valley and its headquarters are in San Jose, California, USA.

Xilinx has produced a large number of different FPGAs. Among the most famous of its platforms are:

• Spartan-3 and its variants • Spartan-3

(43)

(44)

26 FPGA

• Virtex

• Virtex-II and its variants • Virtex-4 and its variants • Virtex-5 and its variants • Virtex-6

The Spartan series primarily targets those applications which have a low-power footprint, extreme cost sensitivity and high-volume, for example, display, set-top boxes, wireless routers and other applications. The latest series in the Spartan family is Spartan-6 which has been built on the latest 45-nm technology.

On the other hand, the Virtex family targets complex and specialized appli-cations such as wired and wireless equipment, advanced medical equipment and defense systems. It also includes embedded fixed function hardware such as mul-tipliers, memories, serial transceivers and microprocessor cores.

The Virtex-II Pro family was the first to include the PowerPC processor. It also had built-in serial transceivers. The Virtex-4 was introduced in 2004 and manufactured on a 1.2V, 90-nm, triple-oxide process technology. It consists of three families, LX - focused on logic design, FX - focused on embedded processing and connectivity and SX - focused on digital signal processing[3].

The Virtex-5 series was introduced in 2006 and is the focus of this Thesis. In it, Xilinx introduced the 6-input LUTs and was fabricated using a 65-nm, 1.0V, triple-oxide process technology. Virtex-5 also had different families focusing on different types of design. The latest in the Virtex family is the Virtex-6. It is built on 40-nm process for highly computing intensive electronic systems[3].

In this thesis, the focus was the mapping of time-multiplexed FIR filters on FPGAs and special focus was on the utilization of DSP and Block RAM resources. A maximum FIR filter order 128 was generated. The LXT-family has enough DSP and Block RAM resources to support all the test filters. Thus, the Virtex-5 LX family was chosen for design implementation.

3.10.1 Virtex 5

The Xilinx Virtex-5 famly is one of the newest and most powerful FPGA in the market today. It uses the second generation Advanced Silicon Modular Block (ASMBLT M_{) column based architecture.[15]. It has 5 distinct sub-families each} targeted towards specific applications. A brief outline of each sub-family and its intended use is given below[16].

• LX - Optimized for high-performance logic.

• LXT - Optimized for high-performance logic with low-power serial connec-tivity.

• SXT - Optimized for DSP and memory-intensive applications, with low-power serial connectivity.

(45)

Each sub-family has a number of different features. The following gives a brief outline of these[15][16] • 550 MHz clocking technology • 65-nm process • 1.0 V Core voltage • 6-input LUT

• PowerPC 440 processor blocks • 1.25 Gbps LVDS I/O

• 580 GMACS performance from DSP48E slices

• 190 GFLOPS of single-precision and 65 GFLOPS of double-precision floating-point DSP Performance

• 1,100 DMIPS per PowerPC 440 processor block with high-bandwidth, low latency interfaces.

• RocektIOT M _{GTP transceivers in the LXT and SXT platforms.} • RocektIO GTX transceivers in the FXT platform.

• PCI Express endpoint blocks and Tri-mode Ethernet MACs • 12 DCMs

• 6 PLLs

• 256-bit Distributed memory per CLB. 64 bits per LUT • 550 MHz, 36Kbit Block RAM

• 550 MHz DSP48E Slice

Although in [15] and [16], Xilinx states that the LX sub-family does not contain DSP48E slices, tabular data available in these state the contrary.

(46)

28 FPGA Virtex-5 LX20T and Virtex-5 LX50T

This section would give a brief overview of the two FPGAs on which the FIR filter was implemented (synthesized and placed-routed). The majority of filters (filter order 15 to 108) were implemented on LX20T and Filter orders 127 and 128 with different time multiplexing order have been implemented on 50T. Reasons about this change from 20T to 50T would be explained in Chapter 5.

Virtex-5 LX20T

The LX20T is the smallest FPGA of the LXT sub-family of Virtex-5. LXT, as stated earlier, is targeted towards High-performance logic with advanced serial connectivity. There are 8 FPGAs in this sub-family. A list of features of 20T AND 50T is given in [16] and reproduced in Tables 3.1 and 3.2

Device Configurable Logic Blocks(CLBs)

Array (Row x Col) Virtex-5 Slicesa

Max Distributed RAM (Kb)

XC5VLX20T 60 x 26 3,120 210

XC5VLX50T 120 x 30 7,200 480

a

Each Virtex-5 FPGA Slice contains four LUTs and four flip-flops

Table 3.1. Virtex-5 LX20T and LX50T Features

Device DSP48 Slicesa Block RAM Blocks

CMTsb 18 Kbc 36 Kb Max(Kb) XC5VLX20T 24 52 26 936 1 XC5VLX50T 48 120 60 2,160 6 a

Each DSP48E Slice contains a 25 x 18 multiplier, an adder and an accumulator

b

It contains 2 Digital Clock Managers (DCMs) and 1 Phase-Locked Loop (PLL)

c

Block RAMs are mainly 36-Kb in size, but they can also be used as ’2’ 18-Kb blocks

Table 3.2. Virtex-5 LX20T and LX50T Features

The three main resources used in our design has been the Virtex-5 Slice, Block RAM and DSP48E Slice. The logic slices have been used to implement the control logic and coefficient Read Only Memorys (ROMs). Block RAM has been used as Data RAMs while DSP48E Slice has been used to implement multiplier, multiply-accumulate and multiply-add. Simple ’add’ has been implemented in logic slices, because it was felt a waste of DSP resources to implement the adder in it as well.

The following paragraphs explain these resources in details

Virte-5 Logic Slice

As briefly outlined in the previous section, each Slice contains four LUTs. ’2’ Slices are combined in one CLB. Each CLB is connected to a switch matrix so

(47)

Figure 3.12. Virtex-5 Configurable Logic Block[17]

that the general routing matrix can be accessed. This is shown in Fig. 3.12. Also shown are the pair of slices. These slices are arranged as columns with no direct connection between them and each having independent carry chain.

As mentioned in the previous section, each slice contains 4 LUTs and 4 flip-flops. Also contained is are multiplexers and carry logic. These provide logic, arithmetic and ROM functions. Some slices also support some additional func-tionality. Those are distributed RAMs and 32-bit shift registers. Those slices which support these functions are called SLICEM; those which do not are referred as SLICEL[17]. SLICEM is shown in Fig. 3.13 and SLICEL is shown in Fig. 3.14[17].

As evident in these images, SLICEM has a much advanced LUT that enables it to support those ’2’ extra functionality. Table 3.3 summarizes logic resources available in one CLB and Table 3.4 summarizes the logic resources in the selected FPGA[17].

Slices 2

LUTs 8

Flip-Flips 8

Arithmetic and Carry Chains 2

Distributed RAM 256 bits

Shift Registers 128 bits

(48)

30 FPGA

(49)

(50)

32 FPGA Device XC5VLX20T XC5VLX50T

CLB Array Row x Col 60 x 26 120 x 30

Number of 6-input LUTs 12,480 28,800

Max Distributed RAM(Kb) 210 480

Shift Register(Kb) 105 240

Number of Flip-Flops 12,480 28,800

Table 3.4. Logic Resources in Selected FPGAs

Figure 3.15. Basic Architecture of 6-input LUT[17]

Look-up Table

In Virtex-5 Xilinx replaced the 4-input LUT with 6-input LUT. This increased the truth table to 64 different combinations from 16 different combinations, thus allowing a much larger logic to be implemented in one LUT. These LUTs are also referred to as function generators; they generate a function based on its contents

These function generators have ’6’ independent inputs and ’2’ independent outputs. The inputs are labeled from A1 to A6 and outputs O5 and O6. They can implement any six-input Boolean function. They can also implement two 5-input boolean function provided that they must share common 5-inputs. O5 and O6 outputs are used for each of the 5-input function. For six-input function, only O6 is used.

Irrespective of the type of function implement, the propagation delay through a LUT is same. It is also independent of whether a LUT is used as a single 6-input function generator or two 5-input function generators[19].

The fundamental architecture of a 6-input LUT is given in Fig. 3.15. This LUT also has associated carry logic, Multiplexers and a flip-flop. It can also be used as a 64-bit distributed RAM or as a 32-bit shift register.

(51)

Figure 3.16. Dual 5-input LUT using 6-input LUT[26]

Figure 3.16 shows how a 6-input LUT can be used to generate two 5-input LUTs[26]

Multiplexers

Slices, in addition to LUTs, also contain three multiplexers. They are labeled as F7AMUX, F7BMUX AND F8MUX. They are used to combine up to the four LUTs to provide any function of seven or eight inputs in a slice. F7AMUX and F7BMUX are used to generate seven input functions from LUTs A/B or C/D. F8MUX is used to generate eight input functions by combining all the LUTs. For functions greater than 8 inputs, they are implemented using multiple slices. However, note that there are no direct connections between slices within a CLB[17].

The following types of multiplexers can be implemented • 4 :1 multiplexers using one LUT

• 8 :1 multiplexers using two LUTs • 16:1 multiplexers using four LUTs

All these multiplexers are implemented in one level i.e. one slice, or even one LUT using the above mentioned multiplexers. In one slice, one can implement four 4:1 MUXes. For 8:1 MUXes, F7AMUX and F7BMUX combine the output of two LUTs. Two ’8:1’ MUXes can be implemented in one slice. The F8MUX is used to combine the output of the F7AMUX and F7BMUX so that the 16:1 Mux can be implement. One slice can only accommodate one 16:1 Mux.

Storage Elements

As mentioned earlier, a slice also contains flip-flops/latches. They can be con-figured either as edge-triggered D-type flip-flops or level-sensitive latches. The D

(52)

34 FPGA

input is either driven directly by output of a LUT or by the BYPASS slice inputs. These inputs bypass the LUTs.

All the control signals, i.e. clock(CK), clock enable (CE), set/reset (SR) and reverse (REV) are common to all storage elements in one slice. Inputs like SR or CE are shared i.e., if one flip-flip has them enabled, the other flip-flps also have them enabled. Only the CLK signal has independent polarity. The CE, SR and REV signals are all active high while all flip-flops and latches have CE and non-CE versions.

There are two attributes, SRHIGH and SRLOW. The SR signal forces the storage element into a state specified by these attributes. As the name suggest, SRHIGH forces a logic High and SRLOW forces a logic low at the output of the storage element[17]

The Register/Latch configuration in a slice is shown in Fig. 3.17 The set and reset functionality of a register or latch are described below • No set or reset

• Synchronous Set • Synchronous Reset

• Synchronous Set and Reset • Asynchronous Set (preset) • Asynchronous Reset (clear)

• Asynchronous Set and Reset (preset and clear)

Distributed RAM

As stated earlier, slices and LUTs can also be combined to form “Distributed RAMs”. These memories are synchronous and can be configured in one of the following ways[17]

• Single-Port 32 x 1-bit RAM • Dual-Port 32 x 1-bit RAM • Quad-Port 32 x 2-bit RAM • Simple Dual-Port 32 x 6-bit RAM • Single-Port 64 x 1-bit RAM • Dual-Port 64 x 1-bit RAM • Quad-Port 64 x 1-bit RAM • Simple Dual-Port 64 x 3-bit RAM • Single-Port 128 x 1-bit RAM

(53)

(54)

36 FPGA

• Dual-Port 128 x 1-bit RAM • Single-Port 256 x 1-bit RAM

These memories are synchronous write. A synchronous read can also be im-plemented by using the flip-flop in the same slice. This decreases the delay into the clock-to-out value of the flip but introduces one extra cycle of latency.

The resource utilization of LUTs by each distributed RAM configuration is given below[17]

RAM Number of LUTs

32 x 1Sa 1 32 x 1Db 2 32 x 2Qc 4 32 x 6SDPd 4 64 x 1S 1 64 x 1D 2 64 x 1Q 4 64 x 3SDP 4 128 x 1S 2 128 x 1D 4 256 x 1S 4 a Single Port b Dual Port c Quad Port d

Simple Dual Port

Table 3.5. Type of Distributed RAMs

The RAM has a common address port for synchronous writes and asynchronous reads. In case of dual-port configurations, the memory has one port for syn-chronous writes and asynsyn-chronous reads and another one for asynsyn-chronous reads. For quad-port, there is one port for synchronous writes and reads and three addi-tional port for asynchronous reads[17].

Interested readers are referred to [17] for further details on Distributed RAMs.

Read Only Memory

Both SLICEM and SLICEL can implement Read Only Memory. Table 3.6 shows the types of ROMs and their corresponding resource utilization[17].

Virtex-5 Block RAMs

The Xilinx Virtex-5, in addition to distributed RAMs, has a large number of fixed memory blocks called block RAMs. Each block RAM can store up to 36K bits of data and thus referred to as 36 Kb RAM. They can either be utilized as ’1’

(55)

36 Kb RAM or two independent 18 Kb RAMs. These memories are placed in columns. As mentioned in Table 3.2, LX20T contains 26 such 36 Kb Block RAMs and LX50T contains 60. These blocks can also be cascaded to enable a much deeper and wider memory implementation[17].

Through these fixed memory blocks, one can implement different modules. Among them are [17]:

• Single-Port RAMs • Dual-Port RAMs • ROMs

• FIFOs

• Multirate FIFOs

These range of devices can either be

• Created using Xilinx CORE GeneratorT M • Inferred

• Directly instantiated

By inferring, the author means that normal Verilog or VHSIC Hardware De-scription Language (VHDL) code is written as dictated by the FPGA manufacturer and left to the Synthesis tool to recognize it and replace it with dedicated memory blocks. By instantiation, it means that you explicitly instantiate using a template provided by the manufacturer. Both techniques have their pros and cons which will be discussed in Chapter 4.

Xilinx also provides a dedicated software to create such fixed modules. One can create a wide range of fixed modules, block RAMs being one, and then instantiate them in the design.

The block RAM in Virtex-5 FPGAs stores up to 36K bits of data and can be configured as either two independent 18 Kb RAMs, or one 36 Kb RAM. Each 36 Kb block RAM can be further configured as a 64K x 1 (when cascaded with an adjacent 36 Kb block RAM), 32K x 1, 16K x 2, 8K x 4, 4K x 9, 2K x 18, or 1K x 36 memory. Each 18 Kb block RAM can be configured as a 16K x 1, 8K x2 , 4K x 4, 2K x 9, or 1K x 18 memory[17].

(56)

38 FPGA

The Write and Read are synchronous operations; although the two ports share the store data, they are symmetrical and totally independent. Each port can be configured in one of the available widths, independent of the other port. Fur-thermore, the read port width can be different from the write port width for each port. The memory can also be initialized or cleared by the configuration bitstream. During a write operation the memory can be set to have the data output either remain unchanged, reflect the new data being written or the previous data now being overwritten[17].

Additional Virtex-5 FPGA block RAM enhancements include[17]:

• Each 36K block RAM can be set to simple dual-port mode, doubling data width of the block RAM to 72 bits. The 18K block RAM can also be set to simple dual-port mode, doubling data width to 36 bits. Simple dual-port mode is defined as having one read-only port and one write-only port with independent clocks.

• One 64-bit Error Correction Coding block is provided per 36 Kb block RAM or 36 Kb First In First Out (FIFO). Separate encode/decode functionality is available.

• Synchronous Set/Reset of the outputs to an initial value is available for both the latch and register modes of the block RAM output.

• An attribute to configure the block RAM as a synchronous FIFO to eliminate flag latency uncertainty.

• The Virtex-5 FIFO does not have FULL flag assertion latency.

True Dual Port RAM

The Xilinx Block RAMs are true Dual-Port RAMs. The 36 Kb block RAM has a 36 Kb storage area. The two ports are completely independent. Same is the case with the 18 Kb counterpart. The structure is fully symmetrical. Figure 3.18 shows the true dual-port block diagram and Table 3.7 list the port definitions[17]. Regarding read and write operation, data can be read/written from either one or both ports. Both operations are synchronous and require a clock edge. As evident, each port has its own address, data in, data out, clock, clock enable and write enable ports.

Information about different writing modes, timing diagrams and other addi-tional details, please refer to [17].

Virtex-5 Simple Dual Port RAM

Apart from True Dual Port RAM, the Block RAMs can also be configured as Simple Dual-Port RAMs. The difference is that one port can not be used for both reading and writing. One port is designated for reading and one for writing thus allowing independent Read and Write operations to happen simultaneously. Figure 3.19 shows the simple dual-port data flow[17] and port definitions are available in Table 3.8.

(57)

Figure 3.18. True Dual Port[17]

Virtex-5 DSP48E Slice

The DSP48E slice in Virtex-5 is an extension to the DSP48 slice available in Virtex-4 FPGA. These special slices allow special functions to be implemented in it without using the FPGA fabric. Many DSP algorithms are supported resulting in low power, high performance and efficient device utilization.

This special slice is a new element in the Xilinx development model referred earlier to as Application Specific Modular Blocks (ASMBLT M_{). The purpose of} this model is to deliver off-the-shelf programmable devices with the best mix of logic, memory, I/O, processors, clock management and digital signal processing. ASMBL is an efficient FPGA development model for delivering off-the-shelf, flex-ible solutions ideally suited to different application domains[20].

Each DSP tile in Virtex-5 contains two DSP48E Slices and a local interconnect for connection to other devices and general FPGA fabric[21].

The DSP48E slice support many independent functions like • Multiplier

(58)

40 FPGA

Port Name Description

DI[A|B] Data Input Bus

DIP[A|B] Data Input Parity Bus, can be used for

additional data inputs

ADDR[A|B] Address Bus

WE[A|B] Byte-wide Write Enable

EN[A|B] When inactive no data is written to the

block RAM and the output bus remains in its previous state

SSR[A|B] Synchronous Set/Reset for either latch

or register modes

CLK[A|B] Clock Input

DO[A|B] Data Output Bus

DOP[A|B] Data Output Parity Bus, can be used

for additional data outputs

REGCE[A|B] Output Register Enable

CASCADEINLAT[A|B] Cascade input pin for 64K x 1 mode

when optional output registers are not enabled

Table 3.7. True Dual Port Definitions

• Multiply-Accumulate • Multiply-Add • Three Input Adder • Barrel Shifter • etc

The architecture also supports connection with multiple DSP48E slices to form wide math function, DSP filters and complex arithmetic without involving the general-purpose FPGA fabric. In this thesis, the Multiply-Accumulate and Multiply-Add features of the DSP48E slice has bene utilized.

The Virtex-5 DSP48E slice is shown in Fig. 3.20

The DSP48E slice has a 25 x 18 multiplier and an add/subtract as its funda-mental component. This add/subtract function has been extended to also function as a logic unit. This logic unit performs a number of bitwise logical operations when the multiplier is not is use. It also includes a pattern detector and a pattern bar detector that can be used for convergent rounding, overflow/underflow de-tection for saturation arithmetic and auto-resetting counters/accumulators. The Single Instruction Multiple Data (SIMD) mode of the adder/subtracter/logic unit is also available[21].

(59)

Figure 3.19. Simple Dual Port Block Diagram[17]

• 25 x 18 multiplier

• 30-bit A input of which the lower 25 bits feed the A input of the multi-plier, and the entire 30-bit input forms the upper 30 bits of the 48-bit A:B concatenate internal bus.

• Cascading A and B input

– Semi-independently selectable pipelining between direct and cascade

paths

– Separate clock enables 2-deep A and B set of input registers

• Independent C input and C register with independent reset and clock enable. • CARRYCASCIN and CARRYCASCOUT internal cascade signals to support

96-bit accumulators/adders/subtracters in two DSP48E slices

• MULTSIGNIN and MULTSIGNOUT internal cascade signals with special OPMODE setting to support a 96-bit MACC extension

• SIMD Mode for three-input adder/subtracter which precludes use of multi-plier in first stage

– Dual 24-bit SIMD adder/subtracter/accumulator with two separate

CARRYOUT signals

– Quad 12-bit SIMD adder/subtracter/accumulator with four separate

(60)

42 FPGA

Design Space Exploration of Time-Multiplexed FIRFilters on FPGAs﻿