High Level Model of IEEE 802.15.3c Standard and Implementation of a Suitable FFT on ASIC

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

High Level Model of IEEE 802.15.3c Standard and

Implementation of a Suitable FFT on ASIC

Examensarbete utfört i Elektroniksystem vid Tekniska högskolan vid Linköpings universitet

av

Tanvir Ahmed

LiTH-ISY-EX--11/4462--SE

Linköping 2011

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

High Level Model of IEEE 802.15.3c Standard and

Implementation of a Suitable FFT on ASIC

Examensarbete utfört i Elektroniksystem

vid Tekniska högskolan i Linköping

av

Tanvir Ahmed

LiTH-ISY-EX--11/4462--SE

Handledare: Carl Ingemarsson

isy, Linköpings universitet

Mario Garrido

isy, Linköings universitet

Examinator: Oscar Gustafsson

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

Electronics Systems

Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2011-05-15 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://www.es.isy.liu.se http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-68697 ISBN — ISRN LiTH-ISY-EX--11/4462--SE Serietitel och serienummer Title of series, numbering

ISSN —

Titel Title

Svensk titel

High Level Model of IEEE 802.15.3c Standard and Implementation of a Suitable FFT on ASIC Författare Author Tanvir Ahmed Sammanfattning Abstract

A high level model of HSIPHY mode of IEEE 802.15.3c standard has been con-structed in Matlab to optimize the wordlength to achieve a specific bit error rate (BER) depending on the application, and later an FFT has been implemented for different wordlengths depending on the applications. The hardware cost and power is proportional to wordlength. However, the main objective of this thesis has been to implement a low power, low area cost FFT for this standard. For that the whole system has been modeled in Matlab and the signal to noise ratio (SNR) and wordlength of the system have been studied to achieve an acceptable BER. Later an FFT has been implemented on 65nm ASIC for a wordlength of 8, 12 and 16 bits. For the implementation, a radix-8 algorithm with eight parallel samples has been adopted. That reduce the area and the power consumption significantly compared to other algorithms and architectures. Moreover, a simple control has been used for this implementation. Voltage scaling has been done to reduce the power. The EDA synthesis result shows that for 16bit wordlength, the FFT has

2.64 GS/s throughput, it takes 1.439 mm2 _{area on the chip and consume 61.51}

mW power.

Nyckelord

(6)

(7)

Abstract

A high level model of HSIPHY mode of IEEE 802.15.3c standard has been con-structed in Matlab to optimize the wordlength to achieve a specific bit error rate (BER) depending on the application, and later an FFT has been implemented for different wordlengths depending on the applications. The hardware cost and power is proportional to wordlength. However, the main objective of this thesis has been to implement a low power, low area cost FFT for this standard. For that the whole system has been modeled in Matlab and the signal to noise ratio (SNR) and wordlength of the system have been studied to achieve an acceptable BER. Later an FFT has been implemented on 65nm ASIC for a wordlength of 8, 12 and 16 bits. For the implementation, a radix-8 algorithm with eight parallel samples has been adopted. That reduce the area and the power consumption significantly compared to other algorithms and architectures. Moreover, a simple control has been used for this implementation. Voltage scaling has been done to reduce the power. The EDA synthesis result shows that for 16bit wordlength, the FFT has

2.64 GS/s throughput, it takes 1.439 mm2 _{area on the chip and consume 61.51}

mW power.

(8)

(9)

Acknowledgments

I would like to thank Oscar Gustafsson for giving me an opportunity to do my thesis in Electronics Systems. That gives me the access of the resources and all kind of facilities for doing my thesis. It gives me a new way of thinking and I believe that it will help me for my PhD in Japan. I am heartily thankful to my supervisors Carl Ingemarsson and Mario Garrido for guiding throughout the thesis and correcting various documents of mine with attention and care. Apart from that they helped me a lot to solve the technical issues related with the thesis. Their guidance helped me to get a grip on different design tool and VHDL, such that Matlab, Modelsim and Design Compiler. I offer my regards and blessing to all my friends who were sharing the lab with me for their inspiration and exchanging their culture and ideas. It was a great experience for me to work with different people from different countries and experiencing the multicultural environment. As well as it helps me a lot to know about different areas of electronics as they were working in different topics.

Last but not least I am grateful to my parents for giving me every kind of support from my birth untill now. I believe that without their support it was not possible for me to continuing my Master’s in Sweden.

(10)

(11)

List of Figures

2.1 Constellation diagram of π/2 BPSK. . . 9

2.2 Constellation diagram of π/2 QPSK. . . 10

2.3 Constellation diagram of π/2 8-PSK. . . 10

2.4 Constellation diagram of π/2 16-QAM. . . 11

2.5 Constellation diagram of DAMI. . . 12

2.6 Constellation diagram of OOK. . . 12

2.7 FEC data multiplexer. . . 13

2.8 Constellation diagram of QPSK modulation. . . 14

2.9 Constellation diagram of 16 QAM modulation. . . 15

2.10 Constellation diagram of 64 QAM modulation. . . 16

2.11 Convolutional encoder. . . 18

3.1 IEEE 802.15.3c system. . . 20

3.2 BER as a function of SNR. . . 23

3.3 BER as a Function of Wordlength at SNR 35 dB. . . 23

4.1 SFG of radix-2. . . 26

4.2 SFG of radix-4. . . 26

4.3 SFG of radix-16 decimation in frequency. . . 27

4.4 SFG of radix-16 decimation in time. . . 28

4.5 Radix-2 feedforward architecture. . . 28

4.6 Radix-4 feedforward architecture. . . 29

4.7 Radix-2 feedback architecture. . . 29

4.8 Radix-4 feedback architecture. . . 30

4.9 Complex multiplier. . . 31

4.10 Radix-2 butterfly. . . 31

4.11 ROM for coefficients. . . 32

4.12 Memory with pointer. . . 32

4.13 Shift registers. . . 32

5.1 SFG of radix-8 decimation in time. . . 35

5.2 SFG of radix-8 decimation in frequency. . . 36

5.3 Data Path of the FFT . . . 36

5.4 Data path of the FFT. . . 37

5.5 Implementation of radix-8 butterfly. . . 37

5.6 Shuffling circuit. . . 38

5.7 Block diagram of shuffler 1. . . 38

5.11 Datapath controller. . . 41

5.12 ROM controller. . . 41

5.13 Entity of complex multiplier. . . 42

5.14 Entity of a radix-2 butterfly. . . 43

(14)

2 Contents

5.16 Area and power consumption of the FFT before and after frequency

scaling. . . 45

5.17 Power consumption before and after voltage scaling. . . 45

5.18 Power and area for different length buffer. . . 46

5.19 Power and area of complex multiplier. . . 47

5.20 Power and area of radix-8 butterfly. . . 48

(15)

Contents 3

List of Tables

2.1 Bandwidth and center frequency for different channels . . . 8

2.2 Modulation dependent normalization factor . . . 13

2.3 Subcarrier frequency allocation . . . 15

2.4 Timing-related parameters for HSIPHY . . . 17

2.5 Low data rate channelization . . . 17

2.6 High data rate OFDM parameter . . . 17

2.7 Low data rate OFDM parameter . . . 18

3.1 MCS 6 specifications . . . 19

3.2 Argument for modem.qammod . . . 21

3.3 Argument for modem.qamdemod . . . 21

4.1 Comparison of pipelined architecture for the N point FFT . . . . 30

5.1 Constraint of the ASIC . . . 33

5.2 Design constraint of the FFT . . . 33

5.3 Selection signal information . . . 40

5.4 Memory and Shift Register performance for different wordlength . 46 5.5 Area and power for different components . . . 47

5.6 FFT performance for different wordlength . . . 48

5.7 Comparison of architectures for the computation of a 512-point 8-parallel FFT. . . 49

(16)

(17)

Chapter 1 Introduction

The advancement of the applications in communication systems as well as the data rate of the applications are racing with time. Different task groups developed different standards and some of them are adopted by the IEEE. IEEE 802.15.3c is one of them. Some other applications of IEEE 802.15 standard are Bluetooth and Zigbee. These standards can support a data rate up to 100 Mb/s for short range (1 m - 10 m) communication. However, those atandards are not suitable for applications such as Live HD video streaming with a bit rate ∼3 Gbps, to replace the HDMI (2.2 Gbps) connection with wireless connectivity and large file transfer at very high speed.

In 2005, IEEE 802.15 Alternative Task Group 3c developed a standard with an aim of providing wireless communication in a person’s area while the data rate will be high enough to support those applications [1]. This standard uses the 60 GHz band as a carrier frequency. However, research shows that the band near 60 GHz has high attenuation in air compared to the 5 GHz band. As aresult, this band is more suitable for indoor rather than outdoor applictions. Moreover, it can limit the problem of channel interference. Later, in 2009, the standard was adopted by IEEE.

The title of the thesis work is “High Level Model of IEEE 802.15.3c and Im-plementation of a Suitable FFT on ASIC” There are two components to this title. The first one, high level model of the IEEE 802.15.3c standard. That include the exploration of the different aspects of the standard. Such as, Review of the standard and a high level model of one specific mode for this standard. The high level model has been used to optimized the different parameter (such as SNR and finite word length) for the physical layer. Second component is the implementation of a suitable FFT on ASIC. HSIPHY mode of this standard adopted orthogonal frequency division multiplexing (OFDM) to overcome multipath fading effect of wireless channel and FFT is the key component of OFDM. To implement an FFT on ASIC, a 65 nm technology standard cell library has been used. The main attention of the implementation was to reduce the power as well as the area.

This document is organized in the following chapters: • Chapter 1: Introduction

(18)

6 Introduction

• Chapter 2: Standard Review of mm-Wave- A review of the IEEE 802.15.3c standard and its different mode of operations.

• Chapter 3: High Level Model of IEEE 802.15.3c (HSIPHY) - Modeling of physical layer for High Speed Interface (HSIPHY) and effect of finite wordlength and SNR on bit error rate.

• Chapter 4: Backround of the FFT - Discussion about the algorithm of the discrete Fourier transform (DFT), different architectures of the FFT and the basic building blocks.

• Chapter 5: Implementation of the FFT on ASIC - Details of radix-8 and design issue, hardware implementation and results of the FFT.

• Chapter 6: Conclusion and Future Work - Different conclusions are drawn on the basis of the results and some direction for the research.

The whole design is based on Matlab and VHDL. Communication toolbox of Matlab has been used for the high level model of the standard and VHDL as a hardware description language for the implementation of the FFT. Modelsim and Design compiler have been used for the functionality testing and compilation of the design for a specific technology library, respectively. Finally, performance measurement (calculation of the power and area for a specific clock frequency) has been done by means of Design compiler and Nanosim.

(19)

Chapter 2 Standard review of

mm-Wave

This chapter focuses on the standard review of the IEEE 802.15.3c. This standard is mostly used for high data rate transmission at GBPS rates such as video on demand, HDTV and home theater and data transmission at Gbps data rate. This standard use 60GHz as a carrier frequency [1]. This band a high attenuation in free space. Research shows that the 60 GHz band has attenuation of 15 dB per kilometer. So, this band is a promising candidate for indoor applications rather than outdoor.

It is noted in [1] that the standard can operate in three different mode. • Single Carrier mode in mmWave PHY (SCPHY)

• High Speed Interface mode in mmWave PHY (HSIPHY) • Audio/Visual mode in mmWave PHY (AVPHY)

2.1 Single carrier mode in mm wave PHY

(SC-PHY)

This mode provides three different classes of modulation and coding scheme tar-geting different wireless connectivity applications. Class 1 has been specified for low rate and low cost mobile operation while this mode can support a data rate 1.5 Gb/s. Class 2 has been specified to achieve a data rate up to 3 Gb/s and class 3 has been specified for the high speed and high performance applications with a data rate over 5 Gb/s [1].

2.1.1 Bandwidth and carrier frequency

This mode operates in four different carrier frequency that ranges between 57.24 GHz to 65.88 GHz [1]. However the bandwidth remains equal for all four cases. These channels are defined in Table 2.1.

(20)

8 Standard review of mm-Wave

Table 2.1: Bandwidth and center frequency for different channels

Channel ID Start frequency Center frequency Stop frequency

1 57.24 58.32 59.40

2 59.40 60.48 61.56

3 61.56 62.64 63.72

4 63.72 64.80 65.88

2.1.2 Forward error correction (FEC)

This mode of operation support reed solomon (RS) block codes and low density parity check (LDPC) block codes as a forward error correction scheme, whereas RS block code is mandatory and LDPC block code is optional. The different coding schemes are described as follows.

RS(255,239)

The RS(255,239) code shall use the polynomial generator in Equation 2.1 [1], where the number of the input is 239, it generates 16 code words and send along with the 239 input words. So, the total number of outputs is 255.

g(x) =

16

Y

k=1

x + α2 (2.1)

Here, α is the root of primitive polynomial p(x) = 1 + x2_{+ x}3_{+ x}4_{+ x}8 _{and x is}

the input data.

LDPC(672,588)

LDPC is systematic, i.e., it encode an information block of size k,i into a codeword c of size n, c by adding n-k parity bits. Each of the parity matrices is partitioned into a square sub blocks of size z × z identity matrix. The cyclic permutation matrix pI is obtained from the cyclically shifting the identity matrix by I times.

p

0

=







1

0 ... ... 0

0

1 ... ... 0

...

0 ... ... 0

0 ...

0

1

0

0 ... ...

0

1 





, p

1

=







0

1 ... ... 0

0

1 ... 0

...

0 ... ... 0

0 ...

0

1

1 ... ...

0

0 





, p

2

=







0

1 ... 0

...

0 ... ... 0

0 ... ...

0

1

0 ... ... 0

0

1

0 ... 0







LDPC(672,588) has 588 input bits and 672 output bits with a code rate of 7/8. Here, the number of parity bits is 84. The table is described in [1].

(21)

2.1 Single carrier mode in mm wave PHY (SCPHY) 9

LDPC(672,504)

There has 504 input bits and 672 output bit in LDPC(672,504) with a code rate of 3/4. The number of parity bits is 168. However, it follows the same identity and permuted matrix as discussed in Section 2.1.2. The table is described in [1]

LDPC(672,336)

LDPC(672,336) is used for highly reliable applications with a code rate of 1/2. It takes 336 bits as an input and generates 672 bits. It follows the identity matrix of Section 2.1.2 and the table is described in [1].

2.1.3 Modulation

This mode supports six different modulation schemes depending on the data rate and the performance requirements of the applications. However, four of them are mandatory and the other two are optional. The optional schemes are used for low data rate application.

π/2 BPSK

π/2 is a binary phase modulation with π/2 phase shift counterclockwise. Figure 2.1 shows the constellation mapping of the π/2 BPSK signal. Here, zl is the input bit. The input bit has mapped with 1 of the constellation diagram when the input is 1. For the other case the bit is mapped with j. With this modulation one symbol is generated for every bit.

I Q Zl -1 1 Counter Clockwise π/2 rotation

Figure 2.1: Constellation diagram of π/2 BPSK.

π/2 QPSK

π/2 QPSK encodes 2 bits per symbol, with a rotation of π/2 counter clockwise. This modulation techniques shows four equally spaced phase on the radius. Figure

(22)

2.2 is the constellation mapping diagram for the π/2 QPSK. This modulation scheme uses gray encoding [1].

I Q 11 01 10 00 d1d2 1 -1 -1 1

Figure 2.2: Constellation diagram of π/2 QPSK.

π/2 8-PSK

The constellation diagram of π/2 8-PSK is depicted in Figure 2.3. In this tech-niques three bits are mapped toh one symbol of the constellation. Here, the three bits are denoted d1d2d3. Again, this also has the π/2 rotation as in previous cases.

Eight different symbols are used for representing the arrival bits. The bits shall be gray encoded here as well.

1 1 -1 -1 I Q d1d2d3 011 100 101 000 001 010 110 111

(23)

2.1 Single carrier mode in mm wave PHY (SCPHY) 11

π/2 16-QAM

The π/2 16QAM constellation diagram is depicted in Figure 2.4. Here four bits, b1b2b3b4are mapped to one symbol. 16 different symbols with different radius has

been used to represent the arrival bit.

I +1 -d -1 +d b1b2b3b4 Q -3 +3 -3d +3d 0010 0110 1110 1010 0011 0111 1111 1011 0001 0000 0101 0100 1101 1100 1000 1001

Figure 2.4: Constellation diagram of π/2 16-QAM.

Dual Alternate Mark Inversion

Dual Alternate Mark Inversion (DAMI) coding is optional and this scheme is used for low data rate and low cost applications. The constellation diagram is shown in the Figure 2.5. It takes two bits as input and generates one symbol.

On Off Keying

On Off Keying (OOK) is also optional and this scheme is used for low data rate and low cost applications as DAMI. Figure 2.6 shows the constellation diagram. It takes one bit and generates one symbol for every bit.

(24)

12 Standard review of mm-Wave Q I 1 1 1 10 0011 01

Figure 2.5: Constellation diagram of DAMI.

I Q

1 0

Figure 2.6: Constellation diagram of OOK.

2.2 High speed interface mode in mm wave PHY

(HSIPHY)

The HSI PHY is designed for low latency, high speed data and it use orthogonal frequency domain multiplexing (OFDM). This mode supports different modulation and coding scheme using different frequency domain spreading factors, modula-tions and LDPC block codes.

2.2.1 Bandwidth and carrier frequency

This mode uses Channel IDs 2 and 3 of Table 2.1 as a carrier frequency [1]. The band starts from 59.40 GHz and ends at 63.72 GHz. The center frequencies are 60.48GHz and 62.64GHz respectively for Channel IDs 2 and 3.

(25)

2.2 High speed interface mode in mm wave PHY (HSIPHY) 13

2.2.2 Forward error correction

This mode use both equal error protection (EEP) and unequal error protection (UEP) depending on the data rate and performance. The data multiplexer is shown in Figure 2.7. For the EEP case the both LDPC blocks will be the same and for the case of UEP, the two LDPC blocks will be different. In this mode four different LDPCs are used with different code rate. Three of them are the same as for SCPHY and the final one is LDPC(672,420). This is discussed in the following.

O ct et de m ux 1: 2 MU X LDPC Encoder LDPC Encoder Msb 8b Lsb 8b

Figure 2.7: FEC data multiplexer.

LDPC(672,420)

LDPC(672,420) is used for high reliability applications with code rate 5/8. 420 bits is taken as a input and generate 672 bits. Here 252 bits are parity bit.

2.2.3 Modulation

This mode uses three different modulation techniques depending on the data rate and the performance. The modulation dependent normalization factor is given in Table 2.2. It is also stated in [1] that the value of d is 1 for normal constellation and 1.25 for skewed constellation.

Table 2.2: Modulation dependent normalization factor

Modulation Kmod

QPSK 1/√1 + d2

16-QAM 1/p5 (1 + d2₎

(26)

QPSK

The constellation diagram of QPSK is depicted in the Figure 2.8. SCPHY also use QPSK but without π/2 rotation. However, it takes two bits b1b2 as input and

maps with the symbol. There are be four symbols on the radius of the constellation diagram. I Q +1 -d -1 +d 10 11 01 00 b1b2

Figure 2.8: Constellation diagram of QPSK modulation.

16 QAM

16 QAM take four bits d1d2d3d4 as input and generate one symbol. The

constel-lation diagram is in the Figure 2.9. There are 16 different symbols with different values and radius on the constellation diagram. It can provide higher data rate than QPSK.

64 QAM

The constellation diagram of 64 QAM is shown in Figure 2.10. Six bits are map with one symbol. Here b1b2b3b4b5b6are six input bits. In the constellation diagram

there are 64 different symbols with different radius and angles.

2.2.4 OFDM

This mode support OFDM. There will be 3 DC sub-carriers, 16 pilot sub-carriers, 16 guard sub-carriers and 336 data sub-carriers [1]. The sub-carriers and their log-ical indexes are described in Table 2.3. Again, the total number of sub-carriers are 512 with a throughput of 2.64 GS/s for this mode. The timing related parameters for the FFT are given in Table 2.4.

(27)

2.3 Audio visual mode in mm wave PHY (AVPHY) 15 Q I d1d2d3d4 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1101 1110 1100 1111 1 1 -1 -1 3 3 -3 -3

Figure 2.9: Constellation diagram of 16 QAM modulation.

Table 2.3: Subcarrier frequency allocation

Subcarriers type Number of subcarriers Logical subcarriers indexes

Null subcarriers 141 [−256 : −186] ∪ [186 : 255]

DC subcarriers 3 −1, 0, 1

Pilot subcarriers 16 [−166 : 22 : −12] ∪ [12 : 22 : 166]

Guard subcarriers 16 [−185 : −178] ∪ [178 : 185]

Data subcarriers 336 All others

2.3 Audio visual mode in mm wave PHY

(AV-PHY)

This mode of the standard is mainly for multimedia applications, such as live HD video streaming, replacement of HDMI wired connectivity with wireless connec-tivity etc. This mode operate in two data rates: one is low data rate and the other one is the high data rate. The modulation and the coding schemes are varied for the data rate.

2.3.1 Bandwidth and carrier frequency

This mode supports two different data rate. One is high data rate and the other one is low data rate and different channels are used for those. High data rate uses Channel Id 2 of Table 2.1. Whereas, the low data rate support three different channels. These are described in Table 2.5. Here fc(HRP )is the current high data

(28)

16 Standard review of mm-Wave I -1 +d b1b2b3b4b5b6 Q -3 -7d 000110 -5d -3d -d -5 -7 +1 +3 +5 +7 000100 011100 010100 110100 111100 101100 100100 +3d +5d +7d 000101 000111 000110 000010 000011 000001 000000 001101 011101 010101 110101 001111 001110 001010 001011 001001 001000 011111 010111 011110 010110 111101 1101101 100101 110111 110110 110010 110011 110001 110000 011010 011011 011001 011000 010010 010011 010001 010000 111111 110110 111010 101111 101110 100111 100110 100010 111011 111001 111000 101010 101011 101001 101000 100011 100001 100000

Figure 2.10: Constellation diagram of 64 QAM modulation.

2.3.2 Forward error correction

This mode of the standard use convolutional encoding. The convolutional encoder diagram for this standard is depicted in Figure 2.11. The convolutional encoder encode with a code rate of 1/3. The convolutional encoder use 6 delay memory. And generator polynomial g0= 1338, g1= 1718andg2= 1658. The initial value of

the memories are set to 0.

2.3.3 Modulation

This mode use the same QPSK and 16QAM modulation scheme as shown in Figures 2.8 and 2.9, respectively. This mode also use gray coded input bits.

2.3.4 OFDM

This mode use two different OFDM technique for low data rate and high data rate respectively. These are described in Table 2.6 and 2.7 for high data rate and low data rate respectively

(29)

2.3 Audio visual mode in mm wave PHY (AVPHY) 17

Table 2.4: Timing-related parameters for HSIPHY

Parameters Description Value

fs Reference sampling rate 2640 MHz

TC Sample duration 0.38 ns

Nsc Number of subcarriers 512

Ndsc Number of data subcarriers 336

NP Number of pilot subcarriers 16

NG Number of guard subcarriers 141

NDC Number of DC subcarriers 3

NR Number of reserved subcarriers 16

NU Number of used subcarriers 352

NGI Guard interval length in samples 64

4fsc Subcarrier frequency spacing 5.15625 MHz

BW Nominal used bandwidth 1815 MHz

TF F T IFFT and FFT period 193.94 ns

TGI Guard interval duration 24.24 ns

TS OFDM Symbol duration 4.583 MHz

FS OFDM Symbol rate 16

NCP S Number of samples per OFDM symbols 576

Table 2.5: Low data rate channelization

Channel Start Frequency Center Frequency Stop Frequency

Index

1 fc(HRP )− 207.625 MHz fc(HRP )− 158.625 MHz fc(HRP )− 109.625 MHz

2 fc(HRP )− 49 MHz fc(HRP ) fc(HRP )+ 49 MHz

3 fc(HRP )+ 109.625 MHz fc(HRP )+ 158.625 MHz fc(HRP )+ 207.625 MHz

Table 2.6: High data rate OFDM parameter

Parameter Value

Occupied bandwidth 1.76 GHz

Reference sampling rate 2.538 GHz

Number of subcarriers 512

FFT period Nsc(HR)/fs(HR)≈202 ns

Subcarrier spacing 1/TF F T (HR)≈4.96 MHz

Guard interval 64/fs(HR)≈25.2 ns

Symbol duration TF F T (HR)+ TGI(HR)≈227 ns

(30)

18 Standard review of mm-Wave D D D D D D +

+

Input

X coded data output

Y coded data output

Z coded data output

Figure 2.11: Convolutional encoder.

Table 2.7: Low data rate OFDM parameter

Parameter Value

Occupied bandwidth 92 MHz

Reference sampling rate 317.25 MHz

Number of subcarriers 128

FFT period Nsc(LR)/fs(LR)≈403 ns

Subcarrier spacing 1/TF F T (HR)≈2.48 MHz

Guard interval 28/fs(HR)≈25.2 ns

Symbol duration TF F T (HR)+ TGI(HR)≈492 ns

(31)

Chapter 3 High Level Model of IEEE

802.15.3c (HSIPHY)

This chapter will mainly focus on the overview of the system and the high level

model of the system in Matlab. In Chapter 2, the different modes of IEEE

802.15.3c were discussed. Among the three different modes HSIPHY is picked. However, this mode has 11 different MCS (Modulation and coding scheme) iden-tifiers. For the high level model MCS 6 has been selected. The specifications of MCS 6 are described in Table 3.1.

Table 3.1: MCS 6 specifications

Parameter Value

Data Rate 5390 Mb/s

Modulation Scheme 16-QAM

Spreading Factor 1

Forward Error Correction LDPC(672,588)

Coding Mode EEP

3.1 System overview

The system is depicted in Figure 3.1. This system can be divided into two main section. These are Transmitter and Receiver. The transmitter get the data from the MAC or protocol and the receiver send the data to the protocol. The received data from the protocol are encoded by the LDPC encoder, where the extra bits are added to protect the signal from the noise on the channel. The coded bits are modulated by the modulator and converted to discrete samples. The OFDM block convert those samples from discrete frequency to discrete time signal. Later, the Digital to Analog Converter (DAC) converts the discrete signal to a continuous time signal. The continuous time signal is processed in the RF section. Before

(32)

20 High Level Model of IEEE 802.15.3c (HSIPHY)

transmitting by the antenna, the RF section up-converts the baseband signal and amplifies. At the other end the RF section of the receiver receives the signal,applies proper filtering and down-converts the received signal.

LD PC O F D M Transmitter Receiver Transreceiver LD P C O F D M

Base Band Processor

M A C / P ro to co l M A C / P roto co l

Base Band Processor

Figure 3.1: IEEE 802.15.3c system.

The transmitted signals are propagated through the wireless channel to the re-ceiver which introduce noise. The rere-ceiver receives the noisy signal by the antenna. The received signals are continuous time signal. The continuous time signals are processed in the RF blocks and send it to the Analog to Digital Converter (ADC) block to make the signals ready for the baseband processing section. The ADC converts the continuous time signal to a discrete time signal. The discrete time signal is converted to frequency domain signal after the OFDM block, which is nothing except an implementation of FFT. Samples in frequency are converted into bits in the demodulator block. The retrieved bits are sent to the MAC or protocol after the LDPC block. In the LDPC block, the encoded bits are decoded with the help of parity bits.

3.2 High level model

The high level model has been constructed for the specification in Table 3.1. The modelling setup includes MATLab and the communication toolbox. The com-munication toolbox includes most of the blocks for the system. The unavailable blocks have been modelled by MATLab. The model consist of three main blocks. These are transmitter, receiver and channel. The transmitter and receiver consist

(33)

3.2 High level model 21

of forward error correction (FEC) as LDPC(672,588), modulator as 16-QAM and OFDM as a subcomponents.

3.2.1 Transmitter and receiver

Forward error correction (FEC)

Forward error correction has been used on both transmitter and receiver. The

LDPC object of communication toolbox has been used for this case. LDPC

(672,588) follows the standard [1]. The table and the permuted identity matri-ces have been generated in Matlab. The table consist of the zero matrimatri-ces and permuted identity matrices.

Modulation and demodulation

Modulation and demodulation convert the bits into samples as well as samples into bits respectively. Modulation has been done on the transmitter and demodulation on the receiver. 16-QAM modulation and demodulation have been performed for this model. There are modem.qammod, modem.qamdemod, modulate and demodulate function in the communication toolbox to perform the modulation and demodulation. The arguments for modem.qammod and modem.qamdemod are described in Table 3.2 and Table 3.3. Later the created objects have been used in modulate and demodulate function to perform the modulation and demodulation.

Table 3.2: Argument for modem.qammod

Argument Description Value

M Modulation index 16

PhaseOffset Offset phase of the mapping π/2

SymbolOrder Symbol order of the input gray

InputType Type of input bit

Table 3.3: Argument for modem.qamdemod

Argument Description Value

M Modulation index 16

PhaseOffset Offset phase of the mapping π/2

SymbolOrder Symbol order of the input gray

InputType Type of input bit

DecisionType Type of decision LLR

NoiseVariance Noise Variance of system 1.2

Orthogonal frequency division multiplexing (OFDM)

The OFDM block has been modelled using IFFT and FFT on transmitter and receiver, respectively. 141 null subcarriers, 3 DC subcarriers, 16 pilot sub-carriers

(34)

22 High Level Model of IEEE 802.15.3c (HSIPHY)

and 16 guard subcarriers have been added with the 336 data subcarriers before the IFFT on the transmitter. In the receiver, the data subcarriers have been extracted from the 512 sub-carriers.

3.2.2 Channel

The processed signal is transmitted through the channel. The channel is wireless and it has multipath fading effect. The channel can be characterized in two ways. One is large scale characterization and the other is small scale characterization [2]. Large scale characterization has been applied here, as in Equation 3.1. The path loss P L(d) can be defined by the average path loss P L(d) and shadowing fading Xσ.

P L(d)[dB] = P L(d)[dB] + Xσ[dB] (3.1)

However, the average pathloss P L(d) can be expressed as in Equation 3.2. Where d0and n denote the reference distance and PL exponent. The pathloss exponent

n varies for different enviroment. This model has been modeled for the room enviroment. Xq is for the additional attenuation due to specific obstruction by

objects. P L(d)[dB] = P L(d0)[dB] + 10n log10 d d0 + Q X q=1 Xq, . . . for d ≥ d0 (3.2)

3.3 Performance evaluation

Two different performance measures have been observed in this model. One is BER as a function of SNR and the second one is BER as a function of wordlength in the FFT. These are described in the following subsections.

3.3.1 SNR vs BER

The BER has improved with the SNR of the system. The graph in Figure 3.2 shows the results for different wordlength. BER of the model reduced with in increment of the SNR. Figure 3.2 shows the blue line for wordlength 8, the red line for wordlength 12 and the black line for wordlength 16. So, to achieve some number of BER the SNR can be selected for a specific wordlength.

3.3.2 WordLength vs BER

BER as a function of wordlength has shown in Figure 3.3. Here, the SNR of the system is 35 dB. Wordlength can be selected from the graph to achieve specific BER. As quantization noise is reduced for higher wordlength, the BER is also improved with wordlength. It has been observed that the BER is reduced for the higher input wordlength.

(35)

3.3 Performance evaluation 23 0 5 10 15 20 25 30 35 40 10−6 10−5 10−4 10−3 10−2 10−1 100 Signal to NoiseRatio(dB)

Bit Error Rate (BER)

8 bit 12 bit 16 bit

Figure 3.2: BER as a function of SNR.

2 4 6 8 10 12 14 16 18 10−5 10−4 10−3 10−2 10−1 100 Wordlength

Bit Error Rate (BER)

(36)

(37)

Chapter 4 Background of FFT

A short description of the FFT algorithm, different architectures and the basic building blocks for the architectures are discussed in this chapter. Further infor-mation about the algorithm and architectures are discussed in [3–7].

4.1 Theoretical background

Some claim that 1965 is the start of the modern world, when J. Cooley and J. Tukey published their efficient method for numerical computation of the Fourier transform. Some others claim, the method was introduced by Gauss in the mid 1800s, the idea that lies at the heart of the algorithm is clearly present in an unpublished paper that appeared posthumously in 1866. However, the present and future demands are that now a days people process continuous signals by discrete methods. Computers and digital processing systems can not work with continuous sums. The FFT represent a general function in terms of summation of trigonometric functions. This mathematical operation transforms the time domain signal into frequency domain signal according to the DFT:

X[k] =

N

X

n=0

x[n]W_Nkn, k = 0...N − 1 (4.1)

In Equation 4.1 X[k] and x[n] are the complex output and the input of N point FFT respectively, where n is the time index and k is the frequency index. Wkn

N is

the twiddle factor. Wkn

N can be defined as in Equation 4.2.

W_Nkn= e−j(2πkn/N )= cos(2πkn

N ) − j · sin( 2πkn

N ) (4.2)

For a better understanding of the operations performed by the FFT, the FFT is represented by its signal flow graph (SFG). Examples of signal flow graphs are shown in Figures 4.1, 4.2, 4.3 and 4.4. The SFGs in the Figures consist of butterflies and complex rotations. For examples Figure 4.1 represents a radix-2 butterfly, which computes:

(38)

26 Background of FFT

Figure 4.1: SFG of radix-2.

X[0] = x[0] + x[1]

X[1] = x[0] − x[1]

Figure 4.2 shows a radix-4 butterfly. A radix-4 butterfly includes a complex mul-tiplication by e−jπ/2 = −j. This is a trivial operation. From hardware point of view a trivial operation can be done without any hardware cost.

Figure 4.2: SFG of radix-4.

The signal flow graph in Figure 4.3 shows a 16-point radix-2 DIF FFT and the number after every stage, φ, indicates a rotation by, e−j2πNφ. The the input

sequences are in natural order whereas the outputs are bit reversed order. On the other hand, Figure 4.4 shows a signal flow graph of 16 point radix-2 DIT FFT. In this case, the inputs are in bit reversed order and the outputs are in natural order. Besides, the placement of multiplications is not same.

(39)

4.2 Architecture of the FFT 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 2 4 6 2 4 6 4 4 4 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15

Figure 4.3: SFG of radix-16 decimation in frequency.

4.2 Architecture of the FFT

The architecture of FFT can be divided in some different parts. Those are butter-flies, complex rotators, memories for twiddle factor, circuits for data management and control. Butterflies and rotators are used for the calculation of mathemati-cal operation of the signal flow graph. Basic pipelined architectures for the FFT operation are discussed below. The basic components for these architectures are discussed in the next section of this chapter.

4.2.1 Feedforward architectures

Radix-2

A radix-2 feedforward Architecture is depicted in Figure 4.5. The input sequence is broken down into two parallel data streams flowing forward, with correct dis-tance between the data elements entering the butterfly scheduled by reorder. In this architecture both butterflies and multipliers have an utilization ratio of 100%. C2 in the Figure 4.5 are switchs and BF2 are the radix-2 butterflies. The num-bers by the switch are the length of the buffer. A detailed description about the architecture can be found in [3].

(40)

28 Background of FFT 0 0 0 0 4 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 4 5 6 0 2 4 6 0 0 3 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15 0 0 4 0 0 4 2 4 6

Figure 4.4: SFG of radix-16 decimation in time.

j

Figure 4.5: Radix-2 feedforward architecture.

Radix-4

A radix-4 feedforward architecture is depicted in Figure 4.6. C4 and BF4 in the Figure 4.6 are the switchs and radix-4 butterflies. The lengths of the buffers are shown by the number in the box. Here, the input sequence is broken into four parallel data streams and proper distance between data elements are kept by the shuffler. In this architecture the multipliers and the butterflies have an utilization ratio of 100%. This architecture is good for high throughput applications. This architecture is well described in [8].

(41)

4.3 Building blocks of the FFT 29 C4 BF4 C4 BF4 C4 BF4 C4 BF4 192 128 64 16 32 48 48 32 16 1 2 3 4 8 12 1 2 3 X X X X X X X X X 4 8 12 192 128 64

Figure 4.6: Radix-4 feedforward architecture.

4.2.2 Single path delay feedback

Radix-2

A radix-2 feedback architecture is depicted in Figure 4.7. This architecture uses the registers efficiently by storing one butterfly output in the feedback shift registers, while a single data stream goes through the multiplier at every stage. However, this architecture suffers 50% utilization of complex multipliers and butterflies. This architecture is good for area efficient implementation. This architecture is described in [9].

Figure 4.7: Radix-2 feedback architecture.

Radix-4

A radix-4 single path feedback architecture is depicted in Figure 4.8. In this architecture the utilization of multipliers and butterflies have been increased to 75%. However, the radix-4 butterfly contains at least 8 complex adders and its utilization dropped to only 25%. More detail about the architecture can be found in [10].

The comparison of the different pipelined architectures is given in Table 4.1.

4.3 Building blocks of the FFT

These architectures use some basic building blocks. Such as, complex multiplier, butterfly, ROM table, RAM and shift register. These blocks are discussed as follows.

(42)

30 Background of FFT

Figure 4.8: Radix-4 feedback architecture.

Table 4.1: Comparison of pipelined architecture for the N point FFT

ARCHITECTURE Multipliers Adders Control

Radix 2 feedforward [11] 2(log₄N − 1) 4 log₄N Simple

Radix 4 feedforward [8] 3(log₄N − 1) 8 log₄N Simple

Radix 2 feedback [11] 2(log₄N − 1) 4 log₄N Simple

Radix 4 feedback [11, 12] log₄N − 1 8 log₄N Medium

4.3.1 Complex multiplier

The complex multiplier is shown in Figure 4.9. A complex multiplier can compute (a + j · b)(c + j · d) = (ac − bd) + j · (ad + bc). Here a + j · b is the multiplicand and c + j · d is the multiplier. These have both real and imaginary parts. This operation can be done by four real multipliers, one adder and one subtractor. The subtractor can be implemented by an adder with a carry 1.

4.3.2 Butterfly

The butterfly is depicted in Figure 4.10. For the two inputs a and b of the butterfly the outputs are a + b and a − b. This operation can be done by one complex addition and one complex subtraction. Here, a and b are complex inputs. Again, the subtraction can be done by setting the carry to 1.

4.3.3 ROM

A ROM is used to store the coefficients of the complex multipliers. Each coefficient are stored in a specific address of the ROM. The coefficients is accessed by the address of the ROM. Different size ROMs is used depending on the size of the FFT and input wordlength. A ROM is depicted in Figure 4.11. Here, the address is 5 bits and the wordlength is 8 bits.

4.3.4 Buffers

Buffers are used to store the samples as well as make the proper sequences for the butterflies. The buffers are can be implemented by memories or shift register. Memories are probably used for the long length buffer and shift register for the short length. A memory is depicted in Figure 4.12, where two pointers are pointing

(43)

4.3 Building blocks of the FFT 31

Figure 4.9: Complex multiplier.

the read and the write addresses of the memory. On the other hand, in the shift register, samples are shifted to the next register every clock cycle. A shift register is depicted in Figure 4.13. + + 0 1 x[0] x[1] X[0] X[1]

(44)

32 Background of FFT Address Content 00000 00001 11101 11110 11111 00000000 11101000 11101000 10001000 10001100

Figure 4.11: ROM for coefficients.

Read Pointer Read Pointer Read Pointer Write Pointer Write Pointer Write Pointer

Figure 4.12: Memory with pointer.

2 3 4 L-1 L

1

X[i] X[i+L]

(45)

Chapter 5 Implementation of FFT on

ASIC

This chapter focuses on the implementation of an FFT for the IEEE 802.15.3c (HSIPHY mode) standard. CORE65LPSVT technology library has been used for this implementation. This library is mainly used for ultra low power applications. For this implementation, 0.8 V supply voltage and a 330 MHZ clock have been used. Table 5.1 shows the specification of the ASIC. The specifications for the FFT are noted in [1]. The FFT shall be 512 point and the sample rate 2.64 GS/s. In order to meet the requirement of the throughput, 8 parallel samples have been used as input of the FFT. Table 5.2 shows the requirements of the FFT. Among 512 sub-carriers, 336 are data sub-carriers, 16 are pilot sub-carriers, 16 are guard sub-carriers, 3 are DC sub-carriers and 141 are Null sub-carriers.

Table 5.1: Constraint of the ASIC

ASIC Constraint Value

Library CORE65LPSVT

Process 65 nm

Global Power Supply 0.8 V

Global Clock Frequency 330 MHz

Table 5.2: Design constraint of the FFT

Design Parameter Value

Length of the FFT 512

sample rate 2.64 GS/s

Samples in parallel 8

(46)

34 Implementation of FFT on ASIC

5.1 Design issue related to the FFT processor

FFT architectures can be divided into two different categories: pipelined ar-chitectures (such that feedforward and feedback) and memory-based architec-tures. These architectures are described in [7, 13–15] and [16–18] respectively. On one hand, pipelined architectures have the advantage of high throughput. However, these architectures have high area cost for large point FFTs. On the other hand, memory-based architectures have advantage of low area cost, but often the throughput is limited due to the memory access bandwidth and the available number of processing elements. In order to meet the requirements of IEEE 802.15.3c standard, a high throughput FFT processor needs to be designed. For high throughput applications, a pipelined FFT architecture has been adopted most times.

Among different pipelined architectures, single path delay feedback architec-tures have the advantages of less number of memories and hardware compared to multipath feedforward architectures. However, single path delay feedback ar-chitectures use the processing unit for 50% compared to multipath feedforward architectures. On the other hand, multipath feedforward architectures can process two or more samples in parallel, whereas single path feedback ones only process one sample per clock cycle. Therefore, feedforward architectures can operate at slower clock than feedback architectures. For a slower clock, low power can be acheived for feedforward architectures. However, these architectures increase the hardware cost significantly, as more complex rotators, butterflies and memories are needed. The above listed architectures have some advantage and some com-mon requirement, as has been well described in [19–21]. A radix-8 and 8 parallel data architecture has been proposed for this application. As the throughput of the FFT is quite high, 8 parallel data can reduce the clock frequency and the direct implementation of radix-8 butterfly need 8 parallel data. Besides, the proposed architecture reduces the number of multipliers and complex adders.

Finally, the processing elements of the data path can operate at maximum 500 MHz (2 ns delay) clock frequency. Therefore, a 330 MHz clock has been used for the pipeline architecture, and 8 parallel samples are the good choice to reduce the input clock frequency.

5.2 Radix-8

Equation 4.1 shows that, for in-place computation of each value of k, N complex multiplications (4N real multiplications and 2N real additions) and N −1 complex additions (4N − 2 real addition) are needed. The signal flow graph for the radix-8 decimation in time is depicted in Figure 5.1. However, the W0

8 coefficient on the

SFG can be ignored, because it represents a multiplication by 1. Figure 5.1 shows that samples are arriving at the input of the SFG as bit reversed, whereas the output are in natural order.

The SFG of radix-8 decimation in frequency is depicted in Figure 5.2. Input samples are arriving in natural order and the outputs are in bit-reversed order. The complex multiplications are changed it position on the SFG. Apart from that the

(47)

5.3 Proposed architecture 35 W8 0 W8 0 W8 0 W8 0 W8 0 W8 0 W8 0 W8 2 W8 2 W8 0 W8 1 W8 2 W8 3 X [0] x [1] x [2] x [3] x [4 ] x [5] x [7 ] x [6] X [1] x [0] X [2] X [3] X [4] X [5] X [6] X [7] −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1

Figure 5.1: SFG of radix-8 decimation in time.

same number of complex multiplications and additions are used in the decimation in frequency decomposition.

5.3 Proposed architecture

A 512-point FFT processor has been proposed for this application. The archi-tecture of the FFT and datapath are depicted in Figure 5.3 and Figure 5.4. The architecture consists of three main parts. Fourteen ROM tables for the coefficients of the multipliers. The data path computes the FFT and a controller has been used for controlling the ROM coefficients as well as the data path. The controller has been easily implemented by a six-bit counter. Figure 5.4 shows that the data-path consist of three stages of Radix-8 butterfly. The first two stages of the FFT include a total of 14 complex rotators. The third stage has only a radix-8 butter-fly. Shuffler 1 and shuffler 4 have been used before and after the FFT, in order to provide input and output samples in natural order. Shuffler 2 and shuffler 3 have been used inside the FFT for maintaining the proper order of data inside the FFT. The different blocks of the FFT are described as follows.

5.3.1 Radix-8 butterfly

The implementation of the radix-8 butterfly is depicted in Figure 5.5. For this architecture, the radix-8 butterfly has been done by direct implementation of the butterflies and constant complex rotations. There are twelve butterflies, two con-stant complex rotators and three trivial rotators. The radix-8 butterfly has three stages. The first stage of butterflies are leading two complex rotation and one trivial rotation by (−j). The second stage follows by two trivial rotations by (−j).

(48)

36 Implementation of FFT on ASIC W8 0 W8 0 W8 0 W8 1 W8 2 W8 3 W8 2 W8 2 x [1] x [2] x [3] x [4 ] x [5] x [7 ] x [6] x [0] X [0] X [1] X [2] X [3] X [4] X [5] X [6] X [7] −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1

Figure 5.2: SFG of radix-8 decimation in frequency.

Data Path

14 ROM Table Controller

Input Output

Coefficient

Figure 5.3: Data Path of the FFT

Figure 5.5 shows the interconnection network of the radix-8 butterfly. Trivial ro-tations (−1, j and −j) have been done by some modification in the butterfly at no extra hardware cost. The multiplication by −1 has been done by interchanging the inputs on the input port. Again, multiplication by j can be done by inter-changing the real and imaginary outputs. And multiplication by −j can be done by interchanging input and output signals as it has done for −1 and j.

5.3.2 Shuffler

Figure 5.6 shows the basic block for the shuffler. The shuffler consists of two multiplexers and input and output buffers. The input and output buffer lengths vary at different stages of the datapath. Both memory and shift registers have been used for the implementation of the buffers. A study on memory and shift register has shown that memory takes less area and consumes less power for long

(49)

5.3 Proposed architecture 37 X X X X X X X X X X X X X X x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] X[0] X[1] X[2] X[3] X[4] X[5] X[6] X[7]

Figure 5.4: Data path of the FFT.

Butterfly Butterfly Butterfly Butterfly Butterfly Butterfly Butterfly Butterfly Butterfly Butterfly Butterfly Butterfly x X X x x x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[0] x[4] x[1] x[5] x[2] x[6] x[3] x[7]

Figure 5.5: Implementation of radix-8 butterfly.

length buffers, whereas shift registers consume less power and less area for small length buffers. Samples are stored in the buffers for control signal 0. Samples of the output buffers are replaced by input buffers for control signal 1.

The shuffler 1 is shown in Figure 5.7. Twelve shuffling circuits have been used in three stages. Different size of buffers have been used in the different stages. First, second and third stages have 32, 16 and 8 input and output buffers, respectively. Three different control signals have been used to control the shufflers. For the first stage the control signal shall change after every 32 clock as the length of input and output buffers are 32. Second and third control signals must change after 16 and 8 clock cycles, respectively. However, the second and third selections shall wait for 32 and 48 clock cycle respectively.

Shuffler 2 and shuffler 3 have also three stages. Figure 5.8 and 5.9 show the shuffler 2 and shuffler 3 respectively. The lengths of the buffers for the shuffler 2 and shuffler 3 are 1, 2, 4 and 8, 16, 32. The figures show the interconnections of the shuffler 2 and shuffler 3. Three control signals have been used for the control of the three stages. Control signals 1, 2 and 3 for shuffler 2 shall change after 1,2 and 4 clock cycles respectively, depending on the number of input and output

(50)

38 Implementation of FFT on ASIC L L 1 0 1 0

Figure 5.6: Shuffling circuit.

Shuffler 1X32 Shuffler 1X32 Shuffler 1X32 Shuffler 1X32 Shuffler 1X16 Shuffler 1X16 Shuffler 1X16 Shuffler 1X16 Shuffler 1X8 Shuffler 1X8 Shuffler 1X8 Shuffler 1X8

Figure 5.7: Block diagram of shuffler 1.

buffers on each stages.

Shuffler 4 is depicted in Figure 5.10. There are twenty four shuffling circuits that have been arranged in six stages. Six control signals have been used to control the stages of the shuffler. The lengths of the input and the output buffers of the six stages are 32, 4, 16, 2, 8 and 1. The control signals of the six stages must change from 0 to 1 every 32, 4, 16, 2, 8 and 1 clock cycle.

5.4 ROMs for the coefficients

Fourteen ROMs in two stages have been used for this architecture. Seven memories of the 64 addresses for the first stage and seven memories of 8 addresses for the second stage. The 64 addresses of the first stage of ROMs can be represented by 6 bits. 64 coefficients have been stored on each ROM. cos(2π_Nφ) − j · sin(2π_Nφ) is the content of the ROM for each specific address. cos(2π_Nφ) and sin(2π_Nφ) have been represented in 8 bit for the 8 bit implementation. The value of φ varies for each specific address and ROM. The value of φ for the address b5b4b3b2b1b0of the X-th

ROM is X × (b2b1b0b5b4b3)2. Here, X is the number of memories from 1, 2 . . . 7

(51)

5.5 Controller 39 Shuffler 1X1 Shuffler 1X1 Shuffler 1X1 Shuffler 1X1 Shuffler 1X2 Shuffler 1X2 Shuffler 1X2 Shuffler 1X2 Shuffler 1X4 Shuffler 1X4 Shuffler 1X4 Shuffler 1X4

Shuffler 1X8 Shuffler 1X8 Shuffler 1X8 Shuffler 1X8 Shuffler 1X16 Shuffler 1X16 Shuffler 1X16 Shuffler 1X16 Shuffler 1X32 Shuffler 1X32 Shuffler 1X32 Shuffler 1X32

address 001100 of ROM 4 is 4 × (100001)2. So, φ is equal to 132.

Again, there are seven ROMs of 8 addresses in this architecture. Each ROM has addresses from 0 to 7. Eight addresses can be represented by 3 bits. The same cos(2π_Nφ) − j · sin(2π_Nφ) equation have been used for calculation of the content of the ROM. The value of φ for ROM X of b2b1b0address is X × (b2b1b0)2, where X

varies from 1, 2 . . . 7. As an example, the value of φ for 101 address of ROM 5 can be calculated as 5 × (101)2= 25.

5.5 Controller

The controller for the FFT has been implemented by a simple six-bit counter. Signals of the counter have been used for controlling both the control signals of the datapath as well as the addresses of the ROMs. The control for the datapath is depicted in Figure 5.11. Control signals of shufflers have been controlled by the signals of the counter. Fifteen control signals have been mapped with the different

(52)

40 Implementation of FFT on ASIC Shuffler 1X32 Shuffler 1X32 Shuffler 1X32 Shuffler 1X32 Shuffler 1X4 Shuffler 1X4 Shuffler 1X4 Shuffler 1X4 Shuffler 1X16 Shuffler 1X16 Shuffler 1X16 Shuffler 1X16 Shuffler 1X2 Shuffler 1X2 Shuffler 1X2 Shuffler 1X2 Shuffler 1X1 Shuffler 1X1 Shuffler 1X1 Shuffler 1X1 Shuffler 1X8 Shuffler 1X8 Shuffler 1X8 Shuffler 1X8

signal of the counter depending on the time period of the signal. The MSB of the counter has been mapped to those control signals that have period of 64 clock cycles, whereas the LSB of the counter has been mapped to those control signals that have a period of 2 clock cycles. From control signal 2 to control signal 15 of the data path shall wait for half of the summation of the previous signals period. Equal number of buffers have been used here. Number of delays and period of the signals are described in Table 5.3.

Table 5.3: Selection signal information

Control Signal Counter signal Period Delays

1 Count(5) 64 0 2 Count(4) 32 32 3 Count(3) 16 48 4 Count(0) 2 56 5 Count(1) 4 57 6 Count(2) 8 59 7 Count(3) 16 63 8 Count(4) 32 71 9 Count(5) 64 87 10 Count(5) 64 119 11 Count(2) 8 151 12 Count(4) 32 155 13 Count(1) 4 171 14 Count(3) 16 173 15 Count(0) 2 181

The controller for the ROM address is depicted in Figure 5.12. The fourteen ROM memories have been controlled by the same counter. Six signals of the counter have been mapped with the address bits of the first 7 ROM memories, as the address of the first 7 ROMs are represented by 6 bits. Three LSBs of the counter have been used for the controlling the address bits of next 7 ROM Table. Equalizing delays have been used for two stages of ROM. 56 and 63 delays have

(53)

5.6 Methodology 41

Shuffler 1 Shuffler 2 Shuffler 3 Shuffler 4

Counter

D D D D D D D D D D D D D D

Figure 5.11: Datapath controller.

been used respectively for the 1st stage and 2nd stage ROMs, respectively.

Counter D ROM 64 X 7 D ROM 8 X 7 6 bits 3 bits

Figure 5.12: ROM controller.

5.6 Methodology

For the implementation, different design tools have been used: Modelsim for the functionality testing, Design compiler for the synthesis and Nanosim for the power calculation. VHDL has been used as a hardware description language. The basic blocks for the architecture have been programmed in VHDL. As the FFT has been implemented for different wordlengths, generic and generate have been used for parameterizable wordlength of the blocks. Later the blocks have been used to build the FFT. Design compiler and Nanosim have been used to calculate the area and power consumption of the FFT.

(54)

42 Implementation of FFT on ASIC

5.6.1 Hardware implementation in VHDL

The entity of the complex multiplier is depicted in Figure 5.13. The generics WM1 and WM2 have been used to change the wordlength of multiplier and multiplicand. The basic block of the complex multiplier is a real value multiplier. A Wallace tree array multiplier has been used for this implementation. A pipeline of 5 stages has been used in the adder tree to reduce the critical path as well as to reduce the latency. The complex multiplier maintains the same input and output wordlength by discarding the LSB bits from the output.

library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; use ieee.std_logic_unsigned.all; entity complex_multiplier is generic(WM1 : integer:=3; WM2 : integer := 2); port(

in_real : in std_logic_vector(WM1-1 downto 0); in_imag : in std_logic_vector(WM1-1 downto 0); coeff_real : in std_logic_vector(WM2-1 downto 0); coeff_imag : in std_logic_vector(WM2-1 downto 0); clk : in std_logic;

reset : in std_logic;

mult_real : out std_logic_vector(WM1-1 downto 0); mult_imag : out std_logic_vector(WM1-1 downto 0)); end complex_multiplier;

Figure 5.13: Entity of complex multiplier.

The entity of the butterfly is shown in Figure 5.14. Generics have been used

to change the wordlength and the truncation. The butterfly keeps the input

wordlength for TE equals to 0 and increases it one bit for TE equals to 1. The basic radix-2 butterfly is used in radix-8 one.

The entity of the shuffler is depicted in Figure 5.15. WL, Lin, Lout and BT have been used in generic to change the wordlength, length of the input and output buffers, and selection between memory and shift registers. Study of memory and shift register has shown that memories consume less power and take less area for long buffers and opposite for shift registers. For this implementation, both architectures have been taken into consideration to optimize the power and area.

These basic components have been used to build the radix-8 butterfly and the shufflers. Twiddle factors for the complex multipliers have been calculated by Matlab. Matlab has been used to generate the VHDL code for the ROMs. These ROMs, radix-8 butterfly, shufflers and complex multiplier have been used to build the FFT. A simple six-bit counter has been used to control the FFT.

(55)

5.6 Methodology 43 library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; use ieee.numeric_std.all; entity butterfly is generic( WL : integer := 3; TE : integer := 1); port(

in_1_real : in std_logic_vector(WL-1 downto 0); in_1_imag : in std_logic_vector(WL-1 downto 0); in_2_real : in std_logic_vector(WL-1 downto 0); in_2_imag : in std_logic_vector(WL-1 downto 0); clk : in std_logic;

out_1_real : out std_logic_vector(WL-1+TE downto 0); out_1_imag : out std_logic_vector(WL-1+TE downto 0); out_2_real : out std_logic_vector(WL-1+TE downto 0); out_2_imag : out std_logic_vector(WL-1+TE downto 0)); end butterfly;

Figure 5.14: Entity of a radix-2 butterfly.

5.6.2 Functionality testing

The functionality of the FFT and the individual components has been tested by Modelsim. Test benches of individual component have been build and the functionality has been tested. Input and output sequences for the FFT have been generated in Matlab and the same input sequences have been used in the test bench of the FFT. The output sequences for the FFT have been tried to match with the output sequences generated by Matlab. Again, the datapath of the FFT has been tested without the radix-8 butterfly and complex multiplier for the data management. Natural input sequences have been used at the input of the circuit with the proper control signals.

5.6.3 Synthesizing and area calculation

The FFT and individual components have been synthesized using Design compiler with CORE65LPSVT library. This library is for 65 nm process technology. Design compiler has been used to synthesis and optimize the area of the design for a specific clock as well as to generate the netlist of the design. The area of the FFT has been calculated by Design compiler.

5.6.4 Power calculation

The power consumption has been calculated by Nanosim. Random sequences

for the FFT and individual components have been generated using Matlab. The netlist generated by Design compiler and the random sequences have been used to calculate the power. Voltage scaling has been done for the design by changing the supply voltage in the spice file.

(56)

44 Implementation of FFT on ASIC library ieee; use ieee.std_logic_1164.all; entity shuffler is generic(WL : integer:= 10; Lin : integer := 20; Lout : integer := 10; BT : integer := 1); port(

in0 : in std_logic_vector(WL-1 downto 0); in1 : in std_logic_vector(WL-1 downto 0); clk : in std_logic;

sel : in std_logic;

out0 : out std_logic_vector(WL-1 downto 0); out1 : out std_logic_vector(WL-1 downto 0)); end shuffler;

Figure 5.15: Entity of shuffling circuit.

5.7 Design for Low Power

Dynamic power of any circuit can be illustrated by:

Pdynamic=

1 2αf cV

2

dd (5.1)

In the equation ‘c’ is the area capacitance, ‘Vdd’ is the supply voltage, ‘f’ is the clock

frequency and ‘α’ is the switching activity. The dynamic power can be improved by reducing the supply voltage ‘Vdd’, area capacitance ‘c’ and clock frequency ‘f’.

However, the area capacitance is indirectly related with the clock frequency. The area capacitance can be reduced by reducing the clock frequency. For optimizing the power of the FFT frequency scaling and voltage scaling have been done.

Initially, the FFT has been synthesized for 380 MHz in order to operate any clock below 380 MHz. Due to the higher clock, the FFT takes more area. That results the higher capacitance and cause more power consumption. Voltage scaling can reduce the power. However, the area capacitance does not change for the voltage scaling. Frequency scaling has been done to reduce the area and power consumption. A 330 MHz clock has been used to reduce the area as well as the capacitance of the FFT. The bar charts in Figure 5.16 show the difference of power and area for both clocks. The blue bars show the area and power for 380 MHz and the brown bars for 330 MHz. The results are shown for wordlength 8, 12 and 16.

Voltage scaling has been done to reduce the power consumption of the FFT. Initially, the power of the FFT has been calculated for 1.2 V and there was a slack time of 0.5 ns. The voltage has been reduced from 1.2 V to 0.8 V and the slack time has been reduced as well. The bar chart in Figure 5.17 shows the change of power after voltage scaling. In the figure the blue bars show the power consumption for 1.2 V and the brown bars show the power consumption for 0.8 V.

Memories have been replaced by shift registers for buffer lengths over 8. On one hand, the switching activity increases with the length of the buffers for shift registers and causes more power consumption. On the other hand, the switching