Implementation of High-Speed 512-Tap FIR Filters for Chromatic Dispersion Compensation

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2018

Implementation of

High-Speed 512-Tap FIR

Filters for Chromatic

Dispersion Compensation

Cheolyong Bae and Madhur Gokhale

(2)

Implementation of High-Speed 512-Tap FIR Filters for Chromatic Dispersion Compensation

Cheolyong Bae and Madhur Gokhale LiTH-ISY-EX--18/5179--SE Supervisor: Oscar Gustafsson

isy_{, Linköpings universitet}

Examiner: Oscar Gustafsson

isy_{, Linköpings universitet}

Division of Computer Engineering Department of Electrical Engineering

Linköping University SE-581 83 Linköping, Sweden

(3)

Abstract

A digital filter is a system or a device that modifies a signal. This is an essential feature in digital communication. Using optical fibers in the communication has various advantages like higher bandwidth and distance capability over copper wires. However, at high-rate transmission, chromatic dispersion arises as a prob-lem to be relieved in an optical communication system. Therefore, it is necessary to have a filter that compensates chromatic dispersion. In this thesis, we intro-duce the implementation of a new architecture of the filter and compare it with a previously proposed architecture.

(4)

(5)

Acknowledgments

We would like to express our gratitude to our supervisor and examiner Oscar Gustafsson for his guidance in this thesis. Since the beginning of this thesis, we have developed our knowledge in the field of computer engineering. We would also like to thank our opponents, Aurélien Moine and Viswanaath Sundaram, for their valuable feedback on the results of this thesis.

Linköping, December 2018 Cheolyong Bae and Madhur Gokhale

(6)

(7)

Notation

Abbreviations

Abbreviations Description

ADC Analog Digital Converter AWGN Additive White Gaussian Noise

BER Bit Error Rate

CD Chromatic Dispersion DAC Digital Analog Converter DFT Discrete Fourier Transform

DIF Decimation in Frequency DIT Decimation in Time DSP Digital Signal Processing FFT Fast Fourier Transform

FIR Finite-length Impulse Response FPGA Field Programmable Gate Array HDL Hardware Description Language

IC Integrated Circuit ICI Intercarrier Interference

IDFT Inverse Discrete Fourier Transform IFFT Inverse Fast Fourier Transform

IIR Infinite-length Impulse Response ISI Intersymbol Interference

LSB Least Significant Bit MSB Most Significant Bit OLS Overlap-save method

SAIF Switching Activity Information File SDF Standard Delay Format

VCD Value Change Dump

VHDL Very High-speed Integrated Circuit Hardware De-scription Language

(10)

(11)

1

Introduction

Annual global IP traffic is growing and is predicted to reach 3.3 ZB (Zeta Byte) by 2021. It was 1.2 ZB in 2016 [1]. Due to high demands for modern communication, fiber optic communication is widely used because of its various advantages over copper-wired communication [19].

Optic communication can handle longer distances, higher bandwidth, and has better reliability. Despite these benefits, the fiber optic communication has some imperfections that need to be considered. One of these is chromatic disper-sion.

Chromatic dispersion (CD) is a form of dispersion in optical fiber [19]. Due to different frequencies of components in a signal or pulse, each component prop-agates with different speed. Thus, the signal or pulse is smeared and delivers wrong information. Therefore, to achieve correct information signal, the need for filtering arises to compensate for chromatic dispersion.

In this thesis, we introduce a new filter architecture to compensate for chro-matic dispersion and compare it with a previously proposed architecture. Both architectures are made using VHDL, and they are verified by Matlab simulation. The blocks in these two architectures are synthesized by Design Compiler tool and analyzed in terms of area usage and power consumption. Some of the blocks have various options to change to increase performance. So, the results of those options and the total comparison between two architectures will be discussed.

1.1 Goal

The goal of the thesis is to design and evaluate a new architecture of high-speed 512-tap finite impulse response filter for compensation of chromatic dispersion in optical fibers and perform a comparative analysis with previous architecture. The previous research shows that it is possible to achieve an operating speed of

(12)

60 GS/s with a maximum frequency of 476 MHz which has a clock period of 2.1 ns [10]. This means about 128 samples should be processed in every clock cycle. The new architecture should also achieve the same operating speed with a smaller usage of resources.

1.2 Previous Research at LiU

At Department of Electrical Engineering, Linköping University, 512-Tap complex FIR filter architectures for compensation of chromatic dispersion [10] was carried out. Also, other studies about digital filters [4], representations of FFT [5, 18] and architectures of FFT [6, 7] have been published.

1.3 Limitation

This thesis uses fixed-point numbers, all the bits cannot be computed by arith-metic operation, and important information in samples is dependent on inputs and coefficients. Therefore, this thesis assumes to use external signals to choose desired bits.

The choice of the coefficients is not the scope of this thesis. Instead of choosing coefficients obtained from chromatic dispersion filter, randomly generated values are considered in this thesis.

Some blocks had greater hierarchy due to which propagation of switching activity to inner nets could not be ensured. Thus, the detailed gate level power estimation cannot be said to be wholly accurate.

1.4 Outline of the Thesis

Chapter 2 presents theories behind FIR filters and optical networks. Chapter 3 covers the languages and tools used in this thesis.

Chapter 4 explains how the filters are implemented.

Chapter 5 contains the results of each block and the whole architecture.

(13)

2

Theory

2.1 Optical Networks and Chromatic Dispersion

Optical networks have significant advantages over traditional networks based on copper cables. They have much higher bandwidth and a lower Bit Error Rate (BER). Communication systems based on optical fiber are less susceptible to elec-tromagnetic interference. So, the communication systems based on optical fiber can be used for distances more than one kilometer at a speed of tens of megabits per second [19].

Optical fibers which are guided wave structures propagate light signals in optical networks. A narrow pulse when launched on fiber spreads, with its width broadening, as it travels along the fiber. Over long distances, the broadening of pulses extends into neighboring pulses causing Intersymbol Interference (ISI). This ISI is referred to as fiber dispersion. There are two basic types of dispersive effects in a fiber [20]; Intermodal Dispersion and Chromatic Dispersion.

Intermodal Dispersion: This form of dispersion exists in multimode fibers since different modes have different group velocity. The pulse power is different for different modes. The pulse arrivals for different modes are in different time with each pulse carrying different power. This dispersion limits bit rate-distance product of an optical communication link [19].

Chromatic Dispersion: This dispersion occurs due to the frequency depen-dence of the group velocity. The chromatic dispersion can be modeled as fre-quency response as:

C(exp(jwT )) = exp(−jK(wT )2), K = Dλ

2_z

4πcT2, (2.1)

where D is the fiber dispersion parameter, λ is the wavelength, c is the speed of light, T is the sampling period and z is the propagation distance [4].

(14)

L GTX(ejwt) C(ejwt) H(ejwt) GRX(ejwt) L x(n) y(n) upsampling with a factor L Anti-imaging

ﬁlter Chromaticdispersion

Filter for compensation of CD Anti-aliasing ﬁlter Downsampling with a factor L AWGN

Figure 2.1:The system of an optical network model.

The compensation of the chromatic dispersion is done by designing a filter with frequency response [4]:

H(exp(jwT )) = 1

C(exp(jwT )) = exp(jK(wT )

2_). _(2.2)

2.1.1 System Chain

The full system chain is shown in Figure 2.1 [9]. In the system chain, in order to reduce the effect of ISI and intercarrier interference (ICI), the interpolation on the transmitter side and the decimation on the receiver side has been added. Also, these interpolation and decimation require anti-aliasing filter which usually performs low pass filtering [4].

The filter given by Equation (2.1) is added to simulate chromatic dispersion. Then, the data goes through Additive White Gaussian Noise (AWGN) channel to simulate the random process in nature. When receiving the signal, the CD compensation filter reduces the effect of CD.

2.1.2 Analog-Digital Converter for Optical Transmission

The optical transmission relies on digital signal processing (DSP) and conversion between analog and digital [12]. There are several DAC and ADC aimed at optical communication. The bit resolution of ADC is from 4 to 8, and the maximum sample rate is 20 to 70 GS/s using various technologies according to the source given in the paper [12]. In this thesis, we assume 6 bits as an input bit resolution and target sample rate is 60 GS/s.

2.2 Finite-length Impulse Response Filter

Digital filters can be divided into two classes: Finite-length Impulse Response (FIR) and Infinite-length Impulse Response (IIR). FIR filter is a filter that has im-pulse response with a finite duration. On the contrary, if the imim-pulse response has infinite duration, the filter is called IIR filter. The FIR filters can be guaran-teed to be stable unless used inside a recursive loop [22]. Equation (2.3) describes an FIR filter of length M with input x(n) and output y(n).

(15)

2.2 Finite-length Impulse Response Filter 5 h(0) h(1) h(2) h(3) D D D Σ Input Output Figure 2.2:Generic 4-tap filter.

where bk is a coefficient of the FIR filter for 0 ≤ k ≤ M. Similarly, the transfer

function can be expressed as

H(z) = M X k=0 bkz −_k . (2.4)

Also, the unit sample response of the FIR filter is the same as the coefficients

bk, that is,

h(k) =

(

bk, 0 ≤ k ≤ M

0, otherwise (2.5)

The output sequence described by Equation (2.3) can be expressed as the con-volution summation of the system

y(n) = M

X

k=0

h(k)x(n − k), (2.6)

where M is the order of the filter [17].

Generally, an FIR filter is described using the length of the filter rather than the order. The length of the filter is given by N = M + 1, where M is the order of the filter. The number of multiplications and additions in an FIR filter of length

N is given by N and N − 1 respectively [22]. The direct form structure of the FIR

filter is one of the simple structures and is depicted in Figure 2.2.

2.2.1 FIR Filtering in Frequency Domain

As we discussed in the previous section, convolving the time domain signal with the impulse response results in the output of the filter. This operation can be sped up by performing Fourier transform of both input signal and coefficient and multiplying them. Taking inverse Fourier transform of the result of multipli-cation gives us the output same as a convolution of inputs. This method is much

(16)

FFT Multiply Inverse _FFT Input Signal x(n) Impulse Response h(n) X(k) H(k) Y(k) = X(k)H(k) Output Signal y(n) = x(n) * y(n) FFT

Figure 2.3:Fast convolution.

faster than time domain convolution, due to the simplicity of multiplication and the speed of Fast Fourier Transform (FFT). This approach is advantageous in fil-tering long data sequences. The complete filfil-tering process is shown in Figure 2.3 [14].

When it comes to filtering long data sequences, the filtering is done on the block by block basis. The input stream of data is divided into segments of data bits and then each segment is processed one by one by Discrete Fourier Trans-form (DFT) and inverse DFT. One of the methods of perTrans-forming filtering of long data sequences is Overlap-save (OLS) method [17]. The OLS method is described further in Section 2.4.

2.3 Fast Fourier Transform

Fourier Transform (FT) is a mathematical way to decompose a function of time (signal) into a function of frequency. When FT is used for the discrete samples, we call it Discrete Fourier Transform (DFT). Fast Fourier Transform (FFT) is simply an optimized version of DFT [16].

The DFT is given by Equation (2.7).

Xk = N −1 X n=0 xnexp −j2π N kn , (2.7)

where xnis a sequence of samples, N is the size of the transformation and Xk is

the k−th frequency of the transform.

With this equation, the computation requires N2_{complex multiplication and}

adders or subtraction without considering the elimination of some trivial com-putation such as multiplication by 1. FFT reduces the number of complex multi-plications from N2to N log₂N . The well-known method of doing FFT is Cooley

and Tukey algorithm [2]. It uses a technique called divide and conquer algorithm which breaks down DFT into smaller DFTs recursively. In order to explain the

(17)

2.3 Fast Fourier Transform 7 W0 8 W1 8 W2 8 W3 8 W0 8 W2 8 W0 8 W2 8 W0 8 W0 8 W0 8 W0 8 x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) X(0) X(4) X(2) X(6) X(1) X(5) X(3) X(7) -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Figure 2.4:An 8-point decimation in frequency FFT algorithm.

algorithm, the equation of DFT (2.7) can be broken down into two parts [15]

Xk = (N /2)−1 X n=0 x2nWN /2nk + WNm (N /2)−1 X n=0 x2n+1WN /2nk (2.8) Xk+N /2= (N /2)−1 X n=0 x2nWN /2nk −WNm (N /2)−1 X n=0 x2n+1WN /2nk , (2.9) where WN = exp −_2jπ N

is called twiddle factor. With Equations (2.8) and (2.9), we can perform two N /2-point DFTs, one for even-indexed samples and one for odd-indexed samples in order to compute an N-point DFT. Those equations can be further broken down until the size of DFT is equal to radix.

Figure 2.4 shows an example of an 8-point FFT and Table 2.1 shows the com-parison between DFT and FFT regarding on the number of complex multiplica-tions.

2.3.1 Various Radix of FFT

As it is discussed above, FFT breaks down the DFT into smaller DFTs. The min-imum size of DFT is dependent on what radix we select for the FFT. The radix is also the size of thebutterfly [15]. A butterfly in this thesis is denoted as the

basic operation of FFT. For example, a radix-2 FFT uses one addition and one subtraction for one butterfly:

X(0) = x(0) + x(1) (2.10)

X(1) = x(0) − x(1) (2.11)

The FFT can, of course, use higher radix. Figure 2.5 shows a radix-2 butterfly and a radix-4 butterfly.

(18)

Number of Complex multiplications Complex multiplications points in direct computation in FFT algorithm

4 16 4 8 64 12 16 256 32 32 1024 80 64 4096 192 128 16384 448 256 65536 1024 512 262144 2304 1024 1048576 5120

Table 2.1: Comparison of the number of complex multiplications in the di-rect computation of the DFT and the FFT algorithm [17].

x(0) X(0)

x(1) X(1)

(a) A radix-2 butterfly

x(0) X(0) -j x(1) x(2) x(3) X(2) X(1) X(3) (b) A radix-4 butterfly

(19)

2.4 Overlap-save 9

2.3.2 FFT Architecture

The FFT architecture used in this thesis is pipelined implementation and direct implementation. In the pipelined structure, the number of input samples is a power of two. The input samples are processed with data shuffling in a continu-ous flow. Data shuffling is done using buffers and multiplexers [7].

The direct implementation is also considered as parallel pipelined FFT where the degree of parallelization is equal to the size of FFT [7]. This direct imple-mentation is straightforward, mapping each operation according to the FFT flow graph.

In the direct implementation, input samples arrive simultaneously so there is no need to pipeline in the data flow. Also, decimation in time (DIT) and dec-imation in frequency (DIF) are same in architecture, the difference being in the rotators [9].

2.4 Overlap-save

Figure 2.6 illustrates the Overlap-save (OLS) method. When the input is a very long signal and one has an FIR filter, OLS is one of the methods to compute discrete convolution. In the OLS method, the input samples are divided into blocks of M samples. The first block of M samples is appended with L − 1 zeros. Then the new block has total M + L − 1 = N samples with L − 1 zeros and M samples. The next block saves the last L − 1 samples and appends to the next M samples. So, it has total M + L − 1 samples with L − 1 old samples and M new samples. The same applies to all the segments.

The N -point FFT is performed on each of these blocks. The FFT results of these blocks are then multiplied with filter coefficient in a frequency domain. Then IFFT is performed on each of these blocks. Then initial L − 1 samples from each of these blocks are discarded, and the result is concatenated. The concate-nated result is the final output [17].

2.5 Pipelining

Pipelining is a method of increasing throughput of a sequential algorithm. This method is achieved by breaking down the critical path by adding delays to the original path. Ideally, the critical path is broken into paths of equal length. With

P number of stages of pipelining, P computations can run concurrently. This

means there is an increase by a factor of P over sequential processing. Pipelining helps to achieve a higher level of parallelism in the structure [22].

In this thesis, pipelining is done for two reasons. One reason is to achieve higher operating speed. Without inserting delay elements in the critical path, the system is unable to run at certain speeds. The other reason is to reduce the number of resources when synthesizing the design. The pipelining reduces over-all area usage and power consumption when operating at higher frequency.

(20)

INPUT FFT Multiply IFFT OUTPUT M L-1 zeros M M Discard L-1 samples Overlap L-1 samples Overlap L-1 samples Discard L-1 samples Discard L-1 samples

(21)

2.6 Complex Multiplication 11

2.6 Complex Multiplication

In this thesis, we consider two different algorithms for complex multiplication. One is a standard complex multiplication algorithm, and the other one is Gauss complex multiplication algorithm [21].

Standard complex multiplication algorithm uses four real multiplications and two additions as can be seen in Equation (2.12).

(a + bj)(c + dj) = (ac − bd) + j(bc + ad) (2.12) When using Gauss complex multiplication algorithm, it is possible to reduce the number of real multiplications. The algorithm is as follows:

         k1 = c(a + b) k2 = a(d − c) k3 = b(c + d) (2.13) ( ac − bd = k1−k3 bc + ad = k1+ k2 (2.14)

This algorithm gains in speed if one multiplication is more expensive than three additions or subtractions. However, it has three steps of computations that make the architecture more complicated. In this thesis, the difference between these two multipliers will be discussed.

(22)

(23)

3

Method

This chapter covers methods that are used in this thesis.

3.1 Programming Language

Firstly, Matlab is used in order to verify the system. Matlab is a mathemati-cal computing application and programming language developed by Mathworks. Due to its easier approach to complex computation, we used Matlab to verify our system.

The second language is VHDL which is an abbreviation for "Very High-speed Integrated Circuit Hardware Description Language". VHDL is a language gener-ally used in the electronic design of an FPGA (Field Programmable Gate Array) and an Integrated Circuit (IC). We used it because some codes from previous studies were available. Therefore, it reduced the amount of time required for implementing both architectures in this thesis.

3.2 ModelSim

ModelSim is a simulation application for hardware description languages such as VHDL, Verilog, and system-level modeling language such as SystemC. This tool is used in order to verify that the system is functionally correct without using any physical equipment.

3.3 Synopsys Design Compiler

Synopsys Design Compiler is a tool to synthesize high-level design blocks with HDL code into physical hardware. It creates net-lists consisting of logic-level

(24)

design blocks. When compiling with Design Compiler, it is possible to specify certain parameters such as clock period and switching activities. This function allows comparing the results with certain constraints.

It also has some useful commands to get optimized results such as "compile_ultra".

It includes many features of optimizing such as automatic ungrouping, datapath optimizing, timing analysis, and so on.

3.4 Power Estimation

In order to estimate more accurate power estimation, it is necessary to set proper switching activity for the ports in the design. The increase in switching activity will cause more dynamic power consumption.

In the thesis, the power estimation is done by using Switching Activity Infor-mation File (SAIF) in the Design Compiler.

There are two ways of generating SAIF file. One is to write out a SAIF file directly, and the other one is to convert a VCD (Value Change Dump) file from the simulation to a SAIF file by using command "vcd2saif " in Design Compiler.

Since the designs in the thesis are quite big to write a file directly, the latter method is used in the thesis.

The detailed procedure is following:

1. Read the design and compile it in Design Compiler.

2. Generate an SDF (Standard Delay Format) file in Design Compiler by using "write_sdf " command.

3. Read the SDF file and the testbench file of the design in ModelSim and create a VCD file.

4. Convert the VCD file to a SAIF file.

5. Read the SAIF file in Design Compiler and report power.

3.5 Approach

In order to achieve an optimized filter in terms of power, multiple variants of every block were designed and analyzed for power consumption. Each variant of the block was synthesized through a range of frequencies starting from 100 MHz to 667 MHz in order to have a better understanding of behaviour of each block. The most efficient variant of all options for each block in terms of power at fre-quency of 476 MHz was then selected for the filter.

(25)

4

Implementation

This chapter presents the implementation of two different filter architectures. One is 1024-point FFT based FIR filter architecture which is a proposed archi-tecture in this thesis. The other one is 256-point FFT based FIR filter archiarchi-tecture which is a previously proposed architecture [9].

In this thesis, wordlengths are determined based on bit error rate (BER) sim-ulated in the previous work [9]. The input data wordlength is chosen as 12 bits with 6 bits for real and 6 bits for imaginary as mentioned in Section 2.1.2. Inside of the architectures, data wordlength for quantization is chosen as 24 bits with 12 bits for real and 12 bits for imaginary. The filter coefficient wordlength is chosen as 16 bits with 8 bits for real and 8 bits for imaginary.

4.1 1024-point FFT Based FIR Filter Architecture

Figure 4.1 shows an overview of the architecture. In this architecture, 1024-point FFT is performed by using 4-point FFT and 256-point FFT. Commutators perform data shuffling of input samples to do correct FFTs, and after multiplication with filter coefficients, inverse FFT is performed. The inverse FFT is done by mirroring of the FFT process with interchanging real and imaginary parts of samples from multipliers and of outputs [3].

4.1.1 Top Level Estimation

Various operators with different wordlength are used in the architecture. The operators which influence area usage and power consumption in the design are adder (subtractor), general complex multiplier, complex multiplier with constants, complex multiplier with re-configurable constants, multiplexer and delay ele-ment.

(26)

y(128n) y(128n+1) y(128n+127) Global counter F ir st Commutator rst clk 4point FFT Coefficient Selection T w iddle F ac tor M ul ti pl ic at ion Second Commutator 256 p oi nt F F T (P ar al lel F F T ) 256 p oi nt I F F T (P ar al lel I F F T ) T hir d Commutator _Twiddle F ac tor M ul ti pl ic at ion F ourth Commutator s1 s2 s3 s4 cs tfs1 tfs2 4point FFT 4point FFT 4point FFT x(128n) x(128n+1) x(128n+127) H(0) H(1024)

}

Disc ar d

Figure 4.1:1024-point FFT based FIR filter architecture.

General complex multiplier refers to a complex multiplier that has both in-puts which are not specifically determined. Complex multiplier with constants refers to a complex multiplier whose value of the multiplier is constant whereas reconfigurable constants change in every clock cycle.

Although the detailed gates are selected by the synthesis tool, it is good to estimate overall performance by analyzing the number of operators in each block. In the 1024-point FFT based FIR filter architecture, 256 samples are processed at every clock cycle since we use the OLS method.

First, the number of operators in 256-point FFT block can be computed by the following equation:

Complex adders = N log₂N (4.1)

Complex multipliers = N

2(log2N − 1) (4.2)

where N is the number of processed samples. The complex multipliers here use twiddle factors which are constants.

Second, the number of multipliers in the twiddle factor multiplication block is 3₄N or N depends on two different procedure of the second commutator and

the twiddle factor multiplication block. The difference will be discussed in Sec-tion 5.4. The multipliers use reconfigurable constants. Therefore, three 2 × 1 multiplexers are used to select one constant out of four constants.

Third, the number of filter coefficient multipliers is N. The multipliers are general multipliers whose inputs are not constants. Here, multiplexers are used for each multiplier. Three 2 × 1 multiplexers are used to select coefficient and two 2 × 1 multiplexers are used for quantization.

Fourth, the commutators have delay blocks and 2 × 1 multiplexers. The num-ber of delay blocks is different in each commutator.

(27)

4.1 1024-point FFT Based FIR Filter Architecture 17

Operators Number

Complex adders 5120

General complex mult. 256

Complex mult. const. 1792

Complex mult. w/ reconfigurable const. 384 or 512 2 × 1 Multiplexers 3840 or 4224

Delay elements 2432

Table 4.1:Number of operators in the 1024-point FFT based FIR filter archi-tecture.

FFT can be computed by the following equations:

Total complex adders = 4N + 2N log₂N (4.3)

Total general complex multipliers = N (4.4)

Total complex mult. w/ const. = N (log₂N − 1) (4.5) Total complex mult. w/ reconfigurable const. = 3

2N or 2N (4.6) Total 2 × 1 multiplexers = ₉ 2N or 6N + 2N + 3N + N + 2N + 2N +N 2 (4.7) Total delay elements = 5

2N + 3N + 3N + N (4.8)

The overall results of the architecture can be seen in Table 4.1. Note that the exact numbers may be different by a different optimizing process. More details in each block will be discussed further.

4.1.2 First Commutator

The architecture consists of 4-point FFT and 256-point FFT. Therefore, it is nec-essary to change the order of the inputs when they go through FFT blocks. The theory behind this block can be found in [6, 8].

The first commutator is used for the 4-point FFTs. The first stage of the FFT needs to use samples from 0 to 255, 256 to 511, 512 to 767, and 768 to 1023. Also, this block needs to handle the OLS method. Therefore, it is needed to permute the input data stream. Detail of the permutation flow can be seen in Figure 4.2. The important point in the figure is that b9 and b8 are placed on the right-hand side of the matrix, and the total number of bits on the right-hand side is eight. Then, it is possible to pick the correct data to perform 4-point FFT.

In order to explain how to permute data, it is easier to express using bit index numbers as follows:

b9 b8 b7 | b6 b5 b4 b3 b2 b1 b0 (4.9)

Indexes left to the vertical bar in Equation (4.9) represents the vertical dimen-sion of the matrix in Figure 4.2, and indexes right to the vertical bar in Equation (4.9) represents the horizontal dimension of the matrix in Figure 4.2. Note that

(28)

0 1 2 3 4 5 127 128 129 130 131 132 133 255 256 384 512 640 896 897 898 899 900 901 1023 767 639 511 383 257 258 259 260 261 389 388 387 386 385 513 641 642 514 515 643 516 644 645 517 b9 b8 b7 000 001 010 011 100 101 111 0000000 b6 b5 b4 b3 b2 b1 b0 0000001 0000010 0000100 0000101 1111111 0000011 0 256 2 258 4 260 128 384 130 386 132 388 1 129 257 3 259 5 261 389 133 387 131 385 b0 b7 00 01 10 11 0000000 b9 b6 b5 b4 b3 b2 b1 b8 0000001 0000010 0000100 0000101 0000011 890 636 892 638 894 1111011 1111100 1111101 1111110 1111111 1018 764 1020 766 1022 891 637 893 639 895 1019 765 1021 767 1023

Figure 4.2:Operation of first commutator.

bits on the left-hand side (b7, b8, and b9 in Equation (4.9)) are increased by a clock signal.

In order to handle the OLS method and select correct bits for the 4-point FFT, the first step is to move b9 to the right-hand side index as follows:

b8 b7 | b9 b6 b5 b4 b3 b2 b1 b0 (4.10)

This is done by adding delay blocks which take four clock cycles at the input stage. By doing this, the 512 samples are overlapped and saved.

Next step is moving b8 to the right-hand side. In order to do it, b8 needs to be exchanged with a bit from b0 to b6. According to [6, 8], this process requires a serial-parallel circuit with delay size of two. If we select b0 for example, the final order of indexes is following:

b0 b7 | b9 b6 b5 b4 b3 b2 b1 b8 (4.11)

Note that selection of the bit from b0 to b6 results in different power con-sumption. It will be discussed in Section 5.5. Figure 4.3 describes the whole architecture of the first commutator.

4.1.3 Second Commutator

After 4-point FFTs are performed, data stream goes to the 256-point FFT. But the data from 4-point FFTs cannot be used directly because the data was permuted. In order to perform 256-point FFT, the order of the data needs to recover to have a normal order.

The exchanging of index numbers on the right-hand side of the expression can be done easily by changing wiring. Therefore, the important feature on the second commutator is to place b8 and b9 on the left side. It can be done similarly

(29)

4.1 1024-point FFT Based FIR Filter Architecture 19 4 4 4 4 2 2 2 2 2 2 2 2 First step Second step

Figure 4.3:Detailed architecture of the first commutator.

to the first commutator. The serial-parallel circuit with one delay and with two delays are performed in the second commutator.

After using the serial-parallel circuit with one delay, b7 and b8 are exchanged.

b0 b8 | b9 b6 b5 b4 b3 b2 b1 b7 (4.12)

To exchange b9 with b0, changing of wiring is needed. The following expres-sion is after changing of wiring is done:

b0 b8 | b7 b6 b5 b4 b3 b2 b1 b9 (4.13)

Then, we can exchange b9 with b0 by using the circuit with two delays.

b9b8 | b7 b6 b5 b4 b3 b2 b1 b0 (4.14)

The other commutators (third and fourth) are mirrored versions of the first and second commutator. Although they use different data wordlength and the final commutator discards half of the outputs instead of saving it. More details will be discussed in Chapter 5.

4.1.4 Twiddle Factor Multiplication

Twiddle factor multiplication block is more complicated than those inside of the typical FFT block because, in the architecture, it handles 256 samples at one clock cycle out of 1024 samples. Therefore, this block needs a two-bit control signal and a multiplexer to select correct twiddle factor for every sample.

(30)

4.1.5 Filter Coefficient Multiplier

Three types of complex multipliers are implemented which can be seen in Fig-ure 4.4. Gauss complex multiplication algorithm can be modified to have pre-computed inputs. Here two adders can be removed as described in Figure 4.4(b). With pre-computed inputs, it is expected to reduce the amount of computation energy by the removal of the two additions.

4.1.6 Coefficient Selector

When multiplying samples with filter coefficients, selecting correct filter coeffi-cients is necessary, similar to the twiddle factor multiplication block. The differ-ence in this coefficient selector is that registers are needed at every multiplication. Here, we assume that four coefficients for one multiplier are set externally, so four registers and one multiplexer is added to the complex multiplier as it is seen in Figure 4.5.

4.1.7 Multiplexer

After the multiplication, output wordlength from the multiplier is increased by coefficient wordlength. In order to make it have the same wordlength, the quanti-zation multiplexer is implemented in order to select 12 bits from the filter output. This multiplexer is kept so that the user of the system has the flexibility to select bits based on the configuration of input data with regards to integer and frac-tional bits depending on the number of fracfrac-tional and integer bits in input data using this multiplexer.

The multiplexer used in this architecture gives the user four choices by using two-bit external control signal. With the first choice being of 12 MSBs of both real and imaginary bits and then going down by 1 bit for both real and imaginary.

Figure 4.6 shows the architecture of the multiplexer

4.1.8 Global Counter and Control Signals

Synchronization is needed in this architecture because it is necessary to select cor-rect twiddle factors and filter coefficients at the corcor-rect timing when we multiply them with samples. Also, architecture of each block has different delay time due to different pipelining. Therefore, we implement a global counter that generates control signals for each block.

4.2 256-point FFT Based FIR Filter Architecture

Figure 4.7 shows an overview of the architecture. This architecture uses polyno-mial convolution [11]. It separates impulse response in polyphase components and performs FFT on that. So, this architecture performs FFT on input samples and uses 4-tap FIR filter on each transformed sample. Inverse FFT is performed by interchanging real and imaginary parts from FIR filter, performing FFT on

(31)

a +bi c + di

Real Imag

(a) Gauss complex multi-plier.

a +bi c c + d

Real Imag d - c

(b) Gauss complex multi-plier with pre-computation.

a + bi c + di

Real Imag

(c) Standard complex multiplier.

Figure 4.4:Multipliers used in this thesis.

(Selected externally) Complex multiplier Incoming sample Output sample Control signal Coefficients Q R R R R

(32)

>> Incoming

sample Output_sample

Control

signal(0) Control signal(1)

2 >> 1

1 0

Figure 4.6:Multiplexer for quantization

4-tap filter 4-tap filter 4-tap filter 4-tap filter 4-tap filter 4-tap filter FF T Ove rl ap Save B loc k

Figure 4.7:256-point FFT based FIR filter architecture.

them and interchanging real and imaginary part of the result as mentioned in [3].

4.2.1 Top Level Estimation

In the 256-point FFT based FIR filter architecture, complex multipliers in FFT use constants, and complex multipliers in 4-tap FIR filter use configured constants because the filter coefficients are configured once to operate the system. The complex multipliers inside the FFT have constant multipliers thus they can be optimized at the hardware level.

In 256-point FFT block of this architecture, input has 12-bit data wordlength, and half of the outputs from IFFT are discarded.

The number of operators in the FIR filter is computed by the following equa-tions:

Complex adders = N (M − 1) (4.15)

Delay elements = N (M − 1) (4.16)

Complex mult. w/ configured const. = N M (4.17) where M is the number of taps for FIR filter, N is the number of FIR filters.

Also, after the 4-tap FIR filter, there are two 2 × 1 multiplexers for quantiza-tion. The total number of each operator in the architecture can be computed by

(33)

Operators Number

Complex adders 4736

Complex mult. w/ const. 1792 Complex mult. w/ configured const. 1024

2 × 1 Multiplexers 512

Delay elements 896

Table 4.2:Number of operators in the 256-point FFT based FIR filter.

the following equations:

Total complex adders = 2N log₂N + N (M − 1) − N

2 (4.18)

Total complex mult. w/ const. = N (log₂N − 1) (4.19) Total complex mult. w/ configured const. = N M (4.20)

Total 2 × 1 multiplexers = 2N (4.21)

Total delay elements = N

2 + N (M − 1) (4.22)

The detailed number of operators can be seen in Table 4.2. Note that the exact numbers may be different for a different optimizing process. More details in each block will be discussed further.

4.2.2 Fast Fourier Transform with Overlap-save Method

As discussed in Chapter 1.1, it is required that 128 samples are processed in every clock cycle. According to [13], we need to have a size of FFT two times the size of output number of samples required. This leads us to have 256-point FFT outputs with 128 overlapped samples for generating 128 output samples.

The FFT based filtering uses the concept of the OLS method to perform filter-ing. The OLS method is implemented using the first block as indicated in Figure 2.6. The block receives 128 new samples where each sample has 6 bits real and 6 bits imaginary. These 128 samples are appended with 6 zeros in the LSB for both real and imaginary part. This makes these 128 samples of 24 bits each. These 128 samples are then kept for one cycle. The 128 samples then come as new samples and the same process of appending zeros to the number is followed. Then these 128 new samples and old retained 128 samples are used as 256 samples which are then fed into the 256-point FFT block. The purpose of this block is therefore to perform OLS component of FFT based FIR filtering.

4.2.3 4-tap FIR Filters

A standard 4-tap FIR filter contains three delay elements and four multipliers and adders. The configuration for summation is left to the synthesizing tool. It

(34)

could be a tree structure or a sequential structure. A generic 4-tap filter is con-structed which is then used in all other types of implementation of 4-tap filters. A general 4-tap filter is depicted in Figure 2.2.

4.2.3.1 Direct Form Implementation

The 4-tap filter in direct form consists of three delay elements. The following multipliers are used for separate implementation:

• With Gauss complex multiplier: The complex multiplier used in this 4-tap filter is using Gauss complex multiplication algorithm as shown in Figure 4.4(a).

• With Gauss complex multiplier with precomputed inputs: The complex multiplier used in this 4-tap filter is using Gauss complex multiplication algorithm with precomputed inputs as shown in Figure 4.4(b).

• With standard complex multiplier: The complex multiplier uses standard complex multiplication algorithm as shown in Figure 4.4(c).

The sum of four multiplier outputs is left for the synthesizing tool to determine the structure of adder. The filter is depicted in Figure 2.2.

4.2.3.2 Parallel Standard Structure

This filter consists of four generic 4-tap filters each having coefficient wordlength of 8 bits. This structure implements standard complex multiplication algorithm. Two generic 4-tap filters take real coefficient values as filter coefficients, and other two take imaginary coefficient values as filter coefficients. The filter with real coefficient values and another filter with imaginary coefficient values are fed with real input values, and the other two filters are fed with imaginary input values. The outputs from these four generic 4-tap filters are then summed to get the outputs. This filter is depicted in Figure 4.8. The real and imaginary outputs are given by Equation (4.23) and (4.24) respectively.

ahre−bhim (4.23)

ahim+ bhre (4.24)

4.2.3.3 Parallel Gauss Structure

This filter implementation consists of three generic 4-tap filters. The two filters have coefficient wordlength of 9 bits, and the remaining filter has coefficient wordlength of 8 bits. This filter uses Gauss complex multiplication algorithm. Two generic 4-tap filters with 9-bit coefficient wordlength have filter coefficients as the sum and difference of real and imaginary parts of coefficients, and the re-maining generic 4-tap filter with 8-bit coefficient wordlength has imaginary part of coefficients as filter coefficients. This can be seen in Figure 4.9.

(35)

4.2 256-point FFT Based FIR Filter Architecture 25 hre him hre Input Real Input Imag Output Real Output Imag him

Figure 4.8:Parallel standard structure.

hre + him him hre - him Input Real Input Imag Output Real Output Imag

Figure 4.9:Parallel Gauss structure.

4.2.4 Multiplexer

Similar to the 1024-point FFT based filter architecture, the quantization multi-plexer is also implemented after the 4-tap filter. The difference is that the output data wordlength from the 4-tap filter is 22 bits real and 22 bits imaginary and the input data wordlength for the inverse FFT is for 12 bits real and 12 bits imaginary. The architecture is same as shown in Figure 4.6

4.2.5 Inverse Fast Fourier Transform

An inverse FFT is performed by interchanging real and imaginary parts as dis-cussed in Section 4.2. Also, in this FIR filter architecture, as the OLS method is used for filtering, we reject the overlapped outputs of FFT in the inverse FFT stage.

(36)

(37)

5

Result

This chapter presents synthesis results of blocks used in two architectures includ-ing a comparison of various options that have been considered. This chapter also shows results in six different frequencies to investigate more details of each block.

Notice that the design is synthesized block-by-block. The whole design was not synthesized due to memory limitations as the entire design is quite large to be synthesized as one. Therefore, synthesis results of individual blocks will be presented in this thesis. The most efficient options of each block in terms of power consumption will be selected for each architecture. The comparison of two architecture will be discussed at the end of the chapter.

The synthesis is done in Design Compiler using a 65-nm low-power process and 1.2 V supply voltage.

5.1 256-point Fast Fourier Transform

5.1.1 Radix-2 vs Radix-4 vs Radix-16

Choice of a base for FFT can make a difference in performance of the system. In this thesis, 256-point FFT is implemented using 2, 4, and radix-16 butterflies. The radix-radix-16 butterflies is based on the radix-4 butterfly. Each case includes a different number of complex multiplications. The three different algorithms can be represented by binary tree representations [18] as illustrated in Figure 5.1.

The basic computation of the number of complex multiplications is N (log₂N −

1) for radix-2 case and N₂(log₂N − 1) for radix-4 and radix-16. However, we can

exclude some trivial cases where the angle of the rotation is 0◦, 90◦, 180◦, and 270◦. These rotations are just multiplications by 1, j, −1, and −j, respectively. These rotations can be implemented easily on hardware by interchanging real

(38)

1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 (a)Radix-2 2 2 2 2 ₄ 6 8 (b)Radix-4 2 2 2 4 8 2 4 (c)Radix-16

Figure 5.1:Binary tree representations used for 256-point FFT Algorithm Non-trivial rotations

Radix-2 878

Radix-4 492

Radix-16 480

Table 5.1: Number of non-trivial rotations in FFT having various radixes. The radix-16 case is based on a radix-4 butterfly.

and imaginary parts or by changing signs [5].

Table 5.1 shows the number of non-trivial rotations. The radix-2 case has more non-trivial rotations whereas the number of rotations in 4 and radix-16 is similar. Due to this difference, synthesis results of those FFTs show that more non-trivial rotations have higher area usage and power consumption. This can be seen in Tables 5.2 and 5.3. Note that the radix-2 case uses two pipeline reg-isters at every second stage, and radix-4 case and radix-16 case use two pipeline registers at every stage. Details about pipelining will be discussed in the next section.

Overall, the results show that radix-16 case has lower area usage and power consumption at 476 MHz frequency.

5.1.2 Pipelining

Various options for pipelining are available in FFT. Among various options, the important point to introduce pipeline registers is near twiddle factor multipliers since complex multiplications take most computation time among all the opera-tions in the FFT block. In this section, radix-16 based on radix-4 butterflies is used because the previous section concludes that radix-16 case is most efficient among three radixes.

When using radix-4 butterflies in the FFT block, it has a total of four stages. Thus, three options are considered in this thesis. Two pipeline registers at every stage, one pipeline register at every stage, and one pipeline register only at the second stage. Figure 5.2 shows the three cases.

(39)

5.1 256-point Fast Fourier Transform 29

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz

Radix-2 209 419 637 910 1283 2189

Radix-4 172 291 520 827 898 1362

Radix-16 168 336 509 816 875 1372

Table 5.2:Synthesis results of power consumption for different radixes. All power numbers are in mW. The radix-16 case is based on a radix-4 butterfly. Two pipeline registers are used at every second stage in radix-2 and every stage in radix-4 and radix-16.

Radix-2 1.746 1.746 1.760 1.911 1.976 2.311

Radix-4 1.546 1.546 1.549 1.549 1.570 1.742

Radix-16 1.508 1.508 1.508 1.520 1.540 1.732

Table 5.3:Synthesis results of area usage for different radixes. All area num-bers are in mm2. The radix-16 case is based on a radix-4 butterfly. Two pipeline registers are used at every second stage in radix-2 and every stage in radix-4 and radix-16.

Two reg. at every stage 168 336 509 816 875 1372

One reg. at every stage 155 308 471 1037 1095 2031

One reg. at second stage 178 378 930 - -

-Table 5.4: Synthesis results of power consumption for different pipelining methods. All power numbers are in mW. Basis of FFT is radix-16.

Two reg. at every stage 1.508 1.508 1.508 1.520 1.540 1.732 One reg. at every stage 1.344 1.343 1.351 1.736 1.756 2.066

One reg. at second stage 1.247 1.306 1.730 - -

-Table 5.5:Synthesis results of area usage for different pipelining methods. All area numbers are in mm2. Basis of FFT is radix-16.

Table 5.5 shows the synthesis results of the FFT block with different pipelin-ing. The results show that the pipelining is useful when the speed requirement is high, but when slower speed is required, the pipelined structure has more area usage and power consumption due to extra registers. Note that a dash in the table means the system did not meet the speed constraint.

5.1.3 FFT with Overlap-save Block and Inverse FFT

In the 256-point FFT based architecture, 256-point FFT is performed initially. The result of this block is different from the general FFT block as expected. The power consumption and area usage both are reduced when compared with nor-mal FFT. This can be explained by the fact that when zeros are appended as ex-plained in Section 4.2.2, the hardware size, such as adders and multipliers, in

(40)

R

4 bu

tterf

lies

R

4 bu

tterf

lies

R

4 bu

tterf

lies

: one twiddle factor multiplier

(a)Two registers at every stage

R

4 bu

tterf

lies

R

4 bu

tterf

lies

R

4 bu

tterf

lies

(b)One register at every stage

R

4 bu

tterf

lies

R

4 bu

tterf

lies

R

4 bu

tterf

lies

R

4 bu

tterf

lies

(c)One register at the second stage

Figure 5.2: Different options of pipelining, when using radix-4 butterflies. Dot lines represent places of pipelining.

(41)

5.2 Complex Multiplier 31

FFT 168 336 509 816 875 1372

FFT w/ OLS. 149 298 452 721 772 1163 FFT w/ Post. 164 328 496 795 854 1332

Table 5.6: Synthesis results of power consumption for various FFT blocks. All power numbers are in mW. Radix-16 pipelined FFT, based on a radix-4 butterfly, is used.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz FFT 1.508 1.508 1.508 1.520 1.540 1.732 FFT w/ OLS. 1.33 1.33 1.33 1.35 1.36 1.52 FFT w/ Post. 1.46 1.46 1.46 1.48 1.49 1.69

Table 5.7: Synthesis results of area usage for various FFT blocks. All area numbers are in mm2. Radix-16 pipelined FFT, based on a radix-4 butterfly, is used.

initial stages will be reduced. This leads to lower power consumption and area usage.

Also, after the inverse FFT, 128 samples are rejected. This leads to a change in area usage and power consumption as half of the butterflies from the last stage are removed. This leads to lower power consumption and lower area usage than a normal FFT block. The results can be found in Tables 5.6 and 5.7.

5.2 Complex Multiplier

In the 1024-point FFT based architecture, complex multipliers are used to multi-ply transformed input samples by filter coefficients. There are three choices for the complex multipliers:

• Gauss complex multiplier: Complex multiplier using Gauss complex multi-plication algorithm as shown in Figure 4.4(a).

• Gauss complex multiplier with precomputed inputs: Using the same Gauss complex multiplication algorithm with precomputation as shown in Figure 4.4(b).

• Standard complex multiplier: The standard way of computation as shown in Figure 4.4(c).

As discussed in Section 2.6, Gauss complex multiplication algorithm uses three real multiplications and five additions while standard complex multipli-cation algorithm uses four real multiplimultipli-cations and two additions.

Table 5.8 shows synthesis results of area usage for the three multipliers. Even though pre-computed Gauss complex multiplier removes two additions, it shows higher area usage. The reason for that is, pre-computed multiplier uses three in-put of which are two are of 9-bit and the remaining is of 8-bit, so it requires larger

(42)

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz Gauss mult. 3861 3861 3861 4692 4748 5445 Pre. Gauss mult. 4125 4125 4125 4739 4802 5289 Standard mult. 3756 3756 3756 4352 4380 5487

Table 5.8: Synthesis results of area usage for the three complex multipliers. All area numbers are in µm2.

registers and multiplexers. Meanwhile, standard complex multiplier preforms better than Gauss complex multiplier because of its simplicity. The synthesis results of power consumption will be discussed in the next section.

5.2.1 Power Estimation with Random Coefficients

The coefficients in the FIR filter are determined by Equation (2.1). However, this thesis does not handle the specific parameters. Instead, the coefficients are cho-sen randomly to see the performance of complex multipliers.

However, we cannot select specific random coefficients because the choice of four different coefficients makes a difference in power result. If the number of transitions between coefficients is small, power consumption on the gates comes less, and if the number of transitions is high, the power consumption be-comes higher. Therefore, it is necessary to address this variation. In order to do this, 99 different sets of coefficients are simulated for each type of multipliers, and a histogram is made. Figure 5.3 shows the result.

The results show that the standard complex multiplier is more efficient than the Gauss complex multiplier because of its simplicity. Meanwhile, the pre-computed Gauss complex multiplier is most efficient among three multipliers. The detailed results of all the multipliers are shown in Table 5.11.

5.3 4-tap FIR Filters

5.3.1 Pipelining

All the architectures have been implemented with pipelining. The effects of pipelining can be seen at higher frequencies where the power consumption re-duces significantly. As expected, the power consumption at lower frequencies increases when compared to non-pipelined structures. The pipelining for paral-lel Gauss structure is implemented as indicated in Figure 5.4(a). The critical path in this structure is determined to be

2Tadder+ Tfilter (5.1)

After pipelining, the critical path reduced to

(43)

5.3 4-tap FIR Filters 33 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 0 5 10 15 20 25 30 35 40 Pre-Gauss Gauss Standard

Figure 5.3: Histogram of power consumption of three complex multipliers at 476 MHz. Power numbers are in mW.

Maximum 0.510 1.023 1.723 3.323 3.597 6.817

Minimum 0.385 0.770 1.308 2.750 2.984 5.681

Median 0.461 0.921 1.574 3.064 3.316 6.291

Table 5.9: Synthesis results of power consumption for the Gauss complex multiplier. All power numbers are in mW.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz Maximum 0.306 0.613 1.014 1.756 1.939 3.407 Minimum 0.262 0.523 0.878 1.729 1.825 3.238 Median 0.290 0.580 0.965 1.744 1.894 3.340

Table 5.10: Synthesis results of power consumption for the precomputed Gauss complex multiplier. All power numbers are in mW.

Maximum 0.502 1.005 1.696 3.010 3.230 5.241

Minimum 0.406 0.809 1.368 2.619 2.814 4.577

Median 0.455 0.909 1.560 2.847 3.055 4.951

Table 5.11: Synthesis results of power consumption for the standard com-plex multiplier. All power numbers are in mW.

(44)

hre + him him hre - him Input Real Input Imag Output Real Output Imag (a) Parallel Gauss structure with pipelining

hre him hre Input Real Input Imag Output Real Output Imag him

(b) Parallel standard structure with pipelining

h(0) h(1) h(2) h(3)

D D D

Σ Input

Output

(c) Direct form structure with pipelining

Figure 5.4: Pipelined structures of 4-tap FIR filters. Dot lines represent places of pipelining.

(45)

5.3 4-tap FIR Filters 35

Area 15475 15477 15420 16149 16217 18206

Power 1.167 2.334 3.540 5.350 5.820 11.490

Table 5.12:Synthesis results of the direct form with standard complex mul-tipliers. All area numbers are in µm2_{and power numbers are in mW.}

Area 17211 17211 17208 18832 18895 19371

Power 1.326 2.651 4.019 7.018 7.440 10.702

Table 5.13:Synthesis results of the direct form with standard complex mul-tipliers when pipelining is applied. All area numbers are in µm2_{and power}

numbers are in mW.

In the direct form structure, pipelining is implemented as in Figure 5.4(c) for 4-tap filters with standard complex multiplier and Gauss complex multiplier. As indicated in Tables 5.12 and 5.13, the pipelining shows its positive effects at the frequency of 667 MHz and the power consumption of the pipelined struc-ture is lower when compared to the non-pipelined strucstruc-ture. This result is even more pronounced in direct form filter with Gauss complex multipliers. In Ta-bles 5.14 and 5.15, we can see that from the frequency of 476 MHz and higher, the power consumption is much lower than non-pipelined structure. Therefore, the pipelined structure will lead to much higher power saving at the higher fre-quency.

Direct form structure with standard complex multiplier also shows the same effect at 667 MHz. The power consumption is lower than the non-pipelined ver-sion. This can be seen in Tables 5.12 and 5.13.

If area is discussed, we can see that the pipelining also benefits in the reduc-tion in area usage when operating at the higher frequency. This effect is pro-nounced in direct form structure with Gauss complex multipliers. The area con-sumption is significantly reduced at 667 MHz. As expected at lower frequen-cies the area consumption is higher as extra registers are added when so much speedup of the circuit is not required.

The parallel standard structure is also implemented with pipelining as indi-cated in Figure 5.4(b). The critical path in this structure is determined to be

Tadder+ Tfilter (5.3)

In the parallel standard structure, the pipelining also shows the same result. At the higher frequency, there is a reduction of both power consumption and area usage, whereas, at the lower frequency, the pipelined version has both higher area usage and power consumption.

The parallel Gauss pipelined structure also performs better. One of the rea-sons for this could be shortening of the critical path that has caused the circuit to become more efficient. The area is reduced at the frequency of 667 MHz. Figure 5.5 shows the result of various structures with pipelining effects on power.

(46)

Area 16904 16904 16415 21198 21724 31737

Power 1.224 2.448 3.934 8.616 9.841 25.062

Table 5.14: Synthesis results of the direct form filter with Gauss complex multipliers. All area numbers are in µm2and power numbers are in mW. Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz

Area 18118 18118 18190 21278 21373 22793

Power 1.424 2.848 4.382 8.335 8.888 14.139

Table 5.15: Synthesis results of the direct form filter with Gauss complex multipliers when pipelining is applied. All area numbers are in µm2 and power numbers are in mW.

Area 15446 15716 15716 19575 20105 28959

Power 1.155 2.309 3.953 8.518 9.353 22.457

Table 5.16: Synthesis results of the direct form filter with Gauss complex multipliers when precomputation is applied. All area numbers are in µm2

and power numbers are in mW.

0 200 400 600 800 Frequency(MHz) 0 5 10 15 20 25 Power(mW)

Parallel standard structure

Non Pipelined Pipelined 0 200 400 600 800 Frequency(MHz) 0 5 10 15 20 25 Power(mW)

Parallel Gauss structure

Direct form standard structure

Direct form Gauss structure

Non Pipelined Pipelined

(47)

5.3 4-tap FIR Filters 37

Area 14876 14876 14395 15472 15611 18236

Power 1.278 2.468 3.729 5.701 7.946 12.385

Table 5.17: Synthesis results of the parallel Gauss structure. All area num-bers are in µm2_{and power numbers are in mW.}

Area 15162 15162 14790 16049 16094 17116

Power 1.219 2.437 3.670 5.912 6.209 10.566

Table 5.18:Synthesis results of the parallel Gauss structure when pipelining is applied. All area numbers are in µm2_{and power numbers are in mW.}

5.3.2 Gauss Complex Multiplication Algorithm Based FIR Filters

There are two filters implemented based on Gauss complex multiplication algo-rithm. The first uses Gauss complex multipliers in the direct form structure and the other is the parallel Gauss structure. The parallel Gauss structure uses three 4-tap filters with coefficients which are sum and difference of real and imaginary coefficients and one filter having imaginary coefficients. The results show that the parallel Gauss structure is better among both the implementations. The results of both implementations can be seen in Tables 5.14 and 5.17.

This can be explained by the fact that the Gauss complex multipliers in the direct form implementation consists of three stages inside. The output is depen-dent on three stages. Therefore, when it is made to operate at high frequency the power consumption increases a lot because there are three stages through which computations are to be made at very high speed. On the other hand, if we look at the parallel Gauss structure, we will see that it consists of three real-valued 4-tap filters which have real multipliers inside. This means the parallel Gauss structure though uses the same Gauss complex multiplication principle, is a much simpler structure.

Another implementation of the Gauss complex multiplication algorithm is one with pre-computed inputs as seen in Figure 4.4(b). Although in this multi-plier we remove two adders to increase the performance, that does not improve performance much. This can be explained by the fact that although two adders are removed, the coefficients are constant so the adders also operate only once when new coefficients arrive, for further cycles they are static. Thus, not much improvement is seen with this algorithm.

5.3.3 Standard Complex Multiplication Algorithm Based FIR

Filters

There are two filters based on the standard complex multiplication algorithm. The direct form structure with standard complex multipliers performs better than the parallel standard structure.

(48)

0 200 400 600 800 Frequency(MHz) 0 5 10 15 20 25 Power(mW) Gauss algorithm Direct form

Direct form with pre Parallel Gauss 0 200 400 600 800 Frequency(MHz) 0 5 10 15 20 25 Power(mW) Standard algorithm Parallel standard Direct form

Figure 5.6: Comparison of power consumption in Gauss and standard algo-rithm.

Area 15766 15764 15519 17083 17574 20200

Power 1.245 2.492 3.721 6.380 7.056 13.520

Table 5.19: Synthesis results of the parallel standard structure. All area numbers are in µm2and power numbers are in mW.

Area 16435 16435 16435 17218 17412 18725

Power 1.368 2.736 4.138 6.352 6.692 11.053

Table 5.20:Synthesis results of the parallel standard structure when pipelin-ing is applied. All area numbers are in µm2and power numbers are in mW.

The difference in both architectures is in the way summation is being per-formed. The order of adders after the multipliers is different in both architec-tures. This results in one architecture to consume more power than the other. The difference in power consumption is depicted in Figure 5.6.

5.3.4 Summary of Results

The results of all synthesis of all 4-tap FIR filter implementations point to a con-clusion that the direct form with standard complex multipliers is the best filter to operate at 476 MHz. The overall architecture of this block is much simpler and dependency on various blocks is minimal compared to other 4-tap FIR filter structures.

The power estimation is performed multiple times in order to get the range of power consumption. The histogram in Figure 5.7 shows a distribution of a set of 99 values of power consumption of direct form filter with standard complex multipliers. The minimum, maximum and median values of this implementation

(49)

5.3 4-tap FIR Filters 39 5.24 5.26 5.28 5.3 5.32 5.34 5.36 5.38 5.4 5.42 0 5 10 15 20 25

Figure 5.7: Histogram of power consumption of direct form filter with the standard complex multipliers at 476 MHz. Power numbers are in mW. Frequency 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz

Median 1.166 2.333 3.537 5.340 5.810 11.475

Minimum 1.150 2.299 3.487 5.253 5.717 11.322

Maximum 1.179 2.361 3.578 5.415 5.891 11.636

Table 5.21: Power consumption range of direct form filter with standard complex multipliers. All power numbers are in mW.

can be seen in Table 5.21.

5.3.5 Tap Configuration Results

The direct form structure with standard complex multipliers is the most power efficient of all the filters as discussed in Section 5.3.4. A 4-tap filter can also be used with only some taps on. Table 5.22 shows various tap configuration of 4-tap filters and their related power consumption. It is clear when the taps are switched off, the power consumption reduces because multipliers have lower switching activity. When all taps are switched off, the power consumption is quite low. The dynamic power consumption is mostly of delay elements as all multipliers have minimal switching activity because coefficients are set to zero. Consequently, all inputs to adder are zero. This leads to total power consumption to be lower.

(50)

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz All Taps On 1.167 2.334 3.540 5.350 5.820 11.490 One Tap Off 0.815 1.630 2.505 4.822 5.256 10.445 Two Taps Off 0.695 1.390 2.136 4.173 4.566 9.140 Three Taps Off 0.577 1.154 1.758 3.465 3.816 7.616 All Taps Off 0.311 0.621 0.986 1.813 1.987 3.572

Table 5.22: Comparison of power consumption of direct form filter with standard complex multipliers with various tap configurations. All power numbers are in mW. 4point FFT Second Commutator 4point FFT 4point FFT 4point FFT : one twiddle factor multiplier 12 16 24

(a) Twiddle factor multipliers are first. 4point FFT : one twiddle factor multiplier clk 0 : 2ndComm 4point FFT clk 1 : 4point FFT clk 2 : 4point FFT clk 3 : 2nd Comm 2nd Comm 2nd Comm 12 16 16 24 (b)Commutator is first.

Figure 5.8:Different procedures of commutator.

5.4 Different Procedures of Commutator and Twiddle

Factor Multiplication

The second commutator in the 1024-point FFT based architecture can be located before and after the twiddle factor multiplication block. Both configurations have the same functional result but have different results in terms of efficiency, since in-put data wordlength of the system is set to be 12-bit, and outin-put data wordlength from the twiddle factor multiplier needs to be 24-bit. Also, as seen in Figure 5.8, the first output of the 4-point FFT always has an angle of 0◦so it is possible to remove one complex multiplication from one of four outputs.

Therefore, if the twiddle factor multiplication block comes first, the block maintains the same configuration regardless of a clock cycle by sending every

Implementation of High-Speed 512-Tap FIR Filters for Chromatic Dispersion Compensation

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2018