Master of Science Thesis in Electrical Engineering
Department of Electrical Engineering, Linköping University, 2018
Implementation of
High-Speed 512-Tap FIR
Filters for Chromatic
Dispersion Compensation
Cheolyong Bae and Madhur Gokhale
Implementation of High-Speed 512-Tap FIR Filters for Chromatic Dispersion Compensation
Cheolyong Bae and Madhur Gokhale LiTH-ISY-EX--18/5179--SE Supervisor: Oscar Gustafsson
isy, Linköpings universitet
Examiner: Oscar Gustafsson
isy, Linköpings universitet
Division of Computer Engineering Department of Electrical Engineering
Linköping University SE-581 83 Linköping, Sweden
Abstract
A digital filter is a system or a device that modifies a signal. This is an essential feature in digital communication. Using optical fibers in the communication has various advantages like higher bandwidth and distance capability over copper wires. However, at high-rate transmission, chromatic dispersion arises as a prob-lem to be relieved in an optical communication system. Therefore, it is necessary to have a filter that compensates chromatic dispersion. In this thesis, we intro-duce the implementation of a new architecture of the filter and compare it with a previously proposed architecture.
Acknowledgments
We would like to express our gratitude to our supervisor and examiner Oscar Gustafsson for his guidance in this thesis. Since the beginning of this thesis, we have developed our knowledge in the field of computer engineering. We would also like to thank our opponents, Aurélien Moine and Viswanaath Sundaram, for their valuable feedback on the results of this thesis.
Linköping, December 2018 Cheolyong Bae and Madhur Gokhale
Contents
Notation ix
1 Introduction 1
1.1 Goal . . . 1
1.2 Previous Research at LiU . . . 2
1.3 Limitation . . . 2
1.4 Outline of the Thesis . . . 2
2 Theory 3 2.1 Optical Networks and Chromatic Dispersion . . . 3
2.1.1 System Chain . . . 4
2.1.2 Analog-Digital Converter for Optical Transmission . . . 4
2.2 Finite-length Impulse Response Filter . . . 4
2.2.1 FIR Filtering in Frequency Domain . . . 5
2.3 Fast Fourier Transform . . . 6
2.3.1 Various Radix of FFT . . . 7 2.3.2 FFT Architecture . . . 9 2.4 Overlap-save . . . 9 2.5 Pipelining . . . 9 2.6 Complex Multiplication . . . 11 3 Method 13 3.1 Programming Language . . . 13 3.2 ModelSim . . . 13
3.3 Synopsys Design Compiler . . . 13
3.4 Power Estimation . . . 14
3.5 Approach . . . 14
4 Implementation 15 4.1 1024-point FFT Based FIR Filter Architecture . . . 15
4.1.1 Top Level Estimation . . . 15
4.1.2 First Commutator . . . 17
4.1.3 Second Commutator . . . 18
4.1.4 Twiddle Factor Multiplication . . . 19
4.1.5 Filter Coefficient Multiplier . . . 20
4.1.6 Coefficient Selector . . . 20
4.1.7 Multiplexer . . . 20
4.1.8 Global Counter and Control Signals . . . 20
4.2 256-point FFT Based FIR Filter Architecture . . . 20
4.2.1 Top Level Estimation . . . 22
4.2.2 Fast Fourier Transform with Overlap-save Method . . . 23
4.2.3 4-tap FIR Filters . . . 23
4.2.4 Multiplexer . . . 25
4.2.5 Inverse Fast Fourier Transform . . . 25
5 Result 27 5.1 256-point Fast Fourier Transform . . . 27
5.1.1 Radix-2 vs Radix-4 vs Radix-16 . . . 27
5.1.2 Pipelining . . . 28
5.1.3 FFT with Overlap-save Block and Inverse FFT . . . 29
5.2 Complex Multiplier . . . 31
5.2.1 Power Estimation with Random Coefficients . . . 32
5.3 4-tap FIR Filters . . . 32
5.3.1 Pipelining . . . 32
5.3.2 Gauss Complex Multiplication Algorithm Based FIR Filters 37 5.3.3 Standard Complex Multiplication Algorithm Based FIR Fil-ters . . . 37
5.3.4 Summary of Results . . . 38
5.3.5 Tap Configuration Results . . . 39
5.4 Different Procedures of Commutator and Twiddle Factor Multipli-cation . . . 40
5.5 Different Cases in the First Commutator . . . 41
5.6 Comparison between Two Architectures . . . 42
5.7 Comparison with Previous Work . . . 45
6 Conclusion 47 6.1 Discussion . . . 47
6.2 Future Work . . . 48
Notation
Abbreviations
Abbreviations Description
ADC Analog Digital Converter AWGN Additive White Gaussian Noise
BER Bit Error Rate
CD Chromatic Dispersion DAC Digital Analog Converter DFT Discrete Fourier Transform
DIF Decimation in Frequency DIT Decimation in Time DSP Digital Signal Processing FFT Fast Fourier Transform
FIR Finite-length Impulse Response FPGA Field Programmable Gate Array HDL Hardware Description Language
IC Integrated Circuit ICI Intercarrier Interference
IDFT Inverse Discrete Fourier Transform IFFT Inverse Fast Fourier Transform
IIR Infinite-length Impulse Response ISI Intersymbol Interference
LSB Least Significant Bit MSB Most Significant Bit OLS Overlap-save method
SAIF Switching Activity Information File SDF Standard Delay Format
VCD Value Change Dump
VHDL Very High-speed Integrated Circuit Hardware De-scription Language
1
Introduction
Annual global IP traffic is growing and is predicted to reach 3.3 ZB (Zeta Byte) by 2021. It was 1.2 ZB in 2016 [1]. Due to high demands for modern communication, fiber optic communication is widely used because of its various advantages over copper-wired communication [19].
Optic communication can handle longer distances, higher bandwidth, and has better reliability. Despite these benefits, the fiber optic communication has some imperfections that need to be considered. One of these is chromatic disper-sion.
Chromatic dispersion (CD) is a form of dispersion in optical fiber [19]. Due to different frequencies of components in a signal or pulse, each component prop-agates with different speed. Thus, the signal or pulse is smeared and delivers wrong information. Therefore, to achieve correct information signal, the need for filtering arises to compensate for chromatic dispersion.
In this thesis, we introduce a new filter architecture to compensate for chro-matic dispersion and compare it with a previously proposed architecture. Both architectures are made using VHDL, and they are verified by Matlab simulation. The blocks in these two architectures are synthesized by Design Compiler tool and analyzed in terms of area usage and power consumption. Some of the blocks have various options to change to increase performance. So, the results of those options and the total comparison between two architectures will be discussed.
1.1
Goal
The goal of the thesis is to design and evaluate a new architecture of high-speed 512-tap finite impulse response filter for compensation of chromatic dispersion in optical fibers and perform a comparative analysis with previous architecture. The previous research shows that it is possible to achieve an operating speed of
60 GS/s with a maximum frequency of 476 MHz which has a clock period of 2.1 ns [10]. This means about 128 samples should be processed in every clock cycle. The new architecture should also achieve the same operating speed with a smaller usage of resources.
1.2
Previous Research at LiU
At Department of Electrical Engineering, Linköping University, 512-Tap complex FIR filter architectures for compensation of chromatic dispersion [10] was carried out. Also, other studies about digital filters [4], representations of FFT [5, 18] and architectures of FFT [6, 7] have been published.
1.3
Limitation
This thesis uses fixed-point numbers, all the bits cannot be computed by arith-metic operation, and important information in samples is dependent on inputs and coefficients. Therefore, this thesis assumes to use external signals to choose desired bits.
The choice of the coefficients is not the scope of this thesis. Instead of choosing coefficients obtained from chromatic dispersion filter, randomly generated values are considered in this thesis.
Some blocks had greater hierarchy due to which propagation of switching activity to inner nets could not be ensured. Thus, the detailed gate level power estimation cannot be said to be wholly accurate.
1.4
Outline of the Thesis
Chapter 2 presents theories behind FIR filters and optical networks. Chapter 3 covers the languages and tools used in this thesis.
Chapter 4 explains how the filters are implemented.
Chapter 5 contains the results of each block and the whole architecture.
2
Theory
2.1
Optical Networks and Chromatic Dispersion
Optical networks have significant advantages over traditional networks based on copper cables. They have much higher bandwidth and a lower Bit Error Rate (BER). Communication systems based on optical fiber are less susceptible to elec-tromagnetic interference. So, the communication systems based on optical fiber can be used for distances more than one kilometer at a speed of tens of megabits per second [19].
Optical fibers which are guided wave structures propagate light signals in optical networks. A narrow pulse when launched on fiber spreads, with its width broadening, as it travels along the fiber. Over long distances, the broadening of pulses extends into neighboring pulses causing Intersymbol Interference (ISI). This ISI is referred to as fiber dispersion. There are two basic types of dispersive effects in a fiber [20]; Intermodal Dispersion and Chromatic Dispersion.
Intermodal Dispersion: This form of dispersion exists in multimode fibers since different modes have different group velocity. The pulse power is different for different modes. The pulse arrivals for different modes are in different time with each pulse carrying different power. This dispersion limits bit rate-distance product of an optical communication link [19].
Chromatic Dispersion: This dispersion occurs due to the frequency depen-dence of the group velocity. The chromatic dispersion can be modeled as fre-quency response as:
C(exp(jwT )) = exp(−jK(wT )2), K = Dλ
2z
4πcT2, (2.1)
where D is the fiber dispersion parameter, λ is the wavelength, c is the speed of light, T is the sampling period and z is the propagation distance [4].
L GTX(ejwt) C(ejwt) H(ejwt) GRX(ejwt) L x(n) y(n) upsampling with a factor L Anti-imaging
filter Chromaticdispersion
Filter for compensation of CD Anti-aliasing filter Downsampling with a factor L AWGN
Figure 2.1:The system of an optical network model.
The compensation of the chromatic dispersion is done by designing a filter with frequency response [4]:
H(exp(jwT )) = 1
C(exp(jwT )) = exp(jK(wT )
2). (2.2)
2.1.1
System Chain
The full system chain is shown in Figure 2.1 [9]. In the system chain, in order to reduce the effect of ISI and intercarrier interference (ICI), the interpolation on the transmitter side and the decimation on the receiver side has been added. Also, these interpolation and decimation require anti-aliasing filter which usually performs low pass filtering [4].
The filter given by Equation (2.1) is added to simulate chromatic dispersion. Then, the data goes through Additive White Gaussian Noise (AWGN) channel to simulate the random process in nature. When receiving the signal, the CD compensation filter reduces the effect of CD.
2.1.2
Analog-Digital Converter for Optical Transmission
The optical transmission relies on digital signal processing (DSP) and conversion between analog and digital [12]. There are several DAC and ADC aimed at optical communication. The bit resolution of ADC is from 4 to 8, and the maximum sample rate is 20 to 70 GS/s using various technologies according to the source given in the paper [12]. In this thesis, we assume 6 bits as an input bit resolution and target sample rate is 60 GS/s.
2.2
Finite-length Impulse Response Filter
Digital filters can be divided into two classes: Finite-length Impulse Response (FIR) and Infinite-length Impulse Response (IIR). FIR filter is a filter that has im-pulse response with a finite duration. On the contrary, if the imim-pulse response has infinite duration, the filter is called IIR filter. The FIR filters can be guaran-teed to be stable unless used inside a recursive loop [22]. Equation (2.3) describes an FIR filter of length M with input x(n) and output y(n).
2.2 Finite-length Impulse Response Filter 5 h(0) h(1) h(2) h(3) D D D Σ Input Output Figure 2.2:Generic 4-tap filter.
where bk is a coefficient of the FIR filter for 0 ≤ k ≤ M. Similarly, the transfer
function can be expressed as
H(z) = M X k=0 bkz −k . (2.4)
Also, the unit sample response of the FIR filter is the same as the coefficients
bk, that is,
h(k) =
(
bk, 0 ≤ k ≤ M
0, otherwise (2.5)
The output sequence described by Equation (2.3) can be expressed as the con-volution summation of the system
y(n) = M
X
k=0
h(k)x(n − k), (2.6)
where M is the order of the filter [17].
Generally, an FIR filter is described using the length of the filter rather than the order. The length of the filter is given by N = M + 1, where M is the order of the filter. The number of multiplications and additions in an FIR filter of length
N is given by N and N − 1 respectively [22]. The direct form structure of the FIR
filter is one of the simple structures and is depicted in Figure 2.2.
2.2.1
FIR Filtering in Frequency Domain
As we discussed in the previous section, convolving the time domain signal with the impulse response results in the output of the filter. This operation can be sped up by performing Fourier transform of both input signal and coefficient and multiplying them. Taking inverse Fourier transform of the result of multipli-cation gives us the output same as a convolution of inputs. This method is much
FFT Multiply Inverse FFT Input Signal x(n) Impulse Response h(n) X(k) H(k) Y(k) = X(k)H(k) Output Signal y(n) = x(n) * y(n) FFT
Figure 2.3:Fast convolution.
faster than time domain convolution, due to the simplicity of multiplication and the speed of Fast Fourier Transform (FFT). This approach is advantageous in fil-tering long data sequences. The complete filfil-tering process is shown in Figure 2.3 [14].
When it comes to filtering long data sequences, the filtering is done on the block by block basis. The input stream of data is divided into segments of data bits and then each segment is processed one by one by Discrete Fourier Trans-form (DFT) and inverse DFT. One of the methods of perTrans-forming filtering of long data sequences is Overlap-save (OLS) method [17]. The OLS method is described further in Section 2.4.
2.3
Fast Fourier Transform
Fourier Transform (FT) is a mathematical way to decompose a function of time (signal) into a function of frequency. When FT is used for the discrete samples, we call it Discrete Fourier Transform (DFT). Fast Fourier Transform (FFT) is simply an optimized version of DFT [16].
The DFT is given by Equation (2.7).
Xk = N −1 X n=0 xnexp −j2π N kn , (2.7)
where xnis a sequence of samples, N is the size of the transformation and Xk is
the k−th frequency of the transform.
With this equation, the computation requires N2complex multiplication and
adders or subtraction without considering the elimination of some trivial com-putation such as multiplication by 1. FFT reduces the number of complex multi-plications from N2to N log2N . The well-known method of doing FFT is Cooley
and Tukey algorithm [2]. It uses a technique called divide and conquer algorithm which breaks down DFT into smaller DFTs recursively. In order to explain the
2.3 Fast Fourier Transform 7 W0 8 W1 8 W2 8 W3 8 W0 8 W2 8 W0 8 W2 8 W0 8 W0 8 W0 8 W0 8 x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) X(0) X(4) X(2) X(6) X(1) X(5) X(3) X(7) -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Figure 2.4:An 8-point decimation in frequency FFT algorithm.
algorithm, the equation of DFT (2.7) can be broken down into two parts [15]
Xk = (N /2)−1 X n=0 x2nWN /2nk + WNm (N /2)−1 X n=0 x2n+1WN /2nk (2.8) Xk+N /2= (N /2)−1 X n=0 x2nWN /2nk −WNm (N /2)−1 X n=0 x2n+1WN /2nk , (2.9) where WN = exp −2jπ N
is called twiddle factor. With Equations (2.8) and (2.9), we can perform two N /2-point DFTs, one for even-indexed samples and one for odd-indexed samples in order to compute an N-point DFT. Those equations can be further broken down until the size of DFT is equal to radix.
Figure 2.4 shows an example of an 8-point FFT and Table 2.1 shows the com-parison between DFT and FFT regarding on the number of complex multiplica-tions.
2.3.1
Various Radix of FFT
As it is discussed above, FFT breaks down the DFT into smaller DFTs. The min-imum size of DFT is dependent on what radix we select for the FFT. The radix is also the size of thebutterfly [15]. A butterfly in this thesis is denoted as the
basic operation of FFT. For example, a radix-2 FFT uses one addition and one subtraction for one butterfly:
X(0) = x(0) + x(1) (2.10)
X(1) = x(0) − x(1) (2.11)
The FFT can, of course, use higher radix. Figure 2.5 shows a radix-2 butterfly and a radix-4 butterfly.
Number of Complex multiplications Complex multiplications points in direct computation in FFT algorithm
4 16 4 8 64 12 16 256 32 32 1024 80 64 4096 192 128 16384 448 256 65536 1024 512 262144 2304 1024 1048576 5120
Table 2.1: Comparison of the number of complex multiplications in the di-rect computation of the DFT and the FFT algorithm [17].
x(0) X(0)
x(1) X(1)
(a) A radix-2 butterfly
x(0) X(0) -j x(1) x(2) x(3) X(2) X(1) X(3) (b) A radix-4 butterfly
2.4 Overlap-save 9
2.3.2
FFT Architecture
The FFT architecture used in this thesis is pipelined implementation and direct implementation. In the pipelined structure, the number of input samples is a power of two. The input samples are processed with data shuffling in a continu-ous flow. Data shuffling is done using buffers and multiplexers [7].
The direct implementation is also considered as parallel pipelined FFT where the degree of parallelization is equal to the size of FFT [7]. This direct imple-mentation is straightforward, mapping each operation according to the FFT flow graph.
In the direct implementation, input samples arrive simultaneously so there is no need to pipeline in the data flow. Also, decimation in time (DIT) and dec-imation in frequency (DIF) are same in architecture, the difference being in the rotators [9].
2.4
Overlap-save
Figure 2.6 illustrates the Overlap-save (OLS) method. When the input is a very long signal and one has an FIR filter, OLS is one of the methods to compute discrete convolution. In the OLS method, the input samples are divided into blocks of M samples. The first block of M samples is appended with L − 1 zeros. Then the new block has total M + L − 1 = N samples with L − 1 zeros and M samples. The next block saves the last L − 1 samples and appends to the next M samples. So, it has total M + L − 1 samples with L − 1 old samples and M new samples. The same applies to all the segments.
The N -point FFT is performed on each of these blocks. The FFT results of these blocks are then multiplied with filter coefficient in a frequency domain. Then IFFT is performed on each of these blocks. Then initial L − 1 samples from each of these blocks are discarded, and the result is concatenated. The concate-nated result is the final output [17].
2.5
Pipelining
Pipelining is a method of increasing throughput of a sequential algorithm. This method is achieved by breaking down the critical path by adding delays to the original path. Ideally, the critical path is broken into paths of equal length. With
P number of stages of pipelining, P computations can run concurrently. This
means there is an increase by a factor of P over sequential processing. Pipelining helps to achieve a higher level of parallelism in the structure [22].
In this thesis, pipelining is done for two reasons. One reason is to achieve higher operating speed. Without inserting delay elements in the critical path, the system is unable to run at certain speeds. The other reason is to reduce the number of resources when synthesizing the design. The pipelining reduces over-all area usage and power consumption when operating at higher frequency.
INPUT FFT Multiply IFFT OUTPUT M L-1 zeros M M Discard L-1 samples Overlap L-1 samples Overlap L-1 samples Discard L-1 samples Discard L-1 samples
2.6 Complex Multiplication 11
2.6
Complex Multiplication
In this thesis, we consider two different algorithms for complex multiplication. One is a standard complex multiplication algorithm, and the other one is Gauss complex multiplication algorithm [21].
Standard complex multiplication algorithm uses four real multiplications and two additions as can be seen in Equation (2.12).
(a + bj)(c + dj) = (ac − bd) + j(bc + ad) (2.12) When using Gauss complex multiplication algorithm, it is possible to reduce the number of real multiplications. The algorithm is as follows:
k1 = c(a + b) k2 = a(d − c) k3 = b(c + d) (2.13) ( ac − bd = k1−k3 bc + ad = k1+ k2 (2.14)
This algorithm gains in speed if one multiplication is more expensive than three additions or subtractions. However, it has three steps of computations that make the architecture more complicated. In this thesis, the difference between these two multipliers will be discussed.
3
Method
This chapter covers methods that are used in this thesis.
3.1
Programming Language
Firstly, Matlab is used in order to verify the system. Matlab is a mathemati-cal computing application and programming language developed by Mathworks. Due to its easier approach to complex computation, we used Matlab to verify our system.
The second language is VHDL which is an abbreviation for "Very High-speed Integrated Circuit Hardware Description Language". VHDL is a language gener-ally used in the electronic design of an FPGA (Field Programmable Gate Array) and an Integrated Circuit (IC). We used it because some codes from previous studies were available. Therefore, it reduced the amount of time required for implementing both architectures in this thesis.
3.2
ModelSim
ModelSim is a simulation application for hardware description languages such as VHDL, Verilog, and system-level modeling language such as SystemC. This tool is used in order to verify that the system is functionally correct without using any physical equipment.
3.3
Synopsys Design Compiler
Synopsys Design Compiler is a tool to synthesize high-level design blocks with HDL code into physical hardware. It creates net-lists consisting of logic-level
design blocks. When compiling with Design Compiler, it is possible to specify certain parameters such as clock period and switching activities. This function allows comparing the results with certain constraints.
It also has some useful commands to get optimized results such as "compile_ultra".
It includes many features of optimizing such as automatic ungrouping, datapath optimizing, timing analysis, and so on.
3.4
Power Estimation
In order to estimate more accurate power estimation, it is necessary to set proper switching activity for the ports in the design. The increase in switching activity will cause more dynamic power consumption.
In the thesis, the power estimation is done by using Switching Activity Infor-mation File (SAIF) in the Design Compiler.
There are two ways of generating SAIF file. One is to write out a SAIF file directly, and the other one is to convert a VCD (Value Change Dump) file from the simulation to a SAIF file by using command "vcd2saif " in Design Compiler.
Since the designs in the thesis are quite big to write a file directly, the latter method is used in the thesis.
The detailed procedure is following:
1. Read the design and compile it in Design Compiler.
2. Generate an SDF (Standard Delay Format) file in Design Compiler by using "write_sdf " command.
3. Read the SDF file and the testbench file of the design in ModelSim and create a VCD file.
4. Convert the VCD file to a SAIF file.
5. Read the SAIF file in Design Compiler and report power.
3.5
Approach
In order to achieve an optimized filter in terms of power, multiple variants of every block were designed and analyzed for power consumption. Each variant of the block was synthesized through a range of frequencies starting from 100 MHz to 667 MHz in order to have a better understanding of behaviour of each block. The most efficient variant of all options for each block in terms of power at fre-quency of 476 MHz was then selected for the filter.
4
Implementation
This chapter presents the implementation of two different filter architectures. One is 1024-point FFT based FIR filter architecture which is a proposed archi-tecture in this thesis. The other one is 256-point FFT based FIR filter archiarchi-tecture which is a previously proposed architecture [9].
In this thesis, wordlengths are determined based on bit error rate (BER) sim-ulated in the previous work [9]. The input data wordlength is chosen as 12 bits with 6 bits for real and 6 bits for imaginary as mentioned in Section 2.1.2. Inside of the architectures, data wordlength for quantization is chosen as 24 bits with 12 bits for real and 12 bits for imaginary. The filter coefficient wordlength is chosen as 16 bits with 8 bits for real and 8 bits for imaginary.
4.1
1024-point FFT Based FIR Filter Architecture
Figure 4.1 shows an overview of the architecture. In this architecture, 1024-point FFT is performed by using 4-point FFT and 256-point FFT. Commutators perform data shuffling of input samples to do correct FFTs, and after multiplication with filter coefficients, inverse FFT is performed. The inverse FFT is done by mirroring of the FFT process with interchanging real and imaginary parts of samples from multipliers and of outputs [3].
4.1.1
Top Level Estimation
Various operators with different wordlength are used in the architecture. The operators which influence area usage and power consumption in the design are adder (subtractor), general complex multiplier, complex multiplier with constants, complex multiplier with re-configurable constants, multiplexer and delay ele-ment.
y(128n) y(128n+1) y(128n+127) Global counter F ir st Commutator rst clk 4point FFT Coefficient Selection T w iddle F ac tor M ul ti pl ic at ion Second Commutator 256 p oi nt F F T (P ar al lel F F T ) 256 p oi nt I F F T (P ar al lel I F F T ) T hir d Commutator Twiddle F ac tor M ul ti pl ic at ion F ourth Commutator s1 s2 s3 s4 cs tfs1 tfs2 4point FFT 4point FFT 4point FFT x(128n) x(128n+1) x(128n+127) H(0) H(1024)
}
Disc ar dFigure 4.1:1024-point FFT based FIR filter architecture.
General complex multiplier refers to a complex multiplier that has both in-puts which are not specifically determined. Complex multiplier with constants refers to a complex multiplier whose value of the multiplier is constant whereas reconfigurable constants change in every clock cycle.
Although the detailed gates are selected by the synthesis tool, it is good to estimate overall performance by analyzing the number of operators in each block. In the 1024-point FFT based FIR filter architecture, 256 samples are processed at every clock cycle since we use the OLS method.
First, the number of operators in 256-point FFT block can be computed by the following equation:
Complex adders = N log2N (4.1)
Complex multipliers = N
2(log2N − 1) (4.2)
where N is the number of processed samples. The complex multipliers here use twiddle factors which are constants.
Second, the number of multipliers in the twiddle factor multiplication block is 34N or N depends on two different procedure of the second commutator and
the twiddle factor multiplication block. The difference will be discussed in Sec-tion 5.4. The multipliers use reconfigurable constants. Therefore, three 2 × 1 multiplexers are used to select one constant out of four constants.
Third, the number of filter coefficient multipliers is N. The multipliers are general multipliers whose inputs are not constants. Here, multiplexers are used for each multiplier. Three 2 × 1 multiplexers are used to select coefficient and two 2 × 1 multiplexers are used for quantization.
Fourth, the commutators have delay blocks and 2 × 1 multiplexers. The num-ber of delay blocks is different in each commutator.
4.1 1024-point FFT Based FIR Filter Architecture 17
Operators Number
Complex adders 5120
General complex mult. 256
Complex mult. const. 1792
Complex mult. w/ reconfigurable const. 384 or 512 2 × 1 Multiplexers 3840 or 4224
Delay elements 2432
Table 4.1:Number of operators in the 1024-point FFT based FIR filter archi-tecture.
FFT can be computed by the following equations:
Total complex adders = 4N + 2N log2N (4.3)
Total general complex multipliers = N (4.4)
Total complex mult. w/ const. = N (log2N − 1) (4.5) Total complex mult. w/ reconfigurable const. = 3
2N or 2N (4.6) Total 2 × 1 multiplexers = 9 2N or 6N + 2N + 3N + N + 2N + 2N +N 2 (4.7) Total delay elements = 5
2N + 3N + 3N + N (4.8)
The overall results of the architecture can be seen in Table 4.1. Note that the exact numbers may be different by a different optimizing process. More details in each block will be discussed further.
4.1.2
First Commutator
The architecture consists of 4-point FFT and 256-point FFT. Therefore, it is nec-essary to change the order of the inputs when they go through FFT blocks. The theory behind this block can be found in [6, 8].
The first commutator is used for the 4-point FFTs. The first stage of the FFT needs to use samples from 0 to 255, 256 to 511, 512 to 767, and 768 to 1023. Also, this block needs to handle the OLS method. Therefore, it is needed to permute the input data stream. Detail of the permutation flow can be seen in Figure 4.2. The important point in the figure is that b9 and b8 are placed on the right-hand side of the matrix, and the total number of bits on the right-hand side is eight. Then, it is possible to pick the correct data to perform 4-point FFT.
In order to explain how to permute data, it is easier to express using bit index numbers as follows:
b9 b8 b7 | b6 b5 b4 b3 b2 b1 b0 (4.9)
Indexes left to the vertical bar in Equation (4.9) represents the vertical dimen-sion of the matrix in Figure 4.2, and indexes right to the vertical bar in Equation (4.9) represents the horizontal dimension of the matrix in Figure 4.2. Note that
0 1 2 3 4 5 127 128 129 130 131 132 133 255 256 384 512 640 896 897 898 899 900 901 1023 767 639 511 383 257 258 259 260 261 389 388 387 386 385 513 641 642 514 515 643 516 644 645 517 b9 b8 b7 000 001 010 011 100 101 111 0000000 b6 b5 b4 b3 b2 b1 b0 0000001 0000010 0000100 0000101 1111111 0000011 0 256 2 258 4 260 128 384 130 386 132 388 1 129 257 3 259 5 261 389 133 387 131 385 b0 b7 00 01 10 11 0000000 b9 b6 b5 b4 b3 b2 b1 b8 0000001 0000010 0000100 0000101 0000011 890 636 892 638 894 1111011 1111100 1111101 1111110 1111111 1018 764 1020 766 1022 891 637 893 639 895 1019 765 1021 767 1023
Figure 4.2:Operation of first commutator.
bits on the left-hand side (b7, b8, and b9 in Equation (4.9)) are increased by a clock signal.
In order to handle the OLS method and select correct bits for the 4-point FFT, the first step is to move b9 to the right-hand side index as follows:
b8 b7 | b9 b6 b5 b4 b3 b2 b1 b0 (4.10)
This is done by adding delay blocks which take four clock cycles at the input stage. By doing this, the 512 samples are overlapped and saved.
Next step is moving b8 to the right-hand side. In order to do it, b8 needs to be exchanged with a bit from b0 to b6. According to [6, 8], this process requires a serial-parallel circuit with delay size of two. If we select b0 for example, the final order of indexes is following:
b0 b7 | b9 b6 b5 b4 b3 b2 b1 b8 (4.11)
Note that selection of the bit from b0 to b6 results in different power con-sumption. It will be discussed in Section 5.5. Figure 4.3 describes the whole architecture of the first commutator.
4.1.3
Second Commutator
After 4-point FFTs are performed, data stream goes to the 256-point FFT. But the data from 4-point FFTs cannot be used directly because the data was permuted. In order to perform 256-point FFT, the order of the data needs to recover to have a normal order.
The exchanging of index numbers on the right-hand side of the expression can be done easily by changing wiring. Therefore, the important feature on the second commutator is to place b8 and b9 on the left side. It can be done similarly
4.1 1024-point FFT Based FIR Filter Architecture 19 4 4 4 4 2 2 2 2 2 2 2 2 First step Second step
Figure 4.3:Detailed architecture of the first commutator.
to the first commutator. The serial-parallel circuit with one delay and with two delays are performed in the second commutator.
After using the serial-parallel circuit with one delay, b7 and b8 are exchanged.
b0 b8 | b9 b6 b5 b4 b3 b2 b1 b7 (4.12)
To exchange b9 with b0, changing of wiring is needed. The following expres-sion is after changing of wiring is done:
b0 b8 | b7 b6 b5 b4 b3 b2 b1 b9 (4.13)
Then, we can exchange b9 with b0 by using the circuit with two delays.
b9b8 | b7 b6 b5 b4 b3 b2 b1 b0 (4.14)
The other commutators (third and fourth) are mirrored versions of the first and second commutator. Although they use different data wordlength and the final commutator discards half of the outputs instead of saving it. More details will be discussed in Chapter 5.
4.1.4
Twiddle Factor Multiplication
Twiddle factor multiplication block is more complicated than those inside of the typical FFT block because, in the architecture, it handles 256 samples at one clock cycle out of 1024 samples. Therefore, this block needs a two-bit control signal and a multiplexer to select correct twiddle factor for every sample.
4.1.5
Filter Coefficient Multiplier
Three types of complex multipliers are implemented which can be seen in Fig-ure 4.4. Gauss complex multiplication algorithm can be modified to have pre-computed inputs. Here two adders can be removed as described in Figure 4.4(b). With pre-computed inputs, it is expected to reduce the amount of computation energy by the removal of the two additions.
4.1.6
Coefficient Selector
When multiplying samples with filter coefficients, selecting correct filter coeffi-cients is necessary, similar to the twiddle factor multiplication block. The differ-ence in this coefficient selector is that registers are needed at every multiplication. Here, we assume that four coefficients for one multiplier are set externally, so four registers and one multiplexer is added to the complex multiplier as it is seen in Figure 4.5.
4.1.7
Multiplexer
After the multiplication, output wordlength from the multiplier is increased by coefficient wordlength. In order to make it have the same wordlength, the quanti-zation multiplexer is implemented in order to select 12 bits from the filter output. This multiplexer is kept so that the user of the system has the flexibility to select bits based on the configuration of input data with regards to integer and frac-tional bits depending on the number of fracfrac-tional and integer bits in input data using this multiplexer.
The multiplexer used in this architecture gives the user four choices by using two-bit external control signal. With the first choice being of 12 MSBs of both real and imaginary bits and then going down by 1 bit for both real and imaginary.
Figure 4.6 shows the architecture of the multiplexer
4.1.8
Global Counter and Control Signals
Synchronization is needed in this architecture because it is necessary to select cor-rect twiddle factors and filter coefficients at the corcor-rect timing when we multiply them with samples. Also, architecture of each block has different delay time due to different pipelining. Therefore, we implement a global counter that generates control signals for each block.
4.2
256-point FFT Based FIR Filter Architecture
Figure 4.7 shows an overview of the architecture. This architecture uses polyno-mial convolution [11]. It separates impulse response in polyphase components and performs FFT on that. So, this architecture performs FFT on input samples and uses 4-tap FIR filter on each transformed sample. Inverse FFT is performed by interchanging real and imaginary parts from FIR filter, performing FFT on
4.2 256-point FFT Based FIR Filter Architecture 21
a +bi c + di
Real Imag
(a) Gauss complex multi-plier.
a +bi c c + d
Real Imag d - c
(b) Gauss complex multi-plier with pre-computation.
a + bi c + di
Real Imag
(c) Standard complex multiplier.
Figure 4.4:Multipliers used in this thesis.
(Selected externally) Complex multiplier Incoming sample Output sample Control signal Coefficients Q R R R R
>> Incoming
sample Outputsample
Control
signal(0) Control signal(1)
2 >> 1
1 0
1 0
Figure 4.6:Multiplexer for quantization
4-tap filter 4-tap filter 4-tap filter 4-tap filter 4-tap filter 4-tap filter FF T Ove rl ap Save B loc k
Figure 4.7:256-point FFT based FIR filter architecture.
them and interchanging real and imaginary part of the result as mentioned in [3].
4.2.1
Top Level Estimation
In the 256-point FFT based FIR filter architecture, complex multipliers in FFT use constants, and complex multipliers in 4-tap FIR filter use configured constants because the filter coefficients are configured once to operate the system. The complex multipliers inside the FFT have constant multipliers thus they can be optimized at the hardware level.
In 256-point FFT block of this architecture, input has 12-bit data wordlength, and half of the outputs from IFFT are discarded.
The number of operators in the FIR filter is computed by the following equa-tions:
Complex adders = N (M − 1) (4.15)
Delay elements = N (M − 1) (4.16)
Complex mult. w/ configured const. = N M (4.17) where M is the number of taps for FIR filter, N is the number of FIR filters.
Also, after the 4-tap FIR filter, there are two 2 × 1 multiplexers for quantiza-tion. The total number of each operator in the architecture can be computed by
4.2 256-point FFT Based FIR Filter Architecture 23
Operators Number
Complex adders 4736
Complex mult. w/ const. 1792 Complex mult. w/ configured const. 1024
2 × 1 Multiplexers 512
Delay elements 896
Table 4.2:Number of operators in the 256-point FFT based FIR filter.
the following equations:
Total complex adders = 2N log2N + N (M − 1) − N
2 (4.18)
Total complex mult. w/ const. = N (log2N − 1) (4.19) Total complex mult. w/ configured const. = N M (4.20)
Total 2 × 1 multiplexers = 2N (4.21)
Total delay elements = N
2 + N (M − 1) (4.22)
The detailed number of operators can be seen in Table 4.2. Note that the exact numbers may be different for a different optimizing process. More details in each block will be discussed further.
4.2.2
Fast Fourier Transform with Overlap-save Method
As discussed in Chapter 1.1, it is required that 128 samples are processed in every clock cycle. According to [13], we need to have a size of FFT two times the size of output number of samples required. This leads us to have 256-point FFT outputs with 128 overlapped samples for generating 128 output samples.
The FFT based filtering uses the concept of the OLS method to perform filter-ing. The OLS method is implemented using the first block as indicated in Figure 2.6. The block receives 128 new samples where each sample has 6 bits real and 6 bits imaginary. These 128 samples are appended with 6 zeros in the LSB for both real and imaginary part. This makes these 128 samples of 24 bits each. These 128 samples are then kept for one cycle. The 128 samples then come as new samples and the same process of appending zeros to the number is followed. Then these 128 new samples and old retained 128 samples are used as 256 samples which are then fed into the 256-point FFT block. The purpose of this block is therefore to perform OLS component of FFT based FIR filtering.
4.2.3
4-tap FIR Filters
A standard 4-tap FIR filter contains three delay elements and four multipliers and adders. The configuration for summation is left to the synthesizing tool. It
could be a tree structure or a sequential structure. A generic 4-tap filter is con-structed which is then used in all other types of implementation of 4-tap filters. A general 4-tap filter is depicted in Figure 2.2.
4.2.3.1 Direct Form Implementation
The 4-tap filter in direct form consists of three delay elements. The following multipliers are used for separate implementation:
• With Gauss complex multiplier: The complex multiplier used in this 4-tap filter is using Gauss complex multiplication algorithm as shown in Figure 4.4(a).
• With Gauss complex multiplier with precomputed inputs: The complex multiplier used in this 4-tap filter is using Gauss complex multiplication algorithm with precomputed inputs as shown in Figure 4.4(b).
• With standard complex multiplier: The complex multiplier uses standard complex multiplication algorithm as shown in Figure 4.4(c).
The sum of four multiplier outputs is left for the synthesizing tool to determine the structure of adder. The filter is depicted in Figure 2.2.
4.2.3.2 Parallel Standard Structure
This filter consists of four generic 4-tap filters each having coefficient wordlength of 8 bits. This structure implements standard complex multiplication algorithm. Two generic 4-tap filters take real coefficient values as filter coefficients, and other two take imaginary coefficient values as filter coefficients. The filter with real coefficient values and another filter with imaginary coefficient values are fed with real input values, and the other two filters are fed with imaginary input values. The outputs from these four generic 4-tap filters are then summed to get the outputs. This filter is depicted in Figure 4.8. The real and imaginary outputs are given by Equation (4.23) and (4.24) respectively.
ahre−bhim (4.23)
ahim+ bhre (4.24)
4.2.3.3 Parallel Gauss Structure
This filter implementation consists of three generic 4-tap filters. The two filters have coefficient wordlength of 9 bits, and the remaining filter has coefficient wordlength of 8 bits. This filter uses Gauss complex multiplication algorithm. Two generic 4-tap filters with 9-bit coefficient wordlength have filter coefficients as the sum and difference of real and imaginary parts of coefficients, and the re-maining generic 4-tap filter with 8-bit coefficient wordlength has imaginary part of coefficients as filter coefficients. This can be seen in Figure 4.9.
4.2 256-point FFT Based FIR Filter Architecture 25 hre him hre Input Real Input Imag Output Real Output Imag him
Figure 4.8:Parallel standard structure.
hre + him him hre - him Input Real Input Imag Output Real Output Imag
Figure 4.9:Parallel Gauss structure.
4.2.4
Multiplexer
Similar to the 1024-point FFT based filter architecture, the quantization multi-plexer is also implemented after the 4-tap filter. The difference is that the output data wordlength from the 4-tap filter is 22 bits real and 22 bits imaginary and the input data wordlength for the inverse FFT is for 12 bits real and 12 bits imaginary. The architecture is same as shown in Figure 4.6
4.2.5
Inverse Fast Fourier Transform
An inverse FFT is performed by interchanging real and imaginary parts as dis-cussed in Section 4.2. Also, in this FIR filter architecture, as the OLS method is used for filtering, we reject the overlapped outputs of FFT in the inverse FFT stage.
5
Result
This chapter presents synthesis results of blocks used in two architectures includ-ing a comparison of various options that have been considered. This chapter also shows results in six different frequencies to investigate more details of each block.
Notice that the design is synthesized block-by-block. The whole design was not synthesized due to memory limitations as the entire design is quite large to be synthesized as one. Therefore, synthesis results of individual blocks will be presented in this thesis. The most efficient options of each block in terms of power consumption will be selected for each architecture. The comparison of two architecture will be discussed at the end of the chapter.
The synthesis is done in Design Compiler using a 65-nm low-power process and 1.2 V supply voltage.
5.1
256-point Fast Fourier Transform
5.1.1
Radix-2 vs Radix-4 vs Radix-16
Choice of a base for FFT can make a difference in performance of the system. In this thesis, 256-point FFT is implemented using 2, 4, and radix-16 butterflies. The radix-radix-16 butterflies is based on the radix-4 butterfly. Each case includes a different number of complex multiplications. The three different algorithms can be represented by binary tree representations [18] as illustrated in Figure 5.1.
The basic computation of the number of complex multiplications is N (log2N −
1) for radix-2 case and N2(log2N − 1) for radix-4 and radix-16. However, we can
exclude some trivial cases where the angle of the rotation is 0◦, 90◦, 180◦, and 270◦. These rotations are just multiplications by 1, j, −1, and −j, respectively. These rotations can be implemented easily on hardware by interchanging real
1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 (a)Radix-2 2 2 2 2 4 6 8 (b)Radix-4 2 2 2 4 8 2 4 (c)Radix-16
Figure 5.1:Binary tree representations used for 256-point FFT Algorithm Non-trivial rotations
Radix-2 878
Radix-4 492
Radix-16 480
Table 5.1: Number of non-trivial rotations in FFT having various radixes. The radix-16 case is based on a radix-4 butterfly.
and imaginary parts or by changing signs [5].
Table 5.1 shows the number of non-trivial rotations. The radix-2 case has more non-trivial rotations whereas the number of rotations in 4 and radix-16 is similar. Due to this difference, synthesis results of those FFTs show that more non-trivial rotations have higher area usage and power consumption. This can be seen in Tables 5.2 and 5.3. Note that the radix-2 case uses two pipeline reg-isters at every second stage, and radix-4 case and radix-16 case use two pipeline registers at every stage. Details about pipelining will be discussed in the next section.
Overall, the results show that radix-16 case has lower area usage and power consumption at 476 MHz frequency.
5.1.2
Pipelining
Various options for pipelining are available in FFT. Among various options, the important point to introduce pipeline registers is near twiddle factor multipliers since complex multiplications take most computation time among all the opera-tions in the FFT block. In this section, radix-16 based on radix-4 butterflies is used because the previous section concludes that radix-16 case is most efficient among three radixes.
When using radix-4 butterflies in the FFT block, it has a total of four stages. Thus, three options are considered in this thesis. Two pipeline registers at every stage, one pipeline register at every stage, and one pipeline register only at the second stage. Figure 5.2 shows the three cases.
5.1 256-point Fast Fourier Transform 29
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Radix-2 209 419 637 910 1283 2189
Radix-4 172 291 520 827 898 1362
Radix-16 168 336 509 816 875 1372
Table 5.2:Synthesis results of power consumption for different radixes. All power numbers are in mW. The radix-16 case is based on a radix-4 butterfly. Two pipeline registers are used at every second stage in radix-2 and every stage in radix-4 and radix-16.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Radix-2 1.746 1.746 1.760 1.911 1.976 2.311
Radix-4 1.546 1.546 1.549 1.549 1.570 1.742
Radix-16 1.508 1.508 1.508 1.520 1.540 1.732
Table 5.3:Synthesis results of area usage for different radixes. All area num-bers are in mm2. The radix-16 case is based on a radix-4 butterfly. Two pipeline registers are used at every second stage in radix-2 and every stage in radix-4 and radix-16.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Two reg. at every stage 168 336 509 816 875 1372
One reg. at every stage 155 308 471 1037 1095 2031
One reg. at second stage 178 378 930 - -
-Table 5.4: Synthesis results of power consumption for different pipelining methods. All power numbers are in mW. Basis of FFT is radix-16.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Two reg. at every stage 1.508 1.508 1.508 1.520 1.540 1.732 One reg. at every stage 1.344 1.343 1.351 1.736 1.756 2.066
One reg. at second stage 1.247 1.306 1.730 - -
-Table 5.5:Synthesis results of area usage for different pipelining methods. All area numbers are in mm2. Basis of FFT is radix-16.
Table 5.5 shows the synthesis results of the FFT block with different pipelin-ing. The results show that the pipelining is useful when the speed requirement is high, but when slower speed is required, the pipelined structure has more area usage and power consumption due to extra registers. Note that a dash in the table means the system did not meet the speed constraint.
5.1.3
FFT with Overlap-save Block and Inverse FFT
In the 256-point FFT based architecture, 256-point FFT is performed initially. The result of this block is different from the general FFT block as expected. The power consumption and area usage both are reduced when compared with nor-mal FFT. This can be explained by the fact that when zeros are appended as ex-plained in Section 4.2.2, the hardware size, such as adders and multipliers, in
R
4 bu
tterf
lies
R
4 bu
tterf
lies
R
4 bu
tterf
lies
: one twiddle factor multiplier(a)Two registers at every stage
R
4 bu
tterf
lies
R
4 bu
tterf
lies
R
4 bu
tterf
lies
: one twiddle factor multiplier(b)One register at every stage
R
4 bu
tterf
lies
R
4 bu
tterf
lies
R
4 bu
tterf
lies
R
4 bu
tterf
lies
: one twiddle factor multiplier(c)One register at the second stage
Figure 5.2: Different options of pipelining, when using radix-4 butterflies. Dot lines represent places of pipelining.
5.2 Complex Multiplier 31
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
FFT 168 336 509 816 875 1372
FFT w/ OLS. 149 298 452 721 772 1163 FFT w/ Post. 164 328 496 795 854 1332
Table 5.6: Synthesis results of power consumption for various FFT blocks. All power numbers are in mW. Radix-16 pipelined FFT, based on a radix-4 butterfly, is used.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz FFT 1.508 1.508 1.508 1.520 1.540 1.732 FFT w/ OLS. 1.33 1.33 1.33 1.35 1.36 1.52 FFT w/ Post. 1.46 1.46 1.46 1.48 1.49 1.69
Table 5.7: Synthesis results of area usage for various FFT blocks. All area numbers are in mm2. Radix-16 pipelined FFT, based on a radix-4 butterfly, is used.
initial stages will be reduced. This leads to lower power consumption and area usage.
Also, after the inverse FFT, 128 samples are rejected. This leads to a change in area usage and power consumption as half of the butterflies from the last stage are removed. This leads to lower power consumption and lower area usage than a normal FFT block. The results can be found in Tables 5.6 and 5.7.
5.2
Complex Multiplier
In the 1024-point FFT based architecture, complex multipliers are used to multi-ply transformed input samples by filter coefficients. There are three choices for the complex multipliers:
• Gauss complex multiplier: Complex multiplier using Gauss complex multi-plication algorithm as shown in Figure 4.4(a).
• Gauss complex multiplier with precomputed inputs: Using the same Gauss complex multiplication algorithm with precomputation as shown in Figure 4.4(b).
• Standard complex multiplier: The standard way of computation as shown in Figure 4.4(c).
As discussed in Section 2.6, Gauss complex multiplication algorithm uses three real multiplications and five additions while standard complex multipli-cation algorithm uses four real multiplimultipli-cations and two additions.
Table 5.8 shows synthesis results of area usage for the three multipliers. Even though pre-computed Gauss complex multiplier removes two additions, it shows higher area usage. The reason for that is, pre-computed multiplier uses three in-put of which are two are of 9-bit and the remaining is of 8-bit, so it requires larger
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz Gauss mult. 3861 3861 3861 4692 4748 5445 Pre. Gauss mult. 4125 4125 4125 4739 4802 5289 Standard mult. 3756 3756 3756 4352 4380 5487
Table 5.8: Synthesis results of area usage for the three complex multipliers. All area numbers are in µm2.
registers and multiplexers. Meanwhile, standard complex multiplier preforms better than Gauss complex multiplier because of its simplicity. The synthesis results of power consumption will be discussed in the next section.
5.2.1
Power Estimation with Random Coefficients
The coefficients in the FIR filter are determined by Equation (2.1). However, this thesis does not handle the specific parameters. Instead, the coefficients are cho-sen randomly to see the performance of complex multipliers.
However, we cannot select specific random coefficients because the choice of four different coefficients makes a difference in power result. If the number of transitions between coefficients is small, power consumption on the gates comes less, and if the number of transitions is high, the power consumption be-comes higher. Therefore, it is necessary to address this variation. In order to do this, 99 different sets of coefficients are simulated for each type of multipliers, and a histogram is made. Figure 5.3 shows the result.
The results show that the standard complex multiplier is more efficient than the Gauss complex multiplier because of its simplicity. Meanwhile, the pre-computed Gauss complex multiplier is most efficient among three multipliers. The detailed results of all the multipliers are shown in Table 5.11.
5.3
4-tap FIR Filters
5.3.1
Pipelining
All the architectures have been implemented with pipelining. The effects of pipelining can be seen at higher frequencies where the power consumption re-duces significantly. As expected, the power consumption at lower frequencies increases when compared to non-pipelined structures. The pipelining for paral-lel Gauss structure is implemented as indicated in Figure 5.4(a). The critical path in this structure is determined to be
2Tadder+ Tfilter (5.1)
After pipelining, the critical path reduced to
5.3 4-tap FIR Filters 33 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 0 5 10 15 20 25 30 35 40 Pre-Gauss Gauss Standard
Figure 5.3: Histogram of power consumption of three complex multipliers at 476 MHz. Power numbers are in mW.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Maximum 0.510 1.023 1.723 3.323 3.597 6.817
Minimum 0.385 0.770 1.308 2.750 2.984 5.681
Median 0.461 0.921 1.574 3.064 3.316 6.291
Table 5.9: Synthesis results of power consumption for the Gauss complex multiplier. All power numbers are in mW.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz Maximum 0.306 0.613 1.014 1.756 1.939 3.407 Minimum 0.262 0.523 0.878 1.729 1.825 3.238 Median 0.290 0.580 0.965 1.744 1.894 3.340
Table 5.10: Synthesis results of power consumption for the precomputed Gauss complex multiplier. All power numbers are in mW.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Maximum 0.502 1.005 1.696 3.010 3.230 5.241
Minimum 0.406 0.809 1.368 2.619 2.814 4.577
Median 0.455 0.909 1.560 2.847 3.055 4.951
Table 5.11: Synthesis results of power consumption for the standard com-plex multiplier. All power numbers are in mW.
hre + him him hre - him Input Real Input Imag Output Real Output Imag (a) Parallel Gauss structure with pipelining
hre him hre Input Real Input Imag Output Real Output Imag him
(b) Parallel standard structure with pipelining
h(0) h(1) h(2) h(3)
D D D
Σ Input
Output
(c) Direct form structure with pipelining
Figure 5.4: Pipelined structures of 4-tap FIR filters. Dot lines represent places of pipelining.
5.3 4-tap FIR Filters 35
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Area 15475 15477 15420 16149 16217 18206
Power 1.167 2.334 3.540 5.350 5.820 11.490
Table 5.12:Synthesis results of the direct form with standard complex mul-tipliers. All area numbers are in µm2and power numbers are in mW.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Area 17211 17211 17208 18832 18895 19371
Power 1.326 2.651 4.019 7.018 7.440 10.702
Table 5.13:Synthesis results of the direct form with standard complex mul-tipliers when pipelining is applied. All area numbers are in µm2and power
numbers are in mW.
In the direct form structure, pipelining is implemented as in Figure 5.4(c) for 4-tap filters with standard complex multiplier and Gauss complex multiplier. As indicated in Tables 5.12 and 5.13, the pipelining shows its positive effects at the frequency of 667 MHz and the power consumption of the pipelined struc-ture is lower when compared to the non-pipelined strucstruc-ture. This result is even more pronounced in direct form filter with Gauss complex multipliers. In Ta-bles 5.14 and 5.15, we can see that from the frequency of 476 MHz and higher, the power consumption is much lower than non-pipelined structure. Therefore, the pipelined structure will lead to much higher power saving at the higher fre-quency.
Direct form structure with standard complex multiplier also shows the same effect at 667 MHz. The power consumption is lower than the non-pipelined ver-sion. This can be seen in Tables 5.12 and 5.13.
If area is discussed, we can see that the pipelining also benefits in the reduc-tion in area usage when operating at the higher frequency. This effect is pro-nounced in direct form structure with Gauss complex multipliers. The area con-sumption is significantly reduced at 667 MHz. As expected at lower frequen-cies the area consumption is higher as extra registers are added when so much speedup of the circuit is not required.
The parallel standard structure is also implemented with pipelining as indi-cated in Figure 5.4(b). The critical path in this structure is determined to be
Tadder+ Tfilter (5.3)
In the parallel standard structure, the pipelining also shows the same result. At the higher frequency, there is a reduction of both power consumption and area usage, whereas, at the lower frequency, the pipelined version has both higher area usage and power consumption.
The parallel Gauss pipelined structure also performs better. One of the rea-sons for this could be shortening of the critical path that has caused the circuit to become more efficient. The area is reduced at the frequency of 667 MHz. Figure 5.5 shows the result of various structures with pipelining effects on power.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Area 16904 16904 16415 21198 21724 31737
Power 1.224 2.448 3.934 8.616 9.841 25.062
Table 5.14: Synthesis results of the direct form filter with Gauss complex multipliers. All area numbers are in µm2and power numbers are in mW. Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Area 18118 18118 18190 21278 21373 22793
Power 1.424 2.848 4.382 8.335 8.888 14.139
Table 5.15: Synthesis results of the direct form filter with Gauss complex multipliers when pipelining is applied. All area numbers are in µm2 and power numbers are in mW.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Area 15446 15716 15716 19575 20105 28959
Power 1.155 2.309 3.953 8.518 9.353 22.457
Table 5.16: Synthesis results of the direct form filter with Gauss complex multipliers when precomputation is applied. All area numbers are in µm2
and power numbers are in mW.
0 200 400 600 800 Frequency(MHz) 0 5 10 15 20 25 Power(mW)
Parallel standard structure
Non Pipelined Pipelined 0 200 400 600 800 Frequency(MHz) 0 5 10 15 20 25 Power(mW)
Parallel Gauss structure
Non Pipelined Pipelined 0 200 400 600 800 Frequency(MHz) 0 5 10 15 20 25 Power(mW)
Direct form standard structure
Non Pipelined Pipelined 0 200 400 600 800 Frequency(MHz) 0 5 10 15 20 25 Power(mW)
Direct form Gauss structure
Non Pipelined Pipelined
5.3 4-tap FIR Filters 37
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Area 14876 14876 14395 15472 15611 18236
Power 1.278 2.468 3.729 5.701 7.946 12.385
Table 5.17: Synthesis results of the parallel Gauss structure. All area num-bers are in µm2and power numbers are in mW.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Area 15162 15162 14790 16049 16094 17116
Power 1.219 2.437 3.670 5.912 6.209 10.566
Table 5.18:Synthesis results of the parallel Gauss structure when pipelining is applied. All area numbers are in µm2and power numbers are in mW.
5.3.2
Gauss Complex Multiplication Algorithm Based FIR Filters
There are two filters implemented based on Gauss complex multiplication algo-rithm. The first uses Gauss complex multipliers in the direct form structure and the other is the parallel Gauss structure. The parallel Gauss structure uses three 4-tap filters with coefficients which are sum and difference of real and imaginary coefficients and one filter having imaginary coefficients. The results show that the parallel Gauss structure is better among both the implementations. The results of both implementations can be seen in Tables 5.14 and 5.17.
This can be explained by the fact that the Gauss complex multipliers in the direct form implementation consists of three stages inside. The output is depen-dent on three stages. Therefore, when it is made to operate at high frequency the power consumption increases a lot because there are three stages through which computations are to be made at very high speed. On the other hand, if we look at the parallel Gauss structure, we will see that it consists of three real-valued 4-tap filters which have real multipliers inside. This means the parallel Gauss structure though uses the same Gauss complex multiplication principle, is a much simpler structure.
Another implementation of the Gauss complex multiplication algorithm is one with pre-computed inputs as seen in Figure 4.4(b). Although in this multi-plier we remove two adders to increase the performance, that does not improve performance much. This can be explained by the fact that although two adders are removed, the coefficients are constant so the adders also operate only once when new coefficients arrive, for further cycles they are static. Thus, not much improvement is seen with this algorithm.
5.3.3
Standard Complex Multiplication Algorithm Based FIR
Filters
There are two filters based on the standard complex multiplication algorithm. The direct form structure with standard complex multipliers performs better than the parallel standard structure.
0 200 400 600 800 Frequency(MHz) 0 5 10 15 20 25 Power(mW) Gauss algorithm Direct form
Direct form with pre Parallel Gauss 0 200 400 600 800 Frequency(MHz) 0 5 10 15 20 25 Power(mW) Standard algorithm Parallel standard Direct form
Figure 5.6: Comparison of power consumption in Gauss and standard algo-rithm.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Area 15766 15764 15519 17083 17574 20200
Power 1.245 2.492 3.721 6.380 7.056 13.520
Table 5.19: Synthesis results of the parallel standard structure. All area numbers are in µm2and power numbers are in mW.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Area 16435 16435 16435 17218 17412 18725
Power 1.368 2.736 4.138 6.352 6.692 11.053
Table 5.20:Synthesis results of the parallel standard structure when pipelin-ing is applied. All area numbers are in µm2and power numbers are in mW.
The difference in both architectures is in the way summation is being per-formed. The order of adders after the multipliers is different in both architec-tures. This results in one architecture to consume more power than the other. The difference in power consumption is depicted in Figure 5.6.
5.3.4
Summary of Results
The results of all synthesis of all 4-tap FIR filter implementations point to a con-clusion that the direct form with standard complex multipliers is the best filter to operate at 476 MHz. The overall architecture of this block is much simpler and dependency on various blocks is minimal compared to other 4-tap FIR filter structures.
The power estimation is performed multiple times in order to get the range of power consumption. The histogram in Figure 5.7 shows a distribution of a set of 99 values of power consumption of direct form filter with standard complex multipliers. The minimum, maximum and median values of this implementation
5.3 4-tap FIR Filters 39 5.24 5.26 5.28 5.3 5.32 5.34 5.36 5.38 5.4 5.42 0 5 10 15 20 25
Figure 5.7: Histogram of power consumption of direct form filter with the standard complex multipliers at 476 MHz. Power numbers are in mW. Frequency 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz
Median 1.166 2.333 3.537 5.340 5.810 11.475
Minimum 1.150 2.299 3.487 5.253 5.717 11.322
Maximum 1.179 2.361 3.578 5.415 5.891 11.636
Table 5.21: Power consumption range of direct form filter with standard complex multipliers. All power numbers are in mW.
can be seen in Table 5.21.
5.3.5
Tap Configuration Results
The direct form structure with standard complex multipliers is the most power efficient of all the filters as discussed in Section 5.3.4. A 4-tap filter can also be used with only some taps on. Table 5.22 shows various tap configuration of 4-tap filters and their related power consumption. It is clear when the taps are switched off, the power consumption reduces because multipliers have lower switching activity. When all taps are switched off, the power consumption is quite low. The dynamic power consumption is mostly of delay elements as all multipliers have minimal switching activity because coefficients are set to zero. Consequently, all inputs to adder are zero. This leads to total power consumption to be lower.
Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz All Taps On 1.167 2.334 3.540 5.350 5.820 11.490 One Tap Off 0.815 1.630 2.505 4.822 5.256 10.445 Two Taps Off 0.695 1.390 2.136 4.173 4.566 9.140 Three Taps Off 0.577 1.154 1.758 3.465 3.816 7.616 All Taps Off 0.311 0.621 0.986 1.813 1.987 3.572
Table 5.22: Comparison of power consumption of direct form filter with standard complex multipliers with various tap configurations. All power numbers are in mW. 4point FFT Second Commutator 4point FFT 4point FFT 4point FFT : one twiddle factor multiplier 12 16 24
(a) Twiddle factor multipliers are first. 4point FFT : one twiddle factor multiplier clk 0 : 2ndComm 4point FFT clk 1 : 4point FFT clk 2 : 4point FFT clk 3 : 2nd Comm 2nd Comm 2nd Comm 12 16 16 24 (b)Commutator is first.
Figure 5.8:Different procedures of commutator.
5.4
Different Procedures of Commutator and Twiddle
Factor Multiplication
The second commutator in the 1024-point FFT based architecture can be located before and after the twiddle factor multiplication block. Both configurations have the same functional result but have different results in terms of efficiency, since in-put data wordlength of the system is set to be 12-bit, and outin-put data wordlength from the twiddle factor multiplier needs to be 24-bit. Also, as seen in Figure 5.8, the first output of the 4-point FFT always has an angle of 0◦so it is possible to remove one complex multiplication from one of four outputs.
Therefore, if the twiddle factor multiplication block comes first, the block maintains the same configuration regardless of a clock cycle by sending every