Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

(1)

Linköping University Post Print

Analysis of Twiddle Factor Memory

Complexity of Radix-2^i Pipelined FFTs

Fahad Qureshi and Oscar Gustafsson

N.B.: When citing this work, cite the original article.

©2010 IEEE. Personal use of this material is permitted. However, permission to

reprint/republish this material for advertising or promotional purposes or for creating new

collective works for resale or redistribution to servers or lists, or to reuse any copyrighted

component of this work in other works must be obtained from the IEEE.

Fahad Qureshi and Oscar Gustafsson, Analysis of Twiddle Factor Memory Complexity of

Radix-2^i Pipelined FFTs, 2009, 43rd Asilomar Conference on Signals, Systems, and

Computers, 217-220.

Postprint available at: Linköping University Electronic Press

(2)

Analysis of Twiddle Factor Memory Complexity of

Radix-

₂

i

Pipelined FFTs

Fahad Qureshi and Oscar Gustafsson

Department of Electrical Engineering, Link¨oping University SE-581 83 Link¨oping, Sweden

E-mail:_{{fahadq, oscarg}@isy.liu.se}

Abstract—In this work, we analyze different approaches to

store the coefficient twiddle factors for different stages of pipelined Fast Fourier Transforms (FFTs). The analysis is based on complexity comparisons of different algorithms when imple-mented on Field-Programmable Gate Arrays (FPGAs) and ASIC for different radix-2i _{algorithms. The objective of this work}

is to investigate the best possible combination for storing the coefficient twiddle factor for each stage of the pipelined FFT.

I. INTRODUCTION

Computation of the discrete Fourier transform (DFT) and inverse DFT is used in e.g. orthogonal frequency-division mul-tiplexing (OFDM) communication systems and spectrometers. AnN-point DFT can be expressed as

X(k) = N−1

n=0

x(n)Wk

N, k = 0, 1, N − 1 (1) where Wn=e−j2πN is twiddle factor, the N:th primitive root

of unity with it’s exponent being evaluated modulo N, n is the time index, and_{k is the frequency index. Various methods} for efficiently computing (1) have been the subject of a large body of published literature. They are commonly referred to as fast Fourier transform (FFT) algorithms. Also, many different architectures to efficiently map the FFT algorithm to hardware have been proposed [1].

A commonly used architecture for transforms of length N = br _{is the pipelined FFT. The pipeline architecture} is characterized by continuous processing of input data. In addition, the pipeline architecture is highly regular, making it straightforward to automatically generate FFTs of various lengths.

Figure 1 outlines the architecture of a Radix-2i _single-path delay feedback (SDF) decimation in frequency (DIF) pipeline FFT architecture for length _{N. This architecture is generic} while the required ranges of each complex twiddle factor multiplier is outlined in Table I for varying numbers of i. For the twiddle factor multipliers with small ranges special methods have been proposed. Especially one can note that for a W4 multiplier the possible coefficients are {±1, ±j} and, hence, this can be simply solved by optionally interchanging real and imaginary parts and possibly negate (or replace the addition with a subtraction in the subsequent stage). For larger

ranges (_W₈,_W₁₆, and _W₃₂) approaches have been proposed in [4], [6]–[8].

In this work we instead focus on using standard complex multipliers. However the twiddle factors calculated advance, stored in memories and retrieved for multiplication whenever necessary. The size of the twidde factor memory for each stage depends upon some factors; arithmetic precision, number of FFT point and number of the stage. Usually for a long FFT the lookup tables are large in comparsion with butterfly and complex multiplier. In [9], [10] methods are proposed to reduce the size of the memories by utilizing the octave symmetry of the twiddle factors, hence only storing values for angles between 0 ≤ α ≤ π/4. The memory then have at most (N/8 + 1) words. However, the results in [9], [10] are given for complete FFTs using the same architecture for all memories and only for radix-22_{. In this work we} show that octave symmetry is not always useful due to the overhead of multiplexers and negations. Furthermore, we will investigate the wordlength scaling effect as previous work has shown that the occupied cell area when synthesizing look-up tables does not grow linearly with the number of bits in the look-up table [11]. It is noted that one could use dedicated memory structures on the FPGAs, but depending on available resources and the size of the memories this may not be suitable. For using the dedicated memory structures a cost model is proposed in [12].

In next section the different architectures to implement the twiddle factor memories are explained. In Section III, we analyze and compare the implementation results of those architectures. Finally, some conclusions are presented.

II. ARCHITECTURES FORTWIDDLEFACTORMEMORIES

The twiddle factor memory should provide the real and imaginary parts of the twiddle factor. Typically, in a SDF

TABLE I

MULTIPLICATION AT DIFFERENT STAGES FOR DIFFERENT ARCHITECTURE. Stage number Radix 1 2 3 4 5 2 WN WN/2 WN/4 WN/8 WN/16 22_[2] _W₄ _W_N _W₄ _W_N/4 _W₄ 23_[3] _W₄ _W₈ _W_N _W₄ _W₈ 24_[4] _W₄ _W₈ _W₁₆ _W_N _W₄ 25_[5] _W₄ _W₈ _W₁₆ _W₃₂ _W_N

(3)

Address

BF BF BF

N/2

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

W W W W W N/8 N/4 BF BF BF N/64 N/32 N/16

Fig. 1. The R2i_{single-path delay feedback (SDF) decimation in frequency (DIF) pipeline FFT architecture with twiddle factor stages.}

Address

WnImaginary

WnReal

N words Coefficient Memory

Fig. 2. Block diagram of Single Look-up Table twiddle factor memory.

Coefficient WnImaginary WnReal Address Mapping Address Memory N words

Fig. 3. Block diagram of twiddle factor memory with address mapping.

pipelined FFT architecture a counter is used to keep track of which row of the FFT are computed in each clock cycle. Hence, we will here assume that the mapping should be from row number to the real and imaginary part of the twiddle factor.

A. Single Look-up Table

The simplest approach, as shown in Fig. 4, is to just use a large look-up table to store the twiddle factors. For a_W_N multiplier,N words needs to be stored. Hence, for large N one could expect this method to have a higher complexity compared to the reduced schemes. On the other hand it lacks any overhead. It should also be noted that this scheme possibly stores the same twiddle factor in several positions as the mapping is from row to twiddle factor and for radix-2i algorithms some twiddle factors appears more than once for i ≥ 2.

B. Twiddle Factor Memory with Address Mapping

A possible simplification is to use an address mapping circuit that maps the row to the corresponding angle (_{k in} (1)) and use a memory storing the required elements only once. For the general case, we will need to store many, but not all, values, still using_{N possible words even though many} can be set to “don’t care”. Because of this one can expect the resources used for the look-up table to be reduced compared to the previous approach, given that the synthesis tool can benefit from it. The structure is shown in Fig. 3.

1 1 Coefficient Address Mapping WnReal WnImaginary Address (N/8 + 1) wordsMemory

Fig. 4. Block diagram of twiddle factor memory with address mapping and symmetry. L k i M Bit Flip Address

Fig. 5. Block diagram of address mapping unit.

C. Twiddle Factor Memory with Address Mapping and Sym-metry

Another modification, that was proposed in [9], [10], is to use the well known octave symmetry to only store twiddle factors for _{0 ≤ α ≤ π/4. The additional cost is an address} mapping circuit as discussed in the previous section as well as multiplexers to interchange the real and imaginary parts and possible negations. The main benefit is that only_{N/8+1 words} are required to be stored. The resulting structure is shown in Fig. 4

D. Address Mapping

The address mapping for a Radix-2i _{FFT is done as shown} in 5. Here, the total length of the FFT is ₂L _{points and the} resolution of the twiddle factor multiplier is_W₂k. It is worth

noting that the address mapping for a givenWN multiplier is independent ofL. Clearly, i will affect the complexity of the address mapping circuitry.

III. ANALYSIS ANDRESULTS

We have analyzed complexity of twiddle factor memory having resolution ≥ 64 with different architectures, con-sidering radix-2i _{algorithm with different values of} _{i. The} architectures of the twiddle factor memories have been coded

(4)

in VHDL. These architectures were synthesized using the three different synthesis tools, Mentor Graphics Precision targeting an Altera Stratix-IV FPGA, ISE Xilinx targeting an Virtex-4 FPGA and Synopsys Design Compiler targeting 0.35_μm CMOS standard cells. The twiddle factors are represented using 16 bits each for real and imaginary parts. The two’s complement representation of the numbers is used in the twiddle factor memory. The resulting complexity for each stage is illustrated in Figs 6, 7, and 8 for different technologies Altera Stratix-IV FPGA, Virtex-4 FPGA and 0.35_{μm CMOS} ASIC, respectively.

Figures 6, 7, and 8 show that the twiddle factor memory with address mapping and symmetry architecture is the most advantageous one for high range. However, for small ranges, the simple look-up table approach is most beneficial. The point where address mapping and symmetry is more beneficial than the simple look-up table moves further towards the higher resolution of twiddle factor as the value of _{i increases.}

Radix−22 W₆₄W₁₂₈W₂₅₆W₅₁₂W₁₀₂₄W₂₀₄₈W₄₀₉₆W₈₁₉₂ 101 102 103 104 105 106 Memory Memory with AG Memory with AG and symmetry

Radix−23 W₆₄W₁₂₈W₂₅₆W₅₁₂W₁₀₂₄W₂₀₄₈W₄₀₉₆W₈₁₉₂ 101 102 103 104 105 106 Radix−24 W₆₄W₁₂₈W₂₅₆W₅₁₂W₁₀₂₄W₂₀₄₈W₄₀₉₆W₈₁₉₂ 101 102 103 104 105 106 Radix−2 5 Twiddle Factors W₆₄W₁₂₈W₂₅₆W₅₁₂W₁₀₂₄W₂₀₄₈W₄₀₉₆W₈₁₉₂ 101 102 103 104 105 106 LUTs

Fig. 6. Radix2i_{SDF pipelined FFT twiddle factor memory complexity using}

Mentor Graphics Precision targeting an Altera Stratix-IV FPGA.

Radix−23 W₆₄W₁₂₈W₂₅₆W₅₁₂W₁₀₂₄W₂₀₄₈W₄₀₉₆W₈₁₉₂ 101 102 103 104 105 106 Radix−24 W64W128W256W512W1024W2048W4096W8192 101 102 103 104 105 106 Radix−2 5 Twiddle Factors W64W128W256W512W1024W2048W4096W8192 101 102 103 104 105 106 4− input LUTs

Fig. 7. Radix2i_{SDF pipelined FFT twiddle factor memory complexity using}

ISE Xilinx targeting an Virtex-4 FPGA.

Radix−23 W₆₄W₁₂₈W₂₅₆W₅₁₂W₁₀₂₄W₂₀₄₈W₄₀₉₆W₈₁₉₂ 103 104 105 106 107 108 Radix−24 W64W128W256W512W1024W2048W4096W8192 103 104 105 106 107 108 Radix−2 5 Twiddle Factors W64W128W256W512W1024W2048W4096W8192 103 104 105 106 107 108 Cell Area

Fig. 8. Radix2i _{SDF pipelined FFT twiddle factor memory complexity}

0.35μm CMOS standard cells.

In FPGA designs, the memory with address mapping is not a beneficial choice because the synthesis tool does not utilize the “don’t care” conditions. However in the ASIC designs it is in the middle of the both, although never the best. To illustrate the input of the wordlength, we synthesize a W1024 twiddle factor using wordlengths varying from 10 to 18 bits to a Xilinx Virtex-4 FPGA. The results are shown in Fig. 9 and shows the expected linear behaviour. However, the offset, corresponding to the constant wordlength circuitry like address generation, differs between the approaches. Hence, one would expect that for resolutions that gave similar complexity in Figs. 6, 7, and 8, one would have to re-evaluate the best architecture based on the used wordlength.

Figure 10 shows the complexity using the best architec-ture of the twiddle factor memory for radix-₂i _{algorithm in} different technologies. It can be seen that, the twiddle factor complexity for the same twiddle factor increases as the value of_{i increases in radix-2}i _algorithms.

8 10 12 14 16 18 20 500 1000 1500 2000 2500 3000 Radix−22 Memory Memory with AG Memory with AG and symmetry

8 10 12 14 16 18 20 500 1000 1500 2000 2500 3000 Radix−23 8 10 12 14 16 18 20 500 1000 1500 2000 2500 3000 Radix−24 8 10 12 14 16 18 20 500 1000 1500 2000 2500 3000 Radix−25 Wordlength 4− input LUTs

Fig. 9. W1024twiddle factor memory complexity for different wordlength

(5)

500 1000 1500 2000 2500 3000 3500 Altera LUTs W₆₄ W₁₂₈ W₂₅₆ W₅₁₂ W₁₀₂₄W₂₀₄₈W₄₀₉₆W₈₁₉₂ Radix−22 Radix−23 Radix−24 Radix−25 500 1000 1500 2000 2500 Xilinx 4− input LUTs W₆₄ W₁₂₈ W₂₅₆ W₅₁₂ W₁₀₂₄W₂₀₄₈W₄₀₉₆W₈₁₉₂ 0.5 1 1.5 2 2.5 3 3.5x 10 5 _ASIC Cell Area Twiddle Factor W₆₄ W₁₂₈ W₂₅₆ W₅₁₂ W₁₀₂₄W₂₀₄₈W₄₀₉₆W₈₁₉₂

Fig. 10. Best architecture of twiddle factor memory for different twiddle factors

TABLE II

TWIDDLE FACTOR MEMORY COMPLEXITY OF8192-FFT SDFPIPELINED WITH DIFFERENT ALGORITHMS.

Memory Algorithm 1 2 3 4 22_[2] _W₈₁₉₂ _W₂₀₄₈ _W₅₁₂ _W₁₂₈ 23_[3] _W₈₁₉₂ _W₁₀₂₄ _W₁₂₈ -24_[4] _W₈₁₉₂ _W₅₁₂ _- -25_[5] _W₈₁₉₂ _W₂₅₆ _- -TABLE III

TWIDDLE FACTOR MEMORY COMPLEXITY OF8192-FFT SDFPIPELINED WITH DIFFERENT ALGORITHMS(ALTERA).

Memory complexity Algorithm 1 2 3 4 Total 22_[2] ₂₆₅₀ ₇₂₉ ₂₄₀ ₉₅ ₃₇₁₄ 23_[3] ₂₈₃₅ ₅₈₁ ₉₆ _- ₃₅₁₂ 24_[4] ₃₀₀₂ ₃₃₉ _- _- ₃₃₄₁ 25_[5] ₃₁₂₃ ₁₅₇ _- _- ₃₂₈₀

Table II shows twiddle factors for a 8192-point FFT single delay feedback pipelined architecture having resolution ≥ 64 for different radix-2i _{algorithms. The complexity of each} complex twiddle factor memory with best architecture by using the three different technologies are shown in Tables III, IV

TABLE IV

TWIDDLE FACTOR MEMORY COMPLEXITY OF8192-FFT SDFPIPELINED WITH DIFFERENT ALGORITHMS(XILINX).

Memory complexity Algorithm 1 2 3 4 Total 22 _[2] ₁₅₉₂ ₇₃₅ ₃₈₃ ₂₀₁ ₂₉₁₁ 23 _[3] ₁₆₅₃ ₅₅₆ ₂₂₈ _- ₂₄₃₇ 24 _[4] ₁₇₉₁ ₅₅₀ _- _- ₂₃₄₁ 25 _[5] ₁₈₆₃ ₅₂₇ _- _- ₂₃₉₀ TABLE V

TWIDDLE FACTOR MEMORY COMPLEXITY OF8192-FFT SDFPIPELINED WITH DIFFERENT ALGORITHMS(ASIC).

Memory complexity Algorithm 1 2 3 4 Total 22_[2] _246500.8 _89471.2 _39967.2 _21294.0 _397233.2 23_[3] _260059.8 _66739.4 _25771.2 _- _352570.4 24_[4] _283501.4 _58167.2 _- _- _341668.6 25_[5] _300829.6 _27318.2 _- _- _328147.8

and V respectively. The values in italic corresponds to that architecture where only a lookup table is used. This justifies the inital assumption that the same architecture is not benefical for all twiddle factor memories. The total complexity of the twiddle factor memory is reduced as the value ofi is increased, except for Xilinx results.

IV. CONCLUSIONS

In this paper, we have analyzed the complexity of twiddle factor memories for pipelined FFTs considering different architectures. Analysis is based on complexity comparisons of different radix-2i _{algorithms when implemented either on} FPGAs (field programmable gate array) or standard cells. The results show that a plain lookup table is advantageous for low resolution memories while for larger resolution twiddle factor memories, utilizing octave symmetry and a address generator is advantageous. The break-point where the plain lookup table approach is advantageous increases with increasingi.

REFERENCES

[1] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999. [2] S. He and M. Torkelson, “A new approach to pipeline FFT processor,”

in Proc. IEEE Parallel Processing Symp., 1996, pp. 766–770. [3] S. He and M. Torkelson, “Designing pipeline FFT processor for

OFDM(de)Modulation,” in Proc. IEEE URSI Int. Symp. Sig. Elect., 1998, pp. 257–262.

[4] J.-E. Oh and M.-S. Lim, “New radix-2 to the 4th power pipeline FFT processor,” IEICE Trans. Electron., vol. E88-C, no. 8, pp. 694–697, Aug. 2005.

[5] A. Cortes, I. Velez and J. F. Sevillano, “Radixrk _{FFTs: matricial}

representation and SDC/SDF pipeline implementation,” IEEE Trans.

Signal Processing on, vol. 57, no. 7, pp. 2824–2839, Jul. 2009.

[6] Y.-E. Kim, K.-J. Cho, and J.-G. Chung, “Low power small area modified Booth multiplier design for predetermined coefficients,” IEICE Trans.

Fund., vol. E90-A, no. 3, pp. 694–697, Mar. 2007.

[7] W. Han, T. Arslan, A. T. Erdogan and M. Hasan, “High-performance low-power FFT cores,” ETRI Journal, vol. 30, no. 3, pp. 451–460, June 2008.

[8] F. Qureshi and O. Gustafsson, “Low-complexity reconfigurable complex constant multiplication for FFTs,” in Proc. IEEE Int. Symp. Circuits

Syst., Taipei, Taiwan, May 24–27, 2009.

[9] H. Cho, M. Kim, D. Kim, and J. Kim “R22_{SDF FFT implementation}

with coefficient memory reduction scheme,” in Proc. Vehicular

Technol-ogy Conf., 2006.

[10] M. Hasan and T. Arslan, “Scheme for reducing size of coefficient memory in FFT processor,” IEEE Electronics Letters, vol. 38, no. 4, pp. 163–164, Feb. 2007.

[11] O. Gustafsson and K. Johansson, “An empirical study on standard cell synthesis of elementary function look-up tables,” in Proc. Asilomar Conf.

Signals Syst. Comp., Pacific Grove, CA, Oct. 26–29, 2008.

[12] P. A. Milder, M. Ahmad, J. C. Hoe and M. P¨uschel “Fast and accurate resource estimation of automatically generated custom DFT IP cores,” in Proc. FPGA, 2006, pp. 211–220.