Minimising Memory Access Conflicts for FFT on a DSP

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2019

Minimising Memory Access

Conflicts for FFT on a DSP

Sofia Jonsson

(2)

Minimising Memory Access Conflicts for FFT on a DSP: Sofia Jonsson

LiTH-ISY-EX--19/5263--SE Supervisor: Oscar Gustafsson

isy_{, Linköpings universitet}

Jean Michel Chabloz

Ericsson

Examiner: Kent Palmkvist

isy_{, Linköpings universitet}

Division of Computer Engineering Department of Electrical Engineering

(3)

Abstract

The FFT support in an Ericsson’s proprietary DSP is to be improved in order to achieve high performance without disrupting the current DSP architecture too much. The FFT:s and inverse FFT:s in question should support FFT sizes ranging from 12-2048, where the size is a multiple of prime factors 2, 3 and 5. Especially memory access conflicts could cause low performance in terms of speed compared with existing hardware accelerator. The problem addressed in this thesis is how to minimise these memory access conflicts. The studied FFT is a mixed-radix DIT FFT where the butterfly results are written back to addresses of a certain order. Furthermore, different buffer structures and sizes are studied, as well as different order in which to perform the operations within each FFT butterfly stage, and different orders in which to shuffle the samples in the initial stage.

The study shows that for both studied buffer structures there are buffer sizes giv-ing good performance for the majority of the FFT sizes, without largely changgiv-ing the current architecture. By using certain orders for performing the operations and shuffling within the FFT stages for remaining FFT sizes, it is possible to reach good performance also for these cases.

(4)

(5)

Acknowledgments

Firstly, I would like to thank my supervisor Jean-Michel Chabloz for supporting me all along this thesis project at Ericsson. Thanks for numerous interesting discussions and for giving me key insights about the subject.

I would like to extend my gratitude to Jakob Brundin who was part of coming up with the idea behind this thesis, thanks for sharing knowledge about the subject. I would also like to thank my academic supervisor Oscar Gustafsson as well as my examiner Kent Palmkvist for their advice and feedback on my work.

Finally, I would like to thank Pierre Rohdin for giving me the opportunity to work on this interesting thesis project at Ericsson.

Stockholm, August 2019 Sofia Jonsson

(6)

(7)

Notation

Abbreviations

Abbreviation Meaning

cc _{Clock cycle/Clock cycles}

cm _{Common memory}

dft Discrete Fourier transform dif Decimation in frequency dit Decimation in time dsp Digital signal processor fft Fast Fourier transform fifo First in, first out

idft _{Inverse discrete Fourier transform} ifft _{Inverse fast Fourier transform} ldm _{Local data memory}

ptr _Pointer

sfg _{Signal flow graph}

vliw _{Very long instruction word}

(10)

(11)

1

Introduction

In this chapter, the thesis project is introduced. Section 1.1 briefly tells about the motivation behind the project and section 1.2 explains its goals. The research questions are described in section 1.3. In section 1.4, the delimitations are de-scribed. Lastly, the background is further explained in section 1.5.

1.1 Motivation

The Fast Fourier Transform (FFT) is an important tool when processing digital signals in mobile networks. Having high performance on FFT calculations is therefore crucial. The FFT calculations in an Ericsson’s proprietary chip are per-formed by a hardware accelerator, but now also a soft solution on a digital signal processor (DSP) will be tried. Therefore, it is of interest to improve the FFT sup-port on the DSP to achieve best performance possible based on the current DSP architecture, without any larger changes of it. The FFT support consists of new DSP operations and methods to use them. The FFT:s and inverse FFT:s in ques-tion should support sizes ranging from 12-2048, where the size is a product of any of factors 2, 3 and 5. There are numerous ways of implementing this, where different ways face different challenges.

A first study of the DSP architecture and some FFT algorithms showed that espe-cially the memory accessing could cause low performance in terms of speed, in comparison with existing hardware accelerator. Therefore, it is important to find ways of minimising memory access conflicts, which is the problem addressed in this thesis.

(12)

1.2 Goals

The aim of the thesis is to investigate what performance in terms of speed a cer-tain version of the DIT mixed-radix FFT could achieve on an Ericsson internal VLIW DSP processor. The study is only focusing on the memory accessing as-pects. The FFT differs from the in-place DIT mixed-radix FFT such that the butterfly results are written back in a certain shuffled order, the FFT thus not being in-place. The FFT in question as well as why it is interesting for the target architecture is explained further in chapter 2.

The performance of the FFT calculations in the DSP is expected to be lower than when run by the hardware accelerator, since the latter is designed uniquely to perform FFT:s while the DSP is not. However, there are other advantages of per-forming the FFT on a DSP that might compensate for the lower performance. This depends of course on how large the difference in performance is, which is why it is important to examine different ways of implementing FFT support on the DSP. The aim of the thesis is therefore to see if the FFT described later in this thesis, as well as further methods studied, could make up an interesting solution to implement. Apart from reaching reasonable performance, the solution should not be too disruptive on the current DSP architecture.

1.3 Research Questions

What would the performance of an FFT mixed-radix DIT algorithm with extra reorder-ing on the target architecture be?

Which factorization and orders of factors are optimal for the different FFT sizes?

The FFT sizes in question can be factorized to different set of factors. These fac-tors correspond to the radices of the butterfly stages. Different factorizations and order of factors hence lead to different memory access patterns.

Could the order in which the operations are performed within each butterfly stage be modified in order to improve the performance?

The addresses to which the results of the butterflies should be written are decided by the algorithm (and selected factorization and order of factors). One order for performing the operations within a butterfly stage is top to down order, however other orders are also possible.

Could the order in which the input samples are shuffled in the initial shuffle stage be modified in order to improve the performance?

(13)

1.4 Delimitation 3 According to the algorithm, the input samples to the FFT are to be shuffled to certain addresses before the first butterfly stage takes place. Up to 6 samples can be shuffled at a time. Since the FFT sizes ranges from 12-2048 samples, there are different orders in which the samples can be shuffled.

How does the size of the write buffers affect the performance? And how would the performance be affected if two banks share the same write buffer, instead of letting all banks have their own?

The write buffers refer to buffers which the butterfly results are written to be-fore reaching the local data memory (LDM). These buffers are supposed to be added to the current architecture in order to support the FFT operation.

1.4 Delimitation

The study is done by modelling using MATLAB. Implementation is not in the scope of this thesis. As mentioned, the investigation of the algorithms considers only the memory accessing aspects. How the implementation of butterfly calcula-tions and twiddle factor application affect the performance is not studied. Where these matters affect the memory accessing aspects and hence the modelling done in this thesis, assumptions are made since it is not known exactly how the imple-mentation would be done. The modelling aims to give as exact results as possible, however some aspects have been neglected in order to ease the modelling. These assumptions and neglects are described and discussed in chapter 3.

1.5 Background

The platform in question includes a common memory (CM) which is connected to DSP:s and accelerators. There is also a job handler unit which dispatches the jobs for the DSP:s and accelerators. This is illustrated in figure 1.1.

CM

ACC

....

Job Handler Unit

DSP DSP DSP

Figure 1.1: The platform - one CM is connected to DSP:s and accelerators ("ACC"). A job handler unit dispatches jobs to the DSP:s and the accelerators.

(14)

One of the main reasons making it preferable to perform the FFT calculations on a DSP rather than on an accelerator is that in many cases the functions before and after an FFT are done in DSP:s. This means that the data frequently has to pass (i.e. be written to and read from) the CM, which increases the latency as well as leads to higher CM load, which in turn reduces the performance of the rest of the system. For the same cases where the functions before and after an FFT are done in DSP:s, performing the FFT in an DSP instead of in an accelerator would also ease the work for the job handler unit.

(15)

2

Theory

This chapter describes the theory behind the problem investigated in this thesis. First, in section 2.1, there is a brief introduction to fast Fourier transforms. There-after, a study of the DSP in question will be presented in section 2.2. Lastly, in section 2.3, the FFT and methods investigated in this thesis will be motivated and described.

2.1 Study of FFT

The study of the FFT presented in this section is mainly based on [4] and [5]. A signal can be represented in different ways and the representation of it is im-portant when it is to be manipulated efficiently. The discrete Fourier transform, DFT, performs the separation of a time-domain signal into its frequency com-ponents. A signal is then said to be represented in the frequency domain. The inverse discrete Fourier transform, IDFT, performs instead a summation of the frequency components, transforming the signal from the frequency domain back to the time domain. The computational effort needed for many signal processing operations of an arbitrary discrete signal is much smaller when the signal is rep-resented in the frequency-domain rather than in the time-domain. The N-point DFT of the time-domain sequence x(n) with period N is defined as:

X(k) = N −1 X n=0 x(n)W_Nnk, k = 0, 1, ..., N − 1 (2.1) 5

(16)

where WN = e

−_j2π/N_{. The N-point IDFT of the frequency coefficients X(k) is} defined as: x(n) = 1 N N −1 X k=0 x(k)W_N−nk, n = 0, 1, ..., N − 1 (2.2)

With simple modifications, any DFT algorithm can be used to calculate an IDFT. The use of DFT:s is very efficient thanks to the fast algorithms that exist, which utilize the properties of the DFT such as its linearity and periodicity (with N). A fast Fourier transform, FFT, is an efficient algorithm for computation of a DFT. By decomposing the DFT, the computational complexity reduces since calculations of subparts are reused. An inverse fast Fourier transform, IFFT, is an efficient al-gorithm for computation of IDFT. With trivial modifications, any FFT alal-gorithm can be used to calculate an IFFT.

There are numerous ways of performing FFT calculation and they come with dif-ferent advantages and disadvantages. InSearching for the Best Cooley-Tukey FFT Algorithm, [2], it is described that a general exhaustive search for the best

algo-rithm is hard (and therefore a specific length is explored). It is also mentioned that given a certain architecture, an optimality criteria may be chosen for the algorithm (e.g. number of multiplications) in order to then search for the best algorithm according to that. In this thesis, the optimality criteria is performance in terms of clock cycles on the given DSP architecture (given that the current architecture may be modified, however not too disruptive).

The most commonly used ways are the algorithms for 2m point FFT:s (where m is an integer > 0), such as radix 2 [1], radix 4 and split radix FFT:s. This is men-tioned in [6], which proposes an algorithm for computing 6m-length DFT:s using radix 3 and radix 6. This is just one example of FFT:s for lengths not being 2m, which also has been an area of research.

The FFT studied in this thesis is a variant of the mixed-radix in-place decimation in time (DIT) FFT. The DIT FFT is described below in section 2.1.1. The FFT used in this thesis is motivated and described in section 2.3, after the DSP architecture has been described in section 2.2.

2.1.1 Radix-2 DIT FFT

The radix-2 DIT FFT is based on the decomposition of a sequence of length N into two sequences of size N/2 each, where one sequence is a DFT of the even samples and the other is a DFT of the odd samples. The N-point DFT is then obtained in terms of the two N/2-point DFT:s. It is assumed that N is a power of two. The N/2-point DFT:s can then be repeatedly decomposed and obtained in the same way until only 2-point sequences are reached.

The N-point DFT X(k) of the samples x(k), expressed by the decompositions G(k) and H(k), where G(k) and H(k) are N/2-point DFT:s of the even and odd samples of x(k), respectively, could be written as following:

(17)

2.1 Study of FFT 7 X(k) = G(k) + W_NkH(k), k = 0, 1, ...,N 2 −1 (2.3) X(k + N 2) = G(k) − W k NH(k), k = 0, 1, ..., N 2 −1 (2.4)

Above expressions could be represented as a butterfly of radix 2, see figure 2.1. Radix 2 means that the DFT/butterfly in question transforms 2 samples. W_Nk is called twiddle factor.

X(k) X(k+N/2) H(k) G(k) N k W

Figure 2.1:Radix-2 butterfly.

The decomposition of N samples will be illustrated by an example with N=8. The signal flow graph (SFG) after the first decomposition of an 8-point DIT FFT is shown in figure 2.2. The two N/2-point DFT:s take the even indexed samples and the odd indexed samples as input, respectively. The output of the FFT is then in natural order. The four “crosses” at the right half are four radix-2 butterflies.

0 8 W 1 8 W 2 8 W 3 8 W 7 6 5 4 3 2 1 0 X N/2−point DFT N/2−point DFT 4 5 7 1 3 6 2 0 x G(0) G(1) G(2) G(3) H(0) H(1) H(2) H(3)

Figure 2.2:SFG for an 8-point DIT FFT after one decomposition.

For each iteration of decomposition, the order of the input samples needs to be changed, so the two resulting DFT:s that a DFT gets decomposed into get its even and odd input samples respectively. This is required in order to get the output in natural order. For an N-point FFT (where N is a power of two), where the DFT has

(18)

been decomposed repeatedly until the only calculations are radix-2 butterflies, the reordering of the initial input samples becomes bit-reversed order.

With a similar strategy as for the DIT decomposition, decimation in frequency (DIF) can be achieved. The butterflies will be performed for the exact same sam-ples as for DIT. What changes is the placement of the twiddle factors.

2.1.2 Mixed Radix DIT FFT

Above radix-2 DIT strategy for decomposing an FFT could also be applied for N not being a power of two, but any composite number. These FFT:s are called mixed-radix DIT FFT:s.

The decomposition of N is written as a product of factors, N = m1m2...mr. First, we let N = m1N1, where N1 = m2...mr. The input sequence x(n) can then be separated to m1subsequences, where each subsequence is of N1elements. This corresponds to the first decomposition. The DFT could then be written as:

X(k) = N1−1 X n=0 x(nm1)Wnm1k N + N1−1 X n=0 x(nm1+ 1)WN(nm1+1)k+ ...+ N1−1 X n=0 x(nm1+ m1−1)W (nm1+m1−1)k N (2.5)

This first decomposition means that the DFT of size N is transformed into N1 DFT:s of size m1 and m1 DFT:s of size N1, plus twiddle factors. The second de-composition then consists of transforming the each DFT of size N1into m2DFT:s of size m3and m3DFT:s of size m2, plus twiddle factors. By performing decom-positions recursively, we come down to a series of columns of butterflies. The number of columns is the number of factors m1, m2, ..., mr that N is factorized into. Each column consists only of butterflies of a certain radix. These radices are the factors m1, m2, ..., mr. Figure 2.3 illustrates the recursive decomposition for a FFT of size 24.

(19)

2.1 Study of FFT 9 4−point DFT 4−point DFT 4−point DFT 4−point DFT 4−point DFT 4−point DFT +twiddle factors 6 4−point DFT:s 4−point DFT 4−point DFT 4−point DFT 4−point DFT 4−point DFT 4−point DFT +twiddle factors 6 4−point DFT:s +twiddle factors +twiddle factors 4 6−point DFT:s +twiddle factors 6−point DFT 6−point DFT 6−point DFT 6−point DFT 12 2−point DFT:s 8 3−point DFT:s 2−point DFT 3−point DFT 3−point DFT 3−point DFT 3−point DFT 3−point DFT 3−point DFT 3−point DFT 3−point DFT 2−point DFT 2−point DFT 2−point DFT 2−point DFT 2−point DFT 2−point DFT 2−point DFT 2−point DFT 2−point DFT 2−point DFT 2−point DFT 24−point DFT

Figure 2.3:Recursive decomposition for a mixed-radix DIT FFT of size 24.

In figure 2.3, the first decomposition step decomposes a 24-point DFT into four 6-point DFT:s and six 4-point DFT:s, plus twiddle factors. In the second decompo-sition, each 6-point DFT is decomposed into three 2-point DFT:s and two 3-point DFT:s, plus twiddle factors. This means that the 24-point DFT has been decom-posed into three columns of DFT:s of size 2, 3 and 4, respectively. These DFT:s

(20)

can be illustrated as butterflies of radix 2, 3 and 4, respectively. In this example, the FFT of size 24 has been factorized into 4, 3 and 2.

In this thesis, the expressionbase is used to describe the factors an FFT is

factor-ized into and in which order each radix should be performed. E.g. base [2 3 2 4] means that the FFT is factorized into 48 = 2 ∗ 3 ∗ 2 ∗ 4, where 4 is the radix of the first butterfly stage (the leftmost stage in a SFG), 2 is the radix of the second butterfly stage, etc. The base of the FFT in figure 2.3 is [4 3 2].

Similarly to the radix-2 DIT FFT, the input samples need to be in reversed order for the mixed radix DIT FFT. In the mixed radix case, the reversal is done with regard to the current base. How this is calculated is described in section 2.3.3. The output samples are in natural order.

2.2 DSP architecture

Ericsson internal documents have been read to learn relevant information about the DSP architecture in question, focus being on the local data memory (LDM). The parts relevant for this thesis are presented in this section.

2.2.1 Current DSP Architecture

The LDM has 32 16-bit banks. The data which should be handled in the FFT is complex numbers of 32 bits, 16 bits each for the real and imaginary parts. Since the relevant FFT data samples are 32 bits, the LDM is in this thesis consid-ered to have 16 32-bit banks. The bank numbers for an address are obtained by [address % 16] = bank number (0 − 15). The LDM is illustrated in figure 2.4.

LDM

0 1 2 3 4 5 6 7 8 9101112131415 write FIFO:s banks

(21)

2.2 DSP architecture 11

It is possible to perform up to three parallel reads from or writes to LDM per cycle. However, each bank has only one read/write-port. This means that only one address per bank can be read per cycle. The exception is when reading the exact same address in a bank, which is allowed. One read access can be up to 512 bits (corresponding to all 16 banks), but as just mentioned, bank overlaps from parallel reads are not possible. Two parallel reads from the same bank causes a stall of one cycle, and three parallel reads from the same bank cause a stall of two cycles.

Similarly, it is possible to perform up to 3 parallel writes to LDM per cycle, where each write access can be up to 512 bits, bank overlaps from parallel writes not being possible. Each bank has a FIFO (first in, first out) write buffer of 3 positions. This means that when performing 3 parallel writes to addresses of the same bank, no stall cycle is required, unless the FIFO is already partially full.

2.2.2 FFT Support on the DSP Architecture

The structure of the studied FFT support is illustrated in figure 2.5. It includes a butterfly pipeline, which can be stopped by a stall signal. When implementing new instructions for FFT support, it is possible to modify the DSP so that it is possible to perform up to 6 parallel reads or writes when performing an FFT. This is required since the butterfly operation can perform either 1 6, 1 radix-5, 1 radix-4, 1-2 radix-3 or 1-3 radix-2 butterflies. At the end of the pipeline, there will be up to 6 samples to be written to the LDM. If these samples are written to consecutive addresses, no sample will be written to the same bank. However, normally these samples are to be written to non consecutive addresses meaning that several samples could be written to the same bank. As mentioned in previous section 2.2.1, there is a FIFO write buffer on each bank, but these will overflow and give back pressure to the butterfly pipeline, meaning it will be stalled. Therefore, buffers are added at the end of the butterfly pipeline. The reason why the size of the existing LDM write buffers are not increased instead is because the LDM is a very critical part of the DSP and it would be hard and expensive to do so.

LDM

. . .

butterfly pipeline

FFT buffers write buffers

memory banks stall signal

(22)

Generating the stall signal is very timing critical since it has a large fan-out - it has to be sent to each flip-flop in the butterfly pipeline. Late in the clock cycle, the LDM will inform the FFT buffer whether or not it has the possibility to receive the data items. The FFT buffer then decides whether the butterfly pipeline must be stalled ot not, depending on the incoming data from the pipeline and the current FFT buffer status. It may occur that it is hard to stop the pipeline in time. FFT Buffer Structures

There are different structures possible for the additional FFT buffers. Two differ-ent ideas are explored in this thesis. One idea is to have one additional buffer per bank (plus current write buffer), while the other idea is to let two banks (plus re-spective write buffers) share one additional buffer in order to reduce the number of required flip-flops.

If each bank has one additional buffer, these buffers could be implemented as FIFO buffers by using shift registers. Two advantages of using shift registers are that they are relatively simple and that they provide good timing, since the out-put of the FIFO is always taken from the same register. A such buffer solution is illustrated in figure 2.6. A-D are 32-bit registers and each cycle, the value in register A is written to LDM. The control block ("CONTROL") controls the mul-tiplexers deciding whether a register should be shifted or get a butterfly output value ("bfy out"), based on information from LDM. The control block also decides whether a stall is required.

A CONTROL STALL LDM B C D LDM bfy out

Figure 2.6: Shift register (non-shared) buffer architecture with four posi-tions. Each cycle the value in register A is written to LDM.

If instead there is one additional buffer per two banks, this would be imple-mented differently. Since we want to be able to empty one value per bank per buffer, thus up to two samples per cycle, a shift register is not possible to use. One way of implementing a shared buffer is by using multiplexing of the buffer registers as illustrated in figure 2.7. A drawback of this architecture is the bad

(23)

2.2 DSP architecture 13 timing, since a multiplexer is required to select the buffer output, and the LDM connection is very time critical. This is resolved by adding a register after the multiplexer deciding which register is to be written to LDM. This extra register means an increased area, but is necessary to ensure good timing. It would be pos-sible to make this register pospos-sible to write to directly without passing registers A-D (when these are empty). However, this would require additional logic.

CONTROL ctrl ctrl ctrl LDM LDM STALL out bfy −READPTR −WRITEPTR D C B A

Figure 2.7:Shared buffer architecture with four positions.

The control block ("CONTROL") has a write pointer ("WRITEPTR") keeping track of which register each butterfly output ("bfy out") is written to, as well as a read pointer ("READPTR") keeping track of which register should be read from to LDM.

From here on in this report, the additional write buffers will be referred to as only "the buffers". If it is not explicitly written that the buffers in question are shared buffers, they are non-shared buffers (i.e. one write buffer per bank).

(24)

2.3 FFT and Methods of Study

The FFT support studied in this thesis is supposed to include a new butterfly operation which can take up to 6 samples as input (and give up to 6 samples as output). The motivation behind this is based on current DSP architecture but falls outside the scope of this thesis and will therefore not be further explained. The butterfly operation can perform either 1 radix-6, 1 radix-5, 1 radix-4, 1-2 radix-3 or 1-3 radix-2 butterflies. The twiddle factors are supposed to be applied after the butterfly calculation before writing the samples to LDM (hence DIT algorithm). Since the memory access pattern remains the same for DIT and DIF, it is not of importance for this study whether DIT or DIF is used. In this report, the denotation basic operation includes the butterfly calculation as well as the

twiddle factor application.

2.3.1 FFT of study

One issue with this organization of the FFT in general mixed radix cases when implemented on current DSP architecture in question is the memory accessing, i.e. the reads from LDM to the basic operations and the writes from the basic operations to LDM. When looking at the input samples which are read from LDM to the basic operations, one problem that could occur is when we want to read data from addresses of the same bank during the same cycle (i.e. a memory access conflict), which would introduce stall cycles.

When writing the basic operation output samples to LDM, similar issues could occur. One difference though is that for writing, we have the write buffers, which means that we can write several samples to addresses of the same bank without causing stall cycles (depending on the size and structure of the buffer).

For these reasons, the main FFT studied in this thesis is a modification of the in-place DIT mixed-radix FFT, where shuffle of the samples is introduced after each butterfly stage in order to make all input samples for the following butterfly stage placed at consecutive addresses. This guarantees that all input samples for a butterfly could be read during the same clock cycle. The way this shuffle is done in this thesis is illustrated in figure 2.8. Shuffle b makes that the input samples to each butterfly of following stage are consecutive. The pattern for this is such that the first butterfly in stage 2 should have as uppermost input sample the first output sample from stage 1 (i.e. the uppermost output sample from the uppermost butterfly of stage 1 in the SFG). The second butterfly in stage 2 should include the uppermost output sample from stage 1 which has not yet been taken. The following butterflies follow the same principle.

(25)

2.3 FFT and Methods of Study 15

a b c d e

Initial stage Stage 1 Stage 2 Stage 3

Figure 2.8:Signal Flow Graph for FFT with base [4 3 2], where the two sub shuffles in stage 2 are shown. The twiddle factors are not shown.

Shuffle c is the opposite of shuffle b. This means that the samples after shuffle c are in the exact same order as the samples after butterfly stage 2 for an in-place mixed radix DIT FFT. Shuffle d then makes that the input samples to each butterfly of following stage are consecutive. The pattern is the same as described for previous stage. Shuffle e is then the opposite of shuffle d, which means that the output samples of the FFT are in the same order as for an in-place mixed radix DIT FFT, which is natural order.

The FFT:s in this thesis have 2-6 butterfly stages. The shuffle after the first and last butterfly stage will always just be one shuffle. For the intermediate stages (for FFT:s having 3-6 butterfly stages), two shuffles occur. The first one is the opposite of previous shuffle, and the second one makes the sample order con-secutive for the following stage, as described in the example above. These two shuffles are however always merged to one. An illustration of an SFG showing the merged shuffle is presented in figure 2.9, where it is denoted cd. Each shuffle corresponds to writing the output samples from the basic operations to LDM ad-dresses corresponding to the respective order in question. How these adad-dresses are calculated is described in section 2.3.3.

Although reading from LDM will be from consecutive addresses, thus not caus-ing any memory access conflict, writcaus-ing to LDM could still cause memory access conflicts when there is no or not enough positions empty in a write buffer when required.

(26)

cd 0 6 12 18 2 8 14 20 4 10 16 22 1 7 13 19 3 9 15 21 5 11 17 23 0 8 16 4 12 20 1 9 17 5 13 21 2 10 18 6 14 22 3 11 19 7 15 23 b 0 3 1 4 2 5 6 9 7 10 8 11 12 15 13 16 14 17 18 21 19 22 20 23 e 0 6 12 18 1 7 13 19 2 8 14 20 3 9 15 21 4 10 16 22 5 11 17 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 a

Initial stage Stage 1 Stage 2 Stage 3

Figure 2.9:Signal Flow Graph for FFT with base [4 3 2], where the shuffle in stage 2 is merged. The twiddle factors are not shown.

2.3.2 Methods to Avoid Memory Conflicts

For the modified DIT mixed radix FFT, several methods are studied in order to see if they can be used to increase the performance of the FFT by avoiding memory access conflicts. These methods are presented below. Other researches have been made in this area, e.g. in [3], the selected number of banks plays a key role in improving the performance. In this thesis though, the number of banks is already set.

Choosing factors and their order

As mentioned, the FFT support should be for butterflies of radices 2, 3, 4, 5 and 6 and for size 12-2048, where the size is a product of any of aforementioned radices. This makes up 101 different sizes, and the total number of factorizations are more than 9000 (if limiting the maximum number of butterfly stages to 6, which is a reasonable limitation since the maximum FFT size could be reached within this). For each FFT stage, every bank will be written to an equal number of times (or differ by one when the FFT size is not a multiple of 16, the number of banks). However, different factorizations and different orders of the factors lead to differ-ent address patterns and therefore also differdiffer-ent memory access patterns. Which factorizations and which orders of factors give the least number of stall cycles is investigated. Should for example an FFT of size 12 be divided into factors 2 and 6, or 4 and 3, and for the latter case, would it be better to calculate radix 4 first and then radix 3 or the opposite?

For bases not giving any stall it is better to select the largest factors possible, i.e. selecting factor 4 instead of two factors 2, and selecting factor 6 instead of factors 2 and 3. This is because fewer factors in a base leads to fewer stages to calculate.

(27)

It is true that since we can perform 3 2 butterflies in the same time, a radix-2 stage for a certain size could be performed quicker than a radix-4 stage for the same size. However, since a radix-4 stage corresponds to two radix-2 stages, selecting radix-4 remains better. This is true for the aspects regarded in this thesis, and does not consider how the basic operations would be implemented and what that might bring.

Orders of operations within each FFT stage

For a given FFT stage, the operations could be calculated in different orders. If e.g. a FFT butterfly stage consists of four basic operations, op1, op2, op3 and op4, performing the operations in e.g. order op1, op2, op3, op4 and op1, op3, op2, op4 yield different memory access patterns. The method consists in finding an

operation order which minimises the memory access conflicts. Both the initial shuffle stage and the butterfly stages are looked at. This method is in this report called operation increment step, or just step.

Split writing at shuffle stage

In the shuffle stage, normally 6 samples are shuffled per operation. When imple-menting this, it is possible to either shuffle 6 consecutive samples per operation, or split these 6 samples into two groups of 3 consecutive samples each. This means that each clock cycle, shuffling of two groups of 3 samples at consecutive addresses is performed. The shuffling can be performed, similarly with the oper-ation increment step, in many different orders, yielding different memory access patterns. The method consists in finding a shuffling order which minimises the memory access conflicts. This method is in this report called split shuffle writing. Write buffers

As described in section 2.2, two different write buffers are considered - non-shared and non-shared buffers. These are investigated to see what impact they could have on the memory access conflicts. The main advantage of the shared buffers is that the performance per buffer register could be higher than for the non-shared solution. E.g. a shared buffer of 6 positions equals two non-shared buffers of 3 po-sitions when it comes to the number of buffer registers (ignoring combinational logic). Writing 6 samples to one bank during one clock cycle without causing any stall is possible in the first case but not in the latter. Also the buffer size is studied. Larger buffers could reduce memory access conflicts, but require more area.

2.3.3 Address pattern

This section describes how the addresses are calculated. The address calculation is for the mixed radix DIT FFT with extra reordering used in this thesis.

When calculating the addresses which the results from the basic operations should be written to, the idea is to first express the addresses 0 to N-1 in a certain base (it is assumed the addresses where each FFT should be placed are 0 to N-1), and then switch places of columns. The switch of columns corresponds to a shuffle.

(28)

To explain how the addresses are calculated, an example illustrating this will firstly be presented in section 2.3.3. Then, a general description is presented in section 2.3.3. Table 2.1 presents some expressions that will be used in these explanations.

(29)

Expression Description

base matrix Matrix where each row is a number expressed in the base in question (one digit per column), where the numbers range from 0 to N-1. The first row equals 0, the second row equals 1, and so on. E.g. if the base is [2 3 2], the base matrix becomes following, where the rightmost column is the least significant:

0: 0 0 0 1: 0 0 1 2: 0 1 0 3: 0 1 1 4: 0 2 0 5: 0 2 1 6: 1 0 0 7: 1 0 1 8: 1 1 0 9: 1 1 1 10: 1 2 0 11: 1 2 1

weights The weights of the base. The weight of position m is calculated as wm=Qm−1n=1bnfor 2 ≤ m ≤ 6, where bnis the element of the base at position n. The first weight is w1= 1. E.g. the weights for the base [2 3 2] is [6 2 1]. The weights for a certain base are used when calculating the decimal value from a number repre-sented in that base. The decimal value is calculated as Pk

n=1wnxn, where wn is the weight for position n and

xn is digit at position n in the number represented in the base in question (where 1 is the least significant position).

address

matrix Matrix from which, together with weights, the ad-dresses are calculated. The adad-dresses are calculated by performing following on each row: Pk

n=1wnxn, where wn is the weight at position n and xn is the row element at position n. k is the number of row elements (which equals the number of weights). An address matrix is calculated by performing column switches at a base matrix.

(30)

Example - Address Calculation for Base = [4 3 2]

This example presents the address calculation for an FFT with base [4 3 2], N=24. Figure 2.8 and 2.9 in previous section show the SFG for this FFT. Figure 2.8 shows explicitly the two sub shuffles performed in stage 2 (shuffle c and d), while 2.9 shows the merged shuffle (shuffle cd). Furthermore, the addresses are written in the latter. Below it is described how the addresses are calculated for each FFT stage. The idea for each step is to get an address matrix and weights which gives the addresses according to the description in table 2.1.

(31)

Initial stage

The initial reordering before the first butterfly stage is the same as for an in-place mixed radix DIT FFT, which is a reversed order with regard to the base in ques-tion. In order to get the address matrix, the base matrix is first generated for the reversed initial base - [2 3 4]. The address matrix is the reversal of this matrix. The weights are calculated from the initial base, and becomes [6 2 1]. This stage is illustrated in figure 2.10, where the resulting addresses also are shown.

6 2 1 6 2 1 0 6 12 18 2 8 14 20 4 10 16 22 1 7 13 19 3 9 15 21 5 11 17 23 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 2 2 2 2 0 0 0 0 1 1 1 1 2 2 2 2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 3 2 4 2 3 4 4 3 2 4 3 2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 0 0 0 1 1 1 1 2 2 2 2 0 0 0 0 1 1 1 1 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 0 0 0 1 1 1 1 2 2 2 2 0 0 0 0 1 1 1 1 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 3Calculation of addresses Addresses Weights

Base after column change Base matrix

Base to get base matrix from

Address matrix − matrix after column change

Color Explanation Calculation of new base and address matrix

1 2Calculation of weights

(32)

Stage 1

The base matrix is first generated for the initial base [4 3 2]. Shuffle b (figure 2.8) corresponds to moving column 2 to position 1 (which means changing place of column 2 and 1). The resulting matrix is the address matrix for stage 1. The weights are generated from the new base being [4 2 3] (where element 2 from the initial base has been moved to position 1, according to the column change). The weights for this base becomes [6 3 1], which are the address weights. This stage is illustrated in figure 2.11, where the resulting addresses also are shown.

0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2 2 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2 2 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2 2 6 3 1 6 3 1

Calculation of new base and address matrix

1

Addresses Weights

Color Explanation 3 4 2 4 2 3 3Calculation of addresses 2Calculation of weights 4 2 3 0 3 1 4 2 5 6 9 7 10 8 11 12 15 13 16 14 17 18 21 19 22 20 23

(33)

Stage 2

The base matrix is generated for the new base from previous stage, [4 2 3]. Shuffle c (figure 2.8) corresponds to the opposite of shuffle b, and corresponds therefore to moving back column 1 to position 2. Shuffle d corresponds to moving column 3 to position 1 (also meaning a left shift of column 2 and 1). The combination of these operations consists in switching place of column 3 and 1. This corresponds to shuffle cd (figure 2.9). The resulting matrix is the address matrix. The cor-responding new base becomes [3 2 4]. The address weights are generated from this base, and become [8 4 1]. This stage is illustrated in figure 2.12, where the resulting addresses also are shown.

0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 1 4 8 1 4 8 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3

1

Addresses Weights

Color Explanation 4 2 3Calculation of addresses 3 2 3 4 2Calculation of weights 2 4 3 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 8 16 4 12 20 1 9 17 5 13 21 2 10 18 6 14 22 3 11 19 7 15 23

(34)

Stage 3

The base matrix is generated for the new base from previous stage, [3 2 4]. Shuffle e (figure 2.8) corresponds to the opposite of shuffle d, and corresponds therefore to moving back column 1 in the base matrix to position 3 (leading to a right shift of column 2 and 3). The corresponding new base becomes [4 3 2], which is the same as the initial base. The address weights are generated from this base, and become [6 2 1]. This stage is illustrated in figure 2.13, where the resulting addresses also are shown.

1 2 6 1 2 6

1

Addresses Weights

Color Explanation 3Calculation of addresses 2 2Calculation of weights 4 3 4 3 2 2 3 4 0 6 12 18 1 7 13 19 2 8 14 20 3 9 15 21 4 10 16 22 5 11 17 23 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

(35)

General Description of Address Calculation

The bases could have 2-6 elements. The above example has 3 elements. Below description assumes instead that the base has 6 elements. For bases with less elements, redundant stages are disregarded.

Initial stage

The initial reordering before the first butterfly stage is the same as for an in-place mixed radix DIT FFT, which is a reversed order with regard to the base in ques-tion. For FFT:s of only radix 2, the initial stage consists in bit reversal. In order to get the address matrix, the base matrix is first generated for the reversed initial base. The address matrix is the reversal of this matrix. The weights are calculated from the new base, which is the same as the initial base since it has been reversed twice. This stage is illustrated in figure 2.14.

b6 b5 b4 b3 b2 b1 b1 b2 b3 b4 b5 b6 column 1 column 2 column 3 column 4 column 5 column 6 w6 w5 w4 w3 w2 w1 b6 b5 b4 b3 b2 b1 w6 w5 w4 w3 w2 w1 addresses

column 1 column 2 column 3 column 4 column 5 column 6

1

2

Calculation of weights

3

Calculation of addresses

b5 b4 b3

b6 b2 b1

Figure 2.14:Initial stage - calculation of addresses for a general case with 6 elements in the base.

The upper left light green field in figure 2.14 is the initial base, its reversal is represented by the upper middle light green field. The dark green field repre-sents the base matrix for this reversed base. The light blue field is the initial base

(36)

reversed twice and the dark blue field is the address matrix. The red field is the weight elements and the yellow field is the addresses.

Stage 1

The base matrix is first generated for the initial base. The shuffle right after the first butterfly stage corresponds to moving column 2 to position 1 (which means changing place of column 2 and 1). The resulting matrix is the address matrix for stage 1. The weights are generated from the new base (where element 2 from the initial base has been moved to position 1, according to the column change). This stage is illustrated in figure 2.15.

b6 b5 b4 b3 b2 b1 column 1 column 2 column 3 column 4 column 5 column 6 b6 b5 b4 b3 b1 b2

1

w6 w5 w4 w3 w2 w1 addresses

3

Calculation of addresses

2

Calculation of weights b6 b5 b4 b3 b1 b2 w6 w5 w4 w3 w2 w1

Figure 2.15: Stage 1 - calculation of addresses for a general case with 6 ele-ments in the base.

Stage 2

Generate the base matrix for the new base from previous stage. The first sub-shuffle corresponds to the opposite of previous stage’s sub-shuffle, and corresponds therefore to moving back column 1 to position 2. The second sub-shuffle corre-sponds to moving column 3 to position 1 (also meaning a left shift of column 2 and 1). These operations put together result in switching place of column 3 and 1. The resulting matrix is the address matrix. An illustration of this is presented in

(37)

figure 2.16. The weights are generated from the corresponding new base where the same column switches have been made.

1

column 6 column 5 column 4

b6 b5 b4 column 1 column 2 b3 column 3 b6 b5 b4 b3 column 1 column 2 column 3 column 4 column 5 column 6 b1 b2 b2 b1

Figure 2.16:Stage 2 - first step of calculation of addresses for a general case with 6 elements in the base.

Stage 3

Generate the base matrix for the new base from previous stage. The first sub-shuffle corresponds to the opposite of previous stage’s second sub-sub-shuffle, and corresponds therefore to moving back column 1 to position 3. The second sub-shuffle corresponds to moving column 4 to position 1 (also meaning a left shift of column 3, 2 and 1). These operations put together result in switching place of column 4 and 1. The resulting matrix is the address matrix. An illustration of this is presented in figure 2.17. The address weights are generated from the corresponding new base where the same column switches have been made.

1

column 6 column 5 b6 b5 column 2 b6 b5 b4 column 1 column 2 column 3 column 4 column 5 column 6 b1 b1 b2 b3 b3 b2 b4

Stage 4

Generate the base matrix for the new base from previous stage. For the same reason as previously, the shuffle for this stage means switching place of column 5 and 1. The resulting matrix is the address matrix. An illustration of this is pre-sented in figure 2.18. The address weights are generated from the corresponding new base where the same column switches have been made.

(38)

1

Calculation of new base and address matrix column 6 b6 column 2 b6 b5 column 1 column 2 column 3 column 4 column 5 column 6 b1 b1 b2 b3 b2 column 3 b3 b4

b4 b5

Stage 5

Generate the base matrix for the new base from previous stage. For the same reason as previously, the shuffle for this stage means switching place of column 6 and 1. The resulting matrix is the address matrix. An illustration of this is pre-sented in figure 2.19. The address weights are generated from the corresponding new base where the same column switches have been made.

1

column 2 b6 column 1 column 2 column 3 column 4 column 5 column 6 b1 b1 b2 b3 b2 column 3 b3 column 4 b4 b4 b5 b5 b6 column 6 column 1 column 5

Stage 6 - Last Stage

Generate the base matrix for the new base from previous stage. The shuffle of this stage corresponds to moving back column 1 in the base matrix to the leftmost position (leading to a right shift of remaining columns). The corresponding new base where the same switches have been made is the same as the initial base. An illustration of this is presented in figure 2.20. The address weights are generated from this base.

(39)

1

column 1 column 2 column 3 column 4 column 5 column 6 b1 b2 b3 b4 column 1 b6 b5 b6 b5 b4 b3 b2 b1 column 2 column 3 column 4 column 5 column 6

To conclude, the shuffles of the first and last stage correspond to moving column 2 to position 1 and moving column 1 to the leftmost position, respectively. Each intermediate stage corresponds to switching column stage + 1 and the first (right-most) column. The stages above are for a base containing 6 elements. For bases with less than 6 elements, redundant stages above are disregarded.

(40)

(41)

3

Methodology

A model for calculating the number of clock cycles required for FFT calculation on the DSP using the DIT FFT algorithm with extra shuffle has been implemented in MATLAB. This model makes up the foundation of this study and is described in section 3.1.

3.1 Modelling

A model for calculating the number of clock cycles required for FFT calculation on the DSP using the DIT FFT algorithm with extra shuffle has been implemented in MATLAB.

The supported input parameters are the following: • base (i.e. factors)

• number of banks sharing the same buffer (explanation in section 2.3.2) • buffer size (explanation in section 2.3.2)

• increment steps for each stage (explanation in section 2.3.2)

• split shuffle writing (start index and step length) (explanation in section 2.3.2)

• whether read is considered (explanation below) The outputs obtained are the following:

(42)

• number of clock cycles that the FFT would require (the total number of clock cycles as well as for each stage)

• number of clock cycles being stall cycles due to memory accessing issues (the total number of stall cycles as well as for each stage)

There are two versions of the model. One version assumes that each buffer could be emptied by one position per bank each clock cycle. This will most likely not always be the case, since the existing buffer for a certain bank can’t be emptied if a read is made from that bank that cycle (since each bank only has one read-write-port). The modelling is pessimistic when it comes to the write buffers, the existing buffers are assumed to always be full. Since the pipeline length is not known, it is hard to model these reads accurately. Instead, the second version of the model has been done as if there is no pipeline, but the output is obtained and should be written back to LDM in the same cycle as the input samples for that operation are read. An illustration of this is shown in figure 3.1.

4

3

2

1

0 8 16 4 12 20 1 9 17 5 13 21 2 10 18 6 14 22 3 11 19 7 15 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 0 8 0 4 12 4 1 9 1 5 13 5 2 10 2 6 14 6 3 11 3 7 15 7

Figure 3.1: Illustration of how the second version of the model works. The second stage for an FFT of base [4 3 2] is used as example. The black numbers are the addresses of the input and output samples, and the grey numbers are respective banks.

(43)

3.1 Modelling 33

The input samples of the first operation in figure 3.1 (denoted 1) are read from banks 0, 1, 2, 3, 4 and 5. The output samples from the same operation are written to banks 0, 8, 0, 4, 12 and 4. Since the butterfly pipeline is neglected, it is assumed that when the output samples are to be written to banks 0, 8, 0, 4, 12 and 4, it is not possible to directly write to bank 0 or 4, since banks 0, 1, 2, 3, 4 and 5 are read from during the same cycle. In this case, considering reads thus leads to a stall.

Neglecting the pipeline is not accurate, but models that up to 6 of the 16 banks may be read from during each cycle. This second version of the model (consider-ing reads) is pessimistic and will therefore be called the pessimistic model, while the first version of the model (not considering reads) will be called the optimistic model since it is optimistic.

The model starts by looking at the shuffle stage. It handles the groups of samples to shuffle per cc in the order that the input parameters tell (step and split). For each group of samples being handled at a time, the clock cycle counter is first incremented by 1 and the buffers are emptied by one position per bank (except the banks being read from that cycle, if reads are considered). Then the buffers are filled according to the shuffle addresses for the samples in question. The buffers are assumed to be empty before at the first clock cycle. If we want to write more than allowed to any buffer, every buffer is emptied by one position per bank (reads are never considered here) until there is a free buffer position for the sample in question, which is then written. This is repeated until all samples for that group have been written. For each time the buffers are emptied due to this, the clock cycle counter and the stall cycle counter are incremented by 1. The next group of samples will then proceed in the same way, based on the buffer status updated for previous group. When the first stage is done, the following butterfly stages are in turn calculated in the same way.

The number of clock cycles excluding the eventual stall cycles for a stage is equal to N /(number of samples handled per operation) rounded up to nearest integer. The model assumes that we could write maximum 6 samples to LDM each clock cycle, which means that for a radix-2 butterfly stage, the result from 3 butterflies (plus twiddle factor application) could be written back each cycle. For the radix-3 case, the result from 2 butterflies (plus twiddle factor application) could be written each cycle. For the cases where the FFT size is not a multiple of 6 and there is a radix 2 or a radix 3 stage, the basic operation will take 6 samples every time but one, which will consist of the remaining samples. An example of this is presented in figure 3.2. For the radix-4, radix-5 and radix-6 cases – 4, 5 and 6 samples are written back each cycle.

(44)

1

2

3

4

5

Figure 3.2: Example of when the basic operation will take 6 samples every time but one. The operations denoted 1, 2, 3 and 4 take 6 samples each, while the operation denoted 5 only takes 3 samples.

When a FFT stage is performed, the proceeding stage is performed directly after-wards in this model. When implementing the FFT though, the proceeding stage cannot be performed until each result from the current stage has been written back. In cases of small FFT:s, it might be needed to idle the pipeline before start-ing the proceedstart-ing stage while waitstart-ing for all the results of the current stage to be written to LDM. For the case studied in this thesis, normally a large amount of FFT:s of the same size has to be run. This means that stage 1 could first be run for many FFT:s, before performing stage 2, and so on.

3.1.1 Shared Buffers

In this study, it is considered that either one or two banks share the same buffer. If the number of banks sharing the same buffer is one, it means that there is one buffer per bank (i.e. non-shared buffers). The banks sharing a buffer are always consecutive, i.e. when two banks are sharing buffer, bank 0 and 1 share buffer, bank 2 and 3 share buffer, and so on. When two banks are sharing buffer, one

(45)

3.1 Modelling 35

sample per bank may be emptied per clock cycle. If the reads are considered, no sample is emptied to banks being read from the same cycle.

The order in which the samples from an operation are written to the write buffers are top to bottom. E.g. if the result from a radix-6 basic operation is aimed for bank 0, 0, 0, 0, 1, 1 (the first 0 being the top sample), and bank 0 and 1 share a buffer which is currently empty and has a size of 4 positions, then the four 0s are placed in the buffer first, while the two 1s are written to the buffer in the following two (stall) cycles.

3.1.2 Increment Step

As mentioned in section 2.3, an increment step could be used to decide the order in which the basic operations or shuffle operations are performed. How the incre-ment step is modelled is described below. The example shows a butterfly stage, but the same principle is used for the shuffle stage.

1

2

3

4

5

Figure 3.3: Operation Indices for a radix-3 butterfly stage of a FFT of size 27.

Each operation to be performed within a stage is given an index starting with 1 at the top and then counts up, as illustrated in figure 3.3, which shows the

(46)

butterflies of a radix 3 stage for a FFT of size 27. When the FFT size is not a multiple of the number of samples handled at each butterfly calculation, it is always the operation with the highest index that treats less number of samples than the others. The operation with index 1 is always the first to be performed. The index for the next operation to be performed is calculated as next_idx = (current_idx + step) % nr_operations, where the notations presented in table 3.1 are used. If the next index becomes an index of an already performed operation, the next index is incremented with one index position until an index of a non-performed operation is reached.

Notation Meaning

% Modulo operation

nr_operations Number of operations in a stage

current_idx Current operation index

step Step number

next_idx Next operation index

Table 3.1:Notations for Calculation of Next Operation Index

If e.g. step = 3 is applied on the example in figure 3.3, the order of indices in which the basic operations are performed would be [1 4 2 5 3].

The different FFT stages can use different steps. When optimizing for best step, the smallest step giving best performance is chosen.

3.1.3 Split Shuffle Writing

Each clock cycle, two groups of 3 samples at consecutive addresses are shuf-fled. To decide in which order the shuffle is done, each group of 3 samples is given an index, as illustrated in figure 3.4, which shows the samples to be shuffled for an FFT of size 24. There are many different combinations of two in-dices which could make up a good order, however to keep regularity (to ease eventual implementation) it was chosen that the two indices should be incre-mented with the same step size. In order to avoid that the two indices collide, one of them is always odd and the other one is always even. The indices start at one. The odd index always starts at one, while the even index could start at any even index. In the cases where the size is not a multiple of 6, there are dummy samples added so that the number of samples becomes a multiple of 6. This way, there will be an equal number of even and odd indices and each cy-cle, two indices may be selected for shuffling. The next even and odd indices are calculated as next_even_idx = (even_idx + 2 ∗ step) % number_of _idx and

next_odd_idx = (odd_idx + 2 ∗ step) % number_of _idx, respectively, where the

notations from table 3.2 are used. If the next even/odd index becomes an index already used, it increments one even/odd index until an unused index is reached.

(47)

3.1 Modelling 37 0 6 12 18 2 8 14 20 4 10 16 22 1 7 13 19 3 9 15 21 5 11 17 23

1

2

3

4

5

6

7

8

Figure 3.4: Indices for split shuffle writing at the shuffle stage of a FFT of size 24. The indices are written on the grey fields.

Notation Meaning

number_of _idx Number of indices (both odd and even)

step Step number

odd_idx Current odd index

even_idx Current even index

next_odd_idx Next odd index

next_even_idx Next even index

Table 3.2:Notations for calculation of next shuffle indices

3.1.4 Optimisation

When the optimal base is to be selected for a certain FFT size, the base giving the lowest number of clock cycles is selected. If several bases give the same number of clock cycles, the base with as low radices possible (for the earliest stages possible) is selected. E.g. for N = 16, if all the possible bases give the same number of clock cycles, the order in which the bases would be chosen is listed in table

(48)

3.3. If all the bases would give the same number of clock cycles, base number 1 would be selected. The leftmost radix corresponds to the first butterfly stage to be performed. Order Base 1 [2 2 2 2] 2 [2 2 4] 3 [2 4 2] 4 [4 2 2] 5 [4 4]

Table 3.3:The order in which the bases for N=16 would be selected in opti-misation when giving the same number of clock cycles.

When optimising with regard to step, the shuffle stage is optimised first. The steps are tried from 1 and up, until the largest step giving different order is reached. The smallest step giving minimum stalls is selected. The buffer status after the complete stage for the optimal step is the buffer status the proceeding stage begins with. The butterfly stages are then optimised in the same manner. When optimising with regard to step and split, following rules apply. For the initial stage, the odd index starts as mentioned always at 1. As for the even index, 2 is always tried first. For these indices, different steps are tried. If this gives any stall, the even index is incremented by one even index and then different steps are tried. The simulation proceeds this way until a stall free configuration is found, or until all even start indices and steps giving different order have been tried. If two configurations give the same optimal number of stalls, the configuration with the smallest even start index is chosen. If two configurations having same start indices give the same optimal number of stalls, the configuration with the smallest step is chosen.

3.1.5 Result metrics

Simulations have been run for both buffer structures and both models, for buffer sizes 3-10. For a given buffer structure, model and buffer size, results are pre-sented showing performance for three different simulation cases. The first simu-lation case (Simusimu-lation case 1 - Initial Simusimu-lation) is when the algorithm is used without using any step or split writing. The optimal factorisation and order of factor is chosen for each N.

The second simulation case (Simulation case 2 - Step Simulation) is based on the previous case, but here optimisation with regard to step is also performed. For the sizes with no stall-free base from the previous case, the chosen optimal base is optimised with regard to step. The sizes where a stall-free base was found in the previous case, the optimal base and its result remains the same as in previous simulation case.

(49)

3.1 Modelling 39

The third and last simulation case (Simulation case 3 - Split Writing Simulation) is based on the previous cases, but here also optimisation with regard to split writing is performed. For the sizes from previous simulation case still giving stalls, the chosen optimal base is optimised with regard to split writing. For the other sizes (being stall free), the optimal base and eventual step remains the same as in previous simulation case.

The three different optimisations are done according to the descriptions in sec-tion 3.1.4. Note that when optimisasec-tion is done with regard to step and split writing in above cases, the optimisations are only made for the already chosen base. For some sizes, higher performance could be reached if optimisation of step and/or split writing were tried on also other bases for respective size. How-ever, the results for these simulations cases still show the impact which step and split writing have on the performance.

The results presented are the following:

• Average number of clock cycles required to perform an FFT (average of all 101 sizes)

• Total % of the clock cycles for all 101 FFT sizes being stall cycles • Maximum % of clock cycles being stall cycles for a size

• Number of sizes (of 101) where any stall cycle occur when performing re-spective FFT

The different metrics have different advantages and disadvantages. The perfor-mance is measured in clock cycles, and the lower the number of clock cycles re-quired to perform an FFT, the higher the performance. Reducing the number of stall cycles means lowering the total number of clock cycles, however a base for a certain N can give zero stalls without being better than another base giving stalls. This depends on the radices. It is hence interesting to look at both the average number of clock cycles and the number of stall cycles. A drawback for looking at the average number of clock cycles and total percent stall cycles for all 101 sizes is that it does not give information about any specific FFT sizes. Therefore, also the number of sizes with stalls and maximum stall percentage for an FFT size are shown to give some more information about the matter.

Although these metrics give information about the performance, it is interesting to look more detailed at the results. To get more insight, more detailed results are shown for buffer size 4. Size 4 has been chosen since it turned out to be the most interesting size for implementation. This will be discussed in chapter 5.

(50)

(51)

4

Results

Section 3.1.5, Result Metrics, describes the simulations for which the results in this section are presented.

In section 4.1, simulation results for buffer size 3-10 are presented for the two different buffer structures and the two models.

Section 4.2 presents further results for buffer size 4, including results showing the effect of optimisation of factorization and factor order, operation increment step, as well as split shuffle writing. Also a comparison between the DSP solution for buffer size 4 and the hardware accelerator on which the FFT:s are currently run is presented. A detailed comparison could be found in section A.5.

4.1 Results for Buffer Size 3-10

The four sections below, 4.1.1-4.1.4, present simulation results for the two buffer structures and models, for buffer size 3-10. The first section, section 4.1.1, presents simulation results for the non-shared buffer structure, for the optimistic model. The following section, section 4.1.2, presents simulation results for the non-shared buffer structure, for the pessimistic model. In the third section, section 4.1.3, results are presented for the shared buffer structure, for the optimistic model. The fourth and last section, section 4.1.4, presents results for the shared buffer structure, for the pessimistic model. In each of these sections, four graphs are presented, each of them showing results for simulation cases 1-3, i.e. initial sim-ulations (optimal bases), step simsim-ulations (optimal increment steps), split simu-lations (optimal split shuffle writing).

Minimising Memory Access Conflicts for FFT on a DSP

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2019