Benchmarking of Sleipnir DSP Processor, ePUMA Platform

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Benchmarking of Sleipnir DSP Processor, ePUMA

Platform

Examensarbete utfört i Datorteknik vid Tekniska högskolan vid Linköpings universitet

av

Somasekar Murugesan

LiTH-ISY-EX--11/4536--SE

Linköping 2011

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Benchmarking of Sleipnir DSP Processor, ePUMA

Platform

Examensarbete utfört i Datorteknik

vid Tekniska högskolan i Linköping

av

Somasekar Murugesan

LiTH-ISY-EX--11/4536--SE

Supervisor:

Di Wu, Linköpings universitet

Examinator:

Di Wu, Linköpings universitet

(4)

(5)

Avdelning, Institution

Division, Department

Division of Computer Engineering Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2011-12-002 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://www.da.isy.liu.se/ http://www.ep.liu.se ISBN — ISRN LiTH-ISY-EX--11/4536--SE

Serietitel och serienummer

Title of series, numbering

ISSN

—

Titel

Title

Benchmarking av Sleipnir DSP-processor, ePUMA Plattform Benchmarking of Sleipnir DSP Processor, ePUMA Platform

Författare

Author

Somasekar Murugesan

Sammanfattning

Abstract

Choosing a right processor for an embedded application, or designing a new pro-cessor requires us to know how it stacks up against the competition, or selling a processor requires a credible communication about its performance to the cus-tomers, which means benchmarking of a processor is very important. They are recognized world wide by processor vendors and customers alike as the fact-based way to evaluate and communicate embedded processor performance. In this the-sis, the benchmarking of ePUMA multiprocessor developed by the Division of Computer Engineering, ISY, Linköping University, Sweden will be described in details. A number of typical digital signal processing algorithms are chosen as benchmarks. These benchmarks have been implemented in assembly code with their performance measured in terms of clock cycles and root mean square error when compared with result computed using double precision. The ePUMA multi-processor platform which comprises of the Sleipnir DSP multi-processor and Senior DSP processor was used to implement the DSP algorithms. Matlab inbuilt models were used as reference to compare with the assembly implementation to derive the root mean square error values of different algorithms. The execution time for different DSP algorithms ranged from 51 to 6148 clock cycles and the root mean square error values varies between 0.0003 to 0.11.

Nyckelord

Keywords Benchmarking, DSP Processor, Sleipnir core, Digital Signal Processing, FFT, DFT, DWT, DCT, FIR, assembler, ePUMA

(6)

(7)

Abstract

Choosing a right processor for an embedded application, or designing a new pro-cessor requires us to know how it stacks up against the competition, or selling a processor requires a credible communication about its performance to the cus-tomers, which means benchmarking of a processor is very important. They are recognized world wide by processor vendors and customers alike as the fact-based way to evaluate and communicate embedded processor performance. In this the-sis, the benchmarking of ePUMA multiprocessor developed by the Division of Computer Engineering, ISY, Linköping University, Sweden will be described in details. A number of typical digital signal processing algorithms are chosen as benchmarks. These benchmarks have been implemented in assembly code with their performance measured in terms of clock cycles and root mean square error when compared with result computed using double precision. The ePUMA multi-processor platform which comprises of the Sleipnir DSP multi-processor and Senior DSP processor was used to implement the DSP algorithms. Matlab inbuilt models were used as reference to compare with the assembly implementation to derive the root mean square error values of different algorithms. The execution time for different DSP algorithms ranged from 51 to 6148 clock cycles and the root mean square error values varies between 0.0003 to 0.11.

(8)

(9)

Acknowledgments

I would like to thank all the people who were involved during my work on this master thesis. My special thanks goes to :

• Professor Dake Liu, Professor,for giving me an opportunity to perform the thesis in his Department.

• Di Wu, Guest Lecturer, Supervisor and Examiner, for suggesting me the thesis topic and assisting me during the work.

• Jian Wang, Ph.D Student, for assisting me with the ePUMA setup

• Andreas Karlsson, Ph.D Student, for assisting me with the instruction set architecture and ePUMA microarchitecture design

• Andreas Ehliar, Assistant Professor, who made me to understand the DSP microarchitecture design.

(10)

(11)

3.6.1 Jumps . . . 9 3.6.2 Subroutines . . . 9 3.6.3 Hardware Repeat . . . 10 3.7 Pipeline Behavior . . . 10 3.7.1 Data Access . . . 10 3.7.2 Datapath . . . 10 3.7.3 Quick instructions . . . 11 4 Transforms 13 4.1 Definition of DFT . . . 13 4.1.1 Radix-2 DIT FFT . . . 14 4.1.2 Radix-3 DIT FFT . . . 15 4.1.3 Radix-4 DIT FFT . . . 18 4.1.4 Radix-5 DIT FFT . . . 19

4.1.5 Mixed Radix FFT Algorithms . . . 23

4.2 Definition of DCT . . . 31

(12)

4.2.1 1-D DCT . . . 31

4.2.2 2-D DCT . . . 31

4.2.3 DCT Benchmark . . . 32

4.3 Discrete Wavelet Transforms(DWT) . . . 35

4.3.1 DWT Benchmark . . . 35

4.4 Finite Impulse Response Filter(FIR) . . . 37

4.4.1 Properties of a FIR filter . . . 38

4.4.2 Implementation of 12-TAP FIR filter on Sleipnir . . . 39

4.5 Modularization of Assembly code . . . 39

4.6 Modularization Methods . . . 39

5 Benchmarking Tools 41 5.1 Reference Model and Algorithm Verification . . . 41

5.1.1 Reference Model . . . 41

5.1.2 Algorithm Verification . . . 42

5.2 Algorithm Implementation on Sleipnir DSP processor . . . 43

5.2.1 ePUMA Simulator . . . 44

5.2.2 Assembly File Simulation . . . 44

6 Results 47 6.1 Test setup information . . . 48

6.2 FFT Implementation . . . 50 6.2.1 12-Point FFT . . . 50 6.2.2 20-Point FFT . . . 51 6.2.3 30-Point FFT . . . 52 6.2.4 60-Point FFT . . . 53 6.2.5 300-Point FFT . . . 54 6.2.6 600-Point FFT . . . 55 6.2.7 900-Point FFT . . . 56 6.2.8 1200-Point FFT . . . 57 6.3 DCT Implementation . . . 58 6.3.1 DCT 8x8 . . . 58 6.3.2 DCT 16x16 . . . 59 6.4 DWT Implementation . . . 60 6.4.1 DWT 8x8 . . . 60 6.4.2 DWT 16x16 . . . 61

6.5 12-TAP FIR Implementation . . . 62

6.5.1 12-TAP FIR Filter . . . 62

7 Conclusion 63

8 Future Work 65

9 Glossary 67

(13)

Contents xi

A DSP Algorithm -Sleipnir Assembly code 71

A.1 12-Point FFT . . . 72 A.2 20-Point FFT . . . 74

(14)

(15)

Chapter 1

Introduction

1.1 Thesis Outline

The report begins with the back ground information followed by the problem description, proceeds in to the information required to solve the problem such as the processor architecture details, simulator usage details, theory supporting the problem implementation and the steps involved in the implementation process, continues in to the discussion of results and finally leading to the conclusion of the thesis.

1.2 DSP

Analog Signal Analog Signal ADC DAC Digital Signal Processing

Figure 1.1. A Simple Digital Processing System

DSP is an abbreviation of Digital Signal Processing. A DSP system as shown in figure 1.1 converts the signal from analog to digital domain and manipulates signals to accomplish certain functions and convert back the signal from digital to analog domain. The process of digital signal manipulation is therefore called digital signal processing.

Digital signal processing is used in a wide range of applications, in which the nature of processed signals can be very different such as radio, sound, video etc. Some applications of DSP are audio signal processing, audio compression, digital image processing, video compression, speech processing, speech recognition, dig-ital communications and biomedicine. The complexity levels of processing can vary depending on the application requirements and performance requirements. Depending on the requirements of an application the digital signal processing can

(16)

be implemented on general purpose computers or with embedded processors that may or may not include digital signal processors. For non real time applications, processing is done through general purpose computers. For real time applications with vast usage, processing can be implemented through ASIC’s. For other real time applications based on flexibility, performance, power, time to market require-ments, specialized DSP processors based on fixed point/floating point or FPGA can be used for processing implementation

1.3 DSP Implementations

100 1000 10000 100000 Power Consumption Silicon Over_Performance

(lower is better) MOPS 0.001mW/MHz 0.01mW/MHz 0.1mW/MHz General DSP ASIP DSP DSP ASIC

Figure 1.2. Comparing three types of DSP implementations

The figure 1.2 has been referred from [1].

1.3.1 ASIC

A digital signal processing algorithm can be implemented in ASIC under two different scenarios, one if the application requires extreme performance and an other if the application demands ultra low power and ultra low area requirements assuming the algorithm is stable and simple. In this case, there is no flexibility and hence no programming solutions are possible.

1.3.2 DSP Processors

A digital signal processor (DSP) can be of two types, one that can be highly efficient and designed specific to an application called as Application Specific In-struction Processor(ASIP) or one that is designed to be highly flexible to scatter to multiple application domains called as General Purpose DSP.

GP DSP

A general purpose DSP has a generic instruction set which can be used across many applications. Generic instruction set covers all the basic arithmetic and sufficient control function. It does not cover any application specific features or acceleration

(17)

1.4 Scope of the Thesis 3

of arithmetic and control operations for a specific application. It can be bought off the shelf from the market and major work consists of developing software for the application. The hardware development is limited to peripheral design, which means connection of the DSP core to the surrounding components, including MCU (microcontroller), main memory and other input-output peripheral controllers. As a result, all possible applications can be developed on these kind of general purpose processors at a faster time but at the cost of performance.

ASIP

A DSP ASIP has an instruction set designed for a specific application or a class of applications. Application specific instruction set implies hardware specific support provided for acceleration of certain arithmetic functions or control operations and also removal of unnecessary instructions for the application. All these decisions are based on the application specific requirements such as the low power consumption, high performance, less area and low cost for high volume production. If these processors are used in the applications where it is not intended to be used, poor performance can be expected. For example, using a baseband processor in an image processing application may yield catostrophically poor performance results.

1.4 Scope of the Thesis

The aim of this thesis is to benchmark Sleipnir DSP processor by implementing different DSP algorithms such as 12, 30, 60, 300, 600, 900, 1200 points Discrete Fourier Transforms, 8x8 and 16x16 Discrete Wavelet Transforms, 8x8 and 16x16 Discrete Cosine Transforms and 12-Tap Finite Impulse Response filter. Bench-marking is done capturing the execution time in clock cycles and calculating the mean square error values for different algorithms.

1.5 Limitation of the Thesis

During the course of the project, a few limitations were made to the scope of the project because of the short available implementation and they are listed below.

1. Mandatory implementation requirements restricted to an effective and opti-mal functional implementation. Code optiopti-mal and memory optiopti-mal imple-mentations will be part of optional requirements.

2. Mandatory verification requirements restricted to functional verification of the algorithm. Stress test verification which provides insight about perfor-mance and consistency of results will be part of optional requirements.

3. All the results published represents the cumulative execution time of pro-logue, program logic and epilogue sections.

4. All the optional requirements may be implemented depending upon the im-plementation time availablity.

(18)

(19)

Chapter 2

Benchmarking

2.1 Why benchmarking is required?

Currently in the market, different companies are offering different DSP proces-sors for different applications. DSP procesproces-sors mainly differ in their speed, cost, memory size, instruction set, interface, additional hardware and software tools for development and debugging. So customers are in a tricky situation to choose their appropriate DSP processor to their application requirement. This is where the benchmarking of the DSP processors can help the customers to solve their problem of choice.

Benchmarking is a testing procedure to measure the execution speed perfor-mance, power consumption, memory size and so on. The first most important of all these would be the execution speed of the processor for a particular algo-rithm. Once you narrow down the processors on the required execution speed, the selection of the processor satisfying the rest of the parameters wouldn’t be that difficult.

It is difficult to make fair speed comparison of DSP processors there exists several ways to measure speed. One of the approaches is to measure speed in MIPS (millions of instructions per second), another one is to measure speed in MOPS (millions of operations per second) and last one is to measure speed in MFLOPS (millions of floating point operations per second). The problem with these kind of comparison is that different DSP processors accomplish different amount of work during an instruction and also there is no standard definition for an operation with respect to a DSP processors. So all the vendor companies follow the tradition of giving the performance in clock cycles.

In this thesis work, benchmarking is done by giving the amount of clock cycles required to complete given task or application. This approach works fine as it gives the speed factor, the weakness and the strengths of the instruction set design of the processor. Lower the performance value gives you the power of the instruction set and architecture while a higher performance value gives you the weakness of the instruction set design and architecture.

The most important information for a customer of a DSP processor is

(20)

ably provided by benchmark results of the basic DSP algorithms such as ma-trix multiplication, filtering operations such as finite impulse response filters, fast fourier transforms,discrete cosine transforms, discrete wavelet transforms and so on. Based on these inputs the customer will have a fair idea on the performance of the processor in the respective applications.

The benchmarking process is generally done by implementing assembly lan-guage program for the given processor. Implementing the algorithm at an assem-bly language level ensures a highly optimal solution with respect to the instruction set as well as the memory usage. You could also implement the algorithm with a higher level language program such as C too, but their effectiveness is very much dependent on the compilers used. Therefore they don’t show the performance of the processor rather optimality of the compiler.

(21)

Chapter 3

Sleipnir architecture

Most details in this chapter are referred from the [2] document. For more details on Sleipnir architecture, refer [2], Sleipnir instruction set, refer [3], Sleipnir memory subsystem and permutation, refer [4], [6] and [12] and Sleipnir simulator, refer [5].

3.1 Overview

The internal structure of Sleipnir is depicted in figure 3.1. An 8-way SIMD data path processor supporting both real valued and complex valued fixed point data inputs. Sleipnir structure consists of 2 levels of memory. At the first level, exists a program memory which contains the Sleipnir executable. In the next level, exists a main data memory, constant memory, the general purpose register file, the special purpose register file vector accumulator register and a vector flag register. Main data memory which comprises of 3 local vector memories acting as input, output and intermediate storage for data. The constant memory is used to store the store the constants and permutation patterns for the LVM and acts as a read only memory during the program execution on the Sleipnir DSP processor. The general purpose register is of vector register file type which is used to store intermediate data. The special register file system consists of the address registers. The vector accumulator register is used in the datapath and a vector flag register is updated for accumulated data and flag results respectively. Maximum of two source operands and one destination operand are supported by the Sleipnir instruction set.

3.2 Datatypes

The data types that Sleipnir support are all fixed-point data types. The supported types are:

• Byte - Signed or unsigned 8-bit value. • Word - Signed or unsigned 16-bit value.

(22)

• Double word - Signed or unsigned 32-bit value.

• Complex word - Signed or Unsigned 32 bit value with a 16 bit real part and a 16 bit imaginary part.

• Complex double word - Signed or unsigned 64 bit value with a 32 bit real part and a 32 bit imaginary part.

DATAPATH LVM0 LVM1 LVM2 VRF SRF CM AGU PC PCFSM PM Instruction Decoder SLEIPNIR CORE

Figure 3.1. Sleipnir Overview

3.3 Addressing

Addressing of word (16 bits), double words (32 bits), half vectors (64 bits) and vectors (128 bits) are supported by Sleipnir processor. It also supports operations on bytes. The word has to be split in to two parts and needs to be operated on separately.

3.4 Datapath

The sleipnir data path consists of 8 x 16 bit data lines. Following instructions are supported by the data path.

(23)

3.5 Register and Memory Layout view 9

• Scalar operations - The first data path lane ( real words) or the first two data path lane (complex/double words) will be used to perform operations on scalars.

• Vector operations - Whole or a part of the pipeline (depending on data width) are used by the vector operations to process data in parallel.

• Triangular operations - These operations take vector as an input and provide scalars as an output. Dependencies exists between data path lanes during the triangular operations.

• Intra-vector operations - These operations take vector as an input and output a vector. Dependencies exist between the data path lanes. Sorting and butterfly instructions belong to intra vector operations.

3.5 Register and Memory Layout view

Below tabular column shows the size and the minimum addressable size of different memory systems in the Sleipnir processor.

Name Smallest Addr Size

Vector Register File Word 8*128 bits

Special Register File Word 8*128 bits

Vector Accumulator Register Word 320 (8*40) bits

Local Vector Memory Word 4096*128 bits

Constant Memory Vector 128*128 bits

Program Memory - 1024 instructions

3.6 Program Flow Control

3.6.1 Jumps

Sleipnir supports both conditional and unconditional jump. Unconditional jump performs jump to some location with single delay slot.In case of conditional jumps, Sleipnir assumes conditions are being set before the jump instruction and not during the jump instruction.The typical usage scenario is to use an instruction set which sets the flag ahead of the jump instruction and perform the jump when the flags are set.

3.6.2 Subroutines

Call and return instruction are used to call the subroutine and return from

subroutine respectively. For faster calls, its better to use hardware stacks instead of the memory stack. The hardware stacks supports four levels of nesting.

(24)

3.6.3 Hardware Repeat

Sleipnir instruction set supports hardware repeat functionality through repeat and repeatr functionality. The number of instructions to be repeated and the number of times to be repeated can be configured as an immediate value for repeat instruction and passed through a register for repeatr function.

3.7 Pipeline Behavior

The pipeline of the processor can vary between 2-13 stage pipeline. Figure 3.2 shows the possible pipeline behaviors. Control instructions mostly use 2 stages of pipeline, while quick data path instructions use 8 to 10 stages of pipeline and the normal datapath varies between 11-13 stages of pipeline. The variations are due to the variations in destination location.

3.7.1 Data Access

Data access requires four stages and is shown below.

• CMAGU - Update CM address registers to perform addressing for the CM.. • SAGU and CMACCESS - In this step, both scalar part of LVM and the CM are accessed and the memory location for LVM is computed. During this stage, VRF might be accessed if the addressing mode uses the VRF for LVM address computation.

• VAGU - CM data is now available and it is used for the vector address generation.

• LVM access - The generated vector address is used to for LVM access. If the instruction contains reference to VRF and SRF to fetch operands , the access to VRF and SRF is also done.

3.7.2 Datapath

The data path is organized as a four stage pipeline. The pipeline stages are the following:

• MUL - It does both real and complex multiplication using 16 18x18-bit multipliers. Sign extension and data formatting are done in this stage.

• ATREEF and ATREES - This stage consists of three levels of arithemetic units with each stage consisting of 8 units. The lane dependent operations and the triangular operations effectively use this structure to achieve their functionality

• ALU - This is the final stage which comprises of arithmetic and logic units, accumulator registers, flag generation and post operations supports.

(25)

3.7 Pipeline Behavior 11

3.7.3 Quick instructions

The quick instruction bypasses the multiplier and the atree data path stages. Effectively, the latency is reduced from 4 to 1 clock cycle. This is used when you want the result to be stored immediately since an other instruction depends on this data .

(26)

IF ID CMA GU SA GU VA GU DF OS ALU WB _VAGU WB MUL ATRE E1 ATRE E2 ALU WB _VAGU WB CMA GU SA GU CMA GU SA GU IF IF IF IF IF IF ID ID ID ID ID ID CMA GU SA GU VA GU DF OS ALU CMA GU SA GU VA GU DF OS ALU CMA GU SA GU VA GU DF OS CMA GU SA GU VA GU DF OS CMA GU SA GU VA GU DF OS MUL ATRE E1 ATRE E2 MUL ATRE E1 ATRE E2 ALU ALU Co ntr ol Qu ick Da tapat h / Acc umu lat e Qu ick Da tapat h / VRF /SRF ou tpu t Quick Dat apa th / LV M Ou tpu t Nor mal Dat apa th/ Ac cum ula te Nor ma l D ata pat h/ VRF /SRF ou tpu t _Norma l D ata pat h/ LVM Ou tpu t Pipel ine Beha vior

(27)

Chapter 4

Transforms

The mathematical derivations for DFT’s were referred from [7] and for mixed radix-FFT’s were referred from [8], [13] [14], [15], [17] and [18].

4.1 Definition of DFT

The DFT X(k), k= 0, 1, 2, 3, 4, ...., N-1 of a sequence x(n), n = 0, 1, 2, 3, 4, ..., N-1 X(k) = N −1 X n=0 x(n)exp(−j2nkπ N ) (4.1) = N −1 X n=0 x(n)W_nnk W_nnk= exp(−j2nkπ N ) (4.2) = cos(2nkπ N ) − j sin( 2nkπ N )

where N is the number of data, j=√−1 and Wnk

n is the twiddle factor. Equation 4.1 is called the N-point DFT sequence of x(n). For each value of k, the value of X(k) represents the fourier transform at the frequency 2kπ_N .

From equation 4.1, it was clear that computing X(k) for each k requires N complex multiplications and N-1 complex additions. In order to calculate the entire DFT, N2_{complex multiplications and N (N − 1) = N}2_{− N ≈ N}2_complex additions are required. Thus, calculating DFT is very computationally intensive. Note that the multiplication of two complex numbers ( a + jb )(c + jd ) = ( ac - bd ) + j ( bc + ad ) requires four real multiplications and two real additions. Similarly, a complex addition ( a + jb ) + ( c + jd ) = ( a + c ) + j ( b + d ) requires two real additions.

(28)

4.1.1 Radix-2 DIT FFT

Theory

The radix-2 DIT FFT algorithm breaks the DFT calculation into several 2-point DFTs. The radix-2 DIT FFT can be derived as below. Each of the sums, P(k) and Q(k) in equation 4.4 is recognized as an N/2-point DFT. Although the index k ranges over N values, k = 0, 1, 2, ... , N-1, each of the sums can be computed only for k = 0, 1, 2, ..., N/2, since they are periodic with period N/2. In short, radix-2 fft can be represented as shown in figure 4.1.

X(k) = N −1 X n=0 x(n)W_Nnk = N 2−1 X n=0 x(2n)W_N2nk+ N 2−1 X n=0 x(2n + 1)W(2n+1)k N = N 2−1 X n=0 x(2n)W_N2nk+ W_Nk N 2−1 X n=0 x(2n + 1)W_N2nk = N 2−1 X n=0 x(2n)Wnk_N 2 + W_Nk N 2−1 X n=0 x(2n + 1)Wnk_N 2 (4.3) =P (k) + W_NkQ(k) (4.4) where k = 0, 1, 2, 3, 4, 5 .... N-1 twiddle factors = Wk N

Radix-4 butterfly multipliers = Wnk

N 2 (1,-1) x(k+N/2) Bout B Wb B' K= 0,1,2,...N/2-1 Wa=1 Wb=exp(-j2π(k)/N A Wa A' (1,1) x(k) Aout

Figure 4.1. Radix-2 DIT Butterfly

Aout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ 1 Bout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ −1

where Wa = 1,Wb = Wbr+ jWbi are the twiddle factors and 1,-1 are radix-2 butterfly multiply factors.

Since the radix-2 butterfly multiplication factors are going to be 1 and -1, These multiplications were achieved by adjusting the connections in the hardware. Therefore, only twiddle factor multiplications needs to be done. The Sleipnir core architecture supports two complete radix-2 butterfly operations in one cycle.

(29)

4.1 Definition of DFT 15

Algorithm implementation on ePUMA platform

To implement the radix-2 algorithm,call the cr2bf instruction with appropriate input datas and output location. The sleipnir core is tuned for radix-2 algorithm, so the user need not worry about the intricacies involved in the algorithm.

4.1.2 Radix-3 DIT FFT

Theory

The radix-3 DIT FFT algorithm breaks the DFT calculation into several 3-point DFTs. The radix-3 DIT FFT can be derived as below.Each of the sums, P(k), Q(k) and R(k) in Equation 4.6 is recognized as an N/3-point DFT. Although the index k ranges over N values, k = 0, 1, 2, ..., N-1, each of the sums can be computed only for k = 0, 1, 2, ..., N/3, since they are periodic with period N/3. In short, radix-3 fft can be represented as shown in figure 4.2.

X(k) = N −1 X n=0 x(n)W_Nnk = N 3−1 X n=0 x(3n)W_N3nk+ N 3−1 X n=0 x(3n + 1)W(3n+1)k N + N 3−1 X n=0 x(3n + 2)W(3n+2)k N = N 3−1 X n=0 x(3n)W_N3nk+ W_Nk N 3−1 X n=0 x(3n + 1)W_N3nk+ W_N2k N 3−1 X n=0 x(3n + 2)W_N3nk = N 3−1 X n=0 x(3n)Wnk_N 3 + W_Nk N 3−1 X n=0 x(3n + 1)Wnk_N 3 + W_N2k N 3−1 X n=0 x(3n + 2)Wnk_N 3 (4.5) =P (k) + W_NkQ(k) + W_N2kR(k) (4.6) where k = 0, 1, 2, 3, 4, 5 .... N-1 twiddle factors = W_Nk,W_N2k

Radix-4 butterfly multipliers = WNnk 3

Aout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ 1 + C ∗ Wc ∗ 1 (4.7) Bout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ (−0.5 − 0.866j) + C ∗ Wc ∗ (−0.5 + 0.866j) (4.8) Cout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ (−0.5 + 0.866j) + C ∗ Wc ∗ (−0.5 − 0.866j) (4.9)

where Wa= 1,Wb= Wbr+ jWbi,Wc = Wcr+ jWciare the twiddle factors and (-0.5-0.866j),(-0.5+0.866j)are radix-3 butterfly multiply factors.

So,it can be concluded 6 complex multiplications are required to calculate Aout value. In order to reduce it to 3 complex multiplications, premultiply radix-3 butterfly factors and the twiddle factors and load the resultant value as the coefficient values in the memory. This method increases the performance at the cost of increase in memory space.

(30)

(1,-0.5-j0.866.-0.5+j0.866) x(k+N/3) Bout

B Wb B'

K= 0,1,2,...N/3-1

Wa=1 Wb=exp(-j2π(k)/N Wc=exp(-j2π(2k)/N A Wa A'

(1,1,1) x(k) Aout

(1,-0.5+j0.866,-0.5-j0.866) x(k+2N/3) Cout

C Wc C'

Radix-3 implementation could be implemented in two different methods on Sleipnir Core. Figure 4.3 shows one method of using triangular complex multiply and accumulate(tcmacw) instruction and figure 4.4 shows other method using complex multiply and accumulate(cmacw) instruction. Figure 4.3 and figure 4.4 shows calculation of Aout value. Similarly, other values Bout, Cout, Dout and Eout can be calculated.

A*Wa B*Wb C*Wc

Memory TCMACW

Aout

(31)

A1out A2out A3out A4out

A1*Wa1 A2*Wa2 A3*Wa3 A4*Wa4

B1*Wb1 B2*Wb2 B3*Wb3 B4*Wb4 Accumulator Memory CMACW CMACW CMACW C2*Wc2 C3*Wc3 C4*Wc4 Accumulator C1*Wc1

Figure 4.4. CMACW Method

Method Cycles for Cycles for single radix-3 4 radix-3’s

TCMACW 3 12

CMACW 9 9

Table 4.1. Comparison of Two Approaches

From the table 4.1, following inferences can be done. Use CMACW method of implementation, if the total number of radix-3 computations are a factor of 4.This method ensures hardware is used effectively and resulting in lower power consumption for the whole algorithm. Use TCMACW method of implementation, if the total number of radix-3 computations are not a factor of 4. In this method, significant amount of hardware is wasted. In order to get best performance, the two methods needs to be efficiently utilized. Please note that, this number of cycles calculation does not include pipeline in to consideration. It is a theoretical estimate of number of cycles required to calculate a radix-3 algorithm using sleipnir core .

(32)

4.1.3 Radix-4 DIT FFT

Theory

The radix-4 DIT FFT algorithm breaks the DFT calculation into several 4-point DFTs. The radix-4 DIT FFT can be derived as below.Each of the sums, P(k), Q(k), R(k) and S(k) in Equation 4.11 is recognized as an N/4-point DFT. Although the index k ranges over N values, k = 0, 1, 2, ..., N-1, each of the sums can be computed only for k = 0, 1, 2, ..., N/4, since they are periodic with period N/4. In short, radix-4 fft can be represented as shown in figure 4.5.

X(k) = N −1 X n=0 x(n)W_Nnk = N 4−1 X n=0 x(4n)W_N4nk+ N 4−1 X n=0 x(4n + 1)W(4n+1)k N + N 4−1 X n=0 x(4n + 2)W(4n+2)k N + N 4−1 X n=0 x(4n + 3)W(4n+3)k N = N 4−1 X n=0 x(4n)W_N4nk+ W_Nk N 4−1 X n=0 x(4n + 1)W_N4nk+ W_N2k N 4−1 X n=0 x(4n + 2)W_N4nk + W_N3k N 4−1 X n=0 x(4n + 3)W_N4nk = N 4−1 X n=0 x(4n)Wnk_N 4 + W_Nk N 4−1 X n=0 x(4n + 1)Wnk_N 4 + W_N2k N 4−1 X n=0 x(4n + 2)Wnk_N 4 + W_N3k N 4−1 X n=0 x(4n + 3)Wnk_N 4 (4.10) =P (k) + W_NkQ(k) + W_N2kR(k) + W_N3kS(k) (4.11) where k = 0, 1, 2, 3, 4, 5 .... N-1 twiddle factors = Wk N,W 2k N ,W 3k N Radix-4 butterfly multipliers = Wnk

N 4 Aout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ 1 + C ∗ Wc ∗ 1 + D ∗ Wd ∗ 1 (4.12) Bout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ (−j) + C ∗ Wc ∗ (−1) + D ∗ Wd ∗ (j) (4.13) Cout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ (−1) + C ∗ Wc ∗ (1) + D ∗ Wd ∗ (−1) (4.14) Dout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ (j) + C ∗ Wc ∗ (−1) + D ∗ Wd ∗ (−j) (4.15)

where Wa= 1,Wb = Wbr+ jWbi,Wc = Wcr+ jWci,Wd= Wdr+ jWdi are the twiddle factors and 1,-1,j,-j are radix-4 butterfly multiply factors.

From equations 4.12, 4.13, 4.14, 4.15, it can be concluded that the radix-4 butterfly multiplication factors are going to be 1, -1, j and -j. These multiplications were achieved by adjusting the connections in the hardware. Therefore, only

(33)

(1,-j,-1,j) x(k+N/4) Bout B Wb B' K= 0,1,2,...N/4-1 Wa=1 Wb=exp(-j2π(k)/N Wc=exp(-j2π(2k)/N Wd=exp(-j2π(3k)/N A Wa A' (1,1,1,1) x(k) Aout

(1,-1,1-1) x(k+2N/4) Cout C Wc C' (1,j,-1,-j) x(k+3N/4) Dout D Wd D'

twiddle factor multiplications needs to be done. The Sleipnir core architecture supports one complete radix-4 butterfly operation in one cycle.

To implement the radix-4 algorithm, call the cr4bf instruction with appropriate input datas and output location. The sleipnir core is designed for radix-4 algo-rithm, so the user need not worry about the intricacies involved in the algorithm. The hardware does all the stuff for the user.

4.1.4 Radix-5 DIT FFT

Theory

The radix-5 DIT FFT algorithm breaks the DFT calculation into several 5-point DFTs. The radix-5 DIT FFT can be derived as below. Symmetry property and periodicity property were used in our derivations. Equation 4.16 represents the symmetry property and equation 4.17 represents periodicity property.

(34)

Wk+ N2 n = − Wnk (4.16) W_nk+N=W_nk (4.17) X(k) = N −1 X n=0 x(n)W_Nnk = N 5−1 X n=0 x(5n)W_N5nk+ N 5−1 X n=0 x(5n + 1)W(5n+1)k N + N 5−1 X n=0 x(5n + 2)W(5n+2)k N + N 5−1 X n=0 x(5n + 3)W(5n+3)k N + N 5−1 X n=0 x(5n + 4)W(5n+4)k N = N 5−1 X n=0 x(5n)W_N5nk+ W_Nk N 5−1 X n=0 x(5n + 1)W_N5nk+ W_N2k N 5−1 X n=0 x(5n + 2)W_N5nk + W_N3k N 5−1 X n=0 x(5n + 3)W_N5nk+ W_N4k N 5−1 X n=0 x(5n + 4)W_N5nk = N 5−1 X n=0 x(5n)Wnk_N 5 + W_Nk N 5−1 X n=0 x(5n + 1)Wnk_N 5 + W_N2k N 5−1 X n=0 x(5n + 2)Wnk_N 5 + W_N3k N 5−1 X n=0 x(5n + 3)Wnk_N 5 + W_N4k N 5−1 X n=0 x(5n + 4)Wnk_N 5 (4.18) =P (k) + W_NkQ(k) + W_N2kR(k) + W_N3kS(k) + W_N4kT (k) (4.19) where k = 0, 1, 2, 3, 4, 5 ....N-1 twiddle factors = Wk N,WN2k,WN3k,WN4k Radix-5 butterfly multipliers = Wnk

N 5

Each of the sums, P(k), Q(k), R(k), S(k), and T(k), in Equation 4.18 is recog-nized as an N/5-point DFT. Although the index k ranges over N values, k=0, 1, 2, ..., N-1, each of the sums can be computed only for k=0, 1, 2,..., N/5, since they are periodic with period N/5. In short, radix-5 fft can be represented as shown in figure 4.6.

From figure 4.6, the following information can be interpreted.

Aout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ 1 + C ∗ Wc ∗ 1 + D ∗ Wd ∗ 1 + E ∗ We ∗ 1 (4.20) Bout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ (a − jb) + C ∗ Wc ∗ (−c − jd) + D ∗ Wd ∗ (−c + jd) + E ∗ We ∗ (a + jb) Cout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ (−c − jd) + C ∗ Wc ∗ (a + jb) + D ∗ Wd ∗ (a − jb) + E ∗ We ∗ (−c + jd) Dout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ (−c + jd) + C ∗ Wc ∗ (a − jb) + D ∗ Wd ∗ (a + jb) + E ∗ We ∗ (−c − jd) Eout =A ∗ Wa ∗ 1 + B ∗ Wb ∗ (a + jb) + C ∗ Wc ∗ (−c + jd) + D ∗ Wd ∗ (−c − jd) + E ∗ We ∗ (a − jb)

where Wa = 1,Wb = Wbr+ jWbi,Wc = Wcr+ jWci,Wd = Wdr+ jWdi,We =

Wer+ jWei are the twiddle factors and (a+jb),(a-jb),(-c+jd),(-c-jd) are radix-5 butterfly multiply factors.

From equation 4.20, it can be concluded that there will be 10 complex multi-plications to calculate one value i.e,Aout. In order to bring it down to 5 complex

(35)

(1,a-jb,-c-jd,-c+jd,a+jb) x(k+N/5) Bout B Wb B' K= 0,1,2,...N/5-1 a =0x278e b =0x79bc c =0x678e d =0x4b3d Wa=1 Wb=exp(-j2π(k)/N Wc=exp(-j2π(2k)/N Wd=exp(-j2π(3k)/N We=exp(-j2π(4k)/N A Wa A' (1,1,1,1,1) x(k) Aout

(1,-c-jd,a+jb,a-jb,-c+jd) x(k+2N/5) Cout C Wc C' (1,-c+jd,a-jb,a+jb,-c-jd) x(k+3N/5) Dout D Wd D' (1,a+jb,-c+jd,-c-jd,a-jb) x(k+4N/5) Eout E We E'

multiplications, radix-5 butterfly factors and the twiddle factors had to be premul-tiplies and loaded as the coefficient values in the memory. This method increases the performance of the algorithm but at the cost of increase in memory space.

Radix-5 implementation could be implemented in two different methods on Sleip-nir Core. Figure 4.7 shows one method of using combination of triangular complex multiply and accumulate(tcmacw) and complex multiply and accumulate (cmacw) instruction. Figure 4.8 shows other method using complex multiply and accumu-late(cmacw) instruction alone. Figure 4.7 and figure 4.8 shows calculation of Aout value. Similarly, other values Bout, Cout, Dout and Eout can be calculated.

From the table 4.2, following inferences can be done. Use CMACW method of implementation, if the total number of radix-5 computations are a factor of 4. This method ensures hardware is used effectively and resulting in lower power consumption for the whole algorithm. Use TCMACW+CMACW method of im-plementation, if the total number of radix-5 computations are not a factor of 4. In this method, significant amount of hardware is wasted in the second step. In

(36)

A*Wa B*Wb C*Wc D*Wd Accumulator E*We Memory TCMACW CMACW Aout

Figure 4.7. TCMACW Method

CMACW

A1out A2out A3out A4out

A1*Wa1 A2*Wa2 A3*Wa3 A4*Wa4

Accumulator B1*Wb1 B2*Wb2 B3*Wb3 B4*Wb4

Accumulator C1*Wc1 C2*Wc2 C3*Wc3 C4*Wc4

Accumulator D1*Wd1 D2*Wd2 D3*Wd3 D4*Wd4

Accumulator E1*We1 E2*We2 E3*We3 E4*We4

Memory

CMACW

(37)

4.1 Definition of DFT 23 Method Cycles for Cycles for

single radix-5 4 radix-5’s

TCMACW 10 40

CMACW 25 25

Table 4.2. Comparison of Two Approaches

order to get best performance, the two methods needs to be efficiently utilized. Please note that the number of cycles calculation does not include pipeline in to consideration. It is a theoretical estimate of number of cycles required to calculate a radix-5 algorithm using sleipnir core .

4.1.5 Mixed Radix FFT Algorithms

The number of input data (the number of DFT points) N, is equal to the product of FFT radixes, which can be defined as follows.

N =R(0) ∗ R(1) ∗ R(2) ∗ R(3)... ∗ R(s − 1) = M −1 Y s=0 R(s) (4.21)

In equation 4.21, the parameters are defined as follows.

• s - stage number

• M - Total number of stages

• R(s) - Radix denominated for the particular stages • N - Total number of DFT points

Mixed radix FFTs can be arbitrarily assigned in the sequence order. For exam-ple, the DFT of 300 points can be factored as 5 * 5 * 3 * 4 and can be calculated as a combination of either of the following FFT radixes:

• radix(5)*radix(5)*radix(3)*radix(4) • radix(4)*radix(3)*radix(5)*radix(5) • radix(5)*radix(5)*radix(4)*radix(3) • radix(3)*radix(4)*radix(5)*radix(5)

(38)

Each stage of the FFT represents N/R(s) executions of an R(s) point butterfly calculation.Table 4.3 shows the radix’s used for each stage in the implementation of of various DFT points.

DFT Points Stages Radix indexes

12 2 3*4 20 2 5*4 30 3 5*3*2 60 3 5*3*4 300 4 5*3*4*5 600 5 5*3*4*5*2 900 5 5*3*4*5*3 1200 5 5*3*4*5*4

Table 4.3. Mixed radix implementation for Different DFT points

12-Point FFT Implementation

An example of 12-point mixed Radix DIT FFT diagram is shown in figure 4.9 to illustrate an entire DIT algorithm. There are 2 stages to process the whole DFT computation. Radix-3 DIT butterflies were used in the first stage resulting in 4 butterflies. Radix-4 DIT algorithm was used in the second stage resulting in 3 butterflies. Radix-3 Radix-3 Radix-3 Radix-3 Radix-4 Radix-4 Radix-4 x(0) x(4) x(8) x(1) x(5) x(9) x(2) x(6) x(10) x(3) x(7) x(11) x1(0) x1(4) x1(8) x1(1) x1(5) x1(9) x1(2) x1(6) x1(10) x1(3) x1(7) x1(11) x1(0) x1(1) x1(2) x1(3) x1(4) x1(5) x1(6) x1(7) x1(8) x1(9) x1(10) x1(11) y(0) y(3) y6) y(9) y(1) y(4) y(7) y(10) y(2) y(5) y(8) y(11) Stage 1 Stage 2 12-Point FFT Figure 4.9. 12-Point FFT

(39)

An example of 20-point mixed Radix DIT FFT diagram is shown in Figure 4.10 to illustrate an entire DIT algorithm. There are 2 stages to process the whole DFT computation. Radix-5 DIT butterflies were used in the first stage resulting in 4 butterflies. Radix-4 DIT algorithm was used in the second stage resulting in 5 butterflies. Radix-5 Radix-4 Radix-4 Radix-4 x1(0) x1(1) x1(2) x1(3) x1(4) x1(5) x1(6) x1(7) x1(8) x1(9) x1(10) x1(11) y(0) y(5) y(10) y(15) y(1) y(6) y(11) y(16) y(2) y(7) y(12) y(17) Stage 1 Stage 2 x(0) x(4) x(8) x(12) x(16) x1(0) x1(4) x1(8) x1(12) x1(16) Radix-5 x(1) x(5) x(9) x(13) x(17) x1(1) x1(5) x1(9) x1(13) x1(17) Radix-5 x(2) x(6) x(10) x(14) x(18) x1(2) x1(6) x1(10) x1(14) x1(18) Radix-5 x(3) x(7) x(11) x(15) x(19) x1(3) x1(7) x1(11) x1(15) x1(19) Radix-4 x1(12) x1(13) x1(14) x1(15) y(3) y(8) y(13) y(18) Radix-4 x1(16) x1(17) x1(18) x1(19) y(4) y(9) y(14) y(19) 20-Point FFT Figure 4.10. 20-Point FFT 15-Point FFT Implementation Radix-5 Radix-3 x1(0) x1(1) x1(2) y(0) y(5) y(10) Stage 1 Stage 2 x(0) x(3) x(6) x(9) x(12) x1(0) x1(3) x1(6) x1(9) x1(12) Radix-5 x(1) x(4) x(7) x(10) x(13) x1(1) x1(4) x1(7) x1(10) x1(13) Radix-5 x(2) x(5) x(8) x(11) x(14) x1(2) x1(5) x1(8) x1(11) x1(14) Radix-3 x1(3) x1(4) x1(5) y(1) y(6) y(11) Radix-3 x1(6) x1(7) x1(8) y(2) y(7) y(12) Radix-3 x1(9) x1(10) x1(11) y(3) y(8) y(13) Radix-3 x1(12) x1(13) x1(14) y(4) y(9) y(14) 15-Point FFT Figure 4.11. 15-Point FFT

(40)

An example of 15-point mixed Radix DIT FFT diagram is shown in Figure 4.11 to illustrate an entire DIT algorithm. There are 2 stages to process the whole DFT computation. Radix-5 DIT butterflies were used in the first stage resulting in 3 butterflies. Radix-3 DIT algorithm was used in the second stage resulting in 5 butterflies. 30-Point FFT Implementation x(0) x(2) x(4) x(6) x(8) x(10) x(12) x(14) x(16) x(18) x(20) x(22) x(24) x(26) x(28) 15-Point FFT x(1) x(3) x(5) x(7) x(9) x(11) x(13) x(15) x(17) x(19) x(21) x(23) x(25) x(27) x(29) 15-Point FFT Radix-2(1) Radix-2(2) Radix-2(14) Radix-2(15) x1(0) x1(1) x1(2) x1(3) x1(26) x1(27) x1(28) x1(29) y(0) y(15) y(1) y(16) y(13) y(28) y(14) y(29)

Stage 1,Stage2 Stage 3

30-Point FFT x1(0) x1(2) x1(4) x1(6) x1(8) x1(10) x1(12) x1(14) x1(16) x1(18) x1(20) x1(22) x1(24) x1(26) x1(28) x1(1) x1(3) x1(5) x1(7) x1(9) x1(11) x1(13) x1(15) x1(17) x1(19) x1(21) x1(23) x1(25) x1(27) x1(29) Figure 4.12. 30-Point FFT

An example of 30-point mixed Radix DIT FFT diagram is shown in Figure 4.12. 30-point DFT computation was divided in to 3 stages. The 30 point DFT was divided in to two 15 point DFT’s. In the first two stages, the DFT’s of two 15-point values were calculated. In the third stage, radix-2 algorithm was used resulting in 15 butterflies. Please refer figure 4.11 for calculation of 15-point DFT’s.

An example of 60-point mixed Radix DIT FFT diagram is shown in Figure 4.13. 60-point DFT computation was divided in to 3 stages. The 60-point DFT was divided in to four 15-point DFT’s. In the first two stages, the DFT’s of four 15-point values were calculated. In the third stage, radix-4 algorithm was used resulting in 15 butterflies. Please refer figure 4.11 for calculation of 15-point DFT’s.

(41)

4.1 Definition of DFT 27 x(0) x(4) x(8) x(48) x(52) x(56) 15-Point FFT(1) x(1) x(5) x(9) x(49) x(53) x(57) 15-Point FFT(2) Radix-4(15) _y(44) y(59)

Stage 1,Stage2 Stage 3

60-Point FFT x(2) x(6) x(10) x(50) x(54) x(58) 15-Point FFT(3) x(3) x(7) x(11) x(51) x(55) x(59) 15-Point FFT(4) x1(58) x1(59) x1(56) x1(57) y(14)y(29) Radix-4(14) _y(43) y(58) x1(54) x1(55) x1(52) x1(53) y(13)y(28) Radix-4(2) x1(6) x1(7) y(31)y(46) x1(4) x1(5) y(1)y(16) Radix-4(1) x1(2) x1(3) y(30)y(45) x1(0) x1(1) y(0)y(15) x1(0) x1(4) x1(8) x1(48) x1(52) x1(56) x1(1) x1(5) x1(9) x1(49) x1(53) x1(57) x1(2) x1(6) x1(10) x1(50) x1(54) x1(58) x1(51) x1(55) x1(59) x1(3) x1(7) x1(11) Figure 4.13. 60-Point FFT 300-Point FFT Implementation

An example of 300-point mixed Radix DIT FFT diagram is shown in Figure 4.14. 300-point DFT computation was divided in to 4 stages. The 300 point DFT was divided in to five 60 point DFT’s. In the first three stages, the DFT’s of five 60-point values were calculated. In the fourth stage, radix-5 algorithm was used resulting in 60 butterflies. Please refer figure 4.13 for calculation of 60-point DFT’s.

(42)

x(0) x(5) x(10) x(285) x(290) x(295) 60-Point FFT(1) x(1) x(6) x(11) x(286) x(291) x(296) 60-Point FFT(2)

Stage 1,Stage2,Stage3 Stage 4

300-Point FFT x(2) x(7) x(12) x(287) x(292) x(297) 60-Point FFT(3) x(3) x(8) x(13) x(288) x(293) x(298) 60-Point FFT(4) Radix-5(1) x1(2) x1(3) y(120)y(180) x1(0) x1(1) y(0)y(60) x(4) x(9) x(14) x(289) x(294) x(299) 60-Point FFT(5) x1(4) y(240) Radix-5(2) x1(7) x1(8) y(121)y(181) x1(5) x1(6) y(1)y(61) x1(9) y(241) Radix-5(59) x1(292) x1(293) y(178)y(238) x1(290) x1(291) y(58)y(118) x1(294) y(298) Radix-5(60) y(179) y(239) y(59) y(119) y(299) x1(297) x1(298) x1(295) x1(296) x1(299) x1(0) x1(5) x1(10) x1(285) x1(290) x1(295) x1(286) x1(291) x1(296) x1(287) x1(292) x1(297) x1(289) x1(294) x1(299) x1(288) x1(293) x1(298) x1(4) x1(9) x1(14) x1(3) x1(8) x1(13) x1(2) x1(7) x1(12) x1(1) x1(6) x1(11) Figure 4.14. 300-Point FFT 600-Point FFT Implementation

An example of 600-point mixed Radix DIT FFT diagram is shown in Figure 4.15. 600-point DFT computation was divided in to 5 stages. The 600 point DFT was divided in to two 300 point DFT’s. In the first 4 stages, the DFT’s of two 300-point values were calculated. In the fifth stage, radix-2 algorithm was used resulting in 300 butterflies. Please refer figure 4.14 for calculation of 300-point DFT’s.

(43)

4.1 Definition of DFT 29 x(0) x(2) x(4) x(594) x(596) x(598) 300-Point FFT(1) x(1) x(3) x(5) x(595) x(597) x(599) 300-Point FFT(2) x1(0) x1(2) x1(4) x1(594) x1596) x1(598) x1(595) x1(597) x1(599) x1(1) x1(3) x1(5) Radix-2(1) Radix-2(2) Radix-2(299) Radix-2(300) x1(0) x1(1) x1(2) x1(3) x1(596) x1(597) x1(598) x1(599) y(0) y(300) y(1) y(301) y(298) y(598) y(299) y(599)

Stage 1,Stage2,Stage3,Stage4 Stage 5

600-Point FFT Figure 4.15. 600-Point FFT 900-Point FFT Implementation x(0) x(3) x(6) x(891) x(894) x(897) 300-Point FFT(1) x(1) x(4) x(7) x(892) x(895) x(898) 300-Point FFT(2) x1(0) x1(3) x1(6) x1(891) x1(894) x1(897) x1(892) x1(895) x1(898) x1(1) x1(4) x1(7)

Stage 1,Stage2,Stage3,Stage4 Stage 5

900-Point FFT x(2) x(5) x(8) x(893) x(896) x(899) 300-Point FFT(3) x1(2) x1(5) x1(8) x1(893) x1(896)

x1(899)x1(898)x1(899) Radix-3(300) y(599)y(899)

x1(897) y(299) Radix-3(299) x1(895) x1(896) y(598)y(898) x1(894) y(298) Radix-3(1) x1(4) x1(5) y(301)y(601) x1(3) y(1) Radix-3(0) x1(1) x1(2) y(300)y(600) x1(0) y(0) Figure 4.16. 900-Point FFT

(44)

An example of 900-point mixed Radix DIT FFT diagram is shown in Figure 4.16. 900-point DFT computation was divided in to 5 stages. The 600 point DFT was divided in to three 300 point DFT’s. In the first 4 stages, the DFT’s of three 300-point values were calculated. In the fifth stage, radix-3 algorithm was used resulting in 300 butterflies. Please refer figure 4.14 for calculation of 300-point DFT’s. 1200-Point FFT Implementation x(0) x(4) x(8) x(1188) x(1192) x(1196) 300-Point FFT(1) x(1) x(5) x(9) x(1189) x(1193) x(1197) 300-Point FFT(2) x1(0) x1(4) x1(8) x1(1188) x1(1192) x1(1196) x1(1189) x1(1193) x1(1197) x1(1) x1(5) x1(9)

Stage 1,Stage2,Stage3,Stage4 Stage 5 1200-Point FFT x(2) x(6) x(10) x(1190) x(1194) x(1198) 300-Point FFT(3) x1(2) x1(6) x1(10) x1(1190) x1(1194) x1(1198) x(3) x(7) x(11) x(1191) x(1195) x(1199) 300-Point FFT(4) x1(3) x1(7) x1(11) x1(1191) x1(1195) x1(1199) Radix-4(300) x1(1198) x1(1199) y(899)y(1199) x1(1196) x1(1197) y(299)y(599) Radix-4(299) x1(1194) x1(1195) y(898)y(1198) x1(1192) x1(1193) y(298)y(598) Radix-4(2) x1(6) x1(7) y(601)y(901) x1(4) x1(5) y(1)y(301) Radix-4(1) x1(2) x1(3) y(600)y(900) x1(0) x1(1) y(0)y(300) Figure 4.17. 1200-Point FFT

An example of 1200-point mixed Radix DIT FFT diagram is shown in Figure 4.17. 1200-point DFT computation was divided in to 5 stages. The 1200 point DFT was divided in to four 300 point DFT’s. In the first 4 stages, the DFT’s of four 300-point values were calculated. In the fifth stage, radix-4 algorithm was used resulting in 300 butterflies. Please refer figure 4.14 for calculation of 300-point DFT’s.

(45)

4.2 Definition of DCT 31

4.2 Definition of DCT

Discrete cosine transform is a real transform that transforms a sequence of real data points into a sum of cosine functions operating at different frequencies. Ap-plications used are speech information compression, JPEG, MPEG1 and MPEG2 encoders and image and pattern recognition. In an image processing application, the DCT transform of an image brings out a set of numbers called coefficients. A coefficients usefulness is determined by its variance over a set of images. If a coefficient has a lot of variance over a set of images then it can not be removed with out affecting the picture quality. Figure 4.18 gives the behavioral perspective of DCT. The mathematical derivations for DCT were referred from [7], [11] and [16]. DCT 8x8 Block Frequency Coeﬃcients Figure 4.18. DCT Function

4.2.1 1-D DCT

The definition of 1-D sequence of length N is

C(u) = α(u) N −1 X x=0 f (x)cos((2x + 1)uπ 2N ) (4.22) where u = 0, 1, 2, 3....N − 1 α(u) =    q 1 n if u = 0, q 2 n if u 6= 0

4.2.2 2-D DCT

(46)

C(u, v) = α(u)α(v) N −1 X x=0 N −1 X y=0 f (x, y)cos((2x + 1)uπ 2N )cos( (2y + 1)vπ 2N ) (4.23) where u, v = 0, 1, 2, 3....N − 1 α(u) =    q 1 n if u = 0, q 2 n if u 6= 0 α(v) =    q 1 n if v = 0, q 2 n if v 6= 0

Figure 4.19 gives the example of how the 2-D DCT is utilized in the field of image processing.

Divide picture into 16x16 blocks (macroblocks)

Each macro block is 16 pixels by 16 lines. (4 blocks)

Each block is 8 pixels by 8 lines

Figure 4.19. DCT Concept in Image processing

4.2.3 DCT Benchmark

Input is a two dimensional data array, stored in memory row by row. To imple-ment two dimensional DCT, first apply one-dimensional DCT on rows, then on columns (Figure 4.20). Output is also an array, stored in memory row by row. Benchmarking was done for 8x8 and 16x16 inputs. The method followed is matrix multiplication.

(47)

4.2 Definition of DCT 33

Input Data DCT on Rows DCT on Columns

Final Result

Figure 4.20. 8x8 DCT Procedure

Calculation of 2D-DCT using matrix multiplication consists of the following steps.

1. Calculate the DCT matrix for calculating DCT along row. DCT matrix is defined as the product of α(u)cos((2x+1)uπ_2N )

2. Calculate the transpose of DCT matrix for calculating DCT along column. DCT matrix is defined as the product of α(v)cos((2y+1)vπ_2N )

3. Calculate the product of DCT matrix and input matrix.

4. Calculate the product of resultant matrix from step 3 and the transpose of DCT matrix calculated in step 2. The output matrix is the final resultant matrix.

Mathematically 2D-DCT matrix multiplication as shown below

D = T M T0 (4.24)

where D is the resultant output matrix, M is the input matrix, T is the DCT matrix for DCT calculation along row and T’ is the DCT matrix for DCT calcu-lation along column.

DCT on 8x8 block

This section shows an example of a DCT 8x8 block calculation. The sample image block is shown below.

(48)

Input =             26 −5 −5 −5 −5 −5 −5 8 64 52 8 26 26 26 8 −18 126 70 26 26 52 52 −5 −5 111 52 8 52 52 38 −5 −5 52 26 8 39 38 21 8 8 0 8 −5 8 26 52 70 26 −5 −23 −18 21 8 8 52 38 −18 8 −5 −5 −5 8 26 8            

calculating T = α(u)cos((2x+1)uπ_2N ) gives

T =             .3536 .3536 .3536 .3536 .3536 .3536 .3536 .3536 .4904 .4157 .2778 .0975 −.0975 −.2778 −.4157 −.4904 .4619 .1913 −.1913 −.4619 −.4619 −.1913 .1913 .4619 .4157 −.0975 −.4904 −.2778 .2778 .4904 .0975 −.4157 .3536 −.3536 −.3536 .3536 .3536 −.3536 −.3536 .3536 .2778 −.4904 .0975 .4157 −.4157 −.0975 .4904 −.2778 .1913 −.4619 .4619 −.1913 −.1913 .4619 −.4619 .1913 .0975 −.2778 .4157 −.4904 .4904 −.4157 .2778 −.0975            

calculating T’ = α(v)cos((2y+1)vπ_2N ) gives

T0=             .3536 .4904 .4619 .4157 .3536 .2778 .1913 .0975 .3536 .4157 .1913 −.0975 −.3536 −.4904 −.4619 −.2778 .3536 .2778 −.1913 −.4904 −.3536 .0975 .4619 .4157 .3536 .0975 −.4619 −.2778 .3536 .4157 −.1913 −.4904 .3536 −.0975 −.4619 .2778 .3536 −.4157 −.1913 .4904 .3536 −.2778 −.1913 .4904 −.3536 −.0975 .4619 −.4157 .3536 −.4157 .1913 .0975 −.3536 .4904 −.4619 .2778 .3536 −.4904 .4619 −.4157 .3536 −.2778 .1913 −.0975            

Applying equation 4.24 gives

D =             162.3 40.6 20.0 72.3 30.3 12.5 −19.7 −11.5 30.5 108.4 10.5 32.3 27.7 −15.5 18.4 −2.0 −94.1 −60.1 12.3 −43.4 −31.3 6.1 −3.3 7.1 −38.6 −83.4 −5.4 −22.2 −13.5 15.5 −1.3 3.5 −31.3 17.9 −5.5 −12.4 14.3 −6.0 11.5 −6.0 −0.9 −11.8 12.8 0.2 28.1 12.6 8.4 2.9 4.6 −2.4 12.2 6.6 −18.7 −12.8 7.7 12.0 −10.0 11.2 7.8 −16.3 21.5 0.0 5.9 10.7            

(49)

4.3 Discrete Wavelet Transforms(DWT) 35

4.3 Discrete Wavelet Transforms(DWT)

Fourier transforms had a major drawback. It was able to give the frequencies present in a signal, but wasn’t able to detect when and where the frequencies occurred. This initiated the development of wavelet transform. A wavelet is, as the name might suggest, a little piece of the wave, which exists in a finite domain and zero elsewhere. A wavelet transform involves convolving the signal against particular instance of the wavelet at different time scales and positions. Major applications of DWT includes image compression, video compression and audio compression. This section on Discrete Wavelet transform were referred from [9] and [10].

4.3.1 DWT Benchmark

Haar transform is the simplest of the Discrete Wavelet Transforms. In a one dimensional transform, a 2 element vector X = (X(1), X(2))T _{can be transformed} in to Y = (Y (1), Y (2))T using the equation 4.25.

Y = CX (4.25) where C =0.707 0.707 0.707 −0.707 Y =Y (1) Y (2) X =X(1) X(2)

Thus Y(1) and Y(2) are simply the sum and difference of X(1) and X(2) and then scaled by 0.707. Note that C is an orthonormal matrix because its rows are orthogonal to each other (their dot products are zero) and they are normalised to unit magnitude. In this case C is symmetric so CT = C.

For an input of two dimensional data array, stored in memory row by row, first apply one-dimensional DWT on rows, then on columns (Figure 4.21). Output is also an array, stored in memory row by row. The haar transform equation is shown as 4.26 Y = CXCT (4.26) where C =0.707 0.707 0.707 −0.707 CT =0.707 0.707 0.707 −0.707 Y =M N O P X =A B C D

After the first step of multiplication Yint= CX

Yint= 0.707

A + C B + D

A − C B − D

After the second step, Y = Yint∗ CT

M N O P = 0.5A + C + B + D A + C − B − D A + B − C − D A + D − B − C

(50)

These operations correspond to the following filtering processes: • M = A+B+C+D = 4-point average or 2-D lowpass (Lo-Lo) filter.

• N = A+C-B-D = Average horizontal gradient or horizontal highpass and vertical lowpass (Hi-Lo) filter.

• O = A+B-C-D = Average vertical gradient or horizontal lowpass and vertical highpass (Lo-Hi) filter.

• P = A+D-B-C = Diagonal curvature or 2-D highpass (Hi-Hi) filter.

DWT on 8x8 block

This section shows an example of a DWT 8x8 block calculation. Figure 4.21 shows the 8x8 DWT procedure.

Input Data DWT on Rows DWT on Columns

Final Result

Figure 4.21. 8x8 DWT Procedure

The sample image block and the 8x8 DWT calculation are as follows. X =             5.5000 6.0000 6.5000 7.0000 7.5000 8.0000 8.5000 9.0000 10.5000 11.0000 11.5000 12.0000 12.5000 13.0000 13.5000 14.0000 15.5000 16.0000 16.5000 17.0000 17.5000 18.0000 18.5000 19.0000 20.5000 21.0000 21.5000 22.0000 22.5000 23.0000 23.5000 24.0000 25.5000 26.0000 26.5000 27.0000 27.5000 28.0000 28.5000 29.0000 30.5000 31.0000 31.5000 32.0000 32.5000 33.0000 33.5000 34.0000 35.5000 36.0000 36.5000 37.0000 37.5000 38.0000 38.5000 39.0000 40.5000 41.0000 41.5000 42.0000 42.5000 43.0000 43.5000 44.0000            

(51)

4.4 Finite Impulse Response Filter(FIR) 37 C = CT =             0.7071 0.7071 0 0 0 0 0 0 0.7071 −0.7071 0 0 0 0 0 0 0 0 0.7071 0.7071 0 0 0 0 0 0 0.7071 −0.7071 0 0 0 0 0 0 0 0 0.7071 0.7071 0 0 0 0 0 0 0.7071 −0.7071 0 0 0 0 0 0 0 0 0.7071 0.7071 0 0 0 0 0 0 0.7071 −0.7071            

After the first step of multiplication Yint= CX =

            11.3137 12.0208 12.7279 13.4350 14.1421 14.8492 15.5563 16.2635 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 25.4558 26.1630 26.8701 27.5772 28.2843 28.9914 29.6985 30.4056 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 39.5980 40.3051 41.0122 41.7193 42.4264 43.1335 43.8406 44.5477 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 53.7401 54.4472 55.1543 55.8614 56.5685 57.2756 57.9828 58.6899 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355 −3.5355            

After the second step, Y = Yint∗ CT ₌

            16.5000 −0.5000 18.5000 −0.5000 20.5000 −0.5000 22.5000 −0.5000 −5.0000 0 −5.0000 0 −5.0000 0 −5.0000 0 36.5000 −0.5000 38.5000 −0.5000 40.5000 −0.5000 42.5000 −0.5000 −5.0000 0 −5.0000 0 −5.0000 0 −5.0000 0 56.5000 −0.5000 58.5000 −0.5000 60.5000 −0.5000 62.5000 −0.5000 −5.0000 0 −5.0000 0 −5.0000 0 −5.0000 0 76.5000 −0.5000 78.5000 −0.5000 80.5000 −0.5000 82.5000 −0.5000 −5.0000 0 −5.0000 0 −5.0000 0 −5.0000 0            

4.4 Finite Impulse Response Filter(FIR)

A finite impulse response (FIR) filter is a type of a signal processing filter whose impulse response (or response to any finite length input) is of finite duration, because it settles to zero in finite time,i.e, if you send an impulse, that is, a single "1" sample followed by many "0" samples, zeroes will come out after the "1" sample has made its way through the delay line of the filter.A finite impulse response (FIR) filter produces an output, y(n), that is the weighted sum of the current and past inputs, x(n)(4.28). This is shown in the figure 4.22, representing a 12-tap FIR filter, with REG representing a unit delay. The impulse response of a FIR filter is actually just the set of FIR coefficients i.e if you give an impulse,

(52)

REG REG REG REG REG REG REG REG REG REG REG x(n)

b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 y(n) REG

Figure 4.22. 12-TAP FIR FILTER

a single "1" sample followed by many "0" samples, the ouput of the filter will be set of coefficients, as the sample 1 moves past each coefficient inturn to form the output. A FIR tap is simply a coefficient-delay pair. The number of tap is an indication of 1) The amount of memory required to implement the filter, 2)The number of calculation required and 3) the amount of filtering the filter can do. In effect, more tap means more stop band attenuation, less ripple and narrower filter. The details on FIR filter were referred from [17] and [18].

y(n) =b0xn+ b1xn−1+ b2xn−2+ b3xn−3+ · · · + bNxn−N (4.27) = N X j=0 bjxn−N (4.28) where

• x(n) is the input signal • y(n) is the output signal

• bj known as the filter coefficients, also known as the tap weights, that make up the impulse response

• N is the filter order, a Nth order filter will have N+1 terms on the right hand side.xn−N is referred as taps.

4.4.1 Properties of a FIR filter

The properties has a number of useful properties.

• No feedback required - This means summed iterations does not compound the rounding errors in each iteration. The same relative error occurs in each calculation. This also makes implementation simpler.

• High stability - Since there is no feedback, all the poles are located at the origin and they exist with in the unit circle.

• Linear phase - The filters can be designed to be in linear phase by designing the coefficients to be symmetric.Linear phase corresponds to equal delay at all frequencies.

(53)

4.5 Modularization of Assembly code 39

4.4.2 Implementation of 12-TAP FIR filter on Sleipnir

Steps involved in implementation of the 12-TAP FIR filter.

1. Load the input data in LVM memory 1.

2. Load the coefficient values in LVM memory 2. The coefficients can be cal-culated using matlab function firdesign function for a 12-Tap filter.

3. Iterate the step 4 for 64 iterations. This is required to calculate a 12-TAP FIR filter for 512 input datas.

4. A 12-Tap filter has 13 stages of multiply and accumulate operations. So a single MAC instruction represents one stage of multiply and accumulate operation. The MAC instruction operates on a vector of inputs and a vector of coefficient inputs. Implement 13 steps of MAC operation for one iteration.

4.5 Modularization of Assembly code

DSP algorithms as such implemented in the current manner consumes more mem-ory space, which means only fewer algorithms can be supported on the Sleipnir core. In order to support more algorithms on the Sleipnir core, the assembly code implementation needs to be done in a more modularized manner. Modularization of assembly code also has the following advantages.

1. Decrease in Code Size, which means less storage space requirement and more features can be supported.

2. Easier to debug the code.

3. Easier to extend the existing code base to support further DSP algorithms.

4.6 Modularization Methods

1. Implement the most repetitive code as subroutines. For example, in case of mixed radix FFT’s, radix-3 and radix-5 computing code can be implemented as subroutines.

2. Use repeatr instruction instead of repeat instruction in subroutines. This ensures the loop values required for iteration in the subroutine can be passed through a register. Hence configurability of the subroutines can be achieved.

3. Address generation for fetching the data and coefficients from LVM memory and writing the data to the LVM memory needs to be modified.

(54)