Implementation study of radar signal processing Using SIMD architectures

(1)

Technical Report, IDE0608, January 2006

IMPLEMENTATION STUDY OF RADAR SIGNAL PROCESSING

USING SIMD ARCHITECTURES

Master Thesis in Electrical Engineering Mikael Ekstr¨om and Martin Westerberg

School of Information Science, Computer and Electrical Engineering Halmstad University

(2)

(3)

Implementation study of radar signal processing Using SIMD architectures

Master Thesis in Electrical Engineering

School of Information Science, Computer and Electrical Engineering Halmstad University

Box 823, S-301 18 Halmstad, Sweden

January 2006

c 2006

(4)

(5)

Preface

This Master thesis in Electrical Engineering has been conducted at the School of Infor- mation Science, Computer and Electrical Engineering at Halmstad University. In this project the CSX600 processor developed by Clearspeed Technologies has been used, we would like to thank Clearspeed Technologies for supplying development tools and support. We would also like to thank our supervisor at Halmstad University, Profssor Bertil Svensson, for guidance and support throughout this thesis and Anders ˚Ahlander for ad- vising us regarding radar technology.

Mikael Ekstr¨om Martin Westerberg

Halmstad University, 20th January 2006

(6)

(7)

Abstract

The aim of this project was to evaluate the use of SIMD array architectures in radar signal processing. This has been done by implementing one of the most demanding parts of the radar signal processing chain for airborne radar on the CSX600 architecture developed by Clearspeed Technologies. The CSX600 architecture is a SIMD processor with 96 processing elements which can be arranged either as a linera array or as a ring. The QR- decomposition, which was the part chosen for implementation, is the most performance demanding part of the STAP stage. In order to create a relevant test case the well known RT STAP benchmark from Mitre Corporation has been used. Two different algorithms for performing QR-decompositions have been implemented and verified. In both cases it has been concluded that either longer (> ≈256) or shorter (< ≈32) processor array lengths would, in general, yield a higher utilization ratio. The FLOP count and utilization has been measured for both algorithms, and it has been concluded that at least eight CSX600 processors are needed to meet the real-time demand of the benchmark.

(8)

(9)

Abbreviations

ALU Arithmetic Logic Unit

ASIC Application Specific Integrated Circuit CFAR Constant False Alarm Ratio

FFT Fast Fourier Transform FLOP Floating Point Operation FPU Floating Point Unit

GFLOPS Giga Floating Point Operations per Second MAC Multiply and Accumulate Unit

MFLOP Mega Floating Point Operations

MIMD Multiple Instruction streams Multiple Data streams PE Processing Element

QRD QR-Decomposition

RISC Reduced Instruction Set Computer SDK Software Development Kit

SIMD Single Instruction stream Multiple Data streams STAP Space Time Adaptive Processing

TFLOPS Tera Floating Point Operations per Second

(10)

(11)

LIST OF FIGURES

List of Figures

1.1 An illustration describing the growth in data as phased array antenna radar systems were introduced. . . . 1 2.1 A simplified illustration of the radar signal processing chain. . . . 3 2.2 A graphical illustration of the transformations performed by the beam-

forming calculations. . . . 5 2.3 A graphical illustration of the input data cube. The matrices X is used in

the weight calculations. . . . . 5 2.4 A graphical illustration of a general QR-decomposition. . . . 6 2.5 A working annihilation scheme for Givens QR factorization of a 4 × 3-matrix. 8 3.1 An overview of the CSX600 processor architecture. . . . 9 3.2 An illustration of a CSX600 processing element. . . . 10 5.1 An illustration of the annihilation scheme used in the implementation of the

Givens QRD method. . . . . 17 5.2 An illustration of the first stages of the Givens QRD implementation. . . . . 18 5.3 An illustration of the intermediate section of the Givens QRD implementation. 18 5.4 An illustration of the stop sequence of the Givens QRD implementation. . . 19 5.5 An illustration of the Householder algorithm in progress. . . . 20 5.6 The figure shows how complex addition and complex multiplication instruc-

tions are distributed over the processor array. . . . 22 5.7 The graph shows the utilization ratio when three different matrix widths are

used. . . . 22 5.8 The figure shows the distribution of the dominating operations. . . . 23 5.9 The graph shows the utilization ratio when three different matrix sizes are

used. . . . 24 5.10 An illustration of the rejected Givens method. . . . 25 6.1 An illustration of the processor array utilization as a function of the matrix

width. . . . 28

(12)

(13)

CONTENTS

Contents

1 Introduction 1

1.1 Objectives . . . . 2

1.1.1 Algorithm mapping . . . . 2

1.1.2 System scalability . . . . 2

1.1.3 Engineering efficiency . . . . 2

2 Radar 3 2.1 Signal processing chain . . . . 3

2.1.1 Beam-forming . . . . 3

2.1.2 Pulse Compression . . . . 3

2.1.3 Doppler filter . . . . 4

2.1.4 Envelope detection . . . . 4

2.1.5 Constant False Alarm Ratio . . . . 4

2.2 STAP . . . . 4

2.2.1 Adaptive beam-forming . . . . 4

2.2.2 Weight calculations . . . . 5

2.3 QR-Decomposition . . . . 6

2.3.1 Householder QR factorizations . . . . 6

2.3.2 Gram-Schmidt QR factorization . . . . 7

2.3.3 Givens QR factorization . . . . 7

2.4 Mitre RT STAP . . . . 8

3 The Clearspeed CSX600 architecture 9 3.1 The mono execution unit . . . . 9

3.2 The poly execution unit . . . . 9

3.3 Development tools . . . . 10

4 Related work 13 4.1 Radar research . . . . 13

4.2 Parallel architectural development . . . . 14

(14)

5 QRD Implementation 15

5.1 Implementation of complex number arithmetic . . . . 15

5.2 Implementation of Givens QRD method . . . . 16

5.2.1 Implemented algorithm . . . . 16

5.2.2 Givens QRD implementation explained . . . . 17

5.3 Implementation of Householders QRD method . . . . 19

5.4 Cycle count estimation method . . . . 21

5.5 Result of Givens QRD . . . . 22

5.6 Result of Householders QRD . . . . 23

5.7 Discussion of results . . . . 24

5.8 Rejected QRD methods . . . . 25

6 Conclusions 27 6.1 Mapping of QRD algorithms . . . . 27

6.2 Scalability . . . . 27

6.3 Engineering efficiency . . . . 28

6.4 Array length and utilization . . . . 28

7 Discussion 31 7.1 QR-decomposition algorithms . . . . 31

7.2 Processor architecture . . . . 31

7.3 Future work . . . . 32

References 33 Appendix A Source Code 35 A.1 Givens rotations based algorithm . . . . 35

A.2 Householder reflections based algorithm . . . . 38

(15)

CHAPTER 1. INTRODUCTION

1 Introduction

The aim of this project is to evaluate the use of highly parallel computer architectures in radar signal processing. This is done using an architecture developed by Clearspeed Technologies, the CSX600 processor which has been identified as interesting for radar signal processing by earlier theses at Halmstad University [1][2]. The processor has 96 processing elements arranged in a Single Instruction stream Multiple Data streams (SIMD) array, controlled by a mono execution unit that uses a common Reduced Instruction Set Computer (RISC) instruction set. More detailed information will be given later (Chapter 3).

The modern generation radar systems use phased array antennas. The antenna beam- direction is electronically controlled without any moving parts. This is done by using hundreds of small transmitting and receiving elements. The beam direction is controlled by changing the relative phase of the signal. When using this technique an area is scanned faster than with traditional radar systems. Earlier generation radar systems generated 2-dimensional input data, the time and range dimension. Using the phased array antenna the amount of input data grows one dimension, namely the channel dimension, see Figure 1.1. This new dimension significantly increases the need for processing power; it could reach levels that are in the Tera Floating Point Operations per Second (TFLOPS) area [3]. In early phased array antenna systems this problem has been solved using Application Specific Integrated Circuit (ASIC).

Figure 1.1: An illustration describing the growth in data as phased array antenna radar systems were introduced.

Since the first microprocessor saw the light of day, the processing power of processors has been growing exponentially. In recent time the performance has reached levels that make them useful in high performance signal processing. Processors are used instead of other high performance calculation units like ASIC for several reasons. Time to market is decreased because of higher levels of abstraction in software development compared to hardware development. Another reason is economic security, if an error is made in the development process, the cost of correcting it will be considerably lower when using processors, because no hardware has to be exchanged in most cases. Changing the func- tionality of a system at a low cost without changing the hardware can also be used to upgrade the system, adapting it to changes in forthcoming systems etc.

(16)

1.1 Objectives

The objective of this thesis is to implement the most computational demanding part of the radar signal processing chain using the CSX600 processor. The result of this implementation will make it possible to evaluate whether or not this type of architecture is suitable for radar applications. The aspects that will be considered in this evaluation are efficiency of algorithm mappings, system scalability and engineering efficiency.

1.1.1 Algorithm mapping

The algorithm that will be investigated is the QR-Decomposition (QRD), which is the most computationally intense part of the signal processing chain for airborne radar. Dif- ferent ways for computing the QRD will be evaluated and the most promising candidates will be implemented. The performance of the different implemented algorithms will be measured. To be able to put the results into context, a widely used benchmarking case needs to be implemented.

1.1.2 System scalability

Scalability is an issue concerning both hardware and software. An example is when a radar system is to be updated with a new larger antenna. The new antenna will generate larger amounts of data; in turn this increases the demand for processing power. Ideally this problem would be solved by adding a number of processors and recompiling the program only changing a few parameters. This level of scalability is virtually impossible to achieve, but the higher the level of scalability, the lesser resources have to be spent for developing the next generation of the system.

The scalability issues that will be considered in this thesis are how the algorithms utilize the added computing power and how the I/O and interconnection network are affected, as the number of processors increases.

1.1.3 Engineering efficiency

The engineering efficiency issues that will be considered in this thesis are how easy the tools are to work with, what functionalities are provided by the manufacturer and how easy it is to migrate software from other platforms. These issues are quite abstract;

therefore it is hard to make concrete measurements.

(17)

CHAPTER 2. RADAR

2 Radar

This chapter will treat the radar aspects important for this project. It will give a brief explanation of the radar signal processing chain. This thesis mainly concerns the Space Time Adaptive Processing (STAP) stage, which is used only in airborne radar systems.

Therefore that part will be explained in greater detail with the main focus on QRD.

2.1 Signal processing chain

A typical radar signal processing chain has five main stages which are illustrated in Figure 2.1. Each stage is briefly presented in this section.

Figure 2.1: A simplified illustration of the radar signal processing chain.

2.1.1 Beam-forming

When using array antennas, beam-forming, which can be done statically or adaptively, is needed to be able to ”look” in different directions. Ground based radar systems often use static beam-forming while adaptive beam-forming is used in airborne radar. When using adaptive beam-forming, the beam-direction is calculated in real time, which provides the possibility to suppress certain directions, thus jamming signals and other noise sources can be avoided. This process is called Space Time Adaptive Processing (STAP), for more detailed information see Section 2.2.

2.1.2 Pulse Compression

When sending a pulse, the ideal scenario is that the pulse is short with high amplitude, to attain good range resolution. In reality, the amplitude is limited by the transmitter hardware. Therefore the pulse energy has to be distributed over time, but then the range resolution becomes poorer. To compensate for this, the signal pulse is modulated and the

(18)

receiver uses a compression filter matched to the modulation, which makes it possible to separate objects with overlapping echoes.

2.1.3 Doppler filter

The Doppler filter is introduced to improve the signal-to-noise ratio and to measure the object’s speed relative to the radar. By using Fast Fourier Transform (FFT), the Doppler frequency can be extracted, and used to estimate the object’s speed.

2.1.4 Envelope detection

After the Doppler filter stage, the phase information of the signal is no longer needed.

Envelope detection is introduced to remove this information, which is done by a simple absolute value calculation.

2.1.5 Constant False Alarm Ratio

Object detection is in principle done by comparing the input signal to a threshold value.

The Constant False Alarm Ratio (CFAR) calculates this threshold value by averaging the surrounding noise level.

2.2 STAP

To utilize the potential of the phased array antenna in airborne radar system STAP is introduced, which is a part of the radar signal processing chain that can be seen as a beam-former to suppress ground noise and jamming signals.

2.2.1 Adaptive beam-forming

Adaptive beam-forming is performed by amplifying the directions differently based on previous information. Every range for every pulse is represented by a vector containing data from all antenna channels. These vectors are multiplied by a weight matrix thus making it possible to suppress or amplify certain ranges and directions. This operation is described mathematically in (2.1). The vector containing antenna channels is represented by x, the weight matrix is represented by W and the resulting lobe vector is represented by y.

y = Wx ⇒





 y1

... y_n





=







W11 . . . W1m

... . .. ... W_n1 . . . W_nm











 x1

... x_n





 (2.1)

In order for adaptive beam-forming to work, this operation needs to be applied to all input data, which will result in data being transformed into the desired lobes. The transformation is graphically described in Figure 2.2.

(19)

CHAPTER 2. RADAR

Figure 2.2: A graphical illustration of the transformations performed by the beam-forming calculations.

2.2.2 Weight calculations

To calculate the weight matrix W Equation (2.2) is used, where wm is a row vector of W, s_m is the pointing vector for direction m and C is the covariance matrix of the received signal.

w_m = C⁻¹s_m

s^H_mC⁻¹s_m (2.2)

The covariance matrix C is defined by Equation (2.3) where X is the received signal, see Figure 2.3.

C = X^HX (2.3)

Figure 2.3: A graphical illustration of the input data cube. The matrices X is used in the weight calculations.

Calculating the matrix C according to (2.3) can lead to numerical instability. Therefore the QR-decomposition approach is normally used. The matrix X is divided into two matrix components Q and R (2.4), where Q is unitary and R is upper triangular.

X = QR (2.4)

When (2.3) and (2.4) are combined, C can be written as (2.5).

C = (QR)^HQR = R^HQ^HQR = R^HR, (Q^HQ = I) (2.5) The actual calculation of w_m is done by calculating the R matrix by QR-Decomposition of X. Then calculating an intermediate vector am according to the expression in Equation

(20)

(2.6). At last the weight vector w_m is calculated according to Equation (2.7), which is derived from combining Equations (2.2), (2.5) and (2.6).

R^Ham = sm (2.6)

Rw_m= a_m

a^H_ma_m (2.7)

2.3 QR-Decomposition

QR-Decomposition (QRD) is a method for factorizing a matrix, A, into two factor matrices, Q and R. Using this method the resulting factor matrices have certain properties.

The Q matrix is unitary, which means that the transpose, or in the complex case hermitian, multiplied by the original Q matrix will result in the identity matrix. The R matrix is upper triangular, which means that all elements below the diagonal are zero.

QRD is explained in Figure 2.4.

There are three main methods [4] to calculate QRD, Householders factorization, modified Gram-Schmidt factorization and Givens factorization, which are discussed in the following sections. These methods are all very performance demanding, in fact the QRD is the most computationally intense stage of STAP. In this section the different QR factorization methods will be described from a mathematical point of view. In Chapter 5 a more implementation oriented approach of the QR factorization problem is described.

A = Q R 0

, R =







r₁₁ r₁₂ . . . r_1n 0 r₂₂ . . . r_2n ... ... . .. ... 0 . . . r_nn







Figure 2.4: A graphical illustration of a general QR-decomposition.

2.3.1 Householder QR factorizations

The Householder method calculates Q and R by multiplying a number of so called House- holder’s matrices, H_n, one for each column in A. The Householder’s matrices are both symmetric and orthogonal. The calculation of Q and R is shown in Equation (2.8).

Q = H₁H₂· · · H_n

R = H_n· · · H₂H₁A (2.8)

The Householder matrices are calculated according to Equation (2.9), where I has the same dimensions as Q. v_n is a vector containing the elements from the diagonal to the bottom of column n in the matrix constructed from multiplying previous Householder matrices with A.

H_n = diag

I, I_n− 2v_nv_n^T v_n^Tv_n

(2.9)

(21)

CHAPTER 2. RADAR

2.3.2 Gram-Schmidt QR factorization

The Gram-Schmidt QR factorization method calculates the columns of Q one vector at a time. R is calculated element-wise as shown in Equation (2.10). The different columns of Q are calculated according to Equation (2.12) and the elements of R are calculated according to Equation (2.11). These calculations need to be done in a certain order because of data dependencies. When calculating Q and R, work is done along the diagonal, starting at the top left corner. The first step is to calculate the diagonal element of R and then the column vector of Q, with the same column index as the recently calculated R-element, is calculated. The other elements of R, located at the same row, are then calculated.











Q = [w₁ w₂ . . . w_n]

R =







r₁₁ r₁₂ . . . r_1n 0 r₂₂ . . . r_2n ... ... . .. ... 0 . . . r_nn







(2.10)

r_ij =











w_i· v_j , i < j

v_i−

i−1

X

x=1

r_xiw_x

, i = j

0 , i > j

(2.11)

w_i = 1

r_ii v_i−

i−1

X

x=1

r_xiw_x

!

(2.12)

2.3.3 Givens QR factorization

The Givens method calculates Q and R by multiplying a number of Givens matrices, one for each element in A located below the diagonal. This is shown in Equation (2.13).

Q = G₁G₂· · · G_n

R = G^T_n· · · G^T₂G^T₁A (2.13)

The Givens matrices are calculated according to Equation (2.14), where k is the index of the column element to be annihilated and i is the index of a column element not yet annihilated. The variables ξ_i and ξ_k are the cell values indexed by i and k in the current column of the product matrix generated from A and previous Givens matrices.

(22)

G (i, k, θ) =







1 . . . 0 . . . 0 . . . 0 ... . .. ... ... ... 0 . . . c . . . s . . . 0

... ... . .. ... ... 0 . . . −s . . . c . . . 0 ... ... ... . .. ...

0 . . . 0 . . . 0 . . . 1





 ,







c = cos(θ) = √^ξⁱ

ξ²_i+ξ_k²

s = sin(θ) = √^−ξ^k

ξ²_i+ξ²_k

(2.14)

There is a restriction on the annihilation order, all elements indexed by i and k in columns to the left of the current one must be previously annihilated. A working annihilation scheme is described in Figure 2.5.







x x x 3 x x 2 5 x 1 4 6







Figure 2.5: A working annihilation scheme for Givens QR factorization of a 4 × 3-matrix. The annihilated elements are indexed by k, and i is assumed to be k − 1.

2.4 Mitre RT STAP

In this project, the RT STAP benchmark from Mitre cooperation [5] is going to be used as a basis for the processor evaluation. The benchmark includes the whole STAP stage, but since this project focuses on the QRD, only this part is implemented. There are three levels of the benchmark, easy, medium and hard, in these cases the performance demand is very dependent on the number of antenna channels and which algorithm is used. The easy case uses two antenna channels. The matrices that need to be QR-factorized therefore have 80×2 elements and are just a minor part of the whole benchmark case. The total performance demand of the easy case is about 0.6 GFLOPS, which is not much in the context of modern radar systems. The medium benchmark case is a First-Order Doppler- Factored STAP using 16 antenna channels. The matrices that need to be factorized have 80×16 elements and the total performance demand is about 6 GFLOPS. About one third of the performance demand in the medium case is related to QRD calculations. The hard benchmark case implements a Third-Order Doppler-Factored STAP, which uses 22 antenna channels. The QR decompositions, in the hard case, are working on matrices that have 240×66 elements; this generates a total performance demand of about 40 GFLOPS.

In the hard benchmark, an extensive part of the performance demand is directly related to QRD calculations. In Table 2.1, a compilation of the performance demands in the benchmark is shown.

Benchmark Number of QRDs Matrix size QRD % of STAP

Easy 384 80×2 5

Medium 384 80×16 31

Hard 128 240×66 84

Table 2.1: The calculation demands of the different cases described by Mitre RT STAP.

(23)

CHAPTER 3. THE CLEARSPEED CSX600 ARCHITECTURE

3 The Clearspeed CSX600 architecture

The CSX600 architecture from Clearspeed Technologies is a SIMD type, single chip multi- processor. It consists of a mono execution unit, or control unit, very much like an ordinary uniprocessor and a poly execution unit containing 96 identical Processing Element (PE)s.

The PEs are arranged in a linear array or a ring fashion, selectable at runtime. The architecture also has a high bandwidth interconnection facility, making it possible to construct powerful parallel computers consisting of several CSX600 processor chips. This interconnection system, called the ClearConnect Bus, can also be used to connect custom I/O units to the computer system. A graphical overview of the different parts of the CSX600 architecture can be seen in Figure 3.1.

Figure 3.1: An overview of the CSX600 processor architecture.

3.1 The mono execution unit

The mono execution unit, or control unit, decodes and dispatches both array (poly) and I/O instructions. The control unit also has hardware support for up to eight execution threads, which means that the processor is well suited for small real-time kernels. The swapping between threads is prioritized, and its main purpose is to be able to hide latencies in e.g. external I/O accesses. To shorten the access time, both for data and instruction fetches, cache memories are used, 8 Kbytes instruction cache and 4 Kbytes data cache.

3.2 The poly execution unit

The poly execution unit consists of 96 identical PEs. The PEs are connected to each other via what is known as the swazzle path. The swazzle path connects the register file of each

(24)

PE to the register files of its left and right neighbours. This connection makes it possible for the PEs to perform register-to-register transfers very rapidly. Transfers can be done either in a crossing fashion, between pairs of neighbouring PEs, or in a shifting fashion, where every PE writes to its left or right neighbour. The data written to the ends of the swazzle path can either be set by the control unit, making the PEs work as a linear array, or they can be connected to each other, making the PEs work as a ring [6][7].

Each of the 96 PEs has an instruction execution part consisting of two Floating Point Unit (FPU)s, an integer Multiply and Accumulate Unit (MAC) and an Arithmetic Logic Unit (ALU), see Figure 3.2. The FPU can handle both single and double precision floating point variables and can perform all the necessary mathematical operations such as addition, multiplication and division. Each PE has a 128 byte register used for instruction operands and temporary storage. For more long term storage each PE has 6 KB of high speed local SRAM. This local memory is very important to achieve high parallel performance, since working with local memory is the only way to parallelize memory accesses to different addresses. The PEs also has direct access to the ClearConnect bus for fast I/O and inter-chip communication in case of multi-chip systems.

Figure 3.2: An illustration of a CSX600 processing element.

3.3 Development tools

The Software Development Kit (SDK) provided by ClearSpeed includes a compiler, a debugger, a simulator and some libraries. The simulator, which is used in this project to test and evaluate the architecture, can simulate both single- and multi-chip systems. The debugger is based on the GNU Debugger (GDB) and allows the programmer to single-step instructions, watch and edit memory and registers. The language used to program the processor is an extended version of C, called Cⁿ, which has the capability to work with parallel structures, such as parallel variables and the PE activation/deactivation system.

To control when to use parallel or sequential variables, two extra keywords, mono and poly, are added.

(25)

CHAPTER 3. THE CLEARSPEED CSX600 ARCHITECTURE ClearSpeed has developed a hardware that includes two CSX600 processors and up to 4 GB of DDR2 SDRAM. Its main purpose is to accelerate mathematical applications, but it can also be used as debugging hardware, in conjunction with the SDK. The accelerator board uses the PCI-X interface and can therefore be used in any modern PC.

(26)

(27)

CHAPTER 4. RELATED WORK

4 Related work

This project is conducted in cooperation with Centre for Research on Embedded Systems (CERES) at Halmstad University as part of there research on radar signal processing.

CERES has been conducting research on radar signal processing in cooperation with Ericsson Microwave Systems and Chalmers University of Technology for many years.

4.1 Radar research

In 1996 an experimental S-band digital beam-forming antenna was developed at the Na- tional Defense Research Establishment in Sweden. The experiment was demonstrated using specially designed ASIC chips as processing elements. This system failed to run in real time so the goal for future research was obviously to create a system that can run in real time preferably using more than the 12 channels used by the original antenna [8]. At the time there were several similar projects, getting similar results, underway in different parts of the world. An example is the AMSAR project conducted by Thomson CSF, GEC Marconi Avionics Ltd and Daimler-Benz Aerospace AG with support from France, UK and Germany [9].

In order to be able to perform the signal processing in real-time a theoretical processor architecture named VEGA was proposed by Mikael Taveniku et al. [10] in 1998. This architecture is a combined Multiple Instruction streams Multiple Data streams (MIMD) and SIMD mesh based parallel architecture that can be scaled from 6.4 Giga Floating Point Operations per Second (GFLOPS) to several TFLOPS.

In a report from 1999, Anders ˚Ahlander and Per S¨oderstam [11] summarized a project conducted by Ericsson Microwave Systems, entitled ”High-Speed Signal Processing”. The objective of the project was to present a solution for a phased-array-antenna based airborne radar system. After analyzing the signal processing demands they came to the conclusion that an ASIC-based multiple SIMD system was the best solution with the technology available at the time, quite similar to the architecture used in this project.

A master thesis by Per S¨oderstam [12] (actually part of the above study, but published in 2004), investigates the possibility to use ring or torus SIMD architectures for radar signal processing. The algorithms considered in the thesis are common radar signal processing algorithms, including simple matrix algebra and the QRD. Two different implementations of QRD were done theoretically; the Householder’s and modified Gram-Schmidt algorithms, using a ring structure with 16 to 64 processing elements. Both algorithms performed approximately equal in terms of performance and utilization ratio. The author came to the conclusion that an array using 16-32 processing elements and the House- holder’s algorithm was the most promising way to perform the QRD.

(28)

4.2 Parallel architectural development

Lately a lot has happened in the microprocessor area. Earlier the main approach to increase performance of microprocessors was to increase the clock frequency. This development seems to have come to a halt mainly because of heat generation getting more and more out of control as clock frequency increases. This does not mean that the development has been standing still. The rate of parallel processor architectures emerging has increased a great deal, since parallelism is a good way to increase performance. In 2004 two master theses studying the use of parallel microprocessors for signal processing in phased array antenna based radar systems were written at Halmstad University. The first was a survey by C. Bangsgaard, T. Erlandsson and A. ¨Orning [1]. The second one was carried out by B. Bylin and R. Karlsson [2]. These theses had slightly different approaches concerning the level of parallelism in the processors. C. Bangsgaard, T. Erlandsson and A. ¨Orning identified ClearSpeed CSX600 as an interesting processor for further studies.

The reason they stated that the CSX600 could be useful in radar systems was the high performance per watt factor. The other thesis also identified the CSX600 as interesting for further investigation. The CSX600 architecture is similar to the VEGA architecture, mentioned in the previous section, in some ways.

(29)

CHAPTER 5. QRD IMPLEMENTATION

5 QRD Implementation

This chapter will describe the practical work done within this project, including detailed descriptions of the implemented algorithms and the simulation results.

5.1 Implementation of complex number arithmetic

Since the compiler used in this project did not have native support for complex numbers, this had to be implemented. The representation of the complex numbers has to be in rectangular form since addition and subtraction are undefined operations for complex numbers in polar form. Since single precision floating point numbers have sufficient precision to describe the radar signal the chosen representation of complex numbers consists of two single precision floating point numbers, one for the real and one for the imaginary part.

The operations that had to be implemented for complex numbers are addition, subtraction, multiplication and inverted square root. The first three operations were implemented according to Equation (5.1). Since the best way to support complex operations is to implement them as part of the compiler and ClearSpeed will add this support in the next version of the compiler, these implementations are not optimized in any way.

re (a + b) = re (a) + re (b) im (a + b) = im (a) + im (b)

re (a − b) = re (a) − re (b) im (a − b) = im (a) − im (b)

re (a × b) = re (a) × re (b) − im (a) × im (b) im (a × b) = re (a) × im (b) + im (a) × re (b)

(5.1)

The inverse square root operation for complex numbers was implemented according to Equation (5.2). This operation is quite complex, it includes square roots and inverse square roots of floating point numbers, among other things. There was no inverse square root for floating point numbers provided by ClearSpeed so that had to be implemented as well, it was done according to a paper by Ken Turkowski at Apple Computer Inc. [13].

The algorithm described in this paper uses a look-up table and a number of iterations to calculate the inverse square root value. The number of iterations needed is depending on the size of the look-up table. When using a look-up table of about 1 KB, sufficient precision is achieved after the first iteration, and a computation corresponding to about 8 Floating Point Operation (FLOP) is needed to perform the calculation. To compute an ordinary square root only one additional FLOP is needed, this results in a total need of about 30 FLOP to compute the complex inverse square root.

(30)





 re

√1 x

=

√ 2

2 ×q

re (x) ×√ ¹

re(x)²+im(x)² + re (x) im

√1 x

= −sgn (im (x)) ×

√2

2 ×q

re (x) × √ ¹

re(x)²+im(x)² − re (x)

(5.2)

5.2 Implementation of Givens QRD method

Givens QR-decomposition method has been mathematically explained in Section 2.3.3.

This section will describe how this method has been implemented using a CSX600 processor.

5.2.1 Implemented algorithm

As stated earlier the Givens method uses a number of Givens-matrices, one for each element located below the diagonal of the matrix on which the QR-decomposition is performed, to calculate both the Q- and the R-matrices. Since in this case only the R- matrix is needed, some, otherwise essential, calculations can be optimized away. This fact makes it possible to use an algorithm that performs updating calculations on the original matrix, which leads to a resulting matrix equal to the R-matrix. When using this updating algorithm, one update corresponds to one matrix multiplication, thus annihilating one element of the updated matrix. The reason that the updating method is much more effective than the matrix multiplication method is that only four elements of the Givens- matrix differ from the elements of the identity matrix and therefore only two rows of a matrix will be affected when multiplied by a Givens-matrix; the updating method only calculates the elements affected by the matrix multiplication, while the actual matrix multiplication will calculate all elements.

Since an update of the matrix corresponds to a matrix multiplication by a Givens-matrix, the calculations performed in an update are very similar to the calculations involved in the generation of a Givens-matrix. Each update is performed in order to annihilate one element of the matrix. A verbal explanation of what is done is: The column vectors are rotated around a base vector of the space in which the column vectors reside, just enough to annihilate one element of one vector. One update consists of a calculation of the sine and cosine values of the rotation angle, according to Equation (2.14), followed by an update of the column elements as they will be after the rotation. This is done according to Equation (5.3), x represents an element in the column vectors and the indices i and k are the same as in Equation (2.14); the same indices as were used to calculate the rotation angle. The element that will be annihilated is x_k of the column used to perform the calculation of the rotation angle.

x_i = cos(θ) × x_i− sin(θ) × x_k xk = cos(θ) × xk+ sin(θ) × xi

(5.3) The annihilation scheme is very important for the calculation. In case the elements are annihilated in the wrong order the resulting matrix will not be upper triangular, thus not equal to the R-matrix. The annihilation scheme used in this implementation was

(31)

CHAPTER 5. QRD IMPLEMENTATION proposed by Sameh and Kuck [14] and is illustrated in Figure 5.1. This is not the only working annihilation scheme, but it is effective since it allows the calculation of the sine and cosine values of the rotation angle for several updates to be calculated in parallel.

This is obviously a huge advantage in this case. The only restriction an annihilation scheme needs to meet, in order to work, is that x_i and x_k must be zero in all columns to the left of the one having an element annihilated. An obvious way to do this, which is often used in sequential implementations, is to simply annihilate all elements below the diagonal of the first, or leftmost, column, then the same procedure is performed on the second and third column, and so on. Most other popular annihilation schemes start from the bottom left corner of the matrix, working their way upwards in some kind of diagonal fashion, as is the case with the one used in this implementation.







x x x x

8 : 1 x x x

7 : 1 9 : 1 x x 6 : 1 8 : 2 10 : 1 x 5 : 1 7 : 2 9 : 2 11 : 1 4 : 1 6 : 2 8 : 3 10 : 2 3 : 1 5 : 2 7 : 3 9 : 3 2 : 1 4 : 2 6 : 3 8 : 4 1 : 1 3 : 2 5 : 3 7 : 4







Figure 5.1: An illustration of the annihilation scheme used in the implementation of the Givens QRD method. The left number specifies angle calculation instance and the right number specifies the update instance following that angle calculation instance.

The actual implementation of the Givens QRD method can be found in Appendix A.1.

This implementation has been verified using the simulator included in the SDK. The verification was done by comparing the calculated R-matrix to the output of the qr function in Matlab.

5.2.2 Givens QRD implementation explained

The easiest way to explain how this implementation works is probably through an example.

In this example a 12×5-matrix, with each column vector located in separate PEs, will be used to describe the three different stages of the calculation. Since the parallel angle calculations are performed in a diagonal fashion, there will be a start-up sequence before the maximum number of parallel angle calculations is reached. In Figure 5.2, the start-up sequence is described step by step. In stage one angle calculation for the annihilation of the bottom left element of the matrix, the element marked with a bold outline, is performed and in the second stage the rotation that actually performs the annihilation is executed.

Since two elements of the first column need to be annihilated before annihilations can be started in second column, stage 3 and 4 also perform annihilation in the first column only.

At this point annihilation can be performed in the second column, so stage 5 now executes two angle calculations in parallel. Stage 6 and 7 then performs annihilation in each of the first two columns. The start-up sequence will continue in this way, annihilating two elements of every column, before annihilations are started in the following column until

(32)

the number of parallel computed angle calculations equals the width of the matrix; at that point the intermediate section of the calculation is started.

Figure 5.2: An illustration of the first stages of the Givens QRD implementation. The top half of the matrix has been removed since those elements will not be affected in these stages.

Figure 5.3 graphically describes how the intermediate section of the calculation works. As can be seen in stage 35, all PEs are busy when angle calculations are performed; there is one square with bold outlines in each column. In the following stages the rotations that annihilate the elements, for which the angle calculations where performed, are executed one by one, starting from the left. This intermediate section continues until all elements of the first column that shall be annihilated are annihilated and then the stop sequence is begun.

Figure 5.3: An illustration of the intermediate section of the Givens QRD implementation.

In the stop sequence, which is described in Figure 5.4, the columns will be finished one by

(33)

CHAPTER 5. QRD IMPLEMENTATION one. As can be seen in stage 47, no angle calculation is performed for the leftmost column.

When the rotations in the next four stages have been executed, the second column is also finished and in the last stage of the illustration, the angle calculations for the remaining columns are preformed. The stop sequence will continue like this, finishing one column for each time angle calculations are performed, until all columns are finished and the R-matrix is calculated.

Figure 5.4: An illustration of the stop sequence of the Givens QRD implementation.

5.3 Implementation of Householders QRD method

The implementation of Householders QRD method has its mathematical background in the equations described in Section 2.3.1, but the actual implementation uses a more efficient way to perform the calculations. The method used is an updating method with similarities to the updating Givens method described in earlier in this chapter. This method can be used since the Q-matrix is not needed. The QRD of a matrix A is performed by multiplying it, from the left, by a virtual Householder-matrix, H, thus annihilating the elements located below the diagonal in the current column and updating the other elements of the matrix. The updates actually work with a sub matrix located in bottom right corner of the original matrix. The first update uses the whole matrix but for every new update the sub matrix used decrease one row and one column in size. The updates are described in Equation (5.4), see also Figure 5.5 for a context overview.

H₁A⁽¹⁾ = H₁







a_1,1 a_1,2 a_1,3 a_2,1 a_2,2 a_2,3 a_3,1 a_3,2 a_3,3 a_4,1 a_4,2 a_4,3







=







a⁰_1,1 a⁰_1,2 a⁰_1,3 0 a⁰_2,2 a⁰_2,3 0 a⁰_3,2 a⁰_3,3 0 a⁰_4,2 a⁰_4,3







= A⁽²⁾

H₂A⁽²⁾ = H₂





a⁰_2,2 a⁰_2,3 a⁰_3,2 a⁰_3,3 a⁰_4,2 a⁰_4,3



=





a⁰⁰_2,2 a⁰⁰_2,3 0 a⁰⁰_3,3 0 a⁰⁰_4,3



= A⁽³⁾

H₃A⁽³⁾ = H₃ a⁰⁰_3,3 a⁰⁰_4,3

= a⁰⁰⁰_3,3 0

= A⁽⁴⁾

(5.4)

(34)

Figure 5.5: An illustration of the Householder algorithm in progress.

As stated earlier these matrix multiplications are virtual, they are actually calculated according to Equation (5.5) where the variables v, β, and w are calculated as described in Equations (5.7), (5.6) and (5.8).

HA = I − βvv^H A = A − βvw^H (5.5)

The calculation of the vector v in Equation (5.6) is implemented in a sequential way. It involves a calculation of the norm of the vector x, which is done by a single PE leaving the other PEs idle. The vector v is then distributed to the other PEs using a broadcast mechanism.

v = x + x₁

|x₁|kxk₂× e₁, e₁ = [1 0 . . . 0]^T (5.6) When the real scalar value β is calculated, some shortcuts can be made by reusing values generated by the calculation of the vector v (see Equations (5.6) and (5.7)), therefore the calculation have to be made by the same PE who calculated the vector v. The variable β is then distributed via the same broadcasting mechanism as the vector v previously did.

β = 2

v^Hv, v^Hv = kxk₂(2 |x₁| + kxk₂) + kxk (5.7) The elements of the vector w are calculated in parallel, one element in each active PE.

This is done by calculating the scalar product the vector v, now located in all PEs, and the hermitian of the A vector located in each PE.

w = A^Hv (5.8)

The actual update of the matrix A, see Equation (5.5), is done in parallel as well. The calculation of the updated value of each element in A is done by multiplying the hermitian value of w element located in each PE by β and then, for each element of the column vector of A, subtract the product of this, previously calculated value and the corresponding element of v.

An implementation of the algorithm described in this section can be found in Appendix A.2. Just like the implementation of the Givens algorithm, described in the previous section, the verification was done by comparing the output of the simulation to the output of the qr function in Matlab.