## Institutionen för systemteknik

### Department of Electrical Engineering

**Examensarbete**

**Real-Time Space-Time Adaptive Processing on the**

**STI CELL Multiprocessor**

Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping

av

**Yi-Hsien Li**

LITH-ISY-EX--07/3953--SE

Linköping 2007

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

**Real-Time Space-Time Adaptive Processing on the**

**STI CELL Multiprocessor**

### Examensarbete utfört i Datorteknik

### vid Tekniska högskolan i Linköping

### av

**Yi-Hsien Li**

LITH-ISY-EX--07/3953--SE

Handledare: **Di Wu**

isy, Linköpings universitet

Examinator: **Dake Liu**

isy, Linköpings universitet

**Avdelning, Institution**

Division, Department

Division of Computer Engineering Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

**Datum**
Date
2007-03-06
**Språk**
Language
Svenska/Swedish
Engelska/English
**Rapporttyp**
Report category
Licentiatavhandling
Examensarbete
C-uppsats
D-uppsats
Övrig rapport

**URL för elektronisk version**

http://www.control.isy.liu.se
http://www.ep.liu.se/2007/3953
**ISBN**
—
**ISRN**
LITH-ISY-EX--07/3953--SE

**Serietitel och serienummer**

Title of series, numbering

**ISSN**

—

**Titel**

Title Real-Time Space-Time Adaptive Processing on the STI CELL Multiprocessor

**Författare**

Author

Yi-Hsien Li

**Sammanfattning**

Abstract

Space-Time Adaptive Processing (STAP) has been widely used in modern radar systems such as Ground Moving Target Indication (GMTI) systems in order to suppress jamming and interference. However, the high performance comes at a price of higher computational complexity, which requires extensive powerful hardware.

The new STI Cell Broadband Engine (CBE) processor combines PowerPC core augmented with eight streamlined high-performance SIMD processing engine offers an opportunity to implement the STAP baseband signal processing without any full custom hardware. This paper presents the implementation of an STAP baseband signal processing flow on the state-of-the-art STI CELL multiprocessor, which enables the concept of Software-Defined Radar (SDR). The potential of the Cell BE processor is studied so that kernel subroutine such as QR decomposition, Fast Fourier Transform (FFT), and FIR filtering of STAP are mapped to the SPE co-processors of Cell BE processor with variety of architectural specific optimization techniques.

This report starts with an overview of airborne radar technique and then the standard, specifically the third-order Doppler-factored STAP are introduced. Next, it goes with the thorough description of Cell BE architecture, its pro-gramming tool chain and parallel propro-gramming methods for Cell BE. In later chapter, how the STAP is implemented on the Cell BE processor is discussed and the simulation results are presented. Furthermore, based on the result of earlier benchmarking, an optimized task partition and scheduling method is proposed to improve the overall performance.

**Nyckelord**

**Abstract**

Space-Time Adaptive Processing (STAP) has been widely used in modern radar systems such as Ground Moving Target Indication (GMTI) systems in order to suppress jamming and interference. However, the high performance comes at a price of higher computational complexity, which requires extensive powerful hard-ware.

The new STI Cell Broadband Engine (CBE) processor combines PowerPC core augmented with eight streamlined high-performance SIMD processing engine of-fers an opportunity to implement the STAP baseband signal processing without any full custom hardware. This paper presents the implementation of an STAP baseband signal processing flow on the state-of-the-art STI CELL multiprocessor, which enables the concept of Software-Defined Radar (SDR). The potential of the Cell BE processor is studied so that kernel subroutine such as QR decomposition, Fast Fourier Transform (FFT), and FIR filtering of STAP are mapped to the SPE co-processors of Cell BE processor with variety of architectural specific optimiza-tion techniques.

This report starts with an overview of airborne radar technique and then the stan-dard, specifically the third-order Doppler-factored STAP are introduced. Next, it goes with the thorough description of Cell BE architecture, its programming tool chain and parallel programming methods for Cell BE. In later chapter, how the STAP is implemented on the Cell BE processor is discussed and the simulation results are presented. Furthermore, based on the result of earlier benchmarking, an optimized task partition and scheduling method is proposed to improve the overall performance.

**Acknowledgments**

First and foremost I wish, in these lines to thank for all the people and mainly, Di Wu for his stimulating suggestions and encouragement all through the thesis work as my supervisor. In addition, gratefully thank my Professor Dake Liu for constructive comments and valuable suggestions during my study in Linköpings universitet.

I also wish to thank all the friends in Linköping for always making me feel welcome and less far from home. Thank you for the good time we had together. Everybody including my friends in Taiwan, in his or her way, has confronted me with difficult situations that I experienced.

Last but not least, I would like to thank my wonderful family, including my par-ents, my sisters, for all their enduring support and always believing in me.

Yi-Hsien Li Linköping, March 6

**Contents**

**1** **Introduction** **1**

1.1 Background . . . 1

1.2 Purpose of the Thesis . . . 2

1.3 Way of Work . . . 2

1.4 Outline . . . 2

**2** **Overview of Radar Signal Processing** **5**
2.1 Airborne Radar Application . . . 5

2.1.1 Moving Target Indicator (MTI) . . . 7

2.1.2 Two-Dimensional Space-Time Spectrum . . . 7

2.2 Space Time Adaptive Processing (STAP) . . . 9

2.2.1 Mathematical Model . . . 10

2.2.2 Full Rank STAP . . . 11

**3** **Functional Specification of STAP System** **15**
3.1 Preprocessing . . . 15

3.1.1 I/Q Conversion . . . 16

3.1.2 Array Calibration . . . 17

3.1.3 Pulse Compression . . . 17

3.1.4 Combination of Array Calibration and Pulse Compression . 17 3.2 Post-Doppler Adaptive Processing . . . 18

3.2.1 Doppler Filtering . . . 19

3.2.2 Weight Computation . . . 19

3.3 Computation issues . . . 22

**4** **Overview of Cell Broadband Engine** **25**
4.1 Architecture . . . 25

4.1.1 Power Processing Element (PPE) . . . 26

4.1.2 Synergistic Processing Element (SPE) . . . 27

4.1.3 Element Interconnect Bus (EIB) . . . 29

4.1.4 Memory and I/O . . . 29

4.2 Programming Toolchain . . . 29

4.2.1 Compiler . . . 29

4.2.2 Accelerated Library Framework . . . 31

4.2.3 The Simulator . . . 32

**x** **Contents**

4.3 Parallel Programming Methods for the Cell Broadband Engine . . 35

4.3.1 SIMD Vectorization . . . 37

4.3.2 Interleaved Load . . . 38

4.3.3 Loop Unrolling . . . 38

4.3.4 Double-Buffering . . . 39

4.3.5 Reducing the Impact of Branches . . . 40

4.3.6 Data Alignment . . . 41

**5** **Design Consideration** **43**
5.1 Real-Time Performance Issues . . . 43

5.2 Finite-Length Precision . . . 44
5.3 System Partition . . . 44
5.3.1 Design Consideration . . . 44
5.3.2 Programming Flow . . . 45
**6** **Kernel Subroutines** **49**
6.1 Implementation . . . 49
6.1.1 Preprocessing . . . 49
6.1.2 Doppler Processing . . . 53
6.1.3 QR Decomposition . . . 53

6.1.4 Forward and Backward Substitution . . . 61

6.2 Benchmark Results . . . 63

**7** **Original STAP Flow** **69**
7.1 Implementation . . . 69

7.1.1 Multidimensional Data Cube Rotation . . . 69

7.1.2 System Integration for Whole STAP Flow . . . 70

7.2 Benchmark Results . . . 74

**8** **Optimized STAP Flow** **77**
8.1 Implementation . . . 77

8.1.1 Problems of Original STAP Flow . . . 77

8.1.2 System Integration for Optimized STAP Flow . . . 80

8.2 Benchmark Results . . . 82

**9** **Conclusion and Future Work** **85**
9.1 Conclusion . . . 85

9.2 Future Work . . . 85

**Bibliography** **87**

**A SPE Kernel C Intrinsic Code for QR decomposition** **89**
**B SPE Kernel C Intrinsic Code for Forward/Backward Substitution 97**

**Contents** **xi**

**List of Figures**

2.1 Illustration of an airborne radar environment . . . 6

2.2 Airborne radar electromagnetic environment . . . 7

2.3 Two-dimensional Space-Time spectrum . . . 8

2.4 General STAP filter structure . . . 9

2.5 Input data cube for a single CPI . . . 10

2.6 Taxonomy of reduced-dimension STAP algorithms. . . 12

3.1 Preprocessing block diagram for a single-array channel . . . 16

3.2 Overlap-save method . . . 18

3.3 Third-order doppler factored STAP . . . 20

4.1 Cell Broadband Engine (CBE) block diagram . . . 26

4.2 Power Processing Element (PPE) block diagram . . . 27

4.3 Synergistic Processing Element (SPE) block diagram . . . 27

4.4 Overview of ALF [4] . . . 31

4.5 Simulation stack of Cell BE processor . . . 32

4.6 Example of static analysis of SPE threads . . . 34

4.7 Example of dynamic analysis of SPE threads . . . 36

4.8 Softward develop flow on CBE . . . 37

4.9 Example of SIMD addition . . . 38

4.10 Eliminate data dependency by interleaved loading . . . 39

4.11 DMA transfer using a double-buffering method . . . 40

5.1 STAP function flow . . . 43

5.2 System partition of STAP on CBE . . . 45

5.3 Program flow of STAP (right part shows the flow in SPE for pre-processing) . . . 47

5.4 Program flow of STAP (right part shows the flow in SPE for QR decomposition) . . . 48

6.1 DIT of a length-N DFT into two length-N /2 DFTs followed by a combining stage. . . 50

6.2 The basic operation of FFT: butterfly . . . 50

6.3 Radix-2 DIT FFT algorithm for a length-8 signal . . . 51

6.4 Program flow of preprocessing . . . 52

6.5 Parallelism of preprocessing(DMA transfer, pipeline 0, and pipeline 1 are executed in parallel) . . . 53

6.6 The computation of ith iteration in QR decomposition . . . 58

6.7 The computation of vector normalization in MGS . . . 60

6.8 Newton-raphson’s method . . . 61

6.9 Pseudo code of forward substitution . . . 62

6.10 Programming flow of forward substitution . . . 63

6.11 Block diagram of 4x4 matrix forward substitution . . . 64

6.12 Block diagram to subtract y by x times first four column elements 65 6.13 Simulation result of QR decomposition . . . 66

**xii** **Contents**

7.1 Original data flow and functional description of STAP processing stages . . . 69 7.2 Example of data rotation between first and third dimension . . . . 71 7.3 The data and computation flow of original STAP system on each

SPE . . . 72

8.1 Ideal multi buffering with more computation and less DMA overhead 77 8.2 Problem of multi buffering with less computation and more DMA

overhead . . . 78 8.3 The ratio of DMA transfer to computation in each processing stage 79 8.4 The channel stall cycles rate in each processing stage . . . 79 8.5 Modified data flow and functional description of STAP system . . 80 8.6 The data and computation flow of modified STAP system on each

SPE . . . 81 8.7 Comparison between original data flow and modified data flow . . 84

**List of Tables**

3.1 Complex operation counts of STAP . . . 23

3.2 Instruction counts of STAP(one complex operation per cycle) . . . 23

3.3 Floating-point operations of STAP(one floating-point operation per cycle) . . . 24

4.1 Example for unaligned load in SPU . . . 41

6.1 Conversion from complex to RFLOPs . . . 54

6.2 Example for approaching reciprocal square root . . . 59

6.3 Performance measurement of kernel subroutines (pure computation without data movement) . . . 66

7.1 The memory address of data dependens on the setting of dimension. 70 7.2 Theoretical performance of STAP benchmark (pure computation without data movement) . . . 74

7.3 Performance of STAP benchmark of original dataflow(Including the Latency of Memory subsystem) . . . 75

8.1 Required data and computation cycles for each processing stage, where the permutation/transposition is not included in the compu-tation cycles. Each data represents a 32-bits complex floating-point. Besides, assume all the processing coefficients have been served in the SPE. . . 78

8.2 Performance of STAP benchmark with modified dataflow (Including the latency of memory subsystem) . . . 83

**Chapter 1**

**Introduction**

**1.1**

**Background**

Recently, more and more Unmanned Aerial Vehicles (UAV) have been deployed for combat field surveillance tasks which requires long-time cruising at relatively low altitude. Compared to manned aircrafts, using UAV can greatly reduce the chance of casualty. However, this exerts tougher requirement on the design of the sensor system. In order to meet the strict constraints such as the space on board, power consumption and accessibility for maintenance and upgrade, the airborne electronics system must be highly compact, low power and flexible. In order to meet these requirements, Software-Defined Radar (SDR) was first introduced by Wiesbeck as an alternative to fixed-functional hardware based systems, which em-ploys programmable devices to accommodate various radar sensors for different missions by updating the software.

In order to meet both the performance and flexibility requirement of SDR system, state-of-the-art hardware is needed. The scaling of semiconductor process allows more processing units to be integrated into one single chip, which can make the system more compact and powerful so that a full radar system can be carried by a UAV which has strict constraints on space and power consumption. STI CELL [3] is the latest state-of-the-art multiprocessor designed by the joint adventure of Sony, Toshiba and IBM (STI). Meanwhile, it also brings a brand new parallel programming model which is not familiar by most of the application programmers.

Recently, an estimation of Space-Time Adaptive Processing(STAP) for STI CELL is presented in [7]. However, since it is only an estimation based on the calculation of Floating-Point Operations Per Second (FLOPS) involved in the computation in-stead of cycle-accurate simulation, and the overhead of memory subsystem hasn’t been explicitly exposed, the result needs to be further proven. In this thesis, in order to explore the potential of CELL for array signal processing, a complete STAP baseband processing flow is implemented on STI CELL and benchmarked using cycle-accurate simulator from IBM.

**2** **Introduction**

**1.2**

**Purpose of the Thesis**

The scope of the project is to design and implement Space-Time Adaptive Pro-cessing (STAP) algorithms on a latest Cell Broadband Engine - the heterogeneous multi-core processor. The thesis is to benchmark the performance of the CELL Architecture for the floating-point radar application targeted on the state-of-art real-time high radar resolution interference nulling adaptive processing (each PRI processed within 32.5 millisecond time intervals). The kernel subroutines of STAP are accelerated using SIMDization, optimization of task/data partitioning and scheduling is carried out to improve the overall performance.

**1.3**

**Way of Work**

The benchmarking was performed by writing a C-language extension with inline assembly-language instruction called Intrinsics. The written program then is com-piled and executed in the IBM Full System Simulator for the Cell Broadband En-gine, which support both functional and cycle-accurate simulation. The aim was to design a high data parallelization and task parallelization program to achieve as short execution time as possible, mainly for real time processing.

**1.4**

**Outline**

Chapter 2 presents the basic theory about Space Time Adaptive Processing, such as Moving Target Indication (MTI), beamforming and the environment of STAP.

In Chapter 3, functional and timing specification of real-time benchmark STAP is elaborated. The benchmark case corresponds to third-order Doppler-factored STAP which consists of several computation stages, such as Preprocessing, Doppler filtering, and Weight compensation.

Chapter 4 covers the overview of Cell BE processor. First a thorough descrip-tion of architecture is found and then the programming toolchain, i.e., includes a basic idea about the tool from compiler to simulator are also discussed here. Several parallel programming methods for developing software on multiprocessor Cell BE are presented in the end of this chapter.

Chapter 5 mainly discusses about the design consideration for implementing STAP algorithm on Cell BE processor. It consists of real-time constraints and numerical precision regarding to specification of STAP system. Also system scheduling and partitioning are discussed to exploit Cell BE architecture.

Chapter 6 presents the implementation and benchmark results of all kernel sub-routines, in STAP system. Kernel subroutines consist of preprocessing, Doppler processing, QR decomposition, forward substitution and backward substitution. For each processing stage, the optimization of significant modules in SIMD

**environ-1.4 Outline** **3**

ment is presented. In the end of this chapter, simulation results of implementation are presented.

In Chapter 7, the whole computation flow and integration of system is presented. The benchmark results shows performance which includes the latency of data movement and permutation.

Chapter 8, based on the result of earlier benchmarking, an optimized task parti-tion and scheduling method is proposed to improve the overall performance. The simulation results, which explain the improvement of optimized implementation, are presented in the second part in this chapter.

**Chapter 2**

**Overview of Radar Signal**

**Processing**

Space-Time Adaptive Processing (STAP) is a signal processing technique most commonly used in airborne radar. It involves adaptive array processing algo-rithms to optimize target detection. Radar system benefits from STAP in areas when there is strong interference. Through application of STAP, it’s possible to suppress the clutter and jamming to achieve order-of-magnitude sensitivity im-provements in target detection.

This chapter discusses required technical background about adaptive airborne radar, such as conjunction, Moving Target Indication, beamforming and Space-Time Processing. However the knowledge has board application in several distinct disciplines, for example in space-based radar, sonar, spectral estimation, biomedi-cal imaging, wireless communications, and many other fields that utilize estimates of statistical correlations between random quantities.

**2.1**

**Airborne Radar Application**

All radars transmit electromagnetic energy and receive scattered echoes from
re-flective objects. These objects can be classified as targets or clutter, while clutter
is defined as any unwanted echoes that interfere with the target echoes. For a high
altitude airborne surveillance radar designed to detect moving aircraft or slow
moving ground vehicles as depicted in Figure 2.1. In general, the most dominant
source of clutter for down-looking surveillance radar is the Earth, which includes
*all natural and man-made structures. Ground clutter, the Earth’s echo is usually*
several orders of magnitude greater in power than a target echo. Therefore, the
ground clutter signals must be mitigated as much as possible through filtering to
detect the relatively weak target signal buried in the clutter signal.

*A non-clutter source of interference to all radars is intentional or unintentional *

**6** **Overview of Radar Signal Processing**

**Figure 2.1. Illustration of an airborne radar environment**

*ming by a source of electromagnetic radiation transmitting signals in the radar’s*

transmit/receive frequency band. This jamming could be generated from hostile sources or from radiating equipment that happen to have a similar interference effect on the radar transceiver.

Discriminating features of the target, as different from the clutter, must exist in
order to filter out a target from the clutter. The three most commonly used
**dis-criminating features are range, radial velocity, and azimuth angle. As shown**
in Fig. 2.2, range gating the received echo helps to separate the targets from
clut-ters by limiting the clutter backscatter area to only a narrow ring, the effective
width of which is equal to the compressed pulse width, and is centered on the
ground and circling the aircraft. This figure, where only half of the ring is shown,
suggests that the low gain is typical located in the antenna back lobe region. For
the front lobe region, the antenna pattern gain as a function of azimuth angle
multiplies the clutter and jamming from the same angles and generally attenuates
these signals several dB if not located in the main beam region. By making the
*compressed pulse width smaller, a target confined to a single range bin has less*
clutter power to compete with during detection processing.

The radial velocity discriminate is the relative velocity of the target with respect to the radar velocity vector along an angle directly to the radar. It has a direct relationship with the Doppler frequency of the target echo. Azimuth angle discrim-inates simply the target location with respect to the antenna pointing direction. Elevation angle is usually not useful for surveillance radars since they generally

**2.1 Airborne Radar Application** **7**

**Figure 2.2. Airborne radar electromagnetic environment**

have very modest vertical antenna aperture widths and consequently have narrow elevation beam widths.

**2.1.1**

**Moving Target Indicator (MTI)**

In many surveillance radars, for each range bin, the CPI pulses are temporally
*pro-cessed by two stages collectively called Airborne Moving Target Indicator (AMTI).*
In the first stage, the AMTI circuit attempts to reverse the effect of the radar
air-craft motion by making the radar signals appears as if they were transmitted from
a stationary platform. A complementary circuit technique called TACCAR allows
*a simple second stage Moving Target Indicator (MTI) filter design which *
signifi-cantly attenuates the stationary (or D.C.) signal component associated with the
ground clutter that exists in the first stage output. For more details of MTI, please
refer to [8].

**2.1.2**

**Two-Dimensional Space-Time Spectrum**

Figure 2.3 illustrates the two-dimensional Space-Time spectrum of the received echo; it is typical of airborne surveillance radars for a single range bin. The ba-sic concept of such a spectrum is that radar’s looking direction is perpendicular to the direction of flight. The clutter is already spread in azimuth angle due to the antenna pattern. However, due to the motion of the aircraft, the clutter also spreads in Doppler frequency. One important character is that the clutter Doppler frequencies and azimuth angles are correlated; meaning that clutter from a certain

**8** **Overview of Radar Signal Processing**

azimuth angle has an associated Doppler frequency shift. This correlation
*pro-duces a structure in the Space-Time spectrum called as clutter ridge. The clutter*
ridge is the diagonal ridge in Fig 2.3 cutting across both azimuth and Doppler axes.

Imagine that the moving radar slows and becomes stationary in the air. The effect would be to rotate the clutter ridge clockwise (still centered at 0 Hz and 0 deg.), until it lies along the zero Doppler frequency axis. Now, every azimuth angle has the same Doppler frequency of zero Hz, and there is no longer a correlation between the two dimensions. This is precisely the effect of MTI algorithm at-tempts to align (via a technique called Displaced Phase Center Antenna, DPCA), but typically with limited success.

Since stopping the radar motion rotates the clutter ridge, the moving targets in the
*Figure 2.3 can become unmasked from the clutter. In other words, the projection*
of the clutter on the Doppler axis reduces down from the full extent of the Doppler
axis to just width caused by internal clutter motion, at the same time, revealing
the relatively weak moving targets at their non-zero Doppler frequencies. Once
the clutter spectrum is rotated, performing MTI and Doppler-factored processing
produces near optimum solution. By reversing the effects of radar motion, targets
are unmasked from the clutter Doppler spectrum which is correlated pulse to pulse.

Recently, the low cross section targets requires improved clutter and jamming suppression over current airborne radar capabilities. The current level of clutter and jamming rejection in airborne radars that use cascaded DPCA and sidelobe canceller techniques is not sufficient to detect these targets. Therefore, Space-Time Adaptive Processing is found to solve this problem.

**2.2 Space Time Adaptive Processing (STAP)** **9**

**2.2**

**Space Time Adaptive Processing (STAP)**

STAP is a two-dimension, linear, adaptive filtering process. It filters received signals over both space and time in order to extract desired signals while reducing interference. In contrast to one-dimensional spatial or temporal filtering, STAP utilizes both of these orthogonal dimensions, simultaneously, in order to optimally discriminate between desired and interfering signals. STAP is sampled in space and time using a finite array of N spatial antennas with a length M tapped delay line attached to each antenna as shown in Figure 2.4.

**Figure 2.4. General STAP filter structure**

To solve the target-masking problem, STAP uses space and time samples of the received field, which provides more information than a single (time) dimension does. If only time samples were available, moving targets might be masked by the Doppler-spread clutter ridge. Because of the added dimension of data, STAP has the ability to create a two-dimensional adapted filter pattern to null out the strong clutter ridge only, without nulling out moving targets at other azimuth angles and Doppler frequencies. This alleviates the need to first remove platform motion effects (i.e., rotating the clutter ridge is no longer necessary).

STAP was developed in the purpose of unmasking targets from Doppler-spread clutter, but it also potential fulfills the role of multiple sidelobe cancellation. Moreover, STAP provides a distortionless response (unity gain) to the target sig-nal and it optimizes the Sigsig-nal-to-Interference-Pulse-Noise ratio (SINR). Thus, theoretically optimal target detection can be performed directly on the magni-tude (squared) for the output of the STAP filter without the need for a follow on Doppler filter bank. As shown in Fig. 2.4, the weights are applied by taking the

**10** **Overview of Radar Signal Processing**

complex inner product of the general weight vector w and the vector of space and
time samples χ for the range bin under test (referred to as a snapshot vector) as
y = wH_{χ, where y is a complex scalar quantity.}

**2.2.1**

**Mathematical Model**

Before outlining the derivation of the STAP weight solution, the mathematical models are described as below.

**Coherent Processing Interval (CPI): The array of L antenna channels **

sam-ples the echo spatially, while each receiver samsam-ples the echo temporally over P pulses in a CPI. For each of R range bins of data is stored along a par-ticular antenna pointing angle, the receiver on each element down converts the echo to baseband. A 3-dimensional data cube called CPI, which consists of L channels, P pulse and R range samples, is used by a STAP process as shown in Figure 2.5. It holds the received, Space-Time, complex data for all ranges along one array azimuth pointing angle. Both range and pulse dimensions are time samples, but the range dimension is called fast time as it samples much faster than P pulses in pulse dimension (called slow time).

For each of R range bins, the associated L × P matrices of Space-Time samples are reshaped into a single long column vector of length LP × 1, denoted as χlwhich is called a Space-Time snapshot because it corresponds

to a single range gate(or range sample time) indexed by variable l.

**Figure 2.5. Input data cube for a single CPI**

**Steering Vector: A spatial steering vector is the set of normalized signal values**

that the antenna array receives from a wave coming from a particular angle at any given time over all N elements.

a(θ) = [1, ej2πθ, . . . , ej(L−1)2πθ]T (2.1) A temporal steering vector is comprised of the normalized signal values that a signal element of the antenna array receives from a wave having a particular Doppler frequency over all M pulses.

**2.2 Space Time Adaptive Processing (STAP)** **11**

Thus, the Space-Time steering vector v(θ, ω) is the product of spatial and temporal steering vector.

v(θ, ω) = b(ω) ⊗ a(θ) (2.3)

where a(θ) is the spatial steering vector on θ radians and b(ω) is the temporal steering vector on relative Doppler frequency ω. A Space-Time steering vector is defined as the normalized response of a target having relative spatial frequency θ and relative Doppler frequency ω.

**Interference The Space-Time snapshot may be decomposed as**

χ = αtvt+ χu (2.4)

where vt= v(θ, ω) is the target Space-Time steering vector being tested, αt

is the target complex amplitude and χuis the undesired component defined

to be

χu= χc+ χj+ χn (2.5)

which consists of clutter χc, jamming χj, and noise χn. Since these are

assumed mutually uncorrelated, the total covariance matrix is given by

Ψu= E{χuχHu} = Ψc+ Ψj+ Ψn (2.6)

**2.2.2**

**Full Rank STAP**

Referring to the Finite Impulse Response (FIR) STAP architecture in Figure 2.4, the filter weights are applied to the snapshot vector via a complex inner product that produces a complex scalar output y as

y = wHχ (2.7)

where the general weight vector is wH_{, and recall the Space-Time snapshot vector}

for the range bin under test, is defined to be χ = αtvt+ χu in Equation (2.4),

where generally complex αt is zero if no target energy is present. The general

weight vector w is often chosen adaptively using training data in attempt to max-imize the average SINR.

The optimal weight vector that maximizes the SINR, while maintaining unity gain on the desired Space-Time signal direction, is well known to be

wM V DR= µΨ−1u vt (2.8)

where vtis the target Space-Time steering vector being tested and µ = 1/(vHt Ψ−1u vt)

is a complex constant. The filter is also called a Minimum Variance Distortionless Response (MVDR) beamformer due to the single main-beam constraint used. The optimal solution is never found in practice due to the requirement of an infinite

**12** **Overview of Radar Signal Processing**

number of independent and identically distributed snapshot sample vectors. Us-ing the Sample Matrix Inversion (SMI), estimation of the interference covariance matrix bΨu is inverted and then substituted in for the true covariance matrix in

equation (2.8) to form the adaptive weight vector which is the estimation of the optimal weight vector.

Specifically, the filter w_{b}M V DR realized from this estimate is shown to produce

within 3dB of the optimum SINR when applied to the data, on average. Note that once the covariance matrix is estimated it must be inverted in order to solve for the adaptive weight vector. Since matrix inversion is highly computational processing, alternative matrix inversion methods will be presented and estimated in the following chapter to perform efficiently in our implementation.

Because signals nearly aligned with the target vector are interpreted by the adap-tive filter as interference and consequently cancelled, the range bin under test and range cells nearby are typically excluded from the interference covariance estima-tion. Since range cells are statistical and have interference, the interference can be predicted and averaged. The well known maximum likehood estimation of Ψu

is formed as b Ψu= 1 K K X k=1 χukχ H uk (2.9) where χuk is the k

th _{Space-Time snapshot used in the estimate, which may be}

from any range bin.

**Figure 2.6. Taxonomy of reduced-dimension STAP algorithms.**

For most airborne radar applications, the implementation of a fully-adaptive STAP is not feasible due to the computational complexity of the weight computation pro-cess and the amount of data required training the adaptive weights. Therefore, sub-optimal adaptive techniques are used in practice. There are four classes of STAP algorithms are distinguished by the type of processing applied before adap-tive processing. The four classes are: element-space pre-Doppler, element-space post-Doppler, beam-space pre-Doppler and beam-space post-Doppler as shown in

**2.2 Space Time Adaptive Processing (STAP)** **13**

Figure 2.6. This figure divides STAP architectures into four basic types, according to the type of preprocessor or the data domain in which the adaptive processing is performed. Element-space approaches, retain full spatial adaptively processing, but reduce the dimensionality through temporal preprocessing on each element. Temporal preprocessing may be simply selecting a small number of pulses (pre-Doppler), or it may be filtering the pulses on each element or beam (post-Doppler). The former type of approach is termed pre-Doppler, as full CPI filtering occurs after adaptation.

The implementation focuses on element-space post-Doppler adaptive techniques that provide increasingly more effective clutter mitigation at the cost of higher processing throughput requirements. The benchmark is based on the hard case in [6] called third-order Doppler-factored STAP.

**Chapter 3**

**Functional Specification of**

**STAP System**

As discussed in previous chapter, the STAP algorithms involve a two-dimensional filtering technique by using a phased-array antenna (temporal) with multiple spa-tial channels (spaspa-tial) to cancel Doppler-spread clutter and interference. By ap-plying the statistics of the interference environment, the STAP weight vector is formed. This weight vector is applied to the coherent samples by the radar.

The critical problem on implementation of STAP algorithms is overwhelming com-putation complexity on the radar platform processor. Consequently, the weight compensated vectors needs to update within real-time requirements. Our goal is to provide a benchmarking methodology which could be used to evaluate CBE processor intended to support STAP applications.

The benchmark of STAP in this thesis is based on RT_STAP benchmark devel-oped by MITRE Corporation [6]. The benchmark represents a processing mode with L spatial channels sampled at 5MHz and performs Doppler-factored STAP with order Q adaptive nulling along with pulse compression and Doppler filtering operations (L = 22,Q = 3). The whole processing flow needs to be finished within an interval of 32.5 milliseconds. In this chapter, the functional and timing specifi-cation of the applispecifi-cation are presented. Also the computation complexity will be elaborated in this chapter.

**3.1**

**Preprocessing**

Our benchmark includes the implementation of the data preprocessing typically performed before the application of STAP. Preprocessing usually includes: con-version of the received radar signals to in-phase and quadrature (I/Q) samples at baseband, array calibration, and pulse compression. In the past, preprocessing has been implemented using special purpose hardware. However, implementing

**16** **Functional Specification of STAP System**

the preprocessing functions within the CBE communication fabric should signifi-cantly reduce the number of interfaces, thus simplifying the system.

A CPI corresponding to L channels, P Pulse Repetition Intervals (PRIs), and N time samples per PRI must be processed for STAP algorithms. As input, these data samples are real-valued integers. As shown in the Figure 3.1, these func-tions of preprocessing are applied to A/D data samples independently across the L channels.

**Figure 3.1. Preprocessing block diagram for a single-array channel**

**3.1.1**

**I/Q Conversion**

I/Q conversion is used to demodulate the signal to baseband and generate digital samples at a certain sampling rate. In many cases, digital samples are generated at an Intermediate Frequency (IF) and sampled at a higher rate than required to accurately represent the baseband signal. For our system, the digital system must be demodulated to baseband, low-pass filtered, and decimated to lower sample rate.

Demodulation is performed by multiplying the data with demodulation coeffi-cients (i.e. complex sinusoid) to translate the signal to baseband. The complex multiplication is applied to both in-phase and quadrature-phase component of data. It requires 2N floating-point operations per channel and PRI, resulting to-tally L × P × (2N ) floating-point operations.

The samples are then processed by a low-pass filter to remove aliased frequency components, while lowpass filter is the real-valued filter with length Ka.

Deci-mation is simply achieved by choosing the samples corresponding to the desired sample rate. In an attempt to minimize computational complexity, directly linear convolution is chosen to implement instead of Fast Fourier Transform (FFT). The reason is that decimation takes place as part of this process, discrete linear convolu-tion of low-pass filtering provides the most efficient implementaconvolu-tion by computing only one of every D output samples (D the is decimation rate). Using this result, the FIR filtering and decimation requires 2 × Ka− 1 floating-point operations per

sample. Totally it takes L × P × ND× (2 × Ka− 1) × 2 floating-point operations.

ND is the number of decimated filter output samples.

However, we suppose the data received is demodulated and low-pass filtered so that I/Q conversion is not part of our implementation.

**3.1 Preprocessing** **17**

**3.1.2**

**Array Calibration**

Array calibration is processed to measure antenna response and equalize across all channels. In general, variations in the response of the antenna elements, receivers, and other components of the array over time introduce unknown amplitude and phase variations in the data. For wideband signals, these variations are often a function of frequency. Meanwhile, these frequency dependent variations affect the ability of STAP algorithms to null undesired interference.

The calibration can be achieved by applying FIR filter to the data with filter coef-ficients designed to equalize the antenna response. Filter coefcoef-ficients are typically determined offline. The most efficient implementation is to use either overlap-save or overlap-add fast convolution techniques. In addition, combining the compu-tation of array calibration and pulse compression can reduce the compucompu-tation complexity even more. It will be discussed in Section 3.1.4.

**3.1.3**

**Pulse Compression**

Pulse Compression is employed to transmit relatively long pulses with low peak power to achieve high signal energy and improve detection performance. Pulse compression is implemented by FIR filter with coefficients matched to the signal waveform. The pulse have a duration equivalent to the inverse of the transmitted signal bandwidth after ”matched filter”. The filter coefficients are matched to the transmitted signal with a taper applied to reduce range sidelobes and improve range resolution. As we noted in previous section, it is most efficient to combine implementation of calibration and pulse compression which is described in the following section.

**3.1.4**

**Combination of Array Calibration and Pulse **

**Com-pression**

The combined filter of array calibration and pulse compression has filter length Kcp= Kc+ Kp where Kc is the filter length of calibration, Kp is the filter length

of pulse compression and we assume the length of the combined filter coefficients is Kcp. Fast convolution techniques based on either overlap-add or overlap-save

methods represents the most efficient implementation of the linear convolution. In our implementation, overlap-save is chosen and presented in the subsequent discussion.

The overlap-save method is explained in Figure 3.2. To begin overlap-save method, appendding the complex data with Kcp− 1 leading zeros as first step. The

appended data sequence is then divided into B overlapped segments of length L + Kcp− 1, where B = dND/Nf f te and Nf f tis the length of discrete linear

con-volution. Each segment is overlapped by Kcp− 1 samples. Each data is circularly

convolved with the sequence of filter coefficients using FFT techniques. FFTs are applied to both the data block and the sequence of filter coefficients with both sequences zero padded so that the length of the FFT is a power of two.

**18** **Functional Specification of STAP System**

**Figure 3.2. Overlap-save method**

Once the FFTs are computed, the transformed sequences are multiplied and an in-verse FFT is applied to the result to obtain the time-domain representation of the circular convolution. Samples 1 through Kcp− 1 are discarded and the remaining

samples from the B data segments are assembled to form the final output of the preprocessing.

The computation of FFTs and inverse FFTs take 5 × Nf f t× log2Nf f t

opera-tions per FFT or per inverse FFT. Multiplication of sequence requires 6 × Nf f t

floating-point operations per data block. In total, it requires L × P × 3 × (10 × Nf f t× log2Nf f t+ 6 × Nf f t)

**3.2**

**Post-Doppler Adaptive Processing**

Modern airborne radar systems must have the capacity to detect targets having relatively small cross-sections in the presence of strong clutter and interference. Due to the motion of the radar platform, the clutter returns have a non-zero Doppler shift that is dependent on the azimuth of the clutter source. Space-Time Adaptive Processing algorithms have been developed to directly address cancella-tion of Doppler-spread clutter as shown in Figure 2.3.

As shown in the Figure 2.6, there are four types of STAP architectures and element-space post-Doppler STAP is chosen to implement in this report. Element-element-space approach retains full spatial adaptivity but reduced temporal dimensionality, and post-Doppler means temporal preprocessing is filtering the pulses on each element and followed by weight compensation. The Doppler filter, with its potential for low sidelobes, can suppress portions of the clutter ridge, there by localizing the

**3.2 Post-Doppler Adaptive Processing** **19**

competing clutter in angle. The factored post-Doppler separate spatial adaptive processing is done in each Doppler bin. The weight update rate with post-Doppler STAP is once per CPI.

**3.2.1**

**Doppler Filtering**

The first component of post-Doppler adaptive algorithm is Doppler processing which transforms the signals from time domain to frequency domain to reduce the STAP problem. A precomputed window function is applied to the data to reduce spectral leakage. It is implemented by applying a discrete Fourier transform of length K across P pulses of the preprocessed data for a given range cell and chan-nel, where K represents the number of Doppler cells to be processed. In practice, the discrete Fourier Transform is implemented using FFT and data samples are zero padded, so that the length of the FFT is a power of two. Radix-2 FFT which is employed in our implementation will be presented in section 6.1.1.

Application of the real-valued window function with length P across all pulses of a given range cell and channel requires 2P floating-point operations. For computation of K points, FFTs takes 5 × K × log2K operations. Therefore,

5 × K × log2K + 2 × P operations are needed to process Doppler filtering for

each range cell and channel. In total, L × R × (5 × K × log2K + 2 × P ) are required

to implement Doppler filtering.

**3.2.2**

**Weight Computation**

The high-order Doppler factored STAP algorithm can be one of the most effective STAP techniques known for clutter and interference suppression. Figure 3.3 shows the high-order Doppler factored STAP architecture of order Q, where Q is three in our implementation. The architecture consists of Doppler processing across all PRIs followed by adaptive filtering across sensors and adjacent Doppler bins. Adaptive filtering of the data uses simultaneously spatial and temporal Degree Of Freedom (DOF) in each specified Doppler bin. The spatial DOF are provided by the L array channels, while the temporal DOF are provided by the Q adjacent Doppler bins centered on the specified Doppler bin.

For third-order Doppler factored STAP, we define kl and kr as adjacent Doppler

bins to the Doppler bin kc. The bL × 1 Space-Time snapshot vector is defined to

be:

−

→_{y (k, r) = [−}→_{x (k}

l, r) −→x (kc, r) −→x (kr, r)]T (3.1)

where bL = L × 3 and each −→x (k, r) contains all channel samples at the rth _{range}

cell, kth_{Doppler bin.}

As illustrated in Equation (2.8), the adaptive weights are obtained by solving equation:

− →

**20** **Functional Specification of STAP System**

**3.2 Post-Doppler Adaptive Processing** **21**

where µ is a scale factor, −→vt is the target Space-Time steering vector, and

− →

Ψ (k, r) is the Space-Time covariance matrix computed by:

− →

Ψ (k, r) = E{−→y (k, r)−→yH(k, r)} (3.3) From equation (3.2), we know that the high-order Doppler-factored STAP algo-rithm clearly depends on knowledge of the Space-Time covariance matrix. For most practical applications, this matrix is unknown and must be estimated from the data samples. In general, an estimate of the covariance matrix is computed by averaging over snapshot vectors from adjacent range cells.

With regards to [6], the training strategy for selecting the snapshot vectors in-volves dividing the range cells into M non-overlapping blocks containing NRrange

samples, where M = ND/NR. The covariance matrix for kthDoppler bin and nth

block of continuous range cells is computed by averaging over the output product of the snapshot. This is

−
→
Ψ0(k, r) = 1
NR
r1+NR−1
X
r=r1
−
→_{y (k, r)−}→_{y}H_{(k, r)} _{(3.4)}

where −→Ψ0(k, r) is the estimation of covariance matrix. The estimation of covari-ance matrix in the equation (3.2) is used to compute the adaptive weight vector.

If the bL × NR Space-Time data matrix,

− → X is defined to be: − → X (k, m) = [−→y (k, mNR) −→y (k, mNR+ 1) · · · −→y (k, (m + 1)NR− 1)] (3.5)

then equation (3.4) can be re-written as

− → Ψ0(k, r) = 1 NR − → X (k, m)−→XH(k, m) (3.6)

Then equation (3.2) becomes: − →

X (k, m)−→XH(k, m)−→w (k, r) = µNR−→vt (3.7)

This equation is the main computation focused in our implementation of STAP. To obtain the weight vector −→w , the inversion of−→X (k, m) is needed. The computation to inverse the matrix exponentially increases with the matrix size. The other option to solve this linear equation is first doing matrix factorization and then solve the equation by the simpler matrix.

**QR decomposition**

The weight vector is computed by first performing a QR decomposition on the full column-rand Space-Time data matrix−→X (k, m) defined in equation (3.7). The QR decomposition produces an NR× NR unitary matrix

− →

**22** **Functional Specification of STAP System**

triangular matrix −→R such that −→XT _{=} −→_{Q}−→_{R . The matrix} −→_{R can be written as}

[−→RT 1

− →

0 ]T_{, where} −→_{R}

1 is a bL × bL full rank upper triangular matrix. The matrix

product−→X (k, m)−→XH_{(k, m) decomposes to}

− →

X (k, m)−→XH(k, m) =−→RT−→QT−→Q∗−→R∗=−→RT_{1}−→R∗_{1} (3.8)
where −→QT−→_{Q}∗ _{= I. There are a variety of methods to implement QR }

decom-position. The Modified Gram Schmidt (MGS) is chosen due to the less compu-tation complexity. More discussion will be presented in the section 6.1.3. For MGS, it takes 8 × NR× L2× Q2 floating-point operations per matrix. A total of

K × M × (8 × NR× L2× Q2) are required for all the Doppler bins and each block

of range cells.

**Forward and Backward Substitution**

The Equation (3.7) can be rewritten as −

→

X (k, m)−→XH(k, m)−→w (k, r) =−→RT_{1}−→R∗_{1}−→w (k, r) = µ−→vt. (3.9)

So the vector −→w is given by first applying forward elimination to solve −→RT 1−→p =

µ−→vt, where

− →

RT

1 is a lower triangular matrix. Then use backward substitution

for getting weight vector −→w . −→R∗_{1}→−w (k, r) = −→p , where−→R∗_{1} is an upper triangular
matrix.

Forward and backward substitution each requires 4 × L2× Q2 _{floating-point }

op-erations. In total, K × M × 2 × (4 × L2_{× Q}2_{) operations are required.}

**Weights Application**

At the output of STAP, NR values are given by the product of the data matrix

and weight vector:

−

→_{w}H_{(k, m)}−→_{X (k, m)} _{(3.10)}

Detection algorithms can then be applied to the result to locate targets in range and Doppler.

This process requires 8 × L × Q × NR floating-point operations to implement

per Doppler bin per block of range cells. In total, K × M × (8 × L × Q × NR)

**3.3**

**Computation issues**

Definitions of the variables used in Table 3.1 are given below:

L: Number of channels (22) P : Number of pulses per CPI (64)

**3.3 Computation issues** **23**

**Function** **Operation Count**

I/Q Conversion L × P × (N + ND× Ka) mul

L × P × ND× (Ka− 1) add

Calibration and L × P × 3 × (Nf f t+ Nf f t

2 × log2Nf f t) mul

Pulse Compression L × P × 3 × (Nf f t× log2Nf f t) add

Doppler Filtering L × ND× (K_{2} × log2K + P ) mul

L × ND× (K × log2K) add QR Decomposition K × M × (NR× L2× Q2) mul K × M × (NR× L × Q) comp-real div K × M × (NR× L2× Q2−12× L2× Q2− 1 2 × L × Q) add

Forward Substitution K × M × (L2×Q_{2} 2) mul
K × M × (L2×Q_{2} 2) add
Backward Substitution K × M × (L2×Q_{2} 2) mul
K × M × (L2×Q_{2} 2) add
Weight Apply K × M × (L × Q × NR) mul

K × M × (L × Q × NR) add

**Table 3.1. Complex operation counts of STAP**

D: Decimation factor (4)

ND: Number of samples per pulses after decimation, ND= bN/Dc (480)

Ka: FIR filter length used for anti-aliasing in I/Q conversion (36)

Nf f t: FFT size of combined filter for array calibration and pulse compression

(256)

K: Doppler FFT size (64)

M : Number of independent non-overlapping blocks ND/NR of contiguous range

samples used to calculate the adaptive weights (2)

NR: Number of contiguous range cells per weight computation (240)

Q: Processing order (3)

**Function** **Complex Operation** **Rate**

I/Q Conversion 50,688,000 14.6%

Calibration and Pulse Compression 14,057,472 4.0%

Doppler Filtering 6,758,400 1.9% QR Decomposition 269,377,152 77.8% Forward Substitution 557,568 0.1% Backward Substitution 557,568 0.1% Weight Apply 4,055,040 1.1% Total 346,051,200

**Table 3.2. Instruction counts of STAP(one complex operation per cycle)**

If we consider each complex operation as one computing instruction, then the instruction count could be derived as presented in Table 3.2. In this table, total required instructions are 346,051,200. Sampling rate is 5MHz for 1920 samples and sample reduces to 1.25 MHz after decimation for 480 samples.

**24** **Functional Specification of STAP System**

**Function** **FLOPS** **Rate**

I/Q Conversion 101,376,000 7.75%

Calibration and Pulse Compression 92,995,584 7.11%

Doppler Filtering 21,626,880 1.65% QR Decomposition 1,070,530,560 81.91% Forward Substitution 2,097,152 0.16% Backward Substitution 2,097,152 0.16% Weight Apply 16,220,160 1.24% Total 1,306,943,488

**Table 3.3. Floating-point operations of STAP(one floating-point operation per cycle)**

If we consider each real-value floating-point operation as one computing instruction then complex multiplication takes 6 FLOPs, complex addition takes 2 FLOPs, complex-real division takes 2 FLOPs and the instruction count could be derived as the Table 3.3. In this table, total required instructions are 1,306,943,488. Sampling rate is 5 MHz for 1920 samples and sample reduce to 1.25 MHz after decimation for 480 samples. In addition, weight computation is the most intensive computation part which contains more than 80% computation time. Therefore one of the most important issues in this thesis, is to efficiently implement the weight computation.

**Chapter 4**

**Overview of Cell Broadband**

**Engine**

The increasing demand of emerging multimedia, digital entertainment, and other intensive applications has generated an upsurge demand of computing power. For-tunately, the advent of modern VLSI technology that ever-scaling down of tran-sistor size over a last decade have brought the greatest possible performance im-provement. However, increasing operating frequency, refinement of architecture and lengthen pipelines by used a largest number of transistor has no longer have the significant return that they once had. With that in mind, most efficient way today is integrate multiple processing units on the same die.

Fundamentally, there are two different approaches toward multi-core processing, one is Symmetric MultiProcessing (SMP) adapted by vendor like Intel and AMD in which similar processing cores are integrated. Another approach is so called tiled processors. The breakthough tiled processor most popular these days is the Cell processor.

The Cell Broadband Engine (CBE) or commonly known as Cell is a microprocessor developed through the partnership of Sony, Toshiba and IBM (STI). The cell pro-cessor is initially designed for game consoles and media-rich consumer-electronics devices yet is flexible enough to be conventional general purpose microprocessor. Meanwhile, a much broader research of use in wide variety of application domain such as scientific computing to supercomputing is envisioned.

**4.1**

**Architecture**

The Cell Broadband Engine is a single-chip multi-core heterogeneous processor with combines a general-purpose 64-bit Power Architecture core called Power Pro-cessing Element (PPE) augmented with 8 streamlined high-performance SIMD RISC engine called Synergistic Processing Element (SPE). The latest cell

**26** **Overview of Cell Broadband Engine**

sor works with core frequency up to 3.2GHz. Connecting within these nine cores together the external memory and input/output interface is an Element Intercon-nect Bus (EIB), a specialized high-bandwidth circular data bus [3]. Figure 4.1 shows a block diagram of the Cell Broadband Engine.

**Figure 4.1. Cell Broadband Engine (CBE) block diagram**

**4.1.1**

**Power Processing Element (PPE)**

The Power Processing Element, as the name suggests, is a 64 bit Power PC ar-chitecture Reduced Instruction Set Computer (RISC) core with the Vector/SIMD Multimedia Extensions. In the Cell processor, PPE will run the operating sys-tem and act as a controller for the other eight SPEs, which handle most of the computational workload. The PPE support 2-way simultaneous hardware threads of execution, virtually enable to process two tasks in simultaneously. Figure 4.2 shows a block diagram of the PPE.

PPE consists of two main units, namely PowerPC Processing Unit (PPU) and Pow-erPC Processor Storage Subsystem (PPSS). PPU has six execution units including one vector /scalar Unit (VSU) for 128-bit Vector/ SIMD Multimedia Extension which together execute floating-point and Vector/ SIMD Multimedia Extension instruction. PPE supports a conventional cache hierarchy with 32 kB each for Level-1 (L1) instruction cache and data cache. It supports 32 bytes load, while 16 bytes independently and memory coherently per processor cycle. PPSS is a unit handlles all memory access between PPE and external memory, SPEs or I/O Devices from Element Interconnect Bus. It has a 512kByte unified level-2 (L2) instruction and data cache with error-correction code (ECC).

**4.1 Architecture** **27**

**Figure 4.2. Power Processing Element (PPE) block diagram**

**4.1.2**

**Synergistic Processing Element (SPE)**

The SPE offers a pervasively brand new processor architecture with 128-bit dual-issue pipelined SIMD data-path aims to accelerate the streaming data-rich, computation-intensive application. SPE consists of two main units, namely the Synergistic Pro-cessor Unit (SPU) and the Memory Flow Controller (MFC). Figure 4.3 shows a block diagram of the SPE.

**Figure 4.3. Synergistic Processing Element (SPE) block diagram**

Each SPU contains 256 Kbytes of dedicated local store memory (four 64 Kbyte SRAM Arrays), which is fully pipelined, single-ported that supports 16-bytes-per-cycle load and store bandwidth, quadword aligned only memory access and 128-byte instruction fetches and DMA transfers. Because the LS has a single port,

**28** **Overview of Cell Broadband Engine**

load, store, DMA read, DMA write, and instruction fetche operations compete for the same port. DMA operations are buffered and can only access the local store at most one of every eight cycles. Instruction fetches occur during idle memory cycles, and up to 3.5 fetches may be buffered in the instruction fetch buffer to better tolerate a large number of consecutive memory instruction that caused in-struction fetch starvation.

Each SPE have large register file of 128 bit, 128-entry general purpose register with six read ports and two write ports, which store all available data types (inte-ger, single-precision and double-precision floating-point, scalars, vectors, logicals, bytes, and others). The SPU has two pipelines, even pipeline and odd pipeline. With this dual pipeline, SPU can dispatch and execute up to two instructions per cycle in order, one on each of the pipelines. Both pipelines contain a different type of execution unit, in which instruction goes to the even or odd pipeline depends on execution unit that perform the instruction type. In general, even pipe are used for floating / fixed point instruction and memory load / store and memory permute is located in odd pipe.

The MFC serves as the communication interface by means of the Element In-terconnect Bus (EIB), to external memory, other processing elements and I/O devices. MFC is supported by means of Direct Memory Access (DMA) operation which performs bulk of instructions and data transfer between LS storage and external memory. The MFC supports naturally aligned transfer sizes of 1, 2, 4, or 8 bytes, and multiples of 16-bytes, with a maximum transfer size of 16 kB. Besides the DMA mechanism, MFC on each SPE used a mailbox and signal-notification messaging in term to allows the inter-processor communication and synchroniza-tion explicitly in between each other processing elements in the system. Mailbox are intended for messages passing up to 32 bits in length for any short data transfer such as completion flags, storage addresses and many more. Signaling is similar as mailbox but that can be configured for messenge accumulation mode (over-write mode or logical OR mode).

In overall, SPE does not work like a conventional superscalar CPU since there is no cache, virtual memory support or memory coherency. The objective of such memory hierarchy is to address a "memory wall" limitation in the conventional architecture as to compensate the thousand cycle penalties when data is not avail-able in the cache. The instruction and data are transferred from / to main memory and the associated local store for respective SPE or between the local store of dif-ferent SPEs with asynchronous direct memory access (DMA). This allows SPE to have a many concurrent memory access in flight and DMA latency can be easily hiding with some programming techniques.

The SPE has omit a hardware prediction or scheduling logic, and branch are assumed non-taken. However, the mispredicted branch flushes the pipeline and incurs a high penalty of 18 cycles. To override a default branch decision, the ar-chitecture supports for a branch hint instruction. The branch hint can prefetch

**4.2 Programming Toolchain** **29**

up to 32 cycles; and correctly-hinted taken branch incurs no penalty.

**4.1.3**

**Element Interconnect Bus (EIB)**

The EIB is a circular bus made of two 128-bit data channels working at half of CPU clock speed in opposite directions each. It allows a data communication from PPE and L2 cache to SPEs and vice versa. Each channel can convey up to 3 simultaneous transfers. Each processor element has one on-ramp and one off-ramp which can drive and receive data simultaneously. It is also connected to the Memory Interface Controller, and the FlexIO for external communications. Theoretically, the EIB’s internal bandwidth is 96 bytes per processor-clock cycle (384 GB/s). The EIB can support more than 100 outstanding DMA requests.

**4.1.4**

**Memory and I/O**

The Memory Interface Controller (MIC) of the Cell supports one or two high speed Rambus Extreme Data Rate (XDR) memory. It has a memory bandwidth of 25.6 GB/s (dual 12.8 GB/s channels). The total memory that can support is config-urable between 64 MB and 64 GB of XDR DRAM Memory.

The Cell Broadband Engine Interface (BEI) Unit, is an interface that proces-sor used to communicate with I/O devices of the system. It consists of Broadband Interface Controller (BIC), I/O Controller (IOC) and Internal Interrupt Controller (IIC). The BEI supports two Rambus FlexIO, one for non-coherent I/O Interface, which used for I/O devices such as sound cards, video cards, etc. Another FlexIO, which supports both non-coherent and coherent protocol, can be used to coher-ently extend the EIB to link up with other cell processors to send and receive data respectively. Between both sets of FlexIO, the bandwidth is 76.8 GB/s.

**4.2**

**Programming Toolchain**

In essence, the unique nature of Cell processor with heterogeneous multi-core en-vironment means developing an application introduces additional sources of com-plexity. Deployments of low-level programming model have demonstrated very competitive performance, but making it work is the trick. IBM abstractly pro-vides broad array of resources to help programmers in exploiting the performance of the Cell processor.

**4.2.1**

**Compiler**

Currently, there are two compiler sets available for Cell BE Processor, including a modified GCC Compiler from Sony and newly released XL C/C++ Compiler from IBM. Both compilers are available for both the PPU and SPU.

The famous IBM XL C/C++ compiler has been updated to fully exploit the PPE and SPE of the Cell Broadband Engine Architecture. This compiler provides

**30** **Overview of Cell Broadband Engine**

state-of-art support to user directed exploitation of parallelization and partition-ing over a wide range of heterogeneous parallelism offered by the Cell processor. The user directives adopted to communicate with the Cell processor were based on OpenMP programming model. This approach provides the programmer to view a computer system as possessing single shared-memory address space, and all the program data to reside in this space [1].

Both compilers come with a rich set of C/C++ language extension intrinsics for SPE and VMX to greatly simplify the SIMD programming, that is, precisely con-trol SIMD instruction and data layout, while allowing the compiler to deal with instruction scheduling and register allocation. This offers an advantage to pro-grammer to control over the high-level transformation such as loop unrolling, yet continue to have low level optimization.

Programming the SPE is significantly enhanced by the availability of the auto-matic SIMDization framework in XL C compiler, which is the process of extract-ing SIMD parallelism from scalar codes. This SIMDization optimization is mainly concerned with extracting SIMD parallelism from various code structures as below:

**Loop-level SIMDization: Similar to vectorization for innermost loop, an **

in-stant of statement are aggregate in consecutive iterations of a loop. This SIMDization is particularly successful at extracting SIMD parallelism in loops and recognized as certain vectorizable pattern such as parallel reduc-tions.

**Basic-Block Level SIMDization: Isomorphic computations on adjacent **

mem-ory are aggregate into virtual vectors. Such SIMDization is particularly successful at extracting SIMD parallelism in unrolled loops, either manually by programmers or automatically by the compiler.

The XL C compiler also enables the OpenMP programming model to guide par-allelization decision. A key component of this parpar-allelization approach is the pre-sentation of system with the abstraction of a single shared-memory address space. In Cell processor, local memories are only addressable by their respective SPEs. Each SPE performed DMA transfer between the local memory and main memory primarily using DMA transfer. The compiler, attempts to abstract the concept of compiler-controlled software cache mechanism that permit reuse of the temporary buffers in SPE local store. In this mechanism, SPE program data are residing in system memory and having the compiler automatically manage the data transfer between its locations in system memory and temporary buffer in the respective local memory. Instead of using load and store instruction, this approach uses pro-cedural analaysis with the instruction that explicitly refer to effective address of the data in a directory. If a desired data is present in the local memory, the ad-dress of requested variable in the local memory is computed and the data in local memory is used instead. Otherwise, a miss-handler subroutine is invoked and the requested data is transferred from system memory.

**4.2 Programming Toolchain** **31**

In SPE, the limited size of local memory is shared by the code and data, there is always the possibility that a single SPE object does not fit in limited space available. In this case, overlay can be useful. This approach divides the program into several pieces of partitions and compiler reserves a small portion of SPE local memory for the code partition manager. At runtime, the code partition manager is responsible dynamically loading partition from system memory into local memory when neccessary.

**4.2.2**

**Accelerated Library Framework**

The Accelerated Library Framework (ALF) is an Application Programming In-terface (API) that accelerates the software development process and provides an abstract view of parallel problems on multi-core memory hierarchy systems. The implementation of this framework focuses on solving data parallel problems on a host-accelerator hybrid system. Currently, ALF supports only the Single-Program-Multiple-Data (SPMD) programming scenario, in which application consists of a control task typically reside on the PPE and a single program running on all al-located accelerator elements (SPEs) at one time. ALF’s most important features include data transfer management, parallel task management, double buffering and data partitioning [4].

**Figure 4.4. Overview of ALF [4]**

**32** **Overview of Cell Broadband Engine**

the host element, input data and corresponding output data are breaking up into number of smaller partitions, called work blocks. With the provided ALF API, these work blocks can be allocated into the work queue and wait for ALF on the host to assign the work blocks to the accelerators. The accelerators then process the assigned work blocks and returns the corresponding output to the host element.

Overall, ALF provides a flexibility to offload the developers burden because the runtime framework handles the underlying resource / task management, data movement, load balancing and synchronization issues. However, designing a good partitioning strategy and kernel computation optimization are responsibility of the developer. Data partition is crucial to ALF programming model. For Cell BE architecture, proper size of data partition must be taken into account due to the limited size of local store and the double buffering scheme of the SPE. Again, kernel computation must be optimized for the SIMD nature of SPE.

**4.2.3**

**The Simulator**

IBM Full-System Simulator Version 2.0 is a system software infrastructure used to build modeling applications for program performance analysis, detailed microar-chitectural modeling for Cell BE processor. This simulator can be configured to work ranging from fast functional simulator of process instruction to performance simulation of an entire system [3][5].

**Figure 4.5. Simulation stack of Cell BE processor**

Functional simulation, in this case, running as a debugger to test the functionality and feature correctness of software developed without modeling the cycle time it takes to execute the targeted application. While for the performance simulation, which is, running a cycle accurate model to gather and provide various types of