A Selection of H.264 Encoder Components Implemented and Benchmarked on a Multi-core DSP Processor

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

A Selection of H.264 Encoder Components

Implemented and Benchmarked on a Multi-core

DSP Processor

Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping

av Jonas Einemo Magnus Lundqvist

LiTH-ISY-EX--10/4392--SE

Linköping 2010

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

A Selection of H.264 Encoder Components

Implemented and Benchmarked on a Multi-core

DSP Processor

Examensarbete utfört i Datorteknik

vid Tekniska högskolan i Linköping

av

Jonas Einemo Magnus Lundqvist

LiTH-ISY-EX--10/4392--SE

Handledare: Olof Kraigher

isy, Linköpings universitet

Examinator: Dake Liu

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution

Division, Department

Division of Computer Engineering Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2010-06-15 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version http://www.da.isy.liu.se/en/index.html http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-4292 ISBN — ISRN LiTH-ISY-EX--10/4392--SE

Serietitel och serienummer

Title of series, numbering

ISSN

—

Titel

Title A Selection of H.264 Encoder Components Implemented and Benchmarked on a Multi-core DSP Processor

Författare

Author Jonas Einemo, Magnus Lundqvist

Sammanfattning

Abstract

H.264 is a video coding standard which offers high data compression rate at the cost of a high computational load. This thesis evaluates how well parts of the H.264 standard can be implemented for a new multi-core digital signal processing processor architecture called ePUMA. The thesis investigates if real-time encoding of high definition video sequences could be performed. The implementation con-sists of the motion estimation, motion compensation, discrete cosine transform, inverse discrete cosine transform, quantization and rescaling parts of the H.264 standard. Benchmarking is done using the ePUMA system simulator and the re-sults are compared to an implementation of an existing H.264 encoder for another multi-core processor architecture called STI Cell. The results show that the se-lected parts of the H.264 encoder could be run on 6 calculation cores in 5 million cycles per frame. This setup leaves 2 calculation cores to run the remaining parts of the encoder.

Nyckelord

(6)

(7)

Abstract

H.264 is a video coding standard which offers high data compression rate at the cost of a high computational load. This thesis evaluates how well parts of the H.264 standard can be implemented for a new multi-core digital signal processing processor architecture called ePUMA. The thesis investigates if real-time encoding of high definition video sequences could be performed. The implementation con-sists of the motion estimation, motion compensation, discrete cosine transform, inverse discrete cosine transform, quantization and rescaling parts of the H.264 standard. Benchmarking is done using the ePUMA system simulator and the re-sults are compared to an implementation of an existing H.264 encoder for another multi-core processor architecture called STI Cell. The results show that the se-lected parts of the H.264 encoder could be run on 6 calculation cores in 5 million cycles per frame. This setup leaves 2 calculation cores to run the remaining parts of the encoder.

(8)

(9)

Acknowledgments

We would like to thank everyone that has helped us during our thesis work, espe-cially our supervisor Olof Kraigher for all help and useful hints and our examiner Professor Dake Liu for his support, comments and the opportunity to do this the-sis. We would also like to thank Jian Wang for the support on the DMA firmware, Jens Ogniewski for the help with understanding the H.264 standard, our families and friends for their support and for bearing with us during the work on this the-sis.

Jonas Einemo Magnus Lundqvist

(10)

(11)

2.5 Entropy Coding . . . 8 2.6 Quality Measurements . . . 8 2.6.1 Subjective Quality . . . 8 2.6.2 Objective Quality . . . 8 3 Overview of H.264 11 3.1 Introduction to H.264 . . . 11 3.2 Coded Slices . . . 12 3.2.1 I Slice . . . 12 3.2.2 P Slice . . . 12 3.2.3 B Slice . . . 12 3.2.4 SP Slice . . . 12 3.2.5 SI Slice . . . 13 3.3 Intra Prediction . . . 13 3.4 Inter Prediction . . . 14 3.4.1 Hexagon search . . . 17

3.5 Transform Coding and Quantization . . . 18

3.5.1 Discrete Cosine Transform . . . 18

3.5.2 Inverse Discrete Cosine Transform . . . 20

3.5.3 Quantization . . . 21

3.5.4 Rescaling . . . 22

3.6 Deblocking filter . . . 23 i

(12)

ii Contents

3.7 Entropy coding . . . 25

4 Overview of the ePUMA Architecture 27 4.1 Introduction to ePUMA . . . 27

4.2 ePUMA Memory Hierarchy . . . 27

4.3 Master Core . . . 29

4.3.1 Master Memory Architecture . . . 29

4.3.2 Master Instruction Set . . . 29

4.3.3 Datapath . . . 29

4.4 Sleipnir Core . . . 30

4.4.1 Sleipnir Memory Architecture . . . 31

4.4.2 Datapath . . . 33

4.4.3 Sleipnir Instruction Set . . . 34

4.4.4 Complex Instructions . . . 34 4.5 DMA Controller . . . 34 4.6 Simulator . . . 35 5 Elaboration of Objectives 37 5.1 Task Specification . . . 37 5.1.1 Questions at Issue . . . 38 5.2 Method . . . 38 5.3 Procedure . . . 38 6 Implementation 39 6.1 Motion Estimation . . . 39

6.1.1 Motion Estimation Reference . . . 39

6.1.2 Complex Instructions . . . 40

6.1.3 Sleipnir Blocks . . . 41

6.1.4 Master Code . . . 47

6.2 Discrete Cosine Transform and Quantization . . . 49

6.2.1 Forward DCT and Quantization . . . 50

6.2.2 Rescaling and Inverse DCT . . . 53

7 Results and Analysis 57 7.1 Motion Estimation . . . 57 7.1.1 Kernel 1 . . . 58 7.1.2 Kernel 2 . . . 60 7.1.3 Kernel 3 . . . 62 7.1.4 Kernel 4 . . . 63 7.1.5 Kernel 5 . . . 65 7.1.6 Master Code . . . 69 7.1.7 Summary . . . 71

(13)

Contents iii 8 Discussion 79 8.1 DMA . . . 79 8.2 Main Memory . . . 79 8.3 Program Memory . . . 80 8.4 Constant Memory . . . 80

8.5 Vector Register File . . . 80

8.6 Register Forwarding . . . 80

8.7 New Instructions . . . 81

8.7.1 SAD Calculations . . . 81

8.7.2 Call and Return . . . 81

8.8 Master and Sleipnir Core . . . 81

8.9 ePUMA H.264 Encoding Performance . . . 82

8.10 ePUMA Advantages . . . 82

8.11 Observations . . . 83

9 Conclusions and Future Work 85 9.1 Conclusions . . . 85

9.2 Future Work . . . 86

Bibliography 87

A Proposed Instructions 89

(14)

iv Contents

List of Figures

2.1 Overview of the data flow in a basic encoder and a decoder . . . . 5

2.2 YUV 4:2:0 sampling format . . . 7

3.1 Overview of the data flow in an H.264 encoder . . . 12

3.2 4x4 luma prediction modes . . . 13

3.3 16x16 luma prediction modes . . . 13

3.4 Different ways to split a macroblock in inter prediction. . . 14

3.5 Subsamples interpolated from neighboring pixels . . . 15

3.6 Multiple frame prediction . . . 16

3.7 Large(a) and small(b) search pattern in the hexagon search algorithm. 17 3.8 Movement of the hexagon pattern in a search area and the change to the smaller search pattern. . . 18

3.9 DCT functional schematic . . . 19

3.10 IDCT functional schematic . . . 20

3.11 Filtering order of a 16x16 pixel macroblock with start in A and end in H for luminance(a) and start in 1 and end in 4 for chrominance(b) 24 3.12 Pixels in blocks adjacent to vertical and horizontal boundaries . . 24

4.1 ePUMA memory hierarchy . . . 28

4.2 ePUMA star network interconnection . . . 28

4.3 Senior datapath for short instructions . . . 30

4.4 Sleipnir datapath pipeline schematic . . . 33

4.5 Sleipnir Local Store switch . . . 35

6.1 Motion estimation program flowchart . . . 42

6.2 Motion estimation computational flowchart . . . 43

6.3 Hexagon search program flow controller . . . 44

6.4 Proposed implementation of call and return hardware . . . 45

6.5 Reference macroblock overlap . . . 45

6.6 Reference macroblock partitioning for 13 data macroblocks . . . . 46

6.7 Master program flowchart . . . 47

6.8 Memory allocation of data memory in the master(a) and main mem-ory allocation(b) . . . 48

6.9 Sleipnir core motion estimation task partitioning and synchronization 49 6.10 DCT flowchart . . . 51

6.11 Memory transpose schematic . . . 51

7.1 Cycle scaling from 1 to 8 Sleipnir cores for simulation of riverbed . 72 7.2 Frame 10 from Pedestrian Area video sequence . . . 73

7.3 Difference between frame 10 and frame 11 in Pedestrian Area video sequence . . . 73

7.4 Motion vector field calculated by kernel 5 on frame 10 and 11 of the Pedestrian Area video sequence . . . 74

7.5 Difference between frame 10 and frame 11 in Pedestrian Area video sequence using motion compensation . . . 74

(15)

Contents v

8.1 Sleipnir core DCT task partitioning and synchronization . . . 83

8.2 Memory allocation of macroblock in LVM for intra coding . . . 83

A.1 HVBSUMABSDWA . . . 89

A.2 HVBSUMABSDNA . . . 90

A.3 HVBSUBWA . . . 90

(16)

vi Contents

List of Tables

3.1 Qstep for a few different values of QP . . . 21

3.2 Multiplication factor MF . . . 22

3.3 Scaling factor V . . . 23

4.1 Pipeline specification . . . 30

4.2 Register file access types . . . 31

4.3 Address register increment operations . . . 32

4.4 Addressing modes examples . . . 32

7.1 Short names for kernels that have been tested . . . 58

7.2 Description of table columns . . . 58

7.3 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 1 using 1 Sleipnir core . . . . . 59

7.4 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 1 using 8 Sleipnir cores . . . . 59

7.5 Block 1 costs . . . 59

7.6 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 2 using 1 Sleipnir core . . . . . 60

7.8 Block 2 costs . . . 61

7.10 Kernel 3 costs . . . 62

7.11 Motion estimation results from simulation with Riverbed frame 10 and Riverbed frame 11 with kernel 4 using 4 Sleipnir cores . . . . 63

7.12 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 4 using 8 Sleipnir cores . . . . 64

7.13 Kernel 4 costs . . . 64

7.14 Motion estimation results from simulation on Sunflower frame 10 and Sunflower frame 11 with kernel 5 using 8 Sleipnir cores . . . . 65

7.15 Motion estimation results from simulation on Blue sky frame 10 and Blue sky frame 11 with kernel 5 using 8 Sleipnir cores . . . . . 66

7.16 Motion estimation results from simulation on Pedestrian area frame 10 and Pedestrian area frame 11 with kernel 5 using 8 Sleipnir cores 66 7.17 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 5 using 4 Sleipnir cores . . . . 67

7.18 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 5 on 8 Sleipnir cores . . . . 67

7.19 Kernel 5 costs . . . 68

7.20 Master code cost . . . 69

7.21 Prolog and epilog cycle costs . . . 70

7.22 Simulated epilog cycle cost including waiting for last Sleipnir to finish 70 7.23 DMA cycle costs . . . 71

(17)

Contents vii

7.24 Costs for DCT with quantization block and IDCT with rescaling block . . . 75 B.1 Simulation cycle cost of motion estimation kernels . . . 92

(18)

viii Contents

Abbreviations

AGU Address Generation Unit

ALU Arithmetic Logic Unit

AVC Advanced Video Coding

CABAC Context-based Adaptive Binary Arithmetic Coding

CAVLC Context-based Adaptive Variable Length Coding

CB Copy Back

CM Constant Memory

CODEC COder/DECoder

DCT Discrete Cosine Transform

DMA Direct Memory Access

DSP Digital Signal Processing

ePUMA Embedded Parallel Digital Signal Processing Proces-sor Architecture with Unique Memory Access

FIR Finite Impulse Response

FPS Frames Per Second

FS Full Search

HDTV High-Definition Television

HVBSUBNA Half Vector Bytewise SUBtraction Not word Aligned HVBSUBWA Half Vector Bytewise SUBtraction Word Aligned HVBSUMABSDNA Half Vector Bytewise SUM of ABSolute Differences

Not word Aligned

HVBSUMABSDWA Half Vector Bytewise SUM of ABSolute Differences Word Aligned

IDCT Inverse Discrete Cosine Transform

IEC International Electrotechnical Commission

ISO International Organization for Standardization

ITU International Telecommunications Union

LS Local Storage

LVM Local Vector Memory

MAE Mean Abolute Error

MB Macroblock

MC Motion Compensation

ME Motion Estimation

MF Multiplication Factor

MPEG Moving Picture Experts Group

MSE Mean Square Error

NAL Network Abstraction Layer

NoC Network on Chip

PM Program Memory

PSNR Peak Signal to Noise Ration

QP Quantization Parameter

(19)

Contents ix

RGB Red, Green and Blue, A color space

ROM Read Only Memory

SAD Sum of absolute difference

SPRF SPecial Register File

STI Sony Toshiba IBM

V Rescaling Factor

VCEG Video Coding Experts Group

VRF Vector Register File

(20)

(21)

Chapter 1

Introduction

This chapter gives a background to the thesis, defines the purpose, scope, way of work and presents the outline of the thesis.

1.1 Background

With new handheld devices and mobile systems with more advanced services the need for increased computational power at low cost, both in terms of chip area and power dissipation, is ever increasing. Now that video playback and recording are more standard applications than features in mobile devices, high computational power at a low cost is still a problem without a sufficient solution.

The Division of Computer Engineering at the Department of Electrical Engi-neering at Linköpings Tekniska Högskola has for some time been part of a research project called ePUMA, which can be read out as Embedded Parallel Digital Signal Processing Processor Architecture with Unique Memory Access. The development is driven by the pursuit of the next generation of digital signal processing demands. By developing a cheap and low power processor with large calculation power this new architecture aims to meet tomorrows demands in digital signal processing. The main applications for the processor is future radio base stations, radar and High-Definition Television (HDTV).

H.264 is a standard for video compression that saw daylight back in 2003. It is now a mature and widely spread standard that is used in Blu-Ray, popular video streaming websites like Youtube, television services and video conferencing. It provides very good compression at the cost of high computational complexity. The hope is that the ePUMA multi-core architecture will be able to handle real-time video encoding using the H.264 standard.

At the Division of Computer Engineering previous work has been done on implementing an H.264 encoder for another multi-core architecture. This work was done on the STI Cell which is used in e.g. the popular video gaming console PLAYSTATION 3.

(22)

2 Introduction

1.2 Purpose

The purpose of this master thesis is to evaluate the capability of the ePUMA processor architecture, in aspect of real-time video encoding using the H.264 video compression standard and aim to find and expose possible areas of improvement on the ePUMA architecture. This will be done by implementing parts of an H.264 encoder and if possible compare the cycles needed to the previously implemented STI Cell H.264 encoder.

1.3 Scope

By implementing the most computationally expensive parts in the H.264 standard it would be possible to better estimate if the ePUMA processor architecture is capable of encoding video using the H.264 standard in real time. Studying the H.264 standard it can be seen that entropy coding is the most time consuming part if it is done in software. Because of the large amount of bit manipulations needed, it is not feasible to perform entropy coding in the processor. Therefore an early decision was made that entropy coding had to be hardware accelerated and that it should not be a part of this thesis.

In this thesis no exact hardware costs for performance improvement will be calculated but instead a reasoning of feasibility will be done.

The time constraint of this master thesis is twenty weeks which restricts the extent of the work. Because of the time constraint some parts of a complete encoder have had to be left out.

1.4 Way of Work

One of the most time consuming tasks is motion estimation which together with discrete cosine transform and quantization became the primary target for evalu-ation. First a working implementation was produced. An iterative development was then used to refine the implementations and reach better performance. The partial implementations of the H.264 standard were written for the ePUMA sys-tem simulator. The simulator was also used for all performance measurements of the implementations using frames from several different commonly used test video sequences. Once the performance measurement results were acquired they were analyzed and the conclusions were made. The way of work is elaborated in section 5.2 and section 5.3.

1.5 Outline

This thesis is aimed at an audience with an education in electrical engineering, computer engineering or similar. Expertise in video coding or the H.264 standard is not necessary as the main principles of these topics will be covered.

The outline of this thesis is ordered as naturally as possible where this intro-duction chapter is followed by theoretical chapters containing the topics needed

(23)

1.5 Outline 3

to understand the rest of the thesis. The first of these is chapter 2 which covers the basics of video coding followed by chapter 3 which offers an introduction to the H.264 video coding standard. The last theoretical chapter is chapter 4 which covers the hardware architecture and toolchain of the ePUMA processor. The theory is followed by chapter 5 where a more detailed task specification, method and procedure of the thesis is presented with help from the knowledge obtained from the theoretical chapters. After that chapter 6 describes the function and de-velopment of the implementations produced. Chapter 7 then presents the results obtained and gives an analysis of them. Chapter 8 contains a discussion about the results as well as ideas thought of while working on this thesis. The final chapter is chapter 9 which contains the conclusions and the future work that could be done in the area.

(24)

(25)

Chapter 2

Overview of Video Coding

This chapter gives an introduction to video coding, color spaces, predictive coding, transform coding and entropy coding. The knowledge is necessary to be able to understand the rest of the thesis.

2.1 Introduction to Video Coding

A video consists of several images, called frames, showed in a sequence. The amount of space on disk required to store a sequence of raw data is huge and therefore video coding is needed. The purpose of video coding is to minimize the data to store on disk or the data to send over a network, without decreasing the image quality too much. There are a lot of techniques and algorithms on the market to do this such as MPEG-2, MPEG-4 and H.264/AVC. [10]

Predictive

coding Transform coding & Quantization Entropy coding

Predictive decoding

Inverse transform & Rescaling

Entropy decoding Video

Data Encoded Video

Encoded Video Decoded

Video

Figure 2.1: Overview of the data flow in a basic encoder and a decoder All of these algorithms are constructed out of a similar template. First some technique is used to reduce the amount of data to be transformed. The video is then transformed with for example a Discrete Cosine Transform (DCT). After this a quantization is performed to shrink the data further. The data is then pushed

(26)

6 Overview of Video Coding

through an entropy coder such as Huffman or a more advanced algorithm such as Context-based Adaptive Binary Arithmetic Coding (CABAC) or Context-Based Arithmetic Variable Length Coding (CAVLC) which all compress the data based on patterns in the bit-stream. [10] The data flow of a basic encoder and a basic decoder is illustrated in figure 2.1.

As mentioned a video sequence consists of many frames. In video coding these frames can be divided into something called slices. A slice can be a part of a frame or contain the complete frame. This slice division is advantageous because it gives ability to know e.g. that data in a slice does not depend on data outside the slice. The frames are also divided into something called macroblocks. A macroblock is a block consisting of 16×16 pixels. This partitioning of the data makes computations easier to organize and structure. [10]

2.2 Color Spaces

To understand video coding some knowledge about different color spaces is needed. One of the color spaces out there is RGB, which name comes from its components red, green and blue. With these three colors and different intensities of them it is possible to visualize all colors in the spectra. Another commonly used color space is Y CbCr, also called YUV. In this color space Y represents the luminance (luma) component, which corresponds to the brightness of a specific pixel. The other two components, namely Cb and Cr, are chrominance (chroma) components which carry the color information. [10] The conversion from the RGB color space to the YUV color space is shown in equation (2.1).

Y = krR + kgG + kbB

Cb= B − Y (2.1)

Cr= R − Y Cg = G − Y

As seen in equation (2.1) there also exists a third chrominance component for green, namely Cg, which thanks to equation (2.2) can be calculated as shown in equation (2.3). This means that Cgcan be calculated by the decoder and does not have to be transmitted which is advantageous in the sense of data compression. [10]

kb+ kr+ kg= 1 (2.2)

Cg= Y − Cb− Cr (2.3)

The human eye is more sensitive to luminance than to chrominance and because of that a smaller set of bits can be used to represent the chrominance and a larger for representation of luminance. With this feature of the YUV color space the total amount of bits needed to encode a pixel can be reduced. A common way to do this is by applying the 4:2:0 sampling format.

(27)

2.3 Predictive Coding 7

Y sample Cr sample

Cb sample

Figure 2.2: YUV 4:2:0 sampling format

The 4:2:0 sampling format can be described as a ’12 bits per pixel’ format where there are 2 samples of chrominance for every 4 samples of luminance as shown in figure 2.2. If each sample is stored using 8 bits this will add up to 6 ∗ 8 = 48 bits for 4 YUV 4:2:0 pixels with an average of 48/4 = 12 bits per pixel. [10]

2.3 Predictive Coding

There are two kinds of predictive coding, intra coding and inter coding. By study-ing a picture it is easy to see that some parts in the picture are very similar, this is called spatial correlation. The predictive coding that uses these spatial corre-lations within a frame to form a prediction of other parts of the frame is called intra coding. By studying a sequence of pictures or a video sequence it can be seen that there is usually not much difference between the frames, this is called temporal correlation. By exploiting this temporal correlation a difference, also called a residue, can be calculated which is comprised of smaller values and there-fore can be described with a smaller number of bits. This will result in better data compression. The predictive coding that uses temporal correlations between different frames is called inter coding. [10]

2.4 Transform Coding and Quantization

The purpose of transform coding is to convert the image data or motion compen-sated data into another representation of data. This can be done with a number of different algorithms where the block based Discrete Cosine Transform (DCT) is one of the most common in video coding. The DCT algorithm converts the data to be described into sums of cosine functions oscillating at different frequencies. [10]

(28)

8 Overview of Video Coding

There are some different transforms that could be used in video coding but the common property of them all is that they are reversible, meaning the transform can be reversed without loss of data. This is an important property because otherwise drift between the encoder and decoder can occur and special algorithms would have to be applied to correct these errors. As mentioned before block based transform coding is the most common. When using block based transform coding the picture is divided into smaller block such as 8 × 8 or 4 × 4 pixels. Each block is then transformed with the chosen transform. The transformed data is then quantisized to remove high frequency data. This procedure can be done because the human eye is insensitive to higher frequencies and therefore these can be removed without any noticeable loss of quality. The quantizer re-maps the input data with one range of values to the output data which has a smaller range of possible values. This means the output can be coded with fewer bits than the original data and in this way data compression is achieved. [10]

2.5 Entropy Coding

Entropy coding is a lossless data compression technique. The different entropy coding algorithms encode symbols that occur often with a few number of bits and symbols that occur less often with more bits. The bits are all put in a bitstream that could be written to disk or sent over a network. In video coding these symbols can be quantisized transform coefficients, motion vectors, headers or other infor-mation that should be sent to be able to decode the video stream. As mentioned earlier a few of the usual entropy coding algorithms are Huffman, CABAC and CAVLC. [10]

2.6 Quality Measurements

There exists several ways to measure the quality of images and compare uncom-pressed images with reconstructed ones to evaluate video coding algorithms.

2.6.1 Subjective Quality

Subjective quality is the quality that someone watching an image or a video se-quence experiences. Subjective quality can be measured by having evaluators rate each part of a series of images or video sequences with different properties. This can be a time consuming and unpractical way of measurement in most circum-stances. [10]

2.6.2 Objective Quality

To enable more automatic measurements of quality some algorithms are commonly used. One of these is Peak Signal to Noise Ratio (PSNR) which can be used to measure the quality of a reconstructed image by comparing it to an uncompressed

(29)

2.6 Quality Measurements 9

one. PSNR gives a logarithmic scale where a higher value is better. The Mean Square Error (MSE) is used in the calculation of PSNR and is calculated as

M SE = 1 m ∗ n m X i=1 n X j=1 (C(i, j) − R(i, j))2 (2.4)

where n is the image height, m is the image width and C and R are the current and reference images being compared. With the MSE value the PSNR can be calculated as P SN R = 10 ∗ log10 2bits_{− 1} M SE (2.5) where 2bits_{−1 is the largest representable value of a pixel with the specified number} of bits. [10]

(30)

(31)

Chapter 3

Overview of H.264

This chapter presents an overview of the H.264 video compression standard. Some sections are more detailed than others because of relevance for the master thesis. The topics covered include the different frame and slice types, intra and inter prediction, transform coding, quantization, deblocking filter and finally entropy coding.

3.1 Introduction to H.264

H.264[12], also known as Advanced Video Coding (AVC) and MPEG-4 Part 10, is a standard for video compression. The standard has been developed by Video Cod-ing Experts Group (VCEG) of International Telecommunications Union (ITU) and Moving Picture Experts Group (MPEG) which is a working group of the Interna-tional Organization for Standardization (ISO) and InternaInterna-tional Electrotechnical Commission (IEC). The main objective when H.264 was developed was to maxi-mize the efficiency of the video compression but also to provide a standard with high transmission efficiency which supports reliable and robust transmission of data over different channels and networks. [10]

H.264 is divided into a number of different profiles. These profiles include different parts of the video coding features from the H.264 standard. Some of the most common ones are the Extended, Baseline, Constrained Baseline and Main profiles. The Baseline profile supports inter and intra coding and entropy coding with CAVLC. The Main profile supports interlaced video, inter coding using B-slices and entropy coding using CABAC. The Extended profile does not support interlaced video nor CABAC but supports switching slices and has improved error resilience. [10]

In figure 3.1 a detailed view of the data flow in an H.264 encoder can be seen. This figure illustrates the important prediction coding and how it is connected to the other parts of the encoder. The in-loop deblocking filter can also be seen in this illustration. [10]

(32)

12 Overview of H.264 + ME (motion estimation) MC (motion compensation) Choose Intra Prediction Intra prediction DCT (discrete cosine transform) IDCT (inverse discrete cosine transform) Q (quantization) R (rescaling) Deblocking Filter Reorder Entropy encode + -+ NAL Fn (current frame) F´n-1 (reference frame) F´n (reconstructed frame)

Figure 3.1: Overview of the data flow in an H.264 encoder

3.2 Coded Slices

A frame can be divided into smaller parts called slices. These slices can then be coded in different modes. The different coding modes in H.264 is presented below [14].

3.2.1 I Slice

In the I slice all macroblocks are intra coded. The encoder uses the spatial corre-lations within a single slice to code that slice. The I slice allocates most space of all the different types of slices after it has been encoded. [10]

3.2.2 P Slice

P slices can contain both I coded macroblocks and P coded macroblocks. P coded macroblocks are predicted from a list of reference macroblocks. [10]

3.2.3 B Slice

B slices or bidirectional slices can contain both B coded macroblocks and I coded macroblocks. B coded macroblocks can be predicted from two different lists of reference macroblocks both before and after the current frame in time. [10]

3.2.4 SP Slice

A Switching P (SP) slice is coded in a way that supports easy switching between similar precoded video streams without suffering a high penalty for sending a new I slice. [10]

(33)

3.3 Intra Prediction 13

3.2.5 SI Slice

A Switching I (SI) slice is an intra coded slice and supports easy switching between two different streams that does not correlate. [10]

3.3 Intra Prediction

In intra coding the encoder only uses data from the current frame. Intra prediction is the next step in this direction to try to minimize the coded frame size. With intra prediction the encoder tries to utilize the spatial correlation within the frame.[10]

M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H 0 (Vertical) 1 (Horizontal) 2 (DC)

3 (Diagonal down-left) 4 (Diagonal down-right) 5 (Vertical-right)

8 (Horizontal-up) 7 (Vertical-left) 6 (Horizontal-down) Mean (A .. D, I .. L)

Figure 3.2: 4x4 luma prediction modes

Mean (V, H) … … . ……. V H V 0 (Vertical) H 1 (Horizontal) H V 2 (DC) V H 3 (Plane)

Figure 3.3: 16x16 luma prediction modes

H.264 supports 9 different intra prediction modes for 4x4 sample luma blocks, four different modes for 16x16 sample luma blocks and four modes for 8x8 chroma components. The 9 4x4 prediction modes are illustrated in figure 3.2 and the 4 16x16 luma prediction modes are illustrated in figure 3.3. The pixels are interpo-lated or extrapointerpo-lated from the pixels nearby i.e the pixels with letters. Usually

(34)

14 Overview of H.264

the encoder selects the prediction mode that minimizes the difference between the predicted block and the block to be encoded. I_PCM is another prediction mode which makes it possible to transmit samples of an image without prediction or transformation. [10, 14]

3.4 Inter Prediction

Inter prediction creates a prediction model from one or more previously encoded video frames or slices using block-based motion compensation. The motion vector precision can be up to a quarter pixel resolution. The task is to find a vector that points to a block of pixels that have the smallest difference between the reference block and the block in the frame that is being encoded. [10]

16x16 16x8 16x8 8x16 8x16 8x8 8x8 8x8 8x8 8x8 8x4 8x4 4x8 4x8 4x4 4x4 4x4 4x4

Figure 3.4: Different ways to split a macroblock in inter prediction.

H.264 supports a range of block sizes from 16x16 to 4x4 pixels. This is illus-trated in figure 3.4. Using big blocks will save data because you will not need as many motion vectors, but the distortion can be very high when there are a lot of small things moving around in the video sequence. Using smaller blocks will in many cases lower the distortion but will instead increase the amount of bits needed to store the increased number of motion vectors. By letting the encoder find the best solution for this a good data compression of the video sequence can be achieved. The blocks are split when a threshold value is reached. [10]

SAD = m X i=1 n X j=1 |C(i, j) − R(i, j)| (3.1) M SE = 1 m ∗ n m X i=1 n X j=1 (C(i, j) − R(i, j))2 (3.2)

(35)

3.4 Inter Prediction 15 M AE = 1 m ∗ n m X i=1 n X j=1 |C(i, j) − R(i, j)| (3.3)

The macroblock cost is commonly calculated in one of a few different ways, Sum of Absolute Difference (SAD) is the most common as it offers the lowest computation complexity. The definition of SAD can be found in equation (3.1). Two other common ways to calculate the cost are Mean Square Error (MSE) and Mean Absolute Error (MSE) presented in equation (3.2) and equation (3.3) respectively. In equation (3.1), equation (3.2) and equation (3.3) n is the image width and m is the image height. [10]

A B C D E F G H I J K L M N P Q R S T U 1 2 4 3 5 6 7 8 a d h e b c f g i n j k p q r m s

Figure 3.5: Subsamples interpolated from neighboring pixels

More accurate motion estimation in form of sub pixel motion vectors is available in H.264. Up to a quarter pixel resolution is supported for the luma component and one eighth sample resolution for the chroma components. This motion estimation is possible to do by interpolating neighboring pixels and then compare with the current frame in the encoder. The interpolation is performed by a 6 tap Finite Im-pulse Response (FIR) filter with weights (1/32, −5/32, 20/32, 20/32, −5/32, 1/32). [10]

In figure 3.5 the half pixel sample b can be located. To generate this sample equation (3.4) can be used. Sample m can be calculated in a similar way shown in equation (3.5). [10]

b = round((E − 5F + 20G + 20H − 5I + J )/32) (3.4) m = round((B − 5D + 20H + 20N − 5S + U )/32) (3.5)

(36)

After generating all half pixel samples from real samples there are some half pixel samples that have not been generated. These samples have to be generated from already generated samples. The sample j in figure 3.5 is an example of that. To generate j the same FIR filter is used but with samples 1, 2, b, s, 7 and 8. j could also be generated with samples 3, 4, h, m, 5 and 6. Note that unrounded versions of the samples should be used when calculating j. When all half pixel samples are generated it is time to generate the quarter pixel samples. This is done by linear interpolation. Sample a in figure 3.5 is calculated as in equation (3.6) and sample d is calculated as in equation (3.7). To generate the last samples two diagonal half pixel samples are used, see equation (3.8). [10]

a = round((G + b)/2) (3.6)

d = round((G + h)/2) (3.7)

e = round((h + b)/2) (3.8)

To enhance the video compression even more H.264 has support for predicting macroblocks from more than one frame. This can be applied to both B and P coded slices. With the possibility to predict macroblocks from different frames a much better video compression can be achieved. The downside with multiframe prediction is an increase cost of memory size, memory bandwidth and computa-tional complexity. [10] Current Frame Following Frames Previous Frames

Figure 3.6: Multiple frame prediction

To find the best motion vector the encoder uses a search algorithm such as Full Search (FS), Diamond Search or Hexagon Search. With Full Search a complete search of the whole search area is performed. This algorithm provides the best compression efficiency but is also the most time consuming algorithm. Diamond search is a less time consuming search algorithm where the search pattern is formed as a diamond. Its performance in terms of compression, is good in comparison with FS. Hexagon search is an even more refined search pattern where the search points are formed as a hexagon, figure 3.7a. By decreasing the number of search points the effort to calculate the motion vector will be minimized and the result will be almost as good as with Diamond Search [16].

Motion estimation is the part in H-264 encoding that consume the most com-putational power when encoding and is predicted to consume about 60% to 80% of the total encoding time[15].

(37)

3.4 Inter Prediction 17

3.4.1 Hexagon search

Hexagon search uses a 7 point search pattern which can be seen i figure 3.7a. Each cross in the grid represents a search point in the search area where the grid resolution is one pixel. From this search point a Sum of Absolute Difference, equation (3.1), is calculated. [16]

(a) (b)

Figure 3.7: Large(a) and small(b) search pattern in the hexagon search algorithm.

The search steps in the hexagon search are the following.

1. Calculate the SAD of the six closest search points and the current search point.

2. Put the search point with the smallest SAD as new current search point. If the middle point has the smallest SAD jump to step 5.

3. Calculate the SAD of the 3 new search points that have not yet been calcu-lated as illustrated in figure 3.8.

4. Jump to step 2

5. Calculate the SAD of the 4 new search points forming a diamond around the middle point. This is illustrated in figure 3.7b.

6. Choose the search point that resulted in the smallest SAD and form a motion vector to this search point.

When the smallest SAD is found the motion compensated residue can be cal-culated. This residue is then sent to the transformation part of the encoder for further processing. In the decoder the motion vectors are used to restore the image correctly from the residue that was sent from the encoder. [16]

(38)

18 Overview of H.264 1 1 1 2 2 4 4 5 5 5 1 5 2 3 3 1 3 4 1 1

Figure 3.8: Movement of the hexagon pattern in a search area and the change to the smaller search pattern.

3.5 Transform Coding and Quantization

The main transform used in H.264 is discrete cosine transform.

3.5.1 Discrete Cosine Transform

The Discrete Cosine Transform (DCT) is a widely used transform in image and video compression algorithms. In H.264 the DCT decorrelates the residual data before quantization takes place. The DCT is a block based algorithm which means it transforms one block at the time. In prior standards to H.264 the blocks were 8x8 pixels large but that is now changed to 4x4 samples to reduce the blocking effects, which reduces the visual quality in the video. The DCT used in H.264 is a modified two-dimensional (2D) DCT transform. The transform matrix for the modified 2D DCT can be found in equation (3.9). [10]

Cf =     1 1 1 1 2 1 −1 −2 1 −1 −1 1 1 −2 2 −1     (3.9)

(39)

3.5 Transform Coding and Quantization 19 Y = CfXCfT ⊗ Ef= =         1 1 1 1 2 1 −1 −2 1 −1 −1 1 1 −2 2 −1     X     1 2 1 1 1 1 −1 −2 1 −1 −1 2 1 −2 1 −1         ⊗     a2 ab₂ a2 ab₂ ab 2 b2 4 ab 2 b2 4 a2 ab 2 a 2 ab 2 ab 2 b2 4 ab 2 b2 4     (3.10) where a = 1 2 (3.11) b = r 2 5 (3.12)

and X is the 4x4 block of pixels to calculate the DCT of. To simplify compu-tation somewhat the post-scaling (⊗Ef) can be absorbed into the quantization process. [10] This will be described in more detail in section 3.5.3 which covers the quantization.

The modified 2D DCT is an approximation to the standard DCT. It does not give the same result but the compression is almost identical. The advantages with this approximation is that the core equation CfXCfT can be done in 16 bit arithmetics with only shifts, additions and subtractions [6].

To do a two-dimensional DCT two one-dimensional DCTs can be performed after each other, the first one on rows and the second one on columns or vice versa. The function of the one-dimensional DCT can be seen in figure 3.9. [6]

+ +

-x0 x1 x2 x3 -2 2 X0 X2 X1 X3 + + + + + +

-Figure 3.9: DCT functional schematic

The operations performed while calculating the DCT as shown in figure 3.9 can be written as equation (3.13).

X0= (x0+ x3) + (x1+ x2) X2= (x0+ x3) − (x1+ x2)

X1= (x1− x2) + 2(x0− x3) (3.13) X3= (x1− x2) − 2(x0− x3)

(40)

3.5.2 Inverse Discrete Cosine Transform

The transform that reverses DCT is called Inverse Discrete Cosine Transform (IDCT). With the design of the DCT in H.264 it is possible to ensure zero mismatch between different decoders. This is because the DCT and IDCT(3.14) can be calculated in integer arithmetics. In the standard DCT some mismatch can occur caused by different representation and precision of fractional numbers in encoder and decoder. [10]

The 2D IDCT transform in H.264 is given by Xr= CiT(Y ⊗ Ei)Ci = =     1 1 1 1₂ 1 1₂ −1 −1 1 −1 2 −1 1 1 −1 1 −1 2     Y ⊗     a2 _ab _a2 _ab ab b2 _ab _b2 a2 _ab _a2 _ab ab b2 _ab _b2     !     1 1 1 1 1 1₂ −1 2 −1 1 −1 −1 1 1 2 −1 1 − 1 2     (3.14)

where Xris the reconstructed original block and Y is the previously transformed block. As with the DCT the pre-scaling (⊗Ei) can be absorbed into the rescaling process. [10] This will be described in more detail in section 3.5.4 which covers the rescaling. + + -1/2 X0 X2 X1 X3 + + + + + + x0 x1 x2 x3

-1/2

-Figure 3.10: IDCT functional schematic

The function of the IDCT can be seen in figure 3.10. To do a two-dimensional IDCT two one-dimensional IDCTs are performed after each other, the first one on rows and the second one on columns or vice versa. [6] The operations performed while calculating the IDCT can be written as equation (3.15).

x0= (X0+ X2) + (X1+ 1 2X3) x1= (X0− X2) + ( 1 2X1− X3) x2= (X0− X2) − ( 1 2X1− X3) (3.15) x3= (X0+ X2) − (X1+ 1 2X3)

(41)

3.5 Transform Coding and Quantization 21

3.5.3 Quantization

Information is often concentrated to the lower frequency area, therefore quanti-zation can be used to further compress the data after applying the DCT. H.264 uses a parameter in the quantization called Quantization Parameter (QP). The QP describes how much quantization that should be applied i.e. how much data that should be truncated. A total of 52 values ranging from 0 to 51 are supported by the H.264 standard. Using a high QP will decrease the coded data in size but it will also decrease visual quality of the coded video. With QP = 0 the quantization will be zero and all data is kept. [10]

From QP the quantizer step size (Qstep) can be derived. The first values of Qstepis presented in table 3.1. Note that Qstepdoubles in value for every increase of 6 in QP. The large number of step sizes provides the ability to accurately control the trade off between bitrate and quality in the encoder. [10]

QP 0 1 2 3 4 5 6 7 8 ...

Qstep 0.625 0.6875 0.8175 0.875 1 1.125 1.25 1.375 1.625 ... Table 3.1: Qstep for a few different values of QP

The basic formula for quantization can be written as

Zij = round Yij Qstep

!

(3.16)

where Yij is a coefficient of the previously transformed block to be quantized and Zij is a coefficient of the quantized block. The rounding operation does not have to be to the nearest integer, it could be biased towards smaller integers which could give perceptually higher quality. This is true for all rounding operations in the quantization. [10]

As mentioned in section 3.5.1 the quantization can absorb the post-scaling (⊗Ef) from the DCT. The unscaled output from the DCT can then be written as W = CfXCfT (as compared to the scaled output which is Y = CfXCfT ⊗ Ef). [10] This gives Zij= round Wij P Fij Qstep ! (3.17) where Wij is a coefficient of the unscaled transformed block, Zij is a coefficient of the quantized block and P Fij is either a2, ab₂ or b

2

4 for each (i,j) according to

P F =     a2 ab 2 a 2 ab 2 ab 2 b2 4 ab 2 b2 4 a2 ab 2 a 2 ab 2 ab 2 b2 4 ab 2 b2 4     (3.18)

(42)

PF and Qstepcan then be reformulated using a multiplication factor (MF) and a division. MF is in fact a 4 × 4 matrix of multiplication factors according to

M F =     A C A C C B C B A C A C C B C B     (3.19)

where the values of A, B and C depends on QP according to

QP A B C 0 13107 5243 8066 1 11916 4660 7490 2 10082 4194 6554 3 9362 3647 5825 4 8192 3355 5243 5 7282 2893 4559

Table 3.2: Multiplication factor MF

The scaling factors in MF are repeated for every increase of 6 in QP. The reformulation of PF and Qstepthen becomes

P F Qstep

= M F

2qbits (3.20)

where qbits is calculated as

qbits = 15 + f loor QP 6

(3.21) This gives a new quantization formula according to

Zij= round Wij M Fij 2qbits

!

(3.22)

which is the final form. [10]

3.5.4 Rescaling

The rescaling also uses Qstepwhich depends on the Quantization Parameter (QP) and is the same as for quantization (see table 3.1). The basic formula for rescaling can be written as

Y_ij0 = ZijQstep (3.23)

where Zij is a coefficient of the previously quantized block and Yij0 is a coefficient of the rescaled block. The rounding operation, as in the quantizer, does not have to be to the nearest integer, it could be biased towards smaller integers which

(43)

3.6 Deblocking filter 23

could give perceptually higher quality. This is true for all rounding operations in the rescaling. [10]

As the quantization formula was reformulated the rescaling formula can also absorb the pre-scaling (⊗Ei) and be reformulated to match the quantization for-mula. The new formula for rescaling where the pre-scaling factor is included can be written as

W_ij0 = ZijQstepP Fij∗ 64 (3.24) where P Fij is the same as in (3.18), Zij is a coefficient of the previously quantized block, W0

ij is a coefficient of the rescaled block and the constant scaling factor of 64 is included to avoid rounding errors while calculating the Inverse DCT. [10]

Much like MF for the quantization the rescaling also uses a 4 × 4 matrix of scaling factors called V, which also incorporates the constant scaling factor of 64 introduced in (3.24). V can be written as

V =     A C A C C B C B A C A C C B C B     (3.25)

where the values of A, B and C depends on QP according to

QP A B C 0 10 16 13 1 11 18 14 2 13 20 16 3 14 23 18 4 16 25 20 5 18 29 23

Table 3.3: Scaling factor V

The scaling factors in V are like MF repeated for every increase of 6 in QP. With V the rescaling formula can be written as

Wij0 = ZijVij2f loor(QP /6) (3.26)

which is the final form. [10]

3.6 Deblocking filter

When using block coding algorithms such as DCT, blocking artifacts can occur. This is unwanted because it lowers the visual quality and prediction performance. The solution to this is to add a filter than removes these artifacts. The filter is placed after the IDCT in the encoding loop which can be seen i figure 3.1. The filter is used on both luma and chroma samples of the video sequence. [10]

(44)

24 Overview of H.264 E G H F A B C D (a) 3 1 4 2 (b)

Figure 3.11: Filtering order of a 16x16 pixel macroblock with start in A and end in H for luminance(a) and start in 1 and end in 4 for chrominance(b)

The deblocking filter in H.264 has 5 levels of filtering, 0 to 4, where 4 is the option with the strongest filtering. The filter is actually two different filters where the first filter is applied on level 1 to 3 and the second on level 4. Level 0 means that no filter should be applied. The filter level parameter is called boundary strength (bS). The parameter depends on the current quantization parameter, macroblock type and the gradient of the image samples across the boundary. There is one bS for every boundary between two 4x4 pixel block. The deblocking filter is applied to one macroblock at a time in a raster scan order throughout the frame. [5]

p3 p2 p1 p0 q0 q1 q2 q3 p3 p2 p1 p0 q0 q1 q2 q3

Figure 3.12: Pixels in blocks adjacent to vertical and horizontal boundaries When applying the deblocking filter on a macroblock it is done in a special order which is illustrated in figure 3.11. The filter is applied on vertical and horizontal edges as shown in figure 3.12. Where p0, p1, p2, p3, q0, q1, q2, q3 are pixels from two neighboring blocks, p and q. The filtering of these pixels only takes place if equation (3.27), (3.28) and (3.29) are fulfilled.

(45)

3.7 Entropy coding 25

|p0− q0| < α(indexA) (3.27)

|p1− p0| < β(indexB) (3.28)

|q1− q0| < β(indexB) (3.29)

indexA= M in(M ax(0, QP + Of f setA), 51) (3.30) indexB = M in(M ax(0, QP + Of f setB), 51) (3.31) The values of α and β are approximately defined to equation (3.32) and equa-tion (3.33).

α(x) = 0.8(2x6 − 1) _(3.32)

β(x) = 0.5x − 7 (3.33)

Note that in equation (3.30) and (3.31) it can be seen that the filtering is dependent on the Quantization Parameter. The different filters applied are 3-,4-and 5-tap FIR filters which are further described in. [5]

3.7 Entropy coding

The H.264 standard supports two different entropy coding algorithms, Context-based Adaptive Variable Length Coding (CAVLC) and Context-Context-based Adaptive Binary Arithmetic Coding (CABAC). CABAC is the most efficient of these two standards but it requires higher computational complexity. Bitrate savings of CABAC can be between 9% and 14% compared to CAVLC[7]. CAVLC is sup-ported in all H.264 profiles but CABAC is only supsup-ported in the profiles above extended. [10]

(46)

(47)

Chapter 4

Overview of the ePUMA

Architecture

This chapter covers an introduction to the ePUMA processor architecture. The memory hierarchy, master core, Sleipnir core, the direct memory access controller and the simulator will be covered.

4.1 Introduction to ePUMA

Embedded Parallel Digital Signal Processing Processor Architecture with Unique Memory Access (ePUMA) is a multi-core DSP processor architecture with 1 master core and 8 calculation cores. The master core handles the Direct Memory Access (DMA) communications. The slave core, which is also called Sleipnir, is a 15-stage pipelined calculation core.

4.2 ePUMA Memory Hierarchy

The ePUMA memory hierarchy consists of three levels where the first level is the off-chip main memory, the second level is the local storage of the master and slaves and the third and final level is the registers of the master and slave cores. In figure 4.1 an illustration of how each core is connected to the on-chip interconnection is depicted. The on-chip interconnection is in turn connected to the off-chip main memory. The main memory is addressed with both a high word of 16 bits and a low word of another 16 bits which means that a 32-bit addressing is used where each address corresponds to a word of data.

(48)

28 Overview of the ePUMA Architecture

Off chip main memory

On chip interconnection Master LS P M D M 0 D M 1 Master Core Registers Sleipnir 0 LS P M C M L V M 1 L V M 2 L V M 3 Sleipnir Core Registers Sleipnir 7 LS P M C M L V M 1 L V M 2 L V M 3 Sleipnir Core Registers

...

Level 1 Level 2 Level 3

Figure 4.1: ePUMA memory hierarchy

The on-chip network is depicted in figure 4.2 where N0 to N7 are intercon-nection nodes. As can be seen from the figure the nodes are connected both to the master and the respective Sleipnir core but also to other nodes. This gives the ability of transferring data between Sleipnir cores and even pipeline the cores. With this setup data can be transferred in any way and combination that does not overlap. Sleipnir 0 Sleipnir 1 Sleipnir 3 Master DMA Main Memory

Sleipnir 5 Sleipnir 6 Sleipnir 7

Sleipnir 4 Sleipnir 2 N0 N1 N2 N4 N7 N6 N5 N3

(49)

4.3 Master Core 29

4.3 Master Core

The master core is for the moment based on a processor called Senior. This processor has been around on the Division of Computer Engineering for some years now and is used in some courses for educational purpose. The Senior processor is a DSP processor which means it got a Multiply and ACcumulate (MAC) unit and other DSP related capabilities. To accomplish a possibility to serve as a master core memory ports for DMA controller and interrupt coming from the DMA and Sleipnir cores have been added.

4.3.1 Master Memory Architecture

The master core has 2 RAMs and 2 ROMs which are called Data Memory 0 (DM 0) and Data Memory 1 (DM 1). These memories are the local storage for the master core. The ROMs start at address 8000 on respective memory. This gives 7F F F = 32767 words in each RAM to work with.

For calculation the master core has 32 16-bit registers that could be used as buffers. There are also a number of special registers such as 4 address registers, registers for hardware looping and registers for support of cyclic addressing in address register 0 and 1. Address register 0 and 1 also supports different step sizes.

4.3.2 Master Instruction Set

Programming guide and instruction set for Senior can be found in [9] and [8] even though they might not be totally accurate because of the modifications for the ePUMA project. The masters instructions set is in large the same as the Senior instruction set. It is a standard DSP instruction set with support for a convolution instruction which multiplies and accumulates the results. To speed up looping a hardware loop function called repeat is included. All jumps, calls and returns can use 0 to 3 delay slots. The number of delay slots specifies how many instructions after the flow control instruction that will be executed. If not all delay slots are used for useful instructions, nop instructions will be inserted in the pipeline.

4.3.3 Datapath

The datapath of the master consists of a 5-stage pipeline which can be seen in figure 4.3. There is only one exception to this, the convolution instruction (conv) uses a 7-stage pipeline but a figure of this is omitted for lack of relevance. The datapath is advanced enough for scalar calculations, larger computational loads should be delegated to the Sleipnir cores. In table 4.1, originally found in [9], a description of the pipeline stages is presented.

(50)

Next PC

PM

Decoder

OP. SEL AGU

ALU

_*

DM 0 DM 1 _CheckCond.

RF

+

ALU flags ACR, MAC flags . . . P1 P2 P3 P4 P5

Figure 4.3: Senior datapath for short instructions

Pipe RISC-E1/E2 RISC Memory load/store

P1 IF: Instr. Fetch IF: Instr. Fetch

P2 ID: Instr. Decode ID: Instr. Decode

P3 OF: Operand Fetch OF+AG: Compute addr

P4 EX1: Execution(set flags MEM: Read/Write

P5 EX2: Only for MAC, RWB WB: Write back (if load) Table 4.1: Pipeline specification

4.4 Sleipnir Core

Sleipnir is the name of the calculation core. In the ePUMA processor there are 8 of them. The Sleipnir is a Single Instruction Multiple Data (SIMD) architecture which in this case means it can perform vector calculations. Each full vector consists of 128 bits and is divided into 8 words of 16 bits which can run through the pipeline in parallel. The datapath of the Sleipnir core has 15 pipeline stages. The pipeline length of an instruction is variable depending on the choice of operands.

(51)

4.4 Sleipnir Core 31

4.4.1 Sleipnir Memory Architecture

The Sleipnir core has 3 memories where 2 of them are connected to the core and the third memory is connected to the DMA bus. The memories are called Local Vector Memories (LVMs). By being able to swap which memories that are connected to the processor and which memory that is connected to the DMA better utilization can be reached and a lot of the transfer cycle cost can be hidden.

Constant Memory

Each Sleipnir is also provided with a Constant Memory (CM) for use of constants during runtime. This memory can be used for different tasks such as scalar con-stants or permutation vectors. All concon-stants that will be used during runtime can be stored in the CM. The memory can contain up to 256 vectors.

Local Vector Memory

The Local Vector Memories (LVM) are the local memories of the Sleipnir core. As described above each core has access to 2 LVMs at runtime. These memories are 4096 vectors large, where each vector is 128 bits wide. The memories have one address for each word of 16 bits. The memories consist of 8 memory banks, one for each word in a vector. The constant memory can be used to address the LVMs according to the values stored in the constant memory. The constant memory addressing of the LVMs can be used to generate a permutation of data which can be used for e.g. transposing a matrix.

Vector Registers File

There are 8 Vector Registers (VR) in the Vector Register File (VRF), VR0 to VR7, for use in computations during runtime. Each word can be obtained separately, it is also possible to obtain a double word and half vector both high and low in each of the 8 vector registers. The different access types are listed in table 4.2, originally found in [4].

Syntax Size Description

vrX.Y 16-bit Word

vrX.Yd 32-bit Double word vrX{h,l} 64-bit Half vector

vrX 128-bit Vector

Table 4.2: Register file access types

Special Registers

There are 4 address register ar0-ar3 which can be used to address memory in the LVMs. There are also 4 configuration registers for these 4 address registers. The subset of these registers are values for top, bottom and step size which can

(52)

be used when addressing memories in all kinds of loops. The different increment operations are listed in table 4.3, originally found in [4].

arX+=C Fixed increment; C = 1,2,4 or 8 arX-=C Fixed decrement; C = 1,2,4 or 8 arX+=S Increment from stepX register

arX+=C% Fixed increment with cyclic addressing arX-=C% Fixed decrement with cyclic addressing arX+=% Increment from stepX with cyclic addressing

Table 4.3: Address register increment operations

The addressing of the two LVMs can be done with one of the four address registers, immediate addresses, vector registers or in combination with the constant memory, to form advanced addressing schemes as shown in table 4.4, originally found in [4].

Mode# Index Offset Pattern Syntax example

0 arX 0 0,1,2,3,4,5,6,7 [ar0]

1 arX 0 cm[carX] [ar0 + cm[car0]]

2 arX 0 cm[imm8] [ar0 + cm[10]]

3 arX 0 cm[carX + imm8] [ar0 + cm[car0 + 10]]

4 0 vrX.Y 0,1,2,3,4,5,6,7 [vr0.0]

5 0 vrX.Y cm[carX] [vr0.0 + cm[car0]]

6 0 vrX.Y cm[imm8] [vr0.0 + cm[10]]

7 0 vrX.Y cm[carX + imm8] [vr0.0 + cm[car0 + 10]]

8 0 0 vrX [vr0]

9 0 0 cm[carX] [cm[car0]]

10 0 0 cm[imm8] [cm[10]]

11 0 0 cm[carX + imm8] [cm[car0 + 10]]

12 arX 0 vrX [ar0 + vr0]

13 arX vrX.Y 0,1,2,3,4,5,6,7 [ar0 + vr0.0]

14 arX imm16 0,1,2,3,4,5,6,7 [ar0 + 1024]

15 0 imm16 0,1,2,3,4,5,6,7 [1024]

Table 4.4: Addressing modes examples

Program memory

The program memory (PM) can contain up to 512 instructions. It can be loaded from the main memory by issuing a DMA transaction.

The program that is loaded into the Sleipnir PM is called a block. A kernel is a combination of master code and blocks. A block can utilize several Sleipnir cores with internal data transfers. Blocks can however not communicate with cores outside the block and can not be data dependant on any other block running at the same time.

(53)

4.4 Sleipnir Core 33

If for some reason the Sleipnir block code is larger than 512 lines of instructions it can be divided into two programs and the memory can be transferred between two Sleipnir cores. For this to work code is needed in the master to keep track of the cores and move data to the next core for further processing. When developing a new block or kernel it can sometimes be good to have a little extra memory. Therefore it is possible to increase the size of the PM in the simulator.

4.4.2 Datapath

The datapath of the Sleipnir slave core is a 8-way 16-bit datapath. The datapath is divided into 15 pipeline stages and is depicted in figure 4.4. A more detailed version of the datapath can be found in [2].

LVM x LVM y Operand Formatting Operand Selection ALU 1 ALU 2 Multiplier VRF SPRF LVM Vector Addressing Instr. Fetch Instr. Decode CM Addressing LVM Scalar Addressing _{CM x} A 1 A 2 B 1 B 2 B 3 B 4 C 1 CM y D 1 D 2 D 3 D 4 E 1 E 2 E 3 E 4

Figure 4.4: Sleipnir datapath pipeline schematic

The datapath includes 16 16x16-bit multipliers and two Arithmetic Logic Units (ALU) connected in series. Simpler instructions can bypass the first ALU and by that become a shorter instruction which saves some execution time. These bypasses can be seen in stage D1 to D4 in figure 4.4. Some instructions use a very short datapath such as the jump instruction which is executed in stage A2. This makes the use of precalculated branch decisions unnecessary. Stage E1 to E4 can be described as the write back stage and therefore it follows after stage D4. Stage D3 and D4 are very similar but provides the core with the possibility of performing summation of a complete vector and similar tasks.

(54)

4.4.3 Sleipnir Instruction Set

The instruction set used is application specific. The instruction set includes no move or load instructions for data. These functions are all included in one instruc-tion which is called copy. Operands and instrucinstruc-tions can be combined in different ways with variable pipeline length as a result. The pipeline length depends on e.g. where the input operands are fetched from, where the result will be stored and if the instruction uses or bypasses the first ALU and multipliers. Instruction names are built upon what data they affect and how. For example the instruction vcopy m0[0].vw m1[0].vw copies a vector from memory 1 address 0 to mem-ory 0 address 0. If the instruction scopy would be used instead it would only copy a scalar word. Another example is the add instruction. If vaddw m0[0].vw m1[0].vw vr0 is used two vectors will be loaded from both m1 and vr0. The .vw after the memory address denotes that the vectors will be added word wise, that means they will be considered as eight words. This means that the processor can carry out 8 additions per clock cycle. [4]

4.4.4 Complex Instructions

To reach better performance results the datapath has to be utilized as much as possible, especially in the inner loops of the critical path. To be able to reach this better performance, new specialized instructions that perform several smaller tasks could be implemented. The result of this is that by pipelining several of these new complex instructions more work can be done in less time and the program will reach an increased throughput.

Things that have been considered when deciding upon accelerating certain parts of code are listed below.

• Motivation – Why should the acceleration be done • Description – What is going to be accelerated

• Extra hardware needed – What extra hardware is needed for acceleration of the specific task

• Profiling and usage – Is the task used a lot and therefore worth accelerating • Extra hardware cost – What is the cost of the extra hardware

• Cycle gain – How many cycles can be saved

• Efficiency – How efficient is the new solution in terms of cost per gain in performance

4.5 DMA Controller

The Direct Memory Access(DMA) controller is used to load and store data to and from an off-chip memory. The DMA can transfer a 128-bit vector to one of the