Institutionen för systemteknik
Department of Electrical Engineering
Examensarbete
A Selection of H.264 Encoder Components
Implemented and Benchmarked on a Multi-core
DSP Processor
Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping
av Jonas Einemo Magnus Lundqvist
LiTH-ISY-EX--10/4392--SE
Linköping 2010
Department of Electrical Engineering Linköpings tekniska högskola
Linköpings universitet Linköpings universitet
A Selection of H.264 Encoder Components
Implemented and Benchmarked on a Multi-core
DSP Processor
Examensarbete utfört i Datorteknik
vid Tekniska högskolan i Linköping
av
Jonas Einemo Magnus Lundqvist
LiTH-ISY-EX--10/4392--SE
Handledare: Olof Kraigher
isy, Linköpings universitet
Examinator: Dake Liu
isy, Linköpings universitet
Avdelning, Institution
Division, Department
Division of Computer Engineering Department of Electrical Engineering Linköpings universitet
SE-581 83 Linköping, Sweden
Datum Date 2010-06-15 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport
URL för elektronisk version http://www.da.isy.liu.se/en/index.html http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-4292 ISBN — ISRN LiTH-ISY-EX--10/4392--SE
Serietitel och serienummer
Title of series, numbering
ISSN
—
Titel
Title A Selection of H.264 Encoder Components Implemented and Benchmarked on a Multi-core DSP Processor
Författare
Author Jonas Einemo, Magnus Lundqvist
Sammanfattning
Abstract
H.264 is a video coding standard which offers high data compression rate at the cost of a high computational load. This thesis evaluates how well parts of the H.264 standard can be implemented for a new multi-core digital signal processing processor architecture called ePUMA. The thesis investigates if real-time encoding of high definition video sequences could be performed. The implementation con-sists of the motion estimation, motion compensation, discrete cosine transform, inverse discrete cosine transform, quantization and rescaling parts of the H.264 standard. Benchmarking is done using the ePUMA system simulator and the re-sults are compared to an implementation of an existing H.264 encoder for another multi-core processor architecture called STI Cell. The results show that the se-lected parts of the H.264 encoder could be run on 6 calculation cores in 5 million cycles per frame. This setup leaves 2 calculation cores to run the remaining parts of the encoder.
Nyckelord
Abstract
H.264 is a video coding standard which offers high data compression rate at the cost of a high computational load. This thesis evaluates how well parts of the H.264 standard can be implemented for a new multi-core digital signal processing processor architecture called ePUMA. The thesis investigates if real-time encoding of high definition video sequences could be performed. The implementation con-sists of the motion estimation, motion compensation, discrete cosine transform, inverse discrete cosine transform, quantization and rescaling parts of the H.264 standard. Benchmarking is done using the ePUMA system simulator and the re-sults are compared to an implementation of an existing H.264 encoder for another multi-core processor architecture called STI Cell. The results show that the se-lected parts of the H.264 encoder could be run on 6 calculation cores in 5 million cycles per frame. This setup leaves 2 calculation cores to run the remaining parts of the encoder.
Acknowledgments
We would like to thank everyone that has helped us during our thesis work, espe-cially our supervisor Olof Kraigher for all help and useful hints and our examiner Professor Dake Liu for his support, comments and the opportunity to do this the-sis. We would also like to thank Jian Wang for the support on the DMA firmware, Jens Ogniewski for the help with understanding the H.264 standard, our families and friends for their support and for bearing with us during the work on this the-sis.
Jonas Einemo Magnus Lundqvist
Contents
1 Introduction 1 1.1 Background . . . 1 1.2 Purpose . . . 2 1.3 Scope . . . 2 1.4 Way of Work . . . 2 1.5 Outline . . . 22 Overview of Video Coding 5 2.1 Introduction to Video Coding . . . 5
2.2 Color Spaces . . . 6
2.3 Predictive Coding . . . 7
2.4 Transform Coding and Quantization . . . 7
2.5 Entropy Coding . . . 8 2.6 Quality Measurements . . . 8 2.6.1 Subjective Quality . . . 8 2.6.2 Objective Quality . . . 8 3 Overview of H.264 11 3.1 Introduction to H.264 . . . 11 3.2 Coded Slices . . . 12 3.2.1 I Slice . . . 12 3.2.2 P Slice . . . 12 3.2.3 B Slice . . . 12 3.2.4 SP Slice . . . 12 3.2.5 SI Slice . . . 13 3.3 Intra Prediction . . . 13 3.4 Inter Prediction . . . 14 3.4.1 Hexagon search . . . 17
3.5 Transform Coding and Quantization . . . 18
3.5.1 Discrete Cosine Transform . . . 18
3.5.2 Inverse Discrete Cosine Transform . . . 20
3.5.3 Quantization . . . 21
3.5.4 Rescaling . . . 22
3.6 Deblocking filter . . . 23 i
ii Contents
3.7 Entropy coding . . . 25
4 Overview of the ePUMA Architecture 27 4.1 Introduction to ePUMA . . . 27
4.2 ePUMA Memory Hierarchy . . . 27
4.3 Master Core . . . 29
4.3.1 Master Memory Architecture . . . 29
4.3.2 Master Instruction Set . . . 29
4.3.3 Datapath . . . 29
4.4 Sleipnir Core . . . 30
4.4.1 Sleipnir Memory Architecture . . . 31
4.4.2 Datapath . . . 33
4.4.3 Sleipnir Instruction Set . . . 34
4.4.4 Complex Instructions . . . 34 4.5 DMA Controller . . . 34 4.6 Simulator . . . 35 5 Elaboration of Objectives 37 5.1 Task Specification . . . 37 5.1.1 Questions at Issue . . . 38 5.2 Method . . . 38 5.3 Procedure . . . 38 6 Implementation 39 6.1 Motion Estimation . . . 39
6.1.1 Motion Estimation Reference . . . 39
6.1.2 Complex Instructions . . . 40
6.1.3 Sleipnir Blocks . . . 41
6.1.4 Master Code . . . 47
6.2 Discrete Cosine Transform and Quantization . . . 49
6.2.1 Forward DCT and Quantization . . . 50
6.2.2 Rescaling and Inverse DCT . . . 53
7 Results and Analysis 57 7.1 Motion Estimation . . . 57 7.1.1 Kernel 1 . . . 58 7.1.2 Kernel 2 . . . 60 7.1.3 Kernel 3 . . . 62 7.1.4 Kernel 4 . . . 63 7.1.5 Kernel 5 . . . 65 7.1.6 Master Code . . . 69 7.1.7 Summary . . . 71
Contents iii 8 Discussion 79 8.1 DMA . . . 79 8.2 Main Memory . . . 79 8.3 Program Memory . . . 80 8.4 Constant Memory . . . 80
8.5 Vector Register File . . . 80
8.6 Register Forwarding . . . 80
8.7 New Instructions . . . 81
8.7.1 SAD Calculations . . . 81
8.7.2 Call and Return . . . 81
8.8 Master and Sleipnir Core . . . 81
8.9 ePUMA H.264 Encoding Performance . . . 82
8.10 ePUMA Advantages . . . 82
8.11 Observations . . . 83
9 Conclusions and Future Work 85 9.1 Conclusions . . . 85
9.2 Future Work . . . 86
Bibliography 87
A Proposed Instructions 89
iv Contents
List of Figures
2.1 Overview of the data flow in a basic encoder and a decoder . . . . 5
2.2 YUV 4:2:0 sampling format . . . 7
3.1 Overview of the data flow in an H.264 encoder . . . 12
3.2 4x4 luma prediction modes . . . 13
3.3 16x16 luma prediction modes . . . 13
3.4 Different ways to split a macroblock in inter prediction. . . 14
3.5 Subsamples interpolated from neighboring pixels . . . 15
3.6 Multiple frame prediction . . . 16
3.7 Large(a) and small(b) search pattern in the hexagon search algorithm. 17 3.8 Movement of the hexagon pattern in a search area and the change to the smaller search pattern. . . 18
3.9 DCT functional schematic . . . 19
3.10 IDCT functional schematic . . . 20
3.11 Filtering order of a 16x16 pixel macroblock with start in A and end in H for luminance(a) and start in 1 and end in 4 for chrominance(b) 24 3.12 Pixels in blocks adjacent to vertical and horizontal boundaries . . 24
4.1 ePUMA memory hierarchy . . . 28
4.2 ePUMA star network interconnection . . . 28
4.3 Senior datapath for short instructions . . . 30
4.4 Sleipnir datapath pipeline schematic . . . 33
4.5 Sleipnir Local Store switch . . . 35
6.1 Motion estimation program flowchart . . . 42
6.2 Motion estimation computational flowchart . . . 43
6.3 Hexagon search program flow controller . . . 44
6.4 Proposed implementation of call and return hardware . . . 45
6.5 Reference macroblock overlap . . . 45
6.6 Reference macroblock partitioning for 13 data macroblocks . . . . 46
6.7 Master program flowchart . . . 47
6.8 Memory allocation of data memory in the master(a) and main mem-ory allocation(b) . . . 48
6.9 Sleipnir core motion estimation task partitioning and synchronization 49 6.10 DCT flowchart . . . 51
6.11 Memory transpose schematic . . . 51
7.1 Cycle scaling from 1 to 8 Sleipnir cores for simulation of riverbed . 72 7.2 Frame 10 from Pedestrian Area video sequence . . . 73
7.3 Difference between frame 10 and frame 11 in Pedestrian Area video sequence . . . 73
7.4 Motion vector field calculated by kernel 5 on frame 10 and 11 of the Pedestrian Area video sequence . . . 74
7.5 Difference between frame 10 and frame 11 in Pedestrian Area video sequence using motion compensation . . . 74
Contents v
8.1 Sleipnir core DCT task partitioning and synchronization . . . 83
8.2 Memory allocation of macroblock in LVM for intra coding . . . 83
A.1 HVBSUMABSDWA . . . 89
A.2 HVBSUMABSDNA . . . 90
A.3 HVBSUBWA . . . 90
vi Contents
List of Tables
3.1 Qstep for a few different values of QP . . . 21
3.2 Multiplication factor MF . . . 22
3.3 Scaling factor V . . . 23
4.1 Pipeline specification . . . 30
4.2 Register file access types . . . 31
4.3 Address register increment operations . . . 32
4.4 Addressing modes examples . . . 32
7.1 Short names for kernels that have been tested . . . 58
7.2 Description of table columns . . . 58
7.3 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 1 using 1 Sleipnir core . . . . . 59
7.4 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 1 using 8 Sleipnir cores . . . . 59
7.5 Block 1 costs . . . 59
7.6 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 2 using 1 Sleipnir core . . . . . 60
7.7 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 2 using 8 Sleipnir cores . . . . 61
7.8 Block 2 costs . . . 61
7.9 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 3 using 8 Sleipnir cores . . . . 62
7.10 Kernel 3 costs . . . 62
7.11 Motion estimation results from simulation with Riverbed frame 10 and Riverbed frame 11 with kernel 4 using 4 Sleipnir cores . . . . 63
7.12 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 4 using 8 Sleipnir cores . . . . 64
7.13 Kernel 4 costs . . . 64
7.14 Motion estimation results from simulation on Sunflower frame 10 and Sunflower frame 11 with kernel 5 using 8 Sleipnir cores . . . . 65
7.15 Motion estimation results from simulation on Blue sky frame 10 and Blue sky frame 11 with kernel 5 using 8 Sleipnir cores . . . . . 66
7.16 Motion estimation results from simulation on Pedestrian area frame 10 and Pedestrian area frame 11 with kernel 5 using 8 Sleipnir cores 66 7.17 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 5 using 4 Sleipnir cores . . . . 67
7.18 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 5 on 8 Sleipnir cores . . . . 67
7.19 Kernel 5 costs . . . 68
7.20 Master code cost . . . 69
7.21 Prolog and epilog cycle costs . . . 70
7.22 Simulated epilog cycle cost including waiting for last Sleipnir to finish 70 7.23 DMA cycle costs . . . 71
Contents vii
7.24 Costs for DCT with quantization block and IDCT with rescaling block . . . 75 B.1 Simulation cycle cost of motion estimation kernels . . . 92
viii Contents
Abbreviations
AGU Address Generation Unit
ALU Arithmetic Logic Unit
AVC Advanced Video Coding
CABAC Context-based Adaptive Binary Arithmetic Coding
CAVLC Context-based Adaptive Variable Length Coding
CB Copy Back
CM Constant Memory
CODEC COder/DECoder
DCT Discrete Cosine Transform
DMA Direct Memory Access
DSP Digital Signal Processing
ePUMA Embedded Parallel Digital Signal Processing Proces-sor Architecture with Unique Memory Access
FIR Finite Impulse Response
FPS Frames Per Second
FS Full Search
HDTV High-Definition Television
HVBSUBNA Half Vector Bytewise SUBtraction Not word Aligned HVBSUBWA Half Vector Bytewise SUBtraction Word Aligned HVBSUMABSDNA Half Vector Bytewise SUM of ABSolute Differences
Not word Aligned
HVBSUMABSDWA Half Vector Bytewise SUM of ABSolute Differences Word Aligned
IDCT Inverse Discrete Cosine Transform
IEC International Electrotechnical Commission
ISO International Organization for Standardization
ITU International Telecommunications Union
LS Local Storage
LVM Local Vector Memory
MAE Mean Abolute Error
MB Macroblock
MC Motion Compensation
ME Motion Estimation
MF Multiplication Factor
MPEG Moving Picture Experts Group
MSE Mean Square Error
NAL Network Abstraction Layer
NoC Network on Chip
PM Program Memory
PSNR Peak Signal to Noise Ration
QP Quantization Parameter
Contents ix
RGB Red, Green and Blue, A color space
ROM Read Only Memory
SAD Sum of absolute difference
SPRF SPecial Register File
STI Sony Toshiba IBM
V Rescaling Factor
VCEG Video Coding Experts Group
VRF Vector Register File
Chapter 1
Introduction
This chapter gives a background to the thesis, defines the purpose, scope, way of work and presents the outline of the thesis.
1.1
Background
With new handheld devices and mobile systems with more advanced services the need for increased computational power at low cost, both in terms of chip area and power dissipation, is ever increasing. Now that video playback and recording are more standard applications than features in mobile devices, high computational power at a low cost is still a problem without a sufficient solution.
The Division of Computer Engineering at the Department of Electrical Engi-neering at Linköpings Tekniska Högskola has for some time been part of a research project called ePUMA, which can be read out as Embedded Parallel Digital Signal Processing Processor Architecture with Unique Memory Access. The development is driven by the pursuit of the next generation of digital signal processing demands. By developing a cheap and low power processor with large calculation power this new architecture aims to meet tomorrows demands in digital signal processing. The main applications for the processor is future radio base stations, radar and High-Definition Television (HDTV).
H.264 is a standard for video compression that saw daylight back in 2003. It is now a mature and widely spread standard that is used in Blu-Ray, popular video streaming websites like Youtube, television services and video conferencing. It provides very good compression at the cost of high computational complexity. The hope is that the ePUMA multi-core architecture will be able to handle real-time video encoding using the H.264 standard.
At the Division of Computer Engineering previous work has been done on implementing an H.264 encoder for another multi-core architecture. This work was done on the STI Cell which is used in e.g. the popular video gaming console PLAYSTATION 3.
2 Introduction
1.2
Purpose
The purpose of this master thesis is to evaluate the capability of the ePUMA processor architecture, in aspect of real-time video encoding using the H.264 video compression standard and aim to find and expose possible areas of improvement on the ePUMA architecture. This will be done by implementing parts of an H.264 encoder and if possible compare the cycles needed to the previously implemented STI Cell H.264 encoder.
1.3
Scope
By implementing the most computationally expensive parts in the H.264 standard it would be possible to better estimate if the ePUMA processor architecture is capable of encoding video using the H.264 standard in real time. Studying the H.264 standard it can be seen that entropy coding is the most time consuming part if it is done in software. Because of the large amount of bit manipulations needed, it is not feasible to perform entropy coding in the processor. Therefore an early decision was made that entropy coding had to be hardware accelerated and that it should not be a part of this thesis.
In this thesis no exact hardware costs for performance improvement will be calculated but instead a reasoning of feasibility will be done.
The time constraint of this master thesis is twenty weeks which restricts the extent of the work. Because of the time constraint some parts of a complete encoder have had to be left out.
1.4
Way of Work
One of the most time consuming tasks is motion estimation which together with discrete cosine transform and quantization became the primary target for evalu-ation. First a working implementation was produced. An iterative development was then used to refine the implementations and reach better performance. The partial implementations of the H.264 standard were written for the ePUMA sys-tem simulator. The simulator was also used for all performance measurements of the implementations using frames from several different commonly used test video sequences. Once the performance measurement results were acquired they were analyzed and the conclusions were made. The way of work is elaborated in section 5.2 and section 5.3.
1.5
Outline
This thesis is aimed at an audience with an education in electrical engineering, computer engineering or similar. Expertise in video coding or the H.264 standard is not necessary as the main principles of these topics will be covered.
The outline of this thesis is ordered as naturally as possible where this intro-duction chapter is followed by theoretical chapters containing the topics needed
1.5 Outline 3
to understand the rest of the thesis. The first of these is chapter 2 which covers the basics of video coding followed by chapter 3 which offers an introduction to the H.264 video coding standard. The last theoretical chapter is chapter 4 which covers the hardware architecture and toolchain of the ePUMA processor. The theory is followed by chapter 5 where a more detailed task specification, method and procedure of the thesis is presented with help from the knowledge obtained from the theoretical chapters. After that chapter 6 describes the function and de-velopment of the implementations produced. Chapter 7 then presents the results obtained and gives an analysis of them. Chapter 8 contains a discussion about the results as well as ideas thought of while working on this thesis. The final chapter is chapter 9 which contains the conclusions and the future work that could be done in the area.
Chapter 2
Overview of Video Coding
This chapter gives an introduction to video coding, color spaces, predictive coding, transform coding and entropy coding. The knowledge is necessary to be able to understand the rest of the thesis.
2.1
Introduction to Video Coding
A video consists of several images, called frames, showed in a sequence. The amount of space on disk required to store a sequence of raw data is huge and therefore video coding is needed. The purpose of video coding is to minimize the data to store on disk or the data to send over a network, without decreasing the image quality too much. There are a lot of techniques and algorithms on the market to do this such as MPEG-2, MPEG-4 and H.264/AVC. [10]
Predictive
coding Transform coding & Quantization Entropy coding
Predictive decoding
Inverse transform & Rescaling
Entropy decoding Video
Data Encoded Video
Encoded Video Decoded
Video
Figure 2.1: Overview of the data flow in a basic encoder and a decoder All of these algorithms are constructed out of a similar template. First some technique is used to reduce the amount of data to be transformed. The video is then transformed with for example a Discrete Cosine Transform (DCT). After this a quantization is performed to shrink the data further. The data is then pushed
6 Overview of Video Coding
through an entropy coder such as Huffman or a more advanced algorithm such as Context-based Adaptive Binary Arithmetic Coding (CABAC) or Context-Based Arithmetic Variable Length Coding (CAVLC) which all compress the data based on patterns in the bit-stream. [10] The data flow of a basic encoder and a basic decoder is illustrated in figure 2.1.
As mentioned a video sequence consists of many frames. In video coding these frames can be divided into something called slices. A slice can be a part of a frame or contain the complete frame. This slice division is advantageous because it gives ability to know e.g. that data in a slice does not depend on data outside the slice. The frames are also divided into something called macroblocks. A macroblock is a block consisting of 16×16 pixels. This partitioning of the data makes computations easier to organize and structure. [10]
2.2
Color Spaces
To understand video coding some knowledge about different color spaces is needed. One of the color spaces out there is RGB, which name comes from its components red, green and blue. With these three colors and different intensities of them it is possible to visualize all colors in the spectra. Another commonly used color space is Y CbCr, also called YUV. In this color space Y represents the luminance (luma) component, which corresponds to the brightness of a specific pixel. The other two components, namely Cb and Cr, are chrominance (chroma) components which carry the color information. [10] The conversion from the RGB color space to the YUV color space is shown in equation (2.1).
Y = krR + kgG + kbB
Cb= B − Y (2.1)
Cr= R − Y Cg = G − Y
As seen in equation (2.1) there also exists a third chrominance component for green, namely Cg, which thanks to equation (2.2) can be calculated as shown in equation (2.3). This means that Cgcan be calculated by the decoder and does not have to be transmitted which is advantageous in the sense of data compression. [10]
kb+ kr+ kg= 1 (2.2)
Cg= Y − Cb− Cr (2.3)
The human eye is more sensitive to luminance than to chrominance and because of that a smaller set of bits can be used to represent the chrominance and a larger for representation of luminance. With this feature of the YUV color space the total amount of bits needed to encode a pixel can be reduced. A common way to do this is by applying the 4:2:0 sampling format.
2.3 Predictive Coding 7
Y sample Cr sample
Cb sample
Figure 2.2: YUV 4:2:0 sampling format
The 4:2:0 sampling format can be described as a ’12 bits per pixel’ format where there are 2 samples of chrominance for every 4 samples of luminance as shown in figure 2.2. If each sample is stored using 8 bits this will add up to 6 ∗ 8 = 48 bits for 4 YUV 4:2:0 pixels with an average of 48/4 = 12 bits per pixel. [10]
2.3
Predictive Coding
There are two kinds of predictive coding, intra coding and inter coding. By study-ing a picture it is easy to see that some parts in the picture are very similar, this is called spatial correlation. The predictive coding that uses these spatial corre-lations within a frame to form a prediction of other parts of the frame is called intra coding. By studying a sequence of pictures or a video sequence it can be seen that there is usually not much difference between the frames, this is called temporal correlation. By exploiting this temporal correlation a difference, also called a residue, can be calculated which is comprised of smaller values and there-fore can be described with a smaller number of bits. This will result in better data compression. The predictive coding that uses temporal correlations between different frames is called inter coding. [10]
2.4
Transform Coding and Quantization
The purpose of transform coding is to convert the image data or motion compen-sated data into another representation of data. This can be done with a number of different algorithms where the block based Discrete Cosine Transform (DCT) is one of the most common in video coding. The DCT algorithm converts the data to be described into sums of cosine functions oscillating at different frequencies. [10]
8 Overview of Video Coding
There are some different transforms that could be used in video coding but the common property of them all is that they are reversible, meaning the transform can be reversed without loss of data. This is an important property because otherwise drift between the encoder and decoder can occur and special algorithms would have to be applied to correct these errors. As mentioned before block based transform coding is the most common. When using block based transform coding the picture is divided into smaller block such as 8 × 8 or 4 × 4 pixels. Each block is then transformed with the chosen transform. The transformed data is then quantisized to remove high frequency data. This procedure can be done because the human eye is insensitive to higher frequencies and therefore these can be removed without any noticeable loss of quality. The quantizer re-maps the input data with one range of values to the output data which has a smaller range of possible values. This means the output can be coded with fewer bits than the original data and in this way data compression is achieved. [10]
2.5
Entropy Coding
Entropy coding is a lossless data compression technique. The different entropy coding algorithms encode symbols that occur often with a few number of bits and symbols that occur less often with more bits. The bits are all put in a bitstream that could be written to disk or sent over a network. In video coding these symbols can be quantisized transform coefficients, motion vectors, headers or other infor-mation that should be sent to be able to decode the video stream. As mentioned earlier a few of the usual entropy coding algorithms are Huffman, CABAC and CAVLC. [10]
2.6
Quality Measurements
There exists several ways to measure the quality of images and compare uncom-pressed images with reconstructed ones to evaluate video coding algorithms.
2.6.1
Subjective Quality
Subjective quality is the quality that someone watching an image or a video se-quence experiences. Subjective quality can be measured by having evaluators rate each part of a series of images or video sequences with different properties. This can be a time consuming and unpractical way of measurement in most circum-stances. [10]
2.6.2
Objective Quality
To enable more automatic measurements of quality some algorithms are commonly used. One of these is Peak Signal to Noise Ratio (PSNR) which can be used to measure the quality of a reconstructed image by comparing it to an uncompressed
2.6 Quality Measurements 9
one. PSNR gives a logarithmic scale where a higher value is better. The Mean Square Error (MSE) is used in the calculation of PSNR and is calculated as
M SE = 1 m ∗ n m X i=1 n X j=1 (C(i, j) − R(i, j))2 (2.4)
where n is the image height, m is the image width and C and R are the current and reference images being compared. With the MSE value the PSNR can be calculated as P SN R = 10 ∗ log10 2bits− 1 M SE (2.5) where 2bits−1 is the largest representable value of a pixel with the specified number of bits. [10]
Chapter 3
Overview of H.264
This chapter presents an overview of the H.264 video compression standard. Some sections are more detailed than others because of relevance for the master thesis. The topics covered include the different frame and slice types, intra and inter prediction, transform coding, quantization, deblocking filter and finally entropy coding.
3.1
Introduction to H.264
H.264[12], also known as Advanced Video Coding (AVC) and MPEG-4 Part 10, is a standard for video compression. The standard has been developed by Video Cod-ing Experts Group (VCEG) of International Telecommunications Union (ITU) and Moving Picture Experts Group (MPEG) which is a working group of the Interna-tional Organization for Standardization (ISO) and InternaInterna-tional Electrotechnical Commission (IEC). The main objective when H.264 was developed was to maxi-mize the efficiency of the video compression but also to provide a standard with high transmission efficiency which supports reliable and robust transmission of data over different channels and networks. [10]
H.264 is divided into a number of different profiles. These profiles include different parts of the video coding features from the H.264 standard. Some of the most common ones are the Extended, Baseline, Constrained Baseline and Main profiles. The Baseline profile supports inter and intra coding and entropy coding with CAVLC. The Main profile supports interlaced video, inter coding using B-slices and entropy coding using CABAC. The Extended profile does not support interlaced video nor CABAC but supports switching slices and has improved error resilience. [10]
In figure 3.1 a detailed view of the data flow in an H.264 encoder can be seen. This figure illustrates the important prediction coding and how it is connected to the other parts of the encoder. The in-loop deblocking filter can also be seen in this illustration. [10]
12 Overview of H.264 + ME (motion estimation) MC (motion compensation) Choose Intra Prediction Intra prediction DCT (discrete cosine transform) IDCT (inverse discrete cosine transform) Q (quantization) R (rescaling) Deblocking Filter Reorder Entropy encode + -+ NAL Fn (current frame) F´n-1 (reference frame) F´n (reconstructed frame)
Figure 3.1: Overview of the data flow in an H.264 encoder
3.2
Coded Slices
A frame can be divided into smaller parts called slices. These slices can then be coded in different modes. The different coding modes in H.264 is presented below [14].
3.2.1
I Slice
In the I slice all macroblocks are intra coded. The encoder uses the spatial corre-lations within a single slice to code that slice. The I slice allocates most space of all the different types of slices after it has been encoded. [10]
3.2.2
P Slice
P slices can contain both I coded macroblocks and P coded macroblocks. P coded macroblocks are predicted from a list of reference macroblocks. [10]
3.2.3
B Slice
B slices or bidirectional slices can contain both B coded macroblocks and I coded macroblocks. B coded macroblocks can be predicted from two different lists of reference macroblocks both before and after the current frame in time. [10]
3.2.4
SP Slice
A Switching P (SP) slice is coded in a way that supports easy switching between similar precoded video streams without suffering a high penalty for sending a new I slice. [10]
3.3 Intra Prediction 13
3.2.5
SI Slice
A Switching I (SI) slice is an intra coded slice and supports easy switching between two different streams that does not correlate. [10]
3.3
Intra Prediction
In intra coding the encoder only uses data from the current frame. Intra prediction is the next step in this direction to try to minimize the coded frame size. With intra prediction the encoder tries to utilize the spatial correlation within the frame.[10]
M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H M A B C D I J K L E F G H 0 (Vertical) 1 (Horizontal) 2 (DC)
3 (Diagonal down-left) 4 (Diagonal down-right) 5 (Vertical-right)
8 (Horizontal-up) 7 (Vertical-left) 6 (Horizontal-down) Mean (A .. D, I .. L)
Figure 3.2: 4x4 luma prediction modes
Mean (V, H) … … . ……. V H V 0 (Vertical) H 1 (Horizontal) H V 2 (DC) V H 3 (Plane)
Figure 3.3: 16x16 luma prediction modes
H.264 supports 9 different intra prediction modes for 4x4 sample luma blocks, four different modes for 16x16 sample luma blocks and four modes for 8x8 chroma components. The 9 4x4 prediction modes are illustrated in figure 3.2 and the 4 16x16 luma prediction modes are illustrated in figure 3.3. The pixels are interpo-lated or extrapointerpo-lated from the pixels nearby i.e the pixels with letters. Usually
14 Overview of H.264
the encoder selects the prediction mode that minimizes the difference between the predicted block and the block to be encoded. I_PCM is another prediction mode which makes it possible to transmit samples of an image without prediction or transformation. [10, 14]
3.4
Inter Prediction
Inter prediction creates a prediction model from one or more previously encoded video frames or slices using block-based motion compensation. The motion vector precision can be up to a quarter pixel resolution. The task is to find a vector that points to a block of pixels that have the smallest difference between the reference block and the block in the frame that is being encoded. [10]
16x16 16x8 16x8 8x16 8x16 8x8 8x8 8x8 8x8 8x8 8x4 8x4 4x8 4x8 4x4 4x4 4x4 4x4
Figure 3.4: Different ways to split a macroblock in inter prediction.
H.264 supports a range of block sizes from 16x16 to 4x4 pixels. This is illus-trated in figure 3.4. Using big blocks will save data because you will not need as many motion vectors, but the distortion can be very high when there are a lot of small things moving around in the video sequence. Using smaller blocks will in many cases lower the distortion but will instead increase the amount of bits needed to store the increased number of motion vectors. By letting the encoder find the best solution for this a good data compression of the video sequence can be achieved. The blocks are split when a threshold value is reached. [10]
SAD = m X i=1 n X j=1 |C(i, j) − R(i, j)| (3.1) M SE = 1 m ∗ n m X i=1 n X j=1 (C(i, j) − R(i, j))2 (3.2)
3.4 Inter Prediction 15 M AE = 1 m ∗ n m X i=1 n X j=1 |C(i, j) − R(i, j)| (3.3)
The macroblock cost is commonly calculated in one of a few different ways, Sum of Absolute Difference (SAD) is the most common as it offers the lowest computation complexity. The definition of SAD can be found in equation (3.1). Two other common ways to calculate the cost are Mean Square Error (MSE) and Mean Absolute Error (MSE) presented in equation (3.2) and equation (3.3) respectively. In equation (3.1), equation (3.2) and equation (3.3) n is the image width and m is the image height. [10]
A B C D E F G H I J K L M N P Q R S T U 1 2 4 3 5 6 7 8 a d h e b c f g i n j k p q r m s
Figure 3.5: Subsamples interpolated from neighboring pixels
More accurate motion estimation in form of sub pixel motion vectors is available in H.264. Up to a quarter pixel resolution is supported for the luma component and one eighth sample resolution for the chroma components. This motion estimation is possible to do by interpolating neighboring pixels and then compare with the current frame in the encoder. The interpolation is performed by a 6 tap Finite Im-pulse Response (FIR) filter with weights (1/32, −5/32, 20/32, 20/32, −5/32, 1/32). [10]
In figure 3.5 the half pixel sample b can be located. To generate this sample equation (3.4) can be used. Sample m can be calculated in a similar way shown in equation (3.5). [10]
b = round((E − 5F + 20G + 20H − 5I + J )/32) (3.4) m = round((B − 5D + 20H + 20N − 5S + U )/32) (3.5)
16 Overview of H.264
After generating all half pixel samples from real samples there are some half pixel samples that have not been generated. These samples have to be generated from already generated samples. The sample j in figure 3.5 is an example of that. To generate j the same FIR filter is used but with samples 1, 2, b, s, 7 and 8. j could also be generated with samples 3, 4, h, m, 5 and 6. Note that unrounded versions of the samples should be used when calculating j. When all half pixel samples are generated it is time to generate the quarter pixel samples. This is done by linear interpolation. Sample a in figure 3.5 is calculated as in equation (3.6) and sample d is calculated as in equation (3.7). To generate the last samples two diagonal half pixel samples are used, see equation (3.8). [10]
a = round((G + b)/2) (3.6)
d = round((G + h)/2) (3.7)
e = round((h + b)/2) (3.8)
To enhance the video compression even more H.264 has support for predicting macroblocks from more than one frame. This can be applied to both B and P coded slices. With the possibility to predict macroblocks from different frames a much better video compression can be achieved. The downside with multiframe prediction is an increase cost of memory size, memory bandwidth and computa-tional complexity. [10] Current Frame Following Frames Previous Frames
Figure 3.6: Multiple frame prediction
To find the best motion vector the encoder uses a search algorithm such as Full Search (FS), Diamond Search or Hexagon Search. With Full Search a complete search of the whole search area is performed. This algorithm provides the best compression efficiency but is also the most time consuming algorithm. Diamond search is a less time consuming search algorithm where the search pattern is formed as a diamond. Its performance in terms of compression, is good in comparison with FS. Hexagon search is an even more refined search pattern where the search points are formed as a hexagon, figure 3.7a. By decreasing the number of search points the effort to calculate the motion vector will be minimized and the result will be almost as good as with Diamond Search [16].
Motion estimation is the part in H-264 encoding that consume the most com-putational power when encoding and is predicted to consume about 60% to 80% of the total encoding time[15].
3.4 Inter Prediction 17
3.4.1
Hexagon search
Hexagon search uses a 7 point search pattern which can be seen i figure 3.7a. Each cross in the grid represents a search point in the search area where the grid resolution is one pixel. From this search point a Sum of Absolute Difference, equation (3.1), is calculated. [16]
(a) (b)
Figure 3.7: Large(a) and small(b) search pattern in the hexagon search algorithm.
The search steps in the hexagon search are the following.
1. Calculate the SAD of the six closest search points and the current search point.
2. Put the search point with the smallest SAD as new current search point. If the middle point has the smallest SAD jump to step 5.
3. Calculate the SAD of the 3 new search points that have not yet been calcu-lated as illustrated in figure 3.8.
4. Jump to step 2
5. Calculate the SAD of the 4 new search points forming a diamond around the middle point. This is illustrated in figure 3.7b.
6. Choose the search point that resulted in the smallest SAD and form a motion vector to this search point.
When the smallest SAD is found the motion compensated residue can be cal-culated. This residue is then sent to the transformation part of the encoder for further processing. In the decoder the motion vectors are used to restore the image correctly from the residue that was sent from the encoder. [16]
18 Overview of H.264 1 1 1 2 2 4 4 5 5 5 1 5 2 3 3 1 3 4 1 1
Figure 3.8: Movement of the hexagon pattern in a search area and the change to the smaller search pattern.
3.5
Transform Coding and Quantization
The main transform used in H.264 is discrete cosine transform.
3.5.1
Discrete Cosine Transform
The Discrete Cosine Transform (DCT) is a widely used transform in image and video compression algorithms. In H.264 the DCT decorrelates the residual data before quantization takes place. The DCT is a block based algorithm which means it transforms one block at the time. In prior standards to H.264 the blocks were 8x8 pixels large but that is now changed to 4x4 samples to reduce the blocking effects, which reduces the visual quality in the video. The DCT used in H.264 is a modified two-dimensional (2D) DCT transform. The transform matrix for the modified 2D DCT can be found in equation (3.9). [10]
Cf = 1 1 1 1 2 1 −1 −2 1 −1 −1 1 1 −2 2 −1 (3.9)
3.5 Transform Coding and Quantization 19 Y = CfXCfT ⊗ Ef= = 1 1 1 1 2 1 −1 −2 1 −1 −1 1 1 −2 2 −1 X 1 2 1 1 1 1 −1 −2 1 −1 −1 2 1 −2 1 −1 ⊗ a2 ab2 a2 ab2 ab 2 b2 4 ab 2 b2 4 a2 ab 2 a 2 ab 2 ab 2 b2 4 ab 2 b2 4 (3.10) where a = 1 2 (3.11) b = r 2 5 (3.12)
and X is the 4x4 block of pixels to calculate the DCT of. To simplify compu-tation somewhat the post-scaling (⊗Ef) can be absorbed into the quantization process. [10] This will be described in more detail in section 3.5.3 which covers the quantization.
The modified 2D DCT is an approximation to the standard DCT. It does not give the same result but the compression is almost identical. The advantages with this approximation is that the core equation CfXCfT can be done in 16 bit arithmetics with only shifts, additions and subtractions [6].
To do a two-dimensional DCT two one-dimensional DCTs can be performed after each other, the first one on rows and the second one on columns or vice versa. The function of the one-dimensional DCT can be seen in figure 3.9. [6]
+ +
-x0 x1 x2 x3 -2 2 X0 X2 X1 X3 + + + + + +
-Figure 3.9: DCT functional schematic
The operations performed while calculating the DCT as shown in figure 3.9 can be written as equation (3.13).
X0= (x0+ x3) + (x1+ x2) X2= (x0+ x3) − (x1+ x2)
X1= (x1− x2) + 2(x0− x3) (3.13) X3= (x1− x2) − 2(x0− x3)
20 Overview of H.264
3.5.2
Inverse Discrete Cosine Transform
The transform that reverses DCT is called Inverse Discrete Cosine Transform (IDCT). With the design of the DCT in H.264 it is possible to ensure zero mismatch between different decoders. This is because the DCT and IDCT(3.14) can be calculated in integer arithmetics. In the standard DCT some mismatch can occur caused by different representation and precision of fractional numbers in encoder and decoder. [10]
The 2D IDCT transform in H.264 is given by Xr= CiT(Y ⊗ Ei)Ci = = 1 1 1 12 1 12 −1 −1 1 −1 2 −1 1 1 −1 1 −1 2 Y ⊗ a2 ab a2 ab ab b2 ab b2 a2 ab a2 ab ab b2 ab b2 ! 1 1 1 1 1 12 −1 2 −1 1 −1 −1 1 1 2 −1 1 − 1 2 (3.14)
where Xris the reconstructed original block and Y is the previously transformed block. As with the DCT the pre-scaling (⊗Ei) can be absorbed into the rescaling process. [10] This will be described in more detail in section 3.5.4 which covers the rescaling. + + -1/2 X0 X2 X1 X3 + + + + + + x0 x1 x2 x3
-1/2
-Figure 3.10: IDCT functional schematic
The function of the IDCT can be seen in figure 3.10. To do a two-dimensional IDCT two one-dimensional IDCTs are performed after each other, the first one on rows and the second one on columns or vice versa. [6] The operations performed while calculating the IDCT can be written as equation (3.15).
x0= (X0+ X2) + (X1+ 1 2X3) x1= (X0− X2) + ( 1 2X1− X3) x2= (X0− X2) − ( 1 2X1− X3) (3.15) x3= (X0+ X2) − (X1+ 1 2X3)
3.5 Transform Coding and Quantization 21
3.5.3
Quantization
Information is often concentrated to the lower frequency area, therefore quanti-zation can be used to further compress the data after applying the DCT. H.264 uses a parameter in the quantization called Quantization Parameter (QP). The QP describes how much quantization that should be applied i.e. how much data that should be truncated. A total of 52 values ranging from 0 to 51 are supported by the H.264 standard. Using a high QP will decrease the coded data in size but it will also decrease visual quality of the coded video. With QP = 0 the quantization will be zero and all data is kept. [10]
From QP the quantizer step size (Qstep) can be derived. The first values of Qstepis presented in table 3.1. Note that Qstepdoubles in value for every increase of 6 in QP. The large number of step sizes provides the ability to accurately control the trade off between bitrate and quality in the encoder. [10]
QP 0 1 2 3 4 5 6 7 8 ...
Qstep 0.625 0.6875 0.8175 0.875 1 1.125 1.25 1.375 1.625 ... Table 3.1: Qstep for a few different values of QP
The basic formula for quantization can be written as
Zij = round Yij Qstep
!
(3.16)
where Yij is a coefficient of the previously transformed block to be quantized and Zij is a coefficient of the quantized block. The rounding operation does not have to be to the nearest integer, it could be biased towards smaller integers which could give perceptually higher quality. This is true for all rounding operations in the quantization. [10]
As mentioned in section 3.5.1 the quantization can absorb the post-scaling (⊗Ef) from the DCT. The unscaled output from the DCT can then be written as W = CfXCfT (as compared to the scaled output which is Y = CfXCfT ⊗ Ef). [10] This gives Zij= round Wij P Fij Qstep ! (3.17) where Wij is a coefficient of the unscaled transformed block, Zij is a coefficient of the quantized block and P Fij is either a2, ab2 or b
2
4 for each (i,j) according to
P F = a2 ab 2 a 2 ab 2 ab 2 b2 4 ab 2 b2 4 a2 ab 2 a 2 ab 2 ab 2 b2 4 ab 2 b2 4 (3.18)
22 Overview of H.264
PF and Qstepcan then be reformulated using a multiplication factor (MF) and a division. MF is in fact a 4 × 4 matrix of multiplication factors according to
M F = A C A C C B C B A C A C C B C B (3.19)
where the values of A, B and C depends on QP according to
QP A B C 0 13107 5243 8066 1 11916 4660 7490 2 10082 4194 6554 3 9362 3647 5825 4 8192 3355 5243 5 7282 2893 4559
Table 3.2: Multiplication factor MF
The scaling factors in MF are repeated for every increase of 6 in QP. The reformulation of PF and Qstepthen becomes
P F Qstep
= M F
2qbits (3.20)
where qbits is calculated as
qbits = 15 + f loor QP 6
(3.21) This gives a new quantization formula according to
Zij= round Wij M Fij 2qbits
!
(3.22)
which is the final form. [10]
3.5.4
Rescaling
The rescaling also uses Qstepwhich depends on the Quantization Parameter (QP) and is the same as for quantization (see table 3.1). The basic formula for rescaling can be written as
Yij0 = ZijQstep (3.23)
where Zij is a coefficient of the previously quantized block and Yij0 is a coefficient of the rescaled block. The rounding operation, as in the quantizer, does not have to be to the nearest integer, it could be biased towards smaller integers which
3.6 Deblocking filter 23
could give perceptually higher quality. This is true for all rounding operations in the rescaling. [10]
As the quantization formula was reformulated the rescaling formula can also absorb the pre-scaling (⊗Ei) and be reformulated to match the quantization for-mula. The new formula for rescaling where the pre-scaling factor is included can be written as
Wij0 = ZijQstepP Fij∗ 64 (3.24) where P Fij is the same as in (3.18), Zij is a coefficient of the previously quantized block, W0
ij is a coefficient of the rescaled block and the constant scaling factor of 64 is included to avoid rounding errors while calculating the Inverse DCT. [10]
Much like MF for the quantization the rescaling also uses a 4 × 4 matrix of scaling factors called V, which also incorporates the constant scaling factor of 64 introduced in (3.24). V can be written as
V = A C A C C B C B A C A C C B C B (3.25)
where the values of A, B and C depends on QP according to
QP A B C 0 10 16 13 1 11 18 14 2 13 20 16 3 14 23 18 4 16 25 20 5 18 29 23
Table 3.3: Scaling factor V
The scaling factors in V are like MF repeated for every increase of 6 in QP. With V the rescaling formula can be written as
Wij0 = ZijVij2f loor(QP /6) (3.26)
which is the final form. [10]
3.6
Deblocking filter
When using block coding algorithms such as DCT, blocking artifacts can occur. This is unwanted because it lowers the visual quality and prediction performance. The solution to this is to add a filter than removes these artifacts. The filter is placed after the IDCT in the encoding loop which can be seen i figure 3.1. The filter is used on both luma and chroma samples of the video sequence. [10]
24 Overview of H.264 E G H F A B C D (a) 3 1 4 2 (b)
Figure 3.11: Filtering order of a 16x16 pixel macroblock with start in A and end in H for luminance(a) and start in 1 and end in 4 for chrominance(b)
The deblocking filter in H.264 has 5 levels of filtering, 0 to 4, where 4 is the option with the strongest filtering. The filter is actually two different filters where the first filter is applied on level 1 to 3 and the second on level 4. Level 0 means that no filter should be applied. The filter level parameter is called boundary strength (bS). The parameter depends on the current quantization parameter, macroblock type and the gradient of the image samples across the boundary. There is one bS for every boundary between two 4x4 pixel block. The deblocking filter is applied to one macroblock at a time in a raster scan order throughout the frame. [5]
p3 p2 p1 p0 q0 q1 q2 q3 p3 p2 p1 p0 q0 q1 q2 q3
Figure 3.12: Pixels in blocks adjacent to vertical and horizontal boundaries When applying the deblocking filter on a macroblock it is done in a special order which is illustrated in figure 3.11. The filter is applied on vertical and horizontal edges as shown in figure 3.12. Where p0, p1, p2, p3, q0, q1, q2, q3 are pixels from two neighboring blocks, p and q. The filtering of these pixels only takes place if equation (3.27), (3.28) and (3.29) are fulfilled.
3.7 Entropy coding 25
|p0− q0| < α(indexA) (3.27)
|p1− p0| < β(indexB) (3.28)
|q1− q0| < β(indexB) (3.29)
indexA= M in(M ax(0, QP + Of f setA), 51) (3.30) indexB = M in(M ax(0, QP + Of f setB), 51) (3.31) The values of α and β are approximately defined to equation (3.32) and equa-tion (3.33).
α(x) = 0.8(2x6 − 1) (3.32)
β(x) = 0.5x − 7 (3.33)
Note that in equation (3.30) and (3.31) it can be seen that the filtering is dependent on the Quantization Parameter. The different filters applied are 3-,4-and 5-tap FIR filters which are further described in. [5]
3.7
Entropy coding
The H.264 standard supports two different entropy coding algorithms, Context-based Adaptive Variable Length Coding (CAVLC) and Context-Context-based Adaptive Binary Arithmetic Coding (CABAC). CABAC is the most efficient of these two standards but it requires higher computational complexity. Bitrate savings of CABAC can be between 9% and 14% compared to CAVLC[7]. CAVLC is sup-ported in all H.264 profiles but CABAC is only supsup-ported in the profiles above extended. [10]
Chapter 4
Overview of the ePUMA
Architecture
This chapter covers an introduction to the ePUMA processor architecture. The memory hierarchy, master core, Sleipnir core, the direct memory access controller and the simulator will be covered.
4.1
Introduction to ePUMA
Embedded Parallel Digital Signal Processing Processor Architecture with Unique Memory Access (ePUMA) is a multi-core DSP processor architecture with 1 master core and 8 calculation cores. The master core handles the Direct Memory Access (DMA) communications. The slave core, which is also called Sleipnir, is a 15-stage pipelined calculation core.
4.2
ePUMA Memory Hierarchy
The ePUMA memory hierarchy consists of three levels where the first level is the off-chip main memory, the second level is the local storage of the master and slaves and the third and final level is the registers of the master and slave cores. In figure 4.1 an illustration of how each core is connected to the on-chip interconnection is depicted. The on-chip interconnection is in turn connected to the off-chip main memory. The main memory is addressed with both a high word of 16 bits and a low word of another 16 bits which means that a 32-bit addressing is used where each address corresponds to a word of data.
28 Overview of the ePUMA Architecture
Off chip main memory
On chip interconnection Master LS P M D M 0 D M 1 Master Core Registers Sleipnir 0 LS P M C M L V M 1 L V M 2 L V M 3 Sleipnir Core Registers Sleipnir 7 LS P M C M L V M 1 L V M 2 L V M 3 Sleipnir Core Registers
...
Level 1 Level 2 Level 3Figure 4.1: ePUMA memory hierarchy
The on-chip network is depicted in figure 4.2 where N0 to N7 are intercon-nection nodes. As can be seen from the figure the nodes are connected both to the master and the respective Sleipnir core but also to other nodes. This gives the ability of transferring data between Sleipnir cores and even pipeline the cores. With this setup data can be transferred in any way and combination that does not overlap. Sleipnir 0 Sleipnir 1 Sleipnir 3 Master DMA Main Memory
Sleipnir 5 Sleipnir 6 Sleipnir 7
Sleipnir 4 Sleipnir 2 N0 N1 N2 N4 N7 N6 N5 N3
4.3 Master Core 29
4.3
Master Core
The master core is for the moment based on a processor called Senior. This processor has been around on the Division of Computer Engineering for some years now and is used in some courses for educational purpose. The Senior processor is a DSP processor which means it got a Multiply and ACcumulate (MAC) unit and other DSP related capabilities. To accomplish a possibility to serve as a master core memory ports for DMA controller and interrupt coming from the DMA and Sleipnir cores have been added.
4.3.1
Master Memory Architecture
The master core has 2 RAMs and 2 ROMs which are called Data Memory 0 (DM 0) and Data Memory 1 (DM 1). These memories are the local storage for the master core. The ROMs start at address 8000 on respective memory. This gives 7F F F = 32767 words in each RAM to work with.
For calculation the master core has 32 16-bit registers that could be used as buffers. There are also a number of special registers such as 4 address registers, registers for hardware looping and registers for support of cyclic addressing in address register 0 and 1. Address register 0 and 1 also supports different step sizes.
4.3.2
Master Instruction Set
Programming guide and instruction set for Senior can be found in [9] and [8] even though they might not be totally accurate because of the modifications for the ePUMA project. The masters instructions set is in large the same as the Senior instruction set. It is a standard DSP instruction set with support for a convolution instruction which multiplies and accumulates the results. To speed up looping a hardware loop function called repeat is included. All jumps, calls and returns can use 0 to 3 delay slots. The number of delay slots specifies how many instructions after the flow control instruction that will be executed. If not all delay slots are used for useful instructions, nop instructions will be inserted in the pipeline.
4.3.3
Datapath
The datapath of the master consists of a 5-stage pipeline which can be seen in figure 4.3. There is only one exception to this, the convolution instruction (conv) uses a 7-stage pipeline but a figure of this is omitted for lack of relevance. The datapath is advanced enough for scalar calculations, larger computational loads should be delegated to the Sleipnir cores. In table 4.1, originally found in [9], a description of the pipeline stages is presented.
30 Overview of the ePUMA Architecture
Next PC
PM
Decoder
OP. SEL AGU
ALU
*
DM 0 DM 1 CheckCond.RF
+
ALU flags ACR, MAC flags . . . P1 P2 P3 P4 P5Figure 4.3: Senior datapath for short instructions
Pipe RISC-E1/E2 RISC Memory load/store
P1 IF: Instr. Fetch IF: Instr. Fetch
P2 ID: Instr. Decode ID: Instr. Decode
P3 OF: Operand Fetch OF+AG: Compute addr
P4 EX1: Execution(set flags MEM: Read/Write
P5 EX2: Only for MAC, RWB WB: Write back (if load) Table 4.1: Pipeline specification
4.4
Sleipnir Core
Sleipnir is the name of the calculation core. In the ePUMA processor there are 8 of them. The Sleipnir is a Single Instruction Multiple Data (SIMD) architecture which in this case means it can perform vector calculations. Each full vector consists of 128 bits and is divided into 8 words of 16 bits which can run through the pipeline in parallel. The datapath of the Sleipnir core has 15 pipeline stages. The pipeline length of an instruction is variable depending on the choice of operands.
4.4 Sleipnir Core 31
4.4.1
Sleipnir Memory Architecture
The Sleipnir core has 3 memories where 2 of them are connected to the core and the third memory is connected to the DMA bus. The memories are called Local Vector Memories (LVMs). By being able to swap which memories that are connected to the processor and which memory that is connected to the DMA better utilization can be reached and a lot of the transfer cycle cost can be hidden.
Constant Memory
Each Sleipnir is also provided with a Constant Memory (CM) for use of constants during runtime. This memory can be used for different tasks such as scalar con-stants or permutation vectors. All concon-stants that will be used during runtime can be stored in the CM. The memory can contain up to 256 vectors.
Local Vector Memory
The Local Vector Memories (LVM) are the local memories of the Sleipnir core. As described above each core has access to 2 LVMs at runtime. These memories are 4096 vectors large, where each vector is 128 bits wide. The memories have one address for each word of 16 bits. The memories consist of 8 memory banks, one for each word in a vector. The constant memory can be used to address the LVMs according to the values stored in the constant memory. The constant memory addressing of the LVMs can be used to generate a permutation of data which can be used for e.g. transposing a matrix.
Vector Registers File
There are 8 Vector Registers (VR) in the Vector Register File (VRF), VR0 to VR7, for use in computations during runtime. Each word can be obtained separately, it is also possible to obtain a double word and half vector both high and low in each of the 8 vector registers. The different access types are listed in table 4.2, originally found in [4].
Syntax Size Description
vrX.Y 16-bit Word
vrX.Yd 32-bit Double word vrX{h,l} 64-bit Half vector
vrX 128-bit Vector
Table 4.2: Register file access types
Special Registers
There are 4 address register ar0-ar3 which can be used to address memory in the LVMs. There are also 4 configuration registers for these 4 address registers. The subset of these registers are values for top, bottom and step size which can
32 Overview of the ePUMA Architecture
be used when addressing memories in all kinds of loops. The different increment operations are listed in table 4.3, originally found in [4].
arX+=C Fixed increment; C = 1,2,4 or 8 arX-=C Fixed decrement; C = 1,2,4 or 8 arX+=S Increment from stepX register
arX+=C% Fixed increment with cyclic addressing arX-=C% Fixed decrement with cyclic addressing arX+=% Increment from stepX with cyclic addressing
Table 4.3: Address register increment operations
The addressing of the two LVMs can be done with one of the four address registers, immediate addresses, vector registers or in combination with the constant memory, to form advanced addressing schemes as shown in table 4.4, originally found in [4].
Mode# Index Offset Pattern Syntax example
0 arX 0 0,1,2,3,4,5,6,7 [ar0]
1 arX 0 cm[carX] [ar0 + cm[car0]]
2 arX 0 cm[imm8] [ar0 + cm[10]]
3 arX 0 cm[carX + imm8] [ar0 + cm[car0 + 10]]
4 0 vrX.Y 0,1,2,3,4,5,6,7 [vr0.0]
5 0 vrX.Y cm[carX] [vr0.0 + cm[car0]]
6 0 vrX.Y cm[imm8] [vr0.0 + cm[10]]
7 0 vrX.Y cm[carX + imm8] [vr0.0 + cm[car0 + 10]]
8 0 0 vrX [vr0]
9 0 0 cm[carX] [cm[car0]]
10 0 0 cm[imm8] [cm[10]]
11 0 0 cm[carX + imm8] [cm[car0 + 10]]
12 arX 0 vrX [ar0 + vr0]
13 arX vrX.Y 0,1,2,3,4,5,6,7 [ar0 + vr0.0]
14 arX imm16 0,1,2,3,4,5,6,7 [ar0 + 1024]
15 0 imm16 0,1,2,3,4,5,6,7 [1024]
Table 4.4: Addressing modes examples
Program memory
The program memory (PM) can contain up to 512 instructions. It can be loaded from the main memory by issuing a DMA transaction.
The program that is loaded into the Sleipnir PM is called a block. A kernel is a combination of master code and blocks. A block can utilize several Sleipnir cores with internal data transfers. Blocks can however not communicate with cores outside the block and can not be data dependant on any other block running at the same time.
4.4 Sleipnir Core 33
If for some reason the Sleipnir block code is larger than 512 lines of instructions it can be divided into two programs and the memory can be transferred between two Sleipnir cores. For this to work code is needed in the master to keep track of the cores and move data to the next core for further processing. When developing a new block or kernel it can sometimes be good to have a little extra memory. Therefore it is possible to increase the size of the PM in the simulator.
4.4.2
Datapath
The datapath of the Sleipnir slave core is a 8-way 16-bit datapath. The datapath is divided into 15 pipeline stages and is depicted in figure 4.4. A more detailed version of the datapath can be found in [2].
LVM x LVM y Operand Formatting Operand Selection ALU 1 ALU 2 Multiplier VRF SPRF LVM Vector Addressing Instr. Fetch Instr. Decode CM Addressing LVM Scalar Addressing CM x A 1 A 2 B 1 B 2 B 3 B 4 C 1 CM y D 1 D 2 D 3 D 4 E 1 E 2 E 3 E 4
Figure 4.4: Sleipnir datapath pipeline schematic
The datapath includes 16 16x16-bit multipliers and two Arithmetic Logic Units (ALU) connected in series. Simpler instructions can bypass the first ALU and by that become a shorter instruction which saves some execution time. These bypasses can be seen in stage D1 to D4 in figure 4.4. Some instructions use a very short datapath such as the jump instruction which is executed in stage A2. This makes the use of precalculated branch decisions unnecessary. Stage E1 to E4 can be described as the write back stage and therefore it follows after stage D4. Stage D3 and D4 are very similar but provides the core with the possibility of performing summation of a complete vector and similar tasks.
34 Overview of the ePUMA Architecture
4.4.3
Sleipnir Instruction Set
The instruction set used is application specific. The instruction set includes no move or load instructions for data. These functions are all included in one instruc-tion which is called copy. Operands and instrucinstruc-tions can be combined in different ways with variable pipeline length as a result. The pipeline length depends on e.g. where the input operands are fetched from, where the result will be stored and if the instruction uses or bypasses the first ALU and multipliers. Instruction names are built upon what data they affect and how. For example the instruction vcopy m0[0].vw m1[0].vw copies a vector from memory 1 address 0 to mem-ory 0 address 0. If the instruction scopy would be used instead it would only copy a scalar word. Another example is the add instruction. If vaddw m0[0].vw m1[0].vw vr0 is used two vectors will be loaded from both m1 and vr0. The .vw after the memory address denotes that the vectors will be added word wise, that means they will be considered as eight words. This means that the processor can carry out 8 additions per clock cycle. [4]
4.4.4
Complex Instructions
To reach better performance results the datapath has to be utilized as much as possible, especially in the inner loops of the critical path. To be able to reach this better performance, new specialized instructions that perform several smaller tasks could be implemented. The result of this is that by pipelining several of these new complex instructions more work can be done in less time and the program will reach an increased throughput.
Things that have been considered when deciding upon accelerating certain parts of code are listed below.
• Motivation – Why should the acceleration be done • Description – What is going to be accelerated
• Extra hardware needed – What extra hardware is needed for acceleration of the specific task
• Profiling and usage – Is the task used a lot and therefore worth accelerating • Extra hardware cost – What is the cost of the extra hardware
• Cycle gain – How many cycles can be saved
• Efficiency – How efficient is the new solution in terms of cost per gain in performance
4.5
DMA Controller
The Direct Memory Access(DMA) controller is used to load and store data to and from an off-chip memory. The DMA can transfer a 128-bit vector to one of the