• No results found

A study of CABAC hardware acceleration with configurability in multi-standard media processing

N/A
N/A
Protected

Academic year: 2021

Share "A study of CABAC hardware acceleration with configurability in multi-standard media processing"

Copied!
91
0
0

Loading.... (view fulltext now)

Full text

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

A study of CABAC hardware acceleration with

configurability in multi-standard media processing

Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping

av

Oskar Flordal LITH-ISY-EX–05/3788–SE

Linköping 2005

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)
(3)

A study of CABAC hardware acceleration with

configurability in multi-standard media processing

Examensarbete utfört i Datorteknik

vid Tekniska högskolan i Linköping

av

Oskar Flordal LITH-ISY-EX–05/3788–SE

Handledare: Di Wu

isy, Linköpigs universitet

Examinator: Dake Liu

isy, Linköpigs universitet

(4)
(5)

Avdelning, Institution Division, Department

Division of Computer Engineering Department of Electrical Engineering Linköpings universitet S-581 83 Linköping, Sweden Datum Date 2005-10-12 Språk Language  Svenska/Swedish  Engelska/English   Rapporttyp Report category  Licentiatavhandling  Examensarbete  C-uppsats  D-uppsats  Övrig rapport  

URL för elektronisk version http://www.da.isy.liu.se http://www.ep.liu.se/2005/3788 ISBNISRN LITH-ISY-EX–05/3788–SE

Serietitel och serienummer Title of series, numbering

ISSN

Titel Title

En studie i konfigurerbar hårdvaruaccelerering för CABAC i flerstandards media-bearbetning

A study of CABAC hardware acceleration with configurability in multi-standard media processing Författare Author Oskar Flordal Sammanfattning Abstract

To achieve greater compression ratios new video and image CODECs like H.264 and JPEG 2000 take advantage of Context adaptive binary arithmetic coding. As it contains computationally heavy algorithms, fast implementations have to be made when they are performed on large amount of data such as compressing high resolution formats like HDTV. This document describes how entropy coding works in general with a focus on arithmetic coding and CABAC. Furthermore the document dicusses the demands of the different CABACs and propose different options to hardware and instruction level optimisation. Testing and benchmarking of these implementations are done to ease evaluation. The main contribution of the thesis is parallelising and unifying the CABACs which is discussed and partly implemented. The result of the ILA is improved program flow through a specialised branching operations. The result of the DHA is a two bit parallel accelerator with hardware sharing between JPEG 2000 and H.264 encoder with limited decoding support.

Nyckelord

Keywords CABAC, Arithmetic coding, H.264, JPEG 2000, instruction level optimizing, ac-celerators, assembler, VLC, encoder, hardware reuse, profiling, DSP, multimedia

(6)
(7)

Abstract

To achieve greater compression ratios new video and image CODECs like H.264 and JPEG 2000 take advantage of Context adaptive binary arithmetic coding. As it contains computationally heavy algorithms, fast implementations have to be made when they are performed on large amount of data such as compressing high resolution formats like HDTV. This document describes how entropy coding works in general with a focus on arithmetic coding and CABAC. Furthermore the document dicusses the demands of the different CABACs and propose different options to hardware and instruction level optimisation. Testing and benchmarking of these implementations are done to ease evaluation. The main contribution of the thesis is parallelising and unifying the CABACs which is discussed and partly implemented. The result of the ILA is improved program flow through a specialised branching operations. The result of the DHA is a two bit parallel accelerator with hardware sharing between JPEG 2000 and H.264 encoder with limited decoding support.

(8)
(9)

Acknowledgements

I want to thank my examiner Dake Liu for giving me an interesting topic and resources to work with. I also want to direct a big thanks to my supervisor Di Wu for interesting discussions and lots of invaluable inspiration and other help.

I also want to take the opertunity to thank my family and friends who have been with me through the years, thanks! Extra thanks to Kattis for proofreading and keeping my mind of technology once in a while!

(10)
(11)

Glossary

ASIC Application Specific Integrated Circuit. ASIP Application Specific Instruction Set Processor AVC Audio video CODEC.

Bit plane A set of pixels upon which contexts are built in JPEG 2000.

CABAC Context-based Adaptive Binary Arithmetic Coding. Method for arith-metic coding used in JPEG 2000 and H.264.

CAVLC A less performance cosuming alternative to CABAC in H.264.

CIF Common Interface Format. Standard of resolutions made for video confer-ancing. CIF is 352x288 4CIF is 704x576 9CIF is 1056x864 and so on.

CODEC COder DECoder A mechanism for encoding or decoding something, video is a good example.

Coding range In arithmetic coding: The current coding interval of the coder. Context The information available on the probabilities for the next bit being

mps.

Entropy coding Removing redundant information through coding.

FPGA A programmable logic device for testing or, with small volumes, implent hardware.

fps Frames per second.

gprof GNU profiler used to produce an "execution profile" H.264 A MPEG4 generation video CODEC.

HDTV High Definition TV. Usually 1920x1080 or 1280x720 pixel resolution. ITU International Telecommunications Union.

JasPer A collection of software for coding and manipulation images. Especially interesting for it’s open source implementation of a JPEG 2000 coder and encoder.

(12)

JM H.264 official software reference implementation

JPEG 2000 Joint picture experts group. An image compression CODEC. lps Least probable symbol, the bit that is not predicted by the coding context. mps Most probable symbol, the bit that is predicted by the coding context. MPEG4 Motion picture experts group 4. A family of video CODECs. OR1200 RISC processor with verilog source avaiable.

Qe The probability of an lps symbol. Verilog A Hardware description language. VLC Variable Length Coding.

(13)

Contents

1 Introduction 1

1.1 Background . . . 1

1.2 Purpose and objectives . . . 1

1.3 Methods and tools . . . 2

1.4 Time constraints . . . 2

1.5 Layout of this document . . . 2

2 Entropy coding 5 2.1 Background . . . 5

2.2 Actual implementation/Historic background . . . 6

2.3 Arithmetic coding . . . 7

2.4 A few more practicalities . . . 9

2.5 Adding it all together . . . 10

3 JPEG 2000 11 3.1 Purpose . . . 11 3.2 Context modelling . . . 12 3.3 Motion JPEG 2000 . . . 12 4 H.264/AVC 15 4.1 Purpose . . . 15

4.2 Anatomy of the H.264 frame . . . 17

4.3 Context formation . . . 18 5 The CABACs 19 5.1 Basics . . . 19 5.2 Probability states . . . 19 5.3 Avoiding multiplication . . . 20 5.4 Renorm . . . 21

5.5 Problems with implementations . . . 21

5.5.1 Uniting the CABACs . . . 22

(14)

6 Analysis of software implementation 25

6.1 Requirements . . . 25

6.2 On benchmarking with gprof . . . 26

6.3 JasPer . . . 26

6.3.1 Benchmarking . . . 26

6.4 JM . . . 27

6.4.1 Benchmarking . . . 28

6.5 OR1200 . . . 29

6.6 Assembler with analysis . . . 29

6.7 Similarities and conclusions what to optimise . . . 29

7 Instruction level acceleration 31 7.1 Introduction . . . 31

7.2 ISA implementation . . . 32

7.3 Better branch control . . . 32

7.4 General idea . . . 33 7.5 Resources . . . 34 7.6 Formal proposal . . . 35 7.7 Problems . . . 35 7.8 Other improvements . . . 36 7.9 Parallelising . . . 36 7.10 Assembler implementation . . . 38 7.11 Benchmarking . . . 40 8 A dedicated accelerator 43 8.1 Advantages and disadvantages . . . 43

8.2 Contexts . . . 43 8.2.1 Context switching . . . 44 8.3 Arithmetic encoder . . . 44 8.3.1 H.264 encoder . . . 45 8.3.2 JPEG 2000 encoder . . . 46 8.4 Renorm . . . 47 8.4.1 H.264 renorm . . . 48 8.4.2 JPEG 2000 renorm . . . 49

8.5 Benchmarking the straightforward hardware . . . 50

8.6 Parallelising . . . 51

8.7 A common accelerator structure . . . 52

8.8 Parallelised and unified hardware . . . 52

8.8.1 Stage 1 . . . 54

8.8.2 Stage 2: The coder . . . 56

8.8.3 Stage 3: Updating low . . . 58

8.8.4 Stage 4+: Renorm . . . 60

8.9 Benchmarking the parallel accelerator . . . 62

8.10 Adding even more features . . . 63

(15)

Contents xiii

9 Conclusions 65

9.1 The demands of the CODECs . . . 65

9.2 Summing it up . . . 66

9.3 Possible future work . . . 67

Bibliography 69 A Cycle count on assembler implementation 71 A.1 Description of unoptimised assembler version on or1200 . . . 71

A.1.1 JPEG 2000 . . . 71

A.1.2 JPEG 2000 Renorm . . . 72

A.1.3 H.264 . . . 72

A.1.4 H.264 Renorm . . . 72

List of Figures

2.1 The standard procedure in VLC. . . 5

2.2 A VLC model with adaption. . . 6

2.3 A graphical explanation of arithmetic coding. . . 8

3.1 The standard procedure of coding JPEG 2000. . . 12

3.2 A few neighbouring bitplanes. . . 13

4.1 A description of the encoding process. . . 16

4.2 A group of macro-blocks(dotted) framed into slices that divide a field. 17 4.3 Calculating context by using neighbours. . . 18

5.1 The JPEG 2000 renorm with byte-out. . . 23

5.2 The H.264 renorm. The writeBit function is omitted. . . 24

7.1 Structure of comparisons . . . 33

7.2 Rough sketch of a device for multibranching . . . 34

7.3 A simple implementation of a jumpback . . . 36

7.4 Estimate of running JPEG 2000 in parallel . . . 37

8.1 H.264 hardware encoder the straightforward way. . . 45

8.2 JPEG 2000 hardware encoder the straightforward way. . . 46

8.3 Idea of H.264 hardware renorm a multi cycle way. . . 48

8.4 Idea of JPEG 2000 hardware renorm. . . 49

8.5 The stages in the unified parallel accelerator. . . 53

8.6 Simplified stage 1 when only considering H.264. . . 55

8.7 Updating range in the unified parallel accelerator. . . 57

8.8 Building a temporary output in H.264 renorm. . . 58

8.9 Updating low. . . 59

8.10 Transforming undetermined bits using the outstanding register. . . 61

8.11 A part of the propagating chain for finding values for undetermined outstanding bits. . . 61

(16)

List of Tables

6.1 Approximate bitrates of various resolutions at 30 fps . . . 25

6.2 Performance estimates on JPEG 2000 coding in JasPer . . . 27

6.3 Performance estimates on H.264 coding in JM . . . 28

6.4 Number of cycles in an assembler implementation of the CABACs 29 6.5 Approximate MIPS cost of software implementation at various res-olutions. . . 29

7.1 Approximate MIPS cost of ASIP implementation at various resolu-tions . . . 41

8.1 Cycle consumption of straightforward hardware implementation . . 50

8.2 Approximate MIPS cost of accelerator implementation at various resolutions . . . 50

8.3 Approximate MIPS cost of parallel implementation at various res-olutions . . . 62

9.1 Approximate bitrates of various resolutions at 30 fps . . . 65

9.2 Approximate MIPS cost at various resolutions at 30 fps . . . 66

A.1 Cycle cost in the encode stage of JPEG 2000 CABAC. . . 71

A.2 Detailed calculation of cycle costs in JPEG 2000. . . 72

A.3 Cycle cost in the renorme stage of JPEG 2000 CABAC. . . 72

A.4 Cycle cost in the encode stage of H.264 CABAC. . . 72

A.5 Cycle cost in the renorme stage of H.264 CABAC. . . 73

List of Examples

2.1 Doing arithmetic coding . . . 8

2.2 Doing arithmetic coding with integers and shifting . . . 10

7.1 A JPEG 2000 CABAC encoder with multibranch . . . 38

(17)

Chapter 1

Introduction

This chapter gives a background to why this document exists and what questions it intends to answer. The chapter will hopefully bring clarity to the reader why this paper is interesting and how the work was conducted as well as its limits.

1.1

Background

With the advent of new and improved compression techniques like the H.264/AVC video compression standard published by ITU-T/ISO/IEC as well as the still im-age (with extensions for motion) compression standard JPEG 2000, comes new challenges for the hardware doing the compressing and uncompressing. One in-teresting technique that results in significantly reduced size but also significantly higher computing demands, is the new and improved entropy coding. This is done by utilising advanced context models and arithmetic coding. Using entropy cod-ing on bit level in these types of CODECs, that deal with lots of data, in real time on General purpose computers requires a lot of processing power. At certain resolutions this power is very though to achieve in multi gigahertz machines and power which is not available on state of the art desktop computers is certainly not available on smaller less power hungry DSPs or other CPUs for embedded systems. To solve this different solutions could be considered such as using ded-icated hardware in external ASIC chips, which is potentially very fast but not very flexible. Another solution, if it is possible, is to accelerate the most perfor-mance demanding parts of the CODECs in separate on chip accelerators or with specialised instructions. This leads us on to the purpose of this paper.

1.2

Purpose and objectives

The purpose of this document is to analyse computational requirements in JPEG 2000 and H.264 and discuss where acceleration is needed mostly. With this knowl-edge different solutions can be approximated in space cost and performance. An-other aspect of the performance is to find out if it is really possible with a set

(18)

amount of resources to do the entropy coding in real time. Yet another interesting question is whether it is possible to build a uniform accelerator for both techniques if the accelerators seem to be really similar. Is it possible to parallelise the ac-celerators to improve performance in some way? Is it possible to do it by only adding a few specialised instructions instead that could also be used to accelerate other similar algorithms? Is there some part of of the coding that is better done in hardware/software? These and similar questions should result in a hardware platform that could be benchmarked in simulation with suggestions on how to integrate it in a real DSP or CPU.

1.3

Methods and tools

The task is done by studying how the CODECs work in general and doing a deep study of the entropy coding. Based on two software implementations the CODECs are profiled on performance to find out which part of the CODECs is run most and approximately how much time these parts consume. After that a hand coded assembler version will be studied of the both CODECs to analyse approximately how large they are in that respect and where time could be shaved of with instruction level accelerators. Hardware implementations are then produced of the areas that seem the most interesting to accelerate and benchmarking is performed on the implementations. Instruction level acceleration is tested by using a simulator for a simple architecture with common architectural features. Along the way the demands of the entropy coders are compared to see if it is possible to find similarities that can be exploited to build combined accelerators as well as possibilities to parallelise and other relevant observations.

1.4

Time constraints

As time is limited the thesis will concentrate on acceleration of the entropy coding, only commenting on areas around this. Focus will also be on the parts that are interesting to optimise, such as the arithmetic coding. The work should be done in 20 weeks which limits the scope of the document.

1.5

Layout of this document

The document is divided so that chapter with background on the subject with commentary on issues and features that will be interesting later on comes first. As the document progresses it contains more discussion and actual testing and ends up in chapter 6, 7 and 8 that describes different ways to implement and optimise the problem in a sensible way. Every chapter evaluates how far the current model goes towards solving the requirements to evaluate cost versus performance on the method in question.

Chapter 2 contains a quick introduction to VLC and various problems that will be of interest later in the document.

(19)

1.5 Layout of this document 3

Chapter 3 and 4 gives a brief introduction to JPEG 2000 and H.264 as a background to what kind of data is actually coded by the arithmetic coders.

Chapter 5 discusses the CABAC implementation in H.264 and JPEG 2000 and might be good as a background to understanding CABAC.

Chapter 6 contains profiling and comments on a software only implementation. Chapter 7 suggest and evaluates different instruction level optimisations. Chapter 8 discusses implementing different parts as hardware accelerators and discusses other extensions and SW/HW partitioning.

Chapter 9 concludes the benchmarking done in previous chapters with added discussion of pros and cons of the different models.

(20)
(21)

Chapter 2

Entropy coding

This chapter provides a background to how development went from variable length coding to arithmetic coding, how arithmetic coding works in general and some relevant issues that could cause problems when implementing arithmetic coding.

2.1

Background

The idea behind variable length coding(VLC) is to use information about the probabilities of different symbols, like numbers or letters, in a collection of data. These probabilities are used to code different symbols with codes of different length depending on their frequency in the collection of data. For example, a letter in a text that is frequently occurring should use fewer bits than a symbol that occurs less frequently.

Common for all VLC coding systems is that they basically consist of a model of the data that contain probabilities and coders to utilise this data. The model could for example be decided before the data is encoded if it is known exactly what the data to encode looks like and there is some mean to store the probabilities for decoding. The first problem with doing this is that if data should be communicated between two independent machines, the model must also be communicated to the decoding side for the data to be decoded according to the correct probabilities. The second problem is all data might not be available at the time of sending because of timing demands or problems in analysing all data at the same time because of architectural limitations.

Model Coder Uncompressed

data

Compressed data

Figure 2.1. The standard procedure in VLC.

(22)

The common solution to these problems is using an adaptive model that does not contain any probabilities from the start but updates the model with probabil-ities as it codes data and finds out how frequently different symbols occur in the data set in relation to other symbols. As the encoder does not have any informa-tion about the particular data at the beginning there is no need to communicate any specific model data to the decoder that instead will adapt the model as it decodes the same way the encoder did. A variation of this theme used in modern coders is context adaption. Using context adaption the encoder and decoder is informed how the data will look by analysing similar data and code it differently depending on in which context the data is coded. They also adapt to how the actual data looks by increasing and decreasing probabilities according to some set standard. This means the context model only have to be communicate once and then can code any amount of new, similar data without sending detailed model info.[10]

Model

Code symbol Compressed data Update model

Uncompressed symbols

Figure 2.2. A VLC model with adaption.

2.2

Actual implementation/Historic background

Data compression became interesting in the late 1940s as the need for removing redundant information in message to reduce size grew large because of different constraints. One of the fore runners in the field was Claude Shannon, “the father of information theory”, who co-invented1 one of the first methods to do variable length coding in a reasonable manner called Shannon-Fano coding. Another, bet-ter known, method is Huffman coding which arrived shortly afbet-ter and was slightly better than Shannon-Fano coding before Shannon-Fano had made a real dent in the industry. Despite being similar in their design the algorithms for building a Shanno-Fano tree used for coding different characters is quite different from build-ing a Huffman tree with the same purpose. The Shannon-Fano tree is built roughly by these five steps.

1. Determine frequencies of different symbols to be coded.

2. Sort these symbols, most probable first and least probable last.

(23)

2.3 Arithmetic coding 7

3. Divide the list in two so the upper half contains symbols with roughly the same total probabilities as the lower half.

4. The upper half is assigned a 0 and the lower half a 1.

5. Repeat 3 and 4 till all symbols have a unique code.

Building a Huffman tree is instead done bottom up by assigning each symbol to a leaf node and assigning them each a weight that is the probability of that node. A parent node is made for the two nodes with lowest probability and that node get the collective weight of the nodes beneath it. Each of the nodes beneath are assigned either a 0 or 1 and then you do the same to the nodes that now has the lowest probability until there are no loose nodes.

Huffman has a weakness in the fact that it can not code a symbol in less than 1 bit or fractions less than one bit. This means something that should optimally have had a probability of 2.3 bits will have either two or three bits which make the coding suboptimal. Besides resulting in better compression, fractional coding is also of critical importance for the CODECs discussed in this document as will be explained later. One solution to the non fractional inefficiency is arithmetic coding or variations of it such as range coding.[10]

2.3

Arithmetic coding

Arithmetic coding works by dividing a range for example between 0 and 1 itera-tively depending on the probabilities off different symbols. The return from the symbol will then point to a value in the range that can only be coded by a giving sequence of symbols. That is at the beginning the range is 0 to 1 and there is four symbols to encode A,B,C and D with the probabilities 1/2, 1/6, 1/6 and 1/6 respectively. Whenever a symbol is to be encoded the current range will be divided according to the size of the probability and it will start at the collected probabilities of all the values in the list to make it unique. This new range is then basis for the next symbol to be encoded as shown in the following example and figure.[10]

(24)

Example 2.1: Doing arithmetic coding

To encode the sequence ABAD with the probabilities stated earlier the first thing to do is to divide the original range into [0, 1[ to [0, 0.5[ as this is the probability of A (the range for B would be [0.5, 0.66...]) This is the first vertical line in figure 2.3.

After that the second symbol (B) would be coded in this new range and having the probability 16 it will shorten the length to 0.083333 and as its coding range comes after the A that has the coding range [0, 0.25[ in the second step the range becomes [0.25, 0.33...[ as depicted in the second to third line of in the figure.

Doing the same with the next letter result in first [0.25, 0.291666...[ then after the fourth symbol 0.25+(1/2+1/6+1/6)∗(0.291666−0.25) = [0.2847222..., 0.291666...[ in this range the value 0.2890625 can be found and that value can be encoded with 7 bits which means the message is compressed with one bit.

A

B

C

D

0.0 1.0 0.0 0.5 0.25 0.333 0.25 0.29166 0.29166 0.284722

Figure 2.3. A graphical explanation of arithmetic coding.

It should however be noted that some sort of “end of message symbol” with low probability is needed if this should be used practically. That symbol would in this case make the message a few bits longer than the original message. It is however clear that on a longer message the tendency of size reduction would have

(25)

2.4 A few more practicalities 9

been approximately the same with a correct probability model and the total size of the coded message would therefore have been smaller than the original.

An interesting effect with this method is that in extreme cases it can also show compressions ratios with less than one bit per symbol as suggested earlier. By allowing less than one bit per symbol there is a possibility to do binary arithmetic coding where only two symbols are used, for example 1 and 0, which in many cases is the most convenient and efficient thing to do. To do this in Huffman coding it is necessary to build larger symbols even though the probabilities only exist for the basic symbols 0 and 1. This means that statistics has to be calculated for the larger symbols based on the binary symbols, which is not practical in some cases.

2.4

A few more practicalities

Even though arithmetic coding might seem quite simple in theory there are a few implementational issues that makes things a little difficult and therefore are interesting to discuss. First is the fact that using floating point or fixed point arithmetic might seem costly and impractical as the digits will run out pretty quickly even using a modern CPU with 80-bit floating point arithmetic. That kind of machinery is far from the norm when it comes to smaller CPUs or DSPs in embedded systems. The first problem is easily solved by using integers, which work almost equally well. Instead of diving a series between 0 and 1, numbers between two integers are used where the range limits are discrete values. Almost next to equally well has to be said as this scheme introduces another problem as shall be seen later. With integers the coding range can easily be renewed by shifting out a few bits of the low range value and high range value when they are too close to each other so that for example both low and high range looks like YYYXXX where the Y values are equal. The shifting will make sure they are sufficiently far away from one another making sure there is enough precision to divide the coding range into potentially very small fractions.

As is done in the CODECs the coding range will from now on be described by a low value that is the integer base and the lowest value of the actual coding range and another value referred to as range that is the current length of the coding interval. Low 234 and range 45 would in principle mean that the coding is done between 234 and 279.[10]

(26)

Example 2.2: Doing arithmetic coding with integers and shifting If the integer arithmetic coding range is based from 0 to 8000 the area could be divided into 1/2,1/4 and 1/4 for the symbols ABC. If a B was coded the coding interval would be 2000 long with a base value, often referred to as low, of 4000. Before coding the next symbol the coding range value (2000) could now shift until it is 8000 once again and then the low value would be shifted equally much. When the low value grows too big a few of the top bits are removed from it and then these bits are considered coded.In this example it seems perfectly simple to do because of the probabilities being evenly binary but it also works with other fractions whether multiplications are used or, as shall be seen later, by using estimates.

Integers may look perfect but there is an added complication using integers as mentioned earlier. The situation is when range closes in around a value where low, according to the YYYXXX model earlier, is getting really close to low+range but not quite close enough, like the values 2999 and 3000. In this case it is not certain what the next value to shift out from low will actually be as there can still be carry overs. But as the coding range is getting really small costing precision so something must be done. This can be solved by shifting low anyway but remembering the uncertainty by counting the number of times a shift has been done2without actually knowing what to shift and then ignore the bit that is shifted out. When the final value is known that value is shifted out and X digits of 9s or 0s are added depending on if the value was high or low (or if binary is considered: 0s if the value that came out was 1 that is 1000... or 1s is the value was 0 which makes 0111...). This solution takes a little extra checking and can be implemented with variations as is done in JPEG 2000.

2.5

Adding it all together

When bringing a few of the described techniques of arithmetic coding with binary symbols, context adaption and integer modification together parts of a technique called Context-adaptive binary arithmetic coding or CABAC for short can be understood. CABAC is what is used in the JPEG 2000 and as one of the alternative entropy coding alternatives in H.264 which are the CODECs discussed in this thesis. What has been discussed in this chapter is mainly what is referred to as the M-coder in H.264 and MQ-coder in JPEG 2000 an overview of the other parts is given in chapter 5.

(27)

Chapter 3

JPEG 2000

This chapter contains a brief overview of what is done with the data in the JPEG 2000 standard to get a background of what is actually coded in the entropy coder and how the contexts are chosen.

3.1

Purpose

The JPEG 2000 standard is a modern standard for image compression that was de-veloped in an effort to improve a few parameters of the old JPEG standard. A set of important features, some of which are new and others which are improvements upon JPEG, are state-of-the-art low bitrate compression performance, progressive transmission by quality, resolution, component or spatial locality, lossy and lossless compression and random (spatial) access to the bit-stream. The improvements ver-sus the old JPEG standard is primarily better compression at smaller resolutions and lossless encoding together with better facilities to choose quality/compression ratio on the fly, which means the same source could easily be sent over mediums with different bandwidth using different quality.[8].

The JPEG 2000 standard consists of a baseline that contains the basic com-pression features as well as a few proposed extensions for specialised uses such as Motion JPEG and formats to use with other mediums such as pre-press and fax applications.[14]

The compression steps in baseline JPEG 2000

1. Do colour transforms and other things that might be necessary to the original image.

2. Divide each colour channel into tiles that are coded separately.

3. Apply a wavelet transform on each tile resulting in four different sub bands per tile to compress separately.

4. In lossy encoding apply some quantisation which results in loss of precision on the coefficients but also requires less space.

(28)

5. The bits from the wavelet transform are now formed into bitplanes (see figure 3.2) that consists on bits of the same significance from neighbouring coefficients. The relations the bits have in the bitplane are used to form coding contexts for the entropy coder (MQ-coder).

6. Bits are coded and put into packets that make sure searching for data in the stream and so on is possible.

Color space tranform Wavelet transform Quantisation Entropy coding Tier 1 Context formation MQ-coding Entropy coding Tier 2 (Rate optimization) Uncompressed image Uncompressed image Compressed image

Figure 3.1. The standard procedure of coding JPEG 2000.

3.2

Context modelling

In more detail the context formation works like follows: A bit in a bitplane is “connected” to the bits in the same coefficient and bits closest in the bitplane. A bit can be encoded in three different passes depending on surrounding bits and how bits were coded in previous bitplanes for that coefficient. A bit can be coded in these passes:

Significance pass If a neighbouring bit had its first significant bit (a bit that is 1) in a previous bitplane.

Magnitude refinement pass When a coefficient has had its significant bit it is coded in the magnitude refinement pass

Normalisation/cleanup pass If the bit was not coded in either of the previous passes it is coded in the normalisation/cleanup pass.

This way depending on what state neighbours are in it is possible to use a good probability model to choose which probability state a bit should be encoded in from a range of predefined contexts.

3.3

Motion JPEG 2000

The need for large amount of resources comes when trying to code Motion JPEG 2000. Motion JPEG 2000 works almost the same as standard JPEG 2000 because

(29)

3.3 Motion JPEG 2000 13

Figure 3.2. A few neighbouring bitplanes.

each frame of the movie is coded as a reference frame and do not use any added compression techniques like motion estimation. But unlike baseline JPEG 2000 to get video there need to be multiple frames to code and decode. This technique is designed to be used for implementations with high demand for quality like encoding digital cinema (at resolution like 4096x3112), things recorded on a HD camera or medical images where there is no room to sacrifice quality by using less compression or even lossless coding1. The fact that there is features like lossless coding and easy

scaling makes Motion JPEG 2000 even more interesting as a high quality stream could be saved directly from camera and then the same compressed stream could first be stored in high quality. The data can then be sent over a limited bandwidth channel by removing some of the data by sacrificing quality or in original quality over a higher bandwidth channel. The most important thing for us to consider is that Motion JPEG 2000 generates a lot of data. This is especially true when used at cinema or medical size resolutions which make it clear that the encoder will have to sustain a high bitrate if that is to be done in real time.[4]

(30)
(31)

Chapter 4

H.264/AVC

This section describes how H.264 works in large and how context modelling and encoding is done in a bit more detail.

4.1

Purpose

H.264 is another fairly recent standard aimed at providing efficient coding rates for a wide variety of formats from small scale QCIF all the way up to full scale 1080p HDTV and beyond. Examples by the principal authors of the standard in their overview [19] include

* Broadcast over cable, satellite, Cable Model, DSL, terrestrial, etc. Interactive or serial storage on optical and magnetic devices, DVD, etc.

*

* Conversational services over ISDN, Ethernet, LAN, DSL, wireless and mobile networks, modems, etc. or mixtures of these.

* Video-on-demand or multimedia streaming services over ISDN, Cable Mo-dem, DSL, LAN, wireless networks, etc.

* Multimedia Messaging Services (MMS) over ISDN, DSL, Ethernet, LAN, wireless and mobile networks, etc.

There is clearly a need for better compression to send next generation HDTV over non broadcast channels with lower bandwidth without saturating the end points or routers with a large amount of streams. With the need to send most of the data over some data channel it becomes apparent that the design should also be based around a network abstraction layer so the data can be transmitted over a wide variety of data channels.[19]

As it is a full blown video CODEC it has more information concerning inter-frame coding and such which is the basis of much of the context formation for the entropy coding, as shall be seen by the features.

When coding video the classical way to compress it is by exploiting spatial and temporal redundancy by making coding of large blocks that have approximately

(32)

Input in 16x16 macroblocks

Coder ctrl

Transform

Scale Quant. CABAC

Motion Estimation

Figure 4.1. A description of the encoding process.

the same colour or blocks that do not move between two frames small. H.264 improves on these techniques with flexibility on how large the motion vector blocks can be and also improves where these motion vectors are allowed to reside in the picture. H.264 also allows more reference pictures to be used for temporal redundancy by forcing the decoder to store more frames when decoding. The order in which pictures are stored and referencing each other is thus removed which, as said earlier, forces the decoder to allocate more memory but also gives more freedom to the encoder. It is interesting to note that the standard does not state clearly how to encode the picture and two encoders could certainly do it in slightly different ways. The encoding could be done with different compression and computational efficiency but still comply with the standard. Other noteworthy additions to the standard is the use of smaller blocks in the basic transform and the fact that the transform produces exactly the same content on all decoders when decoding the same file thanks to the inverse transform being computationally practical to perform in an exact manner. The standard contains a lot of features concerning these redundancies and a good description of how it works can be found in [19].

Features that are of extra interest to this thesis is the new arithmetic cod-ing system that could be used if not uscod-ing the less performance hungry CAVLC (context-adaptive variable-length coding). The entropy coder is a CABAC with the arithmetic coding part referred to as the M-coder (described in detail in section 5). Both methods use context adaptivity but they use it a little different, section 4.3 descries how it is done in the CABAC. The CAVLC path basically has a pre-calculated table with codewords based on Exp-Golumb1 codes that most things are mapped to. The exception is quantised transform coefficients for which the CAVLC method itself is actually employed which exploits statistical probabilities

(33)

4.2 Anatomy of the H.264 frame 17

on various different properties of the quantised transform coefficient to perform an adaptable VLC.

4.2

Anatomy of the H.264 frame

To understand why and why not large scale parallelisation is possible in H.264 it is of interest to study the way a frame is built in H.264. A coded video Sequence in H.264 consists of pictures. These pictures can be built from two interlaced interleaved fields that each contains odd or even rows. They are called progressive when both fields are from the same time, and really part of the same picture or interlaced when they are taken at slightly different times. As a human is more perceptive to brightness than colour the YCbCr colour space with 4:2:0 sampling is used (referred to as YUV 420). Which means the Y which is luma has a higher resolution than Cb and Cr (crominance blue and red). When coding a field it is partitioned into macro-blocks that contain 16x16 luma samples and 8x8 chroma samples of each colour. These macro-blocks are in turn grouped into slices that do not need to share data between each other except when using filters that apply to the whole picture (they only need data from the own slice and reference pictures). Macro-blocks can be divided into slices arbitrarily through a concept called flexible macro-block ordering. This way slices can be formed for best performance where concepts like region on interest is used to add more resources on the parts of the image that are actually interesting.

Slice 0

Slice 1

Slice 2

Figure 4.2. A group of macro-blocks(dotted) framed into slices that divide a field.

The state of the entropy coding is reset before coding every slice which is good from a parallelising standpoint as that allows several threads to run one slice each without having to worry about data dependencies from the nearby slices. The slices are actually one of the few easily exploitable parallelisations, from the entropy coding point of view, in the standard when not going low level.

(34)

4.3

Context formation

There are a number of different ways to build contexts in H.264 and they can be divided into four main categories. The first type gets its context from previously coded blocks that are neighbours above and to the left (see figure 4.3). This is used for things like motion vectors where blocks nearby have a tendency to move the same way. The second type is a model which is used to code macro block information and build its context based on previous bits coded in this state. The third is only used on residual data. The third together with the forth category has separate context formation for different internal types but a common feature for group three is that they do not use past data. The fourth type is like the third only used for residual data and the contexts are based on accumulated encoded values from earlier encodings. Some data is also encoded without a particular model. [9]

Figure 4.3. Calculating context by using neighbours.

The different context indexes describe how to encode things like status info for macro-blocks, motion vectors (there are different context depending on the length of the vectors and so on) and transforms. There are in total 399 different context indexes and these in turn point to 64 different (6 bit) probability states together with a current mps (another bit).[18]

(35)

Chapter 5

The CABACs

This chapter describes how the CABAC is implemented in JPEG 2000 and H.264 and discusses differences and similarities.

5.1

Basics

In principle the arithmetic coding parts of the CABACs in JPEG 2000 and H.264 work like the techniques described in 2.3 but with a few practical differences. As it is a binary encoder bits are encoded one by one and bits are also shifted one at a time1. The symbols that are coded are referred to as most probable symbol(mps)

and least probable symbol(lps) and if mps is 0 or 1 is determined by the context. In both implementations the convention of a low value that is the base of the current range is used (referred to as C in the JPEG 2000 standard[7]) and a range value that is the actual coding range (called A in JPEG 2000). The division of these according to lps and mps is done a little differently where as in H.264 lps is coded in the higher values of the range (that is range - lps-probability would be added to low)[18]. In JPEG 2000 it is by default reversed that the lps field is coded in the lower part of the range, except when the lps range is actually larger than the mps field in which cases the fields are reversed, more on that later.

Except the binary arithmetic coding a large amount of work is done in the other steps of the CABAC, binarisation and context modelling. Binarisation is largely already computed in the standard and is where it is decided how to describe different actions of the CODECs in a sensible and easily coded way. The Context modelling deals with selecting the correct context for the bits (that have been decided by the binarisation) and is perhaps the largest block in the CABAC.

5.2

Probability states

The contexts in JPEG 2000 is built from the bitplanes as described in 3.2. The probabilities are then calculated from 47 different probability states that are

up-1It would certainly be impractical to shift with a base of 10... 19

(36)

dated depending on if a mps or lps was coded. In H.264 the table has 64 probability states where close to equal probabilities are low so the current mps is inverted when in state 0 and an lps is coded.

5.3

Avoiding multiplication

One of the main performance concerns, especially when coding bitwise, is the multiplication that is required to accurately calculate the new range. If Qe is the probability of the least probable symbol it could looks like this when doing arithmetic coding for a mps. . .

low = low + range ∗ Qe

range = range − range ∗ Qe

and for lps. . .

low = low

range = range ∗ Qe

As multiplications is considered slow and expensive to implement and simulate when not available both CABACs have taken measures to avoid this, in different ways. In JPEG 2000 the range is set as an integer that is mapped to a floating point value, so for example 0x8000, which is the initial value, is 0.75. The trick to avoid the calculation range ∗ Qe is to keep the range sufficiently close to 1.0 so that the calculation can be approximated. When doing this approximation an mps instead looks like. . .

low = low + Qe

range = range − Qe

and lps . . .

low = low

range = Qe

A way to make sure that the value is always close enough to 1, where close enough is defined as 0.75 < range < 1.5, is to shift the range as many times as is necessary after every encoded bit where the range end up below 0x8000 in a process called renorming.[7]

When designing H.264 the designers did not think solutions such as the one presented in JPEG 2000 resulted in sufficiently good coding compression as the extreme cases, 1.5 and 0.75, gives a rather bad approximation of 1.0. The solution in H.264 is to use a table which loads different probabilities depending on the size of the range. A table is of course used for the probability values in JPEG 2000 but in H.264 there are four different values depending on the range for each probability state. This means the solution looks like this in H.264 mps first(notice the difference in where lps and mps is put):

(37)

5.4 Renorm 21

low = low

range = range − Qe[range]

and for lps . . .

low = low + (range − Qe[range])

range = Qe[range]

This will not avoid the renorm problem but it gives better accuracy and ac-cording to the designers significantly better coding efficiency at little drop in per-formance. The JPEG 2000 CODEC actually contain another problem that the probability of the least probable symbol might actually be larger than that of the most probable symbol in certain cases forcing a swap of the probability fields. This is largely avoided with the added precision in H.264 as, even if the mps field is smaller, the difference is never that big.[9]

5.4

Renorm

The renorming process tries to make sure the range value is sufficiently close to 1 to either use as is in JPEG 2000 and possibly use the value for a table lookup as in H.264. As all the shifts are done here there is no need to shift to get more precision because of a short coding range as described in 2.3, the range will automatically be long enough. The renorm is noteworthily different between the two CODECs which has its basis in how they treat the outstanding bits problem (described in section 2.4). In JPEG 2000 this is solved by stuffing in an extra bit when there is risk of carry propagation thereby eliminating the problem. This check is done every time low has been shifted 8 times, then those bits are simply removed and the rest of the calculation can be carried out. In H.264 it is solved closer to the solution described in chapter 2 were a counter is used and the numbers of bits that are unsecure are counted and then when a safe bit is found the outstanding bits are added as the inverse of the safe bit. There is a theoretical problem here that bits outstanding could grow indefinitely but this is only a theoretical problem in most practical uses (When simulating one of the HDTV example clips it never grows beyond 25 even in that quite large data set) And the actual max is given by the amount of binary decisions done in one slice (as the count is reseted by the end of that) as Ceil(Log2(M axBinCountInSlice + 1))[18]. Flowcharts of the renorms can be found in figure 5.1 and 5.2.

5.5

Problems with implementations

An implementation of a fast CABAC has a few problems, most notably dependen-cies and speed. The JPEG 2000 is a bit simpler to implement with less reliance on large memories to get data and save contexts (although there is still need for slightly smaller tables) and especially the solution to the bits outstanding problem makes it easier to implement in one cycle. Doing the renorm fast is harder to do

(38)

in H.264, at least in theory, and if a special solution is not used it can take several cycles of the renorm before the next bit can be coded. As the renorm possibly can consist of around 10 loops in H.264 and even more in JPEG 2000 implementing them as simple combinatorial net is out of the question and they have to be built as a state machine which in practise make sure coding a bit takes a few cycles as will be described at the hardware pages later. A better solution to the problem is dividing the different parts to update into different machines and use barrel shifters to do fast renorms, a solution that will be discussed later.

5.5.1

Uniting the CABACs

There is also obvious problems in uniting the two CODECs into the same acceler-ators even in the coding stage as the definitions of what part of the range becomes lps and what becomes mps works a little different if a common instruction where made for that step. The difference is even more evident in the renorm step where it is hard to find similarities that would give much boost to both encoders but possible solutions to the problems will be discussed later.

(39)

5.5 Problems with implementations 23 Renorme Range = Range << 1 Low = Low << 1 CT = CT-1 CT = 0? B = 0xFF? Low < 0x8000000 B = B + 1 C = C AND 0x7FFFFFF B = 0xFF? BP = BP + 1 B = Range >> 20 C = C AND 0xFFFFF CT = 7 BP = BP + 1 B = Range >> 19 C = C AND 0x7FFFF CT = 8 A AND 0x8000 = 0? DONE NO NO NO NO NO YES YES YES YES YES

(40)

Renorme Range < 0x100? Low < 0x100? Low >= 0x200? YES NO PutBit(0) PutBit(0) Low = Low - 0x200 Low = Low - 0x100 bitsOuts = bitsOuts +1 Range = Range << 1 Low = Low << 1 DONE PutBit(X) WriteBit(X) bitsOuts > 0? WriteBit(1-X) bitsOuts = bitsOuts - 1 DONE YES NO YES NO NO YES

(41)

Chapter 6

Analysis of software

implementation

This chapter discusses benchmarking of the software implementations and how fast a hand optimised assembler version would be, which is good to know when comparing the software algorithms versus the hardware.

6.1

Requirements

The requirements for encoding H.264 in HDTV resolutions of 1080p / 30 fps in real time with good quality are approximately 10 Mbps, as discussed earlier. Reaching this kind of performance to encode that mass of bits within a fixed MIPS budget requires some serious optimisation as will be discovered in this chapter. To evaluate how much resources are needed it is interesting to present a goal of how many MIPS can be spared for the CABAC. The device should, if possible, do the arithmetic coding part of the CABAC at 10 Mbps in less than 50 MIPS1. The demands of other resolutions, added for reference, can be found in table 6.1.

QCIF CIF 4CIF HDTV

Bitrate <0.5 Mbps <1 Mbps <3 Mbps <10 Mbps

Table 6.1. Approximate bitrates of various resolutions at 30 fps

Typical characteristics for doing the arithmetic coding and renorming in JPEG 2000 and H.264 are.

1. Large amounts of branches.

2. Not very much arithmetic to go through.

1This is a made up limit used as a general aim as 5 cycles per bit was deemed fair. 25

(42)

3. In H.264 there is a simple function in hardware when fetching the length of lps (range « 6 & 3) that requires a shift and a AND when done in assembler.

6.2

On benchmarking with gprof

The gnu profiler (gprof) was used for doing benchmarks on what functions are called most and approximately how much time they are using. With simple single threaded jobs like this it should give enough precision to make decisions on what to improve in an understandable way. To get the profiler to work properly optimi-sations in the code had to be rewritten and functions that were previously defined as macros had to be removed (to avoid the function call overhead). This together with the overhead of using gprof when doing measurements and the general state of the code as unoptimised might make some conclusions regarding time and per-centage slightly misleading, but the frequencies of the function calls are correct and results in some interesting statistics to work with. This is especially true for the jasper code. In the JM code a few things like renorm had to be avoided when rewriting as the performance penalty was more than triple the time and therefore did not give any extra information (renorm is run every time a bit is coded so the call frequencies are available anyway).

For some additional counting especially in JM a few counters were added to count things like mps/lps ratio.

6.3

JasPer

JasPer is according to [1]:

In simple terms, JasPer is a software tool kit for the handling of image data. The software provides a means for representing images, and facilitates the manipulation of image data, as well as the import/export of such data in numerous formats.

What makes JasPer interesting to this thesis is the fact that it has a mature implementation of JPEG 2000 that can be used for benchmarking and profiling. JasPer implements the context building mostly in jpc_t1enc.c and the MQ-encoder in jpc_mqenc.c. The probability index table is done like a in the standard but the length is doubled with one state for mps = 1 and mps = 0 for every state in the standard. This of course makes for a bigger table but a little less arithmetic is needed when fetching the next state. As JPEG 2000 does not have as many contexts as H.264 it is a lot simpler to analyse the code.

6.3.1

Benchmarking

Running gprof and the analysing tools on JasPer gave the numbers in table 6.2. The images used are the ones that come with the CD in [15]. Small samuel is the samuel picture scaled, cut and cropped to 352x288 this makes other things like file IO and initialisations take a bigger percentage. Otherwise the images,

(43)

6.4 JM 27

Samuel Hawaii Ge Small samuel

Rate 0.01 0.01 0.01 0.01

Context(%) 30.46 30.78 33.79 23.86

Context(#) 48876 33168 35709 1929

MQ-coder(%) 29.1 28.0 31.0 19.2

MQ-coder(#) 68288944 42114326 49467612 1163483

Table 6.2. Performance estimates on JPEG 2000 coding in JasPer

although looking different visually, looked roughly the same from a performance standpoint.

In this implementation JPEG 2000 has a lot of its MIPS cost in the CABAC region. Around 60% of the time seems to be spent in this area, roughly half on the context modelling in the bitplanes and the rest on arithmetic coding and shuffling out the bytes. So in JPEG 2000 is of interest to look at optimising at least the arithmetic encoder which is little code but is run many times. Worth noting though is that the time is spread over a large amount of calls so because the algorithm is already pretty compact there is of course a limit to how much it can be optimised

2, but that will be examined in the next step. The amount of work done by the

coding does not depend on the compression rate as the rate distortion work is done after the encoding. The work load presented here should be approximately the same on other implementations if optimisations are not considered as there is nothing that could be done differently in the encoder to get better compression ratio.

Another interesting thing which can be found in the numbers is that approxi-mately 2/3 of the calls to encode a bit results in a renorm and approxiapproxi-mately 2/3 of the calls are mps.

6.4

JM

JM is the official reference software released by people behind the standard such as the Fraunhofer Institute in Germany. The code does seem to be made to be efficient but is instead nicely partitioned according to different parts of the standard. The context building can be found in cabac.c and the binary encoder is biariencode.c. As the CABAC implementations are done by the same people that wrote the standard the code is implemented very close to the behaviour suggested by the standard with no major differences in the implementation of the state tables and other such things. This implementation is really slow, which is understandable for an encoder of this complexity, but one should note once again that the encoder does not have to follow a standard besides the format of the file and could produce material of varying quality while still following standard. This leaves room for performance boosts in other areas.[19]

2It is for example impossible to get under 1 cycle per bit without using a parallel solution,

(44)

6.4.1

Benchmarking

The numbers generated by JM is in many ways less relevant than the numbers generated by JasPer as the H.264 standard does not specify how to encode a file. Considering H.264 has features like motion estimation, that could theoretically be skipped when encoding, there is a lot of freedom to do things differently when encoding it. This fact makes the relative numbers of the CABAC less interesting as the reference design might be overly thorough and not especially optimised. Or the other way around, an implementation could perhaps be done that is even more extreme when it comes to looking for temporal redundancy in the image sequence. The numbers between context building and arithmetic coder is however interesting. It should be noted that the tests have been run with low complexity mode (RDoptimization = 0) to limit the amount of work (otherwise there would be a need of several hundred Mbps when doing HDTV which is just not feasi-ble on a small scale system anyway for limited gains in compression efficiency). The approximate percentage of work done by the CABAC when complex motion estimation is on is less than 0.1%.

The different clips in table 6.3 are taken from a set of standard 1080p clips [17] and standard clips in smaller sizes from [16].

Paris Highway Pedestrian

Resolution 352x288 352x288 1920x1080 Frames 250 250 50 Context(#) 1914198 1208284 6790618 M-coder(#) 5551384 3129586 15618854 M-coder/CABAC (%) 34 31 31 Mbps (30fps) 0.67 0.38 9.37

Table 6.3. Performance estimates on H.264 coding in JM

As stated earlier the CABAC in this implementation is a very small part of the total MIPS cost. Nonetheless it is run a very large amount of times especially on the HDTV clip where approximately 107 bit encodings per second has to be

executed if the encoding should be done in real time. With the budget of 50 MIPS that would give us 5 cycles per bit to encode with all memory transfers and so on included. How much this can be optimised is of course based on how complex the assembler is and this will be investigated further in this chapter but should at least be theoretically possible.

It should be noted that if a faster implementation is used of other stages such as motion estimation there would either be more bits to encode, which would have a negative impact on the CABAC performance, or worse quality which in many cases might not be acceptable. The context processing in H.264 is, as earlier stated, quite a big machine and it takes up the majority of the performance in this implementation of the CABAC.

The renorm loop in this CABAC is run around 83% the amount of times a bit coding is run (some times it runs more than one time). mps is coded around 70%

(45)

6.5 OR1200 29 of the bits in one test.

6.5

OR1200

To calculate an approximation of how many cycles are required to encode a bit an approximate assembler implementation was constructed. The target architecture for that implementation was OpenRISC 1200(OR1200) see[5] and [6]. OR1200 is an open source implementation of a RISC architecture and has a standard RISC instruction set with architectural features such as delayed branching. Even though it was aimed at a certain architecture the approximation is still rather crude and no effort was wasted on trying to get correct timing numbers with memory reads and writes. This depends too much on the target architecture anyhow and a pure RISC assembler implementation will be far from fast enough anyhow.

6.6

Assembler with analysis

By using probabilities for different paths from the above benchmarking the follow-ing pessimistic approximates was produced(For a little more thorough description see section A.1).

Renorm mps lps Total

H.264 18(1) 10(2) 13(2) 29(3)

J2K 14(3) 8(2) 11(2) 20(3)

Table 6.4. Number of cycles in an assembler implementation of the CABACs

Note that X(Y) means X cycles from which Y is memory operations in the table. The JPEG 2000 one gets a little shorter by avoiding renorm totally but instead JPEG 2000 will have a lot more bits to encode. These values are pessimistic but show that to run H.264 HDTV almost 300 MIPS are needed for the encoding stage alone which, considering there are other parts in dire need of MIPS, is a lot.

QCIF CIF 4CIF HDTV

Software 14 29 87 290

Table 6.5. Approximate MIPS cost of software implementation at various resolutions.

6.7

Similarities and conclusions what to optimise

From the analysis it seems that both the renorm and the encoding step need to be optimised to reach a performance goal of 50 MIPS for the arithmetic encoding step in H.264. As the CABAC is a major part of JPEG 2000 in this implementation

(46)

and probably would be in other implementations, there are good incentives to optimise both renorm and the encode step in JPEG 2000 too but the cost and advantages of different hardware acceleration will be discussed further in chapter 7 and 8.

(47)

Chapter 7

Instruction level acceleration

This chapter describes ideas of hardware acceleration of the CABACs using in-struction level acceleration and what is and what is not a good idea to implement. There is also some benchmarking done here to see if values closer to the require-ments set up in the last chapter are produced.

7.1

Introduction

Hardware accelerations can be done in a few different ways, two of which will be discussed in this and the next chapter.

Instruction level acceleration is about finding ways to add instructions and architectural features that is beneficial for the task, to optimise it, and hopefully also beneficial to other tasks the processor might be doing. This kind of accelera-tion is by definiaccelera-tion tied very close to the processor core but as the optimisaaccelera-tions hopefully is useful for a larger set of tasks it is considered less intrusive than doing dedicated hardware designed especially for a specific task.

Dedicated hardware, commonly referred to as accelerators as discussed in the next chapter, are usually detached from the processor core but is still tied closely through memory or buses for low latency communications. It is common that accelerators do a large portion of the task to optimise and therefore becomes very specialised and not very useful for other tasks the processor might deal with.

Hardware acceleration when there is access to the blue prints of the processor is generally not that hard. With an infinite amount of power and die space it is possible to add an extremely large accelerator for most any tasks. The problem here is that processors size and other costly factors should be kept small and by adding all to specific accelerators the implementation turns closer to an ASIC. The result of this is that all advantages of a General purpose CPU, DSP or by all means ASIP are sacrificed. These factors should be discussed when trying to achieve the best possible solution in this case. To do real time encoding of a complex CODEC like H.264 or motion JPEG 2000 there is likely a need for either a pretty fast and big architecture designed for processing data rapidly or

(48)

a rather specialised architecture which obviously makes the implementation non trivial when considering cost and size.

7.2

ISA implementation

The common way to accelerate multiple CODECs that are not exactly the same but are very similar would be to use a specialised instruction set. However a quick look at the two different CABACs suggests that it is far from trivial, as discussed in 5.5. The problem is that both coders are very branch heavy but with different operations in the branches thus being quite far from exactly the same.

One set of improvements are those suggested by [2] that would use features like delayed execution (which is already a common feature of RISC architectures), address register arithmetic (also rather common) and perhaps most interesting if-then-else decisions. As all but the last feature is implemented in OR1200 which means applying this instruction set would only achieve small gains compared to the suggestions already made in this document. The if-then-else construct is how-ever interesting to discuss as it could be beneficial. To function properly the different branches should be properly balanced or else the branch that does less will suffer from a penalty of executing instructions it would not normally have to execute. The benefit is of course that there is no pipeline break and no extra jump instruction needed and that is clearly beneficial in a few cases with balanced branches.

7.3

Better branch control

With similarities to the goals of the if-then-else construct an instruction that is a bit more complex in its construction but leaves smaller and simpler branches during executions could be considered. As suggested above the inner coding loop contains a huge amount of branches for doing a small amount of arithmetic work. If a multibranch instruction was used that would compare a few preloaded constants in flexible combinations with at most two other registers (which is the maximum most bus structures would allow) the lps/mps branch could be executed together with a little rewriting. The first branches in both lps and mps are then done in one cycle and with delayed branching there is time to do range − Qe which is done in both branches. Similarly the same function could deal with the H.264 renorm process and as there would be a constant left that could possibly accelerate the encoding process too. The problem with the instruction is that it would need some preloading/configuring and it would not be fully orthogonal to most instruction sets. But with quite simple hardware, 3 comparators, a few muxes and short registers it would be quite useful.

Another unorthodox approach is to make it like sort of a branching machine that works with the Program Counter and at a few certain Program Counter val-ues it would unconditionally jump to a common location. This way the jump instruction that is needed on most branches could be skipped and the branching would in principle take 0 cycles if the first branch could be done in the background

(49)

7.4 General idea 33 too. The limitation would be that the registers would probably have to be hard-wired to do comparisons for the branching and special registers would be needed for the locations of the automatic jumps as the bus in most architecture would be busy with another instruction (or by all means instructions in a more complex architecture).

7.4

General idea

The instruction could really be seen as an architectural feature to accelerate control flow. It will shave of the overhead that is associated with branch heavy tasks like the CABACs in H.264 and JPEG 2000. This would be done by calculating multi branching steps once. A reasonable limit would be to do three comparisons and thereby going to the correct branch in a binary tree as is depicted in figure 7.1.

Branch 1 Branch 2 Branch 3 Branch 4 CMP

2 CMP3

CMP 1

Figure 7.1. Structure of comparisons

There are several options to how to implement this, one solution is to do it like an instruction that could point to two registers and then have two more registers hardwired to the function. A special purpose register would then point to which registers to compare with the newly set values. With three compares it would then be possible to jump to the four different relative locations depicted in figure 7.1 the locations which could be relative and stored in an additional set of short registers because the program will not branch far away in memory most of the time. These four branches will end after a few cycles and then usually reunite in one of the branches. This predictability can be used to an advantage by letting the program counter be adjusted after X clock cycles (where X is different for each branch), with the exception if some branch jumps somewhere else, that way there would also be some extra flexibility. This could be implemented either with counters and something watching for jumps, or if it is acceptable with even more large registers it is possible to jump when the PC reaches one of four values. Another method would be to use the logic used at the jump and only have one register to look for which has its address selected when doing the branch.

The approach of watching the PC and do an automatic branch at a certain point could also be used to initiate the function if there is no room to spend a single instruction on the jumping. This approach would require 4 hardwired registers to

References

Related documents

People who make their own clothes make a statement – “I go my own way.“ This can be grounded in political views, a lack of economical funds or simply for loving the craft.Because

The three studies comprising this thesis investigate: teachers’ vocal health and well-being in relation to classroom acoustics (Study I), the effects of the in-service training on

Ejlertsson L, Heijbel B, Ejlertsson G and Andersson HI (2018) Recovery, work-life balance and internal work experiences important to self-rated health: A study on salutogenic

In the previous researches, it has been studied that social media is been used as an effective platform to increase brand awareness between customers and suppliers, enhance the

Despite this the respondents argued that the ability to deliver governmental services through public online interfaces such as webpages or mobile applications created

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

The following section will present the research findings, which have been organized two main blocks comprising the initial analysis and the analysis of the two main perspectives