Image coding with H.264 I-frames

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Image coding with H.264 I-frames

Examensarbete utfört i Bildkodning vid Tekniska högskolan i Linköping

av Anders Eklund LiTH-ISY-EX--07/3902--SE

Linköping 2007

Department of Electrical Engineering Linköpings tekniska högskola Linköpings universitet Linköpings universitet SE-581 83 Linköping, Sweden 581 83 Linköping

(2)

(3)

Image coding with H.264 I-frames

Examensarbete utfört i Bildkodning

vid Tekniska högskolan i Linköping

av

Anders Eklund LiTH-ISY-EX--07/3902--SE

Handledare: Harald Nautsch

ISY, Linköpings universitet

Examinator: Robert Forchheimer

ISY, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

Image Coding Group

Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2007-03-26 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version http://www.icg.isy.liu.se http://www.ep.liu.se ISBN ISRN LiTH-ISY-EX--07/3902--SE Serietitel och serienummer

Title of series, numbering ISSN

Titel

Title Stillbildskodning med H.264 I-frames_{Image coding with H.264 I-frames}

Författare

Author Anders Eklund Sammanfattning

Abstract

In this thesis work a part of the video coding standard H.264 has been imple-mented. The part of the video coder that is used to code the I-frames has been implemented to see how well suited it is for regular image coding.

The big dierence versus other image coding standards, such as JPEG and JPEG2000, is that this video coder uses both a predictor and a transform to compress the I-frames, while JPEG and JPEG2000 only use a transform. Since the prediction error is sent instead of the actual pixel values, a lot of the values are zero or close to zero before the transformation and quantization. The method is much like a video encoder but the dierence is that blocks of an image are predicted instead of frames in a video sequence.

Nyckelord

Keywords Image coding, H.264, I-frames, MPEG-4 Part 10, JPEG, JPEG2000, Data com-pression

(6)

(7)

Abstract

In this thesis work a part of the video coding standard H.264 has been imple-mented. The part of the video coder that is used to code the I-frames has been implemented to see how well suited it is for regular image coding.

The big dierence versus other image coding standards, such as JPEG and JPEG2000, is that this video coder uses both a predictor and a transform to compress the I-frames, while JPEG and JPEG2000 only use a transform. Since the prediction error is sent instead of the actual pixel values, a lot of the values are zero or close to zero before the transformation and quantization. The method is much like a video encoder but the dierence is that blocks of an image are predicted instead of frames in a video sequence.

Sammanfattning

I det här examensarbetet har en del av videokodningsstandarden H.264 imple-menterats. Den del av videokodaren som används för att koda s.k. I-bilder har implementerats för att testa hur bra den fungerar för ren stillbildskodning.

Den stora skillnaden mot andra stillbildkodningsmetoder, såsom JPEG och JPEG2000, är att denna videokodaren använder både en prediktor och en trans-form för att komprimera stillbilderna, till skillnad från JPEG och JPEG2000 som bara använder en transform. Eftersom prediktionsfelen skickas istället för själva pixelvärdena så är många värden lika med noll eller nära noll redan innan trans-formationen och kvantiseringen. Metoden liknar alltså till mycket en ren videoko-dare, med skillnaden att man predikterar block i en bild istället för bilder i en videosekvens.

(8)

(9)

2 Introduction to data compression 5 2.1 Introduction . . . 5 2.2 Lossless compression . . . 6 2.2.1 Data rate . . . 7 2.2.2 Human coding . . . 8 2.2.3 Arithmetic coding . . . 9 2.2.4 Prediction . . . 11 2.3 Lossy compression . . . 11 2.3.1 Distortion . . . 11 2.3.2 Quantization . . . 12 2.3.3 Transforms . . . 13 2.4 Image coding . . . 14 2.4.1 Colour space . . . 15 2.4.2 JPEG . . . 16 2.4.3 JPEG2000 . . . 17 2.5 Video coding . . . 18 3 MPEG-4 H.264 21 3.1 Background . . . 21 3.2 Introduction . . . 21 3.3 Motion estimation . . . 23 3.4 Intra prediction . . . 23 3.5 Transforms . . . 25 3.6 Quantization . . . 26 3.7 Entropy coding . . . 27 3.7.1 CABAC . . . 27 3.7.2 Macroblock information . . . 28 3.7.3 Binarization . . . 28 vii

(10)

3.7.4 Context modeling . . . 30

3.7.5 Transform components . . . 30

3.8 Deblocking lter . . . 31

4 Implementation 33 4.1 Development of the encoder and decoder . . . 33

4.2 Arithmetic coding . . . 34

4.3 Rate distortion optimization . . . 35

4.4 User interface . . . 36

4.5 Improvement of the encoder . . . 37

5 Results and conclusions 39 5.1 Testing procedure . . . 39 5.2 Results . . . 41 5.2.1 Block division . . . 41 5.2.2 Deblocking lter . . . 42 5.2.3 Lossy compression . . . 43 5.2.4 Lossless compression . . . 58 5.3 Conclusions . . . 61 6 Future work 65 6.1 Improvements . . . 65 Bibliography 67 7 Appendix A 69 8 Appendix B 73

(11)

Contents ix

List of Figures

2.1 An encoder and a decoder, a codec. . . 5

2.2 The dierent distributions. . . 7

2.3 Arithmetic coding . . . 10

2.4 Transformation of samples. . . 13

2.5 RGB image. . . 16

2.6 YCbCr image. . . 16

2.7 The zig-zag scan order for the 8 × 8 DCT blocks. . . 17

2.8 Division of an image into dierent sub bands. . . 18

2.9 The owchart of a video encoder. . . 19

3.1 Block division . . . 22

3.2 Transformation blocks. . . 23

3.3 Prediction of 4 × 4 luminance blocks. . . 24

3.4 Determination of most probable mode. . . 25

3.5 Flowchart of the CABAC coding engine. . . 28

5.1 Result of block division. . . 41

5.2 Result of deblocking lter. . . 42

5.3 Rate distortion diagram, luminance of test image 1 . . . 44

5.9 Rate distortion diagram, mean values for the results of the six lu-minance images. . . 50

5.10 Rate distortion diagram, test image 1 . . . 51

5.16 Rate distortion diagram, mean values for the results of the six test images. . . 57 7.1 Test image 1. . . 70 7.2 Test image 2. . . 70 7.3 Test image 3. . . 71 7.4 Test image 4. . . 71 7.5 Test image 5. . . 72 7.6 Test image 6. . . 72

(12)

(13)

Chapter 1

Introduction

The information society of today involves more and more data. As the digital cam-eras get more megapixels, it becomes even more important to improve the image coding methods to be able to store the data in an ecient way. The dimensions of the signals are also increasing and 4D-signals are becoming more usual.

In the area of image coding not much has happened since the standards JPEG (1992) and JPEG2000 (2000). Despite the fact that JPEG2000 is a better image coder than JPEG it has not had a great impact yet. The explanation can be that the dierence between JPEG and JPEG2000 is too small and that JPEG2000 is harder to implement in hardware due to its wavelet transform. Changing the method for image compression can also involve paying large money for using new patents, which the customers at the end do not want to pay for if the dierence is not big enough.

1.1 Problem denition

The purpose of this master thesis has been to implement and evaluate the part in the video coding standard H.264 that is used to code the I-frames. The purpose of the I-frames, Intra-frames, is to have random access capability, to be able to start watching a video sequence at an arbitrary point or to be able to fast forward in a simple way. The I-frames in a video stream are coded independently of the previous and following frames and can therefore also be used to code regular images. The implementation was compared to JPEG and JPEG2000 to see how the I-frames coder stands versus image coding standards of today.

The master thesis also involved making a rate distortion optimization to nd the best parameters for the encoding. The rate distortion optimization in video coding may contain demands for real time which is not an issue in image coding. Even if real time encoding is not demanded, a full rate distortion optimization for a video stream, meaning testing of all combinations of parameters, could take a very long time.

At the beginning of the thesis work I stated four questions that will be answered in this thesis.

(14)

• How good is the H.264 I-frame coder for image coding versus JPEG and JPEG2000?

• Can the coding method be improved in some way, and in that case, how much?

• Can the coding method have any inuence on future coding methods, such as coding of 3D- and 4D-signals?

• Is the coding method suitable for implementation in hardware, such as digital cameras?

1.2 Methodology

The thesis work was divided into smaller parts. At rst a prestudy of image coding, video coding and especially H.264 was made. The main documents for this thesis have been the nal draft for the H.264 standard [12], H.264 tutorial white papers [3], H.264 and MPEG-4 Video Compression [2] and an overview of the H.264/AVC video coding standard [11].

After the prestudy, the predictions and transforms of dierent sizes were im-plemented and tested followed by implementation and integration of the entropy coder and a rate distortion optimizer. Then the implementation was tested with dierent images and parameters and compared to the existing image coding stan-dards.

I started the implementation in Matlab, but since it became very slow I decided to use mex-les to speed it up. After having some diculties with mex-les I nally decided to use C++ and the CImg Library [1] for reading, displaying and saving images.

1.3 Limitations

The focus in this master thesis has been on the performance of the image coder. H.264 is already adapted for hardware implementations since it uses integer trans-forms, instead of DCT, and as far as possible shifts, additions, subtractions and lookup-tables instead of multiplications.

It has not been a part of this thesis work to make an as ecient and fast application as possible, even though some eort has been laid on speeding up the encoder and decoder.

(15)

1.4 Structure of the thesis 3

1.4 Structure of the thesis

This thesis is organized in ve chapters were chapter 2 gives a brief introduction of the data compression area. Chapter 3 contains a description of H.264, focused on the coding of I-frames, and chapter 4 contains my implementation of the I-frames coder. Chapter 5 contains results and conclusions and chapter 6 is about future work and improvement of the codec.

(16)

(17)

Chapter 2

Introduction to data compression

The amount of information around us increases every day. We listen to music from our mp3-players, download information from the internet and take pictures with our digital cameras. An uncompressed video signal of TV-quality requires about 27 megabytes per second which would lead to about 3 minutes on a DVD which can hold 4.5 gigabytes. It is clear that data compression is becoming more and more important.

In this chapter I will try to explain the fundamentals of data compression, especially focused on image and video compression. The main source for this chapter has been [9].

2.1 Introduction

The encoding of a signal can be seen as a conversion of a signal X to another signal Y . The decoding of a signal is the reverse process that converts Y to X0_.

If X0_{= X}_{, the signal is reconstructed without any error.}

The encoded signal can then be transfered on a channel and then decoded. The channel can be any medium that can hold information, for example a DVD or a wireless network.

A codec consists of two parts, an encoder and a decoder. The word codec comes from enCOder/DECoder.

Figure 2.1. An encoder and a decoder, a codec.

Data compression can be divided into two parts, lossless compression and lossy compression. With lossless compression, the data can be reconstructed without any error. For text compression, lossless compression is the only option since a small error can make a big dierence.

(18)

In lossy compression, the data rate can be much lower than with lossless com-pression but with the disadvantage that the reconstructed signal will not be the same as the original.

2.2 Lossless compression

For lossless compression, there is only one measure of importance, if we neglect such things such as encoding time and memory consumption, and that is the data rate of the compressed version of the signal, or rather, the ratio between the data rate for the original signal and the data rate for the compressed version of the signal. The data rate is the number of bits required to code, for instance, a character in a text, a second of music or a pixel in an image.

When dealing with data compression, the term source is commonly used. A source is something that produces a sequence of symbols from a discrete alphabet. An example can be a book where the symbols can be any of the letters in the alphabet.

Encoding of a symbol means that we assign a binary representation for the symbol, called a codeword. There are four ways of encoding a symbol.

• A xed number of symbols for each codeword and a xed number of bits for each codeword.

An example is text coding where each codeword consists of 8 bits and each codeword represents one character.

• A xed number of symbols for each codeword and a varying number of bits for each codeword.

This is the case in Human coding, which will be explained later, where each codeword can consist of a varying number of bits but each codeword only represents one symbol.

• A varying number of symbols for each codeword and a xed number of bits for each codeword.

This method is used in Tunstall coding, which can be called the inverse of Human coding.

• A varying number of symbols for each codeword and a varying number of bits for each codeword.

(19)

2.2 Lossless compression 7

2.2.1 Data rate

The entropy H of a source is the theoretically lowest average data rate for the source. If the source does not have any memory, meaning that each new symbol from the source is independent of the previous, the entropy is simply the data rate for each symbol Aiweighted with its own probability P (Ai).If the information is

coded in bits, − log2P (Ai)is used to calculate the number of bits needed to code

a symbol Ai with the probability P (Ai). Note that the entropy does not tell us

anything about the data rate at a given time or for a given symbol, the entropy is only the average data rate for the source.

H = −X

i

P (Ai) log2P (Ai)

This is called the memory free entropy, or the entropy of the rst order. There are other measures of entropy, for example the entropy of conditional probability. Conditional probability is best explained with an example, given that the current character in an ordinary text is a period, the conditional probability that the next character is a space is very high.

From the denition of entropy we can learn that the distribution of the source is very important. The worst case is a source with an uniform distribution, then no assumptions about the next value can be made. The best case is a source where the probability is equal to one for one symbol and zero for all the other symbols, then all the values will be the same.

The Gaussian distribution is a big step from an uniform distribution toward the theoretically ideal distribution since the data is more concentrated at one value. The Laplacian distribution is even more concentrated and thereby more like the ideal distribution.

Figure 2.2. The dierent distributions. The Laplacian and Gaussian curves are approxi-mations of the continuous distributions.

(20)

So, in order to achieve good compression ratios, a good start is to go from an uniform distribution towards a Laplacian distribution. This can be done by representing the data in a dierent way. In an image there is high correlation between neighboring pixels. Instead of coding the actual pixel values themselves, the dierence between two pixels can be coded. One can say that we guess, or predict, that the next pixel in the image has the same value as the current pixel. Since the neighbouring pixels are highly correlated, the error between our guess and the real value, the prediction error, will be rather small and have a distribution that is much like a Gaussian or a Laplacian distribution. This is an example of how we can encode a given signal more eciently by just representing the data in a dierent way. Note that no distortion is introduced by encoding the prediction error instead of the original values.

In lossless compression there are a number of techniques used. The main principle is to change the distribution of the source, by representing the data in a dierent way as mentioned above, and then take advantage of the new distribution.

2.2.2 Human coding

The most known technique for lossless compression is called Human coding. The idea is to take advantage of the dierences between the probabilities of the symbols in the source by assigning long codewords to symbols with low probabilities and short codewords to symbols with high probabilities. By doing this, the average data rate will be lower than if all the symbols have codewords of the same length. Even better compression can be achieved by making a new larger alphabet where each symbol consists of pairs or triplets of symbols from the smaller alphabet. A similar idea was used by Morse to design the Morse code. Letters that occur often have short codewords while letters that occur more rarely have longer codewords.

Example

Assume that we have a source with the symbols a, b, c, d and that the symbols have the probabilities P (a) = 0.5, P (b) = 0.2, P (c) = 0.2, P (d) = 0.1. Without Human coding, each symbol would get a xed length codeword of 2 bits and the rate would be 2 bits per symbol. Using the algorithm described in Sayood [9], we come up with these codewords instead. The new rate becomes 1 · 0.5 + 2 · 0.2 + 3 · 0.2 + 3 · 0.1 = 1.8bits per symbol.

Symbol Probability Fixed length codeword Human codeword

a 0.5 00 0

b 0.2 01 10

c 0.2 10 110

(21)

2.2 Lossless compression 9

2.2.3 Arithmetic coding

The problem with Human coding is that the length of the codewords have to be of integer size and thereby it can not completely take advantage of sources that have a distribution much like a Laplacian distribution, for instance sources with symbols with a probability higher than 0.5. In order to come close to the entropy of the source, each codeword has to represent a large number of symbols, resulting in a very large number of codewords. A better approach is to use arithmetic coding, which complexity is independent of the number of symbols encoded each time.

Arithmetic coding is based on the idea to divide the probability interval be-tween 0 and 1 into an interval for each symbol of the source. The length of each interval shall be equal to the probability of the symbol. Choose the interval that corresponds to the current symbol being encoded and then continue the method by dividing the current interval.

When encoding a symbol xn, the lower limit ln and the upper limit un are

updated. F (xn)is the value of the distribution function for the symbol xn.

ln= ln−1+ (un−1− ln−1_{) · F (x} n−1)

un= ln−1+ (un−1− ln−1_{) · F (x} n)

The limits are initialized with l0_{= 0} _{and u}0_{= 1}_.

The size of the nal interval is equal to the probability P (¯x) of the sequence. The codeword for the sequence of symbols is the shortest possible binary repre-sentation of a number within the interval. The number of bits needed for the codeword is given by d− log2(P (¯x))e. Since all the other numbers that start with

the same bits as the codeword also must be within the interval, we may need one extra bit to be sure.

Example

Assume that we have a source with the symbols a, b, c and that the sym-bols have the probabilities P (a) = 0.6, P (b) = 0.3, P (c) = 0.1. The distribution function becomes F (0) = 0, F (a) = 0.6, F (b) = 0.9, F (c) = 1.

We want to encode the sequence a, c, b, a and start our encoding by initializing the lower and the upper limit.

l0= 0 u0= 1 Encode a

l1= 0 + (1 − 0) · 0 = 0 u1= 0 + (1 − 0) · 0.6 = 0.6

(22)

Encode c l2= 0 + (0.6 − 0) · 0.9 = 0.54 u2= 0 + (0.6 − 0) · 1 = 0.6 Encode b l3= 0.54 + (0.6 − 0.54) · 0.6 = 0.576 u3= 0.54 + (0.6 − 0.54) · 0.9 = 0.594 Encode a l4= 0.576 + (0.594 − 0.576) · 0 = 0.576 u4= 0.576 + (0.594 − 0.576) · 0.6 = 0.5868

The interval for the sequence becomes [0.576, 0.5868). The size of the interval is the same as the probability of the sequence a, c, b, a , which is P (a, c, b, a) = 0.6 · 0.1 · 0.3 · 0.6 = 0.0108since the probabilities are independent.

The size of the codeword has to be at least d− log2(0.0108)e = 7bits. The

num-ber 0.578125 is within the interval and the binary representation is (0.1001010)2.

The largest possible number that starts with 0.1001010 is 0.1001010111111... = 0.1001011 = 0.5859375which is smaller than the upper limit of the interval, 0.5868. Thereby 7 bits is enough and the codeword for the sequence is 1001010.

The decoder can determine the value of the symbols by simply checking which symbol that belongs to the current interval and then update the intervals as in the encoder.

(23)

2.3 Lossy compression 11 When arithmetic coding is implemented in computers, we do not have innite precision and we do not want to wait for the whole sequence of symbols to be encoded before we can start sending bits. Whenever we are certain of a bit, shift out the bit and rescale the interval to make the most use of our precision.

Theoretically, arithmetic coding is not as good as Human coding but in prac-tice it is much easier to come close to to the lowest average data rate of the source, the entropy. Arithmetic coding works best with conditional probabilities, but then the encoder and decoder become more complex since all the dierent conditional probabilities have to be calculated, updated and stored in the memory. Arithmetic coding gives much better compression ratio than Human coding but at the price of more complex encoding and decoding.

2.2.4 Prediction

As described before, encoding prediction errors instead of the actual values can be a good way of changing the distribution of the source. Since prediction does not introduce any error, prediction can be used both in lossless and lossy compression. If prediction is used in lossy compression, the prediction has to be based on the reconstructed values instead of the original values since the decoder does not have access to the original values, only the reconstructed values. In order to accomplish this, the decoder must be integrated into the encoder.

2.3 Lossy compression

Regarding lossy compression, there is another measure that is important besides the data rate, the distortion of the signal. The distortion tells us how much error the reconstructed version of the signal contains, compared to the original signal.

2.3.1 Distortion

The distortion is normally given as a signal to noise ratio, SNR, or peak signal to noise ratio, PSNR. The signal to noise ratio is a measure that tells us the ratio between the energy in the signal and the energy in the error, the noise.

In order to calculate the distortion, a measure of the error between the recon-structed signal and the original signal must be dened. Normally the mean of the squared error is used.

(24)

Let xn denote the original sequence of samples and yn the reconstructed

se-quence of samples. The squared error esquared(xn, yn) and the absolute error

eabs(xn, yn)are dened as follows.

esquared(xn, yn) = (xn− yn)2

eabs(xn, yn) = |xn− yn|

The most common error measure is the mean squared error, σ2

d, which

repre-sents the energy of the error. σ2_d= 1 N N X n=1 (xn− yn)2= E n (xn− yn)2 o

The energy in the signal, σ2

x, is here dened as the variance of the signal, ¯x

denotes the mean of the signal x. σ_x2= 1 N N X n=1 (xn− ¯x)2= E n (xn− ¯x)2 o

The signal to noise ratio and peak signal to noise ratio, in dB, are dened as follows, xpeakdenotes the highest possible value of the signal x.

SN R = 10 log₁₀σ 2 x σ2 d P SN R = 10 log₁₀x 2 peak σ2 d

2.3.2 Quantization

Nothing of what we have discussed this far can introduce distortion to the signal. One way to compress a signal further than with lossless methods is to use quanti-zation. Quantization can be seen as a mapping from a larger alphabet to a smaller alphabet. Quantization can be made on an analog signal to get a digital signal or on a digital signal to another digital signal with a smaller alphabet.

For instance, if the pixels in an image are encoded with 8 bits each, a simple form of compression would be to quantize the pixels to 4 bits each instead of 8 bits, by dividing each value with 16. Then each pixel can only have the values 0-15 instead of 0-255. The disadvantage is that when we try to go back to 8 bits per pixel again, by multiplying the values with 16, we get another result than the original values. Quantization is thus an irreversible process.

Example

If we quantize the value 190, represented by 8 bits, to 4 bits instead we get the value 12, if we round up. When we go back, we get 12 · 16 = 192.

(25)

2.3 Lossy compression 13 The most common quantization is uniform quantization where each level is of the same size, as in our example where we simply divided the value by 16.

2.3.3 Transforms

The problem with most signals is that there often is some kind of correlation between the sample values. If we want to quantize a pair of samples and there is high correlation between the samples, both the quantizers have to be able to cover large variations of the signal.

To cover all the pairs of samples, a rather big quantization area must be used, resulting in that areas with no samples also are covered. This means that we are wasting bits on the areas without any samples. To get rid of the unnecessary bits, it is possible to make the quantization area smaller by using more coarse quantization, but the problem is then instead loss of information, which means high distortion.

Instead of minimizing the area directly, we can rst make a change of base to one that better suits the samples, and then minimize the area again. We now see that we can minimize the area without loosing as much information as before. If we manage to decorrelate the samples, only one of the quantizers has to be able to cover large variations, resulting in a smaller quantization area.

Figure 2.4. The gures shows that high correlation leads to bits being wasted on areas without samples, areas marked with green. If a transform is used, only one of the quantizers has to be able to cover large variations and the area without samples is much smaller.

This is the basic idea of transformations. The transformation itself does not provide any compression or distortion of the signal, but the new representation is easier to quantize without introducing too much distortion.

If X is the original signal and Y is the transformed version of the signal, the relationships between the signals can be written as matrix products, where A is the transform matrix and A−1 _{is the inverse of the transform matrix.}

Y = AX X = A−1Y

(26)

In the case of a two dimensional signal, as an image, the relationships can also be written as matrix products, AT _{is the transposed version of the matrix A.}

Y = AXAT X = ATY A

There are a number of dierent transforms that can be used. The ideal trans-form should decorrelate the samples as much as possible, concentrate the energy in the signal to a small number of components and be easy to compute. Since we only want to perform a change of base, the basis functions of the transform should be orthogonal and normalized.

The discrete Walsh Hadamard transform, DWHT, is the easiest transform to compute, but it gives poor decorrelation of the samples.

DW HT4×4= 1 2     1 1 1 1 1 1 −1 −1 1 −1 −1 1 1 −1 1 −1    

The Karhunen-Loéve-transform, KLT, is the mathematically best transform that totally decorrelates the samples. But it is also the transform hardest to compute since it is based on eigenvalues and eigenvectors of the autocorrelation matrix of the signal. The transform is thereby also signal dependent, which means that the basis functions have to be sent to the decoder.

The discrete cosine transform, DCT, decorrelates the samples well and is fairly easy to compute. The basis functions are much alike the basis functions from the KLT and the small extra decorrelation from the KLT is not worth the extra eort.

DCT4×4=        1 2 1 2 1 2 1 2 q 1 2cos( π 8) q 1 2cos( 3π 8) q 1 2cos( 5π 8 ) q 1 2cos( 7π 8) q 1 2cos( 2π 8) q 1 2cos( 6π 8) q 1 2cos( 10π 8 ) q 1 2cos( 14π 8 ) q 1 2cos( 3π 8) q 1 2cos( 9π 8) q 1 2cos( 15π 8 ) q 1 2cos( 21π 8 )       

In order to choose the size of the transform, several things should be considered. The larger the transform, the better the energy concentration. But since the signal can vary quite a lot it is better with a smaller transform, which is also easier to compute.

2.4 Image coding

I will now give a brief introduction to the area of image coding. I will start with a description of dierent colour spaces and then give a short description of how JPEG and JPEG2000 works. JPEG stands for Joint Photographic Experts Group.

(27)

2.4 Image coding 15

2.4.1 Colour space

Normally, the colours in an image are represented as combinations of 3 colours, red, green and blue, RGB. When colour TV was introduced, there was the problem of making it possible for those with black and white TV to watch TV as well as those with a colour TV, using the same signal. Therefore a combination of the red, green, and blue signal was made to represent the black and white signal, the luminance signal. Then two colour dierence signals were created, the chrominance signals. By doing this, the ones with a black and white TV only use the luminance signal and those with a colour TV uses all the three signals and convert back to RGB. The luminance signal is denoted by Y, chrominance blue by Cb and chrominance red by Cr. Y = 0.299R + 0.587G + 0.114B Cb = 0.564(B − Y ) Cr = 0.713(R − Y ) R = Y + 1.402Cr G = Y − 0.344Cb − 0.714Cr B = Y + 1.772Cb

It was later discovered that the luminance signal contains higher frequencies, like edges and details, than the chrominance signals. The chrominance signals can therefore be sampled down a factor 2 in each direction, horizontally and vertically, without loosing too much visuable quality. This reduces the number of samples to half the original number of samples by simply representing the colour information in another way.

When the chrominance signals are down sampled a factor 2 in both directions, the colour format is called 4:2:0.

(28)

Figure 2.5. The rst test image in the RGB colour space. The images are the red, green and blue channels of the RGB image.

Figure 2.6. The rst test image in the YCbCr colour space. The images are the luminance image, the down sampled chrominance blue image and the down sampled chrominance red image.

2.4.2 JPEG

The most known standard for compression of images is the JPEG standard. It is implemented in almost all digital cameras and in dierent computer programs such as web browsers and image editors. The standard was dened in 1992 and is still used today.

The three colour channels are encoded separately, as one luminance image and two chrominance images. The chrominance images are normally down sampled a factor 2 in each direction.

The image is divided into blocks of 8x8 pixels and each block is then trans-formed with a 8×8 DCT. The DC-component, representing the average pixel value of the block, is located at the top left corner of the transformed 8 × 8 block. The result is then quantized, using a quantization step size table. The step sizes can be dierent for each transform component, normally rather small steps are used for the low frequency components and a bit bigger steps for the high frequency components.

The transformed and quantized transform components are then scanned in a zig-zag order, since the probability of signicant transform components, com-ponents that are not zero, is decreasing with a growing distance from the DC-component. This is at least the case for natural images which have most of their energy in the lower frequencies.

(29)

2.4 Image coding 17

Figure 2.7. The zig-zag scan order for the 8 × 8 DCT blocks.

The DC-component for each block is encoded separately as the dierence between the DC-component for the previous block and the current. The AC-components are encoded as pair of symbols. The rst symbol is the number of zeros in a row and the second symbol is the rst signicant component after the zeros. Probabilities for the dierent pairs are calculated and a Human codeword is assigned to each pair of symbols. A special symbol, end of block, is used to indicate that the rest of the transform components in the block are zero.

The original JPEG standard supports arithmetic coding of the transform com-ponents instead of Human coding. But since the arithmetic coding used in the standard is protected by dierent patents, Human coding is normally used.

2.4.3 JPEG2000

The major problem with JPEG is the annoying blocking artifacts that occur if the image is compressed too much. JPEG2000 uses a wavelet transform instead of a DCT. A wavelet transform can be described as division of the signal into dierent frequency bands, or sub bands, by using dierent lters. Normally only low pass and high pass lters are used. The dierent frequency signals can then be down sampled since they contain less bandwidth than the original signal. A wavelet transform is much like a normal transform, as the DCT, but since the image is not divided into blocks, there will not be any blocking artifacts.

After the wavelet transform, uniform quantization is used for the wavelet com-ponents and then entropy coding is applied.

The advantage of the wavelet transform is that it tells us where the dierent wavelet components are in the image. The Fourier transform of an image only tells us the total amount of each frequency, but does not include any spatial information.

(30)

Figure 2.8. Division of an image into dierent sub bands. L denotes low pass ltering and H denotes high pass ltering.

2.5 Video coding

In this section, the main principles of video coding will be described.

A video stream consists of a number of images, or frames, per second. The most common is 25 or 30 frames per second. The images can be sent as the whole image at once, progressive, or half the image, a eld, at the double frame rate, interlaced. If the images are sent in interlaced mode, rst the even lines are sent and then the odd lines.

In a video sequence there is high correlation between the adjacent frames, especially if there is small movement in the scene. By predicting a whole frame with the previous frame, only the prediction error has to be sent. The main principle for the well known MPEG standard is to use prediction between the frames and then encode the prediction error with a DCT and quantization, similar to JPEG. The frames encoded with prediction from previous frames are called P-frames.

If there is movement in the scene, it is harder to predict the next frame and the prediction error becomes larger. To prevent this, motion estimation is made and the motion vectors are sent to the decoder. The motion estimation is block based and a small environment is searched to nd the motion vectors with the best match.

Since only the prediction error is sent, it is not possible to start viewing a movie in the middle of the video stream, unless you want to watch frames of prediction errors. To get random access capability, frames that are encoded independently of the previous frames, called Intra-frames or I-frames, have to be sent. Since no prediction from the previous frame can be used for the I-frames, the compression ratio is substantially smaller for the I-frames. To compensate for this, a third type of frames was implemented, the B-frames. The B-frames uses prediction from both previous and following frames and are thereby bidirectionally predicted.

The current state of the art in video coding is the H.264, also called MPEG-4/AVC, standard which will be described in the next chapter.

(31)

2.5 Video coding 19

Figure 2.9. The owchart of a video encoder. Each frame is predicted and then trans-formed and quantized. Motion estimation is used to compensate for movement in the scene, the motion vectors are sent to the decoder. The frame is then rescaled and inverse transformed to predict the next frame from the reconstructed frame.

(32)

(33)

Chapter 3

MPEG-4 H.264

In this chapter I will give a brief introduction to the video coding standard H.264 focused on the part about I-frames coding, since it is the part that I have imple-mented in my thesis work.

3.1 Background

The video coding standard H.264, emerges from the previous MPEG standards. MPEG stands for Moving Picture Experts Group.

The rst MPEG standard was MPEG-1 where B-frames were introduced. The next standard was MPEG-2, which is much alike MPEG-1 but with the ability of higher resolutions and data rates. MPEG-2 is used for DVD and digital TV broadcasting. From the beginning, it was intended to be a MPEG-3 standard as well, but MPEG-2 managed the demands that were stated for MPEG-3. Mp3, which is a popular method for compressing music, is not the MPEG-3 standard. Mp3 stands for MPEG-1 audio layer 3 and is a method for compressing the audio in MPEG-1. After MPEG-2 the development of MPEG-4 started. MPEG-4 is more of a multimedia standard than a standard for video coding but it also contains a video encoder. There are a number of dierent video encoders in MPEG-4, the most known is MPEG-4 Part 10 H.264/AVC. The dierence from the previous MPEG standards are many small improvements that together make a rather big dierence.

3.2 Introduction

The standard only denes the decoding of a video stream, not the encoding. An encoder is allowed to generate any bit stream that can be correctly decoded. The standard is not very readable, since it is very technical and compactly written. Therefore I complemented the reading of the standard with reading of other doc-uments, such as H.264 and MPEG-4 Video Compression [2], H.264 tutorial white papers [3] and an overview of the H.264/AVC video coding standard [11].

(34)

Each frame is transformed from the RGB colour space to the YCbCr colour space, the chrominance signals are then down sampled a factor 2 in each direction, the colour format is hence 4:2:0.

The luminance image is divided in macroblocks of 16 × 16 pixels. Each mac-roblock can then be coded as a single block or be divided further into sixteen 4 × 4 blocks. If the 8 × 8 transform is used, each macroblock can be coded as four 8 × 8 blocks.

The chrominance images are divided in macroblocks of 8 × 8 pixels. Since the chrominance images are sampled down a factor 2 in each direction, each 8 × 8 chrominance block corresponds to a 16 × 16 luminance block.

The current frame is predicted by previous frames, if the current frame is a P-frame, by both previous and following frames, for B-frames, and the I-frames are coded independently of the adjacent frames. In order to compress the I-frames eciently, intra prediction is used instead of inter prediction. Intra prediction means that the current macroblock is predicted by the surrounding macroblocks in the same frame.

Figure 3.1. The gure shows the division of a 16 × 16 luminance block into four 8 × 8 blocks. The 8 × 8 blocks can then be further divided into two 4 × 8 blocks, two 8 × 4 blocks or four 4 × 4 blocks. The numbers inside the blocks indicate in which order the predictions and transforms are made.

If the luminance macroblock is predicted in 16 × 16 mode, each 4 × 4 block of prediction errors is rst transformed using a 4 × 4 integer transform. Then the DC-components from the 4×4 blocks are gathered in a 4×4 block and transformed again with a 4 × 4 Hadamard transform to decorrelate the DC-components once more. Otherwise, the size of the transform is the same as the size of the prediction. The chrominance macroblocks are always predicted in 8 × 8 blocks and trans-formed as four 4 × 4 blocks with an extra 2 × 2 Hadamard transform of the four DC-components.

(35)

3.3 Motion estimation 23

Figure 3.2. The gure shows the transformations of 16 × 16 luminance blocks and 8 × 8 chrominance blocks. The numbers inside the blocks indicate in which order the blocks are transformed and quantized.

3.3 Motion estimation

To improve the prediction where there are movement in the scene, motion esti-mation is used to predict how things move between two frames. This makes the prediction error smaller but the motion vectors have to be sent to the decoder to make the same prediction.

The main principle for the motion estimation is the same as in the previous MPEG standards but in H.264 it is possible to use motion vectors of quarter pixel accuracy. I will not say more about motion estimation since it has not been a part of my implementation.

3.4 Intra prediction

The big dierence in H.264 Intra versus JPEG and JPEG2000 is that a predictor is used together with a transform, normally only a predictor or a transform is used for image coding. With the predictor, a lot of values are zero or close to zero before the transformation and quantization.

The intra prediction in H.264 is a bit dierent from the predictors in the lossless image coding standards PNG and JPEG-LS. The predictor predicts whole blocks of pixels instead of single pixels. For every 4 × 4 luminance block, a total of 9 dierent prediction modes are used and the mode with the least error, measured by sum of absolute errors or sum of squared errors, is selected for each block. The choice of error measure depends on the implementation of the encoder.

(36)

The dierent prediction modes are: • Mode 0 : Vertical prediction • Mode 1 : Horizontal prediction • Mode 2 : DC prediction

• Mode 3 : Diagonal down/left prediction • Mode 4 : Diagonal down/right prediction • Mode 5 : Vertical-left prediction

• Mode 6 : Horizontal-down prediction • Mode 7 : Vertical-right prediction • Mode 8 : Horizontal-up prediction

Figure 3.3. The gure shows the prediction of 4 × 4 luminance blocks. There are 9 dierent prediction modes, where mode 2 is DC-prediction. The pixels with small letters represent the pixels to be predicted and the pixels with large letters are pixels that are used to predict the 4 × 4 block.

For 16×16 luminance blocks and 8×8 chrominance blocks, 4 dierent prediction modes are used. The rst 3 is the same as for the prediction of 4 × 4 blocks and the fourth is a plane prediction. The two chrominance blocks are predicted in the same way but independent of the prediction mode used for the luminance block.

If the 8 × 8 transform is used, luminance blocks of 8 × 8, pixels can also be predicted in 9 dierent ways. The prediction of chrominance blocks remains the same.

The used prediction mode must be sent since the decoder does not have access to the original image and thereby can not determine which prediction mode that is best. Since the prediction modes are highly correlated in adjacent blocks, instead of sending the prediction mode a most probable mode is calculated and a bit is sent to tell if the most probable mode is used or not. If the most probable mode is not used, an additional parameter with the used prediction mode is sent.

(37)

3.5 Transforms 25 The most probable mode is calculated as the minimum of the prediction mode for the block above the current block and the prediction mode for the block to the left of the current block. If either of the blocks not are available, or they belong to a macroblock of a dierent type than the current macroblock, the prediction mode for that block is set to 2, DC-prediction.

Figure 3.4. The gure shows how the most probable prediction mode is determined. The most probable mode is calculated as the minimum of the prediction mode for the block above the current block, marked with T, and the prediction mode for the block to the left of the current block, marked with L. The current block is marked with C.

3.5 Transforms

The transform in H.264 is done with an integer transform instead of the DCT, which is used in JPEG. The advantage of an integer transform versus the DCT is that the integer transform is much easier to compute since it can be calculated with only shifts, additions and subtractions.

The integer transform can be derived from the DCT by multiplying the DCT with a constant and rounding the resulting matrix.

The inverse integer cosine transform, IICT, is given in the standard as an algorithm where wn are the transformed coecients, zn is a set of intermediate

values and xnare the inverse transformed values. The algorithm is rst applied to

each row of the transformed coecients and then to each column of the resulting matrix, >> denotes arithmetic right shift.

z0= w0+ w2 z1= w0− w2 z2= (w1>> 1) − w3 z3= w1+ (w3>> 1) x0= z0+ z3 x1= z1+ z2 x2= z1− z2 x3= z0− z3

(38)

The algorithm is equal to applying the following matrix. IICT =     1 1 1 1 1 1/2 −1/2 −1 1 −1 −1 1 1/2 −1 1 −1/2    

The integer cosine transform, ICT, should be the the same as the IIC. But since division by 2 results in loss of accuracy, the second and fourth row are multiplied with a factor 2. ICT =     1 1 1 1 2 1 −1 −2 1 −1 −1 1 1 −2 2 −1    

The basis functions of these transform matrices are orthogonal but not nor-malized. Normalization of the basis functions would spoil the whole idea with an integer transform. Instead, the normalization is done together with the quantiza-tion. For the forward transform, a scaling matrix is applied after the transform and for the inverse transform, a scaling matrix is applied before the transform.

The resulting integer transform is not an exact copy of the DCT, but the dierence is so small that the benet of being able to perform the transform with just shifts, additions and subtractions is more important. Note that the transform matrices are much alike the simple Hadamard transform.

3.6 Quantization

The quantization of the transform components in H.264 is made with scalar quan-tization. Only 6 dierent step sizes are dened but the quantization parameter, QP, can take the values 0-51. Each time QP is increased with 6, the step size is doubled. The quantization is performed as a multiplication with a multiplication factor MF and an arithmetic right shift, for avoiding division operations. Instead of using a rounding function, a rounding oset is added before the right shift op-eration. The rounding oset should be equal to half the value of the right shift operation.

Since we are dealing with signed integer arithmetic, the absolute value of the transform component should be shifted instead of the signed value. If a transform component with a negative value is shifted, the smallest possible value after the right shift is -1, since ones are shifted in from the left. In order to make the smallest possible value after the right shift to zero, the absolute value of the transform component must be shifted. The sign is then put back afterwards.

The QP for the chrominance is derived as a function of the QP for the lumi-nance.

(39)

3.7 Entropy coding 27

3.7 Entropy coding

In H.264 there are two dierent entropy coders, the context adaptive variable length coding, CAVLC, and the context-based adaptive binary arithmetic cod-ing, CABAC. CABAC is a bit more complex than CAVLC but achieves better compression ratios.

In the next section I will describe how CABAC works, I will not include a description of CAVLC since I only have implemented CABAC.

When I implemented CABAC, there was not much information about it in my rst sources. Eventually I found a good overview [8] written by the creators of the standard. The overview was a very good complement to the standard itself.

3.7.1 CABAC

CABAC uses binary arithmetic coding, which means that the alphabet only con-sists of two symbols, 0 and 1, or rather, a least probable symbol, LPS, and a most probable symbol, MPS. Context based coding means that there are a number of dierent context models, or probability models, to choose from. In CABAC there are 460 dierent context models. It is easy to think that it would be hard to keep track of all the 460 dierent context models, but this is solved pretty neatly. Each context model is represented by a state, that represents the probability that the next symbol is a least probable symbol, and the meaning of the most probable symbol for the context model, if the most probable symbol is 0 or 1. Since the coding is supposed to be adaptive, each time a decision is encoded the state for the current context model is updated.

The state for all the context models are represented by a value between 0 and 63. Since the state only represents the probability of the least probable symbol, the meaning of the most probable symbol has to be switched if the current state is 0 and a least probable symbol is encoded.

The arithmetic coding engine for CABAC consists of three parts. The rst part is the binarizer that converts the elements to be encoded to binary strings. The binarizer is only used for the elements that not already are binary. After the binarization, the bits for the current symbol are sent to the second part of the engine, which consists of a regular and a bypass coding engine. The bypass coding engine does not make use of any context model and it is used for symbols which have a nearly uniform distribution, like the sign of the transform components. The regular coding engine uses context models and a context model has to be selected before the actual encoding. This is done by the third part of the coding engine, the context modeler.

In arithmetic coding, there are a lot of multiplications to calculate the limits for the new interval. These multiplications is one of the main bottlenecks in practical implementations of arithmetic coding. In CABAC, this is solved by using precalculated values that are stored in a lookup table, thus eliminating all the multiplications for calculating the new limits. The precalculated values are not an exact representation of the limits, but the approximations are good enough to be able to use a fast multiplication free coding engine.

(40)

Figure 3.5. Flowchart of the CABAC coding engine.

3.7.2 Macroblock information

In order for the decoder to know what kind of macroblock the current macroblock is, some information about the macroblock has to be sent. The rst thing that is sent is the macroblock type, if the macroblock is predicted as a 16 × 16 block or as sixteen 4 × 4 blocks. Then all the prediction modes are sent. If the current macroblock is a 16 × 16 macroblock, the prediction mode is included in the infor-mation about the macroblock type. Then the parameter delta QP is sent. Delta QP represents the change of quantization parameter from the previous macroblock to the current. After that, the symbol coded block pattern Y is sent to indicate which of the four 8 × 8 blocks in the 16 × 16 luminance block that contain signif-icant components. For the chrominance, a coded block pattern called nc is sent to indicate if there are signicant DC components, both DC and AC components or that all the transform components are zero. Finally, all the dierent blocks of transform components are sent.

3.7.3 Binarization

In this section, the binarization of the dierent symbols will be explained. The binarization of symbols that only are used for P- and B-frames, such as motion vectors, will not be included. There are ve dierent types of binarization in CABAC; unary binarization, truncated unary binarization, concatenated unary kth_{-order Exp-Golomb binarization, xed length binarization and binarization by}

lookup tables.

Unary binarization means that a binary string of C ones and a zero is used to binarize a symbol with the value C. Truncated unary binarization is the same as the unary binarization except for the last symbol which do not have any zero at the end. Fixed length binarization is simply the binary representation of the symbol with a xed number of bits.

(41)

3.7 Entropy coding 29 Symbol Unary Truncated unary Fixed length

0 0 0 000 1 10 10 001 2 110 110 010 3 1110 1110 011 4 11110 11110 100 5 111110 111110 101 6 1111110 1111110 110 7 11111110 1111111 111

The concatenated unary kth_{-order Exp-Golomb binarization is only used for}

the quantized transform components. The exponential Golomb codewords are derived from the Golomb codewords, which can be proven to be optimal prex-free codewords for sources with geometrical distributions. The codeword consists of a truncated unary prex and a exponential Golomb order 0 sux. Since the zeros in the transformed and quantized prediction error blocks only are represented by a ag that indicates that the component is zero, the value of the transform component minus one is binarized. The sign of the transform component is sent separately.

Transform component TU Prex EG0 Sux Codeword

1 0 0 2 10 10 3 110 110 4 1110 1110 5 11110 11110 ... ... ... 13 1111111111110 1111111111110 14 11111111111110 11111111111110 15 11111111111111 0 111111111111110 16 11111111111111 100 11111111111111100 17 11111111111111 101 11111111111111101 18 11111111111111 11000 1111111111111111000 19 11111111111111 11001 1111111111111111001 20 11111111111111 11010 1111111111111111010 The entropy coding of the transform components can be seen as a combination of Human coding and arithmetic coding. The codewords are short for small transform components and longer the larger the transform component. Each bit of the codeword is then individually encoded with the arithmetic encoding engine for further compression.

(42)

The following table denes the type of binarization of the dierent symbols. Symbol Type of binarization

Macroblock type Lookup table Luma prediction mode Fixed length Chroma prediction mode Truncated unary

Delta QP Unary

Coded block pattern, luma Fixed length Coded block pattern, chroma Truncated unary

Transform components UEG0

3.7.4 Context modeling

The context model, or context index, for the current bit to be encoded depends on several factors. First, a context index oset is assigned, depending on which kind of symbol the current bit belongs to. Second, a context category oset is added. The context category oset depends on which category the current symbol belongs to. For the transform components, there are dierent context categories depending on if the transform component belongs to a block of DC or AC components and if the block is a block of transformed luminance or chrominance prediction errors. Third, a context index increment is added. The context index increment is determined by information from surrounding blocks and transform components.

The resulting context index is hence calculated as the sum of the context index oset, the context category oset and the context index increment.

In order to have appropriate values of the state and most probable symbol for each context model from the beginning, the context models are initialized before the rst bit is encoded. The initialization depends on the quantization parameter for the rst macroblock.

3.7.5 Transform components

For each block of transform components, a ag called coded block ag is sent to indicate if the block contains signicant components or not. If the block contains signicant components, a ag called signicant coecient ag is sent for each component to tell if the component is signicant or not. To know when the last component is reached, another ag called last signicant coecient ag is also sent for each signicant component to indicate if the current component is the last signicant component or not.

The rst 14 bits of the codeword for the current transform component are encoded using the regular coding engine, and the rest using the bypass coding engine. The sign of the transform component is also encoded using the bypass coding engine.

(43)

3.8 Deblocking lter 31

3.8 Deblocking lter

One of the advantages with JPEG2000 versus JPEG is that JPEG2000 does not suer from any blocking artifacts since JPEG2000 does not divide the image into blocks. H.264 takes care of the blocking artifacts with a deblocking lter, which is an adaptive lter.

First, dierent boundary strengths are assigned depending on if the current frame is a P-, B-, or I-frame and if the current boundary in the frame is a boundary between two macroblocks or a boundary between two blocks inside a macroblock. Stronger ltering is used for macroblock boundaries.

The lter is then applied in raster scan order. The boundaries for the current macroblock are rst ltered vertically and then horizontally. The idea of the lter is to look for signicant changes in the image and only lter if there is no signicant change between the block boundaries. The denition of signicant change depends on the average quantization parameter of the current and the previous macroblock. The ltering is a low pass ltering that smooths the image, a sort of mean value of the neighbouring pixels is used.

The visual quality after the ltering is better than before the ltering. It is not sure however that the PSNR for the ltered version will be higher than the unltered, since PSNR is only a measure of mean squared error and not of visual quality perceived by humans.

It is possible to control the strength of the ltering by editing the two variables lter oset A and lter oset B, which are sent as header information to the decoder.

(44)

(45)

Chapter 4

Implementation

In my implementation, I have used the nal draft for H.264 [12] as the main doc-ument along with papers from Ian Richardsson [3]. My implementation includes the CABAC entropy coder and the 8 × 8 transform. In this chapter I will describe how my implementation was done.

4.1 Development of the encoder and decoder

I started the development of the encoder by implementing the dierent prediction modes for prediction of 4×4 and 16×16 blocks. To keep track of which pixels that were available, I used an available pixels image where the pixels had the value 1 if the pixel was available. Then the transforms and quantization was added. I began the implementation in Matlab but the program became very slow. A run for a picture took about 15 minutes, without the entropy encoding and rate distortion optimization.

I then decided to use mex-les to speed up the program. Mex-les is a way of combining C-programming with Matlab and compiled C-programs are much faster than Matlab for certain things, such as for-loops. I converted the code from Matlab to C and with a compiled mex-le a run of the program took about 1 second instead of 15 minutes. However, when I tried to add more things to the program, it started to behave strangely. For example, sometimes Matlab exited by it self. Since I was not able to determine the source of the problem, I nally decided to use the programming language C++ and Bloodshed Dev-C++ as development environment. I found the library CImg [1] which I used for reading the original images, displaying dierent kinds of results and saving the reconstructed versions of the images. I started with the C code from the mex-les and eventually added a class for the macroblocks, containing all the information about a macroblock.

When I look back at my implementation now, I am glad that I decided to use C++ since its object orientation is much better. The entropy coding would have been rather hard to implement in a good way in Matlab. A run of the nal program takes about 2 minutes, included the entropy encoding and a full rate distortion

(46)

optimization. I guess that a run in a nal Matlab implementation would have taken several hours.

Since I had to rewrite the code several times, the development of the encoder took a bit longer than I expected. But when the encoder was nished, the devel-opment of the decoder was very easy. The develdevel-opment of the encoder took about 3,5 months of work and the decoder took about 5 hours. Since a predictor is used for encoding the H.264 I-frames, the decoder had to be integrated as a part of the encoder, for the predictions to be made from the reconstructed version of the image.

The encoder and decoder were rst tested without the arithmetic coding, to eliminate as many sources of errors as possible. Since all the symbols were bi-narized by the binarizer, I could write the bits to an ordinary text le. At rst, encoding and decoding of one macroblock was accomplished. To nd the errors, I implemented a function that printed all the information about the current mac-roblock. With the help of the print function, I could easily step forward macroblock by macroblock in the encoder and decoder and nd where the errors were. When I got the encoder and decoder to work without the arithmetic coding, I added the arithmetic coding and repeated the procedure for nding the new errors.

Since the post scaling factors MF for 8 × 8 blocks were not given in the draft of the standard or in the tutorial white papers from Ian Richardsson, I took them from a reference implementation of H.264 [7].

4.2 Arithmetic coding

In the standard [12], the decoding process for CABAC is described. But since the encoding is not a part of the standard, the encoding process is not included. When I implemented the coding of the prediction modes, I rst thought that there was something wrong in the standard. Therefore I searched for a newer version of the standard to see if the description of the coding of the prediction modes was the same. Eventually I found a prepublished version of the standard from 2005 [4]. There was not anything wrong in the standard, the problem was that I had misinterpreted the phrase remaining mode selector since there was no formula of how it was calculated. When I looked in the newer version of the standard, I discovered that an informative description of the encoding process was included. The encoding process is not a part of the standard but if the described algorithm was used, it would work together with the dened decoder. Since the encoding process in the arithmetic coding is not obvious from the dened decoding process, I decided to use the encoding, and decoding, described in the newer version of the standard.

The precalculated values for the new limits of the current interval are taken from the newer standard since it has been changed from my version and the new values are used in the description of the encoding and decoding process. All the context models are also taken from the newer standard.

(47)

4.3 Rate distortion optimization 35

4.3 Rate distortion optimization

When compressing an image, the goal is of course to compress the image as much as possible without losing too much visual quality. In an image there are varia-tions that lead to that dierent combinavaria-tions of parameters, such as the choice of prediction mode and level of quantization, are optimal in dierent parts of the image. In order to achieve the best compression, optimization should therefore be used. But the question is then, what should be maximized or minimized?

If the rate is minimized, the best choice is to not send any bits at all, resulting in innite distortion. If the distortion is minimized, the best choice is to not quantize any transform components at all, resulting in that the rate can not be lower than for lossless techniques. Thus, a combination of the rate and the distortion must be optimized. The most common is to minimize the sum of the rate and the distortion. To be able to choose between high or low data rate, or high or low image quality, a Lagrange multiplier λ is used.

min J = R + λ · D, λ > 0

The parameter lambda controls the goal of the optimization. A small lambda value means that low rate is more important while a high lambda value means that low distortion is more important. The value J to be minimized is also called the R-D-cost.

Since the reconstruction of the blocks aects the prediction and reconstruction of following blocks, all the blocks in the image should be optimized as one unit at the same time, but it would be a very complex optimization problem. Instead I make the rate distortion optimization separately for each macroblock and assume that it will not aect the following macroblocks.

To nd the optimal combination for each macroblock, the rate for each com-bination of macroblock has to be calculated. This means that the whole encoding procedure has to be made. As described in the chapter about H.264, CABAC is an adaptive arithmetic coder. Each time a macroblock is encoded for determining the rate, the state of the encoder and the states of all the probability models are therefore set back to the previous values. When the macroblock is encoded for real, the previous values are updated to the new values.

The distortion for the current macroblock is calculated as the sum of squared errors when the macroblock is reconstructed.

In my implementation I have used the brute force approach to nd the best combination of prediction size and quantization parameter for each macroblock. This means that all combinations of the dierent parameters are tested in order to nd the combination with the lowest R-D-cost.

At rst I implemented my codec from the older standard. Then I compared it with an existing reference implementation [7] and discovered that my implementa-tion was a bit worse. I then added a rate distorimplementa-tion optimizaimplementa-tion of the predicimplementa-tion modes, instead of choosing the prediction mode with the lowest prediction error. The rate distortion optimization of the prediction modes made the data rate drop about 3-5 per cent. In the old standard it is possible to choose between mac-roblocks of the type 4 × 4 and 16 × 16. If the adaptive block size transform is

(48)

used, it is possible to choose between macroblocks of the type 4 × 4, 4 × 8, 8 × 4 and 8 × 8. I implemented the adaptive block size transform but it did not make any big dierence. In the newer version of the standard, it is possible to choose between macroblocks of the type 4 × 4 and 16 × 16 or, if the 8 × 8 transform is used, macroblocks of the type 4 × 4, 8 × 8 and 16 × 16. The 8 × 8 macroblocks are a good adjustment between 4 × 4 and 16 × 16 macroblocks and makes the data rate go down a few more per cent. The blocking artifacts for an 8 × 8 macroblock is not as severe as if a 16 × 16 macroblock is used.

4.4 User interface

I have made a text le where the user can set the dierent settings for the encoder. • Inputimage lename

The lename of the image to be encoded. • QPmin

The lower quantization boundary for the rate distortion optimization. • QPmax

The upper quantization boundary for the rate distortion optimization. • Lambda

The lambda value that controls the goal of the rate distortion optimization. • Transform size 8 × 8

Enable or disable the use of 8 × 8 macroblocks. • Filter Oset A

A parameter to the deblocking lter. • Filter Oset B

A parameter to the deblocking lter. • Lossless

Enable or disable lossless compression. • Grayscale

Used to indicate if the image to be encoded is a gray scale image or not. • Rate distortion optimize luma prediction modes

Enable or disable rate distortion optimization of the luminance prediction modes.

• Rate distortion optimize chroma prediction modes

Enable or disable rate distortion optimization of the chrominance prediction modes.

(49)

4.5 Improvement of the encoder 37 The rate distortion optimization is performed for all QP min <= QP <= QP max. The bigger the dierence between QP min and QP max, the longer the encoding takes. If rate distortion optimization is used for the prediction modes as well, the encoding is much slower. Since the decoder is integrated in the encoder, the reconstructed image and the ltered reconstructed image can then be shown to the user.

4.5 Improvement of the encoder

There are a lot of articles on the internet about small improvements of H.264. At the beginning I intended to implement one or more of the improvements, but as the development of the encoder took longer than I expected I did not have time to do so. Instead I added a lossless version of the I-frames coding.

In order to compress an image lossless, I had to add a conversion to another colour space than YCbCr, since that colour space conversion is not reversible, due to rounding errors. In my lossless implementation I have used the colour space YCgCo [10], where Y is the luminance, Cg is chrominance green and Co is chrominance orange. Since the colour space conversion has to be lossless, the dynamical range of Cg and Co has to be twice as big as the dynamical range of the luminance. Y = 0.25R + 0.5G + 0.25B Cg = −0.5R + G − 0.5B Co = R − B R = Y − 0.5Cg + 0.5Co G = Y + 0.5Cg B = Y − 0.5Cg − 0.5Co

I then removed the down sampling of the chrominance images and simply skipped the transformation and quantization of the prediction errors and encoded them in the same way as the transform components in the lossy version.

The scanning of the transform components is based on the fact that the prob-ability of signicant components decreases with a growing distance from the DC-component. But the probability of a signicant prediction error increases with the distance from the top left pixel in the current block. To take advantage of this fact, I reversed the scanning order of the prediction errors, for the lossless compression only.

(50)

Image coding with H.264 I-frames

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Image coding with H.264 I-frames

Image coding with H.264 I-frames

Examensarbete utfört i Bildkodning

vid Tekniska högskolan i Linköping

av

Abstract

Sammanfattning

Contents

List of Figures

Chapter 1

Introduction

1.1 Problem denition

1.2 Methodology

1.3 Limitations

1.4 Structure of the thesis

Chapter 2

Introduction to data compression

2.1 Introduction

2.2 Lossless compression

2.2.1 Data rate

2.2.2 Human coding

2.2.3 Arithmetic coding

2.2.4 Prediction

2.3 Lossy compression

2.3.1 Distortion

2.3.2 Quantization

2.3.3 Transforms

2.4 Image coding

2.4.1 Colour space

2.4.2 JPEG

2.4.3 JPEG2000

2.5 Video coding

Chapter 3

MPEG-4 H.264

3.1 Background

3.2 Introduction

3.3 Motion estimation

3.4 Intra prediction

3.5 Transforms

3.6 Quantization

3.7 Entropy coding

3.7.1 CABAC

3.7.2 Macroblock information

3.7.3 Binarization

3.7.4 Context modeling

3.7.5 Transform components

3.8 Deblocking lter

Chapter 4

Implementation

4.1 Development of the encoder and decoder

4.2 Arithmetic coding

4.3 Rate distortion optimization

4.4 User interface

4.5 Improvement of the encoder

1.1 Problem denition

2.2.2 Human coding

3.8 Deblocking lter