Transform Coefficient Thresholding and Lagrangian Optimization for H.264 Video Coding

(1)

Transform Coefficient Thresholding and Lagrangian

Optimization for H.264 Video Coding

Examensarbete utfört i Bildkodning

av

Pontus Carlsson

LiTH-ISY-EX-3561-2004

(2)

Transform Coefficient Thresholding and Lagrangian Optimization for H.264

Video Coding

Examensarbete utfört i Bildkodning

vid Linköpings tekniska högskola

av

Pontus Carlsson

LiTH-ISY-EX-3561-2004

(3)

Avdelning, Institution Division, Department

Institutionen för systemteknik

581 83 LINKÖPING

Datum Date 2004-03-04 Språk

Language Rapporttyp Report category ISBN Svenska/Swedish

X Engelska/English

Licentiatavhandling

X Examensarbete ISRN LITH-ISY-EX-3561-2004 C-uppsats D-uppsats Serietitel och serienummer _{Title of series, numbering} ISSN Övrig rapport

____

URL för elektronisk version

http://www.ep.liu.se/exjobb/isy/2004/3561/

Titel

Title Transformkoefficient-tröskling och Lagrangeoptimering för H.264 Videokodning Transform Coefficient Thresholding and Lagrangian Optimization for H.264 Video Coding

Författare

Author Pontus Carlsson

Sammanfattning

Abstract

H.264, also known as MPEG-4 Part 10: Advanced Video Coding, is the latest MPEG standard for video coding. It provides approximately 50% bit rate savings for equivalent perceptual quality compared to any previous standard. In the same fashion as previous MPEG standards, only the bitstream syntax and the decoder are specified. Hence, coding performance is not only determined by the standard itself but also by the implementation of the encoder. In this report we propose two methods for improving the coding performance while remaining fully compliant to the standard. After transformation and quantization, the transform coefficients are usually entropy coded and embedded in the bitstream. However, some of them might be beneficial to discard if the number of saved bits are sufficiently large. This is usually referred to as coefficient thresholding and is investigated in the scope of H.264 in this report.

Lagrangian optimization for video compression has proven to yield substantial improvements in perceived quality and the H.264 Reference Software has been designed around this concept. When performing Lagrangian optimization, lambda is a crucial parameter that determines the tradeoff between rate and distortion. We propose a new method to select lambda and the quantization parameter for non-reference frames in H.264.

The two methods are shown to achieve significant improvements. When combined, they reduce the bitrate around 12%, while preserving the video quality in terms of average PSNR.

To aid development of H.264, a software tool has been created to visualize the coding process and present statistics. This tool is capable of displaying information such as bit distribution, motion vectors, predicted pictures and motion compensated block sizes.

(4)

Nyckelord

Keyword

Lagrangian Optimization, Rate-Distortion Optimization, Transform Coefficient Thresholding, Preprocessing, MATLAB, Developer Tool, MPEG-4 Part 10, Advanced Video Coding, AVC, H.26L, H.264

(5)

Transform Coeﬃcient Thresholding and Lagrangian

Optimization for H.264 Video Coding

Pontus Carlsson ponca707@student.liu.se

Link¨opings Universitet Nanyang Technological University

LiTH-ISY-EX-3561-2004 SCE03-458

(6)

(7)

Abstract

H.264, also known as MPEG-4 Part 10: Advanced Video Coding, is the latest MPEG standard for video coding. It provides approximately 50% bit rate savings for equivalent perceptual quality compared to any previous standard. In the same fashion as previous MPEG standards, only the bit-stream syntax and the decoder are speciﬁed. Hence, codingperformance is not only determined by the standard itself but also by the implementation of the encoder. In this report we propose two methods for improvingthe codingperformance while remainingfully compliant to the standard.

After transformation and quantization, the transform coefficients are usually entropy coded and embedded in the bitstream. However, some of them might be beneficial to discard if the number of saved bits are suffi-ciently large. This is usually referred to as coefficient thresholding and is investigated in the scope of H.264 in this report.

Lagrangian optimization for video compression has proven to yield sub-stantial improvements in perceived quality and the H.264 Reference Software has been designed around this concept. When performing Lagrangian opti-mization, λ is a crucial parameter that determines the tradeoﬀ between rate and distortion. We propose a new method to select λ and the quantization parameter for non-reference frames in H.264.

The two methods are shown to achieve signiﬁcant improvements. When combined, they reduce the bitrate around 12%, while preservingthe video quality in terms of average PSNR.

To aid development of H.264, a software tool has been created to vi-sualize the codingprocess and present statistics. This tool is capable of displayinginformation such as bit distribution, motion vectors, predicted pictures and motion compensated block sizes.

Keywords: Lagrangian Optimization, Rate-Distortion Optimization,

Trans-form Coeﬃcient Thresholding, Preprocessing, MATLAB, Developer Tool, MPEG-4 Part 10, Advanced Video Coding, AVC, H.26L, H.264

(8)

(9)

Acknowledgments

First I would like to thank my supervisors Dr. Liang-Tien Chia and Dr. F. Pan for their help and guidance during my project in Singapore.

A part of the project was carried out at Institute for Infocomm Research, I2R. Many thanks to the staﬀ there for their warm welcome and sharingtheir knowledge of video coding in numerous discussions.

This project was supported by the DUO-Singapore Award for which I am very grateful.

Pontus Carlsson February 2004

(10)

(11)

2.3.5 Intra Prediction . . . 15 2.3.6 Entropy Coding . . . 16 2.3.7 Rate-Distortion Optimization . . . 17 3 H.264 Developer Tool 21 3.1 Introduction . . . 21 3.2 Requirements . . . 22 3.2.1 Data Presentation . . . 22 3.2.2 Application . . . 23 3.3 Architecture . . . 23 3.4 Implementation . . . 25

3.4.1 Modiﬁcation of H.264 Reference Software . . . 25

3.4.2 MATLAB Application . . . 25

3.5 Results . . . 25

3.6 Concluding Discussion . . . 31 vii

(12)

viii CONTENTS

4 Transform Coeﬃcient Thresholding 33

4.1 Introduction . . . 33

4.2 Thresholdingin H.264 Reference Software . . . 33

4.3 R-D Optimized Thresholding. . . 34

4.4 Experiments . . . 35

4.5 Conclusions . . . 35

5 Lagrangian Multiplier Selection 41 5.1 Introduction . . . 41

5.2 Lagrangian Multiplier for B-frames . . . 41

5.2.1 Lagrangian Multiplier Selection in H.264 Reference Software . . . 43

5.2.2 Proposed Lagrangian Multiplier Selection . . . 44

5.3 Approximation of the Global Rate-Distortion Function . . . . 44

5.4 Experiments . . . 46 5.5 Conclusions . . . 50 6 Combined Experiments 51 6.1 Conclusions . . . 51 7 Future Work 55 Table ofAbbreviations 61

(13)

List of Figures

2.1 Codingof a P-frame. Upper left: Decoded framet − 1, which here is used to predict frame t. Upper right: Uncompressed frame t with motion vectors referringto frame t − 1 super-imposed. Middle left: Motion compensated prediction of framet. Middle right: Decoded frame t. Bottom left: Coded residue, i.e. the difference between the decoded and predicted frame t. Bottom right: Quantized transform coefficients cor-responding to the coded residue. . . 8 2.2 Illustration of the deficiency of PSNR. Here, all images

(ex-cept the original) have similar PSNR. Top left: Original ”Lena” image. Top right: Contrast stretched image, PSNR = 24.61. Bottom left: Blurred image, PSNR = 24.61. Bottom right: JPEG compressed image, PSNR = 24.81 [27]. . . 10 2.3 Example of repetitive motion suitable for multiple reference

picture motion compensation. . . 12 2.4 Motion vectors and respective block sizes in a P-frame. The

motion vectors within a block are the same and only one vector per block is coded in the bitstream. . . 13 2.5 Eﬀect of the deblockingﬁlter on an I-frame. Left: Before

deblocking . Rig ht: After deblocking . . . 15 2.6 Codingprocess for an I-frame. Upper left: Uncompressed

blocks. Upper right: Intra predicted blocks. Lower left: De-coded blocks before deblockingﬁlter (used for prediction). Lower rig ht: Final decoded blocks. . . 16 2.7 Intra prediction in detail. Left: Decoded pixels before

de-blocking. The central 4× 4 block is to be predicted using pixels A-K. Right: Intra predicted pixels. The central 4× 4 block has been horizontally predicted usingthe previously decoded pixels A-D in the left picture. . . 17

(14)

x LIST OF FIGURES 3.1 Data ﬂow from the encoder and decoder to H.264 Tool.

Dur-ingthe encodingand decodingprocess, data such as motion vectors and predicted pictures are written to ﬁles. These ﬁles are then read by H.264 Tool for visualization and presentation. 24 3.2 ”Control Panel” of H.264 Tool. The current frame number is

182 and data from two diﬀerent encoders are beingcompared. The buttons to the left are used to open diﬀerent windows for data analysis. . . 26 3.3 Typical usage of H.264 Tool. The leftmost window shows the

decoded picture with motion vectors superimposed. In the middle is the predicted picture and the rightmost plot shows the size of the motion compensated blocks. Each of these windows are opened by pressing”New Plot” on the ”Control Panel”. . . 26 3.4 ”Plot View” of H.264 Tool. In the ”Codec” drop down menu

the user selects which encoder to view. ”MV Scale” scales the motion vectors if present. ”View” and ”Subtract” set the data to view and subtract. Here, the predicted picture is subtracted from the original, uncompressed picture. The checkboxes superimpose information on the picture such as motion vectors and block sizes for motion compensation. This picture has been zoomed in, which is done by pressingthe left mouse button on the center point. . . 27 3.5 Decoded picture from two diﬀerent encoders compared side by

side. Both plots superimpose the block size used for motion compensation. The right plot marks skipped macroblocks with black color. Note the slight diﬀerence in the block sizes for motion compensation. . . 28 3.6 Global PSNR and bit distribution. Accessed by pushing”Global

Stat” on the ”Control Panel”. . . 29 3.7 Left: Transform coeﬃcients. Middle: Bit distribution, mean

squared error (MSE) and quantization parameter (QP) for each macroblock of the current frame. Right: Bit distribution for the frame. . . 30 4.1 Rate-PSNR curves for the proposed thresholdingscheme

com-pared to the H.264 Reference Software. . . 36 4.2 Comparison between the proposed thresholdingscheme and

the H.264 Reference Software. 300 frames of Foreman (QCIF) encoded with ﬁxed QP = 28. . . 37

(15)

LIST OF FIGURES xi 4.3 Close up of the graphs in Figure 4.2. Using the proposed

thresholdingscheme, most B-frames have a higher PSNR, even thoug h they use less bits. . . 38 5.1 Approximation of the global rate-distortion function for

dif-ferent values ofC₁. The operational rate-distortion functions (solid) have been obtained by encodingthe sequences using ﬁxed QP. . . 45 5.2 Rate-PSNR curves for the proposed Lagrangian optimization

scheme compared to the H.264 Reference Software. . . 47 5.3 Comparison between the proposed Lagrangian optimization

scheme and the H.264 Reference Software usingﬁxed QP = 28. 300 frames of Foreman (QCIF). . . 48 5.4 Close up of the graphs in Figure 4.2. We note that most

B-frames have a higher PSNR in the proposed scheme, even thoug h they use less bits. . . 49

(16)

(17)

List of Tables

4.1 Test Conditions. . . 35 4.2 Average improvement of PSNR and rate for the proposed

thresholdingscheme compared to H.264 RS usingﬁxed QP, calculated accordingto [2]. The columns should be regarded as equivalent in the sense that there is either the increase in PSNR or the decrease in rate. . . 37 5.1 Average improvement of PSNR and rate for the proposed

Lagrangian optimization scheme compared to H.264 RS using ﬁxed QP, calculated accordingto [2]. The columns should be regarded as equivalent in the sense that there is either the increase in PSNR or the decrease in rate. . . 46 6.1 Average improvement of PSNR and rate for PTHRES and

PLOPT combined, compared to H.264 RS usingﬁxed QP. The columns should be regarded as equivalent in the sense that there is either the increase in PSNR or the decrease in rate [2]. . . 52 6.2 Average improvement in PSNR for PTHRES and PLOPT

combined, compared to usingPTHRES or PLOPT separately. The improvements are relative to H.264 RS usingﬁxed QP, and are calculated according [2]. . . 52

(18)

Chapter 1

Introduction

1.1 Project Review

Initially this project was to focus on preprocessingfor H.264 and the begin-ningof the project was spent on studyingliterature related to this subject. Little has yet been reported on preprocessingspecific to H.264. Available lit-erature usually target previous MPEG standards but new concepts in H.264 such as the deblockingfilter (section 2.3.4) and enhanced prediction modes (section 2.3.2) make several previously employed preprocessingstrategies less efficient or even obsolete.

One of the most common video codingartifact is the blockingartifact which manifests itself as ”blockiness” or visible and disturbingartificial edges and squares in the decoded video. For previous MPEG standards, prepro-cessingor prefilteringof the video is one way to reduce these artifacts [11, 13]. This filteringoperation can also be a more integrated part of the codingpro-cess and be applied to the residue after motion compensation, such as in [9]. The deblockingfilter introduced in H.264 greatly reduces the need for this type of preprocessing. Even though minor blocking artifacts can still occur, significant improvements are not likely to be achieved using this approach.

In [31], the residue after motion compensation was ﬁltered usinga per-ceptual criterion known as Just Noticeable Distortion (JND). This criterion is based on visual masking, for example that distortion is less visible in highly textured regions of an image. The implementation was done in an MPEG-2 encoder with very good results. Perceptual quality increased signiﬁcantly and surprisingly also the PSNR.

Hopingfor similar results for H.264, we implemented JND ﬁlteringin the H.264 Reference Software [1], but results were disappointing. Both

(19)

2 CHAPTER 1. INTRODUCTION tual quality and PSNR decreased with the JND ﬁltering. We believe that one reason for this is the enhanced prediction modes in H.264. Compared to MPEG-2, the residue after motion compensation is much smaller in H.264 and this property might leave less room for improvement by manipulating the residue. However, the exact reason is diﬃcult to know.

A preprocessor can also be used to remove noise and irrelevant details from the video in order to increase codingeﬃciency [3, 5, 26, 30]. This type of preprocessing is especially useful for high resolution and high bi-trate applications such as Digital Versatile Disc (DVD) and Digital Video Broadcasting(DVB) which currently uses the MPEG-2 standard. For this standard, preprocessinghas been very successful and was a major contribu-tion in achievingbroadcast quality for bitrates around 2Mbit/s in the late 1990’s [6].

Duringthis project, some experiments with noise reduction were con-ducted and the spatiotemporal noise filter proposed in [34] was implemented. Performance was evaluated usingartificially added white noise and was good in terms of rate and PSNR. However, a prohibitive factor for these experi-ments was the difficulty to find uncompressed high resolution video material, for which noise reduction is most useful. The maximum resolution of the standard MPEG test sequences is 352× 288. Further, the H.264 Reference Software is very slow, and encodinga high resolution movie would take several days.

A highly integrated preprocessing technique is modification of the trans-form coefficients. As transtrans-form coefficients are to be coded and embedded in the bitstream, the connection to the rate usingthis approach is very strong. One way of doingthe modification is to discard transform coefficients that are expensive in terms of rate and contribute only little to the distortion. In [20], they do this for previous MPEG standards and enjoy substantial improvements in terms of rate and PSNR. For each coefficient they decide whether to keep it or not usingrate-distortion optimization (section 2.3.7) and dynamic programming. We have implemented a simplified thresholding scheme for H.264 which is presented in chapter 4.

In parallel with the experiments on coeﬃcient thresholding, some at-tempts were made to modify theλ parameter. This parameter is crucial in rate-distortion optimization as it determines the tradeoﬀ between rate and distortion. Because of the complex prediction schemes in video coding, it is not obvious how to chooseλ optimally. Several trial and error experiments were conducted with inconsistent results. Finally the problem was attacked in a more mathematical manner which is presented in chapter 5.

(20)

selec-1.2. REPORT OUTLINE 3 tion was summarized into a paper which was submitted to the ICIP 2004 conference.

The H.264 developer tool presented in chapter 3 was ﬁrst implemented as part of an image processing course at Link¨opings Universitet. During the project in Singapore it has been further developed and a lot of new features have been added.

1.2 Report Outline

The rest of this report is organized as follows. Chapter 2 gives a brief review of the fundamentals of video codingand an overview of H.264. The H.264 developer tool is introduced in chapter 3. Chapter 4, 5 and 6 present our work on transform coeﬃcient thresholding and Lagrangian multiplier selection. Finally, future work is discussed in chapter 7.

(21)

(22)

Chapter 2

Background

2.1 Video Coding Basics

Motion video data consists of a sequence of pictures, and cameras typically generate around 25 pictures per second. To represent this digitally each picture is sampled usingsome resolution and each pixel value is represented usinga limited number of bits. A common format for high quality video is a resolution of 720× 576 where each pixel is represented using24 bits. At 25 frames per second this representation would consume a bandwidth of 249 Mbit/s and the need for compression is obvious.

Compression is based on redundancy removal and video contain both spatial and temporal redundancy. That is, images usually contain areas with non-random structures and in video the change between consecutive frames is typically very small. More mathematically, nearby pixels in space or time usually have a high correlation.

The most successful video codingscheme to this date is called hybrid video coding. This scheme uses motion compensated prediction to remove temporal redundancy. Typically, each picture is divided into blocks of size 16× 16 called macroblocks. For each macroblock, a prediction is formed by estimatingthe motion from a previously coded picture. Hence, the only data that needs to be coded is this motion information and the residue picture, e.g. the diﬀerence between the prediction and the current picture. If the prediction is good, the energy of this residue picture will be small. A picture used for prediction is usually referred to as a reference picture or reference

frame.

To further remove redundancy, a block based transform is applied to the residue picture. The transform coeﬃcients are then quantized and ﬁnally

(23)

6 CHAPTER 2. BACKGROUND entropy coded to achieve the actual compression.

The most common transform for lossy codingis the DCT (Discrete Co-sine Transform) which is well known to concentrate most of the energy of natural signals into a small number coefficients. Because of this property, there will only be a few non-zero coefficients left after quantization which can be efficiently entropy coded.

A visual summary of this process is found in Figure 2.1. From this ﬁgure we observe that

• The motion vectors do not always represent motion. The most straight

forward algorithm for motion estimation is to ﬁnd the best matching block by searchingall possible positions in the reference frame. This process does not necessarily ﬁnd the true motion vectors.

• Temporal changes that do not correspond to motion cannot be well

predicted. The bad prediction of the mouth area is an example of this.

• The remainingcorrelation in the residue after motion compensation

is eﬃciently removed by the block based transform. The residue in transform domain is concentrated to just a few coeﬃcients.

In previous MPEG standards1, coded frames can be of type I, P or B. The terms refer to Intra, Predicted and Bidirectional, respectively.

I-frames are coded as independent of other frames in a way similar to still image standards such as JPEG.

P-frames are coded usingprediction from a previously coded picture as described above.

B-frames expands the concept of P-frames by usingtwo previously coded pictures for prediction, one temporally precedingand one temporally follow-ing. As a result, the prediction is improved and the energy of the residue that needs to be coded is further reduced.

The term GOP, or Group Of Pictures, refers to a sequence of coded picture that can be independently decoded. A typical GOP structure is

I − B − B − P − B − B − P − . . .

where the order refers to the temporal order the pictures. However, a conse-quence of B-frames is that the decodingorder is diﬀerent from the temporal order of the pictures. As the B-frames here use the previous I-frame and

1

For simplicity, we here limit the discussion to MPEG standards prior to H.264. The concepts are similar in H.264, but details more complicated (section 2.3.1).

(24)

2.2. QUALITY METRICS 7 the followingP-frame for prediction, they both have to be available when the B-frame is to be decoded. Below, the correspondingdecodingorder is indicated by subscripts.

I0− B2− B3− P1− B5− B6− P4− . . .

P1 is predicted fromI₀,B₂ andB₃ are predicted from I₀ and P₁, etc. To further complicate things, individual macroblocks can be separately coded as I, P or B. This is useful for example when parts of a P-frame cannot be well predicted.

The MPEG standards for video codingspecifies only the syntax of the bitstream and how to decode a bitstream compliant to this syntax. How to design an encoder is not part of the standard. For example how the bits should be distributed, macroblock codingparameters and what information to discard in the video are not part of the standard. Hence, different encoders implementingthe same standard can have very different performance.

2.2 Quality Metrics

The most widely used metric for measuringthe quality of images is PSNR or

Peak Signal to Noise Ratio. Iff and ˜f denotes the original and reconstructed

picture respectively, PSNR is deﬁned as

P SNR = 20 log√255

MSE (2.1)

where MSE is the mean squared error, deﬁned as

MSE =

_M−1

i=0 N−1j=0 (fi,j− ˜fi,j)2

M × N (2.2)

where M and N is the width and height of the picture.

PSNR is thus a simple and mathematically convenient way to measure image quality. Unfortunately it does not always correlate well with the the perceived quality of the image, as visualized in Figure 2.2. Typical PSNR values range between 20 and 40 and are usually reported using two decimals. A great deal of eﬀort has been made to develop new objective image and video quality metrics by consideringthe human visual system characteristics. Surprisingly, only limited success has been achieved. It has been reported that none of the complicated objective image quality metrics in the litera-ture has shown any clear advantage over simple mathematical measures such

(25)

8 CHAPTER 2. BACKGROUND

Figure 2.1: Coding of a P-frame. Upper left: Decoded frame t − 1, which here is used to predict frame t. Upper right: Uncompressed frame t with motion vectors referringto framet − 1 superimposed. Middle left: Motion compensated prediction of framet. Middle right: Decoded frame t. Bottom left: Coded residue, i.e. the diﬀerence between the decoded and predicted frame t. Bottom right: Quantized transform coeﬃcients corresponding to the coded residue.

(26)

2.3. OVERVIEW OF H.264 9 as PSNR under strict testingconditions and diﬀerent image distortion envi-ronments. For example, in a test conducted by the Video Quality Experts Group (VQEG) in validatingobjective video quality assessment methods, there are eight to nine proposed models whose performance is statistically indistinguishable. Unfortunately, this group of models includes PSNR [27]. Further, the type of artifacts preferred can be subjective and also de-pendent on the target application. As an example, a smooth image with some details removed might ”look better” than an image with more details preserved containingmore codingartifacts. However, for some people or ap-plications, preservation of details may have a higher priority than reduction of codingartifacts.

Measuringquality for video is even more diﬃcult as motion should also be taken into account. Fine details in fast movingobjects can for example be diﬃcult to perceive by the human visual system.

2.3 Overview of H.264

H.264 is the newest video codingstandard of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC MovingPicture Experts Group (MPEG). The main goals of the H.264 standardization effort have been en-hanced compression performance and provision of a ”network-friendly” video representation addressing”conversational” (video telephony) and ”noncon-versational” (storage, broadcast, or streaming) applications. H.264 has achieved a significant improvement in rate-distortion efficiency relative to existingstandards [29]. Some new concepts have been introduced such as an in-loop deblockingfilter (section 2.3.4) and intra prediction (section 2.3.5), but the basic structure is still a hybrid video codingscheme and the im-provements have been achieved by refiningeach functional element. The features of the new design provide approximately a 50% bit rate savings for equivalent perceptual quality relative to the performance of prior standards [29].

In the followingwe brieﬂy overview the most important functional el-ements of H.264, focusingon the improvel-ements as compared to previous standards.

2.3.1 Frame Structure

Similar to previous standards, each picture is divided into macroblocks of size 16× 16. The macroblocks are grouped into arbitrarily shaped slices

(27)

Figure 2.2: Illustration of the deﬁciency of PSNR. Here, all images (except the original) have similar PSNR. Top left: Original ”Lena” image. Top right: Contrast stretched image, PSNR = 24.61. Bottom left: Blurred image, PSNR = 24.61. Bottom right: JPEG compressed image, PSNR = 24.81 [27].

(28)

2.3. OVERVIEW OF H.264 11 of type I, P or B2. An I-slice contain only I-macroblocks and a P-slice can contain both I and P-macroblocks. B-slices can consist of any type of mac-roblocks. The concept of rectangular frames has been replaced by the slice concept in H.264.

Since each slice in a coded picture can be (approximately) independently decoded, the H.264 design enables sending and receiving the slices of the picture in any order relative to each other. This capability can improve end-to-end delay in realtime applications, particularly when used on networks havingout-of-order delivery behavior (e.g., Internet protocol networks)[29].

2.3.2 Motion Compensation

In standards prior to H.264, the block size for motion compensation is in general the same as the size of the macroblock, i.e. 16× 16 pixels. P-frames typically use one reference frame and B-frames use two reference frames. Further, the reference frames are limited to adjacent pictures.

H.264 extends these concepts by introducing multiple reference picture

motion compensation and variable block-size motion compensation. Multiple Reference Picture Motion Compensation

As the name implies, this concept is about usingmore than one reference frame for prediction. In H.264, each macroblock can be predicted usingany previously decoded frame in the sequence. This feature is useful for dealing with

• Motion that is periodic in nature

Suppose the video contains an object that is changing in a repetitive way, for example a ﬂying bird with the wings going up and down. The wings are best predicted from a picture where they are in a similar position, which is not necessarily the precedingpicture (ﬁgure 2.3).

• Alternatingcamera angles that switch back and forth between two

diﬀerent scenes

When switchingback to a scene that has previously been encoded, the best prediction is made from a frame originating from the same scene.

• Occlusions

2

H.264 also deﬁnes switching slice types such as SP and SI [29], but we don’t go into the details of those here.

(29)

Figure 2.3: Example of repetitive motion suitable for multiple reference picture motion compensation.

Once an object is made visible after occlusion, it is beneﬁcial to do prediction from the frame where the object was last visible. A twin-klingeye is an example of this, the eye beingoccluded from time to time.

Variable Block-size Motion Compensation

H.264 extends the ﬁxed block size previously used for motion compensation to a variable block size ranging from 4× 4 to 16 × 16 pixels includingrect-angular block sizes such as 8× 4 and 8 × 16. Figure 2.4 illustrates the block sizes for a P-frame where all diﬀerent block sizes are represented. The way in which the macroblock is split is usually referred to as the mode of the macroblock.

2.3.3 Transform and Quantization

The 8× 8 DCT transform used in previous standards has been replaced by a 4× 4 transform in H.264, and can be computed exactly usinginteger arithmetic. The transform is a close approximation to the DCT but has lower computational complexity due to the lack of ﬂoatingpoint operations. The standard DCT produces real valued output and roundingerrors are inevitably introduced when the output is truncated to a digital represen-tation. The inverse transform will thus also contain errors, although the transform itself is orthogonal. As arithmetic precision is not standardized, this can cause mismatch between decoded data in the encoder and decoder. For isolated blocks this is not a problem. The roundingerrors are very small and can in practice be neglected. However, video coding relies heavily on prediction, and predictions are made from previously decoded data. If the decoded data is not the same in the decoder as in the encoder there will

(30)

2.3. OVERVIEW OF H.264 13

Figure 2.4: Motion vectors and respective block sizes in a P-frame. The motion vectors within a block are the same and only one vector per block is coded in the bitstream.

(31)

14 CHAPTER 2. BACKGROUND be error drifting, which accumulates over time. An attempt to solve this problem can be found in [10].

H.264 makes extensive use of prediction, since even the intra coding modes rely upon spatial prediction (section 2.3.5). As a result, H.264 is very sensitive to prediction drift. In prior standards, prediction drift accu-mulation can occur once per P-frame. In contrast, in H.264 prediction drift can occur much more frequently. As an illustration, in an I-frame, 4× 4 blocks can be predicted from their neighbors. At each stage prediction drift can accumulate. For a CIF image, which has a width of 88 4× 4 blocks, prediction drift can accumulate 88 times in decodingone row of an I-frame. Thus, it is clear that as a result of the extensive use of prediction with H.264, the residual codingmust be drift free [15].

The 4× 4 transform allows signals to be represented in a more locally-adaptive fashion, which reduces artifacts known as ”ringing” [33]. Moreover, the use of small block sizes for motion compensation in H.264 signiﬁcantly reduces correlation between 4× 4 blocks, further motivatingthe choice of a 4× 4 transform [15, 29].

The integer transform is achieved by introducing pre- and post scaling matrices and the scaling is integrated in the quantization process. Together with the desire to avoid division duringquantization this is a rather com-plicated process compared to previous standards.

For the advanced proﬁles, H.264 allows the transform to be done on variable sized blocks. Apart from the standard 4× 4 block size, 4 × 8, 8 × 4 and 8× 8 are also available. This feature can increase the PSNR around 0.5 dB [7], however it is not yet implemented in oﬃcial H.264 Reference Software.

A detailed description of the transform and quantization in H.264 can be found in [15] and [23].

2.3.4 In-loop Deblocking Filter

One of the most noticeable artifacts in video codingis the blockingarti-fact. This manifests itself as ”blockiness” or visible and disturbingartificial edges and squares in the decoded video. Two building blocks in MPEG video codingcan cause this effect, the first beingthe block based motion compensation. Since there is almost never a perfect match, discontinuities on the edges of the copied blocks typically arise. Additionally, existing edge discontinuities in the reference frames are carried into the interior of the block to be compensated.

(32)

Figure 2.5: Eﬀect of the deblocking ﬁlter on an I-frame. Left: Before de-blocking. Right: After dede-blocking.

usinga block transform, it is well known that the codingerrors are larger near the block boundaries than in the middle of the block [14].

For previous MPEG standards, post-ﬁlteringis typically applied in the decoder to reduce blockingartifacts. However, as only employed by the de-coder, this ﬁlteringdoes not solve the problem of removingthe artifacts from the reference frames. The encoder still uses blocky frames for prediction, and interior edges as discussed above are not reduced using this approach.

To solve this problem, a standardized in-loop deblockingfilter is intro-duced in H.264, used by both the encoder and decoder. The filter is inte-grated in the coding loop and the reference frames will be deblocked before they are used for prediction. In this way, interior edges are removed and the overall video quality is significantly improved.

Although the main objective of deblocking is to remove perceptually disturbingblockingartifacts, also objective measures such PSNR has been shown to increase with this approach [14].

An overview of the deblockingﬁlter used in H.264 can be found in [22].

2.3.5 Intra Prediction

If a macroblock is encoded in intra mode a prediction block is formed based on previously encoded and reconstructed blocks in the same frame. The residue is then transformed and quantized. The size of the predicted block can be either 4×4 or 16×16. This is a new concept not available in previous MPEG standards.

(33)

Figure 2.6: Coding process for an I-frame. Upper left: Uncompressed blocks. Upper right: Intra predicted blocks. Lower left: Decoded blocks before deblockingﬁlter (used for prediction). Lower right: Final decoded blocks.

There are several diﬀerent modes available for intra prediction depending on the direction of the prediction. As an example, if the picture contains a horizontal edge ranging over several macroblocks, the best prediction is likely to be done from a horizontally adjacent block. The intra prediction process is visualized in Figure 2.6 and 2.7.

Intra prediction is a bigcontribution to the compression eﬃciency of intra coded pictures in H.264. For still images, the coding scheme of H.264 has been shown to outperform the latest wavelet based image coding standard JPEG2000 in terms of rate and PSNR [7].

2.3.6 Entropy Coding

In H.264 there are two schemes available for entropy codingcalled CAVLC (Context Adaptive Variable Length Coding) and CABAC (Context Adap-tive Binary Arithmetic Coding). Both are based on context adaptation and are adaptive to the changing statistics of the video content. This is in

(34)

Figure 2.7: Intra prediction in detail. Left: Decoded pixels before deblock-ing. The central 4×4 block is to be predicted usingpixels A-K. Right: Intra predicted pixels. The central 4× 4 block has been horizontally predicted usingthe previously decoded pixels A-D in the left picture.

contrast to previous MPEG standards which typically employ static entropy codingschemes such as a ﬁxed Huﬀman table and run length codingof zeros. CABAC is the most powerful scheme but has also a very high complexity. Compared to CAVLC it reduces the bitrate about 9-14% [17].

More details on CABAC and CAVLC can be found in [16, 21, 17].

2.3.7 Rate-Distortion Optimization

Video encodingis a very complex process involvinga huge amount of pa-rameters. For each macroblock in H.264, the coder must determine the

• Type (I, P or B)

• Codingmode (How to split the macroblock for motion compensation

or which intra prediction mode to use)

• Set of motion vectors (One motion vector for each block to be motion

compensated)

• Quantization parameter (Transform coeﬃcient ﬁdelity)

These parameters will determine the rate and distortion of the macroblock and subsequently the overall quality of the encoded sequence. Due to the use of prediction, the local parameters also aﬀect the quality of other frames in the sequence which further complicates the parameter selection.

(35)

• Perform a full search for the motion vectors that minimizes the

pre-diction error usingsome error metric, usually SAD (Sum of Absolute Diﬀerences).

• Transform, quantize and entropy code the remainingresidue.

A problem with this approach is that it does not take into account how many bits are used to represent parameters such as the mode and the motion vectors. In H.264, this strategy would result in many 4×4 blocks for motion compensation and thus many bits used to code motion vectors. Further, the number of bits spent on motion vectors would be relatively independent of the target bitrate. For a sufficiently low bitrate, most bits would be spent on codingthe motion vectors and the transform coefficients would have a very low fidelity. It should be clear that the above strategy does not in general result in an optimal distribution of bits amongdifferent codingparameters such as motion vectors and transform coefficients. A joint optimization of rate and distortion is necessary. For this purpose, Lagrangian optimization is the ideal tool.

Lagrangian Optimization

The ultimate objective of video codingis to minimize the distortion of the coded sequence at a given target bitrate. This can be formulated as

minD subject to R ≤ Rc (2.3)

whereD, R and Rc denote distortion, bitrate and target bitrate respectively. The solution to this minimization problem can be obtained usingLagrangian optimization, where the cost functionJ is deﬁned as

J = D + λR (2.4)

Minimizingthis function results in an optimal solution to (2.3) for a par-ticular value ofRc [24, 25]. J is in the literature often referred to as rate-distortion cost or R-D cost.

In video coding, this result can be used when selecting the macroblock coding parameters [28]. By introducing a ﬁxed Lagrangian multiplierλ and minimizing the Lagrangian cost for each macroblock individually, a near optimal solution can be found. If macroblocks were independent of each other the solution would be optimal [19]. However, in video codingthis is not the case due to the use of prediction.

(36)

2.3. OVERVIEW OF H.264 19 Findingthe codingparameters that minimizeJ can be a very time con-sumingprocess. For each combination of parameters, the rate and distortion must be evaluated and in H.264 this can only be done by actually coding the macroblock. For each set of codingparameters, transformation, entropy codingand quantization must be performed. In H.264 there are 51 possi-ble values of the quantization parameter and numerous possipossi-ble modes and types available to code a macroblock. Consideringthe high complexity even for a single evaluation of the rate and distortion, it is clear that evaluating

J for all possible combinations of parameters is not a practical approach.

In [25], a strong connection between the local Lagrangian multiplier and the quantization parameter QP was found experimentally as

λ ≈ 0.85 × 2(QP −12)/3 _(2.5)

This was done by ﬁxingaλ and for each macroblock selectingthe quan-tization parameter that minimized J.

The resultingequation does not depend on properties of the video con-tent such as spatial complexity or motion. The good news is that given a QP, a near optimal λ can be selected. Thus the need to evaluate the rate and distortion for all possible values of QP is removed. Equation 2.5 has been a fundamental part in designing the H.264 Reference Software.

Lagrangian Optimization in H.264 Reference Software

The rate distortion optimization process in H.264 can be summarized as 1. A QP is selected for the macroblock and λ is selected accordingto

equation 2.5.

2. Motion vectors for all possible block sizes are obtained.

In H.264 motion vectors are predicted and entropy coded. Hence, a smoother motion vector ﬁeld requires less bits to code than a random motion vector ﬁeld. For this reason, the vectors are searched for using rate-distortion optimization principles. The motion vector candidate is evaluated using a Lagrangian cost function including both rate and distortion.

3. Rate and distortion is evaluated for all possible modes and types for this macroblock.

4. The type and mode that minimizes the Lagrangian cost J = D + λR is selected for ﬁnal coding.

(37)

20 CHAPTER 2. BACKGROUND This process is highly computationally intensive and some research eﬀort has been put into reducingthe complexity [18, 32].

We note that rate-distortion optimization is not a normative part of the standard and is only an issue when designing an encoder. An H.264 com-pliant encoder does not have to use rate-distortion optimization. However, the high performance of the H.264 Reference Software is largely due to la-grangian optimization and it is likely that most H.264 encoders will use similar optimization techniques.

(38)

Chapter 3

H.264 Developer Tool

3.1 Introduction

For each frame in H.264 a prediction picture is formed usingup to 32 mo-tion vectors1for each macroblock. The macroblock is split into variable sized smaller units which are separately motion compensated and the motion vec-tors can refer to any previously encoded frame in the sequence. Further, a typical frame rate is about 25 frames per second. The amount of data generated by this coding scheme is very large and diﬃcult to overview. If the development environment is a C or C++ compiler the data is basically accessed by printing to the console or tracking variables while debugging. This representation makes it diﬃcult to detect structures in the data which might be more clear using a better GUI representation. A more clear data representation can

• Increase the understandingof the encoder and the algorithms used. • Help identify parts of the encoder which can be improved.

• Make it easier to evaluate and compare algorithms. For example

com-parison of motion vectors generated by a fast algorithm and motion vectors resultingfrom a full search.

• Simplify debugging.

1

A B-macroblock containing 16 4×4 blocks with one forward and one backward motion vector for each block.

(39)

22 CHAPTER 3. H.264 DEVELOPER TOOL

3.2 Requirements

In the followingtwo subsections, the desired functionality of ”H.264 Tool” is listed. In the ﬁrst subsection the data to visualize or present in some other way is identiﬁed. The second subsection lists the application oriented requirements and focuses on how the data should be presented.

3.2.1 Data Presentation

• Uncompressed picture • Decoded picture • Predicted picture

• Picture before deblockingﬁlter • Motion vectors

– Forward and backward motion vectors in diﬀerent color. • Size of motion compensated blocks

• Transform coeﬃcients • Macroblock type

– Intra, skipped etc. • PSNR information

– Each frame – Global average

– A curve for the whole sequence • Bit distribution

– Within individual macroblocks. How many bits are used for

mo-tion vectors, transform coeﬃcients, macroblock mode etc.

– Amongmacroblocks within a picture. – Amongpictures within a sequence.

(40)

3.3. ARCHITECTURE 23

3.2.2 Application

• Frame selection

– The user should be able to select the frame for which to show

data.

• Show diﬀerence pictures

– Show the diﬀerence between the uncompressed, decompressed or

predicted picture.

• Comparison of data

– It should be possible to compare data from diﬀerent encoders. – Several windows should be able to open, to put pictures side by

side for comparison.

– Switchingof data within the same window, for easy spottingof

diﬀerences.

• Zooming

– The user should be able to zoom pictures and graphs. • Animation

– The data should be able to animate. For example motion vectors

superimposed on the sequence while playingback.

• Macroblock highlighting.

– Highlighting of selected macroblock types (intra, skipped etc.)

3.3 Architecture

The basic idea is to modify the encoder and decoder to write the relevant data to ﬁles while encodingand decoding. After encodingand decodingis ﬁnished, the data is read by H.264 Tool for visualization and presentation (Figure 3.1). Thus, the codec and application are rather independent, only sharinga common data format. The data format has been kept as simple as possible to allow simple parsingand fast access in MATLAB.

(41)

24 CHAPTER 3. H.264 DEVELOPER TOOL H.264 Encoder ❄ H.264 Decoder ❄ Data ❄ H.264 Tool Write Files Read Files

Figure 3.1: Data flow from the encoder and decoder to H.264 Tool. During the encodingand decodingprocess, data such as motion vectors and pre-dicted pictures are written to files. These files are then read by H.264 Tool for visualization and presentation.

(42)

3.4. IMPLEMENTATION 25

3.4 Implementation

3.4.1 Modiﬁcation of H.264 Reference Software

The ﬁrst problem of the implementation was to get acquainted with the H.264 Reference Software [1]. This software is written in C, has many con-tributors and contains a lot of function calls. Moreover, the documentation is rather sparse.

Initially some schematic diagrams were produced to find the structure of the data flow between the different modules. By debugging and observing where different functions were called we could get a rough overview of the encodingand decodingprocess.

Next step was to try to identify the relevant data. For some data this was fairly easy. As an example, the motion vectors were stored in a matrix containingall the motion vectors of a frame. Other data, like the predicted picture, only existed in the scope of decodingone macroblock and addi-tional code for buﬀeringhad to be written. In parallel with this, simple visualization routines were written in MATLAB to continuously verify the consistency of the data.

3.4.2 MATLAB Application

To create the graphical user interface we used GUIDE which is an application included in MATLAB for this purpose. Usingthis, it is easy to connect MATLAB code to standard GUI elements such as buttons and checkboxes. As compared to C/C++, there are several advantages of implementing H.264 Tool in MATLAB, for example

• Access to high quality data- and signal processing functionality which

makes it a breeze to analyze data (histograms, graphs etc.) and apply signal processing to images (preprocessing, postprocessing etc.).

• Built in functionality for producing good looking graphs and plots. • Simplicity of the MATLAB programming language.

3.5 Results

The requirements from section 3.2.1 and 3.2.2 have been implemented. The resultingapplication and the overall functionality are presented here by a number of screen shots with detailed captions.

(43)

Figure 3.2: ”Control Panel” of H.264 Tool. The current frame number is 182 and data from two diﬀerent encoders are beingcompared. The buttons to the left are used to open diﬀerent windows for data analysis.

Figure 3.3: Typical usage of H.264 Tool. The leftmost window shows the decoded picture with motion vectors superimposed. In the middle is the predicted picture and the rightmost plot shows the size of the motion com-pensated blocks. Each of these windows are opened by pressing”New Plot” on the ”Control Panel”.

(44)

3.5. RESULTS 27

Figure 3.4: ”Plot View” of H.264 Tool. In the ”Codec” drop down menu the user selects which encoder to view. ”MV Scale” scales the motion vectors if present. ”View” and ”Subtract” set the data to view and subtract. Here, the predicted picture is subtracted from the original, uncompressed picture. The checkboxes superimpose information on the picture such as motion vectors and block sizes for motion compensation. This picture has been zoomed in, which is done by pressingthe left mouse button on the center point.

(45)

Figure 3.5: Decoded picture from two diﬀerent encoders compared side by side. Both plots superimpose the block size used for motion compensation. The right plot marks skipped macroblocks with black color. Note the slight diﬀerence in the block sizes for motion compensation.

(46)

3.5. RESULTS 29

Figure 3.6: Global PSNR and bit distribution. Accessed by pushing ”Global Stat” on the ”Control Panel”.

(47)

Figure 3.7: Left: Transform coeﬃcients. Middle: Bit distribution, mean squared error (MSE) and quantization parameter (QP) for each macroblock of the current frame. Right: Bit distribution for the frame.

(48)

3.6. CONCLUDING DISCUSSION 31

3.6 Concluding Discussion

A MATLAB tool for research and development of an H.264 codec has been implemented. H.264 Tool makes it easy to observe codingstatistics and data such as bit distribution, motion vectors and predicted pictures. Comparison of diﬀerent algorithms becomes more convenient and the visual presentation makes it easier to get ideas for new algorithms. H.264 Tool can also be used to aid understandingof the H.264 standard.

Duringthe project, H.264 Tool has been used extensively to evaluate and compare diﬀerent algorithms. Creating this tool gave much insight of the H.264 Reference Software, which was very useful when startingto actually modify the encoder. Further, most images in this report, such as visualization of intra prediction, motion compensated block sizes and motion vectors were created usingH.264 Tool.

The data extraction from the encoder and decoder has been kept to the decoder to the furthest extent possible. As a result, also external H.264 bitstreams can be analyzed by the tool. However, some data is more dif-ﬁcult to extract from the decoder. In the current implementation, the bit distribution is still fetched from the encoder. With some more work though, we believe that this should be fairly easy to move to the decoder.

Although out the scope of this project, an interesting point is that the structure of H.264 Tool makes it possible to also compare different coding standards, as longas the source code implementingthe standard can be modified to generate the same kind of data as H.264 Tool can read. For example it would be fairly easy to modify a codec implementinga previous MPEG standard to do this. Except for some minor differences, data such as predicted frames, motion vectors and transform coefficients are pretty much the same in for example MPEG-2 as in H.264.

(49)

(50)

Chapter 4

Transform Coeﬃcient

Thresholding

4.1 Introduction

The process of discardingtransform coefficients that have a non-zero value after quantization is usually referred to as coefficient thresholding. By do-ingthis in a rate-distortion framework, substantial gain in PSNR can be obtained for JPEG and previous MPEG video standards [4, 20]. In these standards, static variable length coding of transform coefficients is combined with run-length coding of zeros. This makes it possible to, for each coeffi-cient, determine its rate and distortion, and algorithms based on dynamic programming can be designed to achieve R-D optimum. However, the en-tropy codingscheme in H.264 is more complex. It utilizes binary arithmetic codingand context adaptation to dynamically update the probability distri-butions. PerformingR-D optimized thresholdingfor this scheme is difficult. We target the simpler problem of whether to discard or to keep the coeffi-cients in an 8×8 block consistingof four transform blocks. This decision is integrated into the R-D optimization process of the H.264 Reference Soft-ware [1] and a block is discarded only if this results in a lower R-D cost.

4.2 Thresholding in H.264 Reference Software

In the H.264 Reference Software, all transform coeﬃcients within an 8×8 block will be set to zero if their total cost is below a threshold which is set to 4. Notingthat the transform blocksize in H.264 is 4×4, this thresholding can be described as

(51)

34 CHAPTER 4. TRANSFORM COEFFICIENT THRESHOLDING 3 i=0 15 j=0 c(i, j) ≤ 4 (4.1)

wherei is the transform block index in raster scan order and j is the trans-form coeﬃcient index in zig-zag scan order. c(i, j) is a cost function deﬁned in the H.264 Reference Software as

c(i, j) =              ∞ if coeff(i, j) > 1 3 ifj = 0 and coeff(i, j) = 1 2 if 1≤ j ≤ 2 and coeff(i, j) = 1 1 if 3≤ j ≤ 5 and coeff(i, j) = 1 0 otherwise

where coeff(i, j) is the transform coefficient value after quantization. Thus, only coefficients with value 1 will be discarded and the coefficient index, or frequency, determines its cost. For example if all non-zero coefficients within an 8×8 block have the value 1 and their associated coefficient indices are greater than 5, the total cost will be zero and the block will be discarded. If two out of the four DC coefficients have the value 1, the cost will be 6 and the block is retained.

4.3 R-D Optimized Thresholding

We note that

• P-frames are used as reference frames and global performance might

be improved by usinga lower threshold.

• Thresholdingcan be rate-distortion optimized.

Based on these observations we evaluate the R-D cost twice instead of once for each 8×8 block. The rate and distortion is first obtained for the case of keepingall coefficients and then for the case of discardingall coefficients. The one that has the lowest R-D cost is then selected for final coding. For non-reference frames the R-D cost is evaluated using the same Lagrangian multiplier as for mode selection. For reference frames a lower value is used, in the experiments we scale the original Lagrangian multiplier with 0.2. This has the effect that coefficients are only discarded if their rate is very high compared to their energy.

(52)

4.4. EXPERIMENTS 35 Table 4.1: Test Conditions.

MV Resolution 1/4 pel

Hadamard ON

RDO ON

Search Range ±32

Reference Frames 1 Restricted Search Range 2

Symbol Mode CABAC

GOP Structure IBBP

Intra Period 0

Slice Mode OFF

4.4 Experiments

The proposed thresholdingscheme (PTHRES) has been integrated into ver-sion 6.1c of the H.264 Reference Software [1]. The setup of the encoder is given in Table 4.1. In Table 4.2 the average rate and PSNR are compared. This average has been calculated using the method in [2], which is based on interpolation of the R-D curve from a number of known points. We obtain these points by encodingthe sequences four times usinga ﬁxed QP of 28, 32, 36 and 40 respectively. The correspondingrate-PSNR curves are shown in Figure 4.1.

In Figure 4.2 and 4.3 we observe that PTHRES distributes more bits to P-frames and less bits to B-frames compared to the original scheme. This is expected, and the reason is the lower value of the Lagrangian multiplier used for P-frames which makes distortion more ”expensive” than rate. As a result, the diﬀerence in PSNR between P-frames and B-frames increases.

For most B-frames, PTHRES achieves a higher PSNR than H.264 RS even though less bits are used. This is due to the higher quality reference frames which increases the quality of the B-frames as well.

4.5 Conclusions

A new coeﬃcient thresholdingscheme for H.264 has been proposed, based on the current frame type and Lagrangian optimization. The new scheme increases the PSNR for all sequences tested. The average improvement for all sequences is 0.25 dB, or equivalently a bitrate reduction of 5.0%.

(53)

36 CHAPTER 4. TRANSFORM COEFFICIENT THRESHOLDING

Figure 4.1: Rate-PSNR curves for the proposed thresholding scheme com-pared to the H.264 Reference Software.

(54)

4.5. CONCLUSIONS 37

Table 4.2: Average improvement of PSNR and rate for the proposed thresh-oldingscheme compared to H.264 RS usingﬁxed QP, calculated according to [2]. The columns should be regarded as equivalent in the sense that there is either the increase in PSNR or the decrease in rate.

Sequence ∆Rate(%) ∆PSNR foreman.qcif -4.13 +0.20 mobile.qcif -6.68 +0.31 news.qcif -5.30 +0.30 silent.qcif -5.01 +0.23 stefan.qcif -7.12 +0.39 weather.qcif -1.49 +0.10 Average -4.96 +0.25

Figure 4.2: Comparison between the proposed thresholding scheme and the H.264 Reference Software. 300 frames of Foreman (QCIF) encoded with ﬁxed QP = 28.

(55)

38 CHAPTER 4. TRANSFORM COEFFICIENT THRESHOLDING

Figure 4.3: Close up of the graphs in Figure 4.2. Using the proposed thresh-olding scheme, most B-frames have a higher PSNR, even though they use less bits.

(56)

4.5. CONCLUSIONS 39 for 8×8 blocks. Informal experiments indicated that this extra function call increases the encodingtime around 5%.

The new scheme distributes slightly more bits to reference frames and less bits to non-reference frames compared to H.264 RS. In the next chapter we speciﬁcally investigate how to optimize bit distribution among diﬀerent frame types.

(57)

(58)

Chapter 5

Lagrangian Multiplier

Selection

5.1 Introduction

Rate-distortion optimization for video compression has proven to yield sub-stantial improvements in perceived quality and the H.264 Reference Software [1] is designed around this concept. In the H.264 Reference Software, R-D optimization is performed for each frame individually and the problem of distributingthe bits amongframes to achieve the global optimum is not ad-dressed. This is usually the case in video codingas the strongdependence between frames makes global optimization very complex. In this chapter we address the problem of how to select the local codingparameters for non-reference frames in order to maximize the global performance.

5.2 Lagrangian Multiplier for B-frames

The objective of video codingis to minimize the distortion of the coded sequence at a given target bitrate. This can be formulated as

minD(R) subject to R ≤ Rc (5.1)

where D, R and Rc denotes distortion, bitrate and target bitrate respec-tively. The solution to this minimization problem can be obtained using Lagrangian optimization, where the the cost function J is deﬁned as

J(R) = D + λR (5.2)

(59)

42 CHAPTER 5. LAGRANGIAN MULTIPLIER SELECTION Minimizingthis function results in an optimal solution to (5.1) for a par-ticular value of Rc [24, 25]. The minimum value ofJ(R) can be found by settingits derivative to zero which yield

λ = −∂D(R)_∂R (5.3)

Thus, at optimality,λ corresponds to the negative slope of the R-D curve. In general, rate and distortion can be assumed additive [28] and (5.2) can be formulated as J(R0, . . . , RN−1) = N−1 i=0 Di(R₀, . . . , R_N−1) +λ N−1 i=0 Ri (5.4) where i is the frame number and N the total number of frames in the se-quence. This equation can be minimized by settingall its partial derivatives with respect toRi to zero. For an arbitrary frame k we have

∂J ∂Rk = _N−1 i=0 ∂Di ∂Rk +λ _N−1 i=0 ∂Ri ∂Rk = _N−1 i=0 ∂Di ∂Rk +λ and we obtain the equation system

                       ∂J ∂R0 = N−1 i=0 ∂Di ∂R0 +λ = 0 .. . ∂J ∂R_k = N−1 i=0 ∂Di ∂R_k +λ = 0 .. . ∂J ∂R_N−1 = N−1 i=0 ∂Di ∂R_N−1 +λ = 0 As N−1 i=0 ∂Di ∂Rk +λ = ∂Dk ∂Rk + i=k∂Di ∂Rk +λ = 0 (5.5)

for eachk, we conclude that

∂Dk

∂Rk =−λ −

i=k∂Di

∂Rk (5.6)

should hold at optimality.

In practical video codingit is impossible to consider overall global de-pendencies and the optimization must be performed more or less locally. A frame level Lagrangian cost function can be deﬁned as

(60)

5.2. LAGRANGIAN MULTIPLIER FOR B-FRAMES 43 which is minimized analogously to (5.2) yielding

λk=−∂D_∂Rk

k (5.8)

However, our objective is to minimize the global R-D cost and a question that arises is

How should the local Lagrangian multiplier λk be selected in order to

minimize the global R-D cost?

Notingthat (5.6) and (5.3) should hold at global optimum, (5.8) can be modiﬁed as λk=−∂D_∂Rk k =λ + i=k∂Di ∂Rk =− ∂D ∂R + I(k) i=k∂Di ∂Rk (5.9)

The optimal local Lagrangian multiplier thus reﬂects the impact of the local rate on the global distortion which is represented by I(k).

In inter frame video codingthere is a strongdependence between frames and I(k) will in general have a large negative value. This is due to the fact that higher quality reference frames also improve the quality of subsequently coded frames. In general I(k) is diﬃcult to approximate. However, a spe-cial case is when there is no dependence between frames. I(k) is equal to zero. Lagrangian optimization under similar independence assumptions is the most commonly studied in literature [24, 28].

We observe that this special case is valid for non-reference frames as their rate does not aﬀect the distortion of other frames1 and conclude that

for non-reference frames, the local Lagrangian multiplier should be set to the negative slope of the global R-D curve.

5.2.1 Lagrangian Multiplier Selection in H.264 Reference

Soft-ware

In [25], a strong connection between the local Lagrangian multiplier and the quantization parameter QP was found experimentally as

λk≈ 0.85 × 2(QP −12)/3 (5.10)

1_{When using rate control this is not entirely true. Here we consider strict Lagrangian}

optimization where the rate cannot be explicitly controlled but is merely a result of the selected Lagrangian multiplier.

(61)

44 CHAPTER 5. LAGRANGIAN MULTIPLIER SELECTION In the H.264 Reference Software the Lagrangian multiplier for I and P-frames (λI,P) is set accordingto this relation. For non-reference frames, or equivalently B-frames in the H.264 Reference Software2, it is modiﬁed as

λB= 4λI,P (5.11)

whereλI,P is derived from a given QP according to (5.10).

5.2.2 Proposed Lagrangian Multiplier Selection

For non-reference frames we suggest the local Lagrangian multiplier to be set equal to the global Lagrangian multiplier. We then select QP using (5.10), which has been shown to be a near optimal mappingbetween the quantization parameter and the Lagrangian multiplier [25]. By solving QP from this equation we obtain

QPB = 3 log₂( λ

0.85) + 12 (5.12)

where QPB denotes the quantization parameter for B-frames. Thus, in contrast to the H.264 Reference Software, QP for B-frames is set to be adaptive to the negative slope of the global R-D curve instead of according to (5.11) given a QP.

5.3 Approximation of the Global Rate-Distortion

Function

In general, the global R-D curve is not known a priori and is subject to approximation. Experimentally, we have found the followingequation to well approximate the global PSNR curve for a variety of sequences and bitrates (Fig.5.1)

P SNR ≈ C1+ 10√2 log (R) (5.13)

where C₁ is a constant dependingon the source statistics. Usingthe def-inition of PSNR, this can be shown equivalent to the distortion D being approximated as

D ≈ C

R√2 (5.14)

whereC is another constant dependingon source statistics. The derivation

2

Unlike previous MPEG standards, B-frames in the H.264 standard are not restricted to being non-reference frames. However, in the H.264 Reference Software they are.

(62)

5.3. APPROXIMATION OF THE GLOBAL RATE-DISTORTION FUNCTION45

Figure 5.1: Approximation of the global rate-distortion function for diﬀerent values of C₁. The operational rate-distortion functions (solid) have been obtained by encodingthe sequences usingﬁxed QP.

is conducted as follows P SNR = 20 log_√255 D ≈ C1+ 10 √ 2 log (R) =⇒ 255 √ D ≈ 10 C1+10√2 log(R) 20 ₌⇒ D ≈ 2552 10C1+10 √ 2 log(R) 10 = 2552 10C1/10_×10√2 log(R) = C (10log(R)₎√2 = C R√2

where the new constant C replaces 2552

10C1/10.

Asλ should be equal to the negative slope of the R-D curve we have

λ ≈ −_∂R∂ _C R√2 = C √ 2 R1+√2 ≈ D√2 R (5.15)

where (5.14) is used to obtain the ﬁnal approximation. Thus, λ can be ap-proximated usingthe total rate and distortion. Unfortunately these quan-tities are also not known a priori. Assumingthat the intra period is known and the GOP structure is ﬁxed, we use

Ravg ≈ R¯I+ ¯RP,B_N_GOP(NGOP−1)

Davg ≈ 1_kk−1_i=0 Di

to estimate the average rate and distortion, denoted as Ravg and Davg.

NGOP is the total number of frames in a GOP. ¯RI and RP,B¯ denote the current average rate for intra and inter frames respectively and k is the current frame number.

Transform Coefficient Thresholding and Lagrangian Optimization for H.264 Video Coding

Transform Coefficient Thresholding and Lagrangian

Optimization for H.264 Video Coding

Examensarbete utfört i Bildkodning

av

Pontus Carlsson

LiTH-ISY-EX-3561-2004

Transform Coefficient Thresholding and Lagrangian Optimization for H.264

Video Coding

Examensarbete utfört i Bildkodning

vid Linköpings tekniska högskola

av

Pontus Carlsson

LiTH-ISY-EX-3561-2004

Institutionen för systemteknik

581 83 LINKÖPING

Transform Coeﬃcient Thresholding and Lagrangian

Optimization for H.264 Video Coding

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Project Review

1.2

Report Outline

Chapter 2

Background

2.1

Video Coding Basics

2.2

Quality Metrics

2.3

Overview of H.264

Chapter 3

H.264 Developer Tool

3.1

Introduction

3.2

Requirements

3.3

Architecture

3.4

Implementation

3.5

Results

3.6

Concluding Discussion

Chapter 4

Transform Coeﬃcient

Thresholding

4.1

Introduction

4.2

Thresholding in H.264 Reference Software

4.3

R-D Optimized Thresholding

4.4

Experiments

4.5

Conclusions

Chapter 5

Lagrangian Multiplier

Selection

5.1

Introduction

5.2

Lagrangian Multiplier for B-frames

5.3

Approximation of the Global Rate-Distortion

Function