On Computational Complexity of Motion Estimation Algorithms in MPEG-4 Encoder

(1)

Motion Estimation Algorithms in MPEG-4 Encoder

Muhammad Shahid

This thesis report is presented as a part of degree of Master of Science in Electrical Engineering

Blekinge Institute of Technology, 2010

Supervisor: Tech Lic. Andreas Rossholm, ST-Ericsson

(2)

(3)

Video Encoding in mobile equipments is a computationally demanding feature that requires a well designed and well developed algorithm. The optimal solution requires a trade off in the encoding process, e.g. motion estimation with tradeoff between low complexity versus high perceptual quality and efficiency. The present thesis works on reducing the complexity of motion estimation algorithms used for MPEG-4 video encoding taking SLIMPEG motion estimation algorithm as reference. The inherent properties of video like spatial and temporal correlation have been exploited to test new techniques of motion estimation. Four motion estimation algorithms have been proposed. The computational complexity and encoding quality have been evaluated. The resulting encoded video quality has been compared against the standard Full Search algorithm. At the same time, reduction in computational complexity of the improved algorithm is compared against SLIMPEG which is already about 99 % more efficient than Full Search in terms of computational complexity. The fourth proposed algorithm, Adaptive SAD Control, offers a mechanism of choosing trade off between computational complexity and encoding quality in a dynamic way.

(4)

(5)

It is a matter of great pleasure to express my deepest gratitude to my ad- visors Dr. Benny Lövström and Andreas Rossholm for all their guidance, support and encouragement throughout my thesis work. It was nonethe- less a great opportunity to do research work at ST-Ericsson under the marvelous supervision of Andreas Rossholm. The counseling provided by Benny Lövström was of great value for me in writing up this manuscript.

I can’t forget mentioning the comfort I received from Fredrik Nillson and Jimmy Rubin of ST-E in setting up the working environment and start up of ST-E algorithm. I owe my successes in life so far to all of my family members, for their magniﬁcent kindness and love!

(6)

(7)

Abstract i

Acknowledgements iii

Contents v

1 Introduction 1

2 Basics of Digital Video 3

2.1 Color Spaces . . . 3

2.2 Video Quality . . . 4

2.3 Representation of Digital Video . . . 4

2.4 Applications . . . 4

2.4.1 Internet . . . 5

2.4.2 Video Storage . . . 5

2.4.3 Television . . . 5

2.4.4 Games and Entertainment . . . 6

2.4.5 Video Telephony . . . 6

3 Video Compression Fundamentals 7 3.1 CODEC . . . 7

3.2 A Video CODEC . . . 8

3.3 Video Coding Standards . . . 9

3.3.1 MPEG-1 . . . 10

3.3.2 MPEG-2 . . . 10

3.3.3 MPEG-4 . . . 10

3.3.4 MPEG-7 . . . 10

3.3.5 MPEG-21 . . . 10

3.3.6 H.261 . . . 11

3.3.7 H.263 . . . 11

3.3.8 H.263 + . . . 11

3.3.9 H.264 . . . 11

3.4 MPEG-4 . . . 11

(8)

3.5 Syntax . . . 12

4 Motion Estimation and its Implementation 15 4.1 Block Matching . . . 17

4.2 Motion Estimation Algorithms . . . 18

4.2.1 Full Search . . . 19

4.2.2 Three-Step Search . . . 20

4.2.3 Diamond Search . . . 20

4.2.4 SLIMPEG . . . 21

5 Rate Distortion Optimization and Bjontegaard Delta PSNR 23 5.1 Measurement of Distortion . . . 24

5.2 Bjontegaard Delta PSNR . . . 24

6 Simulation, Results and Discussion 29 6.1 SAD as a Comparison Metric . . . 30

6.2 Proposed Techniques . . . 30

6.2.1 Spatial Correlation Algorithm . . . 31

6.2.2 Temporal Correlation Algorithm . . . 31

6.2.3 Adaptive SAD Control . . . 32

6.3 Simulations with diﬀerent video sequences . . . 34

6.3.1 Football Sequence . . . 35

6.3.2 Foreman Sequence . . . 36

6.3.3 Claire Sequence . . . 40

7 Conclusion and Future Work 51

List of figures 54

List of tables 55

Bibliography 57

(9)

Introduction

Since the advent of the ﬁrst digital video coding technology standard in 1984 by the International Telecommunication Union (ITU), the technology has seen a great progress. The two main standard setting bodies in this regard are ITU and International Organization for Standardization (ISO).

Recommendations of ITU include the standards like H 261/262/263/264 and these focus on applications in the area of telecommunication. Motion Pictures Experts Group (MPEG) of ISO has released standards like MPEG- 1/-2/-4 which focus the applications in computer and consumer electronics area. The standards deﬁned by both of these groups have some parts in common and also some work has been performed as a joint venture. The ﬁeld of video compression has been continuously developing with the enhancements in the previous versions of the standards and introduction of new recommendations. MPEG-4 standard is followed in this thesis work.

It can be easily said that video compression is a top requirement in any multimedia storage and transmission phenomenon with encoding the video in various forms before sending or storing it and then decoding it subse- quently at the receiver end or when viewing it. Besides the presence of digital video in television and CD/DVD etc, cellular phones will probably be the next high use place of video content. The limited storage capacity of mobile equipments dictates the requirement of eﬃcient video compression tools. Video encoding in mobile equipments has developed from a high- end feature to something that is taken for granted. Nevertheless, it is a computationally demanding feature that requires well designed and well developed algorithms. Many diﬀerent algorithms need to be evaluated in order to come closer to the optimal solution.

As early as 1929, Ray Davis Kell described a form of video compression for which he obtained a patent [1]. Given the fact that a video is actually a series of pictures transmitted at some designated speed between successive images, Rays patent gave rise to the idea of transmitting the diﬀerence between the successive images instead of sending the whole image. However,

(10)

it took ages to get the idea implemented into reality but still it is a keystone of many video compression standards today. Connected to this idea, there comes the concept of motion estimation which tries to exploit the presence of temporal correlation at diﬀerent positions between the video frames.

It predicts the motion found in the current frame using already encoded frames. Henceforth, the residual frame contains much less energy than the actual frame. Motion vectors and the residual frame are encoded by a bit rate much lesser than the bit rate required to encode a regular frame.

Motion estimation may require tremendous amount of computational work inside the video coding process. There are certain algorithms employed for doing motion estimation. The basic class of these is called Full Search Al- gorithms and it gives optimal performance but computationally very time consuming. To deal with this computation issue, many sub optimal fast search algorithms have been designed and this thesis will focus on some of them in a try to improve performance of one of them. The SLIMPEG motion estimation algorithm is taken as reference here and inherent video properties like spatial and temporal correlation has been applied to devise techniques in a try to achieve less complex yet performance oriented motion estimation algorithms.

The rest of the report is organized as: Chapter 2 and chapter 3 deal with fundamentals of digital video and video compression respectively. Imple- mentation aspects of motion estimation have been explored in chapter 4 ending with the introduction of SLIMPEG motion estimation algorithm.

Rate distortion and delta PSNR are the contents of chapter 5. Results of the main contribution have been provided in chapter 6 with their description. Chapter 7 contains conclusion and some hints about future work in the ﬁeld.

(11)

Basics of Digital Video

A video image is obtained by capturing the 2D plane view of a 3D scene.

Digital video is then spatial and temporal sampled frames presented in a sequence. The spatio-temporal sampling unit which is usually called pixel (picture element) can be represented by a digital value to describe its color and brightness. The more sampling points taken to form the video frame the higher is usually the visual quality but requiring high storage capacity.

The video frame is usually formed in a rectangular shape. The smoothness of a video is determined by the rate at which its frames are presented in a succession. A video comprising a frame rate of thirty frames per second looks fairly smooth enough for most purposes. A general comparison of appearance of a video determined by its frame rate is given in the table 2.1[2].

Table 2.1: Video frame rates.[2]

Video frame rates. Appearance

Below 10 frames per second ’Jerky’, unnatural appearance to movement 10-20 frames per second Slow movement appears OK;

rapid movement is clearly jerky 20-30 frames per second Movement is reasonably smooth 50-60 frames per second Movement is very smooth

2.1 Color Spaces

The pixel may be represented by just one number (grey scale image) or by multiple numbers (colored image). A particular scheme used for rep-

(12)

resenting colors is called color space. Two of the most common schemes are known as RGB (red/green/blue) and YC_rC_b (luminance/red chrominance/blue chrominance).

In the RGB color space, each pixel is represented by three numbers indi- cating the relative proportion of the three colors. Each of the numbers is usually formed by eight bits. So, one pixel requires twenty four bits for its complete representation. It has been observed from psycho visual experiments that human optical system is less sensitive to color than luminance.

This fact is exploited in YC_rC_bcolor space where luminance is concentrated in only one of its components Y and color information is contained in the rest of the components. There is a relationship between both color space schemes where one representation can be transformed into another. For details on this, please see [2].

2.2 Video Quality

The video quality is an important parameter and is a subjective issue, by its nature of being judged by human. There are many objective criteria for measuring video quality which can give results with some correlation to human experience e.g., PSNR. However, they may not make up satisfactorily to the demand of subjective experience of a human observer. Experiments show that a picture with lower PSNR may look visually better than with a higher value of PSNR. It is to be noted that human visual experience may vary also from person to person and brings up the need of such alternatives which could be thought of as covering the need of both objective and subjective tests. An objective test which matches best with the human visual experience will give acceptable results.

2.3 Representation of Digital Video

Before the video is ready for coding, it is often transformed to one of the Intermediate Formats. The central out of them is common intermediate format, CIF, where a frame size resolution is 253 X 288 pixels. Table 2.2 gives information about some standard common intermediate formats.

2.4 Applications

There has been an exponential growth in applications of digital video and the technology has got the capacity to emerge rapidly. Some examples of

(13)

Table 2.2: Intermediate formats. [2]

Format Luminance resolution(horz. X vert.)

Sub-QCIF 128 X 96

Quarter CIF(QCIF) 176 X 144

CIF 352 X 288

4 CIF 704 X 576

widely used digital video applications are given in the following subsections.

2.4.1 Internet

It can be safely said that current era of internet holds the most of digital video applications ranging from a small video clip to wholesome of movies, from a small video chat to a corporate video conference and so on. Re- mote teaching/learning, video telephony and sharing videos has been made possible by the digital video. The state of the art video broadcasting phenomenon YouTube presents billions of videos to its viewers worldwide by using beneﬁts of digital video technology.

2.4.2 Video Storage

Digital video has reshaped the way of storing videos. CD/DVD ROM and Blu-ray Disc have almost wiped out the classic ﬁlm tape media storage devices. These new storage discs come with huge advantages of capacity, portability and durability. The latest of them is Blu-ray Disc and it has storage capacity as much as 50 GB in single layer and upto 100 GB in dual layer[3].

2.4.3 Television

Satellite television channels across the planet create an entire new world of global village by the virtue of digital video. Literally, there are thousands of television channels operating in various areas of the world and the number is yet increasing. News, current aﬀair shows and popular drama serials gather a huge count of viewers.

(14)

2.4.4 Games and Entertainment

The heavy video games and movies have gained enormous popularity and these are again an applications of digital video. Now a days, we see an increasing trend of popularity of 3D animation movies which is a big suc- cess of digital video. Take the example of ’Avatar’, a block buster 3D ﬂick which is possibly the best ever liked movie of the current era.

2.4.5 Video Telephony

The digital video has played an important role in getting a video along with voice while communicating on telephone. On government and private levels, video conferencing is replacing the need of traveling far away for at- tending meetings at one place. Skype is probably the brand leader in this ﬁeld.

(15)

Video Compression Fundamentals

It has been observed that the size of an ordinary digitized video signal is far greater than usual storage capacity and transmission media bandwidth.

This fact shows the need of systems capable of compressing the video.

For the sake of example, a channel of ITU-R 601 television (with 30 fps) requires media having bit rate of 216 Mbps for broadcasting in its uncompressed form. A 4.7 Gb DVD can store only 87 seconds of uncompressed video at this bit rate. This implies that there is a clear need of such mechanism which can make this data ﬁt to be able of storing or transmitting having limited capacities. Hence comes the compression but with drawback of some quality loss in visual experience. An eﬀective compression system is, in general, lossy in nature.

3.1 CODEC

The term CODEC represents a combination of two systems capable of encoding (compressing) and decoding (decompressing). A typical codec is shown in ﬁgure 3.1. The encoder compresses the original signal and the process is called source coding. After some more signal processing the signal reaches the point of decompression at source decoder.

According to information theory, there is statistical redundancy in an ordinary data signal. This principle has been utilized in Huﬀman coding and such kind of CODEC is known as entropy CODEC. However, the entropy encoders do not perform well in case of images and videos. There is need of deploying source models before entropy coding can applied on such data.

There are some properties present in video which are taken into considera-

(16)

Figure 3.1: Source coder,channel coder,channel[2].

tion to be beneﬁted in source models. These properties include the spatial and temporal redundancy present amongst pixels in video frames. More- over, psycho visual experiments have shown that human visual system is more particular about lower frequencies. So, in encoding process for video, some high frequencies can be safely ignored. Codecs are often designed to emphasize certain aspects of the media, or their use, to be encoded. For example, a digital video (using a DV codec) of a sports event, such as base- ball or soccer, needs to encode motion well but not necessarily exact colors, while a video of an art exhibit needs to perform well encoding color and sur- face texture. Pertaining to video quality, there can be two kinds of codecs.

In order to achieve good level of compression, most of the codecs degrade the original quality of the signal and are known as lossy codecs. There are also codecs which preserve the original quality of the signal and are known as lossless codecs [10]. Some examples of coding techniques are presented here. In Diﬀerential Pulse Code Modulation (DPCM) coding technique, pixels are sent as prediction of already dispatched pixels. Next step is to transmit the prediction error which is actually diﬀerence of prediction from actual pixel. Transform coding changes the domain of the frame signal.

This change is helpful in rounding off insignificant coefficients and a lossy compression is achieved. The transform coding has got a great deal of application in various video compression techniques. Another technique is motion compensated predictive coding which is the emphasis of this thesis.

In a similar way as that of DPCM, a model of actual frame belonging to a video is obtained by prediction based on already encoded frame. This model is then subtracted from the original frame to obtain residual frame which contains much less energy as compared to its original frame [2].

3.2 A Video CODEC

Video signals are constructed by a sequence of still images which are better known as video frames. These frames can be encoded for compression

(17)

Figure 3.2: Video CODEC With Prediction[2].

using intra frame coding techniques but this compression does not turn out to be of enough good value for a video. This fact and presence of temporal redundancy inside the video sequence drives the need of inter frame encoding. A prediction of actual video frame based on previous frame is subtracted from actual frame to form what is called residual frame. The residual frame is then encoded by frame codec. A block diagram of such video coder is in ﬁgure 3.2. The process of encoding the residual frame includes its transformation. The transform coeﬃcients are quantized and then entropy coding is applied for transmission or storage. At the decoder end, revert operation of these steps are applied to get the original data back [2].

3.3 Video Coding Standards

Most of the video codecs currently being used belong one of the two main- stream video codec standards viz. International Standards Organization (ISO) and International Telecommunications Union (ITU). ISO has introduced JPEG and MPEG-x series for image and video respectively. Simi- larly, ITU has introduced its standards with H.26x series. Coming next is a brief description of these standards and MPEG-4 shall be described in detail.

ISO has covered the applications related to storage and distribution through its standards. The Moving Picture Experts Group (MPEG) has developed recommendations for video and its standards include the following [2][3].

(18)

3.3.1 MPEG-1

Video and audio data can be compressed and played back in real time on CD-ROM under this standard (at a bit rate of 1.4 Mbps). The VHS-quality digital video is compressed down to a ratio of 26:1.

3.3.2 MPEG-2

Bit rate has been increased from the previous standard to 3-5Mbps for compression and transmission of video and audio data storage and transmission purposes. Additionally, support for interlaced video has been added into it.

3.3.3 MPEG-4

It came in late 1998 and provides video and comes with additional features to those of the previous standards. It supports a huge range of bit rates and will be discussed in detail at the end of this section.

3.3.4 MPEG-7

It is a multimedia content description standard. This provides support for describing multimedia content data, with the aim of providing a standard- ized system for content-based indexing and retrieval of multimedia information. It is rather meant for accessing the multimedia data instead of coding and compression phenomenon. MPEG-7 has been formally known as Multimedia Content Description Interface.

3.3.5 MPEG-21

It is usually ratified as Multimedia Framework. It provides definition of an open framework for multimedia applications. The Rights Expression Lan- guage, as defined by MPEG-21, standardizes the process of sharing digital rights for digital content from its source to the consumer end. Integration and inter operation between various technologies related to multimedia field is promoted by this standard.

ITU has focused on applications related to real time and dual duplex video communications. Its working body for standardization is called as Video Coding Experts Group(VCEG) and it has given out the following standards.

(19)

3.3.6 H.261

It was primarily introduced for video telephony over ISDN lines where channel capacity is multiple of 64 kbps. Two video sizes CIF and QCIF are being supported by it.

3.3.7 H.263

It oﬀers videoconferencing for a variety of bit rates ranging from kbps to many Mbps. It is quite popular into internet applications.

3.3.8 H.263 +

It is second version of H.263 which adds some enhancements into the original standards including better encoding and a level of immunity to transmission errors. There came another version H.263++ where some annexes were added with more functionalities.

3.3.9 H.264

Also known as MPEG-4 part 10 or Advanced Video Coding (AVC), its ﬁrst set of recommendations came in 2003. Its use is found in applications such as Blu-ray Disc, YouTube videos and television services. Moreover, H.264 was developed by Joint Video team (JVT) which was actually a collabora- tion work group of ITU and ISO.

3.4 MPEG-4

This standard was developed in an eﬀort to enhance the functionalities of the already exiting MPEG standards for video coding. One of the add- on features is eﬃcient compression for applications which involve low bit rate of transmission media. A whole new concept of video scene and video object has been introduced which considers the coding of video based on its contents instead of just taking everything same in rectangular frames.

MPEG-4 standard is progressive in a way that it has got the capacity to ab- sorb new tools and enhancements. The whole lot of tools, which MPEG-4 offers for encoding, have been offered through various subsets. Such subsets are called profiles and a specific profile addresses a specific application.

One such example is Simple Proﬁle which aims at applications requiring low bit rate and low resolution. Another is Advanced Simple Proﬁle which

(20)

has features like support for bidirectional prediction frames and quarter pixel motion compensation. Some of the salient functionalities for encoding video frames by MPEG-4 are described in the following [2].

• Video core: The video coding phenomenon uses such algorithms which make the core of the standards very low bit rate.

• Input format: Video data is pre-processed and sometimes converted to one of the picture sizes listed in Table 2.2, at a frame rate of up to 30 frames per second and in 4 : 2 : 0 (Y : C_r : C_b) format before the codec is applied.

• Picture type: The frames are encoded as I frames (Intra coded) or P frames (Predictive coded) or B frames (Bidirectional prediction).

For video encoding, a frame is usually divided into small sections of a certain size and these small sections are called macro blocks. I frames contain strictly intra coded macro blocks and P frame could have ei- ther inter or intra coded macro blocks.

• Motion estimation: It is done in macro blocks of size 16X16 normally with an optional availability of macro blocks of size 8X8, 4X4, 4X8, 8X4, 8X16, 16X8, depending upon the proﬁle being in operation. The motion vectors (an array of coordinates representing relative motion) could have sub-pixel resolution.

• Transform coding: The residual frame obtained after motion esti- mation process is coded using discrete cosine transform (DCT). The coeﬃcients returned are then quantized and arranged in zig zag fash- ion. Finally, run level coding is applied.

3.5 Syntax

Main features of the syntax of the MPEG-4 coded bit stream are described as under.

• Picture layer: The top layer of the syntax contains a complete coded picture. The picture header contains values to describe picture resolution, the type of coded picture (inter or intra) and a temporal

(21)

reference ﬁeld.

• Group of blocks layer: A complete row of macro blocks forms a group of blocks (GOB) in QCIF, CIF and SCIF frames and this ﬂag helps in resynchronization by decoder if any errors cause synchronization loss.

• Macroblock layer: Four luminance and two chrominance blocks form one macroblock. The header contains information about type of macroblock and motion vectors for inter coded macroblocks.

(22)

(23)

Motion Estimation and its Implementation

Motion estimation is a key component in video compression, processing and in computer vision. The knowledge of motion in a video compression process helps eradicate temporal redundancy amongst the successive frames and consequently a high value of compression ratio is returned. This makes motion estimation a must found module in the video coding standards.

Contrary to the older standards, MPEG-4 introduces a region based motion model which is quite ﬂexible and has got more eﬃciency. Consider a video frame from a set of video frames and name it as current frame. Some frames from the video which have already been encoded can be used to predict the contents of the current frame and such frames are named as reference frames. Such prediction is called motion prediction. The temporal order of reference frame can be earlier or later than the current frame and the estimation will be forward prediction or backward prediction respectively.

The forward and backward prediction could be combined also and in this case it is called bidirectional prediction. The process can be understood by observing the pictorial explanation given by ﬁgure 4.1 [11] and block diagram given in ﬁgure 4.2 [2]. The target of a motion estimation algorithm is to develop a model of the current frame according to the reference frame with maximum accuracy and minimum computational involvement.

As given in the ﬁgure 4.2, the Motion Estimation block creates such a model by altering a reference frame. The Motion Compensation block creates a residual frame by subtracting the model of the current frame from the original current frame. This residual frame is then sent for transform and entropy coding and sent for transmission along with motion vectors information. Another interesting step is taken here which involves the decoding of this encoded frame in a try to reproduce the current frame to be

(24)

Figure 4.1: Motion Estimation and Compensation [11].

Figure 4.2: Motion Estimation and Compensation [2].

(25)

Figure 4.3: Block Matching [2].

used as reference frame for further encoding process.

The grade of compression can be measured by the size of the coded residual frame which is also called displaced frame diﬀerence (DFD) and the overhead info related to motion vectors. The size of the coded residual frame is proportional to the energy present in the DFD after the motion compensation process. It is observed that this energy can be reduced using motion estimation and compensation to have higher compression eﬃciency [2].

4.1 Block Matching

To carry out the motion estimation and compensation process, a video frame is taken as composed of non overlapping block of a certain size, e.g., 16 x 16 pixels. There are some other standard sizes too which are used in diﬀerent video coding standards. Such blocks are formally known as Macro Blocks and motion estimation applied on them is known as block matching.

Block matching is performed on luminance samples (e.g., on Y blocks in MPEG-4 encoding). Two macro blocks out of current frame and reference frame are checked for similarity aiming at minimizing the energy diﬀerence between them. The search area in reference frame is centered around the macro block under consideration to exploit the temporal redundancy and to avoid searching the whole reference frame.

The block matching phenomenon is depicted in the ﬁgure 4.3.

In this ﬁgure, a 3x3 current block is searched for matching in the corresponding position in reference frame and the search region is kept 1 pixel wider than the size of the block. There are various search criteria methods to estimate the optimum matching point. Some examples include Sum of Absolute Diﬀerence (SAD), Mean Square Error (MSE) and Mean Absolute Error(MAE). SAD is calculated as:

SAD =

N i=1

N j=1

|C_ij− R_ij| (4.1)

(26)

where i,j are pixel positions and C, R represent current and reference frames respectively. SAD is usually chosen because of its simple calculation procedure. The SAD between current block and its (0, 0) corresponding block in reference frame is considered as search metric and is given in equation 5.2

|1−4|+|3−2|+|2−3|+|6−4|+|4−2|+|3−2|+|5−4|+|4−3|+|3−3| = 12 (4.2) The SAD is calculated for other positions as well. The best matching block is centered at the position which returns minimum SAD value and in this case it is found to be (-1, 1), giving a SAD value of 2. Same procedure is repeated for rest of the blocks in the current frame to complete its motion estimation. So, the video encoding process can be itemized as follows [2].

• A diﬀerence of energy is calculated between block in current frame and a search window of certain size which is positioned around the center of its corresponding position block in the reference frame.

• The region picked as matching is the one which gives the least value of search metric.

• The procedure is repeated to have estimation model of the whole current frame.

• A residual frame is then obtained by subtracting the resultant model of the current frame from the original current frame.

• The residual frame along with its information of motion vectors is then encoded and sent for transmission or storage purpose.

4.2 Motion Estimation Algorithms

For the best match, the current block has to be searched for in the whole reference frame. But this approach makes things very computationally time consuming. It has been observed that potential match of block in current frame is often found in blocks nearby to its position in reference frame.

So, the search area is limited to smaller region than the whole frame and that region is known as search window. The optimal search window size is steered by two important trade oﬀs, viz. performance and complexity.

Large search window sizes usually perform better but on the cost of larger number of comparisons and hence require more computational sources.

There are various methods deployed for ﬁnding the matching region in the reference frame (s) and they are broadly divided into two classes. The basic

(27)

Figure 4.4: Full Search Methods [2].

one is called Full Search and it is also known as the optimal solution for motion estimation inside block matching domain. The sub optimal class is called Fast Search method and such methods are speedy as obvious from their name. Starting with the Full Search, some of them are described as in the following.

4.2.1 Full Search

As said earlier, this is the optimum way of ﬁnding motion vectors. The method goes like this: Search the whole search window at each of its search points (pixels) for ﬁnding out the best match dictated by the search criterion like SAD. There are two ways in practice for searching the region.

These methods are depicted in ﬁgure 4.4. Raster order starts from one corner and goes all the way to the opposite corner. Spiral order starts from the middle point of the search window and moves outwards successively scanning the pixels on its way. The later method is eﬃcient in a way that it can be stopped earlier if some minimum value of search criterion has been reached before scanning the whole search window [2].

Apparently, Full Search method is brute force by nature. This gives best results in terms of accuracy and hence best matches but involves a mas- sive measure of computations. This computational complexity renders Full Search method and restricts its usability in general and in real time CODECs in particular.

Coming next are some algorithms which try to reduce the number of comparisons.

(28)

Figure 4.5: Three Step Search [2].

4.2.2 Three-Step Search

This algorithm is one of the earliest approaches for setting up a fast search method. As shown in ﬁgure 4.5, it starts by ﬁxing eight search points at a distance (called step size) of half the length of search window from (0,0) position. The point giving the least value of search criterion (SAD) is then chosen as center for next round of searching with half the step size of previous round. This goes further until a step size of one is reached and the search point with minimum value of SAD is selected as the best match.

The considerable reduction in search points as compared to Full Search is obvious from the mentioned ﬁgure.

4.2.3 Diamond Search

Contrary to the previous one, the search shape has been changed to that of a diamond and the number of steps taken by the algorithm to converge are not limited. On the basis of patterns that the algorithm could adopt, its divided into Large Diamond Search Pattern (LDSP) and Small Diamond Search Pattern (SDSP) as depicted in figure 4.6. The algorithms starts with LDSP and if the minimum SAD value is found at the center,it jumps to step four as given in the figure 4.6. Rest of the steps are like LDSP except the last step. The points checked for SAD calculation could be 3 or 5 again indicated by the mentioned figure. The last step incorporates

(29)

Figure 4.6: Diamond Search [7].

SDSP around the new search origin and point with the least value of SAD is declared as best match. As the search pattern is neither too small nor too big and the fact that there is no limit to the number of steps, this algorithm can ﬁnd global minimum quite accurately. The performance is very near to Full Search but the computational complexity is very much less [9].

4.2.4 SLIMPEG

This is standard reference algorithm for motion estimation used in this thesis work and is discussed here. It works on recursive/predictive methodol- ogy. Motion estimation is done here in two steps. Firstly, a rough estimate of motion ﬁelds is obtained and an improvement is then performed to yield a ﬁne tuned motion estimate.

The predictive phase is performed by utilizing the motion vectors of a pre- deﬁned set of neighboring macro blocks to see which of them are closer matches to the current macro block. The predeﬁned set comes from the same frame (spatial neighbors) and from previous frame (temporal neighbors). The idea behind it is to exploit the natural correlation found amongst neighboring macro blocks. This approach provides some initial estimate to start with instead of starting from zero every time as it happens in Full

(30)

Figure 4.7: Coarse and Reﬁnement Phases [6].

Search or in some fast search algorithms like Three Step Search.

In the next phase, a reﬁnement of the earlier coarse estimate is performed.

It involves searching for best match around a little distance from initially determined motion ﬁelds. The distance here is dictated by the amount of video content under consideration. This procedure is then repeated for the rest of the macro blocks [5][6]. These phases are depicted in ﬁgure 4.7.

A signiﬁcant advantage of this algorithm over Full Search is that the use of previously calculated motion vectors provide a natural alignment towards the actual motion. This alignment improves the performance of the encoding procedure. It has been observed that Full Search sometimes gets stuck in local minimum because of it full dependence on the error measure (SAD) value. However, this algorithm tracks the actual motion very well.

As the number of predictors tried for a macro block is ﬁxed priori, a liberty in choosing the optimal search window size is oﬀered. Moreover, the search window size can also be altered dynamically. Such experiments reveal a huge reduction in computational complexity as compared to Full Search algorithms.

(31)

Rate Distortion Optimization and

Bjontegaard Delta PSNR

Rate distortion efficiency has been a standard measure of the performance of video encoder. It is a quite sophisticated process and its concise description is presented here. Lagrangian optimization techniques are being applied to address the issues found in common hybrid video coders. In such coders, some decision making has to be performed while using various methods of encoding involving different parameter settings. Intra coding stands for a situation where no motion estimation is performed. When some block from current frame exactly matches the block in reference frame, a SKIP flag is sent to the decoder telling that next block is just a replica of its previous counterpart. Starting with some the description of why rate distortion optimization is required, the optimization process is mentioned as under [7].

Given the background of motion estimation and compensation as in the previous chapters, a hybrid video codec is the one having availability of motion handling structure as well as frame coding. In a typical hybrid video coder, following questions are to be addressed:

• Segmentation of video images to form areas.

• Whether INTRA coding or not.

• Description of INTRA data.

• Various steps inside motion estimation if it is INTER coding.

The compression eﬃciency of hybrid video codecs comes with some additional modes which are selected in online operation for diﬀerent sections of

(32)

the encoded image. So, these modes of operation are related with signal- dependent rate distortion properties and rate-distortion trade oﬀs are necessarily present in the designing phase of such aspects.

5.1 Measurement of Distortion

It is a natural requirement of a rate distortion optimization that some measure of distortion should be available. On the other hand, distortion is not easy to measure as perception of human visual system has not been well translated into physical quantity representation. The distortion measures in common use are; sum of squared diﬀerence (SSD), mean square error (MSE) and sum of absolute diﬀerence (SAD). Peak signal to noise ratio (PSNR) which happens to be normalized representation covering the whole pixel values range is calculated as:

P SN R = 10log[(2ⁿ− 1)²/M SE]dB (5.1) where n is pixel bit depth.

So, the rate distortion optimized coding options improve the performance of a video coder. The goal of an encoder is to optimize its overall ﬁdelity:

Minimize distortion D, subject a constraint r on the number of bits used R.

This statement can be formulated as :

min{D} (5.2)

given that R<r.

This problem can be resolved by applying Lagrangian optimization which tries to minimize J where

J = D + λR (5.3)

Lagrangian rate-distortion is minimized for some suitable value of the lagrangian multiplier.

A solution to the equation 5.3 for one value of lagrangian multiplier provides an optimal solution to equation 5.2 for a certain value of r. The aforementioned optimization method has proven its simplicity in eﬀectively evaluating a number of coding options.

5.2 Bjontegaard Delta PSNR

It is required to calculate average PSNR diﬀerence between two rate distortion (RD) curves. Gisle Bjontegaard proposed a method [8] for it and this

(33)

method was accepted by Video Coding Experts Group (VCEG). Given are the two simulation conditions with four Quantization Parameter (QP) values. The input data set contains four PSNR values and the corresponding data rates for each of the two simulation conditions. The baseline encoder is called anchor and the other is called test.

An interpolation curve through four data values of a normal RD curve is obtained as:

SN R = a + b∗ bit + c ∗ bit²+ d∗ bit³ (5.4)

where a, b, c and d are determined such that the curve passes through all 4 data points and ’bit’ means bit rate expressed in logarithmic scale. In the same way we can do the interpolation to ﬁnd the bit rate as a function of SNR:

Bitrate = a + b∗ SNR + c ∗ SNR²+ d∗ SNR³ (5.5)

In this way we can ﬁnd both:

• Average PSNR diﬀerence in dB over the whole range of bitrates.

• Average bitrate diﬀerence in percentage over the whole range of PSNR.

The calculation process is explained below and also depicted in ﬁgure 5.1.

Let B₁, B₂, B₃ and B₄ be the bit rates and their corresponding PSNR values be P₁, P₂, P₃ and P₄ at one simulation condition. Using the equation 5.5, the RD curve equations are given by:

P₁ = a + b∗ B₁+ c∗ B₁²+ c∗ B₁³ (5.6)

P₂ = a + b∗ B₂+ c∗ B₂²+ d∗ B₂³ (5.7)

P₃ = a + b∗ B₃+ c∗ B₃²+ d∗ B₃³ (5.8)

(34)

Figure 5.1: BD PSNR [8]

P₄ = a + b∗ B₄+ c∗ B₄²+ d∗ B₄³ (5.9)

These equations are put in matrix form here:

⎛

⎜⎜

⎝ P₁ P₂ P₃ P₄

⎞

⎟⎟

⎠ =

⎛

⎜⎜

⎝

1 B₁ B₁² B₁³ 1 B₂ B₂² B₂³ 1 B₃ B₃² B₃³ 1 B₄ B₄² B₄³

⎞

⎟⎟

⎠

⎛

⎜⎜

⎝ a b c d

⎞

⎟⎟

⎠ (5.10)

In compact matrix notation, it is written as

b = A∗ x (5.11)

where the column vector b represents the PSNR values given on the left side, A represents the matrix on the right hand side and x represents the column vector of unknown values given on the right of equation 5.11.

This system of equations is solved by using the expression:

x = inv(A)∗ b (5.12)

BD PSNR is found by taking the diﬀerence of values obtained after inte- grating both of the RD curves given by equation 5.5 over the interval of

(35)

Figure 5.2: BD PSNR

lowest bitrate value to the highest bit rate value.

In order to quantify the accuracy of BD measurements, a metric of reliability was introduced [7]. The value of reliability is calculated diﬀerently for BD PSNR and BD rate calculations and are presented next.

As given above, BD PSNR measurements cover an overlapping domain of bit rates of test and anchor RD curves and it is observed that a accuracy of BD PSNR comes from the amount of this overlap. The situation is illus- trated in ﬁgure 5.2. Let Full search be anchor and one of our contributions be a test curve here. The more the overlap between the two curves for bitrate or PSNR, the better accuracy can be guaranteed for the BD results.

From the ﬁgure 5.2, the diﬀerence between ’oh’ and ’ol’ is the region of overlap in bit rates here and the whole range of PSNR values is taken into account, from the lowest to the highest, to get the PSNR value. On similar grounds, BD rate measurements are performed. The overlapping PSNR area is the deciding pattern for domain setting of bit rate values range to evaluate BD rate values.

Finally, a value of reliability is computed by the following formula.

r = o_h− o_l

max(a_h− a_l, t_h− t_l) (5.13) A reliability metric value approaching to 1 is an evidence of acceptable test and lower values of reliability metric dictate the need of a new simulation with a new set of QP values which could bring better overlap.

(36)

(37)

Simulation, Results and Discussion

Given the fact that a video is created by frames (pictures) presented in succession at a certain rate, there exists a variable amount of redundancy of information, depending upon the video content, amongst these successive frames. Technically, we can say that there exists a correlation between various frames of a video and it is called temporal correlation. This facts gives a clue to use information of earlier coded frames as an estimation of the upcoming frames to be encoded. Moreover, the high density sampling points in a frame render a fair amount of correlation amongst neighboring blocks and this is known as spatial correlation. Inspired from the presence of these phenomenon in video, various methods derived on the basis of spatial and temporal correlation have been applied to devise new techniques for motion estimation. The resulting video sequences have been compared against video sequences where Full Search motion estimation algorithm was used. The testing measure were delta PSNR and delta bit rates which are introduced in the previous chapter. The subjective quality was observed by Ericsson’s proprietary VIPS viewer. Moreover, the reduction in complexity offered by the use of new techniques has been compared against the complexity offered by SLIMPEG motion estimation algorithm. The software platform used for simulations was MPEG-4 simple profile encoder. The whole of the work was performed in C language on the Linux environment.

The system used was Intel Core 2 Duo desktop computer.

The algorithms were tested for the ﬁrst 100 frames of the Football Sequence (high motion content), Foreman Sequence (moderate motion content) and Claire Sequence (low motion content) for their luminance domain. The size of the videos was QCIF, at frequency of 15 fps and in 4:2:0 YUV format.

The values of quantization parameter (QP) used are mentioned wherever required along the text.

(38)

6.1 SAD as a Comparison Metric

As we are targeting this research on mobile multimedia applications, processing time consumption feature should always be kept under consideration. Following the SLIMPEG implementation given in [5], after the whole Video Encoder has been optimized, and in this case it has been done for an ARM cortex A9 with Neon support (which is a co-processor that performs vectorization) the load is approximately distributed as given in the table 6.1. There we observe that SAD computation uses a large amount of workload inside the Motion Estimation task, more than half of the Motion Estimation workload. Hence, we chose SAD computation complexity as metric of comparison between motion estimation algorithms.

Table 6.1: Computational workload distribution for SLIMPEG

Task Workload Remarks

Motion Estimation 50 %

SAD 30 %

Pre work 15 % Create predictors

Other 5 % Limited MVs inside search window

Motion Compensation 6 %

Write 12 %

Encode 30 % Transforms, Quantization..

Other 2 %

6.2 Proposed Techniques

During the work on this thesis, some algorithms were formulated exploiting the inherent properties of video and have been tested against SLIMPEG motion estimation algorithm for reduction in complexity. The measure of complexity metric taken here is the number of times an algorithm has to perform SAD calculations for search of similarity between current frame and the reference frame. The coded video quality is estimated by its PSNR and BD PSNR measures. The introduced algorithms oﬀer a good level of reduction in SAD computational complexity. It is a common practice to compare the quality of resultant video against the video encoded with Full Search motion estimation algorithm employed. Coming next is a brief description of the contributions made in this thesis and a comparison of

(39)

results on the mentioned video sequences will follow that.

Before we introduce our techniques, it would be interesting to mention here that SLIMPEG has the possibility to INTRA code a macro block for which a set threshold decides that it has poor SAD value. However, our Temporal and Spatial correlation algorithms are not provided with this facility to see what exactly was estimated by them and hence the motion estimation done by them is kept unchanged. This decision may provide bad quality as a result but it was adopted to see the exact results of our algorithms.

In the graphical picture to follow, temporal and spatial means temporal correlation and spatial correlation algorithms being employed respectively.

6.2.1 Spatial Correlation Algorithm

It has been observed that there is high value of relation found inside a frame of a video amongst its neighboring blocks. This relation amongst the blocks inside a video frame thus build up a correlation called spatial correlation.

This fact leads to the idea of using some of motion vectors of earlier blocks in a sequence for some of the successive blocks in the same frame. The idea can be implemented through a number of ways. We introduce some techniques here which considerably reduces the computational complexity of the motion estimation process. This is explained in the following text.

Consider a frame out of a given video. The frame is divided into small grids called macro blocks. It is usual to do motion estimation of these macro blocks individually by comparing them against a region defined by a given window size in the reference frame. The first row and first column of the frame is coded by SLIMPEG estimation algorithm. Then an average is taken of motion vectors for the motion estimation of the remaining macro blocks in the same frame. The macro blocks chosen for taking the average measure in a particular frame are selected as; 1: The preceding macro block in the same row. 2: The macro block exactly above in the previous row of the current block. In motion estimation, as the prediction is done with some error so this error may become bigger and bigger along the way of predicting frames from their previous frames. This issue is tackled by adding an INTRA coded frame after a certain number of INTER coded frames and that is also done here. This algorithm has one in built inability to deal with a kind of motion which may not be occurring in the area of macro blocks which are encoded by SLIMPEG and hence its performance depends highly on the motion contents of the video being encoded.

6.2.2 Temporal Correlation Algorithm

The objects in a video look smooth in motion due to high temporal sampling. Normally, most of the parts of an object inside a video would undergo

(40)

Figure 6.1: SAD Control Algorithm [2 successive frames, two motion estimation algorithms are switched between black and grey macro blocks]

the same motion and hence an equal amount of motion estimation may be applied over the whole object. The motion amongst the frames thus have some sort of relation in succession, a temporal correlation. Observing the successive images of a video reveal the fact that, depending on the video content, there could be many parts/blocks of one image found exactly the same in a later image. Also, the movement found amongst the images is generally related, e.g. all of the parts of a moving car would move in the same direction and with the same speed. These observations give rise to the concept of temporal correlation stating that images of a video are corre- lated in time domain. This concept could also be implemented in diﬀerent ways and the method adopted here mainly consists of using motion vector information of already encoded macro blocks in temporal order. Moreover, after encoding a ﬁxed number of frames, one frame is encoded fully with SLIMPEG.

6.2.3 Adaptive SAD Control

Our approach here is to minimize the error measure along the successive frames while encoding a video. There are two alternatives to implement this technique. One is SAD Control Algorithm which is static. This algorithm offers a huge reduction in complexity (SAD calculations) though with a trade off of quality of the resultant video quality. It is depicted in figure 6.1. The black and gray colors represent two different kinds of motion estimation algorithms used. As it is seen in the figure 6.1, the selection of a particular algorithm for a certain block position changes along the way between consecutive frames. One of them could be SLIMPEG and the other is some reduced complexity algorithm. The error incurred due to reduced motion estimation for one macro block in one frame is then taken care of in the next frame.

(41)

Figure 6.2: Adaptive SAD control

We enhance the above mentioned algorithm to make it adaptive to deal with the error in prediction to form an algorithm called Adaptive SAD Control. This algorithm is dynamic in nature which adaptively chooses one of the two available methods of motion estimation. The two methods include SLIMPEG and the one which oﬀers reduced complexity in SAD computation. Contrary to the SAD control, it has no ﬁxed pattern of selection of ME algorithm. Instead, it is dynamic in nature and adapts online with respect to the SAD measure values. The decision of choosing a particular motion estimation algorithm is made by the ’threshold’ value.

The algorithm works as given in the following and also depicted in ﬁgure 6.2.

• The ﬁrst frame of the video sequence under consideration is encoded using intra coding.

• Then a certain number of frames are encoded using predictive cod- ing. The second frame is encoded using SLIMPEG motion estimation algorithm

• Starting from the third frame, following scheme is implemented: With SAD threshold value of n%; 1. Sort the SAD values for all the mac- roblocks of previous frame in ascending order. 2. Find the index i of the macroblock that has the lowest SAD value out of top n% SAD values.

• For all the positions above index i, the corresponding macroblocks in the current frame are encoded by SLIMPEG. The rest of the

(42)

Table 6.2: Adjustability of Adaptive SAD Control Algorithm

SAD Threshold 30% 70%

Video sequence Football Foreman Clair Football Foreman Clair

Reduction in Complexity [%age] 63 65 62 35 36 34

Loss in BD PSNR [dB] -1.62 -3.03 -1.13 -0.64 -1.16 -0.32

macroblocks in the current frame, for which the corresponding macroblocks had SAD values less than the n% threshold in previous frame, will be encoded using temporal correlation motion estimation algorithm.

• Intra coding is iterated after a certain number of frames have been encoded using the above method.

This algorithm in total oﬀers a liberty of choosing between high data rate and high quality encoding or low data rate and low quality encoding. In addition, the later option is is less computationally involved. A summary of the results obtained by applying Adaptive SAD Control algorithm with SAD threshold value of 30 % and 70 % is shown in table 6.2.

6.3 Simulations with diﬀerent video sequences

The aforementioned algorithms were tested on diﬀerent kinds of video sequences. The framework adopted to the simulations is given in the following.

• Encode the video under consideration using motion estimation by Full Search, SLIMPEG, Temporal Correlation Algorithm, Spatial Correla- tion Algorithm and Adaptive SAD Control Algorithm separately with a deﬁned set of four QP values. The QP values are chosen such that the data rates of the resulting video are limited in the approximate range of 35 kbps to 300 kbps.

• Record the data rates of the encoded videos and SAD computational complexity oﬀered by each algorithm.

• Decode the videos and compare each of them against the original video to calculate individual PSNR values for each QP value.

• Take the video encode by Full Search as ’anchor’ and evaluate the BD PSNR and BD rate values for videos encoded by rest of the algorithms.

(43)

• Compare the eﬃciency of SLIMPEG against Full Search for SAD computational complexity. Take that values as a reference to check improvement achieved in SAD computational complexity by using other algorithms.

The data so obtained was used to plot some graphs for the video sequences presented in the coming sub sections. One such graph is called BD PSNR graph which shows the Bjontegaard Delta PSNR measure of the encoding quality offered by a certain motion estimation algorithm in comparison with Full Search. In this graph, and others to be discussed, Adaptive SAD Con- trol algorithm has variable response in accordance with its values against each value of the SAD threshold. However, other algorithms have a flat response as none of their attributes are dependent on the SAD thresh- old. BD rate plot presents a comparison amongst data rates offered by the algorithms under test compared with Full Search. Reduction in SAD computation complexity plot shows the decrease in SAD calculations for an algorithm in when compared against SLIMPEG. PSNR plots show the val- ues of individual PSNR achieved by each algorithm when compared against Full Search at a certain QP value. Some sample frames from each video sequence have been depicted in next sections for example, figure 6.3. Macro blocks inside such a frame are shown by green squares and value of motion vectors is represented by small arrows inside these squares. Coming next is the description of the simulations and results of three video sequences.

6.3.1 Football Sequence

This video sequence has quite high motion content. One of its frame is shown in figure 6.3 which was encoded by using SLIMPEG as motion estimation algorithm at QP = 16. The pointer gives information about macro block 55 that it was INTER coded and the value of motion vector is also there. Figure 6.4 shows the BD PSNR measure of the video encoded by four different motion estimation algorithms. As the players move fast and the background changes accordingly, it can be expected that different macro blocks inside any frame of the video could have little relation to each others movement. Also, the nature of the motion content provides an somewhat idea of relation amongst the frames in temporal domain. These intuitive ideas are well supported by the obtained results for spatial and temporal correlation algorithms as shown in the mentioned figure. The SLIMPEG lies above as it should. Our Adaptive SAD Control reaches up to the SLIMPEG quality level gradually with the increasing value of SAD threshold. Figure 6.5 depicts the BD rate measure of the earlier mentioned algorithms. Similar to the BD PSNR results, motion estimation based on temporal correlation offers better BD rate values than the one

(44)

Figure 6.3: 10th frame of the Football Sequence[Encoded using SLIMPEG]

based on spatial correlation. Temporal correlation based algorithm is better in SAD computational complexity as well, as shown in the ﬁgure 6.6.

It is mentioned again here for reference that the computational complexity comparisons were made against the value oﬀered by SLIMPEG. Adaptive SAD Control oﬀers a wide liberty to choose between quality or reduced complexity.

The next four graphical pictures in ﬁgure 6.7 till ﬁgure 6.10 present the PSNR values of the encoded video for QP values 8, 16, 24 and 31. These values has been chosen carefully so as to keep the minimum and maximum bit rates of the encoded video inside the prescribed range. As expected, high PSNR values are obtained for lower QP values and vice versa for higher QP values. The temporal correlation algorithm has maintained its better performance over spatial correlation algorithm at individual QP values as well.

6.3.2 Foreman Sequence

This is a moderate motion content video sequence. One of its frame is shown in ﬁgure 6.11 which was encoded by using Adaptive SAD Control as motion estimation algorithm at QP = 11. The pointer gives information about macro block 34 that it was INTER coded and the value of motion vec- tor is also there. An intuitive observation of the foreman movement inside

(45)

Figure 6.4: BD PSNR[Football Sequence]

Figure 6.5: BD Rate[Football Sequence]

(46)

Figure 6.6: SAD Computation Complexity[Football Sequence]

Figure 6.7: PSNR values for QP = 8[Football Sequence]

(47)

(48)

the video suggests a rather abruptness in the direction of motion vectors.

As most of the motion content of this video come from the nodding of foreman, therefore the temporal correlation may have lesser value due to change in direction of motion so frequently. The figure 6.12 depicts this by BD PSNR results where spatial correlation algorithm value lies above over temporal correlation value. The next plot in figure 6.13 presents the BD rate values where spatial correlation offers better performance. Our Adap- tive SAD Control algorithm once again offer an exponential fall with SAD threshold increment. SAD computational complexity graph in figure 6.14 shows better performance of spatial correlation algorithm than temporal correlation, as expected. The next four plot in figure 6.15 till figure 6.18 depict PSNR values of the encoded videos at QP values 4, 11, 18 and 25.

6.3.3 Claire Sequence

This is a low motion content video sequence. One of its frame is shown in ﬁgure 6.19 which was encoded by using Temporal Correlation Algorithm as motion estimation algorithm at QP = 3. The pointer gives information about macro block 6 that it was Skipped coded and hence no motion vec- tor is required in this case. It becomes clear after having a look at the video that most of its visual areas continue to remain same frame after frame.As there is very low amount of motion in the sequence, we see that spatial correlation algorithm keeps higher value of BD PSNR than that temporal correlation algorithm in ﬁgure 6.20. Also, SLIMPEG becomes

(49)

Figure 6.11: 4th frame of Foreman Sequence[Encoded using Adaptive SAD Control algorithm]

Figure 6.12: BD PSNR[Foreman Sequence]

(50)

Figure 6.13: BD Rate[Foreman Sequence]

Figure 6.14: SAD Computation Complexity[Foreman Sequence]

(51)

Figure 6.15: PSNR values for QP = 4[Foreman Sequence]

(52)

(53)

Figure 6.19: 13th frame of the Claire video sequence[Encoded using Tem- poral Correlation Algorithm]

more efficient than so called optimum full search algorithm here because it uses previous estimates for motion vector calculations. The Adaptive SAD Control algorithm performs well here and keeps its values higher than even spatial for most of the region of SAD threshold. BD rate has similar results as presented in figure 6.21. Temporal correlation algorithm gives more reduction in SAD computation complexity however, as shown in figure 6.22.

The individual QP value PSNR results are shown in the ﬁgure 6.23 till 6.26.

(54)

Figure 6.20: BD PSNR[Claire Sequence]

Figure 6.21: BD Rate[Claire Sequence]

(55)

Figure 6.22: SAD Computation Complexity[Claire Sequence]

Figure 6.23: PSNR values for QP = 2[Claire Sequence]

(56)

(57)

(58)

(59)

Conclusion and Future Work

The task of reducing complexity, in terms of SAD calculations for measure of similarity, in motion estimation algorithms for MPEG-4 video encoding has been addressed in this thesis. Four methods have been introduced which proved their beneﬁt in reducing the motion estimation complexity.

The comparison criteria here are the complexity oﬀered by SLIMPEG and quality provided by standard full search motion estimation algorithm.

The temporal correlation algorithm focuses on usability of motion vectors from previous frame as a motion vector of the current macro block. This method provides an enormous amount of reduction in SAD computation complexity but may not be a stand alone practical solution for motion estimation because of poor PSNR results.

The spatial correlation algorithm aims at reusability of the few ﬁne tuned motion vectors, provided by SLIMPEG, in the same frame. This method too turned out to be very eﬃcient in terms of low amount of SAD calculations but hard to be employed alone for motion estimation process because of its poor performance in terms of PSNR of the resulting video.

• Football Sequence: The temporal correlation algorithm oﬀers better quality than that oﬀered by the spatial correlation algorithm by 0.79 dB in BD PSNR measure for this video. Moreover, the temporal correlation algorithm has 17 % less complexity of SAD computations than the spatial correlation algorithm.

• Foreman Sequence: The spatial correlation algorithm oﬀers better quality than that oﬀered by the temporal correlation algorithm by 1.1743 dB in BD PSNR measure for this video. Moreover, the temporal correlation algorithm has 18 % less complexity of SAD computations than the spatial correlation algorithm.

(60)

• Claire Sequence: The spatial correlation algorithm oﬀers better qual- ity than that oﬀered by the temporal correlation algorithm by 2.2572 dB in BD PSNR measure for this video. Moreover, the temporal correlation algorithm has 18 % less complexity of SAD computations than the spatial correlation algorithm.

The SAD control algorithm alternates between a low complexity and a full motion estimation algorithm. This method oﬀers a promising reduction in complexity and we have used this basic concept two formulate a dynamic adaptive algorithm.

The Adaptive SAD Control algorithm oﬀers a system where complexity and quality can be traded oﬀ in a dynamic way. Higher quality may be obtained by increasing the complexity and it is possible to have low computational complexity if encoding quality can be compromised a little. We were suc- cessful in showing that this algorithm has got the capacity to provide a so called sliding knob which can be moved all the way from lowest quality with lowest complexity to the point of highest quality and highest complexity.

At the point of the highest quality the complexity of SAD calculations is the same as that oﬀered by SLIMPEG.

The aforementioned conclusive remarks show the beneﬁt of inherent properties of video being used for motion estimation process. Correlation found amongst various parts of the video frames and consecutive frames proved useful in compression phenomenon.

The future work on motion estimation algorithms may be focused in diﬀer- ent ways. Motion vectors obtained under the constraint of rate distortion may end up in less use of bit rate. Depending upon the nature of motion in the video under consideration, the temporal and spatial correlation algorithms may be joined to achieve better results.