Implementation of a Frame Rate Up-Conversion Filter

(1)

Implementation of a Frame

Rate Up-Conversion Filter

Fredrik Vestermark

May 24, 2013

Master’s Thesis in Computing Science, 30 credits

Supervisor at CS-UmU: Frank Drewes

Examiner: Eddie Wadbro

Ume˚

a University

Department of Computing Science

SE-901 87 UME˚

A

(2)

(3)

Abstract

When providing video for devices that support different frame rates, some kind of frame rate conversion has to be done. Simple methods such as frame repetition, frame averaging and frame dropping are commonly used to change the frame rate, but these methods introduce jerky motions or blurring artifacts. To avoid these artifacts, many algorithms have been developed to approximate new frames based on estimated movements between successive frames (motion compensation).

This thesis summarizes a selection of previously proposed motion compensation tech-niques for frame rate up-conversion, and presents a prototype implementation that uses a subset of the described methods.

(4)

(5)

List of Figures

2.1 Frame generated via frame repetition . . . 3

3.1 Frame generated via frame averaging . . . 5

3.2 Temporal shifting. . . 6

3.3 Forward motion estimation . . . 8

3.4 Bilateral motion estimation . . . 8

3.5 Overlapping motion estimation . . . 9

3.6 Full search . . . 10

3.7 Diamond search . . . 11

3.8 Hexagon based search . . . 11

3.9 Bad spatial MV refinement . . . 13

3.10 1:2 pixel decimation . . . 14

3.11 Alternating 1:4 pixel decimation of distortion method . . . 14

3.12 Bilateral motion compensation . . . 15

3.13 Direct motion estimation . . . 15

3.14 Pixel-based OBMC . . . 16

3.15 Control grid . . . 16

4.1 System structure and communication . . . 20

4.2 Examples of a quick scene change . . . 23

4.3 Examples of a slow fade between two scenes . . . 24

4.4 Example of artefacts along the edges . . . 25

4.5 Approximations of a frame from the clip Park Joy . . . 25

4.6 Approximations of a frame from the clip Foreman . . . 25

4.7 Example of a bad approximation . . . 26

A.1 Box plots of the PSNR for images generated from the test videos. . . 36

A.2 Line charts of frame rate up-conversion PSNR for test sequences. . . 37

(8)

(9)

List of Tables

2.1 Test videos used when evaluating the quality of frame rate conversions. . . . 4

3.1 Hexagon based search algorithms and SIR claimed by their authors . . . 12

(10)

(11)

Chapter 1

Introduction

Different video devices support different formats, and sometimes it is necessary to convert video to a different format or frame rate. Film, as an example, is produced in 24 frames per second (FPS), but shown on TV in 25 FPS in most of Europe and 30 FPS in most of America. While the audio and frame rate can be sped up from 24 to 25 FPS without being noticeable for humans, movements and sounds get unnatural if sped up to 30 FPS, which makes conversion necessary. It is impossible to perform a completely flawless frame rate conversion (FRC) in the general case, but a number of algorithms to generate approximated intermediate frames exist.

This thesis presents a filter for motion compensated frame rate up-conversion (MC-FRUC) that can perform FRUC with higher image quality than frame repetition and frame averaging.

The topic is based on an external work at the IT consulting-firm CodeMill1_{, whose}

partner Vidispine AB offers an API media asset management platform2_{. They currently}

use frame repetition as their FRC method but are interested in offering a module which gives their customers a better visual experience.

The report is structured as follows:

Chapter 2 describes the problem in further detail and the restrictions that were initially made.

Chapter 3 describes a few methods for block motion estimation and compensation.

Chapter 4 includes a description of the system structure and an image quality evaluation based on statistical comparisons between implemented motion compensation methods.

Chapter 5 includes a discussion about the project and the results, what limitations had to be made and future work.

Chapter 6 expresses my gratitude to those that in one way or another have helped me during the writing of this thesis.

1_{http://www.codemill.se} 2_{http://vidispine.com}

(12)

(13)

Chapter 2

Problem Description

2.1 Problem Statement

As previously mentioned, Vidispine currently uses frame repetition to implement FRC. Frame repetition is a very fast method, but it introduces jerky motions to moving objects if the new frame rate is not an integer multiple of the old one since frames are repeated a different number of times (Castagno et al., 1996). For example, every other image is repeated when the frame rate is increased by a factor of 3/2. Figure 2.1 shows an example of frame repetition.

Figure 2.1: Frame generated via frame repetition surrounded by the reference frames. The generated frame is identical to the first one.

The goal of the project was to implement a frame rate conversion filter that would be able to convert video with higher quality than the currently used frame repetition, preferably with higher quality than frame averaging and in real-time. No restriction on video resolution was specified in the original problem specification, but real-time 720p/24 (720 pixels high progressive 24 FPS video) to 720p/30 conversion was agreed upon as an aim since 720p is a common format and conversion from 24 FPS to 30 FPS is a common problem. An agreement was also made to aim at support for resolution up to 4320p if time permits.

CodeMill wished to have the filter implemented using the FFmpeg library libavfilter1

instead of a native Vidispine filter mainly for two reasons:

– libavfilter support is already planned to be integrated into Vidispine.

1_{https://ffmpeg.org/libavfilter.html}

(14)

4 Chapter 2. Problem Description

– If the filter turns out to require too much maintenance, it can be released to the FFm-peg community under LGPL, who can use and help improving the filter. Eventually it may fulfill Vidispine’s requirements.

2.2 Methods

As a first step, a frame repetition filter and an edge detection filter were implemented to get some basic understanding of how frames are represented and how libavfilter works.

The plan was to start with literature studies on block motion compensation, different search algorithms and optimizations followed by implementation of a few of the algorithms. In reality these steps were overlapping since practical implementation improved the under-standing of studied algorithms. Additionally, an existing tool for FRUC was briefly studied. Afterwards, the structure of the implemented filter was improved and the code was tidied up.

To assure that the filter would meet the image quality goal, it was continously tested with some standard benchmarking videos from Xiph.org2and trailers345for videos released under CC BY 3.06. More information on how the evaluation was performed can be found in Section 4.2.1.

Video Resolution Frame rate Source Foreman 352x288 29.97 Xiph.org

Bus 352x288 30 Xiph.org

Park Joy 1280x720 50 Xiph.org

Big Buck Bunny Trailer 1920x1080 25 Official web page Sintel Trailer 1280x544 24 Official web page

Tears of Steel 1920x800 24 Official YouTube channel

Table 2.1: Test videos used when evaluating the quality of frame rate conversions.

2.3 Related Work

Motion estimation and motion compensation are well studied areas with thousands of related articles. The use of motion compensation in frame interpolation was first proposed by H.C. Bergmann in 1984, but motion compensation was used in video coding even earlier (Luessi and Katsaggelos, 2009). Recent related work consists of, but is not limited to, algorithms that improve estimation speed (Zhu et al., 2002; Tsai and Pan, 2004; Huang and Chang, 2011), motion estimation accuracy (Porto et al., 2012; Huang and Chang, 2011) and the visual quality of interpolations (Choi et al., 2000).

MVTools is a plugin for AviSynth which can perform motion compensated frame rate conversion. Since it is licensed under GPL, it cannot be used in Vidispine but was used for comparison and some inspiration.

Notably, a thesis similar to this one was written by Jostell and Isberg (2012). Their aim was to up-convert video from a surveillance camera from 1080p/20 to 1080p/60, and

2_{http://media.xiph.org/video/derf/}

3_{http://www.bigbuckbunny.org/index.php/trailer-page/} 4_{http://www.sintel.org/download}

(15)

2.3. Related Work 5

(16)

(17)

Chapter 3

Frame Rate Conversion

In this chapter, the low-complexity FRC method frame averaging (FA) is described followed by methods to perform motion estimation (ME) and motion compensation (MC). Example figures show the interpolated frame Fi surrounded by the previous reference frame Fp and

the current reference frame Fc. To simplify, Fi is located at a temporal position exactly in

the middle of Fp and Fc if not stated otherwise.

3.1 Frame Averaging

The process of blending frames together by interpolating pixel values is called frame aver-aging. Pixel values in Fiare linearly interpolated from the corresponding pixel values in Fp

and Fc, weighted depending on the relative temporal position. It reduces flickering at the

cost of moving objects getting blurred (Wong and Au, 1995). Figure 3.1 shows the result of a FA.

Figure 3.1: Frame generated via frame averaging surrounded by the reference frames. Blur-ring artifacts are especially visible on the fence and the statue with background.

When performing FA to achieve a frame rate that is not an integer multiple of the original frame rate, some of the original frames will not be present in the up-converted video due to non-matching temporal frame positions. Castagno et al. (1996) argue that in such cases the original frame with a slightly shifted temporal position, as shown in Figure 3.2, would usually give a better result than straightforward frame averaging. Their subjective experimental results verify that the jerkiness introduced by the temporal shifting is not disturbing in a 50 FPS to 75 FPS conversion even when viewing a uniform motion.

(18)

8 Chapter 3. Frame Rate Conversion

(a) Straightforward frame averaging. (b) Frame averaging with temporal shifting.

Figure 3.2: Frame rate up-conversion from 50 FPS to 75 FPS, with temporal shifting to reduce the amount of frames with artifacts. The thick lines represent frames. Black frames are frames in the original video, green frames are losslessly copied from the original video and red frames are interpolated from two neighbouring frames.

3.2 Block Motion Estimation

Block motion estimation (BME) aims to find motions between two or more images by systematically comparing a block of pixels in one frame with blocks of the same size in the other frame. This is an important preparatory step of block motion compensation and the estimated motion vectors should approximate true motions, since these estimated motions are later used to approximate new frames by drawing these blocks in intermediate positions. Block motion estimation has other application areas such as video compression, but these estimation methods cannot necessarily be used in frame rate conversion, since video compression usually does not require the true motions (Zhai et al., 2005).

3.2.1 Block Distortion

Measurement of the distortion between two blocks is one of the keystones in making an accurate motion estimation. A lower distortion implies a higher similarity. This section discusses a few block distortion methods, where the intensity of the pixel at coordinate (x, y) in frame n is denoted by fn(x, y). Note that chroma values are ignored in these

examples, but can be incorporated in implementations. The width and height of a block in the following formulas are denoted by W and H respectively. All methods are presented with both the total difference of all pixel values and the corresponding mean version, but Jostell and Isberg (2012) point out that the mean versions of the distortion methods below make use of floating point precision and are therefore more expensive than the sum versions.

Absolute Differences

(19)

3.2. Block Motion Estimation 9

Au, 1995), as shown in Formula 3.2.

SAD (x1, y1, x2, y2) = W −1 X i=0 H−1 X j=0 |fk(x1+ i, y1+ j) − fk−1(x2+ i, y2+ j)| (3.1) MAD (x1, y1, x2, y2) = SAD (x1, y1, x2, y2) W · H (3.2)

A problem caused by the use of absolute differences in general is that two blocks with only a few pixels indicating a high difference easily can be considered to have a lower distortion than two blocks with only slightly higher evenly distributed overall distortion.

Squared Differences

The previously mentioned problem can be solved by using squared differences. The formula for summed squared differences (SSD), also known as summed squared error, is shown in Formula 3.3. The corresponding mean squared difference (MSD), also known as mean squared error (MSE), is shown in Formula 3.4.

SSD (x1, y1, x2, y2) = W −1 X i=0 H−1 X j=0 (fk(x1+ i, y1+ j) − fk−1(x2+ i, y2+ j))2 (3.3) MSD (x1, y1, x2, y2) = SSD (x1, y1, x2, y2) W · H (3.4)

According to Xiong and Zhu (2008), MSD is the most widely accepted distortion method, but the less accurate SAD is commonly used because it does not require the use of multipli-cation. They also suggest the multiplication free distortion method weighted SAD (WSAD) which gives more accurate results than SAD. It is however not covered in this thesis.

3.2.2 Estimation Techniques

During motion estimation, blocks in the previous frame Fp and the current frame Fc are

compared to each other. How the comparison is performed depends on the estimation technique, but common for the techniques described below are that they generate a field of motion vectors which describes movements between Fp and Fc.

Uni- and Bidirectional Motion Estimation

In unidirectional motion estimation as described by Luessi and Katsaggelos (2009, p. 374) either Fp or Fc is first split into a non-overlapping grid of blocks. For each of these blocks

the other reference frame is searched for similar blocks within a range limited by a search window. The best match is used to create a motion vector representing the estimated movement. Figure 3.3 shows the estimation of a forward motion vector. Backward motion vectors can be estimated in the same way by simply interchanging the frames Fp and Fc.

Bidirectional motion estimation is performed by estimating both forward- and backward motion vectors (Tang and Au, 1997, p. 1444), and can be used to generate more accurate frames.

(20)

Figure 3.3: Forward motion estimation of a single block within a limited search window, and what an interpolated frame Fi using that motion vector would look like.

Bilateral Motion Estimation

Bilateral motion estimation is designed to remove the need to handle hole areas and over-lapping blocks. It does this by using the to-be interpolated frame Fi as a starting-point,

as opposed to unidirectional ME which uses Fp or Fc. The ME is performed by dividing

Fiinto a non-overlapping grid of blocks and searching for blocks in Fp that matches blocks

in Fc around that location, as shown in Figure 3.4. The matching blocks in Fp and Fc are

located at the relative positions (dx, dy) and (−dx, −dy) respectively. Note that the blocks in Fi does not contain any data and are only used as location reference points.

Figure 3.4: Bilateral motion estimation of a single block within a limited search window, and what an interpolated frame Fi using that motion vector could look like.

The most apparent problem in bilateral estimation is it is hard to capture movements of small fast moving objects, which are instead treated as two separate objects; one existing in Fpbut not in Fc, and one not existing in Fp but in Fc. In the compensation step, this may

result in a visual experience similar to frame averaging or that the moving object blinks in and out of existence.

Choi et al. (2000) who suggest this motion estimation technique call it bidirectional motion estimation, but more recently published papers seem to prefer the term bilateral motion estimation. The most obvious reasons for using the latter term is that the original term already had a different definition and that the word bilateral better describes the technique.

Overlapping Block Motion Estimation

(21)

3.3. Search Algorithms 11

blocks when computing block distortion, blocks that are larger than the ones in the original contiguous non-overlapping grid. An example is shown in Figure 3.5. This is called over-lapped block motion estimation (OBME) since the grid can be seen as a grid of overlapping blocks.

Figure 3.5: Overlapping bilateral ME of a single block and what an interpolated frame Fi using that motion vector could look like. The large dashed squares in Fp and Fc are

compared by the distortion method, as opposed to the solid squares in non-overlapping ME.

OBME can be performed in combination with both unidirectional and bilateral ME for improved accuracy, but at the cost of more expensive computations.

3.3 Search Algorithms

Searching for similarities between blocks in frames is computationally intensive, which has led to many different algorithms that reduce the number of search points (SPs). A search point is a point in a reference frame for which a block is evaluated against a block in the other reference frame, using a distortion method. Reducing the number of SPs can significantly speed up the motion estimation, but usually results in decreased quality (Tham et al., 1998, p. 369). Furthermore, Porto et al. (2012) claim that the output quality for fast algorithms such as the ones covered in this chapter can decrease significantly with increased video definition since the high quality makes them vulnerable to getting stuck in local minima.

This section covers the search algorithms diamond search (DS) and hexagon based search (HEXBS) with some variations. Some other commonly used ME methods such as tree-step search, new tree-step search and four-step search were not considered during this thesis due to time limitations and that results from Tham et al. (1998) and Zhu et al. (2002) indicate that DS and HEXBS generally evaluate substantially fewer SPs than the previously mentioned ME methods.

For simplicity, algorithms described in this section demonstrate forward ME but they can be altered and used for backward ME and bilateral ME as well.

3.3.1 Search Speed

The speed of a BME search algorithm depends on the number of SPs that are evaluated by the algorithm. The speed improvement ratio (SIR), how many times faster an algorithm alg2

is compared to an other algorithm alg1, is defined as in Formula 3.5, where SP (algorithm)

(22)

definition.

SIR(alg₁, alg₂) =SP (alg1) − SP (alg2) SP (alg2)

(3.5)

Guanfeng et al. (2003) and Tsai and Pan (2004) and others however use a slightly different version which shows the percentual speedup (see Formula 3.6). Formula 3.5 was used during this thesis since Formula 3.6 was falsely considered to be incorrect.

SIR(alg₁, alg₂) =SP (alg1) − SP (alg2) SP (alg1)

(3.6)

However, it is important to note that the estimated SIR often depends on the video resolution and the size of movements in the test material. As long as it is not shown that the SIR is constant despite changed video resolution, which seldom seem to be stated in research papers, the SIR should be used with scepticism.

3.3.2 Full Search

Full search (FS), also known as exhaustive search, is the unoptimized algorithm which finds the block Bpwithin the search window that is most similar to Bcby evaluating every search

point within the search window, iterating over the search area by moving one pixel at the time, which is illustrated in Figure 3.6.

(a) Starting step. (b) Ending step.

Figure 3.6: Full search. Search points evaluated in the current step are white, previously evaluated are grey and the best match in the previous step is green.

3.3.3 Diamond Search

Center-biased algorithms such as diamond search (DS) favour small movements over large movements. They generally evaluate some SPs that represent no or little movement and repeatedly evaluate SPs contiguous to the best match until no better match is found. The implementation suggested by Tham et al. (1998) is performed in three steps, namely starting, searching and ending.

(23)

3.3. Search Algorithms 13

– Searching is an optimised version of starting, performed by evaluating previously un-evaluated search points in a diamond shape located around the best candidate search point in the previous step. Five new search points have to be evaluated if the best match is located at a vertex, or three search points if it is located at a face, as shown in Figure 3.7b and Figure 3.7c. If the best match is found at the center position, the algorithm jumps to the ending step, or else it performs a new search.

– Ending is performed by evaluating every search point located at [x ± 1, y] and [x, y ± 1] around the best match, as shown in Figure 3.7d. The location of the best match is returned by the algorithm.

(a) Starting step. (b) Vertex search. (c) Face search. (d) Ending step.

Figure 3.7: Diamond search. Search points evaluated in the current step are white, previ-ously evaluated are grey and the best match in the previous step is green.

3.3.4 Hexagon Based Search

Hexagon based search (HEXBS) as suggested by Zhu et al. (2002) is performed similarly to DS, but with a search shape formed as a hexagon instead of a diamond. Figure 3.8 shows the different search steps, where the starting- and searching steps evaluate new search points at [x, y], [x ± 1, y ± 2] and [x ± 2, y]. The ending step, identical to the DS ending step, evaluates [x ± 1, y ± 1] and returns the position of the best match.

(a) Starting step. (b) Searching step. (c) Ending step.

Figure 3.8: Hexagon based search. Search points evaluated in the current step are white, previously evaluated are grey and the best match in the previous step is green.

(24)

3.3.5 Other Hexagon Based Algorithms

A number of other search algorithms that evaluate search points in the shape of a hexagon were developed after HEXBS of which a few are listed in Table 3.1. Their main character-istics are described below.

Modified hexagon based search uses the search heuristics one-step-stop and subsampling of the distortion method. These heuristics and/or simpler variants of them are described in Section 3.5. The SIR is claimed to be 708.45 for video sequences with median and large motions and 3685.74 for video sequences with small motions (converted to the definition of SIR used in this paper); substantially higher than the other algorithms in Table 3.1. The average increase of MAD is between 0.35% and 7.7% over DS in their test sequences. However, parameters are fine-tuned to suit the test videos and the results can therefore not be properly compared.

Predict hexagon search takes into account that, according to studies made by Tsai and Pan (2004), there is a probability of 65% that the best motion vector being the zero vector and a probability of 85% that the vector is within a 5 × 5 area around the center point. The probabilities are of course resolution dependent. Additionally, studies show a higher probability for blocks to move horizontally or vertically rather than diagonally, which is also reflected in the search patterns.

Proposed HEXBS uses HEXBS with an adaptive-sized search area based on the well-known fact that spatially adjacent blocks most likely have similar motions and the search window therefore can be shrunken and displaced accordingly (Chiang et al., 2007).

Algorithm Avg. SIR Year Source

MHBS 2197.10 2003 Calc. from Guanfeng et al. (2003, Table IV p. 1208) PHS 71.68 2004 Calc. from Tsai and Pan (2004, p. 611)

PHEXBS 57.62 2007 Calc. from Chiang et al. (2007, p. 1153)

Table 3.1: Experimental SIR over HEXBS claimed by the respective authors. It is important to note that different test videos are used for each paper and the results are therefore not sufficient to make a trustworthy comparison.

3.4 Motion Vector Refinement

Sometimes an estimated motion vector is pointing in a completely different direction than the surrounding ones, which typically indicates that the estimation was invalid. Such vectors are commonly removed by applying a median filter on the MVF (Ha et al., 2004). A negative side effect of median filters is that edges of moving objects may become blurred (Xu et al., 2011).

(25)

3.5. Heuristics 15

(a) Original image. (b) Badly reconstructed image.

Figure 3.9: Example of a bad spatial MV refinement based on the best of the spatially adjacent vectors, where the background (a cloudy sky) and the foreground (a dragon) are moving fast in different directions.

3.5 Heuristics

As previously hinted, search algorithms can be sped up significantly by using heuristics. This section presents heuristics that are based on the observation that blocks are generally stationary, have similar movement as neighbouring blocks and that neighbouring pixels usually generate similar distortion.

3.5.1 One-step-stop

One-step-stop is a speed improvement method performed by defining a block distortion threshold value under which blocks are presumed to be stationary (Guanfeng et al., 2003). It is based on the observation that many blocks are stationary, and is especially applicable in low resolution video.

3.5.2 Spatial and Temporal Correlation

Zafar et al. (1991) show that there is a high correlation between the motion vectors of spatially adjacent blocks, and in an accompanying paper (Zhang and Zafar, 1991) that there is a high correlation between motion vectors of temporally adjacent blocks as well. This implies that a good MV approximation of a block often can be made from previously estimated MVs of spatially and temporally adjacent blocks, and an accurate estimation can be made with less computation.

(26)

3.5.3 Subsampling

Subsampling methods use only a subset of the information available to approximate their results.

Block Distortion

The number of pixels compared by a block distortion method can be divided in half by subsampling the pixel block and thereby only estimating every other pixel in for example a chess like pattern (Wong and Au, 1995), shown in Figure 3.10.

Figure 3.10: 1:2 pixel decimation of distortion method. Pixels measured for distortion are marked with a black square.

Liu and Zaccarin (1993) suggest a pattern which reduces the evaluations to one fourth of the full estimation. To achieve a result close to the original distortion method, they apply four patterns using an alternating pattern depending on the location of the search point, as shown in Figure 3.11.

(a) Pattern one. (b) Pattern two. (c) Pattern three. (d) Pattern four. (e) Alternation.

Figure 3.11: Alternating 1:4 pixel decimation of distortion method. Pixels where distortion is evaluated are marked with a black square in (a)-(d), whereas (e) shows which pattern that should be applied at which search point.

Even though these methods could in theory increase the speed significantly, few recently published papers seem to mention them. Fast hardware support for performing SAD on multiple pixels at the same time (Jostell and Isberg, 2012, p. 40) is a possible cause.

Motion Vector Field

(27)

3.6. Motion Compensation 17

3.6 Motion Compensation

Motion compensation is used to approximate the frame Fi from the estimated MV field and

the reference frames Fp and Fc. This paper covers block motion compensation and motion

compensation by control grid interpolation.

3.6.1 Block Motion Compensation

In block motion compensation (BMC), the MVs are interpreted as the translational motion of blocks between two frames, which corresponds very well to block motion estimation.

When performing BMC based on a bilateral MV field, Fi can be interpolated directly

from pixel values of blocks from Fp and Fc at locations calculated from the MVs, as shown

in Figure 3.12, without creating any holes or overlapped regions in the interpolated image.

Figure 3.12: Bilateral motion compensation of a single block. The block in frame Fi is

interpolated from the two blocks in Fpand Fc, according to the previously estimated motion

vector.

The simplest form of block motion compensation based on uni- and bidirectional MV fields is performed by directly drawing blocks in Fi by pulling the blocks in Fp half of the

vector length towards the corresponding block in Fc, or the other way around. In this

case however, blocks will most certainly overlap and there will be holes in the generated image, as shown in Figure 3.13. These holes can be filled by for example estimating motion vectors depending on MVs adjacent to the hole (Choi et al., 2000). Wong and Au (1995) however first convert the old MV field into a bilateral MV field with holes by following the trajectories and adding the MVs to a list of candidate vectors. The best MV for each block is then selected before performing the hole filling.

(28)

3.6.2 Overlapping Block Motion Compensation

The block motion compensation algorithms themselves suffer from blocking artifacts, but as long as the motion vector field is a non-overlapping contiguous grid (bilateral MV fields have this feature, but not uni- or bidirectional ones), these artifacts can easily be reduced by using overlapping block motion compensation (OBMC). The original block sizes are increased, while keeping the blocks centered around the original location. This generates overlapping blocks, and pixel values in overlapping regions around the block edges can then be interpolated from all surrounding blocks.

Choi et al. (2000) perform OBMC (a version of it that was originally suggested by Kuo and Kuo (1997)) by increasing the block size and dividing the blocks into three types of regions, namely non-overlapping regions (R1), regions with two overlapping blocks (R2) and regions with four overlapping blocks (R3), as shown in Figure 3.14. The pixel values for R1 are generated by using only the original motion vector as in straight-forward BMC, while the pixel values in R2 and R3 are based equally on the pixel values of blocks overlapping each other at that location.

Figure 3.14: Pixel-based OBMC. The grid of solid lines represents the original blocks of size N , generated during block motion estimation, while the dashed lines represent areas created when increasing the block size with w on each side.

3.6.3 Control Grid Interpolation

Sullivan and Baker (1991) suggest to use Control Grid Interpolation (CGI) in video coding, in order to encode Fc in terms of Fp and motion vectors. Instead of interpreting the MV

field as a set of translational motion vectors of individual blocks, they view it as a control grid which describes a spatial transformation of Fp into Fc, as shown in Figure 3.15. This

transformation generates a smooth image without blocking artifacts, and catches zooming and wrapping fairly well. At the downside, it performs badly at abrupt changes in the direction (Ishwar and Moulin, 2000). In the example with the rolling ball in Figure 3.15, the transformation of the ball is accurate, but the area closest to the ball is also skewed as a result.

(29)

3.6. Motion Compensation 19

Figure 3.15: Example of control grid representing the motion of a ball rolling to the right and approaching the camera, and MVs approximated from the original MVs (the control grid).

(30)

(31)

Chapter 4

Implementation

As discussed in the introduction, the practical goal of the project underlying this thesis was to implement a frame rate up-conversion filter which produces intermediate frames with higher quality than frame repetition and frame averaging. The filter, named xfps due to its ability to change frame rate, was implemented using the programming language C and the framework libavfilter. It supports unidirectional, bidirectional and bilateral motion estimation as well as direct compensation and control grid interpolation.

Most of the non-libavfilter related implementation is strongly related to the algorithms described in Chapter 3 and the algorithms are thereby already described, even though not all previously described features are implemented. Xfps supports both motion compen-sated frame rate up-conversion and frame averaging, but the term xfps refers to motion compensated up-conversion in the text below, unless explicitly stated otherwise.

4.1 Structure

The system structure can be coarsely divided into the main parts motion estimation and motion compensation. Figure 4.1 shows a slightly simplified version of the structure and communication.

4.1.1 Motion Estimation

The most important inputs to the motion estimation function are the two reference frames and the time stamp at which an intermediate frame will be created during motion com-pensation (only used in bilateral ME). Default values or user specified values also available during motion estimation are:

– Motion estimation block size.

– Maximum search distance.

– Block distortion threshold, under which a block should be presumed to be stationary or have the same motion as neighbouring blocks.

First, a MV field is instantiated (unidirectional, bidirectional or bilateral depending on the preselected ME method) and a reference is sent together with all of the input param-eters to a motion prediction function, which estimates the MVs in the MV field. When

(32)

22 Chapter 4. Implementation

Starting Ref. frames _EstimationMotion Ref. frames+MVF _CompensationMotion Ending

In: Ref. frames Out: MVF Median Filtering MVF Reﬁnement MVF Generated frame Frame Generation

In: MVF+Ref. frames Out: Generated frame

No Prediction Spatial Prediction Ovelapping_BMC

In: MVF Out: Generated frame

Direct BMC

In: MVF Out: Generated frame

In: Ref. frames, x, y, MV Out: MV Full Search Diamond Search Hexagon Based Search

Figure 4.1: System structure and communication. Solid output edges denote mandatory steps, dashed edges denote that one of the choices has to be selected and dotted edges denote optional steps.

the function returns, after motion predictive search, the MV field is passed on to motion compensation.

Predictive Search

Motion vectors in the MV field are estimated from left-to-right, top-to-bottom, in all cur-rently implemented motion prediction methods. The actual prediction is performed by evaluating MVs of previously evaluated spatially neighbouring blocks. The MV which gen-erates the lowest block distortion is then used as a starting point when searching for the best match, using a search method preselected by the user.

The implementation currently supports a spatially predictive search which selects the best MV from all previously estimated spatially adjacent blocks in the MV field, a left-spatially predictive search which only uses the MV of the block located to the left, and a non-predictive search.

Search Methods

The following search methods are currently supported by the filter:

– Full search (see Section 3.3.2); straight-forward implementation, implemented mainly for comparison.

(33)

4.2. Evaluation 23

– Hexagon based search (see Section 3.3.4); implemented since it is fast compared to many other algorithms, popular and many even faster search algorithms are based on it (as covered in Section 3.3.4 and 3.3.5).

The steps in DS and HEXBS and others all have in common that they take two reference frames, a location and a motion vector (the best match from the previous step) as input, evaluate block distortion for a set of MVs relative to the best match and then return the new best match. Due to this, a generic function which takes the parameters previously mentioned and also a list of relative MVs was written to simplify implementation of new search algorithms.

To further simplify the implementation of search algorithms and avoiding repeated com-parisons of the same search point, help structures and functions which caches distortion values were implemented. The values are cached in a two-dimensional array of the same size as the search window. Distortion values are evaluated or read from the cache with a getter method which returns a high distortion for invalid (out-of-window) comparisons.

Together, the two types of help functions make it possible to add new search methods by only implementing the general case, since the special cases are handled automatically.

4.1.2 Motion Compensation

The implementation supports two motion compensation methods, namely direct compen-sation and control grid interpolation, described in Section 3.6.1 and 3.6.3 respectively. The input parameters are the two reference frames, the MV field and an output buffer. From this information, it draws the intermediate frame in the output buffer, and returns.

Control grid interpolation is currently the only practically useful implementation of mo-tion compensamo-tion since the direct BMC does not perform hole filling or block artifact reduction. Notably, the implemented CGI method performs hole filling on uni- and bidi-rectional MV fields by converting them into bilateral MV fields before generating the new frame, which makes it suitable for all kinds of MV fields discussed in this report.

The implementation of direct compensation is minimalistic and does not perform any hole filling or block artifact reduction, which makes it unsuitable for most practical purposes. Nevertheless, it can be of interest when measuring the performance of other compensation methods.

4.2 Evaluation

Evaluation of video quality was performed both objectively and subjectively with the videos listed in Section 2.2. The output of xfps was compared to the output of frame repetition, frame averaging and MVTools to verify that the goals were reached and to get an idea about how good the image quality is compared to other tools. The evaluation was performed by dropping every other frame, reconstructing them and finally comparing each reconstructed frame to the corresponding dropped frame.

(34)

the human eye to detect, which simplifies the subjective evaluation process. Nevertheless, the conditions should be as similar as possible when evaluating whether or not a filter is sufficient in a particular case.

The frame rate up-conversion was performed with bidirectional motion estimation of 8x8 pixel blocks, and the compensation used control grid interpolation or similar. MVTools was used with diamond search and xfps was used with hexagon based search. It is however noteworthy that MVTools has far more features than xfps, and a few of these features were accidentally left activated due to ignorance. Since MVTools was only used to indicate how how well xfps performed compared to other tools, the conversion was not repeated.

For objective comparison of frames, the mathematical model peak signal-to-noise ratio (PSNR) was used due to its common usage in video quality assessment. Importantly, Huynh-Thu and Ghanbari (2008) show that PSNR correlates well with subjective video quality when comparing codecs operating on a fixed content, but that the correlation between PSNR and subjective quality is low when considering how well the codecs work on different content. In other words, PSNR can be used to compare the output quality of two different codecs which operates on the same clip or frame, but not to compare how well a codec performs on different clips or frames compared to each other. The ImageMagick implementation of PSNR1_{was used to generate the PSNR values for the diagrams shown in Appendix A.}

4.2.1 Analysis

Objective quality comparisons (see the box plots in Figure A.1) show that MVTools gen-erates the best result of the compared filters, followed by xfps, frame averaging and frame repetition. By analyzing the line charts in Figure A.2 and comparing the corresponding generated frames, several interesting observations can be made.

Rapid Scene Change

During rapid scene changes, such as in generated frame 114 in the Big Buck Bunny Trailer (frame approximations shown in Figure 4.2), all methods generate a low PSNR, as can be seen in Figure A.2d. Frame repetition generates the lowest PSNR, followed by xfps and frame averaging, which is also used by MVTools. Frame averaging is subjectively the best method in this case (with a perfectly satisfying result as long as only one intermediate frame is generated), followed by frame repetition (since it introduces jerky motions) and xfps (worst since the image is obfuscated).

However, MVTools has a few large negative spikes in Figure A.2b where it uses frame averaging even though there is no frame change. Nevertheless, if a frame change is not detected and the result looks similar to Figure 4.2d, then the artifacts of a false positive detection of a frame change (which results in frame averaging) are considerable smaller than those of a false negative (which results in invalid motion compensation).

Fading

Sometimes a scene slowly fades to black, which creates different pixel intensity for blocks that otherwise would result in a perfect match. This in turn generates bad estimations and a flickering image when using xfps, but MVTools (and frame averaging) handles this problem much better.

A case of fading that is harder to account for is when fading between two scenes with moving objects. In this case the previously described problem with fading is combined with

(35)

4.2. Evaluation 25

(a) Original frame. (b) Frame repetition.

(c) Frame averaging. (d) Xfps.

Figure 4.2: Examples of artefacts introduced by a quick scene change, from the Big Buck Bunny Trailer.

the fact that there are two semi-correct motion vectors for each block, but none which is truly correct unless objects in the two scenes have the same motion. An example is shown in Figure 4.3. Subjectively, xfps does not generate reasonable results in such cases, whereas MVTools does in most cases. These kinds of artifacts are not necessarily clearly visible when looking at PSNR since the area of the errors is small and most of the image is well compensated.

Slow Motions

Frame averaging performs very well in some cases, especially in the trailers (see PSNR in Figure A.2c and A.2d). It turns out that these passages generally contain no or very small movements. According to this observation, it would be suitable to perform frame averaging instead of MC if two frames are very similar, both in the sense of removing the risk for using invalid motion vectors and also to reduce the computational power needed.

Frame Edges

One recurring type of artifact is caused by invalid motion estimation along the edges, as can be seen along the bottom edge in Figure 4.4. The invalid estimations occur when an object or part of an object moves in or out of view between two frames, since the implemented search algorithms cannot estimate the movement of an object if it only exists in one of the frames.

(36)

(a) Original frame. (b) Frame averaging.

(c) Xfps. (d) MVTools.

Figure 4.3: Examples of artefacts introduced by a slow fade from one scene to an other, from the Sintel Trailer.

example would be the upper left and lower right corner of the frame when the camera pans up and left at the same time.

Subjective Evaluation

(37)

4.2. Evaluation 27

(a) Original frame. (b) Frame averaging.

(c) Xfps. (d) MVTools.

Figure 4.4: Examples of artefacts introduced when part of an object moves out of the image. Note especially the bottom of the frames generated by Xfps and MVTools.

(a) Original frame. (b) Frame averaging. (c) Xfps.

Figure 4.5: Approximations of a frame in the clip Park Joy. It contains small movements of small objects and large movements of a large object. The frame is subjectively representable for this video.

(a) Original frame. (b) Frame averaging. (c) Xfps.

(38)

(a) Motion compensated frame. (b) Original frame.

(39)

Chapter 5

Discussion

The objective evaluations made in Section 4.2.1 indicate that the implemented frame rate up-conversion filter generates an improved video quality compared to frame repetition and frame averaging. However, the subjective evaluations show that the result is not good enough for high-resolution video, especially in the worst cases. MVTools outperforms xfps, but cannot be used in Vidispine due to licensing incompatibilities.

In the problem statement, an aim was set alongside with the goals, namely real-time conversion from 720p/24 to 720p/30. Unfortunately, little time was spent on optimizing the time-efficiency of the implementation, and the current implementation generates between 1 and 2 frames per second on an Aspire 3750 with Intel Core i3 2330M processor and 4GB DDR3 SDRAM at 1066 MHz. Due to the little time spent on optimizations, no efforts were made to scientifically quantify these results. A faster computer may improve the speed slightly, but is unlikely to come even close to real-time conversion.

In conclusion, a lot of optimizations have to be done before xfps is ready for release, both regarding video quality and speed.

5.1 Limitations and Future Work

The development of xfps is put on hold due to a large amount of limitations. These limita-tions are listed below, sometimes in combination with a possible solution in case of continued development.

– The implementations of functions for measuring distortion between pixels and the function for interpolating pixel values only work with video in the YUVJ family of pixel formats. Support for more YUV based pixel formats could be added to the functions with little effort, but RGB based pixel formats have to be converted to YUV or similar before they can be used successfully common distortion methods.

– The prototype lacks proper error handling, and invalid/incompatible parameters may cause segmentation fault.

– Motion estimation does not work well on high-resolution video or fast movements. One solution may be to use a hierarchical frame structure with the reference frames scaled to different resolutions and perform a coarse-to-fine grained motion estimation (Jeon et al., 2003) or to use iterative random search (Porto et al., 2012). MVTools

(40)

30 Chapter 5. Discussion

uses a variant of the coarse-to-fine grained estimation, which is likely to be a major reason why MVTools works so well.

– Motion estimation does not work well if fade occurs between two successive reference frames. Thaipanich et al. (2009) suggest a method to adjust the frame intensity before performing motion estimation.

– Scene changes cause poor image quality. A fallback to frame averaging on large dis-tortion between two frames may work as a solution, but a false positive scene change detection may also result in poor image quality. It is shown that more advanced scene change detection based on the difference between successively estimated MV fields (Shu and Chau, 2005) or histogram analysis (Kang et al., 2012) can improve the accuracy of the detection.

– The comparatively slow functions malloc and calloc are overused in the implementa-tion. Reuse of allocated memory to a larger extent is likely to increase the execution speed.

– The prototype runs on the CPU in a single thread. Jostell and Isberg (2012) used OpenMP for parallelizing, OpenCL to enable GPU processing and SSE (PSADBW) to compute SAD on the CPU for up to 16 pixels at the time. These libraries and open computer vision (OpenCV) may be of interest when improving the hardware support.

– Block distortion methods are currently unable to compare blocks partly out-of-frame. Section 4.2.1 mentions how this may be implemented, but also a new problems that will arise with such an implementation.

– Luessi and Katsaggelos (2009) suggest to reuse MVs from the video bitstream to im-prove the estimation speed. They acknowledge that problems exist in many algorithms which reuse the MVs from the bitstream, and compensate for these without a decrease in image quality. Notably, though, they only use low resolution video, and the results may not be valid for high resolution video.

– Pixel-based MV selection is a method that can be used to reduce the skewing artefacts introduced by Control Grid Interpolation and similar motion compensation techniques (Tran and LeDinh, 2011).

Since MVTools1_{performs substantially better than xfps, it would be interesting to have}

an extra look at its functionality. What seem to be the most interesting parameters to the motion estimation function (MVAnalyse) can be divided into two groups:

– For coherence between vectors in the MV field: lambda, lsad and pnew.

– For improved estimation at luma flicker and fades: dct.

(41)

Chapter 6

Acknowledgements

Many thanks go out to employers and employees at CodeMill, to my supervisors Frank Drewes and Tomas H¨ardin, and to everybody else who supported me in one way or an other.

(42)

(43)

References

Castagno, R., Haavisto, P., and Ramponi, G. (1996). A method for motion adaptive frame rate up-conversion. IEEE Transactions on Circuits and Systems for Video Technology, 6(5):436–446.

Chalidabhongse, J. and Kuo, C.-C. (1997). Fast motion vector estimation using multiresolution-spatio-temporal correlations. IEEE Transactions on Circuits and Sys-tems for Video Technology, 7(3):477–488.

Chiang, J., Kuo, W., and Su, L. (2007). Fast motion estimation using hexagon-based search pattern in predictive search range. In Proceedings of 16th International Conference on Computer Communications and Networks, 2007, pages 1149–1153.

Choi, B., Lee, S., and Ko, S. (2000). New frame rate up-conversion using bi-directional motion estimation. IEEE Transactions on Consumer Electronics, 46(3):603–609.

Chung, K.-L. and Chang, L.-C. (2003). A new predictive search area approach for fast block motion estimation. IEEE Transactions on Image Processing, 12(6):648–652.

Guanfeng, Z., Guizhong, L., and Rui, S. (2003). A modified hexagon-based search algorithm for block motion estimation. In Proceedings of the 2003 International Conference on Neural Networks and Signal Processing, volume 2, pages 1205–1208.

Ha, T., Lee, S., and Kim, J. (2004). Motion compensated frame interpolation by new block-based motion estimation algorithm. IEEE Transactions on Consumer Electronics, 50(2):752–759.

Huang, H. and Chang, S. (2011). Block motion estimation based on search pattern and predictor. In 2011 IEEE Symposium on Computational Intelligence for Multimedia, Signal and Vision Processing, pages 47–51.

Huynh-Thu, Q. and Ghanbari, M. (2008). Scope of validity of psnr in image/video quality assessment. Electronics letters, 44(13):800–801.

Ishwar, P. and Moulin, P. (2000). On spatial adaptation of motion-field smoothness in video coding. IEEE Transactions on Circuits and Systems for Video Technology, 10(6):980–989.

Jeon, B., Lee, G., Lee, S., and Park, R. (2003). Coarse-to-fine frame interpolation for frame rate up-conversion using pyramid structure. IEEE Transactions on Consumer Electronics, 49(3):499–508.

Jostell, J. and Isberg, A. (2012). Frame rate up-conversion of real-time high-definition remote surveillance video. Master’s thesis, Chalmers University of Technology, Sweden.

(44)

34 REFERENCES

Kang, S.-J., Cho, S. I., Yoo, S., and Kim, Y. H. (2012). Scene change detection using mul-tiple histograms for motion-compensated frame rate up-conversion. Display Technology, Journal of, 8(3):121–126.

Kuo, T.-Y. and Kuo, C.-C. J. (1997). Complexity reduction for overlapped block motion compensation (obmc). In Proc. SPIE Visual Communications and Image Processing, volume 3024, pages 303–314.

Liu, B. and Zaccarin, A. (1993). New fast algorithms for the estimation of block motion vectors. IEEE Transactions on Circuits and Systems for Video Technology, 3(2):148–157.

Luessi, M. and Katsaggelos, A. (2009). Efficient motion compensated frame rate upconver-sion using multiple interpolations and median filtering. In 2009 16th IEEE International Conference on Image Processing, pages 373–376.

Porto, M., Cristani, C., Dall’Oglio, P., Grellert, M., Mattos, J., Bampi, S., and Agostini, L. (2012). Iterative random search: a new local minima resistant algorithm for motion estimation in high-definition videos. Multimedia Tools and Applications, pages 1–21.

Shu, H. and Chau, L.-P. (2005). A new scene change feature for video transcoding. In Circuits and Systems, 2005. ISCAS 2005. IEEE International Symposium on, pages 4582– 4585. IEEE.

Sullivan, G. J. and Baker, R. L. (1991). Motion compensation for video compression using control grid interpolation. In 1991 International Conference on Acoustics, Speech, and Signal Processing, pages 2713–2716.

Tang, C. and Au, O. (1997). Unidirectional motion compensated temporal interpolation. In Proceedings of 1997 IEEE International Symposium on Circuits and Systems., volume 2, pages 1444–1447.

Thaipanich, T., Wu, P., and Kuo, C. (2009). Low complexity algorithm for robust video frame rate up-conversion (fruc) technique. IEEE Transactions on Consumer Electronics, 55(1):220–228.

Tham, J., Ranganath, S., Ranganath, M., and Kassim, A. (1998). A novel unrestricted center-biased diamond search algorithm for block motion estimation. IEEE Transactions on Circuits and Systems for Video Technology, 8(4):369–377.

Thambidurai, P., Ezhilarasan, M., and Ramachandran, D. (2007). Efficient motion estima-tion algorithm for advanced video coding. In Internaestima-tional Conference on Computaestima-tional Intelligence and Multimedia Applications, 2007, volume 3, pages 47–52.

Tran, T. and LeDinh, C. (2011). Frame rate converter with pixel-based motion vectors selection and halo reduction using preliminary interpolation. IEEE Journal of Selected Topics in Signal Processing, 5(2):252–261.

Tsai, T. and Pan, Y. (2004). A novel predict hexagon search algorithm for fast block motion estimation on h. 264 video coding. In The 2004 IEEE Asia-Pacific Conference on Circuits and Systems. Proceedings, volume 1, pages 609–612.

(45)

REFERENCES 35

Xiong, B. and Zhu, C. (2008). A new multiplication-free block matching criterion. IEEE Transactions on Circuits and Systems for Video Technology, 18(10):1441–1446.

Xu, C., Chen, Y., Gao, Z., Ye, Y., and Shan, T. (2011). Frame rate up-conversion with true motion estimation and adaptive motion vector refinement. In 2011 4th International Congress on Image and Signal Processing, volume 1, pages 353–356.

Zafar, S., Zhang, Y.-Q., and Baras, J. S. (1991). Predictive block-matching motion estima-tion for tv coding—part i: Inter-block predicestima-tion. IEEE Transacestima-tions on Broadcasting, 37(3):97–101.

Zhai, J., Yu, K., Li, J., and Li, S. (2005). A low complexity motion compensated frame interpolation method. In IEEE International Symposium on Circuits and Systems, 2005, pages 4927–4930.

Zhang, Y.-Q. and Zafar, S. (1991). Predictive block-matching motion estimation for tv coding. ii. inter-frame prediction. IEEE Transactions on Broadcasting, 37(3):102–105.

(46)

(47)

Appendix A

Diagrams

(48)

38 Chapter A. Diagrams

A.1 Box Plots

0 10 20 30 40 50 60 70 80 90 100 110 fps xfps:avg xfps:mc mvtools PSNR

(a) Sintel trailer PSNR.

14 16 18 20 22 24 26 28 30 32 34 36 fps xfps:avg xfps:mc mvtools PSNR (b) Park Joy PSNR. 0 20 40 60 80 100 120 fps xfps:avg xfps:mc mvtools PSNR

(c) Tears of Steel trailer PSNR.

0 10 20 30 40 50 60 70 80 90 fps xfps:avg xfps:mc mvtools PSNR

(d) Big Buck Bunny PSNR.

15 20 25 30 35 40 fps xfps:avg xfps:mc mvtools PSNR (e) Foreman PSNR. 14 16 18 20 22 24 26 28 fps xfps:avg xfps:mc mvtools PSNR (f) Bus PSNR.

(49)

A.2. Line Charts 39

A.2 Line Charts

0 10 20 30 40 50 60 70 80 90 100 110 0 100 200 300 400 500 600 700 PSNR Frame fps xfps: avg xfps: mc mvtools

(a) Sintel trailer PSNR.

14 16 18 20 22 24 26 28 30 32 34 36 0 50 100 150 200 250 PSNR Frame fps xfps: avg xfps: mc mvtools (b) Park Joy PSNR. 0 20 40 60 80 100 120 0 50 100 150 200 250 300 350 400 450 500 PSNR Frame fps xfps: avg xfps: mc mvtools

(c) Tears of Steel trailer PSNR.

0 10 20 30 40 50 60 70 80 90 0 50 100 150 200 250 300 350 400 450 PSNR Frame fps xfps: avg xfps: mc mvtools

(d) Big Buck Bunny PSNR.

15 20 25 30 35 40 0 20 40 60 80 100 120 140 160 PSNR Frame fps xfps: avg xfps: mc mvtools (e) Foreman PSNR. 14 16 18 20 22 24 26 28 0 10 20 30 40 50 60 70 80 PSNR Frame fps xfps: avg xfps: mc mvtools (f) Bus PSNR.

Implementation of a Frame Rate Up-Conversion Filter