Investigating the Adaptive Loop Filter in Next Generation Video Coding

(1)

IN

DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2017,

Investigating the Adaptive Loop Filter in Next Generation Video Coding

ALFONSO DE LA ROCHA GÓMEZ- AREVALILLO

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING

(2)

TRITA 2017:002 ISSN 1653-5146

www.kth.se

(3)

Abstract

Current trends on video technologies and services are demanding higher bit rates, higher video resolutions and better video qualities. This issue results in the need of a new generation of video coding techniques to increase the quality and compression rates of previous standards.

Since the release of HEVC, ITU-T VCEG and ISO/IEC MPEG have been studying the potential need for standardization of future video coding technologies with a compression capability that significantly exceeds the ones from current standards. These new e↵orts of standardization and compression enhancements are being implemented and evaluated over a software test model known under the name of Joint Exploration Model (JEM). One of the blocks being explored in JEM is an Adaptive Loop Filter (ALF) at the end of each frame’s processing flow. ALF aims to minimize the error between original pixels and decoded pixels using Wiener-based adaptive filter coefficients, reporting, in its JEM’s implementation, improvements of around a 1% in the BD MS-SSIM rate. A lot of e↵orts have been devoted on improving this block over the past years. However, current ALF implementations do not consider the potential use of adaptive QP algorithms at the encoder. Adaptive QP algorithms enable the use of di↵erent quality levels for the coding of di↵erent parts of a frame to enhance its subjective quality.

In this thesis, we explore potential improvements over di↵erent dimensions of JEM’s Adaptive Loop Filter block considering the potential use of adaptive QP algorithms. In the document, we explore a great gamut of modification over ALF processing stages, being the ones with better results (i) a QP-aware implementation of ALF were the filter coefficients estimation, the internal RD-optimization and the CU-level flag decision process are optimized for the use of adaptive QP, (ii) the optimization of ALF’s standard block activity classification stage through the use of CU-level information given by the di↵erent QPs used in a frame, and (iii) the optimization of ALF’s standard block activity classification stage in B-frames through the application of a correction weight on coded, i.e. not predicted, blocks of B-frames. These ALF modifications combined obtained improvements of a 0.419% on average for the BD MS-SSIM rate in the luma channel, showing each modification individual improvements of a 0.252%, 0.085% and 0.082%, respectively. Thus, we concluded the importance of optimizing ALF for the potential use of adaptive-QP algorithms in the encoder, and the benefits of considering CU-level and frame-level metrics in ALF’s block classification stage.

This thesis was developed in cooperation with Ericsson Research.

Keywords

Joint Exploration Model, JEM, Adaptive Loop Filter, variance-based adaptive quantization, video compression, video coding standards.

(4)

Sammanfattning

Utvecklingen inom video-teknologi och service kräver högre bithastighet, högre videoupplösning och bättre kvalitet. Problemet kräver en ny generation av kodning och tekniker för att öka kvaliteten och komprimeringsgraden utöver vad tidigare teknik kunnat prestera. Sedan lanserin- gen av HEVC har ITU-T VCEG och ISO/IEC MPEG studerat ett eventuellt behov av standardisering av framtida video-kodings tekniker med kompressions kapacitet som vida överstiger dagens system. Dessa försök till standardisering och kompressionsframsteg har implementerats och utvärderats inom ramen för en mjukvara testmodell som kallas Joint Exploration Model (JEM). Ett av omr˚adena som undersöks inom ramen för JEM är adaptiva loopfilter (ALF) som läggs till i slutet av varje bilds processflöde. ALF har som m˚al att minimera felet mellan original pixel och avkodad pixel genom Wiener-baserade adaptiva filter-koefficienter. Mycket kraft har lagts p˚a att förbättra detta omr˚ade under de senaste ˚aren. Men, nuvarande ALF-applicering beaktar inte potentialen av att använda adaptiva QP algoritmer i videokodaren. Adaptiva QP algoritmer till˚ater användningen av olika kvalitet p˚a kodning av olika delar av bilden för att förbättra den subjektiva kvaliteten.

I föreliggande uppsats kommer vi undersöka den potentiella förbättringen av JEM:s adaptiva loopfilter som kan uppn˚as genom att använda adaptiva QP algoritmer. I uppsatsen kommer vi undersöka ett stort antal modifikationer i ALF:s process-stadier, för att ta reda p˚a vilken modifikationer som har bäst resultat: (i) en QP-medveten implementering av ALF där filter- koefficientens uppskattning av den interna RD-optimeringen och CU-niv˚ans flaggbeslutsprocess

¨

ar optimerade för användnngen av adaptiv QP, (ii) optimeringen av ALF:s standard block aktivitets klassificerings stadie genom användning av CU-niv˚a-information producerad av de olika QP:n som används i en bild, och (iii) optimeringen av ALF:s standard block aktivitets klassificerings stadier i B-bilders genom applicering av en korrektursvikt i tidigare kod, d.v.s ej förutsedda, block av B-bilder. När dessa ALF modifikationer kombinerades förbättrades i genomsnitt BD MS-SSIM hastigheten i luma kanalen med 0.419%, där varje modifikation förbättrade med 0.252%, 0.085% och 0.082% var. Därigenom drog vi slutstatsen att det är vik- tigt att optimera ALF för det potentiella användandet av adaptiva QP-algoritmer i kodaren, och fördelarna av att beakta CU-niv˚amätningar och bild-niv˚amätningar i ALF:s block klassificerings stadie.

Denna uppsats framst¨alldes i samarbete med Ericsson Research.

(5)

Acknowledgements

To Per Wennersten, my thesis advisor at Ericsson Research. For his wise guidance and his eternal patience after having me with questions in his office every two hours during six months.

I have learnt a lot about research from you and is something I will always appreciate. Even more, to Ericsson Research and specially to Thomas Rusert, manager of our research group, for giving me the opportunity to experience how it is to make research in a big company.

To my family, for supporting me at all times, and for dealing with my bad moods when my simulations didn’t give the results I was expecting. To my mother for her incredible strength that allows her to deal with a husband at Nicaragua and each son in a di↵erent country. To my father for his personal and wise life advices directly delivered from Nicaragua through Skype.

One more year and we will be, hopefully, back together in our beloved Spain. And last but not least, to my brother Juan, from whom, apparently, a 60% of this thesis and its related incomes belong according to a, probably nonexistent, verbal contract we had when we were 10 years old.

I am starting to think that he is just trying to make a profit from all my works without having to work for himself. Smart!

(6)

List of Figures

2.1 Simplified JEM encoder block diagram . . . 10

2.2 HEVC picture partitioning structure . . . 11

2.3 JEM picture partitioning structure [11] . . . 12

2.4 Original image to the left, and compressed image to the right showing ringing and blocking artifacts[47] . . . 15

3.1 ALF processing stages at encoder [7] . . . 19

3.2 Value of A_q related to the activity quantization index . . . 20

3.3 ALF available filter shapes (5x5, 7x7, 9x9) [7] . . . 22

4.1 Static thresholds average activity distribution . . . 33

4.2 Dynamic thresholds average activity distribution . . . 34

4.3 Constant classification activity bins . . . 35

4.4 Constant number of blocks per activity group . . . 35

4.5 Constant number of blocks per activity group and direction . . . 36

4.6 Uncertainty level over 2x2 blocks with di↵erent QP . . . 39

4.7 Number of sequences with minimum RD-cost for di↵erent values of _qp . . . 44

4.8 Best activity thresholds for one frame full simulation compared to average frame activity . . . 46

4.9 Relative influence of the average activity and QP over the model . . . 48

(9)

List of Tables

3.1 Group classification index according to directional gradients (D) and quantized

activity values(A_q) . . . 21

3.2 Mapping of the gradient computed for a block and transformations . . . 23

4.1 BD MS-SSIM Rate comparison of ALF’s e↵ect with the AQ algorithm enabled or disabled . . . 28

4.2 BD MS-SSIM Rate results for QP-aware optimized ALF implementation . . . 31

4.3 BD MS-SSIM Rate results for di↵erent block distributions . . . 37

4.4 BD MS-SSIM Rate results for modified classifications using di↵erent block distributions . . . 41

4.5 BD MS-SSIM Rate results for modified classifications using di↵erent block distributions . . . 42

4.6 BD MS-SSIM Rate results for di↵erent _qp . . . 43

4.7 BD MS-SSIM Rate results for optimal number of blocks linear model . . . 46

4.8 RSS for average activity regression models . . . 47

4.9 RSS for multivariate regression models . . . 48

4.10 BD MS-SSIM Rate results for QP weight block distribution enhancements . . . . 49

4.11 BD MS-SSIM Rate I-frame results for QP weight block distribution enhancements 50 4.12 BD MS-SSIM Rate I-frame results for QP weight block distribution enhancements 51 4.13 BD MS-SSIM Rate results for di↵erent qp . . . 52

4.14 BD MS-SSIM Rate results for B weight activity correction . . . 53

5.1 BD MS-SSIM Rate final results . . . 54

6.1 Class D simulations video sequences . . . 65

6.2 Full simulations video sequences . . . 66

(10)

Chapter 1

Introduction

1.1 Background

The amount of daily data generated in our current interconnected society is astounding.

An important amount of this data is transmitted using a video format. According to the last Zettabyte Cisco Report [1], in 2015 82% of all IP traffic, both business and consumer, was IP video traffic. Moreover, the trend predicts that this value will keep increasing with the market penetration of novel services such as video-on-demand, internet video to TV, virtual reality or internet-based video surveillance, and existing services such as cloud storage and video streaming services. Moreover, the report [1] details how all this information will, most likely, be consumed through portable devices, with limited computation and bandwidth capabilities. It is expected that by 2020 smartphone traffic will double PC traffic.

Along with the increase in the amount of video information exchanged through the internet, and the bandwidth and computational limitations of portable devices, there is an increasing demand for higher bit rates, higher video resolutions and better video qualities for the same video transmission times and file sizes. Thus, compression technologies which provide higher coding efficiency than current generation of video standards are needed.

Over the past two decades, a set of video coding standards have appeared, namely MPEG-1 Video [2], MPEG-2 Video [3], MPEG-4 Visual [4] and H.264/advanced video coding (AVC) [5].

These standards have played a basic role in enabling the multimedia applications that we use to- day. Their evolution and technical improvements show the aforementioned trend of increasing coding efficiency and compression requirements to enable higher bit-rates, higher resolutions and better video qualities. The most recent video coding standard, released in 2013, is the High Efficiency Video Coding (HEVC/H.265) [6]. HEVC was designed to improve the coding efficiency compared to its predecessor, AVC/H.264 [5]. Specifically, HEVC aims to halve the bit-rate requirements of AVC for the same image quality at the expense of an increased computational complexity.

After the release of HEVC, ITU-T VCEG (Q6/16) and ISO/IEC MPEG (JTC 1/SC 29/WG 11) conformed a joint workgroup, the Joint Video Exploration Team (JVET), to study the potential need for standardization of future video coding technologies with a compression capability that significantly exceeds that of the current HEVC standard. These new e↵orts of standardization and compression enhancements are being explored and implemented in a software test model known under the name of Joint Exploration Model (JEM)[7]. JEM represents the software model which might be used as starting point for the next generation of video coding standards following HEVC.

(11)

1.2 Motivation

One of the new blocks considered for the Joint Exploration Model, which was already considered but not included nor standardized in HEVC, is an Adaptive Loop Filter at the end of the decoding loop [8], [9]. This Adaptive Loop Filter (ALF) aims to minimize the error between original pixels and decoded pixels using Wiener-based adaptive filter coefficients. ALF is located at the last processing stage of a picture, after the deblocking filter [10], and can be considered as a tool that tries to catch and fix artifacts generated, or not removed, in previous coding stages. ALF will mainly try to fix ringing and blurring artifacts, enhancing, in this way, the objective quality and compression efficiency of the video. ALF works by automatically classifying pixel blocks into one of several categories, and designing optimal filter parameters for each category.

In addition to ALF, current video encoders are starting to use a feature already explored in H.264 [5], a variance-based adaptive quantization (Adaptive QP), which allows the encoder to use a di↵erent value of the quantization parameters (QP) during the coding process of di↵erent blocks. Thus, a di↵erent QP will be selected in the coding process of each block of a frame according to the video signal variance in the block. Current implementations of Adaptive Loop Filter, even in JEM’s [11] latest proposal, do not consider the potential use of adaptive QP in the coding process, treating all the pixels and all the blocks in a frame as if they were coded using the same global QP set for the frame by the encoder. Thus, the classification of pixels and estimation of filter parameters is not optimal anymore, resulting in pixels being filtered with the wrong filter parameters or not filtered at all when, by performing an accurate filtering to these pixels, important enhancements of quality could be achieved. Later in the document, a detailed overview of variance-based adaptive QP and the Adaptive Loop Filter will be presented, giving a better view of the problem in hand in the current thesis.

With JEM, brand new coding and compression techniques are being explored to enhance HEVC’s current achieved video qualities and compression rates. One of the the coding blocks were addition improvements may be achieved is in ALF, which still in early stages of standardization and adoption. Hence, the aim of the thesis is to investigate potential improvements over JEM’s current standard implementation of the Adaptive Loop Filter, considering its current QP-unaware design and the potential use of adaptive QP by modern video encoders, in order to verify the achievable compression enhancements that may be obtained through changes over this block.

Thus, the main questions to be investigated and answered in this thesis are the following:

• Can we modify JEM’s current implementation of the Adaptive Loop Filter so that mea- surable video compression improvements are achieved?

• Will the design of an enhanced Adaptive Loop Filter that considers the use of adaptive QP in the encoder increase its compression capabilities compared to current QP-unaware ALF implementations?

• Can other modifications in ALF’s pixel categorization stage result in more optimally designed filter coefficients for each block with a consequent quality improvement?

For the development of the thesis, the following resources, models and references were used as baseline and starting point for our research:

• Modified JEM 3.1 reference software [12]: This is the Joint Exploratory Model reference encoder software. It presents a basic implementation of JVET’s latest encoder for research purposes. For the development of the thesis, an internal Ericsson’s modified version

(12)

of JEM version 3.1 reference software was used. The main di↵erence of this modified software compared to the official available in [12] is the inclusion of a slightly improved implementation of H.264’s variance-adaptive quantization algorithm.

• Current standard implementation of adaptive loop filters: In [11], [13] and [7], an standardized description of the current implementation of ALF used in the reference software implementation is presented. This resources were really useful to understand the ex- act ALF techniques included in the reference software. However, they were not detailed enough to completely understand the ALF’s operation, therefore, additional references, along with the source code, were needed to completely understand JEM’s ALFs.

• ALF’s standard algorithms and techniques: As already mentioned above, additional resources were needed to completely understand ALF’s standard implementation. Although slightly outdated, [8] and [9] give a really good overview of the base techniques used in video coding Adaptive Loop Filters. Most of the techniques presented in these papers, with improvements, are used for JEM’s ALF implementation.

To conclude with this introduction, it is worth mentioning the method and process followed for the development of the thesis. The aim of this work is to investigate potential improvements over a deeply studied system such as Adaptive Loop Filters. As it will be shown in the related work section (section 1.3), a lot of e↵orts have been devoted in academia and industry to improve and fine-tune this video coding block for its inclusion in the next generation of video coding standards. Thus, an iterative method was selected to approach the thesis. The project is based on testing di↵erent ideas and hypothesis to explore every potential improvement over ALF. This iterative method allowed us to rapidly implement and evaluate ideas, enabling their quick dismissal if the results obtained were not the ones expected. Every iteration gave us good insight about the problem, useful in the path of reaching a final solution, by improving our development after each iteration.

Hence, the iterative process was performed as follows:

• Firstly, we conducted an initial state of the art reading to research the fields and techniques that have been explored to improve ALF. During this initial stage, we also familiarized ourselves with JEM’s reference software, and specifically with the standard operation and software implementation of JEM’s ALF.

• Previous to the first iteration, a set of base test measures were gathered from the standard encoder in order to understand how ALF behaved under di↵erent encoding configurations.

• Then, the actual iterative development process was conducted. Every stage started with the implementation of a new idea or technique in the encoder. Once the modifications were included in the encoder, test measures were gathered in order to analyze the performance, quality and potential of this new implementation. After every iteration, a set of conclusions were drawn that served as groundwork for the next iteration step. Every new implementation followed a complex workflow because of the enormous amounts of computation time required for the formal testing and data gathering process of the modified prototypes. Thus, once the new feature was programmed in the encoder, local tests were performed, using a small number of small size frames, in order to verify their correct operation, to gather sample results and to discard, if needed, implementations that gave no sign of improvement. After this verification, formal tests were performed, again with small size videos, over the new prototype using Ericsson Research’s server infrastructure (performing what we will call class D simulations). Finally, complete formal tests (using what we called full simulations) were performed to ALF implementations that showed the most promising results. Each test needed between three days and a week to finish for the small size video sequences simulations, and over five weeks for the full quality tests,

(13)

resulting in the need of exhaustive planning for the development of the thesis. A detailed description of the testing process used in the thesis will be presented in chapter 4. This testing process was so time consuming because of how incredibly slow JEM is. JEM has been implemented over the years by a great amount of programmers, all aiming to explore new video coding techniques without always considering the efficiency of their algorithms, resulting in JEM being a really slow video encoder/decoder implementation.

• After the final iteration, our final ALF development was fine-tuned and subjected to final tests to draw the conclusions.

Further details of this development process will be given in the development, results and conclusions chapters (chapter 4-6).

1.3 Related Work

A lot of e↵ort have been devoted on improving ALF, previous to its inclusion in JEM. In this section, an exhaustive overview of the state of the art techniques explored for the improvement of ALF will be presented. This section is the result of the state of the art reading process referred to in section 1.2.

ALF aims to minimize the error between original samples and decoded samples by using Weiner-based adaptive filters. These type of filters adaptively estimate the best filter coefficients to minimize this error. ALF is included at the last processing stage of each picture inside the in-loop filtering stage. In-loop filters were included in HEVC [14] as a tool to remove artifacts from previous coding stages.

As already mentioned above, in [8] and [9], Chen, Tsai et al. present a really detailed ground reading to understand ALF’s operation and techniques. These papers are quite outdated, as they present ALF’s implementation from version HM-7.0 [15] of the encoder, which belonged to one of the proposals to include ALF in HEVC. However, the papers present a detailed explanation of how ALF Weiner filters work, along with ALF’s core operations and techniques that set the basis to its current implementation in JEM. A detailed explanation of JEM’s ALF implementation, and all its mechanisms, will be presented in chapter 3.

There have been di↵erent approaches in the task to improve ALF, being the following the most relevant ones:

• Unification of filters in the in-loop filtering stage: The in-loop filtering stage consists of the Deblocking Filter [10], the Sample Adaptive O↵set (SAO) [16] and the Adaptive Loop Filter (ALF). Each of these blocks were designed independently. Neither SAO nor ALF consider the e↵ect of the deblocking filter in its operation when processing a frame. This is why there have been some e↵orts to unify these three blocks in order to increase their e↵ects and improve their efficiency. In [17] a unified filter is proposed as an alternative to these three blocks in an attempt to achieve the highest possible artifact correction rate. It works by classifying pixels into block boundary pixels, so called enhancement pixels in the paper, and non-boundary pixels, restoration pixels, and then designing a di↵erent type of filter for each of these type of pixels, so that stronger deblocking capabilities appear for boundary pixels and smoother filters for non-boundary pixels. This approach shows improvements compared to a basic encoder implementation, without in-loop filtering, and similar gains compared to the use of the processing chain of DBF, SAO and ALF, but with a lower encoding complexity. A less ambitious work of considering the e↵ect of the deblocking filter in ALF may be seen in [18]. In this paper, they proposed a Classified Quadtree-based Adaptive Loop Filter (CQALF), which classifies pixels in a

(14)

picture considering the impact of the deblocking filter over them. Thus, for the pixels that are modified by the deblocking filter, the filter is estimated at encoder by minimizing the mean square error between the original input frame and a combined frame which is a weighted average of the reconstructed frames before and after the deblocking filter.

For pixels that the deblocking filter does not modify, the filter is estimated using the standard ALF method, minimizing the mean square error between the original frame and the reconstructed frame. CQALF achieved average BD bitrate reductions of between the 5% and the 10% compared to the standard Quadtree-based implementation of ALF.

These papers are included in the specific approach of trying to unify filters in the in-loop filtering stage, however, they could have been easily included in the group of papers that aim to improve ALF’s pixel classification process, as the above implementations rely on a smart pixel categorization for their operation. Nevertheless, for illustrative purposes, it is interesting to see this attempt of unifying in-loop filters as an independent improvement approach.

• ALF efficiency improvements: Some of the works performed over ALF have been focused on boosting its computational complexity. ALF improves the objective and subjective quality at the expense of a higher computational complexity and the transmission of additional information. A first approach to reduce the amount of information transmitted by ALF, increasing its compression rate, is shown in [19]. In this paper, a temporal prediction of filter parameters between di↵erent frames is presented. For each frame, two sets of adaptive loop filter parameters are adaptively selected. The first set is estimated using the traditional ALF operation, minimizing the mean square error between original frame and the current reconstructed frame, while the second set of parameters is the one used in the latest prior frame. The selection of the set of parameters that suits best a frame, is performed by rate distortion optimization [20]. As it will be discussed below, the standard ALF uses rate distortion optimization in each step of its operation to decide if it enables each of its available features. This temporal prediction method showed low improvements of less than a 0.5% in the BD-rate and an average 1% increase in decoding time. An- other interesting approach used to reduce ALF’s complexity is the one followed in [21].

ALF performs several picture bu↵er accesses for its operation, increasing significantly the encoder’s memory access, encoding latency, and power consumption. Consequently, the paper proposes a method to estimate filtering distortion without performing real filter operations. With this, the number of encoding passes are reduced from 16 to 1 inducing a 0.17% BD-rate increase.

• Pixel classification improvements: Another branch in the e↵ort of improving ALF, considers the current implementation of the Weiner-filter coefficient estimation, well-studied for some time now and, in consequence, very optimized. Therefore, the only way to improve ALF is to modify the way pixels are classified in order to ensure that the most accurate filter is designed for each classification group. To classify pixels, a frame is divided into multiple non-overlapping regions and a set of statistics are collected for each region. Ac- cording to these statistics, the region, and its included pixels, are classified, and the most optimal filter is designed for all the regions in the group. As it will be seen below, this will be one of the approaches selected in this thesis to explore potential improvements over ALF. There are many papers that explore di↵erent classification methods in order to design and apply the best filter possible to each pixel. In fact, most of the proposals that will be presented now for pixel categorization have been gradually adopted in current JEM’s ALF implementation. Firstly, in ALF’s core paper [9], the two main techniques for pixel classification are presented. In Region-based adaptation (RA) [22] a frame is divided in 16 regions of similar size and a local filter is adapted and applied for each region. This technique is useful for pictures that show apparent structure and repetitive patterns in a local

(15)

region as, for instance, a region composed by blue sky. Block-based adaptation (BA) [23], [24], on the other hand, divides the image in non-overlapping 2x2 blocks (according to the latest ALF implementation [7]) and each block is classified into one of 25 possible groups according to the blocks’ statistics, which are mainly direction and gradient of the pixels.

Then a filter is designed for each group according to the pixels included in each group.

These papers focused on the granularity of the classification, while other authors focus on improving or modifying the statistics collected from each region, or block, to optimize their classification. In [25] a classification scheme where blocks are categorized according to their local directional characteristic, instead of the standard technique of classifying according to the blocks’ activity and direction gradient, is presented. The authors claim that the use of this classification based in pixel’s local orientation performs improvements of around a 5% on average compared to the Quadtree-based implementation of ALF.

• Use of static picture techniques: Finally, as ALF works with static images and it does not exploit video sequences’ temporal domain, we decided to research the possibility of using state of the art techniques for static pictures in the improvement of ALF. ALF is designed to reduce coding artifacts and, as the deblocking filter already fix most of the blocking artifacts, the artifacts that are expected to be solved with ALF are mainly blurring and ringing. The problem of using static picture techniques is that they must be computationally fast so that the encoder (and decoder) can process in an acceptable amount of time the subsequent pictures arriving from the video sequence. The Weiner- filters used by ALF are already really good for removing these types of artifacts [26], however, the classification of pixels to be processed by each filter should also be good, as shown in the point above, to design an accurate filter. These static image techniques may, therefore, be of a lot of use to detect and classify pixels that are su↵ering from an specific type of artifact so that they can all benefit from a Weiner Filter. Thus, in [27], [28], [29]

and [30] di↵erent algorithms to efficiently detect blur and ringing artifacts are presented.

These methods might be a great tool to improve ALF’s pixel classification method.

Finally, the concept of using Weiner-filters to adaptively better the objective and subjective quality of video encoders by reducing coding artifacts has been also explored out of the scope of the standard ALF. For instance, in [31] and adaptive Weiner filter, so called Adaptive Inter- polation Filter (AIF), is used to compensate fractional pixel displacements consequence of the limited precision in the interpolation process. Even more, [32] presents a new adaptive filter that combines the use of spatial information from a single frame, as traditional adaptive filters, with temporal information. The filter is located before the deblocking filter and it uses the concept of Temporal Trajectory Filtering (TTF), described in [33], to follow the motion of a single image point through di↵erent frames. This trajectory information allows the design of more accurate per-pixel filters. However, [32] compares the use of ALF and this new TTF showing how ALF outperforms TTF in every experiment performed. A similar approach is presented in [34] where they mix the concept of unifying the three blocks in the in-loop filtering stage and the combination of temporal information with spatial information by using what they called trilateral adaptive filters, as they include temporal statistics between frames for the estimation of coefficients.

1.4 Outline

This thesis report is organized as follows: In this chapter, the motivation, aim and scope of the thesis along with a brief introduction of the method followed to approach the problem and an extensive study of ALF’s state of the art, were presented. Chapter 2, covers a theoretical background with the main concepts about video coding and, specifically, the techniques used

(16)

in the Joint Exploratory Model (JEM). In chapter 3, a detailed overview of the techniques and operation of JEM’s adaptive loop filter implementation will be shown. With all this, all the background needed to follow the development of the thesis, presented in chapter 4, is given.

Finally, in chapter 5, a discussion of the final results of our implementation is presented finalizing with the conclusions and future work of the thesis in chapter 6.

(17)

Chapter 2

Video Coding

In this chapter, a theoretical background of the most important concepts and principles in video coding and video compression will be presented. The specific details about the technologies and techniques presented in this chapter belong to the latest video coding standard published, HEVC [35], and to the modifications and amendments included to the latest version of the Joint Exploratory Model (JEM)[11]. JEM represents a possible basis of the next-generation video standards. For the development of the thesis and the implementation of our prototypes we used JEM’s standard reference encoder.

The main goal of video coding tools is to remove redundancy from multimedia data [36]. The main types of redundancies that video coding aim to remove are:

• Spatial redundancy: Refers to the high correlation between neighboring pixels in an image.

This redundancy may be seen specially for pixels in regular objects or background areas where the distribution of brightness and color saturation are regular.

• Temporal redundancy: Indicates the high correlation among pixels of neighboring frames in a video sequence. This correlation appears since the physical characteristics between neighboring frames are really similar, as they where captured in small time intervals.

• Statistical redundancy: This redundancy refers to the coding process. Symbols of a bit stream do not appear with the same probability in the real world. Thus, the encoder must try to code the most occurrent symbols with a lower number of bits to reduce the size of the bitstream.

• Visual redundancy: Refers to those details in video images not easily perceived by the Human Visual System (HVS).

As will be seen below, all the video coding techniques included in the standards are designed to remove, as far as possible, these major redundancies appearing in raw videos reducing and, in this way, the amount of information needed for video data representation, storage and transmission.

2.1 Video coding standards

Video coding standards define (i) the full decoder’s behavior and (ii) the way encoders and decoders should use to communicate, i.e. the bitstream syntax. Besides defining the bitstream syntax, video coding standards are also required to be efficient, allowing the use of good compression algorithms and efficient implementation of encoders and decoders. [37].

(18)

Video coding standards have evolved primarily through the development of the well-known ITU-T and ISO/IEC standards. The ITU-T produced H.261 [38] and H.263 [39], while ISO/IEC produced MPEG-1 and MPEG-4 Visual [40]. A few years later, the two organizations jointly released the H.262/MPEG-2 Video [3] and H.264/MPEG-4 Advanced Video Coding (AVC) [5]

standards. These two jointly-produced standards have had a particularly strong impact and have found their way into a wide variety of products that are increasingly prevalent in our daily lives including broadcast of high definition (HD) TV signals over satellite, cable, and terres- trial transmission systems, video content acquisition and editing systems, camcorders, security applications, Internet and mobile network video, Blue-ray Discs, and real-time conversational applications such as video chat, video conferencing, and telepresence systems [41]. However, as presented in Cisco’s Zettabyte report and expressed in section 1.1, the increasing diversity of services that are appearing, the increase in the quality, resolution and compression requirements by users and industry, and the amount of video traffic exchanged through the Internet, specially using portable devices, generates the need of new and enhanced video coding standards. Thus, HEVC appears in an attempt to halve the bit-rate requirements of its predecessor, AVC/H.264 [5], for the same image quality at the expense of an increased computational complexity. HEVC was released in 2013 and since then, new e↵orts to better the efficiency and capabilities of video coding for the next generation standards have been explored by JVET with the implementation of the Joint Exploration Model (JEM).

As it has been the case for all past ITU-T and ISO/IEC video coding standards, in HEVC and JEM only the bitstream structure, syntax, constraints and mapping for the generation of decoded pictures are standardized. Consequently, every decoder conforming to the standard will produce the same output for a given standard conformant bitstream. This limitation gives encoder developers maximum freedom to optimize their implementations and explore new techniques to improve the standard. Moreover, to assist the industry community in learning how to use the standard, the standardization e↵ort not only includes the development of a text specification document, but also the reference software source code of the encoder and decoder’s standard implementation. This reference software is usually used as research tool to improve the standard and explore new techniques[42]. For the development of the thesis, the reference software for JEM version 3.0 was used as base encoder in the investigation and implementation of ALF’s improvements.

2.2 Video coding overview

Now that the goal of video coding and its associated standards have been explained, in the following subsections a detailed overview of the process of video coding, with its specific concepts and techniques, will be given. The main stages into which video coding is divided are: picture partitioning, prediction, transformation, quantization and entropy coding [41].

After the quantization of the image, a decoding loop appears. This decoding loop allows the encoder to generate a reconstructed image identical to the one the decoder will generate in the decoding process. The reconstructed image will serve as reference for further prediction stages. The decoding loop is where the in-loop filters -deblocking filter, Simple Adaptive O↵set (SAO) and Adaptive Loop Filter (ALF)- are located. The in-loop filters’ main goal is to estimate and transmit to the decoder the optimal filter parameters that fix potential artifacts in the reconstructed image increasing, in this way, the subjective and objective quality of the reconstructed video. A simplified block diagram of a JEM encoder may be seen in figure 2.1.

The main di↵erences between JEM’s encoder and HEVC’s standard encoder is the inclusion of the ALF block, and the improvement of the techniques used in every other coding stage. The blocks inside the red rectangle make up the decoding loop of the encoder. A JEM decoder will

(19)

be formed by this same blocks shown in the red box (decoding loop).

Figure 2.1: Simplified JEM encoder block diagram

2.2.1 Video frame

Throughout the document we have been talking about video frames (or pictures) and video sequences without formally presenting them. A frame in a video sequence is a matrix, or set of matrices, whose values represent intensity measures in an image [43]. These values are referred to as pixels. A set of these matrices generate a video sequence. The most common way of using pixel values to represent color and brightness in full color video coding is through what it is known as YUV (YCbCr) color space. YUV color space divides the value of a pixel into three channels: the luma value (Y), which represent the gray level intensity in an image, and the chrominance values (Cb, Cr), which represent the extent to which the color di↵ers from gray to blue and red, respectively [41]. Each of these channels for a frame are sent independently in the video bitstream allowing a di↵erent treatment for each of them. Thus, di↵erent sampling approaches may be followed for each channel. This is really convenient, as the Human Visual System (HVS) is more sensitive to changes in brightness than in colors [36]. Video encoders will, therefore, tend to use more luma samples than chroma samples, since the HVS is more sensitive to luma. This technique enables lower bandwidth requirements for a video sequence and a better masking of transmission errors and compression artifacts. The latest coding standards use a YCbCr color space with a 4:2:0 subsampling. This means that for each 2x2 luma samples block there is only one sample from each of the two chrominance channels (Cb, Cr). Initially, in HEVC each of these pixel samples was represented using 8 bits [41], values from 0 to 255.

However, in the latest HEVC revisions and in JEM, this value has been increased to allow a 10 bits color-depth. In this case, every pixel of the frame, is represented with a value from 0 to

(20)

1024, increasing the representation accuracy of each sample. The selection between an 8-bits or a 10-bits color depth is completely optional.

Finally, the number of pixels in a frame is determined by the resolution of the video. Higher resolutions mean more pixels in the frame, and in consequence, a better definition of the image.

For instance, the well known High Definition resolution has 1920 x 1080 pixels per frame, and the emerging 4K video resolution presents 3840 x 2160 pixels. As it may be seen intuitively, the higher the resolution, the higher bandwidth, storage and data transmission requirements a video sequence has.

2.2.2 Picture partitioning

In HEVC, each picture in a video sequence gets divided into a set of non-overlapping blocks [41]

previous to its coding. The largest blocks are called coding tree units (CTU) which can be further divided into smaller squared shaped blocks which are called coding units (CU). Each CU can be as large as their root CTU or it can be subsequently divided from there to subdivision blocks as small as a 8x8 block. These CUs can be split into several prediction units (PUs) and are used during the prediction stages of the encoding which will be detailed below. Each PU can have a size from 64x64 to 4x4 pixels. CUs are also divided, in HEVC, into transform units (TUs), used for the transformation stage in the encoder. The size of TUs can range from 32x32 to 4x4 pixels. PU subdivisions can generate non-square sized blocks such as 4x16 or 32x8 pixel blocks, thus, increasing the partitioning accuracy for the prediction stage[43]. In figure 2.2 an example of a possible block structure derived from a single CTU is shown.

Figure 2.2: HEVC picture partitioning structure

In JEM, some additional concepts are included to this picture partitioning stage compared to HEVC. JEM incorporates a Quadtree plus binary tree (QTBT) structure for blocks. QTBT removes the concept of multiple partition types, which means that the separation of CU, PU and TU no longer exists, supporting more flexibility for CU partition shapes to better match the local characteristics of video data. Thus, in JEM, through QTBT, CUs, TUs and PUs have the same block size. In QTBT, CUs can have either square or rectangle shapes as it happened for PUs in HEVC. Each CTU is first partitioned by a quadtree structure. Then the quadtree leaf nodes are further partitioned by a binary tree structure. The binary splitting allows two types:

horizontal splitting and symmetric vertical splitting, defining the final structure of the CTU

(21)

and, in consequence its specific subdivision in CUs[11]. An example of a CTU block structure in JEM is depicted in figure 2.3. We may see how the binary tree determines the type of CU splitting. A value of 1 on the binary structure of the QTBT determines a symmetric vertical split for a CU, while a 0 value specifies a horizontal split.

Figure 2.3: JEM picture partitioning structure [11]

2.2.3 Prediction

The prediction stage in the encoder aims to remove the spatial and temporal redundancies introduced in the introduction of the chapter. It will try to exploit the similarities inside the same frame (spatial redundancy) through intra-picture prediction, and between subsequent frames (temporal redundancy) through inter-picture prediction, motion estimation and motion compensation.

The basic idea of prediction coding is to transmit a di↵erential signal between the original signal and a prediction for the original signal, instead of the original signal directly. The di↵erential signal is also called residual signal and, at the receiver side, the original signal can be reconstructed by adding the residual and the prediction [36]. The di↵erential signal has a lower correlation than the original one, reducing the number of bits needed for its transmission.

Intra-prediction estimates the residual signal, i.e. the prediction, of a block of pixels from the value of spatially neighboring pixels in the same frame. To perform this, the standard defines a set of intra prediction modes in order to determine the direction of the optimal pixels to perform the prediction from. In HEVC, 33 directional intra-modes are available, while in JEM [11] this value is extended to 67 modes to increase the prediction accuracy.

Inter-prediction, on the other hand, exploits temporal redundancy between subsequent frames by representing the displacement of a block of pixels in a picture, according to the value of pixels in previous or following frames, process called motion compensation. This pixel displacement is represented through a motion vector that is sent to the decoder along with the residual between the original pixels and the motion compensated pixels, computed from other frame’s pixels.

The decoder will use the residual and the motion vector to reconstruct the block of pixels in the reconstructed frame. The process of motion compensation may be performed at sub-pixel level for enhanced accuracy, using motion vectors with sizes of down to a quarter of the distance between pixels (QuarterPel motion). This results in the need of an interpolation filter to be able to estimate the prediction at pixel level. Inter-prediction typically requires a lower amount of information to be transmitted to the encoder compared to intra-prediction.

According to the type of prediction used, a frame, or a slice, in HEVC and JEM, is classified as:

• I-frame: Intra coded (I) frames use only intra-prediction in its coding process. These type

(22)

of frames do not need of other frames to be coded, their coding and prediction processes are performed using data elements from the proper frame.

• P-frame: Predictive (P) frames may use intra-prediction and inter-prediction with previous frames. This type of frames are compressed further than the I-frames by the use of inter-prediction, but they need of the coding of a previous frame in order to be to be able to code them. P-frames send the index of the reference picture along with the residual signal and the motion vector.

• B-frame: Bi predictive (B) frames can use data from previous and following frames for its coding. Thus, B-frames may use intra-prediction or inter-prediction using an interpolated prediction from two di↵erent frames, increasing the accuracy of the motion estimation process. The indices of the reference pictures used for the motion estimation of the frame are sent to the decoder, along with the residual signal and motion vector. This type of frames needs the coding of the reference pictures in order to be coded. However, B-frames achieve, in general, higher compression rates compared to the other two types of frames.

2.2.4 Transformation and Quantization

The transformation and quantization stages of the encoder exploits the statistical redundancy presented in the introduction of this chapter. In the transformation stage, the residual data from the prediction stage, which will be sent to the decoder, is coded using block transforms [41]. A discrete cosine block transform (DCT-transform) is used to convert this data into the transform domain. In the transform domain the information is organized so that a smaller number of transform coefficients are needed to represent the information. Therefore, less data and lower bandwidths are required to transmit the information to the decoder. This DCT block transformation performs a lossless compression, as it is just a smart way of representing the information to reduce the amount of coefficients needed, losing no information in the process [44].

Once the transformation coefficients have been computed by the encoder, they are quantized before transmitting them to the decoder. The quantization process aims to limit the amount of bits generated and sent by the transformation process. The quantization of each coefficients is computed by directly dividing the value of the coefficient by a quantization step (Q_step) which is determined from the Quantization parameter (QP) following the expression in equation 2.1 [45].

Q_step = 2^{(QP 4)/6} (2.1)

At the decoder, and at the encoder’s decoding loop, to obtain the pixel values for the reconstructed image, a dequantization process is performed. To dequantize a frame, the quantized value for each pixel of the frame is multiplied by the quantization step (Qstep), obtaining the reconstructed transform coefficients. An inverse DCT is then applied to these values to obtain the reconstructed image.

The value of QP is determined in the encoder and it ranges from a value of 0 to 51. For each video sequence, a QP value is set to each frame of the sequence. A high QP means that the Q_stepis high and the range of available quantized values is small. Therefore, the transformation coefficients will be quantized really inaccurately, resulting in a high compression rate of the information but a low quality in the reconstructed video sequence, as the dequantized values in the decoder will only represent a coarse approximation of the original values of the frame (high level of uncertainty). Thus, the quantization stage presents a trade-o↵ between quality of the reconstructed sequence and amount of information required to represent the sequence. Hence, the lower the QP, the better the quality of the decoded video is, and the higher the amount of data required for its representation and transmission. On the other hand, high QPs will generate

(23)

lower quality reconstructed video sequences with lower data and bandwidth requirements. The quantization process, unlike the transformation one, presents a lossy compression, i.e. some of the original information will be lost.

2.2.5 Variance-based adaptive quantization

Since H.264 [5], a variance-based adaptive quantization algorithm was introduced in x264 [46], a high performance H. 264/AVC encoder. Adaptive Quantization (AQ) allows each macroblock, in the case of H.264, and every CU, in HEVC and JEM, to use a di↵erent quantization parameter for its coding process, instead of using the same frame QP in the coding of every CU, or macroblock, of the frame. Variance-base adaptive QP (VAQ) algorithms move bits from high variance blocks into flat blocks so that flat blocks are coded more accurately than blocks that present high variance. This is done by adaptively lowering the quantization parameter of certain blocks while increasing it in others. Without VAQ, flat and dark areas of the image tend to show ugly blocking or banding artifacts. By using VAQ these artifacts are greatly reduced, as lower QPs in these flat regions CUs enable a more accurate coding, avoiding the strong changes between neighboring blocks that generate these artifacts. To select the specific QP of a CU, its variance is computed. If a CU’s variance is higher than the average variance of the frame, a higher QP than the frame’s QP is set to the CU. On the other hand, if the CU presents a lower variance than the average variance of the frame, a lower QP is assigned. The QP of each CU in the frame is represented through a QP o↵set from the global QP set in the encoder for the frame, as shown in equation 2.2.

QPof f set = QPCU QPf rame (2.2)

This algorithm generates important quality gains specially in frames with dark or flat regions from, for instance, grass, sky, walls, etc. Currently, more and more implementations of video encoders are including this feature as it is a simple and computationally cheap technique to enhance quality. It is worth mentioning that some encoders include other types of adaptive quantization algorithms which are not variance-based. However, the one included in the encoder’s standard reference implementation since HEVC, and the one included in JEM’s reference encoder’s software, is variance-based.

2.2.6 Entropy coding

Entropy coding is the last stage in the encoding process, previous to the transmission of the information to the decoder. It represents another lossless compression scheme that aims to remove the statistical redundancies of the transmitted information. In HEVC, the entropy coder used is Context-based Adaptive Binary Arithmetic Coding (CABAC) [41]. CABAC uses certain probability measures to reduce these statistical redundancies. It uses Context Models to store statistics about codewords and adaptively makes a codebook (look-up table) with information about the block of symbols to be sent. The compressed bitstream generated consists of block structure, coded prediction parameters, coded residual coefficients. Additionally, information about the estimated parameters of the in-loop filters, such as ALF, are sent. JEM uses an enhanced version of CABAC with a modified context modeling for transform coefficients, a multihypothesis probability estimation with context-dependent updating speed and an adaptive initialization of models. Further information about these enhancements may be consulted in [11].

(24)

2.2.7 In-loop filters

In-loop filters are located in the decoding loop of the encoder. During all the video coding stages explained above, and especially in the lossy compression performed in the quantization stage, the subjective quality of a video sequence can be reduced resulting in the appearance of blocking, ringing or blurring artifacts. An example of an image showing blocking and ringing artifacts may be seen in figure 2.4. In order to remove these artifacts, and increase the subjective and objective quality of the reconstructed sequence, a set of in-loop filters are used. In-loop filters in the encoder estimate the optimal filter parameters that increase the objective quality of a frame the most. These parameters are then transmitted to the decoder so that the in-loop filters of the decoder can use these parameters to optimally filter the reconstructed frame and achieve the same quality improvements reached for the reconstructed frame in the encoder.

Figure 2.4: Original image to the left, and compressed image to the right showing ringing and blocking artifacts[47]

In JEM [11] three in-loop filters are used:

• Deblocking filter[10]: The deblocking filter aims to remove the blocking artifacts that appear in the edges of CUs, and specifically PUs and TUs, as a consequence of using a block structure in the processing of every stage of the encoder. The di↵erent values between neighboring CUs generate strong edges between CUs’ transitions which result in these blocking artifacts if no smooth filtering at block edges is performed. Hence, the deblocking filter will be in charge of applying a smoothing filter to edges in order to remove these artifacts. A di↵erent type of smoothing filter is applied to a block according to the properties of its neighboring blocks. The strength of the filter will also be determined by the specific values of the edge pixels of the block.

• Sample Adaptive O↵set (SAO)[16]: After removing blocking artifacts, the SAO filter aims to reduce undesirable visible artifacts such as ringing. The SAO filter categorizes pixels of a CTU into di↵erent categories. A positive or negative o↵set is determined for each category. This o↵set is then added to all the samples in each category. Thus, if the reconstructed samples in a specific category has a smaller value than the original ones, a positive o↵set is applied to reduce the existing error. On the other hand, if samples in a specific category have higher values than their corresponding original samples, a positive o↵set will be applied. This processing flow in which a classification of pixels is performed previous to the actual processing of the block will also be seen in ALF, as detailed in chapter 3.

(25)

• Adaptive Loop Filter (ALF): ALF also aims to reduce visible artifacts such as ringing and blurring by reducing the mean absolute error between the original image and the reconstructed image. As already mentioned above, chapter 3 is completely dedicated to giving a detailed overview of ALF, so no further explanation will be given here.

2.2.8 Rate-Distortion Optimization [20]

Rate-Distortion Optimization (RDO) is a technique used to make several decisions along the coding process of a video. It refers to the optimization problem that aims to find the optimal balance between the amount of distortion in a sequence, or loss of video quality, against the amount of data required to encode the video, bitrate. This technique is used in the encoder, for instance, to choose which coding mode and prediction parameters to use in each block [36].

This method optimizes the coding parameters to minimize the distortion, D, and the bitrate, R, through the joint RDO cost, J, following the expression in equation 2.3.

J = R + D (2.3)

Thus, RDO measures the error between the original video and the reconstructed one, and the bit cost for each specific coding configuration. The parameter in the equation, or Lagrangian, represents the relationship between bit cost and quality, or distortion, for a certain quality level, and it may be estimated as shown in equation 2.4.

= 0.57⇤ 2^{(QP 12)/3} (2.4)

As it will be explained in the chapter reserved to ALF, RDO is important for the reference implementation of ALF. After each step in ALF’s algorithm, a rate-distortion evaluation is performed to determine if the additional bits transmitted by ALF in the bitstream are worth the improvement in the image final distortion. Furthermore, the decision of the best set of filter parameters in ALF is also conducted using RDO in the reference implementation.

2.2.9 Video compression evaluation

To evaluate video coding techniques, three main metrics are considered: the output quality of the video sequence, the resulting bitrate and the encoding time. Thus, we can define di↵erent types of objective measures to evaluate how good the previous three coding metrics are in an encoder’s implementation.

• Objective Quality: The easiest way to measure the quality performance of a video sequence is to do it objectively using algorithmic quality measures [43]. One of the most common is the Peak Signal-to-Noise Ratio (PSNR) [48]. The PSNR measures the mean squared error between an original signal and the reconstructed signal to the available maximum amplitude of the signal. Accordingly, low PSNR usually indicates low quality while high PSNR indicates high quality for a given image or video frame. This metric is used to compare objectively di↵erent encoder’s implementations by comparing the PSNR of their output video sequences and evaluating the improvements.

• Subjective Quality: Objective metrics not always correlate to the quality perceived by hu- mans. A low PSNR not necessarily means a lower perceived quality of the video sequence, as sometimes the information removed, which produces the decrease of the PSNR, is not perceived by the human brain. A useful objective metric to evaluate subjective quality is SSIM [49]. SSIM measures the similarity between two images. As PSNR, SSIM is a full

(26)

reference metric, so its measurement is based on an initial uncompressed or distortion- free image as reference (the original video sequence). Therefore, to evaluate the subjective quality of an encoder’s implementation, we compare the SSIM measurements for the encoder’s under test output video and the anchor, or basic encoder, using the original input video sequence as reference for the computation of SSIM. A more advanced form of SSIM is the Multiscale SSIM (MS-SSIM) [50]. This metric is computed by calculating the SSIM after several stages of sub-sampling. The sub-sampling process is performed by subsequently filtering each frame of the video sequence using a low-pass filter. To each of these stages of filtering, the SSIM is computed. The experiments in [50] show that with an appropriate parameter settings, the multi-scale method outperforms the best single-scale SSIM model as well as other state-of-the-art image quality metrics.

• BD rate: To improve video coding techniques we want to reduce the number of bits to be transmitted by the encoder maintaining a good video quality. Therefore, not only video quality should be considered to evaluate how good it is an encoder’s implementation, but also the bitrate. Evaluating this two metrics at the same time may be complex. To solve this the Bjontegaard delta [51] metric (BD-rate) is used. BD-rate allows an alternative comparison method of compression performance. For a fixed quality level value, the BD rate calculates the average percentage of loss or gain in bitrate between the coded videos.

A negative value indicates a lower bitrate and a positive value a higher one. As will be seen below, to evaluate the improvements achieved over all of our ALF modification, BD MS-SSIM Rate will be used. When comparing this metric in our implementation, compared to the standard implementation, we will want this value to be as negative as possible, as this will indicate that our implementation needs a smaller bitrate, compared to the anchor, to transmit videos with the same level of output quality.

(27)

Chapter 3

Adaptive Loop Filter

In this section, a detailed overview of the Adaptive Loop Filter (ALF) implementation in JEM3 [7] will be given in order to understand its operation and analyze the potential improvements that may be performed over this block.

ALF is located at the last processing stage of each picture, and can be regarded as a tool that tries to catch and fix artifacts created by the previous coding stages [9]. Figure 3.1 depicts all the processing stages in which ALF is divided. In the following sections a detailed explanation of all these stages is presented.

3.1 Block Classification and initial RD cost computation

In JEM’s reference encoder ALF, the first task is to compute the RD-cost of the to-be-filtered picture, J_R0. This value will be used as a reference measure to analyze the gain obtained after each ALF processing stage. This measure is also used to perform the RD optimization in charge of selecting the optimal set of filter coefficients and ALF features that, afterwards, will be signaled to the decoder.

With the initial RD cost computed, the pixel classification stage starts. In section 1.3, ALF’s technique of classifying blocks, regions or pixels in a frame with similar statistics into the same group so the same optimal filter is designed for all of them, was already mentioned. Specifically in JEM’s ALF, a block-based filter adaptation (BA) is used. Each 2x2 block of a frame is categorized into one out of 25 possible categories according to the directional gradients and global activity of the block. To compute these metrics, the gradients of the horizontal (g_h), vertical (g_v) and two diagonal (g_d0, g_d1) directions are computed through the 1-D Laplacian, as shown in equation 3.1.

(28)

Figure 3.1: ALF processing stages at encoder [7]

g_v = Xi+3 k=i 2

Xj+3 l=j 2

V_k,l, V k, l =|2R(k, l) R(k, l 1) R(k, l + 1)|

g_h= Xi+3 k=i 2

Xj+3 l=j 2

H_k,l, Hk, l =|2R(k, l) R(k 1, l) R(k + 1, l)|

g_d0= Xi+3 k=i 3

Xj+3 l=j 3

D0_k,l, D0k, l =|2R(k, l) R(k 1, l 1) R(k + 1, l + 1)|

g_d1= Xi+3 k=i 2

Xj+3 l=j 2

D1_k,l, D1k, l =|2R(k, l) R(k 1, l + 1) R(k + 1, l 1)|

(3.1)

Indexes i, j refers to the coordinates of the upper left pixel in the 2x2 block and R(i, j) indicates the value of the pixel (i, j) in the reconstructed image. After computing each gradient

(29)

the maximum and minimum values of the block between the horizontal and vertical directions, and between the two diagonal directions, are set as shown in equation 3.2.

g_h,v^max= max(g_h, g_v), g_h,v^min= min(g_h, g_v)

g_d0,d1^max = max(g_d0, g_d1), g^min_d0,d1= min(g_d0, g_d1) (3.2) According to its direction, each 2x2 block will be given a direction value (D) from 0 to 4. D will be used, after the computation of the activity, to decide the group in which a block must be categorized. To determine D the following algorithm is followed[7][12]:

• Step 1 : If both g_h,v^max  t1g^min_h,v and g_d0,d1^max  t1g_d0,d1^min are true then D is set to 0.

• Step 2 : If g^max_h,v g_d0,d1^min > g^max_d0,d1g^min_h,v the the algorithm continues through step 3. Otherwise, it goes to step 4.

• Step 3 : If g^max_h,v > t₂g_h,v^min then D is set to 2, otherwise it is set to 1.

• Step 4 : If g^max_d0,d1> t2g^min_d0,d1 then D is set to 4, otherwise it is set to 3.

where t₁ and t₂ represent two specific thresholds specified in JEM’s reference encoder implementation.

With this, a block is categorized according to its direction. To determine its activity value equation 3.3 is used.

A = Xi+3 k=i 2

j+3X

l=j 2

(V_k,l+ H_k,l) (3.3)

The activity value, A, is then quantized to the range of 0 to 4. We denote the quantized activity value as A_q. To obtain the A_q of a block from A in JEM the following algorithm is followed: first, the measured activity of a block is transformed into a quantization index using equation 3.4.

Q_index= (24A) >> 13 = 24A

2¹³ (3.4)

Where >> 13 denotes a right shift of 13 bit positions in the activity measure. Then, according to the value of the quantization index, a quantized value for the activity from 0 to 4 is assigned as shown in figure 3.2. The image shows the activity quantization indices associated to each of the activity quantized values (values depicted inside the blue boxes). If Q_index has a value higher than 14 for a block, the block will be directly assigned a quantized activity Aq= 4.

Figure 3.2: Value of A_q related to the activity quantization index

With A_q and D computed, each block is classified into one of the 25 possible groups by applying equation 3.5, where C indicates the index of the group in which the block is categorized.

C = 5D + A_q (3.5)

A visual aid to understand how blocks are classified into specific groups according to their value of Aq and D is presented in table 3.1.

Investigating the Adaptive Loop Filter in Next Generation Video Coding

Investigating the Adaptive Loop Filter in Next Generation Video Coding

ALFONSO DE LA ROCHA GÓMEZ- AREVALILLO

Abstract

Keywords

Sammanfattning

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Background

1.2 Motivation

1.3 Related Work

1.4 Outline

Chapter 2

Video Coding

2.1 Video coding standards

2.2 Video coding overview

Chapter 3

Adaptive Loop Filter

3.1 Block Classification and initial RD cost computation