Compression of High Dynamic Range Video

(1)

Compression of High Dynamic Range Video

Simon Ekström 2015

Master of Science in Engineering Technology Computer Science and Engineering

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

(2)

For a long time the main interest in the TV-industry has been in increasing the resolution of the video. However, we are getting to a point where there is little benefit in increasing it even further. New technologies are quickly rising as a result of this and High Dynamic Range (HDR) video is one of these. The goal of HDR video is to provide a greater range of luminosity to the end-consumer. MPEG (Moving Picture Experts Group) wants to know if there is a potential for improvements to the HEVC (High Efficiency Video Coding) standard, specifically for HDR video, and in early 2015 the group issued a Call for Evidence (CfE) to find evidence whether improvements can be made to the existing video coding standard. This work presents the implementation and analysis of three different ideas for suggestions: bit shifting at the coding unit level, histogram- based color value mapping, and modifications to the existing Sample Adaptive Offset (SAO) in-loop filter in HEVC. Out of the three suggestions, the histogram-based color value mapping is shown to provide significant improvements to the coding efficiency, both objectively and subjectively. The thesis concludes the work with a discussion and possible directions for future work.

(3)

I am very grateful for the opportunity to perform my thesis work at Ericsson Research’s Visual Technology unit in Kista and I would like to thank all the people there that have assisted me throughout this work. I would like to especially thank my external supervisor at Ericsson, Martin Pettersson, for assisting me and providing me with valuable feedback throughout the whole process. I would also like to thank my supervisor at Lule˚a University of Technology, Anders Landstr¨om, for showing an interest in the work and providing valuable guidance. Finally I would like to thank family and friends, both new and old, for all the support I have been given during the work and especially the move to a new town.

ii

(4)

1 Introduction 1

1.1 Background . . . 2

1.2 Purpose . . . 2

1.3 Delimitations . . . 3

1.4 Related Work . . . 3

1.5 Contribution . . . 4

2 Theory 6 2.1 High Dynamic Range . . . 6

2.1.1 Transfer Functions . . . 7

2.1.1.1 Philips TF . . . 8

2.1.1.2 PQ-TF . . . 9

2.2 Color Models . . . 10

2.2.1 RGB . . . 10

2.2.2 YCbCr . . . 11

2.3 Color Spaces . . . 12

2.3.1 CIE 1931 . . . 12

2.3.2 CIELAB . . . 12

2.3.3 Wide Color Gamut . . . 14

2.3.4 BT.709 . . . 14

2.3.5 BT.2020 . . . 15

2.3.6 DCI P3 . . . 16

2.4 Chroma Subsampling . . . 16

2.4.1 4:4:4 to 4:2:0 . . . 17

2.4.2 4:2:0 to 4:4:4 . . . 18

2.5 File Formats . . . 19

2.5.1 EXR . . . 19

2.5.2 TIFF . . . 20

2.6 Video Coding . . . 20

2.6.1 Encoder . . . 21

2.6.2 Decoder . . . 22

2.6.3 Rate-Distortion Optimization . . . 22

2.6.4 Video Coding Artifacts . . . 23

2.6.5 HEVC Standard . . . 25

2.6.5.1 Quantization Parameter . . . 27

2.6.5.2 Coding Tree Units . . . 27

2.6.5.3 Deblocking Filter . . . 28 iii

(5)

2.6.5.4 Sample Adaptive Offset . . . 29

2.6.5.5 Profiles . . . 30

2.7 Quality Measurement . . . 31

2.7.1 PSNR . . . 31

2.7.2 tPSNR . . . 32

2.7.3 CIEDE2000 . . . 32

2.7.4 mPSNR . . . 35

2.7.5 Bjøntegaard-Delta Bit-Rate Measurements . . . 36

3 Method 38 3.1 Test Sequences . . . 38

3.2 Processing Chain . . . 40

3.2.1 Preprocessing . . . 41

3.2.2 Postprocessing . . . 41

3.2.3 Anchor Settings . . . 42

3.2.4 Conversion of TIFF Input Files . . . 42

3.3 Evaluation . . . 43

3.3.1 Objective Evaluation . . . 43

3.3.2 Subjective Evaluation . . . 44

3.4 HEVC Profile Tests . . . 44

3.5 Bitshifting at the CU Level . . . 45

3.5.1 Variation 1 . . . 47

3.5.2 Variation 2 . . . 47

3.6 Histogram Based Color Value Mapping . . . 48

3.6.1 Preprocessing . . . 50

3.6.2 Postprocessing . . . 52

3.6.3 Parameters . . . 52

3.7 SAO XYZ . . . 53

4 Results 57 4.1 HEVC Profile Tests . . . 57

4.1.1 Main-RExt, 12 bits, 4:2:0 . . . 57

4.1.2 Main-RExt, 10 bits, 4:4:4 . . . 59

4.1.3 Main-RExt, 12 bits, 4:4:4 . . . 59

4.1.4 Main-RExt, QP offsets . . . 60

4.2 Bitshifting at the CU Level . . . 60

4.2.1 Variation 1 . . . 61

4.2.2 Variation 2 . . . 61

4.3 Histogram Based Color Value Mapping . . . 62

4.3.1 Objective Results . . . 62

4.3.2 Subjective Results . . . 63

4.4 SAO XYZ . . . 65

5 Discussion 66 5.1 Reflections . . . 66

5.2 Conclusions . . . 67

5.3 Future Work . . . 67

(6)

5.3.1 Bitshifting at the CU Level . . . 68 5.3.2 Histogram Based Color Value Mapping . . . 68 5.3.3 SAO XYZ . . . 69

(7)

Introduction

The TV industry is developing quickly. Until now the main interest has been in increasing the resolution of the video content. Ultra HDTV (High Definition Television) provides a number of improvements, including increased resolution, higher frame rate, and an improved color space. However, we are currently approaching the point where there is little benefit in increasing the resolution for ordinary TV sets. Therefore, there is a rising interest in other technologies which can be used to increase the visual expe- rience. One technology introduced is High Dynamic Range (HDR) video and content providers such as Amazon are already providing HDR content [1]. The goal of HDR video is to provide a greater range of luminosity to the end consumer.

Today’s television systems only provides Standard Dynamic Range (SDR), which is a dynamic range of about 1000:1 (ratio between brightest and darkest brightness), with a luminosity between 0.1 to 100 candela per square metre (cd/m²). As an example, the sky near the horizon during noon a clear day has a luminance level of approximately 10 000 cd/m². HDR is defined as a dynamic range greater than 65 536:1.

HDR video, however, may require changes throughout the video chain, affecting every- thing from video capturing to the TV sets that are meant to display the video. The content providers want to produce and distribute content that actually utilizes this new feature, but they also want to be able to distribute it as efficiently as possible. This may require new tools that are specific for compression of HDR video.

This thesis will present three main ideas for improvements to existing tools which im- proves the coding of HDR video:

• Bit depth shifting at the CU level,

• Histogram based color value mapping, 1

(8)

• Sample adaptive offset in XYZ domain.

These ideas will be described in detail and analyzed throughout the report.

Chapter 1 will provide an introduction to the work, providing background, purpose and delimitations. Chapter 2 provides the theory necessary to get an understanding of the presented ideas. It will cover areas such as HDR, color theory, and video coding.

Chapter 3 presents the three approaches, describing them in detail. This chapter also provides an overview of the method used for evaluation of the ideas. Chapter 4 will present the results for each of the ideas and also a brief analysis of the results. The last chapter, chapter 5, provides a discussion of the result, including a more detailed analysis of the results and overall conclusions.

1.1 Background

HEVC (High Efficiency Video Coding) is a video compression standard [2] and version 3 was approved in April 2015 [3]. MPEG (Moving Picture Experts Group) issued a Call for Evidence (CfE) for HDR and WCG (Wide Color Gamut) video coding in the spring 2015.

This process has a clear purpose: MPEG wants to explore if the coding efficiency and/or the functionality of the HEVC standard can be improved for HDR.

The introduction of HDR video presents a number of challenges not previously consid- ered. Many of the methods used for compressing ordinary video may not work as well for HDR video, both in terms of quality and compression rate.

There is an interest in how to efficiently represent the colors to fit the given number of bits per pixel, as is how to do the coding as efficiently as possible. The former is typically performed in the pre- and post-processing stages of the processing chain.

Standard video is usually represented in the sRGB color space, which gives a more efficient use of the pixels compared to just storing them in a linear color space. However, the sRGB gamut, i.e. the complete subset of colors which can be represented within the color space, is restricted and the non-linear gamma model used in sRGB is not well-suited for HDR imagery, as the input and output ranges are unknown [4].

1.2 Purpose

The purpose of this work is to first and foremost study the concepts of HDR video and wide color gamuts to get a basic understanding of what it is, how it differs from today’s

(9)

technology, and how it may affect the video coding process. Next goal is to, given the guidelines of the MPEG standardization process, explore if there are any possible changes and/or additions that can improve the video coder for HDR and WCG video.

Any suggestions for changes are then to be implemented and evaluated.

1.3 Delimitations

This work is connected to a standardization process led by MPEG. The process has a clear purpose and so has this work, to find compression efficiency improvements for the existing video coder. MPEG has limited the proposals to three different categories:

1. Normative changes to the HEVC standard. Proposals in this category need to be justified with significant improvements to the performance.

2. Backwards compatibility. This category covers backward compatibility and how to present HDR content on older systems not supporting HDR.

3. Optimization using the existing standardized Main 10 profiles, described in section 2.6.5.5. This category consists of two subcategories covering non-normative changes, i.e. changes that do not have an impact on the decoding process, to

(a) the Main 10 profile and (b) the Scalable Main 10 profile.

To limit the work, the thesis is restricted to the two categories 1 and 3a. The work mainly focuses on finding improvements to the processing chain presented by the CfE and will not be specifically limited to neither normative nor non-normative changes.

1.4 Related Work

HDR video, not to be confused with HDR photography, is still a quite new concept but there have been previous work done in the area. Lu et al. [5] discussed the implications of distributing HDR and WCG content, and Zhang et al. [6], provided a review of HDR image and video compression. Banitalebi-Dehkordi et al. [7], provided a comparison of H.264/AVC and HEVC, showing that HEVC performs better when it comes to compressing HDR video. As mentioned there are no standards specified for broadcasting yet. However, HDR was recently standardized for the Blu-ray disc format [8].

(10)

There is currently a lot of work going on, MPEG is in the process of standardize HDR for HEVC [9], hardware manufacturers are introducing HDR displays [10], and content providers are starting to provide their users with HDR content [1].

MPEG is not the only organization working on HDR, other groups such as SMPTE, DVB, ATSC, and EBU are also working on specifying standards relating to HDR video.

One of the initial problems with HDR video is how to cope with the extended range.

This requires more efficient usage of the available bits. BBC, for instance, have done a lot of work in this area [11], trying to find an efficient transfer function for HDR. Dolby has also presented work in this area [12], introducing the PQ transfer function. Zhang et al. [13] presents a method for reducing the required bit depth for HDR video, resulting in efficiency improvements, and in [14], several other methods for reducing the required bit depth is proposed, all based on an adaptive uniform re-quantization applied prior to the encoding.

There has also been work on how to perform the evaluation of HDR video compression methods. In [15] a comparison of four objective metrics: mPSNR, tPSNR, and PSNR_∆E, as included in the CfE, as well as the HDR-VDP-2 metric [16] is presented.

In addition to the efficiency improvements, there has been work done on how to provide backwards compatibility. Dai et al. [17] presented techniques that were shown to both provide efficiency improvements for HDR video, but also provides backwards compat- iblity by allowing tone mapping algorithms to be applied, reducing the contrast and luminance.

1.5 Contribution

This thesis presents three different ideas for improvements to video coding of HDR content. Two of which provides no significant gains, but the work itself should provide a small base for further work in the area.

The third proposal, histogram based color value mapping, provides significant gains both objectively and subjectively. This proposal is closely related to the transfer functions suggested by BBC, Dolby, and Philips. It tries to improve the utilization of available bits, which is also the purpose of the transfer functions. However, the key difference is that the proposal in this thesis looks at the actual video content and tries to optimize the mapping for individual sequences while the transfer functions are designed as a generic solution by looking at the properties of the human visual system.

(11)

There are existing similar techniques but they are not completely identical so the proposed idea is still worth considering. For instance, the transfer functions and the proposed mapping technique are not mutually exclusive, in this thesis they are used together to improve the efficiency even further. The mapping technique provides efficiency improvements for HDR and possibly SDR, and the technique could be worth continue developing or take inspiration from.

(12)

Theory

This chapter covers the background required to get an understanding of HDR, color theory, and video coding in general. It will begin by covering the theory and tools used for coding HDR, including Wide Color Gamut (WCG). It will then continue on by giving a general understanding of the HEVC video coder. The chapter will not go into full detail about the inner workings of the coder, but it will cover what is necessary to get an understanding of the ideas proposed in this thesis.

2.1 High Dynamic Range

High Dynamic Range (HDR) imaging is a set of techniques used in imaging and photography that allows for a greater dynamic range of luminosity compared to what is possible with standard digital imaging techniques. Dynamic range can be described as the ratio between the maximum and the minimum luminous intensity in a scene. Luminance is a measure of luminous intensity per unit area and the SI unit for this measure is candela per square metre (cd/m²), another term for the same unit is ”nit”.

In photography the dynamic range is commonly measured in terms of f-stops, which describes the light range by powers of 2.

• 10 f-stops = 2¹⁰: 1 = 1024 : 1.

• 16 f-stops = 2¹⁶: 1 = 65 536 : 1.

The human eye can approximately see a difference of 100 000 : 1 in a scene with no adaption [18]. Table 2.1 shows five examples of luminance values in common scenarios [19][20].

6

(13)

Table 2.1: Approximate luminance levels in common scenarios.

Environment Luminance Level (cd/m²)

Frosted bulb 60 W 120 000

White fluorescent lamp 11 000

Clear sky at noon 10 000

Cloudy sky at noon 1 000

Night sky with full moon 0.01

Standard Dynamic Range (SDR) Today’s television systems only provide SDR, which is less than or equal to 10 f-stops. SDR typically supports a range of luminance of around 0.1 to 100 cd/m². Table 2.1 indicates that SDR is far from being able to provide the luminance levels that the human eye is used to.

Enhanced Dynamic Range (EDR) EDR is an enhanced version of SDR which supports a dynamic range between 10-16 f-stops.

High Dynamic Range (HDR) HDR supports a dynamic range of more than 16 f-stops. This means that the range of HDR is significantly bigger than the one of SDR.

Using a SIM2 HDR display [10] which supports a brightness up to 4000 cd/m², it would be possible to reproduce the brightness of a cloudy sky at noon.

2.1.1 Transfer Functions

When capturing video with a camera the colors are captured in the linear domain.

This means that the color values are linearly proportional to the amount of luminance.

The linear domain, however, is not suitable for when doing the quantization required before doing the video coding. There is typically a too small number of bits to be able to represent the colors without causing visible errors. The video is therefore typically transferred to a perceptual domain using a Transfer Function (TF) before doing the encoding. The video will then be transformed back to the linear domain after doing the decoding using the inverse transfer function.

Bartens model [21] is a model of the human eye’s sensitivity to contrast in different levels of luminance. Comparing a transfer function to the model shows how likely it is that the function will cause visible banding artifacts, as described in section 2.6.4, and how efficiently the bits are used. Figure 2.1 is a graph of Bartens curve showing the contrast sensitivity of the human eye. Noticeable looking at Bartens curve is the fact that the human eye is less sensitive to contrast in dark regions. This is a fact that has been used a lot by models when trying to optimize the usage of bits when encoding images

(14)

10⁻⁴ 10⁻² 10⁰ 10² 10⁴ 10⁻¹

10⁰ 10¹ 10²

Contrast Sensitivity

Luminance (nits)

Minimum Contrast Step (%)

Barten curve BT.1886 8 bit BT.1886 10 bit

Figure 2.1: 8 and 10 bit BT.1886 compared to the Barten curve.

in traditional gamma models. Anything beneath the curve would not cause any visible artifacts but putting the whole transfer function underneath the curve would require a larger number of available bits.

BT.1886 [22] is a gamma model suggested for HDTV and two versions of this model is visible in the figure 2.1, an 8 bit version and a 10 bit version. However, this model is designed for a limited dynamic range, and in this case 0.1 to 100 nits.

As BT.1886 is not suitable for the increased range of HDR [12] three new transfer functions for coding HDR have been discussed in MPEG, the BBC TF, the Philips TF, and the Dolby Perceptual Quantizer Electro-Optical TF (PQ-EOTF, or simply PQ-TF) [12]. These three transfer functions together with the Barten curve can be seen in figure 2.2. The BBC model is very similar to BT.1886 up to a certain level and after this an exponential curve is used. Philips and Dolby also proposed two transfer functions of their own which both follows the Barten curve more smoothly.

2.1.1.1 Philips TF

The Philips transfer function is defined as [9]

PhilipsTF(x, y) = log

1 + (ρ − 1) · (r · x)^γ¹

log(ρ) · M , (2.1)

(15)

10⁻⁴ 10⁻² 10⁰ 10² 10⁴ 10⁻¹

10⁰ 10¹ 10²

Contrast Sensitivity

Luminance (nits)

Minimum Contrast Step (%)

Barten curve 10 bit BBC 0−10k nits 10 bit PQ−TF 0−10k nits 10 bit Philips TF 0−10k nits

Figure 2.2: Bartens model with BBC TF, Philips TF, and PQ-TF.

where ρ = 25, γ = 2.4, r = ₅₀₀₀^y , and

M = log

1 + (ρ − 1) · r^γ¹

log(ρ) .

2.1.1.2 PQ-TF

The Dolby PQ-TF is defined as [9]

PQ TF(x) = c₁+ c₂L^m¹ 1 + c3L^m¹

m2

, (2.2)

where

m1 = 2610 4096·1

4, (2.3)

m2 = 2523

4096· 128, (2.4)

c1 = c3− c₂+ 1 = 3424

4096, (2.5)

c₂ = 2413

4096· 32, (2.6)

c₃ = 2392

4096· 32. (2.7)

(16)

The inverse of the PQ-TF is defined as [9]

PQ TF⁻¹(N ) =



 max

h

N^m⁻¹² − c₁, 0 i c2− c₃N^m⁻¹²





m⁻¹₂

, (2.8)

2.2 Color Models

A color model is a mathematical model that describes the way colors can be represented using a predefined number of components. Examples of color models are RGB and CMYK. A color model together with a reference color space and an associated mapping function results in a set of colors referred to as a color gamut, where the gamut refers to a subset of the complete reference color space.

2.2.1 RGB

Figure 2.3: Picture separated into the R, G, and B channels.

The RGB color model splits the color information into the three primary colors; red, green, and blue. Figure 2.3 depicts an example picture and the three color channels.

This model is an additive color model, meaning that the three colors will added together reproduce any of the colors in the color space.

Figure 2.4 depicts how the three primary colors can be mixed to represent other colors.

For instance, adding green to red will result in yellow, and adding all primary colors together will result in white.

An RGB color space is defined by three additive primaries, red, green, and blue. Plotting an RGB color space on a chromaticity diagram the color space will be visualized by a

(17)

Figure 2.4: Additive color mixing, depicts the mixing of the three primary colors.

triangle, as seen in figure 2.6 with the BT.709 [23] and BT.2020 [24] color spaces. The triangles’ corners are defined by the chosen color primaries of that color space and any color within the triangle can be reproduced. A complete specification of an RGB color space will also require a white point and a gamma correction curve to be defined. Not shown in the figure is the sRGB color space [25]. sRGB shares the same color primaries as BT.709, which means they both share the same color gamut. sRGB, however, explicitly specifies an output gamma of 2.2.

RGB is used when displaying colors on a number of common display types, such as Cathode Ray Tube (CRT), Liquid Crystal Display (LCD), plasma displays, or Organic Light Emitting Diode (OLED). Each pixel on the display consists of three different light sources, one for each color. From a normal viewing distance, the separate sources will be indistinguishable, giving the appearance of a single color.

2.2.2 YCbCr

Figure 2.5: Picture separated into the Y, Cb, and Cr channels.

YCbCr is a family of color spaces where a color value is represented by three components;

Y, Cb, and Cr. Y represents the brightness (luminance), while Cb and Cr are the blue

(18)

and red chroma components holding the color information. Figure 2.5 depicts a picture separated into the three channels. YCbCr should be distinguished from Y’CbCr where the Y’ component (luma) compared to Y is in a non-linear domain, for instance encoded by gamma correction.

Y’CbCr is a relative color space derived from an RGB color space. The color primaries are provided by a color space such as BT.709 or BT.2020. For conversions between Y’CbCr and R’G’B’, see section 2.3.4 or 2.3.5 depending on color primaries used.

Y’CbCr is preferred when doing video coding as the separation of the luma and chroma components allows for operations such as storing the components at different resolutions.

This is to take advantage of the human visual system and the fact that the chromatic visual acuity is lower than the achromatic acuity [26]. This means that the chroma components can be stored at a lower resolution than the luma component without any major visual impact. The same does not apply for the RGB color model as each of the three channels is of equal importance.

2.3 Color Spaces

To be able to capture, store, and display video with colors, the color information needs to be represented in some way. For this purpose color spaces are used. Color spaces allows for a reproducible color representation.

2.3.1 CIE 1931

The CIE 1931 color spaces [27][28], and the CIE 1931 XYZ color space specifically describes all colors visible to the human eye and can be seen as the color gamut of the human visual system. The CIE 1931 XYZ color space is depicted as the complete colored area in the chromaticity diagram in figure 2.6. The chromaticity diagram is a simplification of the color space and the color space is actually expressed as a 3D hull. The X, Y, and Z components of the color space are then coordinates of this 3D hull. Given the properties of the human eye the model defines the Y component as the luminance.

2.3.2 CIELAB

CIELAB or CIE L*a*b* [28] is a color space describing all the colors in the gamut of the human vision and it was specified mainly for the purpose to serve as a device-independent

(19)

x y

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

520

0 700 500

49

380

BT.2020 CIE 1931 Chromaticity Diagram

Figure 2.6: Gamuts of BT.709, DCI P3, and BT.2020 on the CIE 1931 color space.

model to be used as a reference. The color space consists of three components, L*, a*, and b*, where L* represents the lightness of the color while a* and b* are the color components.

Equation 2.9 defines the CIELAB color space and how to convert a color value from the CIE 1931 XYZ color space.

L^∗ = 116f (Y /Y_n) − 16, (2.9a)

a^∗ = 500[f (X/Xn) − f (Y /Yn)], (2.9b) b^∗ = 200[f (Y /Yn) − f (Z/Zn)], (2.9c) where

f (t) =







t^1/3 if t > (24/116)³ (841/108) · t + 16/116 otherwise

and X_n, Y_n, and Z_nare the tristimulus values of a specified white object color stimulus.

In this case Y_n= 100, X_n= Y_n· 0.95047, and Z_n= Y_n· 1.08883 [9].

(20)

2.3.3 Wide Color Gamut

Color gamut, as mentioned previously, describes a subset of colors. This could for instance be the range of colors that the human eye may perceive or the range supported by a particular output device. This chapter covers three important color spaces, BT.709, DCI P3, and BT.2020. These color spaces all have their own gamut and figure 2.6 compares these gamuts on top of the CIE 1931 color space.

In addition to HDR, one could increase the realism even further by using a wider color gamut. Ideally the color gamut used would cover the complete color gamut of the human visual system, as in the CIE 1931 color space, but there are still limitations in the video chain. A color gamut larger than the one of BT.709 is typically referred to as a Wide Color Gamut (WCG), and as such, both BT.2020 and DCI P3 are referred to as wide color gamuts. These gamuts give a closer rendition to human color perception and together with HDR it allows for very bright saturated colors.

2.3.4 BT.709

BT.709 [23] is a standard defining the format parameters for High Definition Television (HDTV). It specifies parameters such as aspect ratio, supported resolutions, frame rates, and the color space. The color space of this standard is what will be covered in this section.

Color Space Conversion

Equation below defines the conversion from Y’CbCr to R’G’B’ with BT.709 primaries [9],

R⁰ = Y⁰+ 1.57480 · Cr, (2.10a)

G⁰ = Y⁰− 0.18733 · Cb − 0.46813 · C_r, (2.10b)

B⁰ = Y⁰+ 1.85563 · Cb, (2.10c)

where R⁰, G⁰, and B⁰ are non-linear RGB in the perceptual domain, this is a result of transforming the color values using the PQ transfer function. The perceptual domain is connected to the perceptual properties of the human visual system and the purpose for using this is that it makes for a more efficient representation of the color values. These

(21)

variables can be defined as

R⁰= PQ TF(max(0, min(R/10000, 1))), (2.11a) G⁰= PQ TF(max(0, min(G/10000, 1))), (2.11b) B⁰= PQ TF(max(0, min(B/10000, 1))). (2.11c)

where PQ TF is defined in equation 2.2 in section 2.1.1.2.

Additionally, the conversion from R’G’B’ to Y’CbCr for BT.709 can be approximated as [9]

Y⁰ = 0.212600 · R⁰+ 0.715200 · G⁰+ 0.072200 · B⁰, (2.12a) Cb = −0.114572 · R⁰− 0.385428 · G⁰+ 0.500000 · B⁰, (2.12b) Cr = 0.500000 · R⁰− 0.454153 · G⁰− 0.045847 · B⁰. (2.12c)

Equation below defines how to convert RGB with BT.709 primaries to the CIE 1931 XYZ color space [9],

X = 0.412391 · R + 0.357584 · G + 0.180481 · B, (2.13a) Y = 0.212639 · R + 0.715169 · G + 0.072192 · B, (2.13b) Z = 0.019331 · R + 0.119195 · G + 0.950532 · B. (2.13c)

2.3.5 BT.2020

While BT.709 is the standard for HDTV, the BT.2020 [24] standard defines the format parameters for Ultra High Definition Television (UHDTV). The color space for both BT.709 and BT.2020 are visualized in figure 2.6.

Color Space Conversion

The equations below define the conversion from Y’CbCr to R’G’B’ with BT.2020 primaries [9]:

R⁰ = Y⁰+ 1.47460 · Cr, (2.14a)

G⁰ = Y⁰− 0.16455 · Cb − 0.57135 · Cr, (2.14b)

B⁰ = Y⁰+ 1.88140 · Cb, (2.14c)

where R⁰, G⁰, and B⁰ are non-linear RGB defined as in equation 2.11.

(22)

Additionally, the conversion from R’G’B’ to Y’CbCr for BT.2020 can be approximated as [9]

Y⁰ = 0.262700 · R⁰+ 0.678000 · G⁰+ 0.059300 · B⁰, (2.15a) Cb = −0.139630 · R⁰− 0.360370 · G⁰+ 0.500000 · B⁰, (2.15b) Cr = 0.500000 · R⁰− 0.459786 · G⁰− 0.040214 · B⁰. (2.15c)

The equations below define how to convert RGB with BT.2020 primaries to the CIE 1931 XYZ color space [9]:

2.3.6 DCI P3

DCI P3 [29] is the specification of the color space used in digital cinemas and it is meant as a standard for modern digital projection. All modern digital cinema projectors are capable of displaying the color space. However, there are not many commercially available monitors that support the DCI P3 color gamut.

The color gamut of DCI P3 is smaller than BT.2020, but larger than the one of BT.709, as seen in figure 2.6. The color gamut of P3 is therefore referred to as a wide color gamut.

The equation below define how to convert RGB with DCI P3 primaries to the CIE 1931 XYZ color space [9]:

2.4 Chroma Subsampling

As mentioned previously, Y’CbCr separates the luma component from the chroma components. Having a lower resolution for the chroma components allows for a lower bit rate without lowering the overall subjective image quality significantly.

(23)

The chroma subsampling formats [30] are commonly expressed using a three part ratio a:b:c, which specifies the ratio between the luma and chroma samples.

• a is the Y’ horizontal sampling reference, defining the width of the sampling region.

• b specifies the horizontal subsampling of Cb and Cr, which is the number of Cb and Cr samples in the first row.

• c is the vertical subsampling for Cb and Cr. Either same as b or zero, indicating that Cb and Cr are subsampled 2:1 vertically.

The two types that are the most important to understand for this thesis are the 4:2:0 and the 4:4:4.

• 4:4:4 specifies that there is no subsampling used. Meaning we have the same number of samples for all components.

• 4:2:0 specifies a subsampling by a factor of 2 for the chroma components, both horizontally and vertically. This means that the resolution is a quarter of the original resolution for the chroma components.

Y’ Cb+Cr Y’CbCr

4:4:4

4:2:0

Figure 2.7: The 4:4:4 and 4:2:0 chroma subsampling formats.

Figure 2.7 depicts for 4:4:4 and 4:2:0 how the luma (Y’) and the chroma (Cb+Cr) samples are merged to produce the resulting pixels. 4:4:4 will not cause any compression gains as it will result in 3 samples per pixels, similar to an ordinary picture in an RGB color space. 4:2:0, on the other hand, will lower the amount of data required as we will go from 8 chroma samples to 2 chroma samples for every macro block.

2.4.1 4:4:4 to 4:2:0

Chroma downsampling from 4:4:4 to 4:2:0 is done in two steps, first the picture is downsampled horizontally down to 4:2:2, and then the picture is downsampled from 4:2:2 to 4:2:0, as follows [9]:

(24)

• First perform the horizontal downsampling down to 4:2:2. Let the input picture be s[i][j], while W and H are the width and height in chroma samples. For i = [0, H − 1] and j = [0, W/2 − 1] the 4:2:2 samples, f [i][j], are derives as follows:

f [i][j] =

1

X

k=−1

c₁[k] · s[i][Clip3(0, W − 1, 2 · j + k)], (2.18)

where c₁[−1] = 1, c₁[0] = 6, c₁[1] = 1, and

Clip3(x, y, z) =











x if z < x y if z > y z otherwise

• Perform the vertical downsampling. For i = [0, H/2 − 1] and j = [0, W/2 − 1], the output 4:2:0 samples, r[i][j], are derives as follows:

r[i][j] = (

1

X

k=−1

c2[k] · f [Clip3(0, H − 1, 2 · i + k)][j] + offset) shift, (2.19)

where c2[−1] = 0, c2[0] = 4, c2[1] = 4, shift = 6, offset = 32, and is the right bitshift operator.

2.4.2 4:2:0 to 4:4:4

Chroma upsampling from 4:2:0 to 4:4:4 is performed in a similar fashion to the downsampling, first vertical filtering is performed, and then horizontal. The steps are as follows [9]:

• Let H and W be the dimensions of the input picture s[i][j] in chroma samples. For i = [0, H − 1] and j = [0, W/2 − 1] the intermediate samples, f [i][j], are derived as follows:

f [2 · i][j] =

1

X

k=−2

d0[k] · s[Clip3(0, H − 1, i + k)][j],

f [2 · i + 1][j] =

1

X

k=−2

d₁[k] · s[Clip3(0, H − 1, i + k + 1)][j],

(2.20)

where the coefficients are defined as in table 2.2.

(25)

Table 2.2: Chroma upsampling coefficients.

Phase −2 −1 0 1

d0[k] −2 16 54 −4 d1[k] −4 54 16 −2

• For i = [0, 2 ∗ H − 1] and j = [0, W − 1], the output samples r[i][j] are derived as, r[i][2 · j] = (f [i][j] + offset₁) shift₁

r[i][2 · j + 1] = (

1

X

k=−2

c[k] · f [i][Clip3(0, W − 1, j + k + 1)] + offset₂) shift₂ (2.21) where c[−2] = 4, c[−1] = 36, c[0] = 36, c[1] = −4, shift₁ = 6, offset₁ = 32, shift2 = 12, and offset2 = 2048.

2.5 File Formats

Working with video requires formats for representing and storing the data and it is important that this can be performed with minimal losses of information. The format needs to support a larger color gamut and an increased dynamic range compared to the traditional video formats. There are a number of formats for HDR video to choose from, all with different capabilities [4].

This section presents two file formats that are used in this thesis work. It will focus on the bit encoding on a pixel level and not how the full image compression is performed.

2.5.1 EXR

OpenEXR [31] is an open source image format created by Industrial Light and Magic [32] with the purpose of being used as an image format for special effects rendering and compositing. The format is a general purpose wrapper for the 16 bit half-precision floating-point data type, Half, [4]. The Half format, or binary16, is specified in the IEEE 754-2008 standard [33]. OpenEXR also supports other formats such as both floating- point and integer 32 bit formats [31]. Using the Half data type the format will have 16 bits per channel, or 48 bits per pixel.

The OpenEXR format is able to cover the entire visible gamut and a range of about 10.7 orders of magnitude with a relative precision of 0.1%. Based on the fact that the human eye can see no more than 4 orders of magnitude simultaneously, OpenEXR makes for a good candidate for archival image storage [4].

(26)

2.5.2 TIFF

TIFF (Tagged Image File Format) [34] is a widely supported and flexible image format.

It provides support for a wide range of image formats wrapped into one file. The format allows the user to specify the type of image (CMYK, YCbCr, etc), compression methods, and also to specify the usage of any of the extensions provided for TIFF.

2.6 Video Coding

For the purpose of this thesis we see video as a sequence of frames, i.e. a series of pictures. Each frame (picture) consists of a number of pixels where every pixel stores color information. The color information is typically stored either in an RGB color space or in the YCbCr color space. The frames store the color information using one of the various types of chroma subsampling. The most important subsamplings types to consider for this thesis are the 4:2:0 and the 4:4:4.

Uncompressed video requires a very high data rate and to meet the limitations of today’s networks and typical storage devices there is a need for the video to be compressed. To understand the need for video compression we can look at the size of an uncompressed video sequence in HD. A sequence in Full HD (1920x1080) with a frame rate of 30 frames per second (60i, i.e. a interlaced frame rate of 60 fields per second [35]), 10-bit per color channel, and the sampling rate 4:2:0 has a data rate of 932 Mbps [35] or a total of 410 GB data for every hour of video. It is clear that this huge amount of data will be impossible to distribute and store efficiently for the average consumer.

There are two types of compression; lossy compression and lossless compression, where a majority of the existing algorithms uses lossy compression. For lossy compression techniques there is a trade-off between video quality, data rate, and coding complexity.

A high quality video stream will require a high data rate while the required data rate can be lowered by reducing the video quality. The complexity of the coder is also a big factor, a complex coder is able to perform a lot of optimizations when coding which may increase quality and decrease data rate. However, the time it takes to code a sequence will increase with increased complexity.

A typical compression algorithm tries to reduce the required data rate by removing redundant data in the stream, both in terms of spatial and temporal data. Another important concept is perceptual video coding. This concept is about understanding and using the human perception to enhance the perceptual quality of the coded video. A good example of this are the transfer functions presented in section 2.1.1, they take

(27)

advantage of the human visual system and its properties to utilize the available bits more efficiently.

2.6.1 Encoder

Figure 2.8: Block diagram of an encoder.

Figure 2.8 shows a typical video encoder which consists of three main units: a temporal model, a spatial model, and an entropy encoder.

In a video sequence there are typically two types of redundancies that the coding process tries to reduce:

• Temporal redundancy, which is similarities between multiple frames, for instance if two sequential frames have the same values in a given region.

• Spatial redundancy, which is similarities or patterns within the same frame, for instance if a picture consists of a solid color or a repeated pattern.

Temporal Model The temporal model attempts to reduce temporal redundancy by finding similarities between neighboring video frames. It then constructs a prediction of the current video frame by looking at these similarities. The input for this step is the uncompressed video sequence and there are two outputs: the residual and the motion vectors. The residual is the difference between the prediction and the actual frame, and the motion vectors describes the motion in reference to the neighboring frames.

Spatial Model The spatial model attempts to reduce any spatial redundancy. Com- pared to the temporal model this model only references the current frame. It makes use of similarities in the local picture. One way to reduce the redundancy is to use transform coding. The residual samples are transformed into the frequency domain, in which the video signal is looked at in respect to frequency bands. The signal is there represented

(28)

by transform coefficients. The coefficients are then quantized to reduce the number of insignificant values.

Entropy Encoder The entropy encoder takes the motion vectors and the transform coefficients as input. This step uses a more general compression approach; it tries to compress the input using entropy coding to even further reduce any redundant data.

2.6.2 Decoder

The decoder is similar to the encoder but works in reverse. It takes the bit stream generated by the encoder as input and then tries to reproduce the original sequence of frames. First the process decodes the motion vectors and the quantized transform coefficients. The coefficients are rescaled to invert the quantization performed in the encoder, however, as this is a lossy process the coefficients will not be equal to the original coefficients.

The residual data is then restored by doing an inverse transform on the coefficients.

Due to the losses in the process the resulting residual data will not be the same as the original. The picture will then be reconstructed by adding the decoded residual data to the predicted picture generated by the motion vectors together with any previous reference frames.

2.6.3 Rate-Distortion Optimization

When compressing video the coder wants to provide high quality video, however, there is a trade-off between video quality and the data rate required. Rate-Distortion Opti- mization (RDO) refers to optimizing the amount of distortion in the video against the data rate required.

Rate-distortion optimization is utilized a lot within a typical video coder. This allows the coder to try out a number of various techniques for coding the video and then comparing the cost for each, making sure the most cost effective technique is used. This will however increase the complexity at the encoder side as it will try to code the video in a number of different ways.

The typical video coder splits the input video into smaller regions, macroblocks in older standards or coding tree units in HEVC, allowing the rate-distortion optimization to determine the best type of prediction and mode on a region to region basis.

(29)

2.6.4 Video Coding Artifacts

As the video coder tries to reduce the data rate as much as possible, as mentioned previously, there is a clear trade-off between data rate and video quality. There are usually very noticeable artifacts on highly compressed video. This section describes some of the more common types of artifacts encountered in video coding.

Figure 2.9: Illustration of color banding.

Color Banding Banding is an artifact that causes inaccurate colors in an image. This artifact is produced when there are not a sufficient number of bits to represent the colors in an image. Natural gradients are typical examples where this artifact may be visible, the number of bits is not sufficient to represent the complete gradient without abrupt changes between two colors. Figure 2.9 shows three versions of the same image, one with a very low number of bits per channel (leftmost) and visible banding artifacts, and one with higher bit count (rightmost) that appears to be smooth. There are possible ways to avoid or hide this type of artifact.

• One could increase the bits per pixel. However, it is not always possible to increase the bit depth.

• Try to encode the available bits more efficiently, as described in section 2.1.1.

• Attempt to hide the artifact by applying intentional noise (dither) to the image, see middle image in figure 2.9.

Blurring Transform coding is typically used in video compression and as a way to control the quality of the video stream the resulting coefficients are quantized. For low quality video coding the coefficients are quantized very coarsely and this may zero out the high frequency components [36]. This yields a low-pass like effect and the resulting video may be perceived as low resolution and blurry. Figure 2.10 shows an example of this artifact with a clear loss of detail in the middle region of the white tent.

(30)

Figure 2.10: Illustration of blurring artifacts.

Figure 2.11: Illustration of blocking artifacts.

Blocking Blocking is common when using macroblocks, or as in HEVC, coding tree units, when doing both image and video coding. The use of macroblocks or coding tree units may cause the coder to potentially code neighboring blocks differently. For instance, when performing transform coding each block will be producing their own set of transform coefficients and the blurring artifact previously mentioned will then lead to discontinuities at the block boundaries [37]. Figure 2.11 shows an image with block coding artifacts caused by the macroblocking when performing JPEG coding. To reduce this type of artifact the coder typically performs either post filtering or in-loop filtering.

In-loop filtering is applied as a part of the encoder loop. HEVC uses two in-loop filters in an attempt to minimize this type of artifact, the deblocking filter, and the so called Sample Adaptive Offset (SAO) filter [2] proposed for HEVC.

Ringing Ringing artifacts are fundamentally associated with Gibb’s phenomenon and are as such typically produced along high contrast edges in areas that are generally smooth [37]. It typically appears as a rippling outwards from the edge. Figure 2.12 illustrates several examples of ringing, the clearest examples being visible around the edges of the cube. This type of artifact is closely related to the blurring artifact as they

(31)

Figure 2.12: Illustration of ringing artifacts.

are both caused by quantization of the transform coefficients [37]. The SAO filter was partly designed to correct these types of errors [38].

2.6.5 HEVC Standard

The High Efficiency Video Coding (HEVC) [2] standard is a successor to the MPEG-4 H.264/AVC standard [39]. HEVC can provide significantly increased coding efficiency compared to previous standards [40].

The second version of the standard includes a range extension (RExt) which supports higher bit depth and additional chroma sampling formats on top of 4:2:0 (4:0:0, 4:2:2 and 4:4:4) [3].

HEVC uses the same hybrid approach as many of the previous standards, using a com- bination of inter-/intrapicture prediction and 2-D transform coding [2].

Figure 2.13 depicts a block diagram of a typical HEVC video encoder. The encoder also duplicates the decoding process and the decoder elements are the shaded blocks in the figure. This allows the encoder to generate predictions identical to the ones of the decoder which allows for better inter-picture prediction. The Sample Adaptive Offset (SAO) filter [38] also uses the generated predictions to determine suitable parameters that help correcting various errors and artifacts.

Input Video This is the input video that the coder is encoding. The encoder first proceeds by splitting each picture of the input video into block-shaped regions called coding tree units [2]. The coder then goes on to decide which type of prediction to use.

(32)

Transform, Scaling &

Quantization

Intra-Picture Estimation

Intra-Picture Prediction

Motion Estimation

Deblocking &

SAO Motion

Compensation

Scaling &

Inverse Transform

CABAC Mode

Decision

-

+

Decoded Picture Buffer

Reconstructed Picture

Quantized Transform Coefficients

Filter

Control Filter Parameters

Prediction Data Residual

Bit- Stream Input Video

Split into CTUs

Figure 2.13: Block diagram of the HEVC encoder (blocks shaded in gray are decoder elements).

Intra-Picture Estimation Intra-picture prediction is the first of the two types of predictions used and it performs predictions based only on data available in the same picture. Therefore intrapicture will have no dependence on other pictures. Intra-picture prediction is the only possible prediction mode when coding the first picture of the sequence or the first picture of a random access point [2].

Motion Compensation For the remaining pictures of the sequence, inter-picture prediction is typically used for the majority of the blocks [2]. In this mode predictions are made based on adjacent pictures in the sequence. The encoder side predicts motion vectors for the blocks that the decoder then will compensate for.

Mode Decision The encoder will then have to decide which mode to use, intra-picture prediction or motion compensation. If the picture does not happen to be a picture where intra-picture prediction is forced (i.e. first picture of the sequence or a random access point), the type of prediction is typically determined by performing RDO [2]. The prediction is decoded and the result is subtracted from the original picture to create a residual. The data needed to perform the predictions are also sent to the CABAC module, either the motion vectors of the inter-picture prediction or the intra-picture prediction data depending on what decision was made.

(33)

Transform, Scaling & Quantization The residual signal of the intra- and inter- picture prediction is then coded using transform coding. This is done by first transforming the signal by a linear spatial transform. The transform coefficients are scaled, quantized and entropy coded before getting sent to the CABAC module. The quantized transform coefficients are then inverse transformed to duplicate the decoded approxima- tion of the residual signal. The residual signal is then added to the predicted signal and the resulting signal is fed into deblocking and SAO (Sample Adaptive Offset) filters

Filter Control When reconstructing the picture the deblocking and the SAO filter of the decoder also needs to be duplicated. The purpose of these filters is to smooth out any artifacts caused by the block-wise processing and quantization. In this step the encoder also determines the parameters for the SAO filter which will be used in the real decoding process, so the resulting parameters are sent to the CABAC module.

After the reconstructed signal has gone through the two filters it will be saved in a buffer of decoded pictures. This is the buffer that will be used when doing prediction on subsequent pictures.

CABAC Any data that is about to be a part of the bitstream is run through an entropy coder. In this case Context Adaptive Binary Arithmetic Coding (CABAC) is used [2]. This module will code all the coefficients, motion vectors, intra-picture prediction data, filter parameters, and any other data necessary before constructing the resulting bitstream.

2.6.5.1 Quantization Parameter

The quantization performed on the transform coefficients is determined by a Quantiza- tion Parameter (QP) [2] that is set when doing the coding in a way to control the quality or the data rate of the coder. The range of the QP values is defined from 0 to 51. An increase of 1 in the QP means an increase of quantization step size by approximately 12% and an increase of 6 means an increase by exactly a factor of 2. It can also be no- ticed that a change of quantization step size by 12% also means a reduction of roughly 12% in bit rate [39].

2.6.5.2 Coding Tree Units

In previous standards such as H.264/AVC [39], the picture was typically split into macroblocks, consisting of a 16x16 block of luma samples and two 8x8 blocks of chroma

(34)

CTU CTU

CTB (Y’)

Cb Cr

CB CB

CB

CB CB

Figure 2.14: Overview of the coding tree unit (CTU).

samples in the case of 4:2:0 subsampling. HEVC introduces a new concept replacing the typical macroblock with the Coding Tree Unit (CTU) [2]. Compared to a macroblock the CTU is not of fixed size, the size is selected by the encoder and it can be larger than a traditional macroblock, up to 64x64 pixels. A CTU consists of three Coding Tree Blocks (CTBs), one for the luma samples and two for the corresponding chroma samples, as shown by figure 2.14. The tree structure of the CTB allows for partitioning into smaller blocks called Coding Blocks (CBs) using quadtree-like signaling [41].

The prediction type for blocks are then coded into Coding Units (CUs), where each CU consists of three CBs, one for luma and two for chroma. Each CU will also have an associated partitioning into Prediction Units (PUs) and Transform Units (TUs).

The decision whether to use interpicture or intrapicture prediction is made at the CU level. Depending on decisions in the prediction process, the CBs can be further split in size and predicted by the Prediction Blocks (PBs) in the PUs.

The residual from the prediction is coded using transform coding. A TU is a tree structure with its root at the CU level consisting of Transform Blocks (TBs). A TB may be of the same size as a CB residual, or it may be split into smaller TBs.

2.6.5.3 Deblocking Filter

Similar to the H.264/AVC standard, HEVC also uses an in-loop deblocking filter [42].

This filter operates within the encoding and decoding loops and is used to reduce the visible artifacts at the block boundaries caused by the block-based coding. The filter detects the artifacts and it then makes decisions on whether to use filtering or not, and subsequently what filtering mode to use.

(35)

2.6.5.4 Sample Adaptive Offset

Additional to the deblocking filter HEVC introduces a new in-loop filtering technique, Sample Adaptive Offset (SAO) [38]. This filter is applied after the deblocking filter and its purpose is to improve the reconstruction of the original by correcting various errors caused by the encoding process. This filter is applied on a CTB level and given that a CTU have a CTB for every component, one for luma and two for chroma, the filter will be applied for every color component.

At the encoder the filter will classify each reconstructed sample into one of two categories, Edge Offset (EO) or Band Offset (BO). This will be done similar to how RDO is performed, determining which mode and offsets that are optimal. The offsets are optional and will only be applied if they have the possibility to increase the quality of the final picture. The offsets are determined and signaled through the bitstream to the decoder, which applies these offsets to the samples when reconstructing the picture.

Edge Offset For the EO mode the sample is classified by comparing the sample to two of its eight neighboring samples in four directional patterns, horizontal, vertical, and two diagonal patterns [38]. EO allows for both smoothing and sharpening of edges in the picture and it helps correcting errors such as ringing artifacts. A positive offset results in smoothing while a negative offset would make the edge sharper. However, after statistical analysis it is clear that HEVC disallows sharpening and only sends absolute values of offsets [38].

Band Offset For the BO mode the offsets are selected based on the amplitude of the sample. The full sample range is divided into 32 bands and the sample is categorized into one of these bands. Four offsets are determined from for four consecutive bands and are then signaled to the decoder. At the decoder one offset will be applied to all samples of the specific band [38]. Using only four consecutive bands helps correcting banding artifacts as these typically appear in smooth areas where the sample amplitudes tend to be concentrated in only a few of the bands [2].

SAO also provides a number of ways to reduce the information needed to be transmitted between the encoder and the decoder, such as allowing multiple CTUs to share the same SAO parameters.

(36)

2.6.5.5 Profiles

The standard defines a number of different profiles [3]. A profile defines a range of bit depths, supported chroma sampling formats, and a set of coding tools that conforms to that profile [2]. The encoder may choose which settings and coding tools to use as long as they conform to the specification of the profile. The decoder on the other hand is required to support all coding tools that the profile supports.

In the second version of the standard, the format range extensions (RExt) were included.

These extensions allows for profiles with higher bit depths and additional chroma sampling formats. The 12 bits per channel and 4:4:4 profiles are examples of the formats supported by RExt.

Main The Main profile is the most common profile and it allows for a bit depth of 8 bits per sample and 4:2:0 chroma subsampling. This is the most common format of video used [3].

Main 10 Main 10 is similar to the Main profile with 4:2:0 chroma subsampling, but it allows for an bit depth up to 10 bits per sample [3]. The extra two bits per sample compared to the 8 bits of the Main profile is a big benefit and it also allows for larger color spaces [43]. This thesis, with the requirements of HDR and WCG will focus mainly on using this profile. It is also stated that the Main 10 profile with 10 bits per sample provides a higher picture quality but with the same bit rate as the Main profile [44].

Main 12 Main 12 allows for a bit depth between 8- to 12 bits per sample with 4:0:0 and 4:2:0 chroma subsampling [3].

Main 4:4:4 10 Main 4:4:4 10 only allows a bit depth of 10 bits per sample, just as Main 10, but in addition to 4:2:0 it also supports 4:0:0, 4:2:2, and 4:4:4 chroma subsampling [3].

Main 4:4:4 12 Main 4:4:4 12 supports the same chroma subsampling formats as Main 4:4:4 but it allows for a bit depth up to 12 bits per sample [3].

(37)

2.7 Quality Measurement

To measure the performance of a video coder you can either perform subjective measurements or objective measurements. When doing subjective measurements you will have human observers watching and rating the quality of the video. For objective measurements on the other hand you will have mathematical models designed to approximate the results you will get when doing subjective measuring.

Given that the video is to be consumed by a human being in the end, subjective measurements are of more value. However, they are usually very costly and time-consuming to gather. Therefore objective measurements are commonly used as a preliminary quality measurement.

Objective measurements can be done using a number of various models. This thesis will focus mainly on the tPSNR and the mPSNR measurements introduced for the MPEG CfE, which both are variations of the PSNR (Peak Signal-to-Noise Ratio) measure. The reason for introducing these measures is that the non-linear behavior of the human visual system makes PSNR an ill-fitted measurement when it comes to image compression [45].

Despite its drawbacks it has been widely used for Standard Dynamic Range (SDR) video.

However, there seems to be a general understanding that the measurement works much worse for HDR video.

2.7.1 PSNR

Peak signal-to-noise ratio (PSNR) defines the ratio between the original video and the error introduced by the compression. The PSNR is calculated as

P SN R = 10log10

255²

M SE, (2.22)

where M SE is the mean square error, defined as

M SE = 1 W H

H

X

y=1 W

X

x=1

[Fo(x, y) − Fr(x, y)]². (2.23)

Here W and H are the width and height of the video, F_o(x, y) is the original frame, and Fr(x, y) is the reconstructed frame.

(38)

2.7.2 tPSNR

When calculating the tPSNR [9] measurement, an average of the PQ and the Philips transfer functions is used. This is to give a result closer to the subjective results and to avoid biasing the measurement towards any specific transfer function.

First, both transfer functions are required to be normalized to support 10 Kcd/m². The content to be transformed is expected to be in linear-light 4:4:4 RGB EXR, if not, the content will have to be converted first.

To calculate the measurement there are a number of steps applied for each sample of the two contents to compare.

• Each sample needs to be normalized to support a luminance of 10 000 cd/m², this is done by dividing the values by 10 000.

• The samples are then converted to XYZ, see equation 2.13, 2.16, or 2.17 depending on the color space of the samples (BT.709, BT.2020, or DCI P3).

• Apply the transfer functions for each sample:

X⁰ = PQ TF(X) + PhilipsTF(X, 10000)

2 ,

Y⁰ = PQ TF(Y ) + PhilipsTF(Y, 10000)

2 ,

Z⁰ = PQ TF(Z) + PhilipsTF(Z, 10000)

2 ,

where PQ TF(x) is defined in equation 2.2 and PhilipsTF(x, y) is defined in equation 2.1.

• Four sums of square error (SSE) values are computed between the two contents, SSE_x, SSE_y, SSE_z, and SSE_xyz, where SSE_xyz= ^SSE^x^+SSE₃ ^y^+SSE^z.

• Finally, the PSNR values are computed for each SSE as:

tP SN R = 10 · log₁₀ nbSamples SSE

,

where nbSamples = 1024² when having an input with 10 bits per color channel, and the SSEs are being clipped to 1e⁻²⁰.

2.7.3 CIEDE2000

CIEDE2000 [46] formula is used to compute the difference, ∆E, or distance, between two colors. The difference between two colors is a metric that is of big interest in color

(39)

science and the purpose of the metric here is to provide a measurement of the difference between two pictures.

This section describes how to compute an objective measurement based on the CIEDE2000 [9], which will be used when evaluating the quality of a particular implementation.

Firstly, the process requires the contents that are to be compared to be in linear-light 4:4:4 RGB EXR format. For instance, if the content is in the Y’CbCr 4:2:0 format, it will first be needed to upsampled to 4:4:4 (see section 2.4.2) and then converted to linear-light RGB according to equation 2.10 for contents with BT.709 primaries, or 2.14 for BT.2020 primaries.

Subsequently, the following steps are to be applied for each (R, G, B) sample of the two contents to compare, both the original and the test material.

• Convert the samples from RGB to the XYZ color space according to equation 2.13 for BT.709 primaries, or 2.16 for BT.2020 primaries.

• Convert from XYZ to the CIELAB color space according to equation 2.9.

• Given the two samples to compare, (L^∗₁, a^∗₁, b^∗₁) and (L^∗₂, a^∗₂, b^∗₂), the CIEDE2000 color difference, ∆E is calculated as follows [46][47]:

1. Calculate the modified chroma, C_i⁰, and hue angle, h⁰_i: C_i,ab^∗ =

q

(a^∗_i)²+ (b^∗_i)² for i = 1, 2 (2.25) C¯_ab^∗ = C_1,ab^∗ + C_2,ab^∗

2 (2.26)

G = 0.5 1 −

s C¯_ab^{∗ 7} ( ¯C_ab^{∗ 7}+ 25⁷

!

(2.27)

a⁰_i= (1 + G)a^∗_i for i = 1, 2 (2.28)

C_i⁰ = q

(a⁰_i)²+ (b^∗_i)² for i = 1, 2 (2.29)

h⁰_i=







0 if b^∗_i = a⁰_i= 0 tan⁻¹(b^∗_i/a⁰_i) otherwise

for i = 1, 2 (2.30)