• No results found

Scalable video coding using the Discrete Wavelet Transform : Skalbar videokodning med användning av den diskreta wavelettransformen

N/A
N/A
Protected

Academic year: 2021

Share "Scalable video coding using the Discrete Wavelet Transform : Skalbar videokodning med användning av den diskreta wavelettransformen"

Copied!
81
0
0

Loading.... (view fulltext now)

Full text

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Scalable video coding using the Discrete Wavelet

Transform

Examensarbete utfört i Bildkodning vid Tekniska högskolan i Linköping

av

Gustaf Johansson

LITH-ISY-EX--10/4209--SE

Linköping 2010

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)
(3)

Scalable video coding using the Discrete Wavelet

Transform

Examensarbete utfört i Bildkodning

vid Tekniska högskolan i Linköping

av

Gustaf Johansson

LITH-ISY-EX--10/4209--SE

Handledare: Harald Nautsch

ISY, Linköpings universitet

Ola Hållmarker

MobiTV Inc.

Examinator: Robert Forchheimer

ISY, Linköpings universitet

(4)
(5)

Avdelning, Institution

Division, Department

Division of Information Coding Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2010-09-14 Språk Language  Svenska/Swedish  Engelska/English   Rapporttyp Report category  Licentiatavhandling  Examensarbete  C-uppsats  D-uppsats  Övrig rapport  

URL för elektronisk version

http://www.bk.isy.liu.se http://www.ep.liu.se ISBNISRN LITH-ISY-EX--10/4209--SE

Serietitel och serienummer

Title of series, numbering

ISSN

Titel

Title Skalbar videokodning med diskret wavelettransformScalable video coding using the Discrete Wavelet Transform

Författare

Author

Gustaf Johansson

Sammanfattning

Abstract

A method for constructing a highly scalable bit stream for video coding is presented in detail and implemented in a demo application with a GUI in the Windows Vista operating system. The video codec uses the Discrete Wavelet Transform in both spatial and temporal directions together with a zerotree quantizer to achieve a highly scalable bit stream in the senses of quality, spatial resolution and frame rate.

Nyckelord

Keywords DWT, Discrete Wavelet Transform, EZW, Embedded Zerotree Wavelet, SVC, Scal-able Video Coding

(6)
(7)

Abstract

A method for constructing a highly scalable bit stream for video coding is presented in detail and implemented in a demo application with a GUI in the Windows Vista operating system. The video codec uses the Discrete Wavelet Transform in both spatial and temporal directions together with a zerotree quantizer to achieve a highly scalable bit stream in the senses of quality, spatial resolution and frame rate.

Sammanfattning

I detta arbete presenteras en metod för att skapa en mycket skalbar videoström. Metoden implementeras sedan i sin helhet i programspråken C och C++ med ett grafiskt användargränssnitt på operativsystemet Windows Vista.

I metoden används den diskreta wavelettransformen i såväl de spatiella dimen-sionerna som tidsdimensionen tillsammans med en nollträdskvantiserare för att åstakomma en skalbar videoström i avseendena bildkvalitet, skärmupplösning och antal bildrutor per sekund.

(8)
(9)

Acknowledgments

I would first like to thank my supervisor M.Sc. Ola Hållmarker at MobiTV espe-cially for his advice and guidance in the practical implementations in this work and full professor Robert Forchheimer for patiently and kindly answering my questions regarding the results of the practical work as well as giving his thoughts of how to outline this report.

Thanks to Klas Nordberg at ISY1 for giving the interesting course Multidimen-sional Signal Processing and especially for giving the application in wavelet trans-forms as a special project.

Thanks to Kurt Hansson at MAI2 for giving the great course Introduction to Wavelets during the spring of 2008. These both courses have certainly increased my interest in wavelet transforms and their applications and are two of the biggest motivators for pursuing a thesis related to wavelet transforms.

Furthermore I would like to thank my brother Andreas for advice as well as emo-tional support in handling the vile and ruthless creatures hiding in the bugjungles. Also lots of love to my parents who have always supported me in the big battles finally leading to this great victory.

Thank you very much!

1

Department of Electrical Engineering. 2

Department of Mathematics.

(10)
(11)

Contents

1 Introduction 1 1.1 Problem statement . . . 1 1.2 Method . . . 2 1.3 Thesis outline . . . 2 1.4 Glossary . . . 3 2 Video coding 5 2.1 What is video data? . . . 5

2.1.1 What is a video? . . . 5

2.1.2 Sampling the video data . . . 5

2.1.3 Video resolution standards . . . 5

2.1.4 Color channel representations . . . 6

2.1.5 Properties of the human eye . . . 6

2.1.6 Standards for subsampling the chrominances . . . 7

2.2 The structure of a video codec . . . 7

2.2.1 Overview of a modern general purpose coder . . . 8

2.2.2 The Transform . . . 8

2.2.3 The Quantizer . . . 9

2.2.4 Exploiting the video motion . . . 11

3 Wavelet Theory 13 3.1 The Discrete Wavelet Transform . . . 13

3.1.1 History . . . 13

3.1.2 Wavelet Theory . . . 14

3.1.3 The Mallat fast wavelet algorithm . . . 16

3.1.4 Complexity analysis of the FWT . . . 16

3.1.5 Detailed description of the FWT . . . 17

3.1.6 Variable filter bank multiresolutions . . . 20

3.1.7 Images as signals . . . 20

3.1.8 Further DWT properties of practical importance . . . 21

3.1.9 The lifting transform . . . 22

3.2 Organizing the transform coefficients . . . 22

3.2.1 A neat property of the DWT . . . 22

3.2.2 The great discovery of Jerome Shapiro . . . 23 ix

(12)

x Contents

3.2.3 Time, what is time? . . . 24

4 The Bit Stream 31 4.1 The problem of scalability . . . 31

4.1.1 Separable transforms . . . 31

4.1.2 The quantization . . . 31

4.1.3 Composing the stream . . . 32

4.1.4 Rate Controller . . . 33 4.1.5 File Writer . . . 34 4.1.6 Wavelet Coder . . . 34 4.1.7 Wavelet Decoder . . . 34 4.1.8 File Reader . . . 36 4.1.9 Raw Writer . . . 36

5 The codec demo application 37 5.1 Overview . . . 37

5.2 Encoder . . . 37

5.3 Decoder . . . 39

5.4 Stream information . . . 39

6 Visual results and performance comparisons 41 6.1 MRA images . . . 41

6.2 Scalability in quality . . . 43

6.3 Performance results and comparisons . . . 49

6.3.1 Performance comparisons for quality scalability . . . 49

6.3.2 Comparisons with the x264 encoder . . . 52

7 Future Improvements 55 7.1 Motion Prediction . . . 55

7.1.1 Block search methods . . . 55

7.1.2 Motion threading . . . 56

7.1.3 Speed optimizations . . . 56

7.1.4 The Lifting Transform . . . 57

Bibliography 59 A The Rescaling Function 62 A.1 Depicting image MRAs . . . 62

A.2 Some initial observations . . . 62

A.3 The coefficient rescaling function . . . 62

A.4 Properties of the logarithms . . . 63

A.5 Iterative properties . . . 64

(13)

Chapter 1

Introduction

Mobile devices (such as mobile phones, handhelds, netbooks, laptops) get better screens and computational performance at a rapid rate. These mobile devices have large variations in screen resolutions and computational abilities. To be able to stream a specific video to these devices, it is necessary to have the coded video data available in many different combinations of resolutions, frame rates and bit rates. If a non-scalable video codec is used, many different streams have to be encoded at the sender side - which requires large amounts of calucations for all encoders as well as a large amount of storage space. This makes it interesting to investigate different methods to code video with a bit stream that is scalable in all aspects of quality, resolution and frame rate. It is well known in the image coding community that the Discrete Wavelet Transform (hereafter denoted DWT) allows a decomposition of images for coding that makes scalability in resolution easy to achieve.

It is interesting to research if this can be used to code moving pictures as well. The aim of this thesis is to investigate if it is possible to create a scalable video stream so that it is easy for the receiving devices to pick out only the parts that are relevant to them.

If this is possible, a full implementation of a demo application for one codec with these properties should be made in the C and C++ programming languages that demonstrate the scalability of the bit stream on the Microsoft Windows Vista operating system.

1.1

Problem statement

The problem to be solved in this thesis is to investigate if the Discrete Wavelet Transform can be used to create a video bit stream that is scalable in spatial res-olution, temporal resolution (frame rate) and bit rate (quality). If this is possible one method should be proposed, implemented in the C and C++ programming languages and demonstrated.

(14)

2 Introduction

1.2

Method

The Discrete Wavelet Transform is used in both spatial and temporal dimensions to create a decomposition of the video. Then a quantization similar to Shapiro’s method for images [2] is used to create packets that are localized in both space, time and quality. These data packets are stored in a specified file format with a specified header structure on a secondary storage device.

A sender selects the interesting packets from the stream and sends them to the decoder which decodes and stores the reconstructed video in the .yuv format. A stream stripper sorts out the interesting packets from the stream and rewrites them as a new stream.

1.3

Thesis outline

This chapter is an introduction to the thesis, describing the problem at hand

and the thesis outline, together with a glossary explaining various often-occuring terms and abbreviations.

The second chapter is an introduction to hybrid video coding in general. Color

spaces, Groups Of Pictures, Transform coding, motion compensation and so on.

The third chapter contains the theory for the implemented algorithms required

to construct the stream with focus on the Discrete Wavelet Transform and the zero-tree quantizer.

The fourth chapter describes how the bit stream is constructed with the tools

presented in the theory chapter.

The fifth chapter describes the implemented demo application, the different

parts of the GUI and how to encode and decode video streams.

The sixth chapter contains sample images decoded at different bit rates, spatial

resolutions and frame rates to demonstrate the scalability achieved from a single high quality video stream with high bit rate, resolution and frame rate. The source stream and all decoded frames shown here are created with the demo application described in chapter 5. The chapter also demonstrates the different types of ar-tifacts that occur in the different temporal modes. Finally a set of performance tests are made on video sources of different resolution.

The seventh chapter describes possible future enhancements to the codec such

as speed-optimizations of existing algorithms and possible performance gains of algorithms presented in some relevant papers.

(15)

1.4 Glossary 3

1.4

Glossary

Here is a table describing some commonly used abbreviations. General Video Coding

terms and abbreviations

DFT Discrete Fourier Transform

FFT Fast Fourier Transform

DCT Discrete Cosine Transform

MC Motion Compensation

ME Motion Estimation

Pixel Picture Element

MPEG Motion Picture Experts Group

Codec Implementation of coder and decoder

JPEG Joint Picture Experts Group

HDTV High Definition TeleVision

Luminance Gray-scale channel of video

Chrominance Color channel of video

HVS Human Visual System

Transform Mathematical tool to change the

repre-sentation of a function to terms of other functions.

Quantization Representing transform coefficients

with limited precision.

Scanning pattern Order of storing quantized transform coefficients.

Entropy coding Lossless coding of a signal based on statistics of the signal.

I frame Intra frame.

P frame Predicted frame.

B frame Frame predicted from multiple frames.

(Bidirectional prediction).

Sub-band coding Splitting the signal into several

(Fourier) frequency bands.

Separable transform Transformation of a multi dimensional signal can be performed by transform-ing individual dimensions after each other.

Lossless coding No error is introduced in the coding process.

Lossy coding Coding that introduces error in the cod-ing process.

(16)

4 Introduction

Wavelet Transform abbreviations

DWT Discrete Wavelet Transform

MRA Multi Resolution Analysis

FWT Fast Wavelet Transform

EZW Embedded Zerotree Wavelet

Stream terms

GOP Group of pictures

Frame A still image or transform image

Level A spatial resolution layer in the DWT

Digit A quality layer in the EZW

Data packet A stream of bits belonging to a certain GOP,

Frame, Level and Digit

Data header Specifies what Level and Digit the following

data packet belongs to.

Frame header Specifies what Frame the following data

pack-ets belong to.

GOP header Specifies what GOP the following frames

be-long to and data that is important to the next GOP.

Movie header Gives information that is relevant to the entire video stream.

(17)

Chapter 2

Video coding

2.1

What is video data?

2.1.1

What is a video?

A video is a time sequence of images.

An image is defined as a 2D signal. In the case of a grayscale image, only one value is assigned to each sample point. If we have a color image, there are three values assigned to each sample point. What these values correspond to in the real world depend on the color space used in representing the video data.

2.1.2

Sampling the video data

Sampling video data is done with photographic cameras. Almost always the CCD-arrays are organized so that the data is sampled with a 2-dimensional regularly spaced sampling grid1. Each discrete sampled value of an image is tradition-ally called a pixel while the pel might be a more intuitive abbreviation (Picture Element). However pixel was fast made popular so that probably many more peo-ple today know what a pixel is than a pel2. Images usually have a fixed height and width which are measured in pixels. The aspect ratio is width

height. For

in-stance an image with width 640 pixels and height 480 pixels has the aspect ratio: 640

480 = 43= 1.333 . . . .

2.1.3

Video resolution standards

There are many different standards for video resolution. Depending on the appli-cation, the resolutions of video images span a very large range today.

The new HDTV standard for instance allows the resolutions 1280x720 (720p) , 1

This is usually the case. Other sampling formats have been proposed, such as a hexagonal sampling grid that can be shown to have better properties in the Fourier transform domain.

2

For instance in digital camera advertisements one can often see the "performance" of the cameras (maximum resolution of output images) measured in megapixels

(18)

6 Video coding

1920x1080 (1080p, full HD) and 3840x2160 (2160p, quad HD). The last resolution is aimed at cinema applications while the two lower are for home use (bluray video, HDTV, entertainment systems and so on). Traditional resolutions for computer monitors with aspect ratio 43 are 1024x768 (XGA), 640x480 (VGA) 1280x1024, 1600x1200. Wide screen computer monitors usually contain the HD resolutions as well as resolutions in-between.

For mobile tv resolutions are rarely higher than 320x240 (QVGA) or 352x288 (CIF). Common resolutions are around 176x144 (QCIF).

2.1.4

Color channel representations

There are numerous ways to represent the color information in video. The most well known is probably the RGB system. RGB stands for Red, Green and Blue. So the channels each store the amount of red, green and blue contained in the pixel. Traditional CRT-monitors and TVs use RGB directly in the screen to view the colors. Other representations are possible, but to understand why they are interesting, we must take a deeper look into the workings of the human eye.

2.1.5

Properties of the human eye

The human eye has two different "photosensors": rods and cones. The rods can’t distinguish colors, but are sensitive to brightness. The cones are responsible for color vision, although they need a higher brightness to work than the rods. [] So if we can construct a representation of color images that is decomposed into brightness and color channels, we might get representation that better matches that of the human eye. Of course this has been done. There are quite some differ-ent color-spaces available, though in this thesis the YUV (Y CbCr) format is used.

The Y component is called the luminance (or brightness) component, while Cb

and Crare the chrominance components of the pixel. The eyes’ limited sensitivity

to the color components can be exploited for compression purposes. Experiments show that it is often possible to sub-sample the chrominances without losing very much visual quality to the human eye - color components of images usually change smoother than brightness components.

The following formulas can be used to calculate Y CbCr from RGB3:

Y = 0.299R + 0.587G + 0.114B

Cb = B − Y

Cr= R − Y

3

Note the weights in the Y channel for the different color channels. Can the reader explain the weightings? Hint: where in the visible spectra of light can red, green and blue be found? It might be sensible to assume that the sensitivity as a function of frequency is continous and zero outside the visible range...

(19)

2.2 The structure of a video codec 7

2.1.6

Standards for subsampling the chrominances

There are various ways to subsample the chrominances. Here the ones used will be presented in table form.

Ordinary sampling (RGB):

The 4:2:2 sampling format (YUV color space):

. .. ... ... ... ... ... ...

. . . YUV Y YUV Y YUV Y . . .

. . . YUV Y YUV Y YUV Y . . .

. . . YUV Y YUV Y YUV Y . . .

. . . YUV Y YUV Y YUV Y . . .

..

. ... ... ... ... ... . ..

The 4:2:0 sampling format (YUV color space):

. .. ... ... ... ... ... ...

. . . YUV Y YUV Y YUV Y . . .

. . . Y Y Y Y Y Y . . .

. . . YUV Y YUV Y YUV Y . . .

. . . Y Y Y Y Y Y . . .

..

. ... ... ... ... ... . ..

2.2

The structure of a video codec

A video coder can be constructed in very many different ways, depending on the application. It is clear that for very low bit rate applications one must find special models for the video data. One example where this has been true is video conferencing and mobile phone video communications. These coders typically perform very poorly when used on data they are not modelled for4. Video coders that are supposed to be able to handle many types of video are called general purpose video coders. Since the coder to be proposed in this thesis is a general purpose coder, that is what is going to be described in this chapter.

Figure 2.1. Components of a video codec. The video data is taken from the source and put into the encoder. Then the encoded video is stored or transmitted. After storage or transmission, the coded video is inserted into the decoder which decodes the video (recreates the original representation). If the coding is lossy, the decoded video doesn’t need to be exactly the same as the source.

4

Model based coding is used for instance in speech coding. When trying to code non-speech sound such as music with model based speech coding it usually gives very poor results.

(20)

8 Video coding

2.2.1

Overview of a modern general purpose coder

Very many different video coders have been constructed over the years. Here a presentation of the most common components of a video coder will be made and then each part described in deeper detail.

Figure 2.2. Overview of the encoder for a hybrid video codec. T is the transform, Q is the quantizer, MC is motion compensation. Modern hybrid encoders often have many different modes and parameters to allow for better performance if the source is analyzed. Therefore this picture is just a rough overview of the workings of a hybrid coder. Inputs of parameters and choice of modes to the transform, the quantization and the motion compensation are not visible in the picture as these vary between different established standards.

2.2.2

The Transform

A transform is usually applied to individual video frames to take advantage of the spatial correlation between neighboring pixels. There are very many different transforms that have been used in video coding. Below are some of the more well known transforms used in general purpose video coding schemes.

The Discrete Cosine Transform

The Discrete Cosine Transform (DCT) is used in many coding standards. For still image coding, the DCT has been used for instance in the JPEG standard. Video coding schemes that apply the DCT include MPEG15 and MPEG2. The basis functions of the discrete cosine transform are (as the name implies) cosine waves

5

MPEG is short for Motion Pictures Experts Group. It has defined various standards for video coding. The most well known of the general purpose coders are probably MPEG 1, MPEG 2 and MPEG4.

(21)

2.2 The structure of a video codec 9

of different frequencies dirac-sampled at the integer points. The lowest frequency is the DC component. Since cosines do not have compact support6it is useful to split the images into smaller parts before performing any actual transformation. This is done in the JPEG as well as MPEG 1 and MPEG 2 standards. These standards use transform blocks of size 8x8 pixels. The DCT is a 1-dimensional transform and is applied on the rows and the columns separately. It is also possible to make the cosines overlap. In that case it is called a lapped transform. A transform where the cosine basis functions overlap is also called a MDCT (Modified Discrete Cosine Transform). This has been used for instance in audio coding.

Integer transforms

Integer transforms have also been used in video coding. For instance, in H.264 (MPEG 4) the following transform matrix is used:

1 1 1 1

2 1 -1 -2

1 -1 -1 1

1 -2 2 -1

Corresponding inverse transform matrix:

1 1 1 12

1 12 -1 -1

1 −12 -1 1

1 -1 1 −21

As can be seen in these matrices, only addition, subtractions and shifts7 are re-quired. This is a good property for implementation in digital hardware and simple computer architectures which do not have any floating point arithmetic instruction set. This is true for many small CPUs used in handheld devices such as mobile phones8.

2.2.3

The Quantizer

Transforms are mappings from one function basis to another. There are trans-forms that are integer-to-integer mappings and these are of special importance for practical applications. However, in the general case, we get a transformation of the coefficients that is integer-to-real or real-to-real. Since it is impossible to store any real number9 with finite precision, we have to replace the real numbers with finite precision counterparts in order to represent the set of coefficients with a bit stream. The act of doing this is called Quantization. Quantization can be made in very many different ways. In video coding it is interesting to try to quantize the

6

They stretch out indefinitely in the spatial dimension. 7

A shift is an integer multiplication or division with powers of 2. Left shift is multiplication while Right shift is a division.

8

Hang on! Do not despair yet! We will see later on that some wavelets enjoy these practical benefits as well!

9

(22)

10 Video coding

values in a manner that takes advantage of the video properties, transform prop-erties and perceptive propprop-erties in order to get better compression at the same bit rates. In a video codec the only lossy step is the quantization step. All other steps in the video coder are lossless.

Example: The JPEG DCT Quantization

The JPEG DCT coefficients are quantized using a Quantization Matrix that con-tains perceptual weigths for different cosine frequencies. In this way the coef-ficients are given priority in a sense that takes the HVS10 into consideration. Roughly speaking, this means high priority for low frequencies and lower priority for higher-frequency components. The standard allows specification of own quan-tization matrices. A typical JPEG perceptual weighting matrix is one presented in [?]: The transform coefficients are divided by the corresponding elements in the

Figure 2.3. A typical quantization matrix for the DCT in the JPEG standard.

quantization matrix and then rounded to the nearest integer - thus it is in practice a uniform quantizer with step size 1.

10

(23)

2.2 The structure of a video codec 11

2.2.4

Exploiting the video motion

Video created by photographic cameras have the property that objects in frames move over time. How fast they move and in what sense can depend very much on the video content. Techniques have been developed to take advantage of the fact that objects move in image sequences. The most widely used method is based on block matching. The next frame to be encoded is divided into blocks of certain sizes11. Then the search for suitable matches against the previous frame begin. How the block-matching is done can vary with different coders, but every coder needs an difference measure to decide which of the evaluated positions are best. Calculating an difference measure is done by:

1. Pick a block in the predicted image. 2. Pick a motion vector.

3. Displace the position of the block with the motion vector. 4. Perform calculations.

A popular difference measure is the SAD-measure. Described in steps, the step 4 of the SAD measure is performed by:

4.1. Calculate the difference between each pixel pair. 4.2. Calculate the absolute values of those differences. 4.3. Sum all those absolute values obtained.

The objective is of course to find the motion vector that minimizes the differ-ence measure for the specified block. If a good enough match is found, we can then transmit the motion vector. The remaining difference is usually stored and transmitted if large enough after some chosen transformation.

The SAD measure is so popular because of it’s simplicity. It is easy to imple-ment on hardware architectures as well as in software and very fast to execute. It is usual to use a fast difference measure to sort out the best matches and store them in a list. Then to perform more careful difference measures that require more processing power on the vectors in the list to find the best one. The SAD thus gives a rough approximation that can be used to early throw away the vast majority of unsuitable motion vector candidates.

11

These sizes do not need to be constant! For instance, in the H.264 motion scheme, block sizes can be adaptively chosen between 8x16, 16x8, and 16x16.

(24)
(25)

Chapter 3

Wavelet Theory

3.1

The Discrete Wavelet Transform

3.1.1

History

The history of the Discrete Wavelet Transform stretches back to the beginning of the 20th century. In 1909 the German mathematician Alfred Haar proposed the very first wavelet transform that is known as the Haar wavelet [1]. The Haar wavelet is probably the most well known DWT and certainly the one most used for demonstrating multiresolution analysis due to it’s simple basis functions. Discovery of the Haar wavelet got almost unnoticed in the engineering commu-nity until the 1970s. In the 1980s Meyer showed the connection between subband filterbanks and wavelet transforms, and Stï¿12phane Mallat introducing the fast wavelet transform (FWT). This implementation is usually depicted as a pyramid ("pyramid algorithm"), as it is an iterative algorithm working with the same filters at each step.

The pyramid algorithm is a very important tool to calculate the DWT fast. It can be shown that DWT using Mallat’s FWT is done with O(n) calculations where n is the signal length, directly proportional to both the length of the wavelet filters and the signal length.

Very few other algorithms (in signal processing and computer science) has such nice and low computational complexity.

Although the Haar wavelet gives a multiresolution analysis (MRA), the filters have very limited regularity and not very good properties for coding of natural images. One of the most astonishing results of the development of the early wavelet theory is the family of Daubechies wavelets. Each of these wavelets have the property of (except orthogonality to their own translates) having maximum smoothness for their length and also having compact support. The property of compact support together with orthogonality makes it possible to implement these DWTs with fi-nite impulse response (FIR) filters in the filter bank. This is a very important property for speed and practical usefulness. The compact support property also gives a very localized representation of the analyzed signal.

(26)

14 Wavelet Theory

Symmetry is a property that is intuitively desirable in image coding, because an image is essentially the same whether or not it is mirrored. That is: it represents the same information no matter if it is viewed in it’s original form or watched in a mirror. The idea was to develop wavelets that have symmetry, so that the coded information needn’t change if the image was mirrored. This search for symmetric wavelets led to biorthogonal wavelets, where the filters (if having compact support) all have odd length. Two popular filter pairs are the Cohen-Daubechies-Feauveau (CDF) 9/7 and 5/3 tap filters named after their discoverers.

Later discoveries include generalization to wavelet packet analysis, M-adic wavelets, and wavelet lifting ("second generation wavelet transform"). By using the tech-niques of lifting, all traditional wavelet transforms can be constructed and im-plemented more efficiently than convolution using the Mallat pyramid algorithm allows.

3.1.2

Wavelet Theory

The early theory of wavelets was based on Fourier analysis. Criterias for exis-tence and convergence of scaling functions (father wavelets) and wavelet functions (mother wavelets) were discovered and presented.

Readers interested in the theory of wavelets should see the works of Ingrid Daubechies (ten lectures on wavelets), Gilbert Strang (Wavelets and Filterbanks). The second generation wavelet transform is described in the works of Wim Sweldens. A brief introduction to the lifting transform is "Factoring wavelet transforms into lifting steps" by Ingrid Daubechies and Wim Sweldens. More theoretically advanced pa-pers have been presented by Sweldens as well as practical applications.

In a sense the whole theory circle around two important equations involving the scaling function φ and the wavelet function Ψ.

The dilation equation:

φ(t) = ∞ X

n=−∞

anφ(2t − n)

The wavelet equation:

Ψ(t) = ∞ X

n=−∞

bnφ(2t − n)

The ai and bi are called the scaling coefficients and wavelet coefficients. These

are simply the values of the FIR filters used in the transform. As we can see, the equations above state a relation between the functions φ, Ψ and scaled and shifted

φfunctions. At first glance the values for n might seem troubling. The trick for practical usability is to choose the coefficients in such a way that they are 0 except for a very few number of coefficients close to n = 0.1 The choice of coefficients

1

(27)

3.1 The Discrete Wavelet Transform 15

ai and bi are crucial to the properties of the corresponding scaling and wavelet

functions. The theory puts restrictions on these coefficients in order to make them lie in the L2function space and makes it possible to analyze how much regularity they are able to contain. How do we find out what the functions our coefficients define look like2? This is done by applying the cascade algorithm. The algorithm is a somewhat straightforward implementation of the above equations, starting with the scaling filter coefficients at dyadic points. The images below show the wavelet and scaling functions for some more or less famous wavelet transforms. If we have finite length filters the above equations can be rewritten:

The dilation equation:

φ(t) =

N1

X

n=−N0

anφ(2t − n)

The wavelet equation:

Ψ(t) =

M1

X

n=−M0

bnφ(2t − n)

The length of the filters are of course N1− N0+ 1 and M1− M0+ 1 respectively.

Orthogonal wavelets

A DWT is orthogonal if the scaling functions are orthogonal to their translates. In this case M0= N0and M1= N1. In particular:

aj= (−1)jbj

This equation is known as the alternating flip condition and it effectively restricts orthogonal DWTs to be defined by just one filter. Also it is easy to show that orthogonal DWTs always have even length filters. Biorthogonal filter pairs on the other hand are not bound by the alternating flip condition, always have odd length and the two filters can also be of different length. This means that it is necessary to specify two filters to get a biorthogonal filter bank. It can be shown that only biorthogonal filters can have the property of symmetry3. Symmetric wavelet filters give symmetric dual basis functions - and this is interesting if the information in the analyzed signal should have the same representation regardless of the order of data. For image coding applications, it is intuitively desirable, because images are perceived to represent4 the same information no matter if they are watched in a mirror or not.

2

At a given scale. 3

That a filter bank is symmetric means that the filters are symmetric, ie. that they satisfy a

j= a−j and bj= b−j 4

For applications desired to preserve the cognitive features of images, this is interesting. Cognitive features are interesting for television and video coding!

(28)

16 Wavelet Theory Biorthogonal wavelets

A DWT is biorthogonal if the translates of the scaling functions do not span an orthogonal base. However, as for all wavelets, the wavelet spaces are still orthogonal to each other.

3.1.3

The Mallat fast wavelet algorithm

Stï¿12phane Mallat found the connection between subband coding and wavelet transforms. The filter bank has the following structure:

1. First the original signal is convolved with two different analysis filters - G and H.

2. The resulting signals are downsampled a factor 2. This means that every second sample is thrown away.

3. The first iteration of the transform is complete!

We call this building block "FB" for filter block. The next picture shows the "pyramid" structure of these filter blocks in the algorithm. Because every second sample is discarded in the downsampling operation 2 above, we can improve the performance by a factor 2 by choosing to compute the convolution with the filters centered at only the even (or odd) positions. Every filter block divides the incom-ing signal into high-pass and low-pass components which both have half the length of the original incoming signal. This means in practice that we get a structure and relation between the number of transform coefficients in the different subbands.

(a) The filter block for forward transforma-tion.

(b) The filter block for inverse transforma-tion.

Figure 3.1.The filter block used in the FWT for orthogonal filters and the correspond-ing inverse FWT.

3.1.4

Complexity analysis of the FWT

It should be noted that this means one convolution operation per sample in each level of the pyramid, if the filter lengths are finite and of no necessary relation to the length of the signal as a whole. At each new level the length of the processed signal is halved. So that if M operations are required at first level,M

2 operations are needed at second level, M

(29)

3.1 The Discrete Wavelet Transform 17

number of levels we have an upper bound of the number of operations required -that is 2·M (The well-known geometric series M +M

2 +M4+... = P∞

i=0M2i converges to 2M). M is linear to the length of the signal, since a normal convolution operation (with a FIR filter) is, giving O(n) complexity where n is the signal length. O(n) complexity is very fast! As a comparison, the classical and more well known fast Fourier transform (FFT) [3] algorithm has a complexity of O(n*log(n))5. The low complexity is because of the existence of basis functions (wavelets) with short length (compact support). If the basis functions had length equal to (or in relation to) the length of the signal, this would not be possible (as it is with sines and cosines in the Discrete Fourier Transform case)!

3.1.5

Detailed description of the FWT

Straight to the core

The FBs of the FWT works as a modified convolution operation. Since we have a downsampling operator after the convolution, we can skip to compute every second value of the convolutions. The transformation is divided into two convolutions: one with the scaling filter and one with the wavelet filter. If the original signal consists of S samples, both the result signals from the convolutions will have S

2 samples after the convolution. The picture below shows how the filter will move 2 signal samples for each sample in the result signals. Since we have 2 different filters to convolve with (one scaling filter (low pass) and one wavelet filter (high pass)), the total amount of operations equal one convolution length equal to the mean length of these filters. For all orthogonal filter pairs the two filter lengths are equal so the total computational load equals an ordinary convolution with just one of the filters. The pictures below depicts a wavelet transform:

5

The original discrete Fourier transform has a complexity of O(n2

). The speedup of using FFT instead of DFT in practical applications is often 100 to 1000 times. The FFT is by many considered to be one of the most important computer algorithms of the 20:th century. It is perhaps the single most important discovery in DSP of all time.

(30)

18 Wavelet Theory

Figure 3.2. Picture of a Discrete Wavelet Transform. The different background colors for the transform samples represent different subbands. Red is the low pass channel (L0)

of the current level. Brown, Blue and Green samples belong to high pass bands H0, H1

(31)

3.1 The Discrete Wavelet Transform 19

The following picture describes 1 level of the DWT:

Figure 3.3. Depiction of one iteration of a Discrete Wavelet Transform on a 16 sample 1 dimensional signal. The green boxes represent filter positions for calculating the different transform coefficients. The coefficients given by the low pass analysis filter are stored in the red boxes of corresponding number. Similarly the high pass coefficients are stored at the blue area of the transformed image in the same order. As can be seen here, the DWT is a lapped transform if the filter length is larger than 2. The method depicted uses circular convolution to handle the signal endpoints.

The Mallat filter bank structure

In the traditional pyramid algorithm invented by Stephanï¿12 Mallat, the filter bank is constructed by iterating the low pass part. That is:

1. Convolve signal with wavelet filter. Downsample a factor 2. 2. Convolve signal with scaling filter. Downsample a factor 2. 3. Iterate only on the result of operation 2.

Note that this 4 level decomposition yields 5 different subbands. There are always as many high-pass bands as levels of decomposition and one low pass band at the lowest level. That means N + 1 sub-bands if we have N levels of decomposition.

N of them are high-pass bands of different levels and 1 is the lowest level low-pass band.

The Wavelet Packet filter bank structure

The wavelet packet bank structure is a different structure springing from an ex-tension to the original wavelet theory. This theory is not presented here, but a

(32)

20 Wavelet Theory

Figure 3.4. The mallat pyramid algorithm for a 4 level decomposition

brief explanation (along with loads of other somewhat nasty generalizations of the wavelet theory) is found in a chapter of [6]. There is to this date more detailed litterature on wavelet theory available and very much more research has been con-ducted since the early days of Mallat’s and Daubechies’ discoveries. The algorithm: 1. Convolve signal with wavelet filter. Downsample a factor 2.

2. Convolve signal with scaling filter. Downsample a factor 2. 3. Iterate 1 and 2 on both of the respective results.

Wavelet Packets are not implemented in this work, but are an elegant extension to the theory giving larger freedom to the transform.

3.1.6

Variable filter bank multiresolutions

Since all wavelets have perfect reconstruction, we need not restrict the filter banks by using the same filter pair at each level. Rather, we could if we want change the filter pair at each level and preservere the perfect reconstruction property. What would this imply? We have a possibility of choosing filters at each level, making it possible to adapt the transform to our signal or other practical constraints6 to improve speed and/or performance. If we use only filter banks with finite length filters, we get an upper bound of the required operations by approximating with using the largest filter pair at every level of the transform.

3.1.7

Images as signals

Images are interpreted as sampled 2 dimensional signals. The conventional way of transforming an image is by applying the filters at horizontal and vertical dimen-sions after each other in a following fashion:

For each new level, at the current sub-image: 1. Apply analysis filter bank along rows. 2. Apply analysis filter bank along columns.

6

For instance, filters with integer precision can be used for the lower frequency sub-bands to make the transform easier and faster to implement on certain architectures.

(33)

3.1 The Discrete Wavelet Transform 21

Figure 3.5. Depiction of a 3 level Multi Resolution Analysis of an image

The image at previous level can then be divided into 4 different sub-images de-pending on what sub-band the corresponding coefficient belongs to.

Usually the Low-Low pass information is depicted in the upper left corner of the picture, with the corresponding high-pass bands at the same level under, to the right and diagonally as the picture shows.

3. Make the LL-subband from this decomposition the active image. 4. Iterate.

Perform N iterations of the above to get a wavelet decomposition of the image of N levels. Furthermore, the order of 1 and 2 are not important - the result will be the same, because the convolution operations are linear.

In picture 3.5. there is a depiction of a 3 level separable 2D DWT performed on an image.

3.1.8

Further DWT properties of practical importance

As noted, the filter kernels only extend in one dimension at a time when transform-ing images. This makes it easy to perform the transform operations in parallell, since none of the calculations in different rows resp. columns depend on each other.

(34)

22 Wavelet Theory

Figure 3.6. The lifting scheme presented in block form.

There is of course an upper limit of how many processes that can run efficiently in parallell. However, implementing software to allow for massive parallellism of the transformation is of course especially interesting on the encoder side, since there are an infinite amount of wavelet filters to choose from and also other parameters that are possible to adapt to the video content for better performance.

3.1.9

The lifting transform

As wavelet theory developed, a practical technique evolved to factor wavelet trans-forms and their inverses into smaller parts. One of the more prominent methods is called the lifting transform. The ideas behind lifting as well as some interesting results related to wavelet theory are described in detail in [7]. Of special interest is the result that any wavelet filter bank with finite length filters can be constructed by (or factored into) lifting steps. Figure 3.6 describes the process of lifting a 1D-signal with k lifting-steps (step 2 through k − 1 are omitted in the picture). The P operations are called lifting steps while the U operations are called dual lifting steps. Inverse lifting is achieved by changing the direction of the signals , the signs of the filters Pi and Ui and flipping the order of application: Dual step

first, then regular step, starting at index k and stopping at 0.

3.2

Organizing the transform coefficients

3.2.1

A neat property of the DWT

The DWT has a fundamental property that Fourier-based transform do not have: Transform coefficients have spatial locality. Fourier coefficients do not have this. Sines and cosines do not have compact support, so each isolated coefficient repre-sents parts of the signal that are spatially everywhere. The fact that each higher level of DWT gives details at a finer scale makes it easy to pick out those coef-ficients that are relevant for a certain spatial resolution. A remaining problem however is to quantize these coefficients efficiently. In the next section, an elegant classical method to arrange and quantize the wavelet subbands is presented.

(35)

3.2 Organizing the transform coefficients 23

Figure 3.7. A 2D MRA of an image. The 3 different spatial orientations are coded in different colors.

3.2.2

The great discovery of Jerome Shapiro

Wavelet and subband coding had been investigated for quite some time when in 1993, Jerome Shapiro published a paper describing an elegant way of organizing the wavelet coefficients of images for coding purposes. The algorithm is called Embedded Zerotrees of Wavelet coefficients, or short EZW. A hypothesis is first made: "Natural" images (without noise) contain fine details almost only where coarser details exist. To put it in the language of wavelets: The wavelet coeffi-cients in the high pass bands of the fine scales are only significant where we have coefficients of significant size at a coarser level at the same spatial position and orientation. The method used to transform the image leaves us with 3 different "spatial orientations" of the high pass bands: HL, LH and HH. Figure 3.7 is sup-posed to depict how these trees are arranged.

The quantization starts with a threshold T. All symbols that are outside the interval −3T2 3T2  are given a symbol depending on the sign it has. Symbol P if it is positive and symbol N if it is negative. The symbols P and N are called positive and negative significant. The opposite of being significant is being insignificant. All other coefficients are insignificant with respect to the current threshold. Now the central idea comes into the picture: We use two different symbols for zero, depending on the spatial orientation of the transform coefficient and all of the coefficients at the same spatial position of the higher levels. If all of these coeffi-cients are insignificant with respect to the current threshold, we give the symbol zerotree. If at least one of these is significant, then we give the symbol isolated

(36)

24 Wavelet Theory

zero.

Since each band pass coefficient has 4 children - that is coefficients at the next higher level band of the same orientation, we save space that increases a factor 4 for each lower level we are able to put a zerotree symbol. These 4 children coefficients come from the 2 downsampling operations (2 each for horizontal and vertical).

The coding gain is mainly due to the way these symbols are coded. In photographic images, the isolated zero symbol is very rarely found and the "pure zero" - zerotree symbol occurs very often (especially "early on" - for the first few thresholds used). The symbols of the LL0sub-band are coded separately and then follows a stream of symbols for the LH0, HL0 and HH0 subbands. These first-level high-pass bands are usually subject to some entropy coding, due to the very often occuring zerotree-symbol. Run length encoding or arithmetic coding have been used for instance.

In the original paper, Shapiro proposes an adaptive arithmetic coder to code the first level of coefficients. Statistical encoding of the lowest level-symbols is not implemented in this work, and thus the corresponding bit streams are longer than they need to be. In the last page of the paper [2] a simple example of a 8x8 3-level DWT transform image is quantized and the corresponding symbol-stream gener-ated. That example is used to check that the algorithm implemented does in fact give the exact same stream of symbols as Shapiro’s original method does. For the next level the zerotree symbols of this level is used to "prune" the coefficients at higher levels, determining where we need not store any further information about significant coefficients. Because zero-trees very often occur at low levels L, this greatly reduces the sizes of most higher level symbol streams. In the image below the LL0band is colored with purple, the insignificant coefficient of the HLi-bands

are colored red, and the insignificant coefficient of the LHibands are colored green.

A significant coefficient in the LH0-band is colored yellow. Here only the symbols for the yellow and the most low level red and green coefficients need to be stored, because of the definition of the zerotree-symbol in the algorithm.

The number of cut off symbols increases exponentially for each level we can cut, dramatically reducing the stream length required if zero-trees are common at low levels. This is almost always the case for the first few digits as the coefficients in the LL0are larger than almost all coefficients in the high-pass bands. For low bit rates low frequency components are more important. Thanks to our tree-structure these occupy almost all of the bits used at these rates without the need of any entropy coding to code long runs of zero-symbols in the high level high-pass bands.

3.2.3

Time, what is time?

Various wavelet-based methods have been proposed to attempt to take advantage of the temporal redundancy in video.

Experience tells us that motion compensation is a good way to increase coding performance of most existing transform-coders. Using a combination of transform coding and motion prediction for coding video is called Hybrid Coding, and is by

(37)

3.2 Organizing the transform coefficients 25

Figure 3.8. Some examples of zerotrees. The red tree has 1 more depth than the green tree, resulting in 16 more symbols being cut off in this example.

Figure 3.9. The temporal difference lifting used in temporal mode 3.

far the most used today.7

One method that does not use motion compensation instead uses a wavelet trans-form in the time dimension either before or after transtrans-forming the frame data. This is in a sense the same as guessing that the video content is a still image. Of course, this is naive and not very true for video in general. For slow-moving video however, it has been reported to work well. In this work, three different modes of temporal coding have been implemented:

1. Still image coding. (No temporal prediction) 2. Temporal Haar prediction.

3. Lifting difference prediction.

The first of these do not utilize any temporal prediction. It only quantizes the individual frames after spatial DWT.

Temporal Haar prediction uses a Haar DWT in the time direction for the gop-length as described above.

Difference lifting is a trivial type of lifting (as explained in theory chapter). The scheme below explains this very simple form of lifting.

7

(38)

26 Wavelet Theory

Comparing this figure with the general construction of lifting in the theory chap-ter, we see that we use only one step and that:

1. P1[z] = 1 2. U1[z] = 0

meaning that we make a simple difference prediction from the evens-band to the odds-band.

As we can see, P1 and U1 are indeed very trivial filters! Much more advanced constructions are certainly possible (and have been successfully tried for coding purposes, for instance in []) within this lifting framework! Further investigations of using lifting in the algorithm is thus interesting, but sadly beyond the scope of this thesis.

Some properties of the modes used:

1. Still image coding is of course the fastest mode, and also the one with least performance gain. It is extremely scalable, since any unwanted frame can just be skipped, giving maximum freedom for scalability of frame rate for the sender. There do exist a few standards for still image coding video. For instance the M-JPEG8 is in practice a standard for encoding images using the JPEG standard in a fashion that does not make any interprediction between frames at all. In this sense, it only codes a stack of consecutive images. The advantage of using M-JPEG and related schemes is mainly speed and simplicity. However, the large performance loss of not making any temporal prediction has given it limited sucess and popularity in video coding applications.

2. Temporal Haar prediction seems to give the best performance in SNR & PSNR sense (at the same rates) of the three methods used. The GOP energy is crunched harder into the temporal Low-pass frame, giving it’s data packets higher prior-ity over the high-pass bands when rate-controlling using the difference measure given by the 3D MRA. The high-pass bands in each GOP can be discarded in a specific order to yield a certain frame rate scalability. Discarding half the GOP-size of highest level temporal-frames gives half framerate, since the highest level temporal-frames describe the difference between each pair of original frames. Dis-carding the 3/4 highest level of the frames yields a framerate of 1/4 of the original frame rate, and so on. The temporal artifacts of this method are visual as a form of temporal blur, of course more visually severe as higher amounts of visible motion are present in the video content. This is natural since no motion compensation scheme is applied before the DWT in this thesis.9

3. The difference lifting method is a form of temporal prediction that at each level stores the first of two consecutive frames and the difference between them.

8

Motion JPEG 9

Although proposals for investigation of such methods are made in the last chapter of this thesis.

(39)

3.2 Organizing the transform coefficients 27

It can be shown 10 that this method does not correspond to a wavelet and not to a MRA in strict sense. However, storing the exact frames instead of low-pass filterings remove the temporal artifacts seen in the Haar prediction above. The performance seems to be a bit lower than the Haar method though, but pure original-image/difference-image coding has more similarities with motion compen-sation schemes, and will perhaps behave better than the other methods if a motion compensation step is introduced in the codec. Especially, we can easily obtain a given frame rate without the temporal artifacts obtained in the Haar method.

10

Hint: Look at the signal ...001100.... It gives us ...010... and ...000... when using the proposed lifting scheme. What FIR filter would be able to recreate the second 1 in the original sequence by just using coefficients in the second channel of the filter bank? Wavelet filters must be able to do this!

(40)

28 Wavelet Theory

Figure 3.10. Demonstration of the Haar temporal transform. One pixel position of GOP consisting of 8 frames is decomposed the maximum number (three times) with the pyramid algorithm. The resulting sub-band structure is demonstrated with the belonging coefficient indices. V0 is the low pass (scaling) band and the Wi are the high pass

(41)

3.2 Organizing the transform coefficients 29

Figure 3.11. Example of irrelevant data packets in GOP if frame rate is full, half, one quarter and one eight of full frame rate. This illustration only applies to modes 2 and 3 described above. The red packets are used and the blue ones ignored. Horizontal scale is frame number and vertical scale is data packets belonging to that frame.

(42)
(43)

Chapter 4

The Bit Stream

4.1

The problem of scalability

The solution to the problem of creating a bit stream scalable in all three aspects of spatial resolution, frame rate, and bit rate will be presented here, divided into different subsections and the next section will give a more detailed view of the different parts of the practical implementation to achieve this.

4.1.1

Separable transforms

The DWT is as stated before a 1 dimensional transform. Transformation of a 2D image is performed by alternating transformation of the rows and columns of that image. A 3 dimensional DWT can be constructed by either first transforming in the temporal direction (T+2D) or first in the spatial dimensions (2D+T). Because of the dyadic nature of the transform, details for each level of the transform constitute to half the total signal length of that level. Throwing away the detail coefficients in the temporal dimension should in some sense mean that we retain the coarse (most important, low motion) information in the time dimension. The main point is to produce coefficients that represent parts of the video that belong to a certain scale in each of these three aspects. The property of separability makes sure that the spatial properties of scalability are presevered if the temporal transformation (which yields temporal scalability in one sense) is performed.

The remaining problem is then to order and quantize these coefficients in a way that is reasonably effective and also gives a bit stream that allows for separation into data packets that belong to a certain level each of resolution, framerate and bitrate. Three modes of temporal prediction have been implemented (see the theory chapter).

4.1.2

The quantization

After spatial DWT and temporal prediction an embedded zerotree quantizer is applied on the different frames. In the mode of no temporal prediction, these

(44)

32 The Bit Stream

frames are transformed still images and in the other modes these frames are trans-form frames, belonging to different transtrans-form bands in the time dimension. The zero-tree symbols for each frame are ordered in one stream for each spatial level and threshold digit, giving a certain (quality & resolution)-layer (called Q&R from now on) for the frame. These layers have a certain hierarchy of importance, based on the properties of the zero-tree quantizer. Since each Q&R-layer is depending on both the previous quality layers for the same resolution and the previous res-olution layers for the same quality, we have a type of dependency that can easily be checked, this also limits the number of ways we can combine Q&R-layers to get a video stream where all present layers are useful. This is depicted in the images below. Present layers are colored red and non-present layers are colored white or purple. The white layers that are not present do not have other packets that depend on them, while the purple packets limit the usefulness of the stream so that all present layers with higher Quality OR Resolution than the purple ones are useless for decoding purposes. Since quantization is done separately for each frame, this does not impose restrictions between Q&R-layers that belong to differ-ent frames. The implemdiffer-entation of the quantizer can also perform reconstruction.

4.1.3

Composing the stream

A data packet is the part of the stream that corresponds to a specific Q&R-level. This means the stream of zero-tree symbols if we have no statistical encoding. If statistical encoding is used, then a data packet is the zero-tree stream after entropy coding. A data packet is the smallest part ("atom") of the stream that can be sent and decoded. The stream typically consists of many data packets, giving a large amount of possible packet-combinations to send. Solving the problem of finding the best combination of packets for a specific need is the objective for the rate controller.

(45)

4.1 The problem of scalability 33

Figure 4.1. Description of the stream structure. MH : Movie Header, GH : GOP Header, FH : Frame Header, DH : Data Header, DP : Data Packet.

Figure 4.2. Sample of a part of a movie bit stream. Movie header indicates start of video sequence followed by a GOP header describing properties of the first GOP. The frame headers between two GOP headers describe the frames of that GOP. After each frame header a sequence of (DH, DP) pairs are sent. The specific information required to decode the DP is stored in the corresponding DH.

4.1.4

Rate Controller

The Rate Controller can be applied everywhere we have a set of data packets. It needs an difference measure and a size measure to control which packets are to be sent under certain restrictions. These restrictions are of the type:

1. Maximum target bit rate. 2. Maximum resolution. 3. Maximum frame rate.

Number 2 and 3 are the criterias checked before allowing the data packet to partic-ipate in the rate controller. This is because there might of course be data packets that are very efficient in a rate/distortion sense, but are of no interest to the device if it has not a big enough screen for the resolution of the video or not fast enough display / computational power for the frame rate associated with the data packet. It is of course possible to do the rate control with different weights on different frames, if information about the importance of motion is available. For instance in a fixed scenery with beautiful background (nature video?) motion might be of limited importance. On the other hand in a sports or action movie clip in-formation in the high temporal frequency frames are of higher importance than non-moving background. The problem of rate control in general and specifically adapting the rate control to the content of the video is a very interesting problem, but far beyond the scope of this thesis.

(46)

34 The Bit Stream

4.1.5

File Writer

The objective of the file writer is to label each data packet with a header, describing what layers it is part of and what information is required in order to decode the packet. Then it writes the composed video stream to a file. Scalability is achieved!

4.1.6

Wavelet Coder

At the highest abstraction layer, the wavelet coder uses the above modules to en-code the video. The following scheme shows the order of operations of the Wavelet Coder:

First of all: Write Movie Header. For each GOP to be encoded:

1. Read GOP from raw video data (.yuv or .cif )

2. Perform the DWT. (three methods supported: (Pure 2D), (2D+T) and (T+2D)). Where T can be either Haar decomposition or temporal lifting.

3. Zero-Tree quantization to create symbols for the frames.

4. Rate Control is applied on the Zero-Trees generated to meet the requirements for storage of the stream.

5. Write bit-streams from quantization symbols. (Note that an entropy coder should be applied especially on the low-pass bands between 3 and 4 for increased performance!).

6. The File Writer writes GOP header at highest level, followed by a frame header for each new encoded frame. After each frame header, the selected data headers are stored, followed by the bit streams from the bit-writer.

After last GOP to be encoded: Write Movie Ender (End Header).

4.1.7

Wavelet Decoder

The wavelet decoder is constructed for two different purposes. That is because of the advantages scalability brings. These two purposes are:

1. Decode a stream to a raw video file (.yuv or .cif)

2. Rewrite a stream to a new stream, by stripping it from unwanted packets. The nature of scalability allows for a larger amount of flexibility at both the encoder and the decoder than non-scalability does. The decoder must know what specifications and needs it has in order to decode the stream. When parsing the stream it picks out the interesting packets, and decodes them.

The rewriter strips the stream of too detailed information and rewrites the inter-esting parts (given some specifications such as limited resolution, frame rate and bit rate). This part is unique for scalable coders and allows the "decoder" to act as a proxy server that redistributes substreams to servers that send to receivers

(47)

4.1 The problem of scalability 35

Figure 4.3. Illustration of the encoder with the different temporal modes shown. TP is the temporal prediction flag and TF is the temporal first flag.

with less demands. In this sense it would be more correct to split the decoder into two different parts called sender and decoder. For practical reasons these two parts are both implemented in the decoder class, where one input to the decoder is whether or not to rewrite, decode (or both!) the selected stream. The operation of changing the encoding is also called Transcoding.

The process of decoding with constraints can be listed in the follow fashion: Sender side:

0. Receive needs and specifications from receiver. 1. Parse stream and read headers.

2. Perform rate control and select packets to send. (Optionally rewrites stream to send to disk.)

Receiver side:

3. Read all data packets received using a File Reader. 4. Perform inverse zero-tree quantization.

5. Inverse DWT.

6. Display frames (Write to raw .yuv/.cif file using a Raw Writer).

The receiver side is of course ignored if just rewrite is selected. This scheme can be used in a unicast mode, where each receiver can tell the sender what needs it has. Simpler modes can of course be constructed where the sender is a "dumb agent", merely retransmitting the packets received from a server that performs the stripping of the stream. This can be useful if large number of receivers have similar or same requirements in terms of bandwidth, resolution and frame rate.

(48)

36 The Bit Stream

Figure 4.4. Illustration of the decoder implementation

4.1.8

File Reader

The file reader reads the stream. Two different main functions are implemented: 1. Stream parser.

2. Stream reader.

The parser function is used to analyze the stream (look for errors or read out headers). The reader function is used to decode the stream. It gets information of what packets to decode from the decoder.

4.1.9

Raw Writer

After a GOP of the stream has been read, we are able to recompose it with the inverse quantizer and inverse transformations. We now need something that writes the reconstructed video data to file. That is the job of the Raw Writer. It is constructed as a class that opens a file in the constructor and writes a new GOP of data when it is fed one from the "outside world". This construction is vital, to allow storage of one GOP at a time!1 In the case of sender - receiver: render the frames of the decoded GOP instead of writing them to disk. Thus the raw writer is actually in practice replaced with a frame renderer at the receiver side after decoding.

1

(49)

Chapter 5

The codec demo application

5.1

Overview

The wavelet codec implementation has been made in in Visual C++ 2005 as a forms application for graphical user interface (GUI) and runs on a PC with Win-dows Vista. The application is split into different parts which are described below: 1. Encoder.

2. Decoder.

3. Stream information.

5.2

Encoder

The encoder part consists of some options related to the encoding of a raw video file. The possible options for encoding are:

1. Which filter bank to use. 2. GOP size controller.

3. Number of GOPs controller. 4. Encode video button.

The supported input files are raw video sequences in YUV (Y CrCb) color space

with a 4:2:0 subsampling on the chrominances. Spatial resolution of the source must be specified, since this information is not present in the raw video format. It is possible to choose filter bank to use for the clip as well as GOP-size and total number of GOPs. It is also possible to choose a maximum bit rate for the total stream. This is the objective for the encoder side rate control to achieve. Al-though the construction allows for variable GOP-size over time as well as variable bit rate over time, this has not been implemented in the test application. It would be necessary to create GOP-control functions that in some sense choose the best GOP-size for the current position in the video content, for instance to avoid scene

(50)

38 The codec demo application

References

Related documents

Hence, it is necessary to apply discrete-event simulation and process analysis in healthcare to improve the service level of the system, increase the use of

One simple method to detect and to extract calcifications is to decompose the mammography by wavelet transforms, suppressing the low fre- quency subband (scaling coefficients block

The Riksbank has, at several times, said that the high household debt is a financial- stability risk and is therefore affecting the repo rate even though this is contrary to their

Note: The rest of this chapter applies one-sided convolutions to different situa- tions. In all cases the method described in Theorem 5.45 can be used to compute these... 5.7

In this work, we present an approach for musical instrument recognition us- ing the scattering transform, which is a transformation that gives a transla- tion invariant

We recommend for the further researches to examine the the considered method for other types of the exotic options, to implement the Laplace transform on the Merton model and

1). Investors are rational and behave in a manner as to maximize their utility with a given level of income or money. Investors have free access to fair and correct information on

In order to avoid or minimize negative cultural aspects that might impact the team work within an intercultural team the manager should make sure that the team work is