Parallel JPEG Processing with a Hardware Accelerated DSP Processor

(1)

Parallel JPEG Processing with a

Hardware Accelerated DSP Processor

Examensarbete utfört i Datorteknik vid Tekniska Högskolan i Linköping

av

Mikael Andersson, Per Karlstr¨om Reg nr: LiTH-ISY-EX-3548-2004

(2)

(3)

Parallel JPEG Processing with a

Hardware Accelerated DSP Processor

Examensarbete utfört i Datorteknik vid Tekniska Högskolan i Linköping

av

Mikael Andersson, Per Karlstr¨om Reg nr: LiTH-ISY-EX-3548-2004

Supervisor: Dake Liu Examiner: Dake Liu Link¨oping 19th October 2004.

(4)

(5)

Avdelning, Institution Division, Department Institutionen för systemteknik 581 83 LINKÖPING Datum Date 2004−05−03 Språk Language Rapporttyp Report category ISBN Svenska/Swedish X Engelska/English Licentiatavhandling

X Examensarbete ISRN LITH−ISY−EX−3548−2004

C−uppsats

D−uppsats Serietitel och serienummer Title of series, numbering

ISSN

Övrig rapport

____

URL för elektronisk version

http://www.ep.liu.se/exjobb/isy/2004/3548/

Titel Title

Parallell JPEG behandling med en hårdvaruaccelerarad DSP−processor Parallel JPEG Processing with a Hardware Accelerated DSP Processor Författare

Author

Mikael Andersson, Per Karlström

Sammanfattning Abstract

This thesis describes the design of fast JPEG processing accelerators for a DSP processor. Certain computation tasks are moved from the DSP processor to hardware accelerators. The accelerators are slave co−processing machines and are controlled via a new instruction set. The clock cycle and power consumption is reduced by utilizing the custom built hardware. The hardware can perform the tasks in fewer clock cycles and several tasks can run in parallel. This will reduce the total number of clock cycles needed. First a decoder and an encoder were implemented in DSP assembler. The cycle consumption of the parts was measured and from this the

hardware/software partitioning was done. Behavioral models of the accelerators were then written in C++ and the assembly code was modified to work with the new hardware. Finally, the accelerators were implemented using Verilog. Extension of the accelerator instructions was given following a custom design flow.

Nyckelord Keyword

(6)

(7)

Abstract

This thesis describes the design of fast JPEG processing accelerators for a DSP processor.

Certain computation tasks are moved from the DSP processor to hardware ac-celerators. The accelerators are slave co-processing machines and are controlled via a new instruction set. The clock cycle and power consumption is reduced by utilizing the custom built hardware. The hardware can perform the tasks in fewer clock cycles and several tasks can run in parallel. This will reduce the total number of clock cycles needed.

First a decoder and an encoder were implemented in DSP assembler. The cyc-le consumption of the parts was measured and from this the hardware/software partitioning was done. Behavioral models of the accelerators were then written in C++ and the assembly code was modified to work with the new hardware. Finally, the accelerators were implemented using Verilog.

Extension of the accelerator instructions was given following a custom design flow.

Keywords: JPEG, JFIF, 2-D DCT, Huffman, Accelerator, HW/SW partitioning.

(8)

(9)

Acknowledgment

We would like to thank our supervisor and examiner Professor Dake Liu for guid-ance and support during the project.

We would also like to thank opponents, Claes Hedlund and Mats Karlsson for many useful comments on our thesis.

(10)

(11)

Notation

Symbols

Random Access Memory.

Sequential Access Memory. End of function (return). Two way choice (Yes/No). Subroutine call.

A function/subroutine start.

Action taken in a program flow graph.

Multiplier, if there is a value instead of the star the signal gets multiplied with that fixed value.

A node, signals pass through or get duplicated if it has more than one output.

Addition of two signals. Negation of a signal.

Operators and functions

DCT2() Two dimensional Discrete Cosine Transform

DCT−1₂ () Two dimensional Inverse Discrete Cosine Transform

DCT () One dimensional Discrete Cosine Transform

DCT−1₍₎ _{One dimensional Inverse Discrete Cosine Transform}

| · | Euclidean length

max(X, Y, Z) The largest number of X, Y and Z

(12)

Abbreviations and Explanations

RGB Red, Green and Blue, a color model with these components YCbCr Y = Luminance, Cb = Chrominance Blue and Cr = Chrominance

Red, a color model with these components

Channel A color component of the image e.g. in a RGB-format image R is one channel.

Component Synonym with Channel, used interchangeably in this thesis. DU Data Unit, an 8× 8 pixel block of data from one channel in the

picture

MCU Minimum Coded Unit, consists of a collection of DU to satisfy the relative sampling frequencies.

cc Clock cycles in the DSP processor and accelerators. RLE Run Length Encoding

RLD Run Length Decoding DCT Discrete Cosine Transform

IDCT Inverse Discrete Cosine Transform

PE Processing Element, a block that performs arithmetic operations 1-D One Dimensional

2-D Two Dimensional

PC Program Counter

FSM Finite State Machine

Matlab The MATLAB program from MathWorks

QVGA Quarter Video Graphics Array, a 320× 240 pixel array. pixel atomic element of a computer image

B Byte

JPEG Joint Photographic Experts Group, ISO/IEC group of experts JFIF JPEG File Interchange Format

IDE Integrated Design Environment DSP Digital Signal Processing

SW SoftWare

HW HardWare

Wimage Width of image.

Sp Sample period.

TC,d Sample period, if subscripted C is the channel and d is the

direc-tion, x for horizontal and y for vertical.

fC,d Sample frequency, C is either the channel, max or min. If C

is max or min it denotes the maximum or minimum sampling frequency of all channels. d is the direction, x for horizontal and

y for vertical.

(13)

ZY X The DU in the Y th DU row, in the Xth DU column in the Z

component of an MCU.

MSB Most Significant Bit, bit with highest numerical weight in a bit vector.

LSB Least Significant Bit, bit with lowest numerical weight in a bit vector.

S[x:y] Bit slice from x to y in bit vector S, LSB is supposed to have index zero. The sliced vector goes from bit x as MSB to bit y as LSB. If the signal name is clear from the context S can be omitted.

Ps Bit precession, the same as the number of bits.

nn.ffff Bit vector format displaying number of integer bits (n) and frac-tion bits (f).

ISO International Organization for Standardization IEC International Electrotechnical Commission RF Register File

Notations

Number Formats

Values are always written as vb, where v is a number and b the base, except for

when b = 10 then b is omitted, b is always given in base ten. The letters A, B, C, D, E and F (also their respective lower-case form can be used) are used for what in base ten would be 10, 11, 12, 13, 14, and 15 respectively.

(14)

(15)

3.11.10.Summary . . . 31 4. Software Implementation 35 4.1. Decoder . . . 35 4.1.1. Design Overview . . . 35 4.1.2. Design decisions . . . 36 4.1.3. Program Flow . . . 36 4.1.4. Performance . . . 41 4.2. Encoder . . . 45 4.2.1. Design Overview . . . 45 4.2.2. Design Decisions . . . 45 4.2.3. Program Flow . . . 46 4.2.4. Performance . . . 47

5. Implementation with Accelerators 51 5.1. Hardware/Software Partitioning . . . 51

5.2. Accelerators . . . 52

5.3. Selected Hardware/Software Partitioning . . . 52

5.4. Decoder . . . 54 5.4.1. Design Overview . . . 54 5.4.2. Design Decisions . . . 55 5.4.3. Performance . . . 55 5.5. Encoder . . . 56 5.5.1. Design Overview . . . 56 5.5.2. Design Decisions . . . 58 5.5.3. Performance . . . 58 6. DU Processor 59 6.1. Functionality Overview . . . 59 6.2. Design Decisions . . . 60 6.3. Interface . . . 60 6.4. Implementation . . . 61 6.4.1. ctrl . . . 61 6.4.2. statRegs . . . 61 6.4.3. mcuPingPongMem . . . 61 6.4.4. MCUaddrCalc . . . 61 6.4.5. colormac . . . 63 6.4.6. upsampleAddr . . . 64 6.4.7. imgAddrCalc . . . 64 6.5. Performance . . . 65

(17)

Contents xi

7. DCT and IDCT Processor 69

7.1. From Formula To Hardware Architecture . . . 69

7.2. Functionality Overview . . . 69

7.3. Design Decisions . . . 70

7.3.1. Bit Width . . . 71

7.4. DCT Calculation Steps . . . 73

7.5. Interface . . . 75

7.5.1. Micro Program Memory . . . 75

7.5.2. Control Block . . . 75

7.5.3. Processing Element 0 . . . 75

7.5.4. Processing Element 1 . . . 76

7.5.5. Register File . . . 76

7.6. Performance . . . 78

8. Stream Reader and Writer 81 9. Improvements 83 9.1. Software Improvements . . . 83

9.1.1. Support More JFIF Types . . . 83

9.1.2. Better Algorithms . . . 83

9.2. HW/SW Partitioning and Overall Architecture . . . 83

9.2.1. Framed JFIFs . . . 84 9.2.2. Improved Instructions . . . 84 9.3. Hardware Improvements . . . 84 9.3.1. DCT/IDCT Processor . . . 84 9.3.2. DU-Processor . . . 85 10. Conclusions 87 10.1. Stream Reader and Writer . . . 87

10.2. Performance . . . 87

10.3. JFIF Accelerability . . . 90

10.4. Proposed Future Work . . . 90

10.5. Final Conclusions . . . 91

A. Tables 95 A.1. Quantization tables . . . 95

A.1.1. Normal . . . 95

A.1.2. Fine . . . 96

A.1.3. Superfine . . . 96

A.2. Huffman tables . . . 97

A.2.1. Typical Huffman tables for the DC coefficient differences . . 97

A.2.2. Typical Huffman tables for the AC coefficients . . . 98

(18)

xii Contents

List of Tables

3.1. Different sampling frequencies and their corresponding MCU size

and DU ordering. . . 20

3.2. RLE encoded values . . . 26

3.3. DC Difference Magnitude Codes and Ranges . . . 27

3.4. Letter frequencies and Huffman codes . . . 28

3.5. Code lengths and code counts. . . 28

4.1. Cycle count of functions in figure 4.2 . . . 38

4.4. Decoder performances for the Lena image (figure 4.5), quantization tables are defined in section A. . . 41

4.5. Decoder performances for the Hard image (figure 4.6), quantization tables are defined in section A. . . 43

4.6. Cycle count of blocks in figure 4.8 . . . 47

4.7. Subjective quality scale. . . 47

4.8. Encoder performance for the Lena image (figure 4.5), quantization tables are defined in section A, the quality scale is defined in table 4.7. . . 49

4.9. Encoder performance for the Hard image (figure 4.6), quantization tables are defined in section A, the quality scale is defined in table 4.7. . . 49

5.1. Decoder performances for the Lena image (figure 4.5), quantization tables are defined in section A. . . 56

5.2. Decoder performances for the Hard image (figure 4.6), quantization tables are defined in section A. . . 56

5.3. Encoder performances for the Lena image (figure 4.5), quantization tables are defined in section A, the quality scale is defined in table 4.7. . . 58

5.4. Encoder performances for the Hard image (figure 4.6), quantization tables are defined in section A, the quality scale is defined in table 4.7. . . 58

6.1. DU processor register . . . 62

6.2. Performance of the DU Processor . . . 65

6.3. Zigzag offset sequence . . . 66

7.1. Bit precision measurements in Matlab for the DCT. . . 71

7.2. Row PE scheduling for the DCT calculation. . . 74

7.3. Column PE scheduling for the DCT calculation. . . 74

7.4. Row and column PE scheduling for the IDCT calculation. . . 74

7.5. Format of the micro memory . . . 75

(19)

Contents xiii

10.1. Cycle consumption of the accelerated and not accelerated JPEG

decoder. . . 88

10.2. Cycle consumption of the accelerated and not accelerated JPEG encoder. . . 89

A.1. Huffman table for luminance DC coefficient differences . . . 97

A.2. Huffman table for chrominance DC coefficient differences . . . 98

A.3. Huffman table for luminance AC coefficients . . . 102

(20)

xiv Contents

List of Figures

2.1. One stage flow graph. . . 6

2.2. Three stage flow graph . . . 6

2.3. Flow graph transposition. . . 7

2.4. Reversed flow graph of figure 2.2 . . . 7

3.1. Encoding flow . . . 12

3.2. Decoding flow . . . 12

3.3. A sample image decomposed into, from top to bottom, Y, Cb and Cr. 14 3.4. A sample image decomposed into, from top to bottom, R, G and B. 14 3.5. A down sampled image and its Y, Cb and Cr components. . . 17

3.6. MCU with relative sample frequencies of{2 × 2, 1 × 1, 1 × 1} . . . . 19

3.7. Separated calculation of an 8× 8 matrix. . . 21

3.8. Zigzag pattern. . . 25

3.9. Typical JFIF file format. . . 30

3.10. Decoder algorithmic flow . . . 32

3.11. Encoder algorithmic flow . . . 33

4.1. Decoder architecture overview. . . 37

4.2. Flow graph of the top loop of the JPEG decoder . . . 39

4.3. Flow graph for SOS in figure 4.2 . . . 40

4.4. Flow graph for getNewMCU in figure 4.3 . . . 42

4.5. Lena image in QVGA format, used for performance measurements. . 43

4.6. Hard image in QVGA format, used for performance measurements. . 44

4.7. Encoder architecture overview. . . 46

4.8. Flow graph of the top loop of the JPEG encoder . . . 48

5.1. Overall HW/SW architecture. . . 54

5.2. Decoder architecture overview. . . 55

5.3. SW/HW pipeline for the decoder for a MCU with four DU:a. . . 55

5.4. Accelerated encoder architecture overview. . . 57

5.5. Schedule for the accelerated encoder pipeline. . . 57

6.1. DU processor top level architecture overview for decoder mode. . . . 60

6.2. DU processor top level architecture overview for encoder mode. . . . 60

6.3. DU position and associated addresses. . . 63

6.4. Positions of DU rows in MCU buffer. . . 63

6.5. Image address calculator timing. . . 65

6.6. Zigzag FSM. . . 67

7.1. The DCT/IDCT processor architecture . . . 70

7.2. DCT row flow graph . . . 72

7.3. Flow graph for column calculation . . . 72

7.4. PE 0 all, note that all operations can not be combined with each other. . . 76

(21)

Contents xv 7.5. PE 0 mode 0 . . . 77 7.6. PE 0 mode 1 . . . 77 7.7. PE 0 mode 2 . . . 77 7.8. PE 0 mode 3 . . . 77 7.9. PE 0 move . . . 77 7.10. PE 1 mode 0 . . . 78 7.11. PE 1 mode 1 . . . 78 7.12. PE 1 mode 7 . . . 78

(22)

(23)

1. Introduction

1.1. Background

A digital image is represented as a 2-D matrix of pixels with different color values. To represent colors visible to the human eye a 3-D color space can be used. Using eight bits for each dimension for every pixel soon renders huge amounts of data. For example a QVGA color image will result in 320· 240 · 3 = 230400B = 225kB of data. In modern communication images are often transmitted over band limited medias thus reducing the size of what is transmitted is preferable and sometimes a necessity to get acceptable performance. Luckily images usually contain a sub-stantial amount of redundant data, which makes it possible to reduce the amount of data needed to store and send images.

Two different schemes of compressions exist, lossless and lossy. The lossless scheme only takes advantage of the redundancy in the picture data, whereas the lossy takes advantage of the visual capabilities of the human eye. The price paid for reducing image data is calculation cost. When viewing a picture, a raw-format image needs to be restored. In order to do this the data must be processed (de-compressed) and when saving an image, the image data once again needs to be processed (compressed). Thus space in bytes is traded for space in time. One of the most widespread standards for lossy compression today is the JFIF-standard [6]. A standard derived from the JPEG ISO/IEC 10918-1 [4] standard. When someone refers to a “JPEG-image”, they usually mean a JFIF-image. The JPEG-standard [4] by it self just describes a number of different methods for image compression, it has no formally specified image format.

1.2. Purpose and Goal

The purpose of this master thesis is to construct a JFIF encoder and decoder for a DSP processor. To increase the speed of the encoding and decoding, some hardware accelerators need to be built. Thus the hardware/software partitioning is an important part of the work. The goal is to encode or decode 25 QVGA pictures per second. Certain parts that could be used for other compression schemes can be a little bit faster than the lower limit set by the 25 pictures per second constraint.

(24)

 Introduction

1.3. Disposition

The work has mainly consisted of four phases. In the first phase we understood how JFIF-images work and are compressed. In the second phase we built a pure software based decoder and encoder. From the SW-implementation we could make measurements and decide which parts need to be accelerated in order to meet the goal of being able to encode or decode at least 25 QVGA images per second. In the third phase we started to modify the original program and built models of the accelerators in C++ which could be linked to the IDE. A pipelined execution of the encoding and decoding was implemented and some final changes in the SW/HW partitioning was done in this stage. In the final stage we constructed the actual hardware to match the functionality of the C++ models.

1.4. Reading Instructions

Chapter 2 describes some basic theory about flow graphs and binary numbers. Chapter 3 describes the algorithms used by JPEG and the JFIF format. Chapter 4 describes how the JFIF encoder/decoder package was implemented

in software.

Chapter 5 describes the hardware/software partitioning.

Chapter 6, 7, 8 describes how the accelerators were implemented.

Chapter 9 lists changes that could be done to increase speed, to reduce cost and/or to reduce power consumption. These parts have not been imple-mented due to time constraints.

Chapter 10 lists the conclusions made.

When referencing other parts in the rest of the text, sections like chapters and subsections will always be referenced with ...section number... , formulae with ...equation number... , images and figures with ...figure number... and tables with ...table number... . Where number is the referred objects unique number. This format is used regardless if, for instance the referred equation is not an equation at all or if the referred section is in fact a chapter etc.

1.5. Who Should Read This Thesis?

The thesis can be read with several intentions, either as a basic introduction to JPEG compression and hardware/software partitioning, or as a description on how to implement hardware accelerators for a sub set of the JPEG standard [4]. The

(25)

1.5 Who Should Read This Thesis? 

indented reader has the knowledge equivalent to a fourth year technical masters stu-dent. Some familiarity with digital design at the RTL level and basic transform the-ory is good but not necessary. Basic knowledge of general DSP and DSP-processor can help make this thesis easier to read and understand.

(26)

(27)

2. Basic Theory

This chapter familiarizes the reader with some basic concepts used in this thesis. Section 2.1 describes how flow graphs and matrix multiplications are related and section 2.2 describes how binary numbers are used in this thesis.

2.1. Flow graphs and Matrix Multiplications

It is possible to represent a matrix-vector multiplication, a transformation, as a flow graph. This is best illustrated with an example. Equation 2.1 describes a transformation matrix operated on a vector.

  _α2 _{−α 0}0 3 γ + β −β γ     a_b c   =   _{αa − αb}2a + 3c (β + γ) a − βb + γc   (2.1)

The transformation in equation 2.1 corresponds to the flow graph shown in figure 2.1. As we can see this flow graph is not very well suited for a hardware implemen-tation. It includes many multiplications and one addition is a three input addition. To get a more hardware friendly implementation we can rewrite the transformation matrix in equation 2.1 as a series of matrix multiplication shown in equation 2.2. These three matrix multiplications can then be represented by a three stage flow graph as shown in figure 2.2. This new flow graph is less complex to implement in hardware and also allows for a pipelined execution. Note that the first stage in the flow graph is the rightmost matrix in the multiplication. Equation 2.3–2.5 shows the operation stage for stage.

  _α2 _{−α 0}0 3 γ + β −β γ   =   10 α 00 1 γ β 0     10 01 00 1 0 1     11 −1 00 1 0 0 1   (2.2) 5

(28)

 Basic Theory 2 3 α −α γ+β γ a b c 2a+3c α αa- b (γ+β β)a- b+ cγ −β

Figure 2.1. One stage flow graph.

  10 α 00 1 γ β 0     10 01 00 1 0 1     11 −1 00 1 0 0 1     a_b c   (2.3) =   10 α 00 1 γ β 0     10 01 00 1 0 1     a + c_{a − b} c   (2.4) =   10 α 00 1 γ β 0     a + c_{a − b} a + c + c   =   _{αa − αb}2a + 3c (β + γ) a − βb + γc   (2.5) β γ α

a

b

c

2a+3c

α α

a- b

(

γ+β β

)a- b+ c

γ

Figure 2.2. Three stage flow graph

Sometimes the transform as well as its inverse have to be calculated. When the inverse transform is equal to the transposition of the forward transform this be-comes particularly easy. Transposing the transform matrix is the same as running the flow graph in reverse. Additions become forks and forks become additions, multiplications and negations remain the same. Figure 2.3 shows these

(29)

relation-2.2 Binary Number Representation 

ships. The flow graph in figure 2.2 reversed is shown in figure 2.4. Remembering that, (ABC)T

= CT

BT

AT _{the order of the individual stages becomes clear.}

Figure 2.3. Flow graph transposition. Left column is the base form and the right the transposed form. β γ α

a

b

c

2a b (

+α + α+β

)c

- a- b

α β

3a c

+γ

Figure 2.4. Reversed flow graph of figure 2.2

2.2. Binary Number Representation

To represent a number in any base and also convert between different bases is no problem as long as an infinite number of digits can be used. However when designing hardware it is important to be careful about how many bits to use, since too many bits will result in unnecessary hardware.

2.2.1. Two’s Complement

All binary number in this thesis, if not stated otherwise or the format is irrelevant, is written in two’s complement. Equation 2.6 defines the value of a two’s complement number, where Psis the number of bits, bn is the n:th bit starting from zero and

(30)

 Basic Theory V = −2Ps−1_b Ps−1+ PXs−2 k=0 2kbk (2.6)

2.2.2. Fractional Values

Fractional value in base two is no different from fractional values in base 10, the only difference being the weight assigned the number positions. To find the value of a fractional two’s complement number with a limited number of bits we only need to modify equation 2.6 to equation 2.7, where Vf is the new value and Bsthe

number of bits after the binal point (decimal point in base 10). Also the bit vector index is extended into the negative range for the bits after the binal point. E.g. the vector 01.11 has the indexes 1 0 -1 -2 for the respective positions and the value of the vector is−21· 0 + 20· 1 + 2−1· 1 + 2−2· 1 = 1.75

Vf =−2Ps−Bs−1bP_s_−B_s₋₁+

P_s_−B_X_s₋₂ k=−B_s

2kbk (2.7)

Although in the rest of this thesis bit vectors are always supposed to have index zero at the LSB, this only modifies the previous theory with an offset for the indexes.

Some things to point out are that converting a fractional number in a higher base might result in an irrational number in the lower base. When the bit width is limited, it is often necessary to use an approximation to the correct value.

2.2.3. Hexadecimal Representation

It is often convenient, and is done frequently in this thesis, to represent binary numbers as hexadecimal numbers. One hexadecimal digit can be converted directly to its corresponding four bit number. If the hexadecimal number consists of several digits every digit is converted individually. E.g. FA716 = 1111101001112 since F16 = 11112, A16 = 10102and 716= 01112.

2.3. The Kroenecker Product

The Kroenecker or tensor product denoted ⊗ is a vector and matrix product. Applying the Kroenecker product with two matrices as arguments renders a new bigger matrix. Equation 2.8 illustrates this. Observe that this is a short and simple explanation of the tensor product for a more comprehensive explanation consult a good book in mathematics.

(31)

2.3 The Kroenecker Product  α β γ δ ⊗ a b c d = =     α a b c d β a b c d γ a b c d δ a b c d     =     αa αb βa βb αc αd βc βd γa γb δa δb γc γd δc δd     (2.8)

(32)

(33)

3. JPEG Theory

The JPEG standard [4] uses a number of different compression algorithms. We will only discuss the methods used by JFIF since our implementation follows the JFIF-standard. The methods are described in the order they occur when encoding. For a more exhaustive explanation, see [6] and [7].

The second part of this chapter describes the basics of how data is arranged in a JFIF file.

3.1. Overview

To compress a picture a number of different steps are employed. This chapter will describe the individual methods in a number of sections. Figure 3.1 shows the flow utilized for compression (encoding) and figure 3.2 shows the flow for restoring (decoding) the compressed image to a raw data format.

The first step in the encoding is to convert the original image colors to a color space better suited for down sampling (section 3.2 and 3.3). The image is then divided into blocks (section 3.4) and a DCT is performed on each block to find the frequency components of the picture (section 3.5). The output from the DCT is then filtered to remove components with lesser importance in the visible image quality (section 3.6). To further reduce the data needed to represent the picture a run length encoding (section 3.8) and Huffman encoding (section 3.9) is performed. When decoding the image all steps are run in reverse. First a Huffman decoding is performed (section 3.9). Next the blocks (section 3.4) are reconstructed by a run length decoder (section 3.8). The data is then dequantized (section 3.6) before an IDCT (section 3.5) is performed in order to restore the spatial domain data. The components are up sampled (section 3.3) if necessary and finally converted back to the RGB color space (section 3.2).

(34)

 JPEG Theory

Figure 3.1. Encoding flow

Color conversion Up sampling IDCT De-quantization RLD Huffman decoding 10101111010 10100100001 11010001111 01010 JPEG File

Figure 3.2. Decoding flow

3.2. RGB and YCbCr Color Models

One well known way of representing colors in a computer is to use the RGB color model. A color is decomposed into its red, green and blue component and each component is stored as a separate value, a color pixel is thus a triplet composed of

(35)

3.2 RGB and YCbCr Color Models 

one value from each channel, e.g. < R, G, B >. There are many other models in use, for example the YCbCr format which is used in the JFIF standard. Colors are still represented as a triplet of values, but instead of decomposing colors into their red, green and blue components, the colors are represented as luminance and chrominance. The luminance component is denoted Y and specifies the intensity. Cb and Cr are the chrominance components, Cb gives the blueness and Cr gives the redness. Equation 3.1–3.3 describes the relations specified by the JFIF standard to calculate Y, Cb and Cr from R, G and B and equation 3.4–3.6 describes the relations the other way around. All values are expected to be in the range [0, 2PS − 1], P_S

denotes the sampling precision. E.g. for eight bit precision PS = 8 and the resulting

values will all be in the range [0, 255].

Y =0.299R + 0.587G + 0.144B (3.1) Cb = − 0.1687R − 0.3313G + 0.5B + 2PS−1 _(3.2) Cr =0.5R − 0.4187G − 0.0813B + 2PS−1 _(3.3) R =Y + 1.402(Cr − 2PS−1₎ _(3.4) G =Y − 0.34414(Cb − 2PS−1₎− 0.71414(Cr − 2P_S₋₁ ) (3.5) B =Y + 1.722(Cb − 2PS−1₎ _(3.6)

The JPEG standard specifies that an offset must be added to the values in order to get a mean DC component of zero, this will be elaborated in section 3.5. The number of operations can be reduced if the offset is merged with the color model transformation as shown in equation 3.7–3.12. Now the R,G and B values still are expected to be in the range [0, 2PS − 1] but the Y,Cb and Cr values resulting

from equation 3.7–3.9 will be in the range [−2PS−1_{, 2}PS−1 − 1] .Thus for eight

bit precision Y, Cb and Cr will be in the range [−128, 127]. However equation 3.10–3.12 ensures that the R, G and B values will be converted back to the range [0, 2PS − 1]. Y =0.299R + 0587G + 0.144B − 2PS−1 _(3.7) Cb = − 0.1687R − 0.3313G + 0.5B (3.8) Cr =0.5R − 0.4187G − 0.0813B (3.9) R =Y + 1.402Cr + 2PS−1 _(3.10) G =Y − 0.34414Cb − 0.71414Cr + 2PS−1 _(3.11) B =Y + 1.722Cb + 2PS−1 _(3.12)

Figure 3.3 shows an image decomposed into YCbCr components and figure 3.4 shows the same image decomposed into RGB components. The Y component by

(36)

 JPEG Theory

Figure 3.3. A sample image decomposed into, from top to bottom, Y, Cb and Cr.

Figure 3.4. A sample image decomposed into, from top to bottom, R, G and B.

itself describes a gray scale version of the original image. We can see that the Y component is the component that contributes with the most visual information in the YCbCr color model.

(37)

3.2 RGB and YCbCr Color Models 

show the corresponding Y, Cb and Cr matrices.

R =             226 226 223 223 226 226 228 227 226 226 223 223 226 226 228 227 226 226 223 223 226 226 228 227 226 226 223 223 226 226 228 227 226 226 223 223 226 226 228 227 227 227 227 222 226 228 226 230 228 228 225 224 225 229 229 229 223 223 226 221 227 225 226 228             (3.13) G =             137 137 137 136 138 129 138 134 137 137 137 136 138 129 138 134 137 137 137 136 138 129 138 134 137 137 137 136 138 129 138 134 137 137 137 136 138 129 138 134 140 140 131 130 136 133 132 133 134 134 141 133 134 137 132 128 133 133 129 132 131 133 129 131             (3.14) B =             125 125 133 128 120 116 123 124 125 125 133 128 120 116 123 124 125 125 133 128 120 116 123 124 125 125 133 128 120 116 123 124 125 125 133 128 120 116 123 124 123 123 113 111 120 115 120 113 119 119 116 115 125 112 116 105 121 121 106 114 120 116 112 106             (3.15) Y =             162 162 162 161 162 157 163 161 162 162 162 161 162 157 163 161 162 162 162 161 162 157 163 161 162 162 162 161 162 157 163 161 162 162 162 161 162 157 163 161 164 164 158 155 161 159 159 160 160 160 163 158 160 162 159 156 159 159 155 157 158 159 156 157             (3.16)

(38)

 JPEG Theory Cb =             107 107 111 109 104 105 105 107 107 107 111 109 104 105 105 107 107 107 111 109 104 105 105 107 107 107 111 109 104 105 105 107 107 107 111 109 104 105 105 107 105 105 103 103 105 103 106 102 105 105 101 104 108 100 104 99 107 107 100 104 106 104 103 99             (3.17) Cr =             173 173 171 172 173 178 174 175 173 173 171 172 173 178 174 175 173 173 171 172 173 178 174 175 173 173 171 172 173 178 174 175 173 173 171 172 173 178 174 175 173 173 177 176 174 177 176 178 176 176 172 175 174 176 178 180 174 174 178 174 177 175 178 179             (3.18)

3.3. Up and Down sampling

The human eye is more sensitive to changes in the intensity of an image then to its exact color value, this is why the YCbCr color model is used by JFIF.

The JFIF standard supports the relative sampling frequencies of 1, 2, 3 or 4 for each component and for each direction, vertical and horizontal. It is called relative sample frequency since it only describes how the different components are sampled relative to the other i.e. the relative sampling frequency of {4 × 4, 4 × 4, 4 × 4} is equal to the relative sampling frequencies of {1 × 1, 1 × 1, 1 × 1}.

If the Y channel (the intensity) is not down sampled it is possible to down sample the Cb and Cr channels (the chrominance) without making an obvious degradation of image quality of a photographic image.

Figure 3.51illustrates this phenomenon. The top right image shows the Y chan-nel and the bottom row shows the two chrominance chanchan-nels down sampled by keeping only every fourth value in each direction, horizontal and vertical. The Cb and Cr channels have been zoomed in this figure, in reality they only contain

1

4 · 14 = 6.25% of the Y channel data. The top left image shows the components

merged together after the down sampled components have been up sampled. This will result in a color image. Even though it is clearly visible that the Cb and Cr components have been down sampled in the decomposed version this is not obvious in the assembled image.

1_{The top left image is a color picture, in the printed thesis there might therefore not be any}

(39)

3.3 Up and Down sampling 

Figure 3.5. A down sampled image and its Y, Cb and Cr components.

In this example 18 values (4· 4 + 1 + 1 = 18, Y+Cb+Cr) are required to describe 16 pixels instead of 48 values (4· 4 + 4 · 4 + 4 · 4 = 48, Y+Cb+Cr), which is a data reduction by 62.5%.

It is worth mentioning that a very primitive up and down sampling algorithm has been used in this example, decimating and nearest neighbor. Better results would have been obtained if more advanced methods, e.g. low pass filtering and cubic spline interpolation, were used [2]. Relative sampling frequencies of

{4 × 4, 1 × 1, 1 × 1} are also a little bit too high to get really good results.

Instead of specifying the sample frequencies it is possible to give the sample

peri-ods. The sample frequencies and the sample periods give the exact same

informa-tion, only in different formats. Instead of specifying how often values are sampled, the distance between two samples are given. It is often useful to know the periods instead of frequencies when implementing the up and down sampling in hardware or

(40)

 JPEG Theory

software. Equation 3.19–3.24 shows how to compute the sampling periods for the sampling frequencies {fY,x× fY,y, fCb,x× fCb,y, fCr,x× fCr,y}. Note that when

sampling information is written in this format: {Y × Y, Cb × Cb, Cr × Cr}, the sampling frequencies are always intended.

TY,x=max(f Y,x, fCb,x, fCr,x) fY,x (3.19) TCb,x=max(f Y,x, fCb,x, fCr,x) fCb,x (3.20) TCr,x=max(f Y,x, fCb,x, fCr,x) fCr,x (3.21) TY,y=max(f

Y,y, fCb,y, fCr,y)

fY,y

(3.22)

TCb,y=max(f

fCb,y

(3.23)

TCr,y=max(f

fCr,y

(3.24)

The JPEG standard does not forbid down sampling of the Y channel. This is however only used for experimental purposes since the visual image quality is reduced.

It is also possible, at least according to the JPEG standard, to use fractional sampling frequencies. Suppose that the Y component has a vertical sampling fre-quency of 3, the Cb component a sampling frefre-quency of 2 and the Cr component a sampling frequency of 1. This means that each data value for the Cb component in the vertical direction represents 1.5 pixels. However, not many JPEG applications permit this kind of sampling and neither does our implementation.

3.4. Data Units and Minimum Coded Units

In JPEG, almost every operation is performed on an 8× 8 pixel block of data. These blocks are called data units (DU ). A minimum coded unit (MCU ) is a collection of data units. How many data units that go into each MCU is determined by the relative sampling frequencies of the different channels. The height in DUs of an MCU is equal to the highest vertical sampling frequency (of the Y, Cb and Cr components) while the width in DUs is equal to the highest horizontal sampling frequency (of the Y, Cb and Cr components). The number of pixels that an MCU covers is thus (fmax,x· 8) · (fmax,y· 8) pixels.

Figure 3.6 illustrates an encoded MCU with the relative sampling frequencies

{2 × 2, 1 × 1, 1 × 1}. With these frequencies, one MCU consists of 6 data units

(41)

3.4 Data Units and Minimum Coded Units 

Cr and a 16× 16 pixel piece of an example image. As seen we need four DUs in the Y layer to make up for the fact that each pixel in the Cb and Cr layer corresponds to four pixels in the original image. The JPEG standard [4] specifies that an MCU may contain a maximum of 10 data units. Sampling frequencies that render more than 10 DUs in one MCU are not allowed. One way of getting around this is to split the data into multiple frames, this will not be further discussed since we do not handle this. It is however described in [6].

Figure 3.6. MCU with relative sample frequencies of {2 × 2, 1 × 1, 1 × 1} The DUs of each component are ordered from left to right, top to bottom in an MCU. All the DUs from the Y component are stored first, then all the Cb DUs and finally all the Cr DUs. The MCUs in an image are also always stored from left to right, top to bottom.

Table 3.1 lists some sampling frequencies and their corresponding MCU size and DU ordering.

(42)

 JPEG Theory

Sampling Frequencies

MCU Size DU Order in Scan

{1 × 1, 1 × 1, 1 × 1} 8×8 pixels Y₁₁,Cb₁₁,Cr₁₁

{2 × 2, 1 × 1, 1 × 1} 16×16 pixels Y₁₁,Y₁₂,Y₂₁,Y₂₂,Cb₁₁,Cr₁₁

{4 × 2, 1 × 1, 1 × 1} 32×16 pixels Y₁₁,Y₁₂,Y₁₃,Y₁₄,Y₂₁,Y₂₂,Y₂₃,Y₂₄,Cb₁₁,Cr₁₁

{2 × 4, 1 × 1, 1 × 1} 16×32 pixels Y11,Y12,Y21,Y22,Y31,Y32,Y41,Y42,Cb11,Cr11

{2 × 2, 2 × 1, 1 × 2} 16×16 pixels Y11,Y12,Y21,Y22,Cb11,Cb12,Cr11,Cr21

Table 3.1. Different sampling frequencies and their corresponding MCU size and DU ordering.

3.5. The Discrete Cosine Transform

The discrete cosine transform is the core of lossy JPEG compression [6], [7]. The N point 1-D DCT and IDCT are shown in equation 3.25 and 3.26

T [i] = c(i) N_X₋₁ x₌₀ v(x) cos (2x + 1)iπ 2N (3.25) v(x) = N_X₋₁ i=0

c(i)T (i) cos

(2x + 1)iπ 2N (3.26) where c(i) = r 1 N, i = 0 c(i) = r 2 N, i 6= 0

Equation 3.27 and 3.28 shows the general N22-D DCT and IDCT; for JPEG N equals to 8. T [i, j] = c(i, j) N_X₋₁ x=0 N_X₋₁ y=0

v[y, x]cos(2y + 1)iπ

2N cos (2x + 1)jπ 2N (3.27) v[y, x] = NX−1 i=0 NX−1 j=0

c(i, j)T [i, j]cos(2y + 1)iπ

2N cos (2x + 1)jπ 2N (3.28) where c(i, j) = 2 N, i and j 6= 0 c(i, j) = 1 N, i or j = 0

The DCT will transform the input data from the spatial domain to the frequency domain. The reason the DCT has been chosen is due to its tendency to collect the

(43)

3.5 The Discrete Cosine Transform 

most of the energy into a few frequency components, when operating on typical photographic images [2].

In JPEG the DCT is performed on one data unit at a time. Thus the DCT utilized by JPEG is a 2-D 8× 8 point DCT.

It is not hard to realize that the 1-D DCT can be written on a matrix form resulting in equation 3.29 and 3.30

t = D · v (3.29)

v = D−1t (3.30)

This matrix form of the DCT and IDCT can be applied on the 2-D case as well, the simplest being to just use the Kroenecker, or tensor, product⊗. Then the 2-D DCT transform matrix is the 64× 64 matrix D2 = D ⊗ D. The N × N matrix that shall be transformed, denoted V, has to be rewritten on an N · N -point vector, denoted v, from the matrix V read row vise from left to right and top to bottom. The following discussion is easier to represent and understand in 1-D but the same theory applies to the 2-D DCT/IDCT

One property of the DCT is that it is an orthonormal transform, meaning that the inverse of the transformation matrix is equal to its transposition. That is

D−1 = DT _[7].

Another property of an orthonormal transform is that computing a higher or-der dimension of the same transform can be performed by transforming a part of the values in fewer dimension several times. We can utilize this fact for the 2-D transform by first transforming all the rows by a 1-D transform and then yet again applying the 1-D transform to the columns of the transformed rows. This is illustrated in figure 3.7.

Figure 3.7. Separated calculation of an 8 × 8 matrix.

The JPEG standard [4] states that to reduce the information after the transform the transformed values shall be multiplied by a quantization value and rounded,

(44)

 JPEG Theory

this will be further discussed in section 3.6. The transform matrix can be factored according to equation 3.31 where S is a diagonal scaling matrix. Multiplying with the quantization values gives equation 3.32, note that also the Q matrix is a diagonal matrix.

D = S · F (3.31)

U = Q · S · F (3.32)

For the IDCT

U−1= (Q · S · F )−1 (3.33) This gives U−1= (S · F )−1· Q−1 (3.34) D−1= DT (3.35) U−1= (S · F )T· Q−1 (3.36) U−1= FT · ST · Q−1 (3.37) ST = S (3.38) U−1= FT · S · Q−1 (3.39)

Worth noting from equation 3.39 is that the scaling factors will be the same both for the DCT and IDCT. The reason for writing the DCT and IDCT on form 3.32 and 3.39 will become clear in chapter 7.

When the YCbCr conversion is done, all values are in the range [0, 255]. The JPEG standard requires that an offset of -128 has to be added to all values, thus getting a mean DC-component of zero. When unpacking JFIF-images an IDCT is performed, after the IDCT the values must be offset back to their original values by adding 128 to them. This problem is however eliminated by using equation 3.7–3.6 instead of equation 3.1–3.6. Equation 3.40–3.42 shows typical values after the DCT is performed. Here it becomes very clear that the value in the first row of the first column gets the most energy. This value is called the DC coefficient of the DU. All the non DC coefficients are called AC coefficients. The concept of DC and AC is borrowed from electrical engineering.

DCT2(Y ) =             259 5 3 0 0 −1 −5 6 8 −1 1 −5 2 3 −4 3 −5 0 −2 2 −1 0 2 −2 2 1 2 1 −1 −1 0 1 −1 −1 0 −1 2 1 −1 −1 1 0 −2 0 −2 1 2 1 −2 0 3 2 2 −2 −1 −1 1 0 −2 −2 −1 2 1 1             (3.40)

(45)

3.6 Quantization  DCT2(Cb) =             −179 8 −1 −4 0 3 0 0 11 0 1 −10 −1 2 7 −2 −4 1 −1 5 1 −1 −3 0 −1 −2 0 0 0 0 −1 2 3 1 0 −1 0 0 2 −2 −2 0 −1 0 1 0 −2 2 −1 −1 1 1 −1 0 0 −1 1 1 −1 −2 1 0 0 0             (3.41) DCT2(Cr) =             372 −10 3 4 −2 −2 3 −4 −8 0 −1 5 −1 −1 1 −3 3 0 1 −2 1 0 −1 2 0 0 0 −1 0 1 0 −1 −2 0 −1 1 −1 −1 0 1 1 0 2 1 1 0 −1 −2 1 0 −2 −2 −1 1 2 2 −1 0 2 2 0 −1 −1 −1             (3.42)

3.6. Quantization

A type of low pass filtering is performed on each data unit by dividing and rounding the individual values resulting from the DCT operation with an integer (called quantum value). This operation is called quantization . The matrix containing the denominators is called a quantization table . The JPEG standard does not specify the quantization tables to be used. However, it provides two empirically tested tables (one for the luminance channel and one for the two chrominance channels) that give good results. These are shown in equation 3.43–3.44.

Equation 3.45–3.47 show typical Y, Cb and Cr matrices after quantization, note that only 6 of 192 elements are non-zero. The quantization is the algorithmic reason JFIF images lose information in the compression. In an actual implementation there can however also be some information losses due to precision errors and rounding when performing the color conversions. Note that the DC component is in the upper left corner and that the frequency increases going to the right and downward. QL=             16 11 10 16 24 40 51 61 12 12 14 19 26 58 60 55 14 13 16 24 40 57 69 56 14 17 22 29 51 87 80 62 18 22 37 56 68 109 103 77 24 35 55 64 81 104 113 92 49 64 78 87 103 121 120 101 72 92 95 98 112 100 103 99             (3.43)

(46)

 JPEG Theory QC=             17 18 24 47 99 99 99 99 18 21 26 66 99 99 99 99 24 26 56 99 99 99 99 99 47 66 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99             (3.44) Y =             16 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0             (3.45) Cb =             −11 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0             (3.46) Cr =             22 −1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0             (3.47)

As can be seen the quantization value is bigger in the high frequencies than in the low frequencies, this is because the human eye is much more sensitive to low frequency data than high frequency data, thus high frequency data can be removed without any obvious degradation of the image to a human observer.

When decoding an image, the DU is of course multiplied element wise with the quantization table before it is inverse transformed.

(47)

3.7 Zigzag Ordering 

Figure 3.8. Zigzag pattern.

3.7. Zigzag Ordering

Each DU is rewritten to vector form in a zigzag pattern, this is done in order to group as many of the zero valued coefficients as possible together to get a good RLE compression [6], see section 3.8 for further details. The zigzag pattern used to reorder the values is shown in figure 3.8. This pattern is defined in the JPEG standard [4]. Equation 3.48–3.50 shows equation 3.45–3.47 zigzag ordered.

16, 0, 1, 0, . . . , 0 (3.48)

−11, 0, 1, 0, . . ., 0 (3.49) 22, −1, 0, 0, . . . , 0 (3.50)

3.8. Run Length Encoding

Run length encoding , or RLE in short, is a method to compress data that has long runs of equal values [6]. For example if I want to tell my colleague, what the bit stream 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1 looks like. I could either tell him the values one by one or just tell him that there is two ones then five zeros and then seven ones. The later method being much more efficient, I would probably prefer it. This is essentially what RLE is all about.

In JFIF-files RLE is used to encode zero runs in the zigzag ordered vector, described in section 3.7. Thus for normal photographic pictures this will result in a reduction of data.

The RLE values are encoded as bytes where the most significant four bits is the number of zeros until the next value, whereas the least significant four bits are the

(48)

 JPEG Theory

Value Raw Bits

05 10110 04 1100 13 100 24 0011 04 0111 F0 — F0 — D1 0 00 —

Table 3.2. RLE encoded values

magnitude of the value. See section 3.8.1 for further details.

The DC-value and the AC-values are encoded in a slightly different manner. The DC-value is encoded as the difference between the previous DC-value for the channel and the current. The DC-value, quite naturally, contains no zero run information.

The AC-values are encoded as described previously but have two special codes; 00₁₆and F0₁₆. 00₁₆means that the rest of the values in the DU are all zeros and F0₁₆means that there is a run of 16 zeros.

The inverse operation of RLE is called run length decoding , or RLD for short. Matrix 3.52 read row wise is the zigzag ordered vector of matrix 3.51. The RLE encoded values of matrix 3.52 and raw bit codes are shown in table 3.2.

            22 12 0 −12 0 0 0 0 0 0 −8 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0             (3.51)             22 12 0 4 0 0 −12 −8 0 0 −8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 0 0             (3.52)

(49)

3.9 Huffman Encoding and Decoding 

3.8.1. Magnitude Encoding

Magnitude encoding is yet another way to save space. The magnitude is the number of bits used for a value . The values are then represented by bits where all values starting with a 0 are negative and all values starting with a 1 are positive [6]. For the binary codes starting with a 1 the value is equal to the binary code seen as a binary value, e.g. 101→ 101₂= 5. For the binary codes starting with a 0 the value is equal to the negative bit inverted binary code, e.g. 010→ −101₂ =−5. Table 3.3 shows the range for each magnitude.

Encoded value DC Value Range

0 0 1 [−1] [1] 2 [−3, −2] [2, 3] 3 [−7, −4] [4, 7] 4 [−15, −8] [8, 15] 5 [−31, −16] [16, 31] 6 [−63, −32] [32, 63] 7 [−127, −64] [64, 127] 8 [−255, −128] [128, 255] 9 [−511, −256] [256, 511] 10 [−1023, −512] [512, 1023] 11 [−2047, −1024] [1024, 2047] Table 3.3. DC Difference Magnitude Codes and Ranges

3.9. Huffman Encoding and Decoding

The Huffman encoding is the last and final step for reducing the size of the image [6]. The bytes encoded by the RLE, described in section 3.8, are Huffman encoded but the raw bits are left untouched.

Huffman encoding works by encoding the most frequently occurring item with fewest bits and then using more bits for items occurring less frequently. For instance take the sequence ABAACDAAAB, the letter’s frequencies are calculated and listed in table 3.4

Four different values imply what we at least need two bits per value thus we would need 10· 2 = 20 bits to encode the string. If we where to use the Huffman codes in table 3.4 we would need only 6· 1 + 2 · 2 + 3 + 3 = 16 bits, thus saving four bits. How the Huffman codes are calculated is not in the scope of this text, for further information about Huffman coding see [6].

When the Huffman tables are calculated the encoding of the data is a simple lookup and replace, simply replace the item with its corresponding bit field.

(50)

 JPEG Theory

Letter Frequency Huffman code

A 6 0

B 2 10

C 1 110

D 1 111

Table 3.4. Letter frequencies and Huffman codes code length 1 2 3

code count 1 1 2 Table 3.5. Code lengths and code counts.

Imagine we have the bit stream 01000110111 generated from the Huffman codes in table 3.4, how do we know when we have read a complete letter? The calculation of the Huffman codes ensures that when a complete code has been found it is not part of a longer Huffman code. Thus decoding the bit stream previously mentioned we get 0→ A, 10 → B, 0 → A, 0 → A, 110 → C and 111 → D.

We need to know the Huffman table in order to decode the coded values, therefore the Huffman tables used to encode values in a JFIF image are always included in the file. The way these are stored in a JFIF image is as two lists. The first list contains the number of encoded word with a particular code length. The other list is the lookup table for the Huffman codes, this list is sorted on the binary values of the Huffman codes. For the example above we would get the first list as 1 1 2. Table 3.5 will perhaps make this clearer and the second list would look like A B C D.

Usually four Huffman tables are stored in each JFIF file, one for the DC coeffi-cients in the Y channel, one for the AC coefficoeffi-cients in the Y channel. Two tables are dedicated for the Cb and Cr components, one for DC and one for AC coefficients. When the Huffman encoding is done the result is stored to the JPEG file as a bit stream with no byte boundaries.

3.10. JPEG Markers

Markers are used to divide a JPEG stream into its component structures. All markers are 2 bytes long, the first byte is always FF₁₆ while the second byte specifies the marker type. There are two general types of markers, stand-alone markers and non-stand-alone markers. Stand-alone markers contain no other data than the 2 bytes that specify their type. Markers that do not stand alone are always immediately followed by a 2 byte value that specifies the length of the data the markers contains. The JPEG standard is not very strict in what order the markers should occur other than that the JPEG stream has to begin with a start-of-image marker and end with an end-of-image marker. However, if information from one

(51)

3.11 Marker Types 

marker is required to process a second marker, the first marker must appear before the second.

3.11. Marker Types

In this section are all the markers supported in our implementation listed.

3.11.1. SOI

The SOI (Start Of Image) stands alone and has to be at the very beginning of the file. Only one SOI marker is allowed per file.

3.11.2. APP

_n

The APP₀–APP₁₅ markers hold application specific data. The markers are used to hold additional data beyond what’s specified by the JPEG standard. The JFIF format uses the APP₀marker to identify the file as being a JFIF file and to store vertical and horizontal resolution and an optional, not compressed, thumbnail of the image. In a JFIF file the APP₀ marker has to immediately follow the SOI marker.

3.11.3. COM

The comment marker is used to hold comment strings about copyright information, what application the file was created with etc. The string is stored as plain ASCII text.

3.11.4. DQT

The DQT (Define Quantization Table) marker defines or redefines quantization tables. Up to 4 quantization tables can be defined by a single DQT marker. Each quantization table is assigned a unique identifier and the table is always stored in zigzag order.

3.11.5. SOF

_n

The SOFn (Start Of Frame) marker defines a frame. The width and height of

the image are specified here along with information about how many components (channels) the frame consists of. Each component is also assigned a quantization table by the SOFn marker.

3.11.6. DHT

The DHT (Define Huffman Table) marker defines or redefines Huffman tables, which are identified by a class (AC or DC) and a number. A single DHT marker

(52)

 JPEG Theory

can define up to 4 Huffman tables. The DHT block contains a 16 element long array of integers that gives the number of Huffman codes for each possible code length (1–16). The sum of the 16 code lengths is the number of values in the Huffman table. The values follow in order of Huffman code length count.

3.11.7. SOS

The SOS (Start Of Scan) marker marks the beginning of compressed data. The components from the SOF marker are here assigned AC and DC Huffman tables. The compressed scan data immediately follows the marker.

3.11.8. EOI

The EOI (End Of Image) marker has to be at the very end of the JPEG file. The EOI marker stands alone and only one EOI marker is allowed per file.

SOI APP0 COM DQT DQT SOF0 DHT DHT DHT DHT SOS EOI

(53)

3.11 Marker Types 

3.11.9. A Typical JPEG File

Figure 3.9 shows a typical JPEG file decomposed into its markers. The two DQT markers define one quantization table each, one for the luminance and one for the chrominance channels. The four DHT markers define one Huffman table each, one table is needed for every combination of AC/DC and luminance/chrominance. The encoder we implemented writes markers in this format.

3.11.10. Summary

This rather substantial chapter gives an overview of the basics of JFIF compression. The reader who wants to know more about the entire JPEG standard and get some deeper understanding should read [7] and [4]. If the reader is interested in a more implementation oriented approach of JFIF the reader is referred to [6].

The various algorithmic steps, the DU flow and data needed for the various steps are displayed in figure 3.10 and 3.11. They should hopefully serve as a summary of this chapter and clarify the algorithmic flow a bit more precisely than figure 3.1 and 3.2

(54)

 JPEG Theory

Phase 1: Process JFIF header data: 1) Read and write Huffman tables 2) Calculate some Huffman tables 3) Read and write Quantization tables

4) Calculate and write parameters, the DU-MCU relationship

Repeat for each DU:

1) Read memory words from the JFIF buffer 2) Huffman decoding 3) RLD and DU formation Huffman tables Huffman Decoding Process JFIF header data RLD Upsampling Conversion Offset Word Writer

JFIF Decoder

DM0 Table Buffer 1) Huffman tables 2) Quantization tables 3) JFIF parameters DU: 1) 8 * 8 bytes 2) Y or Cr or Cb component 3) Spatial domain DU DU DU DU DU MCU Buffer

Repeat for each DU: 1) Dequantization and scaling 2) 8 * 8 point 2D IDCT separable 3) Write the DU to the MCU buffer

Repeat for each complete MCU: 1) Read from the MCU buffer 2) Upsampling 3) YCbCr to RGB conversion 4) Offset calculation (+128) 5) Write sub-image to the RGB buffer

Dequantization IDCT Qantization tables Write MCU Buffer Read MCU Buffer DM0 JFIF Image Buffer

1) JFIF image data 2) Huffman tables 3) Quantization tables Word Reader DU: 1) 8 * 8 16-bit words 2) Y or Cr or Cb component 3) Frequency domain 4) Quantizised DU DU DM0 RGB Image Buffer

Raw image data RGB 1) Relative sampling periods

2) Conversion coefficients

(55)

3.11 Marker Types 

Phase 1: Process JFIF header data: 1) Read and write Huffman tables 2) Read and write Quantization tables 3) Calculate and write a number of JFIF parameters

Repeat for each DU: 1) Read bytes from the RGB buffer 2) Down sampling and DU formation

3) RGB to YCbCr conversion and offset calculation (-128) 1) Relative sampling periods 2) Conversion coefficients

Repeat for each DU: 1) Zigzag reordering DU sequences: {{f0, f1, f2, … , f63}, … } Down sampling Conversion Offset DM0 RGB Image Buffer

Raw image data RGB

DM0 Table Buffer 1) Huffman tables 2) Quantization tables 3) JFIF parameters Process JFIF header data DU: 1) 8 * 8 bytes 2) Y or Cr or Cb component 3) Spatial domain DU: 1) 8 * 8 16-bit words 2) Y or Cr or Cb component 3) Frequency domain 4) Quantizised Zigzag

Memory words of an encoded DU sequence

JFIF Encoder

Byte Reader DU DU Huffman Encoding

Repeat for each DU: 1) RLE 2) Huffman encoding 3) Group bits and write words

Huffman tables

Word Writer RLE

Repeat for each DU: 1) 8 * 8 point 2D DCT separable 2) Quantization, scaling, and rounding

DCT Quantization

Qantization tables

DM0 JFIF Image Buffer

1) JFIF image data 2) Huffman tables 3) Quantization tables

(56)

(57)

4. Software Implementation

The software implementation of the JFIF encoder and decoder was done in order to evaluate cycle costs for different parts in order to be able to do a reasonable HW/SW partitioning.

The software is implemented in pure assembler.

4.1. Decoder

The decoder takes a JFIF image and yields a raw format image. In our case an uncompressed TGA format image. The current implementation can only handle sequential interleaved single frame images without restart markers, see [6] for an explanation of restart markers. Furthermore the image needs to be of a format which will generate an integer number of MCUs. These limitations are due to the fact that this implementation is only a test implementation.

4.1.1. Design Overview

The decompression process essentially consists of two phases. 1. Read header and process header data

2. Read and process the actual image data (scan data)

In the first phase Huffman and Quantization tables are constructed. Information about the MCU and DU relation are also calculated. The complete list is shown below.

• Calculate sampling periods from the sampling frequencies.

• Calculate the horizontal and vertical number of MCUs in the picture from

the sampling frequencies and image resolution.

• Calculate the size in pixels of one MCU from the sampling frequencies. • Calculate the number of Y, Cb and Cr DUs in one MCU from the sampling

frequencies.

• Read Huffman tables and calculate the min, max and first table from the

respective table stored in the JFIF file, in order to facilitate the Huffman decoding later.

(58)

 Software Implementation

• Read the quantization tables and store them. No further processing can be

done.

Figure 4.1 gives an architectural overview of the program. The first phase is the two first boxes (Read JFIF-headers... and pre computing). The rest of the picture constitutes the second phase.

In the second phase the image is constructed from the bit stream. Here is where the most of the calculations are done. This part involves Huffman decoding, run length decoding, IDCT and dequantization. This is also the component where the major part of the clock cycles is consumed.

In this design everything runs sequentially and the next process is started only after the preceding one has finished. In figure 4.1 the dotted lines mark parts of the program which are actually merged in the implementation.

The program will process all DUs in one MCU before taking on next MCU. The decoded data for one MCU is stored in a buffer large enough to accommodate one MCU of maximum size, 640 B (Maximum number of DUs in one MCU = 10 size of one DU = 64 B). The reason the MCU needs to be buffered before writing data to the final picture is that in the scan the component is stored sequentially and we need all three components for every pixel to do the YCbCr to RGB conversion. If the relative sampling frequencies are {2 × 2, 1 × 1, 1 × 1} in the sequential scan there will first be four Y DUs and then one Cb DU and then one Cr DU. Thus first we decode the four Y components and store them in a buffer, then we decode the Cb component and store it. At last we do the same thing with the Cr component. Now we have all data we need to do the necessary up sampling and color conversion.

4.1.2. Design decisions

The decision to implement the decoder for JFIF-images is due to the fact that these pictures are the most common image format following the JPEG standard [6] [4]. To increase speed as many calculations as possible are made in the first phase (4.1.1). Also the focus of the code in this part is more on abstraction to reduce errors in the code, whereas the focus of the second phase is speed, resulting in a bit messier code constructs. The uncompressed TGA format was chosen because one of its modes is very close to raw data. It has just a small header, then data stored raw as one byte for each channel in the order blue, green, red and finally a small footer.

4.1.3. Program Flow

Figure 4.2 describes the top loop of the JFIF decoder and table 4.1 lists the cycle consumption of each function. DHT, DQT, COM, SOI, APP0, SOF0 and SOS all take care of the respective marker type.

DHT parses the definitions of the Huffman tables and stores these into the DSP processor memory for later use.