• No results found

Evaluation and Hardware Implementation of Real-Time Color Compression Algorithms

N/A
N/A
Protected

Academic year: 2021

Share "Evaluation and Hardware Implementation of Real-Time Color Compression Algorithms"

Copied!
104
0
0

Loading.... (view fulltext now)

Full text

(1)

Evaluation and Hardware Implementation of

Real-Time Color Compression Algorithms

Master’s Thesis

Division of Electronics Systems

Department of Electrical Engineering

Linköping University

By

Ahmet Caglar

Amin Ojani

Report number: LiTH-ISY-EX--08/4265--SE Linköping, December 2008

(2)
(3)

Evaluation and Hardware Implementation of

Real-Time Color Compression Algorithms

Master’s Thesis

Division of Electronics Systems

Department of Electrical Engineering

at Linköping Institute of Technology

By

Ahmet Caglar

Amin Ojani

LiTH-ISY-EX--08/4265--SE

Supervisor: Henrik Ohlsson,

Ericsson Mobile Platforms (EMP) Examiner: Oscar Gustafsson,

Electronics Systems, Linköping University Linköping, December 2008

(4)
(5)

Presentation Date

2008-12-16

Publishing Date (Electronic version)

Department and Division

Department of Electrical Engineering Division of Electronic Systems

URL, Electronic Version

http://www.ep.liu.se

Publication Title

Evaluation and Hardware Implementation of Real-Time Color Compression Algorithms

Author(s)

Amin Ojani, Ahmet Caglar

Abstract

A major bottleneck, for performance as well as power consumption, for graphics hardware in mobile devices is the amount of data that needs to be transferred to and from memory. In, for example, hardware accelerated 3D graphics, a large part of the memory accesses are due to large and frequent color buffer data transfers. In a graphics hardware block color data is typically processed using RGB color format. For both 3D graphic rasterization and image composition several pixels needs to be read from and written to memory to generate a pixel in the frame buffer. This generates a lot of data traffic on the memory interfaces which impacts both performance and power consumption. Therefore it is important to minimize the amount of color buffer data.

One way of reducing the memory bandwidth required is to compress the color data before writing it to memory and decompress it before using it in the graphics hardware block. This compression/decompression must be done “on-the-fly”, i.e. it has to be very fast so that the hardware accelerator does not have to wait for data. In this thesis, we investigated several exact (lossless) color compression algorithms from hardware implementation point of view to be used in high throughput hardware. Our study shows that compression/decompression datapath is well implementable even with stringent area and throughput constraints. However memory interfacing of these blocks is more critical and could be dominating.

Keywords

Graphics Hardware, Color Compression, Image Compression, Mobile Graphics, Compression Ratio, Frame Buffer Compression, Lossless Compression, Golomb-Rice coding.

Language

X English

Other (specify below)

Number of Pages 88 Type of Publication Licentiate thesis X Degree thesis Thesis C-level Thesis D-level Report

Other (specify below)

ISBN (Licentiate thesis)

ISRN: LiTH-ISY-EX--08/4265--SE Title of series (Licentiate thesis) Series number/ISSN (Licentiate thesis)

(6)
(7)

Abstract

A major bottleneck, for performance as well as power consumption, for graphics hardware in mobile devices is the amount of data that needs to be transferred to and from memory. In, for example, hardware accelerated 3D graphics, a large part of the memory accesses are due to large and frequent color buffer data transfers. In a graphic hardware block color data is typically processed using RGB color format. For both 3D graphic rasterization and image composition several pixels needs to be read from and written to memory to generate a pixel in the frame buffer. This generates a lot of data traffic on the memory interfaces which impacts both performance and power consumption. Therefore it is important to minimize the amount of color buffer data. One way of reducing the memory bandwidth required is to compress the color data before writing it to memory and decompress it before using it in the graphics hardware block. This compression/decompression must be done “on-the-fly”, i.e. it has to be very fast so that the hardware accelerator does not have to wait for data. In this thesis, we investigated several exact (lossless) color compression algorithms from hardware implementation point of view to be used in high throughput hardware. Our study shows that compression/decompression datapath is well implementable even with stringent area and throughput constraints. However memory interfacing of these blocks is more critical and could be dominating.

Keywords: Graphics Hardware, Color Compression, Image Compression, Mobile Graphics, Compression Ratio, Frame Buffer Compression, Lossless Compression, Golomb-Rice coding.

(8)
(9)

Acknowledgements

First, we would like to express our gratitude and appreciation to our supervisor Dr. Henrik Ohlsson from Ericsson Mobile Platform (EMP) for his valuable guidance and discussions.

We would also like to thank our supervisor from Electronics Systems at Linköping University, Dr. Oscar Gustafsson for his great supports and recommendations.

Finally, deepest thanks go to our beloved parents for their everlasting supports and encouragements throughout our educational years. This thesis is dedicated to them.

(10)
(11)

Table of Contents

CHAPTER 1 ... 1

1 INTRODUCTION ... 1

1.1 COLOR BUFFER AND GRAPHICS HARDWARE ... 2

1.2 COLOR BUFFER COMPRESSION VS. IMAGE COMPRESSION ... 3

1.3 STRUCTURE OF THE REPORT ... 3

CHAPTER 2 ... 5

2 LOSSLESS COMPRESSION ALGORITHMS ... 5

2.1 INTRODUCTION ... 5

2.2 THEORETICAL BACKGROUND OF LOSSLESS IMAGE COMPRESSION ... 6

2.2.1 JPEG-LS Algorithm ... 6

2.3 REFERENCE LOSSLESS COMPRESSION ALGORITHM ... 7

2.3.1 Color Transform and Reverse Color Transform ... 8

2.3.2 Predictor and Constructor ... 10

2.3.3 Golomb-Rice Encoder ... 12

2.3.4 Golomb-Rice Decoder ... 16

2.4 GOLOMB-RICE ENCODING OPTIMIZATION ... 17

2.4.1 Proposed method for exhaustive search solution ... 17

2.4.2 Estimation method... 22

2.5 IMPROVED LOSSLESS COLOR BUFFER COMPRESSION ALGORITHM ... 24

2.5.1 Modular Reduction ... 24

2.5.2 Embedded Alphabet Extension (Run-length Mode) ... 25

2.5.3 Previous Header Flag ... 26

2.6 COMPRESSION PERFORMANCES OF ALGORITHMS ... 26

2.7 POSSIBLE FUTURE ALGORITHMIC IMPROVEMENTS ... 28

2.7.1 Pixel Reordering ... 28

2.7.2 Spectral Predictor ... 28

2.7.3 CALIC Predictor ... 28

2.7.4 Context Information ... 29

CHAPTER 3 ... 30

3 COLOR BUFFER COMPRESSION/DECOMPRESSION HARDWARE ... 30

3.1 DESIGN CONSTRAINTS ... 30

3.2 COMPRESSOR BLOCK ... 31

3.2.1 Addr_Gen1 (Source memory address generator) ... 32

3.2.2 Color_T (Color Transformer) ... 36

3.2.3 Pred_RegFile_Ctrl (Prediction Register File Controller) ... 37

3.2.4 Predictor ... 39

3.2.5 Enc_RegFile_Ctrl (Golomb-Rice Encoder Register File Controller) ... 40

3.2.6 GR_Encoder (Golomb-Rice Encoder) ... 42

3.2.6.1 GR_k Block (Golomb-Rice Parameter Estimation) ... 43

3.2.6.2 Enc Block (Encoding Block) ... 45

3.2.6.3 GR_ctrl (Golomb-Rice Control Block) ... 47

3.2.7 Data_Packer (Variable Bit Length Packer to Memory Word) ... 47

3.2.8 Addr_Gen2 (Destination memory address generator) ... 49

3.2.9 Compressor_Ctrl (Control Path) ... 50

3.2.10 Overall Compressor Datapath and Address Generation ... 51

3.3 DECOMPRESSOR BLOCK ... 52

(12)

3.3.2 Rev_Color_T (Reverse Color Transformer) ... 54

3.3.3 Const_RegFile_Ctrl (Construction Register File Controller) ... 55

3.3.4 Constructor ... 56

3.3.5 Dec_RegFile_Ctrl (Golomb-Rice Decoder Register File Controller) ... 57

3.3.6 GR_Decoder (Golomb-Rice Decoder) ... 58

3.3.7 Data_Unpacker (Variable Bit Length Unpacker from Memory Word) ... 59

3.3.8 Addr_Gen1 (Destination memory address generator) ... 60

3.3.9 Decompressor_Ctrl (Control Path) ... 61

3.3.10 Overall Decompressor Datapath and Address Generation ... 62

3.4 FUNCTIONAL VERIFICATION FRAMEWORK ... 63

3.5 SYNTHESIS RESULTS ... 64

3.6 EVALUATION OF OTHER HARDWARE IMPLEMENTATIONS ... 66

3.6.1 Parallel pipeline Implementation of LOCO-I for JPEG-LS [17] ... 66

3.6.2 Benchmarking and Hardware Implementation of JPEG-LS [18] ... 67

3.6.3 A Lossless Image Compression Technique Using Simple Arithmetic Operations [19] ... 67

3.6.4 A Low power, Fully Pipelined JPEG-LS Encoder for Lossless Image Compression [11]... 67

3.6.5 Hardware Implementation of a Lossless Image Compression Algorithm Using a FPGA [20] ... 68

3.6.6 Comparison ... 68

CHAPTER 4 ... 69

4 CONCLUSION ... 69

4.1 WORKFLOW ... 69

4.2 RESULTS AND OUTCOMES ... 69

4.3 FUTURE WORK ... 71

REFERENCES ... 73

APPENDIX A ... 75

PROPOSED COST REDUCTION METHOD ANALYSIS ... 75

A.1 Overlap-limited Search ... 75

A.2 Remainder-Based Correction ... 83

APPENDIX B ... 85

TEST IMAGE SETS ... 85

B.1 Standard Photographic Test Images ... 85

B.2 Computer Generated Test Scenes ... 86

(13)

Table of Figures

FIGURE 1:COMPRESSOR/DECOMPRESSOR HARDWARE ON MEMORY INTERFACE... 2

FIGURE 2:ERROR ACCUMULATION DUE TO TANDEM COMPRESSION ... 6

FIGURE 3:COMPRESSION /DECOMPRESSION FUNCTIONAL BLOCKS ... 8

FIGURE 4:COLOR TRANSFORM /REVERSE COLOR TRANSFORM BLOCK INTERFACE ... 9

FIGURE 5:COLOR TRANSFORM /REVERSE COLOR TRANSFORM OPERATION FLOW GRAPH ... 9

FIGURE 6:MEDIAN EDGE DETECTOR (MED) PREDICTOR PREDICTION WINDOW ... 10

FIGURE 7:PREDICTOR /CONSTRUCTOR BLOCK INTERFACE ... 11

FIGURE 8:PREDICTOR /CONSTRUCTOR OPERATION FLOW GRAPH ... 11

FIGURE 9:ENCODED DATA IN THE STREAM ... 12

FIGURE 10:ENCODED DATA FOR (2,0,13,3) AND K =2 ... 12

FIGURE 11:GOLOMB-RICE ENCODER FUNCTIONAL BLOCKS ... 13

FIGURE 12:GOLOMB-RICE PARAMETER EXHAUSTIVE SEARCH HARDWARE ... 14

FIGURE 13:A POSSIBLE GOLOMB-RICE ENCODER HARDWARE ... 15

FIGURE 14:A POSSIBLE GOLOMB-RICE DECODER HARDWARE ... 16

FIGURE 15:HW-COST VS. NUMBER OF INPUT SAMPLES (N) ... 19

FIGURE 16:HW-COST VS. NUMBER OF PARAMETERS (K) ... 20

FIGURE 17:HW IMPLEMENTATION OF THE NEW COMBINED METHOD ... 21

FIGURE 18:ILLUSTRATION OF MODULAR REDUCTION ... 24

FIGURE 19:CALICGAP PREDICTION WINDOW ... 29

FIGURE 20:COMPRESSOR BLOCK ... 31

FIGURE 21: MEMORY MAPPING AND CORRESPONDING PIXELS OF THE IMAGE ... 33

FIGURE 22:TRAVERSAL IN PREDICTION WINDOW ... 34

FIGURE 23:ADDRESS GENERATOR I INTERFACE ... 35

FIGURE 24:ADDRESS GENERATOR IHARDWARE DIAGRAM ... 35

FIGURE 25:COLOR TRANSFORM HARDWARE DIAGRAM ... 36

FIGURE 26:PREDICTION REGISTER FILE CONTROLLER INTERFACE... 37

FIGURE 27:CHANGE OF PREDICTION WINDOW FOR PIXELS OF ONE SUBTILE ... 37

FIGURE 28:STATES AND REGISTER INPUT CONNECTIVITY IN PREDICTION REGISTER FILE CONTROLLER... 38

FIGURE 29:MEDPREDICTION HARDWARE FOR BOTH PREDICTOR AND CONSTRUCTOR ... 39

FIGURE 30:PREDICTOR BLOCK HARDWARE DIAGRAM... 40

FIGURE 31:ENCODER REGISTER FILE CONTROLLER BLOCK INTERFACE ... 41

FIGURE 32:GOLOMB-RICE ENCODER BLOCK DIAGRAM ... 42

FIGURE 33:K-PARAMETER ESTIMATION HARDWARE ... 44

FIGURE 34:GOLOMB-RICE ENCODER REALIZATION ... 46

FIGURE 35:P3 BLOCK, BASIC HARDWARE REALIZATION ... 48

FIGURE 36:PACKED DATA ORDER FORMAT IN THE MEMORY ... 48

FIGURE 37:DATA PACKER ... 49

FIGURE 38:DESTINATION MEMORY ADDRESS GENERATOR BLOCK INTERFACE ... 50

FIGURE 39:CONTROL PATH BLOCK INTERFACE ... 50

FIGURE 40:OVERALL COMPRESSOR ... 51

FIGURE 41:DECOMPRESSOR BLOCK ... 52

FIGURE 42:SOURCE MEMORY ADDRESS GENERATOR BLOCK INTERFACE ... 53

FIGURE 43:REVERSE COLOR TRANSFORM HARDWARE DIAGRAM ... 54

FIGURE 44:CONSTRUCTION REGISTER FILE CONTROLLER INTERFACE ... 55

FIGURE 45:STATES AND REGISTER INPUT CONNECTIVITY IN CONSTRUCTION REGISTER FILE CONTROLLER ... 56

FIGURE 46:CONSTRUCTOR BLOCK HARDWARE DIAGRAM ... 57

FIGURE 47:DECODER REGISTER FILE CONTROLLER BLOCK INTERFACE ... 58

FIGURE 48:GOLOMB-RICE DECODER HARDWARE ... 58

FIGURE 49:DATA UNPACKER INTERFACE AND BLOCK DIAGRAM ... 59

FIGURE 50:READ /WRITE ADRESSES FROM/TO DESTINATION MEMORY TO CONSTRUCT ONE SUBTILE ... 60

FIGURE 51:ACTUAL ADDRESSING SCHEME FOR DESTINATION MEMORY ADDRESSES ... 60

(14)

FIGURE 53:OVERALL DECOMPRESSOR ... 62

FIGURE 54:VERIFICATION FRAMEWORK FSM ... 63

FIGURE 55:FUNCTIONAL VERIFICATION FRAMEWORK ... 64

FIGURE 56:ONE BLOCK OF N VALUES ... 75

FIGURE 57:OVERLAP REGIONS OF CONSECUTIVE LENGTH FUNCTIONS WITH RESPECT TO ET ... 77

FIGURE 58:OVERLAP REGIONS BETWEEN LENGTH FUNCTIONS L1,L2,L3,L4 ... 78

FIGURE 59:OVERLAP REGIONS FOR N=4 AND K={0,1,2,3,4,5,6} WITH RESPECT TO ET ... 79

FIGURE 60:REQUIRED COMPARISONS OF OVERLAP REGIONS FOR N=4, K={0,1,2,3,4,5,6} BASED ON ET ... 80

FIGURE 61:OVERLAP REGIONS OF NON-CONSECUTIVE LENGTH FUNCTIONS WITH RESPECT TO ET ... 81

(15)

List of Tables

TABLE 1:ENCODED OUTPUT LENGTHS FOR EACH K-PARAMETER ... 14

TABLE 2:LOGIC COST OF FUNCTIONAL BLOCKS ... 17

TABLE 3:HW COST COMPARISON OF EXHAUSTIVE SEARCH AND NEW COMBINED METHOD ... 22

TABLE 4:ESTIMATION INTERVALS ACCORDING TO SUM OF INPUTS... 23

TABLE 5:HW COST AND COMPRESSION RATIO OF ESTIMATION METHOD ... 23

TABLE 6:COMPARISON OF COMPRESSION PERFORMANCES ... 27

TABLE 7:COMPRESSOR BLOCK INTERFACE PORT DESCRIPTION... 32

TABLE 8:SOURCE MEMORY ADDRESS GENERATOR ADDRESSING SCHEME ... 34

TABLE 9:ESTIMATION FUNCTION ... 45

TABLE 10:HEADER FORMAT GENERATED BY GR_CTRL BLOCK... 47

TABLE 11:DECOMPRESSOR BLOCK INTERFACE PORT DESCRIPTION ... 53

TABLE 12:DESTINATION MEMORY ADDRESS GENERATOR ADDRESSING SCHEME ... 61

TABLE 13:COMPRESSOR SYNTHESIS RESULT ... 65

TABLE 14:DECOMPRESSOR SYNTHESIS RESULT... 66

TABLE 15:CHARACTERISTICS OF DIFFERENT HARDWARE IMPLEMENTATIONS ... 68

(16)
(17)

Chapter 1

1 Introduction

A major bottleneck, for performance as well as power consumption, for graphics hardware in mobile devices is the amount of data that needs to be transferred to and from memory. In, for example, hardware accelerated 3D graphics, a large part of the memory accesses are due to large and frequent color buffer data transfers. Therefore it is important to minimize the amount of color buffer data.

In a graphics hardware block (for example image composition, 3D graphics rasterization), color data is typically processed using RGB color format. Depending on the color resolution of the image 8, 12, 16, or 32 bits could be used to represent one pixel. For both 3D graphic rasterization and image composition several pixels needs to be read from and written to memory to generate a pixel in the frame buffer. This generates a lot of data traffic on the memory interfaces which impacts both performance and power consumption.

One way of reducing the memory bandwidth required is to compress the color data before writing it to memory and decompress it before using it in the graphics hardware block. Figure 1 shows the location of compressor/decompressor hardware with respect to graphics hardware block and memory. The compressor/decompressor hardware will help reduce the data traffic on memory interface shown with arrows in the figure. The reduction on the memory bandwidth can be used to minimize power consumption (reduced access to memory bus), to increase performance (more data traffic with the same memory bandwidth) or a combination of them. Hence, a better trade-off between power and performance can be found depending on the design constraints.

(18)

Graphics Hardware Block RAM Compress

Data

Decompress Data

Figure 1: Compressor/Decompressor hardware on memory interface

Hardware implementation of such a compressor/decompressor is the subject of this work. Our thesis - based on a reference color buffer compression algorithm [1] – aims at:

− Evaluation of color buffer compression algorithms with respect to hardware implementation properties,

− VHDL implementation of a selected algorithm in order to validate the hardware cost estimations.

Accordingly, the thesis has been carried out in two phases. In the first phase, the following tasks have been carried out:

− Analysis of the problem and modeling of the reference algorithm,

− Evaluation of the proposed solution with respect to both compression performance and implementation properties,

− Exploration of algorithmic and hardware optimizations to improve both compression performance and implementation cost,

− Decision of the final algorithm to be implemented.

The second phase of the thesis work is dedicated to hardware implementation in VHDL and verification of the algorithm which is decided in the first phase, and completion of the thesis report.

1.1 Color buffer and graphics hardware

Color buffer refers to a portion of memory where the actual pixel data to be sent to display is stored. Graphics hardware uses this buffer during rasterization. Depending on the rasterizer architecture, the access to this buffer can be in different ways. In traditional immediate mode rendering, each triangle is rendered as soon as they come in. Hence, for every triangle that is drawn, the related pixel data are written to the buffer unless the triangle is completely hidden. On the other hand for tiled, deferred rendering architectures, the color buffer is written when a complete tile (a unit of w h pixels) is finished. Hence only visible color write is performed which reduces the overall color buffer bandwidth. A more detailed explanation on the topic can be found at [2].

(19)

1.2 Color buffer compression vs. image compression

Color buffer data compression, as a specific application of general data compression, shares lots of similarities with image compression. Consequently, the theory developed for image compression is well-suited to be used for compressing color buffer data in 3D graphics hardware. Specifically, correlation between neighboring pixel values is also valid for color buffer data and can be used as a basis for compression.

On the other hand, there are important differences between color buffer data compression and image compression. First of all, most of the image compression algorithms in literature have been developed for continuous-tone still images. Their compression results have been customarily based on some set of well-known test images. Those images are real (photographic) images and it is harder to get information about the performance of image compression algorithms on computer generated images. Secondly and more importantly, most image compression algorithms assume the availability of a whole and completed image. For example most (if not all) of the state-of-the art image compression algorithms are adaptive, which can be briefly explained as learning from the image itself while traversing it in some order. Rasterization in graphics hardware, on the other hand, is an incremental process. Depending on the rasterizer architecture, the data to be compressed could be an unfinished scene and it could also be only a part of the whole scene. In a tiled architecture for example, a tile is the data to be compressed, and the tile size could be too small to learn from. Hence the success of adaptive image compression algorithms on color buffer data is not obvious and dependant on the specific rasterizer architecture.

Another difference between our framework and image compression algorithms is the requirements on the complexity and implementation cost. As mentioned in [1], most of the image compression algorithms are not symmetric, i.e., compression and decompression take different times. Moreover, for most of the compression algorithms, the complexity of the forward path (compression) is discussed, since they aim at the applications where only compression and storage of the image data is important. The backward path (decompression) is not considered as critical. However in our case, the compression/decompression must be done “on-the-fly”, i.e. it has to be very fast so that the hardware accelerator does not have to wait for data. Finally, a compressor/decompressor for mobile devices has extra requirements on the implementation cost. Specifically, the size of the hardware block is of prime concern. This prohibits using sophisticated algorithms that require logic cost and storage (buffering) cost more than what is affordable in our case.

1.3 Structure of the report

Chapter 1 of the report has given a description of the aim of this thesis work and some background information about the application area. Chapter 2, starting with an explanation of the need for lossless compression in our case, gives a thorough analysis of the lossless compression algorithms considered for this thesis and evaluation of their implementation properties. This chapter corresponds to first phase of our thesis work. Chapter 3 describes the implementation and hardware of the compressor/decompressor and presents synthesis results. Chapter 4 includes concluding remarks and discussion of some possible future work.

(20)
(21)

Chapter 2

2 Lossless Compression Algorithms

In this chapter we discuss several lossless color data compression algorithms, their performances with their hardware implementation properties. Later, we propose a modified algorithm which is especially effective for compressible images. The chapter ends with a comparison of compression ratio and cost of those algorithms and some remarks about possible future improvements.

2.1

Introduction

Lossless image compression is customarily used in specific application areas like medical and astronomical imaging, preservation of art work and professional photography. It is not surprising that lossless compression is not used for multimedia in general when one considers its limited compression performance. The achievable compression ratio varies between 2:1 and 3:1 in general, which is significantly lower than what lossy compression can offer. Furthermore, in lossy compression the resulting image quality and desired compression performance can always be traded-off depending on the requirements.

Considering the disadvantages just mentioned, the usage of lossless compression in 3D graphics hardware for color buffer data may be objected. However, [1] explains and illustrates the possibility of getting unbounded errors due to so called tandem compression when a lossy algorithm is used. Tandem compression artifacts arise when lossy compression is performed for every triangle written to a tile during rasterization, resulting in accumulation of error. This is a direct consequence of rasterization being an incremental process. Figure 2 from [1] illustrates the accumulation of error.

(22)

Figure 2: Error accumulation due to Tandem Compression

Although, it is possible to control the accumulated error in those cases as suggested in [1], the resulting image quality may not be acceptable. In our work we employ a conservative approach (lossless compression) instead, since the resulting compression ratio is sufficient for our application.

2.2 Theoretical Background of Lossless Image Compression

In image compression applications, there are several algorithms which offer different approaches for compression of still images. The most famous algorithms are FELICS [3], LOCO-I [4] and CALIC [5]. According to the better tradeoff between complexity and compression ratio, LOCO-I was standardized into JPEG-LS [6].

2.2.1 JPEG-LS Algorithm

The idea behind JPEG-LS is to take the advantage of both simplicity and the compression potential of context models. The error residuals are computed using an adaptive predictor and Golomb-Rice technique is used for encoding the data. The purpose of having an adaptive predictor instead of a fixed predictor is that it proposes minor variations of prediction residuals which lead to a higher compression ratio. It should be noticed that having better prediction result help efficiently only when the header information is extracted from the compressed stream which is the case in JPEG-LS. Otherwise the major overhead which degrades the compression ratio is sending the header information and in that case, improving the predictor cannot help much in getting higher performance. The reason why non-adaptive algorithms give lower compression ratio is that they are limited in their compression performance by first order entropy of the prediction residuals, which in general cannot achieve total decorrelation of the data [6]. As a consequence, the compression gap between these simple schemes and more complex algorithms is significant.

LOCO-I algorithm is constructed by three main components. The first component is predictor and consists of two components of adaptive and fixed. The fixed component does the task of horizontal and vertical edge detection where dependence on the surrounding samples is through fixed coefficients. The fixed predictor used in LOCO-I, is a simple median edge detector (MED)

(23)

predictor and will be explained in subsection 2.3.2. Adaptive component, on the other hand, is context dependant and does the bias cancellation task because the DC offset is typically present in context-based prediction [6].

The second component is context model. A more complex context modeling technique results in higher achievable dependency order. For LOCO-I, the context model is to compute the gradient of neighboring pixels and then quantize gradients into a small number of equally probable connected regions. Although in principle, the number of those regions should be adaptively optimized, the low complexity requirement dictates a fixed number of equally probable regions. The gradients represent information about the part of the image surrounding a sampling pixel. By knowing the gradients we can learn the level of activity such as smoothness or edginess around the sampling pixel. This information governs the statistical behavior of prediction error [6]. For JPEG-LS, the number of contexts is 365. This number represents a suitable trade-off between storage requirements which is proportional to the number of contexts.

The last component coder is used to encode the corrected prediction residuals. LOCO-I uses Golomb-Rice coding technique [6, 7] in two different modes as regular mode and run-length mode. This coding technique is discussed in details in subsection 2.3.3.

There are several different implementation approaches for JPEG-LS algorithm, each of which uses specific hardware architectures such as parallel, pipeline or a combination of both. Implementation options include dedicated DSP, FPGA boards, and ASIC. Factors that affect the choice of platform selection involve cost, speed, memory, size, and power consumption. One of the very important characteristics of JPEG-LS algorithm is its sequential execution nature due to the use of context statistics in coding the error residuals in the prediction phase. This characteristic makes this possible to design parallel pipeline encoder architecture in order to speed-up the compression. In section 3.6, different hardware architectures and their implementation result have been discussed.

Compression in a mobile application is limited by the available storage and memory bandwidth. Therefore, context-based algorithms such as JPEG-LS may not be applicable and their storage requirement for the context information could be quite high for this application.

2.3 Reference Lossless Compression Algorithm

Our thesis work is based on [1], which gives a survey of color buffer data compression algorithms and propose a new exact (lossless) algorithm. In this section, we describe a thorough analysis of this algorithm, the role and hardware implementation cost of its functional blocks. The result of this analysis serves as the basis for our later work both on algorithmic and hardware optimizations.

This algorithm, as opposed to more complex adaptive context-modeling schemes like LOCO-I [4], can be classified as a variant of simplicity-driven DPCM technique by employing a variable

(24)

bit length coding of prediction residuals obtained from a fixed predictor [6]. To get a better decorrelation of pixel data, a lossless (exactly reversible) color transform precedes those blocks. The block diagram of the compressor and decompressor is given in figure 3.

Figure 3: Compression / Decompression Functional Blocks

In context-based algorithms, the encoding parameter for each pixel is estimated from previously traversed data (context). Since the decoder traverses the data in the same order, it will give the same decision as the encoder for the parameter of the current pixel. This eliminates the overhead of sending the encoding parameter in the stream. However since no context information is stored in our reference algorithm, the overhead of sending the encoding parameter of each pixel is significant. An important feature of the algorithm is thus encoding a number of pixels (22 subtile) with the same parameter in the encoder stage. This allows a trade-off between the overhead and using non-optimal encoding parameter for pixels.

Another feature of the reference algorithm is that it operates on tiles (88 blocks of pixels) to make it compliant with a tiled architecture. However, the functional blocks of the algorithm itself do not use any tile specific information.

In the following subsections blocks of the algorithm are discussed.

2.3.1 Color Transform and Reverse Color Transform

The color transform block converts RGB triplet to YCoCg triplet in order to decorrelate the channels. Y channel is the luminance channel; Co and Cg are chrominance channels. It is stated in [1] that decorrelation of channels improves the compression ratio by about 10%. This transformation and its important features have been introduced in [9]. Exact reversibility is an essential feature of this transformation since the overall algorithm is lossless. The forward and backward transformation equations are:

(1) Reverse Color Transform Constructor Decoder +

(25)

From implementation point of view, this transformation has a dynamic range expansion of 2 bits, i.e., if input RGB channels are n bits each, the output Y channel will require n bits, and chrominance channels will require n+1 bits each. The block interfaces of the forward and reverse transforms with 8-bit RGB channels are given in figure 4.

Figure 4: Color Transform / Reverse Color Transform Block Interface

As the equations suggest, both the color transform and reverse color transform has 2 shift and 4 add/subtract operations per pixel which can be expressed as follows:

[2(>>) , 4(+)] per pixel.

The flow-graph of the operations are given in figure 5.

Figure 5: Color Transform / Reverse Color Transform Operation Flow Graph

The operation cost and data lengths indicate that both blocks can be realized by:

B G R 8 8 9 9 8 8 Y Co Cg Reverse Color Transform R G B Y Cg Co 8 8 8 8 9 9 Color Transform - + - + >> >> << + - << + -

(26)

- Two 9-bit adders/subtractors - Two 8-bit adders/subtractors

This cost is per pixel cost and the overall cost depends on the throughput requirement. It should also be noted that color transform has a maximum logic depth of two 9-bit adders and two 8-bit adders, whereas the reverse color transform has a maximum logic depth of one 9-bit adder and two 8-bit adders.

2.3.2 Predictor and Constructor

The predictor used in our reference algorithm is named as MED predictor in [6] and originally introduced by Martucci [10]. This predictor uses three surrounding pixels to predict the value of the current pixel as shown in figure 6.

Figure 6: Median Edge Detector (MED) predictor prediction window

The prediction is performed with the following formula:

(2) The first two cases correspond to a primitive test for horizontal and vertical edge detection. If no edge is detected, the third case predicts the value of the current pixel by considering it on a plane formed by the three neighboring pixels. Despite its simplicity, MED predictor is mentioned to be a very effective fixed predictor.

After the prediction, the predicted value (xˆ) is subtracted from the actual pixel value (x) and the resulting error residual ( ) is sent out to be encoded in the encoder block. Conversely, in the decompression path the same prediction is performed from the previously constructed pixels and the resulting prediction () is added to the decoded error residual ( ) from the stream to construct the actual pixel value (x) back. The block interface of the predictor and constructor are given in figure 7. In this figure, the input pixel values are 9-bit signed chrominance components (Co and Cg), and the error residual is 10-bit signed value. For Y and α predictors/constructors, the input size is 8-bits.

(27)

Figure 7: Predictor / Constructor Block Interface

The operations extracted from (2) can be expressed as follows: [3 comp.(< ) , 3(+)] per pixelcomponent.

The flow-graph of the predictor and constructor operations are identical and given in figure 8.

Figure 8: Predictor / Constructor Operation Flow Graph

The flow graph and data wordlengths indicate that both the predictor and constructor blocks can be realized by:

- Three 10-bit comparators - Two 9-bit adders/subtractors - One 10-bit adder/subtractor

- One 9-bit 4x1 MUX (with some additional logic at select inputs)

This cost is per pixel-component cost and the overall cost depends on the throughput requirement. Both the predictor and constructor have a maximum logic depth of two 9-bit and one 10-bit adders.

Since the next stage i.e. Golomb-Rice encoding requires unsigned (one-sided) error residuals, the following signed-to-unsigned conversion, as suggested in [4], needs to be performed after the prediction: 10 9 9 9 9 Constructor 9 9 9 10 9 Predictor < < + - - < < < + - + <

(28)

(3) Conversely, after Golomb-Rice decoding in decompression, the corresponding unsigned-to-signed conversion is needed.

2.3.3 Golomb-Rice Encoder

Golomb codes are variable bit rate codes optimal for one-sided geometric distribution (OSGD) of non-negative integers. Since the statistics of prediction error residuals from a fixed predictor in continuous-tone images are well-modeled by a two-sided geometric distribution (TSGD) centered at zero [6], Golomb coding is widely used in lossless image coding algorithms with a mathematical absolute operation at the beginning to obtain OSGD.

Since Golomb coding requires an integer division and modulo operation with Golomb parameter

m, in implementations Rice codes [8] are generally used. Rice coding is a special case of Golomb coding which reduces division and modulo operations to simple shift and mask operations. In Golomb-Rice encoding, we encode an input value, e, by dividing it with a constant 2k. The results are a quotient q and a remainder r. The quotient q is stored using unary coding, and the remainder r is stored using normal, binary coding using k bits. To illustrate with an example (figure 10), let us assume that we want to encode the values 2, 0, 13, 3 and assume we have selected the constant k = 2. After the division we get the following (q, r)-pairs: (0, 2), (0, 0), (3, 1), (0, 3). Unary coding represents a value by as many zeros as the magnitude of the value followed by a terminating one. The encoded values therefore becomes (1b, 10b), (1b, 00b), (0001b, 01b), (1b, 11b) which is 15 bits in total.

Figure 9: Encoded Data in the Stream

Figure 10: Encoded Data for (2, 0, 13, 3) and k = 2

010 1 10 00 1 1 11 1000 01 k 1 r ... k 1 q r 1 2 r 2 q 4 q r 4 q 3 r3 k 1 r ... first component of first sub-tile ... second component of first sub-tile Direction of storing/reading data

(29)

In our reference algorithm, the optimal Golomb-Rice parameter k for a 22 pixel subtile of error residuals is computed with an exhaustive search, and the Golomb-Rice coded residuals are sent out to the stream preceded by k-parameter as the header. During decompression, the decoder decodes the data from the stream with k-parameter received as the header.

Encoding requires three functional blocks as given in figure 11:

Figure 11: Golomb-Rice Encoder functional blocks

The reference algorithm uses 3-bit header (k= 0, 1, ... , 7) to encode a subtile. Among those headers, k = 7 is reserved for the special case when all error residuals in a subtile are zero. In this case only header is stored; otherwise the header is followed by coded component-wise residuals. The exhaustive search of the best k-parameter requires comparison of the lengths of output code created by each possible k value (0, 1, … , 6) excluding the special case. The length of an output code corresponding to a k-parameter can be expressed with the following formula:

4 4 2 2 2 2 4 3 2 1                     e e e e k Lk k k k k (4)

The lengths of each output code from this formula are given in table 1. k-parameter Length of output code (Lk)

0 e + 1 e + 2 e + 3 e + 0 + 4 4 1

e1 2

+

e2 2

+

e3 2

+

e4 2

+ 4 + 4 2

e1 4

+

e2 4

+

e3 4

+

e4 4

+ 8 + 4 3

e1 8

+

e2 8

+

e3 8

+

e4 8

+ 12 + 4 4

e1 16

+

e2 16

+

e3 16

+

e4 16

+ 16 + 4 e1 e2 e3 e4 k (3-bit) e1 e2 e3 e4 k-parameter determination Golomb-Rice encoding with k Data packer to external memory compressed stream 10-bit error residuals of subtile encoded codewords of each pixel

(30)

5

e1 32

+

e2 32

+

e3 32

+

e4 32

+ 20 + 4 6

e1 64

+

e2 64

+

e3 64

+

e4 64

+ 24 + 4

Table 1: Encoded output lengths for each k-parameter

In order to find the best k-parameter, four additions should be performed for each k to calculate the length of its corresponding output length (three additions are needed for k=0). The fixed term “4” is common to all the choices; therefore its addition is not needed for comparison. This corresponds to 64 + 3 = 27 additions. To compare the lengths of seven values, six comparison operations are needed. To summarize, the operations to find the best k-parameter with exhaustive search can be expressed as follows:

[6 comp.(< ) , 27(+)] per subtilecomponent = [6 comp.(< ) , 27(+)] per pixel The hardware diagram is given in figure 12.

e1 e 2 e 3 e4

 

e21

 

2 2 e

 

2 3 e

 

2 4 e

 

4 1 e

 

4 2 e

 

4 3 e

 

4 4 e

 

8 1 e

 

8 2 e

 

8 3 e

 

8 4 e

 

16 1 e

 

16 2 e

 

16 3 e

 

16 4 e

 

32 1 e

 

32 2 e

 

32 3 e

 

32 4 e

 

64 1 e

 

64 2 e

 

64 3 e

 

64 4 e

Figure 12: Golomb-Rice parameter exhaustive search hardware

More specifically, the hardware cost is: - Six 13-bit comparators

- Two 12-bit adders

+ + + + + + + + + + + + + + + + + + + + + + + + + + + L0 L1 L2 L3 L4 L5 L6 4 8 12 16 20 24

(31)

- Four 11-bit adders - Four 10-bit adders - Four 9-bit adders - Four 8-bit adders - Four 7-bit adders - Three 6-bit adders - Two 5-bit adders

This cost is per subtile-component which can be equivalently thought as per pixel cost. The overall cost depends on the throughput requirement. This block has a logic depth of three 13-bit, one 12-bit, one 11-bit and one 10-bit adder.

The second encoder block encodes the input residuals of a subtile with the calculated k-parameter. The output of this block is four encoded words corresponding to each pixel of a subtile and their corresponding lengths.

A very simple possible architecture for this block is given in [11]. Adjusting this architecture to our case, the hardware for each pixel of the second block is given in figure 13.

Figure 13: A possible Golomb-Rice encoder hardware

The hardware cost per pixel-component of this block is: - One 5-bit adder

- Two 10-bit shifters - One 22-bit shifter - 10 XOR gates - 22 OR gates

>>

<<

<<

XOR OR code residual 10 22’h0001 k + length q ”1” k 3 residual

(32)

The final block of the encoder is the data packer. This block receives the 3-bit header (k-parameter) and code – length pairs of each pixel in a subtile. It combines the code words into a fixed memory word size and sends as an output to external memory.

2.3.4 Golomb-Rice Decoder

The role of the decoder is to extract error residuals of a subtile by decoding the compressed data using the header according to figure 9. Its functional blocks are similar to the encoder but since header is provided by the incoming stream, k-parameter determination block is not needed. The data un-packer provides the header and (q, r) pairs of each pixel of a subtile. The q data is obtained with unary-to-binary conversion.

The next block combines binary (q, r) pairs with the header and reproduces error residual as the output according to:

r q

e 2k (5)

A simple possible decoder hardware for each pixel-component of a subtile is given in figure 14.

Figure 14: A possible Golomb-Rice decoder hardware

The hardware cost per pixel-component of this block is: - One 22-bit shifter

- 10 OR gates

<<

OR residual q k r 10

(33)

To summarize, table 2 gives the logic cost of functional blocks in both compressor and decompressor (only adder cost is considered). Note that this calculation only includes the datapath functional blocks shown in figure 3. This means the actual hardware is expected to include other blocks for implementation of memory interfacing, memory addressing, pipelining, control path etc. It is also important to note that the actual hardware size to a great extend depends on design requirements, while table 2 shows generic per pixel cost of the algorithm.

Functional Blocks Compressor logic cost (adder cell)

per pixel Decompressor logic cost (adder cell) per pixel

Color transform 34 -

Reverse color transform - 34

Prediction 232 -

Construction - 232

GR Encoder – k determination 310 -

GR Encoder – residual encoding 20 -

GR Decoder – residual decoding - -

Total 596 266

Table 2: Logic Cost of Functional Blocks

2.4 Golomb-Rice Encoding Optimization

Considering the result given in table 2 it is obvious that the most costly part of the design is the hardware necessary to find the best k parameter for Golomb-Rice coding. Therefore, in order to reduce the hardware cost, it is convenient to try reducing the cost of this circuitry.

Two approaches have been considered to reduce the complexity. First one is to use an improved exhaustive search method which is presented in subsection 2.4.1. The second one is to use an estimation formula given in [8] and is presented in subsection 2.4.2.

2.4.1 Proposed method for exhaustive search solution

Exhaustive search method to find k-parameter is straightforward to implement, but the computational cost of this method is large and it increases linearly with the number of k values. For all k values, the length of the encoded data should be calculated and the k, corresponding to minimum length is chosen among them by comparison. For example, consider that we have a block size of n, which indicates the number of inputs to be encoded together and the set k = {0, 1,

2,…, m-1}, where m is variable and depends on the application requirements. The best member of

the set should be selected as the Golomb-Rice parameter

The computational requirements for exhaustive search method can be significantly reduced with our new solution, while still finding the Golomb-Rice (best k) parameter for a group of input data. The approach proposed uses a combination of two different ideas.

(34)

The first idea, which will be referred as “overlap-limited search“, removes the need for computation and comparison of all the length values for each possible k. It is mathematically proven that for any given set of input samples {e1, e2, e3,…,en}, depending on their sum, there are

overlap regions only between a fixed limited number of length functions and that only those length functions need to be computed and compared to get the best k. In other words, not all possible k values but only a fixed, limited and consecutive subset of them can be candidates of being the Golomb-Rice parameter of each block. This idea is not limited to hardware implementations but reduces time-complexity of comparison in software implementations as well.

The second idea, which will be referred as “remainder-based correction“, eliminates computational redundancy of performing identical bit additions in calculation of code lengths (Lk)

corresponding to each k. We identify bit additions common to all Lk and save hardware by

performing those additions only once. With another point of view, instead of adding shifted versions of input data (the quotients) for each k, we first add the inputs only once and then shift the same sum for each k. This way of calculation however, ignores the effect of remainders on the sum. To obtain the exact same result, after the addition, a correction is performed for each k by using remainders of division. Since the correction hardware is much smaller than the adders used for each k, a significant hardware saving is possible. This idea is only applicable to hardware implementations of finding the Golomb-Rice parameter (best k-parameter).

To put the solution into perspective, plots in figures 15 and 16 show cost function of three different implementation which are exhaustive search, the overlap-limited search method, and the combined method (overlap-limited search and remainder-based correction) with respect to n (number of input samples) and k (number of candidates for Golomb-Rice parameter) respectively. In figure 15, the cost function is represented with respect to n (the number of input samples to be encoded together). It is assumed that the set k = {0, 1, 2, 3, 4, 5, 6, 7} is fixed and the input data word-length is equal to 8 bits. It can be observed from the plot that the slope of the cost function of the combined method is ⅓ of the exhaustive search method.

(35)

Figure 15: HW-cost vs. number of input samples (n)

In figure 16, the cost is shown as a function of the number of members in set k. This plot shows a very important feature of “overlap-limited search”. The number of comparisons to find the Golomb-Rice parameter (best k) is fixed and independent of the number of k values to be compared. Hence, for applications where dynamic range of input data is larger, a larger set of k values should be used and “overlap-limited search” leads to even more significant reductions in the complexity of number of comparisons. Audio applications using 16-bit input data is an example of this case [12].

(36)

Figure 16: HW-cost vs. number of parameters (k)

Mathematical derivation and data analysis of this proposed method is given in Appendix A. Our implementation combining both methods and the circuit diagram in figure 17, takes input bits (A5-A0, B5-B0, C5-C0, D5-D0), eT, k, k+1, k+2 as inputs. eT is obtained by adding input values.

Then the region corresponding to eT is located to find the three k (k, k+1, k+2) values to compare.

The output of the circuit diagram is Lk, Lk+1, Lk+2. These three values are compared by using two

(37)

Figure 17: HW implementation of the new combined method 6 5 4 3 2 1 0 2 C0 B0 A0 1 1 1 + + 2 ”00” D0 1 2 C1 B1 A1 1 1 1 + + 2 MSB D1 1 2 C5 B5 A5 1 1 1 + + 2 D5 1 3 3 3 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 x 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 2 M SB 2 MSB 2 MSB + >> eT k Lk 6 5 4 3 2 1 0 6 5 4 3 2 1 0 + >> k+1 Lk+1 + >> k+2 Lk+2 eT eT k+2 k+1 k ’0 ’

(38)

This method is a general solution for implementations of Golomb-Rice encoders in all applications with any set of Golomb-Rice parameters k and different block sizes n (subtile size for our case). This is an exact method which replaces the exhaustive search method to find the best k-parameter and leads to much lower computational requirements. The improvement in hardware cost with our implementation explained is given in table 3.

Method Cost (full adders) Compression Ratio (norm.) Exhaustive search (exact) 310 1 New combined method (exact) 111 1

Table 3: HW cost comparison of exhaustive search and new combined method

The table shows that the new implementation method leads to a reduction of 65% hardware cost over exhaustive search while still finding the best k-parameter for a block.

The comparison of the result is presented in figures 15 and 16, which shows the advantage of the new method in reducing the hardware cost. For example in figure 17, considering the word-length which is 32-bit, in order to achieve the minimum code-word-length k = {0, 1, 2,…, 31} should be used. The hardware cost in this case reduces 83% by overlap-limited search and 89% by combined method with exactly the same result.

2.4.2 Estimation method

In [8], an estimation formula based on the sum of all inputs is given where the k-parameter is determined according to the range of the sum of input values. The estimation works based on table 4, where sum is the summation of inputs to be encoded (in our case, four pixels in a subtile):

(39)

sum K sum = 0 7 0 < sum <8 0 8 ≤ sum <16 1 16 ≤ sum <32 2 32 ≤ sum < 64 3 64 ≤ sum< 128 4 128 ≤ sum< 256 5 sum≥ 256 6

Table 4: Estimation intervals according to sum of inputs

The advantage of estimation method over the exhaustive method in reducing the hardware cost is given in table 5. The cost of estimation method is the cost of the hardware to calculate the sum in (6). Therefore, two 10-bit and one 11-bit adder is required. Estimation method may rarely find non-optimal k-parameter. However, empirical results with wide range of test images show that the reduction in compression performance is insignificant as shown in table 5. In [8] it is also mathematically proven that the effect of estimation on the compression performance is bounded.

Method Cost (full adders) Compression Ratio (norm.) Exhaustive search

(exact) 310 1

Estimation 31 0.998

Table 5: HW cost and compression ratio of estimation method

For applications where the exact exhaustive search is preferred, the method proposed in subsection 2.4.1 can reduce the hardware cost significantly. However, in this thesis work the estimation method has been chosen since it is cheaper and the resulting compression ratio is good enough.

(40)

2.5 Improved Lossless Color Buffer Compression Algorithm

As it is mentioned in [1], our reference algorithm is influenced by LOCO-I algorithm. It can be thought as a low-cost non-adaptive projection of LOCO-I. This has lead us to a deeper analysis of the ideas behind LOCO-I and hence enabled us to improve the algorithm to get better compression ratio especially for highly compressible images with negligible extra hardware cost. The modifications on the reference algorithm are using estimation method (explained in subsection 2.4.2), modular reduction, run-length mode and previous header flag which will be explained in the following subsections.

2.5.1 Modular Reduction

The error residual at the output of predictor is one bit more than the data at predictor inputs. For example in our case, the inputs x, x1, x2, x3 are all 9-bit data and the error residual is 10-bit. The

reason for this expansion is  x ˆx subtraction operation. However, actually since the predicted value () is known to both decoder and the encoder, the error residual ( ) can take on values that can be represented by the number of bits same as input data size. However, since this data in not centered around zero, a remapping of large prediction residuals is needed. This is named as modular reduction[4]. Figure 18 illustrates the technique.

Figure 18: Illustration of Modular Reduction

Positive prediction Negative prediction

-256 0 255 -256 0 x^ 255 -256-x^ -256 0 255-x^ 255 x e -256 0 255 -256 x^ 0 255 -256 -256-x^ 0 255 255-x^ x e

(41)

The effect of modular reduction is two-fold. Firstly, it leads to slightly more compression during encoding stage, since the absolute value of error residual is smaller. Secondly, the compression and decompression hardware blocks have smaller area due to smaller data size in the datapath.

2.5.2 Embedded Alphabet Extension (Run-length Mode)

In section 2.3.3, it is mentioned that header k = 7 is used for the cases where all four error residuals of a subtile are zero during GR-Encoding process. In this case the whole subtile is encoded with 3 bits only. This addresses the redundancy of sending extra terminating bits for each error residual in a subtile. Although the redundancy is removed within a single subtile boundary, a significant redundancy may still exist among adjacent subtiles. In a graphics application this corresponds to the cases where a whole tile (88 blocks of pixels) is covered with one/two triangles during rasterization. A typical example is the user menus in mobile devices. A menu typically consists of large icons and several flat regions at the background. In image compression applications, a quite similar problem exists for large smooth regions of a still image. In [4] it is stated that in general, symbol-by-symbol (in our case Golomb-Rice) encoding of error residuals in low entropy distributions (large flat regions) results in significant redundancy. They address this problem through introducing “alphabet extension”. Specifically, LOCO-I /JPEG-LS algorithm enters “run-length mode” when a flat region is encountered.

We used the same idea for more efficient encoding of low entropy regions. In order to do this, we keep track of the headers used for each component of the previous subtile. Whenever all four headers are 7 (kα = 7, kY = 7, kCg = 7, kCo = 7) the algorithm enters run-length mode. In this mode

we no longer put any bits into the output stream as long as the incoming error residuals to Golomb-Rice encoder are zero. Instead we increase a 4-bit run-length counter by one for each component. The run-length counter indicates the total number of zero error residuals so far. Whenever a non-zero error residual is encountered, the run-length mode is broken. In this case current value of the run-length counter is put into the output stream and the normal mode of operation continues again.

During decoding, the decoder also keeps track of headers for previously decoded subtile. Hence it also enters length mode at the same position during traversal. As soon as it enters the run-length mode, it first reads 4-bit run-run-length counter value from the stream. Then, it gives output error residual as zero for that many cycles and continues normal mode of operation.

The 4-bit run-length counter is fixed in range (0-15). This causes the problem of representing run lengths longer than four subtiles (16 components). This problem is solved by introducing a run-length flag. During encoding, when run-run-length counter becomes 15, a “1” bit is put into the stream representing completion of one 16-component block. Correspondingly, when run-length is broken a “0” bit is put into the stream just before the run-length counter value. For the decoder, each “1” read from the stream means one 16-component block in run-length mode. Similarly a following “0” bit designates that run length is broken.

(42)

The hardware cost of the run-length mode implementation is four 3-bit registers to store component headers and a 4-bit run length counter. Its size relative to other functional blocks will be given in section 3.5

2.5.3 Previous Header Flag

Once headers of previous subtile are stored in the encoder for run-length mode, a better compression can be achieved by comparing the current header with the previous header. Due to the existence of spatial correlation among adjacent subtiles, it is likely that these two headers have the same value. Hence, instead of putting 3-bit header into the output stream for each subtile, a “0” flag bit is put which means the current header is same as the previous header. Conversely, when headers are different, a “1” bit is put before the actual header.

Now that all the modifications on the reference algorithm are introduced, the final algorithm to be implemented is decided. The algorithm includes all the modifications explained in this section. Moreover, the algorithm will be implemented not for tiled-traversal but scan-line traversal of the input data. Therefore, both the reference algorithm and the modified algorithm are modeled for a left-to-right scan-line data traversal. The results in table 6 are obtained from scan-line traversal of images as well.

It is important to note that the maximum output size for a 32-bit input pixel is 64 bits. Therefore theoretically it is possible to have a compressed size twice the original input size. However, unless the input data is a completely noisy meaningless data, the output size always smaller than input size. This is same for most other compression algorithms as well.

2.6 Compression Performances of Algorithms

In order to evaluate the compression performance gained, software models of both algorithms have been prepared in MATLABTM environment. Three different groups of test data have been used. The first group includes well-known standard photographic test images used for benchmarking image compression algorithms and taken from [15]. The second group includes several computer generated scenes. The first four of them in table 6 are used in [1] as well to benchmark the reference algorithm. The third group includes several menu screen snapshots typical to mobile devices. Finally, the compression of a completely black image is also evaluated to observe the performance of algorithms on the extreme case. All test data used are 24-bit color images in .PNG or .BMP format. The images evaluated are given in Appendix B.

It is important to note that the data used for evaluation are compressed screenshots. This means that the result does not include the full, incremental rasterization process. An evaluation of the improvement gained within a real or software-simulated rasterizer framework is definitely of interest. Nevertheless, we anticipate that the results would be similar or even better during a rasterization process since an unfinished scene is generally simpler and contains fewer details

(43)

than a complete scene. It is already mentioned that the improved algorithm works better on simpler, compressible scenes. This is also verified in table 6 for group 3 data.

Another important point to mention is that the all input data are 24-bit RGB images. The algorithms are modeled for 32-bit RGBA data format. For evaluation, the alpha channel of all the image data was padded with eight “0” bits hence the evaluation is performed with 32-bit RGB0 data for all the input images. This is the reason of getting higher compression ratios than expected for both algorithms. For example, the compression ratio for well-known Lena image is found as 1.945 / 2.021 in both algorithms respectively. On the other hand, the JPEG-LS compression ratio is reported as 1.773 [16]. Definitely, JPEG-LS is expected to compress better than both algorithms within the same framework.

IMAGE REFERENCE

ALGORITHM ALGORITHM IMPROVED

Group1 (standard photographic test images) (24-bit color) Peppers (512  512) 2.812 3.016 Peppers2 (512  512) 1.769 1.828 Mandrill (512  512) 1.542 1.591 Lena (512  512) 1.945 2.021 House (256  256) 2.131 2.226 Sailboat (512  512) 1.690 1.744 Airplane (512  512) 2.289 2.404 Average 2.025 2.118 Group2 (computer generated test scenes) (24-bit color) Ducks (640  480) 2.785 2.991 Square (640  480) 2.937 3.155 Car (640  480) 3.609 4.059 Quake4 (640  480) 3.173 3.469 Bench_scr1 (640  360) 2.992 3.253 Bench_scr2 (640  360) 2.976 3.249 Bench_scr4 (640  360) 3.168 3.567 Average 3.091 3.392 Group3 (computer generated user menu scenes) (24-bit color) Menu1 (240  320) 4.684 6.377 Menu2 (240  320) 2.776 3.056 Menu3 (240  320) 1.992 2.068 Menu4 (240  320) 2.700 2.941 Menu5 (240  320) 4.166 5.734 Menu6 (320  480) 3.416 3.803 Menu7 (320  480) 4.606 6.395 Average 3.477 4.340 Group 4 Black (1280  1024) 10.667 511.926

(44)

2.7 Possible Future Algorithmic Improvements

In this thesis work several solutions have been examined to improve the compression performance while still keeping the complexity and hardware cost reasonably low. However, there are still several possibilities for algorithmic and architectural improvements. This chapter describes some of those techniques proposed by several scientific papers which might be applicable to image compression for mobile 3D graphics and to be considered as future works in the area.

2.7.1 Pixel Reordering

This is one of the solutions that have been examined within our work. The objective of this algorithm is to minimize the header overhead in the Golomb-Rice encoder. The idea is to group the pixels/subtiles, inside a tile, based on their Golomb-Rice parameter (k value). This increases compression ratio significantly since it helps to reduce the header overhead in the stream. As a future work, it is interesting to do investigation on storage requirements necessary to keep track of the original place of the pixels in order to reconstruct pixels in their original orders [13].

2.7.2 Spectral Predictor

As it is mentioned before, the main overhead which degrades the compression performance is in storing the header in the encoded stream. Improving the predictor might not have a large contribution into compression performance and this small improvement might not justify having a more complex and costly predictor. However there is an opportunity to get rid of color transform block, if we could efficiently take advantage of spectral correlation between the color components R, G, and B. In order to do so, a spectral predictor is needed which can predict pixel values of one color component, based on the predicted value of another component for the same pixel. This method is described in [14] in detail. What is interesting for the area of mobile image compression is to investigate the cost and complexity of this method, compare it with the total cost of both color transform block and fixed MED predictor, and measure the compression performance improvement that could be achieved by using spectral predictor.

2.7.3 CALIC Predictor

Context-Based Adaptive Lossless Image Compression (CALIC) was proposed by Menon and Wu [5]. This algorithm is developed based on an adaptive predictor followed by a context-based arithmetic coder. CALIC uses a gradient adjusted predictor (GAP), which is able to adapt itself with respect to the intensity gradients of the surrounding and neighboring pixels near the pixel under prediction. [14]

References

Related documents

As with move-to-front coding, it preprocesses the data so that the message values have a better skew in their probability distribution, and then codes this distribution using a

If there is a phrase already in the table composed of a CHARACTER, STRING pair, and the input stream then sees a sequence of CHARACTER, STRING, CHARACTER, STRING, CHARACTER,

In comparison with existing curve fitting technique shows that the objective value of PSNR of proposed scheme is better but the subjective quality of the curve fitting

image transformations. IEEE Transactions on Image Processing, vol. Fractal image coding: a review. Proceedings of the IEEE, vol. Fractal decoding algorithm for fast convergence.

Based on this observation, the paper proposes a lossless compression algorithm for read-only data, called GBDI, that uses global bases shared by all input data, instead of

Figure 2: (a) Parameterization of captured LF images, (b) interpretation of input images as decimation of densely sampled light field, (c) frequency plane tilling using

dictionary-based algorithms based on compression ratio and time for compression. For this experiment, the data is provided by Ericsson AB, Gothenburg. The dataset consists of the

While the Huffman algorithm assigns a bit string to each symbol, the arithmetic algorithm assigns a unique tag for a whole sequence. How this interval is