Steganography in Reed-Solomon Codes

(1)

MASTER'S THESIS

Steganography in Reed-Solomon Codes

Peter Hanzlik

Master program

Master of Science in Information Security

Luleå University of Technology

Department of Business Administration, Technology and Social Sciences

(2)

Abstract

In the thesis is described a steganographic system that embeds hidden data into communication channel that utilizes Reed-Solomon error-correction codes. A formal model of Reed-Solomon covert channel is proposed by stating requirements that are laid on such technique. The model was validated by experimental research methods. Findings indicate that the proposed model satisfies the primary attributes of steganography: capacity, imperceptibility and robustness. The research provides a stand base for further researches in the wide range of applications of Reed-Solomon codes.

(3)

Table of Figures

Figure 1: Error-correction Process ... 5

Figure 2: Rise of Errors ... 12

Figure 3: Reed-Solomon Coder/Decoder ... 17

Figure 4: Reed-Solomon Codeword ... 18

Figure 5: Decoding Process ... 23

Figure 6: A Generic Steganography System ... 34

Figure 7: Inserting Steganographic Data into Error-correction ... 36

Figure 8: Model of Reed-Solomon Covert Channel ... 36

Figure 9: Steganographic Algorithm over Reed-Solomon Codes ... 41

Figure 10: Flow Chart of the Steganographic Technique ... 42

Figure 11: Class Diagram ... 44

Figure 12: Pilot Study - Test Repetitions ... 46

Figure 13: Individual Error Distribution ... 49

Figure 14: Steganographic Locations Distribution ... 49

Figure 15: Burst Error Distribution ... 51

Figure 16: Steganographic Locations Distribution ... 51

List of Tables

Table 1: Basic Operations in GF(2) ... 14

Table 2: Elements of GF(2³) ... 16

Table 3: Pilot Study - Standard Deviation of Error Location Frequency ... 47

Table 4: Pilot Study - Successful Decoding Comparison ... 47

Table 5: Ratio of Successful Decoding of the Information ... 50

Table 6: Ratio of Successful Decoding of the Steganographic Information ... 50

Table 7: Ratio of Successful Decoding of the Information (Burst Error) ... 52

Table 8: Ratio of Successful Decoding of the Steganographic Information (Burst Error) ... 52

Table 9: Theoretical Steganographic Capacity of Blu-ray Disc ... 54

(6)

Page | 1

1 Introduction

Information hiding is a general term encompassing many sub-disciplines, from which this thesis focuses on steganography and watermarking.

Steganography is the science of hidden communication. The aim of steganography is to hide a message in such a way that no one except the sender and intended recipient knows about the existence of the message. Hidden information does not attract attention and is not exposed to attacks. This approach is different from cryptography which goal is to make information unreadable.

Digital watermarking is a sub-field of copyright making and its aim is to protect intellectual property rights. Watermarks may be imperceptible or visible depending on whether steganographic or non- steganographic system is used. In this work concentration is paid to steganographic systems.

Interest in steganography significantly increased after the 9/11 terrorist attacks, when it became clear that information hiding is likely to be used for criminal activities.

1.1 Problem Description

There are many techniques of steganography. In ancient times there were two main techniques used:

writing under the wax of wax tablets and tattooing shaved heads of slaves and waiting for hairs to grow.

From the more modern techniques may be mentioned invisible ink, microdots or hiding information into images or other multimedia files.

Developing new theories utilizing steganography brings new possibilities for steganography application (Information Hiding, 2002). Some of the applications to which this thesis sheds new light on are digital watermarking, message hiding, video stream resynchronization and steganographic file system.

1.1.1 Digital Watermarking

Copyright infringement represents a serious issue in the field of software and multimedia. According Business Software Alliance (Corbin, 2009), in the 2008 software industry lost $53 Billion due to piracy.

Institute for Policy Innovation concludes in its analysis (Siwek, 2007) that piracy in U.S. on sound recording represents an overall annual loss of $12.501 Billion. Piracy of motion pictures results in $20.5 Billion annual loss among all U.S. industries.

Recording and software companies try constantly to develop new robust watermarking techniques against digital piracy. Digital watermarking is relatively new discipline in steganography, introduced at

(7)

Page | 2 the end of the 20^th century (Houmansadr et al., 2006). There were lots of researches performed in this

field. Most of them are dedicated to image watermarking, as for instance JPEG format (Suhail et al, 2003;

Lai and Wu, 2009; Singhal et al., 2009).

As mentioned, the most losses in piracy are nowadays in the motion pictures business. Authors Ye, D. et al (2010) researched watermarking in video on demand, specifically MPEG format. There were performed many researches proposing watermarking techniques for various video formats, as for instance AVS (Wang et al., 2009), H.264/AVC (Wu et al., 2010) and many others. An in-depth overview of digital watermarking in video files is provided by authors Doerr and Dugelay (2003). Authors conclude their research by call for new researches in the video watermarking domain: “New applications have to be considered, specific challenges have to be taken up and video-driven approaches have to be investigated.” (Doerr and Dugelay, 2003, p. 263). Nearly all methods modify the original multimedia file when adding watermark and thus are file format dependent. A watermarking system independent of the multimedia file format would bring much more flexibility for recording companies in the fight with copyright piracy.

1.1.2 Message Hiding

Federal Plan for Cyber Security and Information Assurance Research and Development (2006, p.9) states that „immediate concerns also include the use of cyberspace for covert communications, particularly by terrorists but also by foreign intelligence services“ and the research topics include also detection of hidden information and covert information flows. The plan considers steganography as a potential risk:

“International interest in R&D for steganographic technologies and their commercialization and application has exploded in recent years. These technologies pose a potential threat to U.S. national security. Because steganography secretly embeds additional, and nearly undetectable, information content in digital products, the potential for covert dissemination of malicious software, mobile code, or information is great. The threat posed by steganography has been documented in numerous intelligence reports.” (Interagency Working Group on Cyber Security and Information Assurance, 2006, p. 41).

Governments realized the risk represented by steganography misuse and therefore developed controversial systems for intelligence monitoring such as Echelon or Carnivore (Nabbali and Perry, 2003;

Sloan, 2001). There is a strong suspicion that terrorists use steganographic covert channels for communicating with each other, by placing hidden messages into images available on the Internet (Kelley, 2001). Steganalysis research (Provos, 2001) analyzed 3 millions of images from eBay and USENET

(8)

Page | 3 archives and up to now, the project did not decode any hidden message yet. Steganalysis techniques are

not yet developed at the same level as steganography.

Steganalysis techniques may be divided into two main groups- specific and universal techniques (Nissar and Mir, 2010). Specific techniques try to attack a specific embedding technique or its slight variation.

Deep knowledge about the embedding technique is needed in order to design such attack. Universal techniques apply neural networks and other artificial intelligence concepts to determine steganographic files. The disadvantage is that such system needs an extensive training period with all types of embedding techniques; otherwise the determination of new techniques is limited. The really universal algorithms do not exist (Nissar and Mir, 2010). Therefore every new embedding algorithm has to be described in order to be later utilized in steganalysis techniques to prevent terrorist messages to propagate on the Internet.

1.1.3 Video Stream Resynchronization

Video streaming is subjected to errors during its transmission because of low bandwidth, low processing power, poor quality channel or any other cause. Once such disruption in data stream occurs, the decoder has to quickly resynchronize. Even a single bit error causes a defect on group of pictures (Robie and Mersereau, 2002). By embedding steganographic data into video streams, decoder may quicker resynchronize the stream and the experience of watching the video is significantly improved. As the resynchronization data are encoded steganographically, such modified video file is compatible with the original decoders. Decoders with the implemented feature are able to utilize the resynchronization information; old decoders just skip it. Authors Robie and Mersereau (2002) proposed such resynchronization technique designed for MPEG-2 standard. Also more powerful codecs were presented, however, as are not compliant with standard ones, their wide application is not probable (Yilmaz and Alatan, 2003; Puri et al., 2001). Review of error concealment techniques for video communication may be found in the work of Wang and Zhu (1998). Developing a technique that may be applied for range of video codecs with backwards compatibility with the old decoders represents a challenging task for the researchers. Currently such techniques lack in this field and therefore new researches have to be performed.

1.1.4 Steganographic File System

The steganographic file system was firstly proposed by Anderson et al in 1998. Steganographic file system hides the existence of the data and does not attract adversaries to perform attacks. Attacker does not know whether hidden data exists that provides another layer of security above the

(9)

Page | 4 cryptography. The motivation behind the steganographic file systems is to protect users through

plausible deniability, such as in legal disputes.

The novel approach of Anderson (1998) resides in the fact that hidden data were encoded directly into disk volumes rather than to cover data (images, audio or video files). He proposed two techniques how to perform steganographic file system. First is to generate random cover files in which hidden data are encoded. Second proposal is to encrypt blocks of hidden files and write them to random absolute disk addresses. There were also proposed a model for steganographic file system together with a practical implementation on local machines and its extended version designed for open network platforms, such SAN, DataGrid or P2P (Zhou, 2005). Distributed steganographic file system was presented by Giefer and Letchner (2004). Unfortunately, there are not many researches in this field and as authors admit, nearly all developed models were tested just in theory. All researches more or less enhance the models of Anderson (1998) that are vulnerable to traffic analysis attacks (Troncoso, 2007; Diaz et al., 2008).

Authors conclude their research by stating that “more sophisticated mechanisms are required in order to design a traffic analysis resistant steganographic file system; developing such mechanisms is left as an open problem” (Troncoso, 2007, p. 232). Therefore a new impulse has to be brought in this field of steganography.

1.2 Research Problem

As described in the section Problem Description, steganography has a wide range of applications with possibilities for improvements. There are two ways how to implement steganography- either to develop an application-specific solution that utilizes specific features of the application; or solution that utilizes the communication channel for the steganography and thus might be used on wider range of applications. I have decided to contribute to the field of steganography by proposing an adaptable method for information hiding in a specific communication channel.

Researchers have concentrated on various communication channels so far. One of the most discussed topics is embedding data into TCP/IP protocols as TCP/IP packet architecture allows inserting steganographic data into unused or optional locations within the packet. By extracting specific bits out of packets, recipient may reassemble the hidden message. Murdoch and Lewis (2005) developed tests to detect steganographic data in the TCP/IP headers and furthermore describe new, as they claim undetectable technique. Chakinala (2007) generalized the concept of TCP/IP communication channel for steganographic purposes and in his research "Steganographic Communication in Ordered Channels”

(10)

Page | 5 presented a formal model for transmitting hidden information by packet-reordering. Lucena et al (2004)

described her approach to application layer protocol steganography and furthermore introduced the notions of syntax and semantics to ensure conformity.

Many researches concentrated their work on other than TCP/IP communication channels. Westfeld and Wolf (1998) described a steganographic system which embeds secret messages into a video stream, i.e.

into a not ordered channel. Not ordered channels are mainly used in real-time systems where waiting for delayed packets is not preferable. In these applications often noise that modifies information during the transmission process occurs. Korjik and Morales-Luna (2001) described scenario of information hiding through noisy channels. Practical application of hiding information with respect to channel noise presented Westfeld (2006) in his research "Steganography for Radio Amateurs— A DSSS Based Approach for Slow Scan Television".

To one of the latest communication medium that was invented belongs quantum channel. Keye (2007) introduced in his research a scheme for a steganographic communication hidden in the quantum key distribution protocol BB84.

Focus of researchers has however missed one type of communication channel- transmission channel with error-correction capability. By adding redundant information, error-correction codes are capable of recovering damaged message at the destination (Figure 1).

Figure 1: Error-correction Process

Researchers have so far studied the utilization of error-correction code in steganography in the context of reducing distortion during the embedding process (Fontain and Galand, 2008; Zhang et al., 2009;

Schönfeld and Winkler, 2006). These techniques apply syndrome coding of error-correction codes to communication channels that are however not based on error-correction codes. Research of

(11)

Page | 6 steganography within error-correction codes might reveal new ways of steganography techniques. The

potential hidden in error-correction codes motivated me in performing research in this field. As this area of research has not been targeted by any research yet in a way that I intend to, I have decided to propose and describe a model on hiding steganographic information into error-correction channels.

There is a wide range of error-correction codes. For my research I have selected the Reed-Solomon error-correction codes that are widely used in technologies such as CD, DVD, Blu-ray, DVB or DSL which makes them an ideal candidate for my research. Broad scope of Reed-Solomon codes provides extensive possibilities for setting up a research that should result in improvement of common steganographic applications, such as those described in the previous section.

Research problem is formulated as:

Propose and describe a model for steganography over Reed-Solomon error-correction covert channel.

The overall research question is formulated as:

 “How Reed-Solomon codes might be utilized for steganography purposes?”

Hypothesis answering the research question is stated as:

 “Steganography over Reed-Solomon transmission channel may be performed by exploiting the error-correction capability of the channel by encoding hidden bits as errors in transmitted messages.”

To answer the overall research question, I have to answer also the subordinate question:

 “What are the requirements laid on a technique utilizing the steganography model in the Reed- Solomon covert channel?”

1.3 Research Design

The research design used in the thesis is the experimental design. Experimental design is concerned with the analysis of data generated from experiments (Easton and McColl, 1997). Experimental design enables researcher to test his hypothesis by reaching valid conclusions about relationships between independent and dependent variables (Key, 1997). The purpose of experimental design is to provide a framework within which the experiment is performed.

(12)

Page | 7 We may identify four main experimental design approaches (Ross and Morrison, 2004): True

Experiments, Repeated Measures, Quasi-experimental designs, Time series designs. True experiments are considered as the ideal design that maximizes the internal validity and this approach is used in the thesis. The major advantage of true experiments is the random assignment of subjects that eliminates any systematic error (Ross and Morrison, 2004).

In the experimental design we have to consider the validity threats that have the potential to bias the results. Experimenter has to take into account the threats by evaluating the results and try to limit their impacts on the experiments. One of the possible validity threat influencing experiments in this thesis is represented by the artificially generated subjects of the experiments that provide ideal laboratory instead of real-life conditions for the experiment. Validity threats in randomized experiments are discussed in several papers (Conrad and Conrad, 1994; Borg, 1984) that will be considered by the evaluation of the results in the thesis.

Conducting an experimental research is a difficult process where the phases have to be carried out in sequence in the right order. Ross and Morrison (2004) developed a model representing a sequence of logical steps for planning and conducting experimental research:

1. Select a Topic

2. Identify the Research Problem 3. Conduct a Literature Search

4. State the Research Questions (or Hypotheses) 5. Determine the Research Design

6. Determine Methods

7. Determine Data Analysis Techniques

This thesis follows the proposed model that is considered as a research strategy for this research.

In order to identify possible problems that may arise in the major study, I decided to conduct a pilot study. A pilot study “can greatly improve the proposed study design and methodology”. “Testing instruments and making adjustments before instigating a major study helps to ensure that data collection is efficient and successful” (Monsen and Horn, 2008).

(13)

Page | 8 Conducting a pilot study brings certain advantages to the research [Woken, 2010]:

1. Permits preliminary testing of the hypothesis and its possible change in the major study 2. Provides researchers with ideas and clues that may improve the findings in the main study 3. Permits a thorough check of the planned procedures

4. Reduces the number of unanticipated problems 5. May save a lot of time, money and effort

6. Gives possibility to try out a number of alternative measures and then select the most appropriate one Pilot study is performed as a sample size calculation on the best available data (Lancaster et al., 2002).

Performing pilot study before conducting a true experiment I consider inevitable otherwise the major study may easily miss its aim. In order not to exceed the extent of the thesis, the pilot study will not be fully documented, only references to it will be provided.

1.4 Data Collection and Visualization

The research problem lies on the intersection of two theories- steganography and error correction.

Therefore these two concepts have to be described sufficiently. I collect information from the literature, documents from conferences and research papers.

The theory is described in text, supported by schemas and examples. Used schemas visually present the idea and logic behind the presented theory. Particular care is paid to the examples which I find very effective when explaining a difficult topic. I attach examples after each section so as in this way to make the topics much easier for a reader to understand.

Data collection is represented by series of experiments that validate the proposed model. I have developed an application utilizing the concepts presented in the thesis. Data gathered from the application was statistically analyzed and presented in form of charts and tables.

1.5 Scope

The thesis focuses on steganography in Reed-Solomon error-correction codes. Description of the steganography technique over other error-correction codes is out of scope of this thesis. Concrete

(14)

Page | 9 algorithms for practical applications of the model are out of the scope as well. The research acts as an

enabler for further researches.

(15)

Page | 10

2 Coding Theory

According Hoffman (1991, p. 1), coding theory is “the study of methods for efficient and accurate transfer of information from one place to another”. From mathematical point of view (Prause, 2001), coding is injection that assigns to every symbol of the set a symbol from the set . In this way a codeword is created. Code is a set of all codewords. Codewords are represented by a -tuple sequence, where every of objects may be assigned states. The length of the code is the number of codewords . This length for binary codes is defined by the relation .

2.1 Hamming Distance

The main idea with coding is that all codewords have to be differentiable with each other. For this aim the coding theory introduced the concept of distance among codewords. As long as two codewords are to be compared with each other, they have to have the same distance.

Definition 2.1 (Morelos-Zaragoza, 2006):

Let be an error correcting code with binary elements. is a subset of the -dimensional binary vector space { } , such that its elements are as far apart as possible. In the binary space is the distance defined as the number of elements in which the vectors differ. Let ̅ and ̅ be vectors from . Then the Hamming distance between ̅ and ̅ , denoted ̅ ̅ ,is defined as

̅ ̅ |{ ̅ ̅ }| (2.1)

Definition 2.2 (Morelos-Zaragoza, 2006):

Minimum Hamming distance of the code is defined as the smallest Hamming distance among all possible distinct pairs of codewords in :

{ ̅ ̅ | ̅ ̅ } ̅ ̅ (2.2)

2.2 Types of Codes

There are several aspects according which codes may be divided.

According the way of adding redundancy (Moreira and Farrell, 2006):

1. Systematic codes: from the information bits are derived redundancy bits ( , where is the length of codeword). The codeword consists of the information bits followed by the

(16)

Page | 11 redundancy bits. Information bits of the message are separated from the redundancy bits during

the transmission. Systematic codes are denoted as codes .

2. Non-systematic codes: information bits are replaced with sequences of bits with higher rate of redundancy. By the transmission process it is not possible to differentiate redundancy bits from the information bits.

According the way of correction of errors (Prause, 2001):

1. Detection codes: errors are only identified and not corrected right away. Correction may be performed by requesting to repeat the transmission of the faulty segment.

2. Correction codes: errors are indentified and corrected straight away without the need of backward channel. Computational complexity as well as capabilities of the correction is defined by the error-correction code and its specifications.

There are two groups of codes (Gao, 2007):

1. Source coding (entropy coding): is the process of compressing source data in order to gain higher effectiveness of transmission.

2. Channel coding (forward error correction): is the process of adding redundant bits to the message in order to ensure resistance against communication noise.

Algebraic code theory distinguishes two classes of error correction codes (Morelos-Zaragoza, 2006):

1. Linear block codes 2. Convolutional codes

Linear block codes process the message on a block-by-block basis. Each block is independent of the other blocks, i.e. block codes do not have memory. The property of linearity means that sum as well as scalar product of any two codewords yields another codeword (Shah et al., 2001). Linear block code is denoted as triple

where:

is the number of symbols of the codeword

is the number of information symbols that should be encoded

(17)

Page | 12

is the minimum Hamming distance of the code.

Convolutional codes, in contrast to linear block codes, depend not only on the current input information, but also on previous inputs and outputs on block-by-block or bit-by-bit basis. For this purpose the convolutional encoder has memory and thus represents a finite-state machine. The status of the encoder is determined by the content of its memory. Decoding process is performed by Viterbi algorithm.

2.3 Errors

The main goal of coding is to prevent errors to arise in the transmitted message. Transmission channel provides transmission of a message between sender and recipient. Figure 2 illustrates places where errors may arise due to distortion and noise.

Data Source Trasnsmitter Transmission Data Recipient

Channel

distortion noise

distortion attenuation

noise

distortion noise Receiver

Figure 2: Rise of Errors

In coding we recognize these two classes of errors (Moon, 2005):

1. Single bit errors (independent errors): they are caused by noise that affects only one signal element. Various static characteristic are determined in order to describe their properties. In the sequence of bits only one is fault. Multiple errors may occur that means that in the sequence of bits exist more independent errors. Number of independent errors is denoted by the letter .

2. Burst errors: are sequences of transmitted signal elements in which the frequency of fault elements is higher than the frequency of correctly transmitted elements. Length of the burst error is .

2.4 Galois Fields

In the coding theory is used arithmetic of finite fields. Finite fields were studied by a French mathematician Evariste Galois and therefore they are referred to as Galois fields . Finite field is an algebraic structure in which exist basic mathematical operations such as addition, subtraction, multiplication and division among its elements. For all elements of holds that the result of any

(18)

Page | 13 operation is an element of the field. Finite fields consist of finite number of elements (Moreira and

Farrell, 2006).

Finite field is denoted by , where is the order of the field and expresses the number of field elements. In general, , where is a prime number, called characteristic of the field and is an positive integer.

Definition 2.3 (Vlcek, 2004):

is a set of all polynomials of arbitrary degree with coefficients { } from finite field .

A polynomial defined over is said to be irreducible, if has no factor polynomials of degree higher than zero and lower than .

An irreducible polynomial of degree over is said to be primitive if the lowest positive integer , for which is a factor of , is .

Reed-Solomon codes utilize extensions of binary field called . Finite field contains two elements { }. In is a primitive root , with which all the other elements of may be expressed as power of , besides the zero element (binary zero), that is denoted as .

Definition 2.6:

Let be a finite field. If exists an element such that every element of is its power, then this finite field is called cyclic finite field and element is generator of this field.

Finite field contains elements { }, in which represents binary 1 and all other elements we compute by equation

(2.3)

where is a primitive polynomial for and .

It is possible to generate with power higher than . These only copy the pattern according

(2.4)

so , and so on.

(19)

Page | 14 Similarly it is possible to transform element with negative power to an element with positive power

according:

(2.5)

Not every polynomial over is a primitive polynomial. Valid are only those that form with not repeated elements. For finding a primitive polynomial with degree it is needed to generate all polynomials with degree and test them by generating whereby the has to satisfy the rule of not repeated elements. There are possible polynomials, however, as the primitive polynomial in the binary form starts and finishes with binary 1, the search is reduced to half.

2.4.1 Arithmetic in Finite Field

All mathematical operations over are followed by . For example, in , . Special case occurs in , where operations of addition and subtraction are identical and are represented by operation XOR. Addition of element to itself yields a zero element (binary zero).

Operation of multiplication is represented by operation AND. Multiplication and division is easier to perform in non-binary form, as for example . When multiplying, powers are summed, by division are subtracted as in ordinary arithmetic. If the result does not belong to the , then has to be added or subtracted according Equation (2.4).

Table 1: Basic Operations in GF(2)

 0 1  0 1 a -a a^-1

0 0 1 0 0 0 0 0 -

1 1 0 1 0 1 1 0 1

b) multiplication

a) addition c) inversion

Example 2.1:

Let have defined polynomials and over . Then:

Addition:

5 6 7

3 3 5 6 7

) 1 (

1 )

( ) (

x x x

x x

x x x x x x g x f







Subtraction:

5 6 7

3 3 5 6 7

) 1 (

1 )

( ) (

x x x

x x

x x x x x x g x f











(20)

Page | 15 Multiplication:

1 1 )

1 (

) 1 (

) ( ) (

2 5

6 9

10

3 5

6 7

2 4

6 7 8

3 4 6

8 9 10

3 3

5 6 7











x x

x

x x

x x x

x x x x x g x f

Division:

1 ) (

1 )

(

) 1 (

: ) 1 ) (

( ) (

3 4 6

4 5 7

3 4 3

3 5

6 7

















x x x x

x x x

x x

x x x

g x f

Arithmetic in has the same rules as in , i.e. , just when by multiplication of two polynomials the resulting polynomial is of higher degree than , then this polynomial is factored by the irreducible primitive polynomial .

Example 2.2:

Let have defined polynomials and from the Example 2.1. Then mathematical operations of addition and multiplication over with primitive polynomial are:

As the degree of is higher than 7 ( ), it is needed to reduce the result by .

1 ) (

) 1 (

: ) 1 (

2 3 4 5

2 3 4 5 9

3 9

2 3 5

6 10

2 3

4 8 2

5 6 9 10















x x x x x

x x x x x x

x x

x x x

x x

x x x

x x x x

x x

x x x

So .

(21)

Page | 16 2.4.2 Construction of Finite Field GF(2³)

In case of , the primitive polynomial of degree 3 that generated all elements of the field may be . Primitive polynomial is polynomial over , since coefficients of its variables are elements of { }.

Let be primitive root of . As is the root of it holds that and

Equation (2.3) shows how to generate elements of as powers of . Elements of the field may be represented also as polynomials over of degree less or equal than 2 (in general ).

Mathematical operation of addition is then in this case a sum of coefficients of field elements . Binary notation is a direct consequence of the polynomial one, where coefficients of the polynomial form binary number based on the corresponding power of variable .

Multiplication of two field elements means sum of powers of , in general . For example, . According Equation (2.4), higher powers copy the pattern, so

, and so on. The finite field is in Table 2.

Table 2: Elements of GF(2³)

Exponential notation Polynomial notation

Binary notation

(22)

Page | 17

3 Reed-Solomon Codes

Reed-Solomon codes are codes for forward-error correction that are used in data transmission vulnerable to channel noise. Reed-Solomon codes are block codes that by adding redundant data before transmission are capable of detection and correction of errors within the block of data (Shah et al., 2001). Reed-Solomon codes are non-binary codes, i.e. signal elements are represented by group of bits.

Typical example of Reed-Solomon codes application is illustrated on Figure 3.

Data Source RS Encoder Data Recipient

noise

Transmission Channel

Medium RS Decoder

Figure 3: Reed-Solomon Coder/Decoder

Reed-Solomon codes were invented in 1960 by scientists Irving S. Reed and Gustave Solomon, members of Massachusetts Institute of Technology. In the time of publishing the concept, digital technology was not advanced enough for its implantation. Application of Reed-Solomon codes happen after inventing efficient decoding algorithm by Elwyn Berlekamp in 1968 and its modification by James Massey a year later (Reed and Solomon, 1960; Berlekamp, 1968; Massey, 1969).

Today, Reed-Solomon codes have found numerous applications in the field of digital storage and communication systems to correct burst errors. Firstly Reed-Solomon codes were used by NASA for transmission of digital pictures from space missions, such as Voyager, Mars Pathfinder, Galileo, Cassini and others (Gao, 2007). The first application of Reed-Solomon codes in mass production was in Compact Disks where the code can deal with error burst of length of 4000 consecutive bits (Hulpke, 1993). Similar schemes are included also in DVDs.

Application of Reed-Solomon codes may be found also in latest technologies, such as digital terrestrial and satellite television (DVB), xDSL technologies, Blu-ray disks, microwave and satellite transmission (Morelos-Zaragoza, 2006; Wicker and Bhargava, 1994).

3.1 Characteristics of Reed-Solomon codes

Reed-Solomon codes are block error-correction codes. Input consists of blocks of data to which the encoder adds redundant (parity) data. Redundant data are utilized to recover the damaged data caused

(23)

Page | 18 by noise during the communication transfer. Number and type of errors that might be corrected

depends on the Reed-Solomon code characteristics.

Reed-Solomon codes are non-binary subfamily of the BCH codes (Moreira and Farrell, 2006). They are denoted by with -bit symbols, where is number of information symbols with length of bits. To information symbols encoder adds parity symbols, the output of which is a codeword of length symbols. The number of parity symbols is, therefore, , each of length bits. Non-binary means that a symbol is composed of more than one bit. Reed-Solomon decoder allows to correct up to symbols in a codeword, where . The number of parity symbols is directly proportional to the strength and number of symbols, which is the code able to correct.

Figure 4 is a diagram illustrating a typical Reed-Solomon codeword, encoded systematically, i.e.

information data is unchanged and parity symbols are attached.

Information Symbols Parity Symbols

k 2t

21.3.2010 - 28.3.2010

n

Figure 4: Reed-Solomon Codeword

Where in Figure 4:

denotes the number of symbols from denotes the number of information symbols denotes the number of parity symbols Example 3.1:

Let’s have a popular Reed-Solomon code with 8-bit symbols. Every codeword consists of 255 bytes (1 symbol = 1 byte) from which 223 bytes are information and remaining 32 bytes are parity.

For this code applies:

(24)

Page | 19 Decoder is able to correct 16 fault symbols wherever within the codeword, i.e. in our example the code

may correct up to 16 bytes in the codeword.

Computational complexity of encoding and decoding Reed-Solomon codes is directly proportional to the number of parity symbols in codeword. The more parity symbols are attached to the information symbols, the more errors is the code able to correct, but it requires more computing power.

By defined size of is for Reed-Solomon code the maximal length of codeword . For example, maximum length of a codeword with 8-bit symbols ( ) is 255 bytes.

Reed-Solomon code may be conceptually shortened so that not all information symbols are used.

Redundant symbols are computed from all (also zero) informational symbols, however, only the used portion of information symbols is transferred. The decoder adds back the zero information symbols and decodes the message.

Example 3.2:

code may be reduced to code. Encoder takes block of 168 symbols, conceptually adds 55 zero symbols, and generates (255,223) codeword, but transmits only 168 information symbols and 32 parity symbols.

One of the characteristics of Reed-Solomon codes is that a symbol error occurs when either one or all bits within the symbol are corrupted. Code mentioned above with 8-bit symbols is able to correct 16 symbol errors. In the worst case, it means 16 bit errors, when each symbol (byte) is just one bit wrong. In the best case, it is 16 × 8 bit errors, when all 16 symbols are completely damaged.

Therefore Reed-Solomon codes are particularly suitable for correction of burst errors, where consecutive fault bits are received.

3.2 Encoding Reed-Solomon Codes

Reed-Solomon codes are algebraic codes, in which polynomials over represent the message that should be encoded, and the encoded codeword. In other words, input and output data consists of symbols that are elements of { }. These elements are arranged as polynomial coefficients and the power of polynomial variable indicates the order in which the encoder and decoder processes the associated symbol. For instance, information polynomial shows that symbol will be processed by the encoder as first, then symbol and lastly the symbol (Shah et al., 2001).

(25)

Page | 20 Example 3.3:

Let’s have code with 4-bit symbols ( . Then for encoding and decoding is used Galois field . All symbols are elements of this Galois field, all polynomials are polynomials over and also the arithmetic used for encoding and decoding is for this Galois field.

Systematic encoding, as already described in the previous chapter, attaches parity symbols to the information symbols. From the polynomial point of view, information symbols are shifted on higher power coefficients (by ) and after them parity symbols are appended. Parity symbols are calculated as the remainder from division of the shifted information polynomial with the generator polynomial. By this the codeword is created.

The equation defining the systematic encoding of Reed-Solomon codes is (Shah et al., 2001):

[ ] (3.1)

where:

denotes codeword polynomial of degree denotes information polynomial of degree denotes generator polynomial of degree [ ] denotes parity polynomial of degree .

In general, generator polynomial for Reed-Solomon codes is (Shah et al., 2001):

∏ ( ) ( )

( ) (3.2)

where may be assigned any integer, however, usually it is assigned value . After applying we get:

(3.3)

Non-systematic encoding, in comparison to the systematic one, does not resides in division of polynomials, but in multiplication of information polynomial by the generator polynomial:

(3.4)

(26)

Page | 21 The resulting codeword would be valid, since it is dividable by generator polynomial without a

remainder. The codeword encoded non-systematically has to be additionally decoded even if the information was not damaged during the transmission. Therefore the main advantage of systematic encoding is that the decoding is necessary to determine whether the received codeword is valid and if so, then the message is included in the first symbols of the codeword.

It is important to note that all mentioned polynomials, including , are polynomials over . An important characteristic of is that it divides without remainder.

Example 3.4:

Let’s use the Galois field assembled in the Section 2.4.2 with primitive polynomial and . Let the Reed-Solomon code be with 3-bit symbols . This code is able to correct 1 error symbol .

Let the message to be transmitted be in the binary form: . Defined Reed-Solomon code processes the message in blocks by 3 bits and to every symbol assigns exponential notation according the Table 2 from the previous chapter:

Procedure of encoding of the message is as follows:

Into the encoder the message comes sequentially starting with the highest power of variable , so the information polynomial is as follows:

Generator polynomial has the form:

( )( )

Since , according the arithmetic rules of , is the generator polynomial of following form:

According the formula (3.1), message has to be shifted on higher power coefficients by :

(27)

Page | 22 After information symbols, parity symbols are attached that are computed as the remainder of the

division of shifted information polynomial with generator polynomial:

5 2

5 6 2 2

7 2 2

7 2 8 3 4

2 4 3 4

2 8 3 9 4 5

2 2 3 1 4 5

3 3 4 4 5 0

2 2 3 0 4 0 5 0

4 4 5 5 6 1

2 4 2 5 3 0 4 1 3 4 2 2 2 3 0 4 5 5 4 6 1

) (

:





























x x x

x x

x

x x

x

x x

x

x x

x

x x

x

x x

x

Calculated remainder is attached to the shifted information polynomial and assembles final codeword:

After transcription into the binary form, we get the encoded message that is the output of the encoder:

Highlighted bits make up the parity part of the message. As can be seen, the information is directly involved in the first symbols as the result of the systematic encoding.

3.3 Decoding Reed-Solomon Codes

The logic for decoding Reed-Solomon codes is similar to that for binary BCH codes. There exist many algorithms for decoding Reed-Solomon codes. In this section are presented basic general steps of decoding.

(28)

Page | 23 Algebraic decoding of Reed-Solomon codes consists of the following steps (Moon, 2005):

1. Syndrome computation.

2. Determination of error-locator polynomial , which multiplicative inverse of its roots determine location of the error. There are several algorithms for finding the roots of error-locator polynomial:

a. Berlekamp-Massey algorithm b. Euclidean algorithm

c. Direct solution

3. Finding the roots of the error-locator polynomial. This step is typically accomplished using Chien search algorithm.

4. Calculation of error values- magnitude of errors. For this purpose is used error-locator polynomial, from which is calculated error-evaluator polynomial and afterward the magnitude of the error. Another usual way of doing it is Forney’s algorithm.

Decoding process may be illustrating by the diagram shown on Figure 5.

Syndrome computation s(x)

Determination of error-locator polynomial s(x)

Determination of error locations- finding roots of

s(x)

Calculation of error values

Correction of errors (x) + e(x)

If s(x) = 0

Determination of error polynomial e(x)

r(x) c(x)

Figure 5: Decoding Process

After encoding a message there is a codeword polynomial

(3.5)

which is during the transmission exposed to noise disturbance. This modified message is then the input to the decoder and is represented by received polynomial

(3.6)

Received polynomial is associated with the error polynomial , between which is the following relation:

(3.7)

(29)

Page | 24 Let the received polynomial contain non-zero elements , which means that errors occurred

during the transmission, residing on positions , where , and . Then form of error polynomial is

(3.8)

After calculating the coefficients and powers of the error polynomial we get sets { } and { }. First set contains magnitudes of errors, the second one locations of that errors. For binary codes { }.

After determining the error polynomial (3.8) follows the correction of errors in the received polynomial according the Equation (3.7). In case of sufficient error correction capability of Reed-Solomon code (defined by its characteristic ), the sum of received polynomial with the error polynomial gives the original codeword polynomial which was the output of the encoder.

3.3.1 Syndrome Computation

Syndromes are the evaluations of the received polynomial for each root of the generator polynomial (Morelos-Zaragoza, 2006). Since roots of generator polynomial are also roots of all other polynomials which are its multiple, are also roots of the codeword polynomial.

( ) ( ) ( ) ( )

(3.9) By expanding (3.9) we get a sequence of algebraic syndrome equations





t j b j t j

b j t j

b j t b t

j b j j

b j j b j b

j b j j

b j j b j b

e e

e r

S

e e

e r

S

e e

e r

S

) (

) ( )

( )

( ) (

) ( )

( )

( ) (

1 2 1

2 1

2 2

1 1

2 1

2 2

1 1

2 2

1 1

2 2 1 1

































(3.10)

that might be, by use of error-locator polynomial

∏( )

(3.11)

written as a system of linear equations. The roots of the error-locator polynomial are equal to the inverses of the error locations. Then the following relation between the coefficients of and the syndromes exists (Morelos-Zaragoza, 2006):

(30)

Page | 25











































1 1

1 2 1

1 3

2 2 1

s s

s

















S S

S

S S

S

S S

S

S S S

(3.12)

Equation (3.12) is also called the key equation in decoding of Reed-Solomon codes and its solution is computationally extensive operation. Finding the solution of this equation means finding the error- localization polynomial.

If all syndromes are equal to zero, then the codeword has not been altered during the transmission and the decoding algorithm for the given block of data has finished.

3.3.2 Determination of the Error-locator Polynomial

Common methods for solving the Key Equation (3.12) and thus computing the error-locator polynomial are (Morelos-Zaragoza, 2006):

1. Berlekamp-Massey algorithm

This algorithm, named by its inventors, is computationally effective method of solving key equation in terms of the number of operations in . Particularly popular is this method implemented in software decoders.

2. Euclidean algorithm

This algorithm solves the key equation in polynomial form. For its regular structure, the algorithm is widely used in hardware implementations of BCH and Reed-Solomon codes.

3. Direct solution

This method, also called Peterson-Gorenstein-Zierler decoder, directly finds the coefficients of the error-locator polynomial using standard techniques for solving (3.12) as a set of linear equations. The complexity of inverting a matrix grows with the cube of the error-correcting capability, and therefore the direct method can only be used for small values of .

Since Euclidean algorithm has a regular structure and works with polynomial, in the thesis I focus on its description. Euclidean algorithm is one of the oldest algorithms in the world, named by its inventor.

(31)

Page | 26 Originally, it was formulated for finding the greatest common divisor (GCD) of two integers. Later, it was

extended also for use with polynomials.

Definition 3.1:

For numbers; if , then exist integers and , such that:

For polynomials; if , then exist polynomials and , such that

For arbitrary integer holds:

For arbitrary polynomial holds:

( ) ( )

In order to be able to use the extended Euclidean algorithm for decoding Reed-Solomon codes, it is necessary to define also following polynomials.

Let the syndrome polynomial be defined as (Morelos-Zaragoza, 2006)

(3.13)

and the error-evaluator polynomial defined as

(3.14)

Problem of decoding may be formulated as finding the error-evaluator polynomial , satisfying the Equation (3.14). This may be performed by the extended Euclidean algorithm for polynomials

and , such that if at -th iteration

(3.15)

with degree , then and . From the decoding point of view, the polynomial is not important.

(32)

Page | 27 Steps of Euclidean algorithm for (Morelos-Zaragoza, 2006):

Inputs:

[ ] [ ]

Initial conditions:

At iteration , after applying long division to polynomials and

[ ] [ ] and computing

will the cycle stop at iteration , when

[ ] [ ] (3.16)

Then ( ) , where is the largest non-zero integer such that and .

Pseudo code of extended Euclidean algorithm:

Let be irreducible polynomial, that defines Galois field . Let be an element of this field, which multiplicative inverse is to be found. Then algorithm for finding the inverse is as follows [Wikipedia, 2010]:

(33)

Page | 28 remainder[1] := f(x)

remainder[2] := a(x) auxiliary[1] := 0 auxiliary[2] := 1 i := 2

while remainder[i] > 1 i := i + 1

remainder[i] := remainder(remainder[i-2] / remainder[i-1]) quotient[i] := quotient(remainder[i-2] / remainder[i-1])

auxiliary[i] := -quotient[i] * auxiliary[i-1] + auxiliary[i-2]

inverse := auxiliary[i]

Array auxiliary[] in the pseudo code represents polynomial . Example 3.5:

Let be the code over from the Example 3.4 with primitive polynomial generating and b=1.

Let the received polynomial be

According Equations (3.10) is the computation of syndromes as follows:

Construction of syndrome polynomial according Equation (3.13):

(34)

Page | 29 Input conditions of Euclidean algorithm:

Iteration :

According the condition (3.16) algorithm stops as [ ] [ ].

Result of Euclidean algorithm is error-locator polynomial

It is important to note, that extended Euclidean algorithm also computes and as

.

3.3.3 Finding roots of error-locator polynomial

Computation of roots of polynomial with coefficients over Galois field is performed by Chien search. It is a method of trial and error, when all non-zero elements are appointed to the polynomial and consequently condition is tested. Although this method is computationally intensive, Chien algorithm is so far the only way of finding roots of polynomials over (Morelos- Zaragoza, 2006).

Multiplicative inversions of roots of error-locator polynomial represent positions of errors in received polynomial .

Example 3.6:

Let the error-locator polynomial be with coefficients from as in the Example 3.5.

(35)

Page | 30 Appointing of all non-zero elements into error-locator polynomial:

For holds that , which means that is the only root of the error-locator polynomial . Multiplicative inverse of the root is and according Equation (2.4) holds, that , i.e. on position of the received polynomial is an error.

3.3.4 Calculation of Error Values

Evaluation of error values (their magnitudes) in positions is done by Forney algorithm. All the previous steps have led to the construction of polynomials needed for the equation (Morelos-Zaragoza, 2006):

( )

(3.17)

where:

denotes error-evaluator polynomial

denotes the formal derivative of error-locator polynomial with respect to

After construction of the error polynomial according the Equation (3.8), is computed the original codeword polynomial according the Equation (3.7). Original message resides in the first symbols (each of length bits) in the codeword polynomial.

Example 3.7:

Let be the same code over with primitive polynomial and as in Example 3.4. Let be the received polynomial as in the Example 3.5.

As already mentioned, Euclidean algorithm calculates error-locator as well as error-evaluator polynomial at the same time. In the last -th iteration of Euclidean algorithm holds:

Steganography in Reed-Solomon Codes

MASTER'S THESIS