Turbo Code Performance Analysis Using Hardware Acceleration

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2016

Turbo Code Performance

Analysis using Hardware

Acceleration

(2)

Oskar Nordmark LiTH-ISY-EX--16/5010--SE

Supervisor: Niclas Wiberg

Ericsson AB

Examiner: Oscar Gustafsson

isy_{, Linköpings universitet}

Computer Engineering Department of Electrical Engineering

Linköping University SE-581 83 Linköping, Sweden

(3)

Abstract

The upcoming 5G mobile communications system promises to enable use cases requiring ultra-reliable and low latency communications. Researchers therefore require more detailed information about aspects such as channel coding perfor-mance at very low block error rates. The simulations needed to obtain such re-sults are very time consuming and this poses a challenge to studying the problem. This thesis investigates the use of hardware acceleration for performing fast simulations of turbo code performance. Special interest is taken in investigating different methods for generating normally distributed noise based on pseudo-random number generator algorithms executed in DSP:s. A comparison is also done regarding how well different simulator program structures utilize the hard-ware.

Results show that even a simple program for utilizing parallel DSP:s can achieve good usage of hardware accelerators and enable fast simulations. It is also shown that for the studied process the bottleneck is the conversion of hard bits to soft bits with addition of normally distributed noise. It is indicated that methods for noise generation which do not adhere to a true normal distribution can further speed up this process and yet yield simulation quality comparable to methods adhering to a true Gaussian distribution. Overall, it is show that the proposed use of hardware acceleration in combination with the DSP software simulator program can in a reasonable time frame generate results for turbo code performance at block error rates as low as 10−9.

(4)

(5)

Acknowledgments

This work was performed at Ericsson Research in Linköping, Sweden, during the spring of 2016.

I would very much like to thank my supervisor Niclas Wiberg for proposing this work and giving me the possibility to perform it. It has truly been a privilege to work with you and take part of your knowledge and insight, as well as your continuous input and guidance.

I would also like to thank professor Oscar Gustafsson at Linköping University for his time and the good discussions. My thanks also extend to Patrik Sträng, Mirsad Cirkic and the Flake team in Linköping for their help and explanations.

Lastly I would like to thank all the colleagues at LINLAB for taking an interest in my work and, above all, for making my stay so enjoyable.

Stockholm, April 2017 Oskar Nordmark

(6)

(7)

Notation ix 1 Introduction 1 1.1 Purpose . . . 1 1.2 Problem formulation . . . 2 1.3 Related work . . . 2 1.4 Method . . . 3 1.5 System overview . . . 3 1.6 Limitations . . . 3 1.6.1 Random numbers . . . 4 1.6.2 Channel model . . . 4 1.6.3 Modulation . . . 4 2 Background Theory 5 2.1 The binomial distribution and interval estimation . . . 5

2.2 Channel coding and turbo codes . . . 6

2.3 Pseudo-random number generation . . . 7

2.3.1 Matrix linear recurrence modulo 2 . . . 8

2.3.2 Equidistribution . . . 8

2.3.3 Mersenne Twister . . . 9

2.3.4 Xorshift . . . 9

2.4 Methods for generating normally distributed variables . . . 12

2.4.1 The Box-Müller method . . . 12

2.4.2 The Marsaglia Polar method . . . 13

2.4.3 The Ziggurat method . . . 13

2.5 Hardware acceleration for turbo codes . . . 15

3 Method 17 3.1 Fundamental simulator model . . . 17

3.2 General principles of the simulator setup . . . 18

3.2.1 Soft bits and noise addition . . . 18

3.2.2 SNR calculations . . . 19

3.2.3 Number of coding attempts . . . 20

(8)

3.3 Simulator program structure . . . 20

3.3.1 Busy-wait simulator structure . . . 21

3.3.2 Callback simulator structure . . . 22

3.4 The table method for noise generation . . . 23

4 Results 27 4.1 BLER curves . . . 27

4.1.1 Reference curves and axis . . . 27

4.1.2 Different noise methods . . . 29

4.1.3 Different simulator structures . . . 31

4.2 Execution time for PRNG methods . . . 33

4.3 Normal distribution methods . . . 33

4.4 Hard to soft conversion with noise addition . . . 36

4.5 Simulator throughput . . . 40

4.6 Distribution of execution time . . . 43

4.7 Low error-rate BLER-curves . . . 44

5 Conclusion and Discussion 47 5.1 Simulator output and speed . . . 47

5.2 Program structure . . . 48

5.3 Bottleneck . . . 49

5.4 Normally distributed noise . . . 51

6 Future Work 53 6.1 Improved usability of the simulator . . . 53

6.2 Comparison with theoretical properties . . . 53

6.3 Optimizations of the simulator . . . 54

A Normal distribution table 57

(9)

Notation

Abbreviations

Abbreviation Definition

LTE Long Term Evolution

3GPP 3rd Generation Partnership Program SoC System on Chip

PRNG Pseudo Random Number Generator DSP Digital Signal Processor

BLER Block error rate SNR Signal to noise ratio

CDF Cumulative distribution function PDF Probability density function

(10)

(11)

1

Introduction

The evolution of mobile communication systems over the past decades has brought about a new technological generation about every ten year. In the current discus-sions of the next technological generation, 5G, many in the industry point out the demand for low latency and high reliability as important requirements [12, 23].

One area of high interest in this context is the channel coding used for control-ling errors in the transmission of data on a wireless channel. In the discussions for 5G there are proposals for different channel coding techniques such as polar codes and low density parity check codes (LDPC) [11]. In the current LTE system the channel coding used is turbo coding [22, 9].

In some uses cases for 5G the high requirements for the communication con-ditions correspond to a maximum block error rate of 10−₉

and latency down to a millisecond or less[12]. This creates a need to be able to perform simulations in order to evaluate the performance of proposed algorithms for this type of con-text. To obtain accurate results in the case of very low error probabilities requires a very large number of simulations. This emphasises the need for quick simula-tions and motivates the need for hardware acceleration.

1.1 Purpose

The aim of this work is to study how simulations concerning channel coding can be carried out utilizing hardware acceleration. The overall objective with this type of approach is to achieve faster simulations than can be accomplished using a software implementation running on a general purpose processor.

More specifically this thesis will study the use of system-on-chip (SoC) prod-ucts containing hardware acceleration for turbo codes conforming to the 3GPP specification of turbo codes used in LTE. The system used in this thesis has sev-eral digital signal processors (DSPs) as well as a hardware accelerator for turbo

(12)

codes. In order to study the use of hardware acceleration for turbo code per-formance simulations, a simulator set-up has to be implemented for the specific system.

One key factor needed for this type of simulator set-up to study the perfor-mance of the error-correcting codes is the simulation of random noise. In this thesis this will be accomplished by implementing pseudo-random number gen-eration (PRNG) algorithms on the DSPs which will ensure repeatability of the simulations. The use of well known and tested standard methods also provide a reliable base for discussing the statistical quality of the results.

The various performance limitations and bottle-necks of the simulator set-up are of interest to determine the capabilities of the simulator and of the type of hardware utilized.

1.2 Problem formulation

This thesis aims to answer the following:

1. How can a simulator be set up in a multi core environment to enable fast simulations using accelerators?

2. Which are the limitations on simulation speed and accuracy with the em-ployed simulator set-up and hardware?

3. How efficiently can random numbers be generated in the system using PRNG on DSP processors? Can a potential performance improvement mo-tivate the use of other approaches or methods?

4. How well does the LTE turbo code perform around very low block error rates? How is the general shape of this graph in the area of block error rates from 10−₄

to 10−₉

?

1.3 Related work

Since turbo codes were introduced by Berrou et al. in 1993 [4] the algorithm has been studied and improved or adapted by many researchers and also imple-mented in many systems. [10, 22].

There has been many results presented concerning the performance of turbo codes, both in the form of analytical work and empirical studies where the simu-lated performance is presented. There has been work presented concerning the specific turbo coding used in LTE, but many of these focus on relatively high BLER values. In many cases, a BLER value of 10% is seen as a reasonable error rate to perform transmissions around.

To the best of our knowledge, there has not been any work presented con-cerning the use of dedicated hardware acceleration in simulations for turbo code performance.

(13)

1.4 Method 3

1.4 Method

In this thesis, an implementation of a simulator using DSP and hardware accel-eration components is presented. Several different PRNG methods have been researched through literature studies and some of these methods have been im-plemented. Methods for generating normally distributed values have also been researched and implemented for the simulator. A very simple method for gener-ating normally distributed variables which is not found in the literature studies is also implemented and presented.

The implemented simulator is benchmarked by recording the simulation re-sults and the execution time, both for the overall program and for different seg-ments and algorithms. These results are then used to analyse the performance and capability of the simulator and the hardware system.

The software for the simulator is implemented in C code and the simulation results are compared to reference results from an existing simulator. It should however be stated that this thesis does not aim to guarantee a implementation completely free of flaws or bugs. However, care is taken to verify that the results appear reasonable and relevant for the conclusions.

1.5 System overview

The simulator set-up studied in this thesis utilises both DSP resources and hard-ware acceleration resources. The system is schematically illustrated in figure 1.1.

Figure 1.1: An illustration of the type of hardware system studied for use of performing simulations. A SoC system with multiple DSPs and hardware accelerators for turbo encoding and decoding

1.6 Limitations

Here is an presentation of the limitations used to narrow the scope of the thesis. The limitations are divided up into the three subsections below.

(14)

1.6.1 Random numbers

The implemented simulator should ideally replicate the random nature of a trans-mission channel. Generation of seemingly random numbers has been an area of interest for a very long time. In 1951 the article "Various techniques used in con-nection with random digits" by John von Neumann was published in National Bureau of Standards Applied Mathematics Series [24]. Von Neumann therein discusses the use of either physical methods or algorithmic methods to generate random numbers. Physical methods can generate true randomness but make it impossible to repeat the process to check for errors. On the other hand, von Neu-man very neatly summarizes a fundamental problem of algorithmic methods as "Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin." [24].

These principal concerns are still relevant today. If repeatability is required then methods with true randomness are abandoned in favour of algorithmic pseudo-random methods. However, an algorithmic method will never generate an actually random number, and von Neumann points out that these "cooking recipes" for making digits "should merely be judged by their results. Some statisti-cal study of the digits generated by a given recipe should be made, but exhaustive tests are impractical."

For the simulator set-up presented in this thesis, only PRNG methods imple-mented in software and executed in the DSPs are studied.

1.6.2 Channel model

The simulator set-up used in this thesis is based on an AWGN channel model. The concept of a channel in the context of a communication link is introduced in section 2.2, and the AWGN model is presented in section 3.1. This thesis will not discuss the relevance of AWGN channels or how appropriate an AWGN model is for this type of simulations, we simply observe the model is commonly used and we will apply it in this context of utilising hardware acceleration for performing simulations. For a further discussion of channel models and background infor-mation concerning random signal, see for example [13, 30, 8].

1.6.3 Modulation

When data is actully transfered over a transmission medium it is necessary to have some form of modulation which maps the binary information sequence to a continous physical signal that can be transmitted. In LTE, several different mod-ulation techniques are available, such as quadrature phase-shift keying (QPSK) and quadrature amplitude modulation (QAM). [9] This thesis will not study the possible effect of using different modulation schemes, and no actual modulation to a physical signal will ever take place. We will only use a simple binary phase-shift keying (BPSK) model where a logical one or zero will be mapped to a posi-tive or negaposi-tive integer value.

(15)

2

Background Theory

In the following sections some background information relevant to the scope of the thesis is presented. First there is some very brief information about the bi-nomial distribution and confidence intervals in section 2.1. Information about channel coding and turbo codes are presented in section 2.2, theory and methods for generating pseudo-random variables are presented in 2.3, how these random variables can be altered to achieve a normal distribution is discussed in 2.4, and finally a brief overlook of available products with hardware acceleration for turbo codes is presented in 2.5.

2.1 The binomial distribution and interval estimation

The binomial distribution is a discrete probability distribution. If a series of n independent experiments are conducted, each with a probability of success p, the the number of successful experiments x will have a binomial distribution. We write this as X ∼ B(n, p). The binomial distribution has an expected value of

np and a standard deviation ofpnp(1 − p).

For large values of n the binomial distribution can be approximated by the normal distribution, so that approximately X ∼ N (np,pnp(1 − p)). This approx-imation can typically be done if np(1 − p) ≥ 10.

Interval estimation in statistics is used to calculate a confidence interval of possible values for an unknown parameter based on some experimental sample data. A confidence interval with a confidence level of 1 − α will contain the true parameter value with a probability of 1 − α. A commonly used confidence level is 95%.

Say that we have an unknown population parameter Θ and the point esti-mator for this parameter is Θ∗. Let this estimator be approxiamtely normally

(16)

distributed with a the expected value Θ and the standard deviation D. Then a confidence interval for Θ with a confidence level of 1 − α is show in 2.1

IΘ= (Θ ∗

−_{λα/2D, Θ}∗_{+ λ}_α/2_D) _(2.1)

Where D can be replaced by a suitable estimate d if D depends on Θ [5].

2.2 Channel coding and turbo codes

This section presents an introduction to some components of a modern digital communication link, including the general concept of channel coding and the specific turbo codes used in LTE.

In a digital communication link, as describes in [13], a message consisting of a binary string is transmitted over a waveform channel, which for example can serve as a model for the air interface between a transmitter and a receiver in a mo-bile wireless communication system such as LTE. This type of link is illustrated in figure 2.1, where only the components of the link relevant to this discussion are shown.

Here we see that the binary message goes through the steps of channel coding and modulation before the waveform is transmitted over the channel.

In the step channel coding, which can also be referred to as error protection, the message is mapped to a codeword by adding artificial redundancy to it. This is done in order to make it possible to decode the message even in the case of errors induced by the channel. Typically this step also includes an interleaver which creates a permutation of the order of the coded bits, in order to spread the bit errors over the codeword. During the modulation this codeword is then mapped onto a continuous waveform which can be transmitted over the channel.

In the receiver, the steps are undone. The received waveform is first demodu-lated into digital sequence. In practice, soft demodulation is almost always used, in which case the demodulator does not produce bits, but rather soft bits, which contain reliability information. This is also what is used in the study.These soft bits then undergoes the reverse process of the channel coding, called decoding. This decoding will ideally eliminate errors induced during the transmission so that the output string from the decoder resembles the input string to the encoder.

(17)

2.3 Pseudo-random number generation 7

Figure 2.1: An illustration of the described digital communication link for sending a message over a waveform channel

In the LTE mobile communication system, the channel coding used is turbo coding as specified in [22]. Turbo codes were first published in 1993 [4] and use a design were two encoders at the transmitter and two decoders at the re-ceiver work in parallel. The two decoders at the rere-ceiver can exchange decoding information between them in an iterative way. This iterative process can either continue until both the decoders agree on the result or until a pre-determined number of iterations is reached. The number of iterations used in the decoding process is typically in the range of four to ten.

According to [10], the name "turbo codes" was chosen due to a noticed similar-ity to a turbocharger in an engine. Turbo codes use the output of the decoders to improve the decoding process, much like the turbo charger uses the exhaust gas to improve the combustion by forcing more air into the engine.

The specific turbo encoder in LTE uses two 8-state constituent encoders and a turbo code internal interleaver. The coding rate is 1/3 and the codewords support information bit sequence sizes for a subset of the sized in interval between 40 bits and 6144 bits [22].

2.3 Pseudo-random number generation

There are many methods available to generate a series of pseudo-random num-bers, and below is an presentation of two algorithms which have been used in the work of this thesis, the Mersenne Twister and the family of xorshift genera-tors. Both these methods are based on linear recursion over the two element field. [25, 29]

The quality of these PRNG methods are studied using empirical test. Several alternative standardised tests of statistical properties for pseudo-random number generators have been developed. Among the most notable are the Diehard test suite [15] by George Marsaglia and TestU01[14] by Pierre L’Ecuyer and Richard Simard. The performance of the algorithms in these two test are have been stud-ied in the cited sources and the main results for some of the PRNG implementa-tions are briefly presented for the respective algorithms.

(18)

2.3.1 Matrix linear recurrence modulo 2

Linear recurrence over the two element field use the theory of finite field, also known as Galois Field. Typically, we let F2 = GF(2) be the finite field with two

elements {0, 1}. We write the field operations as + and ×. If 0 is regarded as "false" and 1 as "true", then the field operations are "exclusive or" (⊕) and "and" (∧). The vectors and matrices used have elements in F2. [7]

As described in [25], a PRNG which uses linear recurrence over the two ele-ment field is based on the relationship shown in 2.2-2.4.

x_i = Axi−1mod 2, (2.2) yi = Bxi mod 2, (2.3) ui = w X l=1 yi,l−12 −_l

= .yi,0yi,1yi,2· · · (2.4)

In the equations above xi = (xi,0, . . . , xi,k−1)T and yi = (yi,0, . . . , yi,w−1)T are

the k-bit state and the w-bit output vector at step i, A is a k × k binary transition matrix, B is a w × k binary output transformation matrix, k and w are positive integers, and ui ∈ {0, 1} is the output at step i.

Several well known algorithms belong to this class of PRNGs. Apart from the Mersenne Twister and xorshift family of generators, which are described in fur-ther detail below, it includes ofur-thers such as linear feedback shift register (LFSR), generalized feedback shift register (GFSR), twisted GFSR and the WELL genera-tors [25]. In fact, it has been shown in [7] that the xorshift generagenera-tors are equiva-lent to certain linear feedback shift registers.

2.3.2 Equidistribution

It is desirable that an PRNG should generate a sequence which behaves like inde-pendent samples of stochastical variables with the same distribution. However, there is no decisive definition of what good "randomness" for a PRNG means. Equidistribution, which in this context also can be referred to as the k-distribution test, is one property which is used as a measure of randomness [19, 25, 29].

For a sequence xi of real numbers, equidistribution in one dimension means

that the values in the sequence are uniformly distributed over the interval of the sequence. The proportion of terms falling in a subinterval is thus proportional to the length of the interval.

If the values in the sequence are for example taken as numbers in a decimal representation of π, then each integer 1,2,3,...,9 and 0 should appear on average in a tenth of the cases. In the same way, if we take a sequence of 32 bits, or w bits in general, then each of the 2w _{possible combinations should occur equally}

many times in a period. A slight flaw is permitted in the case of the all zero combination, which appears once less often.

Equidistribution in 2 dimensions means that we study pairs of the w bit val-ues in the sequence, and require that all the 22wpossible combinations should occur equally many times in a period. For the general case of t dimensions, we

(19)

require that all the 2twpossible combinations should occur equally many times in a period, except from the all zero combination which occurs once less often.

The geometrical meaning is taken from viewing the values in the sequence as points in a t-dimensional room. For a sequence of w bit integers, we divide each value by 2wto normalize it into a pseudo-random real number in the [0, 1] interval, and place each consecutive t-tuple in the t-dimensional unit hypercube. The sequence is t-dimensionally equidistributed if these points are uniformly distributed in the t-dimensional unit hypercube [5, 19, 29, 25].

This can further be generalized. Instead of studying all w bits in the output word, we only study the l most significant ones. This corresponds to truncating each value in the sequence and taking the value formed by the leading l bits. Then the sequence is said to be (t, l)-equidistributed if each of the 2tl possible combinations of bits occurs the same number of times in a period, except from the all zero combination which occurs once less often.

In the geometrical interpretation of this, we still study the t-dimensional unit hypercube. We divide the interval [0, 1] into l equally long segments, so that we partition the unit hypercube into 2tl_{cubic cells of equal size. The sequence is said}

to be (t, l)-equidistributed if each of these cubic cells contain the same number of points [19, 25].

2.3.3 Mersenne Twister

The Mersenne Twister is a pseudo-random number generator proposed by Makoto Matsumoto and Takuji Nishimura in 1998 [19]. It has become a very popular choice as a PRNG for many software systems. In the words of Cleve Moler, founder of MathWorks and author of the first MATLAB [27]: "Mersenne Twister is, by far, today’s most popular pseudorandom number generator. It is used by every widely distributed mathematical software package. It has been available as an option in MATLAB since it was invented and has been the default for almost a decade." [21]

The Mersenne Twister in its typical implementation, the MT19937, has a very large period of 219937−_{1. It creates a sequence of 32-bit integers and this sequence} is 623-dimensionally equidistributed. To achieve this the Mersenne Twister uses a state consisting of 623 32-bit words, meaning roughly 2.5 kB of memory [19].

The Mersenne Twister passes the Diehard tests [19] but fails some of the tests in TestU01[14].

2.3.4 Xorshift

In the paper "Xorshift RNGs", [16], George Marsaglia proposes a class of simple and very fast random number generators based on the repeated use of the logical operation exclusive or (xor) of a computer word with a shifted version of itself. This operation can easily be implemented in software as a computer instruction and can be executed very quickly by a processor. I C code, the xorshift opera-tion is y^(y << a) for left shifts and y^(y >> a) for right shifts. Marsaglia establishes the requirement needed for these PRNGs to have a full period, and

(20)

supplies all sets of parameters which achieves this for some certain types of xor-shift generators.

To showcase the simple structure of the xorshift generators, Marsaglia pro-vides the essential C code for generating a random sequence of 32-bit integers with a period of 2128−_{1. This code is show below in listing 2.1. As visible from} the code, only three xorshift operations are used, and the generator requires four random 32-bit seeds x, y, z, w.

Listing 2.1:C code example for xorshift generator tmp=( x ^( x < <15));

x=y ; y=z ; z=w;

r e t u r n w=(w^(w> >21))^(tmp^(tmp> >4));

The xorshift generators are mathematically modelled by Marsaglia as linear transformation over the binary vector space, characterized by a nonsingular n × n binary matrix T . The value of n is typically selected to be 32 or a multiple thereof, for facilitating the implementation on a computer.

In this notation, the linear transformation of a binary vector y is denoted by

yT . The xorshift operation for a shift left is represented by T = I + La, where L is the n × n binary matrix that effects a left shift of one position on a binary vector

y, that is, L is all 0’s except for 1’s on the principal subdiagonal. Similarly, R is

used to denote a right shift of one position on a binary vector, so that the xorshift operation for a right shift is represented by T = I + Ra.

For the mathematical model we also define the seed set Z as the set of 1 × n binary vectors β = (b1, b2, . . . , bn), excluding the zero vector. Such a binary vector

can also be referred to as the state of the xorshift generator. If β is a uniform random choice from Z then each member of the sequence βT , βT2, βT3, . . . is

also uniformly distributed over Z.

Marsaglia present a theorem, shown here as theorem 2.1, for establishing when the xorshift generators will have a full period.

Theorem 2.1. In order that a nonsingular n × n binary matrix T produce all pos-sible non-null 1 × n binary vectors in the sequence βT , βT2, βT3, . . . for every

non-null initial 1 × n binary vector β, it is necessary and sufficient that, in the group of nonsingular n × n binary matrices, the order of T is 2n−₁

If the order of T is 2n −_{1, then each of the matrices T , T}2_{, T}3_{, · · · , T}k _are distinct and nonsingular, and it follows from the theorem that the sequence

βT , βT2_{, βT}3_{, . . . must have a period of k = 2}n₋₁

When using only two xorshift operations for n = 32 or n = 64, there are no values for the shift lengths which will create a matrix T of the desired order 2n₋₁

and thus a sequence of the maximum period. But when using three consecutive shift on the same binary vector y, i.e. the operation yT , where T is of the form

T = (I + La)(I + Rb)(I + Lc), there are several sets of the shift length (a, b, c) which will yield the desired property of the matrix, and thus attain the maximum period of the sequence. Marsaglia presents all the possible choices which achieves this

(21)

for both n = 32 and n = 64. In a later examination of the xorshift generator, [25], it has been shown that to reach the maximal period, both left and right xorshifts must be used.

Marsaglia points out that although the sequence generated by this type of lin-ear transformation over the binary vector space is identically distributed, the ele-ments are not independent. Furthermore, since long period xorshift RNG neces-sarily uses a nonsingular matrix transformation, every successive n vectors must be linearly independent, while truly random binary vectors will be linearly inde-pendent only some 30% of the time. This property makes the xorshift generators fail the binary rank test in Diehard, but they pass all other tests of randomness in Diehard [16].

Marsaglia also remarks that the xorshift generators can be further developed by assigning any function to modify the output of the generator. In the exam-ple shown in listing 2.1, the output w is taken to be the last value of the state (x, y, z, w). This value must be calculated and assigned to achieve the full period, but the output could instead be taken as any function of the state. For exam-ple, returning x multiplied by a constant would also constitute a PRNG of a full period.

Panneton and L’Ecuyer have in a subsequent analysis of the xorshift gener-ators pointed out some weaknesse [25]. Their report included an analysis of the theoretical properties, a search for the best xorshift generators according to equidistribution, and empirical test using the TestU01 suite. In further detail these tests included all full period xorshift generators which have either 32-bit xorshift operations and a 32-bit state, or 64-bit xorshift operations and a 64-bit state. They also analysed xorshift generators with a larger state, and which utilise different xorshift operations on different words in the state. However, among this type of generators they only empirically tested the best generators they found ac-cording to equidistribution. They concluded that the xorshift generators are fast, but not reliable according to their analysis of equidistribution and results from empirical statistical tests. They state that the generators which only use 32-bits or 64-bits of state are doomed due to their short period. To overcome the other lim-itations, they propose using more xorshifts than three or combining the xorshift generators with other RNGs from different classes [25].

The proposed use of a non-linear operation to scramble the result of xorshift generators is explored in a paper by Sebastiano Vigna [29]. The scrambling is done with multiplication by a suitable constant to achieve what is called an xor-shift* generator. Vigna restricts the search to consider only 64-bit shifts, and states consisting of 64-bits or a power of two thereof. Several xorshift generators of different state sizes using this type of multiplication are presented. Their re-sults in empirical test with TestU01 are compare with that of other generators, among them the Mersenne Twister. Among other, Vigna concludes that an xor-shift generator followed by a multiplication can give very good statistical quality. He remarks that the Mersenne Twister and other well know generators have more linear artifacts than even a 64-bit state xorshift generator followed by multiplica-tion. The Mersenne Twister and other generators with extremely long periods also have problems when the state has many zeros. Vinga also points out

(22)

prob-lems with evaluating generators according to equidistribution, and remarks on the high-bit bias of TestU01, such that the result is very different when the bits in reverse order are tested.

Other methods for scrambling are presented by Saito and Matsumoto with what they call the xorshift-add (ASadd) generator [26, 28]. The proposed genera-tor has a 128-bit internal state and uses addition as a non-linear output function. It passes the bigcrush test of TestU01, but fails when reversed. Sebastiano Vigna further develops this use of addition with xorshift generators and creates what he calls the xorshift128+ generator. This generator has 128 bits of state and used 64-bit operations. It passes the bigcrush test of TestU01, even reversed, and is presently used by the JavaScript engines of Chrome, Firefox and Safari according to the author [28].

The current recommendation from Vigna is to use a successor to the xor-shift128+ generator called xoroshiro128+, but he also states that the xorshift* generators "are an excellent choice for all non-cryptographic applications" [3].

2.4 Methods for generating normally distributed

variables

If you have access to independent uniformly distributed random variables it is in principle possible to generate any one-dimensional probability distribution. If we desire to generate a random variable X from a distribution with a cumulative distribution function F, this can be done by using the inverse of the CDF and setting X = F−₁

(U ) where U is a random variable with uniform distribution,

U ∼ U (0, 1) [5].

For the normal distribution, it is not very easy to calculated the inverse of the CDF Φ(x). Instead other methods are used. Two common methods are the Box-Müller method and the Marsaglia Polar method. The Box-Box-Müller method is more computational demanding than the Polar method since it requires calculating the sine and cosine of an angle. Below are more detailed descriptions of the two methods, as well as a description of the Ziggurat method, a more complex methods which aims at being more computational efficient.

2.4.1 The Box-Müller method

The method proposed by George Edward Pelham Box and Mervin Edgar Müller in 1958 generates a pair of independent random variables X1, X2from a normal

distribution with zero mean and unit variance using a pair of independent ran-dom variables U1, U2 with uniform distribution over [0, 1]. The fundamental

relationship is shown in 2.5 [6]. X1= p −_{2 ln U}₁_{cos 2πU}₂ X2= p −_{2 ln U}₁_{sin 2πU}₂ (2.5)

(23)

2.4 Methods for generating normally distributed variables 13

2.4.2 The Marsaglia Polar method

Proposed by George Marsaglia and Thomas Bray in 1964, this method is similar to the Box-Müller method in that is also generates a pair of normally distributed variables of the form shown in 2.6.

X1= R cos φ X2= R sin φ

(2.6)

However, this methods avoids the use of the trigonometric functions. Instead it is based on first finding a random point which is uniformly distributed in the unit circle, and using a fraction of polynomials for the sine and cosine of the angle to this point. To find a point with uniform distribution in the unit circle, we generate values U1, U2which are uniform in [1−, 1] and reject the point (U1, U2)

until U₁2+ U₂2 < 1. Then the normally distributed pair X1, X2 shown in 2.7 are

returned. X1= q −_{2 ln (U}2 1 + U22) U1 q U₁2+ U₂2 X2= q −_{2 ln (U}2 1 + U22) U2 q U₁2+ U₂2 (2.7)

We can see that the sine and cosine terms in 2.6 are replace by a rational expression, and the corresponding angle will be uniformly distributed since the point (U1, U2) is uniformly distributed in the unit circle. The distance R to the

new point (X1, X2) is in both this case and in the Box-Müller method, given by

an expression on the formp−2 ln (U). This expression is the inverse of CDF for the Rayleigh distribution, a full motivation for the use of this can be seen in [5]. The Marsaglia Polar method uses U = U₁2+ U₂2, and this is based on the fact that

U₁2+ U₂2is uniform on [0, 1] and is independent of U1/U2, and hence independent

of U1/(

q

U₁2+ U₂2) and U2/(

q

U₁2+ U₂2) [17].

Since the Marsaglia Polar method initially reject all points (U1, U2) unless U₁2+ U₂2 < 1, it is referred to as a rejection sampling technique. And since the

area of the unit circle is π, it will reject on average 1 − π/4 ≈ 21% of the generated uniform random variables. This will require the method to generate 1/(π/4) ≈ 27% more values than required by the application.

2.4.3 The Ziggurat method

This method for generating variables from a normal distribution was published by George Marsaglia and Wai Wan Tsang in 2000, but is based on a method the two authors developed in the 1980 [18]. It is the method used for MATLAB’s randnfunction [20].

(24)

The Ziggurat method is essentially a table lookup algorithm which uses a rejection method. It can be used to sample from any decreasing density function. The approach is to cover the target density with the union of a collection of sets from which it is easy to choose uniform points and then use the rejection method. If we denote area under the plot of the PDF by C and the the union of the covering sets by Z, note that C ⊂ Z. Then we can describe the rejection method like this: Select random points x, y in Z until you get a point that is in C, and return the value of x. This way the returned values will belong to the desired distribution.

The Ziggurat method covers the PDF of the normal distribution by stacking horizontal rectangles on top of a base strip which tails of to infinity. An impor-tant aspect of the algorithm is that all of these sets, both the rectangles and the base strip, have the same area. This is illustrated in figure 2.2 by using 8 sets, 7 rectangles and a base strip. The number of sets is best chosen to be a power of 2 to make random selection from a table easy. It is also preferable to utilize a lot more sets than 8 in order to reduce the area which is cover in excess of the desired PDF. In the C code provided in [18], the method used 128 sets for the normal distribution.

Figure 2.2: The Ziggurat method with 7 rectangles and a base strip. This figure is copied without alterations from the article [18] by George Marsaglia and Wai Wan Tsang, provided under the Creative Commons Attribution 3.0 Unported License

The Ziggurat method gets its name from the shape of the rectangles. Since each of the rectangles have an equal area the top ones are taller than the bottom ones, so that the layered rectangles resemble a ziggurat step pyramid. The reason for desiring the area of all sets to be equal is that this allows us to select a set a random using a uniform index.

We let the top rectangle be number one, and we denote the rightmost point of each rectangle i as xi, and let r be the rightmost xi. We also assume an empty

rectangle R0with x0 = 0, the left edge of R1. Let W be a uniform value in [−1, 1],

and U be uniform in [0, 1]. Let f (x) be the PDF of the desired normal distribution. Using this notation we can describe Ziggurat algorithm with the following steps:

(25)

2.5 Hardware acceleration for turbo codes 15

1. Select a random index i from a uniform distribution to specify the set 2. Set x = W xi

3. If |x| < xi−1, return x

4. If i = 0, return an x from the tail by a special method 5. If [f (xi−1) − f (xi)]U < f (x) − f (xi), return x

6. Go to step 1

The special method for returning a value from the tail is as follows: Generate

x = − ln U1/r and y = − ln U2 until y + y > x × x, then return r + x or −r − x

depending of the sign of the original x.

When 256 sets are used to cover the PDF, the algorithm will return in step 3 around 99% of the time. [18]

2.5 Hardware acceleration for turbo codes

This section presents some commercially available hardware systems which in-clude DSP processor and hardware acceleration for turbo codes.

Texas Instruments offer several product for telecom infrastructure. Their product TCI6618 [2] is a multicore system-on-a-chip DSP system with hardware accelerators for bit rate processing such as turbo encoding and decoding. It can perform the turbo decoding with a throughput of 582 Mbps for the 6144 bit code block size and 6 iterations.

The TCI6618 system is based on the C66x DSP core. The level-1 (L1) program and data memories on the TCI6618 device are 32 kB each per core. The level-2 (L2) memory is shared between program and data space for a total of 4,096 kB (1,024 kB per core). The TCI6618 contains 2,048 kB of multicore shared memory (MSM) that is used as a shared L2 SRAM or shared L3 SRAM.

Another hardware system the is commercially available is the MSC8157 [1] from NXP Semiconductors (previously Freescale Semiconductors, soon to be ac-quired by Qualcomm). The MSC8157 is a system-on-chip DSP system with hard-ware acceleration for baseband applications such as turbo encoding and decoding. It can perform the turbo decoding with a throughput of up to 330 Mbps.

The MSC8157 includes six StarCore SC3850 DSP subsystems, each with an SC3850 DSP core, 32 kB L1 instruction cache, 32 kB L1 data cache, unified 512 kB L2 cache configurable as M2 memory in 64 kB increments and a M3 shared memory of 3072 kB.

(26)

(27)

3

Method

This chapter describes how the simulator has been implemented, both in a wider view where the the fundamental model and general implementation considera-tions for the simulator setup are described, and also on a detailed level describing the proposed method for using a look-up table to generate random noise.

First the fundamental simulator model is described in 3.1, and then the gen-eral principles for the implementation and use of the simulator are described in 3.2. This is followed in section 3.3 by a description of the different program structures which have been implemented and tested. Lastly, the proposed table method for generating random noise is described in section 3.4.

3.1 Fundamental simulator model

A typical approach used to study the performance of channel coding techniques is to model the transmission media as an additive white Gaussian noise (AWGN) channel. This is the underlying model and general idea used for the simulator de-scribed below. For a time-continuous channel, where X is the transmitted signal and Y is the received signal, the AWGN channel is modelled by 3.1

Y = X + W (3.1)

where the noise W is white and Gaussian. The property of being white is a strictly mathematical model which signifies that the energy of the signal is uniformly dis-tributed over all frequencies. White noise is characterized by a single parameter

σ2 called the energy per degree of freedom or spectral density. That the noise is

Gaussian means that the coefficients of the signal has a Gaussian or normal dis-tribution with zero mean: Wn∼ N(0, σ ). In the application at hand, we will only

utilize a time-discrete channel. In this case, the samples of W are independent and belonging to a Gaussian or normal distribution. [13]

(28)

3.2 General principles of the simulator setup

The objective with the simulator is to be able to generate curves showing the turbo code block error rate (BLER) as a function of the signal-to-noise ratio(SNR). The overall method to achieve this is very simple: try to perform the turbo coding process several times and see how often it fails. This general principle can be described in the few steps below:

1. Generate input message 2. Perform turbo encoding

3. Generate random noise according to AWGN model and add to data 4. Perform decoding

5. Compare the output to the input and store statistics 6. Repeat until sufficient statistics have been obtained

The input message generated in step 1 could be any bit sequence. However it can be advantageous to use a random sequence rather than all ones or all zeros to avoid suffering from potential implementation flaws, such as a decoder which has a bias to the all-zero codeword.

Since hardware acceleration components will be used for the encoding and decoding of turbo codes, the task of the simulator is mainly to interface these components and to use suitable methods to simulate disturbance from an AWGN channel. General considerations about interfacing hardware acceleration compo-nents and adding noise are described in section 3.2.1.

To generate a curve over the BLER performance the simulations have to be run at several different SNR values. How to perform the needed calculations concerning SNR is described in section 3.2.2. Considerations about how many times this process has to be repeated for each SNR value are described in section 3.2.3.

3.2.1 Soft bits and noise addition

The turbo decoder used in this thesis accepts input in the form of 8-bit integer values. These 8-bits are together referred to as asoft bit, and the original binary

value from the encoding process is referred to as an encoded hard bit. So each encoded hard bit in the set {0, 1} must be mapped to a soft bit value belonging to the integer set {−128, −127, . . . , 126, 127}.

To begin describing how this mapping is done we first select a positive value

µ. This µ can be referred to as the signal level, and the mapping F of encoded

hard bits to soft bit values can be described by the relations in 3.2.

F(0) = µ

(29)

3.2 General principles of the simulator setup 19

Based on the AWGN model, we want to add white Gaussian noise to each soft bit value. If we let W ∼ N (0, σ ) then the mapping of encoded hard bits to soft bits with additive white Gaussian noise can be described by the relations in 3.3.

F(0) = µ + W

F(1) = −µ + W (3.3)

With this notation, a soft bit X has a distribution of X ∼ N (±µ, σ ), where the sign depends on if the hard bit value is zero or one.

In order to facilitate the implementation in code and to achieve effective exe-cution, we will first generate a random value G ∼ N (µ, σ ) regardless of the hard bit value, and then assign a positive or negative sign depending on the hard bit value. The random value is generated using PRNG methods implemented to run on DSP processors. The different methods used have been described in chapter 2.

Using this approach, we can express the implementation in code as shown below in 3.1.

Listing 3.1: C implementation of hard to soft bit conversion with additive white Gaussian noise

s o f t _ b i t _ v a l u e = ( h a r d _ b i t _ v a l u e == 0 ? G : −G ) ; Due to the integer nature of the soft bits, the value G first has to be rounded to the nearest integer value in the interval [−127, 127] in order to be sure that both G and −G belong to the set {−128, −127, . . . , 126, 127}. However, with this method, the soft bit value -128 will never be utilized.

It should be noted that the use of integer values imply that a quantification noise will be present in the resulting values.

3.2.2 SNR calculations

The signal-to-noise ratio measures the energy of a useful signal relative to the noise part of a signal. So if we have a useful signal X which is contaminated by additive noise W , then the signal-to-noise ratio of the resulting signal is defined as the ratio of energy in X to that of W . We thus get SNR = EX/EW. The SNR is typically expressed in the logarithmic decibel scale as shown in 3.4. [13]

SNRdB= 10 log10(SNR) = 10 log10( EX

EW) (3.4)

In our case, with the signal defined in 3.3, the SNR is µ2/σ2, so that we arrive at 3.5

SNRdB= 10 log10( µ2

σ2) (3.5)

With the current application of finding the BLER value at a particular SNR level, we are however more interested in the reverse calculation order. We wish to

(30)

find the standard deviation σ corresponding to a certain SNR value. For a given value of the signal level µ, this standard deviation for the normal distributed noise can be calculated according to 3.6

σ =

q

µ2_/10SNRdB/10 _(3.6)

SNR can also be presented in another way using the notation Eb/N0, but this

alternative will not be used in this thesis.

3.2.3 Number of coding attempts

The simulator has been configured to continued running successive attempts of encoding and decoding until 100 errors are encountered. This is a typical ap-proach used for these type of simulations, it puts a limit on how many iterations will be performed and it gives the generated values a certain statistical signifi-cance without having to perform any calculations for the desired number of iter-ations in advance. For a SNR level which corresponds to a BLER value of 10−9we would expect to have to run 1011iterations of attempted encoding and decoding processes.

The statistical significance of the generated values may also depend on other aspects, such as how well the PRNG methods actually mimic true randomness. There is some background theory for this supplied in chapter 2, but for the re-mainder of this section and the description of the method for determining the statistical significance of values we assume true randomness. First we recall the background theory concerning a binomial distribution presented in section 2.1.

If we perform n different attempts of encoding and decoding, and we model this as independent trials each with a probability of failure p, then the number of failed attempts x will belong to a binomial distribution. If n is large this can be approximated by a normal distribution where µ = np and σ =pnp(1 − p).

As also described in section 2.1, if we have a point estimator which is approx-imately normally distributed, and we have the estimators Θ∗

= np = 100 and

d =pnp(1 − p) ≈ 10, then a confidence interval with a confidence level of 95% is

given by 3.7

IΘ= (Θ ∗

−_{λα/2d, Θ}∗_{+ λ}_α/2_{d) = (100 − 1.96 · 10, 100 + 1.96 · 10) = (80.4, 119.6)} (3.7) If we where to calculate BLER values from an estimator which has this type of confidence interval, it would give the calculated BLER value roughly one sig-nificant figure.

3.3 Simulator program structure

Two different program structures for performing the simulations have been im-plemented and tested in this thesis. These program structures define how the

(31)

3.3 Simulator program structure 21

flow of execution should be passed along between the different steps of the sim-ulation process and how multiple DSP resources should be utilized to enable a parallel execution of the program.

In a sense, the implementation according to these two program structures corresponds to the implementation of two different simulator programs. Much of the program can, and has, been divided into different subroutines which are callable from a library. However, the two implementations differ completely con-cerning the part of the simulator program from where the simulation is started and ended, and the execution flow is controlled throughout the duration of the simulation.

The two different program structures are implemented to study how well the program can be executed in parallel over multiple DSP resources, and to aid in studying how well both the hardware acceleration and DSP components are uti-lized. Due to the differences in the implementation, the two simulator programs have also been used as references to each other, for comparing the the simulation results and searching for potential bugs in the program as an attempt to verify the implementations.

The two different program structures implemented and tested in this thesis are referred to as the busy-wait simulator structure and the callback simulator structure.

3.3.1 Busy-wait simulator structure

The busy-wait simulator structure is a very simple program structure which is probably seen by many as a straight forward and natural way to implement the steps of the simulator described in chapter 3.2.

With this program structure, every iteration of the simulator, or independent trial of coding and decoding, is run to completion before the next iteration or trial is started. All the steps in the simulation process are executed is series in a single DSP resource, so that all data can be handled in memory local to only that DSP resource, except of course for the final statistics which need to be shared by and accessible from all DSP resources. The busy-wait structure is illustrated in figure3.1.

(32)

Generate input message and send to encocder

Fetch output, compare to input message and store statistics

Wait for encoder to complete

Fetch output, add noise and send to decoder

Wait for decoder to complete

If more attempts are needed, repeat process

Figure 3.1: A graphical representation of the busy-wait program structure

3.3.2 Callback simulator structure

The callback simulator structure allows a new iterations, or an independent trial of coding and decoding, to be started before the previous one is finished. This behaviour is achieved by making each of the steps 1,3,5 described in chapter 3.2 into self-contained and separate jobs. These jobs must still be executed in a serial way since results obtained in a step is needed for the next.

The steps are made self-contained by requiring that all data needed for the execution of the respective procedure be supplied as arguments. In this way, subroutines can be defined which can perform each step without the need for the different steps to be executed with access to the same memory. The different steps can therefore be executed on different DSP resources and new independent trials can be started before the previous iteration has completed. The aim with this program structure is to better utilized the DSP resources by elimination the need to wait for the accelerator components to finish their task. Instead the DSP resource can be utilized to perform a subroutine belonging to a different step from another independent trial of coding and decoding.

(33)

3.4 The table method for noise generation 23

passed as an argument to another function with the intention that at an appro-priate time, the executing function makes acall back to the argument function.

This is precisely how the program controlled is passes along when a coding or decoding job is sent to the hardware accelerator components. The call to the ac-celerator is bundled with arguments to the subroutine which will perform the continuation of the process, and the data needed to perform it. When the hard-ware acceleration process is complete, the callback function is executed on an available DSP resource.

In the actual implementation, the steps 1 and 5 in chapter 3.2, i.e the step per-forming the initialization of an attempt and the step perper-forming the evaluation of the results from decoding, are combined into one self-contained job. In this way, there are two separate, self-contained jobs performed by software executed on a DSP, in addition to the encoding and decoding jobs executed in the hardware acceleration components. We can refer to the job which executes steps 5 and 1 as Aand the job which executes step 3 as B. This program structure is illustrated in figure 3.2, where and additional software job, C, is defined to initialize the very first iteration. Turbo decoding Turbo encoding A: Analyze result of decoding and start the next encoding

A: Add noise

C: start first encoding

Figure 3.2: A graphical representation of the callback simulator structure: the circles represent execution of software in DSPs, the rectangles represent execution in hardware accelerators. The dotted lines represent how pro-gram control is passed along by callback instructions, the solid lines show the chronological execution path

3.4 The table method for noise generation

The table method for generating noise is a very simple table look up method which is proposed, implemented and tested in this thesis. The idea of the method is to generate approximately normally distributed random variable in a very fast

(34)

way based on uniformly distributed random values. The reason for proposing this method is to study if even an extremely simple method generating only ap-proximate normally distributed values can be used in the simulator to achieve a comparable level of quality in the generated BLER curves. The hope is also that this method can supply a relevant baseline concerning the execution speed of generating noise values in a DSP system, in order to compare to the execu-tion speed of other and more established methods which adhere to a true normal distribution.

Like stated above, this method is a very simple table look up method. First a table is set up according to a desired normal distribution, and then a random integer can be generated and used as an index to the table. We used a table containing 256 values so that a positive integer of 8 random bits can be used as an index. The PRNG method used in this thesis generate a 32-bit integer value, so each generated value was divided up into four different random indexes.

In section 3.2.1 is a description of the general approach used in the implemen-tation to create a signal with additive noise. For each simulation at a specified SNR value, there is a need to generate a large number of random variables from a normal distribution with a constant mean µ and a constant standard deviation

σ . The table method for noise generation is designed to enable the generation of

such values in a fast way.

In general, this method can be described like this: For a probability distribu-tion specified by a constant parameter set of P , a table consisting of N elements is set up such that the values in the table represent points {x1, x2, · · · , xN}on the

x-axis of the distribution CDF, which we can denote by Φ(x). These points are centred around the expected value of the distribution and are equidistant points in terms of probability. By this we mean that for each pair of adjacent point xi

and xi+1, the probability mass between these points is the same, i.e. adhere to the

rule shown in 3.8.

Φ(xi+1) − Φ(xi) = C, ∀i ∈ {1, 2, · · · , N − 1}, constant C ∈ R (3.8)

Further on, the values have been set so that the probability mass in the tails of the distribution which are beyond the largest or smallest value in the table, should equal to C/2, corresponding to the relationships in 3.9

Φ(x1) − Φ(−∞) = Φ(x1) = C/2

Φ(∞) − Φ(xN) = 1 − Φ(xN) = C/2

(3.9)

With this set up of the tables, the probability mass in the tails of the distribu-tion not covered by the table values will amount to 1/N .

In the current context and implementation, the distribution is of course the normal distribution specified by P = {µ, σ } and the table contains N = 256 en-tries. This means that the probability mass of the distribution not covered by the table amounts to 0.4%. The table entries are all integer values to to the integer nature required by the accelerator components as described in section 3.2.1.

In order to facilitate and easy implementation in the DSP software for setting up a table according to a current parameter set, there is a table stored in memory

(35)

3.4 The table method for noise generation 25

containing the 256 floating point value for a normal (0, 1) distribution. These values can be be scaled, shifted and rounded as needed, according to the desired values of µ and σ , to get a table with integer entries.

The table of floating point values used for a normal (0, 1) distribution can be generated using the MATLAB code shown in listing 3.2 below. The actual values of the table are also presented in appendix A.

Listing 3.2: MATLAB code for generating a table for the standard normal distribution

N = 2 5 6 ;

P = 1 / ( 2*N) : 1 /N: 1 ; X = norminv ( P ) ;

(36)

(37)

4

Results

In this chapter the results are presented. First are some results showing output from the simulator, i.e. BLER-curves, for the different simulator structures and methods for generating normally distributed values. These are shown in section 4.1. Then results showing the execution time for the implemented PRNGs are shown in section 4.2. The execution times for the implemented methods which converts uniform variables to a normal distribution are show in section 4.3, to-gether with examples of the empirical distribution of the methods. The execution time for the resulting different ways of converting a hard bit value to a soft bit value are shown in section 4.4. Results showing the throughput of the simula-tor is shown in section 4.5, and after this the distribution of the execution time among the different steps of the simulator is shown in section 4.6. Finally, in section 4.7, some BLER-curves for very low error rates are presented.

4.1 BLER curves

4.1.1 Reference curves and axis

The BLER-curves presented here will be shown together with a reference curve. Results for two different code block sizes will be presented, both for 40 bit code blocks and for 6144 bit code blocks, which are the smallest and largest code blocks available. These reference curves are previous results from a completely different simulator implemented entirely in software. The two reference curves for the different code block sizes are shown in 4.1 and 4.2.

The BLER curves show the block-error rates, i.e. the probability that the code-word is incorrectly decoded, for different signal-to-noise values. All BLER-curves shown are for turbo codes with a code rate of 1/3. All BLER-curves are also based on the principles described in 3.2.3, i.e. that for each SNR-value the simulation

(38)

has run until 100 failed decoding have been encountered, unless stated otherwise. For all BLER-curves, the SNR values shown on the x-axis are set to be zero where the reference curve for 6144 bit code blocks has a block error-rate of 1%.

-3 -2 -1 0 1 2 3 4 scaled SNR [dB] 10-6 10-5 10-4 10-3 10-2 10-1 100 BLER

Turbo code 40 bit code block, 1/3 code rate Reference curve

(39)

4.1 BLER curves 29 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 scaled SNR [dB] 10-6 10-5 10-4 10-3 10-2 10-1 100 BLER

Turbo code 6144 bit code block, 1/3 code rate Reference curve

Figure 4.2:The reference curve for 6144 bit code blocks

4.1.2 Different noise methods

The figures below each show three different generated BLER-curves, in addition to a reference curve. The curves in each figure display the simulation results when using different methods for generating the additive noise. The results for 40 bit code blocks are shown in 4.3 and the results for 6144 bit code blocks are shown in 4.4.

(40)

-3 -2 -1 0 1 2 3 4 scaled SNR [dB] 10-6 10-5 10-4 10-3 10-2 10-1 100 BLER

Turbo code 40 bit code block, 1/3 code rate, Callback simulator Reference curve Table method Ziggurath method Polar method

Figure 4.3: Simulation results for 40 bit code block size

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 scaled SNR [dB] 10-6 10-5 10-4 10-3 10-2 10-1 100 BLER

Turbo code 6144 bit code block, 1/3 code rate, Callback simulator Reference curve Table method Ziggurath method Polar method

(41)

4.1 BLER curves 31

4.1.3 Different simulator structures

The figures below show the simulation results of the implementations in the dif-ferent simulator structures for the same methods for generating the noise. The random seeds for the different simulations also differ. We see the results of the different simulator structures for the table method in 4.5, for the ziggurat method in 4.6, and for the polar method in 4.7.

Turbo code 6144 bit code block, 1/3 code rate, table method noise Reference curve Callback simulator Busy-wait simulator

Figure 4.5: Simulation results of the different simulator structures for 6144 bit code block size using the table method

(42)

Turbo code 6144 bit code block, 1/3 code rate, Ziggurat method noise Reference curve Callback simulator Busy-wait simulator

Figure 4.6: Simulation results of the different simulator structures for 6144 bit code block size using the Ziggurat method

Turbo code 6144 bit code block, 1/3 code rate, Polar method noise Reference curve Callback simulator Busy-wait simulator

Figure 4.7: Simulation results of the different simulator structures for 6144 bit code block size using the Polar method

(43)

4.2 Execution time for PRNG methods 33

4.2 Execution time for PRNG methods

The execution time for the two implemented PRNG methods are shown in figure 4.8. The values shown are calculated as an average over 10 000 consecutive calls to the PRNG. The actual values are also shown in table 4.1.

Figure 4.8: The average execution time measured in clock cycles for gener-ating a 32-bit pseudo random integer

Table 4.1:The average execution time for measured in clock cycles generat-ing a 32-bit pseudo random integer

Method Clock cycles xorshift 4.002 Mersenne Twister 46.363

4.3 Normal distribution methods

Three different methods for generating values with a normal distribution have been implemented, these are the Polar method and Ziggurat algorithm described in chapter 2 as well as the proposed Table method described in chapter 3.4. The results presented here show the execution time of these methods as implemented in the simulator, as well as an example of the empirical distribution of these methods.

As described in the sections mentioned above, all three of these methods rely on an underlying PRNG for generating uniformly distributed values, which are used to generated normally distributed values. For all the results presented here concerning the normal distribution methods, the underlying PRNG used is the xorshift random number generator described in chapter 2.3. It should perhaps be

(44)

reiterated that the normal distribution methods tested require a different number of calls to the underlying PRNG to generate a random normally distributed value. The output from these methods also differ. In the current application, the value from the normal distribution methods eventually have to take the form of an 8-bit signed integer, as described in chapter 3.2.1. The results presented here concerning execution time are for the average generation time of an 8 bit integer value. This means that for the Polar method and the Ziggurat algorithm, the output has been converted to an 8 bit integer value.

The execution time for generating and 8-bit integer value from each of these normal distribution methods are shown in figure 4.9. The values are calculated as an average of 10 000 consecutive calls to respective method. The values are also shown in table 4.2.

Figure 4.9: The average execution time measured in clock cycles for gener-ating a 8-bit normally distributed integer value

Table 4.2:The average execution time for measured in clock cycles generat-ing a 8-bit normally distributed integer value

Method Clock cycles Table method 9.51 Ziggurat algorithm 126.27 Polar method 249.68

The empirical distribution for these different methods can be seen in figure 4.10. In this figure, the values of the mean µ and the standard deviation σ have been set to 32 and 10 respectively. The theoretical CDF of integer values from a normal distribution with theses parameters is also shown in the figure. The em-pirical CDF for the Polar method and the Ziggurat algorithm have been created using 10 000 samples from their output, using the xorshift PRNG as underlying

(45)

4.3 Normal distribution methods 35

source of uniformly distributed random variables. The CDF of the table method has instead been created directly from the table generated in this method. The x-axis is set to be the interval [−128, 127] which is the available interval for the 8-bit soft bit values. Figure 4.11 shows this same graph but zoomed in around where the CDF approaches 1.

-150 -100 -50 0 50 100 150 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Empirical CDF Theoretical Polar method Ziggurat algorithm Table method

Figure 4.10: The empirical CDF:s for µ = 32 and σ = 10 of the table method, Ziggurat algorithm and Polar method alongside the theoretical CDF for a normal distribution of integer values

Turbo Code Performance Analysis Using Hardware Acceleration

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2016