Hardware Implementation and Assessment of a Soft MIMO Detector Based On SUMIS

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Hardware Implementation and Assessment of a Soft

MIMO Detector Based On SUMIS

Examensarbete utfört i Kommunikationssystem vid Tekniska högskolan vid Linköpings universitet

av

Tomas Frostensson LiTH-ISY-EX--13/4664--SE

Linköping 2013

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Hardware Implementation and Assessment of a Soft

MIMO Detector Based On SUMIS

Examensarbete utfört i Kommunikationssystem

vid Tekniska högskolan vid Linköpings universitet

av

Tomas Frostensson LiTH-ISY-EX--13/4664--SE

Handledare: MirsadČirkić

isy_{, Linköpings universitet}

Sven Holmquist

Synective Labs

Examinator: Daniel Persson

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

Division of Communication Systems Department of Electrical Engineering SE-581 83 Linköping Datum Date 2013-05-20 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-92627 ISBN

— ISRN

LiTH-ISY-EX--13/4664--SE Serietitel och serienummer Title of series, numbering

ISSN —

Titel

Title Hardware Implementation and Assessment of a Soft MIMO Detector Based On SUMIS

Författare Author

Tomas Frostensson

Sammanfattning Abstract

To allow faster and more reliable wireless communication a technique is to use multiple antennas in the transmitter and receiver. This technique is called MIMO. The usage of MIMO adds complexity to the receiver that must determine what the transmitter actually sent. This thesis focuses on hardware implementation suitable for an FPGA of a detection algorithm called SUMIS.

A background to detection and SUMIS in particular is given as a theoretical aid for a bet-ter understanding of how an algorithm like this can be implemented. An introduction to hardware and digital design is also presented.

A subset of the operations in the SUMIS algorithm such as matrix inversion and sum of logarithmic values are analyzed and suitable hardware architectures are presented. These operations are implemented in RTL hardware using VHDL targeted for an FPGA, Virtex-6 from Xilinx.

The accuracy of the implemented operations is investigated showing promising results alongside of a presentation of the necessary resource usage.

Finally other approaches to hardware implementation of detection algorithms are discussed and more suitable approaches for a future implementation of SUMIS are commented on. The key aspects are flexibility through software reprogrammability and area efficiency by designing a custom processor architecture.

Nyckelord

(6)

(7)

Abstract

To allow faster and more reliable wireless communication a technique is to use multiple antennas in the transmitter and receiver. This technique is called MIMO. The usage of MIMO adds complexity to the receiver that must determine what the transmitter actually sent. This thesis focuses on hardware implementation suitable for an FPGA of a detection algorithm called SUMIS.

A background to detection and SUMIS in particular is given as a theoretical aid for a better understanding of how an algorithm like this can be implemented. An introduction to hardware and digital design is also presented.

A subset of the operations in the SUMIS algorithm such as matrix inversion and sum of logarithmic values are analyzed and suitable hardware architectures are presented. These operations are implemented in RTL hardware using VHDL tar-geted for an FPGA, Virtex-6 from Xilinx.

The accuracy of the implemented operations is investigated showing promising results alongside of a presentation of the necessary resource usage.

Finally other approaches to hardware implementation of detection algorithms are discussed and more suitable approaches for a future implementation of SUMIS are commented on. The key aspects are flexibility through software reprogramma-bility and area efficiency by designing a custom processor architecture.

(8)

(9)

Acknowledgments

I would like to thank my examiner Daniel Persson and my supervisor Mirsad Čirkić at ISY for examining and providing feedback during this master’s thesis. It has been interesting to hear about the problems associated with the subject from another point of view rather than just my own.

I would like to acknowledge everyone at Synective Labs in Gothenburg for the friendly atmosphere and the possibility for discussions. I also appreciate the feedback from my opponent Emelie Nilsson which led to a better report.

Gothenburg, May 2013 Tomas Frostensson

(10)

(11)

Notation ix 1 Introduction 1 1.1 Background . . . 1 1.2 Goal . . . 2 1.3 Limitations . . . 2 1.4 Outline . . . 2 2 Theory 3 2.1 MIMO . . . 3 2.2 Detection . . . 4 2.2.1 Soft Detection . . . 4 2.3 SUMIS . . . 5 2.3.1 First Stage . . . 6 2.3.2 Second Stage . . . 6 2.3.3 Complexity Selection . . . 7 2.4 Number Representation . . . 7 2.5 Hardware Introduction . . . 8 2.6 Programmable Hardware . . . 9 2.6.1 Hardware Flow . . . 9 2.6.2 Reusable Modules . . . 10 3 Problem Analysis 11 3.1 Overview . . . 11 3.2 Matrix multiplication . . . 11 3.3 Matrix Inversion . . . 12 3.3.1 LDLTDecomposition . . . 12 3.3.2 Reciprocal . . . 14 3.3.3 Forward Substitution . . . 14 3.3.4 Final Steps . . . 16

3.4 Log Sum of Exponentials . . . 16

4 Methodology and Equipment 19

(12)

viii CONTENTS 4.1 Modeling . . . 19 4.2 VHDL . . . 19 4.3 RTL . . . 20 4.4 Hardware . . . 20 5 Implementation 23 5.1 Overview . . . 23 5.2 Matrix Multiplication . . . 24 5.2.1 IP Block Trade-offs . . . 24 5.2.2 Interface . . . 24 5.2.3 Example Implementation . . . 24 5.3 Matrix Inversion . . . 26 5.3.1 LDLTDecomposition . . . 26 5.3.2 Reciprocal Unit . . . 28 5.3.3 Forward Substitution . . . 30 5.4 Jacobi Logarithm . . . 33

6 Result and Analysis 35 6.1 Testing and Measurements . . . 35

6.1.1 Matrix Multiplication . . . 35 6.1.2 LDLTDecomposition . . . 36 6.1.3 Forward Substitution . . . 36 6.1.4 Jacobi Logarithm . . . 36 6.2 Resource Usage . . . 36 6.2.1 Matrix Multiplication . . . 36 6.2.2 Matrix Inversion . . . 37 6.2.3 Jacobi Logarithm . . . 38 6.3 Remaining Work . . . 38 6.3.1 Hyperbolic Tangent . . . 38 6.3.2 Exponential Function . . . 39

6.3.3 Additional Matrix Operations . . . 39

6.3.4 Control Structure . . . 40

6.4 Improvements . . . 40

6.4.1 Hardware Time-Multiplexing and Control . . . 40

6.4.2 Wordlength Optimization or Floating Point Implementation 40 6.4.3 Design Space Exploration using High Level Synthesis . . . 41

6.5 Alternative Approaches and Comparison . . . 41

6.6 Insights from Alternative Approaches . . . 42

6.6.1 Number Representation . . . 42 6.6.2 Processor Architecture . . . 43 6.6.3 Flexibility . . . 43 6.6.4 Integration . . . 43 6.7 Final Conclusions . . . 44 Bibliography 45

(13)

Notation

Number sets

Notation Meaning

R Set of real numbers C Set of complex numbers

Abbreviations

Abbreviation Meaning

ASIC Application-Specific Integrated Circuit BRAM Block RAM

CORDIC Coordinate Rotation Digital Computer FFT Fast Fourier Transform

FPGA Field Programmable Gate Array HDL Hardware Description Language

IEEE Institute of Electrical and Electronics Engineers IP Intellectual Property

JTAG Joint Test Action Group LLR Log-Likelihood Ratio LUT Lookup Table

MAC Multiply and Accumulate

MIMO Multiple-Input and Multiple-Output

OFDM Orthogonal Frequency-Division Multiplexing QAM Quadrature Amplitude Modulation

RAM Random Access Memory RTL Register Transfer Level

SIMD Single Instruction, Multiple Data SNR Signal-to-Noise Ratio

SUMIS Subspace Marginalization with Interference Suppression VHDL VHSIC Hardware Description Language

VHSIC Very High Speed Integrated Circuit

(14)

(15)

1

Introduction

One technique to improve wireless communication reliability as well as perfor-mance is to use multiple antennas in the transmitter and receiver and this tech-nique is called MIMO.

Unfortunately this technique adds increased complexity to the receiver since the receiver has to determine what was actually sent given the overlapping input from multiple antennas. Since this is a complex problem efficient methods must be developed to cope with this complexity given strict real time demands from a communication system.

1.1 Background

The main area of this thesis is the implementation aspect of detection algorithms in the receiver used in a MIMO system.

The background for this thesis is a detection algorithm described in the con-ference paper [Čirkić and Larsson, 2012] and more detailed in the longer ar-ticle [Čirkić and Larsson, 2012]. These papers presents a detection algorithm called SUMIS (subspace marginalization with interference suppression) which has shown promising results compared to other detection algorithms with a lower complexity.

The given high level description in the mentioned papers of the mathematics involved in the detection does not disclose how this could efficiently be imple-mented in hardware for use in a real wireless system. Therefore this thesis will examine the implementation aspects of the proposed algorithm.

(16)

2 1 Introduction

1.2 Goal

The goal of this thesis is to evaluate and assess suitable hardware structures for the implementation of a soft MIMO detector based on the SUMIS algorithm on an FPGA.

The selected operations described in Chapter 3 of the SUMIS algorithm will be implemented in hardware and discussed. The implementation aspects of the al-gorithm will be discussed to see what must be taken into consideration when implementing such a detection algorithm.

The algorithm will be evaluated to determine how suitable this algorithm is for real time implementation in contemporary and future wireless systems.

Implementation-wise it should serve as a proof of concept with discussion about possible improvements rather than providing a solution ready for production.

1.3 Limitations

Limitations have been made to reduce the complexity and limit the work load associated with this thesis to a reasonable amount. The number of antennas sup-ported is considered constant and also the modulation chosen as 16-QAM since it affects the size of the numbers involved.

The main limitation is that only a subset of the operations involved in the SUMIS algorithm has been considered for hardware implementation and these are de-scribed in Chapter 3.

1.4 Outline

The thesis is divided in several chapters. Chapter 2 describes the background theory that is useful for the understanding of the succeeding chapters.

The selected problems that must be solved are described in Chapter 3 with ac-companying algorithms and possible solutions to the problems. The hardware that was utilized and the methodology used for the implementation is described in Chapter 4.

The step of actual hardware implementation is presented in Chapter 5 where the individual modules are described.

Finally the results of the implementation, measurements and comparisons with other implementations can be seen in Chapter 6. The chapter also contains dis-cussions about future work and implementation aspects of the SUMIS algorithm.

(17)

2

Theory

This chapter describes the background theory that is necessary to comprehend other sections of this thesis.

2.1 MIMO

A MIMO communication system is a communication system that uses multiple antennas for transmission as well as for reception. A basic setup of a MIMO system can be seen in Figure 2.1.

R1 R2 RNr Receiv er T1 T2 TNt T ransmitter

Figure 2.1: A MIMO system using Nttransmit and Nr receive antennas

A real valued MIMO channel can be seen as

y= Hs + e, (2.1)

(18)

4 2 Theory

where H ∈ RNr×Nt_{. The matrix H denotes the channel matrix. Each entry of}

the matrix is a possible path from the transmitter to the receiver. Therefore it contains Nr×Ntelements which are all the possible paths from the transmitting

antennas to the receiving antennas. The vector s ∈ SNt _{contains the modulated}

symbols that the transmitter will try to send where S is the set containing the possible symbols. The vector e ∈ RNr _{is the noise vector e ∼ N (0,}N0

2 I) containing

additive Gaussian noise with zero mean and N0

2 variance. Finally y ∈ RNr is the

vector with the received symbols as seen by the receiver.

As mentioned before, the MIMO channel described in Equation 2.1 is real valued. It is more common with a complex channel but as described in [Larsson and Jalden, 2008] every complex channel given a few prerequisites can be posed as a real model. This is straightforward since Cnis isomorphic to R2n. A real model is used since it simplifies the explanation of the SUMIS algorithm and this model can easily be derived from a complex valued model.

2.2 Detection

The principle of detection in MIMO systems is to determine s given y described in Equation 2.1. The channel matrix H is assumed to be known to the receiver and is often so in practice by estimation.

Detection can be divided in two subcategories: hard detection and soft detection. Hard detectors give an estimate of s without additional information while soft detectors provide both an estimate of s and probability information for each bit in the symbols in s. This means that the detector provide information of how accurate the estimated s is on bit level.

Since detectors in communication systems are commonly used together with a coding scheme this probability information is useful when trying to decode the received symbol. If it is known to the decoder that a specific bit in the received symbol has lower probability of being correct it can be possible to achieve a lower error rate by inverting that bit.

As the title of this thesis describes, the focus lies mainly on soft detectors.

2.2.1 Soft Detection

The information that the detector can provide the decoder with is the log-likelihood ratio, LLR, which is the logarithm of the likelihood ratio. Likelihood ratio is a sta-tistical test to compare the fit of two models, in this case if a zero or one was transmitted given the received data. This ratio tells how many more times likely one case is over the other.

With this ratio expressed for each of the received bits the decoder can use this knowledge to decode the received data correctly. With the ratio expressed in the logarithmic domain the sign will show the hard detection, thus if the detector detected a zero or one while the magnitude of the ratio will tell how accurate this

(19)

2.3 SUMIS 5

detection is. The log-likelihood ratio is

l(si|y) = log             P ∀_s∈{s:s_i_=1} exp− 1 N0 k_{y − Hsk}2 P ∀s∈{s:s_i=0} exp− 1 N0 k_{y − Hsk}2             , (2.2)

given that the symbols are uniformly distributed, thus equally probable that a zero or one is being sent.

The sums in Equation 2.2 are over the set {s : si = x} which means all possible

vectors s where the i:th bit is x = 0 or x = 1 respectively.

The computation effort needed to calculate the log-likelihood ratio will grow polynomial with the number of possible symbols of the constellation and expo-nential with the number of transmitter antennas, Nt. If |S| is all of the possible

symbols s can contain the complexity of the calculation will be proportional to |S |Nt_{. This is the big limitation when it comes to MIMO detectors, with the}

con-stellation size growing as well as the number of antennas the computation effort will be impractical to deal with.

Numerous methods to deal with this complexity by introducing approximations exists, such as sphere decoding in [Chu and McAllister, 2012]. The method that is investigated further in this thesis is SUMIS which is introduced in [Čirkić and Larsson, 2012]. SUMIS is based upon a mix of two approaches, partial marginal-ization and soft interference cancellation. Partial marginalmarginal-ization is further de-scribed in [Larsson and Jalden, 2008], [Čirkić et al., 2011], [Persson and Larsson, 2011] and [Persson et al., 2012]. Soft interference cancellation is described in [Lampe and Huber, 1999] and [Choi et al., 2000].

2.3 SUMIS

One of the main concepts in the SUMIS algorithm is to partition Equation 2.1 into

y= H¯s + eH˜s+ e. (2.3) The partitioning can be used to group together eH˜s+ e and treat it as interference and noise.

The partition in Equation 2.3 is dependent on the parameter ns ∈ {1, . . . , Nt}

which can be seen as a complexity parameter. This complexity parameter deter-mines how much effort that will be put in to the detection algorithm. The dimen-sions of the partitioned matrices will be as follows: H ∈ RNr×ns_{, e}_{H ∈ R}Nr×(Nt−ns)_,

¯s ∈ Sns _{and finally ˜s ∈ S}Nt−ns_.

The partitioning must be chosen so that the interesting bit si is contained by ¯s.

To be able to cover all of the available bits it means that it is necessary to have

Ntdifferent partitions to have at least one partition that contains each interesting

(20)

6 2 Theory

If ns= 1 it is easy to choose a partition for bit sisince there exists only one but for

ns > 1 it is a more complex problem. In [Čirkić and Larsson, 2012, Section 3C] a

suitable approach to perform this selection is presented. The approach is to base the selection on the matrix product HTH. The goal is to minimize the impact of e

H˜s+ e on the selected columns that will be contained in H. This is achieved by selecting the column in HTHthat contains the interesting bit along side with the

ns −1 columns that contains the largest values intersecting the chosen column.

This will leave the remaining columns to eHand the impact will be minimized.

2.3.1 First Stage

Given Equation 2.3 it is possible to choose an approximate model

y ≈ H¯s+ n, (2.4)

where n ∼ N (0, Q) and Q = eHeHT+N0

2 I.

The key point of Equation 2.4 is that computations can be simplified by assuming that the interference from eH˜scan be seen as Gaussian noise. With these assump-tions made it is possible to perform the first step of the SUMIS algorithm which has the purpose of reducing the impact of the interfering terms. This is achieved by computing the conditional expected value of each bit approximately and this computation is performed symbol-wise by first computing

λk = log             P ∀_s∈{s:s_k_=1} exp−1 2(y − H¯s)TQ −₁ (y − H¯s) P ∀_s∈{s:s_k_=0} exp−1 2(y − H¯s)TQ −₁ (y − H¯s)             , (2.5) followed by E{s_k|y}= tanh _λ k 2 . (2.6)

2.3.2 Second Stage

The purpose of the second stage of the SUMIS algorithm is to suppress the inter-fering vector ˜s. The first step is defining a new model to suppress this vector and this model is

y0≈_H¯s_{+ n}0_, _(2.7)

where n0 ∼ N_{(0, Q}0_{) and Q}0 _{= e}_{H ˜}ΦHeT+ N₂0I. The matrix ˜Φ is the conditional covariance matrix of ˜s and is described as

˜

Φ = E{eS2|y} − E{eS|y}2. (2.8)

In Equation 2.8 the matrix eSis a diagonal matrix with the diagonal consisting of the elements from ˜s. With all of these computations performed the model can be assumed to be purified and it is possible to calculate the desired LLRs. The main difference from Equation 2.2 is that these computations in SUMIS are over the space spanning ns dimensions instead of the original Nt dimensions. This

(21)

2.4 Number Representation 7

computation is performed for each bit and is described by

l(si|y) ≈ log             P ∀_s∈{s:s_i_=1} exp−1 2(y 0₋ H¯s)TQ0−1(y0−_H¯s) P ∀s∈{s:s_i=0} exp−1 2(y 0₋ H¯s)T_Q0−₁ (y0₋ H¯s)             , (2.9)

Since the LLRs are the information desired by the decoder the SUMIS algorithm has completed its task.

2.3.3 Complexity Selection

As can be seen in the previous sections ns is the complexity parameter of the

algorithm and can be assumed to be much smaller than Nt. With ns = Nt the

benefits of SUMIS are non existing since H = H and the complete computation in Equation 2.2 will be performed. The work in [Čirkić and Larsson, 2012] further describes optimizations possible to minimize the computations needed and these results have been used when selecting the operations to be analysed. One aspect is that the inverse Q−₁

can be computed for all of the partitions by inverting a larger matrix of dimension Ntfollowed by smaller inverses of dimension ns.

2.4 Number Representation

Throughout the thesis a fixed point number representation is being used for the hardware implementation. A fixed point number representation is used to repre-sent a decimal number using a limited number of bits. The wordlength denotes the number of bits used.

To be able to understand how the number representation works it is possible to start with how a regular integer is represented using two’s complement. This can be exemplified by X = −xN −1∗2N −1+ N −2 X i=0 xi∗2i, (2.10)

which denotes the value of a number X represented by N bits xN −1. . . x0.

With a N -bit binary number as described in Equation 2.10 any integer in the range −2N −1≤_{X ≤ 2}N −1−_{1 can be represented.}

With the knowledge of how to represent whole numbers, it is possible to move on to decimal numbers. These numbers can be represented by allocating a number of bits for the integer part of the number and the rest for the fractional part. This is achieved by applying a scaling factor to the number and this can be seen in

X = 2−f ∗_(−x_{N −1}∗₂N −1₊

N −2

X

i=0

(22)

8 2 Theory

which also features a N -bit binary number like the one in Equation 2.10 but this time representing a decimal number.

The number represented by Equation 2.11 is scaled by 2−f which means that

f bits has been allocated for the fractional part and the remaining N − f bits

represent the integer part and sign.

The number can be in the range −2N −1−f _≤_{X ≤ 2}N −1−f₋₂−_f

in steps of 2−_f

. One big difference compared to a floating point representation is that the resolution is constant over the whole number range.

2.5 Hardware Introduction

To be able to fully comprehend the implementation aspect of this thesis an intro-duction to digital design and hardware is necessary.

Digital circuits can mainly be divided in two main areas, combinatorial and se-quential. Combinatorial circuits perform boolean algebra on a given set of input to produce one or multiple output signals. It has no memory and thus the output is only dependent on the provided input. Given the ability to express boolean algebra many different kind of circuits can be constructed, some examples are adders which can add two numbers and multiplexers that work as switches with multiple inputs and one output.

The drawback with purely combinatorial circuits is that they are state-less be-cause of the lack of memory. Sequential logic on the other hand groups together combinatorial circuits with memory elements that allows the circuit to not only take into account the input signals but also the current state. The basic memory element of a sequential circuit is called a flip-flop. A common D-type flip-flop has a data input, data output and a clock input. The flip-flop will only change its output value on the rising edge of the clock, otherwise it will contain the old value.

With sequential logic it is possible to create more advanced circuits such as finite state machines, counters and registers. A register is constructed using a flip-flop and a multiplexer and it has a load signal. When the load signal is low, the old value will remain regardless of the clock signal. When the load signal is high and there is a rising clock edge, a new value will be stored in the register.

Random access memories are very important in digital circuits and heavily used in this thesis. Such memories are much more suitable than flip-flops when there is a need to store greater amounts of data since they are more area efficient. The memories have an address port, a data port and a write signal. With an address provided the data stored at that particular address will be available on the data port with a certain delay. Using the write signal it is possible to store new data into the memory by selecting the correct address, provide data on the data port and asserting the write signal.

(23)

2.6 Programmable Hardware 9

A more detailed introduction to digital design if necessary can be obtained from [Danielsson and Bengtsson, 1996].

2.6 Programmable Hardware

When it comes to programmable hardware, the current choice is often to use an FPGA. An FPGA is a field-programmable gate array that can be configured to implement almost any digital design.

An FPGA is build up of small logic blocks that can be configured and connected to each other to implement different functions. Instead of using logic gates such as AND, OR and NOT boolean functions are represented by their truth table. This truth table is stored in a small component called LUT. The LUT is a lookup table with the input variables to the boolean function connected as an address and the output is the value stored in the truth table. This allows a 4 input LUT to implement any boolean function with at maximum 4 inputs. Additional LUTs can be interconnected to implement boolean functions with more inputs. An FPGA does not only contain LUTs but also flip-flops that can be connected to the output of a LUT which makes it possible to implement sequential circuits mentioned in Chapter 2.5. All of these small components can be connected al-most arbitrarily using a pre-existing routing network in the FPGA.

These components are necessary for a simple FPGA to function but contempo-rary devices often include more hardware. Since the interconnection between the building blocks provide overhead the manufacturers often add additional building blocks that the customers are likely to use such as multipliers and ran-dom access memories. If a memory were to be implemented using only flip-flops the overhead would be substantial and this would limit what else that can be im-plemented at the same time. The same reasoning is valid for multipliers since multiplication is complex to implement with the aid of only LUTs. Since multi-plication is a common operation the manufacturers are likely to include prefabri-cated blocks.

2.6.1 Hardware Flow

From the designer’s point of view the hardware is described using a hardware de-scription language such as VHDL or Verilog. The hardware is described in terms of software even though the code is supposed to be a description of hardware and not be executed on the hardware itself. The written code can be simulated as it is to verify the behaviour even if not everything that can be simulated can be transformed to hardware.

The source code that describes the hardware can be synthesised into a netlist of building blocks such as LUTs and flip-flops appropriate for the targeted FPGA device. This can be seen as an analogy to how a compiler compiles software written in a high-level language into a low-level language.

(24)

10 2 Theory

The synthesised netlist can then be analysed by a tool referred to as place-and-route which organizes the building blocks into a structure suitable for the FPGA. The place-and-route then attempts to connect them using the routing network available in the FPGA. The result is a configuration file that can be loaded into the FPGA using a configuration interface such as JTAG.

2.6.2 Reusable Modules

With increasing demands on a fast time-to-market it has become more common to reuse existing building blocks as much as possible. These blocks are commonly referred to as IP cores or IP blocks where IP stands for intellectual property. These blocks can be anything from a simple counter to a complete processor and can be seen in analogy to the software world as a library.

This allows for a shorter implementation cycle since each IP block’s functionality can be verified beforehand and the block can often easily be integrated with the rest of the design.

It is common for FPGA manufacturers to provide a collection of simpler IP cores that can be used on their devices. The form the IP block is delivered in varies, it can be for example readable VHDL code or an already synthesised netlist.

(25)

3

Problem Analysis

This chapter provide an analysis of a subset of the operations described in Chap-ter 3.1 that are needed for implementation of the SUMIS algorithm.

3.1 Overview

A subset of the operations involved in the SUMIS algorithm was chosen for fur-ther analysis and hardware implementation. Since the algorithm relies heavily on matrix operations such as matrix multiplication and matrix inversion these subproblems are described further in Chapter 3.2 and Chapter 3.3.

Since probabilities are handled in the log-domain there exist problems that has to be accounted for when summarizing them. This is described in Chapter 3.4.

3.2 Matrix multiplication

Matrix multiplication is an integral part of the detection algorithm. Both matrix-matrix and matrix-matrix-vector multiplications are used heavily. A standard matrix-matrix multiplication is described by

AB= C, (3.1)

where A ∈ RM×L, B ∈ RL×Nand C ∈ RM×N.

A naive algorithm for matrix multiplication can be seen in Algorithm 3.1. Other algorithms exists that will reduce the number of multiplications but introduce several additions and subtractions instead that will affect the constant that is usually left out when discussing asymptotic complexity. This implies that the

(26)

12 3 Problem Analysis

real benefit from a clever algorithm is only present when operating on very large matrices.

Algorithm 3.1Matrix multiplication - naive algorithm fori := 1 → M do

forj := 1 → N do sum := 0

fork := 1 → L do

sum := sum + A[i][k] ∗ B[k][j]

end for

C[i][j] := sum

end for end for

If N = M = L = 8, the number of multiply-and-add will be 512. In some of the matrix multiplications such as HT_H_{some of the operations could be reduced}

since the result will be symmetric around the diagonal. The drawback with these reductions is that the same matrix-multiply unit could not as easily be shared be-tween the different operations. The advantage of a general matrix multiplication implementation is that it is possible to reuse for all of the matrix multiplications of the same dimension that are necessary to compute.

3.3 Matrix Inversion

One of the obstacles in the detection algorithm is the need to calculate a matrix inverse. The matrix is sufficiently large so that a closed form formula does not exist for calculating the inverse.

Common ways to calculate the inverse of a larger matrix is by using some sort of decomposition to decompose the original matrix into a product of matrices. The matrices acquired from the decomposition have regular structure, such as triangular or diagonal, that makes them easier to invert. The inverse of these individual matrices can be combined into the original sought inverse matrix. The following sections will describe the steps involved to calculate the inverse denoted Q−1given an original positive definite matrix Q starting with the chosen method of decomposition.

3.3.1 LDL

T

Decomposition

The chosen method of decomposition is the LDLTdecomposition described by [Golub and Van Loan, 1996]. The decomposition is closely related to Cholesky decomposition also described by the previously mentioned authors.

One of the advantages of LDLT _{decomposition compared to Cholesky}

(27)

3.3 Matrix Inversion 13

operation in hardware and it is favorable if it can be avoided. The LDLT decom-position demands that the matrix to be decomposed is symmetric and positive definite. It is possible to rewrite the matrix equations in the detection algorithm to fully comply with these prerequisites to be able to utilize this decomposition. These rewrites are described in detail in [Čirkić and Larsson, 2012].

The decomposition can be described by

Q= LDLT, (3.2)

where L is a lower triangular matrix, D is a diagonal matrix containing only pos-itive elements and LTbeing the transpose of L. A lower triangular matrix is a matrix where only the elements below and including the diagonal are non-zero. Pseudo code for the LDLTdecomposition can be seen in Algorithm 3.2 where the matrix Q is of dimension N . Loops are not evaluated if the lower higher is greater than the higher higher.

Algorithm 3.2Algorithm for LDLT _{decomposition. The input matrix is Q and}

the output matrix is L along with the vector d which is the diagonal of D

v := zeros(N , 1) d := zeros(N , 1) L := zeros(N , N ) fori := 1 → N do sum := 0 forj := 1 → i − 1 do v[j] := L[i][j] ∗ d[j] sum := sum + L[i][j] ∗ v[j]

end for

v[i] := d[i] := Q[i][i] − sum rec := 1/v[i] forj := i + 1 → N do sum := 0 fork := 1 → i − 1 do sum := sum + L[j][k] ∗ v[k] end for

L[j][i] := (Q[j][i] − sum) ∗ rec

end for end for

In Algorithm 3.2 it is required to have a temporary vector denoted v to store intermediate results. It is also possible to rewrite the algorithm to work in-place and store the resulting matrix L and vector d in the original matrix Q. The reason for not choosing that approach is for readability and ease of implementation.

(28)

3.3.2 Reciprocal

In the LDLTdecomposition described in Section 3.3.1 some divisions needs to be performed. Division is by far the most expensive operation of the four basic math operations in terms of hardware area and speed. One effective approach is to calculate the reciprocal of the divisor and multiply that result with the divi-dend. This means that instead of dividing the number n by d the reciprocal 1_d is calculated and the operation n ∗1_d is subsequently performed.

The reciprocal1_d can be approximated using the Newton-Raphson method [Chen et al., 2005]. The Newton-Raphson method consist of choosing a function f (x) that is zero at x = 1_d and use Newton’s method to approximate the root. A suitable function is

f (x) = 1

x−d. (3.3)

The Newton-Raphson method is an iterative method and each iteration can be described by xi+1= xi− f (xi) f0_(x i) , (3.4)

where xi+1is the next approximation closer to the root while xi is the value from

the previous iteration.

Combining Equation 3.3 and Equation 3.4 gives

xi+1 = xi(2 − d ∗ xi) = 2 ∗ xi−d ∗ x2i. (3.5)

The performance of this algorithm is dependent on how good the guess of xi

for the first iteration, thus x0 is. A good approach to avoid excessive number of

iterations is to use a lookup table with an initial guess that can be correct for up to a few decimals. To store a complete table with the desired final precision is not feasible since this table will be very large.

3.3.3 Forward Substitution

When the lower triangular matrix L has been acquired it is necessary to calcu-late L−1since this intermediate result is needed to produce the original inverse described in Section 3.3

It is possible to calculate L−1by solving the matrix equation

Lxi= ei, (3.6)

for i = 1, . . . , n where eiis the i:th column of the unit matrix and n is the

dimen-sion of L. The resulting vectors x1, . . . , xnare the column vectors of L−1.

These equations can be solved efficiently by applying forward substitution. An outline of a general algorithm to solve the equation described in Equation 3.6 can be seen in Algorithm 3.3.

(29)

Algorithm 3.3Forward substitution - general algorithm fori := 1 → N do

forj := 1 → N do sum := 0

fork := 1 → j − 1 do

sum := sum + L[j][k] ∗ x[k][i]

end for

x[j][i] := (e[j][i] − sum)/L[j][j]

end for end for

Since Algorithm 3.3 is general it does not use all available knowledge about the matrices x = (x1, . . . , xn) and e = (e1, . . . , en). If L is of dimension 8 this algorithm

needs 224 multiply-and-add, 64 subtractions and 64 divisions. The number of operations can be reduced by adopting the algorithm to this particular case by using the prior knowledge available about the input and output data.

What prior knowledge can be utilized to decrease the number of operations? The following knowledge can be considered useful:

1. L is unitriangular. This means that the diagonal consists of only ones 2. The inverse of a lower triangular matrix is also a lower triangular matrix 3. e is a unit matrix

The first assumption effectively eliminates the divisions since all of the divisions will be by one. This assumption also gives the fact that the diagonal of x will consist of only ones.

The second assumption will change the limits on the second innermost loop since only the lower triangular matrix of the result will be non-zero. It will also change the limits on the innermost loop since the upper triangular part of x will be zero. Since e is a unit matrix the first multiply-and-add operation when k = i will be a multiplication by one and thus can be eliminated and lifted outside of the loop. With these changes, the number of operations has been greatly reduced. If L is of dimension 8, the operation count is now 56 multiply-and-add and 28 subtractions. The modified algorithm can be seen in Algorithm 3.4.

(30)

Algorithm 3.4Forward substitution - optimized for this particular case fori := 1 → N do

x[i][i] := 1

forj := i + 1 → N do sum := L[j][i]

fork := i + 1 → j − 1 do

sum := sum + L[j][k] ∗ x[k][i]

end for x[j][i] := −sum end for end for

3.3.4 Final Steps

As of now, L−₁

has been obtained from the forward substitution in Chapter 3.3.3 One additional matrix is needed for the calculation of the matrix inverse, D−₁

. This matrix can be obtained for free from the LDLT _{decomposition in}

Chap-ter 3.3.1 by taking the values from the reciprocal unit instead of the values from the d vector since D is diagonal and thus D−1consist of the reciprocal values of D.

The matrix inverse Q−1can now be obtained by

Q−1= L−TD−1L−1, (3.7) where the matrix L−Tis the transpose of L−1. With these final matrix multiplica-tions, the inverse Q−1has been calculated.

3.4 Log Sum of Exponentials

In the SUMIS algorithm, and in detection algorithms in general, probabilities are handled in log space. The reason for this is the fact that when performing calcu-lations on small probabilities the result will be greatly affected by the precision used when performing the calculations. If the calculations are performed in log space the quantities will be scaled to a workable range where the precision does not affect the result as much.

When performing calculations in log space regular multiplication will be mapped to addition, division to subtraction and exponentiation will be mapped to multi-plication. A summary of these identities can be seen in Table 3.1.

(31)

3.4 Log Sum of Exponentials 17 Operation Log Space

log(a ∗ b) log(a) + log(b) log(a/b) log(a) − log(b) log(ab) b ∗ log(a)

Table 3.1:Computations in log space

The drawback of computations in log space is that a suitable mapping for addi-tion does not exist. The operaaddi-tion that must be performed is

log(a + b) = log(elog(a)+ elog(b)). (3.8) Note that a and b are not actually stored but instead their logarithmic counterpart log(a) and log(b).

Apart from requiring several operations including exponentiation and subsequent logarithm, Equation 3.8 has additional drawbacks. If one of the probabilities a or

b is very small, underflow might occur and its value will disappear in the

addi-tion. If multiple probabilities are summarized overflow is possible since the sum might be very large.

With these limitations in mind, it is possible to rewrite Equation 3.8 and normal-ize the calculations using the largest value of the two probabilities. The rewrite yields

log(elog(a)+ elog(b)) = log(emax(log(a),log(b))(1 + e−|log(a)−log(b)|))

= max(log(a), log(b)) + log(1 + e−|log(a)−log(b)|), (3.9) and is often denoted "Jacobi Logarithm".

As can be seen in the Equation 3.9 the summation of the two probabilities in log space will be performed by selecting the maximum value of the two probabilities and add it to the additional logarithmic expression.

The advantage of this method is that the remaining logarithmic expression is limited in size. Its maximum value will be log(2) ≈ 0.69 and it will approach 0 when the difference between log(a) and log(b) grows large. Since the expression is limited to a small range, it can be precalculated and stored in a table to allow faster computations.

(32)

(33)

4

Methodology and Equipment

This chapter describes the methodology and technology involved in the project.

4.1 Modeling

The individual sections that had to be implemented in hardware was first ana-lyzed using Matlab with high level matrix constructs and operations. The op-erations were rewritten in using lower level abstractions and implementing the matrix operations in separate functions. This allowed for an easier way to trans-form the software into a suitable hardware structure.

The number range was investigated using Matlab to see how large the largest numbers were in the different sections of the algorithm and therefore how many bits the numbers had to be represented by. Numeric scopes was widely used since it allowed visualization of the precision needed.

4.2 VHDL

The hardware description language used in this thesis is VHDL. In VHDL it is common when working with fixed point numbers to use an ordinary data type called std_logic_vector that simply contains a number of bits and think of the decimal point as implicit. This is an approach suitable only for very simple de-signs but not that easy to extend or rework since the interpretation of the data type is not explicitly specified.

In this thesis a fixed point package included in the VHDL-2008 standard [IEEE, 2009] has been used instead of the simple approach. The package is named

(34)

20 4 Methodology and Equipment

fixed_pkg and is further described in [Bishop, 2008]. The package contains the necessary definitions to perform basic arithmetic on signed and unsigned fixed point integers. The package allows the wordlengths, both integer and fractional, to be configured more easily by using constants or generics instead of hard coding the word lengths in each and everyone of the operations performed.

For some of the operations there are IP blocks available. These are configured using the tool CoreGen from Xilinx. The use of IP blocks when designing hard-ware is described in Chapter 2.6.2. The IP blocks might lack in flexibility but will reduce design time immensely.

The hardware structure was simulated with Mentor Graphics ModelSim to en-sure correct functionality. The VHDL source code was synthesized and adopted for the FPGA using tools from Xilinx.

4.3 RTL

The design abstraction used when describing the hardware is register transfer level. This means that the VHDL source code describes registers and the opera-tions performed while transferring from one register to another.

Finite state machines has been described using two separate parts. One purely combinatorial that produces the next state and the appropriate outputs and one part that is sequential. The sequential part will only store the next state into the state registers.

Records has been heavily used since if the registers are grouped together it is easier to add additional registers without much rewrite. Records in VHDL are the equivalent of structs in for example C.

4.4 Hardware

The FPGA used for the project is delivered by Xilinx and is a member of the Virtex-6 family, namely XC6VLX240T speed grade -2. More information about this family of devices can be found in [Xilinx Inc, 2012]. The development board used is delivered by Hitech Global and is a Express based board. This PCI-Express connection allows the board to be connected to the PCI-PCI-Express bus of a computer as a peripheral card and be interfaced from running software.

How a common FPGA is constructed is described in Chapter 2.6. FPGAs from Xilinx are divided in blocks called slices. Each slice in a Virtex-6 FPGA contains four LUTs and eight flip-flops with the appropriate interconnect circuitry. The FPGA also contains RAMs denoted block RAM or BRAM and other dedicated hardware such as clock managers that can generate an arbitrary clock frequency from the system clock.

One important type of the dedicated blocks is called DSP48E1. This is a highly optimized building block containing a 25x18 bit multiplier and an adder. It also

(35)

4.4 Hardware 21

has a register so it can accumulate the calculated result. This block can perform numerous operations and the behavior can be modified dynamically. The inclu-sion of such building blocks is described in Chapter 2.6.

The interesting resource count of the chosen part is summarized in Table 4.1. Name of resource Number of resource units

Slice 37680

Block RAM (36 Kb) 416

DSP48E1 768

PCI-Express block 2

Table 4.1: An overview over interesting resources available in the XC6VLX240T

Even though the end result would not be a complete implementation it is suit-able to target an FPGA platform with the limitations it will entail when choosing suitable structures.

(36)

(37)

5

Implementation

This chapter describes the implementation of the chosen hardware modules for the SUMIS algorithm.

5.1 Overview

The following sections describes the implementation of a subset of the hardware modules needed for the SUMIS algorithm. A 4 × 4 complex MIMO setup has been assumed which implies that the corresponding real matrices will be of dimension 8 × 8.

This assumption is reflected in both the matrix multiplication and matrix inver-sion units where the input matrices are of said dimeninver-sion. The choice of matrix dimension also affect the wordlengths necessary to provide accurate results. A small note on the notation used in this chapter can be considered useful. A signal of type "std_logic" is a one dimensional logic signal. An array of logic values with the dimension N is denoted "std_logic_vector(N-1 downto 0)". A signed fractional number is described by "sfixed(A downto -B)" where it denotes

A + 1 integer bits and B fractional bits where position 0 is implicitly the decimal

point.

(38)

24 5 Implementation

5.2 Matrix Multiplication

To perform matrix multiplication an IP block from Xilinx was used [Xilinx Inc, 2011b]. The drawback of this IP block is that the input wordlength is limited to at most 18 bits and this limitation has affected the choice of wordlengths for the other modules as well.

5.2.1 IP Block Trade-offs

When generating the IP block it is possible to perform trade-offs regarding the performance versus hardware usage by selecting an unroll factor. This unroll factor determine how many elements of the input matrices that can be provided as input simultaneously and thus how many multipliers that will be used. The highest unroll factor for a 8 × 8 matrix is 64. It requires 8 multipliers and one element of the resulting matrix is computed each time. The lowest unroll factor is 1 where all of the elements of the resulting matrix are computed in parallel. This requires 512 multipliers but naturally faster than calculating one element each time.

The problem with choosing an unroll factor of 1 apart from the need for a tremen-dous amount of multipliers is that it is hard to provide that much input in par-allel. Given that each element is represented with 18 bits, to provide the two matrix inputs it would be necessary with a bus of width 18 × 64 × 2 = 2304. It is not feasible to both route this wide bus and provide this parallel memory access since the matrix multiplication is just one of the modules in the whole design. For the reasons described an unroll factor of 64 is suitable which allows the use of a single block RAM to provide the input. Additional discussion about the unroll factor can be seen in Chapter 6.4.

5.2.2 Interface

The interface to the IP block is called AMBA AXI4-Stream, which is originally developed by ARM but adopted by Xilinx as described in [Xilinx Inc, 2011a]. For each of the data inputs there exists three signals: valid, last and data. When data is transferred to the module, valid must be held high and a new element must be available each clock cycle. When the last element of the matrix is present, the signal last must be held high during that clock cycle. A figure of this behavior can be seen in Figure 5.1.

Figure 5.1: Control signals for the AMBA AXI4-Stream interface

5.2.3 Example Implementation

Since matrix multiplication is present in several instances in the SUMIS algo-rithm one case was chosen as an example for implementation. One of the first

(39)

5.2 Matrix Multiplication 25

matrix multiplications that has to be calculated is HTHand this case was chosen as an example.

This example implementation assumes that the matrix H has previously been written to a block RAM with dual read ports. It reads out the input data from the dual port block RAM, feeds it to the IP block and gather the output in an additional block RAM.

The matrix H is stored row-wise in the block RAM. To be able to access both H and HT_{simultaneously to provide input to the IP block both read ports of the block}

RAM must be used. To read out the regular matrix a counter can be used that will count from 0 to 63 and generate the read address. To obtain the transposed matrix the read out must be performed column-wise, this can be solved by using the same counter as for the original matrix but insert a lookup table between the counter and the address input. This lookup table will address the matrix in column order instead of row order. Thus the lookup table will map the sequence {_{0, 1, 2, . . . , 62, 63} onto {0, 8, 16, . . . , 55, 63}.}

Everything in the implementation will be controlled by a control FSM that will contain the address counter and control the control signals shown in Figure 5.1. It will observe the status signals from the IP block and also store the result in an additional block RAM. A block diagram of the implementation can be seen in Figure 5.2. Input BRAM Output BRAM IP block Matrix mult. port a port b Control FSM LUT input addr control c output status output addr write output

(40)

26 5 Implementation

5.3 Matrix Inversion

To be able to perform matrix inversion, multiple modules are needed. The first step is the LDLTdecomposition described in Chapter 5.3.1 including a reciprocal unit described in Chapter 5.3.2. The resulting matrix L will be inverted using a forward substitution unit described in Chapter 5.3.3.

Finally the inverted matrix can be produced using the previously described ma-trix multiplication unit in Chapter 5.2.

5.3.1 LDL

T

Decomposition

The LDLTdecomposition has been implemented as a separate module in hard-ware. An effort has been made to use a minimum amount of control logic by performing redundant computations to be able to exploit regular structures in the computations.

Recall the algorithm describing the decomposition in Algorithm 3.2. The vector vcan be obtained by performing pair-wise multiplication between the vector d and the current row of L. The reason why the described loop only iterates to i − 1 is that the remaining elements of both L and d are still zero.

The computation of the next element in position i of v and d can be interpreted as a pair-wise multiplication between the newly calculated v and the same row from L as before followed by a summation that can be performed with an adder tree. This result can be subtracted from the current diagonal element from Q and thus the new element of both v and d has been calculated.

The same structure can be seen in the remaining calculations for each iteration, but now there are more values that needs calculation and there is an additional multiplication with the reciprocal of the i:th element of d. These conclusions result in the computation unit presented in Figure 5.3 with the adder tree to the left.

(41)

5.3 Matrix Inversion 27 Add Su b Mu lt Reciprocal F ro m in p u t BR AM St o re i n L -BR AM St o re in V / D re g ist e rs Add Mu lt Mu lt Add L -BR AM L -BR AM Mu x V D Mu x V D Mu lt Mu lt Add L -BR AM L -BR AM Mu x V D Mu x V D Add Mu lt Mu lt Add L -BR AM L -BR AM Mu x V D Mu x V D Mu lt Mu lt Add L -BR AM L -BR AM Mu x V D Mu x V D

Figure 5.3: Computation unit used in the LDLT_{unit with 8 parallel}

multi-pliers and an adder tree

The data path described in Figure 5.3 also contains the reciprocal unit which is described in detail in Chapter 5.3.2. To be able to fully utilize the computation unit it must be possible to access a complete row of the matrix L simultaneously while being able to write an individual element. This is possible to achieve using a dual port block RAM created using CoreGen since it allows for asymmetric access ports. The dual port block RAM has two sides, A and B. Side A has a wide read and write port that allows a complete matrix row to be read or written at once. Side B on the other hand has narrow ports that allow a single element to be read or written. This asymmetric memory is constructed of a multiple of smaller memories together with some logic to perform the address decoding. In this case, the block RAM for L is composed of eight smaller memories as building blocks. The structure of the LDLTunit can be seen in Figure 5.4. The control unit is built of an FSM that controls the memories, computation unit and registers.

(42)

28 5 Implementation Input BRAM (Q) Output BRAM (L) Computation Unit V regs. D regs. Control FSM Input Q Output L Output D new element L new elements V/D addr ctrl addr load

Figure 5.4: Block diagram of the LDLTunit

The input and output ports are described in Table 5.1.

Name Dir Type Comment

clk in std_logic Input clock

rst_n in std_logic Reset, active low start in std_logic Start computation addr_in in std_logic_vector(5 downto 0) Input address data_in in sfixed(5 downto -12) Data input

we in std_logic Write enable

ready out std_logic Ready for input

done out std_logic Computation done

addr_out in std_logic_vector(5 downto 0) Output address L_data out sfixed(2 downto -15) L matrix output D_data out sfixed(2 downto -15) D−₁

matrix output Table 5.1:Input and output ports of the LDLT_{decomposition module}

5.3.2 Reciprocal Unit

The goal of the reciprocal unit is to implement the computation described in Equation 3.5. One problem is that the lookup table must be limited in size while still providing a good initial guess for all input numbers. If the input d can be scaled to 0.5 ≤ d < 1 it follows that 1 < _d1 ≤_{2 and the lookup table can be limited} in size. To perform this dynamic scaling in hardware the most significant bit that is one of the input number must reside on position −1 next to the decimal point. If the current bit position is known it is possible to scale the number by shifting left or right the appropriate number of steps until the bit is in the correct position.

(43)

If the input is shifted N steps to provide the correct scaling then the reciprocal approximated must also be shifted N steps to reflect the reciprocal of the original input number.

With the practicalities of input scaling handled the issue of how to index into the lookup table remains. By investigating the scaled number, following conclusions can be made:

1. The smallest number is 0.5 which means only bit −1 is set and all bits to the right are zero

2. The largest number is almost 1 which means that bit −1 is set and all bits to the right are one

Since bit −1 is always set it can be ignored and the remaining bits to the right can be used as an index to the lookup table. This means that at index 0 the initial guess for input = 0.5 must be stored while at the last index the guess for input ≈_{1 must be stored. This manipulation can be seen as a subtraction by 0.5 which} moves the interval 0.5 ≤ d < 1 to 0 ≤ d < 0.5 more suitable as an index.

One additional adaptation of Equation 3.5 is that a multiplication by 2 is equiva-lent to a right shift by one place when using binary numbers. A block diagram of the complete structure for the reciprocal unit can be seen in Figure 5.5. Registers are placed after the operations to allow for a higher operating frequency as well as balance the paths. This is not present in Figure 5.5 for clarity.

Mult Lookup Table Find MSB index Shift Square Sub Shift Shift 1/d 1 d

Figure 5.5: Block diagram of the reciprocal unit

(44)

30 5 Implementation

unit has no control signals except for load and is used as a component in the LDLTdecomposition unit.

clk in std_logic Input clock load in std_logic Load new d d in ufixed(5 downto -12) d input result out ufixed(5 downto -12) 1/d output Table 5.2:Input and output ports of the reciprocal unit

5.3.3 Forward Substitution

The implementation approach for the forward substitution differs from the imple-mentation of the LDLTdecomposition. Instead of pursuing minimized control logic efforts has been made to utilize more efficient building blocks and avoid unnecessary computation at the cost of increased control overhead.

Analyzing Algorithm 3.4 yields that apart from subtraction the operation used is multiply-and-accumulate which can be described as

c = c ± a × b, (5.1) with c being an accumulator register. It is also necessary to be able to clear the value in register c. A suitable hardware structure for the multiply-and-accumulate operation can be seen in Figure 5.6.

Mux Add/Sub Register Mult b a clear c output 0

Figure 5.6: Block diagram of the multiply-and-accumulate module

Given the algorithm described in Algorithm 3.4, the final subtraction can be moved into the accumulation and thus the only necessary computation unit needed

(45)

is this multiply-and-accumulate unit performing c = c − a × b. The main problem that has to be solved is how to control these units and provide them with the input in correct order and store resulting values.

The multiply-and-accumulate operation is very common in various kinds of sig-nal processing and therefore the DSP48E1 blocks in the FPGA implements this operation among others. This means that the complete structure shown in Fig-ure 5.6 can be absorbed fully inside a single DSP48E1 block, [Xilinx Inc, 2011c]. One multiply-and-accumulate unit was chosen for the implementation, it would be possible to use several units since the matrix equations described in Chap-ter 3.3.3 are independent and thus easily can be solved by different computation units. This is a compromise between much hardware and execution time. The implementation contains one multiply-and-accumulate unit, one memory for the input matrix L, one memory for the output matrix X and a control FSM. Since the FSM would involve many states, a separate memory was used for the control signals which are summarized in Table 5.3.

Name Purpose

sel Control input mux to MAC unit clr Clear accumulator register L_x, L_y X, Y coordinate in L matrix X_x, X_y X, Y coordinate in X matrix

W_x, W_y X, Y coordinate in X matrix for write we Write signal for X matrix

Table 5.3:Control signals in the forward substitution unit

The control FSM is basically a counter that increments the address to the memory containing the control signals. A complete block diagram of the forward substi-tution unit can be seen in Figure 5.7.

(46)

32 5 Implementation Input BRAM (L) Output BRAM (X) MAC unit Mux 1 C Control Memory

input data output data

a b write addr X addr L addr we sel clr Control Counter

Figure 5.7: Block diagram of the forward substitution unit

The input and output ports of the forward substitution module are described in Table 5.4.

clk in std_logic Input clock

rst_n in std_logic Reset, active low start in std_logic Start computation addr_in in std_logic_vector(5 downto 0) Input address data_in in sfixed(2 downto -15) Data input

we in std_logic Write enable

done out std_logic Computation done

addr_out in std_logic_vector(5 downto 0) Output address data_out out sfixed(2 downto -15) X matrix output Table 5.4:Input and output ports of the forward substitution module

(47)

5.4 Jacobi Logarithm 33

5.4 Jacobi Logarithm

As mentioned in Chapter 3.4 the use of the method called Jacobi logarithm is suitable when adding probabilities in log space to avoid overflow, underflow and unnecessary computations. Overflow means that the resulting number is to large for the chosen integer wordlength and underflow means that the number is of to small magnitude to be represent using the chosen fractional wordlength. Recall Equation 3.9, especially the second term. If log(a) and log(b) are available as input, x can be defined as x = | log(a) − log(b)|. With x defined, the computation that has to be performed is

result = max(log(a), log(b)) + log(1 + e−x). (5.2) Since log(a) − log(b) must be calculated, it is possible to use this knowledge when performing the max selection. If the result of the subtraction is negative, then log(b) is the largest term and shall be selected, otherwise log(a). This means that it is possible to use a simple multiplexer with the sign bit of the result as control signal to select the largest value.

The remaining term in the expression presented in Equation 5.2 is log(1 + e−x). A graph of this function can be seen in Figure 5.8.

0 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x log(1 + exp(−x))

Figure 5.8: The function log(1 + e−_x

) on the interval 0 ≤ x < 8

Since the expression is limited in value on a small interesting interval it is suitable to use a table with precomputed values instead of implementing the complex operations standalone. As can be seen in Figure 5.8 the expression goes towards zero and it is only necessary to precompute a table for the interval 0 ≤ x < 8 and still achieve good accuracy.

A block RAM is suitable for storage of the precomputed table. To avoid exces-sive hardware, it is suitable to limit the table so it fits in a single 36Kb block

(48)

34 5 Implementation

RAM primitive. With a data width of 16 bits this allows for 2048 elements in the lookup table or in terms of the address bus width, 11 bits.

To use the result x as an index into the lookup table it must some how be repre-sented by 11 bits. Since the table will cover 0 ≤ x < 8 it is possible to saturate x to only contain log₂(8) = 3 integer bits. This leaves 11 − 3 = 8 bits for the frac-tional part of x. With these limitations on x the table can be precomputed with x ranging 0 ≤ x < 8 in steps of 2−₈

.

A block diagram of the complete structure for the Jacobi logarithm module can be seen in Figure 5.9. Not shown in the figure are the delay elements needed before and after the selection mux since the subtraction and table lookup has a latency. Sub Mux Add Lookup Table Abs Select bits MSB log(b) log(a) Result 1 0

Figure 5.9: Block diagram of the Jacobi logarithm unit

The input and output ports of the Jacobi logarithm unit can be seen in Table 5.5. The lack of control signals is because this module can be seen as a computation unit that has a certain latency and is supposed to output results continuously. Therefore a control signal such as "start" is unnecessary.

clk in std_logic Input clock log_a in sfixed(5 downto -12) log(a) input log_b in sfixed(5 downto -12) log(b) input result out sfixed(5 downto -12) Result output Table 5.5:Input and output ports of the Jacobi logarithm unit

(49)

6

Result and Analysis

This chapter describes the result from the implementation, both the accuracy of the computations as well as the resource usage. The result is discussed and the approach taken in this thesis is also compared with other implementations and approaches to see what remains until a complete implementation of SUMIS can be obtained.

6.1 Testing and Measurements

The modules were tested with input data generated using Matlab. All of the modules were simulated in ModelSim using this input data and the result was obtained. The result of these computations were then imported into Matlab and was compared and verified with the expected output.

This was performed to ensure correct functionality and to be able to determine how accurate the hardware was compared to ideal computations performed with double precision floating-point numbers. Descriptions of the accuracy are pre-sented in the following sections.

The error presented was acquired using randomized input data and observing the largest individual error in the output elements. Multiple simulations were run to ensure that the maximum error was likely to be observed.

6.1.1 Matrix Multiplication

The matrix multiplication implementation yielded an error of approximately 0.0002 which directly corresponds to an accuracy of 12 fractional bits which was chosen as the output fractional wordlength. It would be possible to achieve a higher

(50)

36 6 Result and Analysis

curacy by allowing for more bits in the result but the limiting factor might be if the module where the results are used only utilizes fewer bits.

6.1.2 LDL

T

Decomposition

While testing the decomposition module a maximum error of approximately 0.001 was observed and this corresponds to an accuracy of 10 fractional bits. Since the algorithm operates on columns from left to right and uses the intermediate re-sults the error accumulates when moving to columns far right. The accuracy in the leftmost columns would correspond to 14 fractional bits.

6.1.3 Forward Substitution

The forward substitution module had a maximum error of approximately 0.00002 which corresponds to an accuracy of 15 fractional bits. All of the computations where performed using 3 integer bits and 15 fractional bits so this accuracy was expected. To allow for a higher precision the computations would need to be performed with more bits.

6.1.4 Jacobi Logarithm

The error observed when testing the Jacobi module was very small, approximately 0.0002 which indicates that the implementation has negligible accuracy loss and the achieved accuracy corresponds to the chosen input wordlength of 12 frac-tional bits. The precision of the lookup table does not affect the result in any meaningful way, mostly because the table is very limited in range from 0 to log(2) and still has a step size of 2−8.

6.2 Resource Usage

The following sections will describe the resource usage for each individual mod-ule that was implemented. The interesting resources are primarily LUTs, flip-flops, DSP48E1 and block RAMs.

It is also interesting to note how high frequency the modules can operate at. This is described along side with a description of the critical path of the module which dictates what the maximum frequency can be.

6.2.1 Matrix Multiplication

The resource usage for the matrix multiplication implementation can be seen in Table 6.1. This implementation computes HTHas described in Chapter 5.2.3.

Hardware Implementation and Assessment of a Soft MIMO Detector Based On SUMIS

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Hardware Implementation and Assessment of a Soft

MIMO Detector Based On SUMIS

Hardware Implementation and Assessment of a Soft

MIMO Detector Based On SUMIS

Examensarbete utfört i Kommunikationssystem

vid Tekniska högskolan vid Linköpings universitet

av

Abstract

Acknowledgments

Contents

Notation

1

Introduction

1.1

Background

1.2

Goal

1.3

Limitations

1.4

Outline

2

Theory

2.1

MIMO

2.2

Detection

2.2.1

Soft Detection

2.3

SUMIS

2.3.1

First Stage

2.3.2

Second Stage

2.3.3

Complexity Selection

2.4

Number Representation

2.5

Hardware Introduction

2.6

Programmable Hardware

2.6.1

Hardware Flow

2.6.2

Reusable Modules

3

Problem Analysis

3.1

Overview

3.2

Matrix multiplication

3.3

Matrix Inversion

3.3.1

LDL

Decomposition

3.3.2

Reciprocal

3.3.3

Forward Substitution

3.3.4

Final Steps

3.4

Log Sum of Exponentials

4

Methodology and Equipment

4.1

Modeling

4.2

VHDL

4.3

RTL

4.4

Hardware