Implementation and Evaluation of Two 512-Tap Complex FIR Filter Architectures for Compensation of Chromatic Dispersion in Optical Networks

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2017

Implementation and

Evaluation of two 512-Tap

Complex FIR Filter

Architectures for

Compensation of Chromatic

Dispersion in Optical

Networks

Anton Kovalev

(2)

Implementation and Evaluation of two 512-Tap Complex FIR Filter Architectures for Compensation of Chromatic Dispersion in Optical Networks

Anton Kovalev LiTH-ISY-EX--17/5094--SE Supervisor: Oscar Gustafsson

isy_{, Linköpings universitet}

Examiner: Mario Garrido

isy_{, Linköpings universitet}

Division of Computer Engineering Department of Electrical Engineering

(3)

Sammanfattning

Filtreringen av en signal är en viktig beståndsdel inom digital signalbehandling, eftersom olika tillämpningar kräver att olika egenskaper hos en signal behöver ändras. Ett digitalt filter åstadkommer just detta - ändrar(filtrerar) en signal så att önskade signalegenskaper erhålls. Ett användingsområde för digitala filter är optiska nätverk. Optiska fibrer används i stor utsträckning för att skicka informa-tion över långa avstånd med hög bandbredd, högre än andra system för överfö-ring av information, exempelvis koppartråd. För att uppnå hög överfööverfö-ringskapa- överföringskapa-citet, behöver signalen filtreras, för att kompensera förvrängningen av en signal som har blivit utsatt för kromatisk dispersion. Detta arbete presenterar och ut-värderar två filterarkitekturer för kompensering av kromatisk dispersion inom fiberoptik. Båda filtrena kan klockas i 475 MHz, vilket resulterar i en genom-strömningshastighet på 60Gs/s.

(4)

(5)

Abstract

Filtering is an important part of digital processing, since the applications often require a change of features of a digital or analog signal. A digital filter is a de-vice or a system that removes or alters certain parts of a signal. Optical fibers are used to transmit information over longer distances and at higher bandwidths than traditional copper cables. In order to enable high-rate transmission in opti-cal communication systems, it is necessary to have a filter that compensates for chromatic dispersion in optic links, since the dispersion alters the signal in an unwanted way. This thesis presents the implementation and evaluation of two filter architectures, used in fiber-optic communication. The clock frequency of the implemented designs reaches 475 MHz, which results in a processing speed of 60 GS/s.

(6)

(7)

Acknowledgments

I would like to express my gratitude to my supervisor Oscar Gustafsson and my examiner Mario Garrido for their immense help in this thesis. Thank you for your help and for sharing your vast knowledge within the field of computer en-gineering with me. Additionally, I would like to thank Erik Nybom for giving me valuable feedback as a result of his opposition, but also for many interesting discussions during the course of the thesis.

Linköping, August 2017 Anton Kovalev

(8)

(9)

1

Introduction

Fiber optics is the cornerstone of modern communication networks, since it car-ries Internet traffics, long distance phone calls and television channels. Opti-cal fibers are used for transmission of information over longer distances and at higher bit rates than traditional wire cables, such as coaxial cables and copper wires. Further benefits of optical fibres are the resistance to electromagnetic in-terference and low attenuation over long distances. The fiber is a thin and very flexible medium that conducts pulses of light. Each pulse represents a bit that the receiver can interpret at the other end of the fiber [1].

Despite the advantages of optical fibres over other transmission materials, it is not a perfect transmission medium and suffers from imperfections, such as chro-matic dispersion.

Chromatic dispersion (CD) is a phenomenon by which different components of a pulse travel at different speeds [2]. This entails distortion of the shape of the pulses. The information becomes corrupted and thus there is a need to compen-sate the loss of information at the receiver, by using a digital filter.

This Thesis presents two filter architectures, both implemented in VHDL and MATLAB, which can be used for compensation of chromatic dispersion. The designs are suitable for hardware implementation on an application-specific in-tegrated circuit (ASIC). The architectures use the Fast Fourier Transform to make the number of computations smaller, as compared to convolution. The architec-tures are evaluated with regard to the power consumption and area of each of the designs. The wordlength of the samples (i.e. the information transmitted) in each of the filters is chosen after implementation and simulation of the filters in MATLAB, where the filters are part of a simulation chain in which chromatic dispersion is included. The criterion, based on which the wordlengths are chosen,

(12)

is the bit error rate of the system.

1.1 Goals

The goal of the thesis is to build and evaluate two architectures of a 512-tap fi-nite impulse response (FIR) filter for compensation of chromatic dispersion in optical fibers at the Department of Electrical Engineering of Linköping Institute of Technology. The architectures should be able to operate at 60 GS/s. The reason for such a rate is that it is enough for a propagation distance of length 2300 km, which is a common length of optical fibers used in transmission of signals [2]. In addition, the rate together with the fact that the most complex component of the filter, the FFT, can be clocked at almost 500 MHz, result in 60GS/500MH z = 120 samples to be processed by the filter each clock cycle. This number is rounded towards the closest power-of-two number, resulting in 128 samples per clock cy-cle to be processed by the filter. Moreover, the filters should be designed in both software and hardware. The hardware version will be used for the analysis of resource consumption of the filters. The software version will be used to deter-mine the bit error rate in a simulation of the system for different wordlengths of samples, which will consist of pulse shaping, the CD model, and a CD compensa-tion filter. The full simulacompensa-tion chain is based on [3] and will be explained in the theory section. It is desirable to minimize the wordlength, since the area of the design depends on the size of the words, and yet maintain a good bit error rate of the system.

1.2 Previous research at LiU

A lot of research on architectures of digital filters [3], arithmetic circuits [4] [5] [6] and communication systems [7] [8] is carried out at the department of Elec-trical Engineering at Linköping University. A large number of papers have been published on the topic. Moreover, a number of those are used in this Thesis and serve as background information and base of the design.

1.3 Limitations

For the calculations, the system proposed in this Thesis uses fixed-point format. Therefore the precision of the samples is limited. Furthermore, advanced tech-niques for dealing with arithmetic discrepancies, such as overflow and/or under-flow, will not be used.

The system will not be put on physical hardware during the course of the thesis, but rather synthesized on an ASIC using a synthesis program.

Finally, the choice of filter coefficients for digital filters will not be the focus of this thesis.

(13)

1.4 Outline of the thesis 3

1.4 Outline of the thesis

Chapter 2 touches upon the theory behind the design of FIR filters and optical fibers.

Chapter 3 is an overview of the tools and the workflow used in this thesis. Chapter 4 discusses the hardware implementation of the system.

Chapter 5 presents the MATLAB simulations.

Chapter 6 contains the results of the simulations and the results of the synthesis. Chapter 7 includes the discussion and conclusions drawn from the results of the thesis.

(14)

(15)

2

Theory

2.1 Optical Networks and Chromatic Dispersion

The rise of the use of optical networks in the 1980s was fueled by the need for higher capacity transmission. Today, the capacity of commercial terrestrial sys-tems is in the range of T b/s [2], and the length of the cables can reach thousands of kilometers, connecting different continents.

An optical fibre consists of a core, which is a made of dense material, and a cladding, which is made of less dense material than the core. The cladding and the core are usually made of different types of silica glass and have different re-fractive indixes, which allows successive reflections within the fiber core [2]. The fiber is coated with a protective covering made of plastic, that protects it from moisture and other damage. A picture of a part of a cable can be seen in figure 2.1.

Since the cable is made for transmission of information, it also has a transmit-ter and receiver at its ends. The transmittransmit-ter converts the electrical information (bits) to optical format, by using a Laser diode. Similarly, a Photo diode at the re-ceiver converts the optical signal to an electrical format. The light pulses which propagate from the transmitter to the receiver through the fiber suffer from un-wanted phenomena. One of the issues is chromatic dispersion, mentioned in the introduction of this Thesis. The chromatic dispersion is a physical phenomenon which alters the propagated signal in an unwanted way. Therefore, it is desireable to minimize the effects of CD, which is traditionally compensated using optical devices with opposite dispersion [3],[9]. The idea is to connect another fiber with opposite dispersion to the existing one. The addition of such fiber will cancel out the dispersion. The problem with optical devices is that they cannot easily

(16)

Core

Cladding Coating

Figure 2.1:Illustration of the cross-section of an optical fiber.

be tuned to accomodate different fiber properties [10]. If the characteristics of the original fiber will change over time, the compensation fiber will have to be changed.

With faster analog to digital converters and coherent detection schemes, digi-tal signal processing is playing a growing role in the field of optical networks [10],[11]. In digital coherent optical receivers, chromatic dispersion is usually modeled as a frequency response given by

C(ejwT) = e−jK(wT )2, K = Dλ

2_z

4πcT2 (2.1)

The parameters D, λ, z and c are all fiber related parameters. D is the fiber dis-persion parameter, λ is the wavelength, z is the propagation distance and c is the speed of light. The parameter T is the sampling period. Taking into consider-ation all the parameters mentioned above, the coefficients of the compensconsider-ation filter can be determined, since the compensation filter is the inverse of the mod-eled chromatic dispersion filter C, given by

(17)

2.1 Optical Networks and Chromatic Dispersion 7

Hcomp(ejwT) = 1

C(ejwT₎ = e jK(wT )2

(2.2) which is the filter which is designed in this Thesis. The length of the filter is dependent on K [3], and is given by

N = 2 ∗ b2K πc (2.3)

2.1.1 System chain

The full system treated in this Thesis can be seen in figure 2.2 [3].

L GTX(e jwt ) C(ejwt ) H(ejwt ) GRX(e jwt₎ L Upsampling with a factor L Anti-imaging filter Chromatic dispersion Filter for compensation of CD Anti-aliasing filter Downsampling with a factor L AWGN x(n) y(n)

Figure 2.2: The system of an optical fiber. The fiber is modeled as chromatic disperson and the AWGN channel.

Notice that a fiber suffers from other unwanted fiber-specific phenomenena, such as intermodal disperson and non-linear effects. In this Thesis, only the effects of CD are considered. As further explained in [3], in order reduce the inter-symbol interference (ISI) and consequently improve the bit error rate, interpola-tion/decimation at the transmitter/receiver is performed after the modulation/ demodulation. The interpolation results in a spectrum of a new, interpolated se-quence, together with repeated images of the baseband [12]. In order to filter out those unwanted, repeated images, a low-pass filter is inserted in the chain, after the upsampling on the transmitter side, and before the downsampling on the receiver side. Next, Chromatic dispersion (CD) is modeled as a filter with the frequency response given by equation 2.1. The data is then passed through an Additive-White-Gaussian-Noise (AWGN) channel for different values of Signal-to-Noise ratio. This is done in order to mimic the consequences of random pro-cesses that occur in nature, i.e. it models general imperfections of a communica-tion channel. When received, almost all the steps above are done in reverse order. In order to compensate CD, a filter is inserted right after the channel. The filter should be as close to the inverse of the CD response as possible.

2.1.2 Bit error rate

In order to estimate the quality of a telecommunication system, a measurement of the bit error rate (BER) is often made. It is defined as the number of erroneous

(18)

bits received from the channel, divided by the number of total bits transferred over the channel.

BER can be affected by many factors, such as noise in the channel, symbol inter-ference or synchronization errors in the system. In the current work, the BER is only dependent on the noise of the channel and the quantization noise inside the filter.

2.2 Fast Fourier Transform

The Fourier transform [13] is a powerful mathematical tool used in many fields that involve mathematical computations. A few examples are polynomial mul-tiplication, filtering and matrix multiplication [14]. To put it simple, the main reason why the Fourier transform is used is that it is possible to view signals in different domains, which in turn makes problems that are hard to solve/analyze in one domain, easier to solve in the other.

The Fast Fourier Transform (FFT)[13], is an algorithm for computing the Discrete Fourier Transform of a signal. The Discrete Fourier Transform is given by equa-tion 2.4 below X[k] = N −1 X n=0 x[n]e −j2πnk N , k = 0, 1, 2, ...N − 1 (2.4)

where N is the size of the transform and X(k) is the k-th frequency of the trans-form.

The amount of computations required to compute the DFT by using a direct ap-proach (i.e. performing all the calculations according to the formula, without exploring certain properties of the transform) is N complex multiplications and (N-1) complex additions for each output frequency. Each complex multiplication consists of four real multiplications and two real additions, and each complex addition consists of two real additions. In total, the number of real multiplica-tions required to compute all output frequencies is equal to 4N2_{. The number}

of real additions required to compute a N-point DFT is N (4N − 2). In addition, the digital computation of the DFT calls for a considerable amount of memory to store and access the N complex input sequence values x(n) and values of com-plex coefficients. The amount of computation, as showed earlier, is proportional to N2 , which results in a very big amount of arithmetic operations needed to compute the DFT for large values of N [14]. Therefore, it is of interest to reduce the amount of computation needed to calculate the DFT of a signal.

As mentioned earlier, the most famous class of algorithms that reduce the num-ber of additions and multiplications is called the Fast Fourier Transform (FFT) [13]. The FFT uses a technique called divide and conquer, which decomposes the discrete Fourier transform into smaller transforms and calculates those first, and then merges the results of smaller transforms into bigger ones. This

(19)

ap-2.2 Fast Fourier Transform 9

proach results in a much smaller computational complexity, and is proportional to N ∗ log(N ) arithmetical computations [14]. This decomposition can be done in many ways [14]. One way of decomposing the DFT is decimation in frequency (DIF), which computes smaller and smaller subsequences, or stages, of the input sequnce x[n]. In total, the number of the stages is n = logr(N ) stages, where r is the base of the radix ρ of the FFT, i.e. ρ = ra_{[5]. The even and odd samples of} the frequency output are split, and can be written as

X[2r] = N /2−1 X n=0 (x[n] + x[n + N /2])e−jN /22πrn_{, r = 0, 1, 2, ...(N /2) − 1} _(2.5) X[2r + 1] = N /2−1 X n=0 (x[n] − x[n + N /2])e−j2πN(2r+1)n_{, r = 0, 1, 2, ...(N /2) − 1} _(2.6)

The result is two DFTs of half the original size. Those DFTs can further be broken down into smaller DFTs of odd and even numbers and so on, until the size of the smallest DFT is the same as the radix r, which is the size of the butterf ly. A butterfly is the basic operation of the FFT. For radix-2, it calculates one addition and one subtraction:

X[0] = x[0] + x[1] X[1] = x[0] − x[1]

Graphically, this butterfly is usually described using a signal flowgraph:

x(0) x(1)

X(0) X(1)

Figure 2.3: Radix − 2 butterfly. The butterfly is taken from the radix − 2

algorithm.

The name butterfly comes from the fact that the graph in 2.3 resembles a butter-fly. From this basic butterfly, larger butterflies can be constructed. The butterfly in figure 2.4 is a radix-4 butterfly.

A full flowgraph of an 8-point FFT using decimation in frequency is shown in figure 2.5. Another way of decomposing an 8-point FFT, decimation in time (DIT), is shown in figure 2.6. Notice the input order of the samples and output order of the frequencies. The input order of both flowgraphs is in natural order. The output is in bit − reversed order. The naming natural comes from the fact that the binary representation of each sample is done in a traditional normal way, by transforming the decimal number of the each index to a binary one. The

(20)

x(0) x(1) x(2) x(3) X(0) X(2) X(1) X(3) ×

Figure 2.4: Radix − 4 butterfly. The butterfly is taken from the radix − 4

algorithm.

name bit-reversed implies that the binary number of each index is reversed. An illustration of bit-reversal is shown in figure 2.7. It is important to notice that the input and output order are not part of the FFT algorithm. Each FFT algorithm can be represented with any input and output orders [15].

x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) X(0) X(4) X(2) X(6) X(1) X(5) X(3) X(7) W 1 W 2 W 3 W 2 W 2

Figure 2.5: Flow graph of an 8-point FFT using Decimation in frequency. The multiplication constants are annotated with a "W".

x(0) x(4) x(2) x(6) x(1) x(5) x(3) x(7) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) W 2 W 2 W 1 W 2 W 3

Figure 2.6: Flow graph of an 8-point FFT using Decimation in time. The multiplication constants are annotated with a "W".

2.2.1 FFT architectures

The FFT architectures can be divided into three main categories: Pipelined archi-tectures [5], Memory-based archiarchi-tectures [16] and direct implementations [17]

(21)

2.2 Fast Fourier Transform 11

Figure 2.7: Bit-reversed input and natural output order of an 8-point FFT.

[6]. The highest throughput is achieved by pipelined and direct implementa-tions. Memory-based architectures calculate the FFT iteratively, on data stored in memory, while pipelined architectures compute the transform in a continu-ous flow. The direct implementation is the most straightforward one, since it maps each addition/multiplication in the corresponding FFT flow graph to an adder/multiplier in hardware.

The fully parallel pipelined architecture is a direct implementation because there is no need to pipeline the dataflow, since the input arrives simultaneously (all samples are coming in parallel), instead of being divided into blocks and pro-cessed one block per clock cycle. Furthermore, the FFT architecture can be real-ized using both decompositions mentioned in the previous section: decimation in time and decimation in frequency. The number of mathematical operations is the same in both decompositions. The difference lies in the placement of rota-tors, since those can be moved between various stages [15]. An example of the movement of the rotators can be seen in figure 2.8, which is taken from [15]. The equation of the upper part of the structure is

X = Wc∗_{(x ∗ W}a_{+ y ∗ W}b₎

Y = Wd∗_{(x ∗ W}a−_{y ∗ W}b₎

The inputs can be divided by a factor Wq, which makes it possible to pass this factor to the outputs, leading to the equation

(22)

X = Wc+q∗_{(x ∗ W}a−q_{+ y ∗ W}b−q₎

Y = Wd+q∗_{(x ∗ W}a−q−_{y ∗ W}b−q₎ which is represented in the lower part of 2.8.

x(0) x(1) X(0) X(1) a b c d x(0) x(1) X(0) X(1) a − q b − q c + q d + q

Figure 2.8: Movement of the rotations in the butterfly. The upper part is the original structure. The bottom one is the structure after movement.

The use of different rotators in the algorithms results in different hardware imple-mentations. In each of the implementations there is a tradeoff between butterflies, rotators and memory [6].

2.3 Filters

FIR filters is an important class of linear time-invariant systems. To put it briefly, filtering is the process of altering certain characteristics of a signal, by multiply-ing and addmultiply-ing the samples of a signal with system-specific coefficients. A system that is causal, time-invariant and linear can uniquely be described by its impulse response h(n). The result y(n) of filtering a signal x(n) through a filter h(n), can be described by convolution of the impulse response of the filter and the input sequence: y(n) = h(n) ∗ x(n) = ∞ X k=0 h(n) · x(n − k) (2.7)

where n is the n-th sample of the signal. If the impulse response becomes zero after a finite number of samples, the filter is a Finite Impulse Response filter (FIR). On the contrary, if the impulse response never reaches zero, the filter is called Infinite Impulse Response filter (IIR) [18].

FIR filters have a number of advantages over IIR filters. FIR filters are uncondi-tionally stable and are in general much faster in hardware [3]. Moreover, FIR fil-ters generally require a shorter data word length. Conversely, FIR filfil-ters are more complex to build than IIR filters, and require more computations and power.

(23)

2.3 Filters 13

2.3.1 FIR filter architectures

Convolving a signal with filter coefficients in time domain is a mathematical op-eration that can be realized in hardware in different ways. As is often the case in hardware design, there is always a tradeoff between multiple factors that the de-signer should take into account when constructing the filter, such as area on the chip and throughput of the system (the reciprocal of the time between successive outputs, measured in samples/s). The most common and simplest FIR filter often seen in literature is depicted in figure 2.10 [18]. Each multiplier and the adder can be mapped to a hardware unit. Another way of realizing the same filter is by using hardware multiplexing, as shown in figure 2.9. In this way, five multipliers are merged into one multiplier, which significantly reduces the hardware cost [4]. The address generator is generating control signals to the multiplexers in order to perform multiplication on the correct data at the correct time.

The hardware required to realize the structure in figure 2.9 is 3.6 times smaller than that in 2.10 [4]. Conversely, the speed will be four times lower than the direct implementation in figure 2.10. Therefore, if high throughput is desired, the direct implementation is preferred over the merged one. The "performance over area cost" is higher in the directly mapped FIR circuit [4]. To increase the throughput even further, it is possible to pipeline the filter in figure 2.10. This reduces the critical path, the longest data path in a given circuit, in terms of the time it takes for the data to travel from one memory element, to another. When pipelining the critical path, the throughput is increased, at the cost of increased area, since more delay elements are added to the design.

Address generator FSM

+

x(n) x(n-1) x(n-2) x(n-3) X h(0) h(1) h(2) h(3) y(n) 0

(24)

D D D

+

h(0) h(1) h(2) h(3)

x(n) x(n-1) x(n-2) x(n-3)

y(n)

Figure 2.10: Direct mapping of a 4-tap FIR filter.

The process of convolving one signal with another is a slow process in computer hardware. Therefore, filtering in time domain is typically avoided when large signals are considered. A more efficient way of filtering a signal is to transform a signal x(n) into its frequency domain, X(k) and then use the fundamental prop-erty in signal processing that convolution in time domain is equal to multipli-cation in frequency domain. That is, instead of convolving x(n) with h(n), it is more efficient to take the FFT of both the signal and the filter coefficients, carry out multiplication with the coefficients of in frequency domain, and then carry out the inverse FFT (IFFT) on the resulting signal. The result is the same as the convolution between the signals. If an FIR filter is implemented in time domain, the filtering is done through linear convolution. From a computing point of view, the cost in a direct realization is

MultiplicationsFI R = N ∗ K (2.8)

AdditionsFI R = (N − 1) ∗ K (2.9) where N is the number of input samples and K is the number of filter taps (im-pulse response length).

If the filter is implemented in frequency domain, the cost is reduced to N , be-cause convolution in time domain is equal to multiplication in frequency domain, i.e. Y (f ) = H(f ) ∗ X(f ), where H is the filter coefficients transformed to frequency domain. In general, the computational cost of filtering in frequency domain can

(25)

2.3 Filters 15

be calculated as

MultiplicationsFFT /I FFT = N

2log2(N ) (2.10)

AdditionsFFT /I FFT = N log2(N ) (2.11)

Multiplicationsf requency domain= N (2.12) where N is the transform length. It can be assumed that N > K, since the block size of the data transformed to frequency domain must be larger than the fil-ter length [19]. Also, in practice, the load on the fiber is usually in the order of terabits [2]. The first equation denotes the number of multiplications by the FFT/IFFT, the second one denotes the number of additions performed by the FFT/IFFT and the third equation denotes the number of multiplications in the frequency domain due to filtering.

The process of filtering in frequency domain can be seen in figure 2.11. It is also important to notice that the IFFT can be realized by using the FFT, with the only differences being that each input sample into the FFT has to be complex conju-gated, and the output should be complex conjugated once again, and divided by the transform length.

By applying the FFT in the filtering process, the lengths of the impulse response and the FFT have to be taken into account. The length of the impulse response is rarely equal to the length of the input signal. However, when taking the trans-form of the signal and the impulse response, the length of both of them must be equal, in order to multiply each input sample with the corresponding impulse sample in frequency. This is done by using one of two most widely used tech-niques: Overlap-and-save and Overlap-and-add. The idea behind both is the same: divide the input signal into blocks of smaller length and use the FFT on each block. Both approaches are described in the next two sections.

(26)

complex conjugate multipli-cations complex conjugate and division by transform length FFT IFFT FFT

Figure 2.11: Filtering in frequency domain. The IFFT is realized by using the FFT.

2.4 Overlap and save

The Overlap and save method (OLS) [20] is illustrated in figure 2.12. Considering an N point FFT, the input signal is divided into segments of N samples. M of those samples are "old samples", i.e. those are the samples saved from the previous input block. In addition, L new samples are processed together with the old M ones, thus N = M + L − 1. [20] The reason why the samples are saved for the next iteration of the FFT is called circular convolution. Contrary to linear convolution, convolution using the FFT makes the signal "wrap around" itself, and therefore corrupts some of the samples of the output. This is a result of the periodic nature of the Fourier transform. The samples that become corrupt are discarded at the output; therefore this technique is also often called "Overlap and discard".

(27)

2.5 Overlap and add 17 Transform and multiplication in frequency domain Inverse transform Input Output A_x B x

Figure 2.12: Overlap and save diagram. The shaded region(samples) of each transformed block overlaps with the previous block and is discarded at the end, i.e. not included in the final output. As an example, input block Ax contains N samples (the shaded region included), whereas output block Bx contains L samples (the shaded region discarded). The number of discarded samples is thus N − L + 1 = M.

2.5 Overlap and add

The Overlap and add (OLA) [20] technique is similar to the previously mentioned Overlap and save. Instead of inserting old samples, zeros can be inserted in order to pad the signal being transformed, to avoid circular convolution. In addition, instead of discarding samples at the output, some of them are added to the new ones in the next block, as can be seen in figure 2.13. This method does not require additional storage at the input to save samples, as in the OLS-case. However, it is necessary to have extra arithmetic units at the output (adders) and memory in order to hold and add the overlapping samples with new ones, coming from the next block [20].

(28)

0 0 0 0 Transform and multiplication in frequency domain Inverse transform Input Output A_x B_x 0 0 0 + +

Figure 2.13: Overlap and add diagram. The right edge of each output block is added to the edge of the next block.

2.6 Word length

The resolution of samples decides the precision that can be achieved when filter-ing a signal. An Analog-to-Digital converter (ADC) samples an analog signal and returns a quantized digital value, corresponding to the analog value that was sam-pled. This leads to a precision loss, since the digital value is an approximation of the analog one. The more bits that can be used to represent a value, the better precision can be achieved. However, there are a number of factors that limit the AD conversion. Firstly, the arithmetic operations in the applications take longer time to compute if the word length is too large. Secondly, the sampling frequency limits the performance of the ADC. The resolution of bits is a decreasing function of the sampling frequency, which imposes limitations on the resolution.

A limited number of bits results in an increase of the bit error rate. In order to analyze the effects of limited word length, a software simulation of the desired (or the whole) part of the system can be performed, which can shed some light on how the quantization affects the BER.

(29)

2.7 Gauss’ complex multiplication algorithm 19

2.7 Gauss’ complex multiplication algorithm

Normally, a multiplication of two complex numbers requires four multiplications and 2 additions, as shown in equation 2.3.

(a + ib) · (c + id) = (ac − bd) · (bc + ad)i (2.13) By using Gauss’ multiplication algorithm [21], it is possible to reduce the number of multiplications to three, at the cost of three extra additions:

           k1 = c · (a + b) k2 = a · (d − c) k3 = b · (c + d) (2.14)

(a + ib) · (c + id) = (k1 − k3) + (k1 + k2)i (2.15) This alternative way of multiplying two complex numbers is beneficial in hard-ware, since multiplication is a much more demanding arithmetic operation than addition in fixed-point arithmetic [22]. Gauss’ multiplication algorithm can be used when multiplying the coefficients with the input samples in frequency do-main.

(30)

(31)

3

Method

This chapter covers the main tools used in this thesis. It also touches upon some general topics, needed to be taken into account during the Thesis, such as testing and design partitioning. In addition, the way towards power estimation of the design is described, since it was a large and time-consuming part of the Thesis.

3.1 Tools

3.1.1 Programming language

The main programming language used in this thesis was VHDL, which is an ab-breviation for "Very High Speed Integrated Circuit Hardware Description Lan-guage". VHDL is a widely used programing language for description of the be-havior of digital hardware circuits. The reason why VHDL was chosen was that the FFT tool [17], used for generation of the FFT and IFFT of the filter, was al-ready written at the start of the thesis and only needed some modifications to be able to work in the desired way and in the environment used in the project. The project was carried out with the help of a number of programs, which are described below.

3.1.2 ModelSim

ModelSim is a simulation environment used for simulation of the circuits made in hardware description languages, such as VHDL and Verilog. The program was used for verification of the functionality of the design, without the use of physical hardware. Furthermore, ModelSim, together with Design Compiler, was used to generate the results of the power consumption of the design.

(32)

3.1.3 Synopsys Design Compiler and Tcl

Synopsys Design Compiler, sometimes referred to as Design Compiler is a synthe-sis tool, which maps the higher level building blocks created in VHDL (or other hardware languages, such as Verilog) to physical hardware and produces a gate-level netlist. The physical hardware consists of transistor-gate-level building blocks, which the tool uses to assemble (compile) the design and estimate different char-acteristics of the design, such as power consumption, maximum frequency, total area of the design and the timing between different parts of the system. The de-signer of the circuit can also specify a number of parameters before synthesis, such as different delays, the desired operating frequency of the design, the num-ber of clocks used in the design, etcetera. This allows the designer to get detailed information and estimate of whether the design meets the required constraints. Another benefit of using Design Compiler is the fact that it easy to write scripts containing different commands and execute them in the shell. A script file can be written in Tcl, a powerful high-level programming language, which is sup-ported by the dc_shell. This saves time and makes it easier to specify and change parameters for each part of the design at runtime.

3.1.4 Git

Git is a version control system for handling different versions of files in a con-trolled and structured manner. Since the thesis involved a large number of files, it was very convenient to e.g. be able to revert to older versions of files and com-pare those versions with newer ones. Furthermore, the University already had Git installed on the computers, only minor steps needed to be taken in order to be able to use it.

3.2 Verification, power consumption and design

partitioning of the design

3.2.1 Verification

In order to make sure that each part of the design works correctly, it needs to be thoroughly tested. Therefore, the design was tested with the help of random input data, generated and adjusted to the desired binary format in MATLAB. The input data was varied not only with respect to the numerical values, but also to the length of the input and length of each individual sample. The output from the design in VHDL was then sent back to MATLAB and compared with the output data of the same filter written in MATLAB code. Since the software version of the filter was in turn tested in the simulation chain described in section 2.1.1, it was verified that the software version of the filter worked properly. Therefore it was reliable when used for testing the hardware implementation.

Before the whole filter was tested, each component was tested on its own. This was done not only in order to verify each component’s functionality, but also to

(33)

3.2 Verification, power consumption and design partitioning of the design 23

make it possible to reduce the possible errors in the design that could arise when the components were assembled in the end. It also made it easier to plan the workflow and follow internal milestones.

3.2.2 Power estimation

Power estimation of a design in Design Compiler can be done in two ways: regis-ter transfer level or gate level.

At register (RT) level, the memory elements and combinational logic in VHDL/ Verilog source files are used for power estimation. Information about switching activity of every memory output can be collected and used for estimation of the power consumption after logic synthesis.

If a power estimation on gate level is desired, the netlist of the design, produced after synthesis, is simulated. The netlist is used as an instance in a testbench, and input data is sent into the design. The switching activity of every gate during the simulation is logged in a file, which is later used for power estimation. This method produces the most accurate results.

The workflow of the gate level power estimation in this thesis can be seen in fig-ure 3.1. First, a the design was written in VHDL, together with a testbench writ-ten in Verilog. The reason why the testbench was not writwrit-ten in VHDL was the standard library cells, which were available in Verilog. Next, the design was syn-thesized and a netlist was generated, together with an SDF file. SDF stands for "Standard Delay Format", which is an IEEE standard for representation of timing data of an electronic design. Thereafter, the SDF file was mapped to the netlist and simulated in ModelSim using the testbench. The delays and glitches of the gates in the netlist during simulation were logged in a V CD file. VCD stands for "Value Change Dump" and is a format for dumpfiles which are generated by EDA simulation tools. This format is not supported by Design Compiler, which uses another format, called SAIF in order to do power estimation. Finally, the vcd file was converted to a saif file and used for power estimation by Design Compiler. The results deviated by approximately 10 - 15% from power estimation on regis-ter level, which was expected, since gate level estimation is more accurate. The power numbers are presented in chapter 6.

3.2.3 Design partitioning

Design reuse decreases time to market by reducing the design, together with inte-gration and testing effort. When constructing a design, it is always a good idea to divide the code into different modules and parameterize those. This makes it eas-ier to reuse the different components, especially if the interfaces are thoroughly defined. In addition, partitioning can enhance synthesis results and reduce the compile time.

The design in this work was divided into several modules, with related combi-national logic being inside each module, instead of opting for large amounts of glue combinational logic. Glue logic connects different blocks and thus restrains

(34)

Design in VHDL

Netlist in Verilog SDF-file

VCD file

Testbench in Verilog

SAIF file

Power report Synthesis in Design Compiler

using 65nm low power library to generate a netlist and an SDF file

Simulation in ModelSim to get the VCD file

Read the SAIF file in Design Compiler to get the power report

Figure 3.1: Workflow of power estimation.

(35)

4

Hardware Implementation

This chapter presents hardware implementations of the two filter architectures, which consist of different parts. The focal parts of the design will be highlighted and explained.

In section 4.1, the hardware implementation of the first filter is described. The number of complex multiplications per sample of the filter is derived, in order to be able to explain some of the drawbacks and benefits of the choice of FFT lengths. Also, timing diagrams are added in order to explain the choice of the FFT size for the 1024-point FFT architecture. In section 4.2, the hardware implementation of the second filter is explained.

A detailed explanation behind the logic and derivation of different control signals of the permutation circuits is not given. The derivation of the structure can be achieved by diving into Boolean algebra and Karnaugh diagrams, or by reading the work in [23], where a detailed explanation of how to construct any type of bit-dimension permutation circuit is given.

An explanation of the FFT will not be given either, only the word length through the FFT will be discussed in chapter 5. Detailed information about the hardware design of the FFT used in this thesis can be found in [5].

Finally, the code for the multiplications and FIR filters in between FFT and the IFFT is parameterizable. In other words, the multiplications and "rotation" by the coefficients is not only restricted to the sizes of the FFTs but can be easily adjusted to any size. Likewise, the filters in frequency are not restricted to 4 taps, but can be made arbitrary large, with an arbitrary word length. However, the permutation circuits are restricted to the size of the FFT in the first filter.

(36)

4.1 Hardware implementation of the 1024-point FFT

filter

The overview of the 1024-FFT filter can be seen in figure 4.1. Notice that each FFT/IFFT processes 128 parallel samples each clock cycle.

4.1.1 Complex multiplications per sample in 1024-point FFT filter

Taking a full 1024-FFT filter into account, and using the Overlap and save tech-nique, the number of the total complex multiplications per sample in the FFT, multiplications in frequency and the IFFT becomes:

N

2 ∗(log2N − 1) +N2 ∗(log2N − 1) + N

N − M (4.1)

where M is the filter length. The expression in the numerator denotes the amount of complex multiplications in the FFT and IFFT, together with the complex mul-tiplications in frequency domain. The expression in the denominator expresses the amount of new samples processed each time by the FFT (N = M + L − 1, see the Overlap and save section).

The main reason why the Overlap and save technique is preferred to Overlap and add in this work, is the avoidance of adders at the end of the filter. Furthermore, the padding of the signal with zeros before it goes into the filter is not needed. For different FFT lengths N, different number of multiplications per sample can be achieved. Given the function above, one can find the most suitable transform size, given a desired filter length. In this thesis, M = 512. For different values of

N , where N is a power of two, it is possible to calculate the average number of

multiplications for each sample that passes the filter. The reason why N is chosen to be a power of two is because it results in a more efficient and symmetric imple-mentation in hardware, especially considering the timing issues. The complex mult/sample values for different choices of N are presented in table 4.1 below. Considering the values in table 4.1, it can be seen that for a 512-tap filter, the most suitable transform size is 4096. Although it results in fewer multiplications per sample, the 4096 architecture has a number of drawbacks. Firstly, the larger the FFT size the larger is the area of the design. Also, with larger area comes greater power consumption, since the number of registers increases.

Moreover, the timing issues results in the fact that each FFT is idle during a long period of time, i.e. it is not receiving data during a long period of time and has to wait. Thus, the FFTs are not fully utilized during the whole period of time if the 4096-FFT is chosen. A more detailed explanation of the timing issues in such filter is shown in the figure below.

(37)

4.1 Hardware implementation of the 1024-point FFT filter 27

FFT

IFFT

FFT

IFFT

128 parallel complex samples 128 parallel complex samples

1024-point 1024-point 1024-point 1024-point

0 127

MUX

Permu- tation_2 Permu- tation_2

Global counter

Registers Registers Registers Registers Permu- tation_3 reset IFFT reset FFT Select signal S Mux control Control signals Select signal S

Select signal S Select signal S Select signal S

clk clk clk clk clk clk clk clk clk clk Mux control clk >>10 complex conjugate

complex conjugate complex conjugate complex conjugate

Permu- tation_1

clk

complex conjugate complex conjugate complex conjugate complex conjugate

S1 S2 S4

S4 S4

reset IFFT reset IFFT

reset FFT reset FFT S3 S1 S2 S3 OR S1 S2 S3 S4 >>10 >>10 >>10 F igure 4.1: S tructure of the 1024-poin t FFT fil ter .

(38)

Fourier transform size N Complex multiplications per sample 1024 18 2048 13.3 4096 12.6 8192 12.8 16384 13.4

Table 4.1:Number of complex multiplications per sample for different sizes of the FFT. 3584 3584 3584 3584 512 512 512 512

Output from upper branch: Output from lower branch:

0 28 32 56 60 84 88 112 116

Clock cycles elapsed:

Upper branch is idle

Lower branch is idle

Figure 4.2: Flow diagram of the output using a 4096-point FFT/IFFT.

512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 Output from IFFT1:

Output from IFFT2:

0 4 8 12 16 20 24 28

Clock cycles elapsed:

Figure 4.3: Timing diagram of the filter output. The darker samples are the ones sent to the output of the filter, by the use of the multiplexer. The white samples are discarded.

Recall the description of the OLS technique, mentioned in chapter 2.3.1. Using OLS, the filter processes L new and M old samples each clock cycle, where M is the length of the impulse response and is the number of discarded samples each clock cycle. Since the impulse response is equal to 512, the number of discarded samples and processed samples is equal (1024 − 512 = 512). The schematic of the input flow is shown in figure 4.3. The number 512 comes from the fact that 128x4 = 512 samples are processed by the filter during 4 clock cycles. Notice that the first 512 samples are not processed by FFT2. This is done in order to start FFT2 4 clock cycles later, which allows the processing of the samples that become corrupted in FFT1, through FFT2 and vice versa. The non-corrupted

(39)

samples are chosen by the multiplexer before the output, shaded in figure 4.3. It is easy to see that since the number of old and new samples is equal (the filter length is half the FFT length), the samples that are coming as "new" into FFT1 are also the "old" samples into FFT2. After 512/128 = 4 clock cycles, the next 128x4 = 512 samples will be coming in as "new" into FFT2 and "old" into FFT1. Consider choosing the size of the FFT/IFFT to be 4096, instead of 1024. The new output flow from each branch of the filter is depicted in figure 4.2. Notice that the problem of each branch being idle for a long period of time arises. This is the result of the amount of overlapping samples not being equal to the amount of new samples. The number of new samples arriving into the FFT now is 4096 − 512 = 3584. This is another reason for choosing the 1024-FFT when having a filter response length of 512: the FFTs and IFFTs are never in an idle state.

4.1.2 Time domain versus frequency domain

Given the transform and filter size, it is possible to deduce if the transition to the frequency domain requires fewer mathematical operations. Once again, complex multiplications per sample should be used for analysis. Computing the total amount of operations will not be a good measure, since linear convolution (time domain) filters all N samples from input to output, while circular convolution (frequency domain) filters N − M samples, due to OLA/OLS techniques. The number of samples processed by the 1024-point FFT is, as showed previously, 1024 − 512 = 512. Using time domain filtering and equation 2.8, the number of complex multiplications per output sample becomes

(N − M) ∗ M

(N − M) =

(1024 − 512) ∗ 512

(1024 − 512) = 512 Comp.mult/sample which is far more than the 18 required in frequency domain.

4.1.3 Permutation circuit before the FFT

As discussed in section 2.2, the FFT can be represented with any input and output order. The FFT used in this thesis used a bit-reveresed input. Since the data arriving into the circuit is expected to come in natural order, it is necessary to rearrange the samples, so that they come in the desired order into the FFT. The theory behind the design of the permutation circuits can be found in [24] and in [23].

The dataflow into the 1024-point FFT is shown in figure 4.4. The matrices repre-sent the flow of samples. The numbers inside each matrix indicate the indices of the data. Data flows from left to right, which means that the rightmost column of a matrix arrives first; column two from the right arrives in the next clock cycle and so on.

(40)

Figure 4.4: Input order into the filter and input order into the FFT after the rearrangement of the samples.

series are coming into the same terminal but at different clock cycles. On the other hand, parallel data are flowing into different terminals but in the same clock cycle. To understand how the the data is permuted in figure 4.4, it is more convenient to rewrite the bit indexes of the input flow in the following way:

b9 b8 b7 | b6 b5 b4 b3 b2 b1 b0 (4.2) The indexes to the left of the vertical line represent the serial dimensions, whereas the indexes on the right represent the parallel dimensions. In similar way, the bit indexes of the desired flow into the filter can be written as:

b2 b1 b0 | b3 b4 b5 b6 b7 b8 b9 (4.3) This way of writing the equations makes it easier to illustrate which circuits are necassary to use, in order to permute the data [23]. First, b9 in 4.2 will be ex-changed for b0:

b9b8 b7 | b6 b5 b4 b3 b2 b1 b0 (4.4) According to [23], this exchange of dimension will require a serial-parallel circuit with buffer size of 4. Next, the exchange of bit indixes b8 and b1 is carried out, which is also a serial-parallel permutation:

b0 b8 b7 | b6 b5 b4 b3 b2 b1 b9 (4.5) This permutation requires a serial-parallel circuit, with buffer size of 2. Finally,

(41)

the permutation

b0 b1 b7 | b6 b5 b4 b3 b2 b8 b9 (4.6) requires a serial-parallel circuit with a buffer size of 1. The permutation after those exchanges will thus be

b0 b1 b2 | b6 b5 b4 b3 b7 b8 b9 (4.7)

In order to get to the desired permutation in 4.3, b2 and b0 in 4.7 need to change places:

b0b1 b2 | b6 b5 b4 b3 b7 b8 b9 (4.8) This is a serial-serial permutation, which can be achieved by a circuit with buffer size of 3 [23], resulting in

b2 b1 b0 | b6 b5 b4 b3 b7 b8 b9 (4.9) Finally, b6...b3 indixes must be permuted to b3...b6 in order to achieve the permu-tation in 4.3:

b2 b1 b0 | b6 b5 b4 b3 b7 b8 b9 (4.10) As pointed out in [23], no circuit is needed for this permutation, since it is parallel-parellel, only interconnections between input terminals have to be made.

A detailed picture of the permutation circuit for the input into the filter can be seen in figure 4.5. The first three stages achieve the serial-parallel permutations described in this section. The final stage achieves the serial-serial one. This per-mutation is P erper-mutation1in figure 4.1.

Considering the size of the logic, all 128 terminals are not depicted. Only the first 8 inputs are shown. It is nevertheless easy to visualize the full circuit, since the structure in figure 4.5 is replicated 16 times in order to permute all 128 incoming samples.

(42)

4 3 4 4 2 4 4 2 2 2 2 2 1 2 1 2 1 1 3 3 3 3 3 3 3 4 4 4 1 1 1 1 Stage 4 Stage 3 Stage 2 Stage 1 s₁ s₁ s₁ s₁ s₁ S₂ S₂ S₂ S₂ S₂ S₂ S₂ S₂ S₃ S₃ S₃ S₃ S₃ S₃ S₃ S₃ S₄ S₄ S₄ S₄ S₄ S4 S₄ S₄ S₄ S4 S₄ S₄ S₄ S₄ S₄ S₄

Figure 4.5: Permutation circuit before FFT.

The multiplexers in each stage are controlled by a counter. The counter consists of 3 binary digits, corresponding to the 3 serial dimensions. The counter starts at zero and counts up to seven, thereafter it restarts from zero again. The control signal S of each stage is connected to the corresponding bit in the counter:

• The control signal S1to the multiplexers in the first stage is controlled by

the leftmost(most significant) bit x2in the counter.

• The control signal S2in the second stage is controlled by the middle bit x1

of the counter.

• Finally, the third stage is controlled by the rightmost(least significant) bit

x0in the counter.

The fourth stage of the permutation circuit differs from the other three, since it permutes only the serial dimensions. The control signal of the multiplexers is obtained as:

S4= ¯x2ORx0 (4.11)

4.1.4 Permutation circuit after FFT

Since the input and output order of the FFT are not the same, there has to be a permutation circuit between the FFT and IFFT. The rearrangement of samples that this circuit has to perform is shown in figure 4.6. Notice that only the serial dimensions are changed. In other words, the samples arriving in parallel do not have to be changed, but they need to arrive at other time instances.

(43)

Figure 4.6: Output order from the FFT and input order into IFFT after the shuffle.

The issue of different orders of the samples from the FFT and input of the IFFT is resolved by adding another permutation component after the FFT. Notice that it does not matter whether the circuit is before the multiplications or after, since the order of multiplications can be arranged with regard to any order of the arriving samples, because the filter coefficients can be changed between the multiplication registers.

The logic of the permutation component is the same as in stage four of figure 4.5. The logic of this circuit is considerably simpler than in the previous one, described in the previous section. The reason for this is the permutation of the serial dimensions, which is the only permutation that has to be done in this case:

b0b1 b2 | b6 b5 b4 b3 b7 b8 b9 (4.12) which results in

b2 b1 b0 | b6 b5 b4 b3 b7 b8 b9 (4.13) Moreover, the circuit and the control signals to the multiplexers are exactly the same as in the last stage of the circuit described in the previous section.

Notice that since there are two FFTs and two IFFTs, there is a need to have two permutation circuits.

4.1.5 Permutation circuit after IFFT

The final permutation circuit permutes the data from the IFFT to natural order, which becomes the output of the filter.

This permutation can be realized by using the first three stages of the circuit in 4.2.1, together with the same control signals:

(44)

Figure 4.7: Output order from the IFFT and after the shuffle.

b0b1 b2 | b6 b5 b4 b3 b7 b8 b9 (4.14)

b9 b1 b2 | b6 b5 b4 b3 b7 b8 b0 (4.15)

b9 b8 b2 | b6 b5 b4 b3 b7 b1 b0 (4.16) resulting in the desired permutation:

b9 b8 b7 | b6 b5 b4 b3 b2 b1 b0 (4.17)

4.1.6 Global counter

Considering that all the blocks need a fixed and different amount of clock cycles to finish their corresponding operations, there is a need to synchronize the blocks. For instance, the FFTs and IFFTs need to be started exactly at the same time slot the input data arrives. This can be done by having a counter for each block, so that when a component finishes its arithmetic operations on one block of data that is spread in time (the order of operations is time-dependent), the counter can be reset in order to start processing the next block of data.

Although it is possible to synchronize all the blocks using one counter for each of them, it is easier to have a centralized counter which all the blocks can use as a reference. Each block can use modular artihmetic, i.e. "wrap around" or restart upon reaching a certain value of the global counter. Therefore, one global eight bit counter is chosen in the design of the filter.

(45)

4.1.7 Multiplications

The complex multipliers are implemented using Gauss multiplication algorithm, described in section 2.7. Since the 1024 FFT takes 128 parallel samples each time, there are 128 multipliers, together with 1024 registers where the filter coefficients are held. In the first clock cycle, when the first 128 samples arrive, the data is multiplied with coefficients in the registers 1 to 128. In the next clock cycle, the incoming samples are multiplied with the corresponding coefficients that are placed in registers 129 to 256. After 1024/128 = 8 clock cycles, one transform of 1024 samples in total is filtered, and the new transform is processed in the same way.

Each complex multiplier (figure 4.9) is implemented using Gauss’ complex mul-tiplication algorithm. After the mulmul-tiplication is done, the result is truncated. Thereafter follows a block that simply takes the complex conjugate of each com-plex sample. After IFFT, each sample is comcom-plex conjugated ones again, and divided by the transform length, 1024.

C8

X

Incoming sample Coefficient Output sample 0 1 2 3 4 5 6 7 C7 C6 C5 C4 C3 C2 C1 Coefficient registers complex multiplier

Q

Select signal S

Controlled by 3 least significant bits of the global counter

(46)

+

X

+

a + b*i c + d*i Real Imaginary

Figure 4.9:Complex multiplier.

4.1.8 MUX

The multiplexer after the complex conjugation chooses which of the IFFTs should be sent to the output. The multiplexer makes sure that only 512 samples from one IFFT are passed to the output during 4 clock cycles. Thereafter, 512 samples from the the other IFFT are instead passed to the output for 4 clock cycles. Thus, 128x4 = 512 are passed to the output during the course of 4 clock cycles from one IFFT and then 512 samples are passed from the other and so on. A timing diagram is shown in the figure below. The shaded parts are the ones which are passed to the output. The 512 samples from the other IFFT during that time are discarded.

The reason why the output from IFFT2 starts at clock cycle 4 is because the lower branch of the filter (FFT2, permutation circuit after it, multiplications and IFFT2) is started 4 clock cycles after the top branch is started.

(47)

4.2 Hardware implementation of the 256-point FFT

filter

An overview of the 256-FFT filter can be seen in figure 4.10.

4.2.1 Complex multiplications in 256-point FFT filter

In the case of the 256-FFT filter, the equation for calculating complex multipli-cation/sample is different from the first filter, because it uses complex 4-tap FIR filters in frequency domain, and not complex multipliers. Notice that the num-ber of new samples each iteration, L, is now equal to half the size of the FFT, see explanation below. According to [19], an FFT-based overlapped block structure with P parallel samples has 2P parallel branches in frequency domain, where each branch is an FIR filter of length M/P , M being the filter length. This entails the need for a 2P -point FFT and a 2P -point IFFT, generating P output samples. Therefore, the equation for complex mutiplications per sample becomes:

2P M P2 + 2P log2(2P ) P = 2M P + 2log2(2P ) (4.18)

In this case, the number of processed samples P is always equal to half the size of the FFT, i.e. the number of new samples each iteration is equal to the number of old samples, when using the Overlap and Save method.

It can be seen from the equation that the FFT size does not affect the number of complex samples in this case. Therefore, having a bigger transform length is not always beneficial. Moreover, most often, the structure with complex FIR filters in between the FFT and IFFT is worse in terms of complex multiplications per sample, compared to an architecture where only multiplications are used. In case of the designed filter, P = 128, M = 512, resulting in

1024

128 + 2log2(256) = 24

which is more than 18 complex multiplications per sample for the other filter architecture, in table 4.1. However, there is a number of advantages with such design. Firstly, if the size of the filter M is decreased, the transform length does not need to be altered, it is only necessary to turn off certain taps of the interme-diate FIR filters. Secondly, since each multiplier is mapped to only one coefficient constant, the structure of the multiplier can be simplified [22]. This can in turn significantly decrease the power consumption of the circuit, since the switching activity can be reduced.

Similar to the case of 1024-point FFT described in 4.1.2, the number of complex multiplications per sample in time domain can be calculated as

N ∗ M

(48)

FFT

IFFT

128 parallel samples in 128 parallel samples out 256 parallel samples 256-point 256-point

4-tap FIR filter

÷ 256 ÷ 256 D 128 255 130 129 127 2 1 0 D D D 256 parallel samples complex conjugate complex conjugate complex conjugate complex conjugate 0 127 clk clk clk clk reset reset F igure 4.10: 256-poin t FIR fil ter .

(49)

which shows that it is more beneficial to implement the filter in frequency do-main than in time dodo-main.

4.2.2 Overlap and save

The Overlap and save block consists of delay elements at the start of the filter, which hold 128 samples for one clock cycle, so that the samples can be inserted as "old" together with 128 new samples arriving in the next clock cycle. The new samples thus become the old ones in the next clock cycle.

4.2.3 Permutations

As mentioned in [23], a permutation circuit is not needed for a fully parallel FFT, since a new frame of data is coming into the circuit each clock cycle and it is not necessary to buffer the data. Therefore, no circuits are needed before, within and after the filter, only interconnections between input and output terminals are made.

4.2.4 FIR filters in frequency

The most common architecture of real-valued FIR filters was given in section 2.3.1. Given a complex input signal, which is the case in this thesis, the filter can be implemented using four real-valued filters and two real-valued additions [25]. Expressed in mathematical terms, the equations for a complex filter can be expressed as:

yRe= (xRe∗hRe) − (xI m∗hI m) (4.19)

yI m= (xRe∗hI m) − (xI m∗hRe) (4.20) Note that such complex filter requires four real-valued filters, performing each convolution in equations 4.19 and 4.20.

As pointed out in [25], it is possible to implement equation 4.19 and 4.20 using only three real-valued filters, namely with

h1= (hRe+ hI m)

h2= (hRe+ hI m)

h3= hI m

and three additions. This results in

(50)

yI m= (xRe∗h2) − ((xI m+ xRe) ∗ h3) (4.22)

Equations 4.21 and 4.22 give the implementation shown in figure 4.11. Each of the three subfilters is a direct implementation of a four tap FIR filter, in figure 2.10.

+

h_im + h_re h_im h_re - h_im

x

_Real

x

_Imag

y

_Real

y

_Imag

Figure 4.11: One of 256 FIR filters in frequency domain.

4.2.5 Scaling

Finally, scaling is added at the end of the filter, in order to ensure division by transform length and complex conjugation of the output samples. Notice also that 128 samples are passed to the output from the IFFT; the other 128 are dis-carded due to aliasing.

(51)

5

Simulations

In order to examine effects of a desired system, a software simulation can be con-ducted, where different parameters can be set, in order to make the system as close to the real environment as possible. In [3], a communication chain was sim-ulated using Monte Carlo simulations in MATLAB, a multi-paradigm numerical environment often used for solving engineering and scientific problems. Since the filter designed in this Thesis is part of the simulation chain in [3], the same system is used in this Thesis, except that the Hcompfilter used in [3], is replaced with the filters designed in this Thesis. Since the simulations are done in MAT-LAB, the filters are also written in MATLAB programming language. The simula-tion is based on the chain depicted in figure 2.2. Billions of samples were passed through the chain to find out the bit error rate of the system.

The hallmark of Monte Carlo simulations is the reliance on random sampling in order to obtain numerical results [26]. In the simulations used [3], the data sent into the system is pseudo randomly generated and thereafter modulated using 16-point quadrature amplitude modulation (16-QAM). As in the chain in figure 2.2, interpolation (decimation) is performed at the transmitter(receiver), in or-der to reduce inter-symbol interference. The intepolation results in a spectrum of a new, interpolated sequence, together with repeated images of the baseband [12]. In order to filter out those unwanted, repeated images, a low-pass filter is inserted in the chain, after the upsampling on the transmitter side and before the downsampling on the receiver side. The filters are designed, so that

gT X(t) = gRX(t) (5.1)

A common solution is the square-root raised cosine filter [18] as

Implementation and Evaluation of Two 512-Tap Complex FIR Filter Architectures for Compensation of Chromatic Dispersion in Optical Networks

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2017