• No results found

Design and implementation of an approximate full adder and its use in FIR filters

N/A
N/A
Protected

Academic year: 2021

Share "Design and implementation of an approximate full adder and its use in FIR filters"

Copied!
64
0
0

Loading.... (view fulltext now)

Full text

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Design and implementation of an approximate full

adder and its use in FIR filters

Examensarbete utfört i Elektroniksystem vid Tekniska högskolan vid Linköpings universitet

av

Nikhil Satheesh Varma LiTH-ISY-EX--12/4565--SE

Linköping 2013

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)
(3)

Design and implementation of an approximate full

adder and its use in FIR filters

Examensarbete utfört i Elektroniksystem

vid Tekniska högskolan i Linköping

av

Nikhil Satheesh Varma LiTH-ISY-EX--12/4565--SE

Handledare: Dr. Anton Blad

isy, Linköpings universitet Examinator: Dr. Oscar Gustafsson

isy, Linköpings universitet Linköping, 19 February, 2013

(4)
(5)

Avdelning, Institution Division, Department

Division of Electronics Systems Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2013-02-19 Språk Language  Svenska/Swedish  Engelska/English   Rapporttyp Report category  Licentiatavhandling  Examensarbete  C-uppsats  D-uppsats  Övrig rapport  

URL för elektronisk version

http://www.es.isy.liu.se http://www.es.isy.liu.se ISBNISRN LiTH-ISY-EX--12/4565--SE Serietitel och serienummer Title of series, numbering

ISSN

Titel

Title Design and implementation of an approximate full adder and its use in FIR filters

Författare Author

Nikhil Satheesh Varma

Sammanfattning Abstract

Implementation of the polyphase decomposed FIR filter structure involves two steps; the generation of the partial products and the efficient reduction of the generated partial products. The partial products are generated by a constant multiplication of the filter coefficients with the input data and the reduction of the partial products is done by building a pipelined adder tree using FAs and HAs. To improve the speed and to reduce the complexity of the reduction tree a 4:2 counter is introduced into the reduction tree. The reduction tree is designed using a bit-level optimized ILP problem which has the objective function to minimize the overall cost of the hardware used. For this purpose the layout design for a 4:2 counter has been developed and the cost function has been derived by comparing the complexity of the design against a standard FA design.

The layout design for a 4:2 counter is implemented in a 65nm process using static CMOS logic style and DPL style. The average power consumption drawn from a 1V power supply, for the static CMOS design was found to be 16.8µW and for the DPL style it was 12.51µW. The worst case rise or fall time for the DPL logic was 350ps and for the static CMOS logic design it was found to be 260ps. The usage of the 4:2 counter in the reduction tree infused errors into the filter response, but it helped to reduce the number of pipeline stages and also to improve the speed of the partial product reduction.

Nyckelord

(6)
(7)

Abstract

Implementation of the polyphase decomposed FIR filter structure involves two steps; the generation of the partial products and the efficient reduction of the generated partial products. The partial products are generated by a constant multiplication of the filter coefficients with the input data and the reduction of the partial products is done by building a pipelined adder tree using FAs and HAs. To improve the speed and to reduce the complexity of the reduction tree a 4:2 counter is introduced into the reduction tree. The reduction tree is designed using a bit-level optimized ILP problem which has the objective function to min-imize the overall cost of the hardware used. For this purpose the layout design for a 4:2 counter has been developed and the cost function has been derived by comparing the complexity of the design against a standard FA design.

The layout design for a 4:2 counter is implemented in a 65nm process using static CMOS logic style and DPL style. The average power consumption drawn from a 1V power supply, for the static CMOS design was found to be 16.8µW and for the DPL style it was 12.51µW. The worst case rise or fall time for the DPL logic was 350ps and for the static CMOS logic design it was found to be 260ps.

The usage of the 4:2 counter in the reduction tree infused errors into the filter response, but it helped to reduce the number of pipeline stages and also to improve the speed of the partial product reduction.

(8)
(9)

Acknowledgments

I would like to take this opportunity to express my sincere gratitude to several individuals who gave me lot of support during my time here, in Linköping. Without them my pursuit to achieve masters degree would not be successful. Firstly, I am indebted to my examiner Oscar Gustafsson, for this opportunity to do the thesis work under him. My supervisor, Anton Blad for giving his precious time for the valuable guidance and encouragement throughout the entire work. He has taught me a lot, given me his best support and shown his enthusiasm in helping me to solve technical problems. I would also like to thank J Jacob Wikner for his important technical support in Cadence. Last but not the least, I would like to thank my loving parents, my friends and all in my family for their love, support and patience.

(10)
(11)

Contents

1 Introduction 5

1.1 Thesis objective . . . 6

1.2 Thesis organization . . . 6

2 Logic style 7 2.1 Static CMOS logic . . . 7

2.2 Pass transistor logic . . . 8

2.3 Double pass transistor logic . . . 8

2.3.1 XOR gate . . . 9

2.3.2 AND gate . . . 10

2.3.3 XNOR gate . . . 11

3 FIR filters 15 3.1 Linear-phase FIR filters . . . 15

3.2 FIR filter structures . . . 16

3.2.1 Direct form structure . . . 17

3.2.2 Transpose direct form structure . . . 17

3.3 Multirate techniques . . . 18

3.3.1 Interpolation . . . 18

3.3.2 Decimation . . . 19

3.4 Polyphase decimation filters . . . 20

3.5 Multirate filter implementation . . . 21

3.5.1 Filter coefficients . . . 22

3.5.2 Partial product generation . . . 23

3.5.3 CSA reduction tree . . . 24

4 4:2 counters 25 4.1 Introduction to 4:2 counter . . . 25

4.2 4:2 counter using DPL style . . . 27

4.2.1 Simulation of layout and schematic . . . 27

4.3 4:2 counter using static CMOS logic . . . 28

4.3.1 Simulation of layout . . . 29 ix

(12)

x Contents

5 ILP optimization 33

5.1 Architecture . . . 33

5.1.1 Partial product generation . . . 33

5.1.2 CSA reduction tree . . . 34

5.2 Optimization constraints . . . 36

5.2.1 Cost . . . 37

5.2.2 Complexity . . . 39

6 Results 41 6.1 FIR filter simulation . . . 41

6.1.1 FIR filter spec-1 . . . 42

6.1.2 FIR filter spec-2 . . . 43

6.1.3 FIR filter spec-3 . . . 44

6.2 CSA tree complexity study . . . 45

6.2.1 Register complexity . . . 45

6.2.2 Adder complexity . . . 45

6.2.3 Cost . . . 48

7 Conclusion and future work 49 7.1 Conclusion . . . 49

7.2 Suggestions . . . 49

7.3 Future work . . . 50

(13)

List of Figures

2.1 Schematic of DPL style XOR gate. . . 9

2.2 Simulation of the DPL style XOR gate. . . 10

2.3 Schematic of DPL style AND gate. . . 11

2.4 Simulation of the DPL style AND gate. . . 11

2.5 Schematic of DPL style XNOR gate. . . 12

2.6 Simulation of the DPL style XNOR gate. . . 13

3.1 Impulse response for different types of linear phase FIR filters. . . 16

3.2 Direct form structure for a FIR filter. . . 17

3.3 Transpose direct form structure for a FIR filter. . . 18

3.4 The block diagram of an upsampler. . . 18

3.5 Block diagram of an interpolator. . . 19

3.6 Upsampling with factor of 3. . . 19

3.7 Block diagram of a downsampler. . . 19

3.8 Block diagram of a decimator. . . 20

3.9 Downsampling of the signal with factor of 3. . . 20

3.10 Polyphase decomposition filter, signal flow graph for M = 4. . . . . 22

3.11 Polyphase decomposition filter identity with switching arrangement for M = 4. . . . 22

3.12 Sub-filter implementation for the polyphase decomposition filter with 4-taps. . . 24

3.13 DF implementation of the polyphase decomposed FIR filter with decimation factor 4. . . 24

4.1 Block diagram of a 4:2 Counter. . . 27

4.2 Schematic of a 4:2 counter using DPL style. . . 27

4.3 Layout of the DPL design 4:2 counter. . . 28

4.4 Simulation of the DPL design 4:2 counter with 1ns clock. . . 29

4.5 Schematic of a 4:2 counter using static CMOS logic. . . 29

4.6 Layout of the 4:2 counter using static CMOS logic. . . 30

4.7 Simulation of a 4:2 counter using static CMOS logic with 1ns clock. 31 4.8 Average power consumption against time period for a, DPL style and static CMOS style 4:2 counter. . . 32

5.1 A pipeline stage with Hmax adder levels. . . 35

5.2 Pipelined CSA tree, with number of stages Nstage, for DF and TF architecture. . . 36

(a) Pipeline stages for DF architecture . . . 36

(b) Pipeline stages for TF architecture with number of cascaded stages for FIR filter as 2 . . . 36

5.3 Relationship between partial product array ibits, bits, regs and inbits. 38 6.1 Register complexity for the reduction tree with 4:2 counter against pipeline height. . . 45

(14)

2 Contents

6.3 Adder complexity for filter with Wd= 4. . . 47 6.4 4:2 counter complexity against maximum height Hmax. . . . 47 6.5 Total cost for the reduction tree. . . 48

(15)

Contents 3

List of Tables

2.1 The truth table for a XOR gate. . . 10

2.2 The truth table for an AND gate. . . 11

2.3 The truth table for a XNOR gate. . . 12

4.1 Truth table for a 4:2 counter. . . 26

5.1 Cost of the hardware used for building CSA tree. . . 38

6.1 Implementation details of CSA reduction tree for Wd= 2 and Hmax = 2 and architecture as TF. . . 42

6.2 Results of the simulation. . . 42

6.3 Implementation details of CSA reduction tree for Wd=4 and Hmax=2 and architecture as DF. . . 43

6.4 Results of the simulation. . . 43

6.5 Implementation details of CSA reduction tree for Wd=5 and Hmax=3 and architecture as TF. . . 44

(16)

4 Contents

Acronyms

ADC Analog-to-Digital Converter CSA Carry Save Adder

CMOS Complimentary Metal Oxide Semiconductor DSP Digital Signal Processor

DAC Digital-to-Analog Converter DF Direct Form

DPL Double Pass transistor Logic FIR Finite-Impulse-Response FA Full Adder

HA Half Adder

IIR Infinite-Impulse-Response ILP Integer Linear Programming LSB Least Significant Bit

MSB Most Significant Bit RCA Ripple Carry Adder TF Transpose direct Form VMA Vector Merge Adder

(17)

Chapter 1

Introduction

Decimation and interpolation filters are integral parts of multirate systems where the sampling rates differs in different subsystems. A high-speed Σ ∆ modulated ADC could be considered as part of a multirate system where decimation of the signal is done to change the data rate. A decimation on the signal is performed for an ADC and interpolation on the signal is performed for a DAC. One of the main characteristics of the Σ ∆ modulation is shorter data wordlength. A Σ ∆ modulated ADC requires digital filter, which are often the complex part of the ADC. The focus of the work is to understand how to simplify the digital filters used and to improve the speed of the system. One such method is to use approx-imate arithmetic.

A polyphase decomposed realization of the FIR filter for a Σ ∆ modulated ADC is an efficient way of implementing the digital filter. Such a filter implementation involves partial product generation and building an efficient reduction tree for the generated partial products [1]. The reduction of the partial products are done conventionally using HAs and FAs. The speed of reduction of the partial products are highly crucial to achieve high speed of operation. In an attempt to achieve the same, an approximate full adder is introduced into the reduction tree along with FAs and HAs. The approximate full adder is named as 4:2 counter. A 4:2 counter is a device which takes in four equally weighted inputs and produce two outputs the carry and the sum. However, using the 4:2 counters for building the reduction tree will introduce errors into the output of the Σ ∆ modulated ADC. The output of the ADC has some inherent noise, so the usage of 4:2 counter could increase error at the output. The scope of the work does not cover how to reduce the noise and improve the output, but instead it focus on the benefits achieved when using the 4:2 counter.

The reduction tree of the FIR filter is designed using a bit-level optimized ILP problem. A similar ILP formulation for the implementation of the reduction tree has already been done [1], but without using a 4:2 counter. The objective function for the formulation of the ILP problem is to minimize the overall cost of the

(18)

6 Introduction

hardware used to build the reduction tree by efficiently placing the hardwares at its disposal. The hardwares used here are FAs, the HAs, the pipeline registers and the 4:2 counters. Since the objective function for the ILP problem is to minimize the cost of the hardware used, it is important to find the cost of the 4:2 counter. The cost of the 4:2 counter was estimated by developing the layout in 65nm process and comparing the area of the design against a FA design provided by the vendor ST Microelectronics. Other factors taken into consideration while deciding the cost for the 4:2 counter was the correctness of the result of the FIR filter design when using a 4:2 counter and also the complexity of the design.

1.1

Thesis objective

This work aims to understand the benefits when a 4:2 counter is introduced into the reduction tree of a polyphase decomposed FIR filter. The project work has two sub task. The first task is to design and implement a 4:2 counter in 65nm process. The layout design helped to decide the cost of the 4:2 counter to be used in the formulation of the ILP problem. The second task is to analyze the FIR filter designed and implemented, with and without 4:2 counters in the reduction tree.

1.2

Thesis organization

The organization of the report are as follows:

• Chapter 2 discusses static CMOS logic style and double pass transistor logic style. The AND, XOR and XNOR gates developed using double pass tran-sistor logic has been discussed with simulation results.

• Chapter 3 briefly discusses the basics of FIR filters and different architectures used for the implementation of the FIR filters.

• Chapter 4 introduces the schematic and layout design for a 4:2 counter in detail with simulation results.

• Chapter 5 discusses the details of the implementation of the FIR filter and the formulation of the ILP problem for the reduction of the partial products. • Chapter 6 presents the simulation results for the FIR filters designed and implemented with different specifications and discusses the results obtained. • Chapter 7 concludes the work done and summarizes the results obtained

(19)

Chapter 2

Logic style

The choice of selecting the logic style for the implementation of the combinatorial circuits depends on many parameters which affect the performance of the circuit, and the applications where the design is meant to be used. Some of these pa-rameters are power dissipation, speed of the operation and the complexity of the design [6, 11]. Size or complexity of the design can be determined by the number of transistors used in the design and their respective sizing. Power dissipation is controlled by switching activity of the transistors and by the parasitic capacitance of the circuit. Speed of operation depends on the number of transistors in series and their sizing. Wiring complexity depends on the lengths of the wires used for the connections and the placement of the transistors in the design. If the place-ment of the transistors are not planned properly this could lead to unnecessary complications in the wiring and leading to increased complexity. To select a logic style which optimizes all the above mentioned parameters is difficult, leading to a trade off when selecting the logic style used for implementing the circuit.

Here, two logic styles, static CMOS logic style and double pass transistor logic style (DPL), are considered for design. The 4:2 counter designed using DPL and static CMOS logic styles will provide the opportunity to compare both the designs and to evaluate the performance. Since there are some advantages in using DPL style over static CMOS style, an evaluation of the performance of both the designs would help to select the 4:2 counter design to be used in a high speed FIR filter design.

2.1

Static CMOS logic

The static CMOS design are based on NMOS pull down networks and PMOS pull up networks. These networks could be used for the implementation of any func-tion. Some of the advantages of static CMOS design are that it provides good stability, no static power consumption, robustness in the presence of any noise signal and easy to design.

(20)

8 Logic style

The performance of the static CMOS gate is dependent on the in and the fan-out of the logic gate. For a static CMOS design when the fan fan-out of the logic gate increases the complexity, the area and cost of the design would increase and the speed of the circuit would decrease. The reduction in speed of the circuit could be compensated by using buffers in the design. However, the disadvantage of using large sized buffers is that the complexity of the design would increase. Different studies have concluded that the speed of the design is a linear function of the fan-out [6, 11].

2.2

Pass transistor logic

Pass transistor logic style is a popular and widely used alternative to static CMOS logic style [6, 11]. This logic style may be preferred for the implementation of cir-cuits like adders or multiplexers. Compared to static CMOS the implementation of the pass transistor logic is simple, as the design uses either NMOS or PMOS transistors. The advantage of the using either NMOS or PMOS transistors in the design is that it helps to reduce the size of the design. The reduction in area need not guarantee a reduction in wiring complexity as the placement of transistors could be tricky, when compared to that of static CMOS design. Sizing of the tran-sistors has to be taken care to ensure the fan-out for each transistor is handled properly and to ensure transistors can operate at low supply voltages.

A pass transistor logic implements a logic function with two input signals which are connected to the gate and the source or drain of the transistor and the output taken at the source or drain. However for a static CMOS logic, the source of the transistor would mostly be connected to power line. In a pass transistor logic style, when the transistor is turned on there would be a threshold voltage drop across it. The resultant output voltage Voutis, Vdd− Vth. This affects the full voltage swing of the circuit. Consider a pass transistor logic circuit where two NMOS transistors are connected in series. Here, extra care should be taken as each transistor will contribute threshold voltage drop in the circuit. Thus the output voltage Voutwill be Vdd− 2 Vth. This could lead to a situation where the voltage at the output node is insufficient to drive the next stage of the circuit and ultimately the circuit fails to deliver the expected result. Thus the voltage at each node of the circuit should be noted and designer has to ensure it is sufficient to drive the next stage. Special swing restoration circuitry could be used in the design to ensure the voltage swing is proper and the circuit is robust.

2.3

Double pass transistor logic

Pass transistor logic circuits show speed degradation when used for designs where supply voltages are comparatively lower [3]. A pass transistor circuit with only NMOS transistors or PMOS transistors cannot drive the output node to Vdd or

zero respectively. As an improvement over pass transistor logic style, double pass

(21)

2.3 Double pass transistor logic 9

both NMOS and PMOS pass transistor logic in parallel. DPL logic circuit helps to improve the speed degradation and improve the robustness. A NMOS transistor passes strong logic 0 and a PMOS transistor passes strong logic 1. Thus for a DPL circuit when both NMOS and PMOS paths are turned on at the same time full voltage swing in the circuit is restored. This helps to avoid the level restoring circuitries which were used for pass transistor logic circuits. One of the disadvan-tages of the DPL design is the use of large PMOS transistors which would increase the capacitive load in the design [11], affecting the speed. The AND, XOR and XNOR gates used in the 4:2 counter design uses DPL style and is discussed here. The switching of the transistors are designed such that at least one NMOS tran-sistor is on to pass logic 0 to the output and at least one PMOS trantran-sistor is on to pass logic 1 to the output.

2.3.1

XOR gate

The schematic of a DPL style XOR gate is shown in Fig. 2.1.The truth table for the XOR gate is given in Table 2.1. The column pass in the truth table, explains the inputs passed to the output. When the inputs A and B is set to low the transistors M 1, M 3 and M 4 are on, and passes the inputs A and B to the output. When the input A is set to 0 and B is set to 1 the transistors M 3 and M 2 are on and passes A and B to the output. When the input A is set to 1 and B is set to 0 the transistors M 1 and M 4 are on and passes A to the output. When the input A is set to 1 and B to 1 the transistor M 2 is on and passes A to the output.

A A B B A A A A B B B B out M1 M2 M3 M4

Figure 2.1: Schematic of DPL style XOR gate.

Simulation result

The simulation result of the XOR gate is as shown in Fig. 2.2. The input signals A and B are two of the outputs from a 4-bit counter. The output signal out is the output of the XOR gate. The output signal plotted against the inputs follows the truth table and has full rail swing between 0 and Vdd.

(22)

10 Logic style A B out pass 0 0 0 A, B 0 1 1 A, B 1 0 1 A 1 1 0 A

Table 2.1: The truth table for a XOR gate.

0 0.5 1 Input, A V (V) 0 0.5 1 Input, B V (V) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10−8 0 0.5 1 Output, out time(S) V (V)

Figure 2.2: Simulation of the DPL style XOR gate.

2.3.2

AND gate

The schematic of the DPL style AND gate is shown in Fig. 2.3. The truth table for the AND gate is shown in Table 2.2. The column pass in the table explains the signals passed to the output for any combination of inputs. When the input B is set to 0 the transistor M 2 is always on and the input B is passed to the output. When the input B is set to 1 transistors M 3 and M 1 are on and the input A is passed to the output.

Simulation result

The simulation result of the AND gate is shown in Fig. 2.4. The input signals A and B are the outputs from a 4-bit counter. The output signal out is the output of the XOR gate. The output signal plotted against the inputs follows the truth table and has full rail swing between 0 and Vdd.

(23)

2.3 Double pass transistor logic 11 out A A B B B B M1 M2 M3

Figure 2.3: Schematic of DPL style AND gate.

A B out pass

0 0 0 B

0 1 0 A

1 0 0 B

1 1 1 A

Table 2.2: The truth table for an AND gate.

0 0.5 1 Input, A V (V) 0 0.5 1 Input, B V (V) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10−8 0 0.5 1 Output, out time(S) V (V)

Figure 2.4: Simulation of the DPL style AND gate.

2.3.3

XNOR gate

The schematic of the DPL style XNOR gate is as shown in Fig. 2.5. The XNOR gate shows double-transmission characteristics [3]. This implies that there will be

(24)

12 Logic style

two paths to the output for any combination of input. The truth table for the XNOR gate as given in Table 2.3. The column pass in the truth table explains how the XNOR gate show the property of double-transmission. When the inputs A and B are set to low transistors M 3 and M 2 are on and passes the signals A and

B to the output. When the input A is set to 0 and B is set to 1 the transistors M 3 and M 1 are on and passes A and B to the output. When the input A is set

to 1 and B is set to 0 the transistors M 2 and M 4 are on and passes A and B to the output. When the inputs A and B are set to 1 the transistors M 1 and M 4 are on and passes A and B to the output. Thus for any given combination of input there exists two paths to the output. This helps to reduce the effective resistance between input and output when compared to the pass transistor logic or complementary pass transistor logic [3]. The reduction in effective resistance aids in the increase in the speed of the overall gate.

A A A A A A B B B B B B out M1 M2 M3 M4

Figure 2.5: Schematic of DPL style XNOR gate.

A B out pass

0 0 1 A, B

0 1 0 A, B

1 0 0 A, B

1 1 1 A, B

Table 2.3: The truth table for a XNOR gate.

Simulation result

The simulation result of the XNOR gate is shown in Fig. 2.6. The input signals A and B are two of the outputs from a 4-bit counter. The signal out is the output of the XNOR gate. The output signal plotted against the inputs follows the truth table and has full rail swing between 0 and Vdd.

(25)

2.3 Double pass transistor logic 13 0 0.5 1 Input, A V (V) 0 0.5 1 Input, B V (V) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10−8 0 0.5 1 Output, out time(S) V (V)

(26)
(27)

Chapter 3

FIR filters

Digital filters can be implemented in two ways, as convolution or as recursion. Using convolution operation FIR filters are implemented and by using recursion IIR filters are implemented. In this chapter time domain FIR filters or moving average filters are discussed. The moving average filters are commonly used filters for interpolation or decimation. A FIR filter can be mathematically represented by using convolution equation as shown in (3.1).

y(n) =

N2 X

k = N1

x(n − k) h(k) (3.1)

Here x(n) represent the input data and h(k) represent the impulse response of the system. FIR filters are always stable and some of the FIR filter exhibits linear phase response when compared to the IIR filter. The linear phase property is the result of the symmetry and anti symmetry of the impulse responses of the filter. The properties of the FIR filter such as stability and the linear phase response ensure the use of the filter in many DSP applications.

3.1

Linear-phase FIR filters

A linear phase FIR filter finds many applications in digital signal processing. In fact commonly used FIR filters are linear phase filters. Linear phase suggest that the phase of the filter should be a linear function of the frequency. Since the phase and the frequency has linear relation delay through the filter for all the frequen-cies will be the same. Linear phase property helps the FIR filter to avoid phase distortion.

A FIR filter is a linear phase FIR filter when the impulse responses of the filter are either symmetric over n = N/2 or antisymmetric over n = N/2. For a linear phase FIR filter the impulse response h(n) can be written as,

h(n) =



h(N − n) for symmetry around n = N/2 −h(N − n) for antisymmetry around n = N/2

(28)

16 FIR filters

where n = 0, 1, ..., N. Depending on the order N, whether N is even or odd the linear phase filters are classified into four types.

Type I : h(n) = h(N − n), N → even

Type II : h(n) = h(N − n), N → odd

Type III : h(n) = −h(N − n), N → even

Type IV : h(n) = −h(N − n), N → odd

Figure 3.1 explains impulse response for all the types of linear phase FIR filters.

n n n n N N N N N/2 N/2 N/2 N/2 h(n) h(n) h(n) h(n)

Type I:N-even Type II:N-odd

Type III:N-even Type IV:N-odd

Figure 3.1: Impulse response for different types of linear phase FIR filters.

3.2

FIR filter structures

The filter structure give the framework for how a FIR filter is implemented. Two filter structures which are most commonly used for the implementation of the FIR filters are discussed here. The DF structure and the TF structure.

The transfer function of the FIR filter is defined as shown in (3.2). From the transfer function it is clear that for an N order FIR filter it has N + 1 coefficients. For the multiplication of the input data with the N +1 coefficients it requires N +1 multipliers, and for the summation of the result it requires N adders. The output of any FIR filter can be represented using convolution equation (3.1) and from the equation the output y(n) can be written in the form of difference equation as shown in (3.3).

(29)

3.2 FIR filter structures 17 H(z) = ∞ X n=0 h(n) z−n (3.2) y(n) = x(n)h(0) + x(n − 1)h(1) + x(n − 2)h(2) ... + x(n − k)h(k) (3.3)

3.2.1

Direct form structure

The DF structures are structures in which the coefficients of multipliers and the transfer function are the same [10]. The DF structure is the direct realization of (3.3). Figure 3.2 shows the signal flow graph of the DF structure for a 5th order FIR filter. This structure provides the simplest framework to implement the filter. There are 6 multiplication and 5 addition respectively and the delays in the structure provide the algorithmic delays for the input. The input data is multiplied with the filter coefficients and the summation of the results provide the filer response. T T T T T + + + + + x(n) x(n-1) x(n-2) x(n-3) x(n-4) x(n-5) h(0) y(n) h(1) h(2) h(3) h(4) h(5)

Figure 3.2: Direct form structure for a FIR filter.

3.2.2

Transpose direct form structure

The TF structure can be derived from the signal flow graph of the DF structure. Interchange the input and output and reverse all the arrow directions. Then replace all the pick up nodes with summation nodes and all the summation nodes with pick up nodes. Pick up nodes are those indicated by the ’dot’ in the diagram and the summation nodes are the addition operation. This process is called the transposition and the corresponding signal flow graph represents the transpose direct form structure. For TF structure also the coefficients of multipliers and the transfer function are the same. Figure 3.3 shows the implementation of the 5thorder FIR filter with TF structure. In the transpose structure the input data is multiplied with all the coefficients. The result of the multiplication is then passed through the delays and the summation of all the results provides the filter response.

(30)

18 FIR filters + + + + + T T T T T h(5) h(4) h(3) h(2) h(1) h(0) y(n) x(n)

Figure 3.3: Transpose direct form structure for a FIR filter.

3.3

Multirate techniques

One of the advantages of the digital signal processing, over analog signal process-ing, is that multirate techniques could be introduced [8, 10] into a digital signal processing system. Multirate processing implies that the sampling rate could be different in different subsystems. There are well established techniques used in different applications for changing the sample rates. Different sample rates are achieved by upsampling and downsampling the signals. Some of the advantages of using multirate techniques are it help to improve the performance and to reduce the workload of the system.

3.3.1

Interpolation

Upsampling [10] is the process of increasing the sampling rate. Figure 3.4 shows the block diagram of an upsampler. Upsampling involves introducing zeros in between the samples. Upsampling by a factor of M on a signal introduces M − 1 zeros in between the samples.

M

x(n) x1(k)

Mfsample

fsample

Figure 3.4: The block diagram of an upsampler.

x1(k) =



x(Mk) if k = 0, ±M, ±2M, ...

0 otherwise (3.4)

X1(z) = X(zM) (3.5)

When a signal x(n) undergoes upsampling the original signal is converted into an-other signal x1(k). The mathematical representation of the signal is given by (3.4)

and the frequency relation corresponding to the upsampling is given by (3.5). An example of how the signal x(n) is changed to x1(k) when the x(n) is upsampled

with a factor of M = 3, is as shown in Fig. 3.6. In the example the upsampling introduces two zeros in between each samples. The new signal x1(k) introduces

(31)

3.3 Multirate techniques 19

removed by using an anti-imaging low pass filter [10]. The process of upsampling and the filtering of the resultant signal using anti-imaging filter is called interpola-tion and the system performing interpolainterpola-tion is called an interpolator. The block diagram of an interpolator is shown in Fig. 3.5.

up sampling anti-imaging filter

x(n) x1(k)

Mfsample

fsample M

y(k)

H(z)

Figure 3.5: Block diagram of an interpolator.

x(n)

x1(k)

upsampling with factor M=3

0 1 2 3 6 9 12 15 18

0 1 2 3 4 5 6

Figure 3.6: Upsampling with factor of 3.

3.3.2

Decimation

Downsampling [10] is the process of decreasing the sampling rate. Figure 3.7 shows the block diagram for a downsampler. The basic operation of the downsampling involves neglecting some of the samples from the original signal. Downsampling by a factor of M on a signal rejects M − 1 samples or in other words every Mth sample is sampled. Hence reducing the sampling rate by a factor of M.

Mfsample

x(k) x1(n)

fsample

M

Figure 3.7: Block diagram of a downsampler.

The block diagram of a decimator is shown in Fig. 3.8. The frequency relation for downsampler is given by (3.6). Downsampling is done on the signal x1(k) which

(32)

20 FIR filters

is then converted into signal y(n). The mathematical representation of the signal is denoted by (3.7). To perform downsampling on the signal x1(k) system should

be free from all the unwanted components so a low pass filtering is done on signal

x(k). The low pass filter used is an aliasing filter. The purpose of the

anti-aliasing filter is to reduce the anti-aliasing effect when the sampling rate is reduced by a factor M . From the example shown in Fig. 3.9, downsampling with a factor of

M = 3 removes the zeros introduced by the interpolation and the input samples

are retrieved. Y (zM) = 1 M M −1 X k=0 X1(z e −j2πk M ) (3.6) y(n) = x1(kM ) (3.7)

M

H(z)

anti-aliasing filter down sampling

x

1

(k)

y(n)

x(k)

Mf

sample

f

sample

Figure 3.8: Block diagram of a decimator.

0 1 2 3 6 9 12 15 18

0 1 2 3 4 5 6

x1(k)

y(n)

downsampling with factor M=3

Figure 3.9: Downsampling of the signal with factor of 3.

3.4

Polyphase decimation filters

Polyphase is an efficient way of implementing sample rate conversion. A reduction in the work load is also attained if the filters are implemented using polyphase techniques. For a decimation operation with a factor of M , M − 1 samples in between each sample is ignored. This rejection of M − 1 samples can be utilized to build a filter in an efficient way. Such an efficient way of implementation of the

(33)

3.5 Multirate filter implementation 21

filter can help to reduce the work load of the system. The FIR filter used for the analysis and the simulation is a polyphase decimation filter.

The frequency response of the polyphase filter can be derived from the original filter. Let h(n) be the impulse response of the filter, then the frequency response of the filter is given by (3.2). Rewrite the impulse response h(n) as the sum of M partial signals as shown in (3.8). By doing so the impulse response h(n) can be considered as divided into M parallel filters.

hk(j) = h (jM + k) where, k = 0, 1, ...M − 1 (3.8) Each hk(j) represents the original signal h(n) at the time instances jM + k. The new representation will have M number of parallel filters. For example consider 10 filter coefficients as below,

h(n) : 8 3 2 4 7 10 5 4 6 1

If M = 4, then the coefficients for each parallel filter would be,

h0(j) : {8, 7, 6}

h1(j) : {3, 10, 1}

h2(j) : {2, 5}

h3(j) : {4, 4}

From the frequency response equation for the filter, if h(n) is replaced with hk(j) then the frequency response equation for the polyphase form filter can be formu-lated and (3.9) [10] would represent the same.

H(z) =

M −1 X

k=0

z−kHk(zM) (3.9)

Polyphase decimator structures are implemented using decimation identities [10]. One such identity is explained here. Figure 3.10 represents the signal flow graph of the identity and Fig. 3.11 shows the implementation of the decimation identity [5, 10]. The signal flow graph shows the input data is down sampled by M = 4. Then the down sampled input is fed to the parallel filter branches and then the multiplied results are added to provide the filter response. The implementation of the identity is by replacing the down sampler with a switching arrangement. The input is fed to the switch and it moves from one location to the next in anti-clockwise direction. This will send the input samples to different parallel filter branches, thus the switch introduces downsampling of the input samples. The corresponding samples are fed to the filter branches respectively.

3.5

Multirate filter implementation

The implementation of the FIR filter involves following steps. The first step is the generation of the filter coefficients. The second step is the partial product generation for the filter using constant multipliers. The third step involves the realization of the pipelined CSA reduction tree structure.

(34)

22 FIR filters + + + H1(z) T T T 4 4 4 4 H0(z) H2(z) H3(z) x(n) y(n)

Figure 3.10: Polyphase decomposition filter, signal flow graph for M = 4.

+ + + H1(z) H0(z) H2(z) H3(z) x(n) y(n)

Figure 3.11: Polyphase decomposition filter identity with switching arrangement for M = 4.

3.5.1

Filter coefficients

The FIR filter used is a linear phase moving average filter. The coefficients of the FIR filter, are generated using the convolution equation given by (3.1). The values of the filter coefficients thus generated are unsigned numbers and the optimization problem would be applied to these generated filter coefficients.

The coefficients for the polyphase decomposed FIR filter with N -tap and M num-ber of cascaded stages are generated. Each of the filter coefficient generated can be denoted as h(n) and is defined by (3.10). The equation gives the binary

(35)

rep-3.5 Multirate filter implementation 23

resentation of each coefficient. The parameter Wc is the wordlength of the filter coefficients, hn,k∈ {0, 1} and k denote the bit position.

h(n) =

Wc

X

k = 0

hn,k2k (3.10)

3.5.2

Partial product generation

Once the filter coefficients are generated they are multiplied with the input data to generate the partial products for the filter. The input data has a wordlength

Wdand the output wordlength of the multiplication is Wout. The filter coefficients are known values. Thus the use of general multiplier for the multiplication can be avoided and can be replaced with a constant multiplier. The multiplication is done at the bit-level. The partial products are only generated for non-zero bits and the bits which are zeros are ignored.

The implementation of the polyphase decimation filter involves the implementa-tion of the sub-filters. The number of sub-filters are determined by the decimaimplementa-tion factor N and thus there will be a maximum of N sub-filters. The sub-filters are also generated by grouping the filter coefficients in anti-clockwise direction. That is, the first filter coefficient is mapped to the sub-filter–N and the second filter coefficient is mapped to the sub-filter–(N − 1). This mapping continues until N filter coefficients are mapped to N sub-filters. The delays in the filter are also assigned to each sub-filter when filter coefficients are mapped to it.

The partial products are generated after the filter coefficients are mapped to each sub-filter. The wordlength of the partial product is the input data wordlength Wd. Each time a partial product is generated it is merged to a partial product vector. The partial product vector has a length of Wout and it has the weight information needed by the CSA reduction tree to reduce the generated partial products. The result from the CSA reduction tree is the FIR filter response.

Let the polyphase decomposed FIR filter has the specification of decimation factor of 4, with the number of cascaded stages as 3 and the architecture as DF. There will be ten non zero filter coefficients generated using the convolution operation for the FIR filter. Let those filter coefficients be denoted by h0, h1, .... h9. As

the number of sub-filters for the polyphase decomposed filter will be 4, the filter coefficients would be grouped into 4 different sub-filters.

sub-filter– 3 : h0, h4, h8

sub-filter– 2 : h1, h5, h9

sub-filter– 1 : h2, h6

sub-filter– 0 : h3, h7

Figure 3.12 shows the implementation of the sub-filter–3. The input data x0 with

wordlength Wdis multiplied with the filter coefficient h0 and the delayed versions

(36)

24 FIR filters

multiplication is then added together to produce the sub-filter–3 response. Fig-ure 3.13 shows the implementation of the polyphase decomposed FIR filter with specification as mentioned. The decimation factor 4 implies there are four inputs

x0, x1, x2and x3with wordlength wdto each sub-filter. The partial product block is fed with the input data and the delayed versions of it. The partial product block generates and merge the partial products.

+

+

T T x0(n) x0(n-1) x0(n-2) Wd h0 h4 h8 CSA tree Wout Partial product generation

Figure 3.12: Sub-filter implementation for the polyphase decomposition filter with 4-taps.

T T T

T T T

T T T

T T T

partial product generation CSA tree Wout wd wd wd wd x0 x1 x2 x3

Figure 3.13: DF implementation of the polyphase decomposed FIR filter with decimation factor 4.

3.5.3

CSA reduction tree

The input to the CSA reduction tree is the merged partial product vector and it has the weight information. The CSA reduction tree is implemented as a pipelined reduction tree. It uses the hardwares 4:2 counters, FAs, HAs and the pipeline registers to reduce the merged partial product vector. The output from the CSA reduction tree is the response of the FIR filter with a wordlength of Wout.

(37)

Chapter 4

4:2 counters

Several realizations for multipliers are available, one such method can be genera-tion of the partial products and the reducgenera-tion of it using a reducgenera-tion tree. There are some conventionally used approaches developed for building the reduction tree like Wallace [9] and Dadda [2]. These approaches help in faster reduction of the partial products and it generally uses FAs and HAs. To further increase the speed of reduction, another generalized component named (n, m) parallel counters [4] along with the HAs and FAs can be used. The author of the article [4], uses Dadda scheme for the reduction of the partial products and introduces the (n, m) parallel counters into the reduction tree. A (n, m) parallel counter takes in n number of inputs of equal weight and converts it to m number of outputs. The output m represents the number of ones at the n inputs. For example, a FA could be considered as a (3, 2) counter which takes in three inputs of equal weight and gives two outputs the carry and the sum. These output bits represent count of number of ones at the input.

4.1

Introduction to 4:2 counter

A 4:2 counter is a combinatorial circuit which takes in four inputs of equal weight and produces two outputs the carry and the sum which represent the count of the number of ones at the input.

The truth table for the 4:2 counter is shown in Table 4.1. The four inputs are labeled as A, B, C and D and the two outputs named carry and sum. It is clear from truth table that the combination of outputs carry and sum represent the count of number of ones at the input. Only scenario where it fails to give correct result is when all the inputs are set to 1. For a (n, m) counter if four of the input bits are set to one, the output should give a result of f our. Since a 4:2 counter has only two bits at the output it is not possible to represent f our. The maximum value which can be represented using 2 bits is three. Here instead of approximating the output to three, a value of two is set.

(38)

26 4:2 counters A B D E carry sum 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0

Table 4.1: Truth table for a 4:2 counter.

Some scenarios considered when approximating the output value are described. First scenario, when a 4:2 counter is used in the reduction tree of a FIR filter it may introduce error at the output, irrespective of the value to which a 4:2 counter approximates. That is, the number of error introduced into the system will be the same. Second scenario, while designing the 4:2 counter it was found that the com-plexity of the design is higher for the counter with output value of three compared to the value set to two. Third scenario, the number of bit changes is more when the output approximates from f our to three compared to f our to two. That is, there will be three bit changes when the output is set to three from 100 to 011 and there will only be two bit changes when the output is set to two from 100 to 010.

Figure 4.1 shows the block diagram of a 4:2 counter. A 4:2 counter circuit could be confused with a 4:2 compressor [4, 7] circuit. A 4:2 compressor is a combinatorial circuit which takes in 4 inputs of the same weight and produce outputs carry and

sum. Along with the four inputs a 4:2 compressor circuits also takes in a carry

input from the previous stage and generates one extra carry bit to pass on to the next stage.

The 4:2 counter circuit is designed using two logic styles, static CMOS logic style and DPL style. This provides the opportunity to compare both the designs and evaluate the performance.

(39)

4.2 4:2 counter using DPL style 27

A B C D

carry sum

4:2 counter

Figure 4.1: Block diagram of a 4:2 Counter.

4.2

4:2 counter using DPL style

From the truth table in Table 4.1 the realization of a 4:2 counter for the outputs

sum and carry can be derived as,

sum = A ⊕ B ⊕ D ⊕ E

carry = (A ⊕ B) · (D ⊕ E) + A · B + D · E (4.1) The transistor level schematic of the 4:2 counter is shown in Fig. 4.2. The AND, XOR and XNOR gates used for the realization of sum and carry use the DPL style as explained in the section 2.3 and the 3 input NOR gate, the buffers and the inverters use the static CMOS logic style.

A B D E A B D E A A A A A B B B B B B A B B int3 int1 int2 D D D D D D E E E E E E E E int4 int3 int4 int5

3 input NOR buffer

carry int1 int1 int1 int1 int1 int2 int2 int1 int1 int2 int2 int2 int2 buffer sum int2 int1 int2 int2 int2

Figure 4.2: Schematic of a 4:2 counter using DPL style.

4.2.1

Simulation of layout and schematic

The goal of layout designing was to analyze the area of the design and compare it with the area of a FA. The comparison helps to set the cost function of the 4:2 counter for the formulation of the ILP problem. It also helps to analyze, how

(40)

28 4:2 counters

efficient the design is to fit into the CSA tree of a high speed filter. Figure 4.3 shows the layout design completed in 65nm process of a 4:2 counter using DPL design. In the chapter 5, ILP problem formulation and the cost function details of the 4:2 counter is discussed. Figure 4.4 shows the simulation of the RC-extraction of the design. The input signals for the testbench are from a 4-bit counter. The signals cout and sum are the outputs from the RC-extraction. The simulation shown was run for a clock frequency of 1GHz. The outputs cout and sum are as expected and it can be verified according to the truth table in Table 4.1.



Figure 4.3: Layout of the DPL design 4:2 counter.

4.3

4:2 counter using static CMOS logic

For the design of 4:2 counter using static CMOS logic, the realization used for the DPL design is not followed. Instead a new realization is derived as given in (4.2). The DPL realization uses three XOR gates and this leads to increase in the number of transistors used in the design. This increased complexity will in turn increase the area of the design. To reduce the number of transistors in the design the number of XOR gate is reduced to one and the rest of the design is realized using AND, OR and NAND gates. The realization for the outputs sum and carry of the 4:2 counter using static CMOS logic is as given below,

sum = (A· B) · (A + B) ⊕ (D· E) · (D + E)

carry = (A + B) · (D + E) + A · B + D · E (4.2) If the pass transistor realization have been followed for static CMOS logic as well, it will require 64 transistors to realize the static CMOS design. However the new realization derived requires only 50 transistors. This helps to reduce the complexity of the design. The schematic design of the 4:2 counter using static CMOS logic is shown in Fig. 4.5.

(41)

4.3 4:2 counter using static CMOS logic 29 0 0.51 Input, A V (V) 0 0.51 Input, B V (V) 0 0.51 Input, C V (V) 0 0.51 Input, D V (V) 0 0.51 Output, cout V (V) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x 10−8 0 0.51 Output, sum time(S) V (V)

Figure 4.4: Simulation of the DPL design 4:2 counter with 1ns clock. A B D E sum carry

Figure 4.5: Schematic of a 4:2 counter using static CMOS logic.

4.3.1

Simulation of layout

The goal of the layout designing was to analyze the design against the DPL design implemented. A comparison on performance and robustness with the

References

Related documents

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

Den förbättrade tillgängligheten berör framför allt boende i områden med en mycket hög eller hög tillgänglighet till tätorter, men även antalet personer med längre än

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av

Det har inte varit möjligt att skapa en tydlig överblick över hur FoI-verksamheten på Energimyndigheten bidrar till målet, det vill säga hur målen påverkar resursprioriteringar

Detta projekt utvecklar policymixen för strategin Smart industri (Näringsdepartementet, 2016a). En av anledningarna till en stark avgränsning är att analysen bygger på djupa