Low Power Design Using RNS

(1)

Final thesis

Low Power Design Using RNS

by

Viktor Classon

LITH-ISY-EX--14/4792--SE

(2)

(3)

Final thesis

Low Power Design Using RNS

by

Viktor Classon

LITH-ISY-EX--14/4792--SE

2014-08-25

Supervisor, ISY: Oscar Gustafsson Supervisor, Ericsson: Shafqat Ullah

(4)

(5)

Abstract

Power dissipation has become one of the major limiting factors in the de-sign of digital ASICs. Low power dissipation will increase the mobility of the ASIC by reducing the system cost, size and weight. DSP blocks are a major source of power dissipation in modern ASICs. The residue number system (RNS) has, for a long time, been proposed as an alternative to the regular two’s complement number system (TCS) in DSP applications to reduce the power dissipation. The basic concept of RNS is to first encode the input data into several smaller independent residues. The computational opera-tions are then performed in parallel and the results are eventually decoded back to the original number system. Due to the inherent parallelism of the residue arithmetics, hardware implementation results in multiple smaller de-sign units. Therefore an RNS dede-sign requires low leakage power cells and will result in a lower switching activity.

The residue number system has been analyzed by first investigating dif-ferent implementations of RNS adders and multipliers (which are the basic arithmetic functions in a DSP system) and then deriving an optimal com-bination of these. The optimum comcom-binations have been used to implement an FIR filter in RNS that has been compared with a TCS FIR filter.

By providing different input data and coefficients to both the RNS and TCS FIR filter an evaluation of their respective performance in terms of area, power and operating frequency have been performed. The result is promising for uniform distributed random input data with approximately 15 % reduction of average power with RNS compared to TCS. For a realistic DSP application with normally distributed input data, the power reduction is negligible for practical purposes.

(6)

(7)

Acknowledgements

First of all I would like to thank the employees at the section Digital ASIC at Ericsson in Kista for all the help and support and most of all for giving me the opportunity of doing my master’s thesis. A big thanks especially to my supervisor Shafqat Ullah at Ericsson for all the support and for sharing his knowledge with me. I also want to thank my examiner, Mark Vesterbacka, and my supervisor at ISY, Oscar Gustafsson, for all the help and support during the thesis.

Most of all I want to thank my parents, Maria and Svante, and my partner, Linda, for supporting me during my five years of studies. Without you I would not have been able to make it!

Finally I would like to thank all fellow students at Link¨oping University and master thesis students at Digital ASIC for a fantastic time and sup-porting company! Especially a big thanks to Emil Lundqvist for his review of this work as an opponent and to Ejaz Sadiq for being a great sounding board during my master’s thesis.

Stockholm, August 2014 Viktor Classon

(8)

2.1.2 Conversion . . . 7 2.1.3 Choosing a moduli-set . . . 7 2.2 FIR filters . . . 8 2.3 Design flow . . . 8 2.3.1 Synthesis . . . 9 2.3.2 Profile developing . . . 9 2.3.3 Power . . . 10 3 Proposed design 11 3.1 Arithmetic functions . . . 11 3.1.1 Addition . . . 11 3.1.2 Multiplication . . . 12 3.2 Conversion . . . 12 3.2.1 Forward conversion . . . 13 3.2.2 Reverse conversion . . . 14 3.3 Choosing a moduli-set . . . 14

3.3.1 Modulus for comparison . . . 14

4 Implementation 16 4.1 RNS addition . . . 16

4.1.1 LUT and binary adders . . . 16

4.1.3 End-around carry parallel-prefix adder . . . 17

4.1.4 Parallel-prefix adder using the diminished-one number representation for modulo 2n_{− 1 . . . .} ₁₉

(9)

CONTENTS CONTENTS

4.1.5 Addition using Verilog’s built-in modulo operator . . . 20

4.1.6 Ordinary addition for modulo 2n _{. . . .} ₂₁

4.2 Multiplication . . . 21

4.2.0 LUT based multiplication . . . 21

4.2.1 Modulo-m product-partitioning multiplier with ROM 22 4.2.2 Parallel-prefix multiplier for modulo 2n_{− 1 . . . .} ₂₂

4.2.3 Parallel-prefix multiplier for modulo 2n+ 1 . . . 22

4.2.4 Modular multiplication using the isomorphic technique 24 4.2.5 High radix modulo 2n− 1 multiplier . . . 25

4.2.6 Using Verilog’s built-in operators . . . 27

4.2.7 Ordinary multiplication for modulo 2n . . . 27

4.3 Forward conversion . . . 27

4.3.1 RNS adder tree . . . 28

4.3.2 Periodicity . . . 29

4.3.3 Forward conversion for modulo 2n_{− 1 . . . .} ₃₀

4.3.4 Using Verilog’s built-in modulo operator . . . 30

4.3.5 Forward conversion for modulo 2n _{. . . .} ₃₀

4.4 Reverse conversion . . . 30

4.4.1 CRT . . . 31

4.5 Choosing a moduli set . . . 33

4.6 FIR filter . . . 34

5 Results 36 5.1 Input data and coefficients . . . 36

5.1.1 Uniformly distributed data and coefficients . . . 36

5.1.2 Sawtooth data and coefficients ramp . . . 37

5.1.3 Realistic input data and FIR coefficients . . . 37

5.1.4 Different properties of the data and coefficients . . . . 37

5.2 Adders and multipliers . . . 40

5.2.1 Adders . . . 41

5.2.2 Multipliers . . . 44

5.3 Moduli-set . . . 54

5.4 FIR filters . . . 56

5.4.1 Varying input word length . . . 56

5.4.2 Varying number of taps . . . 58

5.4.3 Folded FIR filter . . . 60

5.5 Maximum frequency . . . 61

6 Discussion and conclusions 62 6.1 Adders . . . 62

6.2 Multipliers . . . 63

6.3 FIR filters . . . 63

(10)

CONTENTS CONTENTS

Appendix

68

A Modulus 69 B Optimum moduli-sets 70 C RNS adders results 72 D RNS multiplier results 82

(11)

List of Figures

1.1 The basic principle of RNS . . . 2

2.1 Flowchart for profile development . . . 9

4.1 RNS addition using two binary adders . . . 17

4.2 Hybrid version of RNS addition . . . 17

4.3 Logic operators for the parallel-prefix adder . . . 18

4.4 End-around carry prefix adder with Sklansky parallel-prefix structure . . . 19

4.5 Adder based on the diminished-one number representation . . 20

4.6 Adder using Verilog’s built-in operator . . . 21

4.7 Modulo 2n addition using a binary adder . . . 21

4.8 Modulo-m product-partitioning multiplier with ROM . . . 23

4.9 Multiplier based on parallel-prefix RNS adders . . . 24

4.10 Multiplier based on the isomorphic technique . . . 25

4.11 Modular high-radix RNS multiplier . . . 26

4.12 Multiplier based on the isomorphic technique . . . 27

4.13 Multiplier based on binary multiplier for modulo 2n _{. . . . .} ₂₇

4.15 Forward conversion using an RNS adder tree . . . 28

4.14 Forward conversion with registers at input and output . . . . 28

4.16 Reverse conversion . . . 31

4.17 Reverse conversion using CRT . . . 32

4.18 Modulo-m product-partitioning multiplier with combinato-rial logic instead of LUT. Changes from ordinary RNS mul-tiplier are shown in white. . . 33

4.19 Direct-form FIR filter . . . 35

4.20 Transposed direct-form FIR filter . . . 35

4.21 Folded FIR . . . 35

5.1 Discrete uniform distributions for different number of bits . . 37

5.2 Sawtooth data and ramp coefficients . . . 38

5.3 Histogram for realistic input data for a 20-bit FIR filter . . . 38

5.4 Frequency response for some different FIR filter coefficients . 39 5.5 Description of the RNS multiplier and adder graphs . . . 41

(12)

LIST OF FIGURES LIST OF FIGURES

5.6 Test setup for RNS adders and multipliers . . . 41

5.7 Total power dissipation for all RNS adders using uniformly distributed input data as described in section 5.1.1 on page 36. 42 5.8 The best RNS adder for each modulo compared with RNS adders for modulo 2n_{. Power dissipation was calculated using} uniformly distributed input data as described in section 5.1.1 on page 36. . . 43

5.9 RNS adders type 0 and 1 . . . 45

5.12 RNS adders type 6 . . . 48

5.13 All RNS multipliers . . . 49

5.14 RNS multipliers type 0 and 1 . . . 50

5.17 RNS multipliers type 7 and TCS multiplier . . . 53

5.18 Combinations of RNS multipliers with a maximum of 3,5,7,9 and 11 RNS multipliers in the moduli-set compared with TCS multiplier . . . 55

5.19 Combinations of RNS adders with a maximum of 3,5,7,9 and 11 RNS adders in the moduli-set compared with RNS adder for modulo 2n (which is almost identical to a TCS adder) . . 56

5.20 64-tap FIR filter with varying input bit width for RNS and TCS. Uniform data as described in section 5.1.1 on page 36. The red line represents the power reduction. . . 57

5.21 16-tap FIR filter with varying input word length for RNS and TCS. Sawtooth data with ramp coefficients as described in section 5.1.2 on page 37. The red line represents the power reduction. . . 58

5.22 20-bit FIR filter with varying number of taps for RNS and TCS. Uniformly distributed data and coefficients are used as described in section 5.1.1 on page 36. The red line represents the power reduction. . . 59

5.23 20-bit FIR filter with varying number of taps for RNS and TCS. Realistic data with constant FIR coefficients are used as described in section 5.1.3 on page 37. The red line represents the power reduction. . . 60

(13)

List of Tables

2.1 Example of signed and unsigned representations using the

moduli-set {m1, m2} = {2, 3}, M = 6. . . 6

3.1 Definition of different adder types . . . 12

3.2 Definition of different RNS multiplier types . . . 13

3.3 Definition of different RNS forward conversion types . . . 14

3.4 Definition of different RNS reverse conversion types . . . 14

4.1 Periodicity of some residues . . . 29

5.1 Sign switching rate of input data . . . 40

5.2 Theoretical toggle rate at output of a 20-bit input multiplica-tion. The optimum moduli-sets as presented in table 5.5 on page 55 is used. . . 40

5.3 The best adder type for chosen modulo with respect to power, refer to table 3.1 on page 12 for details about the adder types. 44 5.4 The best multiplier type for chosen modulo with respect to power, refer to table 3.2 on page 13 for details about the multiplier types. . . 54

5.5 Some of the optimum moduli-sets and their resulting number of bits. For the complete list refer to Appendix B. . . 55

5.6 Results for an FIR filter folded 22 times with 20-bit input and 22 taps . . . 60

5.7 Synthesis results for 4-tap FIR filter with 20 or 30 input bit-width. The synthesis maximum frequency goal was set to 1.5 GHz. . . 61

B.1 Resulting moduli-sets . . . 70

C.1 Results for RNS adders . . . 73

(14)

Nomenclature

ASIC Application-specific integrated circuit CRT Chinese remainder theorem

DSP Digital signal processing FIR Finite impulse response LUT Look-up table

RNS Residue number system

ROM Read only memory

RTL Register-transfer level

TCS Two’s complement number system

VHDL Very high speed integrated circuit hardware description language VLSI Very large scale integration

(15)

Chapter 1

Introduction

Power dissipation has become one of the major limiting factors in the de-sign of digital ASICs. Low power dissipation will increase the mobility of the ASIC by reducing the system cost, size and weight. DSP blocks are a major source of power dissipation in modern ASICs. The residue number system (RNS) has, for a long time, been proposed as an alternative to the regular two’s complement number system (TCS) in DSP applications to reduce the power dissipation. Some research have shown that implementing FIR filters in residue number system (RNS) instead of two’s complement number sys-tem (TCS) can give a reduction in power dissipation. FIR filters are among the less complex DSP blocks. A general sketch of how RNS computations can be performed is shown in figure 1.1 on the next page. The earliest usage of the residue number system can be found in The Mathematical Classic of Sun Tzu by the Chinese mathematician Sun Tzu who lived in the 3rd century AD. A famous riddle from his book [1] is quoted below.

Now there are an unknown number of things. If we count by threes, there is a remainder of 2. If we count by fives there is a remainder 3. If we count by sevens, there is a remainder 2. Find the number of things.

(16)

1.1. PROBLEM STATEMENT CHAPTER 1. INTRODUCTION Forward conversion Reverse conversion Modulo m1 Modulo m2 Modulo mn Operands Results Modulo channels

Figure 1.1: The basic principle of RNS

1.1 Problem statement

The problem to be investigated in this thesis is to compare RNS with TCS. This will be done by implementing FIR filters in RNS and TCS and compare these two implementations. The requirements of the implementation in RNS is to minimize the power while still being able to run the circuit at 500 MHz and not getting a massive increase in area. Both the RNS and TCS implementation shall be able to receive and process one sample per clock cycle. Another design goal is to be able to process an input of around 20 bits. An important idea in the thesis is that RNS in the future could be implemented in large parts of the ASIC and therefore the forward and reverse conversion will not contribute as much as the computational operation to power dissipation and area, therefore will the implementation and results focus on an implementation without the conversion. The ASIC will be intended for and implemented in 32 nm technology. The aim of the thesis can be summarized with answering these four questions:

• Is RNS better than TCS with respect to power, area and timing? • How can RNS be implemented and what different design choices can

be made?

• What further extensions of RNS exists that can further improve its properties?

(17)

1.2. METHODOLOGY CHAPTER 1. INTRODUCTION

1.2 Methodology

The thesis work has been performed at Ericsson in Kista, Stockholm. The work has been executed in the following way:

1. Literature study

2. Implementation of adders and multipliers in RNS 3. Comparison of individual RNS adders and multipliers

4. Study of what RNS adders and multipliers to use and what combina-tions of them that will result in the lowest power dissipation

5. Implementation of RNS and TCS FIR filter 6. Comparison of RNS and TCS FIR filters

7. Implementation of forward and reverse conversion

8. Comparison of different forward and reverse conversion techniques 9. Analysis of the different results

1.3 Prior work

The arithmetic of a residue number system and its application to digital signal processing and computer technology has earlier been described in [2], [3] and [4]. The use of RNS for reduction of power in FIR filters has earlier been discussed in for example [5], [6] and [7] with good results. Two promising results can be seen in figure 5 from [7] and figure 6 from [5] where the power is significantly lower with RNS compared to TCS.

In figure 5 from [7] we can see an RNS FIR filter with forward and reverse conversion with 16-bit coefficients and a 32-bit dynamic range compared with a TCS FIR filter designed with the same restrictions.

In figure 6 from [5] the dynamic and static power dissipation of an RNS FIR filter compared with a TCS FIR filter. Both the RNS and TCS have a 10-bit input and coefficients and a dynamic range of 20 bits. Note that neither in figure 5 from [7] nor in figure 6 from [5] the authors take account of the increasing bit width in the accumulator due to the number of taps.

1.4 Outline

A brief introduction to the thesis is given in chapter 1. In chapter 2 the basic mathematical principles of RNS are presented. From these basic mathemat-ical principles a set of different implementations of RNS is presented, and a subset of these are the proposed design presented in chapter 3. The de-tailed implementation is presented in chapter 4 and the simulation results

(18)

1.5. LIMITATIONS CHAPTER 1. INTRODUCTION

of the implementation are presented in chapter 5. The results are discussed in chapter 6. From the results and the discussion some conclusions can be made, which are presented in chapter 7 together with suggestions of future work areas in the subject.

1.5 Limitations

The aim with the thesis is to investigate RNS, hence individual TCS adders and multipliers will not be implemented (here the synthesis tool will decide which adders and multipliers to use). The focus of the thesis has been on RNS specific algorithms and not low power algorithms that are suitable for TCS or FIR filters in general. When implementing the individual RNS adders and multipliers the focus has been on the structure and not the exact implementation of the ordinary binary adders and binary multipliers used in the implementation, again this has been left to the synthesis tool in most cases. The major limitation of the thesis work is that it has a time budget of 20 weeks.

(19)

Chapter 2

Background

The basic concept of a residue number system (RNS) is to represent a large number with a set of smaller integers. In RNS some computations can be performed more efficiently. RNS originates from the Chinese remainder theorem (CRT) of modular arithmetic, which was first described by the Chinese third-century mathematician Sun Tzu [4]. The CRT can be used to solve his famous riddle on page 1.

2.1 RNS arithmetic

RNS arithmetic is based on the mathematical congruence relation. Let a and b be integers. These integers are said to be congruent modulo m if a − b is exactly divisible by m. This is often in mathematical contexts written as a ≡ b (mod m). The number m is called a modulus or base.

Now let q be the quotient and r be the remainder from the division of the integer a by the modulus m, a = q · m + r. From the congruence definition above we then have a ≡ r (mod m). The integer r is the residue of a with respect to m, which will be denoted as r = |a|m. We shall assume

that r ∈ {0, 1, 2, ..., m − 1}, that is r lies in the set of least positive residues modulo m.

Now define a moduli-set as {m1, m2, ..., mN} that contains N positive

and pairwise relatively prime moduli. That is for every i and j where i 6= j, the moduli miand mjin the moduli-set have no common divisor larger than

unity. Now M can be defined as the dynamic range of the RNS moduli-set. M can be computed as the product of the moduli-set according to equation (2.1). M = N Y n=1 mn. (2.1)

For every moduli-set a number X < M has a unique representation consist-ing of the N residues. This representation can be calculated as {xi= |X|mi:

(20)

2.1. RNS ARITHMETIC CHAPTER 2. BACKGROUND

1 ≤ i ≤ N }. We shall represent such a representation as hx1, x2, ..., xNi.

Example 1 Take the moduli-set {3, 5, 7}, then m1= 3, m2= 5 and m3=

7. The dynamic range of the moduli-set will be

M =

3

Y

n=1

mn = m1· m2· m3= 3 · 5 · 7 = 105

Now let X = 10. Then hx1, x2, x3i can be calculated as follows

x1= |X|m1= |10|3= |3 · 3 + 1|3= |3 · 3|3+ |1|3= 1

x2= |X|m2= |10|5= |2 · 5|5= 0

x3= |X|m3= |10|7= |1 · 7 + 3|7= |1 · 7|7+ |3|7= 3.

So X = 10 can be represented as h1, 0, 3i in the RNS moduli-set {3, 5, 7}. A residue number system can be used to represent both signed and un-signed numbers. For unun-signed numbers, RNS can represent numbers in the range of 0 ≤ X ≤ M − 1. For signed numbers RNS can represent numbers that satisfies one of the following relations:

−M − 1 2 ≤ X ≤ M − 1 2 if M is odd −M 2 ≤ X ≤ M 2 − 1 if M is even.

See table 2.1 for an example of RNS representation for signed and unsigned numbers. hx1, x2i Unsigned Signed h0, 0i 0 0 h1, 1i 1 1 h0, 2i 2 2 h1, 0i 3 −3 h0, 1i 4 −2 h1, 2i 5 −1

Table 2.1: Example of signed and unsigned representations using the moduli-set {m1, m2} = {2, 3}, M = 6.

2.1.1 Basic arithmetic operations

Addition, subtraction and multiplication are quite straightforward calcu-lated in RNS. Division, sign-determination, overflow-detection and magnitude-comparison are significantly harder to implement. As for addition, subtrac-tion and multiplicasubtrac-tion the only difference with ordinary TCS operasubtrac-tions is

(21)

2.1. RNS ARITHMETIC CHAPTER 2. BACKGROUND

that the result has to be in the range of [0 : m − 1]. Addition X + Y = Z can be calculated as

X + Y = hx1, x2, ..., xni + hy1, y2, ..., yni = hz1, z2, ..., zni = Z

where zi= |xi+ yi|mi.

Multiplication X · Y = Z can be calculated in a similar fashion X · Y = hx1, x2, ..., xni · hy1, y2, ..., yni = hz1, z2, ..., zni = Z

where zi= |xi· yi|mi.

Note that the difference between addition and multiplication is for addition xi+ yi ≤ 2(mi− 1) and for multiplication xi· yi ≤ (mi− 1)2 which leads

to that the reduction required to get a result in the range of [0 : m − 1] can be much greater for multiplication. This fact will cause a more complex implementation of RNS multipliers compared to RNS adders.

2.1.2 Conversion

The goal with the forward and reverse conversion is to convert a number represented in TCS into RNS, and RNS into TCS.

Forward conversion

Conversion from TCS to RNS can in a straightforward way be computed using division, where the remainder of the division will be the residue. Reverse conversion

Reverse conversion is described from an implementation perspective in chap-ter 4 on page 16.

2.1.3 Choosing a moduli-set

There exist in general two types of modulus, arbitrary and special. The special modulus are usually referred to as the ones that is used in a special moduli-set, {2n− 1, 2n_{, 2}n_{+ 1}, or extensions of this. The arbitrary modulus}

are the remaining integers, including the primes.

In this thesis it will be assumed that the arbitrary sets consists only of primes due to the fact that completely arbitrary modulus are not guaranteed to be relative primes. The special sets are designed to be more hardware efficient and are only guaranteed to be relative primes. Using only prime modulus is probably the best moduli-set from a purely mathematical view [8]. But the special sets might have other advantages. This gives us that the desired modulo for comparison would be the primes and those fulfilling the requirements of a special set.

(22)

2.2. FIR FILTERS CHAPTER 2. BACKGROUND

Special moduli-sets

The most common special moduli-set is {2n− 1, 2n_{, 2}n_{+ 1} and extensions}

of this [4]. The use of this moduli-set is often motivated by less complicated implementation of RNS to TCS converters and the fact that dedicated hard-ware multipliers can be used on FPGA platforms [9]. A common extension is to add 2n±q± 1, where q ≥ 1 to the moduli-set.

2.2 FIR filters

Finite-duration Impulse Response, FIR, filters is probably the most com-monly used digital filter. An FIR filter is based on the mathematical concept of discrete convolution where the filtered output of a signal can be calculated using equation (2.2) [10]. y[n] = N X i=0 h[i]x[n − i]. (2.2)

In equation (2.2) y[i] is the output, x[i] is the input and h[i] are the coeffi-cients. N is defined as the order of the filter and the filter will have N + 1 taps.

2.3 Design flow

Each implementation will be performed in the way presented below. If an error would occur at any step the process was restarted from 1.

1. Implement

2. Simulate and verify with TCS result 3. Synthesize

4. Do analysis of synthesis and develop a profile in terms of area, power and delay

(23)

2.3. DESIGN FLOW CHAPTER 2. BACKGROUND RTL Design Synthesis Netlist Verilog Simulation Switching Activity File Cell Library Power Calculation Power Reports Synthesis Reports Input Data

Figure 2.1: Flowchart for profile development

2.3.1 Synthesis

The RTL code has been synthesized using Synopsys Design Compilerr. During synthesis (for all designs, both RNS and TCS) some optimizations will be done by the synthesis tool. The synthesis tool will try to minimize the power while still fulfilling the required critical path. [11]

2.3.2 Profile developing

The design flow for developing a profile in terms of area, delay and power is shown in figure 2.1. The source of the area, delay, power and other interesting parameters are presented below:

Synthesis Reports Area (cell library specific), gate count, UVT cell ratio (see section 2.3.3 on the following page), etc.

Power Reports Power dissipation (leakage power, switching power and internal power), delay, critical path, etc.

(24)

2.3. DESIGN FLOW CHAPTER 2. BACKGROUND

2.3.3 Power

The power calculations are made in the Power Calculation block in figure 2.1 on the previous page. The power dissipation can be divided into dynamic and static power dissipation. Dynamic power dissipation consists of switch-ing and internal power dissipation. The power dissipation reports that are generated from PrimeTime1are described in [13]. Note that both static and dynamic power in the equations below scale with the size of the design as well.

• Static power

– Leakage power Pl= V · Ileak

• Dynamic power

– Switching power Ps=1₂· Cload· V2· f

– Internal power Pint= (1₂· Cint· V2· f ) + (V · Ishortcut)

Different standard cells

Depending on what standard cell the synthesis tool chooses the leakage power consumption will be different. A bigger VT will result in smaller

leakage. The synthesis tool can choose between the following standard cell types (sorted in decreasing VT):

UVT Ultra-high VT

SVT Super-high VT

MVT Mezzanine VT

HVT High VT

1_{The Synopsys PrimeTime suite provides a single, golden, trusted signoff solution for}

(25)

Chapter 3

Proposed design

3.1 Arithmetic functions

The basic arithmetic functions of an FIR filter is addition and multiplication. These operations can be implemented in many different ways in the residue number system. The basic complication with RNS is to deal with modulo overflow that occurs when the result is bigger than the modulo. For a modulo, mi the result of the operations always has to be within the range

{0, ..., mi−1}. For addition the result will be in the range of {0, ..., 2(mi−1)}

and therefore at most one subtraction with mihas to be performed to be in

the correct range. For multiplication on the other hand the product will be in the range of {0, ..., (mi− 1)2} which complicates the reduction.

To find out which algorithms for addition and multiplication that are the best in terms of power dissipation, simulations will be made on individual adders and multipliers for all chosen modulus.

3.1.1 Addition

Three basic approaches for designing addition of arbitrary modulo is pre-sented in [14]. These three are: using LUT, using two ordinary binary adders and a hybrid between these two. Each one of these three will be optimal in terms of area and timing for certain modulus [14].

An interesting approach of implementing addition in the special modulo set {2n− 1, 2n_{, 2}n_{+ 1} by using a parallel-prefix adder is presented in [4].}

A more in detail description is available in [15]. Due to the low level of this approach [16] can be used as an initial implementation idea.

The Verilog language and the synthesis tool has support for the built-in Verilog operators “+”, addition, and “%”, modulus. An implementation with only these operators will be a good naive reference when comparing with the other implementations. Also for modulo 2n _{the trivial}

(26)

3.2. CONVERSION CHAPTER 3. PROPOSED DESIGN

Type Description

0 Look-up table (LUT) based RNS adder 1 Two binary adders

2 A hybrid between 0 and 1

3 Modulo 2n_{− 1 using modified parallel-prefix adder}

4 Modulo 2n _{+ 1} _using _{diminished-one} _number

representation

5 Using Verilog’s built in operators “+”and “%” 6 Ordinary adder for modulo 2n

Table 3.1: Definition of different adder types

summarized in table 3.1.

3.1.2 Multiplication

RNS multiplications can be implemented in a huge variety of ways. A promising implementation is presented in [17] which is a modulo-m product-partitioning multiplier with ROM. This implementation seems more promis-ing than multiplication by reciprocal of modulus as described in [18] since this implementation uses three instead of two multipliers.

For the special set {2n_{− 1, 2}n_{, 2}n_{+ 1} some improvements in terms of}

area, power and delay can be made. A parallel modulo-m multiplier for 2n_{± 1 is presented in [4] without any special speed-up techniques. This}

implementation might be interesting especially for relatively small n. A implementation for 2n± 1 is presented in [19] using Booth-8 encoding. This approach is compared with other implementations with a good result for n ≥ 32 though this can be extrapolated to give a good result at lower n as well. If this is not the case a Booth-4 encoding could be used. The Booth encoding technique is well known in other contexts than RNS and will therefore not be investigated further in this thesis.

Another interesting approach is [20]. In [5] an isomorphic technique is used to replace multiplication with addition and look-up table. This implementation would be very interesting.

The different RNS multipliers that have been selected for implementation are presented in table 3.2 on the next page.

3.2 Conversion

As with the RNS adders and multipliers several forward and reverse conver-sion algorithms should be investigated.

(27)

3.2. CONVERSION CHAPTER 3. PROPOSED DESIGN

Type Description

0 Look-up table (LUT) based RNS multiplier

1 modulo-m product-partitioning multiplier with ROM

2 Parallel modulo-m multiplier for 2n_{− 1}

3 Parallel modulo-m multiplier for 2n_{+ 1}

4 Isomorphism technique as described in [21] 5 High radix multiplier for modulo 2n_{− 1 [20]}

6 Using Verilog’s built in operators “+”and “%” 7 Ordinary multiplication for modulo 2n

Table 3.2: Definition of different RNS multiplier types

3.2.1 Forward conversion

Forward conversion is generally far less complicated to implement than re-verse conversion. Even though residue number systems needs too be able to represent a certain bit width, the input is mostly represented with a much smaller bit width. This reduces of course the complexity. The gen-eral way of solving the forward conversion problem involves the fact that a TCS number can be calculated in the following well known manner:

−an−12n−1+Pi=n−2_i=0 ai2i. The most straightforward solution is to

cal-culate the sum of the ai2i’s using RNS adders instead of TCS adders. By

slightly modifying the solution on page 64 in [4] it can support negative numbers as well.

A modification of this algorithm is to use the periodic properties of mod-ulus. The periodic properties can be derived by calculating the residue of each 2i mod m.

A look-up table based solution is also possible though since it would have to consist of all possible input combinations at ninput bits corresponding to

RNS values of nrns ≥ ninput bits. Due to this fact, this solution can be

excluded from further investigation.

In [22] a modular exponentiation algorithm is proposed that seems promis-ing. Unfortunately it is very complex and therefore very difficult to imple-ment in a parametrized way for arbitrary modulo and input bit width.

Several other sequential algorithms have been proposed in [4] but these will not produce one result per clock-cycle and are therefore not investigated further.

(28)

3.3. CHOOSING A MODULI-SET CHAPTER 3. PROPOSED DESIGN

Type Description

0 RNS adder tree

1 RNS adder tree with periodicity

2 Forward conversion for the special moduli-set 3 Using SystemVerilog’s built-in operators

Table 3.3: Definition of different RNS forward conversion types

3.2.2 Reverse conversion

Reverse conversion is the conversion process from RNS to TCS. The main methods for implementing the reverse conversion is by using either the Chi-nese Remainder Theorem (CRT) or the Mixed-Radix Conversion (MRC) technique. All other techniques are variants of these two [4]. Among these CRT is the most straightforward solution. MRC utilize ”mixed-radix” tech-niques and this would require far more investigating. Other implementa-tions involve using pseudo-SRT division (simply modification of a division algorithm so that it only produce the remainder) or the core function (as described in [4]). An other interesting implementation would be using a LUT. Unfortunately the resulting LUT would be larger than what a synthe-sis tool would support. The resulting reverse converter to be implemented is presented in table 3.4.

Type Description

1 Using CRT

Table 3.4: Definition of different RNS reverse conversion types

3.3 Choosing a moduli-set

Previous research [8], [5], [7] have shown that a significant amount of the power dissipation still will take place during the regular computations and not in the forward or the reverse conversion when the number of taps in an FIR filter is big. Therefore the initial guess of which moduli-sets to choose was done by comparing the power dissipation of a simple one tap FIR filter element without conversion. These simple components where designed in various ways and then an optimal (or near optimal) combination was calculated. There are basically two groups of moduli-sets: arbitrary and special sets as described in section 2.1.3 on page 7.

3.3.1 Modulus for comparison

Since the basic idea with RNS is to choose several small numbers to represent a big number it will be advantageous too choose these numbers quite small

(29)

3.3. CHOOSING A MODULI-SET CHAPTER 3. PROPOSED DESIGN

(but not necessarily as small as possible). A requirement is that the RNS FIR filter will be able to compute inputs that are 20 bits wide, and due to the multiplication the incoming word-length has to be extended to 40 bits. Therefore the modulus for the comparison as described above will be chosen as follows.

• All primes between 2 and 251

• All numbers fulfilling 2n _{or 2}n_{± 1 where n ≤ 14 (to get a dynamic}

range of 240with the moduli-set {2n− 1, 2n_{, 2}n_{+ 1})}

• Each closest prime that is smaller than 2n_{where n ≤ 14. If these turn}

out to be optimal, possibly more similar primes will be added. Note that these sets intersects and no modulo shall be tested twice. These rules will result in the set of integers presented in Appendix A.

(30)

Chapter 4

Implementation

The main implementation philosophy has been to use parametrized mod-ules and functions. The implementation has been done on RTL level in SystemVerilog [23], therefore it can not be guaranteed (and most unlikely) that the synthesis tool maps (as described in section 2.3.1 on page 9) the RTL code directyle to the hardware structure described by the RTL code. Though it has of course been verified that the functionality is consistent.

The parametrization of the RTL code will make the implementation of RNS easily adaptable for new DSP algorithms, scalable in terms of number of taps and bit-widths and easily modifiable for new algorithms of for example adders and multipliers.

4.1 RNS addition

The main issue with RNS addition is that the sum has to be within the range of [0, mi − 1]. The corresponding binary adder would result in a

sum of [0, 2(mi− 1)] and therefore at most a modulo reduction with mi is

required.

4.1.1 LUT and binary adders

The most direct approach to implement RNS addition is to use a look-up table (LUT), two binary adders, or a combination of these.

LUT

The LUT RNS adder implementation is a straightforward ROM storing each sum of the two inputs.

(31)

4.1. RNS ADDITION CHAPTER 4. IMPLEMENTATION

Two binary adders

By using one binary adder for addition and the other adder for subtraction in the modulo reduction and modulo overflow detection a quite neat RNS adder as shown in figure 4.1 was implemented.

+ a b 1 0 sum + −m

Figure 4.1: RNS addition using two binary adders

Hybrid

The hybrid RNS adder consists of one adder connected to a LUT. The LUT stores the resulting residue for each sum of the adder.

+

LUT

a b

sum

Figure 4.2: Hybrid version of RNS addition

4.1.3 End-around carry parallel-prefix adder

The end-around carry parallel-prefix adder is designed to only work for modulo 2n−1, where the advantage is that by using the end-around carry, it use approximately the same hardware as an ordinary parallel-prefix adder.

(32)

4.1. RNS ADDITION CHAPTER 4. IMPLEMENTATION

The parallel-prefix adder was implemented by translating the RNS adder in [16] from VHDL to SystemVerilog. It uses a Sklansky parallel-prefix structure with an end-around carry. The adder uses different logic operators as shown in figure 4.3. The exact behavior of the logic operators is described in equation (4.1). (Gl−1i:k, P l−1 i:k ) (Gl i:k, P l i:k) (G l i:k, P l i:k) (Gl−1 i:j+1, P l−1 i:j+1)(G l−1 j:k, P l−1 j:k) (Gl i:k, P l i:k) (G l i:k, P l i:k) ai bi (gi, pi) si pi ci

Figure 4.3: Logic operators for the parallel-prefix adder

: Gl_i:k= Gl−1_i:k P_i:kl = P_i:kl−1 : Gl_i:k= Gl−1_i:j+1∨ (Gl−1 i:k ∧ P l−1 i:j+1) P l i:k = P l−1 i:k ∧ P l−1 i:j+1 : gi = a0∧ b0∨ a0∧ c0∨ b0∧ c0 if i = 0 ai∧ bi otherwise pi= ai⊕ bi : ci+1= Gmi:0 si= pi⊕ ci (4.1)

In equation (4.1) i is the bit position and i = 0, ..., nbits− 1, l is the level in

the prefix structure and l = 1, ..., m where m is the total required depth of the prefix structure (which can be calculated by dlog₂(nbits)e). And

0 ≤ k ≤ j ≤ i (for more details see [15]). An 8-bit example of the parallel-prefix adder can be seen in figure 4.4 on the next page.

(33)

4.1. RNS ADDITION CHAPTER 4. IMPLEMENTATION 0 1 2 3 a0b0 a1b1 a2b2 a3b3 a4b4 a5b5 a6b6 a7b7 s0 s1 s2 s3 s4 s5 s6 s7

Sklansky prefix structure

Figure 4.4: End-around carry prefix adder with Sklansky parallel-prefix structure

4.1.4 Parallel-prefix adder using the diminished-one

num-ber representation for modulo 2

n

_{− 1}

By using the fact that modulo 2n− 1 almost can be represented with n bits a diminished-one number representation can be implemented. In a diminished-one representation n bits represent the number and the n + 1 bit is used to identify a zero. Hence an ordinary number X can be represented as ˆX in the diminished-one representation, as presented in equation (4.2).

X = 0 : ˆX[n] = 1

X 6= 0 : ˆX[n] = 0, ˆX[n − 1 : 0] = X − 1. (4.2) The advantage with this adder is that the parallel-prefix structure used in the modulo 2n_{− 1 adder in section 4.1.3 can be used except for the small}

change that the end-around carry is inverted. Some forward and reverse conversion is also needed which is described in figure 4.5 on the next page. The blocks used in figure 4.5 on the following page are the same as used in the adder for modulo 2n_{− 1 and are described in equation (4.1) on the}

(34)

4.1. RNS ADDITION CHAPTER 4. IMPLEMENTATION 0 1 2 3 a0 b0 a1 b1 a2 b2 a3 b3 a4 b4 a5 b5 a6 b6 a7 b7 s0 s1 s2 s3 s4 s5 s6 s7

Sklansky prefix structure + a −1 + b −1 MSB + 1 1 0 sum 1 0 0 Forward conversion Reverse conversion

Figure 4.5: Adder based on the diminished-one number representation

4.1.5 Addition using Verilog’s built-in modulo operator

Addition using Verilog’s built-in modulo operator can be performed by using the %-sign and then letting the synthesis tool decide what to do with it. The

(35)

4.2. MULTIPLICATION CHAPTER 4. IMPLEMENTATION

implementation will look as figure 4.6 and can be expressed as

assign output sum = ( i n p u t a + i n p u t b ) % m o d u l o p a r a m e t e r ;

a

sum % b

Figure 4.6: Adder using Verilog’s built-in operator

4.1.6 Ordinary addition for modulo 2

n

The easiest and most efficient implementation of an RNS addition will be that one for modulo 2n _{as it will only require an ordinary binary adder}

where the resulting carry-out is neglected. The implementation will look as figure 4.7 or:

assign output sum = i n p u t a + i n p u t b ;

a

sum b

Figure 4.7: Modulo 2n addition using a binary adder

4.2 Multiplication

RNS multiplication has the same requirement as the RNS adders in that the product has to be within the range [0, mi−1], but unfortunately the product

of an ordinary binary multiplier will be within the range [0, (mi− 1)2] so

the number of modulo reductions with mi would instead be mi− 2 (instead

of one in the RNS adder) which would increase the complexity dramatically with increasing modulo.

4.2.0 LUT based multiplication

The look-up table based RNS multiplication is using the two operands as addresses to a two dimensional look-up table where the product is stored.

(36)

4.2. MULTIPLICATION CHAPTER 4. IMPLEMENTATION

4.2.1 Modulo-m product-partitioning multiplier with ROM

A modulo-m product-partitioning multiplier with ROM is presented in [17] and [4] for arbitrary modulus. This multiplier is based on the fact that the product P = AB can be expressed as in equation (4.3). AB is partitioned∆ into four parts: W , k + 1 bits; Z, n − (k + 1) bits; Y , 1 bit and X, n − 1 bits. P = AB = 22n−(k+1)W + 2nZ + 2n−1Y + X. (4.3) |AB|m= 2 2n−(k+1)_{W + 2}n_{Z + 2}n−1_{Y + X} _m = 2 2n−(k+1)_{W + 2}n−1_Y _m+ |2 n_Z| m+ X m =.2n= m + c ⇒ |2n|_m= c. = 2 2n−(k+1)_{W + 2}n−1_Y _m+ cZ + X m . (4.4)

Here n is the number of bits, c = 2n_{− m and k = 1 + blog}

2cc and m is the

modulo. By ensuring that the product is within the range of the moduli, equation (4.3) can be rewritten as equation (4.4).

Since k will be relatively small and Y only consist of one bit, e =∆

22n−(k+1)W + 2n−1Y_m can be pre-calculated for each value of W and Y

and stored in a ROM. Due to the number of bits used to store e, cZ and X the result will be in the range 0 ≤ e + cZ + X < 2m. It is slightly better to instead store e − m in the ROM [4] and detect whether the result got negative or not and in that case add m. The resulting RNS multiplier can be seen in figure 4.8 on the next page.

4.2.2 Parallel-prefix multiplier for modulo 2

n

_{− 1}

By reusing the parallel-prefix adder from section 4.1.3 and connecting partial products to it an implementation of a parallel-prefix multiplier for modulo 2n− 1 can be achieved. The entire multiplication can be rewritten as equa-tion (4.5). Note that due to the properties of RNS, P Pi will always be n

bits wide. |X · Y |2n₋₁=

n−1

X

i=0

P Pi where P Pi= xi∧ yn−i−1...y0yn−1...yn−i. (4.5)

In figure 4.9 on page 24 the schematic of the parallel-prefix multiplier for modulo 2n _{− 1 is shown. Note that more optimal adder tree structures}

probably can be used.

4.2.3 Parallel-prefix multiplier for modulo 2

n

_{+ 1}

The parallel-prefix multiplier for modulo 2n+ 1 may be implemented using diminished-one representation to remove the extra bit required compared to

(37)

4.2. MULTIPLICATION CHAPTER 4. IMPLEMENTATION ROM A B Multiplier n n k + 1 1 n − 1 − k Multiplier Adder n − 1 c k W Z Y X cZ e 0 1 Adder −m n + 1 n + 1 M SB |AB|m n

(38)

4.2. MULTIPLICATION CHAPTER 4. IMPLEMENTATION RNS Adder P P0 P P1 RNS Adder P P2 RNS Adder P Pn−1 product

Figure 4.9: Multiplier based on parallel-prefix RNS adders

modulo 2n_{. This implementation would require a diminished-one adder but}

due to the poor results of this adder (as can be seen in figure 5.7 on page 42) this multiplier has not been implemented.

4.2.4 Modular multiplication using the isomorphic

tech-nique

This technique has earlier been used by [5] and [8]. The basic principle of the isomorphic technique is described in [21] and can be summarized as in equation (4.6). When m is a prime there exists a q that will fulfill the equation. This means that a multiplicand ni can instead be represented by

wi.

ni= |qwi|m with ni∈ [1, m − 1], wi∈ [0, m − 2] (4.6)

For the specific case of a two input modular multiplier, i ∈ [1, 2] we get equation (4.7).

|a1·a2|m= |qw|m where w = |w1+w2|m−1 and a1= |qw1|m, a2= |qw2|m.

(4.7) A direct implementation of equation (4.6) and equation (4.7) can be imple-mented using two different look-up tables, each storing m − 1 entries, and an RNS modulo-m adder. A sketch of this implementation can be seen in fig-ure 4.10 on the next page. Due to the fact that zero can not be represented by ni = |qwi|m, this has to be taken care of. This is done by a simple zero

detecting logic. A schematic of the multiplier can be found in figure 4.10 on the facing page.

(39)

4.2. MULTIPLICATION CHAPTER 4. IMPLEMENTATION + LUT LUT n1 n2 LUT 1 0 ’0

Figure 4.10: Multiplier based on the isomorphic technique

4.2.5 High radix modulo 2

n

_{− 1 multiplier}

The high radix modular RNS multiplier for modulo 2n_{− 1 is based on a}

suggested multiplier in [4]. In [20] another multiplier is suggested that will only work for modulus where n−1_k = 4 where k is an integer and n − 1 is the number of bits required to represent a number in modulo 2n− 1.

This multiplier is based on the fact that a multiplication A · B can be rewritten as a sum partial products. First divide A and B into two k-bit numbers where k = bdlog2(2n−1)e+1

2 c so that A = A12

k _{+ A}

0 and B =

B12k+ B0. Now the product A · B can now be rewritten as equation (4.8)

by using cyclic convolution.

A · B = (A12k+ A0) · (B12k+ B0) = 2k(A1B0+ A0B1) + (A1B1+ A0B0)

= 2kP1+ P0. (4.8)

This can be extended to |A · B|2n₋₁ = |2kP₁+ P₀|₂n₋₁ for modulo 2n− 1.

P0and P1 can also be expressed as

P0= a2_{− b}2_{+ c}2_{− d}2 8 P1= a2_{− b}2_{− c}2_{+ d}2 8 .

(40)

4.2. MULTIPLICATION CHAPTER 4. IMPLEMENTATION where a = A0+ A1+ B0+ B1 c = A0+ A1− B0− B1 d = A0− A1+ B0− B1 b = A0− A1− B0+ B1.

By combining these equations a schematic can be derived as shown in fig-ure 4.11. A0 A1 B0 B1 + + + + −1 LUT LUT Squaring LUT + −1 + A0 A1 B0 B1 + + + + −1 LUT LUT + −1 + −1 −1 −1 a b c d >> 3 >> 3 Mod 2n_{− 1 adder} product P0 P1

(41)

4.3. FORWARD CONVERSION CHAPTER 4. IMPLEMENTATION

4.2.6 Using Verilog’s built-in operators

Multiplication using Verilog’s built-in modulo operator can be performed by using the %-sign and then letting the synthesis tool decide what to do with it. The implementation will look as figure 4.12 and can be expressed as

assign o u t p u t p r o d u c t = ( i n p u t a ∗ i n p u t b ) % m o d u l o p a r a m e t e r ;

a

product % b

Figure 4.12: Multiplier based on the isomorphic technique

4.2.7 Ordinary multiplication for modulo 2

n

The easiest and most efficient implementation of an RNS multiplication will be the one for modulo 2nas it will only require an ordinary binary multiplier where the resulting most-significant half of the product is neglected. The implementation will look as figure 4.13 and can be expressed as

assign o u t p u t p r o d u c t = i n p u t a ∗ i n p u t b ;

a

product b

Figure 4.13: Multiplier based on binary multiplier for modulo 2n

4.3 Forward conversion

Forward conversion is the translation process from TCS to RNS. Since the TCS bit-width most often is smallest at the input, the complexity of the forward conversion will be less than the complexity of the reverse conversion. Due to this smaller bit-width no pipelining1 _{is required to fulfill the timing} 1_{Pipelining is a process where registers are inserted in the critical path to increase the}

(42)

4.3. FORWARD CONVERSION CHAPTER 4. IMPLEMENTATION RNS + 0 RN S(20₎ 20 0 RN S(21₎ 21 0 RN S(2n−2₎ 2n−2 0 RN S(2n−1₎ 2n−1 RNS + RNS + l = 1 l = 2 l = nlevels

Figure 4.15: Forward conversion using an RNS adder tree

goal of 500 MHz and therefore only registers at the input and output of the forward conversion are considered in the design, which can be seen in figure 4.14.

Forward conversion

TCS RNS

Figure 4.14: Forward conversion with registers at input and output

4.3.1 RNS adder tree

The most straightforward solution for the forward conversion is to use the fact that a TCS number can be represented as −an−12n−1+P

i=n−2 i=0 ai2i.

The RNS representation of a number in TCS can be derived by first con-verting each operand in the summation to RNS (using a LUT) and then calculating each individual addition with RNS adders. This will result in an RNS adder tree.

The RNS adder tree has parametrized input bit-width and modulo. The entire tree will scale with this parameter as seen in figure 4.15. The num-ber of levels, nlevels, in the RNS adder tree can be calculated as nlevels =

dlog₂(n)e where n is the number of TCS input bits. At each level there will be win

l = d

n

2le input wires, where l is the level. There will also be

wout l = l_win l 2 m

which will result in nadders=

j_win l 2

k

(43)

4.3. FORWARD CONVERSION CHAPTER 4. IMPLEMENTATION

4.3.2 Periodicity

The periodicity of a modulo can be derived from the fact that result from 2i mod m will repeat itself for all modulo when i increases (note that this repetition not necessarily is valid for residues where i < dlog2(m)e). The

periodicity of the modulus can be solved by a brute-force search and storing the periodicity of the relevant modulus in a ROM. An example of this can be seen in table 4.1.

Modulus m Residue 2n _{mod m} _{Periodicity, p}

3 1,2,1,2,1,2,... 2 4 1,2,0,0,0,... 1 5 1,2,4,3,1,2,3,1,... 4 6 1,2,4,2,4,2,4,... 2 7 1,2,4,1,2,4,... 3 11 1,2,4,8,5,10,9,7,3,6,1,2,4,8,5,10,9,7,3,6,1,... 10 17 1,2,4,8,16,15,13,9,1,2,4,8,16,15,13,9,1,... 8 31 1,2,4,8,16,1,2,4,8,16,1,2,4,8,16,1,... 5 51 1,2,4,8,16,32,13,26,1,2,4,8,16,32,13,26,1,2,... 8 Table 4.1: Periodicity of some residues

The periodic property of a modulo can be used to reduce the TCS bit-width used for forward conversion. A TCS number can be sign extended into p · dlog₂(nT CSbits)e bits and than partitioned into chunks that are p bits

wide. These chunks is then added with regular TCS adders. The sum of the addition is then used in the forward conversion, which will reduce the number of bits used in the RNS forward conversion. A conversion process identical to the one presented in section 4.3.1 on the preceding page can follow the periodicity simplification.

Example 2 Consider the forward conversion of the 13-bit TCS representa-tion of the number −32493 for modulus 5. −3821 is expressed as 1000100010011 in TCS.

The periodicity of modulus 5 is 4 which can be fetched from table 4.1. So sign-extend the TCS number to p · dlog₂(nT CSbits)e = 4 · dlog2(13)e = 16

bits: 1111000100010011. Separate this number into 4-bit chunks and add them (remember the negative weight of the MSB):

1111 + 0001 + 0001 + 0011 = −1 + 1 + 1 + 3 = 4 = 000100

Then use an RNS adder tree to compute the RNS representation of the number:

(44)

4.4. REVERSE CONVERSION CHAPTER 4. IMPLEMENTATION

4.3.3 Forward conversion for modulo 2

n

_{− 1}

By extending the solution for forward conversion in the special moduli-set in [4] a more general forward conversion solution for a bigger moduli-set containing modulo 2n− 1 can be achieved.

The first step is to sign extend the TCS input to a number of bits,

nS.e.−bits, that is even divisible by n. This new number is then divided

into nS.e.−bits

n -chunks which are summed using an RNS adder tree. A small

modification will be necessary to allow the modulo-2n _{− 1 RNS adder to}

support input in the range of [0, mi] instead of [0, mi− 1].

4.3.4 Using Verilog’s built-in modulo operator

For comparison a TCS to RNS forward converter also has been implemented using the built-in Verilog modulo operator, %, as seen below in the Verilog code.

assign o u t p u t r n s = i n p u t t c s % m o d u l o p a r a m e t e r ;

4.3.5 Forward conversion for modulo 2

n

Forward conversion modulo 2n _{is easily performed by selecting the n least}

significant bits. An example of how this could be realized in Verilog is shown below.

assign o u t p u t r n s = i n p u t t c s [ n b i t s − 1 : 0 ] ;

4.4 Reverse conversion

Reverse conversion is the translation process from RNS to TCS. Observe that compared with forward conversion, no computations can be performed individually in each modulus but the entire moduli-set has to be taken ac-count of. This complication makes reverse conversion a major, if not THE major, drawback of RNS. Due to this non-parallel approach as can be seen in other parts of RNS, the reverse conversion process is also more complex. The two main reverse conversion algorithms are based on the Mixed-Radix Conversion (MRC) or the Chinese Remainder Theorem (CRT), where only the latter has been investigated in this thesis. The reverse conversion pro-cess can be seen in figure 4.16 on the facing page, note that in comparison to forward conversion the reverse conversion has to be pipelined to fulfill the timing goals.

(45)

4.4. REVERSE CONVERSION CHAPTER 4. IMPLEMENTATION

Reverse conversion

RNS TCS

Figure 4.16: Reverse conversion

4.4.1 CRT

The Chinese Remainder Theorem (recall the quote in chapter 1 on page 1) is a mathematical way of finding the TCS representation of an RNS number. Recall from chapter 2 on page 5 that a moduli-set, m1, m2, ..., mN, consisting

of N pairwise relative primes can represent a number X within the range of k ≤ X ≤ k + M where M =QN

i=1mi and k is an integer. A number in

RNS can be represented as hx1, x2, ..., xNi where each xi= |X|mi.

Now define Mi = _mM

i and M

−1

i as the multiplicative inverse where

||M_i−1|miMi|mi = 1. Now the CRT states that a TCS number X can be

computed by equation (4.9). X = N X i=1 xi|Mi−1|miMi _M . (4.9)

The Mican easily computed as described above, though the

multiplica-tive inverse, M_i−1, is far harder to calculate. There is in fact no general expression to calculate the multiplicative inverse in this context, [4]. For prime modulus Fermat’s Theorem may sometimes be useful for finding the multiplicative inverse. A far less complicated solution of finding the multi-plicative inverse is to instead calculate |M_i−1|mi with a brute-force search

for all numbers between 0 and mi− 1. This can be computed on a PC and

at elaboration time the multiplicative inverses to the chosen modulus in the moduli-set can be stored in a memory element on the ASIC. Note that Mi

and M_i−1 will be unique for each moduli-set. Pseudo code for finding Mi

and |M_i−1|mi is presented below:

f o r modulus in m o d u l i s e t : M i = prod ( m o d u l i s e t ) / modulus f o r i n v i t e r in r a n g e ( 1 , modulus ) : i f ( ( M i ∗ i n v i t e r ) % modulus == 1 ) : M i i n v e r s e = i n v i t e r break print modulus , M i , M i i n v e r s e

The product of Mi and |Mi−1|mi will be stored in a look-up table in the

ASIC. Beside the LUT the reverse conversion is just a matter of multiplying each xi with the content of the LUT and then add the products. Both

the multiplication and addition will be performed using RNS adders. The resulting schematic will look like figure 4.17 on the following page.

(46)

4.4. REVERSE CONVERSION CHAPTER 4. IMPLEMENTATION TCS RNS RNS ADD x1 LUT x2 LUT xN LUT ... RNS ADD RNS ADD RNS ADD

RNS MULT RNS MULT RNS MULT

Figure 4.17: Reverse conversion using CRT

In figure 4.17 the RNS adder tree has been pipelined due to the huge amount of bits needed to represent the dynamic range M . This is enough to fulfill the timing requirements in the RNS adder tree when using the simplest and straightforward adder type with two binary adders connected (adder type 1, figure 4.1 on page 17). After a quick review of the implemented multipliers it can be discovered that none of them are purely combinatorial for arbitrary modulo. Due to the need of multiplication of big bit widths in CRT reverse conversion multiplier type 1 (Modulo-m product-partitioning multiplier with LUT) was reimplemented with combinatorial logic instead of LUT and some registers where inserted to pipeline the multiplier, as can be seen in figure 4.18 on the next page.

(47)

4.5. CHOOSING A MODULI SET CHAPTER 4. IMPLEMENTATION COMB A B Multiplier Multiplier Adder c 0 1 Adder −m M SB |AB|m

Figure 4.18: Modulo-m product-partitioning multiplier with combinatorial logic instead of LUT. Changes from ordinary RNS multiplier are shown in white.

By adding registers in the RNS adder tree, designing the RNS multiplier purely combinatorial and adding registers inside the multiplier a maximum operating frequency of 500 MHz could be achieved, which was required.

4.5 Choosing a moduli set

The optimum moduli set in terms of power dissipation for representing n bits can be found by solving equation (4.10) on the following page. Here pi

is the power dissipation, mi is the modulo, N is the number of modulus and

(48)

4.6. FIR FILTER CHAPTER 4. IMPLEMENTATION

dissipation of the computational operations as been taken account of (one adder and one multiplier), in a larger system the conversion is considered to be neglected. min si∈{0,1} i=N −1 X i=0 sipi ! when Y si6=0 simi≥ 2n and |mi|mj 6= 0 ∀ ( i 6= j, mi ≥ mj) (4.10)

This can be solved with the following pseudo code. f o r n comb in r a n g e ( 1 , max n comb ) :

f o r m o d u l i s e t in c o m b i n a t i o n s ( a l l m o d u l u s , n comb ) : c u r r e n t c o s t = sum ( p o w e r c o s t [ i ] f o r i in m o d u l i s e t ) i f c u r r e n t c o s t < b e s t c o s t : i f prod ( m o d u l i s e t ) >= d y n a m i c r a n g e : f o r p a i r in c o m b i n a t i o n s ( m o d u l i s e t , 2 ) : r e l a t i v e p r i m e = True # g c d = g r e a t e s t common d i v i d e r i f gcd ( p a i r [ 0 ] , p a i r [ 1 ] ) != 1 : r e l a t i v e p r i m e = F a l s e i f r e l a t i v e p r i m e : b e s t c o s t = c u r r e n t c o s t b e s t m o d u l i s e t = m o d u l i s e t

Due to the exponentially increasing number of combinations with the num-ber of modulus, the modulus sent to the program has been optimized and those modulus with a very high pi

log₂(mi) has been excluded without any

affect on the outcome.

4.6 FIR filter

Several different implementations are possible to achieve identically func-tionality to the direct-form implementation of equation (2.2) on page 8 as shown in figure 4.19 on the facing page. A major improvement of this de-sign can be achieved by moving the registers from before the multiplications to inside the summation chain as shown in figure 4.20 on the next page. This design is usually referred to as transposed direct-form FIR filters and will have a larger area than the direct-form FIR filter (due to more than twice the size of the registers) but the critical path will only go through one multiplication and one addition (compared to the entire summation chain and one multiplication in the previous case). The simulations preformed in this thesis will be using the transposed direct-form FIR filter as shown in figure 4.20 on the facing page unless anything else is stated. The word length used in the accumulator registers in this case will be equation (4.11) on the next page to prevent overflow.

(49)

4.6. FIR FILTER CHAPTER 4. IMPLEMENTATION

wacc= wdata+ wcoef+ dlog2(ntaps)e (4.11)

c3 c2 c1 c0

Figure 4.19: Direct-form FIR filter

c0 c1 c2 c3

Figure 4.20: Transposed direct-form FIR filter

In a larger DSP system the samples are very unlikely to arrive at every clock cycle, therefore the hardware can be reused by using a folded FIR filter as presented in figure 4.21. Several other techniques for designing the structure of FIR filters are available but not further discussed in this thesis.

L U T in

out Figure 4.21: Folded FIR

There are several different ways of deriving the FIR coefficients to fulfill certain goals for the filter. The method used in this thesis is based on the program described in [24] from [25] that is implemented in the Signal processing library in SciPy, an open-source library of scientific tools for Python. The choice of coefficients is not very import for the results of this thesis as long as they are realistic.

(50)

Chapter 5

Results

The results has been achieved using conditions that are very close to a real DSP system.

• A 500 MHz clock has been used

• New data has been assumed to arrived every clock cycle • The libraries used have been using a 32 nm technology

• The power dissipation was calculated as average power dissipation and not peak power dissipation

5.1 Input data and coefficients

The dynamic power dissipation will highly depend on what input data and coefficients that are provided to the system. The input data and coefficients will also affect RNS and TCS systems in different ways. For example will signed and unsigned values result in a similar behavior in RNS but signed values will most likely give quite a higher dynamic power dissipation for TCS, compared with unsigned values.

5.1.1 Uniformly distributed data and coefficients

In some cases random data that is uniformly distributed has been used. The data and coefficients have been generated by randomly generating each bit. The distributions can be seen in figure 5.1 on the next page for different number of bits. In this case both the data and coefficients are updated each clock cycle, the reason for also updating the coefficients is to not let the choice of coefficients affect the result. The updating of the coefficients will affect the result but the result are assumed to be affected in the same for TCS and RNS.

(51)

5.1. INPUT DATA AND COEFFICIENTS CHAPTER 5. RESULTS -1.0 -0.5 0.0 0.5 1.0 1019 0 1 2 3 4 5 6 710−20 64 bits -3 -2 -1 0 1 2 3 109 0.0 0.5 1.0 1.5 2.0 2.5 3.010−10 32 bits -150 -100 -50 0 50 100 150 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 0.0040 0.0045 8 bits

Figure 5.1: Discrete uniform distributions for different number of bits

5.1.2 Sawtooth data and coefficients ramp

Another interesting input data to investigate is sawtooth data. The idea is to generate the highest possible switching activity. This can then be matched with for example a ramp as coefficients. The input data and coefficients used in this case are presented in figure 5.2 on the following page. They are generated using equation (5.1), where i is the current clock-cycle.

data = i ∗ (−1)i

coef = i (5.1)

5.1.3 Realistic input data and FIR coefficients

The most realistic data is normal distributed data with constant FIR co-efficients. The FIR coefficients has been generated using Remez exchange algorithm as presented in [26]. The FIR filter will have a passband be-tween 0 and 2π · 0.297 rad/sample and a stopband bebe-tween 2π · 0.328 and π rad/sample. The resulting frequency response for some different numbers of taps can be seen in figure 5.4 on page 39. The input data is normal distributed and consists of a signal that has already been processed by a low-pass filter. These are typical signal properties for an input data signal to an FIR filter in a DSP application. The histogram of the input data is plotted in figure 5.3 on the next page.

5.1.4 Different properties of the data and coefficients

Due to different properties of the data and coefficients they will behave in different ways in RNS and TCS.

Sign switching rate The sign switching rate is the rate with which the data switches from positive to negative or vice versa. The switching rates for the used input data and an ordinary normal distribution containing white noise is presented in table 5.1 on page 40.

(52)

5.1. INPUT DATA AND COEFFICIENTS CHAPTER 5. RESULTS 0 5 10 15 20 25 30 35 40 -40 -30 -20 -10 0 10 20 30 40 Data 0 5 10 15 20 25 30 35 40 -40 -30 -20 -10 0 10 20 30 40 Coefficients

Sawtooth data and ramp coefficients

Figure 5.2: Sawtooth data and ramp coefficients

0 211 −211 ₂12 −212 ₂13 −213 ₂14 −214 Value 0.00000 0.00005 0.00010 0.00015 0.00020 Probabilit y of o ccurance

Realistic input data for 20 bit FIR filter

(53)

5.1. INPUT DATA AND COEFFICIENTS CHAPTER 5. RESULTS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Frequency (rad/sample) 10−6 10−5 10−4 10−3 10−2 10−1 100 Amplitude (dB) 3 taps 47 taps 95 taps 151 taps -160 -140 -120 -100 -80 -60 -40 -20 0 Angle (radians) 3 taps 47 taps 95 taps 151 taps

FIR filter frequency response

(54)

5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS

Theoretical multiplier toggle rate The theoretical multiplier toggle rate is the rate with which the product of a multiplication in TCS and RNS toggles for different input data and coefficients. The results are shown in table 5.2. The fact that a 60-bit RNS multiplier is compared as well in table 5.2 is due to the fact that for example an FIR filter would require a larger word length than two times the input word length when using more than one tap as seen in equation (4.11) on page 35.

Input data Sign switching rate

Uniform distribution 0.5 Normal distribution 0.5

Sawtooth data 1.0

Realistic data 0.33

Table 5.1: Sign switching rate of input data

Input data 40-bit TCS 40-bit RNS 60-bit RNS

Uniform distribution 0.486 0.425 0.456

Normal distribution 0.484 0.427 0.458

Sawtooth data 0.810 0.414 0.451

Realistic data 0.414 0.429 0.458

Table 5.2: Theoretical toggle rate at output of a 20-bit input multipli-cation. The optimum moduli-sets as presented in table 5.5 on page 55 is used.

5.2 Adders and multipliers

The results for different modulus for each adder and multiplier are presented in this section. The test setup for generating the results is shown in figure 5.6 on the next page, where both Op. A and Op. B are provided with two different streams of uniformly distributed random data. In the resulting graphs the Total power, Toggle rate, UVT ratio and Gate count can be found on the two y-axes. On the x-axis the modulo is plotted with a logarithmic scale of base two. In figure 5.5 on the facing page the values in the graphs for the RNS adders and multipliers are pointed out and a description of these can be found below.

Total power The total power is the sum of the static and dynamic power dissipation.

Toggle rate The toggle rate is relative to the entire y-axis, that is the maximum “total power” represents a toggle rate of one and a “total

(55)

5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS 21 22 23 24 25 Modulo, m 0 20 40 60 80 100 T otal p o w er [µ W ] Total power Toggle rate UVT ratio Total power Toggle rate UVT ratio 0 100 200 300 400 500 600 Gate count Gate count Gate count RNS adder type 0

Figure 5.5: Description of the RNS multiplier and adder graphs

power” of zero represents a toggle rate of zero. The toggle rate itself represents in which average rate that all the nets in the design toggles. So the toggle rate is approximately proportional to the dynamic power dissipation in combination with the gate count.

UVT ratio The UVT ratio is represented on the y-axis in the same way as the toggle rate. The UVT ratio is the ratio of low-leakage cells used in the design. A UVT ratio of close to 100 % is desirable.

Gate count The gate count is a technology independent measure of the total area and correlates together with the UVT ratio to the static power dissipation. Op. A Op. B Sum (a) RNS adders Op. A Op. B Product (b) RNS multipliers

Figure 5.6: Test setup for RNS adders and multipliers

5.2.1 Adders

The resulting best RNS adders for each modulo are compared in figure 5.8 on page 43 with the RNS adders for modulo 2n. In table 5.3 on page 44 the

(56)

5.2. ADDERS AND MULTIPLIERS CHAPTER 5. RESULTS 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 12 Mo dulo 0 50 100 150 200 250 Total power dissipation [µW]

Low Power Design Using RNS

Final thesis

Low Power Design Using RNS

Viktor Classon

LITH-ISY-EX--14/4792--SE

Final thesis

Low Power Design Using RNS

Viktor Classon

LITH-ISY-EX--14/4792--SE

2014-08-25

Abstract

Acknowledgements

Contents

Appendix

68

List of Figures

List of Tables

Nomenclature

Chapter 1

Introduction

1.1

Problem statement

1.2

Methodology

1.3

Prior work

1.4

Outline

1.5

Limitations

Chapter 2

Background

2.1

RNS arithmetic

2.1.1

Basic arithmetic operations

2.1.2

Conversion

2.1.3

Choosing a moduli-set

2.2

FIR filters

2.3

Design flow

2.3.1

Synthesis

2.3.2

Profile developing

2.3.3

Power

Chapter 3

Proposed design

3.1

Arithmetic functions

3.1.1

Addition

3.1.2

Multiplication

3.2

Conversion

3.2.1

Forward conversion

3.2.2

Reverse conversion

3.3

Choosing a moduli-set

3.3.1

Modulus for comparison

Chapter 4

Implementation

4.1

RNS addition

4.1.1

LUT and binary adders

4.1.3

End-around carry parallel-prefix adder

4.1.4

Parallel-prefix adder using the diminished-one

num-ber representation for modulo 2

− 1

_{− 1}

_{− 1}

_{+ 1}

_{− 1 multiplier}

_{− 1}