Parallel Evaluation Of Fixed-Point Polynomials

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Parallel Evaluation Of Fixed-Point Polynomials

Examensarbete utfört i Elektroniksystem vid Tekniska högskolan i Linköping

av

Shahid Nawaz Khan

LiTH-ISY-EX--10/4406--SE

Linköping 2010

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Parallel Evaluation Of Fixed-Point Polynomials

Examensarbete utfört i Elektroniksystem

vid Tekniska högskolan i Linköping

av

Shahid Nawaz Khan

LiTH-ISY-EX--10/4406--SE

Handledare: Muhammad Abbas

isy, Linköpings universitet

Examinator: Oscar Gustafsson

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

Division of Electronic Systems Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2010-09-14 Språk Language Svenska/Swedish Engelska/English ⊠ Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport ⊠

URL för elektronisk version

http://www.es.isy.liu.se/ http://www.ep.liu.se ISBN — ISRN LiTH-ISY-EX--10/4406--SE Serietitel och serienummer Title of series, numbering

ISSN —

Titel

Title Parallell evaluering av polynom i fix-talrepresentation_{Parallel Evaluation Of Fixed-Point Polynomials}

Författare

Author Shahid Nawaz Khan

Sammanfattning Abstract

In some applications polynomials should be evaluated, e.g., polynomial approxima-tion of elementary funcapproxima-tion and farrow filter for arbitrary re-sampling. For polyno-mial evaluation Horner’s scheme uses the minimum amount of hardware resources, but it is sequential. Many algorithms were developed to introduce parallelism in polynomial evaluation. This parallelism is achieved at the cost of hardware, but ensures evaluation in less time.

This work examines the trade-off between hardware cost and the critical path for different level of parallelism for polynomial evaluation. The trade-offs in gen-erating powers in polynomial evaluation using different building blocks(squarers and multipliers) are also discussed. Wordlength requirements of the polynomial evaluation and the effect of power generating schemes on the timing of operations is also discussed. The area requirements are calculated by using Design Analyzer from Synopsys (tool for logic synthesis) and the GLPK (GNU Linear Programming Kit) is used to calculate the bit requirements.

Nyckelord

(6)

(7)

Abstract

In some applications polynomials should be evaluated, e.g., polynomial approxima-tion of elementary funcapproxima-tion and farrow filter for arbitrary re-sampling. For polyno-mial evaluation Horner’s scheme uses the minimum amount of hardware resources, but it is sequential. Many algorithms were developed to introduce parallelism in polynomial evaluation. This parallelism is achieved at the cost of hardware, but ensures evaluation in less time.

This work examines the trade-off between hardware cost and the critical path for different level of parallelism for polynomial evaluation. The trade-offs in gen-erating powers in polynomial evaluation using different building blocks(squarers and multipliers) are also discussed. Wordlength requirements of the polynomial evaluation and the effect of power generating schemes on the timing of operations is also discussed. The area requirements are calculated by using Design Analyzer from Synopsys (tool for logic synthesis) and the GLPK (GNU Linear Programming Kit) is used to calculate the bit requirements.

(8)

(9)

Acknowledgments

Countless thanks to ALLAH Almighty, worthy of all praise, Who guides us from darkness to light, and many blessings and peace be upon Mohammad s.a.w, the final messenger of Allah.

I am thankful to my supervisor M.Sc. Muhammad Abbas for his kind support and guidance. I am very thanful to my examiner Dr. Oscar Gustafsson, head electronic systems division, for his help and specially for his encouragement. I am very thankful to

• M.Sc. Zaka ullah Sheikh and M.Sc. Yasir Ali shah, who were always there whenever i needed there help.

• M.Sc. Farooq Ul Amin, the person who convinced me to come to linköping and giuded me about the study system.

• M.Sc. Abdul Majid and M.Sc. Owais, who really encouraged during my studies at linköping university.

Finally I would like to thank my family, especially my mother for her unconditional support and love.. . .

(10)

(11)

Introduction

In some applications polynomials should be evaluated. This is the case for polyno-mial approximation of elementary functions and Farrow filters that can be used for arbitrary resampling. The input to the polynomial evaluation is the polynomial coefficients, ai and a the input data x. The computation to be performed for an

N -th order polynomial is P (x) = N X i=0 aixi (1.1)

If the polynomial is fixed it is possible to optimize, the architecture further. How-ever, here we primarily consider architectures for general polynomial coefficients. We will first discuss simpler and serial architectures like Horner’s scheme and then we will move towards parallel architectures that require more hardware and give better speed.

1.1 Aim of the thesis

The main objectives of the thesis are as follows:

• Study the trade-off between the number of operations and the critical path for various degrees of parallelism in polynomial evaluation.

• Study the trade-off in generating powers and polynomial evaluations using different building blocks (squarers, multipliers etc).

• Study how the different power generating schemes affect the timing of oper-ations.

• Study wordlength requirements of polynomial evaluation.

1.2 Thesis organization

The evaluation of polynomial in a parallel way can be broadly divided into two major tasks:

(18)

6 Introduction

• First one is at algorithm level, in which structure is expanded into more branches, i.e., introduce parallelism, with minimum requirement of powers of x.

• Second task is the efficient way to generate powers of x with minimum hard-ware. The critical path can also be an important aspect in generation of powers of x.

Following the introduction in chapter one, different algorithms and schemes that introduce some kind of parallelism in the polynomial evaluation are discussed. These different schemes are explained in detail with their advantages and disadvan-tages. In chapter two the power generation techniques are discussed in detail with their relative advantages and disadvantages. A simple solution is also discussed for the short-comings in the existing methods.

In chapter three, pipelining register and bit requirements are calculated. The pipeline registers are introduced after each arithmetic operation. The bit require-ments for selected schemes are calculated by using integer optimization tool. Chap-ter four focuses on results obtained and compares these schemes on the bases of these results. Finally chapter five contains conclusions and future work.

1.3 Different schemes

Equation (1.1) can be factored in different ways with different trade-offs. Some of the schemes used are discussed below.

1.3.1 Horner’s scheme

Consider a polynomial P (x) of degree N in (1.1) divide P (x) by a linear factor x − x0, we get

P (x) = (x − x0)(b1+ b2x + ... + bNxN −1) + b0. (1.2)

By equating with the (1.1) the values of b0, b1...bN can be calculated as

bN = aN, (1.3)

bj= aj+ x0bj+1, j = N − 1, ....0, (1.4)

bj can be calculated from (1.3) and (1.4) and from (1.2) we get

P (x) = b0.

This method of polynomial evaluation is called Horner’s rule [4] and can also be expressed by

(19)

1.3 Different schemes 7

This method happens to be the most economical of all the possible methods in terms of arithmetic operations required to calculate P (x). It requires N multipli-cations and N additions to calculate P (x). For order 6 we have

P (x) = a0+ x(a1+ x(a2+ x(a3+ x(a4+ x(a5+ a6x))))) (1.6)

a X a a a a a a X X X X X 5 4 3 2 1 0 6 P(x)

Figure 1.1. A 6-th order polynomial in (1.6) implemented according to Horner’s scheme.

Pros and Cons of Horner’s scheme

• Minimum number of hardware resources are guarantied by this method. • No need to generate powers of x, so neither any logic for generating powers

nor any hardware for it is required. • Its the most simplest of available schemes.

• The length of critical path is comparatively the longest.

• It is clear from (1.6) and Fig. 1.1 that this method is sequential in its ap-proach because the next calculation is dependent on the result of the previous calculation and cannot be started before getting the previous result. So no parallelism is possible in this case.

1.3.2 K

_{-th Order Horner’s scheme}

As already known that Horner’s scheme is the most simplest scheme. To introduce some kind of parallelism in it, a generalization on Horner’s scheme is developed by W.S.Dorn [4].

If P (x0) represents the Horner’s scheme for P (x) which was divided by x − x0

then in generalization P (x) is divided by a polynomial q(x) for which q(xo) is true,

then the result P (x0) is obtained at x = x0

By choosing q(x) = xk_{− x}k

0, where k ≥ 1,

P (x) = (xk− xk0)(bk+ bk+1x + ... + bNxN −k)(bk−1xk−1+ ... + b1x + b0). (1.7)

By comparing (1.7) with (1.1), we get

bj= aj, j = N, ....N − k + 1,

(20)

8 Introduction

P (x0) = bk−1xk−10 + .... + b1x0+ b0.

Its the same Horner’s scheme for K = 1. But with values of K greater than 1 it tends to produce a parallel structure. For N = 6 and K = 3 the K-th Order Horner scheme is given by

P (x) = a0+ x(a1+ a4x3) + x2(a2+ a5x3) + x3(a3+ a6x3). (1.8) X a X a a a a a 4 1 6 3 X X X X X 3 3 3 2 a0 P(x) 5 2

Figure 1.2. 6-th order polynomial arranged according to K-th Order Horner scheme represented by (1.8)

1.3.3 Even Odd scheme

Its a special case for the K-th Order Horner’s scheme for K = 2. Its called Even Odd because of its two branches, one consists of even powers of the polynomial and the other consists of the odd powers of the polynomial. Hence the single path is divided into two paths. The increase in the hardware is only one extra multiplier used for calculating x2 that is needed in this scheme

f (x) = a0+ x(a1+ x2(a3+ a5x2)) + x2(a2+ x2(a4+ a6x2)). (1.9)

Pros and Cons of K-th Order Horner’s scheme

• It can increase the parallelism by K at least theoretically. Or in other words we can say that it decreases the processing time by K times.

• Critical path is reduced because of the parallel structures.

• Different sub-structures can be formed by different values of K. Important one to mention here is the one with K = 2 here named as Even Odd scheme.

(21)

1.3 Different schemes 9 X X X a4 a2 a₆ X 2 2 2 a₀ X a₁ X a₃ a₅ P(x)

Figure 1.3. 6-th order polynomial arranged according to Even Odd scheme represented by (1.9)

• The processing time is decreased but the decrease is not K times if we include the time to generate powers of x.

• Generating powers of x is an extra headache in this case which requires some algorithm and hardware.

1.3.4 Estrin’s scheme

In [5], Gerald Estrin came up with a scheme which is known now by his name. As seen in Horner’s scheme every thing runs in series so its a long path to the end. Estrin’s scheme also introduces parallelism to decrease the critical path. In Estrin’s method sub-expressions of the form (A + Bx) and x2n

are isolated. Equation (1.1) for N = 6 can be written as

P (x) = (a0+ a1x) + (a2+ a3x)x2+ ((a4+ a5x) + a6x2)x4. (1.10)

We get (1.10) from (1.1) this is simplified in the following way [9] P (x) = q(x)x(N/2)+1+ r(x),

where

q(x) = aNxN/2+ .... + a(N/2)+1,

and

r(x) = aN/2xN/2+ .... + a0.

Estrin’s scheme [5] is an intelligent scheme. As the polynomial goes larger, the more parallel it becomes and it also uses square terms which are easy to implement than odd terms. As a squarer is more suitable than a multiplier, in terms of hardware.

Pros and Cons of Estrin’s scheme

• An intelligent algorithm which changes itself with the change in order of the polynomial unlike K-th Order Horner in which the algorithm depends

(22)

10 Introduction

on the value of K and this value is different for different polynomial orders depending upon the requirements to meet.

• Along with its advantage of increasing parallelism with increase in polyno-mial order another advantage is that it only requires the powers of x which are squares of the previous one. This is the minimum time to generate a specific power of x.

• In general it may be better for introducing parallelism but for specific order of polynomials, it has only one solution and that may not be the optimal. In other schemes we can change the value of a specific given variable to get our desired results.

a

1 X X X X X X a a a a a 0 a2 3 5 4 6 2 2 P(x)

Figure 1.4. 6-th order polynomial arranged according to Estin’s scheme represented by (1.10)

1.3.5 A Simple Parallel Algorithm For Polynomial

Evalua-tion

In [8] a parallel algorithm for polynomial evaluation is presented for a polynomial P (x) of degree N P (x) = N X i=0 aixi, N + 1 = KL, where the number of processors P = L + 1. Divide N + 1 terms of P (x) in L groups

(23)

1.3 Different schemes 11 P (x) = L−1 X i=0 bixiK where

bi= aiK+ aiK+1x + aiK+2x2+ .... + aiK+K−1xK−1

for i = 0, 1, ...L − 1. For an even N+1 value, [8] gives the maximum flexibility of parallelism. Taking the advantage of the fact that every even number is divisible by two, when K is set equal to 2, we get maximum parallelism as L is maximum at that time

N + 1 = KL, L = (N + 1)/2.

Lets take an example of a structure of this type with order N = 5 and for L = 3 in this case

P (x) = (a0+ a1x) + (a2+ a3x)x2+ (a4+ a5x)x4. (1.11)

But for odd N + 1, this level of parallelism is not achievable by this algorithm.

X X X X X2 a a a a a a0 1 3 2 4 5 P(x)

Figure 1.5. 5th order polynomial arranged according to Lie’s scheme represented by (1.11)

Among odd values of N + 1, values which are divisible by 3, i.e., K = 3 give better parallelism. Parallelism reduces when this ratio for N + 1 is neither divisible by 2 nor 3 such as 25, 49 etc. But the worst case occurs when N + 1 is a prime number. In this scenario the algorithm completely fails as for as parallelism is concerned

(24)

12 Introduction Pros and Cons of Simple Parallel Algorithm for Polynomial Evaluation

• When its applicable it can give the best results along with the option of changing the structure according to your priorities, whether aim is to mini-mize area or critical path.

• Unfortunately, its not applicable generally due to certain limitations.

1.3.6 Algorithm A´

This algorithm is defined in [9] along with its different versions. According to this algorithm if we have K < N processors, we can express a polynomial of degree N as follows:

P (x) = A0(x) + A1(x)X + ... + Ak−1(x)Xk−1,

where X = xN/k and Ai are the polynomials of degree N/k.

Using different possible values of K, different aspects of the structure can be targeted. A 6-th order polynomial as shown in (1.1) with N = 6 can be repre-sented using algorithm A´ as:

P (x) = a0+ a1x + a2x2+ a3x3+ (a4x + a5x2+ a6x3)x3. (1.12)

a

1 X X X X X X X X a a 0 a2 a a 2 5 4 3 6 2 3 a 3 P(x)

Figure 1.6. 6-th order polynomial arranged according to Algorithm A´ scheme repre-sented by (1.12)

Pros and Cons of Algorithm A´

(25)

1.3 Different schemes 13

• A set back is that its also not applicable to all polynomials generally.

1.3.7 Direct Evaluation

This method seems the most parallel version. The required powers of x are com-puted and multiplied with the coefficients in parallel. No algorithm is required

P (x) = a0+ a1x + a2x2+ a3x3+ a4x4+ a5x5+ a6x6. (1.13)

a

1

a

X X X X X X X X 0 2 3 4 5 6 2 2 4 3 P(x)

Figure 1.7. 6-th order polynomial arranged according to Direct Evaluation represented by (1.13)

Pros and Cons of Direct Evaluation

• Simple in a sense that no algorithm is needed. Generate the powers, multiply with constants and then add them.

• Advantage which might be attracting us is the shortest critical path but it does not turn out to be the shortest one.

• The number of adders is not an issue in all these schemes because the adders are always the same in each scheme for a given polynomial. The difference is in the number of multipliers used in different schemes as far as the hardware cost comparison is concerned. In all the previous schemes we need only some specific powers of x to be generated which is always less than the order of the polynomial, but in this case we have to generate all powers of x from two to the order of polynomial. Here we are considering general polynomial so any coefficient being zero case is ignored because that would be addressing a specific case rather than a general case.

(26)

14 Introduction

1.4 A general overlook

As we move form Horner’s scheme to the Direct Evaluation, the demand for hard-ware resources tends to increase but at the same time the parallelism also seems to increase. So before looking into it in detail, it can be observed that parallelism is at the cost of hardware. If we gain one advantage we have to compromise on the other. As the main aim is to introduce parallelism in the whole structure as much as possible with minimum hardware, all the schemes are definitely going to be more expensive than the Horner’s scheme which is sequential.

A trend seen in all the schemes which offers a variable controlling the paral-lelism. By changing the variable, complete advantages and disadvantages can be changed. The decision of selecting a particular variable may not be according to the general trends that are followed up till that point. More clearly it can be said that the values of the variable should be selected which is best according to the given conditions and requirements. Because in the same scheme if one value of that variable is selected it makes the scheme good in terms of area and if other value is selected it become good for the critical path and the area cost might increase.

1.4.1 Using two schemes at the same time

For the improvement of performance of some schemes, there is a good chance to use another scheme inside that scheme. Some schemes tend to divide polynomial into different sections and if that section is another polynomial then some scheme can be used on that inner polynomial also. This is shown below for an 11-th order polynomial implemented by Lie’ scheme (Section 1.3.5) and Horner’s Rule (Section 1.3.1) being applied to the inner polynomial

Improvement for Lie’s scheme K ≥ 3

With K ≥ 3, the inner sum can be considered as an independent polynomial of degree K-1 and different techniques can be used to simplify the sum in terms of arithmetic operations and more parallelism, e.g., we can use Horner’s method for the inner polynomial. Let N = 11, so N + 1 = 12, L=6, for K=2 ,we get

P (x) = a0+a1x+x2(a2+a3x)+x4(a4+a5x)+x6(a6+a7x)+x8(a8+a9x)+x10(a10+a11x),

now for K = 3, for the same polynomial, i.e., N = 11, we get

P (x) = a0+a1x+a2x2+x3(a3+a4x+a5x2)+x6(a6+a7x+a8x2)+x9(a9+a10x+a11x2),

now with Horner’s rule applied to the inner sum,

(27)

Chapter 2

Generating the power terms

2.1 Algorithms for short chains

Apart from Horner’s scheme when we move to more parallel schemes we have to deal with the higher powers of x as the order of the polynomial increases. In Horner’s scheme we only need x so we do not need to generate powers of x. In Estrin’s scheme e.g., we may need x16_{. In order to get x}16_{either we start with x}

and multiply it 15 times with x to get x16_{or to think of a more efficient method in}

which the number of multiplications can be reduced. An efficient method would be to generate x2_,x4_,x8 _{and x}16 _{by squaring the previous result successively, in}

this way we can reduce the number of multiplications from 15 to 4. x2= x · x

x4= x2· x2 x8= x4· x4 x16_{= x}8_{· x}8

If we write the powers of x successively in a set we get s = (1, 2, 4, 8, 16). Lets take another example of x23_{. Its generation sequence would be like this}

x2_{= x · x} x3= x2· x x5= x2· x3 x8= x5· x3 x13= x5· x8 x21= x13· x8 x23= x21· x2.

If this is also written as a set then we get s = (1, 2, 3, 5, 8, 13, 21, 23). It’s called a 15

(28)

16 Generating the power terms

chain for 23 of length 7, since

2 = 1 + 1 3 = 2 + 1 5 = 3 + 2 8 = 5 + 3 13 = 8 + 5 21 = 13 + 8 23 = 21 + 2.

2.1.1 The Binary Method

The binary method for calculating short chain is one of the ancient methods known and it appeared before 200 B.C.

The algorithm

In order to calculate xN _{by this method, we convert N into binary form and then}

remove zeros at the left. Replace ’1’s by SX and ’0’s by S. Remove SX on the left and the remaining part is the rule to calculate xN_{. S is squaring function and}

X is multiplication with x.

Lets take x23 _{as an example. In this case we have N = 23. First step is}

to convert 23 into binary form, i.e., 101111. Now replace ’1’s by SX and ’0’s by X, we get SXSSXSXSX. Remove SX from the left, we get SSXSXSX. which means the sequence of operations is squaring, squaring, multiplication by x squaring, multiplication by x, squaring and then multiplication by x [7]. This can be expressed as x2= x · x x4= x2· x2 x5_{= x · x}4 x10= x5· x5 x11= x10· x x22= x11· x11 x23= x22· x s = (1, 2, 4, 5, 10, 11, 22, 23).

The main advantage of this method is that, temporary storage is only required for x and current partial results. Also its simplicity makes it the most cited method in the literature. Many authors thought of it as the optimal algorithm, but its not true for all values of N . Lets take x15_{as an example. In this case we have N = 15,}

(29)

2.1 Algorithms for short chains 17

the left most SX we are left with SXSXSX. So the sequence to generate x15

would be x2= x · x x3= x2· x x6= x3· x3 x7_{= x}6_{· x} x14= x7· x7 x15= x14· x s = (1, 2, 3, 6, 7, 14, 15).

By binary method we need 6 multiplications to get x15 from x, but we can get the same result in 5 multiplications

x2_{= x.x} x3= x2· x x6= x3· x3 x9= x6· x3 x15_{= x}9_{· x}6 s = (1, 2, 3, 6, 9, 15).

Another way to represent the binary method to calculate short chain for xN _can

be expressed as [3] xN =    x if N =1 xN/2_{· x}N/2 _{if N is even} xN −1_{· x} _otherwise.

2.1.2 Factor Method

This method as evident by its name is based on factorization of N . It works entirely different from the binary method.

The algorithm

In order to calculate xN_{, we take N = p · q, where p is the smallest prime factor}

of N and q > 1. First xp _{calculated and then subsequently raising the outcome}

to the q-th power. For the case where N is prime number, xN −1 _{is calculated by}

multiplying by x. For N = 1 we do not need any calculations. By applying this algorithm recursively, we can calculate xN_{. lets take x}55 _{as an example. Here}

N = 55

(30)

Now we again apply this algorithm on p and q. As p is prime number so y = x5= x4· x = (x2)2· x

y11= y10· y = (y2)5· y

Which can be represented in a more demonstrative way as following x2= x · x x4= x2· x2 x5= x4· x x10_{= x}5_{· x}5 x20= x10· x10 x40= x20· x20 x50= x40· x10 x55= x50· x5 s = (1, 2, 4, 5, 10, 20, 40, 50, 55).

To calculate x55_{, 8 multiplications are needed by this method, while binary method}

requires 9 multiplications for the same calculation. Generally this method is better than the binary method but not always. The minimum value for which binary method is better than factor method is N = 33. [7].

2.1.3 Power Tree Method

Another graphical method which give the minimal addition chains, i.e., minimum multiplications for a relatively small value of N is shown in Fig. 2.1 in [7]. To find the desired result, the required N in the tree is found and then the path from the start to the N gives the desired sequence of operations required to calculate xN_.

The algorithm

We suppose that i levels of the tree are completed and we have to make the (i + 1)-th level in 1)-the tree. Take a node, e.g., N in 1)-the i-1)-th level starting from 1)-the left and moving towards right. Attach nodes N + 1, N + a1, N + a2..., N + ai−1 = 2N to

node N . Where 1, a1, a2..., ai−1 is the path from the starting point of the tree

to the node N . The node that has already been declared is not repeated.

Lets take N = 14 as an example Fig. 2.1. Here a1, a2, a3, a4 = 2, 3, 5, 7

re-spectively. The new nodes can be 15, 16, 17, 19, 21, 28, but as it can be seen that 15, 16, 17 are already declared in the previous level so these are not declared again. We are left with three nodes, i.e., 19, 21, 21 from node N = 14, as shown in Fig. 2.1. It has been verified that this method gives optimal results for the values of N listed in this example. But this might not be the case for large enough values. Mini-mum values of N for which this method is not the best are N = 77, 154, 233. The minimum value for which it overtakes both of Binary and Factor method is for

(31)

2.2 Addition chains 19 1 2 9 10 7 28 11 19 21 2223 26 25 30 27 ₃₆ ₄₈ 38 35 42 29 31 56 44 46 39 52 50 45 60 20 40 41 43 80 54 37 72 49 51 96 34 33 66 68 65 128 5 3 6 4 8 16 12 24 17 32 64 13 14 15 ₁₈

Figure 2.1. Power Tree.

N = 23. Its not always better than Factor method. For N ≤ 100000, it is better than the Factor method 88803 times. It gives the same results as that of Factor method 11191 times. It only loses 6 times to Factor method.[7].

2.2 Addition chains

We find out that the problem of generating powers of x with minimum number of multiplications is actually the problem of finding the least addition chain of the integer. In our case this integer is the power of x. In the above two examples we derived addition chains for 16 and 23.

An addition chain for an integer N is an ascending list of integers 1 = a0, a1, ..., ar=

N such that any element except the first one can be represented as the sum of two preceding elements [3].

In [13], a new prospective of the problem was studied and that was differenti-ation between a squarer and multiplier in addition chains which is not considered before. The idea of taking squarer as a different entity from multiplier is more beneficial, as shown by the work done in [13]. According to [13] if the cost of area needed by a binary adder on an FPGA is considered to be Ca = n, where n is

the input bit size. Then calculations for cost of binary multiplication and binary squaring can be done. From [12] it is observed in [13] that parallel array multiplier consists of n n-bit adders. Parallel add vector based multiplier also needs similar hardware resources. To produce partial products n2 _{AND operations are needed.}

The multiplier cost on an FPGA is calculated as Cm = 2n2. In [10], it is shown

that by computing a square the number of partial product bits can be reduced to half. On the basis of this work, the cost of a squarer on an FPGA calculated to be Cs= n2.

If this is true then definitely the addition chains which consist of more squarers would give economical results as far as the area cost is concerned. Lets take N = 5 as an example. The two possible chains are

(32)

chain1 = 1, 2, 3, 5 chain2 = 1, 2, 4, 5.

Both these chains are of same length, if we only consider the number of multi-plications then both of these are the same, but if we squarers as different entity, then these are different from each other. The area cost on an FPGA for chain1 is 2Cm+ Csand for chain2 is Cm+ 2Cs. It can be seen that chain 2 is definitely more

economical, keeping in mind that the same number of multiplication operations are needed for both chains. In [13] due to the above stated reason the minimal cost addition chain problem is defined as

CA= X a∈A w(a), where w(a) =    Cm if a = b + c, b 6= c, a, b ∈ A Cs if a = 2b, b ∈ A 0 if a = 1 The following theorems are derived in [13].

Theorem 1

If Cs and Cm are area cost of a squarer and multiplier respectively, then lower

bound for the cost of addition chain for N is Cs⌊log2N ⌋ + Cm⌈log2v(N )⌉, where

v(N ) represents the number of binary ones in the binary representation of N [13]

Theorem 2

If Cs and Cm are area cost of a squarer and multiplier respectively, then lower

bound for the cost Cpq of any addition with elements from p to q where p > q is

given by:

(a) If q ≥ 2t_{p, then if q even C}

pq= tCs, otherwise Cpq= (t − 1)Cs+ Cm.

(b) If q is odd and q ≥ 3 · 2t_{r, then C}

pq= (t + 1)Cs+ 2Cm[13].

2.3 Implementation issues

2.3.1 Multiple power terms

After considering all these power generation techniques there are still certain lim-itations in these methods when practical issues are tackled. First and the most important short coming of all the above stated methods is that all these methods focus on getting the shortest addition chain for a single value of N but in real polynomial evaluation task we encounter more than one powers of x at a time. And if we go with the above methods the hardware cost for the specific generation of power may be the minimum but if we have to calculate more than one powers than the overall cost will definitely increase. Reason is that the generation of pow-ers is independently done. From Table 2.1 it can be seen that generating different

(33)

2.3 Implementation issues 21

Table 2.1. Shortest addition chains for values of N from 5 to 16.

N Shortest Addition Chains

5 (1, 2, 3, 5)(1, 2, 4, 5) 6 (1, 2, 3, 6)(1, 2, 4, 6) 7 (1, 2, 3, 4, 7)(1, 2, 3, 5, 7)(1, 2, 3, 6, 7)(1, 2, 4, 5, 7)(1, 2, 4, 6, 7) 8 (1, 2, 4, 8) 9 (1, 2, 3, 6, 9)(1, 2, 4, 5, 9)(1, 2, 4, 8, 9) 10 (1, 2, 3, 5, 10)(1, 2, 4, 5, 10)(1, 2, 4, 6, 10)(1, 2, 4, 8, 10) 11 (1, 2, 3, 4, 7, 11)(1, 2, 3, 4, 8, 11)(1, 2, 3, 5, 6, 11)(1, 2, 3, 5, 8, 11) (1, 2, 3, 5, 10, 11)(1, 2, 3, 6, 8, 11)(1, 2, 3, 6, 9, 11)(1, 2, 4, 5, 6, 11) (1, 2, 4, 5, 7, 11)(1, 2, 4, 5, 9, 11)(1, 2, 4, 5, 10, 11)(1, 2, 4, 6, 7, 11) (1, 2, 4, 6, 10, 11)(1, 2, 4, 8, 9, 11, )(1, 2, 4, 8, 10, 11) 12 (1, 2, 3, 6, 12)(1, 2, 4, 6, 12)(1, 2, 4, 8, 12) 13 (1, 2, 3, 5, 8, 13)(1, 2, 3, 5, 10, 13)(1, 2, 3, 6, 7, 13)(1, 2, 3, 6, 12, 13) (1, 2, 4, 5, 8, 13)(1, 2, 4, 5, 9, 13)(1, 2, 4, 6, 7, 13)(1, 2, 4, 6, 12, 13) (1, 2, 4, 8, 9, 13)(1, 2, 4, 8, 12, 13) 14 (1, 2, 3, 4, 7, 14)(1, 2, 3, 5, 7, 14)(1, 2, 3, 6, 7, 14)(1, 2, 3, 6, 8, 14) (1, 2, 3, 6, 12, 14)(1, 2, 4, 5, 7, 14)(1, 2, 4, 5, 9, 14)(1, 2, 4, 5, 10, 14) (1, 2, 4, 6, 7, 14)(1, 2, 4, 6, 8, 14)(1, 2, 4, 6, 10, 14)(1, 2, 4, 6, 12, 14) (1, 2, 4, 8, 10, 14)(1, 2, 4, 8, 12, 14) 15 (1, 2, 3, 5, 10, 15)(1, 2, 3, 6, 9, 15)(1, 2, 3, 6, 12, 15)(1, 2, 4, 5, 10, 15) 16 (1, 2, 4, 8, 16)

powers require different shortest addition chains independent of each other. As an example take N = 8 it has shortest addition chain of (1, 2, 4, 8) which does not contain 7, 6, 5, 3 similarly in case of 7, 6 and 5 same situation can be seen. This situation gets worse as the value of N increases along with the required number of powers to be generated.

2.3.2 Critical path

The second important issue which is not covered in shortest addition chains or any of the previous methods for generating power is the critical path from x to xN_{. It}

has been observed that in addition chains and the other power generation schemes, the path may be shortest in terms of the number of addition but it may not be shortest is terms of critical path. Lets take an example in Fig. 2.2. Two chains for generating x23 _{are compared. Let T be the time taken by a single multiplication}

or squaring, so the tie of a single multiplication operation is scaled horizontally. In this particular case sizes of all the multipliers are taken same which might not be the case in the real time scenario. Here a comparison is made between the critical paths of the two power generation schemes, so different multiplier sizes will not effect the result concluded.

The chain for the scheme on top in Fig. 2.2 is 1, 2, 3, 4, 7, 8, 16, 23 and for the scheme at the bottom, it is 1, 2, 3, 5, 10, 20, 23. It can be seen that the scheme at

(34)

22 Generating the power terms X X X X X X X X2 X X 4 4 7 3 2 3 5 10 20 X16 Squarer _Multiplier X X Time Comparison X X 23 23 1T 2T 3T 4T 5T 6T Figure 2.2. Comparison.

the bottom is shortest addition chain and its length is 6, where as the scheme at the top has a length of 7. But the important point to mention is that the scheme at the top gives result x23 _{after 5 multiplication times, i.e., 5T but the scheme}

at the bottom gives the same result after 6 multiplication times. Disadvantage is that its chain length is 7.

2.4 A proposed solution

We have to deal with three issues in a combine way which are explained separately in the previous section.

• How to use minimum resources to generate multiple powers.

• Observe the trade-offs between the least number of multiplications and min-imum critical path.

• Take advantage of the less complexity of a squarer over a multiplier.

2.4.1 _{Generating all powers of x from 2 to N}

Here we have N = 32 as shown in the Table 2.2, S represents a squarer and M represents multiplier. The subscripts with M and S represent their corresponding position in the power generation string. Column 2 shows the length of shortest addition chains for respective values of N . Column 3 shows the requirement of squarers and multipliers for the generation of that specific power of N by the proposed method. Column 4 shows the multipliers/squarers in the critical path for generating that specific power in the proposed method.

(35)

2.4 A proposed solution 23

The 5-th and the last column shows the total expenditure of multipliers and squarers up to that particular value of N for the proposed method. A comparison between length of shortest addition chains and that of the proposed solution is made in column 2 and 3. We find out that the length of shortest addition chain is less than the proposed one only at N = 23, 27, 30, 31 for N = 1 to N = 32. From Table 2.1 it can be observed that generating that power independently would require a lot more resources than that of the proposed solution in Table 2.2. The reason for this is that maximum number of multiplication operations are converted to square operation and also the critical path issue has been taken into account. A scheme showing the power generation of N from 1 to 16 is shown in Fig. 2.3.

13 11 10 16 12 14 1 2 3 4 Squarer Multiplier 9 X X X X X X X 15 2 4 3 5 8 6 7 X X X X X X X X X

Figure 2.3. Generating powers of x

2.4.2 Generating specific powers of x

In the real time implementation of polynomial evaluation schemes we find some schemes in which we only need some specific powers. If we follow the above method we may be using excessive resources which might not be needed. Lets analyze a scenario in which we only need even powers of x in polynomial evaluation. In Table 2.3 a scheme for generating only even powers of x has been shown for the values of x from 2 to 32. If it is compared with the Table 2.2, specially the last column it clearly indicates that the hardware cost has been reduced by half. A power generating scheme for generating only even powers of x is shown in Fig. 2.4 for generating powers of x from x2_{to x}32_{. Table 2.4 shows an example of generating}

powers of x that are multiples of three.The last column can be observed to see the reduction in the multiplications and squaring operations. Same method is explained in Fig. 2.5 for N = 1 to N = 15. Figure 2.6 shows the generation of

(36)

24 Generating the power terms 14 X X X 1T 2T 3T 4T Multiplier Squarer 2 4 8 16 10 12 X6 X X X X X

Figure 2.4. Generating even Powers of x

powers of Estrin’s scheme.

X X 1T 2T 3T 4T 5T 2 3 6 X 12 9 15 Multiplier Squarer X X X X

Figure 2.5. Generating powers of x for multiples of three

2.4.3 Sharing of hardware resources

This method is independent of all other methods explained above and it can be used to further reduce the cost of implementation by reducing the cost of hardware and power. Idea is simple, we know that x is same for generating all powers so if we see inside a multiplier circuit and the multiplication principle we find out that some of the partial products generated in generating a power of x do repeat in the partial products obtained during the generation of another power of x (This

(37)

2.4 A proposed solution 25 X X X 1T 2T 3T 4T Multiplier Squarer 2 4 8 16 X X

Figure 2.6. Generating Power of x for Estrin’s scheme

probability certainly increases if the powers of x being compared are adjacent to each other, e.g., x5 _{and x}6_).

These partial products can be shared instead of producing them again and again. This may increase the interconnecting issues inside the circuit. The inter-connect area might become an additional overhead.

(38)

Table 2.2. Generating multiple powers of x from 2 to 32 _{(S=squarer and} M=Multiplier). N Length of Shortest Addition chains Multiplications for single power

Critical Path Multipliers needed to generate powers up to N 2 1 S = 1, M = 0 S1 S = 1, M = 0 3 2 S = 1, M = 1 S1M1 S = 1, M = 1 4 2 S = 2, M = 0 S1S2 S = 2, M = 1 5 3 S = 2, M = 1 S1S2M2 S = 2, M = 2 6 3 S = 2, M = 1 S1M1S3 S = 3, M = 2 7 4 S = 2, M = 2 S1M1M3 S = 3, M = 3 8 3 S = 3, M = 0 S1S2S4 S = 4, M = 3 9 4 S = 3, M = 1 S1S2S4M4 S = 4, M = 4 10 4 S = 3, M = 1 S1S2M2S5 S = 5, M = 4 11 5 S = 3, M = 2 S1S2S4M5 S = 5, M = 5 12 4 S = 3, M = 1 S1M1S3S6 S = 6, M = 5 13 5 S = 3, M = 1 S1S2M2M7 S = 6, M = 6 14 5 S = 3, M = 2 S1M1M3S7 S = 7, M = 7 15 5 S = 3, M = 3 S1M1M3M8 S = 7, M = 8 16 4 S = 4, M = 0 S1S2S4S8 S = 8, M = 8 17 5 S = 4, M = 1 S1S2S4S8M9 S = 8, M = 9 18 5 S = 4, M = 1 S1S2S4M4S9 S = 9, M = 9 19 6 S = 4, M = 2 S1S2S4S8M10 S = 9, M = 10 20 5 S = 4, M = 1 S1S2M2S5S10 S = 10, M = 10 21 6 S = 4, M = 2 S1S2S4S8M11 S = 10, M = 11 22 6 S = 4, M = 2 S1S2S4M5S11 S = 11, M = 11 23 6 S = 4, M = 3 S1S2S4S8M12 S = 11, M = 12 24 5 S = 4, M = 1 S1M1S3S6S12 S = 12, M = 12 25 6 S = 4, M = 2 S1S2S4M4M13 S = 12, M = 13 26 6 S = 4, M = 2 S1M1S3M7S13 S = 13, M = 13 27 6 S = 4, M = 3 S1S2S4M5M14 S = 13, M = 14 28 6 S = 4, M = 2 S1M1M3S7S14 S = 14, M = 14 29 7 S = 4, M = 3 S1S2M2M7M15 S = 14, M = 15 30 6 S = 4, M = 3 S1M1M3M8S15 S = 15, M = 15 31 7 S = 4, M = 4 S1M1M3M8M16 S = 15, M = 16 32 5 S = 5, M = 0 S1S2S4S8S16 S = 16, M = 16

(39)

2.4 A proposed solution 27

Table 2.3. Generating multiple even powers of x from 2 to 32 (S=squarer and M=Multiplier). N Shortest Addition chains Multiplications for single power

Critical Path Multipliers needed to generate powers up to N 2 1 S = 1, M = 0 S1 S = 1, M = 0 4 2 S = 2, M = 0 S1S2 S = 2, M = 0 6 3 S = 2, M = 1 S1S2M1 S = 2, M = 1 8 3 S = 3, M = 0 S1S2S3 S = 3, M = 1 10 4 S = 3, M = 1 S1S2S3M2 S = 3, M = 2 12 4 S = 3, M = 1 S1S2M1S4 S = 4, M = 2 14 5 S = 3, M = 2 S1S2M1M3 S = 4, M = 3 16 4 S = 4, M = 0 S1S2S3S5 S = 5, M = 3 18 5 S = 4, M = 1 S1S2S3S5M4 S = 5, M = 4 20 5 S = 4, M = 1 S1S2S3M2S6 S = 6, M = 4 22 6 S = 4, M = 2 S1S2S3S5M5 S = 6, M = 5 24 5 S = 4, M = 1 S1S2M1S4S7 S = 7, M = 5 26 6 S = 4, M = 2 S1S2S3S5M6 S = 7, M = 6 28 6 S = 4, M = 2 S1S2M1M3S8 S = 8, M = 6 30 6 S = 4, M = 3 S1S2M1M3M7 S = 8, M = 7 32 5 S = 5, M = 0 S1S2S3S5S9 S = 9, M = 7

Table 2.4. Generating powers (multiples of 3) of x from 3 to 30 (S=squarer and M=Multiplier) N Shortest Addition chains Multiplications for single power

Critical Path Multipliers needed to generate powers up to N 3 2 S = 1, M = 1 S1M1 S = 1, M = 1 6 3 S = 2, M = 1 S1M1S3 S = 2, M = 1 9 4 S = 3, M = 1 S1S2S4M4 S = 2, M = 2 12 4 S = 3, M = 1 S1M1S3S6 S = 3, M = 2 15 5 S = 3, M = 3 S1M1M3M8 S = 3, M = 3 18 5 S = 4, M = 1 S1S2S4M4S9 S = 4, M = 3 21 6 S = 4, M = 2 S1S2S4S8M11 S = 4, M = 4 24 5 S = 4, M = 1 S1M1S3S6S12 S = 5, M = 4 27 6 S = 4, M = 3 S1S2S4M5M14 S = 5, M = 5 30 6 S = 4, M = 3 S1M1M3M8S15 S = 6, M = 5

(40)

(41)

Chapter 3

Pipeline registers and bit

requirements for different

schemes

Due to limitation of time all the schemes that have been explained earlier are not considered in this section. Only four of them are considered here.

3.1 Bit requirement

Integer optimization

The tool used for getting the optimized number of bits for different structure is GLPK(GNU Linear Programming Kit) [2]. It solves large scale linear program-ming (LP), mixed integer programprogram-ming (MIP), and other related problems. One example from each of the four schemes is explained below and the bit requirements for all the remaining values of N (where N is the order of polynomial ranging from 3 to 8) for the following four schemes are calculated.

An important and basic requirement before using this optimization tool is to calculate the area for the multipliers that might be used in the optimization process. For this a simple VHDL code is written for multiplication operation with different bit sizes, which in our case is from 2 to 20 bits multipliers. The areas are computed using design analyzer tool. Values for the area are very important because these are the priority factors in the calculation of bit requirement for a certain output noise variance value.

3.1.1 Even Odd scheme

In this section it is intended to explain the optimization method used to obtain the bit requirement for Even Order scheme. Here on the basis of simplicity and functional advantage, we consider the even and odd order of polynomial separately.

(42)

30 Pipeline registers and bit requirements for different schemes

In this scheme the structures for even order are different from that of odd ordered polynomials, which is the reason for considering them separately for optimization problem to find their bit requirements.

For even ordered polynomials

For this case let us consider the simple example of a polynomial of order 4 as shown in Fig. 3.1. In Fig. 3.2, Q represents the quantizers introduced after the

a

₄ 3

a

2

a

1

a

0 X X2 X2 X2 P(x)

Figure 3.1. An Order 4 polynomial using Even Odd scheme.

a

4 3 1 2

a

0 Q Q Q Q Q Q b2 2 b b2 b Q Q Q P(x)

Figure 3.2. An Even Odd order 4 polynomial after the inserting quantizations.

multiplications and for the coefficients. In Fig. 3.3, Q is replaced by its linear model. The errors em1, em2, em3, em4and es1, es2, es3, es4, es5 are introduced after

the multiplications and at the coefficient input respectively.

It is assumed that all errors are uncorrelated with each other. This provides simplicity because now the contribution from each error source can be calculated independently using the superposition principle [6]. The impulse response h(n)

(43)

3.1 Bit requirement 31

for all the noise sources em1, em2, em3, em4 and es1, es2, es3, es4, es5 is calculated

independently. From Fig. 3.3 the impulse responses are

e

s1

e

m2

e

m1

e

s2

e

m3

e

m4

e

s3

e

s4

e

s5 b2 2 b b2 b P(x)

Figure 3.3. An Even Odd order 4 polynomial after the inserting noise sources

hs5(n) = b4

hs4(n) = b3

hs3(n) = b2

hs2(n) = b

hs1(n) = 1

for es5, es4, es3, es2, es1respectively, and

hm4(n) = b2

hm3(n) = b

hm2(n) = 1

hm1(n) = 1

for em4, em3, em2, em1 respectively. The total output noise variance is the sum of

variances calculated for the above sources using the following equation [11] σ2= 1

12(2

−2b1_)Σh2_(n)

This is how output noise variance is calculated. Now we need to calculate the bit requirement for a certain value of noise variance. For that purpose integer linear optimization is used and the objective function is the minimization of area of multipliers and adders, used in the structure for a certain value of roundoff noise requirement at the output. Areas for different bit length multipliers and full adder are calculated using design analyzer tool by synopsys. General theme of the GLPK code is to find the optimal solution of bits requirement for this structure for a given output noise variance at the minimal cost of hardware.

In our case the objective function is to minimize the area of adders and multi-pliers of any given task. A set of constraints is defined so that the calculation is limited to certain boundaries.

(44)

32 Pipeline registers and bit requirements for different schemes

In order to calculate the size of multipliers and adders, the minimum and maximum of the two inputs X1i and X2i are X4i and X5i respectively. The

constraints are functions of binary variables as shown in Fig. 3.4.

An important constraint is that of computation of the total noise variance. With random assigned bit widths for different binary variables in the structure and calculating the noise variance such that it is less than the specified limit at the output.

The objective function is to minimize the area requirements while meeting the requirement limit of out put noise variance. Here X4i and X5i are used to

calculate the area for multipliers and adders respectively, whereas the other input of the multiplier is fixed to some word length.

As a result of this objective function the bit widths are now forced to the smaller values for minimizing size of multipliers and adders used, while still meeting the output requirement.

e

s1

e

m2

e

m1

e

s2

e

m3

e

m4

e

s3

e

s4

e

s5 b2 2 b b2 b X X46 56 26 25 X X16 X X53 X43 X13 23 X X55 X45 15 X X₂₄ 14 X X44 54 X X22 12 X X21 11 X X51 X41 P(x)

Figure 3.4. Fig. 3.3 Binary variable used for integer optimization code

For odd ordered polynomials

Figure 3.5 shows the structure for odd ordered polynomials for the Even Odd polynomial evaluation scheme. This structure shown here is of order 3 to keep things simple. All the method explained in (Section 3.1.1) can be applied to the odd ordered polynomials also but with some minor adjustments because of the structural difference between even ordered and odd ordered polynomial structures of even odd scheme.

These minor changes can be in the parameter set definitions and some variable constraints where conditions may be different according to the labeling of the structure. For all even ordered polynomials when using Even Odd scheme, this method slightly varies with the change of order, i.e., from 4 to 6. Some parameter values need to be adjusted for order 6 which were previously for the order 4.

(45)

3.1 Bit requirement 33

a

₂

a

3 X X X X X X X X X X X X X X X X X X X X b b b2 e e e e e e e s1 s2 s3 m1 m2 m3 s4 41 51 21 11 42 52 12 22 44 54 14 24 43 53 23 45 55 15 25 13 P(x)

Figure 3.5. Order 3 polynomial with Estrin’s scheme, binary variables used for integer optimization code

3.1.2 Horner’s scheme

Bit requirements of a polynomial based on Horner’s scheme for a Farrow filter structure are calculated in [1]. For calculating the bit requirement, Horner’s scheme is the most simple one in a sense that its structure is very linear with the increase in the order of the polynomial. Due to this reason the indexing of the variables relative to other variables is comparatively easy and minimum change is required when the order of the polynomial is changed.

Linear integer optimization code for Horner’s scheme is similar to that of the Even Odd scheme as far as the aim and methodology is concerned. Constraint sets are same as that of Even Odd scheme only difference is in the parameter definitions, set definitions and indexing of variables relative to each other.

Quantizations are shown in Fig. 3.7 which are replaced by noise sources. The Fig. 3.8 shows the noise sources em1, em2, em3 and es1, es2, es3, es4 which are

in-serted in place of the quantization sources as explained earlier in (Section 3.1.1). The difference here is that the structure here is according to the Horner’s scheme as shown in the Fig. 3.6.

X X X a a a a 0 1 2 3 P(x)

Figure 3.6. Horner’s scheme for order 3 polynomial

a 0 a 1 a 2 a ₃ Q Q Q Q Q Q Q b b b P(x)

(46)

34 Pipeline registers and bit requirements for different schemes X X X X X X X X X X X X X X X X 41 51 11 21 42 52 12 22 43 53 13 23 24 44 54 14 e e em3 m2 m1 es4 es3 es2 es1 b b b

Figure 3.8. A Horner’s scheme of order 3 polynomial after inserting noise sources and binary variables for using integer optimization tool

a

₅

a

4

a

2

a

0

a

1

a

3 X X X X X 2 4 2

Figure 3.9. Order 5 polynomial using Estrin’s scheme

3.1.3 Estrin’s scheme

For the calculation of bit requirement for Estrin’s scheme, the same approach is carried out keeping in view the structural differences compared to the previous schemes. Binary variable indexing is the main issue when converting from one scheme to another. In Estrin’s scheme the structural changes are not in a regu-lar fashion from one order to another. These are completely different, e.g., the structures for N = 6 and N = 8 are completely different.

The code for the optimization of the bit requirements, for a particular noise variance at the output, depends a lot on the structure of the scheme, so in this case we have to develop new code according to the new structure as shown in Fig. 3.9. After introducing quantizations its shown in Fig. 3.10. Figure 3.11 shows the noise sources em1, em2, em3, em4, em5 and es1, es2, es3, es4, es5, es6 which are inserted in

place of the quantization sources as explained earlier in (Section 3.1.1)

3.1.4 Lie’s scheme

The Fig. 3.12 shows a structure realization according to the Lie’s algorithm. In Fig. 3.13, quantizations are introduced and then in Fig. 3.14 error sources along with different indexing of the nodes and branches are shown. Similar to the Estrin’s scheme, whenever the order of the polynomial changes, completely new structure appears, so for every order of the polynomial a lot of changes are needed to be done.

(47)

3.1 Bit requirement 35 a a 3 a 2 1 a 5 a 4 0 a Q Q Q Q Q Q Q Q Q Q Q b b b b b4 2 2 P(x)

Figure 3.10. Order 5 polynomial using Estrin’s scheme after insertion of quantizations

e m1 e m2 e m3 e e s4 m4 e s3 e s5 e m5 b2 X X X X X X X X X X X X 11 21 41 X51 12 22 X X 42 52 X13 23 43 53 X14 24 X X 44 54 X X X25 45 55 16 26 X X 46 56 X 57 17 b2 X15 X47 X 27 X b X58 48 X 18 X 28 X b b 4 P(x)

Figure 3.11. An Estrin’s scheme of order 5 polynomial after inserting noise sources and labeling for using integer optimization tool

a 6 a 4 a 2 a 7 a ₅ a 3 a 0 a 1 X X X X X X X 2 4 6 P(x)

(48)

36 Pipeline registers and bit requirements for different schemes a 1 0 a a a ₂ a a 3 a 5 7 a 6 4 Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q b b b b b b8 4 b2 P(x)

Figure 3.13. Lie’s scheme for order 7 polynomial after inserting quantizations

e e e e m1 e e e m2 m3 m4 m5 m6 e m7 e e e e s8 s6 e s2 s1 e s3 s4 e s5 s7 b2 X X X X X X X X₁₁ 12 22 X X X X X X X X X X X X b X41 X51 X52 X42 X X X X X X X X X X X X X X b ₂₁ 43 53 13 23 4,10 X X5,10 4,11 X 1,11 2,11 48 58 X18 28 49 59 19 29 46 56 16 26 47 57 17 27 44 54 14 24 45 55 25 15 X X2,10 1,10 X5,11 X b b b b4 8 P(x)

Figure 3.14. Lie’s scheme of order 7 polynomial after inserting noise sources and labeling for using integer optimization tool

(49)

3.2 Pipeline registers 37

3.2 Pipeline registers

Registers are introduced after each arithmetic operation to introduce pipelining in the structure. The size of these registers is not discussed. We are more concerned about the number of registers. The insertion of these registers is shown by the lines cutting the connections of the structure. In figures this is shown by blue dotted lines.

3.2.1 Horner’s scheme

Registers inserted in a Horner structure of order 6 are shown in Fig. 3.15. The relationship between the number of registers and the order of the polynomial is given by

Number of registers = (N + 1)(2N − 1), where N is order of the polynomial.

a X a a a a a a X X X X X 5 4 3 2 1 0 6 2 3 4 5 6 7 8 9 10 12 ₁₁ P(x)

Figure 3.15. Pipelining in order 6 polynomial using Horner’s scheme

3.2.2 Even Odd scheme

Even Odd scheme structure for a polynomial of order 6 is shown in Fig. 3.16 with registers inserted after each arithmetic operation. The relation between the number of registers and the order of polynomials is given by:

For even N ,

number of registers = N (N + 2) + 2 For odd N ,

number of registers = N (N + 3) + 1, where N is order of the polynomial.

3.2.3 Direct Evaluation

A polynomial of order 6 using direct evaluation scheme is shown in Fig. 3.17 after introduction of registers after each arithmetic operation.

3.2.4 Estrin’s scheme

A polynomial of order 6 using Estrin’s scheme is shown in Fig. 3.18 after intro-duction of registers after each arithmetic operation.

(50)

38 Pipeline registers and bit requirements for different schemes X X X a4 a2 a 6 X 2 2 2 a₀ X a₁ X a₃ a₅ 2 4 6 8 10 12 13 P(x)

Figure 3.16. Pipelining in order 6 polynomial using Even Odd scheme

a 1 a a a a a a X X X X X X X X 0 2 3 4 5 6 2 2 4 3 13 8 4 2 p(x)

Figure 3.17. Pipelining in order 6 polynomial using Direct Evaluation

3.2.5 K

_{-th Order Horner’s scheme}

K-th order Horner’s scheme structure for a polynomial of order 6 is shown in Fig. 3.19 with registers inserted after each arithmetic operation.

3.2.6 Lie´s Algorithm

The Fig. 3.20 shows Lie’s Algorithm structure after pipeline registers introduced after each arithmetic operation.

3.2.7 Algorithm A´

The Fig. 3.21 shows Algorithm A´ structure after pipeline registers introduced after each arithmetic operation.

(51)

3.2 Pipeline registers 39 a 1 X X X X X X a a a a a 0 a2 3 5 4 6 2 2 P(x)

Figure 3.18. Pipelining in order 6 polynomial using Estrin’s scheme

X a X 5 _a a a a a 2 4 1 6 3 X X X X X 3 3 3 2 a0 2 3 5 8 11 14 P(x)

(52)

40 Pipeline registers and bit requirements for different schemes X X X X X2 a a a a a a0 1 3 2 4 5 2 4 6 9 P(x)

Figure 3.20. Pipelining in order 5 polynomial using Lie´s Algorithm

a 1 X X X X X X X X a a 0 a2 a a 2 5 4 3 6 2 3 a 3 2 4 7 12 P(x)

(53)

Chapter 4

Results

4.1 Bit requirements for selected schemes

The details about calculation of bit requirements for different schemes is explained earlier in (Section 3.1). Here only the results are given. All values are correspond-ing to the noise variance constraint of values 1e-5, 1e-6, 1e-7, 1e-8, 1e-9 and 1e-10, for the polynomial of order 3 to 8. This range has been chosen due to memory limitations for optimization tool requirements for the calculations. In some cases it was observed that the optimization tool requires more memory than available.

The multiplication constant i.e., X is shown in the triangle in the figures and its supposed of 11 bits. The boxes after and before additions and multiplication show the number of bits at that stage of the evaluation. Bit requirements can be seen in parallel with the corresponding figure shown for that table.

4.1.1 Horner’s scheme

Bit requirements for polynomials of order 3 to 7 implemented by Horner’s scheme explained in (Section 1.3.1)

3rd order polynomial

Bit requirements for polynomial of order 3 implemented by Horner’s scheme (Sec-tion 1.3.1) is given by Table. 4.1 and shown in Fig. 4.1.

a A B D E a X X X 11 11 11 C F G H I J a B a 3 2 1 0 P(x)

Figure 4.1. Bit requirements of Horner order 3 given in Table. 4.1.

(54)

42 Results

Table 4.1. Bit requirement Horner scheme order 3 for different error rates shown in Fig. 4.1.

Different points

Bit requirement for different error rates 1e-5 1e-6 1e-7 1e-8 1e-9 1e-10

A 5 7 8 9 11 13 B 6 7 9 11 13 14 C 6 7 9 11 12 14 D 6 7 9 11 13 14 E 7 9 10 12 14 15 F 7 9 10 12 14 15 G 7 9 10 12 14 15 H 19 15 14 20 15 19 I 8 10 19 16 20 18 J 19 15 19 20 20 19 4th order polynomial

Bit requirements for polynomial of order 4 implemented by Horner’s scheme is given by Table. 4.2 and shown in Fig. 4.2.

11 X a A B D E L a X X X 11 11 11 C F G H I J K M a B a a 4 3 2 1 0 P(x)

5th order polynomial

(55)

4.1 Bit requirements for selected schemes 43

Different points

A 4 5 7 9 10 12 B 5 7 8 10 12 14 C 5 7 8 10 12 14 D 5 7 8 10 12 14 E 6 8 10 11 13 15 F 6 8 9 11 13 14 G 6 8 10 11 13 15 H 7 9 11 12 14 15 I 7 9 11 12 14 15 J 7 9 11 12 14 15 K 9 10 12 19 15 17 L 20 17 20 13 16 18 M 20 17 20 19 16 18 11 X A B D E X X X 11 11 11 C F G H I J K B L M a 1 a 2 3 4 a 5 a 0 N O P P(x) a a

(56)

44 Results

Different points

A 3 4 6 8 9 11 B 4 6 7 10 11 13 C 4 6 7 10 11 13 D 4 6 7 10 11 13 E 6 7 9 10 12 14 F 6 7 9 10 12 14 G 6 7 9 10 12 14 H 6 8 10 11 13 15 I 6 8 10 11 13 15 J 6 8 10 11 13 15 K 7 9 11 12 14 15 L 7 9 11 12 14 15 M 7 9 11 12 14 15 N 12 11 14 20 20 18 O 9 11 12 14 16 17 P 12 11 14 20 20 18 6th order polynomial

7th order polynomial

(57)

4.1 Bit requirements for selected schemes 45 a A B D E a X X X 11 11 11 C F G H I J a B a 6 5 4 3 11 a 2 K L X M X 11 N O a 1 P 11 Q R a ₀ S X P(x)

11 X a A B D E a X X X 11 11 11 C F G H I J a B 6 5 a 4 K M L a 3 11 X N O a ₂ P 11 X Q R a 1 S 11 X T U a 0 V 7 P(x)

Parallel Evaluation Of Fixed-Point Polynomials

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Parallel Evaluation Of Fixed-Point Polynomials

Parallel Evaluation Of Fixed-Point Polynomials

Examensarbete utfört i Elektroniksystem

vid Tekniska högskolan i Linköping

av

Abstract

Acknowledgments

Contents

List of Tables

List of Figures

Chapter 1

Introduction

1.1

Aim of the thesis

1.2

Thesis organization

1.3

Different schemes

1.3.1

Horner’s scheme

1.3.2

K

-th Order Horner’s scheme

1.3.3

Even Odd scheme

1.3.4

Estrin’s scheme

a

1.3.5

A Simple Parallel Algorithm For Polynomial

Evalua-tion

1.3.6

Algorithm A´

a

1.3.7

Direct Evaluation

a

a

a

a

a

a

a

1.4

A general overlook

1.4.1

Using two schemes at the same time

Chapter 2

Generating the power terms

2.1

Algorithms for short chains

2.1.1

The Binary Method

2.1.2

Factor Method

2.1.3

Power Tree Method

2.2

Addition chains

2.3

Implementation issues

2.3.1

Multiple power terms

2.3.2

Critical path

2.4

A proposed solution

2.4.1

Generating all powers of x from 2 to N

2.4.2

Generating specific powers of x

2.4.3

Sharing of hardware resources

Chapter 3

Pipeline registers and bit

requirements for different

_{-th Order Horner’s scheme}

_{Generating all powers of x from 2 to N}

_{-th Order Horner’s scheme}