Design and implementation of a hardware unit for complex division

(1)

Department of Electrical Engineering

Examensarbete

Design and implementation of a hardware unit for

complex division

Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping

av

Erik Alfredsson

LITH-ISY-EX--05/3724--SE

Linköping 2005

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

complex division

Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping

av

Erik Alfredsson

Handledare: Anders Nilsson

isy, Linköpings universitet

Eric Tell

Examinator: Dake Liu

(4)

(5)

Department of Electrical Engineering Linköpings universitet S-581 83 Linköping, Sweden 2005-10-27 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://www.ep.liu.se/exjobb/isy/2005/3724

ISBN

—

ISRN

Serietitel och serienummer

Title of series, numbering

ISSN

—

Titel

Title

Design och implementering av en hårdvaruenhet för komplex division Design and implementation of a hardware unit for complex division

Författare

Author

Erik Alfredsson

Sammanfattning

Abstract

The purpose of the thesis was to investigate and evaluate existing algorithms for division of complex numbers. The investigation should include implementation of a few suitable algorithms in VHDL. The main application for the divider is compensation for fading in a baseband processor.

Since not much public research is done within the area of complex division in hardware, a divider based on real valued division was designed. The design only implements inversion of complex numbers instead of complete division because it is simpler and the application does not need full division, thus the required chip size is reduced.

An examination of the different kinds of algorithms that exists for real valued division was done and two of the methods were found suitable for implementa-tion, digit recurrence and functional iteration. From each of the two classes of algorithms one algorithm was chosen and implemented in VHDL. Two different versions of the inverter were designed for each method, one with full throughput and one with half throughput. The implementations show very similar results in terms of speed, size and performance. For most cases however, the digit recurrence implementation has a slight advantage.

Nyckelord

(6)

(7)

The purpose of the thesis was to investigate and evaluate existing algo-rithms for division of complex numbers. The investigation should include implementation of a few suitable algorithms in VHDL. The main applica-tion for the divider is compensaapplica-tion for fading in a baseband processor.

Since not much public research is done within the area of complex di-vision in hardware, a divider based on real valued didi-vision was designed. The design only implements inversion of complex numbers instead of com-plete division because it is simpler and the application does not need full division, thus the required chip size is reduced.

An examination of the different kinds of algorithms that exists for real valued division was done and two of the methods were found suitable for implementation, digit recurrence and functional iteration. From each of the two classes of algorithms one algorithm was chosen and implemented in VHDL. Two different versions of the inverter were designed for each method, one with full throughput and one with half throughput. The implementations show very similar results in terms of speed, size and per-formance. For most cases however, the digit recurrence implementation has a slight advantage.

(8)

(9)

1 Introduction 1

1.1 Background . . . 1

1.2 Purpose . . . 2

1.3 Goals . . . 2

2 Background 3 2.1 Complex valued division . . . 3

2.2 Algorithms . . . 4

2.2.1 Digit recurrence . . . 4

2.2.2 Functional iteration . . . 6

2.2.3 Very high radix . . . 8

2.2.4 Lookup table . . . 8 2.2.5 Variable latency . . . 9 2.2.6 Summary . . . 10 3 Implementation 13 3.1 Target environment . . . 13 3.2 Internal representation . . . 14

3.3 Suitability of the algorithms . . . 14

3.3.1 Very high radix . . . 14

3.3.2 Variable latency . . . 14

3.3.4 Functional iteration . . . 15

3.3.5 Lookup table . . . 16

3.3.6 Complex algorithm . . . 16

3.4 The choice of algorithm . . . 16

3.4.2 Functionl iteration with lookup table . . . 17

3.5 Hardware Structure . . . 18

3.5.1 Input/dividend prescaling . . . 19 vii

(10)

3.5.4 Postscaling . . . 27

3.5.5 Control signals . . . 29

3.5.6 Integration with host processor . . . 29

4 Simulation and verification 31 4.1 The application and performance measurment . . . 31

4.2 Output range . . . 32

4.3 Internal representation . . . 33

4.4 Verification . . . 35

5 Synthesis and physical implementation 37 6 Results and discussion 39 6.1 Area and speed comparision . . . 39

6.1.1 Pipelining . . . 39 6.1.2 Area report . . . 40 6.1.3 Comparison . . . 44 6.2 Limitations . . . 46 6.3 Final conclusions . . . 46 Bibliography 47

(11)

Introduction

Division is an operation that people often try to avoid when writing com-puter software, especially for embedded systems where high performance is required. The reason is that division is a rather complicated operation and thus requires much resources. If it is done in software it takes time to calculate. If both the hardware and the software is to be designed the speed can be increased by using dedicated hardware for the task. The extra hardware needed becomes a cost in terms of chip area and as so often there is a trade-off between size and speed. Despite this there are still many applications where divison is necessary. Some of the applications where it is not necessary may still benefit from having one to achieve better results.

1.1 Background

In baseband processing for radio communication there is a need to manip-ulate complex numbers in various ways in order to decode the transmitted data from the digitized radio waves. One of the effects that the receiver has to take into account is fading. The radio waves from the transmitter will bouncs on many different surfaces in many directions before ending up at the receiver. There is a resulting time shift between the signals from the different paths and when they arrive at the antenna they will be added together. The result of this is that different parts of the frequency spec-trum are attenuated differently. One way to compensate for this is to get an estimate of the attenuation and amplify the frequencies with low am-plitude. The magnitude of the amplification is equal to the inverse of the estimated attenuation. The usually high data rates of the communication put a demand of high speed on the inversion. Therefore it needs to be implemented in hardware.

(12)

1.2 Purpose

Since division is not a standard operation for DSP processors and because it can be implemented in several different ways, there is no specific algo-rithm that is clearly the one to choose. It all depends on the requirements one have on its properties, such as accuracy, size and speed. The purpose of this thesis is to examine the different methods and algorithms that exist for high speed division of complex numbers in hardware. A few suitable al-gorithms should be selected and implemented in VHDL for evaluation. The implementation is expected to be a part of an existing baseband processor and should be able to handle the high speed requirements while keeping the size down.

1.3 Goals

Here are the goals that were set up on the thesis:

• The thesis should have a short introduction to existing division

algo-rithms.

• The unit should perform complex valued 1/X operations. • The unit should have a 1/throughput of one or two clock cycles. • The unit should have a reasonable latency.

• The size of the unit should not exceed 10% of the target processor

area.

• The RTL code should be suitable for ASIC implementation.

• The unit sould handle a clock frequency of 100-200MHz in a 0.18um

process.

In case there is time, these additional tasks should be performed:

• Construction of a wrapper between the unit and the

basebandproces-sor BBP1.

• Integration of the unit with BBP1 and writing of a test program for

the unit.

(13)

Background

2.1 Complex valued division

Allmost all of the research in the field of hardware division is based on real valued operands. Only one article that is targeting complex division was found. This does not mean that we are forced to use that one. Division of complex numbers in Cartesian form can be done using a combination of real valued additions, multiplications and divisions. The division between the complex numbers x = a + bi and y = c + di looks like this.

x y = a+ bi c+ di = (a + bi)(c − di) (c + di)(c − di) = (a + bi)(c − di) (c2_{+ d}2₎ = (ac + bd) + (bc − ad)i (c2_{+ d}2₎

If this formula is used there can be overflow errors because of the interme-diate computations of c2+ d2 even if the final result is within the number range used. A modified version of the formula that handles this much bet-ter can be found in [1]. That method however requires an extra division and is not used in this thesis. Instead, the number range is made large enough so that no overflow can occur.

If there are several numbers, x_i that need to be divided with one single number y, it may be beneficial to calculate the reciprocal of y, z = 1/y, and then multiply z with x_i to get the result. Only one inversion is then needed and for every number just one multiplication is required. If a com-plex multiplier already exists in the system, then that can be used for this and the method saves area and probably also power. The memory usage will be a bit larger though since the reciprocals needs to be stored some-where between the calculations. On the other hand one might be able to calculate the reciprocal and send it to the multiplier directly after and thus elimination the extra memory usage at the expense of having to use the

(14)

inverter for every division. The calculation of z = 1/y will be done like this: 1 c+ di = c− di (c + di)(c − di) = c− di c2+ d2 = c c2+ d2 − d c2+ d2 · i

That requires two multiplications, one addition and two divisions.

2.2 Algorithms

Now that we know that we can use real valued division there are several different algorithms to choose from. The article about complex division [1] required prescaling of the operands with a complex number and that made it more complicated than using a real valued algorithm when it comes to calculation of the complex reciprocal. In the article [3] by Stuart Obermann and Michael Flynn, five main classes of division algorithms are defined based on the hardware structure: digit recurrence, functional iteration, very high radix, table lookup and variable latency. There are, however, several division algorithms that are a combination of two or more of these methods.

A brief description of the classes and how the algorithms work is given below.

2.2.1 Digit recurrence

The digit recurrence algorithms are based on subtraction. As the name implies there are a few steps that are repeated several times in order to get the result. For every iteration, one digit of the result is produced, thus it has to be repeated once for every digit in the result. The digit it produces is not necessarily the same as one bit. The range of a digit is dependant on which radix the algorithm uses. A radix-2 algorithm produces one bit every iteration whereas a radix-4 algorithm produces two bits and so forth. So to produce a n-bit result, a radix-2b algorithm need k = n_b iterations. One could of course have a radix that is not a power of two, but that would complicate things so it is only used in special applications where it is needed. One example is the use of radix-10 in pocket calculators. A sometimes usefull property with this class of algorithms is that the remainder comes for free. A partial remainder is calculated in every iteration and after the last iteration the partial remainder is the final remainder. This is of course usefull in the cases when the remainder is needed directly, but is more often used for rounding as the IEEE standard for floating point numbers requires correct rounding of the result.

(15)

Below is an explaination of the algrithm a little bit more in detail. You will might recognize the basic principle as the same as is in the division method that you probably learned in school using pen and paper.

Let us say that we want to divide a by b to retreive the result c; a/b = c. What we actually are looking for is a number c that when multiplied with

b will give a as a result. c· b = a. If a, b and c are integers represented in

base ten (radix-10) and smaller than 1000, they can be written like this:

x= 100x₂+ 10x₁+ x₀; x ∈ {a, b, c}, 0 ≤ x_i ≤ 9

We now have to decide c₀, c₁ and c₂ in such way that b· (100c0₂+ 10c0₁+

c0₀) = a. c₂ is decided first by choosing it as the largest number that fulfills the equation 100c0₂ · b ≤ a. The equation now becomes b · (10c₁ + c₀) =

a− b · 100c₂ = d, where the right part is a constant, d.

c₁is choosen in the same way, select the largest c₁that fulfills b·10c₁ ≤ d. The equation is again rewritten as b· c₀ = d − b · 10c₁ = e.

Finaly, c₀ is choosen as the largest value that fulfill b· c₀ ≤ e Here is a numerical example: a = 538, b = 4

100 · b = 400 so c2· 400 ≤ 538 => c02= 1

d= 538 − 1 · 400 = 138

10 · b = 40 so c1· 40 ≤ 138 => c01 = 3

e= 138 − 3 · 40 = 18

b= 4 so c₀· 4 ≤ 18 => c₀ = 4

The answer then is 538/4 = 134.

The rest is the difference between e and c₀· b, that is 18 − 4 · 4 = 2.

SRT

The most common digit recurrence algorithm is called SRT. The name comes from the initials of Sweeney, Robertson and Tocher who developed the algorithm independently of each other at approximately the same time. The algorithm assumes that both the dividend and the divisor is in the range [1,2) It can be implemented using any radix, but 2 and 4 are the most common ones. Since it operates on the specified number range, sim-plifications can be made and this may make it look a bit different than the algorithm explained above, eventhough it is the same principle. One sim-plification is that a large extended number range for the partial reminder is not needed.

The algorithm looks like this:

r· P₀ = dividend

(16)

P_j+1= r · P j − q_j+1· divisor

The resulting quotient is k X j=0

q_j· r−j

For every iteration one goes through the following steps: Select a value for the next quotient digit q_j+1.

Generate q_j+1· divisor

Get P_j+1 by subtracting q_j+1· divisor from the partial reminder, P_j In the worst case, the select function would generate all possible multi-ples of the divisor (for example 0 to 9 if radix-10 is used). Subtract each of them from the current partial reminder in order to select the largest multiple, q_j+1, that gives a positive result, P_j+1.

In order to simplify the selection process, redundant number systems are used for the quotient digits. The quotient can then be represented in more than one way. This means that there is more freedom in the choice of the quotient digits and thus, a simpler selection function can be used. The resulting quotient is of course converted back to standard binary represen-tation before the result is presented.

For every iteration one full length subtraction has to be made that results in a long carry chain. This can be avoided by using a redundant number system for the partial reminder too. Using a carry save adder structure is one way to do this. Although doing that will complicate the selection function, it is done anyway because it is still beneficial.

2.2.2 Functional iteration

This class uses multiplication as its base operation. The advantage of this is that it can give quadratic convergence of the result as opposed to the linear convergence in the case of digit recurrence. This means that the number of correct bits in the output doubles for every iteration. This is a very important property when dealing with large numbers such as standard long floating point that are 53 bits in size (plus sign and exponent). Usually one starts with some coarse approximation of the result by using for example a lookup table, and then iterates the function for as many steps as needed. If a 16-bit result is desired, one may start with a 4-bit approximation and iterate 2 times, giving 8-bits and then 16-bits of correct result. A problem with this class of algorithms is that it is hard to get a correctly rounded result.

(17)

Newton-Raphson

One of the most used algorithms in this class is the well-known Newton-Raphson. It is used to calculate the reciprocal of the divisor and then mul-tiply that with the dividend. Because Newton-Raphson solves the equation

f(x) = 0, we need a function that has a root at the reciprocal of the

div-idend. There are many such functions and a commonly used one is 1_x − b where b is our divisor. The iteration equation for N-R looks like this

X_i+1= X_i− f(Xi) f0(X_i)

With the goal function inserted we get

X_i+1= X_i· (2 − b · X_i)

This means we need two multiplications and one subtraction for every it-eration.

Taylor series or Goldschmidt’s algorithm

Another method is to use Taylor series expansion. This is sometimes re-ferred to as Goldschmidt’s algorithm. A straightforward approach would be to expand z = _y1 around the point p = 1. It is however better to choose

z= _1+y1 and expand that around p = 0, which is the McLaurin series, be-cause it results in simpler calculations. With y = b−1 the quotient q = a/b then becomes

q= a · 1

1 + y = a · (1 − y + y2− y3+ ...) Which can be written as

q = a · (1 − y)(1 + y2)(1 + y4)(1 + y8)...

By setting

q_i= Ni D_i

it can be implemented iteratively. D_i will converge towards 1 and N_i to-wards the quotient. To begin, set N₀ = a, D₀ = b and R₀ = 1 − y. For each iteration calculate

(18)

After i iterations the result is

N_i = a · (1 − y)(1 + y2)(1 + y4)(1 + y8)...(1 + y2i) D_i = 1 − y2i

With 0.5≤ b < 1, y becomes less than one and the precision doubles for each iteration. To reduce the required number of iterations a and b should be prescaled by an approximation of the reciprocal (1/b). When looking at the iteration steps it may seem very similar to that in Newton-Raphson. They are in fact mathematically the same if y = b− 1 and X₀ = 1.

2.2.3 Very high radix

Digit recurrence algorithms are very useful for low radices but not suitable for higher radices because of the increased complexity of the quotient selec-tion hardware. To get high radix division a variant of the digit recurrence algorithm with simpler quotient digit selection hardware can be used. This class is for algorithms that retires more than around 10 bits per iteration. The algorithms use lookup tables for an initial approximation and multi-plication to generate multiples of the divisor.

Obermann and Flynn [3] puts two algorithms into this category: Accu-rate Quotient Approximation and Short Reciprocal. As these algorithms are targeting systems with numbers that are several tenths of bits in width, they are not useful for the purpose of this thesis and therefore not described in more detail. Interested readers can read the original article.

2.2.4 Lookup table

Using look-up tables as a direct method of receiving a reciprocal may be useful if the requirement on the precision is low. The size of the lookup ta-ble grows exponentially with the accuracy and thus a high accuracy result may require a huge table. An advantage of the look-up table is its speed, there are no arithmetic calculations needed that slow things down. Using a lookup table is a very common way to get a starting approximation for a functional iteration- or a very high radix algorithm.

If the number, b, used to index a reciprocal table is a standard normal-ized floating point number in the range [1,2), it can be done as follows. Since b always has a leading one it can be removed and the next k bits

(19)

are used as index. The reciprocal is in the range (0.5, 1] and for all cases except when b = 1.0 the reciprocal has 0.1 as leading bits that is implicitly assumed and not stored in the table. m bits of the reciprocal are stored in the table giving a m + 1 bits result. The case of b being exactly one is detected and handled by separate hardware. The resulting size of the table is 2km bits. The values in the table are often chosen as the reciprocal of the

midpoint between the index and the next index, rounded to the required number of bits. This will minimize the relative error of the table. If m is chosen as k it can be shown that the maximum relative error is less than 1.5 in the k + 1st bit position. If the precision of the table is denoted by the number α that gives a relative error of ₂1α, the precision of the table is at least k + 0.415 bits.

Instead of just using one look-up table for the reciprocal, one can combine a couple of tables with some arithmetics to get a polynomial approximation. By using two tables one can get a linear approximation.

P(b) = −C₁· b + C₀

The constants C₁ and C₀ in the polynomial are looked up in the table and P(b) calculated. By using k bits to index the two tables and return

m = 2k + 3 bits from them the approximation guarantees a precision of

2k + 2 bits. The total table size needed is 2k_{· m · 2 bits and in addition a}

m· m bit multiplier and adder is needed.

2.2.5 Variable latency

All the algorithms we have studied so far have completed their task in a constant number of cycles. The algorithms that fit into this section are the ones that may need a different number of cycles depending on the numbers to divide. The effect of this is that the average latency decreases while the maximum latency may not, it may even increase in some cases.

Variable latency can be achieved in different ways; one implementation that uses a variant of the SRT algorithm does it as follows. Whenever a series of only ones or zeros is found in the partial reminder it can retire around the same number of bits of the quotient as the length of the series. Another technique that is used is result caches. The inputs to a divi-sion operation are sometimes the same as in earlier calculations. This can be taken advantage of by storing previous results in a cache and looking for the answer there whenever a new division is executed. The advantage of a cache is of course larger if the used division algorithm calculates the

(20)

reciprocal like for example the Newton-Raphson method. This means that only the reciprocal need to be stored and it will match with more division operands than a table where both the dividend and the divisor need to match. The cache size is as usual a trade-off between speed and area and should be compared with other possible use of that area such as improve-ments on the division algorithm itself. When using a high latency, small area algorithm like SRT it is probably more efficient to improve the divider itself instead of using a cache but for lower latency algorithms that need further improvements it may be advantageous.

There is also a possibility to use speculative quotient digit selection. This means that the quotient selection logic is decreased in such a way that it may sometimes be mistaken in its selections. The effect is smaller and simpler selection logic and hopefully a faster divider. The cases when the selection is wrong must then be detected and compensated by at least one extra iteration. If the selection logic is right most of the time it can be worth a couple of extra cycles every now and then. The hardware needed to detect errors may in some cases increase the size of the divider beyond the size of a non-speculative version of the divider.

2.2.6 Summary

Here is a summary of the methods and their properties:

• Digit recurrence – Iterative.

– Small due to the use of addition/subtraction. – Linear convergence.

– High latency. – Low throughput.

– Can be unrolled to get high throughput at the cost of area. – Useful where small area is required.

• Functional iteration – Iterative.

– Larger than digit recurrence due to the use of multiplication. – Quadratic convergence.

(21)

– Medium throughput.

– Can be unrolled to get high throughput at the cost of area. • Very high radix

– Iterative – High radix.

– Low or medium latency. – Medium to high throughput.

– Useful for large operands at high speed. • Lookup table

– Low latency. – High throughput. – Large area.

– Useful for very low precision and for initial approximations for

the other methods.

• Variable latency

– Modified versions of the above. – Variable latency.

– Low to high throughput. – Small to large area.

(22)

(23)

Implementation

3.1 Target environment

In the target processor where the unit is to be used, the complex values are represented as two 16-bit fixed point numbers, the real part and the imaginary part. Both the real and imaginary part inputs will be treated as belonging to the range [-1,1). The application will use the unit to com-pensate for fading and thus the incomming samples should be divided by the estimate of the channel attenuation. The estimate will probably not change for every symbol received so the same estimate will be used multiple times. By only calculating the reciprocal of the estimate in the unit and then multiplying that with the incomming samples in the existing multi-plier, area can be saved. It will probably also save power as the divider does not need to be used for every sample, only when the estimate is updated. The estimate will be 64 samples long or even more and the unit will thus operate on series of at least 64 samples.

An important property for the divider is the throughput. To keep up with the high sample rate it needs to be able to accept new input every, or at least every other, clock cycle. The latency of the unit is not very important as long as it is not around several tens of cycles. The difference in latency to invert a whole serie of numbers will decrease by only one single cycle if the inverter latency is decreased by one cycle. On the other hand, if 1/throughput, is decreased by one clock cycle, the latency for the whole inversion will decrease by the size of the serie. That is assuming that the sending and receiving units for the data can handle the higher speed.

(24)

3.2 Internal representation

By using an internal representation that is a type of floating point much precision is gained and the multipliers and dividers are fully utilized. The inputs are scaled to the range [1, 2) and all calculations is done based on that. This also suits the dividers nicely since they are designed for floating point numbers in this range. After all calculations is done the results are scaled back to their correct values and sent out as fixed point numbers again.

3.3 Suitability of the algorithms

This section contains information about why or why not each class of algo-rithms that is presented in the previous chapter is suitable for implemen-tation.

3.3.1 Very high radix

The relatively small size of the operands (16 bits) and the low latency requirement made the very high radix algorithms non-useful. Considering that they are made to be fast for large numbers with several tens of bits and the latency for our purpose would be only a couple of cycles. The hardware required for such a divider with its large tables and multipliers is much larger that what can be acheived with a higher latency. The very high radix dividers would simply consume unecessarily large area.

3.3.2 Variable latency

The variable latency algorithms were also discarded because it would be difficult to take advantage of the decreased average latency. Here is why. Most of those algorithms are iterative and will have the result ready after a variable number of cycles. The throughput for such a divider is equal to 1/latency and as 1/throughput needs to be one or two cycles the latency can not be any larger. This puts a requirement on the latency that is hard to accomplish and it will increase the area of the divider significantly. The iterative structure can be made non iterative by duplicating the hardware and put them in series instead. This however would defeat most of the benefits of the average latency because we would need hardware for the worst case latency. That means if the average latency is 6 cycles and the maximum latency is 9 cycles, we still need to have 9 of the hardware units in series. Also the throughput of such a divider is not dependent on the

(25)

latency so it would be fixed to the required number of cycles. This makes the average latency advantage almost completely useless becase the latency for dividing the series of samples could only decrease by the very few cycles that can be saved for the last sample of the serie.

One could argue that only the minimum number of hardware units is needed and if more is required the pipeline would stall and the non finnished result fed back to go through some of the iteration units again. This however changes the throughput and one need to make sure that the iteration feedback is not needed too often. The hardware for such unit would also be so large that a fixed latency algorithm is probably around the same size. Another property of the fixed latency algorithms that is very important is the predictability. The target environment is a hard real-time system and as such predictability is necessary. The processor is not allowed to miss any deadline, for example when new data comes on the channel it must be ready to receive it and not be busy doing calculations on previous data. By knowing exactly how many cycles it takes to finnish the calculations it can schedule the inversions and know that it will be ready in time. If an algorithm with variable latency would be used, the processor would have to rely on the maximum number of cycles needed to perform the inversion and base the scheduling on that. The average decreased time is then of no use.

The digit recurrence algorithms is a suitable class of dividers for our pur-pose. Its base of subtraction gives it a possible small size. The latency might be larger than for other solutions but that is not a big problem since latency is not of primary concern. It is probable that at least one addition together with corresponding quotient digit selection can be performed ev-ery clock cycle. This means that the latency for the division will not exceed 16 cycles since the operands are 16 bits and it will probably be much less. The throughput on the other hand is, as mentioned, important. A digit recurrence implementation would need to be a serial structure by remov-ing the recursion as will be described later. This is done to improve the throughput to the desired level.

3.3.4 Functional iteration

This type of dividers might be useful if a good algorithm is found. Since they use multiplication as base function it can potentially be too large but with the quadratic convergence and 16 bit operands, only a few iterations

(26)

may be needed if a low-resolution starting approximation is used. With an initial appoximation of four bits two iterations are required and with eight bits only one iteration.

3.3.5 Lookup table

Using only a lookup table for the implementation is the simplests solution. But it is not a area efficient one. A 16 bit divider would require a table with 216 rows with 16 bits for each row. This means that a one megabit ROM is required. By using a input range of [1,2) and no storage of the msb bit of the table (with always is one), the table can be reduced to 256 kilobits with the addition of some logic. This is however still too large. A table could hovewer be used in combination with a functional iteration algorithm. This requires a low resolution result and the table will not be so large.

3.3.6 Complex algorithm

Only one article about complex division was found [1]. That one is a digit recurrence algorithm that needs complex prescaling of the operands. It was discarded because of that in combination with the fact that it divides two complex numbers, not just calculates the reciprocal. Those things together made it too complex and would give a larger area. However, it could be usefull if a full complex divider were to be implemented.

3.4 The choice of algorithm

As explained above, there are mainly two classes of dividers that would be suitable for our purpose. From them two algorithms were selected for comparison, one digit recurrence [4] and one mix of a look-up table and series expansion [2]. One of the reasons that the specific digit recurrence algorithm was choosen is because of its simple quotient selection function. The fact that it is radix-2 made it likely that the result would be a di-vider that could operate at high frequencies and if the high speed was not necessary several radix-2 stages could be stacked to create a higher radix divider. Stacking would reduce latency and area because of the removed intermediate registers. The choice of the other divider was based on its rel-atively small table size and only a couple of multipliers were need. Those two factors would probably make it fast enough and the latency low.

(27)

The first algorithm is as mentioned a digit recurrence radix-2 divider that uses a over-redundant number system for the quotient. Because it is radix-2 it will output one bit of the result for every iteration. The qoutient digits however are radix-2 and over-redundant. That means that it can take on any value from -2 to 2. The partial reminder is kept in a redundant form using a carry save adder structure. The quotient selection logic only needs to inspect the two most significant digits of the partial reminder in order to decide the new quotient digit. The quotient is converted to binary in two steps. The first step converts the over-redundant digits into ordinary redundant radix-2 digits, that is, it can have the value of −1, 0 or 1. The second step converts it into standard binary representation. The divison algorithm, like most others, are designed for using standardised floating point numbers as inputs. That means that both the dividend and the divisor need to be in the range [1,2) and the ouput will be in range of [0.5, 2).

3.4.2 Functionl iteration with lookup table

The other algorithm combines a lookup table with a couple of multiplica-tions. It is not a real functional algorithm in the sense that there is no iteration involved. It is still placed in this category since it uses multiplica-tions and can be derived using Taylor series. It works like this: Let X be the dividend and Y be the divisor. Let both be 2m-bit numbers. It starts by spliting the divisor in two parts, Y_hand Y_lsuch that Y_h is the m+1 msb bits and Y_lis the m-1 lsb bits of Y . The quote can then be written like this

X Y = X Y_h+ Y_l = X· (Y_h− Y_l) Y_h2− Y_l2

Y_l2 is much smaller than Y_h2 (msb bit is set due to normalization) and can therefore be neglected without a large error. A lookup table is used to get

1

Y2

h which then is multiplied by X· (Yh− Yl) to get the desired result. The resulting approximation is the same as if the first two terms in the Taylor series is used. X Y_h+ Y_l = X Y_h · 1 − Yl Y_h + ...

To get a precision of 2m in the answer, we need to get 2m + 2 bits from the table. The numbers in the table are scaled to have a leading one in the first bit. This can then be omitted and added afterwards. The scaling factor can be calculated by some simple logic. Thus a table with m rows and 2m + 1 bits per row is required.

(28)

3.5 Hardware Structure

Here is an overview of the hardware structure presented and a more detailed description of each part follows below. There are mainly four steps that need to be done. The first is sign adjustment and prescaling of the inputs into the required range of the dividends. Next step is to calculate the divisor and to scale it to proper range. The third stage is the actual division and the only place where the two implementations differ. The fourth step is postscaling of the result. A graphical representation of the structure can be viewed in figure 3.1. The inverter has got a 1/throughput of one clock cycle.

Real input Imaginary input

Prescale and adjust sign Prescale and adjust sign Calculate prescaled absolute value Divide Divide Postscale and adjust sign Postscale and adjust sign 16 16 16 16 w w w w w w w w

Figure 3.1. An overview of the standard hardware structure.

Most of the blocks in the previous structure are duplicated, one block handles the calculations on the real part and the other one on the imag-inary part. A version of the hardware structure that takes advantage of this redundancy to reduce the size has also been implemented. This is ac-complished by introducing timesharing. This structure contains only one version of each part and the area requirement is drastically decreased. The

(29)

decreased area comes with a price though. The main drawback is that the inverter no longer can accept and deliver data in every clock cycle. The things that previously were calculated in parallel in both the real- and imaginary half of the inverter are now calculated after each other. New input can only be accepted every other clock cycle, thus the throughput is halved. The total time needed is not doubled however, only increased by one cycle. A representation of the timeshared structure can be viewed in figure 3.2.

Calculate prescaled absolute value Divide Postscale and adjust sign 16 16 16 Prescale and adjust sign

Register Register Load Load Imaginary output Real output 16 w w w w 16

Figure 3.2. An overview of the time-shared hardware structure.

3.5.1 Input/dividend prescaling

The input to our block is two fixed point, 16 bit, numbers in the range [-1,1). The first thing that is done with them is to remove the sign. This is done by taking the ones complement if the number is negative. By not

(30)

using twos complement for the sign change a 16 bit adder can be saved and simulations shows no noticeable difference in SNR. Of course, removing the sign will generate a sign error in the output if not taken care of. This is handled by changing the sign of the output if the corresponding input sign were changed. This sign removal reduces the possible range of the real and imaginary parts to [0,1) and the required number of bits is reduced to 15.

The divider requires both inputs to be in the range [1,2). Since the inputs is not in this range to start with, they must be prescaled into the proper range and the output postscaled to get the correct result. The prescaling is done by finding the first non zero bit, starting from the leftmost position. The number is then shifted the corresponding number of steps to the left. This results in the left most bit beeing set, and thus the range will be [0.5, 1). To be able to get it into the range [1,2) an extension of the possible number range into [0,2) is necessary. This is done by just changing the interpretation of the bits to a 15 bit number ranging from 0 to 2 (moving the fractional point one step to the right). This extra implied shift is compensated for after the division in the postscaling step. The RTL-schemes of the block can be viewed in the figures 3.3 and 3.4.

Adjust sign Adjust sign

Find first 1 Find first 1

Min

Shifter Shifter

Real sign Imag sign

Real Imag

Dividend x z y Dividend

a b

(31)

Adjust sign Adjust sign

Find first 1 Find first 1

Min Shifter Sign a/b Dividend z a/b x/y

Figure 3.4. RTL scheme of the time-shared Input/dividend prescaling block

3.5.2 Divisor generation and divisor scaling

As stated, the divisor also need to be in the range of [1,2). The divisor is the squared absolute value of the complex number. There are two approaches to how to scale the divisor. The first is to calculate the divisor using the unscaled real and imaginary parts and then scale it to the required range. The other way is to calculate it using prescaled real and imaginary parts and then do a minor adjustment to the divisor to get it within correct range. The later method is prefered because then a smaller adder and shifter can be used without loosing precision.

To determine what scalefactor to use the following reasoning is useful. Let us say that the real part is called a and the imaginary part b. a is prescaled by a factor 2x and b by a factor 2y. The absolute value squared is

a2+ b2= (2−x· a · 2x)2+ (2−y· b · 2y)2

Now, if x = y = z, we can write it as

(32)

In most cases, x and y wont be equal and thus, if we just square them and add them together we will not get the desired value. We need to scale them by the same factor, z, before the squaring and addition. Because of this we need separate scaling units for the divisor generation and the dividend generation. The scale factor z must be choosen in such way that neither a·2z nor b·2z fall outside of the number range for the dividend. On the other hand one want to choose z as large as possible to gain precision. Therefore z is choosen to the smallest of x and y. The resulting square sum of the prescaled inputs is not necessarily in the range of [1,2) that we need. At least one of the scaled numbers a and b will be in the range of [0.25, 0.5), the other one is in the range [0, 0.5). The square sum of this is in the range [0.125, 1) and an adjustment is needed to make the range [1,2). This extra shift is accounted for in the postscaling section. In the timeshared version there is a register following the multiplier. This is neccesary for the calculations as the adder following it should add a2 and b2. Since a2 is calculated in one clock cycle and b2 in the next, a2 need to be saved for one cycle and this is the purpose of that register. The RTL schemes can be seen in figure 3.5 and figure 3.6.

Shifter

*

+

Shifter

*

z a b Divisor Adjust e

(33)

Shifter

*

Register

+

Adjust

a/b

z

e

Divisor

Figure 3.6. RTL scheme of the time-shared divisor generation.

3.5.3 Dividers

The two real valued divisions that are needed in the algorithm are per-formed in this step. The implementation of the dividers are exactly the same in both the standard and the timeshared case, the difference is that the standard implementation has got two of them. The SRT-based divider will be presented first and then the Taylor based one.

SRT divider

The internal structure of the SRT-divider can be viewed in figure 3.7. It is based on a block called an iteration block which is the part that does the repetitive calculations.

The iteration block consists of three main parts: an adder, a quotient selector and a quotient converter.

The dividend is first loaded into the register. The next quotient digit,

q0, is then selected based on the two most significant digits of X. |q0| · D is calculated and, depending on the sign of q0, subtracted or added to the

(34)

Dividend Start Register X Quotient selection Quotient conversion Quotient

Carry save adder/subtractor |q’|*D Divisor Sign(q’) q’ q Z x2 Iteration-block

Figure 3.7. An overview of the SRT divider structure.

partial reminder X. The result, Z is shifted one step left and fed back into the register to become the next partial reminder. The quotient converter is then responsible for converting the over-redundant radix-2 digits, ranging [-2,2], into ordinary redundant radix-2 digits ranging [-1,1]. The redundant result is represented by two numbers one positive and one negative, r₊ and

r₋. The correct answer is finaly calculated as r = r₊− r₋.

The iterative structures can be converted into a serial structure. This is done by removing the feedback and instead duplicating the iteration block and connecting them in series for as many stages as is needed. For a reference, look at figure 3.8. This increases the hardware cost but also increases the throughput. When using the iterative structure, one have to wait N cycles to get a N bit result before any new data can be processed. With a completely serialized structure, new data can be input every cycle. Depending on the speed of the different parts there is also a possibility for decreasing the latency. If the iteration blocks are fast enough, one may be able to stack several of them after each other without any registers in between. The result of this is a structure that will give several quotient digits each clock cycle and hence the radix is increased. If two blocks are stacked, it will be a radix-4, three blocks gives radix-8 and so on.

(35)

Register Dividend Divisor Quotient digit x2 X Iteration Block Z Iteration Block Register x2 Z Quotient digit Register Register

Figure 3.8. The serialization of the divider.

feedback. Lets say for example that 16 bits of output is requested. There is, as we have seen, a possibility to have one block and iterate it 16 times, or 16 blocks without any iteration. An intermediate approach is to have 4 blocks with feedback connected in series. That means each block iterates 4 times before sending the result to the next block. This will give a latency of 16 cycles and a new input/result every 4:th cycle. Another approach is to have 4 blocks in series with feedback, meaning that 4 bits is produced every cycle and it is iterated 4 times. This gives a latency of 4 cycles and the same throughput as previous example, although the possible maximum speed is decreased due to the longer path between the registers. A picure that illustrates the different options can be seen in figure 3.9.

(36)

Divisor Iteration Block Dividend Iteration Block Iteration Block Iteration Block Ctrl Register Divisor Iteration Block q Dividend Iteration Block Iteration Block Iteration Block Register Register Register Register Ctrl Ctrl Ctrl Ctrl Register Register Register Register q q q q q q q Register

Figure 3.9. Two different versions of a 16 bit divider. a) Latency=16, 1/through-put=4. b) Latency=4, 1/through1/through-put=4.

Taylor divider

The second type of divider is the Taylor expansion with look-up table. An overview of it can be seen in figure 3.10. The implementation is very straight forward from the derived equations. First, Y_h− Y_l is calculated and multiplied by the dividend X. Y_h is used to index the table and 1/Y_h2 is received as output. After these stages there are two registers that are used for pipelining. The result is then calculated by the use of the second multiplier and a second pipeline stage is present last. The latency of the divider is then 2 cycles from the input. The unit can input a new value

(37)

every cycle so 1/throughput is 1. There is a possibility of timesharing a multiplier and then get away with just using one instead of two. If this is implemented, the unit can only accept a new value every other cycle, so the throughput is decreased. The latency would still be the same. Doing this will however only be useful if the inverter structure is already timeshared and one like to further decrease the area at the expense of throughput. If the standard full throughput inverter is used it will be much more beneficial to change to the timeshared inverter structure instead of just timesharing the multipliers inside the division. 1/throughput will be at least four clock cycles if both are timeshared.

Divisor Dividend

*

Look-up table Yh-Yl Yh r Register Register Register

Figure 3.10. An overview of the look-up divider.

3.5.4 Postscaling

As mentioned before, the output needs to be scaled. This is because of the scaling of the inputs. If the dividend, N , is scaled by 2u and the divisor,

D, by 22·z+e= 2t. The division result will be like this:

r= N · 2 u D· 2t = N D · 2u 2t = N D · 2 u−t This means that the output R can be calculated as:

R= N

D = r · 2

(38)

That is, r should be scaled by 2t−u. The output should have 7 integer bits, including sign bit, and 9 fractional bits according to the simulation results presented in the next chapter. The output from the divider has one integer bit and no sign bit. This means that the output should be extended with 5 integer bits plus a sign bit and the 5 lower most bits be removed. Instead the interpretation of the number is changed. The result is an implicit left shift of 5 steps that has to be compensated for. This is done by adjusting the scalefactor by 2−5, that is, 5 right shifts. Remember that the dividend was scaled with one left shift before the division, therefore we need to add another right shift to compensate for that and gets 6 shifts. The resulting scale factor then becomes 2t−u−6. If the exponent is positive it means a left shift, otherwise a right shift. However, a left shift will force the result outside of the number range causing a faulty result. This means only right shift is implemented, if it still should be a left shift we do the best of the situation and saturates the answer to the maximum possible output. Just before the scaling takes place, the result of the division is converted to standard binary format from the redundant form. This is accomplished by just subtracting the negative part of the result from the positive part as mentioned previously. The sign of the output is also adjusted for in this part, just before the result is ready. Again ones complement is used to save the area of an adder with no noticeable change in SNR. As in the case of the dividers, the implementation of this step looks the same in both the standard and timeshared versions. You can see the RTL scheme in figure 3.11. Shifter Adjust sign + 6 + x2 + e z x/y r R Sign a/b

(39)

3.5.5 Control signals

The unit can be controlled by using three input signals, reset, new_input and halt. The reset signal is only for initializing the unit to a start state. new_input is used to tell the unit that new data is available at the input. When no new data is available to a part of the pipeline, that part is stalled to save power. It will not change its state until valid data is received and no uneccessary switching that draws power occurs. The halt signal can be used to stall the whole unit. This is usefull if the unit that shall receive the output data can not accept any more input, or for some other reason. No new input is accepted when it is in the halt state and when halt is released it will continue with the operations like nothing has happened. The unit will also tell when it is ready and is sending out new result and if an overflow occurs.

3.5.6 Integration with host processor

The inverter has been integrated into the baseband processor BBP1. To be compatible with it, an interface was written that handles the processor specific connections. An input buffer and an output buffer was also needed for this and therefore added together with some logic to control it. A small program was written for the processor that tests the unit and the interface in some test cases. The system has been simulated using modelsim and found to be working. More thorough testing is however needed to be sure that the unit doesn’t contain any bugs.

(40)

(41)

Simulation and verification

In order to know how many bits that are needed in various places in the design, different simulations are made. The first simulation is to decide on the output range of the inverter. For practical reasons, Matlab is used to generate the simulation input and to analyze the output. A c-model that simulates the behaviour of the inverter is used. The c-model is faster and easier to simulate than the vhdl RTL code.

4.1 The application and performance measurment

There is a need to be able to compare different implementations against each other in terms of how well it performs. There are a few different things one can look at when defining the performance and the choosen method is to look at the application and how the inverter will be used in the processor and from that get a method that gives a performance measure.

The values that are to be inverted is the channel estimate. To get a channel estimate for use in the simulations, a channel model is used. The model takes into concideration multipath fading and there are several standard models that one can use. In the thesis two different versions of the JTC indoor channel model have been used, namely A and C. The models have taps representing the different paths the signal can take. Each tap concists of a delay and an average power. The delay is the delay between the straight path signal and the reflected path signal. The average power is assumed to be rayleigh distributed.

For the simulations, the input data is generated in matlab and saved to a text file. The data is fed into the c-model of the inverter which saves its output to a file. This is read back into Matlab and the SNR is calculated. The generated data is for IEEE 802.11a and thus 64 values are generated

(42)

every iteration, corresponding to one OFDM symbol. First, a channel estimate is needed. An impulse response, h[n], of the channel is generated from the JTC model. This is transfered into the frequency domain, H[f ], by using a 64-point FFT. H[f ] is in floating point format and needs to be truncated to fit the input range of the C-model of the inverter. The C-model is used to get the inverse of the channel estimate and returns. Random 64-QAM data, X, to be ’sent’ over the channel is created and ’transmitted’ by multiplying the data and the channel. The inverse estimate is applied, resulting in X, to try to remove the effect of the channel. With a perfect inverse, X should be equal to X. SNR is measured using the difference between them. SN R= 10 · log   ||X|| (Quant(XH)_Quant(H) − X)2 

 = 10 · log _P P64i=1|X[i]|2

64

i=1(|X[i] − X[i]|)2 !

4.2 Output range

The number range of the output data was not set from the beginning and had to be decided. It should be choosen in such way that the dynamic range is utilized as much as possible. The range was decided in the follow-ing way. The inputs are two 16 bit, real-valued, numbers in the range [-1, 1). The norm of the complex number is then (if zero is not concidered) between√2 · 2−15 and √2. The norm of the reciprocal is therefore in the range √1

2 · 1 to √12· 215.

From this, one may think that two 16 bit outputs, each ranging from 0 to 215_{− 1 together with a sign bit should be enough. Although it do cover}

the whole output number range, it lacks in precision. For example, values that are close to one will have a reciprocal output that is either one or two depending on the rounding. This is definately not good enough, especially since the input values will be scaled to have a mean value of around one half. So, two 16 bit numbers is expected output, if we choose too many integer bits the precision will be too low. If too many fractional bits is choosen, we will not be able to handle large results correctly. Where to put the decimal dot is decided with simulations.

The simulation are made with floating point calculations and truncation of the output. As can be seen in figure 4.1, 7 integer bits and thus 9 fractional bits yields the best SNR. This is the best SNR we can get as going over to fixed point calculations will at best keep the precision, not improve it.

(43)

0 2 4 6 8 10 12 14 16 −10 0 10 20 30 40 50 60 70

Output integer bits, including signbit

SNR (dB)

Figure 4.1. The result of the simulation of output number range. The x-axis corresponds to the number of integer bits (including sign bit) and the y-axis gives the SNR.

4.3 Internal representation

The word length (number of bits), to use within the inverter for intermedi-ate calculations is an important factor and directly affect the required area. Again simulations were made to get performance measuements for differ-ent configurations. Here a c-model was written that is made to imitate the final vhdl implementation. The reason for taking the time to implement a c-model instead of using the RTL-model directly is that the c-model is much faster and more flexible to use. Since c have a much higher abstrac-tion level than RTL it is also easier to use for testing and evaluating of different algorithms. There are already c-models written for other parts of the processor and the inverter can then be integrated and simulated to-gether with those more easily than by interconnecting RTL-models. The c-model should return exactly same result as the the hardware would do. This means that the standard division operation that were used in the pre-vious simulation can not be used. The model need to implement all the

(44)

2 4 6 8 10 12 14 16 0 10 20 30 40 50 60 Taylor SRT

Figure 4.2. SNR from the simulations of internal word lenght. Solid line is for the Taylor based inverter and the dotted line for the SRT based one.

various bit manipulations that is present in the algorithm in the same way as the vhdl implementation. To simplify the bitmanipulations and calcu-lations a library that handles fixedpoint numbers were used. A problem with the library was that it only handles signed numbers and most of the used numbers weren unsigned. One extra signbit was therefore needed in all variables and they sometimes had to be sign adjusted.

The model was parametrized with one parameter, w, as seen in figure 3.1. Simulation results for this can be seen in figure 4.2. For the largest word lengths the SNR is around 58dB but it starts to decrease at 11 bits and below that it decreases by around 6dB/bit. This makes perfect sense since SNR is defined as 10·log(_NS)2 = 20·log(_NS). With a decrease of one bit in word length, the noise will increase by a factor of 2. The resulting SNR is then 20· log(_2·NS ) = 20 · (log(1₂) + log(_NS)). Since log(1₂) is around −0.3, 20 ·log(1₂) will be around -6dB. The reason for the result not beeing around 61dB at maximum, as the floating point calculations gave, but instead around 58dB is probably because of the algorithm not beeing completely correct in the last bit position and because of the truncations that occur in various places.

(45)

4.4 Verification

To make sure that the c-model and the RTL model are equivalent, that is they produce the same output for the same input, they are verified against each other. This was done as follows. A MATLAB program that generated random channel estimations were created. The estimations were scaled to fit the input format of the inverters and then saved to a file. The input data were fed both to the c-model and to the RTL model and the outputs saved to two output files. The files were then compared to spot any differences in the results.

(46)

(47)

Synthesis and physical

implementation

The VHDL code that is written is on the RTL level and as such is fairly low level, using building blocks such as registers, muxes, adders, multipliers and so on. There are still several different possible physical implementa-tions that perform this function. If high speed is necessary an implemen-tation using, for example, faster and larger multipiers or high speed adders may be required. Depending on the different requirements on the final re-sult, such as speed and which manufacturing technology used, the resulting physical implementation will vary in terms of consumed area, speed, power consumption and so on. The process of translating the VHDL code into something that can be used for physical implementation is done by using a synthesis tool. The tool does optimizations on the logic and translates it into a netlist. The netlist uses simple building blocks called standard cells. The standard cells are then given a place on the chip and all the interconnection wires are routed. This step is called place and route. The tool that has been used for this is called PKS and is made by Cadence.

PKS stands for Physical Knowledge Synthesis. PKS does not do a com-plete place and route, instead it does some estimated placing and estimates the wire lenght and delays. Out of this it can generate timing information and see if the design can handle the specified clock frequency. It will also get an estimate of how large area the implementation needs. The reason for not doing a complete place and route is that it is not possible. There are several things that needs to be specified in order to be able to do a complete layout. A clock distribution network need to be constructed, placement of the input/output pads need to be specified and power supply must be placed. Since this is not specified PKS tries to do the best it can

(48)

by doing an estimated placement of the standard cells and modelling the wirelenghts and so on.

(49)

Results and discussion

In this chapter will the results that have been achieved be presented and explanations of why some results are in certain ways be given. The limita-tions of the thesis is also discussed.

6.1 Area and speed comparision

To compare the size of the different implementations of the inverter at different speeds the four different versions have been synthesized at several clock frequencies and with different internal word lengths. The four versions are:

• SRT based inverter with full throughput (1/throughput = 1) • SRT based inverter with half throughput (1/throughput = 2) • Taylor based inverter with full throughput (1/throughput = 1) • Taylor based inverter with half throughput (1/throughput = 2)

Each of them have been synthesized at 100, 143, 166 and 200 MHz and with between 7 and 12 bits of word length. They are all synthesized for UMCs 0.18um process. The results can be viewed in the figures 6.1 to 6.6. The figures do not include area for interconnection wires so the final size will be a little bit larger.

6.1.1 Pipelining

All of the inverter implementation use pipelining. It is simply not possible to use them at high speed if they are not pipelined. Pipelining affects mainly

(50)

three properties of a system: maximum clock frequency, area and latency. The latency will of course increase by one clock cycle if one pipeline step is added. The reason for pipelining is clock frequency. When an object is pipelined the clock frequency can be increased. Pipelining means inserting extra sets of register in the design. These extra registers will use chip area so one might think the size of the design will increase because of this. That is not always the case though. If the frequency is high relative to the number of pipeline steps and the tasks performed, the synthesis tool might need to use faster components, like multipliers, that are larger. When pipelining such design one might get away with a slower component that is also smaller. There is also a possibility that some logic needs to be duplicated because the wire delay makes it impossible to share it. Things like this makes it a bit uncertain wether the area will increase or decrease when another pipelining step is added.

When the different inverters has been synthesized, the size of the im-plementations has been of primary concern, the number of pipeline stages has been chosen based on the resulting size.

In some cases when the frequency requirement has been increased, PKS has not been able to create an implementation that can handle it. In those cases one more pipeline stage was introduced and the synthesis completed successfully. When an implementation with a certain latency is synthesized for a low frequency, PKS will use the smallest multipliers and so on and still fulfill the timing requirement. This sort of ’smallest’ implementation will do fine up to a certain frequency and then the area will start to increase. As the frequency increases PKS will use faster and faster components and also duplicate logic to try to fulfill the demands. If the frequency is close to maximum of what PKS can acheive the area will increase fast. When an implementation were not were fast enough for a certain frequency given the number of pipeline stages, an extra pipeline stage was added. The later version was also synthesized for the closest lower frequency to compare the size between the two versions. In one of of the five case it was smaller and in two cases they were almost the same size.

6.1.2 Area report

As mentioned before the inverters are synthesized for four different frequen-cies, 100, 143, 166 and 200 MHz and for internal word lenght of 7 to 12 bits. The 12 bit versions are of course the largest ones and the 7 bit versions the smallest.

In the figures 6.1 and 6.2 the size of the SRT based inverters, with full throughput and half throughput respectively, can be viewed. With a few

(51)

100 110 120 130 140 150 160 170 180 190 200 210 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 Frequency (MHz) Area (mm 2)

Figure 6.1. The estimated area requirement for the SRT based inverter versus frequency for word length between 7 and 12 bits. 1/Throughput = 1, Latency = 6 cycles except for the marked points where Latency = 7.

Figure 6.2. The estimated area requirement for the SRT based inverter versus frequency for word length between 7 and 12 bits. 1/Throughput = 2, Latency = 6 cycles except for the marked points where Latency = 7.

(52)

exceptions the latency is 6 cycles for the first one and 7 for the second one. The ones with increased latency are marked with a star, square or ring. The ring indicates that both the lower and higher latency versions were synthesizable but the higher latency version required less area, the square indicates that they were the same size. As expected the size increases with frequency and with word length.

In the figures 6.3 and 6.4 the size of the Taylor based inverters, with full throughput and half throughput respectively, can be viewed. The latency is 7 clock cycles for all of them. The reason that the 11 bit version is close to the 12 bit version, the 9 bit version is close to the 10 bit version and so on, is that the real valued divider produces a result that is 2m bits in size. m should be w/2 but since m is an integer it gets difficult when w is odd and m needs to be rounded upwards. This means that the dividers inside the 11 bit version have the same size as that of the 12 bit version. The area difference that exists between them is based on the decrease of the other parts.

(53)

Figure 6.3. The estimated area requirement for the Taylor based inverter versus frequency for word length between 7 and 12 bits. 1/Throughput = 1, Latency = 7 cycles. 100 110 120 130 140 150 160 170 180 190 200 210 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 Frequency (MHz) Area (mm 2)

Figure 6.4. The estimated area requirement for the Taylor based inverter versus frequency for word length between 7 and 12 bits. 1/Throughput = 2, Latency = 7 cycles.