• No results found

06 – MAC Oscar Gustafsson

N/A
N/A
Protected

Academic year: 2021

Share "06 – MAC Oscar Gustafsson"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

06 – MAC

Oscar Gustafsson

(2)

A few remaining issues from last lecture

• ALU example

• Hardware for |x − y|

• Shifts in ALU:s

(3)

ALU example from lecture 5

(4)

ALU example from lecture 5

-- purpose : Select operand 1 for adder 1 -- type : combinational

-- inputs : operand , a, b -- outputs : adder1op1

adder1op1select: process (operand , a, b) begin -- process adder1op1select case operand is

when 1|2|3|4|7 => adder1op1 <= a;

when 5|8 => if (a(wordlength +1) = '0') then adder1op1 <= a;

else

adder1op1 <= not(a);

end if;

when 6 => if (a(wordlength +1) = '0' or b(wordlength +1) = '0') then adder1op1 <= a;

else

adder1op1 <= not(a);

end if;

when others => adder1op1 <= (others => '-');

end case;

end process adder1op1select;

(5)

ALU example from lecture 5

(6)

ALU example from lecture 5

adder1op1select: process (operand , a, b) begin -- process adder1op1select case c1 is

when 0 => adder1op1 <= a;

when 1 => if (a(wordlength +1) = '0') then adder1op1 <= a;

else

adder1op1 <= not(a);

end if;

when others => if (a(wordlength +1) = '0' or b(wordlength +1) = '0') then adder1op1 <= a;

else

adder1op1 <= not(a);

end if;

end case;

end process adder1op1select;

c1adder1op1select: process (operand) begin -- process c1adder1op1select case operand is

when 1|2|3|4|7 => c1 <= 0;

when 5|8 => c1 <= 1;

when 6 => c1 <= 2;

when others => c1 <= 0;

end case;

end process c1adder1op1select;

(7)

ALU example from lecture 5

(8)

ALU example from lecture 5

type c0options is (An , Ai);

...

adder1op1select: process (c0 , a) begin -- process adder1op1select case c0 is

when An => adder1op1 <= a;

when others => adder1op1 <= not(a);

end case;

end process adder1op1select;

c0adder1op1select: process (operand , asign , bsign) begin -- process c1adder1op1select

case operand is

when 1|2|3|4|7 => c0 <= An;

when 5|8 => if asign = '0' then c0 <= An;

else c0 <= Ai;

end if;

when 6 => if asign = '1' and bsign = '1' then c0 <= Ai;

else c0 <= An;

end if;

when others => c0 <= An;

end case;

end process c0adder1op1select;

(9)

ALU example from lecture 5

Results for 8-bit ALU in 65 nm: area vs f

clk

(GHz)

(10)

16-bit |A − B|

Three approaches

(11)

16-bit |A − B|

Results for 16-bit |A − B| in 65 nm: area vs f

clk

(GHz)

(12)

Typical ALU shi opera ons

... ...

15 14 1 0 A rithmetic right shift

... ...

15 14 1 0 L ogic right shift

0

... ...

15 14 1 0 0 L ogic left shift

Rotate right without carry flag ... ...

15 14 1 0

Rotate left without carry flag ... ...

15 14 1 0

... ...

15 14 1 0

Rotate left with carry flag (more than one bit not needed) ... ...

15 14 1 0 C

C Rotate right with carry flag

(more than one bit not needed)

[Liu2008]

(13)

Shi er primi ve

Fill in [0]

Fill in [1]

Fill in [2]

Fill in [3]

Fill in [4]

Fill in [5]

Fill in [6]

Fill in [7]

Fill in [8]

Fill in [9]

[15]

[14]

[13]

[12]

[11]

[10]

[9]

[8]

[7]

[6]

[5]

[4]

[3]

[2]

[1]

[0]

[15]

[14]

[13]

[12]

[11]

[10]

[9]

[8]

[7]

[6]

[5]

[4]

[3]

[2]

[1]

[0]

Shift Input [15:0]

Shift output [15:0]

CTRL LSB CTRL [1]

CTRL [2]

CTRL MSB Fill in [10]

Fill in [11]

Fill in [12]

Fill in [13]

Fill in [14]

[Liu2008]

• Note: Barrel shifters based on 4-to-1 multiplexers

may be more efficient

(14)

Hardware mul plexing in shi er

Filling-in 16 bits shift

primitive

16 Shift in [15:0]

Shift in [0:15]

16 Shift out [15:0]

Shift out [0:15]

MSB [15] “0”

Right shift

Left shift

Right shift

Left shift Fill in port

From filling-in table

16

[Liu2008]

• Note: Fill in table may be complicated for some

shift operations

(15)

Bitwise logic opera ons

• AND, OR, XOR

• More or less trivial

(16)

Other ALU opera ons

• Find leading one: Returns the most significant bit set to one

• Find leading zero: Returns the most significant bit set to zero

• Population count: Returns the number of bits set to one // RTL code for find leading one

casez (opa)

16'b1 ???????????????: msbbit = 16;

16' b01 ??????????????: msbbit = 15;

16' b001 ?????????????: msbbit = 14;

endcase

(17)

MAC - Introduc on

• MAC - Multiply and ACcumulate

• One of the most important parts of a DSP

processor

(18)

Why MAC?

• Convolution based algorithms

• FIR, IIR, Auto correlation, Cross correlation

• Linear algebra based algorithms

• Inner product, Matrix multiplication

• Support most transformation algorithms

• FFT, DCT

• Example below: FIR implemented with 4 MACs and one MULT

Round Saturation

(19)

Basic structure of a MAC unit

*

OpA OpB

ACR

Pipeline reg.

• Why ACR instead of normal RF?

• Reduce the number of RF ports

• No register/accumulator size mismatch

• No need to write to RF after a very long pipeline

– During convolution: (Address generation, memory read, multiply, add)

(20)

Mul plier basics

• Create a schematic for a 16 × 16 multiplier with support for:

• OP1: Signed/unsigned integer multiplication

• OP2: Signed/unsigned integer multiplication (read out high part)

• OP3: Signed fractional multiplication (with proper rounding)

• The instruction decoder will set the C

s

signal

according to whether signed or unsigned

multiplication is desired

(21)

MAC unit basics

• Example: Create a schematic for a MAC unit 16-bit inputs, 8 guard bits and the following operations:

• OP0: NOP

• OP1: ACR = 0

• OP2: ACR = Signed/unsigned multiplication

• OP3: ACR = Signed/unsigned multiply and accumulation

• OP4: ACR = SAT(RND(ACR)) //

Rounding/saturation for fractional inputs/outputs

• Note: A fairly common mistake on the exam is to

forget the NOP instruction!

(22)

MAC unit basics

• Something is missing in the specification.

• (Something which was previously a common error

on the exam…)

(23)

MAC unit basics

• We need to read back the result of the accumulation:

• Op5:RF = ACR[31:16]

• Op6:RF = ACR[15:0]

• Op7:RF = ACR[39:32]

• (In fact, we need to be able to save/restore the complete state of the MAC unit.)

• We need to be able to do context switches!

(24)

MAC unit basics - Discussion break

• Question: If we have 16-bit fractional inputs and

want a 16-bit fractional output, which bits of the

accumulator do we need to read out? How do we

perform saturation and rounding?

(25)

MAC unit basics - Discussion break

• Question: If we have 16-bit fractional inputs and want a 16-bit fractional output, which bits of the accumulator should we need?

• OP8: RF = ACR[30:15]

(26)

MAC unit basics

• Alternatives:

• Support fractional multiplication by scaling right by one step directly after the multiplication. Now both (saturated) integer and (saturated) fractional MAC is supported by reading out only ACR[31:16]

and ACR[15:0].

(27)

MAC unit basics

• Alternatives:

• Allow post operations on the ACR value during readout:

– RF = SAT(ROUND(ACR[31:16]))

– Drawback: Potentially a long critical path.

• Sign-extended readout of guard bits:RF = { {8{ACR[39]}}, ACR[39:32]};

• Allow a few other different kinds of readout options

(28)

MAC unit basics

• We probably also want to be able to set the ACR to arbitrary values:

• OP9:ACR[31:16] = OpA

• OP10:ACR[15:0] = OpA

• OP11:ACR[39:32] = OpA[7:0]

• We may want some other variants as well:

• ACR = { {8{OpA[15]}}, OpA[15:0], 16'b0}; (Load high and sign extend)

• ACR = { {9{OpA[15]}}, OpA[15:0], 15'b0}; (Load fractional value and sign extend)

• ACR = { {8{OpA[15]}}, OpA[15:0], OpB[15:0]};

(Load high and low part simultaneously)

• OpA and OpB may be either a memory or register file operand

(29)

Scaling

• Scaling is typically desired in a MAC unit

• A scaling operation allows us to handle more than just fractional/integer multiplication

• OP12: Scaling by 0.5 (easy)

• OP13: Scaling by 0.25 (easy)

• OP14: Scaling by 2 (easy)

• OP15: Scaling by 4 (easy)

• etc...

• OP16: Scaling by 1.5 (easy or hard?)

• OP17: Scaling by 0.75 (easy or hard?)

(30)

Scaling

• WARNING: Arithmetic right shift is not identical to division by a power of two!

• Not a big concern if you round the result properly

# include <stdio .h>

int main(int argc , char ** argv ){

int a= -1;

printf ("%d, %d\n", a/2, a >> 1);

return 0;

}

// Output from this program on x86_64 ( using GCC ): 0, -1

Nitpick: Right shift on a negative number is actually not specified by the C standard! (It is an implementation defined behavior although it is unlikely to use anything but an arithmetic right shift.)

(31)

Wide mul plica on opera on

• You may sometimes want to run a 32 × 32 bit wide multiplication on a MAC unit with a 16 bit wide multiplier.

• Straightforward method:

$signed(A[31:0]) * $signed(B[31:0]) =

= $unsigned(B[15:0]) * $unsigned(A[15:0]) + + ($unsigned(B[15:0]) * $signed(A[31:16])) << 16 + + ($signed(B[31:16]) * $unsigned(A[15:0])) << 16 + ($signed(B[31:16]) * $signed(A[31:16])) << 32)

• One reason why Senior has separate mulxx/macxx/convxx instructions, wherex can be either s (signed) or u (unsigned).

(32)

Wide mul plica on opera on

• Somewhat trickier in practice (64-bit result)

AH = $signed(A[31:16]); AL = $unsigned(A[15:0]) BH = $signed(B[31:16]); BL = $unsigned(B[15:0]) ACR = AL * BL;

R0 = ACR[15:0];

ACR = ACR >> 16; // Question to audience: Do we // need rounding here?

ACR += BL*AH;

ACR += AL*BH;

R1 = ACR[15:0];

ACR = ACR >> 16;

ACR += HL*AL;

R2 = ACR[15:0];

R3 = ACR[31:16];

(33)

Long Arithme c Opera ons

• There is a wide adder in the MAC unit. This may be used for long addition/subtraction:

• ACRz = ACRx + ACRy; (x, y, and z are numbers from 0 to N − 1, where N is the number of accumulator registers)

• ACRz = ACRx - ACRy;

• ACRz = ABS(ACRx);

• Compare (set flags according toACRx - ACRy)

(34)

Long Arithme c Opera ons

• It is also convenient to be able to add immediates or values from the register file:

• ACRz = {{24{OpA[15]}}, OpA[15:0]};

• ACRz = {{8{OpA[15]}}, OpA[15:0], 16'b0};

• ACRz = {{8{OpA[15]}}, OpA[15:0], OpB[15:0]};

• We may want the following variant to make it easy to work with fractional data:

• ACRz = {{9{OpA[15]}}, OpA[15:0], 15'b0};

(35)

Flags

• We probably want all the normal flags in the MAC

• Zero (Z) unit

• Negative (N)

• Overflow (V)

• Carry (C) (Probably not so important)

• We may also want a sticky version of the overflow flag ((S)) dealing with the full accumulator (not overflow as in saturation)

• Rationale: Do a number of calculations, go to error handling code if an overflow occurred at any point in the calculation.

(36)

Advanced MAC architectures: Dual MAC

* *

+ +

Register file

A B

AAR BAR

MO1 MO2

(37)

Advanced MAC architectures: Complex MAC

NxN MUL

2Nb

2Nb NxN MUL

ADD/SUB

ADD

RMR IMI

NxN MUL

2Nb

2Nb NxN MUL

ADD

RMI IMR

AR AI CR CI

OPA=AR+jAI OPC=CR+jCI

Re output Im output

ACRR ACIR

BR BI

ADD/SUB

2Nb 2Nb

SUB SUB

(38)

Complex mul plica on algorithms

• Normal algorithm:

(a + bi)(c + di) = (ac− bd) + i(ad + bc)

• 4 multiplications, 2 additions

• If one factor (c + di) is constant:

• Usecase: FFT, complex FIR/IIR, etc

• Still 4 multiplications and 2 additions

• Critical path: multiplier → adder

(39)

Complex mul plica on algorithms

• Gauss’ algorithm

k1= c(a + b)

k2= a(d− c)

k3= b(c + d)

(a + bi)(c + di) = (k1− k3) + i(k1+ k2)

• 3 multiplications, 5 additions

• Drawback: Needs slightly wider multipliers/adders

• If one factor (c + di) is constant:

Precompute d− c and c + d

• 3 multiplications and 3 additions

• Critical path: adder → multiplier → adder

(40)

Wide mul plica on: Karatsuba’s variant

• A similar trick can be used for performing real-valued multiplications

• Could be useful for handling multiplications wider than the native datawidth

• If you are interested, search for Karatsuba’s algorithm

• Wikipedia has a good introduction

• A very similar trick can also reduce the

computational complexity of FIR filters (Search for Fast Fir algorithm, FFA)

• We might look at this in a later lecture

• Talk to Oscar if you think this is really interesting!

(from earlier year’s slides...)

(41)

Floa ng-point MAC

• Potential problem:

• Floating-point addition is cumbersome

• The accumulation part is hence cumbersome

• Easiest solution (from HW point of view): Use several accumulators and use loop

unrolling/software pipelining

• Floating point MAC is sometimes called FMA (Fused Multiply and Add)

• In this case rounding is only done once instead of twice

• Alignment can be performed in parallel with the multiplication

(42)

Cri cal Path issues in MAC unit

D-mem 1 D-mem 2 D-mem 3 D-mem 4 RF OPA Constant 32 to1

RF OPB 32 to1 Long wires Long wires

As MAC input Very heavy fan out here!

[Liu2008]

(43)

Cri cal Path issues in MAC unit

ACR1 ACR2 ACRm ACRn Heavy fan out for MAC internal logic

Long wire Register

select logic

Data select logic From RF

From a port

… …

Data memory

[Liu2008]

(44)

Cri cal Path issues in MAC unit

(a) MAC in one clock cycle Accumulator ACR

Flag circuit

*

(b) MAC using two clocks Accumulator ACR

Flag circuit

*

(a) MAC using three clocks Accumulator ACR

Flag circuit

*

[Liu2008]

(45)

www.liu.se

References

Related documents

While much has been written on the subject of female political participation in the Middle East, especially by prominent scholars such as Beth Baron 5 and Margot Badran, 6 not

Mistra Center for Sustainable

Looking at the different transport strategies used when getting rid of bulky waste, voluntary carlessness could also be divided in to two types. A type of voluntary carlessness

Considering the purpose of the study, quantitative content analysis is a relevant method to use for finding of updating frequency and most used picture categories, while

As to say that the change is due to social media or social networking site is harder; people do use the social platforms to their advantage and they enable networked power, so

A discussion about cultural transmission of a minority culture invariably leads to the question of ethnic identity, because much of what is seen as valuable to teach the children

Having studied and understood the functioning of our memory and after trying some methods and strategies that are likely to help musicians to learn music by heart, the

Torbjörn Becker, Director at SITE, and followed by a short discussion by Giancarlo Spagnolo, SITE Research Fellow and Professor, University of Rome II. Time will be available