2.ALU Design
Olle Seger (olles@isy.liu.se) Dake Liu (dake@isy.liu.se)
Oscar Gustafsson (oscar.gustafsson@liu.se)
1
• ALU, an overview
• AU, a case study
• Exercises
• About Lab-2
ALU
Key component in datapath of a DSP Processor
Usually all operands from RF, except imm
Execution Cost : 1 Clock Cycle
Use one guard bit
Key Components of ALU
Arithmetic Unit
Logic Unit (AND, OR, XOR etc)
Shifter (LRS, LLS, ASR, ASL)
Special Functions (e.g. bit manipulation)
Multiplexers
ALU Overview
Logic Shift Special
Flags
AU
Pre-Processing
Post- Processing
Result Saturation
3
Let’s design a small AU
Functional Specification
0. A + B with saturation OP=0000
1. A + B without saturation OP=0001
2. A + B + Cin with saturation OP=0010
3. A + B + Cin without saturation OP=0011
4. A - B with saturation OP=0100
5. A - B without saturation OP=0101
6. A compare to B with saturation OP=0110
7. ABS(A) Absolute operation on A OP=0111
8. NEG(A) Negate operation on A OP=1000
9. (A+B)/2 Average operation OP=1001
10. NOP OP=1010
The C, Z, V, and N flag should be updated for OP0-9
AU functions
A B A B
Saturation
+
A B
+ + +
A B
Cin Cin
SAT(A + B) A + B SAT(A + B + C) A + B +C Saturation
Average (A+B)
+
A B
‘1’
+
A B
‘1’
Flag-only
+
A B
‘1’
+
A
B=0 MSB of A
0 1
+
A B=0
‘1’
ASR
+
A B
SAT(A -B) A - B compare ABS(A) NEG(A) Saturation
5
HW with multiplexing
C1
=1
A[15] A[15:0] B[15:0]
1 0
C A[15]
ASR SAT
C4
C3
DEC
C1 C2 C3 C4 OP
00 01 10
00 01 10
11 10 01 00
Flags
17-bit adder
C5
C5
0 1
Cin Cout = S[16]
S
C2
0
00 01 10
trunc
7
HW with multiplexing
always @(posedge clk) if (c5) begin
C <= Cout;
Z <= !|R;
N <= R[15];
V <= (S[16] != S[15]);
end
Flags
ASR ½
assign R = S[16:1];
always @(*)
if (S[16]==S[15]) R <= S[15:0];
else if (S[16]==0) R <= 16’h7fff;
else
R <= 16’h8000;
Sat
DEC
OP C1 C2 C3 C4 C5
0 Sat(A+B) 00 00 00 00 1
1 A+B 00 00 00 01 1
2 Sat(A+B+C) 00 00 10 00 1
3 A+B+C 00 00 10 01 1
4 A-B 00 01 01 01 1
5 Sat(A-B) 00 01 01 00 1 6 Cmp(A,B) 00 01 01 - 1
7 Abs(A) 10 10 11 01 1
8 Neg(A) 01 10 01 01 1
9 (A+B)/2 00 00 00 10 1
10 NOP - - - 0
Trunc
assign R = S[15:0];
Exercise 2.1
Exercise 2.2
10
We have a processor with a pipeline where we can:
* Read out two operands from the register file and write one operand to the register file, all at the same time
* Instead of reading out one of the operands you can
choose to take a 16-bit immediate from the instruction word
* We have 32 16-bit registers
* A conditional branch takes 3 clock cycles
* We have a repeat instruction
* We have only one load instruction of interest:
load Rd, DM0[AR0++], AR0 is set with the instruction set AR0, Rs
* The store instruction works the same way store DM0[AR0++],Rs
* After a load instruction we must wait a clock cycle before
Exercise 2.3
12
Function 1
(execution time max 105 clock cycles, exclusive the RET instruction)
int16_t dct_indata[32];
// Return value in r0
uint16_t find_maxabsval(void) {
uint16_t biggest = 0, b;
int16_t a;
for(int i=0; i < 32; i++){
a = dct_indata[i];
b = abs(a);
if(b > biggest) biggest = b;
} }
Exercise 2.3
int64_t packet_ctr;
int update_statistics(int16_t length) /* Length is in register r0 when this function is called */
{
packet_ctr += length;
}
max 25 clockcycles (exclusive the RET instruction)
Exercise 2.3
14
SET ar0,dct_indata
SET r0,0 ; max value REPEAT loop,32
LD r1,(ar0++) NOP
ABS r2,r1
MAX r0,r2,r0 loop
RET
SET ar0,dct_indata
SET r0,0 ; max value REPEAT loop,16
LD r1,(ar0++) LD r3,(ar0++) ABS r2,r1
MAX r0,r2,r0 ABS r4,r3
MAX r0,r4,r0 loop
RET
4*32 + 3 = 131 6*16 + 3 = 99
A goldstar if you can do it faster!
Exercise 2.3
SET ar0,dct_indata LD r1,(ar0++)
SET r0,0 ; max value prolog ABS r2,r1
REPEAT loop,31 LD r1,(ar0++)
MAX r0,r2,r0 loop ABS r2,r1
loop:
MAX r0,r2,r0 epilog RET
3*31 + 6 = 99
Exercise 2.3
16
set ar0,packet_ctr set r4,0
add r1,r0,0x8000 ; carry = (length<0) addc r4,r4,r4 ; r4 = (length<0)
ld r1,(ar0)
sub r4,0,r4 ; r4 = (length<0)?-1:0 add r1,r0
st (ar0++),r1 repeat endloop,3 ld r1,(ar0)
nop ; Silverstar if you remove this
; without unrolling loop completely!
addc r1,r4
st (ar0++),r1 endloop
ret
P_c[0]
ext length
P_c[1]
P_c[2]
P_c[3]
ar0
ext ext
r0
Exercise 2.3
3*4 + 9 = 21
set ar0,packet_ctr set r4,0
add r1,r0,0x8000 ; carry = (length<0) addc r4,r4,r4 ; 1 in r4 if length<0
ld r1,(ar0)
sub r4,0,r4 ; -1 in r4 if neg add r2,r1,r0
repeat endloop,3 ld r1,(ar0+1)
st (ar0++),r2 ; loop addc r2,r1,r4
endloop
st (ar0++),r2 ret
Exercise 2.3
software pipelining
3*3 + 9 = 18
ALU
18
C1 C2 C3 C4 C5 ABS(A) 1 10 11 0 0 MAX(A,B) 0 01 00 1 0 A+B 0 00 01 0 1 A-B 0 01 00 0 1 A+B+C 0 00 10 0 1
17-bit adder
{B[15],B[15:0]}
00 01 10
{A[15],A[15:0]}
0 1
Cout
17
C1 C2
C4
=1
A[15]
0
0 1 A[15]
C3
11 10 01 00
C
10 00,01 11
always @(posedge clk) if (C5) begin
C <= Cout;
end
S
[15:0]
S[16]
1 2
Exercise 2.3
Exercise 2.4
20
Exercise 2.4
Software pipelining
SET ar0,dct_indata
SET r0,0 ; max value LD r1,(ar0++) ; prolog
REPEAT loop,31 LD r1,(ar0++)
MAXABS r0,r1,r0 ; loop loop:
MAXABS r0,r1,r0 ; epilog RET
2*31+5=67
This code utilizes pipeline delay!
Exercise 2.4
Loop unrolling
SET ar0,dct_indata
SET r0,0 ; max value REPEAT loop,16
LD r1,(ar0++) LD r2,(ar0++)
MAXABS r0,r1,r0 MAXABS r0,r2,r0
loop RET
4*16+3=67
About Lab 2 (Datapath)
• Manual for Lab 2 (Ch-2)
• Source code for LAB-2
• You can use Verilog or VHDL.
• Go through Ch-0 and Ch-2 for all details
Read the manuals carefully before starting the labs!
22
About Lab 2
saturation.vhd
mac_dp.vhd
adder_ctrl.vhd
min_max_ctrl.vhd
saturation.asm
rounding_vector.asm
alu_test.asm
Write this HW Write this SW
1) Run SW on srsim for reference 2) Run SW and HW using vsim 3) Compare output
4) Check coverage. Was all your HW tested?
SW should test all corner cases
About Lab 2
Verification
– Write Assembly Program to test your modules – Some Templates are provided
– Fill with your choice of registers, and operands – Perform the operation
– Write the results to a file using “out 0x11, r?”
– Use coverage metrics to find obvious missing corner cases – Run Modelsim Simulator using commands mentioned in
Section 0.5
– Simulate and Debug
24