4. Design of
Program Flow Control Units
Olle Seger Andreas Ehliar
Jian Wang
Andréas Karlsson
Oscar Gustafsson
Program Flow Control Unit
• What instruction to execute next?
• Easy! If it weren’t for...
– Hardware repeat (simple to handle) – Unconditional jumps
– Conditional jumps – Call/return
– Program counter as destination operand
– ...
• Can be handled in many ways:
– Flush pipeline – Use delay slots – ...
IR3
Speculative techniques
• Branch prediction
– ”Guess” the outcome of a branch
– Static techniques (based on heuristics) – Dynamic techniques (based on history) – Specialized predictors, e.g. Loop predictor
• Branch target prediction
– ”Guess” the branch target before it is known/calculated
• Return address buffering
– Keep a small stack of return addresses in faster registers (in addition to the stack in memory)
• (Not part of this course)
Finite State Machines
• Moore • Mealy
State
F G
Input Output Input F State G Output
Which should we use for our program flow control unit?
Finite State Machines
• Let’s us immediately react to incoming flow control instruction!
• Beware of combinatorial loops (feedback paths from output to input)...
• Mealy!!!
State
F G
Input Output
A pipelined CPU
• similar (but not identical) to Senior
• PC FSM controls the jump instr.
Inputs:
jmpdec = decoded instruction (what kind, nr of DS) JT = jump taken?
(deduced from IR2,flags) Outputs:
nextPC = targetaddress, ++, stall forcenop = insert HW nop
Implement:
jmp ds0/ds1 addr jmp.eq ds0/ds1 addr Let’s try this program
0: add (sets flags) 1: jmp.eq ds0 5
2: xxx
IR3
JT
add 1
++
no nop
-
IR3
JT
add
jmp.eq ds0 5 2
stall
nop
-
IR3
JT
2
jmp.eq ds0 5
add nop
stall
nop
- -
jmp!
IR3
JT
add
jmp.eq 5 2
nop nop
TA
nop
1
IR3
JT
add
jmp.eq 5 2
nop nop
++
no
nop 0
- -
Jump taken! Jump NOT taken!
0: add
1: jmp.eq ds0 5
2: xxx
3: yyy
4: zzz
5: and
6: www
PC IR IR1 IR2 IR3 ...
0
1 add
2 jmp add
2 nop jmp add
2 nop nop jmp add
5 nop nop nop jmp
6 and nop nop nop
PC IR IR1 IR2 IR3 ...
0
1 add
2 jmp add
2 nop jmp add
2 nop nop jmp add
3 xxx nop nop jmp
4 yyy xxx nop nop
Pipeline diagram
ds0
State graph for ds0
0
1 2
ds0(nop,stall PC)
-(nop,stall PC)
JT(nop,PC=Target) JT(PC++)
3
-(PC++)
jmp (PC++ )
jmp in
IR1 jmp in
IR2
jmp in
IR3
0: add
1: jmp.eq ds1 5
2: xxx
3: yyy
4: zzz
5: and
6: www
PC IR IR1 IR2 IR3 ...
0
1 add
2 jmp add
3 xxx jmp add
3 nop xxx jmp add
5 nop nop xxx jmp
6 and nop nop xxx
PC IR IR1 IR2 IR3 ...
0
1 add
2 jmp add
3 xxx jmp add
3 nop xxx jmp add
4 yyy nop xxx jmp
5 zzz yyy nop xxx
Pipeline diagram
ds1
0
1 2
ds0(nop,stall PC)
-(nop,stall PC)
JT(nop,PC=Target) JT(PC++)
3
-(PC++) 4
jmp (PC++ ) ds1 (PC++ )
-(nop,stall PC)
jmp in
IR1 jmp in
IR2
jmp in IR3
State graph for ds0 and ds1
State graph (simplified)
0
1 2
ds0(nop,stall PC) -(nop,stall PC)
JT(nop,PC=Target) JT(PC++)
jmp(PC++)
ds1(PC++)
Exercises!
4.1
4.1
0: jmp 5 1: xxx 2: yyy 3:
4:
5: zzz
jmp 5 1
5
4.1a
xxx will be executed
1 delay slot
add r3,r0,r1 nop
add r3,r2,r3 add r3,r0,r1
nop
add r3,r2,r3
4.1b
”writeback to RF” is inside Register file
=> 1 nop
xxx
jmp.lt yyy
5
4.1c
0:set r2,10 1:nop
2:jump.lt r0,r2,20 3:xxx ; Delay slot 1 4:yyy ; Delay slot 2 5:zzz ; Delay slot 3
… 20:
20
4.1c
set r2, 10
nop ; Wait for write-back jump.lt r0, r2, skip
nop ; Delay slot 1 nop ; Delay slot 2 nop ; Delay slot 3 set r3, 55
nop ; Wait for write-back add r0, r0, r3
jump.eq r14, r14, endprog ; No need for unconditional jump nop ; Delay slot 1
nop ; Delay slot 2 nop ; Delay slot 3 skip:
set r3, 48
nop ; Wait for write-back add r0, r0, r3
set r3, 1
nop ; Wait for write-back jump.neq r1, r3, endprog nop ; Delay slot 1
nop ; Delay slot 2 nop ; Delay slot 3 set r3, 32
nop ; Wait for write-back add r0, r0, r3
endprog:
25 instructions
4.1c
set r2, 10
set r3, 55 ; May not be used, but in case jump.lt r0, r2, skip
nop ; Delay slot 1 nop ; Delay slot 2 nop ; Delay slot 3
jump.eq r14, r14, endprog ; No need for unconditional jump add r0, r0, r3
nop ; Delay slot 2 nop ; Delay slot 3 skip:
set r3, 48
nop ; Wait for write-back add r0, r0, r3
set r3, 1
nop ; Wait for write-back jump.neq r1, r3, endprog set r3, 32
nop ; Delay slot 2 nop ; Delay slot 3 add r0, r0, r3 endprog:
20 instructions
4.1c
set r2, 10
set r3, 55 ; May not be used, but in case jump.lt r0, r2, skip
nop ; Delay slot 1 nop ; Delay slot 2 nop ; Delay slot 3
jump.eq r14, r14, endprog ; No need for unconditional jump add r0, r0, r3
nop ; Delay slot 2 nop ; Delay slot 3 skip:
set r4, 1 set r3, 48
jump.neq r1, r4, endprog add r0, r0, r3
set r3, 32
nop ; Delay slot 3 add r0, r0, r3
endprog:
17 instructions
4.1c
set r2, 10 set r3, 55
jump.lt r0, r2, skip set r4, 48
set r5, 1
nop ; Delay slot 3
jump.eq r14, r14, endprog ; No need for unconditional jump add r0, r0, r3
nop ; Delay slot 2 nop ; Delay slot 3 skip:
jump.neq r1, r5, endprog add r0, r0, r4
set r3, 32
nop ; Delay slot 3 add r0, r0, r3 endprog:
15 instructions
4.1c
set r13, 0 ; Constant 0 set r14, 1 ; Constant 1 set r15, 36 ; Loop counter loop:
ld r2, [r0]
add r0, r0, r14 st [r1], r2
add r1, r1, r14 sub r15, r15, r14
nop ; Wait for write-back jump.neq r15, r13, loop nop ; Delay slot 1
nop ; Delay slot 2 nop ; Delay slot 3
3 + 36 × 10 = 363 cycles
4.1c
set r13, 0 ; Constant 0 set r14, 1 ; Constant 1 set r15, 36 ; Loop counter loop:
sub r15, r15, r14 ld r2, [r0]
jump.neq r15, r13, loop add r0, r0, r14
st [r1], r2
add r1, r1, r14
3 + 36 × 6 = 201 cycles
4.1c unfold loop
set r13, 0 ; Constant 0 set r14, 1 ; Constant 1 set r15, 18 ; Loop counter loop:
ld r2, [r0]
add r0, r0, r14 st [r1], r2
add r1, r1, r14 sub r15, r15, r14 ld r2, [r0]
jump.neq r15, r13, loop add r0, r0, r14
st [r1], r2
add r1, r1, r14
3 + 18 × 10 = 183 cycles
0
1 2
jmp(PC++)
3 jmp = PFC_OP[4]
JT = PFC_OP2[4]*(PC_FSM_EQUAL*(PFC_OP2[1:0]==0) + …)
jmp(PC++)
-(PC++)
JT(PC=PFC_DATA_2) JT(PC++)
-(PC++)
0 JT(PC++)
JT(PC=PFC_DATA_2)
4.1d
PC = JT ? PFC_DATA2 : PC + 1
+1
1 0
PC
2 1 0
&
PFC_OP[4]
PFC_OP[1:0]
jump decision