Low Power Digital Design
• Low-power Finite-State Machine (FSM) Design
– Design model for partitioned FSMs based on mixed synchronous/asynchronous state memory
– Comparative study of low-voltage performance of standard-cell
flip-flops
Low Power Design
Design Model for partitioned Finite-State Machines based on mixed synchronous/asynchronous state
memory
Bengt Oelmann
Design Model for Partitioned FSMs
• Objective of this work
– Provide a design model that enables low-power FSM design with low area overhead
• Outline
– Background to low-power FSM design
– Basic principles of the proposed design model – Outline the design procedure
– Indicate improvements compared to other work
– Future work
Power consumption in static digital CMOS
Power consumption in static CMOS
Leakage Short circuit dynamic
i
i i
DD
dynamic
V f C
P
2
Dynamic power management
• To have in mind for further discussions
– In static CMOS circuits quiescent portions of the circuit dissipate a minimal amount of power
– Average power can be reduced even if more hardware is added when we can guarantee that the active hardware, in average, is less.
i
i i
DD
dynamic
V f C
P
2
reduce effective capacitance
Functional overhead in power management
• Systems are in most cases designed for worst- case conditions, here meaning full utilization
– only full utilization during a small fraction of their operational time
– a shutdown mechanism shuts down parts of the design not needed when full utilization is not required
• Functional overhead
– shutdown circuits are not needed for functional correctness
• Power management system should never
deteriorate up to violate performance constraints:
area, timing, power(!)
Where is power management used ?
Algorithm level
Architectural level
Register-Transfer level Logic level
Dynamic power
management techniques
Implementation of Power Management
• The design must be partitioned
– idle states if the different units must be detected so that these units can be shut down
• Partitioning the design according to the functional units
– can be made manually and intuitively -- the functional units are well separated and easy to identify
– small number of places where clock-gating are introduced
• Partitioning of one single functional unit
– it is less obvious how to partition the unit
– automated procedures are needed and supported by CAD tools
Low Power FSM design
Approaches to low-power FSM design
State-encoding Shut-down techniques
Gated-clock Input-disabling
Gated-clock approaches result in FSMs with 35% lower power consumption
than state-encoding techniques [Chow96]
Partitioned FSMs
S0
S1
S2 S3
S4
0.6
0.2
0.005 0.005
0.01 0.04
0.04
0.1
S
X Y
Organization of the states
• Local states
– States from the original FSM stored in state memory based on flip-flops triggered by active edge of the clock signal
• Global states
– The global state is pointing out which one of the sub-FSM is active
– It is updated independent of the clock signal (asynchronously) – Entering a g-state will change the global state
• Restrictions on state assignment
– Coupled states must have identical codes (g
i,s
i)
– Other states may share the same codes if they reside in different
sub-FSMs
Basic principles of the proposed design model
Partitioning the original FSM Separate sub-FSMs with coupled states indicated
Crossing transitions
• Original transition from s
5to s
1in one clock cycle
• In the transformed graph the s
5to s
1“transition”
requires two transitions
– In a fully synchronous
operation this takes 2 cycles – or simultaneous clocking of F
1and F
2in the transition cycle – or the crossing transition is
handled asynchronously g
1s
5s
1crossing transition
F
1F
2Example, coupled states
• Let there be a partition = {S
1,S
2,S
3,S
4} resulting in the following sets of local states:
U
1={s
1, s
2, s
3, g
4} U
2={s
4, s
5, s
6, g
7} U
3={s
7, g
1}
U
4={s
8, s
9, s
10, s
11, s
12, s
13, g
1, g
5}
• Clustering of the coupled states
Example, clustering the remaining states
• States not coupled may be freely placed in any cluster
U
1={s
1, s
2, s
3, g
4} U
2={s
4, s
5, s
6, g
7} U
3={s
7, g
1}
U
4={s , s , s , s , s , s , g , g }
Example, state encoding
• Procedure ensures that minimum number of bits in the state memory is needed for each sub-FSM
• Binary state encoding in re-ordered state table
Saving flip-flops
• States in different sub-FSMs share the same code
• Total number of bits in local state memory
– Sharing: 3
– Not sharing: 1 + 2 + 2 + 3 = 8
Improvements
compared to others ...
• Performance of previously presented mixed synch./asynch. approach
• Estimated performance of this approach
– Significant reduction in both area and power for the output logic
Work Average power reduction Max. power reduction
Benini et al. ISCAS’98 -31% -43%
Chow et al. Trans. ACM DA of Electronic Systems ‘96
-29% -59%
Benini et al. Trans. On CAD ‘96 -19% -49%
Previous mixed synch/asynch method by us
-45% -68%
Future work
• Develop suitable implementation architecture
• Performance evaluation using standard benchmarks
• Design automation
– Develop CAD-tool for automatic synthesis
• Low-level optimization
– e.g. optimal state encoding
• Apply this approach to “real-world” problems
– e.g. high-speed -- low-power protocol processors
• Partitioning of FSM along with data path
• ... this work is well suited for a PhD-thesis project
Comparative study of flip-flops
Comparative study of low-voltage performance of standard-cell flip-flops
Xue Shang and Bengt Oelmann
Comparative study of flip-flops
• Objective
– Characterize and compare the power consumption and speed performance of flip-flops designed as standard cells
– Propose suitable combination of flip-flop types to be included in a cell library used in power-driven synthesis
• Outline
– Motivation
– Background to standard cell design – Different types of flip-flops
– Characterization of the flip-flops – Simulation results
– Conclusions
Motivation
• Pipelining together with voltage scaling is an efficient way to get high-speed and low-power
• Increased pipelining flip-flops will occupy larger part of the design
case 1: Delay in critical path is T at Vdd = V1
case 2: Pipeline registers are introduced to shorten the delay when Vdd = V1
Reduce Vdd to V2 so that critical path becomes T.
Standard-cell based design flow
HDL code (RTL)
Automatic Synthesis
Gate netlist
constraints standard
cell library
to physical design
power (max power)
timing (max cycle time)
area (max circuit area)
Flip-flops for timing- driven synthesis
• In timing-driven synthesis only one type of flip-flop is needed in the library
• The synthesis tool picks to flip-flop with best
driving capability for the actual loading condition
FFx3
FFx1
FFx1
Combinational Logic
Flip-flops for power- driven synthesis
• The synthesis tool must compute the switching probability () and signal probability (p) at every node in the network
• The synthesis tool will pick the flip-flop with lowest power consumption for the actual and p.
(0,p0)
(1,p1)
(2,p2)
(3,p3)
(3,p3)
(3,p3)
(5,p5)
(7,p7) (6,p6)
(8,p8)
FF type 1
FF type 2
FF type 1
Design criteria on standard cell flip-flops
• Static CMOS
– Robustness
– no restrictions on the lowest allowable clock frequency
• Single clock phase
– facilitates the automated design process supported by CAD-tools
• Single-ended data inputs
• primary inputs of the flip-flop cell must only be connected to gate-terminals of transistors
– Source- or drain connections are not well suited for simple timing
calculations
Types of flip-flops
• Two different types of master-slave flip-flops
D D
Q
C Q
D D
Q
C
D Q
C
D Q
C
D Q
C D
C
Q n-latch p-latch
latch latch
differential data input two-phase clock
D Q
C
Characterization of flip- flops
• Parameters to control for comparison
– common technology (0.6 m) – transistor sizing
– input signal transition time – loading conditions
– data input sequence
• Simulation testbench
D Q
D Q
C
CL CL
Timing
• Cycle time calculation in synchronous designs
• Performance measure of interest
setup comb
Q C
cycle
t t t
T
setup Q
C Q
D
t t
t
D Q
t
D-QHow to determine t D-Q ?
• t
C-Q= f(t
setup)
– Stable region (tC-Q = constant) – Meta-stable region (tC-Q = f(tC-C)) – Failure region
Power
• Power dissipation is separated into
– Data power (power to switch the data input) – Clock power (power to switch the clock input)
– Internal power (power dissipation within the flip-flop)
• Data input pattern
– Worst power (=1): ... 010101 ...
– Average power (=0.5): pseudo-random pattern
– Minimum power (=0): ... 000000 ... or ... 111111 ...
Flip-flops
• C2MOS
– based on clocked CMOS stages (Traditional)
• MUX
– Multiplexer based, static combinational gates (Traditional)
• SRIS
– Static Ratio-Insensitive Latch (Yuan and Svensson 1997)
• SSTC
– Static Single Transistor Clocked (Yuan and Svensson 1997)
• strongARM
– Used in ARM RISC processors (Advanced RISC Machines)
• TG
Delays
• Minimum D-Q delay
Power-Delay-Product
Power consumption
Where is the power
dissipated ?
Conclusions
• Voltage scaling
– is efficient down to1.8V
• Speed
– strongARM is the fastest flip-flop with low power consumption
• Power
– for low switching activity SRIS has half the power consumption compared to strongARM
– for high switching activity SRIS has up to twice the power consumption compared to strongARM
• For a standard cell library
– use both strongARM and SRIS -- let the synthesis tool pick the
Concluding remarks
• Large number of simulations must be carried out
– automated characterization procedure a must
• Simulation environment built on
– PowerMill
• A fast analog simulator
– Matlab
• Test pattern generation
• Automatic analysis of simulation results
• Graphical User interface
• Graphical presentation of simulation results
– Perl
• Netlist conversions