Low Power Digital Design

(1)

Low Power Digital Design

• Low-power Finite-State Machine (FSM) Design

– Design model for partitioned FSMs based on mixed synchronous/asynchronous state memory

– Comparative study of low-voltage performance of standard-cell

flip-flops

(2)

Low Power Design

Design Model for partitioned Finite-State Machines based on mixed synchronous/asynchronous state

memory

Bengt Oelmann

(3)

Design Model for Partitioned FSMs

• Objective of this work

– Provide a design model that enables low-power FSM design with low area overhead

• Outline

– Background to low-power FSM design

– Basic principles of the proposed design model – Outline the design procedure

– Indicate improvements compared to other work

– Future work

(4)

Power consumption in static digital CMOS

Power consumption in static CMOS

Leakage Short circuit dynamic

 ^





i

i i

DD

dynamic

V f C

P

²



(5)

Dynamic power management

• To have in mind for further discussions

– In static CMOS circuits quiescent portions of the circuit dissipate a minimal amount of power

– Average power can be reduced even if more hardware is added when we can guarantee that the active hardware, in average, is less.

 ^





i

i i

DD

dynamic

V f C

P

²



reduce effective capacitance

(6)

Functional overhead in power management

• Systems are in most cases designed for worst- case conditions, here meaning full utilization

– only full utilization during a small fraction of their operational time

– a shutdown mechanism shuts down parts of the design not needed when full utilization is not required

• Functional overhead

– shutdown circuits are not needed for functional correctness

• Power management system should never

deteriorate up to violate performance constraints:

area, timing, power(!)

(7)

Where is power management used ?

Algorithm level

Architectural level

Register-Transfer level Logic level

Dynamic power

management techniques

(8)

Implementation of Power Management

• The design must be partitioned

– idle states if the different units must be detected so that these units can be shut down

• Partitioning the design according to the functional units

– can be made manually and intuitively -- the functional units are well separated and easy to identify

– small number of places where clock-gating are introduced

• Partitioning of one single functional unit

– it is less obvious how to partition the unit

– automated procedures are needed and supported by CAD tools

(9)

Low Power FSM design

Approaches to low-power FSM design

State-encoding Shut-down techniques

Gated-clock Input-disabling

Gated-clock approaches result in FSMs with 35% lower power consumption

than state-encoding techniques [Chow96]

(10)

Partitioned FSMs

S0

S1

S2 S3

S4

0.6

0.2

0.005 0.005

0.01 0.04

0.04

0.1

 S



X Y

(11)

Organization of the states

• Local states

– States from the original FSM stored in state memory based on flip-flops triggered by active edge of the clock signal

• Global states

– The global state is pointing out which one of the sub-FSM is active

– It is updated independent of the clock signal (asynchronously) – Entering a g-state will change the global state

• Restrictions on state assignment

– Coupled states must have identical codes (g

i

,s

i

)

– Other states may share the same codes if they reside in different

sub-FSMs

(12)

Basic principles of the proposed design model

Partitioning the original FSM Separate sub-FSMs with coupled states indicated

(13)

Crossing transitions

• Original transition from s

₅

to s

₁

in one clock cycle

• In the transformed graph the s

₅

to s

₁

“transition”

requires two transitions

– In a fully synchronous

operation this takes 2 cycles – or simultaneous clocking of F

¹

and F

²

in the transition cycle – or the crossing transition is

handled asynchronously g

₁

s

₅

s

₁

crossing transition

F

¹

F

²

(14)

Example, coupled states

• Let there be a partition  = {S

¹

,S

²

,S

³

,S

⁴

} resulting in the following sets of local states:

U

¹

={s

1

, s

2

, s

3

, g

4

} U

²

={s

₄

, s

₅

, s

₆

, g

₇

} U

³

={s

7

, g

1

}

U

⁴

={s

8

, s

9

, s

10

, s

11

, s

12

, s

13

, g

1

, g

5

}

• Clustering of the coupled states

(15)

Example, clustering the remaining states

• States not coupled may be freely placed in any cluster

U

¹

={s

₁

, s

₂

, s

₃

, g

₄

} U

²

={s

₄

, s

₅

, s

₆

, g

₇

} U

³

={s

₇

, g

₁

}

U

⁴

={s , s , s , s , s , s , g , g }

(16)

Example, state encoding

• Procedure ensures that minimum number of bits in the state memory is needed for each sub-FSM

• Binary state encoding in re-ordered state table

(17)

Saving flip-flops

• States in different sub-FSMs share the same code

• Total number of bits in local state memory

– Sharing: 3

– Not sharing: 1 + 2 + 2 + 3 = 8

(18)

Improvements

compared to others ...

• Performance of previously presented mixed synch./asynch. approach

• Estimated performance of this approach

– Significant reduction in both area and power for the output logic

Work Average power reduction Max. power reduction

Benini et al. ISCAS’98 -31% -43%

Chow et al. Trans. ACM DA of Electronic Systems ‘96

-29% -59%

Benini et al. Trans. On CAD ‘96 -19% -49%

Previous mixed synch/asynch method by us

-45% -68%

(19)

Future work

• Develop suitable implementation architecture

• Performance evaluation using standard benchmarks

• Design automation

– Develop CAD-tool for automatic synthesis

• Low-level optimization

– e.g. optimal state encoding

• Apply this approach to “real-world” problems

– e.g. high-speed -- low-power protocol processors

• Partitioning of FSM along with data path

• ... this work is well suited for a PhD-thesis project

(20)

Comparative study of flip-flops

Comparative study of low-voltage performance of standard-cell flip-flops

Xue Shang and Bengt Oelmann

(21)

Comparative study of flip-flops

• Objective

– Characterize and compare the power consumption and speed performance of flip-flops designed as standard cells

– Propose suitable combination of flip-flop types to be included in a cell library used in power-driven synthesis

• Outline

– Motivation

– Background to standard cell design – Different types of flip-flops

– Characterization of the flip-flops – Simulation results

– Conclusions

(22)

Motivation

• Pipelining together with voltage scaling is an efficient way to get high-speed and low-power

• Increased pipelining  flip-flops will occupy larger part of the design

case 1: Delay in critical path is T at V_dd = V₁

case 2: Pipeline registers are introduced to shorten the delay when V_dd = V₁

Reduce V_dd to V₂ so that critical path becomes T.

(23)

Standard-cell based design flow

HDL code (RTL)

Automatic Synthesis

Gate netlist

constraints standard

cell library

to physical design

power (max power)

timing (max cycle time)

area (max circuit area)

(24)

Flip-flops for timing- driven synthesis

• In timing-driven synthesis only one type of flip-flop is needed in the library

• The synthesis tool picks to flip-flop with best

driving capability for the actual loading condition

FFx3

FFx1

Combinational Logic

(25)

Flip-flops for power- driven synthesis

• The synthesis tool must compute the switching probability () and signal probability (p) at every node in the network

• The synthesis tool will pick the flip-flop with lowest power consumption for the actual  and p.

(₀,p₀)

(₁,p₁)

(₂,p₂)

(₃,p₃)

(₅,p₅)

(₇,p₇) (₆,p₆)

(₈,p₈)

FF type 1

FF type 2

FF type 1

(26)

Design criteria on standard cell flip-flops

• Static CMOS

– Robustness

– no restrictions on the lowest allowable clock frequency

• Single clock phase

– facilitates the automated design process supported by CAD-tools

• Single-ended data inputs

• primary inputs of the flip-flop cell must only be connected to gate-terminals of transistors

– Source- or drain connections are not well suited for simple timing

calculations

(27)

Types of flip-flops

• Two different types of master-slave flip-flops

D D

Q

C Q

D D

Q

C

D Q

C

D Q

C

D Q

C D

C

Q n-latch p-latch

latch latch

differential data input two-phase clock

D Q

C

(28)

Characterization of flip- flops

• Parameters to control for comparison

– common technology (0.6 m) – transistor sizing

– input signal transition time – loading conditions

– data input sequence

• Simulation testbench

D Q

C

C_L C_L

(29)

Timing

• Cycle time calculation in synchronous designs

• Performance measure of interest

setup comb

Q C

cycle

t t t

T 

_

 

setup Q

C Q

D

t t

t

_



_



D Q

t

_D-Q

(30)

How to determine t _D-Q ?

• t

_C-Q

= f(t

_setup

)

– Stable region (t_C-Q = constant) – Meta-stable region (t_C-Q = f(t_C-C)) – Failure region

(31)

Power

• Power dissipation is separated into

– Data power (power to switch the data input) – Clock power (power to switch the clock input)

– Internal power (power dissipation within the flip-flop)

• Data input pattern

– Worst power (=1): ... 010101 ...

– Average power (=0.5): pseudo-random pattern

– Minimum power (=0): ... 000000 ... or ... 111111 ...

(32)

Flip-flops

• C2MOS

– based on clocked CMOS stages (Traditional)

• MUX

– Multiplexer based, static combinational gates (Traditional)

• SRIS

– Static Ratio-Insensitive Latch (Yuan and Svensson 1997)

• SSTC

– Static Single Transistor Clocked (Yuan and Svensson 1997)

• strongARM

– Used in ARM RISC processors (Advanced RISC Machines)

• TG

(33)

Delays

• Minimum D-Q delay

(34)

Power-Delay-Product

(35)

Power consumption

(36)

Where is the power

dissipated ?

(37)

Conclusions

• Voltage scaling

– is efficient down to1.8V

• Speed

– strongARM is the fastest flip-flop with low power consumption

• Power

– for low switching activity SRIS has half the power consumption compared to strongARM

– for high switching activity SRIS has up to twice the power consumption compared to strongARM

• For a standard cell library

– use both strongARM and SRIS -- let the synthesis tool pick the

(38)

Concluding remarks

• Large number of simulations must be carried out

– automated characterization procedure a must

• Simulation environment built on

– PowerMill

• A fast analog simulator

– Matlab

• Test pattern generation

• Automatic analysis of simulation results

• Graphical User interface

• Graphical presentation of simulation results

– Perl

• Netlist conversions

Low Power Digital Design