Digital Design for Low Power Systems

Transcription

1 Digital Design for Low Power Systems Shekhar Borkar Intel Corp.

2 Outline Low Power Outlook & Challenges Circuit solutions for leakage avoidance, control, & tolerance Microarchitecture for Low Power System considerations Technology, Circuits, Architecture, and System co-optimization 2

3 Unconstrained SD Leakage Ioff (na/u) nm 90nm Temp (C) 3

4 Unconstrained SD Leakage Power 1,000 SD Leakage (Watts) X Tr Growth 2X Tr Growth Temp = 100C 1.5V 1.7V 2V 1.2V 1V 0.9V u 0.18u 0.13u 90nm 65nm 45nm 4

5 Product Cost Pressure Desk Top ASP ($) Platform Budget Power Delivery $25 Other Thermal $10 Shrinking ASP & budget for power 5

6 Unconstrained Total Power Power (W), Power Density (W/cm 2 ) mm Die SiO2 Lkg SD Lkg Active 90nm 65nm 45nm 32nm 22nm 16nm Technology, Circuits, & Architecture to constrain power 6

7 Active Power Reduction Low Supply Voltage Slow Fast Slow High Supply Voltage Multiple Supply Voltages Multiple supply voltages mimic Vdd scaling Issues: Performance loss due to interface circuits Multiple Vdd routing overhead 7

8 SD Leakage Avoidance Dual Vt # of Paths # of Paths High Vt Low Vt Delay Delay Low-Vt transistor width (% of total transistor width) 100% 80% Full Low Vt Performance 60% 40% 20% 0% 0% 5% 10% 15% 20% 25% % T scaling from all high-vt design # of Paths Dual Vt Delay Leakage 3X smaller (Active & Standby) No performance loss 8

9 Leakage Control Circuits Body Bias Stack Effect Sleep Transistor Vdd +Ve Vbp Equal Loading Logic Block -Ve Vbn 2-10X Reduction 5-10X Reduction X Reduction 9

10 Body Bias Reverse BB Forward BB Vdd Vbp Increases V t Reduces V t +Ve Reduces I off Increases I off To inactive circuits to reduce leakage To active circuits to improve performance -Ve Vbn Tradeoff & analysis required Applying BB consumes active power Switching delay between body bias 10

11 Reverse Body Bias Leakage Reduction Factor (X) C 0.5V RBB Higher V T Shorter L Lower V T Target I off (na/µm) RBB less effective at shorter L and lower Vt * A. Keshavarzi et. al., 2001 Int. Symp. Low Power Electronics & Design (ISLPED) 11

12 Scalability of RBB Power (Watts) 400.0E E E E E+0 Ibp junction leakage Total Leakage Power Optimum SD leakage Ibn junction leakage Body Bias (V) Total Leakage Power Measured on Test Chip Microprocessor critical path circuit Optimum RBB 0.18 µm I/O circuit 0.35 µm 0.18 µm 2V 0.5V Ioff Red. 1000X 10X RBB Less effective with technology scaling * A. Keshavarzi et. al., 1999 Int. Symp. Low Power Electronics & Design (ISLPED) 12

13 Forward Body Bias Normalized operating frequency Router chip with body bias mV Forward body bias (mv) CBG I/O: S-Links Digital Core 24 LBGs PLL Die size 10.1 X 10.1 mm2 Technology 150nm CMOS Transistors 6.6 million Area overhead 2% Power overhead 1% I/O: F-Links Fmax (MHz) Fmax (MHz) Active power (W) * S. Narendra et al, ISSCC 2002, * S. Narendra et al, ISSCC 2002, FBB increases frequency & SD leakage Body bias chip with 450 mv FBB T j ~ 60 C NBB chip & body bias chip with ZBB Vcc (V) Body bias chip with 450 mv FBB T j ~ 60 C Body bias chip with ZBB

14 Adaptive Body Bias Experiment 5.3 mm Multiple subsites Resistor Network PD & Counter CUT Delay Resistor Network Bias Amplifier 4.5 mm Technology Number of subsites per die Body bias range Bias resolution 150nm CMOS V FBB to 0.5V RBB 32 mv 1.6 X 0.24 mm, 21 sites per die 150nm CMOS Die frequency: Min(F 1..F 21 ) Die power: Sum(P 1..P 21 ) 14

15 Adaptive Body Bias Results Number of dies too slow too leaky ABB FBB RBB Accepted die 100% 60% 20% 0% nobb ABB f target 97% highest bin 100% yield Higher Frequency Frequency within die ABB f target For given Freq and Power density 100% yield with ABB 97% highest freq bin with ABB for within die variability 15

16 Transistor Stack V dd V int Normalized Current Vint(V) Stack leakage is ~5-10X smaller * S. Narendra et. al., 2001 Int. Symp. Low Power Electronics & Design (ISLPED) 16

17 Scalability of Stack Effect Stack Stack effect effect factor factor X X Zero Zero body body bias80 C 80 C 0.13 µm LV T 0.13 µm HV T 0.18 µm Model Vdd (V) X Stack effect factor µm Zero body bias 80 C 10% Vdd scaling 20% Vdd scaling Technology generation (µm) Stack effects becomes stronger with scaling 17

18 Exploiting Natural Stacks 32-bit Kogge-Stone adder % of input vectors 30% 20% High V T Low V T 10% 0% Standby leakage current (µa) Reduction Avg Worst High V T 1.5X 2.5X Low V T 1.5X 2X * Y. Ye et. al., 1998 Symp. VLSI Circuits Energy Overhead High V T Low V T 1.64 nj 1.84 nj Savings 2.2 µa 38.4 µa Min time in Standby 84 µs 5.4 µs 18

19 Stack Forcing 5 Equal Loading Delay (Relative) High Vt Low Vt Performance Loss Leakage Reduction 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 Ioff (Relative) Circuit technique provides additional Vt s 19

20 Sleep Transistor Technique V dd external Dynamic ALU 32 Tota power (mw) Switching Leakage Overhead 4GHz, 75C α= % 10.3% ALU core 0 Clock gating + body bias Clock gating only Clock gating + sleep transistor V ss external Non-sleep ALU Scan FIFO Sleep ALU Output scan Control Body bias J. Tschanz et al, ISSCC 2003, 6.1 PMOS sleep transistor PMOS body bias Frequency change Leakage savings Area overhead -2% 13X 11% 0% 2X 2% 20

21 Leakage Tolerant Register File Domino: leakage-sensitive Φ1 RS0 D0 M1 RS15 M2 D15 Local Bit-line LBL0 N0 LBL1 LBL DC Noise Robustness Pseudo-static: leakage-tolerant This work 6GHz design Dual-Vt Noise floor Keeper upsizing Low-Vt Read Delay (ps) Φ1 D0# RS0 M1 Vs Px M2 Pk D15# RS15 LBL0 LBL1 Improves delay Improves noise margin Leakage tolerant * R. Krishnamurthy et. al., 2001 Symp. VLSI Circuits 21

22 Employing Leakage Control Number of Paths Low Vt + Stack Forcing Low Vt Dual Vt + Stack Forcing Dual Vt Delay Target 22

23 Optimum Frequency Sub-threshold Leakage increases exponentially Relative Frequency Process Technology Performance Diminishing Return Relative Frequency (Pipelining) Pipeline & Performance Power Efficiency Optimum Relative Pipeline Depth Pipeline Depth Maximum performance with Optimum pipeline depth Optimum frequency 23

24 Efficiency of Microarchitectures Growth (X) from previous uarch Same Process Technology Die Area Performance Power S-Scalar Dynamic Deep Pipeline Reduction in MIPS/Watt 40% 20% 0% Same Process Technology Enegry efficiency drops ~20% S-Scalar Dynamic Employ efficient microarchitecture and design Deep Pipeline 24

25 Memory Latency 1000 CPU Small ~few Clocks Cache Memory Large ns Memory Latency (Clocks) Assume: 50ns Memory latency Freq (MHz) Cache miss hurts performance Worse at higher frequency 25

26 Increase on-die Memory 100% Power Density of Logic 10X of Memory Cache % of Total Area 75% 50% 25% 486 Pentium M Pentium III Pentium Pentium 4 0% 1u 0.5u 0.25u 0.13u 65nm Increased Data Bandwidth & Reduced Latency Hence, higher performance for much lower power 26

27 Multi-threading Performance 100% 80% 60% 40% 20% 1 GHz 2 GHz 3 GHz Thermals & Power Delivery designed for full HW utilization ST Single Thread Full HW Utilization MT1 Wait for Mem Multi-Threading Wait for Mem 0% MT2 Wait 100% 98% 96% Cache Hit % MT3 Multi-threading improves performance without impacting thermals & power delivery 27

28 Throughput Oriented Design Vdd Logic Block 0.7 x Vdd Freq = 1 Throughput = 1 Active Power = 1 SD Lkg Power = 1 Logic Block Freq = 0.7 Throughput = 1.4 Logic Block Active Power = 0.7 SD Lkg Power = 0.7 Higher logic throughput, yet lower power 28

29 Special Purpose Hardware TCP Offload Engine 1.E+06 PLL OOO ROB Send buffer MIPS 1.E+05 1.E+04 GP TCB Exec Core TCB Exec Core ROM ROM Input seq CAM1 CLB 1.E+03 1.E+02 TOE mm X 3.54 mm, 260K transistors Opportunities: Network processing engines MPEG Encode/Decode engines Speech engines Special Purpose HW Best Mips/Watt 29

30 Chip Level Multi-processing Rule of thumb: Voltage Frequency Power Performance 1% 1% 3% 0.66% In the same process technology Cache Cache Core Core Core Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8 30

31 Multi-Core Cache Large Core Power 4 Performance Small Core Power = 1/4 Performance = 1/2 1 1 C1 C3 Cache C2 C Multi-Core: Power efficient Better power and thermal management 31

32 From Multi to Many 13mm, 100W, 48MB Cache, 4B Transistors, in 22nm 12 Cores 48 Cores 144 Cores Relative Performance 1.2 Single Core Performance Large 0.5 Med 0.3 Small System Performance TPT One App Large Med Small Two App Four App Eight App 32

33 Future Multi-core Platform GP C GP C GP C GP C General Purpose Cores GP C C SP C GP SP C C GP C GP GP C GP C GP C SP C SP C GP C Special Purpose HW Interconnect fabric Heterogeneous Multi-Core Platform 33

34 Low Power & Platform Breakdown of Platform Power Other CPU MCH, ICH CPU MCH, ICH Disk Display Gfx Memory CLK Gen Power Supply Loss Gfx Disk Mobile Com Desktop Other CPU & electronics power in a platform is low Must reduce power of other platform ingredients 34

35 Efficiency of Power Delivery 90% Workload % Efficiency (%) 85% 80% 75% Loss Design Point 70% I (amp) Difficult to maintain high efficiency across workloads Power supply design needs to exploit new technologies 35

36 Typical Power Delivery System Power Delivery Electronics Mobile Platform, ~50W 36

37 Fine-grain Power Management Power Supply Response Usage Fast Slow Slow Fast Cache Time Improve response time Bring power supply closer Provide multiple supply voltages Fine grain Vdd and Freq scaling Cache Core Hopping for hot spot & power density management 37

38 Software & Low Power Applications Compilers Operating System Firmware Provide guidance to compilers and operating system through policy Add hooks for fine grain power management, and assist OS by providing checkpoints Fine grain power management algorithm Control HW through firmware Hardware control Detect conditions, take action, notify OS Fine grain power management is possible only by cooperation of the entire software stack 38

39 Co-optimization Technology Circuits µarch Platform Software SD Leakage (Ioff) Number of Vt s Gate dielectric (Tox) Supply voltage (Vdd) Transistor density Leakage avoidance, control & tolerance Multiple Vdd circuits Tradeoff frequency for transistors Optimal pipeline Throughput oriented microarchitecture Partitioning of logic Error correction Throughput oriented microarchitecture Fine grain power management hardware & software Efficient power delivery & management of multiple Vdd s Software and applications to utilize logic throughput Interconnect RC delay tolerance Interconnect cost vs efficiency of power delivery 39

40 Summary Power and energy will limit integration Employ circuit techniques for SD leakage avoidance, control, & tolerance Microarchitecture for Low Power Opportunities to save power at the platform level Hardware & Software Technology, Circuits, Architecture, and System co-optimization will be key 40