Digital Design for Low Power Systems Shekhar Borkar Intel Corp.
Outline Low Power Outlook & Challenges Circuit solutions for leakage avoidance, control, & tolerance Microarchitecture for Low Power System considerations Technology, Circuits, Architecture, and System co-optimization 2
Unconstrained SD Leakage Ioff (na/u) 10000 1000 100 10 1 45nm 90nm 0.25 30 80 130 Temp (C) 3
Unconstrained SD Leakage Power 1,000 SD Leakage (Watts) 100 10 1 1.5X Tr Growth 2X Tr Growth Temp = 100C 1.5V 1.7V 2V 1.2V 1V 0.9V 0 0.25u 0.18u 0.13u 90nm 65nm 45nm 4
Product Cost Pressure Desk Top ASP ($) 1400 1200 1000 800 600 400 200 0 1998 2000 2002 2004 2006 Platform Budget Power Delivery $25 Other Thermal $10 Shrinking ASP & budget for power 5
Unconstrained Total Power Power (W), Power Density (W/cm 2 ) 1400 1200 1000 800 600 400 200 0 10 mm Die SiO2 Lkg SD Lkg Active 90nm 65nm 45nm 32nm 22nm 16nm Technology, Circuits, & Architecture to constrain power 6
Active Power Reduction Low Supply Voltage Slow Fast Slow High Supply Voltage Multiple Supply Voltages Multiple supply voltages mimic Vdd scaling Issues: Performance loss due to interface circuits Multiple Vdd routing overhead 7
SD Leakage Avoidance Dual Vt # of Paths # of Paths High Vt Low Vt Delay Delay Low-Vt transistor width (% of total transistor width) 100% 80% Full Low Vt Performance 60% 40% 20% 0% 0% 5% 10% 15% 20% 25% % T scaling from all high-vt design # of Paths Dual Vt Delay Leakage 3X smaller (Active & Standby) No performance loss 8
Leakage Control Circuits Body Bias Stack Effect Sleep Transistor Vdd +Ve Vbp Equal Loading Logic Block -Ve Vbn 2-10X Reduction 5-10X Reduction 2-1000X Reduction 9
Body Bias Reverse BB Forward BB Vdd Vbp Increases V t Reduces V t +Ve Reduces I off Increases I off To inactive circuits to reduce leakage To active circuits to improve performance -Ve Vbn Tradeoff & analysis required Applying BB consumes active power Switching delay between body bias 10
Reverse Body Bias Leakage Reduction Factor (X) 100 10 1 110C 0.5V RBB Higher V T Shorter L Lower V T 0.01 0.1 1 10 100 100 0 Target I off (na/µm) RBB less effective at shorter L and lower Vt * A. Keshavarzi et. al., 2001 Int. Symp. Low Power Electronics & Design (ISLPED) 11
Scalability of RBB Power (Watts) 400.0E-9 300.0E-9 200.0E-9 100.0E-9 000.0E+0 Ibp junction leakage Total Leakage Power Optimum SD leakage Ibn junction leakage -1.0-0.8-0.6-0.4-0.2 0.0 Body Bias (V) Total Leakage Power Measured on Test Chip Microprocessor critical path circuit Optimum RBB 0.18 µm I/O circuit 0.35 µm 0.18 µm 2V 0.5V Ioff Red. 1000X 10X RBB Less effective with technology scaling * A. Keshavarzi et. al., 1999 Int. Symp. Low Power Electronics & Design (ISLPED) 12
Forward Body Bias Normalized operating frequency Router chip with body bias 1.5 1 0.5 0 450mV 0 200 400 600 Forward body bias (mv) CBG I/O: S-Links Digital Core 24 LBGs PLL Die size 10.1 X 10.1 mm2 Technology 150nm CMOS Transistors 6.6 million Area overhead 2% Power overhead 1% I/O: F-Links Fmax (MHz) Fmax (MHz) 2000 1750 1500 1250 1000 Active power (W) * S. Narendra et al, ISSCC 2002, 16.4 13 * S. Narendra et al, ISSCC 2002, 16.4 750 500 250 2000 1750 1500 1250 1000 FBB increases frequency & SD leakage 750 500 250 Body bias chip with 450 mv FBB T j ~ 60 C NBB chip & body bias chip with ZBB 0.9 1.1 1.3 1.5 1.7 Vcc (V) Body bias chip with 450 mv FBB T j ~ 60 C Body bias chip with ZBB 0 5 10 15 20
Adaptive Body Bias Experiment 5.3 mm Multiple subsites Resistor Network PD & Counter CUT Delay Resistor Network Bias Amplifier 4.5 mm Technology Number of subsites per die Body bias range Bias resolution 150nm CMOS 21 0.5V FBB to 0.5V RBB 32 mv 1.6 X 0.24 mm, 21 sites per die 150nm CMOS Die frequency: Min(F 1..F 21 ) Die power: Sum(P 1..P 21 ) 14
Adaptive Body Bias Results Number of dies too slow too leaky ABB FBB RBB Accepted die 100% 60% 20% 0% nobb ABB f target 97% highest bin 100% yield Higher Frequency Frequency within die ABB f target For given Freq and Power density 100% yield with ABB 97% highest freq bin with ABB for within die variability 15
Transistor Stack V dd V int Normalized Current 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 Vint(V) Stack leakage is ~5-10X smaller * S. Narendra et. al., 2001 Int. Symp. Low Power Electronics & Design (ISLPED) 16
Scalability of Stack Effect Stack Stack effect effect factor factor X X 20 15 10 5 0 Zero Zero body body bias80 C 80 C 0.13 µm LV T 0.13 µm HV T 0.18 µm Model 0 0.5 1 1.5 2 Vdd (V) X Stack effect factor 50 40 30 20 10 0 1.8 V @ 0.18 µm Zero body bias 80 C 10% Vdd scaling 20% Vdd scaling 0.18 0.13 0.09 0.06 Technology generation (µm) Stack effects becomes stronger with scaling 17
Exploiting Natural Stacks 32-bit Kogge-Stone adder % of input vectors 30% 20% High V T Low V T 10% 0% 5.0 5.6 6.2 6.8 7.4 105 120 135 Standby leakage current (µa) Reduction Avg Worst High V T 1.5X 2.5X Low V T 1.5X 2X * Y. Ye et. al., 1998 Symp. VLSI Circuits Energy Overhead High V T Low V T 1.64 nj 1.84 nj Savings 2.2 µa 38.4 µa Min time in Standby 84 µs 5.4 µs 18
Stack Forcing 5 Equal Loading Delay (Relative) 4 3 2 1 0 High Vt Low Vt Performance Loss Leakage Reduction 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 Ioff (Relative) Circuit technique provides additional Vt s 19
Sleep Transistor Technique V dd external Dynamic ALU 32 Tota power (mw) 12 10 8 6 4 2 Switching Leakage Overhead 4GHz, 75C α=0.1 9.6% 10.3% ALU core 0 Clock gating + body bias Clock gating only Clock gating + sleep transistor V ss external Non-sleep ALU Scan FIFO Sleep ALU Output scan Control Body bias J. Tschanz et al, ISSCC 2003, 6.1 PMOS sleep transistor PMOS body bias Frequency change Leakage savings Area overhead -2% 13X 11% 0% 2X 2% 20
Leakage Tolerant Register File Domino: leakage-sensitive Φ1 RS0 D0 M1 RS15 M2 D15 Local Bit-line LBL0 N0 LBL1 LBL DC Noise Robustness Pseudo-static: leakage-tolerant 0.25 0.2 0.15 0.1 0.05 0 This work 6GHz design Dual-Vt Noise floor Keeper upsizing Low-Vt 150 160 170 180 190 200 Read Delay (ps) Φ1 D0# RS0 M1 Vs Px M2 Pk D15# RS15 LBL0 LBL1 Improves delay Improves noise margin Leakage tolerant * R. Krishnamurthy et. al., 2001 Symp. VLSI Circuits 21
Employing Leakage Control Number of Paths Low Vt + Stack Forcing Low Vt Dual Vt + Stack Forcing Dual Vt Delay Target 22
Optimum Frequency 10 8 6 4 2 0 10 8 6 4 2 0 Sub-threshold Leakage increases exponentially 1 2 3 4 5 6 7 8 9 10 Relative Frequency Process Technology Performance Diminishing Return 1 2 3 4 5 6 7 8 9 10 Relative Frequency (Pipelining) Pipeline & Performance 10 8 6 4 2 0 Power Efficiency Optimum 1 2 3 4 5 6 7 8 9 10 Relative Pipeline Depth Pipeline Depth Maximum performance with Optimum pipeline depth Optimum frequency 23
Efficiency of Microarchitectures Growth (X) from previous uarch 4 3 2 1 0 Same Process Technology Die Area Performance Power S-Scalar Dynamic Deep Pipeline Reduction in MIPS/Watt 40% 20% 0% Same Process Technology Enegry efficiency drops ~20% S-Scalar Dynamic Employ efficient microarchitecture and design Deep Pipeline 24
Memory Latency 1000 CPU Small ~few Clocks Cache Memory Large 50-100ns Memory Latency (Clocks) 100 10 Assume: 50ns Memory latency 1 100 1000 10000 Freq (MHz) Cache miss hurts performance Worse at higher frequency 25
Increase on-die Memory 100% Power Density of Logic 10X of Memory Cache % of Total Area 75% 50% 25% 486 Pentium M Pentium III Pentium Pentium 4 0% 1u 0.5u 0.25u 0.13u 65nm Increased Data Bandwidth & Reduced Latency Hence, higher performance for much lower power 26
Multi-threading Performance 100% 80% 60% 40% 20% 1 GHz 2 GHz 3 GHz Thermals & Power Delivery designed for full HW utilization ST Single Thread Full HW Utilization MT1 Wait for Mem Multi-Threading Wait for Mem 0% MT2 Wait 100% 98% 96% Cache Hit % MT3 Multi-threading improves performance without impacting thermals & power delivery 27
Throughput Oriented Design Vdd Logic Block 0.7 x Vdd Freq = 1 Throughput = 1 Active Power = 1 SD Lkg Power = 1 Logic Block Freq = 0.7 Throughput = 1.4 Logic Block Active Power = 0.7 SD Lkg Power = 0.7 Higher logic throughput, yet lower power 28
Special Purpose Hardware TCP Offload Engine 1.E+06 PLL OOO ROB Send buffer MIPS 1.E+05 1.E+04 GP MIPS @75W TCB Exec Core TCB Exec Core ROM ROM Input seq CAM1 CLB 1.E+03 1.E+02 TOE MIPS @~2W 1995 2000 2005 2010 2015 2.23 mm X 3.54 mm, 260K transistors Opportunities: Network processing engines MPEG Encode/Decode engines Speech engines Special Purpose HW Best Mips/Watt 29
Chip Level Multi-processing Rule of thumb: Voltage Frequency Power Performance 1% 1% 3% 0.66% In the same process technology Cache Cache Core Core Core Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8 30
Multi-Core Cache Large Core Power 4 Performance 3 2 1 2 1 Small Core Power = 1/4 Performance = 1/2 1 1 C1 C3 Cache C2 C4 4 4 3 3 2 2 1 1 Multi-Core: Power efficient Better power and thermal management 31
From Multi to Many 13mm, 100W, 48MB Cache, 4B Transistors, in 22nm 12 Cores 48 Cores 144 Cores Relative Performance 1.2 Single Core Performance 1 0.8 0.6 0.4 0.2 0 1 Large 0.5 Med 0.3 Small System Performance 30 25 20 15 10 5 0 TPT One App Large Med Small Two App Four App Eight App 32
Future Multi-core Platform GP C GP C GP C GP C General Purpose Cores GP C C SP C GP SP C C GP C GP GP C GP C GP C SP C SP C GP C Special Purpose HW Interconnect fabric Heterogeneous Multi-Core Platform 33
Low Power & Platform Breakdown of Platform Power Other CPU MCH, ICH CPU MCH, ICH Disk Display Gfx Memory CLK Gen Power Supply Loss Gfx Disk Mobile Com Desktop Other CPU & electronics power in a platform is low Must reduce power of other platform ingredients 34
Efficiency of Power Delivery 90% Workload % Efficiency (%) 85% 80% 75% Loss Design Point 70% 0 10 20 30 40 50 I (amp) Difficult to maintain high efficiency across workloads Power supply design needs to exploit new technologies 35
Typical Power Delivery System Power Delivery Electronics Mobile Platform, ~50W 36
Fine-grain Power Management Power Supply Response Usage Fast Slow Slow Fast Cache Time Improve response time Bring power supply closer Provide multiple supply voltages Fine grain Vdd and Freq scaling Cache Core Hopping for hot spot & power density management 37
Software & Low Power Applications Compilers Operating System Firmware Provide guidance to compilers and operating system through policy Add hooks for fine grain power management, and assist OS by providing checkpoints Fine grain power management algorithm Control HW through firmware Hardware control Detect conditions, take action, notify OS Fine grain power management is possible only by cooperation of the entire software stack 38
Co-optimization Technology Circuits µarch Platform Software SD Leakage (Ioff) Number of Vt s Gate dielectric (Tox) Supply voltage (Vdd) Transistor density Leakage avoidance, control & tolerance Multiple Vdd circuits Tradeoff frequency for transistors Optimal pipeline Throughput oriented microarchitecture Partitioning of logic Error correction Throughput oriented microarchitecture Fine grain power management hardware & software Efficient power delivery & management of multiple Vdd s Software and applications to utilize logic throughput Interconnect RC delay tolerance Interconnect cost vs efficiency of power delivery 39
Summary Power and energy will limit integration Employ circuit techniques for SD leakage avoidance, control, & tolerance Microarchitecture for Low Power Opportunities to save power at the platform level Hardware & Software Technology, Circuits, Architecture, and System co-optimization will be key 40