Digital Design for Low Power Systems

Similar documents

DESIGN CHALLENGES OF TECHNOLOGY SCALING

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Intel Labs at ISSCC Copyright Intel Corporation 2012

Low Power AMD Athlon 64 and AMD Opteron Processors

Introducción. Diseño de sistemas digitales.1

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

«A 32-bit DSP Ultra Low Power accelerator»

Introduction to Multi-Core

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures

Data Center and Cloud Computing Market Landscape and Challenges

路論 Chapter 15 System-Level Physical Design

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Parallel Programming Survey

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Power Reduction Techniques in the SoC Clock Network. Clock Power

Scaling in a Hypervisor Environment

Digital Integrated Circuit (IC) Layout and Design

Static-Noise-Margin Analysis of Conventional 6T SRAM Cell at 45nm Technology

CSEE W4824 Computer Architecture Fall 2012

An All-Digital Phase-Locked Loop with High Resolution for Local On-Chip Clock Synthesis

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Thread level parallelism

Alpha CPU and Clock Design Evolution

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors

Introduction to VLSI Programming. TU/e course 2IN30. Prof.dr.ir. Kees van Berkel Dr. Johan Lukkien [Dr.ir. Ad Peeters, Philips Nat.

CPU Performance. Lecture 8 CAP

SRAM Scaling Limit: Its Circuit & Architecture Solutions

Photonic Networks for Data Centres and High Performance Computing

Fault Modeling. Why model faults? Some real defects in VLSI and PCB Common fault models Stuck-at faults. Transistor faults Summary

Area 3: Analog and Digital Electronics. D.A. Johns

Enabling Technologies for Distributed Computing

Am186ER/Am188ER AMD Continues 16-bit Innovation

Enabling Technologies for Distributed and Cloud Computing

International Journal of Electronics and Computer Science Engineering 1482

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

CMOS, the Ideal Logic Family

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Performance evaluation

Control 2004, University of Bath, UK, September 2004

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

A 10,000 Frames/s 0.18 µm CMOS Digital Pixel Sensor with Pixel-Level Memory

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

FPGA-based Multithreading for In-Memory Hash Joins

OpenSoC Fabric: On-Chip Network Generator

Resource Efficient Computing for Warehouse-scale Datacenters

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

ISSCC 2003 / SESSION 13 / 40Gb/s COMMUNICATION ICS / PAPER 13.7

Itanium 2 Platform and Technologies. Alexander Grudinski Business Solution Specialist Intel Corporation

Scalable Internet Services and Load Balancing

Memory Module Specifications KVR667D2D4F5/4G. 4GB 512M x 72-Bit PC CL5 ECC 240-Pin FBDIMM DESCRIPTION SPECIFICATIONS

Design Cycle for Microprocessors

Gigabit Ethernet Design

Memory Architecture and Management in a NoC Platform

Networking Virtualization Using FPGAs

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

VLSI Design Verification and Testing

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Introduction to CMOS VLSI Design (E158) Lecture 8: Clocking of VLSI Systems

Riding silicon trends into our future

On-Chip Interconnection Networks Low-Power Interconnect

DRG-Cache: A Data Retention Gated-Ground Cache for Low Power 1

How To Understand The Design Of A Microprocessor

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Innovativste XEON Prozessortechnik für Cisco UCS

Asynchronous Bypass Channels

What is a System on a Chip?

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Chapter 2 Logic Gates and Introduction to Computer Architecture

1. Memory technology & Hierarchy

HCC/HCF4032B HCC/HCF4038B

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

Optimizing Shared Resource Contention in HPC Clusters

Core Power Delivery Network Analysis of Core and Coreless Substrates in a Multilayer Organic Buildup Package

Lecture 030 DSM CMOS Technology (3/24/10) Page 030-1

PowerPC Microprocessor Clock Modes

. MEDIUM SPEED OPERATION - 8MHz . MULTI-PACKAGE PARALLEL CLOCKING FOR HCC4029B HCF4029B PRESETTABLE UP/DOWN COUNTER BINARY OR BCD DECADE

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

ECE 5745 Complex Digital ASIC Design Course Overview

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

FPGA-based MapReduce Framework for Machine Learning

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Introduction to Microprocessors

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

How To Design A Chip Layout

Multi-Threading Performance on Commodity Multi-Core Processors

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

1.1 Silicon on Insulator a brief Introduction

Clocking. Figure by MIT OCW Spring /18/05 L06 Clocks 1

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Transcription:

Digital Design for Low Power Systems Shekhar Borkar Intel Corp.

Outline Low Power Outlook & Challenges Circuit solutions for leakage avoidance, control, & tolerance Microarchitecture for Low Power System considerations Technology, Circuits, Architecture, and System co-optimization 2

Unconstrained SD Leakage Ioff (na/u) 10000 1000 100 10 1 45nm 90nm 0.25 30 80 130 Temp (C) 3

Unconstrained SD Leakage Power 1,000 SD Leakage (Watts) 100 10 1 1.5X Tr Growth 2X Tr Growth Temp = 100C 1.5V 1.7V 2V 1.2V 1V 0.9V 0 0.25u 0.18u 0.13u 90nm 65nm 45nm 4

Product Cost Pressure Desk Top ASP ($) 1400 1200 1000 800 600 400 200 0 1998 2000 2002 2004 2006 Platform Budget Power Delivery $25 Other Thermal $10 Shrinking ASP & budget for power 5

Unconstrained Total Power Power (W), Power Density (W/cm 2 ) 1400 1200 1000 800 600 400 200 0 10 mm Die SiO2 Lkg SD Lkg Active 90nm 65nm 45nm 32nm 22nm 16nm Technology, Circuits, & Architecture to constrain power 6

Active Power Reduction Low Supply Voltage Slow Fast Slow High Supply Voltage Multiple Supply Voltages Multiple supply voltages mimic Vdd scaling Issues: Performance loss due to interface circuits Multiple Vdd routing overhead 7

SD Leakage Avoidance Dual Vt # of Paths # of Paths High Vt Low Vt Delay Delay Low-Vt transistor width (% of total transistor width) 100% 80% Full Low Vt Performance 60% 40% 20% 0% 0% 5% 10% 15% 20% 25% % T scaling from all high-vt design # of Paths Dual Vt Delay Leakage 3X smaller (Active & Standby) No performance loss 8

Leakage Control Circuits Body Bias Stack Effect Sleep Transistor Vdd +Ve Vbp Equal Loading Logic Block -Ve Vbn 2-10X Reduction 5-10X Reduction 2-1000X Reduction 9

Body Bias Reverse BB Forward BB Vdd Vbp Increases V t Reduces V t +Ve Reduces I off Increases I off To inactive circuits to reduce leakage To active circuits to improve performance -Ve Vbn Tradeoff & analysis required Applying BB consumes active power Switching delay between body bias 10

Reverse Body Bias Leakage Reduction Factor (X) 100 10 1 110C 0.5V RBB Higher V T Shorter L Lower V T 0.01 0.1 1 10 100 100 0 Target I off (na/µm) RBB less effective at shorter L and lower Vt * A. Keshavarzi et. al., 2001 Int. Symp. Low Power Electronics & Design (ISLPED) 11

Scalability of RBB Power (Watts) 400.0E-9 300.0E-9 200.0E-9 100.0E-9 000.0E+0 Ibp junction leakage Total Leakage Power Optimum SD leakage Ibn junction leakage -1.0-0.8-0.6-0.4-0.2 0.0 Body Bias (V) Total Leakage Power Measured on Test Chip Microprocessor critical path circuit Optimum RBB 0.18 µm I/O circuit 0.35 µm 0.18 µm 2V 0.5V Ioff Red. 1000X 10X RBB Less effective with technology scaling * A. Keshavarzi et. al., 1999 Int. Symp. Low Power Electronics & Design (ISLPED) 12

Forward Body Bias Normalized operating frequency Router chip with body bias 1.5 1 0.5 0 450mV 0 200 400 600 Forward body bias (mv) CBG I/O: S-Links Digital Core 24 LBGs PLL Die size 10.1 X 10.1 mm2 Technology 150nm CMOS Transistors 6.6 million Area overhead 2% Power overhead 1% I/O: F-Links Fmax (MHz) Fmax (MHz) 2000 1750 1500 1250 1000 Active power (W) * S. Narendra et al, ISSCC 2002, 16.4 13 * S. Narendra et al, ISSCC 2002, 16.4 750 500 250 2000 1750 1500 1250 1000 FBB increases frequency & SD leakage 750 500 250 Body bias chip with 450 mv FBB T j ~ 60 C NBB chip & body bias chip with ZBB 0.9 1.1 1.3 1.5 1.7 Vcc (V) Body bias chip with 450 mv FBB T j ~ 60 C Body bias chip with ZBB 0 5 10 15 20

Adaptive Body Bias Experiment 5.3 mm Multiple subsites Resistor Network PD & Counter CUT Delay Resistor Network Bias Amplifier 4.5 mm Technology Number of subsites per die Body bias range Bias resolution 150nm CMOS 21 0.5V FBB to 0.5V RBB 32 mv 1.6 X 0.24 mm, 21 sites per die 150nm CMOS Die frequency: Min(F 1..F 21 ) Die power: Sum(P 1..P 21 ) 14

Adaptive Body Bias Results Number of dies too slow too leaky ABB FBB RBB Accepted die 100% 60% 20% 0% nobb ABB f target 97% highest bin 100% yield Higher Frequency Frequency within die ABB f target For given Freq and Power density 100% yield with ABB 97% highest freq bin with ABB for within die variability 15

Transistor Stack V dd V int Normalized Current 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 Vint(V) Stack leakage is ~5-10X smaller * S. Narendra et. al., 2001 Int. Symp. Low Power Electronics & Design (ISLPED) 16

Scalability of Stack Effect Stack Stack effect effect factor factor X X 20 15 10 5 0 Zero Zero body body bias80 C 80 C 0.13 µm LV T 0.13 µm HV T 0.18 µm Model 0 0.5 1 1.5 2 Vdd (V) X Stack effect factor 50 40 30 20 10 0 1.8 V @ 0.18 µm Zero body bias 80 C 10% Vdd scaling 20% Vdd scaling 0.18 0.13 0.09 0.06 Technology generation (µm) Stack effects becomes stronger with scaling 17

Exploiting Natural Stacks 32-bit Kogge-Stone adder % of input vectors 30% 20% High V T Low V T 10% 0% 5.0 5.6 6.2 6.8 7.4 105 120 135 Standby leakage current (µa) Reduction Avg Worst High V T 1.5X 2.5X Low V T 1.5X 2X * Y. Ye et. al., 1998 Symp. VLSI Circuits Energy Overhead High V T Low V T 1.64 nj 1.84 nj Savings 2.2 µa 38.4 µa Min time in Standby 84 µs 5.4 µs 18

Stack Forcing 5 Equal Loading Delay (Relative) 4 3 2 1 0 High Vt Low Vt Performance Loss Leakage Reduction 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 Ioff (Relative) Circuit technique provides additional Vt s 19

Sleep Transistor Technique V dd external Dynamic ALU 32 Tota power (mw) 12 10 8 6 4 2 Switching Leakage Overhead 4GHz, 75C α=0.1 9.6% 10.3% ALU core 0 Clock gating + body bias Clock gating only Clock gating + sleep transistor V ss external Non-sleep ALU Scan FIFO Sleep ALU Output scan Control Body bias J. Tschanz et al, ISSCC 2003, 6.1 PMOS sleep transistor PMOS body bias Frequency change Leakage savings Area overhead -2% 13X 11% 0% 2X 2% 20

Leakage Tolerant Register File Domino: leakage-sensitive Φ1 RS0 D0 M1 RS15 M2 D15 Local Bit-line LBL0 N0 LBL1 LBL DC Noise Robustness Pseudo-static: leakage-tolerant 0.25 0.2 0.15 0.1 0.05 0 This work 6GHz design Dual-Vt Noise floor Keeper upsizing Low-Vt 150 160 170 180 190 200 Read Delay (ps) Φ1 D0# RS0 M1 Vs Px M2 Pk D15# RS15 LBL0 LBL1 Improves delay Improves noise margin Leakage tolerant * R. Krishnamurthy et. al., 2001 Symp. VLSI Circuits 21

Employing Leakage Control Number of Paths Low Vt + Stack Forcing Low Vt Dual Vt + Stack Forcing Dual Vt Delay Target 22

Optimum Frequency 10 8 6 4 2 0 10 8 6 4 2 0 Sub-threshold Leakage increases exponentially 1 2 3 4 5 6 7 8 9 10 Relative Frequency Process Technology Performance Diminishing Return 1 2 3 4 5 6 7 8 9 10 Relative Frequency (Pipelining) Pipeline & Performance 10 8 6 4 2 0 Power Efficiency Optimum 1 2 3 4 5 6 7 8 9 10 Relative Pipeline Depth Pipeline Depth Maximum performance with Optimum pipeline depth Optimum frequency 23

Efficiency of Microarchitectures Growth (X) from previous uarch 4 3 2 1 0 Same Process Technology Die Area Performance Power S-Scalar Dynamic Deep Pipeline Reduction in MIPS/Watt 40% 20% 0% Same Process Technology Enegry efficiency drops ~20% S-Scalar Dynamic Employ efficient microarchitecture and design Deep Pipeline 24

Memory Latency 1000 CPU Small ~few Clocks Cache Memory Large 50-100ns Memory Latency (Clocks) 100 10 Assume: 50ns Memory latency 1 100 1000 10000 Freq (MHz) Cache miss hurts performance Worse at higher frequency 25

Increase on-die Memory 100% Power Density of Logic 10X of Memory Cache % of Total Area 75% 50% 25% 486 Pentium M Pentium III Pentium Pentium 4 0% 1u 0.5u 0.25u 0.13u 65nm Increased Data Bandwidth & Reduced Latency Hence, higher performance for much lower power 26

Multi-threading Performance 100% 80% 60% 40% 20% 1 GHz 2 GHz 3 GHz Thermals & Power Delivery designed for full HW utilization ST Single Thread Full HW Utilization MT1 Wait for Mem Multi-Threading Wait for Mem 0% MT2 Wait 100% 98% 96% Cache Hit % MT3 Multi-threading improves performance without impacting thermals & power delivery 27

Throughput Oriented Design Vdd Logic Block 0.7 x Vdd Freq = 1 Throughput = 1 Active Power = 1 SD Lkg Power = 1 Logic Block Freq = 0.7 Throughput = 1.4 Logic Block Active Power = 0.7 SD Lkg Power = 0.7 Higher logic throughput, yet lower power 28

Special Purpose Hardware TCP Offload Engine 1.E+06 PLL OOO ROB Send buffer MIPS 1.E+05 1.E+04 GP MIPS @75W TCB Exec Core TCB Exec Core ROM ROM Input seq CAM1 CLB 1.E+03 1.E+02 TOE MIPS @~2W 1995 2000 2005 2010 2015 2.23 mm X 3.54 mm, 260K transistors Opportunities: Network processing engines MPEG Encode/Decode engines Speech engines Special Purpose HW Best Mips/Watt 29

Chip Level Multi-processing Rule of thumb: Voltage Frequency Power Performance 1% 1% 3% 0.66% In the same process technology Cache Cache Core Core Core Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8 30

Multi-Core Cache Large Core Power 4 Performance 3 2 1 2 1 Small Core Power = 1/4 Performance = 1/2 1 1 C1 C3 Cache C2 C4 4 4 3 3 2 2 1 1 Multi-Core: Power efficient Better power and thermal management 31

From Multi to Many 13mm, 100W, 48MB Cache, 4B Transistors, in 22nm 12 Cores 48 Cores 144 Cores Relative Performance 1.2 Single Core Performance 1 0.8 0.6 0.4 0.2 0 1 Large 0.5 Med 0.3 Small System Performance 30 25 20 15 10 5 0 TPT One App Large Med Small Two App Four App Eight App 32

Future Multi-core Platform GP C GP C GP C GP C General Purpose Cores GP C C SP C GP SP C C GP C GP GP C GP C GP C SP C SP C GP C Special Purpose HW Interconnect fabric Heterogeneous Multi-Core Platform 33

Low Power & Platform Breakdown of Platform Power Other CPU MCH, ICH CPU MCH, ICH Disk Display Gfx Memory CLK Gen Power Supply Loss Gfx Disk Mobile Com Desktop Other CPU & electronics power in a platform is low Must reduce power of other platform ingredients 34

Efficiency of Power Delivery 90% Workload % Efficiency (%) 85% 80% 75% Loss Design Point 70% 0 10 20 30 40 50 I (amp) Difficult to maintain high efficiency across workloads Power supply design needs to exploit new technologies 35

Typical Power Delivery System Power Delivery Electronics Mobile Platform, ~50W 36

Fine-grain Power Management Power Supply Response Usage Fast Slow Slow Fast Cache Time Improve response time Bring power supply closer Provide multiple supply voltages Fine grain Vdd and Freq scaling Cache Core Hopping for hot spot & power density management 37

Software & Low Power Applications Compilers Operating System Firmware Provide guidance to compilers and operating system through policy Add hooks for fine grain power management, and assist OS by providing checkpoints Fine grain power management algorithm Control HW through firmware Hardware control Detect conditions, take action, notify OS Fine grain power management is possible only by cooperation of the entire software stack 38

Co-optimization Technology Circuits µarch Platform Software SD Leakage (Ioff) Number of Vt s Gate dielectric (Tox) Supply voltage (Vdd) Transistor density Leakage avoidance, control & tolerance Multiple Vdd circuits Tradeoff frequency for transistors Optimal pipeline Throughput oriented microarchitecture Partitioning of logic Error correction Throughput oriented microarchitecture Fine grain power management hardware & software Efficient power delivery & management of multiple Vdd s Software and applications to utilize logic throughput Interconnect RC delay tolerance Interconnect cost vs efficiency of power delivery 39

Summary Power and energy will limit integration Employ circuit techniques for SD leakage avoidance, control, & tolerance Microarchitecture for Low Power Opportunities to save power at the platform level Hardware & Software Technology, Circuits, Architecture, and System co-optimization will be key 40