Lecture 11. Low Power and Power Efficient Circuits

Similar documents
Alpha CPU and Clock Design Evolution

Clocking. Figure by MIT OCW Spring /18/05 L06 Clocks 1

NTE2053 Integrated Circuit 8 Bit MPU Compatible A/D Converter

Low Power AMD Athlon 64 and AMD Opteron Processors

Digital Design for Low Power Systems

DESIGN CHALLENGES OF TECHNOLOGY SCALING

Latch Timing Parameters. Flip-flop Timing Parameters. Typical Clock System. Clocking Overhead

Power Reduction Techniques in the SoC Clock Network. Clock Power

Digital Integrated Circuit (IC) Layout and Design

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors

Design Cycle for Microprocessors

Introduction to CMOS VLSI Design (E158) Lecture 8: Clocking of VLSI Systems

CHAPTER 11: Flip Flops

Lecture 10: Latch and Flip-Flop Design. Outline

Lecture 7: Clocking of VLSI Systems

Class 11: Transmission Gates, Latches

Static-Noise-Margin Analysis of Conventional 6T SRAM Cell at 45nm Technology

HyperThreading Support in VMware ESX Server 2.1

MOSFET TECHNOLOGY ADVANCES DC-DC CONVERTER EFFICIENCY FOR PROCESSOR POWER

CMOS Thyristor Based Low Frequency Ring Oscillator

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures

Lecture 11: Sequential Circuit Design

EE 42/100 Lecture 24: Latches and Flip Flops. Rev B 4/21/2010 (2:04 PM) Prof. Ali M. Niknejad

Sequential 4-bit Adder Design Report

Dynamic Power Variations in Data Centers and Network Rooms

Timing Methodologies (cont d) Registers. Typical timing specifications. Synchronous System Model. Short Paths. System Clock Frequency

Pass Gate Logic An alternative to implementing complex logic is to realize it using a logic network of pass transistors (switches).

Class 18: Memories-DRAMs

Am186ER/Am188ER AMD Continues 16-bit Innovation

«A 32-bit DSP Ultra Low Power accelerator»

A New Paradigm for Synchronous State Machine Design in Verilog

Chapter 2 Logic Gates and Introduction to Computer Architecture

DRIVE CIRCUITS FOR POWER MOSFETs AND IGBTs

Dynamic Power Variations in Data Centers and Network Rooms

Photonic Networks for Data Centres and High Performance Computing

How To Design A Chip Layout

Fairchild Solutions for 133MHz Buffered Memory Modules

路 論 Chapter 15 System-Level Physical Design

Section 3. Sensor to ADC Design Example

Drive circuit basics + V. τ e. Industrial Circuits Application Note. Winding resistance and inductance

TIMING-DRIVEN PHYSICAL DESIGN FOR DIGITAL SYNCHRONOUS VLSI CIRCUITS USING RESONANT CLOCKING

Inrush Current. Although the concepts stated are universal, this application note was written specifically for Interpoint products.

VICOR WHITE PAPER. The Sine Amplitude Converter Topology Provides Superior Efficiency and Power Density in Intermediate Bus Architecture Applications

Planar versus conventional transformer

PowerPC Microprocessor Clock Modes

Building a Linux Cluster

Fully Differential CMOS Amplifier

Fault Modeling. Why model faults? Some real defects in VLSI and PCB Common fault models Stuck-at faults. Transistor faults Summary

DM Segment Decoder/Driver/Latch with Constant Current Source Outputs

PowerAmp Design. PowerAmp Design PAD135 COMPACT HIGH VOLATGE OP AMP

A Lesson on Digital Clocks, One Shots and Counters

VLSI Design Verification and Testing

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

155 Mb/s Fiber Optic Light to Logic Receivers for OC3/STM1

International Journal of Electronics and Computer Science Engineering 1482

ICS Pentium/Pro TM System Clock Chip. Integrated Circuit Systems, Inc. General Description. Pin Configuration.

Lecture 5: Gate Logic Logic Optimization

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Eliminating Parasitic Oscillation between Parallel MOSFETs

These help quantify the quality of a design from different perspectives: Cost Functionality Robustness Performance Energy consumption

EMC / EMI issues for DSM: new challenges

Router Architectures

CMOS Switched-Capacitor Voltage Converters ADM660/ADM8660

Implementation Details

Agilent Technologies 1670G Series (Option 004) Pattern Generator Specifications and Characteristics

Intel DPDK Boosts Server Appliance Performance White Paper

Chapter 9 Latches, Flip-Flops, and Timers

MAS.836 HOW TO BIAS AN OP-AMP

Three-Phase Dual-Rail Pre-Charge Logic

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Lecture 11. Clocking High-Performance Microprocessors

Transmission Line Terminations It s The End That Counts!

A Lesson on Digital Clocks, One Shots and Counters

Optimizing VCO PLL Evaluations & PLL Synthesizer Designs

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

The enable pin needs to be high for data to be fed to the outputs Q and Q bar.

ESP-CV Custom Design Formal Equivalence Checking Based on Symbolic Simulation

Understanding Low Drop Out (LDO) Regulators

SLC vs MLC: Which is best for high-reliability apps?

Topics of Chapter 5 Sequential Machines. Memory elements. Memory element terminology. Clock terminology

CMOS, the Ideal Logic Family

CISC, RISC, and DSP Microprocessors

DC/DC power modules basics

PLAS: Analog memory ASIC Conceptual design & development status

Programming NAND devices

Chapter 9 Semiconductor Memories. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Digital Systems Based on Principles and Applications of Electrical Engineering/Rizzoni (McGraw Hill

Layout and Cross-section of an inverter. Lecture 5. Layout Design. Electric Handles Objects. Layout & Fabrication. A V i

8-Bit Flash Microcontroller for Smart Cards. AT89SCXXXXA Summary. Features. Description. Complete datasheet available under NDA

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

MicroMag3 3-Axis Magnetic Sensor Module

Lecture 30: Biasing MOSFET Amplifiers. MOSFET Current Mirrors.

Display Board Pulse Width Modulation (PWM) Power/Speed Controller Module

Intel Labs at ISSCC Copyright Intel Corporation 2012

Objective. Testing Principle. Types of Testing. Characterization Test. Verification Testing. VLSI Design Verification and Testing.

(Refer Slide Time: 00:01:16 min)

A 10,000 Frames/s 0.18 µm CMOS Digital Pixel Sensor with Pixel-Level Memory

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces

Transcription:

Lecture 11 Low Power and Power Efficient Circuits Intel Corporation jstinson@stanford.edu (many thanks to Intel University for much of the material) 1

Power Trends Max Power (W) 100 Pentium 4 Pentium II Pentium Pro Pentium 10 Intel486 Intel386 1 1.5u 1.0u 0.8u 0.5u 0.35u 0.25u 0.18u 0.13u Process Power (W) 30 25 20 15 10 5 0 Pentium 5V, 66MHz Pentium Pro 3.3V, 200MHz Pentium 3.3V, 100MHz Pentium 3.3V, 200MHz 0.8u 0.5u 0.35u Process Power historically has grown at almost 2X/generation Frequency growing at ~2X/generation Xtor count growing at ~2X/generation Voltage scaling and process temporary relief within a uproc generation Has not prevented growth thru architectural generations 2

Power Cost Source: Intel Technology Journal, Q1 01 Cost of delivery power and thermal solutions is growing at an exponential rate 25000 sq. ft. data center, ~8000 servers = 2,000,000 Watts!! Cost driven by rack height, cooling air flow, power delivery, maintenance What matters most to the computer designers at Google is not speed, but power low power, because data centers can consume as much electricity as a city. -Eric Schmidt, CEO Google (NYT, 09/29/02 3

Power/Thermal Density 10000 Surface of the Sun Power Density (W/cm2) 1000 100 10 4004 Rocket Nozzle Nuclear Reactor 8080 Hotplate 8086 286 386 486 Projected Pentium 4 Pentium processors 1 1970 1980 1990 2000 2010 2020 Power is going up yet die size is going down Power isn t evenly distributed across the die Makes local hotspots even more difficult Designs must be reliable at worst case temperature and voltage Speed, noise, leakage, oxide wearout, EM, SH, etc. 4

di/dt Changes in current (A) create large di/dt events Min power state to max power state can occur in less than 4-5 cycles Min power state is typically 30-40% lower than max power state Architectural efforts to exploit parallelism increase di/dt Putting more hardware on die to do work in parallel Turn it off to save power Creates large discrepancy between hi and lo utilization power Low power efforts can increase di/dt Turning off hardware creates current (A) switching events Source: Intel Technology Journal, Q1 01 5

Dynamic Power Analysis P = AF * C * V 2 * F Greatest flexibility/control Usually fixed by the project Activity Factor (AF) switching or toggle rate Measurement of how often a node switches Logic and circuit design can control Capacitance Voltage and Frequency Typically set by the product spec Majority of power reduction techniques focus on AF and C 6

Where does the dynamic power go? Majority of power consumed in the clock/clocked elements Clock distribution, sequentials,domino, enables, clocked logic 5-10% of the node capacitance close to 50% of the power! AF makes the difference Large I/O and bus drivers Large capacitances Not too many tends to be localized issues more than total power Datapath Tend to upsize drivers to meet frequency requirements High utilization Memory Tend to be low power (dynamic) Can be largest capacitance on die, but lowest power 7

Node Activity Factors 8

Device Sizing and Design Style Most designs have optimum point for power/performance Historically, designers always went for the speed. Low power designs must choose the most optimum point Domino vs. CMOS Domino node cap lower than CMOS (typically 65-75%) Clock loading and AF usually make domino significantly higher power Design domino logic so that precharge state is most common 9

Clock Gating Enable Latch D Q CLK Logic Look for opportunities in the design to turn off the clock Unused hardware, guaranteed to evaluate in a certain direction, don t care Advantages Reduces activity factor for clocks Prevents downstream data from switching Disadvantages Can be difficult to identify opportunities Goal of most architectures is to keep everything busy Often creates speedpaths Easier to identify gating opportunity closest to the event di/dt problems 10

Clock Collapsing Combine local clocked xtors into grouped single xtor Reduces total driven xtor width Reduces generation overhead (pulses, enables) Locally generated pulses/enables require distributed logic power hungry Can cause a performance impact Group domino logic that is unlikely to evaluate together Can increase clock skew (esp. on pulses) Tradeoff power reduction vs. race margin 11

Clock Throttling Dynamically change the frequency of the design Directly impacts the dynamic power of the design Need effective means of changing the frequency on-the-fly Need effective means of triggering the frequency change Missing Teeth Globally disable every N th high phase of the clock Need architecture that understands on-the-fly frequency variation Design must still meet constraints of full frequency clock PLL pllout CLK CLK CLK Counter CLK 12

Clock Throttling (cont d) Re-sync the PLL Change the fundamental frequency coming from PLL Design doesn t have to meet full frequency constraints during throttle Opportunity to dynamically lower voltage Must wait for PLL to lock to new frequency Limits fast transition in and out of throttled mode System CLK PLL CLK CLK CLK Divide by N CLK 13

Low Swing Circuits Separate power supply for portion of the design Using either an explicit power supply or on-die regulation Needs separate supply lines Need to separate out well taps Usually share Vss, split Vcc Cuts the overhead in half Lower supply = lower performance Identify portions of the design that are not speed critical Caches are one of the best targets Localized logic within larger blocks (hard) Identify structures that are less impacted by lower supply voltage Differential, low swing circuit families (long wires) 14

Leakage Leakage increasing at a ~3-4X per process generation Ioff inversely proportional to exp(vth) Vth is reducing for performance Gate leakage inversely proportional to exp(tox) Gate leakage quickly approaching same levels as S-D leakage Modern microprocessors have up to 20-30% leakage budget Within the large caches, leakage can account for up to 50% total power Burnin conditions (stress testing) can increase leakage to over 90% of total power Strong temperature, voltage dependence S-D Leakage Source: V. De, et. al, ISLPED 1999 Gate Leakage Source: K.M. Cao, et. al, IEDM 2000 15

Long Le Increasing channel length dramatically reduces leakage Insignificant increase in gate leakage Insignificant increase in gate capacitance (performance) Corresponding performance loss Idsat reduced with longer channel Must identify circuits with no performance impact Memories good candidates 16

Transistor Stacking Stacked devices reduce leakage Intermediate voltages between stacked devices help combat leakage Only true for off xtors in series!! Rely on statistical nature of inputs to reduce overall leakage May not be true depending on real distribution of input vectors 17

Multiple Vt Strong trend towards multi-vt processes Provides two types of FETs to designer: HiVt: geared towards non-critical transistors to reduce leakage LoVt: geared towards most critical transistors Usage/success highly dependent on ability to detect critical paths Large complex designs not always easy to identify the right locations *Can* run into situation where wrong application of LoVt devices actually slows down the device!! Hi Lo Critical Path LoVt Lo Critical Path 18

Body Biasing Reverse body-bias xtors to increase Vt Leakage reduced significantly (S-D; Igate reduced slightly) Performance goes down accordingly Can also forward bias xtors to increase performance (and leakage) Need to route separate supply for body-biasing Can enable during normal operation Either to decrease leakage OR increase performance Can also enable only during special operating modes Burnin performance doesn t matter, leakage is everything 19