Lecture 11. Clocking High-Performance Microprocessors



Similar documents
Clocking. Figure by MIT OCW Spring /18/05 L06 Clocks 1

Alpha CPU and Clock Design Evolution

TIMING-DRIVEN PHYSICAL DESIGN FOR DIGITAL SYNCHRONOUS VLSI CIRCUITS USING RESONANT CLOCKING

Topics of Chapter 5 Sequential Machines. Memory elements. Memory element terminology. Clock terminology

Latch Timing Parameters. Flip-flop Timing Parameters. Typical Clock System. Clocking Overhead

Power Reduction Techniques in the SoC Clock Network. Clock Power

LOW POWER DESIGN OF DIGITAL SYSTEMS USING ENERGY RECOVERY CLOCKING AND CLOCK GATING

On-Chip Interconnect: The Past, Present, and Future

Signal Integrity: Tips and Tricks

TIMING ISSUES IN DIGITAL CIRCUITS

Low latency synchronization through speculation

A 1.62/2.7/5.4 Gbps Clock and Data Recovery Circuit for DisplayPort 1.2 with a single VCO

ISSCC 2003 / SESSION 13 / 40Gb/s COMMUNICATION ICS / PAPER 13.7

Low Power AMD Athlon 64 and AMD Opteron Processors

Introduction to CMOS VLSI Design (E158) Lecture 8: Clocking of VLSI Systems

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces

Lecture 10: Latch and Flip-Flop Design. Outline

Sequential 4-bit Adder Design Report

FPGA Clocking. Clock related issues: distribution generation (frequency synthesis) multiplexing run time programming domain crossing

ICS514 LOCO PLL CLOCK GENERATOR. Description. Features. Block Diagram DATASHEET

IL2225 Physical Design

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 16 Timing and Clock Issues

S. Venkatesh, Mrs. T. Gowri, Department of ECE, GIT, GITAM University, Vishakhapatnam, India

路 論 Chapter 15 System-Level Physical Design

Fairchild Solutions for 133MHz Buffered Memory Modules

PowerPC Microprocessor Clock Modes

Reducing EMI and Improving Signal Integrity Using Spread Spectrum Clocking

The Motherboard Chapter #5

DESIGN CHALLENGES OF TECHNOLOGY SCALING

Harmonics and Noise in Photovoltaic (PV) Inverter and the Mitigation Strategies

Lecture 11: Sequential Circuit Design

POWER FORUM, BOLOGNA

EMI and t Layout Fundamentals for Switched-Mode Circuits

A 3.2Gb/s Clock and Data Recovery Circuit Without Reference Clock for a High-Speed Serial Data Link

How To Design A Chip Layout

TRUE SINGLE PHASE CLOCKING BASED FLIP-FLOP DESIGN

EE 42/100 Lecture 24: Latches and Flip Flops. Rev B 4/21/2010 (2:04 PM) Prof. Ali M. Niknejad

Abstract. Cycle Domain Simulator for Phase-Locked Loops

Grounding Demystified

DDR subsystem: Enhancing System Reliability and Yield

Evaluating AC Current Sensor Options for Power Delivery Systems

Photonic Networks for Data Centres and High Performance Computing

PCB Design Conference - East Keynote Address EMC ASPECTS OF FUTURE HIGH SPEED DIGITAL DESIGNS

Analyzing Electrical Effects of RTA-driven Local Anneal Temperature Variation

CHAPTER 11: Flip Flops

Jitter in PCIe application on embedded boards with PLL Zero delay Clock buffer

8 Gbps CMOS interface for parallel fiber-optic interconnects

White Paper: Electrical Ground Rules

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

On-Chip Interconnection Networks Low-Power Interconnect

An All-Digital Phase-Locked Loop with High Resolution for Local On-Chip Clock Synthesis

Lecture 7: Clocking of VLSI Systems

11. High-Speed Differential Interfaces in Cyclone II Devices

AVX EMI SOLUTIONS Ron Demcko, Fellow of AVX Corporation Chris Mello, Principal Engineer, AVX Corporation Brian Ward, Business Manager, AVX Corporation

Clock Distribution Networks in Synchronous Digital Integrated Circuits

Lecture 10: Sequential Circuits

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2010

DirectFET TM - A Proprietary New Source Mounted Power Package for Board Mounted Power

Chapter 6 PLL and Clock Generator

ZL40221 Precision 2:6 LVDS Fanout Buffer with Glitchfree Input Reference Switching and On-Chip Input Termination Data Sheet

Signal Types and Terminations

MODULE BOUSSOLE ÉLECTRONIQUE CMPS03 Référence :

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

On-chip clock error characterization for clock distribution system

Phase-Locked Loop Based Clock Generators

Electromagnetic. Reducing EMI and Improving Signal Integrity Using Spread Spectrum Clocking EMI CONTROL. The authors describe the

PROGETTO DI SISTEMI ELETTRONICI DIGITALI. Digital Systems Design. Digital Circuits Advanced Topics

Multi-GHz Systems Clocking Invited Paper

Sequential Circuits. Combinational Circuits Outputs depend on the current inputs

Efficient Interconnect Design with Novel Repeater Insertion for Low Power Applications

ECE124 Digital Circuits and Systems Page 1

Lecture 2. High-Speed I/O

Current Probes, More Useful Than You Think

Application Note 58 Crystal Considerations with Dallas Real Time Clocks

ICS379. Quad PLL with VCXO Quick Turn Clock. Description. Features. Block Diagram

ICS SPREAD SPECTRUM CLOCK SYNTHESIZER. Description. Features. Block Diagram DATASHEET

International Journal of Electronics and Computer Science Engineering 1482

Equalization/Compensation of Transmission Media. Channel (copper or fiber)

Design Compiler Graphical Create a Better Starting Point for Faster Physical Implementation

Design and analysis of flip flops for low power clocking system

Introduction to CMOS VLSI Design

A PC-BASED TIME INTERVAL COUNTER WITH 200 PS RESOLUTION

inter-chip and intra-chip Harm Dorren and Oded Raz

Timing Methodologies (cont d) Registers. Typical timing specifications. Synchronous System Model. Short Paths. System Clock Frequency

Core Power Delivery Network Analysis of Core and Coreless Substrates in a Multilayer Organic Buildup Package

(Amplifying) Photo Detectors: Avalanche Photodiodes Silicon Photomultiplier

Introduction to Cloud Computing

A NEAR FIELD INJECTION MODEL FOR SUSCEPTIBILITY PREDICTION IN INTEGRATED CIRCUITS

Demystifying Data-Driven and Pausible Clocking Schemes

Lecture-3 MEMORY: Development of Memory:

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

Intel Labs at ISSCC Copyright Intel Corporation 2012

Transcription:

Lecture 11 Clocking High-Performance Microprocessors Computer Systems Laboratory Stanford University horowitz@stanford.edu Copyright 2007 Ron Ho, Mark Horowitz 1 Overview Readings (Actually for today s lecture) Gronowski Alpha clocking paper Restle Clock grid paper Harris Variations paper Today s topics Overview of clock distribution methods and challenges Three schemes of clock distribution Overcoming and exploiting clock skew and jitter 2

Latches and Flops Need Clocks Previous lecture on latches and flops assumed an input clock We talked briefly about imperfections in that clock: skew, jitter In a chip, we can have n clocks for the latches and flops asynchrony n=1: one clock for the whole chip n<10: one clock per major chip portion (e.g., core and cache) 10<n<100: one clock per major functional block (e.g., GALS) 100<n<10000: one clock per datapath stage This lecture will focus on the globally synchronous systems One or a few clocks distributed across the chip How do we do this, and why is it so hard? 3 Principal Goal of Clock Distribution We need to distribute a stable clock across a die Clock should be phase-aligned with an external input clock Skew should be zero, or at least well-characterized Jitter should be zero Both skew and jitter should be insensitive to PVT Duty cycle should be 50%-50%, or at least well-characterized clk1 t skew t +jit clk2 t -jit 4

Other Goals of Clock Distribution Clock network should be low-power But remember that most of the power is burned in the latches/flops Clock distribution should be simple to design and verify Clocking should allow external buses at slower frequencies Bus/system clock is much slower than on-chip clock (why?) Example: 533MHz bus clock, 3.0GHz chip clock Often at a non-integer fraction of the CPU clock (3/5, 2/9 ) Every n th CPU clock is phase-aligned with every m th bus clock Clock generation should support test and debug Control over frequency, skews, and duty cycle 5 Principal Goal is Difficult Uncertainty in the clocking network appears everywhere! Affects skew, jitter, and duty cycle of the delivered clock Uncertainty from clock generation clk source clk loads (latches/flops) 6

Principal Goal is Difficult Uncertainty in the clocking network appears everywhere! Affects skew, jitter, and duty cycle of the delivered clock Mismatched buffers Uncertainty from clock generation clk source clk loads (latches/flops) 7 Principal Goal is Difficult Uncertainty in the clocking network appears everywhere! Affects skew, jitter, and duty cycle of the delivered clock Mismatched buffers Uncertainty from clock generation clk source Misplaced buffers clk loads (latches/flops) 8

Principal Goal is Difficult Uncertainty in the clocking network appears everywhere! Affects skew, jitter, and duty cycle of the delivered clock Mismatched buffers Mismatched wires Uncertainty from clock generation clk source Misplaced buffers clk loads (latches/flops) 9 Principal Goal is Difficult Uncertainty in the clocking network appears everywhere! Affects skew, jitter, and duty cycle of the delivered clock Mismatched buffers Mismatched wires Variable loads Uncertainty from clock generation clk source Misplaced buffers clk loads (latches/flops) 10

Principal Goal is Difficult Uncertainty in the clocking network appears everywhere! Affects skew, jitter, and duty cycle of the delivered clock Mismatched buffers Mismatched wires Variable loads Uncertainty from clock generation clk source Misplaced buffers clk loads (latches/flops) Temp variations 11 Principal Goal is Difficult Uncertainty in the clocking network appears everywhere! Affects skew, jitter, and duty cycle of the delivered clock Power supply variations Mismatched buffers Mismatched wires Variable loads Uncertainty from clock generation clk source Misplaced buffers clk loads (latches/flops) Temp variations 12

Mismatched Buffers Earlier we talked about process corners (FF, SS, TT, SF, etc.) This represents variation from one die to another die There is also significant cross-die variation: process tilt ITRS predicts 3σ of L e will grow to be about 20% of L e Two devices on opposite chip sides will behave differently Won t show up in process corners: all devices identically fast in FF There is also local systematic variation: proximity effects Local wiring density affects etch rates and thus dimension control Use dummy structures to equalize and match environments Both process tilt and proximity effects must be considered 13 Mismatched Wires and Misplaced Buffers We would prefer to put buffers in exactly the right places But usually we cannot due to congestion (middle of the cache?) Leads to mismatched wire loads Wires don t always fit exactly right either They also have to fit in the middle of other layouts You get different capacitive components (M5-M4, M5-M5, fringe ) But even if we put everything in the right places (Expensive, but perhaps worth it for the clock) Still have process tilt and proximity effects for wires Again, not counted in process corner simulations Designers often use shield wires and then ignore this 14

Variable Loads Clocks drive latches and flops, which can be passgate loads Load depends on the data going through the passgate Drain Source Drain Fall Fall Low High High High Rise Source Fall Low Low Low High Rise Rise Gate load 1.3x 1.1x 1.0 0.8x 0.42x 0.31x 0.13x Source: Bailey JSSC 1998 2:1 variation 10:1 variation 15 Power Supply Variations RTL-based power traces can drive a global grid simulation Example of Itanium II power supply variations (snapshot in time) Source: Harris TransVLSI 12/01 16

Temperature Variations Same for temperature variations, but these are slower effects Long-term skew and/or jitter 17 Source: Harris TransVLSI 12/01 So What Makes Up Skew? If we eliminate systematic offsets we care mostly about devices Source: Geannopoulos ISSCC 1998 18

How Much Skew Do We Have? Historically, skew budgets have been about 5% of the cycle time What people generally try to hit in designs Source: Rusu Intel 19 Simulating Skew Do not just gather all 3σ numbers in the worst possible way Way too pessimistic; you'll never get the design done You ll never see 3σ variation in clock buffers across a single datapath Statistical simulations figure out the right skew budget Key point: skew from process variation affects both clock and data Data variation makes local skew budgets worse but makes the difference in global and local skew smaller Global skew Local skew C1 C2 20

Simulating Skew: A Note on Inductance Inductance doesn t matter for skinny signal wires Short wires: driver resistance swamps out wire inductance When R driver > 2Z 0, where Z 0 Long wires: wire resistance swamps out wire inductance When R wire > 2Z 0 (i.e., if attenuation factor α > 1) RLC models are exceedingly complicated and time-consuming But: model L for clock networks Wires are wide (much less R wire ) We care about ps of skew Only a few wires (100s) to model L noise also worth modeling Source: Restle JSSC 1998 21 What About Cycle-to-Cycle Variation (Jitter)? Jitter comes from temporal variations in the clock network Some comes from the PLL/DLL, but not very much Power supply changes mostly; temperature variations are slow Also data coupling noise, but can design this away with shielding Jitter depends linearly on the delay through the clock network Insertion delay is the time from PLL to the final latches Around 2 cycles in CPUs today Note that jitter is primarily a max-path problem Min-path problems are at the same clock edge Although beware differential jitter between different clock nets 22

How Much Jitter Do We Have? Also around 5% of the cycle time (statistical simulations) But you should ask: how do people measure jitter? More later Source: Rusu Intel 23 Clock Distribution Schemes Three principal ways of distributing clocks around a chip Grid (DEC Alpha processors) H-tree (IBM S/390, Intel Itanium) Spines (Intel PentiumII, PentiumIII, P4) Tree-Grid hybrids (IBM Power, Sun Ultrasparc) Other methods also being researched Optical distribution Resonant clocks 24

Clock Grids Cover the entire chip with a grid that carries clock Good: simple, low skew (because everything is shorted together) Bad: lots of power and metal resources consumed Need to figure out from where you will drive the grid 25 Three Generations of Gridded Alphas DEC Alpha used clock grids, driven by panes of drivers 21064 (1993): 200MHz, 0.75μm process, one driver pane 21164 (1996): 333MHz, 0.5μm process, two driver panes 21264 (1997): 600MHz, 0.35μm process, sixteen driver panes Source: Gronowski JSSC 1998 26

Alpha Gridded Skews Skews dropped: 240pS to 120pS to 75pS As fraction of Tcycle: 4.8%, 3.6%, 4.5% Simulated only, not measured Source: Gronowski JSSC 1998 27 Thermal Image of 21064 and 21164 Single pane driver on 21064 was definitely a hotspot Splitting into two panes reduced thermal gradients significantly Source: Gronowski JSSC 1998 28

Clock H-Trees Balanced H-Trees are complex to design but are appealing Good: Low power, low routing resources Bad: Skew between divergent branches, complex design Example: IBM S/390 used balanced RC paths in the tree Source: Webb JSSC 1997 29 H-Trees and Points of Divergence Skew between nodes on an H-Tree depends on location Their last common node is their Point of Divergence (POD) Two nodes with a near POD = low skew (10% s+j budget) Two nodes with a far POD = high skew (20% s+j budget) Trees must balance RC delays with non-uniform loads small skew big skew IBM testchip with loads shown by diameter Source: Restle VLSI 2000 30

Hybrid: Trees Driving Grids Eliminates skew differences due to POD But consumes lots of power, like grids IBM Power4 in 0.18μm consumes 70% power in clocks and latches Source: Anderson ISSCC 2001 Source: Restle ISSCC 2001 31 Hybrid: Trees Driving Grids Simulations of skew-smoothing that you get with grids IBM Power4 with intentional skew added to one buffer side Simulations only 32

Animations of Grid Clocks Restle (IBM Research) does interesting work visualizing circuits In particular, clock animations http://researchweb.watson.ibm.com/people/r/restle/animations/dac01top.html 33 Hybrid: Trees Driving Grids Sun UltraSparcIII: balanced tree to global grid (skew simulated) Source: Heald JSSC 2000 34

Clock Spines Spines look like a collapsed tree Each spine is equipotential and can radiate outwards Example: First Intel Pentium4 microprocessor used three spines Source: Kurd ISSCC 2001 Source: Kurd JSSC 2001 35 Clock Spines Skew Later Intel s 90nm P4 added more spines Skew was simulated to be under 10pS Source: Bindal ISSCC 2003 36

A True Hybrid Compaq Alpha (EV8, r.i.p.) Grid in core, panes and trees in memory/network, trees in cache Skew in the glue memory/network controller < 90pS Source: Xanthopoulos ISSCC 2001 37 Minimizing Skew In Design Follow basic heuristics to match all gates Single cell design with dummy structures around it, replicated Same cell orientation in each instance Similar heuristics for wires Consider and account for return path resistance Heavily shield clock routes: makes C uniform and L minimal Limit maximum wire dimensions: avoid skin effect Simplify the problem by using templates All of M5 looks like the following, repeated over the entire die Gnd/Vdd Local Clk Data 38

Active Skew Cancellation (Insertion) Look at clocks and adjust them Need to be careful about the loop bandwidth and stability concerns Example from Intel PentiumIII (dual spine design) Source: Geannopoulos ISSCC 1998 39 Multi-zone Deskew Circuits Example from Intel Itanium clocking Distribute an array of deskew buffers, each tying together 4 zones Source: Rusu ISSCC 2000 40

Deskew Effectiveness Chip Publication Zones Skew After Stepsize PentiumIII ISSCC98 2 60pS 15pS 12pS Itanium ISSCC00 30 110pS 28pS 8pS Pentium4 ISSCC01 47 64pS 16pS 8pS Itanium (3 rd G) ISSCC03 23 60pS 7pS 7pS Source: Stinson ISSCC 2003 41 Minimizing Jitter In Design Jitter predominantly comes from power supply noise So filter the clock power supply Example from Pentium4 clock network (simulated) Source: Kurd JSSC 2001 42

Minimizing Jitter In Design, con t Compaq Alpha EV8 used essentially the same idea (simulated) Source: Xanthopoulos ISSCC 2001 43 Measuring Skew and Jitter There's a good reason why skew numbers are usually simulated E-beam and picoprobe don t work with flip-chip packaging IBM PICA: Picosecond Imaging Circuit Analysis Saturated nmos emit photons in the channel Very rarely; photons photomultiplied exponentially to be seen Subsample to probe the whole chip at once over a long time Can measure skew by imaging XOR of two different clocks Intel LVP: Laser Voltage Probe Shine a laser onto the diffusion (from the back of the substrate) Need a good transparent heat-spreader (diamond) Reflectivity depends on voltage bias. Probes only one point at a time, but rapidly 44

PICA Movie of Ring 45 PICA Measurements PICA measurements of S/390 clock shown Source: Tsang IBM JourResDev 2000 46

LVP Measurements of Skew Voltage levels not well characterized, but transitions are Good for teasing out skew numbers from the edges Shows a layout error that was fixed in a subsequent stepping Source: Anderson ISSCC 2002 47 Measuring Jitter Almost impossible to do, so everybody estimates it You can try to measure total skew + jitter (Restle ISSCC 2005) Both PICA and LVP are averaging measurements (no jitter) Typical schemes drive a clock node to a pin to expose it But need to buffer up to drive the pin Ramp-up chain introduces jitter itself Measurement scopes have source jitter, too Measured jitter: 150pS. Source trigger jitter: 150pS. Hmmm Commercial systems try to cancel trig jitter (M2, Wavecrest) But still end up with on-chip buffer jitter 48