Alpha CPU and Clock Design Evolution



Similar documents
Clocking. Figure by MIT OCW Spring /18/05 L06 Clocks 1

TIMING-DRIVEN PHYSICAL DESIGN FOR DIGITAL SYNCHRONOUS VLSI CIRCUITS USING RESONANT CLOCKING

Power Reduction Techniques in the SoC Clock Network. Clock Power

PowerPC Microprocessor Clock Modes

Topics of Chapter 5 Sequential Machines. Memory elements. Memory element terminology. Clock terminology

IL2225 Physical Design

Low Power AMD Athlon 64 and AMD Opteron Processors

ICS379. Quad PLL with VCXO Quick Turn Clock. Description. Features. Block Diagram

ISSCC 2003 / SESSION 13 / 40Gb/s COMMUNICATION ICS / PAPER 13.7

Lecture 11. Clocking High-Performance Microprocessors

Core Power Delivery Network Analysis of Core and Coreless Substrates in a Multilayer Organic Buildup Package

11. High-Speed Differential Interfaces in Cyclone II Devices

Abstract. Cycle Domain Simulator for Phase-Locked Loops

On-Chip Interconnect: The Past, Present, and Future

TRUE SINGLE PHASE CLOCKING BASED FLIP-FLOP DESIGN

TIMING ISSUES IN DIGITAL CIRCUITS

ANN Based Modeling of High Speed IC Interconnects. Q.J. Zhang, Carleton University

Sequential 4-bit Adder Design Report

Latch Timing Parameters. Flip-flop Timing Parameters. Typical Clock System. Clocking Overhead

Fairchild Solutions for 133MHz Buffered Memory Modules

Phase-Locked Loop Based Clock Generators

ICS514 LOCO PLL CLOCK GENERATOR. Description. Features. Block Diagram DATASHEET

Timing Methodologies (cont d) Registers. Typical timing specifications. Synchronous System Model. Short Paths. System Clock Frequency

DESIGN CHALLENGES OF TECHNOLOGY SCALING

TRIPLE PLL FIELD PROG. SPREAD SPECTRUM CLOCK SYNTHESIZER. Features

Introduction to CMOS VLSI Design (E158) Lecture 8: Clocking of VLSI Systems

IBIS for SSO Analysis

How To Design A Chip Layout

International Journal of Electronics and Computer Science Engineering 1482

PL-277x Series SuperSpeed USB 3.0 SATA Bridge Controllers PCB Layout Guide

Lecture 7: Clocking of VLSI Systems

Reconfigurable ECO Cells for Timing Closure and IR Drop Minimization. TingTing Hwang Tsing Hua University, Hsin-Chu

Clock Distribution in RNS-based VLSI Systems

Features. Modulation Frequency (khz) VDD. PLL Clock Synthesizer with Spread Spectrum Circuitry GND

Status of the design of the TDC for the GTK TDCpix ASIC

A 10,000 Frames/s 0.18 µm CMOS Digital Pixel Sensor with Pixel-Level Memory

AP Microcontroller. EMC Design Guidelines for Microcontroller Board Layout. Microcontrollers. Application Note, V 3.

路 論 Chapter 15 System-Level Physical Design

Clock Distribution Networks in Synchronous Digital Integrated Circuits

RF data receiver super-reactive ASK modulation, low cost and low consumption ideal for Microchip HCS KEELOQ decoder/encoder family. 0.

Title: Low EMI Spread Spectrum Clock Oscillators

University of Texas at Dallas. Department of Electrical Engineering. EEDG Application Specific Integrated Circuit Design

ICS SPREAD SPECTRUM CLOCK SYNTHESIZER. Description. Features. Block Diagram DATASHEET

S. Venkatesh, Mrs. T. Gowri, Department of ECE, GIT, GITAM University, Vishakhapatnam, India

ZL40221 Precision 2:6 LVDS Fanout Buffer with Glitchfree Input Reference Switching and On-Chip Input Termination Data Sheet

Design and analysis of flip flops for low power clocking system

Static-Noise-Margin Analysis of Conventional 6T SRAM Cell at 45nm Technology

An All-Digital Phase-Locked Loop with High Resolution for Local On-Chip Clock Synthesis

Application Note AN:005. FPA Printed Circuit Board Layout Guidelines. Introduction Contents. The Importance of Board Layout

This application note is written for a reader that is familiar with Ethernet hardware design.

LOW POWER DESIGN OF DIGITAL SYSTEMS USING ENERGY RECOVERY CLOCKING AND CLOCK GATING

A 1.62/2.7/5.4 Gbps Clock and Data Recovery Circuit for DisplayPort 1.2 with a single VCO

Efficient Interconnect Design with Novel Repeater Insertion for Low Power Applications

Application Note 58 Crystal Considerations with Dallas Real Time Clocks

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Managing High-Speed Clocks

DDR subsystem: Enhancing System Reliability and Yield

Class 11: Transmission Gates, Latches

Signal Integrity: Tips and Tricks

Lecture 11: Sequential Circuit Design

ISSCC 2003 / SESSION 10 / HIGH SPEED BUILDING BLOCKS / PAPER 10.5

EE 42/100 Lecture 24: Latches and Flip Flops. Rev B 4/21/2010 (2:04 PM) Prof. Ali M. Niknejad

Sentinel-SSO: Full DDR-Bank Power and Signal Integrity. Design Automation Conference 2014

Grounding Demystified

PROGETTO DI SISTEMI ELETTRONICI DIGITALI. Digital Systems Design. Digital Circuits Advanced Topics

Sequential Circuit Design

The Future of Multi-Clock Systems

SPREAD SPECTRUM CLOCK GENERATOR. Features

Low latency synchronization through speculation

PCB Design Conference - East Keynote Address EMC ASPECTS OF FUTURE HIGH SPEED DIGITAL DESIGNS

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

A Survey on Sequential Elements for Low Power Clocking System

IDT80HSPS1616 PCB Design Application Note - 557

ECE410 Design Project Spring 2008 Design and Characterization of a CMOS 8-bit Microprocessor Data Path

Evaluating AC Current Sensor Options for Power Delivery Systems

Lecture 2. High-Speed I/O

Clocks Basics in 10 Minutes or Less. Edgar Pineda Field Applications Engineer Arrow Components Mexico

Chapter 6 PLL and Clock Generator

RX-AM4SF Receiver. Pin-out. Connections

4 OUTPUT PCIE GEN1/2 SYNTHESIZER IDT5V41186

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 16 Timing and Clock Issues

Supply voltage Supervisor TL77xx Series. Author: Eilhard Haseloff

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

Push-Pull FET Driver with Integrated Oscillator and Clock Output

StarRC Custom: Next-Generation Modeling and Extraction Solution for Custom IC Designs

NTE2053 Integrated Circuit 8 Bit MPU Compatible A/D Converter

EVALUATION KIT AVAILABLE Single-Chip Global Positioning System Receiver Front-End BIAS CBIAS GND GND RFIN GND GND IFSEL

Figure 1 FPGA Growth and Usage Trends

8 Gbps CMOS interface for parallel fiber-optic interconnects

HT1632C 32 8 &24 16 LED Driver

Supertex inc. HV Channel High Voltage Amplifier Array HV256. Features. General Description. Applications. Typical Application Circuit

The Classical Architecture. Storage 1 / 36

Sequential Logic. (Materials taken from: Principles of Computer Hardware by Alan Clements )

Chapter 13: Verification

Baseband delay line QUICK REFERENCE DATA

2 TO 4 DIFFERENTIAL PCIE GEN1 CLOCK MUX ICS Features

On-chip clock error characterization for clock distribution system

Transcription:

Alpha CPU and Clock Design Evolution This lecture uses two papers that discuss the evolution of the Alpha CPU and clocking strategy over three CPU generations Gronowski, Paul E., et.al., High Performance Microprocessor Design, IEEE Journal of Solid-State Circuits, Vol. 33, No. 5, May 1998, pp. 676-686 Bailey, Daniel W. and Bradley J. Benschneider, Clocking Design and Analysis for a 600-Mhz Alpha Microprocessor, IEEE Journal of Solid-State Circuits, Vol. 33, No. 11, November 1998, pp. 1627-1633 These papers are interesting because one can see how shrinking device sizes presented new design challenges that forced design methodology changes. All notes in this lecture are from these two papers. Integer performance graph. Clock speed increased from 150 Mhz to 600 Mhz. BR 6/00 1 Gronowski, Paul E., et.al., High Performance Microprocessor Design, IEEE Journal of Solid-State Circuits, Vol. 33, No. 5, May 1998, pp. 676-686 BR 6/00 2 21064 Die Photo Single Clock driver, 2 transistors for buffer visible to naked eye Transistor count increased by 9X (mainly due to caches) Die size slight increase to constant Frequency 3X increase Power dissipation - 2.4X increase Power supply reduced by 1/3 Gate delays indicates longest register-to-register path BR 6/00 3 Clocking scheme was 2 phase, single wire. Clock load was 3.5 nf Gate length of final driver was 35 cm (not a misprint, used serpentine layout to get this gate length). BR 6/00 4 21064 Clock Skew Distribution Max clock skew approx. 180 ps (3.6% of 5 ns clock period) 1 gate delay about 300 ps, so clock skew about 50% of a gate delay. Thermal Image of 20164 76C at center of chip 46C at edges of chip Note the skew is smallest closest to center of chip where driver is located. 30C thermal gradient across chip!! BR 6/00 5 BR 6/00 6

On Chip Decoupling Capacitors Large inrush current at clock edges (di/dt) required use of on-chip decoupling capacitors placed around clock driver. Created a decoupling capacitor using NMOS transistor: Vdd Cg 21164 Clock Distribution Goal of 21164 Clock distribution was to reduce skew by 30% and reduce the thermal gradient. A predriver was centered between two main clock drivers Predriver Clock skew was reduced by factor of 2, and thermal gradient was reduced. Main clock drivers BR 6/00 7 BR 6/00 8 Clock skew lowest near two main clock drivers. Max clock skew approx. 80 ps (2.4% of 3.3 ns clock period) 1 gate delay about 240 ps, so clock skew about 1/3 of a gate delay. Aside: Why a Gridded Clock? Both 21064, 21164 used a single global clk distributed by a metal clock grid. Skew is largely determined by grid interconnect density and is insensitive to gate load placement Why? Because capacitance of grid wiring dominates the gate loads connected to it. Universal availability of clock signals Design teams can proceed in parallel since clock constraints well known Good process-variation tolerance The disadvantage is the extra capacitance of the grid Power-performance tradeoff is determined by choice of skew target, which establishes the needed grid density, which determines the clock driver size. BR 6/00 9 BR 6/00 10 21264 Clock Distribution 21264 clocking fundamentally different from previous Alphas because it supported a hierarchy of clocks Still had a GCLK (global Clk) grid, but conditional and local clocks had several buffer stages after GCLK Conditional clocks used to save power Clocks gated to functional units in design If not executing a floating point instruction, then stop the clock to the floating point unit to save power! State elements and clocking points were 0 to 8 gates past Gclk Six major regional clocks two gain stages past GCLK with grids juxaposed with GCLK, but shielded from it. Major clocks drive local clocks and conditional clocks Goals were to improve performance, reduce power. Clock Hierarchy of 21264 BR 6/00 11 BR 6/00 12

21264 Global Clock Distribution Phased Lock Loops (PLLs) PLLs and Delay Locked Loops (DLLs) are used to perform clock multiplication of an off-chip clock PLLs/DLLs used to align clock edges of original clock with multiplied clocks PLLs are analog circuits that use a charge pump and a voltage controlled oscillator (VCO) to perform phase alignment Alpha 21264 PLL used a separate, regulated 3.3 V supply and was located in the corner of the chip to minimize noise impact Section 9.5.2 of Rabaey text has a block diagram of a PLL All high performance CPUs and most ASICs now include a PLL for internal clock generation Window pane arrangement - same skew to all panes. Note redudant drive to clock nets BR 6/00 13 BR 6/00 14 Global clock grid. Uses 3% of M3/M4 routing layers (lines in picture are misleadingly thick). All GCLK lines are laterally shielded by Vss/Vdd signal GCLK signal Vss Vdd Lateral shielding via Vss/Vdd prevents clock noise from coupling into signal lines. Clock wires and lateral shields were manually placed BR 6/00 15 BR 6/00 16 Simulated worst case GCLK skew was 72 ps. Skew on M1, M2 was less than 10 ps. Major clocks are two inversions past GCLK Measured worst case GCLK skew via ebeam tester was 65 ps BR 6/00 17 Major clocks saved power over a single global clocks because they service a lighter load and distribution area is smaller both of these means smaller drivers are needed. Gclk+Major clocks used 24 W @ 2.2 V, 600 Mhz. It is estimated that at least 40W would have been required if only global clocks were used. 10%-90% rise/fall times were targeted at < 320 ps. BR 6/00 18

Local, Conditional Clocks Major Clock grids. Densest major clock grids used up to 6% of M3/M4 routing. White areas are serviced by local clocks, local clocks also present in major clock grids. Major clocks also laterally shielded. Local clocks generated from any clock GCLK, Major clocks, other local clocks Local clocks were neither shielded or gridded Having local clocks gave freedom to move clock edges with respect to data to solve timing problems This is another form of time borrowing 60,000 local clocks nodes, all were analyzed with SPICE using minimum and maximum gate capacitance estimates Some local clocks had very high min/max delay variation tolerances (up to 280 ps) BR 6/00 19 BR 6/00 20 effective gate input capacitance is not a constant depends on state of other inputs. Causes timing characterization headaches!! Power Consumption 21264 72 W total (600 Mhz @ 2.2 V) Clock distribution power consumption: 46.8 W Gclk 10.2 W Major Clocks 24 W Local unconditional clocks 7.6 W Local conditional clocks 15.6 Clocking accounted for 65% of the total power in the 21264! BR 6/00 21 BR 6/00 22 Power Distribution 21064 Power Distribution 21164 M3 vertical, M2 horizontal. M3 width was twice the process minimum. Vdd/Gnd bond pads only on top/bottom, connected to M2. Supply current of 9A (30W @ 3.3 V). BR 6/00 23 M3, M4 used for power grid. Vdd/Gnd bond pads on all 4 sides. Supply current was 15.2A (50W @ 3.3 V). BR 6/00 24

Power Distribution 21264 21064, 21164 used level-based design (latch based) 21064 used a version of the TSPC latch (true-single phase clock) 21164 used a simpler latch design that had less latency Both embedded logic into latches to reduce gate delays Supply current of 33A! (72 W @ 2.2 V). Difference in current requirements between adjacent clock phases because of conditional clocking could be as high as 25A. This required two, thick, low resistance aluminum reference planes to be used for Vdd/Vss. The VSS plane reduced crosstalk between M3 and M2 signals. 15%-20% of the die area was used for on-chip decoupling capacitors. BR 6/00 25 21064 latch design, dynamic 21164 latch design, dynamic. BR 6/00 26 Embedded logic in 21064, 21164 latches 21264 used edge-triggered design Because of gated clock usage to save power, needed static latches. Went to edge-triggered design to simply timing verification since timing was now much more complicated due to conditional clocking. Family of edgetriggered FFs built using this static latch design. Embedding logic in latches helps to reduce the number of gate delays between latches. BR 6/00 27 BR 6/00 28 Timing Analysis 21264 ECLK, FCLK major clocks derived clocks from GCLK. Need to hand values off between clock domains. D is drive path delay time. R is receive path delay time. Worst case delay analysis maximizes D, minimizes R (if R arrives early, then can clock in data before it arrives). BR 6/00 29 Timing Analysis 21264 (cont.) Race analysis determines if value produced by ECLK is actually clocked in by FCLK before a new value from ECLK arrives. Race analysis minimizes D, maximizes R. BR 6/00 30

Worst Case Analysis Race Analysis Max D Min D Eclk Eclk Eclk dataa Eclk datab Eclk dataa Eclk datab Min R Fclk Fclk Fclk stage gets incorrect value because new data is not ready yet before Fclk edge arrives. Delay on Fclk helps in this case. BR 6/00 31 Max R Fclk stage gets incorrect value because of race condition (shortest path in logic exercised). Delay on Fclk hurts in this case. BR 6/00 32 Conditional Clock Analysis Same analysis (worst case, race) applied for Conditional Clocks. BR 6/00 33