DESIGN AND TUNING OF A TREE-MESH CLOCK DISTRIBUTION

Similar documents
Alpha CPU and Clock Design Evolution

Clocking. Figure by MIT OCW Spring /18/05 L06 Clocks 1

Lecture 11. Clocking High-Performance Microprocessors

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces

University of Texas at Dallas. Department of Electrical Engineering. EEDG Application Specific Integrated Circuit Design

IL2225 Physical Design

Design Compiler Graphical Create a Better Starting Point for Faster Physical Implementation

Reconfigurable ECO Cells for Timing Closure and IR Drop Minimization. TingTing Hwang Tsing Hua University, Hsin-Chu

Efficient Interconnect Design with Novel Repeater Insertion for Low Power Applications

Digital to Analog Converter. Raghu Tumati

Dual DIMM DDR2 and DDR3 SDRAM Interface Design Guidelines

TIMING-DRIVEN PHYSICAL DESIGN FOR DIGITAL SYNCHRONOUS VLSI CIRCUITS USING RESONANT CLOCKING

Resonant Clock Design for a Power-efficient, High-volume

Frequency Response of Filters

How To Design A Chip Layout

Sequential 4-bit Adder Design Report

StarRC Custom: Next-Generation Modeling and Extraction Solution for Custom IC Designs

Addressing the DDR3 design challenges using Cadence DDR3 Design-In Kit

FPGA Clocking. Clock related issues: distribution generation (frequency synthesis) multiplexing run time programming domain crossing

Title: Low EMI Spread Spectrum Clock Oscillators

ATE-A1 Testing Without Relays - Using Inductors to Compensate for Parasitic Capacitance

Cloud-Based Apps Drive the Need for Frequency-Flexible Clock Generators in Converged Data Center Networks

DDR subsystem: Enhancing System Reliability and Yield

Managing High-Speed Clocks

Features. Modulation Frequency (khz) VDD. PLL Clock Synthesizer with Spread Spectrum Circuitry GND

Objective. Testing Principle. Types of Testing. Characterization Test. Verification Testing. VLSI Design Verification and Testing.

Topics of Chapter 5 Sequential Machines. Memory elements. Memory element terminology. Clock terminology

RF data receiver super-reactive ASK modulation, low cost and low consumption ideal for Microchip HCS KEELOQ decoder/encoder family. 0.

Fairchild Solutions for 133MHz Buffered Memory Modules

Latch Timing Parameters. Flip-flop Timing Parameters. Typical Clock System. Clocking Overhead

Sentinel-SSO: Full DDR-Bank Power and Signal Integrity. Design Automation Conference 2014

e.g. τ = 12 ps in 180nm, 40 ps in 0.6 µm Delay has two components where, f = Effort Delay (stage effort)= gh p =Parasitic Delay

AVX EMI SOLUTIONS Ron Demcko, Fellow of AVX Corporation Chris Mello, Principal Engineer, AVX Corporation Brian Ward, Business Manager, AVX Corporation

AN-837 APPLICATION NOTE

Jitter Transfer Functions in Minutes

A 3.2Gb/s Clock and Data Recovery Circuit Without Reference Clock for a High-Speed Serial Data Link

POWER FORUM, BOLOGNA

Signal Integrity: Tips and Tricks

Abstract. Cycle Domain Simulator for Phase-Locked Loops

Low Power AMD Athlon 64 and AMD Opteron Processors

8 Gbps CMOS interface for parallel fiber-optic interconnects

Reducing EMI and Improving Signal Integrity Using Spread Spectrum Clocking

Equipment: Power Supply, DAI, Transformer (8341), Variable resistance (8311), Variable inductance (8321), Variable capacitance (8331)

AMS Verification at SoC Level: A practical approach for using VAMS vs SPICE views

Step Response of RC Circuits

11. High-Speed Differential Interfaces in Cyclone II Devices

Accurate Measurement of the Mains Electricity Frequency

High Precision TCXO / VCTCXO Oscillators

CHIP-PKG-PCB Co-Design Methodology

DEVELOPMENT OF DEVICES AND METHODS FOR PHASE AND AC LINEARITY MEASUREMENTS IN DIGITIZERS

USB 3.0 CDR Model White Paper Revision 0.5

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2010

GaAs Switch ICs for Cellular Phone Antenna Impedance Matching

TRUE SINGLE PHASE CLOCKING BASED FLIP-FLOP DESIGN

CHAPTER 11: Flip Flops

Designing the NEWCARD Connector Interface to Extend PCI Express Serial Architecture to the PC Card Modular Form Factor

BURST-MODE communication relies on very fast acquisition

Phase-Locked Loop Based Clock Generators

Core Power Delivery Network Analysis of Core and Coreless Substrates in a Multilayer Organic Buildup Package

The Gamma Match. 1 Equal Size Elements

RF Measurements Using a Modular Digitizer

RC Scrub R. High Performance Test Socket Systems for Micro Lead Frame Packages

A Pausible Bisynchronous FIFO for GALS Systems

TI GPS PPS Timing Application Note

PL-277x Series SuperSpeed USB 3.0 SATA Bridge Controllers PCB Layout Guide

On-Chip Interconnect: The Past, Present, and Future

Document Contents Introduction Layout Extraction with Parasitic Capacitances Timing Analysis DC Analysis

CLOCK DOMAIN CROSSING CLOSING THE LOOP ON CLOCK DOMAIN FUNCTIONAL IMPLEMENTATION PROBLEMS

Laboratory 4: Feedback and Compensation

Understanding Power Impedance Supply for Optimum Decoupling

Power Reduction Techniques in the SoC Clock Network. Clock Power

S-PARAMETER MEASUREMENTS OF MEMS SWITCHES

Since any real component also has loss due to the resistive component, the average power dissipated is 2 2R

Source-Synchronous Serialization and Deserialization (up to 1050 Mb/s) Author: NIck Sawyer

An All-Digital Phase-Locked Loop with High Resolution for Local On-Chip Clock Synthesis

Design of Experiments (DOE) Tutorial

S-Band Low Noise Amplifier Using the ATF Application Note G004

RLC Series Resonance

SPREAD SPECTRUM CLOCK GENERATOR. Features

NTE2053 Integrated Circuit 8 Bit MPU Compatible A/D Converter

Clock Distribution Networks in Synchronous Digital Integrated Circuits

Data Center Switch Fabric Competitive Analysis

Agilent EEsof EDA.

Customer Training Material. Lecture 4. Meshing in Mechanical. Mechanical. ANSYS, Inc. Proprietary 2010 ANSYS, Inc. All rights reserved.

Using CFD to improve the design of a circulating water channel

Chapter 2 Sources of Variation

Route Power 10 Connect Powerpin 10.1 Route Special Route 10.2 Net(s): VSS VDD

VLSI Design Verification and Testing

Wideband Driver Amplifiers

S. Venkatesh, Mrs. T. Gowri, Department of ECE, GIT, GITAM University, Vishakhapatnam, India

Minimizing crosstalk in a high-speed cable-connector assembly.

System on Chip Design. Michael Nydegger

Model-Based Synthesis of High- Speed Serial-Link Transmitter Designs

VWF. Virtual Wafer Fab

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

ε: Voltage output of Signal Generator (also called the Source voltage or Applied

An octave bandwidth dipole antenna

Transcription:

DESIGN AND TUNING OF A TREE-MESH CLOCK DISTRIBUTION Nikhil Jayakumar, Dave Murata, Valery Kugel { nikhilj, dmurata, valery } @juniper.net Juniper Networks

OVERVIEW Comparison of Clock trees vs Clock Grids/Mesh Juniper s Clock distribution design overview Juniper s 2 step tuning flow for clock meshes Coarse Tuning Fine Tuning Conclusion

CLOCK TREES VS CLOCK GRIDS There are 2 two kinds of clock skews Structural (layout) skew Capacitive load mismatch Wire length mismatch STA Handled by balanced clock trees (Eg: Htrees) Zero / low skew only in the absence of PVT variations Skew due to PVT variations Two types of approaches to handle this: Dynamic: Dynamic clock deskewing schemes Static: Cross-link addition Clock mesh / grid / hybrid tree-mesh Necessitates SPICE based analysis Regular STA won t work due to re-convergences. (more on this later.)

VERTICAL CLOCK SPINE + HORIZONTAL CLOCK RIBS Constructed to be balanced have low latency (and hence low jitter) Wire width, spacing, buffer drive strength, wire length between buffers chosen after careful simulation. Factors considered: Jitter (chose wire code for minimum jitter per unit length) Slew constraints Dynamic IR drop & EM limits Routability &area constraints Overshoot & undershoot due to inductance Cancel out PVT variations through insertion of cross-links (shorting wires) at regular intervals. Cross-links were inserted only if skew reduction outweighed jitter increase.

WHY CROSS-LINKS COMPLICATE TIMING? 0p 50p 100p 150p 200p 250p 300p 395p 350p? STA cannot handle re-convergence in non-linear circuits. SPICE confirms the averaging effect of the short, but STA cannot see this. Where is the point of divergence? 405p? Need a SPICE simulation to estimate delays.

JUNIPER GLOBAL CLOCK DISTRIBUTION Hybrid tree-mesh Balanced tree driving a mesh Cross-links added at regular intervals in the tree also to reduce skew due to PVT Horizontal clock ribs Vertical clock spine Core clock region Construction: PLL drives Vertical Spine Vertical Spine drives 6 Horizontal Ribs Horizontal Ribs drive clock mesh 3.1mm Technology Details: Frequency: 700Mhz to 800Mhz TSMC 40nm (45GS_1P10M_6X1Y2Z + Al RDL) Top 2 (thick) metal layers (Mz) used to distribute the core clock 17mm

WHY REDUCE SKEW IN A MESH? Q : Clock meshes reduce skew - so then why do we have to tune it? Clock meshes have an effect of averaging the delay but at the cost of short circcuit current Large skew can result in a very large short-circuit current for drivers whose outputs are shorted Should not rely on the mesh to reduce structural skew. The mesh is used to only reduce PVT skew.

JUNIPER S 2 STEP TUNING FLOW 1. Coarse-tuning through balancing Tuning the vertical spine and horizontal ribs through RC balancing Tuning the mesh through selective removal of horizontal cross-link wires in the mesh Based on effective wire length (capacitance) driven by each buffer 2. Fine-tuning through driver sizing Automatic driver tuning flow that sizes drivers in the vertical spine and horizontal ribs Drivers are sized to achieve uniform output delay and slew Flow can simultaneously size several thousands of buffers Manual tuning is impossible on such a scale

COARSE TUNING FLOW OF THE MESH DB with full clock mesh Remove all horizontal Mz wires of the clock mesh except the ones closest to the horizontal clock ribs Find effective length (and thus capacitance) of vertical Mz wires of clock mesh driven by each buffer Add back horizontal Mz cross-links such that total effective capacitance is equal across all output buffers Extract Clock mesh (STAR-RC) Simulate in SPICE and verify skew

FINE TUNING FLOW Extracted netlist Simulate in SPICE and gather slew and delay data Re-size buffers based on slew at output of buffers (aim is to get slew at all buffers to be uniform) Simulate modified netlist (with re-sized buffers) and gather slew and delay data Is [skew(previous_run) skew (current_run)] > 1ps? NO Modified netlist & DB YES Buffers are sized based on output slew If slew is larger than target slew, the buffer is up-sized proportionally to achieve target slew If slew is smaller than target slew, the buffer is down-sized proportionally to achieve target slew The fine tuning flow is able to converge to a low-skew solution within 2 to 3 iterations Buffers can be re-sized without re-extracting since the buffers are designed to be footprint compatible Saves significant runtime since extraction alone can take a day or more

RESULTS The tuning flow allowed us to reduce the structural skew of the mesh Skew was reduced to < 30ps across the whole core region and across multiple process corners (from > 100ps before tuning) The removal of the majority of the cross-links also helped save power Power consumed by the distribution (including buffers in the vertical spine + horizontal ribs) was = 1.4W for a 16mm X17mm clock mesh area at 0.9V, 800Mhz Removal of the horizontal cross-links helped reduce mesh capacitance and thus clock power by 30% Skew Core clock area = 30ps = 16mm X 17mm * Example of skew plot over an 16mm X 17mm core region

CONCLUSION We have presented a 2 step tuning flow that can de-skew and tune a clock mesh containing several thousand buffers The fine-tuning flow enables 2 to 3 iterations to be completed within 24 hours. Structural skew of more than 100ps was reduced to less than 25ps Removal of horizontal Mz cross-links in the clock mesh helped reduce clock power Clock distribution + mesh consumed a total of 1.4W in a 100W chip The removal of most of the horizontal cross-links reduced mesh capacitance and power by ~30% This tuning flow was used in multiple chips across two technology generations

Thank You