DDR subsystem: Enhancing System Reliability and Yield

Similar documents

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces

Tuning DDR4 for Power and Performance. Mike Micheletti Product Manager Teledyne LeCroy

Calibration Techniques for High- Bandwidth Source-Synchronous Interfaces

Tuning DDR4 for Power and Performance. Mike Micheletti Product Manager Teledyne LeCroy

Managing High-Speed Clocks

Computer Architecture

A New Chapter for System Designs Using NAND Flash Memory

Dual DIMM DDR2 and DDR3 SDRAM Interface Design Guidelines

Table 1: Address Table

Technical Note DDR2 Offers New Features and Functionality

Features. DDR SODIMM Product Datasheet. Rev. 1.0 Oct. 2011

SOLVING HIGH-SPEED MEMORY INTERFACE CHALLENGES WITH LOW-COST FPGAS

Technical Note. Initialization Sequence for DDR SDRAM. Introduction. Initializing DDR SDRAM

DDR2 Specific SDRAM Functions

Memory Module Specifications KVR667D2D4F5/4G. 4GB 512M x 72-Bit PC CL5 ECC 240-Pin FBDIMM DESCRIPTION SPECIFICATIONS

Table 1 SDR to DDR Quick Reference

User s Manual HOW TO USE DDR SDRAM

LatticeECP3 High-Speed I/O Interface

ADQYF1A08. DDR2-1066G(CL6) 240-Pin O.C. U-DIMM 1GB (128M x 64-bits)

Sentinel-SSO: Full DDR-Bank Power and Signal Integrity. Design Automation Conference 2014

Testing Low Power Designs with Power-Aware Test Manage Manufacturing Test Power Issues with DFTMAX and TetraMAX

Features. DDR3 Unbuffered DIMM Spec Sheet

1.55V DDR2 SDRAM FBDIMM

Maximizing Server Storage Performance with PCI Express and Serial Attached SCSI. Article for InfoStor November 2003 Paul Griffith Adaptec, Inc.

DDR3 SDRAM UDIMM MT8JTF12864A 1GB MT8JTF25664A 2GB

AMD Opteron Quad-Core

GR2DR4B-EXXX/YYY/LP 1GB & 2GB DDR2 REGISTERED DIMMs (LOW PROFILE)

Clocking. Figure by MIT OCW Spring /18/05 L06 Clocks 1

Highlights of the High- Bandwidth Memory (HBM) Standard

DDR Memory Overview, Development Cycle, and Challenges

Application Note for General PCB Design Guidelines for Mobile DRAM

DDR3(L) 4GB / 8GB UDIMM

are un-buffered 200-Pin Double Data Rate (DDR) Synchronous DRAM Small Outline Dual In-Line Memory Module (SO-DIMM). All devices

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Features. DDR3 SODIMM Product Specification. Rev. 1.7 Feb. 2016

Lecture 7: Clocking of VLSI Systems

NAND Flash FAQ. Eureka Technology. apn5_87. NAND Flash FAQ

Hardware and Layout Design Considerations for DDR4 SDRAM Memory Interfaces

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

Topics of Chapter 5 Sequential Machines. Memory elements. Memory element terminology. Clock terminology

DDR SDRAM UDIMM MT16VDDT6464A 512MB MT16VDDT12864A 1GB MT16VDDT25664A 2GB

Memory Architecture and Management in a NoC Platform

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

Chapter 11 I/O Management and Disk Scheduling

DDR SDRAM SODIMM. MT8VDDT3264H 256MB 1 MT8VDDT6464H 512MB For component data sheets, refer to Micron s Web site:

- Nishad Nerurkar. - Aniket Mhatre

1. Memory technology & Hierarchy

Alpha CPU and Clock Design Evolution

Introduction to CMOS VLSI Design (E158) Lecture 8: Clocking of VLSI Systems

Technical Note. Micron NAND Flash Controller via Xilinx Spartan -3 FPGA. Overview. TN-29-06: NAND Flash Controller on Spartan-3 Overview

OpenSPARC T1 Processor

DDR4 Memory Technology on HP Z Workstations

Configuring and using DDR3 memory with HP ProLiant Gen8 Servers

D1 D2 D3 D4 D5 D6 D7. Benefits: Stable system operation No trace length matching Low PCB cost

Addressing the DDR3 design challenges using Cadence DDR3 Design-In Kit

Power Reduction Techniques in the SoC Clock Network. Clock Power

Semiconductor Device Technology for Implementing System Solutions: Memory Modules

Computer Organization & Architecture Lecture #19

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

Asynchronous Bypass Channels

DDR3 memory technology

Memory unit. 2 k words. n bits per word

Selecting the Optimum PCI Express Clock Source

Q & A From Hitachi Data Systems WebTech Presentation:

PROGETTO DI SISTEMI ELETTRONICI DIGITALI. Digital Systems Design. Digital Circuits Advanced Topics

Achieving High Performance DDR3 Data Rates

Fairchild Solutions for 133MHz Buffered Memory Modules

Ring Local Area Network. Ring LANs

Source-Synchronous Serialization and Deserialization (up to 1050 Mb/s) Author: NIck Sawyer

Low Power AMD Athlon 64 and AMD Opteron Processors

DDR2 SDRAM SODIMM MT8HTF3264HD 256MB MT8HTF6464HD 512MB MT8HTF12864HD 1GB For component data sheets, refer to Micron s Web site:

Computer Architecture

LTE BACKHAUL REQUIREMENTS: A REALITY CHECK

Qsys and IP Core Integration

PCI Express: The Evolution to 8.0 GT/s. Navraj Nandra, Director of Marketing Mixed-Signal and Analog IP, Synopsys

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Memory Module Specifications KVR667D2D8F5/2GI. 2GB 256M x 72-Bit PC CL5 ECC 240-Pin FBDIMM DESCRIPTION SPECIFICATIONS

RAM & ROM Based Digital Design. ECE 152A Winter 2012

Hunting Asynchronous CDC Violations in the Wild

Technical Note FBDIMM Channel Utilization (Bandwidth and Power)

DDR SDRAM SODIMM. MT9VDDT1672H 128MB 1 MT9VDDT3272H 256MB MT9VDDT6472H 512MB For component data sheets, refer to Micron s Web site:

The Bus (PCI and PCI-Express)

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

Motivation: Smartphone Market

Figure 1 FPGA Growth and Usage Trends

PCI Express Overview. And, by the way, they need to do it in less time.

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Low latency synchronization through speculation

Latch Timing Parameters. Flip-flop Timing Parameters. Typical Clock System. Clocking Overhead

From Bus and Crossbar to Network-On-Chip. Arteris S.A.

White Paper David Hibler Jr Platform Solutions Engineer Intel Corporation. Considerations for designing an Embedded IA System with DDR3 ECC SO-DIMMs

11. High-Speed Differential Interfaces in Cyclone II Devices

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Advantages of e-mmc 4.4 based Embedded Memory Architectures

Interconnection Networks

Sage ERP Accpac Online

Static-Noise-Margin Analysis of Conventional 6T SRAM Cell at 45nm Technology

Sage 300 ERP Online. Mac Resource Guide. (Formerly Sage ERP Accpac Online) Updated June 1, Page 1

Multiple clock domains

Transcription:

DDR subsystem: Enhancing System Reliability and Yield

Agenda Evolution of DDR SDRAM standards What is the variation problem? How DRAM standards tackle system variability What problems have been adequately addressed and what haven t? Uniquify s approach to tackling the variability problem. What does this mean for system reliability 2

Evolution of JEDEC Standards 2000 2003 2007 2012 DDR1 DDR2 Source synchronous clocking (DQS) Differential DQS (for signal integrity) introduced On-die termination Data rate doubled compared to 533MHz top speed SDRAM by transferring data on both 512Mb 2Gb edges of DQS 4 or 8 bank devices 200MHz top speed 256Mb 512Mb 2 or 4 bank devices 3 LPDDR1 Similar to DDR1 without DLL Relaxed DQS vs clock timing on read; round-trip timing more difficult to predict Per-bank self-refresh 200MHz top speed DDR3 Write leveling Bit leveling Fly-by routing topology on the board 1066MHz top speed 1Gb 4Gb 8 bank standard DDR4 More data training patterns CRC DBI 1600MHz top speed LPDDR2 Many power-saving enhancements Low power modes Multiple refresh schemes including temperature controlled refresh 400MHz top speed

The Variation Problem DDRx Controller / PHY / I/O? SoC Package? Die Output loading uncertainty? Board trace uncertainty DDRx SDRAM? DDR memory timing uncertainty PCB 4

How can we address variation? OPTION 1 Recover the clock from the incoming data stream But what are the disadvantages? Need to encode data Training requirements to detect data eye IO becomes large and complex Overhead for packetizing data and synchronizing data from parallel bit streams. Added latency IO size and complexity, overhead and latency makes it ill-suited for main memory application. 5

How can we address variation? OPTION 2 Use clock forwarding: Send the clock to be used to capture data with the data. Source synchronous clocking is the strategy used by all generations of DDR SDRAM. Very effective in addressing variation all path delays on data are also seen on clock (called data strobe). If the system design is done carefully by routing data and associated clock together, we can ensure all variations on the data path are also seen on the strobe path Both static (process varation, package and board configuration induced) and dynamic variations (caused by system operating voltage and temperature fluctuations) can be minimized by design in this way. 6

Timing Margins are Shrinking DDR Type Clock (MHz) / Data Speed (Mbps) Timing Window (ps) DRAM Margin (ps) Package / Board Margin (ps) On-Chip Margin (ps) DDR1 200 / 400 2,500 900 800 800 DDR2 533 / 1066 938 425 256 256 DDR3 1066 / 2133 469 188 140 140 DDR4 1600 / 3200 312 125 93 93 DDR1 DDR2 DDR3 DDR4 7

Uniquify s approach to meet the variation challenge Ensure that all the data and associated strobe paths in the SoC are carefully balanced so they track each other effectively as voltage and temperature variations occur over time Board and package design should also support the effort by routing data and strobe as a group to carefully match delays. For in-system training, DDR3 and DDR4 provide pre-defined registers that can be used to read out fixed training patterns. These can be used to further train the data eye in-system and obtain additional margin. 8

Uniquify s approach to meet the variation challenge Read side training values can also be applied on the write side to avoid the need for separate write training. This is possible since data and strobe are bi-directional signals on the package and board and any skews seen on the read path will also be seen on the write path. But this does require that IC design is done such that it does not introduce different skews on write and read paths. 9

The Clock domain crossing problem Even though we capture the data successfully using the source forwarded clock, the data still needs to be transferred to the destination clock domain. This is far from simple, and has always been in our experience the most under-estimated, yet most deadly of all problems. Each generation of DDR standards tries to find new solutions, yet many people found that their problems seemed to stay one step ahead! 10

The write side The write side is relatively simple. All generation of DDR standards force the SoC to observe a timing relationship between strobe and DRAM clock. DRAM clock Write DQS This allows the DRAM to transfer data from its strobe clock domain to its system clock domain fairly easily DRAM clock Write DQS 11

DDR2: Board routing using star topology For DDR2, SoC and board designers had no choice but to match clock path and strobe path delays inside the SoC and on the board. On the SoC, this had to be done by matching insertion delay on the clock output with insertion delay on the DQ & DQS output of every byte lane not trivial considering that the DDR interface might span a very wide area within the chip. On the board, clock traces to DRAMs for all byte lanes were matched using the star topology. But this placed limitations on board signal quality. DRAM 0 clock SOC clock output DRAM 1 clock 12 Reflections caused by the star topology DRAM 2 clock..

DDR3 / DDR4: Write leveling Starting with DDR3, SoC and board designers received a big break with the introduction of write leveling. Write leveling provides a mechanism to do automatic training in-system to satisfy strobe and clock relationship timing. So clock trace delays did not need to match across byte lanes. So SoC designers no longer had to match the clock insertion delay to match DQS output timing on every single byte lane in the SoC. 13

DDR3 / DDR4: Board routing using daisy chain topology And board designers could use a routing architecture for the clock that was more optimized for signal quality.. such as the daisy chain topology. SOC clock output DRAM 0 DRAM 1 DRAM 2... DRAM n Reflections are minimized 14

Write Leveling - restrictions Basic concept behind write leveling: Schmoo the write DQS and let the DRAM feedback information on when rising edge of DQS matches with rising edge of DRAM clock. DRAM clock Write DQS Write leveling works without any problems if prior to write leveling we can guarantee these 2 conditions: There is < 1 cycle of skew between write DQS and DRAM clock at every DRAM s pins & It can be guaranteed that write DQS delay is smaller than DRAM clock delay. This restriction is not a problem for embedded systems that usually have limited number of DRAM ranks and byte lanes. 15

Write Leveling Failure! But what happens if there is more than 1 cycle of skew between write DQS and DRAM clock? In this case the write DQS could get aligned with the wrong DRAM clock edge. DRAM clock Write DQS This could potentially happen for high speed multi-rank DIMM configurations with wide data buses DRAM protocol does not specify any means to automatically correct this. 16

Validate Write using Read To validate if write is working correctly, we need to be able to read and verify the written data. But this requires that reads should work correctly! So this is the time to segue into the last and most difficult aspect of DDR interface timing the read side clock domain crossing problem. How do we transfer the data captured using the DRAM forwarded read clock to the internal system clock? 17

What does the Read strobe depend on? Unlike the write side where the DRAM has a pre-defined specification to solve the clock domain crossing problem, there is no such specification to help us on the SoC side. First question we should ask is then what is the expected relationship between these 2 clocks? The DRAM memories do provide a timing specification on the relationship between the DRAM receive clock and the read strobe forwarded clock. However these timing specifications vary between different flavors of DRAM. This specification is particularly loose in the case of LPDDR1 and LPDDR2. 18

The round trip timing problem So the relationship between these 2 clocks on the SoC side is dependent on the total path delays in the system from the SoC system clock tree to the DRAM clock pin, the DRAM itself and then back from the DRAM read strobe pin to the SoC. So this is commonly called the round trip timing problem 19

Solving the round trip time problem. The most common solution to the round trip time problem is to use an asynchronous FIFO. The write pointer is incremented by Read data strobe received from DRAM. The read pointer is incremented by the system clock whenever it does not match the write pointer. DRAM read strobe (clocks FIFO write pointer) DRAM read data in System clock (clocks FIFO read pointer) DRAM read data out (synchronized to system clock) Circular FIFO 20

Does this solve the problem fully? The strobe signal is bi-directional since it is used during both reads and writes. So, before the read starts, it will be un-driven and perhaps noisy. So it needs to be gated (blocked) when it is not driving a valid signal and un-gated when it is driving a valid signal. Read strobe Read strobe gate Without the correct gate timing, the FIFO might clock in invalid data (if too early) or NOT clock in valid data (if too late) So, the round trip time problem has not really gone away. It s come back in a different form that of finding the correct gate timing 21

How bad is the round trip time problem, really? The preamble time is specified as 90% of the clock period in the JEDEC DRAM specifications. This translates to: 400 Mbps DDR1 4.5ns 1066 Mbps DDR2 1.7ns 2133 Mbps DDR3 844ps 3200 Mbps DDR4 560ps Pictorially: Preamble time 22

What s happening with process variation? But dynamic voltage and temperature dependent variations are getting worse as we go to smaller and smaller process nodes. The combination of increased variation with smaller timing margins is a double whammy for system reliability! 400 Mbps DDR1 @ 130nm node 1066 Mbps DDR2 @ 90nm node 1600 Mbps DDR3 @ 65nm node 2133 Mbps DDR3 @ 40nm node 3200 Mbps DDR4 @ 28nm node Shortest Round trip Longest Round trip DQS gating window 3.5ns 6ns 4.5ns 2ns 4ns 1.7ns 1ns 2.5ns 1.12ns 0.8ns 2ns 844ps 0.6ns 1.5ns 560ps 23

From yield loss to system instability For DDR1 and low speed DDR2, a simple fixed register setting could be used to fix the gate timing. As DDR2 clock speeds reached 400Mhz and 533Mhz, gate timing caused lots of problems needing tweaking and adjustment of timing registers by system engineers. With DDR3 800Mhz being mainstream now, many system engineers are struggling with system instability. 24

System instability and failure! At some point, because of the random voltage and temperature variations, the static parameters no longer hold good! Static Parameters System Failure t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t n 25

ATE testing challenge And even if the system engineers succeed in getting reliable operation on the bench, getting to production is an even bigger challenge. A robust ATE test methodology to ensure DDR sub-system reliability is needed so that test escapes do not translate to system yield loss! 26

Uniquify s solution: SCL There have been various attempts by the DRAM community to address the round trip timing problem (such as DFI 2.1 gate training), but Uniquify came up with a different approach. We found a way to solve the round trip time problem in it s entirety! So it makes no difference if the round trip time is ½ the cycle period or 4 times the cycle period! Our solution: A mechanism to exactly measure the round trip time in-system on power up! We call this patented technology SCL (for Self Calibrating Logic). 27

What does SCL do? SCL provides exact information regarding: (A) The phase relationship between incoming read strobe and system clock. (B) Read strobe latency in system clock cycles. SCL programs the registers that control these settings automatically with no user intervention. As a result, the DDR subsystem is truly and completely Self Calibrating. 28

SCL s capability Precision & Range SCL is capable of measuring and compensating many cycles of round trip time variation with a fine precision. It has more than enough range for even the smallest process nodes and highest clock speeds. This entire range of variation can be covered with a fine granularity about 40ps in TSMC 40G node. DRAM clock Fastest round trip time Slowest round trip time 29 Wide range of variation covered with fine precision

SCL s capability Read Latency Further SCL guarantees the BEST possible system read latency. It ensures that the read data is delivered to the system at the earliest possible cycle after the read strobe has been received on all byte lanes. Lowest read Latency Faster than the FIFO based approach since it eliminates pointer synchronization latencies! 30

SCL s capability Multi-cycle byte lane skews Every byte lane is tuned by SCL independent of others, so SCL accommodates a wide range of variation in round trip time across byte lanes. No need for SoC designers to worry about matching data strobe output delay across multiple byte lanes. No need for system designers to worry about delay matching across different byte lanes. SCL is even more powerful in combination with write leveling, since reads can be used to check writes! If SCL fails after write leveling, re-run write leveling to align the write strobe to the next DRAM clock edge and try again! 31

Traditional vs. SCL SCL sets up the timing parameters to the best values at power on: Static Parameters SCL Calibration System Failure t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t n 32

Introducing Dynamic SCL (DSCL) Over the last 3 years, Uniquify has further enhanced SCL, so that SCL is now also optimized to be run dynamically in the middle of system operation. Dynamic SCL (DSCL) needs to be very optimal in minimizing impact on system performance by reducing the time interval of interruption. DSCL can be run in just 300ns at DDR3 2133Mbps. This is comparable to a refresh interval for higher density DDR3 and DDR4 memories. Needs to be run once a second or at most a few times a second to optimize the read timing for temperature driven variations. Long term voltage variations are usually also caused by temperature since temperature affects the leakage current. So DSCL compensates for this effect as well. 33

Traditional vs. SCL / DSCL DSCL delivers the highest system reliability Static Parameters SCL Calibration DSCL Calibration System Failure t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t n 34

DSCL is a quantum leap forward We have seen by real world experience and customer feedback, that DSCL is a quantum leap forward in providing excellent system reliability. It s a comprehensive solution that addresses all possible sources of DDR interface integrity problems on both writes and reads. It addresses the unknown static variations due to the SoC, package, board, manufacturer of the DRAM or the type of DRAM AND the dynamic variations in system operating conditions. 35

SCL / DSCL additional benefits Simplified ATE testing. More and more SoC manufacturers are finding that the best way to test DDR memory subsystem on the ATE is to add real DRAM memory on the load board. Relays can be used to isolate the DRAM during connectivity test or VOH/VOL test. With DRAM on the load board, DDR sub-system testing with DSCL is as simple as performing a sequence of registers writes and a single register readback to verify test pass or fail. Since DSCL does comprehensive interface training, if there is a solution, DSCL will find it and make the test pass! Ensures both high quality test coverage AND high yield 36

Customer Experience SCL/DSCL $60B leading edge consumer electronics company HDTV application Thermal issues caused field reliability problems Design based on standard (non-uniquify) PHY No SCL/DSCL Latest generation of application Adopted Uniquify s DDR memory subsystem with SCL/DSCL Problem Solved! we were surprised to find that the DDR interface is now much more robust than it was in the past. SCL/DSCL solved the problem. 37

Thank you 38