1 Important Differences Between Consumer and Enterprise Flash Architectures Robert Sykes Director of Firmware Flash Memory Summit 2013 Santa Clara, CA OCZ Technology
2 Introduction This presentation will describe key firmware design differences between consumer- and enterprise-class SSDs. Topics covered include end-to-end data protection, bad block recovery techniques for the latest NAND nodes such as Low-Density Parity Check (LDPC), advances in Flash Translation Layer techniques with a focus on firmware requirements for sustained performance, host interface buffering techniques, and a review of RAID and the flash interface.
3 Contents Consumer / Enterprise basics. End to End Data Protection Data Corruption / Correction (ECC, BCH, LDPC, RAID) Enterprise FTL architecture
4 Contents Consumer / Enterprise basics. End to End Data Protection Data Corruption / Correction (ECC, BCH, LDPC, RAID) Enterprise FTL architecture
5 Contents Consumer / Enterprise basics. End to End Data Protection Data Corruption / Correction (ECC, BCH, LDPC, RAID) Enterprise FTL architecture
6 Contents Consumer / Enterprise basics. End to End Data Protection Data Corruption / Correction (ECC, BCH, LDPC, RAID) Enterprise FTL architecture
7 Consumer vs Enterprise Basics
8 There have been several discussions over the past few months on the forums asking; What is the difference between consumer and enterprise grade SSDs? Snapshot of some of the discussions on the forums describing the differences: Interface is the difference. PCIe / SAS / NVMe are enterprise; SATA is consumer. Consumer: Cost / Capacity / Performance / Data Integrity Enterprise: Data Integrity / Performance / Capacity / Cost Enterprise has data redundancy
9 Consumer / Enterprise Basics Enterprise has consistent performance Enterprise drives have greater endurance Enterprise drives have additional raw capacity Enterprise drives require custom applications for specific needs Client / consumer drives contain a single or only two SSDs; Enterprise could be comprised of several SSDs
10 Consumer / Enterprise Basics Turning to specifications for the answer: JEDEC helps to define the difference by specifying workload differences. Data Usage Models JEDEC 218 and 219 define client and enterprise usage models Enterprise products will be used in heavy workload environments and are expected to perform.» JESD 218A defines this as 24hrs / day at 55ºC with 3 months retention at 40ºC Client drives generally have a lighter load.» JESD 218A defines this as 8hrs / day at 40ºC with 1 year retention at 30ºC Note: Data retention is defined here as the ability for the SSD to retain data in a power off state for a period of time.
11 Consumer / Enterprise Basics Enterprise SSDs should not really be defined by the interface, but by the application and system requirements. However, many enterprise systems are not surprisingly built around fast hosts such as PCIe / NVMe as some enterprise systems require high-end performance (high IOPS) with sustained data rates. Questions to ask when designing your system: Is the data WORM like (write once read many)? Is the data volatile data (like swap data)? Is the data redundant data (data backed up somewhere else)?
12 Consumer / Enterprise Basics Issues with sustained data read / writes: During sustained activity in an enterprise system, the probability of a system level issues increases. This could be anything from a host memory issue, DRAM failure or uncorrectable flash page. The following methods can be used to prevent system issues: End-to-End Data Protection Read Retry LDPC RAID Data Stirring Temperature Throttling Wear Leveling
13 End-to-End Data Protection
14 End-to-end Data Protection End-to-End Data Protection Why do we need it? Protect against silent errors» OS issues including device drivers problems» Any issues within the HW or FW of the storage controller (i.e., bus issues, firmware writing the wrong data, etc.) Metadata Application OS HBA Storage Metadata Metadata Validity Array Check Metadata Validity Check on each data transfer NAND
15 Data Corruption / Correction
16 Data Corruption / Correction What are the possible causes? NAND flash has an inherent weakness: (Quantum Mechanics!) as the NAND process shrinks so does the effectiveness of the oxide layer, in turn increasing the probability that a 0 becomes a 1 or vice versa when it shouldn t Electron Source Floating Gate Oxide Layer Drain
18 Data Corruption / Correction How is ECC calculated? Spare area is used for Bad Block markers (Mandatory), ECC and User Specific Data X pages per block NAND Block Data Area Spare Area
19 Data Corruption / Correction The amount of spare area will depict the amount of ECC applied. For example: 25nm MLC may have 450 bytes of spare area allowing 24bit ECC per 1K of data 20nm MLC may have 750 bytes of spare area allowing 40bit plus ECC per 1K of data Note: The NAND vendors specify the level of ECC expected to meet the specified endurance of the drive Consumer drives could use BCH to achieve this level of ECC BCH (Bose, Chaudhuri, Hocquenghem; invented 1959)
20 Data Corruption / Correction Enterprise drives require greater endurance over consumer grade drives given the different requirements set out by the JEDEC standard; BCH may not be enough to meet these standards. One possible method to achieve this is to use LDPC (Low Density Parity Check) LDPC can easily utilise soft information LDPC can enhance endurance and retention Requires new ratio(s) of user data to parity data Compare with SNR/Frame Error Rate graphs, not a simple N-bit correction. How does LDPC work..
21 Data Corruption / Correction LDPC code word - satisfied LDPC code word error d3
23 Data Corruption / Correction What happens when BCH and LDPC is not enough to recover the data? RAID could be used as a method to guarantee data but comes at the cost of additional NAND to store parity and data copies
25 RAID RAID protection As the NAND process shrinks, NAND becomes more susceptible to error, sometimes ECC, BCH and LDPC may not be enough HDD vendors use RAID to protect data across physical drives RAID 5 Controller A1 A2 A3 AP B1 B2 BP B3 C1 CP C2 C3 DP D1 D2 D3 Disk 1 Disk 2 Disk 3 Disk 4
26 RAID A similar approach can be taken with SSDs Block A1 Block B1 Blok C1 Block DP Die1 SSD Controller Block A2 Block B2 Block CP Block D1 Block A3 Block BP Block C2 Block D2 Die2 Die3 Block AP Block B3 Block C3 Block D3 Die4
27 RAID What is RAID protecting against? Latest NAND nodes could be susceptible to mass data loss die to charge issues within the NAND This can result in a whole world line data loss Data Block marked as bad
28 Enterprise FTL
29 Enterprise FTL You can put a price on performance, but you cant buy low latency!
30 Enterprise FTL Host Basic FTL System Architecture NCQ Command Queue Event Queue FTL / BBM / WL NAND SATA Transfer Buffer Manager HIL FTL FIL DRAM
31 Enterprise FTL Enterprise drives may require sustained data rates that far exceed a general consumer grade NAND solution As such, the system requirements are far more complex than a consumer grade drive In the previous example the system is defined for SATA only system When considering the sustained data rates of NVMe, a different approach is required
32 Enterprise FTL In addition to coping with fast hosts, enterprise solutions also require reduced latency Updating the FTL mapping takes time: For 16K page there would be one entry for each 16K One entry is ~4B (i.e., for each 16K page written 4B of metadata must be stored in DDR) For a 256GB drive with 16K page NAND:» (256G / 16K ) *4B = 64MB For a 256GB drive with 8K page NAND:» (256G / 8K ) *4B = 128MB
33 Enterprise FTL Writing many mega bytes of mapping table data to the NAND on a regular basis could be a major contributing factor to latency issues in the system Instead of writing the complete table, latency can be reduced by splitting the table into smaller chunks of data and then creating a link between multiple maps
34 Enterprise FTL Now we have a new (very high level) FTL architecture we can address the performance bottlenecks
35 Enterprise FTL Enterprise FTL System Architecture HIL FIL PCIe 4 Host Lanes Gen3 Buffer Manager Multi CPU Soft LDPC Buffer Manager NAND FTL / BBM / WL SRAM DRAM
36 Enterprise FTL Other factors to consider in an enterprise system: Higher OP levels may be required to achieve good dirty performance Inclusion of super caps or batteries to protect against power loss Infield debugging, continuous logging of firmware behaviour accessible remotely for a speedy unobtrusive solution
A Breakthrough Approach to Storage and Backup Converged Storage and Backup Dramatically Cuts Costs, Eliminates Complexity, and Delivers a Complete Disaster Recovery Solution Table of Contents Executive
One Stop Data & Networking Solutions PREVENT DATA LOSS WITH REMOTE ONLINE BACKUP SERVICE Prevent Data Loss with Remote Online Backup Service The U.S. National Archives & Records Administration states that
TECHNICAL BRIEF:........................................ Running Highly Available, High Performance Databases in a SAN-Free Environment Who should read this paper Architects, application owners and database
Introduction to InfiniBand for End Users Industry-Standard Value and Performance for High Performance Computing and the Enterprise Paul Grun InfiniBand Trade Association INTRO TO INFINIBAND FOR END USERS
Microsoft System Center 2012 R2 Why Microsoft? For Virtualizing & Managing SharePoint July 2014 v1.0 2014 Microsoft Corporation. All rights reserved. This document is provided as-is. Information and views
WHITE PAPER Intel Solid-State Drives Increase Productivity of Product Design and Simulation Intel Solid-State Drives Increase Productivity of Product Design and Simulation A study of how Intel Solid-State
SCHOONER TECHNOLOGY WHITE PAPER Deploying Higher Level Building Blocks for Web 2.0 and Cloud Computing Datacenters How Tightly Coupled Data Access Appliances Simplify Scaling, Decrease Business Complexity,
White Paper The Online Backup Advantage Delivery Superior Performance, Security, and Reliability Abstract This White Paper discusses how online backup, utilising advanced disk-to-disk technology, provides
Best Practices for Virtualizing and Managing SQL Server v1.0 May 2013 Best Practices for Virtualizing and Managing SQL Server 2012 1 1 Copyright Information 2013 Microsoft Corporation. All rights reserved.
HP StoreOnce Catalyst and HP Data Protector 7 Implementation and Best Practice Guide Release 1 Executive Summary This guide is intended to enable the reader to understand the basic technology of HP StoreOnce
WHITE PAPER Storage in the era of cloud and big data: the advantages of SSDs over HDDs www.sandisk.com Storage in the era of cloud and big data: the advantages of SSDs over HDDs Executive overview According
An Oracle Technical White Paper May 2011 Oracle Optimized Solution for Enterprise Cloud Infrastructure Introduction... 1 Overview of the Oracle Optimized Solution for Enterprise Cloud Infrastructure...
INTRODUCTION TO LINUX CLUSTERING DOCUMENT RELEASE 1.1 Copyright 2008 Jethro Carr This document may be freely distributed provided that it is not modified and that full credit is given to the original author.
SAN Design and Best Practices Version 2.3 A high-level guide focusing on Fibre Channel Storage Area Network (SAN) design and best practices, covering planning, topologies, device sharing in routed topologies,
Data Backup Options for SME s As an IT Solutions company, Alchemy are often asked what is the best backup solution? The answer has changed over the years and depends a lot on your situation. We recognize
Reliably Erasing Data From Flash-Based Solid State Drives Michael Wei, Laura M. Grupp, Frederick E. Spada, Steven Swanson Department of Computer Science and Engineering, University of California, San Diego
Kaseya Performance And Best Practices Guide Authors: Jacques Eagle Date: Thursday, April 29, 2010 1 P a g e Contents Introduction... 5 B.0 - The Basics... 6 B.1 - Hot Fixes... 6 B.2 64 bit versus 32 bit
HP B6200 Backup System Recommended Configuration Guidelines Introduction... 3 Purpose of this guide... 4 Executive summary... 4 Challenges in Enterprise Data Protection... 4 A summary of HP B6200 Backup
REMOTE BACKUP-WHY SO VITAL? Any time your company s data or applications become unavailable due to system failure or other disaster, this can quickly translate into lost revenue for your business. Remote
Best Practices for Deploying and Managing Linux with Red Hat Network Abstract This technical whitepaper provides a best practices overview for companies deploying and managing their open source environment
Best Practices Guide McAfee epolicy Orchestrator for use with epolicy Orchestrator versions 4.5.0 and 4.0.0 COPYRIGHT Copyright 2011 McAfee, Inc. All Rights Reserved. No part of this publication may be
With hundreds of Help Desk software packages available, how do you choose the best one for your company? When conducting an Internet search, how do you wade through the overwhelming results? The answer