Reliability of Data Storage Systems

Similar documents

Reliability and Fault Tolerance in Storage

Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe

Storage node capacity in RAID0 is equal to the sum total capacity of all disks in the storage node.

RAID 5 rebuild performance in ProLiant

RAID: Redundant Arrays of Independent Disks

Disk Array Data Organizations and RAID

Price/performance Modern Memory Hierarchy

CS161: Operating Systems

How To Understand And Understand The Power Of Aird 6 On Clariion

HDD Ghosts, Goblins and Failures in RAID Storage Systems

Disk Storage & Dependability

SSDs and RAID: What s the right strategy. Paul Goodwin VP Product Development Avant Technology

IBM System x GPFS Storage Server

Data Corruption In Storage Stack - Review

Don t Let RAID Raid the Lifetime of Your SSD Array

Mean time to meaningless: MTTDL, Markov models, and storage system reliability

PIONEER RESEARCH & DEVELOPMENT GROUP

Why RAID is Dead for Big Data Storage. The business case for why IT Executives are making a strategic shift from RAID to Information Dispersal

Data Storage - II: Efficient Usage & Errors

PARALLELS CLOUD STORAGE

Best Practices RAID Implementations for Snap Servers and JBOD Expansion

Lecture 36: Chapter 6

Data Protection Technologies: What comes after RAID? Vladimir Sapunenko, INFN-CNAF HEPiX Spring 2012 Workshop

Using RAID6 for Advanced Data Protection

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO)

an analysis of RAID 5DP

Replication and Erasure Coding Explained

Declustered RAID & HDFS Replicated Systems

1 Storage Devices Summary

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

Technology Update White Paper. High Speed RAID 6. Powered by Custom ASIC Parity Chips

How To Write A Disk Array

CS 153 Design of Operating Systems Spring 2015

General Parallel File System (GPFS) Native RAID For 100,000-Disk Petascale Systems

CS 6290 I/O and Storage. Milos Prvulovic

Building Storage Clouds for Online Applications A Case for Optimized Object Storage

VERY IMPORTANT NOTE! - RAID

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

RAID Level Descriptions. RAID 0 (Striping)

Sonexion GridRAID Characteristics

Designing a Cloud Storage System

The Case for Massive Arrays of Idle Disks (MAID)

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

CS 61C: Great Ideas in Computer Architecture. Dependability: Parity, RAID, ECC

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

Operating Systems. RAID Redundant Array of Independent Disks. Submitted by Ankur Niyogi 2003EE20367

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Massive Data Storage

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

IBM ^ xseries ServeRAID Technology

RAID0.5: Active Data Replication for Low Cost Disk Array Data Protection

Reliability of Systems and Components

Object Oriented Storage and the End of File-Level Restores

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

RAID Storage, Network File Systems, and DropBox

Locality Based Protocol for MultiWriter Replication systems

Design and Evolution of the Apache Hadoop File System(HDFS)

Improving Lustre OST Performance with ClusterStor GridRAID. John Fragalla Principal Architect High Performance Computing

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Definition of RAID Levels

High-Performance SSD-Based RAID Storage. Madhukar Gunjan Chakhaiyar Product Test Architect

Moving Beyond RAID DXi and Dynamic Disk Pools

760 Veterans Circle, Warminster, PA Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA

How To Improve Performance On A Single Chip Computer

How to recover a failed Storage Spaces

Benchmarking Cassandra on Violin

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

Why disk arrays? CPUs improving faster than disks

Storing Data: Disks and Files

Exploring RAID Configurations

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

GENERAL INFORMATION COPYRIGHT... 3 NOTICES... 3 XD5 PRECAUTIONS... 3 INTRODUCTION... 4 FEATURES... 4 SYSTEM REQUIREMENT... 4

Nutanix Tech Note. Failure Analysis All Rights Reserved, Nutanix Corporation

Benefits of Intel Matrix Storage Technology

Protecting Information in a Smarter Data Center with the Performance of Flash

3PAR Fast RAID: High Performance Without Compromise

Intel RAID Controllers

SALSA Flash-Optimized Software-Defined Storage

technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels

A Novel Data Placement Model for Highly-Available Storage Systems

CS420: Operating Systems

Rebuild Strategies for Clustered Redundant Disk Arrays

Factors rebuilding a degraded RAID

Security+ Guide to Network Security Fundamentals, Fourth Edition. Chapter 13 Business Continuity

Physical Data Organization

Transcription:

Zurich Research Laboratory Ilias Iliadis April 2, 25 Keynote NexComm 25 www.zurich.ibm.com 25 IBM Corporation

Long-term Storage of Increasing Amount of Information An increasing amount of information is required to be stored Web services Email, photo sharing, web site archives Fixed-content repositories Scientific data Libraries Movies Music Regulatory compliance and legal issues Sarbanes Oxley Act of 22 for financial services Health Insurance Portability and Accountability Act of 996 (HIPAA) in the healthcare industry Information needs to be stored for long periods and be retrieved reliably 2 25 IBM Corporation

Storage Disk drives widely used as a storage medium in many systems personal computers (desktops, laptops) distributed file systems database systems high end storage arrays archival systems mobile devices Disks fail and need to be replaced Mechanical errors Wear and tear: it eventually leads to failure of moving parts Drive motor can spin irregularly or fail completely Electrical errors A power spike or surge can damage in-drive circuits and hence lead to drive failure Transport errors The transport connecting the drive and host can also be problematic causing interconnection problems 3 25 IBM Corporation

Data Losses in Storage Systems Storage systems suffer from data losses due to component failures disk failures node failures media failures unrecoverable and latent media errors Reliability enhanced by a large variety of redundancy and recovery schemes RAID systems (Redundant Array of Independent Disks) spare disk Parity disk holds Rebuild redundant information XOR XOR RAID-5: Tolerates one disk failure 4 25 IBM Corporation

Data Losses in Storage Systems Storage systems suffer from data losses due to component failures disk failures node failures media failures unrecoverable and latent media errors Reliability enhanced by a large variety of redundancy and recovery schemes RAID systems (Redundant Array of Independent Disks) spare disk Data lost XOR Rebuild RAID-5: Tolerates one disk failure RAID-6: Tolerates two disk failures 5 25 IBM Corporation

Time to Failure and MTTDL Number of Failed Disks 2 RAID 5: RAID 5 Disk Failure Repair Time to data loss Reliability Metric: MTTDL (Mean Time to Data Loss) Continuous Time Markov Chain Models *** N N *** N (N-) DL *** N *** N *** N Disk failure during repair time : / MTTF for disks : / MTTR MTTDL j N(N-) 2 [Patterson et al. 988] RAID 6: *** N N (N-) (N-2) 2 DL *** N MTTDL j 2 N(N-)(N-2) 3 [Chen et al. 994] original MTTDL equations 6 25 IBM Corporation

Markov Models for Unrecoverable Errors Parameters: C d : Disk capacity (in sectors) P s : P (unrecoverable sector error) h : P (unrecoverable failure during rebuild in critical mode) q : P (unrecoverable failure during RAID 6 rebuild in degraded mode) Reliability Metric: MTTDL (Mean Time To Data Loss for the array) RAID 5: *** N N DF (-h) *** N (N-) h *** N *** N UF *** N *** N MTTDL = h = [( P ) Data loss owing to: DF: Disk Failure C (2N-) + ( N ) ] UF: Unrecoverable Failure N [(N-) + h] s d RAID 6: *** N N (N-) (-q) (-h) q (N-2) 2 DF h UF *** N 2 h = ( N 2) Cd Ps + O( Ps ) N 2 3 q = Cd Ps + O( Ps ) 2 q^h for P s^ 7 25 IBM Corporation

MTTDL for RAID 5 and RAID 6 Assumptions: UD : PB = 5 bytes user data base C d : 3 GB SATA disk drive capacity N : 8 disks per array group for RAID 5 6 disks per array group for RAID 6 N total : 3896 disks: 4762 arrays for RAID 5 238 arrays for RAID 6 MTTF d : 5 hours for a SATA disk MTTR d : 7.8 hours expected repair time P b : P(unrecoverable bit error) = -4 for SATA P s = 496x -4 = 4.96x - N DF (-h) (-q) h (-h) (N-) h q UF (N-2) 2 DF UF h : P (unrecoverable failure during rebuild in the critical mode) q : P (unrecoverable failure during RAID 6 rebuild in the degraded mode) 2 DF DF 2 DF q^h for P s^ UF 2 DF UF h DF UF 2 DF UF SATA DF UF 2 DF UF 8 25 IBM Corporation

Intra-Disk Redundancy (IDR) Scheme unrecoverable error IDR Design concept:?? Rebuild For every n data sectors, m parity sectors are assigned RAID XOR data loss Redundant sectors are placed on the same disk drive as data The m parity sectors protect against uncorrectable media errors of any m sectors in a group of n sectors Intra-disk redundancy segment: Intra-disk redundancy segment l sectors n data sectors m redundant sectors l = n+m sectors Storage efficiency is n/(n+m) By choosing proper values of n and m, storage efficiency, performance and reliability can be optimized Dholakia et al., A New Intra-disk Redundancy Scheme for High-Reliability RAID Storage Systems in the Presence of Unrecoverable Errors, ACM Trans. Storage 28 9 25 IBM Corporation

Interleaved Parity Check (IPC) Coding Scheme Advantages Easy to implement, using existing XOR engine Flexible design parameters: segment size, efficiency Disadvantage Not all erasure patterns can be corrected x x x x Data sectors *** *** *** x n 2 m *** Can tolerate only one error per column *** Can correct a single burst of m consecutive sector errors Conceptual encoding *** *** IPC parity Interleave Parity sectors Single parity per interleave l *** *** *** *** Physical layout 25 IBM Corporation

MTTDL for Independent Unrecoverable Sector Errors SATA 25 IBM Corporation

MTTDL for Correlated Unrecoverable Sector Errors SATA 2 25 IBM Corporation

Disk Scrubbing Zurich Research Laboratory XOR? unrecoverable error Periodically accesses disk drives to detect unrecoverable errors T s : Scrubbing period = time required for a complete check of all sectors of a disk Identifies unrecoverable errors at an early stage Corrects the unrecoverable errors using the RAID capability Increases the workload because of additional read operations Sector write operations result in unrecoverable errors P w = P(sector-write operation results in an error) Transition noise (media noise), high-fly write, off-track write Contribution of thermal asperities and particle contamination ignored Disk-unrecoverable sector errors are created by write operations and remain latent until read or successfully over-written Workload h : load of a given data sector = rate at which sector is read/written e.g. h=. / day d % of the disk is read/written per day r w : ratio of write operations to read+write operations typically 2/3 3 25 IBM Corporation

Modeling Approach Scrubbing SATA No scrubbing derive P s = P(sector error scrubbing is used) = f (T s, P w, h, r w ) evaluate MTTDL = f (P s ) 4 25 IBM Corporation

Analytical Results: Probability of Unrecoverable Sector Error P s : P(unrecoverable error on a tagged sector at an arbitrary time) Without scrubbing: P s d P e = r w P w P e depends on the ratio r w of read/write operations, but not on the workload h for SATA drives (P w = 4.96% - ) Deterministic scrubbing scheme: P s e = ht ht P s [ P e [ P w Random scrubbing scheme: P = ht s s e + hts P s (deterministic) < P s (random) ht s^ d P s (deterministic) l ½ P s (random) s s P P e Iliadis et al., Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems, ACM Trans. Storage 2 5 25 IBM Corporation

Reliability Results for RAID-5 and RAID-6 Systems SATA disk drives: C d = 3GB, MTTF d = 5, h, MTTR=7.8 h, N=8 (RAID 5), N=6 (RAID 6) MTTDL for an installed base of systems storing PB of user data RAID 5 RAID 6 The IDR scheme improves MTTDL by more than two orders of magnitude, which practically eliminates the negative impact of unrecoverable sector errors The scrubbing mechanism may not be able to reduce the number of unrecoverable sector errors sufficiently and reach the desired level of reliability 6 25 IBM Corporation

Enhanced MTTDL Equations for RAID Systems Latent or unrecoverable errors RAID 5: P s = P(sector error) *** N N DF (-h) *** N (N-) h UF *** N *** N Data loss owing to: DF: Disk Failure UF: Unrecoverable Failure Disk scrubbing Periodically accesses disk drives to detect unrecoverable errors XOR? unrecoverable error Identifies unrecoverable errors at an early stage Corrects the unrecoverable errors using the RAID capability P s (equivalent) = P(sector error scrubbing is used) 7 25 IBM Corporation

Distributed Storage Systems Markov models Times to disk failures and rebuild durations exponentially distributed ( - ) MTTDL has been proven to be a useful metric for (+) estimating the effect of the various parameters on system reliability comparing schemes and assessing tradeoffs Non-Markov-based analysis V. Venkatesan et al. Reliability of Clustered vs. Declustered Replica Placement in Data Storage Systems, MASCOTS 2 V. Venkatesan et al. A General Reliability Model for Data Storage Systems, QEST 22 General non-exponential failure and rebuild time distributions MTTDL is insensitive to the failure time distributions; it depends only on the mean value 8 D D 2 D k RAID array D D 2 D k n n 2...... replicated data on the same node Clustered Placement RAID array Rebuild D D 2 D k spare disk D... D 2 n n 2 n 3 Reduce vulnerability window Distributing data Distributed rebuild method... D k replicated data on different nodes Declustered Placement 25 IBM Corporation

Time To Data Loss vs. Amount of Data Lost MTTDL measures time to data loss no indication about amount of data loss Consider the following example Replicated data for D, D 2,, D k is placed: D D 2 D k D D 2 D k...... D D 2 D k D... D 2... D k n n 2 on the same node Clustered Placement n n 2 n 3 on different nodes Declustered Placement Distinguish between data loss events involving high amounts of data lost low amounts of data lost Need for a measure that quantifies the amount of data lost 9 25 IBM Corporation

Expected Annual Fraction of Data Loss (EAFDL) Normal operation spare disk time Time to data loss MTTDL Amazon The Reduced Redundancy Storage option within Amazon S3 is designed to provide 99.999999999% durability of objects over a given year average annual expected loss of a fraction of - of the data stored in the system Data loss events documented in practice by Yahoo!, LinkedIn, and Facebook Assess the implications of system design choices on the frequency of data loss events MTTDL amount of data lost Expected annual fraction of data loss (EAFDL) Fraction of stored data that is expected to be lost by the system annually EAFDL metric is meant to complement, not to replace MTTDL These two metrics provide a useful profile of the magnitude and frequency of data losses for storage systems with similar EAFDL most preferable the one with the maximum MTTDL XOR Rebuild Amount of data lost 2 25 IBM Corporation

Previous Work on Storage Reliability Reliability Measure MTTDL Other Metrics Markov models Theory / Analysis Original RAID-5 and RAID-6 MTTDL equations Enhanced MTTDL Equations Latent or unrecoverable errors Scrubbing operations Non-Markov-based models General non-exponential failure and rebuild time distributions Placement schemes Network bandwidth, Latent errors, Erasure codes I. Iliadis and V. Venkatesan, Expected Annual Fraction? of Data Loss as a Metric for Data Storage Reliability IEEE MASCOTS September 24 Simulation Non-Markov-based MTTDL simulations Normalized Magnitude of Data Loss (NOMDL) Fraction of Data Loss Per Year (FDLPY)* * equivalent to EAFDL 2 25 IBM Corporation

Non-Markov Analysis for EAFDL and MTTDL Exposure Level (e) r = 3 2 Device Failure Repairs Time to data loss T T 2 T 3 Critical level (r -) EAFDL evaluated in parallel with MTTDL r : Replication Factor e : Exposure Level: maximum number of copies that any data has lost T i : Cycles (Fully Operational Periods / Repair Periods) P DL : Probability of data loss during repair period U : Amount of user data in system Q : Amount of data lost upon a first-device failure MTTDL EAFDL = DL MTTDL / EAFDL equations obtained using non-markov Analysis Device failure at exposure level r- time 22 25 IBM Corporation

Theoretical Results n : number of storage devices c : amount of data stored on each device r : replication factor b : reserved rebuild bandwidth per device / : mean time to failure of a storage device 4 to 64 2 TB 2, 3, 4 96 MB/s, h - Weibull distributions with shape parameters greater than one increasing failure rates over time D D 2 D shape parameter D= 2.5... D k... D k...... D D 2... D q D... D 2...... D q...... Symmetric placement D D 2... D k D D 2......... D k 23 25 IBM Corporation

Reliability Results for Replication Factor of 2 MTTDL Declustered placement is not better than clustered one 24 25 IBM Corporation

Distributed Storage Systems Replicated data for D, D 2,, D k is placed: D D 2 D k D D 2 D k...... D D 2 D k D... D 2... D k n n 2 on the same node Clustered Placement n n 2 on different nodes Declustered Placement MTTDL Reduced repair time (+) Reduced vulnerability window Increased exposure to subsequent device failures ( - ) EAFDL Reduced amount of data lost (+) 25 25 IBM Corporation

Reliability Results for Replication Factor of 2 MTTDL Declustered placement not better than clustered one EAFDL Independent of the number of nodes for clustered placement Inversely proportional to the number of nodes for declustered placement Declustered placement better than clustered one 26 25 IBM Corporation

Reliability Results for Replication Factor of 3 MTTDL Inversely proportional to the number of nodes for clustered placement Independent of the number of nodes for declustered placement Declustered placement better than clustered one EAFDL Independent of the number of nodes for clustered placement Inversely proportional to the cube of the number of nodes for declustered placement Declustered placement better than clustered one 27 25 IBM Corporation

Reliability Results for Replication Factor of 4 MTTR/MMTF ratio: 34.7/35 j. not very small e Deviation between theory and simulation MTTDL Proportional to the square of the number of nodes for declustered placement Declustered placement far superior to the clustered one EAFDL Inversely proportional to the sixth power of the number of nodes for declustered placement Declustered placement far superior to the clustered one 28 25 IBM Corporation

Theoretical EAFDL Results for Replication Factor of 3 MTTF = / = 5, h 4 Theoretical results are accurate when devices are very reliable MTTR/MTTF ratio is small Quick assessment of EAFDL No need to run lengthy simulations 29 25 IBM Corporation

Discussion EAFDL should be used cautiously suppose EAFDL =.% this does not necessarily imply that.% of the user data is lost each year System : MTTDL= years % of the data lost upon loss System 2: MTTDL= years % of the data lost upon loss The desired reliability profile of a system depends on the application underlying service If the requirement is that data losses should not exceed % in a loss event only <System > could satisfy this requirement 3 25 IBM Corporation

Summary Reviewed the widely used mean time to data loss (MTTDL) metric Demonstrated that unrecoverable errors are becoming a significant cause of user data loss Considered the expected annual fraction of data loss (EAFDL) metric Established that the EAFDL metric, together with the traditional MTTDL metric provide a useful profile of the magnitude and frequency of data losses can be jointly evaluated analytically in a general theoretical framework Derived the MTTDL/EAFDL in the case of replication-based storage systems that use clustered and declustered data placement schemes and for a large class of failure time distributions real-world distributions, such as Weibull and gamma Demonstrated the superiority of the declustered placement scheme Future Work Apply the methodology developed to derive the reliability of systems using other redundancy schemes, such as erasure codes 3 25 IBM Corporation