Reliability and Fault Tolerance in Storage



Similar documents
Operating Systems. RAID Redundant Array of Independent Disks. Submitted by Ankur Niyogi 2003EE20367

How To Improve Performance On A Single Chip Computer

Price/performance Modern Memory Hierarchy

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

CS420: Operating Systems

RAID. Storage-centric computing, cloud computing. Benefits:

CS161: Operating Systems

PIONEER RESEARCH & DEVELOPMENT GROUP

Lecture 36: Chapter 6

Disk Storage & Dependability

Definition of RAID Levels

technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels

How To Write A Disk Array

Disk Array Data Organizations and RAID

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Sistemas Operativos: Input/Output Disks

Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe

Storing Data: Disks and Files

RAID Technology Overview

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

RAID Levels and Components Explained Page 1 of 23

Striped Set, Advantages and Disadvantages of Using RAID

SSDs and RAID: What s the right strategy. Paul Goodwin VP Product Development Avant Technology

RAID: Redundant Arrays of Independent Disks

Summer Student Project Report

COS 318: Operating Systems. Storage Devices. Kai Li Computer Science Department Princeton University. (

1 Storage Devices Summary

an analysis of RAID 5DP

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

RAID Level Descriptions. RAID 0 (Striping)

CS 61C: Great Ideas in Computer Architecture. Dependability: Parity, RAID, ECC

RAID Overview

How To Create A Multi Disk Raid

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

RAID Performance Analysis

Improving Lustre OST Performance with ClusterStor GridRAID. John Fragalla Principal Architect High Performance Computing

RAID Overview: Identifying What RAID Levels Best Meet Customer Needs. Diamond Series RAID Storage Array

Using RAID6 for Advanced Data Protection

RAID. Contents. Definition and Use of the Different RAID Levels. The different RAID levels: Definition Cost / Efficiency Reliability Performance

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

How To Understand And Understand The Power Of Aird 6 On Clariion

RAID HARDWARE. On board SATA RAID controller. RAID drive caddy (hot swappable) SATA RAID controller card. Anne Watson 1

COS 318: Operating Systems. Storage Devices. Kai Li and Andy Bavier Computer Science Department Princeton University

RAID 5 rebuild performance in ProLiant

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Why disk arrays? CPUs speeds increase faster than disks. - Time won t really help workloads where disk in bottleneck

Why disk arrays? CPUs improving faster than disks

3PAR Fast RAID: High Performance Without Compromise

W4118: RAID. Instructor: Junfeng Yang

CS 6290 I/O and Storage. Milos Prvulovic

Best Practices RAID Implementations for Snap Servers and JBOD Expansion

Reliability of Data Storage Systems

RAID Storage, Network File Systems, and DropBox

Today s Papers. RAID Basics (Two optional papers) Array Reliability. EECS 262a Advanced Topics in Computer Systems Lecture 4

RAID technology and IBM TotalStorage NAS products

RAID Made Easy By Jon L. Jacobi, PCWorld

Data Storage - II: Efficient Usage & Errors

Data Corruption In Storage Stack - Review

Module 6. RAID and Expansion Devices

Overview of I/O Performance and RAID in an RDBMS Environment. By: Edward Whalen Performance Tuning Corporation

California Software Labs

An Introduction to RAID. Giovanni Stracquadanio

Storage node capacity in RAID0 is equal to the sum total capacity of all disks in the storage node.

RAID Technology. RAID Overview

What is RAID and how does it work?

Oracle Database 10g: Performance Tuning 12-1

PARALLELS CLOUD STORAGE

Q & A From Hitachi Data Systems WebTech Presentation:

IBM System x GPFS Storage Server

Hard Disk Drives and RAID

Chapter 6 External Memory. Dr. Mohamed H. Al-Meer

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Chapter 10: Mass-Storage Systems

Chapter 9: Peripheral Devices: Magnetic Disks

IBM ^ xseries ServeRAID Technology

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

File System Design and Implementation

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

Chapter 12: Mass-Storage Systems

Technology Update White Paper. High Speed RAID 6. Powered by Custom ASIC Parity Chips

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

An Introduction to RAID 6 ULTAMUS TM RAID

HP Smart Array Controllers and basic RAID performance factors

Distribution One Server Requirements

RAID Basics Training Guide

What is RAID? data reliability with performance

Introduction. What is RAID? The Array and RAID Controller Concept. Click here to print this article. Re-Printed From SLCentral

PowerVault MD1200/MD1220 Storage Solution Guide for Applications

Filing Systems. Filing Systems

RAID Implementation for StorSimple Storage Management Appliance

Storage Design for High Capacity and Long Term Storage. DLF Spring Forum, Raleigh, NC May 6, Balancing Cost, Complexity, and Fault Tolerance

Performance Analysis of RAIDs in Storage Area Network

Transcription:

Reliability and Fault Tolerance in Storage Dalit Naor/ Dima Sotnikov IBM Haifa Research Storage Systems 1 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Agenda RAID systems (RAID 0-5) Limitations of RAID What comes after RAID 5? Distributed Replication systems Distributed ECC systems Builds on materials from -Operating System Concepts, 7th ed., by Silbershatz, Galvin, & Gagne - CS 3013, Operating Systems, WPI - Notes by André Brinkmann, U. Paderborn --Other (as indictaed) 2 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Definitions MTTF Mean time to failure MTTR Mean Time To Repair MTBF Mean Time Between Failures = MTTF + MTTR AFR Annual Failure Ratio Estimated probability that a hard disk will fail during a full year of use MTTDL Mean time to Data Loss- System MTTF Time (in years) before disk failure is likely to cause data loss in a RAID system Byte or bit level vs. Block level Block is, e.g. 512 Bytes 3 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Disk Arrays Disk Arrays: aggregation of disks into groups of n disks Idea: Combine multiple inexpensive disks into one large virtual disk Virtual disk appears to be a regular disk to the computer increases capacity The virtual disk will have high performance Problem: Disk-system mean time to failure (MTTF) of the array drops proportionally The System MTTF of n disks == Disk-MTTF n AFR = 365*24 MTTF, AFR being the % of disks that will fail in a year within an array Example (disks are assumed to be identical and independent) If mean time to failure (MTTF) of a disk drive is 100,000 hours MTTF of an array of 100 identical disks drops to 1000 hours (i.e., 41.67 days ~ 6 weeks) lose 1% of your data every 6 weeks! Use redundancy (e.g. mirroring) 4 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

A Real Example of Array Reliability w/replication What s the real MTTF value (in hours) for a single disk? Very different estimates (*) Manufacturer MTTF x - 1,200,000 hours Inspected (real world) MTTF 0.2x 250,000 hours Conservative MTTF 0.1x 145,000 hours Pessimistic MTTF 0.03x 36,000 hours Consider a 4 TB disk, today s large capacity disks To create a pool of 40 TB usable capacity, need to aggregate 20 disks MTTF of the array is Disk_MTTF/20 Probability of loosing data in the array w/ replication Q == 20 * (1/MTTF) 2 * MTTR MTTR depends on the disk capacity, and disk throughput Example: - MTTR of a 4TB disk ~ 35 hours (**) - Q is shown in the table MTTF (hours) 1,200,000 Q (prob. data loss) 4.8*10-10 MTTDL (mean time Years to data loss) 234,833 For a small array, this is a very reliable system 250,000 145,000 1.12*10-8 3.32*10-8 10,129 3428 (*) (http://www.zetta.net/docs/zetta_mttdl_june10_2009.xls) (**) Time to rebuild a SATA 7.2K RPM disk at 30MB/s 36,000 5.4*10-7 211 5 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Expected time between disk failures Time from last failure Expected time until the next failure Right after 10 days 20 days 4 days 10 days 15 days 6 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

RAID - Redundant Array of Independent Disks Aggregates disks into groups ( arrays ) of n disks Stripes the data block, and use extra capacity to store information redundantly (e.g. error correction code) on other disks in the group When a disk fails, restore its information from the other disks Provides High Reliability Availability Fast recovery from failure (*) Increased performance Penalty Write/Read amplification Bandwidth Cost (more capacity, more complexity) (*) As disks get bigger, Rebuild takes longer Need higher resiliency Originally introduced to replace Single Large Expensive Disk (SLED) that was used for mainframes IBM 3380 model CJ2 7 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

RAID Level 0 - Striping Simple: Block/group i is on disk (i mod n) Advantage Read/write n blocks in parallel; n times bandwidth Disadvantage No redundancy at all. System MTTF (mean time to failures) is 1/n disk MTTF! No redundancy, no fault tolerance High I/O performance, parallel I/O stripe 0 stripe 4 stripe 8 stripe 1 stripe 5 stripe 9 stripe 2 stripe 6 stripe 10 stripe 3 stripe 7 stripe 11 8 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

RAID Level 1 - Mirroring Simple : Each stripe is written twice Block/group i is on disks (i mod 2n) & (i+n mod 2n) Advantages Read/write n blocks in parallel Redundancy: System MTTF = (Disk MTTF) 2 / (MTTR*2n) Tolerates a single disk fault Writes are amplified Throughput for writes is 50% Can optimize Reads: Throughput for reads is doubled Simple rebuild: a failed disk is replaced by copying its mirror - MTTR = Disk Capacity / Disk Throughput Disadvantage Capacity utilization is 50% RAID 1-0: Striping with Mirroring RAID 0-1 (original) only two disks, no striping stripe 0 stripe 4 stripe 8 stripe 1 stripe 5 stripe 9 stripe 2 stripe 6 stripe 10 stripe 3 stripe 7 stripe 11 stripe 0 stripe 4 stripe 8 stripe 1 stripe 5 stripe 9 stripe 2 stripe 6 stripe 10 stripe 3 stripe 7 stripe 11 9 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

RAID Level 2 Parity with ECC (Obsolete) Error-correction code at the bit level Uses Hamming(7,4)-code, can detect up to two and correct up to one bit errors Disk spindle rotation is synchronized Data is striped such that each sequential bit is on a different disk ECC Implemented by the hard disk RAID 2 is obsolete, no commercial RAID-2 systems 10 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

RAID Level 3 Parity, byte-interleaved (Obsolete) Dedicated parity disk (via XOR computation), byte-level striping A single block of data is spread across all members of the set and will reside in the same location Any I/O requires activity on every disk and usually requires synchronized spindles. Advantage over RAID 2 Capacity utilization is improved Simplicity in computation (XOR) Disadvantage Disk spindle rotation is synchronized Error detection/correction of only a single error Can not serve multiple requests in parallel! Good for long sequential reads and writes Obsolete, not in commercial systems 11 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

RAID Level 4 Parity, block-interleaved One disk is used for parity The data is split into equal sized stripes Each stripe is split on n + 1 disks A full stripe is a single one row {D 0, D 1, D 2, D 3, P 0-3 } D0 D1 D2 D3 P Parity 0-3 = stripe 0 xor stripe 1 xor stripe 2 xor stripe 3 n stripes plus parity are written/read in parallel If any disk/stripe fails, it can be reconstructed from others P = D0 D1 D2 D3 Advantages n times Read bandwidth D2 = D0 P D1 D3 System MTTF = (Disk MTTF) 2 / (MTTR*n(n+1)) Capacity utilization is 1-1/(n+1) Simple Rebuild - Hot Swap disk Rebuild: a failed disk can be reconstructed on-the-fly - Hot expansion : can upgrade to larger sized-disks easily Disadvantage The parity disk is a bottleneck A Write requires read-modify-write of parity stripe only 1x write bandwidth stripe 0 stripe 4 stripe 8 stripe 1 stripe 5 stripe 9 stripe 2 stripe 6 stripe 10 stripe 3 stripe 7 stripe 11 parity 0-3 parity 4-7 parity 8-11 12 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

RAID Level 5 Distributed parity, block-interleaved Similar to RAID Level 4, but parity is distributed over all disks Similar characteristics to RAID Level 4 most popular; rotating disk spreads out parity load Key additional advantages Avoids bottleneck at the parity disk Increases Write parallelism Writing individual stripes (RAID 4 & 5) 1 Logical Write 2 Reads + 2 Writes: - Read existing stripe and existing parity - Recomputed parity - Write new stripe and new parity stripe 0 stripe 4 stripe 8 stripe 12 stripe 1 stripe 5 stripe 9 parity 12-15 stripe 2 stripe 6 parity 8-11 stripe 13 stripe 3 parity 4-7 stripe 10 stripe 14 parity 0-3 stripe 7 stripe 11 stripe 15 13 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

RAID 5 Operations 14 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

RAID Levels and Configurations (up to Level 5) Source: OS course Notes, Kai Li, Princetonhttp://www.cs.princeton.edu/courses/archive/fall12/cos318/schedule.html 15 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

A Real Use Case of RAID 5 (today) Real use case: Required amount of usable data - 30 PB Single disk capacity - 4 TB RAID 5 configuration 5+1 (5 data disks + 1 parity) Expected time to read one sequential megabyte from disk ~ 20 milliseconds* Some calculations: Required amount or RAID boxes : Total capacity in TB/RAID usable capacity in TB 30*1000/(5*4) = 1500 Required amount of disks: Required amount or RAIDs*Amount of disks per RAID 1500 * 6 = 9000 Single disk expected bandwidth: Second/Time to read 1 MB 1/0.02 = 50 MB/sec Assuming that system is 80% utilized, 10 MB/sec of bandwidth can be dedicated for recovery MTTR in hours: Disk capacity in MB/Recovery bandwidth per hour (4*1024*1024)/(10*3600) ~ 116 hours (*) http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf 16 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

A Real Use Case of RAID 5 (today) Single disk MTTF: 145000 hours MTTR: 116 hours #RAID: 1500 (denoted by m) #Disks per RAID: 5 + 1 (denoted by n + 1) MTTDL= 2 MTTF m ( n+ 1) n MTTR = 2 145000 1500 (5+ 1) 5 116 4028 MTTDL: 4028 hours 168 days If we take MTTF to be 250000 hours then MTTDL will be ~ 499 days, not much better Martin Schulze, Garth Gibson, Randy Katz, David Patterson, HOW RELIABLE IS RAID?, COMPCON Springs 89. 17 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

But in reality RAID 5 is actually even worse With 18500 disks, a storage array is always rebuilding! Disk failure annual ratio = Hours per year/mttf Number of disk failures per day = Disk failure annual ratio*total amount of disks/amount of days per year For MTTF 250000 : 9000*24/250000 0.86 disk failures per day For MTTF 145000 : 9000*24/145000 1.5 disk failures per day 1 in 10 Hard error rate of bits implies data loss every ~6 th rebuild During a single disk recovery, all the data from the single RAID will be accessed: #data disks per raid*disk capacity in bits 5*4*1024*1024*1024*1024*8=175921860444160 15 10 15 175921860444160 RAID 5 will loose data every 4 days for MTTF == 145000, or 7 days for MTTF == 250000 Or once at month for MTTF == 1200000 6 18 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

The Limitations RAID 5 - Conclusions System is consistently busy with Rebuilds The I/O traffic performed due to the excessive Rebuilds is affected by bit errors of the disk. RAID 5 does not address today s reliability What about data availability? Control RAID is local, confined to a box Redundancy is all within the array of disks Advantage: - Simple - Rebuild from local disk, close by traffic Large systems can no longer be built out a of single box, need many RAID boxes Any component within an enclosure can fail all data is unreachable although all bits are intact on disks The amount of information needed in order to recover a given unit (e.g. 1MB) is very large (e.g. in RAID 5, n MBs, n being the # of data disks) 19 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

The I/O request path in storage subsystem Protocol Stack Disk Drivers SCSI Protocol Storage Layer FC Adapter w/ drivers Networks Disk 20 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

What comes after RAID 5? Approach 1 Improve reliability within a single box RAID Level 6 - Recovering from double failures RAID-6 is a general term for any type of RAID that can tolerates two disk failures MTTDL increases to ~100 years Extends RAID 5 by adding an additional parity block Uses block-level striping with two parity blocks distributed across all member disks. P parity based on XOR Q other codes, e.g. a different XOR, Reed-Solomon. Box-related problems remain, but postponed to a later point Approach 2 Handle system-wide reliability 21 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Storage System Reliability Distributed Replication Motivation: Avoid excessive Reads on every recovery RAID1 recovers 1 MB by reading 1 MB; while RAID5 recovers 1MB by reading N MB Avoid a single disk bottleneck at recovery, e.g. use distribution like RAID 4 Distributed RAID 1 vs. 2-way replication RAID 1 2-way replication Spare Disk Spare Capacity 22 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Distributed Two-Way Replication Consider the previous example: Usable capacity of 30 PB Single disk capacity of 4 TB 2-way replication configuration requires 60 PB of raw capacity Required amount of disks: 60*1000/4 = 15000 (denoted by N) Single disk recovery bandwidth 10 MB/sec (under the 80% utilization assumption) Replication object chunk size is 1 MB In case of a disk failure, every other disk contains (on average) 4*1024*1024/ 15000 ~ 280 objects (280 MB) of the failed disk Therefore, during disk rebuild (recovery), every disk needs to read 280 MB and write 280 MB of data This takes ~56 seconds (assuming that the network resources are unbounded) MTTDL= 2 MTTF N ( N 1) MTTR 2 145000 = 15000 14999 0.0156 5990 For 40 Gbit InfiniBand* network 5 GB per second 4*1024/5 = 820 sec ~ 14 minute MTTDL= 2 MTTF N ( N 1) MTTR 2 145000 = 15000 14999 0.23 406 *http://en.wikipedia.org/wiki/infiniband 23 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Distributed Three-Way Replication For the same example of 30PB usable capacity: Required amount of disks: 90*1000/4 = 22500 (denoted by N) MTTDL= N ( N 3 MTTF 1) ( N 2) MTTR 2 = 3 145000 22500 22499 22498 0.0156 2 1099930 Ignoring network limitations, MTTDL is more then 125 years Assuming 40 Gbit InfiniBand, the MTTDL is still unacceptable MTTDL = N ( N 3 MTTF 1) ( N 2) MTTR 2 = 3 145000 22500 22499 22498 0.23 2 5060 5060 hours are ~ 7 months 24 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Distributed Schemes with ECC What s next? 4-way replication is too costly Objective: a system with the following properties Capacity utilization should be better than 2-way replication Can withstand many disk failures requires a more complex distributed RAID Erasure Coding Notation m (message size) is the number of the original data chunks n is the number of encoded data chucks, n>m Every data item encoded with n chucks can be reconstructed from any m chucks Encoding rate r = m/n (<1) Capacity utilization 1/r For example, use Reed-Solomon encoding 25 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

ECC - example For ECC with parameters m = 8 and n = 12 Required amount of usable data 30 PB Single disk capacity 4 TB m = 8 and n = 12 requires 45 PB of raw capacity Required amount of disks: 45*1000/4 = 11250 (denoted by N) Single disk recovery bandwidth 10 MB/sec Data chuck size 1 MB In case of a disk failure, every other disk contains (on average) 8*4*1024*1024/ 11250 ~ 2983 chucks (2983 MB) of the failed disk Therefore, during disk rebuild (recovery), every disk needs to read 2983 MB and write 373 MB of data This takes ~ 336 seconds (assuming that the network resources are unbounded), or ~ 14 minute assuming a 40Gb InfiniBand network. MTTDL= N ( N 1) ( N 5 MTTF 2) ( N 3) ( N 4) MTTR 4 8575 years 26 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

In Summary RAID is widely used today - a key concept for today s storage systems Actually used are RAID1 RAID5 RAID6 RAID improves the reliability within a single box This approach is reaching its limitations due to the growth in capacity sizes New approaches (e.g. Cloud scale) build on Replication (two or three ways) Distributed ECC 27 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom