Reliability and Fault Tolerance in Storage Dalit Naor/ Dima Sotnikov IBM Haifa Research Storage Systems 1 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
Agenda RAID systems (RAID 0-5) Limitations of RAID What comes after RAID 5? Distributed Replication systems Distributed ECC systems Builds on materials from -Operating System Concepts, 7th ed., by Silbershatz, Galvin, & Gagne - CS 3013, Operating Systems, WPI - Notes by André Brinkmann, U. Paderborn --Other (as indictaed) 2 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
Definitions MTTF Mean time to failure MTTR Mean Time To Repair MTBF Mean Time Between Failures = MTTF + MTTR AFR Annual Failure Ratio Estimated probability that a hard disk will fail during a full year of use MTTDL Mean time to Data Loss- System MTTF Time (in years) before disk failure is likely to cause data loss in a RAID system Byte or bit level vs. Block level Block is, e.g. 512 Bytes 3 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
Disk Arrays Disk Arrays: aggregation of disks into groups of n disks Idea: Combine multiple inexpensive disks into one large virtual disk Virtual disk appears to be a regular disk to the computer increases capacity The virtual disk will have high performance Problem: Disk-system mean time to failure (MTTF) of the array drops proportionally The System MTTF of n disks == Disk-MTTF n AFR = 365*24 MTTF, AFR being the % of disks that will fail in a year within an array Example (disks are assumed to be identical and independent) If mean time to failure (MTTF) of a disk drive is 100,000 hours MTTF of an array of 100 identical disks drops to 1000 hours (i.e., 41.67 days ~ 6 weeks) lose 1% of your data every 6 weeks! Use redundancy (e.g. mirroring) 4 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
A Real Example of Array Reliability w/replication What s the real MTTF value (in hours) for a single disk? Very different estimates (*) Manufacturer MTTF x - 1,200,000 hours Inspected (real world) MTTF 0.2x 250,000 hours Conservative MTTF 0.1x 145,000 hours Pessimistic MTTF 0.03x 36,000 hours Consider a 4 TB disk, today s large capacity disks To create a pool of 40 TB usable capacity, need to aggregate 20 disks MTTF of the array is Disk_MTTF/20 Probability of loosing data in the array w/ replication Q == 20 * (1/MTTF) 2 * MTTR MTTR depends on the disk capacity, and disk throughput Example: - MTTR of a 4TB disk ~ 35 hours (**) - Q is shown in the table MTTF (hours) 1,200,000 Q (prob. data loss) 4.8*10-10 MTTDL (mean time Years to data loss) 234,833 For a small array, this is a very reliable system 250,000 145,000 1.12*10-8 3.32*10-8 10,129 3428 (*) (http://www.zetta.net/docs/zetta_mttdl_june10_2009.xls) (**) Time to rebuild a SATA 7.2K RPM disk at 30MB/s 36,000 5.4*10-7 211 5 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
Expected time between disk failures Time from last failure Expected time until the next failure Right after 10 days 20 days 4 days 10 days 15 days 6 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
RAID - Redundant Array of Independent Disks Aggregates disks into groups ( arrays ) of n disks Stripes the data block, and use extra capacity to store information redundantly (e.g. error correction code) on other disks in the group When a disk fails, restore its information from the other disks Provides High Reliability Availability Fast recovery from failure (*) Increased performance Penalty Write/Read amplification Bandwidth Cost (more capacity, more complexity) (*) As disks get bigger, Rebuild takes longer Need higher resiliency Originally introduced to replace Single Large Expensive Disk (SLED) that was used for mainframes IBM 3380 model CJ2 7 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
RAID Level 0 - Striping Simple: Block/group i is on disk (i mod n) Advantage Read/write n blocks in parallel; n times bandwidth Disadvantage No redundancy at all. System MTTF (mean time to failures) is 1/n disk MTTF! No redundancy, no fault tolerance High I/O performance, parallel I/O stripe 0 stripe 4 stripe 8 stripe 1 stripe 5 stripe 9 stripe 2 stripe 6 stripe 10 stripe 3 stripe 7 stripe 11 8 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
RAID Level 1 - Mirroring Simple : Each stripe is written twice Block/group i is on disks (i mod 2n) & (i+n mod 2n) Advantages Read/write n blocks in parallel Redundancy: System MTTF = (Disk MTTF) 2 / (MTTR*2n) Tolerates a single disk fault Writes are amplified Throughput for writes is 50% Can optimize Reads: Throughput for reads is doubled Simple rebuild: a failed disk is replaced by copying its mirror - MTTR = Disk Capacity / Disk Throughput Disadvantage Capacity utilization is 50% RAID 1-0: Striping with Mirroring RAID 0-1 (original) only two disks, no striping stripe 0 stripe 4 stripe 8 stripe 1 stripe 5 stripe 9 stripe 2 stripe 6 stripe 10 stripe 3 stripe 7 stripe 11 stripe 0 stripe 4 stripe 8 stripe 1 stripe 5 stripe 9 stripe 2 stripe 6 stripe 10 stripe 3 stripe 7 stripe 11 9 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
RAID Level 2 Parity with ECC (Obsolete) Error-correction code at the bit level Uses Hamming(7,4)-code, can detect up to two and correct up to one bit errors Disk spindle rotation is synchronized Data is striped such that each sequential bit is on a different disk ECC Implemented by the hard disk RAID 2 is obsolete, no commercial RAID-2 systems 10 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
RAID Level 3 Parity, byte-interleaved (Obsolete) Dedicated parity disk (via XOR computation), byte-level striping A single block of data is spread across all members of the set and will reside in the same location Any I/O requires activity on every disk and usually requires synchronized spindles. Advantage over RAID 2 Capacity utilization is improved Simplicity in computation (XOR) Disadvantage Disk spindle rotation is synchronized Error detection/correction of only a single error Can not serve multiple requests in parallel! Good for long sequential reads and writes Obsolete, not in commercial systems 11 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
RAID Level 4 Parity, block-interleaved One disk is used for parity The data is split into equal sized stripes Each stripe is split on n + 1 disks A full stripe is a single one row {D 0, D 1, D 2, D 3, P 0-3 } D0 D1 D2 D3 P Parity 0-3 = stripe 0 xor stripe 1 xor stripe 2 xor stripe 3 n stripes plus parity are written/read in parallel If any disk/stripe fails, it can be reconstructed from others P = D0 D1 D2 D3 Advantages n times Read bandwidth D2 = D0 P D1 D3 System MTTF = (Disk MTTF) 2 / (MTTR*n(n+1)) Capacity utilization is 1-1/(n+1) Simple Rebuild - Hot Swap disk Rebuild: a failed disk can be reconstructed on-the-fly - Hot expansion : can upgrade to larger sized-disks easily Disadvantage The parity disk is a bottleneck A Write requires read-modify-write of parity stripe only 1x write bandwidth stripe 0 stripe 4 stripe 8 stripe 1 stripe 5 stripe 9 stripe 2 stripe 6 stripe 10 stripe 3 stripe 7 stripe 11 parity 0-3 parity 4-7 parity 8-11 12 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
RAID Level 5 Distributed parity, block-interleaved Similar to RAID Level 4, but parity is distributed over all disks Similar characteristics to RAID Level 4 most popular; rotating disk spreads out parity load Key additional advantages Avoids bottleneck at the parity disk Increases Write parallelism Writing individual stripes (RAID 4 & 5) 1 Logical Write 2 Reads + 2 Writes: - Read existing stripe and existing parity - Recomputed parity - Write new stripe and new parity stripe 0 stripe 4 stripe 8 stripe 12 stripe 1 stripe 5 stripe 9 parity 12-15 stripe 2 stripe 6 parity 8-11 stripe 13 stripe 3 parity 4-7 stripe 10 stripe 14 parity 0-3 stripe 7 stripe 11 stripe 15 13 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
RAID 5 Operations 14 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
RAID Levels and Configurations (up to Level 5) Source: OS course Notes, Kai Li, Princetonhttp://www.cs.princeton.edu/courses/archive/fall12/cos318/schedule.html 15 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
A Real Use Case of RAID 5 (today) Real use case: Required amount of usable data - 30 PB Single disk capacity - 4 TB RAID 5 configuration 5+1 (5 data disks + 1 parity) Expected time to read one sequential megabyte from disk ~ 20 milliseconds* Some calculations: Required amount or RAID boxes : Total capacity in TB/RAID usable capacity in TB 30*1000/(5*4) = 1500 Required amount of disks: Required amount or RAIDs*Amount of disks per RAID 1500 * 6 = 9000 Single disk expected bandwidth: Second/Time to read 1 MB 1/0.02 = 50 MB/sec Assuming that system is 80% utilized, 10 MB/sec of bandwidth can be dedicated for recovery MTTR in hours: Disk capacity in MB/Recovery bandwidth per hour (4*1024*1024)/(10*3600) ~ 116 hours (*) http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf 16 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
A Real Use Case of RAID 5 (today) Single disk MTTF: 145000 hours MTTR: 116 hours #RAID: 1500 (denoted by m) #Disks per RAID: 5 + 1 (denoted by n + 1) MTTDL= 2 MTTF m ( n+ 1) n MTTR = 2 145000 1500 (5+ 1) 5 116 4028 MTTDL: 4028 hours 168 days If we take MTTF to be 250000 hours then MTTDL will be ~ 499 days, not much better Martin Schulze, Garth Gibson, Randy Katz, David Patterson, HOW RELIABLE IS RAID?, COMPCON Springs 89. 17 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
But in reality RAID 5 is actually even worse With 18500 disks, a storage array is always rebuilding! Disk failure annual ratio = Hours per year/mttf Number of disk failures per day = Disk failure annual ratio*total amount of disks/amount of days per year For MTTF 250000 : 9000*24/250000 0.86 disk failures per day For MTTF 145000 : 9000*24/145000 1.5 disk failures per day 1 in 10 Hard error rate of bits implies data loss every ~6 th rebuild During a single disk recovery, all the data from the single RAID will be accessed: #data disks per raid*disk capacity in bits 5*4*1024*1024*1024*1024*8=175921860444160 15 10 15 175921860444160 RAID 5 will loose data every 4 days for MTTF == 145000, or 7 days for MTTF == 250000 Or once at month for MTTF == 1200000 6 18 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
The Limitations RAID 5 - Conclusions System is consistently busy with Rebuilds The I/O traffic performed due to the excessive Rebuilds is affected by bit errors of the disk. RAID 5 does not address today s reliability What about data availability? Control RAID is local, confined to a box Redundancy is all within the array of disks Advantage: - Simple - Rebuild from local disk, close by traffic Large systems can no longer be built out a of single box, need many RAID boxes Any component within an enclosure can fail all data is unreachable although all bits are intact on disks The amount of information needed in order to recover a given unit (e.g. 1MB) is very large (e.g. in RAID 5, n MBs, n being the # of data disks) 19 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
The I/O request path in storage subsystem Protocol Stack Disk Drivers SCSI Protocol Storage Layer FC Adapter w/ drivers Networks Disk 20 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
What comes after RAID 5? Approach 1 Improve reliability within a single box RAID Level 6 - Recovering from double failures RAID-6 is a general term for any type of RAID that can tolerates two disk failures MTTDL increases to ~100 years Extends RAID 5 by adding an additional parity block Uses block-level striping with two parity blocks distributed across all member disks. P parity based on XOR Q other codes, e.g. a different XOR, Reed-Solomon. Box-related problems remain, but postponed to a later point Approach 2 Handle system-wide reliability 21 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
Storage System Reliability Distributed Replication Motivation: Avoid excessive Reads on every recovery RAID1 recovers 1 MB by reading 1 MB; while RAID5 recovers 1MB by reading N MB Avoid a single disk bottleneck at recovery, e.g. use distribution like RAID 4 Distributed RAID 1 vs. 2-way replication RAID 1 2-way replication Spare Disk Spare Capacity 22 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
Distributed Two-Way Replication Consider the previous example: Usable capacity of 30 PB Single disk capacity of 4 TB 2-way replication configuration requires 60 PB of raw capacity Required amount of disks: 60*1000/4 = 15000 (denoted by N) Single disk recovery bandwidth 10 MB/sec (under the 80% utilization assumption) Replication object chunk size is 1 MB In case of a disk failure, every other disk contains (on average) 4*1024*1024/ 15000 ~ 280 objects (280 MB) of the failed disk Therefore, during disk rebuild (recovery), every disk needs to read 280 MB and write 280 MB of data This takes ~56 seconds (assuming that the network resources are unbounded) MTTDL= 2 MTTF N ( N 1) MTTR 2 145000 = 15000 14999 0.0156 5990 For 40 Gbit InfiniBand* network 5 GB per second 4*1024/5 = 820 sec ~ 14 minute MTTDL= 2 MTTF N ( N 1) MTTR 2 145000 = 15000 14999 0.23 406 *http://en.wikipedia.org/wiki/infiniband 23 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
Distributed Three-Way Replication For the same example of 30PB usable capacity: Required amount of disks: 90*1000/4 = 22500 (denoted by N) MTTDL= N ( N 3 MTTF 1) ( N 2) MTTR 2 = 3 145000 22500 22499 22498 0.0156 2 1099930 Ignoring network limitations, MTTDL is more then 125 years Assuming 40 Gbit InfiniBand, the MTTDL is still unacceptable MTTDL = N ( N 3 MTTF 1) ( N 2) MTTR 2 = 3 145000 22500 22499 22498 0.23 2 5060 5060 hours are ~ 7 months 24 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
Distributed Schemes with ECC What s next? 4-way replication is too costly Objective: a system with the following properties Capacity utilization should be better than 2-way replication Can withstand many disk failures requires a more complex distributed RAID Erasure Coding Notation m (message size) is the number of the original data chunks n is the number of encoded data chucks, n>m Every data item encoded with n chucks can be reconstructed from any m chucks Encoding rate r = m/n (<1) Capacity utilization 1/r For example, use Reed-Solomon encoding 25 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
ECC - example For ECC with parameters m = 8 and n = 12 Required amount of usable data 30 PB Single disk capacity 4 TB m = 8 and n = 12 requires 45 PB of raw capacity Required amount of disks: 45*1000/4 = 11250 (denoted by N) Single disk recovery bandwidth 10 MB/sec Data chuck size 1 MB In case of a disk failure, every other disk contains (on average) 8*4*1024*1024/ 11250 ~ 2983 chucks (2983 MB) of the failed disk Therefore, during disk rebuild (recovery), every disk needs to read 2983 MB and write 373 MB of data This takes ~ 336 seconds (assuming that the network resources are unbounded), or ~ 14 minute assuming a 40Gb InfiniBand network. MTTDL= N ( N 1) ( N 5 MTTF 2) ( N 3) ( N 4) MTTR 4 8575 years 26 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
In Summary RAID is widely used today - a key concept for today s storage systems Actually used are RAID1 RAID5 RAID6 RAID improves the reliability within a single box This approach is reaching its limitations due to the growth in capacity sizes New approaches (e.g. Cloud scale) build on Replication (two or three ways) Distributed ECC 27 Advanced Topics on Storage Systems - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom