HDD Ghosts, Goblins and Failures in RAID Storage Systems Jon G. Elerath, Ph. D. February 19, 2013
Agenda RAID > HDDs > RAID Redundant Array of Independent Disks What is RAID storage? How does RAID work? Common configurations Why is RAID Needed? HDD ghosts, goblins and failures (errors and failures) HDD Field Reliability for RAIDed HDDs HDD Reliability Metric Realities of RAID Operations Impact on RAID Reliability Estimates 2
HDDs in RAID Storage Systems What is RAID storage? Redundant Array of Independent Disks How does RAID work? Data Redundancy Parity: Exclusive logic Single parity creates one bit for a data stripe Dual parity creates two parity bits across different stripes of data Why is it needed? 3
RAID Storage Systems Common configurations RAID 0 No redundancy Any failure causes data loss RAID 1 Mirrored disks (n+n) all data is saved on two identical HDDs requires both disks to fail concurrently RAID 4 and RAID 5 (n+1) n data disks + 1 parity disk requires 2 failures to lose data RAID 6 (RAID DP) (n+2) n data disks + 2 parity disks requires 3 failures to lose data 4
RAID Operation (Overly Simplified) RAID 0 No redundancy Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 RAID 1 Mirrored Disks Data 1 Data 1 Data 2 Data 2 Data 3 Data 3 5
RAID Operation (Unbelievably Oversimplified) RAID 5 n+1 Redundancy RAID 6 n+2 Redundancy cpu Data Disks cpu Data Disks 1 ECC Disk Equivalent 2 ECC Disk Equivalents 6
HDDs in RAID Storage Systems What is RAID storage? Redundant Array of Independent Disks How does RAID work? Data Redundancy Parity: Exclusive logic Single parity creates one bit for a data stripe Dual parity creates two parity bits across different stripes of data Why is RAID needed? Disk drives suck A perception by Dan Warmenhoven, CEO NetApp (circa 2005) 7
HDD Events (Ghosts, Goblins and Failures) Uncorrectable Cannot Find Data Cannot Read Data Correctable in RAID Data Missing Servo Track SMART Limit Hit Error During Write Written & Destroyed Electronics Can t Stay on Track Read Head Media Inherent Bit Error High Fly Write Scratched Media Corrosion Thermal Asperity 8
HDD Events (Ghosts, Goblins and Failures) Servo Track Electronics Uncorrectable Cannot Find Data Can t Stay on Track Read Head SMART Limit Hit Cannot Read Data Correctable Servo data is written at regular intervals on every data Data track of every disk surface. The servo data is used to Missing control the positioning of the read/write heads. Tracks on a HDD are not perfectly circular. Head position is continuously measured Error and Written compared to where it should be. A During position Write error signal & Destroyed is used to reposition the head over the track. Repeatable run out is part of normal HDD head positioning control. Non repeatable run out (NRRO) cannot be corrected by the HDD Thermal firmware Media since it is non repeatable. NRRO, caused Asperity by mechanical tolerances from the motor bearings, actuator Inherent arm bearings, noise, vibration and servo loop Corrosionresponse Bit Error errors, can cause the head positioning to take too long to High Fly Scratched lock onto a track and ultimately produce an error. Write Media 9
HDD Events (Ghosts, Goblins and Failures) Uncorrectable Cannot Find Data Cannot Read Data Correctable in RAID Data Missing Servo Track SMART Limit Hit Error During Write Written & Destroyed Electronics Can t Stay on Track Read Head Media Inherent Bit Error High Fly Write Scratched Media Corrosion Thermal Asperity 10
HDD Events (Ghosts, Goblins and Failures) Uncorrectable Cannot Find Data Cannot Read Data Correctable in RAID Data Missing Servo Track SMART Limit Hit Error During Write Written & Destroyed Electronics Can t Stay on Track Read Head Most events are heads and media related Media Inherent Bit Error High Fly Write Scratched Media Corrosion Thermal Asperity 11
Uncorrectable Errors and Field Failures (RAID) Uncorrectable errors constitute what most people think of as failures HDD crashed Won t spin Most field data is based on these ONLY Field data do not match vendor specifications HDD reliability depends on the HDD s Manufacturer Product Family from the manufacturer Capacity Age Use conditions 12
Mean Time Between Failures (MTBF) MTBF is NOT a good metric for measuring HDD reliability (Elerath, J. G. 1998. Specifying Reliability in the Disk Drive Industry. Proc. Annual Rel. and Maint. Symposium, IEEE. pp194 199.) What does MTBF really mean to a consumer? Does NOT include correctable events Failure Rate, h(t) Combined Failure Rate 0.000008 0.000007 0.000006 0.000005 0.000004 0.000003 0.000002 0.000001 0 0 43800 87600 131400 175200 Time (hrs) early h(t) useful h(t) wearout h(t) total h(t) 13
Real HDD Field Reliability Data Vintage affects reliability Probability of Failure 0.02 0.02 0.01 8.00E-3 4.00E-3 0 Vintage 5 Vintage 2 Vintage 4 Vintage 3 Vintage 1 Composite 0 4000.00 8000.00 12000.00 16000.00 2000 Time to Failure, hrs Slightly decreasing failure rates, then increasing rate (esp. HDD#2) Probability of Failure 99. 90. 50. 10. 5. 1. 0.5 0.1 0.05 6.0 3.0 2.0 1.6 1.4 1.2 1.0 0.9 0.8 0.7 0.6 0.5 HDD #1 HDD #2 HDD #3 0.01 Time to Failure, hrs 14
Reality of RAID Operations RAID 4 & RAID 5 (n+1) Most likely combination of failures is a unknown data corruption in a small amount of data (several sectors) followed by a concurrent, noncorrectable error in a different HDD During reconstruction onto a new HDD, the corrupted data cannot be recovered RAID 6 (n+2) Most likely combinations are 1 data corruption, followed by two uncorrectable failures (HDD failures) simultaneously. The second failure occurs before the first can be incorporated into the RAID group and reconstructed. Far less likely than RAID 4 or RAID 5 data losses, but still measurable. 15
Reality of RAID Operations Read after write is too slow; not used. Checked only on READ. Data corruptions can occur During READ operations During WRITE operations Whenever an HDD is spinning Sometimes when the HDD is not spinning (depending on the design) Small Data Corruptions All HDDs have error recovery algorithms built into the HDD Small errors are corrected on the fly (less than one revolution) Large Data Corruptions Require ECC and data from other HDDs in the RAID group Disk scrubbing by the RAID controller proactively finds these larger errors and corrects them to prevent data loss at the RAID group level 16
Impact on RAID Reliability Estimates Mean Time To Data Loss (MTTDL) Concept statistically flawed 1 Doesn t include undiscovered defects Ridiculously optimistic RAID 4/5 Reliability Estimates Monte Carlo Simulations 1 Simple Equation 2 RAID 6 Monte Carlo Simulations Simple Equation 3 DDFs per 1000 RAID Groups 25 20 15 10 5 0 RAID Group Size = 14 Non-HPP Simulation NetApp RAID Eq. NetApp Field Data (FC) MTTDL 0 4380 8760 13140 Age, Hours http://raideqn.netapp.com 1 Elerath & Pecht. A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID). Trans. on Computers. 58, 3, March 2009. 2 Elerath. A Simple Equation for Estimating Reliability of an N+1 Redundant Array of Independent Disks (RAID). DSN 2009. 3 Elerath & Schindler. Beyond MTTDL: A closed form RAID 6 reliability equation. WIP. FAST 2013. 17
Conclusion HDD head/disk interface is critical to achieving RAID reliability RAID designers must do background media scrubbing Scrub frequency is critical to RAID reliability Larger HDDs (2+ TB) have more opportunities to develop media defects and require long times to scrub Field data does not map to true defect rates 18