Enhanced Reliability Modeling of RAID Storage Systems

Transcription

1 Enhanced Reliability Modeling of RAID Storage Systems Jon G. Elerath Network Appliance, Inc. Michael Pecht University of Maryland Abstract A flexible model for estimating reliability of RAID storage systems is presented. This model corrects errors associated with the common assumption that system times to failure follow a homogeneous Poisson process. Separate generalized failure distributions are used to model catastrophic failures and usage dependent data corruptions for each hard drive. Catastrophic failure restoration is represented by a three-parameter Weibull, so the model can include a minimum time to restore as a function of data transfer rate and hard drive storage capacity. Data can be scrubbed as a background operation to eliminate corrupted data that, in the event of a simultaneous catastrophic failure, results in double disk failures. Field-based times to failure data and mathematic justification for a new model are presented. Model results have been verified and predict between 2 to 1,500 times as many double disk failures as that estimated using the current mean time to data loss method. 1. Introduction Storage systems consisting of redundant arrays of inexpensive disks (RAID) were developed circa 1988 to improve storage system reliability [1]. Reliability estimates were created assuming that both hard disk drive (HDD) failures and RAID system failures follow a homogeneous Poisson process. If these assumptions are accepted, the time to failure for the common RAID 4 or RAID 5 system can be expressed as the mean time to data loss (MTTDL), or an average time to double-disk failures (DDF). MTTDL is commonly turned into an hourly rate, using the exponentially distributed component failure rate. This means the probability of system failure in any time interval is constant. For example, operating 100 RAID groups for 87,600 hours will have the same probability of failure as operating 87,600 groups for 100 hours. That is, assuming renewal theory, it is assumed that the number of DDFs can be estimated by multiplying the system failure rate by time, where N(t) is the estimated number of failures and λ is the constant system failure rate per unit time in the same units as t. This is an attempt to invoke the relationship N(t) λt 1 - exp(-λt), assuming that the error is less than 1% when λt < However, the failure rate of a component, h(t), is statistically different from the failure rate of the system, more correctly referred to as the rate of occurrence of failure [2] - [5]. As noted by Ascher [6], there is little connection between the properties of component hazard rates and the properties of the process that produces a sequence of failures. That is, times between successive system failures can become increasingly larger even though each component hazard rate is increasing [3]. Even if the HDD follows a homogeneous Poisson process (HPP), there is no statistical basis for assuming the system will be a HPP. A second contributory problem is the assumption that HDD failures follow an HPP. Recent field data analyses show that HDD failure distributions are anything but constant. Data for specific HDD products often indicate subpopulations such as infant wear-out. Different vintages of the same HDD from the same manufacturer may exhibit varying failure distributions. A third issue with current methods is that undiscovered data corruptions can occur at any time in the life of a HDD. These defects were acknowledged by Kari [9], but he assumed that they were caused only by media deterioration and were independent of usage. While Schwarz included latent defects to optimize scrub algorithms [10], he still assumes the system follows a homogeneous Poisson process with constant failure rates. The significance of undiscovered latent defects (LDs) is apparent when a catastrophic (operational) failure ultimately occurs. The latent defect, combined with the operational failure, constitutes a DDF, defeating the reliability gains by (N+1) RAID. These three issues raise the question of usefulness of MTTDL models in estimating the number of DDFs in a RAID group. This paper presents a new model that includes latent defects and does not assume HPP for the HDD or system. HDD failure modes and mechanisms are briefly presented to justify the need

2 for discerning between operational failures and latent defects, which are modeled explicitly. Data scrubbing [10], the remedy for latent defects, is also incorporated in the model. Times to fail are supported through actual field data; times to restore are modeled and acknowledge a minimum and maximum time to restore a failed HDD and reconstruct the lost data. The model is evaluated using a sequential Monte Carlo simulation. The expected number of DDFs predicted by the MTTDL method is compared to the number estimated by this new model proposal, showing that the previous assumptions result in incorrect predictions. 2. Field reliability data A number of recent papers shed light on the distributions underlying HDD failures [7], [8], [10]- [13]. Figure 1 and Figure 2 present Weibull probability plots for new unpublished data. Several noteworthy observations can be made from the aggregate of these data: HDD failure rates are rarely constant Failure distributions exhibit - decreasing failure rates - late-life increasing failure rates - early-life increasing failure rates - vintage based improvements - vintage based deterioration Distributions change as a result of both design and manufacturing process changes. The above observations are a significant departure from the assumption of constant failure rates. In Figure 1, data for three different products are plotted assuming a two-parameter Weibull distribution (a straight line indicates a good fit). Only HDD #1 appears to follow a Weibull distribution. Both of the other two datasets are clearly not linear and indicate abrupt changes in the distribution. HDD #2 shows two separate linear sections, denoting two distributions dominate at different points in time, with the last one, sometime after 10,000 hours, having a marked increase in failure rate (the data plot bends upwards). Failure analyses showed the slope change was due to a change in failure mechanisms. HDD #3 shows two inflection points. Initially, the failure rate is high but decreasing, and follows the slope of HDD #1 (β = 0.9). A significant decrease occurrs (for the population) followed by a significant increase (plot line bends upward). This population has the characteristics of both competing risks and population mixtures. In mixed populations, some of the HDDs have a failure mechanism that the others do not have and so do not, in fact, fail from that mechanism. An example in an HDD is particle contamination [11]. A mixture of populations is likely responsible for the first inflection point for HDD #3 in Figure 1 (decrease in failure rate) and competing risks for the second (upturn in failure rate). Probability of Failure β HDD #1 HDD #2 Time to Failure, hrs Figure 1. Cumulative probability of failure. Only HDD #1 fits a Weibull distribution (straight line) Probability of Failure E-3 Vin E Time to Failure, hrs β1=1.0987, η1=4.5444ε+5 β2=1.2162, η2=1.2566ε+5 β3=1.4873, η3=7.5012ε+4 Vin. 3 Vin. 1 Figure 2. HDD vintage effects In Figure 2, the three lines represent three nonconsecutive HDD vintages from one manufacturer. Vintage 1 has a constant failure rate (β=1.09), whereas the others are increasing (β=1.2 and β=1.4). In a Weibull distribution the shape parameter, β, indicates η HDD #3 Weibull Vin. 1 F=198 S=10433 Vin. 2 F=992 S=23064 Vin. 3 F=921 S=22913

3 whether the failure rate is decreasing (β<1.0), constant (β=1.0), or increasing (β>1.0). 3. HDD failure modes and mechanisms In MTTDL calculations, all HDDs are assumed to have a single failure rate for catastrophic failures and latent defects are ignored. But latent defects are significant and must be included. Failure modes and mechanisms based on HDD electro-mechanical and magnetic events are summarized in Figure 3, grouped by one of two possible consequences: operational (catastrophic) failures or latent defects. Each group has its own unique failure distribution and consequence at the system level. All read failures can be classified as 1) HDD incapable of finding the data or 2) data missing or corrupted shows a structured list for the major causes of inability to read data. The failure mechanisms presented here are not novel [10], [14], but neither are they readily available from HDD manufacturers. The novelty is their use in the model. Operational Failures (Cannot find data) - Bad servo-track - Bad electronics - Can't stay on track - Bad read head - SMART limit exceeded Latent Defects (Data missing) - Error during writing Bad media Inherent bit-error rate High-fly writes - Written but destroyed Thermal asperities Corrosion Scratched media Figure 3. Breakdown for read error causes. 3.1 Cannot find data The inability to "find" data is most often caused by operational failures, which can occur any time the HDD disks are spinning and the heads are staying on track. Heads must read servo wedges that are permanently recorded onto the media during the manufacturing process and cannot be reconstructed with RAID if they are destroyed. These segments contain no user data, but provide information used solely to control the positioning of the read/write heads for all movements. If servo-track data is destroyed or corrupted, the head cannot correctly position itself, resulting in loss of access to user data even though the user s data is uncorrupted. Servo tracks can be damaged by scratches or thermal asperities. Tracks on an HDD are never perfectly circular. The present head position is continuously measured and compared to where it should be and a position error signal is used to properly reposition the head over the track. This repeatable run-out is all part of normal HDD head positioning control. Non-repeatable run-out caused by mechanical tolerances from the motor bearings, excessive wear, actuator arm bearings, noise, vibration and servo-loop response errors can cause the head positioning to take too long to lock onto a track and ultimately produce an error. High rotational speeds exacerbate this mechanism in both ball and fluid-dynamic bearings. HDDs use self-monitoring analysis reporting technology (SMART) to predict impending failure based on performance data. For example, data reallocations are expected and many spare sectors are available on each HDD, but an excessive number in a specific time interval will exceed the SMART threshold, resulting in a SMART trip. Currently, most head failures are due to changes in magnetic properties. Electro-static discharge (ESD), physical impact (contamination), and high temperatures can accelerate magnetic degradation. ESD induced degradation is difficult to detect and can propagate to full failure when exposed to localized heat from thermal asperities (T/As). The HDD electronics are attached to the outside of the HDD. DRAM and cracked chip-capacitors have also been known to cause failure. 3.2 Data missing Data is sometimes written poorly initially, but can be corrupted after being written. Unless corrected, missing and corrupted data become latent defects. 1. Errors during writing The bit-error rate (BER) is a statistical measure of the effectiveness of all the electrical, mechanical, magnetic, and firmware control systems working together to write (or read) data. Most bit-errors occur on a read command and are corrected, but since written data is rarely checked immediately after writing, bit-errors can also occur during writes. BER accounts for some fraction of defective data written to the HDD, but a greater source of errors is the magnetic recording media coating the disks. Writing on scratched, smeared, or pitted media can result in corrupted data. Scratches can be caused by loose hard particles (TiW, Si 2 O 3, C) becoming lodged between the head and the media surface. Smears, caused by soft particles such as stainless

4 steel and aluminum, will also corrupt data. Pits and voids are caused by particles that were originally embedded in the media during the sputtering process and subsequently dislodged during the final processing steps, the polishing process to remove embedded contaminants, or field use. Hydrocarbon contamination (machine oil) on the disk surface can result in write errors as well. A common cause for poorly written data is the high-fly write. The heads are aerodynamically designed to have a negative pressure and maintain the small, fixed distance above the disk surface at all times. If the aerodynamics are perturbed, the head can fly too high, resulting in weakly (magnetically) written data that cannot be read. All disks have a very thin film of lubricant on them as protection from head-disk contact, but lubrication build-up on the head can increase the flying height. 2. Data written but destroyed Most RAID reliability models assume that data will remain undestroyed except by degradation of the magnetic properties of the media ( bit-rot ). While it is correct that media can degrade, this failure mechanism is not a significant cause. Data can become corrupted any time the disks are spinning, even when data is not being written to or read from the disk. Three common causes for erasure include thermal asperities, corrosion, and scratches/smears. Thermal asperities are instances of high heat for a short duration caused by head-disk contact. This is usually the result of heads hitting small bumps created by particles embedded in the media surface during the manufacturing process. The heat generated on a single contact may not be sufficient to thermally erase data, but may be sufficient after many contacts. Heads are designed to push particles away, but contaminants can still become lodged between the head and disk. Hard particles used in the manufacture of an HDD, such as Al 2 O 3, TiW, and C, can cause surface scratches and data erasure any time the disk is rotating. Other soft materials such as stainless steel can come from assembly tooling. Soft particles tend to smear across the surface of the media rendering the data unreadable. Corrosion, although carefully controlled, also can cause data erasure and may be accelerated by T/A generated heat. 4. Model logic RAID reduces the probability of data loss by grouping together multiple inexpensive hard disk drives in a redundant configuration and adding error correction using parity. Most RAID configurations use a single additional HDD within the RAID group for redundancy. As part of the write process, an exclusive OR calculation generates parity bits that are also written to the RAID group. Error correcting codes (ECC) on the HDD and parity across the HDDs is a common method to enssure accurate data transfer and recording. ECC uses Boolean operations to encode blocks of data, interleaving the data and the ECC bits. On each read command, user data and ECC are read. If a data inconsistency occurs, the data is corrected on-the-fly (less than one revolution), data integrity preserved, and performance is not degraded. ECC strength is enhanced by interleaving multiple blocks of data so that errors covering a large physical area (many bits) can be corrected. ECC is faster than data recovery across multiple HDDs, but since ECC is read with every block of user data, excessive ECC use can degrade performance. 4.1 Previous models MTTDL was introduced as the measure of RAID group reliability nearly 20 years ago [1]. Researchers have attempted to improve RAID reliability models, but the primary change has been to introduce Markov models, resulting in a probability of failure rather than an MTTDL [7], [15], and [16]. Ultimately, all past work is based on the assumption of constant failure and repair rates. A review of the methods used to assess reliability in papers [1], [17]-[20] identified deficiencies as follows: 1. Failure rates are not constant in time. 2. Failure rates change based on production vintage. 3. Failure distributions can be mixtures of multiple distributions because of production vintages. 4. Repair rates are not constant and there exists a minimum time to complete restoration. 5. Permanent errors can occur any time. 6. Latent defects must be considered in the model. 7. RAID system failures are assumed to follow a homogeneous Poisson process. MTTDL attempts to estimate average time between simultaneous failures of two hard disk drives in an (N+1) RAID group. Disk drives are assumed to have constant failure rates, λ, the reciprocal of the mean time to failure; constant repair (restoration) rates, μ, the reciprocal of the mean time to restore; and assume RAID group failures follow a homogeneous Poisson process. Based on these assumptions, an N + 1 RAID group has an MTTDL as shown in equation 1. ( 2N + 1) λ + μ MTTDL = eq. 1 N N + 1 λ ( ) 2

5 Since the repair rate is usually much larger than the failure rate, the MTTDL expression can be simplified. MTTDL Indep = N μ 2 ( N + 1) λ N( N + 1) MTTRdisk eq. 2 From MTTDL, the expected number of failures, E[N(t)], in a time interval is estimated by multiplying the time interval by the number of systems and dividing by the MTTDL. Equation 3 shows the estimate for an MTTDL of 36,162 years (MTBF = 461,386 hrs; MTTR=12 hrs; N=7), 1,000 RAID groups, and 10 years of operation. This calculation does not include latent defects or non-constant failure or restoration rates. Non-constant failure rates invalidate the MTTDL. 4.2 NHPP-latent defect model MTTF 2 disk 10 yrs x 1000 RAIDGroups RAIDGroup N() t = = ,162 yrs Failure eq. 3 The state diagram in Figure 4 is used to convey the model logic at a high level. The model is evaluated using Monte Carlo simulation rather than Markov model because estimating the number of failures in time from a probability model can be erroneous [21]. Four distributions are required, denoted d in Figure 4: time to operational failure, time to latent defect, time to operational repair, and time to scrub (latent defect repair). System failure occurs when two HDDs fail simultaneously, depicted as states 3 and 5. An operational failure (Op) is one in which no data on the HDD can be read, even though the data may have no defect. Removal and replacement of the HDD is the only resolution for operational failures. Latent defect (Ld) refers to unknown or undetected data corruption. Latent defects are corrected only when the corrupted data is read and requires reading data on other HDDs in the RAID group and the associated parity bits. If only a few blocks of data are corrupted, the reconstructed data is written to another good section of the HDD and the faulty section is mapped out to prevent reuse. The order of occurrence of operational and latent defects is significant. If an operational failure occurs after the existence of a latent defect on a different HDD, the data cannot be reconstructed on the replacement HDD because the required redundant data is corrupted or missing. Thus, a latent defect followed by an operational failure results in a DDF. Write-errors that occur during reconstruction of an HDD will be corrected the next time the data is = read or will remain as latent defects, but their creation during a reconstruction does not constitute a DDF. The probability of suffering a usage-related data corruption in an unread area during the time of reconstruction is small, so DDFs rarely occur during reconstructions. Multiple HDDs with latent defects do not constitute DDF unless they happen to coexist in blocks from a single data stripe across more than one HDD, an extremely rare event that is not modeled. Fully Functional All HDDs in the RAID group are operating 1 g[(n+1);dld] g[dscrub] g[(n+1);dop] g[drestore] Degraded 1 Ld 1 Op g[(n);dop] g[(n);dop] Failed 2 3 g[dexcessive Op 2 ] 1 Ld 1 OP Op Note 1: Op failure must be a different HDD than the one with a Ld. Note 2: This transition does not have an explicit rate. It is included in the measured rate of "Op" from field data. Figure 4. State diagram for N+1 RAID group Recently, latent defects have been recognized by some system integrators and been reduced by data scrubbing. Schwarz [10] presented a Markov model for mirrored HDDs in an off-line archive system including scrub optimization. But the analysis did not include large RAID groups with latent defects. During scrubbing, data on the HDD is read and checked against its parity bits even though the data is not being requested by the user. The corrupt data is corrected, bad spots on the media are mapped out, and the data is saved to good locations on the HDD. Since this is a background activity, it may be rather slow so it does not impede performance. Depending on the foreground I/O demand, the scrub time may be as short as the maximum HDD and data-bus transfer rates permit, or may be as long as weeks. In summary, the two scenarios that result in DDF are 1) two simultaneous operational failures and 2) an operational failure that occurs after a latent defect has been introduced and before it is corrected. Multiple simultaneous latent defects do not constitute failure.

6 The model logic is partially depicted in Figure 4. In state 1, the data and parity HDDs are good, there are no latent defects, and a spare HDD is available. In state 2, one or more HDDs have latent defects. Failure transitions depend on the number of HDDs available and the distribution of time to failure or restoration. A generic functional notation, g[a;b], is used to represent transitions and the critical variables a and b without conveying any specific operation. For example, transition from state 1 is a function of the N+1 HDDs developing a latent defect according to the failure distribution, d Ld. From state 2, an operational failure in any of the N HDDs other than the one with the latent defect results in state 3, a DDF state. The transition from state 2 to 3 is governed by the operational failure distribution d Op. Transition from state 2 to state 4 occurs because the time to reallocate a sudden burst of media defects on a single HDD exceeds a user specified threshold. This results in a "time-out" error or SMART trip such as excessive block reallocations. In this transition massive media problems render the HDD inoperative, just like any other operational failure, so the frequency of transition from state 2 to state 4 is included in the operational failure distribution d Op. A third transition from state 2 is back to state 1. This represents repair of latent defects according to the scrubbing distribution, d Scrub. State 4 represents one operational failure. The transition to state 4 from state 1 is a function of the number of HDDs in the RAID group and the operational failure distribution, d Op. There are two transitions out of state 4. A second simultaneous operational failure results in transition to DDF state 5. The operational failure is replaced with a new HDD and data reconstructed according to the restoration distribution d Restore, returning the RAID group back to state 1 with full operability. The distribution, d Restore, includes the delay time to physically incorporate the spare HDD and has a minimum time to reconstruct based on the HDD capacity, the maximum transfer rate and concurrent I/O. 5. Sequential Monte Carlo modeling In a sequential Monte Carlo simulation, the time dependent, or chronological, behavior of the system is simulated [22]. For each HDD in the RAID group, each transition distribution in Figure 4 is sampled. The operating and failure times are accumulated until a specified mission time is exceeded. This research uses a mission of 87,600 hours (10 years). During that time, the sequence of HDD failures, repairs, latent defects, scrubs, and DDFs are tracked. Each sequence of sampling required to reach the mission is a single simulation and represents one possible system operating chronology. If 10,000 simulations are needed to develop the cumulative failure function, described in [23], it is equivalent to monitoring the number of DDFs for 10,000 systems over the mission life. Figure 5 shows the sequential sampling process used. For simplicity, only four HDD slots are shown. The graph looks like a digital timing diagram with the high signal representing the operating (nondefective) condition and the low signal representing the failed (defective) condition. Throughout this process, each HDD slot in the RAID group carries its own times to failure (both TTOp and TTLd) and times to restore (TTR and TTScrub) distributions. When a DDF involves an HDD with a latent defect, the TTR for the failure is the same as the concomitant operational failure time. Slot 1 t7 t8 Slot 2 t3 t4 t11 t12 Slot 3 t1 t2 t9 t10 t10' Slot 4 t5 t6' t6 t13 TTF TTR Comparison DDF? Next sample processes Old t1 t2 Sample new TTF & TTR for New t3 t4 Is t1<t3<t2 no slot 3 (t9 & t10) Old t3 t4 New t5 t6 Is t3<t5<t4 no Old t5 t6 New t11 t12 Is t5<t11<t6 yes Old t11 t12 New t9 t10 Is t11<t9<t12 no Old t9 t10 New t7 t8 Is t9<t7<t10 yes Sample new TTF & TTR for slot 2 (t11 & t12) Shift restart time (t6) to coincide with restoration of slot 2 (t12) Sample new TTF & TTR for slot 4 (t13 & t14) Sample new TTF & TTR for slot 2 (not shown) Shift restart time (t10) to coincide with restoration of slot 1 (t8) Sample new TTF & TTR for slot 3 (not shown) Figure 5. Timing diagram for sampling TTFs and TTRs. Initially, a TTF and TTR are sampled for each HDD slot, t1 to t8. Then, pair-wise comparisons are made as indicated below the diagram. The simulation begins by sampling a TTOp and a TTLd for every HDD and storing the times in separate arrays. For the two HDDs with the shortest times to failure (or defect), a time to restore (or time to scrub) is sampled. If two operational failures exist simultaneously, a DDF occurs. Since two latent defects will not fail the system, there is no DDF if the shortest and second shortest event times are both latent defects. If one event is an operational failure and one

7 is a latent defect, a DDF exists when the operational failure occurs after the latent defect has occurred and before the scrub process corrects the corrupted data from the latent defect. A system failure does not occur if the shortest time is an operational failure and the second shortest is a latent defect. Once a DDF has occurred, a subsequent one cannot occur until the first is restored. If no DDF is detected, then the TTR (or TTScrub) that has already been sampled and used in the preceding comparison is added to the earliest time to failure. A new TTOp (or TTLd) is sampled, added to the previous sum, and the HDDs are again sorted and reduced if the cumulative time exceeds the mission time. This process is reiterated until all the cumulative operating times for all HDD slots have exceeded the mission time. 6. Transition distributions The four component-related distributions required for this model are time to operational failure, time to restore an operational failure, time to generation of a latent defect, and time to scrub HDDs for latent defects. The simulations in this paper use a threeparameter Weibull probability density function, f(t), of the form: β 1 β β t t f () t = exp η η η where γ is the location parameter, η is the characteristic life, and β is the shape parameter. 6.1 Time to operational failure (TTOp) To illustrate the improvement over the MTTDL method, a single TTOp distribution to illustrate improvement over the MTTDL method. A Weibull failure distribution with a slightly increasing failure rate is used. The characteristic life, η, is 461,386 hours. The shape parameter, β, is These parameters are from a field population of over 120,000 HDDs that operated for up to 6,000 hours each. 6.2 Time to restore (TTR) A constant restoration rate implies the probability of completing the restoration in any time interval is equally as likely as any other interval of equal length. Therefore, it is just as likely to complete restoration in the interval 0 to 48 hours as it is in the interval 1,000 to 1,048 hours. But this is clearly unrealistic for two reasons. First, there is a finite amount of time required for the HDD to reconstruct all the data on the HDD. It is a function of the HDD capacity, the data rate of the HDD, the data rate of the data-bus, the number of HDDs on the data-bus and the amount of I/O transferred as a foreground process. Reconstruction is performed on a high priority basis but does not stop all other I/O to accelerate completion. This model recognizes that there is a minimum time before which the probability of being fully restored is zero. Fibre Channel HDDs can sustain up to 100MB/second data transfer rates, although 50MB/sec is more common. The data-bus to which the RAID group is attached has only a 2 giga-bits per second capability. Thus, in a RAID group of 14, a 144GB HDD on a Fibre Channel interface will require a minimum of three hours with no other I/O to reconstruct the failed HDD. A 500GB, Serial ATA HDD on a 1.5Gb data-bus will require 10.4 hours to read all other HDDs and reconstruct a replaced HDD. The added I/O associated with continuing to serve data will lengthen the time to restore an operational failure. Some operating systems place a limit on the amount of I/O that takes place during reconstruction, thereby assuring reconstruction will complete in a prescribed amount of time. This results in a maximum reconstruction time. The minimum time of six hours is used for the location parameter. The shape parameter of 2 generates a right-skewed distribution, and the characteristic life is 12 hours. 6.3 Time to latent defect (TTLd) Personal conversations with engineers from four of the world s leading HDD manufacturers support the contention that HDD failure rates are usage dependent, but the exact transfer function of reliability as a function of use (number of reads and writes, lengths of reads and writes, sequential versus random) is not known (or they aren t telling anyone). These analyses approximate use by combining read errors per Byte read and the average number of Bytes read per hour. The result is shown in Table 1 and the following discussion is the justification. Schwartz [10] claims the rate of data corruption is five times the rate of HDD operating failures. Network Appliance completed a study in late 2004 on 282,000 HDDs used in RAID architecture. The read error rate (RER), averaged over three months, was 8x10-14 errors per Byte read. At the same time, another analysis of 66,800 HDDs showed a RER of approximately 3.2x10-13 errors per Byte. A more recent analysis of 63,000 HDDs over five months showed a much improved 8x10-15 errors per Byte read. In these studies, data corruption is verified by the HDD manufacturer as an HDD problem and not a result of the operating system controlling the RAID group.

8 While Gray [25] asserts that it is reasonable to transfer 4.32x10 12 Bytes/day/HDD, the study of 63,000 HDDs read 7.3x10 17 Bytes of data in five months, an approximate read rate of 2.7x10 11 Bytes/day/HDD. The following studies used a high of 1.35x10 10 Bytes/hour and a low of 1.35x10 9 Bytes/hour. Using combinations of the RERs and number of Bytes read yields the hourly read failure rates in Table 1. Table 1. Range of average read error rates 6.4 Time to scrub (TTScrub) Latent defects (data corruptions) can occur any time the disks are spinning. However, these defects can be eliminated by background scrubbing, which is essentially preventive maintenance on data errors. Scrubbing occurs during times of idleness or low I/O activity. During scrubbing data is read and compared to the parity. If they are consistent, no action is taken. If they are inconsistent, the corrupted data is recovered and rewritten to the HDD. If the media is defective, the recovered data is written to new physical sectors on the HDD and the bad blocks are mapped out. Scrubbing is a background activity performed on an as-possible basis so it does not affect performance. If not scrubbed, the period of time to accumulate latent defects starts when the HDD first begins operation in the system. The latent defect rate is assumed to be constant with respect to time (β=1) and is based on the error generation rate and the hourly data transfer rate. As with full HDD data reconstruction, the time required to scrub an entire HDD is a random variable that depends on the HDD capacity and the amount of foreground activity. The minimum time to cover the entire HDD is based on capacity and foreground I/O. The operating system may invoke a maximum time to complete scrubbing. In all cases the shape parameter, β, is 3, which produces a Normal shaped distribution after the delay set by the location parameter, γ. 7. Results Read Errors per Byte per HDD Bytes Read per Hour Low Rate High Rate 1.35x x10 10 Low 8.0x x x10-4 Err/hr Med 8.0x x x10-3 Err/hr High 3.2x x x10-3 Err/hr Analyses were conducted to study the effects of parametric variants of a base case with parameters shown in Table 2. All analyses have an 87,600-hour (10-year) mission and 8 HDDs in a RAID group. Table 2. Base case input parameters Operational Failure Latent Defect Distributions Distributions TTOp TTR TTLd TTScrub γ η β γ η β γ η β γ η β Four variants of the base case, none of which include latent defects or scrubbing, are shown in Figure 6. Line "c-c" has constant rates for both failures and restorations. Line "f(t)-c" has time-dependent failure rates and constant restoration rates. Line "cr(t)" has constant failure rates and time-dependent restoration rates. For line "f(t)-r(t)," failures and restorations both are time dependent as per Table 2. The last line is based on MTTDL assuming constant failure and restoration rates. As expected, the model result "c-c" follows the MTTDL line closely. The plot shows the model will produce the same results as MTTDL under the same (time-independent rate) assumptions but is sensitive to time-dependent failure and restoration rates. The directions of change are counter-intuitive, but result from the shift in the probability density function "mass" when the characteristic life is not changed. DDFs per 1000 RAID Groups MTTDL c-c f(t)-c c-r(t) f(t)-r(t) Time, hours Figure 6. Model compared to MTTDL without latent defects The difference between the MTTDL and the model are on the order of 2 to 1. If latent defects are not included, this difference may not be enough to warrant the use of this complex model. However, when latent defects are added to the analysis, the differences become great. Figure 7 compares the base case (including latent defects and 168 hour scrub) to the case of latent defects without scrubbing, which introduces significantly more DDFs. Notice that in both of these studies, the plot lines are not linear, showing the effects of the time-dependent failure and restoration rates. The increasing rate of occurrence of

9 failure (ROCOF) is verified by finding the number of DDFs that occur in any fixed time interval (Figure 8). DDFs per 1000 RAID Groups Figure 7. Effects of latent defects with no scrub and with 168 hr scrub DDFs per 1000 RAID Groups No Scrub No Scrub 168 hr Scrub 168 hr Scrub Time, hours Time, hours Figure 8. ROCOFs for plots in Figure 7 In Figure 9 additional scrub durations are compared. Again, the plots exhibit a non-linear (time dependent) ROCOF. Remember from Figure 6 that the MTTDL without latent defects predicts only 0.27 DDFs/1000 RAID groups in 10 years. used. A shape parameter of 0.8 may actually have 83% more DDFs than when beta is 1.0. Similarly, if the actual beta is 1.4, there may be only 30% of the DDFs predicted using constant failure rates. DDFs per 1000 RAID Groups β = 0.8 β = 1 β = 1.12 β = 1.4 β = Time, hours Figure 10. Effects of operational failure shape parameter for a given characteristic life This research and new model show a clear difference between the estimated number of DDFs as a function of time based on the MTTDL and the new model. The number of DDFs predicted by the model is, in all cases, greater than the MTTDL when latent defects are included. Without scrubbing, and assuming the distributions in Table 2, this model estimates that in 1,000 RAID groups there will be over 1,200 DDFs in the 10-year mission, contrary to the 0.3 predicted by MTTDL. Table 3 shows the ratio of DDFs expected with the new model to the number estimated using the MTTDL during the first year alone. The highest ratio, >2,500, is when latent defects are included but there is no scrubbing. Even if scrubbing is completed in 168 hours, the new model predicts over 360 times as many DDFs as the MTTDL method. DDFs per 1000 RAID Groups hr Scrub 168 hr Scrub 48 hr Scrub 12 hr Scrub Time, hours Table 3. DDF comparisons Assumptions DDFs in 1st year Ratio MTTDL Base Case w/o Scrub hr Scrub hr Scrub hr Scrub hr Scrub Conclusions Figure 9. Effects of scrub durations The assumption of constant failure rates is inherent to the MTTDL calculations. However, Figure 10 clearly shows the potential inaccuracy resulting from that assumption even when this new model is The MTTDL calculations exclude latent defects and implicitly assume the rate of occurrence of failure for any RAID group is an HPP (constant in time). The model results show that correctly including timedependent failure rates and restoration rates along with latent defects yields estimates of DDFs that are as

10 much as 4,000 times greater than the MTTDL-based estimates. Additionally, the ROCOF for a RAID group is not linear in time and depends heavily on the underlying component failure distributions. Field data show HDD failure rates are not constant in time and vary from vintage to vintage. Latent defects are inevitable and scrubbing latent defects is imperative to RAID (N+1) reliability. Short scrub durations can improve reliability, but at some point the extensive scrubbing required to support the high-capacity HDDs will unacceptably impact performance. This model provides a tool by which RAID designers can better evaluate the impact of the latent defect occurrence rate, which may be 100 times greater than the operational failure rate, and the scrubbing rate. The RAID architect can use this model to drive the design, providing insights as to the best RAID group size based on a specific manufacturer s HDDs and the impact of an increasing failure rate. For systems that currently do not scrub, consumers can see that this is a recipe for disaster. It appears that, eventually, RAID 6 will be required to meet high reliability requirements. 9. References [1] D. A. Patterson, G. A. Gibson, R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks (RAID), Proc., ACM Conference on Management of Data (SIGMOD), Chicago, IL, June [2] W. A. Thompson, On the Foundations of Reliability, Technometrics, vol. 23, no. 1, Feb. 1981, pp [3] H. E. Ascher, A Set-of-Numbers is NOT a Data- Set, IEEE Trans. on Reliability, vol. 48, no. 2, June [4] L. H. Crow, Evaluating the Reliability of Repairable Systems, Proc. Annual Reliability & Maintainability Symp., [5] W. Nelson, Graphical Analyses of System Repair Data, Journal of Quality Technology, vol. 20, no. 1, Jan [6] H. Ascher, [Statistical Methods in Reliability]: Discussion, Technometrics, vol. 25, no. 4, Nov [7] J. G. Elerath and S. Shah, "Disk Drive Reliability Case Study: Dependence Upon Head Fly-Height and Quantity of Heads," Proc. Annual Reliability & Maintainability Symp., [8] S. Shah and J. G. Elerath, "Disk Drive Vintage and Its Affect on Reliability," Proc. Annual Reliability & Maintainability Symp., [9] H. H. Kari, Latent Sector Faults and Reliability of Disk Arrays, Ph.D. Dissertation, TKO-A33, Helsinki University of Technology, Espoo, Finland, 1997, [10] T. J. E. Schwarz et al., Disk Scrubbing in Large Archival Storage Systems, IEEE Computer Society Symposium, MASCOTS, [11] S. Shah and J. G. Elerath, Reliability Analysis of Disk Drive Failure Mechanisms, Proc. Annual Reliability & Maintainability Symp., [12] E. Pinheiro, W. D. Weber, and L. A. Barroso, "Failure Trends in Large Disk Drive Population," Proc. 5th USENIX Conference on File Storage Technologies (FAST '07), Feb [13] B. Schroeder and G. Gibson, "Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?" Proc. of 5th USENIX Conference on File and Storage Technologies (FAST), Feb [14] V. Prabhakaran, IRON File Systems, SOSP 05, Oct. 2005, Brighton, UK. [15] R. Geist and K. Trivedi, An Analytic Treatment of the Reliability and Performance of Mirrored Disk Subsystems, Twenty-Third Inter. Symp. on Fault-Tolerant Computing, FTCS, June [16] M. Malhotra, Specification and solution of dependability models of fault tolerant systems, Ph.D. Dissertation, CS , Dept. of Computer Science, Duke University, May 14, [17] D. A. Patterson et al., Introduction to Redundant Arrays of Inexpensive Disks (RAID), Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage, COMPCON, Feb [18] P. M. Chen et al., RAID: High-Performance, Reliable Secondary Storage, ACM Computing Surveys, [19] W. V. Courtright, II, A Transactional Approach to Redundant Disk Array Implementation, Ph.D. Thesis, CMU-CS , School of Computer Science, Carnegie Mellon University, May [20] T. J. E. Schwarz, W. A. Burkhard, Reliability and Performance of RAIDs, IEEE Transactions on Magnetics, vol. 31, no. 2, Mar [21] W. A. Thompson, "The Rate of Failure Is the Density, Not the Failure Rate," The American Statistician, Editorial, vol. 42, no. 4, Nov [22] C. L. T. Borges et al., Composite Reliability Evaluation by Sequential Monte Carlo Simulation on Parallel and Distributed Operating Environments, IEEE Trans. on Power Systems, vol. 16, no. 2, May [23] D. Trindade and S. Nathan, Simple Plots for Monitoring Field Reliability of Repairable Systems, Proc. Annual Reliability & Maintainability Symp., [24] P. Corbett et al., Row Diagonal Parity for Double Disk Failure Correction, Proc. of 3 rd USENIX Conference on File and Storage Technology, San Francisco, [25] J. Gray, C. van Ingen, Empirical Measurements of Disk Failure Rates and Error Rates, Microsoft Research Technical Report, MSR-TR , Dec