Predctng Archval Lfe of Removable Hard Dsk Drves Paul Wllams; ProStor Systems, Inc.; Boulder, CO., USA / Davd S. H. Rosenthal; Stanford Unversty Lbrares; Stanford, CA., USA / Mema Roussopoulos; Insttute of Computer Scence, FORTH, Greece / Steve Georgs; ProStor Systems, Inc.; Boulder, CO., USA Abstract A collaboraton between the developer of a Removable Hard Dsk Drve (RHDD) storage technology product and academcs presents an accelerated lfe test of non-spnnng powered down hard dsks. The results are used to predct the relablty of removable dsk cartrdges stored n shrtsleeve envronments as archval meda. The contrbutons of ths paper nclude an ntal study of the relablty of data stored off-lne on RHDDs, and a descrpton of the ndustry standard accelerated lfe test process together wth the methodology used to predct storage meda relablty based on the results. Our study shows that removable hard dsk drve technology has consderable potental as a relable archval medum. 1. Introducton Attenton has recently been drawn to the lack of sutable data on whch desgns for long-term storage can be based. At the 2007 FAST conference the keynote [1] and papers from Google [2] and Carnege-Mellon [3] dscussed the msmatch between manufacturer's specfcatons and actual performance n practce of dsk drves. At the 2008 conference, data from NetApp showed that other components of storage systems also contrbute to a sgnfcant rate of storage corrupton [4]. Ths paper s the result of collaboraton between a storage vendor and academcs to see what data can be made avalable, to encourage system desgners and vendors to talk the same language, and to nvestgate some of the mpedments to better storage meda performance data Current data collectons nvolve large arrays of spnnng drves; ths work looks at data retenton on non-spnnng, dle drves. These are becomng mportant. MAID (Massve Array of Idle Dsks) technology, n whch densely packed SATA dsks are spun down whenever possble to reduce power, wear and vbraton, s becomng avalable. Removable hard dsk cartrdges (RHDD) are becomng an alternatve to tape cartrdges for off-lne storage. Idle drves are a more tractable case for performance data, because there are many fewer parameters to consder. They are much less subject to the contnued access, thermal and power varatons, vbraton and other envronmental factors whch spnnng drves nevtably encounter. Nevertheless, some of the formdable dffcultes n the way of provdng storage system desgners wth hgh-qualty performance data of dsk n general stll apply to dle dsks. Dsks are remarkably relable. It was only because Google and Carnege-Mellon studed vast numbers of drves for long perods of tme that they were able to reveal clearly the dspartes between specfcatons and performance. A partcular dsk product may only be on the market for 18 months where the servce lfe of a spnnng drve mght be 60 months. Relablty data based on experence n servce of a partcular product wll typcally be avalable after the product s obsolete. Useful relablty data about partcular products wll necessarly be a predcton based on testng a sample of early producton unts. To provde meanngful predctons of storage relablty usng feasble sample szes and test duratons requres accelerated lfe test technques. The storage devces are subjected to condtons extreme enough to cause many falures. The falure data s adjusted to predct falures n normal condtons usng a model of the effect of envronmental condtons on falure rates. We report on an accelerated lfe test on RHDDs and a small sample of desktop SATA drves, usng ndustry-standard technques [e.g.,5], performed by an ndependent testng lab under contract to the developer of the dsk cartrdge technology. We provde an overvew of the accelerated agng test process, report the data collected, descrbe the model used to generate predcted relablty from the data and show the results. Ths analyss concludes that, stored n realstc condtons, the RHDDs are relable enough for archval use. The predcted falure rate s less than 1% over 30 years. The factors leadng to the gap between predcton and experence for spnnng dsks should be less sgnfcant n the dle case. The contrbuton of ths paper s an ntal study of the relablty of data stored off-lne on RHDDs usng (and descrbng) the ndustry-standard test process, and the ndustry-standard methodology for predctng servce relablty. We hope ths wll help storage system desgners better understand the dsk medum, and the lmtatons of data avalable about t. 2. Removable Hard Dsk Drves RHDDs encapsulate ndustry-standard 2.5 SATA HDDs (notebook computers drves) n a ruggedzed cartrdge that has the removablty, portablty, durablty and off-lne storage characterstcs of legacy magnetc tape, but wth the random-access performance advantages of dsk. RHDDs are desgned to protect the embedded HDD aganst shock, vbraton and Electrostatc Dscharge (ESD) RHDDs are rapdly ganng acceptance as tape replacement for data protecton and archve applcaton n servers and workstatons. One example of RHDD s the mult-vendor RDX format developed by ProStor Systems and manufactured by several storage ndustry supplers, such as Tandberg Data, Imaton and Dell Computer. One dfference between 2.5 laptop drves and standard desktop or server 3.5 drves that s mportant for use n dle drve systems s the technque used to park the heads when the drve s dle. 3.5 drves use contact start-stop (CSS) technology, parkng the heads on a specal landng zone on the dsk medum when the drve sn't spnnng. 2.5 drves, desgned for laptop use, use ramp loadng technology to physcally remove the heads and lock them away from the medum when the drve spns down. Ths provdes much greater non-operatng shock tolerance, but t may also be mportant for longterm data retenton. MAID systems usng 3.5 drves typcally spn the drves up perodcally to exercse the mechancs and reduce the rsk of falures caused by CSS, among other reasons. RHDDs are desgned to be removed from the system for storage, makng perodc exercse of ths knd mpractcal.
If dle dsk drves are to be used as an archval medum, storage system desgners need to know how long the data wrtten to them wll reman readable. Exstng research on spnnng dsks has suggested a number of factors whch affect ther relablty, ncludng power-on hours, duty cycle, operatng envronment, rotatonal vbraton and access patterns (e.g., [6]). Idle dsks operate at very low duty cycles and are mostly powered-down. Consequently operatng envronment, vbraton and access patterns are not sgnfcant concerns. 3. Background The condtons, under whch the powered-down RHDDs are stored, n partcular temperature and relatve humdty, can affect ther ablty to retan data. ProStor Systems desgned and contracted wth an ndependent testng servce to conduct an accelerated lfe test so as to understand these relablty factors and the lmts of longterm data preservaton usng RDX cartrdges. Through an understandng of the desgn of HDDs, known falure mechansms and agng effects, t s possble to postulate a set of archval lfe relablty factors for RHDDs and desgn experments to test for these factors. These potental factors dentfed are: 1. Magnetc thermal decay of recorded bts and control sgnals 2. Meda corroson 3. Meda lubrcant evaporaton 4. Flud dynamc bearng ol evaporaton 5. Electroncs corroson and degradaton These archval lfe factors are all functons of temperature, humdty and tme makng the factors excellent canddates for accelerated testng stresses. 4. QALT Quanttatve accelerated lfe testng (QALT) [7] conssts of a seres of related tests desgned to quantfy the lfe characterstcs of a component or system under normal use condtons by testng the unts at hgher stress levels n order to accelerate the occurrence of falures. These tests provde valuable nformaton about a product s performance under normal use condtons that can allow a manufacturer to make predctve statements about ts products feld performance. The obvous beneft of quanttatve accelerated lfe testng s the tme savngs, whch s based on the decrease n test duraton due to ncreased stress levels. Wth stress related acceleraton, one or more envronmental factors that are known to cause the product to fal under normal condtons such as temperature, voltage, humdty, vbraton etc. are ncreased n order to cause the product to fal more quckly n the test. The stress and levels of stress used n accelerated tests must be chosen so that they accelerate the falure modes of the product but do not ntroduce falure modes that would not normally occur under normal condtons. These stress levels may fall outsde the product specfcaton lmts for the product but usually well nsde the real desgn lmts. The lfe data obtaned from these tests requre accelerated lfe data analyss technques, whch nclude a mathematcal model to translate from accelerated condtons to the product under normal use condtons. Ths model can be used to calculate mportant relablty statstcs. These nclude: Relablty or the probablty of success, or the converse (the probablty of falure), the mean lfe of a populaton, the falure rate per unt tme (hours), and the B(X). The dstrbuton parameter B(X) where X s the gven proporton of the populaton whch wll fal by the tme beng evaluated. For example f t s desrable to know at what tme 1% a populaton wll fal, then B(1) would be evaluated to produce a number of hours correspondng to ths cumulatve falure proporton. 5. Experment Desgn For ths partcular accelerated lfe test we have already dscussed the expected falure mechansms and the assocated stresses whch precptate them. These factors are temperature and relatve humdty. The expermental desgn utlzed three levels of stress for each n order to enable the use of non-lnear lfe models n analyses of the resultant data. Ths s mportant because real world practcal models are exponental n ther nature [8]. To accomplsh a three level stress experment wth two dfferent stresses, sx test cells are requred. To save on samples and test resources one of the cells from each stress can be common. Ths mproves the effcency of the test by reducng the total number of stress cells to fve. The levels of stress were chosen to stress the drves to the maxmum lmts of ther capabltes wthout creatng unrelated falure mechansms. Also, larger sample szes were used n the lower stress cells n an attempt to observe enough falures for statstcal requrements. Longer test ntervals were also used n these low stress cells to further mprove the effcency of the test and to reduce hazards from handlng the drves. Ths exercse resulted n a 5 stress cell experment detaled n Table 1, where the Test Cell s a letter desgnaton to smplfy trackng. The rest of table 1 ncludes, Test Stress conssts of Temperature n degrees Celsus and Hum RH s Relatve Humdty. Samples refer to the number of RDX removable cartrdges used n each test cell and the test nterval s the number of hours between each test pont. Table 1: Test Cell Defntons Test Stress Test Cell Temp C Hum RH Samples Test nterval A 80 85 10 336 B 80 55 10 500 C 80 10 15 500 D 70 85 15 750 E 60 85 30 1000 6. Test Samples A sample of 80 RDX 160GB (80GB/Platter) 2.5 form factor hard drves was used for ths experment. The partcular drves were selected because they were the frst commercally avalable drves usng perpendcular recordng technology at ths areal densty. These samples were dstrbuted as detaled n Table 1. Addtonally, a small sample of fve 500GB 3.5 hard drves (usng the same areal densty and perpendcular recordng technology as the 160GB RDX drves) were placed nto the chamber wth test cell A. Ths group s later referred to as group F n the results secton. The relatve performance of these drves to the RDX removable drves n ths cell are of nterest to benchmark desktop backup solutons performance over tme. 7. Test Procedure The testng was carefully montored and rgorously performed and our methodology follows ndustry-standard testng practces [5].
Frst, each drve was functonally tested and random data wrtten to the entre drve. Then each drve was read comparng the data to what was orgnally wrtten to verfy the data was wrtten properly and to record any errors. Ths process establshed that the removable dsk was error-free before any stresses were appled. The samples were separated nto the 5 test groups and permanently marked to dentfy them. The followng steps were executed for each test nterval: 1. The cartrdges were placed n the envronmental chamber accordng to sample group and envronmental settngs. 2. The chamber was ramped to the desred temperature and humdty levels over a perod long enough to nsure temperature and humdty ramp tmes specfed by the hard drve manufacturer were not volated. Once the temperature and humdty levels were reached they were mantaned at the desred levels for the specfed nterval. 3. At the completon of the nterval, the humdty was reduced to 10% RH and the temperature was mantaned at the elevated level before beng ramped to 20º C over a perod long enough to nsure temperature and humdty ramp tmes specfed by the hard drve manufactures were not volated. Ths was done n order to dry out the chamber and hard drves so that condensaton dd not occur n ether. 4. The cartrdges were then removed from the chamber and all data sectors were read whle checkng for errors. 5. All cartrdges that accurately read the data were returned to the chamber and the nterval cycle was repeated untl the desred total duraton was acheved. See Table 2 n secton 8 for detals. Falure was defned as an unrecoverable read error whle readng any data sector from the dsk or a ms-comparson of the data to what was orgnally wrtten. Once a cartrdge experenced an unrecoverable read error or ms-compare, t was removed from the test and logged as a falure. Each cartrdge was montored untl ether falure or test completon. Durng the course of testng, the envronmental condtons n the chambers were contnuously montored to ensure operaton at specfed levels. 8. Test Results The results of the accelerated lfe tests are shown n the table below. (Courtesy of: Percept Technology Labs, RDX Removable Dsk Archvablty Study, Test Report [9]). Table 2: Results by Test Cell Frst the tme to falure data was coded nto ALTA for each of the envronmental condton cells of ths experment. Ths codng ncludes the tme to each of the falures, the tme at whch drves that dd not fal were suspended as well as the envronmental condtons of temperature and humdty for each data pont. ALTA provdes a tool called dstrbuton wzard whch helps determne the best lfetme dstrbuton for the data set. The wzard estmates the parameters for each of several dstrbutons, compares the log-lkelhood values and then recommends the one that s the best statstcal ft for the data. The dstrbuton wzard was employed to determne the best dstrbuton for the expermental data and Lognormal was chosen as the best ft for the data. ALTA provdes a choce of Lfe-Stress relatonshps ncludng models for Arrhenus, Eyrng, Inverse Power Law (ILP), Temperature-Humdty and Temperature-nonthermal. These models allow the extrapolaton of test data to enable Relablty predcton for other envronmental condtons as well as the desred tme. The Temperature-Humdty lfe-stress model s a varaton of the Eyrng relatonshp [8,11] whch s used for the analyss of data from temperature and humdty tests. Ths s the model chosen because of ts applcablty to the stresses used n the archve test. 10. Temperature-Humdty Relatonshp The temperature-humdty (T-H) relatonshp, a varaton of the Eyrng relatonshp,[8,11] has been proposed for predctng the lfe at use condtons when temperature and humdty are the accelerated stresses n a test. Ths combnaton model s gven by: [8,11] where: s the thermal actvaton energy. b s the actvaton energy for humdty. A s a constant parameter. U s the relatve humdty (decmal or percentage). V s temperature (n absolute unts, Kelvn n our work). T-H Acceleraton Factor The acceleraton factor for the T-H relatonshp s gven by: Test Stress Cell Number Temp C Temp F Hum RH Samples Falures Total tme A 80 176 85 10 4 4000 B 80 176 55 10 4 2000 C 80 176 10 15 2 3500 D 70 158 85 15 5 3000 E 60 140 85 30 1 4000 F 80 176 85 5 5 4000 9. Relablty Analyss The results of the accelerated testng for achevablty were analyzed usng RelaSoft s ALTA relablty software, whch s well suted for ths type of quanttatve data. ALTA s a hgh-qualty commercally avalable Accelerated Lfe Test Analyss (ALTA) statstcal tool. It s wdely accepted as a standard tool for ths type of analyss wthn the Relablty Engneerng communty [10]. where: LUSE s the lfe at use stress level. LAccelerated s the lfe at the accelerated stress level. Vu s the use temperature level. VA s the accelerated temperature level. U A s the accelerated humdty level U u s the use humdty level. The objectve of ths analyss was to extract a predcton of the lfespan for RDX removable dsk drves at varous envronmental condtons as well as set envronmental lmts for varous desred archval perods. Intal envronmental usage condtons are were set as 20 C / 68 F at 30% RH to correspond to a standard commercal Archval condton. Graph 1 s the result of ths analyss. In the
legend each test cell s dentfed along wth the number of Falures (F) and Suspensons (S) for each. Graph 2: Archval lfe vs. Storage condtons: Graph 1: ALTA Lfe Dstrbuton Plot RelaSoft ALTA 6.0 - ALTA.RelaSoft.com Probablty Lognormal 99.00 T-H/Log Unrelablty 50.00 Test Cell E F=1 S=29 Test Cell D F=5 S=10 Test Cell C F=2 S=13 Test Cell B F=4 S=6 Test Cell A F=4 S=6 Use Condto 20 C / 30% R 10.00 5.00 1.00 100.00 1000.00 10000.00 1.00E+5 1.00E+6 1.00E+7 1.00E+8 Std=1.0776, A=4.9850E-14, b= 14.7783, Ea= 1.1713 Tme (hrs) In Graph 1 the X axs s tme n hours and the Y axs s Unrelablty or the cumulatve falure proporton of the test samples. For example, take the upper leftmost pont n Graph 1, t represents a falure that occurred at 1500 hours and represented ~35% cumulatve falure for the populaton for test cell B. It should also be stated that Unrelablty (probablty of falure) s equal to 1 Relablty (probablty of success). As an example unrelablty on the graph of 1% corresponds to a 99% Relablty. Each set of colored ponts represents the falures for a partcular test cell and the correspondng colored lne represents the best statstcal ft for those ponts. The legend dentfes whch test group to whch the data belongs. The rghtmost (red) lne wth no ponts surroundng t s a predcton of the lfe characterstcs for the envronmental usage condton nput. In ths case t was set as 20 C / 68 F at 30% RH. Usng the tools wthn ALTA, the temperature / humdty lmts under whch the RDX removable dsk drve can be safely stored for 20 years, 25 years and 30 years can be predcted. As expected, the hgher the temperature n the storage envronment, the lower the humdty has to be n order to get the maxmum length of storage lfe. All of ths analyss was performed wth a data Relablty requrement for the data of 99%. Graph 2 dsplays these results: 11. Moble vs. Desktop Drves It was also desrable to understand the dfference between the group of RDX removable drves and the standard 3.5 CSS drves commonly used n MAID systems and desktop external USB devces. In group A, 10 samples of RDX 160GB 2.5 form factor hard drves were subjected to 176ºF / 80ºC 85% for 4000hrs wth 4 out of 10 eventually falng as a result of the experment. In group F, 5 samples of 500GB 3.5 form factor drves were placed nto the chamber wth test group A. Durng the course of ths experment all 5 of the samples faled. The 500GB 3.5 equates to the same areal densty as the 2.5 160GB RDX drves. Both groups utlze the same perpendcular recodng technologes sharng common heads and meda technologes. Graph 3: Lfe expectancy comparson RDX vs. 3.5 Hard RelaSoft Webull++ 7 - www.relasoft.com 99.000 t ) ( F, t y l b l a r e63.000 n U 50.000 MTBF x 2 2.5 vs 3.5 nch Drve Comparson MTBF 2604 hrs 17613 hrs Drves Lognormal Group A - 2.5 F=4/S=6 Probablty Lne Group F - 3.5 F=5/S=0 Probablty Lne x 2 10.000 1000 10000 100000 1000000 Test Tme, (hrs) To compare these two groups whch were stored at the same envronmental condtons Relasoft s Webull++ was used to create a lfe dstrbuton plot of tme vs. cumulatve falure percentage for the two populatons. In ths case a reasonable comparson pont between the two groups s the characterstc or average lfe or MTBF. Ths pont s chosen because t s the key parameter for the dstrbuton of tmes to falure. For the 3.5 drves the MTBF s 2604 hours and for the RDX the MTBF s 17,613 hours. Ths rato of >6.7 tmes s a very strong ndcator that 3.5 CSS drves are far more vulnerable to
data loss than s the RDX removable dsk when used as archval meda. Ths comparson s clear n Graph 3. 12. Related Work Many storage systems have been desgned for specfc falure modes (e.g., [12, 13]) and there s great need for real data from the feld to support the use of these falure models. Publshed work presentng and analyzng falure data n real storage systems s only startng to appear. Recent FAST conferences drew attenton to the lack of sutable data on whch desgns for long-term storage can be based. Steve Kleman's keynote [1] and papers by Pnhero et al. [2] and Schroeder et al. [3] dscuss the msmatch between manufacturers specfcatons of dsk drves and actual performance n practce. Specfcally, Schroeder et al. study and analyze data from 100,000 drves wth SCSI, FC, and SATA nterfaces and show many dspartes between manufacturer specfcatons and actual dsk performance. Pnhero et al. study dsk replacement data from more than 100,000 hard drves n operaton at Google, ncludng seral and ATA drves and smlarly fnd dsparty n the annual replacement rates measured n actual use and the rates predcted by the vendors. Jang et al [4] show that other components of the storage system contrbute to these falure rates. Baravasundaram et al [14] show that n addton to these vsble falures, slent data corrupton occurs at sgnfcant rates. Elerath and Shah dscuss n detal the factors that can lead to the dsparty between specfcatons and actual measured drve relablty [6, 15]. These nclude thermal varaton, duty cycle, archtecture and logc of the system n whch the hard drve s used, and the data collecton and analyss process of the falures. Whle vendor-publshed falure-predcton metrcs such as MTTF have been crtczed by the research communty [16, 17], t s mportant to understand how these metrcs are derved. One ntent of our work n detalng the accelerated lfe tests typcally used by ndustry and the methodology used to predct storage meda relablty usng the test results s to help storage system desgners better understand the lmtatons of data avalable about the dsk medum. Because dsks are already very relable, meanngful data about actual performance s avalable only by studyng very large populatons of drves over very long perods. Because dsk technology changes so rapdly, data about current products wll nevtably be predctons based on accelerated lfe tests, not experence. The predctons wll nevtably be based on models of how envronmental factors affect performance. 13. Conclusons Analyss of data from an accelerated lfe test predcts that, f data s wrtten to 160GB RDX removable hard drve cartrdges based on 2.5 laptop dsk technology, and the cartrdges are then stored n realstc condtons for 30 years, more than 99% of the drves wll then read ther entre contents wth no errors. A small sample of 3.5 CSS drves ncluded n the test demonstrated much lower data retenton. The related work above focuses on measurng, analyzng, and predctng falure rates of spnnng hard drves,.e., hard drves that were powered on, spnnng, and n operaton. To the best of our knowledge, ths s the frst publshed work on non-spnnng meda and thus complements these studes. Although ts ndustry-standard methodology shares lmtatons wth the tests on whch manufacturers base ther publshed specfcatons, the envronmental effects n the dle case are much smpler. Ths should lead to more realstc performance predctons. We hope ths paper s a step towards gettng storage system researchers and storage meda relablty engneers talkng the same language. Clearly, storage system desgners need hgher-qualty, more tmely data about the relablty of the components from whch they must construct ther systems. Equally, the dffcultes n the way of storage meda vendors provdng such data are formdable. References 1. Steve Kleman, Trends n Managng Data at the Petabyte Scale, 5th USENIX Conference on Fle and Storage Technologes (FAST), February 2007. 2. Eduardo Pnhero, Wolf-Detrch Weber, Luz Andre Barroso, Falure trends n large dsk drve populaton, 5th USENIX Conference on Fle and Storage Technologes (FAST), February 2007. 3. Banca Schroeder, Garth A. Gbson, Dsk falures n the real world: What does an MTTF of 1,000,000 hours mean to you? 5th USENIX Conference on Fle and Storage Technologes (FAST), February 2007. 4. W. Jang, C. Hu, and Y. Zhou, A. Kanevsky, Are Dsks the Domnant Contrbutor for Storage Falures? A Comprehensve Study of Storage Subsystem Falure Characterstcs, FAST 2008. 5. European Computer Manufacturers Assocaton EMCA-379 Standard: Test Method for the Estmaton of the Archval Lfetme of Optcal Meda, Frst Edton, June 2007. 6. J.G. Elerath, S. Shah, Server class dsk drves: how relable are they?, Proceedngs of the Annual Relablty and Mantanablty Symposum, January 2004. 7. Relasoft, Quanttatve Accelerated Lfe Testng, Data Analyss Software, http://alta.relasoft.com/ 8. Relasoft ALTA, Accelerated Lfe Testng Reference, 1996-2001, pp179. 9. Percept Technology Labs, RDX Removable Dsk Archvablty Study, Test Report, July 2007. 10. G. Cole, Estmatng Drve Relablty n Desktop Computers and Consumer Electroncs Systems, Seagate Technology Paper TP- 338.1, November 2000. 11. W. Nelson, Accelerated Testng: Statstcal Models, Test Plans and Data Analyses, (Wley Seres n Probablty and Mathematcal Statstcs-Appled Probablty),1990, pp100. 12. P.F. Corbett, R. Englsh, A. Goel, T. Grcanac, S. Kleman, J. Leong, Suntha Sankar, Row-dagonal party for double dsk falure correcton, Proceedngs of Conference on Fle and Storage Technologes (FAST), 2004. 13. S. Ghemawat, H. Goboff, S.T. Leung, The Google fle system, Proceedngs of the 19th ACM Symposum on Operatng Systems Prncples (SOSP), October 2003. 14. L. Baravasundaram, G. Goodson, B. Schroder, A. Arpac-Dusseau, R. Arpac-Dusseau, An Analyss of Data Corrupton n the Storage Stack, FAST 2008. 15. J.G. Elerath, Specfyng relablty n the dsk drve ndustry: No more MTBFs, Proceedngs of the Annual Relablty and Mantanablty Symposum, 2000. 16. J.G. Elerath, AFR: problems of defnton, calculaton and measurement n a commercal envronment, Proceedngs of the Annual Relablty and Mantanablty Symposum, 2000. 17. J. Yang and F.B. Sun, A comprehensve revew of hard-dsk drve relablty, Proceedngs of the Annual Relablty and Mantanablty Symposum, 1999.