Reliability Analysis in HPC clusters

Reliability Aalysis i HPC clusters Narasimha Raju, Gottumukkala, Yuda Liu, Chokchai Box Leagsuksu 1, Raja Nassar, Stephe Scott 2 College of Egieerig & Sciece, Louisiaa ech Uiversity Oak Ridge Natioal Lab 2 {rg003,yli010, box, assar}@latech.edu, sscott@orl.gov Abstract Resource failures ad dow times have become a growig cocer for large-scale computatioal platforms, as they ted to have a adverse affect o the performace of the computatio system,. Reliabilityaware resource allocatio ad checkpoitig algorithms have bee ivestigated to miimize the performace loss due to uexpected failures. he effectiveess of a reliability-aware policy relies o the accuracy of reliability predictio. he reliability of a group of odes is evaluated as a combiatio of idividual ode iformatio uder a assumptio that each ode reliability is idepedet. I this paper, we describe the reliability aalysis based o time betwee failures for a system/group of odes. Various reliability models are compared for differet cases i order to fid a optimal reliability model. he reliability models ad aalysis techiques were evaluated based o actual failure data of 512 odes durig their four year operatioal period. 1 1. Itroductio Icreasig demad for computig power i scietific ad egieerig applicatios has spurred deploymet of (HPC) high performace computig systems that deliver tera-scale performace. Curret ad future HPC systems that are capable of ruig large-scale parallel applicatios, may spa hudreds of thousads of odes. I fact, top500.org reports the curret highest processor cout to be 131K odes [ 9]. For parallel programs, the failure probability icreases sigificatly with the icrease i umber of odes. Igorig failures or system reliability ca have severe effect o the performace of the HPC cluster, ad quality of service. A reliability moitorig ad aalysis framework [ 4] provides up-todate reliability of selected compoets. A resource 1 Research supported by the Departmet of Eergy Grat o: DE- FG02-05ER25659. 2 Research supported by the Mathematics, Iformatio ad Computatioal Scieces Office, Office of Advaced Scietific Computig Research, Office of Sciece, U. S. Departmet of Eergy, uder cotract No. DE-AC05-00OR22725 with U-Battelle, LLC. maager ca use the reliability iformatio of odes to schedule a parallel job o a set of odes to miimize job completio time. A checkpoit algorithm ca also use the iformatio that may eable to schedule a checkpoit based o the failure probability of selected odes [ 10] i order to miimize the performace loss. he reliability of a set of odes costitutes the idividual ode failure iformatio. System reliability is the reliability of oe or more odes ad/or compoets that are required for successful ruig of a parallel applicatio. I this paper, we aalyze the system reliability by cosiderig the failure evets of the system comprisig k odes. Based o the reliability moitorig ad aalysis framework from our prior work [ 4] we preset a approach to calculate the system reliability of k odes from failure evet logs of idividual odes. We also compare various reliability models based o our approach of calculatig system reliability. he rest of the paper is orgaized as follows. Sectio 2 discusses the related work i reliability modelig ad failure predictio for HPC. Sectio 3 briefly describes various categories of failure evets. Sectio 4 presets the time to failure reliability models, ad proposed algorithm for combiig time to failure data of k odes. Sectio 5 etails the results of comparig reliability models whe differet umbers of odes are selected, ad fially sectio 6 presets the coclusio ad future work. 2. Related Work Failures i large-scale HPC systems have adverse impact both o the performace ad quality of service. here have bee efforts to predict failures based o historical data, ad failure-aware schedulig algorithms [ 1] [ 7][ 8]. Failure predictio based o idividual evets

Figure 1: Descriptio of Failures i a HPC platform is still a ogoig research, ad may ot be possible to predict the exact type ad time of failure. While curret failure predictio techiques are ot suitable to be applied as a alterative to eable fault tolerace mechaism, the failure probability ca be used to develop reliability-aware schemes [ 10][ 8][ 5] like checkpoitig to miimize performace loss. Our reliability aalysis is based o widely used statistical reliability models. 3. Failures i large scale HPC systems For a parallel program ruig o a set of k odes, if ay compoet i a ode, or a compoet commo to the set of odes (e.g. etwork hub) fails, the program [ 6], a 512 Node cluster from LLNL. Each ode is a 16 SMP (Symmetric multi processor), ad so there are totally 8192 processors. he failures ad dow times of each Node were collected over a 4 year 3 moth time period from 7/15/2000 to 10/1/2004, (total of 37218 hours). he raw failure log cosists of the date ad time of failure evet, dow time, type ad a very brief descriptio of failures. Some of the failures, which we thik do ot affect the job, were filtered out. For example, some failures were oly repaired after a log time, ad did ot affect the job rutime. 4. ime to Failure distributio he failure data cosists of failure times, ad dow times of the odes. he time betwee failures is assumed to be a radom variable which follows a certai distributio. he actual failure times are obtaied by subtractig the dow times (see Figure 1). hat is, suppose a ode fails at times, f1, f2, f3, we fit the distributio of time betwee failures (itervals) i.e f2-f1, ruig o all the k odes is affected. Figure 1 shows examples of hardware ad software failures that affect a sigle ode, a group of odes ad all the odes. I the failure trace, the failure evets that affect more tha oe ode are recorded as failure evets for those odes that were impacted. Whe failure evets of each of k odes are combied, commo failures are treated as oe to avoid duplicatio whe aalyzig the system of odes. Failure Data For our aalysis, we use failure data of ASC White machie f3-f2, f4-f3, ad obtai F(t) the cumulative failure times for idividual odes. We obtai idividual ode reliabilities R 1, R2... RN from R = 1- F(t) 4.1 ime betwee Failures for a system of k odes Parallel applicatios are allocated a set of k odes for executio. Each ode has a idividual failure distributio. I our model, the system fails, whe at least oe ode fails. hus, we are iterested i the miimum time till failure whe k odes are combied. he time till failure of idividual odes are combied to obtai the time to failure distributio of k odes whe the first failure occurs. he algorithm for combiig time betwee failures from idividual odes is described i figure 4. We compare three differet distributios i our reliability models amely Expoetial, Weibull ad Logormal. Figure 4 shows the CDF s (Cumulative Distributio Fuctio) of the reliability models.

Algorithm for calculatig the time betwee failure for a system of k odes Figure 2 ime betwee failures obtaied from failure logs by removig dowtimes, ad calculatig the differece betwee times., 1 where... = U s 2 1 k2 2 k ode2....odek aretheset of failuretimeso ode1,... U = { = { BF = { s s 1 2 k = {, k i+ 1 k1,..., k i= 1 = { s, s s }, i = 1,2,3... m i k2, hetimebetwee failure(bf) of thek odessystemis: i k3,...},...},...} measthestart timeof secodfailureo odek For a systemof k ode,thefailuretimeset F is 21 = 11 U 12, 22, 13 23 1 s s,... s } 2, 3 m Figure 3. ime betwee failures of a system of k odes 5. Compariso Results We calculate the reliability of k odes by aalyzig the time betwee failures of the system comprisig the k odes. he time betwee failures is calculated usig the algorithm give i sectio 4. Differet failure distributio for expoetial, weibull ad logormal are the compared with empirical failure distributio. We assume that parallel programs are usually allocated to 2 processors. We show the time to failure distributios for k = 4, 64, 128 ad 256. We compare the goodess of fit whe odes are selected radomly, ad whe odes are selected i order i able 1. Kolmogorov- Smirov (K-S) goodess of fit test is used to compare the distributios. he K-S goodess of fit test gives the maximum distace betwee the empirical ad theoretical distributio. A p-value equal or less tha 0.05 idicates that the distributio does ot fit the data. able 2 shows the goodess of fit for two cases, whe odes are selected radomly, ad whe odes are selected accordig to ode umber. I both cases of our experimet, weibull is observed to be a better model for reliability of a system of k odes as compared to expoetial ad logormal fit. Distributio Expoetial Weibull CDF F( t) = 1 e λ -e failure rate λt m ( t / c) F( t) = 1 e c - characteristic life, m- shape parameter Distributio Log Normal CDF t l 50 F ( t) = Φ σ σ - shape parameter 50 - medial life at 50% failure poit

K = 8 Nodes K = 16 Nodes K = 32 Nodes K = 64 Nodes K = 128 Nodes K = 256 Nodes Figure 4. Compariso of empirical ad Weibull CDFs whe k=4, 64,128 ad 256 odes are selected.

able 1: Compariso of Failure distributios usig the Kolmogorov-Smirov Goodess of Fit est Compariso of Kolmogorov-Smirov Goodess-of-Fit est of various umber of odes (selected i order) No of p-value p-value p-value Nodes K Expoetial Weibull Logormal 2 0.2628 0.8679 0.5409 4 0.2049 0.4310 0.9034 8 0.1916 0.9980 0.8571 16 0.0818 0.9845 0.3269 32 0.0002 0.6300 0.0438 64 0.0000 0.7122 0.1538 128 0.0000 0.2652 0.0779 256 0.0000 0.0599 0.0000 350 0.0000 0.0388 0.0000 Compariso of Kolmogorov-Smirov Goodess-of-Fit est of various umber of odes (selected Radomly) Expoetial Weibull Logormal No of p-value p-value p-value Nodes 2 0.6060 0.4460 0.6573 4 0.9940 0.8151 0.9852 8 0.2272 0.5758 0.7485 16 0.3193 0.7091 0.4671 32 0.4829 0.4829 0.2460 64 0.0224 0.2484 0.0785 128 0.0000 0.1169 0.0061 256 0.0000 0.0453 0.0000 350 0.0000 0.0159 0.0000 6. Coclusio ad Future work Failures ad dowtimes are a growig cocer for large scale HPC systems. he system performace ad quality of service ca be adversely affected due to resource outages. Curret checkpoit based fault tolerace ca have a sigificat overhead. I additio, failure predictio based o prior evets is still i iitial stages of research. hus, we believe that reliability-aware approach such as a resource maager ca exploit the reliability iformatio to better allocate a particular job so that the job completio time is miimized. I additio, the reliability-aware checkpoitig algorithms ca schedule a checkpoit at itervals based o reliability of resources to miimize checkpoit overhead. hese services require reliability iformatio whe a parallel applicatio is allocated to a sigled ode or a group of odes. I this paper, we described a approach to evaluate the reliability of a sigle ode, ad the a system of k odes. Reliability is estimated based o the time betwee failures data obtaied from the failure history of odes. Usig the actual failure trace obtaied from promiet HPC platform, we studied ad compared appropriateess of differet distributios, Expoetial, Weibull ad Logormal for various cases systems of k odes. Our results idicate that Weibull distributio results i the better reliability model i most of the cases for the give data. I the future, we pla to ivestigate the performace improvemet i resource maagemet ad checkpoit iterval selectio with differet reliability predictio techiques. 7. Refereces: [ 1] A. J. Olier, R. Sahoo, J. E. Moreira, M. Gupta, ad A. Sivasubramaiam. Fault-aware Job Schedulig for BlueGee/L Systems. I Proceedigs of the Iteratioal Parallel ad Distributed Processig Symposium (IPDPS), 2004. [ 2] Egelma ad G. A. Geist. "Super-Scalable Algorithms for Computig o 100,00 Processors". Proceedigs of Iteratioal Coferece o Computatioal Sciece (ICCS), Atlata, GA, USA, May 2005. [ 3] Elmootazbellah N. Elozahy, James S. Plak. "Checkpoitig for Peta-Scale Systems: A Look ito the Future of Practical Rollback-Recovery," IEEE rasactios o Depedable ad Secure Computig, vol. 01, o. 2, pp. 97-108, April-Jue, 2004. [ 4] H. Sog, C. Leagsuksu, N. R. Gottumukkala, S. L. Scott, ad A. Yoo. Near-real-time availability moitorig ad modelig for HPC/HEC rutime system. I Proceedigs of Los Alamos Computer Sciece Istitute (LACSI) Symposium 2005, Sata Fe, NM, USA, October 11-13, 2005. [ 5] James S. Plak ad Michael G. homaso, ``he Average Availability of Parallel Checkpoitig Systems ad Its Importace i Selectig Rutime Parameters,'' 29th Iteratioal Symposium o Fault- olerat Computig, Madiso, WI, Jue, 1999, pp. 250-259.

[ 6] Lawrece Livermore Natioal Laboratory race Logs: url:http://www.lll.gov/asci/platforms/white/ [ 7] R. K. Sahoo, A. J. Olier, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, ad A. Sivasubramaiam. Critical evet predictio for proactive maagemet i large-scale computer clusters. I Proceedigs of the ACM SIGKDD, Itl. Cof. o Kowledge Discovery Data Miig, pages 426 435, August 2003. [ 8] Schroeder, B. ad Gibso, G. A. 2006. A largescale study of failures i high-performace computig systems. I Proceedigs of the iteratioal Coferece o Depedable Systems ad Networks, Jue 2006. [ 9] op 500 Super Computig Sites List, July 2006, : url: http://www.top500.org/ [ 10] Y. Liu ad C. B. Leagsuksu. "Reliabilityaware Checkpoit/Restart Scheme: A Performability rade-off". Submitted to IEEE Cluster Computig (Cluster), Bosto, MA, USA, September 2005. [ 11] Yiglug Liag, Yayog Zhag, Aad Sivasubramaiam, Morris Jette, Ramedra Sahoo, "BlueGee/L Failure Aalysis ad Predictio Models," ds, pp. 425-434, Iteratioal Coferece o Depedable Systems ad Networks (DSN'06), 2006.