Reliability Analysis in HPC clusters



Similar documents
Modified Line Search Method for Global Optimization

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

Evaluating Model for B2C E- commerce Enterprise Development Based on DEA

A probabilistic proof of a binomial identity

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Output Analysis (2, Chapters 10 &11 Law)

Research Article Sign Data Derivative Recovery

Domain 1: Designing a SQL Server Instance and a Database Solution

Data Analysis and Statistical Behaviors of Stock Market Fluctuations

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

Optimization of Large Data in Cloud computing using Replication Methods

SPC on Ungrouped Data: Power Law Process Model

Confidence Intervals for One Mean

Log-Logistic Software Reliability Growth Model

Clustering Algorithm Analysis of Web Users with Dissimilarity and SOM Neural Networks

Installment Joint Life Insurance Actuarial Models with the Stochastic Interest Rate

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

Virtual Machine Scheduling Management on Cloud Computing Using Artificial Bee Colony

MODELING SERVER USAGE FOR ONLINE TICKET SALES

Automatic Tuning for FOREX Trading System Using Fuzzy Time Series

Capacity of Wireless Networks with Heterogeneous Traffic

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

(VCP-310)

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

MTO-MTS Production Systems in Supply Chains

Chair for Network Architectures and Services Institute of Informatics TU München Prof. Carle. Network Security. Chapter 2 Basics

Extracting Similar and Opposite News Websites Based on Sentiment Analysis

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

Recovery time guaranteed heuristic routing for improving computation complexity in survivable WDM networks

Comparative Analysis of Round Robin VM Load Balancing With Modified Round Robin VM Load Balancing Algorithms in Cloud Computing

Malicious Node Detection in Wireless Sensor Networks using Weighted Trust Evaluation


How to read A Mutual Fund shareholder report

ADAPTIVE NETWORKS SAFETY CONTROL ON FUZZY LOGIC

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

C.Yaashuwanth Department of Electrical and Electronics Engineering, Anna University Chennai, Chennai , India..

EUROCONTROL PRISMIL. EUROCONTROL civil-military performance monitoring system

Subject CT5 Contingencies Core Technical Syllabus

Chapter 5 Unit 1. IET 350 Engineering Economics. Learning Objectives Chapter 5. Learning Objectives Unit 1. Annual Amount and Gradient Functions

1 Computing the Standard Deviation of Sample Means

CONTROL CHART BASED ON A MULTIPLICATIVE-BINOMIAL DISTRIBUTION

Maximum Likelihood Estimators.

INVESTMENT PERFORMANCE COUNCIL (IPC) Guidance Statement on Calculation Methodology

An Optimization Approach for Utilizing Cloud Services for Mobile Devices in Cloud Environment

I. Chi-squared Distributions

VEHICLE TRACKING USING KALMAN FILTER AND FEATURES

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

Lesson 15 ANOVA (analysis of variance)

1. Introduction. Scheduling Theory

An Efficient Polynomial Approximation of the Normal Distribution Function & Its Inverse Function

Determining the sample size

Systems Design Project: Indoor Location of Wireless Devices

Chatpun Khamyat Department of Industrial Engineering, Kasetsart University, Bangkok, Thailand

DDoS attacks defence strategies based on nonparametric CUSUM algorithm

DAME - Microsoft Excel add-in for solving multicriteria decision problems with scenarios Radomir Perzina 1, Jaroslav Ramik 2

LECTURE 13: Cross-validation

Study on the application of the software phase-locked loop in tracking and filtering of pulse signal

Inference on Proportion. Chapter 8 Tests of Statistical Hypotheses. Sampling Distribution of Sample Proportion. Confidence Interval

Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.

W. Sandmann, O. Bober University of Bamberg, Germany

The Stable Marriage Problem

Institute of Actuaries of India Subject CT1 Financial Mathematics

Baan Service Master Data Management

Optimal Adaptive Bandwidth Monitoring for QoS Based Retrieval

Traffic Modeling and Prediction using ARIMA/GARCH model

Lecture 2: Karger s Min Cut Algorithm

Statistical inference: example 1. Inferential Statistics

Prof. Dr. Liggesmeyer, 2. Fault Tree Analysis (DIN 25424, IEC 61025) Reliability Block Diagrams (IEC 61078)

Incremental calculation of weighted mean and variance

Statistical and Fuzzy Approach for Database Security

PSYCHOLOGICAL STATISTICS

How To Improve Software Reliability

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

CHAPTER 3 THE TIME VALUE OF MONEY

On Formula to Compute Primes. and the n th Prime

Ranking Irregularities When Evaluating Alternatives by Using Some ELECTRE Methods

Volatility of rates of return on the example of wheat futures. Sławomir Juszczyk. Rafał Balina

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Asymptotic Growth of Functions

Chapter XIV: Fundamentals of Probability and Statistics *

STUDENTS PARTICIPATION IN ONLINE LEARNING IN BUSINESS COURSES AT UNIVERSITAS TERBUKA, INDONESIA. Maya Maria, Universitas Terbuka, Indonesia

Agency Relationship Optimizer

Decomposition of Gini and the generalized entropy inequality measures. Abstract

Optimize your Network. In the Courier, Express and Parcel market ADDING CREDIBILITY

Plug-in martingales for testing exchangeability on-line

Detecting Voice Mail Fraud. Detecting Voice Mail Fraud - 1

Transcription:

Reliability Aalysis i HPC clusters Narasimha Raju, Gottumukkala, Yuda Liu, Chokchai Box Leagsuksu 1, Raja Nassar, Stephe Scott 2 College of Egieerig & Sciece, Louisiaa ech Uiversity Oak Ridge Natioal Lab 2 {rg003,yli010, box, assar}@latech.edu, sscott@orl.gov Abstract Resource failures ad dow times have become a growig cocer for large-scale computatioal platforms, as they ted to have a adverse affect o the performace of the computatio system,. Reliabilityaware resource allocatio ad checkpoitig algorithms have bee ivestigated to miimize the performace loss due to uexpected failures. he effectiveess of a reliability-aware policy relies o the accuracy of reliability predictio. he reliability of a group of odes is evaluated as a combiatio of idividual ode iformatio uder a assumptio that each ode reliability is idepedet. I this paper, we describe the reliability aalysis based o time betwee failures for a system/group of odes. Various reliability models are compared for differet cases i order to fid a optimal reliability model. he reliability models ad aalysis techiques were evaluated based o actual failure data of 512 odes durig their four year operatioal period. 1 1. Itroductio Icreasig demad for computig power i scietific ad egieerig applicatios has spurred deploymet of (HPC) high performace computig systems that deliver tera-scale performace. Curret ad future HPC systems that are capable of ruig large-scale parallel applicatios, may spa hudreds of thousads of odes. I fact, top500.org reports the curret highest processor cout to be 131K odes [ 9]. For parallel programs, the failure probability icreases sigificatly with the icrease i umber of odes. Igorig failures or system reliability ca have severe effect o the performace of the HPC cluster, ad quality of service. A reliability moitorig ad aalysis framework [ 4] provides up-todate reliability of selected compoets. A resource 1 Research supported by the Departmet of Eergy Grat o: DE- FG02-05ER25659. 2 Research supported by the Mathematics, Iformatio ad Computatioal Scieces Office, Office of Advaced Scietific Computig Research, Office of Sciece, U. S. Departmet of Eergy, uder cotract No. DE-AC05-00OR22725 with U-Battelle, LLC. maager ca use the reliability iformatio of odes to schedule a parallel job o a set of odes to miimize job completio time. A checkpoit algorithm ca also use the iformatio that may eable to schedule a checkpoit based o the failure probability of selected odes [ 10] i order to miimize the performace loss. he reliability of a set of odes costitutes the idividual ode failure iformatio. System reliability is the reliability of oe or more odes ad/or compoets that are required for successful ruig of a parallel applicatio. I this paper, we aalyze the system reliability by cosiderig the failure evets of the system comprisig k odes. Based o the reliability moitorig ad aalysis framework from our prior work [ 4] we preset a approach to calculate the system reliability of k odes from failure evet logs of idividual odes. We also compare various reliability models based o our approach of calculatig system reliability. he rest of the paper is orgaized as follows. Sectio 2 discusses the related work i reliability modelig ad failure predictio for HPC. Sectio 3 briefly describes various categories of failure evets. Sectio 4 presets the time to failure reliability models, ad proposed algorithm for combiig time to failure data of k odes. Sectio 5 etails the results of comparig reliability models whe differet umbers of odes are selected, ad fially sectio 6 presets the coclusio ad future work. 2. Related Work Failures i large-scale HPC systems have adverse impact both o the performace ad quality of service. here have bee efforts to predict failures based o historical data, ad failure-aware schedulig algorithms [ 1] [ 7][ 8]. Failure predictio based o idividual evets

Figure 1: Descriptio of Failures i a HPC platform is still a ogoig research, ad may ot be possible to predict the exact type ad time of failure. While curret failure predictio techiques are ot suitable to be applied as a alterative to eable fault tolerace mechaism, the failure probability ca be used to develop reliability-aware schemes [ 10][ 8][ 5] like checkpoitig to miimize performace loss. Our reliability aalysis is based o widely used statistical reliability models. 3. Failures i large scale HPC systems For a parallel program ruig o a set of k odes, if ay compoet i a ode, or a compoet commo to the set of odes (e.g. etwork hub) fails, the program [ 6], a 512 Node cluster from LLNL. Each ode is a 16 SMP (Symmetric multi processor), ad so there are totally 8192 processors. he failures ad dow times of each Node were collected over a 4 year 3 moth time period from 7/15/2000 to 10/1/2004, (total of 37218 hours). he raw failure log cosists of the date ad time of failure evet, dow time, type ad a very brief descriptio of failures. Some of the failures, which we thik do ot affect the job, were filtered out. For example, some failures were oly repaired after a log time, ad did ot affect the job rutime. 4. ime to Failure distributio he failure data cosists of failure times, ad dow times of the odes. he time betwee failures is assumed to be a radom variable which follows a certai distributio. he actual failure times are obtaied by subtractig the dow times (see Figure 1). hat is, suppose a ode fails at times, f1, f2, f3, we fit the distributio of time betwee failures (itervals) i.e f2-f1, ruig o all the k odes is affected. Figure 1 shows examples of hardware ad software failures that affect a sigle ode, a group of odes ad all the odes. I the failure trace, the failure evets that affect more tha oe ode are recorded as failure evets for those odes that were impacted. Whe failure evets of each of k odes are combied, commo failures are treated as oe to avoid duplicatio whe aalyzig the system of odes. Failure Data For our aalysis, we use failure data of ASC White machie f3-f2, f4-f3, ad obtai F(t) the cumulative failure times for idividual odes. We obtai idividual ode reliabilities R 1, R2... RN from R = 1- F(t) 4.1 ime betwee Failures for a system of k odes Parallel applicatios are allocated a set of k odes for executio. Each ode has a idividual failure distributio. I our model, the system fails, whe at least oe ode fails. hus, we are iterested i the miimum time till failure whe k odes are combied. he time till failure of idividual odes are combied to obtai the time to failure distributio of k odes whe the first failure occurs. he algorithm for combiig time betwee failures from idividual odes is described i figure 4. We compare three differet distributios i our reliability models amely Expoetial, Weibull ad Logormal. Figure 4 shows the CDF s (Cumulative Distributio Fuctio) of the reliability models.

Algorithm for calculatig the time betwee failure for a system of k odes Figure 2 ime betwee failures obtaied from failure logs by removig dowtimes, ad calculatig the differece betwee times., 1 where... = U s 2 1 k2 2 k ode2....odek aretheset of failuretimeso ode1,... U = { = { BF = { s s 1 2 k = {, k i+ 1 k1,..., k i= 1 = { s, s s }, i = 1,2,3... m i k2, hetimebetwee failure(bf) of thek odessystemis: i k3,...},...},...} measthestart timeof secodfailureo odek For a systemof k ode,thefailuretimeset F is 21 = 11 U 12, 22, 13 23 1 s s,... s } 2, 3 m Figure 3. ime betwee failures of a system of k odes 5. Compariso Results We calculate the reliability of k odes by aalyzig the time betwee failures of the system comprisig the k odes. he time betwee failures is calculated usig the algorithm give i sectio 4. Differet failure distributio for expoetial, weibull ad logormal are the compared with empirical failure distributio. We assume that parallel programs are usually allocated to 2 processors. We show the time to failure distributios for k = 4, 64, 128 ad 256. We compare the goodess of fit whe odes are selected radomly, ad whe odes are selected i order i able 1. Kolmogorov- Smirov (K-S) goodess of fit test is used to compare the distributios. he K-S goodess of fit test gives the maximum distace betwee the empirical ad theoretical distributio. A p-value equal or less tha 0.05 idicates that the distributio does ot fit the data. able 2 shows the goodess of fit for two cases, whe odes are selected radomly, ad whe odes are selected accordig to ode umber. I both cases of our experimet, weibull is observed to be a better model for reliability of a system of k odes as compared to expoetial ad logormal fit. Distributio Expoetial Weibull CDF F( t) = 1 e λ -e failure rate λt m ( t / c) F( t) = 1 e c - characteristic life, m- shape parameter Distributio Log Normal CDF t l 50 F ( t) = Φ σ σ - shape parameter 50 - medial life at 50% failure poit

K = 8 Nodes K = 16 Nodes K = 32 Nodes K = 64 Nodes K = 128 Nodes K = 256 Nodes Figure 4. Compariso of empirical ad Weibull CDFs whe k=4, 64,128 ad 256 odes are selected.

able 1: Compariso of Failure distributios usig the Kolmogorov-Smirov Goodess of Fit est Compariso of Kolmogorov-Smirov Goodess-of-Fit est of various umber of odes (selected i order) No of p-value p-value p-value Nodes K Expoetial Weibull Logormal 2 0.2628 0.8679 0.5409 4 0.2049 0.4310 0.9034 8 0.1916 0.9980 0.8571 16 0.0818 0.9845 0.3269 32 0.0002 0.6300 0.0438 64 0.0000 0.7122 0.1538 128 0.0000 0.2652 0.0779 256 0.0000 0.0599 0.0000 350 0.0000 0.0388 0.0000 Compariso of Kolmogorov-Smirov Goodess-of-Fit est of various umber of odes (selected Radomly) Expoetial Weibull Logormal No of p-value p-value p-value Nodes 2 0.6060 0.4460 0.6573 4 0.9940 0.8151 0.9852 8 0.2272 0.5758 0.7485 16 0.3193 0.7091 0.4671 32 0.4829 0.4829 0.2460 64 0.0224 0.2484 0.0785 128 0.0000 0.1169 0.0061 256 0.0000 0.0453 0.0000 350 0.0000 0.0159 0.0000 6. Coclusio ad Future work Failures ad dowtimes are a growig cocer for large scale HPC systems. he system performace ad quality of service ca be adversely affected due to resource outages. Curret checkpoit based fault tolerace ca have a sigificat overhead. I additio, failure predictio based o prior evets is still i iitial stages of research. hus, we believe that reliability-aware approach such as a resource maager ca exploit the reliability iformatio to better allocate a particular job so that the job completio time is miimized. I additio, the reliability-aware checkpoitig algorithms ca schedule a checkpoit at itervals based o reliability of resources to miimize checkpoit overhead. hese services require reliability iformatio whe a parallel applicatio is allocated to a sigled ode or a group of odes. I this paper, we described a approach to evaluate the reliability of a sigle ode, ad the a system of k odes. Reliability is estimated based o the time betwee failures data obtaied from the failure history of odes. Usig the actual failure trace obtaied from promiet HPC platform, we studied ad compared appropriateess of differet distributios, Expoetial, Weibull ad Logormal for various cases systems of k odes. Our results idicate that Weibull distributio results i the better reliability model i most of the cases for the give data. I the future, we pla to ivestigate the performace improvemet i resource maagemet ad checkpoit iterval selectio with differet reliability predictio techiques. 7. Refereces: [ 1] A. J. Olier, R. Sahoo, J. E. Moreira, M. Gupta, ad A. Sivasubramaiam. Fault-aware Job Schedulig for BlueGee/L Systems. I Proceedigs of the Iteratioal Parallel ad Distributed Processig Symposium (IPDPS), 2004. [ 2] Egelma ad G. A. Geist. "Super-Scalable Algorithms for Computig o 100,00 Processors". Proceedigs of Iteratioal Coferece o Computatioal Sciece (ICCS), Atlata, GA, USA, May 2005. [ 3] Elmootazbellah N. Elozahy, James S. Plak. "Checkpoitig for Peta-Scale Systems: A Look ito the Future of Practical Rollback-Recovery," IEEE rasactios o Depedable ad Secure Computig, vol. 01, o. 2, pp. 97-108, April-Jue, 2004. [ 4] H. Sog, C. Leagsuksu, N. R. Gottumukkala, S. L. Scott, ad A. Yoo. Near-real-time availability moitorig ad modelig for HPC/HEC rutime system. I Proceedigs of Los Alamos Computer Sciece Istitute (LACSI) Symposium 2005, Sata Fe, NM, USA, October 11-13, 2005. [ 5] James S. Plak ad Michael G. homaso, ``he Average Availability of Parallel Checkpoitig Systems ad Its Importace i Selectig Rutime Parameters,'' 29th Iteratioal Symposium o Fault- olerat Computig, Madiso, WI, Jue, 1999, pp. 250-259.

[ 6] Lawrece Livermore Natioal Laboratory race Logs: url:http://www.lll.gov/asci/platforms/white/ [ 7] R. K. Sahoo, A. J. Olier, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, ad A. Sivasubramaiam. Critical evet predictio for proactive maagemet i large-scale computer clusters. I Proceedigs of the ACM SIGKDD, Itl. Cof. o Kowledge Discovery Data Miig, pages 426 435, August 2003. [ 8] Schroeder, B. ad Gibso, G. A. 2006. A largescale study of failures i high-performace computig systems. I Proceedigs of the iteratioal Coferece o Depedable Systems ad Networks, Jue 2006. [ 9] op 500 Super Computig Sites List, July 2006, : url: http://www.top500.org/ [ 10] Y. Liu ad C. B. Leagsuksu. "Reliabilityaware Checkpoit/Restart Scheme: A Performability rade-off". Submitted to IEEE Cluster Computig (Cluster), Bosto, MA, USA, September 2005. [ 11] Yiglug Liag, Yayog Zhag, Aad Sivasubramaiam, Morris Jette, Ramedra Sahoo, "BlueGee/L Failure Aalysis ad Predictio Models," ds, pp. 425-434, Iteratioal Coferece o Depedable Systems ad Networks (DSN'06), 2006.