Cloud Service Reliability: Modeling and Analysis



Similar documents
HEALTHCARE INTEGRATION BASED ON CLOUD COMPUTING

Software Engineering and Development

An Approach to Optimized Resource Allocation for Cloud Simulation Platform

Questions & Answers Chapter 10 Software Reliability Prediction, Allocation and Demonstration Testing

The transport performance evaluation system building of logistics enterprises

Chapter 3 Savings, Present Value and Ricardian Equivalence

Comparing Availability of Various Rack Power Redundancy Configurations

Comparing Availability of Various Rack Power Redundancy Configurations

ON THE (Q, R) POLICY IN PRODUCTION-INVENTORY SYSTEMS

INITIAL MARGIN CALCULATION ON DERIVATIVE MARKETS OPTION VALUATION FORMULAS

Effect of Contention Window on the Performance of IEEE WLANs

Database Management Systems

Tracking/Fusion and Deghosting with Doppler Frequency from Two Passive Acoustic Sensors

Financial Derivatives for Computer Network Capacity Markets with Quality-of-Service Guarantees

Automatic Testing of Neighbor Discovery Protocol Based on FSM and TTCN*

Power Monitoring and Control for Electric Home Appliances Based on Power Line Communication

How To Use A Network On A Network With A Powerline (Lan) On A Pcode (Lan On Alan) (Lan For Acedo) (Moe) (Omo) On An Ipo) Or Ipo (

Over-encryption: Management of Access Control Evolution on Outsourced Data

Give me all I pay for Execution Guarantees in Electronic Commerce Payment Processes

9:6.4 Sample Questions/Requests for Managing Underwriter Candidates

Optimizing Content Retrieval Delay for LT-based Distributed Cloud Storage Systems

Uncertain Version Control in Open Collaborative Editing of Tree-Structured Documents

An Epidemic Model of Mobile Phone Virus

Data Center Demand Response: Avoiding the Coincident Peak via Workload Shifting and Local Generation

Research on Risk Assessment of the Transformer Based on Life Cycle Cost

Distributed Computing and Big Data: Hadoop and MapReduce

Things to Remember. r Complete all of the sections on the Retirement Benefit Options form that apply to your request.

Ilona V. Tregub, ScD., Professor

PAN STABILITY TESTING OF DC CIRCUITS USING VARIATIONAL METHODS XVIII - SPETO pod patronatem. Summary

Scheduling Hadoop Jobs to Meet Deadlines

STUDENT RESPONSE TO ANNUITY FORMULA DERIVATION

Towards Realizing a Low Cost and Highly Available Datacenter Power Infrastructure

Efficient Redundancy Techniques for Latency Reduction in Cloud Systems

A framework for the selection of enterprise resource planning (ERP) system based on fuzzy decision making methods

An Efficient Group Key Agreement Protocol for Ad hoc Networks

Modeling and Verifying a Price Model for Congestion Control in Computer Networks Using PROMELA/SPIN

High Availability Replication Strategy for Deduplication Storage System

AN IMPLEMENTATION OF BINARY AND FLOATING POINT CHROMOSOME REPRESENTATION IN GENETIC ALGORITHM

Approximation Algorithms for Data Management in Networks

THE DISTRIBUTED LOCATION RESOLUTION PROBLEM AND ITS EFFICIENT SOLUTION

est using the formula I = Prt, where I is the interest earned, P is the principal, r is the interest rate, and t is the time in years.

Firstmark Credit Union Commercial Loan Department

An Analysis of Manufacturer Benefits under Vendor Managed Systems

How to recover your Exchange 2003/2007 mailboxes and s if all you have available are your PRIV1.EDB and PRIV1.STM Information Store database

ENABLING INFORMATION GATHERING PATTERNS FOR EMERGENCY RESPONSE WITH THE OPENKNOWLEDGE SYSTEM

Towards Automatic Update of Access Control Policy

Model-Driven Engineering of Adaptation Engines for Self-Adaptive Software: Executable Runtime Megamodels

Channel selection in e-commerce age: A strategic analysis of co-op advertising models

College of Engineering Bachelor of Computer Science

A formalism of ontology to support a software maintenance knowledge-based system

Alarm transmission through Radio and GSM networks

Self-Adaptive and Resource-Efficient SLA Enactment for Cloud Computing Infrastructures

Converting knowledge Into Practice

How to create RAID 1 mirroring with a hard disk that already has data or an operating system on it

Adaptive Queue Management with Restraint on Non-Responsive Flows

METHODOLOGICAL APPROACH TO STRATEGIC PERFORMANCE OPTIMIZATION

Define What Type of Trader Are you?

Strength Analysis and Optimization Design about the key parts of the Robot

An application of stochastic programming in solving capacity allocation and migration planning problem under uncertainty

Continuous Compounding and Annualization

Spirotechnics! September 7, Amanda Zeringue, Michael Spannuth and Amanda Zeringue Dierential Geometry Project

Reduced Pattern Training Based on Task Decomposition Using Pattern Distributor

883 Brochure A5 GENE ss vernis.indd 1-2

Multiband Microstrip Patch Antenna for Microwave Applications

The Role of Gravity in Orbital Motion

Load Balancing in Processor Sharing Systems

Load Balancing in Processor Sharing Systems

Chris J. Skinner The probability of identification: applying ideas from forensic statistics to disclosure risk assessment

Transmittal 198 Date: DECEMBER 9, SUBJECT: Termination of the Existing Eligibility-File Based Crossover Process at All Medicare Contractors

YARN PROPERTIES MEASUREMENT: AN OPTICAL APPROACH

SUPPORT VECTOR MACHINE FOR BANDWIDTH ANALYSIS OF SLOTTED MICROSTRIP ANTENNA

Modal Characteristics study of CEM-1 Single-Layer Printed Circuit Board Using Experimental Modal Analysis

An Introduction to Omega

A Capacitated Commodity Trading Model with Market Power

Secure Smartcard-Based Fingerprint Authentication

Analyzing Ballistic Missile Defense System Effectiveness Based on Functional Dependency Network Analysis

HIGH AVAILABILITY SOLUTION: RESOURCE USAGE MANAGEMENT IN VIRTUALIZED SOFTWARE AGING

Top K Nearest Keyword Search on Large Graphs

Real Time Tracking of High Speed Movements in the Context of a Table Tennis Application

Peer-to-Peer File Sharing Game using Correlated Equilibrium

The impact of migration on the provision. of UK public services (SRG ) Final Report. December 2011

Chapter 2 Valiant Load-Balancing: Building Networks That Can Support All Traffic Matrices

A Two-Step Tabu Search Heuristic for Multi-Period Multi-Site Assignment Problem with Joint Requirement of Multiple Resource Types

The Binomial Distribution

Financing Terms in the EOQ Model

Transcription:

Cloud Sevice eliability: Modeling and Analysis Yuan-Shun Dai * a c, Bo Yang b, Jack Dongaa a, Gewei Zhang c a Innovative Computing Laboatoy, Depatment of Electical Engineeing & Compute Science, Univesity of Tennessee, Knoxville, TN, USA b Collaboative Autonomic Computing Laboatoy, School of Compute Science Univesity of Electonic Science and Technology of China, Chengdu, China c Depatment of Industial and Infomation Engineeing, Univesity of Tennessee, Knoxville, TN, USA Abstact Cloud computing is a ecently developed new technology fo complex systems with massivescale sevice shaing, which is diffeent fom the esouce shaing of the gid computing systems. Cloud eliability analysis and modeling ae not easy tasks because of the complexity and lage scale of the system. This pape systematically analyzes cloud computing and models the eliability of the cloud sevices. Vaious types of failues ae inteleaved in the cloud computing envionment, such as oveflow failue, timeout failue, esouce missing failue, netwok failue, hadwae failue, softwae failue, and database failue. This pape investigates all of them to achieve a compehensive pictue about cloud sevice eliability, and models those failues in a holistic manne using Makov models, Queuing Theoy and Gaph Theoy. In accodance with the poposed model, a new evaluation algoithm is futhe developed in this pape integating the Bayesian appoaches togethe with the Gaph Theoy. Keywod Cloud computing, eliability modeling, Gaph theoy, Queuing theoy, Bayesian analysis

. Intoduction Cloud computing enables the massive-scale sevice shaing, which allows uses to access technology-enabled sevices without knowledge of, expetise with, o contol ove the technology infastuctue that suppots them. Cloud computing is diffeent fom but elated with gid computing, utility computing and tanspaent computing. Gid computing [] is a fom of distibuted computing wheeby a "supe and vitual compute" composed of a cluste of netwoked, loosely-coupled computes acts in concet to pefom vey lage tasks. Utility computing [2] is the packaging of computing esouces, such as computation and stoage, as a meteed sevice simila to a taditional public utility such as electicity. Tanspaent computing [3] means complex back-end sevices ae tanspaent to uses who only see a simple and easy-touse font-end inteface. The cloud computing deployments ae today poweed by gids, having tanspaent chaacteistics and billed like utilities; but cloud computing is athe a natual next step fom the gid-utility-tanspaent model. Based on this model, the cloud computing can athe ealize the sevice shaing than only the esouce shaing coined by gid computing. The cloud computing is thus moe sevice-oiented than esouce-oiented. Dai et al. [4] has aleady mentioned that the uses do not cae too much about the esouces of the gid system but ae moe concened with the sevices they ae using. Hence, the function of sevice shaing enabled by cloud computing will be moe inteesting to geneal uses than the esouces shaing of the gid computing. A vaiety of cloud sevices ae povided by the cloud system. The cloud system could become vey lage even all ove the whole Intenet. Uses can equest cloud sevices fom any cone of the wold. Some examples of commecial cloud sevices include Amazon EC2 [5], Xen [6], Google Cloud [7], IBM Cloud [8], and Micosoft Cloud [9]. The eliability of the cloud computing is vey citical but had to analyze due to its chaacteistics of massive-scale sevice shaing, wide-aea netwok, heteogeneous softwae/hadwae components and complicated inteactions among them. Hence, the eliability models fo pue softwae/hadwae o conventional netwoks [0-] cannot be simply applied to study the cloud eliability. Theefoe, this pape fist pesents an innovative eliability model fo cloud computing. The cloud eliability model is sevice oiented and hieachical, which is tactable and effective in addessing such a lage and complex system. This new model compehensively consides vaious types of failues that have significant influences on the success/failue of cloud sevices, 2

including oveflow, timeout, data esouce missing, computing esouce missing, softwae failue, database failue, hadwae failue, and netwok failue. The emaining of this pape is oganized as follows. Section 2 pesents a geneal achitectue of the cloud computing system and makes a thoough analysis of the cloud sevices. Section 3 builds a holistic model fo cloud sevice eliability and pesents a new evaluation algoithm. Section 4 concludes this pape and discusses the futue eseach. 2. Cloud Computing System and Failue Analysis Cloud computing is distinguished fom conventional distibuted computing by its focus on massive-scale sevice shaing. The chaacteistics of the cloud computing ae descibed in subsection 2., and then vaious failues in a cloud sevice ae analyzed in subsection 2.2. 2.. Desciption of the cloud computing We ae developing a cloud computing system in the VGADS (Vitual Gid Application Development Softwae) poject sponsoed by National Science Foundation (NSF). This system has aleady been collaboating and integated with Amazon EC2 [5]. The achitectue of ou cloud sevice system is depicted in Fig., which is also a typical epesentation of most pesent o futue cloud sevice systems. Thee is a cloud management system (CMS) which is composed by a set of seves (eithe centalized o distibuted). The CMS mainly fulfills fou diffeent functions as shown in Fig. : ) To manage a equest queue that eceives job equests fom diffeent uses fo cloud sevices; 2) To manage computing esouces (such as PCs, Clustes, Supecomputes, etc.) all ove the Intenet; 3) To manage data esouces (such as Databases, Publicized Infomation, UL contents, etc.) all ove the Intenet; and 4) To schedule a equest and divide it into diffeent subtasks and assign the subtasks to diffeent computing esouces that may access diffeent data esouces ove the Intenet. 3

Use Cloud Sevices Uses Intenet Job equests equests Uses Cloud Management System (CMS) eq.queue Computing es. Man. Schedule Data es. Man. Schedule Subtasks esouces (Data & Comp) esults C D Netwok C2 C3 D2 D Comp. Intenet Comp. Data Data Data Intenet Intenet Fig.. Cloud Sevice System. When a use equests a cetain given cloud sevice, we apply a wokflow to descibe and manage the cloud sevice [2]. Fig. 2 depicts a wokflow template of a sevice that includes fou diffeent subtasks (S, S2, S3, S4) and thei inteelationship (data dependency), e.g. S3 needs the inputs that esult fom S and S2. It also shows the equied data esouces that the subtasks need to access, e.g., S needs to access data esouce D when unning, S2 needs D2 and D3, and S4 needs D4, but S3 needs nothing. With the given wokflow of a cloud sevice, the schedule in the CMS can assign these subtasks to diffeent computing esouces while allocating the data esouces, as shown in Fig. 2, e.g., the computing esouce C is assigned two subtasks, S and S3, to un, C5 is a data esouce offeing data D2, D3 and D4, and C3 is both computing esouce and data esouce to un subtask S2 while offeing data D and D3. Afte the computing esouces and data esouces eceive the commands/subtasks fom the CMS, they fom a netwok accoding to the connectivity o accessibility, e.g. C3 is diectly connected with C5, but cannot diectly communicate with C4 due to the connectivity (e.g. computes C3 and C4 may be both behind outes that tanslate thei oiginal IP addesses so that they cannot diectly build the TCI/IP connection, o they do not have access to each othe [3]). 4

Wokflow of a Sevice S D S3 S4 Schedule S2 D4 D2,D3 C S,S3 C2 C4 S2,S3 S4 C3 S2 D,D3 D2, D3,D4 C5 S D C6 Fig. 2. Wokflow of a Cloud Sevice and Scheduling The cloud netwok shown in Fig. 2 can be vey lage, and each link in Fig. 2 is actually a vitual link that may go though many components (outes/cables/optical fibes/machines) ove a long distance. Thus, the computing esouces will wok togethe via the netwok to un the subtasks while accessing necessay data fom the data esouces. When the job is finished, the esults will etun to the use who equests this sevice, as shown in Fig.. 2.2. Failue Analysis of Cloud Sevice As Fig. and Fig. 2 show, thee ae a vaiety of types of failues that may affect the success/eliability of a cloud sevice, including Oveflow, Timeout, Data esouce missing, Computing esouce missing, Softwae failue, Database failue, Hadwae failue, and Netwok failue. We analyze these failues in moe details. Oveflow: The equest queue should have a limitation on the maximal numbe of equests waiting in the queue. Othewise, new equests have to wait fo too long a time in the queue, which could make the Timeout failues much moe dominant. Theefoe, if the queue is full when a new job equest aives, it is simply dopped and the use is unable to get sevice, which is called an oveflow failue. Timeout: The cloud sevice usually has its due time set by the use o the sevice monito. If the waiting time of the equest in the queue is ove the due time, the Timeout failue occus, see e.g. [4]. As a esult, those timeout equests will be dopped fom the queue so that not to affect othe following equests. Data esouce missing: In CMS, the data esouce manage (DM) egistes all data esouces. Howeve, it is possible that some peviously egisteed data ae emoved but 5

the DM is not updated. As a esult, if those data esouces ae assigned in a cetain job equest, they will cause the data esouce missing failue. Computing esouce missing: Similaly to the above data esouce miss, the computing esouce missing may also occu, such as PC tuns off without notifying the CMS. Softwae failue: The subtasks ae actually softwae pogams unning on diffeent computing esouces, which contain softwae faults, see e.g. [5]. Database failue: The database that stoes the equied data esouces may also fail, causing that the subtasks when unning cannot access the equied data. Hadwae failue: The computing esouces and data esouces in geneal have hadwae (such as computes o seves) which may also encounte hadwae failues. Netwok failue: When subtasks access emote data, the communication channels may be boken eithe physically o logically, which causes the netwok failue, especially fo those long time tansmissions of lage datasets, see e.g. [6]. The model fo cloud computing eliability has to conside all types of these failues, which would be vey complicated and existing eliability models cannot addess all of these concens in a holistic manne although each single topic has been studied. Moeove, these diffeent types of failues ae actually coelated with one anothe (i.e., not independent) in a cloud sevice which exhibits anothe eason why the cloud eliability model cannot simply utilize any one single existing model in each individual topic (such as softwae eliability, hadwae eliability, o netwok eliability). Fo example, failues of schedules may incease the waiting time, which could affect the timeout and the oveflow failues; a lage queue limit may educe the pobability of oveflow failue but may incease that of the timeout failue; a database failue may make a softwae unable to be finished due to lack of necessay data; netwok failues may block the equied communications among softwae pogams to get necessay inputs fom othes. With such coelations, it is obvious that a new holistic model has to be developed fo cloud eliability. 3. Cloud Sevice eliability Modeling and Evaluation In this section, we develop a holistic model fo Cloud Sevice eliability, which is defined as the pobability that a cloud sevice unde consideation can be successfully completed fo a use in a 6

specified peiod of time. In paticula, this equies that the job equest be successfully seved by the schedules in time, the set of subtasks contained by the sevice be completed, the computing/data esouces equied by the subtasks be available; and the netwok be opeational duing the communications. Fom the definition of cloud sevice eliability, it is clea that all types of failues we have discussed in section 2 will moe o less affect this pobability to povide a successful sevice. We classify the above failues in two goups:. equest Stage Failues: Oveflow and Timeout. 2. Execution Stage Failues: Data esouce missing, Computing esouce missing, Softwae failue, Database failue, Hadwae failue, and Netwok failue. The failues in Goup may occu befoe the job equest is successfully assigned to computing/data esouces; on the othe hand, the failues in Goup 2 may occu afte the job equest has been successfully assigned and duing the execution of subtasks. Theefoe, the two goups of failues could be deemed as independent. Nevetheless, failues within each goup ae stongly coelated. In summay, the modeling of cloud sevice eliability can be sepaated in two pats: modeling of equest Stage eliability and modeling of Execution Stage eliability. 3.. equest Stage eliabiliy This equest stage contains two types of failues: oveflow and timeout. The due time fo a specific sevice is the allowed time spent fom the submission of the job equest to the completion of the job. The due time can be set by the use o by the sevice monito. If a job equest is not seved by a schedule befoe the due time, it will be dopped. The dopping ate is denoted by µ d. Suppose the capacity of the equest queue is N (the maximal numbe of equests in the queue). We assume that the aival of submissions of job equests follow a Poisson pocess with the aival ate of λ a. Usually, thee ae multiple schedule seves to seve the equests. These schedule seves ae usually homogeneous with simila stuctues, schemes and equipments. Hee, we assume a total of S homogenous schedule seves ae unning simultaneously to seve the equests. The sevice time to complete one equest by each schedule seve is assumed to be an exponentially distibuted quantity with paamete µ. Thus, such pocess can be modeled by a Makov pocess 7

as depicted by Fig. 3, in which state n (n=0,,,n) epesents the numbe of equests in the queue. λ a λ a λ a 0 2 S- S S+ N- N µ + µ d 2 µ + 2µ d S µ + Sµ S µ d + ( S +) µ d S µ + Nµ d Fig. 3. Makov model fo the equest queue. In Fig. 3, the tansition ate fom state n to state n+ is λ a. At state N, the aival of a new equest will make the equest queue oveflow, so the equest is dopped and the queue still stays at state N. The sevice ate of a equest by a schedule seve is µ. If n S, then n equests can be immediately seved by the S schedule seves, so the depatue ate of any one equest is equal to nµ. If n > S, only S equests ae being simultaneously seved by schedule seves, so the depatue ate is Sµ. The dopping ate fo any one equest in the queue to each its due time is nµ d (n=,2,,n). Denote by q n the steady pobability fo the system to stay at state n (n=0,,,n). It is easy to deive q n by solving the following Chapman-Kolmogoov equations: And S n λ a N n n µ + p x) λa qn = ( n + ) µ qn+ + x= y= 0 a q0 µ q λ a λ = (2) ( p( n y) λaq y (n=,,s-) (3) ( p( n y) λaq y (n=s,,n-) (4) N n n µ + p x) λa qn = Sµ qn+ + x= y= 0 S q = N µ N y= 0 N n= 0 The pobability fo the oveflow failue NOT to occu is thus p( N y) λ q (5) a y q = (6) n N = oveflow q n n= 0 whee q n (n=0,,,n) can be obtained by solving equations (2)-(6)., (7) 8

To study the timeout failue, suppose the cuent length of the equest queue is n (n=0,,,n-) when the new sevice equest unde consideation aives. The pobability density function (p.d.f.) of waiting time to complete the n equests by S schedule seves is f ( t) = Sµ e n n S ( S t) ( n S)! Sµ t µ, t 0 and n S. (8) If the waiting time is longe than the due time T d, the timeout failue occus. Theefoe, the pobability fo the waiting time in completing the n+ equests to be less than T d is If T d P{ t < Td } = f n( t) dt n S (9) 0 n < S, then the new equest that has aived can be immediately seved without any waiting time. Theefoe, the pobability fo the timeout and oveflow failues NOT to occu (i.e. the equest Stage is eliable) is S N Td equest = n + qn 0 n= 0 n= S q f ( t) dt (0) whee f n (t) can be obtained by (8). The summation in (0) between [ 0, N ] contains a condition that the oveflow failue not to occu as analyzed by the (7). Thus, in (0) epesents the pobability without timeout o oveflow failues. n equest 3.2. Execution Stage 3.2.. A New Model To addess vaious types of failues duing the execution of a cloud sevice, we popose a new model hee. All types of execution stage failues ae integated in this new model, as illustated by a gaph model in Fig. 4. 9

Hadwae Failues Netwok Failues Netwok Failues D,D2 D2 D Database Failues Data/Comp es. Miss (a special HW failue) S D Hadwae Failues S2,S3 S2 Softwae Failues S3 Fig. 4. A gaph model integating diffeent types of failues at the execution stage. In this model, hadwae (such as a compute) is epesented by a solid-line node, so the chaacteistics egading the hadwae (such as hadwae failues, pocessing speed, etc.) can be associated with the node. The link of the netwok is epesented by a solid line which epesents a communication channel between two nodes, so the chaactes of the channel (such as link failue, bandwidths, etc.) can be associated with the link. Hadwae may contain database o softwae equied by the cloud sevice, so we suggest using vitual nodes to epesent database o softwae pogams, which ae dawn as dashed-line cicles. The idea of vitual nodes is diffeent fom pevious gaph models fo distibuted computing systems [7]. Those models only exhibit the softwae/database inside the hadwae node, which actually fits the physical stuctue (e.g. softwae does un inside the compute hadwae), but such physical epesentations could not epesent the heteogeneity of hadwae/softwae/databases so these models only used the node popety to incopoate all diffeent chaacteistics. Howeve, in cloud computing the heteogeneity is significant including vaious kinds of esouces, so these esouces should be teated espectively. The vitual nodes making physical stuctue inside-out can fulfill this equiement. As a esult, the chaacteistics with espect to the database (such as size of data uploaded/downloaded, database failues, etc.) and to the softwae (such as softwae failues, and the unning time of the softwae) can be associated with diffeent vitual nodes without intefeing the chaacteistics of hadwae (the solid-line node). This vitual link (dashed line) connects diffeent vitual nodes to thei hosted hadwae node. The vitual stuctue in Fig. 4, when exhibiting the heteogeneity, can also show 0

the failue coelation to accommodate to the pactice bette, e.g. if the hadwae fails, then all those vitual nodes (components inside this hadwae) ae isolated fom outside, which means unavailable at the same time to othe extenal components. Finally, this gaph model can also addess the data/computing esouces missing. Once the missing esouces ae included by the cloud schedule by mistakes, we can addess the missing in anothe way, i.e., the esouce fails at the beginning of the execution of the cloud sevice. Theeby, the missing of esouces can be incopoated in the hadwae popety, as a special type of hadwae failue. In summay, the new gaph model to be built as pe the above methodology can well addess those diffeent failues in a holistic manne fo a given cloud sevice duing the execution stage. 3.2.2. Paametes In accodance with the new model as depicted by Fig. 4, the paametes with espect to diffeent components ae discussed hee, which will be used in the poposed evaluation algoithm. Fo the i:th hadwae node ( i =,2,..., H ), denote by ps i its Pocessing Speed, e.g. in MIPS (Million Instuctions pe Second). Fo the j:th data esouce ( j =,2,..., J ), denote by sd j the Amount of Data downloaded/uploaded by emote softwae pogams, e.g., in MB (Mega Bytes). Fo softwae (such as a softwae pogam to complete a subtask), denote by wp ( k =,2,..., K ) the Wokload of the k:th softwae pogam, e.g. in NoI (Numbe of Instuctions) to be executed. Denote by sd ( i =,2,..., J, j =,2,..., J, i j ) the Amount of Data exchanged between the i:th ij subtask and the j:th subtask. Denote by communication link, e.g. in bps (bit pe second). bw ( m =,2,..., M ) the Bandwidth of the m:th m Any of the elements of hadwae/database/softwae/links may encounte failues. The failue ate [] is anothe paamete of inteest. As explained by [8], in the opeational phase of softwae, thee will be no modifications made on the softwae souce code, thus the softwae failue ate is a constant. Fo electonic hadwae, a constant failue ate is nomally obseved in the opeational phase as well. We thus denote by λ element ) the failue ate of the n:th element. ( n Theefoe, the eliability of each individual element can be deived as ) = exp{ λ ( element ) T ( element )}, () ( elementn n w n k

whee T element ) denotes the length of woking time of the n:th element in a cloud sevice, w( n which can be deived, espectively, as follows. The time that the k:th softwae pogam is unning on the i:th machine is Softwae Wokload wp T = The time that the m:th communication link is tansmitting data is k w( Softwae) = (2) Pocessing Speed psi Amount of Data sdij T w ( Communication) = = (3) Bandwidth bw The total woking time fo a hadwae element has two pats: unning softwae and tansmitting data, thus T w( Hadwae) = Tw( Softwae) + Tw( Communication) (4) Hadwae Hadwae which means the summation of the execution time of all softwae pogams unning on this hadwae and the communication time of all channels going though this hadwae. The woking time fo a data souce can be calculated as the summation of all communication times that access the data on the data souce. T ( DataSouce) = T ( Communication) (5) w Data With the woking time deived by equations (2)-(5), the eliability of individual element can be obtained fom (), which is moe ealistic and pactical than othe conventional methods [7] assuming the eliabilities of elements (nodes and links) ae constant, (e.g. a node is always 90% eliable, egadless of how long it woks). In fact, the eliability of individual element is affected by vaious conditions such as failue ate, amount of data, bandwidth, opeation time, etc. w m 3.2.3. New Evaluation Algoithm Though the new gaph model and the paametes of elements ae moe ealistic and pactical, they also make the evaluation of oveall eliability much moe complicated so that the existing algoithms [7] could not be diectly applied hee. Fo instance, those conventional algoithms have one o some of the following assumptions that ae not applicable to evaluate the eliability given the above new model: ) the netwok topology is made up of physical nodes/links without consideing the vitual nodes/links; 2) the opeational pobabilities (eliabilities) of nodes o links 2

ae constant; 3) only hadwae failues of links and pocessos ae consideed without taking into account the softwae, data and esouce failues. Theefoe, we futhe pesent a new algoithm fo evaluating the oveall cloud sevice eliability consideing all diffeent factos duing the execution stage given the new gaph model and the above paametes. The new evaluation algoithm based on Gaph theoy and Bayesian theoem is pesented to deive the eliability, as follows. A. Minimal Subtask Spanning Tee (MSST) The set of all nodes and links involved in completing a specific subtask fom a Subtask Spanning Tee (SST). This SST can be consideed to be a combination of seveal minimal subtask spanning tees (MSSTs), whee each MSST epesents a minimal possible combination of available elements (nodes and links) that guaantees the success to execute this specific subtask (i.e., failue of any element in MSST leads to the subtask failue). By this definition of MSST, we can see that each MSST contains exactly one set of data esouces without any duplications, because any duplication could be educed to anothe smalle SST. Theefoe, fo any MSST, the data esouces and pecedent subtasks that povide cetain input fo the subtask ae also detemined. One can also obtain the woking times of diffeent elements by (2)-(5). Some elements inside one MSST can still belong to seveal paths if they ae involved in diffeent communications tasks, such as data tansmission o data esouce access. Note that all elements in the execution stage ae hot-standby although some elements/subtasks may be waiting fo the output of some othe subtasks. So duing the waiting peiod, those elements ae also possible to fail. Thus, we suppose that an MSST completes the entie sevice if all of its elements do not fail duing the maximal time allowed to complete all subtasks in executing which they ae involved. Theefoe, when calculating the element eliability in a given MSST, one has to use the coesponding ecod with maximal time. Assume thee ae a total of K elements in an MSST, and element i (i=,2,,k) denotes the i:th element in the MSST. Accodingly, the communication time of the i:th element is denoted by T w( element i) and λ ( elementi ) epesents its failue ate. The eliability of this single MSST can be simply expessed as K MSST = i= exp{ λ ( element ) T ( element )} (6) i w i 3

With this equation, the eliability of an MSST can be computed if the woking times of all the elements ae obtained. Hence, finding all the MSSTs and detemining the woking time of thei elements ae the fist step in deiving the execution eliability of a cloud sevice. To solve the gaph tavesal poblem, seveal classical algoithms have been suggested, such as depth-fist seach, beadth-fist seach, etc. These algoithms can find all MSSTs in an abitay gaph. Hee, we popose a depth-fist seach algoithm hee, which is biefly descibed as follows: Step. Given a pogam/subtask, say S m, stat fom a node that contains this pogam, to seach the equied data esouces and pecedent subtasks/pogams along the possible links, and ecod elements that compose the seaching oute and thei communication times. Step 2. Until all the equied data esouces and pecedent subtasks/pogams ae eached, an MSST is found, and ecod this MSST. Step 3. Then othe outes ae tied to seach othe MSSTs until all the MSSTs ae seached. Step 4. Change to anothe node that also contains the pogams m. epeat the above thee steps until all the nodes that have S m ae evaluated. Save all the MSSTs found associated with S m into the vecto MSST ( S m ). Step 5. Change to anothe pogam and epeat the above fou steps until all the pogams ae exploed. Then all the vectos of MSST S ) (m=,2, M) ae geneated. ( m B. Minimal Execution Spanning Tee (MEST) Simila to the MSST, a Minimal Execution Spanning Tee (MEST) epesents a minimal possible combination of available elements (nodes and links) that guaantees the success to execute the entie sevice. Thus, at least one MSST of each MEST S ) (m=,2, M) must be eliable, and then the subtask S m (m=,2, M) can be connected to those emote esouces and exchange data with them successfully though the netwok. If any set of the M subtasks ae successful, then the execution is eliable fo the cloud sevice to execute the equied set of subtasks, so the MEST ( m could be deived as the intesection of the above sets of MSSTs as I M m= In pactice, all MESTs could be geneated in the following steps: MEST = MSST( ) (7) Step : Select an MSST fom each set of MSST S ) whee (m=,2, M). S m ( m 4

Step 2: M MSSTs ae obtained and put them togethe to geneate the MEST. Fo each common element when intesecting tees togethe, ecod the geate woking time as the final woking time of this element in the MEST. Step 3: epeat Step -2 until all combinations ae tied to geneate all N MSSTs. Simila to (6), the eliability of a single MEST can be calculated by MEST = i MEST exp{ λ ( element ) T ( element )} (8) i w i C. Execution eliability Having the list of N MESTs and the coesponding task completion time, one can detemine the eliability of cloud sevice at the execution stage, as follows. = U N execute P MEST i (9) i= which means any one MEST out of the total N MESTs being succeeded will make the cloud sevice successfully executed in the execution stage. Denote event of the MEST j while E j the failue of the E j the successful opeation MEST j. Using the Bayesian theoem on conditional pobability, we can deive (9) to a summation of conditional pobabilities P = U N execute MEST i i= N i P = ( E ) P( E, E, E E ) j= P L (20) j 2, The pobability P ( E j ) can be diectly obtained fom (8) as (, E, ) E 2, L E j E j can be computed by the following two-step algoithm. j j MEST j and the pobability Step identifies the failues of all of the citical elements in a peiod of time duing which they lead to the failues of any one MEST fom pevious j- MESTs, but do not affect MEST j. Step 2 geneates all the possible combinations of the identified citical elements that lead to the event E, 2, E, L E j E by a binay seach, and computes the pobabilities of those j combinations. Thei summation is P{ E, E2, L E j Ej}. 5

When calculating the failue pobabilities of MESTs elements the maximal time fom the coesponding ecods in a list fo the given MEST should be used. Finally, if a cloud sevice needs to be successfully completed, both equest stage and execution stage should be eliable. Afte we deive the eliability fo both stages, we can heeby get the cloud sevice eliability whee Sevice as = Sevice equest execute (2) equest can be deived fom the eliability of equest stage by (0), and deived fom the eliability of execute stage by (20). execute can be 4. Conclusion and Discussion In this pape, eliability modeling and analysis of cloud sevice is conducted. We fist elaboate vaious types of possible failues in a cloud sevice, based on which a holistic eliability model is developed. A new algoithm is poposed to evaluate cloud sevice eliability based on the developed model. The developed cloud sevice eliability model and evaluation algoithm, howeve, is yet to be validated by simulation and eal-life data. This issue shall be addessed in ou futue eseach. Acknowledgement: This wok is suppoted by National Science Foundation (No. 083609) of USA. This wok is suppoted by National Natual Science Foundation of China (No. 5080508) and Key Poject of Chinese Ministy of Education (No. 0938). efeences: [] I. Foste, C. Kesselman. The Gid 2: Bluepint fo a New Computing Infastuctue. Los Alios, Mogan-Kaufmann, 2003. [2] C.S. Yeo,. Buyya, M.D. de Assunção, et al. Utility Computing on Global Gids. Technical epot, GIDS-T-2006-7, Gid Computing and Distibuted Systems Laboatoy, The Univesity of Melboune, Austalia, 2006. [3] Y. Zhang, Y. Zhou. Tanspaent computing: A new paadigm fo pevasive computing. Poceedings of the 3d Intenational Confeence on Ubiquitous Intelligence and Computing (UIC-06), LNCS 445,, 2006. 6

[4] Y.S. Dai, Y. Pan, X.K. Zou. A hieachical modeling and analysis fo gid sevice eliability. IEEE Tansactions on Computes, 56(5), 68-69, 2007. [5] http://aws.amazon.com/ec2/ [6] http://www.xen.og/ [7] http://www.googlecloud.com/ [8] http://www.ibm.com/ibm/cloud/ [9] http://www.micosoft.com/azue [0] M.L. Shooman. eliability of Compute Systems and Netwoks: Fault Toleance, Analysis and Design. New Yok: John Wiley & Sons, Inc., 2002. [] M. Xie, Y.S. Dai, K.L. Poh. Computing System eliability: Models and Analysis. New Yok: Kluwe Academic Publishes, 2004. [2] L. Xing, Y.S. Dai, A new decision diagam model fo efficient analysis on multi-state systems, IEEE Tansactions on Dependable and Secue Computing, Accepted fo Publication, 2008, Publishes: IEEE Pess. [3] X. Zou, Y.S. Dai, Y. Pan, Tust and Secuity in Collaboative Computing, Wold Scientific, Hackensack, NJ, U.S.A., 2008, ISBN: 98-270-368-3. [4] D. Abamson,. Buyya, J. Giddy. A computational economy fo gid computing and its implementation in the Nimod-G esouce boke. Futue Geneation Compute Systems, 8(8), 06-074, 2002. [5] Y.S. Dai, M. Xie, K.L. Poh. eliability of gid sevice systems, Computes & Industial Engineeing, 50(-2), 30-47, 2006. [6] Y.S. Dai, M. Xie, K.L. Poh, eliability Analysis of Gid Computing Systems, The 9th IEEE Pacific im Symposium on Dependable Computing (PDC2002), IEEE Compute Pess, 2002, pp. 97-03. [7] M. Xie, Y.S. Dai, K.L. Poh, Computing Systems eliability: Models and Analysis, (330 pages), Spinge: New Yok, U.S.A., 2004. ISBN: 0-306-48496-X. [8] B. Yang, M. Xie. A study of opeational and testing eliability in softwae eliability analysis, eliability Engineeing & System Safety, 70(3), 323-329, 2000. 7