RELIABILITY IN CLOUD COMPUTING: ISSUES AND CHALLENGES

Similar documents

Failure-aware Resource Provisioning For Hybrid Cloud Infrastructure

J. Parallel Distrib. Comput.

Hybrid Cloud Resource Provisioning Policy in the Presence of Resource Failures

International Journal of Engineering Research & Management Technology

Optimal Service Pricing for a Cloud Cache

1. Simulation of load balancing in a cloud computing environment using OMNET

PERFORMANCE ANALYSIS OF PaaS CLOUD COMPUTING SYSTEM

International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April ISSN

Cloud Management: Knowing is Half The Battle

Storage Architectures for Big Data in the Cloud

Optimizing the Cost for Resource Subscription Policy in IaaS Cloud

Characterizing Task Usage Shapes in Google s Compute Clusters

Multilevel Communication Aware Approach for Load Balancing

Profit-driven Cloud Service Request Scheduling Under SLA Constraints

WORKFLOW ENGINE FOR CLOUDS

Payment minimization and Error-tolerant Resource Allocation for Cloud System Using equally spread current execution load

International Journal of Computer & Organization Trends Volume21 Number1 June 2015 A Study on Load Balancing in Cloud Computing

VM Provisioning Policies to Improve the Profit of Cloud Infrastructure Service Providers

A Study on Analysis and Implementation of a Cloud Computing Framework for Multimedia Convergence Services

2) Xen Hypervisor 3) UEC

Characterizing Task Usage Shapes in Google s Compute Clusters

A Middleware Strategy to Survive Compute Peak Loads in Cloud

Scientific and Technical Applications as a Service in the Cloud

Participatory Cloud Computing and the Privacy and Security of Medical Information Applied to A Wireless Smart Board Network

International Journal of Advance Research in Computer Science and Management Studies

A CP Scheduler for High-Performance Computers

A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing

Mobile and Cloud computing and SE

Cloud Computing Architectures and Design Issues

Black-box Performance Models for Virtualized Web. Danilo Ardagna, Mara Tanelli, Marco Lovera, Li Zhang

High performance computing network for cloud environment using simulators

Cloud computing: the state of the art and challenges. Jānis Kampars Riga Technical University

An Approach to Load Balancing In Cloud Computing

Cloud deployment model and cost analysis in Multicloud

CHAPTER 5 WLDMA: A NEW LOAD BALANCING STRATEGY FOR WAN ENVIRONMENT

Dynamic Round Robin for Load Balancing in a Cloud Computing

Dynamic resource management for energy saving in the cloud computing environment

Efficient Load Balancing using VM Migration by QEMU-KVM

SLA BASED SERVICE BROKERING IN INTERCLOUD ENVIRONMENTS

Simulation-based Evaluation of an Intercloud Service Broker

Sla Aware Load Balancing Algorithm Using Join-Idle Queue for Virtual Machines in Cloud Computing

CloudAnalyst: A CloudSim-based Visual Modeller for Analysing Cloud Computing Environments and Applications

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

INCREASING SERVER UTILIZATION AND ACHIEVING GREEN COMPUTING IN CLOUD

CHAPTER 8 CLOUD COMPUTING

Multifaceted Resource Management for Dealing with Heterogeneous Workloads in Virtualized Data Centers

Cloud Computing Services and its Application

A Comparative Study of Load Balancing Algorithms in Cloud Computing

Cloud Computing Architecture: A Survey

Performance Analysis of Cloud-Based Applications

Case Study I: A Database Service

A Study of Infrastructure Clouds

Infrastructure as a Service (IaaS)

Auto-Scaling Model for Cloud Computing System

Reallocation and Allocation of Virtual Machines in Cloud Computing Manan D. Shah a, *, Harshad B. Prajapati b

CLOUD COMPUTING. When It's smarter to rent than to buy

FEDERATED CLOUD: A DEVELOPMENT IN CLOUD COMPUTING AND A SOLUTION TO EDUCATIONAL NEEDS

21/09/11. Introduction to Cloud Computing. First: do not be scared! Request for contributors. ToDO list. Revision history

Comparison of PBRR Scheduling Algorithm with Round Robin and Heuristic Priority Scheduling Algorithm in Virtual Cloud Environment

Building Platform as a Service for Scientific Applications

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

INTRUSION DETECTION ON CLOUD APPLICATIONS

Emerging Technology for the Next Decade

Keywords: Cloudsim, MIPS, Gridlet, Virtual machine, Data center, Simulation, SaaS, PaaS, IaaS, VM. Introduction

Cost Effective Automated Scaling of Web Applications for Multi Cloud Services

An Efficient Checkpointing Scheme Using Price History of Spot Instances in Cloud Computing Environment

Energy Efficient MapReduce

Webpage: Volume 3, Issue XI, Nov ISSN

Energy Constrained Resource Scheduling for Cloud Environment

ABSTRACT. KEYWORDS: Cloud Computing, Load Balancing, Scheduling Algorithms, FCFS, Group-Based Scheduling Algorithm

Evaluating HDFS I/O Performance on Virtualized Systems

Figure 1. The cloud scales: Amazon EC2 growth [2].

Environments, Services and Network Management for Green Clouds

Overview. The Cloud. Characteristics and usage of the cloud Realities and risks of the cloud

Virtual Machine Instance Scheduling in IaaS Clouds

Cloud JPL Science Data Systems

Paul Brebner, Senior Researcher, NICTA,

A Survey on Load Balancing and Scheduling in Cloud Computing

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Rapid Application Development

Resource Allocation Avoiding SLA Violations in Cloud Framework for SaaS

Oracle Applications and Cloud Computing - Future Direction

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

CDBMS Physical Layer issue: Load Balancing

Transcription:

RELIABILITY IN CLOUD COMPUTING: ISSUES AND CHALLENGES Dr. Bahman Javadi School of Computing, Engineering and Mathematics University of Western Sydney, Australia 1

2

3

4

Partner Institutions» Over 104 existing international partners in 31 countries» Collaborative agreements» Research» Exchange including students and staff» Academic cooperation» Guaranteed admissions and articulation» Offshore delivery 5

6

RESEARCH IN SCHOOL Research Centre Centre for Research in Mathematics (CRM) Three University Research Groups Artificial Intelligence Research Group (AI) Digital Humanities Research Group (DH) Solar Energy Technologies Research Group (SET) School Groups Networking, Security an Cloud Research Group (NSCR) E-Health Research Group Computational Astrophysics 7

OUTLINE Introduction and Background Reliability in Cloud: Definition Failure Correlation Case Study: Failure-aware Hybrid Cloud Architecture Results Conclusions Issue, Challenges and Open Questions 8

CLOUD COMPUTING: OLD OR NEW? In 1969, Leonard Kleinrock, one of the chief scientists of the original ARPA project which seeded the Internet, wrote: "As of now, computer networks are still in their infancy, but as they grow up and become sophisticated, we will probably see the spread of "computer utilities", which, like present electric and telephone utilities, will service individual homes and offices across the country. 9

WHAT IS CLOUD COMPUTING Having a Car? Buying a Car Pay the initial cost Pay the insurance cost Pay the maintenance cost Renting a Car Pay-as-you-go Classical Computing Cloud Computing 10

WHAT IS CLOUD COMPUTING New Technology New Architecture New Service Cloud Computing: on-demand IT Service, anywhere, anytime with a pay-as-you-go model The name comes from the use of a cloud-shaped symbol as an abstraction for the complex infrastructure it contains in system diagrams. 11

TRADE-OFFS By: D. Kondo 12

CLOUD INTERESTS OVER TIME Grid Cloud Ref: Google Trends 13

CLOUD SERVICES 14

CLOUD SERVICE PROVIDERS 15

SPECTRUM OF ABSTRACTIONS Different levels of abstraction Instruction Set VM: Amazon EC2 Framework VM: Google App Engine Similar to languages Higher level abstractions can be built on top of lower ones Lower-level, More flexibility, More management Not scalable by default Higher-level, Less flexibility, Less management Automatically scalable EC2 Azure AppEngine Google Apps 16

CLOUD COMPUTING: PROS AND CONS Pros Cons Elasticity Cost-effective Quick deployment Unlimited storage Security and Privacy Vendor Lock-in Lack of Control 17

CLOUD COMPUTING IN JAPAN The preparedness to support the growth of cloud computing These 24 countries account for 80 percent of the global ICT market. 18

OUTLINE Introduction and Background Reliability in Cloud: Definition Failure Correlation Case Study: Failure-aware Hybrid Cloud Architecture Results Conclusions Issue, Challenges and Open Questions 19

CLOUD RELIABILITY A failure is an event that makes a system fail to operate according to its specifications. Resource failure is inevitable where failure is the norm rather than exception. Highly Reliable Service Public Clouds Redundant components to provide reliable services Private Clouds Frequent service failures 20

CLASSIFYING FAILURES OF SERVER 5% 18% 6% 70% Hard Disk Raid Controller Memory Others Hard disks are the not only the most replaced component, they are also the most dominant reason behind server failure!! Vishwanath, Kashi Venkatesh, and Nachiappan Nagappan. "Characterizing cloud computing hardware reliability." Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010. 21

FAILURE RATE FOR COMPONENTS 0.1% 2.4% 0.7% 2.7% Hard Disk Raid Controller Memory Others The cost of per server repair (which includes downtime; IT ticketing system to send a technician; hardware repairs) is $300. This amounts close to 2.5 million dollars for 100,000 servers. Vishwanath, Kashi Venkatesh, and Nachiappan Nagappan. "Characterizing cloud computing hardware reliability." Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010. 22

OUTLINE Introduction and Background Reliability in Cloud: Definition Failure Correlation Case Study: Failure-aware Hybrid Cloud Architecture Results Conclusions Issue, Challenges and Open Questions 23

FAILURE CORRELATION Correlation in Failures à overlapped failures Spatial Temporal Spatial correlation means multiple failures occur on different nodes within a short time interval. Temporal correlation is the skewness of the failure distribution over which means failure events exhibit considerable autocorrelation at small time lags, so the failure rate changes over time. 24

for the required number of VMs. To determine the mean numb In our A. User problem, Request one request corresponds to an individual job whereas need an the probability of different application In can the system includeunder several study, jobs. each Therequest user s request can be can thought be thought of requests. of Assume that P 1 and P as a rectangle as a rectangle whose whose length length is the request is the user duration requested (T )andthewidthisthe duration (T ) with one VM and power of number and of required the widthvms is the (S) number as is depicted of required in Figure VMs (S). 1. Note Since that ther isrespectively. the So, the mean numbe estimated resources duration in of the the private request Cloud whileare T isfailure-prone, the actual request we would duration. byin requests is given as follows: Sectionhave 5.4, some the relation failurebetween events (E these two parameters is investigated. i ) in the nodes while a request S = P is getting serviced. In addition, it has been shown that there 1 +2 dke (P 2 )+ are spatial and temporal correlations in the failure events as Based on the parallel worklo well as dependency of workload type and intensity on the request follows a two-stage unif failure rate [12], [13], [36]. Spatial correlation means multiple eters (l, m, h, q) [22], [23]. This failures occur on different nodes within a short time interval, uniform distributions where the while temporal correlation in failures refers to the skewness the interval of [l, m] with probab of the failure distribution over time. Therefore, we might have with the probability of 1 q woul several overlapped failure events in the system. So, m is the middle point of pos Let T s (.) and T e (.) be the function that return the start Hence, k can be found as the m and end time of an failure event. Let H be the sequence of as follows: overlapped failure events ordered according to increasing start k = ql + m + FAILURES IN SERVICE where The time 1 apple sequence that i is, apple n 1. Since we are dealing with the tightly Figure 1: Serving of aoverlapped request in the presence failures of resource failures. coupled workflows, all VMs must be available for the whole requested H duration, = {F i Fso i =(E any 1 failure,...,e n event ),T s (Ein i+1 any ) applevm T e (Ewould i )} stop (1) 2.3. thefailure execution Model of the whole request. In this case, the downtime Downtime of of the service Inthe thisservice work, we would take into be as account follows: resource failures in the system where a failure is defined as an event in which the system fails to operate according to D = X (max{t its specifications. We investigate a e (F case i )} min{t where we are s (F only i )}) (2) faced with resource 8F failures in the private i 2H Cloud as public Cloud environments are usually able JEC-ECC to provide The 2015 above highly analyses reliable services reveal to that their even customers in the presence [15]. The public of an Cloud optimal fault-tolerant mechanism (e.g., perfect checkpointing) in the private Cloud, a given request 5 is faced with D time The redirection strategy sub Cloud provider if the number o 25

OUTLINE Introduction and Background Reliability in Cloud: Definition Failure Correlation Case Study: Failure-aware Hybrid Cloud Architecture Results Conclusions Issue, Challenges and Open Questions 26

CASE STUDY Hybrid Cloud Systems Public Clouds Private Clouds Resource Provisioning in Hybrid Cloud Users QoS (i.e., deadline) Resource failures Taking into account Workload model Failure characteristics Failure correlations Failure model 27

HYBRID CLOUD ARCHITECTURE Based on InterGrid components Using a Gateway (IGG) as the broker InterGrid Gateway Communication Module Message-Passing Management & Monitoring JMX Scheduler (Provisioning Policies & Peering) Persistence DB Java Derby Virtual Machine Manager Emulator Local Resources Grid Middleware IaaS Provider IGG 28

WORKLOAD MODEL Scientific Applications Potentially large number of resources over a short period of time. Several tasks that are sensitive to communication networks and resource failures (tightly coupled) User Requests Type of virtual machine; Number of virtual machines; Estimated duration of the request; Deadline for the request (optional). 29

PROPOSED APPROACHES Knowledge-free Approach No Failure Model Using failure correlation Three brokering policies Knowledge-based Approach Failure Model Generic resource provisioning model Two brokering policies (cost-aware) Workload model Request size Request duration 30

nvironment Clouds. Thus, louds. (i.e., IGG) as ies are based orrelation thought of and uration they do (T not ) ). s a Since result, the the, environment we would le a request n that there re events as sity on the be thought of ans multiple duration (T ) me interval, S). Since the e skewness e, we would might have hile a request wn that there rn the start lure events as sequence of ensity on the reasing start eans multiple time interval, Ethe i )} skewness (1) e might have element of this strategy is the mean number of VMs required downtime is strongly dependent on the number of requested per request. VMs. To determine Therefore, the the mean more number VMs of requested, VMs per the request, more likely we need they the jobs probability will be affected of different by simultaneously numbers of VMs failures. in incoming To cope with this situation, we proposed a redirecting requests. Assume that P 1 and P 2 are probabilities of request strategy that sends wider requests with larger S to more with one VM and power of two VMs in the workload, reliable public Cloud systems, while minimizing requests respectively. So, the mean number of virtual machines required sent to potentially more failure-prone local resources. A key by requests is given as follows: element Spatial of correlation this strategy : multiple is the mean failures number occur of VMs on required per different request. S = P nodes 1 +2 dke (P within 2 )+2 k (1 (P a short time interval 1 + P 2 )) (3) To determine the mean number of VMs per request, we Based Strategy: on the sends parallel wider workload requests models, to more thereliable size ofpublic each request need the probability of different numbers of VMs in incoming Cloud follows systems a two-stage uniform distribution with parameters requests. Assume that P 1 and P 2 are probabilities of request with Mean (l, m, one number h, q) [22], VM and of VMs [23]. power per This ofrequest distribution consists of two uniform distributions where the first two distribution VMs in the would workload, be in respectively. P 1 : probability So, the of one mean VM the interval of [l, m] with probability number of of virtual q and machines the second required one by requests P 2 : probability is given of power as follows: of two VMs with the probability of 1 q would be in the interval of [m, h]. So, m is thes middle = P 1 +2 point dke of (Ppossible 2 )+2 k (1 values (P 1 between + P 2 )) l and h. (3) Hence, k can be found as the mean value of this distribution Based Request onsize: the two-stage parallel workload uniform distribution models, the (l,m,h,q) size of each as follows: request follows a two-stage ql + m uniform +(1 distribution with parameters (l, m, h, q) k [22], = q)h [23]. This distribution consists of (4) 2 two uniform The redirection distributions strategy wheresubmits the firstrequests distribution to the would public be in Cloud the interval provider of [l, if the m] with number probability of requested of q and VMs theis second greater one with the probability of 1 q would be in the interval of [m, h]. PROPOSED POLICIES Size-based Strategy 31

duration as the decision point for the gateway to redirect incoming requests to the Cloud providers. In other word the request duration is less than or equal to the mean req duration, the request will be run using local resources. PROPOSED POLICIES this technique, (CONT.) the majority of short requests could meet workload deadline as they are less likely to encounter failu Time-based strategy Moreover, longer requests will be served by more reli Temporal correlation: public Cloud the failure provider rate is resources timedependent and deadline some periodic under failure enhanced patterns resource can availability. be where they can meet observed in different The time-scales mean request duration can be obtained from the fi Request duration: distribution are long of tailed. the workload model. For instance, the req duration of the parallel workloads in DAS-2 multi-clu system is the Lognormal The mean distribution request duration with parameters µ so the mean value is given as follows [22]: Lognormal distribution in a parallel production system T = e µ+ 2 2 The redirection strategy submits requests to the public C provider if the duration of request is greater than T, otherw will be serviced using local resources. Another advantag 32 this strategy is better utilization of the Cloud resources. example with Amazon s EC2, if a request uses a VM for

D. Area-based Strategy The two aforementioned strategies are based on the only one aspect of the request: number of VMs (S) or duration (T ). A third strategy aimed at making a compromise between the size-based and time-based strategy. Here we utilize the area of a request as the decision point for the gateway - this is given as the rectangle with length T and width S. Using this Making model, a we compromise are able to calculate between the the mean size-based requestand area by multiplying time-based thestrategy mean number of VMs by the mean request duration, The mean as follows: area of the requests PROPOSED POLICIES (CONT.) Area-based strategy A = T S (6) the M requests i Slowdow where bound is very short reque To compute th we use the amou machines and ne for policy pl can Cost pl =( This strategy sends long and wide requests to the public Cloud, It would be more conservative than a size-based strategy and less conservative than a time-based strategy. 33

KNOWLEDGE-BASED APPROACH: GENERIC RESOURCE PROVISIONING MODEL Fig. 2. Model of resource brokering for n providers. Model based on routing in distributed parallel queue for the broker could be expressed as follows: nx min (K i E[T i ]) (1) i=1 [T i ] is the expected response time of requests served i and is described in the following. on Figure 2, queue i has the mean inter-arrival time 1 i =(P i ) 1, so we can find its variance by Wald s [23] as follows: Pi: routing probability 2 I V [I i ]= P i + 2 (1 P i ) P 2 (2) i oned before, queue i has the mean service time and cient of variance µ 1 i and Si µ i, respectively. As ing requests have several VMs that potentially can e as M i nodes, we model each provider with a single Ki: price of provider i 34

pect ious rt-up osts. o be time cost, n be ters. have ch is time ctive inter-arrival time at queue i, and can be calculated a using u Equation (2) as follows: adaptive to the system s failure pattern [13]. 2 Moreover, t a, t u, a and u 2 are the mean and the variance of CI 2 B. i = V Runtime [I availability i] E[I i ] 2 =1+P Model andfor i( 2 unavailability I 2 Public Cloud interval lengths, respectively. As the input workload 1) includes parallel (4) requests (i.e. request Although resource failures are inevitable, but public Cloud with several VMs), we consider the mean request size (W ) as Next, MODEL we apply PARAMETERS Lagrange providers employ multipliers efficient method and expensive to optimize mechanisms to man- E[Tresource i ] from Equation failures. These (3) and mechanisms assuming are mainly based n Using Lagrange multipliers methods, we i=1 the given work to the system. We define the mean request size Equation P (1) usingage obtained P by multiplying the mean number of VMs (V ) by the mean request duration (D). Hence, W = V D. These two parameters i = 1. Theonsolution redundancy. of this Hence, optimization public Cloud gives the providers are usually routing probabilities the able Prouting i as to provide follows: probability highly reliable as follows: services to their customers [1]. are dependent on workload model (see Section VI-A). P i = µ Therefore, P n By i i=1 µ we can use consideringp Normal distribution for the request W i Ki time units of work over M s failureprone nodes, completion i time in we Pthe n define pcloud. the service This can rate (5) be of the justified cluster by queue the as central i=1 Ki the limit reciprocal theorem value which of the assures i mean that completion when summing time for many a given independent 2 There are several approximations workloadrandom for this as follows: variables (here requests completion time), queue in the literature, but we Private Cloud the service resultingrate distribution choose one which is a good estimate for heavily loaded tends toward a Normal systems. distribution. 1 So, the service rate W t a + t u µ of s = the Cloud queue can + L be found as the s (12) reciprocal values of the mean M s request s t a completion time for a givenwhere workload Public Cloud service rate s ison them computing c reliable nodes speed as offollows: the nodes in the local cluster in terms of instruction per second, 1 and L s is the time to W transfer the application µ c = (e.g. configuration + L c file or input(13) file(s)) to the cluster through M the c communication c network. Another required parameter is the coefficient of variance35 of the cluster service time (i.e. C Ss ) which is nothing but Equation (9). This

Sequence (ARS) policy. In contrast, we also propose another method with deterministic sequence, which considers the past sequence of dispatching with a very limited time overhead. This method, we call as Adaptive with Deterministic Sequence (ADS) policy. ADAPTIVE POLICIES To generate the deterministic sequence, we used the Billiard scheme [11], Adaptive determined with as follows. Random Sequence (ARS) Suppose a billiard ball bounces in an n-dimensional cube Routing probabilities (Pi) where each side and opposite side are assigned by an integer value in the range Dispatch of {1, 2,...,n}. using Bernoulli Then, a deterministic distribution billiard sequence is generated by a series of integer values which shows the sides Routing hit byprobabilities the ball when (Pi) shot. In [11], the authors proposed Dispatch a method using to generate Billiard the sequence billiard sequence as follows: Xi + Y i i b = min (7) 8i P i Adaptive with Deterministic Sequence (ADS) where i b is the target queue, and X and Y are vectors of integers with size n. X i reflects the fastest queue, and is set to one for the fastest queue and zero for all other queues [2]. Y i keeps track of the number of requests that have been sent to queuejec-ecc i and is2015 initialized to zero. After finding the target queue, it is updated as Y ib = Y ib +1. P i is the fraction of In this section, we adopt specific case where we hav use index i = s for the loca hereafter. Moreover, we assu homogeneity within each pr proposed policies are part o To apply the proposed an egy, we first need to speci arrival distribution I depen could be given as a general As can be seen from Equati and coefficient of variance, C Therefore, in the following the local cluster and µ c an corresponding routing proba A. Runtime Model for Loca The distribution of servic on the characteristics of the workload. Moreover, in our response times are more i 36 times. The reason is that sc times in Equation (1) does no

Figure 3, the request is redirected to the remote IGG to get the resource from the public Cloud provider (i.e., Amazon s EC2). Once the IGG has allocated the requested VMs, it makes them available and the DVE manager will be able to access the VMs and finally deploy the user s application. 14 4.3. Fault-Tolerant Scheduling Algorithms As depicted in Figure 4, we need an algorithm for scheduling the requests for the private and public Clouds. For this purpose, we utilize a well-known scheduling algorithm for parallel requests, which is called selective backfilling [33]. Backfilling is a dynamic mechanism to identify the best place to fit the requests in the scheduler queue. In other words, Backfilling works by identifying hole in the processor-time space and moving forward smaller Selective backfilling grants reservation to a request when its expected slowdown exceeds a threshold. That means, the request has waited long enough in the queue. The expected slowdown of a given request is also called expansion Factor (XFactor) and is given by the following equation: SCHEDULING ALGORITHMS Scheduling the request across private and public Cloud resources Two requests well-know that fit those algorithms holes. where requests are allowed to leap forward in the queue Conservative backfilling Selective backfilling XF actor = W i + T i (7) T i VM where Checkpointing W i and T i is the waiting time and the run time of request i, respectively. We use the Selective-Di erential-adaptive scheme proposed in [33], VM which stops lets the working XFactorfor threshold the unavailability to be the average period slowdown of previously The completed request requests. is started It has been from shown where that selective it left off backfilling when outperforms the other types of backfilling algorithms [33]. node We becomes used another available schedulingagain algorithm, aggressive backfilling [41], in our experiments as the base algorithm. In the aggressive backfilling (EASY), only the request at the head of the queue, called the pivot, is granted a 37 JEC-ECC reservation. 2015 Other requests are allowed to move ahead in the queue as long

OUTLINE Introduction and Background Reliability in Cloud: Definition Failure Correlation Case Study: Failure-aware Hybrid Cloud Architecture Results Conclusions Issue, Challenges and Open Questions 38

orkloads in DAS-2 multi-cluster distribution with parameters µ,, request uses a VM for less as follows [22]: of the Cloud resources. For our must be paid. So, when public Cloud provider, this d. = e µ+ 2 2 (5) bmits requests to the public Cloud request is greater than T, otherwise l resources. Another advantage of zation gies are of the based Cloudon resources. the only For C2, if a request uses a VM for less er oneof hour VMs must(s) be paid. or duration So, when king to theapublic compromise Cloud provider, between this inimized. rategy. Here we utilize the point for the gateway - this ngth T and width S. Using strategies are based on the only te the mean request area by number of VMs (S) or duration at VMs making by athe compromise mean request between ased strategy. Here we utilize the cision point for the gateway - this ith length T and width S. Using calculate S the mean request area (6) by ber of VMs by the mean request PERFORMANCE EVALUATION CloudSim Simulator Performance Metrics Deadline veryviolation short requests rate [9]. Slowdown = T S (6) Cloud Cost on EC2 Workload Model ducible research and and interested the cost readers of conducting should refer experiments to [3] to seeon howreal public Cloud checkpoint resources overheads would and periods be prohibitively can be computed expensive. base on thethe associated performance failure model. metrics which are considered in all simulation scenarios IV. PERFORMANCE are the violation EVALUATION rate and the bounded slowdown [9]. The violation rate is the fraction of requests that do not meet a given deadline. The bounded slowdown for the M requests is defined as: In order to evaluate the performance of proposed policies, we implemented a discrete event simulator using CloudSim [4]. We used simulation as experiments are reproducible and the cost of conducting experiments on real public Cloud resources would be prohibitively expensive. The performance Slowdown metrics = 1 MX which are considered in all simulation scenarios are the violation i=1 rate and i,bound) M max(t the bounded slowdown [9]. The violation rate is the fraction of requests that do not meet a given deadline. The bounded slowdown for the M requests is defined as: Slowdown = 1 M W i + max(t i,bound) where bound is set to 10 seconds to eliminate the effect of To compute the cost of using resources from a public Cloud, we use the amounts charged by Amazon to run basic virtual MX W i + max(t i,bound) max(t i,bound) machines and network i=1 usage at EC2. The cost of using EC2 for policy pl can be calculated as follows: where bound is set to 10 seconds to eliminate the effect of very short requests [9]. To compute the cost of using resources from a public Cloud, we use the amounts charged by Amazon to run basic virtual machines and network usage at EC2. The cost of using EC2 for policy pl can be calculated TABLE as follows: I Parallel jobs model of a multi-cluster system (i.e., DAS-2) Cost pl =(H pl + M pl H u ) C n +(M pl B in ) C x Input Parameters Distribution/Value (8) Inter-arrival time Weibull ( = 23.375, 0.2 apple apple 0.3) No. of VMs Loguniform (l =0.8,m,h= log 2 N s,q =0.9) Request duration Lognormal (2.5 apple µ apple 3.5, = 1.7) P 1 0.02 P 2 0.78 (7) (7) Cost pl =(H pl + M pl H u ) C n +(M pl B in ) C x (8) INPUT PARAMETERS FOR THE WORKLOAD MODEL where H pl is the Cloud usage per hour for the policy pl. That months, which includes on average average availability and unavailabili is 22.26 hours and 10.22 hours res In order to generate different wo modified three parameters of the w the inter-arrival time, we modified the Weibull distribution 39 (the shape Table I. To have requests with diffe the first parameter of the Lognorm and 3.5 as described in Table I.

the Weibull distribution (the shape parameter ) as shown in Table I. To have requests with different durations, we changed the first parameter of the Lognormal distribution between 2.5 the policy pl. That and 3.5 as described in Table I. Moreover, we also varied nutes for example, the middle point of the Loguniform distribution (i.e., m) to is the fraction of Cloud. H u PERFORMANCE generate workloads is the EVALUATION with different number(cont.) of VMs per request where m = h! and! is between 1.5 to 3.3. Thus the larger systems on a VM value of!, the fewer VMs required to service requests. The into account this Failures from Failure Trace Archive (FTA) TABLE I same scheduling algorithms are used for thetable private II and public ers RAMETERS whenforthe THEVM WORKLOAD http://fta.scem.uws.edu.au/ Cloud MODEL. providers in all scenarios. which needs Distribution/Value to be Grid 5000 To generate traces the request deadlines we utilize the same uest. This Weibull is( set =23.375, to 0.2 apple apple 0.3) Loguniform (l =0.8,m techniques =3.5,h fic instance on the 18-month =6,q given =0.9) in [18], which provide a feasible schedule for Lognormal (2.5 apple jobs. apple 3.5, as 0.085 USD per 800 events/node To =1.7) obtain deadlines, we conducted experiments based on 0.02 scheduling requests on local resources without failure events of data transfer to 0.78 using aggressive backfilling. We used the following equations which is 0.1 USD ider a case where Synthetic to calculate Deadline the deadline for each request i: ( Lognormal transferred distributions, to the respectively. These st i +(f ta i ), if [st i +(f ta i )] <ct i their parameters are listedd i in= Table I. It (9) ct i, otherwise that the number of VMs in the request the system size (e.g. M nodes) by setting f: stringency where i is the factor request s submission time, ct i is its completion he AURIN project f>1 time, is normal ta i is the deadline request s(e.g., turn around f=1.3) time (i.e., ct i st i ). We define f as a stringency to avoid bias factor results. that indicates how urgent deadlines are. If f =1, then the request s deadline is uses an aggressive backfilling scenario to ensure completion. We evaluate strategies with different stringency factors, however ean number of VMs per request, we need fparallel differentjob number model of VMs in the incoming N s = N c = 64 uster that PGrid one and [22] P pow2 as are probabilities of VM and power ios. Based on JEC-ECC of two VMs the 2015 in the workload, efore, the mean number of VMs required by as l time follows: (based on INPUT PARAMETERS FOR THE FAILURE MODEL. Parameters Description Value (hours) t a Mean availability length 22.25 a Std of availability length 41.09 t u Mean unavailability length 10.22 u Std of unavailability length 40.75 The failure trace for the experiments is obtained from the Failure Trace Archive [15]. We used the failure trace of a cluster in the Grid 5000 with 64 nodes for duration of 18 months, which includes on average 795 failure events per node. The parameters for the failure model of Grid 5000 are listed in Table II (see [15] for more details). Also, each experiment utilizes a unique starting point in the failure traces In order to generate different synthetic workloads, we 40 modified two parameters of the workload model, one at a time. To change the inter-arrival time, we modified the second parameter of the Weibull distribution (the shape parameter )

SIMULATION RESULTS Violation rate (knowledge-free policies) Request arrival rate Request size Request duration 41

SIMULATION RESULTS (CONT.) Cloud Cost on EC2 (knowledge-free policies) Request arrival rate Request size Request duration 42

SIMULATION RESULTS (CONT.) Slowdown (Knowledge-based policies) Request arrival rate (SB) Request arrival rate (CB) Request arrival rate (EB) 43

FAILURE TRACE ARCHIVE (FTA) 27 Failure Traces Supercomputers, HPC, Grid, P2P FTA Format Simulator and Scripts http://fta.scem.uws.edu.au/ 44

CONCLUSIONS Adaptive resource provisioning in a failure-prone hybrid Cloud system Flexible brokering strategies based on failure correlation/model as well as workload model Improve performance of hybrid Cloud Knowledge-free approach: 32% in terms of deadline violation and 57% in terms of slowdown while using 135$/month on EC2 Knowledge-based approach: 4.1 times in terms of response time while using 1200$/month on EC2 45

OPEN QUESTIONS Recourse Failures vs. Energy Consumption for Cloud Systems How they are related? Reliability-as-a-Service (RaaS) in Cloud Computing Providing reliability on demand based on the users requirements (e.g., Amazon Spot Instances) Cost Model for Resource Failures in Cloud Systems Repair. Replacement 46

REFERENCES Bahman Javadi, Parimala Thulasiraman, Rajkumar Buyya, Enhancing Performance of Failure-prone Clusters by Adaptive Provisioning of Cloud Resources, Journal of Supercomputing, 63(2) (2013) 467-489. Bahman Javadi, Jemal Abawajy, and Rajkumar Buyya, Failureaware Provisioning for Hybrid Cloud Infrastructure, Journal of Parallel and Distributed Computing, 72 (10) (2012) 1381-1331. Bahman Javadi, Jemal Abawajy, and Richard O. Sinnott, Hybrid Cloud Resource Provisioning Policy in the Presence of Resource Failures 4 th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2012), Taipei, Taiwan, December 2012, pp. 10-17. Best Paper Finalist Award. Bahman Javadi, Parimala Thulasiraman, Rajkumar Buyya, Cloud Resource Provisioning to Extend the Capacity of Local Resources in the Presence of Failures, 14th IEEE International Conference on High Performance Computing and Communications (HPCC-2012), Bradford, UK, June 2012, pp. 311-319. 47

48