Exploiting Performance and Cost Diversity in the Cloud

Transcription

1 2013 IEEE Sixth International Conference on Cloud Computing Exploiting Performance and Cost Diversity in the Cloud Luke M. Leslie, Young Choon Lee, Peng Lu and Albert Y. Zomaya Centre for Distributed and High Performance Computing School of Information Technologies The University of Sydney NSW 2006 Australia {lles9991, {young.lee, Abstract Infrastructure-as-a-Service (IaaS platforms, such as Amazon EC2, allow clients access to massive computational power in the form of virtual machines (VMs known as instances. Amazon hosts three different instance purchasing options, each with its own service level agreement covering availability and pricing. In addition, Amazon offers access to a number of geographical regions, zones, and instance types from which to select. In this paper, we present a resource allocation and job scheduling framework (RAMC-DC, which utilizes Amazon s rich selection of service offerings particularly within Spot and On-Demand instance purchasing options aiming to cost efficiently execute deadline-constrained jobs. The framework is capable of ensuring quality of service in terms of cost, deadline compliance and service reliability. Such capacities are realized incorporating a set of novel strategies including execution time and cost approximation, bidding and resource allocation strategies. To the best of our knowledge, RAMC-DC most extensively exploits the service diversity of Amazon EC2, and offers a comprehensive cost efficiency solution that is able to deliver both the performance and reliability of On-Demand instances and the low costs of Spot instances. Experimental results obtained from extensive simulations using Amazon s Spot price traces show that our approach keeps deadline breaches and early-termination rates as low as 0.47% and 0.18%, respectively. This reliable performance is achieved with total costs between 13% and 20% of an equivalent approach using only On-Demand instances. Keywords-Spot instances; Cloud provisioning; Cost efficiency I. INTRODUCTION Cloud computing provides a means to acquire pay-as-yougo computing power and data storage in a manner similar to publicly available utilities such as gas and electricity. As such, cloud computing has become a very powerful and popular tool among users who require access to computational resources and data storage without the fixed costs involved in purchasing, installing, and maintaining a private cloud. Moreover, cloud computing providers are in a prime position to leverage economies of scale unavailable to users with small private clouds, and therefore can pass these savings on to the users. There are many different cloud computing providers and each offers different layers of services. The focus in this paper will be on IaaS providers, specifically Amazon EC2. Amazon EC2 offers three different instance purchasing options. These are: reserved instances, where a user pays a yearly fee and receives a discount on hourly rates; On- Demand instances, where a user pays a single hourly rate; and Spot instances, where users bid for instances [1]. While Spot instances benefit from low costs, they also suffer from inherent volatility. On the other hand, On-Demand instances offer very low volatility but higher prices than equivalent Spot instances. Taking advantage of this apparent dichotomy provides an attractive path to achieving both high cost-efficiency and low volatility. As in the Pinterest example, 1 supplementing Spot instances with On-Demand instances can help reduce the volatility encountered by the application and lower total costs. There have been several previous studies aimed at creating resource allocation strategies utilizing Spot instances. Examples such as BROKER [3], SRRP and DRRP [4], and that introduced in [5], all seek to simply provision Spot instances to run computations. Among such strategies, only [3] incorporates the use of different types of Spot instances from separate availability zones, but compares only cost efficiency across these zones, rather than the inherent reliability of each. In this paper, we present a resource allocation and job scheduling framework (RAMC-DC that exploits performance and cost diversity in order to schedule deadline-constrained jobs on a dynamic cluster of Spot and On-Demand instances. RAMC-DC is unique in that it compares price dynamics from different purchasing options, instance types, and availability zones, and evaluates comprehensive methods to estimate the cost of execution on a Spot instance and different checkpointing strategies. Furthermore, our approach provides bidding and resource allocation strategies designed to examine: i the interchangeability of an instance among short jobs, ii the probability of completion of long jobs on an instance, iii the tradeoffs of using an On-Demand or Spot instance, iv and the cost effectiveness of running jobs on an instance. Experimental results using Amazon s Spot prices from the period between July and November 2012, and using Downey s speedup model [6] show that our approach can keep total costs between 13% and 20% of the equivalent cost when using only On-Demand instances. Furthermore, these cost savings are achieved while maintaining early-terminations and deadline breaches as low as 0.18% and 0.47%, respectively. The remainder of this paper is organized as follows. Section II describes background and related work. Section III outlines the instance and job models, and the overall problem RAMC- DC attempts to solve. Section IV introduces the checkpointing strategies, and develops and evaluates comprehensive cost approximation methods. Section V presents the dynamic resource 1 The content provider Pinterest was able to reduce its costs from $54/hr to $20/hr by using a dynamic combination of Spot and On-Demand instances (targeted to generally be to handle elastic load [2] /13 $ IEEE DOI /CLOUD

2 Spot Price ($US us east 1a us east 1b us east 1c us east 1d us east 1e On Demand 12:00 14:24 16:48 19:12 21:36 00:00 02:24 04:48 07:12 09:36 12:00 Time (HH:MM Fig. 1: Spot prices for instance type m1.large, running a Linux OS, for all availability zones in region us-east-1. provisioning strategy as the primary component of RAMC- DC, and an evaluation of RAMC-DC is given in Section VI. The conclusions are then drawn in Section VII. II. BACKGROUND AND RELATED WORK In this section, we begin by describing Amazon s service diversity, with a focus on Spot instances. We then discuss previous studies as they relate to bidding and resource allocation strategies for Spot instances. A. Spot Instances Amazon offers their IaaS instances in three different ways with different price and availability/reliability dynamics. Much of the diversity in these offerings is within Spot instances. Each Spot instance type is available in certain zones through a Spot market in which users bid for instances. This Spot market has an associated time-varying price which determines the cost of running a user s Spot instance, as well as if and when the instance is terminated by Amazon. If the user s bid exceeds the current market price, the user gains access to the instance as long as this holds true, with the user being charged the current market price at the start of each hour block. If the market price exceeds the bid, the instance is terminated by Amazon and the user is not charged for that partial hour. An example of these market prices is shown in Figure 1. B. Bidding Strategies Andrzejak et al. [7] present bidding strategies based on the execution times of an instance for jobs requiring 1,000 minutes of execution that were designed to satisfy SLA constraints. Zafer, Song, and Lee [8], [9] developed optimal bidding strategies in Spot markets both from a client and a broker perspective. From the client s perspective, Zafer et al. designed a dynamic bidding policy (DBA to minimize the total cost of a parallel or serial job with a deadline constraint. From the broker s perspective, Song et al. develop a profit aware dynamic bidding algorithm (PADB that maximizes the time average profit of the broker. The bidding strategies in this paper differ from these works and others, by allowing a trade-off between deadline breaches and total cost, adjusting the bidding strategy depending on the execution time, and comparing the price dynamics across availability zones. Furthermore, when handling jobs with low execution times, it becomes prudent to acquire instances with bids independent of the initial jobs to allow other jobs to fill idle hour blocks. C. Resource Allocation Strategies There have been a number of resource allocation strategies proposed to utilize Spot instances. Zhao et al. [4] develop deterministic and stochastic resource rental planning models (DRRP and SRRP to minimize costs when running elastic computations on Spot instances. Chen et al. [10] attempt to model the interaction between customer satisfaction and service profit for a provider that leases Spot instances from IaaS providers. Liu [11] and Chohan et al. [12] utilize Spot instances to run MapReduce applications. Voorsluys et al. [3] propose a resource provisioning policy that encompasses two novel fault tolerance techniques aimed at decreasing the volatility of a heterogeneous cluster composed of Spot instances, including migration of VM states across availability zones. Estimates of execution times were made in a fashion similar to those in this work. Contrary to the work presented here, both [10] and [3] do not compare the pricing and reliability dynamics across markets and do not supplement Spot instances with On-Demand instances. III. MODELS In this section, the models used by RAMC-DC for instances and jobs are described followed by our problem formulation. A. Instances Each job is scheduled for execution on an instance, either Spot or On-Demand, represented by ν. The instance ν refers to either a to-be-leased instance or an existing instance. If ν is to-be-leased, it may be represented by the triple i, z, b, where i is the instance type, z is the availability zone, and b is the bid. The type i is an element of the set of all types, I, and z is an element of Z, the set of availability zones. Not all instance types are available as Spot instances in all zones: if ν is a Spot instance, z Z i, where Z i is the set of all zones in which a Spot instance of type i is available. ν is an On- Demand instance if and only if b = ; otherwise, ν is a Spot instance and b R +. When an instance is leased, a pointer to the EC2 instance is added to ν to retrieve data such as the status of the instance and the time remaining in the hour. B. Jobs Users submit a job, j, which is placed in an FIFO queue, J. Each job is independent and includes a desired instance type, i j, an estimated execution time, t (given in full or partial hours, and a deadline, D. In the event of termination, j also contains a reference to the last zone j was executed in, z j (initially. The execution time of a job most likely is dependent on the assigned instance. Such variation will be facilitated by an example model of job moldability that utilizes the number of EC2 compute units provided by an instance, and 108

3 the extent to which execution time is altered by them. Thus, we assume each job is one of the following. Moldable: For a job to be moldable implies that some speedup or slowdown is observed when running the job on larger or smaller instances (with respect to computational power, respectively. As an exemplar of moldability, speedup will be determined similarly to [3], using Downey s speedup model [6], and measures the change in execution time for a job running on n processors compared to a job running on 1 processor. Downey s model requires two additional parameters, A and σ, which measure the average parallelism and the coefficient of variance in parallelism, respectively, and generates SU(n, the speedup of a job using n processors. In this work, we calculate the estimated execution time on i as: t i = t SU(n i /SU(n ij, where n i is the number of EC2 Compute Units in i. Rigid: For a job to be rigid implies that no speedup is encountered on larger instances (i.e., t i = t, and that the job will not execute on smaller instances. Therefore, rigidity requires that only instance types with n i n be used to execute the job, and these will be the only instance types incorporated in the search. Such a requirement can easily be extended to memory size, storage capacity, etc. C. Problem Formulation Scheduling and resource allocation decisions in RAMC- DC are made based primarily on two tunable parameters, S lb [0, 1] and [0,. The parameter S lb specifies an evaluation lower bound in the Spot instance bidding strategies. The value of represents a ting parameter used to classify jobs as short or long, thereby determining the evaluation used for that job. In this paper, if t, the job is classified as short. Otherwise, the job is classified as long. S lb generally is used to specify how much resistance to early-termination is required for Spot instances; higher values of S lb generally increase the bid and thus incur higher costs but lower early-termination rates. The set of instance types searched, I, includes Amazon s m1, m2, m3, c1, and cc1 types. For each job in the queue, RAMC-DC must locate a jobinstance (j, ν assignment that minimizes the cost, C(j, ν, of running a job j on an instance ν, while meeting either reliability or availability constraints, depending on whether a job is classified as short or long (i.e., if t. Thus, given a job j, RAMC-DC searches among Spot and On-Demand instances to find the instance ν such that: ν =argminc(j, ν. ν V Here, V is a set of leased and to-be-leased instances such that ν V,j can be executed on ν before D and S(j, ν, S lb, where S(j, ν, [0, 1] is an evaluation of the assignment (j, ν using the parameter. IV. COST APPROXIMATION The cost of running a job on a particular instance is a function of job execution time and unit cost of resource rental (hourly rate in Amazon EC2 s case. Because we adopt checkpointing to improve reliability, the checkpointing overhead needs to be taken into account when calculating job execution time. In the meantime, resource rental cost when dealing with Spot instances is subject to change and should be estimated. In this section, we present three checkpointing strategies incorporated by RAMC-DC. Following this, we compare and evaluate five different methods for approximating the cost of execution on Spot instances. A. Checkpointing Strategies For some availability zone, z, the execution time in z is modified to include the estimated checkpointing times of the job, if run on a Spot instance, and the time to resume the job from a suspended state if the job was previously checkpointed: t i,z = t i + t chkpt i + t res z j z. Here, both the estimated checkpointing time, and the resume time, t res z j z, of an instance in zone z from some checkpoint in j s last zone z j, are determined as in [3], [13], where the suspend and resume rates of a VM state in the same availability zone are s =63.67MB/s and r =81.27MB/s, and the resume rate from a different availability zone is set to r/2. Thus, the time per checkpoint is the time required to save the instance s memory to a global file system (e.g., Amazon S3, and is given as t susp i = m i /s where m i is the size of the instance s memory. Similarly, the time to resume a checkpointed instance state is calculated as t res z j z = m i /r if z = z j, and t res z j z = m i /(r/2 otherwise. When resuming instance states on On-Demand instances, we let t res OD = tres z j z j = m i /r, as we assume z always equals z j in such cases. To provide fault-tolerance, the following checkpointing strategies are compared when running jobs on Spot instances. 1 None: No checkpoints are taken. The estimated checkpointing time is t chkpt i =0and, upon forced termination, all completed computation is lost, forcing the job to be restarted. 2 Hourly: A checkpoint is taken at the end of each hour block. The estimated checkpointing time is therefore calculated as t chkpt i = t i t susp i and, upon forced termination, execution resumes from the end of the last hour. 3 Rising Market Price: A checkpoint is taken each time the market price rises for that instance. Thus, we determine the estimated number of checkpoints as the average number of price increases for a t i period over some Spot market window, and the checkpointing time is t chkpt i = avg incr t susp i. B. Execution Costs For each type-zone pair (i, z, RAMC-DC has access to Amazon s Spot price history for some past span of time: H i,z = {(p 1,d 1,...,(p k = p mkt,d k }, where p i is the price at time d i, and p k is the current market price. To determine the best way to estimate the cost of execution, we evaluate the following five different methods for approximating the cost. 1 Market Price (mkt: The cost is approximated as the current market price (p mkt times the ceiling of t: Ĉ mkt (j, ν =p mkt t i,z. 109

4 2 Monte Carlo (mc: The cost is approximated using a nonparametric Monte Carlo estimate: Ĉ mc (j, ν = 1 C x (j, ν, X C x X C where X C is a set of 10,000 dates sampled uniformly over the past 60 days and C x (j, ν is the true cost of running the job at time x if the job completes successfully, and is otherwise equal to Ĉmkt(j, ν. 3 Average Price (avg: The cost is approximated using an average per-hour price calculated as the weighted sum of all previous market prices less than the bid over the past 60 days, with each weight proportional to the fraction of the time spent at each market price: p m (d m+1 d m p m b, m<k Ĉ avg (j, ν = t i,z. (d m+1 d m p m b, m<k 4 Market-Monte Carlo (mmc α: If the runtime of the job is less than some parameter α, the estimated cost is determined using the Market Price estimate of the cost. Otherwise, the estimated cost is calculated as the sum of the Market Price method for the first α hours and the Monte Carlo method for the remaining time. 5 Market-Average (ma α: As in Market-Monte Carlo but using the Monte Average estimate for the remaining time. If ν has been leased, Ĉ(j, ν is calculated as the total cost accounting for the fact that the remaining hour block has already been paid for. Thus, Ĉ(j, ν approximates the cost using the estimated execution time t i,z RemHour(ν, where RemHour(ν is the predicted time that will remain in ν s hour block when j is expected to start. The above approximations were chosen to determine the appropriate weight given to the current market price, and to evaluate the difference in accuracy between a random sampling and an averaging approach. Increasing the weight of the market price will lead to more accurate results if the interprice time (the time between price changes is very high. When this is true, the probability of a change in market price during execution is low, so the price first encountered is likely to remain constant. Approximations utilizing the time-weighted average price will generally yield higher accuracy than random sampling when spikes are more frequent since such approximation will incorporate these spikes, while random sampling has the potential to miss such spikes altogether. C. Comparison of Cost Approximations To determine which cost estimation method achieves the lowest relative error, simulations are run using 20,000 jobs over two months, with desired runtime uniformly sampled between 1 and 12 hours (desired computation times above 1 hour guarantee that the true cost is nonzero, and with desired type and zone also uniformly sampled. Each job is allocated a new Spot instance of the requested type, and in the requested zone. Spot price history from the region us-east-1 is used. Relative Error (η mkt mc avg mmc_avg ma_avg mmc_2 ma_2 mmc_4 ma_4 0 Fig. 2: A comparison of the relative error of each cost estimation method using availability bids with no checkpointing. Figure 2 illustrates the accuracy of each cost estimation method for successful jobs (not terminated early using a S lb -availability bid. Here, the variable avg represents the average inter-price time over the Spot price history for the market corresponding to (i, z. For values of S lb less than 0.7, Average, Market-Average and Market-Monte Carlo (with α = avg approximations perform similarly, achieving relative errors of only around each. Of the four, Market- Average with α =4is marginally the most accurate. For higher values of S lb, the simplest estimate, Market Price, performs the best, with other estimates quickly becoming more and more inaccurate as S lb increases. This increasing disparity between cost estimates reflects the fact that other cost estimate methods rely on the instance s potential bid. As S lb is increased, the bid will also monotonically increase, allowing for a wider range of past market prices to be taken into account when calculating the average prices or the average costs. As Spot prices exhibit periods of little fluctuation punctuated by large price spikes, using data from periods of different market prices in the estimation will be less indicative of the actual cost. For lower values of S lb, the range of bids which satisfy the confidence level is constricted (when S lb =0, the bid will always be equal to the market price and thus cost estimation methods utilizing past Spot prices will be more accurate. V. DYNAMIC RESOURCE PROVISIONING The overall resource provisioning process RAMC-DC employs is performed by (1 evaluating instance suitability based on j s execution time and, (2 finding the most costeffective instance among already leased and to-be-leased instances that satisfy the evaluation lower bound, S lb, as well as the deadline, D, (3 leasing a new Spot or On-Demand instance if required (i.e., the optimal instance is to-be-leased, and (4 assigning j to the resulting instance. A. Two-Tier Instance Evaluation The suitability of an instance is determined through a twotier instance evaluation strategy that involves the calculation of the reliability or availability of an instance ν, depending on 110

5 whether job is classified as short or long, and is defined as: { Availability(ν if t t S(j, ν, =, Reliability(j, ν if t > with Reliability and Availability defined below. If ν is an On-Demand instance, we assume S(j, ν, =1. If t, S(j, ν, is calculated as the portion of time that b was above the market price during a 60-day Spot price window for the Spot market determined by (i, z: (d m+1 d m Availability(ν = p m b, m<k (d k d 1 As described in Section IV-B, (p m,d m H i,z for m = 1,...,k, and p k = p mkt is the current market price. If t >, S(j, ν, is calculated as the empirical probability of successful completion if ν was used to execute j. That is, Reliability(j, ν =P (T i,z,b t i,z where T i,z,b is a random variable representing the true length of time for which the Spot instance is available to the user when bidding b on instance type i in availability zone z. This probability is estimated using the nonparametric Kaplan-Meier Estimator: n i,z,b (x 1 Reliability(j, ν =, n i,z,b (x t i,z,b (x t i,z where X is a set of 10,000 dates sampled uniformly over the past 60 days of the Spot price history, t i,z,b (x is the true step length for an instance leased at time x X with type i in zone z, and with bid b. Here, n i,z,b (x is the number of samples in which the instance was available for longer than t i,z,b (x. B. Bidding The optimal bid for a Spot instance is calculated as the minimum bid that satisfies a lower bound on the instance evaluation function described above. Thus, the bid for a Spot instance of type i in zone z is calculated as: b(j, i, z, =min{b R + S(j, i, z, b, S lb b p mkt (i, z}. If the job has an estimated execution time greater than the ting parameter, the bidding strategy locates the minimum bid such that the empirical probability of completion of j on an instance of type i in zone z is greater than or equal to S lb. Such a strategy helps to provide job-specific bids that can limit the risk of early-termination for long jobs. On the other hand, if the execution time is less than, the bidding strategy instead locates the minimum bid such that the instance has been available (i.e., the market price has been under the bid for at least S lb of the time over the Spot price window. This approach helps guarantee that instances are interchangeable among short jobs, thereby filling partially empty hour blocks. C. Resource Provisioning The process of resource provisioning and job assignment is described in Algorithm 1. Here, ETUI(ν represents the. Algorithm 1: Provision - Identifying the minimum cost job-instance assignment and provisioning resources. Data: J, S lb, 1 begin 2 SPOT, OD 3 while true do 4 j Pop(J // waits for J to be non-empty 5 V, ν, ν new, breach false 6 I j { i I : t i + t res OD D} 7 if I j == then 8 I j {i I n i n j }, breach true 9 V {ν SPOT OD i I j ETUI(ν= S(j, ν, S lb } 10 else 11 V {ν SPOT OD i I j ETUI(ν+ t i,z D S(j, ν, S lb } 12 end 13 ν new MinNew(j, S lb,,i j, breach//alg ν arg min Ĉ(j, ν ν V {ν new} 15 if ν == ν new then 16 Lease(ν // lease ν from Amazon EC2 17 if ν is a Spot instance then 18 Add(ν,SPOT 19 else 20 Add(ν,OD 21 end 22 end 23 Assign(j, ν // push j to ν s FIFO queue 24 end 25 end estimated time until ν is idle and equals the sum of the remaining estimated runtimes of each job assigned to ν. From the set of all instance types I we determine the set of feasible types, I j I, that would satisfy the deadline with the corresponding On-Demand instances (line 6. If no feasible types exist, I j is constructed as the set of instance types with n i greater than or equal to the job s previous instance. Then, the set of all feasible instances, V, is constructed. If there are no feasible types, as discussed above, for each ν V, ν must be idle and have greater than or equal to the number of EC2 compute units of the last instance j was executed on (lines 8 and 9; otherwise, the sum of the estimated time until ν is idle and the execution time of j on ν must be less than or equal to D (line 11. In both cases, each ν must also satisfy the instance evaluation inequality, S(j, ν, S lb. The instance v new in line 13, identified using Algorithm 2, represents the lowest cost instance that may potentially be leased if no lower cost already-leased instances are found. ν is determined as the instance that minimizes the estimated cost of execution defined by Ĉ(j, ν (line 14. Estimated costs for On-Demand instances are calculated as t i,z RemHour(ν times the hourly On-Demand price for type i. Ifν is not yet leased (i.e., ν == ν new as discussed 111

6 Algorithm 2: MinNew - Identifying the minimum cost new potential instance satisfying S(j, ν, S lb. Data: j, S lb,,i j, breach Result: ν new (an unleased instance 1 begin 2 ν new, c, s 0 3 for i I j do 4 for z Z i do 5 if t i,z D breach then 6 ν SPOT i, z, b(j, i, z, 7 s S(j, ν SPOT, 8 c Ĉ(j, ν SPOT+C resume (j, z 9 if (c <c (c == c s>s then 10 ν new ν SPOT, s s, c c 11 end 12 end 13 end 14 ν OD i, z j, // potential new On-Demand 15 c Ĉ(j, ν OD // see Section V-C 16 if c c then 17 ν new ν OD, s 1, c c 18 end 19 end 20 end above, an instance matching ν s description is leased and added to the corresponding set of leased instances, SPOT or OD. Thus, if b, a Spot instance of type i, in zone z, and with bid b is leased (lines 16-18; otherwise, an On-Demand instance in zone z j is leased, where z j is either us-east-1a if j has not been previously attempted, or j s last availability zone. After leasing the instance, a pointer to this instance is added to ν and j is then assigned to ν s queue. D. Identification of New Resources The identification of v new from line 13 of Algorithm 1 is outlined in Algorithm 2. The first for loop (line 3 iterates through the set of feasible instance types given by I j, and the nested for loop (line 4 iterates through the corresponding availability zones in which a Spot instance of type i is available. For each (i, z combination, if the estimated execution time on type i in zone z ( t i,z satisfies the deadline, or if the job will surpass the deadline regardless, the estimated cost is compared to the current minimum. Although Amazon has since made such transfers free, if the job must be resumed from another availability zone, a data transfer cost is added at the rate of $0.01/GB ([1] and is calculated by C resume (j, z. Due to the static pricing characteristics of On-Demand instances among availability zones, potential On-Demand instances are evaluated for each instance type only (lines E. Job Scheduling and Resource Deprovisioning If j has been assigned to an instance but has not been started before D t i,z and the assigned instance is not idle, j is pushed to the front of J. Otherwise, prior to execution, the algorithm again searches for any lower cost instances on which to run the job and reassigns the job if a cheaper alternative is found. If no cheaper alternatives are found, i j and z j are updated and j is executed. In addition to the loss of Spot instances from early-termination, On-Demand and Spot instances are automatically released at the end of the current hour block if their assignment queues are empty. VI. EVALUATION This section evaluates RAMC-DC through real workload traces and presents results based on total costs, deadline breaches and early-termination rates. A. Experimental Setup The workload used for the evaluation consists of two sets of 20,000 jobs, with execution times and arrival times taken from the traces of the ANL Intrepid supercomputer [14]. Jobs in the first set are assumed to be moldable, and jobs in the second are rigid. Each job is assumed to initially require access to a data file with size less than 1GB. Thus, jobs initially run in zones besides us-east-1a incur an additional transfer cost of $0.01. Deadlines for jobs are given as twice the estimated runtime, the requested instance type is uniformly sampled from the available types, and values of A and σ for moldable jobs are calculated using the model of Cirne and Berman [15]. In addition, we adopt the Market Price estimation, i.e., Ĉ = Ĉmkt for the sake of simplicity and accuracy. The workload traces were chosen for the general proximity of estimated and true execution times, as well as the range of these execution times, and the dispersion of jobs over a time period of several months. B. Total Costs Total costs when using no checkpointing with moldable jobs are shown in Figure 3a (rigid jobs evince similar characteristics. Total costs for all checkpointing strategies and with both sets of jobs, while letting and (as they determine upper and lower bounds, are given in Figures 3b and 3c. Thus, these two figures specify the range of observable total costs given each checkpointing strategy. Varying between these two values therefore effectively allows a tradeoff between cost and volatility, with higher values of decreasing cost but increasing volatility. As seen in Figure 3, for all values of S lb,, and each checkpointing strategy, total costs when incorporating Spot instances are very low, attaining a minimum of 12.75% for moldable jobs and 19.5% for rigid jobs. The total costs from using only On-Demand instances in our approach are equal to $15,305 when using moldable jobs, and $24,433 when using rigid jobs. For both sets of jobs, hourly checkpointing generally incurs the highest total cost due to the checkpointing overhead. Incorporating no checkpointing strategy often results in the lowest costs, although a rising-market price strategy will achieve the lowest costs for S lb 0.8 when using rigid jobs. Allowing instance evaluation to be a function of the execution time of a job yields higher costs than evaluation relying on 112

7 Total Cost/On Demand Cost t t t t t Total Cost/On Demand Cost , "none" t, "none" t, "hourly", "hourly" t, "rising", "rising" Total Cost/On Demand Cost , "none", "none", "hourly", "hourly", "rising" t, "rising" (a Moldable Jobs, No Checkpointing (b Moldable Jobs, All Checkpointing (c Rigid Jobs, All Checkpointing Fig. 3: The total cost over On-Demand cost using various checkpointing strategies, job types, and values of t t t t 0.02 t t t t t t t t t (a No Checkpointing (b Hourly Checkpointing (c Rising-Price Checkpointing Fig. 4: The deadline breach rate using different checkpointing strategies and values of with moldable jobs (a No Checkpointing (b Hourly Checkpointing (c Rising-Price Checkpointing Fig. 5: The deadline breach rate using different checkpointing strategies and values of with rigid jobs. availability; total costs decrease by as much as 4% of the total On-Demand cost when evaluation is always calculated as instance availability (. Such a decrease is due to the increased interchangeability of instances, and thus the higher number of feasible instances, inherent in such an evaluation. C. Deadline Breaches Deadline breaches generally occur very infrequently, with rates achieving a minimum of just 0.74% when using an hourly checkpointing strategy with moldable jobs, and 0.46% when using no checkpointing strategy and rigid jobs (see Figures 4 and 5. For both sets of jobs, hourly checkpointing tends to have the lowest number of deadline breaches, and increasing S lb and decreasing tends to decrease the number of deadline breaches. However, when using moldable jobs, setting S lb =1results in a sharp spike in deadline breaches for the none and rising checkpointing strategies. These spikes often occur because no such instance can be found satisfyings lb while maintaining a reasonable bid and market-price, and thus the job must wait for such an instance to become available. This additional waiting time, and sparsity of suitable instances, increases the risk and propagation of a deadline breach. When decreasing S lb, total costs decrease and earlyterminations increase. Depending on the value of and 113

8 Early Termination Rate Early Termination Rate (a Moldable Jobs 0 (b Rigid Jobs Fig. 6: Early-termination rates with no checkpointing. the checkpointing strategy used, however, it is still possible to maintain low deadline breach rates as S lb decreases. As seen in Figure 4b, using an hourly checkpointing strategy and letting =0, for example, allows RAMC-DC to still maintain low deadline breaches when S lb =0.05, while keeping the total cost equal to 13% and 14.5% of the On-Demand cost for moldable and rigid jobs. In the case of moldable jobs, a rising-market price strategy allows RAMC-DC to maintain steady deadline breach rates at around 1.6% of all jobs while incurring lower costs than an hourly checkpointing strategy. D. Early-Terminations Early-terminations occur in as few as 0.18% of all jobs when S lb =1, regardless of moldability, checkpointing strategy, and value of (as seen in Figure 6. This figure also shows that the approach presented in this paper can keep these early-termination rates below 9.5% and 12.5% of moldable and rigid jobs, respectively, when S lb. Indeed, while achieving such low early-termination rates, our approach still keeps total costs under 19.5% of the On-Demand cost, regardless of job type. The variation in early-termination rates for different values of is highest when S lb is not equal to 0 or 1, and decrease as S lb moves to these values. For lower values of S lb (less than 0.5, all values of achieve roughly similar early-termination rates, with higher values of incurring slightly lower rates. As expected, as S lb increases, however, early-termination rates decrease until S(j, ν, =Reliability(ν, due to the shift in focus to successful completion, rather than the reliability. VII. CONCLUSIONS The approach presented in this paper provides a cost effective and low-volatility means to run both moldable and rigid deadline-constrained jobs. The dynamic provisioning strategy in RAMC-DC deals with the mixture of Spot and On- Demand instances, and exploits the inherent performance and cost diversity within these complementary purchasing options. Furthermore, the tunable parameters, S lb and, allow the operator to effectively trade higher total cost for lower volatility. To validate our approach, experiments were run using Amazon s Spot price history, workload traces from ANL Intrepid with a deadline of twice the estimated runtime, and utilizing Downey s speedup model as an exemplar approach to predicting job moldability. Our evaluation results have confirmed these claims, i.e., deadline breaches and earlytermination rates are mostly below 2% and 1%, while incurring below 20% of equivalent On-Demand instance costs. ACKNOWLEDGMENT Prof. Zomaya would like to acknowledge the support of the Australian Research Council Discovery Grant DP REFERENCES [1] Amazon Elastic Compute Cloud. [2] Pinterest Cut Costs From $54 To $20 Per Hour By Automatically Shutting Down Systems. [3] W. Voorsluys and R. Buyya, Reliable provisioning of spot instances for compute-intensive applications, in IEEE Int l Conference on Advanced Information Networking and Applications (AINA, 2012, pp [4] H. Zhao, M. Pan, X. Liu, X. Li, and Y. Fang, Optimal resource rental planning for elastic applications in cloud market, in IEEE Int l Parallel and Distributed Processing Symposium (IPDPS, 2012, pp [5] S. Yi, A. Andrzejak, and D. Kondo, Monetary cost-aware checkpointing and migration on Amazon Cloud spot instances, in IEEE Transactions on Services Computing (TSC, 2011, pp [6] A. Downey, A parallel workload model and its implications for processor allocation, in IEEE Int l Symposium on High Performance Distributed Computing (HPDC, 1997, pp [7] A. Andrzejak, D. Kondo, and S. Yi, Decision model for cloud computing under SLA constraints, in IEEE Int l Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS, 2010, pp [8] Y. Song, M. Zafer, and K. Lee, Optimal bidding in spot instance market, in IEEE Int l Conference on Computer Communications (IN- FOCOM, 2012, pp [9] M. Zafer, Y. Song, and K. Lee, Optimal bids for spot VMs in a cloud for deadline constrained jobs, in IEEE Int l Conference on Cloud Computing (CLOUD, 2012, pp [10] J. Chen, C. Wang, B. Zhou, L. Sun, Y. Lee, and A. Zomaya, Tradeoffs between profit and customer satisfaction for service provisioning in the cloud, in Int l ACM Symposium on High Performance Distributed Computing (HPDC, 2011, pp [11] H. Liu, Cutting MapReduce cost with spot market, in USENIX Workshop on Hot Topics in Cloud Computing (HotCloud, [12] N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi, and C. Krintz, See spot run: using spot instances for MapReduce workflows, in USENIX Conference on Hot Topics in Cloud Computing (HotCloud, [13] B. Sotomayor, K. Keahey, and I. Foster, Combining batch execution and leasing using virtual machines, in Int l Symposium on High Performance Distributed Computing (HPDC, 2008, pp [14] P. W. Archive, Parallel Workloads Archive: ANL Intrepid, cs.huji.ac.il/labs/parallel/workload/l anl int/index.html. [15] W. Cirne and F. Berman, A comprehensive model of the supercomputer workload, in IEEE Int l Workshop on Workload Characterization (WWC, 2001, pp