Towards Auction-Based HPC Computing in the Cloud *

Transcription

1 Computer Technology and Application 3 (2012) D DAVID PUBLISHING Towards Auction-Based HPC Computing in the Cloud * Moussa Taifi, Justin Y. Shi and Abdallah Khreishah Computer and Information Sciences Department, Temple University, Philadelphia PA 19122, USA Received: June 01, 2012 / Accepted: July 03, 2012 / Published: July 25, Abstract: Cloud computing is expanding widely in the world of IT infrastructure. This is due partly to the cost-saving effect of economies of scale. Fair market conditions can in theory provide a healthy environment to reflect the most reasonable costs of computations. While fixed cloud pricing provides an attractive low entry barrier for compute-intensive applications, both the consumer and supplier of computing resources can see high efficiency for their investments by participating in auction-based exchanges. There are huge incentives for the cloud provider to offer auctioned resources. However, from the consumer perspective, using these resources is a sparsely discussed challenge. This paper reports a methodology and framework designed to address the challenges of using HPC (High Performance Computing) applications on auction-based cloud clusters. The authors focus on HPC applications and describe a method for determining bid-aware checkpointing intervals. They extend a theoretical model for determining checkpoint intervals using statistical analysis of pricing histories. Also the latest developments in the SpotHPC framework are introduced which aim at facilitating the managed execution of real MPI applications on auction-based cloud environments. The authors use their model to simulate a set of algorithms with different computing and communication densities. The results show the complex interactions between optimal bidding strategies and parallel applications performance. Key words: Auction-based cloud computing, fault tolerance, cloud HPC (high performance computing). 1. Introduction The economy of scale offers cloud computing virtually unlimited cost effective processing potentials. While it is in general difficult to assess the real cost of a computation task, the auction-based provisioning scheme offers a reasonable pricing structure. Theoretically, prices under fair market conditions should reflect the most reasonable costs of computations. The fairness is ensured by the mutual agreements between the sellers and the buyers. From the consumer s perspective, among all computing applications, High Performance Computing (HPC) applications are the biggest potential Corresponding author: Moussa Taifi, Ph.D. candidate, research fields: dependable cloud computing, high performance computing, auction-based cloud computing, fault tolerance. moussa.taifi@temple.edu. * This paper is an extended version of SpotMPI: a framework for auction-based HPC computing using amazon spot instances published in the International Symposium on Advances of Distributed Computing and Networking (ADCN 2011). beneficiaries. From the seller s perspective, HPC applications represent the most reliable income stream since they are the most resource intensive users. Theoretically, resource usage efficiency is also maximized under the auction-based provisioning schemes. Traditional HPC applications are typically optimized for hardware features to obtain processing efficiency. Since transient component errors can halt the entire application, it has become increasingly important to create autonomic applications that can automate checkpoint and re-starting with little loss of useful work. Although the existing HPC applications are not suitable for volatile computing environments, with an automated Checkpoint-Restart (CPR) HPC toolkit, it is plausible that practical HPC applications could gain additional cost advantages using auction-based resources by dynamically minimized CPR overheads. Unlike existing MPI (Message Passing Interface)

2 500 Towards Auction-Based HPC Computing in the Cloud fault tolerance tools, the authors emphasize on dynamically adjusting optimal CPR intervals in order to offset the large number of out-bid failures typical in the volatile auction-based computing platform. The authors introduce a formal model and a HPC application toolkit, SpotHPC, to facilitate the practical execution of real MPI applications on volatile auction-based cloud platform. In section 2, the background and context of the current research are described. In section 3, the authors establish models for estimating running times of HPC applications using auction-based cloud resources. The proposed models take into account the time complexities of the HPC application, the overheads of checkpoint-restart, and the publicly available resource bidding history. They seek to unravel the inter-dependencies between the applications computing/communication complexities, the number of required processors, bidding prices and the eventual processing costs. The authors then introduce the SpotHPC toolkit and show how it can automate MPI application processing using volatile resources and the guidance of the formal models. In section 4, the proposed models are applied to recent bidding histories of Amazon EC2 HPC resources. Preliminary results for two HPC application types with different computing and communication complexities are reported. Conclusions are given in section 5 and give potential future research directions. 2. Background 2.1 Auction-Based Computing: Spot Instances Amazon is one of the first cloud computing vendors to provide at least two types of cloud instances: on-demand instances and spot instances. An on-demand instance has a fixed price. Once ordered, it provides service according to Amazon s Service Level Agreement (SLA). A spot instance is a type of resource whose availability is controlled by the current bidding price and the auctions market. There are unique characteristics of the auction-based computing platform. First, a stable computing environment can potentially be established by bidding the on-demand prices. Lower costs can be gained if the applications can tolerate partial failures. Thus, the most fault resilient implementation will gain the best possible cost effectiveness. Third, given an application, its required processing time and/or budget requirements, as well as the bidding history of required resources, it is possible to develop an optimized bidding strategy to meet the desired target(s). There are three special features of Amazon s spot instance pricing policy: A successful bid does not guarantee the exclusive resource access for the entire requested duration. The Amazon engine can terminate access at any time if a higher bid is received; Amazon does not charge a partial hour (job terminated before reaching the hour boundary) if the termination is caused by out-bidding. Otherwise, the partial hour is charged in full if the user terminates the job; Amazon will only charge the user the highest market price that may be less than the user s successful bid. The authors have chosen two types of Amazon EC2 HPC resources for this study. The cc1.4xlarge and the cg1.4xlarge are cluster HPC instances that provide cluster level performance (23 GB of memory, 8 cores, 10 Gigabit Ethernet). The main difference is the presence of GPUs (Graphical processing units, 2 x NVIDIA Tesla Fermi M2050) in the cg1.4xlarge which provides more power for compute intensive applications (Table 1). Fig. 1 records a sample market price history for the cc1.4xlarge instance type from May 5 to May 11, This instance type shows typical user behavior for more legacy HPC applications. The cg1.4xlarge instance type illustrates resources for HPC applications that can benefit from GPU processing. Since many legacy HPC applications are not yet suitable for GPU processing, cg1.4xlarge pricing history sees less fluctuations.

3 Towards Auction-Based HPC Computing in the Cloud 501 Table 1 Amazon HPC resource types. Instance type Description 23 GB memory, 33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core (Nehalem cc1.4xlarge architecture)), 1690 GB storage, 64-bit platform, 10 Gigabit Ethernet 22 GB memory, 33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core Nehalem cg1.4xlarge architecture), 2 x NVIDIA Tesla Fermi M2050 GPUs, 1690 GB storage, 64-bit platform,10 Gigabit Ethernet been developed specifically for HPC applications. These improvements have demonstrated hopeful features [1-3]. These improvements show the diminishing overhead of virtual machine monitors such as Xen [4]. Due to the severity of declining MTBFs, fault tolerance for MPI applications has also progressed. These developments inspired the design and development of SpotHPC. 2.3 Checkpoint-Restart (CPR) MPI Applications Fig. 1 Market prices of cc1.4 instance in may HPC in the Cloud Although HPC applications are the biggest potential beneficiaries of cloud computing, except for a few simple applications, there are still many practical concerns: Most mathematical libraries rely on optimized numerical codes that exploit common hardware features for extreme efficiencies. Some of the hardware features are not mapped in virtual machines. Hardware cache is one of the examples. Consequently, HPC applications suffer more performance drawbacks in addition to the normal virtualization overhead; Many HPC applications have high inter-processor communication demand. Current virtualized networks have difficulty meeting these high demands; All existing HPC applications handle only two communication states: success and failure. While success is a reliable state, failure is not. Existing applications treat timeout identical to failure. Consequently, any transient component failure can halt the entire application. Using volatile spot instances for these applications is a serious challenge. Initially, low-end cloud services provided little guarantee on the deliverable performance for HPC applications. Recently, high end cloud resources have Much research has been done in the past to provide seamless fault tolerance for MPI applications. FT-MPI [5] uses interactive process fault tolerance. Starfish [6] supports a number of CPR protocols and LA-MPI [7] provides message level failure tolerance. Egida [8] experimented with message logging grammar specification as means for fault tolerance. Cocheck [9] extends the Condor [10] scheduler to provide a set of fault tolerance methods that can be used by MPI applications. The authors choose the OpenMPI s coordinated CPR because of its cluster wide checkpoint advantage since more fine grained strategies will not work in this highly volatile environment [11-13]. The challenge faced by applications willing to use the volatile nature of spot instances is of a different kind than regular clusters. The behavior of the spot instances can be analyzed as a fail-stop mechanism. The Amazon engine uses the market information and the user s bid to terminate an instance with no prior notice [14]. This is called an out-of-bid failure. These out-of-bid failures require applications to adapt their run time to be frequently interrupted. While there are many different fault tolerance libraries for MPI, the authors choose the Open MPI s coordinated CPR mechanism [15]. The high volatility of spot instance platform makes many fine grained checkpoint strategies impractical. These include local checkpoints [11], multilevel checkpoints [12] and using pairs or groups of nodes to provide redundancy [13]. The OpenMPI s coordinated CPR is an

4 502 Towards Auction-Based HPC Computing in the Cloud all-or-nothing single tasks CPR interface, although with higher overheads, that guarantees to work correctly regardless the number of processors. Other single task CPR efforts involving spot instances include map-reduce applications [16-17]. A map-reduce application does not require inter-task communications. Parallel processing can be controlled external to the individual tasks. Therefore, spot instances can be used as accelerators via a simple job monitor that tracks and restarts dead jobs automatically [18]. Another noticeable effort studying the spot instances includes Refs. [19-20]. These are based on simulating the behavior of a single instance under different bids. Their work outlined the inherent tradeoff between completion time and budget. In Ref. [19], a decision model is proposed that describes a simulator that is able to determine under a set of conditions the total expected time of a single application. Another study, Ref. [20], discussed a set of checkpoint strategies that can maximize the use of the spot instances while minimizing the costs. Resource allocation strategies are also identified in Refs. [21-22]. This research work uses monetary and runtime estimation techniques to simulate the runtime of grid-based applications on such volatile infrastructure. These experiments also provide a heuristical study of the effect/benefits of generic fault tolerant techniques such as checkpointing, task duplication and migration. Additionally the research carried out in Ref. [23], explores the spot instance markets and captures the long term behavior of spot instances pricing. This includes the distributions of the prices, inter-price times as well as the difficulties related to fitting analytical distributions to a young market where sparse variations exist so far. In addition, the research in Ref. [24], points out the existence of different epochs in the pricing behavior and implications for using the spot price data. This report contributes to the understanding of the usefulness of the data that is publicly available and how decision models need to be aware of changes in pricing policies and the suppliers announcements concerning new prices and new pricing regions. While these research effort have clarified some of the challenges of spot instances, only few, such as Ref. [25], have touched specifically on HPC applications and how even the nature of the application, being high performance or high throughput, impacts the fault tolerance strategies used to mitigate the interruptions due to market fluctuations. While describing strategies for single applications is crucial to the understanding of the spot resources and auction-based computing in general, this knowledge is not fully usable in the context of HPC computing and there are many issues that need examination when applications are meant to scale to higher orders of magnitudes. To the best of the authors knowledge, there has been no direct evaluation of practical MPI applications on spot instances. The volatile auction-based computing platform challenges the established HPC programming practices. 2.4 Evaluating MPI Applications Using Auction-Based Platforms For HPC applications using large number of processors, the CPR overhead is the biggest cost factor. Without CPR optimization, it is unlikely to gain practical acceptance for MPI application to run on the volatile auction-based platforms. The authors report a theoretical model based on application resource time complexities [26] and optimal CPR models [27-28]. In addition, they describe a toolkit named SpotHPC that can support autonomic MPI application using spot instance clusters. This toolkit can monitor spot instances and bidding prices, automate checkpointing at bidding price (and history) adjusted optimal intervals and automatically restart application after out-bidding faults. 3. Theoretical Model The auction prices vary dynamically depending on the supply and demand in the Amazon market place.

5 Towards Auction-Based HPC Computing in the Cloud 503 There are no guidelines from Amazon as how the prices are set. Unlike other projects (e.g., Ref. [29]) that use autoregressive models to maximize the profit for fictitious cloud providers, the paper focuses on the intrinsic characteristics of users application and the bidding history. The authors are interested in the inherent dependencies between these characteristics and their impact on the optimal CPR intervals the largest cost factor for MPI applications to run on a volatile platform. 3.1 Bid-Aware Optimal CPR Interval The authors assume that the time between consecutive out-of-bid failures is exponentially distributed with rate. This allows the out-of-bid failures to be modeled the same way as traditional failures but at different rates depending on the bid price, which is referred as. Thus, the authors can extend the previous works on optimal CPR interval for distributed memory applications. This paper refers to the original CPR interval work by Ref. [28], which is extended by Ref. [27] and later adapted to MPI by Ref. [30]. The authors start their discussion using the same symbols. Similar to Ref. [30], the authors obtain the expected application running time with checkpoints and failures. Important assumptions are that out-of-bid failure occurs only once per checkpoint interval and all failures are independent: 2 This leads to the optimal CPR interval by Refs. [27-28, 30]: 2 A crucial difference between stable clusters and spot instances clusters is that an out-of-bid failure will force an application downtime that is absent for component failures. This means that the restart time cannot happen until the average downtime per out-of-bid failure has elapsed. can be obtained using the price history and the current bid. The expected running time using spot instances becomes (1) Another difference between traditional clusters and spot based clusters is that the failure rate is not fixed for all the instances. Instead the failure rate is a function of the bid proposed by the user and the market price. This situation requires using statistical tools to determine the failure rate of an instance given a bid price. The authors obtain the price history and determine the empirical cumulative distribution of the failures given a specific bid price. Then various fits of this distribution are obtained. Fig. 2 shows that by increasing the bid price, the CDF of the availability of the instance under that bid price also increases. This measure of price stability is used as a measure of failure free runtime. This means that the higher the bid price, the longer the availability time of an instance at that bid price. Since the proposed model relies on an exponential distribution, the authors use the corresponding exponential fit s parameters and simulate the runtime of various applications using this exponential failure rate. 3.2 SpotHPC Framework Due to the lack of toolkits that deal explicitly with auction-based HPC, the authors develop a framework to run distributed applications such as MPI on virtual clusters composed of failure-prone spot instances. SpotHPC is composed of four components: cluster orchestration, checkpoint/restart service, checkpoint forecaster and a monitoring service (Fig. 3). These modules are initially installed on an on-demand instance that is free of out-of-bid failures. The cluster monitoring service pulls the status and bidding prices of all instances continuously. The interactive bidding price and dynamic price history are used by the CPR calculator to generate the next optimal CPR interval. A composite timing model (next section) is responsible for estimating the total processing times. The checkpoint service saves the state of the MPI

6 504 Towards Auction-Based HPC Computing in the Cloud Fig. 4 Clustering workflow using HPCFY. Fig. 2 CDF of price probability per bid price for cc1.4xlarge instances. Fig. 3 SpotHPC architecture design. application in the users EBS (elastic block storage) volume at dynamically adjusted intervals. Any out-bid failure will cause the application to halt. Upon a future winning bid, the cluster orchestration rebuilds the virtual cluster to match the pre-failure configuration and the application is automatically restarted from the last checkpoint using the Open MPI restart library and the stored checkpoint. The current design of this framework includes the OpenMPI coordinated CPR library [15] for the checkpoint service based on the BLCR project [31]. The cluster orchestration is managed by the HPCFY library (Fig. 4) [32]. The OpenMPI and BLCR libraries facilitate the execution of automatic CPR at optimal intervals. The orchestration service, HPCFY, facilitates the creation and management of HPC clusters using virtual cloud resources such as Amazon EC2 instances. The HPC user is assumed to have an on-demand instance that will be used to install the SpotHPC packages and from which they deploy their own auction-based clusters. This process starts by requesting a set of VMs from the cloud controller by specifying a virtual machine image and type as well as the number of instances and the bid price to be used. The cloud controller allocates the requested number of spot VMs when the market price is favorable to the bid price. Subsequently, one of the nodes gets selected as the head node of the virtual cluster. The user then sends the cluster configuration to the puppet master that deploys the cluster configuration to the rest of the nodes. The worker nodes send back their statuses and get any new modification at regular intervals. Once the cluster stabilized/converged and the configuration is uniform on the nodes, the user is able to launch parallel and distributed applications from the head node. The project is openly available and can be easily downloaded and extended. Currently, this project supports popular HPC packages such as MPI, Hadoop/MapReduce with the large scale data mining package Mahout. Also, it provides automatic configuration of distributed user accounts security configurations and cluster monitoring using the Ganglia project [33]. Following the launch of the application, the monitoring system keeps track of the running application and performs checkpoints and restarts until the application is successfully run.

7 Towards Auction-Based HPC Computing in the Cloud Computational Results 4.1 Steady State Timing Model To evaluate, an estimate of the failure free processing time is needed. The authors use the steady state timing model [26] to determine required running time based on major component usage complexities. Table 2 shows the symbols used in timing models. The general problem of assessing the processing time of a parallel application is difficult. There are too many hard to quantify factors. However, a steady state timing model can capture the intrinsic dependencies between major time consuming elements, such as computing, communication and input/output, by using instrumented capabilities like and. The idea is to eliminate the non-essential constant factors. Thus contrasting timing models can reveal non-trivial parallel processing insights [34]. In this paper, the authors choose to study two typical algorithm classes (Table 3) for the use of spot instance computing. Timing models in general can be applied to all deterministic algorithm classes [26]. 4.2 Evaluation of Checkpointing Overheads on Amazon EC2 Cluster Instances A central goal of checkpointing libraries is to decrease the overhead incurred when saving the state of a parallel application. To quantify the effect of checkpointing on real machine, the authors conducted an experiment on the Amazon EC2 using cluster instances of type cc2.8xlarge and the MPI-based NAS benchmark. The cc2.8xlarge instances and the benchmark used are described in Table 4. The authors conducted an experiment on 4x cc2.8xlarge nodes by using 1 MPI task on each node. This experiment consisted of running classes C and D of the SP NAS benchmark and observing the effect of a set of checkpoint frequencies. The goal of this experiment is to show the importance of choosing a correct checkpoint frequency. Fig. 5 shows the slowdown incurred by two application sizes compared Table 2 Definition of symbols and variables for modeling the runtime. Symbol Description Interval of application-wide checkpoint Expected rate of out-of-bid failures corresponding to bid over t Time needed to create and store a checkpoint Time needed to read and recover a checkpoint Average out-of-bid downtime Estimated time needed to run the application with no checkpoints and no failures Expected running time between checkpoints Expected total running time Total observed time Number of processing units Instrumented processor capacity in number of computational steps per second Instrumented network capacity in bytes per second Problem size Number of iterations Parallel processing time Table 3 Algorithm classes A 1 and A 2. Compute and Sample communication Timing model application complexities :, 16 Molecular force simulation :, 16 Linear solvers Table 4 Checkpoint overhead experimental setup. AMI name Description 60.5 GB memory, 88 EC2 Compute Units, cc2.8xlarge 3370 GB storage, 64-bit platform, 10 Gigabit Ethernet Benchmark Description SP, scalar-pentadiagonal kernel, non-linear NAS-NPB 2.3 PDE solver Fig. 5 Impact of checkpoint frequency on the runtime.

8 506 Towards Auction-Based HPC Computing in the Cloud to failure free runtimes. In both cases the authors notice that increasing the checkpoint frequency increases the slowdown of the application. While failures do not include in this scenario, it can be found that over protective checkpointing strategies can lead to slowdowns of 2 to 3 times the failure free runtimes. On the other hand, due to the occurrence of failures, there should be a balance between the out-of-bid failure rate of the cluster nodes and the checkpoint frequency. For this effect, the authors use the model developed in section 6 to determine the optimal checkpoint interval/frequency based on an application characteristic and a corresponding bidding strategy. 4.3 Evaluation of Bid-Aware Optimal CPR Interval The bid-aware CPR interval is validated using non-optimal intervals. In Fig. 6, the behavior of the speed up is visualized under different CPR intervals. Fig. 6 shows the clear advantages of bid-aware optimal CPR intervals that have avoided longer completion times and higher total costs. The authors also notice that as the bid increases the advantage of optimal CPR interval decreases. This is because at higher bids, frequent checkpointing is not needed as much. 4.4 Bidding Price and Application Processing Time The authors are also interested in understanding, given the price history, how a new bid would affect the total processing time. Once this is done, the authors can then derive a number of other important metrics, such as speedup, efficiency, total cost, speedup per dollar, and efficiency per dollar deploying different numbers of processing units. In the following calculations, the authors assume: The application uses the bid-aware optimal CPR intervals; The HPC application is run under the optimal granularity and optimal degree of parallelism which allows the authors to set the synchronization overhead to zero; Fig. 6 A 1 speedup using 100 spot instances and different CPR intervals. The Amazon resources deliver the advertised capabilities. Then the steady state timing models in Table 3 can be plugged directly into Eq. (1): (2) (3) Eqs. (2)-(3) capture the intrinsic dependencies between critical factors, such as bidding price, price history, the number of spot instances ( and overall processing time. To minimize errors, program instrumentations are conducted to get the ranges of and. Table 5 shows the value ranges in our calculations. Results are reported in Figs First, it is observed that HPC applications can indeed gain practical feasibility using spot instances under optimized CPR intervals. As indicated by the Amdahl s law [35], the effect of diminishing returns is also clearly visible when the number of spot instances increases for the same algorithms. For (with linear communication complexity), speed up and efficiency drop significantly when spot instances are greater than 200. For (with communication complexity), speedup and efficiency drop much earlier.

9 Towards Auction-Based HPC Computing in the Cloud 507 Table 5 Critical parameters. Variable Range 200 to 1,000 instances measured algorithmic step per second using cc1.4xlarge Network speed: 250 MBps measured Problem size: 10 4 to 10 5 Number of iterations: 10 3 to 10 6 Fig. 7 A 1 using 200 to 1,000 spot instances with N =100,000 to 1,000,000 iterations. Fig. 8 A 2 using 200 to 1,000 spot instances for ns =10,000 and 1,000 iterations. The authors also notice that for, the bidding prices have much bigger impact on speedup than. The added dimension of bidding price reveals cost effectiveness of different configurations. Although higher bids can deliver better performance, the cost effectiveness actually decreases (see Speedup per Dollar charts). Therefore, the users should use these figures to optimize budget, processing deadline or anything in between. The non-trivial insight is the high price sensitivity for algorithms with high communication complexities. The cost effectiveness is also difficult to visualize without the proposed tools. These results provide the ground for selecting the best number of processors (spot instances) and the most promising bidding price for a given objective. 5. Conclusions Finding the optimal bidding strategy for any application is a difficult problem. For specific applications, the proposed approach gives reasonable predictions that can guide the choice of a promising bidding strategy based on the intrinsic dependencies of critical factors. The timing model along the bid-aware CPR model provide an effective tool to determine the optimal bid as well as the optimal number of processing units needed for completing a specific application. This research paves the ground for more specialized pricing models for cloud providers by giving more insights about the return of investment. For example, since the speedup gain slows down when the number of processors reaches a level, it makes sense to give lower prices as volume discounts that is sensitive to the communication complexities. The new pricing models may change users behavior that in term would also affect the providers that eventually would reach equilibrium. Meanwhile, the resource utilization is maximized. Other innovative ideas are also possible. For example, self-healing applications [36] could enjoy much better cost advantages by setting bidding ranges to organize defensive rings to protect the users core interests while maintaining the lowermost cost structures. Spot instances give the provider much freedom in dispatching resources for meeting dynamic users needs. This ultimate freedom allows for the ultimate computational efficiency and fair revenue/cost generation. It also challenges the HPC community for highly efficient and more flexible programming means that can automatically exploit cheaper resources on the fly.

10 508 Towards Auction-Based HPC Computing in the Cloud Acknowledgment This research is supported in part by the National Science Foundation grant CNS and educational resource grants from Amazon.com. References [1] L. Youseff, R. Wolski, B. Gorda, C. Krintz, Evaluating the performance impact of Xen on MPI and process execution for HPC systems, in: Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing, 2006, p. 1. [2] C. Vecchiola, S. Pandey, R. Buyya, High-performance cloud computing: A view of scientific applications, in: Proceedings of 10th International Pervasive Systems, Algorithms, and Networks (ISPAN), 2009, pp [3] A. Iosup, S. Ostermann, N. Yigitbasi, R. Prodan, T. Fahringer, D. Epema, Performance analysis of cloud computing services for many-tasks scientific computing, IEEE Transactions on Parallel and Distributed Systems 22 (6) (2011) [4] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, A. Warfield, Xen and the art of virtualization, in: Proceedings of the 19th ACM Symposium on Operating Systems Principles, 2003, pp [5] G.E. Fagg, J. Dongarra, FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world, in: Proceedings of the 7th European PVM/MPI Users Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2000, pp [6] A. Agbaria, R. Friedman, Starfish: fault-tolerant dynamic MPI programs on clusters of workstations, in: Proceedings of the 8th International Symposium on High Performance Distributed Computing, 1999, pp [7] R. Graham, S. Choi, D. Daniel, N. Desai, R. Minnich, C. Rasmussen, L. Risinger, M. Sukalski, A network-failure-tolerant message-passing system for terascale clusters, International Journal of Parallel Programming 31 (4) (2003) [8] S. Rao, L. Alvisi, H. Viny, Egida: An extensible toolkit for low-overhead fault-tolerance, in: 29th Annual International Symposium on Fault-Tolerant Computing, Digest of Papers. IEEE, 1999, pp [9] G. Stellner, Cocheck: Checkpointing and process migration for MPI, in: Proceedings of the 10th International Parallel Processing Symposium (IPPS 96), IEEE Computer Society, 1996, pp [10] M. Litzkow, T. Tannenbaum, J. Basney, M. Livny, Checkpoint and migration of Unix processes in the condor distributed processing system, Technical Report, [11] J. Hursey, J. M. Squyres, T.I. Mattox, A. Lumsdaine, The design and implementation of checkpoint/restart process fault tolerance for open MPI, in: Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007, pp [12] A. Moody, G. Bronevetsky, K. Mohror, B.R. Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, in: Proceedings of International High Performance Computing, Networking, Storage and Analysis (SC) Conference, 2010, pp [13] G. Zheng, L. Shi, L. Kale, FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, in: Proceedings of 2004 IEEE International Conference on Cluster Computing, 2004, pp [14] Amazon HPC Cluster Instances, 2011, available online at: [15] J. Hursey, Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems, Ph.D. Dissertation, Indiana University, Bloomington, IN, USA, July [16] J. Dean, S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communications of the ACM 51 (1) (2008) [17] D. Borthakur, The Hadoop Distributed File System: Architecture and Design, 2007, available online at: [18] N. Chohan, C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi, C. Krintz, See spot run: Using spot instances for MapReduce workflows, in: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, USENIX Association, 2010, p. 7. [19] A. Andrzejak, D. Kondo, S. Yi, Decision model for cloud computing under SLA constraints, in: Proceedings of IEEE International Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2010, pp [20] S. Yi, D. Kondo, A. Andrzejak, Reducing costs of spot instances via checkpointing in the Amazon elastic compute cloud, in: 2010 IEEE 3rd International Conference on Cloud Computing, 2010, pp [21] W. Voorsluys, R. Buyya, Reliable provisioning of spot instances for compute-intensive applications, arxiv: v1, [22] W. Voorsluys, S. Garg, R. Buyya, Provisioning spot market cloud resources to create cost-effective virtual clusters, in: Proceedings of the 11th International Conference on Algorithms and Architectures for Parallel Processing, 2011, pp [23] B. Javadi, R. Buyya, Comprehensive statistical analysis

11 Towards Auction-Based HPC Computing in the Cloud 509 and modeling of spot instances in public cloud environments, Technical Report CLOUDS-TR , The University of Melbourne, [24] O. Ben-Yehuda, M. Ben-Yehuda, A. Schuster, D. Tsafrir, Deconstructing Amazon EC2 spot instance pricing, Technion-Israel Institute of Technology, Technical Report CS , [25] M. Taifi, J. Shi, A. Khreishah, SpotMPI: A framework for auction-based HPC computing using Amazon spot instances, in: Proceedings of ICA3PP, 2011, pp [26] J. Shi, Program scalability analysis, in: International Conference on Distributed and Parallel Processing, Georgetown University, Washington D.C., October [27] J. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems 22 (3) (2006) [28] J. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM 17 (9) (1974) [29] Q. Zhang, E. Gurses, R. Boutaba, J. Xiao, Dynamic resource allocation for spot markets in clouds, in: Proceedings of the 11th USENIX Conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services, [30] W. Gropp, E. Lusk, Fault tolerance in MPI programs, Special Issue of International Journal of High Performance Computing Applications 18 (2002) [31] P. Hargrove, J. Duell, Berkeley lab checkpoint/restart (BLCR) for Linux clusters, Journal of Physics: Conference Series 46 (2006) [32] M. Taifi, HPCFY-Virtual HPC Cluster Orchestration Library, available online at: [33] M. Massie, B. Chun, D. Culler, The ganglia distributed monitoring system: design, implementation, and experience, Parallel Computing 30 (7) (2004) [34] K. Blathras, D. Szyld, Y. Shi, Timing models and local stopping criteria for asynchronous iterative algorithms, Journal of Parallel and Distributed Computing 58 (3) (1999) [35] G. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, in: Proceedings of the April 18-20, 1967, Spring Joint Computer Conference (AFIPS 67), ACM, pp [36] J. Shi, M. Taifi, A. Khreishah, J. Wu, Sustainable GPU computing at scale, in: 14th IEEE International Conference in Computational Science and Engineering, 2011, pp