Resource Management for Scientific Application in Hybrid Cloud Computing Environments. Simon Ostermann

Size: px
Start display at page:

Download "Resource Management for Scientific Application in Hybrid Cloud Computing Environments. Simon Ostermann"


1 Resource Management for Scientific Application in Hybrid Cloud Computing Environments Dissertation by Simon Ostermann submitted to the Faculty of Mathematics, Computer Science and Physics of the University of Innsbruck in partial fulfillment of the requirements for the degree of doctor of science advisor: Assoz.-Prof. Priv.-Doz. Dr. Radu Prodan, Institute of Computer Science Innsbruck, 17 April 2012


3 Certificate of authorship/originality I certify that the work in this thesis has not previously been submitted for a degree nor has it been submitted as part of requirements for a degree except as fully acknowledged within the text. I also certify that the thesis has been written by me. Any help that I have received in my research work and the preparation of the thesis itself has been acknowledged. In addition, I certify that all information sources and literature used are indicated in the thesis. Simon Ostermann, Innsbruck on the 17 April 2012 i

4 ii

5 Abstract Cloud computing is an emerging commercial infrastructure paradigm that promises to eliminate the need for maintaining expensive computing hardware. Nevertheless, the potential of using Cloud computing infrastructure to support computational and data-intensive scientific applications has not yet been sufficiently addressed. This thesis closes this gap by researching an architecture and techniques for performance and cost-efficient execution of scientific applications on Cloud computing infrastructures, organized in five chapters. First, we investigate the suitability of the workflow paradigm for programming scientific applications, following from their success on related distributed computing infrastructures such as computational Grids. We present case studies for modeling two applications from the astrophysics field as scientific workflow applications to be run with improved performance on multiple leased Cloud resources. We further analyze the workflow traces collected over the last three years of research in the Austrian Grid and classify them according to their different structural and performance characteristics for later evaluation purposes. Second, we investigate the problem of provisioning and management of Cloud resources to large-scale scientific workflows that do not benefit from sufficient free Grid resources, as required by their computational requirements. For this purpose, we propose an extended architecture comprising new services that allow using Cloud resources in an integrated manner with minimal application interface change: resource management for virtualized hardware, software catalogues for machine images, and integrated security and authentication features. The evaluation of the proposed architecture indicates that using Cloud resources for scientific applications is a viable choice, and execution times can be significantly reduced by acquiring additional ondemand Cloud resources. Third, there is currently a lack of models to understand the performance offered by existing Cloud computing infrastructures required for scheduling and running scientific workflow applications. We perform an empirical evaluation of the performance of four commercial Cloud computing services including Amazon EC2, using different benchmarks for single resource instances and virtual clusters. We compare the

6 performance characteristics and cost models of Clouds to other scientific computing platforms such as production parallel computers and computational Grids. The results prove that certain resource types offered by Cloud providers have high potential for speeding up the execution of loosely coupled parallel applications such as scientific workflows, especially for short deadlines. Fourth, to address the lack of scalable simulators to support the Cloud computing research, we developed GroudSim, a Grid and Cloud simulation toolkit for scientific computing based on a scalable simulation-independent discrete-event engine. GroudSim provides a comprehensive set of features for complex simulation scenarios from simple job executions on leased computing resources to file transfers, calculation of costs, and background load on resources. We illustrate real scenarios of using this simulation toolkit to accelerate the evaluation of various optimized resource provisioning techniques by a factor of 700 compared to real execution with no resource cost expenses. Finally, we address the problem of dynamic provisioning of Cloud resources to large-scale scientific workflows with respect to four important aspects: (1) when to extend the Grid infrastructure with Cloud resources, (2) the amount of Cloud resources to be provisioned, (3) when to move computation from Cloud to the Grid and (4) when to release Cloud resources if no longer necessary. Then, we address the NP-complete problem of scheduling scientific workflows on heterogeneous Cloud resources by proposing an extension to the dynamic critical path scheduling algorithm for dealing with the general resource leasing model encountered in today s commercial Clouds. We analyze the availability of the cheaper and unreliable Spot instances and study their potential to complement the unavailability of Grid resources for large workflow executions. Experimental results demonstrate that Spot instances represent a 60% cheaper but equally reliable alternative to Standard instances provided that a correct user bet is made. iv

7 Acknowledgements I thank my advisor, Professor Radu Prodan, for the invaluable support and guidance throughout my Ph.D. years. His advice and example left a strong positive imprint in my formation as a researcher. I thank Professor Thomas Fahringer, for the opportunity of working in his research group at the University of Innsbruck and for his continued support and confidence in my abilities. I thank the present and past members of the Distributed and Parallel Systems group, especially Kassian Plankensteiner, Vlad Nae and Simone Pellegrini, who helped me in my research efforts. I thank my bachelor and master students Daniel Bodner, Georg Kraler, Christian Hollaus and Markus Brejla for implementing tools needed to fulfill these research ideas and Alexandru Iosup for our pleasant and fruitful collaboration. I thank my family, Renate, Gerhard, Felix, Maria, Regina and Theres, for offering support and creating an environment, which enabled and often motivated my studies and for the weekend lunches, which were always one of my weeks culinary and social highlights. Finally, I would like to express my strong bond to Tyrol and its beautiful mountains. The time spent hiking and climbing here helped me regenerate and conquering Tyrol s unique mountain tops gave me the strength to get through the most demanding times of my work. v

8 vi

9 Contents 1 Introduction Motivation Scientific Workflows Resource Management Cloud Performance Simulation Resource Provisioning and Scheduling Goals Scientific Workflows Resource Management Cloud Performance Simulation Resource Provisioning and Scheduling Summary Outline Model Workflows Activities Structures Grid Computing Austrian Grid Globus Toolkit Cloud Computing Scientific View Market View vii

10 2.3.3 Virtualization Cloud Types Amazon Elastic Compute Cloud Eucalyptus ASKALON Execution Engine Scheduling Resource Management Performance Prediction Performance Analysis Summary Scientific Workflows Design and Analysis Montage Design Evaluation Grasil Design Evaluation Wien2k Invmod MeteoAG Workflow Characteristics Workflow-based Grid Workload Analysis Results Related Work Future Work Summary Architecture Cloud Computing Survey Taxonomy Service Type Resource Deployment viii

11 4.2.3 Hardware Runtime Tuning Security Business Model Middleware Performance Resource Management Architecture ASKALON Resource Management Cloud Resource Management Image Catalogue Security Cloud-based Workflow Execution Related Work Conclusions and Future Work Summary Cloud Performance Analysis Introduction Benchmark Design Many Task Computing Method and Experimental Setup Results Cloud Performance Evaluation Method Experimental Setup Results Clouds versus other Infrastructures Method Experimental Setup Results Related Work Conclusions and Future Work Summary ix

12 6 Simulation Toolkit GroudSim Discrete-event Simulation Entities Jobs File Transfer Cost Tracing Probability Distributions Failures Background Load Evaluation ASKALON Integration Evaluation Simulation Times Sequential Grid Sites Parallel Grid Sites Simulation versus Execution Related Work Conclusions Summary Resource Provisioning and Scheduling Optimized Cloud Provisioning Cloud Start Instance Size Grid Rescheduling Cloud Stop Provisioning Evaluation Scheduling Resource Model Dynamic Critical Path Algorithm Spot Price Analysis x

13 7.6 Dynamic Critical Path for Clouds DCP-C Algorithm Rescheduling Cloud Choice Prescheduling DCP-C Evaluation Wien2k Invmod Related Work Conclusion Summary Conclusions Scientific Workflows Resource Management Cloud Performance Simulation Toolkit Resource Provisioning and Scheduling List of Figures 203 List of Tables 207 Bibliography 209 xi

14 xii

15 Chapter 1 Introduction 1.1 Motivation Scientific computing requires an ever-increasing number of resources to deliver results for growing problem sizes in a reasonable timeframe. Few years ago, supercomputers were the only way to get enough computation power for such compute intensive tasks. In the last decade, while the largest research projects were able to afford expensive supercomputers, other projects were forced to opt for cheaper resources such as commodity clusters or more modern and challenging computational Grids [52]. While aggregating a potentially unbounded number of computational resources to serve highly demanding scientific applications, computational Grids suffer from serious problems related to reliability, fulfillment of Quality of Service (QoS) guarantees, and automation of software deployment and installation processes, which makes their use rather tedious and accessible only to specialized computer specialists. Moreover, while an enormous amount of funding has been invested by national and international agencies to build large-scale computational Grids, operational costs and ultimately hardware deprecation are significant barriers in their daily and long term maintenance. Today, a new research direction coined by the term Cloud computing proposes an alternative by which resources are no longer hosted by the researcher s computational facilities, but leased from large specialized data centers only when needed. Compared to traditional parallel and distributed environments such as Grids, computational Clouds present at least four advantages that make them attractive to scientific computing scientists. First, Clouds promote the concept of leasing remote resources rather than buying own hardware, which frees institutions from permanent maintenance costs and eliminates the 1

16 burden of hardware deprecation following Moore s law. Second, Clouds eliminate the physical overhead cost of adding new hardware such as compute nodes to clusters or supercomputers and the financial burden of permanent over-provisioning of occasionally needed resources. Through a new concept of scaling-by-credit-card, Clouds promise to immediately scale up/down an infrastructure according to the temporal needs in a cost-effective fashion. Third, the concept of hardware virtualization can represent a significant breakthrough in enhancing resource utilization. Additional the automatic and scalable deployment of complex scientific software that today remains a tedious and manual process that requires manual intervention of skillful computer scientists can be simplified when virtualization is used. Fourth, the provisioning of resources through business relationships constrains specialized data center companies in offering a certain degree of QoS encapsulated in Service Level Agreements (SLA) that significantly increases the reliability and the fulfillment of user expectations. Despite the existence of many vendors that, similar to Grid computing, aggregate a potentially unbounded number of compute resources, Cloud computing remains a domain dominated by business applications (e.g. Web hosting, database servers) and whose suitability for scientific computing remains largely unexplored. The way resources are offered and advertised by the Cloud providers opens several questions about hardware, software and performance in general. As only little research was done in this area, the use of Clouds for scientific computing is the major contribution of this thesis to the field of computer science Scientific Workflows An important class of applications, which has been largely ignored so far when dealing with effective parallelization for Cloud resources, are workflow applications [90]. Workflows have a strong impact on application development in industry [54], commerce [63] and science [149] on desktop, server and parallel computing infrastructures, accelerating and simplifying programming by allowing programmers to focus on the composition of existing legacy programs to create larger and more powerful applications. Workflows have emerged as an easier way to formalize and structure data analysis, to execute the necessary computations on computing resources, to collect information about the derived results and, if necessary, to repeat the analysis. Re- 2

17 searchers across many disciplines such as life sciences, physics, astronomy, ecology, meteorology, neuroscience or computational chemistry create and use ever-increasing amounts of often highly complex data, and rely more and more on computationallyintensive modeling, simulations and statistical analysis. Scientific workflows [149] have become a key paradigm for managing such complex tasks and have emerged as a unifying mechanism for handling scientific data. Similarly, industry and commerce have been using workflow technology for a long time to describe and manage business and industry processes, and to define flows of work and data that have high business values to companies [63]. In short, workflows encapsulate the essence of describing scientific, industrial, and business user expertise through logical flows of data and work, modeled and described in the form of workflow activities, which are mapped onto a concrete computing infrastructure with the goal of managing the data processing in an automated and scalable way. With so many driving forces at work, it is clear that workflow are here to stay and will play a major role in the future Information Technology (IT) strategies of business and scientific organizations, both large and small. A large variety of workflows has been created and is being used in production in numerous areas of science [149], industry [54] and business [63]. Many of these workflows are highly data and/or computation-intensive presenting great potential for taking advantage of today s Cloud computing resources. Although a plethora of techniques and tools exists to manage and execute workflows on sequential and distributed computing infrastructures such as Grids [149], workflow applications have not yet entered the domain of Cloud computing and lack effective tools and support for programming, parallelization and optimization on Cloud infrastructures. This observations result in the following motivational question we are trying to answer with our research in the area of scientific workflows: are scientific workflow applications well suited for executions on Cloud environments? Resource Management In the last decade, Grid computing gained high popularity in the field of scientific computing through the idea of distributed resource sharing among institutions and scientists. Scientific computing is traditionally a high-utilization workload, with production 3

18 Grids often running at over 80% utilization [66] (generating high and often unpredictable latencies), and with smaller national Grids offering a rather limited amount of high-performance resources. Running large-scale simulations in such overloaded Grid environments often becomes latency-bound or suffers from well-known Grid reliability problems [33]. Despite the existence of several integrated environments for transparent programming and high-performance use of Grid infrastructures for scientific applications [176], there are still no results published in the community that report on extending them to enjoy the benefits offered by Cloud computing. While there are several early efforts that investigate the appropriateness of Clouds for scientific computing, they are either limited to simulations [36], do not address the highly successful workflow paradigm [10], and do not attempt to extend Grids with Clouds as a hybrid combined platform for scientific computing. Additional open questions have to be answered in the field resource management and provisioning. Grid resources offer monitoring services that allow resource managers to keep an overview of the current system state in a transparent way. Clouds, on the other hand, do not offer a standardized representation of their available hardware and billing models, which makes it difficult to automatically parse and compare their resources. The existing providers only offer commercial information mostly distributed over multiple Web pages, which are time-consuming to gather. To handle this problem, a resource manager is needed that can handle the information of the offered Cloud hardware in a unified and centralized manner. Clouds have advantages compared to Grid hardware. For example local resource manager systems, which are used on Grid systems to allow multiple users to share the resources result in sometimes large and unpredictable queueing overheads. Clouds on the other side to not target the resource sharing use-case and eliminate this queuing overhead by not having a resource management system by default. Instead Cloud systems introduce provisioning overheads, which is the time spend on processing an resource request and making it available for the user, that does not exist in Grid systems. The different advantages and disadvantages of Grid and Cloud systems could be combined into a hybrid system that uses both types of resources by trying to eliminate as much overheads as possible. When no Grid resources are available, the provisioning 4

19 overhead of Clouds might take less time than waiting for Grid resources to become available. We ask the research question: can the resource pool of a Grid be extended with Cloud computing resources and does a scientist have benefits from using this hybrid approach? Cloud Performance When talking about scientific usage of commercial Clouds, the expenses required for their usage are an important factor. Typically, existing Infrastructure as a Service (IaaS)-based Cloud providers offer classes of resources described in fuzzy terms, whose suitability for scientific applications is unclear. For example, Amazon Elastic Compute Cloud (EC2) advertises the so called Elastic Compute Unit (ECU) of its resources as the equivalent of a gigahertz 2007 Opteron processor, and the I/O performance of the associated storage systems with fuzzy medium and high values. On the other hand, the virtual machines to be executed on IaaS-based Clouds are often cross-compiled on local compatible architectures (e.g. an AMD Opteron image executed on a Xeon processor) and may exhibit significant losses in performance when executed on unknown Cloud resources. This uncertainty, combined with the fuzzy terms in which the Cloud resources are described makes the execution of applications a black-box approach that is hard to understand and predict, requiring different models than the ones used for computational parallel computers or computational Grids. Although a few benchmarking efforts have been conducted to evaluate the raw performance of existing academic and commercial Clouds, there is a lack of performance models to support schedulers in taking optimal mapping decisions. Without such knowledge the risk of spending money without gaining performance is considerably high. In our research we want to take the cost and performance metrics into account and show more precise results compared to related work, which does not take this speed into account or use billing intervals that are not offered by any Cloud provider right now. The majority of Cloud providers only offer hourly billing, and simulating an environment with billing based on the used seconds, as done in related work, oversimplifies the billing model of Clouds. The main question we are interested in when evaluating the benchmarks executed 5

20 on the Cloud is: is the performance delivered by the different Cloud providers suitable for scientific computing and what do we have to pay for it? Simulation Commercial Cloud resources are not offered for free to customers or scientists when needed in a larger amount. These costs result in problems when trying to evaluate scientific approaches on this resource type as such an evaluation might need several thousand hours of excessive usage. To handle this problem and to speed up the evaluation process, we investigated the available simulation frameworks that support Grid and Cloud environments. ASKALON [47] is a software middleware that eases the use of distributed Grid and Cloud resources by providing high-level abstractions for programming complex scientific workflow applications through a textual XML representation or a graphical UML-based diagram. Beside this, different middleware services support the user with sophisticated mechanisms for transparent scheduling and execution of the applications on the underlying hardware resources. Besides execution of applications in real Grid and Cloud environments as supported by ASKALON, simulation is an alternative technique to analyze a real-world model, which has several important advantages: it delivers fast results, it allows for reproducibility, it saves costs, and it enables the investigation of scenarios that cannot be easily reproduced in reality. The number of experiments can significantly be increased when using simulation and the resources used can be more flexibly specified. To support such simulations, there is the need for a simulation framework that supports Grid and Cloud resources and workflow applications. To let researchers benefit the most from the simulation system, an integration into the existing execution framework would be a desired solution. This leads to the research question in the field of simulation, which we try to answer in this thesis: is it possible to integrate a simulation framework into an execution framework to allow seamless simulation and execution? 6

21 1.1.5 Resource Provisioning and Scheduling The scientific community is highly interested in the field of Cloud computing, characterized by leasing of computation, storage, message queues, databases, and other raw resources from specialized providers under certain Quality of Service and Service Level Agreements (usually a certain resource uptime for a certain price). Extending Grid infrastructures with on demand Cloud resources appears to be a promising way to improve executions of scientific applications that do not have sufficient Grid resources available for their computational demand. However the Cloud resources have different characteristics than Grid resources that have to be taken into account when deciding about their provisioning. In the worst case, when taking the wrong decisions, the application execution might take longer with Cloud resources, which has to be avoided. Scheduling of scientific workflow applications to heterogeneous resources is known to be a NP complete problem, meaning that an optimal mapping in terms of execution time can not calculated in polynomial time. For each part of the workflow a resource must be chosen for its execution. We require a heuristical scheduling method that produces mappings that is close to the optimum in a reasonable time. By restricting the general scheduling problem with additional constrains that represent our use case better, the problem can be simplified, which reduces the complexity. Therefore we need an algorithm that performs well when resource availability frequently changes as it is the case when additional Cloud resources can be started and stopped at any time. Besides the Standard instances rented at a fixed price per hour, Amazon gives the possibility to bet on unused resources called Spot instances and rent them at variable prices with no reliability guaranteed. The user demand influences and determines the Spot instance market price, which is in most cases lower than the standard one. However, when this price gets higher than the user s bet, the access is terminated and the resources are claimed back by Amazon. Therefore, the availability of such Spot instances is limited and delivers a cheaper but possibly unreliable environment. We investigate the usage of Cloud resources and propose optimizations to raise efficiency and therefore lower the overall cost that has to be spending on resources. We extend an existing scheduling algorithm to handle the new resource type and add several optimizations to lower the overall resource cost. 7

22 The motivation in terms of provisioning and scheduling can be summarized in the short question: how can we reduce the workflow execution time and minimize the overall resource cost in a hybrid Grid and Cloud execution environment? 1.2 Goals We define the following goals matching the motivational thoughts that led to this thesis Scientific Workflows We want to fulfill two goals in the field of scientific workflows. First, we want to take sequential scientific applications and decompose them into smaller application parts, which can then be put together in a workflow shaped application. This transformation allows efficient execution of the workflow on distributed systems such as Grids and Clouds. We have an existing set of workflow application available that can be reused for the planed research, but additional scientific applications will help to proof the usability of the workflow paradigm for scientific applications. Second, we aim to analyze execution logs of existing workflow applications in the Austrian Grid and to observe the typically execution parameters and workflow sizes. Based on this study, we will be able to select a set of reference workflows and sizes to be used for our later evaluations Resource Management We see a high potential in combining this two resource types to increase performance for workflow executions. There is need for an infrastructure that allows the execution of workflows on conventional Grid resources which can be supplemented on demand with additional Cloud resources, if necessary. Details about the extensions to the resource management service to consider Cloud resources, comprising new Cloud management, software (image) deployment, and security components are of particular interest. We plan to seamlessly integrate the Cloud resource class into the existing scientific Grid workflow environment available to the scientists of the distributed and parallel computing group at the University of Innsbruck. The changes should not affect the 8

23 end-user besides the need to store his access credentials for the Cloud into our system if he plans to use this resource class. The credential management is required to allow accounting based on the personal account with a Cloud provider. Executions will be possible on Grid resources as before but if the results are needed faster, additional Cloud resources will be usable, if required login credentials are given, to speed up executions. Experimental results using a real-world application in the Austrian Grid environment, extended with an own academic Cloud constructed using a private Cloud with virtualization will be used to verify the usefulness of the presented integration Cloud Performance Cloud providers sell their offers mostly as black boxes to the end user. Performance values are hard to find and no guarantees are given about the speed, but mostly only about the reachability and availability. We will analyze the resources that can be rented from four different Cloud providers and will investigate the pure performance that is available to the end user. We evaluate using well-known micro-benchmarks and application kernels the performance of four commercial cloud computing services that can be used for scientific computing, among which the Amazon Elastic Compute Cloud (EC2), the largest commercial computing cloud in production. We compare the performance of Clouds with scientific computing alternatives such as Grids and parallel production infrastructures. Our comparison uses trace-based simulation and the empirical performance results from our Cloud performance evaluation Simulation Execution of scientific workflows is time-consuming and resource intensive. When optimizing such a workflow execution, several hundreds of executions are needed to evaluate the optimization impact and more runs are needed to verify that there are no side effects on the system. To allow a faster evaluation process, we plan to develop an event-based simulation framework in Java that provides an improved scalability compared to other related approaches. Simulation reduces the workflow execution time significantly by reducing the execution time of tasks to zero. The resulting sys- 9

24 tem allows running several hundreds of workflows within minutes without the need of available hardware from the Grid or Cloud. This development will be useful to all researchers working with our execution environment and will dramatically speed up development and debugging processes which should result in better validation of the proposed and developed features. Experimental setups will be more flexible in a simulated environment and error handling can be analyzed better then when using real environments where it is not always possible to have failures when needed to test fault tolerant aspects of new developments Resource Provisioning and Scheduling When Grid resources, which are mostly available for free to scientists in Austria, are extended with Cloud resources that are charged on a hourly basis, the additional cost is an important factor that needs to be considered. We plan to develop a scheduling mechanism that uses additional paid resources only when they have a positive effect on the overall execution time and price ratio. We study the problem of dynamically provisioning of additional Cloud resources to large-scale scientific workflows running in Grid infrastructures with respect to four important aspects: (1) Cloud start representing when is sensible to extend the Grid infrastructure with Cloud resources, (2) instance size quantifying the amount of Cloud resources that shall be provisioned; (3) Grid rescheduling indicating when to move computation from Cloud to the Grid, if new fast resources become available; and (4) Cloud stop meaning when it is sensible to release Cloud resources if no longer necessary considering their hourly payment interval. We analyze the impact of these four aspects with respect to the overall execution time, overall cost, as well as cost per unit of saved time for using Cloud resources. We plan to research a scheduling algorithm that is potentially well suited for the new challenges that arise by using Cloud resources. Once a good candidate is selected, we plan to implement and extend this algorithm within the existing workflow environment with advanced provisioning optimizations. As workflow scheduling is a NP complete problem a scheduler with reasonable complexity, that can schedule a workflow within seconds or minutes, will not give the optimal results. We aim for a heuristic that is well suited for the workflow applications running on hybrid Grid and Cloud resources. 10

25 1.2.6 Summary These five goals aim to research a workflow system that supports Cloud resources and Grid environments. Simulation of executions is one of the planed features that will allow fast end excessive evaluation of developments in the workflow system. Using this simulator will help evaluating all further goals. Workflow executions can benefit from sophisticated scheduling mechanism that optimizes the resource mapping for execution time and resource cost. The end user wants to decrease the execution time of his applications with minimal economic cost of leased resources. The benchmark analysis of the different Cloud providers will help to understand how much performance a user gets for his investment from the different Cloud providers. This knowledge can significantly improve the resource selection process when using Cloud resources in scientific workflows. The overall goal is a seamless integration of the new Cloud resources into an existing Grid workflow environment. The optimization needed to make this none trivial combination a useful alternative to existing Grid only executions is one of the key research goals, even though it is not directly visible to end users. Therefore the advantages need to be proven with detailed evaluation for each optimization. 1.3 Outline In Chapter 2 we introduce the model for the research presented in this thesis. The important terms and technologies are defined and explained. We continue in Chapter 3 with the introduction and development of scientific workflows we used for our evaluation process in this thesis. The Chapter continues with the analysis of workflow characteristics collected from historical execution traces in the Austrian Grid. Chapter 4 starts with a classification of the Infrastructure as a Service Cloud providers in eight criteria derived from a market analysis. This information is then used to explain the technical integration of this resource class into the ASKALON system. Detailed performance analysis of four Cloud providers are presented in Chapter 5. Different benchmarks are executed and evaluated combined with a comparison of workload traces from different cluster systems. In Chapter 6, we introduce a combined Grid and Cloud simulator, which is more 11

26 scalable then other available simulators. The simulator is integrated into the ASKALON system to allow simulations from within the regular execution environment. We develop and evaluate four provisioning optimizations in the first half of Chapter 7. In the second part we present the extension of a scheduling algorithm that is optimized for Cloud resources with variable pricing. Chapter 8 concludes the thesis. 12

27 Chapter 2 Model This chapter defines the terminologies needed for understanding the presented topics. The connection between the components is explained in detail given a model of the used environment and environmental conditions these studies rely on. Key topics of this chapter are workflows, the hardware and software used for the workflow execution and the environment for the experimental evaluation. General terminologies are defined for the scope of this work to understand the point of science this work relies on. 2.1 Workflows This section introduces workflows in general and in particular by the scientific workflow application this thesis focuses on. After the introduction of the general workflow model, detailed information about the applications used throughout this thesis are presented. Definition 1. A workflow consists of a sequence of concatenated steps. Emphasis is on the flow paradigm, where each step follows the precedent without delay or gap and ends just before the subsequent step may begin. This simple case of a workflow can be extendet to the more concrete workflow application, where each step is a fine grain part of the overall application. Additional data dependencies can exist between these application steps, resulting in a transfer delay between steps leading to a more precise definition. 13

28 Definition 2. A workflow application is a software application, which automates, at least to some degree, a process or processes. The processes are usually businessrelated, but it may be any process that requires a series of steps that can be automated via software. The workflows this thesis is focusing on are a subset of the general workflow application and rely on a more technical perspective: Definition 3. A scientific workflow application is a computational intensive software application, which might take a long time period to be executed. The workflow structure allows its execution in a distributed fashion to speed up the overall execution. Processes in general need input parameters and files and produce output files. As basis for this workflow structure a graph representation is used which is introduced in Section We use for our workflows the model introduced by Coffman and Graham [31]. Formally, a workflow is a directed graph G in which nodes are computational tasks which we call activities, and the directed edges represent communication; we denote the number of nodes and edges by N and E respectively. To begin an activity, all its predecessor activities and all communication with that activity as the sink must be finished. Workflows in general are often directed acyclic graph (DAG) based, but we rely an a richer representation model that allows loops, which are important for complex scientific applications. More details about the applications and their workflow representations are given in Chapter 3. Resulting from possible distributed executions, the file dependencies of such an application are very important. Files need to be copied to each location where tasks of a workflow are executed and after completion results have to be gathered. The workflows are no longer built from generic steps but a composition of tasks we call activities Activities The building block for workflows are called activities. In this section we will define the different flavors of activities that are encountered throughout this thesis. Definition 4. An atomic activity is a single task that can not be split into smaller activities. It represents an activity that applies a specified transformation/calculation 14

29 on a given input and produces output. A scientific workflow application is a graph of connected atomic activities. These activities represent the application parts that the workflow consists of. Each activity is of a special activity type which represents the functionality of such an atomic activity by: Input ports representing the files and parameters needed for execution of the activity. Output ports representing the files generated by an activity. The activity name and possible semantic information about this activity. Definition 5. An activity type is an abstraction of a application type defined by an activity name and an arbitrary number of input and output ports. These ports might be of different types. The ports of such an activity type might be from one of the following types: agwl:file representing a file that is required for execution when used in an input port, and a resulting file when being an output port. xs:integer representing an integer number. xs:float representing a float number. xs:boolean representing the values true or false. xs:string representing a string value. agwl:collection representing one or multiple of the other port types. This structure is comparable with a vector. This concept of activity types represents an abstraction from the real applications that are executed on physical hardware, which are stored and represented by activity deployments. These deployments are comparable with concrete application installations and consists of: 15

30 File names that this applications expects for its input and outputs. The underlying service architecture for execution, which may be a Web Service, a Globus service or simple SSH job submission routines. The location of the deployment by server, service URI or file system location. Information of the invocation of this deployment via command line or URI parameters needed for the execution. Definition 6. An activity deployment is a concrete installation matching an activity type. The deployment contains all the information needed to execute the application represented. One or multiple of these activities are connected with control flow and data flow dependencies. To allow more complex workflow structures a workflow language is used that supports different high level structures to build workflows Structures Workflows can be simple structured lists of atomic activities, which should be executed one after the other. More interesting for parallel computing are workflows with a higher complexity in the structure as in compound activities. To allow easier creation of complex workflow structures the used Abstract Grid Workflow Language (AGWL) [48] supports multiple compound activities which can have an arbitrary number of atomic activities in their body: for has in its body a set of activities which are executed an arbitrary number of times sequentially similar to a classical for loop. parallelfor is similar to for, but the iterations of the loop are executed all in parallel. foreach is a structure that needs a collection as an input and then executes the activities defined in its body for each element of the collection in sequence. parallelforeach is similar to the foreach, but all the elements of the collection are processed in parallel. 16

31 while is an unbounded for loop with a stop condition. if allows to have optional workflow activities or alternative workflow structures allowing different atomic activities on the if and else branch. fork defines parallel sections in the workflow and allows two or more activities to be executed beside each other. DAG stands for Directed Acyclic Graph and allows sections with dependencies known from this graph structure. Here dependencies might be less strict then with the fork construct. With these structures it is easy to build workflows that allow distributed execution on Grid and/or Cloud environments. The scientific workflows used for the experiments and evaluation are presented in Chapter 3 and show example of most of this compound activities. 2.2 Grid Computing A computer can be defined as a machine with one or many CPUs, memory and a hard drive connected to a main board. This unit is empowered by an operating system that allows easier use of the given hardware. When multiple of these computers, mostly of same hardware, are interconnected with fast networks the system built is called a cluster. Those can also be build from standard desktop computers like in commodity clusters i.e. Beowulf clusters. When special hardware is used to achieve better performance, these clusters are built in servers cases hosted in racks in server rooms having redundant power supplies and interconnections with high speed networks. In the 1990s scientists of the different universities had access to their local cluster systems, which were not fast enough to fulfill their requirements on peak usage. But an undeniable amount of time the cluster was not used that much and scientists developed the idea to combine their cluster with other universities clusters to a bigger system, which is called the Grid. This system is globally distributed and heterogeneous in the terms of hardware and operating systems. The administrators of the different domains of such a Grid need to agree on a set of standard software that needs to be installed on 17

32 all systems that participate in such a Grid, to allow uniform access to all the resources for scientists. Back in 1998, Carl Kesselman and Ian Foster attempted a definition in the book [52], which includes the following definition: A computational Grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities. Later on a list was developed with three aspects a system has to fulfill to be a Grid: Computing resources are not administered centrally. Open standards are used. Nontrivial quality of service is achieved. From a computer scientist s point of view, the Grid is an infrastructure that offers resources which might be used for computation, storage or other form of sensors. They are used to share resources over administrative domains and allow uniform access to different systems with central authentication. For the work presented in this thesis, we had access to the Austrian Grid system, which is explained in more detail in the upcoming subsection Austrian Grid The Austrian Grid is a nationwide initiative to establish Grid computing in Austria. It combines Austria s leading researchers in advanced computing technologies with well-recognized partners in Grid-dependent application areas. [160] The Austrian Grid was started in 2004 as a three years project that was continued with a second project from 2007 to Different Universities from Vorarlberg to Vienna contributed to this national project with their hardware and created a shared hardware pool with fluctuating cluster count and hit a peak maximum of 1372 cores in total. After the end of the second project the infrastructure was kept functional to allow scientists to continue their work. The hardware available within the Grid has a high diversity starting from desktop clusters build from unused computer rooms of the universitys to expensive shared memory systems with up to 768 core. 18

33 Table 2.1 shows a snapshot of the Austrian Grid system from from the Monitoring & Discovery System (MDS), which is installed to give an machine readable overview over the usable resources. The value for RAM is given in megabyte and per computing node, which does not show the total amount of memory available for traditional cluster systems. Additional two benchmark values are presented where SI00 shows the result of the SpecInt2000 benchmark and SF00 for the SpecFloat2000 benchmark (single CPU). The local resource management system (LRM) indicates what scheduling system is used locally within this system. This information is important when submitting jobs directly to the cluster and bypassing the Grid middleware used or when adding additional parameters that should be passed to the LRM. Site Master CPUs Free RAM SI00 SF00 LRM altix-uibk torque jku dps-prod sge jku pbs leo1 sge Sum Table 2.1: Austrian Grid resources snapshot taken at Globus Toolkit Globus is open source Grid software that addresses the most challenging problems in distributed resource sharing. The Globus Toolkit is a fundamental enabling technology for building Grids that allow distributed computing power, storage resources, scientific instruments, and other tools to be shared securely across corporate, institutional, and geographic boundaries. [152] The Austrian Grid uses Globus as middleware to allow uniform login to the different Clusters available for scientists all over Austria and cooperating countries. Job submission is done using the Globus job manager called GRAM which is build on top of the local resource management systems of the different clusters and offers a uniform interface to the different resource management flavors. The monitoring of the resources is done using MDS and file transfers are supported by the GridFTP protocol, which is 19

34 a optimized file transfer protocol that allows multi streaming and buffer adjustments to maximize throughput of transfers. 2.3 Cloud Computing Cloud computing is a hyped buzzword in the IT at the current point of time. Industry and most companies want to label their products with Cloud computing to be covered with the media attention currently given to this term. Most Internet businesses, starting from one person start-ups to global players, want to promote their services as Cloud based to be part of the hype. This trend results in a growing area in the IT that wants to be covered by the term Cloud Scientific View Scientific computing requires an ever-increasing number of resources to deliver results for ever-growing problem sizes in a reasonable time frame. In the last decade, while the largest research projects were able to afford (access to) expensive supercomputers, many projects were forced to opt for cheaper resources such as commodity clusters [147, 151] and Grids [52]. Cloud computing proposes an alternative in which resources are no longer hosted by the researchers computational facilities, but are leased from big data centers only when needed in a pay-per-use fashion. From a scientific point of view, the most popular interpretation of Cloud computing is Infrastructure as a Service (IaaS), which provides generic means for hosting and provisioning of access to raw computing infrastructure and its operating software. IaaS are typically provided by data centers renting modern hardware facilities to customers that only pay for what they effectively use, which frees them from the burden of hardware maintenance and deprecation. IaaS is characterized by the concept of resource virtualization which allows a customer to deploy and run his own guest operating system on top of the virtualization software (e.g. [29]) offered by the provider. Virtualization in IaaS is also a key step towards distributed, automatic, and scalable deployment, installation, and maintenance of software. More informations about IaaS and virtualization is given in the sections and To deploy a guest operating system showing to the user another abstract and higher- 20

35 level emulated platform, the user creates a virtual machine image, in short image. In order to use a Cloud resource, the user needs to copy and boot an image on top, called virtual machine instance, in short instance. After an instance has been started on a Cloud resource [6], we say that the resource has been provisioned and can be used. If a resource is no longer necessary, it must be released such that the user no longer pays for its use. Commercial Cloud providers typically provide to customers a selection of resource classes or instance types with different characteristics including CPU type, number of cores, memory, hard disk, and I/O performance. The Cloud computing paradigm holds great promise for the performance-hungry scientific computing community: Clouds can be a cheap alternative to supercomputers and specialized clusters, a much more reliable platform than Grids, and a much more scalable platform than the largest of commodity clusters. Clouds also promise to scale by credit card, that is, to scale up instantly and temporarily within the limitations imposed only by the available financial resources, as opposed to the physical limitations of adding nodes to clusters or even supercomputers and to the administrative burden of over-provisioning resources. Moreover, through the use of resource management such as Condor [151] Clouds offer good support for bags-of-tasks, which currently constitute the dominant Grid application type [72] Market View But in general, the term Cloud computing is defined broader then the pay-as-you go section of this large field, that we are interested in. The search for a common definition brings up multiple different interpretations where we want to show three to visualize the diversity of the general definitions: Cloud computing is Internet-based computing, whereby shared resources, software, and information are provided to computers and other devices on demand, like the electricity Grid. from A self-service environment for the creation of highly-scalable applications, with the immediate availability of compute power and granular levels of billing. from Using word processing, spread sheet or programs that are installed some- 21

36 where other than on the computer upon which you are currently typing. Simply put, the applications live in a Cloud on the Internet rather than being installed on your computer hard drive. from Resulting from this lack of a general definition of the term Cloud computing the work in [166] tried to categorize the field of Cloud computing more scientifically in The research done in [57] in 2009 did a survey resulting in multiple possible definitions for the term Cloud computing collected from experts in the field of computer science. These definitions where checked for common features that definitions from multiple experts contained and the following list of Cloud characteristics was derived: Virtualization: Shield the user from the hardware via a virtualization layer. Pay per use: No monthly contracts are required and the user only pays for the real resources usage (mostly on a hourly basis). User friendliness: Easy to use interfaces, graphical control portals are often available to control and use the Cloud. Internet centric: Services are located in the Internet and not on local servers. User sometimes does not even know where his data or service is stored. Variety of resources: A provider has different offers performance wise, which fit the different needs from customer with diverse requirements. Automatic adaptation: Clouds should be evolving and dynamic system that make it easy to adapt to new paradigms and trends. Scalability: When more power is needed due overloading of a service the system should offer scalability of the used services in an easy adoptable way or automatically. Resource optimization: The hardware should be used at best possible efficiency by using optimized software and resource and time sharing methods. Service SLAs: Clouds should offer contracts that specify what service is guaranteed and what compensations are applied when the service quality is below the specified limit. 22

37 The goal of this thesis is not to define the term Cloud computing. We aim to use products that are promoted under the term Cloud computing for scientific workflow executions. In the following subsections we introduce a short categorization of Cloud types and define the part this research is focusing on Virtualization Servers can be virtualized to raise efficiency by putting together multiple applications on a single hardware server. For instance, a mail server with average 5% load, a SQL server with 10% load and a Web server with 15% load could be put together using three virtual machines with a total load of 30% in average on a single server saving the investment for the other two servers. The advantage of such an optimization is the lower hardware cost and better resource utilization but the disadvantage is that on peak usage the reserves for each service are shared and chances to overload the overall system are higher. Assuming a equal resource sharing each service can have 33.3% server utilization at maximum compared to 100% when executed on a dedicated server. Virtualization can be used to lower over provisioning when sharing resources for different applications. From this development, the virtualization evolved to be used to virtualize not only shared resources to improve utilization, but also to allow strict resource separation by creating multiple virtual servers on one big server infrastructure. This newly created virtual dedicated servers behave like physical servers with only a part of the computational power of the host. Scientific software, compared to commercial mainstream products, is often hard to install and use. In most cases special requirements have to be fulfilled to install them as discussed in [20]. The time needed to install specific required compiler versions, libraries and software on each host used is higher than the time spent with the creation of one virtual machine image, which can be started by virtualization software on any hardware supported. The two most common virtualization environments are VMWare [165] and Xen [29], but there exist other solutions used by smaller communities such as KVM, Virtual Box, vsersers, OpenVZ, and Qemu. The largest part of the scientific community, including Amazon, has chosen Xen as their virtualization platform as it is open source and, therefore, can be freely used and 23

38 Micro-Kernel Virtualization Paravirtualization User Apps User Apps User Apps User Apps Mgt Code Linux Windows Mgt Code Device Driver Driver Linux Windows Micro- Kernel { Mgt API Binary Translation Mgt API Virtual Hardware API } Device Device Device Device Driver Driver Driver Driver Hardware Hardware Intel VT AMD Pacific Figure 2.1: Xen paravirtualisation compared to a micro kernel architectures adapted to various needs, if required. Figure 2.1 shows the paravirtualization architecture implemented by Xen. The advantage is the thinner layer between the hardware and the guest operating system in the software stack to ensure best possible performance. If the operating system and hardware support paravirtualization, the guest operating system can directly access the hardware resources like CPU and memory. The device drivers are moved to a different layer of the stack to separate them from the guest. The advantages of virtualization are rather important when using heterogeneous computing resources as common in Grid and Cloud computing. A virtual machine image is only needed to be created once and can then be used on all machines having the same virtualization software installed. While the performance reachable with this approach is depending on the used software and hardware, Xen claims to be able to add an overhead of only % [106] in the best case. However, in most cases when comparing to hardware optimized binaries and libraries, this performance loss may dramatically increase. Using machines with similar architecture and speed, and a machine image optimized for this main architecture, this may result in a small total virtualization overhead and comparable performance with less compilation and optimization effort. For example, when running a virtual machine image on an AMD Opteron host environment the used libraries and binaries should have been compiled to use the 3D Now extension that this architecture is offering, but this image will then not be executable on Intel Xeon architectures. 24

39 2.3.4 Cloud Types We can classify the field of Cloud computing that we are interested in into four categories. Many Cloud computing companies distinguish themselves in the type of services they offer. At the highest-level, we observe three main directions (see Figure 2.2) and one area where we see all the other services that do not fall into the first three directions. Service Type Infrastructure as a Service (IaaS) Software as a Service (SaaS) Platform as a Service (PaaS) Specialized services Web hosting File hosting Figure 2.2: Service type taxonomy. Infrastructure as a Service (IaaS) IaaS provides generic functionality for hosting and provisioning of access to raw computing infrastructure and its operating middleware software. IaaS are typically provided by data centers that rent modern hardware facilities to customers, which are freed from the burden of their maintenance and deprecation. IaaS is characterized by the concept of resource virtualization, which allows a customer to deploy and run his own guest operating system on top of the virtualization software offered by the provider. Virtualization in IaaS is a key step towards distributed, automatic, and scalable deployment, installation, and maintenance of software. An example for this service type is described in Section

40 Software as a Service (SaaS) SaaS is the second category of Cloud services defining a new model of software deployment where an application is hosted as a service and provided to customers across the Internet, with no need to install and run it on customer s own computer. In SaaS, the hosting is done transparently by the service provider (usually the same as the developer), which eliminates the hosting intermediary (and its underlying functionality) between itself and the customer. SaaS is a more restrictive model than IaaS which constrains customers to using an existing set of services, rather than deploying there own ones. Software companies adopt this model to give access to their software on a pay-as-you go model rather then sell licenses to software. This allows a cheaper usage of software at the beginning but might increase the tool cost for the customer on a long-term usage scenario. Examples for SaaS providers are: Google Docs, Microsoft Office 365 and Adobe Photoshop Express. Platform as a Service (PaaS) PaaS, also known as Cloudware, is the third category that brings IaaS and SaaS one step further by providing all facilities and APIs to support the complete life cycle of building and delivering Web applications and services (including design, development, testing, deployment, and hosting), with no more need for tedious software downloads and installations. PaaS is a relatively new and immature concept that still needs to gain community acceptance and support before being surveyed in detail. Examples for PaaS providers are: Google AppEngine, Microsoft Azure and Heroku. Specialized hosting services Besides these three main categories, we introduce a fourth category of specialized hosting services that are closely related to, or claim to support Cloud computing, although they offer significantly restricted or specialized functionality. We see two successful representatives of this category on the market: 1. Web hosting environments act as intermediaries between service providers and customers by renting packages for hosting Web sites, comprising Web servers, FTP and SSH access, storage space, and various software capabilities such as 26

41 Perl, PHP, Python, or Ruby. There are three main aspects invoked by Web hosting companies which connect them to Clouds: (i) virtualization of resources (although not exposed to the users) for improved management of timesharing resources, (ii) automatic scaling of the provisioned resources to cope with dynamic client load with guaranteed Quality of Service (QoS) (see Section 4.2.4), and (iii) business models inspired from utility computing (see Section 4.2.6); 2. File hosting environments offer a virtual and persistent storage system where customers can safely save their data at a certain price with guaranteed QoS delivery. 3. Everything else is of minor interest for the scientific community. Therefore we did not concentrate on the providers that claim that they offer Cloud computing but do not fit into ant of the previews categories Amazon Elastic Compute Cloud Amazon Elastic Compute Cloud (Amazon EC2) is a Web service that provides resizable compute capacity in the Cloud. It is designed to make Web-scale computing easier for developers. - Amazon EC2 s simple Web service interface allows you to obtain and configure capacity with minimal friction. It provides a complete control of computing resources and lets one run on Amazon s proven computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing one to quickly scale capacity, both up and down, as the computing requirements change. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from common failure scenarios. [6] EC2 is an IaaS provider with a wide range of options for the customer to choose from. It is one of the biggest provider known and plays the role of the market leader. As internal numbers are not often published, these observations rely on third party companies that try to figure out how many resources such a provider has and how these are used. In Figure 2.3 we see such a analysis from [137] showing how many of the top Web sites listed in the Internet have IP addresses associated with 27

42 Figure 2.3: Estimation of market share of different Cloud providers. [137] Cloud providers. Amazon is at the first place closely followed by Rackspace Cloud servers. It is also remarkable that from the total sites analyzed, nearly or 1,8% are using Cloud technologies from the Web pages producing the most traffic. This shows the importance of the Cloud technology on the Web hosting market. There are several terms that are important when talking about EC2 and IaaS Cloud Computing: Definition 7. An instance image is a file containing the operating system and possible additional application, which can be started in a Cloud. The image might be optimized for special architectures and has a specified size. When started the instance image needs to be copied, as the original file is not changes when executed. After termination, all changes to such an image are lost when not stored separately. Definition 8. An instance type is a hardware configuration, which can be from physical resources or a virtual environment. This type has characteristics which might be defined in detail or in a broader way the CPU or memory of such an instance might be shared or dedicated. Definition 9. An instance is a Cloud resource, which is running a specific instance image on a matching instance type. The user has root rights on this instance and 28

43 can log into it using the SSH protocol with public/private key authentication once the startup process is finished Eucalyptus Eucalyptus [111] is a university project that tries to mimic the feature set offered by Amazon EC2 and S3 by implementing the same underlying API, and additionally a Web portal for user management. A public test Cloud is also provided. This software bundle allows user to install their own EC2 compatible cloud on local hardware for developing, test or production environments. We installed Eucalyptus on two local servers with total 12 cores allowing us to start up to 12 virtual machines. 2.4 ASKALON The goal of ASKALON is to simplify the development and optimization of applications that can harness the power of Grid computing. The ASKALON project crafts a novel environment based on new innovative tools, services, and methodologies to make Grid application development and optimization for real applications an everyday practice. [157] The ASKALON Grid middleware [47] can execute via its workflow execution engine workflows specified in AGWL. During execution, this abstract specification is instantiated, that is, the tasks are annotated with details concerning the used resources. ASKALON s workflow execution engine features a fine grain event system, which is implemented as WSRF service and allows event-forwarding even through NAT or firewalls. An overview of this and other Grid workflow systems can be found in [176]. ASKALON is a service oriented architecture that consists of several independent services, which communicate with each other to allow scientists to have a simple environment to execute workflow application on parallel and distributed systems such as clusters, Grids and Clouds.For the composition of a workflow, ASKALON provides a graphical user interface, as shown in Figure 2.4, through which application developers can conveniently assemble activities, define their execution order, and decide which activities can be executed in parallel and monitor them. Figure 2.5 shows the architecture of ASKALON at the state where the presented 29

44 Figure 2.4: User interface of ASKALON. research started. In the following subsections, we give more details about the main services of ASKALON, which were used and extended to produce the research presented in this thesis Execution Engine The execution engine is the entry point to the ASKALON services. The user submits an workflow for execution in a Web service call with an embedded AGWL description of the workflow to execute. In the year 2008, the ASKALON workflow execution engine evolved in two major steps. The first version, DEE [41], focused on functionality and hard coded optimizations. DEE s primary shortcomings were the internal loop unrolling from the workflow specification, and the complete scheduling at the start of the execution. To improve on scalability and on adaptability to highly dynamic Grid environments, the secondgeneration engine Execution Engine 2 (EE2) was developed [128]. The EE2 uses internally a structure that is kept close to the AGWL specification and better scales for the execution of large workflows. Each job that is ready for execution is dynamically sent to the best available Grid site at a certain moment. 30

45 Figure 2.5: Original ASKALON architecture with no Cloud support. [157] To decide where a activity should be executed, the EE2 asks the Scheduler and once a decision taken, EE2 will submit the to the specified resource. The engine is responsible for the correct execution of file transfers and activities that follow the control and data flow of the workflow. Optimizations for better workflow performance and stability are also part of this service, which are not covered in this thesis Scheduling The scheduler is the component who has to decide on which host an activity should be executed. As the scheduling of a workflow on heterogeneous resources is known to be NP-complete [156], it is not possible to deliver the best solution for scheduling problems in a reasonable time. The scheduler gets a request from EE2 to map one activity to a resource. Currently there are different implementations of this scheduler using different methods: JIT is a Just In Time (JIT) scheduler that looks at single activities only and tries to find the resource with the minimum completion time for the current activity. The scheduler does not take dependencies or other activities into account when taking this decision. The advantage is the fast decision-making and the up-to- 31

Cloud Computing and Grid Computing 360-Degree Compared

Cloud Computing and Grid Computing 360-Degree Compared Cloud Computing and Grid Computing 360-Degree Compared 1,2,3 Ian Foster, 4 Yong Zhao, 1 Ioan Raicu, 5 Shiyong Lu,,, 1 Department

More information

Green-Cloud: Economics-inspired Scheduling, Energy and Resource Management in Cloud Infrastructures

Green-Cloud: Economics-inspired Scheduling, Energy and Resource Management in Cloud Infrastructures Green-Cloud: Economics-inspired Scheduling, Energy and Resource Management in Cloud Infrastructures Rodrigo Tavares Fernandes Instituto Superior Técnico Avenida Rovisco

More information

Final Report. DFN-Project GRIDWELTEN: User Requirements and Environments for GRID-Computing

Final Report. DFN-Project GRIDWELTEN: User Requirements and Environments for GRID-Computing Final Report DFN-Project GRIDWELTEN: User Requirements and Environments for GRID-Computing 5/30/2003 Peggy Lindner 1, Thomas Beisel 1, Michael M. Resch 1, Toshiyuki Imamura 2, Roger Menday 3, Philipp Wieder

More information

The HAS Architecture: A Highly Available and Scalable Cluster Architecture for Web Servers

The HAS Architecture: A Highly Available and Scalable Cluster Architecture for Web Servers The HAS Architecture: A Highly Available and Scalable Cluster Architecture for Web Servers Ibrahim Haddad A Thesis in the Department of Computer Science and Software Engineering Presented in Partial Fulfillment

More information



More information

Cloud-Based Software Engineering

Cloud-Based Software Engineering Cloud-Based Software Engineering PROCEEDINGS OF THE SEMINAR NO. 58312107 DR. JÜRGEN MÜNCH 5.8.2013 Professor Faculty of Science Department of Computer Science EDITORS Prof. Dr. Jürgen Münch Simo Mäkinen,

More information

Cyber Security and Reliability in a Digital Cloud

Cyber Security and Reliability in a Digital Cloud JANUARY 2013 REPORT OF THE DEFENSE SCIENCE BOARD TASK FORCE ON Cyber Security and Reliability in a Digital Cloud JANUARY 2013 Office of the Under Secretary of Defense for Acquisition, Technology, and Logistics

More information

Identity and Access Management in Multi-tier Cloud Infrastructure. MohammadSadegh Faraji

Identity and Access Management in Multi-tier Cloud Infrastructure. MohammadSadegh Faraji Identity and Access Management in Multi-tier Cloud Infrastructure by MohammadSadegh Faraji A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department

More information



More information

Scalability and Performance Management of Internet Applications in the Cloud

Scalability and Performance Management of Internet Applications in the Cloud Hasso-Plattner-Institut University of Potsdam Internet Technology and Systems Group Scalability and Performance Management of Internet Applications in the Cloud A thesis submitted for the degree of "Doktors

More information

DRAFT Cloud Computing Synopsis and Recommendations

DRAFT Cloud Computing Synopsis and Recommendations Special Publication 800-146 DRAFT Cloud Computing Synopsis and Recommendations Recommendations of the National Institute of Standards and Technology Lee Badger Tim Grance Robert Patt-Corner Jeff Voas NIST

More information

Introduction to Grid Computing

Introduction to Grid Computing Front cover Introduction to Grid Computing Learn grid computing basics Understand architectural considerations Create and demonstrate a grid environment Bart Jacob Michael Brown Kentaro Fukui Nihar Trivedi

More information

Challenges and Opportunities of Cloud Computing

Challenges and Opportunities of Cloud Computing Challenges and Opportunities of Cloud Computing Trade-off Decisions in Cloud Computing Architecture Michael Hauck, Matthias Huber, Markus Klems, Samuel Kounev, Jörn Müller-Quade, Alexander Pretschner,

More information

Cloud Service Level Agreement Standardisation Guidelines

Cloud Service Level Agreement Standardisation Guidelines Cloud Service Level Agreement Standardisation Guidelines Brussels 24/06/2014 1 Table of Contents Preamble... 4 1. Principles for the development of Service Level Agreement Standards for Cloud Computing...

More information

Monitoring Services in a Federated Cloud - The RESERVOIR Experience

Monitoring Services in a Federated Cloud - The RESERVOIR Experience Monitoring Services in a Federated Cloud - The RESERVOIR Experience Stuart Clayman, Giovanni Toffetti, Alex Galis, Clovis Chapman Dept of Electronic Engineering, Dept of Computer Science, University College

More information

FEDERAL CLOUD COMPUTING STRATEGY. Vivek Kundra U.S. Chief Information Officer

FEDERAL CLOUD COMPUTING STRATEGY. Vivek Kundra U.S. Chief Information Officer FEDERAL CLOUD COMPUTING STRATEGY Vivek Kundra U.S. Chief Information Officer FEBRUARY 8, 2011 TABLE OF CONTENTS Executive Summary 1 I. Unleashing the Power of Cloud 5 1. Defining cloud computing 5 2.

More information

Introduction to InfiniBand for End Users Industry-Standard Value and Performance for High Performance Computing and the Enterprise

Introduction to InfiniBand for End Users Industry-Standard Value and Performance for High Performance Computing and the Enterprise Introduction to InfiniBand for End Users Industry-Standard Value and Performance for High Performance Computing and the Enterprise Paul Grun InfiniBand Trade Association INTRO TO INFINIBAND FOR END USERS

More information


SOFTWARE ENGINEERING SOFTWARE ENGINEERING Key Enabler for Innovation NESSI White Paper Networked European Software and Services Initiative July 2014 Executive Summary Economy and industry is experiencing a transformation towards

More information

Guidelines for Building a Private Cloud Infrastructure

Guidelines for Building a Private Cloud Infrastructure Guidelines for Building a Private Cloud Infrastructure Zoran Pantić and Muhammad Ali Babar Tech Report TR-2012-153 ISBN: 978-87-7949-254-7 IT University of Copenhagen, Denmark, 2012 ITU Technical Report

More information

Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce Data-Intensive Text Processing with MapReduce i Jimmy Lin and Chris Dyer University of Maryland, College Park Manuscript prepared April 11, 2010 This is the pre-production manuscript of a book in the Morgan

More information

Power and Performance Management in Cloud Computing Systems

Power and Performance Management in Cloud Computing Systems UNIVERSITY OF TORINO DOCTORAL SCHOOL IN SCIENCE AND HIGH TECHNOLOGY SPECIALIZATION IN COMPUTER SCIENCE XXIII PHD CYCLE Power and Performance Management in Cloud Computing Systems by MARCO GUAZZONE Advisor:

More information

The Definitive IP PBX Guide

The Definitive IP PBX Guide The Definitive IP PBX Guide Understand what an IP PBX or Hosted VoIP solution can do for your organization and discover the issues that warrant consideration during your decision making process. This comprehensive

More information

B.2 Executive Summary

B.2 Executive Summary B.2 Executive Summary As demonstrated in Section A, Compute Canada (CC) supports a vibrant community of researchers spanning all disciplines and regions in Canada. Providing access to world- class infrastructure

More information

February 2009. Seeding the Clouds: Key Infrastructure Elements for Cloud Computing

February 2009. Seeding the Clouds: Key Infrastructure Elements for Cloud Computing February 2009 Seeding the Clouds: Key Infrastructure Elements for Cloud Computing Page 2 Table of Contents Executive summary... 3 Introduction... 4 Business value of cloud computing... 4 Evolution of cloud

More information

Cloud Computing Models

Cloud Computing Models Cloud Computing Models Eugene Gorelik Working Paper CISL# 2013-01 January 2013 Composite Information Systems Laboratory (CISL) Sloan School of Management, Room E62-422 Massachusetts Institute of Technology

More information

The Datacenter as a Computer. An Introduction to the Design of Warehouse-Scale Machines

The Datacenter as a Computer. An Introduction to the Design of Warehouse-Scale Machines The Datacenter as a Computer An Introduction to the Design of Warehouse-Scale Machines iii Synthesis Lectures on Computer Architecture Editor Mark D. Hill, University of Wisconsin, Madison Synthesis Lectures

More information

Extending an Open Source Enterprise Service Bus for Horizontal Scalability Support

Extending an Open Source Enterprise Service Bus for Horizontal Scalability Support Institute of Architecture of Application Systems University of Stuttgart Universitätsstraße 38 D-70569 Stuttgart Diplomarbeit Nr. 3317 Extending an Open Source Enterprise Service Bus for Horizontal Scalability

More information


NEER ENGI ENHANCING FORMAL MODEL- LING TOOL SUPPORT WITH INCREASED AUTOMATION. Electrical and Computer Engineering Technical Report ECE-TR-4 NEER ENGI ENHANCING FORMAL MODEL- LING TOOL SUPPORT WITH INCREASED AUTOMATION Electrical and Computer Engineering Technical Report ECE-TR-4 DATA SHEET Title: Enhancing Formal Modelling Tool Support with

More information

Best practice in the cloud: an introduction. Using ITIL to seize the opportunities of the cloud and rise to its challenges Michael Nieves. AXELOS.

Best practice in the cloud: an introduction. Using ITIL to seize the opportunities of the cloud and rise to its challenges Michael Nieves. AXELOS. Best practice in the cloud: an introduction Using ITIL to seize the opportunities of the cloud and rise to its challenges Michael Nieves White Paper April 2014 Contents 1 Introduction 3 2 The

More information

An SLA-based Broker for Cloud Infrastructures

An SLA-based Broker for Cloud Infrastructures Journal of Grid Computing manuscript No. (will be inserted by the editor) An SLA-based Broker for Cloud Infrastructures Antonio Cuomo Giuseppe Di Modica Salvatore Distefano Antonio Puliafito Massimiliano

More information