Automatic, multi-grained elasticity-provisioning for the Cloud User Requirements and System Architecture V1 Deliverable no.: D1.1 Date: 05-04-2013 CELAR is funded by the European Commission DG-INFSO Seventh Framework Programme, Contract no.: 317790
Table of Contents 1 Introduction... 6 1.1 The Vision of CELAR... 6 1.2 CELAR Expected Outcome... 8 1.3 Purpose of this Document... 9 1.4 Document Structure... 9 2 Application Requirements Analysis... 10 2.1 Translational Cancer Detection pipeline (SCAN)... 10 2.2 EverythingHere Application... 12 2.3 Application-driven Use cases... 13 2.3.1 Actors... 13 2.3.2 Use Cases... 14 2.4 Functional Requirements... 17 2.5 Non-Functional Requirements... 19 3 CELAR Approach and Architecture... 21 3.1 Current Approach to Scaling... 21 3.2 CELAR Architecture... 23 3.3 CELAR Components... 24 3.3.1 Application Management Platform... 24 3.3.2 Elasticity Platform... 27 3.3.3 Cloud Information and Performance Monitor... 31 3.4 CELAR Workflows... 34 3.4.1 Application Description-Submission Workflows... 36 3.4.2 Application Deployment Workflow... 37 3.4.3 Profiling Workflow... 38 3.4.4 Monitoring Workflow... 38 3.4.5 Decision-making Workflow... 39 4 Conclusions... 41 5 Citations and References... 42 List of Figures Figure 1: Active vs. Idle Resources in an Over-provisioning Scenario [Feroldi2009]... 6 Figure 2: Dataflow of the SCAN Pipeline... 10 Figure 3: Architectural Overview of the Web-based Policy Game Developed... 12 Figure 4: CELAR Actors... 13 Figure 5: CELAR Application Lifecycle... 15 Figure 6: CELAR Use-case UML Diagram... 15 Figure 7: Standard Layering of Cloud-based Applications... 21 D1.1 User Requirements and System Architecture V1 2
Figure 8: CELAR System Architecture... 24 Figure 9: Decision Module overview... 28 Figure 10: CELAR Deployment Overview... 35 Figure 11: Application Description and Submission Workflows... 37 Figure 12: Monitoring Workflow... 39 Figure 13: Decision-making Workflow... 40 List of Tables Table 1: Use Cases for the Application User Actor... 16 Table 2: Use Cases for the CELAR Expert Actor... 17 Table 3: Use Cases for the IaaS/Application Platform... 17 Table 4: Default Monitoring System Metrics... 33 List of Abbreviations AMI API DB GUI IaaS ICR IS MS NGS PaaS UML VM WP Amazon Machine Image Application Programming Interface DataBase Graphical User Interface Infrastructure as a Service Institute of Cancer Research Information System Monitoring System Next Generation Sequencing Platform as a Service Unified Modeling Language Virtual Machine Work Package D1.1 User Requirements and System Architecture V1 3
Deliverable Title User Requirements and System Architecture V1 Deliverable no D1.1 Filename CELAR_D1.1_finalrelease.docx Author(s) Dimitrios Tsoumakos, Ioannis Konstantinou, Nikolaos Papailiou, Ioannis Giannakopoulos, Demetris Trihinas, Nicholas Loulloudes, Stalo Sofokleous, Georgiana Copil, Daniel Moldovan, Wei Xing, Kam Star Date 29-03-2013 Start of the project: 01-10-2013 Duration: 36 months Project coordinator organization: ATHENA RESEARCH AND INNOVATION CENTER IN INFORMATION COMMUNICATION & KNOWLEDGE TECHNOLOGIES (ATHENA) Due date of deliverable: 31-03-2013 Actual submission date: 05-04-2013 Dissemination Level X PU Public PP Restricted to other programme participants (including the Commission Services) RE Restricted to a group specified by the consortium (including the Commission Services) CO Confidential, only for members of the consortium (including the Commission Services) Deliverable status version control Version Date Author 1.1 04-04-2013 Dimitrios Tsoumakos, Ioannis Giannakopoulos, Nikos Papailiou, 1.0 28-03-2013 Dimitrios Tsoumakos 0.9 24-03-2013 Ioannis Giannakopoulos, Nikos Papailiou, Wei Xing 0.8 13-03-2013 Dimitrios Tsoumakos, Ioannis Giannakopoulos 0.7 11-03-2013 Dimitrios Tsoumakos 0.6 09-03-2013 Demetris Trihinas, Stalo Sofokleous, Georgiana Copil, Daniel Moldovan 0.5 07-03-2013 Dimitrios Tsoumakos, Ioannis Giannakopoulos, Nikos Papailiou 0.4 06-03-2013 Demetris Trihinas, Nicholas D1.1 User Requirements and System Architecture V1 4
Loulloudes 0.3 05-03-2013 Georgiana Copil 0.2 28-02-2013 Nikos Papailiou, Ioannis Giannakopoulos, Wei Xing, Kam Star 0.1 25-02-2013 Dimitrios Tsoumakos Abstract The aim of this document is to gather the use cases, compile the user requirements and present the detailed CELAR system architecture. Collecting the use cases relevant to the CELAR applications from the user partners enables a thorough registration of requirements and functionality which leads to the specification of the overall CELAR system architecture. Additionally, module functionality and basic workflows are defined in this document. Keywords CELAR System Architecture, Automated Elasticity Provisioning, Resource Allocation, Use Cases, System Requirements, Cloud Monitoring, c-eclipse, Application Management Platform, Elasticity Platform, Cloud Computing, Workflows D1.1 User Requirements and System Architecture V1 5
1 Introduction 1.1 The Vision of CELAR Cloud computing refers to the notion of delivering computing as a service rather than a product. Through Cloud computing, on-demand, ubiquitous network access to a shared pool of configurable and often virtualized computing resources is achieved in a metered by use, costefficient manner, i.e., with minimal management effort or service provider interaction. All these features make cloud computing an ideal paradigm for the modern business, with a predicted global market growth from $40.7 billion this year to more than $241 billion in 2020 [forester]. Figure 1: Active vs. Idle Resources in an Over-provisioning Scenario [Feroldi2009] One of the most appealing (albeit challenging) characteristics of cloud computing is the ability to support elastic computing. Elasticity refers to the ability of the infrastructure (IaaS), platform (PaaS) or software (SaaS) to expand or contract dedicated resources in order to meet the exact demand at runtime. Optimal resource allocation is of great importance in the realm of cloud infrastructure provisioning. Businesses, organizations and simple users alike witness wide variations in the load of their respective applications inside a time-span of a year, month, day or even a few minutes. Under-provisioning runs the risk of costly service denials at peak-hours (e.g., the recent Amazon Cloud Outage [Cloudoutage] or the Foursquare outage [Horowitz2010]). The standard model of provisioning for the expected peak load (see Figure 1) depicts how static resource provisioning incurs increased costs: The majority of commissioned resources remain idle during off-peak hours. Cloud elasticity using simple yet customizable rules can be provided so that application performance can be throttled in a multi-grained, controlled manner, bringing profits for both parties. From the cloud provider perspective, on-demand (elastic) provisioning allows for increased flexibility and performance gains for the customers. However, cloud consumers need not deal with exact provisioning of resources according to expected demand. All cloud consumers care about is cost and quality; thus, elastic resource provisioning should be performed by the provider [Dustdar2011]. On one hand, consumers of cloud services minimize the execution time of their submitted tasks without exceeding a given budget and on the other, cloud providers are keen on maximizing their financial gain while keeping their customers satisfied. In order for cloud applications and their users as well as cloud providers to harvest the benefits of elastic provisioning, it is imperative that is performed in an automated, fully customizable manner. Autoscaling of resources has been identified as one of the top obstacles and opportunities for Cloud Computing [Armbrust2009]. CELAR plans to fill this gap and deliver a fully automated and highly customizable system that performs elastic resource provisioning to cloud computing applications. The vision of the CELAR D1.1 User Requirements and System Architecture V1 6
(Cloud ELAsticity provisining) project is to provide a complete software stack that efficiently programs and manages resource allocation to cloud applications in the same way an operating system manages processes: When an application requires more resources to reach a required quality (in the same way, for instance, a process requires more memory), our system will automatically expand the application s virtual hardware at runtime. On the other hand, when the cloud application can achieve its performance goals with fewer resources, an automatic contraction (at runtime) will free virtual resources for other applications. The software stack adaptively regulates the efficient allocation of these resources to applications according to predefined consumer s elastic constraints or application descriptions in a multi-grained manner. To achieve that, dynamic resource and quality performance information is collected both at the platform and the application side of the Cloud infrastructure, evaluated cost-wise and exposed to the users. Our proposal covers the three layers required by an application to operate over the Cloud: The infrastructure layer (deployment over the ~Okeanos IaaS and the Flexiscale IaaS), the monitoring/optimization middleware (automatic elasticity provisioning over cloud platforms) and the programming development environment (i.e., c-eclipse, a distributed tool to enable developers, administrators and users to define the characteristics of their applications, launch them, submit jobs and monitor performance). The outcome of the proposed project is a modular, completely open-source system that offers elastic programmability for the user and automatic elasticity at the platform level. This outcome will be provided in a way that allows simple installation of any application alongside with its automated resource provisioning over a Cloud IaaS. The proposed system specifically targets: Infrastructure providers: Our elasticity provisioning subsystem will be modular in both design and implementation and deployable with minimum effort across various commercial and open-source platforms. As a proof of concept, our system will be deployed over three infrastructures: ~Okeanos (https://okeanos.grnet.gr), a public IaaS cloud service developed and hosted by GRNET, the FlexiScale public platform (http://www.flexiscale.com/, hosted by BlueSquare Data servers) and Flexiant's in-house data-centre with increased API and platform customization capabilities. Cloud Application Users/developers: The system will enable a wide range of applications and their respective developers/expert users to easily utilize the resources of the underlying infrastructure through an intuitive programming development environment plug-in. Specifically, users will be able to define the application characteristics, including cost and quality and their trade-offs, for optimal resource allocation at design-time and runtime and will submit their applications without having to manually perform resource mappings and bindings. Search capabilities for making Cloud resources easily accessible to end-users will be provided. To circumvent the burdensome installation and integration process, our project offers a unique cloudification feature that allows click-and-go installation of the application over the CELAR system. To showcase the great potential of our modules and methods, two novel applications, one in the area of large-scale internet gaming and one in the area of scientific computing will be deployed over our prototype system. Cloud administrators: the prototype will enable administrators to manage and monitor the available resources (storage, processing and networking). A Cloud resource description framework will be introduced, which will be used for conceptual description of Cloud resources and will support visualization and search. D1.1 User Requirements and System Architecture V1 7
1.2 CELAR Expected Outcome The goal of the proposed project is to develop methods and tools for applying and controlling multi-grained automatic elasticity at the application level over Cloud infrastructures. The main expected outcome of the project is a complete set of methods materializing into open-source tools that will allow the enhancement of a platform towards intelligent and automatic, multigrained resource provisioning according to the needs of user applications. Specifically: i. The elasticity provisioning subsystem, which manages platform resources. ii. The cloud-eclipse(c-eclipse) framework (adapted and extended from the g-eclipse official Eclipse project) so as to provide plug-ins for accessing and managing Cloud resources on iii. the envisioned platform. A scalable, multi-layer Cloud Monitoring tool that gathers a rich set of platform, infrastructure and application-side metrics and evaluates them in a composite fashion. Our modules will be both generic and open-source, in order to allow for maximum utilization and ease of adaptation with existing commercial, academic and community systems. Providing added value and greatly simplifying application deployment over CELAR, the project will develop a framework for the cloudification of any elasticity-demanding cloud application with the CELAR system, offering this integration into a single installable software package. In more detail, the expected outcome of CELAR is on two levels: the level on innovative methods and technologies and on the level of tools and applications. On the methods and technology level, novel methods in order to automatically decide the exact amount and type of resources that need to be commissioned or freed on a per-application level will be designed and developed. These will be tightly coupled with the idea of exposing cloud and application performance metrics to the user and enabling both qualitative and quantitative characterization of the application s performance. The integrated technology will enable fast, hustle-free development and submission of simple and highly demanding applications that take full advantage of the Cloud resources according to both demand and user requirements. Moreover, a unique integration framework will provide a roadmap for one-step, click-and-go installation of applications on virtually any cloud provider. On the tools and applications level, the outcome of the CELAR project will be a set of open-source tools that can be used both separately and as an integrated system in order to provide real-time, multi-grained elasticity and control over applications running over the Cloud. Two exemplary applications that showcase and validate the aforementioned technology will be developed, providing a clear path towards the adoption of the CELAR rationale and increasing the visibility and impact of the project results. The first application will showcase the use of CELAR technology for massive data management and large-scale collaboration required in the on-line gaming realm. This will be driven by a leading games development company that designs, develops and delivers Games, Applications and Simulations across industry sectors. The second application will target the area of scientific computing, with an application that requires computeand storage-intensive genome computations of varying difficulty in order to analyse cancer Workflows. This will be driven by one of the world s foremost independent cancer research organisations specializing in prevention, diagnosis and methods of treatment of cancer. In the next Section, more details about the two use-case applications as well as their envisioned functionality and requirements are given. D1.1 User Requirements and System Architecture V1 8
1.3 Purpose of this Document The aim of this document is to present the CELAR use cases as these were initially documented by the intended users in Milestone 1 (MS1) of the project and describe the first version of the CELAR architecture. Use cases are used to define requirements, which drive the design of the first version of the CELAR system architecture. Factors influencing the detailed architecture of the CELAR system can be grouped, according to their characteristics, into categories: Functional requirements: (What the system will be capable of doing). The goals that users want to reach and the tasks they intend to perform with the new software must be determined. By recognizing the Functional Requirements, we understand the tasks that involve the abstraction of why the user performs certain activities, what their constraints and preferences are, etc. The important point to note is that WHAT is wanted is specified, and not HOW it will be delivered. Infrastructure requirements: Special or already existing hardware / software systems that must be used in the project fall into this category. Non-functional requirements (The restrictions on the types of solutions that will meet the functional requirements). Specification of non-functional requirements includes the description of user characteristics such as prior knowledge and experiences, the special needs of professional (i.e., developers, cloud experts, etc.) and personal users (Application Experts), subjective preferences, and the description of the environment in which the product or service will be used. As such, performance, usability and scalability requirements are also an issue. 1.4 Document Structure The structure of the rest of this document is as follows: In Section 2 we present a semantic overview of the CELAR applications, derive the basic actors and use cases and compile the CELAR functional and non-functional requirements. In Section 3 we define the overall system architecture. This section outlines the difference between current elastic application management and CELAR, present the system architecture and describe its components. Finally, we define the basic Workflows based on the CELAR use-cases and functionality. We conclude this document in Section 4. D1.1 User Requirements and System Architecture V1 9
2 Application Requirements Analysis In this section, we provide an overview of the two applications that will use the CELAR infrastructure. While a more detailed description will be available in the corresponding application deliverables (D7.1 and D8.1), we now provide an overview description of the scenarios that motivate the applications as well as their envisioned architecture/modules. 2.1 Translational Cancer Detection pipeline (SCAN) The identification of genes that are mutated and hence drive oncogenesis has been a central aim of cancer research since the advent of recombinant DNA technology. ICR has developed several pipelines to capture and analyze genomic, proteomic and clinical information by using several biology tools such as BWA, GATK, The Global Proteome Machine, MaxQuant, CellProfiler and Cytoscape. Details of the SCAN pipeline can be shown in Figure 2: Figure 2: Dataflow of the SCAN Pipeline Below, we briefly describe the tools mentioned above as well as the respective platforms/software licences under which they operate: Burrows-Wheeler Aligner (BWA): an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. It is a CPU intensive application implemented in the C language, running under Linux. Genome Analysis Toolkit (GATK): a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce [Dean2004]. It is a java application using the Java JDK for Linux platforms. Global Proteome Machine: a powerful search engine which uses mass spectrometry data to identify proteins from primary sequence databases. MaxQuant: a quantitative proteomics software package designed for analyzing large massspectrometric data sets. It is specifically aimed at high-resolution Mass Spectometry data. Several labelling techniques as well as label-free quantification are supported. It runs over a Windows platform. D1.1 User Requirements and System Architecture V1 10
CellProfiler Analyst: open-source software in Python for exploring and analyzing large, highdimensional image-derived data. It includes machine learning tools for identifying complex and subtle phenotypes. It runs over Linux platforms. Cytoscape: an open source software platform for visualizing complex networks and integrating these with any type of attribute data. It is a java application. The SCAN pipeline comprises four processes: A) NGS data process with Linux system; B) Mass Spectrometry sample data process with Windows system; C) Cell Image data process with Linux Web server; D) Integrative network analysis with Linux system. The resources required for the four processes are very different, such as Linux, Windows, Web services, etc. Currently, ICR has to prepare the different systems (hardware and software) with various run-time environments in advance in order to run the whole pipeline in five steps. However, this approach is unproductive and highly inefficient. For example, a dedicated windows system for the protein discovery application is required but only used for about 30 minutes in a total of 90 h of the whole pipeline execution. Furthermore, over-provisioning the required systems with maximum hardware capability is currently mandatory for few special cases (e.g., the analysis of some very complicate and very large patient data sets), although they are not necessary for most of the cases. CELAR can provision system resources automatically to the heterogeneous applications of the SCAN analytic pipeline in a just enough, just on time manner. It will allow the SCAN pipeline to terminate smoothly without interruption. In particular, CELAR will be able to monitor the state of execution of the various SCAN steps, so that it can dynamically allocate the required resources for each step of the pipeline when needed. D1.1 User Requirements and System Architecture V1 11
2.2 EverythingHere Application EverythingHere is a web based policy game, developed by Playgen, which will utilize Big-Data from the government website http://data.gov.uk, pulling in historical and real time information in order to model and simulate the infrastructure of London as games challenges. Players will be put in the shoes of policy makers and progress through the game by slicing and analysing data. In order to complete game challenges they must discover emerging stories within the data sets (Correspondence analysis). The game will encourage players to use its inbuilt data tools to do sentiment analysis and deep data mining to complete their objectives and progress further. The game will allow multiple concurrent users to access it ubiquitously. The game will be deployed and run over the CELAR system in order to demonstrate the capability of CELAR for elastic processing of huge volumes of highly volatile social data produced and updated during a cloud social game as well as accommodating a large varying number of concurrent user accesses. EverythingHere's architecture is shown in Figure 3. The game consists of three separate tiers: Data Aggregation modules, Data Analysis modules and Data Presentation/Manipulation modules. Figure 3: Architectural Overview of the Web-based Policy Game Developed Each module of the architecture is responsible for a set of functionalities, provided to the system. Specifically: Data agents represent http://data.gov.uk that will provide data. Data receivers are specific applications responsible for interacting with Data agents and pulling data. The data receiver will store the data pulled by data agents into the Data Stores, which are implemented using Cassandra [Cassandra], a NoSQL data store. The Analyser Engine will analyse data according to defined analytic queries. Data analysis involves reading data from data stores, updating them and storing them back to the data stores. D1.1 User Requirements and System Architecture V1 12
An interactive User Interface is responsible for showing queries of analysed data to the user. If the analyzed data is not available, then the Data Receiver will request it from Data Agents. For the above modules, CELAR must provide the necessary resources so that CPU- and I/Ointensive applications like Data Analysis can operate adequately and execute queries at the shortest time possible. CELAR must also monitor the application executed on the CELAR platform and keep appropriate metrics so that overloading is avoided and dynamic addition or removal of resources will help achieving a scalable application performance. 2.3 Application-driven Use cases In this section, we present the use cases for the CELAR platform. We first provide a description of the actors (or user roles) that will interact with the system; we then describe the functionalities each wants to access as perceived by the CELAR actors. Figure 4 gives an overview of all the CELAR actors as well as their hierarchy. Application User «extends» «extends» CELAR Expert «extends» «extends» Application Owner Application Expert/Developer IaaS/Application platform CELAR Admin CELAR Engineer Figure 4: CELAR Actors 2.3.1 Actors The CELAR system accepts as input an application and its corresponding descriptions, gets resources and metrics from the underlying provider and eventually performs elastic and automatic application resource management. To that extent, CELAR users may be both physical entities (relating to the CELAR platform and the application that runs over it) and another system (e.g., IaaS monitoring infrastructure). The identified actors are: Application User The Application User is a person that is knowledgeable about the application currently executed over the CELAR platform. His goal is to describe, deploy and monitor the application over CELAR. This general user is a generalization of the Application Owner and the Application Expert/Developer that are described in the respective sections. D1.1 User Requirements and System Architecture V1 13
Application Owner The owner is a user responsible to define the cost and performance policies that govern the elastic decisions made by the CELAR platform. The user can define or update the optimization criteria under which elastic actions will be taken. Moreover, this user will be able to monitor the application s performance and costs and terminate the application execution under CELAR Application Expert/Developer The Application Expert is a person aware of the application structure, its modules, its execution environment and history, etc. This actor can provide CELAR with any available information relative to the description of the application and various submission details. Additionally, this actor will be monitoring the application s performance. CELAR Expert A CELAR Expert is a person who has knowledge of the CELAR system internals; it can be either a CELAR Admin or a CELAR Engineer. CELAR Admin The CELAR Admin is responsible to setup and maintain the CELAR platform. He deploys the platform inside the IaaS infrastructure and is responsible to maintain it in cases of hardware or software faults. The CELAR Admin is experienced with the internal components and architecture of the platform. CELAR Engineer A CELAR Engineer is a person who understands the CELAR system and is able to interact and operate with the CELAR platform when needed. He has experience in writing custom-made resizing and deployment scripts for CELAR. As such, the engineer is someone who has knowledge of the CELAR modules and the underlying IaaS. IaaS/Application Platform The IaaS provider as well as the Application Platform interact with CELAR providing it with adequate performance measurements, indicating the application's resources usage, load information, cluster status, etc. 2.3.2 Use Cases The use cases, as they are compiled through the requirements gathered by the user partners, impose a general structure into the lifecycle of an application that is executed over the CELAR system. This is pictorially described in Figure 5: the Application User s input is required in order to describe, submit and deploy his application. During these stages, the CELAR system is provided with information used to correctly deploy, manage and monitor this application. The application is then profiled and elastically managed by CELAR. These stages correspond to the CELAR Expert and IaaS/application platform users that provide with metrics, scripts and various administrative and technical support correspondingly. Finally, the Application User may terminate his application. D1.1 User Requirements and System Architecture V1 14
Describe Submit Deploy Profile Monitor and Manage Terminate Application User CELAR Expert IaaS/Application Platform Application User Figure 5: CELAR Application Lifecycle All CELAR functionalities are presented in a common UML use-case diagram of Figure 6. With its help, the relation between different functionalities is described in a graphical way. Describe Application «uses» CELAR «uses» «uses» «uses» Describe Structure Describe Elasticity Submit Application «uses» «uses» «uses» Describe Data/Load Set Optimization Policy Deploy Application Set Deployment Parameters Application User Save/Upload Preferences «uses» Terminate Application Profile Application «uses» Monitor Application «uses» Provide monitor statistics IaaS/Application platform Application-script Generation CELAR Expert Maintenance Figure 6: CELAR Use-case UML Diagram D1.1 User Requirements and System Architecture V1 15
We now describe the documented use-cases on a per-user level. The actors described previously can interact with CELAR through specific actions. For each of these actors, the use cases are: Table 1: Use Cases for the Application User Actor Name Used by Description Describe Application Submit Application Application description is given by the Application Expert in order to give specific details of the application's structure and topology. The description must be given before the submission of the application to the CELAR platform and it can be divided to a number of shorter steps like structural description, elasticity directives and Describe Structure Describe Elasticity Describe Data/Load Submit Application Set Optimization Policy Set Deployment Parameters Deploy Application Monitor Application Describe Application Describe Application Describe Application Deploy Application Submit Application Submit Application Profile Application Profile Application data/load hints, as described below. Structural description of the application provides information to the platform about the application's topology (e.g., number and types of tiers, number and types of application components pert tier, etc.) and the dependences between them. This description is provided by the Application Expert. Elasticity directives are given by the Application Expert in order to provide information about the application's elastic behaviour towards resizing actions. Data and load hints are provided by the Application Expert in order to be used by CELAR and help the platform predict the application's behaviour under different amount of skew, read/write/update schemes, load, etc. The Application Expert submits the application to the platform after the description step, where he has provided the static information of the application. The submission will proceed when details about the optimization policy and deployment are provided. Policy details are given by the owner; these details are used by the CELAR decision module to optimize the application's behaviour according to the defined preferences. Deployment details are given by the Application Expert in order to be used by CELAR to initialize the configuration of separate tiers and orchestrate them in the physical level (allocate the necessary resources). After all description and submission information has been completed, the Application User chooses to finally deploy his application over the CELAR system. Application Users have access to monitoring statistics of their application, regarding the application's performance measured by any metric available at application and D1.1 User Requirements and System Architecture V1 16
Save/Upload preferences Terminate Application physical level. Different Application Users may have different privileges in the monitoring interface. - Preferences of the application's deployment or policy input given by the policy expert or owner may be saved and uploaded any time during the application's execution, so that the user can retrieve older configurations and use them. - The owner of an application chooses to terminate the execution of his application over the CELAR platform. Table 2: Use Cases for the CELAR Expert Actor Name Used by Description Profile Application - Application profiling is the process where the application is executed under different deployment setups in order to export conclusions about the application's behaviour under different loads/committed resources. These conclusions will be very helpful to the decision module. Profiling uses the Deploy Application and Monitor Application use cases. The CELAR Engineer is responsible Applicationscript Generation for invoking and controlling the profiling. - The CELAR Engineer helps the Application Expert to create custom scripts so that resizing actions can be correctly implemented at the application in hand. Maintenance - The CELAR Admin is responsible for the maintenance of the platform. Table 3: Use Cases for the IaaS/Application Platform Name Used By Description Provide monitor statistics Monitor Application The IaaS/Application platform will provide monitor statistics to CELAR, used by the monitor application use case. 2.4 Functional Requirements This section aims to derive the functional requirements from the previously described use cases and elaborate on the way the required functionality is expected to be realized by the CELAR system. Application Submission: Application Owners should be able to use a CELAR UI that will allow them to submit all necessary information in order for their application to be deployed and run over a cloud infrastructure. The first step is to be able to provide, via a user-friendly interface, meaningful information about the D1.1 User Requirements and System Architecture V1 17
application to be deployed and elastically managed by CELAR. In order to assist CELAR and accommodate the general application case, a minimum set of hints relating to the structure of the application, its dependencies, sample elasticity directives and workload/data usage are expected to be provided. The second step includes (but is not limited to) providing an Optimization Policy for CELAR and selecting the appropriate VM images. The user will be able to pick one of the base VM images available from the IaaS provider. Alternatively, he can use custom machine images built and configured over the base images. Building and configuring custom snapshots is a job aided by automated tools. Application Deployment: Application Experts should be able to perform the deployment of their application using the CELAR technology. In detail, this means that after submitting all relevant information, the system should reserve the right type and amount of cloud resources relevant to the given hints and application profiling so that the initial application execution performs within the user-defined policy. Real-time Application Monitoring: Application Users should be able to monitor current and past application performance using a user-friendly UI. Moreover, they should have access to aggregated statistics on cost and performance based on metrics of their choice. In order to provide a complete and accurate such view, the monitoring system should be able to collect metrics from multiple infrastructure layers, combine and evaluate them against costs and benefits before presenting them. Real-time, Automated and User-defined Resource Provisioning: The fundamental requirement from the CELAR system is to perform elastic resource allocation in a completely transparent manner: Users and owners should be able to perceive the performance and corresponding costs of their applications to vary, at all times, within the limits they defined when submitting the application. The CELAR system should thus adaptively add and remove resources in real time, so that the perceived behaviour range is always bound by the user s requirements. Customizable System Interaction: The users should be able to adjust the system s behaviour and output in various important aspects: First and foremost, the users should be able to change the policies governing their application behaviour if they wish so; moreover, they should be able to alter or refine their application or its description and deployment requirements; finally, they should be able to dynamically alter the importance of metrics or the detail granularity at which they will be reported to them. Application Termination: The Application Owner should be able, besides submitting and starting an application to also terminate its execution under CELAR. D1.1 User Requirements and System Architecture V1 18
2.5 Non-Functional Requirements Apart from the functional requirements described above, which specify the system behaviour and reflect the use cases that the CELAR actors can perform, the CELAR architecture is equally driven by several non-functional requirements which define the qualities of the system. Scalability Scalability is of paramount importance for the CELAR system. As the platform is expected to handle multiple I/O intensive and data intensive applications, it is obvious that scaling both to the number of managed applications as well as to the load/data that each one produces is a hard requirement. In essence, scalability in CELAR relates to: i. Efficient management of multiple deployments both relative to monitoring multiple application layers in real-time and maintaining the required metadata per application. ii. Taking elastic resize actions over multiple applications and huge resource pools from the IaaS provider. High Availability Availability of the CELAR system is an important aspect: Users should be able to manage their applications consistently and reliably via CELAR. Consequently, the system should exhibit very high levels of robustness both against hardware and software failures, utilize redundancy, loadbalancing, etc, in order to ensure that CELAR components and services will be highly available. Efficiency The CELAR system will require many computation and store steps and resources for analysing large amounts of monitoring and application deployment data in a reasonable time, mainly in order to provide fast and accurate resize actions. Thus, it is imperative that efficient data management schemes (both in data store and data processing technologies) be employed on multiple stages of the CELAR Workflows: During monitoring/profiling and processing of the metrics, during decision making and during actual deployments. Wide Applicability To ensure a wider usage of the CELAR system after the end of the project by different organizations and environments, it is necessary to develop a system that is portable, easy to deploy and maintain, intuitive to use and extend. One step towards that direction is to provide CELAR as a service: Services offer a higher level of flexibility, as they can be composed on demand to provide new functionalities. Furthermore services can be spread and replicated on other machines ensuring a good quality of service. CELAR has the potential of being offered by different IaaS providers as a service that will dynamically manage application resource allocation. Portability can be ensured by a series of choices such as: i. Clear API and interaction definitions: Each module in the CELAR architecture must specify a list of public methods (services) for interaction. Specifying these APIs for the software modules hides the implementation details and eases workflow creation and changes. ii. Portable programming languages and standards: Utilize platform independent languages like Java, standard tools and libraries that exist on a wide variety of platforms. D1.1 User Requirements and System Architecture V1 19
User-friendliness CELAR offers a set of tools and interfaces to allow Application Users and experts to interact with the system. As the level of interaction with CELAR requires novel information exchange (e.g., elasticity constraints per module, load hints, Optimization Policy, etc.), it is imperative that the interface must be designed in a user-friendly and user-intuitive manner: users should not be forced to input information unknown to them, yet they should provide useful hints for CELAR operations. D1.1 User Requirements and System Architecture V1 20
3 CELAR Approach and Architecture 3.1 Current Approach to Scaling CELAR aims at providing an elasticity layer that is currently missing from cloud infrastructures. Applications need to be able to take advantage of the elastic, pay-as-you-go resource provisioning nature of cloud infrastructures in a transparent and customizable manner. In that sense, a cloudbased application should ideally be able to automatically scale both horizontally and vertically, based on an expert- or user-provided policy. Resources should be dynamically allotted or freed from the application at runtime, so that the application s performance and cost remain within an objective function specified by the expert. Figure 7: Standard Layering of Cloud-based Applications Figure 7 depicts the current layering in the realm of cloud computing infrastructures and applications. There exists a clearly defined 3-layer architecture that consists of the physical layer, the IaaS layer and the PaaS/SaaS layer. The offerings of each layer are based on the layer under it, gradually adding up to the available functionality: In the physical layer, hardware and networking resources such as compute nodes, storage entities, switches, optical fibre infrastructure, etc., exist, possibly distributed in several datacenters. In the IaaS (Infrastructure as a Service) layer, resources from the physical layer can be offered to clients on an on-demand basis rather than purchasing them. The IaaS layer uses virtualization techniques to enable leasing of resources as opposed to their exclusive usage and billing. These resources are provided on demand via easy to use remote interfaces. D1.1 User Requirements and System Architecture V1 21
In the PaaS/SaaS (Platform/Software as a Service) layer, the resources to actually build applications (PaaS case) or the applications themselves (SaaS case) are provided. The PaaS layer facilitates the development and deployment of applications without the cost and complexity of buying, managing and configuring the underlying hardware, middleware and software layers. SaaS refers to an actual application delivered over a browser. SaaS eliminates the need to install and run applications on the customer's private infrastructure and simplifies maintenance, upgrades and support. Currently, developers building their application on a specific platform or owners deploying custom-made applications over a cloud infrastructure face specific limitations. While elastic scaling is currently possible for some applications, allowing them for example to increase measurable metrics such as throughput when adding more compute and storage resources, the specific actions to be performed and their effects versus performance and cost are unclear to the expert. They are thus forced to manually select the exact number of VMs to be added or the amount of memory to be allocated to a VM; they are also forced to carry out those actions by following a defined non-trivial series of two steps: Ask for the commissioning (or freeing) of the required resources from the infrastructure provider (IaaS provider). Manually orchestrate the insertion (or deletion) of these resources from the current application runtime. As an example, let us consider a standard, 2-layer application that comprises a Web/Application Server in the first layer and a distributed data store (e.g., HBase [Hbase]) as the storage backend. Users send queries via a user interface (UI) that are received by the application server. The server asks for data from the datastore using the HBase API. Returned data is processed at the application server and the result is returned to the client(s). Assuming the expert wishes to scale out HBase in order to increase its query throughput, they must (indicatively) decide and act on the following actions: i. Relative to the current load on the HBase cluster, they must decide on the number and type of additional HBase RegionServer instances (new VMs) to be added. ii. Using the providers tool(s), they must request the specific VMs and wait until they have been commissioned, registered, given an IP address (assuming they were initialized with the right image file or snapshot including HBase and Hadoop [Hadoop]). iii. Inject script files/directives to the HBase Master and RegionServers that add the newly added slaves to the cluster. iv. Possibly restart the HBase cluster. Therefore, currently it is an expert s task to define the size and type of scaling action, the time of its occurrence as well as to ensure correct orchestration of the virtualized resource allocation and use by the application. While these are obvious burdens, there still exist a number of pitfalls not identified: How is the expert aware of a problem in the application performance? Are the metrics they are able to watch in real time application-based and connected to performance and cost? How is the expert able to decide which part of their application causes the problem? How does the expert decide whether to add VM instances, storage, etc. for a given component? What is the effect of each action on the performance or cost? D1.1 User Requirements and System Architecture V1 22
CELAR s vision is to provide a unified framework in order to automate all these decisions for any cloud-based application. CELAR plans to provide a set of methods implemented via open-source tools in order to enable Application Experts to intelligently describe, deploy, monitor and scale their application. 3.2 CELAR Architecture In contrast to the current state as depicted in Figure 7, CELAR enhances the functionality provided by current cloud infrastructures in order to provide automated, multi-grained, elastic resource provisioning for cloud-based applications. CELAR contributions and respective modules can be categorized along three areas (in a one-to-one correspondence to the research areas defined in the submitted CELAR DoW): i. Application Management: Modules and methods that enable intelligent, application- and user-aware description and deployment of cloud-based applications. Moreover, this layer exposes real-time, application-based performance and cost metrics, an overview of the current and past status of their application as well as the available resources (software and hardware) from the underlying IaaS. ii. Monitoring: A scalable, distributed, real-time monitoring framework allowing the remote collection and storage of statistics that come from the infrastructure layer, the platform and the application layers. Monitoring will be cost-evaluated by considering different cost factors at application, resource, and infrastructure levels. iii. Elasticity: This basic layer consists of all the algorithms and modules that are necessary in order to provide automatic resource allocation based on the application characteristics, the user-defined optimization and the incoming load. Moreover, the elasticity platform is responsible for maintaining all necessary information for past and current application deployments, the orchestration of added or removed resources as well as methods for ensuring robustness and availability of the elastic operations. This architecture is depicted in Figure 8. The application management modules will be developed and provided under the c-eclipse framework and exposed via meaningful, user-friendly UIs to the end-users and Application Experts. It is a goal of the CELAR consortium to provide a scalable, modular and easily deployable solution that is easy for both application developers and IaaS providers to integrate to their IaaS/PaaS platforms. As such, both monitoring and elasticity platform modules will be deployed inside the cloud provider s realm, ensuring tight integration and transparency for the users. The use of standard APIs, open source tools, platformindependent programming languages and wide coverage for underlying platforms will ensure the utilization of CELAR by a wide range of applications and existing cloud platforms and providers. D1.1 User Requirements and System Architecture V1 23
Figure 8: CELAR System Architecture 3.3 CELAR Components In this Section we provide a detailed description of the modules that comprise the CELAR system along the three areas defined above. Note that detailed descriptions that include data models, APIs, etc. will be given in the respective WP deliverables. 3.3.1 Application Management Platform 3.3.1.1 Application Management Framework The Application Management Framework will be implemented on top of the reliable Eclipse [Eclipse] platform and will follow its plug-in based software architecture. It will take advantage of the Eclipse GUI that thousands of users around the world are accustomed to and utilize extensively on a daily basis. Therefore, the intuitive and user-friendly GUI through which users will interact with any Cloud Infrastructure will be able to minimize any complexity regarding the process of Application description, submission and monitoring. In turn, this will result in a lowentry barrier for Application Users who are new to the Cloud, while simultaneously improve the Workflow efficiency of experienced users. Nevertheless, choosing to develop c-eclipse over the Eclipse framework has additional benefits. The c-eclipse framework will be platform independent, running on any platform which is supported by Eclipse (Windows, Linux, Sun Solaris, Mac OS X and others). By utilizing graphical libraries such as SWT, the Eclipse GUI is not only platform independent but also always has the look and feel of a native application. Furthermore, the availability of high quality tooling such as the Eclipse Modelling and Communication frameworks as well as the vast supplementary D1.1 User Requirements and System Architecture V1 24
resources (documentation, tutorials, examples, etc.) will ease the work of the development team and enable them to focus on the delivery of a high quality end product. Importantly, thanks to the Eclipse modular architecture that adheres to the OSGi [OSGI] framework, the c-eclipse functionality and GUI can be easily extended and customized to support either new Cloud related technologies or additional requirements. Finally, by integrating c-eclipse to the Eclipse ecosystem, we will be able to: i) guarantee its long-term sustainability through a strong international user and developer backed community and ii) increase the CELAR project visibility and thereby further improve its rate of adoption. 3.3.1.2 Application Description Tool The Application Description Tool is a c-eclipse component that requires input from the Application Expert in order to facilitate the process of application deployment and also during other critical CELAR decisions. In general, the Application Description Tool will provide an intuitive graphical user interface (GUI) that will enable the Application Expert to describe its application in an efficient way. Consequently, the graphical application description will be translated into a formal specification, useful to the decision-making and orchestration modules. Specifically, for this purpose, the OASIS TOSCA specification [TOSCA] will be utilized. The TOSCA specification provides a language for describing services in an interoperable manner, so that they can be implemented in different cloud environments without much extra effort. TOSCA descriptions hide the intricacies of hardware and simplify application development on alternative Clouds. There are two main parts in a TOSCA service description: the service topology and the orchestration processes. The service topology specifies the service components and their relationships, while the orchestration processes describe the service s management procedures. Thus, a single TOSCA document combines all the required information for deploying and managing an application throughout its lifecycle, over different cloud environments. However, in order to utilize TOSCA in CELAR we need to extend the specification to provide abstractions for describing elastic cloud services. Such services adhere to specific elasticity models and usually require custom scaling plans in order to scale elastically throughout their lifecycle. It is a priority of the project to contribute to subsequent versions of the Open TOSCA Standard by conveying the experience collected. The information that will be included in the extended TOSCA application description is: Application topology: The different components (nodes) that the application consists of, and the relationships or dependencies between these components. For example an application consists of a Web Server that depends on a database. Application components elasticity models: The Application Expert must specify which components of the application can be elastically adapted and also to what types of elasticity each component adheres to. For example, a Hadoop cluster is an elastic component which can be elastically adapted by adding new slave-nodes to the cluster. Application load/data hints: The Application Expert can give hints about the load of each component or even the load of the entire application. For example, an application component can be read or write heavy, or the incoming requests to a component might follow a specific pattern. Similarly, he can provide insights on the type, size, location, etc of the data utilized inside the application. D1.1 User Requirements and System Architecture V1 25
Elasticity actions: For each elastic component, the Application Expert must define the resizing actions that can be applied to it (e.g., add VM, add storage space, etc). The Application Expert must also provide the scripts to be executed for each action. 3.3.1.3 Application Submission Tool The Application Submission Tool is a c-eclipse component that requires the application description created by the Application Description Tool. It enriches the Application Description with additional information related to the current submission. For each new submission the Submission tool retrieves the respective application description file from the CELAR DataBase. According to the application description, the submission tool gathers the following information from the Application Expert: Optimization policy: The Application Expert needs to specify the optimization strategy of his application. For example, he might want to improve application-side metrics (throughput, response time, etc.) or he can try to minimize the application s deployment cost (e.g., the maximum deployment must not exceed 500 per month). Deployment artefacts: References to the necessary executable files for materializing instances of an application component. Some of these artefacts can be custom (reference to a custom machine image), while other artefacts can be chosen from a list provided by the information system. Implementation artefacts: References to the necessary executable files for a component to operate after its deployment, for example references to custom scripts and.jar files. After receiving this input, the Application Submission Tool creates a new TOSCA file that contains both the description and the submission information. The new TOSCA file is saved to the CELAR DataBase through the CELAR Manager; the application is now ready to be deployed. 3.3.1.4 CELAR Information System The role of the CELAR Information System (IS) is to provide an interface for c-eclipse users to inspect their current and previous deployments. Users can also utilize the Information System to compare different IaaS providers as well as search for offered resources per IaaS. The c-eclipse information tool provides an interface for the Application Expert to query and access the following types of information that reside in the CELAR DataBase: User Profile (authentication type, metadata, etc.) and history (past submissions and history of actions) Pricing Schemes of the provider Resizing actions that the provider allows VM image formats (i.e. VMware, AMIs, etc.) that the provider offers/accepts. These are images available through the provider s marketplace (e.g., Ubuntu Server 12.04, Debian Squeeze 6.05, etc.) Services available from CELAR s software repository (e.g., Apache Hadoop, Apache Server, MySQL, Apache Cassandra, etc.) Monitoring Metrics the provider offers through CELAR Monitoring System D1.1 User Requirements and System Architecture V1 26
3.3.2 Elasticity Platform The elasticity provisioning platform is a middleware that hosts all central to CELAR components, namely: Decision module: takes real-time, informed decisions on the type and quantity of resources that need be added or removed from a running application. Resource Provisioner: module that undertakes the task of automated, on-demand creation of multi-machine runtime environments. This also encapsulates: o The interaction with the underlying cloud infrastructure for resource request and release, i.e., the translation of high-level get/add/remove resource commands to low level (IaaS-specific) commands (Cloud Orchestration). o The task of updating the application configuration in order for it to employ the newly committed resources or detach those freed (Application Orchestration). CELAR DataBase: a central storage module that maintains information useful to other CELAR components. CELAR Manager: handles action orchestration in the elasticity platform, operates upon the CELAR DataBase and provides fault-tolerance. Application Profiler: a module that will allow the monitoring and measurement characterization of the application s behaviour over a number of representative resource provisioning and load scenarios. We now proceed in the description of these modules. A more detailed explanation on their functionality and dependencies can be found in D3.1. 3.3.2.1 Decision module The role of the decision module is to analyze and control the elasticity of cloud applications. The Decision module considers elasticity as a complex multi-dimensional property, with three main dimensions: quality, cost and resources, which are further decomposed into sub-dimensions (i.e. quality dimension can be further decomposed into quality of data (QoD), performance, the quality of data can be further decomposed into data completeness, accuracy, etc.). Considering this multi-perspective quality, cost and resources view on elasticity, CELAR supports high level requirement specification over elasticity metrics. The Decision module decomposes these metrics for mapping high level elasticity requirements to low level metrics restrictions (i.e., cost per application is decomposed into cost resulted from number of allocated virtual machines, and cost resulted from I/O calls). The Decision Module requires information related to the respective application, such as application structure, elasticity models, application profile and available elasticity actions for each of the application s logical components. The application structure information contains the application topology, defining a logical hierarchy of components. Using this information and realtime application monitoring information, the decision module analyzes and selects elasticity adaptation actions for each application logical component. At the base of the application structure, the component concept represents a functional or data unit of the application which can be deployed on one or more underlying virtual machines (e.g., a Cassandra Node, Application Server, Hadoop master or slave, etc.). The components are grouped together into complex components, enabling elasticity control over groups of components (e.g., Cassandra clusters, Hadoop Clusters, etc.). Based on this logical structure, we define two levels of elasticity control: component level control, focusing on horizontal component scaling by managing physical resources, such as virtual machines, disk volumes or networks(e.g. instantiating a new Cassandra D1.1 User Requirements and System Architecture V1 27
data node), and complex component level control, focusing on component management instead of basic resources management(i.e., adding entire Cassandra or Hadoop clusters). In terms of costrelated decisions, the decision module will also consider factors influencing cost of running cloud applications, such as available pricing schemes or availability regions. Quality is also included in the decision process, as the Decision module generates action plans that ensure the user requested quality (such as response time, I/O performance, network bandwidth or data quality), within a given budget and with a minimum set of required resources. Figure 9 shows the types of input data used by the Decision module to generate action plans. At application deployment time, the Decision module generates an initial application deployment configuration and a cost estimation for running the target application. The smart application deployment can be bypassed if the Application Expert has provided a deployment configuration. Targeting run time control of cloud application elasticity, the Decision module generates an action plan aimed at enforcing specified elasticity requirements, the application elasticity analysis report on which the action plan is based and a refined cost evaluation. Figure 9: Decision Module overview The Decision module contains three main components: the learning engine, the application elasticity analysis engine and the planning engine. The learning engine uses data coming from the application profiler, from the monitoring module and historical decisions, for predicting the effect one action will have upon the application in terms of elasticity, and metrics correlation (i.e. how cost correlates with quality, how CPU usage correlates with response time, etc.). D1.1 User Requirements and System Architecture V1 28
The application elasticity analysis engine analyses cloud services elasticity capabilities in order to detect services that can fulfill the required elasticity requirements, and cloud application behavior as to detect violations of elasticity restrictions. The planning engine uses the report compiled by the analysis engine and information from the learning engine for selecting cloud services to be used and elasticity actions to be applied on running cloud application to build an action plan that would enforce the requested elasticity restrictions. The plan is sent to the Resource Provisioner via the CELAR Manager for enforcement. 3.3.2.2 Resource Provisioner The Resource Provisioner is a module that allows automated provisioning and creation of cloud resources. The Provisioner module provides simple access to a cloud infrastructure. Specifically, it provides repeatable deployment of multi-component distributed applications on any cloud infrastructure. As such, it is a module that must encapsulate but also intelligently handle and automate the communication between application deployment and the underlying IaaS as well as between resource reallocation (elastic actions) and IaaS. The following are a set of high level steps that the Provisioner needs to take in order to deploy a cloud application: Configure Cloud accounts: Set required parameters for supported Cloud providers for the users of the system. Define Deployment: In order to achieve that, the module should give its user the opportunity to pick one of the base VM images available from the IaaS provider, configure it by installing his custom application and create a snapshot image that can be used for automated deployment. To define the application deployment users attach image snapshots to application components; add deployment scripts on the Images; define initially required resources (CPU/RAM/Network/Extra disks) and default multiplicities; define input/output parameters on the Images and between application components. Launch Deployment Follow progress of the Deployment. On a successful launch of a Deployment, real-time updates in the states of the Deployment should be kept. The launch deployment task above requires both IaaS and Application-specific actions. In detail: The Cloud Orchestration is required in order to translate higher level elasticity commands from the decision making module into specific IaaS resource allocation commands. The Application Orchestration module will ensure that newly committed resources are identified and used by applications or that deleted ones are not utilized (if/where applicable). As soon as the Provisioner has successfully reserved and launched the required IaaS resources, the Application Orchestration will be responsible to incorporate these resources into the running application. These actions can be performed by utilizing deployment tools like Chef [Chef] or Puppet [Puppet]. In any case, a set of well-defined steps in the form of a workflow of custom-made scripts per application/module (possibly custom-made on a per-application basis and stored in the CELAR Database) and elasticity actions need to be clearly defined so that the module can perform the appropriate actions. At this phase, CELAR plans on utilizing the SlipStream [Slipstream] application to play the role of the Resource Provisioner. SlipStream will soon be released under an open source license. The work done by the Cloud and Application Orchestration modules will then be integrated into the SlipStream environment: the higher level elasticity commands can be compiled in SlipStream scripts that will directly interact with the IaaS layer (using hooks with the most popular APIs). D1.1 User Requirements and System Architecture V1 29
3.3.2.3 CELAR DataBase In order to avoid the creation of many small databases that store component-specific information, we propose the creation of the CELAR DataBase, a central repository that can be used to store information from CELAR components other than monitoring. The DataBase may consist of multiple datastore technologies, such as NoSQL databases (for large-scale, high-throughput data storage) along with relational DBMSs (efficient aggregate queries on multiple data dimensions). We anticipate the hosting of two types of information, namely static (i.e., that does not change regularly) and dynamic information (i.e., that is incrementally updated or changed during time) regarding the following: Static information: Application Topology Application components elasticity models Application components resizing actions Policy information Deployment artefacts Cloud resources available per IaaS (e.g., VM sizes, pricing schemes, etc.) Resizing actions allowed by the providers (e.g., add/remove storage, VM, etc.) Dynamic information: Deployed application structure Historical elastic actions/decisions Profiling/Monitoring data 3.3.2.4 CELAR Manager The CELAR Manager is a component whose role is to provide general orchestration and synchronization among the Elasticity Platform components. Moreover, it will also supply fault tolerance and assist in the application deployment. More specifically, the CELAR Manager s function is to: Synchronize and orchestrate messages among CELAR components. Provide required data processing capabilities to the CELAR DataBase. Examples of required computation include aggregation of historical data, data movement to secondary storage, optimization of data entry and retrieval relative to incoming queries, types of data and storage technologies available, etc. Provide fault-tolerance mechanisms such as checkpointing for failing modules. Orchestrate and execute the deployment of the CELAR Application Orchestrator during application deployment (see Section 3.4). 3.3.2.5 Application Profiler A process, rather than a module, that will allow the monitoring and measurement of the application s behavior over representative resource provisioning and load scenarios. This way, CELAR will form a clear model of the elasticity, cost and performance properties on a per application module basis. Collected information will form the basis of the learning knowledge required by the Decision module. D1.1 User Requirements and System Architecture V1 30
The profiling process will be compiled using information gathered by the Application Description tool, as given by the Application Expert. In general, according to the different modules and possible operational statuses, a more fine-grained exploration of the application performance impact under a representative set of elasticity actions will be performed. The description tool will explicitly define both the different module interactions along with the resizing actions per module that affect its performance. The results of the profiling process will be stored to the CELAR DataBase. The Decision module will constantly utilize and augment the initially observed data with actual responses to decision actions. An important aspect to be investigated during the module s development is whether the process of deployment creation, load generation and execution will be fully automated or aided by the CELAR Expert (as currently shown in Figure 6). We envision that a limited human intervention to evaluate the cost and time constraints of a possibly large number of deployments and their execution will be valuable. As an example of the use of the Profiler, let us assume an HBase component that adheres to an add-worker VM resizing action. The profiler investigates what is the impact of the resizing action to the module's throughput, latency, CPU/memory, etc. We can start from the high level elasticity models provided by the expert (e.g. add VM results to higher throughput) and refine them by doing automatic tests with different configurations (i.e., number of worker VMs). The output will be a detailed elasticity model that can help the Decision module have a steeper learning curve, avoid initial errors (and thus financial costs) in scaling decisions and suggest an initial deployment setup. 3.3.3 Cloud Information and Performance Monitor 3.3.3.1 Monitoring System (MS) The role of the Monitoring System is to collect, process and distribute monitoring metrics to CELAR interested components and to subscribed users. CELAR adds a monitoring layer to the cloud infrastructure for monitoring virtual resources and application performance. While an application is up and running, Monitoring Agents collect data via probes from the virtual machine, virtual cluster, cloud and application level. The collected monitoring data is then forwarded to the MS Server. To reduce network traffic, metric aggregation, rules and filtering can be applied when requested by interested entities via the MS Server. Rules can be time-based (e.g. report CPU usage every 5 seconds) or event-based (e.g. only notify me if allocated memory is over 70%). Monitoring data is distributed to users and interested CELAR System components by using one of the three delivery mechanisms available: Pub/Sub mechanism: entities subscribe to a stream of events. Metrics are pushed to the entity when they become available. Query/Response mechanism: entities request for metrics and the MS responds with one single response. Notification mechanism: entities request to be notified only when a threshold has been violated. 3.3.3.1.1 MS Main Components The main components of the Monitoring System used to gather, distribute and receive monitoring metrics are: D1.1 User Requirements and System Architecture V1 31
Probes are small programs/scripts that are used to gather raw metrics and generate timestamped monitoring events. Probes can be implemented using a library that will be provided (CELAR Probe Interface). Probes will be utilized in the CELAR System to collect metrics concerning resource allocation and usage from the virtual cluster and virtual machine level. Probes will also be used to collect application performance metrics from the application level. Monitoring Agents are the entities responsible for adding (or removing) probes when requested, collecting raw metrics from probes and forwarding them to the corresponding MS Server. Probes are implemented separately from Agents, allowing them to be added (or removed) from an agent dynamically at runtime, without interfering with the monitoring process. MS Servers are the entities responsible for collecting metrics from Agents and distributing the acquired metrics to the interested entities. The basic functionality of a MS Server can be extended to provide Aggregation, Caching, Filtering and also Pre-Processing (e.g. cost-evaluation). Clients are entities that request and receive metrics from MS Servers. A Client Interface can be implemented to create many types of clients. The Decision Module of each application will implement a Client interface to receive metrics from MS Server(s), combine the received information and perform analysis in order to make decisions. The c-eclipse MS Visualization tool will also implement a Client interface to receive metrics and present data in a visualized way (e.g. Graphs). Metric Storage: After raw data processing, the MS Server stores fresh monitoring metrics locally to an application monitoring repository in order to be consumed mainly by the decision making module via pull requests. This allows the Decision Module to quickly retrieve requested metrics when needed, without flooding it constantly with new metrics. In addition, it reduces query latency to a minimum by not querying a central monitoring database for metrics. When the local repository is either full or a time interval has expired, the local repository forwards the metrics stored to an Event Processor which is the entity responsible for Filtering, Compressing and Storing the processed data in the CELAR DataBase for historical purposes. MS Visualization Tool: Application Users can view monitoring data for their current deployment via an intuitive, easy to use graphical user interface embedded in the c-eclipse Platform. 3.3.3.1.2 Metrics The CELAR MS will adopt a hybrid monitoring approach. It will utilize passive collecting mechanisms of existing open source monitoring tools such as Ganglia [Ganglia] and/or Nagios [Nagios] to collect raw monitoring data from the infrastructure components. Active monitoring will also be supported by the MS probes gathering any metrics that the mentioned monitoring tools cannot provide. As CELAR progresses, the MS will be enhanced with more and more monitoring probes, delegating the need of using pre-existing monitoring probes from the above systems. The CELAR MS will be extensible and can expand to include new features and metrics. If a cloud provider already has a Monitoring System implemented in the virtualization layer and wishes to override the default probes (or a portion of them) he may substitute them with custom made probes that implement the CELAR Probe Interface. In a similar way, the MS can be extended to provide custom application performance metrics by providing the application developer with the ability to write MS probes and deploy them through the c-eclipse platform. The following list presents the most important default metrics that the Monitoring System will provide: D1.1 User Requirements and System Architecture V1 32
Table 4: Default Monitoring System Metrics Group Metric Description Unit CPU Load X-minute load average XX {1,5,15} mmmmmm % CPU Idle Percentage of time that the CPU or CPUs were idle % and the system did not have an outstanding disk IO request CPU User Percentage of CPU utilization that occurred while % executing at the user level CPU System Percentage of CPU utilization that occurred while % executing at the system level CPU Wait IO Percentage of time that the CPU or CPUs were idle % during which the system had an outstanding disk IO request CPU Num of CPUs Total number of CPUs (collected once at Agent init) Num CPU CPU Speed CPU Speed in terms of MHz (collected once at Agent MHz init) Disk Disk Used Maximum percent used for all partitions % Disk Disk Total Total available disk space, aggregated over all GB partitions Disk Disk Free Total free disk space, aggregated over all partitions GB Disk Reads (Writes) per Number of Reads (Writes) to disk per second ops/s sec Disk Reads (Writes) KB of data transferred to disk KB/s Transfer Memory Memory Total Total amount of memory displayed in KB KB Memory Swap Total Total amount of swap space displayed in KB KB Memory Memory Free Amount of available memory KB Memory Memory Cache Amount of cached memory KB Memory Memory Shared Amount of shared memory KB Memory Swap Free Amount of available swap memory KB Process Processes Total Total number of processes Num Process Processes Running Total number of running processes Num Network Packets In (Out) Packets in (out) per second Pckt/s Network Bytes In (Out) Number of bytes in (out) per second B/s System OS name Operating System Name String System OS release Operating system release date String System Machine Type System architecture String System Machine Location Location of the machine String System Boot Time The last time that the system was started Time System System Time Time as reported by the system clock Time System Heartbeat Last heartbeat Time The CELAR MS will also be enhanced with application probes in order to gather application performance metrics for widely used applications such as Apache Server, Cassandra, Hadoop, D1.1 User Requirements and System Architecture V1 33
MySQL and any other metrics needed by the developers of the pilot applications in the CELAR consortium. 3.3.3.2 Interceptor The Interceptor is a component devised to accurately monitor intra-application performance, for applications with multiple layers that communicate during runtime. The Interceptor module will be responsible for providing application performance metrics (latency/throughput for basic intra module operations) to the Decision Module. This is important, as for applications with multiple layers CELAR will need to know not only the end-to-end performance but also identify exactly which layer is the bottleneck. To facilitate the Interceptor module, each edge on the dependency graph defined in the application topology (i.e., communication between application layers) can be annotated with some information. We will investigate different means of such annotations. Some of them will be optionally provided, as the Application Expert might not have full access to the code or the implementation details of each module. Specifically we plan to investigate: Basic API requests/responses (optional): The Application Expert can specify the basic types of API requests/responses that are used for the module communication. For example, if the module is HBase, the basic API calls would be get(), put(), delete(). This information can be used by a packet sniffer in order to estimate a module s throughput. Fake test API requests (optional): The Application Expert can define basic test API calls that will help estimate each module s latency. For example, if the module is HBase, we can have the Interceptor send fake random get() requests and measure the latency deviations. In this scenario, the Interceptor must be knowledgeable about the average size of each request/response, in order to give realistic estimations about latency and throughput between application tiers. Automatically generated checkpoints (optional for open source modules): For each API call given by the expert we can automatically inject checkpoint code that sends information to the Interceptor or the monitoring module. For example if we have the source code of a web-server module that communicates with HBase, as well as the basic API calls, we can automatically scan the code and inject start/stop checkpoint code for each (or sampled) get() operation. 3.4 CELAR Workflows Figure 10 gives a graphical overview of the CELAR deployment topology, consisting of the following containers: CELAR Client: Contains all CELAR components that run on client machines. Such components are the c-eclipse modules. CELAR Server: This is a server-side CELAR deployment that contains the main CELAR process (CELAR Manager), the CELAR DataBase and the Resource Provisioner. The CELAR server can be deployed inside or outside the cloud provider, yet our choice at this point is to have a uniform CELAR deployment within the realm of a cloud provider. CELAR Application Orchestrator: For each application deployment, the CELAR Manager will launch instances of various CELAR modules inside an orchestrator VM that will be responsible for deploying, monitoring, resizing, etc., the current application. The CELAR Application Orchestrator consists of: D1.1 User Requirements and System Architecture V1 34
o The CELAR Orchestrator and the Provisioner Orchestrator processes. The CELAR Orchestrator contains the execution and data logic of CELAR Manager for this specific application. The Provisioner Orchestrator is a process that eases the deployment of the application; at this point, we envision it implementing Cloud Orchestration functionality (i.e., request and free specific resources from the underlying IaaS). o The Decision module process that implements the module s logic and stores the respective data for this specific application. o The Monitoring Server process responsible for gathering, processing and storing metrics for this specific application. o The Profiler process that creates smart benchmarking scenarios for this application to assist decision-making. Application: This layer contains all the application VMs that are launched, controlled and monitored by CELAR. Based on this deployment scheme, we now describe the basic Workflows of the CELAR system: Figure 10: CELAR Deployment Overview D1.1 User Requirements and System Architecture V1 35
3.4.1 Application Description-Submission Workflows The following steps describe the series of actions taken in order for an Application Expert to describe his application and submit it for deployment and later execution in the CELAR system. They are pictorially described in Figure 11: 3.4.1.1 Description Workflow i. The Application Expert describes his application using the graphical user interface of the Application Description Tool. He can drag and drop application components into the application description editor and then add the required information for each component, and also for the entire application (this information is described in subsection 3.3.1). ii. The Application Description Tool validates the Application Expert s graphical input and either it provides information about the correctness of the input or provides hints for rectifying any possible errors. iii. The Application Description Tool translates the valid input into the extended TOSCA specification language. iv. The TOSCA application description is now in a state where it can be passed to the Application Expert, in order to be enriched with the necessary deployment, elasticity and orchestration information. v. In turn, the Application Expert provides the aforementioned elasticity and orchestration properties in the TOSCA application description. 3.4.1.2 Submission Workflow i. The Application Submission Tool parses the TOSCA application description created as stated previously in the Description workflow. ii. The Application Submission Tool reads the TOSCA application description from the CELAR DataBase (given during the Application s Description). It then parses the TOSCA Application Description and enhances it with deployment, elasticity and orchestration information given by the Application Expert at this step. iii. The Application Submission Tool uses a series of CELAR API calls to send the enhanced TOSCA Application Description (which now includes the application s topology along with elasticity, deployment and orchestration information) to the CELAR Manager. iv. The CELAR Manager, then, sends the deployment and orchestration information to the Resource Provisioner Server. The Resource Provisioner will use this information to deploy the Provisioner Orchestrator and manage the application throughout its lifecycle. v. The CELAR Manager also stores the original TOSCA application description to the CELAR DataBase. D1.1 User Requirements and System Architecture V1 36
Figure 11: Application Description and Submission Workflows 3.4.2 Application Deployment Workflow When the CELAR Manager receives a deployment request, a new CELAR Application Orchestrator VM is initialized. This VM will be responsible for the newly deployed application, as it hosts the following modules: i. The CELAR Orchestrator which contains all CELAR Manager functionality and data relative to this application only; ii. The Resource Provisioner Orchestrator module; iii. The Application Profiler module; iv. The Decision Module for this application along with the corresponding data; v. The MS Server. During the initialization of the orchestrator, the CELAR Manager transfers all the relevant information from the main CELAR DataBase to the application specific databases that reside inside the orchestrator. The information transferred to the orchestrator contains deployment details and resizing scripts of the application, as well as the application description and elasticity/decision information archived by previous deployments or by the profiling of the application. D1.1 User Requirements and System Architecture V1 37
The Provisioner Orchestrator retrieves (pull mode) the deployment plan from the Provisioner Server in order to request resources and VMs from the IaaS, configure them and deploy the application according to its description. Application VMs are injected with Monitoring Agents that use probes in order to monitor resource utilization. Monitoring Agents are connected and report data to the monitoring master inside the Application Orchestrator VM. At this point, the application is deployed, monitored and ready to be elastically scaled. 3.4.3 Profiling Workflow Profiling is executed before a new application is first deployed. During profiling, a number of different deployment configurations will be created and monitored in order to identify the relationship between a specific configuration and the application s behaviour. Performance will be measured and evaluated identically with normal deployments. That way, CELAR creates a knowledge base for the Decision Module with partial statistics on the application s expected behaviour when a resizing action is decided. This knowledge is helpful in two cases: i. In the case of decision-making when the application is initially executed, as the Decision Module has not yet enough information on the application and its behaviour. ii. In the case of initial deployment (or application bootstrapping), as the collected data along with the Decision Module can guide towards a good initial recommendation on deployment resource allocation. Before profiling is executed, the application must have been described using c-eclipse; this description is then stored in the CELAR DataBase. Profiling can be initiated automatically (for instance, after the application is described) or manually (for instance, a CELAR Engineer will create a request for Application Profiling). When profiling is initiated, the Profiler module inside the CELAR Application Orchestrator generates a Profiling scenario, which consists of a deployment configuration and an application load. Deployment configurations and Application load are different for each loop of the Profiler process depending on the hints provided by the Application Expert. Application load is generated using a Load Generator submodule of the Profiler. Thus, different configurations are created, their number being dependent on the number of the application tiers and the number of the available resizing actions. After one profiling scenario is created, the deployment configuration is stored and used for deployment (see previous workflow). The process is repeated for each profiling scenario. 3.4.4 Monitoring Workflow In order to better demonstrate the communication and data exchange between the components of the Monitoring System and other CELAR modules, a dataflow diagram (Figure 12) is presented below: i. Metrics are received by the MS Server from Monitoring Agents that reside on the VMs of the deployed application. ii. Pre-processing and aggregation can be performed to the received metrics before they are distributed and stored to the MS local repository iii. Metrics are: a. pulled by the Decision Module in order to take a resizing decision and act upon it. D1.1 User Requirements and System Architecture V1 38
b. pushed to the c-eclipse monitoring visualization tool of subscribed users The Decision Module can change the configuration options of the MS at any time. Such options are: i. frequency a metric is collected ii. add/remove metric to/from interested metric sets iii. request for aggregation type (sum, average, max, etc.) Finally, metrics from the local metrics repository are asynchronously pushed to the CELAR DataBase to be filtered and stored for historical purposes. Figure 12: Monitoring Workflow 3.4.5 Decision-making Workflow This workflow describes in more detail when and how the elasticity decisions are going to be taken. We assume that an application is deployed and monitored according to the previously discussed workflows. The Decision Module is placed inside the CELAR Application Orchestrator VM and is responsible for taking elasticity actions throughout the lifetime of the application deployment. The information on the elasticity requirements and the Application Topology is stored in a local database inside the Decision Module. This database contains archived information from previous deployments or from the profiling of the application, as well as the user-provided optimization policy for the current deployment. The Decision Module's database is initially populated with cloud descriptions and application profiling information from the CELAR DataBase during the D1.1 User Requirements and System Architecture V1 39
deployment of the application. The taken decisions as well as their impact on application s performance will be stored inside the Decision Module s database. Newly created elasticity knowledge will be propagated and archived in the main CELAR DataBase in order to be used in future deployments. The actual decision process takes place in the following ways: In time intervals: The Decision Module will run in predefined time intervals (e.g. every ten minutes). Event driven: The decision process can be triggered by events. For example, if the Monitoring System detects a sudden spike on the user s request rate it will launch the Decision Module to handle it. Figure 13: Decision-making Workflow There are two main decision flows: (i) smart deployment of new applications based on elasticity characteristics and (ii) application elasticity control. Smart deployment of new applications based on elasticity characteristics The smart deployment workflow follows the general deployment workflow discussed in Section 3.4.2. The only difference now is that the initial deployment configuration is generated by the Decision Module. After the initialization of the orchestrator VM, the Decision Module retrieves from the CELAR Orchestrator application logical structure/topology and application elasticity requirements. Using this information, the Decision Module generates a Deployment Action Plan, which is sent for enforcement to the CELAR Orchestrator. This data is also shared with the CELAR Manager and Provisioner Server to apply the action plan and inform any interested party, such as the c-eclipse UI. Application elasticity control The Decision Module evaluates the application elasticity on a periodical and/or an event-driven basis. When a new evaluation process starts, the Decision Module retrieves newly gathered application monitoring data from the Monitoring System. As we can see in Figure 13, this information is used for devising an Action Plan which elastically controls the application to fulfill all the elasticity requirements given during the Application Submission. The Action Plan, together with the Cost Estimation and Elasticity Analysis for the current application, are sent to the CELAR Orchestrator, which in turn sends it both to the Provisioner Server (in order to apply the Action D1.1 User Requirements and System Architecture V1 40
Plan) and the CELAR DataBase (so that this information can be used by the Information System for informing any interested party). 4 Conclusions This document described the first version of the CELAR use cases, requirements and architecture. CELAR is a visionary system that plans to efficiently integrate research and business innovation in order to provide intelligent, fully customizable and automated resource allocation for cloud-based applications. The user partners have provided a first description of the envisioned applications to be run over the CELAR system. Using these descriptions, the respective functionality and use-cases were defined. The overall system architecture has been compiled accordingly, to reflect the consortium s experience as well as the main user requirements. As the project moves from the analysis and design phase to the implementation and integration phases, both the requirements as well as the resulting architecture will be refined. The updated user requirements and system architecture deliverable (D1.2) is planned for Month 20 of the project. D1.1 User Requirements and System Architecture V1 41
5 Citations and References [Armbrust2009] Armbrust et al, Above the Clouds: A Berkeley View of Cloud Computing. Technical Report EECS-2009-28, EECS Department, University of California, Berkeley. [Cassandra] Apache Cassandra Project, http://cassandra.apache.org/ [Chef] http://www.opscode.com/chef/ [Cloudoutage] http://www.computerworld.com/s/article/9216064/amazon_gets_black_eye_from_cloud_outage [Dean2004] J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. [Dustdar2011] Schahram Dustdar, Yike Guo, Benjamin Satzger, Hong Linh Truong: Principles of Elastic Processes. IEEE Internet Computing 15(5): 66-71 (2011) [Eclipse]: Eclipse foundation, http://www.eclipse.org [Feroldi2009] http://www.slideshare.net/federicof/cloudify-scalability-on-demand [forester] http://forrester.com/rb/research/sizing_cloud/q/id/58161/t/2 [Ganglia] http://ganglia.sourceforge.net/ [Hbase] Apache HBase, http://hbase.apache.org/ [Hadoop] Apache Hadoop http://hadoop.apache.org/ [Horowitz2010] E. Horowitz, Foursquare Outage Post Mortem http://bit.ly/c4gnv0, 2010. [Nagios] http://www.nagios.org/ [OSGI]: OSGi alliance, http://www.osgi.org [Puppet] https://puppetlabs.com/solutions/cloud-management/ [Slipstream] http://sixsq.com/products/slipstream.html [TOSCA] https://www.oasis-open.org/committees/tosca/ D1.1 User Requirements and System Architecture V1 42