Analysis of the state of the art and defining the scope

Transcription

1 Grant Agreement N FP Title: Authors: Editor: Reviewers: Analysis of the state of the art and defining the scope Danilo Ardagna (POLIMI), Giuliano Casale (IMPERIAL), Ciprian Craciun (IEAT), Michele Ciavotta (POLIMI), Emanuele Della Valle (POLIMI), Elisabetta Di Nitto (POLIMI), Mozhdeh Gholibeigi (POLIMI), Peter Matthews (CA), Marco Miglierina (POLIMI), Juan F. Pérez (IMPERIAL), Cosmin Septimiu Nechifor (SIEMENS), Craig Sheridan (FLEXI), Weikun Wang (IMPERIAL) Victor Muntés (CA) Francesco d Andria (ATOS) and Franck Chauvel (SINTEF) Identifier: Deliverable # D6.1 Nature: Report Version: 1 Date: March 29 th, 2013 Status: Diss. level: Final Public Executive Summary This deliverable presents a state of the art analysis of cloud monitoring techniques, the tools available for managing QoS in the cloud, and the techniques available for automatically deploying applications on the cloud. Furthermore, performance models which can be used to quickly estimate cloud systems runtime performance and resource management techniques for optimizing the use of cloud resources from the software provider perspective. This deliverable also specifies the requirements for the MODAClouds run-time environment developed within WP3 and defines a road-map for the work package. Copyright 2012 by the MODAClouds consortium All rights reserved. The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under grant agreement n (MODAClouds).

2 Members of the MODAClouds consortium: Politecnico di Milano Stiftelsen Sintef Institute E-Austria Timisoara Imperial College of Science, Technology and Medicine SOFTEAM Siemens Program and System Engineering BOC Information Systems GMBH Flexiant Limited ATOS Spain S.A. CA Technologies Development Spain S.A. Italy Norway Romania United Kingdom France Romania Austria United Kingdom Spain Spain Published MODAClouds documents These documents are all available from the project website located at Public Final version 1.0, March 29 th

3 Contents 1. INTRODUCTION CONTEXT AND OBJECTIVES MODACLOUDS RUN- TIME ENVIRONMENT VISION Goals and Technical Assumptions MAPE- K Loop Conceptual Architecture OVERVIEW STATE- OF- THE- ART: RUN- TIME PLATFORMS PREAMBLE Criteria DEPLOYMENT APPROACHES PaaS solutions Deployment and execution solutions Deployment solutions FP7 projects Infrastructures RESOURCE ALLOCATION, LOAD- BALANCING Load balancing and scheduling algorithms Load balancing in production systems APPLICATION DATA MANAGEMENT AND MIGRATION STATE- OF- THE- ART: CLOUD MONITORING PREAMBLE GENERAL MONITORING APPROACHES The COMponent Performance Assurance Solutions (COMPAS) [Mos02]: TestEJB [Mey04]: Performance Anti- pattern Detection (PAD) [Par08 ]: An elastic Multi- layer monitoring approach [Kon12]: RMCM, A Runtime Model Based Monitoring Approach for Cloud [Sha10]: The Multi- layer Collection and Constraint Language (mlccl) [Bar12] Infrastructure monitoring in current IoT landscape INFRASTRUCTURE- LEVEL MONITORING Guest VM monitoring Host machine monitoring Cloud Infrastructure- level monitoring Cloud- specific monitoring APPLICATION- LEVEL MONITORING A Run- time Correlation Engine (RTCE- based approach) [Hol10] CASViD, an SNMP- based monitoring approach for SLA violation [Eme12]: A multi- layer approach for cloud application monitoring [Gon11] Cloud Application Monitoring: the mosaic Approach [Rak11] M4Cloud, A Generic Application Level Monitoring [Mas11] REMO, a Resource- Aware Application State Monitoring approach [Men08] Cloud4SOA COST MONITORING AND MEASUREMENT INDEXES Unified Service Description Language Service Measurement Index STATE- OF- THE- ART: QOS MANAGEMENT PREAMBLE QOS DATA ANALYSIS AND FORECASTING Public Final Version 1.0, March 29 th

4 Problem Overview of data analysis, forecasting techniques, and queueing models Statistical Inference for QoS model parameterization Workload forecasting methods RUN- TIME QOS MODELS Statistical learning models Control theory models Product- form queueing networks Layered queueing networks Summary SLA MANAGEMENT Problem Solution Discipline State of the art Summary tables Criteria for evaluation MODACLOUDS RUN- TIME PLATFORM OVERVIEW Actors Requirement Sets Requirement Elicitation Methodology EXECUTION REQUIREMENTS Context and System Overview Use case specification for the Run Application use case Use case specification for the Deploy Application use case Use case specification for the Start/Stop Application Sandbox use case MONITORING REQUIREMENTS Context and System Overview Use case specification for the Install Monitoring Rule use case Use Case Specification for the Activate/Deactivate Monitoring Rule use case Use case specification for the Add/Remove Observer use case Use case specification for the Collect Monitoring Data use case Use case specification for the Distribute Data use case ANALYSIS OF REQUIREMENTS Context and System Overview Use case specification for the Correlate Monitoring Data use case Use case specification for the Estimate Measure use case Use case specification for the Forecast Measure use case Use case specification for the Feedback Measure use case SELF- ADAPTIVITY REQUIREMENTS Context and System Overview Use case specification for the Define/Undefine QoS Constraints use case Use case specification for the Start/Stop Feedback of Self- Adaptivity Data use case Use case specification for the Define/Undefine Cost Constraints use case ROADMAP REFERENCES APPENDIX A RUN- TIME PLATFORM EVALUATION CRITERIA Public Final version 1.0, March 29 th

5 Acronyms API ARIMA ARMA AWS CDN CEP CLI CUI DSS ECA EJB FA GAE HTTP IaaS IDE IoT IP JDO JMX JNI JSON JSQ LC LLC LLCR LVS MAPE-K OS OVF PaaS QoS REST RHEL ROI RR Application Programmable Interface Autoregressive Integrated Moving Average Autoregressive Moving Average Amazon Web Service Content Delivery Network Complex Event Processing Command Line Interface Console User Interface Decision Support System Event Condition Action Enterprise Java Beans First Available Google App Engine Hypertext Transfer Protocol Infrastructure as a Service Integrated Development Environment Internet of Things Internet Protocol Java Data Objects Java Management Extensions Java Native Interface JavaScript Object Notation Join-the-shortest-queue Least-Connections scheduling Locality-based Least-Connections scheduling Locality-based Least-Connections scheduling with replications Linux Virtual Server Monitor Analyse Plan Enforce Shared Knowledge Operating System Open Virtualization Format Platform as a Service Quality of Service Representational State Transfer Red Hat Enterprise Linux Return of Investment Round Robin Public Final Version 1.0, March 29 th

6 SaaS SLA SLO SMI SOA SRC SRPT USDL VM WLC WP WRR WS WSGI WUI XML XMPP Software as a Service Service Level Agreement Service Level Objective Service Measurement Index Service-oriented Architecture Source Shortest-remaining-processing-time Unified Service Description Language Virtual Machine Weighted Least-Connections scheduling Work Package Weighted Round Robin Web Service Web Server Gateway Interface Web User Interface extensible Markup Language extensible Messaging and Presence Protocol Public Final version 1.0, March 29 th

7 1. Introduction 1.1. Context and objectives The main goal of MODAClouds is to deliver methods, a Decision Support System (DSS) and an open source IDE and run-time environment for the high-level design, early prototyping, semi-automatic code generation, and automatic deployment of applications on multi-clouds with guaranteed QoS. The MODAClouds Consortium The present document is Deliverable 6.1 Analysis of the state of the art and defining the scope (henceforth referred to as D6.1) of the MODAClouds project. The main objectives of this document is to analyse the state of the art regarding cloud monitoring, management APIs, automatic deployment and run-time performance models, and to establish the general requirements that will define the scope of the MODAClouds project. Figure 1.1.a depicts the methodology followed to define the requirements presented in D6.1. In order to collect the requirements that will drive the development of the MODAClouds framework, the following steps have been carried out: i. Collection and structured review of the relevant State of the Art; ii. Elicitation of requirements from (i); and iii. Prioritization of the requirements. State of the art analysis Elicitate Requiements Prioritize Requirements Figure 1.1.a. The MODAClouds requirements elicitation.. We have carried out an extended review of the State of the Art in the following research fields covered by MODAClouds in order to identify gaps, deficiencies, needs and problems, and thus elicit requirements: Run-time platforms for cloud environments; Monitoring cloud infrastructures and applications; Quality of service in cloud-based environments. The QoS metrics of interest in this deliverable are identical to the ones reviewed in D5.1 for the design-time evaluation. Thus, they are mainly concerned with quantifying application responsiveness, scalability, and dependability. We point to D5.1 for an extensive description. The state of the art analysis is based on an initial search in the major research databases of computer science, i.e. ACM Digital Library, IEEE Xplore, SpringerLink, ScienceDirect and Google Scholar using keywords such as Cloud computing, multiclouds, risk, quality management, interoperability, architectures, deployment, monitoring, etc. We gave priority to publications dated from 2007 (as according to Google Trends search and news reference volume data the term Cloud computing started becoming popular in 2007 [GT13]) untill February However, older references are used when necessary. From these papers, references were checked and additional papers were found. Initiatives coming from standardization bodies, leading vendors and funded projects were also included in this survey. This resulted in a collection of several hundred publications that included (a) conference, workshop and symposium papers, (b) journal articles, (c) electronic articles and (d) technical reports and white papers. Around 200 publications were finally selected as the most relevant. Public Final Version 1.0, March 29 th

8 1.2. MODAClouds Run-Time Environment Vision Before presenting the analysis of the state of the art, in this subsection we describe the overall goals of the runtime environment, we describe the general approach and the high-level conceptual architecture. With this we aim at providing a general framework to discuss self-adaptive architecture. This also provides relevant information that might be useful when discussing monitoring platforms and reviewing existing QoS management solutions Goals and Technical Assumptions MODAClouds aims at defining a runtime environment specifically tailored to the execution of applications developed with its design-time tools. The high-level objectives of the MODAClouds runtime environment are several and can be summarized as follows: 1. Define a monitoring platform to characterize the state of applications developed and deployed using MODAClouds. 2. Define self-adaptive policies to manage application QoS at runtime. These policies will rely on models, shipped with the runtime environment, to perform predictions on application performance and scalability as well as to track or estimate its current status and resource demands. 3. Develop an execution platform for managing application deployment, configuration, and run-time execution. This platform will be utilized by the self-adaptive policies to manage application QoS. 4. Define data synchronization and load balancing mechanisms to support the execution of an application that is replicated over multiple clouds to ensure high-availability. The above goals are quite broad and therefore require clarifying more precisely the scope. The main underlying technical assumptions of the runtime environment that will be adopted are as follows: Supported application classes. For an application developed with the MODAClouds design-time methodology, the runtime environment will offer its functionalities on the supported IaaS and PaaS platforms. For these applications, the run-time environment aims at maximizing automation, while still keeping the human in the loop, especially in situations when this is highly-desirable, for example incident management. We will focus on Linux applications, with a specific emphasis on applications running inside the Java virtual machine. Extension to other application classes will be evaluated and defined after the requirement phase for the main case studies is completed. Single-Cloud Deployment. MODAClouds will support public cloud deployments where an application is entirely deployed on a given IaaS platform or entirely deployed on a given PaaS platform. The choice of the target platform is delegated to the software developer, which should base this choice on the MODAClouds decision support system. Multi-Cloud Deployment. MODAClouds applications may require the simultaneous deployment on two cloud platforms or two different availability zones in order to guarantee continuous availability in face of a cloud outage. In the context of the runtime environment, we therefore need to ensure that the failure of a cloud platform A will not result into unavailability also for the backup copy deployed on cloud platform B. Based on this observation, we will assume that the two applications will have independent runtime environments. A special-purpose software component will be developed to ensure that upon a failure of one of the two platforms, the incoming application traffic can be gracefully redirected to the other platform. Data will be replicated on the two clouds and the runtime environment will offer a mechanism based on streams to synchronize the data across clouds and therefore optimise the downtime. Notice that the migration process underlying the deployment of an application running on cloud A onto a new cloud B is intended as process that will be fully done at design-time, it is not intended as code movement or live migration. The only migration operation that will be performed at runtime is the data synchronization that will be used to migrate the data. Application Deployment and Execution. MODAClouds will manage the execution of applications on a set of target IaaS and PaaS public clouds. To avoid making the scientific and technical approach of the runtime environment too dependent on the target clouds chosen, we intend to use the work done in the mosaic and in the Cloud4SOA FP7 projects as a starting point for defining the MODAClouds runtime execution platform. In particular, we intend to reuse part of the execution mechanisms implemented in mosaic for IaaS platforms, and in Cloud4SOA for PaaS platforms, to manage the MODAClouds application lifecycle (e.g., application Public Final version 1.0, March 29 th

9 start/stop/initial configuration/etc). This will provide a runtime environment that exploits and enhances technical outputs of recent FP7 projects, thus leveraging code maturity and avoiding solving issues already addressed by other EU projects. Application-Oriented. The MODAClouds runtime will allow to monitor and adapt the deployment of a cloud application with the goal of meeting the QoS goals decided at design-time. This will require operating in parallel to the application a number of software services, controllers, and tools that will be instrumental to achieve this goal. However, as part of the runtime environment, these systems will consume themselves physical resources and, in line of principle, may be subject to their own set of QoS and cost constraints. In general, to avoid an excessive complexity in the design of the runtime and to avoid focusing on secondary aspects, these concerns will be treated as minor compared to the QoS and cost constraints specified by the developer for the running application. Legacy software. The runtime environment will also provide the ability to execute applications shipped with legacy software components. One example could be a legacy business logic written in C++. However, limitations may apply on the ability of the framework to automate all the operations for these applications. Therefore, a substantially higher level of involvement of human administrators for management could be required in this case. Monitoring approach. The monitoring approach adopted in MODAClouds is focused on addressing the needs of all stakeholders involved in the operation and management of cloud applications. In particular, it aims at satisfying the following important requirements: i) be able to collect data at various levels, starting from the hypervisor level, when exposed by the cloud platform, up to the application level; ii) be able to cope with the case of single-cloud and multi-cloud deployments; iii) be extensible in terms of the information to be collected and of the filtering and correlation actions to be performed on such data. In order to address the last point, the architecture of the monitoring approach will be highly composable and will allow for the installation of special purpose data collectors, in charge of acquiring data depending on the needs of the stakeholders interested in performing monitoring, and data analysers. Moreover, it will feature a proper language to support different kinds of filtering and composition of the acquired information MAPE-K Loop The overall paradigm adopted by the MODAClouds runtime environment is inspired by the research area of Autonomic Computing, which has greatly increased over the course of the last ten years the common understanding on how to realize systems with self-managing capabilities. The MODAClouds runtime is inspired in its high-level design by the MAPE-K loop, which is one key conceptual aspect of the Autonomic Computing field. The MAPE-K autonomic loop (Monitor, Analyze, Plan, Execute, Knowledge) represents a blueprint for the design of autonomic systems where a managed element is coordinated by a loop structured in 4 phases and a common knowledge (see Figure 1.2.a). Public Final Version 1.0, March 29 th

10 Figure 1.2.a: MAPE-K loop The MAPE-K loop is structured in 4 consecutive phases: Monitoring. The monitoring component is responsible for managing the different sensors that provide information regarding the performance of the system. In the MODAClouds context for example, sensors can capture the current consumption of critical node resources (such CPU and memory) but also other performance metrics (such as the number of processed requests in a time window and the request process latency). The monitoring granularity is specified by rules. Sensors can also raise notifications when changes to the system configuration happen. Analysis. The analysis component is responsible for processing the information captured by the monitoring component and to generate high level events. For instance, it may combine the values of CPU and memory utilization to signal an overload condition in the platform. Planning. The planning component is responsible for selecting the actions that need to be applied in order to correct some deviation from the desired operational envelope. The planning component relies on a high level policy that describes an adaptation plan for the system. These policies may be described, for example, using Event Condition Action (ECA) rules that are defined by a high level language. A ECA rule describes for a specific event and a given condition what action should be executed. In the context of MODAClouds, the actions may affect the deployed artifacts and the bindings among these ones. Execution. The execution component applies the actions selected by the planning component to the target components. Additionally, the shared knowledge includes information to support the remaining components. In the context of MODAClouds, it maintains information about managed elements. Overall, a model of the MAPE management processes within the context of a generalized system management meta-model also developed within few relevant projects like Auto-I [AUT13], ANA [ANA13], or CASCADAS [CAS13]. In addition, the topic is analysed by Calinescu, with the meanings of formal methods [GPA13] Conceptual Architecture The MAPE-K loop only represents a design blueprint that leaves lower level details of the architecture purposely unspecified (i.e., they do not impose constraints on the implementation). In these initial months of the project, the MODAClouds consortium has defined a reference conceptual architecture for the runtime platform which we here describe and that follows the MAPE-K loop design approach. The details and implementation of this conceptual architecture will be specified more in the details in follow up Year 1 deliverables which will iterate on this initial description to produce a concrete implementation plan. The goal of this section is to provide a Public Final version 1.0, March 29 th

11 high-level intuition of the systems that will compose the architecture, which is required in order to identify the actors that are involved in the requirement specification. Figure 1.2.b: Conceptual architecture of MODAClouds runtime environment Figure 1.2.b shows the main components of the MODAClouds runtime platform. We here consider a runtimecentric perspective and therefore, the runtime subsystems are exploded, while the design time systems are left as a single component (the MODAClouds IDE). More in details, the runtime environment will be composed of three main platforms: 1. The monitoring platform responsible for data collection, analysis, and definition of summary measures (application health, QoS, etc). This platform will be in charge of collecting information from various sources (cloud monitoring interface, application-level components, PaaS containers) and filtering, correlating, and analysing such information with the purpose of executing a set of monitoring rules that will be defined by the design-time tools provided by MODAClouds. For example, a monitoring rule could check every 60 seconds in a time window of 3600 seconds if response time is greater than 1 sec more than a given number of times. The violation of such rules determines the triggering of actions in the self-adaptation platform. 2. The self-adaptation platform. This subsystem is responsible for state tracking of the currently deployed application and for the decision-making to identify changes in the current configuration of the cloud-based application to satisfy with SLA requirements. The self-adaptation platform is activated by the monitoring platform in response to some alarming event or performs predefined analyses periodically without being explicitly triggered by the monitoring platform. The self-adaptation platform will be developed jointly across WP4 and WP6, the former focusing more on the underlying software engineering aspects (e.g., the Models@Runtime approach described in deliverable D4.1), while the latter will focuses more on the runtime reasoning. We point to deliverable D4.1 for a description the self-adaptive platform and we focus here only on specifying its requirements that do not overlap with those in D The execution platform. The execution platform provisions all services that are needed to deploy on the clouds both the application and the corresponding monitoring platform. The initial deployment decisions are initially provided by the MODAClouds IDE and translated to the target platform. The execution platform also offers the services and the API needed to maintain and change the operational state of the deployment at runtime, typically under triggers of the self-adaptation platform. Public Final Version 1.0, March 29 th

12 Each platform has therefore a clearly identified role and will interact with the other platforms, whenever possible, via RESTful services or data streams that will hide implementation details. This will allow the runtime platform to integrate more easily contributions written in different languages and styles Overview The rest of the document is divided into four sections: Section 2 presents a description of the state-of-the-art on run-time platforms for cloud environments, specifically those handling the execution and deployment of applications. With this section we aim at analysing tools that could potentially be used to create the run-time MODAClouds environment. Section 3 describes previous work on monitoring cloud infrastructures and applications. This is an important aspect in order to be able to analyse the system and evaluate its performance in different regards. Section 4 reviews methods proposed during the last decade to provide tools to system administrators to manage QoS of online applications. QoS is essential for the optimal delivery of cloud services. Finally, Section 5 presents the use cases and the requirements of the platform. Public Final version 1.0, March 29 th

13 2. State-of-the-Art: Run-Time Platforms 2.1. Preamble The purpose of the current section is to highlight the potential solutions that could be used to develop MODAClouds run-time environment, specifically those handling the execution and deployment of applications. Thus the technical solutions surveyed herein, are grouped in the following broad categories: PaaS --- solutions featuring an integrated application and required services or resources life-cycle --- from packaging, through deployment, execution and monitoring --- available usually as hosted services from various cloud providers; deployment and execution --- independent software artifacts that focus mostly on the deployment and execution phases; these could be seen as lightweight PaaS replacements, especially suitable for applications requiring a more conservative environment, such as customized software packages, full access to the OS or the network, etc.; deployment only --- similar with the above but focusing only on the deployment aspects, such as installation and configuration; This survey is structured as follows: we start with a section (Criteria) describing the characteristics and capabilities we are interested to find in the proposed solutions; then in the next one (Solutions) we systematically take the exponents from the categories mentioned earlier, and for each one we provide the following information: o the "Overview" section presenting a very high level picture, together with a small dictionary describing the terms either unique or having a particular interpretation in the solution's context; also if needed an architectural view, or short descriptions of various important mechanisms; o the "Criteria" section, summarizing characteristics pertaining to the surveyed solution; this should be materialized in a highly structured text, such as lists or tables; o followed by the "Notes" and "Limitations" sections that should provide a subjective --- from the MODAClouds perspective --- critique of the proposed solution, highlighting advantages, disadvantages, future developments, and if possible comparisons with other solutions in the same category; o then a short "MODAClouds integration" section that would highlight how would the project benefit by integrating the proposed solution; o and last a "References" section pertaining solely to the analyzed solution --- those references pertaining to multiple solutions should be placed in the proper global references section; However before proceeding we must make a few observations about the purpose of these solutions, and the distinction between the user's application, and the MODAClouds services supporting that application. Therefore: in the context of the following sections by application we understand the user's application; that is the software artifacts written and provided by the user himself which implement the desired functionality, plus the required services or resources, such as databases, middleware or other generic components, that are required by the user's components; such an application could target either a PaaS or IaaS cloud provider; however in order to enhance the user's application with features such as monitoring, QoS enforcement, automatic scalability, or cloud migration, we are required to run certain support services alongside the user's application; unfortunately due to the complexity and requirements of such support services, we are uncertain that they would be able to run inside a PaaS, as such it is our intention to run these on a IaaS alongside the PaaS that hosts the user's application; (for a IaaS-only solution things are much simpler;) we focus on the deployment and execution solutions for the user's application; although we keep an eye open for the possibility to manage alongside the application also the support services; but as stated this is not a main requirement; moreover the current survey assesses only the technical aspects related with deployment and execution, meanwhile other parts of this deliverable tackle individual traits such as QoS, monitoring or loadbalancing; non-technical aspects such as cost and pricing models are discussed in deliverable D2.1; Public Final Version 1.0, March 29 th

14 we reiterate once more that, except the "Overview" section, the other sections of the survey are a subjective take on the solutions, the perspective being that of the MODAClouds own focus and requirements; Criteria Although many of the surveyed, or other existing, solutions are production-ready --- or even better backed up by powerful companies in the IT sector --- and offer many features, we must focus our effort in determining if they are a good match with MODAClouds requirements, described in a later section. Such a goal implies two separate conditions: first of all they should be suitable for our industrial partners' case study applications; which in turn implies matching the supported programming languages, palette of available resources and middleware, and nonetheless security requirements; and in order to fulfill our project's goal, they must provide a certain flexibility, to allow our run-time environment to integrate, and provide support for the user's application; Therefore, we are especially interested in the following aspects --- due to space constraints, the possible values are described in Appendix A: type Actually one of the categories mentioned in the beginning of the section, which broadly describes what is the purpose of the solution and the range of features it offers. suitability Shortly, how mature, or production ready, is the solution? Does it have a supportive community built around it? application domain What would be the "main flavour" of targeted applications? application architecture Broadly matching a targeted application architecture. application restrictions What constraints would the application (and part of our run-time environment) be subjected to? programming languages and programming frameworks Some solutions target (or at least are focused on) a particular framework (such as Servlets for Google App Engine's Java environment, Capistrano tightly focused on Ruby on Rails deployment, etc.). Thus it would prove useful to know in advanced which are the officially sanctioned or preferred frameworks. scalability How can scalability be achieved? session affinity Usually PaaSs offer HTTP request routers (or dispatchers); how does they load-balance clients between the multiple available service instances? (How each client is identified depends on the internals of the PaaS and it could range from source IP address to cookies.) interaction How can we pragmatically interact with the proposed solution? hosting type How would we be able to use the proposed solution? portability If a developer uses a particular solution, how easy is for him to move to another solution having the same role? services Public Final version 1.0, March 29 th

15 Especially in the case of PaaS, what additional resources or services (such as databases, middleware, etc.) are available and managed directly by the solution, and thus integrated with the application life-cycle? monitoring coverage Especially in the case of PaaS, how much do the monitoring facilities cover and expose to the operator? monitoring level From which perspective, or at which level of the software and infrastructure stack, are the metrics provided? monitoring interface What technique --- such as standard, API, library, etc. --- is used to expose the monitoring information to the operator? Resource providers Most of the PaaS do not also have their own hardware resources, but instead are built on top of other publicly accessible IaaS providers. Thus if the user needs services not offered by the PaaS itself, it could use that IaaS to host the missing functionality himself. multi-tenancy This characteristic pertains mainly to PaaS or PaaS-like solutions, and tries to assess if multiple applications can share the same instance of the PaaS. resource sharing This characteristic pertains mainly to PaaS or PaaS-like solutions, and tries to assess how are the application components or services mapped on the provisioned VMs. limitations Most of the solutions impose quantitative limitations (such as memory, bandwidth, storage, etc.) on the running applications, which could be of interest especially in determining the suitability for our case studies Deployment Approaches PaaS solutions In the following subsections we survey the most important PaaS solutions, either hosted, like Heroku or Windows Azure, either deployable on self-provisioned IaaS, such as Cloud Foundry. Although there are countless other technologies fitting inside the PaaS category, especially emerging products from various startups, we have limited our survey to the ones most likely to be used inside MODAClouds, because they provide a wide degree of flexibility, or are popular choices amongst developers. Moreover this list is not exclusive, because if during the implementation phase of the project we find other suitable candidates we can use them as well Heroku Overview Heroku is a classical PaaS solution, featuring a large degree of flexibility for the targeted application, ranging from the largest set of supported programming languages, to the availability of third party integrated services. What it sets it apart from other PaaS solutions is the simplicity of developing and deploying applications that target this platform, the only requirement being to respect the "general accepted best practices", as summarized in [HER1] and detailed in [HER11]. For example while Google App Engine requires the developer to choose exactly one of the three supported languages, then to strictly adhere to a reduced API and use Google's customized data stores, in contrast Heroku allows the developer to run almost any "well behaved" web Public Final Version 1.0, March 29 th

16 application, and exposes access to resources ranging from the classical SQL databases to distributed search indexes. Regarding the tooling, the deployment is almost completely driven via the Git distributed versioning system, but at the same time there are CLI tools --- based on web services --- that allow full control of the application. Characteristics type PaaS suitability production application domain web applications [HER1] application architecture n-tier applications [HER1] application restrictions container [HER2] programming languages Ruby, Python, NodeJS, Java [HER3] programming frameworks any scalability manual [HER4] session affinity non-deterministic [HER7] interaction WUI, CLI, WS, API [HER9] hosting type hosted portability out-of-the-box services large palette of managed services [HER6] monitoring coverage none (however there are add-ons) backing provider Amazon EC2 [HER2] Limitations OS resources: memory per dyno 512 MB (soft-limit) and 1.5 GB (hard-limit) [HER2] disk per dyno unspecified CPUs per dyno unspecified package size 200 MB [HER8] [HER10] Networking accessibility: inbound HTTP exclusively [HER7] [HER10] outbound allowed (with exceptions) internal disallowed [HER10] Notes Although Heroku has official support for some programming languages, it can support practically any application that can be run on Linux, via their "buildpack" feature [HER5]. As above, out-of-the-box Heroku does not provide any type of monitoring --- except for the existence of processes --- but there are various third party add-ons available that monitor the running application from within. Unfortunately even with these add-ons we cannot get any data from "within" the platform. With regard to the network accessibility, for inbound network connections only HTTP --- for the domain under which the application was registered --- is forwarded to the processes with the web type, while any other access from the exterior does not seem to be supported; on the other hand, it is hinted that the HTTP router does support the CONNECT verb --- a unique feature among existing PaaS solutions --- enabling thus the proxying of arbitrary protocols. For outbound connections it seems that there are not any constraints except the "best practice" policies, and the fact that the source IP address might change at any time. The connections between various processes, thus internal, seem to be disallowed [HER10]. (The documentation is not definitive in any of these regards, especially about the existence of any quota.) The cost seems to be higher when compared with VMs (with higher capacity) from the underlying provider (Amazon), and especially when using add-ons. On the positive side, for each application there is an amount of free time which would allow a user to run a single process for a month (or multiple processes for a fraction of a month). Public Final version 1.0, March 29 th

17 MODAClouds integration Heroku provides unique possibilities, not found in other commercial PaaS solutions, for integration with MODAClouds. We could provide customized "buildpacks" [HER5] that would augment the user's application code with our probes, without impacting his development, packaging and deployment experience. Moreover, because Heroku allows the user to run multiple types of processes, and because we can customize the deployment, we could run the support services that MODAClouds needs directly inside the same application instance. On the other hand, because the platform itself is hosted over Amazon EC2, we could easily deploy our services there. Finally, the API exposed by Heroku [HER9], although simple, it allows fine grained control over the application, from changing the number of instances of a certain process type, to attaching add-ons or accessing logs Cloud Foundry Overview At a high level Cloud Foundry can be also seen as part of the classical PaaS family, similarly to Heroku. Cloud Foundry allows the developer to run almost any "conventional" application without changes --- one that uses the most common frameworks and respects the common "best practices" --- also providing support for the most common resources (such as relational databases). Two important highlights of Cloud Foundry are the fact that its source code is released under an open-source license, and that there is a "Micro Cloud Foundry" [CFY4] solution that enables the developer to have a local deployment and testing environment that simulates the hosted platform. Because Cloud Foundry is at the same time both a hosted PaaS (by VMWare), and an open-source product, in this survey we focus especially on the hosted platform, because many of these limitations and constraints depend solely on the choices made by the hosting provider; meanwhile the open-source variant allows anyone that wants to deploy it, to add support for new programming languages, services or raise limitations. Unfortunately unlike Heroku it is not ready for production yet, allowing only limited resources to the applications, providing a very small set of additional resources or services, and constraining the supported programming languages and frameworks. However, compared with Heroku and other PaaS solutions, regarding the possible application architectures and the HTTP routing layer, it promises to offer more flexibility. To sum up. Cloud Foundry offers three different forms of a PaaS: CloudFoundry.com: Public instance of the open Cloud Foundry PaaS operated by VMWare. CloudFoundry.com is now in beta and can be used for free. Micro Cloud Foundry: Complete Cloud Foundry instance contained within a virtual machine. This product is intended to be installed directly in a developer s computer in order to simulate the interaction with a real cloud foundry-based private or public cloud with the same environment, ensuring that applications that run locally will run in the cloud too. Cloudfoundry.org: Open Source project hosting the Cloud Foundry technology. With the tools within this project a private PaaS can be built and deployed on top of any IaaS. Different configurations can be built, achieving different supported languages, frameworks and application services. Characteristics type PaaS suitability emerging application domain web applications application architecture 2-tier applications (but see the notes) application restrictions container programming languages Java, Ruby, NodeJS [CFY3] programming frameworks popular frameworks (Spring, Java Servlets, Rails) plus "standalone" Public Final Version 1.0, March 29 th

18 scalability manual session affinity sticky-sessions [CFY1] interaction CLI, WS, API hosting type hosted, simulated, deployable open-source portability out-of-the-box services MySQL, PostgreSQL, MongoDB, Redis, RabbitMQ [CFY2] monitoring coverage basic monitoring level container backing provider VMWare (private solution) Limitations OS resources: memory 2 GB [CFY1] disk 2 GB [CFY1] descriptors 256 [CFY1] CPUs 4 [CFY1] Networking accessibility: inbound HTTP exclusively outbound allowed (with exceptions) [CFY1] internal unspecified Notes Much that can be said about Cloud Foundry was already written in the overview above, and could be summarized as: it could be considered as an alternative to Heroku; its major advantage over the other PaaS solutions is its open-source license; there is a solution that offers the developer an "emulator" of the hosted platform that can be run on a local machine; overall it is a promising solution, but currently it is still in beta status; Regarding the support of programming languages and supported frameworks, it is more strict than Heroku: only the available ones can be used, and there is not an option to customize the build and packaging process (without the developers intervention). The limitations are clearly described, and on par with Heroku. Unfortunately, due to its current beta status, it cannot host any real world application, because all the applications (or application instances) summed total memory cannot be over 2 GB [CFY1], and there does not seem to be support for domain names others than *.cloudfoundry.com. Moreover, the limitation on file descriptors of 256 could be worrying, because it implies that each instance cannot have more than 256 concurrent HTTP requests, which for real-time web-applications (such as using web-sockets) would be a show-stopper. Although the deployed applications must fit the 2-tier model (i.e. monolithic process and database or middleware layer), Cloud Foundry has a unique feature that allows different applications to share the same services or resources, thus allowing the user to obtain a n-tier model by splitting his application in multiple ones when deploying to Cloud Foundry. MODAClouds integration Again we stress the fact that the following statements apply mainly to the hosted solution (by VMWare), because in a self-deployed Cloud Foundry environment the operator has many other choices. From MODAClouds perspective, Cloud Foundry currently has less to offer than Heroku, as it has the following major disadvantages, especially when thinking about how to run the additional support services that MODAClouds require: Public Final version 1.0, March 29 th

19 it is backed by VMWare IaaS solution (presumably VMWare vsphere), but in a private cloud; this implies that we will not be able to provision any VM for our support services, that would be able to interact with the running application; currently the supported services and resources is a very small set, and there is not any (practical and cost effective) way to add others; the limitation imposed on the applications --- the total amount of memory used by the entire application is only 2 GB --- makes it difficult for any real world application to be run; the limited number of programming languages imposes large restrictions on what services we are able to effectively use; On the positive side it does offer some advantages, although marginal compared with the drawbacks: it does provide basic monitoring information (memory, CPU, disk) at instance level; it allows the user to upgrade the application without interrupting the service --- through manipulation of the HTTP routes, and only if the old and new versions of the application are capable of handling it; But all in all we could still manage to use it as a run-time platform for the applications, provided that we host the support services in another provider, not without performance degradation or cost increases AppFog Overview AppFog is one of the commercial PaaS solutions that have been built upon the open-source licensed Cloud Foundry code base. AppFog is limited in a similarly to Cloud Foundry, hosted by VMWare. As such in the current section we shall focus mainly on what is different than in the case of Cloud Foundry. Characteristics type PaaS suitability production application domain web applications application architecture 2-tier applications application restrictions container programming languages Java, Ruby, Python, NodeJS, PHP [APF2] [APF4] programming frameworks popular frameworks plus "standalone" scalability manual session affinity sticky-sessions (presumably) interaction WUI, CLI, WS, API hosting type hosted portability out-of-the-box services MySQL, PostgreSQL, MongoDB, Redis, RabbitMQ [APF3] [APF4] monitoring coverage basic monitoring level container backing provider Amazon, HP [APF4] Notes and Limitations Most are presumably the same as in the case of Cloud Foundry --- there is not a clear documentation stating the constraints --- with the following notable exceptions: the limit of 2 GB total memory for an application can be raised if the user pays; however the user can create an unlimited number of applications for free, within that 2 GB limit; Public Final Version 1.0, March 29 th

20 there is a way to access some resources (like the relational databases) backing the applications [APF5]; it is backed by public IaaS offerings; MODAClouds integration As stated in the introduction AppFog inherits some of the drawbacks of Cloud Foundry, especially when compared with Heroku. For example the number of programming languages is slightly larger than with Cloud Foundry, but we cannot run arbitrary applications as we can in Heroku. However, because AppFog is backed by Amazon and other public cloud providers, we have the ability to deploy our support services on VMs residing in the same cloud. Moreover through the "tunneling" feature [APF5] we could support the migration of the application data, at least for the limited set of supported resources AWS Elastic Beanstalk Overview In order to ease the deployment of applications over its IaaS solution, Amazon provides a simple wrapper service, namely AWS Elastic Beanstalk, which given a deployable software artifact --- either in compiled form, or source code, depending on the target platform --- automates the following aspects of the application life-cycle: [ABS1] provisioning the required VMs from EC2, with the right image needed for the targeted run-time; configuring both the VM related aspects, like security groups, but also the complementary required services, like AWS Elastic Load Balancer, or AWS CloudWatch; deploying the software artifact inside the run-time environment; managing complementary services like the Elastic Load Balancer solution; Concepts: application An umbrella concept for all the entities that belong to the same logical "application". version A deployable software artifact, suitable for deployment. Each application can have at any time multiple versions, each prepared for immediate execution, thus enabling the operator to rollback to previous versions if a particular deployment manifests issues. [ABS4] environment The run-time instance of a particular version, again there can be multiple concurrent environments, possibly of the same version. The environment also specifies the characteristics of the VM to be deployed on. [ABS4] resources Any external database, middleware, etc., that the application needs, and which is completely out of the control of the platform (although the various web-based wizards, do allow the operator to create an Elastic Load Balancer or an RDS instance). Characteristics type application deployment and execution suitability emerging application domain web applications application architecture 2-tier applications application restrictions none programming languages Java, PHP, Python, Ruby,.Net [ABS1] programming frameworks a selected set, specific to each supported language [ABS1] scalability none (delegated to AWS Auto Scaling) session affinity none (delegated to AWS Elastic Load Balancer) interaction WUI, CLI, WS, API hosting type hosted portability out-of-the-box Public Final version 1.0, March 29 th

21 services none (manual provisioning of any AWS service) monitoring none (delegated to AWS Cloud Watch) providers Amazon multi-tenancy single application resource sharing 1:1 Notes and Limitations Right from the start, the AWS Elastic Beanstalk documentation, [ABS1], clearly states that this is a solution meant for getting developers or operators to quickly adopt cloud-based deployments. Furthermore it states that once the user has a deeper understanding of the principles governing AWS, it should migrate towards using AWS CloudFormation. However for the simple case of 1-tier web-applications, or 2-tier ones, where the second tier is the database, it proves a perfect match by providing a simple API to manage the application life cycle. From a technical perspective it can be seen as a parallel to Windows Azure cloud services solution, in that although it exposes a PaaS-like functionality, each component is deployed on an individual VM, granting the code full access to the underlying OS --- most of which are unique to Elastic Beanstalk --- such as: each application version, can have dependencies on native packages; [ABS5] moreover the user can choose a customized VM image for each environment; [ABS5] the operator is granted full SSH access on the underlaying VM; there is the option to attach AWS EBS volumes or take snapshots; On the downside it does not automatically handle any other services or resources, like AWS S3, DynamoDB, etc., but are left to the user to provision and properly configure; this being one of the reasons why AWS CloudFormation is a better choice in this respect. MODAClouds integration As stated AWS Elastic Beanstalk could be used in order to deploy simple web applications on-top of AWS EC2, without handling ourselves the VM provisioning and container deployment. Although, in terms of functionality, it does not offer more than a hosted PaaS, like Heroku, or a deployable one, like Cloud Foundry, it does prove useful in the case where the application requires more resources than the PaaS offers; for example in case of demanding Java-based web applications, which require VMs from the top of the EC2 offering. Moreover it provides an API as simple as the other PaaS solutions for managing the application instance Google App Engine Overview Google App Engine (GAE) is one of the first commercial PaaS solutions, and from a technical point of view it is the closest to the PaaS philosophy, that is it completely alleviates programmer's concerns related to the infrastructure, either software or hardware, upon which the application runs. This comes at the price of flexibility because the developer has to adhere to a very strict set of rules. A GAE application is mainly composed of: request handlers [GAE1] These are the "normal" instances available in GAE, that must conform to the Java Servlet API --- or the equivalents in other languages --- towards which requests are routed. However, these handlers must obey some strict limitations, like response time, available hardware resources, etc. backend handler [GAE3] From a programmer's point of view they are identical with the "normal" handlers, except that some limitations have lower bounds, like the possibility of running background threads, more memory, etc.; moreover the billing model is similar to that of a classical IaaS provider. service Various services already provided by Google, which are exposed to the applications via dedicated API's in the targeted language. version Each application can be deployed and re-deployed multiple times, each mapping to a particular version, which is individually accessible. Public Final Version 1.0, March 29 th

22 Characteristics type PaaS suitability production application domain web applications application architecture 2-tier applications application restrictions limited programming languages Java, Python, Go programming frameworks Java Servlets, Python WSGI scalability automatic scalability session affinity non-deterministic [GAE1] interaction CLI, WUI, WS hosting type hosted portability locked services object store, memcache, mail, HTTP fetching, user management, XMPP monitoring coverage extensive monitoring level application resource providers Google multi-tenancy multiple organizations resource-sharing n:1 Limitations Access limitations: sockets are completely disallowed, both for listening or connecting; the only way to communicate with the outside is through the API's that Google provides; (there is however an API for HTTP resource fetching or sending;) [GAE1] the interaction with the clients must be confined to HTTP only; [GAE1] the application cannot mutate the local file-system operations; [GAE1] (for request or task handlers) threads cannot out-live the request life-span; moreover the API is custom for GAE; [GAE1] most system- or native-related calls, like JNI, or interpreter-related calls are forbidden; [GAE1] Quantitative limitations: the maximum request or response size for an HTTP response is 32 MB; [GAE1] [GAE2] each HTTP request must be resolved in at most 60 seconds; [GAE1] there are a maximum of 50 threads for each request; [GAE1] the maximum size of a file in the package is 32 MB; [GAE1] [GAE2] the maximum memory available for a normal instance is 128 MB; [GAE3] the maximum memory available for a backend instance is 1 GB; [GAE3] other quotas are high enough, especially for the payed applications, that any even medium sized [1] application should not worry; [GAE2] Although the previous paragraphs listed only the most important limitations, GAE has a very complex quota and QoS model: [GAE2] on the temporal axis we have either daily quotas and per-minute quotas referring to resource usage; on the cost axis there are the billable quotas, on a daily basis, that ensure the application operational budget is limited; and the safety quotas, that ensure no application depletes available resources; Public Final version 1.0, March 29 th

23 Notes Although GAE has serious limitations in terms of development flexibility, it compensates by a high integration with other Google's products, especially Google Accounts, GMail, Google Drive, etc., by allowing the developer to leverage all those additional services and easily integrate them in his applications. As such GAE would be a prime candidate for a PaaS hosting an application based on Google's services. It seems that GAE is tuned towards small response time applications, because the automatic scaling feature is available only for those applications where "most" requests are under a second [GAE1]. Moreover the backend handlers are not automatically scaled; however the "dynamic" backends are automatically "woken" when needed, and then "disabled" when idle [GAE3]. An interesting feature of GAE is the support for multiple application versions, easily accessible by altering the URL's host name [GAE1], thus an older un-updated client is able to use an older variant of the service. Another interesting feature is the availability of the SPDY protocol --- a replacement of HTTP --- which makes GAE the single PaaS currently supporting it. Related to data access, GAE provides its own set of data stores tailored for scalable applications; in case of the Java environment the developer has the choice of either JDO or JPA compatible interfaces, together a limited SQL-like query language, either a set of low-level interfaces to interact directly with the data store with its native data model [GAE4]. Related with the additional services, again GAE provides a wide variety of services integrated by Google [GAE5]. However, if the developer needs a data store, middleware or service not part of Google's offering, the only solution is to host it outside GAE (for example inside Google Compute Engine) and access it via HTTP-based requests; which unfortunately rules out most database systems and middlewares. MODAClouds integration Because GAE is a very peculiar PaaS --- whose feature set is not matched by other PaaS solutions, although there are prototypes or projects cloning it --- it stands out as a unique development and deployment target, and coupled with its access restrictions, it will require on our part more work to integrate than other hosted PaaS solutions. On the other side the monitoring capabilities of GAE are very fine-grained, surpassing that of the other hosted PaaS solutions, making it a good part for the monitoring platform (even if CPU utilization cannot be gathered easily but only estimated through code instrumentation). However it lacks the ability to directly control the scalability of the normal instances, making it unsuitable for the self-adaptation platform Microsoft Azure Overview Windows Azure is Microsoft's cloud computing solution, an umbrella for various solutions ranging from IaaS, the VM roles, PaaS, the cloud services, and SaaS. However, in the current section we focus only on the PaaS aspects, that is the cloud services offer. Because the Windows Azure model for cloud services resembles closely that of Google App Engine, we shall often compare features or assimilate concepts between the two solutions. Thus in Windows Azure we have [WAZ2]: web role Which handle external HTTP requests, usually handling the presentation and user interaction, probably by delegating lengthy work items to the worker roles. worker role The counterpart of Google App Engine's backend handler, mapping to the application logic layer. Characteristics type PaaS suitability production application domain web applications application architecture n-tier applications application restrictions container Public Final Version 1.0, March 29 th

24 programming languages.net, NodeJS, PHP, Java, Python programming frameworks any (compatible with the desired role) scalability manual session affinity sticky sessions, non-deterministic interaction WUI, CLI, WS hosting type hosted portability locked services SQL, key-value, blobs, caching, CDN, user management [WAZ1] monitoring coverage basic monitoring level container resource providers Windows Azure multi-tenancy multiple organizations resource sharing 1:1 Notes and Limitations An interesting feature of Windows Azure is how the various roles are provided, because unlike other hosted PaaS solutions, each role has its own VM, thus increasing isolation and offering more resources per instance. Like in the case of Google App Engine the various application components should exchange information via queues; however there is also the possibility to host other types of data stores or middleware on VM roles, the IaaS facet of Windows Azure, thus allowing greater flexibility. MODAClouds integration Proving less strict that Google App Engine, and allowing similar types of applications as other considered PaaS solutions, Windows Azure could serve as a target for deploying classical web applications. Moreover currently it is one of the few available PaaS solutions for running.net applications Deployment and execution solutions In the previous section we have described the PaaS solutions most likely to be used in the project. However there are applications where even a deployable PaaS imposes unsuitable constraints, thus to address these issues we need to rely on even more flexible solutions, that give a finer grained control over the deployed VMs, while still holding most of the PaaS features like deployment and scalability automation Juju Overview Juju is a Canonical sponsored open-source project, that describes itself as a solution for service orchestration [JUJ3], by taking care of tasks from deployment, through configuration and up-to custom tasks such as backups. Moreover it focuses mainly on cloud-based environments, although it currently supports only EC2-compatible API's [JUJ1] [JUJ4]. Juju concepts are quite straight-forward: service --- It is a logical construct, and can be seen as an application layer or tier, such as the MySQL cluster, or a set of application servers handling the logic, etc. [JUJ5] unit --- It represents a software component, a single instance of potential multiple ones, that fulfill the tasks of a particular service. For example it is one MySQL database instance part of potentially a cluster thereof, or a single web-server that receives a portion of the load. [JUJ5] relation --- Relations can be established between two services --- actually between all the units of one service, and all the units of the other one --- and is used to denote functional dependency. The matching of relation partners is done based on interfaces which one service requires, and the other provides. For example an application server might need a database, thus a database relation is formed between the Public Final version 1.0, March 29 th

25 two. Moreover the same service might be in relation with multiple other service types, like in our case the application might also need a cache and a messaging middleware. [JUJ5] charm --- All service units are deployed, configured and establish relations, as prescribed in a set of files, that comprise a charm. [JUJ6] hook --- The actual active components of a charm are the hooks, nothing more than executables, potentially written in any programming language, which are called by Juju at different stages of the service's lifecycle. [JUJ6] environment --- Everything that Juju manages is in the context of an environment, which equates to a cloud provider and credentials [JUJ4]. Thus it allows the same cloud provider to be used in different environments by having a separate set of credentials (access and secret keys). Technically this requires an EC2-compatible and S3-compatible deployment, or it can be a testing environment based on LXC. Compared with Puppet (Section ), Juju is more focused on the application life-cycle, offering more automation out-of-the-box, such as the relations between different services, and being more flexible by allowing its hooks to be written in virtually any programming language, as compared with Ruby-only for the former solutions. [JUJ6] Juju architecture and philosophy is quite simple, it starts one VM for management purposes which then provisions and delegates actions to the other VMs where the service units run on; then to ensure the persistence of the state it uses an S3-like store to commit snapshots of the running service units, their configurations and relations, and the running VMs. Characteristics type application deployment and execution suitability emerging [JUJ1] application domain generic applications application architecture n-tier applications application restrictions none scalability manual scalability interaction CLI, WS hosting type deployable (open-sourced), simulated portability out-of-the-box services web-servers, databases, middlewares [JUJ2] monitoring none multi-tenancy single application resource sharing 1:1 (n:1 on the roadmap) [JUJ1] providers EC2-compatible [JUJ1] [JUJ4] Limitations Notes currently it supports only EC2-compatible API's [JUJ1] [JUJ4], although there is an option to run a simulated version based on LXC on a single machine; it also requires a S3-compatible permanent storage facility [JUJ4], that is used to store the state of the cluster; it supports only deploying one service on each VM [JUJ1] (thus a 1:1 mapping), although there are workarounds and plans to alleviate this limitation; the management VM is currently a single-point-of-failure, although the cluster state is committed at regular intervals in a S3-compatible store; it is tightly tied to the Ubuntu Linux distribution [JUJ9]; however Ubuntu is currently the most popular distribution, and the default choice for most developer and infrastructure providers; Public Final Version 1.0, March 29 th

26 To a certain extent it can be seen as a very lightweight PaaS that is very much similar with solutions like Amazon Elastic Beanstalk, Cloudify, and others. On the other hand it is not focusing on developer aspects of the application, but on operational ones. A nice feature of Juju, is that at the time of instantiation of a particular service the operator is able to customize its behaviour by providing a set of parameters that alter the behaviour of the charm's hooks --- obviously these parameters must be implemented by the hooks themselves. Building on that, and again if the hooks support, the operator is able to change some of these parameters and then all the service's units react by reconfiguring themselves accordingly. [JUJ7] Juju goes even further and allows the related services to influence one another, like exchanging endpoints or access credentials, by allowing configurations scoped at the relation level, which when changed, trigger hooks for each of the other partner service units. [JUJ6] Related to the backing infrastructure, as stated, it requires an EC2- and S3-compatible environment. And although it is dependent on Ubuntu as the Linux distribution [JUJ9], it can work on plain non-customized variants, thus it could work even for those providers that do not allow customized images. As expected the operator is also able to specify the exact VM type, and if the provider supports it, the image or availability zone, either for the entire environment or a particular service. [JUJ8] On the downside, Juju does not automatically destroy the VMs as they are kept to allow the operator to inspect their state. However, Juju is able to deploy a different type of unit on an available VM. MODAClouds integration As previously stated we could see Juju as a very thin PaaS, thus we could use it to deploy the more conservative applications, that have more elaborate requirements, for which even a PaaS like Cloud Foundry would be too limiting. The strong point for Juju in our case, is that it provides a very flexible and dynamic service reconfiguration and interaction. Thus it would enable us to more easily implement the dynamic scalability and reconfiguration features. Even if we do not intend to reuse Juju due to its close ties with EC2-compatible providers and Ubuntu, we could re-use their charms, especially the hooks, because they already implement all the needed functionality to configure the managed service Cloudify Overview Coming from GigaSpaces, Cloudify is another open-source project, that fits into the same category as Ubuntu's Juju, handling application deployment and execution on-top of cloud or virtualized infrastructures. Although it fits into the same category as Juju, it sets itself apart by having explicit support for managing multiple applications, it works on more cloud infrastructures, and provides basic monitoring and scalability features. The concepts are the following [CDF1]: application --- Defined as a set of services, that by working together provide the required functionality. As previously stated, Cloudify is able to explicitly manage multiple distinct applications within the same infrastructure. service --- Just as with Juju, it is a logical construct that contains all service instances which provide the same functionality. For example a database service might be a cluster of VMs that run the same distributed database management system, or the web-frontend that is composed of multiple VMs sharing the load. service instance --- An individual VM that runs the required software for the respective service type. application recipe --- It prescribes global characteristics, such as configuration, provisioning, or scalability rules, related to the application as a whole, and lists the set of services that compose it, together with their dependencies. service recipe --- Prescribes the characteristics of an individual service layer, again including details such as configuration or provisioning. Characteristics Public Final version 1.0, March 29 th

27 type application deployment and execution suitability production application domain generic applications application architecture n-tier applications application restrictions none scalability manual and automatic scalability [CDF2] interaction WUI, CLI, WS hosting type deployable (open-sourced), simulated portability out-of-the-box services web-servers, databases, middleware monitoring level container, application [CDF9] monitoring coverage basic [CDF9] multi-tenancy single organization [CDF6] resource sharing 1:1 (n:1 on the roadmap) [CDF5] providers EC2, OpenStack, Azure, manual [CDF3] [CDF10] Notes and limitations As said in the overview, Cloudify fits into the same category as Juju, therefore most observations made about Juju are also true for Cloudify, thus we'll just summarize them: the focus is on the operational aspects and not on the development ones; it does allow reusage of recipes between multiple applications, and customization through recipe parameters; (but see below for the drawbacks;) [CDF4] it maps each service instance on an individual VM (thus a 1:1 mapping); (although there is the intention to implement a more complex sharing mechanism;) [CDF5] it requires the same bootstrapping through a management VM; [CDF3] On the other hand, it also sets itself apart from Juju in the following respects: although it seems to be only slighter older than Juju (2011 vs 2012), the developers of Cloudify state that it's production ready; (this however might not correctly reflect the maturity and stability of the solution;) interacts with a few more IaaS providers, including manually provisioning of VMs, plus offering the opportunity to add other providers by implementing drivers for them; [CDF3] [CDF10] it features a more complex security model, including roles and groups for the users and for applications; [CDF6] it has explicit support for running multiple applications managed through the same Cloudify instance; it allows basic monitoring and scalability rules; [CDF2] it is not tied to a particular operating system distribution; [CDF5] However it has major drawbacks, especially in terms of service reconfigurability and interaction, as we will describe in the following paragraphs. There is not a global repository for recipes, and each application must provide within its file system hierarchy all the needed recipes [CDF4]. On one hand this is good because we have in one place all the needed files to inspect and re-deploy another application instance with the exact configuration, thus enabling deterministic configurability. On the other hand it makes maintaining and updating the recipes quite difficult, requiring manually overwriting the files. Moreover we cannot commission new services at run-time that were not described in the application recipe, which means that any application upgrade cannot be done in-place, because it requires a complete stop-start cycle. (The documentation does not explicitly state this issue, but after a careful review of the available documentation, available CLI commands and API operations, there does not seem to be support for such a feature.) Public Final Version 1.0, March 29 th

28 The way in which services are able to influence each other, either to transmit configuration parameters (like endpoints, credentials, etc.) or to trigger custom behaviours, is rather cumbersome. Information exchange is achieved by attaching attributes at various scopes to various entities (the application, service layer, service instance, etc.), but the way in which this reconfiguration is observed is either through continuous polling, or by manually calling commands on the dependant services. [CDF7] [CDF8] On the up-side the documentation does seem to touch all important aspects, and even touches the API or other technical aspects. Another important feature is that related with monitoring, as it provides out-of-the-box basic metrics at service instance level (CPU, memory, paging activity, etc.) and support for JMX; plus the operator is able to write customized scripts or to install plugins that provide other metrics. [CDF9] On-top of all these there is also support for basic scalability inside the same service layer [CDF2]. However the rules can use metrics belonging only to the current service layer, and cannot capture global behaviours that involve multiple service layers. There is also basic integration with Puppet, that allows, on each machine, to run a stand-alone variant of those deployment solutions, thus enabling deployment of services described through Puppet recipes. MODAClouds integration In MODAClouds we could leverage Cloudify to deploy the applications over IaaS, just as we have described in the case of Juju. However it would provide less flexibility in terms of application reconfiguration, although it does provide basic monitoring out-of-the box, and supports a few more cloud providers Deployment solutions As stated in the previous two section introductions, some applications require various degrees of freedom, and in this section we focus on those applications that require an IaaS based deployment. Thus we look at a few tools that help automating the task of provisioning, deployment and management of VMs AWS Cloud Formation Overview AWS CloudFormation is a deployment service, offered free of charge by Amazon, that allows an operator to automate the deployment of any complex application that relies on services or resources provided by Amazon. The operator must define his stack --- an application, or class of applications with a common architecture and requirements --- in terms of a template --- which can then be parameterized and instantiated multiple times --- that describes all the required resources, such as VMs, buckets, databases, and other services. Then, once pushed to the dedicated AWS CloudFormation service, the instantiation and initialization of the resources happens asynchronously, the operator needing only to watch for updates. Moreover the operator is even able to change the template after the deployment was done, and AWS mediates the adaptation of the provisioned stack to match the new requirements [ACF1]. Characteristics type application deployment suitability production application domain any application architecture any interaction WUI, CLI, WS, API hosting type hosted portability locked services almost all AWS services [ACF2] multi-tenancy multiple organizations resource sharing 1:1 providers Amazon Public Final version 1.0, March 29 th

29 Notes and limitations The CloudFormation service provided by Amazon is unique compared with other production-ready IaaS providers, allowing the operator to offload the provisioning step, and eliminating the need for various cloud deployment tools or libraries. The downside is its locked-in nature by being tailored to Amazon's infrastructure; however the format is simple and generic enough that with some effort it can be implemented for other providers, although then the operator will have to rely on his own implementation of the service. Although it is focused on infrastructure provisioning, having only basic or limited support for application deployment --- like installing and configuring certain packages --- it does seem to integrate well with Puppet, thus obtaining an integrated solution [ACF3]. As with other AWS products, the documentation [ACF2] is quite thorough --- both from an operator's, but also from a tool developer's or integrator's, point of view --- and comes with sufficient examples [ACF4]. However, note that the expressiveness power of the templates is limited --- both in terms of syntax, JSON, but also in terms of semantic --- allowing only limited control, like single valued parameters or simple conditionals to be used, thus more complex decisions have to be taken outside of this framework, and a compiled template must be used instead (for example we can take a closer look at how the JSON syntax is twisted in the user-data properties of a VM, in order to convey information about the provisioned resources.) From a certain point of view, this approach of describing the application resources and delegating their provisioning to a dedicated service is not new, as it is captured in other standards, like OVF; research projects, like mosaic, Contrail; or application deployment systems, like Cloudify. MODAClouds integration If every targeted cloud provider would have a similar service, then all the MODAClouds IaaS deployment tasks could have been completely delegated. But unfortunately as only Amazon provides it, we can take it as an inspiration source and design and prototype a similar technology, that allows us to describe all the deployment steps. Then, depending on the particular provider in question, we could either translate it into native template variants and delegate the task, or simulate a similar service ourselves Puppet Overview Puppet, developed by PuppetLabs is an Open Source configuration management solution [PUP1]. Puppet is distributed both in an OpenSource edition and as a Commercial Product with additional features. As stated in [PUP2], Puppet uses a declarative, model-based approach to IT automation : Define the desired state of the managed resources. This is done using Puppets own Domain Specific Language or the Ruby Language Simulate the changes before applying Realizing the model the required changes to meet the defined model, restoring any configuration changes not conforming to the model Report the actions performed for enforcing the model Puppet provides [PUP4]: System configuration: a declarative, domain specific, language for expressing the relations between servers, the services they host and the parts that form the services. Tools: clients and servers for distributing the system configuration A way for realizing the configuration Basically Puppet, can be summarized as an Backwards Chaining expert system. It provides a mixture of: goals (in Puppet terminology: node definitions, classes) rules (in Puppet: resource type definitions) Public Final Version 1.0, March 29 th

30 facts (information related to the managed resources) Puppet tries to reach a specific node state (goal) by applying rules defined in the manifests, with respect to the facts of the nodes. It needs to be noted that Puppet is an idempotent system: it guarantees that rules can be applied multiple times obtaining the same outcome. Puppet can be run both as a centralized service managed by a puppet master or decentralized using only the agent. The decentralized approach requires an external (user provided) method for distributing the manifests. See [PUP5] for more information regarding these concepts: manifest Is a file in the Puppet Language, contained by a module. A manifest generally provides a single class definition, defined type or node definition. module A module is an organized collection/package of classes, templates, files. A module is generally specialized in performing a specific task (e.g., deploying an application) class Is a collection of related resources which can be declared as a single unit. Generally a class contains all information needed for configuring a specific service resource Is a unit of configuration managed by Puppet. Examples of resources: files, services, packages exported resource specifies a desired state for a resource (unfortunately it does not manage the resource on the target system, and publishes the resource for use by other nodes [PUP5].) This feature is used for sharing information between nodes node A node is a collection of classes, variables and resources which are applied to a node (identified by FQDN or FQDN REGEX) catalog Is a compilation of all the rules that should be applied to a node fact An information regarding a node Characteristics type application deployment and execution suitability production application domain generic applications application architecture generic application restrictions none interaction CLI, REST API, Puppet Dashboard hosting type deployable portability most Linux distributions (requires Ruby), Solaris and Windows [10] services any (as long as a module/manifest is available) monitoring no direct application monitoring. It allow monitoring of puppet itself using REST API, CLI, Dashboard resource sharing n:1 any number of classes can be applied to a node providers EC2. Other available in Puppet Enterprise Notes and Limitations Some Linux distributions provide older Puppet versions that lack some features. An example of this is the RHEL provided package (Puppet version 2.6). exported resources require additional services (e.g., puppetdb) for storing node information. [PUP8] Dependencies between resources deployed in different nodes can be expressed using the exported resources features: resources that depend on resources exported by a different node are realized only after the required resource is realized on the source node. [PUP6], [PUP8] in some areas the open source version of puppet is lacking features. Examples for this are interfaces to cloud providers [PUP1], [PUP9] lack of ways for clearing the effects of applying the rules on a node. For example if we apply the role of MySQL Server to a node we cannot revert (in a simple way) the operations performed by Puppet. The methods for extending Puppet (eg. adding new facts) tend to be complex, requiring Ruby modules for implementing the extensions. [PUP6] Public Final version 1.0, March 29 th

31 A major limitation is the way catalog updates are performed: by default every 30 minutes, in a puppetd based implementation. This can be worked around with additional messaging middleware, or by triggering manual updates using the provided API. MODAClouds integration As it provides flexible resource definition, resource management (including reconfiguration) and inventory, Puppet can be used for deploying and controlling applications. Due to its idempotency Puppet is ideal for application deployment and resource configuration, providing a predictable configuration based on a defined model FP7 projects In this section the approach implemented by related EU projects will be overviewed mosaic Overview Overall mosaic is an FP7 research project that touches multiple subjects revolving around cloud computing, from cloud brokering, interoperability up-to deployment and execution. For the purpose of the MODAClouds project we shall focus mainly on the PaaS-related outcomes from mosaic. The main concepts, detailed in [MOS1] or [MOS2], are: component Represents the basic building block of a cloud application, the atomic deployment and execution unit, which is materialized as one, or a set of tightly coupled, OS processes that run in an isolated environment. There are many types of components, each type mapping to an application tier, but they are treated the same by the platform. In general they fit in one of the following categories: the "user" components --- which embodies the code developed by the user, and implements the needed logic; resource or middleware components --- that provide usual generic services, like data storage (Riak, CouchDB, MySQL), message brokering (like RabbitMQ), etc.; specialized components --- that are of particular use in the mosaic platform or in a cloud environment, like the HTTP Gateway serving as a load-balancer, or the credentials service that could hold and mediate access to sensitive information, like the cloud provider access keys; controller The orchestration service that initiates the deployment and controls the execution of the components. [MOS3] hub A bus-like or RPC-like system that allows components to discover each other, or exchange configuration messages. [MOS4] Characteristics type PaaS suitability prototype application domain web applications, generic applications application architecture n-tier applications programming languages any programming frameworks any scalability manual session affinity dependent on the load-balancer interaction WUI, WS hosting type deployable (open-source) portability portable services RabbitMQ, Riak, CouchDB, MySQL, custom monitoring none resource providers Amazon EC2, Eucalyptus, custom Public Final Version 1.0, March 29 th

32 multi-tenancy single application resource sharing n:1 Notes and Limitations Related with limitations, mosaic imposes little constraints on the running components. For example any component is able to listen on ports --- provided that it request access beforehand --- or receive inbound requests from the Internet, not necessarily limited to the HTTP protocol. Moreover the resources allocated to a particular component are configured by the operator, and could range up to the entire VM resources. The only real limitations are that the component must run on Linux and not require root access. Thus the support for programming languages and frameworks is virtually any that runs on Linux, the only customization needed is the interaction with the component controller and component hub to gain access to their services. On-top of that the components have access to native OS libraries and tools, either provided by the current used distribution, or prepared by the developer before deploying the application. Although from a certain perspective it fits the same functionality as other PaaS solutions, especially Heroku or Cloud Foundry, it differentiates in that one instance of the mosaic platform is dedicated to only one instance of a particular application, thus proving a good solution for private PaaS situations. However unlike the aforementioned PaaS solutions, although mosaic usually shields the operator from the low-level details such as VMs, it does offer full access when needed, thus the operator is able to pin-point the VM where a certain component should be deployed. Finally the software artifacts related with the PaaS aspects of mosaic are fully open source [MOS5], and are split into many quasi-independent software services that could be reused even independently of the PaaS. MODAClouds integration Because the many features of mosaic maps over the requirements of MODAClouds --- such as no multitenancy, minimal restrictions over run application, or fine grained control --- it could fit two different usage scenarios, both related with IaaS deployments: for once it can be used to deploy and execute the user's application when targeting an IaaS solution; and secondly, it could be used to deploy and execute the MODAClouds support services; Cloud4SOA Overview The Cloud4SOA project helps to empower a multi-cloud paradigm at Platform as a Service (PaaS) level, providing an open semantic interoperable framework for PaaS developers and providers, capitalizing on Service Oriented Architecture (SOA), lightweight semantics and user-centric design and development principles. The system supports Cloud-based application developers with multiplatform matchmaking, management, monitoring and migration by interconnecting heterogeneous PaaS offerings across different providers that share the same technology. The Cloud4SOA project introduces a broker-based reference architecture consisted of five layers: The Front-end layer supports the user-centric focus of Cloud4SOA and the easy access of both Cloudbased application developers and Cloud PaaS providers to the Cloud4SOA s functionalities exposed via widgetized services which are adaptable to the user s context. The Semantic layer is the backbone of the architecture that puts in place the Cloud semantic interoperability framework (CSIF) and facilitates the formal representation of information (i.e. PaaS offerings, applications and user profiles). It spans the entire architecture resolving interoperability conflicts and providing a common basis for publishing and searching different PaaS offerings. The SOA layer implements the core functionalities offered by the Cloud4SOA system such as PaaS offering discovery and recommendation (matchmaking), PaaS offering and application publication (profile management), application deployment, monitoring and migration. The Governance layer implements the business-centric focus of Cloud4SOA where Cloud PaaS providers and consumers (Cloud-based application developers) can establish business relationships through Service Level Agreements (SLA). Specifically, it enables the lifecycle execution and Public Final version 1.0, March 29 th

33 management of Cloud-based applications taking into account monitoring information, SLAs and scalability issues. The Repository layer acts as an intermediary between the Cloud4SOA system and the various PaaS offerings allowing the applications to be independent from specific PaaS offering implementations. Moreover, it provides a harmonized API that enables the seamless interconnection and management of applications across different Cloud PaaS offerings, using PaaS-specific adapters deployed in each platform. Cloud4SOA provides four core capabilities implemented by the reference architecture: Matchmaking. The matchmaking component allows searching among the existing PaaS offerings for those that best match the developer s needs. To succeed in this, the matchmaking algorithm heavily capitalizes on the Semantic layer and, especially, on the PaaS and Application models while it distinguishes the user s needs into application requirements and user preferences. The degree of relation is computed based on the similarity of the semantic descriptions between PaaS offerings and an application profile taking also into account the target user s preferences. In addition, the matchmaking algorithm is designed to resolve the semantic conflicts between diverse PaaS offerings and to allow matching of concepts between different PaaS providers that may use different naming or even different measurement units. The outcome of the matchmaking algorithm is a list of PaaS offerings that satisfy users needs, ranked according to the number of satisfied user preferences. Management. The management component supports the efficient deployment and governance of applications on a specific PaaS offering chosen by the application developer. The module performs an analysis of the application requirements to build a specific application deployment descriptor. This descriptor is created according to the format defined by the PaaS offering that the user has selected. It then checks if a valid SLA contract has been previously agreed between the specific PaaS offering and the application, finally initiating the deployment process using the Cloud4SOA standard API exposed by every Cloud4SOA PaaS platform adapter. Moreover, this component provides a functionality to manage the life-cycle of the application by delegating its lower level functionality to the Governance Layer. Monitoring. In a multi-cloud scenario, it is important to continually monitor business-critical applications hosted on a variety of Cloud providers to ensure that their performance consistently meets expectations and that Cloud resources are being effectively used. Cloud providers typically present very diverse architectures providing dissimilar resource-level metrics used to provide fine grain Quality of Service (QoS) guarantees. As a consequence, Cloud users are not able to compare offerings they are adopting. In order to consider the heterogeneity of different Cloud architectures, Cloud4SOA provides a PaaS monitoring functionality based on unified platform-independent metrics, such as latency and application status, to allow application developers to proactively monitor the health and performance of business-critical applications hosted on multiple Clouds environments. Migration. Moving an application between PaaS offerings is a difficult operation, where several issues could arise related to the different modelling and notation of the same features across different providers. The Cloud4SOA framework aims to support a seamless migration to tackle semantic interoperability conflicts. Moving an application between PaaS offerings consists of two main steps: i) moving the application data and ii) moving the application itself. During the first step, all the application data is retrieved from the PaaS offering where the app is running and moved to the new PaaS offering; in order to avoid dirty or inconsistent states, the application is stopped before starting to move the data. After the data structures have been created and initialized, the application code is deployed on the new PaaS provider. MODAClouds integration Cloud4SOA includes various components developed with a modular approach based on SOA architecture and standard technologies and exposed through public REST APIs and REST interfaces. This approach simplifies the integration with MODAClouds, depending on the core capabilities that the MODAClouds solution wants to integrate. Public Final Version 1.0, March 29 th

34 OPTIMIS Overview OPTIMIS aims at optimizing IaaS cloud services by producing an architectural framework and a development toolkit. The optimization covers the full cloud service lifecycle (service construction, cloud deployment and operation). Optimis gives service providers the capability to easily orchestrate cloud services from scratch, run legacy apps on the cloud and make intelligent deployment decisions based on their preference regarding trust, risk, eco efficiency and cost (TREC). It supports end to end security and compliance with data protection and green legislation. It also gives service providers the choice of developing once and deploying services across all types of cloud environments private, hybrid, federated or multi-clouds. OPTIMIS simplifies the management of infrastructures by automating most processes while retaining control over the decision-making. The various management features of the OPTIMIS toolkit make infrastructures adaptable, reliable and scalable. These, altogether, lead to an efficient and optimized use of resources. By using the OPTIMIS toolkit, organizations can easily provision on multi-cloud and federated cloud infrastructures and allows them to optimize the use of resources from multiple providers in a transparent, interoperable, and architecture-independent fashion. MODAClouds integration The unique OPTIMIS TREC solution involves choosing an optimal target platform based on trust, risk, eco and cost data. This dynamic data that is extracted from target platforms is used for the benefit of service providers and end users and can be weighted to fit their needs, ensuring an automated runtime solution. This can be used at runtime to dynamically manage the optimal platform for a service run, providing cross-platform scalability, using platforms with different hypervisors and cloud software, the OPTIMIS tools provide a common approach and methodology to achieve this. This solution has been developed to operate within a fully managed, SLA based OPTIMIS system with the legalities of handling data being analysed at every stage of operation as well as the related business models being published. The MODAClouds decision engine could use this data when deciding on the optimal target platform to run an application Infrastructures Although MODAClouds has to support applications deployed both on PaaS and on IaaS solutions, the main focus is on PaaS-based deployments, especially for those applications perfectly matching the PaaS constraints. However the IaaS deployment is still an important topic, because even though the application targets a PaaS deployment, there are cases where the PaaS itself is self-hosted on a deployable IaaS. Moreover regardless on how the application is deployed and run, there are a few MODAClouds provided services that have to exist alongside the application, and due to the requirements of those support services, they must be deployed on VMs. Thus there is a three-fold interest on IaaS technologies that we must be aware of: for the deployment of the application components; for the deployment of the hosted PaaS, on top of which the application runs; for the deployment of the MODAClouds support services; Therefore in the current section we seek to focus our survey on those IaaS solutions that are the most promising in the contexts described above. Because most providers improve their offerings by adding new advanced features, or new VM types with various capacities, we will not focus on these aspects, but on the following ones, highlighted in Table 2.2.a, which heavily impact the deployment procedures and execution: hypervizor --- although the choice of hypervisor, does not directly influence the deployment, it does influence the execution performance, and the kernel capabilities and requirements; custom images -- some providers do not offer support for uploading customized images, and thus the deployment process must compensate for this lack of functionality; user data -- although a seemingly unimportant feature, it provides a way to convey some minimal configuration information from the provisioning to the execution; again the lack of this feature must be compensated by other means; Public Final version 1.0, March 29 th

35 API -- how to interact with the VM provisioning process; Provider Hosting type Hypervizor Custom images User data API Link Amazon EC2 public, hosted Xen yes yes EC2 Windows Azure public, hosted unspecified yes no custom Rackspace public, hosted Xen no yes custom Eucalyptus deployable, open-source Xen, KVM, VMware yes yes EC2- compatible OpenStack deployable, open-source Xen, KVM, VMware, plugable yes yes custom, EC2- compatible OpenNebula deployable, open-source Xen, KVM, VMware, plugable yes no custom, EC2- compatible Flexiant deployable, public, hosted Xen, KVM, VMware, Hyper-V yes yes custom GoGrid public, hosted Xen no no custom Joyent public, hosted Solaris Zones, KVM no no custom Table 2.2.a IaaS characteristics 2.3. Resource allocation, load-balancing The ability to operate a large number of resources is central to the performance of a cloud platform and to achieve high QoS levels. However, how to distribute a large number of incoming heterogeneous requests among a large number of heterogeneous VMs is a challenging problem in the context of IaaS platforms. Furthermore, in MODAClouds we aim at extending the QoS management capabilities of an existing deployment and execution runtime environment, mosaic, with new load balancing mechanisms. To support this activity, in this section we describe the general problem of load balancing and review some of the scheduling algorithms available for this problem. We then provide an overview of how this problem is approached in production systems Load balancing and scheduling algorithms We consider the problem of a centralized dispatcher that receives requests for resources from a set of users. These requests are heterogeneous in the amount and the type of resources they require, as well as in the priority with which they are handled. By means of a scheduling algorithm, the dispatcher decides to which of a set of servers a newly incoming request should be assigned. To this end, the scheduling algorithm may require more or less information on the state of the servers and the characteristics of the requests. While more information may improve the behaviour of the system (better QoS, less cost), it also demands more resources and thus a larger overhead. We now describe a set of load balancing scheduling algorithms, considering their requirements and objectives. Random: probably the simplest scheduling policy, a server is selected randomly without additional considerations. Round-robin (RR): in this case the requests are assigned to the servers in a cyclic fashion, such that a particular server out of N will receive one request for every N requests received by the dispatcher. This policy is commonly used thanks to its low overhead and its ability to distribute the requests uniformly among the servers. Weighted round-robin (WRR): this policy generalizes the round-robin algorithm by assigning different weights to the servers, thus allowing servers with a larger weight to receive a proportionally larger number of requests. The weights can be set from information such as the server resources (e.g. Public Final Version 1.0, March 29 th

36 CPU clock speed), or its current load. If the weight changes dynamically (as with the assigned load) this policy imposes a larger overhead. Join-the-shortest-queue (JSQ): as its name indicates, under this policy the incoming requests are assigned to the server with the least assigned workload. The need to keep track of the servers current load implies a larger overhead than the simple round-robin. Shortest-remaining-processing-time (SRPT): similar to the previous policy, but in this case the remaining processing time before completion is used instead of the queue length, as the criteria to determine the server in charge of the next incoming request. Content-aware: this type of policies allocates a request by considering information regarding the requested file sizes, location, and other characteristics. Among these, locality-aware policies aim at avoiding the unnecessary replication of content among many servers by clustering the servers according to the content they maintain, and routing the requests accordingly [Che00a] [Ris02]. Cycle stealing: this category of policies starts from a clustering of the servers, e.g. among those that serve short jobs and those that serve long jobs. The policy then permits a request to be assigned to a different cluster that the one it should (e.g., a short job assigned to the long job cluster) when a server is idle. This policy aims to take advantage of these idle periods in one cluster to serve jobs from another cluster, thus reducing the number of idling resources [Har03]. Many other policies have been introduced in the recent years in order to consider many factors that arise in a cloud environment. Load balancing policies are surveyed in Section Load balancing in production systems In this section, we describe how load balancing is achieved on IaaS platforms and traditional non-virtualized applications by considering some of the different deployment solutions reviewed in Section 2.2. In a production system other considerations have to be taken into account in addition to the ones mentioned in the previous section. One of special interest is session persistence or stickiness. While for performance purposes it might be preferable to allocate a set of requests to different servers, all these requests may belong to a single user session, and all of them may need to be processed by the same server in order to successfully complete the requested service. For instance, the user may first provide some credentials to validate its access, and then it may select a set of products before proceeding to checkout. Once the first request is assigned to a given server, the remaining ones should be processed by the same server, which keeps all the information for that user session. The ability to handle this is called session persistence, session affinity or stickiness, and becomes relevant for the operation of the load balancer. In the following we describe how load balancing is handled in production systems, where issues such as session persistence must be taken care of. Linux Virtual Server (LVS): The LVS load balancer can run IP-level load balancing (IPVS) or application-level load balancing (KTCPVS) implemented in the Linux kernel. Under IPVS every server must provide the same services and the selection of the servers is done by the scheduling algorithm, which can be selected among several supported. LPVS supports o Round robin o Weighted round robin, with weights assigned according to the servers capacity, o o o o Shortest expected delay (shortest remaining processing time) Least-connections scheduling (LC) (similar to shortest queue), where the request is assigned to the server with the least number of current connections. It also supports a weighted version of LC scheduling (WLC) Locality-based least-connections scheduling (LLC), where each server has a number of destination IP addresses assigned, and a request for an address is allocated to the server with that address, unless it is overloaded (the least loaded server is chosen in this case) Locality-based least-connections scheduling with replications (LLCR), where the destination IP address is mapped to a set of servers, and a request for that address is assigned to the server with the smallest number of connections from this set. Session affinity is handled by means of persistent ports, such that when a persistent port is used, a connection template is created for the client-server pair and a entry is added to the template for the current connection. In this manner the traffic related to this connection will be handled by the same server. mosaic: mosaic is a deployment and execution environment that allows the developer to run any solution for load balancing. However, it provides the mosaic HTPP Gateway, which listens for HTTP Public Final version 1.0, March 29 th

37 connections and pushes them to a RabbitMQ broker. This broker distributes the messages to the consumers according to a round-robin strategy, provided that all consumers pull messages at the same rate. In case a consumer pulls messages at a faster rate, the broker will dynamically adjust the rate. Heroku : In Heroku, load balancing is performed automatically by means of round-robin. Further, if the application is scaled up or down, the nodes are (de-)registered with the routing infrastructure automatically. However, Heroku does not support session affinity. Cloud Foundry: Cloud Foundry handles load balancing by means of Apache web server. It supports three scheduling algorithms: request counting, weighted traffic counting, and pending request counting. The first two can be considered as weighted round robin approaches, where the weights are assigned according to the capacity of the server. The difference between them is that in the first method the capacity is measured in terms of requests, while in the second in terms of traffic (I/O operations). The last method is equivalent to join-the-shortest-queue. Cloud Foundry supports session affinity by means of cookies, that is, assigning a cookie to a new connection in order to recognize it in the future and route it to the proper server. Amazon Elastic Load Balancing (ELB): Amazon ELB supports load balancing by enabling Round- Robin method. It also provides sticky session by utilising cookies to indicate the destination server. Rackspace Cloud Load Balancer: Currently Rackspace supports the following load balancing algorithms: o Random o Round-Robin o Weighted Round-Robin o Least Connections o Weighted Least Connections Rackspace provides the function of session persistence (sticky session). Rackspace has two modes for this. One is to insert HTTP cookie in the message. The other is to keep track the source IP address and determine the corresponding server. Puppet: Puppet proposes four ways to distribute agent load. o Manually distribute the load to even the load among each server. o Use DNS Round-Robin method to distribute requests to different servers according to the DNS names (multiple IP addresses are associated with a single domain name) in a round-robin manner. o Use a load balancer to distribute the load. o Use DNS SRV records to achieve load balancing. Each node will be allocated a SRV domain name. Then the SRV record will go to the puppet master with corresponding domain name. This requires a DNS service supporting SRV records. Google App Engine (GAE): Google App Engine uses DNS Round Robin load balancing algorithm. Currently it does not provide sticky sessions, only replicated sessions are supported. Windows Azure: Windows Azure will provide round-robin load balancing of network traffic to publicly defined ports of a cloud service. Windows Azure, unfortunately, does not support load balancing and it does not provide Sticky Sessions. However, Windows Azure provides the functions for Eclipse Sticky Sessions. GoGrid: GoGrid supports the following load balancing algorithms: o Weighted Round Robin o Weighted Least Connect o Source Address Hashing GoGrid also provides sticky session. It supports three modes related to sticky session. o None: None is the default option and no sticky session functionality will be utilised. o Session Cookie: Use cookie to achieve sticky session. Public Final Version 1.0, March 29 th

38 o IP Subnet: Requests from the same /24 IP subnet will be allocated to the same server for a limited period of time. Haproxy: Haproxy supports the following load balancing algorithms: o Weighted Round-Robin o Static Weighted Round-Robin Static means that the weight will not change on the fly o Least Connections o o First Available (FA) Source (SRC) The server receiving the request is chosen from the value of hashed source IP address divided by the total weight of the running servers. o URI The URI is hashed and the hash value is divided by the total weight of the running servers to obtain the server that will receive the request. Haproxy supports sticky session by using cookie. Crossroads (XR): Crossroads supports: o Round-Robin o Least Connection o First Available Crossroads provides sticky session by inserting custom headers in HTTP messages. Piranha Piranha, also known as IP Load Balancing, is based on LVS and provides load balance among a cluster of servers. It has two main features: o Heartbeating between active and backup load balancers. o Checking availability of the services on each of real servers. Table 2.4.a summarizes the load-balancing characteristics of the deployment solutions reviewed. The scheduling algorithms are referred to with the acronym introduced in their definition, and a legend is supplied below for easy reference. Further, the last column shows if the deployment solution supports session affinity. Table 2.4.a Summary of Load Balancing Methods R RR WRR SWRR LC WLC LLC LLCR FA SRC URI SRV Aff LVS x x x x x x x mosaic x Cloud Foundry x x Heroku x ELB x x Rack space x x x x x x Puppet x x GAE Azure x x GoGrid x x x x Haproxy x x x x x x x XR x x x x Public Final version 1.0, March 29 th

39 Piranha x x x x x x x *R: random, RR: round robin, WRR: weighted round robin, SWRR: static weighted round robin LC: least connections, WLC weighted least connections, LLC: locality-based least connections, LLCR: locality-based least connections with replications, FA: first available, SRC: source based 2.4. Application data management and migration Data migration is the process of transferring data between storage types, formats, or computer systems [Wik13]. This process encompasses issues concerning mapping the schema of the source storage to the destination one and transforming the data from the original format to the target one (these two issues are being handled in deliverable D4.1 and therefore are not further discussed here). In many cases data migration is performed offline with respect to the normal operation of the applications using such data. In some cases, however, such an operation has to happen online within guaranteed time constraints. In [Lu02] authors propose Aqueduct an approach to support online migration that uses a control-theoretical approach to statistically guarantee that data are transferred in the shortest possible time without significantly impacting on the performance of the application being executed in foreground. In [Kar11] authors study the problem of migrating data on various disks by assuming that each data node can handle simultaneously more than one data transfer at a time. Authors demonstrate that the problem of minimizing the number of rounds needed for transferring data is NP/hard but they also propose an efficient algorithm that still offers very good guarantees. In the context of MODAClouds we plan to develop a data migration approach based on reliable streams of data. This allows us to handle the case of highly dynamic data that need to be kept fully synchronized through different replica. In many real-world applications data take the form of continuous streams instead of the form of finite data sets stored in a traditional repository. This is the case of monitoring network traffic, telecommunications management, clickstreams, manufacturing, sensor networks, and many other domains. In such applications, instead of classical one-shot queries, clients need to register continuously running queries, which return new results as new data arrive on the streams. A data stream is a sequence of items received continuously and in real-time, ordered either implicitly, by arrival time, or explicitly, by means of timestamps. Not only is it typically impossible to control the order in which items arrive, but, even more important, it is not feasible to locally store a stream in its entirety [Gol03]. Due to all these peculiar characteristics, traditional database systems and data processing algorithms are not suitable for handling numerous and complex continuous queries over data streams. Ad-hoc data management systems have been studied and developed since the late nineties. One of the first proposed models for data streams was the Chronicle data model [Jag95]. It introduced the concept of chronicles as append-only ordered sequences of tuples, together with a restricted view definition language and an algebra that operates over chronicles as well as over traditional relations. OpenCQ [Liu99] and NiagaraCQ [Che00] addressed continuous queries for monitoring persistent data sets spread over wide-area networks. Another data stream management system is Aurora [Bal04], which in turn evolved into the Borealis project [Aba05], which addresses the distribution issues. In [Bab01], Babu et al. tackle the problem of continuous queries over data streams addressing semantic issues as well as efficiency concerns. They specify a general and flexible architecture for query processing in the presence of data streams. This research evolved into the specification and development of a query language tailored for data streams, named CQL [Ara03, Ara06]. Further optimizations are discussed in [Mun07]. A different perspective on the same issue brought Law et al. [Law04] to put particular emphasis on the problem of mining data streams [Law05]. Another DSMS is Stream Mill [Bai06], which extensively considered and addressed data mining issues, specifically with respect to the problem of online data aggregation and to the distinguishing notion of blocking and non-blocking operators. Its query language (ESL) efficiently supports physical and logical windows (with optional slides and tumbles) on both built-in aggregates and user-defined aggregates. The constructs introduced in ESL extend the power and generality of Data Stream Management Systems. Public Final Version 1.0, March 29 th

40 The problem of processing delays is one of the most critical issues and at the same time a strong quality requirement for many data stream applications, since the value of query results decreases dramatically over time as the delays sum up. In [Che06], the authors address the problem of keeping delays below a desired threshold in situations of overload, which are common in data stream systems. The framework described in the paper is built on top of the Borealis platform. As for the join over data streams, rewriting techniques are proposed in [Gol08] for streaming aggregation queries, studying the conditions under which joins can be optimized and providing error bounds for results of the rewritten queries. The basis for the optimization is a theory in which constraints over data streams can be formulated and the result error bounds are specified as functions of the boundary effects incurred during query rewriting. More recently research focused on exploiting parallelism in stream processing and, more recently, performing in an elastic way (i.e., the ability to scale out, when the rate of data elements in the stream increases, and scale in, when the rate decreases). Apache S4 [Neu10] and Twitter Storm [Twi13] allow stating queries as directed acyclic graphs with parallel operators. S4 instantiates parallel computations of operators but does not allow to control neither the parallelism nor the state. Storm users can assert a parallelization level and offers a way to partition a stream based on key intervals. Still, it does not offer facilities to manage the operator state and it does not support elasticity. Stateful queries parallelization at run-time is supported in StreamCloud [Gul12]. It uses a query compiler to transform high-level queries into a graph of relational algebra operators to be executed in parallel thanks to hash-based parallelization. Some proposals for scalable stream processing systems (e.g., [Mar11]) adapt the map/reduce paradigm to low-latency stream processing, but they allow expressing only very simple continuous queries. [Fer13] is one of the most promising approaches. It exposes internal operator state explicitly through a set of state management primitives. Based on them, it describes an integrated approach for dynamic scale out and recovery of stateful operators. Externalized operator state is check-pointed periodically and backed up to upstream VMs. The system then identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the check-pointed state from upstream VMs. At any point, failed operators are recovered by restoring check-pointed state on a new VM and replaying unprocessed tuples, not yet reflected in the state. Public Final version 1.0, March 29 th

41 3. State-of-the-Art: Cloud Monitoring 3.1. Preamble In the context of Cloud management, monitoring Cloud infrastructures and applications is a critical task. It allows us to i) get insights into the system and the health of the running applications, ii) make capacity planning with the aim of scalability and, as a result, coming up with adaptation decisions based on monitoring data. In general, monitoring approaches differ from various points of view, in particular: The monitoring actor (who) can be the application/service provider, cloud provider, or third party. Who performs monitoring has an impact on the aspects possible to monitor. For instance, an application/service provider has usually a full control over the execution of its application components/services and can easily probe their internals. In contrast, it could not easily access to infrastructure level information that the cloud provider may hide. What objects and properties are monitored. The monitored objects can be an application, cloud resources or some specific services such as queues. As for the monitored properties different approaches can be distinguished one from the other for the following two aspects: Types of monitored properties. Monitored properties can be functional (proper functionality) or nonfunctional (quality aspects) such as execution cost, response time, and throughput. Punctual vs History-based monitoring. The monitoring is punctual if it concentrates on values collected at particular instants of the execution. The term history-based monitoring refers to the case when the analysis considers the history of the system in a certain time window in order to discover the presence (absence) of sequences of values or events. When monitoring happens. This concerns the timing of the monitoring process with respect to the execution of the monitored system. How the monitoring system is built. This refers to the monitoring mechanism, the expressiveness of the language and abstraction level, the validation approach, the capability of diagnosing and deviation handling and the runtime support. The language expressiveness. The type of monitored properties and the capability of predicating on single values and histories lead to the expressiveness of the monitoring language. After deciding what we want to monitor, we need a way to render monitoring directives. Usually services are monitored through special purpose constraints that must be checked during execution such as compliance with promised SLAs. History-based constraints require a temporal logic to relate the values belonging to a sequence, while monitoring the QoS properties imposes a language allowing for suitable representation. Abstraction level. Monitoring properties can be expressed at various abstraction levels. There is a distinction between the level at which monitoring works and the level at which the user is required to define such properties. This distinction helps characterize what the user specifies with respect to what the execution environment must cope with. Abstraction level refers to the first aspect and does not consider what the runtime support executes. This aspect is taken into account when considering the degree of automation intrinsic within each approach. Architecture of the monitoring environment. Monitoring constraints must be specified and then evaluated while the system executes. The support could be in the form of modelling environment, meaning what the approach offers to the user to specify the monitoring constraints (e.g., Palladio component modeling) and execution environment meaning what the approach offers/requires to check directives at runtime. Usually, specification environments propose proprietary solutions, while execution environments can be based on standard technology, proprietary solutions, or suitable customizations of existing engines. The execution environment may also include the mechanisms deployed for generating the monitoring information such as instrumentation, reflection, and the interception of events/information exchanged in the execution environment. Filtering. Pure monitoring is in charge of detecting possible discrepancies between what stated by monitoring constraints and what is actually collected from the execution. Filtering and reasoning abilities enable analysis of complex properties by combining raw data collected by the monitoring infrastructure. Monitoring output. The monitoring environment can offer its output through specific and proprietary GUI or it can offer APIs enabling the retrieval of monitoring data. In general, the richness of information offered as an outcome of the monitoring activity is of paramount importance. Public Final Version 1.0, March 29 th

42 Derivation of monitoring constraints. Monitoring directives can be either programmed explicitly or be derived (semi) automatically from other artifacts, e.g., design specification containing QoS information. Reactive/proactive monitoring. Reactive monitoring takes actions to solve problems in response to one or more incidents, after a problem has occurred. It is designed to analyse the direct and root causes of the problems and then take corrective actions to fix them. Optionally, it could collect data for comparison with past and future events and allow related risk assessment. For example, reactive monitoring comprises reduction, correlation, sequencing, notification and reporting of event, automated actions and responses, and the implementation of special-purpose policies to constrain problems. Proactive monitoring implies the definition of monitoring actions trying to identify and solve problems before occurrences such as the verification of SLAs, capacity planning and treatment of statistics to measure of the system behaves General Monitoring Approaches In this section some approaches generally applicable to web service monitoring are presented, together with some cloud-specific ones that are dealing with all layers of the Cloud. These are useful as a starting point to get ideas about aspects to be considered in building monitoring architectures The COMponent Performance Assurance Solutions (COMPAS) [Mos02]: The COMPAS approach is worth to be considered as a framework encompassing a complete loop of performance identification for component-oriented distributed systems. It is a performance monitoring approach for J2EE systems in which components are EJBs. The framework is divided into three parts: monitoring, modelling, and prediction. Java Management Extensions (JMX) is used in the monitoring part. An EJB application is augmented with one proxy for each component EJB. Such proxies mirror in the monitoring system each component in the original application. Timestamps for EJB life-cycle events are sent through these proxies to a central dispatcher. The performance metrics of the application can be visualized with a graphical tool. UML models can be generated with SPT annotations from the measured performance indices. During runtime, a feedback loop connecting the monitoring and modelling modules, allows the monitoring to be refined by the modelling process in order to increase the efficiency of monitoring/modelling. The COMPAS architecture is depicted in Figure 3.2.a. Figure 3.2.a. The COMPAS Architecture TestEJB [Mey04]: The TestEJB approach deals with QoS specification of components by introducing a novel measurement architecture. The target of the TestEJB framework, which is an extension to the JBoss application server, is the performance monitoring of J2EE systems implementing an application-independent profiling technique. The framework focuses on response time by monitoring execution times of EJBs and also traces of users calls. Implemented interceptors log invocations of EJBs and augment calls with identifiers to construct call graphs from execution traces. TestEJB uses bytecode instrumentation library to modify components at deployment time. The approach relies on the Java Virtual Machine Profiler Interface (JVMPI) and records events for constructing and deconstructing EJBs. This method allows tracing back memory consumption to individual components and introduces a significant overhead in the range of seconds and should only be used in a testing environment. The monitoring architecture is demonstrated in Figure 3.2.b. Public Final version 1.0, March 29 th

43 Figure 3.2.b. TestEJB Monitoring Architecture A couple of possible layers for instrumentation in order to capture response time and their pros and cons are identified in this work as follows. The bottom layer is the operating system. The monitoring of the entire network traffic is possible through instrumenting the kernel. Hence, the J2EE traffic could be filtered, while it is not portable across operating systems and EJB containers. Above the OS there is J2EE layer, which is the Java Virtual Machine. Using interfaces such as the JVM Debugger or the JVM Profiler, it is possible to measure time information, while filtration must be applied. Further, instrumenting the java.net. classes is possible at this layer [Cza98]. The J2EE (i.e., EJB container) is the next layer for instrumentation by augmenting the source-code of the container, but it is not portable to the implementation of other containers. Instrumenting the EJBs themselves is another possibility, but for response time from the server point of view not the client side. The most promising approach which considers both client and server sides is the non-intrusive instrumentation of the EJB container through callback like interceptors Performance Anti-pattern Detection (PAD) [Par08 ]: Monitoring is not limited to just obtaining some raw data, rather it concerns also analyzing the data in order to detect functionality and performance issues that then trigger ameliorative adaptation decisions. The objective of PAD, which is based on the COMPAS framework, is the automatic detection of performance anti-patterns [Par06] in EJB component-based systems. The framework includes three modules as performance monitoring, reconstruction of a design model and anti-pattern detection. Monitoring is performed at the component level and it is portable across different middleware implementations as a result of using standard JEE mechanisms. There are proxies for each EJB in order to collect timestamps and call sequences. Bytecode instrumentation is considered with the purpose of redeployment of a running system, which dynamically instruments the application at runtime. Then after, reconstruction of the design model is achieved by using measurements from EJB deployment descriptors, similar to the RMCM approach, capturing the structural and behavioural information from an executing system [Par07]. Relying on rules implemented with the JESS rule engine, anti-pattern detection on the reconstructed design model could be achieved (based on predefined rules). Anti-patterns across or within runtime paths, inter-component relationship anti-patterns, antipatterns related to component communication patterns, and data tracking anti-patterns are all distinguishable. This approach has advantages compared to some common. The approaches utilizing Java profilers by monitoring at the JVM level collect information on any class loaded by the JVM. Hence the huge mass of loaded classes makes it difficult to distinguish the code of interest from lower level middleware and library calls. Further, the information is raw and it is not trivial to determine the cause of performance issues. On the contrary, PAD collects data at the correct level of abstraction (e.g., component level for JEE systems) and provide sufficient Public Final Version 1.0, March 29 th

44 runtime context for the collected data (e.g., run-time paths [Par07], dynamic call traces [Jer97]). The framework is demonstrated in Figure 3.2.c. Figure 3.2.c. The PAD framework architecture An elastic Multi-layer monitoring approach [Kon12]: A peer-to peer scalable distributed system for monitoring is proposed in [Kon12], enabling deployment of longliving monitoring queries (query framework) across the cloud stack to collect metrics and trigger policies to automate management (policy framework). The monitoring architecture of the approach is composed of three layers as data, processing and distribution, interfacing on different levels with the cloud stack as demonstrated in Figure 3.2.d. Public Final version 1.0, March 29 th

45 Figure 3.2.d. Three-layered monitoring architecture. The data layer provides extensible adaptors to cope with resource heterogeneity. The processing layer describes complex queries (in a SQL-like syntax) over data and also defines policy rules to be triggered if needed. The distribution layer performs an automated deployment of processing operators in the correct places relying on services such as SmartFrog [Gol09], SLIM [Kir10] or Puppet [Tur07]. Three sequential steps of the monitoring framework are as follows. 1. Metadata definition, to define the available data sources. 2. Query & policy definition, to evaluate information and define reactions over certain conditions. 3. Distribution & execution, deployment of queries and policies using a placement strategy RMCM, A Runtime Model Based Monitoring Approach for Cloud [Sha10]: The Runtime Model for Cloud Monitoring (RMCM) proposed here represents a running cloud through an intuitive model. It organizes monitoring data, hiding the heterogeneity of underlying infrastructures and platforms and presenting the system at a high level of abstraction. The objective is to provide an intuitive operable profile of a running cloud and to apply it for implementing a flexible monitoring framework with adaptive monitoring capabilities based on a trade-off between monitoring overhead and monitoring capabilities. The model constantly mirrors the system state [Bla09]. The adaptive actions are taken against the model rather than against the running cloud directly, so as to avoid low-level operation mistakes. There are three types of roles in the cloud, cloud operators, service providers, and end users, and RMCMs are presented from the point of view of these roles. Integration of these views provides a comprehensive runtime model for cloud monitoring as demonstrated in Figure 3.2.e. Figure 3.2.e. Overview of the RMCM approach. Interaction behaviour, Application, Middleware and Infrastructure are categories of entities to be modelled. The main focus of infrastructure monitoring is on the resource utilization. Applications are monitored from the design and performance point of views. The monitoring is based on server-agent architecture as demonstrated in Figure 3.2.f. Public Final Version 1.0, March 29 th

46 Figure 3.2.f. Server-agent architecture. A monitoring agent is deployed on each virtual machine in charge of collecting runtime info of all the entities on the same VM. These entities are equipped with various monitoring mechanisms to collect runtime info from entities of each level. Collected info is used to instantiate corresponding RMCM to be checked based on some pre-defined rules. Administrators can view and query this monitoring information from the DB. They also can modify the monitoring configuration for each agent The Multi-layer Collection and Constraint Language (mlccl) [Bar12] mlccl is an event-based multi-level service monitoring approach which defines runtime data to be collected and how to collect, aggregate (for the purpose of building higher-level knowledge) and analyze (for the purpose of discovering undesired behaviours) it in a multi-layered system such as Cloud. The approach is extensible and could be used in a cross-layer manner. For instance, it is possible to monitor a platform as a service or the hypervisor by adding new indicators. Further, ECoWare (Event Correlation Middleware) [Bar10], which is a framework for event correlation and aggregation, initially developed for the monitoring of BPEL processes, is extended to support mlccl. Data in mlccl is described as Service Data Objects (SDOs) [Ope07], which is a language-agnostic data structure used to facilitate the communication between diverse service-based entities. The SDOs to be collected may have two kinds of data collections: messages and indicators. Messages are used to obtain the request or response messages exchanged during service invocations. In the case of a new service invocation for which a message collection is defined, the mlccl tool produces a new SDO and outputs it to an event bus so that the designer can make further use of it. In addition to collected messages and the location of them, an SDO contains a timestamp indicating when the message was sent or received by the service runtime (NTP (Network Time Protocol) timestamps are used for the purpose of clock synchronization when sampling sources on different computers with a precision in the order of 10 ms) and an instanceid, that is a unique ID identifying the specific service call. Indicators are not triggered by any particular service call. They collect periodic information about a service. An indicator can be a Key Performance Indicator (KPI) such as average response time or throughput, or a Resource Indicator (RI) such as the amount of available memory or idle CPU in a virtual machine. Upon calculating a new indicator value by mlccl s runtime, it is wrapped in an SDO and output to an event bus for the purpose of further use. The ECoWare framework encompasses three components as the execution environments for collecting runtime data through appropriate probes, processors with the aim of data aggregations and analyses and the data visualization dashboard. The aforementioned components collaborate through a Siena [Car01] Publish and Subscribe (P/S) event bus Infrastructure monitoring in current IoT landscape The domain of smart objects or IoT is characterized by an ever increasing number and diversity of smart objects, plus a growing volume of produced data and events. The world of IoT is currently evolving from a collection of simple sensors and actuators, controlled by rather rudimentary services, into a truly smart environment, with interacting objects. While current research has achieved impressive results for different aspects of IoT technologies, it s visible that current solutions and research activities fail to address some of the key issues required for and limiting the full deployment of a secure and reliable IoT world. A smart object is essentially a building block for services and it should be treated as such. A fully fledged vision of IoT amplifies requirements on openness and scalability, universal access and security, performance, and accountability. All of these requirements need to be addressed in a systematic way, rather than by ad-hoc/tailored solutions confined in specific provider/vendor silos. Commercially available IoT systems are often unable to inter-operate, even when deployed in the same physical environment. Clearly, such limitations need to be overcome in order to bridge IoT, services, people and business worlds. Several running FP7 projects, such as IoT-A, have as their main objective to break this vertical structure of silos and allow all technologies to be used by all domains. Due to specific nature of low reliability environments for running IoT eco-systems different degrees of autonomic capabilities are developed and which are intensively based on the monitoring of key parameters and performance indicators. Due to the fact that current approach is to view things as service exposures, monitoring of infrastructures serves in addition automatic steps such as service discovery and repair. Key technology on this field is Complex Event Processing and by consequence all architectural approaches follow event-driven principles. A special case in IoT services where monitoring is very relevant is the one of self-healing, due to the fact that most of scenarios asks for critical safe procedures in place, like the ones in power grids, for example. Public Final version 1.0, March 29 th

47 Another aspect considered is the fact that due to highly distributed model, a centralized monitoring infrastructure is an unacceptable single point of failure, in addition to the inherent delays and noise occurred in networked environments. Such systems make use of techniques like CEP previously mentioned, publish/subscribe strategies and goes until full scale data fusion ones, last area being strongly connected with the so called Situation Awareness or Context concepts. Extended information gathering can be labelled as action information fusion because an assessment of possible/plausible actions can be considered. The goal of any fusion system is to provide the user with a set of refined information for functional action. Taken together, the user-fusion system is actually a functional sensor, whereas the traditional fusion model is just an observational sensor. The user refinement block not only is the determination of who wants the data and whether they can refine the information, but how they process the information for knowledge. Bottom- up processing of information can take the form of physical sensed quantities that support the higher-level functions. To include machine automation as well, it is important to afford an interface with which the user can interact with the system (i.e. sensors). Specifically, the top-down user can query necessary information to plan and decide while the automated system can work in the background to update the changing environment. Finally, the functional fusion system is concerned with the entire situation while a user is typically interested in only a subsection of the environment. It should be considered that with user can be assimilated also a system capable to publish and enforce own policies over a certain infrastructure Infrastructure-Level Monitoring Infrastructure-level monitoring involves collecting metrics related to CPU, memory, disk and network from a IaaS platform, either via monitoring probes deployed inside VMs or using services provided by the cloud platform itself (e.g., Amazon Cloudwatch). MODAClouds will consider both types of sources to acquire this information. Therefore, in this section we review current standard monitoring tools and cloud-specific tools that may be appropriate for this purpose. A specification of the monitoring system that will be adopted in MODAClouds is available in deliverable D Guest VM monitoring Common tools for monitoring performance metrics inside VMs include collectl, sar, SIGAR. These tools are meant to be used inside individual VMs to acquire performance. Thus, they are not meant to provide a solution for monitoring of distributed systems, either applications or machine clusters. collectl. Collect system information about CPU, memory, disk and network. These abilities are all provided by plugins; therefore there are no external dependencies. However, most plugins are only for Linux. Collectl does not generate graphs, but can use other programs to draw graphs. There is only a basic monitoring functionality. Currently it is limited to simple notification and threshold control. Everything in collectd is done in plugins. Collectd provides plugins to relate with other projects, such as to receive data from gmond from Ganglia project. collectl is mainly used on Linux systems. System Activity Reporter (sar). sar can monitor metrics related to overall system performance, such as cpu, disk, IO and processor information. It can have for graphical presentation of data gathered by sar using sag (system activity grapher). sar can monitor only Linux systems. System Information Gatherer And Reporter (SIGAR). Sigar is a free software library that provides a crossplatform, cross-language programming interface to low-level information on computer hardware and operating system activity. The core API is implemented in C with bindings implemented for Java, Perl, Ruby, Python, Erlang, PHP and C#. It gathers system information such as CPU, memory, network and file system (such as the file system usage of a mounted directory). Sigar runs on Linux and Windows Host machine monitoring esxtop. esxtop is used to analyse real-time performance data inside an VMware ESX or ESXi server esxtop is able to collect CPU, interrupt, memory, network, disk adapter, disk interface, disk VM, and power management information. Public Final Version 1.0, March 29 th

48 xentop. xentop is included in a Xenserver and displays real-time information about the server system. xentop is able to collect CPU, network, memory and disk information Cloud Infrastructure-level monitoring Ganglia. Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. The ganglia monitoring daemon (gmond) collects dozens of system metrics: CPU, memory, disk, network and process data. The Ganglia web frontend provides a view of the gathered information via realtime dynamic web pages. Ganglia is able to run in Linux and Windows. Nagios. Nagios offers the ability to monitor applications, services, operating systems, network protocols, system metrics and infrastructure components with a single tool. Nagios is able to respond to issues at the first sign of a problem and automatically fix problems when they are detected, such as restarting a failed service when some predefined conditions are met. Nagios is able to run on Linux and Windows (using add-ons). MonALISA. MonALISA is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of information gathering and processing tasks. MonALISA is able to collect local host information, such as CPU, memory, network traffic, Disk I/O, processes and sockets in each state, LM sensors, APC UPSs. MonALISA service is mainly used on Linux. JASMINe. JASMINe Monitoring is a set of tools that allows supervising a distributed infrastructure. JASMINe is able to collect information such as CPU usage, memory usage and JDBC connections. JASMINe has a web console to visualize the monitoring data. JASMINe is able to run on any platforms. Zabbix. Zabbix is an open source performance monitoring software for enterprise environment. Zabbix is able to monitor performance of CPU, memory, network, disk space and processes and hardware information. Zabbix is able to monitor network devices (routers and switches), databases, Java application servers and web service. Zabbix is available for Linux, Unix and Windows and does not require a specific environment such as JAVA. Zabbix has flexible network discovery functionalities based on IP ranges, availability of external services (FTP, SSH, WEB, POP3, TCP, etc.), information from Zabbix agent and information from SNMP agents. These functionalities could be considered in developing dynamicity for monitoring data collectors. Zabbix is used by the BonFIRE infrastructure developed in FP Cloud-specific monitoring Amazon CloudWatch. CloudWatch enables monitoring for Amazon cloud resources and services. CloudWatch provides monitoring metrics about CPU, disk, network, system status check and custom metrics such as memory and error rates. CloudWatch supports alarm for notification of predefined metric value. Graphs and statistics are provided for selected monitoring metrics. CloudWatch is neither open source nor free. Cloudify. Cloudify is an open source PaaS for business-critical apps enabling on-boarding and scaling to any cloud. Cloudify supports monitoring system using a web management console or using the cloudify shell. Cloudify in default can monitoring CPU and memory. It also supports monitoring probes and plugins to conduct monitoring, such as JMX. Alerts can be set inside the web management console. Rackspace Cloud Monitoring. Rackspace Cloud Monitoring is a service provided by Rackspace to monitor applications on cloud (not limited to Rackspace cloud). Rackspace Cloud Monitoring supports various monitoring metrics, including CPU, disk, memory, network, processes and custom metrics. CopperEgg Reveal*. CopperEgg is a cloud computing systems management company and its products provide monitoring for websites, web applications, servers, and systems deployed in cloud. RevealCloud is a tool for server monitoring; RevealUpTime is a tool for website and web application monitoring; RevealMetrics is a tool for custom monitoring metrics. CopperEgg products support almost all the major public cloud providers and platform. CopperEgg products are not free. Public Final version 1.0, March 29 th

49 New Relic. New Relic is an application performance management company. Its products provide user, application and server monitoring functionalities. Real user monitoring is able to provide real time performance metrics such as page load time, page views, Apdex score 1, identify poor performance pattern and alert and notification. Application monitoring is able to provide several metrics such as throughput, response time and Apdex score and transaction tracking and reporting. It also supports alerts and capacity analysis. Server monitoring provides performance metrics such as CPU, memory, network and IO. New Relic products are not free. AppDynamics. AppDynamics is an application performance management company. Its products focus on providing performance management in cloud. AppDynamics provide real time monitoring and end user monitoring. AppDynamics is able to achieve troubleshooting by identifying bottlenecks, detecting transaction anomalies and code diagnostics. AppDynamics provides detailed report and visualised dashboard for statistics and comparison with each release. AppDynamics products are not free Application-Level Monitoring This section is devoted to application level Cloud monitoring, which is of significant importance, especially in the context of Cloud application SLA management. Due to virtualization as the basis for resource sharing, multiple virtual machines (VMs) can be run on a single physical machine or even multiple applications can be run on a single VM. As a result, per-application monitoring in such a shared environment is essential to keep applications healthy and guarantee QoS. Therefore, it is not enough just to monitor a physical machine or a VM to measure the application resource consumption, detecting SLA violations and managing resources efficiently A Run-time Correlation Engine (RTCE-based approach) [Hol10] Log analysis is a common monitoring task in order to get insight to the system behaviour. Especially in the context of application-level monitoring it could allow one to know about the application health. However, monitoring and analysing the behaviour of various components in a Cloud environment is a challenging task due to the inherent complexity and large scale of Clouds and also the large amount of data to be managed. The introduced approach, which is built on top of RTCE (Run Time Correlation Engine), provides a scalable log correlation, analysis, and symptom matching architecture that can perform real time correlation for large volumes of log data. The huge mass of log files produced by software components demand for logging facilities in order to analyse them. However, both the facilities and log files are usually application/vendor specific. Additionally, problems may arise in heterogeneous environments. Therefore, coherent and runtime extraction of meaningful information from these log files is a challenging issue. The RTCE framework encompasses four functionalities as follow, also depicted in Figure 3.4.a.: (a) automatic data collection, (b) data normalization into a common format, (c) runtime correlation and analysis of the data to give a coherent view of system behaviour at run-time and (d) a symptom matching mechanism identifying errors in the correlated data [Hol09]. The log data from each application read by monitoring agents (MA), which are deployed on the different hardware components, are routed in the form of events to the Event Correlation Engine (ECE) via TCP/IP connection. ECE processes these events and present the data on the web server (Tomcat) and users can get different views (such as a single view of correlated data from multiple sources, log statistics, and reports on automated real time matching of known symptoms) of the information from the RTCE core by interacting with the web server. Logs are converted to a single common format. One limitation of the RTCE framework is scalability in cloud computing environments as they may consist of hundreds or even thousands of applications and the current RTCE module may not be able to handle logs from all the applications, which a long-term solution for that is presented in this approach and it is the design of a new distributed and scalable ECE module. A master ECE would be responsible to correlate all the distributed ECE instances. The distributed ECE needs to have an extra functionality, which is the ability to send the statistics data to the master ECE via TCP/IP. 1 The Application Performance Index (Apdex) is a standard method to report performance of software applications, especially to which extent a given application meets its users' expectations. Public Final Version 1.0, March 29 th

50 ocation to services, service scheduling, application, and SLA violation detection (Figure 1). as it is not designed for a particular set of applications. The service interface supports the provisioning of transactional as well as computational applications. The SLA management framework can handle the provisioning of all application types based on the pre-negotiated SLAs. Description of the negotiation process and components is out of scope of this paper and is discussed by Brandic et al. [8]. CASViD architecture contains a flexible monitoring framework based on the SNMP (Simple Network Management Protocol) standard [12]. It receives instructions to monitor applications from the SLA management framework and delivers the monitored information. It is based on the traditional manager/agent model used in network management. Figure 2 presents the monitor architecture. The manager, located in the management node, polls periodically each agent in the cluster to get the monitored information. In order to enhance Figure 3.4.a. The RTCE Current Architecture its scalability, the monitor uses asynchronous communication Fig. 1. CASViD Architecture. with all cluster agents. It is composed of a library and an agent CASViD, an SNMP-based monitoring approach The monitor for SLA agent violation implements [Eme12]: the methods to capture each rs place their Application service requests level monitoring throughis aa defined challenging task metric as the defined infrastructure theand CASViD platform layer monitor metrics MIB need (Management to be the front-endmapped node to (step the required 1, Figure metrics 1), at the which application Information layer the Base). purpose At of the SLA manager management. side, CASViD the monitor (cloud library application SLA violation detection) aims at monitoring and detecting SLA violations at the application layer, management node in the Cloud environment. The and includes tools for resource allocation, scheduling, provides and methods deployment. to The configure CASViD which architecture metrics is shown shouldin be captured are and placed which through nodes a defined should interface be included to the front-end in the node monitoring. rator sets up the Figure Cloud 3.4.b. environment and works as follows. by deploying Service requests ed VM images (step (step 1), 2) acting on physical as the management machinesnode. and The VM The configurator SLA management sets up the framework Cloud environment in the system by deploying architecture m accessible for preconfigured service provisioning. VM images (step The2). request The request uses is received this library by the service to configure interface and thedelivered monitoring the process SLA and by the service resource management interface allocation and framework to services, delivered for service validation to the scheduling, SLA (step 3). application Then the request is passed to the application deployer (step 4) for retrieve the as it desired is not designed metrics. forthe a particular retrievement set of applications. process can Thebe monitoring, resource and allocation SLA violation and deployment detection (Figure (step 5). 1). CASViD monitors the application and sends information to the t framework for SLA validation management (step framework 3), which (step is6) done service interface supports the provisioning of transactional as for detection done of SLA by collecting violations. the metrics information from application well as computational applications. The SLA management hat the request comes from the right customer. In or operating framework systemcan loghandle files. the provisioning of all application p the service request is passed to the application types based on the pre-negotiated SLAs. Description of the tep 4), which allocates resources for the service negotiation process and components is out of scope of this paper and is discussed nd deploys it in the Cloud environment (step 5). by Brandic et al. [8]. ying the service application, CASViD monitors the A. System and Application Monitor execution and sends the monitored information to CASViD architecture contains a flexible monitoring framework based on the SNMP (Simple Network Management anagement framework (step 6) for processing and Protocol) standard [12]. f SLA violations. It receives instructions to monitor applications from the SLA management framework and delivers the monitored information. It is based on the traditional configurator and application deployer are comallocating resources and deploying applications manager/agent model used in network management. Figure ud testbed. They are included in the architecture 2 presents the monitor architecture. The manager, located in r complete solution. The Application Deployer is the management node, polls periodically each agent in the cluster to get the monitored information. In order to enhance for managing the execution of user applications, Fig. 2. CASViD Monitor Overview. its scalability, the monitor uses asynchronous communication rokers in the Grid literature Fig. 1. [1], CASViD [16], Architecture. [26], [30], Figure 3.4.b. The CASViD with Architecture all cluster agents. It is composed of a library and an agent. parameter sweeping executions [11]. It simplifies SimilarThe to monitor other monitoring agent implements systems the methods [21], to[29], capture CASViD each es of transferring Customers The application manager place their periodically input servicedata requests polls toeach each through agent ato defined monitor get the monitored is metric general defined information. purpose the It CASViD and is composed supports monitor of a MIB the library acquisition (Management and of interface an agent. to thethe front-end agent node implements (step 1, the Figure methods 1), which to capture Information metrics defined Base). At in the the manager CASViD side, monitor the monitor MIB library g the execution, acts (Management asand the management collecting Information node the results inbase). the Cloud from At the environment. the common manager side, The application metrics, and even system metrics such as the monitor provideslibrary methods provides to configure methods which to configure, metrics should which be captured memory and in the which utilization. monitoring. nodes should This Thebe library application included is then in the used metrics monitoring. by (SLA front-end node. VMmetrics configurator The should mapping setsbe upcaptured of theapplication Cloud and environment which tasks nodes by deploying should CPUbe and included performed preconfigured dynamically the SLA VM management byimages a scheduler (step framework 2) on located physical to configure inmachines parameters) the and monitoring ThetoSLA process bemanagement monitored and retrieve framework depends the desired in onthe the metrics. system application architecture The type tion Deployer each making retrieval themslave process accessible process can for be service consumes performed provisioning. by tasks collecting The request and the metrics how uses to information ensure this library its from performance. to configure application the or monitoring operating system process and is received log files. bythe the service strategy interface of detecting and delivered SLA violations to the SLA is based retrieve on the predefined the desired threat metrics. thresholds, The retrievement which are process more can be he VM is idle. Further details on this component management framework for validation (step 3), which is done done by collecting the metrics information from application nfigurator toare ensure found thatin theour request previous comes from work the [17], B. SLA Management Framework right customer. In or operating system log files. execution of the the nextapplications step the serviceand request theismonitoring passed application The service provisioning management and detection of application be done deployer automatically (step 4), by which theallocates Cloudresources provider, for the service SLA objective violations are performed by the SLA execution and deploys it in the Cloud environment (step Public 5). Final version 1.0, March 29 th incorporated into a Cloud Service that can be After deploying the service application, CASViD monitors management the framework component. This component is central by the users. application execution and sends the monitored information and to interacts with the Service Interface, Application Deployer, osed CASViD the SLA architecture management isframework generic (step in its6) usage for processing and and CASViD monitor. In order to manage the SLA violations, A. System and Application Monitor

51 restrictive than the violation thresholds. With this information the system can react quickly to avert the violation threat and save the Cloud provider from costly SLA violation penalties A multi-layer approach for cloud application monitoring [Gon11] Hierarchical monitoring and analysis is a methodology for refining the monitoring data and analysis results in order to achieve higher precision and also reduce the amount of data to be analysed. In the context of Cloud computing it could be taken into account for the purpose of load lightening (i.e., the amount of data to be analysed) and reasoning on monitoring data. A three-dimensional approach for cloud application monitoring is proposed in this work encompassing the Local Application Surveillance (LAS), the Intra Platform Surveillance (IPS) and the Global Application Surveillance (GAS) dimensions with the interconnection and subcomponents shown in Figure 3.4.c. LAS monitors the application instance to check for rules violations. For the purpose of further analysis, the output of the LAS is sent to the assigned IPS, which is an extra monitoring mechanism at the level of one particular VE and analyses data from different VMs running on the same machine looking for issues arising as a result of interaction between VMs and between the applications running on the same VM. The filtered results are then sent to the GAS components for further analysis. Figure 3.4.c. LAS, IPS and GAS layers. The aim of the optional GAS component, which is assigned one per application (not instance), is monitoring the software pieces and detecting modelling and implementation problems through analysing data from different machines (from several IPS components) referred to the same application. The global view of the GAS components reveals the behaviour of the software in different virtualized environments inferring proper conclusions for both the applications users and developers Cloud Application Monitoring: the mosaic Approach [Rak11] The building of custom monitoring systems for Cloud applications is facilitated using the mosaic API. The mosaic approach as a whole contains four modules as the API, the framework (i.e., platform), the provisioning system, and the semantic engine. The API and the framework aim at the development of portable and provider independent applications. The provisioning system works at IaaS level and resource management. The functionality of the provisioning system is a part of the Cloud agency [Ven11]. The framework is a collection of predefined Cloud components in order to build complex applications. The framework constitutes a PaaS enabling the execution of complex services with predefined interfaces. The mosaic SLA management Public Final Version 1.0, March 29 th

52 components are also part of the Framework. The API offers the implementation of a programming model in a given language (currently Java, and Python in the future) to build applications. The API provides new concepts (e.g., the Cloudlet or the Connector) in order to focus on Cloud resources and communications instead of the resource access or communication details. The mosaic architecture is depicted in Figure 3.4.d. Figure 3.4.d. The mosaic Architecture Figure 1: mosaic Monitoring Components Architecture Resource Monitoring is implemented by the Cloud agency. Archiver, which is a monitoring agent offered by the by the user, agency, publish collects resource-related monitoring information, from the such agents as distributed runningon on. the VMware monitored vfabric resources Hyperic and stores [6], the e.g., specializes the CPU usage, messages at in given a storage time intervals. system. The The Monitoring mosaicagent developer has monitoring the role of systems developing (e.g., Ganglia, the monitoring Nagios, Cloudlet, SNMP-based applications, ing applications etc.) and thatpublish utilizethem any on of the same app. 75 supported has in the web-application ability to collect monitoring, information from and is common capable of monitor- which, through storage. the Applications connectors interact to the with Event the Archiver bus is able through to a connector. web technologies. The Observer A component more general generates approach events is to utilize the on the (resource) event bus by accessing the storage filled by the Archiver. Integrated in the Cloudlet, retrieve all the monitored events. Moreover it is able to access to the JMX Java framework [1], that is employed by most Java Application Monitoring connector is responsible for application component monitoring and generating events on the Cloud connected Agency buses. in Then order after to retrieve application more components general need application to share the containers, monitoring information and is capable and manage of providing information on the status of the application running in the con- purpose information. the related events. The mosaic monitoring API offers a set of connectors representing tainer. an This, abstraction however, of resource requires monitoring that the and application a is written indata Java, (from andifferent that ittechniques); is preparedtherefore, to publish it information set of drivers implementing different ways of acquiring monitoring 6 Related supports Work monitoring by (i) offering a way to collect data directly through from theany JMX of the subsytem. components Regardless of a mosaic of the level of parameters monitoring being monitored, techniques a(accessing general monitoring Cloud- infrastruc- application, (ii) offering a way to collect data for any proposed As discussed provider, in resource-related, previous sections, and mosaic in monitoring tools of (called turem/w is required (monitoring/warning) collect andsystem)) processand the(iii) information provided by the monitored components. Such an infrastructure supporting the mosaic Cloud application in order to access data regardless the technology of the acquired Cloud-based applications, we can distinguish two separate levels: resources and the way they are monitored. The aim of the set of mosaic monitoring tools, offered by the mosaic the infrastructure framework, is level offering and the the ability application-level is provided by the Lattice framework [2] (also utilized in to building up a dedicated monitoring system. monitoring. Infrastructure-level resource monitoring [7] the RESERVOIR project [13]) that has a minimal run-time aims at the measurement M4Cloud, A and Generic reporting Application of system Level parameters related to Monitoring [Mas11] footprint and is not intrusive, so as not to adversely affect This realmodel-driven or virtual infrastructure approach classifies services and monitors offered the performance of the system itself or any running applications. application-level metrics in shared environments such as to the user the (e.g. Cloud. CPU, The RAM, basis or for data the implementation storage parameters). of the monitoring At phase The is the framework Cloud Metric defines Classification a system (CMC). of data sources, data this level, CMC someidentifies of the better the following advertised four Cloud-monitoring models: application based consumers, (e.g., generic/specific), and control measurement strategies that based influence (e.g., the collection of and monitoring nature based data. (e.g., The monitoring quantity/quality) data can be trans- solutions include direct/calculable), the Nimsoft implementation Monitor [4], based Monitis (e.g., [3], shared/individual) or the Tap inmodels. Cloud The Management application Service based model [5]. supports These services the distinction ported of the metrics over IPon multicast the basis of solutions, the application Eventthey Service Bus, or a cover different belong subsets to. The of measurement Cloud-service based providers, model is applied and they to define publish/subscribe the formulas from which mechanism. metrics can This be calculated; is a very flexible framework, the which corresponding is tailored measurement towards distributed mechanisms, applications, but support different the Implementation measurements Based based Model on the defines actual for Cloud each metric provider that is being monitored. On the application level, not Cloud applications. In contrast, in the mosaic monitoring subsystem, we can utilize such Cloud-oriented ser- the nature of the monitored parameters, and the way their values should be retrieved depend on the actual software vices as reliable Public Final version 1.0, March 29 th messaging and flexible storage options for being monitored, and not on the Cloud infrastructure it is the measurement data.

53 coherently with the formulas defined at the previous step; finally, the nature based model defines the nature of the metrics and their definition within SLAs. More info on the models could be found in the original article in [Mas11]. CMC is part of the M4Cloud framework. This is shown in Figure 3.4.e. In this framework, the FoSII infrastructure [Bra10] is used as a Cloud Management System (CMS). Monitored data is analysed and stored within a knowledge database and then is used for planning actions. Moreover, monitored data is also acquired and analysed after the execution of such actions, for the purpose of efficiency evaluation. Figure 3.4.e. Architecture of M4Cloud REMO, a Resource-Aware Application State Monitoring approach [Men08] Cost effectiveness and scalability are among main criteria in developing monitoring infrastructure for largescale distributed applications. REMO addresses the challenge of constructing monitoring overlays from the cost and scalability point of views jointly considering inter-task cost-sharing opportunities and node-level resource constraints. Processing overhead is modelled in this approach in a per message basis. A forest of optimized monitoring trees is deployed in the approach through iterations of two phases exploring cost sharing opportunities between tasks and refining the tree with resource sensitive construction schemes. In each iteration a partition augmentation procedure is run generating a list of most promising augmentations for improving the current distribution of workload among trees, considering also the cost estimation for the purpose of limiting the list. Then, these augmentations are further refined through a resource-aware evaluation procedure and monitoring trees are built accordingly (through the resource-aware tree construction algorithm). An adaptive algorithm is also considered for the purpose of balancing the cost and benefits of the overlay, which is useful especially for large-scale systems with dynamic monitoring tasks. Planning the monitoring topology and collection frequency are important factors in keeping a balance between monitoring scalability and cost effectiveness. The drawback of proposed approaches up to date is that they either build monitoring topologies for each individual monitoring task (e.g., TAG [Mad02], SDIMS [Yal04], PIER [Hue05], join aggregations [Cor05], REED [Aba05], operator placement [Sri05]) or use a static one for all monitoring tasks [Sri05], which none of them is optimal. For instance, it could happen that two monitoring tasks collect data over the same nodes. Hence, in such a case it is more efficient to consider just one monitoring tree for data transmission, as nodes can merge updates for both tasks and reduce per-message processing overhead. Therefore, it is of significant importance to consider multi-monitoring-task level topology optimization for the purpose of monitoring scalability. Load management is another important factor to be considered in monitoring data collection, especially for data-intensive environments, meaning that the Public Final Version 1.0, March 29 th

54 monitoring topology should be able to somehow control the amount of resources spent in order to collect and deliver the data. In the case of ignoring this fact, it may lead to overloading and consequently losing of data. Remo approach addresses all these issues considering node-level resources in building a monitoring topology and optimizing the topology for the purpose of scalability and ensuring that no node is assigned with monitoring workloads more than the amount that their available resources could support. Three main advantages of this approach are as follows. At first, it identifies three critical requirements of large-scale application monitoring including sharing message processing cost among attributes, meeting nodelevel resource constraints and efficient adaptation based on monitoring task changes. Then after, a monitoring framework optimizing the monitoring topologies and addressing the above-mentioned requirements is proposed. Finally, techniques for runtime efficiency and support are developed as well. The figure below demonstrates the high level model of REMO encompassing the following four components as the task manager, the management core, the data collector and the result processor. The functionalities of each of these components are summarized in the architecture section of Figure 3.4.f. Figure 3.4.f. The REMO Architecture Cloud4SOA 2 The Cloud4SOA monitoring offers a unified platform-independent mechanism, to monitor the health and performance of business-critical applications hosted on multiple Clouds environments in order to ensure that their performance consistently meets expectations defined by the SLA. In order to consider the heterogeneity of different PaaS offering Cloud4SOA provides a monitoring functionality based on unified platform independent metrics. The Cloud4SOA monitoring functionality allows to monitor leverages on a range of standardized and unified metrics of different nature (resource / Infrastructure level, container level, application level, etc.) that, based on the disparate underlying cloud providers, allow the runtime monitoring of distributed applications so as to enforce the end-to-end QoS, regardless of where they are deployed across different PaaS. In the scope of Cloud4SOA several metrics have been defined (Table 3.4.a) from the cloud resource as well as the business application perspective, but not all of them have been enforced at runtime since they only provide useful information about the status of the application. 2 Public Final version 1.0, March 29 th

55 Table 3.4.a Cloud4SOA Metrics Metric Description App. Cloud CPU load The amount of computational work that the application performs X Memory Load The amount of memory consumed by the application X HTTP Response Code Includes custom status messages to understand the health of the application, but also the performance of X X Application and DB Response Time Application Container Response Time Cloud Response Time the cloud Time that measures the efficiency and speed with which servers deliver requested web content to end users The elapsed time between the end of an inquiry or demand on a cloud system and the beginning of a response The time the Cloud needs to process and forward to the application the incoming call X X X X 3.5. Cost Monitoring and Measurement Indexes A common measure for measuring cost of IT is the Return on Investment or ROI calculation. Analysts such as Daryl Plummer from Gartner have reviewed the use of ROI calculations in the monitoring and measurement of cloud services. His discussions focused on industrial monitoring and measurement of cloud services. Other analysts such as Trevor Pott look at the complexity of calculating metrics such as ROI in a cloud environment with different infrastructures and legacy operations all in the mix. The single issue that strikes any reviewer of the state of measurement and monitoring of cloud services is that there is no agreed way of measuring or comparing cloud services. This lack of agreement or commonality has led to initiatives from commercial organizations and standards bodies. World Wide Web Consortium (W3C) initiated an incubator project on the Unified Service Description Language (USDL) as a way of generating consensus. The lack of a public standard for monitoring cloud services is not a problem for private cloud implementations. In-house staff or service providers in a private cloud environment can facilitate interoperable monitoring without concerning themselves about external service measurement. Private clouds are the predominant environments in corporations at the present time, however there is a growth in rogue IT and the use of external services. These both demand a more standardized means of monitoring and measuring cloud services. There are a few initiatives active in the description of cloud services in index form. Their goal is to create a standard way of comparing services during the selection process. W3C s USDL incubator is one such initiative The Cloud Service Measurement Index Consortium (CSMIC) is another initiative, by a number of organizations to describe measures for use in the comparison of service behaviour at service selection time Unified Service Description Language USDL was the name given to an incubator project from W3C. USDL extends the state of the art in many fields of service description and is seen as an extension of work done in the semantic web in general and linked data in particular. USDL is seen as a language based method of aligning business services by using a common description. The incubation group has completed its work and has delivered a report that contains their recommendations. It is clear from the report that USDL requires additional work to make it valid for use in cloud services. Particularly there are requirements to create module specific processes as well as descriptions. A good example of this is the legal module. In the legal module there will need to be different processes for each jurisdiction. Another identified extension to USDL is a language specific query language. Little work appears to have been completed since the report was published Service Measurement Index The Service Measurement Index (SMI) is a standardization initiative managed by the Cloud Services Measurement Index Consortium, led by Carnegie Mellon University. The consortium has a current membership of 17 organisations from all areas of IT including, universities, benchmarking specialists, software houses and Public Final Version 1.0, March 29 th

56 systems integrators. All have an interest in defining a common method for describing services. The consortium meets regularly to develop definitions of service attributes and measures. These are not measures for continuous monitoring of cloud service performance, but for service selection. Some of these measures can be modified to describe continuous monitoring measures. SMI is defined as a set of measures that describe an attribute that is part of a service category. Table 3.5.a contains a list of attributes that have been prioritized for definition of measures. This is the current list and will be expanded later in the exercise. The financial category is of particular interest to MODAClouds as a source of information and potential cost management information. Category Selected Attributes Accountability Compliance, Ease of Doing Business, Provider Certifications, Provider contract/sla verification Agility Elasticity, Portability, Scalability Assurance Availability, Reliability, Resiliency/fault tolerance Financial Acquisition, On-going cost, Transition costs Performance Functionality, Interoperability, Service response time Security and Privacy Access control & privilege management, Data integrity, Data privacy and data loss Usability Accessibility, Learnability, Suitability Table 3.5.a. List of CSMIC Prioritised Attributes by category Each attribute has one or more measures defined and for each measure there is a template for the description that contains a number of fields. There is the normal identification and meta-data and the measure description. The measure description contains information about how the measure is described; the frequency and units of measure plus the formula for calculating this measure. Some formulae are simple yes/no questions, for example Is the service supplier Sarbanes-Oxley certified? Other formulae are more complex and are used to describe a measure. Measures are weighted based on their importance to the selector of services. Data is gathered to allow calculations of a measure. The data is either gathered by benchmarking or by the contribution of service suppliers. In a prototype exercise in 2011 several cloud service suppliers donated performance, security and quality metrics. A leading university benchmarked several public services to gather. This formed the basis of examples of service selection that validated the approach. From those early days the basic QARCCS (Quality, Agility, Risks, Cost, Capability, Security) model was modified to the current list of categories. The SMI is now in the final stages of internal review of attribute measure definition and the next stage is for those measures to be passed to the wider community for external review. The review should be completed in the second quarter of During this review period the measures will undergo further refinement. A user scenario and a demonstration tool for calculating service selection heat-maps are also under development. Defined measures and data will allow service selection and the further development of a standard measure of cloud services from SMI. The use of SMI to select services is illustrated in Figure 3.5.a. Public Final version 1.0, March 29 th

57 Figure 3.5.a. Service selection heat-map for dummy service. If the user placed a high priority on the security and privacy category then despite a high score for usability this service would not be acceptable. Even if a service does not provide an adequate score in a category this does not mean the service is always unacceptable. If may become acceptable with suitable risk mitigation. Public Final Version 1.0, March 29 th

58 Tools Features Table 3.5.b. Monitoring component in EU Projects 4CaaST Reused Frameworks from the state of art: Collected Framework(JMX, monitor applications that have MBeans) Publish/Subscribe Middleware(framework used: SilboPS) JASMINe(collector provided: MbeanCmd) To access monitoring services inside 4CaaSt, a TCloud REST-based API is proposed. Design a new environment which can use all the different available generic monitoring systems(e.g. Ganglia, collectd, mbeancmd, etc ), Cloud4SOA Cloud-TM Contrail Front-End(visual information about the status (life cycle) of the deployed applications) Back-End(Collect data from platforms) The skeleton of our prototype implementation of WPM(Workload and Performance Monitor) is based on the Lattice framework; The prototypal implementation of the data platform oriented probes has been extensively based on the JMX framework Monitoring is based on the monitoring solution developed in the SLA@SOI project (architecture); TheWeb hosting service: use Ganglia for monitoring several application-specific parameters OPTIMIS Java, RESTful Web Services Interfaces to downstream components querying data from the Monitoring Infrastructure (e.g. TREC components and Monitoring website) Interfaces to upstream components inserting data into the Monitoring Infrastructure (e.g. Probe and scripts). MySQL Storage of monitoring data Google Toolkit Web Monitoring website Vision Cloud mosaic The Monitoring system is responsible for collecting cluster level usage records and aggregate them to generate cloud level usage records. The generated records are pushed to the Accounting system via a RESTful web interface. Regarding monitoring of cloud-based applications, we distinguish two separate levels: the infrastructure level and the application-level monitoring. Connector API for Monitoring: Use JSON(JavaScript Object Notation) for data exchange Public Final version 1.0, March 29 th

59 4. State-of-the-Art: QoS Management 4.1. Preamble Quality of Service (QoS) plays a central role in the optimal delivery of web services, and as more applications are deployed on public clouds, the task of handling QoS becomes harder. As more applications share the same infrastructure, their demand for resources may create contention that reduces the QoS perceived by the user. In this section we review methods that have been proposed during the last decade to provide tools to system administrators to manage QoS of web applications. To provide a better view on this problem we have divided this section in three parts: Data analysis and forecasting. The first section deals with need to accurately describe the system workload, as this drives the demand for resources. The high variability and auto-correlation present in web application workloads calls for advanced modelling approaches for prediction and evaluation. These include statistical regressions, autoregressive models, and machine learning techniques. Runtime QoS models. This section presents recent advances on QoS runtime models, which are tools to evaluate the performance of the application under a given workload mix, resource availability, and resource management policy. These models attempt to predict the effect of a reconfiguration on the system performance, allowing an to predict the benefits of a reconfiguration before applying it, and to consider future changes needed in order to cope with a potential change in the workload. The models used in this section are based on statistical inference, control theory, and queueing theory. SLA management. This section describes methods to determine resource management policies to cope with Service Level Agreements (SLA). SLAs exist between application provider and end users and also between application provider and cloud provider. We here focus on the first kind, i.e., agreements with the end users. We consider policies for application placement, admission control and capacity allocation. To determine the most appropriate policy, optimization and game theoretic methods are reviewed QoS Data Analysis and Forecasting Problem MODAClouds will offer a data analysis platform that will serve among its main purposes the goal of parameterizing models of cloud applications in order to deliver predictions of their QoS metrics. Classical QoS data analysis involves service demand estimation and traffic forecasting. Service demand estimation means that approximate the service demand for different classes of requests by analysing the log file or streaming data. Typical interesting data metric includes response time and server utilisation. Traffic forecasting problem is to forecast the incoming workload by analysing the historical data and obtain the future trend. This will bring the need for methods to keep predictive models consistent with observations to maximize predictive accuracy. A similar concept is adopted for example in [Shi06], [Coh04], [Des12] where the runtime engine features statistical learning methods, classification, regression, adaptive re-learning. Machine learning methods can be more flexible than stochastic models in capturing dependencies in empirical data. However, they can be less accurate in what-if analysis since they take a black-box system view. For example, since they do not model scheduling mechanisms, it is difficult to predict the effects of changes in request priorities, which can, instead, be simple to predict with a stochastic model. Unsupervised methods may also be inapplicable for predicting metrics that are unobservable due to overhead concerns (e.g., threading levels). While definition of machine learning methods is usually embedded with the modelling technique itself (e.g., training algorithms for neural networks), less standardized data analysis methods are required to parameterize QoS stochastic models. Since WP5 aims at leveraging such models for design-time predictions, the runtime environment will take advantage of the WP5 models and therefore will need mechanisms to parameterize them. QoS model parameterization can be broadly divided into direct measurement techniques and statistical inference methods. Direct measurement parameterization, as used for instance in [Urg05], is usually expensive in terms of overhead because it instruments the code or tracks the requests to see what they do However, an estimation problem can be formulated using statistical inference methods and this can be reapplied periodically. In the next sections we focus on statistical inference methods for data analysis and forecasting of QoS model parameters. As Public Final Version 1.0, March 29 th

60 the methods share similarities in the techniques used, we provide a brief review of these techniques in the following section Overview of data analysis, forecasting techniques, and queueing models Before describing recent works on this area, we overview some of the main techniques used for data analysis and forecasting. Regression analysis Regression is a statistical technique to estimate the relationship among observed variables. Regression models can be formulated as: Y = f(x, β), where X is independent variable and Y is dependent variable. Both X and Y are observed variables and β is the variable to estimate. The most simple regression method is linear regression which assumes a linear relationship Y = Xβ + ɛ between the variables. Regression approaches are widely used for prediction and forecasting. Classical approaches include ordinary least squares linear regression, which is trying to find the simple linear relation that minimises the sum of squared residuals, and non-linear methods such as SVM regression, which uses Support Vector Machine (SVM) to obtain the non-linear relation between variables. Autoregressive models Autoregressive models are mathematical models describing time-varying processes. Classical methods include autoregressive moving average models (ARMA) and their generalization form of autoregressive integrated moving average model (ARIMA). The ARMA models form a class of linear time series models. By adjusting the order of the model, any linear time series models can be approximated with a desired accuracy. Autoregressive models can also be used to forecast time series which has been mainly used in economics and natural science. Kalman filter Kalman filter is a technique to estimate the states of a running system by analysing the system input and noisy and incomplete observed data. Kalman filter works in two steps. First, the algorithm estimates the current system state and the uncertainties. Once obtaining the observed measurement of the system, Kalman filter tries to update the previous estimates using a weighted average, with higher weight to the estimates with lower uncertainty. Kalman filter is a recursive estimator and it only requires the current measurement and previous state, therefore it is suitable for online parameter estimation to achieve adaptive management of the system. Machine Learning methods Machine learning algorithms have been studied extensively over the last century and their application in queueing system raise much interest in this decade. Machine learning has the advantage that it requires no internal structure of the system and treat it as a black-box. Therefore they are more flexible than stochastic models in capturing dependencies among the data. Techniques like Support Vector Machine (SVM), Artificial Neural Network (ANN), Clustering and Bayesian models has been studied and applied to recognise patterns of workload and predict and forecast future events. Along this section we will repetitively make reference to queueing systems and queueing networks, which among the most important modelling tools for QoS management. To make this document self-contained, we provide a brief overview of these techniques, pointing to deliverable D5.1 for additional details: Queueing systems A queueing system is a mathematical model that consists of one or several servers that deliver a timeconsuming service to a population of clients/requests. A queueing system can be described using the Kendall notation: A/B/c/k, where A describes the request arrival process, B describes the service process, c is the number of servers in the system, and k is the number of spaces in the system, including service and waiting spaces. For instance, the most traditional queue is the M/M/1/ queue, where the inter-arrival request times and the service times follow an exponential distribution, there is one server and infinite room for holding waiting requests. Other usual values for the A and B components include a general distribution G and the Erlang distribution Er, among others. Another relevant aspect of a queueing system is its service discipline, which determines how the server/resource is allocated among the incoming requests. Common service disciplines include First-Come-First-Serve (FCFS), Last- Public Final version 1.0, March 29 th

61 Come-First-Serve (LCFS), Processor Sharing (PS) and Generalized Processor Sharing (GPS), among others. Queueing networks A queueing network is a collection of queueing systems (each one a node in the network) that interact through their arrival and departure processes. When a request finishes being served at a node, it may move to another node in the network, or leave the network, according to a probabilistic routing matrix. Further, the requests can be classified in different classes, depending on the probability laws that govern their external arrivals, services and routing. Queueing networks are ideal to analyze systems where several resources are accessed by external requests. Layered queueing networks Layered queueing networks are an extension to queueing networks that allows the representation of computer systems composed of several layers of software servers that share hardware resources. The layers play an important role in software applications as they capture the blocking and waiting that a software server (in a layer) experiences when it requests a process from another server (in a lower layer) in order to complete its service. When performing such a request, the calling server gets blocked and is unable to provide any service, a feature that product-form queueing network models are not able to capture Statistical Inference for QoS model parameterization Statistical inference techniques differ from direct data measurement techniques because they aim at calibrating QoS model parameters from aggregate statistics such as CPU utilization or response time measurements. In [Men94], a standard model calibration technique is introduced. The technique is based on comparing the performance metrics (e.g., response time, throughput and resource utilization) predicted by a performance model against measurements collected in a controlled experimental environment varying the system workload and configuration. Given the lack of control over the system workload and configuration during operation, techniques of this type may not be applicable for online model calibration. In [Rol95], linear regression is used for parameter estimation and is found to be accurate with less than 10% error with respect to simulation data. However, regression fails when there is not enough variability in the observed data. [Rol98] studies the precision of linear regression using simulation of different service time distributions, which is shown to decrease as the service variance grows. In [Liu05], performance models are calibrated by application-independent synthetic benchmarks. The approach uses middleware benchmarking to extract performance profiles of the underlying component-based middleware. However, application-specific behaviour is not modelled. The study in [Zha07] presents a regression-based approximation of the CPU demand of customer transactions, which is later used to parameterize a queueing network model where each queue represents a tier of the web application. It is shown that such an approximation is effective for modeling different types of workloads whose transaction mix changes over time. Moreover, [Cas08a] presents an optimization-based inference technique that is formulated as a robust linear regression problem that can be used with both closed and open queueing network performance models. It uses aggregate measurements (i.e., system throughput and utilization of the servers), commonly retrieved from log files, in order to estimate service times. The work in [Pac08] considers the problem of dynamically estimating CPU demands of diverse types of requests using CPU utilization and throughput measurements. The problem is formulated as a multivariate linear regression problem and accounts for multiple effects such as data aging. In [Kal11], an on-line resource demand estimation approach is presented. An evaluation of regression techniques Least Squares (LSQ), Least Absolute Deviations (LAD) and Support Vector Regression (SVR) is presented. Experiments with different workloads show the importance of tuning the parameters, thus the authors proposes an online method to tune the regression parameters. In [Kal12], a novel approach of resource demand estimation is proposed for multi-tier systems. The Demand Estimation with Confidence (DEC) approach it proposes can effectively overcome the problem of multicollinearity in regression methods. DEC can be iteratively applied to improve the accuracy. A thorough evaluation demonstrates the effectiveness of the algorithm. Public Final Version 1.0, March 29 th

62 Other approaches to model calibration are presented in [Wu08] and [Zhe08]. Both of them use Extended Kalman Filter for parameter tracking. While in [Wu08], a calibration framework based on fixed test configurations is proposed, [Zhe08] applies tracking filters on time-varying systems. [Zhe08] extends [Zhe05], where the use of an extended Kalman filter is investigated to adjust the estimated parameters based on utilization and response time measurements. The above approaches to model calibration, however, have not been validated in scenarios of a realistic size and complexity yet and it is currently not clear if they can be used as a basis for online model calibration. The study in [Cre10] proposes a method based on clustering to estimate the service time. The authors employ density based clustering to obtain clusters and then use clusterwise regression algorithm to estimate the service time. A refinement process is conducted between clustering and regression to get accurate clustering results by removing outliers and merging the clusters that fit the same model. This approach proves to be computationally efficient and robust to outliers. [Cre12] proposes an algorithm to estimate the service demands for different system configurations. A time-based linear clustering algorithm is used to identify different linear clusters for each service demands. This approach proves to be robust to noisy data. Extensive validation on generated dataset and real data show the effectiveness of the algorithm. [Sha08] explores the problem of inferring workload classes automatically from high-level measurement of resources (e.g., request rate, total CPU and network usage) using a machine learning technique known as independent component analysis (ICA). In [Sut08], the authors propose using an inference method to estimate the parameters in a queueing network. This method can effectively overcome the problem of queueing models which require distributional assumptions. From the perspective of graphical models, a Gibbs sampler and stochastic EM algorithm for M/M/1 FIFO queues are proposed to estimate the parameters of the queueing network from incomplete data. The work in [Liu06] proposes instead service demand estimation from utilization and end-to-end response times: the problem is formulated as quadratic optimization programs based on M/G/1/PS formulas; results are in good agreement with experimental data. The work in [Spi11] presents a thorough investigation of the state-of-the-art in resource demand estimation technologies. Those technologies are analysed and compared in the same environment. By adjusting the parameters of the environment the accuracy of the algorithms can be compared and possible directions for future research can be obtained. The following table provides a classification of the approaches reviewed according to the main techniques employed by each of them. Overall, we may find that regression analysis tend to be the simplest the method for model parameterization. It requires the assumption of the hidden relation between variables, such as linear relation. Kalman filter is suitable for online model parameterisation because it is able to recursively adapt the parameters. However, this may introduce significant overhead to the system. Therefore it is suitable to close the feedback loop in WP5 when combined with layered queueing network, when no short-time execution is required. Machine learning techniques have the advantage of no need for the knowledge of internal structure of the system. However, when it comes to what-if analysis, machine learning cannot provide much useful information. Queueing-based inference is able to provide useful insight into the system; however it requires the assumption of the queueing distribution. The reviewed approaches are classified in Table 4.2.a, according to the techniques used. Method type Regression analysis Kalman filter Machine learning Queueing-based inference Table 4.2.a Summary of QoS model parameterization methods References [Rol95] [Rol98] [Zha07] [Kal11] [Kal12] [Zhe05] [Zhe08] [Wu08] [Cre10] [Cre12] [Sha08] [Sut08] [Liu06] [Sut08] Public Final version 1.0, March 29 th

63 Above model parameterization methods can be selected and compared and then employed in MODAClouds. If none of them is efficient enough, a novel approach may be developed. For QoS model parameterization, the most important is the accuracy. Above methods demonstrate their effectiveness under different circumstances. For example, the Kalman filter can be accurate but it requires recursive execution which lead to a strong overhead. Another issue for runtime model parameterization is that this requires the approach to be conducted in a short time period. Regression can be easily the fastest; however, it requires the assumption of the relation and may lose accuracy if the real relation is different Workload forecasting methods There are many approaches to predict future workload. These approaches require extensive profiling and log data about the running system and then use the data to extract interesting information from the system with the help of techniques such as machine learning and data mining. Autoregressive methods, also known as Box-Jenkins algorithms, have been proposed to forecast workload time series. In [Lu09] the Box-Jenkins algorithms are combined with simulation technologies to incorporate risk and uncertainty analysis. [Ver07] proposes a hierarchical framework to both short-term and long-term web server workload. The authors use Dynamic Harmonic Regression (DHR) to model the long-term workload and use an autoregressive model to predict the short-term workload. The parameters of both methods are estimated using Sequential Monte Carlo (SMC) algorithms. Experiments result show that the framework is robust to outliers and non-stationary in the data. An interesting approach is to combine autoregressive methods with machine learning techniques. For instance, the work in [Zha03] proposes a forecasting technology combining both ARIMA and Artificial Neural Network (ANN) models. This approach takes the advantages of both ARIMA and ANN in linear and nonlinear modelling for time series data. Experiments with real data show that the hybrid model has an improved forecasting performance compared to the models used separately. Also, [Pow05] explores several machine learning and data mining algorithms, such as auto-regressive models, multivariate regression models, Bayesian network classifiers, to predict the short term performance of enterprise systems. They treat it as a classification question as to whether the system will meet the target performance objective in a short time period. Besides the accuracy of different methods, they also characterize whether they are qualified to be stand alone tools in the real system. For example, the model should adapt to different systems and workloads and can predict with incomplete data. Moreover, the gain of accuracy should be more than the cost of the complexity of the model. Another example of this is [Wu10], who proposes to use Kalman filter and Savitzky-Golay filter to predict grid performance. They use a confidence windows approach to restrict the workload prediction in a certain tolerable range to avoid large workload fluctuations. They also present an adaptive hybrid model to extend the classic auto-regression model to take the confidence windows into consideration and adaptively improve the prediction accuracy. The authors use real data to prove the effectiveness compared to existing workload forecasting technologies. Other works based on machine learning methods include [Di12], where a workload prediction algorithm based on Bayes model is proposed. The objective is to predict the long-term workload and the pattern of it. The authors designed nine key features of the workload and use Bayesian classifier to estimate the posterior probability of each feature. The experiments are based on a large dataset collected from a Google data center with thousands of machines. Non-bayesian machine learning methods have also received significant attention, as in [Wan05], where a web traffic trend prediction model is proposed. The neuro-fuzzy model analyses the web log data and extracts the useful information from it. The authors build a pattern analysis and fuzzy inference system to predict the chaotic trend of both the short-term and long-term web traffic by the help of cluster information obtained from Self Organising Map (SOM). Empirical results demonstrate the efficiency for predicting the future trend of web traffic. Also, in in [Kha12], the authors propose a method to characterise and predict workload in cloud environments in order to efficiently provision cloud resources. The authors developed a co-clustering algorithm to find servers that have similar workload pattern. The pattern is found by studying the performance correlations for applications on different servers. Then they presented a Hidden Markov Modelling (HMM) method to identify the temporal correlations between different clusters and use this information to predict the workload variation in future. Public Final Version 1.0, March 29 th

64 Methods based on trend and pattern recognition technologies are used in [Gma07] to propose a workload demand prediction algorithm. The objective of this approach is to find a way to efficiently use the resource pool to allocate the servers to different workloads. The pattern and trend of the workloads are first analysed and then synthetic workloads are created to reflect the future behaviours of the workload. After obtaining the synthetic workloads, how to place the workloads among different servers can be suggested so as to minimise the usage of servers and balance workloads. A related approach is taken in [Hol10], where a periodicity detection approach is proposed. The objective is to predict the workload changes in enterprise DBS which often exhibits periodic patterns. Two methods for detecting periodicity pattern are proposed: the discrete Fourier transform method and the interval analysis method. An algorithm is presented to relate the knowledge of periodic patterns with workload changes. Table 4.2.b presents a classification of the methods reviewed according to the underlying technique. Table 4.2.b Summary of Workload forecasting methods Method type Autoregressive model Regression model Kalman filter Machine Learning (Bayesian) Machine Learning (Non-Bayesian) Pattern Analysis (Recognition) References [Lu09] [Zha03] [Ver07] [Pow05] [Wu10] [Pow05] [Ver07] [Wu10] [Pow05] [Ver07] [Di12] [Zha03] [Wan05] [Kha12] [Gma07] [Hol10] 4.3. Run-Time QoS Models Common approaches used by system administrators to characterize the runtime execution of complex software systems include direct measurement techniques, such bytecode instrumentation via aspect-oriented programming [Mar12]. These monitoring approaches focus on acquiring extensive profiling and log data about the offered QoS and then provide the ability to execute statistical analysis methods and data mining to extract interesting information about the system in execution. While this procedure is in general very important to understand the properties of a system at runtime, it does not per-se provide mechanisms to help reasoning on how such a system could be optimized. Such mechanisms include, for example, the ability to condense such information into mathematical models that could be integrated within numerical optimization programs in order to find the best choice for a decision parameter. Another example is to determine the correlation between a request resource consumption on a server and the resource consumption it requires on a different server. While footprinting methods exist to track the identity of a transaction across a distributed system, they are not always adopted and furthermore they do not allow to clearly map the resource consumption of a request across all the software and hardware layers that contribute to its processing. Hence, statistical reasoning is needed to understand such correlation from monitoring. Several works have attempted to use statistical learning methods, such as classification, regression and adaptive re-learning, to characterize a system in execution at runtime and operate predictions on its QoS. Others have focused on the estimation and tracking of the system state by means of control theoretic methods such as Kalman filters. Yet another set of works have adopted models that describe the inner structure of the system modelled and/or the architecture of deployed software application. These models are typically queueing networks and queueing layered networks. In the following sections we describe recent developments in each of these directions. Regarding the techniques, these are closely related to the ones presented in Section 4.2, including statistical inference, control-theoretic and queueing-based methods. For a brief review of the main techniques mentioned in the following sections we refer the reader to Section Furthermore, product form queueing networks and layered queueing networks are used also for design time analyses. Research works focussing at design time are Public Final version 1.0, March 29 th

65 discussed in MODAClouds deliverable D5.1 (Sections 5.2 and 7). Here the works focussing on run-time problems are considered Statistical learning models [Aga07] presents E2EProf, a toolkit capable of tracking the end-to-end behaviour of requests in a distributed enterprise application, such as those that are commonly migrated to the cloud. The approach looks at network packet traces to reconstruct non-intrusively the path of a high-level request across a distributed system. A time series approach is utilized in which cross-correlation between events in the traces is used as a driver for inference and establishing which software components have been utilized by a transaction. Such system has been reported by the authors as applied to production systems. The works in [Coh04, Coh05] illustrate a methodology to predict correlation among system states based on Tree- Augmented Networks, an efficient class of Bayesian networks. Given a monitoring trace, the approach involves defining an ensemble of models that is continuously learned. Such models attempt to describe the probabilistic law that puts in relation a QoS metric (e.g., CPU utilization, memory consumption, etc) with an SLO state of compliance (service objective achieved, service objective violated). Scoring methods are used to select the best submodel in the ensemble to use to estimate the SLO state over a moving window. The predefined sample size can be obtained from learning surface. Recently, [Gam12] proposes runtime QoS models in which a controller maintains a Kriging model for each target SLO. A Kriging model is a method to describe the correlation among errors in a prediction model, thus it differs from regression methods which instead focus on providing a prediction given modelling assumption, not a description of the resulting error. The Kriging approach is based on radial basis functions, which are data interpolators used in pattern recognition since many years. Essentially, they can be useful to attack situations in which errors are correlated, a circumstance which can be more problematic to handle for regression models. Initial results of this approach indicate that Kriging models can lead to controllers delivering very low, even negligible, SLO violation errors. vperfguard [Xio13] is a new controller capable of automatically identifying predictive metrics for application performance and adapt dynamically to changes in such metrics. Compared to other controllers, this approach aims at identifying the most important metrics from prediction using a machine learning approach. Correlations across metrics are considered for feature selection. Subsequently, modelling is performed using methods such as linear regression, k-nearest neighbor (k-nn), regression trees, and boosting which are compared for their predictive capabilities. Similar methods are adopted for example in [Shi06], [Coh04], [Des12] where a runtime engine is proposed that features statistical learning, classification, regression, and adaptive re-learning. IRONModel [The08] is a performance management system that maintains a modelling description of a distributed system by dynamically analyzing its traces and discovering automatically new correlations between performance metrics and system attributes. The model is built by system designer incorporated in the system. The underlying modelling approach is based on zero-training classification and regression trees (Z-CART). The underlying models rely in part on operational analysis and bound analysis laws developed in the context of queueing theory, however the approach combines these formulas in a machine learning framework. Compared to other approaches in this section, IRONModel also features active probing to accelerate training. Reinforcement learning has also been proposed as a method to build run-time QoS models [Tes05]. Although these methods may provide good results without specifying an underlying traffic model, they also require significant online training, which can be very expensive in production systems. To prevent this, hybrid methods [Tes06] have been considered, where the initial policy is provided by an analytic model, that is afterwards improved by solutions found by a reinforcement algorithm trained offline using previously collected information. [Tan12] introduces PREPARE, an online anomaly prediction and virtualization-based prevention system. Its anomaly detection module consists of a 2-state Markov model to predict the future values of relevant attributes, and a tree-augmented Bayesian network model to classify those future states between normal and abnormal. In addition, it provides a module to determine the faulty VMs causing the anomaly, as well as an actuator module that perform preventive actions to avoid SLO violation states. Public Final Version 1.0, March 29 th

66 Many of the statistical learning methods mentioned so far, such as [Tan12,Coh05,Dua09] rely on labelled training data, i.e. data from the production system that includes monitoring metrics and annotations of whether the system is violating an SLO or not. As these data is not readily available in most systems, [Dea12] introduces an unsupervised learning algorithm that is able to predict anomalies in a virtualized data centre, without the need of training data. To this end, the authors rely on the Self Organizing Map method which can describe complex system behaviours with a smaller computational cost than other unsupervised methods. [Mal11] employs a multi-model for n-tier application control, where an empirical model learns the best decisions for each possible configuration and workload. As initially the system under control has no logs to learn from, the decisions are taken based on another model, in this case a Horizontal Scale Model. Once a decision is known for a given configuration, the empirical model takes over and applies the decision already known as the best for that configuration. Although proposed initially as relying of the Horizontal scale model, this meta-model can actually operate with any of the models proposed in this or the following sections Control theory models Some works have instead attempted to use modelling techniques based on control-theory, such as Kalman filters [Kal09] and Linear parametrically varying (LPV) models [Tan10]. Control theory has also provided a framework to analyse the behaviour of policies for autonomic control. [Dut10] uses this framework to analyse the challenges of threshold-based and reinforcement learning approaches, considering aspects that affect the stability of an autonomic system, such as the latency and power of the controller, and oscillations in the input variables. Kalman filters have been applied to control resource consumption in runtime web applications in works such as [Zhe05], [Kal09]. Here we discuss the underlying resource consumption models. [Zhe05] uses a modelling methodology based on layered queueing models, which are reviewed in Section Conversely, [Kal09] illustrates the application of feedback-loop models used in control theory to distributed systems. It proposes three Kalman filters to model the dynamics of a software application and applies them to the control problem showing good accuracy. Such filters are respectively based on a Single Input Single Output (SISO) model relating input workload and CPU utilization, Multiple Input Multiple Output (MIMO) model relating covariances between VM utilizations, and an adaptive version of the latter that is self-configured. LPV models are a class of control-theory methods that allows to describe the dynamics of a complex system in terms of an input and a set of so-called scheduling variables that are variables describing the operational condition of the system [Lee99], [Nem95], [Lov98], [Bam99], [Ver02]. An LPV model is linear in the parameters and a vector of scheduling variables enters the system matrices in an affine or linear fractional way. Single-input single-output (SISO) and multiple-input multiple-output (MIMO) state space LPV models have been both considered in the literature. For example LPV methods have been investigated in [Ver02], [Van09] and their performance assessed on experimental data measured on a custom implementation of a workload generator and micro-benchmarking Web service applications. The results show that the LPV framework is very general, since it allows describing the performance of an IT system by exploiting all of the available technical parameters to manage QoS. [Tan10] introduces an LPV model to identify the dynamics of a web service, and defines an optimal control problem based on this model. The solution of this optimal control problem is then used to define an optimal policy to manage the trade-off that arises between the QoS guarantees and the energy consumption. In [Gia11] the stability properties of an LPV-based proportional controller are analysed. The controller is designed for admission control in web services and the LPV model is used to design the controller. [Lim10] proposes a proportional threshold control for elastic storage in cloud platforms. The controller explicitly considers the resources as discrete quantities, which is in line with per-instance pricing in platforms such as EC2. The controller also considers the actuator lag generated by the delay of redistributing data to new storage servers. Other approaches include the use of fuzzy logic [Xu07] to design a two-level controller for resource allocation in a virtualized datacentre. Fuzzy logic difers from Boolean logic in that membership of element to a set is not either 0 or 1, but can be any real in the interval [0,1]. With this generalization, [Xu07] proposes a model that learns the relationship between workload and resource demand for a given QoS level. From this model inference functions are derived that determine the appropriate resource allocation for a given workload. Public Final version 1.0, March 29 th

67 Product-form queueing networks Queueing networks have been among the first methods used for runtime control of software systems. Their distinguishing feature compared to the models described above is their ability to consider white-box information about a system in the runtime prediction. Often, this does not imply major limitations from a computational point of view, since efficient iterative algorithms and fluid methods exist to approximate the solution of such models in short amount of time. Recently, [Cas08] shows how such models can be applied to integrate a more realistic description of the application workloads, including burstiness and fluctuations in the surrounding operating environment (e.g., network bandwidth fluctuations in the cloud). Early work in [Men03, Ben04], focuses on ecommerce sites and shows how queueing network models can be used with combinatorial search techniques to determine an optimal system configuration. Periodic execution allows adaptation at runtime. Variants of such models have been subsequently studied in works such as [Ben05, Men07, Men05] in various application areas, including data centers. Urgaonkar et al. were able to validate a basic product-form queueing networks for the Rubis and Rubbos [Rub] open-source benchmark multi-tier applications [Urg05]. They also considered various non-product-form extentions to the model to better account for several important features of their applications under study, e.g., an imbalance of load across multiple application servers. Chen et al. represent the TPC-W [Tpc] and Rubis benchmark multi-tier applications as multi-station queues where the multiplicity refers to the number of server processes in each tier [Che08]. They use an approximation [Sei87] that transforms a multi-station queueing network models to an equivalent singlestation product-form queueing network models which can be solved using MVA. Lu et al. used simple productform models in conjunction with a feedback controller to perform runtime optimization of a single-tiered Apache Web server system [Lu03]. [Zha07] presents a queueing network model where each queue represents a tier of a web application, which is parameterized by means of a regression-based approximation of the CPU demand of customer transactions. It is shown that such an approximation is effective for modeling different types of workloads whose transaction mix changes over time Layered queueing networks The main limitation of ordinary queueing network models is that they describe the resource consumption mechanisms of the software, but they do not explicitly take into account known information about the software architecture. Layered queueing models (LQM) [Rol95, Woo95] are an extension to queueing networks that allows the representation of computer systems composed of several layers of software servers that share hardware resources, and have therefore been extensively applied to software system research. LQMs were developed starting in the 1980s to consider the performance impact of contention for software resources, e.g. server threads, and the interactions between software entities at various system layers, e.g., messaging between an application server and a database server. The approach decomposes an LQM into a hierarchy of queueing networks models. Each model in the hierarchy is solved using approximate mean value analysis and the solution process is repeated until the individual estimates of the models are all consistent with each other. Approximate mean value analysis [Cha82, Cre02] is a technique that allows queueing network models to be solved iteratively in a very efficient manner thereby permitting the study of larger systems and the solution of models at runtime. However, the technique relies on product-form assumptions which restrict its applicability. In particular, behaviour commonly observed in complex enterprise systems such as contention for software resources, synchronous and asynchronous request-reply relationships between software entities, and priority based resource access all violate product-form assumptions [Alt06]. As mentioned in Section 4.3.2, [Zhe05] uses a modelling methodology based on layered queueing models, together with an extended Kalman filter for parameter estimation. They considered a time-varying web application, which is modelled as an LQM due to the interdependencies of its components (web server, database). Parameters such as the clients think time, the CPU and disk demands, vary with time and its values are estimated by the Kalman filter. With these estimated values, the LQM is parameterized and (SLA-driven) performance results obtained. These results can then be used by an autonomic control to make decisions regarding resource allocation to prevent SLA violations. Other works in this area include [Lit05, Woo05]. [Jun09] presents a runtime adaptation engine that allows the automatic reconfiguration of multi-tier web applications. The engine first evaluates the potential benefits of a reconfiguration based on an LQM and its associated costs. Based on these the engine chooses the optimal sequence of reconfigurations to be applied on the web application. The engine is evaluated with the RUBiS benchmark [Rub] and shows a significant Public Final Version 1.0, March 29 th

68 reduction in SLA violations. This has been extended in [Jun10] to consider power costs, including those caused by the transient behaviour generated by the reconfiguration Summary From the previous discussion, it is clear that the area of QoS runtime modeling has received significant interest in the last decade. The contributions reviewed in this section are summarized in the following table. When considering the different options available for Qos runtime models, it is important to mention that the selection of the modeling technique is closely related to the information available to the performance analyst. Statistical learning and most of the control theory-based methods assume a black-box approach, where little or nothing is known about the application inner workings and architecture. As a result, these methods may be able to capture the result of a reconfiguration that has been considered in the past, but may not be able to adequately predict the performance implications of a completely new configuration. On the other hand, methods relying on queueing networks and layered queueing networks consider more information about the specifics of the application, and can therefore better predict the results of a new configuration. However, this additional information may not always be available, especially for the owners of the cloud infrastructure, for whom the application may indeed be a black box. Some methods, as [Zhe05], actually combine both approaches, using the control theory models to parameterize layered queueing networks that describe the underlying architecture of the application. Method type Statistical learning Control theory Queueing networks Table 4.3.a Summary of rum-time modelling methods References [Aga07] [Coh04] [Coh05] [Gam12] [Xio13] [SBC06] [CCGTS04] [DWSPV12] [The08] [Tes05] [Tes06] [Tan12] [Dua09] [Dea12] [Mal11] [Kal09] [Tan10] [Dut10] [Zhe05] [Son12] [Vaq08] [Gia11] [Lim10] [Xu07] [Cas08] [Men03] [Ben04][Ben05] [Men07] [Men05] [Urg05] [Che08] [Sei87] [Lu03] [Zha07] Layered networks queueing [Cha82] [Cre02] [Alt06] [Zhe05] [Lit05] [Woo05] [Jun09] [Jun10] Public Final version 1.0, March 29 th

69 4.4. SLA Management Many solutions have been proposed for the management of Cloud services at run-time, each seeking to meet application requirements while controlling the underlying infrastructure. Five main problem areas have been considered in resource management policies design: 1) application/vm placement, 2) admission control, 3) capacity allocation, and 4) load balancing. The following discussion aims to figure out how these problems are addressed and to classify them according to theoretical or applied criteria, conforming to the related research developed by the scientific community. Figure 4.4.a summarizes the classification criteria we adopt which will be examined in details in the next three subsections. A similar approach is followed in the state-of-the-art review presented in Deliverable D5.1 that, unlike this document focusing on run-time techniques, surveys the literature on design-time approaches to Cloud related problems. Optimization Approaches Problem Solution Discipline Perspective Type Quality Model Quality Attribute Decision Variables Type Dimensionality Architecture Representation Constraint Optimization Strategy Constraint Handling Timescale Figure 4.4.a. Classification criteria for SLA run-time management solutions Problem The first category we consider is related to the problem the approaches aim to solve in the real world. Every approach tries to achieve a certain goal in a specific context. As a first classification of the literature approaches we consider the perspective, i.e., the actor optimizing the use of resources: Many proposals take the perspective of the Cloud providers whose goal is to determine the optimal configuration of the underlying infrastructure in order to satisfy incoming requests from the end-users while minimizing some cost metrics (e.g., energy). In the opposite perspective the actor involved in resource management optimization is the Cloud end-user which performs Cloud resource allocations according to application needs, minimizing the cost of use of Cloud resources. This latter approach is the one that will be perceived within the MODAClouds project. Most of the problems aim to minimize costs, others want to ensure high performance or high availability of the system, some others to simultaneously guarantee these goals. Since the nature and the architecture of a system are concepts difficult to be defined, it is useful to categorize some quantifiable quality attributes as performance, cost, availability, reliability, safety, security or energy consumption. Public Final Version 1.0, March 29 th

70 Furthermore, the set of optimized quality attributes can be aggregated into a single mathematical function or decoupled into conflicting objective: the first one optimizes a single quality attribute only (single-objective optimization, SOO), while the second optimizes multiple quality attributes at once (multi-objective optimization, MOO). Often, for a nontrivial multi-objective optimization problem, there does not exist a single solution that simultaneously optimizes each objective; in that case, the objective functions are said to be conflicting, and there exists a (possibly infinite number of) Pareto optimal solutions. Some approaches encode priority criteria following MOO into a single mathematical function (multi-objective weighted, MOW), others can even use specifically designed functions. Besides the dimensionality, each problem can be further characterized by the quality constraints that represent additional attributes or other system properties. Constraints include structural constraints and performance constraints as minimum throughput for the applications or available memory, limits on the overall resource costs, fixed budget on the energy costs of the infrastructure, response time constraint. In some cases constraints are not present Solution The problems faced at run-time can be further analyzed on the basis of the solution category. We classify the approaches according to how they achieve the optimization goal and thus describe the main steps of the optimization process. First, solutions can be classified as centralized and distributed according to the framework and to the interplay between the system factors; alternatively there are hierarchical solutions when the resources are managed introducing multiple decision points (e.g., an high level controller assigns applications to clusters of physical servers, while a second layer controller determines the optimal capacity allocation among applications within the same cluster). Within each problem, the solution is devised by the Decision Variables (DVs) available (e.g., provider selection, application placement, capacity allocation, load-balancing, admission control). In other words, the DVs indicate which changes of the system are considered as decision variables of the underlying optimization process. Furthermore, approaches can be characterized according to the representation of the system under study. Firstly, architecture representation classifies the solutions based on the information used to describe the problem structure and configuration: according to the input required, there can be architectural model, UML (Unified Modeling Language), ADL (Architecture Description Language) or optimization models (linear or non-linear). Secondly, concerning with the solution technique, two main categories of optimization strategies can be pointed out: those using exact methods or those guaranteeing approximate solutions. Among exact methods there can be standard methods (as the branch-and-bound or dynamic programming) or problem-specific methods. Among approximate ones, heuristic methods require problem or domain specific information to perform the search, while meta-heuristic methods apply high-levels search strategies. The latter might exploit, for example, local search, Evolutionary Algorithms such as Genetic Algorithms, Simulated Annealing or bio-inspired as for example. Another characteristic that differentiates the various searches and solution methods is constraint handling that describes the used strategies to handle constraints. More precisely this category distinguishes if they are treated as hard constraints or soft constraints with related penalties. Finally, solutions are classified according to the time scale used which can range from a daily or hourly scale up to the granularity of minutes, in some cases even seconds Discipline Finally, the techniques used to solve these run-time service management problems advantage of various disciplines which range from mathematics to computer science. Between the most used we find control theory methods, machine learning and utility based methods consisting of combining performance models and optimization models. For a detailed discussion and analysis of the disciplines see also [Ard12c, Ard08]. Furthermore, as orthogonal classification we can distinguish among pure optimization and game theory based approaches. In pure optimization approaches a single actor optimizes, with various techniques and objectives, his own goals without interacting with other actors. Vice versa, in game theory approaches the interaction across different actors is non-negligible and, while perceiving his how goal, each actor (e.g., a cloud end-user) can be affected by the actions of other actors (e.g., other end-users of the same cloud provider), not only by his own actions. Public Final version 1.0, March 29 th

71 State of the art The next two sections present some of the most significant works that have been carried on in the last few years for Cloud services SLA management. First pure optimization approaches are discussed (see Section 4.4.c), and later game theory based solutions will be considered. Pure optimization approaches The literature has been reviewed according to the complex taxonomy depicted in Figure 4.4.a. Many categories and sub-categories have been considered. Tables 4.4.a to 4.4.l represent a useful and direct way to partition the state-of-the-art literature from a specific point of view. In what follows a brief description of the most important works published in the last few years is presented. The papers are grouped according to the Decision Variables category. Notice that although many works should appear several times because they present many decision variables we only report them once. However, the other decision variables are mentioned. Provider Selection The works listed below have in common the fact that the methods they propose consider the selection of a different provider at run-time. In [Dut12] authors show SmartScale, an autoscaling framework that uses a combination of vertical (adding more resources to existing VM instances) and horizontal (adding more VM instances) scaling mechanisms together with the selection of the most suitable provider. This method ensures that each application is scaled in order to optimize both resource usage and the reconfiguration costs to pay due to the scaling process itself. In a similar way, in [Xia12], an implementation of a system that provides automatic scaling for Internet application is described. Each application is encapsulated in a single VM and the system scales up and down, minimizing costs and energy consumption, maximizing the throughput, deciding also the application placement and load distribution thanks to a color set algorithm. Finally, [Xio11] addresses two challenges: the minimization of the total amount of resources while meeting the end-to-end performance requirements for N-tier web applications. Open and closed workloads are considered as input for an adaptive PI controller. A SLA-based control method leads to exact solution minimizing of the average response time. Application placement The application placement together with the dynamic resource allocation problem is addressed and optimally solved in [Had12] where a minimum cost maximum flow algorithm is proposed. The solution is based on a Bin- Packing algorithm combined with a prediction mechanism. An opportunistic scheduling approach, instead, is proposed in [He12], where parallel tasks are considered and low-priority tasks are allocated to underutilized computation resources left by high-priority tasks. A model representing tasks as ON/OFF Markov chains is presented. In [Cap10], the SOS Cloud project is presented. The project aims at providing robust and scalable solutions for service deployment and resource provisioning in a cloud infrastructure. The project has a double objective: meeting the service level agreement and minimizing the required cloud resources. The algorithms developed have the additional benefit to take advantage of the cloud elasticity, allocating and deallocating resources to help the services to respect contractual SLAs. Lastly, a bio-inspired cost minimization mechanism for data-intensive service provision is proposed in [Wan12]. The mechanism uses some bio-inspired concepts and mechanisms to manage data application services, to create a large services cluster and to produce optimal composition solutions. The authors propose a multi-objective genetic algorithm capable of returning a set of Pareto-optimal solutions. Capacity allocation As far as the capacity allocation decision variable is concerned, the literature is marred by works considering it as part of the proposed solution. In [Bjo12], for instance, the authors discuss an opportunistic service replication policy that leverages the VM workload and performance variability, as well as on-demand billing pricing models to ensure response time constraints, while achieving a target system utilization for the underlying resources. Public Final Version 1.0, March 29 th

72 Alternatively, one can mention [Gou11] where a Force-directed Resource Assignment (FRA) heuristic is used to optimize the total expected profit obtained from processing, memory and communications resources. Moreover, the results of the proposed approach are compared with those attained by relaxing the capacity constraint which represent upper-bounds for the original problem. Furthermore, in [Zam12] the authors show the today s limitations for Cloud Computing providers in allocating their VMs with off-line mechanisms based on fixed-prices or auctions. Improvements have been demonstrated by implementing an on-line mechanism that aims at maximizing the profit of each provider. A model for applying revenue management to on-demand IT services has been presented in [Liu10]. The model uses a nonlinear objective function to determine the optimal price over different system capacity and multiple classes with different SLAs. In [Lin12] a branch-and-bound approach together with an adjusting recursive procedure are proposed to evaluate and maximize the reliability of a computer network in a Cloud Computing environment; the algorithm devised as solution considers budget, time and stochastic capacity constraints. Similarly, the problem of minimizing the use of resources and meeting, at the same time, performance requirements under a certain financial budget and time constraints, has been investigated in [Tia11] for MapReduce applications. Load Balancing In [Ard12b] the authors take the perspective of a Web service provider which offers multiple transactional Web Services. They provide a non-linear model of the capacity allocation and load redirection of multiple classes of request problems which are solved with decomposition techniques, exploiting predictive models of the incoming workload at each physical site. A heuristic solution method for the same problem is, instead, presented in [Ard11b]. The decentralized load balancing problem, as opposed to the traditional centralized version, has also been the subject of recent works. [Ala09] proposes a decentralized load-balancing mechanism that considers heterogeneous resources. The server state information is exchanged as to minimize the communication overhead required by a decentralized approach. A bio-inspired algorithm for the load balancing problem is also discussed in [Val11]. It is investigated an alternative for a decentralized service network, based on an unstructured overlay network, in which the nodes that host instances of many different service types self-organize into virtual clusters. The authors present a framework focusing on the load balancing problem, because nodes must be able to efficiently balance the incoming requests among themselves. The proposed approach combines and exploits the synergies between the clustering technique and superpeer topologies. Moreover, it inherits the typical benefits of bio-inspired self-organization, such as the scalability with respect to the number of peers, and the dynamism and robustness respect to unexpected behaviour. Admission control In [Wu12], cost-effective admission control and scheduling algorithms for SaaS providers are proposed in order to maximize profits while improving customer satisfaction level. In [Kon12], instead, a probabilistic approach aims to test admission control and to find the optimal allocation of VMs on physical servers; the multi-objective weighted function incorporates business rules in terms of trust and cost, and it is associated to constraints representing real factors that compromise the Cloud services, including the provider selection, the variable number of users in time and different workload patterns. Game Theory approaches Game theory has found its applications in numerous fields such as Economics, Social Science, Political Science, Evolutionary Biology. Over the last years this branch of applied mathematics has found its applications also in problems arising in the ICT industry. For example, resource or QoS allocation problems, pricing and load shedding, cannot always be handled with classical pure optimization approaches. Indeed, in a general complex system interaction across different players is non-negligible: Each player can be affected by the actions of all players, not only by his own actions. Non-cooperative Game Theory tools can reproduce perfectly this aspect. In this setting, a natural modeling framework involves seeking an equilibrium, or stable operating point for the system. More precisely, non-cooperative Game Theory is the study of problems of conflict and cooperation among multiple independent decision-makers, which means the study of the ways in which strategic interactions among economic agents produce outcomes with respect to the preferences (or utilities) of those agents, where the outcomes in question might have been intended by none of the agents. Each agent pursues his own interests Public Final version 1.0, March 29 th

73 working independently and without assuming anything about what other players are doing. Moreover, he has to follow certain rules while making his choices and each agent is supposed to behave rationally. In the language of Game Theory rationality implies that every player is motivated by maximizing his own utility (or payoff) irrespective to what other players are doing. Given a game, which strategies will the rational players adopt? Intuitively, a player pursues the case in which his payoff is maximized. Since the payoff function depends even on the strategies of the other players, which in turn are maximizing their own payoff, a conflict situation is created and it is not easy to characterize the best choice for every player. In other words, when rational players correctly forecast the strategies of their opponents they are not merely playing best responses to their beliefs about their opponents play; they are playing best responses to the actual play of their opponents. Indeed, the notion of a solution is more tenuous in game theory than in other fields; it concerns with optimality, feasibility and equilibria. In the fifties a solution concept - due to John Forbes Nash, see [74] - emerged as the most appropriate and effective. When all players correctly forecast their opponents strategies, and play best responses to these forecasts, the resulting strategy profile is a Nash equilibrium. Formally, a non-cooperative game Γ in strategic form is a tuple {N, {X i } i N, {Θ i } i N }that consists of: a finite set of players N {1,2,...,n}, where n N; a set of strategies X i for every player i N, which is also called feasible set for player i; payoff functions, Θ i :X 1 X 2 X n R for each player i N. Moreover, we indicate with X: X X 1 X 2 X n R M the common strategy set, called feasible set or strategy space of game Γ; every point x X represents the feasible strategies of the game. Let us denote with x i the set of all the players variables, except the i-nth one: x i (x 1,x 2,...,x i 1,x i+1,...,x n ) so we can write x = (x i, x i ). A vector x X is called a Nash equilibrium for the game if: Θ i (x) Θ i (x i, x i ), x i X i Equivalently, x is a Nash equilibrium if and only if x i solves the maximization problem: max Θ i (x i,x i ), s.t. x i X i for all i N, i.e., if and only if no player can improve his payoff function by unilaterally changing his strategy. Many approaches have been used to represent, model and manage Cloud services at run-time through Game Theory tools. In [Fen11] authors present a methodical in-depth game theory study on price competition, moving progressively from a monopoly market to a duopoly market, and finally to an oligopoly Cloud market. They characterize the nature of non-cooperative competition in a Cloud market with multiple competing Cloud service providers, derive algorithms that represent the influence of resource capacity and operating costs on the solution and they prove the existence of a Nash equilibrium. On the dynamics of the market, a model of competitive equilibrium in e-commerce to solve the problem of pricing and outsourcing can be found in [Dub07]; here the analysis of pricing choices and decisions to outsource IT capability leads to a representation of the Internet competition and extracts the maximum profit solution. Studies of the maximization of the social welfare as a long-term social utility are discussed in [Men11]. Considering relevant queuing aspects in a centralized setting, under appropriate convexity assumptions on the operating costs and individual utilities, the work established existence and uniqueness of the social optimum. Furthermore, other studies based on a non-cooperative game theory, are presented in [Wan12] where authors employ a bidding model to solve the resource allocation problem in virtualized servers with multiple instances competing for resources. A unique equilibrium point is obtained. A similar discussion can be found in [Wei10] where a QoS constrained parallel tasks resource allocation problem is considered. [Abh12] considers two simple pricing schemes for selling Cloud instances and studies the trade-off between them. Exploiting Bayes Nash equilibrium the authors provide theoretical and simulation based evidence suggesting that fixed prices generate a higher expected revenue than hybrid systems. Using Bellman equations and a dynamic bidding policy, in [Zaf12], an optimal strategy under a Markov spot price evaluation is found in order to complete jobs with deadline and availability constraints. The performance of the model is evaluated by considering uniformly distributed spot prices and EC2 spot prices. Another work regarding on-spot bidding is proposed in [Son12]. Authors propose a profit aware dynamic bidding algorithm, Public Final Version 1.0, March 29 th

74 which observes the current spot price and selects bids adaptively to maximize the average profit of a Cloud service broker while minimize its costs in a spot instance market. Finally, a Generalized Nash game for service provisioning problem have been formulated in [Ard12] and [Ard11] where the perspective of SaaS providers hosting their applications at an IaaS provider is taken. Each SaaS needs to comply with end user applications SLA and at the same time maximize its own revenue while minimizing the cost of use of resource supplied by the IaaS. On the other end, the IaaS wants to maximize the revenues obtained providing on spot resources Summary tables A summary of classification proposed here is reported in Tables 4.4.a to 4.4.l. Tables 4.4.a, 4.4.b, 4.4.d and 4.4.d relate to the Problem category while the Solution category is detailed through tables 4.4.e to 4.4.j. Finally, Tables 4.4.k and 4.4.l represent the Discipline category following the classification depicted in Figure 4.4.a. Problem The first table, Table 4.4.a, represents the partitioning of the reviewed literature according to the Perspective sub-category. Each piece of literature can face a specific problem from two distinct points of view, focusing either on the Cloud provider or on the Cloud end-user. The Quality Attributes are summarized in Table 4.4.b. Four specific attributes are considered (Performance, Cost, Availability and Reliability). Other, less common attributes are grouped under the label of Others. As is clearly shown, the vast majority of the reviewed papers deals with the Performance, Cost and Availability attributes. Table 4.4.c, instead, considers the Dimensionality sub-category. It classifies the considered approaches in singleobjective (SOO) and multi-objective (MOO). In this case, one can see that the methodologies presented in literature mainly belong to the single objective approach. Finally, the considered taxonomy categorizes the Constraint sub-category into 5 possible attributes (Table 4.4.d) the constraints considered by the state-of-the-art works, namely Cost, Performance, Availability, Throughput and Memory usage. The literature is almost evenly distributed among these attributes. Solution The type of approach (Centralized, Distributed or Hierarchical) implemented is one of the fundamental characteristics of a solution. Table 4.4.e details how the reviewed papers are subdivided according to this attribute. Notice that the vast majority of them show a distributed architecture. Table 4.4.f reports about the Decision Variables (DVs) exploited by the various solution methods in order to effectively explore the design space. A DV is the set of possible actions that can be taken upon a current design alternative in order to create new alternatives with, possibly, higher quality. It can be easily noticed that most of the literature leverages the Capacity Allocation as DV. The Architecture representation is exposed in Table 4.4.g. Clearly the state-of-the-art solutions prefer Optimization over Architecture based models. As far as the Optimization strategy is concerned, the proposed techniques are grouped into two main categories: Exact methods and Meta-heuristics. Table 4.4.h demonstrates that the literature is evenly distributed between those two approaches. Finally, only few papers include information about the Time scale and Constraint handling approach. They are reported and classified in Tables 4.4.i and 4.4.j. Discipline The last two of tables to present, namely Tables 4.4.k and 4.4.l, regroup the considered researches with respect to their Discipline. A discipline is fully described by means of a certain Type and Quality model. Table 4.4.k faces the type sub-category. Three typology are considered: Utility based, Control theory and Bio-inspired. The Utility based and Bio-inspired approaches are dominant whereas only three works fall within the Control theory field. Public Final version 1.0, March 29 th

75 In Table 4.4.l, instead, are reported the references to the considered works according to the underlying Quality model. Cloud provider Perspective [Xia12], [Dut12], [Dou12], [LinC12], [Sri08], [Maz12], [Kon12], [Zam12], [Xio11], [NeeV11], [Gou11], [Bjo12], [Liu10], [Wu12], [Fen11], [Men11], [WanDJ12], [Abh12], [Ard11], [Ard12], [Ala09]; Table 4.4.a: Problem category: perspective. Cloud end-user [Tia11], [Had12], [HE12], [Zaf12], [Son12], [Ard11], [Ard12], [Cap10], [Ard12b], [Ard11b]; Quality Attributes Performance Cost Availability Reliability Others [Gou11], [Tia11], [Bjo12], [Wu12], [Xia12], [Zaf12], [Cap10], [Ard12b], [Ard11b]; [Liu10], [Zaf12], [Son12], [Men11], [Had12], [Dub07], [Fen11], [Wan12], [Kon12], [Ard12b], [Ard11b]; [Wei10],[Zam12], [Dut12], [Had12], [HE12], [Val11]; Table 4.4.b: Problem category: quality attributes. [LinC12], [Xio11]; [Ala09], [Sri08], [Maz12], [Dou12], [Cap10]; Single-objective optimization Dimensionality Multi-objective optimization [Gou11], [Tia11], [Bjo12], [Fen11], [Liu10], [Zaf12], [Wu12], [Xia12], [Dub07], [Meh12], [Maz12]; [Son12], [Men11], [Had12], [Wei10], [Zam12], [Dut12], [HE12]; [LinC12], [Xio11], [Dou12], [Sri08], [Wan12], [Cap10], [Kon12], [Ard12b], [Ard11b], [Ala09]; Table 4.4.c: Problem category: dimensionality. Constraints Cost Performance Availability Throughput Memory [Tia11], [Liu10], [Dub07], [Son12], [Fen11], [Wei10], [LinC12], [Ard12b], [Ard11b]; [Wu12], [Gou11], [Tia11], [Fen11], [Had12], [HE12], [LinC12], [Dou12], [Ard12b], [Ard11b]; [Zaf12], [Wei10], [Zam12], [Had12]; Table 4.4.d: Problem category: constraints. [Son12], [Dut12]; [Gou11], [Liu10], [Meh12], [Had12], [Xio11]; Type Centralized Distributed Hierarchical [Bjo12], [Liu10], [Men11], [Tia11], [Dub07], [Meh12], [Wan12]; Public Final Version 1.0, March 29 th

76 [Meh12]; [Wei10], [Sri08], [Cap10], [Val11], [Ard12b], [Ard11b], [Ala09]; Table 4.4.e: Solution category: type. Degrees of freedom Provider selection Application placement Capacity allocation Load balancing Admission control [Xia12], [Dut12], [Xio11], [Kon12], [Xia12], [Had12], [HE12], [Sri08], [Cap10], [Kon12], [Wan12], [Ala09]; [Wu12],[Gou11], [Tia11],[Bjo12], [Liu10],[Zaf12], [Son12],[Men11], [Fen11], [Had12], [Wei10],[Zam12], [LinC12],[Xio11], [Dou12], [Maz12], [Sri08],[Ard12b], [Ard11b]; [Xia12], [Val11], [Ard12b], [Ard11b], [Ala09]; [Wu12], [Kon12], [Ala09]; Table 4.4.f: Solution category: degrees of freedom. Architecture models Architecture representation Optimization model [Xia12]; [Gou11], [Tia11], [Wu12], [Fen11], [Liu10], [Men11], [Zam12], [Xio11], [Dut12], [Dou12], [Wan12],[Kon12], [Ard12b], [Ard11b], [Ala09]; Table 4.4.g: Solution category: architecture representation. Exact [Bjo12], [Liu10], [Zaf12], [Son12], [Men11], [Fen11], [Had12], [Wei10], [Had12],[Xio11], [Dou12], [Ard11b], [Ala09]; Optimization strategy Table 4.4.h: Solution category: optimization strategy. Meta-heuristic [Wu12],[Gou11], [Wei10], [HE12],[LinC12], [Maz12], [Sri08], [Wan12],[Ard12b]; Constraints handling Not present Hard Penalty [Dub07]; [Fen11],[HE12],[Ard12b],[Ard11b]; [Liu10]; Table 4.4.i: Solution category: constraints handling. Time scale Minute Hour Day [Wu12], [Ard12b], [Ard11b]; [Maz12], [Ard12b], [Ard11b]; [Fen11]; Table 4.4.j: Solution category: time scale. Type Public Final version 1.0, March 29 th

77 Utility based Control theory Bio-inspired [Gou11], [Tia11], [Dub07], [Men11], [Fen11], [Wu12], [Wei10], [Ard12b], [Ard11b], [Ala09]; [Zaf12], [Had12], [Kon12], [Xio11]; [Cap10], Table 4.4.k: Discipline category: type. [Val11], [Wan12]; Quality Model Markov chain Queuing network State based model [Zaf12], [Son12]; [Gou11], [Tia11], [Dub07], [Wu12]; [Men11], [Fen11], [Wu12], [Wei10], [Ard12b], [Ard11b], [Ala09]; Table 4.4.l: Discipline category: quality model Criteria for evaluation In order to assess the quality of the solution methods proposed in the literature several evaluation criteria can be considered. Given the run-time constraints, the time required to find a solution or the maximum size of the problem instance that can be solved in a given time horizon need to be considered. These measures depend on practical and physical limitations, specific application under study, industry s aim or research s purpose as well on the chances, tools and resources available. Another important evaluation criterion is scalability as the ability of the solution method to handle problems of growing size or its ability to enlarge the optimization scope (e.g., adding additional quality metrics or constraints). Another important aspect is the accuracy which can be achieved by the underlying quality evaluation model, that is the accuracy that can be achieved comparing the QoS metrics evaluated through the QoS model with the real figures measured in the real system. In [Sri08] four simulations are compared, maintaining constant the number of applications but varying disk and CPU utilizations, showing that the energy used by the proposed heuristic is about 5.4% above the optimal value on an average 20% tolerance. No information about scalability is reported. To evaluate the scalability of the resource allocation algorithm they proposed, in [Ard11] the authors considered a very large set of randomly generated instances. Such instances have been created varying the number of SaaS providers between 10 and 100 while the number of applications between 1000 and They showed that the problem can be solved, in the worst case, in less than twenty minutes. In [Dut12] the authors varied the number of servers in an emulated data center and observe the performance, demonstrating that the total cost for their approach increases linearly with the number of servers. They also demonstrated that the running time is statistically independent from the number of servers. A large-scale simulation demonstrates that the algorithm presented in [Xia12] is extremely scalable: the decision time remains under 4 seconds for a system with servers and applications. In [Had12] is reported a complete scalability study. The deviation from the optimal value is shown to be consistently small and tends to zero as the number of physical machines (PMs) increases. This means that the proposed algorithm is capable to find solutions very close to the optimal for a large number of PMs and for a big Cloud provider with many data centers. The proposed method scales much better than common Bin-Packing algorithms which encounter scalability problems and take longer times to find the optimal solution for the problem. Finally, in [Ard12] the inefficiency of the two algorithms presented is calculated in terms of Price of Anarchy (PoA) and Individual Worst Case (IWC). A very large number of randomly generated instances is considered, the number of SaaS providers varies between 10 and 100, while the number of applications between 100 and Furthermore, the article focused on the scalability arguing that the algorithms scale linearly with the cardinality of the set of SaaS. Public Final Version 1.0, March 29 th

78 5. MODAClouds Run-Time Platform 5.1. Overview The aim of this section is to define the requirements for the MODAClouds runtime platform that will be developed in the project. After the introduction to the overall goals of the runtime environment, the general approach and the high-level conceptual architecture described in Section 1, we define the actors (Section 5.1.1) that are referenced in the requirements specifications (Sections ) and the requirement sets (Section 5.1.2). The requirements elicitation methodology that we have adopted is overviewed in Section Finally, Section 5.6 provides a roadmap for WP6, focusing in particular on Year 1 of the project Actors In this section, we consider the three platforms defined in the conceptual architecture as actors included in the requirements specifications. In additions to these, we consider in the requirements specifications a set of common actors that are referenced also in the other WPs requirements specifications: Cloud app developer: A developer who designs, implements, and tests cloud-based applications. Cloud app: the cloud application developed by the Cloud app developer using the MODAClouds IDE. Application cloud: the cloud platform where the Cloud app is (or will be or was) running. Service cloud: the cloud platform where the runtime services offered by the runtime platform are (or will be or were) running. A service is not part of the Cloud app, rather it is part of the execution platform (e.g., discovery service). MODAClouds IDE: this is the envisioned technical output of WP4 and WP5, a design-time environment that will implement the MODACloudML language and that will provide the application code and the initial deployment decisions that are needed by the runtime platform to instantiate the application. Cloud app admin: An administrator who configures, deploys, operates, and tests cloud-based applications on cloud platforms. Cloud app provider: A provider that provides cloud-based applications. QoS engineer: An engineer who specifies quality of service (QoS) constraints and alternatives for design time exploration and run-time adaptation. Throughout the requirement elicitation, we use the notation <A>to indicate actor A, e.g., <Cloud app admin>. Furthermore, we refer generically to QoS constraints to mean any hard or constraints regarding QoS (e.g., imposed by an SLA) and specified in the MODAClouds IDE Requirement Sets In the following sections, we describe the requirements for the runtime platform. The requirements have been grouped into four categories inspired by the conceptual architecture. The main distinction from the conceptual architecture mapping is that the requirements for the monitoring platform are distinguished into two further sets: monitoring requirements and analysis requirements. The former set mainly deals with monitoring data collection and distribution, while the latter set emphasizes the analysis of the acquired monitoring data to extract knowledge. The sets of requirement elicited in the rest of this section are as follows: Execution Requirements (Section 5.2): this group provides requirements for application deployment, initial testing, execution, and runtime management. Management functionalities include runtime services (e.g., discovery, logging, application health controllers) and data management (archival and synchronization). Monitoring Requirements (Section 5.3): this group provides requirements for the part of the monitoring platform that will deal with data collection, preprocessing, distribution and consumption by means of monitoring data observers. Analysis Requirements (Section 5.4): this group provides a list of requirements for the data analysis part. These requirements deal with high-level aggregation and processing of the monitoring data and characterize the analysis step of the MAPE-K loop. Self-adaptivity Requirements (Section 5.5): this group provides requirements for the subsystems that will implement the runtime models and runtime policies developed in WP6. Public Final version 1.0, March 29 th

79 Requirement Elicitation Methodology For each group of requirements, we use the guidelines provided in D3.1.1 to define use case scenarios. For the sake of readability, unused entries in tables are omitted. Furthermore, qualitative requirements that provide more details about a use case and the environment with which it interacts are provided in the Other requirements subsection. These additional requirements also form necessary requirements for the WP6 runtime architecture. To help readability, we express these Other requirements using the keywords proposed in the Internet Engineering Task Force RFC 2119 which are here briefly summarized and related to the Priority of accomplishment keywords indicated in D3.1.1 (i.e., Must/Should/Could have): "MUST"/"MUST NOT"/"REQUIRED"/"SHALL"/"SHALL NOT": equivalent expressions to indicate Must have priority of accomplishment. "SHOULD"/"SHOULD NOT"/"RECOMMENDED"/ NOT RECOMMENDED : equivalent keywords to indicate Should have priority of accomplishment. "MAY"/"OPTIONAL": equivalent expressions to indicate Could have priority of accomplishment. We point to for further details. Public Final Version 1.0, March 29 th

80 5.2. Execution Requirements Context and System Overview Context Use case template Category name Description Execution The scope of the following use case specification is to elicit requirements for application deployment and execution. The scope of this in the context of the MODAClouds reference architecture falls primarily within the Execution Platform System Boundary Model Figure 5.2.a: Execution Requirements Use case specification for the Run Application use case Use case heading Use case name Use case ID Priority of accomplishment Description Run application UC-MC.wp6.Execution.Run Application.-V01 Must Have Use case Description description Use case diagram See the Run application use case of the system boundary model in Section Goal The goal of the Run application use case is to start, stop, status query, and manage a <Cloud app> instance on the <Application Cloud>. Main Actors <Cloud app admin> <Cloud app> <MODAClouds IDE> Use case scenarios Description Public Final version 1.0, March 29 th

81 Main success scenarios Preconditions Postconditions 1. The <MODAClouds IDE> requests to start or stop <Cloud app>. Alternatively, the <Cloud app admin> requests via a web-based UI to start or stop <Cloud app>. The <Execution Platform> starts or stops automatically the application on the target <Application Cloud>. 2. The <Cloud app admin> requests via a web-based UI to the <Execution Platform> to view the configuration and the logs of the running <Cloud app> 3. The <Execution Platform> feeds back information to the caller about the status of the application. 1. The application is compliant with the restrictions of the <Application Cloud>, and uses the appropriate API's, packaging, etc. 1. The application has been successfully deployed on the <Application Cloud> Other requirements: 1. An instance of the <Execution Platform> can start or stop a single instance of <Cloud app> and be deployed on a single <Application cloud>. Therefore, separate <Cloud app> MUST have separate <Execution Platform>. 2. The <Execution Platform> and the <Cloud app> MUST be treated as independent software artifacts. They MAY run on different clouds, preferably within (network) topological proximity to reduce latency. Therefore, they SHOULD rely as much as possible on services and protocols that can operate in any cloud environment (e.g., HTTP-based RESTful services) Use case specification for the Deploy Application use case Use case heading Use case name Use case ID Priority of accomplishment Description Deploy Application UC-MC.wp6.Execution.Deploy Application.-V01 Must Have Use case description Description Use case diagram See the Deploy application use case of the system boundary model in Section Goal To deploy the packaged <Cloud app> to the targeted <Application cloud> Main Actors < Cloud app > <Cloud app admin> <MODAClouds IDE> Use case scenarios Description Public Final Version 1.0, March 29 th

82 Main success scenarios 1. The <Execution Platform> is instructed by the <MODAClouds IDE>, or by <Cloud app admin>, to deploy the application. 2. The <Execution Platform> will provision the required resources from the <Application cloud> on behalf of the <Application administrator> 3. The <Execution Platform> will then deploy all the needed software artifacts to run the <Cloud app>, which includes the <Cloud app> itself and other MODAClouds Services needed for the application. Preconditions Postconditions 1. The <Cloud app> was packaged properly for the <Application cloud> 2. The <Cloud app admin> has the proper credentials to access the <Application cloud>. 3. The <Cloud app admin> has delegated the credentials to the <Execution Platform> 1. The <Cloud app> has been successfully deployed on the <Application cloud> Other requirements: Deploying the application includes deploying the necessary <MODAClouds services> Use case specification for the Start/Stop Application Sandbox use case Use case heading Use case name Use case ID Priority of accomplishment Description Start/Stop Application Sandbox UC-MC.wp6.Execution. Start/Stop Application Sandbox.-V01 Should Have Use case description Description Use case diagram See the Start/stop application sandbox use case of the system boundary model in Section Goal Start/stop the application in a controlled container for the purpose of application testing or calibration of the services and their internal data structures (e.g., runtime models). Main Actors <Cloud app> <MODAClouds IDE> <Cloud app admin> Use case scenarios Description Main success scenarios 1. <Cloud app admin> or <MODAClouds IDE> requests to the <Execution Public Final version 1.0, March 29 th

83 Platform> to create a sandbox environment for <Cloud App>. 2. <Execution Platform> creates a sandbox environment and configures the services for executing in this special environment. Postconditions 1. <Execution Platform> accepts the same requests as in a normal environment (e.g., Deploy Application, etc) but these are all performed in a sandbox environment. Use case specification for the Synchronise Application Data use case Use case heading Use case name Use case ID Priority of accomplishment Description Synchronise Application Data UC-MC.wp6.Execution.Synchronise Application Data.-V01 Must Have Use case description Use case diagram Goal Actors Description See the Synchronise Application Data use case of the system boundary model in Section To offer to the Cloud app designer or QoS engineer to specify data replica and the synchronization requirements for them <QoS engineer> <Cloud app developer> Use case scenarios Main success scenarios Description 1. The QoS engineer or Cloud app developer selects, for a portion of the database or for the whole database, the synchronization requirements between replicas. These can be: consistent or eventual consistent 2. The execution platform examines the deployment configuration of the database and creates and activates the proper synchronization connectors between the replicas Preconditions Postconditions The application has been already deployed on the execution platform The execution platform is ready to keep the data replicas synchronized according to the selected synchronization requirements Other requirements: 1. The execution platform MAY offer the possibility to check that the synchronization choices made by the user are consistent with the way the system is deployed 2. The execution platform MAY offer the possibility to change the synchronization requirements dynamically Public Final Version 1.0, March 29 th

84 5.3. Monitoring Requirements Context and System Overview Context Use case template Category name Description Monitoring The scope of the following use case specifications is to detail the main functionalities offered by the monitoring platform System boundary model Figure 5.3.a: Monitoring Requirements Use case specification for the Install Monitoring Rule use case Use case heading Use case name Use case ID Priority of accomplishment Description Install Monitoring Rule UC-MC.wp6.Monitoring.Install Monitoring Rule.-V01 Must Have Use case description Description Public Final version 1.0, March 29 th

85 Use case diagram Goal See the Install Monitoring Rule use case of the system boundary model in Section Monitoring rules are produced at design time by WP5 and define the object to be monitored, which measures should be gathered, the time window in which monitoring should happen, the frequency of monitoring. The goal of this use case is to allow new rules to be installed in the monitoring platform. Main Actors <MODAClouds IDE> <Cloud app admin> Use case scenarios Main success scenarios Preconditions Postconditions Description 1. The Cloud app developer, through the <MODACloudsIDE> or the <Cloud app admin>, through a direct interface, requests the installation of a new rule. 2. The <Monitoring Platform> checks that the rule has not been previously installed. 3. If the previous check is successful, then the <Monitoring Platform> installs the rule and put it in the state Inactive. The <Monitoring Platform> is ready to start its service The monitoring rule is properly installed in the monitoring platform Other requirements: 1. The <Monitoring Platform> MUST allow for installation of new monitoring rules before monitoring starts 2. The <Monitoring Platform> MUST allow for the installation of multiple monitoring rules 3. The Monitoring Platform, upon installation of a monitoring rule, MAY check that it can be actually executed in the current <Monitoring Platform> configuration, i.e., the associated data can be gathered and the corresponding computations/filtering/compositions can be executed 4. The <Monitoring Platform> MAY allow for installation of new monitoring rules during its execution Use Case Specification for the Activate/Deactivate Monitoring Rule use case Use case heading Use case name Use case ID Priority of accomplishment Description Activate/Deactivate Monitoring Rule UC-MC.wp6.Monitoring.Activate/Deactivate Monitoring Rule.-V01 Must Have Public Final Version 1.0, March 29 th

86 Use case Description description Use case diagram See the Activate/Deactivate Monitoring Rule use case of the system boundary model in Section Goal Monitoring rules are produced at design time by WP5 and define the object to be monitored, which measures should be gathered, the time window in which monitoring should happen, the frequency of monitoring. The goal of this use case is to activate a monitoring rule already installed in the <Monitoring Platform> or to deactivate an activated one. Main Actors <Cloud app admin> Use case scenarios Main success scenarios Description Activation scenario 1. The <Cloud app admin> through a specific user interface requests the activation of a rule that is installed and in the Inactive state 2. The <Monitoring Platform> checks that it can collect the requirement measures based on its current internal configuration Preconditions Postconditions Deactivation scenario 1. The <Cloud app admin> through a specific user interface requests the deactivation of a rule that is in the Active state 2. The <Monitoring Platform> stops the execution of the monitoring rule and puts it in the Inactivestate. 3. If the deactivated rule is the last one in the monitoring platform, then this last one stops collecting monitoring data. The <Monitoring Platform> is ready to start its service or it is already running Upon activation of a monitoring rule, the <Monitoring Platform> starts executing it Upon deactivation of a monitoring rule, the <Monitoring Platform> stops executing it Other requirements: 1. The <Monitoring Platform> MUST allow for activation and deactivation of monitoring rules during execution 2. The <Monitoring Platform> MUST execute all Active rules. 3. The Monitoring Platform, upon activation of a monitoring rule, MAY check that it can be actually executed in the current <Monitoring Platform> configuration, i.e., the associated data can be gathered and the corresponding computations/filtering/compositions can be executed Use case specification for the Add/Remove Observer use case Use case heading Use case name Use case ID Priority of accomplishment Description Add/Remove Observer UC-MC.wp6.Monitoring.Add/Remove Observer.-V01 Must Have Use case description Description Public Final version 1.0, March 29 th

87 Use case diagram Goal See the Add/Remove Observer use case of the system boundary model in Section An observer is any software component that needs to receive information from the monitoring platform. The objective of Add Observer is to allow new components to subscribe to the monitoring platform. Upon subscription they will start receiving a specific stream of data. Such a stream is specified as part of the New Observer operation in terms of an RDF query. The Remove Observer operation simply detaches an observer from a stream. Main Actors <MODAClouds IDE> <Cloud app admin> <Self-adaptation platform> Use case scenarios Main success scenarios Preconditions Postconditions Description Add Observer scenario 1. The <MODAClouds IDE>, the <Cloud app admin> or the <Self-adaptation platform> (generically colled Observer) requests the Add Observer operation by passing an RDF query and the reference of itself as parameter 2. The <Monitoring Platform> checks if it can fulfil the specified RDF query 3. If yes, it adds the observer to its list and returns a reference to the proper stream. Delete Observer scenario 1. The Observer requests the Delete Observer operation by passing the reference of itself as parameter. 2. The <Monitoring Platform> checks if the observer is in the list. 3. If yes, then it removes the observer from the list. The <Monitoring Platform> is up and running The list of observers remains in a consistent state Use case specification for the Collect Monitoring Data use case Use case heading Use case name Use case ID Priority of accomplishment Description Collect Monitoring Data UC-MC.wp6.Monitoring.Collect Monitoring Data.-V01 Must Have Use case Description description Use case diagram See the Collect Monitoring Data use case of the system boundary model in Section Goal The successful collection of required metrics from both application level and cloud level (PaaS and IaaS containers), based on the monitoring rules in the Active state. Main Actors <Cloud app> (here the Cloud app generically represents any monitorable resource. It may include also the Application cloud if this provides proper monitoring mechanisms. Public Final Version 1.0, March 29 th

88 Use case scenarios Main success scenarios Description Collect monitoring data in pull mode 1. Periodically the <Monitoring Platform> checks if the assigned monitoring cost constraints is still positive 2. If not, then the <Monitoring Platform> closes the connection with the Application cloud or Cloud all 3. If yes, it queries the Application cloud or <Cloud app> in order to receive monitoring information. 4. If the query is well formed and the Application cloud or <Cloud app> interface is running, the Application cloud or <Cloud app> provides the required data. 5. The <Monitoring Platform> executes the Active monitoring rules on the collected information 6. Then it gives the control to the Distribute Data use case Collect monitoring data in push mode 7. The <Monitoring Platform> periodically receives data from the Application cloud or the <Cloud app> 8. The <Monitoring Platform> executes the Active monitoring rules on the collected information 9. Then it gives the control to the Distribute Data use case 10. Periodically the <Monitoring Platform> checks if the assigned monitoring cost constraints are still positive 11. If it is negative, then the <Monitoring Platform> closes the connection with the Application cloud or Cloud all Exceptions Preconditions Postconditions In the pull mode if data do not arrive within the expected (configurable) time frame, the <Monitoring Platform> raises an alarm to the <Cloud app> administrator At least a monitoring rule is active in the monitoring platform. The Application cloud and <Cloud app> components that are able to provide the required data are known to the <Monitoring Platform> and a connection with them has been already established. Data are acquired and passed to the Distribute Data use case. Other requirements: 1. The <Monitoring Platform> MUST acquire at runtime QoS metrics (the performance, availability, and health metrics specified in deliverable D6.2) from the <Cloud app> and, if exposed by the cloud provider, from its Application cloud (either, IaaS or PaaS). 2. The <Monitoring Platform> MUST acquire historical and current information about the resource usage costs incurred to run the CloudApp. 3. For cloud platforms offering resources at spot prices (e.g., EC2 spot instances), the <Monitoring Platform> MAY be able to acquire spot prices also relatively to a custom time horizon. 4. The <Monitoring Platform> MAY rely on existing standard monitoring APIs (e.g., JMX), tools (e.g., SIGAR, sar), and cloud provider monitoring APIs. 5. Each monitoring cost constraint MUST be configured within the <Monitoring Platform> probe by the Execution Platform at deployment time of the CloudApp. 6. The monitoring cost constraint value MAY be updated at runtime by the Self-Adaptation Platform for cost or overhead management purposes. If a metric can be acquired at no cost, then its cost constraint will be infinite. 7. The <Monitoring Platform> MUST offer the ability to activate and deactivate the acquisition of certain information at application runtime. 8. The <Monitoring Platform> MAY offer the ability to adjust the sampling rate at which data is acquired from the Application cloud or Cloud app. This adjustment is requested by the Self- Adaptation Platform. Public Final version 1.0, March 29 th

89 Use case specification for the Distribute Data use case Use case heading Description Use case name Use case ID Priority of accomplishment Distribute Data UC-MC.wp6.Monitoring.Distribute Data.-V01 Must Have Use case Description description Use case diagram See the Distribute Data use case of the system boundary model in Section Goal The successful distribution of information to the observers connected to the monitoring platform. Main Actors <MODAClouds IDE> <Cloud app admin> <Self-adaptation platform> Use case scenarios Main success scenarios Preconditions Description 1. The <Monitoring Platform> executes the queries defined by the observers on the data collected through the Collect Data use case 2. The <Monitoring Platform> sends to the observers all data that match the queries associated to them (these data are sent through a stream) The <Monitoring Platform> is acquiring data through the Collect Data use case 5.4. Analysis of Requirements Context and System Overview Context Use case template Category name Description Analysis The scope of the following use case specification is to define analysis and measurement functionalities of the Execution Platform. These functionalities have the goal of receiving monitoring data and extract aggregate metrics and knowledge from it. Public Final Version 1.0, March 29 th

90 System Boundary Model Figure 5.4.a: Analysis Requirements Use case specification for the Detect Violation use case Use case heading Use case name Use case ID Priority of accomplishment Description Detect Violation UC-MC.wp6.Analysis.Detect Violation.-V01 Must Have Use case description Description Use case diagram See the Detect Violation use case of the system boundary model in Section Goal Detect a violation in a QoS constraint a measured metric and raise a trigger to the <Self-Adaptation Platform>. Main Actors <Self-Adaptation Platform> Use case scenarios Main success scenarios Preconditions Postconditions Description 1. A Monitoring rule of the <Monitoring Platform> detects a violation in the value of one or more QoS metrics. 2. <Monitoring Platform> automatically raises a trigger to all registered observers. 1. One or more monitoring rules are installed and active on the <Monitoring Platform> 2. There exist one or more registered observers to the triggers. 1. Triggers are raised in presents of QoS violations Other requirements: Public Final version 1.0, March 29 th

91 1. Detection rules SHOULD be specified as part of the monitoring queries installed in the <Monitoring Platform>. These CAN be either SLA requirements or soft QoS constraints that are requested by the developer Use case specification for the Correlate Monitoring Data use case Use case heading Use case name Use case ID Priority of accomplishment Description Correlate Monitoring Data UC-MC.wp6.Analysis.Correlate Monitoring Data.-V01 Should Have Use case description Use case diagram Goal Main Actors Description See the Correlate Monitoring Data use case of the system boundary model in Section The goal is to establish a relationship between measurements collected on different components of the application, with the aim of generating measures that summarize component runtime execution correlations. <Self-Adaptation Platform> Use case scenarios Main success scenarios Description Correlation in deterministic mode 1. <Monitoring Platform> collects monitoring data from a set of streams 2. Based on the timestamps, <Monitoring Platform> outputs on a new stream a measure that pair events from different sources as being related to each other Correlation in statistical mode (black box) 1. <Monitoring Platform> collects monitoring data from a set of streams 2. Within a time window, <Monitoring Platform> runs a statistical correlation algorithm 3. <Monitoring Platform> outputs on a new stream a measure that describes the statistical similarities between metrics coming on the different streams Correlation in statistical mode (white box) 1. <Monitoring Platform> collects monitoring data from a set of streams 2. For each monitoring metric to be correlated, <Monitoring Platform> determines the associated streams and checks that the metric is supported for correlation, returning an error if not 3. <Monitoring Platform> periodically obtains a description of the current application topology from <Execution Platform> 4. Within a time window, <Monitoring Platform> runs a statistical correlation algorithm, that exploits the application model available to the <Monitoring Platform> (precondition), to find statistical similarities Public Final Version 1.0, March 29 th

92 between metrics coming on the different streams 5. <Monitoring Platform> outputs on a new stream a measure that describes the statistical similarities between metrics coming on the different streams Preconditions Postconditions <Monitoring Platform> exposes a set of high-level measures that it can provide to any observer One or more observers are registered to receive data from these measures, most likely components of the <Self-Adaptation Platform>. <Execution Platform> maintains time synchronization for monitoring data collected from different sources <MODAClouds IDE> has provided to <Monitoring Platform> information on the dependancies between the application components <Self-Adaptation Platform> has provided to <Monitoring Platform> information on the current topology of the application <Self-Adaptation Platform> has registered as observer on the ouput streams Correlation measures provided in output by <MonitoringPlatform> on one or more streams Other requirements: 1. Correlation in deterministic mode MAY be offered via a standard data stream aggregator solution programmed to account for the specificity of the MODAClouds Execution Platform. 2. The Monitoring Platform SHOULD be capable of correlating events only based on information independent of the specific target cloud being considered Use case specification for the Estimate Measure use case Use case heading Use case name Use case ID Priority of accomplishment Description Estimate Measure UC-MC.wp6.Analysis.Estimate Measure.-V01 Must Have Use case description Description Use case diagram See the Estimate Measure use case of the system boundary model in Section Goal Estimate QoS metrics of the system within a time horizon specified by the <MODAClouds IDE> that are not directly observable by the data collectors, or that cannot be observed due to overhead concerns. Main Actors <MODAClouds IDE> <Self-Adaptation Platform> Use case scenarios Main success scenarios Description Estimation in blackbox mode Public Final version 1.0, March 29 th

93 1. < Monitoring Platform> parses the estimation requirements specification provided by <MODAClouds IDE> 2. For each metric to be estimated, <Monitoring Platform> determines the streams needed to estimate that metrics and returns an error if one or more are unavailable 3. < Monitoring Platform> continuously run estimation algorithms to estimate the value of the metrics that cannot be directly observed by the monitoring probes 4. The results of the estimation algorithms are put in output on streams consumed by observers of the <Self-Adaptation Platform> and by <MODAClouds IDE> via the feedback loop Preconditions Postconditions Estimation in white-box mode 5. < Monitoring Platform> parses the estimation requirements specification provided by <MODAClouds IDE> 6. For each metric to be estimated, <Monitoring Platform> determines the streams needed to estimate that metrics and returns an error if one or more are unavailable 7. <Monitoring Platform> periodically obtains a description of the current application topology from <Execution Platform> 8. < Monitoring Platform> continuously run estimation algorithms to estimate the value of the metrics that cannot be directly observed by the monitoring probes The results of the estimation algorithms are put in output on streams consumed by observers of the <Self-Adaptation Platform> and by <MODAClouds IDE> via the feedback loop <Monitoring Platform> exposes a set of high-level measures that it can provide to any observer One or more observers of the <Self-Adaptation Platform> are registered to receive data from these measures. <MODAClouds IDE> has provided to <Monitoring Platform> indications of which metrics should be estimated Estimated measures provided in output by <MonitoringPlatform> on one or more streams Other requirements: 1. For a given application, the Monitoring Platform MUST be able to estimate, if requested by the monitoring queries and the monitoring data is available, at least mean value of traffic arrival rates, service demand, number of active users, throughputs, failure events, startup times/uptimes/downtimes. 2. For some the same performance indicators, the Monitoring Platform MAY be able to provide an estimate of a variance and percentiles over a reference time window. 3. If requested by the monitoring queries, the Monitoring Platform MUST be able to differentiate the estimation across workload classes and different resources. Public Final Version 1.0, March 29 th

94 4. The estimation CAN depend on the runtime models, when this dependence does not introduce a circular dependence that cannot be resolved. 5. The estimation COULD return confidence information on the estimates. The estimation MUST support timeouts for the algorithms and MUST cope with abnormal termination and infeasibilities in the solutions without cascading errors in the dependent systems Use case specification for the Forecast Measure use case Use case description Use case heading Use case name Use case ID Priority of accomplishment Description Forecast Measure UC-MC.wp6.Analysis.Forecast Measure.-V01 Should Have Use case description Use case diagram Goal Description See the Forecast Measure use case of the system boundary model in Section These services will forecast, using statistical methods, some of the metrics needed by the <Self-Adaptation Platform> to manage the application QoS. Main Actors 1. <MODAClouds IDE> 2. <Self-Adaptation Platform> Use case scenarios Main success scenarios Description Estimation in blackbox mode 1. < Monitoring Platform> parses the estimation requirements specification provided by <MODAClouds IDE> 2. For each monitoring metric to be forecasted, <Monitoring Platform> determines the associated streams and checks that the metric is supported for forecasting, returning an error if not 3. < Monitoring Platform> continuously run forecasting algorithms to predict the value of the metrics on the input streams 4. The results of the blackbox forecasting algorithms are put in output on streams consumed by observers of the <Self-Adaptation Platform> Estimation in white-box mode 1. < Monitoring Platform> parses the specification provided by <MODAClouds IDE> 2. For each monitoring metric to be forecasted, <Monitoring Platform> determines the associated streams and checks that the metric is supported for forecasting, returning an error if not 3. <Monitoring Platform> periodically obtains a description of the current application topology from <Execution Platform> 4. < Monitoring Platform> continuously run whitebox forecasting algorithms to predict the value of the metrics on the input streams based Public Final version 1.0, March 29 th

95 on the Application Models and the topology information 5. The results of the whitebox forecasting algorithms are put in output on streams consumed by observers of the <Self-Adaptation Platform> Preconditions Postconditions <Monitoring Platform> exposes a set of high-level measures that it can provide to any observer One or more observers are registered to receive data from these measures, normally from <Self-Adaptation Platform> 1. Forecasted measures are given in output on one or more output streams Other requirements: 1. The Monitoring Platform MUST able to carry out forecasting at predefined times or periodically at a given period included in the <MODACLouds IDE> specification. 2. The forecasting MAY depend on the application models, when this dependence does not introduce a circular dependence that cannot be resolved. 3. The forecasting MUST support timeouts to provide forecasts and MUST cope with abnormal termination and infeasibilities in the predictions without generating errors in the dependent systems Use case specification for the Feedback Measure use case Use case heading Description Use case name Feedback Measure Use case ID UC-MC.wp6.Analysis.Feedback Measure.-V01 Priority of Must Have accomplishment Use case description Description Use case diagram See the Feedback Measure use case of the system boundary model in Section Goal Return a measure to <MODAClouds IDE> to support the design-runtime feedback loop. Main Actors 1. <MODAClouds IDE> Use case scenarios Main success scenarios Preconditions Description 1. <MODAClouds IDE> requests to <Monitoring Platform> to provide feedback on a set of raw metrics or high-level measures. 2. <Monitoring Platform> creates feedback streams for the data to be pushed to <MODAClouds IDE>. 3. <Monitoring Platform>, following the input specification provided by <MODAClouds IDE>, binds either raw metrics streams or measures to the feedback streams. 4. <Monitoring Platform> deactivates the feedback streams upon request of <MODAClouds IDE> or when <Cloud app> is not running. 1. <Monitoring Platform> running 2. <Cloud app> deployed, not necessarily running Public Final Version 1.0, March 29 th

96 Postconditions 1. Feedback streams push raw metrics and measurements to <MODAClouds IDE> 5.5. Self-Adaptivity Requirements Context and System Overview Context Use case template Category name Description Self-Adaptivity The scope of the following use case specifications is to define the Self-Adaptation management services of the MODAClouds runtime environment System Boundary Model Figure 5.5.a: Self-Adaptivity Requirements Use case specification for the Define/Undefine QoS Constraints use case Use case heading Description Use case name Define/Undefine QoS Constraints Use case ID UC-MC.wp6.Self-Adaptivity.Define/Undefine QoS Constraints.-V01 Priority of accomplishment Must Have Use case description Use case diagram Description See the Define/Undefine QoS Constraints use case of the system boundary model in Public Final version 1.0, March 29 th

97 Goal Main Actors Section These services will allow to define or undefine in the <Self-Adaptation Platform> a set of QoS constraints for <Cloud app> specified by the <QoS engineer> in the <MODACloudsIDE>. <MODAClouds IDE> <Self-Adaptation Platform> <Cloud app admin> Use case scenarios Description Main success scenarios 1. <MODACLoudsIDE> requests to < Self-Adaptation Platform> to define/undefine a set of QoS Constraints for <Cloud app>. 2. <Self-Adaptation Platform> stores the information, adds a log entry for the operation, checks correctness of the information received. 3. <Self-Adaptation Platform> returns a success or failure code to <MODACloudsIDE>. Preconditions 1. <Cloud app> is deployed on <Application Cloud> 2. <Self-Adaptation Platform> is deployed and running on <Service Cloud> Postconditions 1. <Self-Adaptation Platform> updated its internal information to define/undefine the QoS Constraints. Other requirements: 1. The correctness of the QoS Constraints specification SHOULD be also checked by <MODAClouds IDE>. 2. QoS Constraints MUST be specified in parsable inter-change format, e.g., a SLA specification in XML. 3. Define QoS contraints SHOULD be automatically invoked by <Execution Platform> when running the <Deploy Application> use case Use case specification for the Start/Stop Feedback of Self-Adaptivity Data use case Use case heading Use case name Use case ID Priority of accomplishment Description Start/Stop Feedback of Self-Adaptivity Data UC.MC.wp6.Self-Adaptivity.Start/Stop Feedback of Self-Adaptivity Data- V01 Could Have Use case Description description Use case diagram See the Start/Stop Feedback Self-Adaptivity Data use case of the system boundary model in Section Goal Return detailed data on the actions taken by the <Self-Adaptation Platform> in a reference time horizon and their outcomes. Main Actors <MODAClouds IDE> <Monitoring Platform> <Execution Platform> Public Final Version 1.0, March 29 th

98 Use case scenarios Description Main success scenarios 1. <MODACloudsIDE> requests to <Self-Adaptation Platform> to start/stop feedback of self-adaptivity data in a reference time horizon. 2. <Self-Adaptation Platform> configures the runtime models to record/stop recording data. 3. <Self-Adaptation Platform> registers/deregisters with <Monitoring Platform> to start/stop a Data Collector of the self-adaptivity data. 4. <Self-Adaptation Platform> returns to <MODACloudsIDE> the success or failure of the operation. Preconditions 1. <Self-Adaptation Platform> deployed and running on <Service Cloud> 2. Runtime models running for <Cloud app>. Postconditions 1. Data collectors for self-adaptivity data started/stopped Use case specification for the Define/Undefine Cost Constraints use case Use case heading Description Use case name Define/Undefine Cost Constraints Use case ID UC.MC.wp6.Self-Adaptivity.Define/Undefine Cost Constraints-V01 Priority of accomplishment Must Have Use case description Use case diagram Goal Main Actors Description See the Define/Undefine ost Constraints use case of the system boundary model in Section These services will allow defining or undefining in the <Self-Adaptation Platform> a set of Cost constraints for <Cloud app> or <Monitoring Platform> specified by the <QoS engineer> in <MODACloudsIDE>. <MODAClouds IDE> <Self-Adaptation Platform> <Execution Platform> <Cloud app admin> Use case scenarios Description Main success scenarios 1. <MODACLoudsIDE> requests to < Self-Adaptation Platform> to define/undefine a set of QoS Constraints for <Cloud app>. 2. <Self-Adaptation Platform> stores the information, adds a log entry for the operation, checks correctness of the information received. 3. <Self-Adaptation Platform> returns a success or failure code to <MODACloudsIDE>. Preconditions 1. <Cloud app> is deployed on <Application Cloud> 2. <Self-Adaptation Platform> is deployed and running on <Service Cloud> Postconditions 1. <Self-Adaptation Platform> updated its internal information to define/undefine the cost constraints. Other requirements: Public Final version 1.0, March 29 th

99 1. The correctness of the Cost Constraints specification SHOULD also be checked by <MODAClouds IDE>. 2. Define Cost Constraints SHOULD be automatically invoked by <Execution Platform> when running the <Deploy Application> use case Roadmap In this section, we describe the roadmap for year 1 activities. In this first year of the project, the MODAClouds consortium will focus on realizing the initial prototypes for the <Monitoring Platform> and for the execution platform. The Self-Adaptation platform and the multi-cloud deployment components will be included in the workplan for the following years. Interfaces between these components will be specified in year 1. In terms of target clouds, in Year 1 at least on the Amazon EC2 and the Flexiscale IaaS platforms will be considered. The IaaS focus will continue in Year 2 when initial support for PaaS will be provided. We envision at this stage that the focus will shift more towards PaaS in the last year of the project. In terms of timelines for implementation of the requirements, the following table outline the general roadmap: # Group Use case scenarios (UC-MC.wp6.*) Priority Year(s) 1 Execution Run Application Must Have 1,2 2 Execution Start/Stop Application Sandbox Should Have 2,3 3 Execution Synchronise Application Data Must Have 2,3 4 Execution Deploy Application Must Have 1,2 5 Monitoring Collect Monitoring Data Must Have 1,2 6 Monitoring Distribute Data Must Have 1 7 Monitoring Install Monitoring Rule Must Have 1,2 8 Monitoring Activate/Deactivate Monitoring Rule Must Have 1,2 10 Monitoring Add/Remove Observer Must Have 1,2 11 Analysis Detect Violation Must Have 2 12 Analysis Correlate Monitoring Data Should Have 2 13 Analysis Estimate Measure Must Have 1,2,3 14 Analysis Forecast Measure Must Have 2,3 15 Analysis Feedback Measure Must Have 2 16 Self-Adaptivity Define/Undefine QoS Constraints Must Have 2 17 Self-Adaptivity Start/stop Feedback of Self-Adaptivity Data Could Have 3 18 Self-Adaptivity Define/Undefine Cost Constraints Must Have 2,3 Public Final Version 1.0, March 29 th

100 References [Aba05] Abadi, D. J., Ahmad, Y., Balazinska, M., C ȩtintemel, U., Cherniack, M., Hwang, J.-H., Lindner, W., Maskey, A. S., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., and Zdonik, S. The Design of the Borealis Stream Processing Engine. In Proc. Intl. Conf. on Innovative Data Systems Research (CIDR 2005), [Aba05b]: Abadi, D. J., Madden, S., Lindner, W. Reed. Robust, efficient filtering and event detection in sensor networks. In VLDB, [Abh12] Abhishek, V, Kash, I, Key, P. Fixed and market pricing for cloud services. International Conference on Computer Communications Workshops (INFOCOM WKSHPS) [ABS1] AWS Elastic Beanstalk -- Developer Guide -- What Is AWS Elastic Beanstalk and Why Do I Need It? (accessed in 2013); [ABS2] AWS Elastic Beanstalk -- Developer Guide -- How Does AWS Elastic Beanstalk Work? (accessed in 2013); [ABS3] AWS Elastic Beanstalk -- Developer Guide -- Components (accessed in 2013); [ABS4] AWS Elastic Beanstalk -- Developer Guide -- Managing and Configuring Applications and Environments Using the Console, CLI, and APIs (accessed in 2013); [ABS5] AWS Elastic Beanstalk -- Developer Guide -- Customizing and Configuring AWS Elastic Beanstalk Environments (accessed in 2013); [ACF1] AWS CloudFormation (accessed in 2013); [ACF2] AWS CloudFormation -- User Guide (accessed in 2013); [ACF3] AWS CloudFormation -- FAQ (accessed in 2013); [ACF4] AWS CloudFormation -- Templates (accessed in 2013); [Aga07] Agarwala, S, Alegre, F, Schwan, K, Mehalingham, J. E2EProf: Automated End-to-End Performance Management for Enterprise Systems. IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) [Ala09] I. Al-Azzoni and D. Down. Decentralized Load Balancing for Heterogeneous Grids. Proceedings of the 2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns (COMPUTATIONWORLD '09), [Alt06] Altman, E, Boulogne, T, Azouzi, R, Jiménez, T, Wynter, L. A survey on networking games in telecommunications. Computers & Operation Research [ANA13] Ana Network architecture. Last visited on March 16, [APF2] AppFog Documentation -- Languages (accessed in 2013); [APF3] AppFog Documentation -- Services (accessed in 2013); [APF4] AppFog Documentation -- Feature Roadmap (accessed in 2013); [APF5] AppFog Documentation -- Tunneling (accessed in 2013); [Ara03] Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Nishizawa, I., Rosenstein, J., and Widom, J. STREAM: The Stanford Stream Data Manager (Demonstration Description). In Proc. ACM Intl. Conf. on Management of data (SIGMOD 2003), page 665, [Ara06] Arasu, A., Babu, S., and Widom, J. The CQL Continuous Query Language: Semantic Foundations and Query Execution. The VLDB Journal, 15(2): , [Ard08] D. Ardagna, C. Ghezzi, R. Mirandola. Rethinking the use of models in software architecture. QoSA 2008 Proceedings, 1-27, Karlsruhe, Germany, October [Ard11] Ardagna, D, Casolari, S, Panicucci, B. Flexible distributed capacity allocation and load redirect algorithms for cloud systems. IEEE International Conference on Cloud Computing (CLOUD) Public Final version 1.0, March 29 th

101 [Ard11b] D. Ardagna, S. Casolari, B. Panicucci. Flexible distributed capacity allocation and load redirect algorithms for cloud systems. Cloud Computing (CLOUD), 2011 IEEE International Conference on, 2011, [Ard11c] Ardagna, D, Panicucci, B, Passacantando, M. A game theoretic formulation of the service provisioning problem in cloud systems. International Conference on World Wide Web [Ard12] Ardagna, D, Panicucci, B, Passacantando, M. Generalized nash equilibria for the service provisioning problem in cloud systems. IEEE Transactions on Services Computing [Ard12b] D. Ardagna, S. Casolari, M. Colajanni, B. Panicucci. Dual Time-scale Distributed Capacity Allocation and Load Redirect Algorithms for Cloud Systems. Journal of Parallel and Distributed Computing, Elsevier. 72(6), , [Ard12c] Ardagna, D, Panicucci, B, Trubian, M, Zhang, L. Energy-aware autonomic resource allocation in multi-tier virtualized environments. IEEE Transactions on Services Computing [AUT13] The autonomic Internet. Last visited on March 16, [Bab01] Babu, S., and Widom, J. Continuous Queries over Data Streams. SIGMOD Rec., 30(3): , [Bai06] Bai, Y., Thakkar, H., Wang, H., Luo, C., and Zaniolo, C. A Data Stream Language and System Designed for Power and Extensibility. In Proc. Intl. Conf. on Information and Knowledge Management (CIKM 2006), pages , [Bal04] Balakrishnan, H., Balazinska, M., Carney, D., C ȩtintemel, U., Cherniack, M., Convey, C., Galvez, E., Salz, J., Stonebraker, M., Tatbul, N., Tibbetts, R., and Zdonik S. Retro- spective on Aurora. The VLDB Journal, 13(4): , [Bam99] Bamieh, B, Giarré, L. Identification of linear parameter varying models. IEEE Conference on Decision Control [Bar04] Barham, P, Donnelly, A, Isaacs, R, Mortier, R. Using Magpie for request extraction and workload modelling. USENIX Symposium on Opearting Systems Design & Implementation (OSDI) [Bar10]: Baresi, L., Caporuscio, M., Ghezzi, C., and Guinea, S. Model-Driven Management of Services. Proceedings of the Eighth European Conference on Web Services, ECOWS. IEEE Computer Society, 2010, pp [Bar12]: Baresi, L., Guinea, S. Event-based Multi-level Service Monitoring [Ben04] Bennani, M, Menasce, D. Assessing the robustness of self-managing computer systems under highly variable workloads. International Conference on Autonomic Computing (ICAC) [Ben05] Bennani, M, Menasce, D. Resource allocation for autonomic data centers using analytic performance models. nternational Conference on Autonomic Computing (ICAC) [Bjo12] Björkqvist, M, Chen, L, Binder, W. Opportunistic service provisioning in the cloud. International Conference on Cloud Computing [Bla09]: Blair, G., Bencomo, N. France, R. [email protected]. Computer, vol. 42, no. 10, pp , oct [Bra10]: Brandic, I. FoSII Project: Autonomic Resource Management in Clouds Considering Cloud-based Resource Monitoring and Knowledge Management. Seoul National University, Seoul, South Korea, July 15th [Cal12] Calcavecchia, N, Caprarescu, B, Nitto, E, Dubois, D, Petcu, D. Depas: A decentralized probabilistic algorithm for auto-scaling. Computing [Cap10] Bogdan Alexandru Caprarescu, Nicolo Maria Calcavecchia, Elisabetta Di Nitto, and Daniel J. Dubois. Sos cloud: Self-organizing services in the cloud. In BIONETICS, pages 48-55, [Car01]: Carzaniga, A., Rosenblum, D. S., Wolf, A. L. Design and Evaluation of a Wide-Area Event Notification Service. ACM Transactions on Computer Systems, vol. 19, no. 3, pp , August, [Cas08a] Casale, G, Cremonesi, P, Turrin, R. Robust workload estimation in queueing network performance models. Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP) [Cas08b] Casale, G, Mi, N, Cherkasova, L, Smirni, E. How to parameterize models with bursty workloads. ACM SIGMETRICS Performance Evaluation Revie Public Final Version 1.0, March 29 th

102 [CAS13] CASCADAS. Last visited on March 16, [CDF1] Cloudify documentation -- Anatomy of a recipe (accessed in 2013); [CDF2] Cloudify documentation -- Scaling rules (accessed in 2013); [CDF3] Cloudify documentation -- Boostrapping any cloud (accessed in 2013); [CDF4] Cloudify documentation -- Application recipe (accessed in 2013); [CDF5] Cloudify documentation -- Service recipe (accessed in 2013); [CDF6] Cloudify documentation -- Configuring security (accessed in 2013); [CDF7] Cloudify documentation -- Attributes API (accessed in 2013); [CDF8] Cloudify documentation -- Custom commands (accessed in 2013); [CDF9] Cloudify documentation -- Probes (accessed in 2013); [CDF10] Cloudify documentation -- Cloud driver (accessed in 2013); [CFY1] Cloud Foundry -- FAQ (accessed in 2013); [CFY2] Cloud Foundry -- Services (accessed in 2013); [CFY3] Cloud Foundry -- Frameworks (accessed in 2013); [CFY4] Micro Cloud Foundry (accessed in 2013); [Cha82] Chandy, K, Neuse, D. Linearizer: A heuristic algorithm for queuing network models of computing systems. Communications of the ACM [Cha12] Rong Chang, editor IEEE Fifth International Conference on Cloud Computing, Honolulu, HI, USA, June 24-29, IEEE, [Che00a] L. Cherkasova, M. DeSouza and S. Ponnekanti. Performance Analysis of Content-Aware'' Load Balancing Strategy FLEX: Two Case Studies'. In Proceedings of Thirty-Fourth Hawaii International Conference on System Sciences (HICSS-34), Software Technology Track, January 3-6, [Che00b] Chen, J., DeWitt, D. J., Tian, F., and Wang,Y. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, Proc. ACM Intl. Conf. on Management of Data (SIGMOD 2000), pages , [Che06] cheng Tu,Y., Liu, S., Prabhakar, S., and Yao, B., Load Shedding in Stream Databases: A Control-based Approach. In Proc. Intl. Conf. on Very Large Data Bases (VLDB 2006), pages , [Che08] Chen, Y, Iyer, S, Liu, X, Milojicic, D, Sahai, A. Translating Service Level Objectives to lower level policies for multi-tier services. Cluster Computing [Coh04] Cohen, I, Goldszmidt, M, Kelly, T, Symons, J, Chase, J. Correlating instrumentation data to system states: A building block for automated diagnosis and control. USENIX Symposium on Operating Systems Design & Implementation (OSDI) [Coh05] Cohen, I, Zhang, S, Goldszmidt, M, Symons, J, Kelly, T, Fox, A. Capturing, indexing, clustering, and retrieving system history. ACM symposium on Operating systems principles (SOSP) [Cor05]: Cormode, G., Garofalakis, M. N. Sketching streams through the net: Distributed approximate query tracking. In VLDB, 2005, pp [Cre02] Cremonesi, P, Schweitzer, P, Serazzi, G. A unifying framework for the approximate solution of closed multiclass queuing networks. IEEE Transactions on Computers Public Final version 1.0, March 29 th

103 [Cre10] Cremonesi, P, Dhyani, K, Sansottera, A. Service time estimation with a refinement enhanced hybrid clustering algorithm. International Conference on Analytical and Stochastic Modeling Techniques and Applications (ASMTA) [Cre12] Cremonesi, P, Sansottera, A. Indirect Estimation of service demands in the presence of structural changes. International Coference on Quantitative Evaluation of Systems (QEST) [Cza98]: Czajkowski, G., Eicken, T. V. JRes: A Resource Accounting Interface for Java. Proceedings of the 13th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, [Dea12] Dean, D, Nguyen, H, Gu, X. UBL: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. International Conference on Autonomic Computing (ICAC) [Des12] Desnoyers, P, Wood, T, Shenoy, P, Patil, S, Vin, H. Modellus: Automated modeling of complex data center applications. ACM Transactions on Internet Technology [Di12] Di, S, Kondo, D, Cirne, W. Host load prediction in a Google compute cloud with a Bayesian model. International Conference on High Performance Computing, Networking, Storage and Analysis (SC) [Dou12] Brian Dougherty, Jules White, and Douglas C. Schmidt. Model-driven auto-scaling of green cloud computing infrastructure. Future Generation Comp. Syst., 28(2): , [Dua09] Duan, S, Babu, S, Munagala, K. Fa: A system for automating failure diagnosis. IEEE International Conference on Data Engineering (ICDE) [Dub07] Parijat Dube, Zhen Liu, Laura Wynter, and Cathy H. Xia. Competitive equilibrium in e-commerce: Pricing and outsourcing. Computers & OR, 34(12): , [Dut10] Dutreilh, X, Rivierre, N, Moreau, A, Malenfant, J, Truck, I. From data center resource allocation to control theory and back. International Conference on Cloud Computing [Dut12] Sourav Dutta, Sankalp Gera, Akshat Verma, and Balaji Viswanathan. Smartscale: Automatic application scaling in enterprise clouds. In Chang [Cha12], pages [Eme12]: Emeakaroha, V. C., Ferreto, T. C., Netto, M. A. S., Brandic, I., Rose, De C. A. F. CASViD: Application Level Monitoring for SLA Violation Detection in Clouds [Fen11] Yuan Feng, Baochun Li, and Bo Li. Price competition in an oligopoly cloud market [Fer13] Fernandez,R.C., Migliavacca,M., Kalyvianaki,E., and Pietzuch,P. Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management. In Sigmod, TO APPEAR [GAE1] GAE Documentation --- The Java Servlet Environment (accessed in 2013); [GAE2] GAE Documentation --- Quotas (accessed in 2013); [GAE3] GAE Documentation --- Backends and Java API Overview (accessed in 2013); [GAE4] GAE Documentation --- Datastore Overview (accessed in 2013); [GAE5] GAE Documentation --- Java Service APIs (accessed in 2013); [Gam12] Gambi, A, Toffetti, G. Modeling cloud performance with Kriging. International Conference on Software Engineering (ICSE) [Gia11] Giani, P, Tanelli, M, Lovera, M. Controller design and closed-loop stability analysis for admission control in Web service systems. World Congress [Gma07] Gmach, D, Rolia, J, Cherkasova, L, Kemper. A. Workload analysis and demand prediction of enterprise data center applications. International Symposium on Workload Characterization (IISWC) [Gou11] Hadi Goudarzi and Massoud Pedram. Multi-dimensional sla-based resource allocation for multi-tier cloud computing systems. In Liu and Parashar [Liu11], pages Public Final Version 1.0, March 29 th

104 [Gol03] Golab, L., DeHaan, D., Demaine, E. D., Lopez-Ortiz, A., and Munro, J. I. Identifying Frequent Items in Sliding Windows over On-line Packet Streams. In Proc. Intl. Conf. on Internet Measurement (IMC 2003), pages , [Gol08] Golab,L., Johnson,T., Koudas, N., Srivastava, D., and Toman. D., Optimizing Away Joins on Data Streams. In Proc. Intl. Workshop on Scalable Stream Processing System (SSPS 2008), pages 48 57, [Gol09]: Goldsack, P., Guijarro, J., Loughran, S., et al. The SmartFrog configuration management framework. ACM SIGOPS Oper. Syst. Rev., 2009, 43, pp [Gon11]: Gonzalez, J., Munoz, A., Mana, A. Multi-layer Monitoring for Cloud Computing. IEEE 13th International Symposium on High-Assurance Systems Engineering [GPA13] General purpose autonomic computing. Last visited on March 16, [GT13] G. Trends, "Results for "cloud computing", DOI: [Gul12] Gulisano, R. Jimenez-Peris, et al. StreamCloud: An Elastic and Scalable Data Streaming System. TPDS, 99(PP), [Had12] Makhlouf Hadji and Djamal Zeghlache. Minimum cost maximum flow algorithm for dynamic resource allocation in clouds. In Chang [Cha12], pages [Has11] Hassan, M, Song, B, Huh, E. Distributed resource allocation games in horizontal dynamic cloud federation platform. International Conference on High Performance Computing and Communications (HPCC) [Has12] Hassan, M, Hossain, M, Sarkar, A, Huh, E. Cooperative game-based distributed resource allocation in horizontal dynamic cloud federation platform. Information Systems Frontiers [He12] Ting He, Shiyao Chen, Hyoil Kim, Lang Tong, and Kang-Won Lee. Scheduling parallel tasks onto opportunistically available cloud resources. In Chang [Cha12], pages [HER1] Heroku DevCenter -- The Process Model (accessed in 2013); [HER2] Heroku DevCenter -- Dynos and the Dyno Manifold (accessed in 2013); [HER3] Heroku DevCenter -- Languages (accessed in 2013); [HER4] Heroku DevCenter -- Buildpacks (accessed in 2013); [HER5] Heroku DevCenter -- Scaling Your Process Formation (accessed in 2013); [HER6] Heroku Add-ons (accessed in 2013); [HER7] Heroku DevCenter -- HTTP Routing and the Routing Mesh (accessed in 2013); [HER8] Heroku DevCenter -- Slug Compiler (accessed in 2013); [HER9] Heroku API (accessed in 2013); [HER10] Heroku DevCenter -- Frequently Asked Questions about Java (accessed in 2013); [HER11] The Twelve-Factor App (accessed in 2013); [Hol09] Holub, V., Parsons, T., O'Sullivan, P., Murphy, J. Runtime correlation engine for system monitoring and testing. In ICAC-INDST '09 Proceedings of the 6th international conference industry session on Autonomic computing and communications industry session, pages 9-18, New York, NY, USA, ACM. [Hol10] Holze, M, Haschimi, A, Ritter, N. Towards workload-aware self-management: Predicting significant workload shifts. International Conference on Data Engineering Workshops (ICDEW) [Hol10b]: Holub, V., Parsons, T., O'Sullivan, P. Run-Time Correlation Engine for System Monitoring and Testing (RTCE) Public Final version 1.0, March 29 th

105 [Hue05]: Huebsch, R., Chun, B. N., Hellerstein, J. M., Loo, B. T., Maniatis, P., Roscoe, T., Shenker, S., Stoica, I., Yumerefendi, A. R. The architecture of pier: an internet-scale query processor. In CIDR, [Jag95] Jagadish, H. V., Mumick, I. S., and Silberschatz, A. View Maintenance Issues for the Chronicle Data Model. In Proc. ACM Symp. on Principles of Database Systems (PODS 1995), pages , [Jer97]: Jerding, D. F., Stasko, J. T., Ball, T. Visualizing Interactions in Program Executions. Proceedings of the International Conference on Software Engineering, [JUJ1] Frequently Asked Questions (accessed in 2013); [JUJ2] Juju Charms (accessed in 2013); [JUJ3] Juju Documentation (accessed in 2013); [JUJ4] Juju Documentation -- Getting started (accessed in 2013); [JUJ5] Juju Documentation -- User tutorial (accessed in 2013); [JUJ6] Juju Documentation -- Charms (accessed in 2013); [JUJ7] Juju Documentation -- Service configuration (accessed in 2013); [JUJ8] Juju Documentation -- Machine constraints (accessed in 2013); [JUJ9] Juju Documentation -- Operating systems (accessed in 2013); [Jun09] Jung, G, Joshi, K, Hiltunen, M, Schlichting, R, Pu, C. A Cost-sensitive adaptation engine for server consolidation of multitier applications. Middleware [Jun10] Jung, G, Hiltunen, M, Joshi, K, Schlichting, P, Pu, C. Mistral: Dynamically managing power, performance, and adaptation cost in cloud infrastructures. International Conference on Distributed Computing Systems (ICDCS) [Kal09] Kalyvianaki, E, Charalambous, T, Hand, S. Self-adaptive and self-configured CPU resource provisioning for virtualized servers using Kalman filters. International Conference on Autonomic Computing (ICAC) [Kal11] Kalbasi, A, Krishnamurthy, D, Rolia, J, Richter. MODE: mix driven on-line resource demand estimation. International Conference on Network and Services Management (CNSM) [Kal12] Kalbasi, A, Krishnamurthy, D, Rolia, J, Dawson, S. DEC: service demand estimation with confidence. IEEE Transactions on Software Engineering [Kar11] Kari, C.; Yoo-Ah Kim; Russell, A., "Data Migration in Heterogeneous Storage Systems," Distributed Computing Systems (ICDCS), st International Conference on, vol., no., pp.143,150, June [Kel79] Kelly, F. Reversibility and Stochastic Networks. Cambridge University Press [Kha12] Khan, A, Yan, X, Tao, S, Anerousis, N. Workload characterization and prediction in the cloud: A multiple time series approach. IEEE/IFIP International Workshop on Cloud Management (Cloudman) [Kir10]: Kirschnick, J., Calero, J. A. M., Wilcock, L., Edwards, N. Towards an architecture for the automated provisioning of cloud services. IEEE Commun. Mag., 2010, 48, (12), pp [Kle75] Kleinrock, L. Queueing Systems. Wiley-Interscience [Kon12] Kleopatra Konstanteli, Tommaso Cucinotta, Konstantinos Psychas, and Theodora A. Varvarigou. Admission control for elastic cloud services. In Chang [Cha12], pages [Kon12b]: König, B., Calero, J. A. M., Kirschnick, J. Elastic monitoring framework for cloud infrastructures. Communications, IET, vol. 6, num. 10, pp , July, [Law04] Law, Y.-N., Wang, H., and Zaniolo, C. Query Languages and Data Models for Database Sequences and Data Streams. In Proc. Intl. Conf. on Very Large Data Bases (VLDB 2004), pages , Public Final Version 1.0, March 29 th

106 [Law05] Law, Y.-N., and Zaniolo, C., An Adaptive Nearest Neighbor Classification Algorithm for Data Streams. In Proc. Europ. Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD 2005), pages , [Lee99] Lee, L, Poolla, K. Identification of linear parameter-varying systems using nonlinear programming. Transactions-American Society Of Mechanical Engineers Journal Of Dynamic Systems Measurement And Control [Lim10] Lim, H, Babu, S, Chase, J. Automated control for elastic storage. International Conference on Autonomic Computing (ICAC) [Lin12] Yi-Kuei Lin and Ping-Chen Chang. Reliability evaluation of a computer network in cloud computing environment subject to maintenance budget. Applied Mathematics and Computation, 219(8): , [Lit05] Litoiu, M, Woodside, C, Zheng, T. Hierarchical model-based autonomic control of software systems. ACM SIGSOFT Software Engineering Notes [Liu99] Liu, L., Pu, C., and Tang W. Continual Queries for Internet Scale Event-Driven Information Delivery. IEEE Trans. Knowl. Data Eng., 11(4): , [Liu05] Liu, Y, Gorton, I, Fekete, K. Design-level performance prediction of component-based applications. IEEE Transactions on Software Engineering [Liu06] Liu, Z, Wynter, L, Xia, C, Zhang, F. Parameter inference of queueing models for IT systems using endto-end measurements. ACM SIGMETRICS Performance Evaluation Review [Liu10] Liu, T, Methapatara, C, Wynter, L. Revenue management model for on-demand it services. European Journal of Operational Research [Liu11] Ling Liu and Manish Parashar, editors. IEEE International Conference on Cloud Computing, CLOUD 2011, Washington, DC, USA, 4-9 July, IEEE, [Lov98] Lovera, M, Verhaegen, M, Chou, C. State space identification of MIMO linear parameter varying models. International Symposium on the Mathematical Theory of Networks and Systems [Lu02] Lu, C., Alvarez, G. A. and Wilkes, J Aqueduct: online data migration with performance guarantees. In Proceedings of the 1st USENIX conference on File and storage technologies (FAST'02). USENIX Association, Berkeley, CA, USA, [Lu03] Lu, Y, Abdelzaher, T, Lu, C, Sha, L, Liu, X. Feedback control with queueing-theoretic prediction for relative delay guarantees in web servers. IEEE Real-Time and Embedded Technology and Applications Symposium [Lu09] Lu, Y, AbouRizk, S. Automated Box Jenkins forecasting modelling. Automation in Construction [Mad02]: Madden, S., Franklin, M. J., Hellerstein, J. M., Hong, W. Tag: A tiny aggregation service for ad-hoc sensor networks. In OSDI, [Mal11] Malkowski, S, Hedwig, M, Li, J, Pu, C, Neumann, D. Automated control for elastic n-tier workloads based on empirical modeling. International Conference on Autonomic Computing (ICAC) [Mar11] Martin,A., Knauth,T., et al. Scalable and Low-Latency Data Processing with Stream MapReduce. In CLOUDCOM, [Mar12] Marek, L, Zheng, Y, Ansaloni, D, Sarimbekov, A, Binder, W, Tuma, P. Java bytecode instrumentation made easy: The DiSL framework for dynamic program analysis [Mas11]: Mastelic, T., Emeakaroha, V. C., Maurer, M., Brandic, I. M4CLOUD - Generic Application Level Monitoring For Resource-Shared Cloud Environments [Maz12] Michele Mazzucco and Dmytro Dyachuk. Optimizing cloud providers revenues via energy efficient server allocation. Sustainable Computing: Informatics and Systems, 2(1):1 12, [Men94] Capacity Planning and Performance Modeling: from mainframes to client-server systems, D. Menascé, V. Almeida, and L. Dowdy, Prentice Hall, [Men03] Menasce, D, Bennani, M. On the use of performance models to design self-managing computer systems. Computer Measurement Group Conference [Men05] Menasce, D, Bennani, M, Ruan, H. On the use of online analytic performance models in self-managing and self-organizing computer systems. Self-star Properties in Complex Information Systems Public Final version 1.0, March 29 th

107 [Men07] Menasce, D, Ruan, H, Gomaa, H. QoS management in service-oriented architecture. Performance evaluation [Men08]: Meng, S., Kashyap, S. R., Venkatramani, C., Liu, L. Resource-Aware Application State Monitoring (REMO). IEEE Transactions On Parallel And Distributed Systems [Men11] Ishai Menache, Asuman Ozdaglar, and Nahum Shimkin. Socially optimal pricing of cloud computing resources. In Proceedings of the 5th International ICST Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS 11, pages , ICST, Brussels, Belgium, Belgium, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering). [Mey04]: Meyerhöfer, M., Neumann, C. TESTEJB - A Measurement Framework for EJBs. Proceedings of the 7th International Symposium on Component-Based Software Engineering (CBSE 2004), Edinburgh, UK, May 24-2, 2004, pp [Mos02]: Mos, A., Murphy, J. A framework for performance monitoring, modelling and prediction of component oriented distributed systems. Proceedings of the 3rd international workshop on Software and performance (WOSP '02), [MOS1] Dana Petcu, Ciprian Craciun, Massimiliano Rak: Towards a Cross Platform Cloud API -- Components for Cloud Federation; 2011 [MOS2] Ciprian Craciun: Building blocks of scalable applications; Masters thesis; 2012; [MOS3] mosaic notes -- Component controller (accessed in 2013); [MOS4] mosaic notes -- Component hub (accessed in 2013); [MOS5] mosaic BitBucket repositories (accessed in 2013); [Mun07] Munagala, K., Srivastava, U., and Widom, J. Optimization of Continuous Queries with Shared Expensive Filters. In Proc. ACM Intl. Symp. on Principles of Database Systems (PODS 2007), pages , [Nash54] J. Nash. Non-cooperative games. The Annals of Mathematics, 54(2): , [Nee11] Neelakanta, G, Veeravalli, B. On the resource allocation and pricing strategies in compute clouds using bargaining approaches. International Conference on Networks (ICON) [Nem95] Nemani, M, Ravikanth, R, Bamieh, B. Identification of linear parametrically varying systems. IEEE Conference on Decision Control [Neu10] Neumeyer,L., Robbing,B., et al. S4: Distributed Stream Computing Platform. In ICDMW, [Ope07]: OpenSOA, Service Data Objects Specification [Pac08] Pacifici, G, Segmuller, W, Spreitzer, M, Tantawi, A. CPU demand for web serving: Measurement analysis and dynamic estimation. ACM SIGMETRICS Performance Evaluation Review [Par06]: Parsons, T., Murphy, J. The 2nd International Middleware Doctoral Symposium: Detecting Performance Antipatterns in Component-Based Enterprise Systems. IEEE Distributed Systems Online, vol. 7, no. 3, March, [Par07]: Parsons, T. Automatic Detection of Performance Design and Deployment Antipatterns in Component Based Enterprise Systems. Ph.D. Thesis, 2007, University College Dublin. [Par08]: Parsons, T., Murphy, J. Detecting Performance Antipatterns in Component Based Enterprise Systems. Journal of Object Technology, vol. 7, no. 3, [Pou10] Poussot-Vassal, C, Tanelli, M, Lovera, M. Linear parametrically varying MPC for combined quality of service and energy management in web service systems. American Control Conference [Pow05] Powers, R, Goldszmidt, M, Cohen, I. Short term performance forecasting in enterprise systems.international Conference on Knowledge Discovery and Data Mining (SIGKDD) [PUP1] Puppet Labs (accessed in 2013); [PUP2] Puppet Labs -- What is Puppet? (accessed in 2013); Public Final Version 1.0, March 29 th

108 [PUP3] Puppet Labs -- Big Picture (accessed in 2013); [PUP4] Puppet Labs -- What is Puppet? (slides) (accessed in 2013); [PUP5] Puppet Labs -- Glossary (accessed in 2013); [PUP6] Puppet Labs -- Reference Manual (accessed in 2013); [PUP7] Puppet Labs -- Tools (accessed in 2013); [PUP8] Puppet Labs -- Exported Resources (accessed in 2013); [PUP9] Puppet Labs -- Compare Puppet Enterprise (accessed in 2013); [PUP10] Puppet Labs -- System Requirements (accessed in 2013); [Ran06] P. Ranganathan, P. Leech, D. Irwin, and J. Chase. Ensemble-level Power Management for Dense Blade Servers. SIGARCH Comput. Archit. News, 34, [Ris02] A. Riska. Aggregate Matrix-Analytic techniques and their applications. PhD thesis. Computer Science College of William & Mary, Williamsburg, VA. [Rol95] Rolia, J, Vetland, V. Parameter estimation for performance models of distributed application systems. Conference of the Centre for Advanced Studies on Collaborative Research (CASCON) [Rol98] Rolia, J, Vetland, V. Correlating resource demand information with ARM data for application services. International Workshop on Software and Performance (WOSP) [Rub] RUBiS: Rice University Bidding System. [Sei87] Seidmann, A, Schweitzer, P, Shalev-Oren, S. Computerized closed queueing network models of flexible manufacturing systems. Large Scale Systems [Sha08] Sharma, A, Bhagwan, R, Choudhury, M, Golubchik, L, Govindan, R, Voelker, G. Automatic request categorization in internet services. ACM SIGMETRICS Performance Evaluation Review [Sha10]: Shao, J., Wei, H., Wang, Q., Mei, H. A Runtime Model Based Monitoring Approach for Cloud (RMCM) IEEE 3rd International Conference on Cloud Computing. [Shi06] Shivam, P, Babu, S, Chase, J. Learning application models for utility resource planning. International Conference on Autonomic Computing (ICAC) [Son12] Yang Song, Murtaza Zafer, and Kang-Won Lee. Optimal bidding in spot instance market. In Albert G. Greenberg and Kazem Sohraby, editors, INFOCOM, pages IEEE, [Spi11] Spinner, S. Evaluating approaches to resource demand estimation (Master Thesis). Karlsruhe Institute of Technology [Sri05]: Srivastava, U., Munagala, K., Widom, J. Operator placement for in-network stream query processing. In PODS, 2005, pp [Sri08] Shekhar Srikantaiah, Aman Kansal, and Feng Zhao. Energy aware consolidation for cloud computing. In Proceedings of the 2008 conference on Power aware computing and systems, HotPower 08, pages 10 10, Berkeley, CA, USA, USENIX Association. [Sut08] Sutton, C, Jordan, M. Probabilistic inference in queueing networks. In Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML) [Tan12] Tan, Y, Nguyen, H, Shen, Z, Gu, X, Venkatramani, C, Rajan, D. PREPARE: Predictive performance anomaly prevention for virtualized cloud systems. International Conference on Distributed Computing Systems (ICDCS) [Tes05] Tesauro, G, Das, R, Walsh, W, Kephart, J. Utility-function driven resource allocation in autonomic systems. International Conference on Autonomic Computing (ICAC) [Tes06] Tesauro, G, Jongt, N, Das, R, Bennanit, M. A hybrid reinforcement learning approach to autonomic resource allocation. International Conference on Autonomic Computing (ICAC) [The08] Thereska, E, Ganger, G. IRONModel: Robust performance models in the wild. ACM SIGMETRICS Performance Evaluation Review Public Final version 1.0, March 29 th

109 [Tia11] Fengguang Tian and Keke Chen. Towards optimal resource provisioning for running mapreduce programs in public clouds. In Liu and Parashar [Liu11], pages [Tpc] Transaction processing performance council. TPC-W. [Tur07]: Turnbull, J. Pulling Strings with Puppet, FristPress, 2007, 1st edn. [Twi13] Twitter Storm. github.com/nathanmarz/storm/wiki, 2013 [Urg05] Urgaonkar, B, Pacifici, G, Shenoy, P, Spreitzery, M, Tantawi, A. An analytical model for multitier internet services and its applications. ACM SIGMETRICS Performance Evaluation Review [Vaq08] Luis M. Vaquero, Luis Rodero-Merino, Juan Caceres, and Maik Lindner. A break in the clouds: towards a cloud definition. SIGCOMM Comput. Commun. Rev., 39(1):50 55, December [Val11] Giuseppe Valetto, Paul L. Snyder, Daniel J. Dubois, Elisabetta Di Nitto, and Nicolo Maria Calcavecchia. A self-organized load-balancing algorithm for overlay-based decentralized service networks. In SASO, pages , [Ven11]: Venticinque S., Martino, Di B., Pectu, D. Agent-based cloud provisioning and management, design and prototypal implementation. In M. v. S. Frank Leymann, Ivan Ivanov and B. Shishkov, editors, 1st International Conference on Cloud Computing and Services Science (CLOSER2011), pages ScitePress, [Ver02] Verdult, V. Nonlinear system identification: A state-space approach. Ph.D. dissertation. Twente University Press [Ver07] Vercauteren, T, Aggarwal, P, Wang, X, Li, T. Hierarchical forecasting of web server workload using sequential monte carlo training. IEEE Transactions on Signal Processing [Wan03] W. Zhang and W. Zhang. Linux Virtual Server Clusters. Linux Magazine, November,2003. [Wan05] Wang, X, Abraham, A, Smith, K. Intelligent web traffic mining and analysis. Journal of Network and Computer Applications [Wan12] Jian Wan, Dechuan Deng, and Congfeng Jiang. Non-cooperative gaming and bidding model based resource allocation in virtual machine environment. In IPDPS Workshops, pages IEEE Computer Society, [Wan12b] Lijuan Wang and Jun Shen. Towards bio-inspired cost minimisation for data-intensive service provision. In Services Economics (SE), 2012 IEEE First International Conference on, pages 16 23, june [WAZ1] Windows Azure Documentation -- Introducing Windows Azure (accessed in 2013); [WAZ2] Windows Azure Documentation -- Windows Azure Execution Models (accessed in 2013); [Wei10] Guiyi Wei, Athanasios V. Vasilakos, Yao Zheng, and Naixue Xiong. A game- theoretic method of fair resource allocation for cloud computing services. The Journal of Supercomputing, 54(2): , [Wik13] Wikipedia, Data Migration, [Win09] Wingerden, J. Control of wind turbines with smart rotors: Proof of concept & LPV subspace identification. Ph.D. dissertation. Delft University of Technology [Woo95] Woodside, C, Neilson, J, Petriu, D, Majumdar, S. The stochastic rendezvous network model for performance of synchronous Client-Server-like distributed software. IEEE Transactions on Computers [Woo06] Woodside, C, Zheng, T, Litoiu, M. Service system resource management based on a tracked layered performance model. International Conference on Autonomic Computing (ICAC) [Wu08] Wu, X, Woodside, M. A calibration framework for capturing and calibrating software performance models. European Performance Engineering Workshop on Computer Performance Engineering (EPEW) [Wu10] Wu, Y, Hwang, K, Yuan, Y, Zheng, W. Adaptive workload prediction of grid performance in confidence windows. IEEE Transaction on Parallel and Distributed Systems [Wu12] Linlin Wu, Saurabh Kumar Garg, and Rajkumar Buyya. Sla-based admission control for a software-asa-service provider in cloud computing environments. J. Comput. Syst. Sci., 78(5): , Public Final Version 1.0, March 29 th

110 [Xia12] Z. Xiao, Q. Chen, and H. Luo. Automatic scaling of internet applications for cloud computing services. Computers, IEEE Transactions on, PP(99):1, [Xio11] PengCheng Xiong, Zhikui Wang, Simon Malkowski, Qingyang Wang, Deepal Jayasinghe, and Calton Pu. Economical and robust provisioning of n-tier cloud workloads: A multi-level control approach. In ICDCS, pages IEEE Computer Society, [Xio13] Xiong, P, Pu, C, Zhu, X, Griffith, R. vperfguard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments. International Conference on Performance Engineering (ICPE), [Xu07] Xu, J, Zhao, M, Fortes, J, Carpenter, R, Yousif, M. On the use of fuzzy modeling in virtualized data center management. International Conference on Autonomic Computing (ICAC) [Yal04]: Yalagandula P., Dahlin, M.. A scalable distributed information management system. In SIGCOMM, 2004, pp [Zaf12] Murtaza Zafer, Yang Song, and Kang-Won Lee. Optimal bids for spot vms in a cloud for deadline constrained jobs. In Chang [Cha12], pages [Zam12] Sharrukh Zaman and Daniel Grosu. An online mechanism for dynamic vm provisioning and allocation in clouds. In Chang [Cha12], pages [Zha03] Zhang, G. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing [Zha07] Zhang, Q, Cherkasova, L, Smirni, E. A regression-based analytic model for dynamic resource provisioning of multi-tier applications. International Conference on Autonomic Computing (ICAC) [Zhe05] Zheng, T, Yang, J, Woodside, M, Litoiu, M, Iszlai, M. Tracking time-varying parameters in software systems with extended Kalman filters. Conference of the Centre for Advanced Studies on Collaborative Research (CASCON) [Zhe08] Zheng, T, Woodside, M, Litoiu, M. Performance model estimation and tracking using optimal filters. IEEE Transactions on Software Engineering Public Final version 1.0, March 29 th

111 Appendix A Run-time platform evaluation criteria Although many of the surveyed, or other existing, solutions are production-ready --- or even better backed up by powerful companies in the IT sector --- and offer many features, we must focus our effort in determining if they are a good match with ModaClouds requirements, described in a later section. Such a goal implies two separate conditions: first of all they should be suitable for our industrial partners' case study applications; this in turn implies matching the supported programming languages, palette of available resources and middleware, and nonetheless security requirements; and in order to fulfill our project's goal, they must provide a certain flexibility, to allow our run-time environment to integrate, and provide enhanced services and support for the user's application; Therefore, we are especially interested in the following aspects: type Actually one of the categories mentioned in the beginning of the section, which broadly describes what is the purpose of the solution and the range of features it offers. PaaS --- fully integrated solution that abstracts away all low-level details of the deployment and execution; application execution --- suitable only for application execution, meaning that it does not manage the host environment it runs in, like operating system, machine, etc.; (classical examples would be Tomcat and derivatives, Ruby on Rails, etc.;) application deployment --- as above suitable only for application deployment, thus implying that the environment be provided by other means; (classical examples would be package managers, Capistrano for Ruby, etc.;) server deployment --- suitable to deploy the entire host environment, possibly even including the application deployment, but it will still require an application execution solution; (classical examples would be Chef or Puppet;) task automation --- low-level tools that, if required, would allow to quickly implement our own solution that would fit in one of the above categories; (classical examples would be Ant for Java, Fabric for Python, etc.;) library --- the described solution is actually a library to be used inside our programs; here we include also platforms or frameworks, which although more complex than libraries, are still used only to develop applications; service --- referring to solutions which are stand-alone services, which on their own do not provide direct benefits, but which are either used as dependencies of our environment, or if integrated it would provide added value to it and then to our users; (for example database systems, various middleware, logging or monitoring systems or SaaS, etc.;) standard --- although not a ready to be used solution, this could be a protocol, data format, guidelines or other kind of specification, that could prove useful to implement or follow ourselves; suitability Shortly, how mature, or production ready, is the solution? Does it have a supportive community built around it. production; emerging --- usually either a very popular solution, or one backed by a large company, but not yet reaching or surpassing the beta status; prototype --- maybe not the best solution to adopt, but it could have important features that we could leverage or re-implement; legacy --- although not a choice for most new developments, it could prove important to address, because it either has a large deployment base, or it is mandated by one of the case studies; Public Final Version 1.0, March 29 th

112 application domain What would be the main flavour of targeted applications? web applications; map-reduce applications; generic compute-, data-, or network-intensive applications; application architecture Broadly matching a targeted application architecture. 2-tier applications --- monolithically applications that besides the data storage or communication layer, have a single layer handling all the concerts from user interface to logic; n-tier applications --- SOA-inspired applications where parts of the application are clearly identified as independent layers, and deployed accordingly; application restrictions What constraints would the application (and part of our run-time environment) be subjected at? none --- the application is able to use all the features of the targeted programming language and the targeted framework, including full control over the run-time environment; moreover the application is able to interact with other OS artifacts (like file-system, processes, sockets, etc.); (e.g. Amazon Beanstalk;) container --- like in the case of no restrictions, except that interactions with the run-time or the OS are limited; limited --- the application is able to use only some features of the targeted language or framework, and most likely interactions with the run-time and the OS are limited (i.e. native libraries are forbidden, filesystem access is restricted, etc.); (e.g. Google App Engine;) programming languages Self explanatory programming frameworks Some solutions target a particular framework (such as Servlets for Google App Engine's Java environment, Capistrano tightly focused on Ruby on Rails deployment, etc.). Thus it would prove useful to know in advanced which are the officially sanctioned or preferred frameworks. scalability How can scalability be achieved? automatic scalability --- based on user defined policies the platform is able to provision and commit new computing resources; (i.e. the platform decides and executes;) manual scalability --- the user is able to control via a high-level UI or CLI the amount of provisioned and committed computing resources; (i.e. the operator decides, the platform executes;) (this implies that the platform is able to provision new resources by itself;) passive scalability --- the platform itself is able to scale if computing resources are manually provided by the operator himself; (i.e. the operator decides and executes, the platform only takes notice and reacts;) (this implies that the platform is not able to provision resources by itself;) session affinity Usually PaaS offers HTTP request routers (or dispatchers); how does they load-balance clients between the multiple available service instances? transparent --- the solution provides automatic session replication between multiple instances (most likely through a shared database); Public Final version 1.0, March 29 th

113 sticky-sessions --- all the requests originating from the same client are routed to the same instance; non-deterministic --- (self-explanatory); interaction How can we pragmatically interact with the proposed solution? WS (Web Service) --- the interaction can be made through HTTP calls (either SOAP+WSDL or RESTfull); (this implies that the is a public specification of such calls, or they are easily reveres engineered); WUI (Web User Interface) --- although this interface is provided remotely through HTTP, it's suitable for human operators and can't be easily consumed by an automated tool; CLI (Command Line Interface) --- there are command line tools that interact with the solution (most likely through HTTP or some form of RPC); (this implies that the input / output format are easy to parse by another tool, and as above specification is available); CUI (Console User Interface) --- the provided command line tools are not suitable for being invoked by other tools, because for example the input / output are human-centric and difficult to parse; API (Application Programmable Interface) --- the solution also provides a library that abstracts one of the previous interaction methods; hosting type How would we be able to use the proposed solution? hosted --- the proper meaning of the term PaaS; deployable (closed-source) --- available for deployment in a private cloud, but the code is closedsource; deployable (open-source) --- available for deployment in a private cloud, and the code is available as open-source, thus enabling modifications; simulated --- there is an option to deploy locally a similar solution for development and debugging purposes; portability If a developer uses a particular solution, how easy is to him to move to another solution having the same role? services locked -- to move to a different solution would require massive rewriting of the application; portable -- possible with minor updates to the application; out-of-the-box -- the solution uses existing standards thus portability is guaranteed; Especially in the case of PaaS, what additional resources or services (such as databases, middleware, etc.) are available and managed directly by the solution, and thus integrated with the application life-cycle? monitoring coverage Especially in the case of PaaS, how much do the monitoring facilities cover and expose to the operator? none -- the solution provides no monitoring options (except maybe the listing of running processes or logging, etc.); basic -- the usual information that is comprised of CPU, memory and disk usage; extensive -- it provides many other metrics than the ones above; monitoring level Public Final Version 1.0, March 29 th

114 From which perspective, or at which level of the software and infrastructure stack, are the metrics provided? application -- the data is collected from within the application itself; (for example by using NewRelic, etc.;) container -- the data is collected from within the VM or the container; it could refer to the VM or the container itself or the whole running application; hypervisor -- the data is collected by the virtualization solution; fabric -- the data is collected at the infrastructure layer; (for example raw disks, load balancers, routers, switches, etc.); monitoring interface What technique --- such as standard, API, library, etc. --- is used to expose the monitoring information to the operator? resource providers Most of the PaaS do not also have their own hardware resources, but instead are built on top of other publicly accessible IaaS providers. Thus if the user needs services not offered by the PaaS itself, it could use that IaaS to host the missing functionality himself. multi-tenancy This characteristic pertains mainly to PaaS or PaaS-like solutions, and tries to asses if multiple applications can share the same instance of the PaaS. single application --- the entire PaaS instance is dedicated to only one application; (some deployable PaaS solutions fit into this category;) single organization --- the PaaS is able to host multiple independent applications, but they should belong to the same organization, mainly because the security model is restricted, or the scheduling model implies a fair behaviour; (almost all other deployable PaaS solutions fit into this category;) multiple organizations --- the PaaS is shared between multiple parties, each possibly with multiple applications; (all hosted PaaS's fit in this category;) resource sharing This characteristic pertains mainly to PaaS or PaaS-like solutions, and tries to assess how are the application's components or services mapped on the provisioned VM's. 1:1 --- each component or service (from each application where applicable) is deployed on its own VM; such a usage pattern would better fit heavy-weight applications, that have few component or service types, featuring constantly high load; thus an instance wouldn't interfere with another, through shared resource consumption; n:1 --- more than one component or service (potentially from different applications in case of multitenancy) can be deployed on the same VM, thus sharing its resources; this usage pattern would allow cost savings, especially in development or initial deployments, until the product gains traction and increased load, where a 1:1 pattern would prove more efficient; limitations Most of the solutions impose quantitative limitations (such as memory, bandwidth, storage, etc.) on the running applications, which could be of interest especially in determining the suitability for our case studies. We should observe that not all of these properties or capabilities apply to all the surveyed solutions. Public Final version 1.0, March 29 th