Project Acronym: Project Title: OPTIMIS Project Number: 257115 Instrument: Thematic Priority: Optimized Infrastructure Services Integrated Project ICT-2009.1.2 Internet of Services, Software and Virtualisation The OPTIMIS Project Optimized Infrastructure Services Scientific Results Publication Date: 28/06/2013 Start Date of Project: 01/06/2010 Duration of Project: 36 months Authors: The Optimis consortium www.optimis-project.eu www.optimistoolkit.com Editors: Johan Tordsson Ahmed Ali-Eldin Daniel Espling Karim Dejemame UMU UMU UMU
Table of Contents 1 INTRODUCTION... 4 2 THE OPTIMIS VISION... 4 3 OPTIMIS TOOLKIT OVERVIEW... 5 4 OPTIMIS TOOLKIT SCIENTIFIC OUTCOMES... 7 4.1 SERVICE CONSTRUCTION... 7 4.1.1 Programming model... 7 4.1.2 Integrated Development Environment... 8 4.1.3 Image Construction Service... 8 4.1.4 Service manifest... 9 4.2 SERVICE DEPLOYMENT... 10 4.2.1 Service Deployment Optimizer... 10 4.2.2 Admission Control... 12 4.2.3 Virtual Machine Contextualizer... 12 4.2.4 Data Manager... 13 4.2.5 Cloud Quality of Service... 15 4.2.6 License Management... 15 4.3 SERVICE OPERATION... 18 4.3.1 Cloud Optimizer... 18 4.3.2 Virtual Machine Manager... 18 4.3.3 Elasticity Engine... 19 4.3.4 Fault Tolerance Engine... 21 4.3.5 Service Manager... 22 4.4 FUNDAMENTAL TOOLS... 23 4.4.1 Monitoring Infrastructure... 23 4.4.2 Trust... 23 4.4.3 Risk... 24 4.4.4 Eco-efficiency... 26 4.4.5 Cost... 26 4.4.6 Security... 27 5 CONCLUSIONS... 30 Index of Figures Figure 1 - OPTIMIS Toolkit overview.... 5 Figure 2 - Image Creation Service Architecture.... 9 Figure 3 - Service manifest structure.... 9 Figure 4 SDO architecture overview.... 10 Figure 5 Execution time for Permutation algorithm.... 11 Figure 6 Execution time for approximation algorithms.... 11 Figure 7 Validation of Arsys trace.... 15 Figure 8: Deployment of a trusted instance in the cloud.... 16 OPTIMIS Consortium Page 2 of 30
Figure 9: Secure license delegation mechanism... 17 Figure 10 Frequency histogram of SLA fulfillment.... 19 Figure 11 Comparing power consumption.... 19 Figure 12 System model for elasticity controller design.... 20 Figure 13 Overview of Workload Analysis and Classification (WAC) tool.... 21 Figure 14 Two-tier architecture of the ICVPN.... 28 Figure 15 Mean latency for wide-area ICVPN deployment.... 29 Figure 16 Mean throughput for wide-area ICVPN deployment.... 29 Index of Tables Table 1 Performance results for all algorithms.... 11 Table 2 Controller performance for the FIFA world cup traces.... 20 Table 3 Types of risk.... 25 OPTIMIS Consortium Page 3 of 30
1 Introduction The OPTIMIS project has an ambitious research agenda with diverse topics ranging from lowlevel concepts such as hardware fault-tolerance and Virtual Machine (VM) configuration to more high-level concepts such as assessments of cloud operational costs, provider reputation, and risk for Service Level Agreement (SLA) violations. The main result of the OPTIMIS project is the OPTIMIS Toolkit, a set of software tools for cloud providers and users that can be used to incarnate various multi-cloud architectures, optimize the cloud service lifecycle, simplify cloud self-management, etc. In the following, the core concepts and ideas of OPTIMIS are summarized and the toolkit is outlined. The main part of this report is a summary of the scientific work performed as part of development and validation of the toolkit. Additional information about OPTIMIS Toolkit can be found in project deliverables D1.1.1.3, D1.2.1.3, and D1.2.2.3, describing the requirement engineering process, high-level architecture, and detailed design of the toolkit, respectively. Full details about the scientific results for the project can be found in accompanying documentation in D6.1.3.3. 2 The OPTIMIS Vision The main stakeholders in OPTIMIS are Infrastructure Providers (IPs) and Service Providers (SPs): Service providers offer economically efficient services using hardware resources provisioned by infrastructure providers. SPs participate in all phases of the service lifecycle, by implementing the service, deploying it, and overseeing its operation. Infrastructure providers offer physical infrastructure resources required for hosting services. Their goal is typically to maximize their profit by making efficient use of the infrastructure and by possibly outsourcing partial workloads to partner IPs. The application logic of a service is transparent to the IPs, which instead uses VMs as blackbox management units. The element of focus in OPTIMIS is the service. A service can provide any sort of functionality to an end-user. This functionality is delivered one or more VMs, which each run applications that deliver the service logic. Each service is associated with a number of functional and nonfunctional requirements specified in a service manifest. The functional requirements include requested performance profile, both in terms of hardware configuration, e.g., amount of physical memory and CPU characteristics for VMs as well as application-level Key Performance Indicators (KPIs) such as transaction rate, number of concurrent users, response time, which are used for elastic auto-scaling. The non-functional requirements include, e.g., service and security level policies, and ecological profile. All of these requirements in the manifest configure the service for deployment and operation. The service manifest is thus the principal mean of interaction between the SP and IP. The scientific ambitions of the OPTIMIS project can be summarized in five innovations: 1. Optimized Service Construction, Deployment, and Execution. The OPTIMIS toolkit simplifies for SPs to develop new services, make informed deployment decisions for these, and monitor their execution. Similarly, IPs can use OPTIMIS tools to decide whether to accept additional services and optimize provisioning for the already hosted ones. 2. Dependable Sociability = TREC. The OPTIMIS TREC tools allows SPs and IPs to base all decision making not only on lower-level functional properties, but also on non-functional factors relating to trust (including reputation), risk (likelihood-consequence analysis for OPTIMIS Consortium Page 4 of 30
valuable assets), eco-efficiency (power consumption and ecological factors), as well as cost (economical models for service and infrastructure provisioning expenses). 3. Adaptive and Eco-Aware Self-Preservation. The OPTIMIS Toolkit enables dynamic and proactive management of cloud infrastructures and services. This ensures adaptability, reliability, and scalability of infrastructures that handle predicted and unforeseen changes in services and continuously optimize operation. 4. Provisioning on Multi-Clouds, federation, etc. With the OPTIMIS Toolkit, composition of cloud providers in novel and complex ways is significantly simplified. The toolkit can be used to instantiate private clouds, set up multi-clouds, enable cloud bursting, have IPs collaborate in federations, and even support cloud brokering (reselling of IP capacity). 5. Cloud-nomics (business and legal contributions). This non-technical research direction of the OPTIMIS project investigates legal aspects of cloud usage, including data legislation issues and regulatory aspects of cloud operation. Another topic of interest here is prediction of the cloud market evolution. The results of this work have greatly influenced the design of the toolkit. 3 OPTIMIS Toolkit overview For readability purposes, the following section briefly introduces the OPTIMIS toolkit. The OPTIMIS toolkit can be grouped into three main toolsets, illustrated in Figure 1. Figure 1 - OPTIMIS Toolkit overview. The Basic Toolkit is a set of fundamental tools for monitoring and assessing clouds services and infrastructures, as well as interconnecting these securely. The Monitoring infrastructure allows the runtime state of physical infrastructure, virtual infrastructure, and applications to be captured, stored, and analyzed. The assessment tools for TREC-factors allows SPs and IPs to take self-* management decisions based on these non-functional properties o o o The Trust framework enables SPs and IPs to assess each other s reputation prior to engaging in a business relationship. The Risk management tools allow SPs and IPs to reason about certain assets of service deployment and operation, the risk factors associated with these, and estimate the potential consequences. The Eco-efficiency tools enable an IP to assess power consumption, carbon emissions, etc. in order to achieve certain ecological goals. OPTIMIS Consortium Page 5 of 30
o The Cost tools can be used to assess and predict the service operation costs, both from an SP and IP perspective. The security framework provides a set of access and credential management capabilities that facilitates the interconnection of multiple clouds, as well as services that operate across these. The SP tools allow OPTIMIS SPs to create, deploy, and run services. The OPTIMIS Programming Model simplifies the construction of service applications that orchestrate existing services, new code, and remote method invocations. The Image Creation Service allows the construction of VM images that embed the applications (developed with the programming model). The Integrated Development Environment provides a user-friendly way to implement and prepare services for deployment, using the programming model and image creation service. The License Management tools allow SPs to incorporate license-protected software in services, and also provide the runtime support needed to manage software licenses for services deployed in remote IPs. The Service Deployment Optimizer (SDO) coordinates the service deployment process; ranks and selects suitable IPs, negotiates SLAs for the service, prepares the VM images and transfers these, and ensures that the service is launched properly. The VM Contextualization tools prepare VM images with information needed for services to self-contextualize once launched by attaching boot devices to the images. This can be virtually any data; propagation of network parameters, security credentials, etc., is one common usage area. The Service Manager extracts from the monitoring system the runtime state of deployed services, allowing the SP to keep track of its services and manage these, e.g., migrate the service upon unacceptable behavior. The IP tools allow an OPTIMIS IP to optimize provisioning of services according to its BLOs. The Admission Control component is used to determine whether or not to accept a new service request, and thus handle the tradeoff between increased revenue and increased workload (with potential SLA violations as consequence). The SLA management tools, CloudQoS, are used to model, negotiate, and monitor SLAs between the IP and SPs whose services the IP runs. The Data Manager provides on-demand storage clusters linked to the services, thus acting as a data repository and provider of shared storage space. It also provides mechanisms to transfer service data (most notably, VM images) between the SP and the IP(s).. The VM Manager handles the VM lifecycle (launches VMs, monitors VMs, shuts VMs down) as well as performs VM placement in order to maximize the utility of the IPs infrastructure. The Elasticity Engine proposes capacity adjustments (in terms of number of VMs) allocated to a service in order to meet SLAs upon quick changes in workload of the VMs that constitute the services. OPTIMIS Consortium Page 6 of 30
The Fault Tolerance Engine provides a VM re-starting service, thus contributing to a selfhealing cloud infrastructure. The Cloud Optimizer combines the monitoring system, the IP-level TREC assessment tools, and the above listed management engines to create a self-managed cloud infrastructure driven by BLOs. 4 OPTIMIS Toolkit Scientific Outcomes The scientific result section is organized according to the cloud service lifecycle: service construction, service deployment, and service operation. These parts are followed by a section about fundamental tools, including monitoring and TREC assessment. 4.1 Service Construction 4.1.1 Programming model The programming model defined in OPTIMIS has the main goal of enabling the creation of multiple services and execute them in a dynamic pool of resources. These new services offer some web methods named Orchestration Elements, OE, which are compositions of Core Elements: new services developed by the Service Developer, existing services already published by other parties, and pieces of code which are not pretended to be published as services. The programming model is in charge of the detection of the CEs that compose an OE and orchestrating their execution in the resources of the pool taking into account the data dependencies between them, their constraints and the features of each resource. The design of the programming model is driven by a set of design principles or requirements that we have identified as key: Transparency of the infrastructure: primitives, directives or calls that require any specific knowledge about the cloud will be omitted in the programming model. Use well-known existing programming languages: with an easy programming paradigm (sequential programming), leading to a small learning process. The programming model is not only a language to declare service compositions, but it also contains a runtime which is able to orchestrate the defined composition in the cloud, controlling dependencies between core elements. From the execution of the defined service composition, the runtime creates a workflow, which describes dependencies between the core elements that form the composition. In such a way, a core element will not be invoked until all its preceding dependencies are solved (preceding CE have been executed). These dependencies between services are detected by analyzing objects or files used in each service invocation as callee, input parameter or return value. This workflow is the basic structure to achieve a correct orchestration. Parts of the workflow that do not have dependencies can be invoked in parallel. Using the programming model, developers need only to write sequential Java code and identify which parts of the code will run remotely. The Programming Model Runtime will modify the code in order to detect the different core elements that compose the service and orchestrate their execution. OPTIMIS Consortium Page 7 of 30
4.1.2 Integrated Development Environment The Integrated Development Environment (IDE) is provided as the main tool for Service Developers in order to create services for the OPTIMIS framework. The IDE provides a Graphical User Interface to assist service developers in the implementation of their own services from scratch or in the creation of service compositions from existing services or legacy code using the OPTIMIS Programming Model. Once the service is implemented, the IDE also assists the users on the deployment of the service in the OPTIMIS infrastructure, creating the service packages, installing these packages in the images and creating the Service Manifest where the service developer specifies the non-functional requirements required to achieve a deployment in the cloud that satisfies the users needs. The OPTIMIS IDE is designed as an Eclipse plugin that contains several wizards and actions. The service development starts in the Service Implementation stage, where the IDE guides the developer through the definition of Orchestration and Core elements and generate parts of the code required to implement them. This part is extended to support the integrating legacy software and existing code in the compositions that we implemented for the new cloud service. The IDE offers wizards for creating new service elements and wrappers to execute existing services, binaries, libraries, commands, etc. Once the service implementation is finished, the IDE provides a set of actions to launch the different service building processes, requesting additional information to the user when it is required. In the provided actions, the IDE compiles, instruments, and packages the service code together with the legacy software libraries. Afterwards, the IDE s Image Creation Action facilitates the creation of the VM images for the developed service by using the Image Creation Service. In this step, the generated service packages and their software dependencies are installed in images, which fulfill the service element requirements. In case the Service Developer has introduced a license requirement in one of their Core Element definition, the request license token action will allow to the developer to obtain the license tokens to execute the legacy software and introduce them in the service manifest. the IDE provides another action to generate the Service Manifest according to the requirements and properties of the implemented service, the created images and the non-functional requirements (TREC requirements, Legal constraints, etc.) provided by the service developer. Finally, once the service images and license tokens have been created, the IDE also offers an action to deploy the recently built service in the OPTIMIS infrastructure by integrating the results of the Service Construction phase with the actions performed in the Service Deployment phase. In this action, the IDE also offers the possibility to the service developer of defining non-functional requirements such as TREC requirements or Legal constraints. 4.1.3 Image Construction Service The Image Creation Service (ICS) is used to create and manipulate VM images. These images embed the applications that are developed with the programming model. The ICS supports, creation, update, finalization, and deletion of VM images. Images can also be defined to comply with requirements specified in the service manifest. The ICS is an important tool for interoperability at the VM management layer. The ICS can handle VM templates from different types of IPs, including native OPTIMIS IPs as well as clouds based on OpenNebula and OpenStack. And outline of the ICS architecture is shown in Figure 2. OPTIMIS Consortium Page 8 of 30
Figure 2 - Image Creation Service Architecture. 4.1.4 Service manifest The SP writes specification and configuration of the service manifest describing the functional and non-functional parameters of the service. Information relevant to the service manifest includes: VM images, thresholds for TREC factors the SP requests, location and cost constraints, capacity and elasticity requirements, KPIs to monitor, etc. The OPTIMIS service manifest basically describes the requirements of the service provider for an infrastructure service provisioning process. The Service Manifest is therefore an abstract definition of the infrastructure services as expected by the service provider. All aspects of these infrastructure services must be described in detail in the manifest. Figure 3 shows a high level overview of the service manifest structure. Each element in the service manifest may have an associated identifier, allowing service developers to cross-reference elements, e.g., to specify that an elasticity rule relates to a particular component. Figure 3 - Service manifest structure. OPTIMIS Consortium Page 9 of 30
In the Service Description Section, each VM that constitute the service is described using OVF, along with any allocation constraints (cardinality for elasticity), rules for (anti)affinity to other VMs, etc. The TREC Section allows the service developer to specify constraints with respect to trust, risk, eco-efficiency, and cost. These constraints must be met by any IP where service components are to be deployed. The Elasticity Section allows the SP to define scaling rules for auto-scaling of the service components (changing the number of VM instances). Rules are specified by a KPI (performance metric), along with specifications of when to scale up or down as this metric changes. Finally, in the Data Section, various constraints can be defined related to valid data locations (eligible countries) and data protection, including encryption. 4.2 Service Deployment 4.2.1 Service Deployment Optimizer Upon a cloud service deployment request, the SDO, comprised of two sub-components, the Service Deployer (SD), and the Deployment Optimizer (DO), is responsible for generating optimal deployment solutions for the request, and coordinating the deployment process in order to provision a service according to the deployment plan. Notably, to provide a complete solution for cloud deployment, the SDO interacts with external components for service contextualization, data management, service management, and IP assessment, as illustrated in Figure 4. Figure 4 SDO architecture overview. SD is implemented to orchestrate the whole deployment process, including calling external components, and invoking DO to generate optimal deployment solution for a service. In consideration of execution time, quality of solution and complexity, five algorithms are implemented in DO for deployment optimization, i.e., Random, Round-robin, Greedy, Permutation, and First-fit. Experimental evaluation using 2940 simulated deployments on cloud providers with dynamic pricing schemes are performed. Algorithms are evaluated in terms of deployment cost, rounds of negotiations, and the number of feasible placement solutions found. OPTIMIS Consortium Page 10 of 30
Figure 5 Execution time for Permutation algorithm. Figure 6 Execution time for approximation algorithms. Table 1 Performance results for all algorithms. Combining the results presented in Figure 5, Figure 6, and Table 1, we conclude that Permutation is slow, and does not scale very well, while other algorithms are fast, and scale very well. Greedy finds very good solutions. However, just as Random and Round-robin, Greedy sometimes fails to find a solution. In contrast, the Permutation algorithm always finds a solution (if these is one). These results can guide users to configure the SDO with the most suitable algorithm when performing cloud service deployment in different scenarios. OPTIMIS Consortium Page 11 of 30
4.2.2 Admission Control When admitting a new service to the Cloud, the IP must consider not only the basic computational and networking requirements of the service but also the extra ones that may be needed to be added at runtime, defined as elastic requirements. In many cases, the elastic requirements may be quite large compared to the basic ones. For example, given a service with a high expected variability in the number of users, the number of VMs that may need to be deployed at runtime may significantly vary over time, in order to meet the agreed quality of service (QoS) level for the users. Therefore the elastic requirements play a significant role in the cost of hosting the service, and the IP has a strong interest in investigating the possibility of reducing the resources that need to be booked for elasticity reasons when accepting the service. However, at the same time, such an approach may increase the possibility of deviating from the agreed (QoS) level, and the imposed penalties may as well outgain its advantages. The AC is a family of service-oriented components with distinct tasks, which work together to provide the overall admission control functionality in OPTIMIS. The components are the Workload Analyzer, the TREC Analyzer, the Admission Controller and the AC Gateway. Their tasks cover workload analysis, Trust-Risk-Eco-Cost (TREC) information retrieval, optimization modelling and admission control, and web service (WS) access to the aforementioned functionality respectively. The AC has two modes of operation. Single mode of operation that enables Private and Multi-Cloud deployments and the federated mode of operation that enables federated and hybrid cloud deployments. In the background, the Admission Control mathematical optimization model uses a probabilistic approach to the problem of optimum allocation of services on virtualized physical resources, when horizontal elasticity is required. The formulated optimization model constitutes a probabilistic admission control test. It exploits statistical knowledge about the elastic workload requirements of horizontally scalable services for reducing the correspondingly booked resources. The proposed model also allows for proper trade-offs among business level TREC objectives, and also takes into account extra affinity, anti-affinity and TREC constraints the services might have. In order to provide a strong assessment of the effectiveness of the proposed technique, the problem is modeled on GAMS and solved under realistic provider's settings. Finally, an ad-hoc heuristic solver able to produce solutions of very good quality (compared to the global solutions) at a fraction of the cost of the commercial global solver was developed and evaluated in terms of quality and cost against the BARON global solver.m of optimum allocation of services on virtualized physical resources, when horizontal elasticity is required. The formulated optimization model constitutes a probabilistic admission control test. It exploits statistical knowledge about the elastic workload requirements of horizontally scalable services for reducing the correspondingly booked resources. The proposed model also allows for proper trade-offs among business level TREC objectives, and also takes into account extra affinity, anti-affinity and TREC constraints the services might have. In order to provide a strong assessment of the effectiveness of the proposed technique, the problem is modelled on GAMS and solved under realistic provider's settings. Finally, an ad-hoc heuristic solver able to produce solutions of very good quality (compared to the global solutions) at a fraction of the cost of the commercial global solver was developed and evaluated in terms of quality and cost against the BARON global solver. 4.2.3 Virtual Machine Contextualizer The capacity of a cloud service is adapted during runtime by increasing or decreasing the number of instances running for each type of service component. Each instance is started from the same component template, and therefore each instance is absolutely identical when it is OPTIMIS Consortium Page 12 of 30
initially started. This is problematic since some customization is needed for each instance to avoid conflicts with already running instances. The VMC performs contextualization of each instance to configure unique properties such as IP-addresses or VPN configuration during VM boot-time. Contextualization is performed in the VMC by dynamically creating a customized virtual ISO image that contains the settings that are unique to the instance about to be started. The instance contains scripts automatically executed during boot-time to read and process the instance specific data. Contextualization can also be used to adapt a booting instance to the underlying infrastructure on which it is about to run, allowing a generically designed base template to be dynamically adapted to the execution environment. 4.2.4 Data Manager The DM has the purpose of creating on demand storage clusters (clusters of VMs containing the OPTIMIS version of the HDFS-Hadoop Distributed File System) thus acting as a Data Analytics as a Service (DAaaS) provider. These clusters are linked to the service VMs in order to also act as a data repository and shared storage space provider. Due to their nature the storage clusters also implement the MapReduce programming model, through which the services may submit distributed data analysis tasks. To this end, after a successful negotiation, user spaces must be prepared to which the users will have access for the necessary actions. The DM thus acts as a service layer on top of the storage clusters, in order to offer management and regulation capabilities. The aim is to be consistent against the SLA constraints (e.g. legal requirements) with regard to data location and security levels but also to offer extended functionalities such as usage of federated resources and risk-based management. The OPTIMIS Distributed File System (ODFS) solution is heavily based on the Apache Hadoop framework, and mainly its underlying file system, HDFS. The purpose of the ODFS is to offer virtualized Hadoop clusters as a service, to accompany regular service VMs and to act as a storage and data analytics cluster. In order to offer it as a service, the two main node types, the Master Node (Hadoop Namenode) and the Slave Node (Hadoop Datanode) have been incorporated in virtual machine (VM) templates. A suitable RESTful gateway (the OPTIMIS DM) has been implemented in order to enable the creation and management of the data clusters during startup and runtime. The DM is responsible for acquiring requests from the OPTIMIS Infrastructure Provider (IP) in order to launch a group of these data VMs for a specific service. Following these requests, the DM launches for this service ID a Namenode, accompanied by the suitable number of Datanodes, that consist of the virtualized distributed file system and data analytics framework. Afterwards, the regular VM servers that host the applications (service VMs) are contextualized and a suitable client is installed in order for the users to mount and use the ODFS data VMs. In a multi-provider environment, the ability to federate from one IP to another is crucial in terms of achieving availability obligations or enabling probabilistic overbooking strategies. Furthermore, in bursting strategies (from a private cloud to a public one) this can also apply for meeting peaks in demand. However in these actions, data would be kept in- house either for confidentiality issues (bursting of private cloud) or for performance ones (increased delay in moving large amounts of data from one provider to the other for a federation that may potentially last for a small amount of time to meet specific loads). Thus by using the presented mechanism a provider may rank the candidate service VMs for federation based on their OPTIMIS Consortium Page 13 of 30
anticipated activity towards the virtual Hadoop data cluster that will remain in-house. Through this, the ones that will exhibit the least activity can be selected, enabling the minimization of the performance overhead from accessing data over a remote network location (from the federated or bursted provider). Knowing in advance the demand for data may also lead to enabling optimal replication management schemes. The latter may aid in two areas. Initially, increased number of replicas for anticipated hot data will result in more sources being able to accommodate incoming requests, thus improving performance and response time (especially in read operations). What is more, reduced number of replicas for predicted cold data may result in reduced disk usage. Following these predictions, suitable management schemes can be applied in order to rearrange aspects such as the number and location of replicas. Data activity patterns are predicted using time series methods. The available input is the compressed time series of average I/O values in a predefined time interval, in terms of MB/second. This is done as seen in the previous section, for compressing the logs but also for hiding large variations in the time series itself. The output (prediction) values can be fed back to the input in order to iterate the prediction and thus extend it to multiple time steps ahead. A typical time step ranges between 15 minutes and 1 hour in the cases of the time series we have investigated. Due to the fact that the examined data displayed a periodic nature, a classic method from the signal analysis domain (Fourier series analysis) was selected. Analysis using the Fourier factors is used in order to depict the effect of various frequencies of a signal in the overall signal function. This way the main signal can be separated from the noise. In our case it was used since it was considered a way of identifying different workload patterns (with different frequencies) that may affect the overall data demand (for example, the main period of the data demand signal can be considered the working week, while larger periods of influence may indicate specifics such as summer/christmas holidays etc.). The function that was used in order to depict the periodic demand signal is: f(x)= a0 + a1*cos(x*w) + b1*sin(x*w) + a2*cos(2*x*w) + b2*sin(2*x*w) + a3*cos(3*x*w) + b3*sin(3*x*w) + a4*cos(4*x*w) + b4*sin(4*x*w) + a5*cos(5*x*w) + b5*sin(5*x*w) + a6*cos(6*x*w) + b6*sin(6*x*w) + a7*cos(7*x*w) + b7*sin(7*x*w) + a8*cos(8*x*w) + b8*sin(8*x*w). This proposed method was evaluated using two real data traces, one from Wikipedia, and a second trace gathered from the OPTIMIS partner Arsys. The Wikipedia trace is a 3-month web server trace counting the number of site hits per hour, whereas the Arsys logs constitute 8 months of traffic to a mail service. In both cases, 70% of the trace was used as training set, and the remaining 30% for validation. Performing prediction using the Fourier method for the Wikipedia data set, the accuracy (mean absolute error) was 14.29% for 10 steps ahead and 12.37% for 100 steps ahead. The corresponding numbers for the Arsys trace was 14.34% for 10 steps prediction and 53.08% for 100 steps prediction. Notably, the periodic nature of the Wikipedia trace result in very good predictions also for longer time periods (100 hours in this case). Figure 7 shows the overall results for the Arsys trace. OPTIMIS Consortium Page 14 of 30
7148) 194) 6725) 1433) 4137) 9889).034).355) 87) 3.27) 6111) 4603) 3732) 2012 up in bytes) 5-minute thod, by ing 30% ation the feedback ed on for appear in rrors, and rmula are e Arsys minute, eviation absolute entile s Figure 7 Validation of Arsys trace. Figure 4: Overall graph of validation in the Arsys Dataset for a variety of step ahead predictions (15 minute, 6969 steps ahead) 4.2.5 Cloud Quality of Service Table 4: Fourier fitting parameters for the Arsys dataset (15-minute values, 69696 predicted values) Fourier Function Value 95% Confidence risk factors Parameters that the IP will comply to. Bounds a0 8.67E+08 (8.606e+008, 8.733e+008) a1-2.82e+08 (-2.908e+008, - 2.727e+008) level b1 monitoring, persistence, accounting, and more. -3.35E+08 (-3.436e+008, - 3.255e+008) a2 7.43E+07 (6.516e+007, 8.337e+007) b2 (-2.275e+008, - Termination: Agreement is destroyed explicitly or via soft state (termination time). -2.19E+08 2.095e+008) a3 4.2.6 License Management 8.56E+07 (7.655e+007, 9.457e+007) b3 6.45E+05 (-8.401e+006, 9.691e+006) a4-9.52e+07 (-1.042e+008, - 8.619e+007) b4-3.39e+07 (-4.301e+007, - 2.487e+007) a5 7.63E+06 (-1.679e+006, 1.695e+007) b5-1.73e+08 (-1.816e+008, - 1.636e+008) a6 2.46E+08 (2.364e+008, 2.546e+008) The Cloud QoS component can be used to create Service Level Agreements between an OPTIMIS SP and an OPTIMIS IP. An agreement includes the Service Manifest as its main part of the contract. The Service Manifest describes the requirements that the Service Provider requires for the deployment of the service. It also describes the cost, trust, eco efficiency and The Cloud QoS component was implemented by using the WSAG4J framework The WS- Agreement for Java framework is a tool to create and manage service level agreements (SLAs) in distributed systems. It is an implementation of the OGF WS-Agreement and the WS- Agreement Negotiation proposed standard. WSAG4J helps in designing and implementing SLAs for a service and automates typical SLA management tasks like SLA offer validation, service The WSAG-based protocol uses an XML-based language for specifying the nature of an agreement, and agreement templates to facilitate the discovery of compatible agreement parties. The typical WS-Agreement life cycle has four phases: 1) Exploration: the IP provides templates describing possible agreement parameters; 2) Creation: SP fills in parameters, and makes an SLA offer; 3) Operation: Agreement state is available for monitoring, and 4) In OPTIMIS clouds, additional requirements for license management arise due to the multicloud capabilities (e.g. cloud bursting or cloud federation) and the extended support for elasticity. In case of risk of insufficient resources leading to underperforming services, the OPTIMIS toolkit would allow to dynamically add resources from an additional cloud provider (if the current cloud provider cannot provide more resources). Moreover, in case of temporary or permanent problems, the OPTIMIS Toolkit would allow to transparently migrate the services to a new cloud provider. In both cases, licenses for the license-protected application have to be available in the new cloud environment too. Since it cannot be assumed that the new cloud provider can provide all the licenses needed to continue the execution of the application, this OPTIMIS Consortium Page 15 of 30
requires either a token-based approach, where the token can travel with the application to the new cloud environment or a dynamically deployable license service that has the necessary licenses installed. The rationale behind the trusted instance is twofold: Providing a secure container for tokens that in addition is able to communicate with the API and deliver all information to the API required to authorize a user s request to launch an application. The trusted instance verifies that the token is valid. Providing a secure channel that communicates with the license server located at the user s premises (e.g. to verify the status of a token prior and during the operation of the license-protected application). The same channel is used by the API to trigger a renegotiation of the license usage terms (e.g. duration, number of processors, etc. with the license server installed at the user s premises). The trusted instance is in a VM that can be deployed in the cloud together with the application VM(s). The OPTIMIS VMC is responsible for setting-up and configuring the trusted instance in the VM at deployment. This includes, network address configuration, an initial set of tokens, etc. The contextualisation information is provided in the service manifest. Using a trusted instance does not require changes in the processing of the authorisation in the API because the only difference is the source of information, namely the trusted instance instead of a local token file read by the API. Similar to the license delegation described in the next section an additional feature to secure the license tokens. The cloud info service allows the trusted instance and the API to check whether the execution environment is the same as foreseen. In case of a mismatch the application will not be executed. Figure 8: Deployment of a trusted instance in the cloud. The trusted instance is operating on tokens that are created beforehand when it is known in advance which applications will be used and which licenses are required. The tokens can be prepared created beforehand at the user s premises. The license server deployed on the IP s cloud infrastructure supports dynamic, on-the-fly creation of tokens that are needed for running the applications. This process is outlined in Figure 8. OPTIMIS Consortium Page 16 of 30
The deployed license service (Server B) is a copy of the license service that runs locally at the premises of the user (Server A). The license service part of the VM so it can be deployed on the designated cloud infrastructure together with the applications and data. However, the total number of licenses and features initially procured from the ISV must remain the same when a license service is running in the premises of the user and in the cloud. To achieve this, the licenses and features made available for the license service in Server B in the Cloud are blocked in Server A and cannot be used locally anymore. Figure 9 depicts the process of preparing a subset of the licenses and features available at Server A to be added to Server B in the cloud (license delegation). Figure 9: Secure license delegation mechanism As shown above, the initial authorisation issued by the ISV for Server A to install and use a license procured from the ISV can be delegated by Server A to Server B. This delegated authorisation includes the authorisation of the ISV, the certificate of Server B and is signed by server A. Server B includes the authorisation of Server A into each created token. This allows the API to validate the entire chain of trust up to the ISV when a token is processed. The authorisation for using the application is not rejected when the chain is broken, i.e. when the token is not created by server B but by another copy of the license service running on another server. Moreover, the API can check at runtime whether the license and features contained in the token are blocked for local use at Server A. As an additional security feature for the token we added provider-specific information to the token, which will be verified once the license-protected application started and the API reads the token. In case of a mismatch the applications quits. While this is a limitation of the token s ability to be created once and used in different environments if the load situation of a Cloud requires bursting or federation, the provider information can be used in an environment, where the ISV is not 100% confident. Clearly, the ISV has to allow license delegation. The ISV can do this implicitly through providing an extended API that is able to validate the delegation chain. Any non-enhanced API would reject the tokens that were created based on a delegated license. OPTIMIS Consortium Page 17 of 30
4.3 Service Operation 4.3.1 Cloud Optimizer The CO combines the monitoring and assessment tools in the OPTIMIS base toolkit with various management engines in order to create a self-managed cloud infrastructure driven by provider s business-level objectives. The CO consists of two main components; the Holistic Manager (HM) and the Cloud Manager (CM). The HM is responsible for harmonizing the operation of the IP low level managers in order to fulfill high-level Business-level Objectives (BLOs), i.e., set the overall IP TREC values for the provider. BLOs represent how an IP provider wants to manage the infrastructure, e.g., reducing costs of operation at the expense of trust. The HM configures the low-level managers and engines at boot time and every time the BLOs are changed. It translates BLOs to TREC-based objective functions and constraints. Managers supporting TREC factors are configured using TREC weights. Managers not supporting TREC factors are configured using modes, so the HM maps objective function to one of these modes. The HM enables the IP to optimize the overall service execution while having Eco-awareness. The CM is responsible for the actual deployment/release of the different VMs based on the recommendations given by the different low level components and the TREC assessors. In addition, the CM arbitrates between private versus external deployment of the VMs arbitrating between different cloud deployment models. The CM receives a notification from TREC assessors when a TREC factor goes over/below a given threshold. When this occurs, the CM forwards this notification to the low-level managers to try to solve the issue. If the CM receives subsequent notifications, it initiates more drastic actions, such as canceling a running service, or elastic VMs within services. To this end, the CM repeats the process above but calculating the forecast of the overall {trust, risk, eco, cost} for the IP when canceling the corresponding VMs. 4.3.2 Virtual Machine Manager The VMM is responsible for the efficient management of infrastructure resources by managing physical nodes and the VMs running on top of them during their whole life cycle. The VMM has two subcomponents: the placement optimizer and the infrastructure optimizer. The placement optimizer optimizes how VMs are placed on physical resources so that the IP s BLOs are fulfilled. At any given moment, it can re-organize the mapping of VMs to physical resources according to the IP s goals. Infrastructure Optimizer is aimed to turn on/off physical resources, depending on the load handled by the IP. The decision-making processes within the VMM are governed by a policy management framework that incorporates both BLOs (based on TREC factors) and traditional parameters (hardware and resource requirements). OPTIMIS Consortium Page 18 of 30
The scheduling algorithm used for the allocation of VMs in the IP considers multiple facets to optimize the provider s profit. In particular, it considers energy efficiency, virtualization overheads, and SLA violation penalties, and supports the outsourcing to external providers. We refer to this policy as CDSP. Figure 10 Frequency histogram of SLA fulfillment. Comparing our proposed algorithm to dynamic backfilling, Figure 10 demonstrates that our approach gets high SLA fulfillment for most of the tasks while Dynamic backfilling provides less than 0.5 fulfillments to more than 50% of the tasks. Figure 11 also shows that CDSP reduces the energy consumption compared to dynamic backfilling.. Figure 11 Comparing power consumption. 4.3.3 Elasticity Engine Elasticity is the ability of the cloud to rapidly vary the allocated resource capacity to a service according to the current load in order to meet the quality of service (QoS) requirements specified in the SLA agreements. During the project, we designed and implemented various elasticity controllers to be used in the EE. In designing our controllers, we view the cloud as a control system. We model a service deployed in the cloud as a closed loop control system. Thus, the horizontal elasticity problem can be stated as follows: The elasticity controller output μ(t) should add or remove VMs to ensure that the number of service requests C(t) is equal to the total number of requests OPTIMIS Consortium Page 19 of 30
received R(t) + D(t) at any time unit t with an error tolerance specified in the SLA irrespective of the change in the demand D while maintaining the number of VMs to a minimum. Figure 12 shows the cloud system model. Figure 12 System model for elasticity controller design. 2 Controller performance for the FIFA world cup traces. Table Table 2 shows the performance of one of our controllers, C hybrid, vs. a reactive controller C Reactive using the FIFA 1998 world cup server workload traces when the workload size is increased by multiplying it by different factors F. The average service rate of one server is λ, OP and UP are over and under provisioning respectively, while the barred OP and UP are the averages. OPTIMIS Consortium Page 20 of 30
We tested our controllers with different workloads [2]. The controller showed varying performance with the different workloads. In order to improve the performance, we designed WAC, a Workload Analysis and Classification tool that analyzes workloads and assigns them to the most suitable elasticity controllers based on the workloads characteristics and a set of BLOs. The rationale behind this tool is illustrated in Figure 13. Figure 13 Overview of Workload Analysis and Classification (WAC) tool. Figure 1: WAC: A Workload Analysis and Classification Tool Different experiments shows that the WAC correctly assigns between 92% and 98.3% of the workloads to the most suitable elasticity controller. This allows the IP to better optimize the resource usage while optimizing performance according to the predefined BLOs. 4.3.4 Fault Tolerance Engine The FTE is responsible for monitoring and alerting of parts of self-healing infrastructure operation. As such, it explicitly asks or implicitly receives periodic updates to/from the monitoring system about the state of physical hardware hosts and virtual-it infrastructure (i.e. virtual machines, VMs). Later on, this engine decides whether any corrective action is required during services operation, such as restarting a recently failed VM. Since OPTIMIS is only concerned with IaaS provisioning, the FTE does not handle software faults or check pointing since the SP should handle these issues. The proposed fault tolerance mechanisms enhance reactive fault tolerance mechanisms that allow recovering the execution of services upon infrastructure failures with proactive capabilities. The proactive component makes use of the risk assessment tools to apply preventive measures before the actual failure occurs. If the proactive component fails to predict the failure, the FTE reverts to the reactive recovery mechanism. The FTE suggests recovery actions, which are evaluated by the VM manager with respect to the overall IP objectives. Since some recovery actions might be not compliant with the provider s OPTIMIS Consortium Page 21 of 30
goals. For instance, some recovery actions can increase the cost while the IP objectives is to reduce the overall cost. Reactive Fault Tolerance. To detect the failure of virtual machines, the engine periodically checks if virtual machines are online by pinging them. When a virtual machine does not respond, the engine checks also its CPU consumption and its state. When a VM failure is detected, the CM is asked to restart the failing VM. To detect the failure of physical hosts, the engine periodically checks if physical hosts are online by pinging them. If any physical resource does not respond within a given time, it is considered to be down, and the CM is asked to restart the VMs running in that host in other hosts. Proactive Fault Tolerance. The engine is also able to anticipate failures in a proactive way. This is done is collaboration with the risk assessor. The Fault Tolerance Engine can configure the risk assessor with the aim to receive proactive notifications when the risk level of VM or host failure is above a given threshold. The rationale used to set this risk level is two-fold. On one side, it can be set for all the VMs running in the provider and for all its physical hosts if the provider s business strategy includes a risk level constraint or an aim to optimize the overall risk level during its operation. On the other side, it can be set individually for the VMs composing a specific service, if a risk level constraint has been specified in the service manifest. Once the engine receives an alert from risk assessor that a physical host is going likely to fail, it informs the CM about this situation by suggesting the migration of all the VMs running in that host to other hosts. If the engine is alerted of a potential VM failure, the process is the same than the reactive behavior: that is ask the CM to restart that VM. 4.3.5 Service Manager The Service Manager manages the services throughout the entire lifecycle, from service creation and until execution is cancelled. The Service Manager interacts with many other components and performs a wide range of tasks, including to: act as an information repository for services deployed on the VMs of infrastructure providers. expose deploy und undeploy methods that can be used to control the deployment of the service programmatically. monitor execution of services via a link with the SLA Management component and the TREC Monitoring component to determine if services must be redeployed. store rules used to check if a service must be redeployed (based on information coming from the SLA Management component and the TREC Monitoring component). send orders for service redeployment whenever necessary. store elasticity rules defined by the end-user and communicate the elasticity rules with the Elasticity Engine. In effect, the Service Manager is responsible for service management at the SP level. The component is also integrated with GUIs, where users and administrators can inspect and manage the execution of a service. OPTIMIS Consortium Page 22 of 30
4.4 Fundamental tools 4.4.1 Monitoring Infrastructure OPTIMIS monitoring provides aggregation of data from multiple Information Providers, management of that data and a Data Model designed for flexible post-processing of that monitoring data. The goal is to provide a holistic approach that will introduce an abstraction layer on top of the various sources of information of the Cloud paradigm and effectively store, manage and offer that data to other components or actors of the framework The main challenges are due to the use of virtualization technologies, the distribution of the infrastructure, and the scalability of the applications. Also, the system contains support energy consumption metrics. The resulting system spans across two different levels of the infrastructure, providing monitoring data from the application, virtual and physical infrastructure as well as energy efficiency related parameters. The different layers are: Information Providers: comprised of different sources where monitoring data is collected from, as well as the components in charge of collecting them (known as Collectors ). The Monitoring Infrastructure is designed in a way that this layer is scalable, allowing the incorporation of additional sources, through the corresponding collectors. Management and storage of the data: includes components in charge of managing the data collected from the monitoring information providers and storing them in an aggregated database. There are two main components involved in the monitoring infrastructure; The Monitoring Manager component serves as the orchestrator of the whole Monitoring process. The role of this component is twofold and includes controlling the Monitoring process (start/stop actions) and providing the necessary interfaces towards other internal components of the Cloud infrastructure. The role of the Aggregator component is to collect, aggregate and store the monitoring information coming from various Information Providers. The challenge but at the same time the objective when implementing that component is the development of a lightweight mechanism that effectively combines information that comes from difference levels (from high-level application monitoring to low-level energy related data). 4.4.2 Trust In OPTIMIS, trust is seen as the evaluation of the reliability perceived in IPs (and the way they manage resources and the services deployed) and in SPs (and the behavior of the services they have developed). Main entities (in our case, services, Service Providers, Infrastructure Providers and resources) are long lived and interactions among them are continuous; Deployment requests and resources usage are stored and they are used for recalculating trust and storing it, keeping historical evaluation data (which can be accessed through a standardized API); Past values are included in the trust calculations and the obtained rates are used later by other components, influencing their behavior. OPTIMIS Consortium Page 23 of 30
The process for calculating trust is iterative and it is repeated from time to time (the period depends on the frequency of inputs updates and on the needs of users). Steps are the following: data is gathered from the different input sources (monitoring, service manifests, other tools, etc ), individual parameters are calculated according to the corresponding model (VMs formation, resources usage, legal aspects, etc ), calculations are aggregated (taking into account historical data and aggregation mechanisms through fuzzy models), all data is stored and, finally, internal thresholds are checked in order to determine proactive actions to be performed. IP-assessment: The trustworthiness of the IP is modelled based on the characteristics of the IP which are: runtime execution gap(reg), VM formation(vmf), IP reaction time(irt), SLA compliance(sc) and legal evaluation(le). The coefficient of variation (C v ) is given as: C v = σ/µ, where µ is the mean and the σ is the standard deviation computed for the measurements on the metrics parameters. SP-assessment: The trustworthiness of the SP is modelled based on the characteristics of the SP which are: risk assessment(ra), legal and security(ls), service reliability(sr) and elasticity tightness(et). 4.4.3 Risk Risk is a concept that relates to human expectations about some future event. It denotes a potentially negative impact on an asset by depreciating some of its characteristic value. Risk arises and is triggered by some present process or future event. Generally, risk is understood to represent the probability of a loss implied by the threat. In professional risk assessment, the notion of risk combines the probability of events with the impact of those events. Thus, mathematically risk is equal to the weighted average of the expected losses with respect to their chances of occurring. A fundamental issue in the characterization and representation of risk is to properly and appropriately carry out the following steps: i. Analyze the triggering events of the risk, and by breaking down those events formulate their adequately accurate structure. ii. Estimate the losses associated with each event in case of its realization. iii. Forecast the probabilities or the possibilities of the events by using either iii.1. Statistical methods with probabilistic assessments, or iii.2. Subjective judgments with approximate reasoning. After the possible risks have been identified, they are assessed in terms of their potential severity of loss and probability or possibility of occurrence. This process is called Risk Assessment (RA). The input quantities for Risk Assessment can range from simple to measurable (when estimating the value of a lost asset or contracted penalty associated with non-delivery) to impossible to know for certain (when trying to quantify the probability of a very unlikely event). Risk Management (RM) is the process of measuring or assessing risk and on the basis of the results developing strategies to manage that risk and control its implications. Managing a type of risk includes the issues of determining whether an action or a set of actions - is required, and if so finding the optimal strategy of actions to deal with the risk. OPTIMIS Consortium Page 24 of 30
The Risk value (r) is given as a function from both P: Probability: Probability of incident to occur C: Consequence: Harm of incident on asset r: P C RV is the Risk Value. The risk level is represented on a scale [1..7]: 1: trivial, 2: minor (-), 3: minor (+), 4: significant (-), 5: significant (+), 6: major, and 7: catastrophic. In OPTIMIS, Risk assessment is performed at the following stages: Infrastructure Provider Assessment: The SP, before sending an SLA request to an IP, assesses the risk of dealing with all known IPs. Service Provider Assessment: An IP receives an SLA request and assesses the risk of dealing with the SP from which the request came from. SLA Request Risk Assessment: The IP assesses the risk of accepting the SLA request from the SP, e.g. risk of physical hosts failing and impact on current services. SLA Offer Risk Assessment: The SP assesses the risk associated with deploying a service in an IP (entering an SLA with the IP) Risk assessment at service operation: this refers to continuously assessing the risk of failure of physical hosts, VMs, services, and the entire IP infrastructure. Another risk assessment task consists of monitoring the risk of the cloud bursting. Risk IP Failure SLA Failure VM Failure Physical Host Failure Table 3 Types of risk. Associated Vulnerabilities / Threats Aggregation: SLA 1 Failure Risk, SLA 2 Failure Risk, Aggregation: VM 1 Failure Risk, VM 2 Failure Risk Data Management threats Security threats Legal threats Virtual host cpu_speed Virtual host Vm_state Virtual host Disk_total Virutual host Mem_total Virtual host OS_version Physical Disk Physical Memory Physical CPU Network OPTIMIS Consortium Page 25 of 30
4.4.4 Eco-efficiency The model for eco-efficiency is based on several aspects, e.g. the energy efficiency (amount of useful work per Watt) and the ecological efficiency (amount of useful work per Kg of emitted carbon). The model distinguishes between different levels; the VM level, node level, infrastructure level, and service level in order to pinpoint and rate the eco-efficiency of each provider and use this as a factor during scheduling. An example of energy efficiency calculation at the node level follows. In this case, U VMj,I represents the CPU consumption of VM j in node i. The maximum computing power that can be delivered by a node is represented as P Node and can be measured as the number of computing units delivered by the node when its processors are running at full speed with 100% of CPU utilization. By multiplying P Node by the sum of CPU utilizations (normalized to 1) incurred by the VMs running in the node, we can obtain the amount of computing power delivered to the VMs, assuming that the distribution of computing power among the VMs is proportional to their CPU utilization. These assumptions are valid when running computing-intensive applications. The power overhead needed to run the node (cooling, lighting and power delivery losses) also needs to be taken into account. This is normally calculated using a PUE factor, and the total power consumed is assumed to be the measured power consumption of a server (R i ) multiplied with the PUE of the datacenter. Taking into consideration these aspects, the energy efficiency of a node from the IaaS point of view can be calculated using Equation 1. Note that the CPU utilization values range from 0 to 100 independently of the number of processors present in the node. Therefore, U VMj,i refers to the total amount of CPU time consumed by the whole server for a given time period, and not by a single core of it. Equation 1 Energy efficiency of a node. The full model contains corresponding equations for energy efficiency at all levels of the infrastructure. Using this eco model as the basis for VM placement has been evaluated using simulation, and results show a 14% improvement of energy efficiency. 4.4.5 Cost In determining the cost on the infrastructure provider side, the starting point is to develop a model for TCO. The focus here is not to reproduce all TCO calculations used, but rather to illustrate how the low-level service resource usage and service configuration parameters are mapped to the TCO in order to determine cost per service. A high level TCO model can be expressed as follows: OPTIMIS Consortium Page 26 of 30
(2) Using the TCO value from Equation (1), the objective of the following cost models is to understand the cost to the infrastructure provider of hosting a specific service. The assumption is made that the hosting of services is the only revenue generator in the infrastructure provider, therefore consuming 100% of the TCO when calculating the cost. In order to generate a profit, the revenue from the hosted services must exceed the TCO costs. Depending on the resource usage of VMs constituting a service, some services can cost more to host than others. In order to determine the cost of a specific service, the resource usage must be monitored over time per VM and is used to represent the cost of hosting a service (constituting multiple VMs). In addition, the TCO is monitored over the same time period therefore the breakdown of cost per service can be determined as a percentage of the TCO. We model the TCO-influence of a running service based on its (a) consumed energy and (b) the number of CPU cores used. Hence, we consider to be an influencing factor that directly affects the energy-related cost part of Eqn. (1). Furthermore, we address the influence on the operational cost part of Eqn. (1) via. This is a reasonable assumption, as an increase in reduces the lifespan of a higher number of physical CPUs and thus affects operational costs (e.g. amortisation). It can be argued that is also affecting the overall energy cost as an increase in could complicate VM consolidation. In the current models, only CPU and Energy are taken into account, but in future models more monitored metrics will be considered (such as hardware configuration per VM, memory, disk, network). This will form a more sophisticated partial-weighing model. By introducing the weighing factors and, we illustrate how to relate service costs to the TCO model in Eqn. (1) based on and. This leads to the TCO cost percentage that should be applied to a service and also its cost (3) (4) In the above, and are calculated by accumulating the consumed energy of all VMs that is running on for a particular monitoring interval. 4.4.6 Security The main result of the security research is the inter-cloud VPN architecture (ICVPN). The architecture consists of two main components, namely (a) the peer-to-peer overlay and (b) the secure virtual private connections. OPTIMIS Consortium Page 27 of 30
The core technique employed by the ICVPN concept is the use of two tiers of P2P overlays. A universal P2P overlay is used to provide a scalable and secure service infrastructure to initiate and bind multiple VPN overlays to different cloud services. The universal overlay itself can be initiated either by the service owner, the cloud broker or the cloud service providers. It helps with the bootstrapping activity of VPN peers. It also provides other functionalities such as service advertisement, service discovery mechanisms, and service code provisioning, with minimal requirement for manual configuration and administration.. This approach acts as an aggregation service for the eventual peered overlay resources (which in this case are virtual machines) span across multiple cloud domains to help form a virtual private network. The peers of the universal overlay act as super peers for the nodes of the underlying overlays and let new nodes enroll, authenticate, bootstrap and join a particular VPN overlay based on the cloud service requiring a VPN service. As depicted in Figure 14, the service owner/provider or the cloud broker could itself be a peer in the universal overlay and a subset of the universal overlay peers can act as super-peers for the peer nodes of the VPN overlay for a particular cloud service. The universal overlay peers can join and leave the sys- tem dynamically and additional VMs from the cloud providers can be provisioned to act as the universal overlay peers as well. As both the universal and the VPN overlay nodes are basically VMs provisioned from different cloud providers, they can be demoted or promoted from these overlays respectively based on parameters like performance and availability. Figure 14 Two-tier architecture of the ICVPN. A set of experiments were conducted to evaluate the effect of our prototype ICVPN solution upon the network performance of a service deployed on two different cloud IaaS providers. We use a 3- tier web service comprising of database, business logic and presentation components deployed on nine virtual machines hosted on the clouds of British Telecom Ltd. and Flexiant Ltd., our partners in the EU OPTIMIS project. The purpose of these experiments is to evaluate the architecture being proposed, in terms of service latency and service throughput, in a practical scenario with a service deployed over a real wide-area network, with the BT cloud geographically located in Ipswich, England and Flexiant cloud located in Livingston, Scotland. We define service latency as the inter-cloud round-trip time taken by a HTTP request, issued by a service component on one cloud, to get a response from the target service component on a different cloud. Similarly, service throughput is the inter-cloud OPTIMIS Consortium Page 28 of 30
network throughput between service components deployed on different clouds. The experimental results are summarized in Figure 15 and Figure 16 and suggest that the ICVPN provides acceptable overhead in terms of latency and throughput. The overhead is not bidirectional though and the differences between ingoing and outgoing traffic is a topic for future studies. Figure 15 Mean latency for wide-area ICVPN deployment. Figure 16 Mean throughput for wide-area ICVPN deployment. OPTIMIS Consortium Page 29 of 30
5 Conclusions Based on the scientific work performed in the OPTIMIS project, we relate back to our 5 innovations and conclude that: 1. Cloud management is achievable based on not only functional properties, but also on higher-level concepts, in this case TREC. OPTIMIS provide infrastructure and service owners tools to easily handle aspects of cost and eco-efficiency during cloud provisioning. Furthermore, the OPTIMIS Toolkit enable decision-making based on abstract criteria (e.g., risk minimization and trust maximization) and the translation of these goals all the way down to, e.g., VM management policies, significantly raising the level of abstraction for cloud management. 2. Different multi-cloud architectures are possible with a single, carefully designed, set of tools. With OPTIMIS, selection of the desired cloud architecture (private cloud, federation, bursting, multi-clouds, or brokerage) is decided at configuration time, instead of having to be planned when developing cloud management software. 3. Holistic management is a promising approach to handle complexity in IP operation. The coordination and harmonization of lower-level management such as admission control, VM placement, elasticity, and fault tolerance has been shown to be key for optimized IP provisioning. 4. Service lifecycle management matters. The OPTIMIS innovations start at service construction. Unless instrumented to do so in the service manifest, an OPTIMIS service cannot benefit from the proactive self-management actions provided by the service deployment and operation tools. OPTIMIS also provides a rich set of capabilities for service construction. Means for construction an OPTIMIS service includes implementation, aggregation, and inclusion of legacy software, significantly reducing the threshold for adoption. 5. Legal gaps and necessary actions have been identified. The OPTIMIS legal research has resulted in tools for automatic checks of legal compliance with respect to data location, SLA terms, etc. Furthermore, OPTIMIS has closed the gap by designing and implementing an automated process for checking for legislative compliance for clouds. The OPTIMIS Toolkit has been developed over three years and has been extensively validated in three use cases: cloud programming model, cloud bursting, and cloud brokerage. To facilitate this validation, five testbeds with OPTIMIS IP and SP installations have been deployed across Europe with nodes in Spain, Sweden, and the UK. The OPTIMIS Toolkit is available from: www.optimistoolkit.com. OPTIMIS Consortium Page 30 of 30