Dynamic Resource Provisioning in IaaS Cloud Environment

Transcription

1 Aalto University School of Science Degree Programme of Computer Science and Engineering Ramasivakarthik Mallavarapu Dynamic Resource Provisioning in IaaS Cloud Environment Master s Thesis Espoo, August 9, 2012 Supervisor: Instructor: Professor Antti Ylä-Jääski Yrjö Raivio Lic.Sc. (Tech.)

2 Aalto University School of Science Degree Programme of Computer Science and Engineering Author: Ramasivakarthik Mallavarapu Title: Dynamic Resource Provisioning in IaaS Cloud Environment ABSTRACT OF MASTER S THESIS Date: August 9, 2012 Pages: Professorship: Data Communication Software Code: T-110 Supervisor: Instructor: Professor Antti Ylä-Jääski Yrjö Raivio Lic.Sc. (Tech.) Elasticity is one of the key-enablers of cloud systems, minimizing the cost of resource provisioning while meeting critical Quality of Service (QoS) requirements of a service level agreement (SLA). Most internet based services have SLA s that demand stringent performance requirements. Automated resource provisioning (AutoScaling) is an effective way of dealing with workload fluctuations by allocating resources based on the current demand. Simple reactive approaches to AutoScaling can have a contrasting affect on performance, while over-provisioning substantially increases the costs. To tackle these challenges, there is a need for intelligent resource provisioning mechanisms that can model, analyze and predict the resource demand. This thesis outlines the key practical issues involved in AutoScaling in an Infrastructure as a Service (IaaS) cloud environment and provides tangible solutions. We study a few prediction models and make a comparative analysis on their strengths and weaknesses. We then present the predictive elastic resource controller that addresses the issues in AutoScaling by using modeling techniques from statistical analysis. The research also identifies issues relating to resource demand and capacity estimation in a multi-tenant cloud environment. A prototype model of the predictive resource controller was implemented on an OpenNebula based cluster. Real world and artificial workload traces were used to test the efficiency of the model. We have also made a comparative analysis of our proposed model with a simple, reactive resource controller. Simulation results show that our model outperforms a simple, reactive resource controller in terms of prediction error, QoS and number of SLA violations. Keywords: Language: cloud, autoscaling, IaaS cloud, workload modeling, SLA, time series analysis English ii

3 Acknowledgements I would like to thank Professor Antti Ylä-Jääski for supervising my thesis and giving me the opportunity to work in the Data Communications Software research group, School of Science, Aalto University. I owe my greatest gratitude to my Instructor, Yrjö Raivio. Yrjö s patience and guidance helped me a lot in the time of research and writing of this thesis. I would also like to thank Koushik Annapureddy for many insightful conversations that improved my thesis. Last but not the least, I would like to thank my family for supporting me throughout. Espoo, August 9, 2012 Ramasivakarthik Mallavarapu iii

4 Abbreviations & Acronyms API AR ARIMA ARMA AWS CAPEX CBMG CSP EC2 ES GAE GUI HA IaaS IT KPI KVM MA MPE MSE NIST OPEX PaaS QoS RPC RUBiS SaaS SLA VM Application Programming Interface Autoregressive Autoregressive Integrated Moving Average Autoregressive Moving Average Amazon Web Services Capital Expenditure Customer Behavior Model Graph Communication Services Provider Elastic Compute Cloud Exponential Smoothing Google App Engine Graphical User Interface High Availability Infrastructure-as-a-Service Information Technology Key Performance Indicator Kernel based Virtual Machine Moving Average Mean Percentage Error Mean Squared Error National Institute of Standards and Technology Operational Expenditure Platform-as-a-Service Quality of Service Remote Procedure Call Rice University Bidding System Software-as-a-Service Service Level Agreement Virtual Machine iv

5 Contents Abstract Acknowledgements Abbreviations & Acronyms List of Tables List of Figures ii iii iv vii viii 1 Introduction Motivation Problem Statement Targets and Methodology Contribution Document Structure Background Cloud Computing Cloud Service Models Infrastructure as a Service Platform as a Service Software as a Service IaaS Cloud Deployment Models Public Cloud Private Cloud Hybrid Cloud Multi-tier Architectures Rice University Bidding System v

6 3 Design Overview Dynamic Resource Provisioning Framework Predictive Models for AutoScaling Mean Max Exponential Smoothing ARIMA Traffic Monitoring and Fault Detection Algorithm Implementation Test Cloud Setup Workload Generation Resource Provisioning Framework Performance Evaluation Workload Traces Google Trace Artificial Trace Workload Prediction Mean Max Exponential Smoothing ARIMA Prediction Accuracy RUBiS Simulation SLA Violations Discussion Limitations Cloud Reliability Future Work Conclusions 42 Bibliography 42 vi

7 List of Tables 3.1 ARMA parameters List of variables used in Algorithms 1 and Google workload trace Artificial workload trace RUBiS Simulation vii

8 List of Figures 1.1 Average traffic distribution [7] Google cluster workload trace [37] Animoto AutoScaling [34] Resource provisioning approaches Cloud Service Models Multi-tier Web Architecture Predictive Elastic Resource Controller Test setup Google cluster workload trace [37] Artificial workload trace Google workload trace - Mean 16 prediction Artificial workload trace - Mean 8 prediction Google workload trace - Max 16 prediction Artificial workload trace - Max 8 prediction Google workload trace - Exponential smoothing prediction Artificial workload trace - Exponential smoothing prediction Google workload trace - ARIMA prediction Artificial workload trace - ARIMA prediction Predictive and Reactive error comparison viii

9 Chapter 1 Introduction This chapter begins with a look at the motivation, which forms the basis for choosing this area and exploring further into it. Motivation also highlights the inefficiency of traditional models of resource provisioning, while pointing out the major challenges in simple reactive resource provisioning approaches. Problem statement defines the exact problem that this thesis aims to solve. We outline the major goals of this thesis and explain why the goals are extremely important in the context of resource provisioning. Finally, we discuss the most important contributions of this thesis and the document structure. 1.1 Motivation Large internet based enterprises such as ebay, Amazon and Netflix must adhere to very strict Service Level Agreement (SLA) requirements in terms of high availability, high throughput and low latencies. Performance degradation in terms of any of the above mentioned metrics may result in loss of user base and hence their revenue. Traditionally, enterprises tend to overprovision resources to contain the demand fluctuations in the workload. The problem with this approach is that the peak workloads vary from the average workload by a factor of as high as Provisioning resources for the average case means that the resources are not sufficient to handle the excess workload, which has an adverse affect on the Quality of Service (QoS). Provisioning for peak workloads results in gross underutilization, as excess resources remain idle most of the time. More often than not, internet based services experience a seasonal demand, with demand fluctuations during the course of the day, week, month and year. Optimizing infrastructure to the varying workload demand is a complex process [29]. Figure 1.1 depicts the 1

10 CHAPTER 1. INTRODUCTION 2 daily distribution of a typical internet application in the USA which is similar to the average SMS traffic distribution [7, 27]. Figure 1.1: Average traffic distribution [7] Infrastructure as a Service (IaaS) cloud services have provided a paradigm shift in the way resources are provisioned and utilized. AutoScaling enables cloud subscribers to scale out their infrastructure only when the workload demand spikes up and deprovision excess resources when the demand lowers. This is especially vital for enterprises that experience workload variations on a hourly, daily or weekly basis. Unexpected workload variations should also be handled seemlessly without a loss in QoS. Figure 1.2 is a real world workload trace from Google, depicting the variations in workload demand [28, 37].

11 CHAPTER 1. INTRODUCTION Workload trace 800 Message request rate(msg/sec) Day 1 Day 6 Day 11 Day 16 Day 21 Day 26 Day 29 Time - 29 Days Figure 1.2: Google cluster workload trace [37] Figure 1.3 shows a real world scenario where AutoScaling has been successfully used [34]. Animoto, a video animation company had successfully employed AutoScaling, in order to deal with a sudden surge of user demand.

12 CHAPTER 1. INTRODUCTION 4 Figure 1.3: Animoto AutoScaling [34] Unfortunately the above stated approach has some inherent flaws due to several issues. The simple reactive approach to AutoScaling as stated above, manages to optimize the resource utilization and reduces the cost of resources utilized, but fails to maintain the QoS during a period of increased workload demand. The following are some of the major challenges involved in the process of AutoScaling. Non trivial launch times of virtual machines Accurate capacity estimation of virtual resources Unexpected workload variations In order to address the above mentioned issues, there is a need for proactive mechanisms that can forecast changes in the workload demand and preemptively scale the infrastructure, which forms the basis for pursuing this thesis.

13 CHAPTER 1. INTRODUCTION Problem Statement We seek to address the following research questions in this thesis. In order to successfully utilize the many services offered by the cloud computing paradigm, we need to understand the relevant issues and address them. In this thesis work, we aim to highlight some of the most important problems in the process of AutoScaling, suggest pertinent models and frameworks that solve the problems and demonstrate the efficiency of suggested solutions through simulations. As mentioned earlier, the major challenges in successfully applying AutoScaling to enterprise grade applications are the unexpected workload variations and non trivial launch times of virtual machines. Figure 1.4 shows the need for dynamic on-demand resource provisioning mechanisms as opposed to the traditional model of fixed infrastructure resources. Load Average resource provisioning (fixed) Peak resource provisioningi (fixed) On-demand resource provisioning Resource Usage Work Load Figure 1.4: Resource provisioning approaches A simple reactive approach fails in this aspect because, the unexpected

14 CHAPTER 1. INTRODUCTION 6 workload variations may cause a sudden spike in the demand forcing the infrastructure to scale out. While the resource controller attempts to provision new resources, the existing infrastructure must cater to the excess workload demand, effecting the QoS of the system. Typically, virtual machine with an enterprise grade application will require about 7-10 minutes of time before it becomes operational. The existing infrastructure that handles the excess load during the launch time of new resources, may cause performance degradation and the excess workload may also have a ripple effect on the next few minutes after the newly launched resources become operational. So in order to tackle this issue, there has to be a proactive resource provisioning approach that can successfully detect workload variations and spawn new resources well before the demand spike occurs. In this thesis, we chose a stochastic model to forecast workload variations and make scaling decisions well in advance, so that the overall system performance is not affected. 1.3 Targets and Methodology The major goal of this thesis is to design an automated predictive elastic resource controller that optimizes the infrastructure according to the workload variations, preventing a loss in QoS while eliminating the need for overprovisioning. We identify issues in AutoScaling and address them with the help of our model. The other important aspect of this thesis is to build a framework for AutoScaling and perform simulations in order to compare different approaches to AutoScaling. Rice University Bidding System (RU- BiS) is employed as the use case for carrying out the simulations. RUBiS represents a complex web service, which can be deployed as a distributed multi-tier web service. Apart from predictive AutoScaling, we attempt to extend the validity of our model to address the issues in scaling distributed multi-tier applications that may deal with heterogeneous requests. The research methodology adopted in this thesis is literature study, implementation, simulation and analysis. 1.4 Contribution This thesis outlines the most important practical issues involved in AutoScaling and explores solutions to address those issues. We have attempted to practically demonstrate the issues and the subsequent effect on the QoS by implementing a simple reactive resource controller that scales resources on certain predefined conditions. The major contribution of this thesis is the

15 CHAPTER 1. INTRODUCTION 7 design and implementation of an automated predictive resource controller, that utilizes prediction models from statistical analysis. RUBiS was used as the experimental use case in all the simulations. We describe in detail, a variety of workload prediction models and make a comparative analysis on the prediction accuracy of chosen approaches using artificial and real world workload traces. In our evaluation, we have found that Autoregressive Integrated Moving Averages (ARIMA) model outperformed the other models described in this thesis work. We implemented an ARIMA based automated predictive resource controller on an OpenNebula cloud deployment and conducted simulations to demonstrate the effectiveness of our approach. We also try to point out the performance variations that may occur in a multi-tenant cloud environment, and provide a roadmap to address such issues. We argue that, online capacity estimation methods and tools such as Customer Behavior Model Graphs (CBMG) could be vital in understanding the intricacies involved in scaling a complex service. 1.5 Document Structure The thesis is documented in the following order. Chapter 2 touches up on the relevant background information on cloud computing, service models and also briefly introduces the use case in this thesis. In Chapter 3, the proposed dynamic resource provisioning framework is explained in detail, including various prediction models for workload prediction. Algorithm for AutoScaling is also illustrated in this chapter. Chapter 4 discusses in detail, the experimental setup, simulation methodology and also provides the relevant implementation details. Chapter 5 compares different prediction models with the help of artificial and real world workload traces. Chapter 5 also includes the comparative analysis of predictive and reactive approaches to AutoScaling. Chapter 6 provides an indepth discussion about the issues involved in AutoScaling and extends the validity of the chosen approach to complex scenarios. Chapter 7 concludes the thesis.

16 Chapter 2 Background This chapter explains the background information about cloud computing, service models in cloud computing and relevant details about simulation use case. 2.1 Cloud Computing Cloud computing broadly refers to resources and resource enablers delivered as services over the internet. In the context of cloud computing, resources could be software applications, while resource enablers could be the hardware and software stack required for running/building the applications [3]. According to National Institute of Standards and Technology (NIST), U.S. Department of Commerce, there are five essential characteristics of a cloud deployment model [24]. On-demand self-service: The capability to provision computational resources automatically as and when the need arises. Broad network access: Network access through standard mechanisms to allow heterogeneous clients to make use of the resources. Resource pooling: The cloud provider s resources are pooled to serve cloud subscribers in a multi-tenant model where free resources are dynamically assigned to subscribers requesting resources. The subscriber is generally abstracted from details like the datacenter location, specific configuration, failover mechanisms etc. Rapid elasticity: Cloud subscribers must be able to perform on-demand provisioning/deprovisioning of cloud resources. The cloud provider may 8

17 CHAPTER 2. BACKGROUND 9 provide an illusion of unlimited resources, so that the cloud subscriber may request for resources at any point in time. Measured service: Cloud resource usage must be transparent to the cloud provider as well as the cloud subscriber. There must be provisions to monitor, control and report on cloud resource usage. The advancements in the virtualization technology coupled with the pricing model adopted by several cloud providers has significantly changed the dynamics of infrastructure investment. The pay-as-you-go model is a first step in reducing the upfront capital expenses and converting it into operating expenses. Although the pricing model of certain public cloud providers may seem expensive, they bring in additional benefits such as elasticity and scalability thus providing ways to efficiently utilize the infrastructure, reducing the burden of over provisioning and mitigating the risks of under provisioning [3, 10, 33]. 2.2 Cloud Service Models Cloud services may be classified into different service models depending on the services offered to the cloud subscriber. Based on the delivery model, cloud computing can broadly be divided into three service models. Figure 2.1 presents a cloud computing stack and the three service models [19] Infrastructure as a Service Infrastructure-as-a-Service (IaaS) clouds provide computing, storage or network resources as services, delivered usually over the internet. The cloud subscribers, without an upfront commitment or capital expenditure may get access to IaaS cloud resources within a matter of few minutes or hours. Thus the subscribers may get rid of huge Capex costs and operational expenses as the IaaS cloud providers offer an usage based pricing model. The most prominent player in the Iaas cloud market is Amazon Web Services (AWS) followed by Rackspace, Gogrid etc. IaaS cloud providers may offer services as dedicated or multi-tenant physical or virtual infrastructure resources. AWS s elastic compute cloud, for instance, offers virtualized resources abstracting the physical and virtualization layers. The cloud subscribers get complete access to the virtual machines, from the choice of operating system to the application software installation. Although the IaaS cloud provider pricing is slightly higher than the private infrastructure pricing calculated over a sufficiently long period of time,

18 IaaS PaaS SaaS CHAPTER 2. BACKGROUND 10 Applications Application Services Execution Environment Middleware O/S Virtual Resources Virtualization Computation Storage Network Figure 2.1: Cloud Service Models the distinct advantage of IaaS cloud model is the elimination of initial capital expense and a significant reduction in operating expenses. Also, because of the pay-as-you-go pricing model, and the efficient and fast infrastructure scaling provisions offered by the cloud providers, the subscribers may choose to dynamically adapt the infrastructure to their fluctuating needs. This is extremely crucial especially for small and medium enterprises that cannot afford huge initial investment costs and also enables efficient utilization of cloud services, using the resources only when it is absolutely needed [14, 15] Platform as a Service Platform-as-a-Service (PaaS) provides an application development environment, abstracting the underlying hardware, virtualization and operating system layers. PaaS is essentially an aggregation of set of development tools that the developers need for building applications and services in the cloud. PaaS further simplifies the job of cloud subscribers, by hiding the complexities of hardware and software resource management, application deployment and

19 CHAPTER 2. BACKGROUND 11 dynamic scaling of infrastructure to cater to the growing application needs. Developers may use PaaS services to build applications that are hosted by the PaaS provider and offered as a service to the end users, usually over the internet. PaaS model enables the enterprises to focus only on the software development cycle involved in building the application, as the other aspects such as infrastructure management, dynamic scaling mechanisms are made transparent by the PaaS provider. Google App Engine (GAE), Microsoft Azure and Force.com are some of the most important PaaS cloud providers [5] Software as a Service Software-as-a-Service (SaaS) clouds provide an on-demand software applications as a service offering, where the subscriber may be allowed a need based usage policy and thus an usage based pricing model. In this model, the subscriber has limited control over the physical hardware, software stack, application execution environment and other factors unlike in IaaS or PaaS clouds. One of the most widely used applications like is conceptually a software delivered as a service. Google Docs is another example of application software delivered as a service over the internet. SaaS represents a paradigm shift in the software service model, especially in the end user sector, as it reduces the need for client side software installation and mandates less system requirements than in the traditional software service model [3]. 2.3 IaaS Cloud Deployment Models IaaS clouds offer lower level resources such as computational, network and storage as a service to the subscribers. IaaS clouds may be categorized primarily in to 3 deployment models Public Cloud Public cloud providers are organizations that provide IaaS services to one or more subscribers on a pay-per-use principle. The emergence of public clouds considerably reduces the Capex involved in purchasing and setting up the infrastructure for enterprises. Also, most of the public clouds provide a REST based API for automatic provisioning and deprovisioning of resources. This gives subscribers, the freedom to dynamically adapt the resources, provisioning them only when the workload request demand and deprovisioning them when the demand subsides. With the arrival of public clouds, the subscribers

20 CHAPTER 2. BACKGROUND 12 can afford to make short term plans for decisions on resource allocation and optimization. The most prominent public cloud providers are Amazon Web Services, Rackspace, Gogrid, IBM etc. The key enablers of such a model is the advancement of virtualization where virtual resources can be spawned in a matter of few minutes, usage-based pricing model and cloud provider APIs for automatic control of infrastructure. The downside of public cloud deployment model is that the subscribers are generally not in control of the underlying hardware and software stacks involved in infrastructure management, location of datacenters, security implications of multi-tenancy and country specific laws that may impose restrictions on sensitive information crossing geographical borders [3] [20] Private Cloud Conceptually a private cloud is very similar to a public cloud except that the infrastructure services offered are internal to the organization owning the infrastructure. Private cloud deployment model is an efficient way of managing large infrastructures. A private cloud essentially is an infrastructure management software solution that enables organizations to deploy resources effectively, monitor the health and performance of the infrastructure and eases the administrative tasks involved in managing large scale infrastructures [31]. Eucalyptus 1, OpenNebula 2, OpenStack 3 are three most visible players in the private cloud space. The most distinct advantage of private cloud is that the organizations have total control on the infrastructure unlike in a public cloud. The private cloud infrastructure is fixed and dedicated and hence cannot cater to workloads beyond a certain limit, while the public clouds provide an illusion of unlimited resources. Although private clouds do not bring a lot of value to an organization apart from providing ways and means to efficient infrastructure management, private cloud adoption eases the future migration to public cloud or usage of a hybrid architecture utilizing the private infrastructure as well as the public infrastructures services offered by a public cloud provider Hybrid Cloud Hybrid cloud is a composition of private and public cloud deployment models enabling data and application portability, and seeking ways to overcome the shortcomings of private cloud, while addressing the risks attached with

21 CHAPTER 2. BACKGROUND 13 the public clouds [24]. An ideal deployment scenario would work in such a way that the private cloud would be designed to handle the average workload, while the public cloud resources are provisioned dynamically to deal with the excess demand. The main advantage of a hybrid cloud is the flexibility of using a public cloud only when the demand grows beyond a certain threshold and switching back to private-only mode to handle the average case. This approach tries to address the major hindrances in using a public cloud such as security implications, restrictions imposed by country specific laws, multitenancy etc; while also taking into account, the burden of overprovisioning a private cloud. The main challenge of using a hybrid cloud is to strike the right balance between the private and public cloud components. Enterprises dealing with sensitive information such as medical records, telecommunication operators, government data etc must also explore ways in which the enterprises can offload some of the non-critical computational tasks to the public clouds [11]. 2.4 Multi-tier Architectures Multi-tier architectures have become the industry standard in designing scalable client-server applications. Traditionally, enterprises with in-house infrastructure deployment, used to have large scale dedicated servers to handle each tier of the application. But offlate, large dedicated servers gave way to smaller distributed virtual servers. Figure 2.2 represents a two tier architecture with a web front end and database. 2.5 Rice University Bidding System Rice University Bidding System (RUBiS) is an auction site benchmark that implements the core functionality of an auction site. It has provisions for selling, browsing and bidding items. RUBiS also implements different sessions for different type of users in the form of visitor, buyer and seller. A registered user can sell, buy and browse through the website, while a visitor is only allowed to browse through the different sections of the website. RUBiS is modeled after ebay.com with a web front end and a database that maintains the records of users, items, bids etc. A client with a web browser can perform in total 26 interactions that include browsing items by region or category, bidding, buying and selling of items. RUBiS represents a dynamic web application with several web pages that require interactions with the database. It is often used to study the application design patterns and

22 CHAPTER 2. BACKGROUND 14 Client 1 Web 1 DB 1 Client 2 Web 2 DB 2 Client 3 Web Load Balancer Web 3 DB Load Balancer DB Client n Web n DB n Figure 2.2: Multi-tier Web Architecture evaluate performance bottlenecks. In this thesis, we use RUBiS as the use case for all simulations [2].

23 Chapter 3 Design Overview In this chapter, we discuss about the resource provisioning framework, predictive models used in the framework and also touch upon different scaling algorithms. 3.1 Dynamic Resource Provisioning Framework The process of AutoScaling in the cloud, involves dealing with challenges on many fronts. A dynamic resource provisioning framework (resource controller henceforth) consists of several individual components collaborating with each other. A resource controller may have the following basic functionalities, Monitor: Traffic monitoring for detecting workload changes. Monitor function may also include monitoring the infrastructure in order to detect faults Analyze: Workload modeling and subsequent decisions on infrastructure scaling Act: Implementing the scaling decision prompted by the analyze function Designing an automated elastic resource controller that ensures high availability and carrier grade performance is a complex task. Considering the relatively long startup time of a virtual machine (VM), it is essential that the resource controller provisions resources well before the need arises. In this thesis, we consider two kinds of resource controllers: reactive and proactive. A reactive resource controller, implements the basic functionalities of 15