MODERN cloud computing systems operate in a new and

Transcription

1 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 10, NO. 5, SEPTEMBER/OCTOBER A Hierarchical Approach for the Resource Management of Very Large Cloud Platforms Bernardetta Addis, Danilo Ardagna, Barbara Panicucci, Mark S. Squillante, Fellow, IEEE, and Li Zhang Abstract Worldwide interest in the delivery of computing and storage capacity as a service continues to grow at a rapid pace. The complexities of such cloud computing centers require advanced resource management solutions that are capable of dynamically adapting the cloud platform while providing continuous service and performance guarantees. The goal of this paper is to devise resource allocation policies for virtualized cloud environments that satisfy performance and availability guarantees and minimize energy costs in very large cloud service centers. We present a scalable distributed hierarchical framework based on a mixed-integer nonlinear optimization of resource management acting at multiple timescales. Extensive experiments across a wide variety of configurations demonstrate the efficiency and effectiveness of our approach. Index Terms Performance attributes, performance of systems, quality concepts, optimization of cloud configurations, QoS-based scheduling and load balancing, virtualized system QoS-based migration policies Ç 1 INTRODUCTION MODERN cloud computing systems operate in a new and dynamic world, characterized by continual changes in the environment and in the system and performance requirements that must be satisfied. Continuous changes occur without warning and in an unpredictable manner, which are outside the control of the cloud provider. Therefore, advanced solutions need to be developed that manage the cloud system in a dynamically adaptive fashion, while continuously providing service and performance guarantees. In particular, recent studies [18], [9], [25] have shown that the main challenges faced by cloud providers are to 1) reduce costs, 2) improve levels of performance, and 3) enhance availability and dependability. Within this framework, it should first be noted that, at large service centers, the number of servers are growing significantly and the complexity of the network infrastructure is also increasing. This leads to an enormous spike in electricity consumption: IT analysts predict that by the end of 2012, up to 40 percent of the budgets of cloud service centers will be devoted to energy costs [13], [14]. Energy efficiency is, therefore, one of the main focal points on. B. Addis is with the Dipartimento di Informatica, Università degli Studi di Torino, C.So Svizzera 185, Torino 10149, Italy. addis@di.unito.it.. D. Ardagna is with the Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano, Via Golgi 42, Milano 20133, Italy. ardagna@elet.polimi.it.. B. Panicucci is with the Dipartimento di Elettronica e Informazione, Politecnico di Milano, and the Dipartimento di Scienze e Metodi dell Ingegneria, Università di Modena e Reggio Emilia, Via Amendola 2, Reggio Emilia 42122, Italy. barbara.panicucci@unimore.it.. M.S. Squillante and L. Zhang are with the IBM Thomas J. Watson Research Center, Yorktown Heights, NY {mss, zhangli}@us.ibm.com. Manuscript received 15 July 2012; revised 16 Nov. 2012; accepted 2 Jan. 2013; published online 9 Jan For information on obtaining reprints of this article, please send to: tdsc@computer.org, and reference IEEECS Log Number TDSCSI Digital Object Identifier no /TDSC which resource management should be concerned. In addition, providers need to comply with service-level agreement (SLA) contracts that determine the revenues gained and penalties incurred on the basis of the level of performance achieved. Quality of service (QoS) guarantees have to be satisfied despite workload fluctuations, which could span several orders of magnitude within the same business day [23], [14]. Moreover, if end-user data and applications are moved to the cloud, infrastructures need to be as reliable as phone systems [18]: Although 99.9 percent up-time is advertised, several hours of system outages have been recently experienced by some cloud market leaders (e.g., the Amazon EC Easter outage and the Microsoft Azure outage on 29 February 2012, [9], [44]). Currently, infrastructure as a service (IaaS) and platform as a service (PaaS) providers include in SLA contracts only availability, while performance are neglected. In case of availability violations (with current figures, even for large providers, being around 96 percent, much lower than the values stated in their contracts [12]), customers are refunded with credits to use the cloud system for free. The concept of virtualization, an enabling technology that allows sharing of the same physical machine by multiple end-user applications with performance guarantees, is fundamental to developing appropriate policies for the management of current cloud systems. From a technological perspective, the consolidation of multiple user workloads on the same physical machine helps to reduce costs, but this also translates into higher utilization of the physical resources. Hence, unforeseen load fluctuations or hardware failures can have a greater impact among multiple applications, making cost-efficient, dependable cloud infrastructures with QoS guarantees of paramount importance to realize broad adoption of cloud systems. A number of self-managing solutions have been developed that support dynamic allocation of resources among running applications on the basis of workload predictions /13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

2 254 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 10, NO. 5, SEPTEMBER/OCTOBER 2013 Most of these approaches (e.g., [15]) have implemented centralized solutions: A central controller manages all resources comprising the infrastructure, usually acting on an hourly basis. This timescale is often adopted [8] because it represents a compromise in the tradeoff between the frequency of decisions and the overhead of these decisions. For example, self-managing frameworks [35], [8] may decide to shut down or power up a server, or to move virtual machines (VMs) from one server to another by exploiting the VM live-migration mechanism. However, these actions consume a significant amount of energy and cannot be performed too frequently. As previously noted, cloud systems are continuously growing in terms of size and complexity: Today, cloud service centers include up to 10,000 servers [9], [25], and each server hosts several VMs. In this context, centralized solutions are subject to critical design limitations [9], [7], [53], including a lack of scalability and expensive monitoring communication costs, and cannot provide effective control within a 1-hour time horizon. In this paper, we take the perspective of a PaaS provider that manages the transactional service applications of its customers to satisfy response time and availability guarantees and minimize energy costs in very large cloud service centers. We propose a distributed hierarchical framework [39] based on a mixed-integer nonlinear optimization of resource management across multiple timescales. At the root of the hierarchy, a central manager (CM) partitions the workload (applications) and resources (physical servers) over a 24-hour horizon to obtain clusters with homogeneous workload profiles one day in advance. For each cluster, an application manager (AM) determines the optimal resource allocation in a centralized manner across a subsystem of the cloud platform on a hourly basis. More precisely, AMs can decide the set of applications executed by each server of the cluster (i.e., application placement), the request volumes at various servers inside the cluster (i.e., load balancing), and the capacity devoted to the execution of each application at each server (i.e., capacity allocation). AMs can also decide to switch servers into a low-power sleep state depending on the cluster load (i.e., server switching) or to reduce the operational frequency of servers (i.e., frequency scaling). Furthermore, because load balancing, capacity allocation, and frequency scaling require a small amount of time to be performed, they are executed by AMs every few (5-15) minutes. The periodicity of the placement decisions is investigated within the experimental analysis in Section 6. Extensive experiments across a wide variety of configurations demonstrate the effectiveness of our approach, where our solutions are very close to those achieved by a centralized method [3], [8] that requires orders of magnitude more time to solve the global resource allocation problem based on a complete view of the cloud platform. Our solution further renders net-revenue improvements of around percent over other heuristics from the literature [41], [50], [61], [57] that are currently deployed by IaaS providers [6]. Moreover, our algorithms are suitable for parallel implementations that achieve nearly linear speedup. Finally, we have also investigated the mutual influences and interrelationships among the multiple timescales (24 hours/1 hour/ 5-15 minutes) by varying the characteristics of the prediction errors and identifying the best fine-grained timescale for runtime control. The remainder of the paper is organized as follows: Section 2 reviews previous work from the literature, while Section 3 introduces our hierarchical framework. The optimization formulations and their solutions are presented in Sections 4 and 5, respectively. Section 6 is dedicated to experimental results, with concluding remarks in Section 7. 2 RELATED WORK Many solutions have been proposed for the self-management of cloud service centers, each seeking to meet application requirements while controlling the underlying infrastructure. Five main problem areas have been considered in system allocation policy design: 1. application/vm placement, 2. admission control, 3. capacity allocation, 4. load balancing, and 5. energy consumption. While each area has often been addressed separately, it is noteworthy that these problem solutions are closely related. For example, the request volume determined for a given application at a given server depends on the capacity allocated to that application on the server. A primary contribution of this paper is to integrate all five problem areas within a unifying framework, providing very efficient and robust solutions at multiple timescales. Three main approaches have been developed: 1) control theoretic feedback loop techniques, 2)adaptive machine learning approaches, and 3) utility-based optimization techniques. A main advantage of a control theoretic feedback loop is system stability guarantees. Upon workload changes, these techniques can also accurately model transient behavior and adjust system configurations within a transitory period, which can be fixed at design time. A previous study [35] implemented a limited look-ahead controller that determines servers in the active state, their operating frequency, and the allocation of VMs to physical servers. However, this implementation considers the VM placement and capacity allocation problems separately, and scalability of the proposed approach is not considered. A recent study [55], [53] proposed hierarchical control solutions, particularly providing a cluster-level control architecture that coordinates multiple server power controllers within a virtualized server cluster [55]. The higher layer controller determines capacity allocation and VM migration within a cluster, while the inner controllers determine the power level of individual servers. However, application and VM placement within each service center cluster is assumed to be given. Similarly, coordination among multiple power controllers acting at rack enclosures and at the server level is done without providing performance guarantees for the running applications [53]. Machine learning techniques are based on live system learning sessions, without a need for analytical models of applications and the underlying infrastructure. A previous study [30] applied machine learning to coordinate multiple

3 ADDIS ET AL.: A HIERARCHICAL APPROACH FOR THE RESOURCE MANAGEMENT OF VERY LARGE CLOUD PLATFORMS 255 Fig. 1. Hierarchical resource management framework and interrelationship among the timescales T 1 ;T 2, and T 3. autonomic managers with different goals, integrating a performance manager with a power manager to satisfy performance constraints while minimizing energy expenses exploiting server frequency scaling. Recent studies provide solutions for server provisioning and VM placement [36] and propose an overall framework for autonomic cloud management [59]. An advantage of machine learning techniques is that they accurately capture system behavior without any explicit performance or traffic model and with little built-in system-specific knowledge. However, training sessions tend to extend over several hours [30], retraining is required for evolving workloads, and existing techniques are often restricted to separately applying actuation mechanisms to a limited set of managed applications. Utility-based approaches have been introduced to optimize the degree of user satisfaction by expressing their goals in terms of user-level performance metrics. Typically, the system is modeled by means of a performance model embedded within an optimization framework. Optimization can provide global optimal solutions or suboptimal solutions by means of heuristics, depending on the complexity of the optimization model. Optimization is typically applied to each one of the five problems separately. Some research studies address admission control for overload protection of servers [28]. Capacity allocation is typically viewed as a separate optimization activity that operates under the assumption that servers are protected from overload. VM placement recently has been widely studied [15], [13], [20]. A previous study [61] has presented a multilayer and multiple timescale solution for the management of virtualized systems, but the tradeoff between performance and system costs is not considered. A hierarchical framework for maximizing cloud provider profits, taking into account server provisioning and VM placement problems, has been similarly proposed [14], but only very small systems have been analysed. The problem of VM placement within an IaaS provider has been considered [31], providing a novel stochastic performance model for estimating the time required for VMs startup. However, each VM is considered as a black box and performance guarantees at the application level are not provided. Finally, frameworks for the colocation of VMs into clusters based on an analysis of 24-hour application workload profiles have been proposed [26], [23], but only the server provisioning and capacity allocation problems have been solved without considering the frequency scaling mechanism available with current technology. In this paper, we propose a utility-based optimization approach, extending our previous works [39], [3], [8] through a scalable hierarchical implementation across multiple timescales. In particular, [39] has provided a theoretical framework investigating the conditions under which a hierarchical approach can provide the same solution of a centralized one. The hierarchical implementation we propose here is built on that framework and includes the CM and the AMs, whereas in [3], [8], only a CM was considered, which implied more time to solve the global resource allocation problem as well as a complete view of the cloud platform. To the best of our knowledge, this is the first integrated framework that also provides availability guarantees to running applications, because all foregoing work considered only performance and energy management problems. 3 HIERARCHICAL RESOURCE MANAGEMENT This section provides an overview of our approach for hierarchical resource management across multiple timescales in very large scale cloud platforms. The main components of our runtime management framework are described in Section 3.1, while our performance and power models are discussed in Section Runtime Framework We assume the PaaS provider supports the execution of multiple transactional services, each representing a different customer application. The hosted services can be heterogeneous with respect to resource demands, workload intensities, and QoS requirements. Services with different quality and workload profiles are categorized into independent request classes, where the system overall serves a set R of request classes. Fig. 1 depicts the architecture of the service center under consideration. The system includes a set S of heterogeneous servers, each running a virtual machine monitor (VMM) (such as IBM POWER Hypervisor, VMWare or en) configured in non-work-conserving mode (i.e., computing resources are capped and reserved for the execution of individual VMs). The physical resources (e.g., CPU, disks, communication network) of a server are partitioned among

4 256 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 10, NO. 5, SEPTEMBER/OCTOBER 2013 multiple VMs, each running a dedicated application. Among the many physical resources, we focus on the CPU and RAM as representative resources for the resource allocation problem, consistent with [47], [50], [15]. Any class can be supported by multiple application tiers (e.g., front end, application logic, and data management). Each VM hosts a single application tier, where multiple VMs implementing the same application tier can be run in parallel on different hosts. Under light load conditions, servers may be shutdown or placed in a stand-by state and moved to a free server pool to reduce energy costs. These servers may be subsequently moved back to an active state and reintroduced into the running infrastructure during peaks in the load. To go into details, we assume that each server has a single CPU supporting dynamic voltage and frequency scaling (DVFS) by choosing both its supply voltage and operating frequency from a limited set of values, noting that this single-cpu assumption is without loss of generality in heavy traffic [56] and can be easily relaxed in general. The adoption of DVFS is very promising because it does not introduce any system overhead, while hibernating and restoring a server both require time and energy. Following [43], [46], we adopt full system power models and assume that the power consumption of a server depends on its operating frequency/voltage as well as the current CPU utilization. Moreover, as in [35], [30], [24] but without limiting the generality, servers performance is assumed to be linear in its operating frequency. Finally, each physical server s has availability a s, and to provide availability guarantees, the VMs running the application tiers are replicated on multiple physical servers and work under the so-called load-sharing configuration [10]. Our resource management framework is based on a hierarchical architecture [39]. At the highest level of the hierarchy, a CM acts on a timescale of T 1 (i.e., every 24 hours) and is responsible for partitioning the classes and servers into clusters one day in advance (long-term problem). The objective is to obtain a set of clusters C, each element of which has a homogeneous profile to reduce server switching at finer-grained timescales. Furthermore, we denote by R c the set of request classes assigned to cluster c 2Cand by S t c the set of physical servers in cluster c at time interval t. At a lower level of the hierarchy, AMs centrally manage individual clusters acting on a timescale of T 2 (i.e., on an hourly basis). Each AM can decide (medium-term problem): 1. application placement within the cluster, 2. load balancing of requests among parallel VMs supporting the same application tier, 3. capacity allocation of competing VMs running on each server, 4. switching servers into active or low-power sleep states, and 5. increasing/decreasing the operating frequency operation of the CPU of a server. Furthermore, because the load balancing, capacity allocation, and frequency scaling optimization do not add significant system overhead, they are performed by AMs more frequently (short-term problem) on a timescale of T 3 (i.e., every 5-15 minutes). The decisions that take place at each timescale are obtained according to a prediction of the incoming workloads, as in [61], [4], [16], [39], [7]. Concerning this prediction, several methods have been adopted in many fields for decades [16] (e.g., ARMA models, exponential smoothing, and polynomial interpolation), making them suitable to forecast seasonal workloads, common at timescale T 1, as well as runtime and nonstationary request arrivals characterizing timescale T 3. In general, each prediction mechanism is characterized by several alternative implementations, where the choice about filtering or not filtering input data a runtime measure of the metric to be predicted and choosing the best model parameters in a static or dynamic way are the most significant. By considering multiple timescales, we are able to capture two types of information related to cloud sites [61]. The fine-grained workload traces at timescale T 3 (see also the following section) exhibit a high variability due to the short-term variations of typical web-based workloads [16]. For this reason, the finer-grained timescale provides useful information to react quickly to workload fluctuations. At coarser-grained timescales T 1 and T 2, the workload traces are more smooth and not characterized by the instantaneous peaks typical at the finer-grained timescale. These characteristics allow us to use the medium and long timescale algorithms to predict the workload trend that represents useful information for resource management. Note that a hierarchical approach is also beneficial to reduce the monitoring data required to compute workload predictions. Indeed, a centralized solution would require collection of central node application workloads at timescale T 3. Vice versa, with a hierarchical solution AMs need to provide aggregated data (i.e., the overall incoming workload served for applications) of the individual managed partitions to the CM at timescale T 1. Dependability and fault tolerance are also important issues in developing resource managers for very large infrastructures. There has been research employing peer-topeer designs [2] or using gossip protocols [58] to increase dependability. In [2], a peer-to-peer overlay is constructed to disseminate routing information and retrieve states/ statistics. Wuhib et al. [58] propose a gossip protocol to compute, in a distributed and continuous fashion, a heuristic solution to a resource allocation problem for a dynamically changing resource demand. In our solution, to provide better dependability for the management infrastructure, each cluster maintains a primary and a backup AM. The primary and the backup do not need to be tightly synchronized to the second level, as the decisions they are making are at a timescale of 5-15 minutes. Similarly, a backup CM is also maintained in our solution. We will investigate the peer-to-peer management infrastructure as part of future work. Finally, an SLA contract, associated with each service class, is established between the PaaS provider and its customers. This contract specifies the QoS levels that the PaaS provider must satisfy when responding to customer requests for a given service class, as well as the corresponding pricing scheme. Following [52], [11] (but without limiting the generality of our approach), for each class

5 ADDIS ET AL.: A HIERARCHICAL APPROACH FOR THE RESOURCE MANAGEMENT OF VERY LARGE CLOUD PLATFORMS 257 Fig. 2. Requests utility functions, system performance, and power models. r 2R, a linear utility function specifies the per request revenue (or penalty) U r incurred when the average end-toend response time R r realizes a given value: U r ¼ r R r þ u r, (see Fig. 2a). The slope of the utility function is r such that r ¼ u r =R r > 0, where u r is the linear utility function y-axis intercept and R r is the threshold identifying the revenue/penalty region, i.e., if R r > R r the SLA is violated and penalties are incurred. SLA contracts also involve multiclass average response time metrics for virtualized systems (whose evaluation is discussed in the next section) and the minimum level of availability A r required for each service class r. 3.2 System Performance and Power Models The cloud service center is modeled as a queuing network composed of a set of multiclass single-server queues and a delay center (see Fig. 2b). Each layer of queues represents the collection of applications supporting the execution of requests at each tier. The delay center represents the clientbased delays, or think time, between the completion of one request and the arrival of the subsequent request within a session. There is a different set of application tiers A r that support the execution of class r requests. This section focuses on the medium-term timescale T 2, noting that similar considerations can be drawn for the timescales T 1 and T 3. The relationships among the timescales T 1, T 2, and T 3 are illustrated in Fig. 1. We denote by n T2 (n T3 ) the number of time intervals at granularity T 2 (T 3 ) within the time period T 1 (T 2 ), which can be expressed as n T2 ¼ T 1 =T 2 (n T3 ¼ T 2 =T 3 ) and assumed to be an integer. In a given time interval, user sessions begin with a class r request arriving to the service center from an exogenous source with average rate t r ðt 2Þ. Upon completion, the request either returns to the system as a class r 0 request with probability r;r 0 or departs with probability 1 P l2r r;l. The aggregate arrival rate for class r requests, t r ðt 2Þ, is given by the exogenous arrival rates and transition probabilities: t r ðt 2Þ¼ t r ðt 2Þþ P r 0 2R t r 0ðT 2Þ r 0 ;r. Let r; denote the maximum service rate of a unit capacity server dedicated to the execution of an application of class r at tier. We assume requests on every server at different application tiers within each class are executed according to the processor sharing scheduling discipline (a common model of time sharing among web servers and application containers [4]), where the service time of class r requests at server s follows a general distribution having mean ðc s;f r; Þ 1, with C s;f denoting the capacity of server s at frequency f [32], [19]. The service rates r; can be estimated at runtime by monitoring application execution (see, e.g., [40]). They can also be estimated at a very low overhead by a combination of inference and filtering techniques developed in [60], [34], [33]. Let t s;r; denote the execution rate for class r requests at tier on server s in time interval t, and let t s;r; denote the fraction of VM capacity for class r requests at tier on server s in time interval t. Once a capacity assignment has been made, VMMs guarantee that a given VM and hosted application tier will receive their assigned resources (CPU capacity and RAM allotment), regardless of the load imposed by other applications. Under the non-workconserving mode, each multiclass single-server queue associated with a server application can be decomposed into multiple independent single-class single-server M/G/1 queues with capacity equal to C s;f t s;r;. Then, the average response time for class r requests, R r, is equal to the sum of the average response times for class r requests executed at tier on server s working at frequency f, R f s;r;, each of which can be evaluated as R f s;r; ¼ 1 C s;f r; t s;r; t s;r; R r ¼ 1 ð1þ t t r ðt s;r; R f s;r; : 2Þ s2s;2a r Although system performance metrics have been evaluated by adopting alternative analytical models (e.g., [41], [1], [54]) and accurate performance models exist in the literature for web systems (e.g., [45], [28]), there is a fundamental tradeoff between the accuracy of the models and the time required to estimate system performance metrics. Given the strict time constraints for evaluating performance metrics in our present setting, the high complexity of analyzing even small-scale instances of existing performance models has prevented us from exploiting such results here. It is important to note, however, that highly accurate approximations for general queuing networks [22], [21], having functional forms similar to those above, can be directly incorporated into our modeling and optimization frameworks. Energy-related costs are computed using the full system power models adopted in [43], [46], which yield server energy consumption estimate errors within 10 percent. Servers in a low-power sleep state still consume energy (cost parameter c s ) [35], whereas its dynamic operating cost when

6 258 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 10, NO. 5, SEPTEMBER/OCTOBER 2013 TABLE 1 Multitime Scale Resource Management Framework Parameters running depends on a constant base cost c s þ p s;f (function of the operating frequency f) and an additional cost component that varies linearly with CPU utilization U s having slope b s;f [43] (see Fig. 2c). Following [35], [50], we further consider the switching cost cs s that includes the power consumption costs incurred when starting up the server from a low-power sleep state and, possibly, the revenue lost when the computer is unavailable to perform any useful service for a period of time. Finally, to limit the number of VM movements in the system, we add a cost associated with VM movement, cm, which avoids system instability and minimizes the total number of application starts and stops [35], [50]. The cost cm has been evaluated by considering the cost of the use of a server to suspend, move and startup a VM according to the values reported in [49], [29]. Table 1 summarizes the notation adopted in this paper. 4 FORMULATION OF OPTIMIZATION PROBLEMS In this section, we introduce resource management optimization problems at multiple timescales (T 1, T 2, T 3 ). The general objective is to maximize profit, namely the difference between revenues from SLA contracts and costs associated with servers switching and VM migrations, while providing availability guarantees. We formulate the optimization problem at the timescale of T 2 (medium-term problem) from which, at the end of this section, will straightforwardly render the formulations at the other timescales. As previously explained, the CM determines one day in advance the class partition R c and the server assignment S t c for each cluster c and every time interval t ¼ 1;...;n T2. The AM controlling cluster c has to determine at timescale T 2 : 1. the servers in active state, introducing for each server s a binary variable x t s equal to 1 if server s is running in interval t and 0 otherwise; 2. the operating frequency of the CPU of each server, given by the binary variable y t s;f equal to 1 if server s is working with frequency f in interval t and 0 otherwise; 3. the set of applications executed by each server, introducing the binary variable z t s;r; equal to 1 if the application tier for class r is assigned to server s in interval t and 0 otherwise; 4. the rate of execution for class r at tier on server s in interval t (i.e., t s;r; ); 5. the fraction of capacity devoted to executing VM at tier for class r at server s in interval t (i.e., t s;r; ). We also introduce an auxiliary binary variable w t s;r equal to 1 if at least an application tier for class r is assigned to server s in interval t and 0 otherwise. The revenue for executing class r requests is given by T 2 U r t r ðt 2Þ¼T 2 ðu r r R r Þ t r ðt 2Þ, whereas the total cost is given by the sum of costs for servers in the low-power sleep state, for using servers according to their operating frequency, and for penalties incurred when switching servers from the low-power sleep state to the active state while migrating VMs. Since the utilization of server s when it is operating at frequency f is

7 ADDIS ET AL.: A HIERARCHICAL APPROACH FOR THE RESOURCE MANAGEMENT OF VERY LARGE CLOUD PLATFORMS 259 U s ¼ 2A r;r2r c t s;r; P f2f s C s;f y t s;f r; ; the profit (i.e., objective function) is given by t s;r; r P f2f s C s;f ys;f t r; t s;r; A t s;r; r2r c s2s t c ;2Ar þ r2r c u r t r ðt 2Þ s2s t c s2s t c ;f2f s s2s t c c s þ p s;f y t s;f þ b s;f y t s;f cs s max 0;x t s xt 1 s cm max 0;z t s;r; z t 1 s2s t c ;r2r c;2a r P! 3 2A r;r2r c t s;r; P 5 f2f s C s;f r; y t T 2 s;f s;r; : The first and second terms of the objective function are SLA revenues, the third are costs of servers in the low-power sleep state, the fourth are costs of using servers at their operating frequency, the fifth are penalties for switching servers from low-power sleep to active states, and the sixth are penalties associated with moving VMs. Note that the terms P r2r c T 2 u r r ðt 2 Þ and P s2s t c s T c 2 can be ignored because they are independent of the decision variables. The medium-term resource allocation problem ðrap T 2 c Þ that an AM solves for its cluster c (every hour as described in Section 1) can, therefore, be formulated as follows: maxf t c ¼ r2r c;s2s t c ;2Ar s2s t c ;f2f s 2A r;r2r c r t s;r; P f2f s C s;f y t s;f r; t s;r; t s;r; p s;f y t s;f þ b s;f y t s;f t s;r; P f2f s C s;f r; y t s;f cs s maxð0;x t s T xt 1 s Þ s2s t 2 c cm maxð0;z t s;r; T zt 1 s;r; Þ s2s t 2 c ;r2r c;2a r s:t: t s;r; ¼ t r ðt 2Þ¼ t r; ðt 2Þ 8r 2R c ; 2A r ; s2s t c! ð2þ r2r c;2a r t s;r; 1 8s 2St c ; ð3þ f2f s y t s;f ¼ xt s 8s 2S t c ; ð4þ z t s;r; xt s 8s 2S t c ;r2r c; 2A r ; ð5þ t s;r; rðt 2 Þ t z t s;r; 8s 2S t c ;r2r c; 2A r ; ð6þ t s;r; <! C s;f y t s;f f2f s r; t s;r; 8s 2S t c ;r2r c; 2A r ; ð7þ 2A r RAM r; z t s;r; RAM s 8s 2S t c ;r2r c; ð8þ 0 Y 1 ð1 a s 2A r s2s t c Þ wt s;r 1 A A r 8r 2R c ; ð9þ z t s;r; ja rjw t s;r 8s 2S t c ;r2r c; 2A r t s;r; ;t s;r; 0; xt s ;yt s;f ;zt s;r; ;wt s;r 2f0; 1g 8s 2S t c ;r2r c; 2A r : ð10þ Constraint (2) ensures that the traffic assigned to individual servers and every application tier equals the overall load predicted for class r jobs. Note that for a given request class r, the overall load t r ðt 2Þ is the same at every tier. Constraint (3) expresses bounds for the VMs capacity allocation (i.e., at most 100 percent of the server capacity can be used). Constraint (4) guarantees that if server s is in the active state, then exactly one frequency is selected in the set F s (i.e., only one variable y t s;f is set to 1). Indeed, if x t s ¼ 0, then no frequency can be selected and every yt s;f is equal to 0, vice versa if the server is in active state and x t s ¼ 1, exactly one variable yt s;f is set to 1. Hence, the extra cost of a server operating at frequency f and its corresponding capacity is given by P f2f s b s;f y t s;f and P f2f s C s;f y t s;f, respectively. Constraint (5) states that application tiers can only be assigned to servers in the active state. Indeed, if the server is in active state x t s ¼ 1 and then zt s;r; can be raised to one. Vice versa, if the server is in low-power sleep state x t s ¼ 0 and all variables zt s;r; are forced to 0. With the same arguments, (6) allows executing requests only at a server to which the corresponding application tier has been assigned, (7) guarantees that resources are not saturated, (8) ensures that the memory available at a given server is sufficient to support all assigned VMs, (9) forces availability obtained by assigning servers to class r requests to be at least equal to the minimum level according to the load-sharing fault tolerance scheme [10] (we plan to consider more advanced availability models, along the lines of [37], [42], in our future work). Finally, Constraint (10) relates variables z t s;r; and the auxiliary variables w t s;r. Indeed, if any application tier of class r is assigned to server s (at least one variable z t s;r; is equal to 1), then also wt s;r has to be equal to 1. The medium-term problem (RAPc T2 ) is a mixed-integer nonlinear programming problem that is particularly difficult to solve when the number of variables is large, as in this case. Even if we consider the set of servers in the active state x t ¼ðx s Þ t s2s with relative frequencies y t ¼ðy t t c s;f Þ s2s t c ;f2f s and the assignment of applications on each server z t ¼ ðz t s;r; Þ s2s t c ;r2r c;2a r to be fixed, the problem is still difficult to solve because the objective function is not concave (see [8] for further details). To overcome this difficulty, we propose an efficient and effective heuristic approach described in Section 5.

8 260 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 10, NO. 5, SEPTEMBER/OCTOBER 2013 Let us now turn to consider the short-term (RAP T 3 c ) and long-term (RAP T 1 ) resource allocation problems. For every cluster c, ðrapc T3 T2 Þ results straightforwardly from ðrapc Þ upon fixing the set of servers in the active state (x t s ) and the assignment of applications on each server (z t s;r;, wt s;r ). The decision variables are the operating frequency of servers y t s;f, the request volumes at each server t s;r;, and the fraction of capacity for executing each application at each server t s;r;. The objective function and the constraints are as in ðrapc T2Þ upon replacing t r ðt 2Þ by the workload at timescale T 3, t r ðt 3Þ, in Constraints (2) and (6). For the longterm ðrap T 1 Þ problem, the CM determines the optimal class partition R c and cluster-servers assignment S t c for each time interval t ¼ 1;...;n T2. In this case, all clusters c and all time intervals t contribute to the objective function P nt2 t¼1 P c2c ft c, where ft c is the objective function of the medium-term ðrapc T2Þ problem. It is worth noting that, as established in [39], because the CM objective function (i.e., the aggregation function for the entire system) is obtained as the sum of the objective functions for the individual AMs, the decentralized optimal solution obtained by our hierarchical distributed algorithm is equally as good as the optimal solution obtained by a corresponding centralized optimization approach.. Step 3. A local search procedure iteratively improves the latest solution by exploring its neighborhood and using so-called moves. The details are discussed in Section Algorithm 1. ResourceAllocation() optimization procedure 5 SOLUTION OF OPTIMIZATION PROBLEMS In this section, we describe an efficient and effective approach for solving the computationally hard optimization problems of the previous section. We first consider the medium-term (RAP T 2 c ) and short-term (RAP T 3 c ) resource allocation problems, then turn to the long-term (RAP T 1 ) optimization problem, and end by briefly discussing the interactions among them. 5.1 Medium- and Short-Term Solutions In this section, we discuss the optimization techniques used for solving the medium-term (RAP T 2 c ) and short-term (RAPc T3 ) resource allocation problems. Computational results demonstrate that only small instances of these problems can be solved to optimality with commercial nonlinear optimization packages. To handle representative cloud service center problem sizes, we develop a heuristic approach (see Algorithm 1). Three basic components comprise our overall solution approach:. Step 1. An initial solution for the medium-term problem is found by applying a greedy algorithm (Initial_ Solution() in Algorithm 1) whose output is the set of active servers x, their corresponding operating frequencies y, the assignment of tiers to servers z, an initial capacity allocation, and an initial load balancing vector. The details are provided in Section Step 2. A fixed-point iteration (FPI) is used to search for improvements on the initial capacity allocation and load balancing decisions. As described in Section 5.1.2, the FPI solves the load balancing and capacity allocation problems while changing and keeping fixed, and vice versa Initial Solution The process for building an initial solution (Algorithm 1, Step 1) is based on the greedy procedure greedycore(), illustrated in Fig. 3, whose rationale is to assign application tiers to servers so that the CPU utilization of servers is lower than a given threshold U. The value of U is considered to be between 50 and 60 percent according to literature proposals (e.g., [17], [47]). Let C s and b s, respectively, denote the values of C s;f and b s;f þ p s;f, for y s;f ¼ 1. Initially, greedy- Core() orders the application tiers according to nondecreasing values W r; ¼ res 1 r; P2A r r;, where res r; is the remaining workload to be allocated, and also orders the servers according to nondecreasing values of the ratio C s =b s. Then, the procedure iteratively selects an (partially allocated) application tier and assigns a class r request at tier to server s. The process is repeated until res r; is completely allocated while guaranteeing that all server utilization are at most U and that the RAM Constraint (8) and availability Constraint (9) are satisfied. A server is set to operate at the maximum frequency upon turning into the active state (Fig. 3, Steps 8-11). If possible, the workload res r; is allocated completely on server s, and after updating the server utilization, the next tier is selected (Fig. 3, Step 16). However, if res r; cannot be allocated completely on the single server s, then the workload is partially assigned to the server guaranteeing the maximum utilization U, and the remaining load (computed in Step 15) is allocated on subsequent servers. The construction of the initial solution starts at t ¼ 1 by applying greedycore() with all servers in cluster c set to the low-power sleep state and all remaining workloads to be allocated set to the workload prediction for the time interval. Then, for t ¼ 2;...;n T2, the initial solution is constructed starting from the solution of the medium problem for time interval t 1, used as input to greedy- Core(), apart from the load balancing, which is computed

9 ADDIS ET AL.: A HIERARCHICAL APPROACH FOR THE RESOURCE MANAGEMENT OF VERY LARGE CLOUD PLATFORMS 261 Fig. 3. Greedycore(): Application to server allocation. proportionally to the previous solution. When this saturates a VM, the algorithm first tries to set the physical server to the maximum frequency and then, if this is not sufficient, it workload such that the maximum utilization of the corresponding VM is U. The computational complexity of the greedycore() procedure is removes from the server a res r; O ( max js c j ja r j; ja r jlog r2r c r2r c )! js c jlogðjs c j ;! ja r j ; r2r c where the first term is the number of assignments performed in a worst-case scenario, i.e., at each iteration the first server with sufficient RAM and residual computing capacity is last in the servers ordering. The computational complexity for sorting the ordered lists of application tiers and servers is Oð P r2r c ja r jlogð P r2r c ja r jþþ and OðjS c jlogðjs c jþþ, respectively The Fixed Point Iteration Once the application tiers and operating frequencies are assigned to servers (i.e., x, y, and z are fixed), the capacity allocation at each server and the load balancing can be improved through an FPI technique (Algorithm 1, Step 2). The idea is to iteratively identify the optimal value of one variable ( or ), while fixing the value of the other variable ( or ). Both problems, capacity allocation and load balancing, are separable into subproblems that can be solved independently. The solution of all these subproblems can be obtained in closed form via the Karush-Kuhn-Tucker conditions (see [8] for the details). It is also shown in [8] that 1) even if the FPI procedure is not guaranteed to converge to a global optimum, the FPI is asymptotically convergent, and 2) the number of iterations required for the FPI to converge within a precision of 10 5 is typically between 6 and Neighborhood Exploration The FPI yields a feasible solution for the problem of interest (medium or short term) that represents a local optimum with respect to the decision variables and. Since both medium-term and short-term problems are not convex, we can improve by applying a local search procedure that identifies a local optimum with respect to the integer variables x, y, and z; we then update the continuous variables and in a greedy manner. Given a solution, the local search procedure explores its neighborhood NðÞ to find a better solution 0 and then iterates using Nð 0 Þ as the new neighborhood until no further improvement can be found. The local search neighborhood NðÞ is defined as the set of solutions that can be obtained from solution via a set of moves. Clearly, the choice of moves is crucial to achieving good results with a local search procedure. The moves we consider are based on those proposed in [8], which are properly modified to take into account the availability constraints. We first describe the original moves and the modification necessary to tackle the availability constraints. We then explain how the neighborhood search can be modified to improve the exploration properties of our local search strategy. The moves are: setting a server into the active or low-power sleep state, scaling the server frequency, and reallocating tiers to servers. Move M1: Set servers in low-power sleep state. Let S active denote the set of servers in the active state. A server s 2 S active with utilization U s 2½min s U s ; min s U s Š (experimentally setting 2½1:1; 1:2Š), is considered to be lightly loaded and a candidate for switching into the low-power sleep state. The load of such a candidate server, say ^s, is allocated over the remaining servers proportional to their spare

10 262 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 10, NO. 5, SEPTEMBER/OCTOBER 2013 capacity, where the spare capacity for tier of class r requests at server s is SC s;r; ¼ C s r; s;r; s;r;. Let SC r; ¼ P s2s SC active f^sg s;r; be the overall spare capacity available at the remaining servers in the active state for tier of class r. The load at server ^s is (suboptimally) assigned to SC the remaining servers by setting s;r; s;r; þ s;r; ^s;r; SC r;, which is then improved by solving an instance of the load balancing problem over only the subset of involved servers. The computational complexity of neighborhood exploration is OðjS c j P r2r c ja r jþ in the worst case scenario, where all servers are in the active state and have the same utilization and every application tier is allocated on every server. Switching into the low-power sleep state and moving the load to the remaining servers reduces availability. If this reduction causes a violation of the availability constraint, then the server is not switched into the low-power sleep state, even if it is lightly loaded. Move M2: Set servers in active state. A server s 2 S active with U s 2½ maxu s ; maxu s Š, 2½0:9; 0:95Š, is considered to be a bottleneck server. To reduce load at a bottleneck server, say s 1, a server in the low-power sleep state, say s 2, with RAM s2 greater or equal to the memory allocated on server s 1, is set into the active state. Server s 2 applies the same capacity allocation adopted by s 1 and runs at the maximum frequency. Optimal load balancing is identified by solving an instance of the load balancing problem, considering only servers s 1 and s 2. The computational complexity of neighborhood exploration is OðjS c j 2 P r2r c ja r jþ, when the system is running with js c j=2 servers in the active state with almost the same utilization. Move M3: Frequency scaling. This move updates the server frequency by one step, where bottleneck server frequency is increased and lightly loaded server frequency is decreased (only if Constraint (7) is not violated). The rationale is that the increase in cost for using resources can be balanced by improving system performance, while the energy savings can be counterbalanced by worsening response times. Since this move does not reallocate VMs, availability constraints are not violated and no modification is needed. Move M4: Reallocation of application tiers to servers. This move seeks to reallocate application tiers to servers in the active state to find more profitable solutions. Since the FPI cannot allocate new application tiers on servers (see [8]), the goal is to look for application tiers that can be allocated on a different set of servers. If an application tier is deallocated from server s, the corresponding z variable is set to zero and a tier that was not allowed to be executed on server s due to a memory constraint can now be allocated to server s, exploiting these new degrees of freedom to allocate new application tiers on such servers. To this end, we identify a destination server s 1 among all the active servers s with sufficient memory capacity, such that U s < U (U 2½50% 60%Š). We then identify a candidate tier ðr 1 ; 1 Þ to be moved and the search continues until an improvement is identified, where we look for a source server s 2 with U s2 > U. Since the sum of variables on each server equals 1 in the optimal solution of the capacity allocation problem (see [8]), the spare capacity required to allocate ðr 1 ; 1 Þ is created by setting s1;r; s1 ;r; C s1 r; U for all application tiers currently executed on server s 1 (i.e., guaranteeing U s1 ¼ U). The capacity available at server s 1 for the new application tier is C s1 r1 ; 1 s1 ;r 1 ; 1, and the load of application tier ðr 1 ; 1 Þ is balanced between s 1 and s 2 by solving the corresponding restricted instance of the load balancing subproblem. The computational complexity of neighborhood exploration is OðjS c j 2 ð P r2r c ja r jþ 2 Þ in the worst case scenario, where js c j servers are in the active state, js c j=2 servers have a utilization less than U, and memory requirements allow every application tier to be allocated on every server. If a tier is moved to a server with a lower availability or the number of physical servers supporting request class r is increased, this move may lead to an availability constraint violation. In this case, the move is discarded Local Search Implementation We are now ready to describe the overall local search procedure. Given the current solution, the Local_Search_- with-availability() procedure (Algorithm 1, Step 7) explores its neighborhood to find the best solution in NðÞ, constructed using all moves for the medium-term problem (changed as already discussed, for availability feasibility) and only the frequency scaling moves for the short-term problem. If no further improvements can be found, the local search procedure Local_Search_withAvailability() stops at a local optimum. We then perform an FPI step (Algorithm 1, Step 8) seeking to improve the current solution: Try to escape from the current local optimum by optimally updating and, instead of approximating their values as done in the neighborhood exploration. If the FPI succeeds to find an improved solution, the Local_Search_- withavailability() procedure starts again; otherwise, we accept the current result (Algorithm 1, Step 9). The overall procedure, however, is not finished. Note that the availability constraints can be very difficult to satisfy because they conflict with the load balancing constraints and the objective function. In fact, the availability constraints seek a more fragmented load, while the objective function seeks a reduction in the number of servers. The introduction of availability constraints jeopardizes the feasible set and reduces the likelihood of moving from one feasible solution to another, possibly preventing us from reaching a good solution. Therefore, when the Local_Search_withAvailability() procedure fails to find an improvement, we consider a more complex process that allows a wider exploration: Local_ Search_withoutAvailability() (Algorithm 1, Step 11). The current solution represents the best feasible solution found thus far by Local_Search_withAvailability(), and the neighborhood NðÞ is now constructed using all moves ignoring availability constraints (hence, some moves might be infeasible). All violations are recorded in a list, which will be useful in consecutive steps. By relaxing the availability constraints and creating a larger search space, we have more chances to move toward better solutions. The Local_Search_withoutAvailability() procedure is repeated for a fixed number of iterations (Algorithm 1, Step 10), after which we must find a feasible solution through the procedure

11 ADDIS ET AL.: A HIERARCHICAL APPROACH FOR THE RESOURCE MANAGEMENT OF VERY LARGE CLOUD PLATFORMS 263 Remove_Violations() (Algorithm 1, Step 12) to reenter the feasible region. Remove_Violations() switches additional servers into the active state to handle requests classes whose availability constraints are violated. We then repeat Local_ Search_withAvailability() followed by the FPI procedure. We observe that switching additional servers into the active state (to remove violations) produces a worsening of the objective function value by at most the cost of those servers. However, because switching a server into the active state improves system performance and, therefore, SLA revenues, the worsening of the objective function value will be generally less than the cost of the additional servers. As shown in Section 6, when the current solution returns to the feasible region, Algorithm 1 does not get stuck in a local optima and the current solution can be further improved by the moves in Local_Search_withAvailability(). Our strategy combines the two main features of a classical local-search strategy: Intensification and diversification. Local_Search_withAvailability() performs the intensification step, refining the current feasible optimal solution through exploration of its neighborhood. Local_- Search_withoutAvailability() performs the diversification step, greedily improving the solution quality by allowing solutions to become infeasible. For the short-term problem, we simply determine the load balancing, capacity allocation, and frequency scaling results because the additional system overhead is insignificant. The overall optimization procedure (Algorithm 1) starts from the current solution and the FPI is applied to locally improve the load balancing and capacity allocation results. Then, the local search Local_Search_withAvailability() (Algorithm 1, Step 7) is started, with moves based solely on frequency scaling, followed by an FPI step (Algorithm 1, Step 8) to possibly improve the current solution as for the medium-term problem. However, because frequency scaling cannot introduce violations, we eliminate the Local_- Search_withoutAvailability() and Remove_Violations() procedures. Hence, our solution scheme for the short-term problem is much easier than that for the medium-term problem, thus supporting the decision to solve the shortterm problem at a finer-grain timescale. The rationale behind this class partitioning approach is to obtain an overall workload profile as homogeneous as possible. In this way, the number of server switches and VM migrations at a timescale of T 2 can be reduced by assigning to the same partition classes characterized by complementary profiles, namely classes such that when the workload of one class is at a peak, the other is at a minimum (see Fig. 4). Another factor that needs to be carefully taken into account is the number of clusters jcj generated (or, alternatively, the maximum number of classes M included in a partition). This factor influences both the time required to find good partitions and the quality of solutions achieved by the overall framework. In fact, introducing a large number of partitions reduces the time to solve individual medium-term problems, but each cluster has a limited amount of computing capacity making the sharing of resources among classes more difficult at finer timescales. On the other hand, considering a small number of partitions, resource sharing among different classes is possible at the expense of increased execution time and complexity for the medium-term algorithm (with a limited number of clusters the framework behaves like the centralized approach characterized by large execution times that are not suitable for very large-scale service centers). The choice of a more profitable size for individual clusters has been performed experimentally and results have shown that our solution is robust with respect to this dimension (see the next section). Our partitioning algorithm is an iterative procedure that starts from the larger set of possible partitions (i.e., where each partition contains a single class), and then iteratively tries to aggregate them two by two (see Algorithm 2). Before introducing the details of the partitioning algorithm, it is necessary to introduce some preliminary definitions. In particular, we need to introduce a similarity measure among workload profiles to implement the intuitive idea of aggregating classes with complementary profiles. Algorithm 2. ClassPartitioning(): The class partitioning algorithm. 5.2 Long-Term Solution Our hierarchical optimization framework is composed of a CM and a set of AMs (see Section 3.1). The AM controls a subsystem of the cloud service center, operating on a cluster that includes a subset of applications and resources (physical servers). Each AM optimally assigns the available servers within its cluster to the given subset of applications, employing the algorithm described in the previous section. The CM determines the different partitions of applications and assignments of servers to each AM, where the algorithm operates at a timescale of T 1 (24 hours) and plans the resources for n T2 time intervals. More precisely, 1) at the beginning of each day, all request classes are partitioned (i.e., R c determined) using a heuristic procedure based on a prediction of the workload profile for the entire day, where the class partitioning considers n T2 workload forecasts (one for each hour); and 2) computing capacity is assigned to each partition at a timescale of T 2 based on the workload of each individual time interval (i.e., determine S t c ).

12 264 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 10, NO. 5, SEPTEMBER/OCTOBER 2013 Definition 4 (Area Ratio). Let P i and P j be two sets of request classes. Then, the area ratio is given by ARðP i ;P j Þ :¼ AðgðP i [ P j ÞÞ=AðP i [ P j Þ: Fig. 4. Example of request classes with complementary workload profiles. Definition 1 (Area Under Workload Profile). Since all n T2 workload sampling steps are equal to 1=T 2, we define the area under the predicted workload of a request class as the area under the function t r ðt 2Þ: AðrÞ :¼ 1 nt2 t r T ðt 2Þ: 2 Definition 2 (Maximum Request Function). This function represents, for a given set of classes P n, the maximum r ðtþ at every time interval t among the classes r 2 P n in the partition, defined as t¼1 gðp n ;tþ :¼ max r t r ðt 2 Þ;r2 P n : Let AðgðP n ÞÞ denote the area under the above function, omitting the time argument for simplicity. Definition 3 (Area Summation). The area summation of classes partition P n, AðP n Þ, at the current iteration of the partitioning algorithm is defined as the summation of areas under the workload profile of the classes assigned to P n in the previous iterations. We, therefore, have AðP n Þ :¼ AðrÞ; r2p n AðP i [ P j Þ :¼ AðrÞ ¼AðP i ÞþAðP j Þ; r2p i[p j where the latter equation follows by Definition 3. Upon clustering two classes into a partition, we seek to overlay the peaks of one class with the valleys of another class so that resources unused by the first class can serve requests from the second class, and vice versa. More generally, we need to decide whether to merge two subsets of classes into a partition. To quantify the desirability of merging two sets of classes, we define the area ratio criterion as follows: The area ratio is based on relationships between the maximum request function and the area summation. If we evaluate whether to merge sets of classes P i and P j whose workload profiles are similarly distributed, then the maximum request function is only slightly modified while the area summation is increased by AðP j Þ. Instead, if two partitions of classes with complementary distributions are evaluated, the maximum request function of their union is (approximately) the sum of the maximum request function of the two partitions, while the area summation always increases by the same value, thus maximizing the ratio. Let us consider the example workload profiles reported in Fig. 5. Intuitively, clustering classes 1 and 3 is less desirable than clustering classes 1 and 2 (which are complementary). It is easy to verify that the area ratio in the former case equals 0.5, while in the latter case we have an area ratio equal to 1. We can now consider the partitioning procedure outlined in Algorithm 2. First, construct the list of actual partitions (L 1 ) and initialize it with partitions formed by a single class (Step 3). Then, we construct another list (L 2 ) with all pairs of partitions formed by elements of L 1 and evaluate their area ratio (Steps 4-5). Elements are inserted to L 2 in nondecreasing area ratio order (Step 5), and the iterative procedure of merging partitions begins. We try to merge partitions according to two criteria: AR is above a threshold T (T ¼ 0:7, chosen experimentally) and the cardinality of the new partition is less than M. Hence, we avoid creating partitions that are too large or not profitable with respect to AR. In choosing the pair of partitions following the list order, we try to merge partitions that will have the best AR. Every time a new partition is formed, it is inserted in L 1 and the resulting new pairs obtained with this partition are added to L 2, recalculating the AR value (Steps 11-15). The algorithm stops when there are no more partitions that can be merged. The final partition P determined at the end of Algorithm 2 identifies the requests class partition fr c g used as input by the medium-term algorithm. The computational complexity of Algorithm 2 is OðjRj 3 Þ. In the worst case, the repeat-until cycle is performed jrj=m times (recall that initially we have jrj partitions, each one including a single class) and only one class is merged in the new partition at each iteration. The inner cycles include jrj 2 pairs in the first iteration, ðjrj 1Þ 2 in the second, and ðjrj jrj=m þ 1Þ 2 in the last. Fig. 5. Example of workloads for the evaluation of the AR criteria.

13 ADDIS ET AL.: A HIERARCHICAL APPROACH FOR THE RESOURCE MANAGEMENT OF VERY LARGE CLOUD PLATFORMS 265 Fig. 6. Interactions among individual solution components at multiple timescales. After assigning classes to partitions, the CM determines the set of servers S t c assigned to each cluster c and time interval t, while optimizing resource allocation across successive time intervals. An initial solution is found using the Initial_Solution() procedure in Algorithm 1, and then this solution is improved through a local search phase. The local search, using servers from the Free Server Pool and/or moving servers from one partition to another, tries to add or remove capacity to partitions. The CM considers three types of moves and neighborhoods for this local search phase, where each neighborhood is explored in succession to find additional solution improvements. The neighborhoods are explored in a circular fashion (after the last neighborhood, the first one is explored again), until no further improvement can be found. The quality of the solutions is based on the their expected revenue, which is calculated as part of the medium-term algorithm (AM) under the workload profile forecast for the given time interval. To reduce the computational time, a simplified version of the mediumterm algorithm is used where only server on/off moves are allowed. Several AMs are run in parallel, thus exploiting current multicore architectures. The first and second types of moves increase and decrease the elaboration of resource capacity, respectively, by moving servers from the Free Server Pool to partitions or moving servers from partitions to the Free Server Pool. The third type of moves focuses on resource sharing, allowing different partitions to exchange some of the allocated capacity. The set of servers identified in this way, S t c,is assigned at each time interval t to each cluster c for the following day. 5.3 Combined Solution In Fig. 6, we summarize how the different components of our hierarchical solutions interact. The CM, using predictions of the workload one day ahead, determines the application class request partitions (R c ) and assigns them to the different AMs. The sets of servers that will be controlled by individual AMs for each time interval (S t c with t ¼ 1;...;n T2 ) are also identified. The ClassPartitioning() procedure and the ResourceAllocation() procedure exploit only a subset of moves (Moves M1 and M2) in this phase. The decisions will be actuated in the following day. Then, at runtime, each AM solves a series of optimization problems of type ðrap T 2 c Þ, one for each time interval in 1;...;n T2 using as workload estimates the predictions that can be achieved from runtime monitoring data 1 hour in advance. At timescale T 2, the full algorithm ResourceAllocation() is executed and the active servers and application placements choosen at this stage are held fixed for the following hour. Finally, at timescale T 3, each AM solves the problem ðrap T 3 c Þ considering as decision variables only capacity allocation, load balancing, and CPUs frequency, because they introduce a very small system overhead. Control decisions are achieved considering short-term workload predictions r ðt 3 Þ and executing a simplified version of the local search algorithm (i.e., LocalSearch_- withavailability() limited to frequency scaling operations and the FPI() procedure). Note that AMs execute the medium- and short-term algorithms, possibly with the more accurate predictions of the incoming workload t r ðt 2Þ and t r ðt 3Þ as input relative to the same estimates evaluated one day or one hour ahead, respectively. If an AM acting at the timescales T 2 or T 3 cannot find a feasible solution, then a recovery algorithm executing the full ResourceAllocation() procedure is executed with an additional set of servers. We assume that additional servers are always available in the Free Server Pool and that a feasible solution can be always found, i.e., the service center is overprovisioned. 1 6 NUMERICAL ANALYSIS Our hierarchical framework has been implemented in Java and the bytecode has been optimized through the Sun Java 1.6 just in the time compiler. The resource management algorithms have been evaluated under a variety of systems and workload configurations. Section 6.1 presents the settings for our experiments, while Section 6.2 reports on the scalability results for our algorithms. Section In the opposite case, our algorithms can provide valuable information to service centers operators on the clusters and request classes that lead to infeasibility, i.e., our solution would help to plan for investments on new servers or to identify request classes with critical SLAs.

14 266 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 10, NO. 5, SEPTEMBER/OCTOBER 2013 TABLE 2 Performance Parameters and Time Unit Cost Ranges evaluates the impact of availability constraints on the net revenues for the PaaS provider and compares the solution proposed in this paper with our previous work [8]. Section 6.4 analyzes the interrelationships among the multiple timescales considered in this paper with respect to the incoming workload prediction error. Finally, Section 6.5 presents a cost-benefit evaluation of our solution in comparison with other heuristics and state-of-the-art techniques [50], [41], [61], [57], [6]. 6.1 Experimental Settings Experimental analyses have been performed considering synthetic workloads built from the trace of requests registered at an IBM data center and, for the analysis in Section 6.3, at the website of a large university in Italy. The IBM data center traces dump the memory and CPU utilization of over 10,000 customer VMs running development, test, and production servers of multitiers Internet-based systems, gathered for three months during 2010 at a timescale of 15 minutes. From the CPU utilization, we derived the incoming load r by applying the utilization law [19], i.e., by assuming that the workload was proportional to the CPU usage. The university website trace contains the number of sessions registered over 100 physical servers for a 1 year period in 2006, on an hourly basis. Data collected from the logs have been interpolated linearly and oversampled adding also some noise to obtain workload traces varying every 5, 10, and 15 minutes, as in [35]. Realistic workloads are built assuming that the request arrivals follow nonhomogeneous Poisson processes and extracting the requests traces corresponding to the days with the highest workloads experienced during the observation periods. All tests have been performed on an Intel Nehalem dual socket quad-core GHz with 24 GB of RAM running Ubuntu Linux considering a large set of randomly generated instances. The performance parameters of the applications and infrastructural resources costs have been randomly generated uniformly in the ranges reported in Table 2, as in [4], [8], [3], [35], according to commercial fees applied by IaaS/PaaS Cloud providers [5], [38] and the energy costs available in the literature [27]. The number of servers jsj has been varied between 20 and 7,200. The number of servers is scaled with the workload to obtain an average servers utilization equal to 80 percent when the workload due by the whole set of classes reaches the peak over the 24 hours. The number of request classes jrj has been varied between 20 and 720 and the maximum number of tiers ja r j has been varied between 2 and 4. In each test case, the number of tiers ja r j for different classes may be different. In this way, we can consider the execution of both the front end and back end of simple applications deployed in the cloud (e.g., web and Worker roles in Microsoft Azure terminology [38]) but also more complex systems exploiting advanced middleware solutions available in PaaS offerings (e.g., Windows Azure Queues and Amazon SQS [5]), involving a larger number of tiers. The value range adopted for r allows us to obtain revenues for running a service application on a single CPU for 1 hour varying between $1 and $10 according to the current commercial fees (see, e.g., the Amazon RDS service based on Oracle hosted on Amazon EC2, R r is proportional to the number of tiers ja r j and to the overall demanding time at various tiers of class r requests (i.e., R r ¼ 10 P 1 2A r r; ). Servers with six P-states (a voltage and frequency setting) have been considered (the working operating frequencies were 2.8, 2.4, 2.2, 2.0, 1.8, and 1.0 GHz, which implies 95-, 80-, 66-, 55-, 51-, and 34-W power consumption, according to the energy models provided in [43]). The overhead of the cooling system has been included in the data center PUE according to the values reported in [27] and the cost of energy per kwh has been varied between 0.05 and 0.25 $/kwh. The switching cost of servers cs s has been evaluated as in [35] considering the reboot cost in energy consumed, while the cost associated with VMs movement cm has been obtained as in [49]. The comparison of different solutions is based on the evaluation of the response times through M/G/1 models and considering the objective function values at multiple timescales. The validation of the medium- and short-term techniques in a real prototype environment has been reported in [8]. 6.2 Algorithms Performance Table 3 reports the average optimization time required by our resource management algorithms for problem instances of different sizes. The average values reported in this section have always been computed by considering 10 instances with the same size. Results show that the short-term algorithm is very efficient and can solve problem instances up to 120 classes and 1,200 servers in less than 1 minute. TABLE 3 Algorithms Execution Time

15 ADDIS ET AL.: A HIERARCHICAL APPROACH FOR THE RESOURCE MANAGEMENT OF VERY LARGE CLOUD PLATFORMS 267 TABLE 4 Performance of the Hierarchical Algorithm Conversely, considering the 1-hour time constraint, which characterizes the medium-term control implemented in modern cloud service centers, the medium-term algorithm can solve systems with up to 100 classes and 1,000 servers, motivating the implementation of a hierarchical approach. As for the long-term solution, problem instances including 720 classes, 7,200 servers, and around 60,000 VMs can be solved in about 16 hours; hence, the long-term algorithm is suitable for the one day ahead planning. To estimate the speedup and the precision of our hierarchical framework with respect to the centralized approach, we have compared the solution achieved by the long-term algorithm with the solution of the medium-term algorithm without any optimization time limit. Results are reported in Table 4 and show how for small systems (up to 40 classes and 400), the hierarchical algorithm introduces a significant overhead, because the speedup (evaluated as ratio between the optimization times of the medium-term and long-term algorithms) is less or equal to 1. Increasing the size of the systems, the speedup also increases reaching a value close to 8, which is the theoretical speedup that can be achieved in the system supporting our experiments. Furthermore, the gap between the objective functions of the centralized solution computed over the 24-hour period and the long-term algorithm is about 10 percent. This gap can be motivated by the fact that in the hierarchical framework the physical servers are partitioned among the clusters, while in the centralized solution without any time limit the physical servers can be shared among multiple classes, and hence, the centralized solution has a larger set of degrees of freedom performing the search. However, this lack of precision in the solution is significantly counterbalanced by the speedup achieved by the hierarchical framework, which is suitable for the resource management of very large-scale cloud service centers. We have also investigated, the optimal number of classes M to be included in the individual clusters managed by the AMs. In particular, we have considered clusters with 10, 20, and 30 request classes per AM. Figs. 7 and 8 report the normalized objective function value and the optimization times over the 24 hours of some representative test cases with 20, 40, and 100 classes and 200, 400, and 1,000 servers. Results show that our solution is robust to the variation of the parameter M. In the following sections, we will set M ¼ 10, because in this way the medium-term algorithm is slightly more efficient and can compute an optimal solution in less than 1 minute (see Table 3) and can be used effectively at runtime as the recovery algorithm. Hence, if the short-term algorithm cannot find a feasible solution, the medium-term algorithm can be triggered at runtime by the AMs and a new cluster configuration can be determined by exploiting the full set of actuation mechanisms available to the system (i.e., including server switches and VM migrations) in case of workload fluctuations or to compensate for workload prediction errors. As a final consideration, in Fig. 7, we report a zero optimization time for the centralized approach, if the medium-term algorithm without any time limit cannot identify a feasible solution. This plot shows that even for small-sized systems (including 40 classes and 400 servers) in heavy workload conditions, the centralized approach sometimes cannot find a feasible solution. Conversely, across the entire set of experiments that we performed, the hierarchical framework always identified a feasible solution. Hence, the partitioning approach is also effective in identifying feasible solutions under critical workload conditions. Fig. 7. Normalized objective function value for a service center with 20 classes and 200 servers (left) and 40 classes and 400 servers (right). Fig. 8. Medium-term algorithm optimization time for a service center with 20 classes and 200 servers (left) and 100 classes and 1,000 servers (right).

16 268 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 10, NO. 5, SEPTEMBER/OCTOBER 2013 Fig. 9. Execution trace of the medium-term algorithm with (plot right) and without availability violation (plot left). 6.3 Availability Impact Analysis In this section, we compare our medium-term algorithm with our solution in [8]. The algorithm presented in this paper extends our previous work providing availability guarantees to the running applications and implementing the diversification strategy that periodically relaxes availability Constraint (9) and recovers a feasible solution. To show the effectiveness of our diversification procedure, Fig. 9 reports the execution trace of a significant run of our algorithms for a system including 50 classes and 400 servers. The x-axis reports the execution time, while the y-axis reports the objective function value. The plot on the left has been obtained by the previous solution in [8] and shows that the standard Local_Search_withAvailability() procedure got stuck in a local optima after around 25 sec. In the plot on the right, Solution A depicts the best found feasible solution, while Solution B is the current solution of the new Algorithm 1. This plot shows that after almost 25 sec, Algorithm 1 finds the first local optimum and Solution A is not further improved. Then, Constraint (9) is relaxed and Algorithm 1 can continue to explore the solution space. In about 35 sec, Algorithm 1 finds a better optimum feasible solution and the plot of Solution A coincides with Solution B. Note that, during the search the value of Solution B can increase or decrease. Solution B usually decreases suddenly when Algorithm 1 goes back in the feasible solution set. Indeed, in that case, additional servers are turned on to improve the system availability and additional energy costs are incurred accordingly. As a final consideration, the plot shows that the Algorithm 1 is effective because the transition in the infeasible region allows a local optimum feasible solution to be found after about 55 sec, which is almost 30 percent greater than the first local optimum found at 25 sec. To evaluate the impact of availability constraints on the net revenues, we have also considered a case study based on realistic workloads created from the logs of the website of a large university in Italy. From these logs, we have extracted 50 requests classes that correspond to the days with the highest workloads experienced during the year. The workload follows a bimodal distribution with two peaks around and (see plot left in Fig. 10). Results are illustrated in Fig. 10, which reports the number of online servers and net revenues for a service center including 400 servers with and without availability constraints. As the plots show, under light load conditions between 1.00 and 9.00, the solution with availability constraints introduces almost 100 percent additional servers with respect to the solution without availability. Conversely, under heavy load conditions, the system naturally adopts a larger number of servers, and in that case, the solution with availability introduces only 30 percent additional servers. Furthermore, by inspecting the obtained solutions, results have shown that the Algorithm 1 is able to exploit the DVFS capability and to manage service resources effectively because the net revenues of the solution with availability are only 20 percent lower, on average, over the 24 hours. 6.4 Timescales and Workload Prediction Error To take into account the effects of workload prediction error and analyze the interrelationships among the multiple timescales, we have evaluated the performance of our resource management framework by considering a large set of experiments and adding some Gaussian noise to the workload prediction at the finer-grained timescale r ðt 3 Þ. In particular, a noise with zero mean and a standard deviation equal to p E½ r ðt 2 ÞŠ has been considered, i.e., for each class r the standard deviation of the noise was Fig. 10. Case study incoming workload, number of online servers, and net revenues.

17 ADDIS ET AL.: A HIERARCHICAL APPROACH FOR THE RESOURCE MANAGEMENT OF VERY LARGE CLOUD PLATFORMS 269 TABLE 5 Medium- and Short-Term Timescales Comparison proportional to the expectation of the incoming arrival rate at the coarse-grained timescale T 2, via the spread out parameter p. Analyses have been performed by setting p ¼ 0:1, 0.2, and 0.3. Several workload prediction models have been proposed in the literature (e.g., ARMA and exponential smoothing [48], [16]). Predictions based on multiple timescales typically perform well in practice. It has been observed that the workload at fine timescales follows Gaussian distributions. The prediction accuracy in terms of mean square error can be about 10 percent [48]. Hence, imposing a Gaussian noise on the workload prediction r ðt 3 Þ allows us to determine results independent on the choice of the prediction model. We have considered multiple values for the finergrained timescale, namely T 3 ¼ 5, 10, and 15 minutes. In our evaluations, we have considered the following as metrics: 1) the average number of server switches per hour, 2) the average number of VM migrations per hour, and 3) the objective function value of the cloud data center over the 24 hours. The metrics have been evaluated by considering both the workload prediction r ðt 3 Þ and the real incoming workload ~ r ðt 3 Þ obtained by adding the Gaussian noise. The response time of individual VMs have been evaluated through the M/G/1 queues feeding the systems by ~ r ðt 3 Þ and r ðt 3 Þ, sampled every 5 minutes. Results are reported in Tables 5 and 6 for a system with 240 classes and 2,400 servers (the same considerations hold for larger systems). The first column in Table 5 also reports the results for the workload prediction at timescale T 2. Considering the prediction-based results at timescale T 2, the objective function value (i.e., the net revenues for the system) is overestimated, while the number of server switches and VM migrations planned by our algorithms are underestimated with respect to the same figures evaluated at timescale T 3. If we consider the different values of the timescale T 3, we can draw similar conclusions, i.e., a larger value of the timescale leads to overestimate the net revenues and to underestimate the number of VM migrations and servers switches. This could be expected since a larger time interval for the timescale implies smoother values for the incoming workload predictions r ðt 3 Þ. However, when the real workload is considered, coarse-grained values for T 3 worsen the net revenues and require more frequent triggering of the recovery algorithm because infeasible conditions arise (see Table 6, where the objective function values are normalized with respect to Table 5 results). This happens more frequently as p increases, as shown in Fig. 11, where we report the number of online servers of a representative cluster considering different values of T 3 and p for 24 hours. Since the number of active servers is determined at the T 2 timescale, this number should change at every T 2 time unit (hour). However, because of the smoothing behavior of the mean at coarse-grained timescales, and because of prediction error, the AM sometimes cannot find a feasible solution and needs to run the recovery algorithm. We present in the same figure the behavior of our framework for three different choices of T 3, i.e., 5, 10, and 15 minutes. All the variations in the number of servers at timescale T 3 ¼ 5 represent a call to the recovery algorithm; in fact, if the recovery algorithm is not invoked (solution still feasible), the number of server is kept constant. For the sake of readability, for the lines representing T 3 ¼ 10 and T 3 ¼ 15, we decide to enlight the calls of the recovery algorithm setting to zero the number of servers. As a final consideration, we can state that a lower interval for the timescale T 3 is the best choice and does not increase the optimization overhead of the framework as one could expect (because the short-term algorithm is executed TABLE 6 Short-Term Timescale Runtime Analysis Fig. 11. Number of online servers for a representative cluster with p ¼ 0:1 (left) and p ¼ 0:3 (right).

18 270 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 10, NO. 5, SEPTEMBER/OCTOBER 2013 TABLE 7 Net Revenues Percentage Improvement with Respect to the Alternative Method can be achieved by our solution over the 24 hours are about 24 percent higher than the alternative solution in Case 1, while in Case 2 the partitioning approach is even more effective: Being able to reduce server switching at finergrained timescales, we can improve the net revenues on average by around 30 percent. more frequently) because the recovery algorithm has to be run in any case to react to resource saturation due to the uncertainty in the workload predictions, even when the workload prediction is accurate (p ¼ 0:1). 6.5 Comparison with Other Heuristic Solutions To evaluate the effectiveness of our joint resource allocation approach, we have performed an in-depth comparison analysis with the solutions presented in [41], [50], [51]. In fact, these techniques have been developed in a coherent framework and have been implemented as part of the IBM Tivoli middleware [50]. In the remaining part of this section, we will refer to this set of techniques as the alternative solution. More specifically, the alternative optimal solution is determined as follows:. The workload is evenly shared among the VMs supporting the same application tier as in [41].. The computing requirements of individual VMs are determined according to the utilization principle, i.e., servers utilization is equal to a given threshold U ¼ 0:6 [61], [57]. Note that, the utilization principle is currently adopted also by an IaaS provider for dynamic resource management (see, e.g., AWS Elastic Beanstalk [6]).. The set of active servers and VM placement are computed by globally solving the optimization model proposed in [50] with CPLE 12.2 (a stateof-the-art mixed-integer linear programming solver).. If availability constraints are violated, additional servers are introduced iteratively, starting from the tiers with lower availability. The university web system described in Section 6.3 has been analyzed by setting T 3 ¼ 5 minutes, p ¼ 0:3, and oversampling the original log trace as in [35]. The number of request classes has been varied considering the days with the highest workloads experienced during the year. Two reference case studies have been built. In Case 1, the workload is generated according to the log traces. In Case 2, half of the requests classes are shifted by 12 hours to obtain a more homogeneous overall incoming workload with lower peaks. The new workloads have consequently more heterogeneous profiles, with peaks in different time periods, thus offering more opportunities for resource sharing. Note that this is representative for a service provider with customers geographically distributed across different time zones. System up to 2,400 servers and 240 classes have been considered, because for larger systems CPLE runs out of memory in solving the optimization model proposed in [50]. Results are reported in Table 7 (mean values computed over ten runs). Overall, on average, the net revenues that 7 CONCLUSIONS In this paper, we have proposed resource allocation policies for the management of multitier virtualized cloud systems with the aim at maximizing the profits associated with multiple class of SLAs and minimizing the infrastructure energy costs. A hierarchical framework acting at multiple timescales and providing availability guarantees for the running applications was developed. Cloud service centers including up to 7,200 servers, 700 request classes, and 60,000 VMs can be managed efficiently by a parallel implementation of a local search algorithm achieving an almost linear speedup. The effectiveness of our approach has been assessed by considering realistic workloads. The interrelationships which characterize the timescales that govern the management of very large scales systems have also been analyzed. In our current work, we are extending the resource allocation problem by considering, at a finer-grained timescale, the adoption of pure control theory models at timescales of seconds. At larger time granularity, we are considering the management of resources among multiple distributed service centers, exploiting also the local availability of green energy sources. Finally, more advanced availability models for characterizing cloud system behaviors and a peer-to-peer architecture for the AMs will be also investigated. ACKNOWLEDGMENTS The authors would like to thank Dr. iaoqiao Meng for his valuable comments and for analyzing the IBM data logs. Thanks are also expressed to Dr. Marco Caldirola for development and experimental activities. The work of Danilo Ardagna and Barbara Panicucci was partially supported by the PARVIS project funded by IBM. The work of Danilo Ardagna was supported also by the European Commission in the context of the Programme IDEAS-ERC, Project SMScom. REFERENCES [1] B. Abrahao, V. Almeida, J. Almeida, A. Zhang, D. Beyer, and F. Safai, Self-Adaptive SLA-Driven Capacity Management for Internet Services, Proc. IEEE/IFIP 10th Network Operations and Management Symp. (NOMS 06), [2] C. Adam and R. Stadler, Service Middleware for Self-Managing Large-Scale Systems, IEEE Trans. Network and Service Management, vol. 4, no. 3, pp , Dec [3] B. Addis, D. Ardagna, B. Panicucci, and L. Zhang, Autonomic Management of Cloud Service Centers with Availability Guarantees, Proc. IEEE Third Int l Conf. Cloud Computing (CLOUD), [4] J. Almeida, V. Almeida, D. Ardagna, I. Cunha, C. Francalanci, and M. Trubian, Joint Admission Control and Resource Allocation in Virtualized Servers, J. Parallel and Distributed Computing, vol. 70, no. 4, pp , 2010.

19 ADDIS ET AL.: A HIERARCHICAL APPROACH FOR THE RESOURCE MANAGEMENT OF VERY LARGE CLOUD PLATFORMS 271 [5] Amazon, Inc., Amazon Elastic Cloud, ec2/, [6] Amazon, Inc., AWS Elastic Beanstalk, com/elasticbeanstalk/, [7] D. Ardagna, S. Casolari, M. Colajanni, and B. Panicucci, Dual Time-Scale Distributed Capacity Allocation and Load Redirect Algorithms for Cloud Systems, J. Parallel and Distributed Computing, vol. 72, no. 6, pp , [8] D. Ardagna, B. Panicucci, M. Trubian, and L. Zhang, Energy- Aware Autonomic Resource Allocation in Multitier Virtualized Environments, IEEE Trans. Services Computing, vol. 5, no. 1, pp. 2-19, Jan.-Mar [9] M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R.H. Katz, A. Konwinski, G. Lee, D.A. Patterson, I.S.A. Rabkin, and M. Zaharia, Above the Clouds: A Berkeley View of Cloud Computing, pdf, [10] A. Avizienis, J.C. Laprie, B. Randell, and C. Landwehr, Basic Concepts and Taxonomy of Dependable and Secure Computing, IEEE Trans. Dependable and Secure Computing, vol. 1, no. 1, pp , Jan.-Mar [11] M. Bennani and D. Menascé, Resource Allocation for Autonomic Data Centers Using Analytic Performance Models, Proc. IEEE Int l Conf. Autonomic Computing Proc., [12] Bitcurrent, Cloud Performance from the End User Perspective, [13] N.M. Calcavecchia, O. Biran, E. Hadad, and Y. Moatti, VM Placement Strategies for Cloud Scenarios, Proc. IEEE Fifth Int l Conf. Cloud Computing (CLOUD), [14] J. Cao, K. Hwang, K. Li, and A.Y. Zomaya, Optimal Multiserver Configuration for Profit Maximization in Cloud Computing, IEEE Trans. Parallel Distributed Systems, preprint, no. 99, [15] D. Carrera, M. Steinder, I. Whalley, J. Torres, and E. Ayguadé, Autonomic Placement of Mixed Batch and Transactional Workloads, IEEE Trans. Parallel Distributed Systems, vol. 23, no. 2, pp , Feb [16] S. Casolari and M. Colajanni, Short-Term Prediction Models for Server Management in Internet-Based Contexts, Decision Support Systems, vol. 48, no. 1, pp , [17] L. Cherkasova and P. Phaal, Session-Based Admission Control: A Mechanism for Peak Load Management of Commercial Web Sites, IEEE Trans. Computers, vol. 51, no. 6, pp , June [18] M.D. Dikaiakos, D. Katsaros, P. Mehra, G. Pallis, and A. Vakali, Cloud Computing: Distributed Internet Computing for IT and Scientific Research, IEEE Internet Computing, vol. 13, no. 5, pp , Sept./Oct [19] G.S.G.E.D. Lazowska, J. Zahorjan, and K.C. Sevcik, Quantitative System Performance, Computer System Analysis Using Queueing Network Models. Prentice-Hall, [20] E. Feller, C. Rohr, D. Margery, and C. Morin, Energy Management in IaaS Clouds: A Holistic Approach, Proc. IEEE Fifth Int l Conf. Cloud Computing (CLOUD), [21] D. Gamarnik, Y. Lu, and M.S. Squillante, Fundamentals of Stochastic Modeling and Analysis for Self-* Properties in Autonomic Computing Systems, technical report, IBM Research, Oct [22] D. Gamarnik, Y. Lu, and M.S. Squillante, Workload Management Based on the Heavy-Traffic Theory of Queueing Systems, technical report, IBM Research, Sept [23] A. Gandhi, Y. Chen, D. Gmach, M. Arlitt, and M. Marwah, Minimizing Data Center SLA Violations and Power Consumption via Hybrid Resource Provisioning, Proc. Int l Green Computing Conf. and Workshops (IGCC), [24] A. Gandhi, M. Harchol-Balter, R. Das, and C. Lefurgy, Optimal Power Allocation in Server Farms, Proc. 11th Int l Joint Conf. Measurement and Modeling of Computer Systems (SIGMETRICS), [25] Gartner, 2012 Cloud Computing Planning Guide, my.gartner.com/portal/server.pt?open=512&objid=249&mode= 2&PageID= &resId= &ref=Browse, [26] Z. Gong and. Gu, PAC: Pattern-Driven Application Consolidation for Efficient Cloud Computing, Proc. IEEE/ACM 18th Ann. Int l Symp. Modeling, Analysis and Simulation of Computer and Telecomm. Systems (MASCOTS), [27] Greenpeace, How Clean Is Your Cloud? greenpeace.org/international/global/international/ publications/climate/2012/icoal/howcleanisyourcloud.pdf, [28] V. Gupta and M. Harchol-Balter, Self-Adaptive Admission Control Policies for Resource-Sharing Systems, Proc. 11th Int l Joint Conf. Measurement and Modeling of Computer Systems (SIGMETRICS), [29] Q. Huang, F. Gao, R. Wang, and Z. Qi, Power Consumption of Virtual Machine Live Migration in Clouds, Proc. Third Int l Conf. Comm. and Mobile Computing (CMC 11), [30] J. Kephart, H. Chan, R. Das, D. Levine, G. Tesauro, F. Rawson, and C. Lefurgy, Coordinating Multiple Autonomic Managers to Achieve Specified Power-performance Tradeoffs, Proc. Fourth Int l Conf. Autonomic Computing (ICAC), [31] H. Khazaei, J. Misic, V.B. Misic, and S. Rashwand, Analysis of a Pool Management Scheme for Cloud Computing Centers, IEEE Trans. Parallel Distributed Systems, preprint, no. 99, [32] L. Kleinrock, Queueing Systems. John Wiley & Sons, [33] D. Kumar, A. Tantawi, and L. Zhang, Estimating Model Parameters of Adaptive Software Systems in Real-Time, Autonomic Systems, pp , D. Ardagna, L. Zhang, eds., Springer, [34] D. Kumar, L. Zhang, and A. Tantawi, Enhanced Inferencing: Estimation of a Workload Dependent Performance Model, Proc. Fourth Int l ICST Conf. Performance Evaluation Methodologies and Tools (VALUETOOLS), [35] D. Kusic, J.O. Kephart, N. Kandasamy, and G. Jiang, Power and Performance Management of Virtualized Computing Environments via Lookahead Control, Proc. Int l Conf. Autonomic Computing (ICAC), [36] H. Li and S. Venugopal, Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform, Proc. ACM Eighth Int l Conf. Autonomic Computing (ICAC), [37] F. Longo, R. Ghosh, V.K. Naik, and K.S. Trivedi, A Scalable Availability Model for Infrastructure-as-a-Service Cloud, Proc. IEEE/IFIP 41st Int l Conf. Dependable Systems and Networks (DSN), [38] Microsoft, Windows Azure, en-us/library/windowsazure/dd163896, [39] T. Nowicki, M.S. Squillante, and C.W. Wu, Fundamentals of Dynamic Decentralized Optimization in Autonomic Computing Systems, Self-Star Properties in Complex Information Systems, pp , Springer-Verlag, [40] G. Pacifici, W. Segmuller, M. Spreitzer, and A. Tantawi, CPU Demand for Web Serving: Measurement Analysis and Dynamic Estimation, Performance Evaluation, vol. 65, no. 6/7, pp , [41] G. Pacifici, M. Spreitzer, A.N. Tantawi, and A. Youssef, Performance Management for Cluster-Based Web Services, IEEE J. Selected Areas in Comm., vol. 23, no. 12, pp , Dec [42] H. Qian, D. Medhi, and K.S. Trivedi, A Hierarchical Model to Evaluate Quality of Experience of Online Services Hosted by Cloud Computing, Proc. IFIP/IEEE Int l Symp. Integrated Network Management (IM), [43] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and. Zhu, No Power Struggles: Coordinated Multi-Level Power Management for the Data Center, SIGARCH Computer Architecture News, vol. 36, no. 1, pp , [44] T. Register, Microsoft s Azure Cloud Down and Out for 8 Hours, windows_azure_outage/, [45] A. Riska, M. Squillante, S.Z. Yu, Z. Liu, and L. Zhang, Matrix- Analytic Analysis of a MAP/PH/1 Queue Fitted to Web Server Data, Matrix-Analytic Methods: Theory and Applications, G. Latouche and P. Taylor eds., World Scientific, [46] S. Rivoire, P. Ranganathan, and C. Kozyrakis, A Comparison of High-Level Full-System Power Models, Proc. Conf. Power Aware Computing and Systems (HotPower), [47] J. Rolia, L. Cherkasova, and C. McCarthy, Configuring Workload Manager Control Parameters for Resource Pools, Proc. IEEE 10th Network Operations and Management Symp. (NOMS), Apr [48] S. Casolari and M. Colajanni, On the Selection of Models for Runtime Prediction of System Resources, Autonomic Systems, D. Ardagna, L. Zhang, eds., Springer, 2010.

20 272 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 10, NO. 5, SEPTEMBER/OCTOBER 2013 [49] M. Steinder, I. Whalley, and D. Chess, Server Virtualization in Autonomic Management of Heterogeneous Workloads, SIGOPS Operating Systems Rev., vol. 42, no. 1, pp , [50] C. Tang, M. Steinder, M. Spreitzer, and G. Pacifici, A Scalable Application Placement Controller for Enterprise Data Centers, Proc. 16th Int l Conf. World Wide Web (WWW 07), [51] B. Urgaonkar, G. Pacifici, P.J. Shenoy, M. Spreitzer, and A.N. Tantawi, Analytic Modeling of Multitier Internet Applications, ACM Trans. Web, vol. 1, no. 1, article 2, Jan [52] B. Urgaonkar and P. Shenoy, SHARC: Managing CPU and Network Bandwidth in Shared Clusters, IEEE Trans. Parallel and Distributed Systems, vol. 15, no. 1, pp. 2-17, Jan [53]. Wang, M. Chen, C. Lefurgy, and T.W. Keller, SHIP: A Scalable Hierarchical Power Control Architecture for Large-Scale Data Centers, IEEE Trans. Parallel Distributed Systems, vol. 23, no. 1, pp , Jan [54]. Wang, Z. Du, Y. Chen, and S. Li, Virtualization-Based Autonomic Resource Management for Multi-Tier Web Applications in Shared Data Center, J. Systems Software, vol. 81, no. 9, pp , [55]. Wang and Y. Wang, Coordinating Power Control and Performance Management for Virtualized Server Clusters, IEEE Trans. Parallel Distributed Systems, vol. 22, no. 2, pp , Feb [56] W. Whitt, Stochastic-Process Limits. Springer-Verlag, [57] A. Wolke and G. Meixner, Twospot: A Cloud Platform for Scaling out Web Applications Dynamically, Proc. European Conf. ServiceWave, [58] F. Wuhib, R. Stadler, and M. Spreitzer, A Gossip Protocol for Dynamic Resource Management in Large Cloud Environments, IEEE Trans. Network and Service Management, vol. 9, no. 2, pp , June [59] C.Z. u, J. Rao, and. Bu, URL: A Unified Reinforcement Learning Approach for Autonomic Cloud Management, J. Parallel Distributed Computing, vol. 72, no. 2, pp , [60] L. Zhang, C.H. ia, M.S. Squillante, and W.N. Mills, Workload Service Requirements Analysis: A Queueing Network Optimization Approach, Proc. IEEE 10th Int l Symp. Modeling, Analysis and Simulation of Computer and Telecomm. Systems (MASCOTS), pp , [61]. Zhu, D. Young, B. Watson, Z. Wang, J. Rolia, S. Singhal, B. McKee, C. Hyser, D. Gmach, R. Gardner, T. Christian, and L. Cherkasova, 1000 Islands: An Integrated Approach to Resource Management for Virtualized Data Centers, J. Cluster Computing, vol. 12, no. 1, pp , Bernardetta Addis received the master s degree in computer engineering and the PhD degree in operations research from Università di Firenze. She is a research fellow at the Dipartimento di Informatica, Università degli Studi di Torino. Her works focus on operations research, algorithms, and modeling for applied and theoretical optimization problems. Barbara Panicucci received the master s degree and the PhD degree in mathematics from the Università di Pisa. She is a temporary researcher at the Faculty of Engineering, Università di Modena e Reggio Emilia. Her research interests include models and solution methods for equilibrium problems and development of algorithms for operations research problems. Mark S. Squillante received the PhD degree from the University of Washington. He is a research staff member in the Mathematical Sciences Department at the IBM T.J. Watson Research Center, where he leads the Stochastic Processes and Optimization Department. His research interests concern mathematical foundations of the analysis, modeling, and optimization of complex stochastic systems. He is the author of more than 200 technical papers and more than 30 issued/filed patents. His work has been recognized by The Daniel H. Wagner Prize (INFORMS), eight best paper awards, nine keynote/plenary presentations, and nine major IBM technical awards. He is an elected fellow of the ACM and IEEE. Li Zhang received the bachelor s degree in math from Beijing University, the MS degree in mathematics from Purdue University, and the PhD degree in operations research from Columbia University. He is the manager of the System Analysis and Optimization group at the IBM T.J. Watson Research Center. His research interests include control and performance analysis of computer systems, statistical techniques for traffic modeling and prediction, scheduling and resource allocation in parallel and distributed systems. He has also been working on measurement-based clock synchronization algorithms. He has coauthored more than 90 technical articles and more than 30 patents. He received IBM s prestigious Corporate Award and Outstanding Technical Achievement Award for developing advanced clock synchronization mechanisms for mainframe computers.. For more information on this or any other computing topic, please visit our Digital Library at Danilo Ardagna received the master s and PhD degrees in computer engineering from Politecnico di Milano. He is an assistant professor at the Dipartimento di Elettronica e Informazione at Politecnico di Milano. His work focuses on the design, prototyping, and evaluation of optimization algorithms for resource management and planning of self-managing and cloud computing systems.