Cloud Management: Knowing is Half The Battle

Cloud Management: Knowing is Half The Battle Raouf BOUTABA David R. Cheriton School of Computer Science University of Waterloo Joint work with Qi Zhang, Faten Zhani (University of Waterloo) and Joseph L. Hellerstein (Google Inc.) NOMS 2014, Krakow (Poland), May 5-9, 2014

Outline Introduction to Cloud computing The heterogeneity challenge Google Cluster Data Set Research Questions/Opportunities Dynamic Capacity Provisioning with Harmony Conclusions

The rise of Internet-scale Applications

Infrastructure/Data Scale Large scale infrastructure Google: 200+ clusters, hundreds of thousands machines Facebook: 2000+ machines Yahoo: 34000+ machines Huge volume of data (a.k.a. big data) Google: 20PB data per day (2008) Facebook: 36 PB of stored data, processing 80-90TB per day (2010) Yahoo: 170 PB data stored spread across the globe. Processing 3 PB per day (2010)

Cloud Computing A model designed for running large applications in a scalable and cost-efficient manner Harnessing massive resource capacities in the computing platforms, e.g. data centers Sharing resources among applications based on usage in an on-demand fashion Roles in a cloud computing environment Cloud providers (a.k.a. infrastructure providers) Service providers End users

Benefits of Cloud Computing Economical Cheap, commodity hardware Leveraging economies of scale Highly scalable Illusion of infinite resources on demand Start small, then scale resources up/down as needed Highly flexible Customizable CPU, memory, storage & networking capabilities Customizable software stack Easy access Access resources from any machine connected to the Internet Deploy applications from anywhere at anytime

Resource Management Resource management is a central activity of any cloud computing environment Service-level management Dynamic (i.e., on-demand) performance management and service provisioning Infrastructure-level management Monitoring Scheduling and resource allocation Fault detection and management Energy Management

The Heterogeneity Challenge Cloud resource management is difficult! A key reason: Both Cloud resources and applications are heterogeneous Machines have heterogeneous processing capacities and capabilities Different processor architecture, hardware features, processor speed, memory size and energy consumption model. Applications have heterogeneous sizes, durations, priorities and performance objectives

Google s Case Study Google s compute clusters execute millions of tasks on a daily basis Carrying out management activities requires an understanding of the performance impact of management activities Evaluating the performance of a new scheduling algorithm Capacity upgrade: what type of machines do we need? Current solution: sophisticated simulations High overhead Difficult to understand evaluation results Difficult to analyze what-if scenarios Characterizing the heterogeneity can improve resource management effectiveness and lower maintenance overhead

Google Data Set Workload traces collected from a production compute cluster in Google over 29 days ~ 12,000 machines ~ 2,012,242 jobs 25,462,157 tasks Applications are represented by jobs Each Job consists of one or more tasks 12 priorities divided into 3 priority groups Gratis (0-1): low priority batch jobs (e.g., MapReduce jobs) Other (2-8) : medium priority jobs (e.g., monitoring) Production (9-11) : high priority applications (e.g., user facing)

Machine Heterogeneity Histogram of machine capacities Machine availability over 24 hours Machines in production data centers often consist of multiple types E.g. multiple generations of machines purchased over time Machine failures are common in the compute cluster

Application Heterogeneity: Job Priority & Size Percentage of jobs per priority group CDF of Number of tasks per job Most of the jobs have low priority Almost 50% of the jobs consist of <10 tasks, but a few of them have more than 1000 tasks

Application Heterogeneity: Task Size Task size (Gratis) Task size (Other) Task size (Production) Tasks in production compute clusters are very heterogeneous in size

Task Duration and Scheduling Delay CDF of Task Duration CDF of Scheduling delay Most of the tasks are short (<10 min), a few tasks are really long More than 30% of the tasks are scheduled immediately, however other tasks can wait for days to be scheduled

Job Arrival Rate Arrival rate of jobs varies highly from time to time Inter-arrival time exhibits an on-off pattern according to the time of the day During day time the job arrival can be quite intense, as around 40% job inter-arrival time is less than 10s. At night time, job arrival intervals can be very long The task arrival rate can be very spiky Due to uneven distribution of both jobs size and arrival rate

Outline Introduction to Cloud computing The heterogeneity challenge Google Cluster Data Set Research Questions/Opportunities If Knowing is Half the Battle, What is the Other Half? Dynamic Capacity Provisioning with Harmony Conclusions

Research Questions/Opportunities Performance modeling for heterogonous workloads How to capture task and job performance characteristics (e.g. queuing delay, pre-emption rate) when both workload and machines are heterogeneous? Scheduling Algorithms for heterogeneous workloads How to design scheduling algorithms that consider workload and machine heterogeneity? MapReduce jobs and user facing jobs have completely different performance objectives, thus different scheduling policies should be used How can we take job performance objectives (e.g. deadlines for MapReduce jobs) into account when making scheduling decisions? Are there good bin-packing algorithms for task scheduling, given the distribution of task sizes? How to avoid frequent preemption of long running tasks?

Research Questions/Opportunities (cont) Optimizing workload performance and resource efficiency using migration Live migration is a well known technique for online workload management Reduce resource contention (e.g., network hot spots) Reduce resource fragmentation Minimize energy consumption (i.e., cost) How to use migration effectively given heterogeneous workload and machine characteristics? Energy management How to leverage machine heterogeneity and job arrival patterns to save energy, while meeting job performance objectives?

HARMONY: Dynamic Heterogeneity-Aware Capacity Provisioning Energy cost is an important concern in data centers Accounts for 12% of data center operational cost [Gartner Report 2010] Governments policies for building energy-efficient (i.e. Green ) ICT Minimize energy cost by turning off servers An idle server consumes as much as 60% of its peak energy demand

Resource Demand - Google s Data Set Fluctuation of resource demand in data centers creates opportunities for dynamically turning on and off servers CPU Demand over 30 days Memory Demand over 30 days Figure: Total resource demand in Google s Cluster Data Set

Important Factors To dynamically control data center capacity, one must consider the following factors: Heterogeneity of machines Heterogeneity of task size and duration Variability of task arrival rate Workload performance requirement Scheduling delay Cost of turning on and off servers Wear-tear effect Fluctuating energy prices

Solution Approach Classify tasks based on their size and duration using k-means clustering algorithm Capture the run-time workload composition in terms of arrival rate for each task class Predict the arrival rate of each type of tasks Define container as a logical allocation of resources to a task that belongs to a task class Use containers to reserve resources for each task class Using task arrival rate to estimate the number of required containers of each type of task

System Architecture

Optimization Optimal Capacity Provisioning can be formulated as the following integer program: Where: (Performance objective) (Energy cost) (Switching cost) Subject to constraints: (Machine state constraint) (Capacity constraint)

Optimization (cont) Optimal Capacity Provisioning is NP-hard We relax the integer program, then devise two solutions Container-Based Scheduling (CBS) Statically allocate containers in physical machines At run-time, schedule tasks into containers Container-Based Provisioning (CBP) Use the estimated number of containers to provision machines At run-time, schedule tasks using existing VM scheduling algorithms such as first-fit (FF)

Experiment Set Up Task classification Classify tasks based on size Categorize into short and long tasks Number of tasks (gratis) Task size (gratis) Task duration (gratis)

Experiments Set Up (cont) Machine energy consumption model Aggregated task arrival rates Number of required containers

Experiment Results Number of machines (the baseline) Number of machines (CBS and CBP) Comparison of Energy Consumption

Experiment Results (cont) Baseline CBP CBS

Take Away Message Cloud computing is becoming an integral part of today s IT infrastructure Heterogeneity is a major yet overlooked challenge for resource management in Cloud computing environments Machines have heterogeneous capacities and capabilities Applications have diverse resource characteristics, priority and performance objectives We have presented a characterization of workload found in production cloud environments. Traces can be dowloaded at: http://rboutaba.cs.uwaterloo.ca/download.html Many research opportunities exist for designing heterogeneityaware resource management schemes, with higher potential for practical impact.

Questions