Cloud Management: Knowing is Half The Battle

Similar documents
Dynamic Workload Management in Heterogeneous Cloud Computing Environments

Characterizing Task Usage Shapes in Google s Compute Clusters

Characterizing Task Usage Shapes in Google s Compute Clusters

Dynamic Resource allocation in Cloud

Energy Constrained Resource Scheduling for Cloud Environment

Keywords Distributed Computing, On Demand Resources, Cloud Computing, Virtualization, Server Consolidation, Load Balancing

Efficient and Enhanced Load Balancing Algorithms in Cloud Computing

What Is It? Business Architecture Research Challenges Bibliography. Cloud Computing. Research Challenges Overview. Carlos Eduardo Moreira dos Santos

INCREASING SERVER UTILIZATION AND ACHIEVING GREEN COMPUTING IN CLOUD

A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing Systems

Power Management in Cloud Computing using Green Algorithm. -Kushal Mehta COP 6087 University of Central Florida

Energy Efficient MapReduce

Introduction to Cloud Computing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Dynamic Resource Allocation in Software Defined and Virtual Networks: A Comparative Analysis

Relational Databases in the Cloud

International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April ISSN

Cost-effective Resource Provisioning for MapReduce in a Cloud

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

1. Simulation of load balancing in a cloud computing environment using OMNET

can you effectively plan for the migration and management of systems and applications on Vblock Platforms?

CHAPTER 1 INTRODUCTION

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Migration of Virtual Machines for Better Performance in Cloud Computing Environment

Fair Scheduling Algorithm with Dynamic Load Balancing Using In Grid Computing

Sla Aware Load Balancing Algorithm Using Join-Idle Queue for Virtual Machines in Cloud Computing

How To Handle Big Data With A Data Scientist

LOAD BALANCING ALGORITHM REVIEW s IN CLOUD ENVIRONMENT

A Novel Approach for Efficient Load Balancing in Cloud Computing Environment by Using Partitioning

White Paper. How to Achieve Best-in-Class Performance Monitoring for Distributed Java Applications

Effective Resource Allocation For Dynamic Workload In Virtual Machines Using Cloud Computing

The International Journal Of Science & Technoledge (ISSN X)

Capacity Estimation for Linux Workloads

Multifaceted Resource Management for Dealing with Heterogeneous Workloads in Virtualized Data Centers

ABSTRACT. KEYWORDS: Cloud Computing, Load Balancing, Scheduling Algorithms, FCFS, Group-Based Scheduling Algorithm

The Total Cost of (Non) Ownership of Web Applications in the Cloud

Enhancing the Scalability of Virtual Machines in Cloud

Automatic Workload Management in Clusters Managed by CloudStack

Load Balancing in cloud computing

Cloud, Community and Collaboration Airline benefits of using the Amadeus community cloud

An Approach to Load Balancing In Cloud Computing

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

Energy Conscious Virtual Machine Migration by Job Shop Scheduling Algorithm

International Journal of Computer & Organization Trends Volume21 Number1 June 2015 A Study on Load Balancing in Cloud Computing

Efficient Virtual Machine Sizing For Hosting Containers as a Service

Various Schemes of Load Balancing in Distributed Systems- A Review

solution brief September 2011 Can You Effectively Plan For The Migration And Management of Systems And Applications on Vblock Platforms?

A Survey on Load Balancing Algorithms in Cloud Environment

Capacity Planning Fundamentals. Support Business Growth with a Better Approach to Scaling Your Data Center

International Journal of Engineering Research & Management Technology

In a dynamic economic environment, your company s survival

Efficient Parallel Processing on Public Cloud Servers Using Load Balancing

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

Cloud Computing. Chapter 8 Virtualization

Cost-effective Strategies for Building the Next-generation Data Center

Data Centers and Cloud Computing

HadoopTM Analytics DDN

Environments, Services and Network Management for Green Clouds

Load Distribution in Large Scale Network Monitoring Infrastructures

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Cloud Computing Paradigm

Setting deadlines and priorities to the tasks to improve energy efficiency in cloud computing

Making a Smooth Transition to a Hybrid Cloud with Microsoft Cloud OS

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

SCHEDULING IN CLOUD COMPUTING

RESOURCE MANAGEMENT IN CLOUD COMPUTING ENVIRONMENT

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Last time. Data Center as a Computer. Today. Data Center Construction (and management)

Windows Server 2008 R2 Hyper-V Live Migration

International Journal Of Engineering Research & Management Technology

The Importance of Software License Server Monitoring

Task Scheduling in Hadoop

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Energetic Resource Allocation Framework Using Virtualization in Cloud

ASCETiC Whitepaper. Motivation. ASCETiC Toolbox Business Goals. Approach

An Energy-aware Multi-start Local Search Metaheuristic for Scheduling VMs within the OpenNebula Cloud Distribution

Big Data Technology Core Hadoop: HDFS-YARN Internals

Cloud Computing- Research Issues and Challenges

On-Demand Virtual System Service

Transcription:

Cloud Management: Knowing is Half The Battle Raouf BOUTABA David R. Cheriton School of Computer Science University of Waterloo Joint work with Qi Zhang, Faten Zhani (University of Waterloo) and Joseph L. Hellerstein (Google Inc.) NOMS 2014, Krakow (Poland), May 5-9, 2014

Outline Introduction to Cloud computing The heterogeneity challenge Google Cluster Data Set Research Questions/Opportunities Dynamic Capacity Provisioning with Harmony Conclusions

The rise of Internet-scale Applications

Infrastructure/Data Scale Large scale infrastructure Google: 200+ clusters, hundreds of thousands machines Facebook: 2000+ machines Yahoo: 34000+ machines Huge volume of data (a.k.a. big data) Google: 20PB data per day (2008) Facebook: 36 PB of stored data, processing 80-90TB per day (2010) Yahoo: 170 PB data stored spread across the globe. Processing 3 PB per day (2010)

Cloud Computing A model designed for running large applications in a scalable and cost-efficient manner Harnessing massive resource capacities in the computing platforms, e.g. data centers Sharing resources among applications based on usage in an on-demand fashion Roles in a cloud computing environment Cloud providers (a.k.a. infrastructure providers) Service providers End users

Benefits of Cloud Computing Economical Cheap, commodity hardware Leveraging economies of scale Highly scalable Illusion of infinite resources on demand Start small, then scale resources up/down as needed Highly flexible Customizable CPU, memory, storage & networking capabilities Customizable software stack Easy access Access resources from any machine connected to the Internet Deploy applications from anywhere at anytime

Resource Management Resource management is a central activity of any cloud computing environment Service-level management Dynamic (i.e., on-demand) performance management and service provisioning Infrastructure-level management Monitoring Scheduling and resource allocation Fault detection and management Energy Management

The Heterogeneity Challenge Cloud resource management is difficult! A key reason: Both Cloud resources and applications are heterogeneous Machines have heterogeneous processing capacities and capabilities Different processor architecture, hardware features, processor speed, memory size and energy consumption model. Applications have heterogeneous sizes, durations, priorities and performance objectives

Outline Introduction to Cloud computing The heterogeneity challenge Google Cluster Data Set Research Questions/Opportunities Dynamic Capacity Provisioning with Harmony Conclusions

Google s Case Study Google s compute clusters execute millions of tasks on a daily basis Carrying out management activities requires an understanding of the performance impact of management activities Evaluating the performance of a new scheduling algorithm Capacity upgrade: what type of machines do we need? Current solution: sophisticated simulations High overhead Difficult to understand evaluation results Difficult to analyze what-if scenarios Characterizing the heterogeneity can improve resource management effectiveness and lower maintenance overhead

Google Data Set Workload traces collected from a production compute cluster in Google over 29 days ~ 12,000 machines ~ 2,012,242 jobs 25,462,157 tasks Applications are represented by jobs Each Job consists of one or more tasks 12 priorities divided into 3 priority groups Gratis (0-1): low priority batch jobs (e.g., MapReduce jobs) Other (2-8) : medium priority jobs (e.g., monitoring) Production (9-11) : high priority applications (e.g., user facing)

Machine Heterogeneity Histogram of machine capacities Machine availability over 24 hours Machines in production data centers often consist of multiple types E.g. multiple generations of machines purchased over time Machine failures are common in the compute cluster

Application Heterogeneity: Job Priority & Size Percentage of jobs per priority group CDF of Number of tasks per job Most of the jobs have low priority Almost 50% of the jobs consist of <10 tasks, but a few of them have more than 1000 tasks

Application Heterogeneity: Task Size Task size (Gratis) Task size (Other) Task size (Production) Tasks in production compute clusters are very heterogeneous in size

Task Duration and Scheduling Delay CDF of Task Duration CDF of Scheduling delay Most of the tasks are short (<10 min), a few tasks are really long More than 30% of the tasks are scheduled immediately, however other tasks can wait for days to be scheduled

Job Arrival Rate Arrival rate of jobs varies highly from time to time Inter-arrival time exhibits an on-off pattern according to the time of the day During day time the job arrival can be quite intense, as around 40% job inter-arrival time is less than 10s. At night time, job arrival intervals can be very long The task arrival rate can be very spiky Due to uneven distribution of both jobs size and arrival rate

Outline Introduction to Cloud computing The heterogeneity challenge Google Cluster Data Set Research Questions/Opportunities If Knowing is Half the Battle, What is the Other Half? Dynamic Capacity Provisioning with Harmony Conclusions

Research Questions/Opportunities Performance modeling for heterogonous workloads How to capture task and job performance characteristics (e.g. queuing delay, pre-emption rate) when both workload and machines are heterogeneous? Scheduling Algorithms for heterogeneous workloads How to design scheduling algorithms that consider workload and machine heterogeneity? MapReduce jobs and user facing jobs have completely different performance objectives, thus different scheduling policies should be used How can we take job performance objectives (e.g. deadlines for MapReduce jobs) into account when making scheduling decisions? Are there good bin-packing algorithms for task scheduling, given the distribution of task sizes? How to avoid frequent preemption of long running tasks?

Research Questions/Opportunities (cont) Optimizing workload performance and resource efficiency using migration Live migration is a well known technique for online workload management Reduce resource contention (e.g., network hot spots) Reduce resource fragmentation Minimize energy consumption (i.e., cost) How to use migration effectively given heterogeneous workload and machine characteristics? Energy management How to leverage machine heterogeneity and job arrival patterns to save energy, while meeting job performance objectives?

Outline Introduction to Cloud computing The heterogeneity challenge Google Cluster Data Set Research Questions/Opportunities Dynamic Capacity Provisioning with Harmony Conclusions

HARMONY: Dynamic Heterogeneity-Aware Capacity Provisioning Energy cost is an important concern in data centers Accounts for 12% of data center operational cost [Gartner Report 2010] Governments policies for building energy-efficient (i.e. Green ) ICT Minimize energy cost by turning off servers An idle server consumes as much as 60% of its peak energy demand

Resource Demand - Google s Data Set Fluctuation of resource demand in data centers creates opportunities for dynamically turning on and off servers CPU Demand over 30 days Memory Demand over 30 days Figure: Total resource demand in Google s Cluster Data Set

Important Factors To dynamically control data center capacity, one must consider the following factors: Heterogeneity of machines Heterogeneity of task size and duration Variability of task arrival rate Workload performance requirement Scheduling delay Cost of turning on and off servers Wear-tear effect Fluctuating energy prices

Solution Approach Classify tasks based on their size and duration using k-means clustering algorithm Capture the run-time workload composition in terms of arrival rate for each task class Predict the arrival rate of each type of tasks Define container as a logical allocation of resources to a task that belongs to a task class Use containers to reserve resources for each task class Using task arrival rate to estimate the number of required containers of each type of task

System Architecture

Optimization Optimal Capacity Provisioning can be formulated as the following integer program: Where: (Performance objective) (Energy cost) (Switching cost) Subject to constraints: (Machine state constraint) (Capacity constraint)

Optimization (cont) Optimal Capacity Provisioning is NP-hard We relax the integer program, then devise two solutions Container-Based Scheduling (CBS) Statically allocate containers in physical machines At run-time, schedule tasks into containers Container-Based Provisioning (CBP) Use the estimated number of containers to provision machines At run-time, schedule tasks using existing VM scheduling algorithms such as first-fit (FF)

Experiment Set Up Task classification Classify tasks based on size Categorize into short and long tasks Number of tasks (gratis) Task size (gratis) Task duration (gratis)

Experiments Set Up (cont) Machine energy consumption model Aggregated task arrival rates Number of required containers

Experiment Results Number of machines (the baseline) Number of machines (CBS and CBP) Comparison of Energy Consumption

Experiment Results (cont) Baseline CBP CBS

Outline Introduction to Cloud computing The heterogeneity challenge Google Cluster Data Set Research Questions/Opportunities Dynamic Capacity Provisioning with Harmony Conclusions

Take Away Message Cloud computing is becoming an integral part of today s IT infrastructure Heterogeneity is a major yet overlooked challenge for resource management in Cloud computing environments Machines have heterogeneous capacities and capabilities Applications have diverse resource characteristics, priority and performance objectives We have presented a characterization of workload found in production cloud environments. Traces can be dowloaded at: http://rboutaba.cs.uwaterloo.ca/download.html Many research opportunities exist for designing heterogeneityaware resource management schemes, with higher potential for practical impact.

Questions