Towards an understanding of oversubscription in cloud



Similar documents
Cloud SLAs: Present and Future. Salman Baset

Cloud SLAs: Present and Future

VirtualclientTechnology 2011 July

Dynamic Resource allocation in Cloud

Managing Capacity Using VMware vcenter CapacityIQ TECHNICAL WHITE PAPER

Part 1: Price Comparison Among The 10 Top Iaas Providers

Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction

Green Cloud Computing 班 級 : 資 管 碩 一 組 員 : 黃 宗 緯 朱 雅 甜

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

What s New with VMware Virtual Infrastructure

Comparison of Memory Balloon Controllers

Energy Constrained Resource Scheduling for Cloud Environment

Black-box and Gray-box Strategies for Virtual Machine Migration

Multifaceted Resource Management for Dealing with Heterogeneous Workloads in Virtualized Data Centers

How To Build A Cloud Server For A Large Company

Virtualization. Dr. Yingwu Zhu

Cloud Vendor Benchmark Price & Performance Comparison Among 15 Top IaaS Providers Part 1: Pricing. April 2015 (UPDATED)

Dimension Data Enabling the Journey to the Cloud

Cutting Costs with Red Hat Enterprise Virtualization. Chuck Dubuque Product Marketing Manager, Red Hat June 24, 2010

vrealize Operations Management Pack for vcloud Air 2.0

Philips IntelliSpace Critical Care and Anesthesia on VMware vsphere 5.1

Hyper-V vs ESX at the datacenter

INCREASING SERVER UTILIZATION AND ACHIEVING GREEN COMPUTING IN CLOUD

BridgeWays Management Pack for VMware ESX

How To Manage Cloud Service Provisioning And Maintenance

WHITE PAPER 1

Technical Investigation of Computational Resource Interdependencies

Top 5 Reasons to choose Microsoft Windows Server 2008 R2 SP1 Hyper-V over VMware vsphere 5

Best Practices for Optimizing Your Linux VPS and Cloud Server Infrastructure

Flash Storage Roles & Opportunities. L.A. Hoffman/Ed Delgado CIO & Senior Storage Engineer Goodwin Procter L.L.P.

The best platform for building cloud infrastructures. Ralf von Gunten Sr. Systems Engineer VMware

White Paper. Recording Server Virtualization

VI Performance Monitoring

How To Use Vsphere On Windows Server 2012 (Vsphere) Vsphervisor Vsphereserver Vspheer51 (Vse) Vse.Org (Vserve) Vspehere 5.1 (V

Towards Predictable Datacenter Networks

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

VMware Virtual Infrastucture From the Virtualized to the Automated Data Center

Server Virtualization A Game-Changer For SMB Customers

1. Simulation of load balancing in a cloud computing environment using OMNET

Run-time Resource Management in SOA Virtualized Environments. Danilo Ardagna, Raffaela Mirandola, Marco Trubian, Li Zhang

Profit-driven Cloud Service Request Scheduling Under SLA Constraints

RED HAT ENTERPRISE VIRTUALIZATION FOR SERVERS: COMPETITIVE FEATURES

FOR SERVERS 2.2: FEATURE matrix

Scaling Database Performance in Azure

The Benefits of POWER7+ and PowerVM over Intel and an x86 Hypervisor

FlashSoft Software from SanDisk : Accelerating Virtual Infrastructures

Cloud Computing Capacity Planning. Maximizing Cloud Value. Authors: Jose Vargas, Clint Sherwood. Organization: IBM Cloud Labs

Stratusphere Solutions

CloudSim: A Toolkit for Modeling and Simulation of Cloud Computing Environments and Evaluation of Resource Provisioning Algorithms

An Approach to Load Balancing In Cloud Computing

I/O virtualization. Jussi Hanhirova Aalto University, Helsinki, Finland Hanhirova CS/Aalto

Figure 1. The cloud scales: Amazon EC2 growth [2].

Service Level Agreement Comparison in Cloud Computing

vsphere 6.0 Advantages Over Hyper-V

Nutanix Tech Note. Configuration Best Practices for Nutanix Storage with VMware vsphere

Cost Effective Automated Scaling of Web Applications for Multi Cloud Services

Lecture 2 Cloud Computing & Virtualization. Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Microsoft Hyper-V chose a Primary Server Virtualization Platform

Enabling Technologies for Distributed and Cloud Computing

Cloud Optimize Your IT

The next step in Software-Defined Storage with Virtual SAN

Understanding Data Locality in VMware Virtual SAN

Before we can talk about virtualization security, we need to delineate the differences between the

Deploying and Optimizing SQL Server for Virtual Machines

Avoiding Performance Bottlenecks in Hyper-V

An Energy-aware Multi-start Local Search Metaheuristic for Scheduling VMs within the OpenNebula Cloud Distribution

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

Windows Server 2008 R2 Hyper-V Live Migration

Storage I/O Control: Proportional Allocation of Shared Storage Resources

Revealing the MAPE Loop for the Autonomic Management of Cloud Infrastructures

Real- Time Mul,- Core Virtual Machine Scheduling in Xen

Windows Server 2008 R2 Hyper-V Live Migration

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Virtualization with the Intel Xeon Processor 5500 Series: A Proof of Concept

Desktop Virtualization with VMware Horizon View 5.2 on Dell EqualLogic PS6210XS Hybrid Storage Array

VDI Optimization Real World Learnings. Russ Fellows, Evaluator Group

Data Centers and Cloud Computing

Virtualized Disaster Recovery (VDR) Overview Detailed Description... 3

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Comparison of Windows IaaS Environments

App App App App App App App App. VMware vcenter Suite. VMware vsphere 4. Availability Security Scalablity. vshield Zones VMSafe

VDI Without Compromise with SimpliVity OmniStack and Citrix XenDesktop

VDI FIT and VDI UX: Composite Metrics Track Good, Fair, Poor Desktop Performance

Enabling Technologies for Distributed Computing

Program Summary. Criterion 1: Importance to University Mission / Operations. Importance to Mission

VMware vcloud Automation Center 6.0

Transcription:

IBM Research Towards an understanding of oversubscription in cloud Salman A. Baset, Long Wang, Chunqiang Tang sabaset@us.ibm.com IBM T. J. Watson Research Center Hawthorne, NY

Outline Oversubscription background - Airlines and cloud - What are typical overload symptoms for CPU, memory, disk, and network? - Isn t managing oversubscribed cloud the same as regular cloud? Mitigating overload: mechanism vs. policy Contributions - Theoretical basis for oversubscription problem - A Markov model for oversubscription - SLAs and oversubscription - Results on increasing oversubscription in cloud by terminating or live migrating a VM while meeting SLAs Ongoing work 2

Motivation 10 seat capacity Plans changed last minute Airline boss: my planes are not flying full. Overbook the seats!

Motivation 10 seat capacity 12 people book seats, 2 cancel. Airplane flies full

Motivation 10 seat capacity 12 people book seats, 12 show up PROBLEM!!!!!!! Refund, vouchers etc

Cloud motivation Studies indicate that VMs do not fully utilize the provisioned resources Definitions - Provisioned resources e.g., the resources with which a VM is configured. EC2 small instance (1.7 GB memory, 160 GB disk) - Used resources e.g., the resources used by a VM at a point time (1 GB memory, 50 GB disk) - Overcommitted, oversubscribed Can we oversubscribe the resources of a physical machine while meeting the SLAs promised to a customer?

Regular cloud 8 GB RAM 1 TB disk Quad core Xeon 8 GB RAM 1 TB disk Quad core Xeon VM: 2 GB RAM 500 GB 1 CPU 4 VMs per physical machine Black box indicates provisioned resources per VM

Oversubscribed cloud 8 GB RAM 1 TB disk Quad core Xeon 8 GB RAM 1 TB disk Quad core Xeon VM: 2 GB RAM 500 GB 1 CPU 8 VMs per physical machine Black box indicates provisioned resources per VM

Oversubscribed cloud 8 GB RAM 1 TB disk Quad core Xeon 8 GB RAM 1 TB disk Quad core Xeon VM: 2 GB RAM 500 GB 1 CPU 8 VMs per physical machine Black box indicates provisioned resources per VM Green box indicates used resources per VM

Overload! 8 GB RAM 1 TB disk Quad core Xeon 8 GB RAM 1 TB disk Quad core Xeon VM: 2 GB RAM 500 GB 1 CPU VMs requesting more memory than available in physical server. 8 VMs per physical machine Black box indicates provisioned resources per VM Green box indicates used resources per VM

What are overload symptoms for CPU, memory, network, disk? CPU Memory Disk Network

What are overload symptoms for CPU, memory, network, disk? CPU - less CPU share per VM, long run queues Memory - Swapping to hypervisor disk, thrashing Are symptoms related? Disk (spinning) - Increased r/w latency, decreased throughput Network - Link fully utilized

What are overload symptoms for CPU, memory, network, disk? CPU - less CPU share per VM, long run queues Memory - Swapping to hypervisor disk, thrashing Disk (spinning) - Increased r/w latency, decreased throughput Network - Link fully utilized Locally attached disks Increased disk traffic Network attached disks Increased network traffic

What are overload symptoms for CPU, memory, network, disk? CPU - less CPU share per VM Monitoring agents within VMs and hypervisor may not get a chance to run as per their schedule Memory - Swapping to hypervisor disk, thrashing Disk (spinning) - Increased r/w latency, decreased throughput Network - Link fully utilized

What are overload symptoms for CPU, memory, network, disk? CPU - less CPU share per VM Memory - Swapping to hypervisor disk, thrashing Disk (spinning) - Increased r/w latency, decreased throughput Network - Link fully utilized If work of all VMs is I/O bound, a fully utilized link (for one VM) may cause other VMs to sit idle, wasting CPU and memory resources.

Isn t managing oversubscribed cloud the same as regular cloud? Regular cloud - Only network and disk are susceptible to overload - CPU and memory are never oversubscribed Oversubscribed cloud - CPU, disk, memory, and network are oversubscribed

Mitigating overload Mechanism vs. policy Mechanisms - Stealing Borrow resources from one VM and give it to other - Quiescing Terminate a VM. Which VMs to terminate? - Migrate Live migration - Shared vs. local disk storage - VMware VMotion - Streaming disks Offline migration Which VMs to live / offline migrate? - Network memory Swap space is over network. May work for transient workloads.

Handling overload Overload detection - Detect that overload is occurring (within VMs or physical server) - Hard or adaptive thresholds Overload mitigation - Mitigate overload by terminating a VM, live migrating it, or using network memory It is hard!

Overload mitigation policy Factors to consider - Performance - Useful work done - Cost - Fairness - Minimal impact to VMs - SLAs An optimization problem

Oversubscription and classical problems Multiple-constraints single knapsack (FPTAS polynomial in n and 1/e for e > 0) - Given n items and one bin (single knapsack) - Each item and bin has d dimensions, and each item has profit p(i) - Find a packing of n items into this bin which maximizes profit, while meeting bins dimensions Multiple knapsacks (bin packing) (PTAS polynomial in 1/e for e > 0) - Given n items, and m bins (knapsacks) - Each item has a profit, p(i), and size(i) - Find items with maximum profit that fit in n bins Vector bin packing (no-aptas cannot find a PTAS for every constant e > 0) - Given n items and m bins - Each item and bin has d dimensions - Find a packing of n into m which minimizes m, while meeting bins dimensions Online vector bin packing - Same as above - but also minimize the total number of moves across bins or VM terminations

The underlying theoretical problem of oversubscription Online multiple constraints multiple knapsack problem with costs of moving between knapsacks - Given n items (VMs), and m bins (servers) - Each VM and server has d dimensions, and each VM has utility u(i) - Moving a VM from server i to j has a cost M ij - Terminating a VM k has a cost T k - lambda is the rate of arrival of workloads within VMs (iid) - Utility of a VM and PM, U VM, U PM, respectively - State space: resource consumption of PMs and VMs resources - PM resources: CPU, memory, disk, network - state tuple: (PM i CPU, PM i disk, PM i mem, PM i network ) - state space explosion probability of being in that state, given workload distributions Utility of a state Given workload distributions, find argmax number of VMs s.t. - Total utility (profit) is maximized

SLAs and overload Overload must be precisely defined as part of SLAs What are the SLAs of public cloud providers? - None provide any performance guarantees for compute - Uptime guarantees, typically only for data center and not for VMs. 22

Compute SLA comparison Amazon EC2 Azure Compute Rackspace Cloud Servers Terremark vcloud Express Storm on Demand Service guarantee Availability (99.95%) 5 minute interval Role uptime and availability, 5 minute interval Availability Availability Availability Granularity Data center Aggregate across all role Per instance and data center + mgmt. stack Data center + management stack Per instance Scheduled maintenance Unclear if excluded Includ. in service guarantee calc. Excluded Unclear if excluded Excluded Patching N/A Excluded Excluded if managed N/A Excluded Guarantee time period 365 days or since last claim Service credit 10% if < 99.95% 10% if < 99.95% 25% if < 99% Violation report respon. Per month Per month Per month Unclear Uptime guarantees on a data center (very weak) Implicit uptime guarantees on a VM 5% to 100% $1 for 15 minute downtime up to 50% of customer bill 1000% for every hour of downtime Customer Customer Customer Customer Customer Reporting time period N/A 5 days of occurrence N/A N/A N/A Claim filing timer period 30 business days of last reported incident in claim Within 1 billing month of incident Within 30 days of downtime Within 30 days of the last reported incident in claim Within 5 days of incident in question Credit only for future payments Yes No No Yes No Cloud SLAs: Present and Future. To appear in ACM Operating System Review

Questions investigated in this paper Overload detection interval and request inter-arrival within VM Mitigating overload by terminating VMs over a do nothing approach Mitigating overload by live migrating a VM, over terminating VMs and do nothing. Simulations - Setup 40 PMs (rack of physical machines), each has 64 GB of RAM Only memory overload 30 days of simulated time Number of VMs fixed Request interarrival rate exponentially distributed Request size exponential and pareto (real data set in progress) Live migration: 1 VM per minute at most (mig-1) or all VMs until overload alleviated (mig-all). - Overload definition - Metrics If memory consumption exceeds 95% of physical server memory for five contiguous minutes, overload occurs. Percentage of VMs not experiencing overload for given workload arrival rate Number of VMs terminated and migrated 24

Preliminary results uptime > 99.9% % of VMs killed Max. # VM killed 100 50 quiesce no quiesce 0 32.5 35 37.5 40 42.5 45 47.5 50 100 50 100 0 32.5 35 37.5 40 42.5 45 47.5 50 Load on VMs as a function of their provisioned capacity. Overcommit factor is 2. uptime > 99.9% migs / min 100 50 mig all mig 1 0 32.5 35 37.5 40 42.5 45 47.5 50 Overcommit factor is 2. 0 All VMs have same provisioned memory, i.e., 2 GB. 5Physical server has 64 GB memory. 32.5 35 37.5 40 42.5 45 47.5 50 200 Average load on VMs as a function of provisioned capacity. E.g., 32.5% of 2 GB = 650 MB When average load on all VMs is 50% of provisioned capacity, the physical server memory is exhausted. 15 10 0 32.5 35 37.5 40 42.5 45 47.5 50 Load on VMs as a function of their provsioned capacity. Overcommit factor is Migration strategy: Select the VM with the largest memory consumption and terminate or live migrate it Insights: - Terminating a VM improves the uptime performance of all VMs by more than a factor of 2 over a do nothing approach. - Mig-1 (at most one migration per minute results in a step function like reduction in uptime) 25

Preliminary results uptime > 99.9% % of VMs killed Max. # VM killed 100 50 quiesce no quiesce 0 32.5 35 37.5 Percentage 40 of VMs 42.5 killed 45 47.5 50 100 50 200 100 0 32.5 35 37.5 40 42.5 45 47.5 50 Insights: 0 32.5 35 37.5 40 42.5 45 47.5 50 Load on VMs as a function of their provisioned capacity. Overcommit factor is 2. uptime > 99.9% migs / min 100 50 mig all mig 1 0 32.5 35 37.5 40 42.5 45 47.5 50 Percentage of VMs migrated 15 10 mig-all on a of 2 5 mig-1 0 32.5 Load 35 VMs as 37.5 function 40 their provsioned 42.5 capacity. 45 47.5 Overcommit 50 factor is - One or more VMs killed as aggregate memory consumption of all VMs approach physical server memory - mig-all can overly stress the network - Always selecting the VM with highest memory consumption for terminating or live migrating is not a good idea! 26

Questions under investigation To what extent a combination of VM quiescing and live migration schemes perform better than the individual schemes? Does asymmetry in oversubscription levels across PMs (within the same rack) and workload distributions lead to a higher overall overcommit level? When identical or asymmetric capacity VMs have different SLAs, which overload mitigation scheme gives the best results? When the available SLAs are defined per VM group instead of per VM, can it be leveraged to improve the performance of underlying overload mitigation scheme? How are the results affected when other resources such as CPU, network, and disk are oversubscribed? What is the best strategy for selecting VMs to terminate or live migrate? How the SLAs should be defined for oversubscribed environments? How can we answer all of the above questions for real workloads in a test-bed or deployed environment? 27