Inciting Cloud Virtual Machine Reallocation With Supervised Machine Learning and Time Series Forecasts. Eli M. Dow IBM Research, Yorktown NY



Similar documents
How To Make A Minecraft Iommus Work On A Linux Kernel (Virtual) With A Virtual Machine (Virtual Machine) And A Powerpoint (Virtual Powerpoint) (Virtual Memory) (Iommu) (Vm) (

Rackspace Cloud Databases and Container-based Virtualization

Dynamic Load Balancing of Virtual Machines using QEMU-KVM

Cloud Computing Capacity Planning. Maximizing Cloud Value. Authors: Jose Vargas, Clint Sherwood. Organization: IBM Cloud Labs

GUEST OPERATING SYSTEM BASED PERFORMANCE COMPARISON OF VMWARE AND XEN HYPERVISOR

IOS110. Virtualization 5/27/2014 1

Black-box and Gray-box Strategies for Virtual Machine Migration

Introduction to Virtualization & KVM

Efficient Load Balancing using VM Migration by QEMU-KVM

Top 5 Reasons to choose Microsoft Windows Server 2008 R2 SP1 Hyper-V over VMware vsphere 5

Virtualization. Pradipta De

RCL: Software Prototype

Cloud Computing through Virtualization and HPC technologies

HRG Assessment: Stratus everrun Enterprise

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

Virtualization. Types of Interfaces

Virtualization. Jukka K. Nurminen

Machine Learning Capacity and Performance Analysis and R

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE

A Complete Open Cloud Storage, Virt, IaaS, PaaS. Dave Neary Open Source and Standards, Red Hat

VDI FIT and VDI UX: Composite Metrics Track Good, Fair, Poor Desktop Performance

Windows Server 2008 R2 Hyper-V Live Migration

solution brief September 2011 Can You Effectively Plan For The Migration And Management of Systems And Applications on Vblock Platforms?

RESOURCE MANAGEMENT IN CLOUD COMPUTING ENVIRONMENT

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Keywords Distributed Computing, On Demand Resources, Cloud Computing, Virtualization, Server Consolidation, Load Balancing

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Cloud Computing for Control Systems CERN Openlab Summer Student Program 9/9/2011 ARSALAAN AHMED SHAIKH

Masters Project Proposal

INCREASING SERVER UTILIZATION AND ACHIEVING GREEN COMPUTING IN CLOUD

How To Make Your Database More Efficient By Virtualizing It On A Server

Using Linux as Hypervisor with KVM

Private Cloud Database Consolidation with Exadata. Nitin Vengurlekar Technical Director/Cloud Evangelist

Cloud Computing #6 - Virtualization

Comprehensive Job Analysis. With MPG s Performance Navigator

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Virtualization and Cloud Management Using Capacity Planning

Learn How to Leverage System z in Your Cloud

Towards an understanding of oversubscription in cloud

The New Virtualization Management. Five Best Practices

Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction

Distributed and Cloud Computing

Technical Investigation of Computational Resource Interdependencies

opensm2 Enterprise Performance Monitoring December 2010 Copyright 2010 Fujitsu Technology Solutions

Data Centers and Cloud Computing

Full and Para Virtualization

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

can you effectively plan for the migration and management of systems and applications on Vblock Platforms?

OpenStack IaaS. Rhys Oxenham OSEC.pl BarCamp, Warsaw, Poland November 2013

Upgrading to Microsoft SQL Server 2008 R2 from Microsoft SQL Server 2008, SQL Server 2005, and SQL Server 2000

International Journal of Computer & Organization Trends Volume21 Number1 June 2015 A Study on Load Balancing in Cloud Computing

Computer Science. About PaaS Security. Donghoon Kim Henry E. Schaffer Mladen A. Vouk

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking

Monitoring can be as simple as waiting

VMware vsphere 5.0 Boot Camp

Microsoft Research Windows Azure for Research Training

This presentation covers virtual application shared services supplied with IBM Workload Deployer version 3.1.

Windows Server 2008 R2 Hyper-V Live Migration

Resource Utilization of Middleware Components in Embedded Systems

Run-Time Deep Virtual Machine Introspection & Its Applications

CloudCmp:Comparing Cloud Providers. Raja Abhinay Moparthi

Windows Azure and private cloud

Network Management and Monitoring Software

Virtualization for Cloud Computing

JOB ORIENTED VMWARE TRAINING INSTITUTE IN CHENNAI

Virtualization Technologies

7/15/2011. Monitoring and Managing VDI. Monitoring a VDI Deployment. Veeam Monitor. Veeam Monitor

Dheeraj K. Rathore 1, Dr. Vibhakar Pathak 2

Microsoft Research Microsoft Azure for Research Training

How To Understand The Power Of A Virtual Machine Monitor (Vm) In A Linux Computer System (Or A Virtualized Computer)

Avoiding Performance Bottlenecks in Hyper-V

Technology Insight Series

Cloud Computing, Virtualization & Green IT

Multifaceted Resource Management for Dealing with Heterogeneous Workloads in Virtualized Data Centers

Shoal: IaaS Cloud Cache Publisher

Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors

Understanding Linux on z/vm Steal Time

USING VIRTUAL MACHINE REPLICATION FOR DYNAMIC CONFIGURATION OF MULTI-TIER INTERNET SERVICES

SAS deployment on IBM Power servers with IBM PowerVM dedicated-donating LPARs

<Insert Picture Here> Introducing Oracle VM: Oracle s Virtualization Product Strategy

Cloud Building Blocks: Paving the Road

HP Virtualization Performance Viewer

An Autonomic Auto-scaling Controller for Cloud Based Applications

Affinity Aware VM Colocation Mechanism for Cloud

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Live Vertical Scaling

Manage Your Data: Virtualization for Small Businesses

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Windows Server 2008 R2 Hyper V. Public FAQ

Enterprise Workloads on the IBM X6 Portfolio: Driving Business Advantages

Transcription:

Inciting Cloud Virtual Machine Reallocation With Supervised Machine Learning and Time Series Forecasts Eli M. Dow IBM Research, Yorktown NY

What is this talk about? This is a 30 minute technical talk about building autonomous cloud management systems that determines if Virtual Machines are unhappy where they reside. My intention is to give enough insight that anyone inclined could build a system based on these recommendations.

Motivation The state of the art in VM dynamic placement is not sufficient to meet the needs of certain types of cloud computing data centers. Most research focuses on the initial admittance problem or requires application level instrumentation. Cloud computing needs to be inexpensive Low consumer cost leads to a need for automation and lights-out data centers. The provider goal is to use hardware resources effectively What do we mean by effectively? Depends on your state of mind. Consolidation vs. Load-balancing. Opposite ends of the spectrum. People managing VM placement is a poor idea. Explicit constraints? Hidden constraints? Automated scheduling is NP hard for long lived processes, and you can think of VMs as a long lived process especially in the KVM context where the mapping is literally 1VM to 1 process The goal of the broader research objective that subsumes the work in this talk is to build a data center that migrates VM s using live guest migration to move cloud servers around as rapidly as we move around local processes on a single multi-core server. Note: Migration times are sub second for most VMs using RDMA style networks etc, so we can almost achieve this if only we can automate the migration decision making processes.

Automating VM Placement We want to automate VM placement according to some policy (load-balanced? consolidated? both?) Have flexibility to adapt to runtime changes in workload/compute rescue availability with agility Seems easy enough [1] so lets try to automate this! [1] But seriously this is hard, so lets look at what people have done before

Prior Work Tons of work on load balancing and consolidation strategies. Each their own discipline entirely [2]. The paper associated with this presentation has a good survey of the work from the literature over the last several decades. Open problems - The VM Colocation Problem says that not all VM s may co-reside peacefully. Examples: Privacy mandates High-Availabilty (HA) peers [2] I have my own take on high performance consolidation for VMs in a paper under review entitled

To do the management part, we first need a monitoring part! To do the monitoring required to load balance or consolidate there are 2 choices really: 1. In Band (Agent-based) Put an agent into each of the VMs to collect data! (not alway ideal because it can be resource wasteful and requires custom configuration in each VM. Non-starter for IaaS ) 2. Out of Band ( Agent-less ) Put one agent per hypervisor instance (Also a terrible idea but only because it is harder to do!)

Is vigilance is the key to success? [3] IBM Research paper titled Vigilant box perspective. looked at detecting faults inside virtual machines from a completely black They effectively did some monitoring from the hypervisor perspective by instrumenting QEMU and other things to do some machine learning about the state of the VMs without altering them directly. What did they measure? The number of times a guest VM was scheduled. Utilization of CPU by the guest VM. Time spent in the runnable state. Time spent waiting on events. What were their results? They received good results, finding faults with high skill (accuracy), but the results were not very generic as they used a custom Matlab machine learning implementation to learn failure states vs nominal states. Can we use their general approach to learn more about the VMs being managed autonomously? [3] Vigilant - Dan Pelleg, Muli Ben-Yehuda, Rick Harper, Lisa Spainhower, and Tokunbo Adeshiyan. 2008. Vigilant: out-of-band detection of failures in virtual machines. SIGOPS Oper. Syst. Rev. 42, 1 (January 2008), 26-31.

Beyond Vigilance? What can we measure out of band (hypervisor level) for KVM VMs that might let us place VMs intelligently and automatically? (A lot) But can we do this efficiently enough to be responsive (pseudo real time)? CPU consumption as a percentage of the host Characters read in memory Characters written to memory Bytes read to memory Bytes written to memory Bytes received over a net connection Packets received over a net connection Read packets errors encountered on a net connection Read packets dropped on a net connection Bytes sent over a net connection Packets sent over a net connection Written packets errors encountered on a net connection Written packets dropped on a net connection Page faults incurred during the interval Time spent blocked on the scheduler run queue

Glenda (Space Bunny) to the rescue! (/Proc File System [4] ) [4] Glenda is the evil creation of Renee French for the purposes of total and utter world domination as part of the Plan 9 from Bell Labs.

To do the management part, we need a monitoring part In KVM, all virtual machines map to a single process just like your email client or web browser. Also note that the KVM kernel and the Linux kernel are the same thing. We can read from /proc file system for the data values about each process running, and by extension we can read information about each virtual machine. Implementation in C allows us to measure values every 3 seconds with no measurable overhead on modern hardware. Note that /proc file system reads are effectively memory reads from the Linux kernel data structures for managing processes.

We have data about VMs, now what? Lets take a page form the vigilant playbook and try to learn what it is like for virtual machines to be happy (nominal) or sad (constrained) I have no idea from looking at the raw data what constitutes a happy or sad VM So rather than guessing, we can use the machine learning approach to infer those salient characteristics from the data

What Machine Learning Implementation Should We Use? We are effectively looking for binary classification here and we are seeking supervised learning techniques by running some benchmarks on the VMs to determine their nominal and constrained behavior. We defer to literature for supervised machine learning classification: Support Vector Machines - recently caught the world on fire. They work well on lots of problems but are sort of magic (hyperplanes and such). Decision Tree Classifiers - not as in vogue presently, but have the elegant property that their derived classification strategy can be explained to people in english What should we do? Which do we pick? I am wracked with indecisiveness

Do Both? We opted to do both! In a previous work [5] I published a means for taking a single data stream and constructing trained SVM models with LibSVM as well constructing a Binary Decision tree classifier model in parallel. The work also compiled the binary decision tree output into a small shared object code embodiment that could be rapidly evaluated at runtime. These modules can be swapped out at runtime as newer models are learned/ uploaded to each hypervisor. We recycled this more generic work and built a VM runtime classification system. Our monitor obtains the runtime parameters on a per-vm basis then invokes the trained classifier periodically to make sure the VM is still in a happy state. [5] The Journal article was entitled Transplanting Binary Decision Trees

Anticipation A reactionary approach to problems as they occur is a horrible strategy in general. So we can detect problems happening right now but maybe we should anticipate them? Sounds like we need some predictive analytics to forecast future VM unhappiness

Predictive Analytics In this case, analytics = signal processing + math There have been attempts at sophisticated resource forecasting systems that are very complicated to explain and build. So why not simplify a bit and see how well we can do? New Math vs Old Math (I opt to go with the math from the 1960s) The resource measurements for each vm can be seen as a signal or time series. For instance, CPU consumption over time makes a line graph that you can think of as a signal. Let us apply some univariate time series forecasting techniques

Smooth Criminal Harmonic means (for averaging rates) Linear Regression (do we need anything more than good old y=mx+b?) Exponential Smoothing family of algorithms (to deal with trends, periodicity, and periodic trends when lines wont do!) A neural network (it sounds great doesn't it?)

How To Predict The Future [6] We mine the VMs at runtime as described before, and in addition to classifying for happiness right now, we predict the forecast for each parameter and then classify again using the estimated sure state to ensure the VM will remain happy for some period of time. How far do we need to forecast? The answer is that you must forecast the duration of time that it would take to move the longest-migrating VM you have (which is a function of the paging rate and memory size assuming no persistent disk migration). Generally we only need to forecast some small interval (5 minutes or so). [6] The formulas needed are included in the paper reference etc along with applicability explanations.

Recap: Contributions to the Art In summary, the Incite work presented here demonstrates the rigorous design methodology and technical material necessary for cloud developers and practitioners to create a viable replication of our first-of-a-kind system for using runtime classification of virtual machines as over-constrained for the purposes of triggering workload balancing suitable for deployment in an IaaS cloud computing environment. The combination of exceptionally light weight data mining, optional hierarchical time series forecasting, and high performance run time classification functions allows for rapid indication of potential over-constraint situations suitable for triggering remedial action or initiating system operator alerts. The goal is autonomous, frequent, reallocation of VM resources in the cloud data center that respect explicit and implicit constraints.

What Next?

Thank You!