Black-box and Gray-box Strategies for Virtual Machine Migration Wood, et al (UMass), NSDI07 Context: Virtual Machine Migration 1
Introduction Want agility in server farms to reallocate resources devoted to applications. Look at building on virtualization, specifically capability of migrating virtual machines from one physical machine to another. Focus of paper is how to detect the need for such migrations (hotspots) and determine where a VM should be migrated. Examine black-box (only observe VM) vs. gray-box (some OS-level knowledge) strategies. System called Sandpiper, after a migratory bird. Note, but do not investigate possibility of assigning more resources to an existing VM. 2
System Overview Work built on Xen and its existing VM migration capabilities. Track processor, network and memory usage to determine hotspots. Determines what VMs to migrate, where to move them, and how much resources to allocate for migrated VM. 3
Black-Box Monitoring CPU: XenMon tracks CPU usage of resident VMs. Disk/network I/O CPU assigned to Domain-0 and must be apportioned. Network: Domain-0 implements the network interface driver so it can track network usage. Use /proc/net/dev to monitor activity on each virtual interface. Memory: Only known to OS within each VM. Can indirectly track swap activity in Domain-0. Most problematic part for black-box approach. 4
Gray-Box Monitoring Use a light-weight monitoring daemon that runs inside of a VM. Gathers stats from /proc interface in Linux CPU, network and memory usage. Profiling Profiling Engine periodically obtains a resource usage report from each nucleus. Periodic reports are combined over time window W to track trends. Maintain both distribution and time series over this window. 5
Hotspot Detection Performed on a per-physical server basis. Look for cases where the usage of a resource exceeds a threshold. Also look for violations of SLA agreements (response time) on a per-vm basis. Also if memory utilization exceeds a threshold. Detect a hotspot if k out of n observations as well as next predicted value exceed a threshold. n = k = 1 is the most aggressive detection approach. flexible descriptive approach Uses time-series prediction technique. 6
Resource Provisioning If a hotspot is detected, need to determine how many resources that the overloaded VM does need. In black-box approach, observed CPU and network bandwidth for a VM may be constrained if other VMs are using their fair share. In such cases will under-estimate the actual peak need. May be only able to guess for actual needs. Simpler for memory as each VM has a fixed amount of physical memory assigned to it not flexible like CPU and network. Simply add a fixed amount of memory to determine peak. With gray-box scale CPU by λ peak /λ cap where λ peak is the estimated peak arrival rate. Also used to determine peak network needs (along with mean request file size). 7
Hotspot Mitigation Determine which VMs to migrate and where to do so. Try to minimize migration overhead (amount of data transferred). Apply greedy algorithm to move VMs from most to least loaded physical servers. Use CPU-network-memory volume of a physical or virtual machine. Tries to pick based on largest VSR (volume/size ratio) where size is the memory footprint trying to pick VM to move with most volume and least memory footprint. Alternate approach is to swap VMs going to be more expensive. 8
Implementation Based on Xen. Use 3 out of 5 observations and 75% threshold to determine a hotspot. Nucleus is written in Python. Gray-box gathers stats from /proc 20 servers running Linux 2.6.16 and Xen 3.0 with at least 1GB RAM. Apache servers serving dynamic PHP web pages. Cluster of Linux servers generates load using httperf. 9
Migration Effectiveness Try to move highest VSR VM to least loaded PM. Maximizes amount of displace load from hotspot per megabyte of data transferred. 10
Other Tests VM swaps incur more overhead, but increase chances of mitigating hotspots in clusters with high average utilization. Can handle mixed resource hotspots. Gray-box approach can better infer memory usage. 11
Prototype Data Center Evaluation 35 VMs (running a mix of applications) across 16 physical servers LAMP Linux, Apache, MySQL, PHP Use RUBiS as a test application to implement an ebay-like auction web site and workload generator. Relative to a static approach, Sandpiper performs much better in resolving hotspot situations not surprising! 12
Additional Tests Sandpiper itself has negligible impact on performance. Primary scaling issue is the placement algorithm. Sandpiper limits instability by only initiating migrations when it has found a better solution. Need to find right thresholds. 13
Related Work Process migration in the 1980s network connections not really considered. VM migration provides a means, authors have built a framework on top. Shared hosting environments. Estimating resource needs. 14
Summary Extensive amount of work. Lots of decisions that look reasonable Built a working system with hotspot alleviation in 20s to minutes. Compare effectiveness of black-box and gray-box strategies. Good engineering work. Little to compare work with. 15