Bridgeways White Paper: Management Pack for VMware ESX BridgeWays Management Pack for VMware ESX Ensuring smooth virtual operations while maximizing your ROI. Published: July 2009 For the latest information, please visit www.bridgeways.ca
Bridgeways White Paper: Management Pack for VMware ESX Introduction... 1 The Art of Capacity Planning... 2 Providing the Canvas... 2 Balancing Resources while Protecting the User Experience... 3 Monitoring the Tipping Point... 3 Processing Time... 4 Memory Commitment... 4 Network and Storage... 6 Where to Go from Here... 7
Bridgeways White Paper: Management Pack for VMware ESX 1 Introduction Virtualization technology has improved dramatically over the last few years. As more and more companies look to virtualize their environments, additional vendors come to market, spurring competition and accelerating feature growth. Today s hypervisors open even more opportunities for server consolidation and create a new (some would argue a return to the old) operational paradigm in which resources are centralized and workloads are distributed to the hardware that can best accommodate the resource requirements. Monitoring virtualized environments has become critical to optimizing hardware performance and achieving maximum ROI. This requires a consolidated view that allows for monitoring of both the hypervisor and the individual workloads in order to identify bottlenecks and correct any problems before they have a major impact on the performance of the virtual environment. Without such a consolidated view it is hard to place emerging issues in their appropriate context. This in turn can lead to a frustrating and time consuming trial and error search for the root causes of problems. For example, if an application server is handling fewer requests, we need to examine several possible causes: Is there an increased load on the server? Have there been patches applied to the operating system or application? Have new VMs been added to the resource pool? How are they configured? Are they SMP virtual machines? Is the resource pool marked as expandable? Has the resource pool been making heavy demands on the parent pool to meet its own needs? Have new pools been added which reduce the resources available from the parent pool? Has the VM moved to a new host with different bandwidth characteristics? Is there IO contention on the SAN? Any of these issues can result in slower response times, making it imperative to quickly identify the underlying issue. Otherwise, unexpected system downtime, service degradation, or extended maintenance periods can result in lost revenues and reduced customer satisfaction. Microsoft System Center Operations Manager is an excellent platform from which to perform in-depth monitoring of complex environments. The addition of BridgeWays Management Packs allows administrators to drill down into the hypervisor, operating system, and workloads to do deep root cause analysis and quickly resolve underlying problems. The BridgeWays Management Pack for VMware ESX models the entire virtual environment from the data center, through the hosts, to the individual virtual machines. By modelling the entire ESX environment, it presents administrators with a detailed view of what loads are being placed on the hypervisor, allowing them to see how the various components are interacting. A single misconfigured virtual machine can have a significant impact on the host, which, in turn, can reduce the efficiency of an entire resource pool or cluster.
Bridgeways White Paper: Management Pack for VMware ESX 2 Today, many administrators are still monitoring virtual machines through guest operating system metrics. This approach can send them off on tangents, and cause them to solve problems that don t exist. For example, if a multi-processor VM using a non-smp HAL stays in its idle loop for long periods of time, the guest operating system is going to report CPU usage as 0%, because it sees the idle loop and considers the CPU to be available. Monitoring via BridgeWays through the host will show that this VM is actually using 100% of allocated CPU time as the idle loop executes. If the VM is not sending the proper halt instruction due to an incorrect HAL, the host will waste a large amount of resources supporting an idle VM. It is crucial to monitor both the ESX host and the guest OS to see the discrepancy in the metrics and correctly diagnose the system s health. The Art of Capacity Planning Capacity planning is the art of balancing what is requested against what can be delivered, and then creating a plan to provide the necessary resources. When expanding a virtual environment, administrators need to know how the current virtual capacity is allocated, and to what extent it is actively used. The current resources could be over committed, under committed, or they could be perfectly balanced. However, without proactive monitoring and historical reporting it is impossible to determine their current state. Determining historical usage trends has typically been difficult for many IT organizations, because they are using disparate tools to gather the metrics they need. A tool that gathers information from the hypervisor typically provides no information on workloads running on guest operating systems. A tool that monitors the workload, or provides metrics from the operating system, is generally unaware of the hypervisor. Failure to weave the various strands of data into an overall context can lead to planning mistakes. Providing the Canvas The BridgeWays Management Pack for VMware ESX provides a comprehensive view of the virtual capacity to help ensure that the environment is operating at the levels for which it is designed. Operations Manager allows BridgeWays to pull together information from the hypervisor, operating systems, and various workloads into a single management console where the administrator can see how changes to one workload s resource allocations impact the others. This enables administrators to take an iterative approach to capacity management in which they cautiously downgrade the resources available to a system, monitor the impact, and then reallocate those resources to a new system coming on line. In the absence of a comprehensive view, it is difficult to determine whether the system from which resources are taken from was over-allocated, and won t suffer degraded performance when resources are transferred.
Bridgeways White Paper: Management Pack for VMware ESX 3 For example, when heap memory usage and garbage collection characteristics of an application server are compared to the actual memory used and the amount being swapped out by the hypervisor, it is possible to see if the application server is efficiently using the available memory. Consistently high levels of swapped memory indicate that a virtual machine may have too much memory allocated, because the hypervisor is swapping it out to disk. If the hypervisor is not constantly swapping the memory into and out of physical memory, and if the heap and garbage collection metrics are healthy, the virtual machine has more memory allocated to it than its workload requires, and some of the memory can be safely taken away. The consolidated view provided by BridgeWays enables administrators to do the same for CPU usage, network bandwidth, and storage capacity. By monitoring how much of the available resources are being used by the virtual machines administrators can easily identify and reallocate idle resources. The holistic overview of the available resources in the virtual environment that is provided by the BridgeWays Management Pack for VMware ESX enables administrators to ensure that the capacity plan for the virtual environment is handling the load efficiently. Capacity management is used to keep an environment balanced between over- and under-allocated states. By monitoring actual resource usage, it is possible to find the point at which the resources available are being overstretched, and individual workloads start to suffer. Monitoring the Tipping Point Underutilization of hardware exists in both the virtual and physical world. For example, a server having 32GB of memory, 1TB of drive space, and two quad core CPUs may be running a database workload that uses only half of the physical memory, leverages a SAN for the datafiles, and uses around 5% of the available CPU time. Servers like this represent a significantly underutilized investment and are good candidates for virtualization. The problem faced by many administrators is that when they try to virtualize a server like this, the application owners insist on having the same specs in the virtual environment as they had on the physical server. This is where an administrator must get creative in managing limited resources that are not being fully utilized. Features of the hypervisor like memory and CPU limits can help keep control over resources by showing one value for the virtual machine, for example 32GB of memory, while limiting the physical memory provided to 16GB. The key to using such advanced features lies in monitoring their effects on workloads and fine tuning resource allocations based on the measures obtained. For example, a database application could be fine-tuned by allocating 32GB of memory at the start, and then monitoring memory usage over the course of a week. If it is found that the server is only using half of the memory allocated, the amount of physical memory allocated to the application could be reduced to
Bridgeways White Paper: Management Pack for VMware ESX 4 20GB. It is only by proactively monitoring the environment that it is possible to make these kinds of changes while minimizing risk to the workloads. Processing Time The first place to look for underutilized processors is on virtual machines equipped with multiple processors. The problem with multiple processors is that when the hypervisor tries to schedule CPU time for virtual machines, it needs to synchronize the available cores to ensure that a core is available for each vcpu. The result is that the hypervisor locks multiple cores while waiting for enough to become available, and other single processor virtual machines are stuck in line behind the multi-processor virtual machine, waiting to a use resources that are locked, but idle. To monitor this kind of scenario, administrators can use BridgeWays to look at a pair of metrics. First, they can monitor the CPU Ready Time % for the individual virtual machines. This metric shows how long a virtual machine is waiting for a core to become available, in order to execute an instruction. If there are virtual machines with CPU Ready Time % above 5%, the administrator can examine the Host CPU Usage % to see if it is high or low. If the Host CPU Usage % is low, it is likely that several virtual machines with more than one vcpu are running on that host. These virtual machines are causing CPU contention and should be split across multiple hosts, or if possible have the number of vcpus reduced. CPU Reservation is another possible cause of underutilization. By reserving a specific amount of time on the CPU, a virtual machine may be blocking another VM that is trying to use that CPU time. To ensure that a virtual machine that is given reserved CPU time actually needs it, the administrator can monitor its CPU Guaranteed and CPU Usage metrics. If CPU Usage falls below CPU Guaranteed, CPU time that could be used productively by other VMs is being wasted. CPU Shares are the preferred way to provide some virtual machines with more processor time than others. Increasing the shares available to a virtual machine enables the hypervisor to more intelligently schedule the extra time. This sharing mechanism also provides a nice metric for locating extra CPU time in Resource Pools. By watching CPU Extra Time, it is possible to identify virtual machines that are being given more time on the processor than their share allocation reserves for them. This indicates the potential for adding a new virtual machine to the resource pool without having a significant impact on the existing virtual machines. Memory Commitment Memory over commitment is a powerful feature of the Virtual Infrastructure architecture that allows capacity planners to allocate more memory than is actually available. The way this works is that as physical memory is required for one virtual machine, it is taken away from another virtual machine. Which VM looses physical memory is determined by the current load, share ratios, limits, and reserves of the
Bridgeways White Paper: Management Pack for VMware ESX 5 virtual machines running on the host providing the memory. Since most virtual machines do not run at 100% capacity, the hypervisor is able to reclaim memory from one VM and give it to another. Through monitoring, an administrator can ensure that the memory reclamation is not causing unintended resource limits for the various workloads. By watching a Resource Pool s Memory Active (memory that is actively touched and used), Memory Consumed (how much memory has been allocated to virtual machines) and Memory Overhead (the amount of memory lost to the hypervisor to handle the scheduling of access to physical memory) it is possible to see how high the demand for physical memory is. Digging a little deeper and looking at how the individual hosts are reclaiming memory resources and how virtual machines are affected by the reclamation, administrators can find areas where capacity may be over or underutilized. This information can be used to tune environments to better reflect the actual needs of each workload, allowing the hypervisor to do less work to balance those requirements. There are several mechanisms available to the hypervisor to handle the reclamation of unused memory from virtual machines: Shared Memory Shared Memory is a case where the host contains similar guest operating systems with a large number of common components. The hypervisor scans for identical memory pages among VMs, and allocates a single read-only version of the page in physical memory. The duplicate memory pages are released and made available for other purposes. This is the preferred way to reclaim physical memory because it has the least impact on the virtual machines. When monitoring Shared Memory the administrator needs to watch the historical values. If Shared Memory usage suddenly drops, that may indicate that some of the virtual machines were patched while others were not. This leads to fewer memory pages that can be shared, and reduces the overall capacity of the environment. Balloon Driver The balloon driver is installed along with VMware Tools on a guest operating system. The balloon driver is controlled by the hypervisor and is used to pin memory pages on a guest, forcing the operating system to page out to disk. The hypervisor determines which VMs use the balloon driver based on the activity level of the virtual machine. This is
Bridgeways White Paper: Management Pack for VMware ESX 6 the second best way to reclaim memory from virtual machines, because the guest operating system gets the chance to choose which memory pages to swap to disk. When monitoring the Memory Balloon Usage administrators need to be conscious of the usage pattern that the driver is exhibiting. From the host level, if the memory reclaimed by the driver is constant, virtual machines are likely to be found where the balloon is always inflated, or the host has hit the limit of the amount of memory it can reclaim through the balloon driver. In the first instance, the balloon drivers on individual virtual machines must be monitored to find which ones are consistently inflated. This is a sign that the virtual machine has more memory allocated to it than it needs, reducing the committed memory to more reasonable levels will reduce the resource overhead lost to the hypervisor as it maintains the balloon driver. In the second instance, if the balloon driver has reached the limit of how fast it can reclaim memory, the Swapped Memory metrics can be examined to determine how swapped memory is being used. Swapped Memory Swapped Memory is physical memory that the hypervisor swaps to disk directly, rather than allow the guest operating system to do its own paging. This is the least desirable choice for memory reclamation because while the hypervisor can make a best guess of which pages to swap to disk, based on calculations like the memory tax, the guest operating system will generally make better choices. When monitoring Swapped Memory, it is not enough to watch the amount of memory swapped to disk. The swap rate must also be watched to see how actively the hypervisor is moving memory pages from disk back to physical memory because of hard faults. If the swap rate is high, the host is struggling to meet its memory commitment levels and action needs to be taken to increase the overall memory available. If the swap rate is low and the amount swapped to disk is consistent, then there are virtual machines that are not making good use of allocated memory. This is a good way to find workloads on which the allocated memory can be reduced. Network and Storage The network and storage backbone is often a limiting factor for virtual environments. Both bandwidth saturation and IO contention tend to cause ripple effects throughout the environment, reducing the overall effectiveness of virtualization. Capacity managers need to constantly monitor the load being placed on the network to ensure that it is not exceeding capacity. There are several ways in which monitoring the hypervisor can help with this task. From the host level, it is possible to monitor the Network Bandwidth Usage to ensure that it is not hitting the maximum capacity of the network interface cards
Bridgeways White Paper: Management Pack for VMware ESX 7 installed on the system. As network traffic increases, administrators need to either add more NICs and team them for busy vswitches, or investigate the possibility of moving complementary workloads to the same hosts. In addition to monitoring the bandwidth, it is important to monitor the connection states of those NICs. It is not uncommon for a switch to be taken offline and replaced temporarily, but the replacement switch may auto-negotiate to a lower connection rate than the production switch. Hypervisor routing can help reduce overall network usage. In the case of N-tier or client/server architectures, it is common for components to be on different virtual machines, and for communication to occur across the network. If two virtual machines are on the same host, network communication will be re-routed through the hypervisor, rather than through the physical network. There are two advantages to this reduced network load and increased transfer speeds. The data flow in this case is occurring at the speed of the memory modules as opposed to the speed of the network. When monitoring underlying storage, administrators must be aware of the storage devices and how their load is being handled. The physical devices themselves can handle a finite number of IOPS. The hypervisor handles spikes in IO by queuing up IOPs while the device is busy, and sending them once capacity is available. By monitoring both Device and Kernel Latencies, administrators are able to see where bottlenecks are forming. If the Device Latency is increasing, there could be a problem with the physical disks. If the Kernel Latency is increasing, then there may be too much traffic sent to the LUNs, in which case more capacity needs to be added. Where to Go from Here Once administrators are able to use the BridgeWays Management Packs to monitor and measure available capacity, they can take capacity planning to the final phase by proactively projecting resource requirements three months, six months, or even years down the road. By analyzing performance views, historical data reports, and trending reports, administrators can forecast resource utilization as the organization grows, and greater demands are placed on the virtual infrastructure. This allows administrators to implement just-in-time procurement to cut rack space, power consumption, cooling, and other costs associated with running high end hardware. BridgeWays 301 Moodie Dr., Suite 200 Ottawa, Ontario K2H 9C4 Canada tel: 1.613.842.3494 fax: 1.613.842-3499 www.bridgeways.ca In the past, in order to avoid adverse impact on systems already in place, administrators had to resort to using disjointed tools and crystal balls when allocating resources to new virtual systems as they were brought online. Today, the BridgeWays Management Pack for VMware ESX provides both the high-level and detailed views that help take the guesswork out of tuning virtual environments. Proactive monitoring with the BridgeWays Management Pack for VMware ESX helps ensure smooth virtual operations while maximizing hardware ROI.