APM Experts. Bernd Harzog CEO APM Experts. White Paper: Infrastructure Performance Management for Virtualized Systems. June 2010.

APM Experts White Paper: Infrastructure Performance Management for Virtualized Systems Bernd Harzog CEO APM Experts June 2010 Abstract 2010 APM Experts. All Rights Reserved. All other marks are property of their respective owners. Most enterprises that have deployed it have realized substantial hard dollar savings from virtualization, driven primarily by the benefits derived from server consolidation. These enterprises have also discovered that management activities like provisioning, recovering from server failure, or providing for disaster recovery can be accomplished in a more agile and consistent manner across a variety of systems using virtualization as the underlying technology. However, most large enterprises are only 30% virtualized, with virtualization occurring only on the application systems that are under the direct control of IT Operations. To extend virtualization to the business-critical and performance-critical applications that are under the ownership of dedicated application support teams, the IT Operations group who owns the virtual environment must provide accurate and credible performance assurances for the virtual infrastructure that will be supporting these applications. Without these assurances, the application support teams and their business constituents have the political power to prevent these systems from becoming virtualized, and will exercise that power. The IT Operations group needs to put tools in place that can measure the performance of the virtual environment and provide verifiable service-level data to the owners of these key applications. With the right tools collecting the right metrics, application owners can then in turn offer assurances to end-users. The management tools selected to support virtualization is essential to the ability of IT to grow the virtual environment without proportionately increasing the staff required to manage all of the new physical host servers and their guest VMs.

Table of Contents I. Introduction: Barriers to Virtualization... 1 Collapsed and Centralized Application Infrastructure... 1 Business Demand for Service Level Management... 1 II. Current Approach to Virtualization Management... 1 III. Flaws in Existing Approaches to Performance Management... 3 Time-Based Metric Measurement... 3 Density-Based Interactions... 3 Dynamic Operations... 3 Resource Utilization Is No Longer a Reliable Indicator of Performance... 4 IV. The Need for New Tools for Virtualized Systems... 4 V. What is Virtualization Infrastructure Performance Management?... 4 Infrastructure Response Time Defined... 5 Key Requirements for Infrastructure Performance Management Solutions... 5 VI. Comparison of IPM Solutions... 7 VII. Importance of Virtualization Infrastructure Performance Management... 8 VIII. CA Virtual Assurance... 10 IX. CA Virtual Assurance Use Case... 11 X. Defining a Vision of IPM in the Virtualized Environment... 15 XI. About APM Experts... 16 All other marks are property of their respective owners. ii

I. Introduction: Barriers to Virtualization Most of the enterprises that have adopted it have realized substantial hard dollar savings from virtualization, driven primarily by the benefits derived from server consolidation. These enterprises have also discovered that many management activities like provisioning, recovering from server failure, providing for disaster recovery, backup, and security can be accomplished in a more agile and consistent manner across a variety of systems using virtualization as the underlying technology. However, many barriers remain that prevent large enterprises from virtualizing key systems and applications, including the political skirmishes associated with separate Applications and Infrastructure groups within the IT organization and the lack of reliable management data. As a result, most large enterprises are only 30% virtualized, with virtualization occurring only on the application systems that are under the direct control of IT Operations. Collapsed and Centralized Application Infrastructure Prior to virtualization, most business-critical applications probably ran on over-provisioned, dedicated servers, with massively over-provisioned LANs handling the traffic between these servers. When these servers are virtualized, this dedicated infrastructure is collapsed into a shared pool of server and network resources. IT is responsible for the technology that enables this sharing, the virtualization platform. When physical servers are consolidated to guests on a shared host, the IT Operations group thus becomes responsible for any application performance problems that arise, issues that are perceived, rightly or wrongly, to have been caused by this higher degree of sharing. In the absence of accurate knowledge of the root cause of application performance issues, IT Operations are considered guilty until proven innocent. Business Demand for Service Level Management As application owners and their business constituents lose direct control of the physical resources that support their applications, they will in turn demand that the IT organization be able to prove that the shared infrastructure is not at fault for performance issues. For example, the owner of a business-critical, revenue-generating application will insist that the virtualized infrastructure is at fault when the application performance management system reports response time issues. To meet these demands for performance assurance, and to maintain support for further virtualization projects, IT must have tools that accurately measure how the performance of the virtual infrastructure is impacting each application. II. Current Approach to Virtualization Management For every virtualization platform, including those from VMware, Microsoft, and Citrix, performance management tools are available from both the virtualization platform vendors and a variety of third-party performance and capacity management vendors. In the case of the market-leading platform, VMware vsphere, most of the vendors leverage the vcenter Server API, which provides a robust set of resource utilization statistics about the virtual environment. Resource utilization All other marks are property of their respective owners. 1

statistics provide insight into the availability and capacity of the virtualized system, but do not lead to a granular or comprehensive ability to monitor the actual performance of the virtual infrastructure. As shown below, many vendors provide resource and availability management for the VMware vsphere platform. Some of these vendors, like Veeam and Vizioncore, only monitor the VMware environment, while others, like eg Innovations, up.time Software, and NetIQ, have been monitoring a broad array of physical server and network resources and have added the collection of VMware vcenter API data to their product suites. These vendors basically collect resource utilization data, such as CPU, memory, network, and disk utilization, and use this data to try to infer the performance of the virtualized environment. But as we demonstrate in the next section, this approach is fundamentally flawed. Virtualization Resource and Capacity Management Solutions and their Use Cases All other marks are property of their respective owners. 2

III. Flaws in Existing Approaches to Performance Management Multiple products and solutions are available to provide basic monitoring of resource utilization in a virtual environment. While it is important to know how key resources like CPU load, memory consumption, network load, SAN load, and the load on the storage array are utilized relative to their capacity, this information does not provide an accurate picture of virtual infrastructure performance. And although it is possible to approximate the performance of a physical infrastructure based on how its resources are being used, this is not possible in a virtual environment. A much more comprehensive approach is needed. Time-Based Metric Measurement The service-level agreements that bind applications teams to the larger enterprise of end-users are based on deviations from a set of normal performance metrics. To define and codify normal operating patterns, many management solutions rely on either a guest virtual machine (VM) or an agent built into the operating system, such as WMI, to collect resource utilization statistics. However, the accuracy of all time-based metrics collected inside of guests, including CPU utilization, network or disk I/O rates, page fault rates, and context switch rates, is susceptible to a time-keeping issue between the guests and their host. As a result of this timing issue, the data that was used to infer infrastructure performance in a physical environment can no longer be used to report on infrastructure performance in a virtual environment. In physical environments, many performance management solutions calculate resource utilization baselines for time-of-day or workload. Deviations from these baselines were indicative of a performance problem. However, when a server is virtualized, resource baselines are not a reliable predictor because the resources allocated to a guest at any given time are variable and dynamic in nature. Therefore, baseline deviations are no longer reliable indicators of infrastructure or application performance. Density-Based Interactions The hard dollar ROI from consolidation comes from achieving a higher utilization rate for server and network resources. As a result, virtualization raises the prospect that isolated workload peaks can now cause resource conflicts which can, in turn, create performance issues. Effective Infrastructure Performance Management solutions are required to allow the IT Operations group to demonstrate that resource peaks in certain workloads are not causing performance issues for critical business applications. The monitoring approach that generates alarms based on deviations from baseline utilization values is too limited to tackle the complex factors that contribute to workload peaks in a virtualized environment. Dynamic Operations The VMware vsphere platform contains several features, such as VMotion, HA, and DRS, that enable the movement of workloads among physical hosts. Application support teams are not automatically comfortable with this level of automatic and dynamic operation. To facilitate a productive partnership between these teams and IT Operations, effective infrastructure response time tools must be in place, and they must be able to demonstrate that dynamic operations are not interfering with committed application response times. All other marks are property of their respective owners. 3

Resource Utilization Is No Longer a Reliable Indicator of Performance When it comes to assuring infrastructure or application performance in a virtualized environment, all of the reasons cited above argue that the existing methods collecting resource utilization statistics to create patterns of normal and abnormal usage are no longer effective. The rest of this paper discusses a new approach to infrastructure performance management that is optimized for virtualized infrastructures and that monitors application performance from the end-user s perspective. IV. The Need for New Tools for Virtualized Systems Systems Management and Performance Management have existed since the days when the mainframe was the sole computing resource in the enterprise. Since then, multiple waves of innovation in the fundamental architecture of computing have swept the industry, including: Minicomputer-based systems Standalone personal computers Personal computers networked around a file-sharing model Personal computers networked around a two-tier client/server model N-Tier computing, with presentation servers, application servers, and database servers The Web server becoming the dominant presentation server, and the rise of the Internet as a ubiquitous method to access computing resources and applications Virtualization as a way to reduce over-provisioning and inject standardized dynamic operations to vertical silos of hardware software and applications Cloud Computing as a way to economically and technically separate users and applications from the infrastructure that supports them Each wave of innovation has ushered in a new set of management tools designed to address these architectural changes. Virtualization likewise requires a new set of tools because most of the legacy systems management and performance management tools were not designed for an environment where previously separate workloads are dynamically shared in a resource pool. V. What is Virtualization Infrastructure Performance Management? Infrastructure Performance Management (IPM) is a new approach to managing the performance of the physical and virtual hardware and software resources that comprise virtualized and cloud-based computing environments. This new approach is based on a new performance metric for the infrastructure: Infrastructure Response Time. The vendors listed in the graphic at the start of Chapter 2 do not measure or present Infrastructure Response Time hence the need for new approaches and new vendors. All other marks are property of their respective owners. 4

Infrastructure Response Time Defined IPM is a superset of the Resource and Availability Management category. When optimized for a virtualized environment, solutions in this category collect vcenter data, but build on this data by collecting unique data of their own, which allows them to provide a response time-based perspective on infrastructure performance. This perspective takes the end-user experience interacting with the application into account, an approach that offers a vast improvement over the resource-based view of performance. Infrastructure Response Time (IRT) is defined as the time it takes for any workload (application) to place a request for work on the virtual environment and for the virtual environment to complete the request. The request could be as simple as a bi-directional exchange of data between two guest VMs on one host over the vswitch. Or the request could comprise multiple hops among various VMs on multiple hosts and then include a database transaction, which ultimately requires a write to a storage array and a confirmation back to the original requesting component of the application. Each separate portion of the request and the associated responses must be timed so that the actual experience of the end-user who initiated the request can be evaluated. Key Requirements for Infrastructure Performance Management Solutions Prior to virtualization, IPM was not a separate category of solutions; it was a component of IT resource and availability management. However, due to the dynamic and shared nature of virtualized systems, neither infrastructure nor application performance can be accurately gauged from resource utilization or availability metrics gathered from the virtualized environment. Infrastructure performance management exists as a separate category, not subject to the limitations of the resource utilization approach. The key requirements of an infrastructure performance management solution are: Support any applications running in the virtual environment (take an application-agnostic approach). Infrastructure Response Time is relevant for the entire virtual environment to the team supporting that environment. To provide assurance to the application teams that the virtual infrastructure is performing well, IRT must be calculated and reported for every application running on that infrastructure. This IRT strategy in turn requires the IPM to automatically identify the applications and their topologies in a manner that is independent of the application architecture. Continually discover the topology of the infrastructure supporting each application. Once you understand how applications are communicating with each other, it becomes necessary to dynamically identify the chain of virtual and physical resources that are supporting an application at a given moment in time, based on continual discovery. Do not rely on resource utilization statistics collected from the infrastructure components of a virtualized system to infer infrastructure and application performance. Instead, take a clean-sheet-of-paper approach toward calculating IRT. Calculate IRT from the guests to the spindle and back. Infrastructure Response Time must be calculated across the breadth and depth of the virtual environment. For example, many scaled-out applications have multiple tiers that run within many different guests across a virtual infrastructure, and a significant portion of those guests may make database or other I/O calls that are serviced by a physical disk array. When calculating IRT metrics, the full scope of the virtualized environment must be considered. Provide both performance and capacity-based analytics around Infrastructure Response Time. IRT is more than just the standard by which virtual infrastructure performance should be understood and the metric on which All other marks are property of their respective owners. 5

performance troubleshooting should be based. IRT is also the basis for a new understanding of capacity planning. When an increase in workload is contemplated, the first question that needs to be answered is whether or not this increase in workload will increase IRT beyond acceptable levels. If the metrics indicate that it will reach unacceptable levels, it must be possible to determine whether the increase signals a capacity issue, or whether it is caused by some other factor. Provide out-of-the-box value. Virtualized environments change too rapidly for an approach that requires extensive manual configuration to provide value. New applications, and new versions of existing applications, are constantly being added to the environment. These must be automatically discovered, and the management product should start providing IRT information on these new applications immediately after discovery. Similarly the addition of new servers, new vswitches, new VLANs, or a new storage array should not require manual reconfiguration of the IPM solution; the management system should simply discover the additions and changes and adapt accordingly. Be prepared for the multi-hypervisor case, such as VMware and Hyper-V, by 2011. While every virtualization platform vendor provides performance monitoring solutions (all of which are based on resource utilization), their customers should retain the ability to ultimately run multiple virtualization platforms, with a consistent set of performance management products to manage these platforms. While VMware vsphere is the clear market leader in large enterprises today, especially for business-critical workloads, other virtualization platforms are gaining sufficient technical maturity and market traction so that support for these platforms should be forthcoming in the near future within market-leading IPM solutions. The following diagram shows the flow of an IRT transaction from a guest to the spindle and back, through the layers of the virtual environment, and lists four vendors of IPM solutions who are profiled in the next section. All other marks are property of their respective owners. 6

Infrastructure Performance Management Solutions and their Use Cases VI. Comparison of IPM Solutions The table below compares the major Infrastructure Performance Management solutions available on the market today. The CA NetQoS solution is unique in the following respects: NetQoS has a long history as a successful vendor in the Network Performance Management market and was considered a leader of that market before being acquired by CA in the fall of 2009. NetQoS brings deep TCP/IP technical and market expertise to the IPM market. The CA Virtual Assurance (ports NetQoS aspects) solution is the only solution on the market that can provide an Infrastructure Response Time (IRT) metric for every TCP/IP application running in every guest in a VMware virtual environment. The IRT metrics are combined with NetFlow data from the switches carrying traffic systemwide to provide both traffic volume metrics and response time data. The comprehensive data collection performed by the CA Virtual Assurance solution occurs in real time. This approach avoids the problems associated with typical 5- or 15-minute polling intervals, which can easily miss transient but repetitive problems. All other marks are property of their respective owners. 7

Feature/Capability Akorri CA Virtual Assurance Virtual Instruments Xangati Data Collection Methods Breadth and Depth of Infrastructure Response Time Data Collected Storage Performance Visibility LAN and WAN Performance Visibility Server Performance Visibility Visibility into Performance Issues between Guests on one Host Level of Application Identification Data Collection Interval Built-in Analytics Deployment Model vcenter APIs, direct instrumentation of SAN and storage arrays Infrastructure Response Time is collected end-toend (from guest to spindle on storage array) Has specific instrumentation for storage arrays. Captures IOPS and storage latency to physical spindles. Maps guests and workloads to spindles No visibility to the LAN and the WAN Direct calculation of IRT impacts on a per-guest and per-host basis No Pulls process list from guests via WMI. Able to identify certain key applications and workloads Polls the entire virtual infrastructure every 15 minutes Automatically calculates a performance index which compares IRT against capacity utilization Deployed as one subnetattached virtual appliance in the VMware resource pool vcenter APIs, NetFlow data from physical switches, application performance data from virtual and physical switches via mirror ports Infrastructure Response Time is collected for each application identified via port and protocol from the guest through the entire IP network (LAN, WAN, and IP storage) Only for IP-attached storage devices using ISCSI Deep visibility into all IP traffic (LAN and WAN) Sees server impacts from the perspective of end-toend application response time Virtual appliance on the mirror port of the vswitch sees interactions between guests on one host Identifies applications based upon ports and protocols vcenter APIs, proprietary taps into SAN fabric Measures the response time of individual Fiber Channel frames and maps them to LUNs Taps the SAN data directly for latency and load information for all Fiber Channel traffic No visibility into the LAN or WAN Relies on vcenter data to infer server-level performance issues from resource utilization data No No ability to tie applications to infrastructure slowdowns Real-time Real-time Real-time Automatic baselines and thresholds, Top-N reporting. Optional investigations and notifications when performance degrades. Integrated reporting with a full IT management suite. Deployed as one virtual appliance on the vswitch in each VMware host, physical appliances on the physical mirror ports on the LAN switches, and on one management appliance None Deployed as a physical tap on the Fibre Channel SAN vcenter APIs, Netflow data from physical and virtual switches Infrastructure Response Time is collected for each application identified via port and protocol for the IP network (LAN, WAN, and IP storage) Only for IP-attached storage devices using ISCSI Deep visibility into all IP traffic (LAN and WAN) See server impacts from the perspective of the network Virtual appliance on the mirror port of the vswitch sees interactions between guests on one host Identifies applications via Cisco IP SLA protocols None Deployed as a virtual appliance in each VMware host, and as a separate virtual appliance for management VII. Importance of Virtualization Infrastructure Performance Management Deploying a Virtualization Performance Management solution built around a broad and deep understanding of individual application Infrastructure Response Time represents the only credible approach to virtualizing the All other marks are property of their respective owners. 8

production environment. This approach allows the teams that own and support the virtual environment to virtualize business-critical and performance-critical applications. Without adequate performance monitoring, application owners will effectively resist virtualization because it increases their perceived risk of performance issues and reduces their ability to over-provision the resources that are assigned to these applications. The conflict between the IT group, who wants to deliver more cost savings and technical or business agility to the enterprise, and the application teams, who want to fulfill service level agreements, is natural from the point of view of each constituent. To resolve this conflict, IT must be able to prove that the portion of the infrastructure that is serving each application at each moment in time is actually performing at a level that does not impede application performance. Offering resource utilization metrics as proof will not be accepted by the application owners, who cannot be expected to mentally translate a variety of component infrastructure metrics into a number that is meaningful for their application. Only a response time-based metric like Infrastructure Response Time can serve as the key point of agreement between these two constituencies. Once an agreement is established, the benefits of virtualization can be extended to additional applications in the organization. All other marks are property of their respective owners. 9

VIII. CA Virtual Assurance CA Virtual Assurance offers a unique solution that helps enterprises maintain the performance of a virtualized infrastructure. The CA Virtual Assurance works by combining unique, application-specific infrastructure response time metrics collected by a virtual collector on each VMware host with vcenter resource utilization data collected from the vcenter APIs. This unique combination allows CA Virtual Assurance to correlate infrastructure response time with the root-cause resource contention issues. All other marks are property of their respective owners. 10

IX. CA Virtual Assurance Use Case Problem: The network and virtual teams at a manufacturing company have been tasked with assuring consistent application delivery of the Siebel business service. Siebel is currently deployed as a multi-tier application (three tiers: Front-End, Middle Tier, Back-End) on a VMware ESX host. VMware Distributed Resource Scheduler (DRS) has been enabled on the cluster. Based on CPU contention, DRS VMotions the Middle Tier of the Siebel application. Approach: Through the CA Virtual Assurance console, the virtual administrator is able to see the topology update in real time after the VMotion is complete. All other marks are property of their respective owners. 11

Shortly after, the virtual administrator receives a performance-based alert indicating that Server Response Time is higher than normal for the Middle Tier of the Siebel application. The virtual administrator decides to investigate further and drills into the Performance View. All other marks are property of their respective owners. 12

The virtual administrator is now looking at the VM Performance tab, which indicates that the Server Response Time (SRT) is well above the normal baseline for the middle tier of the Siebel application. The virtual administrator notices that SRT was high even before the VMotion occurred, but it actually got worse after the migration was complete. Equally important, the rise in unresponsive session percentage signifies that end-users are being impacted. All other marks are property of their respective owners. 13

In addition, the VM Resources tab indicates performance degradation after the VMotion event occurred. The virtual administrator sees that the CPU load of the ESX host is normal, but that CPU Ready has spiked upward. A rise in CPU Ready typically occurs when a VM (or resource pool) must wait to use the physical CPU of the ESX host, which affects server response time. VM CPU Ready charts show spikes during business hours that correspond to the degraded application performance. Further investigation shows that the ESX Host which now contains the Middle Tier of the Siebel application is oversubscribed. Rather than strictly relying on DRS, the virtual administrator evaluates the performance-based metrics and makes the decision to manually VMotion the Middle-Tier resource pool to an ESX host that has additional CPU capacity. All other marks are property of their respective owners. 14

X. Defining a Vision of IPM in the Virtualized Environment The ability to detect dynamic resource allocation, identify dependencies and logical relationships between the virtual components, and automatically update the topology in real time is of vital importance in a virtualized environment. Business continuity and the ability to maintain visibility of the topology are both critical operational requirements. With the provisioning of new VMs to run business-critical applications and VMotion happening on a regular basis, the monitoring solution must be able to accurately report on the entire lifecycle of a VM, both before and after a VMotion event occurs. Heterogeneous environments and geographically dispersed end-users necessitate mapping out the physical infrastructure, such as the upstream Layer 2 switches and routers, alongside the virtual environment. Before virtualization, capacity planning was important and used heavily in the physical data center. Because more traditional networks tended to be static in nature, the typical approach to such planning worked, and it usually involved forecasting based on volume and utilization trend and baseline data. In a virtualized environment, the rules have changed because of the dynamic nature of virtual machines. Now, a real-time approach to capacity management that involves continuous monitoring is needed. Every time a new VM is provisioned or a VMotion event occurs, a new baseline must be established. To find acute performance issues that require remediation, Operations need dynamically generated definitions of normal performance. From an operational perspective, collection of the supporting data cannot be accomplished via a manual process, but must be done automatically. With virtualization, application owners have the same expectations for levels of service. Upper-level managers look at SLAs over time and set service levels that allow them to determine, from month to month, whether their IT infrastructure investment has brought objective improvement. As a result, the monitoring tools have had to prove value and to be able to quantify the areas of the infrastructure that need greater investment. In the future, enterprise customers will seek solutions that perform many of the same functions as VMware DRS, but instead of basing these decisions on resource-based metrics like CPU and memory, they will rely on performance-related metrics instead. Vendors are still taking a resource-based approach to performance management. But when end-users call to complain about a performance issue, they usually report something like the application is taking forever to load, not high CPU utilization or memory usage. Utilization data is important, but it must be accompanied by additional metrics such as application response time that impact the end-user. In a virtualized environment, end-to-end application response time remains the most important performance metric to monitor. A best-practice approach requires a solution that spans the full lifecycle of virtualization management needs and provides infrastructure monitoring from the perspective of application response time. When troubleshooting a business-critical process, the need to perform root-cause analysis and see beyond the symptomatic alarms also becomes very relevant. Those tasked with virtualizing the infrastructure should seek a single pane of glass in which to visualize that infrastructure in its entirety. They should demand tools that not only manage and report from that single window, but that also help them leverage their current investment in fault management solutions. CA Virtual Assurance comprises of all of these features and takes a uniquely effective approach to managing performance in a virtual environment. All other marks are property of their respective owners. 15

XI. About APM Experts Bernd Harzog is the CEO of APM Experts, a consulting and analysis firm focusing on the virtualization infrastructure and application performance management markets, vendor strategies in these markets, and customer use cases in these markets. Bernd was formerly a Gartner Group Research Director focusing on the Windows Server operating system, CEO of RTO Software, and VP of Products at Netuitive, and has been involved in vendor and IT strategy since 1980. All other marks are property of their respective owners. 16