Best Practices for Managing Virtualized Environments

WHITE PAPER Introduction... 2 Reduce Tool and Process Sprawl... 2 Control Virtual Server Sprawl... 3 Effectively Manage Network Stress... 4 Reliably Deliver Application Services... 5 Comprehensively Manage Performance and Availability... 6 Zenoss Best Practice Support... 7 Summary... 8 Best Practices for Managing Virtualized Environments Virtualized environments are taking IT datacenters by storm. As datacenter virtualization increases, it is critical to understand and adopt best practices for managing virtualization. This white paper provides best practices that you can use to help you more effectively manage datacenter virtualization. 1

Introduction Virtualization is rapidly transforming the datacenter landscape. Enterprise Management Associates (EMA) defines virtualization as a technique for abstracting (or hiding) the physical characteristics of computing resources from the way in which other systems, applications, or end users interact with those resources. This includes making a single physical resource (such as a server, an operating system, an application, or a storage device) appear to function as multiple logical resources, or making multiple physical resources (such as storage devices or servers) appear as a single logical resource. According to several studies, server virtualization technologies deliver the following benefits to organizations: IT becomes more efficient. Virtualization optimizes infrastructure costs, increases operational efficiency, and reduces hardware expense. Organizations implementing virtualization technologies are able to reclaim existing datacenter capacity, such as space, power, and cooling, and postpone costly datacenter expansion plans. IT provides faster business application support. Virtualization speeds server and application deployment, simplifies the staging of applications across test and development environments, and makes configurations more consistent. IT delivers more predictable service. Virtualization makes failure recovery faster, easier, and less expensive. The desire to harvest virtualization technology benefits is rapidly driving datacenter virtualization initiatives. However, virtualization can shackle, rather than free, IT operations staff if you do not use the following best practices when managing your virtualized environment: Reduce tool and process sprawl Control virtual server sprawl Effectively manage network stress Reliably deliver application services Comprehensively manage performance and availability Reduce Tool and Process Sprawl Effectively managing the administrative complexity virtualization can bring to the data center is a challenge. Data center virtualization typically introduces many new systems, management tools, and processes. For example, according to EMA research, typical virtualization deployments include approximately 11 different platforms, technologies, and vendors. Administrators use multiple tools, often from multiple vendors, to perform basic virtual infrastructure management functions. Each new tool requires the datacenter IT operations staff to install, learn, configure, maintain, back up, and integrate the new tool into their existing IT operations processes. Additionally, most organizations already have tools and processes designed to help them manage their existing physical datacenter infrastructure components. With the arrival of virtualization, new datacenter virtualization management components are introduced that duplicate or overlap with the existing tools and processes, resulting in tool and process sprawl. For example, you probably have well-defined tools and processes in your datacenter for provisioning physical servers. With virtualization, you now also need to identify appropriate tools and processes for deploying virtual computers. You also probably already have within your datacenter well-defined tools and processes for monitoring 2

and reporting on your physical servers. However, with virtualization, you now also need to identify appropriate tools and processes for monitoring and reporting on virtual computers. When virtualizing components of your data center, the best ways to avoid tool and process sprawl are the following: Minimize the number of tools and processes used by IT operations to manage datacenter infrastructure components. Consolidate management of both physical and virtual resources. Use tools that you can quickly deploy and adapt. Control Virtual Server Sprawl Software developers, testers, and department heads all like virtual computers because they can be pressed into service almost immediately, often in fifteen minutes or less. However, this short deployment timeframe can result in uncontrollable virtual computer deployment, and typically guarantees virtual computer sprawl. If your datacenter is like most datacenters, you can never be certain how many virtual computers are running, where they are, or even whether they are online and actively consuming resources or offline and passively consuming disk space. Because virtual computer deployment in most organizations circumvents standard processes for deploying physical servers, you can t certify that all deployed virtual computers meet your organization s policies and procedures like you can for your physical computers. These new virtual computers introduced into your datacenter environment can compete with existing business-critical applications for resources, quickly deplete datacenter CPU and memory capacity, and create security risks. You must identify new virtual computers, inventory them, associate them with an appropriate business service, track them when they move to new hosts, monitor their performance and availability, manage their events, report on them, and fix them when they break. You probably are also required to put a management agent in place on all of your datacenter computers, including virtual computers, and gather and forward appropriate management information to a centralized infrastructure management tool. Performing all of these activities takes time, creates bottlenecks, and runs up administrative costs. It is very common for IT system administrators to create a virtual computer for a specific purpose, use the virtual computer for that purpose for a short period of time, such as a couple of days or a week, and then completely forget about the virtual computer as they move on to new projects. As a result, you can end up with virtual computers running silently in the background for weeks or months at a time, consuming resources with no one knowing they are there. And even if the virtual computer is switched off, they can continue to quietly consume datacenter disk space for years. The following best practices help control virtual server sprawl in your data center: Employ strict controls over who creates virtual computers, who reviews them for proper configuration and security, and who deploys them. Put a process in place that ensures IT operations always knows when a virtual computer is created. Either ensure that each virtual computer creator/administrator notifies IT operations each and every time a new virtual computer is created or put a tool in place that can automatically identify and track the existence of virtual computers. Put a process in place that ensures IT operations always knows when a virtual computer is deleted. You ll need to ensure the disk files associated with the virtual machine are erased at the same time. 3

Put a process in place that ensures you can track virtual computers not only when they are online, but also when they are offline. For example, you need to be able to track virtual computers that have not been powered on for weeks, months, or even years. Define and enforce parameters related to a virtual computer s lifecycle, including lifecycle parameters for when virtual computers are online or off. Effectively Manage Network Stress Server virtualization solutions can be very effective in addressing current datacenter cost and operational challenges. However, the widespread deployment of virtual computers can stress data center network infrastructures designed to support a traditional datacenter server model, where one application runs on one physical server. Traditionally, datacenter networks were designed using the following premises: A server has a single identity: one MAC address, one IP address, and one World Wide Name (WWN). Each application requires its own server. Network segmentation required for regulatory, security, or political reasons is accomplished through physical separation and dedicated hardware. Initially, it may seem that virtualization would have little impact on network management. The same applications still run in the same datacenter, just on fewer pieces of hardware. However, the consolidation of systems onto fewer physical devices increases and intensifies network traffic on those devices. With virtualization, more network traffic is centralized on fewer, large computers, rather than spread out across a large number of smaller computers. For example, with virtualization you may have a physical computer in your environment with one, two, or three network interface cards (NICs). However, this one physical computer may host 10 to 20 guest virtual computers. These guest virtual computers can quickly overwhelm the throughput capacity of the physical computer, impacting application performance and end user experience. VMotion events can also overwhelm NICs and stress networks. VMotion is a key technology in VMWare. It continuously monitors pooled server resources and uses rules to intelligently and dynamically load-balance virtual computers between physical host computers. VMotion is a powerful technology, but it can also increase network stress. For example, when assessing your network function, you determine that guest virtual computer throughput is overwhelming the NICs on one of your ESX host computers. To address this throughput issue, you move one or more of the guest virtual computers to a second ESX host computer with enough NICs to provide sufficient throughput capacity. However, after a VMotion event places one or more additional guest virtual computers on the second ESX host server, you discover that the NICs on the second host computer are now overwhelmed as well. Datacenter virtualization also results in additional network stress due to changes in datacenter architecture and design. Virtualization typically works best with detached, rather than attached, storage. This requires very fast, dependable network connectivity between servers and the storage devices on the storage area network (SAN) and increases network traffic. All access between applications and data now run over the network, and even small delays can create issues with many applications. Understanding network segmentation is also critical to understanding and reducing network stress. You must understand both the logical and physical layout of your network, including network segments, in order to effectively manage your network in a virtualized environment. If you do not clearly understand network segments and their capacities, you may overload a network segment, and that affects performance. You also need to understand network segmentation from a security perspective, including understanding which devices belong on one network segment but not on another. For example, your organization may have a private network segment for financial applications. This private network segment is firewalled off from the rest of the company because some of the applications exchange information in clear text. However, if you virtualize the financial applications on this segment 4

without maintaining the separate network segment, another IT administrator with less knowledge of the network design could accidently install an inappropriate device on the network segment and cause a security breach. In a virtualized environment, you must track the impact virtualization has on network usage and proactively identify and manage any areas of network stress. The following best practices help you identify and manage points of network stress in a virtualized environment: Continuously monitor loads on NICs and Ethernet SAN adapters, and ensure they do not become overloaded. Measure and account for increased network traffic on host physical computers that support multiple virtual computers. Ensure you have enough throughput capacity not only for the applications running on guest computers, but also for movements that occur on host computers based on VMotion events. Proactively track VMotion events to determine if they overwhelm NICs on host physical computers. Understand and manage increased network stress caused by increases in the rate and volume of communication and data transfer between servers and remote storage. Track all guest virtual computers and always know what network segment each guest resides on. Reliably Deliver Application Services Effective application service delivery requires a continuous understanding of the end-to-end IT infrastructure that supports each application, along with a clear understanding of the performance and availability requirements for each application. In non-virtualized environments, application service delivery managers can easily put their arms around the various IT infrastructure components responsible for providing application service. Application managers and IT operations staff can also use standard tools to measure and report on the performance and availability of the physical computers and other infrastructure components that support applications. Virtualized environments, however, introduce more complexity around understanding and managing application service delivery. In virtualized environments, workloads and resource allocations change dynamically, hour to hour and even minute to minute. For example, VMware VMotion technology, used in production by up to 70% of VMware customers, continuously and automatically optimizes virtual computers within resource pools. As a result of this optimization capability, VMotion can move an entire running virtual computer instantaneously from one server to another. And VMotion movements can occur multiple times a day. When a virtualized IT environment dynamically and automatically updates and moves IT infrastructure components instantaneously while optimizing the environment, understanding and assuring application service delivery becomes a challenge. With the amount of dynamic change inherent in virtualized environments, it is difficult for application managers and IT operations to always know, at any point in time, where all of the various IT infrastructure components that deliver application services reside. It can also be difficult to understand when a problem with application service delivery actually exists, and when one does not. For example, application managers for a specific set of applications may initially understand, based on their documented application service delivery model, that a specific ESX host computer or set of ESX host computers hosts a group of virtual computers that run their application. Armed with this information, the IT application managers may configure management systems to provide alerts when the performance and availability of an ESX physical computer hosting virtual computers running their application becomes degraded or unavailable. This approach can work well in a non-virtualized environment. However, in a dynamic, virtualized environment, such an approach quickly proves inadequate. In a virtualized environment, if a VMotion event occurs, the guest virtual computers running the application may instantaneously move to one or more different ESX host computers. Application service is still effectively delivered, but the IT infrastructure supporting the application service has 5

changed. Further, the IT infrastructure supporting the application service may continue to change throughout the day as VMotion events continue to move guest virtual computers around while optimizing the virtualized environment. When such events occur, the application team receives an alert, warning them that the ESX computer that hosts virtual computers running their application is offline. However, this alert is misleading. Even though the ESX host computer that originally hosted the guest virtual computers running their application is offline, the application is still available. It is simply hosted on a different ESX host computer based on VMotion events. To address the challenges related to application service delivery in virtualized environments, use the following best practices to effectively deliver application services: Require that the IT operations team responsible for supporting each application track VMotion events and update application service delivery models each and every time a VMotion event occurs. Ensure that the application management team always has a clear view into and understanding of the IT infrastructure components that support their applications at any point in time. Implement one system that can automatically update and maintain your application service delivery model. If you have a paper and pencil system in place today for documenting your application service delivery model, replace it with a system that can automatically update your application service delivery model each time an application moves. In a virtualized environment, the IT infrastructure components that support a service move too fast, and the virtualized environment is too complex, to rely on manual processes for updating application service delivery models. If you are using different tools with different service models, find one tool that can document your application service model. Otherwise, when IT components that support an application service change, the IT operations team may be required to update several different tools in order to effectively track and document the change. Comprehensively Manage Performance and Availability One of the most vexing challenges in a virtualized environment is managing virtual computer performance and availability and understanding its relationship to physical computer performance and availability. Understanding performance and availability in a virtualized environment, is typically a two step process. First you must use one set of tools to look at the physical ESX host computer and its processor and memory utilization. Then you must use a second set of tools to examine the guest virtual computers and their allocated processor and memory. You may even need to use yet a third set of tools designed to assess the performance and availability of the guest operating system and the applications it supports. Comprehensively managing performance and availability across these three different layers is critical, but it is difficult to do. For example, a physical computer may report 20% CPU utilization. However, one of the multiple virtual computers hosted on the physical computer may be throttled and using 100% of the virtual CPU, which affects performance and diminishes the service that the end user is experiencing. Similarly, a virtual computer may report utilizing only 50% of virtual memory. However, memory over-commit for multiple virtual computers on a single host computer may result in 100% memory utilization on the host computer, causing intensive memory page swapping and affecting the end-user application experience. In order to effectively manage performance and availability, you must understand not only physical computer performance, but also virtual computer performance, and how the relationship between the two affects the end-user application experience. 6

Unfortunately, there are few ways to obtain both physical and virtual performance and availability data simultaneously so you can easily normalize and correlate performance and availability data across physical, virtual, and application layers. Using a variety of disparate tools to manage physical and virtual performance and availability inevitably leads to inefficiencies and process gaps, and makes it hard to identify the root cause of an issue. In a virtualized environment, you must understand performance at every layer. To address the challenges related to managing application performance and availability in hybrid virtualized environments, use the following best practices: Monitor virtual computers like physical computers. Do not treat virtual computers and physical computers differently. Operate as if IT operations is equally responsible for both physical and virtual computers. Track not only how much physical processor and memory each physical host computer uses, but also how much allocated processor and memory each virtual guest computer uses. Track and correlate performance and availability information for the virtual and physical infrastructure together to ensure a complete picture of application performance and availability levels. Prevent excessive load on physical host computers as a result of VMotion events by carefully tracking VMotion events and verifying that CPU processor and memory utilization changes on physical computers triggered by VMotion events do not require adjustments to CPU and memory allocations on virtual computers. Zenoss Best Practice Support Zenoss was developed specially to support implementation of best practices for managing virtualized environments. Reduce tool and process sprawl Zenoss helps you control tool and process sprawl by unifying the monitoring and management of physical and virtual resources, rather than using a separate product or products with their own user interfaces, processes, databases, and agents for each managed discipline. Zenoss deploys in days, not months or years. Zenoss also gives you flexibility, delivering a tightly integrated toolset with an open source design that also allows you flexibility to adapt your Zenoss configuration to your exact enterprise datacenter needs. Control virtual server sprawl Zenoss helps you control virtual server sprawl by automatically discovering and categorizing virtual infrastructure components, including virtual computers, ESX computers, clusters, and data stores. Zenoss automatically tracks all guest virtual computers and detects real-time when virtual computers move from one ESX host computer to another. Because Zenoss is agentless, you don t have to spend time trying to get agents deployed and working in order to monitor virtualized components. Finally, Zenoss automatically tracks where a virtual computer is and its availability in order to determine whether or not the virtual computer is running. Even when a virtual computer is not running, it knows the virtual computer is there and how much disk space the virtual computer currently consumes. Effectively manage network stress Zenoss helps you manage network stress in a virtualized environment by monitoring loads on NICs and Ethernet SAN adapters, helping you understand and manage increased network stress caused by increased communications with external storage and increases in the rate and volume of data transfer between servers and remote storage. Zenoss also monitors and records VMotion events and automatically tracks guest virtual computers, showing which segments they reside on. 7

Reliably deliver application services Using the Zenoss data model, guest virtual computers that support an application service can be logically grouped together. Zenoss clearly displays information about the ESX physical computer currently hosting the guest virtual computer. When a VMotion event occurs, Zenoss automatically and immediately updates its database to show which ESX server currently hosts guest computers. With this information, IT operations and IT application service delivery managers can clearly see at a glance which infrastructure components support service delivery for a specific application, without having to manually update their service delivery model and without using multiple tools. They can also drill down at any time to view additional information about the performance and availability of the ESX host physical computer, each guest virtual computer, and even the performance and availability of the applications running on the guest computers. Comprehensively manage performance and availability Zenoss gathers data from physical and virtual environments simultaneously, and allows IT operations to manage and view both physical and virtual resources together. Zenoss automatically correlates performance and availability information for the virtual and physical infrastructure together, which helps IT operations understand end-to-end application performance requirements and understand performance at every layer. Zenoss also continuously measures dynamic workloads and resource allocations to establish performance requirements. Summary Zenoss makes using best practices for managing virtualized environments easy. Zenoss allows you to eliminate tool and process sprawl by holistically monitoring your physical and virtual infrastructure through a single pane of glass. You can manage virtual server sprawl with tools that automatically discover and inventory all of your virtualized and physical components, and monitor network stress in your environment using tools that combine synthetic end-user transactions with centralized infrastructure performance measurements. Zenoss also allows you to automatically and consistently manage services, prevent service disruption, and report on service levels for both physical and virtual resources. Finally, Zenoss allows you to comprehensively monitor resource performance and availability, including applications, databases, middleware, and Web servers, whether they are physical or virtual. 8