SIZING-UP IT OPERATIONS MANAGEMENT: THREE KEY POINTS OF COMPARISON MOVING BEYOND MONITORING, ALERTING AND REPORTING TO DELIVER AN AUTOMATED CONTROL SYSTEM FOR TODAY S VIRTUALIZED DATA CENTERS The analytic engine must be built with the goal of prescribing specific actions that will maintain the environment in the desired state based on business rules in effect, automating the decision-making process for IT operations. VMTurbo is a company founded on the belief that IT operations management needs to be fundamentally changed to allow companies to unlock the full value of virtualized infrastructure and cloud services. The approach must be shifted from bottom-up to top-down, and the emphasis must be transitioned from manual intervention to prescriptive analytics that identify the necessary actions to control the environment in an optimal state one where applications get the resources they require to meet business goals while making most efficient use of network, storage and compute resources. Virtualization is a key enabler, but existing management tools fail to embrace and leverage this new architecture effectively. With traditional approaches, too much time is spent collecting thousands of data points, triggering alerts based on perceived anomalies, and leaving IT operators with the difficult task of figuring out what to do to return the system to acceptable performance. These approaches, by definition, lead to unpredictable service, are OPEX-intensive, and become increasingly complex as shared-everything environments are constantly changing due to workload fluctuations, virtual machine movement, and all the self-service benefits that accompany virtual infrastructure. To do it properly, the operations management framework must have three important characteristics in its design: 1. The approach must be oriented on a top-down view of the infrastructure with a layer of abstraction allowing the environment to be modeled in rapid fashion and removing the data collection burden. 2. The model must be capable of understanding the entire environment and all the constraints and interdependencies that exist within it. 3. The analytic engine must be built with the goal of prescribing specific actions that will maintain the infrastructure in the desired state based on business rules in effect, automating the decisionmaking process for IT operations. 2012 VMTurbo, Inc. All Rights Reserved.
RETHINKING THE APPROACH TO IT OPERATIONS MANAGEMENT Conceptually, virtualization provides an incredible opportunity to change the way IT is managed through the ability to use software controls to dynamically change resource allocation and workload configurations for applications operating across a shared physical infrastructure. Examples of decisions that can be executed through these software controls include altering workload placement on servers and storage infrastructure; changing virtual machine resource limits and reservations; starting and stopping virtual machines; or cloning application instances. In effect, virtualization enables the environment to be horizontally scaled (to meet specific workload demand) and vertically scaled (to reallocate resources based on existing workload requirements) in rapid fashion. These controls provide an agile way for IT operators to optimize resource usage or address performance bottlenecks reported by monitoring systems. However, because of the shared nature of virtualized infrastructure and the dynamic fluctuation of workloads, decisions regarding how to orchestrate these controls need to be taken with great care to prevent any action from impacting performance and efficiency of other IT services. For example, reconfiguring the placement of virtual machines on servers in a cluster or re-sizing a virtual machine on a host might solve a specific resource bottleneck, but cause other resource constraints across the environment. This knock-on effect can impact applications or workloads that are more latency-sensitive or critical to the business. The number of variables, constraints, and dependencies that must be considered and modeled to make effective decisions is immense. When these challenges are combined with the dynamic nature of the virtual infrastructure itself, it creates a problem that is exponentially complex in nature. And, as the environment grows beyond a handful of physical hosts it becomes impossible to compute before variations in the data occur. Fundamentally, the traditional approach of collecting data, generating alerts and manually troubleshooting cannot scale to meet the requirements of this new architecture. IT operators struggle to keep pace and the collection mechanism itself creates a significant tax on the overall system. Moreover, this approach is not designed to leverage the inherent fluidity in virtual data centers to control the environment based on workload demands, resource capacity, and configuration and business constraints. Key benefits to a new approach: Provides a more stable and consistent user experience by assuring the quality of service of virtualized applications Lowers operational costs by reducing the number of problems and incidents IT must handle Improves the ROI of compute and storage assets by driving higher levels of utilization across the environment 2
When determining a strategy for performance assurance of virtualized services it is important to draw the distinction between vendors based on several key differences in how the solutions approach solving this complex set of challenges. A BOTTOM-UP COLLECTION MODEL VS. A TOP-DOWN ABSTRACTION MODEL Solutions with a heritage in visibility and alerting often incorporate data analysis engines that focus on examining thousands of performance metrics to identify abnormal patterns in the data and infer impact or potential impact to service performance. In many cases these analytic engines focus on thresholds and correlate events to identify anomalies based on learned behavior. This is problematic as threshold-driven events and learned behavior can be misleading if the environment is irregular, changes frequently, or as is often the case not configured optimally. More importantly these approaches are bottomup methods that are not designed with the goal of determining the actions required to systematically control resource allocation and workload performance. Because they focus exclusively on the myriad of individual metrics at the infrastructure layer, they lack the necessary understanding of topological relationships and dependencies that are required to effectively drive intelligent decisions (and actions) across the IT environment that result in maintaining the health of the infrastructure. At best, they present operators with huge amounts of event data and require them to drill into it with the hope of determining what actions are required to address the anomaly. A better approach is a top-down one that understands the control points that can be leveraged to tune the environment and uses only the data it needs to prescribe the necessary actions to maintain the system in the optimal operating state. Doing this properly requires a layer of abstraction across the environment through which the analytic model can be run to determine the right actions based on business rules and system interdependencies. This solves for the data collection at scale issue that can manifest itself in larger environments and ensures that the analysis engine is designed to prescribe actions with a full understanding of the topological relationships in the infrastructure. By focusing specifically on prescriptive analytics, this type of solution approaches operations management with the goal of preventing performance constraints based on service level priorities and determining the specific actions that will allocate resources appropriately. ELEMENT- VS. ENVIRONMENT-CENTRIC RESOURCE OPTIMIZATION Resource optimization is a key benefit marketed by vendors across the IT operations management landscape. However, it is important to properly assess what each vendor is actually delivering in this regard. Does the solution focus on individual metrics at the component level and optimize based on a narrow view of each element? Or is the solution more comprehensive in nature, understanding the constraints and interdependencies across the environment? Does the solution focus on individual metrics at the component level and optimize based on a narrow view of each element? Or is the solution more comprehensive in nature, understanding the constraints and interdependencies across the environment? 3
A decision analysis engine takes a top-down approach. It understands the control points that can be leveraged to tune the environment & uses only the data it needs to prescribe the necessary actions to maintain the system in the optimal operating state. Element-centric optimization is fairly straightforward in that it focuses on specific requirements and constraints on an individual metric basis for a given workload or physical resource. The most common application of this in virtual environments is for virtual machine rightsizing. For example, it is possible to look at an individual virtual machine and conclude that the allocated vmem, vcpu or vdisc should be increased due to usage exceeding a threshold. However, taking these actions could create larger issues in the environment if they are not considered in the context of other workloads sharing those resources. Before increasing resource allocation, virtual machines may need to be moved to different hosts or data stores to create the headroom in the environment so the change does not impact the performance of other workloads. If there is simply not enough capacity in the environment to meet the increased demand then physical resources might need to be added before allocating more virtual machine resources. Additionally, if no capacity is available and resources in the environment are constrained, understanding the service levels or business priority of this workload as compared to others in the system is required before addressing the need. PROCESS VS. DECISION AUTOMATION As with resource optimization, the term automation is used extensively in the marketing lexicon of all IT operations management vendors. And with good reason manual tasks are labor intensive and prone to error. When actions can be automated, they should be (IT process or run-book automation solutions do just that). These solutions automate many of the discrete tasks associated with running the virtual data center. However, they do not solve for the complex decision-making requirements that most IT operators face in maintaining the environment. In reality, these solutions are well suited to automation where the individual steps in the process can be very clearly defined, programmed, and executed in a workflow engine. Unfortunately, determining the actions required to maximize performance and efficiency across the virtualized infrastructure is not an easy task, as each workload has its own personality and consumes resources differently from its neighbors. This means that very different results may be achieved depending on how workloads are combined on different server and storage resources and based on how physical or virtual resources are sized. Decision automation requires a deeper level of understanding beyond just how to procedurally execute a set of tasks. To effectively ensure performance, the solution must be capable of determining what tasks to carry out. 4
In effect, the process automation itself is the easy part. To solve the workload performance management challenge, a decision analysis engine must determine and prescribe resource allocations and workload configurations based on the assessment of multiple criteria on an ongoing basis. This includes individual workload demand and patterns, the capacity of allocated physical and virtual resources, the environmental and business constraints which impact what decisions can actually be taken, and with full understanding of the systematic effect of executing those decisions across the environment. Once the actions have been identified, the automation capabilities are readily available in the virtualization layer via APIs or through comprehensive run book automation solutions. Decision automation requires a deeper level of understanding beyond just how to procedurally execute a set of tasks. To effectively ensure performance, the solution must be capable of determining what tasks to carry out. Continuously ensuring workload performance while maximizing the utilization of the underlying infrastructure is a complex problem to solve. It requires a highly sophisticated decision analysis engine, a holistic view of the environment built on an abstraction layer that reduces complexity, and a top-down understanding of the control points in the virtualized infrastructure so that the right actions can be taken. CONCLUSION At VMTurbo, our operations management solution focuses specifically on applying this new approach for planning, onboarding, and controlling virtualized data centers. By automating the decision-making process in software, VMTurbo Operations Manager maximizes utilization of the physical infrastructure, ensures critical applications have the resources they require, and reduces the operational costs of running virtual data centers. To do it, the product employs an economic abstraction on the IT infrastructure and uses a market-based approach driven by pricing principles to derive the specific actions that tune the environment for optimal performance and utilization. VMTurbo is the only vendor that provides a closed-loop management system capable of holistically assuring workload QoS while maximizing infrastructure efficiency. Our solution continuously identifies inefficiencies, resource contention and bottlenecks in the system and is able to determine and automate the necessary actions that control the environment in the optimal operating zone. It changes the economics of managing virtualized data centers and delivers operational savings and productivity gains across the organization. And, it is a better approach to IT operations management in today s virtualized data center. www.vmturbo.com 5