WHITE PAPER Intelligent Tracking of Performance Storms in Complex Cloud Infrastructures by Jagan Jagannathan, Founder and CTO, Xangati 2014 Xangati, Inc. All rights reserved. Page 1 of 5
Intelligent Tracking of Performance Storms in Complex Cloud Infrastructures As enterprises, service providers, healthcare organizations, government agencies and educational institutions adopt and migrate their data center to virtual and cloud infrastructures, management solutions have not kept up to provide fine-grain relevant information for this dynamic, complex and volatile environment. Critical resources and applications, in such environments, are shared and are therefore, subject to spontaneous storms impacting the performance of applications and end users. This white paper explores common storms affecting virtualization and cloud environments and the key Infrastructure Performance Management (IPM) requirements to intelligently identify, capture and manage performance storms. Identifying Performance Storms in Cloud Infrastructures Performance storms are created by the unintended toxic interactions among cross-silo shared resources in the converged data center. A storm entangles multiple objects VMs, hosts, end-users, applications, etc. even if they are unrelated. The entanglement often has a dramatically adverse effect on the infrastructure performance. Some of the most common performance storms include: Storage storms typically occur when applications unknowingly and excessively share a datastore, which causes storage performance to deteriorate, often dramatically and spontaneously. Memory storms usually occur when you have multiple VMs trying to share insufficient amount of memory or, in other cases, you might have a VM that is hogging memory and not leaving enough for the others even with ballooning in place. CPU storms typically occur when there aren t enough CPU cycles or virtual CPUs to go around in the sharing of processing resources, leaving some with more and some with less Network storms usually occur when too many VMs are attempting to communicate at the same time on a specific interface or when a few VMs are hogging a specific interface with traffic limiting the ability of other VMs to send or receive data. 2014 Xangati, Inc. All rights reserved. Page 2 of 5
Legacy Infrastructure (Pre-Cloud Era) Can t Deal With Performance Storms With current performance management solutions, cloud performance storms can take several hours to even days or weeks to isolate, analyze and resolve, according to a recent IT survey we conducted with ZK Research. Why does it take that long? Two important reasons First, existing solutions, at best, have a real-time fidelity of multiple minutes which is fundamentally incompatible with performance storms that may start and finish within time intervals of seconds; Second, current solutions only focus on silo-specific metrics that only help generate alerts. Unfortunately, alerts only identify effects of storms they leave the all-important and often daunting root cause analysis to administrators to figure out on their own. Identifying Causes of Performance Storms Even in the best-run cloud infrastructures, performance storms are part of the new reality, and one must be able to accurately identify, track and resolve these disruptive and spontaneous occurrences in a timely and effective manner. To get to the root cause of the problem, you need: 1. Insight into real-time (second-by-second) interactions; 2. Visibility into both consumptive and interactive object behaviors; and 3. Integration with capacity management. #1 Real-time (Second-by-Second) Insight Into Interactions Because the cloud is constantly in-flux, it is critical to be able to visualize interactions on a real-time (second-by-second) basis in order to capture everything that is occurring within the environment. Equally important is tracking these real-time (second-by-second) interactions at scale. Given the scale, complexity & behavior of cloud infrastructures, this live, continuous and scalable monitoring and management is essential to accurately identify performance storms and can only be achieved through an in-memory based analytics architecture. An in-memory analytics architecture allows the system to track and analyze what is happening at a precise moment on a second-by-second basis rather than just averaging data out over a five or ten minute time period. The analytics 2014 Xangati, Inc. All rights reserved. Page 3 of 5
architecture enables visualization of the multitude of simultaneous and finegrain interactions that are responsible for surges or spikes in performance. In effect, it provides the critical context and understanding needed to identify trends & patterns that characterize storms. How else would you find the source of a datastore latency storm unless you know which VMs are actually using that datastore at that exact moment in time? #2 Visibility into Both Consumptive and Interactive Object Behaviors To see what is causing a performance storm, you need visibility not only into how objects are consuming cloud resources but also how objects are interacting with others within the infrastructure this being much more critical to determine the problem cause. Consumptive silo-specific alerts (using a combination of system-learned and best- practice thresholds) point to the effects of performance storms an impacted application or VM, for example while interactional cross-silo alerts give details that enable one to accurately identify and resolve the source of the problem. In order to deliver these interactional alerts and reveal the toxic (heavy resource usage) interactions that may be occurring between different objects you must have a cross-silo analysis of the infrastructure cutting across network, server and storage tiers, as well as applications and end users to provide a context. Furthermore, this analysis needs to scale so that you can easily view the distant and proximate areas of impact for a given storm, as well as the source of contention and the resources affected. Only by visualizing and analyzing the cross-silo interactions can you accurately identify the trends & patterns of interactions that are causing the storm. #3 Integration with Capacity Management The most common culprit for performance storms is conservative or underprovisioning of the cloud for either cost reasons or for little or unknown capacity requirements. Considering this, it seems logical that one would integrate performance and capacity management. Yet today s virtualization management solutions do not, perhaps due to the inability to connect the two thereby ignoring the intrinsic connection that exists and dealing with capacity management as a completely separate and distinct entity. Xangati uniquely believes that infrastructure performance management must expressly inform capacity analytics; otherwise, you can t identify the impact of 2014 Xangati, Inc. All rights reserved. Page 4 of 5
performance storms and their intensity on capacity utilization and saturation. This linkage leads to recommendations on how to resolve problems that cause storms, typically by either increasing resource capacity or by targeted resource load balancing. To operate your cloud in an efficient and effective manner, you need the right infrastructure performance management solution to tackle the highly disruptive and hard-to-detect performance storms that are intrinsic to your cloud. The three capabilities, discussed in this paper, allow you to effectively monitor and manage your infrastructure. To summarize, they are (1) Real-time, live and continuous, insights that you need to instantly recognize spontaneous and transient storms; (2) Cross-silo visibility into interactional metrics that you need to help identify root causes of storms instead of just chasing the effects a.k.a consumptive metric alerts; and (3) Linkage between performance and capacity management to appropriately add or reallocate infrastructure resources to mitigate or avoid future storms. About Xangati Xangati is the recognized leader for cloud and workload performance management solutions. Over 300 customers among enterprises, government agencies, healthcare organizations, educational systems and cloud providers use Xangati s solutions to gain unprecedented performance management of their massive, heterogeneous and consumer-scale, cloud and VDI environments. Xangati s solutions built on patented technology proactively track the health of key IT metrics that impact the performance of applications and users, accurately diagnose the cause of any performance bottleneck and recommend remedial action when a bottleneck is discovered. Organizations like EBay, Comcast, British Gas, Guess, Colliers International, Univita Health, DTCC, Harvard University and the US Army use the Xangati Management Dashboard suite of solutions with its massively scalable live and continuous recording ability to ensure their business-critical applications perform at optimal levels. Xangati is headquartered in Silicon Valley and can be found online at www.xangati.com. 2014 Xangati, Inc. All rights reserved. Page 5 of 5