Solving the Real Problems with Performance Monitoring Proactive Monitoring using Predictive Analytics Written by: Douglas A. Brown, MVP, CTP President & Chief Technology Officer DABCC, Inc. www.dabcc.com Page 0
Solving the Real Problems with Performance Monitoring Solutions True Proactive Monitoring using Predictive Analytics Written by: Douglas A. Brown, MVP, CTP Executive Summary In today s IT based world end users want their business resources to be available and performing well 24/7. But when problems occur, administrators spend hours tracking down what happened. As a result, the business as a whole suffers a loss of productivity, revenue, and customer satisfaction. This is something that happens to all businesses. To prove my point, think about how many meetings the IT department has that deal with problem troubleshooting. Do you attend one a week, two a week, or more? Are these meetings productive or do they turn into finger pointing sessions? How many include a manager, director, or higher who were brought into the meetings to make sure the problems go away? I would also like to point out that these meetings tend to be reactive and very rarely proactive. In fact, to say that this is a huge problem is an understatement. To solve this problem IT departments have invested heavily in performance monitoring solutions ranging from free utilities such as Windows Performance to big boy monitoring frameworks that spit out tons of information relating to each piece of hardware and software in an environment. I m not going to bash these types of solutions, but I m going to ask a question: are they proactive? Do they inform you about the problem before it happens? Do they do the analytical collations for you or is this what your meetings are for? I mean, in your meetings aren t you taking all the information these monitoring solutions spit out and doing the analytics manually? If not, then why are you still having these meetings? They are obviously not so you can pat each other on the back for a job well done. In this white paper I will discuss, in more detail, the barriers to effective troubleshooting along with a solution that addresses the problem in a proactive way, reducing the need for meetings and more importantly reducing the cost to the organization. Page 1
The Problem We all know that nothing is perfect and today s enterprise class IT systems exemplify this better than most. For years I ve been preaching that a modern day enterprise is made up of much more than just a Citrix server or a web server but consist of routers, switches, load balancers, VPNs, workstations, laptops and rack upon rack of physical servers that tend to be partitioned into many logical or virtualized Citrix servers, mail servers, domain controllers, applications servers, etc, etc, etc. It seems to me this maze of logical and physical components has many interdependencies, and most problems arise from breakdowns between linked components of this mix of technology. It s certainly true that when one component experience issues, even the slightest problem, then the other components in the chain may experience them too, resulting in the classic alert storm. More insidious is the case when the individual components are doing fine and yet the end to end system is slowed down or offline because of a problem in the handoffs between components. These issues only get worse as IT adds more servers, more virtual machines, more security, and more users. IT has been fighting the performance and availability battle for years. They have spent countless dollars on monitoring solutions that do a great job of providing the end user s experience (end point analysis) and / or the health of a particular component, but most do nothing for collating all of the data for the different components in the maze that we collectively call an enterprise service or application. To address this we need to get a holistic view of the application environment and a solution that correlates this data. Better yet, we need a solution that will do this work in order to predict a problem before it occurs. Part of the problem IT faces with their existing monitoring solutions is deciding where to set thresholds. Most monitoring tools allow you to configure alerts for such items as CPU, memory, network latency and about every other aspect of an environment. However, the alert thresholds are static and differ from environment to environment and component to component, and the level that indicates trouble based on how busy the overall environment is. For years I ve received emails asking for my recommendations on where to set the alert thresholds in Citrix s Resource Manager and my response has always been, it depends. The big problem with static alerts is the IT administrator is being asked to define what s abnormal, while abnormal at 8:00 AM is completely different than at 3:00 PM or 2:00 AM. Because of this administrators have to choose between being alerted to death or having systems die without being alerted. Think about it. At 8:00 AM when everyone comes to work and logs on to the Citrix server or web based application, the CPU is taxed way beyond what it is at 3:00 PM, the routers are taking a hit due to the increased traffic, and the domain controllers are expected to authenticate a slew of users all at the same time. Because this type of behavior at 8:00 AM is really normal for that time of day, the administrators tend to set alert thresholds a bit higher so they are not warned every day about what is, after all, perfectly normal. Consequently a much lower reading at 3 PM or 2 AM that is actually abnormal, and an indication of a potential problem, will be missed. No matter how you look at it there is a problem with static thresholds that needs to be addressed. Page 2
The Solution The problems as defined above are just a few of the many issues we face when monitoring an IT environment. What we need is a solution that will address these problems in a proactive way so that the business processes are not affected by degraded solutions and/or downtime. I was recently introduced to an integrity management tool called Integrien Alive. I m very happy to say that Alive is truly much more than just another monitoring tool. It has the ability to address the problems I detailed above and much more. Alive brings an intelligent proactive tool to the market, not just another if/then conditional monitoring solution that requires the IT / Citrix administrator to spend hours learning to understand the information the tool is spitting out only to be overcome by alerts. Alive, as its name suggests, is the smartest monitoring solution I ve reviewed to date. As shown in Figure 1, Alive will allow you to resolve problems before they occur. First, Alive learns what is normal and abnormal in your environment. What this means is that you don t have to try and define the single threshold that will catch abnormal behavior, an impossible task because what s normal is constantly changing. Second, Alive looks for the time series of abnormal events that precede problems in order to warn you before it arrives. In the next section you will learn how Integrien Alive solves the problems I laid out and how it compliments the existing monitoring tools you already have in place. How Integrien Alive Works As described above, there is much more to solving problems than just looking at one of the components. We need to look at the end to end set of components that together deliver a critical IT service. For example, in a Citrix world there are many components that must interoperate as designed in order to provide the best possible experience to the end user including logical services such as a SQL Server holding the Citrix IMA data store, an XML service that is critical for logging in through a Web Interface server, and the Active Directory servers responsible for authenticating users upon login as well as load balancers, IIS servers, routers, switches, firewalls, VPN software or hardware, etc. When a problem occurs in the linked components or subsystems that together deliver a service, the end user is affected. In order to gain an understanding on what is really happening in an environment you need a holistic view of all the components and how each one affects the other during both normal and abnormal behavior. Integrien Alive can look at an environment as an IT service with an expected Service Level, from the top down. In addition, Alive presents the path and components underlying the service, giving users a view of both the end to end experience as well as the health and real time metrics of each underlying component. In itself this real time view is a powerful capability when it comes to manually troubleshooting an issue. Page 3
In addition, a key insight behind Alive s patent pending integrity management approach is that outages and slowdowns are preceded by a pattern of out of normal events that builds up to the problem. And that leads us to Alive s Dynamic Thresholding and Problem Fingerprint engines. Dynamic Thresholds Alive s dynamic thresholding learns what s normal for a performance metric down to the day of week and hour of the day, then looks in real time for metrics out of that normal range. The issue of administrators being bothered by meaningless violations or setting static thresholds so high they are never bothered goes away. Instead, dynamic threshold violations indicate out ofnormal behavior that often enables the appropriate administrator to adjust the handoffs between linked components, or to tune components to deal with a resource constraint that is impacting the overall service level. Administrators also get trending information useful for capacity planning. But the biggest value of detecting out of normal behavior does not come from passing violations through to administrators but rather from using the overall pattern as an indicator of a problem ahead. Dynamic thresholding learns normal metric levels to hour of day Page 4
Dynamic Thresholding actually learns what is normal by tracking and analyzing the performance history of each individual metric. To accomplish this, Alive applies elements of chaos theory to ensure that the performance patterns in an environment can be recognized despite any noise in the data that might be introduced by intermittent sampling. In addition, Alive does not make the erroneous assumption that IT performance data is always characterized by normal, bell shaped curve distribution. Instead, Alive applies a variety of analytical techniques including: Linear behavior analysis Logarithmic behavior analysis Distribution independent non linear behavior analysis Rate of change analysis Pattern change (including frequency and fractal analysis) Don t worry though. You don t have to go back to College to earn a degree in advanced statistical analysis. You won t be asked how many Standard Deviations you care about, although you will be able to tune sensitivity. The idea is that all of this rocket science math working in the background is making life simpler for you by automating the process of detecting when your environment is starting to go south. The analytics in this module and Problem Fingerprints are truly the brains behind the solution. Fingerprints When you first install Alive it will spend a few weeks learning your system s performance metrics in terms of what is normal and more importantly abnormal and track any problems as they occur. Once enough data points make the statistics meaningful, Alive establishes dynamic thresholds and tracks and stores what Integrien calls Problem Fingerprints. A Problem Fingerprint is the time series of out of normal events that build up to a problem. Just as everyone s fingerprint is unique, so is every class of IT problem. The first time a class of problem occurs, Alive s Problem Fingerprint lays out the applicationspecific evidence leading up to it to the team charged with resolving the issue. Contrast that neat package of pre correlated data with having individual admins on a bridge call trying to figure out what components are involved. They then manually correlate what they re seeing in the associated reports, element manager screens, and log files. Once the evidence in the Problem Fingerprint takes the problem resolution team back in time to the first few abnormal events leading to the problem and clarifies how those events impact linked components, conclusions on the cause and resolution tend to be reached rapidly. Your recommended action is then associated with the Problem Fingerprint. The next time the class of problem starts to develop, Alive s real time comparison of incoming event patterns to all of the patterns stored in the Problem Fingerprint library will trigger a match, often on the first arriving ridges or out of normal events that compose the Fingerprint. Alive then issues an alert to the appropriate admin predicting the time and location of the problem occurrence, the events associated with it, and your recommended resolution. Page 5
For example, there is a characteristic pattern of out of normal metrics that arrive in a time series when the Citrix load balancing solution sends all new users to a zombie server due to its low load. Or a problem with a faulty switch might show up first as out of normal latency, followed by declining performance of all the published applications that use that switch. Each problem has a very unique fingerprint and, when that fingerprint is recognized, the silo administrators responsible for supporting the application can see the big picture in order to act in a timely and proactive manner. SAP tablespace lockup problem might impact all SAP users (62.8% probability) in 30 minutes. Service desk guidance: have SAP admin double work processes and Oracle DBA clear SAP tablespace. Problem Fingerprint alerts provide location, symptoms, ETA of problem, and your resolution plan Fingerprints are recorded automatically for critical abnormalities such as an authentication failure, an SLA violation, or crossing a domain specific metric limit that indicates a meltdown has already occurred. Additionally, no one can better define what constitutes a problem than the owner of a domain, and thus the power of Fingerprinting taps that expertise by allowing guided Fingerprint creation for a specific IT Service. An Example For example, the following are just some commonly known problems in a Citrix environment that have repercussions Alive s Problem Fingerprints would pick up and i notify the appropriate administrator about before they became significant. Application Memory Leak (not releasing memory until a session is terminated) early indicators are increase in memory usage without increase in concurrent users. Worm or Virus attack early indicators are an increase in CPU, I/O or Disk read/write activity without increase in concurrent users. Unintended change in AD or elsewhere in infrastructure early indications an increase in login time. Black hole problem early indications are increase in logins to the same Citrix server with increase in traffic. Page 6
The Benefits The benefits of this technology, then, are clearly illustrated in the before and after of problem management process shown in the Figure below. Problem Resolution Before and After Alive Page 7
To sum it up, the benefits are: 1st problem occurrence: Significantly reduced duration of impact due to informed detective work around time series of out of normal events Recurrences: Near zero duration of impact due to problem prediction Troubleshooting: Significantly reduced IT labor and disruption Bottom line impact: What s your cost of slowdowns/downtime? Alive's sophisticated analytics engine even learns and refines the Fingerprint model with each recurrence, throwing out incidental data in order to statistically hone the Fingerprint s viability. As a result, the time needed for Alive to recognize and detect the first arriving ridges of a Problem Fingerprint improves over time. Complimentary Nature of Existing Monitoring Tools with Alive As you can see in the previous Before and After Figure, Alive can take in performance data from existing tools and pass events and alerts to an existing problem management process. For example, Integrien customers who wish to get the most from Tivoli tools can utilize Integrien s Common Base Event adaptor to take advantage of Alive Fingerprint Alerts and other events that it passes in the standard format to the IBM Tivoli Enterprise Console, making that environment predictive for the first time. On the intake side, Integrien Alive s advanced analytics can integrate with performance metrics captured by IBM Tivoli Monitoring 6.1 to add integrity management prediction and prevention capabilities. Similarly, performance data captured by the new Citrix EdgeSite monitoring tool could be imported through Alive s API to inform the underlying analytics. Page 8
Summary To summarize, we have learned that Integrien Alive gives the IT department the ability to identify and correct problems in a proactive fashion. Alive s Fingerprinting technology warns an administrator before a problem has time to manifest itself in degraded performance and/or downtime to the end user base. With Alive s script library it has the ability to automatically take action against a pending problem and in many cases provide a solution before it has time to affect the end user base. All in all, with Integrien Alive, IT has the ability to take back control of the IT infrastructure environment. Page 9
Additional Resources To learn more about the problem s we face and the solution, Integrien Alive, please refer to the following resources: Integrien Alive s Web Site: http://www.integrien.com/alive.cfm DABCC Integrien Alive Industry News and Resources Web Page: http://www.dabcc.com/ialive Page 10
About the Author Douglas A. Brown (DABCC, Inc.) DABCC specializes in the design and development of techniques, methodologies, authoring, education, training, outsourcing and software products that add immediate value to server based computing and on demand virtual application computing world. The company was formed in 2004 to reduce complexity in corporate computing by developing and teaching a proven methodology for providing seamless, real time access to strategic company information. Douglas Brown worked at Citrix Systems, Inc. as a Senior Systems Engineer from 2001 to 2004 in which time he was voted Systems Engineer of the Year 2002 by his peers and management at Citrix. He was awarded the Microsoft MVP (Most Valuable Professional) by Microsoft Corporation in 2005 and 2006 for his contributions to the industry. He was also a charter award winner on the Citrix Technology Professional (CTP) program for his continued support of the Citrix community. Mr. Brown has earned worldwide recognition for his dedication to providing server based computing professionals with proven solutions for implementation, infrastructure design, time saving utilities, performance tips and best practices. DABCC.com is one of the most frequently visited sites internationally for server based computing information and networking opportunities. Page 11