Proactive Monitoring using Predictive Analytics



Similar documents
End Your Data Center Logging Chaos with VMware vcenter Log Insight

PATROL From a Database Administrator s Perspective

Network Management and Monitoring Software

Fifty Critical Alerts for Monitoring Windows Servers Best practices

Perform-Tools. Powering your performance

Predictive Intelligence: Identify Future Problems and Prevent Them from Happening BEST PRACTICES WHITE PAPER

ProactiveWatch Monitoring For the Rest of Us

Why Alerts Suck and Monitoring Solutions need to become Smarter

TNT SOFTWARE White Paper Series

White Paper. The Ten Features Your Web Application Monitoring Software Must Have. Executive Summary

Gaining a Holistic View of the Citrix End-User Computing Environment and Beyond

Proactive Performance Management for Enterprise Databases

Elevating Data Center Performance Management

APPLICATION PERFORMANCE MONITORING

always on meet the it department PROPHET managed services ebook Business Group Meet the Always On IT Department

Product Review: James F. Koopmann Pine Horse, Inc. Quest Software s Foglight Performance Analysis for Oracle

mbits Network Operations Centrec

Table of Contents Cicero, Inc. All rights protected and reserved.

How To Use Ibm Tivoli Monitoring Software

Desktop Activity Intelligence

Application Performance Management for Enterprise Applications

WHITE PAPER. Automated IT Asset Management Maximize Organizational Value Using Numara Track-It! p: f:

Resolving Active Directory Backup and Recovery Requirements with Quest Software

pc resource monitoring and performance advisor

Improving. Summary. gathered from. research, and. Burnout of. Whitepaper

Monitoring Domino Servers with HP Operations Manager

Cloud Services Catalog with Epsilon

End-to-end Service Level Monitoring with Synthetic Transactions

Kaseya Traverse. Kaseya Product Brief. Predictive SLA Management and Monitoring. Kaseya Traverse. Service Containers and Views

Benefits of Deploying VirtualWisdom with HP Converged Infrastructure March, 2015

Using WebLOAD to Monitor Your Production Environment

End-User Experience. Critical for Your Business: Managing Quality of Experience.

Test Case 3 Active Directory Integration

The top 10 misconceptions about performance and availability monitoring

Active Directory Integration

Automated IT Asset Management Maximize organizational value using BMC Track-It! WHITE PAPER

Monitoring Remedy with BMC Solutions

Table of Recommendations for End-User Monitoring Solutions

An Introduction to HIPPO V4 - the Operational Monitoring and Profiling Solution for the Informatica PowerCenter platform.

Best Practices for Building a Security Operations Center

WINDOWS SERVER MONITORING

Closing The Application Performance Visibility Gap Inherent To Citrix Environments

IBM Tivoli Netcool network management solutions for enterprise

MSP Service Matrix. Servers

Redefining Infrastructure Management for Today s Application Economy

White Paper. Business Service Management Solution

Customer Evaluation Report On Incident.MOOG

How To Choose Help Desk Software For Your Company

Using Application Response to Monitor Microsoft Outlook

Server & Application Monitor

Capacity planning with Microsoft System Center

th Avenue Phone: Kenosha, WI Fax: NOC SERVICES OFFSITE, LLC of 8

Integrated processes aligned to your business The examples of the new NetEye and EriZone releases

A new Breed of Managed Hosting for the Cloud Computing Age. A Neovise Vendor White Paper, Prepared for SoftLayer

Monitoring Windows Event Logs

The Evolution of Load Testing. Why Gomez 360 o Web Load Testing Is a

FIREWALL CLEANUP WHITE PAPER

Monitoring Tips. for Microsoft Applications and Environments. i Ten Monitoring Tips for Microsoft Applications and Environments

VMware Virtualization and Cloud Management Overview VMware Inc. All rights reserved

White paper: Unlocking the potential of load testing to maximise ROI and reduce risk.

Proactive database performance management

Monitoring IBM Maximo Platform

Understand Troubleshooting Methodology

RaMP Data Center Manager. Data Center Infrastructure Management (DCIM) software

IT Service Management Real-time Enduser Context Has A Dramatic Affect On Incident and Problem Resolution Times

Command Center Handbook

An Oracle White Paper June, Strategies for Scalable, Smarter Monitoring using Oracle Enterprise Manager Cloud Control 12c

WHITEPAPER. Unlocking Your ATM Big Data : Understanding the power of real-time transaction analytics.

Information Technology Solutions

Virtual Desktop Infrastructure Optimization with SysTrack Monitoring Tools and Login VSI Testing Tools

Company & Solution Profile

Hardware Performance Optimization and Tuning. Presenter: Tom Arakelian Assistant: Guy Ingalls

Winning the J2EE Performance Game Presented to: JAVA User Group-Minnesota

Virtual Data Center Management Challenges

Optimizing IT Performance

Performance Testing. Slow data transfer rate may be inherent in hardware but can also result from software-related problems, such as:

Best Practices for Installing and Configuring the Hyper-V Role on the LSI CTS2600 Storage System for Windows 2008

Monitoring IBM HMC Server. eg Enterprise v6

Dell Active Administrator 8.0

Enterprise IT is complex. Today, IT infrastructure spans the physical, the virtual and applications, and crosses public, private and hybrid clouds.

Cisco Unified Computing Remote Management Services

Performance Management for Enterprise Applications

Contents. Platform Compatibility. GMS SonicWALL Global Management System 5.0

G DATA TechPaper #0275. G DATA Network Monitoring

Monitoring Microsoft Exchange to Improve Performance and Availability

How To Improve Your Data Center Service With A Microsoft Server 2007 Monitoring Software

Managed Service Plans

Default Thresholds. Performance Advisor. Adaikkappan Arumugam, Nagendra Krishnappa

Becoming Proactive in Application Management and Monitoring

Solution Brief Virtual Desktop Management

IBM Tivoli Monitoring for Applications

Managed Security Services SLA Document. Response and Resolution Times

Transcription:

Solving the Real Problems with Performance Monitoring Proactive Monitoring using Predictive Analytics Written by: Douglas A. Brown, MVP, CTP President & Chief Technology Officer DABCC, Inc. www.dabcc.com Page 0

Solving the Real Problems with Performance Monitoring Solutions True Proactive Monitoring using Predictive Analytics Written by: Douglas A. Brown, MVP, CTP Executive Summary In today s IT based world end users want their business resources to be available and performing well 24/7. But when problems occur, administrators spend hours tracking down what happened. As a result, the business as a whole suffers a loss of productivity, revenue, and customer satisfaction. This is something that happens to all businesses. To prove my point, think about how many meetings the IT department has that deal with problem troubleshooting. Do you attend one a week, two a week, or more? Are these meetings productive or do they turn into finger pointing sessions? How many include a manager, director, or higher who were brought into the meetings to make sure the problems go away? I would also like to point out that these meetings tend to be reactive and very rarely proactive. In fact, to say that this is a huge problem is an understatement. To solve this problem IT departments have invested heavily in performance monitoring solutions ranging from free utilities such as Windows Performance to big boy monitoring frameworks that spit out tons of information relating to each piece of hardware and software in an environment. I m not going to bash these types of solutions, but I m going to ask a question: are they proactive? Do they inform you about the problem before it happens? Do they do the analytical collations for you or is this what your meetings are for? I mean, in your meetings aren t you taking all the information these monitoring solutions spit out and doing the analytics manually? If not, then why are you still having these meetings? They are obviously not so you can pat each other on the back for a job well done. In this white paper I will discuss, in more detail, the barriers to effective troubleshooting along with a solution that addresses the problem in a proactive way, reducing the need for meetings and more importantly reducing the cost to the organization. Page 1

The Problem We all know that nothing is perfect and today s enterprise class IT systems exemplify this better than most. For years I ve been preaching that a modern day enterprise is made up of much more than just a Citrix server or a web server but consist of routers, switches, load balancers, VPNs, workstations, laptops and rack upon rack of physical servers that tend to be partitioned into many logical or virtualized Citrix servers, mail servers, domain controllers, applications servers, etc, etc, etc. It seems to me this maze of logical and physical components has many interdependencies, and most problems arise from breakdowns between linked components of this mix of technology. It s certainly true that when one component experience issues, even the slightest problem, then the other components in the chain may experience them too, resulting in the classic alert storm. More insidious is the case when the individual components are doing fine and yet the end to end system is slowed down or offline because of a problem in the handoffs between components. These issues only get worse as IT adds more servers, more virtual machines, more security, and more users. IT has been fighting the performance and availability battle for years. They have spent countless dollars on monitoring solutions that do a great job of providing the end user s experience (end point analysis) and / or the health of a particular component, but most do nothing for collating all of the data for the different components in the maze that we collectively call an enterprise service or application. To address this we need to get a holistic view of the application environment and a solution that correlates this data. Better yet, we need a solution that will do this work in order to predict a problem before it occurs. Part of the problem IT faces with their existing monitoring solutions is deciding where to set thresholds. Most monitoring tools allow you to configure alerts for such items as CPU, memory, network latency and about every other aspect of an environment. However, the alert thresholds are static and differ from environment to environment and component to component, and the level that indicates trouble based on how busy the overall environment is. For years I ve received emails asking for my recommendations on where to set the alert thresholds in Citrix s Resource Manager and my response has always been, it depends. The big problem with static alerts is the IT administrator is being asked to define what s abnormal, while abnormal at 8:00 AM is completely different than at 3:00 PM or 2:00 AM. Because of this administrators have to choose between being alerted to death or having systems die without being alerted. Think about it. At 8:00 AM when everyone comes to work and logs on to the Citrix server or web based application, the CPU is taxed way beyond what it is at 3:00 PM, the routers are taking a hit due to the increased traffic, and the domain controllers are expected to authenticate a slew of users all at the same time. Because this type of behavior at 8:00 AM is really normal for that time of day, the administrators tend to set alert thresholds a bit higher so they are not warned every day about what is, after all, perfectly normal. Consequently a much lower reading at 3 PM or 2 AM that is actually abnormal, and an indication of a potential problem, will be missed. No matter how you look at it there is a problem with static thresholds that needs to be addressed. Page 2

The Solution The problems as defined above are just a few of the many issues we face when monitoring an IT environment. What we need is a solution that will address these problems in a proactive way so that the business processes are not affected by degraded solutions and/or downtime. I was recently introduced to an integrity management tool called Integrien Alive. I m very happy to say that Alive is truly much more than just another monitoring tool. It has the ability to address the problems I detailed above and much more. Alive brings an intelligent proactive tool to the market, not just another if/then conditional monitoring solution that requires the IT / Citrix administrator to spend hours learning to understand the information the tool is spitting out only to be overcome by alerts. Alive, as its name suggests, is the smartest monitoring solution I ve reviewed to date. As shown in Figure 1, Alive will allow you to resolve problems before they occur. First, Alive learns what is normal and abnormal in your environment. What this means is that you don t have to try and define the single threshold that will catch abnormal behavior, an impossible task because what s normal is constantly changing. Second, Alive looks for the time series of abnormal events that precede problems in order to warn you before it arrives. In the next section you will learn how Integrien Alive solves the problems I laid out and how it compliments the existing monitoring tools you already have in place. How Integrien Alive Works As described above, there is much more to solving problems than just looking at one of the components. We need to look at the end to end set of components that together deliver a critical IT service. For example, in a Citrix world there are many components that must interoperate as designed in order to provide the best possible experience to the end user including logical services such as a SQL Server holding the Citrix IMA data store, an XML service that is critical for logging in through a Web Interface server, and the Active Directory servers responsible for authenticating users upon login as well as load balancers, IIS servers, routers, switches, firewalls, VPN software or hardware, etc. When a problem occurs in the linked components or subsystems that together deliver a service, the end user is affected. In order to gain an understanding on what is really happening in an environment you need a holistic view of all the components and how each one affects the other during both normal and abnormal behavior. Integrien Alive can look at an environment as an IT service with an expected Service Level, from the top down. In addition, Alive presents the path and components underlying the service, giving users a view of both the end to end experience as well as the health and real time metrics of each underlying component. In itself this real time view is a powerful capability when it comes to manually troubleshooting an issue. Page 3

In addition, a key insight behind Alive s patent pending integrity management approach is that outages and slowdowns are preceded by a pattern of out of normal events that builds up to the problem. And that leads us to Alive s Dynamic Thresholding and Problem Fingerprint engines. Dynamic Thresholds Alive s dynamic thresholding learns what s normal for a performance metric down to the day of week and hour of the day, then looks in real time for metrics out of that normal range. The issue of administrators being bothered by meaningless violations or setting static thresholds so high they are never bothered goes away. Instead, dynamic threshold violations indicate out ofnormal behavior that often enables the appropriate administrator to adjust the handoffs between linked components, or to tune components to deal with a resource constraint that is impacting the overall service level. Administrators also get trending information useful for capacity planning. But the biggest value of detecting out of normal behavior does not come from passing violations through to administrators but rather from using the overall pattern as an indicator of a problem ahead. Dynamic thresholding learns normal metric levels to hour of day Page 4

Dynamic Thresholding actually learns what is normal by tracking and analyzing the performance history of each individual metric. To accomplish this, Alive applies elements of chaos theory to ensure that the performance patterns in an environment can be recognized despite any noise in the data that might be introduced by intermittent sampling. In addition, Alive does not make the erroneous assumption that IT performance data is always characterized by normal, bell shaped curve distribution. Instead, Alive applies a variety of analytical techniques including: Linear behavior analysis Logarithmic behavior analysis Distribution independent non linear behavior analysis Rate of change analysis Pattern change (including frequency and fractal analysis) Don t worry though. You don t have to go back to College to earn a degree in advanced statistical analysis. You won t be asked how many Standard Deviations you care about, although you will be able to tune sensitivity. The idea is that all of this rocket science math working in the background is making life simpler for you by automating the process of detecting when your environment is starting to go south. The analytics in this module and Problem Fingerprints are truly the brains behind the solution. Fingerprints When you first install Alive it will spend a few weeks learning your system s performance metrics in terms of what is normal and more importantly abnormal and track any problems as they occur. Once enough data points make the statistics meaningful, Alive establishes dynamic thresholds and tracks and stores what Integrien calls Problem Fingerprints. A Problem Fingerprint is the time series of out of normal events that build up to a problem. Just as everyone s fingerprint is unique, so is every class of IT problem. The first time a class of problem occurs, Alive s Problem Fingerprint lays out the applicationspecific evidence leading up to it to the team charged with resolving the issue. Contrast that neat package of pre correlated data with having individual admins on a bridge call trying to figure out what components are involved. They then manually correlate what they re seeing in the associated reports, element manager screens, and log files. Once the evidence in the Problem Fingerprint takes the problem resolution team back in time to the first few abnormal events leading to the problem and clarifies how those events impact linked components, conclusions on the cause and resolution tend to be reached rapidly. Your recommended action is then associated with the Problem Fingerprint. The next time the class of problem starts to develop, Alive s real time comparison of incoming event patterns to all of the patterns stored in the Problem Fingerprint library will trigger a match, often on the first arriving ridges or out of normal events that compose the Fingerprint. Alive then issues an alert to the appropriate admin predicting the time and location of the problem occurrence, the events associated with it, and your recommended resolution. Page 5

For example, there is a characteristic pattern of out of normal metrics that arrive in a time series when the Citrix load balancing solution sends all new users to a zombie server due to its low load. Or a problem with a faulty switch might show up first as out of normal latency, followed by declining performance of all the published applications that use that switch. Each problem has a very unique fingerprint and, when that fingerprint is recognized, the silo administrators responsible for supporting the application can see the big picture in order to act in a timely and proactive manner. SAP tablespace lockup problem might impact all SAP users (62.8% probability) in 30 minutes. Service desk guidance: have SAP admin double work processes and Oracle DBA clear SAP tablespace. Problem Fingerprint alerts provide location, symptoms, ETA of problem, and your resolution plan Fingerprints are recorded automatically for critical abnormalities such as an authentication failure, an SLA violation, or crossing a domain specific metric limit that indicates a meltdown has already occurred. Additionally, no one can better define what constitutes a problem than the owner of a domain, and thus the power of Fingerprinting taps that expertise by allowing guided Fingerprint creation for a specific IT Service. An Example For example, the following are just some commonly known problems in a Citrix environment that have repercussions Alive s Problem Fingerprints would pick up and i notify the appropriate administrator about before they became significant. Application Memory Leak (not releasing memory until a session is terminated) early indicators are increase in memory usage without increase in concurrent users. Worm or Virus attack early indicators are an increase in CPU, I/O or Disk read/write activity without increase in concurrent users. Unintended change in AD or elsewhere in infrastructure early indications an increase in login time. Black hole problem early indications are increase in logins to the same Citrix server with increase in traffic. Page 6

The Benefits The benefits of this technology, then, are clearly illustrated in the before and after of problem management process shown in the Figure below. Problem Resolution Before and After Alive Page 7

To sum it up, the benefits are: 1st problem occurrence: Significantly reduced duration of impact due to informed detective work around time series of out of normal events Recurrences: Near zero duration of impact due to problem prediction Troubleshooting: Significantly reduced IT labor and disruption Bottom line impact: What s your cost of slowdowns/downtime? Alive's sophisticated analytics engine even learns and refines the Fingerprint model with each recurrence, throwing out incidental data in order to statistically hone the Fingerprint s viability. As a result, the time needed for Alive to recognize and detect the first arriving ridges of a Problem Fingerprint improves over time. Complimentary Nature of Existing Monitoring Tools with Alive As you can see in the previous Before and After Figure, Alive can take in performance data from existing tools and pass events and alerts to an existing problem management process. For example, Integrien customers who wish to get the most from Tivoli tools can utilize Integrien s Common Base Event adaptor to take advantage of Alive Fingerprint Alerts and other events that it passes in the standard format to the IBM Tivoli Enterprise Console, making that environment predictive for the first time. On the intake side, Integrien Alive s advanced analytics can integrate with performance metrics captured by IBM Tivoli Monitoring 6.1 to add integrity management prediction and prevention capabilities. Similarly, performance data captured by the new Citrix EdgeSite monitoring tool could be imported through Alive s API to inform the underlying analytics. Page 8

Summary To summarize, we have learned that Integrien Alive gives the IT department the ability to identify and correct problems in a proactive fashion. Alive s Fingerprinting technology warns an administrator before a problem has time to manifest itself in degraded performance and/or downtime to the end user base. With Alive s script library it has the ability to automatically take action against a pending problem and in many cases provide a solution before it has time to affect the end user base. All in all, with Integrien Alive, IT has the ability to take back control of the IT infrastructure environment. Page 9

Additional Resources To learn more about the problem s we face and the solution, Integrien Alive, please refer to the following resources: Integrien Alive s Web Site: http://www.integrien.com/alive.cfm DABCC Integrien Alive Industry News and Resources Web Page: http://www.dabcc.com/ialive Page 10

About the Author Douglas A. Brown (DABCC, Inc.) DABCC specializes in the design and development of techniques, methodologies, authoring, education, training, outsourcing and software products that add immediate value to server based computing and on demand virtual application computing world. The company was formed in 2004 to reduce complexity in corporate computing by developing and teaching a proven methodology for providing seamless, real time access to strategic company information. Douglas Brown worked at Citrix Systems, Inc. as a Senior Systems Engineer from 2001 to 2004 in which time he was voted Systems Engineer of the Year 2002 by his peers and management at Citrix. He was awarded the Microsoft MVP (Most Valuable Professional) by Microsoft Corporation in 2005 and 2006 for his contributions to the industry. He was also a charter award winner on the Citrix Technology Professional (CTP) program for his continued support of the Citrix community. Mr. Brown has earned worldwide recognition for his dedication to providing server based computing professionals with proven solutions for implementation, infrastructure design, time saving utilities, performance tips and best practices. DABCC.com is one of the most frequently visited sites internationally for server based computing information and networking opportunities. Page 11