WHITE PAPER Customer Evaluation Report On Incident.MOOG (Real Data Provided by a Fortune 100 Company) For information about Moogsoft and Incident.MOOG, visit www.moogsoft.com. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 1
Synopsis: A Fortune 100 Company (the Customer) conducted an extensive evaluation on Incident.MOOG, a nextgeneration manager of managers (MoM) providing early warning and collaborative remediation for IT incidents. Within 30 working days, Incident.MOOG had transformed the Customer s IT operations command center to operate faster and more effectively. Here s a summary of the key results: Key Metrics Before After % of Improvement Average Managed Events Per Day ~ 115,000,000 ~ 1,000,000 99.1% Real Situations, i.e. Cleaned, Clustered and Contextualized Alerts Time it took to create corresponding actionable tickets This report summarizes the: Benefits provided by Incident.MOOG Almost zero rolled up alerts ~ 250 99.998% >=24 hours 1-2 hours 95.8% Before and after the evaluation the Customer s data center environment, processes, tools, and where and how Incident.MOOG fitted in Scope and procedure of the evaluation. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 2
Executive Summary Moogsoft engaged with the Customer over a 30-day period to demonstrate the value of Incident.MOOG vs. existing fault and incident management tools and processes. Definitions of words used in this document Incident: An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service. Event: A change of state that has significance for the management of a Configuration Item or IT Service. Alert: A unique warning message indicating that a threshold has been reached, something has changed, or a Failure has occurred. Alerts are often created and managed by System Management tools and are managed by the Event Management Process. Issue: Some fault or impact behavior that is actionable but has as yet not been categorized as an Incident. Situation: The indication of some form of issue, as a cluster of related alerts that should be acted upon by appropriate stakeholders. The cluster of alerts may contain Causal Indicators, Collateral Indicators. Situation stakeholders are the Service Owners whose Alerts are clustered within a given Situation, i.e. Storage, Database and Application. A Situation may represent an Incident or the impact of an Incident. Causal Indicator: An Alert or set of Alerts indicating the source of a fault or failure of some kind. Collateral Indicator: An Alert or set of Alerts indicating the collateral damage or impact of a service disruption. What is Incident.MOOG? Created by the inventors of IBM Tivoli Netcool, Incident.MOOG is a next-generation manager of managers (MoM), providing incident early warning and collaborative remediation platform for IT Ops and DevOps teams. It uses machine learning to reduce the number of incidents that IT Ops and DevOps teams must handle. It then applies social collaborative technologies to enable cross-domain teams work together and remediate problems faster. Incident.MOOG is designed to: Clean: Automatically remove event noise and detect anomalies from input events and log sources in real-time, without relying on rules and up-to-date Configuration Management Database (CMDB). Contextualize: Automatically cluster resulting anomalies into Situations - a set of related alerts with indicators of causes and impacts, allowing IT operational teams to more quickly diagnose the condition. Collaborate: Automatically host a single pane of glass situation room, which invites the relevant experts to triage the same situations, and orchestrates the entire remediation process. What were the goals of this evaluation? To demonstrate that Incident.MOOG can substantially and automatically reduce IT operational staff s time and efforts spent on troubleshooting issues: 1. Substantially and automatically reduce the total actionable work by Network Operations Center (NOC) operators and service owners; http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 3
2. Detect all of the Situations that are identified by the existing tools and processes; 3. Detect real actionable Situations not indicated by the existing tools and processes, 4. Detect Situations either at the same time or earlier than the existing tools and processes. What did the Customer environment look like? For this evaluation, in 30 working days, the live production environment included 2 large data centers across five customer event feeds: Application Performance Monitoring (APM) Systems Performance Dashboard Oracle Enterprise Manager Java Virtual Machine (JVM) Application Logs Server / OS Syslog. TABLE 1: Event Sources Data Source Count of Distinct Devices (Physical or Virtual) Application Performance Management (APM) Availability 3,172 Systems Performance Dashboard (SPD) 3,531 Oracle Enterprise Manager (OEM) 511 Java Virtual Machine App Logs (SystemErr.log) 171 Server/OS syslog 1,747 Total Distinct Entities 9,132 In total these event sources produce approximately 115 million events per day. What were the results of this evaluation? During the evaluation process, Incident.MOOG: Automatically reduced number of managed Events by 99.998% Created actionable Situations for IT Ops and DevOps teams to better manage incidents from these events (without rules, topology or historic behavior to determine anomalies). Didn t miss any incidents or problems: Zero false negative ; Detected Situations up to >24 Hours before the corresponding Incident tickets were created; Automatically identified that >30% of the Situations were recurring issues. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 4
Additional Details about the Results More findings during the measurement processes were as follows: 100% of the Moogsoft Situations that were assessed reflected genuine Fault or, Collateral or Impact conditions 90.7% of the assessed Situations were flagged as actionable. The remaining 9.3% of Situations represented recurring or transient issues. There were no non-actionable Situations. Of the actionable Situations assessed during the measurement period, 23% were recorded as Incidents with corresponding tickets created by the existing processes, reflecting Incident.MOOG s ability to detect important lower severity, transient and recurring problems; which were directed to the appropriate stakeholder service owner(s). 30.23% of the measured Situations detected by Incident.MOOG were indicated as Recurring Situations in the Situation Room Knowledge Scope of the Evaluation This has been an extremely large evaluation carried out in a very short amount of time. Incident.MOOG is processing over 100 million raw events per day from the 5 different production Event sources across two Customer data centers represented by ~9,100 Managed Entities. The daily volume of Events after cleaning by Incident.MOOG algorithms and processes was approximately 1 million, which are de-duplicated to approximately 13k unique Alerts, all computed by a single Incident.MOOG server VM instance running 5 Link Access Modules (LAM), taking into 5 event feeds. Moogsoft worked with Customer Operations to configure Incident.MOOG to cluster the creation of Situations. In 30 working days, Customer and Moogsoft had been able to aggregate the raw event feeds, clean the incoming data (unifying timestamps, node-names/ip addresses, enriching Application tags) and calibrate the system across single and multiple Event feeds. The basic metrics for the 2 live data centers are: 1,080,298 Events/day (after cleaning ) with an average rate of 800/ second and a peak rate of 10,000/second. For sizing purposes, we can assume an equal amount of alerts from each Customer data center, therefore 0.5/million/day/datacenter with an average rate of 400/second/datacenter and peak of 5,000/second/datacenter. An Incident.MOOG Situation is one or more related Alerts that should constitute an Incident requiring actionable effort i.e. further investigation or corrective action). An Alert is an object aggregating a unique raw input Event. Situations contain Alerts that are de-duplicated to reduce noise and make it easier to contextualize the relationship of one Alert with others around it. Network configuration and services data (locations, customer numbers etc.) was taken from the same data sources used to enrich events, and this data was used in both the clustering configuration and the enrichment of the resulting situations with operationally useful data. This enrichment included: IpAddress to Host: To enrich syslog Alerts arriving with only IpAddress with a Hostname Host/JVM/DB to Application: To enrich Alerts with one or more Application names associated with the source entity Application to Service: To further enrich Alerts with one or more Services associated with the source entity Entity to Environment: To allow partitioning of production from non-production Entity to Priority: To allow labeling of Alerts with Customer s P1 P6 prioritization scheme Additional labeling-only meta-data: To allow labeling entities with names of support teams, etc. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 5
Measurement Process and Results What are the goals of the measurements? The aim of the Customer is to compare the performance of the Incident.MOOG system with the existing tools and processes over a statistically significant time period (2 average working days). The measurements to be recorded were defined by the Customer as: Detection time: Did Incident.MOOG identify a meaningful Situation earlier than the existing tools and processes? Accuracy: Did Incident.MOOG actually identify something which is either Actionable or that the NOC should be aware of? Potential for Reduction in Effort: How many Network Operations Center (NOC) resources would have been disrupted when compared to the information contained in the Incident.MOOG Situation? Measurement Results Moogsoft has successfully completed the key deliverables that were identified in the original scope of evaluation. By implementing Incident.MOOG as automated Situation-Centric Aggregation solution above all of the Customer monitoring fabric and integrating Incident.MOOG into the wider Service Management framework, Customer could achieve additional operational benefits. Select Screenshots Screen Shot 1: 15 million raw events clustered into 250 real situations that contained 625 impacted services. Note that critical services are greyed out to protect privacy for the Customer. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 6
Screen Shot 2: Situation room. Note that domain experts identity are greyed out to protect privacy for the Customer. Screen Shot 3: Each situation is compared mathematically in similarity with past situations, so that knowledge can be captured and reused. Note that domain experts identity are greyed out to protect privacy for the Customer. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 7
Before and After Incident.MOOG Before Moogsoft Evaluation The Customer had reached the tipping point that the volume of raw events has caused too many business-impacting situations: FIGURE 1 Customer Problems: What were the symptoms? On average, ~115 million raw events were emitted by the above 5 data sources, per day. This completely overwhelmed the limited, 100+ support staff, around the world and around the clock. Service quality is suffering: Up to 74% of incidents were spotted by end users fi rst, lengthy Mean Time to Detection (MTTD), Mean Time to Resolution (MTTR), Customer Satisfaction Score (CSAT) going the wrong direction. There will be no staff increase, for OpEx control reasons. Business agility is held back: Time to market, time to new services, migrations to DevOps are all suffering. What are the problems with legacy tools and processes? FIGURE 2 Customer legacy incident management processes Resources Infrastructure Elements Domain Specific Monitoring Tools Application Performance Management (APM) Java Virtual Machine Application Logs Millions of Events and Alerts SME Parse Logs Check Metrics Run Diagnostic War Room Unclear Causes & Impacts System Performance Dashboard Oracle Enterprise Manager Server / OS Syslogs Only P1 alarms receive attention Time Day 1 Day X http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 8
Customer relied on a reactive Fault and Incident Management model. There is no single pane of glass to review all Event data (see Figure 1 above). A group of War Room Operators monitor the APM application state for service interruptions while other groups of Subject Matter Experts (SME) utilize their individual experience and multiple monitoring tools, ranging from the standard Customer toolset (APM Availability, Systems Performance Dashboard, Oracle Enterprise Manager [OEM] and, JVM App Logs) to domain specific tools, in order to identify whether a service interruption is occurring. When a service interruption is identified, the respective SMEs will attempt to work out whether they are the causal or collateral impact of the issue. Often this leads to dead-end or spam workload, after going around in circles between running the diagnostic, checking performance and reading log files. Issues are raised to the service desk by individual SMEs, Users (via the trouble ticketing system) or by the War Room Operators. The service desk staff then creates an Incident and escalates, setting up war room calls for the appropriate stakeholders. These calls are then used by the SMEs to determine who the causal party is. General observations: Customer acknowledges that the volume of currently ignored operational data (syslogs, app/jvm logs, etc.) is too-high to be manually processed and that many issues are simply missed until end-users report them. SMEs generally only look at P1 and P2 priority Alerts, leaving P3 through P6 Alerts exposed. Many of these P3 through P6 Alerts initially were transient small issues that came and left, but eventually developed into P1 and P2. So seeing these transient issues as they unfold early on could have prevented many P1 incidents. While working with Customer SMEs, the several SMEs commented that they don t care to see Alerts falling outside their domain (i.e. Storage SMEs only care about Storage Alerts) This method of operations can increase the MTTR significantly because of a lack of situational awareness between responder parties (War Room Operators and SMEs). Firstly they may all be looking at different tools and so have a different perspective on the scope, scale and impact of a given issue, and secondly, a non-deterministic amount of time may be applied by several parties in parallel investigating the same issue with varying degrees of understanding, before the issue is escalated or before the War Room Operators become aware of the issue. After Moogsoft FIGURE 3: Moogsoft Automated Situation Management http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 9
Moogsoft offers situational awareness across all the incoming SMEs participating in the service assurance processes. Situations indicate that an issue is occurring in real time, often before any application or service disruption. Situations contain clusters of related alerts with indicators of causes and impacts, making it easier to determine whether a stakeholder to the issue is responsible for the causality, or is suffering collateral damage from the original fault. For example: A slow performing ESX cluster A is a collateral (impacted) party, database B and network switch port C are both a causal party for the impact. So SMEs for these specific hosts and switch will be called upon. For the causal parties, the cluster of alerts often gives an indication of the likely cause of the issue, directing the relevant and right SME to the appropriate resolution. Where known faults are identified, real-time automation can trigger resolution scripts or processes. This situational awareness ultimately leads to increased operational efficiency, reduced mean time to detect and diagnose the cause, enables proactive feedback to End Users, correspondingly reducing the number of User Tickets raised, and reduces the mean time to resolve the issue through less finger pointing and better use of social knowledge. During the Customer evaluation, Incident.MOOG was producing on average, 250 Situations per day. These represented causal and collateral indicators. If the system was being utilized in real-time across a large number of Customer domain SMEs (compute, storage, network, database, etc.), then it is estimated that there are well over 100 Level 1 and Level 2 War Room Operators covering these SMEs. This means that using Incident.MOOG as a source of early situational awareness could significantly reduce their actionable workload. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 10
Conclusions Incident.MOOG has proved that it can: Automatically contextualized ~115 million raw events into ~250 real actionable Situations per day, a 99.998% reduction of managed events for IT Ops and DevOps; Detected Situations up to 24 hours before corresponding tickets were created; Kept up with frequent infrastructure changes, without relying on rules and models. Incident.MOOG has accurately identified significantly more customer impacting real incidents than the existing systems and processes. These incidents have tended to be transient, short-duration service interruptions primarily in the local distribution network. Several of them reflected patters of repeating problems that were not evident from the existing tools and processes. Transient failures (a quick fail and clear cycle) present a difficult operational issue as they appear to be noise rather than real service impacting alerst, and it is often only after a repeated P2 or P3 transient incident has turned into a major P1 incident that these patterns are noted, often during a post mortem or wash-up exercise. Providing situational visibility of P2 or P3 incidents and facilitating repeat fault analysis, and so preventing them escalating to P1 is a key benefit of the Incident.MOOG solution. The Situations that Incident.MOOG identified represent exactly the kind of service interruptions and outages that impact customer perception of Quality of Service and therefore impact business continuity and customer satisfaction. By using Incident.MOOG to generate Incidents for these smaller more transient outages the Customer Services Staff would have a far more complete picture of disruption. These results were achieved using standard Incident.MOOG product configuration with no use of partial topology in order to optimize the Situation clustering. For more information, visit www.moogsoft.com. U.S. 140 Geary Street Office 1000 San Francisco, CA 94108 +1 415 738 2299 U.K. The Sanctuary 23 Oakhill Grove Surbiton KT6 6DU +44 208 399 8266 NY +1 646 843 0455 Singapore +65 3158 4393 http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 11 2011-2015 Moogsoft Inc. All rights reserved.