Customer Evaluation Report On Incident.MOOG



Similar documents
Restore IT Services NOW

Lean MSP Operations Needs Lean Machine Learning

Closed Loop Incident Process

VMware Virtualization and Cloud Management Overview VMware Inc. All rights reserved

Work Smarter, Not Harder: Leveraging IT Analytics to Simplify Operations and Improve the Customer Experience

Copyright 11/1/2010 BMC Software, Inc 1

HP Business Service Management 9.2 and

Predictive Analytics for APM. Neil MacGowan Technical Director Netuitive Europe 18 April 2013

How To Use Ibm Tivoli Monitoring Software

HP Business Service Management (BSM) George Leschener BSM Solution Lead, MEMA

can you improve service quality and availability while optimizing operations on VCE Vblock Systems?

Enterprise IT is complex. Today, IT infrastructure spans the physical, the virtual and applications, and crosses public, private and hybrid clouds.

Vistara Lifecycle Management

Improve end-to-end management with IBM consolidated operations management solutions.

& USER T ECH.C W WW. SERVICE

Monitoring and Log Management in Hybrid Cloud Environments

A Vision for Operational Analytics as the Enabler for Business Focused Hybrid Cloud Operations

HP Service Health Analyzer: Decoding the DNA of IT performance problems

HP APPLICATION PERFORMANCE MONITORING

BMC ProactiveNet Performance Management: Delivering on the Promise of Predictive Control Across the Total IT Environment SOLUTION WHITE PAPER

BMC Service Assurance. Proactive Availability and Performance Management Capacity Optimization

Improving. Summary. gathered from. research, and. Burnout of. Whitepaper

Business white paper. Top ten reasons to automate your IT processes

ITIL Event Management in the Cloud

Service Automation to implement and operate your Cloud initiatives

A FAULT MANAGEMENT WHITEPAPER

Application Performance Management

Kaseya Traverse. Kaseya Product Brief. Predictive SLA Management and Monitoring. Kaseya Traverse. Service Containers and Views

Cisco Unified Communications Remote Management Services

Proactive Performance Management for Enterprise Databases

Service Assurance. service operations management. modeling IT services

Securing your IT infrastructure with SOC/NOC collaboration

Full visibility into Siebel CRM user experience with Compuware APM.

SOLUTION WHITE PAPER. Align Change and Incident Management with Business Priorities

Evolution from the Traditional Data Center to Exalogic: An Operational Perspective

Riverbed Performance Management

Why Alerts Suck and Monitoring Solutions need to become Smarter

CA Service Desk Manager

CA Virtual Assurance/ Systems Performance for IM r12 DACHSUG 2011

Best Practices from Deployments of Oracle Enterprise Operations Monitor

Cisco TelePresence Select Operate and Cisco TelePresence Remote Assistance Service

Health monitoring & predictive analytics To lower the TCO in a datacenter

Operations Orchestration Automating Your Data Center May 21, 2014

How To Create A Cloud Monitoring Platform

Automating ITIL v3 Event Management with IT Process Automation: Improving Quality while Reducing Expense

Virtual Data Center Management Challenges

Proactive Incident and Problem Management

Empower Human Ingenuity IT Process Automation Buying Guide

Cisco Data Center Network Manager for SAN

ITIL by Test-king. Exam code: ITIL-F. Exam name: ITIL Foundation. Version 15.0

Welcome to today's webinar: How to Transform RMF & SMF into Availability Intelligence

How To Create A Help Desk For A System Center System Manager

Monitoring and Operating a Private Cloud MOC 20246

IT Service Management Real-time Enduser Context Has A Dramatic Affect On Incident and Problem Resolution Times

ITSM 101. Patrick Connelly and Sandeep Narang. Gartner.

Top 10 Reasons to Automate your IT Run Books

IBM Tivoli Service Request Manager

DevOps. Production Operations - The Last Mile of a DevOps Strategy

Splunk for VMware Virtualization. Marco Bizzantino Vmug - 05/10/2011

Solution White Paper Boosting Digital Transformation BMC vs. HP

how can I deliver better services to my customers and grow revenue?

Infrastructure & Operations Management with vcenter Operations Management Suite

Frequently Asked Questions Plus What s New for CA Application Performance Management 9.7

Yale University Incident Management Process Guide

Drive Down IT Operations Cost with Multi-Level Automation

IBM Tivoli Netcool Configuration Manager

Agio Remote Monitoring and Management

Statement of Service Enterprise Services - AID Microsoft IIS

Align IT Operations with Business Priorities SOLUTION WHITE PAPER

Network change is constant: Configuration and compliance management can help

Stopping The Application Management Blame Game Through Integrated IT Management Tools.

FireScope + ServiceNow: CMDB Integration Use Cases

BMC and ITIL: Continuing IT Service Evolution. Why adopting ITIL processes today can save your tomorrow

MS-20246: Monitoring and Operating a Private Cloud

Cisco Prime Data Center Network Manager Release 6.1

SOLUTION WHITE PAPER

Course Outline. Course Details Course code: 20246D Duration: 5 days Starting time: 9am Finishing time: 4.30pm Lunch and refreshments are provided.

mbits Network Operations Centrec

Cisco Unified Computing Remote Management Services

Business Resilience Communications. Planning and executing communication flows that support business continuity and operational effectiveness

ReliaTel VoIP QoS and UC Management Solution

CA Virtual Assurance for Infrastructure Managers

IntelliNet Delivers APM Service with CA Nimsoft Monitor

How To Use Mindarray For Business

An Oracle White Paper June, Strategies for Scalable, Smarter Monitoring using Oracle Enterprise Manager Cloud Control 12c

ISO :2005 Requirements Summary

IBM Tivoli Netcool network management solutions for enterprise

Integrated processes aligned to your business The examples of the new NetEye and EriZone releases

Statement of Service Enterprise Services - MANAGE Microsoft IIS

VMware Performance and Capacity Management Accelerator Service

PLUMgrid Toolbox: Tools to Install, Operate and Monitor Your Virtual Network Infrastructure

Riverbed SteelCentral. Product Family Brochure

Transcription:

WHITE PAPER Customer Evaluation Report On Incident.MOOG (Real Data Provided by a Fortune 100 Company) For information about Moogsoft and Incident.MOOG, visit www.moogsoft.com. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 1

Synopsis: A Fortune 100 Company (the Customer) conducted an extensive evaluation on Incident.MOOG, a nextgeneration manager of managers (MoM) providing early warning and collaborative remediation for IT incidents. Within 30 working days, Incident.MOOG had transformed the Customer s IT operations command center to operate faster and more effectively. Here s a summary of the key results: Key Metrics Before After % of Improvement Average Managed Events Per Day ~ 115,000,000 ~ 1,000,000 99.1% Real Situations, i.e. Cleaned, Clustered and Contextualized Alerts Time it took to create corresponding actionable tickets This report summarizes the: Benefits provided by Incident.MOOG Almost zero rolled up alerts ~ 250 99.998% >=24 hours 1-2 hours 95.8% Before and after the evaluation the Customer s data center environment, processes, tools, and where and how Incident.MOOG fitted in Scope and procedure of the evaluation. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 2

Executive Summary Moogsoft engaged with the Customer over a 30-day period to demonstrate the value of Incident.MOOG vs. existing fault and incident management tools and processes. Definitions of words used in this document Incident: An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service. Event: A change of state that has significance for the management of a Configuration Item or IT Service. Alert: A unique warning message indicating that a threshold has been reached, something has changed, or a Failure has occurred. Alerts are often created and managed by System Management tools and are managed by the Event Management Process. Issue: Some fault or impact behavior that is actionable but has as yet not been categorized as an Incident. Situation: The indication of some form of issue, as a cluster of related alerts that should be acted upon by appropriate stakeholders. The cluster of alerts may contain Causal Indicators, Collateral Indicators. Situation stakeholders are the Service Owners whose Alerts are clustered within a given Situation, i.e. Storage, Database and Application. A Situation may represent an Incident or the impact of an Incident. Causal Indicator: An Alert or set of Alerts indicating the source of a fault or failure of some kind. Collateral Indicator: An Alert or set of Alerts indicating the collateral damage or impact of a service disruption. What is Incident.MOOG? Created by the inventors of IBM Tivoli Netcool, Incident.MOOG is a next-generation manager of managers (MoM), providing incident early warning and collaborative remediation platform for IT Ops and DevOps teams. It uses machine learning to reduce the number of incidents that IT Ops and DevOps teams must handle. It then applies social collaborative technologies to enable cross-domain teams work together and remediate problems faster. Incident.MOOG is designed to: Clean: Automatically remove event noise and detect anomalies from input events and log sources in real-time, without relying on rules and up-to-date Configuration Management Database (CMDB). Contextualize: Automatically cluster resulting anomalies into Situations - a set of related alerts with indicators of causes and impacts, allowing IT operational teams to more quickly diagnose the condition. Collaborate: Automatically host a single pane of glass situation room, which invites the relevant experts to triage the same situations, and orchestrates the entire remediation process. What were the goals of this evaluation? To demonstrate that Incident.MOOG can substantially and automatically reduce IT operational staff s time and efforts spent on troubleshooting issues: 1. Substantially and automatically reduce the total actionable work by Network Operations Center (NOC) operators and service owners; http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 3

2. Detect all of the Situations that are identified by the existing tools and processes; 3. Detect real actionable Situations not indicated by the existing tools and processes, 4. Detect Situations either at the same time or earlier than the existing tools and processes. What did the Customer environment look like? For this evaluation, in 30 working days, the live production environment included 2 large data centers across five customer event feeds: Application Performance Monitoring (APM) Systems Performance Dashboard Oracle Enterprise Manager Java Virtual Machine (JVM) Application Logs Server / OS Syslog. TABLE 1: Event Sources Data Source Count of Distinct Devices (Physical or Virtual) Application Performance Management (APM) Availability 3,172 Systems Performance Dashboard (SPD) 3,531 Oracle Enterprise Manager (OEM) 511 Java Virtual Machine App Logs (SystemErr.log) 171 Server/OS syslog 1,747 Total Distinct Entities 9,132 In total these event sources produce approximately 115 million events per day. What were the results of this evaluation? During the evaluation process, Incident.MOOG: Automatically reduced number of managed Events by 99.998% Created actionable Situations for IT Ops and DevOps teams to better manage incidents from these events (without rules, topology or historic behavior to determine anomalies). Didn t miss any incidents or problems: Zero false negative ; Detected Situations up to >24 Hours before the corresponding Incident tickets were created; Automatically identified that >30% of the Situations were recurring issues. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 4

Additional Details about the Results More findings during the measurement processes were as follows: 100% of the Moogsoft Situations that were assessed reflected genuine Fault or, Collateral or Impact conditions 90.7% of the assessed Situations were flagged as actionable. The remaining 9.3% of Situations represented recurring or transient issues. There were no non-actionable Situations. Of the actionable Situations assessed during the measurement period, 23% were recorded as Incidents with corresponding tickets created by the existing processes, reflecting Incident.MOOG s ability to detect important lower severity, transient and recurring problems; which were directed to the appropriate stakeholder service owner(s). 30.23% of the measured Situations detected by Incident.MOOG were indicated as Recurring Situations in the Situation Room Knowledge Scope of the Evaluation This has been an extremely large evaluation carried out in a very short amount of time. Incident.MOOG is processing over 100 million raw events per day from the 5 different production Event sources across two Customer data centers represented by ~9,100 Managed Entities. The daily volume of Events after cleaning by Incident.MOOG algorithms and processes was approximately 1 million, which are de-duplicated to approximately 13k unique Alerts, all computed by a single Incident.MOOG server VM instance running 5 Link Access Modules (LAM), taking into 5 event feeds. Moogsoft worked with Customer Operations to configure Incident.MOOG to cluster the creation of Situations. In 30 working days, Customer and Moogsoft had been able to aggregate the raw event feeds, clean the incoming data (unifying timestamps, node-names/ip addresses, enriching Application tags) and calibrate the system across single and multiple Event feeds. The basic metrics for the 2 live data centers are: 1,080,298 Events/day (after cleaning ) with an average rate of 800/ second and a peak rate of 10,000/second. For sizing purposes, we can assume an equal amount of alerts from each Customer data center, therefore 0.5/million/day/datacenter with an average rate of 400/second/datacenter and peak of 5,000/second/datacenter. An Incident.MOOG Situation is one or more related Alerts that should constitute an Incident requiring actionable effort i.e. further investigation or corrective action). An Alert is an object aggregating a unique raw input Event. Situations contain Alerts that are de-duplicated to reduce noise and make it easier to contextualize the relationship of one Alert with others around it. Network configuration and services data (locations, customer numbers etc.) was taken from the same data sources used to enrich events, and this data was used in both the clustering configuration and the enrichment of the resulting situations with operationally useful data. This enrichment included: IpAddress to Host: To enrich syslog Alerts arriving with only IpAddress with a Hostname Host/JVM/DB to Application: To enrich Alerts with one or more Application names associated with the source entity Application to Service: To further enrich Alerts with one or more Services associated with the source entity Entity to Environment: To allow partitioning of production from non-production Entity to Priority: To allow labeling of Alerts with Customer s P1 P6 prioritization scheme Additional labeling-only meta-data: To allow labeling entities with names of support teams, etc. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 5

Measurement Process and Results What are the goals of the measurements? The aim of the Customer is to compare the performance of the Incident.MOOG system with the existing tools and processes over a statistically significant time period (2 average working days). The measurements to be recorded were defined by the Customer as: Detection time: Did Incident.MOOG identify a meaningful Situation earlier than the existing tools and processes? Accuracy: Did Incident.MOOG actually identify something which is either Actionable or that the NOC should be aware of? Potential for Reduction in Effort: How many Network Operations Center (NOC) resources would have been disrupted when compared to the information contained in the Incident.MOOG Situation? Measurement Results Moogsoft has successfully completed the key deliverables that were identified in the original scope of evaluation. By implementing Incident.MOOG as automated Situation-Centric Aggregation solution above all of the Customer monitoring fabric and integrating Incident.MOOG into the wider Service Management framework, Customer could achieve additional operational benefits. Select Screenshots Screen Shot 1: 15 million raw events clustered into 250 real situations that contained 625 impacted services. Note that critical services are greyed out to protect privacy for the Customer. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 6

Screen Shot 2: Situation room. Note that domain experts identity are greyed out to protect privacy for the Customer. Screen Shot 3: Each situation is compared mathematically in similarity with past situations, so that knowledge can be captured and reused. Note that domain experts identity are greyed out to protect privacy for the Customer. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 7

Before and After Incident.MOOG Before Moogsoft Evaluation The Customer had reached the tipping point that the volume of raw events has caused too many business-impacting situations: FIGURE 1 Customer Problems: What were the symptoms? On average, ~115 million raw events were emitted by the above 5 data sources, per day. This completely overwhelmed the limited, 100+ support staff, around the world and around the clock. Service quality is suffering: Up to 74% of incidents were spotted by end users fi rst, lengthy Mean Time to Detection (MTTD), Mean Time to Resolution (MTTR), Customer Satisfaction Score (CSAT) going the wrong direction. There will be no staff increase, for OpEx control reasons. Business agility is held back: Time to market, time to new services, migrations to DevOps are all suffering. What are the problems with legacy tools and processes? FIGURE 2 Customer legacy incident management processes Resources Infrastructure Elements Domain Specific Monitoring Tools Application Performance Management (APM) Java Virtual Machine Application Logs Millions of Events and Alerts SME Parse Logs Check Metrics Run Diagnostic War Room Unclear Causes & Impacts System Performance Dashboard Oracle Enterprise Manager Server / OS Syslogs Only P1 alarms receive attention Time Day 1 Day X http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 8

Customer relied on a reactive Fault and Incident Management model. There is no single pane of glass to review all Event data (see Figure 1 above). A group of War Room Operators monitor the APM application state for service interruptions while other groups of Subject Matter Experts (SME) utilize their individual experience and multiple monitoring tools, ranging from the standard Customer toolset (APM Availability, Systems Performance Dashboard, Oracle Enterprise Manager [OEM] and, JVM App Logs) to domain specific tools, in order to identify whether a service interruption is occurring. When a service interruption is identified, the respective SMEs will attempt to work out whether they are the causal or collateral impact of the issue. Often this leads to dead-end or spam workload, after going around in circles between running the diagnostic, checking performance and reading log files. Issues are raised to the service desk by individual SMEs, Users (via the trouble ticketing system) or by the War Room Operators. The service desk staff then creates an Incident and escalates, setting up war room calls for the appropriate stakeholders. These calls are then used by the SMEs to determine who the causal party is. General observations: Customer acknowledges that the volume of currently ignored operational data (syslogs, app/jvm logs, etc.) is too-high to be manually processed and that many issues are simply missed until end-users report them. SMEs generally only look at P1 and P2 priority Alerts, leaving P3 through P6 Alerts exposed. Many of these P3 through P6 Alerts initially were transient small issues that came and left, but eventually developed into P1 and P2. So seeing these transient issues as they unfold early on could have prevented many P1 incidents. While working with Customer SMEs, the several SMEs commented that they don t care to see Alerts falling outside their domain (i.e. Storage SMEs only care about Storage Alerts) This method of operations can increase the MTTR significantly because of a lack of situational awareness between responder parties (War Room Operators and SMEs). Firstly they may all be looking at different tools and so have a different perspective on the scope, scale and impact of a given issue, and secondly, a non-deterministic amount of time may be applied by several parties in parallel investigating the same issue with varying degrees of understanding, before the issue is escalated or before the War Room Operators become aware of the issue. After Moogsoft FIGURE 3: Moogsoft Automated Situation Management http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 9

Moogsoft offers situational awareness across all the incoming SMEs participating in the service assurance processes. Situations indicate that an issue is occurring in real time, often before any application or service disruption. Situations contain clusters of related alerts with indicators of causes and impacts, making it easier to determine whether a stakeholder to the issue is responsible for the causality, or is suffering collateral damage from the original fault. For example: A slow performing ESX cluster A is a collateral (impacted) party, database B and network switch port C are both a causal party for the impact. So SMEs for these specific hosts and switch will be called upon. For the causal parties, the cluster of alerts often gives an indication of the likely cause of the issue, directing the relevant and right SME to the appropriate resolution. Where known faults are identified, real-time automation can trigger resolution scripts or processes. This situational awareness ultimately leads to increased operational efficiency, reduced mean time to detect and diagnose the cause, enables proactive feedback to End Users, correspondingly reducing the number of User Tickets raised, and reduces the mean time to resolve the issue through less finger pointing and better use of social knowledge. During the Customer evaluation, Incident.MOOG was producing on average, 250 Situations per day. These represented causal and collateral indicators. If the system was being utilized in real-time across a large number of Customer domain SMEs (compute, storage, network, database, etc.), then it is estimated that there are well over 100 Level 1 and Level 2 War Room Operators covering these SMEs. This means that using Incident.MOOG as a source of early situational awareness could significantly reduce their actionable workload. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 10

Conclusions Incident.MOOG has proved that it can: Automatically contextualized ~115 million raw events into ~250 real actionable Situations per day, a 99.998% reduction of managed events for IT Ops and DevOps; Detected Situations up to 24 hours before corresponding tickets were created; Kept up with frequent infrastructure changes, without relying on rules and models. Incident.MOOG has accurately identified significantly more customer impacting real incidents than the existing systems and processes. These incidents have tended to be transient, short-duration service interruptions primarily in the local distribution network. Several of them reflected patters of repeating problems that were not evident from the existing tools and processes. Transient failures (a quick fail and clear cycle) present a difficult operational issue as they appear to be noise rather than real service impacting alerst, and it is often only after a repeated P2 or P3 transient incident has turned into a major P1 incident that these patterns are noted, often during a post mortem or wash-up exercise. Providing situational visibility of P2 or P3 incidents and facilitating repeat fault analysis, and so preventing them escalating to P1 is a key benefit of the Incident.MOOG solution. The Situations that Incident.MOOG identified represent exactly the kind of service interruptions and outages that impact customer perception of Quality of Service and therefore impact business continuity and customer satisfaction. By using Incident.MOOG to generate Incidents for these smaller more transient outages the Customer Services Staff would have a far more complete picture of disruption. These results were achieved using standard Incident.MOOG product configuration with no use of partial topology in order to optimize the Situation clustering. For more information, visit www.moogsoft.com. U.S. 140 Geary Street Office 1000 San Francisco, CA 94108 +1 415 738 2299 U.K. The Sanctuary 23 Oakhill Grove Surbiton KT6 6DU +44 208 399 8266 NY +1 646 843 0455 Singapore +65 3158 4393 http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 11 2011-2015 Moogsoft Inc. All rights reserved.