WHITE PAPER Lean MSP Operations Needs Lean Machine Learning A White Paper About Real-time War Room Situation Management For information about Moogsoft and servicenow, visit www.moogsoft.com. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 1
Executive Summary All MSPs today must maintain their traditional hosting revenue, grow cloud business, and contain OpEx. But MSPs also need to invest in IT operations, not cutbacks, to sustain faster service delivery without sacrifi cing service quality. Where exactly do you cut back and invest at the same time? The answer is by thinking strategically, to go after your most signifi cant ineffi ciency that s holding back your business. As described in Figure 1, every MSP owner of operators must address the same challenge: How to reduce workload and time spent in handling explosive volume of events and alerts, while improving service quality? This paper considers three types of Managed Service Provider (MSP): 1. Cloud-enabled managed hosting service provider with the root in traditional managed hosting market within managed services. Examples: Rackspace, HCL. 2. Managed Service Provider as part of a Telco Service Provider. Examples: AT&T, Verizon. 3. Traditional resellers and system integrators of technology vendor products, with signifi cant experience in break/fi x, and remote management. Example: CSC. FIGURE 1: BYOD, Cloud and DevOps amplifi es MSP operational ineffi ciency This process chronically requires way too many manual steps that can take hours or days to complete. The process starts with tons of noise, and ends with fewer incidents/situations but in the middle is a big gap with no automation, or situational awareness among the war room staff. This white paper looks at a strategic solution that can automate the noise to situation workfl ow: Extend existing monitoring and event management systems with Real-time Situation Management software. No need to rip and replace although you can. MSP success is highly dependent upon increased margins by virtue of reduced operating expenditures. An IT specialist full time equivalent (FTE) costs $150k/year, on average, while an IT generalist costs $75k/ year, on average. Ashar Baig, Research Director, Gigaom http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 2
Why Is Incident Detection and Remediation So Reactive and Lengthy? Very simply put, most NOC today still relies on a noise in, noise out event and incident management approach to detect and remediate outages (Figure 2). FIGURE 2: Noise in, noise out alert management In modern IT environments, outages are more often the result of simultaneous, cascading, and transient events and faults across multiple technology domains exasperated by virtualization, mobility and cloud. True culprits of outages are often buried deep among millions of events and thousands of alarms generated daily and without context. Yet, the increasing pace of IT complexity and change instantly leaves any infrastructure to services mapping inaccurate, most notably the Confi guration Management Database (CMDB). This renders ineffective an event management system that depends on static rules, requiring a 100% accurate topology model. When relying on these outdated models and rules to triangulate outages that are full of noise, you get noise in, noise out. Our enterprise command center handled 9 million events in the past 30 days. Kalyan Kumar, Senior Vice President and Chief Technologist for IT Operations at HCL Technologies Note: HCL is a global MSP brand, recognized as a Gartner Magic Quadrant Leader in Data Center Outsourcing and Infrastructure Utility Services, along with CSC, IBM and others. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 3
Why Is Incident Detection and Remediation So Reactive and Lengthy? Figure 3 depicts the resulted workfl ow, spanning from event collection and processing, to incident management and problem remediation, to ultimately service restoral and RCA - all while the Service Desk team and customers wait. FIGURE 3: Your Current Workfl ow The sheer volume of events often obscures the problem source. Therefore, IT ops and traditional event management systems process only priority 1 alerts based on SLAs. Or they use aggressive fi ltering to make event volume manageable. But this often hides important events including severity 2+ that contain early warnings. NOC generalists escalate still voluminous alerts to experts operating in different silos, without context. Multiple experts are often troubleshooting separately, but not collaborating to solve the same problem. 80% of the mean time to resolve is wasted on trying to locate the issue. - Gartner There is no way of seeing how alerts are related. This leads to multiple tickets raised off multiple critical alerts. Multiple tickets all point to the same problem. 74% of end user problems are not detected by IT. -Forrester Research After an outage has occurred, tickets are often merged into a master ticket, a manual time-consuming process, and a poor use of any domain expert s time. Once an incident is being worked on by operations and domain experts are called in, the Service Desk lacks visibility into what s going on. Finally, after an incident has been resolved, there is no easy and automatic way to update a knowledge article, and make it for correlation with future incidents. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 4
Pros and Cons of Various Approaches The industry is aware of the lack of context problem, and there have been various attempts to solve it. Traditional event management tools do offer some automatic fi ltering and correlation capabilities. This might work if your environments are static, with always up-to-date Confi guration Management Database (CMDB) and topology models. See table 1. TABLE 1: Pros and Cons of Traditional Event Management Tools Traditional Event Management Tools Pros Cons Examples: Cover all technology domains IBM Tivoli Netcool BMC TrueSight CA Spectrum EMC Smarts Proven to work in static and stable application architectures No longer effective in handling complex and dynamic IT Require extensive rules and models Filter out important early warnings Allow too much noise Proprietary programming Another approach is to unify or modernize monitoring. This is an important step for collecting key metrics and graphing them in real-time to identify spikes of resource utilization. It can be a substitute for fi nding trouble spots if your NOC operator knows exactly what to look for. But the operator still has to manually defi ne the metrics, thresholds, and create the right charts and dashboards to spot unusual spikes. It is virtually impossible to troubleshoot in real-time, as events and alerts are still coming in. See table 2. TABLE 2: Pros and Cons of Monitoring Tools Monitoring Pros Cons Traditional Monitoring You need it to collect metrics, Fallen behind modern monitoring IBM, HP, CA, BMC events and logs Monolithic, not best-of-breed Cover all technology domains Expensive Dashboard and reports All-in-one Proven to work in static and stable application architectures Modern Monitoring (Composable Monitoring) AppDynamics, Zenoss, Solarwinds, Nagios You need it to collect metrics, events and logs. Cover all technology domains Real-time metrics and charts Based on many open source technologies Faster, simpler and cheaper than traditional monitoring Lack of real-time, cross-domain correlation and contextualization http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 5
Log analyzers are a more recent technology that provides the important Google for Logs approach. It is great for forensic analysis, as no one can predict what logs need to be kept and thrown away at any moment. So it s better to keep most of it and meet audit requirements. But there are two issues here. (a) You need large log fi les to arrive at a powerful server cluster fi rst, which often takes hours; and (b) what search phrases should you NOC staff use? How do they know which technology silo to start digging into? What if there are cascading issues across multiple domains? And what are their causal and collateral relationships? How did causes and impacts look over the timelines? TABLE 3: Pros and Cons of Log Analyzers Log Analyzers Pros Cons Examples Splunk, Sumo Logic, Elastic Search, Loggly Cover all technology domains Great if your NOC operators know what search phrase and which domain silo to dig into Limited real-time, cross-domain analytics Mainly historic search Delay in getting logs High licensing cost Some newer tools claim to support the correlation between a change management event and a potential service disruption. But again, what if your topology model and CMDB are not up-to-date? The correlation will fail, as the model between changes and services will cause noise data in, noise analysis out. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 6
What An Ideal Solution Looks Like If you have it your way, an ideal solution would include all these capabilities: See not only events across your entire stack (apps, private and public clouds, and deep into all infrastructure domains), but also the dynamically formed correlation among them - no need to wait for a perfectly up-to-date Confi gurations Management Database (CMDB) and any topology model. See real, incident-triggering warnings earlier much earlier: Like seeing how and where a hurricane will land in 24-48 hours. Your L1 staff should be push-notifi ed with severity 2-4 warnings way before they trigger severity 1s incidents. Triage faster much faster: In minutes, your L1 staff should already know where and how multiple issues are cascading - there is no time to stitch together a dizzy array of charts and log analysis. They should have already notifi ed the right L2-L4 experts not spamming every expert. Work on a few situations, not 1000s of alerts : Your L2-L4 experts should immediately see fewer, higher level incidents related alerts already grouped together, with narrative of causes and impacts. Situations, not individual alerts, can drastically cut Mean-Time-to-Restoration. Know instantly which past remediation can restore services: Have a machine-calculated situation score refl ecting its similarity with past situations. Give your teams automated access to past situations with successful RCA & remediation. Close the loop automatically: Automate remediation and knowledge recycling for future situations. Essentially, you need a solution that can transform your Serial Alert Workfl ow to Parallel, Real-time Situation Management, described next. Situational-aware machine learning and socialized workfl ows is the future of service assurance. We need these essential innovations to support more customers and ensure our service quality, while keeping operational cost low and effi ciency high. Kalyan Kumar, Senior Vice President and Chief Technologist for IT Operations at HCL Technologies. http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 7
Real-time Situation Management Workflow Detects Incidents 24 Hours Earlier You need to capture the entire narrative of an incident and present it as higher-level situation. Because this datadriven approach can analyze a situation from multiple angles, IT teams have less actionable alerts to deal with, and hence can deal with many more. A - Clean: Remove noise, de-duplicate, and blacklist events, despite a partial and inaccurate CMDB underneath. Real-time machine learning and natural language processing algorithms replace hard-coded rule and models. Examine loosely defi ned, text rich events across domain silos from applications, clouds to infrastructures. Well suited in dynamic IT environments with software infrastructures, virtualization and cloud and migration toward continuous application delivery (i.e. DevOps). B - Contextualize: Eliminate troubleshooting triage by showing the resulting alerts in context - clustering related alerts into situations, which are then decorated with service-specifi c details. A data driven approach, change- and error-tolerant algorithms that automatically identify clusters of related alerts, vs. always assuming singular root cause. C - Collaborate: Use social collaboration technology to orchestrate push notifi cations of relevant domain experts, getting them together in a virtual war room known as Situation Rooms. Here, the experts can log communications, query other tools, and capture the remediation process, automating knowledge recycle and keeping in sync with the Service Desk team. FIGURE 4: Real-time Situational Aware Workfl ow http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 8
A Real-Time Situation Management Solution: Incident.MOOG Incident.MOOG is Real-time Situation Management software. Using machine-learning and social collaboration technologies, it detects warnings earlier, provides situational awareness to IT operations teams, and enables faster cross-domain remediation. Despite its breakthrough accuracy and speed in detecting anomalies, Incident.MOOG is remarkably simply to use. When installed on top of existing monitoring tools, or traditional event management software, Incident.MOOG gives your operations team an immediate, 360o context view across all IT domains (e.g. app, DB, server, storage, network, and private and public clouds). The software uses open standards, such as JavaScript to specify how events can be ingested into JSON data format, so there is virtually no learning curve. Your NOC staff doesn t need to learn any proprietary, vendorsupplied programming language. After reducing event noise and simplifying alerts into fewer situations, Incident.MOOG sends 99% fewer tickets to your Service Desk software (ServiceNow, BMC Remedy, any others). The entire Proof-of-Concept (POC) typically takes about 15 days, including installation, data ingestion, virtually automatic tuning, and results presentation. The screenshot on the right shows a global MSP was able to reduce more than 9 million raw events down to around 1,900 situations, over a 30-day production period. FIGURE 5: From Individual Alerts to Situation Management http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 9
Conclusions Every operations leader at a cloud-aspiring MSP must innovate her monitoring and service assurance workfl ow, or risk being marginalized by public cloud service providers. Real-time Situation Management, a type of IT Operations Analytics (ITOA) tool, can detect warnings for incidents earlier, automatically, in real-time, so your team can be more proactive and remediate faster. Make sure your entire environment is restored with 360o visibility with context. Make sure your blended apps, hybrid clouds and heterogeneous infrastructures are all covered. Make sure your one, resource-limited NOC can scale to support more customers. Transform your Enterprise Command Center today. To fi nd out more, read this most recent MSP case study -> Link to HCL case study To try it yourself, visit www.moogsoft.com and contact us (http://moogsoft.com/contact-us). The following table helps you understand capabilities and features needed to achieve situation management. Use it to guide your decision process. TABLE 6: Evaluation Table for ITOA Technologies http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 10
Conclusions Every operations leader at a cloud-aspiring MSP must innovate her monitoring and service assurance workfl ow, or risk being marginalized by public cloud service providers. Real-time Situation Management, a type of IT Operations Analytics (ITOA) tool, can detect warnings for incidents earlier, automatically, in real-time, so your team can be more proactive and remediate faster. Make sure your entire environment is restored with 360o visibility with context. Make sure your blended apps, hybrid clouds and heterogeneous infrastructures are all covered. Make sure your one, resource-limited NOC can scale to support more customers. Transform your Enterprise Command Center today. To fi nd out more, read this most recent MSP case study -> Link to HCL case study To try it yourself, visit www.moogsoft.com and contact us (http://moogsoft.com/contact-us). For more information, visit www.moogsoft.com. U.S. 140 Geary Street Office 1000 San Francisco, CA 94108 +1 415 738 2299 U.K. The Sanctuary 23 Oakhill Grove Surbiton KT6 6DU +44 208 399 8266 NY +1 646 843 0455 Singapore +65 3158 4393 http://moogsoft.com 2011-2015 Moogsoft Inc. All rights reserved. 11 2011-2015 Moogsoft Inc. All rights reserved.