INCIDENT LIFECYCLE. It s more than just ACKing



Similar documents
Enterprise IT is complex. Today, IT infrastructure spans the physical, the virtual and applications, and crosses public, private and hybrid clouds.

Best Practices for Monitoring: Reduce Outages and Downtime. Develop an effective monitoring strategy with the right metrics, processes and alerts.

Vistara Lifecycle Management

SORTING OUT YOUR SIEM STRATEGY:

Sorting out SIEM strategy Five step guide to full security information visibility and controlled threat management

FireScope + ServiceNow: CMDB Integration Use Cases

Choosing A Service Provider:

Best Practices in Release and Deployment Management

RSA envision. Platform. Real-time Actionable Security Information, Streamlined Incident Handling, Effective Security Measures. RSA Solution Brief

Automated Efficiency For Every Fast Moving Business

WHITE PAPER. Creating your Intranet Checklist

Automatic promotion and versioning with Oracle Data Integrator 12c

SPOK HEALTHCARE CONSOLE. Improving the Way Your Hospital Contact Centre Manages Critical Communications

SAP Solution Brief SAP Technology SAP IT Infrastructure Management. Unify Infrastructure and Application Lifecycle Management

Cisco Unified Communications Remote Management Services

Command Center Handbook

Cisco Change Management: Best Practices White Paper

ITSM Process Description

Virtual Parking Management. Real-Time PARCS Monitoring

How to Resolve Major IT Service Problems Faster

How to Outsource Without Being a Ninnyhammer

CA Service Desk Manager

The Importance of Defect Tracking in Software Development

Going ITIL with Countersoft Gemini

Reducing Customer Churn

THE THREE ASPECTS OF SOFTWARE QUALITY: FUNCTIONAL, STRUCTURAL, AND PROCESS

Infasme Support. Incident Management Process. [Version 1.0]

Instant Technical Brief: A Comparison of Instant Team Sessions and Instant Queue Manager with IBM Lotus Sametime Advanced

How To Create A Call Center With Talkdesk

Customer Evaluation Report On Incident.MOOG

PNMsoft Sequence Ticketing Solution (PSTS)

How Cisco IT Outsourced Network Management Operations

B2C Marketing Automation Action Plan. 10 Steps to Help You Make the Move from Outdated Marketing to Advanced Marketing Automation

Top 10 Reasons to Automate your IT Run Books

Perform-Tools. Powering your performance

Reduce the Stress of Selecting Your Call Management System

The Basics of Scrum An introduction to the framework

Time Management Tips

Better Business Analytics with Powerful Business Intelligence Tools

White paper: Developing agile project task and team management practices

An Oracle White Paper June, Strategies for Scalable, Smarter Monitoring using Oracle Enterprise Manager Cloud Control 12c

SETTING UP AN INSTANT MESSAGING SERVER

Incident Management Best Practices Chris Pope. Global Service Delivery Manager Global Managed Services Column Technologies.

Frequently Asked Questions Plus What s New for CA Application Performance Management 9.7

Six Signs. you are ready for BI WHITE PAPER

MONyog White Paper. Webyog

The Builder s Guide to Online Reputation Management

References: Hi, License: Feel free to share these questions with anyone, but please do not modify them or remove this message. Enjoy the questions!

Business white paper. Top ten reasons to automate your IT processes

What is a Social Media Playbook, and Why Do I Need One?

NMS Network Management System

A s c e r t i a S u p p o r t S e r v i c e s G u i d e

> WHITE PAPER. ManageEngine > WHITE PAPER. How ITIL-based IT Help Desk can help Small and Medium Businesses

The Impact of Transaction-based Application Performance Management

setup and provide drill-down capabilities to view further details on metrics and dynamic updates for a real-time view of your business conditions.

How To Create A Help Desk For A System Center System Manager

IMPROVING CUSTOMER SUPPORT THROUGH UNIFIED OMNICHANNEL CUSTOMER SELF-SERVICE

Why Alerts Suck and Monitoring Solutions need to become Smarter

Personalised view of metrics for an instant snapshot of your business. Wizard-Driven Dashboards

Closed Loop Incident Process

Three Critical Components of the Specialty 3PL Experience. What should you expect from your supply chain partner?

State of Wisconsin Initial Incident Triage Service Service Offering Definition (SOD)

HOW TO MAP THE CUSTOMER JOURNEY

Three Critical Strategic Planning Lessons Gleaned from Moneyball

Smart Reporting: Using Your Service Desk to Better Manage Your IT Department

Cisco Network Optimization Service

ITIL Event Management in the Cloud

Cisco Unified Communications and Collaboration technology is changing the way we go about the business of the University.

Service Desk Edition

8 Best Practices for IT Incident Management

Evaluating Internal and Outsourced Models for Network Monitoring

The Connected CFO a company s secret silver bullet?

It s a Mad, Mad, Mad Multichannel World!

CA Service Desk On-Demand

Automated IT Asset Management Maximize organizational value using BMC Track-It! WHITE PAPER

Mobile Admin Real-time Dashboard and Notification System

MONyog White Paper. Webyog

Transcription:

INCIDENT LIFECYCLE It s more than just ACKing

The ITIL calls it the Incident Lifecycle. Others know it as the process through which you go about solving IT issues in real-time, as they happen. You might call it a long night without sleep. Whatever terms are used, following an alert from once it comes in to seeing it through to resolution can be a long process. Fortunately, with the right tools, surviving this challenge can be made all the less daunting. When you break it down into its parts, fighting the good fight becomes simply a matter of choosing a solution that works with each part.

ALERTING investigation 01 02 03 04 05 06 I d e n t i f ic at i o n Resolution d o c u m e n t at i o n T R I A G E

ALERTING 01 I received notification that a critical alert has come in. If you break down a typical incident resolution into phases, you see that, generally, the smallest portion of time is spent being alerted to the problem. On average, our VictorOps data shows that only 5% of the total TTR has anything to do with alerting or escalation of problems. There are incidents where a team member does not respond but this is generally more about the team member than the platform finding him or her. Historically, the alerting phase was a longer portion of TTR back when teams actually carried pagers, as those systems were quite slow. Now that team members have smart phones, human behavior has changed to be more always-on and engaged with that device. Alerting has come along for the ride of WAN, LAN and SMS data. Nonetheless, the truth of the matter is a perfect zero time alerting platform that could find people instantly can only really effect average TTR by a very small percentage, even in the best case scenario. Push, SMS and phone notifications (with customizable ring tones) will make sure that no one misses an alert. Additionally, rich alerts provide context around the incident and baked in solutions make the on-call person s job easier.

TRIAGE 02 I know there s a problem but I have no idea who or what is affected. This is the phase of the incident lifecycle that can cause the most stress. For someone new to the on-call process, finding out what exactly is wrong by picking up the phone to call someone that may or may not be awake and may or may not have the right answer is anxiety-inducing but even more so if the incident happens at 3am. Based on our internal customer data, 18% of the TTR is simply getting an initial person (or subsequent team member) up to speed with what is happening. This information is rarely contained completely in the alert metadata, but rather requires seeing other markers in the system as well. We call this situational awareness, and having situational awareness in the platform can have a big impact on TTR. The faster you can get the right eyes on the problem, the faster you can solve the problem. The incident timeline provides a single view of all activities surrounding the incident, including alerts, paging, chat messages and the ability to reroute the alert to a different person or team. Also, with intelligent routing that can send an alert to the person who knows how to fix the problem, triage becomes less stressful.

INVESTIGATION 03 I need help digging into the issue. The majority of TTR, a full 40%, falls into what we call the Investigation Phase of Incident Resolution. Investigation requires the on-call person to play the part of detective by following up on possible leads while ruling out the usual suspects. This phase includes logging into the system, tailing logs, consulting performance monitoring tools, etc. It also involves consulting internal documentation resources such as wikis or ticketing systems. Anyone can triage but it takes a higher level of advanced thinking to figure out that one thing that s broken and may be causing everything else to break. If triage has been successful, the on-call engineer has already figured out who else needs to be involved in this phase. Getting other people involved early means that they can help look at the series of events and possibly recognize a pattern you may have missed. Our timeline allows for easy collaboration. You can chat, send private messages or mention an entire team in a question posted to the timeline. Annotations attached to alerts can provide much-needed direction as to how the problem was solved last time or who to contact to find an answer.

IDENTIFICATION 04 Everything will be better if I fix this one thing. Once you know what the problem is, you just need to find the answer. This is much easier said than done. Depending on what your company s documentation is like, getting your hands on updated protocol might just be the hardest part of solving the problem. Remediation docs may be stored in an internal wiki, a spreadsheet, or in some cases, someone s head. Today s systems are so complex and ever-changing that they require new ways of maintaining them. Add to that fact that many teams now have a varied group of individuals taking part in the on-call rotation meaning that if a database team member is handling the alert, they need to know how to solve for problems outside of their domain expertise. Imagine if every alert automatically surfaced contextual information and suggested solutions to resolve the problem. Knowing exactly how to solve the problem is one surefire way of reducing TTR. Having real-time remediation data right where you need it, right when you need it. The ability to annotate alerts with links to internal runbooks, graphs and notes about how the problem was solved means that the on-call person can actually solve the problem much faster. Knowing that the solution will be easy to find is a major game changer when it comes to making on-call suck less.

RESOLUTION 05 I m fixing it. 10% of TTR falls into a category called Problem Resolution. This is represented by team members performing system actions to fix the problem that started the incident. It unfortunately also means waiting for systems to recover and verify that the root cause was found, often extending team involvement longer than desired. The Problem Resolution phase is perhaps the largest potential lever in a true collaborative system. To reduce TTR in the Resolution phase, you need a feature set that self-documents what teams do to solve the problem. This is, in a sense, the heart of collaboration: the ability to not only reduce TTR during the current resolution cycle, but also capture that knowledge to pay it forward next time. Fix the one thing. Watch the other things get better. Update documentation so you can easily fix that one thing again in the future. With bidirectional integrations with HipChat and Slack, it s easy to collaborate and manage many aspects of your infrastructure from the comfort of your firefighting chat room.

DOCUMENTATION 06 I don t want to have to deal with that again. After an incident is resolved, on-call best practices mandate that a post-mortem, or retrospective, take place. An accurate, comprehensive post-mortem is an essential tool for communicating with internal and external stakeholders. But more importantly, it helps prepare, and ideally prevent, similar incidents from occurring again. The ideal report would pull together everything that happened during the entire incident lifecycle, with a single authoritative clock that gives context to the event and includes all relevant communications. That report would also be customizable - allowing you to edit out unimportant details and add notes where applicable and provide a high-level snapshot of exactly what the incident entailed. Want to know even more about your on-call process? Add reporting around incident frequency, incident metrics and user metrics and you have a much larger picture of what s working with your alerting and what s not. Our reporting gives you the ability to improve your process around incident resolution and helps to facilitate documentation cleanup by letting you make notes about the accuracy/helpfulness of annotated alerts while in the moment.

INCIDENT LIFECYCLE It s more than just ACKing