Root Cause Analysis for IT Incidents Investigation



Similar documents
Infasme Support. Incident Management Process. [Version 1.0]

EFFECTIVE ROOT CAUSE ANALYSIS AND CORRECTIVE ACTION PROCESS

Problem Management Fermilab Process and Procedure

With Windows, Web and Mobile clients Richmond SupportDesk is accessible to Service Desk operators wherever they are.

Problem Management Overview HDI Capital Area Chapter September 16, 2009 Hugo Mendoza, Column Technologies

The purpose of Capacity and Availability Management (CAM) is to plan and monitor the effective provision of resources to support service requirements.

Service Improvement. Part 1 The Frontline. Robert.Gormley@ed.ac.uk

INTERMEDIATE QUALIFICATION

Empowering the Enterprise Through Unified Communications & Managed Services Solutions

Service Desk/Helpdesk Metrics and Reporting : Getting Started. Author : George Ritchie, Serio Ltd george dot- ritchie at- seriosoft.

Problem Management Why and how? Author : George Ritchie, Serio Ltd george dot- ritchie at- seriosoft.com

Blend Approach of IT Service Management and PMBOK for Application Support Project

Incident Investigation Procedure

Technical Support. Technical Support. Customer Manual v1.1

Managed Desktop Support Services

T141 Computer Systems Technician MTCU Code Program Learning Outcomes

ITIL Event Management in the Cloud

Module 1 Study Guide

Learning from Our Mistakes with Defect Causal Analysis. April Copyright 2001, Software Productivity Consortium NFP, Inc. All rights reserved.

ITIL A guide to problem management

Exhibit F. VA CAI - Staff Aug Job Titles and Descriptions Effective 2015

Root Cause Analysis Concepts and Best Practices for IT Problem Managers

SysAidTM ITIL Package Guide. Change Management, Problem Management and CMDB

ReMilNet Service Experience Overview

IT Service Provider and Consumer Support Engineer Position Description

Karen Ferris MACANTA CONSULTING PTY LTD

Outsourcing BI Maintenance Services Version 3.0 January With SourceCode Inc.

ISO 9000 Introduction and Support Package: Guidance on the Concept and Use of the Process Approach for management systems

Effective Project Management of Team Based Business Improvement Projects

Knowledge Management 101 Better Support Decisions, Faster

ITSM Roles. 1.0 Overview

ITIL Introducing service operation

A Framework for Incident and Problem Management

MEASURING FOR PROBLEM MANAGEMENT

Wendell Bosen-MBA, CPCU, ARM-E Director, Risk Management Management & Training Corporation. Kristina Narvaez-MBA President & CEO ERM Strategies, LLC

Project Risk Management

itsmf USA Problem Management Special Interest Group

Root Cause Analysis for Customer Reported Problems. Topics

Introduction to ITIL: A Framework for IT Service Management

Higher Certificate in Information Systems (Network Engineering) * (1 year full-time, 2½ years part-time)

Service Desk Recruitment Strategy

Problem Management: A CA Service Management Process Map

Commonwealth of Massachusetts IT Consolidation Phase 2. ITIL Process Flows

White paper. Corrective action: The closed-loop system

ITIL Essentials Study Guide

ITSM Maturity Model. 1- Ad Hoc 2 - Repeatable 3 - Defined 4 - Managed 5 - Optimizing No standardized incident management process exists

Support and Service Management Service Description

Occupational safety risk management in Australian mining

Introduction to ITIL. Author : George Ritchie, Serio Ltd george dot- ritchie at- seriosoft.com

Use product solutions from IBM Tivoli software to align with the best practices of the Information Technology Infrastructure Library (ITIL).

ITIL and Data Center Migration

-Blue Print- The Quality Approach towards IT Service Management

Article 6 IT Physician Heal Thyself Building Bridges and Breaking Boundaries

How to save money with Document Control software

Accident / Incident Investigation Participants Guide

w w w. s t r a t u s. c o m

Integrating Project Management and Service Management

IT Service Management

Punjab National Bank Achieves 165% ROI with CA Technologies IT Management Solutions

Business Continuity Position Description

DOCUMENT HISTORY LOG. Description

Information Technology Division. Customer Service Support Center

Accident Investigation

Managing EHS Incidents Using Integrated Managment Systems. By Matt Noth

Problem Management. Contents. Introduction

INTERNAL AUDIT SOFTWARE BUYER S GUIDE

Siebel HelpDesk Guide. Version 8.0, Rev. C March 2010

An example ITIL -based model for effective Service Integration and Management. Kevin Holland. AXELOS.com

Effective Implementation of Problem Management in ITIL Service Management

Punjab National Bank achieves 165% ROI with CA Technologies IT management solutions

Operational Change Control Best Practices

A Symptom Extraction and Classification Method for Self-Management

HP Service Manager. Software Version: 9.34 For the supported Windows and UNIX operating systems. Processes and Best Practices Guide

Backlog Management Index (BMI) Evaluation and Improvement An ITIL Approach

Process Description Incident/Request. HUIT Process Description v6.docx February 12, 2013 Version 6

Training and Development (T & D): Introduction and Overview

Physicians are fond of saying Treat the problem, not the symptom. The same is true for Information Technology.

POSITION PROFILE Payroll Systems Administrator. Position Summary. Position Statement. Corporate Vision. Constructive Culture

WHITE PAPER. iet ITSM Enables Enhanced Service Management

Process for reporting and learning from serious incidents requiring investigation

Project Quality Management. Project Management for IT

Human Resource Services PO Box Classification and Compensation Gainesville, FL Fax

Policies, Procedures, Guidelines and Protocols

TIME MANAGEMENT TOOLS AND TECHNIQUES FOR PROJECT MANAGEMENT. Hazar Hamad Hussain *

Implementation of ITIL in a Moroccan company: the case of incident management process

White Paper Case Study: How Collaboration Platforms Support the ITIL Best Practices Standard

An Introduction to the PRINCE2 project methodology by Ruth Court from FTC Kaplan

ACHIEVING BUSINESS VALUE THROUGH A SUPERIOR CUSTOMER EXPERIENCE

Application Support Solution

SDI SD Service Desk Manager Qualification Exam.

WHITEPAPER. 10 Simple Steps to ITIL Network Compliance

Leveraging a Help Desk as part of a Hyperion Center of Excellence

ITIL A guide to incident management

White Paper. A guide to managing food safety successfully. Executive Summary

Program Title: Advanced Project Management Knowledge, Skills & Software Program ID: # Program Cost: $4,690 Duration: 52.

Kent State University s Cloud Strategy

HR Support Services Labor Categories. Attachment D

quality of service Screenshots

Teaching the Business Management Study Design

ITIL & PROCESSES. Basic Training

Transcription:

Root Cause Analysis for IT Incidents Investigation Still trying to figure out what went wrong? Even IT shops with formal incident management processes still rely on developers and/or support specialists to figure out based on experience and personal expertise what went wrong with the system. Executives and users are therefore entitled to ask: how do you know this is indeed the cause of our problem? This article provides an answer: formal root cause analysis employing proven techniques. Operational systems maintenance represents by far the bulk workload for any IT department, except for organisations where IT products and services is their core business line. However, in the glamorous world of projects, once a system is deployed in production it becomes unattractive both for specialists and for management. There is no success to be achieved anymore, but only maintenance headaches fixing operational issues to keep the lights on. Considering the above, it is not surprising that in most organisations areas like IT incident investigation and resolution are still very fuzzy. The incident may be captured, monitored and the results reported using standardised forms, most of the time even using a help-desk or trouble tickets software system to automate it and sometimes even a formal process methodology like ITIL. But the core activity is still representing by a technical specialist nosing around the system trying to figure out what is wrong based on previous experience and personal expertise. On the other hand, the incidents / accidents investigation topic in other areas has quite an extensive literature, describing a large number of analytical techniques. However, as the main streams are nuclear, electrical, workplace safety and so on, most techniques described are quite exhaustive. And in the IT world where systems downtime is counted in minutes, who has time for a weeks / months long investigation? This article s intent is to bridge the gap and indicate a suitable methodology for the actual investigation of IT production support incidents (not for the entire incidents handling). It introduces the Root Cause Analysis Methodology and several techniques that can be used for the investigation of any IT incident or support requests, from major production down to nice-to-have enhancements, without exhaustive details. Further reading is required to grasp the details of chosen techniques, but it will provide the IT specialist with a starting point in sorting out relevant material from the multitude of books and papers available for general incident analysis. 1. Root Cause Analysis Overview Root Cause Analysis (RCA) is a methodology for identifying and correcting the most important reasons for functional and operational problems. Root cause analysis uncovers the fundamental issues (root causes) that generate a problem, as opposed to troubleshooting and problem solving that seek immediate solutions to resolve the user visible symptoms. A root cause is usually defined as a specific reason or group of reasons that can be logically identified, are under management control to fix and effective recommendations for preventing their recurrences can be generated. For example, the user spilling coffee on the desk, which damaged the workstation, is not a root cause because it is no efficient recommendation that can be made to prevent it from happening again. Most problems identified during IT systems operation have multiple approaches to resolution, generally requiring different levels of effort to apply. In many shops it is a common tendency to choose the solution that is the most expedient in terms of dealing with the situation and keep the system operational. In doing this, the symptoms are usually treated, rather than the underlying cause actually responsible for the problem, so in most cases the issue reoccurs and have to be dealt with repeatedly.

With the goal of minimising the operational costs in mind, and correspondingly the Total Cost of Ownership (TCO), a much more significant emphasis should be given to root cause analysis and implementation of longlasting solutions rather than temporary fixes. It is conceivable that emergency situations such as production stoppage will still be approached from a troubleshooting perspective, leading to quick fixes. However, the incident should not be closed before a proper root cause analysis is performed and a long-lasting solution identified and applied to prevent re-occurrence. The Root-Cause Analysis as employed in engineering disciplines uses specific terminology, completely portable to Information Technology discipline. The literature regarding Root Cause Analysis usually describes the main RCA dictionary in terms of: Occurrence: An event or condition that is not within normal system functionality or expected behavior. Event: A real-time factual occurrence that could seriously impact the system operation. Condition: Any system state, whether precursor or resulting from an event, that may have adverse implications for the normal system s functionality. Cause (also Causal Factor): A condition or an event that results in or participates in the occurrence of an effect. They can be classified as: o o o Direct Cause: A cause that resulted in the occurrence. Contributing Cause: A cause that contributed to an occurrence but would not have caused it by itself. Root Cause: The cause that, if corrected, would prevent recurrence of this and similar occurrences. The root cause usually has generic implications to a broad group of possible occurrences, and it is the most fundamental aspect of the cause that can logically be identified and corrected. Causal Factor Chain (Sequence of Events and Causal Factors): A cause and effect sequence in which a specific action creates a condition that contributes to or results in an event. This creates new conditions that, in turn, result in another event. Earlier events or conditions in a sequence are called upstream factors. 2. Procedural Steps The basic reason for investigating and reporting the causes of occurrences is to enable the identification of corrective actions adequate to prevent similar recurrences. An added benefit of an effective RCA is that, over time, the root causes identified across the population of occurrences can be used to target major opportunities for improvement. To achieve maximum efficiency, the Root-Cause Analysis should be performed immediately following the event occurrence, when all significant information can be collected to support the investigation. However, practical limitations such as resources availability may require later scheduling of less critical events investigation. In such case, it is necessary to gather all available information regarding the system state and user actions and safeguard it for later analysis. Root Cause investigation and reporting process typically includes five distinct phases, even if some overlapping activities might occur between phases: Phase I. Investigation: The Investigation Phase is focused to discover, in a value-neutral manner, facts that show how an incident occurred, what actually happened, without any judgement of value. Investigation deals with pure facts, not with interpretations. It is important to begin the data collection as soon as possible after the event occurrence to ensure that as much data as possible is available. The information that should be collected consists of conditions before, during, and after the occurrence, personnel involvement (including actions taken), environmental factors and any other information relevant to the occurrence.

Phase II. Analysis: The main goal is to discover reasons that explain why an incident occurred, by placing the purely factual representation of the incident within the context of the IT system to compare what actually happened against what should have happened, at any point during the incident. Any root cause analysis method may be used, some of them being described in Section 3. All RCA techniques include the following basic steps: Identify the problem and its impact Identify the causes (conditions or actions) immediately preceding and surrounding the problem Identify the reasons why the causes in the preceding step existed, working back to the root cause (the final point in the assessment phase). There is a broad range of methods to perform Root Cause Analysis, more or less applicable to IT-type incidents. Section 3 RCA Methods presents four of the most common analytical methods, suitable for a broad range of IT incidents, concerning both software and infrastructure. Regardless of the method employed, the Analysis Phase will conclude with recommendations that could range from user training to new development or updates to existing system components, as well as an order of magnitude effort estimate. Phase III. Decision: The objective of Decision Phase is to identify actions and lessons learned to correct or eliminate the root causes of an incident, in order to achieve long-term, effective results. Implementing effective corrective actions for each cause reduces the probability that a problem will recur and improves system reliability and stability. The summary table identifies the actions to be taken for each Root Cause identified in the Analysis Phase. It includes both short-term actions (workarounds) and definitive solutions to permanently eliminate the root causes that created the incident occurrence. A sample Root Cause Analysis Summary table is presented in the following figure: Root Cause Analysis Summary Problem Statement: Root Cause Contributing Short-Term Definitive Resolution Next payrun date not entered in system "No payrun date yet" not communicated to staff N/A Update policy to establish deadline to define payrun date "No record retrieved" error not handled by calling Incident Causes Cannot access page in Payroll User access to Payroll -> page Module not conform with Detail Corrective and Preventive Actions Manually enter in the system next payroll job execution date # Modify to validate if a record has been returned as 'next payment' # Include 'What If' analysis in Detail review # Review test checklist for Detail compliance Error not handled in presentation page User access to Payroll -> page Page not conform with Detail Manually enter in the system next payroll job execution date # Modify page to display meaningfull message if no 'next payment' is available # Include 'What If' analysis in Detail review # Review test checklist for Detail compliance Figure 1 Sample RCA Summary Phase IV. Communication: All interested parties should be notified regarding the investigation conclusions. This includes discussing and explaining the results of the analysis, including corrective actions proposed for

implementation, with management and personnel involved in the occurrence. In addition, consideration should be given to providing information of interest to other stakeholders and users regarding the occurrence and recommended course of action. Phase V. Implementation: The Implementation Phase includes the execution of the actions identified during decision phase, as well as follow-up activities (e.g. effectiveness review) to determine if the corrective actions were effective in preventing subsequent occurrences of the event. 3. RCA Methods This section summarily presents 4 of the most commonly used root cause analysis methods described in specialized incident investigation literature. The selection criterion of the presented methods was their applicability to a wide range of IT incident investigation. For example, Barrier Analysis method very useful for security penetration incidents was not included due to its limited benefits for software related incidents. For each method a brief description and a sample are presented in the following paragraphs. Cause-Effect Analysis: The cause-effect analysis uses fishbone (Ishikawa) diagrams to illustrate how various causes can be linked to an identified effect. There may be a series of causes that can be identified, one leading to another. This series should be pursued until the fundamental, correctable cause has been identified. An example of Ishikawa diagram is presented below: Human / Process Security Middleware Network 'No payrun date yet' was not communicated to staff Next payrun date not entered in system Client request to find out next payment date User access to Payroll - > page Next_payrun record not created in system 'No record retrieved' error not handled by calling Error not handled in presentation page Cannot access page in Payroll System did not save next_payrun record Page not conform with Detail Error / help messages not explicit Module not conform with Detail Batch Scheduling Database Tier Business Logic Tier Presentation Tier Figure 2 - Sample Ishikawa Diagram Events and Causal Factor Analysis: Events and Causal Factor Analysis consists of the identification of a series of tasks and/or actions in time sequence, as well as the environmental conditions of the tasks leading to an incident occurrence. The resulting Events and Causal Factor chart provides a graphical representation of the timeline and relationships of the events and causal factors, including more details than Cause-Effect Analysis. An example of an ECF diagram is presented in the following figure:

High year-end communication volume Compressed development timeline 'No payrun date yet' not communicated Error / help messages not explicit Next payrun date not entered in system scheduling table Next_payrun record not created in system User access to Payroll -> Next Payment page 'No record retrieved' error not handled by calling Error not handled in presentation page Cannot access page in Payroll System did not save next_payrun record Need to see next payment date Module not conform with Detail Page not conform with Detail Detail not applied Module testing not Detail not applied Page testing not Figure 3 - Sample ECF Diagram Fault Tree Analysis: FTA involves backward reasoning through successive refinements from general to specific. As a deductive methodology it examines preceding events leading to failure in a time-driven relational sequencing. The resulting fault tree is a graphical representation of the potential combinations of failures that generated the incident, as shown in the following figure. The tree starts with a top event representing the analysed incident and decomposes it into contributory events and their relationships until the root causes are identified.

Cannot access page in Payroll AND Next_payrun record not created in system Error not handled in presentation page 'No record retrieved' error not handled by calling User access to page for ODSP cases OR OR OR Next payrun date not entered in system scheduling table System did not save next_payrun record Page not conform with Detail Module not conform with Detail Client request to find out next payment date AND AND 'No payrun date yet' not communicated to staff Developer did not follow Detail Page testing not Developer did not follow Detail Module testing not Figure 4 - Sample FTA Diagram Causal Factor Charting: CFC provides a structure for investigators to organize and analyze the information gathered during the investigation as a sequence diagram with logic tests that describes the events leading up to the incident occurrence. A sample Causal Factor Chart is presented in the following figure.

Was it saved before? Recent release? Recent release? System did not save next_payrun record Need to see next payment date Next_payrun record not created in system User access to Payroll -> Next Payment page 'No record retrieved' error not handled by calling Error not handled in presentation page Cannot access page in Payroll Next payrun date not entered in system scheduling table Page not conform with Detail Module not conform with Detail Need better procedures? Was part of the test scenario? Was test executed? Was part of the test scenario? Was test executed? 'No payrun date yet' not communicated to staff Page testing not Module testing not Did the TL provide guidance? Did the TL provide guidance? Developer did not follow Detail Developer did not follow Detail Figure 5 - Sample CFC Diagram 4. Summary Employing a formal Root Cause Analysis process is an up-front investment in reducing the overall Cost of Ownership of the IT system. By spending the extra effort to identify and correct the root causes of an incident, not only the visible causes, avoids future occurrences requiring investigation and correction effort (on top of potential revenue loss from systems unavailability). Each investigation method adopted by the organisation to perform Root Cause Analysis is likely to require customisation to perfectly fit the IT environment specifics. However, once a palette of analytical tools is available to (and applied by) the support staff the results are almost immediately visible through increased system stability, less overall support effort and increased users satisfaction. Root Cause Analysis still relies heavily on personal experience and expertise, but employing formal techniques proves to users and executives that the root cause, not just a cause, has been discovered and fixed so the incident will nor reoccur. This article only provides a jump-start in formal Root Cause Analysis for IT incidents by introducing the main process and four suitable techniques, but the reader is strongly advised to consult further references for details.

5. References N/A Root Cause Analysis Guidance Document (U.S. Department of Energy, February 1992) N/A - Issues Management Guidance Handbook (Los Alamos National Laboratory, August 2004) A D Livingston, G Jackson & K Priestley - Root causes analysis: Literature review (WS Atkins Consultants Ltd, Contract Research Report for Britain's Health and Safety Executive, 2001) J.R. Buys & J.L. Clark - Events and Causal Factors Analysis (SCIENTECH, Inc., Technical Research and Analysis Center, August 1995) James J. Rooney & Lee N. Vanden Heuvel - Root Cause Analysis For Beginners (American Society for Quality, Quality Progress, July 2004) Ted S. Ferry - Modern Accident Investigation and Analysis, second edition, (John Wiley and Sons, 1988). About the author: George Jucan, MSc, PMP, OCP is the founder and acting CEO of Open Data Systems Inc., a consulting services company based in Toronto, Ontario. He has over 12 years of progressive technical and management experience, specialized in project and software development methodologies, as well as process and organizational (re)engineering. A regular author of technical and methodological articles, George Jucan is also a member of PMI s Standards Committee for the Update to the Government Extension of the PMBOK Guide. He can be contacted through the Open Data Systems web site http://www.opendatasys.com or directly at gjucan@opendatasys.com.