Incident Management Best Practices Chris Pope Global Service Delivery Manager Global Managed Services Column Technologies February 2009
Agenda & Objectives 1. Incident Management Overview 2. Changes in Incident Management & Service Operation 3. Real World Incident Management and what really happens 4. Improvement 5. Use Case 6. Metrics 7. Documentation Examples 8. Tips 9. Questions
Incident Management What does ITIL say? Definition An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet impacted service is also an incident, for example failure of one disk from a mirror set. Incident management is the process for dealing with all incident, than can include Failures, questions or queries reported by the users (usually via telephone call to the service desk, by technical staff, or automatically detected and reported by event Monitoring tools Goal / Objective The Primary goal of the Incident Management process is to restore normal service operation as quickly as possible and minimize the adverse impact on business operations, thus ensuring the best possible levels of service quailty and availability are maintained
ITIL V3 and Service Operation New terms and definitions added that define inputs and triggers to Incident Management Event Management An event can be defined as any detectable or discernible occurrence that has Significance for the management of the IT Infrastructure or the delivery of IT Services And evaluation of the impact a deviation might cause to the services. Events are Typically notifications created by an IT service, Configuration Item (CI) or monitoring Tool. Types of Events Informational - An event that does not require any action and does not represent an Exception (a user logs on successfully, a batch job completes, a device comes online) Warning An event that is generated when a service or device is approaching a Threshold (Memory Utilization, Network collision rate is x) Exception An exception means that a service or device is currently operating Abnormally (however that has been defined) Typically this means an SLA/OLA has been breached and the business is being impacted
ITIL V3 and Service Operation Request Fulfillment The term Service Request is used as a generic description for many varying types Of demands that are placed upon the IT Department by the users. Many are actually Small changes low risk, frequently occurring, low cost etc (password reset, install Additional software) or maybe just a question requesting information. Goal / Objective To provide a channel for users to request and receive standard services for which a Pre-defined approval and qualification process exists. To Provide information to users and customers about the availability of services and the procedure for obtaining them. to source and deliver the components of requested standard services. To assist with general information, complaints and comments
Day to Day Incident Management Structured, documented, repeatable process Clear Guidelines on what to do and when Everybody knows what they are doing and their role Everybody has been trained on the latest version of the tools The boss is in his office waiting for updates The business understands what is happening and that IT is Working to restore service as soon as possible
Real Day to Day Incident Management Emails, Phone calls, network events, desktop issues Day to day pressures of projects, tasks, changes, new initiatives Events Incidents Ticket? Phone calls? Unable to Identify Impact Multiple events Time of day? Difficult to focus on Service restoration Poorly integrated tools consume time and resources Data Quality challenges hinder impact and risk analysis Poor communication processes consume time and resources Increased MMTR and Service Restoration
How can we improve the process? Training Major focus for process improvement Roles and Responsibilities clearly defined People know what to do and when Continuous improvement Communication Clearly define communication tools Escalation mechanism Ensuring your customer is aware at all times Full contact, full exposure Tools Documentation Ease of use Focused and relevant Fir for use and purpose Simple, reliable, readily available Up to date Has clear ownership Quality & Consistency Integrations Drive decisions more intelligently Single pane of glass What can help you reduce MTTR and restore service quickly Data Quality Is the data accurate? Clear ownership Refresh cycle and maintenance
Use Case Background Global Financial company 25000+ Employees 3500+ Applications 20000+ Servers Specifics IT division with 1100+ developers 300+ Applications support trading floor 24x7 globally operational, trading on all markets Focused on raising the bar, new functionality supersedes existing bugs/issues Little to no incident management structure in place Action Taken 1. Identified key stakeholders by function for both the business and IT 2. Empowered stakeholders to be able to make decisions and resolve conflicts 3. Instigated a global training plan, mandatory attendance 4. Agreed on what constitutes a Low, Medium, High, Critical incident 5. Integrated tools focusing on usability/simplicity 6. Established a robust program to Manage IT by Metrics 7. Defined clear escalation paths 8. Instigated a culture change, its OK to have a High/Critical Incident
The Metrics
The Metrics (cont)
How can I do this? This is not the first time its been done! Utilize ITIL where it makes sense, if it doesn t, don t use it! Training is the big ticket, establish buy in and value, you then have accountability Concentrate on 3-4 key measures or metrics and focused on them..driving the increase in incidents being recorded, its Ok to have a High/Critical Incident Be a wingman Establish accountability and manage it Celebrate the success, learn from the errors, don t criticize if someone gets it wrong Distribute guides, wallet cards, desk reminders, screensaver to enforce the message and the how Don t hesitate escalate
Escalation Criteria
Incident Timeline
Start the Process early Don t wait for Services or CI s to break in production before figuring out what you need to know or do with them. Start early in the lifecycle, before services are in a Production status Establish standards for monitoring, platform, management and process Permit To Operate (PTO) Encourage escalation Empower people to make a decision, if it s the wrong one, review and follow up
Questions