Closed Loop Incident Process From fault detection to closure Andreas Gutzwiller Presales Consultant, Hewlett-Packard (Schweiz) HP Software and Solutions 2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
Closed Loop Incident Process Solution The CLIP solution is a: Highly automated fault detectionto-recovery solution Focused on end-to-end service availability and performance Reducing mean time to recovery and improves mean time between system failures
Agenda 1. and Incident Processes 2. Closing the Loop 3. Architecture 4. Why CLIP
Agenda 1. and Incident Processes 2. Closing the Loop 3. Architecture 4. Why CLIP
ITILv3 Linkage of & Incident Management Neither process can stand alone in today s IT environments A change of state or alert that has significance for the management of a Configuration Item (CI) or IT Service. Incident Unplanned interruption, or reduction of quality, of an IT service IT Service People, processes & technology deliverable that supports a customer s business processes Management Responsible for managing events throughout their lifecycle. Main activity of IT Operations. Filtered/Correlated Resolve or forward to Incident Close Incident Management Includes any event which, or could, disrupts a service. From users or IT staff Incident -> Categorize /Prioritize -> Diagnose -> Resolve -> Close 5
ITIL Areas Involved in CLIP Operations Bridge (aka NOC) Central coordination point Manages various classes of events Detects incidents Manages routine operational activities Reports on the status and performance May provide first-level support for those events which generate an incident Service Desk Single central point of contact for all users of IT Logs and manages all incidents, service requests and access requests Provides interface to all other Service Operation processes and activities The Service Desk is not typically involved in Management unless the Service Desk and Operations Bridge have been combined 6
Traditional Incident Management From diagnosis to resolution 1 Identify service performance degradation Troubleshoot problem to isolate root cause 1. Service performance notification 2 3 Identify actionable condition / changes to be implemented 2. Gather data to assign SME 4 Create TT/RFC to implement change 3. Bouncing the incident 5 Implement and automate change to close RFC 6 Update CMS (Federated CMDB) 6. Update CMDB - timely & correctly? End User Help Desk Fire Storms CMDB 7 4. Ticket is finally assigned to the correct SME 5. Impact analysis and change management Multiple un-integrated systems and data stores, manually coordinated hand-offs inconsistent troubleshooting, high MTTR SME: Subject Matter Experts
Agenda 1. and Incident Processes 2. Closing the Loop 3. Architecture 4. Why CLIP
From Fault Detection To Recovery & Closure Closed Loop Incident Process solution for ITIL and Incident Management Generation & Detection Recovery & Closure Correlation & Business Impact Resolution Incident Submission 9 Investigation & Diagnosis ITIL Process Management Incident Management
Generation & Detection Closed Loop Incident Process solution for ITIL and Incident Management Recovery & Closure Resolution 10 Generation & Detection Investigation & Diagnosis Correlation & Business Impact Incident Submission Operations bridge console collects events & alerts from servers, networks, apps & 3rd party Challenge Bottom-up alert and event overload Lack of qualitative cross domain actionable and causal event data Solution All events come to one place, correlated and enriched against an auto-updated service model User Example s to single console End user experience slow SQL slow query performance alert J2EE DB collection pool issue
Correlation & Business Impact Closed Loop Incident Process solution for ITIL and Incident Management Recovery & Closure Resolution 11 Generation & Detection Investigation & Diagnosis Correlation & Business Impact Incident Submission Business services, business impact relationship, and SLAs determined Challenge Struggle to link causal events to top down enduser experience and business impact Solution Proactive end-user experience linked to business process and business transaction flow to identify high revenue generating service impact User Example - Cause from symptoms and impact Oracle database is the cause, topology based correlation Critical funds transfer business service impacted
Incident Submission Closed Loop Incident Process solution for ITIL and Incident Management Recovery & Closure Resolution 12 Generation & Detection Investigation & Diagnosis Correlation & Business Impact Incident Submission Automatic submission to service desk with annotations and cause area Challenge Quality and enrichment of data Siloed, broken service lifecycle Duplication of effort wasting time Solution Better collaboration Automation and integrated of event to incident process lifecycle User Example - Automatic incident ticket creation Ticket visible to ops bridge Assignment to subject expert
Investigation & Diagnosis Closed Loop Incident Process solution for ITIL and Incident Management Recovery & Closure Resolution 13 Generation & Detection Investigation & Diagnosis Correlation & Business Impact Incident Submission Problem isolation, SME tools, and KM used to determine root cause Challenge Significant problem resolution time spent on pinpointing problem in a dynamic heterogeneous IT universe Incident assigned and reassigned to multiple silos Solution Cross domain data visualization and analysis User Example - Diving deeper to find root cause Expert sees corrupt DB tables Finds runbook automation fix in knowledgebase
Resolution Closed Loop Incident Process solution for ITIL and Incident Management Recovery & Closure Resolution 14 Generation & Detection Investigation & Diagnosis Correlation & Business Impact Incident Submission Change request with attached run book automation to repair CI s Challenge Little or lack of automation leads to increased manual efforts impacting quality and efficiency Solution Expert created/authorized run book automation to empower lower level teams Manage change, configuration, and release process User Example - Processing the change Get change request approval Use runbook to reindex database tables
Recovery & Closure Closed Loop Incident Process solution for ITIL and Incident Management Recovery & Closure Resolution Generation & Detection Investigation & Diagnosis Correlation & Business Impact Incident Submission Automatically close incident & related incidents acknowledging related events Challenge Struggle to improve speed of restoration, recovery and closure of incident and verify post compliance of SLA/OLA Solution Automate all notifications & updates, continuously monitor SLA/OLA compliance User Example Verify the change worked User, DB and connection pool OK Ticket and events closed 15
Agenda 1. and Incident Processes 2. Closing the Loop 3. Architecture 4. Why CLIP
Closed Loop Incident Process Integration Points Integrated ITIL event and incident management process optimizing MTTR and MTBF Monitoring 1 2 3 5 Integrated CMDB Automation 1 5 Service Desk 4 17 1. Sharing CIs, topology and state information 2. For creating and updating incidents 3. For updating events 4. Incident-, Problem- and Change-Mgmt 5. Runbook automation to remediate
HP s Closed Loop Incident Process Solution Integrated ITIL event and incident management process optimizing MTTR and MTBF BSM CIs, Topo, s, Status Net 1 Ops App Other 2 3 4 5 UCMDB Operations Orchestration NA SA CA SE Other 6 7 Service Manager 18 1. CIs, topology, events, status measurements flowing into BSM 2. Sharing events and topology 3. For creating and updating incidents 4. To access Business Impact View for a CI 5. Runbook automation to enrich, diagnosis and remediate 6. Sharing CIs and state information 7. Runbook automation to remediate
Agenda 1. and Incident Processes 2. Closing the Loop 3. Architecture 4. Why CLIP
Closed-Loop Incident Mgmt Process Incident management from diagnosis to automated resolution 1 Identify service performance degradation 2 Troubleshoot problem to isolate root cause 3 Identify changes to be implemented 4 Create TT/RFC to implement change 5 Implement and automate change to close RFC 6 Update CMS (Federated CMCB) 1. Identify service performance issue Business service management 2. Gather data to identify root cause 3. Create RFC to make change 4b. Review, assess, plan and govern change IT service management 4a. Initiate change 5b. Close change request? 6. Update Configuration Management System Configuration Management System (Federated CMDB) 5a. Implement change Business service automation Key processes incident, change and configuration need to be tightly linked Seamless process linkage requires tools to be consistently service-oriented 20
Closed Loop Incident Process Key Benefits Drive innovation value of IT Cost Quality Transparency Agility Business risk Drive efficiency through automation Optimize service lifecycle process efficiency Eliminate error-prone manual tasks Predict and prevent negative business impact The cost/value ratio of delivered services is understood by the business Any service from everywhere Saved labor can be spend on innovation Measure and optimize time to develop and successfully deploy new services Reduce risk of failure when deploying changes Enable compliance 72% lower maintenance cost 2.5x increased availability and performance 99.5% availability via integrated delivery 30% faster time to market for new apps 70% fewer bad changes 21