SERVICE DESK EFFECTIVENESS ELIMINATE RECURRING INCIDENTS 1
ELIMINATE RECURRING INCIDENTS Situation: A Service Desk typically handles about fifty tickets per day per thousand end users with a staff of about twenty analysts per thousand end users. But there often a significant number of recurring tickets - in some companies as many as 10-20 percent. Aside from the fact that it is a waste of effort for people to do the same work over and over again, unnecessary tickets increase the risk of mis-handling, with associated escalations and reduction in customer satisfaction. Even teams that have already implemented problem management often leave much hidden fruit. Solution: If the Service Desk is inundated with incidents, job #1 is to get control of the situation. Create a repeatable process for registering incidents and resolving them quickly. Once that has been accomplished with a basic service level defined for incident resolution, there is still a lot more that can be done to streamline incident handling. Establishing a discipline for Root Cause Analysis using the ITIL Problem Management process will enable a repeatable method to identify and eliminate recurring failures. With the right reports, IT can establish baselines and the benefits of process improvements. Simple reports cannot do much other than track count and number of new problem records. Deeper insights to identify root cause can be found easily using advanced analytics to highlight terms common incidents and tracking progress over time. Total Incidents Recurring Incidents 2
Getting the most from Problem Management The goal for Problem Management is quite simple - to reduce the number of recurring incidents that are wasting time and effort. But if success is achieved, and common Incidents occur less frequently, it may not be easy to detect non-events. Unless the IT environment is particularly chaotic, it s unlikely that there will be wholesale drops in ticket volume. Reasons for this include new users, bring your own device, new releases, and the increasing complexity and pace of change in the IT infrastructure. With all that said, there are ways to tell if problem management is succeeding: To really show the benefits of implementing Problem Management requires a detailed baseline incident rates. With a baseline in place, it is possible to show improvements in specific categories, classes of resolution and other attributes. To assess the value of preventative work, estimate the effort to resolve such incidents and assign a cost based on the FTE hourly rate. In a business that depends on availability of certain IT services, for example an e-commerce web site, calculate revenue saved based on higher availability and historical norms of revenue generation. Availability of services - regularly meeting business SLAs Backlog of problems - staffed appropriately Average age of problems - showing progress When a fridge unexpectedly falls from the sky and hits someone... When the fridge bounces and hits another person... When multiple fridges start falling from the sky... When we investigate why fridges keep falling from the sky... it s an incident it s a multi-user incident it s a recurring incident it s problem management 3
BEST PRACTICE APPROACH 1 Establish the need. top-n incidents by category Recurring incidents are a waste of resources, preventing the most valuable IT resources from working on activities that increase the pace, quality and quantity of delivered IT innovations. The incident process should restore service as quickly as possible. Documenting resolution activities in the incident record creates a treasure trove for subsequent analysis of common modes of failure and restoration. Problem management can then analyze the top categories of incidents to identify common incidents that could be prevented rather than being handled reactively through the Service Desk. Top categories for incident volume (category). A high volume of incidents indicates a higher risk of impact on the business and effort being applied to maintain service. Once the top categories are identified, use text analytics to determine the top keywords for each category. Search the incident repository using the top keywords to quickly find common requests or recurring incidents. If an incident can reasonably be handled by self-service or a Level 1 Service Desk analyst, have a domain expert write a knowledge article explaining how to diagnose and resolve the issue. Most Incidents by Group Service Desk Desktop Support Productivity App Support SAP Support Windows Server Mobile Support Onsite s... Tip: ensure ownership of Problem Management by assigning a Service Desk leader in the process team. For more information on the best use of analytics for implementing self-service, please read the white paper Self Service Reigns Supreme. 4
2 Build the practice. The next step is to implement proactive steps to eliminate recurring incidents. The tactic so far has been to minimize the impact of incidents by enabling faster resolution at the Service Desk. The second phase of attack seeks to understand why incidents are recurring. Problem management introduces Root Cause Analysis to identify the why of an incident, and develops proposed solutions to prevent them from happening in the first place. Any incident has the potential to be a recurring incident. Without a more structured approach, a problem manager can easily get overloaded. Incidents with a temporary workaround Incidents with the temporary workaround flag set (#). This analysis of the incident repository identifies incidents for which a temporary workaround is being applied. These are ideal candidates for a problem management without requiring a broad review of all incidents. Multi-user incidents with no related event Incidents with no related event (#). Event management aims to warn of failures before they become critical or user impacting. But inadequate coverage and rule thresholds can give a false sense of security. This analysis identifies incidents that are being detected by end users with prior warning from the NOC of the impending failure. Recurring Incident Patterns Desktop Support Exchange Services Tip: An incident should be registered and assigned to the NOC for each monitoring coverage or threshold rule failure. 5
3 Measure the Success. Show the value of problem management by tracking the progress. Once root cause analysis has been completed, a proposed solution should be defined. Solutions range from software updates, application configuration improvements, and memory upgrades, to elimination of single points of failure. Tracking the reduction in incidents of this class of failure will highlight the savings associated with prior reactive service restoration. problem backlog Number of open problem records (#). Track the number of open problem investigations awaiting completion. The backlog provides an indication of risk associated with the current infrastructure and processes. Once the process is well established, sudden increase in backlog should give cause for concern and may warrant assignment of additional resources. problem aging by time bucket Percentage of problem records whose Root Cause Analysis activities have not yet completed by time bucket (%). It is good practice to assign an SLA for RCA. For simple incidents, consider an RCA time of 3 days, for more complex incidents it may take 5-10 days. Problem Management Performance new closed 9 8 <5 days 5-10 days >10 days Tip: for more advanced analytics see the Best Practice Identifying problem Configuration Items 6