A White Paper. The Best Practices Guide to Developing and Monitoring SLAs

A White Paper The Best Practices Guide to Developing and Monitoring SLAs

Best Practices for Meeting End-User Demand: Put SLAs and Service Level Monitoring to Work for You Information technology departments are increasingly finding that SLAs and Service Level Monitoring do much more than document that written service level agreements (SLAs) are being met. When defined and applied properly, they can help stretched IT staff meet end-user demand by providing a more coherent picture of the performance of distributed business applications. Five Fundamentals of SLA Monitoring Define Collect Interpret Resolve With an approach that covers the fundamentals well, your organization Present can spend less time and money on the mechanics monitoring tools, data collection, data analysis, and performance reporting and concentrate on making sure that IT services align with the needs of the business. Next-generation agentless monitoring software is making service level monitoring practical and affordable for organizations of all sizes, not only those that can afford to license and maintain high-end systems management frameworks. These basics apply to a wide range of situations: whether your organization follows wellknown best practices for service management such as the Information Technology Infrastructure Library (ITIL) framework, whether you stick to your own in-house principles, or even when your organization needs to demonstrate good IT performance but doesn t use formal SLAs. Define the business service. Service level management (SLM) starts with an agreed-upon definition of the service being delivered and the expectations for its performance. To clarify the terms used in this discussion, SLM is the process of managing IT services to meet the expectations of users and the business on a sustained basis. The written, agreed-upon conditions are known as an SLA, or service level agreement. In some cases the SLA represents an understanding between an IT department and internal end users. In others, the SLA is a legally binding contract that specifies penalties for failure to perform. This second type of SLA is common between a company and an external service provider. The essence of a well-crafted SLA has been described by journalist Tim Wilson using a familiar example. Writing in Network Computing, he cited the once-famous guarantee of Domino s Pizza. The company specified a service (pizza delivery), a metric for measuring performance (30 minutes from the time of the call) and a penalty for failure to meet the SLA (free pizza). 1 Both service provider and customer shared a clear

understanding of what was expected and when. Traditional performance monitoring doesn t support the SLA view of the world too well because it is intended to keep tabs on the health of individual components, such as application software, a server, or router. The pizza delivery analogy can be used to explain why: Traditional performance monitoring might tell you the driver arrived 10 minutes late. Yet you might never be able to correlate that to a delay in getting the order to the kitchen, or the fact the oven was too full to bake the pizza immediately. For example, the performance of enterprise resource planning (ERP) order management might depend upon a Web page, Windows CPU usage, and a UNIX database running on one or more sets of servers. A three-tier e-commerce application may rely on Web servers, middleware, and a back-end database. Or you might have load balancing that distributes application processing across a group of servers to improve response time and/or provide redundancy. Effective service level management benefits from having a method of grouping resources that deliver the business service governed by an SLA. The ability to define and monitor what constitutes acceptable performance is similarly useful. A best-of-breed product allows you to set up the necessary business service profiles in just minutes ideally, with a point-and-click Web browser interface. Example: View of IT Resources Delivering a Business Service

You will notice this differs from stress-testing with synthetic transactions, which some organizations use to simulate the user experience. The focus of this paper is on establishing a view of the business service in order to monitor the behavior of an IT production environment, and on getting to the root cause of any issues. Collect and correlate to reveal service performance. IT resources are very often monitored with collections of separate and independent tools that do not provide a cohesive representation of status. As soon as you take a businesslevel view, using automation to collect and correlate data across systems brings advantages. Scanning a Windows event log or looking at perfmon counters can be reasonable methods to monitor one server, for instance. But collecting, aggregating, and correlating the data becomes far more challenging and time-consuming when you are responsible for multiple systems or a three-tier application infrastructure that must function within certain parameters. Without automation, the monitoring data that reflects the entire service delivery chain is not usable in real time, either. It s certainly possible to aggregate and correlate the collected data manually. By the time you finish, however, the opportunity to take preventive steps or confine a crisis will almost certainly be gone. Lacking real-time data to reflect service performance, you are likely to miss the chance to improve SLA compliance through proactive measures. Here it s worth noting that compared with agent-based monitoring used for this purpose, an agentless approach can save time and money if designed with SLA monitoring in mind. All agentless software by definition eliminates the need to install software agents on monitored systems because data is collected using standard underlying technologies, such as Windows Management Instrumentation (WMI) and Secure Shell (SSH). The most capable of agentless products not only use mechanisms such as these to collect and aggregate data from links in the chain of service delivery, but also correlate the monitoring data to the business services being provided. Interpret the business impact. Traditional performance monitoring provides only raw numbers that must be analyzed in order to determine the business impact. Look for service level monitoring capable of automatically offering a judgment on the data based on parameters you specify. (You still want easy access to basic metrics and statistics as needed.) Is performance good, degraded, unacceptable? Also make sure you can designate scheduled or ad hoc maintenance times, during which SLA requirements are set aside. Say that four servers run the same application when one goes out of service for some reason; this might qualify as good under the desired service level. The status might

change to degraded if a second server went out of service, and so on. From the business and user perspectives, the service level as a whole matters more than the condition of any individual server. An immediate understanding of business impact is important both for management reporting purposes (non-technical managers care about business, not about memory or CPU load) and to help direct IT staff to what really needs attention. In the preceding scenario, a server that is completely down might be less important than a slowdown in an underlying database that has an impact on all servers. Where resources are grouped (or clustered) together, achieving satisfactory service levels is sometimes possible even when an individual component has failed. Here, the SLA is met when at least three of four servers are functioning, degraded when two are functioning, and failed if fewer then two are functioning. Resolve quickly; prevent when possible. When a problem condition (an event, in monitoring terms) does occur, you need the ability to find out where, when, and for how long the condition has existed. You must be able to identify where the problem lies within the service delivery chain, and assess its impact on the business service. The desired result is reduced time-to-resolution or when possible, to contain adverse events and trends before they have a noticeable effect on your users and applications. The capability to drill down to investigate an event is essential for the rapid problem identification and resolution that benefits SLA performance. Well-designed agentless monitoring tools ought to provide options for drilling down that instantly supply the type of information you need. Perhaps a senior IT administrator prefers to see raw numbers. Other colleagues may be able to address root causes faster by examining event summaries that are complemented by details about individual SLA failures, including recommended corrective actions. Example: Event Monitoring Showing SLA State Transitions This event summary display shows transitions in service level agreement compliance.

SLA-oriented monitoring will also account for developing trends. Perhaps you observe what appears to be an isolated spike in CPU usage during off-peak hours. Along with a real-time view, you need the means to investigate historical data to reveal whether the upswing is a recurring condition so you can plan accordingly. Present don t just tell. Regular reporting on service level compliance is becoming an obligation in many organizations. Even if you don t have formal SLAs to uphold, is likely that you have the need to demonstrate IT s value to the business by showing how well service is being delivered. Extracting the necessary data and preparing this documentation can consume many hours of IT staff time. Agentless monitoring with rich reporting capabilities can lighten the load. Its cost and convenience leaves little reason not to take advantage of batch reports that can be scheduled to run automatically, and e-mailed to recipients on a regular basis. When it comes to reporting, showing gets the point across more effectively than telling. Crisp, informative graphics let executives and users see service level performance at a glance. The best tools can generate this style of report automatically and display it through a Web browser interface or distribute it automatically via e-mail. Example: Executive Summary SLA Report Charts and graphs document SLA compliance at a glance, and can be distributed via e-mail or a Web portal. Taking a cue from best practices for SLM, use strong reporting as your ally to forecast trends months in advance. Trend analysis supports capacity management and IT financial management. For example, you can budget for capital expense and negotiate better prices from vendors when you can anticipate when CPU or memory capacity will hit their limits.

Example: Trend Report Trend lines based on monitoring data inform IT planning and budgeting. Conclusion Making do with a patchwork of separate tools and writing your own tools and scripts are no longer the only choices for IT departments that seek to streamline service level monitoring without an enterprise framework. Next-generation agentless monitoring can help get the job done efficiently and affordably. By choosing a commercial monitoring product from a company with an established reputation, you also avoid pitfalls associated with unsupported shareware that is not robust enough for the task. Because agentless products have been limited in capabilities until recently, evaluate them thoroughly to find functionality that approaches higher-end, agent-based products. Informed selection can lead to an SLA monitoring solution that is cost-effective, useful right out of the box, backed by technical support, and maintained with ongoing updates. When the fundamentals are taken care of, your organization will have more resources to make sure that IT remains a strategic asset to the business.