WHITE PAPER SQL Server Performance Intelligence MARCH 2009 Confio Software www.confio.com +1-303-938-8282 By: Consortio Services & Confio Software
Performance Intelligence is Confio Software s method of applying Business Intelligence principles to database performance management. SQL Server Performance Intelligence Have you ever known that your application was being slowed down by SQL Server, but not known what to do about it? In the past, SQL Server performance management has been reactive and server health focused. Database Administrators (DBAs) could only respond to trouble, not avoid it. And their visibility was largely restricted to watching the database server rather than understanding how SQL Server was directly affecting their application users. Performance Intelligence (PI) is a method of improving the service and performance of SQL Server databases. PI is Confio Software s way of applying Business Intelligence principles to database performance management. PI applies proven principles of business intelligence, historical data mining and trend analysis. Rather than monitoring system health, it focuses on application Wait Time in the SQL Server database. The result is an analysis technique that can quickly answer the key question: Why is my database causing application users to wait, and what can be done to improve response? PI uses wait time to find and resolve SQL Server query bottlenecks. It prioritizes actions based on what is most important to application response. And it illustrates to developers exactly the reason is for query problems. Performance Intelligence focuses on application Wait Time in the SQL Server database. It prioritizes actions based on what is most important to application response. Figure 1: Performance Intelligence measures wait time spent at each wait type in the SQL Server database. The wait time approach to analysis is now practical due to lightweight monitoring techniques and agentless architectures. It takes advantage of new instrumentation in SQL Server to expose wait types, the individual steps that accumulate delays as SQL Server processes queries. This article describes examples of how PI can be used to identify the most important problem query, quantify the impact of locking sessions, and expose the critical SQL Server resource responsible for a database bottleneck. For the IT organization, the result is reduced cost of database operation and improved IT service. DBAs can do more with fewer servers. Migrations from SQL Server 2000 to 2005 to 2008 become quicker. And the development cycles are 2 SQL Server Performance Intelligence
shortened. For IT groups tasked with providing better service with fewer resources, PI is a cost effective answer. Performance Intelligence is a new analysis technique that gives DBAs insight into what impacts their end users. Why a New Analysis Technique Needed DBAs are often in a tough box. They are accountable for database response to application users, but they have no visibility into why the database is slow. Often the issue is not in their database at all, coming from the application code, the network, or the system architecture. To get application code changed, DBAs must bring evidence to developers, who meanwhile are suspicious, because to them, the database is a poorly understood black box. Just a get a faster server? Wrong! The problems above are a symptom of relying on old server health monitoring techniques to truly understand what is happening inside SQL Server. PI is a new analysis technique, derived from successful business intelligence methods, that gives DBAs the insight needed to respond to their users. Performance Intelligence Explained PI uses proven business intelligence (BI) techniques to analyze the wait type data and improve application performance. It allows DBAs, developers and even managers to make sense of something as obscure at wait time data. Key concepts include: Measure Time, Don t Count Operations. For the application user, the number of I/O operations or logical reads means nothing. All that counts is how long does my application take to respond. To optimize for this user perspective, focus on time taken in the database. Wait Types are a method of doing this. Focus on Queries. Key to PI is measurement at the level of SQL queries and individual sessions. Tools that measure wait across an entire instance or database without breaking it down further do not give actionable information. PI captures data at the smallest level of granularity, measuring wait time for each individual query. Continuous Capture. Keep your eyes open all the time. By watching all sessions, all of the time, PI captures the occurrence of any problem. When a user calls asking for help on a slow application, the data is already available. Systems that depend on tracing intermittently will miss problems when they happen. Historical View. To know what to fix, you must look at trends and changes in the database, not just instantaneous results. PI takes a historical view to compare current wait type statistics with the past, and to see what is different that could be the source of a new problem. 3 SQL Server Performance Intelligence
Performance Intelligence finds trends, identifies anomalies, and exposes relationships across 13 different dimensions of performance data. Figure 2: PI analyzes millions of data points along multiple dimensions to expose the true conditions impacting database end-users. SQL Server Wait Types Awareness of SQL Server wait types is an important step in understanding PI. Wait time analysis is a relatively new method of identifying and resolving database performance problems. While new to SQL Server, it has established itself as best practice in other database environments. Advances in SQL Server instrumentation available to 3 rd party developers in the 2005 and 2008 releases give visibility that now makes it possible to capture accurate wait type data at a low performance cost. Any statement running against a SQL Server will experience some form of wait as SQL Server accesses resources in order for the statement to complete. A request will wait for data to be retrieved, written to disk, or for an entry to be written to the SQL Server log. You will notice when watching an instance closely that it experiences a number of waits throughout a given time period. When waits become chronic or excessive you may begin to see a performance problem. Common Wait Types SQL Server records information about the type and duration of the waits that a process experiences. While there are over 100 different wait types in SQL Server, you will likely only ever encounter a handful of these as problems. Any wait type beginning with LCK_ means that a task was waiting to acquire a lock. For example, a wait type of LCK_M_IX means the process was waiting to acquire an Intent Exclusive lock. Over 20 of the wait types are lock waits, which is fitting as most work being performed in SQL Server requires some sort of lock. The next most common lock types are ASYNC_IO_COMPLETION and ASYNC_NETWORK_IO. The first means a process was waiting for an I/O operation to complete. The second means that a task is waiting for I/O to complete over the network. Finally, keep an eye out for the CXPACKET wait state. This occurs when a process is trying to synchronize the query processor exchange iterator. This can indicate an issue with a server s parallelism setting. Spending time figuring out what all the potential wait states are can be time consuming. On average, about 20 of the potential wait states show up in 80 percent of problems. After doing wait type analysis for a while, you will get used to seeing certain wait types, including the ones looked at here. 4 SQL Server Performance Intelligence
SQL Server has offered views of wait types for quite some time now, but those views have been snapshots of current statistics, not helpful in understanding the performance picture. Methods to Capture Wait Type Data SQL Server has offered views of wait types for quite some time now, but unfortunately, those views have been vague and for the most part unhelpful. Starting in SQL Server 7.0 and 2000, DBAs could use Enterprise Manager (EM) to view wait types. The problem was that all EM provided was the name of the wait type and the length of time a given process had been waiting. When SQL Server Management Studio (SSMS) was introduced with SQL Server 2005, the views of active queries and sessions remained similar. Again, DBAs were given a wait type and duration, but not much else. Dynamic Management Views (DMVs), new in SQL Server 2005, offer a newer and more efficient view of wait type statistics. The most pertinent DMVs for looking at wait statistics are sys.dm_exec_requests, sys.dm_exec_query_stats, and sys.dm_os_wait_stats, as described below. Note that the DMVs provide a snapshot of the counter values since system start, so to make them useful, you need to poll and calculate deltas. A far better way to access this information is to use a performance solution built specifically to capture and analyze this information. sys.dm_exec_requests This DMV offers information about each request that is execution on a given SQL Server. When looking at wait states, you care about only a few of the columns that this view provides; specifically sql_handle, wait_type, wait_time, last_wait_type, and wait_resource. These columns provide information about the statement being executed and the request s current wait state. sys.dm_exec_query_stats This view returns aggregate performance statistics for cached queries. By using the sql_handle detail from sys.dm_exec_requests to join to a row in this view, you can start to get a picture of how often the waits we see might be occurring. Keep in mind that this view doesn t give more wait detail, everything here is just an aggregated statistic for a given sql_handle. sys.dm_os_wait_stats This view provides us with an aggregate picture of all wait states on a SQL Server. This view provides a list of all the different waits states and detail about tasks in that state including how many tasks are waiting in each state, the total wait time for the state and the average wait time. This detail is good for big picture, or to get a quick idea of the types of waits occurring, but most of the real diagnostic and tuning will occur at a statement level. Trying to perform any sort of analysis of wait statistics using the built-in tools has several inherent drawbacks. The Problem with Raw Wait Type Statistics Trying to perform any sort of analysis of waits statistics using the built-in tools has several inherent drawbacks. DBAs have to spend a great deal of time researching what certain wait types mean, because there is no direct reference from within the views. They also have to spend a time devising ways to trap wait types and durations during high activity periods. The other big problem is that waits are transient, meaning that just because a query is experiencing a wait right now does not mean it will be waiting in the same state 10 seconds from now. While these views give the adventurous DBAs a way to see what wait types exist, as well as some more detailed statistics around the duration of waits for a given process, this, unfortunately, is still not enough. In order to identify any trends, you must have an historical perspective. This means lots of work to trap the wait state detail on an ongoing basis for later analysis. In general, while wait state analysis can be extraordinarily helpful, coming up with a methodology for performing this analysis is the biggest problem. This is where PI, built into a packaged solution, comes in. 5 SQL Server Performance Intelligence
Performance Intelligence examples apply to typical problems scenarios: Identify problem query Resolve locking Find hardware bottleneck Problem Resolution Scenarios In order to understand how PI wait type analysis can help DBAs accomplish everyday problem resolution, here are a few scenarios to consider. Problem Resolution Scenario 1 Identifying the Problem Query One of the most frustrating scenarios a DBA faces is the problem query. Often, this is a query that a developer has identified as particularly slow running. DBAs will usually hear that the query ran fine in development or has been running fine for several weeks. Other times, repeated complaints of performance problems will lead DBAs to begin looking for the problem query in an attempt to increase performance. In either case, the traditional methods of researching the problem usually involve opening several tools, such as SQL Server Profiler and Windows Performance Monitor, and trying to capture real time problems. Specifically, most DBAs are looking at the queries that have high durations, high numbers of reads and/or writes, and queries that are being rerun frequently. In all of the cases, however, the base numbers can be misleading. For example, queries that are being rerun frequently, but very quickly, may or may not be causing a bottleneck. If the base query runs quickly and efficiently, with very low wait times, there probably isn t a problem. If however, the given query is constantly experiencing the same wait type, such as ASYNC_IO_COMPLETION, there may be a bottleneck. Determining the difference is what wait type analysis is all about. Figure 3: Example of a problem SQL query Get State exposed with excessive wait time 6 SQL Server Performance Intelligence
Problem Resolution Scenario 2 Resolving Locking Problems SQL Server locking is often times a very confusing subject. However, using wait type analysis, figuring out what locks are being acquired, and how those locks may be blocking other processes is much easier. Throughout the day, most SQL Servers will experience split second locking and blocking conditions. Only when these locks result in long term blocking is there a problem. Wait types that list locking types, such as LCK_M_SCH_M (which is a schema modification lock), identify exactly what the process is waiting for. In this case, a process waiting for that lock needs to actually modify the schema of the table or view, and therefore has to wait for any preceding processes that are inserting, updating, or deleting data to finish. Another potential problem is the natural extension of a single blocking process; the blocking chain. Once one process is waiting for a resource, and blocking another process, it s very likely that another process will end up waiting for the process that is waiting for the original process, and so on. The resolution to this is to find the head of the chain. Once the wait type of the head of the chain has been identified and resolved, the rest of the blocking chain should be freed up. Figure 4: Lock_M_U wait, shown by blue bar, causes most wait time for Get State Problem Resolution Scenario 3 Finding Hardware Bottlenecks Identifying hardware resource bottlenecks may be the most complicated scenario. While there are a number of symptoms that can point to a bottleneck, there is almost no other way to identify a hardware problem other than using wait type analysis. In this case, the key is to look for wait types related to either the disk subsystem (such as the PAGELATCHIO_* wait types), the CPU (CXPACKET, for example) or 7 SQL Server Performance Intelligence
the general memory system (RESOURCE_* wait types). These wait types, when experienced for more than a few seconds, generally point to hardware problems. For example, assume there is a query that usually runs for about 20 minutes, and uses three table joins to determine the updates for a fourth table. The developer has provided feedback that the query has started randomly taking upwards of 4 hours; there is no discernable pattern to when the query runs fast versus when it runs slow. A DBA can identify what wait type is occurring most frequently for that query, and what the duration is for each wait type during its run. If the wait type falls into one of the hardware related categories, it s time to look at other queries on the system that are experiencing greater than expected durations in similar wait types. Figure 5: A DBA can identify what wait type is occurring most frequently for each query, SQL Server wait type instrumentation is now better than ever in 2005 and 2008, and a wait time focused performance tool is the most effective way to utilize this valuable data. Conclusion Wait types are now instrumented in SQL 2000, 2005, and 2008, and detail improves with each release. Because they are complex, a purpose built performance solution using the PI principles is the most realistic method of taking advantage of this information. A focus on end user waiting time is the key to better application service, and SQL Server wait types are the clues that expose this information for DBAs and development teams. About Confio Ignite for SQL Server Ignite for SQL Server is a low impact, easy to implement solution that delivers the benefits of PI described in this paper. Built from the bottom up to focus on wait time, Ignite is the most comprehensive and understandable tool for DBAs, developers and IT management to find and resolve critical database performance bottlenecks. The Ignite agentless architecture allows wait type data to be collected with no agent installed on the SQL Server monitored server. Performance data is analyzed in the Ignite PI server, with historical statistics saved in a separate, non-production SQL Server instance. All user access is browser based, with no client to install. For enterprises needing more than SQL Server monitoring, the same principles are applied to monitoring of Oracle, DB2 and Sybase, all from the same Ignite PI installation. 8 SQL Server Performance Intelligence
Figure 6: Ignite for SQL Server Agentless Architecture For a free trial of Ignite for SQL Server, visit www.confio.com, download Ignite, and be up and running in 20 minutes. About the authors Joshua Jones is a Database Systems Consultant with Consortio Services in Colorado Springs, CO. There he provides training, administration, analysis, and design support for customers utilizing SQL Server 2000, 2005 and 2008. Josh speaks at numerous events about SQL Server topics such as performance tuning, high-availability design, and business intelligence design. He is the co-author of "A Developer's Guide to Data Modeling with SQL Server", available from Addison- Wesley. Don Bergal is the COO at Confio Software in Boulder, CO. For the past 5 years he and his team have helped customers improve the performance on thousands of databases, and during this time they have developed the Ignite Performance Intelligence methods. About Confio Software Confio Software develops performance management solutions for Oracle, SQL Server, DB2, and Sybase databases. Confio Ignite PI, which applies business intelligence analysis to IT operations, improves service levels and reduces costs for 9 SQL Server Performance Intelligence
database and application infrastructure. The Confio Igniter Suite PI is an open, multi-vendor, agentless monitoring solution that allows DBAs and management the ability to detect problems, analyze trends, and resolve bottlenecks impacting database response time. Built on an industry best-practice Wait-Time methodology, Confio s Igniter Suite improves service levels for IT end-users and reduces total cost of operating IT infrastructure. Confio Software products today are used by customers in North America, Europe, South America, Africa and Asia whose mission includes getting most value out of their business critical IT systems. Customers are reached directly through the Confio sales force and through a network of partners in the US and internationally. Confio Software Boulder, Colorado, USA +1.303.938.8282 info@confio.com www.confio.com 10 SQL Server Performance Intelligence