Users are Complaining that the System is Slow What Should I Do Now? Part 1 Jeffry A. Schwartz July 15, 2014 SQLRx Seminar jeffrys@isi85.com
Overview Most of you have had to deal with vague user complaints quoting slowness or hangs Obviously, this information is not very helpful with getting to the source of the problem Remember that erratic response times Proven to be more frustrating and counterproductive than consistently slow response times People usually remember the 90 th or 95 th percentile response times as average Database is often the first place to be blamed Dilemmas Many times analysts do not know where to start - Process especially time-consuming when analysts inexperienced in performance 2
Overview DBAs need techniques for determining Which hardware or software components, including servers, are in trouble? Causes of poor performance - What is to blame? - Is the software itself the problem? - Are the hardware s problems caused by Inefficient software? Simply too much work? For SQL Server servers, which queries are most troublesome? How to develop appropriate solutions 3
Today s Session First of several 30-minute presentations Discusses methodology for pinpointing the sources of problems using PerfMon data alone Subsequent presentations will delve into more SQL Server-centric metrics and tools Discusses high-level hardware-related and preliminary SQL Server performance analysis Describes information and techniques applicable to 2005+ SQL Server and all versions of Windows 4
Analysis Objectives Use measurements to corroborate or discredit user perceptions of performance Capture performance data on ALL servers involved in user experience - IIS, Application, Database, etc. - If SAN is shared, then collect on ALL servers that use it If perceptions are valid - When? - How bad? - Duration? - How do users know it is bad from 11:00 noon? If hardware in trouble, prove whether application is the cause or just the amount of work Too many database trips per transaction? Too many transactions? 5
Analysis Methodology Use Windows Performance Monitor, a.k.a. PerfMon, to determine When problems occur and their duration Which servers and hardware components could be involved Are the physical servers or SQL Server short of memory? Use graphs to visually correlate problem periods on servers Using PerfMon to focus analyses is highly recommended by Microsoft 6
Analysis Techniques Covered Interpretation and usage of informative performance counters - Processor - Memory - Physical I/O - SQL Server Expand upon and clarify PerfMon explanations Provide Useful graphical techniques Insights acquired from over 1,000 customer engagements Possible courses of action 7
PerfMon Hardware and SQL Server Metrics Invaluable hardware and SQL Server metrics % User Time (Processor) % Privileged Time (Processor) % Interrupt Time (Processor) % DPC Time (Processor) % Idle Time (each Disk) Avg. Disk sec/transfer (each Disk) Avg. Disk sec/write (each Disk) Available Bytes (Memory) Page Life Expectancy (SQLServer:Buffer Manager) Page Reads/sec (Memory & SQLServer:Buffer Manager) 8
Assessing Your System Potential problems exist if consistently Counter Criterion % Processor Time > 70% % Privileged Time > 30% (Processor) % Interrupt Time > 20% (Processor) % DPC Time > 25% (Processor) % Idle Time < 40% for any Disk LUN and especially SQL LUNs Avg. Disk sec/transfer > 0.040 seconds (40 ms) Avg. Disk sec/write > 0.040 seconds (40 ms) Available Bytes < 1 GB (Memory) Page Life Expectancy < 300 seconds (SQLServer:Buffer Manager) 9
Useful Processor Performance Counters Processor Object Interrupt counters are isolated to specific processors and frequently ignored (often single-threaded through interrupt processor) Windows only reports % Privileged Time (contains kernel mode and interrupt times) and interrupt times - Actual kernel (Windows) time must be computed manually - REAL kernel mode time is difference between % Privileged Time and the sum of the interrupt times (%Interrupt and %DPC) System Object Processor Queue Length (waiting list length ONLY) - Can be affected by application design Context Switches/sec 10
Processor Usage Overview Percentage used 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% Interrupts or DPCs/Second 9,000 8,100 7,200 6,300 5,400 4,500 3,600 2,700 1,800 900 0% Fri Jul 21 6:00 7:30 9:00 10:30 12:00 1:30 3:00 4:30 6:00 Mon Jul 24 6:30 8:00 9:30 11:00 12:30 2:00 3:30 5:00 6:30 Tue Jul 25 7:00 8:30 10:00 11:30 1:00 2:30 4:00 5:30 DPC Interrupt Kernel User Deferred Procedure Calls Queued Hardware Interrupts - 11
Physical I/O Measurements Critical for SQL Server systems because they are often I/O constrained I/O time measured directly by disk driver, which provides transfer times to Windows I/O time = service time + queue time due to driver s location in I/O path Disk response time Queuing not always the cause of large I/O times May not be possible to improve large service (working) times because of physical or financial constraints 12
Physical I/O Measurements Most frequently (and badly) misunderstood Windows metrics Data reported for Each physical disk or LUN (PhysicalDisk) - When RAID implemented in hardware, many actual disk drives can appear as one physical disk or LUN - Use disk idle and disk transfer times to detect contention within RAID itself (see I/O Performance Counters Incomplete slide) Each logical disk partition (LogicalDisk) 13
Useful Disk Performance Counters Physical Disk Avg. Disk sec/transfer - Should be 0.020 seconds (20 ms) at most unless I/O size huge % Idle Time - Surprising how often this is ignored! - Once this reaches zero, no more I/Os can be processed - Performance usually degrades as it approaches zero - Easier to represent as utilization, i.e., 100 - % Idle Time Disk Transfers/sec, Disk Bytes/sec - Beware of disk specs because they usually cite very large I/Os - Beware of cliffs, even on SSDs Read and Write-specific counters also valuable, especially when a read/write performance disparity exists or using RAID 5 Logical Disk Same counters available plus space-related ones Useful when multiple logical drives reside on one physical LUN 14
Interpreting Performance Counters Disk Queue lengths Unlike processor queue length, this INCLUDES I/Os in progress By far, most commonly quoted and used disk performance measurement - Assumes relationships among transfer times, utilizations, and queue lengths - Actually least useful, except when outrageously high or from a VM guest on a busy host (> 50% processor utilization) - Use Transfer times and % Idle instead Interpretation very difficult because # of physical disks in a LUN is usually unknown unless the values are obscenely high 15
Utilization versus Queue Depth Graph 16
Misunderstood PerfMon Counters Many PerfMon counters misunderstood, e.g., % Disk Time Many people continue in 2014 to believe that this metric is a utilization metric! It most definitely is NOT! PerfMon explanation does not help! % Disk Time is the percentage of elapsed time that the selected disk drive was busy servicing read or write requests. % Disk Time actually = 100 * Avg. Disk Queue Length Artificially constrained to 100% by PerfMon Actually useless, but frequently referenced and interpreted as disk busy times Actual busy = 100 - % Idle Time Can indicate a capacity constraint even when performance is excellent 17
Disk LUN Driver Activity % Disk busy 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Mon Jul 31 6:00 8:00 10:00 12:00 2:00 4:00 6:00 Tue Aug 1 7:00 9:00 11:00 1:00 3:00 5:00 Wed Aug 2 6:00 Disk 00 C Disk 05 D Disk 04 E Disk 03 E 8:00 10:00 12:00 2:00 4:00 6:00 Disk 03 E Disk 01 Disk 04 E Disk 03 F Disk 05 D Disk 02 F Disk 00 C Disk 06 D 18
Disk LUN Driver Activity % Disk busy 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Mon Mar 1 2:49 2:52 2:55 2:58 3:19 3:22 3:25 3:28 3:31 3:34 3:37 3:54 3:57 4:00 4:03 4:06 4:09 4:12 Disk 02 T Disk 00 C D Disk 00 C D Disk 01 R S V Disk 02 T 19
Disk LUN I/O Times Milliseconds per I/O 300 250 200 150 100 50 Disk 01 R 0 Mon Mar 1 2:49 2:52 2:55 2:58 3:19 3:22 3:25 3:28 3:31 3:34 3:37 3:54 3:57 4:00 4:03 4:06 4:09 4:12 Disk 00 C D Disk 00 C D Disk 02 T Disk 01 R S V 20
Disk I/O Byte Traffic Overview Disk Bytes/second 40,000,000 35,000,000 30,000,000 25,000,000 20,000,000 15,000,000 10,000,000 5,000,000 0 Mon Jul 17 5:00 Tue Jul 18 6:00 8:00 10:00 12:00 2:00 4:00 6:00 Wed Jul 19 7:00 9:00 11:00 1:00 3:00 5:00 Thu Jul 20 6:00 8:00 10:00 12:00 2:00 4:00 6:00 Disk 07 Disk 12 Disk 13 Disk 13 Disk 11 Disk 10 Disk 12 Disk 08 Disk 09 Disk 07 21
I/O Performance Counters Incomplete Some important metrics not measured directly Avg. Disk Service Time per Transfer Avg. Disk Queuing Time per Transfer Missing values can be computed using the Utilization Law 22
Utilization Law U = X * S U => utilization of a resource X => completion rate S => average service time required of a resource System assumed to be in steady state Not perfect, but an excellent tool 23
Using Utilization Law to Compute Missing I/O-Related Times Restate the Utilization Law S = U / X All calculations use PhysicalDisk counters LogicalDisk counters can be used, if necessary Italicized entities are PerfMon counters Disk Utilization = (100 - % Idle Time) / 100 Disk service time (sec) = Disk Utilization / Disk Transfers/sec Disk queue time (sec) = Avg. Disk sec/transfer - Disk service time (sec) 24
RAID Example Calculations #1 and #2 Disk Utilization 36.57% Disk Transfers/sec 0.65 Avg. Disk sec/transfer 2.0095 seconds! LUN #1 LUN #2 Disk service time 0.3657 / 0.65 = 0.563 seconds or 563 milliseconds Disk queue time 2.0095 0.563 = 1.447 seconds or 1,447 milliseconds Bytes/Transfer 1,307 Disk Utilization 77.67% Disk Transfers/sec 30.89 Avg. Disk sec/transfer 2.4424 seconds! Disk service time 0.7767 / 30.89 = 0.025 seconds or 25 milliseconds Disk queue time 2.4424 0.025 = 2.4174 seconds or 2,417 milliseconds Bytes/Transfer 22,437 25
RAID Example #1 vs. #2 I/O times (2.0095 vs. 2.4424) outrageously high Queuing occurred on both disks Low I/O rate of Disk #1 appears to contribute to high service times 1,307 bytes should not require 563 milliseconds How could this happen? 26
RAID Example #1 vs. #2 Problems began when faster processor complex attached to an existing disk subsystem Customer blamed new processor for poor performance Customer wanted vendor to take it back because architecture was supposedly defective and slower than original In reality, it was MUCH faster and it was swamping the disk subsystem! Solution was to reconfigure disk drives Customer refused to state exactly what they changed Probably multiple LUNs shared same physical drives - Becoming a more common problem as many LUNs are spread across the same physical disks Great in theory, but once hot spots develop, it is very difficult to unravel and correct 27
Database I/O Counters Page reads/sec and Page writes/sec counters Measures physical I/Os, not logical I/Os SQL Trace measures logical I/Os May indicate Insufficient database memory Applications improperly accessing database Improper database table implementation Plot reads and writes on same graph Highlights changes in workload behavior Heavy write activity may coincide with periods of poor performance, especially when RAID 5 disks involved Add Full Scans/sec and Forwarded Recs/sec for better view 28
I/O Activity vs. Scans and Forwarded Records Graph 29
Detecting Insufficient SQL Memory Use Page Life Expectancy Measures time unlocked buffers allowed to remain in buffer pool Aged number that tends to drop quickly and increase slowly If Page Life Expectancy too low (< 300) Allocate more memory to SQL Server or optimize queries Malformed queries that read inappropriate amounts of data can cause low Page Life Expectancy because of data churn - Page reuse is very low 30
Page Lookups/sec Counter Measures number of times SQL Server attempted to find page in buffer pool Logical read 31
Page Life Expectancy vs. Page Lookups Graph 250,000 Lookups per second SQL Server Page Life Expectancy & Page Lookups Seconds 1,000 900 200,000 800 700 150,000 600 500 100,000 400 300 50,000 200 100 0 Fri Mar 9 9:30 11:15 1:00 2:45 4:30 6:15 8:00 9:45 11:30 Sat Mar 10 1:15 3:00 4:45 6:30 8:15 10:00 11:45 1:30 3:15 5:00 6:45 8:30 10:15 Sun Mar 11 12:00 1:45 4:30 6:15 8:00 9:45 0 11:30 Lookups Page Life Expectancy Page Life Expectancy Threshold 32
Batch Requests/sec Number of select, insert, and delete statements Each of these statements triggers a batch event, which increments the counter Note: Also includes each of these statement types executed within a stored procedure 33
Batch Requests vs. Page Lookups Graph 34
Conclusions PerfMon should always be used to focus application troubleshooting and tuning efforts Extremely important to combine Windows system performance for ALL servers used by the application Especially true for processor, memory, and I/O Include SQL Server PerfMon and internal metrics when applicable 35
Follow-Up Please attend next session Will discuss SQL Server Dynamic Management Views (DMVs) Please complete the four question evaluation to let us know How we can help What topics you would like us to cover in future webinars 36