THE COE PERFORMANCE METHOD

FROM THE CENTER OF EXPERTISE THE COE PERFORMANCE METHOD A PERFORMANCE METHODOLOGY FOR ENTERPRISE-WIDE INFORMATION SYSTEMS Roger Snowden Center of Expertise, Oracle Corporation 2002 Oracle Corporation, all rights reserved

ABSTRACT A holistic method for diagnosing performance techniques in a complex information system is presented. The COE Performance Method (CPM) relies on proven techniques and offers a simple, end-to-end approach to isolating performance bottlenecks based on evidence of their actual causes. There are many excellent Oracle solutions which treat single, individual technology components in greater depth than this paper, but the purpose in this document is to provide a complete method of end-to-end performance analysis for an entire application of perhaps many synergistic components. While this approach is shown in the context of a networked enterprise database application, a general use of the described CPM can be easily applied to any computing environment. An explicit goal of the COE Performance Method described here is to enhance the performance achievements of Service Level Agreements and to quickly diagnose variances from those SLA s. TRADITIONAL APPROACHES TO PERFORMANCE TUNING Although Oracle s relational database products have never been exactly simple, the software s complexity has grown significantly, particularly in recent versions. With increased complexity has come a great deal of confusion and misinformation regarding performance management of the server and related technologies. Often, performance issues are treated as though they were concealed in magic, heralded by mystery. Bookshelves are full of offerings for tips n tricks and secret knowledge about squeezing performance from an Oracle database. Authors with many years of database experience suggest various parameter settings and configurations with little expository justification. While there is some excellent material on the subject of Oracle performance, making the best use of that information requires a methodology based on facts and logic rather than guesswork. The problem with the common best guess approach becomes apparent when the database administrator encounters a situation where parameters are adjusted, expensive memory or disk is added changes made as the experts recommended, yet performance is still abysmal and Oracle appears to be the problem. What to do? Throwing memory or CPU at a problem may not address the underlying issues at all. In some cases, such a blanket approach may simply make things worse, until the system in question grinds to an unimpressive halt. Therefore, it is reasonable and proper for us to seek a rational, comprehensive approach to managing the performance of an Oracle database without reliance on guesswork. We need some consistent, uncomplicated method of finding and relieving bottlenecks in a complex enterprise information architecture. CPM THE SOLUTION FOR AN ORDERED APPROACH The movement toward network architectures has significantly added to the complexity of the computing environment. Years ago it was only necessary to manage a single, unified set of technology components to achieve optimal performance. Now we have to manage multiple technology stacks the database and its host platform, application servers, varied client workstations and operating systems as well as the network that glues it all together. It is no longer possible to examine a single component and perform effective diagnostics for the system at large. The methodology for diagnosing and analyzing performance put forth here not only encompasses all technology stacks in the realm of an application system, but additionally does so in an orderly manner, quickly leading the diagnostician toward a positive result. Moreover, it takes into account the disparate pieces of the computing puzzle other, purely component-based approaches ignore. A SYSTEM IS NO FASTER THAN ITS WORST BOTTLENECK. The study of performance and capacity analysis of complex systems relies on a mathematical discipline known as queuing theory. Queuing theory relies on statistical methods to make possible an effective analysis of the behaviour of systems of processes, particularly as interrelated processes affect one another. While this THE COE PERFORMANCE METHOD 2

description suggests a level of complexity that might discourage the non-mathematician, it is not necessary to have a mathematics background to develop a reasoned understanding of the principles involved. The fundamental equation we need to understand is this: Response Time = Service Time + Wait Time Response time refers to the total time a process consumes, start to finish. In a rush hour traffic example, response time would be measured from the time a car entered a freeway to the time it left an off-ramp. In a retail service scenario, it might be from the time a customer gets into a bank teller s line (to cash a check, perhaps) to the moment cash is in hand. Service time is the amount of time consumed by the process itself the teller s busy time. Wait time refers to the time spent in line waiting for service. Optimal processes have minimal service and wait times. The target in the performance method discussed here is overall response time. For the most part, the focus will be on the causes of wait time, but by no means will service time be ignored. Most of us already understand these concepts, and we only need to observe the events of our daily lives to reinforce this understanding. Consider the commuter driving to work during rush hour on a typical morning. If traffic is moving rapidly, but congestion is heavy and cars are close together, a simple near miss caused by one car stopping suddenly can create instant havoc. As following cars are forced to brake suddenly, even more cars further back are affected and are forced to slam on their brakes. The effect ripples backward through the highway perhaps for miles. Even if the original incident involves no actual damage and traffic at that initial site begins moving again immediately, the delaying after-effects are likely to continue for perhaps an additional hour. Once congestion has set in, it seems to feed on itself long after the cause of the bottleneck is removed. It may be impractical to attempt to solve all of the mathematical equations demonstrating the various events and collective consequences, but certainly the rush hour driving experience reinforces the conclusion that a relatively small event can have severe performance consequences. As with traffic jams, computer systems suffer similar congestion. Service time deserves some consideration. In the case of a database application, a session s process might be found to spend too much service time, in the form of CPU time, processing extra data blocks because of the lack of a proper index on a particular table. That is, the server performs a full table scan instead of an index range scan, retrieving many more data blocks than otherwise would be necessary. While this additional work might be initially regarded as service time indeed, each block retrieval operation will consist of some CPU processing time the operation will involve even more I/O wait time as the user s process must wait for each additional block s disk read requests. So, while the full table scan certainly incurs additional CPU service time, the symptom of poor performance will most obviously be exhibited by excessive wait time (disk I/O) rather than service (CPU) time. Consider another example from daily life: the junk food lunch. We drop by our favorite hamburger restaurant for a quick bite and are faced with three lines of people waiting to order food from three employees acting as servers. Which line do we choose? Almost automatically, we choose the shortest line available. After several minutes, we notice someone who arrived after us is being served before us. It dawns on us the person serving our line might still be in training. It takes that person about twice as long to fill an order as the more experienced workers. So, we intuitively understand service time the time it takes to actually take and fill an order is a vital component of response time. Response time in this case is the time it takes to get our food in hand, starting from the moment we step into line in the restaurant. Another example of the importance of wait time as a primary measure of poor performance would be CPU time consumed by excess SQL parsing operations. A well-designed application will not only make use of sharable SQL and avoid hard parses, but will also avoid soft parses by keeping frequently used cursors open for immediate execution without reparsing at all neither hard nor soft. A poorly designed application will certainly exhibit a high percentage of parse time CPU, but will probably also incur a disproportionate amount of time waiting for latches, most notable the library cache latch. As such, even a highly CPU- THE COE PERFORMANCE METHOD 3

consumptive process is likely to cause measurable disproportionate waits. So, while service time must be monitored, performance problems are more likely to be quickly spotted by focusing on wait time. CPM as presented here takes a holistic approach to performance analysis and encourages the analyst to concentrate on service time or wait time as appropriate for the situation at hand. If the real problem is service-time related rather than wait time, it will be indicated by CPM and its cause corrected. Although the earlier automobile traffic example is easy to understand, the importance of wait time is all too easy to forget when dealing with the abstractions of computer software. However, that example can highlight how a database server might have a buffer cache hit ratio of ninety-nine percent and at the same time exhibit abysmal response time. Or, how a large parallel query might take too long to complete while CPU consumption mysteriously drops to near-idle levels. When the CPU is not working, it is waiting. VARIANCE, UTILIZATION AND CAPACITY Queuing analysis is helpful in understanding resource utilization and for optimizing service levels. In queuing analysis, the exact timing of an event is not always known. Customer arrivals, or computer users clicking the submit button to invoke a database request tend not to be uniformly timed, and often come in groups. This is a common statistical phenomenon known as variance. It is simpler and more effective to instead deal with the aggregation of events and construct a mathematical model based on the probability of each event. Since customer arrival times and hamburger preparation times vary, a model can take the form of a graph illustrating the effects of congestion, or busy-ness. From that model, an analysis can be performed of response time, throughput, or the nature of a bottleneck. The manager of the hamburger restaurant knows from experience that people arrive at random intervals. That is, while there might be an average of three customers per minute during the mid-morning hours, people don t actually arrive at exactly twenty- second intervals. They come in groups and as individuals at unpredictable times. Thus, variances in arrival rates may have an effect on our response time. An idle resource, like an employee or a CPU, is often seen as wasted capacity. However, having an occasionally idle resource may be the price one pays to provide a level of service needed to be competitive. Similarly, the freeway we use to drive to work during rush hour may have several lanes idle at two o clock in the morning. During rush hour, all lanes may be full and backed up. Extra slack or capacity is traded off for busy-time response and throughput. In computing systems, congestion can be experienced as either idle CPU time, or growing process run queues; unused memory or swapping virtual memory; idle or busy disk. We may not be able to determine precisely how many users will be logged on at one time or exactly what the workload will be, so we may have to provide some margin of extra capacity in order to get our business completed on time. In a large enterprise, the queuing model presents itself within the measure of end-to-end application response. A user pressing a mouse button in an office may be physically and logically miles from the data of business interest. The total time a user waits before their screen responds with a display of data is the sum total time for each system component between that mouse and the distant repository of data and well as the return trip. Each component of technology has its own process model and is a potential contributor to response delay. We will refer to these interconnected technology components as technology stacks. Examples include the network, database server, application server, the underlying hardware machines, and their operating systems. With a basic understanding of queuing theory, we need to develop a way to apply it to the technology problem at hand. We need to have access to information which tells us when system components are busy, how busy they are, and what they are waiting for when they are not busy. Fortunately, there are numerous sources for this information. All we need is to identify them and to find a cohesive way to bring this information together in an understandable manner. Although each of these stacks consists of sub-processes, each with their own queuing models, we can view the overall stack as an aggregate process and consider its response as a unit. For the Oracle Database Server THE COE PERFORMANCE METHOD 4

there exist a number of statistical values available for report, called wait events, indicating the presence or absence of internal bottlenecks. Measuring changes in the performance of an Oracle database involves viewing these wait events by value of time waited and comparing these wait times to the same measure from a different time period. Other stacks involved in the end-to-end application view typically have tools to provide similar information. We will discuss some of those tools in more detail later. Let s now forge on to the practical details of diagnosing performance issues. THE ENGINEERING APPROACH Certainly, the need for engineering discipline in the deployment and management of mission critical applications is well understood. Such discipline may be currently less widely applied toward performance management than other areas of enterprise technology, but an engineering approach to the performance of an application is equally as important as engineering an initial deployment. While practices vary from enterprise to enterprise, certain key practices have been identified by Oracle s Center of Expertise as essential to effective performance management. First among these is the establishment of a Service Level Agreement (SLA). It is beyond the scope of this paper to fully define the nature of such an agreement. Nevertheless, it is clear that in order to declare a particular aspect of system performance as bad, one must first have a clear definition of good. One goal of the COE Performance Method described here is to achieve the performance commitments of the SLA and to diagnose variances from that SLA. SERVICE LEVEL AGREEMENT Since an SLA is an agreement between a technology service provider and a user, it tends to be a bottom-line document. That is, the agreement is for a particular specification of availability and performance for a technology-based service. As such, it tends to focus on end-to-end service and does not bother with the interconnected details in the middle. It is up to the technology provider to understand and define the interconnected components (stacks) and to support them. Technology stacks in a contemporary information environment will include database servers, application servers, hardware and operating system platforms on which to run those servers, network components such as routers, hubs, gateways and firewalls, and workstations with user interface software for end users. Each stack has its own set of support issues and available tools for management. In order to be able to effectively respond to reactive performance issues, the service provider should take a proactive approach. The tools and techniques needed to diagnose wait time versus service time for each technology stack must be implemented and in place, and they should be well understood by the service provider prior to any actual performance diagnostic engagement. This deployment includes not only the tools, but also the engineering training and support to use them. Oracle Database Server from version 8.1.6 and beyond has been shipped with a tool called Statspack. Statspack is specifically designed to monitor server performance and offers a high level view of server wait events the key to tracking down database performance bottlenecks. Operating system tools such as sar, netstat, glance, vmstat and iostat, among others, are also available on most UNIX platforms and are quite effective in combination with Statspack for overall proactive diagnostic monitoring. Windows NT and its successors, Windows 2000 and Windows XP also come packaged with performance monitoring tools. Third party tools are also available and many are quite effective, although they generally have a price tag associated with them. Statspack is available free of charge, as is usually the case with the operating system tools mentioned above. PERFORMANCE BASELINE REFERENCE Whatever our toolset choices, we need to use those tools to establish and maintain a performance metric baseline. This takes the form of actual performance data gathered at appropriate times, using tools such as those already mentioned, to establish some measurable norm. A baseline might consist of an elaborate set of gathered data, or may be as simple as a benchmark timing of a standard query. The important characteristic of the baseline is that it is consistent and offers a reasonable basis of comparison. Data gathered should THE COE PERFORMANCE METHOD 5

represent actual system performance taken during one or more periods of busy activity. A baseline of data gathered while the system is idle is of little use. The baseline will need to be maintained as the system evolves, with respect to workload, functionality and configuration. If you add new application features, upgrade the database version or add or replace CPUs or other hardware, the environment has changed and therefore performance may have changed. If the baseline is not reestablished, any understanding of a future performance complaint by the user community will be compromised and blurred one will not be able to know if a performance change is due to a configuration issue or is a bug introduced with a new application feature. The baseline is established for this system in this environment and enables a comparative analysis to be made to diagnose a specific problem. The issue of the performance complaint itself is worthy of some note. One of the problems inherent with managing complex systems is the uncertainty of the performance metric. Performance is largely a matter of perception. A user may decide one day that a two second response for the execution of a particular form is acceptable, but unacceptable the next day, depending on issues like how hurried or relaxed the user feels on a particular day. This suggests the information used for the reference baseline needs to be coordinated with the metrics used for the SLA. Even though performance complaints may still be lodged, at least the system or database administrator has either a defense to offer or a starting point to diagnose the issue. ENGINEERING A SIMPLE METHOD One of the best features of the COE Performance Methodology is that it lends itself to performance analysis of large systems of interconnected technology stacks. Since our premise is that a system is no faster than its worst bottleneck, it is obviously important to be able to identify the location of that bottleneck. Moreover, although Oracle tends to be the common denominator from the perspective of users and management alike, we know from experience bottlenecks can just as well reside in the network, the application server, or an operating system. In order to identify the problem technology stack, and ultimately the actual problem itself, we need a systematic approach. The essential steps of the CPM approach, illustrated in Figure 1, will now be discussed briefly. The COE Performance Methodology, in a nutshell Problem Statement Information Gathering / Stack Identification Stack Drill-Down Fix the Stack Test Against Baseline Repeat Until Complete Figure 1 As illustrated, the basic steps of the COE Performance Methodology are straightforward. By starting at a high level, broad view of the enterprise system and rigorously following the steps in an orderly manner, positive results are achieved simply, quickly and without expensive and timeconsuming guesswork. THE COE PERFORMANCE METHOD 6

PROBLEM STATEMENT A clear and unambiguous definition of both good and bad behaviour is essential. The problem statement is more than half of the battle for a solution and defines success for us. Moreover, the discipline of stating the problem clearly and concisely often uncovers possible solutions. There is an undeniable and innate siren offering a temptation to gloss over this step, but this temptation must be resisted so that misunderstandings and inefficiencies are avoided. If you think you are solving one problem and the customer or user has a different expectation, valuable time will be wasted addressing misguided issues. An example of a weak problem definition would be, Online queries are slow and need to be much faster, while a good problem statement might be, The Customer Service Name Lookup screen normally returns query results in 3-4 seconds, but has been taking more than 20 seconds since the open of business this morning. Define the problem specifically and concisely, establish the measure of success with the customer and make certain you have agreement. The accordant goal must, of course, be reasonable and realistic. The definition needs to be quantifiable in terms conforming to the SLA metrics. The weak problem statement example above is harmfully vague. How would we know when we have succeeded in finding a solution? In our good example, if the SLA requires specific response times for the application function in question, we at least have a target for success and therefore a greater probability of success. Sometimes a clear problem statement is elusive. When things go wrong, often during critical business hours, tempers flare and communication lines break down. Sometimes the issue is obvious while at other times we wonder if we are simply imagining a problem that does not exist. When in doubt, ask yourself the simple question, What makes you think there is a problem here? then demand of yourself a very specific answer based on symptomatic behavior. As Winston Churchill said, Never overlook the obvious. It may well be the cause of the problem is already understood or suspected. A clear description of what the problem is and isn t will go a long way toward quickly resolving both obvious and obscure problems. Take the time to clearly define the nature of the performance symptom, the time and circumstances of appearance or disappearance, and to establish a valid test. Say what is known about the problem, and describe what is not known. A previously developed test case is ideal, and if one does not exist in advance, now is the time to create one. A test case can be as simple as the execution of a procedure through SQL*Plus and then also through the web server, with a measurement of response times. The result of the test needs to be compared to the baseline, so the importance of a valid and current baseline is therefore apparent. If a baseline was not established in advance, get one now so that you at least have the current bad performance captured and have something against which to measure the impact of changes. Not all changes are good. INFORMATION GATHERING / STACK IDENTIFICATION Execute the test case and record the result. Gather associated performance data from all technology stacks defined earlier, using appropriate tools. Compare the test results for each stack to the baseline for that stack and identify the most probable stack as the source of the bottleneck. What is needed for this critical stack identification step is a cursory check of each stack potentially involved in the problem. For hardware platforms, it may be a straightforward tool such as sar, iostat or vmstat. Network tools include netstat and ping. For the Oracle database server, a quick review of the alert log or error trace files will frequently turn up critical evidence for the trained database administrator. The ideal test is the one that yields the most information with the least effort, so proceed accordingly. Ideally, we will gather overall system resource data as well as service and wait times for each individual stack in order to determine which stack is the biggest bottleneck. This is one of the biggest challenges: getting a coherent, end-to-end measure of response time through each stack. Some organizations prefer to develop and maintain their own monitoring tools and there are plenty of open source and freeware resources available for use, including various scripting languages such as perl and tcl. A common practice is to use operating system command utilities such as vmstat and iostat shown in Figures 2 and 3, and to use a scripting THE COE PERFORMANCE METHOD 7

language such as perl to analyze the text output. The tool can then phone home when exceptions are encountered or predefined thresholds exceeded. Having an integrated monitoring environment will facilitate rapid and accurate stack identification during a performance crisis. While elaborate third party tools are available for such an infrastructure, off-the-shelf and freeware tools are often entirely adequate, although any tools you choose will have to be integrated into your environment. For example, each UNIX platform in the enterprise might have a scheduled process to gather sar and netstat statistics on regular intervals. If Statspack snapshots are also collected at similar times, it is a simple matter to analyze reports for those tools for a period of concern and compare the available data to reports from, say, exactly one week or one month earlier. If the application workload is similar for both periods, but the performance problem did not exist in the earlier period, we have a fast way to compare bad performance data to baseline data. If the problem is with the underlying UNIX platform or the network, it should be apparent immediately. Even without the baseline, a trained technician will recognize symptoms of constraint a high percentage of CPU wait time or process swapping activity, for example. See Figure 2 for an example of vmstat output. If no obvious starting point presents itself, we recommend you start with the database server itself. One obvious reason is the database administrator understands that stack best. Another advantage is the Oracle server gathers and provides information offering clues to problems across other stacks. For example, network problems often show up as a specific Oracle wait event, sql*net more data to client. Knowing the response time through the database stack will allow you to determine whether most of the overall response time is spent in the database or not. This in turn will direct your attention to the database itself or to another stack. THE COE PERFORMANCE METHOD 8

$ vmstat 5 5 procs memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id 0 1 0 6968 20376 49 1775 37 399 577 87248 27 8 0 37 36 4520 6649 7044 17 6 77 1 1 0 31807664 443376 10 1037 0 0 0 51536 0 1 0 49 49 5251 61709 8798 29 6 64 1 4 0 31798856 443008 1 934 68 0 0 30456 0 1 2 54 55 5277 77163 17295 32 8 60 0 1 0 31807744 441872 0 1051 164 0 0 18000 0 4 8 49 49 4755 74029 11738 31 7 62 0 0 0 31808072 441376 0 986 0 0 0 10640 0 10 0 47 47 4720 48430 5518 23 5 72 This is a vmstat sample taken from a 32-processor Sun system for five intervals of five seconds each. Statistical sampling is such that we ignore the first line of vmstat. A quick glance under the procs section tells us there is some process run queue wait time ( r is either 0 or 1 in this example) and some resource waiting ( b > 0 for most interval samples). This is generally considered good, nonbottlenecked performance although the b value indicates a process blocked by an IO wait, so disk may need balancing if that b value grows. Run queues are averaged for all CPUs for Solaris. Memory paging and swapping are not the same. Paging, even with these seemingly large numbers, is quite normal. The sr column tells you how often the page scanner daemon is looking for memory pages to reclaim, shown in pages scanned per second. Consistent high numbers here (> 200) are a good indication of real (not virtual) memory shortage. The fields displayed are: procs Report the number of processes in each of the three following states: r in run queue b blocked for resources (I/O, paging, and so forth) w runnable but swapped memory report on usage of virtual and real memory. swap amount of swap space currently available (Kbytes) free size of the free list (Kbytes) page Report information about page faults and paging activity, in units per second. re page reclaims mf minor faults pi kilobytes paged in po kilobytes paged out fr kilobytes freed de anticipated short-term memory shortfall (Kbytes) sr pages scanned by clock algorithm disk Report the number of disk operations per second, per disk unit shown. faults Report the trap/interrupt rates (per second). in (non clock) device interrupts sy system calls cs CPU context switches cpu Give a breakdown of percentage usage of CPU time. On MP systems, this is an average across all processors. us user time sy system time id idle time Figure 2 TIMING IS EVERYTHING An important consideration when evaluating third party tools or rolling your own is to gather and analyze data in a meaningful manner. For the most part, we are dealing with statistical samples when we monitor hardware and software resources, so sampling techniques must be sensible with respect to sample size and THE COE PERFORMANCE METHOD 9

interval. The vmstat report shown in Figure 2 was taken at five-second intervals. While short intervals show performance spikes quite well, they also tend to exaggerate variances in values and therefore contain statistical noise. A better method is to take concurrent short and long samples to be able to analyze both averages and variances to get a meaningful picture of performance. $ iostat xtc device extended device statistics tty cpu r/s w/s kr/s kw/s wait actv svc_t %w %b tin tout us sy wt id sd0 1.7 6.1 34.5 46.9 0.0 0.2 26.1 0 4 0 48 18 6 5 70 sd1 0.1 0.0 1.1 1.7 0.0 0.0 7.6 0 0 sd2 2.0 35.9 24.0 416.1 0.0 0.2 4.9 0 18 sd3 1.2 35.9 9.5 416.1 0.0 0.2 4.9 0 18 sd4 0.3 1.5 15.8 13.5 0.0 0.0 17.7 0 1 sd5 0.4 7.8 28.4 17.1 0.0 0.1 10.2 0 6 sd15 8.8 2.3 97.5 11.8 0.0 0.2 14.0 0 5 sd16 14.1 6.7 140.1 46.7 0.0 0.3 14.6 0 12... This as an abbreviated iostat report from the same 32 processor system as shown in Figure 1. The svc_t column is actually the response time for the disk device, however misleading the name. When looking for input/output bottlenecks on disks, a rule of thumb is to look for response time greater than 30 milliseconds for any single device. A well-buffered and managed disk system can show response times under 10 milliseconds. Here are the field names and their meanings: Figure 3 device r/s w/s kr/s kw/s wait actv svc_t name of the disk reads per second writes per second kilobytes read per second kilobytes written per second average number of transactions waiting for service (queue length) average number of transactions actively being serviced average service time, in milliseconds %w percent of time there are transactions waiting for service (queued) %b percent of time the disk is busy (transactions in progress) A sudden burst of activity might cause a single disk drive to be so busy as to cause process queuing, yet may not be of any real concern unless it become chronic. On the other hand, long iostat samples will average disk service time and tend to hide frequent spikes and could possibly mask a real problem. See figure 4 for an example of a CPU resource measurement illustrating how large variances in reported data can be misleading. If you look at the data for too short an interval, you might conclude CPU idle time is nearly seventy percent or nearly as low as twenty percent. If you are trying to analyze a performance anomaly during a period of high or low CPU usage, such a narrow slice of data can be quite helpful. On the other hand, taken as an indication of the norm, such a microscopic view could be completely misleading. The first priority at this early juncture is to eliminate obvious problems that can skew performance data and blur the analysis. We are concerned with quickly ascertaining the overall health of the components of each technology stack to make sure we know where the possible problem both is and isn t. We do this by looking for exceptions to what we know to be normal behavior. THE COE PERFORMANCE METHOD 10

CPU Idle Time 80 70 60 50 40 30 20 10 0 6:51:30 7:06:30 Low 7:21:30 7:36:30 7:51:31 8:06:31 8:21:31 High 8:36:31 8:51:31 9:06:31 9:21:32 9:36:32 %idle trendline CPU Idle times extracted from a sar report. The jagged line represents samples taken at fifteen-minute intervals. The trend line is shown to illustrate the degree to which variances among individual samples can be distracting and misleading. You need both average and variance information to get a true picture of what is happening at the hardware and operating system levels. The interval marked Low is entirely different from the interval marked High. A narrow peek at a performance variation can be useful for analyzing bottlenecks, but can be misleading if taken as an indication of the norm. Figure 4 For example, perhaps we received a report that an Oracle server had severe latch free wait events during a period of bad performance. If we respond directly to that symptom without adequate high-level analysis of the overall platform/database technology stack, we might overlook heavy process queuing at the operating system level. That is, the Oracle database might appear to be the problem, when the real issue is a lack of capacity. Reports from vmstat or iostat would indicate chronic process run queues, so we would know that the Oracle database itself is probably not the culprit, at least not the primary culprit. Once the resource limit is addressed, by tuning the application, rescheduling processes or adding more or faster processors, we can proceed once again with the stack analysis and identify server constraints in their proper context. THE COE PERFORMANCE METHOD 11

tracert mail12 Tracing route to mail12.us.snerdley.com [148.22.88.200]over a maximum of 30 hops: 1 <10 ms <10 ms 10 ms whq17jones-rtr-755f1-0-a.us.snerdley.com [148.22.216.1] 2 <10 ms <10 ms <10 ms whq4op3-rtr-714-f0-0.us.snerdley.com [148.22.252.23] 3 220 ms 210 ms 231 ms mail12.us.snerdley.com [148.22.88.200] Trace complete. Sample tracert used to identify potential network problems. Coupled with ping, a number of common issues can be quickly identified. Ping each device shown in tracert, with the don t fragment bit set and a large packet size to isolate individual segment performance. Although tracert shows timing information, it is for very small packets and may not isolate bottlenecks, so ping is used in conjunction with tracert. Figure 5 STACK DRILL-DOWN Each technology stack is then analyzed in detail to ascertain the source of the bottleneck. Since this effort is specific to each stack, the exact drill-down techniques are beyond the scope of this introductory paper. A network analysis, for example, might involve the services of a network administrator and the use of a network sniffer. Figure 5 shows an example of using the tracert utility to analyze network performance. The specific techniques for each stack differ greatly and have to be developed and supported specifically for each environment. For the Oracle database server, a tool such as Statspack or Oracle Enterprise Manager can provide a focused accounting of wait events. Sample the data for a narrow, busy period. One of the most common errors for database statistics gathering is to assume more is better and to take a sampling for too long a period. If the performance symptoms appear for fifteen minutes each hour, then an hour long sample of data only averages the wait events and hides the real cause of the problem. Data gathered should represent actual activity during the most pronounced performance symptoms for the clearest picture. Create the associated report, such as Statspack for the database server, and review it for the top wait events, in order of time waited. Each of those events will provide evidence of the cause of the problem and will provide a path for further drill-down. Much information is available to discover the significance of each wait event in the context of Oracle s internal operations and it is up to the individual performance analyst to learn how to interpret and respond to wait event statistics. Although many tools purport to offer tuning advice, there is no substitute for individual knowledge and training. A good source of information for database wait events is Anjo Kolk s YAPP paper referenced in the bibliography. A note of caution is due here. There exists a blurry distinction between capacity planning and performance management. The two subjects are tightly intertwined. One of the important techniques required when engaging in performance analysis is to properly distinguish between a capacity problem and an actual performance problem. If a problem crops up slowly over time in the form of gradual performance degradation as workload grows naturally, the problem is a matter of capacity, not performance per se. A performance issue is a technical matter to be dealt with in a primarily technical manner, while a capacity problem quickly turns into a business decision. If a server needs additional capacity, such capacity must be purchased or done without. FIX THE STACK Having identified the worst bottleneck, it is now time to apply an appropriate remedy. Again, as an introduction to the COE Performance Methodology, it is beyond our scope to attempt to list possible fixes here. You may have identified a bug or a matter of human error, or a hardware failure. Whatever the cause, use your engineering, and perhaps diplomatic skills to get it fixed. THE COE PERFORMANCE METHOD 12

TEST AGAINST BASELINE Now that the single bottleneck has been identified and relieved, it is time to rerun the test case and compare to the baseline and SLA to establish relative success. We use the term relative here to suggest the problem might not be altogether solved. It is common to find the relief of one bottleneck only serves to reveal another. If you have achieved success, document that fact, stop tuning and go home. You do get to go home, don t you? Performance management is, of course, an ongoing process. This is not meant to suggest the diagnostician will walk away and not continue to monitor performance. On the contrary, proactive monitoring is the best way to avoid emergencies. It is important, however, to distinguish between reactive and proactive efforts and not to be caught in the trap of managing one crisis into the next. After the crisis is resolved, review performance against the baseline and update the baseline if hardware or software configurations have changed. Continue to monitor proactively. REPEAT UNTIL COMPLETE If success, as defined by agreement established in the problem statement, is not yet declared, go back to the second step above and rerun the analysis to identify the stack now containing the worst bottleneck. Consider the possibility the bottleneck has moved to another stack. It is also possible there is no ready relief for the problem. This may be a case where a performance problem is actually a capacity issue, in which case an investment decision may need to be made. Alternatively, the root of the problem may be a bug or a hardware failure for which there is no immediate solution. Often one symptom will mask another. It is not uncommon for multiple, unrelated problems to manifest themselves at the same time. In a recent engagement involving a sudden and dramatic increase in response time in a production database, heavy contention was discovered within the file system. Once several large objects were moved to other, less busy disk drives, throughput increased fourfold, but response time for individual users was still slow. Further investigation from the top down revealed certain SQL statements did not properly use an index. Both issues surfaced at the same time because of the introduction of a new business transaction type causing a concentration of activity on the affected disk objects, while at the same time invoking SQL statements not previously executed. Once the SQL statement was corrected to be more selective, performance returned to normal, acceptable levels and the engagement ended. Performance problems are like onions: you peel them one layer at a time. TOOLS TO DO THE JOB In order to perform the multiple levels of diagnostics required for each stack, a number of tools will be needed. Commercial software and hardware products are available from various vendors and free software tools abound. It is beyond the scope of this paper to attempt to identify all such tools, but some obvious sources are hardware and software vendors as well as the various open source consortia. Commonly used diagnostics tools mentioned already include sar, iostat, vmstat, netstat and ping for UNIX platforms. Some tools offer varying degrees of comprehensiveness and integration. Naturally, an integrated tool is likely to be more convenient to implement than a set of point-solution tools. For Oracle servers, obvious choices include Oracle Enterprise Manager (EM), the utlbstat/utlestat scripts, and Statspack. EM has features incorporating the basic methodology described here. Utlbstat/utlestat and Statspack have the virtue of being included with the server at no extra charge. Statspack has been shipped with Oracle database servers since 8.1.6 and is intended as a replacement for the utlbstat/utlestat scripts. It offers excellent and comprehensive features for ongoing monitoring of the database. All of these tools will report data for selected intervals and will provide a view of the wait event interface built into the Oracle server kernel. THE COE PERFORMANCE METHOD 13

A MEASURE OF DIPLOMACY Besides tools to cover the technology spectrum under your domain, you will also need occasional cooperation from other experts. One of the more common problems of the contemporary enterprise is a direct outgrowth of the integration of disparate technologies communication barriers. Often, the administrators of the database, hardware platform and the network belong to entirely different management structures. While a performance methodology such as this cannot address political turfs, cooperation is necessary to quickly diagnose potentially complex problems. BACK TO THE CONCEPTS MANUAL An understanding of Oracle concepts is fundamental to effective performance analysis. Have you read the Concepts Manual lately? An understanding of all components of the Oracle server is contained in that material, including Buffer Cache operations, enqueues, latches, the Library Cache, the Shared Pool, redo, undo; lgwr, dbwr and smon background processes. Oracle 9i documentation includes Oracle9i Database Performance Methods, which along with Oracle9i Performance Guide and Reference provides an in depth discussion of server and application tuning. For technology stacks other than the database, there is a wealth of material to read. Some excellent sources are listed in the bibliography below. Bear in mind some of them are written from the perspective of a particular operating system, but contain concepts applicable to all brands and flavors of platform. Documents are available on the Oracle Technical Network site providing an understanding of the wait events Oracle records to provide the queuing analysis perspective you need to apply this methodology and to tune the database product effectively. There is a discussion of Oracle wait events, in some detail, as well as an introduction to wait event analysis known as Yet Another Performance Profiling Method (YAPP), by Anjo Kolk. Also, Oracle9i Database Performance Methods applies the holistic approach to the database in particular. Both are well worth reading. See the Bibliography for details and additional reading. ACKNOWLEDGEMENTS The Center of Expertise Performance Methodology has been a collaborative work of many individuals. Current and former members of COE, including Jim Viscusi, Ray Dutcher, Kevin Reardon and others, provided much of the early research. Cary Millsap offered the theoretical foundation for this effort. BIBLIOGRAPHY Practical Queueing Analysis, Mike Tanner, McGraw-Hill Book Company (out of print in the United States, but a classic worth finding, available at Amazon s United Kingdom site) The Art of Computer Systems Performance Analysis Techniques for Experimental Design, Measurement, Simulation, and Modeling, Raj Jain, John Wiley & Sons Capacity Planning for Web Performance, Daniel A. Menasce, Virgilio A. F. Almeida, Prentice Hall Oracle8i Designing and Tuning for Performance Release 2 (8.1.6), Oracle Corporation, part A76992-01 Oracle9i Database Performance Methods, Oracle Corporation, part A87504-02 Oracle9i Database Performance Guide and Reference, Oracle Corporation, part A87503-02 Sun Performance and Tuning, Java and the Internet, Adrian Cockcroft, Richard Pettit, Sun Microsystems Press, a Prentice Hall Title Oracle Performance Tuning 101, Gaja Krishna Vaidyanatha, Kirtikumar Deshpande, John A. Kostelac, Jr., Oracle Press, Osborne/McGraw-Hill Oracle Applications Performance Tuning Handbook, Andy Tremayne, Oracle Press, Osborne/McGraw-Hill Yet Another Performance Profiling Method (YAPP), Anjo Kolk, http://metalink.oracle.com THE COE PERFORMANCE METHOD 14