How To Diagnose A Virtual Memory Fault On Linux (Amd64) On A Windows Xp (Amd86) On An Ubuntu) (Amd66) (For Acedo) (Orchestra) (Powerpoint) (Windows) (

Introduction Tuning Linux VM on Kernel 2.6 For Oracle RDBMS What does the phrase tuning kernel mean? It actually conveys a dangerous message that it is possible to somehow make database to run faster, by tweaking certain arcane and undocumented Linux parameters. The knowledge of the charms giving the ability to perform such a dangerous black magic is highly priced possession of the privileged few, the real stuff that separates boys from men, or so the legend goes. The truth is that there is no such thing as tuning database or tuning Linux. The usual meaning of the phrase is to minimize the application response time, for our given application. This document concentrates on the application systems that utilize Oracle RDBMS. On Linux, there are very few things one can tune to make things run faster. The purpose of this document is to provide the insight into relatively few tools and parameters that can be tuned and adjusted to provide temporary relief to the OS. Linux, in contrast to fully fledged Unix systems, is oriented to the mythical home user and surprisingly poor when it comes to the OS diagnostics and tuning. When tuning an application system, we must always start from the application itself and see where it spends its time. Unfortunately Oracle RDBMS doesn't have any statistics or wait event names waiting for an OS mess. The first indication that there is a problem with the OS is a long duration of other, normally encountered wait events like scattered or sequential reads, free buffer search or IPC message waits. It is important to know and have reference of the usual values of RDBMS statistics, in order to be able to notice sudden spike. The best way to collect such statistics is to take regular STATSPACK snapshots or use Activity & Workload History (AWH, available as of Oracle 10g is a separately licensed option, available as a part of OEM Diagnostics and Tuning pack). The information collected using these tools can help us correlate the periods of strain on the system with the periods of activity withing the Oracle RDBMS. When such a spike is noticed, it is time to look into OS. Causes of increased overhead are many and can be caused by all of the OS subsystems: I/O, device management, virtual memory management, network or, simply, a bad application. The rest of this document concentrates on the case of virtual memory subsystem causing problems. Virtual memory was chosen because of popular demand: several people that I know of asked me how to diagnose VM problems on Linux and how to respond. The problem is that documentation is scant and inaccurate, while rumors and recipes for run my stuff faster button abound. Linux itself also isn't very helpful in such situations. The rest of this document deals with the following questions: How to diagnose VM shortage and what diagnostic tools do we have at our disposal? What should we do? What is the goal of tuning VM? What means do we have at our disposal? This is, of course, the part about what everybody

wants to know the most Linux parameters. How to diagnose VM problems? Before we start diagnosing the problem, let's clarify the purpose and the basic features of the Linux virtual memory subsystem. The purpose of any virtual memory subsystem, including the one on Linux, is to emulate much larger memory then physically available by saving the least needed parts of memory to disk. The award winning question is: what are the least needed parts of memory? The answer to that question is simple: the least needed parts of memory are those that will not be used for the longest time. Unfortunately, Linux doesn't have a crystal ball which would allow it to look into the future so guesses must be made about the parts of memory should be written to the disk and replaced by new contents. Different operating systems make different guesses. VMS has an implementation of the least recently used (LRU) mechanism for page replacement. Linux does not have LRU mechanism, Linux uses the standard two handed clock mechanism, found on all other Unix systems. This mechanism works like this: special process, called page update daemon or pdflush, goes through all eligible pages, that is the pages that are not locked in memory, and turns the in use bit in the page header off. The process then goes to sleep. It wakes up after period of time, determined by a configuration parameter, and writes down to disk all modified pages which do not have the in use bit turned back to on, as a result of access by some process. Those two functions of the page replacement process are called hands. This is a high level overview of Linux (and Unix) page replacement process. Linux virtual memory system maintains certain quantity of free memory. What is free memory? Free memory is comprised of pages having valid backup on disk which are not being used by any process. The in use bit for those pages is set to 0. Those pages can be re allocated to another process at any time. Process will then allocate a page from the free memory pool of pages, discard its contents, which it can safely do, as the page has a valid backup, and read the new contents from the disk. This process is called a page fault and happens as a result of CPU hardware. Large number of page faults results in system spending large parts of time in kernel mode and is one of the best indicators of problems with the VM subsystem. At this point, the further descriptions of the virtual memory system would become too complicated for an article like this and I'll have to direct the reader to one of the numerous classic texts like Linux Kernel Development by Robert Love, UNIX Internals: The New Frontiers, by Uresh Vahalia or Design of the UNIX Operating System by Maurice J. Bach. The usual tools for diagnosing all kinds of performance problems, including the ones with virtual memory, are top, sar, vmstat, iostat and atop. The first four tools are very common and their descriptions can be found in many places. Those tools can give us information about the memory and CPU usage per process and various system statistics on the global level, but not on the per process level. The last tool, atop, is an excellent and phenomenally useful tool which is not that well known but is very much appreciated among the connoisseurs. The reason for the relative obscurity and insufficient popularity lies, quite paradoxically, in the problem of the Linux operating system. Linux doesn't do per

process accounting of disk requests or network requests, while atop is useful for allowing the administrator to monitor precisely those aspects: per process page faults, disk and network requests. So, how can it be? Linux doesn't do per process accounting of I/O requests and atop monitors precisely that? The answer is in the kernel patch. To monitor I/O and network requests issued by various processes, the administrator has to install two kernel patches: $ ls /usr/src redhat linux kernels atoppatch kernel 2.6.16 linux 2.6.16.20 $ ls /usr/src/atoppatch kernel 2.6.16 README.patch 02patch 2.6.16_atopacct 01patch 2.6.16_atopcnt After that, the new kernel has to be linked and installed, which is not a very simple procedure, especially not with packaged kernels like RH EL or SuSE Linux distributions which no longer ship the current kernel source. When this is done, atop can be used pretty much in same way the top is used. Atop, along with the necessary kernel patches and documentation can be downloaded from: http://www.atcomputing.nl/tools/atop On the beginning of the next page is a screen shot of atop monitor, showing the processes responsible for the most of I/O activity. It shows processes responsible for the disk activity, together with percentages and the disk activity broken into the number of self explanatory calls. This is just one of the monitoring screens atop can provide to system administrator. It can also break down paging and networking on the per process basis. This is, by far, my favorite tool for investigating Oracle performance problems on Linux. In case that atop proves to be too much of an issue, there are also sar and vmstat which, unfortunately, do not allow drilling to the process level and finding out the process responsible for all the commotion. The vmstat tool is an archaic tool used primarily by the people from the BSD world. On Linux, it has been mostly replaced by sar, originally a System V tool for monitoring systems. The output looks like this: $ sar B 3 4 Linux 2.6.17 1.2142_FC4 (medo.noip.com) 08/13/2006 10:12:33 PM pgpgin/s pgpgout/s fault/s majflt/s 10:12:36 PM 0.00 18.67 187.33 0.00 10:12:39 PM 0.00 27.91 94.35 0.00 10:12:42 PM 0.00 16.00 88.67 0.00 10:12:45 PM 0.00 43.19 89.04 0.00 Average: 0.00 26.46 114.81 0.00 As you can see, this only shows the overall picture, not enabling the administrator to drill down to the process level. This report contains two columns: fault/s and majflt/s. What is the difference between a fault, sometimes known as a soft fault, and a major fault (aka hard fault )? Soft fault happens when the process needs a page that is already in memory, but was freed by the page replacement process. Major or hard fault happens when the page needs to be brought into memory from disk. Major faults are, of course, much more expensive and take much longer to complete then

the soft ones. Large number of major page faults can slow the system down to the crawl. On an average system, major page faults are responsible for the vast majority of the CPU time spent in the kernel mode. Overall picture of the system memory can be obtained by using r ( report ) option and looks like this: $ sar r 3 3 Linux 2.6.17 1.2142_FC4 (medo.noip.com) 08/13/2006 10:17:47 PM kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad 10:17:50 PM 58640 456864 88.62 8292 230496 1347644 225212 14.32 31352 10:17:53 PM 58640 456864 88.62 8300 230496 1347644 225212 14.32 31352 10:17:56 PM 58656 456848 88.62 8308 230496 1347644 225212 14.32 31352 Average: 58645 456859 88.62 8300 230496 1347644 225212 14.32 31352 The columns in the display are more or less self explanatory. There are also two other tools, used for investigating processes in detail, by following their execution path through exposing the system and library calls they make. Those tools are called strace and ltrace. They trace system and library calls, respectively. Those tools are not tools for monitoring virtual memory per se, but do deserve being honorary mentioned as they are otherwise extremely useful. Our problem at hand, however, was virtual memory system misbehaving. How can we tell that virtual memory is misbehaving? There are several strong indicators: Our system is spending much of the time in kernel mode, executing system calls. That, in itself, is not a sufficient indication, as there may be many reasons for that. Another major reasons for system spending large amount of time in kernel mode include faulty equipment which sends interrupt requests uncontrollably, PIO (Programmed I/O, character by character is an extremely expensive operation as each separate character requires a separate interrupt) or a kernel bug. This is not a 100% certain sign of a virtual memory problems but is a sure sign of some problem. There are many page faults ( majflts from the sar output above), pages are being constantly brought in and out. This will cause significant system strain and is a definite sign of a virtual memory problem. This is where atop comes in handy: other tools will not help you in identifying the culprit, they will only tell you that something is paging like crazy on the system. The page stealing process pdflush is very active and constantly shows up among the processes responsible for the most disk I/O activity. Swap consumption continues to increase.

Atop Monitor All of these OS aspects can be monitored and diagnosed using atop. Other monitoring tools do not provide as clean and readable per process information which could help the administrator ascertain that the interaction of Oracle and VM is the problem. Of course, standard tools, usable without kernel patches, can also help diagnose problems with virtual memory subsystem. The best tool for monitoring the overall health of the VM subsystem is sar, which is a part of the sysstat package for Red Hat Fedora and EL operating systems. Programs in this package, sar and vmstat will provide you with the global state of affairs as far as virtual memory is concerned, but will not make it possible to drill down to the process level. The Goal Now, the problem has been diagnosed as overly active virtual memory subsystem. What should we do about it and how can we do it? The best thing to do would be to tune the culprit program and make it behave nicely with respect to virtual memory. That, alas, is not an option with commercial programs like Oracle RDBMS which do not provide the source code. If the program itself cannot be tuned, we have to tune the system, which means to adjust the operating mode of the system in such a way that our special program, in this case Oracle RDBMS, can operate as smoothly as possible.

The goal of the tuning effort is to bring the system in the stable state, where it will not spend time maintaining virtual memory but will devote that time to work on user's problems. Please, note that this goal is not equivalent to eliminating paging. Some people try with usual locking the SGA, and that, in most of the cases, makes things worse. Why? Look at the monitor above: there is web browser (firefox), word processor (swriter.bin) and printing subsystem (eggcups), all active along with Oracle instance. If I allow Oracle to take the SGA out of the control of virtual memory subsystem and prevent paging and swapping on 50% of the system memory which comprises the SGA, all these processes will have to make do with just 50% of memory, competing for the resources and paging like crazy. Will my Oracle instance work faster as a result? Definitely not. Page fault is implemented as a system trap and its execution prevents other processes from running for a short period of time. Oracle processes will have the whole SGA in memory, even with the parts that are rarely used, but will be unable to run because the kernel will be constantly running in the kernel mode, bringing pages needed by the word processor, printer and web browser in and out of memory. The whole system will slow down to the crawl. The solution, in this case, would be to refrain from doing things like writing articles about virtual memory on database servers. The other thing to do would be decreasing the SGA and leaving more memory for everything else. If the system is busy paging, trimming down the buffer cache will actually result in increased performance, despite the lower BCHR (buffer Cache Hit Ratio, an old system performance metrics, rarely used today) Opposite of that is purchasing more memory, also known as throwing money at the problem. That may or may not solve the problem, depending on what uses the memory. If the memory is used for file system buffers, which are allocated dynamically, and paged in and out, as needed, then purchasing more memory will not be extremely helpful. Oracle RDBMS keeps data in the OS files. If I/O operations referencing those files use system buffers, the memory consumption can potentially be enormous, as database files may be enormous. That is known as the double buffering. Oracle blocks are being cached by both Oracle and operating system, in two separate memory areas, with two different cache management systems. This situation is not without its benefits, either, but, generally speaking, tends to stimulate exorbitant memory consumption. Benefits for such situation come from pre fetch features of most of modern file systems, which will bring in many more blocks then just the requested one, and make full table scans and fast full index scans perform better. Unfortunately, server processes will have to compete for memory with the file system, instead of just allocating free memory, thus increasing swap consumption and slowing things down. Operations that will be badly affected are operations which use lot of memory: hash joins, large sorts or PL/SQL objects with the internal arrays and tables. The solution here is to make Oracle do I/O bypassing the system buffer cache. In other words, Oracle should use direct I/O or raw devices. What else, except SGA, can make Oracle start using large quantities of memory? Oracle processes allocate PGA for doing sorts, hash joins, storing session variables, PL/SQL tables and cursor statuses.

Note that P in PGA stands for Private, which means that every process allocates its own PGA (except when using MTS, which is outside of the scope of this article) and that means that locking SGA in memory is unlikely to help. That is the classic situation every DBA faces from time to time. SGA is fixed in memory, it size never increases and never decreases. If the DBA doesn't go overboard with the initial allocation and doesn't use BCHR as the most important tuning metrics, SGA is unlikely to ever cause a serious headache. Shared memory is always the last one to be paged out, so it will almost never happen. The problem is, therefore, aggressive allocation of PGA areas. How to speed it up? The best way to do that is to ensure that there is always enough memory for an Oracle process to allocate an average PGA and to control the overall PGA allocation through the Oracle instance parameters. Automatic PGA allocation is described in many papers and is quite easy to set up and control. What needs to be done is to ensure that there is always enough of free memory for an average allocation to be completed quickly and without performance problems. Now, that is a well defined goal that Linux kernel parameters can help us achieve. Of course, before we proceed with parameter changes, one should always try tuning SQL and reduce the need for memory. That should always be the first line of defense. Creating an index and turning hash join into a nested loops join is definitely less painful option than changing VM parameters for the whole system or purchasing more memory. Sometimes, though, tuning SQL cannot help us. Big quarterly report just has to go through all of the data for the quarter and summarize them, accessing several dimension tables along the way. In this case, we can only help Oracle manage PGA. Automated PGA management will keep overall memory consumption under control, but for each process, PGA allocation can be a major hurdle. Here is what needs to happen, so that the process can allocate enough memory: 1. Process needs to allocate enough space on the swap partition to which the newly allocated memory will be written, should the need arise. 2. Process needs to allocate sufficient number of clean memory pages. It will first try to find sufficient number of pages which are not in use and have a valid backup on disk (aka free pages ) but if if fails, it will try to back up some pages to disk, fill them with zeros (aka clean them ) and allocate to the requesting process. In doing so, the process will activate page stealing process pdflush and, of course, wait for the process to free enough pages so that the allocation can proceed. 3. Page tables will have to be extended and segment sizes updated. Again, that might require some help from pdflush process. 4. Process statistics about memory usage has to be updated. Backing pages up and cleaning them can result in several I/O requests. The requesting process cannot proceed before those pages are allocated, so there will be significant wait here. Time needed for disk access is measured in milliseconds, while the time needed for memory access is measured in nanoseconds. Memory access is approximately 100,000 times faster then the disk access. Impact of

hard page faults will severely impact all users of the system. What we want to do is to instruct the Linux virtual memory system to always maintain enough memory so that the PGA allocation, should the need arise, can be completed quickly. Let's have Linux kernel do preventive maintenance, so that Oracle never has to wait for pages to be written down. Now, that is the well defined goal that can be achieved by tweaking parameters. The Means In order to help VM system to maintain enough free memory, there are the following kernel parameters: Min_free_kbytes defines how much free memory should be maintained. Dirty_background_ratio defines the percentage of dirty pages that will activate pdflush. Dirty_expire_centisecs defines how long should the data be in memory before becoming eligible for being written out by pdflush. Dirty_writeback_centisecs defines the period between pdflush hands. Swapiness composite parameter which defines inclination of the system for swapping. These parameters regulate several things: When do pages get flushed to disk Which pages will be flushed to disk. How many pages will be written to disk. The goal is to maintain sufficient quantity of free memory while not causing page replacement process to become overly active and the whole system to become I/O bound.. So far, we have been talking about page replacement, page stealing and page freeing, but we haven't mentioned the word swapping. What is swapping? Swapping does not operate on pages, it operates on processes. Swapper is the process which, when memory is at the very low point, writes the address space of a whole process to the swap partition and frees the memory. Usually, on other Unix systems, that is a desperate measure which kicks in only when the amount of free memory falls beneath the number of pages defined by the parameter minfree. Unfortunately, Linux doesn't have minfree parameter which would even remotely resemble the usual culprits on Solaris or HP UX. Linux 2.6 has swappiness, which is defined as tendency of system to swap. It is a dashboard indicator, in the best Microsoft style, which can take values between 0 and 100, with the default of 60.

[root@medo ~]# cat /proc/sys/vm/swappiness 60 [root@medo ~]# Of course, this parameter is not defined in kernel documentation that comes with the system and there are no good recommendations. Personally, I turned it off by setting it to 0, without any ill effects, but without any noticeable improvements, either. I want my system to start swapping only in times of crisis, and Linux doesn't provide any type of control that makes sense. Linux doesn't provide many means for system administrator to tune VM, asking you instead to rely on the intelligence of the system. Despite all the search there is no proof of intelligence within Linux kernel yet. This undocumented parameter is rumored to be the famous run faster button. When I set it to 100, my system started to swap frequently [root@medo ~]# echo "100">/proc/sys/vm/swappiness [root@medo ~]# cat /proc/sys/vm/swappiness 100 [root@medo ~]# Below is the picture of my system, with Oracle instance started and swappiness set to 100: As is visible from the picture, swapper is active, but the amount of free memory is not particularly large. This parameter has just activated the swapper, without any beneficial effect on performance. I haven't noticed any bad effects from setting it to 0, for as long as page replacement mechanism is able

to cope with the amount of available memory. To maintain the necessary amount of free memory, it is much better to set min_free_kbytes which will cause the page replacement process to maintain at least that many free kilobytes of memory. It is also worth noticing that Linux misses parameters for fine regulation of the virtual memory, defining when will the page replacement process be activated, deactivated, sped up or slowed down. What I do not want to do is to give numbers and advise maintaining at least 10% or 15% or 20% of your RAM as free memory, as the number is very dependent on the behavior and type of the system. Linux parameters are not very well documented and I've seen systems that behaved well without a single one of the parameters above being changed. The parameters described earlier are the tools that database and system administrators have at their disposal in case of urgent problems. These parameters can be used to patch situation quickly until the main problem is resolved. Sudden and rapid growth in the memory consumption is always a sign of change and changing the virtual memory parameters usually doesn't resolve the problem but it can alleviate the pain. There is no one size fit all best strategy, capable of determining the values of those parameters in advance. It is important to underline the fact that changing these parameters should be done as a reaction to a problem, not as an preventive measure which will prevent demons from possessing our little Linux box. As you can see, Linux is rather limited in its tuning capabilities and tools and rather poorly instrumented. When an operating system becomes fully instrumented and stable, it usually also becomes obsolete. Judging by that, Linux is light years away from becoming obsolete.