Monitoring Linux with native tools

Monitoring Linux with native tools Robert Andresen 30th Annual International Conference of The Computer Measurement Group, Inc. December 5-10, 2004 Las Vegas, Nevada USA Linux is gaining interest as a solution across many hardware platforms: Intel based machines, Sun and Apple proprietary hardware and IBM zseries platforms. But once applications are ported to an open source operating system what options are available to monitor their performance and availability? This presentation covers native Linux solutions to monitoring performance and collecting statistics for capacity planning. We will look at tools ranging from real time monitors through those that can build a database of historical system performance. Introduction Linux has caught the interest of many different groups of people. It has always had a following among die-hard techies, especially those who dislike Microsoft. As it has become more stable and feature-rich, more and more shops are using it in their server farms. Lower perceived costs are attractive: some smaller companies or not-for-profit organizations need to save every penny they can, and many larger shops are using it as a negotiating ploy to keep their Microsoft contract costs down. Now that Linux is available to run on zseries hardware, a whole new group of people are getting interested: the traditional IBM mainframers. Some are interested in cost savings available by moving traditional mainframe applications to Linux on an IFL, either reducing traditional MIPS needed or avoiding costly upgrades. Others find the new technology exciting; some look at Linux as a ploy to extend the life of the mainframe at their organization, and with it, perhaps their jobs. Before Linux can run production applications, however, there will be a requirement to monitor its performance. Reasons to Monitor There are ultimately two different purposes behind monitoring any computer system, which basically map onto two different systems management functions. Systems administrators and systems programmers care more about the installation, tuning and troubleshooting of systems under their control. Capacity planners care more about building a performance database to analyze and predict resource consumption over time, looking to predict growth and upgrades required to sustain that growth. Because of this difference of purpose, these groups will require different tools, though there are definite areas of overlap. Systems management generally requires tools to show what is happening right now, whereas capacity planning tends to be more concerned with trending resource usage to recognize growth and future bottlenecks, over time. There is overlap, of course, since you cannot do systems management trouble-shooting without understanding acceptable ranges of systems performance metrics. So where the capacity planning function needs historical metrics to predict future growth bottlenecks, the performance management function needs similar historical data to understand how to recognize what to look for in the metrics, and which values are an indication of a performance problem. Metrics to measure It feels a bit presumptuous to advise CMG as to which metrics need to be measured for any system, but this topic needs to be addressed briefly before we jump into the different tools available. Ultimately, the metrics we care about are much the same for any computer system. We need to measure use of all physical resources, the usual suspects for x86 and zseries: CPU

Memory Disk devices and controllers Network devices We also need to measure use of system level resources that may impact performance and capacity: Paging Swapping Inter-process communication constructs There may be more system level resources depending on the applications in use, e.g. database locks, but these monitors tend to be part of application support packages. CPU Utilization: In Linux, not only do we care about the percentage of available CPU used, but also the use by different states. The four CPU states in Linux are: User: Application use of CPU System: CPU used by kernel functions such as I/O or network Nice: User CPU use where the process has voluntarily lowered its priority to allow higher priority work to run. Idle: CPU not being used, available for additional work Virtualization Linux may not be running on the bare metal of the machine. It may be running in a virtual machine managed by VMware (x86 platforms) or z/vm (IBM zseries platforms). This may cause the CPU percent numbers to be somewhat misleading. Linux may think it is running 80% of the CPU, but the virtual machine it is in may only be given 10% of the real CPU by the virtual machine manager. In this case the Linux instance would be consuming 8% of the real CPU. For this reason, it becomes necessary to measure CPU percentages allowed by the virtualization manager, and then prorate the CPU percentages reported in each Linux by this amount. Getting this percentage with which to prorate will depend on the virtualization manager. VMware can run on the bare metal of the machine, or under another operating system such as Windows or Linux. z/vm provides this data, which is then available to a number of monitoring packages. Memory: Linux uses both real memory and swap files, similar to virtual memory. Usually on an x86 platform the swap file is defined as twice the amount of real memory. On zseries it turns out to be a mistake to over allocate memory to the Linux systems in z/vm, as the memory will be used up and cause additional paging at the z/vm level. The recommendation is to give each Linux instance as little memory as it can get by on. If a Linux system needs more real memory than is available it will use its swap file to free up some memory to allow another process s memory to be resident in real. At some point this swapping can turn into thrashing, where a process gets swapped in, can t finish what it needs to do and is swapped out again. Attempting to run X-Windows on a machine with not enough memory is a classic example of this thrashing. X-Windows spawns a number of processes that use up all the real memory and get swapped out by Linux. None of these processes get to do any of the work they intend to do, as Linux is using the entire CPU reading pages in and writing pages out. One of our goals is to be able to measure this activity and have enough real resources to match the anticipated application workload. Types of tools In this paper, we will look at four different types of tools: Real time displays automatically refresh system performance metrics Static commands display snapshots of system performance metrics /proc filesystem is a pseudo-filesystem that contains these metrics sysstat project is a Linux project to display and collect these metrics Real time displays top The first example is the top command. top is an auto-refreshing list of the processes using the most CPU, and by default sorts them in descending order by CPU use. You may change the sort order, fields displayed and refresh rate either interactively or by configuration. An example is shown (Fig. 1).

8:28pm up 28 min, 2 users, load average: 0.00, 0.06, 0.16 102 processes: 99 sleeping, 2 running, 1 zombie, 0 stopped CPU states: 0.7% user, 1.5% system, 0.0% nice, 97.6% idle Mem: 1031408K av, 577164K used, 454244K free, 0K shrd, 55344K buff Swap: 514072K av, 0K used, 514072K free 265008K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 7060 rda 15 0 11876 11M 10476 R 0.5 1.1 0:00 kdeinit 1483 root 15 0 55744 11M 3064 S 0.3 1.1 0:06 X 1615 rda 15 0 9864 9860 9048 S 0.3 0.9 0:01 kdeinit 7093 rda 15 0 996 996 784 R 0.3 0.0 0:00 top 6957 rda 15 0 50812 49M 40272 S 0.1 4.9 0:06 soffice.bin 1 root 15 0 480 480 428 S 0.0 0.0 0:04 init 2 root 15 0 0 0 0 SW 0.0 0.0 0:00 keventd 3 root 15 0 0 0 0 SW 0.0 0.0 0:00 kapmd 4 root 34 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_cpu0 5 root 15 0 0 0 0 SW 0.0 0.0 0:00 kswapd 6 root 25 0 0 0 0 SW 0.0 0.0 0:00 bdflush 7 root 15 0 0 0 0 SW 0.0 0.0 0:00 kupdated 8 root 25 0 0 0 0 SW 0.0 0.0 0:00 mdrecoveryd 12 root 15 0 0 0 0 SW 0.0 0.0 0:00 kjournald 91 root 15 0 0 0 0 SW 0.0 0.0 0:00 khubd 651 root 16 0 736 736 632 S 0.0 0.0 0:00 dhcpcd 780 root 15 0 540 540 460 S 0.0 0.0 0:00 syslogd Figure 1: top command Important fields to look at in top output include: Load average: These three numbers show the number of runable processes in the CPU queue over the past minute, five minutes and fifteen minutes. These numbers need to be compared to the number of CPUs available for that Linux. refresh rate. vmstat CPU states: This line shows the percentage utilization for each of the four possible CPU states. Mem & Swap: These two lines show how much real memory and swap space is available, in use and free. Also shown are memory used for buffers, cache and shared memory (one of the interprocess communication options). Process information: For currently runable processes, each line shows the process number, owning user, priority, nice value, size (code + data + stack space), RSS (total amount of physical memory), shared memory in use, status (Running, Sleeping, Nice value, swapped), %CPU being used, %Memory being used, CPU time since task has started, and command issued to start the process. A common debugging use of top is when the CPU spikes, watch the output of top to see which process is causing the spike. That process may be stopped with a kill command. xload xload is a graphic representation of CPU load. It is more of an operational warning than a performance tool. You may set the colors and Figure 2: xload This command gives information about processes, memory, paging, block IO, traps, and CPU activity. Do not confuse this output with z/vm performance metrics, no native Linux tools know about z/vm performance.

[rda@localhost rda]$ vmstat -n 5 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 0 536828 61192 254416 0 0 22 13 155 313 2 1 97 1 0 0 0 536828 61192 254416 0 0 0 7 279 224 1 0 99 0 0 0 0 536828 61192 254416 0 0 0 3 279 222 0 0 99 0 0 0 0 536828 61192 254416 0 0 0 3 278 221 0 1 99 0 0 0 0 536828 61192 254416 0 0 0 19 349 492 1 1 98 Figure 3: vmstat In this example (Fig. 3), the n 5 parameter says refresh every 5 seconds. The first line produced gives averages since the last reboot. Additional lines give information on a sampling period of the requested length. The process and memory reports are instantaneous in either case. The fields mean: procs: r- number of processes waiting to run b- number in uninterruptible sleep w- number swapped out, but otherwise runnable memory: swap, free, buffers, cache swap: swap ins, swap outs io: blocks in, blocks out system: interrupts per second (includes clock), context switches per second cpu: user, system, idle vmstat excludes itself from the statistics it presents. 8:40pm up 40 min, 2 users, load average: 0.03, 0.06, 0.10 Figure 4: uptime free free displays similar information to the Mem and Swap sections of the top command (see Fig. 5). total used free shared buffers cached Mem: 1031408 503616 527792 0 69076 242780 -/+ buffers/cache: 191760 839648 Swap: 514072 0 514072 Figure 5: free Static commands Static commands only give a snapshot of what is happening at the time the command was issued; they do not refresh. They may be re-issued if you want to see how the metrics are changing, or they may be invoked by shell scripts and the output stored in files for historical analysis later. uptime uptime shows the same information as the first line of top: how long the system has been running and the load average numbers (see Fig. 4). ps This command displays the running processes according to the authority of the issuer and the parameters used. This is more of a performance diagnostic tool than a performance reporting tool. This example (with the f option) shows: user, process id, parent process id, child count, start time, associated terminal device, CPU time and command. rda 2315 2313 0 09:04 pts/3 00:00:00 /bin/bash rda 2838 2315 0 10:28 pts/3 00:00:00 xload -fg red -bg white rda 2910 2315 0 10:46 pts/3 00:00:00 ps -f Figure 6: ps netstat netstat displays network statistics with many different options to choose from. Common parameters are routes, interfaces and statistics. This is also more of a diagnostic tool than a performance

reporting tool. [root@localhost init.d]# netstat -r Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 172.25.167.0 * 255.255.255.0 U 40 0 0 eth0 127.0.0.0 * 255.0.0.0 U 40 0 0 lo default 172.25.167.1 0.0.0.0 UG 40 0 0 eth0 [root@localhost init.d]# netstat -i Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1500 0 30080 0 0 0 16393 0 0 0 BMNR lo 16436 0 10563 0 0 0 10563 0 0 0 LRU Figure 7: netstat displays In the example (see Fig. 7), some fields shown are RX: received, TX: transmitted OK, Error, Dropped or OVR (unable to transmit) /proc filesystem This is a pseudo-filesystem used to access kernel performance metrics as well as to update some kernel parameters. Scripts or programs may access this data merely by reading the appropriate files. For example, if we do a directory listing for /proc, we see an entry for every running process, as well as entries for system level metrics (see Fig. 8). [root@localhost proc]# pwd /proc [root@localhost proc]# ls 1 1282 1333 1538 1591 1667 2262 91 ide mtrr 1015 1283 1354 1541 1592 1668 2345 945 interrupts net 1037 1284 1381 1544 1595 1669 2353 948 iomem partitions 1067 1285 1400 1546 1598 1670 3 apm ioports pci 1086 1286 1426 1557 1637 1671 4 bus irq scsi 1104 1287 1427 1558 1644 1672 5 cmdline kcore self 1160 1288 1428 1568 1646 1989 6 cpuinfo kmsg slabinfo 1184 1289 1429 1569 1659 2 650 devices ksyms stat 12 1290 1430 1571 1660 2000 7 dma loadavg swaps 1217 1291 1431 1572 1661 2013 737 dri locks sys 1232 1292 1432 1574 1662 2014 742 driver mdstat sysvipc 1255 1293 1439 1579 1663 2015 763 execdomains meminfo tty 1278 1300 1441 1580 1664 2016 791 fb misc uptime 1280 1301 1442 1587 1665 2024 8 filesystems modules version 1281 1315 1459 1589 1666 2259 881 fs mounts Figure 8: /proc contents Each number represents a running process; the names are system metrics from the kernel. If we were to look at /proc/stat we would see the relevant values (see Fig. 9).

[root@localhost 2]# cat /proc/stat cpu 12412 70 2968 506115 cpu0 12412 70 2968 506115 page 232980 82619 swap 1 0 intr 749494 521565 8314 0 17 6 2 6 3 1 3 2 96434 72512 0 25308 25321 disk_io: (3,0):(25524,19303,465378,6221,165216) (11,0):(18,18,72,0,0) ctxt 2121146 btime 1076535165 processes 2475 Figure 9: /proc/stat The man page for proc shows all the values by file within /proc. An example is shown for /proc/stat (see Fig. 9.5). cpu 3357 0 4313 1362393 The number of jiffies (1/100ths of a second) that the system spent in user mode, user mode with low priority (nice), system mode, and the idle task, respectively. The last value should be 100 times the second entry in the uptime pseudo-file. page 5741 1808 The number of pages the system paged in and the number that were paged out (from disk). [root@localhost 2]# cat /proc/2/status Name: keventd State: S (sleeping) Tgid: 2 Pid: 2 PPid: 1 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 32 Groups: SigPnd: 0000000000000000 SigBlk: fffffffffffeffff SigIgn: 0000000000010000 SigCgt: 0000000000000000 CapInh: 0000000000000000 CapPrm: 00000000ffffffff CapEff: 00000000fffffeff swap 1 0 The number of swap pages that have been brought in and out. intr 1462898 The number of interrupts received from the system boot. disk_io: (2,0):(31,30,5764,1,2) (3,0):... (major,minor):(noinfo, read_io_ops, blks_read, write_io_ops, blks_written) ctxt 115315 The number of context switches that the system underwent. btime 769041601 boot time, in seconds since the epoch (January 1, 1970). processes 86031 Number of forks since boot. As you can see, these files are not user-friendly displays for diagnostic purposes. They are better suited to be accessed by programs or scripts which strip out needed metrics. They are where the other tools discussed in this paper get their metrics. Here is what is available for a running process (see Fig. 10). Figure 10: /proc filesystem process information sysstat project Thankfully, there is a project in Linux to mine the raw data out of the /proc filesystem and make it available for display as well as for building a historical database. The sysstat project is lead by Sébastien Godard from France: Web links to project information: http://freshmeat.net/projects/sysstat/ http://perso.wanadoo.fr/sebastien.godard/ The project includes: iostat: Monitor system input and output device loading by comparing the time the devices are active in

relation to their average transfer rates. mpstat Monitors CPU activity, aggregate and individual CPU sar Collect, save and report system activity metrics iostat iostat generates two reports: CPU activity, followed by device utilization (see Fig. 11). Linux 2.4.18-3 (dhcp64-134-114-41.hhwh.hou.wayport.net) 02/23/2004 avg-cpu: %user %nice %sys %idle 10.26 0.00 2.82 86.91 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn dev3-0 22.84 420.82 93.67 312066 69464 dev11-0 0.02 0.10 0.00 72 0 Figure 11: iostat Again notice the percentages of the four CPU states. The device section shows transfers per second (tps) by device, blocks read or written per second as well as the count of blocks read and written. The d n c option will cause iostat to display a new report every n seconds for a count of c times. mpstat mpstat displays processor utilization, percentage of time for each CPU state and the number of interrupts per second (see Fig. 12). Linux 2.4.18-3 (localhost.localdomain) 02/24/2004 10:51:55 AM CPU %user %nice %system %idle intr/s 10:51:55 AM all 2.26 0.01 0.58 97.15 154.79 Figure 12: mpstat The V n c option may optionally specify both an interval and a count to cause mpstat to redisplay every n seconds for a count of c times. sar sar provides three major functions: Creates daily performance files with all system metrics [root@localhost sa]# pwd /var/log/sa Displays metrics from a current or previous day s file Extracts data from saved performance files in formats to load to spreadsheets or databases. File creation is based on the o option. The data is saved in binary format in files named, by default, /var/log/sa/sadd. An example file is shown (see Fig. 13). [root@localhost sa]# ls -al total 88 drwxr-xr-x 2 root root 4096 Feb 25 10:40. drwxr-xr-x 9 root root 4096 Feb 25 10:56.. -rw-r--r-- 1 root root 21701 Feb 23 21:20 sa23 -rw-r--r-- 1 root root 42197 Feb 24 15:20 sa24 -rw-r--r-- 1 root root 9989 Feb 25 12:10 sa25 Figure 13: sar files

Notice the default naming convention will keep only a month of data in the binary files. You may override this with the o option, or extract the data into another format with these parameters: -e hh:mm:ss Set the ending time of the report -f filename Extract records from filename -h When reading data from a file, print its contents in a format that can easily be handled by pattern processing commands like awk. -H When reading data from a file, print its contents in a format that can easily be ingested by a relational database system. -i interval Select data records at seconds as close as possible to the number specified by the interval parameter. -s hh:mm:ss Set the starting time of the data. -t When reading data from a daily data file, indicate that sar should display the timestamps in the original locale time of the data file creator. And what kind of data can be extracted? These parameters show what is saved in the binary files: -b Report I/O and transfer rate statistics. -B Report paging statistics. -c Report process creation activity -d Report activity for each block device (kernels 2.4 and later only). -I irq SUM PROC ALL XALL Report statistics for a given interrupt. -n DEV EDEV SOCK FULL Report network statistics. -q Report queue length and load averages. -r Report memory and swap space utilization statistics. -R Report memory statistics. -u Report CPU utilization. A few examples of what this looks like? Let s see CPU activity starting at 10:00am (see Fig. 14). [rda@localhost rda]$ sar -s 10:00:00 Linux 2.4.18-3 (localhost.localdomain) 02/24/2004 10:00:00 AM CPU %user %nice %system %idle 10:10:00 AM all 1.11 0.00 0.38 98.51 10:20:00 AM all 4.18 0.00 0.85 94.97 10:30:00 AM all 1.39 0.03 0.37 98.21 10:40:00 AM all 1.61 0.00 0.44 97.95 10:50:00 AM all 2.14 0.00 0.44 97.43 11:00:00 AM all 1.87 0.00 0.45 97.68 11:10:00 AM all 3.52 0.00 0.40 96.09 11:20:00 AM all 2.77 0.00 0.42 96.80 11:30:00 AM all 0.58 0.00 0.36 99.05 11:40:00 AM all 3.99 0.00 0.62 95.39 11:50:00 AM all 4.00 0.00 1.13 94.86 12:00:00 PM all 5.05 0.00 0.52 94.43 12:10:00 PM all 5.64 0.00 0.43 93.94 Average: all 2.91 0.00 0.52 96.56 Figure 14:sar s output

Or, perhaps we would like to see swap data from February 24 th, ending at 9:30am (see Fig. 15). [root@localhost init.d]# sar -B -f /var/log/sa/sa24 -e 09:30:00 Linux 2.4.18-3 (localhost.localdomain) 02/24/2004 08:20:00 AM pgpgin/s pgpgout/s activepg inadtypg inaclnpg inatarpg 08:30:00 AM 0.57 7.60 52313 698 10596 12721 08:40:00 AM 2.48 28.25 58419 976 10717 14022 08:50:00 AM 4.34 9.55 61104 993 10602 14539 09:00:00 AM 35.70 5.78 71026 1005 10616 16529 09:10:00 AM 114.21 21.06 95643 2583 10627 21770 09:20:00 AM 2.09 6.03 95654 2584 10627 21773 09:30:00 AM 2.39 4.63 96261 2584 10774 21923 Average: 26.57 13.04 75774 1632 10651 17611 Figure 15: sar B output Hmmm, this is starting to remind me of SMF data from those old extinct IBM mainframes (remember them?) The system can write performance metrics to an internal file using a specified interval. I can extract the records I want and load them to a database or even a spreadsheet. We can see what the extract output would look like if I chose the H option for a relational database (see Fig. 16). [root@localhost init.d]# sar -B -f /var/log/sa/sa24 -H localhost.localdomain;600;2004-02-24 14:30:00 UTC;0.57;7.60;52313;698;10596;12721 localhost.localdomain;599;2004-02-24 14:40:00 UTC;2.48;28.25;58419;976;10717;14022 localhost.localdomain;600;2004-02-24 14:50:00 UTC;4.34;9.55;61104;993;10602;14539 localhost.localdomain;600;2004-02-24 15:00:00 UTC;35.70;5.78;71026;1005;10616;16529 localhost.localdomain;600;2004-02-24 15:10:00 UTC;114.21;21.06;95643;2583;10627;21770 localhost.localdomain;600;2004-02-24 15:20:00 localhost.localdomain;600;2004-02-24 18:10:00 UTC;0.05;4.58;82698;4861;11113;19734 localhost.localdomain;600;2004-02-24 18:20:00 UTC;0.13;5.04;83208;4862;11086;19831 localhost.localdomain;600;2004-02-24 18:30:00 UTC;0.11;8.31;80067;4812;11224;19220 If we save this as a text file, both Excel and Open Office will allow me to specify a semicolon as a field delimiter (see Fig. 17). Figure 16:sar B with H for semicolons

Once we load our data to a spreadsheet, or a database, now we can generate our performance reports and graphs (see Fig. 18). Figure 17: import text dialog box Swap activity 140 120 100 80 60 40 20 0 14:30:00 14:50:00 15:10:00 15:30:00 15:50:00 Swaps Page Out Page in Figure 18: swap activity graph Now we have a tool to track Linux system performance over time and be able to make capacity planning predictions.