Advanced System Monitoring with Nagios, PNP4Nagios and NConf Josh Malone Systems Administrator National Radio Astronomy Observatory Charlottesville, VA
is great It checks your servers It tells you when there are problems
But Services keep expanding
We work in larger teams We all want to work on things at the same time
We all want to work on things at the same time Management demands data
You need the right tools
We Need to Engineer a Monitoring Solution That Goes to 11!
The Right Addons PNP4Nagios Graph the data from your service checks https://github.com/lingej/pnp4nagios https://docs.pnp4nagios.org/pnp-0.6/ 6
The Right Addons PNP4Nagios Graph the data from your service checks https://github.com/lingej/pnp4nagios https://docs.pnp4nagios.org/pnp-0.6/ 6
The Right Addons PNP4Nagios Graph the data from your service checks https://github.com/lingej/pnp4nagios https://docs.pnp4nagios.org/pnp-0.6/ 6
The Right Addons PNP4Nagios Graph the data from your service checks https://github.com/lingej/pnp4nagios https://docs.pnp4nagios.org/pnp-0.6/ NConf Web-based Nagios configurator http://www.nconf.org/dokuwiki/doku.php https://github.com/nconf/nconf 7
The Right Addons PNP4Nagios Graph the data from your service checks https://github.com/lingej/pnp4nagios https://docs.pnp4nagios.org/pnp-0.6/ NConf Web-based Nagios configurator http://www.nconf.org/dokuwiki/doku.php https://github.com/nconf/nconf 7
The Right Addons PNP4Nagios Graph the data from your service checks https://github.com/lingej/pnp4nagios https://docs.pnp4nagios.org/pnp-0.6/ NConf Web-based Nagios configurator http://www.nconf.org/dokuwiki/doku.php https://github.com/nconf/nconf 7
The Right Plugins Online plugin repositories Nagios Exchange Icinga Exchange Monitoring Plugins But. if you want something done write Write it yourself! and write it RIGHT! 8
PNP4Nagios Performance Data + Graphing
Nagios Performance Data Check plugins can optionally return performance data ( perfdata ) Perfdata is just any metric associated with a check Response time (seconds, ms) Web page size (bytes, kb) Network throughput (bits/sec, kb/sec, mb/s) Room temperature (F, C) 10
Perfdata Output./check_ping -H 184.6.0.1 -w 100,2% -c 200,5% PING OK - Packet loss = 0%, RTA = 56.56 ms rta=56.563000ms;100.000000;200.000000;0.000000 pl=0%;2;5;0 All output is on STDOUT 11
Perfdata Output./check_ping -H 184.6.0.1 -w 100,2% -c 200,5% PING OK - Packet loss = 0%, RTA = 56.56 ms rta=56.563000ms;100.000000;200.000000;0.000000 pl=0%;2;5;0 All output is on STDOUT Vertical bar separates screen output from performance data 11
12
Support By Plugins Not all plugins report performance data Some plugins require a command-line flag to activate perfdata output Some plugins output things that could be perfdata but they do it in the screen output Wrap these plugins in a script to parse screen output and reformat it as proper perfdata 13
Performance Data Handling Nagios does not natively do much with performance data Perfdata must be passed to an add-on for it to be useful Nagios comes with sample commands for processing perfdata process-host-perfdata process-service-perfdata 14
Getting Perfdata into PNP misccommands.cfg - redefine perdata commands define command { command_name process-service-perfdata command_line /usr/localł/nagios/libexec/ process_perfdata.pl } define command { command_name process-host-perfdata command_line /usr/localł/nagios/libexec/ process_perfdata.pl -d HOSTPERFDATA } 15
Understanding RRDs RRD is a Round Robin Database Data in an RRD is stored as sets of averages 1 minute, 5 min, 15 min, 1hr, 6 hr, 12 hr, etc. File never grows, but resolution is lost with time Maximum time to hold data is set when the RRD is created (number of slots for each time bin ) PNP4Nagios holds enough data for 4 years by default 16
Multi-value graphs Graphs can overlay multiple values from one RRD 17
Multi-value graphs Graphs can overlay multiple values from one RRD 18
Multi-value graphs Graphs can overlay multiple values from one RRD 19
Perfdata Processing Modes Easy Synchronous Not as Easy Bulk mode The PNP processor is invoked after each and every service check RRDs are updated immediately after each service check Number of perl execs can cause high load Perfdata is accumulated in a flat file after each service check PNP processor is called every 30 seconds and handles all data from file Reduced PNP load 20
Increase Graph Data Age PNP4Nagios shows graphs out to 1 year by default 21
Increase Graph Data Age PNP4Nagios shows graphs out to 1 year by default The default RRDs hold data for 4 years All that s missing is some links for older data Defined in the $views array in config_local.php $views[] = array( title => Two Years, start => (3600*24*740) ); Days in 2 years 22
Increase Graph Data Age 23
Using PNP4Nagios
PNP4Nagios Overview
PNP4Nagios Menus Switch to a different host right from PNP screen Select date range Create PDF export 26
Using the Basket Basket can be used to combine graphs from multiple hosts into a single page Use in combination with PDF export to generate printable/mailable summaries for others Management, vendors, etc 27
Using the Basket
Templates Templates define how the perfdata is displayed PNP4Nagios looks for a template with the same name as the check command Falls back to a default if not found Define how to present values from the RRDs Written in PHP so you can do any kind of processing you like (scaling, coloring, etc.) 29
Using templates to tune graphs 30
Using templates to tune graphs Define command line options to rrdtool $opt[$key] = -X 0 --height 200 --vertical-label foo --title Graph Title Tells rrgraph not to power-scale the Y axis, sets Y axis label and graph title and makes graphs taller Divide a value by 1024 and call the result gb $def[$key].= CDEF:gb=var1,1024,/ ; Converts MB to GB 30
Using templates to tune graphs 30
NConf Web-based GUI configurator
32
33
NConf Web-based GUI configurator for Nagios Stores config objects in MySQL database Generates Nagios config files from DB for deployment to Nagios servers Deployment is scriptable (SCP, rsync, etc.) NConf need not run on the Nagios server itself 34
Installation: Pre-requisites MySQL with InnoDB OS packages apt-get install libdbi-perl php5-mysql gcc yum install perl-dbi perl-dbd-mysql PHP short_open_tags = On register_globals = Off magic_quotes_gpc = Off 35
Install Un-tar files into web server document area config/mysql.php for database server/user/pass config/authentication.php - AD, sql, file or basic auth config/deployment.ini - How to deploy conf files to Nagios instance 36
Local Deployment [local deployment] type = local source_file = /etc/nconf/output/nagiosconfig.tgz target_file = /etc/nagios action = extract reload_command = sudo /etc/init.d/nagios reload 37
Importing Existing Configs NConf can import existing config files, but the process must be done in multiple steps Each type of object (hosts, services, commands, contacts, etc.) must be imported separately and in the correct order (contacts before contact groups) Nagios object cache lists all objects sorted by type See the Import Guide 38
Extending the Schema Some Nagios configuration attributes aren t supported by NConf out of the box Luckily, the configuration schema/data model used by NConf is extensible Administration Attributes Add Back up your database before changing the schema! 39
Extending the Schema Back up your database before changing the schema! 39
Extending the Schema contacts Contacts People to notify about this host host assign-many contact 40
Extending the Schema 41
Check Plug-Ins
Must-have plugins check_openmanage - Monitor Dell servers with OMSA 43
Must-have plugins check_openmanage - Monitor Dell servers with OMSA 43
Must-have plugins check_netappfiler.py Old, but still works great Uses SNMP, compatible with OnTap 7-Mode Comes with PNP templates https://github.com/wampire/check_netappfiler 44
Must-have plugins 45
Must-have plugins check_logfiles https://github.com/lausser/check_logfiles Scans logfiles for patterns indicating Warning, Critical or OK states Handles rotated logfiles Detects recovery strings as well Can use external config files for complex checks 46
Must-have plugins check-cisco.pl Cisco router / switch CPU, PSU, temp https://github.com/ranl/monitor-utils Synology status (check_snmp_synology) Check health, RAID, disk temps, storage Available on Nagios exchange 47
Writing Check Plug-ins Have no fear - Write exactly the plugin you need
Custom Plugins Nagios can monitor anything you can write a script to check Simple API You can write plugins in ANY language you choose! bash, python, tcl, expect perl (Nagios has embedded perl interpreter for speed) C, C++ 49
Plugin API Exit code determines check state 0 - OK 1 - Warning 2 - Critical 3 - Unknown Stdout is for human-readable notices; ignored by Nagios Perfdata written on stdout, after vertical bar Multiple lines allowed - up to 4 kb http://nagios.sourceforge.net/docs/3_0/pluginapi.html 50
Writing plugins in Perl Nagios provides utils.pm Provides %ERRORS hash Maps status names to exit codes $ERRORS{ CRITICAL } You can use my template as a starting point https://github.com/48kram/nagios-plugins/tree/ master/template Command-line parsing, threshold parsing, output formatting 51
Writing Good Plugins Keep default output short and to the point Suitable for SMS messages, pagers, etc. Easy to parse in a time-critical situation Remember: Nagios should help you fix the problem! Call external binaries by their full path Make it configurable on the cmdline or at the top of the script in a variable 52
Writing Good Plugins Watch out for long runtimes or hung processes Perl: Use alarm (standard function) Bash/Sh: Use timeout (coreutils) Avoid temp files in case your disk is full, out of file handles, etc. Validate your cmdline arguments Is it legal for warn to be higher than crit? Are numeric arguments really numeric? 53
Writing Safe Plugins Nagios s embedded Perl (epn) requires special care Plugins must work under use strict Perl should be run with -w Close all opened files (epn never exits) Initialize all variables before using them (epn caches) Don t use global variables in subroutines 54
When to Use Unknown Unknown is a special exit status in Nagios for when an error occurred in the plugin itself. Missing Perl module or client binary, etc. Also illegal command line options Do not use Unknown to indicate that the service is in an unknown state, hostname unknown, etc. Use Warning or Critical for this state because the service is not Ok! 55
Minimal Nagios Check Plugin #!/usr/bin/perl -w # Check runtime on APC Symmetra UPS use Net::SNMP qw (ticks_to_time); use lib qw (. /usr/lib/nagios/libexec ); use utils qw(%errors); use Getopt::Long qw(:config no_ignore_case); GetOptions( H host=s => \$host, C community=s => \$community, w warning=s => \$warn, c critical=s => \$crit ) or print_help(); if( $warn <= $crit ) { printf( Error: Warning must be > critical!\n ); exit $ERRORS{ UNKNOWN }; } 56
Minimal Nagios Check Plugin $runtimeoid=.1.3.6.1.4.1.318.1.1.1.2.2.3.0 ; ($s, $error) = Net::SNMP->session( -hostname => $host, -community => $community, -timeout => 10, -version => 1, -translate => [ -timeticks => 0x0 ] ) or do { print SNMP Error: $error; exit $ERRORS{ UNKNOWN }; }; $res = $s->get_request( -varbindlist => [$runtimeoid] ); # Check for errors in request here - omitted $runminutes=$res->{$runtimeoid}/100/60; 57
Minimal Nagios Check Plugin # Begin plugin logic $status= OK ; if ($runminutes <= $warn) { $status= WARNING ; } if ($runminutes <= $crit) { $status= CRITICAL ; } $screenout=sprintf( %s: %d minutes runtime, $status, $runminutes); $perfdata=sprintf( runtime=%dminutes;%d;%d, $runminutes, $warn, $crit ); print $screenout $perfdata\n ; exit $ERRORS{$status}; 58
Minimal Nagios Check Plugin : root@host;./check_apc_run -H 10.1.63.34 -C public -w 40 -c 20 OK: 64 minutes estimated runtime runtime=64minutes;40;20 : root@host; echo $? 0 59
One Final Word
Only You Can Change the Culture of Systems Administration No service is truly production-ready until it is Acceptance-tested Backed-up Monitored Documented 61
Credits Nagios, the Nagios logo, and Nagios graphics are the servicemarks, trademarks, or registered trademarks owned by Nagios Enterprises. APC and Symmetra are registered trademarks of American Power Conversion Corporation. This project is not affilliated with American Power Conversion Corporation. People image by netalloy. Public Domain. Courtesy openclipart.org Clock images by hypocore. Public Domain. Courtesy openclipart.org Profit Chart Curve by simpletutorials.net. Public Domain Tools image by sev. Public Domain. Courtesy openclipart.org Some images by unknown authors taken from http://clipart-finder.com/ 62