Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Introduction Motivation Background Implementation Conclusions
Why Monitor? A distributed system is one that stops you from getting any work done when a machine you've never even heard of crashes. Leslie Lamport
Reliability In a replicated system, you only know when the last server fails Sometimes you need to know that a system is down before the first user notices
Why Monitor? Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital Aaron Levenstein
Performance Metrics Do we have five 9s uptime? Are there servers which regularly fall over? Are there patterns behind our service outages? Does Jim s team manage their systems better than Susan s?
Monitoring Regularly check if all services are working correctly Tell people if they re not
Complications Image credit: fliepsiebieps http://www.etsy.com/view_transaction.php?transaction_id=6780862
Completeness Monitoring systems are rapidly taken for granted I haven t been paged - there can t be anything wrong Maintaining the monitoring system is often forgotten Easy to forget to add new services Omissions only noticed when things go wrong Although... harder to forget when decommissioning!
Complications Image credit: rustybrick : http://www.flickr.com/photos/rustybrick/2100356981
Dependencies (and message storms) One small failure can have dramatic expenses Receiving 1,000 messages when a router fails can be both annoying and expensive Large scale monitoring configurations have to understand Network topology Service dependencies Modelling these is hard. Maintaining the model is harder still
Complications Image credit: Guacamole Goalie : http://www.fickr.com/photos/jeopei
Communication : Passive vs active Not everything you care about is network visible Some things can t be Some things shouldn t be An event occurring is as notable as a service outage Have to be able to handle Services reporting their own state Reporting of events (SNMP traps &c)
Complications Quis custodiet ipsos custodes? Juvenal Image credit: Ley_photograhy http://www.flickr.com/photos/lostindevon/
Self monitoring Silence isn t always golden Failure of a component of the monitoring system can hide everything Important that the monitoring system itself is robustly observed
Complexity and redundancy Satisifying all of these requirements leads to configurations that are complex fragile hard to understand full of redundant duplicated information Removing redundancy is highly beneficial. It lets us... make changes in one place instead of 20... avoid inconsistency between multiple copies... speed up configuration tasks... reduce errors!
Aside on redundancy System configuration is full of redundant information Take a web server configuration - information shared with DNS DHCP Firewall Certificate Server Console Server and your Monitoring System Removing redundancy dramatically reduces the potential for errors across all of these components
Lost in translation Very hard to handle raw configuration files Rapidly becomes an O(n 2 ) problem (translate all of these configuration formats to all of these other formats) Instead, use a set of abstract configuration resources Translate from this to Service configuration Monitoring configuration Everything else...
Web server example This machine is an ssl capable web server It runs two virtual hosts One http service on port 80, answering all requests, and redirecting to the https service One https service on port 443, answering to www.inf.ed.ac.uk, serving documents from /var/www/html The https service should be given a signed certificate The web-service monitoring group is notified on failure
Implementation - Configuration (other configuration systems are available)
Implementation - Monitoring
Example - Apache resources #include <options/apache.h> #include <options/apache-ssl.h> #include <options/x509-client.h> apache.vhosts apache.vhostname_redir apache.vhostverbatim_redir apache.vhostname_ssl apache.vhostssl_ssl apache.vhostroot_ssl apache.nagios_groups redir ssl _default_ Redirect / https://www.inf.ed.ac.uk www.inf.ed.ac.uk true /var/www/html nagios/web-service X509_CERTIFICATE(ssl, /etc/pki/tls/certs) Small
Example - Apache configuration ServerType standalone ServerRoot /etc/httpd LockFile /var/run/httpd.locak PidFile /var/run/httpd.pid ScoreBoardFile logs/apache_runtime_status Timeout 300 KeepAlive On MaxKeepAliveRequests 100 KeepAliveTimeout 15 MinSpareServers 5 MaxSpareServers 20 StartServers 8 MaxClients 150 MaxRequestsPerChild1000 LoadModule env_module /usr/lib/apache/mod_env.so LoadModule config_log_module /usr/lib/apache/mod_log_config.so LoadModule mime_module /usr/lib/apache/mod_mime.so LoadModule negotiation_module /usr/lib/apache/mod_negotiation.so LoadModule status_module /usr/lib/apache/mod_status.so LoadModule includes_module /usr/lib/apache/mod_include.so LoadModule autoindex_module /usr/lib/apache/mod_autoindex.so LoadModule dir_module /usr/lib/apache/mod_dir.so LoadModule cgi_module /usr/lib/apache/mod_cgi.so LoadModule asis_module /usr/lib/apache/mod_asis.so LoadModule imap_module /usr/lib/apache/mod_imap.so LoadModule action_module /usr/lib/apache/mod_actions.so LoadModule userdir_module /usr/lib/apache/mod_userdir.so LoadModule alias_module /usr/lib/apache/mod_alias.so LoadModule access_module /usr/lib/apache/mod_access.so LoadModule auth_module /usr/lib/apache/mod_auth.so LoadModule digest_module /usr/lib/apache/mod_digest.so LoadModule expires_module /usr/lib/apache/mod_expires.so LoadModule setenvif_module /usr/lib/apache/mod_setenvif.so LoadModule headers_module /usr/lib/apache/mod_headers.so Port 80 Listen 443 Listen 80 User apache Group apache ServerAdmin sxw ServerName duffus.inf.ed.ac.uk DocumentRoot /var/www/html <Directory /> Options FollowSymLinks AllowOverride None </Directory> AccessFileName.htaccess <Files ~ "^\.ht"> Order allow,deny Deny from all Satisfy All </Files> UseCanonicalName On TypesConfig /etc/mime.types DefaultType text/html ErrorLog logs/error_log LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Refereri\" \"%{User-Agenti\"" combined LogLevel warn CustomLog logs/access_log combined SSLPassPhraseDialogbuiltin SSLSessionCache shm:logs/ssl_scache(512000) SSLSessionCacheTimeout 300 SSLMutex file:logs/ssl_mutex SSLRandomSeed startup builtin SSLRandomSeed connect builtin SSLLog logs/ssl_engine_log SSLLogLevel warn AddType application/x-httpd-php.php NameVirtualHost 129.215.165.30:80 <VirtualHost _default_:80> ServerName _default_ Redirect / https://www.inf.ed.ac.uk/ </VirtualHost> <VirtualHost www.inf.ed.ac.uk:443> ServerName www.inf.ed.ac.uk SSLEngine On SSLCertificateFile /etc/pki/tls/certs/ssl.crt SSLCertificateKeyFile /etc/pki/tls/certs/ssl.key SSLCertificateChainFile /etc/pki/tls/certs/ssl.chain DocumentRoot /var/www/html </VirtualHost> Medium
Example - Apache monitoring define command { command_line $USER1$/check_http -I$HOSTADDRESS$ -p $ARG1$ command_name check_apacheconf_main define command { command_line command_name $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -C$ARG4$ check_apacheconf_cert define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -u $ARG4$ -S command_name check_apacheconf_https define command { command_line command_name define contactgroup { alias nagios/web-service contactgroup_namenagios/web-service members sxw define contact { use default-contact alias Simon Wilkinson contact_name sxw email sxw@inf.ed.ac.uk define hostgroup { alias DICE/Web/Servers hostgroup_name DICEWebServers define host { use default-host address 129.215.165.30 contact_groups nagios/web-service host_name duffus hostgroups DICEWebServers $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -u $ARG4$ check_apacheconf_http define service { use default-service active_checks_enabled 0 check_command check_self_active contact_groups nagios/web-service host_name duffus passive_checks_enabled 1 service_description Profile Translation service_description Apache define service { use default-service check_command check_apache_http!129.215.165.30!www.inf.ed.ac.uk!80!/ contact_groups nagios/web-service host_name duffus service_description Apache www.inf.ed.ac.uk:80 HTTP define servicedependency { dependent_host_name duffus dependent_service_description Apache www.inf.ed.ac.uk:80 HTTP execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache define service { use default-service check_command check_apache_https!129.215.165.30!www.inf.ed.ac.uk!443!/ contact_groups nagios/web-server host_name curlew service_description Apache www.inf.ed.ac.uk:443 HTTPS define servicedependency { dependent_host_name duffus dependent_service_description Apache www.inf.ed.ac.uk:443 HTTPS execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache define service { use default-service check_command check_apache_cert!129.215.202.30!www.inf.ed.ac.uk!443!18 contact_groups nagios/web-service host_name duffus service_description Apache www.inf.ed.ac.uk:443 Certificate define servicedependency { dependent_host_name duffus dependent_service_description Apache www.inf.ed.ac.uk:443 Certificate execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache define service { use default-service check_command check_apache_main!80 contact_groups nagios/web-service host_name duffus Large
Example - Apache monitoring 2 log_file=/var/log/nagios/nagios.log object_cache_file=/var/log/nagios/objects.cache resource_file=/etc/nagios/private/resource.cfg status_file=/var/log/nagios/status.dat nagios_user=nagios nagios_group=nagios check_external_commands=1 command_check_interval=-1 command_file=/var/spool/nagios/cmd/nagios.cmd comment_file=/var/log/nagios/comments.dat downtime_file=/var/log/nagios/downtime.dat lock_file=/var/run/nagios/nagios.pid temp_file=/var/log/nagios/nagios.tmp event_broker_options=-1 log_rotation_method=d log_archive_path=/var/log/nagios/archives use_syslog=1 log_notifications=1 log_service_retries=1 log_host_retries=1 log_event_handlers=1 log_initial_states=0 log_external_commands=1 log_passive_checks=1 service_inter_check_delay_method=s max_service_check_spread=30 service_interleave_factor=s host_inter_check_delay_method=s max_host_check_spread=30 max_concurrent_checks=0 service_reaper_frequency=10 auto_reschedule_checks=0 auto_rescheduling_interval=30 auto_rescheduling_window=180 sleep_time=0.25 service_check_timeout=60 host_check_timeout=30 event_handler_timeout=30 notification_timeout=30 ocsp_timeout=5 perfdata_timeout=5 retain_state_information=1 state_retention_file=/var/log/nagios/retention.dat retention_update_interval=60 use_retained_program_state=1 use_retained_scheduling_info=1 interval_length=60 use_aggressive_host_checking=0 execute_service_checks=1 accept_passive_service_checks=1 execute_host_checks=1 accept_passive_host_checks=1 enable_notifications=1 enable_event_handlers=1 process_performance_data=0 obsess_over_services=0 check_for_orphaned_services=1 check_service_freshness=1 service_freshness_check_interval=60 check_host_freshness=0 host_freshness_check_interval=60 aggregate_status_updates=1 status_update_interval=15 enable_flap_detection=0 low_service_flap_threshold=5.0 high_service_flap_threshold=20.0 low_host_flap_threshold=5.0 high_host_flap_threshold=20.0 date_format=euro p1_file=/usr/sbin/p1.pl illegal_object_name_chars=`~!$%^&* '"<>?,()= illegal_macro_output_chars=`~$& '"<> use_regexp_matching=0 use_true_regexp_matching=0 admin_email=nagios-admin@inf.ed.ac.uk admin_pager=empty daemon_dumps_core=0 define timeperiod{ timeperiod_name 24x7 alias 24 Hours A Day, 7 Days A Week sunday 00:00-24:00 monday 00:00-24:00 tuesday 00:00-24:00 wednesday 00:00-24:00 thursday 00:00-24:00 friday 00:00-24:00 saturday 00:00-24:00 define contact{ name default-contact service_notification_period 24x7 host_notification_period 24x7 service_notification_options w,u,c,r host_notification_options d,r service_notification_commands service_jnotify host_notification_commands host_jnotify register 0 define host{ name default-host ; The name of this host template active_checks_enabled 1 ; Active host checks are enabled notifications_enabled 1 ; Host notifications are enabled event_handler_enabled 1 ; Host event handler is enabled flap_detection_enabled 1 ; Flap detection is enabled failure_prediction_enabled 1 ; Failure prediction is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain nonstatus information across program restarts notification_period 24x7 ; Send host notifications at any time check_period 24x7 max_check_attempts 10 notification_interval 10 notification_options d,u,r check_command check-host-alive register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE! define service{ name default-service ; The 'name' of this service template active_checks_enabled 1 ; Active service checks are enabled passive_checks_enabled 1 ; Passive service checks are enabled/accepted parallelize_check 1 ; Active service checks should be parallelized (disabling this ca n lead to major performance problems) obsess_over_service 1 ; We should obsess over this service (if necessary) check_freshness 0 ; Default is to NOT check service 'freshness' notifications_enabled 1 ; Service notifications are enabled event_handler_enabled 1 ; Service event handler is enabled flap_detection_enabled 1 ; Flap detection is enabled failure_prediction_enabled 1 ; Failure prediction is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts is_volatile 0 ; The service is not volatile check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 notification_options w,u,c,r notification_interval 10 notification_period 24x7 register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! define command { command_name command_line check results not recently received" check_passive $USER1$/check_dummy 2 "Passive define command { command_name host_jnotify command_line /usr/bin/printf "%b" "***** Nagios 2.9 *****\n \nnotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\n State: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/ Time: $LONGDATETIME$\n" /usr/bin/notify-by-jnotify --number $NOTIFICATIONNUMBER$ --subject "Host $HOSTSTATE$ alert for $HOSTNAME$!" $CONTACTEMAIL$ define command { command_name service_jnotify command_line /usr/bin/printf "%b" "***** Nagios 2.9 *****\n \nnotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$" /usr/bin/notifyby-jnotify --number $NOTIFICATIONNUMBER$ --subject "** $NOTIFICATIONTYPE$ alert - $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$ define command { command_line command_name define host { use default-host address 127.0.0.1 alias Self Monitoring Pseudo Host contact_groups ignore host_name Self notification_options n $USER1$/check_dummy 1 "Service checked passively" check_self_active define service { use default-service active_checks_enabled 0 check_command check_self_active contact_groups sxw,idurkacz host_name Self passive_checks_enabled 1 service_description Nagios Configuration define command { command_line command_name $USER1$/check_http -I$HOSTADDRESS$ -p $ARG1$ check_apache_main define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -C $ARG4$ command_name check_apache_cert define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -u $ARG4$ -S command_name check_apache_https define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -u $ARG4$ command_name check_apache_http define contactgroup { alias nagios/web-service contactgroup_name nagios/web-service members sxw define contact { use default-contact alias Simon Wilkinson contact_name sxw email sxw@inf.ed.ac.uk define hostgroup { alias DICE/Web/Servers hostgroup_name DICEWebServers define host { use default-host address 129.215.165.30 contact_groups nagios/web-service host_name duffus hostgroups DICEWebServers define service { use default-service active_checks_enabled 0 check_command check_self_active contact_groups nagios/web-service host_name duffus passive_checks_enabled 1 service_description Profile Translation define service { use default-service check_command check_apache_main!80 contact_groups nagios/web-service host_name duffus service_description Apache define service { use default-service check_command check_apache_http!129.215.165.30!www.inf.ed.ac.uk!80!/ contact_groups nagios/web-service host_name duffus service_description Apache www.inf.ed.ac.uk:80 HTTP define servicedependency { dependent_host_name duffus dependent_service_description Apache www.inf.ed.ac.uk:80 HTTP execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache define service { use default-service check_command check_apache_https!129.215.165.30! www.inf.ed.ac.uk!443!/ contact_groups nagios/web-server host_name curlew service_description Apache www.inf.ed.ac.uk:443 HTTPS define servicedependency { dependent_host_name curlew dependent_service_description Apache www.inf.ed.ac.uk:443 HTTPS execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache define service { use default-service check_command check_apache_cert!129.215.202.30! www.inf.ed.ac.uk!443!18 contact_groups nagios/web-service host_name duffus service_description Apache www.inf.ed.ac.uk:443 Certificate define servicedependency { dependent_host_name curlew dependent_service_description Apache www.inf.ed.ac.uk:443 Certificate execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache Extra Large
Translation Have to translate our resources into application specific configuration Traditionally LCFG has components which translate a machine s resources into a local configuration Monitoring introduces more complexity The monitoring configuration is made up of resources from many different machines Introduce translators which convert machine resource fragments into monitoring configuration descriptions
Spanning Maps LCFG models configuration on a per host basis A publish/subscribe model exists for sharing information between hosts We call that model spanning maps We use it a lot: DHCP, firewall, Kerberos, X509 Monitoring extends this, so it can see more, quicker
Monitoring Framework Monitoring system agnostic translation framework Spanning maps tell it which components to monitor Fetch resources for that component from database Run resources through translator Extract user/group information from user database Combine translator results into configuration file All implemented in OO perl Translators are run time loaded perl modules, written to a defined interface
Caching Monitoring reconfiguration is costly Have to cache as much as possible Framework caches Results from configuration database Results from user database Results from translators Only moves performs reconfigurations when required
Why so much? Why is that configuration so complex? Actually monitoring 4 services The default httpd instance The virtual host running on port 80 The SSL service running on port 443 The validity of the certificate These all have dependencies
Dependencies Apache server Redirect server SSL server SSL certificate
Redundant Servers and Clusters Modelling clusters is important I really care if more than 2 of my KDCs are down My LDAP service is dependent on there being a KDC available Nagios makes this hard, the framework makes it simple kerberos.nagios_cluster INF.ED.AC.UK_KDC openldap.nagios_dependency INF.ED.AC.UK_KDC
Notifications Notification storms are still a danger Systems tend to be maintained by more than one person Presence enables us to tell who is available Escalation lets us shout louder and wider Jabber is used for presence, and initial notification Escalations are by email
More on the Jabber bot Written in the python Twisted async framework Bot is permanently connected to our Jabber server, receives notifications by Unix socket Maintains state information for all of its buddies Initially only notifies you if you re available Then notifies you if you re online Then emails you
Statistics Using 2 monitoring machines, we monitor 159 services, across 42 hosts. The monitoring system has 5305 lines of configuration, all automatically generated and maintained. (on 00:50 on 30th March 2008) Image credit: williamhartz : http://www.flickr.com/photos/whartz/
Issues Only checking real world == configuration definition If you remove your web server from your configuration, the monitoring system won t tell you its down Don t have a good way of expressing There must be a www.inf.ed.ac.uk There shall be at least 3 KDCs up at all times Each site must have an AFS database server Need a more descriptive configuration language to solve these problems
Futures Physician, heal thyself Use monitoring events to repair problems Bring up new systems when one goes down Move volumes off AFS fileservers when one becomes too full... Lots of risks, but lots of potential
Conclusions A quick tour of some of the complexities of monitoring Examined how using a central configuration database can simplify these Described an implementation for LCFG and Nagios Looked at some possible ways of extending that implementation in the future
Questions? Image credit: hustvedt : http://commons.wikimedia.org/wiki/image:three_surveillance_cameras.jpg This talk: http://www.dice.inf.ed.ac.uk/publications/ LCFG: http://www.lcfg.org/ Me: simon@sxw.org.uk