Managing Monitoring in Distributed Environments



Similar documents
How To Monitor A Network With Nagios And Other Tools

Présentation de Nagios

Monitoring Systems and Services. Alwin Brokmann DESY-IT March 24 28,2003 CHEP 2003 San Diego

DEPLOYMENT GUIDE Version 1.0. Deploying the BIG-IP LTM with the Nagios Open Source Network Monitoring System

Document d'installation FAN 2.1

NETWORK MONITOR. Some high-end network monitoring. Watching your systems with Nagios COVER STORY. What Is Nagios? Installing the Server and Plugins

Created by : Ashish Shah, J.M. PATEL COLLEGE UNIT-5 CHAP-1 CONFIGURING WEB SERVER

How to setup HTTP & HTTPS Load balancer for Mediator

C:\www\apache2214\conf\httpd.conf Freitag, 16. Dezember :50

Availability Management Nagios overview. TEIN2 training Bangkok September 2005

ViMP 3.0. SSL Configuration in Apache 2.2. Author: ViMP GmbH

Installation von des Netzwerküberwachungssystems Nagios 1.1 auf einem Debian-Rechner

Setting Up A Nagios Monitoring System Warren Block, May 2005

Installing an SSL certificate on the InfoVaultz Cloud Appliance

Web Server: Principles and Configuration Web Programming 8) Web Server

Installing Apache Software

While are you still in Nagios working directory, create a new file for DNS servers monitoring

SecuritySpy Setting Up SecuritySpy Over SSL

Network Monitoring Systems / Nagios. 2/19/08 Michael Miller e mail: mike.mikemiller@gmail.com

Monitoring VoIP Systems. Sebastian Damm

Apache 2.2 on Windows: A Primer

wget

Nagios. Nagios. Jacquelin Charbonnel - Albert Shih. CNRS - Ecole Mathrice Marseille, Novembre / 72

NRPE Documentation CONTENTS. 1. Introduction... a) Purpose... b) Design Overview Example Uses... a) Direct Checks... b) Indirect Checks...

High Availability Configuration of ActiveVOS Central with Apache Load Balancer

Internet Appliance INTERNETpro Enterprise Stack : Performance & failover testing

Nagios. cooler than it looks. Wednesday, 31 October 2007

Monitoring MySQL. Geert Vanderkelen MySQL Senior Support Engineer Sun Microsystems


Real Vision Software, Inc.

TODAY web servers become more and more

This section describes how to use SSL Certificates with SOA Gateway running on Linux.

HP Cloud Service Automation Deployment Architectures

Securing the Apache Web Server

September 21 st, Ethan Galstad.

Eth0 IP Address with Default Gateway Settings:

How to: Install an SSL certificate

Installing and Configuring Apache

DEERFIELD.COM. DNS2Go Update API. DNS2Go Update API

1.0 DHCPD.CONF. option domain-name-servers ; option domain-name "smuth-mru.org.zm"; option broadcast-address

Implementing HTTPS in CONTENTdm 6 September 5, 2012

Creating X.509 Certificates With OpenSSL

Monitoring Software Services registered with science.canarie.ca

heck What the is wrong with my server?!? Josh Malone Systems Administrator National Radio Astronomy Observatory Charlottesville, VA

DEPLOYMENT GUIDE Version 1.0. Deploying the BIG-IP LTM with Apache Tomcat and Apache HTTP Server

Apache Web Server Complete Guide Dedoimedo

esync - Receiving data over HTTPS

User s guide. APACHE SSL Linux. Using non-qualified certificates with APACHE SSL Linux. version 1.3 UNIZETO TECHNOLOGIES S.A.

The course will be run on a Linux platform, but it is suitable for all UNIX based deployments.

Host your websites. The process to host a single website is different from having multiple sites.

APACHE. An HTTP Server. Reference Manual

How To Monitor A Network With Nagios And Rt Software On Linux On A Microsoft Ipad (A2) On A Pc Or Macbook Or Ipad Or Ipa (A3) On An Ipa Or Ipo (

How To Configure Apa Web Server For High Performance

Magento Enterprise Edition White Paper!!"#$%&'()*&(+"'#(,-).#/."'(0%-(1/2$(,"-0%-3)*."("4%33"-."!

Setting Up SSL From Client to Web Server and Plugin to WAS

Configuring MassTransit for the Web Using Apache on Mac OS 10.2 and 10.3

Virtual Host (Web Server)

WebBridge LR Integration Guide

Open Source Management Options

Ganglia & Nagios. Maciej Lasyk 11. Sesja Linuksowa Wrocław, /25. Maciej Lasyk, Ganglia & Nagios

Network Monitoring with Nagios. Matt Gracie, Information Security Administrator Canisius College, Buffalo, NY

Rails Application Deployment. July Philly on Rails

APACHE WEB SERVER. Andri Mirzal, PhD N

Apache & Virtual Hosts & mod_rewrite

Basic Apache Web Services

To enable https for appliance

How To Install Nagios Network Monitoring Program On A Pc Or Macbook Or Ipad (For Mac) With A Web Browser (For A Mac)

An Introduction to Monitoring with Nagios

Wednesday, October 10, 12. Running a High Performance LAMP stack on a $20 Virtual Server

Enterprise Content Management System Monitor. How to deploy the JMX monitor application in WebSphere ND clustered environments. Revision 1.

Best Practices in Hardening Apache Services under Linux

WEB SERVER MONITORING SORIN POPA

McAfee epolicy Orchestrator: Creating an Apache HTTP Repository

Monitoring a Linux Mail Server

GlobalSign Enterprise Solutions Google Apps Authentication User Guide

10gAS SSL / Certificate Based Authentication Configuration

Installing and Configuring Apache

Anexo II Ejemplos de ficheros de configuración de Apache Web Server

Deployment Guide MobileIron Sentry

Web Hosting for Fame and Fortune. A Guide to using Apache as your web-server solution

Install Apache on windows 8 Create your own server

SIG-NOC Meeting - Stuttgart 04/08/2015 Icinga - Open Source Monitoring

Supermicro Server Monitoring with SuperDoctor 5 and Nagios Using SNMP Protocol. Version 1.1b

Step-by-Step guide to setup an IBM WebSphere Portal and IBM Web Content Manager V8.5 Cluster From Zero to Hero (Part 2.)

NetSpective Global Proxy Configuration Guide

Greenstone Documentation

Setting Up SSL on IIS6 for MEGA Advisor

Monitoring Your Enterprise PACS With Nagios, Cacti And Smokeping

WhatsUp Gold v11 Features Overview

Academic Calendar for Faculty

Research Report Security Management with Open Source Tools

Installing Apache as an HTTP Proxy to the local port of the Secure Agent s Process Server

Setting up an Apache Web Server for Greenstone 2 Walkthrough

Maximizing Performance and Scalability with Magento Enterprise Edition

Transcription:

Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Introduction Motivation Background Implementation Conclusions

Why Monitor? A distributed system is one that stops you from getting any work done when a machine you've never even heard of crashes. Leslie Lamport

Reliability In a replicated system, you only know when the last server fails Sometimes you need to know that a system is down before the first user notices

Why Monitor? Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital Aaron Levenstein

Performance Metrics Do we have five 9s uptime? Are there servers which regularly fall over? Are there patterns behind our service outages? Does Jim s team manage their systems better than Susan s?

Monitoring Regularly check if all services are working correctly Tell people if they re not

Complications Image credit: fliepsiebieps http://www.etsy.com/view_transaction.php?transaction_id=6780862

Completeness Monitoring systems are rapidly taken for granted I haven t been paged - there can t be anything wrong Maintaining the monitoring system is often forgotten Easy to forget to add new services Omissions only noticed when things go wrong Although... harder to forget when decommissioning!

Complications Image credit: rustybrick : http://www.flickr.com/photos/rustybrick/2100356981

Dependencies (and message storms) One small failure can have dramatic expenses Receiving 1,000 messages when a router fails can be both annoying and expensive Large scale monitoring configurations have to understand Network topology Service dependencies Modelling these is hard. Maintaining the model is harder still

Complications Image credit: Guacamole Goalie : http://www.fickr.com/photos/jeopei

Communication : Passive vs active Not everything you care about is network visible Some things can t be Some things shouldn t be An event occurring is as notable as a service outage Have to be able to handle Services reporting their own state Reporting of events (SNMP traps &c)

Complications Quis custodiet ipsos custodes? Juvenal Image credit: Ley_photograhy http://www.flickr.com/photos/lostindevon/

Self monitoring Silence isn t always golden Failure of a component of the monitoring system can hide everything Important that the monitoring system itself is robustly observed

Complexity and redundancy Satisifying all of these requirements leads to configurations that are complex fragile hard to understand full of redundant duplicated information Removing redundancy is highly beneficial. It lets us... make changes in one place instead of 20... avoid inconsistency between multiple copies... speed up configuration tasks... reduce errors!

Aside on redundancy System configuration is full of redundant information Take a web server configuration - information shared with DNS DHCP Firewall Certificate Server Console Server and your Monitoring System Removing redundancy dramatically reduces the potential for errors across all of these components

Lost in translation Very hard to handle raw configuration files Rapidly becomes an O(n 2 ) problem (translate all of these configuration formats to all of these other formats) Instead, use a set of abstract configuration resources Translate from this to Service configuration Monitoring configuration Everything else...

Web server example This machine is an ssl capable web server It runs two virtual hosts One http service on port 80, answering all requests, and redirecting to the https service One https service on port 443, answering to www.inf.ed.ac.uk, serving documents from /var/www/html The https service should be given a signed certificate The web-service monitoring group is notified on failure

Implementation - Configuration (other configuration systems are available)

Implementation - Monitoring

Example - Apache resources #include <options/apache.h> #include <options/apache-ssl.h> #include <options/x509-client.h> apache.vhosts apache.vhostname_redir apache.vhostverbatim_redir apache.vhostname_ssl apache.vhostssl_ssl apache.vhostroot_ssl apache.nagios_groups redir ssl _default_ Redirect / https://www.inf.ed.ac.uk www.inf.ed.ac.uk true /var/www/html nagios/web-service X509_CERTIFICATE(ssl, /etc/pki/tls/certs) Small

Example - Apache configuration ServerType standalone ServerRoot /etc/httpd LockFile /var/run/httpd.locak PidFile /var/run/httpd.pid ScoreBoardFile logs/apache_runtime_status Timeout 300 KeepAlive On MaxKeepAliveRequests 100 KeepAliveTimeout 15 MinSpareServers 5 MaxSpareServers 20 StartServers 8 MaxClients 150 MaxRequestsPerChild1000 LoadModule env_module /usr/lib/apache/mod_env.so LoadModule config_log_module /usr/lib/apache/mod_log_config.so LoadModule mime_module /usr/lib/apache/mod_mime.so LoadModule negotiation_module /usr/lib/apache/mod_negotiation.so LoadModule status_module /usr/lib/apache/mod_status.so LoadModule includes_module /usr/lib/apache/mod_include.so LoadModule autoindex_module /usr/lib/apache/mod_autoindex.so LoadModule dir_module /usr/lib/apache/mod_dir.so LoadModule cgi_module /usr/lib/apache/mod_cgi.so LoadModule asis_module /usr/lib/apache/mod_asis.so LoadModule imap_module /usr/lib/apache/mod_imap.so LoadModule action_module /usr/lib/apache/mod_actions.so LoadModule userdir_module /usr/lib/apache/mod_userdir.so LoadModule alias_module /usr/lib/apache/mod_alias.so LoadModule access_module /usr/lib/apache/mod_access.so LoadModule auth_module /usr/lib/apache/mod_auth.so LoadModule digest_module /usr/lib/apache/mod_digest.so LoadModule expires_module /usr/lib/apache/mod_expires.so LoadModule setenvif_module /usr/lib/apache/mod_setenvif.so LoadModule headers_module /usr/lib/apache/mod_headers.so Port 80 Listen 443 Listen 80 User apache Group apache ServerAdmin sxw ServerName duffus.inf.ed.ac.uk DocumentRoot /var/www/html <Directory /> Options FollowSymLinks AllowOverride None </Directory> AccessFileName.htaccess <Files ~ "^\.ht"> Order allow,deny Deny from all Satisfy All </Files> UseCanonicalName On TypesConfig /etc/mime.types DefaultType text/html ErrorLog logs/error_log LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Refereri\" \"%{User-Agenti\"" combined LogLevel warn CustomLog logs/access_log combined SSLPassPhraseDialogbuiltin SSLSessionCache shm:logs/ssl_scache(512000) SSLSessionCacheTimeout 300 SSLMutex file:logs/ssl_mutex SSLRandomSeed startup builtin SSLRandomSeed connect builtin SSLLog logs/ssl_engine_log SSLLogLevel warn AddType application/x-httpd-php.php NameVirtualHost 129.215.165.30:80 <VirtualHost _default_:80> ServerName _default_ Redirect / https://www.inf.ed.ac.uk/ </VirtualHost> <VirtualHost www.inf.ed.ac.uk:443> ServerName www.inf.ed.ac.uk SSLEngine On SSLCertificateFile /etc/pki/tls/certs/ssl.crt SSLCertificateKeyFile /etc/pki/tls/certs/ssl.key SSLCertificateChainFile /etc/pki/tls/certs/ssl.chain DocumentRoot /var/www/html </VirtualHost> Medium

Example - Apache monitoring define command { command_line $USER1$/check_http -I$HOSTADDRESS$ -p $ARG1$ command_name check_apacheconf_main define command { command_line command_name $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -C$ARG4$ check_apacheconf_cert define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -u $ARG4$ -S command_name check_apacheconf_https define command { command_line command_name define contactgroup { alias nagios/web-service contactgroup_namenagios/web-service members sxw define contact { use default-contact alias Simon Wilkinson contact_name sxw email sxw@inf.ed.ac.uk define hostgroup { alias DICE/Web/Servers hostgroup_name DICEWebServers define host { use default-host address 129.215.165.30 contact_groups nagios/web-service host_name duffus hostgroups DICEWebServers $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -u $ARG4$ check_apacheconf_http define service { use default-service active_checks_enabled 0 check_command check_self_active contact_groups nagios/web-service host_name duffus passive_checks_enabled 1 service_description Profile Translation service_description Apache define service { use default-service check_command check_apache_http!129.215.165.30!www.inf.ed.ac.uk!80!/ contact_groups nagios/web-service host_name duffus service_description Apache www.inf.ed.ac.uk:80 HTTP define servicedependency { dependent_host_name duffus dependent_service_description Apache www.inf.ed.ac.uk:80 HTTP execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache define service { use default-service check_command check_apache_https!129.215.165.30!www.inf.ed.ac.uk!443!/ contact_groups nagios/web-server host_name curlew service_description Apache www.inf.ed.ac.uk:443 HTTPS define servicedependency { dependent_host_name duffus dependent_service_description Apache www.inf.ed.ac.uk:443 HTTPS execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache define service { use default-service check_command check_apache_cert!129.215.202.30!www.inf.ed.ac.uk!443!18 contact_groups nagios/web-service host_name duffus service_description Apache www.inf.ed.ac.uk:443 Certificate define servicedependency { dependent_host_name duffus dependent_service_description Apache www.inf.ed.ac.uk:443 Certificate execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache define service { use default-service check_command check_apache_main!80 contact_groups nagios/web-service host_name duffus Large

Example - Apache monitoring 2 log_file=/var/log/nagios/nagios.log object_cache_file=/var/log/nagios/objects.cache resource_file=/etc/nagios/private/resource.cfg status_file=/var/log/nagios/status.dat nagios_user=nagios nagios_group=nagios check_external_commands=1 command_check_interval=-1 command_file=/var/spool/nagios/cmd/nagios.cmd comment_file=/var/log/nagios/comments.dat downtime_file=/var/log/nagios/downtime.dat lock_file=/var/run/nagios/nagios.pid temp_file=/var/log/nagios/nagios.tmp event_broker_options=-1 log_rotation_method=d log_archive_path=/var/log/nagios/archives use_syslog=1 log_notifications=1 log_service_retries=1 log_host_retries=1 log_event_handlers=1 log_initial_states=0 log_external_commands=1 log_passive_checks=1 service_inter_check_delay_method=s max_service_check_spread=30 service_interleave_factor=s host_inter_check_delay_method=s max_host_check_spread=30 max_concurrent_checks=0 service_reaper_frequency=10 auto_reschedule_checks=0 auto_rescheduling_interval=30 auto_rescheduling_window=180 sleep_time=0.25 service_check_timeout=60 host_check_timeout=30 event_handler_timeout=30 notification_timeout=30 ocsp_timeout=5 perfdata_timeout=5 retain_state_information=1 state_retention_file=/var/log/nagios/retention.dat retention_update_interval=60 use_retained_program_state=1 use_retained_scheduling_info=1 interval_length=60 use_aggressive_host_checking=0 execute_service_checks=1 accept_passive_service_checks=1 execute_host_checks=1 accept_passive_host_checks=1 enable_notifications=1 enable_event_handlers=1 process_performance_data=0 obsess_over_services=0 check_for_orphaned_services=1 check_service_freshness=1 service_freshness_check_interval=60 check_host_freshness=0 host_freshness_check_interval=60 aggregate_status_updates=1 status_update_interval=15 enable_flap_detection=0 low_service_flap_threshold=5.0 high_service_flap_threshold=20.0 low_host_flap_threshold=5.0 high_host_flap_threshold=20.0 date_format=euro p1_file=/usr/sbin/p1.pl illegal_object_name_chars=`~!$%^&* '"<>?,()= illegal_macro_output_chars=`~$& '"<> use_regexp_matching=0 use_true_regexp_matching=0 admin_email=nagios-admin@inf.ed.ac.uk admin_pager=empty daemon_dumps_core=0 define timeperiod{ timeperiod_name 24x7 alias 24 Hours A Day, 7 Days A Week sunday 00:00-24:00 monday 00:00-24:00 tuesday 00:00-24:00 wednesday 00:00-24:00 thursday 00:00-24:00 friday 00:00-24:00 saturday 00:00-24:00 define contact{ name default-contact service_notification_period 24x7 host_notification_period 24x7 service_notification_options w,u,c,r host_notification_options d,r service_notification_commands service_jnotify host_notification_commands host_jnotify register 0 define host{ name default-host ; The name of this host template active_checks_enabled 1 ; Active host checks are enabled notifications_enabled 1 ; Host notifications are enabled event_handler_enabled 1 ; Host event handler is enabled flap_detection_enabled 1 ; Flap detection is enabled failure_prediction_enabled 1 ; Failure prediction is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain nonstatus information across program restarts notification_period 24x7 ; Send host notifications at any time check_period 24x7 max_check_attempts 10 notification_interval 10 notification_options d,u,r check_command check-host-alive register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE! define service{ name default-service ; The 'name' of this service template active_checks_enabled 1 ; Active service checks are enabled passive_checks_enabled 1 ; Passive service checks are enabled/accepted parallelize_check 1 ; Active service checks should be parallelized (disabling this ca n lead to major performance problems) obsess_over_service 1 ; We should obsess over this service (if necessary) check_freshness 0 ; Default is to NOT check service 'freshness' notifications_enabled 1 ; Service notifications are enabled event_handler_enabled 1 ; Service event handler is enabled flap_detection_enabled 1 ; Flap detection is enabled failure_prediction_enabled 1 ; Failure prediction is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts is_volatile 0 ; The service is not volatile check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 notification_options w,u,c,r notification_interval 10 notification_period 24x7 register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! define command { command_name command_line check results not recently received" check_passive $USER1$/check_dummy 2 "Passive define command { command_name host_jnotify command_line /usr/bin/printf "%b" "***** Nagios 2.9 *****\n \nnotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\n State: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/ Time: $LONGDATETIME$\n" /usr/bin/notify-by-jnotify --number $NOTIFICATIONNUMBER$ --subject "Host $HOSTSTATE$ alert for $HOSTNAME$!" $CONTACTEMAIL$ define command { command_name service_jnotify command_line /usr/bin/printf "%b" "***** Nagios 2.9 *****\n \nnotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$" /usr/bin/notifyby-jnotify --number $NOTIFICATIONNUMBER$ --subject "** $NOTIFICATIONTYPE$ alert - $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$ define command { command_line command_name define host { use default-host address 127.0.0.1 alias Self Monitoring Pseudo Host contact_groups ignore host_name Self notification_options n $USER1$/check_dummy 1 "Service checked passively" check_self_active define service { use default-service active_checks_enabled 0 check_command check_self_active contact_groups sxw,idurkacz host_name Self passive_checks_enabled 1 service_description Nagios Configuration define command { command_line command_name $USER1$/check_http -I$HOSTADDRESS$ -p $ARG1$ check_apache_main define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -C $ARG4$ command_name check_apache_cert define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -u $ARG4$ -S command_name check_apache_https define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -u $ARG4$ command_name check_apache_http define contactgroup { alias nagios/web-service contactgroup_name nagios/web-service members sxw define contact { use default-contact alias Simon Wilkinson contact_name sxw email sxw@inf.ed.ac.uk define hostgroup { alias DICE/Web/Servers hostgroup_name DICEWebServers define host { use default-host address 129.215.165.30 contact_groups nagios/web-service host_name duffus hostgroups DICEWebServers define service { use default-service active_checks_enabled 0 check_command check_self_active contact_groups nagios/web-service host_name duffus passive_checks_enabled 1 service_description Profile Translation define service { use default-service check_command check_apache_main!80 contact_groups nagios/web-service host_name duffus service_description Apache define service { use default-service check_command check_apache_http!129.215.165.30!www.inf.ed.ac.uk!80!/ contact_groups nagios/web-service host_name duffus service_description Apache www.inf.ed.ac.uk:80 HTTP define servicedependency { dependent_host_name duffus dependent_service_description Apache www.inf.ed.ac.uk:80 HTTP execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache define service { use default-service check_command check_apache_https!129.215.165.30! www.inf.ed.ac.uk!443!/ contact_groups nagios/web-server host_name curlew service_description Apache www.inf.ed.ac.uk:443 HTTPS define servicedependency { dependent_host_name curlew dependent_service_description Apache www.inf.ed.ac.uk:443 HTTPS execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache define service { use default-service check_command check_apache_cert!129.215.202.30! www.inf.ed.ac.uk!443!18 contact_groups nagios/web-service host_name duffus service_description Apache www.inf.ed.ac.uk:443 Certificate define servicedependency { dependent_host_name curlew dependent_service_description Apache www.inf.ed.ac.uk:443 Certificate execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache Extra Large

Translation Have to translate our resources into application specific configuration Traditionally LCFG has components which translate a machine s resources into a local configuration Monitoring introduces more complexity The monitoring configuration is made up of resources from many different machines Introduce translators which convert machine resource fragments into monitoring configuration descriptions

Spanning Maps LCFG models configuration on a per host basis A publish/subscribe model exists for sharing information between hosts We call that model spanning maps We use it a lot: DHCP, firewall, Kerberos, X509 Monitoring extends this, so it can see more, quicker

Monitoring Framework Monitoring system agnostic translation framework Spanning maps tell it which components to monitor Fetch resources for that component from database Run resources through translator Extract user/group information from user database Combine translator results into configuration file All implemented in OO perl Translators are run time loaded perl modules, written to a defined interface

Caching Monitoring reconfiguration is costly Have to cache as much as possible Framework caches Results from configuration database Results from user database Results from translators Only moves performs reconfigurations when required

Why so much? Why is that configuration so complex? Actually monitoring 4 services The default httpd instance The virtual host running on port 80 The SSL service running on port 443 The validity of the certificate These all have dependencies

Dependencies Apache server Redirect server SSL server SSL certificate

Redundant Servers and Clusters Modelling clusters is important I really care if more than 2 of my KDCs are down My LDAP service is dependent on there being a KDC available Nagios makes this hard, the framework makes it simple kerberos.nagios_cluster INF.ED.AC.UK_KDC openldap.nagios_dependency INF.ED.AC.UK_KDC

Notifications Notification storms are still a danger Systems tend to be maintained by more than one person Presence enables us to tell who is available Escalation lets us shout louder and wider Jabber is used for presence, and initial notification Escalations are by email

More on the Jabber bot Written in the python Twisted async framework Bot is permanently connected to our Jabber server, receives notifications by Unix socket Maintains state information for all of its buddies Initially only notifies you if you re available Then notifies you if you re online Then emails you

Statistics Using 2 monitoring machines, we monitor 159 services, across 42 hosts. The monitoring system has 5305 lines of configuration, all automatically generated and maintained. (on 00:50 on 30th March 2008) Image credit: williamhartz : http://www.flickr.com/photos/whartz/

Issues Only checking real world == configuration definition If you remove your web server from your configuration, the monitoring system won t tell you its down Don t have a good way of expressing There must be a www.inf.ed.ac.uk There shall be at least 3 KDCs up at all times Each site must have an AFS database server Need a more descriptive configuration language to solve these problems

Futures Physician, heal thyself Use monitoring events to repair problems Bring up new systems when one goes down Move volumes off AFS fileservers when one becomes too full... Lots of risks, but lots of potential

Conclusions A quick tour of some of the complexities of monitoring Examined how using a central configuration database can simplify these Described an implementation for LCFG and Nagios Looked at some possible ways of extending that implementation in the future

Questions? Image credit: hustvedt : http://commons.wikimedia.org/wiki/image:three_surveillance_cameras.jpg This talk: http://www.dice.inf.ed.ac.uk/publications/ LCFG: http://www.lcfg.org/ Me: simon@sxw.org.uk