A company of ProSiebenSat.1 Media AG Berlin, Mai2014 Icinga and Puppet Dominik Schulz Head of Datacenter and Operations Magic Internet / MyVideo
Our Stack Icinga: 300 Hosts and over 4000 Services Linux (Ubuntu, Debian) Managed by Puppet Heterorgenous infrastructure Private Cloud Public Cloud Dedicated Servers
Our Environments Several environments development, integrajon, staging, live Several locajons private cloud, on- premise, public clouds,...
Our History Introduced Puppet in 2012 Introduced Zabbix in 2013 Not in use anymore Introduced PuppetDB in second half of 2013 Icinga since end of 2013
Reflections Puppet knows everything about our environment Why not let it feed our monitoring? GitHub does it Puppet has najve Nagios resources Sounds good?
The Plan Distributed Puppet / Icinga One Puppetmaster per environment Also Icinga Satellite Master (dubbed Server ) Performing acjve checks on its nodes SubmiYng results to the Master And Graphite Relay One central Icinga Server (dubbed Master ) Only passive Checks Receiving check results from satellite masters
The Hierachy Icinga Master Icinga Server Live Icinga Server Staging Icinga Server IntegraJon Webserver Live Webserver Staging Webserver IntegraJon
Excourse: Our Puppet Hierachy We use modules, services and roles with hiera The well known components, profiles, roles pa4ern with different names Each module knows about the the package it handles and the supported OS No data, no business logic Each service uses dumb modules to implement the business logic (wiring, folders, monitoring, backup,...) Each role includes different services (no logic here)
Example Hierachy Role: Frontend Web Role: Backend Search Service: Nginx Service: SolR Modul: Nginx Modul: Redis Modul: Tomcat
The Configuration PuppetDB Icinga Master Icinga Server Each Puppet Service exports its Icinga Checks to the PuppetDB The Master and each Icinga Server realize these resources Agent
Example: Distributed Check I class service_nginx_frontend (... ) {... service_monitoring::check { "HTTP- localhost- 81- ${vhost_name": service_description => "HTTP VHost for ${vhost_name on Port 81", host_name => $fqdn, command => "check_http_81_${vhost_name", nrpe_command => "check_http - H ${vhost_name - I 127.0.0.1 - p 81-4 - k 'X- SECRET: 42' - u /ping/ - e 'HTTP/1.1 200'", process_perf_data => 0,
Example: Distributed Check II define service_monitoring::check ( $host_name, $command, $nrpe_command,... ) { @@module_icinga::server::check { "${fqdn- ${name": ensure => $ensure, host_name => $host_name,... @@module_icinga::master::check { "${fqdn- ${name": ensure => $ensure, host_name => $host_name,... if $nrpe_command { module_icinga::client::check { $command: Server Master Client
Example: Distributed Checks III Class module_icinga::master::config (... ) {... Module_icinga::Master::Check << >> Module_icinga::Master::Host << >> Master Class module_icinga::server::config ( $location,... ) { Server... Module_icinga::Server::Check << location == $location >> Module_icinga::Server::Host << location == $location >>
Example: Checking Backups I People don t want backups, they want recovery We don t have checks for recoverability, yet But making sure backups actually succeed is imporant, too We use pull- backups Only a few hosts, most hosts don t need backups Automated Provisioning, Deployment and Source Control Idea: Export backup jobs to the backup servers Icinga Checks should be exported as well
Example: Checking Backups II Service_gitlab (... ) { service_backup::vault { $fqdn: path => "/home/git, Every service should ensure it is backed up Only if necessary Backup Resource (vault) takes care of backup and monitoring
Example: Checking Backups III define service_backup::vault( ) { @@module_revobackup::vault { "$name": source => "${user@${host:${path", server => $backup_server, sudo => 1, module_icinga::client::plugin { 'check_backup : file => "puppet:///.../plugins/check_backup", service_monitoring::check { 'Backup : nrpe_command => "check_backup - w $warn - c $crit - p $path",...
Awesome! Sounds great and easy In pracjce it was a liale more difficult Lets look at some of the issues we had with this setup
Issues Checks becoming stale Modeling Host- and Servicegroups Metrics NoJficaJons MigraJng old Nagios Checks Breaking Icinga Removing Checks Disabling certain Hosts / Environments Puppet / Ruby performance
Checks becoming stale Nsca / send_nsca are preay old Do not scale very well We did oeen get batches of stale checks First we raised the freshness interval Once this was maxed out we tried nsca- ng Out of the box with Ubuntu 14.04 LTS Since then we did not look back
Modeling Host- and Servicegroups Host- and Servicegroups are very nice Modeling them in puppet is a liale difficult Hint: Use a fallback groups which are always defined
Gathering Metrics Started feeding Nagios Perfdata to Graphite Quickly became clear that we want a finer resolujon Switched to Diamond + Graphite Relays Works quite well StatsD / CollectD may be even beaer suited If we want to switch puppet makes this preay easy
Notifications Sending nojficajons (Email / SMS) is sjll an issue A large environment tends to produce quite a lot of false posijves If only for a short period of Jme anag works quite well, but it s no push nojficajon SuggesJons?
Migrating old Nagios Checks When we introduced Icinga we sjll had an old Nagios instance running preay unaaended How do you migrate those checks to an puppet- manged Icinga Master? Easy: Add this in your /etc/icinga/icinga.conf: Cfg_dir=/etc/icinga/legacy.d/ Put the old configurajon in there Some minor adjustments and your good to go
Removing Checks Puppet / PuppetDB is a great tooling But somejmes it complicates things a bit Removing hosts or services is not as easy as it used to be w/o Puppet Removing a host: puppet node deacjvate <fqdn> Removing a service: Export an icinga check resource with ensure => absent
Disabling Hosts Having all hosts in monitoring is great But certain hosts don t need to be monitored Reduce noise and distracjon Using $enable flags on exported resources is the key May take a few iterajons to get it right Icinga doesn t like services referencing non- exisjng hosts
Puppet / Ruby Performance As the number of resources grows things slow down Puppet is wriaen in Ruby Ruby is opjmized for developers Ruby is NOT opjmized for execujon speed Puppet tends to get real slow with huge catalogs SoluJon: Raise your Puppet Jmeouts VERY high We re really eager to see how Ruby 2.x will perform
Questions? QuesJons? SuggesJons?