CS615 - Aspects of System Administration Slide 1 CS615 - Aspects of System Administration Monitoring, SNMP Department of Computer Science Stevens Institute of Technology Jan Schaumann jschauma@stevens.edu http://www.cs.stevens.edu/~jschauma/615/
Problem Report CS615 - Aspects of System Administration Slide 2 Something s wrong.
Now what? CS615 - Aspects of System Administration Slide 3
SNMP CS615 - Aspects of System Administration Slide 4 A complete network management system to monitor network-attached devices.
SNMP CS615 - Aspects of System Administration Slide 5 Base concepts: managed devices run an snmp agent or dæmon information about the device is exposed in management information bases parts of a system are made available in read-only mode parts of a system may be made available in write mode certain conditions may trigger actions or traps normally uses UDP 161 for the agent and 162 for the manager
SNMP CS615 - Aspects of System Administration Slide 6 Management Information Bases (MIBs): hierarchical namespace contains Object Identifiers (OIDs) written in Abstract Syntax Notation One (ASN.1) often vendor defined
SNMP Versions CS615 - Aspects of System Administration Slide 7 SNMPv1: de-facto standard poor security ( community strings act as passwords) SNMPv2: improvements in the area of performance (GETBULK instead of GETNEXT) and security comes in the flavors SNMPv2c, SNMPv1.5 and SNMPv2u SNMPv3: official standard adds authentication, privacy and access control
SNMP: An example CS615 - Aspects of System Administration Slide 8 $ snmpwalk -c public -v 1 colt45.cs.stevens.edu SNMPv2-MIB::sysDescr.0 = STRING: HP ETHERNET MULTI-ENVIRONMENT,ROM R.22.01,JETDIRECT,JD95,EEPROM R.24.06,CIDATE 10/17/2002 SNMPv2-MIB::sysObjectID.0 = OID: SNMPv2-SMI::enterprises.11.2.3.9.1 DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (276616180) 32 days, 0:22:41.80 [...] RFC1213-MIB::ipRouteNextHop.0.0.0.0 = IpAddress: 155.246.81.1 [...] HOST-RESOURCES-MIB::hrMemorySize.0 = INTEGER: 65536 KBytes [...] SNMPv2-SMI::mib-2.43.16.5.1.2.1.1 = STRING: "Powersave on" [...]
SNMP: An example CS615 - Aspects of System Administration Slide 9 $ snmpwalk -c public -v 1 gw.cc.stevens-tech.edu SNMPv2-MIB::sysDescr.0 = STRING: Cisco IOS Software, s72033_rp Software (s72033_rp-advipservicesk9_wan-m), Version 12.2(33)SXH, RELEASE SOFTWARE (fc5) DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (3112108803) 360 days, 4:44:48.03 SNMPv2-MIB::sysContact.0 = STRING: chose@stevens.edu x5457 SNMPv2-MIB::sysName.0 = STRING: gw.cc.stevens-tech.edu SNMPv2-MIB::sysLocation.0 = STRING: campus:sl:0:machineroom SNMPv2-MIB::sysORLastChange.0 = Timeticks: (0) 0:00:00.00 [...] IF-MIB::ifPhysAddress.1 = STRING: 0:17:95:68:d2:dc IF-MIB::ifPhysAddress.2 = STRING: 0:18:74:1c:e3:80 [...] IF-MIB::ifAdminStatus.1 = INTEGER: up(1) IF-MIB::ifAdminStatus.2 = INTEGER: down(2) [...] IF-MIB::ifInOctets.1 = Counter32: 147341347 IF-MIB::ifInOctets.2 = Counter32: 487894092 [...] IF-MIB::ifOutOctets.1 = Counter32: 956876160 IF-MIB::ifOutOctets.2 = Counter32: 1532452749 [...] RFC1213-MIB::ipRouteDest.66.193.255.0 = IpAddress: 66.193.255.0 RFC1213-MIB::ipRouteDest.66.194.0.0 = IpAddress: 66.194.0.0 [...]
SNMP: An example CS615 - Aspects of System Administration Slide 10 $ snmpwalk -Os -c public -v 1 localhost sysdescr.0 = STRING: Linux eva 2.6.32-33-generic #72-Ubuntu SMP Fri Jul 29 21:07:13 U sysobjectid.0 = OID: netsnmpagentoids.10 sysuptimeinstance = Timeticks: (90638910) 10 days, 11:46:29.10 syscontact.0 = STRING: Root <root@localhost> (configure /etc/snmp/snmpd.local.conf) sysname.0 = STRING: eva syslocation.0 = STRING: Unknown (configure /etc/snmp/snmpd.local.conf) [...]
CS615 - Aspects of System Administration Slide 11 Hooray! 5 Minute Break
Symptoms CS615 - Aspects of System Administration Slide 12 The system feels slow.
Symptoms CS615 - Aspects of System Administration Slide 13 The system feels slow. I can t log in.
Symptoms CS615 - Aspects of System Administration Slide 14 The system feels slow. I can t log in. My mail was not delivered.
Symptoms CS615 - Aspects of System Administration Slide 15 The system feels slow. I can t log in. My mail was not delivered. The site is down.
Now what? CS615 - Aspects of System Administration Slide 16
To the logs! CS615 - Aspects of System Administration Slide 17
Answers CS615 - Aspects of System Administration Slide 18 The system feels slow. up 1318 days, 13:46, 1 user, load averages: 993.81, 272.91, 1012.18
Answers CS615 - Aspects of System Administration Slide 19 The system feels slow. up 1318 days, 13:46, 1 user, load averages: 993.81, 272.91, 1012.18 I can t log in. Apr 6 09:25:56 <auth.info>hostname sshd[1624]: Failed password for root from 115.239.231.100 port 1047 ssh2
Answers CS615 - Aspects of System Administration Slide 20 The system feels slow. up 1318 days, 13:46, 1 user, load averages: 993.81, 272.91, 1012.18 I can t log in. Apr 6 09:25:56 <auth.info>hostname sshd[1624]: Failed password for root from 115.239.231.100 port 1047 ssh2 The mail was not delivered. Apr 6 19:04:26 <mail.info>hostname postfix/smtpd[18524]: NOQUEUE: reject: RCPT from a61.amerikadadiploma.com[31.210.91.61]: 450 4.1.8 <noreply-only-bounces-selintokgoz-emailer@amerikadadiploma.com>: Sender address rejected: Domain not found; from=<noreply-only-bounces-selintokgoz-emailer@amerikadadiploma.com> to=<jschauma@netmeister.org> proto=esmtp helo=<a61.amerikadadiploma.com>
Answers CS615 - Aspects of System Administration Slide 21 The site is down. 94.242.252.41 - "" [06/Apr/2014:19:18:47-0400] "GET /secret/ HTTP/1.1" 401 524 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0"
Answers CS615 - Aspects of System Administration Slide 22 The site is down. 94.242.252.41 - "" [06/Apr/2014:19:18:47-0400] "GET /secret/ HTTP/1.1" 401 524 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0"
CS615 - Aspects of System Administration Slide 23 Events Something s wrong. is just an unexpected or undesirable event.
CS615 - Aspects of System Administration Slide 24 Events Something s wrong. is just an unexpected or undesirable event. Events happen all the time.
CS615 - Aspects of System Administration Slide 25 Events Something s wrong. is just an unexpected or undesirable event. Events happen all the time. Being able to identify relevant events allows you to diagnose, predict and even prevent undesirable events.
CS615 - Aspects of System Administration Slide 26 Events In order to be able to identify an event as unexpected, you have to have expected events.
CS615 - Aspects of System Administration Slide 27 Expected Events Know your applications.
Expected Events CS615 - Aspects of System Administration Slide 28 Know your applications. Know your users.
Expected Events CS615 - Aspects of System Administration Slide 29 Know your applications. Know your users. Know your traffic patterns.
Expected Events CS615 - Aspects of System Administration Slide 30 Know your applications. Know your users. Know your traffic patterns. Know your systems.
Events and Metrics CS615 - Aspects of System Administration Slide 31 $ dict event event n 1: something that happens at a given place and time 2: a special set of circumstances; "in that event, the first possibility is excluded"; "it may rain in which case the picnic will be canceled" [syn: {event}, {case}] [...]
Events and Metrics CS615 - Aspects of System Administration Slide 32 $ dict event event n 1: something that happens at a given place and time 2: a special set of circumstances; "in that event, the first possibility is excluded"; "it may rain in which case the picnic will be canceled" [syn: {event}, {case}] [...] $ dict metric [...] 3: a system of related measures that facilitates the quantification of some particular characteristic [syn: {system of measurement}, {metric}]
Events and Metrics CS615 - Aspects of System Administration Slide 33 Event Metric You
Events and Metrics CS615 - Aspects of System Administration Slide 34 Events: may occur rarely / frequently / constantly can be collected in logs may be comprised of other events may be: something happened may be: nothing (new) happened Metrics: correlation of related events may help identify outliers may trigger events may help make (automated or interactive) decisions
Collecting Data CS615 - Aspects of System Administration Slide 35 Counters: easy, numeric data tracking individual events. Example: HTTP status codes Timers: easy, numeric data tracking event duration. Example: Time to send all data for a successful HTTP request. Thresholds: easy, numeric trigger for events; may itself trigger events or metrics. Example: more than N HTTP hits in X seconds yield 404.
SNMP: An example CS615 - Aspects of System Administration Slide 36 $ snmpwalk -c public -v 1 gw.cc.stevens-tech.edu SNMPv2-MIB::sysDescr.0 = STRING: Cisco IOS Software, s72033_rp Software (s72033_rp-advipservicesk9_wan-m), Version 12.2(33)SXH, RELEASE SOFTWARE (fc5) DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (3112108803) 360 days, 4:44:48.03 SNMPv2-MIB::sysContact.0 = STRING: chose@stevens.edu x5457 SNMPv2-MIB::sysName.0 = STRING: gw.cc.stevens-tech.edu SNMPv2-MIB::sysLocation.0 = STRING: campus:sl:0:machineroom SNMPv2-MIB::sysORLastChange.0 = Timeticks: (0) 0:00:00.00 [...] IF-MIB::ifPhysAddress.1 = STRING: 0:17:95:68:d2:dc IF-MIB::ifPhysAddress.2 = STRING: 0:18:74:1c:e3:80 [...] IF-MIB::ifAdminStatus.1 = INTEGER: up(1) IF-MIB::ifAdminStatus.2 = INTEGER: down(2) [...] IF-MIB::ifInOctets.1 = Counter32: 147341347 IF-MIB::ifInOctets.2 = Counter32: 487894092 [...] IF-MIB::ifOutOctets.1 = Counter32: 956876160 IF-MIB::ifOutOctets.2 = Counter32: 1532452749 [...] RFC1213-MIB::ipRouteDest.66.193.255.0 = IpAddress: 66.193.255.0 RFC1213-MIB::ipRouteDest.66.194.0.0 = IpAddress: 66.194.0.0 [...]
Know your Systems CS615 - Aspects of System Administration Slide 37 Profile your application: execution time (for example: time(1)) data sources and destination affect execution strace(1) and friends for more detailed analysis Understand your system performance: CPU load, memory (for example: top(1), vmstat(1)) disk I/O (for example: iostat(1)) user activity (for example: ac(1), lsof(8), sa(8))
Know your Systems CS615 - Aspects of System Administration Slide 38 Network statistics: ports and applications (for example: lsof(8), netstat(8)) packets in and out connection origin NetFlow etc.
Context CS615 - Aspects of System Administration Slide 39 Context lets you find the relevant events in your haystack of metrics.
No context. CS615 - Aspects of System Administration Slide 40 CPU load - 12 hours
No context. CS615 - Aspects of System Administration Slide 41 Disk I/O - 12 hours
No context. CS615 - Aspects of System Administration Slide 42 Load Average - 12 hours
No context. CS615 - Aspects of System Administration Slide 43 Memory - 12 hours
Some context. CS615 - Aspects of System Administration Slide 44 12 hours
With context. CS615 - Aspects of System Administration Slide 45 7 days
With context. CS615 - Aspects of System Administration Slide 46 CPU load - 30 days
Building context CS615 - Aspects of System Administration Slide 47 Define your events and relevant information. Example: Apache Common Log 127.0.0.1 - - [10/Oct/2000:13:55:36-0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Building context CS615 - Aspects of System Administration Slide 48 Define your events and relevant information. Example: Apache Common Log 127.0.0.1 - - [10/Oct/2000:13:55:36-0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Collect events. Example: syslog(3) Apr 7 14:04:54 <auth.info>hostname sshd[11682]: Accepted publickey for jschauma from 207.38.152.228 port 38012 ssh2 Apr 7 14:04:54 <auth.info>hostname sshd[1759]: Received disconnect from 207.38.152.228: 11: disconnected by user Apr 7 16:27:44 <authpriv.notice>hostname sudo: jschauma : TTY=pts/4 ; PWD=/var/log ; USER=root ; COMMAND=/usr/bin/tail authlog
Turn events into metrics CS615 - Aspects of System Administration Slide 49 Log it! Export counters/timers from within your application. Process logs and produce counters/timers: awk {print $9} /var/log/httpd/access.log sort uniq -c Graph it. http://shouldigraphit.com/
Monitoring/graphing tools CS615 - Aspects of System Administration Slide 50 SNMP based: Cacti: http://www.cacti.net/ MRTG: http://oss.oetiker.ch/mrtg/ Observium: http://demo.observium.org/... Other / complementary: Ganglia: http://monitor.millennium.berkeley.edu/ Munin: http://munin.ping.uio.no/ Nagios: http://nagioscore.demos.nagios.com/ Graphite: http://graphite.wikidot.com/
To the cloud! CS615 - Aspects of System Administration Slide 51 There s a service for that. In the cloud. Consider: support / convenience vs. do-it-yourself integration with your other services confidentiality data lock-in
Monitoring Pitfalls CS615 - Aspects of System Administration Slide 52 Increasing the size of your haystack does not always help in finding the needle.
Monitoring Pitfalls CS615 - Aspects of System Administration Slide 53 Increasing the size of your haystack does not always help in finding the needle. Email is not a scalable network monitoring solution.
Monitoring Pitfalls CS615 - Aspects of System Administration Slide 54 Increasing the size of your haystack does not always help in finding the needle. Email is not a scalable network monitoring solution. Absence of a signal can itself be a signal.
Monitoring Pitfalls CS615 - Aspects of System Administration Slide 55 Increasing the size of your haystack does not always help in finding the needle. Email is not a scalable network monitoring solution. Absence of a signal can itself be a signal. This list is incomplete.
Reading CS615 - Aspects of System Administration Slide 56 SNMP: http://is.gd/1lwosd RFCs 1157, 3411, 3418 and others snmpcmd(1) Monitoring: https://www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html http://www.datadoghq.com/ https://www.newrelic.com/ http://logstash.net/ http://www.splunk.com/...