Ein Unternehmen stellt sich vor Nagios in large environments
Agenda About ITdesign Introduction Customer environments and requirements Heterogenous environment How to get data from end systems? 350 Servers (> 25 000 measurements) Optimized plugin design 550+ Routers and Switches ITdesign solution for interface measurement
About ITdesign Consultingcompany founded 2000 as spin off from DEC/COMPAQ located in vienna Total of 38 people working on infrastructure- and softwareprojects Focus on High availibility infrastructure (Novell, Microsoft, VMware, CITRIX) Programming (PERL, C#, JAVA, AS/400 RPG) etc. 5 people (+ 1 external) working on Nagios Profitable every year with a growth rate of 10-20% per year (on people and cash)
Contact information ITdesign Software Projects & Consulting GmbH Anton Freunschlag-Gasse 49 A-1230 Wien Tel.: +43(1)699 33 99-0 Fax: +43(1)699 33 99-33 E-Mail: office@itdesign.at Werner Neunteufl Technical Consultant Mobile +43(664) 230 45 33 werner.neunteufl@itdesign.at
Customer requirements Management Service Level Agreements End2End performance monitoring (e.g. SAP Client) Reports / Statistics Service views no technical details Service Monitoring DON T forget the Management they pay this!
Customer requirements Technical Platform independend - monitor everything including Windows, UX*, IBM host, VMWARE, Applications, Logfiles, etc. Monitoring must not send to much mails! (no Notification overload) Handle vacation / attendance of people Easy maintenance Nice technical view No agents (which could negatively influence end systems)
3 cool Customer environments Heterogenous environment AS/400, VMware ESX 3.0, End to End Application Performance, Unix, Windows, SLA calculation 350 Servers (> 25 000 measurements) monitor Volumes, NLMs loaded, DirXml, Timesync, SLP, LDAP, etc. etc. 550+ CISCO routers and switches Monitor each interface with all properties replace CACTI with Nagios
3 cool Customer environments Heterogenous environment AS/400, VMware ESX 3.0, End to End Application Performance, Unix, Windows, SLA calculation 350 Servers (> 25 000 measurements) monitor Volumes, NLMs loaded, DirXml, Timesync, SLP, LDAP, etc. etc. 550+ CISCO routers and switches Monitor each interface with all properties replace CACTI with Nagios
Heterogenous environment Customer requirements Integration of all end systems AS/400, i5, iseries Business applications End 2 End application performance measurement Environment (UPS, Air condition, etc.) Databases VMware ESX Backup Software HW Montioring (e.h. HP Systems Insight Manager)
Heterogenous environment How to get data? Active Active checks with plugins (e.g. snmp, ssh, WMI) Passive snmp traps (from any device) nsca (e.g. AIX monitoring) mail (e.g. Backup software) ftp (e.g. QSYSOPR messages from AS/400) End 2 End measurement from clients
Heterogenous environment Passive interface design Generic solution for all network transports Customizing on the nagios side and/or on the end system Handle performance data like active plugins Simplify parsing input data Configuration instead of programming Modular design for future extensions
Heterogenous environment Nagios passive Output layer / Interface to performance data XML parser CSV parser TXT parser nsca Transport layer nsca Mail snmp file ftp
Heterogenous environment Example 1: AS/400 Integration Operators do not allow to access the machine in any way! Software running on the AS/400 Read data with IBM's APIs for collecting perfomance data Warning and Critical are set on the AS/400 Transfers data with ftp from the AS/400 to the Nagios machine Passive event interface takes data, processes performance data and sends passive event to nagios
Heterogenous environment Example 1: AS/400 Integration
Heterogenous environment Example 2: End to End performance monitoring Measurement is done on dedicated clients Robot software collects data from applications We convert data into XML and CSV and sent it with mail to the nagios server Passive event interface collects performance data and triggers events into nagios
Heterogenous environment Example 2: End to End performance monitoring
3 cool Customer environments Heterogenous environment AS/400, VMware ESX 3.0, End to End Application Performance, Unix, Windows, SLA calculation 350 Servers (> 25 000 measurements) monitor Volumes, NLMs loaded, DirXml, Timesync, SLP, LDAP, etc. etc. 550+ Routers and switches Monitor each interface with all properties replace CACTI with Nagios
350 Servers (> 25 000 measurements) Problems We never faced such large environments before and had only one nagios server available Host down problem stopped scheduling queue Performance problems everywhere CPU, Network (WAN) WEB view is overloaded Performance data graphs Solution -> design something new / but what?
350 Servers (> 25 000 measurements) Design requirements - general PERL instead of C Use existing PERL know how No embedded PERL (some tests fail) PERL costs performance! Design must compensate this! No modification of nagios source code No modification of nagios plugins Avoid conflicts with upwards compatibility Avoid conflicts with GPL (checked this with a lawyer) Better views (WEB output)
350 Servers (> 25 000 measurements) Design requirements - technical Host down must not stop scheduling queue (no problem anymore because of Nagios 3.0) Plugins are the success factor #1 Plugins cooperate Plugins must reduce network traffic wherever possible Plugins must cache data on disk Performance data must not influence CPU load Graph engine for performancedata must not influencet CPU load
350 Servers (> 25 000 measurements) Example - information from remote system 1 Process running? 10 TCP packets / ~30ms 2 Processes running? 11 TCP packets / ~30ms 3 Proccesses running? 12 TCP packets / ~30ms Conclusio: read more information than you need now and store it on disk (plugin caching)
350 Servers (> 25 000 measurements) Plugins cooperate Nagios Process Nagios Processdisk plugin mem plugin Server properties Nagios output > link to CGI program cache data on disk collect performance data Write performance data to disk
350 Servers (> 25 000 measurements) Online demo of optimized plugins on notebook
350 Servers (> 25 000 measurements) Traditional setup each measurement separately
350 Servers (> 25 000 measurements) ITdesign solution: one plugins does multiple measurements
350 Servers (> 25 000 measurements) Drill down into Operating system details
350 Servers (> 25 000 measurements) Writing and collecting performance data Plugins write current measurement to disk and mark each measurement with an ID Highly optimized scheduled cron job takes all measurement data and stores it into filesystem or SQL Database To avoid huge amount of data only changes (deltas) are stored Example: process availibility of httpstkd
350 Servers (> 25 000 measurements) Graphing performace data RRD databases and graphs are created when the user klicks on the appropriate link We call this feature RRD graphs on demand + CPU load only for a very short time + RRD databases are created on click + Change graphs on the fly (no need to recreate RRD databases) + Graphs do not loose measurement details + Zoom in / out implemented on the server side
350 Servers (> 25 000 measurements) Example: Graphing performace data with zoom in
3 cool Customer environments Heterogenous environment AS/400, VMware ESX 3.0, End to End Application Performance, Unix, Windows, SLA calculation 350 Servers (> 25 000 measurements) monitor Volumes, NLMs loaded, DirXml, Timesync, SLP, LDAP, etc. etc. 550+ routers and switches Monitor each interface with all properties replace CACTI with Nagios
550+ routers and switches Generic solutions do not work because Reduce network traffic is the biggest challange caching data on disk is not enough Execution time is a problem (network polling) Sometimes it s easier to write a special plugin Write an application for reading interfaces via snmp: interface_table.pl plugin
550+ routers and switches No need to know each interface only SNMP community string required / each interface is monitored automatically (plug and play) Warning and Critical can also be set on throughput to recognize link overload Find changes an each interface (e.g. ISDN backup link goes up or remote support from the provider dials in) Could include or exclude interfaces
550+ routers and switches Online demo of optimized plugins on notebook
550+ routers and switches interface_table plugin measures a complete network device
550+ routers and switches interface_table.pl plugin evolved to the most wanted plugin we have because: Some customer use it as inventory and even add on to network documentation Monitoring of one complete device / no more checks required Very short deployment time command line is like interface_table.pl C public H router1 w <> -c <>
3 cool Customer environments Heterogenous environment AS/400, VMware ESX 3.0, End to End Application Performance, Unix, Windows, SLA calculation 350 Servers (> 25 000 measurements) monitor Volumes, NLMs loaded, DirXml, Timesync, SLP, LDAP, etc. etc. 550+ CISCO routers and switches Monitor each interface with all properties replace CACTI with Nagios
Summary Nagios
Question?
Ein Unternehmen stellt sich vor Thanks for your attention