System Monitoring Using NAGIOS, Cacti, and Prism.



Similar documents
Network and Server Statistics Using Cacti

Network and Server Statistics Using Cacti

Maintaining Non-Stop Services with Multi Layer Monitoring

USING OPEN SOURCE SOFTWARE IN DAILY ISP OPERATIONS

MONITORING RED HAT GLUSTER SERVER DEPLOYMENTS With the Nagios IT infrastructure monitoring tool

PATROL From a Database Administrator s Perspective

SCF/FEF Evaluation of Nagios and Zabbix Monitoring Systems. Ed Simmonds and Jason Harrington 7/20/2009

CAREN NOC MONITORING AND SECURITY

Grids & networks monitoring - practical approach

Monitoring Tools for Large Scale Systems

Результат запроса: Cacti weathermap

This presentation introduces you to the new call home feature in IBM PureApplication System V2.0.

Patch Management. Module VMware Inc. All rights reserved

MALAYSIAN PUBLIC SECTOR OPEN SOURCE SOFTWARE (OSS) PROGRAMME. COMPARISON REPORT ON NETWORK MONITORING SYSTEMS (Nagios and Zabbix)

A SURVEY ON AUTOMATED SERVER MONITORING

GRNET NOC network monitoring & visualization tools

Network performance overview. TEIN2 Bangkok September 2005

1. INTERFACE ENHANCEMENTS 2. REPORTING ENHANCEMENTS

The new services in nagios: network bandwidth utility, notification and sms alert in improving the network performance

How To Monitor Your Computer With Nagiostee.Org (Nagios)

Red Hat Network Satellite Management and automation of your Red Hat Enterprise Linux environment

Red Hat Satellite Management and automation of your Red Hat Enterprise Linux environment

Tools and strategies to monitor the ATLAS online computing farm

Network Monitoring. Review of Software

Proactively Monitoring Departmental Clinical IT Systems with an Open Source Availability System

Zend and IBM: Bringing the power of PHP applications to the enterprise

Monitoring backbone networks

WHITE PAPER GoundWork: Bringing IT Operations Management to Open Source and Beyond

Availability Management Nagios overview. TEIN2 training Bangkok September 2005

There are numerous ways to access monitors:

Network and Server Statistics using Cacti

1. INTERFACE ENHANCEMENTS 2. REPORTING ENHANCEMENTS

Violin Symphony Abstract

TPAf KTl Pen source. System Monitoring. Zenoss Core 3.x Network and

Monitoring MySQL. Presented by, MySQL & O Reilly Media, Inc. A quick overview of available tools

WÜRTHPHOENIX NetEye Version 3

A recipe using an Open Source monitoring tool for performance monitoring of a SaaS application.

Perceptive Software Platform Services

Ensure Continuous Availability and Real-Time Monitoring of your HP NonStop Systems with ESQ Automated Operator

FITB. Network Graphing Done Right. Laurie Denness

How To Use Ibm Tivoli Monitoring Software

Project Orwell: Distributed Document Integrity Verification

FUNCTIONAL OVERVIEW

StruxureWare TM Center Expert. Data

Cacti complete network graphing solution. Oz Melamed E&M Computing Jun 2009

A Scalable Network Monitoring System as a Public Service on Cloud

Enterprise IT is complex. Today, IT infrastructure spans the physical, the virtual and applications, and crosses public, private and hybrid clouds.

FileMaker 12. ODBC and JDBC Guide

Do you know how your TSM environment is evolving?

Vistara Lifecycle Management

Intellicus Enterprise Reporting and BI Platform

Professional Xen Visualization

End Your Data Center Logging Chaos with VMware vcenter Log Insight

The Evolution of Cray Management Services

AXIGEN Mail Server Reporting Service

Management of VMware ESXi. on HP ProLiant Servers

TECHNOLOGY WHITE PAPER Jan 2016

MEGARAC XMS Sx EXTENDIBLE MANAGEMENT SUITE SERVER MANAGER EDITION

QuickSpecs. HP Integrity Virtual Machines (Integrity VM) Overview. Currently shipping versions:

SEE-GRID-SCI. SEE-GRID-SCI USER FORUM 2009 Turkey, Istanbul December, 2009

Advanced Science and Technology Institute Department of Science and Technology

COMPARING NETWORK AND SERVER MONITORING TOOLS

Using AppMetrics to Handle Hung Components and Applications

MeritPresentationHandout

REQUIREMENTS LIVEBOX.

101 ways to use SysAid

Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall.

Managing your Red Hat Enterprise Linux guests with RHN Satellite

Optimization of QoS for Cloud-Based Services through Elasticity and Network Awareness

STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM

EMC UNISPHERE FOR VNXe: NEXT-GENERATION STORAGE MANAGEMENT A Detailed Review

Installation documentation for Ulteo Open Virtual Desktop

Ulteo Open Virtual Desktop Installation

The enterprise class monitoring solution for everyone

Details. Some details on the core concepts:

Free Network Monitoring Software for Small Networks

Best Practices for Deploying and Managing Linux with Red Hat Network

Proactively Managing Servers with Dell KACE and Open Manage Essentials

OMNITURE MONITORING. Ensuring the Security and Availability of Customer Data. June 16, 2008 Version 2.0

DEPLOYMENT GUIDE Version 1.0. Deploying the BIG-IP LTM with the Nagios Open Source Network Monitoring System

SMARTcontrol. Dashboard. from

How To Use 1Bay 1Bay From Awn.Net On A Pc Or Mac Or Ipad (For Pc Or Ipa) With A Network Box (For Mac) With An Ipad Or Ipod (For Ipad) With The

DOCUMENT MANAGEMENT SYSTEM

TECHNOLOGY WHITE PAPER Jun 2012

WIALAN Technologies, Inc. Network Monitoring System Overview

MANAGED HOSTING SERVICES

Tue Apr 19 11:03:19 PDT 2005 by Andrew Gristina thanks to Luca Deri and the ntop team

Index terms Management, Measurement, Performance Monitoring, Middleware. Keywords Wi-Fi Networks, Service Level Agreement, Visualization

Unified Batch & Stream Processing Platform

How Comcast Built An Open Source Content Delivery Network National Engineering & Technical Operations

Simplifying power management in virtualized data centers

MICROSOFT EXCHANGE MAIN CHALLENGES IT MANAGER HAVE TO FACE GSX SOLUTIONS

VIRTUOZZO TM FOR LINUX 2.6.1

Implementing a secure high visited web site by using of Open Source softwares. S.Dawood Sajjadi Maryam Tanha. University Putra Malaysia (UPM)

Hardware Monitoring with the new Nagios IPMI Plugin

Instructions for Access to Summary Traffic Data by GÉANT Partners and other Organisations

Transcription:

System Monitoring Using NAGIOS, Cacti, and Prism. Thomas Davis, NERSC and David Skinner, NERSC ABSTRACT: NERSC uses three Opensource projects and one internal project to provide systems fault and performance monitoring. These three projects are NAGIOS, Cacti, and Prism. 1. Introduction Nagios, Cacti, and Prism are part of the core Operational Monitoring strategy at NERSC. 2. What is Nagios? Nagios is a fault monitoring software package with plugin support. The plugin support allows one to create methods of monitoring that are specific to a system, or business/contract logic. 2.1 Where/How do I get it? You can download it from http://www.nagios.org/. Some Linux distributions (ie, Fedora) already have it in the repositories, allowing a binary install. 2.2 How does it work? By using 'plugins', it can monitor, notify, and acknowledge when faults occur. Can return 3 possible exit states OK, WARNING, and CRITICAL. The plugin can also return a text message, stating what the condition means. 2.2 What can I do with it? You can watch for network, process, hardware and software faults. You can also collect data for processing by other utilities. Notifications can be sent based on date, time, or severity. 3. What is Cacti? Cacti is performance monitoring tool based on a LAMP stack (Linux/Apache/MySQL/PHP) and RRD (Round Robin Database). It can collect, manage and display graphs of collected data. 3.1 Where/How do I get Cacti? Cacti is available from http://cacti.net. Some distributions (ie, Fedora) also supply a version in their repositories. However, and important architecture plugin feature of cacti has to be patched in. 3.2 How does Cacti work? Cacti uses Round Robin Databases (RRD) and MySQL database technologies to store collected information. MySQL and PHP is used to provide a graphical, web based interface to the RRD databases. Rrd database technology was popularized in the widely known MRTG graphing project. Cacti takes MRTG configuration to a whole new level it provides a 'click and point' interface to RRD's. A user can easily define, graph, and track over time many different devices without editing a configuration file using an editor. 3.3 What can I do with it? Cacti is extendable, and user driven. The limits are 4. What is PRISM? Prism is a project that collects data from varios NAGIOS systems using RDF, and then displays the results in an aggregated form. 4.1 How do I get PRISM? 4.2 How does it work? Using RDF/XML/RSS feeds from nagios, it collects, processes, and creates a color coded, simplified display of system status, allowing quick, easy visual notification of faults. CUG 2009 Proceedings 1 of 5

4.2 What can I do with it? Prism allows the grouping of several NAGIOS (in NERSC's case, 9 displays) into one display. This allows 'at a glance' status checks. DDN I/O operations, collected from a DDN controller port. 5. Examples of Nagios/Cacti/PRISM usage 5.1 Nagios fault monitoring of DDN controllers and drives. DDN Read/Write/Verify/Rebuild data rates graph A snapshot from nagios, showing the a DDN drive fault. This information is collected using telnet, driven by a Perl/Expect script. The script is capable of detecting system faults, system temperatures, and other errors. This plugin makes finding and repairing DDN faults easier. 5.2 Cacti data collection, processing and graphing of DDN controller and drive performance. Cacti is used to collect I/O statistics from the controllers. Areas of data collection are I/O operation statistics, I/O read/write/verify/rebuild data rates, and I/O block sizes, and drive temperatures. For each controller, a graph for each port, plus a graph for total is generated. The data is collected using either the DDN api or telnet/command line scraping. Examples of several different graphs are: DDN Write I/O Size, histogram style. 5.3 Thermal monitoring of a Cray XT4 system. Several plugins have been created to monitor the thermal environment in the machine room. The nodes monitored for temperature are Cray cabinets, air handling units (AHU), and DDN controllers and drives. The system has a set point that is used to send a WARNING email message when the temperature rises past a certain point, and a CRITICAL email message at a higher point. These two points are chosen to allow operations/staff to time to react and either fix the temperature problem or do a graceful shutdown. This is to prevent an automatic overtemp shutdown from occuring. The other goal of this monitoring is to help provide information on variances in the machine room floor temperatures, and provide long term information for future planning. CUG 2009 Proceedings 2 of 5

An example of a alert message from NAGIOS, which is monitoring the Cray cabinets using the XTWM service from the SMW: Notice in this email message, that it contains several bits of information: 1) Type of alert. 2) Source of alert 3) Date & Time of alert. 4) Temperature of system The system also collects the temperature information, and stores into a MySQL database. This database is queried by Cacti, to generate a graph of temperatures, like this: When the temperature of the system returns back to normal,, nagios will send another email message notification of recovery: This graph shows the average temperature over a 1 year span, reduced to a 1 day average from a 5 minute average. Nagios and Cacti are also combined, to generate a screen that allows 'at a glance' status thermal information of the system. This image shows that one can see there is no major temperature problems on the machine floor. However, one can see that certain parts of the floor are warmer than other parts. CUG 2009 Proceedings 3 of 5

the NAGIOS plugin looks at the XTCLI information generated on the SMW based on chassis, and faults on a node failure. The plugin works by using design feature of Nagios, in that you can go from an OK state, to an CRITICAL, then to an WARNING state, and only generate one email message. The plugin does the following: 1. A cron job is run on the SMW, that generates a xtcli file. This file is transfered to the NAGIOS monitoring box for processing. 2. A plugin process this file, to find information on a chassis basis. 3. If a node flag state changes, a CRITICAL alert is issued. The down node is recorded into a status file for that chassis. 4. On the next pass. The plugin will now emit a warning state, unless another node flag changes state. The following is an example of a thermal event. Where a chiller had problems. This was caught, and operations/staff had a chance to correct the problem before the system reached an over temperature shutdown. All acknowledgements, recoveries, and warning states email notifications are supressed. Only CRITICAL state emails are sent for this plugin. Using this method allows the operations center to acknowledge the failed node in NAGIOS. Again, this gets back to 'at a glance' state information the Nagios tatical display will show non acknowledged node failures. An example from NAGIOS: 5.4 Monitoring using XTCLI a Cray XT4 system. This plugin was written to watch for down nodes, and then to log them to a database, generate an alert, and then also display the status in NAGIOS. Several problems had to be dealt with, starting with how do you monitor 9k+ nodes? The simplified answer is, you don't. The technique that is used by this plug and appears to work is monitor on a chassis basis. What this means is 5.5 Network Weathermap plugin of Cacti. One of the nagios/cacti systems collects data from several switches/routers in the network, and generates a network weather map. This weather map is useful to CUG 2009 Proceedings 4 of 5

figure out usage patterns ie, is someone/something transfer large amounts of data from the NERSC Global Filesystem (NGF) to the mass storage system? If so, the network weathermap can show this. The weathermap has also been used in the past to show poorly utilized or mis configured routes and links. Acknowledgments The Cray staff at NERSC, for being patient with some of the questions I have asked in trying to get performance and fault monitoring in place. The operations staff, for their adopting of the Cacti thermal page for Franklin before I even completed it, and their patience as I broke and fixed it on several occasions. About the Authors Thomas Davis works for NERSC, E Mail: tdavis@nersc.gov David Skinner works for NERSC, E Mail: dskinner@nersc.gov 5.6 Prism This image shows Prism in a completely acknowledged state. 7. Conclusion Nagios, Cacti, and Prism are 3 tools that are used everyday at NERSC. They provide valuble insight into system's status in a standardized way the reduces staff training. These tools have provided data into areas of NERSC that was assumed to be correctly configured, allowing a more proactive instead of reactive management style of systems to be used. CUG 2009 Proceedings 5 of 5