Supermicro Server Monitoring with SuperDoctor 5 and Nagios Using SNMP Protocol Version 1.1b
Supermicro Server Monitoring with SuperDoctor 5 and Nagios Using SNMP Protocol Release: v 1.1b Document release date: 11/15/2013 Copyright 2013 Super Micro Computer, Inc. All Rights Reserved. Legal Notices This software and documentation is the property of Super Micro Computer, Inc., and supplied only under a license. Any use or reproduction of this software is not allowed, except as expressly permitted by the terms of said license. Information in this document is subject to change without notice. Trademark Notice All trademarks and copyrights referred to are the property of their respective owners.
Supermicro Server Monitoring with SuperDoctor 5 and Nagios Using SNMP Protocol Revision History Date Rev Description Jul-4-2011 1.0 1. Initial Document. Sep-20-2012 1.1 1. Reorganize MIB structures. Jul-12-2013 1.1a 1. Change product name to SuperDoctor 5 (SD5). Nov-15-2013 1.1b 1. Changed default install folder of SD5. ii
Supermicro Server Monitoring with SuperDoctor 5 and Nagios Using SNMP Protocol Contents 1. Introduction... 4 2. Prerequisites... 5 2.1 Installing Java Runtime Environment (JRE)... 5 2.2 Installing the check_snmp_health Plug-in... 5 2.3 SuperDoctor 5 (SD5)... 5 2.4 Seting Up SNMP Service in Linux... 5 2.5 Installing Smartctl Utility... 6 3. Getting Started... 7 3.1 Defining the Hosts... 7 3.2 Defining a Command... 7 3.3 Defining the Services... 8 3.4 Validating the Nagios Configurations... 9 3.5 Restarting Nagios Service... 10 3.6 Connecting to the Nagios Web UI... 10 4. Using check_snmp_health... 11 4.1 h or --help... 11 4.2 bc... 11 4.3 cn... 12 4.4 co... 12 4.5 d... 13 4.6 i... 13 4.7 t... 13 4.8 to... 14 5. Appendix... 15 5.1 SD5 FAQ... 15 5.2 How to Reset Memory Error Status?... 15 5.3 Can I Disable the SD5 Web?... 16 5.4 Can I Disable the NRPE Protocol?... 17 5.5 No Health Information from SNMP Was Fetched... 17 Contacting Supermicro... 19 iii
1. Introduction This Nagios plug-in, named check_snmp_health, uses SNMP to talk to SuperDoctor 5 and check the health of the following hardware components: Fan Processor temperature System temperature DDR3 temperature Power supply failure Voltage Chassis intrusion Physical disk failure Memory failure (Linux platform only) 1 Processor failure (Linux platform only) 2 RAID health (LSI MegaRAID 2108 and 2208 controllers only) The results of executing the check_snmp_health plug-in are shown on the Nagios Web UI. 1 The memory health check includes CECC and UECC. Both kinds must be BIOS supported, and this function is currently only available on Linux platforms. 2 The processor failure checks must be BIOS supported and is currently only available on Linux platforms. 4
2. Prerequisites 2.1 Installing Java Runtime Environment (JRE) The check_snmp_health plug-in is written in Java. To run the plug-in, install JRE 1.6 or above in your Nagios server. Please set the JAVA_HOME environment variable to the JRE installation path. 2.2 Installing the check_snmp_health Plug-in 1. Unzip the package file SSMServerPlugin-1.0-build.[xyz].zip to the /usr/local/nagios/libexec/ssmserverplugin folder, assuming your Nagios is installed on the /usr/local/nagios location. 2. Using the command chmod +x check_snmp_health.sh to make the check_snmp_health plug-in executable. 3. Execute the check_snmp_health.sh program without providing any argument. If the JRE and the plug-in are installed correctly, the error message appears: Invalid options. Three options must be provided for -i (--ip). 2.3 SuperDoctor 5 (SD5) The check_snmp_health plug-in is designed to work with the SuperDoctor 5, which implements an SNMP extension to support Supermicro MIBs (see 5.3 Supermicro MIB in SuperDoctor 5 User's Guide for details). For the installation of the SuperDoctor 5, please refer to Chapter 2 Setting Up SD5 in SuperDoctor 5 User's Guide. For the quick installation of multiple SD5s, see 2.1.4 Tips for Deploying a Large Number of SD5s in SuperDoctor 5 User's Guide. 2.4 Seting Up SNMP Service in Linux To support SNMP, the NET-SNMP service needs to be installed and configured on your Linux. For the installation and configuration of the NET-SNMP service, please refer to 5.2 Setup SNMP Service in Linux in SuperDoctor 5 User's Guide. 5
2.5 Installing Smartctl Utility SD5 uses an open source program named smartctl to check the health of physical disks. To enable this function, you need to manually install the smartctl program. Download the program from: http://sourceforge.net/apps/trac/smartmontools/wiki/download Many Linux distributions provide pre-compiled packages to simplify the installation of smartctl. For example, on the CentOS 5.x, you can install smartctl by using the yum command as shown below. 6
3. Getting Started 3.1 Defining the Hosts Define a host for each of the SD5s. Suppose that the host is written in the host1.cfg file. define host { host_name 10.134.12.36 alias 10.134.12.36 address 10.134.12.36 use linux-server } Edit the $NAGIOS_HOME$/etc/nagios.cfg file to include the host1.cfg. 3.2 Defining a Command Define a command for check_snmp_health. Suppose that the command is written in the commands.cfg file. define command { command_name check_snmp_health_all command_line /usr/local/nagios/libexec/ssmserverplugin/check_snmp_health.s h i $HOSTADDRESS$ -t $ARG1$ } 7
3.3 Defining the Services Define the service to be checked by Nagios. Suppose that the service is written in the host1.cfg file. define service { use local-service service_description check_snmp_health host_name 10.134.12.36 check_command check_snmp_health_all!a } You can also define a service to check a particular type of monitored items, e.g., fan, disk, or memory, by specifying the type argument: a: all (like checking all health) w: power, f: fan c: current d: disk m: memory t: temperature v: voltage s: switch p: processor r: raid For example, the service used to check fan status is shown below: define service { use local-service host_name 10.134.12.36 service_description check fan status check_command check_snmp_health_all!f } You can also check multiple types of monitored items by enumerating each of the type. A service used to check disk, fan, and voltage status is shown below. 8
define service { use local-service host_name 10.134.12.36 service_description check disk, fan, and voltage status check_command check_snmp_health_all!dfv } 3.4 Validating the Nagios Configurations 9
3.5 Restarting Nagios Service 3.6 Connecting to the Nagios Web UI The results are shown on the Nagios Web UI as below. 10
4. Using check_snmp_health 4.1 h or --help The -h or --help option shows the help menu, as shown below. 4.2 bc Use the bc option to specify user-defined thresholds for memory and processor checking. The argument format is as follows: [type][duration][fail count],. [type]: m: correctable single bit ECC errors. M: uncorrectable ECC errors. p: processor failures. [duration]: d: day h: hour m: minute s: second [fail count]: The acceptable number of failures. To trigger a critical status, the failure counts must be greater than this value. 11
Example: To specify a threshold for memory that indicates four single bit ECC errors per 1GB RAM within one day (24 hours) is allowed (i.e., m1d4), and 0 uncorrectable ECC error is allowed within 1 hour (i.e., M1h0). -bc m1d4,m1h0 Note: To reset the memory, refer to 5.2 How to Reset Memory Error Status? for more information. Example: To specify a threshold for processor that indicates 2 correctable processor failures within 30 days is allowed (i.e., p30d2), and 0 uncorrectable processor failure is allowed within 1 hour (i.e., P1h0). -bc p30d2,p1h0 4.3 cn Use the cn option to specify user-defined thresholds for checking the number of processors, memory and hard disks. The argument format is as follows: [type][number],. [type]: p: processor. m: memory. d: hard disk drives. [number]: The expected number of processors, memory, or hard disks. To trigger an OK status, the assigned number must be equal to the number of the processors, memory, or hard disks installed on the system under monitoring. Example: The arguments are specified for the cn option, and this option indicates the system under monitoring has one processor, four memory DIMMs, and one hard disk drive. -cn p1,m4,d1 4.4 co Use the co option to specify an SNMP community string. 12
4.5 d Use the d option to show detailed information regarding the monitoring logics, which is used for debugging propose only. This option should not be used in Nagios. 4.6 i Use the i option to specify the host name or IP address to be checked. 4.7 t Use the t option to specify the type of monitored items to be checked. The default value is all. Use the "-t r" option as shown below to check the health status of a RAID controller, including the states of its components such as battery backup units, virtual drives and hard disks. 13
The following figure indicates one virtual drive and one hard disk are alerted, and the health status of the RAID controller is thus critical. The following figure shows the RAID controller is critical due to the absent BBU. 4.8 to Use the to option to specify the SNMP timeout value. The default value is 15 seconds. You may need to increase the timeout value if the check_snmp_health plug-in cannot retrieve all MIBs. For example, a host to be checked has several hard disks. Checking its physical disk failures may be longer than 15 seconds and times out the check_snmp_health plug-in. To avoid such a situation, specify a larger timeout value by using the to option. 14
5. Appendix 5.1 SD5 FAQ Q: I see some error messages in the [SD5 install folder]/wrapper.log file. Do you have a list of all error messages and solutions? A: Here are the known error messages. NO Message Root Cause Solution A0001 HealthInfo The SD5 is run on a Install SD5 on Supermicro servers. initialization error. com.supermicro.ss m.tmhealth.model. MotherboardModel NotExistException: non-supermicro server. Health information is only available on Supermicro servers. A0002 Unable to start JVM: No such file or directory The SD5 cannot find the required Java Virtual Machine (JVM) located in the [SD5 install folder]/jre folder. Reinstall the SD5. 5.2 How to Reset Memory Error Status? Q: An uncorrectable ECC error has been raised on a server and I have manually changed the pragmatic memory module. However, the check_snmp_health plug-in still shows a critical status. A: The check logic of memory errors is based on these: 1. There are memory error logs in the BIOS event log. 2. The log's generated time is in the check time period. For example, suppose that you use the -bc M1d0 option (i.e., any uncorrectable ECC error occurring in one day will cause a critical state) to check memory error. Once an uncorrectable ECC error has been found, the status will remain critical for one day even after the problematic memory is manually changed. To get an OK status immediately after manually repairing the memory, you need to follow these steps: 15
1. Clear BIOS event logs from the BIOS setup menu. 2. Delete the file [SD5 install folder]/config/bioslogs.txt 5.3 Can I Disable the SD5 Web? Q: I only use the check_snmp_health plug-in to check the health of a host and do not use a browser to view the sensor readings via the SD5 Web. Can I disable it? A: Yes, the SD5 Web can be disabled during installation. At the Setup SuperDoctor 5 Web step, select 2- No to disable the SD5 Web. See the figure below. You can also manually disable the SD5 Web after installation. Use a text editor to open the [SD5 install folder]/plugins/builtin/web/plugin.cfg file, as shown below. 16
Change the enabled attribute from 1 to 0 and save the document. Exit the text editor and restart the SD5 to apply the setting. If the SD5 Web is disabled, the TCP ports 8181 and 8444 are not used. 5.4 Can I Disable the NRPE Protocol? Q: I only use the check_snmp_health plug-in to check the health of a host and do not use the NRPE protocol to talk with the SD5. Can I disable the support of the NRPE protocol? A: The SD5 supports three NRPE connection modes: Mode A: Plain text with allowed IP (port 5333) Mode B: Anonymous SSL connection with allowed IP (port 5666) Mode C: SSL encryption with a public key infrastructure (port 5999) Because the NRPE protocol is the default connection protocol provided by the SD5, it cannot be completely turned off. At least one connection mode must be specified. For modifying the connection mode settings, refer to 3.2 SuperDoctor 5 Connection Modes in SuperDoctor 5 User's Guide. 5.5 No Health Information from SNMP Was Fetched Q: I execute the command check_snmp_health.sh -i [host_ip] and the result shows No health information from SNMP was fetched. What is the problem? A: Usually this message indicates that the host to be checked does not support Supermicro MIB. Possible reasons include: 17
The operating system s built-in SNMP service (i.e., the Net-SNMP) does not start. The SD5 does not start. The SD5 SNMP extension is not correctly installed. The SNMP port is blocked by firewall. The default timeout value is not long enough for a health check. 18
Contacting Supermicro Headquarters Address: Super Micro Computer, Inc. 980 Rock Ave. San Jose, CA 95131 U.S.A. Tel: +1 (408) 503-8000 Fax: +1 (408) 503-8008 Email: marketing@supermicro.com (General Information) support@supermicro.com (Technical Support) Web Site: www.supermicro.com Europe Address: Super Micro Computer B.V. Het Sterrenbeeld 28, 5215 ML 's-hertogenbosch, The Netherlands Tel: +31 (0) 73-6400390 Fax: +31 (0) 73-6416525 Email: sales@supermicro.nl (General Information) support@supermicro.nl (Technical Support) rma@supermicro.nl (Customer Support) Asia-Pacific Address: Super Micro Computer, Inc. 3F, No. 150, Jian 1st Rd. Zhonghe Dist., New Taipei City 23511 Taiwan (R.O.C) Tel: +886-(2) 8226-3990 Fax: +886-(2) 8226-3992 Web Site: www.supermicro.com.tw Technical Support: Email: support@supermicro.com.tw Tel: +886-(2)-8226-3990 19
This page is intentionally left blank 20