Monitoring Software Services registered with.canarie.ca Introduction The software registry at.canarie.ca monitors each of the contributed services via the API defined in Research Service Support for the CANARIE Registry and Monitoring System. Service owners and interested rs can view the results of this monitoring by visiting https://.canarie.ca and clicking on the name of the service of interest on the main page. This brings up a service- specific page that displays information about the service including various reliability metrics. This information is ful for understanding the long- term reliability of a service and for determining whether or not a service is currently available, but is designed primarily for human browsing and not for monitoring by automation systems. To address the need for automated service monitoring, CANARIE has provided the following three capabilities, which are described in detail in this document: Email notification of service failure A web service interface that can be polled by an automation system to determine the current status of a service A plugin that allows a service to be monitored via a Nagios monitoring system Audience The audience for this application note is owners and rs of services registered with.canarie.ca. Email Notification Currently, email notification is only available to owners of services registered with.canarie.ca. When a service is registered, CANARIE adds the email addresses of those responsible for maintaining the service to the notification system. Once this is done you will get an email if either: The reliability of your service drops below 20% in any three hour period. This email will be sent out at the end of the three hour period. The reliability of your service drops below 90% for any day. This email is sent out at the end of the day (UTC). Revision 1.2 CANARIE 2014 1
The service monitoring system on.canarie.ca polls each service according to the API defined in Research Service Support for the CANARIE Registry and Monitoring System every 15 minutes. Reliability is simply calculated as the percentage of these polls that returned some sort of error. Possible errors include: HTTP failures when attempting to communicate with the target service Mal- formed or missing information in the response generated by the service A valid response that indicates the service is not currently available Web Service Interface In addition to the email notification system described above, CANARIE provides a web service interface to allow you to check on the status of a service. Note that this web service returns the last known status from the monitoring system s database and does not actually poll the target service in real time. Since the monitoring system polls each service at 15 minute intervals, the information returned by the web service interface will be at most 15 minutes old. To the web service interface, you must first know the unique numeric identifier of the service of interest. To obtain this value, your browser to navigate to https://.canarie.ca and choose the service of interest from among those listed on the main page. When the details page for that service appears, note the service s unique identifier from the URL query parameter called serviceid as displayed by your browser. In the image below, the service s unique identifier is 1. Revision 1.2 CANARIE 2014 2
To retrieve the status of this service via the web services interface, perform a GET on URL https://.canarie.ca/researchmiddleware/rs/service/<id>/status where <id> is the unique identifier for the service of interest (1, in the image above). The curl command to retrieve this URL would be: curl https://.canarie.ca/researchmiddleware/rs/service/1/status A GET request to this URL returns a JSON packet, formatted as follows: { status : <current status of the service>, lastupdate : <last time service was checked>, meta : { pollinginterval : <how often service is polled> where: The value of status will be one of: o OK last poll of the service returned no errors o ERROR last poll of the service resulted in at least one error o UNKNOWN service has not yet been polled. This will only happen for a newly registered service. lastupdate is the time in UTC (ISO 8601 format) at which the service was last polled. For example, 2014-01- 17T17:31:29Z pollinginterval is a string indicating how often the service is polled. Currently this value returns Every 15 minutes Nagios Support If you already have a Nagios monitoring system in place(www.nagios.org), you may want to consider using the Nagios plugin CANARIE has provided to monitor services via the web service interface described above. This plugin is available from https://github.com/canarie/support_software/tree/master/nagios/plugins/check_research_sw Revision 1.2 CANARIE 2014 3
This plugin is meant to be installed locally on your Nagios server and is written in Python. Therefore a Python environment will also be required. To this plugin, 1. Download the file check_research_sw.py and install it in the plugins directory on your Nagios server. For CentOS 6, this directory is /usr/lib64/nagios/plugins. 2. Rename this file to check_research_sw 3. Create a Nagios command based on this file in commands.cfg. The entry should look like: # 'check_research_sw' command definition define command{ command_name check_research_sw command_line $USER1$/check_research_sw $ARG1$ $ARG2$ For CentOS 6, the commands.cfg file is located in /etc/nagios/objects 4. Define a Nagios host and Nagios service for the check_research_sw command. Since this command is run locally on your Nagios server, you can localhost as the host. In this case, you just need to add a service definition to file localhost.cfg (found in /etc/nagios/objects on CentOS 6). For example define service{ service_description check_command notifications_enabled local- service localhost Research Service 1 check_research_sw!service!12 0 will allow Nagios to check the service on.canarie.ca with id 12 5. Since the service being polled is actually not running on localhost, you may want to introduce a new Nagios host to keep it separate. To define a Nagios host called, ignore step 4 above and instead create a new file.cfg (in /etc/nagios/objects in CentOS 6) with the following contents: define host{ alias linux- server Revision 1.2 CANARIE 2014 4
define service{ service_description check_command notifications_enabled local- service Research Service 12 check_research_sw!service!12 1 6. Restart your Nagios server. With CentOS 6, will do this. /etc/init.d/nagios restart Revision 1.2 CANARIE 2014 5