Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK 05-25-2011



Similar documents
Determining the health of Lustre filesystems at scale

Monitoring Tools for Large Scale Systems

Oak Ridge National Laboratory Computing and Computational Sciences Directorate. File System Administration and Monitoring

Maintaining Non-Stop Services with Multi Layer Monitoring

New Storage System Solutions

Network Contention and Congestion Control: Lustre FineGrained Routing

Lustre Monitoring with OpenTSDB

Cray Lustre File System Monitoring

Lustre & Cluster. - monitoring the whole thing Erich Focht

April 8th - 10th, 2014 LUG14 LUG14. Lustre Log Analyzer. Kalpak Shah. DataDirect Networks. ddn.com DataDirect Networks. All Rights Reserved.

Step-by-Step Guide. to configure Open-E DSS V7 Active-Active iscsi Failover on Intel Server Systems R2224GZ4GC4. Software Version: DSS ver. 7.

HowTo Check. Microsoft Cluster. Functionality via SNMP

Lustre SMB Gateway. Integrating Lustre with Windows

NetCrunch 6. AdRem. Network Monitoring Server. Document. Monitor. Manage

New and Improved Lustre Performance Monitoring Tool. Torben Kling Petersen, PhD Principal Engineer. Chris Bloxham Principal Architect

Analyzing large flow data sets using. visualization tools. modern open-source data search and. FloCon Max Putas

1 Basic Configuration of Cisco 2600 Router. Basic Configuration Cisco 2600 Router

Network Monitoring With Nagios. Abstract

Lessons learned from parallel file system operation

Lustre Networking BY PETER J. BRAAM

LMT Lustre Monitoring Tools

Note: This case study utilizes Packet Tracer. Please see the Chapter 5 Packet Tracer file located in Supplemental Materials.

Antelope Enterprise. Electronic Documents Management System and Workflow Engine

INUVIKA TECHNICAL GUIDE

Introduction to system monitoring with Nagios, Check_MK and Open Monitoring Distribution (OMD)

Sample Configuration Using the ip nat outside source static

Simplifying Lustre Operations with Chukwa and Hadoop

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

TPAf KTl Pen source. System Monitoring. Zenoss Core 3.x Network and

Leveraging Best Practices for SolarWinds IP Address Manager

Enterprise Application Monitoring with

SolarWinds Technical Reference

Smart Business Architecture for Midsize Networks Network Management Deployment Guide

SystemManager. Server Management Software. November, NEC Corporation, Cloud Platform Division, MasterScope Group

Hybrid Cluster Management: Reducing Stress, increasing productivity and preparing for the future

How To Set Up Foglight Nms For A Proof Of Concept

Bright Cluster Manager

Installing and Using the vnios Trial

How To Use Mindarray For Business

Cloud Implementation using OpenNebula

Backup Exec Infrastructure Manager 12.5 FAQ

Portable, Scalable, and High-Performance I/O Forwarding on Massively Parallel Systems. Jason Cope

FLOW-3D Performance Benchmark and Profiling. September 2012

HP Intelligent Management Center v7.1 Virtualization Monitor Administrator Guide

Sample Configuration Using the ip nat outside source list C

Bernd Ahlers Michael Friedrich. Log Monitoring Simplified Get the best out of Graylog2 & Icinga 2

High Availability and Backup Strategies for the Lustre MDS Server

Tue Apr 19 11:03:19 PDT 2005 by Andrew Gristina thanks to Luca Deri and the ntop team

QuickStart Guide vcenter Server Heartbeat 5.5 Update 2

Job Scheduling on a Large UV Chad Vizino SGI User Group Conference May Pittsburgh Supercomputing Center

Cisco Active Network Abstraction Gateway High Availability Solution

Installing GFI Network Server Monitor

AGENDA: INTRODUCTION: 1. How is our cloud monitoring setup? 2. Which are the tools used? 3. How do we access monitoring dashboard?

Server Installation Manual 4.4.1

Nagios. cooler than it looks. Wednesday, 31 October 2007

Cisco Application Networking Manager Version 2.0

PANDORA FMS NETWORK DEVICES MONITORING

Step-by-Step Guide to Open-E DSS V7 Active-Active Load Balanced iscsi HA Cluster

The Lustre File System. Eric Barton Lead Engineer, Lustre Group Sun Microsystems

QUICK START GUIDE. Cisco C170 Security Appliance

Data Communications Hardware Data Communications Storage and Backup Data Communications Servers

GlobalSCAPE DMZ Gateway, v1. User Guide

A SURVEY ON AUTOMATED SERVER MONITORING

Oak Ridge National Laboratory Computing and Computational Sciences Directorate. Lustre Crash Dumps And Log Files

Remote Monitoring Unit SC8100. Monitoring Unit SC8100

AKIPS Network Monitor Installation, Configuration & Upgrade Guide Version 16. AKIPS Pty Ltd

NMS300 Network Management System

AKIPS Network Monitor Installation, Configuration & Upgrade Guide Version 15. AKIPS Pty Ltd

Architecting a High Performance Storage System

Scripts for InfiniBand Monitoring

McAfee SIEM Alarms. Setting up and Managing Alarms. Introduction. What does it do? What doesn t it do?

Step-by-Step Guide to Open-E DSS V7 Active-Active iscsi Failover

PANDORA FMS NETWORK DEVICE MONITORING

N5 NETWORKING BEST PRACTICES

High Availability with Elixir

February, 2015 Bill Loewe

Nagios in High Availability Environments

Pandora FMS 3.0 Quick User's Guide: Network Monitoring. Pandora FMS 3.0 Quick User's Guide

Device Interface IP Address Subnet Mask Default Gateway

HPC Update: Engagement Model

Kriterien für ein PetaFlop System

Adapt Support Managed Service Programs

4G Business Continuity Solution. 4G WiFi M2M Router NTC-140W

Lustre failover experience

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

EXAScaler. Product Release Notes. Version Revision A0

ENC Enterprise Network Center. Intuitive, Real-time Monitoring and Management of Distributed Devices. Benefits. Access anytime, anywhere

高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统. High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University

Monitoring MySQL. Geert Vanderkelen MySQL Senior Support Engineer Sun Microsystems

LLNL Production Plans and Best Practices

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Quick Installation Guide Network Management Card

Lab 5.5 Configuring Logging

Pivot3 Reference Architecture for VMware View Version 1.03

STORAGE CENTER WITH NAS STORAGE CENTER DATASHEET

Private cloud computing advances

DEPLOYMENT GUIDE Version 1.0. Deploying the BIG-IP LTM with the Nagios Open Source Network Monitoring System

Transcription:

Determining health of Lustre filesystems at scale Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK 05-25-2011

Overview Overview of architectures Lustre health and importance Storage monitoring Server monitoring Lustre monitoring Log monitoring Headaches Where do we go from here? Conclusion 2

Spider Architecture 197 Lustre servers 192 OSS, 4 MDS, 1 MGS 192 LNET routers on XT5 DDR IB connects routers and Lustre servers MGS MDS[1-4] OSS[1-192] DDR IB Network Analysis, Development, Endto-End Platform, Data Transfer L N E T R T R HSN (SeaStar) DDN 9900 3

Spider Architecture Scalable Units 1 DDN 9900 couplet, 4 OSS nodes Scalable Clusters 3 SU s 16 Scalable Clusters in all IB DDN1 DDN2 OSS1 OSS2 OSS3 OSS4 4

Spider Architecture Cross connected to single fed Ethernet switches OLCF Ethernet Core MgmtE 1 (H) DDN1 OSS1 OSS2 OSS3 OSS4 CoreE1 (H) CoreE2 (U) MgmtE2 (U) DDN1 OSS1 OSS2 OSS3 OSS4 DDN2 DDN2 5

Monitoring Infrastructure Spider LAN Mgmt-2 OLCF Ethernet Core Mgmt-1 4 x 1GbE links Monitoring Server(s) Syslog Server(s) Lustre Servers DDN 9900 s Internal Services Widow LAN 6

Monitoring Infrastructure Nagios Server -- HP DL 360G6 Quad Socket Quad Core 32 GB system memory Bonded GbE (2 links) Gigabit Ethernet backplane from Lustre VLAN to Int- Services VLAN Network utilization 45 Mbit so network isn t a bottleneck Nagios Set up parent/child relationship between Lustre servers Parent/Child relationships for DDN controllers to Lustre OSSes 7

Monitoring Infrastructure SNMP monitoring Most Nagios plugins/checks use snmpwalk ORNL has registered OID space We use custom OIDS to execute a script on the local machine snmpwalk v 2c c mycommstring hostname OID OID name /path/to/script in /etc/snmp/snmpd.local.conf Additionally use a SNMP trap to set downtime in Nagios Relies on parent/child relationships Suppress notifications for known issues 8

snmpd.local.conf example exec 1.3.6.1.4.1.341.49.5.1.3 monitor_multipath /opt/bin/monitor_multipath.pl exec 1.3.6.1.4.1.341.49.5.1.4 monitor_ib_health /chexport/bin/monitor_ib_health.sh exec 1.3.6.1.4.1.341.49.5.1.17 lustre_health /opt/bin/lustre_healthy.sh exec 1.3.6.1.4.1.341.49.5.1.18 lnet_stats /opt/bin/lnet_stats.sh exec 1.3.6.1.4.1.341.49.5.1.23 lustre_dev /opt/bin/lustre_device_check.sh widow1 exec 1.3.6.1.4.1.341.49.5.1.24 lustre_dev /opt/bin/lustre_device_check.sh widow2 exec 1.3.6.1.4.1.341.49.5.1.25 lustre_dev /opt/bin/lustre_device_check.sh widow3 exec 1.3.6.1.4.1.341.49.5.1.26 lustre_dev /opt/bin/lustre_device_check.sh widow exec 1.3.6.1.4.1.341.49.5.1.28 aacraid /chexport/bin/check-aacraid.py exec 1.3.6.1.4.1.341.49.5.1.35 aacraid /opt/bin/aacraid.sh -b exec 1.3.6.1.4.1.341.49.5.1.36 aacraid /opt/bin/aacraid.sh -c exec 1.3.6.1.4.1.341.49.5.1.37 aacraid /opt/bin/aacraid.sh -d exec 1.3.6.1.4.1.341.49.5.1.30 collectl /opt/bin/collectl.sh exec 1.3.6.1.4.1.341.49.5.1.34 osirisd /opt/bin/osirisd.sh 9

Lustre Health What is it? Can users access their files? Is it performance related? All Lustre services/devices are online What about failover conditions? We use the online thought process for our current monitoring Also looking if users can access their files Also look if Storage resources are online 10

Storage Monitoring DDN 9900 s don t have much monitoring capability Ping the IP of the controller DDN has a GUI for looking at controllers, doesn t work well for our setup Serializes connections to DDN s before displaying 96 couplets takes 20-30 minutes to scan More about this in log monitoring 11

12 Server Monitoring Need to establish that the hardware can communicate Ping the OSS sshd alive OSS is alive on the IB network Need to establish that the server is running correct things OpenSM to backend storage Configuration monitoring (osiris) Statistic gathering (collectl) Need to establish the hardware has no problems Dual PSU connected to House and UPS Input Voltage within range Fans/Temps all within range Plugin from Nagios Exchange for Dell OMSA

Server Monitoring (2) Multipathd checker Use IB SRP to connect to DDN 9900 via two paths Look at output of echo show multipaths topology > multipathd k Making sure that there are 2 good paths Looking for active and ready paths. IB DDN1 DDN2 OSS1 OSS2 OSS3 OSS4 13

Server Monitoring (3) IB health monitor Sources configuration file that gives interface, speed Verifies that state is active, speed correct, interface is up If not correct return -2 and print text # cat monitor_ib_health_oss.conf mthca0 1 20 mthca0 2 20 mthca1 1 20 14

15 Lustre monitoring Do we have all of our devices mounted Source the configuration file for the filesystem and compare with mounted devices OSTDEV[0]= HOSTNAME:/dev/mpath/LUNNAME Mounted at /tmp/lustre/fsname/ostdev What is the status of /proc/fs/lustre/health_check? if [! -e /proc/fs/lustre/health_check ]; then exit 2 fi if [[ $(cat /proc/fs/lustre/health_check)!= "healthy" ]]; then echo "CRITICAL: Lustre is unhealthy. Call Lustre OnCall Admin" exit 2 else echo "OK: Lustre is healthy" exit 0 fi

Lustre Monitoring (2) Is LNET performing okay? if [! -e /proc/sys/lnet/stats ]; then exit 2 fi last_stat=$(cat /tmp/last_lnet_stat) curr_stat=$(/bin/awk '{print $1}' /proc/sys/lnet/stats) # Now we set the last stat for the next run echo $curr_stat > /tmp/last_lnet_stat 16 if [[ $curr_stat -lt 30000 && $last_stat -lt 30000 ]]; then echo " OK: Curr: $curr_stat Last: $last_stat" exit 0 elif [[ $curr_stat -gt 30000 && $last_stat -lt 30000 ]]; then echo " OK: Curr: $curr_stat Last: $last_stat" exit 0 else echo " CRITICAL: Curr: $curr_stat Last: $last_stat" exit 2 fi

Lustre Monitoring (3) Client side monitoring ls timer If greater than 30s, send mail From XT5 and external sources df Manifested by Nagios built-in filesystem checks lfs osts Can look for inactive OSC s lfs check servers 17

Log Monitoring Simple Event Correlator DDN Syslog message parsing Lustre Syslog message parsing Rationalized printk John Hammond et. al from TACC Formatting messages to programmatically parse Splunk Use for trending and reporting as we move forward 18

Headaches Snmpwalk times out with 7000-10000 entries in process table Use snmp bulk get and a starting OID to help things Helps with process checks on MDSes where 4000-8000 mdt processes exist Current monitoring load is large as a percentage of all OLCF monitoring 30% of hosts in Nagios are Lustre servers (323) Not counting LNET router monitoring 40% of services in Nagios are Lustre services (2441) Static load average is ~10 Every new service we monitor adds (min) 210 checks to Nagios Aspirin : Delegate Lustre checks to another Nagios server? 19

Where do we go from here Several things we want to do LMT Custom DDN 9900 monitoring Verify configuration is as expected, alert if not Parsing server side client stats Not simple with 18k clients Make current monitoring failover aware Could require on-the-fly Nagios configuration Implementing rationalized printk 20

Conclusion Pace of current monitoring won t stand up to next generation As system size grows problems arise in monitoring Can changes be made to snmp based monitoring? Explore options that are currently available (NRPE, delegation) Monitoring health is critical to delivering stable storage platform for compute, analysis, visualization, and data transfer to other sites 21

Questions? This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DEAC05-00OR22725. 22 Notice: This manuscript has been authored by UT-Battelle, LLC, under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.