Cray Data Management Platform: Cray Lustre File System Monitoring esfsmon



Similar documents
Cray Lustre File System Monitoring

New and Improved Lustre Performance Monitoring Tool. Torben Kling Petersen, PhD Principal Engineer. Chris Bloxham Principal Architect

Lustre* Testing: The Basics. Justin Miller, Cray Inc. James Nunez, Intel Corporation LAD 15 Paris, France

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Cray s Storage History and Outlook Lustre+ Jason Goodman, Cray LUG Denver

Introduction to Cray Data Virtualization Service S

Current Status of FEFS for the K computer

Monitoring the Lustre* file system to maintain optimal performance. Gabriele Paciucci, Andrew Uselton

The Fusion of Supercomputing and Big Data: The Role of Global Memory Architectures in Future Large Scale Data Analytics

February, 2015 Bill Loewe

Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK

LMT Lustre Monitoring Tools

Determining the health of Lustre filesystems at scale

Oak Ridge National Laboratory Computing and Computational Sciences Directorate. Lustre Crash Dumps And Log Files

New Storage System Solutions

Oak Ridge National Laboratory Computing and Computational Sciences Directorate. File System Administration and Monitoring

How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on Citrix* XenServer 6.0*

PVFS High Availability Clustering using Heartbeat 2.0

Implementing Storage Concentrator FailOver Clusters

We mean.network File System

Monitoring Tools for Large Scale Systems

TPAf KTl Pen source. System Monitoring. Zenoss Core 3.x Network and

Bright Cluster Manager

System Event Log (SEL) Viewer User Guide

Architecting a High Performance Storage System

Novell Sentinel Log Manager 1.2 Release Notes. 1 What s New. 1.1 Enhancements to Licenses. Novell. February 2011

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

Cray Cluster Supercomputers. John Lee VP of Advanced Technology Solutions CUG 2013

Lustre Monitoring with OpenTSDB

Intel System Event Log (SEL) Viewer Utility

高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统. High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University

IBM Sterling Control Center

Network File System (NFS) Pradipta De

Parallels Virtuozzo Containers 4.7 for Linux

NetApp High-Performance Computing Solution for Lustre: Solution Guide

VMware vsphere-6.0 Administration Training

SteelEye Protection Suite for Linux v Network Attached Storage Recovery Kit Administration Guide

Step-by-Step Guide. to configure Open-E DSS V7 Active-Active iscsi Failover on Intel Server Systems R2224GZ4GC4. Software Version: DSS ver. 7.

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

Monitoring Microsoft Exchange to Improve Performance and Availability

Viola M2M Gateway User Manual

Bright Cluster Manager 5.2. Administrator Manual. Revision: Date: Fri, 27 Nov 2015

IBRIX Fusion 3.1 Release Notes

Hardware Information Managing your server, adapters, and devices ESCALA POWER5 REFERENCE 86 A1 00EW 00

Lustre Networking BY PETER J. BRAAM

FioranoMQ 9. High Availability Guide

Enhancements of ETERNUS DX / SF

Tue Apr 19 11:03:19 PDT 2005 by Andrew Gristina thanks to Luca Deri and the ntop team

Cloud Bursting with SLURM and Bright Cluster Manager. Martijn de Vries CTO

Lustre* HSM in the Cloud. Robert Read, Intel HPDD

SanDisk ION Accelerator High Availability

SUSE LINUX Enterprise Server for SGI Altix Systems

AlienVault Unified Security Management (USM) 4.x-5.x. Deploying HIDS Agents to Linux Hosts

Lenovo ThinkServer Solution For Apache Hadoop: Cloudera Installation Guide

Deploying Microsoft Clusters in Parallels Virtuozzo-Based Systems

Bright Cluster Manager

This document describes the new features of this release and important changes since the previous one.

Getting Started with Endurance FTvirtual Server

Symantec NetBackup OpenStorage Solutions Guide for Disk

Step-by-Step Guide to Open-E DSS V7 Active-Active Load Balanced iscsi HA Cluster

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

PARALLELS SERVER 4 BARE METAL README

Double-Take AVAILABILITY

Enterprise Manager. Version 6.2. Administrator s Guide

Monthly Specification Update

NNMi120 Network Node Manager i Software 9.x Essentials

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Course Syllabus. At Course Completion

SIOS Protection Suite for Linux v Postfix Recovery Kit Administration Guide

Deployment Guide: Transparent Mode

Intel Service Assurance Administrator. Product Overview

Intel System Event Log (SEL) Viewer Utility

Clearswift SECURE Exchange Gateway Installation & Setup Guide. Version 1.0

Installation and configuration of Real-Time Monitoring Tool (RTMT)

Barracuda Link Balancer Administrator s Guide

Astaro Deployment Guide High Availability Options Clustering and Hot Standby

Lab 5.5 Configuring Logging

Active-Active and High Availability

Syncplicity On-Premise Storage Connector

Lesson Plans Configuring Exchange Server 2007

Intel System Event Log (SEL) Viewer Utility

Explain how to prepare the hardware and other resources necessary to install SQL Server. Install SQL Server. Manage and configure SQL Server.

How To Write A Libranthus (Libranthus) On Libranus 2.4.3/Libranus 3.5 (Librenthus) (Libre) (For Linux) (

Cyberoam Multi link Implementation Guide Version 9

High Availability Solutions for the MariaDB and MySQL Database

Managing Software and Configurations

RedHat (RHEL) System Administration Course Summary

AKIPS Network Monitor User Manual (DRAFT) Version 15.x. AKIPS Pty Ltd

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

Isilon OneFS. Version OneFS Migration Tools Guide

Network Contention and Congestion Control: Lustre FineGrained Routing

Dell OpenManage Mobile Version 1.4 User s Guide (Android)

NI Real-Time Hypervisor for Windows

Lustre performance monitoring and trouble shooting

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service. Eddie Dong, Tao Hong, Xiaowei Yang

Active-Active ImageNow Server

User's Guide - Beta 1 Draft

Transcription:

Cray Data Management Platform: Cray Lustre File System Monitoring esfsmon Jeff Keopp, Cray Inc. Harold Longley, Cray Inc.

Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements. 2

Overview: External Lustre Filesystems Cray External Lustre Filesystems Lustre File System by Cray (CLFS, previously esfs) Cray Sonexion Management and Monitoring CLFS Cray Integrated Management System (CIMS, previously esms) lustre_control - controls the Lustre filesystem esfsmon - monitors and performs automated Lustre failover Lustre Monitoring Tool - gathers data from /proc/fs/lustre Cray Sonexion Cray Sonexion System Manager (CSSM) Unified System Management firmware (USM) This discussion is focused on monitoring and automated failover of CLFS 3

CLFS Overview CMGUI CMSH Remote server or CIMS Primary CIMS site-admin-net esmaint-net ipmi-net failover-net Secondary CIMS ib-net (LNET) Managed ethernet switch - mgmt/provisioning & ipmi CLFS: Lustre File System by Cray MDS and OSS nodes (in failover pairs) MDS active/passive (active/active in DNE), OSS active/active Block Storage Software CentOS Cray ESF software with Lustre Server built from CLE sources CLFS oss OST OST IB switch (LNet) CLFS oss FC8 or IB Controller A & B OST OST CLFS mds CLFS mds Controller A & B MDT To LNet router on the Cray To site LDAP server FC8 or SAS CLFS Overview 4

Cray Lustre Filesystems Lustre Control Controlling Lustre - lustre_control Performs the following filesystem operations Format/Reformat Writeconf Start/Stop/Tune Status Manual Failover/Failback Executed from the CIMS Components on both the CIMS and CLFS nodes lustre_control 5

Cray Lustre Filesystems Lustre Control Controlling Lustre - lustre_control Filesystems are defined in fs_defs files installed into lustre_control - not directly used in operation # lustre_control install fsname.fs_defs /opt/cray/esms/cray-lustre-control-xx/default/etc/ example.fs_defs Filesystem tuning defined in fs_tune files Used directly in operation # lustre_control set_tune -f fsname <path_to_fs_tune_file> /opt/cray/esms/cray-lustre-control-xx/default/etc/ example.fs_tune lustre_control 6

Automated Lustre Failover Monitor - esfsmon Automated Lustre Failover Monitor Implemented as a custom health check /cm/local/apps/cmd/scripts/healthchecks/esfsmon_healthcheck Executed on the CIMS Monitors filesystem nodes via management daemon on each node Monitors multiple CLFS filesystems Supports Lustre DNE (Distributed NamespacE) and non-dne CLFS Health check failure triggers failover of Lustre assets /cm/local/apps/cmd/scripts/actions/esfsmon_action Calls lustre_control to perform the failover of Lustre assets Supports multiple failovers Previous version went into RUNSAFE mode after first failover Automated Lustre Failover Monitor 7

Automated Lustre Failover Monitor - esfsmon Operational Modes NORMAL monitors and will execute automated failover Failures are logged to /var/log/messages Failover information logged to /tmp/esfsmon/fsname.fo_log RUNSAFE monitors but will not execute automated failover Failure incidents are logged to /var/log/messages Set by existence of /var/esfsmon/esfsmon_runsafe_fsname SUSPENDED not monitored Set by existence of /var/esfsmon/esfsmon_suspend_fsname NOTE: esfsmon will set suspended mode during failover and failback operations or when lustre_control is stopping the filesystem. Automated Lustre Failover Monitor 8

Automated Lustre Failover Monitor - esfsmon Operation Checks CLFS nodes every 2 minutes (configurable) CLFS nodes are grouped by categories per filesystem Checks performed by category - excluding esfs-failed-fsname esfs-even-fsname - Even-numbered nodes in filesystem fsname esfs-odd-fsname - Odd-numbered nodes in filesystem fsname esfs-failed-fsname - Failed nodes in filesystem fsname' Checks are performed on a category of nodes in parallel: Power Status Node Status TCP Ping LNet Ping Lustre Mounts Automated Lustre Failover Monitor 9

Automated Lustre Failover Monitor - esfsmon Failure Determination Power Status Failure triggers a retry to avoid transient failed IPMI response Node Status Node DOWN status reported by management daemon Must also fail a TCP ping before being declared dead TCP Ping Failure triggers a retry to avoid transient PING failure LNet Ping ibstat is checked - No active IB interfaces triggers failure The ability to lctl ping at least one other node must succeed lctl ping will retry for up to 90 seconds Lustre Mounts A missing mount triggers failover Automated Lustre Failover Monitor 10

Failover Logging syslog on the CIMS Syslog Failover example (part 1 of 2) /var/log/messages Apr 1 19:07:24 esms1 logger: esms1 esfsmon: ERROR: esf-oss002 missing lustre mount: Failover initiated. Apr 1 19:07:24 esms1 logger: esms1 esfsmon: esf-oss002 failed health check Apr 1 19:07:24 esms1 logger: esms1 esfsmon: In FAILOVER mode. Apr 1 19:07:24 esms1 logger: esms1 esfsmon: Setting SUSPEND mode for failover action (/var/esfsmon/esfsmon_suspend_scratch) Apr 1 19:07:24 esms1 logger: esms1 esfsmon: Setting RUNSAFE mode for failover action (/var/esfsmon/esfsmon_runsafe_scratch) Apr 1 19:07:24 esms1 logger: esms1 esfsmon: OSS node failed: esf-oss002 Apr 1 19:07:24 esms1 logger: esms1 esfsmon: Setting category for esf-oss002 to esfs-failed-scratch Apr 1 19:07:26 esms1 logger: esms1 esfsmon: Successfully set esf-oss002 category to esfs-failed-scratch Automated Lustre Failover Monitor 11

Failover Logging syslog on the CIMS Syslog Failover example (part 2 of 2) /var/log/messages Apr 1 19:07:26 esms1 logger: esms1 esfsmon: Powering off esfoss002 in 6 minutes, allowing kdump to complete Apr 1 19:13:35 esms1 logger: esms1 esfsmon: esf-oss002 is powered off Apr 1 19:13:35 esms1 logger: esms1 esfsmon: INFO: Failover started: esf-oss002's services failing over to partner Apr 1 19:15:12 esms1 logger: esms1 esfsmon: INFO: Failover of targets completed successfully. Apr 1 19:15:20 esms1 logger: esms1 esfsmon: INFO: Lustre targets started on failover partner Apr 1 19:15:20 esms1 logger: esms1 esfsmon: INFO: I/O tuning of scratch complete. Apr 1 19:15:20 esms1 logger: esms1 esfsmon: INFO: Failover completed Automated Lustre Failover Monitor 12

Failover Logging fo_log on the CIMS esfsmon Failover log (part 1 of 2) /tmp/esfsmon/scratch.fo_log Apr- 1-19:07:24 esfsmon: scratch: esf-oss002 failed health check Apr- 1-19:07:24 esfsmon: scratch: In FAILOVER mode. Apr- 1-19:07:24 esfsmon: scratch: Setting SUSPEND mode for failover action (/var/esfsmon/esfsmon_suspend_scratch) Apr- 1-19:07:24 esfsmon: scratch: Setting RUNSAFE mode for failover action (/var/esfsmon/esfsmon_runsafe_scratch) Apr- 1-19:07:24 esfsmon: scratch: OSS node failed: esf-oss002 Apr- 1-19:07:24 esfsmon: scratch: Node ordinal=2 Apr- 1-19:07:24 esfsmon: scratch: Node is odd=0 Apr- 1-19:07:24 esfsmon: scratch: Node is EVEN Automated Lustre Failover Monitor 13

Failover Logging fo_log on the CIMS esfsmon Failover log (part 2 of 2) /tmp/esfsmon/scratch.fo_log Apr- 1-19:07:24 esfsmon: scratch: Setting category for esf-oss002 to esfs-failed-scratch Apr- 1-19:07:26 esfsmon: scratch: Successfully set esf-oss002 category to esfs-failed-scratch Apr- 1-19:07:26 esfsmon: scratch: Powering off esf-oss002 in 6 minutes, allowing kdump to complete Apr- 1-19:13:35 esfsmon: scratch: esf-oss002 is powered off Apr- 1-19:13:35 esfsmon: scratch: INFO: Failover started: esfoss002's services failing over to partner Apr- 1-19:15:12 esfsmon: scratch: INFO: Failover of targets completed successfully. Apr- 1-19:15:20 esfsmon: scratch: INFO: Lustre targets started on failover partner Apr- 1-19:15:20 esfsmon: scratch: INFO: I/O tuning of scratch complete. Apr- 1-19:15:20 esfsmon: scratch: INFO: Failover completed Automated Lustre Failover Monitor 14

Automated Lustre Failover Monitor failback Restoring a node back to service - esfsmon_failback Places a node back into service # esfsmon_failback lustre01-oss001 Performs management housekeeping Calls lustre_control to failback the Lustre assets to their primary server NOTE: Using lustre_control failback instead of esfsmon_failback will result in the node remaining in a failed category and no longer part of the health check. Automated Lustre Failover Monitor 15

Automated Lustre Failover Monitor failback Restoring a node back to service - esfsmon_failback Apr 1 18:37:50 esms1 root: esms1 esfsmon_failback: Setting esfmds001 to esfs-odd-scratch category Apr 1 18:37:51 esms1 root: esms1 esfsmon_failback: INFO: Failback started: failing back Lustre services to esf-mds001 Apr 1 18:39:08 esms1 root: esms1 esfsmon_failback: INFO: Failback completed successfully. Apr 1 18:39:13 esms1 root: esms1 esfsmon_failback: INFO: Lustre targets started on esf-mds001 Apr 1 18:39:13 esms1 root: esms1 esfsmon_failback: INFO: I/O tuning of esf-mds001 not performed. Apr 1 18:39:13 esms1 root: esms1 esfsmon_failback: INFO: Reloading Lustre modules on esf-mds002 Apr 1 18:40:14 esms1 root: esms1 esfsmon_failback: INFO: Failback of esf-mds001 completed Automated Lustre Failover Monitor 16

Automated Lustre Failover Monitor esfsmon Configuration File - esfsmon.conf Used by esfsmon_healthcheck, esfsmon_action and esfsmon_failback /cm/local/apps/cmd/etc/esfsmon.conf Includes the following parameters: State and Data directories State: /var/esfsmon Data: /tmp/esfsmon Node categories used by each Lustre filesystem LNet networks used by each filesystem Base hostname for each filesystem Paths to lustre_control tuning files Passive MDS (non-dne) Automated Lustre Failover Monitor 17

esfsmon Status - cmsh Check current status with latesthealthdata command in cmsh esms1:~ # cmsh [esms1]% device [esms1->device]% latesthealthdata esms1 -v Health Check Severity Value Age (sec.) Info Message ----------------- -------- -------- ---------- esfsmon:scratch 10 UNKNOWN 58 Monitor in SUSPEND mode. esfsmon:scratch1 0 PASS 58 Monitor in ACTIVE mode. esfsmon:scratch2 0 PASS 58 Monitor in RUNSAFE mode. esfsmon:scratch3 0 PASS 58 Monitor in DNE ACTIVE mode. esfsmon:scratch4 10 FAIL 58 ERROR: lustre04-oss002 failed LNET ping healthcheck: Failover initiated. Monitor in DNE ACTIVE mode. esfsmon:scratch5 10 FAIL 58 ERROR: lustre05-oss010 failed TCP ping healthcheck: Cray intervention required. Monitor in RUNSAFE mode. esfsmon:scratch6 0 PASS 58 WARN: Failover MDS lustre06-mds002 failed power healthcheck: Cray intervention required. Monitor in ACTIVE mode. [esms1->device]% Automated Lustre Failover Monitor 18

esfsmon Status History - cmsh Check historical status with dumphealthdata command in cmsh [esms1->device[esms1]]% dumphealthdata -7d now esfsmon:scratch -v # From Wed Feb 26 11:24:09 2014 to Wed Mar 5 11:24:09 2014 Time Value Info Message -------------------------- ---------------------------------- Wed Feb 26 11:24:09 2014 PASS Monitor in ACTIVE mode. Wed Feb 26 14:37:00 2014 PASS WARN: Failover MDS lustre01-mds002 failed LNET ping healthcheck: Cray intervention required. Monitor in ACTIVE mode. Wed Feb 26 15:04:00 2014 FAIL ERROR: lustre01-mds001 missing lustre mount: Cray intervention required. Monitor in RUNSAFE mode. Wed Feb 26 15:05:00 2014 FAIL ERROR: lustre01-mds001 missing lustre mount: Failover initiated. Monitor in ACTIVE mode. Wed Feb 26 15:06:18 2014 UNKNOWN Monitor in SUSPEND mode. [esms1->device[esms1]]% Automated Lustre Failover Monitor 19

Lustre Monitoring Tool - LMT Monitors Lustre servers using the Cerebro monitoring system Maintained by Lawrence Livermore National Laboratory (LLNL) Collects statistics published in /proc/fs/lustre every 5 seconds. Enabled/Disabled by starting/stopping the Cerebro daemon (cerebrod) on the CIMS and CLFS nodes. On the CIMS: lmt-server: Cerebro monitoring system Cerebro uses the MySQL server on the CIMS to store data ltop: Live top-like data monitor lmtsh: interactive shell for viewing historical LMT data On the CLFS (MDS/OSS) nodes: lmt-server-agent: Cerebro monitor plugin, ltop client and other utilities for administering LMT Lustre Monitoring Tool 20

Lustre Monitoring Tool - LMT OSS/OST Data Data Collected OSS/OST OSC Status - The MDS s view of the OST OSS hostname Export Count - number of clients mounting the filesystem Includes one for the MDS and one for the OST itself Reconnects per second Read and Write rates Bulk RPCs per second OST resource locks number currently granted, grant and cancellation rate CPU and Memory usage OST storage space used Lustre Monitoring Tool 21

Lustre Monitoring Tool - LMT MDS/MDT Data Data Collected MDS/MDT CPU usage KB free KB used inodes free inodes used Rates for the following operations: open and close mknod link and unlink mkdir and rmdir rename Lustre Monitoring Tool 22

Lustre Monitoring Tool - LMT Example ltop data: /usr/bin/ltop -f scratch Filesystem: scratch Indoors: 443.956m total, 49.295m used ( 11%), 394.662m free Space: 172.188t total, 129.573t used ( 75%), 42.615t free Bytes/s: 0.000g read, 0.000g write, 0 IOPS MDops/s: 0 open, 0 close, 0 getattr, 0 setattr 0 link, 0 unlink, 0 mkdir, 0 rmdir 0 statfs, 0 rename, 0 getxattr OST S OSS Exp CR rmb/s wmb/s IOPS LOCKS LGR LCR %cpu %mem %spc 0000 F oss001 137 0 0 0 0 0 0 0 1 10 77 0001 F oss002 137 0 0 0 0 0 0 0 0 9 76 0002 F oss003 137 0 0 0 0 0 0 0 0 9 76 0003 F oss004 137 0 0 0 0 0 0 0 0 10 76 0004 F oss001 137 0 0 0 0 0 0 0 1 10 76 0005 F oss002 137 0 0 0 0 0 0 0 0 9 76 0006 F oss003 137 0 0 0 0 0 0 0 0 9 76 0007 F oss004 137 0 0 0 0 0 0 0 0 10 77 Lustre Monitoring Tool 23

Lustre Monitoring Tool - LMTSH Example lmtsh data: (part 1 of 2) /usr/bin/lmtsh -f scratch scratch> fs Available filesystems: scratch scratch> ost OST_ID OST_NAME 1 scratch-ost0000 2 scratch-ost0002 3 scratch-ost0004 4 scratch-ost0006 5 scratch-ost0001 6 scratch-ost0003 7 scratch-ost0005 8 scratch-ost0007 Lustre Monitoring Tool 24

Lustre Monitoring Tool - LMTSH Example lmtsh data: (part 2 of 2) /usr/bin/lmtsh -f scratch scratch> t Available tables for scratch: Table Name Row Count EVENT_DATA 0 EVENT_INFO 0 FILESYSTEM_AGGREGATE_DAY 576 FILESYSTEM_AGGREGATE_HOUR 13338 FILESYSTEM_AGGREGATE_MONTH 36 FILESYSTEM_AGGREGATE_WEEK 99 FILESYSTEM_AGGREGATE_YEAR 18 FILESYSTEM_INFO 1 MDS_AGGREGATE_DAY 640 MDS_AGGREGATE_HOUR 14745 MDS_AGGREGATE_MONTH 40 MDS_AGGREGATE_WEEK 110 MDS_AGGREGATE_YEAR 20 MDS_DATA 2108108 MDS_INFO 2 MDS_OPS_DATA 44273061 MDS_VARIABLE_INFO 7 OPERATION_INFO 81 OSS_DATA 2101471 OSS_INFO 2 OSS_INTERFACE_DATA 0 OSS_INTERFACE_INFO 0 OSS_VARIABLE_INFO 7 OST_AGGREGATE_DAY 4608 OST_AGGREGATE_HOUR 106596 OST_AGGREGATE_MONTH 288 OST_AGGREGATE_WEEK 792 OST_AGGREGATE_YEAR 144 OST_DATA 8465265 OST_INFO 8 OST_OPS_DATA 0 OST_VARIABLE_INFO 11 ROUTER_AGGREGATE_DAY 0 ROUTER_AGGREGATE_HOUR 0 ROUTER_AGGREGATE_MONTH 0 ROUTER_AGGREGATE_WEEK 0 ROUTER_AGGREGATE_YEAR 0 ROUTER_DATA 0 ROUTER_INFO 0 ROUTER_VARIABLE_INFO 4 TIMESTAMP_INFO 1131417 VERSION 0 Lustre Monitoring Tool 25

Log Files located on the CIMS Syslog All CLFS nodes forward syslog to the CIMS /var/log/messages esfsmon Failover log /tmp/esfsmon/fsname.fo_log CMDaemon Management daemon log /var/log/cmdaemon Node-installer Node Installer log /var/log/node-installer Conman CLFS Node console logs /var/log/conman/ Software Installation Logs /var/adm/cray/logs Bright Cluster Manager Event Log Stored in the Bright Cluster Manager database Accessed via events command in cmsh or event viewer in cmgui Troubleshooting 26

Troubleshooting Depending on the problem, check the most likely logs Booting issues node-installer log - /var/log/node-installer Management issues cmdaemon log - /var/log/cmdaemon syslog - /var/log/messages event log events command in cmsh event viewer in cmgui Check relevant monitor histories dumphealthdata for health checks Operational issues syslog - /var/log/messages event log events command in cmsh event viewer in cmgui Check relevant monitor histories dumphealthdata for health checks Troubleshooting 27

Documentation Data Management Platform (DMP) Administrator's Guide - S-2327-C esfsmon installation and configuration LMT installation and configuration Installing Lustre(R) File System by Cray(R) (CLFS) Software - S-2521-C Installing Cray(R) Integrated Management Services (CIMS) Software - S-2522-E LMT - https://github.com/chaos/lmt/wiki Documentation 28

Thank you for your time Any questions? Jeff Keopp Harold Longley 29

Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright 2013 Cray Inc. 30