Cray Data Management Platform: Cray Lustre File System Monitoring esfsmon
|
|
|
- Letitia Jacobs
- 9 years ago
- Views:
Transcription
1 Cray Data Management Platform: Cray Lustre File System Monitoring esfsmon Jeff Keopp, Cray Inc. Harold Longley, Cray Inc.
2 Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements. 2
3 Overview: External Lustre Filesystems Cray External Lustre Filesystems Lustre File System by Cray (CLFS, previously esfs) Cray Sonexion Management and Monitoring CLFS Cray Integrated Management System (CIMS, previously esms) lustre_control - controls the Lustre filesystem esfsmon - monitors and performs automated Lustre failover Lustre Monitoring Tool - gathers data from /proc/fs/lustre Cray Sonexion Cray Sonexion System Manager (CSSM) Unified System Management firmware (USM) This discussion is focused on monitoring and automated failover of CLFS 3
4 CLFS Overview CMGUI CMSH Remote server or CIMS Primary CIMS site-admin-net esmaint-net ipmi-net failover-net Secondary CIMS ib-net (LNET) Managed ethernet switch - mgmt/provisioning & ipmi CLFS: Lustre File System by Cray MDS and OSS nodes (in failover pairs) MDS active/passive (active/active in DNE), OSS active/active Block Storage Software CentOS Cray ESF software with Lustre Server built from CLE sources CLFS oss OST OST IB switch (LNet) CLFS oss FC8 or IB Controller A & B OST OST CLFS mds CLFS mds Controller A & B MDT To LNet router on the Cray To site LDAP server FC8 or SAS CLFS Overview 4
5 Cray Lustre Filesystems Lustre Control Controlling Lustre - lustre_control Performs the following filesystem operations Format/Reformat Writeconf Start/Stop/Tune Status Manual Failover/Failback Executed from the CIMS Components on both the CIMS and CLFS nodes lustre_control 5
6 Cray Lustre Filesystems Lustre Control Controlling Lustre - lustre_control Filesystems are defined in fs_defs files installed into lustre_control - not directly used in operation # lustre_control install fsname.fs_defs /opt/cray/esms/cray-lustre-control-xx/default/etc/ example.fs_defs Filesystem tuning defined in fs_tune files Used directly in operation # lustre_control set_tune -f fsname <path_to_fs_tune_file> /opt/cray/esms/cray-lustre-control-xx/default/etc/ example.fs_tune lustre_control 6
7 Automated Lustre Failover Monitor - esfsmon Automated Lustre Failover Monitor Implemented as a custom health check /cm/local/apps/cmd/scripts/healthchecks/esfsmon_healthcheck Executed on the CIMS Monitors filesystem nodes via management daemon on each node Monitors multiple CLFS filesystems Supports Lustre DNE (Distributed NamespacE) and non-dne CLFS Health check failure triggers failover of Lustre assets /cm/local/apps/cmd/scripts/actions/esfsmon_action Calls lustre_control to perform the failover of Lustre assets Supports multiple failovers Previous version went into RUNSAFE mode after first failover Automated Lustre Failover Monitor 7
8 Automated Lustre Failover Monitor - esfsmon Operational Modes NORMAL monitors and will execute automated failover Failures are logged to /var/log/messages Failover information logged to /tmp/esfsmon/fsname.fo_log RUNSAFE monitors but will not execute automated failover Failure incidents are logged to /var/log/messages Set by existence of /var/esfsmon/esfsmon_runsafe_fsname SUSPENDED not monitored Set by existence of /var/esfsmon/esfsmon_suspend_fsname NOTE: esfsmon will set suspended mode during failover and failback operations or when lustre_control is stopping the filesystem. Automated Lustre Failover Monitor 8
9 Automated Lustre Failover Monitor - esfsmon Operation Checks CLFS nodes every 2 minutes (configurable) CLFS nodes are grouped by categories per filesystem Checks performed by category - excluding esfs-failed-fsname esfs-even-fsname - Even-numbered nodes in filesystem fsname esfs-odd-fsname - Odd-numbered nodes in filesystem fsname esfs-failed-fsname - Failed nodes in filesystem fsname' Checks are performed on a category of nodes in parallel: Power Status Node Status TCP Ping LNet Ping Lustre Mounts Automated Lustre Failover Monitor 9
10 Automated Lustre Failover Monitor - esfsmon Failure Determination Power Status Failure triggers a retry to avoid transient failed IPMI response Node Status Node DOWN status reported by management daemon Must also fail a TCP ping before being declared dead TCP Ping Failure triggers a retry to avoid transient PING failure LNet Ping ibstat is checked - No active IB interfaces triggers failure The ability to lctl ping at least one other node must succeed lctl ping will retry for up to 90 seconds Lustre Mounts A missing mount triggers failover Automated Lustre Failover Monitor 10
11 Failover Logging syslog on the CIMS Syslog Failover example (part 1 of 2) /var/log/messages Apr 1 19:07:24 esms1 logger: esms1 esfsmon: ERROR: esf-oss002 missing lustre mount: Failover initiated. Apr 1 19:07:24 esms1 logger: esms1 esfsmon: esf-oss002 failed health check Apr 1 19:07:24 esms1 logger: esms1 esfsmon: In FAILOVER mode. Apr 1 19:07:24 esms1 logger: esms1 esfsmon: Setting SUSPEND mode for failover action (/var/esfsmon/esfsmon_suspend_scratch) Apr 1 19:07:24 esms1 logger: esms1 esfsmon: Setting RUNSAFE mode for failover action (/var/esfsmon/esfsmon_runsafe_scratch) Apr 1 19:07:24 esms1 logger: esms1 esfsmon: OSS node failed: esf-oss002 Apr 1 19:07:24 esms1 logger: esms1 esfsmon: Setting category for esf-oss002 to esfs-failed-scratch Apr 1 19:07:26 esms1 logger: esms1 esfsmon: Successfully set esf-oss002 category to esfs-failed-scratch Automated Lustre Failover Monitor 11
12 Failover Logging syslog on the CIMS Syslog Failover example (part 2 of 2) /var/log/messages Apr 1 19:07:26 esms1 logger: esms1 esfsmon: Powering off esfoss002 in 6 minutes, allowing kdump to complete Apr 1 19:13:35 esms1 logger: esms1 esfsmon: esf-oss002 is powered off Apr 1 19:13:35 esms1 logger: esms1 esfsmon: INFO: Failover started: esf-oss002's services failing over to partner Apr 1 19:15:12 esms1 logger: esms1 esfsmon: INFO: Failover of targets completed successfully. Apr 1 19:15:20 esms1 logger: esms1 esfsmon: INFO: Lustre targets started on failover partner Apr 1 19:15:20 esms1 logger: esms1 esfsmon: INFO: I/O tuning of scratch complete. Apr 1 19:15:20 esms1 logger: esms1 esfsmon: INFO: Failover completed Automated Lustre Failover Monitor 12
13 Failover Logging fo_log on the CIMS esfsmon Failover log (part 1 of 2) /tmp/esfsmon/scratch.fo_log Apr- 1-19:07:24 esfsmon: scratch: esf-oss002 failed health check Apr- 1-19:07:24 esfsmon: scratch: In FAILOVER mode. Apr- 1-19:07:24 esfsmon: scratch: Setting SUSPEND mode for failover action (/var/esfsmon/esfsmon_suspend_scratch) Apr- 1-19:07:24 esfsmon: scratch: Setting RUNSAFE mode for failover action (/var/esfsmon/esfsmon_runsafe_scratch) Apr- 1-19:07:24 esfsmon: scratch: OSS node failed: esf-oss002 Apr- 1-19:07:24 esfsmon: scratch: Node ordinal=2 Apr- 1-19:07:24 esfsmon: scratch: Node is odd=0 Apr- 1-19:07:24 esfsmon: scratch: Node is EVEN Automated Lustre Failover Monitor 13
14 Failover Logging fo_log on the CIMS esfsmon Failover log (part 2 of 2) /tmp/esfsmon/scratch.fo_log Apr- 1-19:07:24 esfsmon: scratch: Setting category for esf-oss002 to esfs-failed-scratch Apr- 1-19:07:26 esfsmon: scratch: Successfully set esf-oss002 category to esfs-failed-scratch Apr- 1-19:07:26 esfsmon: scratch: Powering off esf-oss002 in 6 minutes, allowing kdump to complete Apr- 1-19:13:35 esfsmon: scratch: esf-oss002 is powered off Apr- 1-19:13:35 esfsmon: scratch: INFO: Failover started: esfoss002's services failing over to partner Apr- 1-19:15:12 esfsmon: scratch: INFO: Failover of targets completed successfully. Apr- 1-19:15:20 esfsmon: scratch: INFO: Lustre targets started on failover partner Apr- 1-19:15:20 esfsmon: scratch: INFO: I/O tuning of scratch complete. Apr- 1-19:15:20 esfsmon: scratch: INFO: Failover completed Automated Lustre Failover Monitor 14
15 Automated Lustre Failover Monitor failback Restoring a node back to service - esfsmon_failback Places a node back into service # esfsmon_failback lustre01-oss001 Performs management housekeeping Calls lustre_control to failback the Lustre assets to their primary server NOTE: Using lustre_control failback instead of esfsmon_failback will result in the node remaining in a failed category and no longer part of the health check. Automated Lustre Failover Monitor 15
16 Automated Lustre Failover Monitor failback Restoring a node back to service - esfsmon_failback Apr 1 18:37:50 esms1 root: esms1 esfsmon_failback: Setting esfmds001 to esfs-odd-scratch category Apr 1 18:37:51 esms1 root: esms1 esfsmon_failback: INFO: Failback started: failing back Lustre services to esf-mds001 Apr 1 18:39:08 esms1 root: esms1 esfsmon_failback: INFO: Failback completed successfully. Apr 1 18:39:13 esms1 root: esms1 esfsmon_failback: INFO: Lustre targets started on esf-mds001 Apr 1 18:39:13 esms1 root: esms1 esfsmon_failback: INFO: I/O tuning of esf-mds001 not performed. Apr 1 18:39:13 esms1 root: esms1 esfsmon_failback: INFO: Reloading Lustre modules on esf-mds002 Apr 1 18:40:14 esms1 root: esms1 esfsmon_failback: INFO: Failback of esf-mds001 completed Automated Lustre Failover Monitor 16
17 Automated Lustre Failover Monitor esfsmon Configuration File - esfsmon.conf Used by esfsmon_healthcheck, esfsmon_action and esfsmon_failback /cm/local/apps/cmd/etc/esfsmon.conf Includes the following parameters: State and Data directories State: /var/esfsmon Data: /tmp/esfsmon Node categories used by each Lustre filesystem LNet networks used by each filesystem Base hostname for each filesystem Paths to lustre_control tuning files Passive MDS (non-dne) Automated Lustre Failover Monitor 17
18 esfsmon Status - cmsh Check current status with latesthealthdata command in cmsh esms1:~ # cmsh [esms1]% device [esms1->device]% latesthealthdata esms1 -v Health Check Severity Value Age (sec.) Info Message esfsmon:scratch 10 UNKNOWN 58 Monitor in SUSPEND mode. esfsmon:scratch1 0 PASS 58 Monitor in ACTIVE mode. esfsmon:scratch2 0 PASS 58 Monitor in RUNSAFE mode. esfsmon:scratch3 0 PASS 58 Monitor in DNE ACTIVE mode. esfsmon:scratch4 10 FAIL 58 ERROR: lustre04-oss002 failed LNET ping healthcheck: Failover initiated. Monitor in DNE ACTIVE mode. esfsmon:scratch5 10 FAIL 58 ERROR: lustre05-oss010 failed TCP ping healthcheck: Cray intervention required. Monitor in RUNSAFE mode. esfsmon:scratch6 0 PASS 58 WARN: Failover MDS lustre06-mds002 failed power healthcheck: Cray intervention required. Monitor in ACTIVE mode. [esms1->device]% Automated Lustre Failover Monitor 18
19 esfsmon Status History - cmsh Check historical status with dumphealthdata command in cmsh [esms1->device[esms1]]% dumphealthdata -7d now esfsmon:scratch -v # From Wed Feb 26 11:24: to Wed Mar 5 11:24: Time Value Info Message Wed Feb 26 11:24: PASS Monitor in ACTIVE mode. Wed Feb 26 14:37: PASS WARN: Failover MDS lustre01-mds002 failed LNET ping healthcheck: Cray intervention required. Monitor in ACTIVE mode. Wed Feb 26 15:04: FAIL ERROR: lustre01-mds001 missing lustre mount: Cray intervention required. Monitor in RUNSAFE mode. Wed Feb 26 15:05: FAIL ERROR: lustre01-mds001 missing lustre mount: Failover initiated. Monitor in ACTIVE mode. Wed Feb 26 15:06: UNKNOWN Monitor in SUSPEND mode. [esms1->device[esms1]]% Automated Lustre Failover Monitor 19
20 Lustre Monitoring Tool - LMT Monitors Lustre servers using the Cerebro monitoring system Maintained by Lawrence Livermore National Laboratory (LLNL) Collects statistics published in /proc/fs/lustre every 5 seconds. Enabled/Disabled by starting/stopping the Cerebro daemon (cerebrod) on the CIMS and CLFS nodes. On the CIMS: lmt-server: Cerebro monitoring system Cerebro uses the MySQL server on the CIMS to store data ltop: Live top-like data monitor lmtsh: interactive shell for viewing historical LMT data On the CLFS (MDS/OSS) nodes: lmt-server-agent: Cerebro monitor plugin, ltop client and other utilities for administering LMT Lustre Monitoring Tool 20
21 Lustre Monitoring Tool - LMT OSS/OST Data Data Collected OSS/OST OSC Status - The MDS s view of the OST OSS hostname Export Count - number of clients mounting the filesystem Includes one for the MDS and one for the OST itself Reconnects per second Read and Write rates Bulk RPCs per second OST resource locks number currently granted, grant and cancellation rate CPU and Memory usage OST storage space used Lustre Monitoring Tool 21
22 Lustre Monitoring Tool - LMT MDS/MDT Data Data Collected MDS/MDT CPU usage KB free KB used inodes free inodes used Rates for the following operations: open and close mknod link and unlink mkdir and rmdir rename Lustre Monitoring Tool 22
23 Lustre Monitoring Tool - LMT Example ltop data: /usr/bin/ltop -f scratch Filesystem: scratch Indoors: m total, m used ( 11%), m free Space: t total, t used ( 75%), t free Bytes/s: 0.000g read, 0.000g write, 0 IOPS MDops/s: 0 open, 0 close, 0 getattr, 0 setattr 0 link, 0 unlink, 0 mkdir, 0 rmdir 0 statfs, 0 rename, 0 getxattr OST S OSS Exp CR rmb/s wmb/s IOPS LOCKS LGR LCR %cpu %mem %spc 0000 F oss F oss F oss F oss F oss F oss F oss F oss Lustre Monitoring Tool 23
24 Lustre Monitoring Tool - LMTSH Example lmtsh data: (part 1 of 2) /usr/bin/lmtsh -f scratch scratch> fs Available filesystems: scratch scratch> ost OST_ID OST_NAME 1 scratch-ost scratch-ost scratch-ost scratch-ost scratch-ost scratch-ost scratch-ost scratch-ost0007 Lustre Monitoring Tool 24
25 Lustre Monitoring Tool - LMTSH Example lmtsh data: (part 2 of 2) /usr/bin/lmtsh -f scratch scratch> t Available tables for scratch: Table Name Row Count EVENT_DATA 0 EVENT_INFO 0 FILESYSTEM_AGGREGATE_DAY 576 FILESYSTEM_AGGREGATE_HOUR FILESYSTEM_AGGREGATE_MONTH 36 FILESYSTEM_AGGREGATE_WEEK 99 FILESYSTEM_AGGREGATE_YEAR 18 FILESYSTEM_INFO 1 MDS_AGGREGATE_DAY 640 MDS_AGGREGATE_HOUR MDS_AGGREGATE_MONTH 40 MDS_AGGREGATE_WEEK 110 MDS_AGGREGATE_YEAR 20 MDS_DATA MDS_INFO 2 MDS_OPS_DATA MDS_VARIABLE_INFO 7 OPERATION_INFO 81 OSS_DATA OSS_INFO 2 OSS_INTERFACE_DATA 0 OSS_INTERFACE_INFO 0 OSS_VARIABLE_INFO 7 OST_AGGREGATE_DAY 4608 OST_AGGREGATE_HOUR OST_AGGREGATE_MONTH 288 OST_AGGREGATE_WEEK 792 OST_AGGREGATE_YEAR 144 OST_DATA OST_INFO 8 OST_OPS_DATA 0 OST_VARIABLE_INFO 11 ROUTER_AGGREGATE_DAY 0 ROUTER_AGGREGATE_HOUR 0 ROUTER_AGGREGATE_MONTH 0 ROUTER_AGGREGATE_WEEK 0 ROUTER_AGGREGATE_YEAR 0 ROUTER_DATA 0 ROUTER_INFO 0 ROUTER_VARIABLE_INFO 4 TIMESTAMP_INFO VERSION 0 Lustre Monitoring Tool 25
26 Log Files located on the CIMS Syslog All CLFS nodes forward syslog to the CIMS /var/log/messages esfsmon Failover log /tmp/esfsmon/fsname.fo_log CMDaemon Management daemon log /var/log/cmdaemon Node-installer Node Installer log /var/log/node-installer Conman CLFS Node console logs /var/log/conman/ Software Installation Logs /var/adm/cray/logs Bright Cluster Manager Event Log Stored in the Bright Cluster Manager database Accessed via events command in cmsh or event viewer in cmgui Troubleshooting 26
27 Troubleshooting Depending on the problem, check the most likely logs Booting issues node-installer log - /var/log/node-installer Management issues cmdaemon log - /var/log/cmdaemon syslog - /var/log/messages event log events command in cmsh event viewer in cmgui Check relevant monitor histories dumphealthdata for health checks Operational issues syslog - /var/log/messages event log events command in cmsh event viewer in cmgui Check relevant monitor histories dumphealthdata for health checks Troubleshooting 27
28 Documentation Data Management Platform (DMP) Administrator's Guide - S-2327-C esfsmon installation and configuration LMT installation and configuration Installing Lustre(R) File System by Cray(R) (CLFS) Software - S-2521-C Installing Cray(R) Integrated Management Services (CIMS) Software - S-2522-E LMT - Documentation 28
29 Thank you for your time Any questions? Jeff Keopp Harold Longley 29
30 Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright 2013 Cray Inc. 30
Cray Lustre File System Monitoring
Cray Lustre File System Monitoring esfsmon Jeff Keopp OSIO/ES Systems Cray Inc. St. Paul, MN USA [email protected] Harold Longley OSIO Cray Inc. St. Paul, MN USA [email protected] Abstract The Cray Data Management
New and Improved Lustre Performance Monitoring Tool. Torben Kling Petersen, PhD Principal Engineer. Chris Bloxham Principal Architect
New and Improved Lustre Performance Monitoring Tool Torben Kling Petersen, PhD Principal Engineer Chris Bloxham Principal Architect Lustre monitoring Performance Granular Aggregated Components Subsystem
Lustre* Testing: The Basics. Justin Miller, Cray Inc. James Nunez, Intel Corporation LAD 15 Paris, France
Lustre* Testing: The Basics Justin Miller, Cray Inc. James Nunez, Intel Corporation LAD 15 Paris, France 1 Legal Disclaimer Information in this document is provided in connection with Cray Inc. products.
Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler
Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
Cray s Storage History and Outlook Lustre+ Jason Goodman, Cray LUG 2015 - Denver
Cray s Storage History and Outlook Lustre+ Jason Goodman, Cray - Denver Agenda Cray History from Supercomputers to Lustre Where we are Today Cray Business OpenSFS Flashback to the Future SSDs, DVS, and
Introduction to Cray Data Virtualization Service S 0005 3102
TM Introduction to Cray Data Virtualization Service S 0005 3102 2008-2010 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or
Current Status of FEFS for the K computer
Current Status of FEFS for the K computer Shinji Sumimoto Fujitsu Limited Apr.24 2012 LUG2012@Austin Outline RIKEN and Fujitsu are jointly developing the K computer * Development continues with system
Monitoring the Lustre* file system to maintain optimal performance. Gabriele Paciucci, Andrew Uselton
Monitoring the Lustre* file system to maintain optimal performance Gabriele Paciucci, Andrew Uselton Outline Lustre* metrics Monitoring tools Analytics and presentation Conclusion and Q&A 2 Why Monitor
The Fusion of Supercomputing and Big Data: The Role of Global Memory Architectures in Future Large Scale Data Analytics
HPC 2014 High Performance Computing FROM clouds and BIG DATA to EXASCALE AND BEYOND An International Advanced Workshop July 7 11, 2014, Cetraro, Italy Session III Emerging Systems and Solutions The Fusion
February, 2015 Bill Loewe
February, 2015 Bill Loewe Agenda System Metadata, a growing issue Parallel System - Lustre Overview Metadata and Distributed Namespace Test setup and implementation for metadata testing Scaling Metadata
Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK 05-25-2011
Determining health of Lustre filesystems at scale Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK 05-25-2011 Overview Overview of architectures Lustre health and importance Storage
LMT Lustre Monitoring Tools
LMT Lustre Monitoring Tools April 13, 2011 Christopher Morrone, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344
Determining the health of Lustre filesystems at scale
Determining the health of Lustre filesystems at scale Jason J. Hill, Dustin B. Leverman, Scott M. Koch, David A. Dillow Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory {hilljj,leverman,smkoch,dillowda}@ornl.gov
Oak Ridge National Laboratory Computing and Computational Sciences Directorate. Lustre Crash Dumps And Log Files
Oak Ridge National Laboratory Computing and Computational Sciences Directorate Lustre Crash Dumps And Log Files Jesse Hanley Rick Mohr Sarp Oral Michael Brim Nathan Grodowitz Gregory Koenig Jason Hill
New Storage System Solutions
New Storage System Solutions Craig Prescott Research Computing May 2, 2013 Outline } Existing storage systems } Requirements and Solutions } Lustre } /scratch/lfs } Questions? Existing Storage Systems
Oak Ridge National Laboratory Computing and Computational Sciences Directorate. File System Administration and Monitoring
Oak Ridge National Laboratory Computing and Computational Sciences Directorate File System Administration and Monitoring Jesse Hanley Rick Mohr Jeffrey Rossiter Sarp Oral Michael Brim Jason Hill Neena
How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on Citrix* XenServer 6.0*
How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on Citrix* XenServer 6.0* Technical Brief v1.0 December 2011 Legal Lines and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED
PVFS High Availability Clustering using Heartbeat 2.0
PVFS High Availability Clustering using Heartbeat 2.0 2008 Contents 1 Introduction 2 2 Requirements 2 2.1 Hardware................................................. 2 2.1.1 Nodes...............................................
Implementing Storage Concentrator FailOver Clusters
Implementing Concentrator FailOver Clusters Technical Brief All trademark names are the property of their respective companies. This publication contains opinions of StoneFly, Inc. which are subject to
We mean.network File System
We mean.network File System Introduction: Remote File-systems When networking became widely available users wanting to share files had to log in across the net to a central machine This central machine
Monitoring Tools for Large Scale Systems
Monitoring Tools for Large Scale Systems Ross Miller, Jason Hill, David A. Dillow, Raghul Gunasekaran, Galen Shipman, Don Maxwell Oak Ridge Leadership Computing Facility, Oak Ridge National Laboratory
TPAf KTl Pen source. System Monitoring. Zenoss Core 3.x Network and
Zenoss Core 3.x Network and System Monitoring A step-by-step guide to configuring, using, and adapting this free Open Source network monitoring system Michael Badger TPAf KTl Pen source I I flli\ I I community
Bright Cluster Manager
Bright Cluster Manager For HPC, Hadoop and OpenStack Craig Hunneyman Director of Business Development Bright Computing [email protected] Agenda Who is Bright Computing? What is Bright
System Event Log (SEL) Viewer User Guide
System Event Log (SEL) Viewer User Guide For Extensible Firmware Interface (EFI) and Microsoft Preinstallation Environment Part Number: E12461-001 Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN
Architecting a High Performance Storage System
WHITE PAPER Intel Enterprise Edition for Lustre* Software High Performance Data Division Architecting a High Performance Storage System January 2014 Contents Introduction... 1 A Systematic Approach to
Novell Sentinel Log Manager 1.2 Release Notes. 1 What s New. 1.1 Enhancements to Licenses. Novell. February 2011
Novell Sentinel Log Manager 1.2 Release Notes February 2011 Novell Novell Sentinel Log Manager collects data from a wide variety of devices and applications, including intrusion detection systems, firewalls,
Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services
Reference Architecture Developing Storage Solutions with Intel Cloud Edition for Lustre* and Amazon Web Services Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud
Cray Cluster Supercomputers. John Lee VP of Advanced Technology Solutions CUG 2013
Cray Cluster Supercomputers John Lee VP of Advanced Technology Solutions CUG 2013 Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or
Lustre Monitoring with OpenTSDB
Lustre Monitoring with OpenTSDB 2015/9/22 DataDirect Networks Japan, Inc. Shuichi Ihara 2 Lustre Monitoring Background Lustre is a black box Users and Administrators want to know what s going on? Find
Intel System Event Log (SEL) Viewer Utility
Intel System Event Log (SEL) Viewer Utility User Guide Document No. E12461-003 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS FOR THE GENERAL PURPOSE OF SUPPORTING
高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统. High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University
高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统 High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University 清 华 信 息 科 学 与 技 术 国 家 实 验 室 ( 筹 ) 公 共 平 台 与 技 术 部 清 华 大 学 科 学 与 工 程 计 算 实 验
IBM Sterling Control Center
IBM Sterling Control Center System Administration Guide Version 5.3 This edition applies to the 5.3 Version of IBM Sterling Control Center and to all subsequent releases and modifications until otherwise
Network File System (NFS) Pradipta De [email protected]
Network File System (NFS) Pradipta De [email protected] Today s Topic Network File System Type of Distributed file system NFS protocol NFS cache consistency issue CSE506: Ext Filesystem 2 NFS
Parallels Virtuozzo Containers 4.7 for Linux
Parallels Virtuozzo Containers 4.7 for Linux Deploying Clusters in Parallels-Based Systems Copyright 1999-2011 Parallels Holdings, Ltd. and its affiliates. All rights reserved. Parallels Holdings, Ltd.
NetApp High-Performance Computing Solution for Lustre: Solution Guide
Technical Report NetApp High-Performance Computing Solution for Lustre: Solution Guide Robert Lai, NetApp August 2012 TR-3997 TABLE OF CONTENTS 1 Introduction... 5 1.1 NetApp HPC Solution for Lustre Introduction...5
VMware vsphere-6.0 Administration Training
VMware vsphere-6.0 Administration Training Course Course Duration : 20 Days Class Duration : 3 hours per day (Including LAB Practical) Classroom Fee = 20,000 INR Online / Fast-Track Fee = 25,000 INR Fast
SteelEye Protection Suite for Linux v8.2.0. Network Attached Storage Recovery Kit Administration Guide
SteelEye Protection Suite for Linux v8.2.0 Network Attached Storage Recovery Kit Administration Guide October 2013 This document and the information herein is the property of SIOS Technology Corp. (previously
Step-by-Step Guide. to configure Open-E DSS V7 Active-Active iscsi Failover on Intel Server Systems R2224GZ4GC4. Software Version: DSS ver. 7.
Step-by-Step Guide to configure on Intel Server Systems R2224GZ4GC4 Software Version: DSS ver. 7.00 up01 Presentation updated: April 2013 www.open-e.com 1 www.open-e.com 2 TECHNICAL SPECIFICATIONS OF THE
Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems
Sun Storage Perspective & Lustre Architecture Dr. Peter Braam VP Sun Microsystems Agenda Future of Storage Sun s vision Lustre - vendor neutral architecture roadmap Sun s view on storage introduction The
Monitoring Microsoft Exchange to Improve Performance and Availability
Focus on Value Monitoring Microsoft Exchange to Improve Performance and Availability With increasing growth in email traffic, the number and size of attachments, spam, and other factors, organizations
Viola M2M Gateway User Manual
Viola M2M Gateway User Manual Document version 3.0 Modified June 25, 2008 Firmware version 2.4 Contents 1 Introduction 6 1.1 About Viola M2M Gateway....................................... 6 1.2 M2M Gateway
Bright Cluster Manager 5.2. Administrator Manual. Revision: 6776. Date: Fri, 27 Nov 2015
Bright Cluster Manager 5.2 Administrator Manual Revision: 6776 Date: Fri, 27 Nov 2015 2012 Bright Computing, Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless
IBRIX Fusion 3.1 Release Notes
Release Date April 2009 Version IBRIX Fusion Version 3.1 Release 46 Compatibility New Features Version 3.1 CLI Changes RHEL 5 Update 3 is supported for Segment Servers and IBRIX Clients RHEL 5 Update 2
Hardware Information Managing your server, adapters, and devices ESCALA POWER5 REFERENCE 86 A1 00EW 00
86 A1 00EW 00 86 A1 00EW 00 Table of Contents Managing your server, adapters, and devices...1 Managing your server using the Hardware Management Console...1 What's new...1 Printable PDFs...2 HMC concepts
Lustre Networking BY PETER J. BRAAM
Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information
FioranoMQ 9. High Availability Guide
FioranoMQ 9 High Availability Guide Copyright (c) 1999-2008, Fiorano Software Technologies Pvt. Ltd., Copyright (c) 2008-2009, Fiorano Software Pty. Ltd. All rights reserved. This software is the confidential
Enhancements of ETERNUS DX / SF
shaping tomorrow with you ETERNUS - Business-centric Storage Enhancements of ETERNUS DX / SF Global Product Marketing Storage ETERNUS Business-centric Storage Agenda: 1 Overview of the top 3 innovations
Tue Apr 19 11:03:19 PDT 2005 by Andrew Gristina thanks to Luca Deri and the ntop team
Tue Apr 19 11:03:19 PDT 2005 by Andrew Gristina thanks to Luca Deri and the ntop team This document specifically addresses a subset of interesting netflow export situations to an ntop netflow collector
Cloud Bursting with SLURM and Bright Cluster Manager. Martijn de Vries CTO
Cloud Bursting with SLURM and Bright Cluster Manager Martijn de Vries CTO Architecture CMDaemon 2 Management Interfaces Graphical User Interface (GUI) Offers administrator full cluster control Standalone
Lustre* HSM in the Cloud. Robert Read, Intel HPDD
Lustre* HSM in the Cloud Robert Read, Intel HPDD Overview Lustre in the Cloud HSM for Cloud Importing from Amazon Simple Storage Service* (S3) General archive with S3 Crazy snapshot idea 2 Lustre in the
SanDisk ION Accelerator High Availability
WHITE PAPER SanDisk ION Accelerator High Availability 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Introduction 3 Basics of SanDisk ION Accelerator High Availability 3 ALUA Multipathing
SUSE LINUX Enterprise Server for SGI Altix Systems
SUSE LINUX Enterprise Server for SGI Altix Systems 007 4651 002 COPYRIGHT 2004, Silicon Graphics, Inc. All rights reserved; provided portions may be copyright in third parties, as indicated elsewhere herein.
AlienVault Unified Security Management (USM) 4.x-5.x. Deploying HIDS Agents to Linux Hosts
AlienVault Unified Security Management (USM) 4.x-5.x Deploying HIDS Agents to Linux Hosts USM 4.x-5.x Deploying HIDS Agents to Linux Hosts, rev. 2 Copyright 2015 AlienVault, Inc. All rights reserved. AlienVault,
Lenovo ThinkServer Solution For Apache Hadoop: Cloudera Installation Guide
Lenovo ThinkServer Solution For Apache Hadoop: Cloudera Installation Guide First Edition (January 2015) Copyright Lenovo 2015. LIMITED AND RESTRICTED RIGHTS NOTICE: If data or software is delivered pursuant
Deploying Microsoft Clusters in Parallels Virtuozzo-Based Systems
Parallels Deploying Microsoft Clusters in Parallels Virtuozzo-Based Systems Copyright 1999-2008 Parallels, Inc. ISBN: N/A Parallels Holdings, Ltd. c/o Parallels Software, Inc. 13755 Sunrise Valley Drive
Bright Cluster Manager
Bright Cluster Manager A Unified Management Solution for HPC and Hadoop Martijn de Vries CTO Introduction Architecture Bright Cluster CMDaemon Cluster Management GUI Cluster Management Shell SOAP/ JSONAPI
This document describes the new features of this release and important changes since the previous one.
Parallels Virtuozzo Containers 4.0 for Linux Release Notes Copyright 1999-2011 by Parallels Holdings, Ltd. All rights reserved. This document describes the new features of this release and important changes
Getting Started with Endurance FTvirtual Server
Getting Started with Endurance FTvirtual Server Marathon Technologies Corporation Fault and Disaster Tolerant Solutions for Windows Environments Release 6.1.1 June 2005 NOTICE Marathon Technologies Corporation
Symantec NetBackup OpenStorage Solutions Guide for Disk
Symantec NetBackup OpenStorage Solutions Guide for Disk UNIX, Windows, Linux Release 7.6 Symantec NetBackup OpenStorage Solutions Guide for Disk The software described in this book is furnished under a
Step-by-Step Guide to Open-E DSS V7 Active-Active Load Balanced iscsi HA Cluster
www.open-e.com 1 Step-by-Step Guide to Open-E DSS V7 Active-Active Load Balanced iscsi HA Cluster (without bonding) Software Version: DSS ver. 7.00 up10 Presentation updated: May 2013 www.open-e.com 2
Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre
Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre University of Cambridge, UIS, HPC Service Authors: Wojciech Turek, Paul Calleja, John Taylor
PARALLELS SERVER 4 BARE METAL README
PARALLELS SERVER 4 BARE METAL README This document provides the first-priority information on Parallels Server 4 Bare Metal and supplements the included documentation. TABLE OF CONTENTS 1 About Parallels
Double-Take AVAILABILITY
Double-Take AVAILABILITY Version 8.0.0 Double-Take Availability for Linux User's Guide Notices Double-Take Availability for Linux User's Guide Version 8.0, Monday, April 25, 2016 Check your service agreement
Enterprise Manager. Version 6.2. Administrator s Guide
Enterprise Manager Version 6.2 Administrator s Guide Enterprise Manager 6.2 Administrator s Guide Document Number 680-017-017 Revision Date Description A August 2012 Initial release to support version
Monthly Specification Update
Monthly Specification Update Intel Server Board S1400FP Family August, 2013 Enterprise Platforms and Services Marketing Enterprise Platforms and Services Marketing Monthly Specification Update Revision
NNMi120 Network Node Manager i Software 9.x Essentials
NNMi120 Network Node Manager i Software 9.x Essentials Instructor-Led Training For versions 9.0 9.2 OVERVIEW This course is designed for those Network and/or System administrators tasked with the installation,
COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service
COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service Eddie Dong, Yunhong Jiang 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
Course Syllabus. At Course Completion
Key Data Product #: Course #: 6231A Number of Days: 5 Format: Certification Exams: 70-432, 70-433 Instructor-Led This course syllabus should be used to determine whether the course is appropriate for the
SIOS Protection Suite for Linux v8.3.0. Postfix Recovery Kit Administration Guide
SIOS Protection Suite for Linux v8.3.0 Postfix Recovery Kit Administration Guide July 2014 This document and the information herein is the property of SIOS Technology Corp. (previously known as SteelEye
Deployment Guide: Transparent Mode
Deployment Guide: Transparent Mode March 15, 2007 Deployment and Task Overview Description Follow the tasks in this guide to deploy the appliance as a transparent-firewall device on your network. This
Intel Service Assurance Administrator. Product Overview
Intel Service Assurance Administrator Product Overview Running Enterprise Workloads in the Cloud Enterprise IT wants to Start a private cloud initiative to service internal enterprise customers Find an
Intel System Event Log (SEL) Viewer Utility
Intel System Event Log (SEL) Viewer Utility User Guide Document No. E12461-005 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS FOR THE GENERAL PURPOSE OF SUPPORTING
Clearswift SECURE Exchange Gateway Installation & Setup Guide. Version 1.0
Clearswift SECURE Exchange Gateway Installation & Setup Guide Version 1.0 Copyright Revision 1.0, December, 2013 Published by Clearswift Ltd. 1995 2013 Clearswift Ltd. All rights reserved. The materials
Installation and configuration of Real-Time Monitoring Tool (RTMT)
Installation and configuration of Real-Time Monitoring Tool (RTMT) How to install and upgrade RTMT, page 1 Services, servlets, and service parameters on server, page 5 Navigation of RTMT, page 6 Nonconfigurable
Barracuda Link Balancer Administrator s Guide
Barracuda Link Balancer Administrator s Guide Version 1.0 Barracuda Networks Inc. 3175 S. Winchester Blvd. Campbell, CA 95008 http://www.barracuda.com Copyright Notice Copyright 2008, Barracuda Networks
Astaro Deployment Guide High Availability Options Clustering and Hot Standby
Connect With Confidence Astaro Deployment Guide Clustering and Hot Standby Table of Contents Introduction... 2 Active/Passive HA (Hot Standby)... 2 Active/Active HA (Cluster)... 2 Astaro s HA Act as One...
Lab 5.5 Configuring Logging
Lab 5.5 Configuring Logging Learning Objectives Configure a router to log to a Syslog server Use Kiwi Syslog Daemon as a Syslog server Configure local buffering on a router Topology Diagram Scenario In
Active-Active and High Availability
Active-Active and High Availability Advanced Design and Setup Guide Perceptive Content Version: 7.0.x Written by: Product Knowledge, R&D Date: July 2015 2015 Perceptive Software. All rights reserved. Lexmark
Syncplicity On-Premise Storage Connector
Syncplicity On-Premise Storage Connector Implementation Guide Abstract This document explains how to install and configure the Syncplicity On-Premise Storage Connector. In addition, it also describes how
Lesson Plans Configuring Exchange Server 2007
Lesson Plans Configuring Exchange Server 2007 (Exam 70-236) Version 2.1 Table of Contents Course Overview... 2 Section 1.1: Server-based Messaging... 4 Section 1.2: Exchange Versions... 5 Section 1.3:
Intel System Event Log (SEL) Viewer Utility
Intel System Event Log (SEL) Viewer Utility User Guide Document No. E12461-007 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS FOR THE GENERAL PURPOSE OF SUPPORTING
Explain how to prepare the hardware and other resources necessary to install SQL Server. Install SQL Server. Manage and configure SQL Server.
Course 6231A: Maintaining a Microsoft SQL Server 2008 Database About this Course Elements of this syllabus are subject to change. This five-day instructor-led course provides students with the knowledge
How To Write A Libranthus 2.5.3.3 (Libranthus) On Libranus 2.4.3/Libranus 3.5 (Librenthus) (Libre) (For Linux) (
LUSTRE/HSM BINDING IS THERE! LAD'13 Aurélien Degrémont SEPTEMBER, 17th 2013 CEA 10 AVRIL 2012 PAGE 1 AGENDA Presentation Architecture Components Examples Project status LAD'13
Cyberoam Multi link Implementation Guide Version 9
Cyberoam Multi link Implementation Guide Version 9 Document version 96-1.0-12/05/2009 IMPORTANT NOTICE Elitecore has supplied this Information believing it to be accurate and reliable at the time of printing,
High Availability Solutions for the MariaDB and MySQL Database
High Availability Solutions for the MariaDB and MySQL Database 1 Introduction This paper introduces recommendations and some of the solutions used to create an availability or high availability environment
Managing Software and Configurations
55 CHAPTER This chapter describes how to manage the ASASM software and configurations and includes the following sections: Saving the Running Configuration to a TFTP Server, page 55-1 Managing Files, page
RedHat (RHEL) System Administration Course Summary
Contact Us: (616) 875-4060 RedHat (RHEL) System Administration Course Summary Length: 5 Days Prerequisite: RedHat fundamentals course Recommendation Statement: Students should have some experience with
AKIPS Network Monitor User Manual (DRAFT) Version 15.x. AKIPS Pty Ltd
AKIPS Network Monitor User Manual (DRAFT) Version 15.x AKIPS Pty Ltd October 2, 2015 1 Copyright Copyright 2015 AKIPS Holdings Pty Ltd. All rights reserved worldwide. No part of this document may be reproduced
High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach
High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach Introduction Email is becoming ubiquitous and has become the standard tool for communication in many
Isilon OneFS. Version 7.2.1. OneFS Migration Tools Guide
Isilon OneFS Version 7.2.1 OneFS Migration Tools Guide Copyright 2015 EMC Corporation. All rights reserved. Published in USA. Published July, 2015 EMC believes the information in this publication is accurate
Network Contention and Congestion Control: Lustre FineGrained Routing
Network Contention and Congestion Control: Lustre FineGrained Routing Matt Ezell HPC Systems Administrator Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory International Workshop on
Dell OpenManage Mobile Version 1.4 User s Guide (Android)
Dell OpenManage Mobile Version 1.4 User s Guide (Android) Notes, cautions, and warnings NOTE: A NOTE indicates important information that helps you make better use of your computer. CAUTION: A CAUTION
NI Real-Time Hypervisor for Windows
QUICK START GUIDE NI Real-Time Hypervisor Version 2.1 The NI Real-Time Hypervisor provides a platform you can use to develop and run LabVIEW and LabVIEW Real-Time applications simultaneously on a single
Lustre performance monitoring and trouble shooting
Lustre performance monitoring and trouble shooting March, 2015 Patrick Fitzhenry and Ian Costello 2012 DataDirect Networks. All Rights Reserved. 1 Agenda EXAScaler (Lustre) Monitoring NCI test kit hardware
COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service. Eddie Dong, Tao Hong, Xiaowei Yang
COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service Eddie Dong, Tao Hong, Xiaowei Yang 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO
Active-Active ImageNow Server
Active-Active ImageNow Server Getting Started Guide ImageNow Version: 6.7. x Written by: Product Documentation, R&D Date: March 2014 2014 Perceptive Software. All rights reserved CaptureNow, ImageNow,
User's Guide - Beta 1 Draft
IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Cluster Server Agent vnext User's Guide - Beta 1 Draft SC27-2316-05 IBM Tivoli Composite Application Manager for Microsoft
