Oak Ridge National Laboratory Computing and Computational Sciences Directorate. File System Administration and Monitoring



Similar documents
Oak Ridge National Laboratory Computing and Computational Sciences Directorate. Lustre Crash Dumps And Log Files

Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK

Determining the health of Lustre filesystems at scale

Monitoring Tools for Large Scale Systems

Cray Lustre File System Monitoring

LUSTRE USAGE MONITORING What the are users doing with my filesystem?

Lustre tools for ldiskfs investigation and lightweight I/O statistics

How To Write A Libranthus (Libranthus) On Libranus 2.4.3/Libranus 3.5 (Librenthus) (Libre) (For Linux) (

New and Improved Lustre Performance Monitoring Tool. Torben Kling Petersen, PhD Principal Engineer. Chris Bloxham Principal Architect

High Performance Computing OpenStack Options. September 22, 2015

Lustre & Cluster. - monitoring the whole thing Erich Focht

Monitoring the Lustre* file system to maintain optimal performance. Gabriele Paciucci, Andrew Uselton

LLNL Production Plans and Best Practices

Lustre * Filesystem for Cloud and Hadoop *

April 8th - 10th, 2014 LUG14 LUG14. Lustre Log Analyzer. Kalpak Shah. DataDirect Networks. ddn.com DataDirect Networks. All Rights Reserved.

New Storage System Solutions

Practices on Lustre File-level RAID

UNIX - FILE SYSTEM BASICS

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Application Performance for High Performance Computing Environments

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Lustre Monitoring with OpenTSDB

Lustre performance monitoring and trouble shooting

High Availability and Backup Strategies for the Lustre MDS Server

Cognos8 Deployment Best Practices for Performance/Scalability. Barnaby Cole Practice Lead, Technical Services

How Comcast Built An Open Source Content Delivery Network National Engineering & Technical Operations

The Use of Flash in Large-Scale Storage Systems.

高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统. High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University

Installing MooseFS Step by Step Tutorial

Incremental Backup Script. Jason Healy, Director of Networks and Systems

SUSE Manager in the Public Cloud. SUSE Manager Server in the Public Cloud

STeP-IN SUMMIT June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)

Current Status of FEFS for the K computer

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Lessons learned from parallel file system operation

Data Movement and Storage. Drew Dolgert and previous contributors

NetApp High-Performance Computing Solution for Lustre: Solution Guide

A Shared File System on SAS Grid Manger in a Cloud Environment

Managing MySQL Scale Through Consolidation

Splunk for VMware Virtualization. Marco Bizzantino Vmug - 05/10/2011

PTC System Monitor Solution Training

Tiered Adaptive Storage

LMT Lustre Monitoring Tools

IBRIX Fusion 3.1 Release Notes

ORACLE OPS CENTER: PROVISIONING AND PATCH AUTOMATION PACK

SAP HANA Disaster Recovery with Asynchronous Storage Replication Using Snap Creator and SnapMirror

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

Tools and strategies to monitor the ATLAS online computing farm

Hadoop Distributed File System. Dhruba Borthakur June, 2007

High Performance, Open Source, Dell Lustre Storage System. Dell /Cambridge HPC Solution Centre. Wojciech Turek, Paul Calleja July 2010.

A Shared File System on SAS Grid Manager * in a Cloud Environment

Maintaining Non-Stop Services with Multi Layer Monitoring

Linux Filesystem Comparisons

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Uncovering degraded application performance with LWM 2. Aamer Shah, Chih-Song Kuo, Lucas Theisen, Felix Wolf November 17, 2014

Do it Yourself System Administration

Building Views and Charts in Requests Introduction to Answers views and charts Creating and editing charts Performing common view tasks

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Evolution from the Traditional Data Center to Exalogic: An Operational Perspective

Investigation of storage options for scientific computing on Grid and Cloud facilities

Holistic Performance Analysis of J2EE Applications

LSN 10 Linux Overview

Linux System Administration on Red Hat

RHCSA 7RHCE Red Haf Linux Certification Practice

8/15/2014. Best (and more) General Information. Staying Informed. Staying Informed. Staying Informed-System Status

MYASIA CLOUD SERVER. Defining next generation of global storage grid User Guide AUG 2010, version 1.1

Monitoring Remedy with BMC Solutions

Understanding Server Configuration Parameters and Their Effect on Server Statistics

Creating a Cray System Management Workstation (SMW) Bootable Backup Drive

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

February, 2015 Bill Loewe

SCF/FEF Evaluation of Nagios and Zabbix Monitoring Systems. Ed Simmonds and Jason Harrington 7/20/2009

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

NERSC File Systems and How to Use Them

Restoring a Suse Linux Enterprise Server 9 64 Bit on Dissimilar Hardware with CBMR for Linux 1.02

ROCANA WHITEPAPER How to Investigate an Infrastructure Performance Problem

The Hadoop Distributed File System

Bigdata High Availability (HA) Architecture

Lustre failover experience

Computing forensics: a live analysis

Hadoop Cluster Applications

B.Sc (Computer Science) Database Management Systems UNIT-V

Log Analysis with the ELK Stack (Elasticsearch, Logstash and Kibana) Gary Smith, Pacific Northwest National Laboratory

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

The Lustre File System. Eric Barton Lead Engineer, Lustre Group Sun Microsystems

Drupal Performance Tuning

Red Hat Certifications: Red Hat Certified System Administrator (RHCSA)

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

<Insert Picture Here> Private Cloud with Fusion Middleware

Optimization of QoS for Cloud-Based Services through Elasticity and Network Awareness

Automating Healthcare Claim Processing

Transcription:

Oak Ridge National Laboratory Computing and Computational Sciences Directorate File System Administration and Monitoring Jesse Hanley Rick Mohr Jeffrey Rossiter Sarp Oral Michael Brim Jason Hill Neena Imam * Joshua Lothian (MELT) ORNL is managed by UT-Battelle for the US Department of Energy

Outline Starting/stopping a Lustre file system Mounting/unmounting clients Quotas and usage reports Purging Survey of monitoring tools

Starting a Lustre file system The file system should be mounted in the following order (for normal bringup): MGT (Lustre will also mount the MDT automatically if the file system has a combined MGT/MDT) All OSTs All MDTs Run any server-side tunings After this, the file system is up and clients can begin mounting Mount clients and run any client-side tunings The commands for mounting share a similar syntax mount -t lustre $DEVICE No need to start a service or perform a modprobe

Mount by label / path Information about a target is encoded into the label [root@testfs- oss3 ~]# dumpe2fs - h /dev/mapper/testfs- l28 grep "^Filesystem volume name" Filesystem volume name: testfs- OST0002 - These labels also appear under /dev/disk/by-label/ If not using multipathing, this label can be used to mount by label: testfs- mds1# mount - t lustre - L testfs- MDT0000 /mnt/mdt Also avoid using this method when using snapshots If using multipathing, instead use the entry in /dev/ mapper/. This can be set up in bindings to provide a meaningful name. testfs- mds1# mount - t lustre - L /dev/mapper/testfs- lun0

Mounting Strategies These mounts can be stored in fstab. Include noauto param file system will not automatically mount at boot Include _netdev param file system will not mount if network layer has not started These targets could then be mounted using: mount - t lustre - a This process lends itself to automation

Client Mounting To mount the file system on a client, run the following command: mount -t lustre MGS_node:/fsname /mount_point, e.g., mount -t lustre 10.0.0.10@o2ib:/testfs /mnt/ test_filesystem As seen above, the mount point does not have to map to the file system name. After the client is mounted, run any tunings

Stopping a Lustre file system Shutting down a Lustre file system involves reversing the previous procedure. Unmounting all block devices on a host stops the Lustre software. First, unmount the clients On each client, run: umount -a -t lustre #This unmounts all Lustre file systems umount /mount/point #This unmounts a specific file system Then, unmount all MDT(s) On the MDS, run: umount /mdt/mount_point (e.g., /mnt/mdt from the previous example) Finally, unmount all OST(s) On each OSS, run: umount -t lustre a

Quotas For persistent storage, Lustre supports user and group quotas. Quota support includes soft and hard limits. As of Lustre 2.4, usage accounting information is always available, even when quotas are not enforced. The Quota Master Target (QMT) runs on the same node as the MDT0 and allocates/releases quota space. Due to how quota space is managed, and that the smallest allocable chuck is 1MB (for OSTs) or 1024 inodes (for MDTs), a quota exceeded error can be returned even when OSTs/MDTs have space/inodes.

Usage Reports As previously mentioned, accounting information is always available (unless explicitly disabled). This information can provide a quick overview of user/ group usage: Non root users can only view the usage for their user and group(s) # lfs quota -u myuser /lustre/mntpoint For more detailed usage, the file system monitoring software Robinhood provides a database that can be directly queried for metadata information. Robinhood also includes special du and find commands that use this database.

Purging A common use case for Lustre is as a scratch file system, where files are not intended for long term storage. In this case, purging older files makes sense. Policies will vary per site, but for example, a site may want to remove files that have not been accessed nor modified in the past 30 days.

Purging Tools An administrator could use a variety of methods in order to purge data. The simplest version includes a find (or lfs find) to list files older than x days, then remove them. Ex: lfs find /lustre/mountpoint -mtime +30 -type f This would find files that have a modification time stamp older than 30 days A more advanced technique is to use software like Lester to read data directly from a MDT. https://github.com/ornl-techint/lester

Handling Full OSTs One of the most common issues with a Lustre file system is an OST that is close to, or is, full. To view OST usage, run the lfs df command. An example of viewing a high usage OST [root@mgmt ~]# lfs df /lustre/testfs sort - rnk5 head - n 5 testfs- OST00dd_UUID 15015657888 12073507580 2183504616 85% /lustre/testfs[ost:221] Once the index of the OST is found, running lfs quota with the I argument will provide the usage on that OST. for user in $(users); do lfs quota -u $user -I 221 /lustre/testfs; done

Handling Full OSTs (cont.) Typically, an OST imbalance that results in a filled OST is due to a single user with improperly striped files. The user can be contacted and asked to remove/ restripe the file, or the file can be removed by an administrator in order to regain use of the OST. It is often useful to check for running processes (tar commands, etc) that might be creating these files. When trying to locate the file causing the issue, it s often useful to look at recently modified files

Monitoring There are some things that are important to monitoring on Lustre servers. These include things like high load and memory usage.

Monitoring General Software Nagios Nagios is a general purpose monitoring solution. A system can be set up with host and service checks. There is native support for host-down checks and various service checks, including file system utilization. Nagios is highly extensible, allowing for custom checks This could include checking the contents of the /proc/fs/lustre/ health_check file. It s an industry standard and has proven to scale to hundreds of checks Open source (GPL) Supports paging on alerts and reports. Includes a multi-user web interface https://www.nagios.org

Monitoring General Software Ganglia Gathers system metrics (load, memory, disk utilization, ) and stores the values in RRD files. Benefits to RRD (fixed size) vs downsides (data rolloff) Provides a web interface for these metrics over time (past 2hr, 4hr, day, week, ) Ability to group hosts together In combination with collectl, can provide usage metrics for Infiniband traffic and Lustre metrics http://ganglia.sourceforge.net/ http://collectl.sourceforge.net/tutorial-lustre.html

Monitoring General Software Splunk Operational Intelligence Aggregates machine data, logs, and other user-defined sources From this data, users can run queries. These queries can be scheduled or turned into reports, alerts, or dashboards for Splunk s web interface Tiered licensing based on indexed data, including a free version. Useful for generating alerts on Lustre bugs within syslog There are open source alternatives such as ELK stack. https://www.splunk.com/

Monitoring General Software Robinhood Policy Engine File system management software that keeps a copy of metadata in a database Provides find and du clones that query this database to return information faster. Designed to support millions of files and petabytes of data Policy based purging support Customizable alerts Additional functionality added for Lustre file systems https://github.com/cea-hpc/robinhood/wiki

Monitoring Lustre tools Lustre provides information on a low level about the state of the file system This information lives under /proc/ For example, to check if any OSTs on an OSS are degraded, check the contents of the files located at /proc/fs/lustre/obdfilter/*/degraded Another example would be to check if checksums are enabled. On a client, run: cat /proc/fs/lustre/osc/*/checksums More details can be found in the Lustre manual

Monitoring Lustre tools Lustre also provides a set of tools The lctl {get,set}_param functions display the contents or set the contents of files under /proc lctl get_param osc.*.checksums lctl set_param osc.*.checksums=0 This command allows for fuzzy matches The lfs command can check the health of the servers within the file system: [root@mgmt ~]# lfs check servers testfs- MDT0000- mdc- ffff880e0088d000: active testfs- OST0000- osc- ffff880e0088d000: active The lfs command has several other possible parameters

Monitoring Lustre tools The llstat and llobdstat commands provide a watch-like interface for the various stats files llobdstat: /proc/fs/lustre/obdfilter/<ostname>/stats llstat: /proc/fs/lustre/mds/mds/mdt/stats, etc. Appropriate files are listed in the llstat man page Example: [root@sultan- mds1 lustre]# llstat - i 2 lwp/sultan- MDT0000- lwp- MDT0000/stats /usr/bin/llstat: STATS on 06/08/15 lwp/sultan- MDT0000- lwp- MDT0000/stats on 10.37.248.68@o2ib1 snapshot_time req_waittime req_active mds_connect obd_ping 1433768403.74762 1520 1520 2 1518 lwp/sultan- MDT0000- lwp- MDT0000/stats @ 1433768405.75033 Name Cur.Count Cur.Rate #Events Unit last min avg max stddev req_waittime 0 0 1520 [usec] 0 46 144.53 14808 380.72 req_active 0 0 1520 [reqs] 0 1 1.00 1 0.00 mds_connect 0 0 2 [usec] 0 76 7442.00 14808 10417.10 obd_ping 0 0 1518 [usec] 0 46 134.91 426 57.51 ^C

Monitoring Lustre Specific Software LMT The Lustre Monitoring Tool (LMT) is a Python-based, distributed system that provides a top-like display of activity on server-side nodes LMT uses cerebro (software similar to Ganglia) to pull statistics from the /proc/ file system into a MySQL database lltop/xltop A former TACC staff member, John Hammond, created several monitoring tools for Lustre file systems lltop - Lustre load monitor with batch scheduler integration xltop - continuous Lustre load monitor

Monitoring Lustre Specific Software Sandia National Laboratories created OVIS to monitor and analyze how applications use resources This software aims to monitor more than just the Lustre layer of an application https://cug.org/proceedings/cug2014_proceedings/includes/files/ pap156.pdf OVIS must run client-side, where most of the other monitoring tools presented here are run from the Lustre servers Michael Brim and Joshua Lothian from ORNL created Monitoring Extreme-scale Lustre Toolkit (MELT) This software uses a tree-based infrastructure to scale out Aimed at being less resource intensive than solutions like collectl http://lustre.ornl.gov/ecosystem/documents/lustreeco2015-brim.pdf

Summary How to start and stop a Lustre file system Steps to automate these procedures Software, both general and specialized, to monitor a file system

Acknowledgements This work was supported by the United States Department of Defense (DoD) and used resources of the DoD-HPC at Oak Ridge National Laboratory.