LUSTRE USAGE MONITORING What the &#%@ are users doing with my filesystem?

Similar documents

Oak Ridge National Laboratory Computing and Computational Sciences Directorate. File System Administration and Monitoring

How To Write A Libranthus (Libranthus) On Libranus 2.4.3/Libranus 3.5 (Librenthus) (Libre) (For Linux) (

New and Improved Lustre Performance Monitoring Tool. Torben Kling Petersen, PhD Principal Engineer. Chris Bloxham Principal Architect

Monitoring Tools for Large Scale Systems

Monitoring the Lustre* file system to maintain optimal performance. Gabriele Paciucci, Andrew Uselton

MICROSOFT EXCHANGE MAIN CHALLENGES IT MANAGER HAVE TO FACE GSX SOLUTIONS

Lustre & Cluster. - monitoring the whole thing Erich Focht

IBM Tivoli Monitoring Version 6.3 Fix Pack 2. Infrastructure Management Dashboards for Servers Reference

Lustre tools for ldiskfs investigation and lightweight I/O statistics

Lustre performance monitoring and trouble shooting

April 8th - 10th, 2014 LUG14 LUG14. Lustre Log Analyzer. Kalpak Shah. DataDirect Networks. ddn.com DataDirect Networks. All Rights Reserved.

Lustre Monitoring with OpenTSDB

PTC System Monitor Solution Training

Offline Image Viewer Guide

This presentation is an introduction to the SQL Server Profiler tool.

Inside Lustre HSM. An introduction to the newly HSM-enabled Lustre 2.5.x parallel file system. Torben Kling Petersen, PhD.

Cray Lustre File System Monitoring

PLUMgrid Toolbox: Tools to Install, Operate and Monitor Your Virtual Network Infrastructure

VMware vrealize Operations for Horizon Administration

Managing your Red Hat Enterprise Linux guests with RHN Satellite

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

DBMS / Business Intelligence, SQL Server

Lab 2 : Basic File Server. Introduction

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

The Complete Performance Solution for Microsoft SQL Server

CHAPTER. Monitoring and Diagnosing

How To Back Up Your Pplsk Data On A Pc Or Mac Or Mac With A Backup Utility (For A Premium) On A Computer Or Mac (For Free) On Your Pc Or Ipad Or Mac On A Mac Or Pc Or

Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK

NERSC File Systems and How to Use Them

Improved metrics collection and correlation for the CERN cloud storage test framework

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

Benchmarking Hadoop & HBase on Violin

Informix Performance Tuning using: SQLTrace, Remote DBA Monitoring and Yellowfin BI by Lester Knutsen and Mike Walker! Webcast on July 2, 2013!

Storage Management. in a Hybrid SSD/HDD File system

Also on the Performance tab, you will find a button labeled Resource Monitor. You can invoke Resource Monitor for additional analysis of the system.

PARALLELS CLOUD SERVER

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

pt360 FREE Tool Suite Networks are complicated. Network management doesn t have to be.

There are numerous ways to access monitors:

Enterprise Manager Performance Tips

Maintaining Non-Stop Services with Multi Layer Monitoring

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper.

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

SAP HANA SPS 09 - What s New? Administration & Monitoring

SCF/FEF Evaluation of Nagios and Zabbix Monitoring Systems. Ed Simmonds and Jason Harrington 7/20/2009

Multi-site Best Practices

GSX Monitor & Analyzer. for IBM Collaboration Suite

System Monitoring Using NAGIOS, Cacti, and Prism.

20 Command Line Tools to Monitor Linux Performance

File Systems Management and Examples

HelpSystems Web Server User Guide

7/15/2011. Monitoring and Managing VDI. Monitoring a VDI Deployment. Veeam Monitor. Veeam Monitor

McAfee Web Gateway 7.4.1

In Memory Accelerator for MongoDB

mbits Network Operations Centrec

Lustre * Filesystem for Cloud and Hadoop *

A FAULT MANAGEMENT WHITEPAPER

Isilon OneFS. Version 7.2. OneFS Migration Tools Guide

IBRIX Fusion 3.1 Release Notes

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

DB2 for i5/os: Tuning for Performance

MySQL Administration and Management Essentials

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Monitoring Microsoft Exchange to Improve Performance and Availability

EMC Unisphere for VMAX Database Storage Analyzer

Parallels Cloud Storage

SQL in the Cloud: Is it Game Changing? V P P r o f e s s i o n a l S e r v i c e s

UMass High Performance Computing Center

WhatsUp Gold v16.1 Wireless User Guide

Isilon OneFS. Version OneFS Migration Tools Guide

Current Status of FEFS for the K computer

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Windows NT File System. Outline. Hardware Basics. Ausgewählte Betriebssysteme Institut Betriebssysteme Fakultät Informatik

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Outline. Windows NT File System. Hardware Basics. Win2K File System Formats. NTFS Cluster Sizes NTFS

Product Review: James F. Koopmann Pine Horse, Inc. Quest Software s Foglight Performance Analysis for Oracle

Network Visiblity and Performance Solutions Online Demo Guide

Google File System. Web and scalability

White Paper. Fabasoft Folio Environment Variables. Fabasoft Folio 2015 Update Rollup 2

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

System Administration Guide

McAfee Application Control / Change Control Administration Intel Security Education Services Administration Course

Understanding Disk Storage in Tivoli Storage Manager

Big Data Processing with Google s MapReduce. Alexandru Costan

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Performance White Paper

Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace

Network Storage Appliance

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Transcription:

LUSTRE USAGE MONITORING What the &#%@ are users doing with my filesystem? Kilian CAVALOTTI, Thomas LEIBOVICI CEA/DAM LAD 13 SEPTEMBER 16-17, 2013 CEA 25 AVRIL 2013 PAGE 1

MOTIVATION Lustre monitoring is hard very few tools are available distributed filesystem means distributed stats Administrators need stats to check that everything runs smoothly to analyze problems to watch users users are (sometimes unintentionally) malicious, they need to be watched because stats are cool and graphs are even cooler We @CEA/DAM needed tools to monitor filesystem activity (performance, trends ) to understand filesystem usage (typology, distribution, working sets ) LAD 13 - SEPTEMBER 16-17, 2013 PAGE 2

LUSTRE MONITORING: WHAT WE HAVE LAD 13 SEPTEMBER 16-17, 2013 PAGE 3

LUSTRE MONITORING: WHAT WE WANT LAD 13 SEPTEMBER 16-17, 2013 PAGE 4

OUTLINE Monitoring filesystem activity tools visualizing real-time activity bandwidth, metadata operations, space usage diagnosing slowdowns and unexpected behavior suggesting improvements for user applications causing slowdowns understanding how people use their files showing the filesystem temperature by visualizing working sets and their evolution Answering typical questions I don't get the expected GB/sec... My ls takes too long! Can you give me the current usage for each user working on this project? What kind of data is in the filesystem? Disclaimer we use Lustre 2.1. (YMMV ) LAD 13 - SEPTEMBER 16-17, 2013 PAGE 5

MONITORING FILESYSTEM ACTIVITY LAD 13 SEPTEMBER 16-17, 2013 PAGE 6

TOOLBOX The Lustre admin tool belt ltop: github.com/chaos/lmt top-like command providing instant stats about a filesystem collectl: collectl.sourceforge.net sar-like tool for the monitoring of Infiniband traffic, Lustre activity, etc. htop, dstat and friends Open-source tools developed @CEA shine: lustre-shine.sourceforge.net command-line tool for Lustre filesystem configuration and management robinhood policy engine: github.com/cea-hpc/robinhood multi-purpose tool for large filesystem management Custom scripts based on all of the above, glue provided by Bash and Python we parse /proc/fs/lustre a lot nice colors courtesy of RRDtool LAD 13 SEPTEMBER 16-17, 2013 PAGE 7

ROBINHOOD POLICY ENGINE Swiss-army knife for your filesystems audit, accounting, alerts, migration, purge, based on policies massively parallel stores filesystem metadata in a (My)SQL database allows lightning-fast usage reports provides rbh-du and rbh-find replacements for du and find advanced capabilities for Lustre filesystems purge on OST usage, list files per OST pool and/or OST based policies changelogs reader (Lustre v2): no scan required reporting tools (CLI), web interface (GUI) open-source license http://robinhood.sourceforge.net @CEA one instance per Lustre filesystem, using changelogs LAD 13 SEPTEMBER 16-17, 2013 PAGE 8

FILESYSTEM ACTIVITY OVERVIEW column view allows event correlation for each filesystem robinhood info fs1 fs2 fs3 bandwidth volume inodes MDS RPCs # of clients Our dashboard LAD 13 SEPTEMBER 16-17, 2013 PAGE 9

FILESYSTEM ACTIVITY: BANDWIDTH Bandwidth /proc/fs/lustre/obdfilter/${fsname}-ost*/stats snapshot_time 1377780180.134516 secs.usecs read_bytes 178268 samples [bytes] 0 1048576 185094526864 write_bytes 220770 samples [bytes] 115 1048576 230859704715 get_page 399038 samples [usec] 0 6976 69863795 15160124841 cache_access 45189234 samples [pages] 1 1 45189234 cache_miss 45189234 samples [pages] 1 1 45189234 get_info 5250 samples [reqs] set_info_async 25 samples [reqs] process_config 1 samples [reqs] connect 6074 samples [reqs] disconnect 12 samples [reqs] statfs 181816 samples [reqs] create 30 samples [reqs] destroy 520 samples [reqs] setattr 76 samples [reqs] punch 19 samples [reqs] sync 1092 samples [reqs] preprw 399038 samples [reqs] commitrw 399038 samples [reqs] llog_init 2 samples [reqs] quotacheck 1 samples [reqs] quotactl 1060 samples [reqs] ping 10349451 samples [reqs] write read aggregated on each OST, for each filesystem LAD 13 SEPTEMBER 16-17, 2013 PAGE 10

FILESYSTEM ACTIVITY: METADATA Metadata operations (MDS RPCs) /proc/fs/lustre/mdt/${fsname}-mdt0000/mdt*/stats most commonly used: obd_ping: filesystem health mds_(dis)connect: clients (u)mounts mds_get(x)attr: metadata operations mds_statfs: (lfs) df mds_readpage: readdir() ldlm_ibits_enqueue/mds_close: ~open()/close() long-term storage filesystem: large files, few metadata operations 2 filesystems with the same clients (~6000), but different usages scratch-like filesystem: lots of open() LAD 13 SEPTEMBER 16-17, 2013 PAGE 11

DIAGNOSING SLOWDOWNS (1/2) From graph to clients peak detection on the RPC graph initially a manual process: watching the graph now automatic: averaging the last 10 values in the RRD DB, and comparing to a threshold collection of RPC samples initially a manual process: enabling debug traces, collecting, disabling debug traces now automatic: background collection for 5sec every 5min, keeping 7 days of traces echo +rpctrace > /proc/sys/lnet/debug echo +rpc > /proc/sys/lnet/subsystem_debug analysis of collected samples RPC type breakdown by Lustre client GENERAL Source nodes: node114 (1) Duration: 5.0 sec Total RPC: 88105 (17618.5 RPC/sec) Ignored ping RPC: 102 mds_getxattr peak 3 nodes involved NODENAME NET COUNT % CUMUL DETAILS -------- --- ----- - ----- ------- 10.100.10.137 o2ib6 18567 21.07 21.07 mds_getxattr: 18567 10.100.10.138 o2ib6 17534 19.90 40.97 mds_getxattr: 17534 10.100.10.140 o2ib6 14724 16.71 57.69 mds_getxattr: 14724 LAD 13 SEPTEMBER 16-17, 2013 PAGE 12

DIAGNOSING SLOWDOWNS (2/2) From client identification to code improvement querying the cluster scheduler $ sacct -S startdate -E endate -N nodename getting jobs running on those nodes during the event contacting the user for live events, possibility to strace the process at fault having seen such things as: # strace -e stat -p 95095 -r Process 95095 attached - interrupt to quit 0.000000 stat("/path/to/case/workdir/dmp", 0x7fff0c3ce630) = -1 ENOENT (No such file or directory) 0.000398 stat("/path/to/case/workdir/bf", 0x7fff0c3ce630) = -1 ENOENT (No such file or directory) 0.000246 stat("/path/to/case/workdir/stop", 0x7fff0c3ce630) = -1 ENOENT (No such file or directory) 0.038258 stat("/path/to/case/workdir/dmp", 0x7fff0c3ce630) = -1 ENOENT (No such file or directory) 0.000338 stat("/path/to/case/workdir/bf", 0x7fff0c3ce630) = -1 ENOENT (No such file or directory) 0.000275 stat("/path/to/case/workdir/stop", 0x7fff0c3ce630) = -1 ENOENT (No such file or directory) 0.038236 stat("/path/to/case/workdir/dmp", 0x7fff0c3ce630) = -1 ENOENT (No such file or directory) 0.000320 stat("/path/to/case/workdir/bf", 0x7fff0c3ce630) = -1 ENOENT (No such file or directory) 0.000234 stat("/path/to/case/workdir/stop", 0x7fff0c3ce630) = -1 ENOENT (No such file or directory) kindly suggested the user better ways to stop his run LAD 13 SEPTEMBER 16-17, 2013 PAGE 13

FILESYSTEM TEMPERATURE: USER WORKING SETS Time-lapse of filesystem usage working set = set of files recently written/read 70% of data produced within the last 2 months data production (mod. time) Cold data data in use (last access) read bursts 1 month working set 2 months working set 80% of data accessed <1 month LAD 13 SEPTEMBER 16-17, 2013 PAGE 14

FILESYSTEM TEMPERATURE: DETAILS Implementation compute {modification,access} age for each file now last_mod/last_access aggregate count/volume by age range based on Robinhood database optimized SQL query could be slow for filesystems with > 100s millions of inodes some RRDtool magic involved the RRDtool graph command line is ~6500 characters long Benefits visualization of the current working set age repartition, based on last modification and last access dates allows to anticipate filesystem evolutions mostly recently written data: need to add more OSTs mostly old, never-accessed files: could be purged LAD 13 SEPTEMBER 16-17, 2013 PAGE 15

FILESYSTEM TEMPERATURE: EXAMPLES Visualization of different filesystem usage patterns significant reads/writes (volume) cooling effect after a large read (volume) read-mostly filesystem (volume) Robinhood DB dump and initial scan (inodes) nice linear scan, ~2 days for 50M inodes LAD 13 SEPTEMBER 16-17, 2013 PAGE 16

ANSWERING TYPICAL QUESTIONS a.k.a. THE ROBINHOOD COOKBOOK LAD 13 SEPTEMBER 16-17, 2013 PAGE 17

ENFORCING FILESYSTEM RECOMMENDED USAGE Question I don't get the expected GB/sec... Different filesystems, different usages user documentation indicates recommended file sizes for each filesystem as well as purge policies, quotas often, recommendations not followed bad performance usually a case of small files writing large blocks is more efficient metadata overhead decreases performance Implemented solutions filesystem quotas a limited amount of inodes motivates users to write bigger files user guidance file size profiling, small file detection using Robinhood information provided to users if necessary, helping them to take appropriate measures LAD 13 SEPTEMBER 16-17, 2013 PAGE 18

USERS GUIDANCE File size profiling available in the Robinhood web interface file size repartition global / per user summary per user User guidance informational emails sent on specific criteria avg. file size < 50MB, more than 80% of files < 32MB includes good I/O practices, stats about their account (list of non-compliant directories) graduated response system first two emails are warnings new job submission is suspended after the 3rd warning without any action or feedback from the user, data is automatically packed LAD 13 SEPTEMBER 16-17, 2013 PAGE 19

INFORMATIONAL EMAILS: IMPLEMENTATION Robinhood reports file size profiling get a list of users sorted by the ratio of files in a given size range $ rbh-report --top-users --by-szratio=1..31m rank, user, spc_used, count, avg_size, ratio(1..31m) 1, john, 1.23 TB, 26763, 60.59 MB, 91.30% 2, perrez, 80.77 GB, 27133, 114.71 MB, 87.64% 3, matthiew, 2.22 TB, 25556, 1.04 GB, 85.76% 4, vladimir, 62 GB, 14846, 74.18 MB, 78.50% 5, gino, 30.92 GB, 15615, 77.34 MB, 77.02% tip: with -l FULL, Robinhood logs SQL requests, for reuse and customization small file detection list directories with at least 500 entries, sorted by average file size (smallest first) $ rbh-report -top-dirs --by-avgsize --reverse --count-min=500 LAD 13 SEPTEMBER 16-17, 2013 PAGE 20

LARGE DIRECTORY DETECTION Question My ls takes too long! Issue directories with too many entries (100,000s) sequential processing of readdir() takes a long time Implemented solutions real-time directory alerts alerts defined in Robinhood configuration Alert large_dir { type == directory and dircount > 10000 and last_mod < 1d } doesn t work with changelogs, only when scanning report directories with the most entries $ rbh-report --topdirs LAD 13 SEPTEMBER 16-17, 2013 PAGE 21

ACCOUNTING (1/2) Question Can you give me the current usage for each user working on this project? given that: some data are in a project directory, users have some other data from the same project in their own directory, project data can sometimes be identified by a group attribute, and be anywhere else... randomly organized data of many projects user directory project directory anywhere randomly organized data of many projects and many users randomly organized data of many users Issues data belonging to a single project can be distributed in several locations logical attachment to a project can reside in file attributes, not in location LAD 13 SEPTEMBER 16-17, 2013 PAGE 22

ACCOUNTING (2/2) Implemented solution basic accounting using rbh-report space used in the whole filesystem by user (splitted by group) $ rbh-report u foo* S user, group, type, count, spc_used, avg_size foo1, proj001, file, 422367, 71.01 GB, 335.54 KB Total: 498230 entries, 77918785024 bytes used (72.57 GB) for a specific group $ rbh-report g proj001 per directory accounting using rbh-du filtering capabilities by a given user in a given directory $ rbh-du u theuser /project_dir by a given group in given directories $ rbh-du g proj001 /users/foo* project data in other locations $ rbh-find g proj001 LAD 13 SEPTEMBER 16-17, 2013 PAGE 23

CONTENT MONITORING Question What kind of data is in my filesystem? Implemented solution define Robinhood file classes arbitrary definitions based on file attributes name, owner, group, size, extended attributes Fileclass system_log_files { definition { name == *.log and (owner == root or group == root ) } } Fileclass big_pictures { definition { (name == *.png or name == *.jpg ) and (size > 10MB) } } Fileclass flagged { definition { xattr.user.flag == expected value } } get class summary $ rbh-report --classinfo class, count, spc_used, volume, min_size, max_size, avg_size Documents, 128965, 227.35 TB, 250.69 TB, 8.00 KB, 2.00 TB, 30.03 GB System_log_files, 1536, 4.06 TB, 4.06 TB, 684, 200.01 GB, 2.71 GB Big_pictures, 621623, 637.99 TB, 638.02 TB, 3, 1.54 TB, 1.05 GB LAD 13 SEPTEMBER 16-17, 2013 PAGE 24

WRAPPING UP CEA 25 AVRIL 2013 PAGE 25

CONCLUSION Visualizing filesystem activity is useful we can diagnose abnormal patterns and improve suboptimal code we can better understand user behavior and help them getting the best I/O performance out of their applications Solutions exist some assembly required but all the required information is there in /proc/fs/lustre in the filesystem metadata (Robinhood database) you need Robinhood and some RRDtool-fu LAD 13 SEPTEMBER 16-17, 2013 PAGE 26

THANK YOU! QUESTIONS? PAGE 27 CEA 25 AVRIL 2013 Commissariat à l énergie atomique et aux énergies alternatives CEA / DAM Ile-de-France Bruyères-le-Châtel - 91297 Arpajon Cedex T. +33 (0)1 69 26 40 00 Direction des applications militaires Département sciences de la simulation et de l information Service informatique scientifique et réseaux Etablissement public à caractère industriel et commercial RCS Paris B 775 685 019