Lustre tools for ldiskfs investigation and lightweight I/O statistics



Similar documents
Lessons learned from parallel file system operation

A New Quality of Service (QoS) Policy for Lustre Utilizing the Lustre Network Request Scheduler (NRS) Framework

New Storage System Solutions

Cray Lustre File System Monitoring

Oak Ridge National Laboratory Computing and Computational Sciences Directorate. File System Administration and Monitoring

KIT Site Report. Andreas Petzold. STEINBUCH CENTRE FOR COMPUTING - SCC

KIT Site Report. Andreas Petzold. STEINBUCH CENTRE FOR COMPUTING - SCC

Monitoring Tools for Large Scale Systems

Integrating Lustre with User Security Administration. LAD 15 // Chris Gouge // 2015 Sep

DL-SNAP: A Directory Level SNAPSHOT Facility on Lustre

Current Status of FEFS for the K computer

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Practices on Lustre File-level RAID

高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统. High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University

Data Movement and Storage. Drew Dolgert and previous contributors

Owner of the content within this article is Written by Marc Grote

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Hadoop: Embracing future hardware

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

NERSC File Systems and How to Use Them

High Performance, Open Source, Dell Lustre Storage System. Dell /Cambridge HPC Solution Centre. Wojciech Turek, Paul Calleja July 2010.

Lustre Operations Manual Version man-v33 (08/01/2006)

Lustre * Filesystem for Cloud and Hadoop *

Lustre performance monitoring and trouble shooting

Lustre Monitoring with OpenTSDB

Network Contention and Congestion Control: Lustre FineGrained Routing

Lustre & Cluster. - monitoring the whole thing Erich Focht

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

Preview of a Novel Architecture for Large Scale Storage

High Performance Computing OpenStack Options. September 22, 2015

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

High Availability and Backup Strategies for the Lustre MDS Server

Introduction to Supercomputing with Janus

The Use of Flash in Large-Scale Storage Systems.

Fujitsu HPC Cluster Suite

New and Improved Lustre Performance Monitoring Tool. Torben Kling Petersen, PhD Principal Engineer. Chris Bloxham Principal Architect

Version Admin Guide. Revision: B5

Research Technologies Data Storage for HPC

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Application Performance for High Performance Computing Environments

Windows NT File System. Outline. Hardware Basics. Ausgewählte Betriebssysteme Institut Betriebssysteme Fakultät Informatik

Experiences with Lustre* and Hadoop*

Outline. Windows NT File System. Hardware Basics. Win2K File System Formats. NTFS Cluster Sizes NTFS

Investigation of storage options for scientific computing on Grid and Cloud facilities

Fine-grained File System Monitoring with Lustre Jobstat

Lustre SMB Gateway. Integrating Lustre with Windows

IMPLEMENTING GREEN IT

Docker : devops, shared registries, HPC and emerging use cases. François Moreews & Olivier Sallou

The Asterope compute cluster

Implementing Internet Storage Service Using OpenAFS. Sungjin Dongguen Arum

Broadening Moab/TORQUE for Expanding User Needs

LLNL Production Plans and Best Practices

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Porting Lustre to Operating Systems other than Linux. Ken Hornstein US Naval Research Laboratory April 16, 2010

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Cloud Implementation using OpenNebula

LUSTRE USAGE MONITORING What the are users doing with my filesystem?

Scala Storage Scale-Out Clustered Storage White Paper

Hack the Gibson. John Fitzpatrick Luke Jennings. Exploiting Supercomputers. 44Con Edition September Public EXTERNAL

Oak Ridge National Laboratory Computing and Computational Sciences Directorate. Lustre Crash Dumps And Log Files

Incremental Backup Script. Jason Healy, Director of Networks and Systems

Inside The Lustre File System

The Hadoop Distributed File System

Tiered Adaptive Storage

April 8th - 10th, 2014 LUG14 LUG14. Lustre Log Analyzer. Kalpak Shah. DataDirect Networks. ddn.com DataDirect Networks. All Rights Reserved.

HIPAA Compliance Use Case

Storage CloudSim: A Simulation Environment for Cloud Object Storage Infrastructures

SAS 9.4 Intelligence Platform

GRIDScaler-WOS Bridge

A Survey of Shared File Systems

Using and Contributing Virtual Machines to VM Depot

Command Line Interface Specification Linux Mac

<Insert Picture Here> RMAN Configuration and Performance Tuning Best Practices

Kriterien für ein PetaFlop System

HP ProLiant DL380p Gen mailbox 2GB mailbox resiliency Exchange 2010 storage solution

February, 2015 Bill Loewe

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

Lustre* HSM in the Cloud. Robert Read, Intel HPDD

Improved metrics collection and correlation for the CERN cloud storage test framework

IBRIX Fusion 3.1 Release Notes

EXAScaler. Product Release Notes. Version Revision A0

Cray DVS: Data Virtualization Service

Sonexion GridRAID Characteristics

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

The cloud storage service bwsync&share at KIT

Oracle Managed File Getting Started - Transfer FTP Server to File Table of Contents

Skybot Scheduler Release Notes

Analisi di un servizio SRM: StoRM

LUSTRE FILE SYSTEM High-Performance Storage Architecture and Scalable Cluster File System White Paper December Abstract

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

Uncovering degraded application performance with LWM 2. Aamer Shah, Chih-Song Kuo, Lucas Theisen, Felix Wolf November 17, 2014

MSU Tier 3 Usage and Troubleshooting. James Koll

Transcription:

Lustre tools for ldiskfs investigation and lightweight I/O statistics Roland Laifer STEINBUCH CENTRE FOR COMPUTING - SCC KIT University of the State Roland of Baden-Württemberg Laifer Lustre and tools for ldiskfs investigation and lightweight I/O statistics LAD 15 National Laboratory of the Helmholtz Association www.kit.edu

Overview Lustre systems at KIT Short preview on our next Lustre system Lessons learned from wrong quota investigation Developed tools for ldiskfs investigation How to easily provide I/O statitics to users Developed tools for lightweight Lustre jobstats usage 2 2015-09-22 Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD 15

Lustre systems at KIT - diagram New systems at Campus North ForHLR II ForHLR II Cluster Visualization home 1 work 1 work 2 InfiniBand One InfiniBand fabric Easy flexible mount on all clusters Same Lustre home directory on all clusters 320 Gbit/s 35 km InfiniBand Existing systems at Campus South Lustre 5-7 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Lustre 1-4 3 2015-09-22 Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD 15

Lustre systems at KIT - details System name pfs2 pfs3 pfs4 (Dec 15) Users Lustre server version universities, 4 clusters DDN Lustre 2.4.3 universities, tier 2 cluster (phase 1) DDN based on IEEL 2.2 universities, tier 2 cluster (phase 2) DDN based on IEEL 2.x # of clients 1941 540 1200 # of servers 21 17 23 # of file systems 4 3 3 # of OSTs 2*20, 2*40 1*20, 2*40 1*14, 1*28, 1*70 Capacity (TB) 2*427, 2*853 1*427, 2*853 1*610, 1*1220, 1*3050 Throughput (GB/s) 2*8, 2*16 1*8, 2*16 1*10, 1*20, 1*50 Storage hardware DDN SFA12K DDN SFA12K DDN ES7K # of enclosures 20 20 16 # of disks 1200 1000 1120 4 2015-09-22 Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD 15

Wrong quota investigation - general How we recognized that quotas are wrong 1. Difference between du hs <user dir> and lfs quota u <user> <filesys> 2. Perl script sums all user and group quotas per OST Used /proc/fs/lustre/osd-ldiskfs/<ost>/quota_slave/acct_user & acct_group Should be very similar but showed few per cent deviation 3. Perl script walks through file system, sums capacities and compares with quotas User / group capacity quotas were up to 30 % wrong Support pointed to LU-4345 (http://review.whamcloud.com/11435) UID / GID of OST object could be set to random value on ldiskfs Capacity quotas are computed from ldiskfs quotas on OSTs Bug fixed with Lustre 2.5.3 but wrong UID / GID values do not disappear Wrong UID / GID of OST object possibly fixed with LFSCK of Lustre 2.6 5 2015-09-22 Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD 15

Wrong quota investigation - basics Get parent FID and UID/GID of OST object: debugfs stat [root@ost0]# statcmd="stat \ /O/0/d$((71666856 % 32))/71666856" [root@ost0]# debugfs -c -R "$statcmd \ /dev/mapper/ost_pfs2wor2_0 User: 8972 Group: 12345 Size: 5 fid: parent=[0x200018a62:0x8a4d:0x0] stripe=0 OST object ID FID Get FID for file name: lfs path2fid [user@client]$ lfs path2fid myfile [0x200018a62:0x8a4d:0x0] Get file name for FID: lfs fid2path [root@client]# lfs fid2path pfs2wor2 \ [0x200018a62:0x8a4d:0x0] <path_to_myfile>/myfile Get OST name for OST index: lctl get_param [user@client]$ lctl get_param \ lov.pfs2wor2-*.target_obd 0: pfs2wor2-ost0000_uuid ACTIVE Get object IDs for file name: lfs getstripe [user@client]$ lfs getstripe myfile lmm_stripe_count: 2 obdidx objid objid group 0 71666856 0x4458ca8 0 2 72574780 0x453673c 0 file name Get UID/GID for file name: stat [user@client]$ stat --format="%u %g" myfile 8972 12345 6 2015-09-22 Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD 15

Wrong quota investigation details (1) Motivation Idea Check if bug of LU-4345 caused all quota problems During production, use debugfs to stat all OST objects compare UID / GID with values on file system (MDS) Get biggest object ID: debugfs -c -R "dump /O/0/LAST_ID /tmp/last_id" <OST device> od -Ax -td8 /tmp/last_id Show object status on ldiskfs: Problem debugfs -c -R "stat /O/0/d$<object ID modulo 32>/<object ID>" <OST device> How to do this fast enough? 7 2015-09-22 Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD 15

Wrong quota investigation details (2) Solution Perl script pipes many stat commands to same debugfs call Number of commands and when to restart debugfs configurable Use another perl script to filter output Results If object belongs to inspected user print object ID and parent FID OR: Check if UID and GID is inside valid range Investigated only one OST Investigation is still time consuming i.e. can take days Indeed found lots of objects with wrong UID / GID Found a number of orphaned OST objects Unknown reason why they still existed Used procedure also helpful for other investigations 8 2015-09-22 Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD 15

Lightweight I/O statistics - diagram user batch system request I/O statistics during job submit after job completion create file to request I/O statistics when requested send I/O statistics of job via email collect jobstats from all servers file server collect new I/O statistics request files send I/O usage of job with Lustre jobstats start batch job client 9 2015-09-22 Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD 15

Lightweight I/O statistics steps in detail (1) 1) Enable jobstats for all file systems on clients: lctl set_param jobid_var=slurm_job_id Make sure clients have fix of LU-5179 Slurm job IDs are used by Lustre to collect I/O stats On servers increase time for holding jobstats E.g. to 1 hour: lctl set_param *.*.job_cleanup_interval=3600 2) User requests I/O statistics with Moab msub options: -W lustrestats:<file system name>[,<file system name>]... Optionally: -M <email address> 3) On job completion Moab creates files to request I/O stats File name: lustrestat-<file system name>-<cluster name>-<job ID> File content: account name and optionally email address 10 2015-09-22 Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD 15

Lightweight I/O statistics steps in detail (2) 4) Perl script runs hourly on each file system Uses different config file for each file system Defines names of request files and of batch system servers Allows to collect request files from different clusters Defines which servers are used for the file system Transfers files from batch systems and deletes remote files Uses rsync and rrsync as restricted ssh command for login with key Reads data including job IDs and account name If not specified asks directory service to get email address of account Collects and summarizes jobstats from all servers For each job sends an email Email is good since jobstats are collected asynchronously 11 2015-09-22 Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD 15

Lightweight I/O statistics example email Subject: Lustre stats of your job 1141 on cluster xyz Hello, this is the Lustre IO statistics as requested by user john_doe on cluster xyz for file system home. Job 1141 has done...... 1 open operations.... 1 close operations.... 1 punch operations.... 1 setattr operations.... 10 write operations and sum of 10,485,760 byte writes (min IO size: 1048576, max IO size: 1048576). 12 2015-09-22 Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD 15

Lightweight I/O statistics experiences Users do not care much about their I/O usage Tool was not yet frequently used No negative impact of jobstats activation Running since 6 weeks Another perl script checks high I/O usage per job Collects and summarizes jobstats from all servers Reports job IDs over high water mark for read/write or metadata operations Extremely useful to identify bad file system usage 13 2015-09-22 Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD 15

Summary Currently our main Lustre problems are related to quotas Tools helped to analyze on the ldiskfs level New LFSCK features will hopefully fix wrong quotas Quotas on directory tree would be very helpful Lustre jobstats are extremely useful Not available with other file systems It s incredible what users are doing All my talks about Lustre http://www.scc.kit.edu/produkte/lustre.php roland.laifer@kit.edu 14 2015-09-22 Roland Laifer Lustre tools for ldiskfs investigation and lightweight I/O statistics LAD 15