Lustre performance monitoring and trouble shooting

Similar documents

Lustre Monitoring with OpenTSDB

Fine-grained File System Monitoring with Lustre Jobstat

High Performance Computing OpenStack Options. September 22, 2015

New and Improved Lustre Performance Monitoring Tool. Torben Kling Petersen, PhD Principal Engineer. Chris Bloxham Principal Architect

Lustre tools for ldiskfs investigation and lightweight I/O statistics

Lab 2 : Basic File Server. Introduction

高通量科学计算集群及 Lustre 文件系统. High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University

Cray Lustre File System Monitoring

Lustre * Filesystem for Cloud and Hadoop *

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Porting Lustre to Operating Systems other than Linux. Ken Hornstein US Naval Research Laboratory April 16, 2010

Current Status of FEFS for the K computer

April 8th - 10th, 2014 LUG14 LUG14. Lustre Log Analyzer. Kalpak Shah. DataDirect Networks. ddn.com DataDirect Networks. All Rights Reserved.

New Storage System Solutions

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

We mean.network File System

Lustre & Cluster. - monitoring the whole thing Erich Focht

Cluster Implementation and Management; Scheduling

Cray DVS: Data Virtualization Service

Enterprise Manager Performance Tips

This presentation will discuss how to troubleshoot different types of project creation issues with Information Server DataStage version 8.

Oak Ridge National Laboratory Computing and Computational Sciences Directorate. File System Administration and Monitoring

Application Performance for High Performance Computing Environments

Troubleshooting PHP Issues with Zend Server Code Tracing

The Native AFS Client on Windows The Road to a Functional Design. Jeffrey Altman, President Your File System Inc.

Oak Ridge National Laboratory Computing and Computational Sciences Directorate. Lustre Crash Dumps And Log Files

HP Data Protector Integration with Autonomy IDOL Server

XpoLog Center Suite Data Sheet

A Survey of Shared File Systems

A New Quality of Service (QoS) Policy for Lustre Utilizing the Lustre Network Request Scheduler (NRS) Framework

EXAScaler. Product Release Notes. Version Revision A0

Lustre* Testing: The Basics. Justin Miller, Cray Inc. James Nunez, Intel Corporation LAD 15 Paris, France

Spectrum Scale. Problem Determination. Mathias Dietz

McAfee Web Gateway 7.4.1

IBRIX Fusion 3.1 Release Notes

HP OpenView Smart Plug-in for Microsoft SQL Server

Red Hat Network Satellite Management and automation of your Red Hat Enterprise Linux environment

Red Hat Satellite Management and automation of your Red Hat Enterprise Linux environment

Maintaining Non-Stop Services with Multi Layer Monitoring

2 Purpose. 3 Hardware enablement 4 System tools 5 General features.

File Systems Management and Examples

The Complete Performance Solution for Microsoft SQL Server

February, 2015 Bill Loewe

Binary search tree with SIMD bandwidth optimization using SSE

HeapStats: Your Dependable Helper for Java Applications, from Development to Operation

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Architecting a High Performance Storage System

LUSTRE USAGE MONITORING What the are users doing with my filesystem?

Monitoring Tools for Large Scale Systems

PTC System Monitor Solution Training

Storage Management. in a Hybrid SSD/HDD File system

DiskPulse DISK CHANGE MONITOR

Cisco Performance Visibility Manager 1.0.1

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

NCI National Facility

Practices on Lustre File-level RAID

Virtual Private Systems for FreeBSD

Network File System (NFS) Pradipta De

HPC Software Requirements to Support an HPC Cluster Supercomputer

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Also on the Performance tab, you will find a button labeled Resource Monitor. You can invoke Resource Monitor for additional analysis of the system.

PLUMgrid Toolbox: Tools to Install, Operate and Monitor Your Virtual Network Infrastructure

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Rapidly Growing Linux OS: Features and Reliability

SysPatrol - Server Security Monitor

RecoveryVault Express Client User Manual

Informatica Corporation Proactive Monitoring for PowerCenter Operations Version 3.0 Release Notes May 2014

Online Backup Client User Manual

PATROL Console Server and RTserver Getting Started

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

FlexArray Virtualization

Improved metrics collection and correlation for the CERN cloud storage test framework

How To Write A Libranthus (Libranthus) On Libranus 2.4.3/Libranus 3.5 (Librenthus) (Libre) (For Linux) (

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

AppResponse Xpert RPM Integration Version 2 Release Notes

High Performance, Open Source, Dell Lustre Storage System. Dell /Cambridge HPC Solution Centre. Wojciech Turek, Paul Calleja July 2010.

Vistara Lifecycle Management

Monitoring Remedy with BMC Solutions

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Lessons learned from parallel file system operation

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Online Backup Linux Client User Manual

Partek Flow Installation Guide

Monitoring the Lustre* file system to maintain optimal performance. Gabriele Paciucci, Andrew Uselton

Chapter 3: Operating-System Structures. Common System Components

Online Backup Client User Manual

POSIX and Object Distributed Storage Systems

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

Release Notes for Epilog for Windows Release Notes for Epilog for Windows v1.7/v1.8

Online Backup Client User Manual

GlusterFS Distributed Replicated Parallel File System

Investigation of storage options for scientific computing on Grid and Cloud facilities

EMC ISILON AND ELEMENTAL SERVER

Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK

High-Availability and Scalable Cluster-in-a-Box HPC Storage Solution

Transcription:

Lustre performance monitoring and trouble shooting March, 2015 Patrick Fitzhenry and Ian Costello 2012 DataDirect Networks. All Rights Reserved. 1

Agenda EXAScaler (Lustre) Monitoring NCI test kit hardware details What is it? How does it work Demo Lustre trouble-shooting General points 4 examples 2012 DataDirect Networks. All Rights Reserved. 2

Introduction Patrick Fitzhenry Director, Technical Services & Support, South Asia & ANZ Ian Costello Senior Application Support Engineer 2012 DataDirect Networks. All Rights Reserved. 3

Lustre Performance Monitoring 4 2012 DataDirect Networks. All Rights Reserved. 4

NCI test kit hardware details 20 x Fujitsu compute nodes Dual E5-2670, 2.60GHzProcessors, 32GB Single Rail FDR SFA12KX-40 400x3TB NL-SAS 4xOSS s: Dual E5-2670 128GB CENTOS 6.4 Metadata 12 x 600GB 15K SAS 2xMD s: Dual E5-2670 128GB CENTOS 6.4 2012 DataDirect Networks. All Rights Reserved. 5

Lustre Monitoring Background DDN development project Use information Linux's /proc Goals: Collect near real-time data (minimum every 1sec) and visualize them All Lustre statistics information can be collectable Support Lustre-1.8.x, 2.x version and beyond Application aware monitoring (Job stats) Administrator can make any custom graphs on the web browser Configurable, intuitive dashboard Scalable, Light weight and no performance impacts and it is quite helps for debug and I/O analysis. Lustre is distributed, scalable filesystem. The monitoring/analysis tool must be aware of this. Lustre monitoring tool helps understanding current/past filesystem behavior and prevents slowdown of performance 6 2012 DataDirect Networks. All Rights Reserved. 6

ExaScaler Monitoring File system, OST Pool, OST/MDT stats, etc. JOB ID, UID/GID, aggregation of application's stats, etc. Archive of data by policy Lightweight Near real-time Massive scale Customizable OSS/MDS Monitoring Server collectd Graphite plugin Lustre client collectd DDN monitoring plugin UDP(TCP)/IP based small text message transfer graphite graphite 2012 DataDirect Networks. All Rights Reserved. 7

Opentsdb Architecture The end to end Opentsdb work flow: 2012 DataDirect Networks. All Rights Reserved. 8

A new Lustre plugin for collectd Using Collectd (http:///collectd.org) Running at many Enterprise/HPC system Written in C for performance and portability Includes optimizations and features to handle hundreds of thousands of data sets. Comes with over 90 plugins which range from standard cases to very specialized and advanced topics. Provides powerful networking features and is extensible in numerous ways Actively developed and supported and well documented Lustre plugin extended collectd to collect Lustre statistics while inheriting its advantages It is possible to port Lustre plugin to a better framework if necessary 9 2012 DataDirect Networks. All Rights Reserved. 9

XML definition of Lustre's /proc information Tree structured descriptions about how to collect statistics from Lustre proc entries Modular A hierarchical framework comprised by a core logic layer (Lustre plugin) and statistics definition layer (XML files) Extendable without the need to update any source codes of Lustre plugin Easy to maintain the stableness of core logic Centralized 10 A single XML file for all definitions of Lustre data collection No need to maintain massive error-prone scripts Easy to verify correctness Easy to support multiple versions and update for new versions of Lustre 2012 DataDirect Networks. All Rights Reserved. 10

XML definition of Lustre's /proc information Precise Strict rules using regular expression could be configured to filter out all but what we exactly want Locations to save collected statistics are explicitly defined and configurable Powerful Any statistics could be collected as long as there is proper regular expressions to match it Extendable Any newly wanted statistics could be collected in no time by adding definition in XML file Efficient No matter how many definitions are predefined in the XML file, only under-used definitions will be traversed at run-time. 11 2012 DataDirect Networks. All Rights Reserved. 11

Example of a collectd.conf This is an example of a /etc/collectd.conf from an MDS (tmds1): [root@tmds1 ~]# cat /etc/collectd.conf # # collectd.conf for DDN LustreMon # Interval 5 WriteQueueLimitHigh 1000000 WriteQueueLimitLow 800000 LoadPlugin match_regex LoadPlugin syslog <Plugin syslog> #LogLevel info LogLevel err </Plugin> LoadPlugin lustre <Plugin "lustre"> <Common> DefinitionFile "/etc/lustre-ieel-2.5_definition.xml" </Common> # OST stats # <Item> # Type "ost_kbytestotal" # Query_interval 300 # </Item> # <Item> # Type "ost_kbytesfree" # Query_interval 300 # </Item> <Item> Type "ost_stats_write" </Item> <Item> Type "ost_stats_read" </Item> 2012 DataDirect Networks. All Rights Reserved. 12

Example of a collectd.conf (continued) # MDT stats # <Item> # Type "mdt_filestotal" # Query_interval 300 # </Item> # <Item> # Type "mdt_filesfree" # Query_interval 300 # </Item> <Item> Type "md_stats_open" </Item> <Item> Type "md_stats_close" </Item> <Item> Type "md_stats_mknod" </Item> <Item> Type "md_stats_unlink" </Item> <Item> Type "md_stats_mkdir" </Item> <Item> Type "md_stats_rmdir" </Item> <Item> Type "md_stats_rename" </Item> <Item> Type "md_stats_getattr" </Item> <Item> Type "md_stats_setattr" </Item> <Item> Type "md_stats_getxattr" </Item> <Item> Type "md_stats_setxattr" </Item> <Item> Type "md_stats_statfs" </Item> <Item> Type "md_stats_sync" </Item> 2012 DataDirect Networks. All Rights Reserved. 13

Example of a collectd.conf (continued) <Item> Type "ost_jobstats" <Rule> Field "job_id" </Rule> </Item> <Item> Type "mdt_jobstats" <Rule> Field "job_id" </Rule> </Item> <ItemType> Type "mdt_jobstats" <ExtendedParse> # Parse the field job_id Field "job_id" # Match the pattern Pattern "u([[:digit:]]+)[.]g([[:digit:]]+)[.]j([[:digit:]]+)" <ExtendedField> Index 1 Name pbs_job_uid </ExtendedField> <ExtendedField> Index 2 Name pbs_job_gid </ExtendedField> <ExtendedField> Index 3 Name pbs_job_id </ExtendedField> </ExtendedParse> TsdbTags "pbs_job_uid=${extendfield:pbs_job_uid} pbs_job_gid=${extendfield:pbs_job_gid} pbs_job_id=${extendfield:pbs_job_id}" </ItemType> <ItemType> Type "ost_jobstats" <ExtendedParse> # Parse the field job_id Field "job_id" # Match the pattern Pattern "u([[:digit:]]+)[.]g([[:digit:]]+)[.]j([[:digit:]]+)" <ExtendedField> Index 1 Name pbs_job_uid </ExtendedField> 2012 DataDirect Networks. All Rights Reserved. 14

Example of a collectd.conf (continued) <ExtendedField> Index 2 Name pbs_job_gid </ExtendedField> <ExtendedField> Index 3 Name pbs_job_id </ExtendedField> </ExtendedParse> TsdbTags "pbs_job_uid=${extendfield:pbs_job_uid} pbs_job_gid=${extendfield:pbs_job_gid} pbs_job_id=${extendfield:pbs_job_id}" </ItemType> </Plugin> loadplugin "write_tsdb" <Plugin "write_tsdb"> <Node> Host "10.10.108.33" Port "8500" </Node> </Plugin> #loadplugin "write_graphite" #<Plugin "write_graphite"> # <Carbon> # Host "172.21.66.181" # Port "2003" # Prefix "collectd." # Protocol "udp" # </Carbon> #</Plugin> 2012 DataDirect Networks. All Rights Reserved. 15

Demo Show the OpenTSB layout Show the Grafana layout Show adding a mdt based stat, then update with a filter to a jobid Show adding a ost based stat 2012 DataDirect Networks. All Rights Reserved. 16

Troubleshooting Lustre 17 2012 DataDirect Networks. All Rights Reserved. 17

Process when Troubleshooting Lustre 18 2012 DataDirect Networks. All Rights Reserved. 18

Lustre debugging Lustre is complex environment, lots of tightly coupled moving parts: Storage (data, metadata) OSS MDS Network Lustre Server Lustre Client Operating Systems The software resides in kernel-space which makes it difficult to to debug compared with user-space software. It is possible to debug Lustre Lustre bugs do get resolved searching jira (if the issue is Lustre) A lot of tools have been developed specifically for Lustre debugging. The Lustre community is very active and provides strong support. 19 2012 DataDirect Networks. All Rights Reserved. 19

What to do when a Lustre issue occurs 1 Understand the problem What is the failure type? (kernel crash/lbug/system call failure/stuck process/incorrect result/unexpected behavior/performance regression) Which nodes cause the problem o Is it a server side problem or client side problem? o Is it a problem limited to a single client? o Is it a metadata or data access problem? How critical the problem is? The impacted services could be: o The whole system, e.g. crash or deadlock on MGS/MDS; o All of the services on a server, e.g. crash or deadlock on OSS; o A certain service of the whole system, e.g. quota failure on QMT/QSD; o All of the operations on the client(s), e.g. crash or deadlock on client. 20 2012 DataDirect Networks. All Rights Reserved. 20

What to do when a Lustre issue occurs 2 Find a simple and reliable reproduction method Step 1: Confirm which program causes the bug; Step 2: Write a simple program which can reproduce the problem repeatedly3; Step 3: Simplify the program as much as possible. A simple and reliable reproduction method: o Simplifies the description of the issue thus helps other people understand it quickly; o Reduces the collected logs thus reduces the time to analyze it; o Accelerates the confirmation of possible fix methods thus accelerates the fix process. 21 2012 DataDirect Networks. All Rights Reserved. 21

What to do when a Lustre issue happens 3 Collect logs on the involved nodes System logs are always valuable to determine the states of Lustre nodes. Use strace command to collect logs of system calls: o Which system call returns failure? o Which errno does this system call returns? Errno is essential for understanding and debuging the issue, e.g. EIO(5) usually means disk I/O has some problems. Collect kernel dump file when crash happens o Kdump should always been enabled on production system. o It is especially useful for NULL pointer dereference. Collect Lustre messages for further analysis Tips: o A few lines of critical messages are much more helpful than other messages. o The first messages when the bug happens are more important. o Massive messages which are printed days before the bug happens is less valuable. o Redundancy messages are always better than lack of messages. 22 2012 DataDirect Networks. All Rights Reserved. 22

What to do when a Lustre issue occurs 4 Collect Lustre messages Command: lctl debug_kernel Different masks can be used: trace, inode, super, ext2, malloc, cache, info, ioctl, neterror, net, warning, buffs, other, dentry, nettrace, page, dlmtrace, error, emerg, ha, rpctrace, vfstrace, reada, mmap, config,console, quota, sec, lfsck, hsm Default masks are warning, error, emerg, console. But it might be necessary to change mask to collect desirable messages. Mask trace quota dlmtrace ioctl malloc Usage Useful for tracing the process flow of Lustre software stack. Frequently used. Useful for debuging quota problems. Useful for debuging LDLM problems. Useful for debuging ioctl problems. Useful for debuging memory leak problems. Usually used together with leak_finder.pl 23 2012 DataDirect Networks. All Rights Reserved. 23

What to do when a Lustre issue happens 5 Fix the issue Search whether the same issues has been fix in master branch of Lustre git repository o Lustre mater branch is evolving quickly which means a lot of fixed bugs might still exists on the older version. Search whether there is any similar issue reported o A fix/walk-around method might have proved to be successful. Keep the faith that a fix method will show up naturally as soon as the problem is fully understood. Compromise if have to: o Find a temporary way to recover the service of the production system quickly, e.g. reboot/e2fsck. o If it is impossible to understand or fix the root cause of the issue right now, try to find a way to walk around it. 24 2012 DataDirect Networks. All Rights Reserved. 24

Real examples of fixing Lustre bugs 1 RM-135/LU-4478 Problem discription: When formating a Lustre OST, the kernel crashes. Reproduce steps: o Apply a debug patch which returns failure from ldiskfs_acct_on() o Formatting a Lustre OST will trigger the crash Collected log: Kernel dump file collected by Kdump Analysis: o Log shows that the kernel crashes in ext4_get_sb()/get_sb_bdev()/ kill_block_super()/generic_shutdown_super()/iput()/clear_inode() because of BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0 o By using crash commands, it is confirmed EXT4_SB((inode)->i_sb) is NULL o After further analysis, it is found that the failure of ldiskfs_acct_on() in ldiskfs_fill_super() is not handled correctly. Fix: Add codes to handle failure of ldiskfs_acct_on() in ldiskfs_fill_super(). (http://review.whamcloud.com/10938) 25 2012 DataDirect Networks. All Rights Reserved. 25

Real examples of fixing Lustre bugs 2 RM-185/LU-5054 Problem description: Creating and setting a pool name of length 16 to a directory will succeed. However, creating a file under that directory will fail. Reproduce steps: o [root@penguin1 ~]# lfs setstripe -p aaaaaaaaaaaaaaaa /lustre/dir2 o [root@penguin1 ~]# touch /lustre/dir2/a touch: cannot touch `/lustre/dir2/a': Argument list too long Errno: E2BIG(7) Collected log: Trace log of Lustre to check which function returns the E2BIG errno. Analysis: Log shows that lod_generate_and_set_lovea() returns E2BIG, because the pool name inherited from parent directory is longer than the length limit. Fix: Cleanup all related codes to enforce a consistent length limit of pool name. (http://review.whamcloud.com/10306) 26 2012 DataDirect Networks. All Rights Reserved. 26

Real examples of fixing Lustre bugs 3 LU-5808 Problem discription: When using one MGT to mange two file systems which names are 'lustre' and 'lustre2t, it is impossible to mount their MDTs on different servers because parsing of MGS llog fails. Reproduce steps: o o o o o o o o o o mkfs.lustre --mgs --reformat /dev/sdb1; mkfs.lustre --fsname lustre --mdt --reformat --mgsnode=192.168.3.122@tcp --index=0 /dev/sdb2; mkfs.lustre --fsname lustre2t --mdt --reformat --mgsnode=192.168.3.122@tcp --index=0 /dev/sdb3; mount -t lustre /dev/sdb1 /mnt/mgs; mount -t lustre /dev/sdb2 /mnt/mdt-lustre; mount -t lustre /dev/sdb3 /mnt/mdt-lustre2t; lctl conf_param lustre.quota.ost=ug; mount -t ldiskfs /dev/sdb1 /mnt/ldiskfs; llog_reader /mnt/ldiskfs/configs/lustre2t-mdt0000 grep quota.ost; The output of the last command is: #10 (224)marker 8 (flags=0x01, v2.5.25.0) lustre 'quota.ost' Mon Oct 27 21:26:23 2014- #11 (088)param 0:lustre 1:quota.ost=ug #12 (224)marker 8 (flags=0x02, v2.5.25.0) lustre 'quota.ost' Mon Oct 27 21:26:23 2014- Collected log: o Trace log of Lustre to check which function returns the failure when mouting MDTs o Trace log of Lustre to check how does MGS handles llog names Analysis: Log shows that the MGS matches the llog of lustre2t even when it tries to update the llog of lustre Fix: Update codes of MGS to match llog name strictly to avoid invalid record (http://review.whamcloud.com/12437) 27 2012 DataDirect Networks. All Rights Reserved. 27

Performance Issue during commissioning (1) Background: Lustre System being Commissioned in Asia DDN Storage, White box Servers, DDN Lustre HW assembled by third party contractor No pre or post installation documentation Problem Statement: Low OSS Performance Failing Performance Acceptance tests 2012 DataDirect Networks. All Rights Reserved. 28

Performance Issue during commissioning (2) Local team spent many hours trying to resolve Escalated to (remote) DDN APAC Lustre Support team Steps to resolve: Determine what the problem is in the first case o Multiple tests to confirm where the problem is occurring ior and iozone obdfilter-survey lnet-selftest raw ib test utils ib_[write,read]_bw Make sure to specify the correct HCA you want to test on. Based on results from the above testing investigate the hardware lspci vv was our friend 2012 DataDirect Networks. All Rights Reserved. 29

Performance Issue during commissioning (3) Resolution Onsite engineer moved 1 HCA to a 8 lane PCI on all servers Restart tests to confirm the fix which it did and achieved the 10GB/s read/write performance profile. 2012 DataDirect Networks. All Rights Reserved. 30

Performance Issue during commissioning (4) 20/20 Hind-sight is a beautiful thing: Obvious when the issue is known Lessons learned: Need detailed documentation of installation issue would have been resolved easily if available 2012 DataDirect Networks. All Rights Reserved. 31

What makes Lustre debugging easier? Difficulty to debug Easy Middle Hard Ability to reproduce Every time Sometimes Never Time to reproduce Seconds Minutes Hours Program to reproduce A few system calls Single node application Parallel application Condition to reproduce A certain condition of a single process Race condition with multiple processes Uncertain/Unknown condition Involved nodes Client MDS or OSS Client & MDS & OSS Involved software components Single component Multiple components on a single node Multiple components on multiple nodes with RPCs Ways of failing Omission failure (crash, request loss, or no reply) Commission failure (wrong process of request, incorrect reply, corrupted state) Arbitrary/Byzantine failure (unpredictable result) Types of error Syntax error (compile error) Semantic defect (unintended result) Design deficiency Problem description Clear description with reproduction steps Clear text description Ambiguous description Collected logs Precise logs since the bug occurred Massive unfiltered logs Not enough logs 32 2012 DataDirect Networks. All Rights Reserved. 32

Fini Questions? 33 2012 DataDirect Networks. All Rights Reserved. 33

Lustre debugging Lustre is a very complex piece of software which is hard to debug It has a lot of software components with tightly coupled interfaces. It is a distributed file system with multiple types of nodes connected together by network. The software resides in kernel-space which makes it difficult to to debug compared with user-space software. It is possible to debug Lustre Most bugs of Lustre get fixed eventually searching jira. A lot of tools have been developed specifically for Lustre debugging. The Lustre community is very active and provides strong support. 34 2012 DataDirect Networks. All Rights Reserved. 34

Lustre DDN branch Client Performance optimization 35 2012 DataDirect Networks. All Rights Reserved. 35

Where ideas become reality Genomic Analysis Application It's a standardized job set (pipeline), but... More than 2000 jobs run in a single pipeline. o Alignment and mapping with genomics reference databases o Annotations adding references (metadata) to data o Analysis by each application There are 100+ analysis applications. But, no MPI applications. A lot of single jobs! Each applications have a lot of options/libraries All jobs are associated with job scheduler and allocated them very efficiently. A lot of analysis pipelines are running on same HPC cluster simultaneously. 36 Engineering Technical Conference 2014 2012 DataDirect Networks. All Rights Reserved. 36

Where ideas become reality Complex, Complex and Complex... job202 job103 job204 job305 job3 job303 job102 Single Pipeline job4 job5 job101 job2 job104 job302 job203 job1 After Finish job job105 job205 Dependency job106 job107 job201 job301 job304 job306 job206 job6 37 Engineering Technical Conference 2014 2012 DataDirect Networks. All Rights Reserved. 37 waiting jobs

Pipeline aware I/O performance monitoring Developed Lustre Performance monitoring Tool Near realtime data point collection. (every second) Any type of I/O monitoring is possible. (UID/GID/JOBID or any type of custom ID) ExaScaler Monitor Performance monitoring is NOT only daily/hourly report, but it's really critical for performance optimization. Total Pipeline1 Pipeline2 Pipeline3 Pipeline4 38 2012 DataDirect Networks. All Rights Reserved. 38

Where ideas become reality Problem at MMBK Pipeline job on lustre-2.5 client elapsed time is longer than lustre-1.8 client system. One analysis takes 2.5 days! Job started lustre-2.5 client system Finished job lustre-1.8 client system 10hours Finished job 39 Engineering Technical Conference 2014 2012 DataDirect Networks. All Rights Reserved. 39

Lustre performance optimization for genomic applications Worked with Intel exclusively and optimized current Lustre-2.5 client codes for better I/O performance for genomic applications. mmap() I/O performance improvements Bug fixes, optimization and improvements BTW, there is an crucial issue with mmap() in GPFS Performance improvements for single shared file Parallel read to same region of file from single client CPU/Memory resource reduct A lot of CPU intensive application. CPU is always high usages Large bulk I/O size support and enhancement Support to up 16MB I/O size (4MB was limit) Aggressive ReadAhead Engine for large I/O 40 2012 DataDirect Networks. All Rights Reserved. 40

Fix mmap() performance problem and improvements Several application calls a lot of mmap().10%+ of open() calls with mmap()! # cat /proc/fs/lustre/llite/*/stats 250 llite.share1-ffff881067f9b800.stats= snapshot_time 1408263676.546716 secs.usecs 200 read_bytes 589388 samples [bytes] 0 2147479552 258867698600 write_bytes 1025093126 samples [bytes] 1 4194304 637173439272 150 osc_read 3880442 samples [bytes] 8 1048576 3667025741928 100 osc_write 640640 samples [bytes] 5 1048576 637252863026 ioctl 17938 samples [regs] 50 open 90267 samples [regs] close 90239 samples [regs] 0 mmap 10523 samples [regs] seek 6997546 samples [regs] fsync 16 samples [regs] readdir 48874 samples [regs] setattr 252 samples [regs] truncate 12 samples [regs] getattr 2097773 samples [regs] create 3465 samples [regs] link 1 samples [regs] 450 unlink 2890 samples [regs] 400 statfs 2069 samples [regs] alloc_inode 8423 samples [regs] 350 getxattr 1025105141 samples [regs] 300 inode_permission 229899278 samples [regs] 250 200 150 100 50 0 41 mmap() read perforamnce improvements Lustre-1.8.9 2012 DataDirect Networks. All Rights Reserved. 41 450 400 350 300 mmap() read Performance (1MB block size) After rework, 2.5x speed up from 1.8 client. lustre-1.8.9 lustre-2.5.2 Fixed DDN branch Fixed DDN branch 32K 128K 512K 1024K Block size

Performance improvements for the same region of a shared file Single client's processes A reference database file Application is not MPI, but a lot of single applications refer to a reference file and does mapping operation with it 2000 1800 1600 1400 1200 1000 800 600 400 200 0 2X Fix and optimization for parallel read (no cache) 8X 9X 12 X 2X 4KB single 4KB parallel 1MB single 1MB parallel 2X lustre-1.8.9 lustre-2.5.2 Fixed DDN branch Sanger Institute in UK hit similar performance regressions with lustre-2.5.2 client. After they applied our patches, significant reduced job's elapsed time. 24 hours (Fixed DDN Lustre branch) from 40 hours (lustre-2.5.2). 42 2012 DataDirect Networks. All Rights Reserved. 42

Optimization of performance under heavy CPU loads All client's CPU utilizations are quite high and Job scheduler allocates next jobs very efficiently. Found Lustre-2.5 performance regressions under heavy CPU loads. A lot of Java applications seems not be doing good memory management. And Lustre client consumes memory. Several implementation of applications are based on old architecture. (assuming everything put on the cache?) Reduced buffer caches for Lustre changed more disk access rater than using caches... 43 2012 DataDirect Networks. All Rights Reserved. 43

Where ideas become reality Large bulk I/O size support As far as it monitors server side IO stats, a lot of large sequential I/O are coming. # cat /proc/fs/lustre/obdfilter/*/brw_stats snapshot_time: 1406696961.271996 (secs.usecs) read write pages per bulk r/w rpcs % cum % rpcs % cum % 1: 1091416 1 1 681741 2 2 2: 62166 0 1 164562 0 2 4: 96568 0 1 60799 0 2 8: 115945 0 1 10054 0 2 16: 170813 0 1 11361 0 2 32: 242152 0 1 18944 0 2 64: 444827 0 2 37609 0 2 128: 861561 0 3 107677 0 3 256: 99436837 96 100 32549912 96 100 35 30 25 20 15 10 5 0 SFA12K/Lustre Performance(Write) (/w large bulk I/O patches) 320 x NLSAS 400 x NLSAS 1MB I/O 4MB I/O 16MB I/O read write discontiguous pages rpcs % cum % rpcs % cum % 0: 102060933 99 99 33641331 99 99 1: 177850 0 99 1196 0 99 2: 27307 0 99 39 0 99 3: 10447 0 99 27 0 99 4: 5502 0 99 16 0 99 - snip read write discontiguous blocks rpcs % cum % rpcs % cum % 0: 102029460 99 99 31615681 93 93 1: 208894 0 99 2026762 6 99 2: 27592 0 99 131 0 99 3: 10511 0 99 25 0 99 4: 5549 0 99 9 0 99 - snip - 44 Engineering Technical Conference 2014 2012 DataDirect Networks. All Rights Reserved. 44 40 35 30 25 20 15 10 5 0 SFA12K/Lustre Performance(Read) (/w large bulk I/O patches) 320 x NLSAS 400 x NLSAS 1MB I/O 4MB I/O 16MB I/O

Performance results after reworking all improvements (1/3 scale test case) Job Started Lustre-1.8.9 Job Finished Fixed Lustre Branch After rework : 5 hours faster than lustre-1.8 45 Job Finished 2012 DataDirect Networks. All Rights Reserved. 45

Summary Learned I/O patterns of genomic analysis applications. Each job's IO access patterns are not difficult, but it makes complexity with genomic analysis pipeline. We've done performance monitoring, analysis and optimization of Lustre. Realtime Lustre performance monitoring helps performance analysis and performance optimization. There are still many areas we can optimize Still remained a lot of legacy and old system architectures base. Changing the applications are really hard (researchers are busy and I/O optimization is not main work ) but adapting and optimizing for their applications are possible. 46 2012 DataDirect Networks. All Rights Reserved. 46

Trouble shooting Using two real examples to discuss/illustrate troubleshooting Lustre: 1. Performance Issue during commissioning 2. 3 bugs in a mature running systems 47 2012 DataDirect Networks. All Rights Reserved. 47

Generic Grafana graphing 48 2012 DataDirect Networks. All Rights Reserved. 48

Grafana IOR run 49 2012 DataDirect Networks. All Rights Reserved. 49

Opentsdb web interface 50 2012 DataDirect Networks. All Rights Reserved. 50