Lustre performance monitoring and trouble shooting

Transcription

2 Agenda EXAScaler (Lustre) Monitoring NCI test kit hardware details What is it? How does it work Demo Lustre trouble-shooting General points 4 examples 2012 DataDirect Networks. All Rights Reserved. 2

5 NCI test kit hardware details 20 x Fujitsu compute nodes Dual E5-2670, 2.60GHzProcessors, 32GB Single Rail FDR SFA12KX x3TB NL-SAS 4xOSS s: Dual E GB CENTOS 6.4 Metadata 12 x 600GB 15K SAS 2xMD s: Dual E GB CENTOS DataDirect Networks. All Rights Reserved. 5

6 Lustre Monitoring Background DDN development project Use information Linux's /proc Goals: Collect near real-time data (minimum every 1sec) and visualize them All Lustre statistics information can be collectable Support Lustre-1.8.x, 2.x version and beyond Application aware monitoring (Job stats) Administrator can make any custom graphs on the web browser Configurable, intuitive dashboard Scalable, Light weight and no performance impacts and it is quite helps for debug and I/O analysis. Lustre is distributed, scalable filesystem. The monitoring/analysis tool must be aware of this. Lustre monitoring tool helps understanding current/past filesystem behavior and prevents slowdown of performance DataDirect Networks. All Rights Reserved. 6

7 ExaScaler Monitoring File system, OST Pool, OST/MDT stats, etc. JOB ID, UID/GID, aggregation of application's stats, etc. Archive of data by policy Lightweight Near real-time Massive scale Customizable OSS/MDS Monitoring Server collectd Graphite plugin Lustre client collectd DDN monitoring plugin UDP(TCP)/IP based small text message transfer graphite graphite 2012 DataDirect Networks. All Rights Reserved. 7

9 A new Lustre plugin for collectd Using Collectd ( Running at many Enterprise/HPC system Written in C for performance and portability Includes optimizations and features to handle hundreds of thousands of data sets. Comes with over 90 plugins which range from standard cases to very specialized and advanced topics. Provides powerful networking features and is extensible in numerous ways Actively developed and supported and well documented Lustre plugin extended collectd to collect Lustre statistics while inheriting its advantages It is possible to port Lustre plugin to a better framework if necessary DataDirect Networks. All Rights Reserved. 9

10 XML definition of Lustre's /proc information Tree structured descriptions about how to collect statistics from Lustre proc entries Modular A hierarchical framework comprised by a core logic layer (Lustre plugin) and statistics definition layer (XML files) Extendable without the need to update any source codes of Lustre plugin Easy to maintain the stableness of core logic Centralized 10 A single XML file for all definitions of Lustre data collection No need to maintain massive error-prone scripts Easy to verify correctness Easy to support multiple versions and update for new versions of Lustre 2012 DataDirect Networks. All Rights Reserved. 10

11 XML definition of Lustre's /proc information Precise Strict rules using regular expression could be configured to filter out all but what we exactly want Locations to save collected statistics are explicitly defined and configurable Powerful Any statistics could be collected as long as there is proper regular expressions to match it Extendable Any newly wanted statistics could be collected in no time by adding definition in XML file Efficient No matter how many definitions are predefined in the XML file, only under-used definitions will be traversed at run-time DataDirect Networks. All Rights Reserved. 11

12 Example of a collectd.conf This is an example of a /etc/collectd.conf from an MDS (tmds1): [root@tmds1 ~]# cat /etc/collectd.conf # # collectd.conf for DDN LustreMon # Interval 5 WriteQueueLimitHigh WriteQueueLimitLow LoadPlugin match_regex LoadPlugin syslog <Plugin syslog> #LogLevel info LogLevel err </Plugin> LoadPlugin lustre <Plugin "lustre"> <Common> DefinitionFile "/etc/lustre-ieel-2.5_definition.xml" </Common> # OST stats # <Item> # Type "ost_kbytestotal" # Query_interval 300 # </Item> # <Item> # Type "ost_kbytesfree" # Query_interval 300 # </Item> <Item> Type "ost_stats_write" </Item> <Item> Type "ost_stats_read" </Item> 2012 DataDirect Networks. All Rights Reserved. 12

13 Example of a collectd.conf (continued) # MDT stats # <Item> # Type "mdt_filestotal" # Query_interval 300 # </Item> # <Item> # Type "mdt_filesfree" # Query_interval 300 # </Item> <Item> Type "md_stats_open" </Item> <Item> Type "md_stats_close" </Item> <Item> Type "md_stats_mknod" </Item> <Item> Type "md_stats_unlink" </Item> <Item> Type "md_stats_mkdir" </Item> <Item> Type "md_stats_rmdir" </Item> <Item> Type "md_stats_rename" </Item> <Item> Type "md_stats_getattr" </Item> <Item> Type "md_stats_setattr" </Item> <Item> Type "md_stats_getxattr" </Item> <Item> Type "md_stats_setxattr" </Item> <Item> Type "md_stats_statfs" </Item> <Item> Type "md_stats_sync" </Item> 2012 DataDirect Networks. All Rights Reserved. 13

14 Example of a collectd.conf (continued) <Item> Type "ost_jobstats" <Rule> Field "job_id" </Rule> </Item> <Item> Type "mdt_jobstats" <Rule> Field "job_id" </Rule> </Item> <ItemType> Type "mdt_jobstats" <ExtendedParse> # Parse the field job_id Field "job_id" # Match the pattern Pattern "u([[:digit:]]+)[.]g([[:digit:]]+)[.]j([[:digit:]]+)" <ExtendedField> Index 1 Name pbs_job_uid </ExtendedField> <ExtendedField> Index 2 Name pbs_job_gid </ExtendedField> <ExtendedField> Index 3 Name pbs_job_id </ExtendedField> </ExtendedParse> TsdbTags "pbs_job_uid=${extendfield:pbs_job_uid} pbs_job_gid=${extendfield:pbs_job_gid} pbs_job_id=${extendfield:pbs_job_id}" </ItemType> <ItemType> Type "ost_jobstats" <ExtendedParse> # Parse the field job_id Field "job_id" # Match the pattern Pattern "u([[:digit:]]+)[.]g([[:digit:]]+)[.]j([[:digit:]]+)" <ExtendedField> Index 1 Name pbs_job_uid </ExtendedField> 2012 DataDirect Networks. All Rights Reserved. 14

15 Example of a collectd.conf (continued) <ExtendedField> Index 2 Name pbs_job_gid </ExtendedField> <ExtendedField> Index 3 Name pbs_job_id </ExtendedField> </ExtendedParse> TsdbTags "pbs_job_uid=${extendfield:pbs_job_uid} pbs_job_gid=${extendfield:pbs_job_gid} pbs_job_id=${extendfield:pbs_job_id}" </ItemType> </Plugin> loadplugin "write_tsdb" <Plugin "write_tsdb"> <Node> Host " " Port "8500" </Node> </Plugin> #loadplugin "write_graphite" #<Plugin "write_graphite"> # <Carbon> # Host " " # Port "2003" # Prefix "collectd." # Protocol "udp" # </Carbon> #</Plugin> 2012 DataDirect Networks. All Rights Reserved. 15

16 Demo Show the OpenTSB layout Show the Grafana layout Show adding a mdt based stat, then update with a filter to a jobid Show adding a ost based stat 2012 DataDirect Networks. All Rights Reserved. 16

19 Lustre debugging Lustre is complex environment, lots of tightly coupled moving parts: Storage (data, metadata) OSS MDS Network Lustre Server Lustre Client Operating Systems The software resides in kernel-space which makes it difficult to to debug compared with user-space software. It is possible to debug Lustre Lustre bugs do get resolved searching jira (if the issue is Lustre) A lot of tools have been developed specifically for Lustre debugging. The Lustre community is very active and provides strong support DataDirect Networks. All Rights Reserved. 19

20 What to do when a Lustre issue occurs 1 Understand the problem What is the failure type? (kernel crash/lbug/system call failure/stuck process/incorrect result/unexpected behavior/performance regression) Which nodes cause the problem o Is it a server side problem or client side problem? o Is it a problem limited to a single client? o Is it a metadata or data access problem? How critical the problem is? The impacted services could be: o The whole system, e.g. crash or deadlock on MGS/MDS; o All of the services on a server, e.g. crash or deadlock on OSS; o A certain service of the whole system, e.g. quota failure on QMT/QSD; o All of the operations on the client(s), e.g. crash or deadlock on client DataDirect Networks. All Rights Reserved. 20

21 What to do when a Lustre issue occurs 2 Find a simple and reliable reproduction method Step 1: Confirm which program causes the bug; Step 2: Write a simple program which can reproduce the problem repeatedly3; Step 3: Simplify the program as much as possible. A simple and reliable reproduction method: o Simplifies the description of the issue thus helps other people understand it quickly; o Reduces the collected logs thus reduces the time to analyze it; o Accelerates the confirmation of possible fix methods thus accelerates the fix process DataDirect Networks. All Rights Reserved. 21

22 What to do when a Lustre issue happens 3 Collect logs on the involved nodes System logs are always valuable to determine the states of Lustre nodes. Use strace command to collect logs of system calls: o Which system call returns failure? o Which errno does this system call returns? Errno is essential for understanding and debuging the issue, e.g. EIO(5) usually means disk I/O has some problems. Collect kernel dump file when crash happens o Kdump should always been enabled on production system. o It is especially useful for NULL pointer dereference. Collect Lustre messages for further analysis Tips: o A few lines of critical messages are much more helpful than other messages. o The first messages when the bug happens are more important. o Massive messages which are printed days before the bug happens is less valuable. o Redundancy messages are always better than lack of messages DataDirect Networks. All Rights Reserved. 22

23 What to do when a Lustre issue occurs 4 Collect Lustre messages Command: lctl debug_kernel Different masks can be used: trace, inode, super, ext2, malloc, cache, info, ioctl, neterror, net, warning, buffs, other, dentry, nettrace, page, dlmtrace, error, emerg, ha, rpctrace, vfstrace, reada, mmap, config,console, quota, sec, lfsck, hsm Default masks are warning, error, emerg, console. But it might be necessary to change mask to collect desirable messages. Mask trace quota dlmtrace ioctl malloc Usage Useful for tracing the process flow of Lustre software stack. Frequently used. Useful for debuging quota problems. Useful for debuging LDLM problems. Useful for debuging ioctl problems. Useful for debuging memory leak problems. Usually used together with leak_finder.pl DataDirect Networks. All Rights Reserved. 23

24 What to do when a Lustre issue happens 5 Fix the issue Search whether the same issues has been fix in master branch of Lustre git repository o Lustre mater branch is evolving quickly which means a lot of fixed bugs might still exists on the older version. Search whether there is any similar issue reported o A fix/walk-around method might have proved to be successful. Keep the faith that a fix method will show up naturally as soon as the problem is fully understood. Compromise if have to: o Find a temporary way to recover the service of the production system quickly, e.g. reboot/e2fsck. o If it is impossible to understand or fix the root cause of the issue right now, try to find a way to walk around it DataDirect Networks. All Rights Reserved. 24

25 Real examples of fixing Lustre bugs 1 RM-135/LU-4478 Problem discription: When formating a Lustre OST, the kernel crashes. Reproduce steps: o Apply a debug patch which returns failure from ldiskfs_acct_on() o Formatting a Lustre OST will trigger the crash Collected log: Kernel dump file collected by Kdump Analysis: o Log shows that the kernel crashes in ext4_get_sb()/get_sb_bdev()/ kill_block_super()/generic_shutdown_super()/iput()/clear_inode() because of BUG: unable to handle kernel NULL pointer dereference at e0 o By using crash commands, it is confirmed EXT4_SB((inode)->i_sb) is NULL o After further analysis, it is found that the failure of ldiskfs_acct_on() in ldiskfs_fill_super() is not handled correctly. Fix: Add codes to handle failure of ldiskfs_acct_on() in ldiskfs_fill_super(). ( DataDirect Networks. All Rights Reserved. 25

26 Real examples of fixing Lustre bugs 2 RM-185/LU-5054 Problem description: Creating and setting a pool name of length 16 to a directory will succeed. However, creating a file under that directory will fail. Reproduce steps: o [root@penguin1 ~]# lfs setstripe -p aaaaaaaaaaaaaaaa /lustre/dir2 o [root@penguin1 ~]# touch /lustre/dir2/a touch: cannot touch `/lustre/dir2/a': Argument list too long Errno: E2BIG(7) Collected log: Trace log of Lustre to check which function returns the E2BIG errno. Analysis: Log shows that lod_generate_and_set_lovea() returns E2BIG, because the pool name inherited from parent directory is longer than the length limit. Fix: Cleanup all related codes to enforce a consistent length limit of pool name. ( DataDirect Networks. All Rights Reserved. 26

27 Real examples of fixing Lustre bugs 3 LU-5808 Problem discription: When using one MGT to mange two file systems which names are 'lustre' and 'lustre2t, it is impossible to mount their MDTs on different servers because parsing of MGS llog fails. Reproduce steps: o o o o o o o o o o mkfs.lustre --mgs --reformat /dev/sdb1; mkfs.lustre --fsname lustre --mdt --reformat --mgsnode= @tcp --index=0 /dev/sdb2; mkfs.lustre --fsname lustre2t --mdt --reformat --mgsnode= @tcp --index=0 /dev/sdb3; mount -t lustre /dev/sdb1 /mnt/mgs; mount -t lustre /dev/sdb2 /mnt/mdt-lustre; mount -t lustre /dev/sdb3 /mnt/mdt-lustre2t; lctl conf_param lustre.quota.ost=ug; mount -t ldiskfs /dev/sdb1 /mnt/ldiskfs; llog_reader /mnt/ldiskfs/configs/lustre2t-mdt0000 grep quota.ost; The output of the last command is: #10 (224)marker 8 (flags=0x01, v ) lustre 'quota.ost' Mon Oct 27 21:26: #11 (088)param 0:lustre 1:quota.ost=ug #12 (224)marker 8 (flags=0x02, v ) lustre 'quota.ost' Mon Oct 27 21:26: Collected log: o Trace log of Lustre to check which function returns the failure when mouting MDTs o Trace log of Lustre to check how does MGS handles llog names Analysis: Log shows that the MGS matches the llog of lustre2t even when it tries to update the llog of lustre Fix: Update codes of MGS to match llog name strictly to avoid invalid record ( DataDirect Networks. All Rights Reserved. 27

28 Performance Issue during commissioning (1) Background: Lustre System being Commissioned in Asia DDN Storage, White box Servers, DDN Lustre HW assembled by third party contractor No pre or post installation documentation Problem Statement: Low OSS Performance Failing Performance Acceptance tests 2012 DataDirect Networks. All Rights Reserved. 28

29 Performance Issue during commissioning (2) Local team spent many hours trying to resolve Escalated to (remote) DDN APAC Lustre Support team Steps to resolve: Determine what the problem is in the first case o Multiple tests to confirm where the problem is occurring ior and iozone obdfilter-survey lnet-selftest raw ib test utils ib_[write,read]_bw Make sure to specify the correct HCA you want to test on. Based on results from the above testing investigate the hardware lspci vv was our friend 2012 DataDirect Networks. All Rights Reserved. 29

30 Performance Issue during commissioning (3) Resolution Onsite engineer moved 1 HCA to a 8 lane PCI on all servers Restart tests to confirm the fix which it did and achieved the 10GB/s read/write performance profile DataDirect Networks. All Rights Reserved. 30

31 Performance Issue during commissioning (4) 20/20 Hind-sight is a beautiful thing: Obvious when the issue is known Lessons learned: Need detailed documentation of installation issue would have been resolved easily if available 2012 DataDirect Networks. All Rights Reserved. 31

32 What makes Lustre debugging easier? Difficulty to debug Easy Middle Hard Ability to reproduce Every time Sometimes Never Time to reproduce Seconds Minutes Hours Program to reproduce A few system calls Single node application Parallel application Condition to reproduce A certain condition of a single process Race condition with multiple processes Uncertain/Unknown condition Involved nodes Client MDS or OSS Client & MDS & OSS Involved software components Single component Multiple components on a single node Multiple components on multiple nodes with RPCs Ways of failing Omission failure (crash, request loss, or no reply) Commission failure (wrong process of request, incorrect reply, corrupted state) Arbitrary/Byzantine failure (unpredictable result) Types of error Syntax error (compile error) Semantic defect (unintended result) Design deficiency Problem description Clear description with reproduction steps Clear text description Ambiguous description Collected logs Precise logs since the bug occurred Massive unfiltered logs Not enough logs DataDirect Networks. All Rights Reserved. 32

34 Lustre debugging Lustre is a very complex piece of software which is hard to debug It has a lot of software components with tightly coupled interfaces. It is a distributed file system with multiple types of nodes connected together by network. The software resides in kernel-space which makes it difficult to to debug compared with user-space software. It is possible to debug Lustre Most bugs of Lustre get fixed eventually searching jira. A lot of tools have been developed specifically for Lustre debugging. The Lustre community is very active and provides strong support DataDirect Networks. All Rights Reserved. 34

36 Where ideas become reality Genomic Analysis Application It's a standardized job set (pipeline), but... More than 2000 jobs run in a single pipeline. o Alignment and mapping with genomics reference databases o Annotations adding references (metadata) to data o Analysis by each application There are 100+ analysis applications. But, no MPI applications. A lot of single jobs! Each applications have a lot of options/libraries All jobs are associated with job scheduler and allocated them very efficiently. A lot of analysis pipelines are running on same HPC cluster simultaneously. 36 Engineering Technical Conference DataDirect Networks. All Rights Reserved. 36

37 Where ideas become reality Complex, Complex and Complex... job202 job103 job204 job305 job3 job303 job102 Single Pipeline job4 job5 job101 job2 job104 job302 job203 job1 After Finish job job105 job205 Dependency job106 job107 job201 job301 job304 job306 job206 job6 37 Engineering Technical Conference DataDirect Networks. All Rights Reserved. 37 waiting jobs

38 Pipeline aware I/O performance monitoring Developed Lustre Performance monitoring Tool Near realtime data point collection. (every second) Any type of I/O monitoring is possible. (UID/GID/JOBID or any type of custom ID) ExaScaler Monitor Performance monitoring is NOT only daily/hourly report, but it's really critical for performance optimization. Total Pipeline1 Pipeline2 Pipeline3 Pipeline DataDirect Networks. All Rights Reserved. 38

39 Where ideas become reality Problem at MMBK Pipeline job on lustre-2.5 client elapsed time is longer than lustre-1.8 client system. One analysis takes 2.5 days! Job started lustre-2.5 client system Finished job lustre-1.8 client system 10hours Finished job 39 Engineering Technical Conference DataDirect Networks. All Rights Reserved. 39

40 Lustre performance optimization for genomic applications Worked with Intel exclusively and optimized current Lustre-2.5 client codes for better I/O performance for genomic applications. mmap() I/O performance improvements Bug fixes, optimization and improvements BTW, there is an crucial issue with mmap() in GPFS Performance improvements for single shared file Parallel read to same region of file from single client CPU/Memory resource reduct A lot of CPU intensive application. CPU is always high usages Large bulk I/O size support and enhancement Support to up 16MB I/O size (4MB was limit) Aggressive ReadAhead Engine for large I/O DataDirect Networks. All Rights Reserved. 40

41 Fix mmap() performance problem and improvements Several application calls a lot of mmap().10%+ of open() calls with mmap()! # cat /proc/fs/lustre/llite/*/stats 250 llite.share1-ffff881067f9b800.stats= snapshot_time secs.usecs 200 read_bytes samples [bytes] write_bytes samples [bytes] osc_read samples [bytes] osc_write samples [bytes] ioctl samples [regs] 50 open samples [regs] close samples [regs] 0 mmap samples [regs] seek samples [regs] fsync 16 samples [regs] readdir samples [regs] setattr 252 samples [regs] truncate 12 samples [regs] getattr samples [regs] create 3465 samples [regs] link 1 samples [regs] 450 unlink 2890 samples [regs] 400 statfs 2069 samples [regs] alloc_inode 8423 samples [regs] 350 getxattr samples [regs] 300 inode_permission samples [regs] mmap() read perforamnce improvements Lustre DataDirect Networks. All Rights Reserved mmap() read Performance (1MB block size) After rework, 2.5x speed up from 1.8 client. lustre lustre Fixed DDN branch Fixed DDN branch 32K 128K 512K 1024K Block size

42 Performance improvements for the same region of a shared file Single client's processes A reference database file Application is not MPI, but a lot of single applications refer to a reference file and does mapping operation with it X Fix and optimization for parallel read (no cache) 8X 9X 12 X 2X 4KB single 4KB parallel 1MB single 1MB parallel 2X lustre lustre Fixed DDN branch Sanger Institute in UK hit similar performance regressions with lustre client. After they applied our patches, significant reduced job's elapsed time. 24 hours (Fixed DDN Lustre branch) from 40 hours (lustre-2.5.2) DataDirect Networks. All Rights Reserved. 42

43 Optimization of performance under heavy CPU loads All client's CPU utilizations are quite high and Job scheduler allocates next jobs very efficiently. Found Lustre-2.5 performance regressions under heavy CPU loads. A lot of Java applications seems not be doing good memory management. And Lustre client consumes memory. Several implementation of applications are based on old architecture. (assuming everything put on the cache?) Reduced buffer caches for Lustre changed more disk access rater than using caches DataDirect Networks. All Rights Reserved. 43

44 Where ideas become reality Large bulk I/O size support As far as it monitors server side IO stats, a lot of large sequential I/O are coming. # cat /proc/fs/lustre/obdfilter/*/brw_stats snapshot_time: (secs.usecs) read write pages per bulk r/w rpcs % cum % rpcs % cum % 1: : : : : : : : : SFA12K/Lustre Performance(Write) (/w large bulk I/O patches) 320 x NLSAS 400 x NLSAS 1MB I/O 4MB I/O 16MB I/O read write discontiguous pages rpcs % cum % rpcs % cum % 0: : : : : snip read write discontiguous blocks rpcs % cum % rpcs % cum % 0: : : : : snip - 44 Engineering Technical Conference DataDirect Networks. All Rights Reserved SFA12K/Lustre Performance(Read) (/w large bulk I/O patches) 320 x NLSAS 400 x NLSAS 1MB I/O 4MB I/O 16MB I/O

45 Performance results after reworking all improvements (1/3 scale test case) Job Started Lustre Job Finished Fixed Lustre Branch After rework : 5 hours faster than lustre Job Finished 2012 DataDirect Networks. All Rights Reserved. 45

46 Summary Learned I/O patterns of genomic analysis applications. Each job's IO access patterns are not difficult, but it makes complexity with genomic analysis pipeline. We've done performance monitoring, analysis and optimization of Lustre. Realtime Lustre performance monitoring helps performance analysis and performance optimization. There are still many areas we can optimize Still remained a lot of legacy and old system architectures base. Changing the applications are really hard (researchers are busy and I/O optimization is not main work ) but adapting and optimizing for their applications are possible DataDirect Networks. All Rights Reserved. 46

47 Trouble shooting Using two real examples to discuss/illustrate troubleshooting Lustre: 1. Performance Issue during commissioning 2. 3 bugs in a mature running systems DataDirect Networks. All Rights Reserved. 47