LLNL s Parallel I/O Testing Tools and Techniques for ASC Parallel File Systems



Similar documents
A High Performance Computing Scheduling and Resource Management Primer

Building CHAOS: an Operating System for Livermore Linux Clusters

IRS: Implicit Radiation Solver Version 1.0 Benchmark Runs

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

File System-Aware Job Scheduling with Moab

EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE

PETASCALE DATA STORAGE INSTITUTE. Petascale storage issues. 3 universities, 5 labs, G. Gibson, CMU, PI

Comparison of leading parallel NAS file systems on commodity hardware

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Big data management with IBM General Parallel File System

PARALLELS CLOUD STORAGE

Parallel I/O on JUQUEEN

GeoGrid Project and Experiences with Hadoop

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

A Survey of Shared File Systems

Performance Monitoring of Parallel Scientific Applications

The Production Cluster Construction Checklist

On Benchmarking Popular File Systems

Current Status of FEFS for the K computer

Large File System Backup NERSC Global File System Experience

RevoScaleR Speed and Scalability

Shared File Performance Improvements

VERITAS Business Solutions. for DB2

New!! - Higher performance for Windows and UNIX environments

Enabling High performance Big Data platform with RDMA

Online Transaction Processing in SQL Server 2008

WinMagic Encryption Software Installation and Configuration

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Climate-Weather Modeling Studies Using a Prototype Global Cloud-System Resolving Model

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

VERITAS Volume Management Technologies for Windows

Requesting Nodes, Processors, and Tasks in Moab

Benchmarking Hadoop & HBase on Violin

InterferenceRemoval: Removing Interference of Disk Access for MPI Programs through Data Replication

Google File System. Web and scalability

Understanding the Benefits of IBM SPSS Statistics Server

Hopper File Management Tool

Best Practices for Data Sharing in a Grid Distributed SAS Environment. Updated July 2010

Distributed File Systems

Information management software solutions White paper. Powerful data warehousing performance with IBM Red Brick Warehouse

Make the Most of Big Data to Drive Innovation Through Reseach

Quantcast Petabyte Storage at Half Price with QFS!

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Cray DVS: Data Virtualization Service

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

High Performance Computing OpenStack Options. September 22, 2015

Dell One Identity Manager Scalability and Performance

- An Essential Building Block for Stable and Reliable Compute Clusters

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper.

Evaluation of Enterprise Data Protection using SEP Software

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Introduction to Gluster. Versions 3.0.x

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Mixing Hadoop and HPC Workloads on Parallel Filesystems

PARALLELS CLOUD SERVER

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

NetApp High-Performance Computing Solution for Lustre: Solution Guide

LLNL Production Plans and Best Practices

Parallel IO performance and scalability study on the PRACE CURIE supercomputer.

Comment on "Observational and model evidence for positive low-level cloud feedback"

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

ATLAS Tier 3

Benchmarking FreeBSD. Ivan Voras

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Cloud Computing at Google. Architecture

High Performance Storage System Interfaces. Harry Hulen

A Systems Approach to HVAC Contractor Security

Attunity RepliWeb Event Driven Jobs

AIX 5L NFS Client Performance Improvements for Databases on NAS

Running a Workflow on a PowerCenter Grid

The BRU Advantage. Contact BRU Sales. Technical Support. T: F: W:

GA A23745 STATUS OF THE LINUX PC CLUSTER FOR BETWEEN-PULSE DATA ANALYSES AT DIII D

Patch Management Marvin Christensen /CIAC

Enterprise Manager Performance Tips

Understanding Disk Storage in Tivoli Storage Manager

Tuning U2 Databases on Windows. Nik Kesic, Lead Technical Support

Barracuda Backup Deduplication. White Paper

Optimizing LTO Backup Performance

International Journal of Advance Research in Computer Science and Management Studies

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Solving Rendering Bottlenecks in Computer Animation

Bottleneck Detection in Parallel File Systems with Trace-Based Performance Monitoring

Moving Virtual Storage to the Cloud

HDFS Space Consolidation

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Massive Data Storage

StorTrends RAID Considerations

Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture

Distributed Operating Systems. Cluster Systems

Everything you need to know about flash storage performance

Secure SCADA Communication Protocol Performance Test Results

Transcription:

UCRL-CONF-203489 LAWRENCE LIVERMORE NATIONAL LABORATORY LLNL s Parallel I/O Testing Tools and Techniques for ASC Parallel File Systems W. E. Loewe, R. M. Hedges, T. T. McLarty, and C. J. Morrone April 16, 2004 2004 IEEE Cluster Computing Conference, San Diego, CA, September 20-23, 2004

1 DISCLAIMER This document was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or the University of California, and shall not be used for advertising or product endorsement purposes.

2 LLNL S Parallel I/O Testing Tools and Techniques for ASC Parallel File Systems Bill Loewe Richard Hedges Tyce McLarty Chris Morrone Lawrence Livermore National Laboratory April 16, 2004 ABSTRACT Livermore Computing is an early and aggressive adopter of parallel file systems including, for example, GPFS from IBM and Lustre for our present Linux systems. As such, we have acquired more than our share of battle scars from encountering bugs in "bleeding edge" file systems that we have pressed into production to serve our customers massive I/O requirements. A major role of the Scalable I/O Project is to detect errors before our end users do. In order to do this, we have developed highly parallel test codes to stress and probe potentially weak areas of file system behavior. This paper describes those test programs and how we make use of them. Keywords: MPI, File-I/O, Disk-I/O, Bandwidth, Testing, Performance. INTRODUCTION The Department of Energy (DOE) s Advanced Simulation & Computing (ASC) program has extreme high-performance computational demands that are ushering in increasingly bigger and faster computing systems for scientific research. To provide a means of transfer and storage for the correspondingly vast amounts of data generated as computational rates in excess of a petaflop are approached, high-performance parallel file systems are a critical system requirement. The Scalable I/O Project (SIOP) at Lawrence Livermore National Laboratory (LLNL) supports ASC-scale applications generating many terabytes of data during an execution. We are engaged in testing the parallel file systems in all phases of their development in the hope that we will find problems before our users do and in working with application codes to ensure efficient I/O use of the parallel file system. An important element in the

3 comprehensive support solution is to develop tools and techniques for testing the reliability and performance of the scalable file systems to meet the needs of the ASC program. TESTING OBJECTIVES The objective of our testing is to determine the stability, integrity, and performance of our parallel file systems. Our testing is used in the procurement phase of selecting a parallel file system to determine our future file system requirements. During the acceptance phase of bringing in a new parallel file system, we perform tests to determine whether a file system can pass its acceptance tests and fulfill its contact. We also investigate the behavior of the file system such that we may suggest improvements to our ASC partners developing the file system and consider approaches to the file systems efficient use by our customers. And, we maintain regression testing of our parallel file systems to document performance and stability after system upgrades and changes as well as file system aging. The level of stability is sought by putting a file system under load while scaling tasks accessing either shared or independent files. With shared files, all participating tasks have access to a single, common file, whereas with independent files, each task can only access its own unique file. In our experience, we have seen that though a parallel file system may handle low counts of multiple tasks for either of these approaches, significantly increasing the task count to production levels can uncover issues of stability and robustness. The correctness and data integrity of a file system is absolutely expected by users. While a file system should be stable and rarely down or unavailable to users, the expectation that the data stored on it will not be corrupted is a fundamental requirement. This must be monitored closely, particularly in the development and stabilization phases of bringing in a new parallel file system. During these phases, the file system software is often temporarily broken in the process of forward progress. Upon stabilizing a file system and finding it to be robust and correct, the inevitable concern for performance and tuning is addressed. We are interested in the performance of the file system operations involved with metadata activity and data movement. These performance rates can be calculated for a parallel file system using direct POSIX operations or through higher-level libraries such as MPI-IO, HDF5, and Parallel- NetCDF. The performance tests used for benchmarking are generally the most used of our testing tools. With the goal of high-performance I/O for our users, further effort is then made in working with users applications to ensure proper I/O models. For example, the higherlevel libraries of MPI-IO, HDF5, or netcdf can improve the I/O approach for certain applications using large, noncontiguous datasets. A poor approach to I/O in an application can give even the best-performing file system substandard performance and

4 become a bottleneck in the application s use. Unfortunately, I/O does not have high priority for code developers as they often feel the return on investment is marginal. TESTING TOOLS The SIOP uses several tools for its testing objectives. While such serial testing tools as Bonnie++ [1], dbench [2], fsx [3], and IOzone [5] are useful, tools for parallel file systems need process synchronization for large task counts to ensure accurate testing. To aid us in testing parallel file systems, our project has developed the following parallel tools. simul [7] is used to test metadata integrity and stability by performing simultaneous metadata operations. mdtest [6] is used for testing and measuring metadata performance. IOR [4] is a workhorse that tests data performance rates for several access patterns and APIs. simul To test simultaneous file system operations from many nodes and processes, the application simul was developed. It is an MPI-coordinated test of parallel file system system calls and library functions that tests the correctness and coherence of a parallel file system. It accomplishes this by having one or more tasks perform standard POSIX operations to either shared or independent files. In total, there are forty tests that simul performs to stress the simultaneous metadata operations. For both the shared and independent access modes, there are operations to files and directories. For files, the operations of open, close, file stat, lseek, read, write, unlink, rename, creat, truncate, symlink, readlink, link to one file, and link to file-perprocess are performed. For directories, the chdir, directory stat, readdir, mkdir, and rmdir operations are tested. simul provides a high instantaneous load in metadata operations, providing a rigorous test of a parallel file system. mdtest mdtest allows multiple MPI tasks in parallel to create, stat, and remove shared or independent files. The rate of these operations is reported by mdtest and can be used to determine if file system performance problems are related to metadata performance. For example, mdtest was used on a large parallel file system at LLNL to test whether the observed sluggishness of the file systems was due to aging as it filled near capacity. It appeared that as the parallel file system grew to over 90% full, the time to it took to allocate inodes for creating multiple files seemed to slow down to the point where it was a tenth of what it was previously capable. mdtest allowed the diagnosis and understanding of the file system slow down, helping us to determine the issues with the metadata performance. mdtest has several parameters that may be set in testing. It has the option to perform its tests with tasks working in the same or independent directories. Also, it allows the number of file or directories that each task accesses to be set which can simulate the

5 behavior of various user application codes. Further, it has an option to prevent reading locally-cached file or directory information before performing a stat on a newly-created file by using a read-your-neighbor node approach. In addition, mdtest has other settings that may be used to test metadata performance behavior. IOR IOR is a file system bandwidth testing code developed and used at LLNL. Employing MPI for process synchronization, IOR can be used for testing bandwidth performance of parallel file systems using various interfaces and access patterns. These interfaces and patterns try to simulate the file system used byasc applications. In our arsenal of testing tools, IOR is the most heavily used for a high, sustained load on a parallel file system. Its interfaces include the POSIX, MPI-IO, HDF5, and Parallel-NetCDF APIs. Though many ASC applications use the POSIX API for their I/O, some codes are beginning to employ the capabilities of the high-level libraries. IOR attempts to follow the native access patterns inherent in each of these interfaces. IOR has the capability of either shared or independent file access. As the task count for parallel applications increases, using a file-per-process approach becomes untenable due to the large number of files. Attempting to list a directory containing thousands of files becomes a challenge. To alleviate this overhead of large numbers of files, users applications are beginning to follow a single, shared file paradigm. IOR offers this access to test the performance of a shared file access. IOR also checks for data integrity and reports incidences and locations of data corruption. Upon writing or reading, a file check may be performed on the file to determine any corruption during the write or read phase of file access. In order that the performance results are not skewed, the write and read data checks are not performed during the data transfer measurements. Further, to assure that stored rather than cached data is checked, there is an option to have a different process perform the check than had performed the data transfer. These checks have been useful in finding many problems in the file system. Named for the acronym from interleaved or random, IOR allows for different access patterns to a shared file. The primary distinction between the two major shared-file patterns is whether each task s data is contiguous or noncontiguous. For the segmented pattern, each task (A and B, e.g.) stores its blocks of data in a contiguous region in the file. This aaabbb pattern is generally less challenging to a file system as larger data buffers can be transferred and there are fewer requests/revocations of locks for byteranges in the file. With the strided access pattern, each task s data blocks are spread out through a file and are noncontiguous. While this ababab pattern may be more natural for storing a multi-dimensional mesh, it can be inefficient from the viewpoint of the file system due to additionallocking and smaller data transfers.

6 For the MPI-IO, HDF5, and Parallel-NetCDF interfaces, the access may be either collective or independent. With collective operations, multiple tasks participate in a file operation gathering I/O requests into a single request. With independent operations, each task s I/O request is independent of the others requests. IOR allows both the collective and independent operations for the I/O libraries to be exercised, helping to determine which model may be more effective for a particular case. Generally, the collective model offers the greatest benefit for the strided access pattern wherein data from multiple tasks may be aggregated into a larger access. For the segmented pattern in large files, the collective access may serve only to add communication overhead, and the independent I/O model can be more efficient. Among the other features of the testing code is the ability to pass hints to the MPI-IO library and make ioctl() calls as necessary to specific file systems. Also, IOR may be executed by command line for ease in scripting or with a GUI for interactivity. One of its most important features is its ability to be modified quickly for an immediate, one-off test. The code has become a powerful Swiss Army knife with more features available than mentioned here. In addition to testing the file system behavior and simulating the applications I/O access patterns, IOR has even been used to find bugs in I/O library implementations. For example, in an implementation of the MPI I/O library, IOR helped to diagnose a problem in accessing MPI elemental datatypes (etypes). The MPI standard declared that units for counting file offsets should be in etypes, but the implementation was using only the byte unit for calculating all offsets as done in POSIX. IOR noticed that the finished file size was inaccurate for the test and was able to help track down the cause. TESTING TECHNIQUES As the goal of the parallel file system is to keep I/O from being in the way of the applications, approaches to understanding the interaction of the file system and the user code have been considered. With the understanding gained, the SIOP can suggest improvements to the file system design as application needs are clarified and also uncover inefficient I/O models in the user code that are inherently difficult for a file system. A simple technique to monitor all I/O activity on an I/O node of a parallel file system is with UNIX s iostat. It can be used to determine the load on an I/O device to find any bottlenecks and diagnose file system behavior. For example, were the I/O device showing a high block write or read rate, it would be reasonable to assume that the file system was working near capacity and only slowed by the physical hardware. But, were this rate lower than expected, the bottleneck in performance would likely be elsewhere, either in the file system design or the application I/O usage. For a detailed look into an application s I/O calls, Linux s strace or AIX s truss can be used. These tools provide a trace of all system calls and their parameters. While the

7 amount of trace data can be quite large, the I/O behavior of an application can be learned by data mining the output. As an example of strace s use by our project, an application we looked into with a code developer turned up thousands of 32-byte write calls. It had not been clear that the application was making so many and such small transfers. This type of inefficient I/O behavior can reduce file system performance significantly. In another situation where a particular user's code was triggering file system errors, we used strace to estimate input parameters for IOR to model that code s I/O behavior. We knew that the total file size was ~20 GB, that 49 processes were active in the I/O, and that the transfers were interleaved from the processes. The tracing showed that most of the data was transferred in ~1.8 MB transfers. That information was sufficient to specify a set of input parameters to the IOR program in order to model the user's I/O. When we were able to reproduce the file system error with the IOR test code, we had considerably simplified subsequent testing of that problem in the following ways: 1) The bug could now be demonstrated with a test program which was well know to the test and development teams; 2) The problems in dealing with any sensitivities of the original program or resultant data were eliminated. ISSUES CACHE One of the major issues affecting our strategy in testing a file system is how to account for caching. In particular, as the Linux OS is aggressive in its use of cache, obtaining accurate measurements for write/read performance or reliable results for a data check can be challenging. For example, data that is expected to have been written to storage media is often cached locally or on the device s cache. Actually syncing all the data to storage can be difficult. We have used different approaches in our attempts to defeat cache. The first approach works for local cache. Quite simply, it is to reduce the amount of memory the operating system can access. By setting mem=512m, for example, much of the unused memory that the cache can use is inaccessible. A drawback, however, is that this approach requires rebooting the nodes with the lower setting for available memory. What is gained is to allow the write performance to be closer to the bandwidth to storage than it is to the memory bandwidth. Since the cache is small, testing the read performance after the write performance is more accurate, too. Another approach is to have a neighbor node read back the data such that the data written is only on the original node s cache, not on the neighbor node. This read your neighbor approach works well for reads, but it does little for the cached data on writes. Since we usually cannot reset the available memory and reboot a node on a production system, and we are generally cannot access and reduce the cache on the device (be they I/O nodes, object storage targets, etc.), the most common approach to reducing the effect of caching is to increase the test file size to the point where the data cannot be cached

8 both locally and remotely. While this works, it has the disadvantage of increasing the test time by moving more data. Since testing time is always limited, our testing becomes less comprehensive and more likely to miss file system problems. Therefore, handling caching incorrectly can have an impact on file system performance testing. Of course the value of these approaches to considering the issue of caching is debatable as well. One line of thought is that if the testing code is designed to simulate the application s I/O behavior and caching improves the performance, then the testing code using caching is capturing the true performance as used by the application. While this has merit, the testing code and the application are likely to use significantly different amounts of memory, thereby having different remaining cache sizes available. To disable caching to the extent that it is possible levels the playing field. POSIX vs. MPI- IO The data stream model of a file used by the POSIX standard can deliver very high performance if the access pattern used by the application looks like a stream of data. However many applications, especially parallel ones, access data in smaller chunks in a strided pattern through the file. I/O performance with this type of access pattern is usually much slower than the file system capability unless some means is used to coordinate the data flow to or from the file system. System caching can reduce the strided effect in some cases, but if each or most I/O operations cause a disk seek, caching is of little help. MPI-IO provides a way to do the data coordination for a parallel application in a standard, portable way. When MPI datatypes and collective operations are used, a much greater fraction of the file system capability is delivered by aggregating smaller chunks of data using the high speed communication structure and doing larger, well-aligned I/O operations. The IOR program includes an MPI-IO option so that the data coordination capabilities of a particular MPI-IO implementation can be tested systematically. FUTURE WORK In general, our I/O work is driven by the problems and issues found as they present themselves. This can range from procurement to regression testing and from file system debugging to user application issues. Despite having a tendency to be interrupt-driven, however, our project also makes sure to preemptively prepare ourselves for coming issues as we foresee them. This can include impending file system design changes and improving our knowledge of how applications perform I/O. CONCLUSION The parallel testing tools developed in the SIOP offer a significant improvement in scaling testing of ASC parallel file systems over the existing serial tests. As we are on the forefront of parallel file system testing, our understanding of the issues of these file

9 systems for large clusters is still expanding and new problems themselves are surfacing. The tools and techniques we have thus far developed are an excellent start to solving some of the problems in parallel file system testing, but only a beginning. The testing tools IOR, mdtest, and simul have been released under the GNU General Public License and can be found at: http://www.llnl.gov/icc/lc/siop/downloads/download.html Our project website is located at: http://www.llnl.gov/icc/lc/siop/ ACKNOWLEDGMENTS We thank Brian Behlendorf, Ryan Braby, Matt Davis, Dave Fox, Terry Heidelberg, Linda Stanberry, and Andrew Uselton for their design considerations, contributions, usage, bug fixes, feedback, and sage advice with these testing tools and strategies. AUSPICES STATEMENT This work was performed under the auspices of the U.S. Department of Energy by University of California, Lawrence Livermore National Laboratory under Contract W- 7405-Eng-48.

10 REFERENCES [1] Bonnie++ [2001]. http://www.coker.com.au/bonnie++/. [2] dbench [2001]. http://freshmeat.net/projects/dbench. [3] fsx [1991]. http://www.codemonkey.org.uk/cruft/fsx-linux.c. [4] IOR [2003]. # UCRL-CODE-2003-016. http://www.llnl.gov/icc/lc/siop/downloads/download.html. [5] IOzone [2003]. http://www.gnu.org/directory/sysadmin/monitor/iozone.html. [6] mdtest [2003]. # UCRL-CODE-155800. http://www.llnl.gov/icc/lc/siop/downloads/download.html. [7] simul [2003]. # UCRL-CODE-2003-019. http://www.llnl.gov/icc/lc/siop/downloads/download.html.