A Survey of Shared File Systems

Transcription

1 Technical Paper A Survey of Shared File Systems Determining the Best Choice for your Distributed Applications Updated 22 Oct 2014

2 A Survey of Shared File Systems

3 A Survey of Shared File Systems Table of Contents Introduction... 1 Retaining File System Data in Memory... 1 Maintenance of File System Metadata... 1 Implications for Physical Resources... 2 Workloads... 2 Large Sorts... 2 Demand Model Forecasting and Calibration... 2 Configuration... 3 File System Configuration... 3 SAS System Options... 3 IBM Elastic Storage (aka General Parallel File System or GPFS )... 5 IBM General Parallel File System File Placement Optimizer (GPFS FPO). 6 NFS... 7 Global File System2 (Red Hat Resilient Storage)... 8 Common Internet File System (CIFS)... 9 Quantum StorNext Intel Enterprise Edition for Lustre Veritas Cluster File System... 12

4 A Survey of Shared File Systems

5 A Survey of Shared File Systems 1 Introduction A shared file system is a required and integral component of all SAS Grid Manager deployments, Enterprise Business Intelligence deployments with load balanced servers on multiple systems, and other types of distributed SAS applications. In order to determine which shared file system is the best choice for a given deployment, it is important to understand how SAS software interacts with the file system and how a shared file system might behave differently compared to a nonshared file system. The important characteristics of shared file systems with respect to SAS performance are whether the file system data is retained in memory in a local file cache handling of file system metadata implications for the physical resources This paper briefly describes these aspects and examines the behavior of several different shared file systems in the context of performance of a representative deployment. Retaining File System Data in Memory Data read from storage such as a solid state device or disk is referred to as a physical read. When the data is retained in and read from the local file cache the reads are called logical reads. Programs that perform logical reads from data that has already been read physically into memory perform substantially better than when the program performs physical reads. SAS programs often read the same data multiple times. If the data are retained in a local file cache, the program can perform logical reads from the local file cache much faster and more efficiently than re-reading it from the physical device. Operations that access small files tend to get a substantial performance benefit from the local file cache. Files that are larger than the size of the cache memory cannot be cached in their entirety and may not be cached at all. Most non-shared file systems will retain data in the local file cache until the memory is needed for some other purpose. In comparison, shared file systems implement a wide variety of rules that govern the retention of data in the local file cache. Some file systems allow tuning of the local file cache, providing options for limiting the amount of memory that can be used for file cache or the number of files kept in the local file cache, for example. Maintenance of File System Metadata File system metadata includes such information as lists of files in a directory, file attributes such as file permissions or file creation date, and other information about the physical data. Various shared file systems differ in the maintenance of the file system metadata. Shared file systems are required to make information about file system metadata and file locking available to all systems participating in the shared file system. File system metadata is updated whenever a file is created, modified, deleted or extended, when a lock is obtained or dropped, and on some file systems, when a file is accessed. The sizes of files created by SAS programs are not known when files are first created. As the file is written, additional file system blocks, or extents, are requested. Each extent request results in an update to file system metadata. In a shared file

6 2 A Survey of Shared File Systems system, file system metadata changes are coordinated across systems, so there is more overhead associated with file system metadata with a shared file system compared to a non-shared file system. Due to the differences in local file cache behavior and the additional overhead required to maintain file system metadata, programs that performed well with a non-shared file system may behave quite differently with a shared file system. To minimize overhead from the shared file system, a general recommendation is to maintain SAS WORK directories in a nonshared file system and place only permanent data in a shared file system. A shared file system may be necessary for GRIDWORK. SAS WORK files are not usually shared and are considered to be private to the process that created the directory, so there is no benefit for SAS WORK files to reside in a shared file system. SAS WORK files also tend to generate a higher number of file system metadata updates because files are continually created, extended and deleted which can greatly increase the overhead of coordinating file system metadata in the shared file system. In addition, a non-shared file system is more likely to retain SAS WORK files in the local file cache. Implications for Physical Resources Shared file systems require coordination between multiple host systems and place more demand on the network and storage devices. This means that the network and storage devices must be provisioned appropriately. In addition, less efficient file cache management may result in more data transfer and more metadata requests, both of which may result in more physical I/O. The storage subsystem may need to support both a higher number of I/O operations as well as a higher throughput rate compared to the needs of a non-shared file system. Workloads In order to characterize different shared file systems and storage devices, we developed a small set of internal benchmarks to exercise typical SAS software scenarios. The following is a description of two of the workloads that are being used to study a variety of shared file systems. The outcomes for each of those shared file systems are described later in the paper. Large Sorts Large sorting workloads measure throughput of the file/storage subsystem. This workload is intended to push the I/O subsystem bandwidth to its limit. The workload launches multiple simultaneous sorts, typically 1 to 2 per CPU core. The files used by this workload are typically larger than system memory and will not be retained in the local file cache. File system metadata is exercised lightly. Demand Model Forecasting and Calibration This workload runs a configurable number of concurrent SAS processes that execute a demand model calibration process. The workload emulates a SAS solution deployed in a distributed environment using shared input data and a shared output analytic data mart. The SAS programs are written using SAS macro language and create many small to moderate datasets.

7 A Survey of Shared File Systems 3 There are 750 models in the workload and each model is calibrated by an individual SAS process. Each of these processes creates about 100 permanent output datasets and 7,000 temporary files. Many solutions, including SAS Drug Development, SAS Warranty Analysis, SAS Enterprise Miner and others behave in a similar manner. This workload will exercise the management of file system metadata and some shared file systems have failed with what we consider a representative load. This workload tends to benefit a great deal from the file cache, because there are many small files that are read multiple times. Configuration It is important to configure your system appropriately before running SAS applications. Below we discuss file system and SAS options used for our shared file system studies. File System Configuration Some shared file systems are available as NAS (network attached storage) rather than SAN (storage area network). With 10 GbE and other high bandwidth network connectivity, NAS devices have become much more attractive because much higher throughput rates can be achieved compared to lower bandwidth connectivity such as 1 GbE. However, even with the higher bandwidth networks, throughput may not be sufficient for systems with a high core count. Systems with 12 or more cores may not be fully utilized due to constraints and latency of the network. There are some file system options that are available on most Unix-based or Windows-based systems that enable or disable recording when a file is accessed. Unless the workload specifically requires information about when a file was accessed or modified, disabling the metadata update with shared file systems reduces some of the load on the file system. On Unix-based systems, mounting file systems with the option NOATIME= disables the recording of file access times. The mount option NOATIME= improved the performance of all benchmarks and is a recommended setting if accurate file access time is not required for correct program execution. On Microsoft Windows operating systems, setting the registry value HKLM\System\CurrentControlSet\Control\Filesystem\NtfsDisableLastAccessUpdate to 1 also disables this type of file system metadata update. SAS System Options Where possible, the SAS options are set to coordinate well with the file system page size and device stripe size. The option ALIGNSASIOFILES aligns all datasets, including SAS WORK datasets on the specified boundary. It performs the same alignment as the LIBNAME option OFFSET specified for non- WORK libraries. Other options align SAS I/O requests to the boundary specified by the buffer size. I/O alignment may reduce the number of physical I/O operations. For shared file systems, these options are particularly useful because many shared file systems will not coalesce unaligned I/O requests. Misalignment may cause a much higher number of requests, either over a network or to the I/O device, compared to aligned requests. The following is a description of the SAS Foundation 9.3 options used to tune performance of the benchmarks: CPUCOUNT=2

8 4 A Survey of Shared File Systems These benchmarks all ran multiple SAS processes concurrently. The option CPUCOUNT was set to 2 in order to minimize contention for CPU resources. MEMSIZE=1G This option MEMSIZE was set to 1G minimize contention for memory resources. SORTSIZE=768m The option SORTSIZE was set to 75% of memsize. ALIGNSASIOFILES The option ALIGNSASIOFILES aligns access to SAS datasets on a specific boundary. The boundary is defined by BUFSIZE. This allows a file system to align read and write requests and avoid split I/O operations. BUFSIZE=128k The option BUFSIZE sets the read/write size of each request made to SAS datasets. UBUFSIZE=128k The option UBUFSIZE sets the read/write size of SAS utility file requests. IBUFSIZE=32767 The option IBUFSIZE sets the read/write size of SAS index file requests. IBUFNO=10 Some of the workloads utilize indexes. Setting IBUFNO to 10 reduced the number of file system requests for indexed traversal. FILELOCKS=none By default, on UNIX systems the SAS option FILELOCKS=FAIL is set. This setting obtains read and write file locks on non- WORK files to protect programs from making concurrent writes to a file or from reading data that may be in an inconsistent state. File systems make metadata requests in order to manage locks. The benchmarks were run with both FILELOCKS=FAIL and FILELOCKS=NONE in order to determine the overhead associated with file locks. In order to reduce metadata requests, there are SAS options that eliminate file lock requests. For data that is not shared, is only read, or if the application is designed not to need file locks, setting the option FILELOCKS=NONE will reduce file system metadata requests. Also, some file systems will flush files from memory when any file attribute changes. This means files accessed with a lock may be forced out of the local file cache and re-read from storage. The overhead from managing file locks and the change in file cache retention can be substantial, therefore limiting the use of file locks to just those files and libraries that require concurrent read/write access is recommended for best performance. File locks are necessary to avoid corrupting data or accessing data in an inconsistent state. Examples of situations that require the use of file locks are: concurrent processes attempting to update the same file a single process updating a file and other processes attempting to read it concurrently.

9 A Survey of Shared File Systems 5 We are in the process of running these workloads on a variety of shared file systems. The following sections describe the file systems benchmarked as of April, File System Performance Studies In this section we discuss the shared file systems studied to date. We plan to study additional file systems and report those results also. IBM Elastic Storage (aka General Parallel File System or GPFS ) As of the release of version 4.1 IBM is rebranding GPFS as IBM Elastic Storage. For the purposes of this paper IBM Elastic Storage and GPFS are synonymous. Working together SAS and IBM have found that due to certain bugs fixes and performance improvements SAS-based deployments benefit from being at IBM GPFS version or later. We advise installing at or moving to that release level or higher when possible. The IBM General Parallel File System (GPFS ) performed well on both Red Hat Enterprise Linux (RHEL) and Microsoft Windows operating systems. Both permanent data and SAS WORK files were managed by GPFS with excellent throughput and low overhead for file system metadata management. GPFS requires dedicated system memory and the amount of memory is defined by the file system option PAGEPOOL=. The client systems used in the benchmarks have 96 GB of memory and 32 GB of it was dedicated to the page pool. It is likely that client computer systems that utilize both a non-shared file system and GPFS will require additional memory to allow both file systems to perform adequately. The workloads were run with several GPFS configurations on RHEL 5.7 and 6.1 and Windows Server 2008 R2. Also, GPFS is deployed by many customers with SAS software on AIX. For RHEL, two configurations were run: a NAS configuration and a SAN configuration. The NAS configuration used a DataDirect Networks SFA10K-E GridScaler Appliance connected via 10 GbE with multiple GPFS clients running on RHEL 6.2 and Windows Server 2008 R2. The SAN configuration had SAS software running on the same systems as the GPFS NSD (network shared disk) servers using fibre channel connections to a storage device. Both configurations performed well. There was no measureable overhead using the default setting FILELOCKS=FAIL. For both the SAN and NAS configurations, GPFS was able to transmit data at a rate limited only by the throughput rate of the physical device. This rate was sustained during the large sort workload and achieved periodically for the calibration workload. The calibration workload ran 100 to 250 concurrent SAS processes spread over six 12-core systems. Maximum throughput was achieved and sustained and file system metadata overhead stayed at a manageable and responsive level. SAS data access tends to be sequential. The GPFS file system was created with the "cluster" allocation method so that data blocks would be allocated close to one another. The documentation for GPFS specifies that cluster allocation should be used for moderate configurations, but it behaved much better for both workloads compared to the "scatter" allocation method. GPFS allows separation of file system metadata from file system data. For SAS software, the access pattern to file system metadata is dramatically different than the access pattern for data. File system metadata tends to be random access of small blocks and file system data access tends to be sequential access of larger blocks. We found that creating separate volumes for file system metadata and using Solid State Devices (SSD's) provided a significant improvement in the responsiveness of the file system. The file system was created with a block size of 256 KB. Because the SAS system options BUFSIZE=128k and ALIGNSASIOFILES were specified, each dataset runs a minimum of 256 KB with the first block containing SAS header information and the next block containing data.

10 6 A Survey of Shared File Systems The following GPFS options were changed from the default values: maxmbps= The GPFS option MAXMBPS= was set to approximately twice the bandwidth available to any single system. For example, for the systems connected via 10 GbE, the value was set to 2000 (2 GB/sec). pagepool= The SAS software and all data resided in GPFS, so system memory was primarily used for either SAS processes or GPFS file cache. The GPFS option PAGEPOOL= determines the size of the file cache and was set to 32 GB. seqdiscardthreshold= The GPFS option SEQDISCARDTHRESHOLD= limits the size of a file that is read sequentially and will be retained in the file cache. This option was set to (approximately 4 GB). This value was chosen because the input data for the calibration workload is shared among many processes. If one process has already read the file, other processes can consume the same data without requiring physical I/O. However, if processes are not sharing data, a smaller value is recommended so that a few large files will not dominate the file cache. prefetchpct= Because SAS software usually reads and writes files sequentially, we increased the GPFS option PREFETCHPCT= from the default 20 percent to 40 percent. This option controls the amount of the page pool that is used for read-ahead and write-behind operations. maxfilestocache= The calibration workload creates a high number of relatively small SAS WORK files. The GPFS option MAXFILESTOCACHE= was increased from the default 1000 to The system should be monitored using a program like MMPMON to determine the appropriate value for this option. Overall, this file system behaved very well, with both the file cache benefiting performance and little overhead for maintaining file system metadata. The measured workloads had both permanent and SAS WORK files managed by GPFS. For a more detailed treatment of IBM Elastic Storage see SAS Grid Manager I/O: Optimizing SAS Application Data Availability for the Grid ( IBM General Parallel File System File Placement Optimizer (GPFS FPO) IBM introduced the File Placement Optimizer (FPO) functionality with GPFS release 3.5. FPO is not a product distinct from Elastic Storage / GPFS, rather as of version 3.5 is an included feature that supports the ability to aggregate storage across multiple servers into one or more logical file systems across a distributed shared-nothing architecture. Use of FPO brings along all the benefits of GPFS and additionally provides (1) a favorable licensing model and (2) the ability to deploy SAS Grid Manager in a shared nothing architecture reducing the need for expensive enterprise-class SAN infrastructure. GPFS FPO has the ability to aggregate storage across multiple servers into one or more logical file systems across a distributed shared-nothing architecture. GPFS FPO provides block level replication at the file system level allowing continuity of operations in light of a failed disk drive or a failed node. GPFS FPO extends the flexibility of GPFS and has

11 A Survey of Shared File Systems 7 configuration options that provides fine control over the level of replication. FPO also allows a level of control over how the initial and replica blocks are written across the cluster. For example, data written to SAS WORK will be read by the same node where it was written therefore having the ability to specify that the first block is written to the requesting node benefits SAS performance. There are two additional performance-related concerns with FPO beyond those of GPFS without use of the FPO option: Ensure each node has a sufficient quantity of disks to meet SAS recommended IO throughput requirements based upon the number of cores + overhead needed to write replica blocks from other nodes in the environment Ensure that sufficient connectivity exists that replica blocks can be written to other nodes without throttling IO throughput. SAS has found that when coupled with properly provisioned hardware GPFS FPO successfully meets the IO demands of a SAS Grid Manager implementation. For a more robust treatment of GPFS FPO functionality and SAS please see the SAS Grid Manager A Building Block approach white paper available on the Grid Computing Papers section of the Scalability and Performance focus area of support.sas.com. NFS The needs of SAS Grid Manger customers vary widely based upon workload, budget, processing windows, etc. Because of these variances it is not possible to make a blanket best practice recommendation for shared file system. NFS behaviors described in this document make the choice of NFS less desirable in demanding implementations however there are SAS customers who use NFS to their satisfaction. Certainly the use of well-engineered and properly sized NFS appliances for SASDATA is an option used by customers and for some implementations within SAS. The use of NFS for SASWORK where performance on the list of concerns is strongly discouraged due to the number of customers who have found this configuration unacceptable for performance reasons. NFS client and server implementations show a wide variety of behavior that affect performance. For that reason, if a customer chooses to use NFS - the specific client and server should be measured to ensure performance goals can be met. The performance and scalability of NFS are limited by the NFS protocol itself. If a customer implements with NFS and subsequently finds the performance does not meet business needs there is little that SAS can do to help. In this situation the customer will need to migrate to a shared file system that better aligns with their business needs. The NFS client maintains a cache of file and directory attributes. The default settings will not ensure that files created or modified on one system will be visible on another system within a minute of file creation or modification. The default settings may cause software to malfunction if multiple computer systems are accessing data that is created or modified on other computer systems. For example, if a workspace server on system A creates a new SAS dataset, that file may not be visible on System B within one minute of its creation. In order to assure a consistent view of the file system, the mount option ACTIMEO= (attribute cache timeout) should be set to 0. This setting will increase the number of requests to the NFS server for directory and file attribute information, but will ensure that the NFS client systems have a consistent view of the file system. File data modifications may not be visible on any NFS client system other than the one where the modifications are being made until an NFS commit is executed. Most NFS clients will issue a commit as part of closing a file. If multiple systems are reading files that may be modified, file system locking should be used. This is controlled by the SAS system option FILELOCKS=, the SAS library option FILELOCKS=, or the SAS LOCK statement. In order to ensure file data consistency, when an NFS client detects a change in a file system attribute, any data in the local file cache is invalidated. The next time the file is accessed, its data will be retrieved from the NFS server. This means that retention in the file cache may have much different behavior with an NFS file system compared to other file systems and the file system storage devices and network must be provisioned to handle a larger demand compared to either a local file system or a shared file system that uses a different strategy for cache coherency. Some NFS clients treat locks as file attributes. Therefore, obtaining any lock including a read/share lock may invalidate

12 8 A Survey of Shared File Systems any data that is in the local file cache and force the NFS client to request file data from the server. On UNIX systems, files in the SAS WORK directory are accessed without file locks, because those files are assumed to be private to a process. By default, the SAS system requests read and write file locks for all other files. For best performance, it is important to use locks only when necessary in order to avoid invalidating the file cache and making additional file data requests of the NFS server. File locks are needed when data can be updated by multiple concurrent processes or updated by one process and read by other concurrent processes. File locks prevent data corruption or processes reading data in an inconsistent state. The usage of file locks with NFS may have a much greater effect on overall system performance compared to other file systems. This behavior should be considered when sizing the I/O subsystem. NFS performed reasonably well on RHEL 5.7 and 6.1 in a variety of configurations and storage devices. The clients and servers were connected with 10 GbE which provided sufficient bandwidth and throughput for reasonably high CPU utilization. The large sort workload gets very little benefit from the local file cache, so its performance did not change with either the SAS FILELOCK default setting or with FILELOCKS=NONE. The calibration workload gets a large benefit from the local file cache and the overall elapsed time for the job increased by 25% or more with the SAS default setting for file locks compared to running with FILELOCKS=NONE. Please note that file locks were only requested for permanent data and that SAS WORK files were accessed with no file locks. For SAS programs which do not rely on file system locks to prevent multiple concurrent read/write access to a file, we recommend FILELOCKS=NONE to improve NFS s local file cache behavior. With FILELOCKS=NONE, file cache retention is improved. The benchmarks were run on a variety of devices including EMC Isilon and Oracle Exalogic. In general, there are few tuning options for NFS, but NFS benefits from isolating file system metadata from file system data. Some storage devices like Isilon allow this isolation. Within the continuum of NFS performance - better results were obtained from devices that both isolate file system metadata and utilize SSD's for file system metadata. Some NFS devices generated errors under the load generated by the calibration workload. SAS R&D is working with several vendors to resolve these issues. SAS technical support is aware of these issues and can be contacted to determine if there are any known issues with any particular device. NFS can be a reasonable choice for small SAS Grid Manager implementations and / or when performance is less of a concern for the customer. NFS appliances have differing architectures and therefore differing performance characteristics. In general it is not a good idea to place SASWORK on an NFS file system where performance is a concern. If customers are concerned about performance they are strongly advised to consider more performant options such as other shared file system choices outlined in this paper. If not taking advantage of checkpoint / restart functionality using NFS for SASDATA and local SSDs for example for SASWORK can provide performance advantages however such an arrangement does come with the tradeoff of needing to manage the mix of jobs and the concurrent demand for file space on the local storage. Global File System2 (Red Hat Resilient Storage) Global File System 2 (GFS2) is a cluster file system supported by Red Hat. GFS2 is provided in the RHEL Resilient Storage Add-On. Note that Red Hat strongly recommends an architecture review be conducted by Red Hat prior to the implementation of a RHEL Cluster. At the time of this update, the cost of the architecture review is covered by the GFS2 support subscription with Red Hat. To initiate an architecture review open a support ticket with Red Hat requesting an architecture review for Resilient Storage. A previous release of this paper indicated performance concerns with GFS2, however improvements delivered with RHEL errata patches through mid-march 2013 have provided marked enhancements to the performance of GFS2. If SAS WORK directories need to be shared then the recommended configuration is to place those directories on a different GFS2 file system distinct from the GFS2 file system used for shared permanent files. GFS2 is limited to 16 nodes in a cluster.

13 A Survey of Shared File Systems 9 The workloads were run using the tuned utility with the enterprise-storage profile. This configures the system to use the deadline I/O scheduler and sets the dirty page ratio to 40 by setting the option VM.DIRTY_RATIO= to 40. These settings greatly improved the performance of the workloads. To improve the behavior of the Distributed Lock Manager (DLM), the following parameters were set to 16384: DLM_LKBTBL_SIZE DLM_RSBTBL_SIZE DLM_DIRTBL_SIZE Change these values by editing /etc/init.d/cman to ensure that the values are retained upon each reboot of the system. The benchmark environment used the LVCHANGE-R command <the value should be appropriate for the workload> to set read-ahead for the file system. We recommend setting read ahead for SAS workloads. Other commands may be used to set read ahead and more detail can be found in the references below. For optimal GFS2 performance run RHEL 6.4 and ensure that all patches are applied at least through mid-march These errata provide fixes to the tuned service as well as address a concern with irqbalance. With RHEL 6.4 and the mid- March 2013 errata SAS found that the workload performed very well on GFS2. This version of GFS2 has not displayed rapid performance degradation observed with prior versions. The other shared file systems measured on the RHEL platform were GPFS, StorNext and NFS. GPFS and StorNext are excellent choices when there is a need for extreme performance or the need to have greater than 16 nodes in the cluster. GFS2 performance has improved markedly and is now a very good choice for clusters up to 16 nodes. NFS is an option when performance is not a primary concern. See the following URLs for additional information: Common Internet File System (CIFS) Common Internet File System (CIFS) is the native shared file system provided with Windows operating systems. With recent patches, CIFS can be used for workloads with moderate levels of concurrency and works best for workloads that get limited benefit from the local file cache. The recommended configuration is to place SAS WORK directories on a non- CIFS file system and use CIFS to manage shared permanent files. With the release of the Windows Vista operating system in 2006, many improvements were made to address performance issues and connectivity via 10 GbE greatly improves the throughput and responsiveness of the CIFS file system. With changes made both to SAS Foundation 9.3 software and Windows Server 2008 R2 operating system, CIFS is functionally stable. However, benchmark results showed there was relatively poor retention of data in the local file cache. Workloads that reuse data from local file cache will not perform nearly as well with CIFS compared to a local file system. The benchmark configuration had three systems running the Windows Server 2008 R2 operating system, one acting as file server and two systems as clients all connected via 10 GbE. In order to function properly and perform optimally, the following registry settings, Microsoft patches, and environment variable settings used by SAS 9.3 software should be used:

14 10 A Survey of Shared File Systems [HKLM\SYSTEM\CurrentControlSet\Control\SessionManager\Executive] AdditionalDelayedWorkerThreads = 0x28 AdditionalCriticalWorkerThreads = 0x28 [HKLM\SYSTEM\CurrentControlSet\Services\lanmanworkstation\parameters] DirectoryCacheLifetime=0x0 FileNotFoundCacheLifetime=0x0 Hotfix referenced by Hotfix referenced by Environment variable SAS_NO_RANDOM_ACCESS = 1 Hotfix disabled file read-ahead under some circumstances. Microsoft will release an additional patch in April, 2012 to address this issue. Comparing CIFS to other shared file systems measured on the Windows platform, GPFS had much better throughput, local file cache retention and metadata management compared to CIFS. This translated to better scalability and GPFS can serve a larger number of client systems compared to CIFS. Quantum StorNext StorNext is a file system and tiered data management family of products provided by Quantum Corporation. It is supported on a variety of operating systems including Red Hat Enterprise Linux and Microsoft Windows operating systems. Benchmarks were run with both Quantum StorNext 4.3 and a pre-release copy of StorNext 5 on Red Hat Enterprise Linux (RHEL) 6.4. Both versions performed well, but StorNext 5 contains additional performance and scalability improvements as well as updates that address storage capacity issues encountered with heavy file metadata environments in the earlier release. (StorNext 5 is scheduled to be available in StorNext appliances in CY2013. Availability of the software only solution is targeted for early 2014) The configuration utilized separate StorNext file systems for permanent data and SAS WORK files with excellent throughput. The demand model forecasting and calibration workload creates and deletes a high number of files in a relatively short period of time. The 4.3 version required more temporary storage capacity because artifacts from deleted files were not completely eliminated from the system as quickly as new files were created. StorNext 5 pre-release software solved this issue by significantly improving file creation and deletion performance as well as achieving low overhead for file system metadata management. The configuration utilized 8 systems running the SAS workloads, with 2 of these systems also acting as StorNext metadata controllers. Both file systems had a blocksize of 16k. The file system for permanent data had a journal size of 64M and the SAS WORK file system had a journal size of 512M. The following options were set in the cfgx file for each of the two file systems: filelocks true (default false) globalsuperuser true (default false)

15 A Survey of Shared File Systems 11 maxconnections 64 (default 10) These options were increased from default values to improve performance: inodecachesize 512k (default 32k) buffercachesize 4G (default 32M) threadpoolsize 1024 (default 16) The SAS WORK file system also used ASR (Allocation Session Reservation) set to 10G. The default is 0 which is off. The following mount options were changed from default values: nodiratime and noatime are set, default is not set. threads 32 (default 12) cachebufsize 128k (default 32k) buffercachecap 30730M (default varies by hardware, would be 256M for a 64 bit system with > 2G of memory ) buffercache_readahead 24 (default 16) buffercache_iods 32 (default 8) dircachesize 64M (default 10M) auto_dma_read_length and auto_dma_write_length 2048 but testing defaulted to a minimum value of cachebufsize+1 (128k+1) Overall, the StorNext file systems behaved very well, with excellent usage of the buffer cache and low overhead for file system metadata operations. Intel Enterprise Edition for Lustre Intel Enterprise Edition for Lustre Version 1.0 (IEEL V1.0) is a commercial version of the parallel file system Lustre supported by Intel. Earlier benchmarks run with other versions of Lustre encountered several performance problems, but IEEL V1.0 performed well, with excellent throughput and low overhead for file system metadata management. The benchmark configuration utilized one Metadata Server (MDS), three Object Storage Servers (OSS) and 12 RHEL 6.4 client systems. There was 30 Gb of network bandwidth available between the servers and clients. The Object Storage Targets were constructed from a disk array with k rpm disks. The Metadata Target was constructed from 4 SSD s configured as RAID 10. Intel recommends 128 GB of memory for each server, however the benchmark configuration had only 24 GB for each server. The demand model forecasting and calibration workload generates a high amount of metadata requests. For each mount point on a client system, there is a single metadata request queue. This was found to be a bottleneck for the initial configuration. A single Lustre file system can be mounted multiple times. To create additional metadata queues, the file system was mounted four times and three of the mount points were used for SAS work directories. The SAS configuration file was modified so that the work option was set to a file that contained a list of three directories and the method for

16 12 A Survey of Shared File Systems selecting a work directory was set to random. The additional metadata queues eliminated the metadata bottleneck. The file system was formatted using default options. Several file system settings were modified for best performance of the benchmarks. On the OSS the option readcache_max_filesize controls the maximum size of a file that both the read cache and writethrough cache will try to keep in memory. Files larger than readcache_max_filesize will not be kept in cache for either reads or writes. This was set to 2M on each OSS via the command: lctl set_param obdfilter.*.readcache_max_filesize=2m On the client systems, the maximum for dirty data that can be written and queued was increased from the default to 256 MB. Also, the maximum number of in-flight RPC s was increased from the default to 16. These changes were made using the following commands: lctl set_param osc.\*.max_dirty_mb=256 lctl set_param osc.\*.max_rpcs_in_flight=16 Overall, the Intel Enterprise Edition for Lustre Version 1.0 file systems behaved very well, with excellent usage of the local file system cache and low overhead for file system metadata operations. Veritas Cluster File System The SAS Performance Analysis lab has not done first hand experiments with Veritas products (owned and maintained by Symantec since 2005) however we have experience with this file system in the field via customer engagements. The following are best practices documented by SAS employees working with customer implementations of SAS and the Veritas file system in the field: 1. Use 8KB block size when creating a VxFS file system 2. Use multipathing. 3. Separate permanent SAS data and temporary SASWORK data on different file systems. 4. When creating a Veritas volume use a RAID-0 stripe across multiple LUNs (assuming LUNS are RAID protected). 5. Depending upon the underlying LUN characteristics make the VxVM strip size be equal to a full stripe-width on the underlying LUN. For example if a LUN is a RAID-5 volume created as a 4+1 with a 64KB strip size, the strip size for the host should be 256KB. 6. In the above case the SAS BUFSIZE parameter should be set to 256KB in the SAS configuration file. 7. If a default SAS BUFSIZE is 256KB or larger use vxtunefs to increase the value of discovery_direct_iosz. 8. If SAS BUFSIZE is 256KB or less change the vxtunefs parameters Read_perf_io Write_perf_io to match the SAS BUFSIZE otherwise make the values an integer divisor of the SAS BUFSIZE. 9. Check read_nstream and write_nstream values are set appropriately from VxVM volume configuration. 10. Mount the SASWORK temporary file system(s) with tmplog and mincache=tmpcache. In SAS Version 9.1 and 9.2 have users modify LIBNAME statement to add offset=value, where value is the SAS BUFSIZE. Using the example above (5) add offset=256k to the LIBNAME statement. In version 9.3 and above put alignsasiofiles in the sasv9 config file. By default a SAS data set contains 512 bytes of header data. The actual SAS data begins in the file at an 8K offset and

17 A Survey of Shared File Systems 13 the file is written to with BUFSIZE IO request sizes. When offset= is set then the first data page will be at offset in the file. For very large files that are accessed by a single SAS job at any point in time consider using direction as documented in Conclusion This paper contains only the file systems that we have tuned and measured in our performance lab. Other shared file systems have been used successfully in distributed deployments, such as Veritas Cluster File System. We plan to continue running benchmarks against other configurations and will publish the results as they become available. Resources Windows is a registered trademark of Microsoft Corporation in the United States and other countries. Windows Vista is either a registered trademark or trademark of Microsoft Corporation in the United States and/or other countries. DDN, SFA, 10K-E, and In-Storage Processing are trademarks owned by DataDirect Networks.

18 14 A Survey of Shared File Systems SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright 2010 SAS Institute Inc., Cary, NC, USA. All rights reserved.