PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

PADS GPFS Filesystem: Crash Root Cause Analysis Computation Institute

Argonne National Laboratory Table of Contents Purpose 1 Terminology 2 Infrastructure 4 Timeline of Events 5 Background 5 Corruption 5 Attempted Recovery 5 Disaster Recovery 6 Transferring Data to Temporary Filesystem 6 Rebuilding the Filesystem 6 Tape Restoration 7 Lessons Learned 7 Changelog 7 PADS GPFS Filesystem: Crash Root Cause Analysis

Purpose On June 25, 2010 the PADS cluster s GPFS filesystem experienced a catastrophic and fatal corruption. This document s goal is to explain the root cause of the crash, what was done to attempt to recover from it, the lessons learned and the changes made to prevent this in the future. Figure 1. Timeline PADS GPFS Filesystem: Crash Root Cause Analysis 1

Terminology The following terms are used throughout this document and are provided here for a better understanding. 8+2 RAID6 RAID level 6 that consists of 8 data disk and 2 distributed parity disks. Active-active Controllers A SAN configuration of 2 controllers where either controller can service I/O for any LUN at any time. Provides higher throughput than an active-passive configuration. Active-passive Controllers A SAN configuration of 2 controllers where only 1 controllers can service I/O for a given LUN at a time. The other controller will take over only in the case of failure of the primary controller. Clustered Filesystem A clustered filesystem is a cluster of servers that work together to provide a single filesystem. Clustered filesystem allow for higher performance by spreading the load and I/O across many servers and greater resilience to server failures. Controller The piece of the SAN storage array responsible for servicing I/O, maintaining RAID integrity, and monitoring the health of the storage array. Data NSD An NSD that contains the actual data portion of files on the GPFS filesystem. DDN DataDirect Networks. We use DDN to mean the disk storage array used - a DataDirect Networks S2A9550 storage array. Disaster Recovery The plan and procedure to follow when a catastrophic and fatal disaster has been encountered. Also referred to as DR. DS4400 IBM s DS4400 disk storage array. Failure Group GPFS NSDs are placed in the same failure group if they have the same points of failure. For instance, all LUNs on the same storage array should be in the same failure group. Failure groups affect how GPFS replicates blocks. FC Fiber Channel. A network technology that is primarily used to transport SCSI commands in a SAN. It currently supports speeds of 1 Gbps, 2 Gbps, 4 Gbps and 8 Gbps. Filesystem Manager GPFS servers. fsck GPFS HBA HCA A GPFS server delegated to coordinate filesystem operations between the various Filesystem check program. Checks the integrity the filesystem. IBM s General Parallel File System. A clustered, parallel filesystem. Host Bus Adapter. The client side FC interconnect card. Host Channel Adapter. The client side IB interconnect card. IB InfiniBand. A high speed, low latency network interconnect. IB topologies are created from lanes - 1 lane (1X) or 4 lanes (4X) - and the data rate - single (SDR), double (DDR), quad (QDR) - of those lanes. 1X SDR is 2.5 Gbps, 1X DDR is 5 Gbps and 1X QDR is 10 Gbps. PADS GPFS Filesystem: Crash Root Cause Analysis 2

LUN Logical Unit Number. Used to refer to a SCSI logical unit, a device that performs storage operations such as read and write. A tier can be carved into multiple LUNs. LUN Presentation Defining what LUNs a fiber channel host can see over specific fiber channel ports. Presentations are defined on the DDN. Metadata NSD An NSD that contains the metadata - inode, link references, creation time, modifcation time, etc. - of files on the GPFS filesystem. Multipathing Presenting the same LUN over multiple fiber channel paths either to achieve more resilience against fiber channel port or cable failures or higher throughput by balancing I/O across multiple fiber channel ports. NSD Network Shared Disk. A GPFS abstraction to uniquely define disks in the GPFS filesystem. NSDs allow GPFS to know that two local disks may, in fact, be the same LUN presented using multipathing. NSDs can be data only, metadata only, or data and metadata. Parallel Filesystem A clustered filesystem that allows multiple clients to read and write, in parallel, the same files or the same areas of a file at the same time. Data is striped across multiple storage devices in the filesystem. RAID0 A RAID that stripes data blocks across all disks in the RAID set. Provides high throughput but has no fault tolerance to disk failures in the RAID set. RAID5 disk. A RAID that stripes data blocks across disks in the RAID set and maintains 1 parity RAID6 A RAID that stripes data blocks across disks in the RAID set and maintains 2 distributed parity disks. This provides added protection over RAID level 5 when a disk fails RDMA Remote Direct Memory Access. Access from memory of one computer to that of another without the OS intervention. RDMA can be used over InfiniBand for high-throughput and low-latency networking. Replication high availability reasons. Placing the same data or metadata block on multiple devices for fault tolerance and SAN Storage Area Network. A network architecture that presents remote storage devices, such as disks or tape drives, to servers such that they appear as local devices to the operating system. Tier TSM Verbs The DDN term for a RAID volume. IBM s Tivoli Storage Manager. The backup software we use. InfiniBand functions. PADS GPFS Filesystem: Crash Root Cause Analysis 3

Infrastructure The PADS GPFS filesystem is built on top of several hardware components: DDN S2A9550. Consists of 2 active-active controllers with 8 total 4 Gbps FC connections, 480 1 TB SATA disk drives providing a peak of 3.2 GB/s throughput. There are 48 tiers in an 8+2 RAID6 and each tier provides 1 LUN for a total of 48 LUNs. All LUNs are presented to all 8 FC ports. IBM SAN32B-3. 32 4 Gbps port FC switch. This is the switch connect between the storage servers and the DDN. 10 IBM x3550 storage servers. Each server has 4 GB of DDR2 RAM, a single dual-core 2.00 GHz Intel Xeon 5130 64-bit CPU, a single port QLogic QLx2460 4Gbps FC HBA and a Mellanox 4X DDR IB HCA. GPFS. We are running GPFS version 3.3.0. Figure 2. PADS Interconnect PADS GPFS Filesystem: Crash Root Cause Analysis 4

Timeline of Events Background When we were configuring the PADS GPFS filesystem, we consulted with both IBM and DDN for any guidelines or suggestions on the most scalable and high performance configuration to use. We were provided a Best Practices document. In that document it was recommended to separate the metadata NSDs from the data NSDs to obtain the best performance. This is what we did. We made the DDN LUNs data only NSDs. In each storage server was an unused local SATA disk that each were made metadata only LUNs. This configuration is fully supported by GPFS. However, it was quickly realized that this wasn t an optimal configuration. Because metadata was now being kept on disks accessible only to one server, when that server rebooted or crashed, the filesystem would go offline because those metadata blocks could not be accessed. We developed a plan to enable metadata replication so that when one server went offline, the replica server could take over. We went over this change plan with IBM developers and, at their request, changed it so that the metadata disks would be placed in only 2 failure groups. This suggestion was a core reason of our lack of resilience and eventually led, indirectly, to the metadata corruption that eventually crashed the filesystem. Because disks that had different points of failure were in the same failure group, GPFS assumed things that were not true. This led to performance problems and scalability problems. We realized we needed to transition the metadata to SAN disks, but could not use the DDN because they were already configured to be data only. With the UC TeraGrid RP site being decommissioned there was an IBM DS4400 storage array that was no longer in use that would perfectly serve this purpose. We racked, configured and extensively tested this hardware to make sure there were no performance or stability issues that needed solving before hand. We added the DS4400 into the SAN and further tested that the servers were compatible with and handled failures, such as FC links going down and disk failures, gracefully. After all of these tests passed we added the DS4400 LUNs into the GPFS filesystem as metadata disks and let them passively participate for two weeks. We continued stability tests during this time with no interruption to the filesystem or its operations. Corruption On June 23, 2010 we started the process of migrating the metadata off of the local SATA disks in each storage server to the DS4400 storage array. Almost all metadata, >99%, had successfully been migrated to the DS4400 when on June 25, 2010, the migration crashed. It is believed at this time the metadata was left in an unknown and corrupted state. After investigation and observing behavior during the attempted recovery we believe the GPFS filesystem manager (fsmgr) node ran out of memory performing a metadata consistency check. Attempted Recovery On June 25, 2010 we opened a severity 2 ticket with IBM and were directed to run a no-repair fsck on the filesystem. We also announced to the user community the emergency outage and offered to restore any data needed from tape to a temporary location. About 5-6 users asked for portions of projects to be restored, which we did. The fsck was run in no-repair mode so as to only report errors but not attempt to fix them. Once the fsck completed the results were sent to IBM and we were advised to run fsck in repair mode. We started this but were unable to get the fsck to complete. On June 27 we had the ticket escalated to severity 1. On June 29 we discovered that the fsmgr was running out of memory during the fsck and increased the RAM from 4 GB to 12 GB on the server. The fsck continued to fail running the fsmgr out of memory. We then added a server to the cluster with 24 GB of RAM and 32 GB of swap and forced it to be the fsmgr. With the new fsmgr we were able to have the fsck complete and fix some problems, but some PADS GPFS Filesystem: Crash Root Cause Analysis 5

problems still remained. After several fscks some inconsistencies remained and would never be repaired. On July 2, IBM advised that the filesystem was irreparable and to implement our disaster recovery procedure. Disaster Recovery On July 2, 2010 we announced our disaster recovery procedure to the user community. We had two goals for the recovery: 1. Recover as much, if not all, of the data on the filesystem. 2. Provide read-only access to the current data during the restoration. To meet these goals we had the following steps: 1. Transfer the current data to a temporary filesystem. (approximately 5 days) 2. Make the data on the temporary filesystem available read-only. 3. Rebuild the filesystem on the DDN array. 4. Start the restore process from tape. (approximately 2-3 weeks) 5. Transfer files from the temporary filesystem that were created or modified after the last taken backup. 6. Release the filesystem and cluster back into operation. Transferring Data to Temporary Filesystem Because the PADS compute cluster nodes already were in a GPFS cluster and there was the high speed IB interconnect between them and the storage nodes and each compute node has roughly 2.5 TB of usable disk capacity we converted the compute node GPFS cluster to be a GPFS filesystem. Each compute node contributed its local RAID0 volume to the filesystem. Because RAID0 is not tolerant to a single disk failure, we enabled replication and ensured each disk was in its own failure group. We opted not to rebuild compute nodes RAID volumes as something more fault tolerant like RAID5 because of the time to do so - roughly 2-3 days for all 48 RAID volumes to initialize. On July 2, 2010 we started copying as much data from the now corrupt filesystem that we could to the temporary GPFS filesystem. We monitored the health of the cluster nodes and their disk during this time with no failures. The data migration completed and appropriate firewall holes were in place on July 6 and the temporary filesystem was made available read-only to users. On July 7, there were hardware failures in two separate nodes: node c05 suffered a disk failure taking the RAID set offline and node c12 s RAID controller failed taking its RAID set offline. Taken separately, these failures would not have been fatal, but combined they destroyed the temporary GPFS filesystem. Rebuilding the Filesystem On July 6 we started the process of recreating the GPFS filesystem. There were 2 tiers in the DDN that still needed to be upgraded to 1TB drives, so we replaced and built those tiers. It took about 1.5 days to build the new tiers. While the tiers were building, we researched to make sure that the new filesystem would be configured for the highest availability possible, best performance possible, and largest amount of usable capacity. We discovered several parameters to PADS GPFS Filesystem: Crash Root Cause Analysis 6

modify. These parameter changes are detailed in the Changelog section below. On July 7, the tier building finished and we created the new GPFS filesystem and recreated the project filesystem structure. Tape Restoration After the filesystem was created we attempted to start tape restorations, but encountered bugs in our version of the TSM server. We worked with IBM support to develop workarounds until we could upgrade and on July 8th started restorations from tape. Initially things looked good with the first node restoring around 300 MB/s, but as more nodes started restoring we noticed that 300 MB/s was an aggregate limit. After investigating we discovered that multipathing was incorrectly configured and corrected it. We restarted the restore on July 9 and averaged approximately 450 MB/s with peaks up to 600 MB/s. See the Changelog section for details of the multipath issue. The Argonne Leadership Computing Facility (LCF) division loaned us 6 tape drives, bringing our total drive count to 10. Because of their generosity, we were able to have all 10 storage servers performing restores concurrently. Excluding the two largest projects, all projects were restored by July 14 and we released the filesystem back for full use on July 15. Lessons Learned We have known for some time that placing the metadata on host local disks in two failure groups with replication is a non-standard and sub-optimal configuration and had been working towards a more standard configuration. We were able to apply that knowledge in the creation of the new filesystem. In addition we learned how GPFS accesses data when a node has direct access to the NSDs and have designed the new filesystem to exploit this (see Changelog ). We learned better how multipathing works and how to configure and optimize it (see Changelog ). We learned that some filesystem operations require more memory on the fsmgr node. Because any of the nodes in the cluster could be delegated as the fsmgr, we are increasing the memory on each node from 4 GB to 12 GB. While this is still not enough memory to perform a fsck in one pass, it should prevent running out of memory for all other operations. The extra memory will also allow us to increase the amount of memory GPFS can pin for certain cached operations increasing performance in some cases. Lastly we discovered that our current backup strategy is optimized for backups but not optimally optimized for DR restores. In the coming weeks we will be analyzing how to organize the data on tape and in TSM so that we can backup efficiently, perform accurate accounting and reporting, and restore projects or the whole filesystem as quickly as possible. Changelog The configuration of GPFS, the OS, and the DDN have all been heavily modified based on knowledge we learned prior to this outage and during the reconfiguration of the new filesystem. Below we detail these changes. Consolidate data and metadata. Both data and metadata are now on the same LUNs on the DDN. While this is not the highest performing configuration, it is the most reliable and should still provide very good performance. Fixed multipathing. Because each LUN is presented to all eight ports of the DDN, a server sees the same LUN 8 times resulting in what looks like 8 different disks (/dev/sdc, /dev/sdd, /dev/sde, etc). Multipathing knows that these 8 presentations are all the same LUN and groups them together into one logical disk (e.g., /dev/mpath0). The multi path software is responsible for determining which disk (/dev/sdc, /dev/sdd, /dev/sde, etc) to send I/O too and thereby determining which port on the DDN the I/O is sent over. Previously the multipath software was misconfigured and was PADS GPFS Filesystem: Crash Root Cause Analysis 7

sending I/O only to 2 ports on one controller for all LUNs. This meant that 3/4 of our available bandwidth to disk wasn t being utilized and in fact causing contention on those two ports. We ve fixed this so that odd numbered multipath disk (/dev/mpath1, etc) I/O is sent in a round-robin fashion to all 4 ports of controller 1 and all even numbered multipath disk (/dev/mpath0, etc) I/O is sent in a round robin fasion to all 4 ports of controller 2. If a path or controller fails, I/O is sent to the secondary controller. This means that now all I/O is spread evenly over all 8 ports of the DDN and no one controller does too much work (see Figure 3). Enabled InfiniBand RDMA verbs. When the storage array moved physically close to the PADS compute cluster, we connected the storage servers to the cluster IB fabric. We thought we had enabled GPFS to use IB RDMA when we did this, but a missing package was actually silently turning this feature off effectively halving the available bandwidth between the storage servers and to the rest of the compute cluster. RDMA verbs support is now on and fully functional. Present LUNs only to NSD owners. We discovered that if a server can see all the NSDs in the filesystem, that server will perform I/O directly to the NSD regardless of whether it s the NSD owner or not. This meant that for operations that happen directly on the server, like GridFTP or restoration, I/O was not being striped to all servers, but instead only being performed on that server. This meant the maximum available bandwidth for those operations was that of the server s FC connection which is 4 Gbps. To fix this, we present only those LUNs that a server is primary or secondary for thereby forcing I/O to be striped across all nodes. See Figures 4 and 5 for a graphical representation. Up until Wednesday the 14th all servers could see all LUNs and the servers on ports 0/1, 0/2, and 0/3 were performing restores. You can clearly see that those ports are performing the only I/O with some nodes doing nothing. After the 14th we enabled LUN presentation and you can see I/O is almost uniformly spread across all 10 servers. Disable read-ahead prefetch. The DDN can perform read-ahead prefetching in an effort to anticipate the next read request; however, with a parallel filesystem such as GPFS it s very poor at succeeding and so this option can actually be a performance drag. We disabled this and enabled block level OS settings (see below) to allow GPFS to do the read-ahead prefetching. Tune DDN write cache size. We aligned the write cache size to match the RAID stripe and GPFS block size. This should provide a minor performance increase as write operations should all be aligned on the same block. Increase block device read-ahead size. We enabled and increased the default size of the OS block device readahead size to allow GPFS to fetch a larger chunk of data for read-ahead prefetch and caching. Increase block device request size. We increased the size of the OS block I/O request size to allow GPFS to read and write in larger chunks. Tuned FC HBA queue depth. Each port of the DDN has a transaction queue depth of 256. This means that under heavy load or in an effort to bundle I/O requests together, the DDN can queue 256 transaction requests before denying further transactions while the queue drains. We applied a formula to prevent the storage servers from overrunning the DDN port transaction queue. GPFS block size now matches RAID stripe and write cache size. The GPFS block size now matches the DDN tier RAID stripe size. This means writes are aligned on byte boundaries and allows the write cache to perform better. Aligned LUN ownership to match multipath rules. Even though the DDN is an active-active configuration, LUNs are still owned by one of the controllers and there is a small hand-off that happens when the other controller accesses the LUN. To prevent this very minor performance hit, we updated LUN ownership so it matches the multipath rules of PADS GPFS Filesystem: Crash Root Cause Analysis 8

odd LUNs owned by controller 1 and even LUNs owned by controller 2. Now the hand-off should only occur when there is problem with one of the controllers or the FC fabric. Increased the number of SSH connections. GPFS uses SSH for communication between nodes. In some cases with the default settings, SSH could deny more connection attempts until others complete causing timeouts and misbehavior of GPFS operations. We increased the number of allowed SSH connections to prevent this. Set a higher amount of reserved virtual memory (VM). GPFS can make use of VM under heavy load. By default the OS reserves some portion of this from not being used by applications, but the default value is too low. We increased this reserved amount to keep GPFS from running the OS out of VM. Figure 3. DDN Throughput per Port PADS GPFS Filesystem: Crash Root Cause Analysis 9

Figure 4. Before LUN Presentation PADS GPFS Filesystem: Crash Root Cause Analysis 10

Figure 5. After LUN Presentation PADS GPFS Filesystem: Crash Root Cause Analysis 11