PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute



Similar documents
SAN Conceptual and Design Basics

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

DELL TM PowerEdge TM T Mailbox Resiliency Exchange 2010 Storage Solution

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

SMB Direct for SQL Server and Private Cloud

IBM System Storage DS5020 Express

WHITEPAPER: Understanding Pillar Axiom Data Protection Options

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

Configuration Maximums

Introduction to MPIO, MCS, Trunking, and LACP

VERITAS Storage Foundation 4.3 for Windows

VMware Best Practice and Integration Guide

SAN TECHNICAL - DETAILS/ SPECIFICATIONS

Lessons learned from parallel file system operation

Large File System Backup NERSC Global File System Experience

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

High Availability and MetroCluster Configuration Guide For 7-Mode

Best Practices RAID Implementations for Snap Servers and JBOD Expansion

Using Multipathing Technology to Achieve a High Availability Solution

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Architecting a High Performance Storage System

New Storage System Solutions

RFP-MM Enterprise Storage Addendum 1

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

VTrak SATA RAID Storage System

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Virtual Server and Storage Provisioning Service. Service Description

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

IBM XIV Gen3 Storage System Storage built for VMware vsphere infrastructures

High Performance Oracle RAC Clusters A study of SSD SAN storage A Datapipe White Paper

Cisco Active Network Abstraction Gateway High Availability Solution

An Oracle White Paper September Oracle Exadata Database Machine - Backup & Recovery Sizing: Tape Backups

Hitachi Path Management & Load Balancing with Hitachi Dynamic Link Manager and Global Link Availability Manager

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage

The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000)

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Installation Guide July 2009

SAS Analytics on IBM FlashSystem storage: Deployment scenarios and best practices

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

IBM Global Technology Services November Successfully implementing a private storage cloud to help reduce total cost of ownership

List of Figures and Tables

InfoScale Storage & Media Server Workloads

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

Network Attached Storage. Jinfeng Yang Oct/19/2015

August Transforming your Information Infrastructure with IBM s Storage Cloud Solution

Windows 8 SMB 2.2 File Sharing Performance

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

The functionality and advantages of a high-availability file server system

Dell Exchange 2013 Reference Architecture for 500 to 20,000 Microsoft Users. 1 Overview. Reliable and affordable storage for your business

Brocade and EMC Solution for Microsoft Hyper-V and SharePoint Clusters

HP reference configuration for entry-level SAS Grid Manager solutions


Addendum No. 1 to Packet No Enterprise Data Storage Solution and Strategy for the Ingham County MIS Department

Configuration Maximums

HBA Virtualization Technologies for Windows OS Environments

Managing your Domino Clusters

The Benefit of Migrating from 4Gb to 8Gb Fibre Channel

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Data Protection Technologies: What comes after RAID? Vladimir Sapunenko, INFN-CNAF HEPiX Spring 2012 Workshop

SFA10K-X & SFA10K-E. ddn.com. Storage Fusion Architecture TM. DDN Whitepaper

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

PCI Express and Storage. Ron Emerick, Sun Microsystems

OPTIMIZING SERVER VIRTUALIZATION

FlexArray Virtualization

HP recommended configuration for Microsoft Exchange Server 2010: HP LeftHand P4000 SAN

Step-by-Step Guide. to configure Open-E DSS V7 Active-Active iscsi Failover on Intel Server Systems R2224GZ4GC4. Software Version: DSS ver. 7.

Configuration Maximums

Price/performance Modern Memory Hierarchy

Globus and the Centralized Research Data Infrastructure at CU Boulder

UCS M-Series Modular Servers

Module: Business Continuity

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

The Microsoft Large Mailbox Vision

PARALLELS CLOUD STORAGE

Configuration Maximums VMware vsphere 4.1

How to Choose your Red Hat Enterprise Linux Filesystem

Open-E Data Storage Software and Intel Modular Server a certified virtualization solution

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

How To Make A Backup System More Efficient

RAID Basics Training Guide

Best Practice of Server Virtualization Using Qsan SAN Storage System. F300Q / F400Q / F600Q Series P300Q / P400Q / P500Q / P600Q Series

Veritas Storage Foundation High Availability for Windows by Symantec

SanDisk ION Accelerator High Availability

Software-defined Storage Architecture for Analytics Computing

1 Storage Devices Summary

TECHNOLOGY BRIEF. Compaq RAID on a Chip Technology EXECUTIVE SUMMARY CONTENTS

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

QuickSpecs. HP Smart Array 5312 Controller. Overview

Fibre Channel SAN Configuration Guide ESX Server 3.5, ESX Server 3i version 3.5 VirtualCenter 2.5

F600Q 8Gb FC Storage Performance Report Date: 2012/10/30

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

The Benefits of Virtualizing

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

SAP Solutions on VMware Infrastructure 3: Customer Implementation - Technical Case Study

Maurice Askinazi Ofer Rind Tony Wong. Cornell Nov. 2, 2010 Storage at BNL

Transcription:

PADS GPFS Filesystem: Crash Root Cause Analysis Computation Institute

Argonne National Laboratory Table of Contents Purpose 1 Terminology 2 Infrastructure 4 Timeline of Events 5 Background 5 Corruption 5 Attempted Recovery 5 Disaster Recovery 6 Transferring Data to Temporary Filesystem 6 Rebuilding the Filesystem 6 Tape Restoration 7 Lessons Learned 7 Changelog 7 PADS GPFS Filesystem: Crash Root Cause Analysis

Purpose On June 25, 2010 the PADS cluster s GPFS filesystem experienced a catastrophic and fatal corruption. This document s goal is to explain the root cause of the crash, what was done to attempt to recover from it, the lessons learned and the changes made to prevent this in the future. Figure 1. Timeline PADS GPFS Filesystem: Crash Root Cause Analysis 1

Terminology The following terms are used throughout this document and are provided here for a better understanding. 8+2 RAID6 RAID level 6 that consists of 8 data disk and 2 distributed parity disks. Active-active Controllers A SAN configuration of 2 controllers where either controller can service I/O for any LUN at any time. Provides higher throughput than an active-passive configuration. Active-passive Controllers A SAN configuration of 2 controllers where only 1 controllers can service I/O for a given LUN at a time. The other controller will take over only in the case of failure of the primary controller. Clustered Filesystem A clustered filesystem is a cluster of servers that work together to provide a single filesystem. Clustered filesystem allow for higher performance by spreading the load and I/O across many servers and greater resilience to server failures. Controller The piece of the SAN storage array responsible for servicing I/O, maintaining RAID integrity, and monitoring the health of the storage array. Data NSD An NSD that contains the actual data portion of files on the GPFS filesystem. DDN DataDirect Networks. We use DDN to mean the disk storage array used - a DataDirect Networks S2A9550 storage array. Disaster Recovery The plan and procedure to follow when a catastrophic and fatal disaster has been encountered. Also referred to as DR. DS4400 IBM s DS4400 disk storage array. Failure Group GPFS NSDs are placed in the same failure group if they have the same points of failure. For instance, all LUNs on the same storage array should be in the same failure group. Failure groups affect how GPFS replicates blocks. FC Fiber Channel. A network technology that is primarily used to transport SCSI commands in a SAN. It currently supports speeds of 1 Gbps, 2 Gbps, 4 Gbps and 8 Gbps. Filesystem Manager GPFS servers. fsck GPFS HBA HCA A GPFS server delegated to coordinate filesystem operations between the various Filesystem check program. Checks the integrity the filesystem. IBM s General Parallel File System. A clustered, parallel filesystem. Host Bus Adapter. The client side FC interconnect card. Host Channel Adapter. The client side IB interconnect card. IB InfiniBand. A high speed, low latency network interconnect. IB topologies are created from lanes - 1 lane (1X) or 4 lanes (4X) - and the data rate - single (SDR), double (DDR), quad (QDR) - of those lanes. 1X SDR is 2.5 Gbps, 1X DDR is 5 Gbps and 1X QDR is 10 Gbps. PADS GPFS Filesystem: Crash Root Cause Analysis 2

LUN Logical Unit Number. Used to refer to a SCSI logical unit, a device that performs storage operations such as read and write. A tier can be carved into multiple LUNs. LUN Presentation Defining what LUNs a fiber channel host can see over specific fiber channel ports. Presentations are defined on the DDN. Metadata NSD An NSD that contains the metadata - inode, link references, creation time, modifcation time, etc. - of files on the GPFS filesystem. Multipathing Presenting the same LUN over multiple fiber channel paths either to achieve more resilience against fiber channel port or cable failures or higher throughput by balancing I/O across multiple fiber channel ports. NSD Network Shared Disk. A GPFS abstraction to uniquely define disks in the GPFS filesystem. NSDs allow GPFS to know that two local disks may, in fact, be the same LUN presented using multipathing. NSDs can be data only, metadata only, or data and metadata. Parallel Filesystem A clustered filesystem that allows multiple clients to read and write, in parallel, the same files or the same areas of a file at the same time. Data is striped across multiple storage devices in the filesystem. RAID0 A RAID that stripes data blocks across all disks in the RAID set. Provides high throughput but has no fault tolerance to disk failures in the RAID set. RAID5 disk. A RAID that stripes data blocks across disks in the RAID set and maintains 1 parity RAID6 A RAID that stripes data blocks across disks in the RAID set and maintains 2 distributed parity disks. This provides added protection over RAID level 5 when a disk fails RDMA Remote Direct Memory Access. Access from memory of one computer to that of another without the OS intervention. RDMA can be used over InfiniBand for high-throughput and low-latency networking. Replication high availability reasons. Placing the same data or metadata block on multiple devices for fault tolerance and SAN Storage Area Network. A network architecture that presents remote storage devices, such as disks or tape drives, to servers such that they appear as local devices to the operating system. Tier TSM Verbs The DDN term for a RAID volume. IBM s Tivoli Storage Manager. The backup software we use. InfiniBand functions. PADS GPFS Filesystem: Crash Root Cause Analysis 3

Infrastructure The PADS GPFS filesystem is built on top of several hardware components: DDN S2A9550. Consists of 2 active-active controllers with 8 total 4 Gbps FC connections, 480 1 TB SATA disk drives providing a peak of 3.2 GB/s throughput. There are 48 tiers in an 8+2 RAID6 and each tier provides 1 LUN for a total of 48 LUNs. All LUNs are presented to all 8 FC ports. IBM SAN32B-3. 32 4 Gbps port FC switch. This is the switch connect between the storage servers and the DDN. 10 IBM x3550 storage servers. Each server has 4 GB of DDR2 RAM, a single dual-core 2.00 GHz Intel Xeon 5130 64-bit CPU, a single port QLogic QLx2460 4Gbps FC HBA and a Mellanox 4X DDR IB HCA. GPFS. We are running GPFS version 3.3.0. Figure 2. PADS Interconnect PADS GPFS Filesystem: Crash Root Cause Analysis 4

Timeline of Events Background When we were configuring the PADS GPFS filesystem, we consulted with both IBM and DDN for any guidelines or suggestions on the most scalable and high performance configuration to use. We were provided a Best Practices document. In that document it was recommended to separate the metadata NSDs from the data NSDs to obtain the best performance. This is what we did. We made the DDN LUNs data only NSDs. In each storage server was an unused local SATA disk that each were made metadata only LUNs. This configuration is fully supported by GPFS. However, it was quickly realized that this wasn t an optimal configuration. Because metadata was now being kept on disks accessible only to one server, when that server rebooted or crashed, the filesystem would go offline because those metadata blocks could not be accessed. We developed a plan to enable metadata replication so that when one server went offline, the replica server could take over. We went over this change plan with IBM developers and, at their request, changed it so that the metadata disks would be placed in only 2 failure groups. This suggestion was a core reason of our lack of resilience and eventually led, indirectly, to the metadata corruption that eventually crashed the filesystem. Because disks that had different points of failure were in the same failure group, GPFS assumed things that were not true. This led to performance problems and scalability problems. We realized we needed to transition the metadata to SAN disks, but could not use the DDN because they were already configured to be data only. With the UC TeraGrid RP site being decommissioned there was an IBM DS4400 storage array that was no longer in use that would perfectly serve this purpose. We racked, configured and extensively tested this hardware to make sure there were no performance or stability issues that needed solving before hand. We added the DS4400 into the SAN and further tested that the servers were compatible with and handled failures, such as FC links going down and disk failures, gracefully. After all of these tests passed we added the DS4400 LUNs into the GPFS filesystem as metadata disks and let them passively participate for two weeks. We continued stability tests during this time with no interruption to the filesystem or its operations. Corruption On June 23, 2010 we started the process of migrating the metadata off of the local SATA disks in each storage server to the DS4400 storage array. Almost all metadata, >99%, had successfully been migrated to the DS4400 when on June 25, 2010, the migration crashed. It is believed at this time the metadata was left in an unknown and corrupted state. After investigation and observing behavior during the attempted recovery we believe the GPFS filesystem manager (fsmgr) node ran out of memory performing a metadata consistency check. Attempted Recovery On June 25, 2010 we opened a severity 2 ticket with IBM and were directed to run a no-repair fsck on the filesystem. We also announced to the user community the emergency outage and offered to restore any data needed from tape to a temporary location. About 5-6 users asked for portions of projects to be restored, which we did. The fsck was run in no-repair mode so as to only report errors but not attempt to fix them. Once the fsck completed the results were sent to IBM and we were advised to run fsck in repair mode. We started this but were unable to get the fsck to complete. On June 27 we had the ticket escalated to severity 1. On June 29 we discovered that the fsmgr was running out of memory during the fsck and increased the RAM from 4 GB to 12 GB on the server. The fsck continued to fail running the fsmgr out of memory. We then added a server to the cluster with 24 GB of RAM and 32 GB of swap and forced it to be the fsmgr. With the new fsmgr we were able to have the fsck complete and fix some problems, but some PADS GPFS Filesystem: Crash Root Cause Analysis 5

problems still remained. After several fscks some inconsistencies remained and would never be repaired. On July 2, IBM advised that the filesystem was irreparable and to implement our disaster recovery procedure. Disaster Recovery On July 2, 2010 we announced our disaster recovery procedure to the user community. We had two goals for the recovery: 1. Recover as much, if not all, of the data on the filesystem. 2. Provide read-only access to the current data during the restoration. To meet these goals we had the following steps: 1. Transfer the current data to a temporary filesystem. (approximately 5 days) 2. Make the data on the temporary filesystem available read-only. 3. Rebuild the filesystem on the DDN array. 4. Start the restore process from tape. (approximately 2-3 weeks) 5. Transfer files from the temporary filesystem that were created or modified after the last taken backup. 6. Release the filesystem and cluster back into operation. Transferring Data to Temporary Filesystem Because the PADS compute cluster nodes already were in a GPFS cluster and there was the high speed IB interconnect between them and the storage nodes and each compute node has roughly 2.5 TB of usable disk capacity we converted the compute node GPFS cluster to be a GPFS filesystem. Each compute node contributed its local RAID0 volume to the filesystem. Because RAID0 is not tolerant to a single disk failure, we enabled replication and ensured each disk was in its own failure group. We opted not to rebuild compute nodes RAID volumes as something more fault tolerant like RAID5 because of the time to do so - roughly 2-3 days for all 48 RAID volumes to initialize. On July 2, 2010 we started copying as much data from the now corrupt filesystem that we could to the temporary GPFS filesystem. We monitored the health of the cluster nodes and their disk during this time with no failures. The data migration completed and appropriate firewall holes were in place on July 6 and the temporary filesystem was made available read-only to users. On July 7, there were hardware failures in two separate nodes: node c05 suffered a disk failure taking the RAID set offline and node c12 s RAID controller failed taking its RAID set offline. Taken separately, these failures would not have been fatal, but combined they destroyed the temporary GPFS filesystem. Rebuilding the Filesystem On July 6 we started the process of recreating the GPFS filesystem. There were 2 tiers in the DDN that still needed to be upgraded to 1TB drives, so we replaced and built those tiers. It took about 1.5 days to build the new tiers. While the tiers were building, we researched to make sure that the new filesystem would be configured for the highest availability possible, best performance possible, and largest amount of usable capacity. We discovered several parameters to PADS GPFS Filesystem: Crash Root Cause Analysis 6

modify. These parameter changes are detailed in the Changelog section below. On July 7, the tier building finished and we created the new GPFS filesystem and recreated the project filesystem structure. Tape Restoration After the filesystem was created we attempted to start tape restorations, but encountered bugs in our version of the TSM server. We worked with IBM support to develop workarounds until we could upgrade and on July 8th started restorations from tape. Initially things looked good with the first node restoring around 300 MB/s, but as more nodes started restoring we noticed that 300 MB/s was an aggregate limit. After investigating we discovered that multipathing was incorrectly configured and corrected it. We restarted the restore on July 9 and averaged approximately 450 MB/s with peaks up to 600 MB/s. See the Changelog section for details of the multipath issue. The Argonne Leadership Computing Facility (LCF) division loaned us 6 tape drives, bringing our total drive count to 10. Because of their generosity, we were able to have all 10 storage servers performing restores concurrently. Excluding the two largest projects, all projects were restored by July 14 and we released the filesystem back for full use on July 15. Lessons Learned We have known for some time that placing the metadata on host local disks in two failure groups with replication is a non-standard and sub-optimal configuration and had been working towards a more standard configuration. We were able to apply that knowledge in the creation of the new filesystem. In addition we learned how GPFS accesses data when a node has direct access to the NSDs and have designed the new filesystem to exploit this (see Changelog ). We learned better how multipathing works and how to configure and optimize it (see Changelog ). We learned that some filesystem operations require more memory on the fsmgr node. Because any of the nodes in the cluster could be delegated as the fsmgr, we are increasing the memory on each node from 4 GB to 12 GB. While this is still not enough memory to perform a fsck in one pass, it should prevent running out of memory for all other operations. The extra memory will also allow us to increase the amount of memory GPFS can pin for certain cached operations increasing performance in some cases. Lastly we discovered that our current backup strategy is optimized for backups but not optimally optimized for DR restores. In the coming weeks we will be analyzing how to organize the data on tape and in TSM so that we can backup efficiently, perform accurate accounting and reporting, and restore projects or the whole filesystem as quickly as possible. Changelog The configuration of GPFS, the OS, and the DDN have all been heavily modified based on knowledge we learned prior to this outage and during the reconfiguration of the new filesystem. Below we detail these changes. Consolidate data and metadata. Both data and metadata are now on the same LUNs on the DDN. While this is not the highest performing configuration, it is the most reliable and should still provide very good performance. Fixed multipathing. Because each LUN is presented to all eight ports of the DDN, a server sees the same LUN 8 times resulting in what looks like 8 different disks (/dev/sdc, /dev/sdd, /dev/sde, etc). Multipathing knows that these 8 presentations are all the same LUN and groups them together into one logical disk (e.g., /dev/mpath0). The multi path software is responsible for determining which disk (/dev/sdc, /dev/sdd, /dev/sde, etc) to send I/O too and thereby determining which port on the DDN the I/O is sent over. Previously the multipath software was misconfigured and was PADS GPFS Filesystem: Crash Root Cause Analysis 7

sending I/O only to 2 ports on one controller for all LUNs. This meant that 3/4 of our available bandwidth to disk wasn t being utilized and in fact causing contention on those two ports. We ve fixed this so that odd numbered multipath disk (/dev/mpath1, etc) I/O is sent in a round-robin fashion to all 4 ports of controller 1 and all even numbered multipath disk (/dev/mpath0, etc) I/O is sent in a round robin fasion to all 4 ports of controller 2. If a path or controller fails, I/O is sent to the secondary controller. This means that now all I/O is spread evenly over all 8 ports of the DDN and no one controller does too much work (see Figure 3). Enabled InfiniBand RDMA verbs. When the storage array moved physically close to the PADS compute cluster, we connected the storage servers to the cluster IB fabric. We thought we had enabled GPFS to use IB RDMA when we did this, but a missing package was actually silently turning this feature off effectively halving the available bandwidth between the storage servers and to the rest of the compute cluster. RDMA verbs support is now on and fully functional. Present LUNs only to NSD owners. We discovered that if a server can see all the NSDs in the filesystem, that server will perform I/O directly to the NSD regardless of whether it s the NSD owner or not. This meant that for operations that happen directly on the server, like GridFTP or restoration, I/O was not being striped to all servers, but instead only being performed on that server. This meant the maximum available bandwidth for those operations was that of the server s FC connection which is 4 Gbps. To fix this, we present only those LUNs that a server is primary or secondary for thereby forcing I/O to be striped across all nodes. See Figures 4 and 5 for a graphical representation. Up until Wednesday the 14th all servers could see all LUNs and the servers on ports 0/1, 0/2, and 0/3 were performing restores. You can clearly see that those ports are performing the only I/O with some nodes doing nothing. After the 14th we enabled LUN presentation and you can see I/O is almost uniformly spread across all 10 servers. Disable read-ahead prefetch. The DDN can perform read-ahead prefetching in an effort to anticipate the next read request; however, with a parallel filesystem such as GPFS it s very poor at succeeding and so this option can actually be a performance drag. We disabled this and enabled block level OS settings (see below) to allow GPFS to do the read-ahead prefetching. Tune DDN write cache size. We aligned the write cache size to match the RAID stripe and GPFS block size. This should provide a minor performance increase as write operations should all be aligned on the same block. Increase block device read-ahead size. We enabled and increased the default size of the OS block device readahead size to allow GPFS to fetch a larger chunk of data for read-ahead prefetch and caching. Increase block device request size. We increased the size of the OS block I/O request size to allow GPFS to read and write in larger chunks. Tuned FC HBA queue depth. Each port of the DDN has a transaction queue depth of 256. This means that under heavy load or in an effort to bundle I/O requests together, the DDN can queue 256 transaction requests before denying further transactions while the queue drains. We applied a formula to prevent the storage servers from overrunning the DDN port transaction queue. GPFS block size now matches RAID stripe and write cache size. The GPFS block size now matches the DDN tier RAID stripe size. This means writes are aligned on byte boundaries and allows the write cache to perform better. Aligned LUN ownership to match multipath rules. Even though the DDN is an active-active configuration, LUNs are still owned by one of the controllers and there is a small hand-off that happens when the other controller accesses the LUN. To prevent this very minor performance hit, we updated LUN ownership so it matches the multipath rules of PADS GPFS Filesystem: Crash Root Cause Analysis 8

odd LUNs owned by controller 1 and even LUNs owned by controller 2. Now the hand-off should only occur when there is problem with one of the controllers or the FC fabric. Increased the number of SSH connections. GPFS uses SSH for communication between nodes. In some cases with the default settings, SSH could deny more connection attempts until others complete causing timeouts and misbehavior of GPFS operations. We increased the number of allowed SSH connections to prevent this. Set a higher amount of reserved virtual memory (VM). GPFS can make use of VM under heavy load. By default the OS reserves some portion of this from not being used by applications, but the default value is too low. We increased this reserved amount to keep GPFS from running the OS out of VM. Figure 3. DDN Throughput per Port PADS GPFS Filesystem: Crash Root Cause Analysis 9

Figure 4. Before LUN Presentation PADS GPFS Filesystem: Crash Root Cause Analysis 10

Figure 5. After LUN Presentation PADS GPFS Filesystem: Crash Root Cause Analysis 11