IBM System x GPFS Storage Server Schöne Aussicht en für HPC Speicher ZKI-Arbeitskreis Paderborn, 15.03.2013 Karsten Kutzer Client Technical Architect Technical Computing IBM Systems & Technology Group
Agenda IBM General Parallel File System overview GPFS Native RAID technology GPFS Storage Server
Agenda IBM General Parallel File System overview GPFS Native RAID technology GPFS Storage Server
IBM General Parallel File System (GPFS ) History and evolution HPC First called GPFS GPFS General File Serving Standards Portable operating system interface (POSIX) semantics -Large block Directory and Small file perf Data management 1998 Virtual Tape Server (VTS) Linux Clusters (Multiple architectures) IBM AIX Loose Clusters 2002 GPFS 2.1-2.3 HPC Research Visualization Digital Media Seismic Weather exploration Life sciences 32 bit /64 bit Inter-op (IBM AIX & Linux) GPFS Multicluster GPFS over wide area networks (WAN) Large scale clusters thousands of nodes 2005 GPFS 3.1-3.2 Information lifecycle management (ILM) Storage Pools File sets Policy Engine Ease of administration Multiplenetworks/ RDMA Distributed Token Management Windows 2008 Multiple NSD servers NFS v4 Support Small file performance 2006 GPFS 3.3 Restricted Admin Functions Improved installation New license model Improved snapshot and backup Improved ILM policy engine 2009 GPFS 3.4 Enhanced Windows cluster support - Homogenous Windows Server Performance and scaling improvements Enhanced migration and diagnostics support 2010 GPFS 3.5 Caching via Active File Management (AFM) GPFS Storage Server GPFS File Placement optimizer (FPO) 2012
Agenda IBM General Parallel File System overview GPFS Native RAID technology GPFS Storage Server
PERSEUS/GNR Motivation Hard Disk Rates Are Lagging There have been recent inflection points in disk technology in the wrong direction In spite of these trends, Peta- and Exascale projects aim to maintain performance increases Gb/sq.in. Disk Areal Density Trend 2000-2010 10000 1000 25-35% CAGR 100 100% CAGR 10 1 0.1 1998 2000 2002 2004 2006 2008 2010 1000 Media data rate vs. year 8 Disk drive latency by year 7 MB/s 100 ms. 6 5 4 10 3 2 1 1995 2000 2005 2010 1 0 1985 1990 1995 2000 2005 2010
GPFS Native RAID technology Declustered RAID Data and parity stripes are uniformly partitioned and distributed across a disk array. Arbitrary number of disks per array (unconstrained to an integral number of RAID stripe widths) 2-fault and 3-fault tolerance (RAID-D2, RAID-D3) Reed-Solomon parity encoding 2 or 3-fault-tolerant: stripes = 8 data strips + 2 or 3 parity strips 3 or 4-way mirroring End-to-end checksum Disk surface to GPFS user/client Detects and corrects off-track and lost/dropped disk writes Asynchronous error diagnosis while affected IOs continue If media error: verify and restore if possible If path problem: attempt alternate paths If unresponsive or misbehaving disk: power cycle disk Supports service of multiple disks on a carrier IO ops continue on for tracks whose disks have been removed during carrier service 7
De-clustering Bringing Parallel Performance to Disk Maintenance Traditional RAID: Narrow data+parity arrays Rebuild uses IO capacity of an array s only 4 (surviving) disks 20 disks, 5 disks per traditional RAID array 4x4 RAID stripes (data plus parity) Striping across all arrays, all file accesses are throttled by array 2 s rebuild overhead. Failed Disk Declustered RAID: Data+parity distributed over all disks Rebuild uses IO capacity of an array s 19 (surviving) disks 20 disks in 1 De-clustered array 16 RAID stripes (data plus parity) Failed Disk Load on files accesses are reduced by 4.8x (=19/4) during array rebuild. 8
Low-Penalty Disk Rebuild Overhead failed disk failed disk time time Rd Wr Rd-Wr Reduces Rebuild Overhead by 3.5x 9
Declustered RAID6 Example 14 physical disks / 3 traditional RAID6 arrays / 2 spares 14 physical disks / 1 declustered RAID6 array / 2 spares Decluster data, parity and spare failed disks failed disks failed disks Number of faults per stripe Red Green Blue failed disks Number of faults per stripe Red Green Blue 0 2 0 1 0 1 0 2 0 0 0 1 0 2 0 0 1 1 0 2 0 2 0 0 0 2 0 0 1 1 0 2 0 1 0 1 0 2 0 Number of stripes with 2 faults = 7 0 1 0 Number of stripes with 2 faults = 1
Non-intrusive disk diagnostics Disk Hospital: Background determination of problems While a disk is in hospital, GNR non-intrusively and immediately returns data to the client utilizing the error correction code. For writes, GNR non-intrusively marks write data and reconstructs it later in the background after problem determination is complete. Advanced fault determination Statistical reliability and SMART monitoring Neighbor check, drive power cycling Media error detection and correction Supports concurrent disk firmware updates 11
Mean time to data loss 8+2 vs. 8+3 Parity 50 disks 200 disks 50,000 disks 8+2 200,000 years 50,000 years 200 years 8+3 250 billion years 60 billion years 230 million years These figures assume uncorrelated failures and hard read errors. Simulation assumptions: Disk capacity = 600-GB, MTTF = 600khrs, hard error rate = 1-in-10 15 bits, 47-HDD declustered arrays, uncorrelated failures. These MTTDL figures are due to hard errors, AFR (2-FT) = 5 x 10-6, AFR (3-FT) = 4 x 10-12 12
Agenda IBM General Parallel File System overview GPFS Native RAID technology GPFS Storage Server
GPFS Native Raid (GNR) on Power 775 GA on 11/11/11 See Chapter 10 of GPFS 3.5 Advanced Administration Guide (SC23-5182-05) or GPFS 3.4 Native RAID Administration and Programming Reference (SA23-1354-00) First customers: Weather agencies, government agencies, universities. IBM Confidential
R IBM GPFS Storage Server ZKI-AK Paderborn, 15.03.2013 Workload Evolution Compute Cluster Compute Cluster File/Data Servers Dedicated Disk Controllers NSD NSD File File Server Server 1 x3650 x3650 NSD File Server 2 NSD File Server 2 Migrate RAID to Commodity File Servers NSD NSD File File Server Server 1 1 GPFS GPFS Native Native RAID RAID NSD NSD File File Server Server 2 2 GPFS GPFS Native Native RAID RAID R D FDR IB 10 GigE Disk Enclosures Disk Enclosures
Introducing IBM GPFS Storage Server What s New: Replaces external hardware controller with software based RAID Modular upgrades improve TCO Non-intrusive disk diagnostics Client Business Value: Integrated and ready to go for Big Data applications 3 years maintenance and support Improved storage affordability Delivers data integrity, end-to-end Faster rebuild and recovery times Reduces rebuild overhead by 3.5x Key Features: Declustered RAID (8+2p, 8+3p) 2- and 3-fault-tolerant erasure codes End-to-end checksum Protection against lost writes Off-the-shelf JBODs Standardized in-band SES management SSD Acceleration Built-in
A Scalable Building Block Approach to Storage Complete Storage Solution Data Servers, Disk (SSD and NL-SAS), Software, Infiniband and Ethernet x3650 M4 Twin Tailed JBOD Disk Enclosure Model 24: Light and Fast 4 Enclosures 20U 232 NL-SAS 6 SSD Model 26: HPC Workhorse! 6 Enclosures 28U 348 NL-SAS 6 SSD High Density HPC Options 18 Enclosures 2-42u Standard Racks 1044 NL-SAS 18 SSD 17
Agenda IBM General Parallel File System overview GPFS Native RAID technology GPFS Storage Server
Doing Big Data Since 1998 IBM GPFS