IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO)

IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO) Rick Koopman IBM Technical Computing Business Development Benelux Rick_koopman@nl.ibm.com

Enterprise class replacement for HDFS GPFS 3.5 HDFS Terasort: large reads X X Performance Enterprise readiness Hbase: small write X X Metadata intensive X X Posix compliance Meta-data replication Distributed name node X X X Protection & Recovery Security & Integrity Snapshot Asynchronous Replication Backup Access Control Lists Ease of Use Policy based Ingest X X X X X

A typical HDFS Environment Filers Map Reduce Cluster Jobs Users NFS M a p H D F S R e d u c e Uses disk local to each server Aggregates the local disk space into a single, redundant shared filesystem The open source standard file systems used in partnership with Hadoop Map reduce

Map Reduce Environment Using GPFS-FPO (File Placement Optimizer) Filers Map Reduce Cluster Jobs Users NFS G P F S - F P O M a p R e d u c e Uses disk local to each server Aggregates the local disk space into a single redundant shared filesystem Designed for map reduce workloads Unlike HDFS, GPFS-FPO is POSIX compliant so data maintenance is easy Intended as a drop in replacement for open source HDFS (IBM BigInsights product may be required)

GPFS FPO advanced storage for Map Reduce Data Hadoop HDFS HDFS NameNode is a single point of failure Large block-sizes poor support for small files IBM GPFS Advantages No single point of failure, distributed metadata Variable block sizes suited to multiple types of data and data access patterns Non-POSIX file system obscure commands POSIX file system easy to use and manage Difficulty to ingest data special tools required Policy based data ingest Single-purpose, Hadoop MapReduce only Versatile, Multi-purpose Not recommended for critical data Enterprise Class advanced storage features

IBM Storage Next Generation Archiving Solutions LTFS Storage Platforms

Data Protection Operational Technical Computing: Powerful. Comprehensive. Intuitive The Problem Network Disk Growth Manageability Cost Data mix - Rich media & databases, etc Uses active, time senstive access & static, immutable data C:/user defined namespace Large And Growing Bigger Difficult to Protect / Backup Cost Backup windows Time to recovery Data mix reduces effectiveness of compression/dedupe 7

Data Protection Operational Technical Computing: Powerful. Comprehensive. Intuitive The Solution Tiered Network Storage Single file system view C:/user defined namespace High use data, databases, email, etc Policy Based Tier Migration Static data, rich media, unstructured, archive LTFS LTFS LTFS LTFS LTFS Smaller Scalable Easier to protect Faster Time to recovery Smaller backup footprint Time critical applications/data Lower cost, scalable storage Data types/uses for tape Static data, rich media, etc. Replication backup strategies 8

Los Angeles London Tokyo NFS/CIFS NFS/CIFS NFS/CIFS Smarter Storage Distributed Data Namespace file view Load balancing Policy migration Storage Distribution Reduction of cost for storage Data monetization Node 1 Node 2 Node 3 Node 4 GPFS DSM LTFS LE GPFS DSM LTFS LE GPFS DSM LTFS LE GPFS DSM LTFS LE SSD Disk SSD Disk SSD Disk Disk LTFS

IBM System x GPFS Storage Server A Revolution in HPC Intelligent Cluster Management!

A Scalable Building Block Approach to Storage Complete Storage Solution Data Servers, Disk (SSD and NL-SAS), Software, Infiniband and Ethernet x3650 M4 Twin Tailed JBOD Disk Enclosure Model 24: Light and Fast 4 Enclosures 20U 232 NL-SAS 6 SSD 10 GB/Second Model 26: HPC Workhorse! 6 Enclosures 28U 12 GB/Second 348 NL-SAS 6 SSD High Density HPC Options 18 Enclosures 2-42u Standard Racks 1044 NL-SAS 18 SSD 36 GB/Second 11

Mean time to data loss 8+2 vs. 8+3 Parity 50 disks 200 disks 50,000 disks 8+2 200,000 years 50,000 years 200 years 8+3 250 billion years 60 billion years 230 million years These figures assume uncorrelated failures and hard read errors. Simulation assumptions: Disk capacity = 600-GB, MTTF = 600khrs, hard error rate = 1-in-10 15 bits, 47-HDD declustered arrays, uncorrelated failures. These MTTDL figures are due to hard errors, AFR (2-FT) = 5 x 10-6, AFR (3-FT) = 4 x 10-12 12

De-clustering Bringing Parallel Performance to Disk Maintenance Traditional RAID: Narrow data+parity arrays Rebuild uses IO capacity of an array s only 4 (surviving) disks 20 disks, 5 disks per traditional RAID array 4x4 RAID stripes (data plus parity) Striping across all arrays, all file accesses are throttled by array 2 s rebuild overhead. Failed Disk Declustered RAID: Data+parity distributed over all disks Rebuild uses IO capacity of an array s 19 (surviving) disks 20 disks in 1 De-clustered array 16 RAID stripes (data plus parity) Failed Disk Load on files accesses are reduced by 4.8x (=19/4) during array rebuild. 13

Low-Penalty Disk Rebuild Overhead failed disk failed disk time time Rd Wr Rd-Wr Reduces Rebuild Overhead by 3.5x 14