Storage Challenges for Petascale Systems Dilip D. Kandlur Director, Storage Systems Research IBM Research Division 2004 IBM Corporation
Outline Storage Technology Trends Implications for high performance computing Achieving petascale storage performance Manageability of petascale systems Organizing and finding Information 2 2006 IBM Corporation
Extreme Scaling There have been recent inflection points in CAGR of processing and storage in the wrong direction! Programs like HPCS are aimed at maintaining throughput at or above the CAGR of Moore s Law in spite of these technology trends Frequency (GHz) 10 5 4 3 2 Pentium 4 (130 nm) Pentium 4 (180 nm) 2002 Roadmap ~35% yr/yr 2003 10-15% yr/yr Prescott (90 nm) 1 2000 2001 2002 2003 2004 2005 2006 2007 Initial Ship Date Maximum Internal Bandwidth Maximum internal Bandwidth MB/s 1000 100 10 1998 2000 2002 2004 2006 2008 2010 Areal Density Gb/sq.in. 10000 1000 100 10 1 Disk Areal Density Trend 2000-2010 100% CAGR 25-35% CAGR 0.1 1998 2000 2002 2004 2006 2008 2010 Year of Production 3 2006 IBM Corporation
Peta-scale systems: DARPA HPCS, NSF Track 1 HPCS goal: Double value every 18 months in the face of flattening technology curves NSF track 1 goal: at least a sustained petaflop for actual science applications New technologies like multi-core will keep processing power on the rise but will make storage relatively more expensive Maintaining balanced system scaling constants for storage will be expensive Storage bandwidth:.001byte/second/flop capacty: 20 bytes/flop Cost per drive will be same order of magnitude, so proportionally the same amount of storage will be a higher fraction of total system cost How to make reliable a system with 10x today s number of moving parts? System Year TF GB/s Nodes Cores Storage Disks Blue P 1998 3 3 1464 5856 43 TB 5040 White 2000 12 9 512 8192 147 TB 8064 Purple/C 2005 100 122 1536 12288 2000 TB 11000 NSF Track 1 (possible) 2011 2000 2000 10000 300000 40000 TB 50000 4 2006 IBM Corporation
HPCS Storage 100000 1000 10 0.1 0.001 5,000 drives 4 TF 3.6 GB/s 165,000 drives 11,000 drives 6 PF 6 TB/s 100 TF 120 GB/s 1995 2000 2005 2010 2015 1995 2000 2005 2010 Fast 5 TB/sec sequential bandwidth 30,000 file creates/sec on one node Capable of running fsck on 1 trillion files CPU Performance File System Capacity CPU Performance Number of Disk Drives File System Throughput Number of Disk Drives File System Throughput 300,000 processors 150,000 disk drives Managable Robust Fix 3 or more concurrent errors Detect undetected errors Only minor slowing during disk rebuild Detect and manage slow disks Unified manager for files, storage End-end discovery, metrics, events Managing system changes, problem fixes GUI scaled to large clusters 5 2006 IBM Corporation
GPFS Parallel File System GPFS file system nodes Cluster: thousands of nodes, fast reliable communication, common admin domain. Shared disk: all data and metadata on disk accessible from any node, coordinated by distributed lock service. Parallel: data and metadata flow to/from all nodes from/to all disks in parallel; files striped across all disks. Control IP network Disk FC network GPFS file system nodes Data / control IP network GPFS disk server nodes: VSD on AIX, NSD on Linux RPC interface to raw disks 6 2006 IBM Corporation
Scaling GPFS HPCS file system performance and scaling targets Balanced system DOE metrics (.001B/s/F, 20 B/F) This means 2-6 TB/s throughput, 40-120 PB storage!! Other performance goals 30 GB/s single node to single file for data ingest 30K file opens per second on a single node 1 trillion files in a single file system Scaling to 32K nodes (OS images) 7 2006 IBM Corporation
Extreme Scaling: Metadata Metadata: the on-disk data structures that represent hierarchical directories, storage allocation maps, Why is it a problem? Structural integrity requires proper synchronization. Performance is sensitive to the latency of these (small) I/O s. Techniques for scaling metadata Scaling synchronization (distributing the lock manager) Segregating metadata from data to reduce queuing delays Separate disks Separate fabric ports Different RAID levels for metadata to reduce latency, or solid-state memory Adaptive metadata management (centralized vs. distributed) GPFS provides for all these to some degree; work always ongoing Sensible application design can make a big difference! 8 2006 IBM Corporation
Data loss in Petascale Systems Petaflop systems require tens to hundreds of petabytes of storage Evidence exists that manufacturer MTBF specs may be optimistic (Schroeder & Gibson) Evidence exists that failure statistics may not be as favorable as simple exponential distribution MTTDL in years for 20PB system Hard error rate of 1 in 10 15 means one rebuild in 30 will get an error Rebuild of 8+P array of 500GB drives reads 4TB, or 3.2 10 13 bits RAID-5 is dead at petascale; even RAID-6 may not be sufficient to prevent data loss Simulations of file system size, drive MTBF, failure probability distribution show 4%-28% chance of data loss over five-year lifetime for 8+2P code Stronger RAID (8+3P) increase MTTDL by 3-4 orders of magnitude for extra 10% overhead. Stronger RAID is sufficiently reliable even for unreliable (commodity) disk drives MTTDL in years 10000000 1000000 100000 10000 1000 100 10 1 8+3P 600K hrs Exponential 8+3P 300K hrs Exponential 4% 8+2P 600K hrs Exponential Configuration 16% 8+2P 300K hrs Exponential 8+3P 600K hrs Weibull 28% 8+2P 600K hrs Weibull 9 2006 IBM Corporation
GPFS Software RAID Implement software RAID in the GPFS NSD server Motivations Better fault-tolerance Reduce the performance impact of rebuilds and slow disks Eliminate costly external RAID controllers and storage fabric Use the processing cycles now being wasted in the storage node Improve performance by file-system-aware caching Approach Storage node (NSD server) manages disks as JBOD Use stronger RAID codes as appropriate (e.g. triple parity for data and multi-way mirroring for metadata) Always check parity on read Increases reliability and prevents performance degradation from slow drives Checksum everything! Declustered RAID for better load balancing and non-disruptive rebuild 10 2006 IBM Corporation
Declustered RAID Partitioned RAID Declustered RAID 16 logical tracks 20 physical disks 20 physical disks 11 2006 IBM Corporation
Rebuild Work Distribution failed disk Relative read and write throughput for rebuild 12 2006 IBM Corporation
Rebuild (2) Upon the first failure, begin rebuilding the tracks that are affected by the failure (large arrows). Many disks involved in performing rebuild, so work is balanced, avoiding hot spots. 13 2006 IBM Corporation
Declustered vs. Partitioned RAID Data losses per year per 100PB 1E+5 1E+4 1E+3 1E+2 1E+1 1E+0 1E-1 1E-2 partitioned distributed Simulation results 1E-3 1 2 3 Failure tolerance 14 2006 IBM Corporation
Autonomic Storage Management Making Complex Tasks Simple IBM TotalStorage Productivity Center Standard Edition A Single Application with modular components Disk Data Fabric Business Resiliency Integrated Repication Manager Metro Disaster Recovery Global Disaster Recovery Cascaded Disaster Recovery Application Disaster Recovery Console Enhancements End-to-End DataPath Explorer Integrated Storage Planner Configuration Change Rover Configuration Checker Personalization TSM Integration Ease of Use Streamlined Installation and Packaging Single User Interface Single Database Single Set of services for consistent administration and operations Policy Based Storage Management SAN Best Practices SAN Configuration Validation Storage Subsystem Planning Fabric Security Planning Host Planning (Multi-path) 15 2006 IBM Corporation
Integrated Management Seamlessly integrate systems management across servers, storage and network & provide end-to-end problem determination and analytics capabilities Integrated Web 2.0 GUI Best Practices Deployment Systems Knowledge DB Orchestration Analytics Discovery Monitoring Reporting Configuration Applications Middleware OS File System Server Network Storage 16 2006 IBM Corporation Applications Middleware Operating systems Virtualization software Hardware
PERCS Management A unified and standards based management for GPFS and PERCS Storage A GUI that is designed for large-scale clusters Supporting PERCS scale GPFS The PERCS UI will support: Information collection: asset tracking, end-end discovery, metrics, events Management: system changes, problem fixes, configuration changes Rich visualizations to help them maintain situational awareness of system status Essential for large systems Also enable GPFS to satisfy commercial customers requiring easeof-use GPFS File System PERCS Storage Retrieves Data CIM Provider CIM Model Uses PERCS GUI CIM Client Management CIMOM Server Uses CIM Repository Simulator Systems DB 17 2006 IBM Corporation
Analytics Problem Determination and Impact Analysis Root cause analysis: discover the finest-grain events that indicate the root cause of the problem Symptom suppression: correlate alarms/symptoms caused by a common cause across the integrated infrastructure Bottleneck Analysis Post-mortem, live and predictive analysis Workload and Virtualization Management Automatically monitor multi-tiered, distributed, heterogeneous or homogeneous workloads Migrate virtual machines to satisfy performance goals Integrated Server, Storage and Network Allocation and Migration Integrated allocation accounting for connectivity, affinity, flows, ports based on performance workloads Disaster Management Provides integrated server/storage disaster recovery support 18 2006 IBM Corporation
Visualization Integrated Management is centered around Topology Viewer capabilities based on Web 2.0 technologies Data Path Viewer for Applications, Servers, Networks and Storage Progressive Information Disclosure Semantic Zooming Information Overlays Mixed Graphical and Tabular Views Integrated Historical and Real Time Reporting 19 2006 IBM Corporation
The Changing Nature of Archive Current Archive: Data landfill Store and forget Not easily accessible, typically offline and offsite with access time measured in days Not organized for usage, retained just in case needed Emerging Archive: Leverage information for business advantage Readily accessible, access time measured in seconds Indexed for effective discovery Mined for business value 20 2006 IBM Corporation
Building Storage Systems Targeted at Archive Scalability Scale to huge capacity Exploit tiered storage with disk and tape Leverage commodity disk storage Handle extremely large number of objects Support high ingest rates Effect data management actions in a scalable fashion Emerging Archive: Leverage info for business advantage Current Archive: Data landfill Functionality Consistently handle multiple kinds of objects Manage and retrieve based on data semantics E.g. Logical groupings of objects Support effective search and discovery Provide for compliance with regulations Reliability Ensure data integrity and protection Provide media management and rejuvenation Support long-term retention 21 2006 IBM Corporation
GPFS Information Lifecycle Management (ILM) GPFS ILM abstractions Storage pool group of LUNs Fileset subtree of a file system namespace Policy rule for file placement, retention, or movement among pools GPFS Clients Application Application Application GPFS Placement Application GPFS Policy Posix Placement GPFS Policy Placement GPFS Policy Placement Policy ILM Scenarios Tiered storage fast storage for frequently used files, slower for infrequently used files Project storage separate pools for each project, each with separate policies, quotas, etc. Differentiated storage e.g. place media files on media-friendly storage (QoS) GPFS RPC Protocol Gold Pool Storage Network Silver Pool Pewter Pool GPFS Manager Node Cluster manager Lock manager Quota manager Allocation manager Policy manager System Pool Data Pools GPFS File System (Volume Group) 22 2006 IBM Corporation
GPFS 3.1 ILM Policies Placement policies, evaluated at file creation, example Migration policies, evaluated periodically Deletion policies, evaluated periodically GPFS RPC Protocol GPFS Clients Application Application Application GPFS Placeme Application Posix GPFS nt Policy Placeme GPFS nt Policy Placeme GPFS nt Policy Placeme nt Policy GPFS Manager Nod Cluster manager Lock manager Quota manager Allocation manager Policy manager Storage Network Gold Pool Silver Pool Pewter Pool System Pool Data Pools GPFS File System (Volume Group) 23 2006 IBM Corporation
GPFS Policy Engine Migrate and delete rules scan the file system to identify candidate files Conventional backup and HSM systems also do this Usually implemented with readdir() and stat() This is slow random small record reads, distributed locking Can take hours or days for a large file system GPFS Policy Engine uses efficient sort-merge rather than slow readdir()/stat() Directory walk builds list of path names (readdir() but no stat()!) List sorted by inode number, merged with inode file, then evaluated Both list building and policy evaluation done in parallel on all nodes > 10 5 files/sec per node! 24 2006 IBM Corporation
Storage Hierarchies the old way Normally implemented one of two ways: Explicit control archive command (IBM TSM, Unitree) copy into special archive file system (IBM HPSS) copy to archive server (HPSS, Unitree) all of which are troublesome and error-prone for the user Implicit control through an interface like DMAPI File system sends events to HSM system (create/delete, low space) Archive system moves data and punches holes in files to manage space Access miss generates event; HSM system transparently brings file back 25 2006 IBM Corporation GPFS Session Node GPFS Disk Arrays GPFS I/O Nodes HPSS SAN Disk HPSS Data Store HPSS HSM processes HPSS Interface Client Domain HPSS 6.2 API Architecture DB2 Tape Libraries 2. Client transfers file to HPSS Disk or Tape Over TCP/IP LAN or WAN using an HPSS Mover HPSS FC SAN Moverless SAN Data transfers Client Cluster Computers Metadata Disks HSM Control Information IP LAN Data transfers HPSS API IP Network 1. Client issues HPSS Write or Put to HPSS Core Server HPSS Cluster Computers Core Server and Movers HPSS Disk Arrays DB2 HPSS Movers GPFS Cluster HPSS Cluster GPFS 3.1 and HPSS 6.2 DMAPI Architecture HPSS Core Server HPSS Tape Libraries Tape disk transfers
DMAPI Problems Namespace events (create, delete, rename) Synchronous and recoverable Each is multiple database transactions Slow down the file system Directory scans DMAPI low-space events trigger directory scans to determine what to archive can take hours or days on large FS Scans have little information upon which to make archiving decisions (what you get from ls l ) As a result, data movement policies are usually hard-coded and primitive Read/write managed region Blocks the user program while data brought back from HSM system Parallel data movement isn t in the spec, but everyone implements it anyway Data movement is actually the one thing about DMAPI worth saving GPFS Session Node GPFS Disk Arrays HPSS HSM processes GPFS I/O Nodes HPSS Interface DB2 Moverless SAN Data transfers HSM Control Information IP LAN Data transfers GPFS Cluster GPFS 3.1 and HPSS 6.2 DMAPI Architecture HPSS Disk Arrays HPSS Core DB2 Server HPSS Movers HPSS Cluster Tape disk transfers HPSS Tape Libraries 26 2006 IBM Corporation
GPFS Approach: External Pools External pools are really interfaces to external storage managers, e.g. HPSS or TSM External pool rule defines script to call to migrate/recall/etc. files RULE EXTERNAL POOL PoolName EXEC InterfaceScript [ OPTS options ] GPFS policy engine builds candidate lists and passes them to external pool scripts External storage manager actually moves the data Using DMAPI managed regions (read/write invisible, punch hole) Or using conventional Posix API s 27 2006 IBM Corporation
GPFS ILM Demonstration NERSC Oakland, CA HPSS Archive on tapes with disk buffering, connected via 10Gb link High bandwidth, parallel data movement across all devices and networks SC 06 Tampa, FL GPFS 1M active files FC, SATA disks 28 2006 IBM Corporation
Nearline Information conceptual view NFS/CIFS Client TSM Archive Client/API Admin / Search NFS/CIFS Server Scale-out Archiving Engine (GPFS Cluster) DMAPI TSM Archive API TSM Archive Client Global Index and Search Capability Migration via TSM Archive Client Provides capability to handle extended metadata Meta-data may be derived from data content Extended attributes (integrity code, retention period, retention hold status, and any application meta-data) Global index on content and EA meta-data Allow for application-specific parsers (e.g., DICOM) TSM Deep Storage 29 2006 IBM Corporation
Summary Storage environments moving from petabytes to exabytes Traditional HPC New archive environments Significant challenges for reliability, resiliency, and manageability Meta-data becomes key for information organization and discovery 30 2006 IBM Corporation