Storage Challenges for Petascale Systems

Similar documents

IBM System x GPFS Storage Server

Archival Storage At LANL Past, Present and Future

General Parallel File System (GPFS) Native RAID For 100,000-Disk Petascale Systems

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Large File System Backup NERSC Global File System Experience

Data Protection Technologies: What comes after RAID? Vladimir Sapunenko, INFN-CNAF HEPiX Spring 2012 Workshop

Big data management with IBM General Parallel File System

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

How To Virtualize A Storage Area Network (San) With Virtualization

Archive Data Retention & Compliance. Solutions Integrated Storage Appliances. Management Optimized Storage & Migration

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Enterprise Storage Solution for Hyper-V Private Cloud and VDI Deployments using Sanbolic s Melio Cloud Software Suite April 2011

Long term retention and archiving the challenges and the solution

GPFS und HPSS am HLRS

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Scalable NAS for Oracle: Gateway to the (NFS) future

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

IBM Tivoli Storage Productivity Center

Diagram 1: Islands of storage across a digital broadcast workflow

Scala Storage Scale-Out Clustered Storage White Paper

SAN Conceptual and Design Basics

IBM ELASTIC STORAGE SEAN LEE

Accelerating and Simplifying Apache

Understanding Disk Storage in Tivoli Storage Manager

MICROSOFT HYPER-V SCALABILITY WITH EMC SYMMETRIX VMAX

SAM-FS - Advanced Storage Management Solutions for High Performance Computing Environments

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

ETERNUS CS High End Unified Data Protection

(Scale Out NAS System)

VERITAS Business Solutions. for DB2

IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO)

HIGHLY AVAILABLE MULTI-DATA CENTER WINDOWS SERVER SOLUTIONS USING EMC VPLEX METRO AND SANBOLIC MELIO 2010

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Cloud Infrastructure Management - IBM VMControl

Evaluation of Enterprise Data Protection using SEP Software

DELL TM PowerEdge TM T Mailbox Resiliency Exchange 2010 Storage Solution

PARALLELS CLOUD STORAGE

IBM Spectrum Protect in the Cloud

Violin Memory Arrays With IBM System Storage SAN Volume Control

WHITE PAPER BRENT WELCH NOVEMBER

STORAGE CENTER. The Industry s Only SAN with Automated Tiered Storage STORAGE CENTER

OPTIMIZING EXCHANGE SERVER IN A TIERED STORAGE ENVIRONMENT WHITE PAPER NOVEMBER 2006

Sun Constellation System: The Open Petascale Computing Architecture

IBM Infrastructure for Long Term Digital Archiving

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

The Panasas Parallel Storage Cluster. Acknowledgement: Some of the material presented is under copyright by Panasas Inc.

List of Figures and Tables

Optimized Storage Solution for Enterprise Scale Hyper-V Deployments

Object Oriented Storage and the End of File-Level Restores

Cloud Optimize Your IT

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

Optimizing Storage for Better TCO in Oracle Environments. Part 1: Management INFOSTOR. Executive Brief

EMC DATA DOMAIN OPERATING SYSTEM

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Quantum StorNext. Product Brief: Distributed LAN Client

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

June Blade.org 2009 ALL RIGHTS RESERVED

IBM Scale Out Network Attached Storage

Clustering Windows File Servers for Enterprise Scale and High Availability

Hadoop: Embracing future hardware

IBM Storage Management within the Infrastructure Laura Guio Director, WW Storage Software Sales October 20, 2008

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

VMware Virtual Machine File System: Technical Overview and Best Practices

The Archival Upheaval Petabyte Pandemonium Developing Your Game Plan Fred Moore President

Cisco and EMC Solutions for Application Acceleration and Branch Office Infrastructure Consolidation

Using EonStor FC-host Storage Systems in VMware Infrastructure 3 and vsphere 4

Symantec Enterprise Vault And NetApp Better Together

EMC BACKUP MEETS BIG DATA

EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage

High Performance Storage System Interfaces. Harry Hulen

AFS Usage and Backups using TiBS at Fermilab. Presented by Kevin Hill

Big Fast Data Hadoop acceleration with Flash. June 2013

SMB Direct for SQL Server and Private Cloud

IBM System x GPFS Storage Server

A virtual SAN for distributed multi-site environments

Bigdata High Availability (HA) Architecture

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

New Storage System Solutions

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

With DDN Big Data Storage

1 Storage Devices Summary

Implementing a Digital Video Archive Using XenData Software and a Spectra Logic Archive

EMC DATA DOMAIN OPERATING SYSTEM

Implementing an Automated Digital Video Archive Based on the Video Edition of XenData Software

Designing a Cloud Storage System

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

10th TF-Storage Meeting

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

Implementing a Digital Video Archive Based on XenData Software

Transcription:

Storage Challenges for Petascale Systems Dilip D. Kandlur Director, Storage Systems Research IBM Research Division 2004 IBM Corporation

Outline Storage Technology Trends Implications for high performance computing Achieving petascale storage performance Manageability of petascale systems Organizing and finding Information 2 2006 IBM Corporation

Extreme Scaling There have been recent inflection points in CAGR of processing and storage in the wrong direction! Programs like HPCS are aimed at maintaining throughput at or above the CAGR of Moore s Law in spite of these technology trends Frequency (GHz) 10 5 4 3 2 Pentium 4 (130 nm) Pentium 4 (180 nm) 2002 Roadmap ~35% yr/yr 2003 10-15% yr/yr Prescott (90 nm) 1 2000 2001 2002 2003 2004 2005 2006 2007 Initial Ship Date Maximum Internal Bandwidth Maximum internal Bandwidth MB/s 1000 100 10 1998 2000 2002 2004 2006 2008 2010 Areal Density Gb/sq.in. 10000 1000 100 10 1 Disk Areal Density Trend 2000-2010 100% CAGR 25-35% CAGR 0.1 1998 2000 2002 2004 2006 2008 2010 Year of Production 3 2006 IBM Corporation

Peta-scale systems: DARPA HPCS, NSF Track 1 HPCS goal: Double value every 18 months in the face of flattening technology curves NSF track 1 goal: at least a sustained petaflop for actual science applications New technologies like multi-core will keep processing power on the rise but will make storage relatively more expensive Maintaining balanced system scaling constants for storage will be expensive Storage bandwidth:.001byte/second/flop capacty: 20 bytes/flop Cost per drive will be same order of magnitude, so proportionally the same amount of storage will be a higher fraction of total system cost How to make reliable a system with 10x today s number of moving parts? System Year TF GB/s Nodes Cores Storage Disks Blue P 1998 3 3 1464 5856 43 TB 5040 White 2000 12 9 512 8192 147 TB 8064 Purple/C 2005 100 122 1536 12288 2000 TB 11000 NSF Track 1 (possible) 2011 2000 2000 10000 300000 40000 TB 50000 4 2006 IBM Corporation

HPCS Storage 100000 1000 10 0.1 0.001 5,000 drives 4 TF 3.6 GB/s 165,000 drives 11,000 drives 6 PF 6 TB/s 100 TF 120 GB/s 1995 2000 2005 2010 2015 1995 2000 2005 2010 Fast 5 TB/sec sequential bandwidth 30,000 file creates/sec on one node Capable of running fsck on 1 trillion files CPU Performance File System Capacity CPU Performance Number of Disk Drives File System Throughput Number of Disk Drives File System Throughput 300,000 processors 150,000 disk drives Managable Robust Fix 3 or more concurrent errors Detect undetected errors Only minor slowing during disk rebuild Detect and manage slow disks Unified manager for files, storage End-end discovery, metrics, events Managing system changes, problem fixes GUI scaled to large clusters 5 2006 IBM Corporation

GPFS Parallel File System GPFS file system nodes Cluster: thousands of nodes, fast reliable communication, common admin domain. Shared disk: all data and metadata on disk accessible from any node, coordinated by distributed lock service. Parallel: data and metadata flow to/from all nodes from/to all disks in parallel; files striped across all disks. Control IP network Disk FC network GPFS file system nodes Data / control IP network GPFS disk server nodes: VSD on AIX, NSD on Linux RPC interface to raw disks 6 2006 IBM Corporation

Scaling GPFS HPCS file system performance and scaling targets Balanced system DOE metrics (.001B/s/F, 20 B/F) This means 2-6 TB/s throughput, 40-120 PB storage!! Other performance goals 30 GB/s single node to single file for data ingest 30K file opens per second on a single node 1 trillion files in a single file system Scaling to 32K nodes (OS images) 7 2006 IBM Corporation

Extreme Scaling: Metadata Metadata: the on-disk data structures that represent hierarchical directories, storage allocation maps, Why is it a problem? Structural integrity requires proper synchronization. Performance is sensitive to the latency of these (small) I/O s. Techniques for scaling metadata Scaling synchronization (distributing the lock manager) Segregating metadata from data to reduce queuing delays Separate disks Separate fabric ports Different RAID levels for metadata to reduce latency, or solid-state memory Adaptive metadata management (centralized vs. distributed) GPFS provides for all these to some degree; work always ongoing Sensible application design can make a big difference! 8 2006 IBM Corporation

Data loss in Petascale Systems Petaflop systems require tens to hundreds of petabytes of storage Evidence exists that manufacturer MTBF specs may be optimistic (Schroeder & Gibson) Evidence exists that failure statistics may not be as favorable as simple exponential distribution MTTDL in years for 20PB system Hard error rate of 1 in 10 15 means one rebuild in 30 will get an error Rebuild of 8+P array of 500GB drives reads 4TB, or 3.2 10 13 bits RAID-5 is dead at petascale; even RAID-6 may not be sufficient to prevent data loss Simulations of file system size, drive MTBF, failure probability distribution show 4%-28% chance of data loss over five-year lifetime for 8+2P code Stronger RAID (8+3P) increase MTTDL by 3-4 orders of magnitude for extra 10% overhead. Stronger RAID is sufficiently reliable even for unreliable (commodity) disk drives MTTDL in years 10000000 1000000 100000 10000 1000 100 10 1 8+3P 600K hrs Exponential 8+3P 300K hrs Exponential 4% 8+2P 600K hrs Exponential Configuration 16% 8+2P 300K hrs Exponential 8+3P 600K hrs Weibull 28% 8+2P 600K hrs Weibull 9 2006 IBM Corporation

GPFS Software RAID Implement software RAID in the GPFS NSD server Motivations Better fault-tolerance Reduce the performance impact of rebuilds and slow disks Eliminate costly external RAID controllers and storage fabric Use the processing cycles now being wasted in the storage node Improve performance by file-system-aware caching Approach Storage node (NSD server) manages disks as JBOD Use stronger RAID codes as appropriate (e.g. triple parity for data and multi-way mirroring for metadata) Always check parity on read Increases reliability and prevents performance degradation from slow drives Checksum everything! Declustered RAID for better load balancing and non-disruptive rebuild 10 2006 IBM Corporation

Declustered RAID Partitioned RAID Declustered RAID 16 logical tracks 20 physical disks 20 physical disks 11 2006 IBM Corporation

Rebuild Work Distribution failed disk Relative read and write throughput for rebuild 12 2006 IBM Corporation

Rebuild (2) Upon the first failure, begin rebuilding the tracks that are affected by the failure (large arrows). Many disks involved in performing rebuild, so work is balanced, avoiding hot spots. 13 2006 IBM Corporation

Declustered vs. Partitioned RAID Data losses per year per 100PB 1E+5 1E+4 1E+3 1E+2 1E+1 1E+0 1E-1 1E-2 partitioned distributed Simulation results 1E-3 1 2 3 Failure tolerance 14 2006 IBM Corporation

Autonomic Storage Management Making Complex Tasks Simple IBM TotalStorage Productivity Center Standard Edition A Single Application with modular components Disk Data Fabric Business Resiliency Integrated Repication Manager Metro Disaster Recovery Global Disaster Recovery Cascaded Disaster Recovery Application Disaster Recovery Console Enhancements End-to-End DataPath Explorer Integrated Storage Planner Configuration Change Rover Configuration Checker Personalization TSM Integration Ease of Use Streamlined Installation and Packaging Single User Interface Single Database Single Set of services for consistent administration and operations Policy Based Storage Management SAN Best Practices SAN Configuration Validation Storage Subsystem Planning Fabric Security Planning Host Planning (Multi-path) 15 2006 IBM Corporation

Integrated Management Seamlessly integrate systems management across servers, storage and network & provide end-to-end problem determination and analytics capabilities Integrated Web 2.0 GUI Best Practices Deployment Systems Knowledge DB Orchestration Analytics Discovery Monitoring Reporting Configuration Applications Middleware OS File System Server Network Storage 16 2006 IBM Corporation Applications Middleware Operating systems Virtualization software Hardware

PERCS Management A unified and standards based management for GPFS and PERCS Storage A GUI that is designed for large-scale clusters Supporting PERCS scale GPFS The PERCS UI will support: Information collection: asset tracking, end-end discovery, metrics, events Management: system changes, problem fixes, configuration changes Rich visualizations to help them maintain situational awareness of system status Essential for large systems Also enable GPFS to satisfy commercial customers requiring easeof-use GPFS File System PERCS Storage Retrieves Data CIM Provider CIM Model Uses PERCS GUI CIM Client Management CIMOM Server Uses CIM Repository Simulator Systems DB 17 2006 IBM Corporation

Analytics Problem Determination and Impact Analysis Root cause analysis: discover the finest-grain events that indicate the root cause of the problem Symptom suppression: correlate alarms/symptoms caused by a common cause across the integrated infrastructure Bottleneck Analysis Post-mortem, live and predictive analysis Workload and Virtualization Management Automatically monitor multi-tiered, distributed, heterogeneous or homogeneous workloads Migrate virtual machines to satisfy performance goals Integrated Server, Storage and Network Allocation and Migration Integrated allocation accounting for connectivity, affinity, flows, ports based on performance workloads Disaster Management Provides integrated server/storage disaster recovery support 18 2006 IBM Corporation

Visualization Integrated Management is centered around Topology Viewer capabilities based on Web 2.0 technologies Data Path Viewer for Applications, Servers, Networks and Storage Progressive Information Disclosure Semantic Zooming Information Overlays Mixed Graphical and Tabular Views Integrated Historical and Real Time Reporting 19 2006 IBM Corporation

The Changing Nature of Archive Current Archive: Data landfill Store and forget Not easily accessible, typically offline and offsite with access time measured in days Not organized for usage, retained just in case needed Emerging Archive: Leverage information for business advantage Readily accessible, access time measured in seconds Indexed for effective discovery Mined for business value 20 2006 IBM Corporation

Building Storage Systems Targeted at Archive Scalability Scale to huge capacity Exploit tiered storage with disk and tape Leverage commodity disk storage Handle extremely large number of objects Support high ingest rates Effect data management actions in a scalable fashion Emerging Archive: Leverage info for business advantage Current Archive: Data landfill Functionality Consistently handle multiple kinds of objects Manage and retrieve based on data semantics E.g. Logical groupings of objects Support effective search and discovery Provide for compliance with regulations Reliability Ensure data integrity and protection Provide media management and rejuvenation Support long-term retention 21 2006 IBM Corporation

GPFS Information Lifecycle Management (ILM) GPFS ILM abstractions Storage pool group of LUNs Fileset subtree of a file system namespace Policy rule for file placement, retention, or movement among pools GPFS Clients Application Application Application GPFS Placement Application GPFS Policy Posix Placement GPFS Policy Placement GPFS Policy Placement Policy ILM Scenarios Tiered storage fast storage for frequently used files, slower for infrequently used files Project storage separate pools for each project, each with separate policies, quotas, etc. Differentiated storage e.g. place media files on media-friendly storage (QoS) GPFS RPC Protocol Gold Pool Storage Network Silver Pool Pewter Pool GPFS Manager Node Cluster manager Lock manager Quota manager Allocation manager Policy manager System Pool Data Pools GPFS File System (Volume Group) 22 2006 IBM Corporation

GPFS 3.1 ILM Policies Placement policies, evaluated at file creation, example Migration policies, evaluated periodically Deletion policies, evaluated periodically GPFS RPC Protocol GPFS Clients Application Application Application GPFS Placeme Application Posix GPFS nt Policy Placeme GPFS nt Policy Placeme GPFS nt Policy Placeme nt Policy GPFS Manager Nod Cluster manager Lock manager Quota manager Allocation manager Policy manager Storage Network Gold Pool Silver Pool Pewter Pool System Pool Data Pools GPFS File System (Volume Group) 23 2006 IBM Corporation

GPFS Policy Engine Migrate and delete rules scan the file system to identify candidate files Conventional backup and HSM systems also do this Usually implemented with readdir() and stat() This is slow random small record reads, distributed locking Can take hours or days for a large file system GPFS Policy Engine uses efficient sort-merge rather than slow readdir()/stat() Directory walk builds list of path names (readdir() but no stat()!) List sorted by inode number, merged with inode file, then evaluated Both list building and policy evaluation done in parallel on all nodes > 10 5 files/sec per node! 24 2006 IBM Corporation

Storage Hierarchies the old way Normally implemented one of two ways: Explicit control archive command (IBM TSM, Unitree) copy into special archive file system (IBM HPSS) copy to archive server (HPSS, Unitree) all of which are troublesome and error-prone for the user Implicit control through an interface like DMAPI File system sends events to HSM system (create/delete, low space) Archive system moves data and punches holes in files to manage space Access miss generates event; HSM system transparently brings file back 25 2006 IBM Corporation GPFS Session Node GPFS Disk Arrays GPFS I/O Nodes HPSS SAN Disk HPSS Data Store HPSS HSM processes HPSS Interface Client Domain HPSS 6.2 API Architecture DB2 Tape Libraries 2. Client transfers file to HPSS Disk or Tape Over TCP/IP LAN or WAN using an HPSS Mover HPSS FC SAN Moverless SAN Data transfers Client Cluster Computers Metadata Disks HSM Control Information IP LAN Data transfers HPSS API IP Network 1. Client issues HPSS Write or Put to HPSS Core Server HPSS Cluster Computers Core Server and Movers HPSS Disk Arrays DB2 HPSS Movers GPFS Cluster HPSS Cluster GPFS 3.1 and HPSS 6.2 DMAPI Architecture HPSS Core Server HPSS Tape Libraries Tape disk transfers

DMAPI Problems Namespace events (create, delete, rename) Synchronous and recoverable Each is multiple database transactions Slow down the file system Directory scans DMAPI low-space events trigger directory scans to determine what to archive can take hours or days on large FS Scans have little information upon which to make archiving decisions (what you get from ls l ) As a result, data movement policies are usually hard-coded and primitive Read/write managed region Blocks the user program while data brought back from HSM system Parallel data movement isn t in the spec, but everyone implements it anyway Data movement is actually the one thing about DMAPI worth saving GPFS Session Node GPFS Disk Arrays HPSS HSM processes GPFS I/O Nodes HPSS Interface DB2 Moverless SAN Data transfers HSM Control Information IP LAN Data transfers GPFS Cluster GPFS 3.1 and HPSS 6.2 DMAPI Architecture HPSS Disk Arrays HPSS Core DB2 Server HPSS Movers HPSS Cluster Tape disk transfers HPSS Tape Libraries 26 2006 IBM Corporation

GPFS Approach: External Pools External pools are really interfaces to external storage managers, e.g. HPSS or TSM External pool rule defines script to call to migrate/recall/etc. files RULE EXTERNAL POOL PoolName EXEC InterfaceScript [ OPTS options ] GPFS policy engine builds candidate lists and passes them to external pool scripts External storage manager actually moves the data Using DMAPI managed regions (read/write invisible, punch hole) Or using conventional Posix API s 27 2006 IBM Corporation

GPFS ILM Demonstration NERSC Oakland, CA HPSS Archive on tapes with disk buffering, connected via 10Gb link High bandwidth, parallel data movement across all devices and networks SC 06 Tampa, FL GPFS 1M active files FC, SATA disks 28 2006 IBM Corporation

Nearline Information conceptual view NFS/CIFS Client TSM Archive Client/API Admin / Search NFS/CIFS Server Scale-out Archiving Engine (GPFS Cluster) DMAPI TSM Archive API TSM Archive Client Global Index and Search Capability Migration via TSM Archive Client Provides capability to handle extended metadata Meta-data may be derived from data content Extended attributes (integrity code, retention period, retention hold status, and any application meta-data) Global index on content and EA meta-data Allow for application-specific parsers (e.g., DICOM) TSM Deep Storage 29 2006 IBM Corporation

Summary Storage environments moving from petabytes to exabytes Traditional HPC New archive environments Significant challenges for reliability, resiliency, and manageability Meta-data becomes key for information organization and discovery 30 2006 IBM Corporation