SciDAC Petascale Data Storage Institute Advanced Scientific Computing Advisory Committee Meeting October 29 2008, Gaithersburg MD Garth Gibson Carnegie Mellon University and Panasas Inc. SciDAC Petascale Data Storage Institute (PDSI) www.pdsi-scidac.org w/ LANL (Gary Grider), LBNL (William Kramer), SNL (Lee Ward), ORNL (Phil Roth), PNNL (Evan Felix), UCSC (Darrell Long), U.Mich (Peter Honeyman)
Charting Storage Path thru Peta- to Exa-scale Top500.org scaling 100% per year; storage required to keep up This is hard for disks: MB/sec +20% per year, accesses/sec +5% per year Increases number of disks much faster than processor chips or nodes www.pdsi-scidac.org
SciDAC Petascale Data Storage Institute High Performance Storage Expertise & Experience Carnegie Mellon University, Garth Gibson, lead PI U. of California, Santa Cruz, Darrell Long U. of Michigan, Ann Arbor, Peter Honeyman Lawrence Berkeley National Lab, William Kramer Oak Ridge National Lab, Phil Roth Pacific Northwest National Lab, Evan Felix Los Alamos National Lab, Gary Grider Sandia National Lab, Lee Ward www.pdsi-scidac.org 3
SciDAC Petascale Data Storage Institute Efforts divided into three primary thrusts Outreach and leadership Community building: ie. PDSW @ SC08, FAST, FSIO APIs & standards: ie., Parallel NFS, POSIX Extensions SciDAC collaborations: applications, centers, institutes Data collection and dissemination Failure data collection, analysis: ie., cfdr.usenix.org Performance trace collection & benchmark publication Mechanism innovation Scalable metadata, archives, wide area storage, etc IT automation applied to HEC systems & problems www.pdsi-scidac.org 4
Outreach: Sponsored Workshops SC08: PDSW, Mon Nov 17, 8-5 www.pdsi-scidac.org 5
Standards: Multi-vendor, Scalable Parallel NFS Persistent investment in scalability Share costs with commercial R&D Next generation NFS is parallel (pnfs) Responds to growing role of clusters Open source & competitive offerings NetApp, Sun, IBM, EMC, Panasas, BlueArc, DESY/dCache NFSv4 extended w/ orthogonal layout metadata attributes Client Apps pnfs IFS Layout driver 1. SBC (blocks) 2. OSD (objects) 3. NFS (files) pnfs server Local Filesystem Layout metadata grant & revoke Garth www.pdsi-scidac.org Gibson, 03/26/08 www.pdl.cmu.edu 6 & www.pdsi-scidac.org 6
Dissemination: Fault Data Los Alamos root cause logs 22 clusters & 5,000 nodes covers 9 years & continues cfdr.usenix.org publication + PNNL, NERSC, Sandia, PSC, Failures per year per proc 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 4096 procs 1024 nodes 128 procs 32 nodes # failures normalized by # procs 6152 procs 49 nodes 4-way 2001 2-way 2003 128-way 1996 256-way 2004 www.pdsi-scidac.org 7
Projections: More Failures Con t top500.org 2X annually 1 PF Roadrunner, May 2008 Cycle time flat, but more of them Moore s law: 2X cores/chip in 18 mos # sockets, 1/MTTI = failure rate up 25%-50% per year Optimistic 0.1 failures per year per socket (vs. historic 0.25) 10,000,000 600 18 months 24 months 500 30 months 1,000,000 400 18 months 24 months 30 months 300 100,000 200 100 10,000 2006 2007 2008 2009 2010 2011 2012 2013 Year 2014 2015 2016 2017 2018 0 2006 2007 2008 2009 2010 2011 2012 2013 Year 2014 2015 2016 2017 2018 www.pdsi-scidac.org 8
Fault Tolerance Challenges Periodic (p) pause to checkpoint (t) Major need for storage bandwidth Balanced systems 600 500 400 300 200 Storage speed tracks FLOPS, memory so checkpoint capture (t) is constant 1 AppUtilization = t/p + p/(2*mtti) p 2 = 2*t*MTTI 18 months 24 months 30 months 100% 75% Everything Must Scale with Compute Computing Speed Memory 5,000TFLOP/s TeraBytes 2,500 Year 500 250 2012 50 25 08 Disk 50 04 2.5 5 PetaBytes 5.5.05 00 103 1 10 5 5 102 Parallel 50 200 500 I/O 5,000.5.5 200 GigaBytes/sec 5 5 2,000 50 20,000 50 Metadata 500 Inserts/sec Network Speed 500 Gigabits/sec Archival Storage 9 GigaBytes/sec 18 months 24 months 30 months 100 0 2006 2007 2008 2009 2010 2011 2012 Year 2013 2014 2015 2016 2017 2018 but dropping MTTI kills app utilization! 50% 25% 0% 2006 2007 2008 2009 2010 2011 2012 2013 Year 2014 2015 2016 2017 2018 www.pdsi-scidac.org 9
2014 2015 2016 2017 2018 2014 2015 2016 2017 2018 Bolster HEC Fault Tolerance 1,000,000 100,000 18 months 24 months 10,000 30 months 1,000 100 10 1 www.pdsi-scidac.org 10 2006 2007 2008 2009 2010 2011 2012 2013 Year 100 90 80 70 60 50 40 30 20 10 0 2006 2007 2008 2009 2010 2011 2012 2013 Year 18 months 24 months 30 months
Alternative Approaches Compute Cluster FAST WRITE Checkpoint Memory SLOW WRITE Disk Storage Devices www.pdsi-scidac.org 11
Storage Suffers Failures Too HPC1 HPC2 Type of drive 18GB 10K RPM SCSI 36GB 10K RPM SCSI 36GB 10K RPM SCSI Count 520 Duration 3,400 5 yrs 2.5 yrs Supercomputing X HPC3 15K RPM SCSI 15K RPM SCSI 7.2K RPM SATA 14,208 1 yr Various HPCs HPC4 250GB SATA 500GB SATA 400GB SATA 13,634 3 yrs COM1 COM2 10K RPM SCSI 15K RPM SCSI 26,734 39,039 1 month 1.5 yrs Internet services Y COM3 www.pdsi-scidac.org 12 10K RPM FC-AL 10K RPM FC-AL 10K RPM FC-AL 10K RPM FC-AL 3,700 1 yr
Storage Failure Recovery is On-the-fly Scalable performance = more disks But disks are getting bigger Recovery per failure increasing Hours to days on disk arrays Consider # concurrent disk recoveries e.g. 10,000 disks 3% per year replacement rate 1000.0 1+ day recovery each 100.0 Constant state of recovering? Maybe soon 100s of concurrent recoveries (at all times!) 10.0 Design normal case 1.0 for many failures (huge challenge!) 0.1 www.pdsi-scidac.org 13 SATA 2006 2007 2008 2009 2010 2011 2012 Year 2013 2014 2015 2016 2017 ARR = 0.88% ARR = 0.58% Data avrg = 3% 2018
Parallel Scalable Repair Defer the problem by making failed disk repair a parallel app File replication and, more recently, object RAID can scale repair - decluster redundancy groups over all disks (mirror or RAID) - use all disks for every repair, faster is less vulnerable Object (chunk of a file) storage architecture dominating at scale PanFS, Lustre, PVFS, GFS, HDFS, Centera, H G k E C F E FAIL 140 120 100 80 60 40 Rebuild MB/sec One Volume, 1G Files One Volume, 100MB Files N Volumes, 1GB Files N Volumes, 100MB Files 20 0 0 2 4 6 8 10 12 14 # Shelves www.pdsi-scidac.org 14
PDSI Collaborations Primary partner: Scientific Data Management (SDM) PVFS, metadata, checkpoint, failure diagnosis Storage IO & Performance Engineering (PERI) Ocean (POP), Combustion (S3D), P. Roth Storage trouble shooting with trace tools Climate Change (CCSM), L. Ward Vendor partnerships & centers of excellence pnfs, IBM GPFS, Sun Lustre, Panasas, EMC Leadership computing facilities partnerships Roadrunner, Redstorm, Jaguar, Franklin,... Base program cooperation: FASTOS IO Forwarding www.pdsi-scidac.org 15
Its All About Data, Scale & Failure Continual gathering of data on data storage Failures, distributions, traces, workloads Nurturing of file systems to HPC scale, requirements pnfs standards, benchmarks, testing clusters, academic codes Checkpoint specializations App-compressed state, special devices & representations Failure as the normal case? Risking 100s of concurrent disk rebuilds (need faster rebuild) Quality of service (performance) during rebuild in design Correctness at increasing scale? Testing using virtual machines to simulate larger machines Formal verification of correctness (performance?) at scale HPC vs Cloud Storage Architecture Common software? Share costs with new technology wave http://www.pdl.cmu.edu/ 16 Garth Gibson, Aug 28, 2007
SciDAC Petascale Data Storage Institute Advanced Scientific Computing Advisory Committee Meeting, October 29 2008, Gaithersburg MD Garth Gibson Carnegie Mellon University and Panasas Inc. SciDAC Petascale Data Storage Institute (PDSI) www.pdsi-scidac.org w/ LANL (Gary Grider), LBNL (William Kramer), SNL (Lee Ward), ORNL (Phil Roth), PNNL (Evan Felix), UCSC (Darrell Long), U.Mich (Peter Honeyman)
Backup
Tools: Understanding IO in Apps sourceforge.net/projects/libsysio www.pdsi-scidac.org 19 Garth Gibson, 10/23/2008
Dissemination: Parallel Workloads sourceforge.net/projects/libsysio www.pdsi-scidac.org 20 Garth Gibson, 10/23/2008
Dissemination: Parallel Workloads www.pdsi-scidac.org 21 Garth Gibson, 10/23/2008
Dissemination: Parallel Workloads www.pdsi-scidac.org 22 Garth Gibson, 10/23/2008
Dissemination: Statistics www.pdsi-scidac.org 23
Mechanisms for Scalable Metadata (1) www.pdsi-scidac.org 24
Mechanisms for Scalable Metadata (2) Billion+ files in a directory Eliminate serialization All servers grow directory independently, in parallel, without any co-ordinator No synchronization or consistency bottlenecks Servers only keep local view, no shared state www.pdsi-scidac.org 25