SciDAC Petascale Data Storage Institute

Similar documents

PETASCALE DATA STORAGE WORKSHOP

Update on Petascale Data Storage Institute

Scientific Discovery Through Advanced Computing (SciDAC): Petascale Data Storage Institute

PETASCALE DATA STORAGE INSTITUTE. Petascale storage issues. 3 universities, 5 labs, G. Gibson, CMU, PI

Mixing Hadoop and HPC Workloads on Parallel Filesystems

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

Large File System Backup NERSC Global File System Experience

Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory

General Parallel File System (GPFS) Native RAID For 100,000-Disk Petascale Systems

IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO)

pnfs State of the Union FAST-11 BoF Sorin Faibish- EMC, Peter Honeyman - CITI

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

PARALLELS CLOUD STORAGE

Accelerating and Simplifying Apache

The Panasas Parallel Storage Cluster. Acknowledgement: Some of the material presented is under copyright by Panasas Inc.

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

HPC Advisory Council

Performance Analysis of Mixed Distributed Filesystem Workloads

Lessons learned from parallel file system operation

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

DualFS: A New Journaling File System for Linux

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

IBM System x GPFS Storage Server

High Performance Storage System. Overview

Will They Blend?: Exploring Big Data Computation atop Traditional HPC NAS Storage

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Storage Architectures for Big Data in the Cloud

Building a Linux Cluster

Mixing Hadoop and HPC Workloads on Parallel Filesystems

Distributed File Systems

In search of the right way for extreme-scale HPC file system metadata

Parallele Dateisysteme für Linux und Solaris. Roland Rambau Principal Engineer GSE Sun Microsystems GmbH

Panasas: High Performance Storage for the Engineering Workflow

HPC data becomes Big Data. Peter Braam

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

HDFS and Availability Data Retragement

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Lustre * Filesystem for Cloud and Hadoop *

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

High Performance Computing OpenStack Options. September 22, 2015

How To Write A High Performance Nfs For A Linux Server Cluster (For A Microsoft) (For An Ipa) (I.O) ( For A Microsnet) (Powerpoint) (Permanent) (Unmanip

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Active Storage Processing in a Parallel File System

Performance Analysis of Flash Storage Devices and their Application in High Performance Computing

Reliability and Fault Tolerance in Storage

High Performance Storage System Interfaces. Harry Hulen

Globus and the Centralized Research Data Infrastructure at CU Boulder

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Big data management with IBM General Parallel File System

How To Improve Afs.Org For Free On A Pc Or Mac Or Ipad (For Free) For A Long Time (For A Long Term Time) For Free (For Cheap) For Your Computer Or Your Hard Drive) For The Long

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Hank Childs, University of Oregon

New Storage System Solutions

Design and Evolution of the Apache Hadoop File System(HDFS)

pwalrus: Towards Better Integration of Parallel File Systems into Cloud Storage

Archival Storage At LANL Past, Present and Future

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Quick Reference Selling Guide for Intel Lustre Solutions Overview

The Google File System

IBM ELASTIC STORAGE SEAN LEE

Sistemas Operativos: Input/Output Disks

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Price/performance Modern Memory Hierarchy

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Apache Hadoop FileSystem and its Usage in Facebook

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Massive Data Storage

Sheepdog: distributed storage system for QEMU

Failure Recovery Issues in Large Scale, Heavily Utilized Disk Storage Systems

HDFS Space Consolidation

Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs + D. Pugmire, D. Camp, C. Garth, G.

Four Reasons To Start Working With NFSv4.1 Now

GeoGrid Project and Experiences with Hadoop

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

Why is it a better NFS server for Enterprise NAS?

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Data Backup and Archiving with Enterprise Storage Systems

Distributed File Systems

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

Hadoop & its Usage at Facebook

HFAA: A Generic Socket API for Hadoop File Systems

White Paper. Version 1.2 May 2015 RAID Incorporated

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

Cluster Implementation and Management; Scheduling

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Lessons and Predictions from 25 Years of Parallel Data Systems Development PARALLEL DATA STORAGE WORKSHOP SC11

Achieving Performance Isolation with Lightweight Co-Kernels

Implementing Network Attached Storage. Ken Fallon Bill Bullers Impactdata

Transcription:

SciDAC Petascale Data Storage Institute Advanced Scientific Computing Advisory Committee Meeting October 29 2008, Gaithersburg MD Garth Gibson Carnegie Mellon University and Panasas Inc. SciDAC Petascale Data Storage Institute (PDSI) www.pdsi-scidac.org w/ LANL (Gary Grider), LBNL (William Kramer), SNL (Lee Ward), ORNL (Phil Roth), PNNL (Evan Felix), UCSC (Darrell Long), U.Mich (Peter Honeyman)

Charting Storage Path thru Peta- to Exa-scale Top500.org scaling 100% per year; storage required to keep up This is hard for disks: MB/sec +20% per year, accesses/sec +5% per year Increases number of disks much faster than processor chips or nodes www.pdsi-scidac.org

SciDAC Petascale Data Storage Institute High Performance Storage Expertise & Experience Carnegie Mellon University, Garth Gibson, lead PI U. of California, Santa Cruz, Darrell Long U. of Michigan, Ann Arbor, Peter Honeyman Lawrence Berkeley National Lab, William Kramer Oak Ridge National Lab, Phil Roth Pacific Northwest National Lab, Evan Felix Los Alamos National Lab, Gary Grider Sandia National Lab, Lee Ward www.pdsi-scidac.org 3

SciDAC Petascale Data Storage Institute Efforts divided into three primary thrusts Outreach and leadership Community building: ie. PDSW @ SC08, FAST, FSIO APIs & standards: ie., Parallel NFS, POSIX Extensions SciDAC collaborations: applications, centers, institutes Data collection and dissemination Failure data collection, analysis: ie., cfdr.usenix.org Performance trace collection & benchmark publication Mechanism innovation Scalable metadata, archives, wide area storage, etc IT automation applied to HEC systems & problems www.pdsi-scidac.org 4

Outreach: Sponsored Workshops SC08: PDSW, Mon Nov 17, 8-5 www.pdsi-scidac.org 5

Standards: Multi-vendor, Scalable Parallel NFS Persistent investment in scalability Share costs with commercial R&D Next generation NFS is parallel (pnfs) Responds to growing role of clusters Open source & competitive offerings NetApp, Sun, IBM, EMC, Panasas, BlueArc, DESY/dCache NFSv4 extended w/ orthogonal layout metadata attributes Client Apps pnfs IFS Layout driver 1. SBC (blocks) 2. OSD (objects) 3. NFS (files) pnfs server Local Filesystem Layout metadata grant & revoke Garth www.pdsi-scidac.org Gibson, 03/26/08 www.pdl.cmu.edu 6 & www.pdsi-scidac.org 6

Dissemination: Fault Data Los Alamos root cause logs 22 clusters & 5,000 nodes covers 9 years & continues cfdr.usenix.org publication + PNNL, NERSC, Sandia, PSC, Failures per year per proc 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 4096 procs 1024 nodes 128 procs 32 nodes # failures normalized by # procs 6152 procs 49 nodes 4-way 2001 2-way 2003 128-way 1996 256-way 2004 www.pdsi-scidac.org 7

Projections: More Failures Con t top500.org 2X annually 1 PF Roadrunner, May 2008 Cycle time flat, but more of them Moore s law: 2X cores/chip in 18 mos # sockets, 1/MTTI = failure rate up 25%-50% per year Optimistic 0.1 failures per year per socket (vs. historic 0.25) 10,000,000 600 18 months 24 months 500 30 months 1,000,000 400 18 months 24 months 30 months 300 100,000 200 100 10,000 2006 2007 2008 2009 2010 2011 2012 2013 Year 2014 2015 2016 2017 2018 0 2006 2007 2008 2009 2010 2011 2012 2013 Year 2014 2015 2016 2017 2018 www.pdsi-scidac.org 8

Fault Tolerance Challenges Periodic (p) pause to checkpoint (t) Major need for storage bandwidth Balanced systems 600 500 400 300 200 Storage speed tracks FLOPS, memory so checkpoint capture (t) is constant 1 AppUtilization = t/p + p/(2*mtti) p 2 = 2*t*MTTI 18 months 24 months 30 months 100% 75% Everything Must Scale with Compute Computing Speed Memory 5,000TFLOP/s TeraBytes 2,500 Year 500 250 2012 50 25 08 Disk 50 04 2.5 5 PetaBytes 5.5.05 00 103 1 10 5 5 102 Parallel 50 200 500 I/O 5,000.5.5 200 GigaBytes/sec 5 5 2,000 50 20,000 50 Metadata 500 Inserts/sec Network Speed 500 Gigabits/sec Archival Storage 9 GigaBytes/sec 18 months 24 months 30 months 100 0 2006 2007 2008 2009 2010 2011 2012 Year 2013 2014 2015 2016 2017 2018 but dropping MTTI kills app utilization! 50% 25% 0% 2006 2007 2008 2009 2010 2011 2012 2013 Year 2014 2015 2016 2017 2018 www.pdsi-scidac.org 9

2014 2015 2016 2017 2018 2014 2015 2016 2017 2018 Bolster HEC Fault Tolerance 1,000,000 100,000 18 months 24 months 10,000 30 months 1,000 100 10 1 www.pdsi-scidac.org 10 2006 2007 2008 2009 2010 2011 2012 2013 Year 100 90 80 70 60 50 40 30 20 10 0 2006 2007 2008 2009 2010 2011 2012 2013 Year 18 months 24 months 30 months

Alternative Approaches Compute Cluster FAST WRITE Checkpoint Memory SLOW WRITE Disk Storage Devices www.pdsi-scidac.org 11

Storage Suffers Failures Too HPC1 HPC2 Type of drive 18GB 10K RPM SCSI 36GB 10K RPM SCSI 36GB 10K RPM SCSI Count 520 Duration 3,400 5 yrs 2.5 yrs Supercomputing X HPC3 15K RPM SCSI 15K RPM SCSI 7.2K RPM SATA 14,208 1 yr Various HPCs HPC4 250GB SATA 500GB SATA 400GB SATA 13,634 3 yrs COM1 COM2 10K RPM SCSI 15K RPM SCSI 26,734 39,039 1 month 1.5 yrs Internet services Y COM3 www.pdsi-scidac.org 12 10K RPM FC-AL 10K RPM FC-AL 10K RPM FC-AL 10K RPM FC-AL 3,700 1 yr

Storage Failure Recovery is On-the-fly Scalable performance = more disks But disks are getting bigger Recovery per failure increasing Hours to days on disk arrays Consider # concurrent disk recoveries e.g. 10,000 disks 3% per year replacement rate 1000.0 1+ day recovery each 100.0 Constant state of recovering? Maybe soon 100s of concurrent recoveries (at all times!) 10.0 Design normal case 1.0 for many failures (huge challenge!) 0.1 www.pdsi-scidac.org 13 SATA 2006 2007 2008 2009 2010 2011 2012 Year 2013 2014 2015 2016 2017 ARR = 0.88% ARR = 0.58% Data avrg = 3% 2018

Parallel Scalable Repair Defer the problem by making failed disk repair a parallel app File replication and, more recently, object RAID can scale repair - decluster redundancy groups over all disks (mirror or RAID) - use all disks for every repair, faster is less vulnerable Object (chunk of a file) storage architecture dominating at scale PanFS, Lustre, PVFS, GFS, HDFS, Centera, H G k E C F E FAIL 140 120 100 80 60 40 Rebuild MB/sec One Volume, 1G Files One Volume, 100MB Files N Volumes, 1GB Files N Volumes, 100MB Files 20 0 0 2 4 6 8 10 12 14 # Shelves www.pdsi-scidac.org 14

PDSI Collaborations Primary partner: Scientific Data Management (SDM) PVFS, metadata, checkpoint, failure diagnosis Storage IO & Performance Engineering (PERI) Ocean (POP), Combustion (S3D), P. Roth Storage trouble shooting with trace tools Climate Change (CCSM), L. Ward Vendor partnerships & centers of excellence pnfs, IBM GPFS, Sun Lustre, Panasas, EMC Leadership computing facilities partnerships Roadrunner, Redstorm, Jaguar, Franklin,... Base program cooperation: FASTOS IO Forwarding www.pdsi-scidac.org 15

Its All About Data, Scale & Failure Continual gathering of data on data storage Failures, distributions, traces, workloads Nurturing of file systems to HPC scale, requirements pnfs standards, benchmarks, testing clusters, academic codes Checkpoint specializations App-compressed state, special devices & representations Failure as the normal case? Risking 100s of concurrent disk rebuilds (need faster rebuild) Quality of service (performance) during rebuild in design Correctness at increasing scale? Testing using virtual machines to simulate larger machines Formal verification of correctness (performance?) at scale HPC vs Cloud Storage Architecture Common software? Share costs with new technology wave http://www.pdl.cmu.edu/ 16 Garth Gibson, Aug 28, 2007

SciDAC Petascale Data Storage Institute Advanced Scientific Computing Advisory Committee Meeting, October 29 2008, Gaithersburg MD Garth Gibson Carnegie Mellon University and Panasas Inc. SciDAC Petascale Data Storage Institute (PDSI) www.pdsi-scidac.org w/ LANL (Gary Grider), LBNL (William Kramer), SNL (Lee Ward), ORNL (Phil Roth), PNNL (Evan Felix), UCSC (Darrell Long), U.Mich (Peter Honeyman)

Backup

Tools: Understanding IO in Apps sourceforge.net/projects/libsysio www.pdsi-scidac.org 19 Garth Gibson, 10/23/2008

Dissemination: Parallel Workloads sourceforge.net/projects/libsysio www.pdsi-scidac.org 20 Garth Gibson, 10/23/2008

Dissemination: Parallel Workloads www.pdsi-scidac.org 21 Garth Gibson, 10/23/2008

Dissemination: Parallel Workloads www.pdsi-scidac.org 22 Garth Gibson, 10/23/2008

Dissemination: Statistics www.pdsi-scidac.org 23

Mechanisms for Scalable Metadata (1) www.pdsi-scidac.org 24

Mechanisms for Scalable Metadata (2) Billion+ files in a directory Eliminate serialization All servers grow directory independently, in parallel, without any co-ordinator No synchronization or consistency bottlenecks Servers only keep local view, no shared state www.pdsi-scidac.org 25