Parallele Dateisysteme für Linux und Solaris. Roland Rambau Principal Engineer GSE Sun Microsystems GmbH

Similar documents

New Storage System Solutions

Sun Constellation System: The Open Petascale Computing Architecture

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

HPC Update: Engagement Model

LUSTRE FILE SYSTEM High-Performance Storage Architecture and Scalable Cluster File System White Paper December Abstract

The Panasas Parallel Storage Cluster. Acknowledgement: Some of the material presented is under copyright by Panasas Inc.

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Distributed File System Choices: Red Hat Storage, GFS2 & pnfs

THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

Sun in HPC. Update for IDC HPC User Forum Tucson, AZ, Sept 2008

Quantum StorNext. Product Brief: Distributed LAN Client

10th TF-Storage Meeting

Lessons learned from parallel file system operation

Cray DVS: Data Virtualization Service

Scalable filesystems boosting Linux storage solutions

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

High Performance Computing OpenStack Options. September 22, 2015

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Introduction to Gluster. Versions 3.0.x

Four Reasons To Start Working With NFSv4.1 Now

Best Practices for Data Sharing in a Grid Distributed SAS Environment. Updated July 2010

Scala Storage Scale-Out Clustered Storage White Paper

Cluster Implementation and Management; Scheduling

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

(Scale Out NAS System)

pnfs State of the Union FAST-11 BoF Sorin Faibish- EMC, Peter Honeyman - CITI

Data storage considerations for HTS platforms. George Magklaras -- node manager

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

Large Scale Storage. Orlando Richards, Information Services LCFG Users Day, University of Edinburgh 18 th January 2013

Archival Storage At LANL Past, Present and Future

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

PARALLELS CLOUD STORAGE

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Lustre & Cluster. - monitoring the whole thing Erich Focht

Current Status of FEFS for the K computer

PolyServe Matrix Server for Linux

The Lustre File System. Eric Barton Lead Engineer, Lustre Group Sun Microsystems

Linux Powered Storage:

The Use of Flash in Large-Scale Storage Systems.

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

How to Choose your Red Hat Enterprise Linux Filesystem

HP reference configuration for entry-level SAS Grid Manager solutions

Moving Virtual Storage to the Cloud

IBM System x GPFS Storage Server

Accelerating and Simplifying Apache

Moving Virtual Storage to the Cloud. Guidelines for Hosters Who Want to Enhance Their Cloud Offerings with Cloud Storage

Analisi di un servizio SRM: StoRM

Lustre * Filesystem for Cloud and Hadoop *

Introduction to NetApp Infinite Volume

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

AFS Usage and Backups using TiBS at Fermilab. Presented by Kevin Hill

Reference Architectures for Repositories and Preservation Archiving

Capacity planning for IBM Power Systems using LPAR2RRD.

HPC Advisory Council

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Software to Simplify and Share SAN Storage Sanbolic s SAN Storage Enhancing Software Portfolio

Architecting a High Performance Storage System

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

NAS or iscsi? White Paper Selecting a storage system. Copyright 2007 Fusionstor. No.1

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

STORAGE CENTER. The Industry s Only SAN with Automated Tiered Storage STORAGE CENTER

Enterprise Storage Solution for Hyper-V Private Cloud and VDI Deployments using Sanbolic s Melio Cloud Software Suite April 2011

3 Red Hat Enterprise Linux 6 Consolidation

SUN HPC SOFTWARE CLUSTERING MADE EASY

GlusterFS Distributed Replicated Parallel File System

A Survey of Shared File Systems

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

Red Hat Cluster Suite

Panasas: High Performance Storage for the Engineering Workflow

Getting performance & scalability on standard platforms, the Object vs Block storage debate. Copyright 2013 MPSTOR LTD. All rights reserved.

High Availability Databases based on Oracle 10g RAC on Linux

Software-defined Storage Architecture for Analytics Computing

Introduction to Highly Available NFS Server on scale out storage systems based on GlusterFS

High Availability and Backup Strategies for the Lustre MDS Server

Data Movement and Storage. Drew Dolgert and previous contributors

High Performance Storage System Interfaces. Harry Hulen

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

GPFS und HPSS am HLRS

Scalable NAS for Oracle: Gateway to the (NFS) future

Understanding Enterprise NAS

Tiered Adaptive Storage

Lustre SMB Gateway. Integrating Lustre with Windows

Sanbolic s SAN Storage Enhancing Software Portfolio

Transcription:

Parallele Dateisysteme für Linux und Solaris Roland Rambau Principal Engineer GSE Sun Microsystems GmbH 1

Agenda kurze Einführung QFS Lustre pnfs ( Sorry... ) Roland.Rambau@Sun.Com Sun Proprietary/Confidential NEC Sales Meeting 05/2007 Seite 2

Some Critical Qualitative Trends Now Time To Start Sending Images Rather Than Data BANDWIDTH Human Visual System ideo V n o stati Work orks w t e uter N p m o ata C D d late u m r Si o ired u q Ac WE ARE HERE YEARS This is only a qualitative graph

Some Critical Qualitative Trends Now Time To Start Sending Images Rather Than Data BANDWIDTH Human Visual System ideo V n o stati Work WE ARE HERE orks w t e uter N p m o ata C D d late u m r Si o ired u q Ac Soon Time To Forget about Just Moving all the Data YEARS This is only a qualitative graph InfiniBand DDR x12 is about 3.5 PB/week per direction GbE is about 3 PB/year

High Performance Computing Is Data Centric Visualize Secure Search Share f(t) PB Data Transact InterOperate Compute Submit Access

Sun HPC Storage Solutions Long-Term Retention and Archive Compute Engine Data Cache Tier 2 Fixed Content Archive Computer Cluster IB Network Load Tier 2 Near-Line Archive Automated Migration Archive LAN and SAN Scalable Storage Cluster Data Movers Tier 1 Archive and Home Directories

Sun HPC Storage Solutions Long-Term Retention and Archive Compute Engine Data Cache Tier 2 Fixed Content Archive Computer Cluster IB Network Load Tier 2 Near-Line Archive Automated Migration Archive LAN and SAN Scalable Storage Cluster Data Movers Tier 1 Archive and Home Directories

Basic distributed file system diagram shared file service many Clients ( 100s-100000s )

Survey of Distributed File Systems There are 3 kinds of file systems for clusters: Proxy file systems > Name services, space management, and data services brokered through a single server. > This is then not a parallel file system SAN file systems > Name services and space management on the metadata server; direct parallel access to the data devices. Object storage file systems > Name services on the metadata server(s); space management on the storage servers; direct parallel access to the storages servers.

PROXY File Systems Name services, space management, and data services brokered through a single server. For example: NFS CIFS Sun PxFS etc. many Clients ( 100s-100000s )

Basic NFS numbers some people claim NFS is ~40 MB/s max ( on GbE ) used to be true very long time ago we have seen 235 MB/s already years ago, from single file, untuned ( on 10GbE ) today we see 980 MB/s in the lab using a new NFSrdma implementation

Basic Parallel file system diagram multiple storage ( 10s-100s ) Metadata service parallel file data transfers many Clients ( 100s-100000s )

SAN File Systems Name services and space management on the metadata server; Clients directly access the data devices.for example: Sun QFS Shared File System IBM GPFS and IBM SANergy ADIC StorNext SGI CxFS Redhat GFS, Polyserve, IBRIX, etc. SAN file systems with HSM: SAM-FS on Sun Shared QFS StorNext HSM on ADIC DMF on SGI CxFS

Sun StorageTek QFS Shared File System QFS is a high-performance shared file system which can be configured with metadata stored on devices separate from the data devices for greater performance and recovery or combined on a single device depending on the needs of the application. The file system may be integrated with the SAM software to provide storage archiving services in the shared file system environment. It supports Solaris and Linux clients. The Metadata server is always on Solaris.

SAN File Systems Metadata over LAN

Sun StorageTek QFS SAN File Sharing Innovation Data Consolidation > SAN file sharing > Name services and space management on the metadata server > Direct access to the data devices Performance & Scalability Parallel Processing > Tune file system to the application > Predictable performance > File system performance scales linearly with the hardware > Multi-node read/write access > 128 nodes supported > 256 nodes with IB support

SAM-QFS 4.6 New Features Simplified Install & Setup Resource Mgt. Info GUI Data Integrity Verification Archive Copy Retention HA-SAM Snaplock compatibility for ECMs Directory Lookup Performance Phase 1 Honeycomb Integration as a disk archive target Sun Cluster support - Shared QFS nodes outside the cluster Linux Updates SLES 10 support

Basic SAN file system diagram multiple block storage ( 10s-100s ) Metadata service doing block allocation parallel block transfers many Clients ( 100s-100000s )

Basic Object file system diagram multiple object storage servers( 10s-100s ) Metadata service doing object allocation parallel object transfers many Clients ( 100s-100000s )

Object Storage File Systems Name services on the Metadata Server(s); space management on Object Storage devices (OSDs); direct access to the Object Storage devices (OSDs) Currently available object storage file systems: Lustre, from Sun (Cluster File Systems, Inc.) Panasas PanFS IBM SanFS No HSM support on the Object Storage file systems IBM and Panasas adhers to the T10/OSD standard; T10/OSD is a SCSI protocol. Lustre uses a proprietary protocol

Lustre Cluster File System World s Largest Network-Neutral Data Storage and Retrieval System The worlds most scalable parallel filesystem 10,000 s of clients Proven technology at major HPC installations: > Tokyo Tech, TACC (Sun), LANL, LLNL, Sandia, PNNL, NCSA, etc. 50% of Top30 run Lustre 15% of Top500 run Lustre

The First Sun Constellation System Implementation The world s largest general purpose compute cluster based on Sun Constellation System More than 500 Tflops > > > > 82 Sun ultra-dense blade platforms 2 Sun ultra-dense switches 72 Sun X4500 storage servers X4600 frontend and service nodes Sun is the sole HW supplier Opteron Barcelona based started operation february 4th 2008

What is Lustre? Lustre is a storage architecture for clusters > Open source software for Linux licensed under GNU GPL > Standard POSIX-compliant UNIX file system interface > Complete software solution runs on commodity hardwares Key characteristics > > > > Unparalleled scalability Production-quality stability and failover Object-based architecture Open ( Multi-vendor and multi-platform ) Roadmap

Understanding Lustre Clients: 1-100000, good target is a couple of 10000 Object Storage servers: 1-1000, in reality up to 400-500 Metadata serves: 1-100, in reality only a couple Operation: > Client goes to the metadata server, get a handle describing where > > > > > > the data is File can be striped over many object servers Client does the I/O There are no locks and additional info on the metadata servers! Max file size: 1.2PB Max file system size: 32PB Max number of files: 2B 24

TACC users already reported: Our datasets comprises 3 single real variables per grid point at 4096³ [which] is 768GB. Our code took about 22 secs to write the files (each processor writes a file) which means a parallel performance of ~35GB/sec. > (TACC acceptance test measured 45 GB/s on benchmark ) Red Storm test: 160 wide stripe (not recommended, use odd, or even prime numbers rather!!!), 10000 clients, 40GB/sec I/O, nice and constant performance for a single file activity 25

Lustre today WORLD RECORD #Clients WORLD RECORD #Servers Capacity WORLD Performance RECORD Stability Clients: 25,000 Red Storm Processes: 200,000 BlueGene/L Can have Lustre root file systems Metadata Servers: 1 + failover OSS servers: up to 450, OST s up to 4000 Number of files: 2Billion File System Size: 32PB, Max File size: 1.2PB Single Client or Server: 2 GB/s + BlueGene/L first week: 74M files, 175TB written Aggregate IO (One FS): ~130GB/s (PNNL) Pure MD Operations: ~15,000 ops/second Software reliability on par with hardware reliability Increased failover resiliency Networks Native support for many different networks, with routing Features Quota, Failover, POSIX, POSIX ACL, secure ports

A Lustre Cluster with everything in it: Object Storage Servers (OSS) MDS disk storage containing Metadata Targets (MDT) OSS storage with Object Storage Targets (OST) 1-1000 s Pool of clustered Metadata Servers (MDS) 1-100 MDS 1 MDS 2 (active) (standby) OSS 1 Commodity Storage Elan Myrinet OSS 2 InfiniBand Simultaneous support of multiple network types Lustre Clients 1-100,000 OSS 3 Shared storage enables failover OSS OSS 4 Router OSS 5 GigE OSS 6 = failover OSS 7 Enterprise-Class Storage Arrays and SAN Fabric

Recent Improvements Lustre Clients require no Linux kernel patches (1.6) Dramatically simpler configuration (1.6) Online server addition (1.6) Space management (1.8) Metadata performance improvements (1.4.7 & 1.6) Recovery improvements (1.6) Snapshots & backup solutions (1.6) CISCO, OpenFabrics IB (up to 1.5GB/sec!) (1.4.7) Much improved statistics for analysis (1.6) Backup tools (1.6.1) Linux Large ext4 partitions support (1.4.7) Very powerful new ext4 disk allocator (1.6.1) Dramatic Linux software RAID5 performance improvements Other pcifs client in beta today

Lustres Intergalactic Strategy Clustered MDS HPC Scalability 5-10X MD perf of 1.4 1 PFlop Systems Pools 1 Trillion files ZFS OSS & MDS 1M file creates / sec Solaris 30 GB/s mixed files Kerberos 1 TB/s Migration Windows pcifs 2009 Lustre v2.0 10 TB/sec Q4 2008 WB caches Small files Lustre v1.8 Snapshots Lustre v3.0 Q2 2008 Backups HSM Proxy Servers Network RAID Disconnected Operation Lustre v1.6 Q1 2007 Online Server Addition Lustre v1.4 Simple Configuration Patchless Client Run with Linux RAID Enterprise Data Management

Sun Lustre Solutions Sun Customer Ready Scalable Storage Cluster Small 48 Terabytes* 1.6 GB/sec Large 144 Terabytes* 4.8 GB/sec Expansion 192 Terabytes* 6.4 GB/sec * with 500 GB drives

Conclusion Lustre is almost 9 years old The #1 file system at scale The originally elusive numbers are in the bag: > >100 GB/sec > >10,000 clients Cluster growth was seriously underestimated > Lustre will scale 10TB/sec, 1,000,000 clients

What is pnfs? Parallel extensions to NFS to improve bandwidth > Designed by HPC community > Adopted by IETF NFS WG as part of NFSv4.1 Draft standard expected consensus 2008 (presently draft 21) final content due next week, complete approvals later this year Major features > > > > > Parallel transfer from multiple servers to single client Global name space Horizontal scale of data storage Exists within mental model of existing NFS community Three types of implementation: files, objects, blocks All offer identical file semantics to client

Basic pnfs diagram multiple data servers( 10s-100s ) NFS metadata service providing a layout to pnsf clients or traditional NFS service to older clients parallel file, object or block transfers many pnfs Clients ( 100s-100000s )

Architecture Classical cluster file system architecture > Metadata is handled by centralized server (MDS) > User data is distributed to many data servers (DS) Clients open files by asking the MDS > MDS verifies permissions, sends back layout > Layout describes how file data is organized and where > Standard supports RAID-0 and replicated data Striping done at the file level Files have custom organizations, even in same directory /pnfs/file1 might be 5-way striped on s1, s99, s4, s12, s04 /pnfs/file2 might be a singleton file on s4

Orthogonal Features Exists within NFSv4.0 context > Mount, showmount, sharemgr, GSS security, Kerberos, TX are all the same as for traditional NFS > Data transfer protocol is the same too NFS-over-RDMA > Remote DMA allows client/server communication without resorting to XDR/TCP/IP protocol overhead > Applies to non-parallel NFS and pnfs > New Solaris implementation underway in onnv Q4CY07 Fully standard compliant ( ignore old version in Solaris 10 ) Fast and efficient: 980 MB/sec vs 235MB/sec (10G ethernet) (and 980 MB/sec was hardware limited on x2100m1)

pnfs Implementation pnfs is optional for NFSv4.1 clients > MDS must be prepared to proxy data from DS to client if client cannot handle layouts > Not expected to be common > However, this required capability may prove useful Data on all DS shares global name space > Could be hundreds or thousands of data servers pnfs Implementation Status > Code is in advanced prototype stage, both Linux and Solaris > Source released to OpenSolaris, estimated put back late '08 > Server code also already in Lustre

Sun CRS, Support, Architectural, Professional Services Sun HPC Open Software Stack Developer Tools Distributed Applications Management Workload Management Cluster Management Distributed IO File System, Visualization Sun Studio 12 Sun HPC ClusterTools Sun Grid Engine Software Sun Connection, ROCKS, Ganglia Sun Lustre, QFS, NFS, pnfs et al. Sun Visualization System Nodes Open, Free Open Open Processors and Kernels Interconnect Free Gigabit Ethernet, Myrinet, Infiniband, and Suns 3456 Port Non-Blocking IB Switch Open

Vielen Dank! Roland Rambau roland.rambau@sun.com