The Lustre File System. Eric Barton Lead Engineer, Lustre Group Sun Microsystems



Similar documents
New Storage System Solutions

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

A Scalable Health Network For Lustre

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Lustre Networking BY PETER J. BRAAM

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Architecting a High Performance Storage System

Cray DVS: Data Virtualization Service

LUSTRE FILE SYSTEM High-Performance Storage Architecture and Scalable Cluster File System White Paper December Abstract

High Performance Computing OpenStack Options. September 22, 2015

Lessons learned from parallel file system operation

Current Status of FEFS for the K computer

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

POWER ALL GLOBAL FILE SYSTEM (PGFS)

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Parallele Dateisysteme für Linux und Solaris. Roland Rambau Principal Engineer GSE Sun Microsystems GmbH

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Scala Storage Scale-Out Clustered Storage White Paper

Lustre: A Scalable, High-Performance File System Cluster File Systems, Inc.

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Big data management with IBM General Parallel File System

(Scale Out NAS System)

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Cluster Implementation and Management; Scheduling

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

HPC Update: Engagement Model


Sun Constellation System: The Open Petascale Computing Architecture

Network Attached Storage. Jinfeng Yang Oct/19/2015

How to Choose your Red Hat Enterprise Linux Filesystem

Design and Evolution of the Apache Hadoop File System(HDFS)

Storage Challenges for Petascale Systems

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Quantum StorNext. Product Brief: Distributed LAN Client

Distributed File Systems

Enterprise Storage Solution for Hyper-V Private Cloud and VDI Deployments using Sanbolic s Melio Cloud Software Suite April 2011

February, 2015 Bill Loewe

Application Performance for High Performance Computing Environments


Introduction to Gluster. Versions 3.0.x

SAN Conceptual and Design Basics

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Hadoop: Embracing future hardware

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Flash Performance for Oracle RAC with PCIe Shared Storage A Revolutionary Oracle RAC Architecture

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Distributed File System Choices: Red Hat Storage, GFS2 & pnfs

HIGHLY AVAILABLE MULTI-DATA CENTER WINDOWS SERVER SOLUTIONS USING EMC VPLEX METRO AND SANBOLIC MELIO 2010

Beyond Embarrassingly Parallel Big Data. William Gropp

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Ceph. A file system a little bit different. Udo Seidel

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

IBM System x GPFS Storage Server

SciDAC Petascale Data Storage Institute

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

Four Reasons To Start Working With NFSv4.1 Now

Milestone Solution Partner IT Infrastructure MTP Certification Report Scality RING Software-Defined Storage

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

Red Hat Cluster Suite

Monitoring Tools for Large Scale Systems

Enabling High performance Big Data platform with RDMA

Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory

High Availability Databases based on Oracle 10g RAC on Linux

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

ioscale: The Holy Grail for Hyperscale

Diagram 1: Islands of storage across a digital broadcast workflow

HPC Advisory Council

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

With DDN Big Data Storage

Lustre * Filesystem for Cloud and Hadoop *

Windows Server 2008 Essentials. Installation, Deployment and Management

High Availability with Windows Server 2012 Release Candidate

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

ADVANCED NETWORK CONFIGURATION GUIDE

Distributed File Systems

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

WHITE PAPER SCALABLE NETWORKED STORAGE. Convergence of SAN and NAS with HighRoad. SPONSORED RESEARCH PROGRAM

Chapter 11 Distributed File Systems. Distributed File Systems

Lustre failover experience

Dr Markus Hagenbuchner CSCI319. Distributed Systems

Oracle Maximum Availability Architecture with Exadata Database Machine. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

Network File System (NFS) Pradipta De

Distributed Software Development with Perforce Perforce Consulting Guide

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

EMC BACKUP MEETS BIG DATA

Feature Comparison. Windows Server 2008 R2 Hyper-V and Windows Server 2012 Hyper-V

STORAGE CENTER WITH NAS STORAGE CENTER DATASHEET

Can High-Performance Interconnects Benefit Memcached and Hadoop?

High Performance Computing Specialists. ZFS Storage as a Solution for Big Data and Flexibility

High Performance Server SAN using Micron M500DC SSDs and Sanbolic Software

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Bigdata High Availability (HA) Architecture

Big + Fast + Safe + Simple = Lowest Technical Risk

Object storage in Cloud Computing and Embedded Processing

HDFS Architecture Guide

Transcription:

The Lustre File System Eric Barton Lead Engineer, Lustre Group Sun Microsystems 1

Lustre Today What is Lustre Deployments Community Topics Lustre Development Industry Trends Scalability Improvements

Lustre File System The world's fastest, most scalable file system Parallel shared POSIX file system Scalable High performance Petabytes of storage Tens of thousands of clients Coherent Single namespace Strict concurrency control Heterogeneous networking High availability GPL Open source Multi-platform Multi-vendor

Lustre File System Major components Client Client Client Client Client Client Client Client MGS configuration MDS MDS namespace MDS data

Lustre Networking Simple Message Queues RDMA Active - Get/Put Passive Attach Asynchronous Events Error handling Unlink Layered LNET / LND Multiple Networks Routers RPC Queued requests RDMA bulk RDMA reply Recovery Resend Replay

A Lustre Cluster Metadata Servers (MDS) I/O Servers () MDS 1 (active) MDS 2 (standby) Multiple Networks 1 Commodity Storage Servers TCP/IP QSNet 2 Lustre Clients 10 s - 10,000 s Myrinet InfiniBand iwarp Cray Seastar 3 Shared storage enables failover Router 4 5 = failover 6 Enterprise-Class Storage Arrays & SAN Fabrics 7

Lustre Today Lustre is the leading HPC file system > 7 of Top 10 > Over 40% of Top100 Demonstrated scalability and performance > 190GB/sec IO > 26,000 clients > Many systems with over 1,000 nodes

Livermore Blue Gene/L SCF 3.5 PB storage; 52 GB/s I/O throughput 131,072 processor cores TACC Ranger 1.73 PB storage; 40GB/s I/O throughput 62,976 processor cores Sandia Red Storm 340 TB Storage; 50GB/s I/O throughput 12,960 multi-core compute sockets ORNL Jaguar 10.5PB storage; 240 GB/s I/O throughput goal 265,708 processor cores

Center-wide File System Spider will provide a shared, parallel file system for all systems Based on Lustre file system Demonstrated bandwidth of over 190 GB/s Over 10 PB of RAID-6 Capacity 13,440 1 TB SATA Drives 192 Storage servers 3 TeraBytes of memory Available from all systems via our high-performance scalable I/O network Over 3,000 InfiniBand ports Over 3 miles of cables Scales as storage grows Undergoing system checkout with deployment expected in summer 2009

Future LCF Infrastructure Everest Powerwall Remote Visualization Cluster End-to-End Cluster Application Development Cluster Data Archive 25 PB SION 192x 48x 192x XT5 Login XT4 Spider

Lustre Success - Media Customer challenges > Eliminate data storage bottlenecks resulting from scalability issues NFS can't handle > Increase system performance and reliability Lustre value > Doubled data storage at three times less cost of compelling solutions > The ability to provide a single file system namespace to its production artists > Easy-to-install open source software with great flexibility on storage and server hardware While we were working on The Golden Compass, we faced the most intensive I/O requirements on any project to date. Lustre played a vital role in helping us to deliver this project. Daire Byrne, senior systems integrator, Framestore

Lustre success - Telecommunications Customer challenges > Provide scalable service > Ensure continuous availability > Control costs NBC broadcast 2008 Summer Olympics live online over Level 3 network using Lustre Lustre value > The ability to scale easily > Works well with commodity equipment from multiple vendors > High performance and stability With Lustre, we can achieve that balancing act of maintaining a reliable network with lesscostly equipment. It allows us to replace servers and expand the network quickly and easily - Kenneth Brookman, Level 3 Communications

Lustre success - Energy Customer challenges > Process huge and growing volumes of data > Keep hardware costs manageable > Scale existing cluster easily Lustre value > Ability to handle exponential growth in data > Capability to scale computer clusters easily > Reduced hardware costs > Reduced maintenance costs More Success

Open Source Community Lustre OEM Partners

Open Source Community Resources Web http://www.lustre.org News and information Operations Manual Detailed technical documentation Mailing lists lustre-discuss@lists.lustre.org > General/operational issues lustre-devel@lists.lustre.org > Architecture and features Bugzilla https://bugzilla.lustre.org Defect tracking and patch database CVS repository Lustre Internals training material

HPC Trends Processor performance / RAM growing faster than I/O Relative number of I/O devices must grow to compensate Storage component reliability not increasing with capacity > Failure is not an option it s guaranteed Trend to shared file systems Multiple compute clusters Direct access from specialized systems Storage scalability critical

DARPA HPCS Capacity 1 trillion files per file system 10 billion files per directory 100 PB system capacity 1 PB single file size >30k client nodes 100,000 open files Reliability End-to-end data integrity No performance impact during RAID build Performance 40,000 file creates/sec > Single client node 30GB/sec streaming data > Single client node 240GB/sec aggregate I/O > File per process > Shared file

Lustre and the Future Continued focus on extreme HPC Capacity > Exabytes of storage > Trillions of files > Many client clusters each with 100,000's of clients Performance > TB's/sec of aggregate I/O > 100,000's of aggregate metadata ops/sec Community Driven Tools and Interfaces > Management and Performance Analysis

HPC Center of the Future Capability 500,000 Nodes Capacity 1 250,000 Nodes Capacity 2 150,000 Nodes Capacity 3 50,000 Nodes Test 25,000 Nodes Viz 1 Viz 2 WAN Access Shared Storage Network 10 TB/sec User Data 1000 MDTs Metadata 25 MDS s HPSS Archive Lustre Storage Cluster

Lustre Scalability Definition Performance / capacity grows nearly linearly with hardware Component failure does not have a disproportionate impact on availability Requirements Scalable I/O & MD performance Expanded component size/count limits Increased robustness to component failure Overhead grows sub-linearly with system size Timely failure detection & recovery

Lustre Scaling

Architectural Improvements Clustered Metadata (CMD) 10s 100s of metadata servers Distributed inodes > Files local to parent directory entry / subdirs may be non-local Distributed directories > Hashing Striping Distributed Operation Resilience/Recovery > Uncommon HPC workload - Cross-directory rename > Short term - Sequenced cross-mds ops > Longer term - Transactional - ACID - Non-blocking - deeper pipelines - Hard - cascading aborts, synch ops

Epochs Global Oldest Volatile Epoch Reduction Network Oldest Epoch Current Globally Known Oldest Volatile Epoch Newest Epoch Stable Unstable Committed Uncommitted Server 1 Updates Server 2 Server 3 Operations Client 1 Client 2 Local Oldest Volatile Epochs Redo

Architectural Improvements Fault Detection Today RPC timeout > Timeouts must scale O(n) to distinguish death / congestion Pinger > No aggregation across clients or servers > O(n) ping overhead Routed Networks > Router failure can be confused with end-to-end peer failure Fully automatic failover scales with slowest time constant > Many 10s of minutes on large clusters > Failover could be much faster if useless waiting eliminated

Architectural Improvements Scalable Health Network Burden of monitoring clients distributed not replicated > ORNL 35,000 clients, 192 s, 7 OSTs/ Fault-tolerant status reduction/broadcast network > Servers and LNET routers LNET high-priority small message support > Health network stays responsive Prompt, reliable detection > Time constants in seconds > Failed servers, clients and routers > Recovering servers and routers Interface with existing RAS infrastructure Receive and deliver status notification

Health Monitoring Network Primary Health Monitor Failover Health Monitor Client

Architectural Improvements Metadata Writeback Cache Avoids unnecessary server communications > Operations logged/cached locally > Performance of a local file system when uncontended Aggregated distributed operations > Server updates batched and tranferred using bulk protocols (RDMA) > Reduced network and service overhead Sub-Tree Locking > Lock aggregation a single lock protects a whole subtree > Reduce lock traffic and server load

Architectural Improvements Current - Flat Communications model Stateful client/server connection required for coherence and performance Every client connects to every server O(n) lock conflict resolution Future - Hierarchical Communications Model Aggregate connections, locking, I/O, metadata ops Caching clients > Aggregate local processes (cores) > I/O Forwarders scale another 32x or more Caching Proxies > Aggregate whole clusters > Implicit Broadcast - scalable conflict resolution

Hierarchical Communications Lustre Storage Cluster MDS MDS MDS MDS MDS MDS Proxy Cluster Proxy Cluster Proxy Server WBC Client Proxy Server Proxy Server WBC Client Proxy Server Proxy Server WBC Client Proxy Server User Proc. Lustre Client IO Forwarding Server I/O Forwarder WBC Client IO Forwarding Server I/O Forwarder User Proc. IO Forwarding Client IO Forwarding Client IO Forwarding Client User Proc. User Proc. User Proc. User Proc. User Proc. User Proc. User Proc. User Proc. Proxy Server WBC Client Proxy Server WAN / Security Domain WBC Client User Proc. Luster Comm System Calls

ZFS End-to-end data integrity Checksums in block pointers Ditto blocks Transactional mirroring/raid Remove ldiskfs size limits Immense Capacity (128 bit) No limits on files, dirents etc COW Transactional Snapshots

Performance Improvements SMP Scaling Improve MDS performance / small message handling CPU affinity Finer granularity locking # Client Nodes RPC Trhoughput RPC Througput Total client processes Total client processes

Load (Im)Balance Request Queue Depth Time Server #

Network Request Scheduler Much larger working set than disk elevator Higher level information - client, object, offset, job/rank Prototype Initial development on simulator Scheduling strategies - quanta, offset, fairness etc. Testing at ORNL pending Future Exchange global information - gang scheduling QoS - Real time / Bandwidth reservation (min/max)

Metadata Protocol Improvements Size on MDT (SOM) Avoid multiple RPCs for attributes derived from OSTs OSTs remain definitive while file open Compute on close and cache on MDT Readdir+ Aggregation > Directory I/O > Getattrs > Locking

Lustre Scalability Attribute Today Future Number of Clients 10,000s Flat comms model 1,000,000s Hierarchical comms model Server Capacity Ext3 8TB ZFS - Petabytes Metadata Performance Single MDS CMD SMP scaling Recovery Time RPC timeout - O(n) Health Network - O(log n)

THANK YOU Eric Barton eeb@sun.com lustre-discuss@lists.lustre.org lustre-devel@lists.lustre.org 36