Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

Similar documents
New Storage System Solutions

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

PARALLELS CLOUD STORAGE

LUSTRE FILE SYSTEM High-Performance Storage Architecture and Scalable Cluster File System White Paper December Abstract

HIGHLY AVAILABLE MULTI-DATA CENTER WINDOWS SERVER SOLUTIONS USING EMC VPLEX METRO AND SANBOLIC MELIO 2010

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Enterprise Storage Solution for Hyper-V Private Cloud and VDI Deployments using Sanbolic s Melio Cloud Software Suite April 2011

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

The Lustre File System. Eric Barton Lead Engineer, Lustre Group Sun Microsystems

Virtualizing SQL Server 2008 Using EMC VNX Series and Microsoft Windows Server 2008 R2 Hyper-V. Reference Architecture

Lessons learned from parallel file system operation

Cray DVS: Data Virtualization Service

High Performance Computing Specialists. ZFS Storage as a Solution for Big Data and Flexibility

Lustre Networking BY PETER J. BRAAM

Scala Storage Scale-Out Clustered Storage White Paper

Introduction to Gluster. Versions 3.0.x

POWER ALL GLOBAL FILE SYSTEM (PGFS)

Architecting a High Performance Storage System

Storage Architectures for Big Data in the Cloud

Sun Constellation System: The Open Petascale Computing Architecture

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Big data management with IBM General Parallel File System

StorPool Distributed Storage Software Technical Overview

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

EMC Unified Storage for Microsoft SQL Server 2008

IBM ELASTIC STORAGE SEAN LEE

Quantum StorNext. Product Brief: Distributed LAN Client

Hyper-V over SMB Remote File Storage support in Windows Server 8 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

High Performance Computing OpenStack Options. September 22, 2015

HyperQ DR Replication White Paper. The Easy Way to Protect Your Data

Parallele Dateisysteme für Linux und Solaris. Roland Rambau Principal Engineer GSE Sun Microsystems GmbH

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Linux Powered Storage:

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Distributed File System Choices: Red Hat Storage, GFS2 & pnfs

An Oracle White Paper July Oracle ACFS

High Performance Server SAN using Micron M500DC SSDs and Sanbolic Software

ovirt and Gluster Hyperconvergence

Nexenta Performance Scaling for Speed and Cost

Lustre: A Scalable, High-Performance File System Cluster File Systems, Inc.

Cluster Implementation and Management; Scheduling

<Insert Picture Here> Oracle Cloud Storage. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

Hyper-V over SMB: Remote File Storage Support in Windows Server 2012 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

June Blade.org 2009 ALL RIGHTS RESERVED

Zadara Storage Cloud A

Scalable filesystems boosting Linux storage solutions

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Current Status of FEFS for the K computer

The functionality and advantages of a high-availability file server system

<Insert Picture Here> Solution Direction for Long-Term Archive

Oracle Maximum Availability Architecture with Exadata Database Machine. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

pnfs State of the Union FAST-11 BoF Sorin Faibish- EMC, Peter Honeyman - CITI

NetApp High-Performance Computing Solution for Lustre: Solution Guide

How to Build a Sanbolic Platform User Manual

Next Generation HPC Storage Initiative. Torben Kling Petersen, PhD Lead Field Architect - HPC

Maxta Storage Platform Enterprise Storage Re-defined

THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC

IBM System x GPFS Storage Server


ARCHITECTING COST-EFFECTIVE, SCALABLE ORACLE DATA WAREHOUSES

SAN Conceptual and Design Basics

Using EonStor FC-host Storage Systems in VMware Infrastructure 3 and vsphere 4

February, 2015 Bill Loewe

Tiered Adaptive Storage

Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003

Distributed File Systems

OPTIMIZING EXCHANGE SERVER IN A TIERED STORAGE ENVIRONMENT WHITE PAPER NOVEMBER 2006

Optimized Storage Solution for Enterprise Scale Hyper-V Deployments

Getting performance & scalability on standard platforms, the Object vs Block storage debate. Copyright 2013 MPSTOR LTD. All rights reserved.

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Ceph. A file system a little bit different. Udo Seidel

Clustering Windows File Servers for Enterprise Scale and High Availability

VERITAS Business Solutions. for DB2

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

HDFS Architecture Guide

Highly-Available Distributed Storage. UF HPC Center Research Computing University of Florida

What s New with VMware Virtual Infrastructure

Cisco Active Network Abstraction Gateway High Availability Solution

Design and Evolution of the Apache Hadoop File System(HDFS)

SAP Running on an EMC Virtualized Infrastructure and SAP Deployment of Fully Automated Storage Tiering

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

What s new in Hyper-V 2012 R2

A virtual SAN for distributed multi-site environments

Archive Data Retention & Compliance. Solutions Integrated Storage Appliances. Management Optimized Storage & Migration

A Deduplication File System & Course Review

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

HyperQ Storage Tiering White Paper

Storage Virtualization in Cloud

Diagram 1: Islands of storage across a digital broadcast workflow

How to Choose your Red Hat Enterprise Linux Filesystem

The Panasas Parallel Storage Cluster. Acknowledgement: Some of the material presented is under copyright by Panasas Inc.

HGST Virident Solutions 2.0

Software-defined Storage Architecture for Analytics Computing

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Transcription:

Sun Storage Perspective & Lustre Architecture Dr. Peter Braam VP Sun Microsystems

Agenda Future of Storage Sun s vision Lustre - vendor neutral architecture roadmap

Sun s view on storage introduction

The IT Infrastructure Backoffice HPC Web

Big Changes Everything is a cluster Open Source everywhere (Computer, Network, Storage) Fully virtualized processing, IO, and storage Integration, datacenter as a design center NOW COMING COMPUTE: Many cores, many threads, open platforms STORAGE OPEN PLATFORMS: $/performance $/gigabyte NETWORKING: Huge bandwidth Open platforms

What's Ahead Open Servers Open Storage Open Networks Leveraging innovative product design and packaging Common components Open source software Wide interoperability to deliver breakthrough economics A storage architecture that leverages: Open software An open architecture Common components Open interoperability to create innovative storage products Delivers breakthrough economics Unified datacenter network that utilizes common components Open source software Seamless integration with exitsting evironments Delivers breakthrough ecomonics

ZFS the central component of Open Storage

What is ZFS? A new way to manage data End-to End Data Integrity With checksumming and copy-on-write transactions Immense Software Data Developer Capacity The world's first 128-bit file system Easier Administration A pooled storage model no volume manager Huge Performance Gains Especially architected for speed

Trouble with Existing File Systems? Good for the time they were designed, but... No Defense Against Silent Data Corruption Difficult to Administer Need a Volume Manager Older/Slower Data Management Techniques Any defect in datapath can corrupt data... undetected Volumes, labels, partitions, provisioning and lots of limits Fat locks, fixed block size, naive pre-fetch, dirty region logging

Storage software features Getting out of the controller Storage Management Redundancy Snapshots Replication Monitoring Management NAS exports Solaris + ZFS Replace RAID controllers Foundation for Lustre / pnfs. Lustre Horizontal Scaling HPC Web 2.0

ZFS re-usability Storage controller iscsi or IB volume exports > With the enterprise goodies Local file system NAS server Storage layer for clustered storage > pnfs, Lustre, others

I know you're out there. I can feel you now. I know that you're afraid... you're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. Neo, The Matrix (1999)

Lustre introduction

World s Fastest and Most Scalable Storage Lustre is the leading cluster file system > 7 of Top 10 HPC systems > Half of Top 30 HPC systems Demonstrated Scalability and Performance > 100 GB/sec I/O > 25,000 Clients > Many systems with 1000s of nodes

Lustre scalable file system Lustre is a shared file system > Software only solution, no hardware ties > Developed as company gvmt lab collaboration > Open source, modifiable, many partners > Extraordinary network support > Smoking performance and scalability > POSIX compliance and High Availability Lustre is for extreme storage > Horizontal scaling of IO over all servers > parallelizes I/O, block allocation and locking > Similar for metadata over MDS servers > add capacity by adding servers > Example: week1 of LLNL BG/L system: 75M files, 175TB

What kind of deployments? Extremely Large Clusters > Deployment: extremely high node count, performance > Where: government labs, DoD > Strengths: modifiability, special networking, scalability Medium and Large Clusters > Deployment: 32 low thousands of nodes > Where: everywhere > Strengths: POSIX features, HA Very large scale data centers > Deployments: combine many extremely large clusters > Where: LLNL, ISP s, DoD > Strengths: security, networking, modifiability, WAN features

A Lustre Cluster MDS disk storage containing metadata targets (MDT) OSS servers 1-1000s OSS storage with object storage targets (OST) Pool of clustered MDS servers 1-100 MDS1 (active) MDS2 (standby) OSS 1 Commodity Storage Elan Myrinet OSS 2 Lustre Clients 1-100,000 InfiniBand Simultaneous support of multiple network types OSS 3 Shared storage enables failover OSS Router OSS 4 GigE OSS 5 = failover OSS 6 OSS 7 Enterprise-Class Storage Arrays and SAN Fabric

How does it work? File open Clients LOV Directory Operations, file open/close metadata, and concurrency File I/O and file locking MDS Recovery, file status and file creation OSS

Lustre Stripes Files with Objects Currently objects are simply files on OSS resident file systems Enables parallel I/O to one file > Lustre scales that to 100GByte/sec to one file

Vision Facet Activity Difficulty Priority Timeframe Product Quality Performance fixes More HPC Scalability Wide area features Broad adoption Major work is needed except on networking High High 2008 Systematic benchmarking & tuning Low Medium 2009 Clustered MDS, Flash cache, WB cache, Request Scheduling, Resource management, ZFS Secuirty, WAN performance, proxies, replicas Medium Medium 2009-2012 Medium Medium 2009-2012 Combined nnfs /Lustre exports High Low 2009-2012 Note: These are visions, not commitments

Lustre ZFS-DMU

Lustre & ZFS User space! > DMU talks to block devices > OSS / MDS talks to DMU > ztest and FUSE work similarly > LNET: user space or kernel Solaris Kernel Solaris VFS ZPL ZFS file system DMU zpool OSS / MDS > Will write ZFS formats on disk > Like we currently write ext3 > Use DMU API s for transactions DMU > Already ported to Linux, OS X ZFS USE OF DMU OSS/MDS Any Kernel storage DMU zpool Block dev LUSTRE USE OF DMU storage

Lustre pnfs

pnfs & Lustre pnfs integration Soon pnfs exports from Lustre on Linux > First participation in a Bakeathon by Lustre! Longer term possibilities > Let Lustre servers offer pnfs & Lustre protocol > Requires an interesting Lustre storage layer > Make LNET an RDMA transport for NFS? > Offer proven Lustre features to NFS standards efforts

Layered & direct pnfs pnfs Clients pnfs Clients...... Lustre Client pnfs file layout pnfs file layout pnfsd NFSD NFSD Lustre Client FS Global Namespace MDS OSS OSS ( OSS Lustre Servers (MDS & pnfs layered on Lustre Clients MDS pnfsd OSS NFSD pnfs and Lustre servers on Lustre / DMU storage system OSS NFSD MD node Stripe node Stripe node

Lustre flash cache

Flash cache Exploit storage hardware revolution > Very high bandwidth available from flash > Add Flash Cache OSTs capacity ~ RAM of cluster > Cost: small fraction of cost of RAM of cluster Fast I/O from compute node memory to flash Then drain flash to disk storage - ~ 5x slower > E.g. cluster finishes I/O in 10 mins, on disk in 50 mins > Need 5x fewer disks Lustre manages file system coherency

Flash Cache interactions READ: - OSS forces FC flush first WRITE all writes to Flash Cache Flash Cache OSS Trad OSS Trad OSS Trad OSS DRAIN put data in final position

Lustre client write back cache

Metadata WBC & replication Goal & problem: > Disk file systems make updates in memory > Network FS s do not - metadata ops require RPCs > The Lustre WBC should only require synchronous RPCs for cache misses Key elements of the design > Clients can determine file identifiers for new files > A change log is maintained on the client > Parallel reintegration of log to clustered MD servers > Sub-tree locks enlarge lock granularity

Uses of the WBC HPC > I/O forwarding makes Lustre clients I/O call servers > These servers can run on WBC clients Exa-scale clusters > WBC enables last minute resource allocation WAN Lustre > Eliminate latency from wide area use for updates HPCS > Dramatically increase small file performance

Lustre with I/O forwarding Forwarding clients FW sever / Lustre client Servers FW sever / Lustre client FW servers should be Lustre WBC enabled clients

Lustre data migration & file system replication

Migration many uses Between ext3 / ZFS servers For space rebalancing To empty servers and replace them In conjunction with HSM To manage caches & replicas For basic server network striping

Migration Coordinator Data Moving Agents Virtual Migrating MDS Pool MDS Pool A MDS Pool B Virtual Migrating OSS Pool OSS Pool A OSS Pool B

General purpose replication Driven by major content distribution networks > DoD, ISPs > Keep multi petabyte file systems in sync Implementing scalable synchronization > Changelog based > Works on live file systems > No scanning, immediate resume, parallel Many other applications > Search, basic server network striping

Lustre server caches & proxies

Caches / proxies Many variants > HSM Lustre cluster is proxy cache for 3 rd tier storage > Collaborative read cache > Bit-torrent style reading or > When concurrency increases use other OSS s as proxies > Wide area cache repeated reads come from cache Technical elements > Migrate data between storage pools > Re-validate cached data with versions > Hierarchical management of consistency

Collaborative cache Lustre Clients... Scenario: all clients read one file Initial reads from primary OSS With increasing load redirect Other OSS s act as caches OSS OSS OSS

Proxy clusters Remote Clients Proxy Lustre Servers Proxy Lustre Servers Master Lustre Servers Clients Local performance after the first read

peter.braam@sun.com