DAOS An Architecture for Extreme Scale Storage. Eric Barton High Performance Data Division Intel



Similar documents
EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Big data management with IBM General Parallel File System

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Scale out NAS on the outside, Object storage on the inside

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

Next-Gen Big Data Analytics using the Spark stack

The Sierra Clustered Database Engine, the technology at the heart of

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Understanding Microsoft Storage Spaces

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

BSC vision on Big Data and extreme scale computing

HPC data becomes Big Data. Peter Braam

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

The Classical Architecture. Storage 1 / 36

Intel RAID Controllers

Storage Architectures for Big Data in the Cloud

StorPool Distributed Storage Software Technical Overview

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

PARALLELS CLOUD STORAGE

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

Intel Platform and Big Data: Making big data work for you.

RADOS: A Scalable, Reliable Storage Service for Petabyte- scale Storage Clusters

An Oracle White Paper July Oracle ACFS

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Diagram 1: Islands of storage across a digital broadcast workflow

Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture. Dell Compellent Product Specialist Team

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Parallel Computing. Benson Muite. benson.

Best Practices for Increasing Ceph Performance with SSD

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

Big Data With Hadoop

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

Designing a Cloud Storage System

Enabling High performance Big Data platform with RDMA

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Maxta Storage Platform Enterprise Storage Re-defined

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

HGST Virident Solutions 2.0

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service. Eddie Dong, Tao Hong, Xiaowei Yang

Understanding Data Locality in VMware Virtual SAN

Using an In-Memory Data Grid for Near Real-Time Data Analysis

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Advances in Flash Memory Technology & System Architecture to Achieve Savings in Data Center Power and TCO

Whitepaper. NexentaConnect for VMware Virtual SAN. Full Featured File services for Virtual SAN

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

High Performance Computing OpenStack Options. September 22, 2015

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

VNF & Performance: A practical approach

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

INCREASING EFFICIENCY WITH EASY AND COMPREHENSIVE STORAGE MANAGEMENT

Data Centric Systems (DCS)

Configuring RAID for Optimal Performance

Enterprise Storage Solution for Hyper-V Private Cloud and VDI Deployments using Sanbolic s Melio Cloud Software Suite April 2011

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

The Transition to PCI Express* for Client SSDs

Lustre* HSM in the Cloud. Robert Read, Intel HPDD

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

EMC ISILON AND ELEMENTAL SERVER

Dell One Identity Manager Scalability and Performance

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

The Methodology Behind the Dell SQL Server Advisor Tool

Moving Virtual Storage to the Cloud

Technology Insight Series

StarWind Virtual SAN for Microsoft SOFS

Pivot3 Reference Architecture for VMware View Version 1.03

Answering the Requirements of Flash-Based SSDs in the Virtualized Data Center

CDH AND BUSINESS CONTINUITY:

Intel Media SDK Library Distribution and Dispatching Process

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Accelerating Business Intelligence with Large-Scale System Memory

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Dell s SAP HANA Appliance

Hypertable Architecture Overview

Make A Right Choice -NAND Flash As Cache And Beyond

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

General Parallel File System (GPFS) Native RAID For 100,000-Disk Petascale Systems

EMC VFCACHE ACCELERATES ORACLE

A Virtual Filer for VMware s Virtual SAN A Maginatics and VMware Joint Partner Brief

HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK

Windows Server Virtualization An Overview

- An Essential Building Block for Stable and Reliable Compute Clusters

Amazon EC2 Product Details Page 1 of 5

Building a Flash Fabric

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

HyperQ Storage Tiering White Paper

Transcription:

DAOS An Architecture for Extreme Scale Storage Eric Barton High Performance Data Division Intel

Legal Information No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm. Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/content/www/us/en/software/intel-solutions-for-lustre-software.html. Intel may make changes` to specifications and product descriptions at any time, without notice. Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. Intel, Intel Inside, the Intel logo and Xeon Phi are trademarks of Intel Corporation in the United States and other countries. Material in this presentation is intended as product positioning and not approved end user messaging. *Other names and brands may be claimed as the property of others. 2015 Intel Corporation 2

Emerging Trends Increased computational power Huge expansion of simulation volume & meta complexity Complex to manage and analyze achieved through parallelism 100,000s nodes with 10s millions cores More frequent hardware & software failures Tiered storage architectures High performance fabric & solid state storage on-cluster Low performance, high capacity disk-based storage off-cluster

Disruptive Change NVRAM + Integrated fabric Byte-granular storage access Sub-µS storage access latency µs network latency Conventional storage software Block granular access limits scaling High overhead I/O stack requirements Minimal software overhead OS bypass Communications Latency sensitive I/O Fail-out resilience Integrated Fabric Persistent Memory storage Persistent Memory Block Storage 10s µs lost to communications S/W 100s µs lost to F/S & block I/O stack Filesystem & application meta / hot Block storage SSD warm / Disk lukewarm

Storage Architecture Compute Cluster Site-wide Storage Fabric Compute Node NVRAM Hot High valence & velocity Brute-force, ad-hoc analysis Extreme scale-out Full fabric bandwidth O(1PB/s) O(10PB/s) Extremely low fabric & NVRAM latency Extreme fine grain New programming models Compute Nodes (NVRAM) Compute Fabric I/O Nodes (NVRAM, SSD) I/O Node NVRAM/SSD Semi-hot / staging buffer Fractional fabric bandwidth O(10TB/s) O(100TB/s) Parallel Filesystem NVRAM/SSD/Disk Site-wide shared warm storage SAN limited O(1TB/s) O(10TB/s) Indexed Parallel Filesystem (NVRAM, SSD, Disk)

On-cluster (hot) storage requirements Scalable capacity O(10x) system DRAM Scalable throughput Significant fraction of fabric bandwidth Significant fraction of fabric injection rate Data integrity & consistency Tunable resilience/availability No silent failures Safe overwrite Minimal usage constraints Global namespaces System namespace shared across jobs Connected workflows Object namespaces shared across processes Encapsulating entire simulation sets Fine grained, random, massively concurrent Minimal interference Data movement by unrelated workflows Security Authorized/authenticated access

Global Namespaces Containers Shared System Namespace Where s my stuff Private Namespaces My stuff Entire simulation sets Multiple Schemas POSIX* Shared (system) & Private (legacy sets) No discontinuity for application developers Scientific: HDF5*, ADIOS*, SciDB*, Big Data: HDFS*, Spark*, Graph Analytics, System Namespace POSIX set dir file Compute Nodes Hot sets /projects /posix /Particle /Climate dir dir file dir file ADIOS set Compute Cluster I/O Nodes HDF5 set set set set ` System Namespace + Warm/staging sets

Distributed Application Object Storage Applications Tools Top-level APIs Exascale I/O stack Extreme scalability, ultra fine grain Integrity, availability, resilience Unified model over all storage tiers site-wide Multiple Top Level APIs Domain-specific APIs High-level models DAOS-CT: Caching and Tiering Data migration over storage tiers Guided by usage meta Driven by system resource manager DAOS-CT DAOS-SR DAOS-M DAOS-SR: Sharding and Resilience Scaling throughput over storage nodes Fail-out resilience across storage nodes DAOS-M: Persistent Memory Object Storage Ultra-low latency / fine grain I/O Fine-grain versioning & global consistency Location (latency & fault domain) aware

Transactions Committed (immutable) Uncommitted (volatile) Why Simplify application development Safe update in-place Guaranteed model consistency Concurrent producer/consumer workflows Support resilience schemas Guaranteed consistency for redundant distributed Support tiered storage Preserve integrity on migration Named Snapshots How Lowest Open Epoch Consistent Readers Multi-version concurrency control Snapshot consistency on read Maximize concurrency/asynchrony Process s Highest Committed Epoch Arbitrary numbers of collaborating processes Arbitrary numbers of storage targets Leader commit/snap/migrate Writers & Inconsistent Readers

DAOS-M Object Storage Multiple Independent Object Address Spaces Versioning Object PGAS Container = {container shards} + meta Container Shard = {objects} Object = KV store or byte array Sparsity exposed Meta = {shard list, commit state} Minimal Resilient (Replicated state machine) Maximum concurrency Byte-granular MVCC Deferred integration of mutable Writes eagerly accepted in arbitrary order Reads sample requested version snapshot Object Distributed Transactions Prepare: Send updates tagged with version t Commit: Mark version t committed in container MD Version t now immutable and globally consistent Abort: Discard version t updates everywhere Low latency End-to-end OS bypass Persistent Memory server Userspace fabric drivers

DAOS-M latency sensitive server operations Request Queues Byte array objects Write (log ) Allocate extent buffer in NVRAM Copy immediate / RDMA READ remote Insert into persistent extent.version index Write, Version, Extent, Write, Version, Extent, RMA Read, Version, Extent, RMA Insert Index Traverse Read ( integration) extent.version index traversal => gather descriptor RDMA WRITE remote Version Being written Committed Extents Key-Value objects Insert/remove/retrieve value into key.version table

Sharding & Resilience Multiple mixed schemas Performance schemas Scale IOPs & bandwidth Resilience schemas Data integrity Checksums + stored separately N-way replication High performance for shared write objects Erasure codes High efficiency for non-shared objects Asynchronous refactor, scrub & repair Exploit immutability of committed Leverage DAOS-M Distributed consistency Sparsity Scaling Requirements Onerous! Aliasing of access & distribution patterns Bulk synchronous workload == Amdahl's law vector Extreme object size dynamic range Megaliths v. grains of sand Declustering Rebalance on node addition Distributed rebuild on node failure

Sharding & Resilience Algorithmic (O(0)) layout meta Consistent hash randomizes placement Replicas placed adjacently Hash must guarantee minimum separation of nodes in same fault domain Multiple hash rings for declustering Explicit (O(n)) layout meta Layout responsive to usage Preserve locality / feed hungriest mouth O(0) structures used to store layout Fault domain separation Hash (object ID) Hash (object ID) 13

Caching & Tiering Meta Residence maps Whole object maintained directly Sub-object leverages lower layers Access patterns Historical Explicit notification by upper levels Data colouring Data migration Resharding between tiers dirty clean miss Maintain distributed object semantics Maximize performance on subsequent access Select appropriate resiliency schemas HDF5 set set set set Near storage Explicit control Tiering Meta Persist & prestage APIs / JCL System resource manager driven migration Rebalance & minimize interference Transparent caching Write-back & demand cache Prefetch guided by usage meta Residence maps Prefetch & Writeback HDF5 set Tiering Meta set set Far storage

Top level I/O APIs POSIX Containers POSIX namespace over DAOS-HA objects Dynamically sharded directories & files Private POSIX namespaces Library for parallel applications and middleware targeting POSIX System POSIX namespace Parallel server exporting shared namespace DAOS for application programmers Simplified APIs Distributed Persistent Memory High level HPC object bases Complex application types & meta HDF5 + derivatives / ADIOS / SciDB etc Big Data HDFS compatibility layer Hadoop ecosystem Spark / Graph Analytics etc

NVRAM Storage Revolution Cost-effective storage & fabric integration Challenge: Extreme scale-out Amdahl s law Fault Tolerance Reward: Storage @ full fabric bandwidth O(1000) increase in velocity Byte-granular access @ us latency Challenge: Deliver benefit to applications Software overhead of conventional storage & communications stacks Reward: Ultra fine-grain access Remove constraints on applications Enable new programming models 16