Research Data Storage Infrastructure (RDSI) Project. DaSh Straw-Man

Similar documents
Survey of Technologies for Wide Area Distributed Storage

Introduction to Gluster. Versions 3.0.x

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

Moving Virtual Storage to the Cloud. Guidelines for Hosters Who Want to Enhance Their Cloud Offerings with Cloud Storage

Moving Virtual Storage to the Cloud

Silent data corruption in SATA arrays: A solution

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

Building Storage Service in a Private Cloud

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Design and Evolution of the Apache Hadoop File System(HDFS)

ZFS Administration 1

Storage Architectures for Big Data in the Cloud

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

Testing of several distributed file-system (HadoopFS, CEPH and GlusterFS) for supporting the HEP experiments analisys. Giacinto DONVITO INFN-Bari

CSE-E5430 Scalable Cloud Computing P Lecture 5

Object storage in Cloud Computing and Embedded Processing

PARALLELS CLOUD STORAGE

Lessons learned from parallel file system operation

High Performance Computing Specialists. ZFS Storage as a Solution for Big Data and Flexibility

StorPool Distributed Storage Software Technical Overview

Getting performance & scalability on standard platforms, the Object vs Block storage debate. Copyright 2013 MPSTOR LTD. All rights reserved.

Managed GRID or why NFSv4.1 is not enough. Tigran Mkrtchyan for dcache Team

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

<Insert Picture Here> Cloud Archive Trends and Challenges PASIG Winter 2012

ZFS Backup Platform. ZFS Backup Platform. Senior Systems Analyst TalkTalk Group. Robert Milkowski.

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Data Storage in Clouds

WHITE PAPER. Software Defined Storage Hydrates the Cloud

June Blade.org 2009 ALL RIGHTS RESERVED

Next Generation Tier 1 Storage

Data storage services at CC-IN2P3

New Storage System Solutions

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. dcache Introduction

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

The dcache Storage Element

SAN Conceptual and Design Basics

Maurice Askinazi Ofer Rind Tony Wong. Cornell Nov. 2, 2010 Storage at BNL

Oracle Maximum Availability Architecture with Exadata Database Machine. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

Distributed File System Choices: Red Hat Storage, GFS2 & pnfs

IBM System x GPFS Storage Server

Data Protection Technologies: What comes after RAID? Vladimir Sapunenko, INFN-CNAF HEPiX Spring 2012 Workshop

Ceph. A file system a little bit different. Udo Seidel

Sep 23, OSBCONF 2014 Cloud backup with Bareos

Apache Hadoop FileSystem and its Usage in Facebook

XtreemFS Extreme cloud file system?! Udo Seidel

The Design and Implementation of the Zetta Storage Service. October 27, 2009

Michael Thomas, Dorian Kcira California Institute of Technology. CMS Offline & Computing Week

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

Linux Powered Storage:

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Scalable filesystems boosting Linux storage solutions

Scientific Storage at FNAL. Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015

Home storage and backup options. Chris Moates Head of Lettuce

The Panasas Parallel Storage Cluster. Acknowledgement: Some of the material presented is under copyright by Panasas Inc.

DSS. High performance storage pools for LHC. Data & Storage Services. Łukasz Janyst. on behalf of the CERN IT-DSS group

Google File System. Web and scalability

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Long term retention and archiving the challenges and the solution

Scala Storage Scale-Out Clustered Storage White Paper

10th TF-Storage Meeting

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

Hadoop & its Usage at Facebook

General Parallel File System (GPFS) Native RAID For 100,000-Disk Petascale Systems

Large Scale Storage. Orlando Richards, Information Services LCFG Users Day, University of Edinburgh 18 th January 2013

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Distributed File Systems

Maxta Storage Platform Enterprise Storage Re-defined

High Availability with Windows Server 2012 Release Candidate

WHITE PAPER. QUANTUM LATTUS: Next-Generation Object Storage for Big Data Archives

Overview of I/O Performance and RAID in an RDBMS Environment. By: Edward Whalen Performance Tuning Corporation

EMC DATA DOMAIN OPERATING SYSTEM

Trends in Enterprise Backup Deduplication

ovirt and Gluster hyper-converged! HA solution for maximum resource utilization

EMC DATA DOMAIN OPERATING SYSTEM

High Availability Databases based on Oracle 10g RAC on Linux

ZFS In Business. Roch Bourbonnais Sun Microsystems

EMC XTREMIO EXECUTIVE OVERVIEW

SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS

Alternatives to Big Backup

Globus and the Centralized Research Data Infrastructure at CU Boulder

Designing a Cloud Storage System

Implementing Enterprise Disk Arrays Using Open Source Software. Marc Smith Mott Community College - Flint, MI Merit Member Conference 2012

Cloud Optimize Your IT

Chapter 12: Mass-Storage Systems

High Performance Computing OpenStack Options. September 22, 2015

Solaris For The Modern Data Center. Taking Advantage of Solaris 11 Features

Reliability and Fault Tolerance in Storage

Analisi di un servizio SRM: StoRM

Transcription:

Research Data Storage Infrastructure (RDSI) Project DaSh Straw-Man

Recap from the Node Workshop (Cherry-picked) *Higher Tiered DCs cost roughly twice the cost of Lower Tiered DCs. * However can provide a robust Higher Tiered like service. * Using co-operating Lower Tiered DCs. * With distributed and/or replicated mechanisms. * If a service (partially) fails another DC can temporarily provide it. * If a DC fails other DCs can provide its services temporarily. *Loss of service pardonable. Loss of data unforgivable. *Need to provide concrete assurances to the end user.

*Whats DaSh all about? * Developing sufficient elements of potential technical architectures for data interoperability and sharing. * So that its use can be appropriately specified the call for nodes proposal. * Mile high view of technical architectures to get data into and out of the RDSI node(s). *Ensure (meta)data durability and curation. * Loss of (meta)data is a capital offence. *Ensure data scalability. * Storage capacity, moving data into and out of a node(s). *Ensure End-user usability. * Provide a good end-user experience. *DaSh straw-man seeks community opinion on the various possible architectures.

Re-exported FS Building Blocks HSM, Tiers Storage Classes protocol neg. SRM SRM Wide Area xfers REST S3 Clouds GRIDs Wide Area xfers gsiftp, https dcap, DPM, xrootd NFS, CIFS WebDAV, FUSE

*irods and Federation *Federation is a feature in which separate irods Zones (irods instances), can be integrated. * When zones 'A' and 'B' are federated, they work together. * Each zone continues to be separately administrated. * Users in the multiple zones, if given permission, will be able to access data and metadata in the other zones. * No user passwords exchanged * Zone admins setup trust relationships to other zones.

ARCS Data Fabric icat only. Hosted on NeCTAR NSP irods server + tape irods server + tape irods server irods server irods server + tape irods server irods server + tape

Node s Eye View. (N=6) No Federation.

Node s Eye View. (N=6) Too much Federation. Too much confusion!!

Node s Eye View. (N=6) Just right Federation. Slave ICAT Slave ICAT Slave ICAT Master ICAT Slave ICAT Slave ICAT Slave ICAT

Distributed Fault- Tolerant Parallel FS Over N=6 nodes Re-exported FS Distributed vs Federated HSM, Tiers Storage Classes protocol neg. SRM SRM Wide Area xfers REST S3 Clouds GRIDs Wide Area xfers gsiftp, https dcap, DPM, xrootd NFS, CIFS WebDAV, FUSE

Distributed Pros and Cons *Distributed over a larger number of nodes. * Geographic scaling as well as node scaling. * Inherent data replication. *Fault Tolerant. * A storage brick took lickin but the service keep on tickin. * A node took a lickin but the service keep on tickin. *Parallel I/O. * All nodes can participate to move data. High aggregate BW. *Single global namespace. * Rather than separate logical namespaces. *Cost Effective * Use cheap hardware. Big disks over fast disks. * Design to expect failures.

File Replication *Whole file * Duplicated and stored on multiple bricks. *Slices of file * File sliced and diced, slices stored on multiple bricks. * A single brick may not contain the whole file. * Erasure Codes * Parity Blocks * (used in RAID) * Reed-Solomon * Over sampled polynomial constructed from data. * Add Erasure codes and slice file * Need M of N pieces to recover file (M < N) * Can store a slice on multiple bricks. Extra redundancy.

SurfNET Survey of Wide Area Distributed Storage. (Circa 2010) [1/4] http://www.surfnet.nl/nl/innovatieprogramma's/gigaport3/documents/eds-3r%20open-storage-scouting-v1.0.pdf Requirements required: *Scalable. * Capacity, performance and concurrent access. * Expandable storage without degrading performance. *High Availability. * Keeps data available to apps and clients. * Even in the event of a malfunction. * Or system reconfiguration. * Needs to replicates data to multiple locations.

SurfNET Survey of Wide Area Distributed Storage. (Circa 2010) [2/4] http://www.surfnet.nl/nl/innovatieprogramma's/gigaport3/documents/eds-3r%20open-storage-scouting-v1.0.pdf *Durability * No data is lost from a single software or hardware failure. * Automatically maintain minimum number of replicas. * Support backup to tape. *Performance at Traditional SAN/NAS Level. * Comparable performance to traditional non-distributed SAN/NAS. *Dynamic Operation. * Availability, durability, performance configurable per application. * Reduce costs as not running at highest support level at the time. * Allow users, apps, sysadmins to balance cost vs features. * System should be self-configurable, self-tunable. * Support data movement between different storage technologies. * Tiered functionality. Classes of Storage.

SurfNET Survey of Wide Area Distributed Storage. (Circa 2010) [3/4] http://www.surfnet.nl/nl/innovatieprogramma's/gigaport3/documents/eds-3r%20open-storage-scouting-v1.0.pdf *Cost Effective * Must be possible to build, configure, run and maintain in a cost effective manner. * Must work with commodity hardware. * Hardware may not be as reliable as high end hardware. * Configuration of system and its maintenance must be easy and straight forward. * Operation of system is energy efficient. * License fees for software when applicable must be limited. *Generic Interfaces. * System offers generic interfaces to apps and clients * POSIX interface. POSIX/NFSv4.1 semantics. * Block device (iscsi, etc).

SurfNET Survey of Wide Area Distributed Storage. (Circa 2010) [4/4] http://www.surfnet.nl/nl/innovatieprogramma's/gigaport3/documents/eds-3r%20open-storage-scouting-v1.0.pdf *Protocols Based on Open Standards * System build using open protocols * Reduces vendor lock-in * More economical in the long run. *Multi-Party Access * System must support access by multiple geographically dispersed parties at the same time. * Promotes collaboration between these parties.

SurfNET Survey of Wide Area Distributed Storage.(Circa 2010) http://www.surfnet.nl/nl/innovatieprogramma's/gigaport3/documents/eds-3r%20open-storage-scouting-v1.0.pdf Candidates *Lustre *GlusterFS *GPFS *Ceph *+ dcache Non-Candidates *XtreemFS *MogileFS *NFS v4.1 (pnfs) *ZFS *VERITAS FS *Parascale *CAStor *Tahoe-LAFS *DRBD

Nordic DataGrid Facility (dcache)

The DEISA Global File System at European Scale (Multi-Cluster General Parallel File System)

TeraGrid (GPFS & Lustre)

SurfNET Survey of Wide Area Distributed Storage + dcache Lustre GlusterFS GPFS Ceph dcache Owner Oracle Gluster IBM Newdream dcache.org Licence GNU GPL GNU GPL commercial GNU GPL DESY Data Primitive Object (file) Object (file) block Object (file) Object (file) Data placement Metadata Storage tiers Round robin + free space heuristics Max 2 metadata servers Pools of object targets Different strategies via modules Stored with file Policy based Distribute over storage servers Placement groups, random mappings Multiple metadata servers Policy based pnfs (postgresql) unknown Policy defined CRUSH rules Policy defined

SurfNET Survey of Wide Area Distributed Storage + dcache. Lustre GlusterFS GPFS Ceph dcache Failure handling Replication WAN deployment example Client interface Node types Assuming reliable nodes Server side (failover pairs) TeraGrid Native client, FUSE, CIFS, NFS Clients, metadata, objects Assuming unreliable nodes Assuming reliable nodes, Failure groups Assuming unreliable nodes Assuming reliable nodes Client side Server side Server side Server side City Cloud (Swedish IaaS provider) Native Client, FUSE TeraGrid DEISA Native Client, exports NFSv3, CIFS, pcifs, WebDAV, SRM (StoRM) unknown Native client, FUSE Client, data Client, data Clients, metadata, objects Fermilab, Swegrid, NDGF NFSv4.1, HTTP, WebDAV, GridFTP, Xrootd, SRM, dcap Clients, metadata, objects

WAN Data Caching and Performance Bringing data closer to where it is consumed. *Researchers are naturally distributed over the city and country *Some may not benefit from the high speed networks provide by AARNet and the NRN due to their location. *Can RDSI help these spatially disenfranchised? *Yes, (sort of). *Take the model of Content Delivery Networks. * ie Akamai, Amazon CloudFront, etc * Web content, videos etc are cached close to the end user. *But focus on data caching rather that content caching. *May not provide the same experience as the spatially franchised. * But every bit helps!

WAN Data Caching with GPFS.

WAN Data Caching Continued *dcache is a distributed cache system. * Locate a dcache pool close to the spatially disenfranchised. * dcache admin can populate required data collections to spatially disenfranchised using standard SRM processes. * Potentially a (reasonably) fast parallel transfer. *BioTorrents <http://www.biotorrents.net> * Allows scientists to rapidly share their results, datasets, and software using the popular BitTorrent file sharing technology. * All data is open-access and any illegal filesharing is not allowed on BioTorrents. * Or RDSI nodes can provide bit-torrent seeders itself from its nodes. * Ignoring the bad press BitTorrent is very good at what it does.

Data Durability. Things that go bump in the night (or not!) *Data Durability is an absolute necessity. *RDSI must provide a safe and enduring home for research data. * This might be more difficult as it appears! *The enemy is *Physics. *The world is a complex quantum/probabilistic system. * And so are all your computing and storage infrastructure. *Random events in your infrastructure will create: *Bit Rot and Silent Corruptions. *But you can engineer around the laws of physics.

Data Durability. Sources of Bit Rot and Silent Corruptions User Space ECC errors Corrupted Metadata Corrupted Data Inter-op issues Bugs in FW Wear Out Flipped Bits Latent sector errors VM Memory Filesystems Block layer SCSI layer Low-level drivers Controller firmware Storage firmware Disk Mechanics + Physical magnetic media All interconnecting cables Cosmic rays/sun spots EM Radiation, etc Lost Writes Torn Writes Misdirected Writes From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. Expected Background Bit Error Rate (BER) * NIC/Link/HBA: 10-10 (1 bit in ~1.1 GB) * Check-summed, retransmit if necessary * Memory: 10-12 (1 bit in ~116 GB) * ECC * SATA Disk: 10-14 (1 bit in ~11.3 TB) * Various error correction codes * Enterprise Disk: 10-15 (1 bit in ~113 TB) * Various error correction codes * Tape: 10-18 (1 bit in ~1.11 PB) * Various error correction codes * Data maybe encoded up to five or more times as it travels to and from physical disk/tape to user space. * At petascale incredibly infrequent events happen all the time. From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. The errors you know. The errors you don t know. There are known errors; there are errors we know we know. We also know there are known unknown errors; that is to say we know there are some things we do not know. But there are also unknown unknown errors; the ones we don't know we don't know. Paraphrased from Donald Rumsfeld From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. The errors you know. The errors you don t know. *There are Data Errors that you will now know about. * Logs message. * SMART messages * Detection: SW/HW-level with error messages * Correction: SW/HW-level with warnings * If your really lucky your kernel will panic so you ll know something happened. *There are Data Errors that you will never know about. * As far as your storage infrastructure knows that write/read was executed perfectly. * In reality you will probably never know the data has been corrupted. * (Unless you design for this eventuality.) From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. How to discover the unknown unknowns. * checksums * (CRC32, MD5, SHA1,...) * Checksum (meta)data. * Transport checksum with meta(data) for later comparison. * Error detection and correction codings. * Detects errors caused by noise, etc. (See checksums.) * Corrects detected errors and reconstruction of the original, error-free data. * Backward error correction: * Automatic Retransmit on error detection. * Forward error correction: * Encode extra redundant data. * Regenerate data from Forward Error Codes. * Multiple copies with quorum. From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. Silent Corruptions and CERN * Circa 2007. 9PB tape. 4PB disk. 6000 nodes. 20000 drives, 1200 RAID. * Probabilistic storage integrity check (fsprobe) on 4000 nodes. * Write known bit pattern * Read it back. * Compare and alert when mismatch found. * 6 cycles over 1 hour each. * Low I/O footprint for background operation on 2GB file. * Keep complexity to the minimum. * use static buffers * Attempt to preserve details about detected corruptions for further analysis. From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. Silent Corruptions and CERN *2000 incidents reported over 97 PB of traffic. * 6/day on average observed! * 192 MB of data silent data corruption. *320 nodes affected over 27 hardware types. *Multiple types of corruptions. *Some corruptions are transient. *Overall BER considering all the links in the chain * 3x10-7. *Not the 10-12 10-14 spec d rates. From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. Types of silent Corruptions * Type 1 * Single/double bit flip errors. Usually persistent. * Usually bad memory (RAM, cache, etc.) * Happens with expensive ECC memory too. * Type II * Small, 2 n -sized random chunks (128-512 bytes) of unknown origin. * Usually transient. * Possible OOM Killer or corrupted SLAB/SLUB allocator. * Type III * multiple large chunks of 64K, old file data. I/O command timeouts * Usually persistent. * Type IV various sized chunks of zeros. From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. What Can Be Done? *Self-examining/healing hardware. *WRITE-READ cycles before ACK. *Check-summing though not necessarily enough. *End-to-end check-summing. *Store multiple copies. *Regular scrubbing of RAID arrays. *Data refresh. Re-read cycles on tapes. *Generally accept and prepare for corruptions. From Silent Corruptions, Peter.Kelemen, CERN

Data Durability. The solutions. ZFS. The Good. * Developed by Sun (now Oracle) on Solaris. * Designed from the ground up with a focus on data integrity. * Combined filesystem, logical volume manager * RAID-Z, RAID-Z2, RAID-Z3, or mirrored * Copy-on-write. Transactional operation. * Built-in end-to-end data integrity. * Data/metadata checksum all the way to the root. * Always consistent on disk. no fsck or journaling * Automatic self-healing. * Intelligent online scrubbing and resilvering. * Very large filesystem limits. Max. 256 ZB FS * Deduplication. * Snapshots. and much much more.

Data Durability. The solutions. ZFS. The Bad. *Supported on Solaris only. * OpenSolaris is no more. *Kernel ports for FreeBSD and NetBSD. * Using OpenSolaris kernel source code. *Linux port via ZFS-FUSE. * Kernel space good. User space not so good. *ZFS on Linux. * Supported by Lawrence Livermore National Laboratory. * Issues with CDDL and GPL license compatibility in the kernel. * Solaris Portability layer/shim to the rescue. * Currently v0.6.0-rc4. It worked for me but not production grade yet.

Data Durability. The solutions. ZFS for Lustre. *1999: Peter Bramm from CMU creates Lustre. * A GPL massively parallel distributed file system. * 2003: Bramm created Cluster File Systems Inc to continue work. * 2007: Sun acquires Cluster File Systems Inc. * Works to combines ZFS and Lustre. * High Performance parallel FS with end to end data integrity. * But only supported on solaris. *2009: LLNL starts porting ZFS kernel to linux. * Oracle acquires Sun. *2010: Oracle announced ZFS/Lustre only for Solaris. *2011: LLNL starts ZFS/Lustre port for linux. *Late 2011: LLNL plans ZFS/Lustre FS. * 50 PB. 512GB/s 1TB/s bandwidth.

Data Durability. The solutions. DataDirect Networks S2S Technology. *SATA storage with: * Enterprise-class performance. * Reliability and data integrity. * Automatic self-healing * Detects anomalies and begins journaling all writes while recovering operations. *Dynamic Maid (D-MAID) * Save additional power and cooling by powering down the platters, * Where over 80% of power is consumed. * DC friendly.

Community Input Time. *Are we barking up the right tree. *Are we barking up the wrong tree. *Is there even a tree in the first place. *You decide.

Building Block *Are the base building blocks sufficient? * If not what should be added? *Is there a need for additional data transfer protocols. * If so what should be added? *Is there a need for additional file system protocol? * If so what should be added? *What additional public cloud storage infrastructure should RDSI consider? *What additional private cloud storage infrastructure should RDSI consider?

Federated vs Distributed. *Should RDSI continue to embrace the federated irods model? *Should RDSI embrace the Distributed FS model? *Should RDSI embrace both the federated and distributed model?

Distributed Fault Tolerant Parallel Filesystems. *If RDSI chooses to use a Distributed Fault Tolerant Parallel filesystem component, are there such systems that we have not yet consider?

WAN Data Caching There are always going to researchers who may not be able to benefit from the high speed networks provide by AARNet and the NRN. WAN Data Caching may partially eliminate their disadvantage but at cost. *Should RDSI consider the use of WAN Data Caches? *If so what sites would benefit from these data caches?

Data Durability. Data Durability is one of the foremost challenges of RDSI. However it seems impossible to entirely eliminate the various issues of bit rot and silent corruptions. *Given this fact of nature what level of data durability is the research community willing to accept?