GRAU DATA Scalable OpenSource Storage CEPH, LIO, OPENARCHIVE

Similar documents

Ceph: petabyte-scale storage for large and small deployments

Testing of several distributed file-system (HadoopFS, CEPH and GlusterFS) for supporting the HEP experiments analisys. Giacinto DONVITO INFN-Bari

SUSE Enterprise Storage Highly Scalable Software Defined Storage. Gábor Nyers Sales

Sep 23, OSBCONF 2014 Cloud backup with Bareos

RADOS: A Scalable, Reliable Storage Service for Petabyte- scale Storage Clusters

Flexible Scalable Hardware independent. Solutions for Long Term Archiving

Ceph. A complete introduction.

Ceph. A file system a little bit different. Udo Seidel

Distributed File Systems An Overview. Nürnberg, Dr. Christian Boehme, GWDG

Product Spotlight. A Look at the Future of Storage. Featuring SUSE Enterprise Storage. Where IT perceptions are reality

StorPool Distributed Storage Software Technical Overview

DreamObjects. Cloud Object Storage Powered by Ceph. Monday, November 5, 12

Cloud storage reloaded:

VM Image Hosting Using the Fujitsu* Eternus CD10000 System with Ceph* Storage Software

Getting performance & scalability on standard platforms, the Object vs Block storage debate. Copyright 2013 MPSTOR LTD. All rights reserved.

SUSE Linux uutuudet - kuulumiset SUSECon:sta

Building low cost disk storage with Ceph and OpenStack Swift

The Panasas Parallel Storage Cluster. Acknowledgement: Some of the material presented is under copyright by Panasas Inc.

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

Storage Virtualization in Cloud

Optimizing Ext4 for Low Memory Environments

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Building Storage as a Service with OpenStack. Greg Elkinbard Senior Technical Director

Diagram 1: Islands of storage across a digital broadcast workflow

Benchmarking Hadoop performance on different distributed storage systems

Storage Architectures for Big Data in the Cloud

Moving Virtual Storage to the Cloud. Guidelines for Hosters Who Want to Enhance Their Cloud Offerings with Cloud Storage

Large Scale Storage. Orlando Richards, Information Services LCFG Users Day, University of Edinburgh 18 th January 2013

Scala Storage Scale-Out Clustered Storage White Paper

10th TF-Storage Meeting

Survey of Filesystems for Embedded Linux. Presented by Gene Sally CELF

Moving Virtual Storage to the Cloud

(Scale Out NAS System)

XtreemStore A SCALABLE STORAGE MANAGEMENT SOFTWARE WITHOUT LIMITS YOUR DATA. YOUR CONTROL

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

pnfs State of the Union FAST-11 BoF Sorin Faibish- EMC, Peter Honeyman - CITI

Implementing Enterprise Disk Arrays Using Open Source Software. Marc Smith Mott Community College - Flint, MI Merit Member Conference 2012

Red Hat Storage Server Administration Deep Dive

Object storage in Cloud Computing and Embedded Processing

東海大學資訊工程研究所碩士論文

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

Lessons learned from parallel file system operation

Ceph Distributed Storage for the Cloud An update of enterprise use-cases at BMW

Scientific Computing Data Management Visions

Distributed File Systems

Linux Powered Storage:

GlusterFS Distributed Replicated Parallel File System

SYNNEFO: A COMPLETE CLOUD PLATFORM OVER GOOGLE GANETI WITH OPENSTACK APIs VANGELIS KOUKIS, TECH LEAD, SYNNEFO

An Oracle White Paper July Oracle ACFS

Distributed File System Choices: Red Hat Storage, GFS2 & pnfs

Introduction to Gluster. Versions 3.0.x

New Storage System Solutions

High Performance Computing Specialists. ZFS Storage as a Solution for Big Data and Flexibility

IBM ELASTIC STORAGE SEAN LEE

Michael Thomas, Dorian Kcira California Institute of Technology. CMS Offline & Computing Week

<Insert Picture Here> Btrfs Filesystem

University of Birmingham & CLIMB GPFS User Experience. Simon Thompson Research Support, IT Services University of Birmingham

CS 6343: CLOUD COMPUTING Term Project

GPFS Cloud ILM. IBM Research - Zurich. Storage Research Technology Outlook

Violin: A Framework for Extensible Block-level Storage

September 2009 Cloud Storage for Cloud Computing

Hadoop: Embracing future hardware

Top 10 Things You Always Wanted to Know About Automatic Storage Management But Were Afraid to Ask

GPFS-OpenStack Integration. Dinesh Subhraveti IBM Research

Apache Hadoop. Alexandru Costan

THE FUTURE OF STORAGE IS SOFTWARE DEFINED. Jasper Geraerts Business Manager Storage Benelux/Red Hat

Ultimate Guide to Oracle Storage

Best Practices for Increasing Ceph Performance with SSD

A Virtual Filer for VMware s Virtual SAN A Maginatics and VMware Joint Partner Brief

Sheepdog: distributed storage system for QEMU

Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions

ETERNUS CS High End Unified Data Protection

Technology Insight Series

ZFS Administration 1

VDI Optimization Real World Learnings. Russ Fellows, Evaluator Group

How To Virtualize A Storage Area Network (San) With Virtualization

Unified Storage for the Private Cloud Dennis Chapman NetApp

Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage Division

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

SCALABLE DATA SERVICES

High Availability with Windows Server 2012 Release Candidate

Big data management with IBM General Parallel File System

Four Reasons To Start Working With NFSv4.1 Now

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

<Insert Picture Here> Oracle Cloud Storage. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

Seagate Cloud Systems & Solutions

GeoGrid Project and Experiences with Hadoop

POSIX and Object Distributed Storage Systems

Hitachi NAS Platform and Hitachi Content Platform with ESRI Image

Red Hat Storage Server

EMC BACKUP MEETS BIG DATA

Cloud Server. Parallels. Key Features and Benefits. White Paper.

Transcription:

GRAU DATA Scalable OpenSource Storage CEPH, LIO, OPENARCHIVE

GRAU DATA: More than 20 years experience in storage OPEN ARCHIVE 2007 2009 1992 2000 2004 Mainframe Tape Libraries Open System Tape Libraries Archive Systeme FMA Software HP OEM FSE & FMA ARCHIVEMANAGER OPENARCHIVE

OPENARCHIVE - Homepage

OPENARCHIVE: 2nd generation of archiving software Start of development in 2000 100 man-years of development Scalable archive solution from 1 TB up to several PB Common code base for Windows and Linux HSM approach data gets migrated to tapes or disks Fileystem interface (NTFS, POSIX) simplyfies integration Support for all kinds of SCSI backend devices (FC, iscsi) API only necessary for special purposes > 150 Customers worldwide References: Bundesarchiv, Charite, many DMS vendors

OPENARCHIVE: internal architecture CLI GUI API HSMnet Event Manager Management Interface HSMnet Server Library Agent Tape Library CIFS Client File System HSMnet I/F Partition Manager NFS Client File System HSMnet I/F Partition Manager Resource Manager Client File Disks System Back End Agent(s) Back End Agent(s )

Current limitations of OPENARCHIVE Number of files and performance is limited by native filesystems (ext3/ext4, NTFS) Archiving and recalling of many files in parallel is slow Cluster file systems are not supported Performance is not appropriate for HPC environments Scalability might not be appropriate for cloud environments

High performance SCSI Target: LIO LIO is the new standard SCSI target in Linux since 2.6.38 LIO is devloped and supported by RisingTide System Completely kernel based SCSI target engine Support for cluster systems (PR, ALUA) Fabric modules for: iscsi (TCP/IP) FCoE FC (Qlogic) Infiniband (SRP) Transparent blocklevel caching for SSD High speed iscsi initiator (MP for performance and HA) RTSadmin for easy administration

LIO performance comparision (IOPS) Based on 2 prozesses: 25% read 75% write iscsi numbers with standard Debian client. RTS iscsi Initiator plus kernel patches give much higher results.

Next generation ClusterFS: CEPH Scalable storage system 1 to 1000s of nodes Gigabytes to exabytes Reliable No single points of failure All data is replicated Self-healing Self-managing Automatically (re)distributes stored data Developed by Sage Weil (Dreamhost) based on Ph.D. Theses Upstream in Linux since 2.6.34 (client), 2.6.37 (RDB)

Ceph design goals Avoid traditional system designs Single server bottlenecks, points of failure Symmetric shared disk (SAN) Avoid manual workload partition Data sets, usage grow over time Data migration is tedious Avoid passive storage devices Storage servers have CPUs; use them

Object storage Objects Alphanumeric name Data blob (bytes to gigabytes) Named attributes (foo=bar) Object pools Separate flat namespace Cluster of servers store all objects RADOS: Reliable autonomic distributed object store Low-level storage infrastructure librados, radosgw RBD, Ceph distributed file system

Ceph data placement Files striped over objects 4 MB objects by default Objects mapped to placement groups (PGs) pgid = hash(object) & mask PGs mapped to sets of OSDs crush(cluster, rule, pgid) = [osd2, osd3] ~100 PGs per node Pseudo-random, statistically uniform distribution File Objects PGs OSDs (grouped by failure domain) Fast O(log n) calculation, no lookups Reliable replicas span failure domains Stable adding/removing OSDs moves few PGs

Ceph storage servers Ceph storage nodes (OSDs) cosd object storage daemon cosd Object interface cosd btrfs volume of one or more disks Actively collaborate with peers Replicate data (n times admin can choose) btrfs btrfs Consistently apply updates Detect node failures Migrate data

It's all about object placement OSDs acts intelligently - everyone knows where objects are located Coordinate writes with replica peers Copy or migrate objects to proper location OSD map completely specifies data placement OSD cluster membership and state (up/down etc.) CRUSH function mapping objects PGs OSDs cosd will peer on startup or map change Contact other replicas of PGs they store Ensure PG contents are in sync, and stored on the correct nodes Identical, robust process for any map change Node failure Cluster expansion/contraction Change in replication level

Why btrfs? Featureful Copy on write, snapshots, checksumming, multi-device Leverage internal transactions, snapshots OSDs need consistency points for sane recovery Hooks into copy-on-write infrastructure Clone data content between files (objects) ext[34] can also work... Inefficient snapshots, journaling

Object storage interfaces: librados Direct, parallel access to entire OSD cluster PaaS, SaaS applications When objects are more appropriate than files C, C++, Python, Ruby, Java, PHP bindings rados_pool_t pool; rados_connect(...); rados_open_pool("mydata", &pool); rados_write(pool, foo, 0, buf1, buflen); rados_read(pool, bar, 0, buf2, buflen); rados_exec(pool, baz, class, method, inbuf, inlen, outbuf, outlen); rados_snap_create(pool, newsnap ); rados_set_snap(pool, oldsnap ); rados_read(pool, bar, 0, buf2, buflen); /* old! */ rados_close_pool(pool); rados_deinitialize();

Object storage interfaces: radosgw Proxy: no direct client access to storage nodes HTTP RESTful gateway S3 and Swift protocols http ceph

RDB: RADOS block device Virtual disk image striped over objects Reliable shared storage Centralized management VM migration between hosts Thinly provisioned Consume disk only when image is written to Per-image snapshots Layering (WIP) Copy-on-write overlay over existing 'gold' image Fast creation or migration

RDB: RADOS block device Native Qemu/KVM (and libvirt) support $ qemu-img create -f rbd rbd:mypool/myimage 10G $ qemu-system-x86_64 --drive format=rbd,file=rbd:mypool/myimage Linux kernel driver (2.6.37+) Simple administration CLI, librbd $ echo 1.2.3.4 name=admin mypool myimage > /sys/bus/rbd/add $ mke2fs -j /dev/rbd0 $ mount /dev/rbd0 /mnt $ rbd create foo --size 20G $ rbd list foo $ rbd snap create --snap=asdf foo $ rbd resize foo --size=40g $ rbd snap create --snap=qwer foo $ rbd snap ls foo 2 asdf 20971520 3 qwer 41943040

Object classes Start with basic object methods {read, write, zero} extent; truncate {get, set, remove} attribute delete Dynamically loadable object classes Implement new methods based on existing ones e.g. calculate SHA1 hash, rotate image, invert matrix, etc. Moves computation to data Avoid read/modify/write cycle over the network e.g., MDS uses simple key/value methods to update objects containing directory content

POSIX filesytem Create file system hierarchy on top of objects Cluster of cmds daemons No local storage all metadata stored in objects Lots of RAM function has a large, distributed, coherent cache arbitrating file system access Dynamic cluster New daemons can be started up dynamically Automagically load balanced

POSIX example Client MDS Cluster fd=open( /foo/bar, O_RDONLY) Client: requests open from MDS MDS: reads directory /foo from object store MDS: issues capability for file content read(fd, buf, 1024) Client: reads data from object store close(fd) Client: relinquishes capability to MDS MDS out of I/O path Object Store Object locations are well known calculated from object name

Dynamic subtree partitioning Root MDS 0 MDS 1 MDS 2 MDS 3 MDS 4 Busy directory fragmented across many MDS s Scalable Arbitrarily partition metadata, 10s-100s of nodes Adaptive Move work from busy to idle servers Replicate popular metadata on multiple nodes

Workload adaptation Extreme shifts in workload result in redistribution of metadata across cluster Metadata initially managed by mds0 is migrated many directories same directory

Metadata scaling Up to 128 MDS nodes, and 250,000 metadata ops/second I/O rates of potentially many terabytes/second File systems containing many petabytes of data

Recursive accounting Subtree-based usage accounting Recursive file, directory, byte counts, mtime $ ls -alsh head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51. drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06.. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwxr-xr-x 1 mcg_test1 pg2419992 23G 2011-02-02 08:57 mcg_test1 drwx--x--- 1 luko adm 19G 2011-01-21 12:17 luko drwx--x--- 1 eest adm 14G 2011-02-04 16:29 eest drwxr-xr-x 1 mcg_test2 pg2419992 3.0G 2011-02-02 09:34 mcg_test2 drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph $ getfattr -d -m ceph. pomceph # file: pomceph ceph.dir.entries="39" ceph.dir.files="37" ceph.dir.rbytes="10550153946827" ceph.dir.rctime="1298565125.590930000" ceph.dir.rentries="2454401" ceph.dir.rfiles="1585288" ceph.dir.rsubdirs="869113" ceph.dir.subdirs="2"

Fine-grained snapshots Snapshot arbitrary directory subtrees Volume or subvolume granularity cumbersome at petabyte scale Simple interface $ mkdir foo/.s na p/one # cre a te s na ps hot $ ls foo/.s na p one $ ls foo/ba r/.s na p _one _1099511627776 # pa re nt's s na p na me is ma ng le d $ rm foo/myfile $ ls -F foo ba r/ $ ls foo/.s na p/one myfile ba r/ $ rmdir foo/.s na p/one # re move s na ps hot Efficient storage Leverages copy-on-write at storage layer (btrfs)

POSIX filesystem client POSIX; strong consistency semantics Processes on different hosts interact as if on same host Client maintains consistent data/metadata caches Linux kernel client # modprobe ceph # mount -t ceph 10.3.14.95:/ /mnt/ceph # df -h /mnt/ceph Filesystem Size Used Avail Use% Mounted on 10.3.14.95:/ 95T 29T 66T 31% /mnt/ceph Userspace client cfuse FUSE-based client libceph library (ceph_open(), etc.) Hadoop, Hypertable client modules (libceph)

Integration of CEPH, LIO, OpenArchive CEPH implements complete POSIX semantics OpenArchive plugs into VFS layer of Linux Integrated setup of CEPH, LIO and OpenArchive provides: Scalable cluster filesystem Scalable HSM for cloud and HPC setups Scalable high available SAN storage (iscsi, FC, IB)

I have a dream... Archive ClusterFS SAN Tape Disk Archive OpenArchive Filesystem Filesystem Filesystem Blocklevel Blocklevel Blocklevel CEPH CEPH Initiator CEPH MDS LIO Target RDB LIO Target RDB CEPH OSD

Contact Thomas Uhl Cell: +49 170 7917711 thomas.uhl@graudata.com tlu@risingtidesystems.com www.twitter.com/tuhl de.wikipedia.org/wiki/thomas_uhl