Cooling Down Ceph. Exploration and Evaluation of Cold Storage Techniques



Similar documents
Ceph. A complete introduction.

Ceph. A file system a little bit different. Udo Seidel

Product Spotlight. A Look at the Future of Storage. Featuring SUSE Enterprise Storage. Where IT perceptions are reality

Building low cost disk storage with Ceph and OpenStack Swift

Snapshots in Hadoop Distributed File System

Engineering a NAS box

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

SUSE Enterprise Storage Highly Scalable Software Defined Storage. Gábor Nyers Sales

Google File System. Web and scalability

A simple object storage system for web applications Dan Pollack AOL

RADOS: A Scalable, Reliable Storage Service for Petabyte- scale Storage Clusters

Understanding Storage Virtualization of Infortrend ESVA

HDFS 2015: Past, Present, and Future

Sep 23, OSBCONF 2014 Cloud backup with Bareos

Overview of RD Virtualization Host

Intel Virtual Storage Manager (VSM) 1.0 for Ceph Operations Guide January 2014 Version 0.7.1

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

BookKeeper overview. Table of contents

Understanding Microsoft Storage Spaces

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. dcache Introduction

The Design and Implementation of the Zetta Storage Service. October 27, 2009

Understanding IBM Lotus Domino server clustering

Best Practices for Increasing Ceph Performance with SSD

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Testing of several distributed file-system (HadoopFS, CEPH and GlusterFS) for supporting the HEP experiments analisys. Giacinto DONVITO INFN-Bari

OPTIMIZING PRIMARY STORAGE WHITE PAPER FILE ARCHIVING SOLUTIONS FROM QSTAR AND CLOUDIAN

The Use of Flash in Large-Scale Storage Systems.

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Cassandra A Decentralized, Structured Storage System

Understanding Enterprise NAS

Big Data Analytics on Object Storage -- Hadoop over Ceph* Object Storage with SSD Cache

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

<Insert Picture Here> Btrfs Filesystem

EqualLogic PS Series Load Balancers and Tiering, a Look Under the Covers. Keith Swindell Dell Storage Product Planning Manager

CS514: Intermediate Course in Computer Systems

From Internet Data Centers to Data Centers in the Cloud

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Large Scale Storage. Orlando Richards, Information Services LCFG Users Day, University of Edinburgh 18 th January 2013

Data Storage in Clouds

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Getting performance & scalability on standard platforms, the Object vs Block storage debate. Copyright 2013 MPSTOR LTD. All rights reserved.

UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

Increase Database Performance by Implementing Cirrus Data Solutions DCS SAN Caching Appliance With the Seagate Nytro Flash Accelerator Card

Web-Based Data Backup Solutions

09'Linux Plumbers Conference

Hadoop: Embracing future hardware

XtreemFS a Distributed File System for Grids and Clouds Mikael Högqvist, Björn Kolbeck Zuse Institute Berlin XtreemFS Mikael Högqvist/Björn Kolbeck 1

File Systems for Flash Memories. Marcela Zuluaga Sebastian Isaza Dante Rodriguez

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Data storage services at CC-IN2P3

ZooKeeper. Table of contents

Technical Overview Simple, Scalable, Object Storage Software

Storage Class Memory Support in the Windows Operating System Neal Christiansen Principal Development Lead Microsoft

Object storage in Cloud Computing and Embedded Processing

Alternatives to Big Backup

StorReduce Technical White Paper Cloud-based Data Deduplication

Moving Virtual Storage to the Cloud. Guidelines for Hosters Who Want to Enhance Their Cloud Offerings with Cloud Storage

Using SNMP to Obtain Port Counter Statistics During Live Migration of a Virtual Machine. Ronny L. Bull Project Writeup For: CS644 Clarkson University

Distributed File Systems

The Hadoop Distributed File System

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

Flexible Storage Allocation

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

High Availability Databases based on Oracle 10g RAC on Linux

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Improving Scalability Of Storage System:Object Storage Using Open Stack Swift

Tushar Joshi Turtle Networks Ltd

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Cloud and Big Data initiatives. Mark O Connell, EMC

Scientific Computing Data Management Visions

Chapter 3 Operating-System Structures

Comparing Dynamic Disk Pools (DDP) with RAID-6 using IOR

Application Brief: Using Titan for MS SQL

Monitoring DoubleTake Availability

Hypertable Architecture Overview

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Moving Virtual Storage to the Cloud

Two Parts. Filesystem Interface. Filesystem design. Interface the user sees. Implementing the interface

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Violin: A Framework for Extensible Block-level Storage

POSIX and Object Distributed Storage Systems

CSE-E5430 Scalable Cloud Computing P Lecture 5

21 st Century Storage What s New and What s Changing

Choosing Storage Systems

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

Transcription:

Cooling Down Ceph Exploration and Evaluation of Cold Storage Techniques

Cold Storage Ceph Cooling Down Ceph Object Stubs Striper Prefix Hashing

Facebook photos [1]: 82% reads to 8% stored data Scientific data system [2]: 50% reads to 5% stored data [1] T. P. Morgan. Facebook Rolls Out New Web and Database Server Designs. http:// www.theregister.co.uk/2013/01/17/open_compute _facebook_servers/, 2013 [2] M. Grawinkel, L. Nagel, M. Mäsker, F. Padua, A. Brinkmann, and L. Sorth. Analysis of the ECMW Storage Landscape. Proc. of the 13th USENIX Conference on File and Storage Technologies (FAST), 2015 3

Cold storage is an operational mode or a method operation of a data storage device or system for inactive data where an explicit trade-off is made, resulting in data retrieval response times beyond what may be considered normally acceptable to online or production applications in order to archive significant capital and operational savings IDC Technology Assessment: Cold Storage Is Hot Again - Finding the Frost Point (2013) 4

Distributed storage system No single point of failure Horizontal scaling Run on commodity hardware 6

7

MON MON MON Client 8

Monitor Keeps the Cluster Map Distributed Consensus MON MON Not in data path MON Client 9

Client Computes placement based on Cluster Map Directly accesses s and MONs MON MON MON Client 10

Stores Objects Manages replication MON MON Placement MON Client 11

Pool PG PG 12

Pool s Buckets Pool Type PG Rack, Server, Disk, Type PG Replicated Erasure Coded Placement Groups 13

Placement Groups Pool Abstraction for placement computation PG ~ 100 per PG 14

Object Pool Data 4 MiB PG Name PG Matters Object Map 15

Pool Object Storage Device PG ~ 1 per Disk Replication PG Erasure Coding 16

Placement Object PG Hash CRUSH Cluster Map PG 17

Cooling Down Ceph

Ceph s Cold Storage Features Cache Tiering Erasure Coding Metadata-aware clients Semantic Pool Selection Metadata for Later Placement Striper Prefix Hashing Extra Placement Information Redirection and Stubbing Object Redirects Object Stubs Object Store Backend to Archive System Journal Cache 19

Ceph s Cold Storage Features Cache Tiering Erasure Coding Metadata-aware clients Semantic Pool Selection Metadata for Later Placement Striper Prefix Hashing Extra Placement Information Redirection and Stubbing Object Redirects Object Stubs Object Store Backend to Archive System Journal Cache 20

Object Stubs stub stub Archive System 21

Implementation As part of the Transparent to the clients New RADOS Operations Stub Unstub Implicit unstub 22

New RADOS Ops: Stub, Unstub Stub Server Stub Server stub IN stub OUT SET XAttrs RM XAttrs - URL - URL 23

Implementation: Stub, Unstub UNSTUB STUB Read Object HTTP PUT Get Xattr "stub_uri" HTTP GET url HTTP DELETE url TRUNCATE 0 SETXATTR "stub_uri" url WRITEFULL ect RMXATTR "stub_uri" 24

Operation B.5 operations that need ect data 59 zero X setxattrs Table B.1: Ceph operations Implicit Unstub Needs Object Data? Operation read X cache-flush X stat cache-evict X mapext X cache-try-flush X masktrunc X tmap2omap X sparse-read X set-alloc-hint notify notify-ack redirect unredirect assert-version clonerange X list-watchers list-snaps assert-src-version src-cmpxattr sync_read X getxattr write X getxattrs writefull X cmpxattr truncate X setxattr zero X setxattrs delete resetxattrs append X rmxattr startsync pull X settrunc push X trimtrunc X balance-reads X Needs Object Data? delete resetxattrs append X rmxattr startsync pull X settrunc push X trimtrunc X balance-reads X tmapup X unbalance-reads X tmapput X scrub X tmapget X scrub-reserve X create X scrub-unreserve X rollback X scrub-stop B.6Xstub watch scrub-map X omap-get-keys omap-get-vals omap-get-header Operation omap-get-vals-by-keys omap-set-vals omap-set-header omap-clear omap-rm-keys omap-cmp wrlock Table B.1: Ceph wrunlock operations Needs Object Data? rdlock Operation rdunlock uplock Needs Object Data? dnlock Continued on next pa call pgls pgls-filter copy-from X pg-hitset-ls copy-get-classic X pg-hitset-get undirty isdirty copy-get X pgnls pgnqls-filter

Benefits Supports links to external storage systems Stubbed Snapshots => Backup 26

Striper Prefix Hashing File File 27

converts an ect ID (ect_t oid) and ect locator (ect_locator_t Listing C.1: Original hash_key implementation loc, contains the pool ID) to a placement group (pg_t pg). uint32_t pg_pool_t::hash_key(const string& key, Full source code of the implementation, including the simulator, are pub- Implementation lished on github 1. Listing C.1 const showsstring& the original ns) const implementation; Listing { C.2 shows the implementation with Striper Prefix Hashing. Listing C.1: Original hash_key implementation } uint32_t pg_pool_t::hash_key(const string& key, { } string n = make_hash_str(key, ns); return ceph_str_hash(ect_hash, n.c_str(), n.length()); The changed method cuts off const the ect s string& name ns) const at the first dot found in the string inkey, which happens to be the sequence number part for ect string n = make_hash_str(key, ns); names generated by the Ceph Striper. return ceph_str_hash(ect_hash, n.c_str(), n.length()); Listing C.2: Changed hash_key implementation uint32_t The changed pg_pool_t::hash_key(const method cuts off the ect s string& name inkey, at the first dot found in the string inkey, which happens const to bestring& the sequence ns) const number part for ect names generated by the Ceph Striper. { string key(inkey); Listing C.2: Changed hash_key implementation if (flags & FLAG_HASHPSONLYPREFIX) { uint32_t pg_pool_t::hash_key(const string& inkey, string::size_type n = inkey.find("."); const string& ns) const { if (n!= string::npos) { string key(inkey); key = inkey.substr(0, n) ; } if (flags & FLAG_HASHPSONLYPREFIX) { } string::size_type n = inkey.find("."); } string n = make_hash_str(key, ns); if (n!= string::npos) { return ceph_str_hash(ect_hash, n.c_str(), n.length()); key = inkey.substr(0, n) ; } 28

Methodology ECMWF ECFS HPSS dump [1] 137 million files 14.8 PiB 10% random sample Simulator 29 [1] M. Grawinkel, L. Nagel, M. Mäsker, F. Padua, A. Brinkmann, and L. Sorth. Analysis of the ECMW Storage Landscape. Proc. of the 13th USENIX Conference on File and Storage Technologies (FAST), 2015

Table 5.1: ECFS HPSS Dump Properties Property Value Unit Workload Total #files 14,950,112 Total used capacity 1.595 PiB Max file size 32 GiB ECFS HPSS 10% random sample 5.2 striper prefix hashes 35 8000000 7000000 47.05 Size of Objects 6 4 MiB 5.495 PiB Size of Objects > 4 MiB 1.590 PiB Table 5.1: ECFS HPSS Dump Properties 6000000 Property Value Unit 5000000 00 00 47.05 Total #files 14,950,112 Total used capacity 1.595 PiB Max file size 32 GiB Size of Objects 6 4 MiB 5.495 PiB Size of Objects > 4 MiB 1.590 PiB #Files 4000000 3000000 2000000 1000000 0 0-4MiB 7.98 4MiB-8MiB 10.53 8MiB-16MiB 6.97 16MiB-32MiB 6.91 32MiB-64MiB 6.25 64MiB-128MiB 6.00 128MiB-256MiB 3.82 256MiB-512MiB 2.22 512MiB-1GiB 1.41 1GiB-2GiB 0.47 2GiB-4GiB 0.29 4GiB-8GiB 0.09 8GiB-16GiB 0.01 16GiB-32GiB 00 00 Figure 5.2: Histogram: File size distribution. Each bucket inclusive e.g 0<x6 4 MiB 30

Simulator 5.2 striper prefix ha 600 s 38400 PGs 45 minutes on 32 cores 5.2.2.3 Simulator Table 5.2: Benchmark settings Hash Algorithm 1 RJenkins No 2 Linux No 3 RJenkins Yes 4 Linux Yes Prefix Hash Enabled? Simulating Ceph s placement decision by running a whole Ceph with Ceph FS was quickly found to be infeasible in terms of run For this test s, MONs and MDS components ran on a single s s ran on top of a FUSE overlay filesystem called blackholefs 2 cially designed to intercept writes. A special 31 simulator running only the bare minimum of Ceph and

pendix A contains more detailed data and values for the plots shown here. Distinct s per File 5.2.4.1 Distinct s per File Table 5.3 shows statistics for the number of distinct primary s of all files in the workload or: that Does are larger than it 4work? MiB, because the number of s per file is always 1 for files with only one ect. Table 5.3: Distinct primary s per file for files larger than 4 MiB Statistic RJenkins Linux dcache RJenkins+Prefix Linux+Prefix Min. 1 1 1 1 Q 1 3 3 1 1 Median 9 9 1 1 Q 3 35 35 1 1 Max. 600 600 1 1 As evident from the table the Striper Prefix Hash runs result in 1 per file. 32

In into anaccount. ideal balanced Ceph cluster all s store the same amount of data and ects. Figure 5.4 and 5.5 illustrate the number of ects per primary Balance and the size of data per primary respectively. Both measures are similar, but the size plot also takes ects not filled to the maximum into account. 1.2M 1.1M 1.0M 900.0k #Objects 800.0k 700.0k 600.0k 500.0k 400.0k 300.0k RJenkins Linux dcache RJenkins+Prefix Figure 5.4: Number of ect per Linux+Prefix Combined ect size / Byte 4.0TiB 3.8TiB 3.6TiB 3.4TiB 3.2TiB 3.0TiB 2.8TiB 2.6TiB 2.4TiB 2.2TiB 2.0TiB Figure 5.4: Number of ect per RJenkins Linux dcache 33 RJenkins+Prefix Linux+Prefix

Racap Ceph Cooling Down Ceph File Implementation and Evaluation Object Stubs Striper Prefix Hashing stub Archive System stub 34

Cooling Down Ceph Exploration and Evaluation of Cold Storage Techniques

Bonus Slides

Striper File, Block Device Image, etc. Blocks 0 1 2 3 4 5 6 7 8 Stripes 0 1 2 3 4 0 1 2 3 4 5 6 7 8 Object Set Objects 0 1 6 7 2 3 8 4 5 0 1 2 3 37