Replication and Consistency in Cloud File Systems



Similar documents
Data Storage in Clouds

XtreemFS a Distributed File System for Grids and Clouds Mikael Högqvist, Björn Kolbeck Zuse Institute Berlin XtreemFS Mikael Högqvist/Björn Kolbeck 1

BabuDB: Fast and Efficient File System Metadata Storage

XtreemFS - a distributed and replicated cloud file system

XtreemFS Extreme cloud file system?! Udo Seidel

Distributed File Systems

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

The Google File System

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Apache Hadoop. Alexandru Costan

Testing of several distributed file-system (HadoopFS, CEPH and GlusterFS) for supporting the HEP experiments analisys. Giacinto DONVITO INFN-Bari

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Sun Storage Perspective & Lustre Architecture. Dr. Peter Braam VP Sun Microsystems

Network File System (NFS) Pradipta De

HDFS Architecture Guide

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Design and Evolution of the Apache Hadoop File System(HDFS)

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

GraySort on Apache Spark by Databricks

In Memory Accelerator for MongoDB

GeoGrid Project and Experiences with Hadoop

Storage Architectures for Big Data in the Cloud

Distributed Data Stores

Distributed Storage Systems

RADOS: A Scalable, Reliable Storage Service for Petabyte- scale Storage Clusters

Scala Storage Scale-Out Clustered Storage White Paper

Google File System. Web and scalability

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Hadoop and Map-Reduce. Swati Gore

Distributed File Systems

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Scalable Architecture on Amazon AWS Cloud

Diagram 1: Islands of storage across a digital broadcast workflow

Snapshots in Hadoop Distributed File System

Tushar Joshi Turtle Networks Ltd

Lessons learned from parallel file system operation

Ceph. A file system a little bit different. Udo Seidel

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Big Table A Distributed Storage System For Data

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Bigdata High Availability (HA) Architecture

Architectures Haute-Dispo Joffrey MICHAÏE Consultant MySQL

Cloud Optimize Your IT

Investigation of storage options for scientific computing on Grid and Cloud facilities

Ceph. A complete introduction.

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Contents. 1. Introduction

Datacenter Operating Systems

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

<Insert Picture Here> Managing Storage in Private Clouds with Oracle Cloud File System OOW 2011 presentation

Oracle Maximum Availability Architecture with Exadata Database Machine. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

A simple object storage system for web applications Dan Pollack AOL

AFS Usage and Backups using TiBS at Fermilab. Presented by Kevin Hill

Release Notes. LiveVault. Contents. Version Revision 0

Lecture 18: Reliable Storage

Hadoop & its Usage at Facebook

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to NOSQL

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Investigation of storage options for scientific computing on Grid and Cloud facilities

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Dell High Availability and Disaster Recovery Solutions Using Microsoft SQL Server 2012 AlwaysOn Availability Groups

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Avoid a single point of failure by replicating the server Increase scalability by sharing the load among replicas

The Design and Implementation of the Zetta Storage Service. October 27, 2009

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

CS 6343: CLOUD COMPUTING Term Project

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

The Panasas Parallel Storage Cluster. Acknowledgement: Some of the material presented is under copyright by Panasas Inc.

Introduction to Cloud Computing

<Insert Picture Here> Oracle Cloud Storage. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

Transcription:

Replication and Consistency in Cloud File Systems Alexander Reinefeld und Florian Schintke Zuse-Institut Berlin Cloud-Computing-Tag im IKMZ der BTU Cottbus A. Reinefeld, F. Schintke, ZIB 14.04.2011 1

Let s start with a little quiz Who invented Cloud Computing? a) Werner Vogels b) Ian Foster c) Konrad Zuse The correct answer is c) Schließlich werden auch Rechenzentren über Fernmeldeleitungen miteinander vernetzt werden, Konrad Zuse in Rechnender Raum (1969) Konrad Zuse 22.06.1910-18.12.1995 A. Reinefeld, F. Schintke, ZIB 2

ZuseInstitute Berlin Research institute for applied mathematics and computer science Peter Deuflhard chair for scientific computing, FU Berlin Martin Grötschel chair for discrete mathematics, TU Berlin Alexander Reinefeld chair for computer science, HU Berlin A. Reinefeld, F. Schintke, ZIB 3

HPC Systems @ ZIB 1984 Cray 1M 160 MFlops 1987 Cray X-MP 471 MFlops 1994 Cray T3D 38 GFlops 1997 Cray T3E 486 GFlops 2002 IBM p690 2,5 TFlops 2008/09 SGI ICE, XE 150 TFlops 1984 1.000.000-fold performance increase in 25 years 2009 A. Reinefeld, F. Schintke, ZIB 4

H L R N 2 sites 98 computer racks 26112 CPU cores 128 TB memory 1620 TB disk 300 TF peakperf.

S T O R A G E 3 SL8500 robots 39 tape drives 19000 slots

What is Cloud Computing? Cloud Computing = Grid Computing on Datacenters? not that simple Cloud and Grid both abstract resources through interfaces. Grid: via new middleware. Requires Grid APIs. Cloud: via virtualization. Allows legacy APIs. Software as a Service (SaaS) Applications Application Services Platform as a Service (PaaS) Programming Environment Execution Environment Infrastructure as a Service (IaaS) Infrastructure Services Resource Set A. Reinefeld, F. Schintke, ZIB 7

Why Cloud? Pros It scales because it s their resources, not yours It s simple because they operate it Pay for what you need don t pay for empty spinning disks Cons It s expensive Amazon S3 charges $.15 / GB / month. = $1800 / TB / year It s not 100% secure S3 now allows to bring your own RSA key-pair. But: Would you put your bank account into the cloud? It s not 100% available S3 provides service credits if availability drops (10% for 99.0-99.9% availability) Alexander Reinefeld, ZIB 8

File System Landscape PC, local system Network FS/ Centralized Cluster FS/ Datacenter Cloud/Grid ext3, ZFS, NTFS NFS, SMB AFS/Coda Lustre, Panasas, GPFS, CEPH... Grid File System GFarm GDM "gridftp" Alexander Reinefeld, ZIB 9

Consistency, Availability, Partition tolerance: Pick two of three! C + A singleserver, Linux HA (one data center) A A + P Amazon S3 Mercurial Coda/AFS C P Consistency: All clients have the same view of the data. Availability: Each client can always read and write. Partitiontolerance: Operations will complete, even if individual components are unavailable. C + P distributeddatabases, distributed file systems Brewer, Eric. Towards Robust Distributed Systems. PODC Keynote, 2004. Alexander Reinefeld, ZIB 10

Which semantic do you expect? Distributed file systems should provide C + P But recent hype was on A + P + eventual consistency (e.g. Amazon S3) Alexander Reinefeld, ZIB 11

GridFile System provides access to heterogeneous storage resources, but middleware causes additional complexity, vulnerability requires explicit file transfer whole file: latency to 1 st access, bandwidth, disk storage also partial file access (gridftp) and pattern access (falls) no consistency among replicas user must take care no access control on replicas Alexander Reinefeld, ZIB 12

CloudFile System: XtreemFS Focus on data distribution data replication object based Key features MRCs are separated from OSDs fat Client is the link MRC = metadata & replica catalogue OSD = object storage device Client = file system interface Alexander Reinefeld, ZIB 13

A closer look at XtreemFS Features distributed, replicated POSIX compliant file system Server software (Java) runs on Linux, OS X, Solaris Client software (C++) runs on Linux, OS X, Windows secure: X.509 and SSL open source (GPL) Assumptions synchronous clocks with max. time drift (needed for OSD lease negotiation, reasonable assumption in clouds) upper limit on round trip time no need for FIFO channels (runs on either TCP or UDP) A. Reinefeld, F. Schintke, ZIB 14

XtreemFSInterfaces A. Reinefeld, F. Schintke, ZIB 15

File access protocol User appl. (Linux VFS) XtreemFS Client (fuse) MRC OSD Update(Cap, FileSize=128k) FileSize = 128k Alexander Reinefeld, ZIB 16

Client gets list of OSDs from MRC get a capability (signed by MRC) per file selects best OSD(s) for parallel I/O various striping policies: scatter/gather, RAIDx, erasure codes scalable and fast access no communication between OSD and MRC needed client is the missing link A. Reinefeld, F. Schintke, ZIB 17

MRC Metadata and Replication Catalogue provides open(), close(), readdir(), rename(), attributes per file: size, last access, access rights, location (OSDs), capability (file handle) to authorize a client to access objects on OSDs implemented with a key/value store (BabuDB) fast index append-only DB allows snapshots A. Reinefeld, F. Schintke, ZIB 18

OSD Object Storage Device serves file content operations read(), write(), truncate(), flush(), implements object replication also partial replicas for read-access data is filled on demand gets OSD list from MRC slave OSD redirects to master OSD write ops only on master OSD POSIX requires linearizable reads, hence reads are also redirected A. Reinefeld, F. Schintke, ZIB 19

OSD Object Storage Device Which OSD to select? object list bandwidth rarest first network coordinates, datacenter map, prefetching (for partial replicas) A. Reinefeld, F. Schintke, ZIB 20

OSD Object Storage Device implements concurrency control for replica consistency POSIX compliant master/slave replication with failover group membership service provided by MRC lease service Flease : distributed, scalable and failure-tolerant 50,000 leases/sec with 30 OSDs based on quorum consensus (Paxos) A. Reinefeld, F. Schintke, ZIB 21

Quorum consensus Basic algorithm When a majority is informed, each other majority has at least one member with up-to-date information. A minority may crash at any time. Paxos Consensus 1 Step: Check whether a consensus c was already established 2 Step: Re-establish c or try to establish own proposal x x x x xx x x A. Reinefeld, F. Schintke, ZIB 22

Proposer Init r = 1 r latest = 0 latest v = // Neues Proposal senden ack num = 0 Sende prepare(r) an alle acceptors Empfange ack(r ack,v i,r i ) von acceptor i Falls r == r ack ack num ++ Falls r i > r latest r latest = r i // lokale Runden-Nr // Nr der höchsten bestätigten Runde // Wert d. höchsten bestätigten Runde // Anzahl gültiger Bestätigungen // jüngere akzeptierte Runde // jüngerer Wert latest v = v i Falls ack num maj Falls latest v == schlage selbst einen Wert latest v vor sende accept(r, latest v ) an alle acceptors Acceptor Init r ack = 0 r accepted = 0 v = Empfange prepare(r) von proposer Falls r > r ack r > r accepted r ack = r // zuletzt bestätigte Runde // zuletzt akzeptierte Runde // aktueller lokaler Wert Sende ack(r ack, v, r accepted ) an Proposer Empfange accept(r, w) // höhere Runde Ende 1. Phase Learner num accepted = 0 // Anzahl gesammelter accepts Empfange accepted(r, v) von acceptor i Wenn r steigt: num accepted = 0 num accepted ++ Falls num accepted ==maj decide v; inform client // v ist Konsens Falls r r ack r > r accepted r accepted = r Alexander Reinefeld, ZIB 23 v = w Sende accepted (r accepted, v) an Learners

Striping Performance on Cluster Striping parallel transfer from/to many OSDs READ bandwidth scales with the number of OSDs client is the bottleneck: (slower reads are caused by TCP ingress problem) WRITE One client writes/reads a single 4GB file using asynchr. writes, read-ahead, 1MB chunk size, 29 OSDs. Nodes are connected with IP over IB (1.2 GB/s). A. Reinefeld, F. Schintke, ZIB 24

Snapshots & Backups Metadata snapshots (MRC) need atomic operation without service interrupt asynchronous consolidation in background granularity: subdirectories or volumes implemented by BabuDB or Scalaris File snapshots (OSD) taken implicitly when file is idle or explicitly when closing file or fsync() versioning of file objects: copy-on-write A. Reinefeld, F. Schintke, ZIB 25

Atomic Snapshots in MRC implemented with BabuDB backend a large-scale DB for data that exceeds the system s main memory 2 components: small mutable overlay trees(lsm trees) large immutable memory-mapped index on disk non-transactional key-value store prefix and range queries primary design goal: Performance! 300,000 lookups/sec (30M entries) fast crash recovery fast start-up A. Reinefeld, F. Schintke, ZIB 26

Log Structured Merge Trees:. A lookup takes O(s log(n)) with s: #snapshots, n: #files A. Reinefeld, F. Schintke, ZIB 27

Replicating MRC, OSDs Master/Slave Scheme Pros fast local read no distributed transactions easy to implement Cons master is performance bottleneck interrupt when master fails: needs stable master election Replicated State Machine (Paxos) Pros Cons no master, no single point of failure no extra latency on failure slower: 2 round trips per op needs distrib. consensus Alexander Reinefeld, ZIB 28

XtreemFSFeatures Release 1.2.1 (current) RAID and parallel I/O POSIX compatibility Read-only replication Partial replicas (on-demand) Security (SSL, X.509) Internet ready Checksums Extensions OSD and replica selection (Vivaldi, datacenter maps) Asynchronous MRC backups Metadata caching Graphical admin console Hadoop file system driver (experimental) Release 1.3 (very soon) DIR and MRC replication with automatic failover Read/write replication Release 2.x Consistent Backups Snapshots Automatic replica creation, deletion and maintenance Alexander Reinefeld, ZIB 29

Source Code XtreemFS http://code.google.com/p/xtreemfs 35.000 lines of C++ and Java code GNU GPL v2 license BabuDB http://code.google.com/p/babudb 10.000 lines of Java code new BSD license Scalaris http://code.google.com/p/scalaris 28.214 lines of Erlangand C++ code Apache 2.0 license Scalaris A. Reinefeld, F. Schintke, ZIB 30

Summary Cloud file systems require replication availability fast access, striping Replication requires consistency algorithm when crashes are rare: use master/slave replication with frequent crashes: use Paxos Only Consistency + Partition tolerance from CAP theorem Our next step: Faster high-level data services for MapReduce, Dryad, key/value store, SQL, A. Reinefeld, F. Schintke, ZIB 31