BookKeeper overview. Table of contents



Similar documents
BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

ZooKeeper. Table of contents

The Hadoop Distributed File System

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

THE HADOOP DISTRIBUTED FILE SYSTEM

HDFS Users Guide. Table of contents

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Bigdata High Availability (HA) Architecture

The Hadoop Distributed File System

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Lecture 18: Reliable Storage

ZooKeeper Administrator's Guide

A Deduplication File System & Course Review

Distributed File Systems

Architecture for Hadoop Distributed File Systems

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

HADOOP MOCK TEST HADOOP MOCK TEST I

Outline. Failure Types

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Distributed Data Management

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Snapshots in Hadoop Distributed File System

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Promise of Low-Latency Stable Storage for Enterprise Solutions

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

PIONEER RESEARCH & DEVELOPMENT GROUP

Connectivity. Alliance Access 7.0. Database Recovery. Information Paper

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

This talk is mostly about Data Center Replication, but along the way we'll have to talk about why you'd want transactionality arnd the Low-Level API.

Configuring Apache Derby for Performance and Durability Olav Sandstå

Connectivity. Alliance Access 7.0. Database Recovery. Information Paper

Apache Hadoop. Alexandru Costan

CA ARCserve and CA XOsoft r12.5 Best Practices for protecting Microsoft SQL Server

Using Oracle NoSQL Database

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop Architecture. Part 1

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Microsoft SQL Server Always On Technologies

B) Using Processor-Cache Affinity Information in Shared Memory Multiprocessor Scheduling

CA ARCserve and CA XOsoft r12.5 Best Practices for protecting Microsoft Exchange

HDFS: Hadoop Distributed File System

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

Acronis Backup & Recovery: Events in Application Event Log of Windows

NSS Volume Data Recovery

Diagram 1: Islands of storage across a digital broadcast workflow

File System Design and Implementation

Design and Evolution of the Apache Hadoop File System(HDFS)

Chapter 10: Distributed DBMS Reliability

Information Systems. Computer Science Department ETH Zurich Spring 2012

CDH AND BUSINESS CONTINUITY:

Google File System. Web and scalability

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Eloquence Training What s new in Eloquence B.08.00

Cloudera Manager Health Checks

A Survey of Distributed Database Management Systems

Cluster APIs. Cluster APIs

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Postgres Plus xdb Replication Server with Multi-Master User s Guide

XtreemFS Extreme cloud file system?! Udo Seidel

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

HSS: A simple file storage system for web applications

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Distributed File Systems

Recovery and the ACID properties CMPUT 391: Implementing Durability Recovery Manager Atomicity Durability

New Technologies File System (NTFS) Priscilla Oppenheimer. Copyright 2008 Priscilla Oppenheimer

Hadoop IST 734 SS CHUNG

Contents. SnapComms Data Protection Recommendations

BrightStor ARCserve Backup for Windows

SQL Backup and Restore using CDP

Hadoop Distributed File System. Dhruba Borthakur June, 2007

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

The Oracle Universal Server Buffer Manager

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Implementing and Managing Windows Server 2008 Clustering

Data Replication INSTALATION GUIDE. Open-E Data Storage Server (DSS ) Integrated Data Replication reduces business downtime.

SQL Server Transaction Log from A to Z

GlobalSCAPE DMZ Gateway, v1. User Guide

Availability Digest. Raima s High-Availability Embedded Database December 2011

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

SCALABILITY AND AVAILABILITY

Cloud Computing at Google. Architecture

CHAPTER 17: File Management

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

Distributed File Systems

Big Data with Component Based Software

Hypertable Architecture Overview

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 5 - DBMS Architecture

Trends in Data Protection and Restoration Technologies. Mike Fishman, EMC 2 Corporation (Author and Presenter)

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Amazon Cloud Storage Options

TECHNICAL WHITE PAPER: ELASTIC CLOUD STORAGE SOFTWARE ARCHITECTURE

Data Management in the Cloud

Remote Copy Technology of ETERNUS6000 and ETERNUS3000 Disk Arrays

Transcription:

by Table of contents 1 BookKeeper overview...2 1.1 BookKeeper introduction... 2 1.2 In slightly more detail...2 1.3 Bookkeeper elements and concepts...3 1.4 Bookkeeper initial design... 3 1.5 Bookkeeper metadata management...6 1.6 Closing out ledgers...6

1. BookKeeper overview 1.1. BookKeeper introduction BookKeeper is a replicated service to reliably log streams of records. In BookKeeper, servers are "bookies", log streams are "ledgers", and each unit of a log (aka record) is a "ledger entry". BookKeeper is designed to be reliable; bookies, the servers that store ledgers, can crash, corrupt data, discard data, but as long as there are enough bookies behaving correctly the service as a whole behaves correctly. The initial motivation for BookKeeper comes from the namenode of HDFS. Namenodes have to log operations in a reliable fashion so that recovery is possible in the case of crashes. We have found the applications for BookKeeper extend far beyond HDFS, however. Essentially, any application that requires an append storage can replace their implementations with BookKeeper. BookKeeper has the advantage of scaling throughput with the number of servers. At a high level, a bookkeeper client receives entries from a client application and stores it to sets of bookies, and there are a few advantages in having such a service: We can use hardware that is optimized for such a service. We currently believe that such a system has to be optimized only for disk I/O; We can have a pool of servers implementing such a log system, and shared among a number of servers; We can have a higher degree of replication with such a pool, which makes sense if the hardware necessary for it is cheaper compared to the one the application uses. 1.2. In slightly more detail... BookKeeper implements highly available logs, and it has been designed with write-ahead logging in mind. Besides high availability due to the replicated nature of the service, it provides high throughput due to striping. As we write entries in a subset of bookies of an ensemble and rotate writes across available quorums, we are able to increase throughput with the number of servers for both reads and writes. Scalability is a property that is possible to achieve in this case due to the use of quorums. Other replication techniques, such as state-machine replication, do not enable such a property. An application first creates a ledger before writing to bookies through a local BookKeeper client instance. Upon creating a ledger, a BookKeeper client writes metadata about the ledger to ZooKeeper. Each ledger currently has a single writer. This writer has to execute a close Page 2

ledger operation before any other client can read from it. If the writer of a ledger does not close a ledger properly because, for example, it has crashed before having the opportunity of closing the ledger, then the next client that tries to open a ledger executes a procedure to recover it. As closing a ledger consists essentially of writing the last entry written to a ledger to ZooKeeper, the recovery procedure simply finds the last entry written correctly and writes it to ZooKeeper. Note that currently this recovery procedure is executed automatically upon trying to open a ledger and no explicit action is necessary. Although two clients may try to recover a ledger concurrently, only one will succeed, the first one that is able to create the close znode for the ledger. 1.3. Bookkeeper elements and concepts BookKeeper uses four basic elements: Ledger: A ledger is a sequence of entries, and each entry is a sequence of bytes. Entries are written sequentially to a ledger and at most once. Consequently, ledgers have an append-only semantics; BookKeeper client: A client runs along with a BookKeeper application, and it enables applications to execute operations on ledgers, such as creating a ledger and writing to it; Bookie: A bookie is a BookKeeper storage server. Bookies store the content of ledgers. For any given ledger L, we call an ensemble the group of bookies storing the content of L. For performance, we store on each bookie of an ensemble only a fragment of a ledger. That is, we stripe when writing entries to a ledger such that each entry is written to sub-group of bookies of the ensemble. Metadata storage service: BookKeeper requires a metadata storage service to store information related to ledgers and available bookies. We currently use ZooKeeper for such a task. 1.4. Bookkeeper initial design A set of bookies implements BookKeeper, and we use a quorum-based protocol to replicate data across the bookies. There are basically two operations to an existing ledger: read and append. Here is the complete API list (mode detail here): Create ledger: creates a new empty ledger; Open ledger: opens an existing ledger for reading; Add entry: adds a record to a ledger either synchronously or asynchronously; Read entries: reads a sequence of entries from a ledger either synchronously or Page 3

asynchronously There is only a single client that can write to a ledger. Once that ledger is closed or the client fails, no more entries can be added. (We take advantage of this behavior to provide our strong guarantees.) There will not be gaps in the ledger. Fingers get broken, people get roughed up or end up in prison when books are manipulated, so there is no deleting or changing of entries. BookKeeper Overview A simple use of BooKeeper is to implement a write-ahead transaction log. A server maintains an in-memory data structure (with periodic snapshots for example) and logs changes to that structure before it applies the change. The application server creates a ledger at startup and store the ledger id and password in a well known place (ZooKeeper maybe). When it needs to make a change, the server adds an entry with the change information to a ledger and apply the change when BookKeeper adds the entry successfully. The server can even use asyncaddentry to queue up many changes for high change throughput. BooKeeper meticulously logs the changes in order and call the completion functions in order. When the application server dies, a backup server will come online, get the last snapshot and then it will open the ledger of the old server and read all the entries from the time the snapshot was taken. (Since it doesn't know the last entry number it will use MAX_INTEGER). Once all the entries have been processed, it will close the ledger and start Page 4

a new one for its use. A client library takes care of communicating with bookies and managing entry numbers. An entry has the following fields: Field Type Description Ledger number long The id of the ledger of this entry Entry number long The id of this entry last confirmed (LC) long id of the last recorded entry data byte[] the entry data (supplied by application) authentication code byte[] Message authentication code that includes all other fields of the entry Table 2: Entry fields The client library generates a ledger entry. None of the fields are modified by the bookies and only the first three fields are interpreted by the bookies. To add to a ledger, the client generates the entry above using the ledger number. The entry number will be one more than the last entry generated. The LC field contains the last entry that has been successfully recorded by BookKeeper. If the client writes entries one at a time, LC is the last entry id. But, if the client is using asyncaddentry, there may be many entries in flight. An entry is considered recorded when both of the following conditions are met: the entry has been accepted by a quorum of bookies all entries with a lower entry id have been accepted by a quorum of bookies LC seems mysterious right now, but it is too early to explain how we use it; just smile and move on. Once all the other fields have been field in, the client generates an authentication code with all of the previous fields. The entry is then sent to a quorum of bookies to be recorded. Any failures will result in the entry being sent to a new quorum of bookies. To read, the client library initially contacts a bookie and starts requesting entries. If an entry is missing or invalid (a bad MAC for example), the client will make a request to a different bookie. By using quorum writes, as long as enough bookies are up we are guaranteed to eventually be able to read an entry. Page 5

1.5. Bookkeeper metadata management There are some meta data that needs to be made available to BookKeeper clients: The available bookies; The list of ledgers; The list of bookies that have been used for a given ledger; The last entry of a ledger; We maintain this information in ZooKeeper. Bookies use ephemeral nodes to indicate their availability. Clients use znodes to track ledger creation and deletion and also to know the end of the ledger and the bookies that were used to store the ledger. Bookies also watch the ledger list so that they can cleanup ledgers that get deleted. 1.6. Closing out ledgers The process of closing out the ledger and finding the last ledger is difficult due to the durability guarantees of BookKeeper: If an entry has been successfully recorded, it must be readable. If an entry is read once, it must always be available to be read. If the ledger was closed gracefully, ZooKeeper will have the last entry and everything will work well. But, if the BookKeeper client that was writing the ledger dies, there is some recovery that needs to take place. The problematic entries are the ones at the end of the ledger. There can be entries in flight when a BookKeeper client dies. If the entry only gets to one bookie, the entry should not be readable since the entry will disappear if that bookie fails. If the entry is only on one bookie, that doesn't mean that the entry has not been recorded successfully; the other bookies that recorded the entry might have failed. The trick to making everything work is to have a correct idea of a last entry. We do it in roughly three steps: 1. Find the entry with the highest last recorded entry, LC; 2. Find the highest consecutively recorded entry, LR; 3. Make sure that all entries between LC and LR are on a quorum of bookies; Page 6