Bigtable - A Distributed Storage System for Structured Data

Similar documents
Big Table A Distributed Storage System For Data

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Distributed storage for structured data

Cloud Computing at Google. Architecture

Bigtable: A Distributed Storage System for Structured Data

Hypertable Architecture Overview

Future Prospects of Scalable Cloud Computing

A programming model in Cloud: MapReduce

Bigtable: A Distributed Storage System for Structured Data

Parallel & Distributed Data Management

Bigtable is a proven design Underpins 100+ Google services:

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Slave. Master. Research Scholar, Bharathiar University

Comparing SQL and NOSQL databases

Hosting Transaction Based Applications on Cloud

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

MapReduce. Olivier Curé. January 6, Université Paris-Est Marne la Vallée, LIGM UMR CNRS 8049, France

Cloud Computing mit mathematischen Anwendungen

The Apache Cassandra storage engine

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Table of Contents. Développement logiciel pour le Cloud (TLC) Table of Contents. 5. NoSQL data models. Guillaume Pierre

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

American International Journal of Research in Science, Technology, Engineering & Mathematics

Cassandra. Jonathan Ellis

What is Analytic Infrastructure and Why Should You Care?

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Implementation Issues of A Cloud Computing Platform

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Big Data With Hadoop

Snapshots in Hadoop Distributed File System

Distributed Systems. Tutorial 12 Cassandra

Design Patterns for Distributed Non-Relational Databases

NoSQL: Going Beyond Structured Data and RDBMS

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Similarity Search in a Very Large Scale Using Hadoop and HBase

Apache HBase. Crazy dances on the elephant back

Lecture Data Warehouse Systems

Criteria to Compare Cloud Computing with Current Database Technology

The Hadoop Eco System Shanghai Data Science Meetup

Introduction to Database Systems CSE 444. Lecture 24: Databases as a Service

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece

How To Scale Out Of A Nosql Database

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Big Data and Scripting Systems build on top of Hadoop

CSCI 5980 TOPICS IN DISTRIBUTED SYSTEMS FINAL REPORT 1. Locality-Aware Load Balancer for HBase

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

Distributed File Systems

Distributed Lucene : A distributed free text index for Hadoop

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Open source large scale distributed data management with Google s MapReduce and Bigtable

Lecture 5 Distributed Database and BigTable

Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables

On the Varieties of Clouds for Data Intensive Computing

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduction to HBase Schema Design

Windows Azure Storage Essential Cloud Storage Services

How To Improve Performance In A Database

Big Systems, Big Data

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

An Approach to Implement Map Reduce with NoSQL Databases

Media Upload and Sharing Website using HBASE

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Distributed File Systems

Web and Data Analysis. Minkoo Seo June 2015

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Big Data and Apache Hadoop s MapReduce

NoSQL. Thomas Neumann 1 / 22

Completing the Big Data Ecosystem:

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Cloud Data Management Big Data

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Physical Data Organization

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Bigdata : Enabling the Semantic Web at Web Scale

Infrastructures for big data

Hadoop IST 734 SS CHUNG

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Scaling Up HBase, Hive, Pegasus

Can the Elephants Handle the NoSQL Onslaught?

Apache HBase: the Hadoop Database

Data Management in the Cloud -

Big Data Web Originated Technology meets Television Bhavan Gandhi and Sanjeev Mishra ARRIS

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

ibigtable: Practical Data Integrity for BigTable in Public Cloud

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

A Distributed Storage Schema for Cloud Computing based Raster GIS Systems. Presented by Cao Kang, Ph.D. Geography Department, Clark University

Cloudera Certified Developer for Apache Hadoop

Hadoop & its Usage at Facebook

Xiaowe Xiaow i e Wan Wa g Jingxin Fen Fe g n Mar 7th, 2011

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Cloud Computing and the OWL Language

Transcription:

1/24 Bigtable - A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Presented by Ripul Jain

What is Bigtable? Source: Wikipedia (Table) 2/24

3/24 Bigtable A distributed storage system for managing structured data Developed by Google to deal with massive amount of data Not a relational database Designed for - Wide Applicability - Scalability - High Performance - High Availability

Data Model A table in Bigtable is a sparse, distributed, persistent multidimensional sorted map Map is indexed by a row key, column key, and a timestamp - (row:string, column:string, time:int64) string Figure 1: A slice of example table that stores webpages Source: Chang et al., OSDI 2006 4/24

5/24 Rows and Columns Row key is an arbitrary string - Access to data under a single row key is atomic - Rows maintained in sorted lexicographic order Columns grouped into column families - Column key = family:qualifier - Data within a column family is of same type - Unbounded number of columns in a table Timestamps are used to store different versions of same data in a cell - Clients can specify the last n versions they want to keep

6/24 Tablets Rows ranges in the table are dynamically partitioned into Tablets - Each Tablet is about 100-200 Mb in size - Unit of distribution and load balancing Each machine is responsible for 10 to 1000 tablets - Not necessarily contiguous range of rows/tablets - Master can migrate tablets away from overloaded or failed machines - Fast Recovery: 100 machines can pick up 1 tablet from a failed 100 tablet machine

7/24 API Bigtable API provides functions for creating and deleting table and column families Supports single-row transactions, which can be used to perform atomic read-modify-write sequences under a single row key Can retrieve data from a single row or all rows Can retrieve data for all columns, certain column families or specific columns Supports execution of client-supplied scripts on the tablet servers

Bigtable Building Blocks Google File System (GFS) - Used to store log and data files SSTable - Persistent ordered immutable map from keys to values - Sequence of 64KB blocks plus an index for block lookup - Can be mapped into main memory - Multiple SSTables form a Tablet Chubby - A highly available, persistent and distributed lock service - Stores the location of Bigtable data - Store Bigtable schema - Ensures there s only 1 Bigtable master - Keeps track of tablet servers 8/24

9/24 System Structure Bigtable Master - Assigns tablets to tablet servers - Detects addition and expiration of tablet servers - Performs load balancing - Handles garbage collection - Handles schema changes such as table and column family creations Bigtable Tablet Servers - Manages a set of tablets - Handles read and write requests to loaded tablets - Splits tablets that have grown too large

Source: Chang et al., OSDI 2006 10/24 Locating Tablets Three-level hierarchical lookup scheme for tablets Figure 2: Tablet location hierarchy

Tablet Assignment Each tablet is assigned to one tablet server at a time - Tablet server maintains an exclusive lock on a newly created file in a specific Chubby directory Master scans the servers directory to monitor tablet servers - Communicates with tablets servers to find tablet assignments - If root tablet not found, adds root tablet to unassigned tablets - Periodically asks each tablet server for the status of its lock - Keeps track of unassigned tablets and handles their assignment Changes to tablet structure - Master initiates Table creation/deletion - Master initiates tablet merge - Tablet server initiates tablet splits, updates METADATA table and notifies master 11/24

Source: Chang et al., OSDI 2006 12/24 Tablet Representation Figure 3: Tablet Representation

13/24 Tablet Serving Write operation Tablet server ensures the client has sufficient priviledges for write operation. Writes are placed in the commit log first on GFS. After the write update has been committed, its contents are mirrored into an in-memory sorted buffer called memtable. After a memtable grows to a certain size, it gets written to GFS Read operation Tablet server ensures the client has sufficient priviledges for read operation. Reads then are executed on a merged view of sequence of SSTables and memtable.the paper doesn t mention how often this is done.

14/24 Compactions Minor compaction - When a memtable grows too big, it is frozen, converted to SSTable and written to GFS - Reduces memory usage of the tablet server - Reduces the amount of data that needs to be read from commit log during recovery Merging compaction - Reads content of few SSTables and the memtable, and writes out a new SSTable - Applied periodically Major compaction - A merging compaction that results in exactly one SSTable - No deletion records - Applied regularly to all tablets

15/24 Refinements Locality groups - Column families can be assigned to a locality group - Separate SSTable for each locality group - Used to better organize data for performance - Can be declared to be in-memory Compression - SSTables for locality groups can be compressed - Clients can specify which compression format to use Caching for read performance - Scan cache caches the key-value pairs returned by SSTable interface to tablet server - Block cache caches SSTable blocks that are read from GFS

Refinements Bloom filters for SSTables - A read operation may need to read many SSTables - Does (row,column) exist in the tablet? - Reduces disk seeks for read operations Exploiting immutability - SSTables are immutable - Master removes the obsolete SStables as a mark-and-sweep garbage collection - Memtable is the only mutable data structure - Use copy-on-write semantics to allow parallel reads and writes Shared commit log - Large number of log files written concurrently in GFS leads to poor performance - Idea: Single physical log file per tablet server - Complicates shared log recovery 16/24

17/24 Shared Log Recovery Idea: Partition and sort the commit log entries in order of keys <table,row,log sequence number> Tablet server notifies master of log chunks they need to read Tablet Servers do the sorting and master coordinates which sorted chunks are needed by which tablet servers Tablets server read sorted data for its tablets directly from its peers

Performance Evaluation Setup - 1786 machines with two 400 GB hard drives each, large physical memory - Each tablet server has 1 GB memory - No. of clients = No. of servers Figure 4: No. of 1000-byte values read/written per second per tablet server 18/24

19/24 Performance Evaluation Figure 5: Aggregate rate

20/24 Applications Figure 6: Characteristics of a few tables in use at Google

21/24 Lessons Large distributed systems are vulnerable to many types of failures Large systems need proper system-level monitoring Value simple designs Delay adding new features until it is clear how the new features are going to be used

22/24 Critique Strengths - Flexible data model - Able to handle mixed read-write workloads - Demonstrated that single-master systems can avoid bottlenecks and scale well Weaknesses - No support for multi-row transactions (added in 2008) - Stability of the system is dependent on Chubby - No performance comparison with other databases - Bigtable is difficult for new users to adapt to, specially if they are accustomed to relational databases

23/24 Competition Public version of Bigtable called Google Cloud Bigtable released in 2015 Apache HBase is modelled after Bigtable IBM DB2 Parallel Edition Sybase IQ C-Store Apache Cassandra LevelDB

24/24 Conclusion Excellent example of how to design a storage system for large scale data that needs to be frequently read or searched. Self managing, as servers can be dynamically added or removed Can be used as Mapreduce input or output Tried and tested extensively in production Can be extended to Bigtable as a service Leaves room for future work like support for secondary indices