Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems



Similar documents
CSCI 5980 TOPICS IN DISTRIBUTED SYSTEMS FINAL REPORT 1. Locality-Aware Load Balancer for HBase

Apache HBase. Crazy dances on the elephant back

Apache HBase: the Hadoop Database

Comparing SQL and NOSQL databases

Future Prospects of Scalable Cloud Computing

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Table A Distributed Storage System For Data

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Distributed Lucene : A distributed free text index for Hadoop

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

The Hadoop Eco System Shanghai Data Science Meetup

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Scaling Up HBase, Hive, Pegasus

Cloud Computing at Google. Architecture

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Tutorial: HBase Theory and Practice of a Distributed Data Store

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

How To Improve Performance In A Database

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Integration of Apache Hive and HBase

Splice Machine: SQL-on-Hadoop Evaluation Guide

Xiaoming Gao Hui Li Thilina Gunarathne

Criteria to Compare Cloud Computing with Current Database Technology

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Hypertable Architecture Overview

Introduction to Hbase Gkavresis Giorgos 1470

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Slave. Master. Research Scholar, Bharathiar University

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hosting Transaction Based Applications on Cloud

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

American International Journal of Research in Science, Technology, Engineering & Mathematics

Hadoop IST 734 SS CHUNG

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

HBase. Lecture BigData Analytics. Julian M. Kunkel. University of Hamburg / German Climate Computing Center (DKRZ)

Introduction to HBase Schema Design

Bigtable is a proven design Underpins 100+ Google services:

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

How To Scale Out Of A Nosql Database

Data storing and data access

Big Data With Hadoop

Cloud Storage Solution for WSN in Internet Innovation Union

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Media Upload and Sharing Website using HBASE

Big Data and Scripting Systems build on top of Hadoop

Apache HBase 0.96 and What s Next Headline Goes Here Jonathan

Real Time Micro-Blog Summarization based on Hadoop/HBase

THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data and Scripting Systems build on top of Hadoop

Hands-on Cassandra. OSCON July 20, Eric

Trafodion Operational SQL-on-Hadoop

Snapshots in Hadoop Distributed File System

A programming model in Cloud: MapReduce

Hadoop Ecosystem B Y R A H I M A.

Completing the Big Data Ecosystem:

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

Qsoft Inc

Hbase - non SQL Database, Performances Evaluation

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Similarity Search in a Very Large Scale Using Hadoop and HBase

What is Analytic Infrastructure and Why Should You Care?

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

The Hadoop Distributed File System

Distributed File Systems

CC 2.0 by William Brawley

Cassandra A Decentralized, Structured Storage System

Application Development. A Paradigm Shift

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Distributed File System (HDFS) Overview

Cloudera Certified Developer for Apache Hadoop

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Challenges for Data Driven Systems

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Certified Big Data and Apache Hadoop Developer VS-1221

Complete Java Classes Hadoop Syllabus Contact No:

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Transcription:

Storage of Structured Data: BigTable and HBase 1

HBase and BigTable HBase is Hadoop's counterpart of Google's BigTable BigTable meets the need for a highly scalable storage system for structured data Provides random and (almost) real-time data access Works on top of Google File System Data is structured into entities (records), aggregated into few huge files and indexed Not a relational database. Offers typical create, read, update and delete (CRUD) ops. plus scan of keys Not ACID guarantees Used by many applications in Google 2

Hadoop's and Google's stacks HBase Hadoop MapReduce BigTable Google MapReduce ZooKeeper Chubby Hadoop Distributed File System Google File System Hadoop Google 3

HBase/BigTable Tables A Bigtable is a sparse, distributed, persistent multidimensional sorted map Map Associates keys to values Sorted Ordered by key (efficient look-ups) Multidimensional Key is formed by several values Persistent Once written, it is there until removed Distributed Stored across different nodes Sparse Many (most) values are not defined 4

Hbase/BigTable Datamodel Rows are composed of columns, which are grouped into column families Column families group semantically related values Each column family is stored as one file (HFile) in HDFS One column family can have millions of columns Each column is referenced by family:qualifier A row key is an array of bytes. Keys are ordered lexicographically A column value is denoted a cell. They have timestamps. Old values are kept 5

BigTable vs. Relational Datamodels Students/courses database Students Id (primary key) name gender age St_Co_Rel st_id co_id mark Course Id (primary key) title description Relational Model BigTable Model (identifiers handled by app programmers, no referencial integrity) row <student_id> column families info: info:name info:gender info:age course: course:course_id=mark row column families info: <course_id> info:title info:descript course: student:student_id=mark 6

Hbase/BigTable Datamodel View ordered arbitrary array of bytes lexicographically ordered ROW KEYS r_key_1 r_key_22 r_key_3... COLUMN FAMILY COLUMN FAMILY cf1:col_a cf1:col_b cf2:col_y cf2:col_z aa ts 3 val1 val2 val3 cell values ts 9 ts 10 ts 5 time human readable name arbitrary array of bytes (Table, RowKey, Family, Column, Timestamp) Value 7

Hbase/BigTable Data Storage ROW KEYS COLUMN FAMILY COLUMN FAMILY r_key_1 cf2:col_y cf1:col_a cf1:col_b val1 ts 5 r_key_22 aa ts 3 val2 ts 9 r_key_3 val3 ts 10... Columns with no value are not stored Each column family in its own HFile files Stored in HDFS, split into blocks (64 Kbs per block) and with a block index at the end of the HFile 8

HBase Regions For scalability, tables are split into regions by key Each region is assigned to a region server ROW KEYS Region Server Region Server region 9

Region Server Internal Architecture Region Server 1) 2) key (in-memory map) key key MemStore Data Data Data Store table t2, column family cf2 A Store keeps the keys of a certain region of some column family of some table. (Write Ahead Log) HLog Store HRegion HRegion 10

Region Server Internal Architecture Region Server (in-memory map) 2) When the MemStore gets full, data is written to a HFile. Several HFiles per Store can be present. HFile Block Block Index MemStore HFile HFile 1) Store table t1, column family cf1 Store table t2, column family cf2 (Write Ahead Log) HRegion HLog HRegion 11

HBase over HDFS Region Server Region Server HLog HFile HFile HFile HLog HFile HFile HFile HDFS Client middleware HDFS Client middleware HDFS NameNode HDFS DataNode HDFS DataNode 12

Reading Data When reading data by row key the RegionServer attending the corresponding region is queried All HRegions which store a column family whose data is requested by the query must be checked For that, the in-memory map and the HBase files must all be read (merged read) HBase files can use bloom filters to speed-up readings Unless otherwise specified, only the last version of each value is returned 13

Writing, Updating, and Deleting Data Rows are written on the in-memory map of the corresponding RegionServer Update operations just write a new version of data Row deletion depends on where the row is located Rows in the in-memory map are just deleted HBase files are immutable! Deletion markers are used To prevent HBase files from using too much space they are merged Rows marked as deleted and old versions of data are removed 14

HBase ROOT and.meta Tables Used to organize data in HBase Special system tables Clients use them to locate which RegionServer serves a certain key of a certain table They are partitioned into regions and served by RegionServers as normal tables 15

Example of Lookup for data in HBase Region Server R Region ROOT.META.,,1.META.,tableN,123456789... Region Server M1 Region.META.,,1 tablea,,98766789 tablea,row-520,99995432... Region Server M2 RS M1 RS M2... RS U1 RS U2... Region.META.,t,tableN,123456789 tablen,,12344321 tablen,row-333,12345791... RS U3 RS U4... ZooKeeper /hbase/root-region-server RS R Lookup #2 Lookup #3 Region Server U1 Lookup #1 User Region tablea,,98766789 row-1 Region Server U2 Lookup #4 Region tablea,row520,99994532 row-520 16

Zookeeper Zookeeper is used to coordinate members of distributed systems Implements a hierarchical namespace where clients write/read to share state information. Each element in the path can store data and other elements A typical Zookeeper deployment has several servers, grouped in an ensemble. The middleware takes care of consistency, load balancing, crash recovery... / /app1 /app2/info1 Ensemble Zookeeper Server Leader Zookeeper Server Zookeeper Server /app2 /app2/info2 Client Client Client Client Distributed system to coordinate 17

HMaster Splits the key space of all tables and assigns the resulting regions to the present RegionServers Balances load by re-assigning regions Handles metadata changes requests from clients It is NOT involved in read/write operations It uses Zookeeper to keep track of RegionServers and to emit information for clients (like which RS holds the ROOT table) 18

HBase Architecture create/delete tables and col. families User ZooKeeper Master r/w data Region Server Region Server manage tables and regions Region Server HLog HRegion Store HFile HFile HDFS 19

HBase Client API HBaseAdmin is used to: Create/delete tables: createtable(); deletetable() HBaseAdmin hbadm = new HbaseAdmin(HBaseConfiguration.create()); hbadm.createtable(new HTableDescriptor( TestTable )); Add/delete column families to tables: addcolumn(); deletecolumn() hbadm.addcolumn( TestTable, new HcolumnDescriptor( TestColFamily )); 20

HBase Client API (2) HTable class is the basic data access entity: Read data with get(); getscanner(scan) Get get = new Get(Bytes.toBytes( testrow )); Result result = testtable.get(get); for(byte[] family: result.keyset()) for(byte qual: result.get(family).keyset()) for(long ts: result.get(family).get(qual).keyset()) String val = Bytes.toString( Result.get(family).get(qual).get(ts)); Write data with put(); checkandput() Put put = new Put(Bytes.toBytes( testrow )); put.add(bytes.tobytes( testfam ), Bytes.toBytes( testqual ), Bytes.toBytes( value )); testtable.put(put); Delete data with delete(); checkanddelete() 21

HBase Advanced Features Filters: When scanning tables, Filter instances refine the results returned to the client User Scan Filter Region Server Region Server Counters: support for read-and-update atomic operations Coprocessors: to extend HBase functionality with users' custom code that it is run by the framework. Example: secondary indexes, access control... 22

Bibliography (Paper) Bigtable: A Distributed Storage System for Structured Data. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. ACM Transactions on Computer Systems, Volume 26 Issue 2, 2008. (Book) HBase: The Definitive Guide (2 nd edition). Lars George. O'Reilly Media, Inc., 2011 23

(Example with HBase) 24