HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Similar documents
How To Scale Out Of A Nosql Database

Xiaoming Gao Hui Li Thilina Gunarathne

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Comparing SQL and NOSQL databases

Integrating Big Data into the Computing Curricula

Open source large scale distributed data management with Google s MapReduce and Bigtable

Apache HBase. Crazy dances on the elephant back

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

The Hadoop Eco System Shanghai Data Science Meetup

Introduction to Hbase Gkavresis Giorgos 1470

Scaling Up HBase, Hive, Pegasus

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Big Data and Scripting Systems build on top of Hadoop

A Brief Outline on Bigdata Hadoop

Hadoop IST 734 SS CHUNG

Can the Elephants Handle the NoSQL Onslaught?

How To Improve Performance In A Database

Cloud Scale Distributed Data Storage. Jürmo Mehine

Data Modeling for Big Data

Hadoop Ecosystem B Y R A H I M A.

Apache HBase: the Hadoop Database

Large scale processing using Hadoop. Ján Vaňo

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Introduction to Apache Cassandra

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Cloud Computing at Google. Architecture

Practical Cassandra. Vitalii

Media Upload and Sharing Website using HBASE

NoSQL Data Base Basics

Similarity Search in a Very Large Scale Using Hadoop and HBase

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

NoSQL for SQL Professionals William McKnight

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data and Scripting Systems build on top of Hadoop

Hadoop and Map-Reduce. Swati Gore

Cloudera Certified Developer for Apache Hadoop

MapReduce with Apache Hadoop Analysing Big Data

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data and Apache Hadoop s MapReduce

Dominik Wagenknecht Accenture

Using distributed technologies to analyze Big Data

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Integration of Apache Hive and HBase

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Bigtable is a proven design Underpins 100+ Google services:

Splice Machine: SQL-on-Hadoop Evaluation Guide

BIG DATA What it is and how to use?

Large Scale Text Analysis Using the Map/Reduce

Criteria to Compare Cloud Computing with Current Database Technology

Trafodion Operational SQL-on-Hadoop

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

A Distributed Storage Schema for Cloud Computing based Raster GIS Systems. Presented by Cao Kang, Ph.D. Geography Department, Clark University

Hadoop. Sunday, November 25, 12

Data storing and data access

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

White Paper: What You Need To Know About Hadoop

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Hadoop implementation of MapReduce computational model. Ján Vaňo

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

A Performance Analysis of Distributed Indexing using Terrier

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

How To Use Big Data For Telco (For A Telco)

Data Management in the Cloud -

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Open source Google-style large scale data analysis with Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Qsoft Inc

Apache Kylin Introduction Dec 8,

Slave. Master. Research Scholar, Bharathiar University

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A programming model in Cloud: MapReduce

I/O Considerations in Big Data Analytics

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Big Data With Hadoop

Application Development. A Paradigm Shift

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Big Systems, Big Data

NOSQL INTRODUCTION WITH MONGODB AND RUBY GEOFF

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Big Data Course Highlights

Transcription:

HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Overview

Overview: History Began as project by Powerset to process massive amounts of data for natural language search Open-source implementation of Google s BigTable Lots of semi-structured data Commodity Hardware Horizontal Scalability Tight integration with MapReduce Developed as part of Apache s Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem) Provides fault-tolerant way of storing large quantities of sparse data.

Overview: What is HBase? Non-relational, distributed database Column Oriented Multi Dimensional High Availability High Performance

Data Model & Operators

Data Model A sparse, multi-dimensional, sorted map {row, column, timestamp} -> cell Column = Column Family : Column Qualifier Rows are sorted lexicographically based on row key Region: contiguous set of sorted rows HBase: a large number of columns, a low number of column families (2-3)

Operators Operations are based on row keys Single-row operations: Put Get Scan Multi-row operations: Scan MultiPut No built-in joins (use MapReduce)

Physical Structures

Physical Structures: Data Organization Region: unit of distribution and availability Regions are split when grown too large Max region size is a tuning parameter Too low: prevents parallel scalability Too high: makes things slow

Physical Structures: Need for Indexes HBase has no built-in support for secondary indexes API only exposes operations by row key Row Key Name Position Nationality 1 Nowitzki, Dirk PF Germany 2 Kaman, Chris C Germany 3 Gasol, Paul PF Spain 4 Fernandez, Rudy SG Spain Find all players from Spain? With built-in API, scan the entire table Manually build a secondary index table Exploit the fact that rows are sorted lexicographically by row key based on byte order

Physical Structures: Secondary Index Data Table: Row Key Name Position Nationality 1 Nowitzki, Dirk PF Germany 2 Kaman, Chris C Germany 3 Gasol, Paul PF Spain 4 Fernandez, Rudy SG Spain Index table on nationality column a scan operation start row = "Spain" stop scanning: set a RowFilter with a BinaryPrefixComparator on the end value("spain") range queries are also supported Row Key Dummy Germany 1 Germany 1 Germany 2 Germany 2 Spain 3 Spain 3 Spain 4 Spain 4

Physical Structures: Secondary Index (cont.) Find all power forwards from Spain? A composite index Row keys are plain byte arrays Byte order = your desired order? Convert strings, integers, floats, decimals carefully to bytes Default sorting is ascending; if descending indexes are needed, reverse bit order

Physical Structures: More Indexing Lily s HBase Indexing Library Aids in building and querying indexes in HBase Hides the details of playing with byte[] row keys HBase + full text indexing and searching systems Apache Lucene (Apache Solr, elasticsearch) Lily, HAvroBase (HBase + Solr), HBasene (HBase + Lucene)

System Architecture

System Architecture: Overview

System Architecture: Write-Ahead-Log Flow

System Architecture: WAL (cont.)

System Architecture: HFile and KeyValue

APIs

APIs: Overview Java Get, Put, Delete, Scan IncrementColumnValue TableInputFormat - MapReduce Source TableOutputFormat - MapReduce Sink Rest Thrift Scala Jython Groovy DSL Ruby shell Java MR, Cascading, Pig, Hive

ACID Properties

ACID Properties HBase not ACID-compliant, but does guarantee certain specific properties Atomicity All mutations are atomic within a row. Any put will either wholely succeed or wholely fail. APIs that mutate several rows will not be atomic across the multiple rows. The order of mutations is seen to happen in a well-defined order for each row, with no interleaving. Consistency and Isolation All rows returned via any access API will consist of a complete row that existed at some point in the table's history.

ACID Properties (cont.) Consistency of Scans A scan is not a consistent view of a table. Scans do not exhibit snapshot isolation. Those familiar with relational databases will recognize this isolation level as "read committed". Durability All visible data is also durable data. That is to say, a read will never return data that has not been made durable on disk. Any operation that returns a "success" code (e.g. does not throw an exception) will be made durable. Any operation that returns a "failure" code will not be made durable (subject to the Atomicity guarantees above). All reasonable failure scenarios will not affect any of the listed ACID guarantees.

Users

Users: Just to name a few

Users: Facebook - Messaging System

Users: Facebook - Messaging System (cont.) Previous Solution: Cassandra Current Solution: HBase Why? Cassandra's replication behavior

Users: Twitter - People Search

Users: Twitter - People Search (cont.) Customer Indexing Previous Solution: offline process at a single node Current Solution: Import user data into HBase Periodically MapReduce job reading from HBase Hits FlockDB and other internal services in mapper Write data to sharded, replicated, horizontally scalable, in-memory, low-latency Scala service Vs. Others: HDFS: Data is mutable Cassandra: OLTP vs. OLAP?

Users: Mozilla - Socorro

Users: Mozilla Socorro (cont.) Socorro, Mozilla s crash reporting system (https://crashstats.mozilla.com/products) Catches, processes, and presents crash data for Firefox, Thunderbird, Fennec, Camino, and Seamonkey. 2.5 million crash reports per week, 320GB per day Previous Solution: NFS (raw data), PostgreSQL (analyze results) 15% of crash reports are processed Current Solution: Hadoop (processing) + HBase (storage)

HBase vs. RDBMS

HBase vs. RDBMS HBase Column-oriented Flexible schema, add columns on the fly Good with sparse tables No query language Wide tables Joins using MR not optimized Tight integration with MR RDBMS Row oriented (mostly) Fixed schema Not optimized for sparse tables SQL Narrow tables Optimized for joins (small, fast ones too!) Not really...

HBase vs. RDBMS (cont.) HBase De-normalize your data Horizontal scalability just add hardware Consistent No transactions Good for semi-structured data as well as structured data RDBMS Normalize as you can Hard to shard and scale Consistent Transactional Good for structured data

Questions?

Thanks!