Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl



Similar documents
How To Scale Out Of A Nosql Database

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Scaling Up HBase, Hive, Pegasus

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Hadoop Ecosystem B Y R A H I M A.

Comparing SQL and NOSQL databases

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Xiaoming Gao Hui Li Thilina Gunarathne

Open source large scale distributed data management with Google s MapReduce and Bigtable

ITG Software Engineering

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data and Scripting Systems build on top of Hadoop

Integration of Apache Hive and HBase

Big Data With Hadoop

Introduction to Hbase Gkavresis Giorgos 1470

Hadoop. Sunday, November 25, 12

Big Data and Scripting Systems build on top of Hadoop

How Companies are! Using Spark

How To Handle Big Data With A Data Scientist

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Qsoft Inc

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

NoSQL Data Base Basics

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

COURSE CONTENT Big Data and Hadoop Training

Apache HBase: the Hadoop Database

Hadoop Job Oriented Training Agenda

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Peers Techno log ies Pv t. L td. HADOOP

Splice Machine: SQL-on-Hadoop Evaluation Guide

White Paper: What You Need To Know About Hadoop

Dominik Wagenknecht Accenture

Using RDBMS, NoSQL or Hadoop?

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Cloudera Certified Developer for Apache Hadoop

Media Upload and Sharing Website using HBASE

Big Data Course Highlights

Workshop on Hadoop with Big Data

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Similarity Search in a Very Large Scale Using Hadoop and HBase

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Moving From Hadoop to Spark

Internals of Hadoop Application Framework and Distributed File System

Application Development. A Paradigm Shift

The Hadoop Eco System Shanghai Data Science Meetup

How To Create A Data Visualization With Apache Spark And Zeppelin

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Table A Distributed Storage System For Data

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Data storing and data access

Hadoop IST 734 SS CHUNG

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Deploying Hadoop with Manager

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

BIG DATA What it is and how to use?

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

White Paper: Hadoop for Intelligence Analysis

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Integrating Big Data into the Computing Curricula

Integrating VoltDB with Hadoop

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

ITG Software Engineering

Bringing Big Data to People

Complete Java Classes Hadoop Syllabus Contact No:

HDFS. Hadoop Distributed File System

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Benchmarking Cassandra on Violin

Big Data Explained. An introduction to Big Data Science.

Hadoop Project for IDEAL in CS5604

Cloud Computing at Google. Architecture

Hadoop: The Definitive Guide

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Unified Big Data Processing with Apache Spark. Matei

How To Improve Performance In A Database

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Large scale processing using Hadoop. Ján Vaňo

Data processing goes big

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Transcription:

Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1

Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the scenes of MapReduce: HDFS & Scheduling Algorithm design for MapReduce A high-level language for MapReduce: Pig Latin 1 & 2 MapReduce is not a database, but HBase nearly is! Lets iterate a bit: Graph algorithms & Giraph How does all of this work together? ZooKeeper/Yarn 2

Learning objectives Explain how HBase differs from Hadoop and RDBMSs Decide for which use cases HBase is suitable Explain the HBase organisation Use HBase from the terminal 3

Lets start with the last point

HBase from the shell hbase shell! create food, local! put food, row1, local:source, Delft! put food, row1, local:name, tomato! put food, row1, local:price, 1.20! put food, row1, local:price, 1.29! put food, row2, local:name, pineapple! scan food! scan food, {COLUMNS => [ local:source ]}! get food, row1! count food! describe food! disable food ; drop food 5

Introduction to HBase

HBase HBase is not ACID compliant HBase is a distributed column-oriented database built on top of HDFS. HBase is the Hadoop application to use when you require real-time read/write random access to very large datasets. (Tom White) HBase tables are like those in an RDBMS, only cells are versioned, rows are sorted, and columns can be added on the fly (Tom White) 7

ZooKeeper (Coordination) Avro (Serialization) HBase in the Hadoop ecosystem Pig (Data Flow) Hive (SQL) Sqoop (Data Transfer) MapReduce (Job Scheduling/Execution) HBase (Column DB) HDFS 8 Image source: Cloudera

Main points HBase is not an ACID-compliant database HBase does not support a full relational model HBase provides clients with a simple data model Clients have dynamic control over data layout and format Clients can control the locality of their data by creating appropriate schemas 9

History of HBase Started at the end of 2006 Modelled after Google s Bigtable paper (2006)!! highly recommended read! January 2008: Hadoop becomes Apache top level project, HBase becomes subproject May 2010: HBase becomes an Apache top level project Contributors from Cloudera, Facebook, Intel, Hortonworks, etc. 10

Use case: Web tables user Web Crawler Crawl Policy Search Indexer Retrieval System Web document indices ad index 11

Use case: Web tables Table of crawled web pages and their attributes (content, language, anchor text, inlinks, ) with web page URL as row key! Retrieval: [first round] simple approach, [second round] subset of pages ranked by complex machine learning algorithms Webtable contains billions of rows (URLs) Batch processes are running against Webtable, deriving statistics (PageRank, etc.) and adding new columns for indexing, ranking, Webtable is randomly accessed by many crawlers concurrently running at various rates and updating random rows! Cached web pages are served to users in real-time! Different versions of a web page are used to compute crawler frequency 12

Demands for HBase Structured data, scaling to petabytes Efficient handling of diverse data Wrt. data size (URLs, web pages, satellite imagery) Wrt. latency (backend bulk processing vs. realtime data serving) Efficient read and write of individual records 13

HBase vs. Hadoop Hadoop s use case is batch processing! Not suitable for a single record lookup Not suitable for adding small amounts of data at all times Not suitable for making updates to existing records HBase addresses Hadoop s weaknesses Provides fast lookup of individual records Supports insertion of single records Supports record updating Not all columns are of interest to everyone; each client only wants a particular subset of columns (column-based storage) 14

HBase vs. Hadoop HBase is built on top of HDFS! Hadoop HBase writing reading file append only, no updates sequential random write, updating random read, small range scan, full scan structured! storage up to the user sparse column family data model 15

HBase vs. RDBMS small to medium-volume applications RDBMS use when scaling up in terms of dataset size, read/write concurrency HBase schema fixed random write, updating orientation row-oriented column-oriented query language size SQL terabytes (at most) simple data access model billions of rows, millions of columns scaling up! difficult (workarounds) add nodes to a cluster 16

Question: which tool is best suited for which use case? Data generated by the Large Hadron Collider is stored and analysed by researchers All pages crawled from the Web are stored by a search engine and served to clients via its search interface Data generated by the Hubble telescope is used in the SETI@Home project (served at request to users) Data generated by the Dutch tax office about tax payers is used to send warning letters ( you are late with your taxes ) 17

How does a RDBMS scale up in practice?

RDBMS scaling story Initial public launch of service Remotely hosted MySQL instance with well-defined schema Service becomes popular; too many reads hitting the database Add memcached to cache common queries Service grows, too many writes are hitting the database Scale MySQL vertically by buying a new (larger) server 19

RDBMS scaling story New features of the service increase query complexity; now there are too many joins Denormalize data to reduce joins (not good DB practice) Rising popularity swamps the server; things are too slow Stop doing any server-side computations Some queries are still too slow Periodically prematerialise the most complex queries! Reads are OK, but writes are getting slower and slower Drop secondary indexes 20 What now?

HBase data model 21

A row 2 column families incoming hyperlinks 3 cells cell row name (reverse URL) 3 versions of the web page stored: table cells are versioned 3 columns timestamp different columns written at different times 22

HBase vs. RDBMS sparse, distributed, persistent multi-dimensional sorted map nulls are free of cost Image source: Tom White s Hadoop The Definite Guide 23

Table rows Row keys are byte arrays! i.e. anything can be a row key Row keys are unique Rows are sorted lexicographically by row key (similar to a primary key index in RDBMS) row-1! row-11! row-111! row-2! row-22! row-3! Rows are composed of columns Read/write access to row data is atomic 24

Table columns & column families Columns are grouped into column families! Semantic or topical boundaries Useful for compression, caching millions of columns in a column family Column family members have a common prefix! E.g. anchor:cnnsi.com, anchor:bbc.co.uk/sports! A table s column families must be specified as part of the table schema definition Few families, few updates! Column family members can be added on demand, e.g. anchor:ww.twitter.com/bbc_sports can be added to a table having column family anchor 25

Table columns & column families Tuning and storage specifications happen at the column-family level For performance reasons column family s members should share the same general access patterns and size characteristics! All columns in a column family are stored together in the same low-level storage file: HFile Columns can be written to at different times 26

Table cells (Table, RowKey, Family, Column, Timestamp)! Value! table column families! columns SortedMap<RowKey, List<SortedMap<Column, List<Value, Timestamp>>>> cells Cells are indexed by a row key, a column key and a timestamp Table cells are versioned Default: auto-assigned timestamp (insertion time) Timestamp can be set explicitly by the user Cells are stored in decreasing timestamp order User can specify how many versions to keep Cell values are uninterpreted array of bytes - clients need to know what to do with the data 27

Table cells API provides a coherent view of rows as a combination of all columns and their most current versions By default API returns the value with the most recent timestamp! User can query HBase for values before/after a specific timestamp or more than one version at a time 28

Summary HBase shell HBase vs. RDBMS/Hadoop HBase organisation 29

THE END