Scaling Up HBase, Hive, Pegasus

Similar documents

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data Analytics Process & Building Blocks

How To Scale Out Of A Nosql Database

Comparing SQL and NOSQL databases

Hadoop Ecosystem B Y R A H I M A.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Big Data and Scripting Systems build on top of Hadoop

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data and Scripting Systems build on top of Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Data Course Highlights

Xiaoming Gao Hui Li Thilina Gunarathne

Big Data Too Big To Ignore

Hadoop IST 734 SS CHUNG

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Hbase Gkavresis Giorgos 1470

Hadoop Job Oriented Training Agenda

CMU SCS Large Graph Mining Patterns, Tools and Cascade analysis

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Text Analytics (Text Mining)

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ITG Software Engineering

Hadoop and Map-Reduce. Swati Gore

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Integration of Apache Hive and HBase

Big Data Analytics Building Blocks; Simple Data Storage (SQLite)

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Similarity Search in a Very Large Scale Using Hadoop and HBase

Graph Processing and Social Networks

How To Improve Performance In A Database

Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

COURSE CONTENT Big Data and Hadoop Training

Workshop on Hadoop with Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

A Brief Outline on Bigdata Hadoop

American International Journal of Research in Science, Technology, Engineering & Mathematics

Internals of Hadoop Application Framework and Distributed File System

Large scale processing using Hadoop. Ján Vaňo

Integrating Big Data into the Computing Curricula

How To Handle Big Data With A Data Scientist

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Cloud Computing at Google. Architecture

Cloudera Certified Developer for Apache Hadoop

Big Data and Apache Hadoop s MapReduce

Open source Google-style large scale data analysis with Hadoop

Oracle Big Data SQL Technical Update

A programming model in Cloud: MapReduce

Map Reduce & Hadoop Recommended Text:

Apache HBase. Crazy dances on the elephant back

The Hadoop Eco System Shanghai Data Science Meetup

Deploying Hadoop with Manager

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Big Table A Distributed Storage System For Data

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

Apache HBase: the Hadoop Database

HDFS. Hadoop Distributed File System

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Move Data from Oracle to Hadoop and Gain New Business Insights

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Hadoop implementation of MapReduce computational model. Ján Vaňo

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Peers Techno log ies Pv t. L td. HADOOP

NoSQL and Hadoop Technologies On Oracle Cloud

Cloud Scale Distributed Data Storage. Jürmo Mehine

Hadoop: The Definitive Guide

Data storing and data access

Estimating PageRank Values of Wikipedia Articles using MapReduce

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Big Data and Data Science: Behind the Buzz Words

Hadoop Parallel Data Processing

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Big Data With Hadoop

Sentimental Analysis using Hadoop Phase 2: Week 2

Transcription:

CSE 6242 A / CS 4803 DVA Mar 7, 2013 Scaling Up HBase, Hive, Pegasus Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le Song

Lecture based on these two books. http://goo.gl/yncwn http://goo.gl/svztv 2

Last Time... Introduction to Pig Showed you a simple example run using Grunt (interactive shell) Find maximum temperature by year Easy to program, understand and maintain You write a few lines of Pig Latin, Pig turns them into MapReduce jobs Illustrate command creates sample dataset (from your full data) Helps you debug your Pig Latin Started with HBase 3

http://hbase.apache.org Built on top of HDFS Supports real-time read/write random access Scale to very large datasets, many machines Not relational, does NOT support SQL ( NoSQL = not only SQL ) Supports billions of rows, millions of columns Written in Java; can work with many languages Where does HBase come from? 4

HBase s history Hadoop & HDFS based on... Not designed for random access 2003 Google File System (GFS) paper 2004 Google MapReduce paper HBase based on... 2006 Google Bigtable paper This fixes that 5

How does HBase work? Column-oriented Column is the most basic unit (instead of row) Columns form a row A column can have multiple versions, each version stored in a cell Rows form a table Row key locates a row Rows sorted by row key lexicographically 6

Row key is unique Think of row key as the index of the table You look up a row using its row key Only one index per table (via row key) HBase does not have built-in support for multiple indices; support enabled via extensions 7

Rows sorted lexicographically (=alphabetically) hbase(main):001:0> scan 'table1' ROW COLUMN+CELL row-1 column=cf1:, timestamp=1297073325971... row-10 column=cf1:, timestamp=1297073337383... row-11 column=cf1:, timestamp=1297073340493... row-2 column=cf1:, timestamp=1297073329851... row-22 column=cf1:, timestamp=1297073344482... row-3 column=cf1:, timestamp=1297073333504... row-abc column=cf1:, timestamp=1297073349875... 7 row(s) in 0.1100 seconds row-10 comes before row-2. How to fix? 8

Rows sorted lexicographically (=alphabetically) hbase(main):001:0> scan 'table1' ROW COLUMN+CELL row-1 column=cf1:, timestamp=1297073325971... row-10 column=cf1:, timestamp=1297073337383... row-11 column=cf1:, timestamp=1297073340493... row-2 column=cf1:, timestamp=1297073329851... row-22 column=cf1:, timestamp=1297073344482... row-3 column=cf1:, timestamp=1297073333504... row-abc column=cf1:, timestamp=1297073349875... 7 row(s) in 0.1100 seconds row-10 comes before row-2. How to fix? Pad row-2 with a 0. i.e., row-02 8

Columns grouped into column families Column family is a new concept from HBase Why? Helps with organization, understanding, optimization, etc. In details... Columns in the same family stored in same file called HFile (inspired by Google s SSTable = large map whose keys are sorted) Apply compression on the whole family... 9

More on column family, column Column family Each table only supports a few families (e.g., <10) Due to shortcomings in implementation Family name must be printable Should be defined when table is created Shouldn t be changed often Each column referenced as family:qualifier Can have millions of columns Values can be anything that s arbitrarily long 10

Cell Value Timestamped Implicitly by system Or set explicitly by user Let you store multiple versions of a value = values over time Values stored in decreasing time order Most recent value can be read first 11

Time-oriented view of a row 12

Concise way to describe all these? HBase data model (= Bigtable s model) Sparse, distributed, persistent, multidimensional map Indexed by row key + column key + timestamp (Table, RowKey, Family, Column, Timestamp)! Value 13

... and the geeky way SortedMap<RowKey, List<SortedMap<Column, List<Value, Timestamp>>>> (Table, RowKey, Family, Column, Timestamp)! Value 14

An exercise How would you use HBase to create a webtable store snapshots of every webpage on the planet, over time? Also want to store each webpage s incoming and outgoing links, metadata (e.g., language), etc. 15

Details: How does HBase scale up storage & balance load? Automatically divide contiguous ranges of rows into regions Start with one region, split into two when getting too large 16

Details: How does HBase scale up storage & balance load? 17

How to use HBase Interactive shell Will show you an example, locally (on your computer, without using HDFS) Programmatically e.g., via Java, C++, Python, etc. 18

Example, using interactive shell Start HBase Start Interactive Shell Check HBase is running 19

Example: Create table, add values 20

Example: Scan (show all cell values) 21

Example: Get (look up a row) Can also look up a particular cell value, with a certain timestamp, etc. 22

Example: Delete a value 23

Example: Disable & drop table 24

RDBMS vs HBase RDBMS (=Relational Database Management System) MySQL, Oracle, SQLite, Teradata, etc. Really great for many applications Ensure strong data consistency, integrity Supports transactions (ACID guarantees)... 25

RDBMS vs HBase How are they different? Hbase when you don t know the structure/schema HBase supports sparse data (many columns, most values are not there) Use hbase if you only work with a small number of columns Relational databases good for getting whole rows HBase: Multiple versions of data RDBMS support multiple indices, minimize duplications Generally a lot cheaper to deploy HBase, for same size of data (petabytes) 26

Advanced topics to learn about Other ways to get, put, delete... (e.g., programmatically via Java) Doing them in batch Maintaining your cluster Configurations, specs for master and slaves? Administrating cluster Monitoring cluster s health Key design bad keys can decrease performance Integrating with MapReduce More... 27

Hive http://hive.apache.org Use SQL to run queries on large datasets Developed at Facebook Similar to Pig, Hive runs on your computer You write HiveQL (Hive s query language), which gets converted into MapReduce jobs 28

Example: starting Hive 29

Example: create table, load data Specify that data file is tab-separated Overwrite old file This data file will be copied to Hive s internal data directory 30

Example: Query So simple and boring! Or is it? 31

Same thing done with Pig records = LOAD 'input/ ncdc/ micro-tab/ sample.txt' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature!= 9999 AND (quality = = 0 OR quality = = 1 OR quality = = 4 OR quality = = 5 OR quality = = 9); grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records GENERATE group, MAX( filtered_records.temperature); DUMP max_temp; 32

Graph Mining using Hadoop 33

www.cs.cmu.edu/~pegasus/

Generalized Iterated Matrix Vector Multiplication (GIM-V) PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. ICDM 2009, Miami, Florida, USA. Best Application Paper (runner-up). 35

Generalized Iterated Matrix Vector Multiplication (GIMV) details PageRank proximity (RWR) Diameter Connected components Matrix vector Multiplication (iterated) (eigenvectors, Belief Prop. ) 36

Example: GIM-V At Work Connected Components 4 observations: Count Size 37

Example: GIM-V At Work Connected Components Count 1) 10K x larger than next Size 38

Example: GIM-V At Work Connected Components Count 2) ~0.7B singleton nodes Size 39

Example: GIM-V At Work Connected Components Count 3) SLOPE! Size 40

Example: GIM-V At Work Connected Components Count 300-size cmpt X 500. 1100-size cmpt Why? X 65. Why? 4) Spikes! Size 41

Example: GIM-V At Work Connected Components Count 4) Spikes! Size 41

Example: GIM-V At Work Connected Components Count suspicious financial-advice sites (not existing now) Size 42

Example: GIM-V At Work Connected Components Count Size 42

GIM-V At Work Connected Components over Time LinkedIn: 7.5M nodes and 58M edges Stable tail slope after the gelling point 43