Xiaoming Gao Hui Li Thilina Gunarathne

Similar documents

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Hadoop IST 734 SS CHUNG

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

The Hadoop Eco System Shanghai Data Science Meetup

Introduction to Hbase Gkavresis Giorgos 1470

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

How To Scale Out Of A Nosql Database

Cloudera Certified Developer for Apache Hadoop

Workshop on Hadoop with Big Data

Introduc)on to Map- Reduce. Vincent Leroy

Word Count Code using MR2 Classes and API

Data storing and data access

Mrs: MapReduce for Scientific Computing in Python

Internals of Hadoop Application Framework and Distributed File System

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Introduction to MapReduce and Hadoop

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Connecting Hadoop with Oracle Database

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Big Data for the JVM developer. Costin Leau,

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop Ecosystem B Y R A H I M A.

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Map Reduce & Hadoop Recommended Text:

Comparing SQL and NOSQL databases

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Apache HBase. Crazy dances on the elephant back

Media Upload and Sharing Website using HBASE

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data With Hadoop

BIG DATA APPLICATIONS

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Tutorial- Counting Words in File(s) using MapReduce

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

ITG Software Engineering

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Using distributed technologies to analyze Big Data

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop Basics with InfoSphere BigInsights

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Scaling Up HBase, Hive, Pegasus

Getting to know Apache Hadoop

CS54100: Database Systems

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Big Data Course Highlights

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Big Data Management and NoSQL Databases

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop Architecture. Part 1

Introduction to Hadoop

Hadoop Configuration and First Examples

A Brief Outline on Bigdata Hadoop

Integrating Big Data into the Computing Curricula

Complete Java Classes Hadoop Syllabus Contact No:

BIG DATA What it is and how to use?

Big Data 2012 Hadoop Tutorial

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

COURSE CONTENT Big Data and Hadoop Training

Cloud Computing Era. Trend Micro

Distributed Systems + Middleware Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

Large scale processing using Hadoop. Ján Vaňo

Application Development. A Paradigm Shift

Hadoop and Map-Reduce. Swati Gore

Open source large scale distributed data management with Google s MapReduce and Bigtable

White Paper: What You Need To Know About Hadoop

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data Too Big To Ignore

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Constructing a Data Lake: Hadoop and Oracle Database United!

Running Hadoop on Windows CCNP Server

Cloud Computing at Google. Architecture

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Open source Google-style large scale data analysis with Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Bigtable is a proven design Underpins 100+ Google services:

Hadoop Integration Guide

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data Workshop. dattamsha.com

Integrating VoltDB with Hadoop

BIG DATA TECHNOLOGY. Hadoop Ecosystem

HBase. Lecture BigData Analytics. Julian M. Kunkel. University of Hamburg / German Climate Computing Center (DKRZ)

Implement Hadoop jobs to extract business value from large and varied data sets

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Transcription:

Xiaoming Gao Hui Li Thilina Gunarathne

Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce

Motivation Lots of Semi structured data Horizontal scalability Commodity hardware Tight integration with MapReduce RDBMS don t scale Google BigTable (paper).

Apache HBase Open-source, distributed, column-oriented, datastore. Hosting of very large tables atop clusters of commodity hardware. When you need random, realtime read/write access to your Big Data. Automatic partitioning Linear scalability

HBase and Hadoop HBase is built on Hadoop Fault tolerance Scalability HBase uses HDFS for storage Adds random read/write capability Batch processing with MapReduce

Data Model: A Big Sorted Map A big sorted sparse Map Tables consist of rows, each of which has a primary key (row key) Each row can have any number of columns Multidimensional : map of maps Versioned

Physical View RowKey Column key Time Stamp value aaa@indiana.edu BasicInfo:Name t0 aaa aaa@indiana.edu BasicInfo:Office t2 IE339 aaa@indiana.edu BasicInfo:Office t1 LH201 bbb@indiana.edu BasicInfo:Name t3 bbb

HBase Cluster Architecture Tables split into regions and served by region servers Regions vertically divided by column families into stores Stores saved as files on HDFS

HBase VS. RDBMS RDBMS HBase Data layout Row-oriented Column-family-oriented Schema Fixed Flexible Sparse data Not Good Query language SQL (Join, Group) Get/Put/Scan (& Hive) Hardware requirement Large arrays of fast and expensive disks Max data size TBs ~1PB Designed for commodity hardware Read/write throughput 1000s queries/second Millions of queries/second Easy of use Relational data modeling, easy to learn A sorted Map, significant learning curve, communities and tools are increasing

When to Use HBase Dataset Scale Data sizes that cannot fit in a single node RDBMS High throughput reads/writes are distributed as tables are distributed across nodes Batch Analysis Massive and convoluted SQL queries can be executed in parallel via MapReduce jobs Large cache Sparse data Random read/write

Use Cases: Facebook Analytics Real-time counters of URLs shared, preferred links Twitter 25 TB of message every month Mozilla Store crashes report, 2.5 million per day.

Programming with HBase 1. HBase shell Scan, List, Create, Get, Put, Delete 2. Java API Get, Put, Delete, Scan, 3. Non-Java Clients Thrift REST 4. HBase MapReduce API hbase.mapreduce.tablemapper; hbase.mapreduce.tablereducer; 5. High Level Interface Pig, Hive

Hbase with Hadoop MapReduce Work with MapReduce TableInputFormat & TableOutputFormat Provides Mapper and Reducer base classes Utility methods TableMapReduceUtil Writable types

Hbase with Hadoop Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleSummary"); job.setjarbyclass(mysummaryjob.class); // class that contains mapper and reducer Scan scan = new Scan(); scan.setcaching(500); // 1 is the default, which will be bad for MapReduce jobs scan.setcacheblocks(false); // don't set to true for MR jobs // set other scan attrs TableMapReduceUtil.initTableMapperJob(sourceTable, scan, MyMapper.class, Text.class, IntWritable.class, job); TableMapReduceUtil.initTableReducerJob(targetTable, MyTableReducer.class, job); job.setnumreducetasks(1); boolean b = job.waitforcompletion(true); More info : http://hbase.apache.org/book.html#mapreduce

Hbase with Hadoop public static class MyMapper extends TableMapper<Text, IntWritable> { private final IntWritable ONE = new IntWritable(1); private Text text = new Text(); } public void map(immutablebyteswritable row, Result value, Context context)..{ } String val = new String( value.getvalue(bytes.tobytes("cf"), Bytes.toBytes("attr1"))); text.set(val); context.write(text, ONE); More info : http://hbase.apache.org/book.html#mapreduce

Hbase with Hadoop public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> { } public void reduce(text key, Iterable<IntWritable> values, Context context).. { int i = 0; for (IntWritable val : values) { i += val.get(); } Put put = new Put(Bytes.toBytes(key.toString())); put.add(bytes.tobytes("cf"), Bytes.toBytes("count"), Bytes.toBytes(i)); context.write(null, put); } More info : http://hbase.apache.org/book.html#mapreduce

Hands-on HBase MapReduce Programming* HBase MapReduce API import org.apache.hadoop.hbase.hbaseconfiguration; import org.apache.hadoop.hbase.client.result; import org.apache.hadoop.hbase.client.scan; import org.apache.hadoop.hbase.client.put; import org.apache.hadoop.hbase.mapreduce.tablemapper; import org.apache.hadoop.hbase.mapreduce.tablereducer; import org.apache.hadoop.hbase.io.immutablebyteswritable; import org.apache.hadoop.hbase.mapreduce.tablemapreduceutil; import org.apache.hadoop.hbase.util.bytes; * Developed by Li Hui

Hands-on: load CSV file into HBase table with MapReduce CSV represent for comma separate values CSV file is common file in many scientific fields

Hands-on: load CSV file into HBase table with MapReduce Main entry point of program public static void main(string[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if(otherargs.length!= 2) { System.err.println("Wrong number of arguments: " + otherargs.length); System.err.println("Usage: <csv file> <hbase table name>"); System.exit(-1); }//end if Job job = configurejob(conf, otherargs); System.exit(job.waitForCompletion(true)? 0 : 1); }//main

Hands-on: load CSV file into HBase table with MapReduce Configure HBase MapReduce job public static Job configurejob(configuration conf, String [] args) throws IOException { Path inputpath = new Path(args[0]); String tablename = args[1]; Job job = new Job(conf, tablename); job.setjarbyclass(csv2hbase.class); FileInputFormat.setInputPaths(job, inputpath); job.setinputformatclass(textinputformat.class); job.setmapperclass(csv2hbase.class); TableMapReduceUtil.initTableReducerJob(tableName, null, job); job.setnumreducetasks(0); return job; }//public static Job configure

Hands-on: load CSV file into HBase table with MapReduce The map function public void map(longwritable key, Text line, Context context) throws IOException { // Input is a CSV file Each map() is a single line, where the key is the line number // Each line is comma-delimited; row,family,qualifier,value String [] values = line.tostring().split(","); if(values.length!= 4) { return; } byte [] row = Bytes.toBytes(values[0]); byte [] family = Bytes.toBytes(values[1]); byte [] qualifier = Bytes.toBytes(values[2]); byte [] value = Bytes.toBytes(values[3]); Put put = new Put(row); put.add(family, qualifier, value); try { context.write(new ImmutableBytesWritable(row), put); } catch (InterruptedException e) { e.printstacktrace(); } if(++count % checkpoint == 0) { context.setstatus("emitting Put " + count); } } }

Hands-on: steps to load CSV file into HBase table with MapReduce 1. Check Hbase installation in Ubuntu Sandbox 1. http://salsahpc.indiana.edu/sciencecloud/virtualbox_appliance_guide.html 2. echo $HBASE_HOME 2. Start Hadoop and Hbase cluster 1. cd $HADOOP_HOME 2.. MultiNodesOneClickStartUp.sh $JAVA_HOME nodes 3. start-hbase.sh 3. Create hbase table with specified data schema 1. hbase shell 2. create csv2hbase, f1 3. quit 4. Compile the program with Ant 1. cd ~/hbasetutorial 2. ant 5. Upload input.csv into HDFS 1. hadoop dfs mkdir input 2. hadoop dfs copyfromlocal input.csv input/input.csv 6. Run the program: hadoop jar dist/lib/cglhbasesummerschool.jar iu.pti.hbaseapp.csv2hbase input/input.csv csv2hbase 7. Check inserted records in Hbase table 1. hbase shell 2. scan csv2hbase

Hands-on: load CSV file into HBase table with MapReduce HBase builtin importtsv commands supports importing CSV

Other Features Bulk loading HFileOutputFormat Multiple masters In memory column families Block cache Bloom filters

Coming up Introduction to Pig Demo Search Engine System with MapReduce Technologies (Hadoop/HDFS/HBase/Pig)

Questions