Advanced Java Client API

Similar documents

HBase Key Design coreservlets.com and Dima May coreservlets.com and Dima May

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

HBase Java Administrative API

Map Reduce Workflows

Virtual Machine (VM) For Hadoop Training

Apache Pig Joining Data-Sets

Hadoop Distributed File System (HDFS) Overview

HDFS - Java API coreservlets.com and Dima May coreservlets.com and Dima May

MapReduce on YARN Job Execution

The Google Web Toolkit (GWT): The Model-View-Presenter (MVP) Architecture Official MVP Framework

HDFS Installation and Shell

Java with Eclipse: Setup & Getting Started

Hadoop Introduction coreservlets.com and Dima May coreservlets.com and Dima May

Managed Beans II Advanced Features

JHU/EP Server Originals of Slides and Source Code for Examples:

Android Programming: Installation, Setup, and Getting Started

The Google Web Toolkit (GWT): Declarative Layout with UiBinder Basics

Official Android Coding Style Conventions

Android Programming: 2D Drawing Part 1: Using ondraw

Android Programming Basics

Web Applications. Originals of Slides and Source Code for Examples:

Building Web Services with Apache Axis2

For live Java EE training, please see training courses

Session Tracking Customized Java EE Training:

The Google Web Toolkit (GWT): Overview & Getting Started

The Hadoop Eco System Shanghai Data Science Meetup

Xiaoming Gao Hui Li Thilina Gunarathne

Web Applications. For live Java training, please see training courses at

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Application Security

Localization and Resources

Data storing and data access

Handling the Client Request: Form Data

Hadoop WordCount Explained! IT332 Distributed Systems

Servlet and JSP Filters

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Basic Java Syntax. Slides 2016 Marty Hall,

Word Count Code using MR2 Classes and API

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Object-Oriented Programming in Java: More Capabilities

Integrating VoltDB with Hadoop

Unleash the Power of HBase Shell

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Introduction to Java

Apache HBase: the Hadoop Database

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Media Upload and Sharing Website using HBASE

CS506 Web Design and Development Solved Online Quiz No. 01

Introduction to Hbase Gkavresis Giorgos 1470

What servlets and JSP are all about

Debugging Ajax Pages: Firebug

2011 Marty Hall An Overview of Servlet & JSP Technology Customized Java EE Training:

Comparing SQL and NOSQL databases

Consuming and Producing Web Services with WST and JST. Christopher M. Judd. President/Consultant Judd Solutions, LLC

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Overview of Web Services API

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

AVRO - SERIALIZATION

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

Apache HBase NoSQL on Hadoop

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Rational Application Developer Performance Tips Introduction

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Getting to know Apache Hadoop

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

A Sample OFBiz application implementing remote access via RMI and SOAP Table of contents

Scanner. It takes input and splits it into a sequence of tokens. A token is a group of characters which form some unit.

Specialized Programme on Web Application Development using Open Source Tools

Secrets of YARN Application Development

JDK 1.5 Updates for Introduction to Java Programming with SUN ONE Studio 4

Apache HBase. Crazy dances on the elephant back

First Java Programs. V. Paúl Pauca. CSC 111D Fall, Department of Computer Science Wake Forest University. Introduction to Computer Science

Java Interview Questions and Answers

2. Follow the installation directions and install the server on ccc

AP Computer Science Java Mr. Clausen Program 9A, 9B

Clustering with Tomcat. Introduction. O'Reilly Network: Clustering with Tomcat. by Shyam Kumar Doddavula 07/17/2002

13 File Output and Input

How to Install and Configure EBF15328 for MapR or with MapReduce v1

& JSP Technology Originals of Slides and Source Code for Examples:

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Pentesting Web Frameworks (preview of next year's SEC642 update)

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData

IBM WebSphere Application Server

Java Server Pages and Java Beans

Creating Java EE Applications and Servlets with IntelliJ IDEA

How to Run Spark Application

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Apache Sentry. Prasad Mujumdar

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Java 7 Recipes. Freddy Guime. vk» (,\['«** g!p#« Carl Dea. Josh Juneau. John O'Conner

Arrays. Atul Prakash Readings: Chapter 10, Downey Sun s Java tutorial on Arrays:

The full setup includes the server itself, the server control panel, Firebird Database Server, and three sample applications with source code.

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Transcription:

2012 coreservlets.com and Dima May Advanced Java Client API Advanced Topics Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite or at public venues) http://courses.coreservlets.com/hadoop-training.html Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jquery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location. 2012 coreservlets.com and Dima May For live customized Hadoop training (including prep for the Cloudera certification exam), please email info@coreservlets.com Taught by recognized Hadoop expert who spoke on Hadoop several times at JavaOne, and who uses Hadoop daily in real-world apps. Available at public venues, or customized versions can be held on-site at your organization. Courses developed and taught by Marty Hall JSF 2.2, PrimeFaces, servlets/jsp, Ajax, jquery, Android development, Java 7 or 8 programming, custom mix of topics Courses Customized available in any state Java or country. EE Training: Maryland/DC http://courses.coreservlets.com/ area companies can also choose afternoon/evening courses. Hadoop, Courses Java, developed JSF 2, PrimeFaces, and taught Servlets, by coreservlets.com JSP, Ajax, jquery, experts Spring, (edited Hibernate, by Marty) RESTful Web Services, Android. Spring, Hibernate/JPA, GWT, Hadoop, HTML5, RESTful Web Services Developed and taught by well-known author and developer. At public venues or onsite at your location. Contact info@coreservlets.com for details

Agenda Scan API Scan Caching Scan Batching Filters 4 Scan Data Retrieval Utilizes HBase s sequential storage model row ids are stored in sequence Allows you to scan An entire table Subset of a table by specifying start and/or stop key Transfers limited amount of rows at a time from the server 1 row at a time by default can be increased You can stop the scan any time Evaluate at each row Scans are similar to iterators 5

Scan Rows 1. Construct HTable instance 2. Create and Initialize Scan 3. Retrieve ResultScanner from HTable 4. Scan through rows 5. Close ResultScanner 6. Close HTable 6 ** We are already familiar with HTable usage so let s focus on steps 2 through 5 2: Create and Initialize Scan Scan class is a means to specify what you want to scan Scan is very similar to Get but allows you to scan through a range of keys Provide start and stop keys Start key is inclusive while stop key is exclusive If start row id is NOT provided then will scan from the beginning of the table If stop row is NOT provided then will scan to the very end 7

2: Create and Initialize Scan Construction options new Scan() - will scan through the entire table new Scan(startRow) begin scan at the provided row, scan to the end of the table new Scan(startRow, stoprow) begin scan at the provided startrow, stop scan when a row id is equal to or greater than to the provided stoprow new Scan(startRow, filter) begin scan at the provided row, scan to the end of the table, apply the provided filter 8 2: Create and Initialize Scan Once Scan is constructed you can further narrow down (very similar to Get) scan.addfamily(family) scan.addcolumn(family, column) scan.settimerange(minstamp, maxstamp) scan.setmaxversions(maxversions) scan.setfilter(filter) to be covered later For example: Scan scan = new Scan(toBytes(startRow), tobytes(stoprow)); scan.addcolumn(tobytes("metrics"), tobytes("counter")); scan.addfamily(tobytes("info")); 9

3: Retrieve ResultScanner Retrieve a scanner from an existing HTable instance ResultScanner scanner = htable.getscanner(scan); 10 4: Scan Through Rows Use result scanner by calling Result next() throws IOException Same Result class as in Get operation Result[] next(int nbrows) throws IOException Returns an array of Result object up to nbrows Maybe less than nbrows ResultScanner also implements an Iterable interface so we can do something like this ResultScanner scanner = htable.getscanner(scan); for ( Result result : scanner){ // do stuff with result 11

5: Close ResultScanner Scanner holds to resources on the server As soon as you are done with the scanner call close() Required to release all the resources Always use in try/finally block ResultScanner scanner = htable.getscanner(scan); try { // to stuff with scanner finally { scanner.close(); Most of the examples omit try/finally usage just to make them more readable 12 ScanExample.java private static void scan(htable htable, String startrow, String stoprow) throws IOException { System.out.println("Scanning from " + "["+startrow+"] to ["+stoprow+"]"); Scan scan = new Scan(toBytes(startRow), tobytes(stoprow)); scan.addcolumn(tobytes("metrics"), tobytes("counter")); ResultScanner scanner = htable.getscanner(scan); for ( Result result : scanner){ byte [] value = result.getvalue( tobytes("metrics"), tobytes("counter")); System.out.println(" " + Bytes.toString(result.getRow()) + " => " + Bytes.toString(value)); 13 scanner.close();

ScanExample.java public static void main(string[] args) throws IOException { Configuration conf = HBaseConfiguration.create(); HTable htable = new HTable(conf, "HBaseSamples"); scan(htable, "row-03", "row-05"); scan(htable, "row-10", "row-15"); htable.close(); 14 ScanExample.java $ yarn jar $PLAY_AREA/HadoopSamples.jar hbase.scanexample.... Scanning from [row-03] to [row-05] row-03 => val2 row-04 => val3 Scanning from [row-10] to [row-15] row-10 => val9 row-11 => val10 row-12 => val11 row-13 => val12 row-14 => val13 15

ResultScanner Lease 16 HBase protects itself from Scanners that may hang indefinitely by implementing lease-based mechanism Scanners are given a configured lease If they don t report within the lease time HBase will consider client to be dead The scanner will be expired on the server side and it will not be usable Default lease is 60 seconds To change the lease modify hdfs-site.xml <property> <name>hbase.regionserver.lease.period</name> <value>120000</value> </property> The same property is used for lease-based mechanism for both locks and scanners Make sure the value works well for both Scanner Caching By default next() call equals to RPC (Remote Procedure Call) per row Even in case of next(int rows) int numofrpcs = 0; for ( Result result : scanner){ numofrpcs++; System.out.println("Remote Calls: " + numofrpcs); Results in a bad performance for small cells Use Scanner Caching to fetch more than a single row per RPC 17

Scanner Caching Three Levels of control HBase Cluster: change for ALL HTable Instance: configure caching per table instance, will affect all the scans created for this table ResultScanner Instance: configure caching per scan instance, will only affect the configured scan Can configure at multiple levels if you require the precision Ex: Certain tables may have really big cells then lower scanning size 18 1: Configure Scanner Caching per HBase Cluster Edit <hbase_home>/conf/hbase-site.xml <property> <name>hbase.client.scanner.caching</name> <value>20</value> </property> Restart the cluster to pick up the change Changes caching to 10 for ALL scans Can still override per HTable or Scan instance 19

2: Configure Scanner Caching per HTable Instance Call htable.setscannercaching(10) to change caching per HTable instance Will override caching configure for the entire HBase cluster Will affect caching for every scan open from this HTable instance Can be overridden at scan level 20 3: Configure Scanner Caching per ResultScanner Instance Set caching on Scan instance and use it to retrieve the scanner scan.setcaching(10); ResultScanner scanner = htable.getscanner(scan); Will only apply to this scanner Will override cluster and table based caching configurations 21

Scanner Caching Considerations Balance between low number of RPC and memory usage Consider the size of the data retrieved (cell size) Consider available memory on the client and Region Server Setting higher caching number would usually improve performance Setting caching too high may have negative effect Takes longer for each remote call to transfer data Run out of client s or Region Server s heap space and cause OutOfMemoryError 22 ScanCachingExample.java private static void printresults(htable htable, Scan scan) throws IOException { System.out.println("\nCaching table=" + htable.getscannercaching() + ", scanner=" + scan.getcaching()); Print caching attributes ResultScanner scanner = htable.getscanner(scan); for ( Result result : scanner){ Scan through the results byte [] value = result.getvalue( tobytes("metrics"), tobytes("counter")); System.out.println(" " + Bytes.toString(result.getRow()) + " => " + Bytes.toString(value)); scanner.close(); 23

ScanCachingExample.java public static void main(string[] args) throws IOException { Configuration conf = HBaseConfiguration.create(); HTable htable = new HTable(conf, "HBaseSamples"); Scan scan = new Scan(); scan.addcolumn(tobytes("metrics"), tobytes("counter")); printresults(htable, scan); htable.setscannercaching(5); printresults(htable, scan); Caching is not set will use default scan.setcaching(10); printresults(htable, scan); htable.close(); Set scanning on table level, overrides default Set caching on Scan level Overrides default and table 24 ScanCachingExample.java Output $yarn jar $PLAY_AREA/HadoopSamples.jar hbase.scancachingexample 25 Caching table=1, scanner=-1 row-01 => val0 row-02 => val1... row-16 => val15 Caching table=5, scanner=-1 row-01 => val0 row-02 => val1... row-16 => val15 Caching table=5, scanner=10 row-01 => val0 row-02 => val1... row-16 => val15 Table defaulted to the setting of 1 Scanner caching is not set (-1) Pulls 1 row per RPC Updated on table level to 5 Overrides default Pulls 5 rows per RPC Updated on the scan level to 10 Overrides default and table level Pulls 10 rows per RPC

Scanner Batching A single row with lots of columns may not fit memory HBase Batching allows you to page through columns on per row basis Limits the number of columns retrieved from each ResultScanner.next() RPC Will not get multiple results Set the batch on Scan instance No option on per table or cluster basis Scan scan = new Scan(); scan.setbatch(10); 26 ScanBatchingExample.java public static void main(string[] args) throws IOException { Configuration conf = HBaseConfiguration.create(); HTable htable = new HTable(conf, "HBaseSamples"); Scan scan = new Scan(); scan.addfamily(tobytes("columns")); printresults(htable, scan); Print result with default batch (loads entire row) scan.setbatch(2); printresults(htable, scan); Print result with batch=2 htable.close(); 27

ScanBatchingExample.java 28 private static void printresults(htable htable, Scan scan) throws IOException { System.out.println("\n------------------"); System.out.println("Batch=" + scan.getbatch()); Display batch size Of this Scan instance ResultScanner scanner = htable.getscanner(scan); for ( Result result : scanner){ System.out.println("Result: "); For each result print all the cells/keyvalues for ( KeyValue keyval : result.list()){ System.out.println(" " + Bytes.toString(keyVal.getFamily()) + ":" + Bytes.toString(keyVal.getQualifier()) + " => " + Bytes.toString(keyVal.getValue())); scanner.close(); 29 ScanBatchingExample.java Output ------------------ Batch=-1 Result: columns:col1 => colrow1val1 columns:col2 => colrow1val2 columns:col3 => colrow1val3 columns:col4 => colrow1val4 Result: columns:col1 => colrow2val1 columns:col3 => colrow2val2 columns:col4 => colrow2val3 ------------------ Batch=2 Result: columns:col1 => colrow1val1 columns:col2 => colrow1val2 Result: columns:col3 => colrow1val3 columns:col4 => colrow1val4 Result: columns:col1 => colrow2val1 columns:col3 => colrow2val2 Result: columns:col4 => colrow2val3 Default batch load entire row per Result instance Batching 2 columns per Result instance

Caching and Batching 30 Caching and Batching can be combined when scanning a set of rows to balance Memory usage # of RPCs Batching will create multiple Result instances per row Caching specifies how many results to return per RPC To estimate Total # of RPCs (# of rows) * (columns per row) / (minimum between batch size and # of columns size) / (caching size) Caching and Batching Example **Batch = 2 and Caching = 9** RPC row1 row2 row3 c1 c2 c3 c4 c5 c6 RPC row1 row2 row3 c1 c2 c3 c4 c5 c6 HTable Source: Lars, George. HBase The Definitive Guide. O'Reilly Media. 2011

Filters get() and scan() can limit the data retrieved/transferred back to the client via Column families, columns, timestamps, row ranges, etc... Filters add further control to limit the data returned For example: select by key or values via regular expressions Optionally added to Get and Scan parameter Implemented by org.apache.hadoop.hbase.filter.filter Use HBase s provided concrete implementations Can implement your own 32 Filter Usage 1. Create/initialize an instance of a filter 2. Add it to Scan or Get instance 3. Use Scan or Get as before 33

1: Create/Initialize an Instance of a Filter There are a lot of filters provide by HBase ValueFilter, RowFilter, FamilyFilter, QuilifierFilter, etc... 20+ today and the list is growing For example ValueFilter lets you include columns that only have specific values Uses expression syntax Comparison Operator Scan scan = new Scan(); scan.setfilter( new ValueFilter(CompareOp.EQUAL, new SubstringComparator("3"))); 34 Comparator ValueFilterExample.java public static void main(string[] args) throws IOException { Configuration conf = HBaseConfiguration.create(); HTable htable = new HTable(conf, "HBaseSamples"); Only get cells whose value Scan scan = new Scan(); contains string 3 scan.setfilter( new ValueFilter(CompareOp.EQUAL, new SubstringComparator("3"))); 35 ResultScanner scanner = htable.getscanner(scan); for ( Result result : scanner){ byte [] value = result.getvalue( tobytes("metrics"), tobytes("counter")); System.out.println(" " + Bytes.toString(result.getRow()) + " => " + Bytes.toString(value)); scanner.close(); htable.close();

ValueFilterExample.java Output yarn jar $PLAY_AREA/HadoopSamples.jar hbase.valuefilterexample row-04 => val3 row-14 => val13 36 Filters Filters are applied on the server side Reducing amount of data transmitted over the wire Still involves scanning rows For example, not as efficient using start/stop rows in the scan Execution with filters constructed on the client side serialized and transmitted to the server executed on the server side Must exist both on client s and server s CLASSPATH 37

Execution of a Request with Filter(s) 1. constructed on the client side 2. serialized and transmitted to the server 3. applied on the server side Scanner Scan Scan 1 2 Filter 3 Filter Client 2 Scanner Scan Filter 3 38 Sampling of HBase Provided Filters Filter ColumnPrefixFilter FilterList FirstKeyOnlyFilter KeyOnlyFilter PrefixFilter QualifierFilter RowFilter SkipFilter ValueFilter Description from HBase API This filter is used for selecting only those keys with columns that matches a particular prefix. Implementation of Filter that represents an ordered List of Filters A filter that will only return the first KV from each row. A filter that will only return the key component of each KV This filter is used for selecting only those keys with columns that matches a particular prefix. This filter is used to filter based on the column qualifier. This filter is used to filter based on the key A wrapper filter that filters an entire row if any of the KeyValue checks do not pass. This filter is used to filter based on column value. 39

To Apply Multiple Filters 1. Create FilterList and specify operator Operator.MUST_PASS_ALL: value is only included if an only if all filters pass Operator.MUST_PASS_ONE: value is returned if any of the specified filters pass 2. Add filters to FilterList 3. Add it to Scan or Get instance 4. Use Scan or Get as before 40 FilterListExample.java Scan scan = new Scan(); FilterList filters = new FilterList(Operator.MUST_PASS_ALL); filters.addfilter(new KeyOnlyFilter()); filters.addfilter(new FirstKeyOnlyFilter()); scan.setfilter(filters); Only load row ids by chaining KeyOnlyFilter and FirstKeyOnlyFilter ResultScanner scanner = htable.getscanner(scan); for ( Result result : scanner){ byte [] value = result.getvalue( tobytes("metrics"), tobytes("counter")); System.out.println(" " + Bytes.toString(result.getRow()) + " => " + Bytes.toString(value)); scanner.close(); 41

FilterListExample.java Output $ yarn jar $PLAY_AREA/HadoopSamples.jar hbase.filterlistexample anotherrow => null row-01 => row-02 => row-03 => row-04 => row-05 => row-06 => row-07 => row-08 => row-09 => row-10 => row-11 => row-12 => row-13 => row-14 => row-15 => row-16 => row1 => null Only row ids were retrieved because KeyOnlyFilter and FirstKeyOnlyFilter Were applied to the Scan request 42 2012 coreservlets.com and Dima May Wrap-Up Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jquery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location.

Summary We learned about Scan API Scan Caching Scan Batching Filters 44 2012 coreservlets.com and Dima May Questions? More info: http://www.coreservlets.com/hadoop-tutorial/ Hadoop programming tutorial http://courses.coreservlets.com/hadoop-training.html Customized Hadoop training courses, at public venues or onsite at your organization http://courses.coreservlets.com/course-materials/java.html General Java programming tutorial http://www.coreservlets.com/java-8-tutorial/ Java 8 tutorial http://www.coreservlets.com/jsf-tutorial/jsf2/ JSF 2.2 tutorial http://www.coreservlets.com/jsf-tutorial/primefaces/ PrimeFaces tutorial http://coreservlets.com/ JSF 2, PrimeFaces, Java 7 or 8, Ajax, jquery, Hadoop, RESTful Web Services, Android, HTML5, Spring, Hibernate, Servlets, JSP, GWT, and other Java EE training Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jquery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location.