HBase Key Design. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May



Similar documents
Advanced Java Client API

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

HBase Java Administrative API

Map Reduce Workflows

Virtual Machine (VM) For Hadoop Training

Apache Pig Joining Data-Sets

Hadoop Distributed File System (HDFS) Overview

HDFS - Java API coreservlets.com and Dima May coreservlets.com and Dima May

HDFS Installation and Shell

Java with Eclipse: Setup & Getting Started

MapReduce on YARN Job Execution

The Google Web Toolkit (GWT): The Model-View-Presenter (MVP) Architecture Official MVP Framework

Hadoop Introduction coreservlets.com and Dima May coreservlets.com and Dima May

Official Android Coding Style Conventions

Managed Beans II Advanced Features

Android Programming: Installation, Setup, and Getting Started

Building Web Services with Apache Axis2

The Google Web Toolkit (GWT): Declarative Layout with UiBinder Basics

JHU/EP Server Originals of Slides and Source Code for Examples:

Android Programming: 2D Drawing Part 1: Using ondraw

The Google Web Toolkit (GWT): Overview & Getting Started

Android Programming Basics

Session Tracking Customized Java EE Training:

Handling the Client Request: Form Data

Data storing and data access

Comparing SQL and NOSQL databases

Application Security

For live Java EE training, please see training courses

The Hadoop Eco System Shanghai Data Science Meetup

Web Applications. Originals of Slides and Source Code for Examples:

Servlet and JSP Filters

Web Applications. For live Java training, please see training courses at

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Object-Oriented Programming in Java: More Capabilities

Basic Java Syntax. Slides 2016 Marty Hall,

Xiaoming Gao Hui Li Thilina Gunarathne

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Localization and Resources

Integrating VoltDB with Hadoop

Apache HBase: the Hadoop Database

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

Media Upload and Sharing Website using HBASE

Introduction to Hbase Gkavresis Giorgos 1470

Hadoop WordCount Explained! IT332 Distributed Systems

CS506 Web Design and Development Solved Online Quiz No. 01

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Creating Java EE Applications and Servlets with IntelliJ IDEA

Controlling Web Application Behavior

What servlets and JSP are all about

Word Count Code using MR2 Classes and API

2011 Marty Hall An Overview of Servlet & JSP Technology Customized Java EE Training:

Services. Relational. Databases & JDBC. Today. Relational. Databases SQL JDBC. Next Time. Services. Relational. Databases & JDBC. Today.

Managing Data on the World Wide-Web

Database Applications Recitation 10. Project 3: CMUQFlix CMUQ s Movies Recommendation System

AP Computer Science Java Subset

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

NoSQL Databases. Big Data 2015

Hadoop Project for IDEAL in CS5604

Liferay Enterprise ecommerce. Adding ecommerce functionality to Liferay Reading Time: 10 minutes

Data Domain Profiling and Data Masking for Hadoop

Debugging Ajax Pages: Firebug

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Similarity Search in a Very Large Scale Using Hadoop and HBase

Configure a SOAScheduler for a composite in SOA Suite 11g. By Robert Baumgartner, Senior Solution Architect ORACLE

Getting to know Apache Hadoop

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone

An Overview of Servlet & JSP Technology

Unleash the Power of HBase Shell

Tutorial: HBase Theory and Practice of a Distributed Data Store

Cloudera Certified Developer for Apache Hadoop

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Apache HBase NoSQL on Hadoop

Unit Testing with JUnit: A Very Brief Introduction

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Hadoop Integration Guide

& JSP Technology Originals of Slides and Source Code for Examples:

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

SSC - Web development Model-View-Controller for Java web application development

DIPLOMA IN WEBDEVELOPMENT

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Pentesting Web Frameworks (preview of next year's SEC642 update)

Unlocking Hadoop for Your Rela4onal DB. Kathleen Technical Account Manager, Cloudera Sqoop PMC Member BigData.

Running Hadoop on Windows CCNP Server

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

HBase. Lecture BigData Analytics. Julian M. Kunkel. University of Hamburg / German Climate Computing Center (DKRZ)

Transcription:

2012 coreservlets.com and Dima May HBase Key Design Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite or at public venues) http://courses.coreservlets.com/hadoop-training.html Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jquery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location. 2012 coreservlets.com and Dima May For live customized Hadoop training (including prep for the Cloudera certification exam), please email info@coreservlets.com Taught by recognized Hadoop expert who spoke on Hadoop several times at JavaOne, and who uses Hadoop daily in real-world apps. Available at public venues, or customized versions can be held on-site at your organization. Courses developed and taught by Marty Hall JSF 2.2, PrimeFaces, servlets/jsp, Ajax, jquery, Android development, Java 7 or 8 programming, custom mix of topics Courses Customized available in any state Java or country. EE Training: Maryland/DC http://courses.coreservlets.com/ area companies can also choose afternoon/evening courses. Hadoop, Courses Java, developed JSF 2, PrimeFaces, and taught Servlets, by coreservlets.com JSP, Ajax, jquery, experts Spring, (edited Hibernate, by Marty) RESTful Web Services, Android. Spring, Hibernate/JPA, GWT, Hadoop, HTML5, RESTful Web Services Developed and taught by well-known author and developer. At public venues or onsite at your location. Contact info@coreservlets.com for details

Agenda Storage Model Querying Granularity Table Design Tall-Narrow Tables Flat-Wide Tables 4 Column Family Storage Unit Column family is how data is separated Not columns (unlike regular column database) For a family cells are stored in the same file (HFile) on HDFS Cells that are not set will not be stored Nulls are not stored data must be explicitly set in order to be persisted Coordinate of each cell are be persisted with the value 5

Column Family Storage Unit Rows are stored as a set of cells Cells are stored with their coordinates Ordered by Row-Id, and then by column qualifier Each cell is represented by KeyValue class Row1:ColumnFamily:column1:timestamp1:value1 Row1:ColumnFamily:column2:timestamp1:value2 Row1:ColumnFamily:column3:timestamp2:value3 Row2:ColumnFamily:column1:timestamp1:value4 Row3:ColumnFamily:column1:timestamp1:value5 Row3:ColumnFamily:column3:timestamp1:value6 Row4:ColumnFamily:column2:timestamp3:value7 KeyValue(s) 6 HFile ColumnFamily/abcdef HTable Set of Column Families Row1:ColumnFamily:column1:timestamp1:value1 Row1:ColumnFamily:column2:timestamp1:value2 Row1:ColumnFamily:column3:timestamp2:value3 Row2:ColumnFamily:column1:timestamp1:value4 Row3:ColumnFamily:column1:timestamp1:value5 Table will be made of set of Column Families where each is stored in it s own file. Row3:ColumnFamily:column3:timestamp1:value6 Row4:ColumnFamily:column2:timestamp3:value7 Row4:ColumnFamily:column3:timestamp4:value8 HFile Family1/abcdefU Row1:ColumnFamily:column1:timestamp1:value1 Row1:ColumnFamily:column1:timestamp1:value1 Row1:ColumnFamily:column2:timestamp1:value2 Row1:ColumnFamily:column2:timestamp1:value2 Row1:ColumnFamily:column3:timestamp2:value3 Row1:ColumnFamily:column3:timestamp2:value3 Row2:ColumnFamily:column1:timestamp1:value4 Row2:ColumnFamily:column1:timestamp1:value4 Row3:ColumnFamily:column1:timestamp1:value5 Row3:ColumnFamily:column1:timestamp1:value5 Row3:ColumnFamily:column3:timestamp1:value6 Row3:ColumnFamily:column3:timestamp1:value6 Row4:ColumnFamily:column2:timestamp3:value7 Row4:ColumnFamily:column2:timestamp3:value7 Row4:ColumnFamily:column3:timestamp4:value8 Row4:ColumnFamily:column3:timestamp4:value8 7 HFile Family1/abcdefU

Querying Data From HBase 8 By Row-Id or range of Row Ids Only load the specific rows with the provided range of ids Only touch the region servers with those ids If no further criteria provided will load all the cells for all of the families which means loading multiple files of HDFS By Column Family Eliminates the needs to load other storage files Strongly suggested parameter if you know the needed families By Timestamp/Version Can skip entire store files as the contained timestamp range within a file is considered Querying Data From HBase By Column Name/Qualifier Will not eliminate the need to load the storage file Will however eliminate the need to transfer the unwanted data By Value Can skip cells by using filters The slowest selection criteria Will examine each cell Filters can be applied to each selection criteria rows, families, columns, timestamps and values Multiple querying criteria can be provided And most of the time should be provided 9

Querying Granularity vs. Performance Cells Coordinates Row Column Family Column Name Timestamp Value Skip Rows x x x x Skip Store Files x x Granularity/Control Performance 10 Source: Lars, George. HBase The Definitive Guide. O'Reilly Media. 2011 Designing to Store Data in Tables Two options: Tall-Narrow OR Flat-Wide Tall-Narrow Small number of columns Large number of rows Flat-Wide Small number of rows Large number of columns Tall Narrow is generally recommended Store parts of cell data in the row id Faster data retrieval HBase splits on row boundaries 11

Real World Problem: Blog App Let s say we are implementing a Blog management application Users will enter blogs Some users will have many blogs and some few You ll want to query by user and date range We will implement simple DataFacade to store and access Blog objects 12 Real World Problem: Blog App 13 public class Blog { private final String username; private final String blogentry; private final Date created; public Blog(String username, String blogentry, Date created) { this.username = username; this.blogentry = blogentry; this.created = created; public String getusername() { return username; public String getblogentry() { return blogentry; public Date getcreated() { return created; @Override public String tostring() { **Encapsulates username, Blog entry and the date the blog was created return this.username + "(" + created + "): " + blogentry ;

Real World Problem: Blog App Two ways to arrange your data Flat-Wide: each row represents a single user Tall-Narrow: each row represents a single blog, multiple rows will represent a single user Lets assume a simple interface: public interface DataFacade { public List<Blog> getblogs(string userid, Date startdate, Date enddate) throws IOException; public void save(blog blog) throws IOException; public void close() throws IOException; 14 1: Flat-Wide Solution All blogs are stored in a single row and family Each blog is stored in its own column (timestamp1 timestamp) Flat Row id entry:t1 entry:t2 entry:tn userid1 Blah blah blah... Important news.. Interesting stuff... userid2 Interesting stuff... Blah blah blah... Important news.. userid3 Important news... Interesting stuff... Blah blah blah... Wide Columns are ordered Can use filter to select sub-set of columns or page through 15

1: Flat-Wide Solution FlatAndWideTableDataFacade.java public class FlatAndWideTableDataFacade implements DataFacade { private final HTable table; private final static byte [] ENTRY_FAMILY = tobytes("entry"); public FlatAndWideTableDataFacade(Configuration conf) throws IOException{ table = new HTable(conf, "Blog_FlatAndWide"); @Override public void close() throws IOException{ table.close(); 16 1: Flat-Wide Solution FlatAndWideTableDataFacade.java 17 @Override public void save(blog blog) throws IOException { Put put = new Put(toBytes(blog.getUsername())); put.add(entry_family, tobytes(datetocolumn(blog.getcreated())), tobytes(blog.getblogentry())); table.put(put); entry:timestamp format, each blog is stored in it s own column; for a user all blogs are stored in the same row private String datetocolumn(date date ){ String reverseddateasstr= Long.toString(Long.MAX_VALUE-date.getTime()); StringBuilder builder = new StringBuilder(); for ( int i = reverseddateasstr.length(); i < 19; i++){ builder.append('0'); builder.append(reverseddateasstr); return builder.tostring(); public Date columntodate(string column){ long reversestamp = Long.parseLong(column); return new Date(Long.MAX_VALUE-reverseStamp); Encode and decode date into a column name; columns are ordered,reversing timestamp and padding will force latest records to be retrieved first

1: Flat-Wide Solution FlatAndWideTableDataFacade.java @Override public List<Blog> getblogs(string userid, Date startdate, Date enddate) throws IOException { Get get = new Get(toBytes(userId)); Get row by user id FilterList filters = new FilterList(); filters.addfilter(new QualifierFilter(CompareOp.LESS_OR_EQUAL, new BinaryComparator(toBytes(dateToColumn(startDate))))); filters.addfilter(new QualifierFilter(CompareOp.GREATER_OR_EQUAL, new BinaryComparator(toBytes(dateToColumn(endDate))))); get.setfilter(filters); Utilize filter mechanism to only retrieve provided date range; this works because timestamp is encoded in the column name Result result = table.get(get); Map<byte[], byte[]> columnvaluemap = result.getfamilymap(entry_family); 18 List<Blog> blogs = new ArrayList<Blog>(); for (Map.Entry<byte[], byte[]> entry : columnvaluemap.entryset()){ Date createdon = columntodate(bytes.tostring(entry.getkey())); String blogentry = Bytes.toString(entry.getValue()); blogs.add(new Blog(userId, blogentry, createdon)); return blogs; 2: Tall-Narrow Solution 1 blog per row Row id will be comprised of user and timestamp Tall Row id entry:blog userid1_t1 Blah blah blah... userid1_t2 Interesting stuff......... userid1_tn Important news........... userid3_500 Blah blah blah... userid3_501 Blah blah blah... 19 Narrow

2: Tall-Narrow Solution TallAndNarrowTableDataFacade.java public class TallAndNarrowTableDataFacade implements DataFacade { private final HTable table; private final static byte [] ENTRY_FAMILY = tobytes("entry"); private final static byte [] BLOG_COLUMN = tobytes("blog"); private final static byte [] CREATED_COLUMN = tobytes("created"); private final static byte [] USER_COLUMN = tobytes("user"); private final static char KEY_SPLIT_CHAR = '_'; public TallAndNarrowTableDataFacade(Configuration conf) throws IOException{ table = new HTable(conf, "Blog_TallAndNarrow"); @Override public void close() throws IOException { table.close(); 20 2: Tall-Narrow Solution TallAndNarrowTableDataFacade.java @Override public void save(blog blog) throws IOException { Put put = new Put(toBytes(blog.getUsername() + KEY_SPLIT_CHAR + convertforid(blog.getcreated()))); Encode username and creation timestamp into key put.add(entry_family, USER_COLUMN, tobytes(blog.getusername())); put.add(entry_family, BLOG_COLUMN, tobytes(blog.getblogentry())); put.add(entry_family, CREATED_COLUMN, tobytes(blog.getcreated().gettime())); table.put(put); Store blog in and other metadata in column(s) Save one blog per row 21

2: Tall-Narrow Solution TallAndNarrowTableDataFacade.java private String convertforid(date date ){ return convertforid(date.gettime()); private String convertforid(long timestamp){ String reverseddateasstr= Long.toString(Long.MAX_VALUE-timestamp); StringBuilder builder = new StringBuilder(); for ( int i = reverseddateasstr.length(); i < 19; i++){ builder.append('0'); builder.append(reverseddateasstr); return builder.tostring(); Encode date into a string to be used for row_id; ids are ordered, reversing timestamp and padding will force latest records to be retrieved first 22 2: Tall-Narrow Solution TallAndNarrowTableDataFacade.java @Override public List<Blog> getblogs(string userid, Date startdate, Date enddate) throws IOException { List<Blog> blogs = new ArrayList<Blog>(); 23 Scan scan = new Scan(toBytes(userId + KEY_SPLIT_CHAR + convertforid(enddate)), tobytes(userid + KEY_SPLIT_CHAR + convertforid(startdate.gettime()-1))); Carefully construct start and stop keys of the scan; scan.addfamily(entry_family); recall that row id is userid+reversed timestamp ResultScanner scanner = table.getscanner(scan); for (Result result : scanner){ String user = Bytes.toString( result.getvalue(entry_family, USER_COLUMN)); String blog = Bytes.toString( result.getvalue(entry_family, BLOG_COLUMN)); long created = tolong( result.getvalue(entry_family, CREATED_COLUMN)); blogs.add(new Blog(user, blog, new Date(created))); return blogs; Each row equals a blog

TableDesignExample.java public class TableDesignExample { private final static DateFormat DATE_FORMAT = new SimpleDateFormat("yyyy.MM.dd hh:mm:ss"); public static void main(string[] args) throws IOException { List<Blog> testblogs = gettestblogs(); Configuration conf = HBaseConfiguration.create(); DataFacade facade = new FlatAndWideTableDataFacade(conf); printtestblogs(testblogs); exercisefacade(facade, testblogs); facade.close(); Try both facades by running through the same code 24 facade = new TallAndNarrowTableDataFacade(conf); exercisefacade(facade, testblogs); facade.close(); TableDesignExample.java private static void printtestblogs(list<blog> testblogs) { System.out.println("----------------"); System.out.println("Test set has [" + testblogs.size() + "] Blogs:"); System.out.println("----------------"); for (Blog blog : testblogs){ System.out.println(blog); private static List<Blog> gettestblogs(){ List<Blog> blogs = new ArrayList<Blog>(); blogs.add(new Blog("user1", "Blog1", new Date(11111111))); blogs.add(new Blog("user1", "Blog2", new Date(22222222))); blogs.add(new Blog("user1", "Blog3", new Date(33333333))); blogs.add(new Blog("user2", "Blog4", new Date(44444444))); blogs.add(new Blog("user2", "Blog5", new Date(55555555))); return blogs; 25

TableDesignExample.java public static void exercisefacade(datafacade facade, List<Blog> testblogs) throws IOException{ System.out.println("----------------"); System.out.println("Running aggainst facade: " + facade.getclass()); System.out.println("----------------"); for (Blog blog : testblogs){ facade.save(blog); getblogs(facade, "user1", new Date(22222222), new Date(33333333)); getblogs(facade, "user2", new Date(55555555), new Date(55555555)); 26 TableDesignExample.java private static void getblogs(datafacade facade, String user, Date start, Date end)throws IOException { System.out.println("Selecting blogs for user [" + user + "] " + "between [" + DATE_FORMAT.format(start) + "] " + "and [" + DATE_FORMAT.format(end) + "]"); for ( Blog blog : facade.getblogs(user, start, end)){ System.out.println(" " + blog); 27

Run TableDesignExample $ yarn jar $PLAY_AREA/HadoopSamples.jar \ hbase.tabledesign.tabledesignexample 28 TableDesignExample s Output 29 Test set has [5] Blogs: ---------------- [Blog1] by [user1] on [1969.12.31 10:05:11] [Blog2] by [user1] on [1970.01.01 01:10:22] [Blog3] by [user1] on [1970.01.01 04:15:33] [Blog4] by [user2] on [1970.01.01 07:20:44] [Blog5] by [user2] on [1970.01.01 10:25:55] ---------------- Running aggainst facade: class hbase.tabledesign.flatandwidetabledatafacade ---------------- Selecting blogs for user [user1] between [1970.01.01 01:10:22] and [1970.01.01 04:15:33] [Blog3] by [user1] on [1970.01.01 04:15:33] [Blog2] by [user1] on [1970.01.01 01:10:22] Selecting blogs for user [user2] between [1970.01.01 10:25:55] and [1970.01.01 10:25:55] [Blog5] by [user2] on [1970.01.01 10:25:55] ---------------- Running aggainst facade: class hbase.tabledesign.tallandnarrowtabledatafacade ---------------- Selecting blogs for user [user1] between [1970.01.01 01:10:22] and [1970.01.01 04:15:33] [Blog3] by [user1] on [1970.01.01 04:15:33] [Blog2] by [user1] on [1970.01.01 01:10:22] Selecting blogs for user [user2] between [1970.01.01 10:25:55] and [1970.01.01 10:25:55] [Blog5] by [user2] on [1970.01.01 10:25:55]

Flat-Wide vs. Tall-Narrow 30 Tall Narrow has superior query granularity query by rowid will skip rows, store files! Flat-Wide has to query by column name which won t skip rows or storefiles Tall Narrow scales better HBase splits at row boundaries meaning it will shard at blog boundary Flat-Wide solution only works if all the blogs for a single user can fit into a single row Flat-Wide provides atomic operations atomic on per-row-basis There is no built in concept of a transaction for multiple rows or tables 2012 coreservlets.com and Dima May Wrap-Up Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jquery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location.

Summary We learned How Data Stored Querying Granularity Tall-Narrow Tables Flat-Wide Tables 32 2012 coreservlets.com and Dima May Questions? More info: http://www.coreservlets.com/hadoop-tutorial/ Hadoop programming tutorial http://courses.coreservlets.com/hadoop-training.html Customized Hadoop training courses, at public venues or onsite at your organization http://courses.coreservlets.com/course-materials/java.html General Java programming tutorial http://www.coreservlets.com/java-8-tutorial/ Java 8 tutorial http://www.coreservlets.com/jsf-tutorial/jsf2/ JSF 2.2 tutorial http://www.coreservlets.com/jsf-tutorial/primefaces/ PrimeFaces tutorial http://coreservlets.com/ JSF 2, PrimeFaces, Java 7 or 8, Ajax, jquery, Hadoop, RESTful Web Services, Android, HTML5, Spring, Hibernate, Servlets, JSP, GWT, and other Java EE training Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jquery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location.