Integration of Apache Hive and HBase



Similar documents
Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Big Data and Scripting Systems build on top of Hadoop

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Qsoft Inc

The Hadoop Eco System Shanghai Data Science Meetup

Data storing and data access

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

COURSE CONTENT Big Data and Hadoop Training

Constructing a Data Lake: Hadoop and Oracle Database United!

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

How To Scale Out Of A Nosql Database

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Comparing SQL and NOSQL databases

Introduction to Hbase Gkavresis Giorgos 1470

Apache Sentry. Prasad Mujumdar

ITG Software Engineering

Complete Java Classes Hadoop Syllabus Contact No:

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Hadoop Ecosystem B Y R A H I M A.

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Apache HBase. Crazy dances on the elephant back

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Using distributed technologies to analyze Big Data

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Scaling Up HBase, Hive, Pegasus

Big Data and Scripting Systems build on top of Hadoop

I 3 Map Reduce for mining Big Data

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Cloudera Certified Developer for Apache Hadoop

Certified Big Data and Apache Hadoop Developer VS-1221

American International Journal of Research in Science, Technology, Engineering & Mathematics

Workshop on Hadoop with Big Data

Hadoop IST 734 SS CHUNG

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Can the Elephants Handle the NoSQL Onslaught?

Apache HBase: the Hadoop Database

Real World Hadoop Use Cases

Implement Hadoop jobs to extract business value from large and varied data sets

Xiaoming Gao Hui Li Thilina Gunarathne

Media Upload and Sharing Website using HBASE

Lofan Abrams Data Services for Big Data Session # 2987

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Best Practices for Hadoop Data Analysis with Tableau

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data Course Highlights

Secure Your Hadoop Cluster With Apache Sentry (Incubating) Xuefu Zhang Software Engineer, Cloudera April 07, 2014

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Apache HBase 0.96 and What s Next Headline Goes Here Jonathan

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

Hadoop implementation of MapReduce computational model. Ján Vaňo

Open source large scale distributed data management with Google s MapReduce and Bigtable

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Hadoop Distributed File System. -Kishan Patel ID#

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Hive Interview Questions

Olivier Renault Solu/on Engineer Hortonworks. Hadoop Security

Open Source Technologies on Microsoft Azure

SQL on NoSQL (and all of the data) With Apache Drill

docs.hortonworks.com

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

Hadoop: The Definitive Guide

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Trafodion Operational SQL-on-Hadoop

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Efficient Processing of XML Documents in Hadoop Map Reduce

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Hive A Petabyte Scale Data Warehouse Using Hadoop

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop/Hive General Introduction

Copyright. Albert Haque

CS 564: DATABASE MANAGEMENT SYSTEMS

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

APACHE HADOOP JERRIN JOSEPH CSU ID#

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Architecting the Future of Big Data

HBase. Lecture BigData Analytics. Julian M. Kunkel. University of Hamburg / German Climate Computing Center (DKRZ)

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Transcription:

Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Page 1

About Me User and committer of Hadoop since 2007 Contributor to Apache Hadoop, HBase, Hive and Gora Joined Hortonworks as Member of Technical Staff Twitter: @enissoz Page 2

Agenda Overview of Hive and HBase Hive + HBase Features and Improvements Future of Hive and HBase Q&A Page 3

Apache Hive Overview Apache Hive is a data warehouse system for Hadoop SQL-like query language called HiveQL Built for PB scale data Main purpose is analysis and ad hoc querying Database / table / partition / bucket DDL Operations SQL Types + Complex Types (ARRAY, MAP, etc) Very extensible Not for : small data sets, low latency queries, OLTP Page 4

Apache Hive Architecture JDBC/ODBC CLI Hive Thrift Server Hive Web Interface Driver Parser Execution Planner Optimizer M S C l i e n t Metastore MapReduce HDFS RDBMS Page 5

Overview of Apache HBase Apache HBase is the Hadoop database Modeled after Google s BigTable A sparse, distributed, persistent multi- dimensional sorted map The map is indexed by a row key, column key, and a timestamp Each value in the map is an un-interpreted array of bytes Low latency random data access Page 6

Overview of Apache HBase Logical view: From: Bigtable: A Distributed Storage System for Structured Data, Chang, et al. Page 7

Apache HBase Architecture Client HMaster Region server Region server Region server Zookeeper Region Region Region Region Region Region HDFS Page 8

Hive + HBase Features and Improvements Page 9

Hive + HBase Motivation Hive and HBase has different characteristics: High latency Structured Analysts vs. Low latency Unstructured Programmers Hive datawarehouses on Hadoop are high latency Long ETL times Access to real time data Analyzing HBase data with MapReduce requires custom coding Hive and SQL are already known by many analysts Page 10

Use Case 1: HBase as ETL Data Sink From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Page 11

Use Case 2: HBase as Data Source From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Page 12

Use Case 3: Low Latency Warehouse From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Page 13

Example: Hive + Hbase (HBase table) hbase(main):001:0> create 'short_urls', {NAME => 'u'}, {NAME=>'s'} hbase(main):014:0> scan 'short_urls' ROW COLUMN+CELL bit.ly/aaaa column=s:hits, value=100 bit.ly/aaaa column=u:url, value=hbase.apache.org/ bit.ly/abcd column=s:hits, value=123 bit.ly/abcd column=u:url, value=example.com/foo Page 14

Example: Hive + HBase (Hive table) CREATE TABLE short_urls( short_url string, ) url string, hit_count int STORED BY 'org.apache.hadoop.hive.hbase.hbasestoragehandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, u:url, s:hits") TBLPROPERTIES ("hbase.table.name" = short_urls"); Page 15

Storage Handler Hive defines HiveStorageHandler class for different storage backends: HBase/ Cassandra / MongoDB/ etc Storage Handler has hooks for Getting input / output formats Meta data operations hook: CREATE TABLE, DROP TABLE, etc Storage Handler is a table level concept Does not support Hive partitions, and buckets Page 16

Apache Hive + HBase Architecture CLI Hive Thrift Server Hive Web Interface Driver Parser Execution Planner Optimizer M S C l i e n t Metastore StorageHandler MapReduce HBase HDFS RDBMS Page 17

Hive + HBase Integration For Input/OutputFormat, getsplits(), etc underlying HBase classes are used Column selection and certain filters can be pushed down HBase tables can be used with other(hadoop native) tables and SQL constructs Hive DDL operations are converted to HBase DDL operations via the client hook. All operations are performed by the client No two phase commit Page 18

Schema / Type Mapping Page 19

Schema Mapping Hive table + columns + column types <=> HBase table + column families (+ column qualifiers) Every field in Hive table is mapped in order to either The table key (using :key as selector) A column family (cf:) -> MAP fields in Hive A column (cf:cq) Hive table does not need to include all columns in HBase CREATE TABLE short_urls( short_url string, url string, hit_count int, props, map<string,string> ) WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, u:url, s:hits, p:") Page 20

Type Mapping Recently added to Hive (0.9.0) Previously all types were being converted to strings in HBase Hive has: Primitive types: INT, STRING, BINARY, DATE, etc ARRAY<Type> MAP<PrimitiveType, Type> STRUCT<a:INT, b:string, c:string> HBase does not have types Bytes.toBytes() Page 21

Type Mapping Table level property "hbase.table.default.storage.type = binary Type mapping can be given per column after # Any prefix of binary, eg u:url#b Any prefix of string, eg u:url#s The dash char -, eg u:url#- CREATE TABLE short_urls( short_url string, url string, hit_count int, props, map<string,string> ) WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,u:url#b,s:hits#b,p:#s") Page 22

Type Mapping If the type is not a primitive or Map, it is converted to a JSON string and serialized Still a few rough edges for schema and type mapping: No Hive BINARY support in HBase mapping No mapping of HBase timestamp (can only provide put timestamp) No arbitrary mapping of Structs / Arrays into HBase schema Page 23

Bulk Load Steps to bulk load: Sample source data for range partitioning Save sampling results to a file Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner Import Hfiles into HBase table Ideal setup should be SET hive.hbase.bulk=true INSERT OVERWRITE TABLE web_table SELECT. Page 24

Filter Pushdown Page 25

Filter Pushdown Idea is to pass down filter expressions to the storage layer to minimize scanned data To access indexes at HDFS or HBase Example: CREATE EXTERNAL TABLE users (userid LONG, email STRING, ) STORED BY 'org.apache.hadoop.hive.hbase.hbasestoragehandler WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, ") SELECT... FROM users WHERE userid > 1000000 and email LIKE %@gmail.com ; -> scan.setstartrow(bytes.tobytes(1000000)) Page 26

Filter Decomposition Optimizer pushes down the predicates to the query plan Storage handlers can negotiate with the Hive optimizer to decompose the filter x > 3 AND upper(y) = 'XYZ Handle x > 3, send upper(y) = XYZ as residual for Hive Works with: key = 3, key > 3, etc key > 3 AND key < 100 Only works against constant expressions Page 27

Security Aspects Towards fully secure deployments Page 28

Security Big Picture Security becomes more important to support enterprise level and multi tenant applications 5 Different Components to ensure / impose security HDFS MapReduce HBase Zookeeper Hive Each component has: Authentication Authorization Page 29

HBase Security Closer look Released with HBase 0.92 Fully optional module, disabled by default Needs an underlying secure Hadoop release SecureRPCEngine: optional engine enforcing SASL authentication Kerberos DIGEST-MD5 based tokens TokenProvider coprocessor Access control is implemented as a Coprocessor: AccessController Stores and distributes ACL data via Zookeeper Sensitive data is only accessible by HBase daemons Client does not need to authenticate to zk Page 30

Hive Security Closer look Hive has different deployment options, security considerations should take into account different deployments Authentication is only supported at Metastore, not on HiveServer, web interface, JDBC Authorization is enforced at the query layer (Driver) Pluggable authorization providers. Default one stores global/ table/partition/column permissions in Metastore GRANT ALTER ON TABLE web_table TO USER bob; CREATE ROLE db_reader GRANT SELECT, SHOW_DATABASE ON DATABASE mydb TO ROLE db_reader Page 31

Hive Deployment Option 1 Client CLI Driver Parser Execution Authorization Planner Optimizer M S C l i e n t Authentication Metastore A/A MapReduce A/A HBase HDFS A12n/A11N A12n/A11N RDBMS Page 32

Hive Deployment Option 2 Client CLI Driver Parser Execution Authorization Planner Optimizer M S C l i e n t Authentication Metastore A/A MapReduce A/A HBase HDFS A12n/A11N A12n/A11N RDBMS Page 33

Hive Deployment Option 3 Client JDBC/ODBC CLI Hive Thrift Server Hive Web Interface Driver Parser Execution Authorization Planner Optimizer M S C l i e n t Authentication Metastore A/A MapReduce A/A HBase HDFS A12n/A11N A12n/A11N RDBMS Page 34

Hive + HBase + Hadoop Security Regardless of Hive s own security, for Hive to work on secure Hadoop and HBase, we should: Obtain delegation tokens for Hadoop and HBase jobs Ensure to obey the storage level (HDFS, HBase) permission checks In HiveServer deployments, authenticate and impersonate the user Delegation tokens for Hadoop are already working Obtaining HBase delegation tokens are released in Hive 0.9.0 Page 35

Future of Hive + HBase Improve on schema / type mapping Fully secure Hive deployment options HBase bulk import improvements Sortable signed numeric types in HBase Filter pushdown: non key column filters Hive random access support for HBase https://cwiki.apache.org/hcatalog/random-accessframework.html Page 36

References Security https://issues.apache.org/jira/browse/hive-2764 https://issues.apache.org/jira/browse/hbase-5371 https://issues.apache.org/jira/browse/hcatalog-245 https://issues.apache.org/jira/browse/hcatalog-260 https://issues.apache.org/jira/browse/hcatalog-244 https://cwiki.apache.org/confluence/display/hcatalog/hcat+security +Design Type mapping / Filter Pushdown https://issues.apache.org/jira/browse/hive-1634 https://issues.apache.org/jira/browse/hive-1226 https://issues.apache.org/jira/browse/hive-1643 https://issues.apache.org/jira/browse/hive-2815 https://issues.apache.org/jira/browse/hive-1643 Page 37

Other Resources Hadoop Summit June 13-14 San Jose, California www.hadoopsummit.org Hadoop Training and Certification Developing Solutions Using Apache Hadoop Administering Apache Hadoop Online classes available US, India, EMEA http://hortonworks.com/training/ Hortonworks Inc. 2012 Page 38

Thanks Questions? Page 39