Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team

Size: px

Start display at page:

Download "Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team"

Melanie Simpson
10 years ago
Views:

2 Hive Development (~15 minutes) Yongqiang He Software Engineer Facebook Data Infrastructure Team

3 Agenda 1 Introduction 2 New Features 3 Future

4 What is Hive? A system for managing and querying structured data built on top of Hadoop Large scale execution (Map-Reduce/others?) Massive Storage (HDFS/HBase) Metadata Key Building Principles: SQL as a familiar data warehousing tool Extensibility Types, Functions, Formats, Scripts Scalability and Performance

5 Simple Example Create table CREATE TABLE src(key STRING, value STRING) LOCATION '/hive/src' PARTITIONED BY (ds STRING) Stored as TextFile; Query the table SELECT key, count(distinct value) FROM src GROUP BY key;

6 Hive Query Language SQL Group by Equi-joins Semi Join mapjoin/bucket mapjoin/sort merge mapjoin UDF/UDAF/UDTF Lateral view Subqueries in from clause Multi-table Insert Multi-group-by Sampling

7 Hive Query Language (continued) Extensibility Pluggable Map-reduce scripts Pluggable UDF/UDAF/UDTF Complex object types Support columnar storage Pluggable Formats/Storage Handler Support database Schema Concurrency model Dynamic Partition

types Support columnar storage Pluggable Formats/Storage

8 New Features

9 Concurrency Model Use Case Support concurrent reader and writer Lock: Shared Lock Exclusive Lock Implementation Zookeeper Reference:

Implementation Zookeeper Reference: https://issues.

10 HBase integration & Storage Handler Example: CREATE TABLE users (userid int, name string, string, notes string) STORED BY 'org.apache.hadoop.hive.hbase.hbasestoragehandler' WITH SERDEPROPERTIES ( hbase.columns.mapping = small:name,small: ,large:notes ) TBLPROPERTIES ( hbase.table.name = user_list ); Status (Testing): 20-node test cluster Bulk-loaded 6TB of gzip-compressed data from Hive into HBase in about 30 hours Incremental-loaded from Hive into Hbase at 30GB/hr (with write-ahead logging disabled) Full-table scan queries: currently 5x slower than against native Hive tables (no tuning or optimization yet)

name = user_list ); Status (Testing): 20-node test cluster Bulk-loaded 6TB of gzip-compressed data from Hive into HBase in about 30 hours

11 Dynamic Partitioning Example: A query without DP FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt=' ', country='us') SELECT viewtime, userid, page_url, referrer_url, null, null, ip WHERE country = 'US INSERT OVERWRITE TABLE page_view PARTITION(dt=' ', country='ca') SELECT viewtime, userid, page_url, referrer_url, null, null, ip WHERE country = 'CA DP query FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt=' ', country) SELECT viewtime, userid, page_url, referrer_url, null, null, ip, country; Reference:

viewtime, userid, page_url, referrer_url, null, null, ip WHERE country = 'CA DP query FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view

12 Local mode Use Case Avoid Job Tracker scheduler when the job is small enough to execute in local machine (run the job in the same machine the user submit the job) Reduce small job latency Example Set hive.exec.mode.local.auto= true; Query

job in the same machine the user submit the job) Reduce

13 Archiving Use Case Archive files inside one partition directory. Reduce number of small files and alleviate namenode pressure. Example ALTER TABLE srcpart ARCHIVE PARTITION (ds=' ', hr='12'); ALTER TABLE srcpart UNARCHIVE PARTITION (ds=' ', hr='12');

Example ALTER TABLE srcpart ARCHIVE PARTITION (ds='2008-04-08',

14 Indexing Use Case Avoid scan whole base table (narrow down the data location) Create Indexing CREATE INDEX src_index ON TABLE src(key) as 'COMPACT' WITH DEFERRED REBUILD STORED AS RCFILE; Update Index ALTER INDEX src_index ON src REBUILD; Use Index INSERT OVERWRITE DIRECTORY "/tmp/index_result" SELECT `_bucketname`, `_offsets` FROM default srcpart_rc_srcpart_rc_index WHERE key=100; SET hive.index.compact.file=/tmp/index_result; SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.hivecompactindexinputformat; SELECT key, value FROM srcpart_rc WHERE key=100 ORDER BY key; Reference:

`_bucketname`, `_offsets` FROM default srcpart_rc_srcpart_rc_index WHERE key=100; SET hive.index.compact.file=/tmp/index_result; SET hive.input.format=org.

15 Future Work More Indexing support More generalized execution framework support Nested columnar storage support Integration with BI tools (through JDBC/ODBC) Real-time Streaming Partial Results Open source workflow integration More coming from *YOU* Apache TOP LEVEL PROJECT

BI tools (through JDBC/ODBC) Real-time Streaming Partial Results

16 Q & A

Introduction to Apache Hive

Introduction to Apache Hive Pelle Jakovits 1. Oct, 2013, Tartu Outline What is Hive Why Hive over MapReduce or Pig? Advantages and disadvantages Running Hive HiveQL language Examples Internals Hive vs