Hive Development (~15 minutes) Yongqiang He Software Engineer Facebook Data Infrastructure Team
Agenda 1 Introduction 2 New Features 3 Future
What is Hive? A system for managing and querying structured data built on top of Hadoop Large scale execution (Map-Reduce/others?) Massive Storage (HDFS/HBase) Metadata Key Building Principles: SQL as a familiar data warehousing tool Extensibility Types, Functions, Formats, Scripts Scalability and Performance
Simple Example Create table CREATE TABLE src(key STRING, value STRING) LOCATION '/hive/src' PARTITIONED BY (ds STRING) Stored as TextFile; Query the table SELECT key, count(distinct value) FROM src GROUP BY key;
Hive Query Language SQL Group by Equi-joins Semi Join mapjoin/bucket mapjoin/sort merge mapjoin UDF/UDAF/UDTF Lateral view Subqueries in from clause Multi-table Insert Multi-group-by Sampling
Hive Query Language (continued) Extensibility Pluggable Map-reduce scripts Pluggable UDF/UDAF/UDTF Complex object types Support columnar storage Pluggable Formats/Storage Handler Support database Schema Concurrency model Dynamic Partition
New Features
Concurrency Model Use Case Support concurrent reader and writer Lock: Shared Lock Exclusive Lock Implementation Zookeeper Reference: https://issues.apache.org/jira/browse/hive-1293 http://wiki.apache.org/hadoop/hive/locking
HBase integration & Storage Handler Example: CREATE TABLE users (userid int, name string, email string, notes string) STORED BY 'org.apache.hadoop.hive.hbase.hbasestoragehandler' WITH SERDEPROPERTIES ( hbase.columns.mapping = small:name,small:email,large:notes ) TBLPROPERTIES ( hbase.table.name = user_list ); Status (Testing): 20-node test cluster Bulk-loaded 6TB of gzip-compressed data from Hive into HBase in about 30 hours Incremental-loaded from Hive into Hbase at 30GB/hr (with write-ahead logging disabled) Full-table scan queries: currently 5x slower than against native Hive tables (no tuning or optimization yet)
Dynamic Partitioning Example: A query without DP FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='us') SELECT viewtime, userid, page_url, referrer_url, null, null, ip WHERE country = 'US INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='ca') SELECT viewtime, userid, page_url, referrer_url, null, null, ip WHERE country = 'CA DP query FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT viewtime, userid, page_url, referrer_url, null, null, ip, country; Reference: http://wiki.apache.org/hadoop/hive/tutorial#dynamicpartition_insert
Local mode Use Case Avoid Job Tracker scheduler when the job is small enough to execute in local machine (run the job in the same machine the user submit the job) Reduce small job latency Example Set hive.exec.mode.local.auto= true; Query
Archiving Use Case Archive files inside one partition directory. Reduce number of small files and alleviate namenode pressure. Example ALTER TABLE srcpart ARCHIVE PARTITION (ds='2008-04-08', hr='12'); ALTER TABLE srcpart UNARCHIVE PARTITION (ds='2008-04-08', hr='12');
Indexing Use Case Avoid scan whole base table (narrow down the data location) Create Indexing CREATE INDEX src_index ON TABLE src(key) as 'COMPACT' WITH DEFERRED REBUILD STORED AS RCFILE; Update Index ALTER INDEX src_index ON src REBUILD; Use Index INSERT OVERWRITE DIRECTORY "/tmp/index_result" SELECT `_bucketname`, `_offsets` FROM default srcpart_rc_srcpart_rc_index WHERE key=100; SET hive.index.compact.file=/tmp/index_result; SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.hivecompactindexinputformat; SELECT key, value FROM srcpart_rc WHERE key=100 ORDER BY key; Reference: http://wiki.apache.org/hadoop/hive/indexdev
Future Work More Indexing support More generalized execution framework support Nested columnar storage support Integration with BI tools (through JDBC/ODBC) Real-time Streaming Partial Results Open source workflow integration More coming from *YOU* Apache TOP LEVEL PROJECT
Q & A