Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team



Similar documents
Introduction to Apache Hive

Introduction to Apache Hive

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Data Warehouse Overview. Namit Jain

Integration of Apache Hive and HBase

Alternatives to HIVE SQL in Hadoop File Structure

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Using distributed technologies to analyze Big Data

Hive User Group Meeting August 2009

Hadoop IST 734 SS CHUNG

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

CS 564: DATABASE MANAGEMENT SYSTEMS

CASE STUDY OF HIVE USING HADOOP 1

COURSE CONTENT Big Data and Hadoop Training

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Complete Java Classes Hadoop Syllabus Contact No:

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Hive: SQL in the Hadoop Environment

Hadoop and MySQL for Big Data

Hadoop implementation of MapReduce computational model. Ján Vaňo

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Hadoop and Big Data Research

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

The Hadoop Eco System Shanghai Data Science Meetup

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Hadoop Ecosystem B Y R A H I M A.

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Constructing a Data Lake: Hadoop and Oracle Database United!

Oracle Database 12c: Introduction to SQL Ed 1.1

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Apache HBase. Crazy dances on the elephant back

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Hadoop and Map-Reduce. Swati Gore

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Hive A Petabyte Scale Data Warehouse Using Hadoop

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Introduction To Hive

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

project collects data from national events, both natural and manmade, to be stored and evaluated by

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

BigData in Real-time. Impala Introduction. TCloud Computing 天 云 趋 势 孙 振 南 2012/12/13 Beijing Apache Asia Road Show

Introduction to cloud computing

Hadoop Job Oriented Training Agenda

Design and Evolution of the Apache Hadoop File System(HDFS)

Big Data With Hadoop

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL

Best Practices for Hadoop Data Analysis with Tableau

Large scale processing using Hadoop. Ján Vaňo

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Qsoft Inc

Apache Hadoop FileSystem and its Usage in Facebook

BIG DATA What it is and how to use?

Can the Elephants Handle the NoSQL Onslaught?

Big Data on Microsoft Platform

Moving From Hadoop to Spark

Big Data and Scripting Systems build on top of Hadoop

Native Connectivity to Big Data Sources in MSTR 10

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

HADOOP MOCK TEST HADOOP MOCK TEST I

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

A Brief Outline on Bigdata Hadoop

Chase Wu New Jersey Ins0tute of Technology

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Hadoop Big Data for Processing Data and Performing Workload

File S1: Supplementary Information of CloudDOE

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Using RDBMS, NoSQL or Hadoop?

Big Data Too Big To Ignore


Data storing and data access

CDH AND BUSINESS CONTINUITY:

Pivotal HAWQ Release Notes

Dominik Wagenknecht Accenture

Cloudera Certified Developer for Apache Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Transcription:

Hive Development (~15 minutes) Yongqiang He Software Engineer Facebook Data Infrastructure Team

Agenda 1 Introduction 2 New Features 3 Future

What is Hive? A system for managing and querying structured data built on top of Hadoop Large scale execution (Map-Reduce/others?) Massive Storage (HDFS/HBase) Metadata Key Building Principles: SQL as a familiar data warehousing tool Extensibility Types, Functions, Formats, Scripts Scalability and Performance

Simple Example Create table CREATE TABLE src(key STRING, value STRING) LOCATION '/hive/src' PARTITIONED BY (ds STRING) Stored as TextFile; Query the table SELECT key, count(distinct value) FROM src GROUP BY key;

Hive Query Language SQL Group by Equi-joins Semi Join mapjoin/bucket mapjoin/sort merge mapjoin UDF/UDAF/UDTF Lateral view Subqueries in from clause Multi-table Insert Multi-group-by Sampling

Hive Query Language (continued) Extensibility Pluggable Map-reduce scripts Pluggable UDF/UDAF/UDTF Complex object types Support columnar storage Pluggable Formats/Storage Handler Support database Schema Concurrency model Dynamic Partition

New Features

Concurrency Model Use Case Support concurrent reader and writer Lock: Shared Lock Exclusive Lock Implementation Zookeeper Reference: https://issues.apache.org/jira/browse/hive-1293 http://wiki.apache.org/hadoop/hive/locking

HBase integration & Storage Handler Example: CREATE TABLE users (userid int, name string, email string, notes string) STORED BY 'org.apache.hadoop.hive.hbase.hbasestoragehandler' WITH SERDEPROPERTIES ( hbase.columns.mapping = small:name,small:email,large:notes ) TBLPROPERTIES ( hbase.table.name = user_list ); Status (Testing): 20-node test cluster Bulk-loaded 6TB of gzip-compressed data from Hive into HBase in about 30 hours Incremental-loaded from Hive into Hbase at 30GB/hr (with write-ahead logging disabled) Full-table scan queries: currently 5x slower than against native Hive tables (no tuning or optimization yet)

Dynamic Partitioning Example: A query without DP FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='us') SELECT viewtime, userid, page_url, referrer_url, null, null, ip WHERE country = 'US INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='ca') SELECT viewtime, userid, page_url, referrer_url, null, null, ip WHERE country = 'CA DP query FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT viewtime, userid, page_url, referrer_url, null, null, ip, country; Reference: http://wiki.apache.org/hadoop/hive/tutorial#dynamicpartition_insert

Local mode Use Case Avoid Job Tracker scheduler when the job is small enough to execute in local machine (run the job in the same machine the user submit the job) Reduce small job latency Example Set hive.exec.mode.local.auto= true; Query

Archiving Use Case Archive files inside one partition directory. Reduce number of small files and alleviate namenode pressure. Example ALTER TABLE srcpart ARCHIVE PARTITION (ds='2008-04-08', hr='12'); ALTER TABLE srcpart UNARCHIVE PARTITION (ds='2008-04-08', hr='12');

Indexing Use Case Avoid scan whole base table (narrow down the data location) Create Indexing CREATE INDEX src_index ON TABLE src(key) as 'COMPACT' WITH DEFERRED REBUILD STORED AS RCFILE; Update Index ALTER INDEX src_index ON src REBUILD; Use Index INSERT OVERWRITE DIRECTORY "/tmp/index_result" SELECT `_bucketname`, `_offsets` FROM default srcpart_rc_srcpart_rc_index WHERE key=100; SET hive.index.compact.file=/tmp/index_result; SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.hivecompactindexinputformat; SELECT key, value FROM srcpart_rc WHERE key=100 ORDER BY key; Reference: http://wiki.apache.org/hadoop/hive/indexdev

Future Work More Indexing support More generalized execution framework support Nested columnar storage support Integration with BI tools (through JDBC/ODBC) Real-time Streaming Partial Results Open source workflow integration More coming from *YOU* Apache TOP LEVEL PROJECT

Q & A