HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team



Similar documents
Hadoop/Hive General Introduction

Introduction to Apache Hive

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Introduction to Apache Hive

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Hive A Petabyte Scale Data Warehouse Using Hadoop

Advanced SQL Query To Flink Translator

Hive User Group Meeting August 2009

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

CASE STUDY OF HIVE USING HADOOP 1

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Hadoop Architecture and its Usage at Facebook

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

Complete Java Classes Hadoop Syllabus Contact No:

Hive Interview Questions

Hadoop Distributed File System. -Kishan Patel ID#

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

Big Data and Scripting Systems build on top of Hadoop

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Data Warehousing and Analytics Infrastructure at Facebook

Data Warehouse Overview. Namit Jain

Big Data and Scripting Systems build on top of Hadoop

Integration of Apache Hive and HBase

INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS)

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Hadoop & its Usage at Facebook

Hadoop Job Oriented Training Agenda

Hadoop & its Usage at Facebook

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

CREDIT CARD DATA PROCESSING AND E-STATEMENT GENERATION WITH USE OF HADOOP

Operations and Big Data: Hadoop, Hive and Scribe. Zheng 铮 9 12/7/2011 Velocity China 2011

CSE-E5430 Scalable Cloud Computing Lecture 2

NetFlow Analysis with MapReduce

Introduction To Hive

Workshop on Hadoop with Big Data

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Alternatives to HIVE SQL in Hadoop File Structure

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Database Scalability and Oracle 12c

Data Domain Profiling and Data Masking for Hadoop

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

HDFS. Hadoop Distributed File System

Apache Sentry. Prasad Mujumdar

Using distributed technologies to analyze Big Data

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

APACHE HADOOP JERRIN JOSEPH CSU ID#

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

COURSE CONTENT Big Data and Hadoop Training

LANGUAGES FOR HADOOP: PIG & HIVE

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Oracle Big Data SQL Technical Update

Analysis of Big data through Hadoop Ecosystem Components like Flume, MapReduce, Pig and Hive

Qsoft Inc

Big Data Time Series Analysis

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Big Data Course Highlights

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Important Notice. (c) Cloudera, Inc. All rights reserved.

How To Scale Out Of A Nosql Database

the missing log collector Treasure Data, Inc. Muga Nishizawa

Large Scale Text Analysis Using the Map/Reduce

Big Data and Market Surveillance. April 28, 2014

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Oracle Big Data SQL. Architectural Deep Dive. Dan McClary, Ph.D. Big Data Product Management Oracle

Best Practices for Hadoop Data Analysis with Tableau

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Big Data? Definition # 1: Big Data Definition Forrester Research

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

BIG DATA HADOOP TRAINING

American International Journal of Research in Science, Technology, Engineering & Mathematics

Apache Hadoop FileSystem and its Usage in Facebook

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

How to Install and Configure EBF15328 for MapR or with MapReduce v1

HadoopRDF : A Scalable RDF Data Analysis System

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Data-Intensive Information Processing Applications! Session #7. A database perspective on the cloud

Transcription:

HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

Why Another Data Warehousing System? Problem: Data, data and more data 200GB per day in March 2008 back to 1TB compressed per day today The Hadoop Experiment Problem: Map/Reduce is great but every one is not a Map/Reduce expert I know SQL and I am a python and php expert So what do we do: HIVE

What is HIVE? A system for querying and managing structured data built on top of Map/Reduce and Hadoop We had: Structured logs with rich data types (structs, lists and maps) A user base wanting to access this data in the language of their choice A lot of traditional SQL workloads on this data (filters, joins and aggregations) Other non SQL workloads

Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Oracle RAC Hive on Hadoop Cluster Federated MySQL

HIVE: Components Mgmt. Web UI Browsing Hive CLI Queries DDL Map Reduce HDFS Thrift API Parser Planner Execution Hive QL MetaStore Thrift SerDe Jute JSON..

Data Model Schema Library #Buckets=32 Bucketing Info Partitioning Cols Logical Partitioning Hash Partitioning /hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 HDFS clicks Tables MetaStore

Dealing with Structured Data Type system Primitive types Recursively build up using Composition/Maps/Lists Generic (De)Serialization Interface (SerDe) To recursively list schema To recursively access fields within a row object Serialization families implement interface Thrift DDL based SerDe Delimited text based SerDe You can write your own SerDe Schema Evolution

MetaStore Stores Table/Partition properties: Table schema and SerDe library Table Location on HDFS Logical Partitioning keys and types Other information Thrift API Current clients in Php (Web Interface), Python (old CLI), Java (Query Engine and CLI), Perl (Tests) Metadata can be stored as text files or even in a SQL backend

Hive CLI DDL: create table/drop table/rename table alter table add column Browsing: show tables describe table cat table Loading Data Queries

Hive Query Language Philosophy SQL like constructs + Hadoop Streaming Query Operators in initial version Projections Equijoins and Cogroups Group by Sampling Output of these operators can be: passed to Streaming mappers/reducers can be stored in another Hive Table can be output to HDFS files can be output to local files

Hive Query Language Package these capabilities into a more formal SQL like query language in next version Introduce other important constructs: Ability to stream data thru custom mappers/reducers Multi table inserts Multiple group bys SQL like column expressions and some XPath like expressions Etc..

Joins Joins FROM page_view pv JOIN user u ON (pv.userid = u.id) INSERT INTO TABLE pv_users SELECT pv.*, u.gender, u.age WHERE pv.date = 2008-03-03; Outer Joins FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id) INSERT INTO TABLE pv_users SELECT pv.*, u.gender, u.age WHERE pv.date = 2008-03-03;

Aggregations and Multi-Table Inserts FROM pv_users INSERT INTO TABLE pv_gender_uu SELECT pv_users.gender, count(distinct pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO TABLE pv_ip_uu SELECT pv_users.ip, count(distinct pv_users.id) GROUP BY(pv_users.ip);

Running Custom Map/Reduce Scripts FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map INSERT INTO TABLE pv_users_reduced SELECT TRANSFORM(map.dt, map.uid) USING 'reduce_script' AS (date, count);

Inserts into Files, Tables and Local Files FROM pv_users INSERT INTO TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO DIRECTORY /user/facebook/tmp/pv_age_sum.dir SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age) INSERT INTO LOCAL DIRECTORY /home/me/pv_age_sum.dir FIELDS TERMINATED BY, LINES TERMINATED BY \013 SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);

Hadoop Usage @ Facebook Types of Applications: Summarization Eg: Daily/Weekly aggregations of impression/click counts Ad hoc Analysis Eg: how many group admins broken down by state/country Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes

Hadoop Usage @ Facebook Usage statistics: Total Users: ~140 (about 50% of engineering!) in the last 1 ½ months Hive Data (compressed): 80 TB total, ~1TB incoming per day Job statistics: ~1000 jobs/day ~100 loader jobs/day

Hadoop Improvements @ Facebook Some problems: No Fair Sharing: Big tasks can hog the cluster No snapshots: What if a software bug corrupts the NameNode transaction log Solutions: Simple fair sharing (Matie Zaharia) Investigating Snapshots (Dhrubha Bortharkur)

Conclusion JIRA http://issues.apache.org/jira/browse/hadoop-3601 Soon to be checked into hadoop trunk Release available in hadoop version 0.19 People: Suresh Anthony Zheng Shao Prasad Chakka Pete Wyckoff Namit Jain Raghu Murthy Joydeep Sen Sarma Ashish Thusoo