Introduction to Apache Hive



Similar documents
Introduction to Apache Hive

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Hadoop and Big Data Research

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

American International Journal of Research in Science, Technology, Engineering & Mathematics

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

CS 564: DATABASE MANAGEMENT SYSTEMS

Hive: SQL in the Hadoop Environment

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data Hive! Laurent d Orazio

Hive User Group Meeting August 2009

Big Data and Scripting Systems build on top of Hadoop

Hadoop Job Oriented Training Agenda

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Introduction to cloud computing

Using distributed technologies to analyze Big Data

Introduction To Hive

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Cloudera Certified Developer for Apache Hadoop

Data Warehouse Overview. Namit Jain

Workshop on Hadoop with Big Data

Big Data and Scripting Systems build on top of Hadoop

Alternatives to HIVE SQL in Hadoop File Structure

Moving From Hadoop to Spark

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop/Hive General Introduction

ITG Software Engineering

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Big Data Too Big To Ignore

Data processing goes big

Big Data Course Highlights

Xiaoming Gao Hui Li Thilina Gunarathne

Integration of Apache Hive and HBase

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Big Data. Facebook Friends Data on Amazon Elastic Cloud

The Hadoop Eco System Shanghai Data Science Meetup

Qsoft Inc


Complete Java Classes Hadoop Syllabus Contact No:

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Big Data Weather Analytics Using Hadoop

Services. Relational. Databases & JDBC. Today. Relational. Databases SQL JDBC. Next Time. Services. Relational. Databases & JDBC. Today.

Architecting the Future of Big Data

Integrating MicroStrategy with Hadoop/Hive

Hadoop Ecosystem B Y R A H I M A.

COURSE CONTENT Big Data and Hadoop Training

Big Data With Hadoop

Unified Big Data Analytics Pipeline. 连 城

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide

Hadoop and MySQL for Big Data

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Important Notice. (c) Cloudera, Inc. All rights reserved.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop and Map-Reduce. Swati Gore

Apache Hive. Big Data 2015

Hadoop IST 734 SS CHUNG

BigData in Real-time. Impala Introduction. TCloud Computing 天 云 趋 势 孙 振 南 2012/12/13 Beijing Apache Asia Road Show

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Teradata Connector for Hadoop Tutorial

Apache Flink Next-gen data analysis. Kostas

Introduction to Hadoop

Native Connectivity to Big Data Sources in MSTR 10

Data storing and data access

Internals of Hadoop Application Framework and Distributed File System

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

Hive A Petabyte Scale Data Warehouse Using Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

How To Scale Out Of A Nosql Database

Deep Quick-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Architecting the Future of Big Data

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop Distributed File System. -Kishan Patel ID#

Real World Hadoop Use Cases

A Brief Outline on Bigdata Hadoop

Move Data from Oracle to Hadoop and Gain New Business Insights

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Transcription:

Introduction to Apache Hive Pelle Jakovits 14 Oct, 2015, Tartu

Outline What is Hive Why Hive over MapReduce or Pig? Advantages and disadvantages Running Hive HiveQL language User Defined Functions Hive vs Pig Other projects in Hadoop Ecosystem 2

Hive Initially developed by Facebook Data warehousing on top of Hadoop. Designed for easy data summarization ad-hoc querying analysis of large volumes of data. HiveQL statements are automatically translated into MapReduce jobs 3

Advantages of Hive Higher level query language Simplifies working with large amounts of data Lower learning curve than Pig or MapReduce HiveQL is much closer to SQL than Pig Less trial-and-error than Pig 4

Disadvantages Updating data is complicated Mainly because of using HDFS Can add records Can overwrite partitions No real time access to data Use other means like Hbase or Impala High latency 5

Running Hive We will look at it more closely in the practice session, but you can run hive from Hive web interface Hive shell $HIVE_HOME/bin/hive for interactive shell Or you can run queries directly: $HIVE_HOME/bin/hive -e select a.col from tab1 a JDBC Java Database Connectivity "jdbc:hive://host:port/dbname Also possible to use hive directly in Python, C, C++, PHP 6

HiveQL Hive query language provides the basic SQL like operations. Some of these operations are: CREATE/DROP/ALTER TABLE, SHOW, DESCRIBE FILTER, SELECT, JOIN, INSERT, UNION, UPDATE(>= 0.14), DELETE (>= 0.14) Base HiveQL language is extended with User Defined Functions (UDF) 7

Create Tables CREATE TABLE page_view( viewtime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY -- LINES TERMINATED BY STORED AS SEQUENCEFILE; 8

Load Data There are multiple ways to load data into Hive tables. The user can create an external table that points to a specified location within HDFS. The user can copy a file into the specified location in HDFS and create a table pointing to this location with all the relevant row format information. Once this is done, the user can transform the data and insert them into any other Hive table or run queries. 9

Describing data file CREATE EXTERNAL TABLE page_view_stg( viewtime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY -- LINES TERMINATED BY STORED AS TEXTFILE LOCATION '/user/data/staging/page_view'; 10

Loading from data file FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='us') SELECT pvs.viewtime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'; 11

Directly loading ja Storing data Loading data directly: LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view; To write the output into a local disk, for example to load it into an excel spreadsheet later: FROM pv_users INSERT OVERWRITE LOCAL DIRECTORY '/user/pelle/pv_age_sum' SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY pv_users.age; 12

Data units Databases: Namespaces that separate tables and other data units from naming conflicts Tables: Homogeneous units of data which have the same schema Consists of specified columns accordingly to its schema Partitions: Each Table can have one or more partition Keys which determines how data is stored Partitions allow the user to efficiently identify the rows that satisfy a certain criteria It is the user's job to guarantee the relationship between partition name and data! Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid to sample the data. 13

Partitions ja Buckets CREATE TABLE order ( username STRING, orderdate STRING, amount DOUBLE, tax DOUBLE, ) PARTITIONED BY (company STRING) CLUSTERED BY (username) INTO 25 BUCKETS; 14

Types Types are associated with the columns in the tables. The following Primitive types are supported: Integers TINYINT - 1 byte integer SMALLINT - 2 byte integer INT - 4 byte integer BIGINT - 8 byte integer Boolean type BOOLEAN Floating point numbers FLOAT DOUBLE String type STRING 15

Built in functions Round(number), floor(number), ceil(number) Rand(seed) Concat(string), substr(string, length) Upper(string), lower(string), trim(string) Year(date), Month(date), Day (date) Count(*), sum(col), avg(col), max(col), min(col) 16

Display functions SHOW TABLES; Lists all tables SHOW PARTITIONS page_view; Lists partitions on a specific table DESCRIBE EXTENDED page_view; To show all metadata Mainly for debugging 17

INSERT INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com'; 18

JOIN JOIN, {LEFT RIGHT FULL} OUTER JOIN, LEFT SEMI JOIN, CROSS JOIN Can join more than 2 tables at once It is best to put the largest table on the rightmost side of the join to get the best performance. Only equality expressions allowed: expression = expression SELECT pv.*, u.gender, u.age FROM user u JOIN page_view pv ON (pv.userid = u.id) WHERE pv.date = '2008-03-03'; 19

Sampling By number of rows: SELECT * FROM source_table TABLESAMPLE (1m ROWS); By sampling rate; SELECT * FROM source_table TABLESAMPLE(0.1 PERCENT) ; Forced random order sampling: select * FROM source_table distribute by rand() sort by rand() limit 1000; 20

Union INSERT OVERWRITE TABLE actions_users SELECT u.id, actions.date FROM ( SELECT av.uid AS uid FROM action_video av WHERE av.date = '2008-06-03' UNION ALL SELECT ac.uid AS uid FROM action_comment ac WHERE ac.date = '2008-06-03' ) actions JOIN users u ON(u.id = actions.uid); 21

Complex types Structs the elements within the type can be accessed using the DOT (.) notation. F or example, for a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a Maps (key-value tuples) The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group'] Arrays (indexable lists) The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'. 22

collect_list & sort_array SELECT city, sort_array(collect_list(balance)) as balances from bank_accounts GROUP BY city; Values inside field balances can be addressed as an array: balances[0] 23

Complex Type operations size(array), array_contains(array, val), map_keys(map), map_values(map), greatest(array), least(array) explode(array) sentences(string text, string lang, string locale) ngrams(sentences, int gram, int num) histogram_numeric(field) 24

Hive WordCount SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) lnputtable as word GROUP BY word; A lateral view applies UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table. 25

Java UDF package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.udf; import org.apache.hadoop.io.text; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); } } 26

Using UDF create temporary function my_lower as 'com.example.hive.udf.lower'; hive> select my_lower(title), sum(freq) from titles group by my_lower(title); 27

Running custom mapreduce FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script.py' AS dt, uid CLUSTER BY dt) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.dt, map_output.uid USING 'reduce_script.py' AS date, count; 28

Map Example import sys import datetime for line in sys.stdin: line = line.strip() userid, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print ','.join([userid, str(weekday)]) 29

DEMO 30

Hive disadvantages Same disadvantages as MapReduce and Pig Slow start-up and clean-up of MapReduce jobs It takes time for Hadoop to schedule MR jobs Not suitable for interactive OLAP Analytics When results are expected in < 1 sec Designed for querying and not data transformation Limitations of the SQL language Tasks like co-grouping can get complicated 31

Pig vs Hive Pig Hive Purpose Data transformation Ad-Hoc querying Language Something similiar to SQL SQL-like Difficulty Medium (Trial-and-error) Low to medium Schemas Yes (implicit) Yes (explicit) Join (Distributed) Yes Yes Shell Yes Yes Streaming Yes Yes Web interface Third-Party Yes Partitions No Yes UDF s Yes Yes 32

Pig vs Hive SQL might not be the perfect language for expressing data transformation commands Pig Mainly for data transformations and processing Unstructured data Hive Mainly for warehousing and querying data Structured data 33

Big Picture Store large amounts of data to HDFS Process raw data: Pig Build schema using Hive Data querries: Hive Real time access access to data with Hbase real time queries with Impala 34

Is Hadoop enough? Why use Hadoop for large scale data processing? It is becoming a de facto standard in Big Data Collaboration among Top Companies instead of vendor tool lock-in. Amazon, Apache, Facebook, Yahoo!, etc all contribute to open source Hadoop There are tools from setting up Hadoop cluster in minutes and importing data from relational databases to setting up workflows of MR, Pig and Hive. 35

More Hadoop projects Hbase Open-source distributed database ontop of HDFS Hbase tables only use a single key Tuned for real-time access to data Cloudera Impala Simplified, real time queries over HDFS Bypass job schedulling, and remove everything else that makes MR slow. 36

Thats All This week`s practice session Processing data with Hive Similiar exercise as last two weeks, this time using Hive Next lecture: Spark 37