Pig vs Hive. Big Data 2014

Similar documents

Apache Hive. Big Data 2015

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Introduction to Apache Pig Indexing and Search

American International Journal of Research in Science, Technology, Engineering & Mathematics

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Hadoop Pig. Introduction Basic. Exercise

Big Data and Scripting Systems build on top of Hadoop

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

ITG Software Engineering

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Hadoop Training Hands On Exercise

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

CASE STUDY OF HIVE USING HADOOP 1

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Hadoop Job Oriented Training Agenda

Big Data Too Big To Ignore

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

web-scale data processing Christopher Olston and many others Yahoo! Research

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Important Notice. (c) Cloudera, Inc. All rights reserved.

Introduction To Hive

Apache Pig Joining Data-Sets

Click Stream Data Analysis Using Hadoop

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Scaling Up HBase, Hive, Pegasus

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Connecting Hadoop with Oracle Database

Architecting the Future of Big Data

COURSE CONTENT Big Data and Hadoop Training

Hadoop Hands-On Exercises

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Tutorial- Counting Words in File(s) using MapReduce

Complete Java Classes Hadoop Syllabus Contact No:

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Hadoop Basics with InfoSphere BigInsights

Hadoop Configuration and First Examples

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

How To Install Hadoop From Apa Hadoop To (Hadoop)

Big Data and Scripting Systems build on top of Hadoop

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

IBM Software Hadoop Fundamentals

Introduction to Apache Hive

Big Data Course Highlights

Hadoop 2.6 Configuration and More Examples

Cloudera Certified Developer for Apache Hadoop

Netezza Workbench Documentation

Word count example Abdalrahman Alsaedi

Big Data Weather Analytics Using Hadoop

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Teradata Connector for Hadoop Tutorial

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Integration of Apache Hive and HBase

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide

Apache Hadoop new way for the company to store and analyze big data

A Brief Outline on Bigdata Hadoop

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

CSE-E5430 Scalable Cloud Computing. Lecture 4

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Workshop on Hadoop with Big Data

BIG DATA HADOOP TRAINING

Data Domain Profiling and Data Masking for Hadoop

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

About the Tutorial. Audience. Prerequisites. Disclaimer & Copyright. Apache Hive

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Hadoop Basics with InfoSphere BigInsights

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Hadoop Distributed File System. -Kishan Patel ID#

Unified Big Data Analytics Pipeline. 连城

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Data Tool Platform SQL Development Tools

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Implement Hadoop jobs to extract business value from large and varied data sets

Introduction to Apache Hive

USING MYWEBSQL FIGURE 1: FIRST AUTHENTICATION LAYER (ENTER YOUR REGULAR SIMMONS USERNAME AND PASSWORD)

Xiaoming Gao Hui Li Thilina Gunarathne

A Study of Data Management Technology for Handling Big Data

Single Node Hadoop Cluster Setup

Introduction to MapReduce and Hadoop

Paper SAS Techniques in Processing Data on Hadoop

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Extreme computing lab exercises Session one

Transcription:

Pig vs Hive Big Data 2014

Pig Configuration In the bash_profile export all needed environment variables

Pig Configuration Download a release of apache pig: pig-0.11.1.tar.gz

Pig Configuration Go to the conf directory in the pig-home directory rename the file pig.properties.template in pig.properties

Pig Running Running Pig: $:~pig-*/bin/pig <parameters> Try the following command, to get a list of Pig commands: $:~pig-*/bin/pig -help Run modes: local $:~pig-*/bin/pig -x local! mapreduce $:~pig-*/bin/pig or $:~pig-*/bin/pig -x mapreduce

Pig in Local Running Pig in Local: $:~pig-*/bin/pig -x local Grunt Shell: grunt> A = LOAD 'passwd' using PigStorage(':'); grunt> B = FOREACH A GENERATE $0 as id; grunt> dump B; grunt> store B; Script file: $:~pig-*/bin/pig -x local myscript.pig

Pig in Local: Examples Word Count using Pig Count words in a text file separated by lines and spaces Basic idea: Load this file using a loader Foreach record generate word token Group by each word Count number of words in a group Store to file words.txt program program pig pig program pig hadoop pig latin latin

Pig in Local: Examples Word Count using Pig $:~pig-*/bin/pig -x local! grunt> myinput = LOAD <myhome>/words.txt USING TextLoader() as (myword:chararray); grunt> words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*)); grunt> grouped = GROUP words BY $0; grunt> counts = FOREACH grouped GENERATE group, COUNT(words); grunt> store counts into '<myhome>/pigoutput' using PigStorage();

Pig in Local: Examples Word Count using Pig $:~pig-*/bin/pig -x local wordcount.pig wordcount.pig myinput = LOAD <myhome>/words.txt USING TextLoader() as (myword:chararray); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); store counts into '<myhome>/pigoutput' using PigStorage();

Pig in Local: Examples Word Count using Pig myinput = LOAD <myhome>/words.txt USING TextLoader() as (myword:chararray); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); store counts into '<myhome>/pigoutput' using PigStorage();

Pig in Local: Examples Word Count using Pig myinput = LOAD <myhome>/words.txt USING TextLoader() as (myword:chararray); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); store counts into '<myhome>/pigoutput' using PigStorage();

Pig in Local: Examples Word Count using Pig myinput = LOAD <myhome>/words.txt USING TextLoader() as (myword:chararray); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); store counts into '<myhome>/pigoutput' using PigStorage();

Pig in Local: Examples Word Count using Pig myinput = LOAD <myhome>/words.txt USING TextLoader() as (myword:chararray); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); store counts into '<myhome>/pigoutput' using PigStorage();

Pig in Local: Examples Word Count using Pig myinput = LOAD <myhome>/words.txt USING TextLoader() as (myword:chararray); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); store counts into '<myhome>/pigoutput' using PigStorage();

Pig in Local: Examples Word Count using Pig the directory '<myhome>/pigoutput' has not to exist before the execution of the script myinput = LOAD <myhome>/words.txt USING TextLoader() as (myword:chararray); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); store counts into '<myhome>/pigoutput' using PigStorage();

Pig in MapReduce: Examples Word Count using Pig $:~hadoop-*/bin/hadoop dfs -mkdir input $:~hadoop-*/bin/hadoop dfs -copyfromlocal /tmp/words.txt input $:~pig-*/bin/pig -x mapreduce wordcountmr.pig wordcountmr.pig myinput = LOAD input/words.txt USING TextLoader() as (myword:chararray); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); store counts into 'pigoutput' using PigStorage();

Pig in MapReduce: Examples Word Count using Pig hdfs:input $:~pig-*/bin/pig -x mapreduce wordcountmr.pig

Pig in MapReduce: Examples Word Count using Pig hdfs:pigoutput part-r-00000

Pig in Local: Examples Computing average number of page visits by user Logs of user visiting a webpage consists of (user,url,time) Fields of the log are tab separated and in text format Basic idea: Load the log file Group based on the user field Count the group Calculate average for all users Visualize result visits.log user url time Amy www.cnn.com 8:00 Amy www.crap.com 8:05 Amy www.myblog.com 10:00 Amy www.flickr.com 10:05 Fred cnn.com/index.htm 12:00 Fred cnn.com/index.htm 13:00

Pig in Local: Examples Computing average number of page visits by user $:~pig-*/bin/pig -x local average_visits_log.pig average_visits_log.pig visits = LOAD <myhome>/visits.log as (user,url,time); user_visits = GROUP visits BY user; user_cnts = FOREACH user_visits GENERATE group as user, COUNT(visits) as numvisits; all_cnts = GROUP user_cnts all; avg_cnt = FOREACH all_cnts GENERATE AVG(user_cnts.numvisits); dump avg_cnt;

Pig in Local: Examples Computing average number of page visits by user visits = LOAD <myhome>/visits.log as (user,url,time); user_visits = GROUP visits BY user; user_cnts = FOREACH user_visits GENERATE group as user, COUNT(visits) as numvisits; all_cnts = GROUP user_cnts all; avg_cnt = FOREACH all_cnts GENERATE AVG(user_cnts.numvisits); dump avg_cnt;

Pig in Local: Examples Computing average number of page visits by user visits = LOAD <myhome>/visits.log as (user,url,time); user_visits = GROUP visits BY user; user_cnts = FOREACH user_visits GENERATE group as user, COUNT(visits) as numvisits; all_cnts = GROUP user_cnts all; avg_cnt = FOREACH all_cnts GENERATE AVG(user_cnts.numvisits); dump avg_cnt;

Pig in Local: Examples Computing average number of page visits by user visits = LOAD <myhome>/visits.log as (user,url,time); user_visits = GROUP visits BY user; user_cnts = FOREACH user_visits GENERATE group as user, COUNT(visits) as numvisits; all_cnts = GROUP user_cnts all; avg_cnt = FOREACH all_cnts GENERATE AVG(user_cnts.numvisits); dump avg_cnt;

Pig in Local: Examples Computing average number of page visits by user visits = LOAD <myhome>/visits.log as (user,url,time); user_visits = GROUP visits BY user; user_cnts = FOREACH user_visits GENERATE group as user, COUNT(visits) as numvisits; all_cnts = GROUP user_cnts all; avg_cnt = FOREACH all_cnts GENERATE AVG(user_cnts.numvisits); dump avg_cnt;

Pig in Local: Examples Computing average number of page visits by user visits = LOAD <myhome>/visits.log as (user,url,time); user_visits = GROUP visits BY user; user_cnts = FOREACH user_visits GENERATE group as user, COUNT(visits) as numvisits; all_cnts = GROUP user_cnts all; avg_cnt = FOREACH all_cnts GENERATE AVG(user_cnts.numvisits); dump avg_cnt;

Pig in Local: Examples Computing average number of page visits by user visits = LOAD <myhome>/visits.log as (user,url,time); user_visits = GROUP visits BY user; user_cnts = FOREACH user_visits GENERATE group as user, COUNT(visits) as numvisits; all_cnts = GROUP user_cnts all; avg_cnt = FOREACH all_cnts GENERATE AVG(user_cnts.numvisits); dump avg_cnt;

Pig in Local: Examples Identify users who visit Good Pages Good pages are those pages visited by users whose page rank is greater than 0.5 Basic idea: Join table based on url Group based on user Calculate average page rank of user visited pages Filter user who has average page rank greater than 0.5 Store the result visits.log user url time Amy www.cnn.com 8:00 Amy www.crap.com 8:05 Amy www.myblog.com 10:00 Amy www.flickr.com 10:05 Fred cnn.com/index.htm 12:00 Fred cnn.com/index.htm 13:00 url pages.log pagerank www.cnn.com 0.9 www.flickr.com 0.9 www.myblog.com 0.7 www.crap.com 0.2

Pig in Local: Examples Identify users who visit Good Pages $:~pig-*/bin/pig -x local good_users.pig good_users.pig visits = LOAD <myhome>/visits.log as (user:chararray,url:chararray,time:chararray); pages = LOAD <myhome>/pages.log as (url:chararray,pagerank:float); visits_pages = JOIN visits BY url, pages BY url; user_visits = GROUP visits_pages BY user; user_avgpr = FOREACH user_visits GENERATE group, AVG(visits_pages.pagerank) as avgpr; good_users = FILTER user_avgpr BY avgpr > 0.5f; store good_users into <myhome>/pigoutput ;

Pig in Local: Examples Identify users who visit Good Pages Load files for processing with appropriate types visits = LOAD <myhome>/visits.log as (user:chararray,url:chararray,time:chararray); pages = LOAD <myhome>/pages.log as (url:chararray,pagerank:float); visits_pages = JOIN visits BY url, pages BY url; user_visits = GROUP visits_pages BY user; user_avgpr = FOREACH user_visits GENERATE group, AVG(visits_pages.pagerank) as avgpr; good_users = FILTER user_avgpr BY avgpr > 0.5f; store good_users into <myhome>/pigoutput ;

Pig in Local: Examples Identify users who visit Good Pages visits = LOAD <myhome>/visits.log as (user:chararray,url:chararray,time:chararray); pages = LOAD <myhome>/pages.log as (url:chararray,pagerank:float); visits_pages = JOIN visits BY url, pages BY url; user_visits = GROUP visits_pages BY user; user_avgpr = FOREACH user_visits GENERATE group, AVG(visits_pages.pagerank) as avgpr; good_users = FILTER user_avgpr BY avgpr > 0.5f; store good_users into <myhome>/pigoutput ;

Pig in Local: Examples Identify users who visit Good Pages visits = LOAD <myhome>/visits.log as (user:chararray,url:chararray,time:chararray); pages = LOAD <myhome>/pages.log as (url:chararray,pagerank:float); visits_pages = JOIN visits BY url, pages BY url; user_visits = GROUP visits_pages BY user; user_avgpr = FOREACH user_visits GENERATE group, AVG(visits_pages.pagerank) as avgpr; good_users = FILTER user_avgpr BY avgpr > 0.5f; store good_users into <myhome>/pigoutput ;

Pig in Local: Examples Identify users who visit Good Pages visits = LOAD <myhome>/visits.log as (user:chararray,url:chararray,time:chararray); pages = LOAD <myhome>/pages.log as (url:chararray,pagerank:float); visits_pages = JOIN visits BY url, pages BY url; user_visits = GROUP visits_pages BY user; user_avgpr = FOREACH user_visits GENERATE group, AVG(visits_pages.pagerank) as avgpr; good_users = FILTER user_avgpr BY avgpr > 0.5f; store good_users into <myhome>/pigoutput ;

Pig in Local: Examples Identify users who visit Good Pages visits = LOAD <myhome>/visits.log as (user:chararray,url:chararray,time:chararray); pages = LOAD <myhome>/pages.log as (url:chararray,pagerank:float); visits_pages = JOIN visits BY url, pages BY url; user_visits = GROUP visits_pages BY user; user_avgpr = FOREACH user_visits GENERATE group, AVG(visits_pages.pagerank) as avgpr; good_users = FILTER user_avgpr BY avgpr > 0.5f; store good_users into <myhome>/pigoutput ;

Pig in Local: Examples Identify users who visit Good Pages visits = LOAD <myhome>/visits.log as (user:chararray,url:chararray,time:chararray); pages = LOAD <myhome>/pages.log as (url:chararray,pagerank:float); visits_pages = JOIN visits BY url, pages BY url; user_visits = GROUP visits_pages BY user; user_avgpr = FOREACH user_visits GENERATE group, AVG(visits_pages.pagerank) as avgpr; good_users = FILTER user_avgpr BY avgpr > 0.5f; store good_users into <myhome>/pigoutput ;

Pig in Local with User Defined Functions: Examples Find all planets similar and closed to Earth Planets similar and closed to Earth are that with oxygen and whose distance from Earth is less than 5 Basic idea: Define a User Defined Function (UDF) Filter planets using UDF planets.txt planet, color, atmosphere, distanceformearth gallifrey, blue, oxygen, 4.367 skaro, blue, phosphorus, 10.5 krypton, red, oxygen, 2.5 apokolips, white, unknown, 0 klendathu, orange, oxygen, 0.89 asgard, unknown, unknown, 0 mars, yellow, carbon dioxide, 0.00000589429108 thanagar, yellow, oxygen, 3.29 planet x, yellow, unknown, 0.78 warworld, red, phosphorus, 10.1 daxam, red, oxygen, 7.2 oa, blue white, nitrogen, 2.4 Gliese 667Cc, red dwarf, unknown, 22

Pig in Local with User Defined Functions: Examples Find all planets similar and closed to Earth DistanceFromEarth.java package myudfs; import java.io.ioexception; import org.apache.pig.filterfunc; import org.apache.pig.data.tuple;! public class DistanceFromEarth extends FilterFunc { public Boolean exec(tuple input) throws IOException { if (input == null input.size() == 0) return null; try { Object value = input.get(0); if (value instanceof Double) return ((Double)value) <5; } catch(exception ee) { throw new IOException("Caught exception processing input row ", ee); } return null; } }

Pig in Local with User Defined Functions: Examples Find all planets similar and closed to Earth PlanetWithOxygen.java package myudfs; import java.io.ioexception; import org.apache.pig.filterfunc; import org.apache.pig.data.tuple;! public class PlanetWithOxygen extends FilterFunc { public Boolean exec(tuple input) throws IOException { if (input == null input.size() == 0) return null; try { String value = (String)input.get(0); return (value.indexof("oxygen") >=0); } catch(exception ee) { throw new IOException("Caught exception processing input row ", ee); } } }

Pig in Local with User Defined Functions: Examples Find all planets similar and closed to Earth myudfs.jar

Pig in Local with User Defined Functions: Examples Find all planets similar and closed to Earth $:~pig-*/bin/pig -x local planets.pig planets.pig REGISTER <myhome>/myudfs.jar ; planets = LOAD <myhome>/planets.txt USING PigStorage(,') as (planet:chararray,color:chararray,atmosphere:chararray,distance:double); result = FILTER planets BY myudfs.planetwithoxygen(atmosphere) AND myudfs.distancefromearth(distance); store result into <myhome>/pigoutput ;

Pig in Local with User Defined Functions: Examples Find all planets similar and closed to Earth REGISTER <myhome>/myudfs.jar ; planets = LOAD <myhome>/planets.txt USING PigStorage(,') as (planet:chararray,color:chararray,atmosphere:chararray,distance:double); result = FILTER planets BY myudfs.planetwithoxygen(atmosphere) AND myudfs.distancefromearth(distance); store result into <myhome>/pigoutput ;

Pig in Local with User Defined Functions: Examples Sort employees by department and by stack ranking. Basic idea: name, stackrank, department Define a User Defined Function (UDF) order employees using UDF employees.txt JohnS, 9.5, Accounting Bill, 6, Marketing Franklin, 7, Engineering Marci, 8, Exec Joe DeAngel, 4.5, Finance Steve Francis, 9, Accounting Sam Shade, 6.5, Engineering Sandi, 9, Exec Roderick Trevers, 7, Accounting Terri DeHaviland, 8.5, Exec Colin McCullers, 8, Marketing Fay LaMore, 9, Marketing

Pig in Local with User Defined Functions: Examples Sort employees by department and by stack ranking. #!/usr/bin/python @outputschema("record: {(rank:int, name:chararray, stackrank:double, department:chararray)}") def enumerate_bag(input): output = [] for rank, item in enumerate(input): output.append(tuple([rank] + list(item))) return output rankudf.py

Pig in Local with User Defined Functions: Examples Sort employees by department and by stack ranking. $:~pig-*/bin/pig -x local employee.pig employee.pig REGISTER <myhome>/rankudf.py ; employees = LOAD <myhome>/employees.txt USING PigStorage(,') as (name:chararray, stackrank:double, department:chararray); employees_by_department = GROUP employees BY department; result = FOREACH employees_by_department{ sorted = ORDER employees BY stackrank desc; ranked = myudf.enumerate_bag(sorted); generate flatten(ranked); }; store result into <myhome>/pigoutput ;

Pig in Local with User Defined Functions: Examples Sort employees by department and by stack ranking. REGISTER <myhome>/rankudf.py ; employees = LOAD <myhome>/employees.txt USING PigStorage(,') as (name:chararray, stackrank:double, department:chararray); employees_by_department = GROUP employees BY department; result = FOREACH employees_by_department{ sorted = ORDER employees BY stackrank desc; ranked = myudf.enumerate_bag(sorted); generate flatten(ranked); }; store result into <myhome>/pigoutput ;

Hive Configuration In the bash_profile export all needed environment variables

Hive Configuration Translates HiveQL statements into a set of MapReduce jobs which are then executed on a Hadoop Cluster Execute on Hadoop Cluster HiveQL Hive Monitor/Report Client Machine Hadoop Cluster

Hive Configuration Download a binary release of apache Hive: hive-0.11.0-bin.tar.gz

Hive Configuration In the conf directory of hive-home directory set hive-env.sh file set the HADOOP_HOME # Set HADOOP_HOME to point to a specific hadoop install directory HADOOP_HOME=/Users/mac/Documents/hadoop-1.2.1

Hive Configuration Hive uses Hadoop In addition, you must create /tmp and /user/hive/warehouse and set them chmod g+w in HDFS before you can create a table in Hive. Commands to perform this setup: $:~$HADOOP_HOME/bin/hadoop dfs -mkdir $:~$HADOOP_HOME/bin/hadoop dfs -mkdir $:~$HADOOP_HOME/bin/hadoop dfs -chmod g+w $:~$HADOOP_HOME/bin/hadoop dfs -chmod g+w /tmp /user/hive/warehouse /tmp /user/hive/warehouse

Hive Running Running Hive: $:~hive-*/bin/hive <parameters> Try the following command to acces to Hive shell: $:~hive-*/bin/hive Hive Shell Logging initialized using configuration in jar:file:/users/ mac/documents/hive-0.11.0-bin/lib/hive-common-0.11.0.jar!/ hive-log4j.properties Hive history file=/tmp/mac/hive_job_log_mac_4426@macbook- Air-di-mac.local_201404091440_1786371657.txt hive>

Hive Running In the Hive Shell you can call any HiveQL statement: create a table hive> CREATE TABLE pokes (foo INT, bar STRING); OK Time taken: 0.354 seconds hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); OK Time taken: 0.051 seconds browsing through Tables: lists all the tables hive> SHOW TABLES; OK invites pokes Time taken: 0.093 seconds, Fetched: 2 row(s)

Hive Running browsing through Tables: lists all the tables that end with 's'. hive> SHOW TABLES.*s ; OK invites pokes Time taken: 0.063 seconds, Fetched: 2 row(s) browsing through Tables: shows the list of columns of a table. hive> DESCRIBE invites; OK foo int None bar string None ds string None # Partition Information # col_name data_type comment ds string None Time taken: 0.191 seconds, Fetched: 8 row(s)

Hive Running altering tables hive> ALTER TABLE events RENAME TO 3koobecaf; hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); hive> ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING, baz INT COMMENT 'baz replaces new_col2'); dropping Tables hive> DROP TABLE pokes;

Hive Running DML operations takes file from local file system hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; takes file from HDFS file system hive> LOAD DATA INPATH /user/hive/files/kv1.txt OVERWRITE INTO TABLE pokes; SQL query hive> SELECT * FROM pokes;

Hive Configuration on the Job Tracker of Hadoop By default, Hive utilizes LocalJobRunner; Hive can use the JobTracker of Hadoop In the conf directory of hive-home directory you have to add and edit the hive-site.xml file

Hive Configuration on the Job Tracker of Hadoop In the conf directory of hive-home directory you have to add and edit the hive-site.xml file <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.exec.scratchdir</name> <value>/users/mac/documents/hive-0.11.0-bin/scracth</value> <description>scratch space for Hive jobs</description> </property> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> <description>location of resourcemanager so Hive knows where to execute mapreduce Jobs</description> </property> </configuration>

Hive Running Running Hive One Shot command: $:~hive-*/bin/hive -e <command> For instance: $:~hive-*/bin/hive -e SELECT * FROM mytable LIMIT 3 Result OK name1 10 name2 20 name3 30

Hive Running Executing Hive queries from file: $:~hive-*/bin/hive -f <file> For instance: $:~hive-*/bin/hive -f query.hql query.hql SELECT * FROM mytable LIMIT 3

Hive Running Executing Hive queries from file inside the Hive Shell $:~ cat /path/to/file/query.hql SELECT * FROM mytable LIMIT 3 $:~hive-*/bin/hive hive> SOURCE /path/to/file/query.hql;

Hive in Local: Examples Word Count using Hive wordcounts.hql CREATE TABLE docs (line STRING); LOAD DATA LOCAL INPATH./exercise/data/words.txt' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w words.txt program program pig pig program pig hadoop pig latin latin GROUP BY word ORDER BY word;

Hive in Local: Examples Word Count using Hive words.txt $:~hive-*/bin/hive -f wordcounts.hql program program pig pig program pig hadoop pig latin latin

Hive in Local with User Defined Functions: Examples Convert unixtime to a regular time date format subscribers.txt name, department, email, time Frank Black, 1001, frankdaman@eng.example.com, -72710640 Jolie Guerms, 1006, jguerms@ga.example.com, 1262365200 Mossad Ali, 1001, mali@eng.example.com, 1207818032 Chaka Kaan, 1006, ckhan@ga.example.com, 1130758322 Verner von Kraus, 1007, verner@example.com, 1341646585 Lester Dooley, 1001, ldooley@eng.example.com, 1300109650 Basic idea: Define a User Defined Function (UDF) Convert time field using UDF

Hive in Local with User Defined Functions: Examples Convert unixtime to a regular time date format package com.example.hive.udf; Unix2Date.java! import java.util.date; import java.util.timezone; import java.text.simpledateformat; import org.apache.hadoop.hive.ql.exec.udf; import org.apache.hadoop.io.text;! public class Unix2Date extends UDF{ public Text evaluate(text text) { if(text == null) return null; long timestamp = Long.parseLong(text.toString()); // timestamp*1000 is to convert seconds to milliseconds Date date = new Date(timestamp*1000L); // the format of your date SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy HH:mm:ss z"); sdf.settimezone(timezone.gettimezone("gmt+2")); String formatteddate = sdf.format(date); return new Text(formattedDate); } }

Hive in Local with User Defined Functions: Examples Convert unixtime to a regular time date format unix_date.jar

Hive in Local with User Defined Functions: Examples Convert unixtime to a regular time date format $:~hive-*/bin/hive -f time_conversion.hql time_conversion.hql CREATE TABLE IF NOT EXISTS subscriber ( username STRING, dept STRING, email STRING, provisioned STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; LOAD DATA LOCAL INPATH./exercise/data/subscribers.txt' INTO TABLE subscriber; add jar./exercise/jar_files/unix_date.jar; CREATE TEMPORARY FUNCTION unix_date AS 'com.example.hive.udf.unix2date'; SELECT username, unix_date(provisioned) FROM subscriber;

Hive in Local with User Defined Functions: Examples Convert unixtime to a regular time date format $:~hive-*/bin/hive -f time_conversion.hql Frank Black 12-09-1967 12:36:00 GMT+02:00 Jolie Guerms 01-01-2010 19:00:00 GMT+02:00 Mossad Ali Chaka Kaan 10-04-2008 11:00:32 GMT+02:00 31-10-2005 13:32:02 GMT+02:00 Verner von Kraus 07-07-2012 09:36:25 GMT+02:00 Lester Dooley 14-03-2011 15:34:10 GMT+02:00 Time taken: 9.12 seconds, Fetched: 6 row(s)

Quickly Wrap Up! Two ways of doing one thing OR! One way of doing two things

Two ways of doing same thing Both generate map-reduce jobs from a query written in higher level language. Both frees users from knowing all the little secrets of Map- Reduce & HDFS.

Language PigLatin: Procedural data-flow language A = LOAD mydata ; dump A; HiveQL: Declarative SQLish language SELECT * FROM mytable ;

Different languages = Different users Pig: More popular among programmers researchers Hive: More popular among analysts

Different users = Different usage pattern Pig: programmers: Writing complex data pipelines researchers: Doing ad-hoc analysis typically employing Machine Learning Hive: analysts: Generating daily reports

Different usage pattern Different Usage Pattern Data Collection Data Factory Data Warehouse Data Data Collection Data Factory Data Data Warehouse Pig Pig Hive Hive -Pipeline Pipelines Iterative Iterative Processing Processing Research -Iterative Processing Research -Research BI BI Tools -BI Tools tools Analysis Analysis -Analysis 7 7 7

Different usage pattern = Different future directions Pig is evolving towards a language of its own Users are asking for better dev environment: debugger, linker, editor etc. Hive is evolving towards Data-warehousing solution Users are asking for better integration with other systems (O/JDBC)

Resources

Pig vs Hive Big Data 2014