COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Similar documents

Apache Pig Joining Data-Sets

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

American International Journal of Research in Science, Technology, Engineering & Mathematics

Recommended Literature for this Lecture

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Big Data and Scripting Systems build on top of Hadoop

ITG Software Engineering

Introduction to Apache Hive

Big Data Too Big To Ignore

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduction to Apache Hive

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Hadoop Job Oriented Training Agenda

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Hadoop Tutorial GridKa School 2011

Hadoop Scripting with Jaql & Pig

Cloudera Certified Developer for Apache Hadoop

Data processing goes big

Complete Java Classes Hadoop Syllabus Contact No:

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data and Scripting Systems build on top of Hadoop

Big Data Weather Analytics Using Hadoop

Programming with Pig. This chapter covers

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Integration of Apache Hive and HBase

Hadoop Ecosystem B Y R A H I M A.

The Hadoop Eco System Shanghai Data Science Meetup

Xiaoming Gao Hui Li Thilina Gunarathne

Map Reduce & Hadoop Recommended Text:

Hadoop IST 734 SS CHUNG

Internals of Hadoop Application Framework and Distributed File System

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Workshop on Hadoop with Big Data

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

How, What, and Where of Data Warehouses for MySQL

Big Data Course Highlights

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Moving From Hadoop to Spark

Comparing SQL and NOSQL databases

Big Data With Hadoop

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Introduction to Apache Pig Indexing and Search

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop and Big Data Research

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig

Introduction to Big Data Training

Large scale processing using Hadoop. Ján Vaňo

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36

An Insight on Big Data Analytics Using Pig Script

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Hadoop Pig. Introduction Basic. Exercise

COURSE CONTENT Big Data and Hadoop Training

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Integrating VoltDB with Hadoop

Using distributed technologies to analyze Big Data

Alternatives to HIVE SQL in Hadoop File Structure

07/11/2014 Julien! Poorna! Andreas

Integrating Big Data into the Computing Curricula

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Architecting the Future of Big Data

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Scaling Up HBase, Hive, Pegasus

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How To Scale Out Of A Nosql Database

On a Hadoop-based Analytics Service System

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Important Notice. (c) Cloudera, Inc. All rights reserved.

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Best Practices for Hadoop Data Analysis with Tableau

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Chase Wu New Jersey Ins0tute of Technology

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.

SQL Server An Overview

BIG DATA What it is and how to use?

Architecting the Future of Big Data

Sentimental Analysis using Hadoop Phase 2: Week 2

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

Introduction To Hive

Click Stream Data Analysis Using Hadoop

Hadoop Hands-On Exercises

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

CASE STUDY OF HIVE USING HADOOP 1

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

Transcription:

COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt file) explanations to the code answers to questions Deliver electronically to gabriel@cs.uh.edu Expected by Friday, March 31, 11.59pm In case of questions: ask, ask, ask! 1

Part 1: Bulk Import into HBase 1. Given the same stack exchange data sets as used during the 1 st homework assignment. Develop a MapReduce code to perform a bulk import of the data in /bigdata-hw2 (not the large file from hw1) Same format as in the 1 st assignment, just different size ( ~220 MB). you have to decide on the schema, e.g. whether and how to group columns in families etc. Justify your selection in the document! Additional information: New build.xml example available on webpage, since you have to utilize different directories for the hbase jar files Additional argument required when starting a job to tell yarn where the hbase jar files are gabriel@shark> HADOOP_CLASSPATH=`hbase classpath` yarn jar edgarairquality.jar airqualitydemo.airquality /gabriel/input /gabriel/output Requirements: Final table name has to have the format bigd<xy>-hbasesample to avoid name clash of student tables 2

Required operations need to create first a table manually setting up the schema Perform MapReduce job using HFileOutputformat Input data using bulkload tool Document all steps in details Since not everything that you do is in the java file, please make sure that you document precisely how you created the table, how you executed the bulkload etc. Part 2 Write a MapReduce code, which calculate the average number of answers per question using the data from HBase (instead of the file) Compare the performance of the Hbase version of the code with your version from the 1 st homework assignment for the new file 3

Documentation The Documentation should contain (Brief) Problem description Solution strategy Results section Description of resources used Description of measurements performed Results (graphs/tables + findings) The document should not contain Replication of the entire source code that s why you have to deliver the sources Screen shots of every single measurement you made Actually, no screen shots at all. The output files 4

Pig Pig is an abstraction on top of Hadoop Pig is a platform for analyzing large data sets Provides high level programming language designed for data processing Converted into MapReduce and executed on Hadoop Clusters Why using Pig? MapReduce requires programmers Must think in terms of map and reduce functions More than likely will require Java programming Pig provides high-level language that can be used by Analysts and Scientists Does not require know how in parallel programming Pig s Features Join Datasets Sort Datasets Filter Data Types Group By User Defined Functions 5

Pig Components Pig Latin Command based language Designed specifically for data transformation and flow expression Execution Environment The environment in which Pig Latin commands are executed Supporting local and Hadoop execution modes Pig compiler converts Pig Latin to MapReduce Automatic vs. user level optimizations compared to manual MapReduce code Running Pig Script Execute commands in a file $pig scriptfile.pig Grunt Interactive Shell for executing Pig Commands Started when script file is NOT provided Can execute scripts from Grunt via run or exec commands Embedded Execute Pig commands using PigServer class Can have programmatic access to Grunt via PigRunner class 6

Pig Latin concepts Building blocks Field piece of data Tuple ordered set of fields, represented with ( and ) (10.4, 5, word, 4, field1) Bag collection of tuples, represented with { and } { (10.4, 5, word, 4, field1), (this, 1, blah) } Similar to Relational Database Bag is a table in the database Tuple is a row in a table $ pig grunt> cat /input/pig/a.txt a 1 d 4 c 9 k 6 Simple Pig Latin example Load grunt in default map-reduce mode grunt supports file system commands Load contents of text file into a bag called records grunt> records = LOAD '/input/a.txt' as (letter:chararray, count:int); grunt> dump records;... org.apache.pig.backend.hadoop.executionengine.mapreducelayer.mapreducelauncher - 50% complete 2012-07-14 17:36:22,040 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapreducelayer.mapreducelauncher - 100% complete... (a,1) (d,4) (c,9) (k,6) grunt> Display records on screen 7

Simple Pig Latin example No action is taken until DUMP or STORE commands are encountered Pig will parse, validate and analyze statements but not execute them STORE saves results (typically to a file) DUMP displays the results to the screen doesn t make sense to print large arrays to the screen For information and debugging purposes you can print a small sub-set to the screen grunt> records = LOAD '/input/excite-small.log' AS (userid:chararray, timestamp:long, query:chararray); grunt> toprint = LIMIT records 5; grunt> DUMP toprint; Simple Pig Latin example LOAD 'data' [USING function] [AS schema]; data name of the directory or file Must be in single quotes USING specifies the load function to use By default uses PigStorage which parses each line into fields using a delimiter Default delimiter is tab ( \t ) The delimiter can be customized using regular expressions AS assign a schema to incoming data Assigns names and types to fields ( alias:type) (name:chararray, age:int, gpa:float) 8

records = LOAD '/input/excite-small.log USING PigStorage() AS (userid:chararray, timestamp:long, query:chararray); int Signed 32-bit integer 10 long Signed 64-bit integer 10L or 10l float 32-bit floating point 10.5F or 10.5f double 64-bit floating point 10.5 or 10.5e2 or 10.5E2 chararray Character array (string) in Unicode UTF-8 bytearray Byte array (blob) hello world tuple An ordered set of fields (T: tuple (f1:int, f2:int)) bag A collection of tuples (B: bag {T: tuple(t1:int, t2:int)}) Pig Latin Diagnostic Tools Display the structure of the Bag grunt> DESCRIBE <bag_name>; Display Execution Plan Produces Various reports, e.g. logical plan, MapReduce plan grunt> EXPLAIN <bag_name>; Illustrate how Pig engine transforms the data grunt> ILLUSTRATE <bag_name>; 9

FOREACH statement FOREACH <bag> GENERATE <data> Iterate over each element in the bag and produce a result grunt> records = LOAD /input/a.txt' AS (c:chararray, i:int); grunt> dump records; (a,1) (d,4) (c,9) (k,6) grunt> counts = foreach records generate i; grunt> dump counts; (1) (4) (9) (6) Joining Two Data Sets Join Steps Load records into a bag from input #1 Load records into a bag from input #2 Join the 2 data-sets (bags) by provided join key Default Join is Inner Join Rows are joined where the keys match Rows that do not have matches are not included in the result Inner join Set 1 Set 2 10

Simple join example 1. Load records into a bag from input #1 posts = load '/input/user-posts.txt' using PigStorage(',') as (user:chararray, post:chararray, date:long); 2. Load records into a bag from input #2 likes = load '/input/user-likes.txt' using PigStorage(',') as (user:chararray,likes:int,date:long); 3. Join the data sets when a key is equal in both data-sets then the rows are joined into a new single row; In this case when user name is equal userinfo = join posts by user, likes by user; dump userinfo; $ hdfs dfs -cat /input/user-posts.txt user1,funny Story,1343182026191 user2,cool Deal,1343182133839 user4,interesting Post,1343182154633 user5,yet Another Blog,13431839394 $ hdfs dfs -cat /input/user-likes.txt user1,12,1343182026191 user2,7,1343182139394 user3,0,1343182154633 user4,50,1343182147364 $ pig /code/innerjoin.pig (user1,funny Story,1343182026191,user1,12,1343182026191) (user2,cool Deal,1343182133839,user2,7,1343182139394) (user4,interestingpost,1343182154633,user4,50,1343182147364) 11

Join re-uses the names of the input fields and prepends the name of the input bag <bag_name>::<field_name> grunt> describe posts; posts: {user: chararray,post: chararray,date: long} grunt> describe likes; likes: {user: chararray,likes: int,date: long} grunt> describe userinfo; UserInfo: { posts::user: chararray, posts::post: chararray, posts::date: long, likes::user: chararray, likes::likes: int, likes::date: long} Outer Join Records which will not join with the other record-set are still included in the result Left Outer Records from the first data-set are included whether they have a match or not. Fields from the unmatched (second) bag are set to null. Right Outer The opposite of Left Outer Join: Records from the second dataset are included no matter what. Fields from the unmatched (first) bag are set to null. Full Outer Records from both sides are included. For unmatched records the fields from the other bag are set to null. 12

Pig Use cases Loading large amounts of data Pig is built on top of Hadoop -> scales with the number of servers Alternative to manual bulkloading Using different data sources, e.g. collect web server logs, use external programs to fetch geo-location data for the users IP addresses, join the new set of geo-located web traffic to click maps stored Support for data sampling Hive Data Warehousing Solution built on top of Hadoop Provides SQL-like query language named HiveQL Minimal learning curve for people with SQL expertise Data analysts are target audience Early Hive development work started at Facebook in 2007 Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster 13

Hive Ability to bring structure to various data formats Simple interface for ad hoc querying, analyzing and summarizing large amounts of data Access to files on various data stores such as HDFS and HBase Hive does NOT provide low latency or realtime queries Even querying small amounts of data may take minutes Designed for scalability and ease-of-use rather than low latency responses Hive To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database Packaged with Derby, a lightweight embedded SQL DB Default Derby based is good for evaluation an testing Schema is not shared between users as each user has their own instance of embedded Derby Stored in metastore_db directory which resides in the directory that hive was started from Can easily switch another SQL installation such as MySQL 14

Hive Architecture Hive Interface Options Command Line Interface (CLI) Hive Web Interface https://cwiki.apache.org/confluence/display/hive/hivewebinterface Java Database Connectivity (JDBC) https://cwiki.apache.org/confluence/display/hive/hiveclient Re-used from Relational Databases Database: Set of Tables, used for name conflict resolution Table: Set of Rows that have the same schema (same columns) Row: A single record; a set of columns Column: provides value and type for a single value 15

Hive creating a table hive> CREATE TABLE posts (user STRING, post STRING, time BIGINT) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE; OK Time taken: 10.606 seconds creates a table with 3 columns How the underlying file should be parsed hive> show tables; OK posts Time taken: 0.221 seconds hive> describe posts; OK user string post string time bigint Time taken: 0.212 seconds Display schema for posts table Hive Query Data hive> select * from posts where user="user2";...... OK user2 Cool Deal 1343182133839 Time taken: 12.184 seconds hive> select * from posts where time<=1343182133839 limit 2;...... OK user1 Funny Story 1343182026191 user2 Cool Deal 1343182133839 Time taken: 12.003 seconds hive> 16

Partitions To increase performance Hive has the capability to partition data The values of partitioned column divide a table into segments Entire partitions can be ignored at query time Similar to relational databases indexes but not as granular Partitions have to be properly crated by users When inserting data must specify a partition At query time, whenever appropriate, Hive will automatically filter out partitions Bucketing Mechanism to query and examine random samples of data Break data into a set of buckets based on a hash function of a "bucket column" Capability to execute queries on a sub-set of random data Doesn t automatically enforce bucketing User is required to specify the number of buckets by setting # of reducer 17

Joins Hive support outer joins left, right and full joins Can join multiple tables Default Join is Inner Join Rows are joined where the keys match Rows that do not have matches are not included in the result Pig vs. Hive vs. Spark Hive Uses an SQL like query language called HQL Gives non-programmers the ability to query and analyze data in Hadoop. Pig Uses a workflow driven scripting language Don't need to be an expert Java programmer but need a few coding skills. Can be used to convert unstructured data into a meaningful form. Spark Successor to map-reduce in Hadoop, Emphasis on in-memory computing, lower level programming required than in Hive and Spark 18