Apache Pig Joining Data-Sets



Similar documents
Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Map Reduce Workflows

Virtual Machine (VM) For Hadoop Training

Advanced Java Client API

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

HBase Key Design coreservlets.com and Dima May coreservlets.com and Dima May

Hadoop Distributed File System (HDFS) Overview

HBase Java Administrative API

HDFS Installation and Shell

HDFS - Java API coreservlets.com and Dima May coreservlets.com and Dima May

The Google Web Toolkit (GWT): The Model-View-Presenter (MVP) Architecture Official MVP Framework

MapReduce on YARN Job Execution

Hadoop Introduction coreservlets.com and Dima May coreservlets.com and Dima May

Official Android Coding Style Conventions

Java with Eclipse: Setup & Getting Started

The Google Web Toolkit (GWT): Declarative Layout with UiBinder Basics

Introduction to Apache Pig Indexing and Search

Managed Beans II Advanced Features

Android Programming: 2D Drawing Part 1: Using ondraw

Android Programming: Installation, Setup, and Getting Started

Web Applications. For live Java training, please see training courses at

JHU/EP Server Originals of Slides and Source Code for Examples:

Web Applications. Originals of Slides and Source Code for Examples:

Android Programming Basics

The Google Web Toolkit (GWT): Overview & Getting Started

Building Web Services with Apache Axis2

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Session Tracking Customized Java EE Training:

American International Journal of Research in Science, Technology, Engineering & Mathematics

For live Java EE training, please see training courses

Lecture 9-10: Advanced Pig Latin! Claudia Hauff (Web Information Systems)!

Servlet and JSP Filters

Handling the Client Request: Form Data

Hadoop Pig. Introduction Basic. Exercise

Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Debugging Ajax Pages: Firebug

Application Security

2011 Marty Hall An Overview of Servlet & JSP Technology Customized Java EE Training:

Localization and Resources

Basic Java Syntax. Slides 2016 Marty Hall,

What servlets and JSP are all about

Hadoop Hands-On Exercises

Object-Oriented Programming in Java: More Capabilities

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Click Stream Data Analysis Using Hadoop

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data Too Big To Ignore

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Integrating VoltDB with Hadoop

& JSP Technology Originals of Slides and Source Code for Examples:

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36

Hadoop Scripting with Jaql & Pig

Xiaoming Gao Hui Li Thilina Gunarathne

Web development... the server side (of the force)

Pig vs Hive. Big Data 2014

Big Data Frameworks: Scala and Spark Tutorial

Package sjdbc. R topics documented: February 20, 2015

Hadoop Job Oriented Training Agenda

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data for the JVM developer. Costin Leau,

Package hive. January 10, 2011

DIPLOMA IN WEBDEVELOPMENT

Assignment 4: Pig, Hive and Spark

Big Data and Scripting Systems build on top of Hadoop

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Design Approaches of Web Application with Efficient Performance in JAVA

HybriDroid: Analysis Framework for Android Hybrid Applications

Unit Testing with JUnit: A Very Brief Introduction

Other Map-Reduce (ish) Frameworks. William Cohen

An Overview of Servlet & JSP Technology

CS506 Web Design and Development Solved Online Quiz No. 01

TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data User s Guide

ITG Software Engineering

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Syllabus INFO-GB Design and Development of Web and Mobile Applications (Especially for Start Ups)

Services. Relational. Databases & JDBC. Today. Relational. Databases SQL JDBC. Next Time. Services. Relational. Databases & JDBC. Today.

Impala Introduction. By: Matthew Bollinger

Hadoop Streaming. Table of contents

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Java 7 Recipes. Freddy Guime. vk» (,\['«** g!p#« Carl Dea. Josh Juneau. John O'Conner

An Insight on Big Data Analytics Using Pig Script

Pentesting Web Frameworks (preview of next year's SEC642 update)

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. AVRO Tutorial

Syllabus INFO-UB Design and Development of Web and Mobile Applications (Especially for Start Ups)

Certified PHP/MySQL Web Developer Course

Course Number: IAC-SOFT-WDAD Web Design and Application Development

Introduction to Apache Hive

2. Follow the installation directions and install the server on ccc

Transcription:

2012 coreservlets.com and Dima May Apache Pig Joining Data-Sets Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite or at public venues) http://courses.coreservlets.com/hadoop-training.html Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jquery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location. 2012 coreservlets.com and Dima May For live customized Hadoop training (including prep for the Cloudera certification exam), please email info@coreservlets.com Taught by recognized Hadoop expert who spoke on Hadoop several times at JavaOne, and who uses Hadoop daily in real-world apps. Available at public venues, or customized versions can be held on-site at your organization. Courses developed and taught by Marty Hall JSF 2.2, PrimeFaces, servlets/jsp, Ajax, jquery, Android development, Java 7 or 8 programming, custom mix of topics Courses Customized available in any state Java or country. EE Training: Maryland/DC http://courses.coreservlets.com/ area companies can also choose afternoon/evening courses. Hadoop, Courses Java, developed JSF 2, PrimeFaces, and taught Servlets, by coreservlets.com JSP, Ajax, jquery, experts Spring, (edited Hibernate, by Marty) RESTful Web Services, Android. Spring, Hibernate/JPA, GWT, Hadoop, HTML5, RESTful Web Services Developed and taught by well-known author and developer. At public venues or onsite at your location. Contact info@coreservlets.com for details

Agenda Joining data-sets User Defined Functions (UDF) 4 Joins Overview Critical Tool for Data Processing Will probably be used in most of your Pig scripts Pigs supports Inner Joins Outer Joins Full Joins 5

How to Join in Pig Join Steps 1. Load records into a bag from input #1 2. Load records into a bag from input #2 3. Join the 2 data-sets (bags) by provided join key Default Join is Inner Join Rows are joined where the keys match Rows that do not have matches are not included in the result set #1 join set #2 6 Simple Inner Join Example 1: Load records into a bag from input #1 --InnerJoin.pig posts = load '/training/data/user-posts.txt' using PigStorage(',') as (user:chararray,post:chararray,date:long); 1:Load records into a bag from input #2 Use comma as a separator likes = load '/training/data/user-likes.txt' using PigStorage(',') as (user:chararray,likes:int,date:long); userinfo = join posts by user, likes by user; 7 3: Join the 2 data-sets dump userinfo; When a key is equal in both data-sets then the rows are joined into a new single row; In this case when user name is equal

Execute InnerJoin.pig $ hdfs dfs -cat /training/data/user-posts.txt user1,funny Story,1343182026191 user2,cool Deal,1343182133839 user4,interesting Post,1343182154633 user5,yet Another Blog,13431839394 $ hdfs dfs -cat /training/data/user-likes.txt user1,12,1343182026191 user2,7,1343182139394 user3,0,1343182154633 user4,50,1343182147364 $ pig $PLAY_AREA/pig/scripts-samples/InnerJoin.pig (user1,funny Story,1343182026191,user1,12,1343182026191) (user2,cool Deal,1343182133839,user2,7,1343182139394) (user4,interesting Post,1343182154633,user4,50,1343182147364) 8 user1, user2 and user4 are id that exist in both data-sets; the values for these records have been joined. Field Names After Join Join re-uses the names of the input fields and prepends the name of the input bag <bag_name>::<field_name> grunt> describe posts; posts: {user: chararray,post: chararray,date: long} grunt> describe likes; likes: {user: chararray,likes: int,date: long} 9 grunt> describe userinfo; UserInfo: { posts::user: chararray, posts::post: chararray, posts::date: long, likes::user: chararray, likes::likes: int, likes::date: long} Schema of the resulting Bag Fields that were joined from posts bag Fields that were joined from likes bag

Join By Multiple Keys Must provide the same number of keys Each key must be of the same type --InnerJoinWithMultipleKeys.pig posts = load '/training/data/user-posts.txt' using PigStorage(',') as (user:chararray,post:chararray,date:long); likes = load '/training/data/user-likes.txt' using PigStorage(',') as (user:chararray,likes:int,date:long); userinfo = join posts by (user,date), likes by (user,date); dump userinfo; Only join records whose user and date are equal 10 Execute InnerJoinWithMultipleKeys.pig $ hdfs dfs -cat /training/data/user-posts.txt user1,funny Story,1343182026191 user2,cool Deal,1343182133839 user4,interesting Post,1343182154633 user5,yet Another Blog,13431839394 $ hdfs dfs -cat /training/data/user-likes.txt user1,12,1343182026191 user2,7,1343182139394 user3,0,1343182154633 User4,50,1343182147364 $ pig $PLAY_AREA/pig/scripts/InnerJoinWithMultipleKeys.pig (user1,funny Story,1343182026191,user1,12,1343182026191) There is only 1 record in each data-set where both user and date are equal 11

Outer Join Records which will not join with the other record-set are still included in the result Left Outer Records from the first data-set are included whether they have a match or not. Fields from the unmatched (second) bag are set to null. Right Outer The opposite of Left Outer Join: Records from the second data-set are included no matter what. Fields from the unmatched (first) bag are set to null. 12 Full Outer Records from both sides are included. For unmatched records the fields from the other bag are set to null. Left Outer Join Example --LeftOuterJoin.pig posts = load '/training/data/user-posts.txt' using PigStorage(',') as (user:chararray,post:chararray,date:long); likes = load '/training/data/user-likes.txt' using PigStorage(',') as (user:chararray,likes:int,date:long); userinfo = join posts by user left outer, likes by user; dump userinfo; Records in the posts bag will be in the result-set even if there isn t a match by user in the likes bag 13

Execute LeftOuterJoin.pig $ hdfs dfs -cat /training/data/user-posts.txt user1,funny Story,1343182026191 user2,cool Deal,1343182133839 user4,interesting Post,1343182154633 user5,yet Another Blog,13431839394 $ hdfs dfs -cat /training/data/user-likes.txt user1,12,1343182026191 user2,7,1343182139394 user3,0,1343182154633 User4,50,1343182147364 $ pig $PLAY_AREA/pig/scripts/LeftOuterJoin.pig (user1,funny Story,1343182026191,user1,12,1343182026191) (user2,cool Deal,1343182133839,user2,7,1343182139394) (user4,interesting Post,1343182154633,user4,50,1343182147364) (user5,yet Another Blog,13431839394,,,) 14 User5 is in the posts data-set but NOT in the likes data-set Right Outer and Full Join --RightOuterJoin.pig posts = LOAD '/training/data/user-posts.txt' USING PigStorage(',') AS (user:chararray,post:chararray,date:long); likes = LOAD '/training/data/user-likes.txt' USING PigStorage(',') AS (user:chararray,likes:int,date:long); userinfo = JOIN posts BY user RIGHT OUTER, likes BY user; DUMP userinfo; 15 --FullOuterJoin.pig posts = LOAD '/training/data/user-posts.txt' USING PigStorage(',') AS (user:chararray,post:chararray,date:long); likes = LOAD '/training/data/user-likes.txt' USING PigStorage(',') AS (user:chararray,likes:int,date:long); userinfo = JOIN posts BY user FULL OUTER, likes BY user; DUMP userinfo;

Cogroup Joins data-sets preserving structure of both sets Creates tuple for each key Matching tuples from each relationship become fields --Cogroup.pig posts = LOAD '/training/data/user-posts.txt' USING PigStorage(',') AS (user:chararray,post:chararray,date:long); likes = LOAD '/training/data/user-likes.txt' USING PigStorage(',') AS (user:chararray,likes:int,date:long); userinfo = COGROUP posts BY user, likes BY user; DUMP userinfo; 16 Execute Cogroup.pig $ hdfs dfs -cat /training/data/user-posts.txt user1,funny Story,1343182026191 user2,cool Deal,1343182133839 user4,interesting Post,1343182154633 user5,yet Another Blog,13431839394 $ hdfs dfs -cat /training/data/user-likes.txt user1,12,1343182026191 user2,7,1343182139394 user3,0,1343182154633 User4,50,1343182147364 $ pig $PLAY_AREA/pig/scripts/Cogroup.pig (user1,{(user1,funny Story,1343182026191)},{(user1,12,1343182026191)}) (user2,{(user2,cool Deal,1343182133839)},{(user2,7,1343182139394)}) (user3,{},{(user3,0,1343182154633)}) (user4,{(user4,interesting Post,1343182154633)},{(user4,50,1343182147364)}) (user5,{(user5,yet Another Blog,13431839394)},{}) 17 Tuple per key First field is a bag which came from posts bag (first dataset); second bag is from the likes bag (second data-set)

Cogroup with INNER Cogroup by default is an OUTER JOIN You can remove empty records with empty bags by performing INNER on each bag INNER JOIN like functionality --CogroupInner.pig posts = LOAD '/training/data/user-posts.txt' USING PigStorage(',') AS (user:chararray,post:chararray,date:long); likes = LOAD '/training/data/user-likes.txt' USING PigStorage(',') AS (user:chararray,likes:int,date:long); userinfo = COGROUP posts BY user INNER, likes BY user INNER; DUMP userinfo; 18 Execute CogroupInner.pig $ hdfs dfs -cat /training/data/user-posts.txt user1,funny Story,1343182026191 user2,cool Deal,1343182133839 user4,interesting Post,1343182154633 user5,yet Another Blog,13431839394 $ hdfs dfs -cat /training/data/user-likes.txt user1,12,1343182026191 user2,7,1343182139394 user3,0,1343182154633 User4,50,1343182147364 $ pig $PLAY_AREA/pig/scripts/CogroupInner.pig (user1,{(user1,funny Story,1343182026191)},{(user1,12,1343182026191)}) (user2,{(user2,cool Deal,1343182133839)},{(user2,7,1343182139394)}) (user4,{(user4,interesting Post,1343182154633)},{(user4,50,1343182147364)}) 19 Records with empty bags are removed

User Defined Function (UDF) There are times when Pig s built in operators and functions will not suffice Pig provides ability to implement your own Filter Ex: res = FILTER bag BY udffilter(post); Load Function Ex: res = load 'file.txt' using udfload(); Eval Ex: res = FOREACH bag GENERATE udfeval($1) Choice between several programming languages Java, Python, Javascript 20 Implement Custom Filter Function Our custom filter function will remove records with the provided value of more than 15 characters filtered = FILTER posts BY isshort(post); Simple steps to implement a custom filter 1. Extend FilterFunc class and implement exec method 2. Register JAR with your Pig Script JAR file that contains your implementation 3. Use custom filter function in the Pig script 21

1: Extend FilterFunc FilterFunc class extends EvalFunc Customization for filter functionality Implement exec method public Boolean exec(tuple tuple) throws IOException Returns false if the tuple needs to be filtered out and true otherwise Tuple is a list of ordered fields indexed from 0 to N We are only expecting a single field within the provided tuple To retrieve fields use tuple.get(0); 22 1: Extend FilterFunc public class IsShort extends FilterFunc{ private static final int MAX_CHARS = 15; extend FilterFunc and implement exec function 23 } @Override public Boolean exec(tuple tuple) throws IOException { if ( tuple == null tuple.isnull() tuple.size() == 0 ){ return false; } } Object obj = tuple.get(0); if ( obj instanceof String){ String st = (String)obj; if ( st.length() > MAX_CHARS ){ return false; } return true; } return false; Default to a single field within a tuple Pig s CHARARRAY type will cast to String Filter out Strings shorter than 15 characters Any Object that can not cast to String will be filtered out

2: Register JAR with Pig Script Compile your class with filter function and package it into a JAR file Utilize REGISTER operator to supply the JAR file to your script REGISTER HadoopSamples.jar The local path to the jar file Path can be either absolute or relative to the execution location Path must NOT be wrapped with quotes Will add JAR file to Java s CLASSPATH 24 3: Use Custom Filter Function in the Pig Script Pig locates functions by looking on CLASSPATH for fully qualified class name filtered = FILTER posts BY pig.isshort(post); Pig will properly distribute registered JAR and add it to the CLASSPATH Can create an alias for your function using DEFINE operator 25 DEFINE isshort pig.isshort();...... filtered = FILTER posts BY isshort(post);...

Script with Custom Function --CustomFilter.pig REGISTER HadoopSamples.jar DEFINE isshort pig.isshort(); Pig custom functions are packaged in the JAR and can be used in this script Create a short alias for your function posts = LOAD '/training/data/user-posts.txt' USING PigStorage(',') AS (user:chararray,post:chararray,date:long); Script defines a schema; post field will be of type chararray filtered = FILTER posts BY isshort(post); dump filtered; 26 Execute CustomFilter.pig $ hdfs dfs -cat /training/data/user-posts.txt user1,funny Story,1343182026191 user2,cool Deal,1343182133839 user4,interesting Post,1343182154633 user5,yet Another Blog,13431839394 $ pig pig/scripts/customfilter.pig (user1,funny Story,1343182026191) (user2,cool Deal,1343182133839) Posts whose length exceeds 15 characters have been filtered out 27

Filter Function and Schema What would happen to pig.issort custom filter if the schema was NOT defined in the script --CustomFilter-NoSchema.pig REGISTER HadoopSamples.jar DEFINE isshort pig.isshort(); posts = LOAD '/training/data/user-posts.txt' USING PigStorage(','); Since no schema defined will need to reference second field by an index filtered = FILTER posts BY isshort($1); dump filtered; LOAD does not define schema 28 Execute CustomFilter- NoSchema.pig $ hdfs dfs -cat /training/data/user-posts.txt user1,funny Story,1343182026191 user2,cool Deal,1343182133839 user4,interesting Post,1343182154633 user5,yet Another Blog,13431839394 $ pig pig/scripts/customfilter-noschema.pig $ Why did CustomFilter-NoSchema.pig produce no results? 29

Why did CustomFilter-NoSchema.pig Produce no Results? Recall that the script doesn t define schema on LOAD operation posts = LOAD '/training/data/user-posts.txt' USING PigStorage(','); filtered = FILTER posts BY isshort($1); 30 When type is not specified Pig default to bytearray DataByteArray class Recall our custom implementation IsShort.exec Object obj = tuple.get(0); if ( obj instanceof String){...... } return false; Since script never defined schema obj will be of type DataByteArray and filter will remove ALL records Make IsShort Function Type Aware Override getargtofuncmapping method on EvalFunc, parent of FilterFunc Specify expected type of the functions parameter(s) Method returns a List of User Defined Functions (UDF) specifications FuncSpec objects Each object represents a parameter field In our case we just need to provide a single FuncSpec object to describe field s type filtered = FILTER posts BY isshort($1); FuncSpec object will describe function s parameter 31

GetArgToFuncMapping method of IsShortWithSchema.java @Override public List<FuncSpec> getargtofuncmapping() throws FrontendException { List<FuncSpec> schemaspec = new ArrayList<FuncSpec>(); FieldSchema fieldschema = new FieldSchema( null, First argument is field alias and is DataType.CHARARRAY); ignored for type conversion 32 } Second argument is the type CHARARRAY that will cast to String FuncSpec fieldspec = new FuncSpec( this.getclass().getname(), Name of the function new Schema(fieldSchema)); schemaspec.add(fieldspec); return schemaspec; Schema for the function; in this case just one field Returns FuncSpec object that describes metadata about each field CustomFilter-NoSchema.pig --CustomFilter-NoSchema.pig REGISTER HadoopSamples.jar DEFINE isshort pig.isshortwithschema(); Improved implementation of filter with type specification posts = LOAD '/training/data/user-posts.txt' USING PigStorage(','); filtered = FILTER posts BY isshort($1); dump filtered; This Pig script still does NOT specify type of the function s parameter 33

Execute CustomFilter- NoSchema.pig $ hdfs dfs -cat /training/data/user-posts.txt user1,funny Story,1343182026191 user2,cool Deal,1343182133839 user4,interesting Post,1343182154633 user5,yet Another Blog,13431839394 $ pig pig/scripts/customfilter-withschema.pig (user1,funny Story,1343182026191) (user2,cool Deal,1343182133839) Improved implementation specified the parameter type to be CHARARRAY which will then cast to String type 34 2012 coreservlets.com and Dima May Wrap-Up Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jquery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location.

Summary We learned about Joining data-sets User Defined Functions (UDF) 36 2012 coreservlets.com and Dima May Questions? More info: http://www.coreservlets.com/hadoop-tutorial/ Hadoop programming tutorial http://courses.coreservlets.com/hadoop-training.html Customized Hadoop training courses, at public venues or onsite at your organization http://courses.coreservlets.com/course-materials/java.html General Java programming tutorial http://www.coreservlets.com/java-8-tutorial/ Java 8 tutorial http://www.coreservlets.com/jsf-tutorial/jsf2/ JSF 2.2 tutorial http://www.coreservlets.com/jsf-tutorial/primefaces/ PrimeFaces tutorial http://coreservlets.com/ JSF 2, PrimeFaces, Java 7 or 8, Ajax, jquery, Hadoop, RESTful Web Services, Android, HTML5, Spring, Hibernate, Servlets, JSP, GWT, and other Java EE training Customized Java EE Training: http://courses.coreservlets.com/ Hadoop, Java, JSF 2, PrimeFaces, Servlets, JSP, Ajax, jquery, Spring, Hibernate, RESTful Web Services, Android. Developed and taught by well-known author and developer. At public venues or onsite at your location.