Hadoop Pig. Introduction Basic. Exercise



Similar documents
web-scale data processing Christopher Olston and many others Yahoo! Research

Distributed and Cloud Computing

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Introduction to Apache Pig Indexing and Search

Large Scale (Machine) Learning at Twitter. Aleksander Kołcz Twitter, Inc.

MR-(Mapreduce Programming Language)

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Xiaoming Gao Hui Li Thilina Gunarathne

Cloudera Certified Developer for Apache Hadoop

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Connecting Hadoop with Oracle Database

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Hadoop Job Oriented Training Agenda

Click Stream Data Analysis Using Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Large Scale Data Analysis Using Apache Pig Master's Thesis

Big Data Too Big To Ignore

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

American International Journal of Research in Science, Technology, Engineering & Mathematics

Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig

Apache Pig Joining Data-Sets

Hadoop WordCount Explained! IT332 Distributed Systems

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data and Scripting Systems build on top of Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Pig vs Hive. Big Data 2014

Introduction to MapReduce and Hadoop

Qsoft Inc

Download and install Download virtual machine Import virtual machine in Virtualbox

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Hadoop IST 734 SS CHUNG

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

The Hadoop Eco System Shanghai Data Science Meetup

An Insight on Big Data Analytics Using Pig Script

Workshop on Hadoop with Big Data

Lecture 9-10: Advanced Pig Latin! Claudia Hauff (Web Information Systems)!

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Big Data Management and NoSQL Databases

Big Data and Scripting Systems build on top of Hadoop

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Introduction to Cloud Computing

Hadoop and Map-Reduce. Swati Gore

ITG Software Engineering

COURSE CONTENT Big Data and Hadoop Training

Complete Java Classes Hadoop Syllabus Contact No:

BIG DATA, MAPREDUCE & HADOOP

Hadoop Scripting with Jaql & Pig

Hadoop Streaming. Table of contents

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Introduction to Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data and Apache Hadoop s MapReduce

Peers Techno log ies Pv t. L td. HADOOP

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduc)on to Map- Reduce. Vincent Leroy

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

BIG DATA HADOOP TRAINING

Extreme Computing. Hadoop MapReduce in more detail.

Word count example Abdalrahman Alsaedi

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

CS54100: Database Systems

Hadoop: The Definitive Guide

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Data Science in the Wild

Big Data for the JVM developer. Costin Leau,

Other Map-Reduce (ish) Frameworks. William Cohen

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop implementation of MapReduce computational model. Ján Vaňo

Running Hadoop on Windows CCNP Server

How To Scale Out Of A Nosql Database

Big Data Course Highlights

Hadoop Hands-On Exercises

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Transcription:

Your Name

Hadoop Pig Introduction Basic Exercise

A set of files A database A single file

Modern systems have to deal with far more data than was the case in the past Yahoo : over 170PB of data Facebook : over 30PB of data We need a system to handle large-scale computation!

Characteristics Distributed systems are groups of networked computers Allowing developers to use multiple machines for a single job At compute time, data is copied to the compute nodes Programming for traditional distributed systems is complex. Data exchange requires synchronization Finite bandwidth is available Temporal dependencies are complicated It is difficult to deal with partial failures of the system

Distribute the data as it is initially stored in the system Individual nodes can work on data local to those nodes Users can focus on developing applications.

Open Source Software + Hardware Commodity Zookeeper Sqoop/ Oozie Pig Hive Hue Mahout Flume MapReduce HBase Ganglia/Nagios Hadoop Distributed File System (HDFS)

Open Source Software + Hardware Commodity Zookeeper Sqoop/ Oozie Pig Hive Hue Mahout Flume MapReduce HBase Ganglia/Nagios Hadoop Distributed File System (HDFS)

HDFS (Hadoop Distributed File System) Distributed Scalable Fault tolerant High performance MapReduce Software framework of parallel computation for distributed operation Mainly written by Java

Consists of a high-level language for expressing data analysis, which is called Pig Latin Combined with the power of Hadoop and the MapReduce framework A platform for analyzing large data sets

Q : Find user who tend to visit good pages. Visits Pages......

Load Visits(user, url, time) Canonicalize urls Join url = url Load Pages(url, pagerank) Group by user Compute Average Pagerank Filter avgpr > 0.5 Result

import java.io.ioexception; import java.util.arraylist; import java.util.iterator; import java.util.list; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writablecomparable; import org.apache.hadoop.mapred.fileinputformat; import org.apache.hadoop.mapred.fileoutputformat; import org.apache.hadoop.mapred.jobconf; import org.apache.hadoop.mapred.keyvaluetextinputformat; import org.apache.hadoop.mapred.mapper; import org.apache.hadoop.mapred.mapreducebase; import org.apache.hadoop.mapred.outputcollector; import org.apache.hadoop.mapred.recordreader; import org.apache.hadoop.mapred.reducer; import org.apache.hadoop.mapred.reporter; import org.apache.hadoop.mapred.sequencefileinputformat; import org.apache.hadoop.mapred.sequencefileoutputformat; import org.apache.hadoop.mapred.textinputformat; import org.apache.hadoop.mapred.jobcontrol.job; import org.apache.hadoop.mapred.jobcontrol.jobcontrol; import org.apache.hadoop.mapred.lib.identitymapper; public class MRExample { public static class LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(longwritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.tostring(); int firstcomma = line.indexof(','); String key = line.substring(0, firstcomma); String value = line.substring(firstcomma + 1); Text outkey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outval = new Text("1" + value); oc.collect(outkey, outval); } } public static class LoadAndFilterUsers extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(longwritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.tostring(); int firstcomma = line.indexof(','); String value = line.substring(firstcomma + 1); int age = Integer.parseInt(value); if (age < 18 age > 25) return; String key = line.substring(0, firstcomma); Text outkey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outval = new Text("2" + value); oc.collect(outkey, outval); } } public static class Join extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(text key, Iterator<Text> iter, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // For each value, figure out which file it's from and store it // accordingly. List<String> first = new ArrayList<String>(); List<String> second = new ArrayList<String>(); while (iter.hasnext()) { Text t = iter.next(); String value = t.tostring(); if (value.charat(0) == '1') first.add(value.substring(1)); else second.add(value.substring(1)); reporter.setstatus("ok"); } // Do the cross product and collect the values for (String s1 : first) { for (String s2 : second) { String outval = key + "," + s1 + "," + s2; oc.collect(null, new Text(outval)); reporter.setstatus("ok"); } } } } public static class LoadJoined extends MapReduceBase implements Mapper<Text, Text, Text, LongWritable> { Low-level programing. public void map( Text k, Text val, OutputCollector<Text, LongWritable> oc, Reporter reporter) throws IOException { // Find the url String line = val.tostring(); int firstcomma = line.indexof(','); int secondcomma = line.indexof(',', firstcomma); String key = line.substring(firstcomma, secondcomma); // drop the rest of the record, I don't need it anymore, // just pass a 1 for the combiner/reducer to sum instead. Text outkey = new Text(key); oc.collect(outkey, new LongWritable(1L)); } } public static class ReduceUrls extends MapReduceBase implements Reducer<Text, LongWritable, WritableComparable, Writable> { public void reduce( Text key, Iterator<LongWritable> iter, OutputCollector<WritableComparable, Writable> oc, Reporter reporter) throws IOException { // Add up all the values we see long sum = 0; while (iter.hasnext()) { sum += iter.next().get(); reporter.setstatus("ok"); } oc.collect(key, new LongWritable(sum)); } } public static class LoadClicks extends MapReduceBase implements Mapper<WritableComparable, Writable, LongWritable, Text> { public void map( WritableComparable key, Writable val, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { oc.collect((longwritable)val, (Text)key); } } public static class LimitClicks extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { int count = 0; public void reduce( LongWritable key, Iterator<Text> iter, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { } // Only output the first 100 records while (count < 100 && iter.hasnext()) { oc.collect(key, iter.next()); count++; } } public static void main(string[] args) throws IOException { JobConf lp = new JobConf(MRExample.class); lp.setjobname("load Pages"); lp.setinputformat(textinputformat.class); lp.setoutputkeyclass(text.class); lp.setoutputvalueclass(text.class); lp.setmapperclass(loadpages.class); FileInputFormat.addInputPath(lp, new Path("/user/gates/pages")); FileOutputFormat.setOutputPath(lp, new Path("/user/gates/tmp/indexed_pages")); lp.setnumreducetasks(0); Job loadpages = new Job(lp); JobConf lfu = new JobConf(MRExample.class); lfu.setjobname("load and Filter Users"); lfu.setinputformat(textinputformat.class); lfu.setoutputkeyclass(text.class); lfu.setoutputvalueclass(text.class); lfu.setmapperclass(loadandfilterusers.class); FileInputFormat.addInputPath(lfu, new Path("/user/gates/users")); FileOutputFormat.setOutputPath(lfu, new Path("/user/gates/tmp/filtered_users")); lfu.setnumreducetasks(0); Job loadusers = new Job(lfu); - Requiring more effort to understand and maintain Write join code yourself. JobConf join = new JobConf(MRExample.class); join.setjobname("join Users and Pages"); join.setinputformat(keyvaluetextinputformat.class); join.setoutputkeyclass(text.class); join.setoutputvalueclass(text.class); join.setmapperclass(identitymapper.class); join.setreducerclass(join.class); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/indexed_pages")); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/filtered_users")); FileOutputFormat.setOutputPath(join, new Path("/user/gates/tmp/joined")); join.setnumreducetasks(50); Job joinjob = new Job(join); joinjob.adddependingjob(loadpages); joinjob.adddependingjob(loadusers); Effort to manage multiple MapReduce jobs. JobConf group = new JobConf(MRExample.class); group.setjobname("group URLs"); group.setinputformat(keyvaluetextinputformat.class); group.setoutputkeyclass(text.class); group.setoutputvalueclass(longwritable.class); group.setoutputformat(sequencefileoutputformat.class); group.setmapperclass(loadjoined.class); group.setcombinerclass(reduceurls.class); group.setreducerclass(reduceurls.class); FileInputFormat.addInputPath(group, new Path("/user/gates/tmp/joined")); FileOutputFormat.setOutputPath(group, new Path("/user/gates/tmp/grouped")); group.setnumreducetasks(50); Job groupjob = new Job(group); groupjob.adddependingjob(joinjob); JobConf top100 = new JobConf(MRExample.class); top100.setjobname("top 100 sites"); top100.setinputformat(sequencefileinputformat.class); top100.setoutputkeyclass(longwritable.class); top100.setoutputvalueclass(text.class); top100.setoutputformat(sequencefileoutputformat.class); top100.setmapperclass(loadclicks.class); top100.setcombinerclass(limitclicks.class); top100.setreducerclass(limitclicks.class); FileInputFormat.addInputPath(top100, new Path("/user/gates/tmp/grouped")); FileOutputFormat.setOutputPath(top100, new Path("/user/gates/top100sitesforusers18to25")); top100.setnumreducetasks(1); Job limit = new Job(top100); limit.adddependingjob(groupjob); JobControl jc = new JobControl("Find top 100 sites for users 18 to 25"); jc.addjob(loadpages);

Visits = load /data/visits as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load /data/pages as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; UserPageranks = foreach UserVisits generate user, AVG(VP.pagerank) as avgpr; GoodUsers = filter UserPageranks by avgpr > 0.5 ; store GoodUsers into /data/good_users ;s Reference Twitter : Typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time.

Ease of programming. script language(pig Latin) Pig not equals to SQL Increases productivity In one test 10 lines of Pig Latin 200 lines of Java. What took 4 hours to write in Java took 15 minutes in Pig Latin. Provide some common functionality (such as join, group, sort, etc) Extensibility - UDF

Pig Basic

You can run Pig (execute Pig Latin statements and Pig commands) using various modes. Local Mode MapReduce Mode Interactive Mode Yes Yes Batch Mode Yes Yes

Local mode Definition : all files are installed and run using your local host and file system. How : using the -x flag : pig -x local Mapreduce mode Definition : access to a Hadoop cluster and HDFS installation. How : the mode is the default mode; you can, but don't need to, specify it using the -x flag : pig or pig -x mapreduce.

By Pig command : /* local mode */ $ pig -x local... /* mapreduce mode */ $ pig... or $ pig -x mapreduce... By Java command : /* local mode */ $ java -cp pig.jar org.apache.pig.main -x local... /* mapreduce mode */ $ java -cp pig.jar org.apache.pig.main... or $ java -cp pig.jar org.apache.pig.main -x mapreduce...

Definition : using the Grunt shell. How : Invoke the Grunt shell using the "pig" command and then enter your Pig Latin statements and Pig commands interactively at the command line. Example: grunt> A = load 'passwd' using PigStorage(':'); grunt> B = foreach A generate $0 as id; grunt> dump B; $ pig -x local $ pig $ pig x mapreduce

Definition : run Pig in batch mode. How : Pig scripts + Pig command Example: /* id.pig */ a = LOAD 'passwd USING PigStorage(':'); b = FOREACH a GENERATE $0 AS id; STORE b INTO id.out ; $ pig -x local id.pig

Pig scripts Place Pig Latin statements and Pig commands in a single file. Not required, but is good practice to identify the file using the *.pig extension. Allows parameter substitution Support comments For single-line : use - - For multi-line : use /*. */

Pig Basic

exec : Run a Pig script. Syntax : exec [ param param_name = param_value] [ param_file file_name] [script] grunt> cat myscript.pig a = LOAD 'student' AS (name, age, gpa); b = ORDER a BY name; STORE b into '$out'; grunt> exec param out=myoutput myscript.pig;

run : Run a Pig script. Syntax : run [ param param_name = param_value] [ param_file file_name] script grunt> cat myscript.pig b = ORDER a BY name; c = LIMIT b 10; grunt> a = LOAD 'student' AS (name, age, gpa); grunt> run myscript.pig grunt> d = LIMIT c 3; grunt> DUMP d; (alice,20,2.47) (alice,27,1.95)

File commands cat, cd, copyfromlocal, copytolocal, cp, ls, mkdir,mv, pwd, rm, rmf, exec, run Utility commands help quit kill jobid set debug [on off] set job.name 'jobname'

Pig Basic

Type Description Example int Signed 32-bit integer 10 long Signed 64-bit integer 10L float 32-bit floating point 10.5F double 64-bit floating point 10.5 chararray bytearray Character array (string) in Unicode UTF-8 format Byte array (blob) Hello world boolean boolean true/false (case insensitive)

Type Description Example tuple An ordered set of fields. (20, true) Syntax : ( field [, field ] ) A tuple just like a row with one or more fields. Each field can be any data type Any field may/may not have data

Type Description Example bag An collection of tuples. {(1, 2), (3, 4)} Syntax : { tuple [, tuple ] } A bag contains multiple tuples. (1~n) Outer bag/inner bag

Type Description Example map A set of key value pairs. [open# trendmicro] Syntax : [ key#value <, key#value >] Just like other language, key must be unique in a map.

Tuple Field Relation(Outer Bag) Tom Male 10 Mary Female 20 John Male 30 Susan Female Sam Male 60

Conclusion? A field is a piece of data. A tuple is an ordered set of fields. A bag is a collection of tuples. A relation is a bag (more specifically, an outer bag). PIG LATIN STATEMENT WORK WITH RELATIONS!!

Case sensitive: Names (aliases) of relations and fields Built-in function names Case insensitive: Parameter name Key-word such as LOAD, STORE in pig latin. (They can also be written as load, store)

Used as keyword AS clause. Defined with LOAD/STREAM/FOREACH operator. Enhance better parse-time error checking and more efficient code execution. Optional but encourage to use.

Legal format: includes both the field name and field type includes the field name only, lack field type not to define a schema

John 18 true Mary 19 false Bill 20 true Joe 18 true a = LOAD 'data' AS (name:chararray, age:int, male:boolean); a = LOAD data AS (name, age, male); a = LOAD data ;

(3,8,9) (mary,19) (1,4,7) (john,18) (2,5,8) (joe,18) a = LOAD data AS (F:tuple(f1:int, f2:int, f3:int), T:tuple(t1:chararray, t2:int)); a = LOAD data AS (F:(f1:int, f2:int, f3:int), T:(t1:chararray, t2:int));

{(3,8,9)} {(1,4,7)} {(2,5,8)} a = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)}); a = LOAD 'data' AS (B: {T: (t1:int, t2:int, t3:int)});

What if field name/field type is not defined? Lacks field name : can only refer to the field using positional notation. ($0, $1 ) Lacks field type : value will be treated as bytearray. you can change the default type using the cast operators. John 18 true Mary 19 false Bill 20 true Joe 18 true grunt> a = LOAD data'; grunt> b = FOREACH a GENERATE $0;

Pig Basic

Loading Data Working with Data Storing Intermediate Results Storing Final Results Debugging Pig Latin

LOAD 'data [USING function] [AS schema]; Use the LOAD operator to load data from the file system. Default load function is PigStorage. Example : a = LOAD 'data' AS (name:chararray, age:int); a = LOAD 'data' USING PigStorage( : ) AS (name, age);

Transform data into what you want. Major operators: FOREACH : work with columns of data FILTER : work with tuples or rows of data GROUP : group data in a single relation JOIN : Performs an inner join of two or more relations based on common field values. UNION : merge the contents of two or more relations SPLIT : Partitions a relation into two or more relations.

Default location is /tmp Could be configured by property pig.temp.dir

STORE relation INTO 'directory' [USING function]; Use the STORE operator to write output to the file system. Default store function is PigStorage. If the directory already exists, the STORE operation will fail. The output file will be named as part-nnnnn. Example : STORE a INTO 'myoutput USING PigStorage ('*');

DUMP : display results to your terminal screen.(useful) DESCRIBE : review the schema of a relation. (useful) EXPLAIN : view the logical, physical, or map reduce execution plans to compute a relation. ILLUSTRATE : view the step-by-step execution of a series of statements.

18 (John, Mary) 22 (Austin, Sunny) 78 (Richard, Alice) 87 (Johnny,) I want to find couple(s) both alive and age > 75? a = Load data' AS (age:int, couple:(husband:chararray, wife:chararray)) b = FILTER a BY (age > 75) AND ((couple.husband is not null) AND (couple.husband!= '') AND (couple.wife is not null) AND (couple.wife!= '')); c = FOREACH b GENERATE couple AS lucky_couple:(old_man:chararray, old_women:chararray); STORE c INTO 'population'; Describe a ; a: {age: int, couple: (husband: chararray, wife:chararray)}

18 (John, Mary) 22 (Austin, Sunny) 78 (Richard, Alice) 87 (Johnny,) I want to find couple(s) both alive and age > 75? a = Load data' AS (age:int, couple:(husband:chararray, wife:chararray)) b = FILTER a BY (age > 75) AND ((couple.husband is not null) AND (couple.husband!= '') AND (couple.wife is not null) AND (couple.wife!= '')); c = FOREACH b GENERATE couple AS lucky_couple:(old_man:chararray, old_women:chararray); STORE c INTO 'population'; Dump a ; (18,(John, Mary)) (22,(Austin, Sunny)) (78,(Richard, Alice)) (87,(Johnny,))

18 (John, Mary) 22 (Austin, Sunny) 78 (Richard, Alice) 87 (Johnny,) I want to find couple(s) both alive and age > 75? a = Load data' AS (age:int, couple:(husband:chararray, wife:chararray)) b = FILTER a BY (age > 75) AND ((couple.husband is not null) AND (couple.husband!= '') AND (couple.wife is not null) AND (couple.wife!= '')); c = FOREACH b GENERATE couple AS lucky_couple:(old_man:chararray, old_women:chararray); STORE c INTO 'population'; Dump b ; (78,(Richard, Alice))

18 (John, Mary) 22 (Austin, Sunny) 78 (Richard, Alice) 87 (Johnny,) I want to find couple(s) both alive and age > 75? a = Load data' AS (age:int, couple:(husband:chararray, wife:chararray)) b = FILTER a BY (age > 75) AND ((couple.husband is not null) AND (couple.husband!= '') AND (couple.wife is not null) AND (couple.wife!= '')); c = FOREACH b GENERATE couple AS lucky_couple:(old_man:chararray, old_women:chararray); STORE c INTO 'population'; Describe c ; c: {lucky_couple: (old_man: chararray,old_women: chararray)}

18 (John, Mary) 22 (Austin, Sunny) 78 (Richard, Alice) 87 (Johnny,) I want to find couple(s) both alive and age > 75? a = Load data' AS (age:int, couple:(husband:chararray, wife:chararray)) b = FILTER a BY (age > 75) AND ((couple.husband is not null) AND (couple.husband!= '') AND (couple.wife is not null) AND (couple.wife!= '')); c = FOREACH b GENERATE couple AS lucky_couple:(old_man:chararray, old_women:chararray); STORE c INTO 'population'; Folder population (Richard, Alice)

Pig Basic

Operator Symbol addition + subtraction - multiplication * division / modulo % bincond?: x = FOREACH a GENERATE f2, f1 + f2; x = FOREACH a GENERATE f2, (f2==1?1:count(b));

Operator AND OR NOT Symbol AND OR NOT x = FILTER a BY (f1==8) OR (NOT (f2+f3 > f1));

Operator (Simple data type you want to cast to) to tuple to bag to map Symbol (enclose data type in parentheses) (tuple(data_type)) src_field (bag{tuple(data_type)}) src_field (map[]) src_field b = FOREACH a GENERATE (int)$0 + 1; x = FOREACH b GENERATE group, (chararray)count(a) AS total;

Operator (Simple data type you want to cast to) to tuple to bag to map Symbol (enclose data type in parentheses) (tuple(data_type)) src_field (bag{tuple(data_type)}) src_field (map[]) src_field (1,2,3) (4,2,1) a = LOAD 'data' AS fld; b = FOREACH a GENERATE (tuple(int,int,float))fld; DESCRIBE a; DESCRIBE b; DUMP b;

Operator (Simple data type you want to cast to) to tuple to bag to map Symbol (enclose data type in parentheses) (tuple(data_type)) src_field (bag{tuple(data_type)}) src_field (map[]) src_field {(1234L)} {(5678L)} {(9012)} a = LOAD 'data' AS fld:bytearray; b = FOREACH a GENERATE (bag{tuple(long)})fld; DESCRIBE a; DESCRIBE b; DUMP b;

Operator (Simple data type you want to cast to) to tuple to bag to map Symbol (enclose data type in parentheses) (tuple(data_type)) src_field (bag{tuple(data_type)}) src_field (map[]) src_field [open#apache] [apache#hadoop] [hadoop#pig] a = LOAD 'data' AS fld; b = FOREACH a GENERATE ((map[])fld; DESCRIBE a; DESCRIBE b; DUMP b;

Operator Symbol equal == not equal!= greater than > greater than or equal to >= less than < less than or equal to <= pattern matching matches x = FILTER a BY (f2 == 'apache'); x = FILTER a BY (f1 matches '.*apache.*');

Operator Symbol Note tuple constructor bag constructor map constructor () Equivalent to TOTUPLE {} Equivalent to TOBAG [] Equivalent to TOMAP

Operator Symbol Note tuple constructor bag constructor map constructor () Equivalent to TOTUPLE {} Equivalent to TOBAG [] Equivalent to TOMAP Joe 20 Amy 22 Leo 18 a = LOAD 'students AS (name:chararray, age:int); b = FOREACH a GENERATE (name, age); STORE b INTO results ; (Joe,20) (Amy,22) (Leo,18)

Operator Symbol Note tuple constructor () Equivalent to TOTUPLE bag constructor {} Equivalent to TOBAG map constructor [] Equivalent to TOMAP Joe 20 3.5 Amy 22 3.2 Leo 18 2.1 a = LOAD data AS (name:chararray, age:int, gpa:float); b = FOREACH a GENERATE {(name, age)}, {name, age}; STORE b INTo results ; {(Joe,20)} {(Joe ),(20)} {(Amy,22)} {(Amy),(22)} {(Leo,18)} {(Leo),(18)}

Operator Symbol Note tuple constructor bag constructor map constructor () Equivalent to TOTUPLE {} Equivalent to TOBAG [] Equivalent to TOMAP Joe 20 3.5 Amy 22 3.2 Leo 18 2.1 a = LOAD 'students AS (name:chararray, age:int, gpa:float); b = FOREACH a GENERATE [name, gpa]; STORE b INTO results ; [Joe#3.5] [Amy#3.2] [Leo#2.1]

Operator is null is not null Symbol is null is not null x = FILTER a BY f1 is not null;

Load data from the file system Default load function is PigStorage, could choose to loaded by other customized UDF. LOAD 'data [USING function] [AS schema]; a = LOAD 'myfile.txt' USING PigStorage('\t'); a = LOAD 'myfile.txt' AS (f1:int, f2:int, f3:int);

Save result to the file system. Default store function is PigStorage, could be customized by UDF. About the output directory, if the directory already exist, the operation will fail. STORE alias INTO directory [USING function] STORE a INTO 'myoutput' USING PigStorage ('*');

Groups the data in one or more relations. The result of a GROUP operation is a relation that includes one tuple per group. The tuple contains 2 fields: first field : named group, the same type as the group key second field : whose name is the same with original relation, and type is a bag.

alias = GROUP alias {ALL BY expression} [,alias ALL BY expression ] John 18 4.0F Mary 19 3.8F Bill 20 3.9F Joe 18 3.8F a = LOAD data' AS (name:chararray, age:int, gpa:float); b = GROUP a BY age; c = FOREACH b GENERATE group, COUNT(a); DESCRIBE b; DUMP b; DUMP c;

Basically, the GROUP and COGROUP operators are identical. For readability GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations.

alias = COGROUP alias {ALL BY expression} [,alias ALL BY expression ] Alice turtle Alice goldfish Alice cat Bob dog Bob cat a = LOAD 'data1' AS (owner:chararray,pet:chararray); b = LOAD 'data2' AS (friend1:chararray,friend2:chararray); Cindy Alice Mark Alice x = COGROUP a BY owner, b BY friend2; Paul Bob Paul Jane DESCRIBE x; DUMP x;

Selects tuples from a relation based on some condition. alias = FILTER alias BY expression; x = FILTER a BY (f1 == 8) OR (NOT (f2+f3 > f1));

Removes duplicate tuples in a relation. You can not use DISTINCT on a subset of fields; to do this, use FOREACH and a nested block to first select the fields and then apply DISTINCT. alias = DISTINCT alias; (8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3) x = DISTINCT a; (1,2,3) (4,3,3) (8,3,4)

Sorts a relation based on one or more fields. Currently, supporting ordering on fields with simple types or by tuple designator (*). You cannot order on fields with complex types or by expressions. alias = ORDER alias BY {*[ASC DESC]} field[asc DESC] [,field[asc DESC] ] x = ORDER a BY a3 DESC;

Performs an inner join of two or more relations based on common field values. Inner joins ignore null keys, so it makes sense to filter them out before the join.

alias = JOIN alias BY field (, alias BY field); 1 2 3 4 2 1 8 3 4 4 3 3 7 2 5 8 4 3 2 4 8 9 1 3 2 7 2 9 4 6 4 9 a = LOAD 'data1' AS (a1:int,a2:int,a3:int); b = LOAD 'data2' AS (b1:int,b2:int); x = JOIN a BY a1, b BY b1; DESCRIBE x; DUMP x;

Un-nests tuple/bag - changes the structure of it. FLATTEN (Tuple/Bag)

18 (John, Mary) 22 (Austin, Sunny) 78 (Richard, Alice) 87 (Johnny,) a = LOAD data' AS (age:int, couple:(husband:chararray, wife:chararray)); b = FOREACH a GENERATE age, couple; c = FOREACH a GENERATE age, FLATTEN(couple); DESCRIBE b; DUMP b; DUMP c; DESCRIBE c;

18 (John, Mary) 22 (Austin, Sunny) 22 (Tom, Rosa) 78 (Richard, Alice) 87 (Johnny,) DESCRIBE b; a = LOAD data' AS (age:int, couple:(husband:chararray, wife:chararray)); b = GROUP a BY age; c = FOREACH b GENERATE FLATTEN(a); DUMP b; DESCRIBE c; DUMP c;

Computes the union of two or more relations. Does not eliminate duplicate tuples. Does not ensure (as databases do) that all tuples adhere to the same schema the simplest case should be ensure that the tuples in the input relations have the same schema

UNION alias, alias [,alias ]; 1 2 3 4 5 6 7 8 9 a = LOAD 'data1' AS (a1:int,a2:int,a3:int); b = LOAD 'data2' AS (b1:int,b2:int, b3:int); 1 2 3 1 3 5 2 4 6 x = UNION a, b; DESCRIBE x; DUMP x;

Partitions a relation into two or more relations. Result : A tuple may be assigned to more than one relation. A tuple may not be assigned to any relation.

SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression ] [, alias OTHERWISE]; 1 2 3 4 5 6 7 8 9 a = LOAD 'data1' AS (a1:int,a2:int,a3:int); SPLIT a INTO x IF a1==7, y IF (a1 < 5 AND a2 < 6), z OTHERWISE; DUMP x; DUMP z; DUMP y;

alias = LIMIT alias n Limit the number of output tuples. Note : There is no guarantee which n tuples will be returned, and the tuples that are returned can change from one run to the next. 1 2 3 4 5 6 7 8 9 a = LOAD 'data1' AS (a1:int,a2:int,a3:int); x = LIMIT a 2; DUMP x;

alias = FOREACH alias GENERATE expression [AS schema] [expression [AS schema].]; Generates data transformations based on columns of data.

1 2 3 4 5 6 7 8 9 a = LOAD 'data' AS (a1:int,a2:int,a3:int); x = FOREACH a GENERATE a1+a2 AS f1:int, (a3==3? TOTUPLE(1,1) : TOTUPLE (0,0)); DUMP x;

Could work with nested. FOREACH statements can be nested to two levels only. Valid nested op : CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY The GENERATE keyword must be the last statement within the nested block.

MARY,22,female RICHARD,26,male AUSTIN,28,male JOHN,52,male TOM,40,male JIJI,38,female DESCRIBE b; a = LOAD data' USING PigStorage(',') AS (name:chararray, age:int, gender:chararray); b = GROUP a BY gender; c = FOREACH b { c0 = FILTER a BY (age > 20) AND (age < 30); GENERATE group, COUNT(c0) AS young; } DESCRIBE c; DUMP b; DUMP c;

MARY,female,LA RICHARD,male,Taipei AUSTIN,male,Taipei TOM,male,Taichiung JOHN,male JIJI,female DESCRIBE c; a = LOAD data' USING PigStorage(',') AS (name:chararray, gender:chararray, location:chararray); b = GROUP a BY gender; c = FOREACH b { c0 = FILTER a BY (location is not null) AND (location!= ''); DUMP c; } c1 = ORDER c0 BY name; GENERATE group, c1;

Goal : Familiar with Pig Latin basic usage, includes: Be able to run a Pig script. Be able to load/store data using Pig. Be able to parse data using Pig. Be able to applying simple operation on data.