Advanced Pig (or "we're not in Kansas anymore") Set operations in Map/Reduce How to parameterize an operation The oxymoron called "Pig Efficiency"



Similar documents
Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Hadoop Hands-On Exercises

Introduction to Apache Pig Indexing and Search

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Data Management in the Cloud

Apache Pig Joining Data-Sets

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36

Hadoop Scripting with Jaql & Pig

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

American International Journal of Research in Science, Technology, Engineering & Mathematics

Hadoop Job Oriented Training Agenda

SQL Injection. SQL Injection. CSCI 4971 Secure Software Principles. Rensselaer Polytechnic Institute. Spring

Chapter 9 Joining Data from Multiple Tables. Oracle 10g: SQL

Relational Algebra. Basic Operations Algebra of Bags

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

ITG Software Engineering

Using Ad-Hoc Reporting

Hive Interview Questions

Hadoop Pig. Introduction Basic. Exercise

Relational Database: Additional Operations on Relations; SQL

Building Data Products using Hadoop at Linkedin. Mitul Tiwari Search, Network, and Analytics (SNA) LinkedIn

An Insight on Big Data Analytics Using Pig Script

Instant SQL Programming

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

REP200 Using Query Manager to Create Ad Hoc Queries

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Pig vs Hive. Big Data 2014

Course 20461C: Querying Microsoft SQL Server Duration: 35 hours

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Improving SQL Server Performance

Information Systems SQL. Nikolaj Popov

Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Introducing Microsoft SQL Server 2012 Getting Started with SQL Server Management Studio

Offline Image Viewer Guide

Querying Microsoft SQL Server


Lecture 9-10: Advanced Pig Latin! Claudia Hauff (Web Information Systems)!

HW3: Programming with stacks

Other Map-Reduce (ish) Frameworks. William Cohen

Big Data and Scripting Systems build on top of Hadoop

Querying Microsoft SQL Server 2012

SQL Server Administrator Introduction - 3 Days Objectives

Course 10774A: Querying Microsoft SQL Server 2012

Course ID#: W 35 Hrs. Course Content

Course 10774A: Querying Microsoft SQL Server 2012 Length: 5 Days Published: May 25, 2012 Language(s): English Audience(s): IT Professionals

Introduction To Hive

Efficiently Identifying Inclusion Dependencies in RDBMS

Querying Microsoft SQL Server 20461C; 5 days

Object Oriented Databases (OODBs) Relational and OO data models. Advantages and Disadvantages of OO as compared with relational

Using Excel in Research. Hui Bian Office for Faculty Excellence

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Using distributed technologies to analyze Big Data

Pig Laboratory. Additional documentation for the laboratory. Exercises and Rules. Tstat Data

MOC 20461C: Querying Microsoft SQL Server. Course Overview

Databases 2011 The Relational Model and SQL

Querying Microsoft SQL Server Course M Day(s) 30:00 Hours

Big Data Technology Pig: Query Language atop Map-Reduce

DATABASE NORMALIZATION

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Relational Algebra. Module 3, Lecture 1. Database Management Systems, R. Ramakrishnan 1

MOC QUERYING MICROSOFT SQL SERVER

Introduction to Querying & Reporting with SQL Server

Click Stream Data Analysis Using Hadoop

Security Test s i t ng Eileen Donlon CMSC 737 Spring 2008

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Toad for Data Analysts, Tips n Tricks

IBM Tivoli Software. Maximo Business Intelligence Work Packs. Workorder, Asset Failure, Asset and Inventory Management

Big Data Weather Analytics Using Hadoop

Using an Access Database

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Now that you have a fleet management system, what should you do with it?

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

>

How To Contact The Author Introduction What You Will Need The Basic Report Adding the Parameter... 9

Alarm Display. Tips & Tricks DIANA SMIT. Tel: 0861-WONDER Fax: (011)

Real SQL Programming 1

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

A Distributed Network Security Analysis System Based on Apache Hadoop-Related Technologies. Jeff Springer, Mehmet Gunes, George Bebis

Database Design Basics

Business Application Services Testing

7.0 BW Budget Formulation Report Tips and Tricks

More on SQL. Juliana Freire. Some slides adapted from J. Ullman, L. Delcambre, R. Ramakrishnan, G. Lindstrom and Silberschatz, Korth and Sudarshan

Big Data and Scripting Systems build on top of Hadoop

Transcription:

Advanced_Pig Page 1 Advanced Pig 2:25 PM Advanced Pig (or "we're not in Kansas anymore") Set operations in Map/Reduce How to parameterize an operation The oxymoron called "Pig Efficiency"

Advanced_Pig Page 2 Set operations 2:28 PM Set operations in Map/Reduce UNION is easy Intersection and Difference rely upon COGROUP operation

Advanced_Pig Page 3 COGROUP 2:35 PM COGROUP Input: two bags with a pair of comparable columns. Output: One bag in which the two bags are grouped together by the column. Sort of "half of a join". Example of COGROUP if s1 has schema {name:chararray, hits:int} and data (John,3) (Harry,4) (George,2) and s2 has schema {name:chararray, errors:int} and data (John,2) (John,3) (George,0) (Sue,1) Then foo = COGROUP s1 BY name, s2 BY name; has schema {group:chararray, s1:{(name:chararray, hits:int)}, s2:{(name: chararray, errors:int)}} and returns (John,{(John, 3)},{(John,2),(John,3)}) (Harry,{(Harry,4)},{}) (George,{(George,2)},{(George,0)}) (Sue, {},{(Sue,1)}) Note: Something is in the intersection of s1 and s2 if

Note: Something is in the intersection of s1 and s2 if there are no {}'s in the cogroup. Something is in the difference between s1 and s2 if there are no non-empty second sets. Advanced_Pig Page 4

Advanced_Pig Page 5 Set intersection 2:45 PM Set intersection If we have s1:{thing:chararray} and s2:{thing:chararray} then we can form their intersection via COGROUP and FILTER Example: grps = COGROUP s1 BY thing, s2 BY thing; -- cogroup by common grp2 = FILTER grps by NOT(IsEmpty(s1)) AND NOT(IsEmpty(s2)); -- throw away non-compliant things inter = FOREACH grp2 GENERATE group as thing; -- strip the co-group

Advanced_Pig Page 6 Set difference 2:48 PM Set difference Use COGROUP to determine whether sets are empty or not. USE FILTER to strike elements that are present in the second set: Example: if s1:{thing:chararray} and s2:{thing:chararray} then grps = COGROUP s1 BY thing, s2 BY thing; -- it's in the difference if it is in the LHS, but not in the RHS grp2 = FILTER grps by IsEmpty(s2); diff = FOREACH grp2 GENERATE group as thing;

Advanced_Pig Page 7 Parameters and CROSS 2:51 PM There are no true variables in Pig. Often, we want to set a parameter, e.g., how many things constitute a threshold. We can't do this in the script itself. How do we parameterize an operation? Example: want to be able to change the number of friends someone should have in order to "count" in a query. Step 1: store the parameter in a file. Step 2: distribute the parameter via CROSS Step 3: do the distributed operation. Step 4: strip the distributed parameter.

Advanced_Pig Page 8 CROSS 2:52 PM Suppose s1 and s2 are bags. Then foo = CROSS s1,s2 contains all tuples built from one tuple in s1 and another in s2 Example: if foo contains (1,2) (3,4) and bar contains (amy,fred) (george, jack) then CROSS foo,bar contains (1,2,amy,fred) (1,2,george,jack) (3,4,amy,fred) (3,4,george,jack)

Advanced_Pig Page 9 Example: parameterize an attribute of a FILTER 3:00 PM Suppose we have a relation friends: {name:chararray, friend:chararray} (Amy,George) (George,Fred) (Fred, Anne) (George,Joe) (George,Harry) and want to select people who have a certain number of friends. We create a parameters file 'params.dat' containing nfriends 2 and load it via params = LOAD 'params.dat' USING PigStorage(' ') AS (name:chararray, value:int); Then we group the pairs by first friend groups = GROUP friends BY name; to get: (Amy,{(Amy,George)}) (Fred, {(Fred, Anne)}) (George,{(George,Joe),(George,Harry),(George,Fred)}) and count the number of friends group2 = FOREACH groups GENERATE group as name, COUNT(friends) as count; to get (Amy, 1L)

Advanced_Pig Page 10 (Amy, 1L) (Fred, 1L) (George, 3L) Now we need to filter by the number of friends. We select the parameter of interest: nfriends = FILTER params BY name=='nfriends'; nfriend2 = FOREACH nfriends GENERATE value; This results in the relation nfriend2: (value: int) containing (2) Now we CROSS that relation with the group2 relation group3 = CROSS group2, nfriend2; to get (Amy, 1L, 2) (Fred, 1L, 2) (George, 3L, 2) And finally, filter by the parameter group4 = FILTER group3 by group2::count>=nfriend2::value; to get (George, 3L, 2) After which we can strip the parameter from the row: group5 = FOREACH group4 GENERATE group2::name AS name, group2::count AS count; To get the one tuple we want, i.e., (George, 3L)

Advanced_Pig Page 11 Pig "Efficiency" 3:13 PM What is "Pig Efficiency"? Pig script takes 1.7x java program time to do the same thing Contributions include: need to distribute data and code. dynamically, as computation progresses Runtime in Pig is affected by Data size: how much you have to deal with. Data distribution: is data you need where you need it? How to write "efficient" Pig scripts: FILTER as early as possible. PROJECT out useless attributes as early as possible. minimize Map/Reduce phases.

Advanced_Pig Page 12 Dirty Pig tricks 4:48 PM Dirty Pig Tricks canonicalization: if you have lots of variants of a thing, choose one A symmetric relation (s1,s2) in R => (s2,s1) in R. FILTER r by s1<s2; -- canonicalizes the pairs parameter distribution via CROSS/flatten you can move FILTERs upwards from the bottom, without changing the output of the script. s = SOMETHING(r); t = FILTER s BY s1<s2; can be reversed.