Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Similar documents

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Introduction to Apache Pig Indexing and Search

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Hadoop Job Oriented Training Agenda

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

American International Journal of Research in Science, Technology, Engineering & Mathematics

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Pig vs Hive. Big Data 2014

ITG Software Engineering

Cloudera Certified Developer for Apache Hadoop

Click Stream Data Analysis Using Hadoop

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Xiaoming Gao Hui Li Thilina Gunarathne

Best Practices for Hadoop Data Analysis with Tableau

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

SQL Server An Overview

Big Data Technology Pig: Query Language atop Map-Reduce

Connecting Hadoop with Oracle Database

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Workshop on Hadoop with Big Data

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Pig Joining Data-Sets

Search Engine Marketing Analytics with Hadoop. Ecosystem: Case Study for Red Engine Digital Agency

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

Hive Interview Questions

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Big Data Weather Analytics Using Hadoop

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Big Data and Scripting Systems build on top of Hadoop

Hadoop Scripting with Jaql & Pig

Constructing a Data Lake: Hadoop and Oracle Database United!

A Comparison of Approaches to Large-Scale Data Analysis

Implement Hadoop jobs to extract business value from large and varied data sets

Introduction to Apache Hive

Peers Techno log ies Pv t. L td. HADOOP

Hadoop Pig. Introduction Basic. Exercise

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Big Data With Hadoop

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Internals of Hadoop Application Framework and Distributed File System

Relational Database: Additional Operations on Relations; SQL

Hadoop IST 734 SS CHUNG

A Brief Introduction to MySQL

Pig Laboratory. Additional documentation for the laboratory. Exercises and Rules. Tstat Data

Big Data Technology CS , Technion, Spring 2013

Lesson 7 Pentaho MapReduce

Transforming the Telecoms Business using Big Data and Analytics

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36

Big Data and Scripting map/reduce in Hadoop

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

Chapter 7. Using Hadoop Cluster and MapReduce

An Insight on Big Data Analytics Using Pig Script

Hadoop and Map-Reduce. Swati Gore

Complete Java Classes Hadoop Syllabus Contact No:

[Type text] Week. National summer training program on. Big Data & Hadoop. Why big data & Hadoop is important?

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Using SQL Server Management Studio

Data Domain Profiling and Data Masking for Hadoop

How To Handle Big Data With A Data Scientist

Big Data Too Big To Ignore

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Big Data Course Highlights

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Retrieving Data Using the SQL SELECT Statement. Copyright 2006, Oracle. All rights reserved.

Financial Data Access with SQL, Excel & VBA

COURSE CONTENT Big Data and Hadoop Training

Using distributed technologies to analyze Big Data

Advanced Business Analytics using Distributed Computing (Hadoop)

Data processing goes big

Big Data Spatial Analytics An Introduction

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Database Query 1: SQL Basics

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Introduction to Cloud Computing

3.GETTING STARTED WITH ORACLE8i

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

The Hadoop Eco System Shanghai Data Science Meetup

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop Hands-On Exercises

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

CS 378 Big Data Programming. Lecture 2 Map- Reduce

BIG DATA What it is and how to use?

HDFS. Hadoop Distributed File System

LabVIEW Day 6: Saving Files and Making Sub vis

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

How To Scale Out Of A Nosql Database

Product: DQ Order Manager Release Notes

Big Data Analytics* Outline. Issues. Big Data

Big Data for the JVM developer. Costin Leau,

Transcription:

Introduction to Pig

Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types in Pig Simple Data Types Complex Data Types

Agenda Running Pig Execution Modes of Pig Relational Operators Eval Function Piggy Bank When to use Pig? When NOT to use Pig? Pig versus Hive

What is Pig? Apache Pig is a platform for data analysis. It is an alternative to Map Reduce Programming.

Features of Pig It provides an engine for executing data flows (how your data should flow). Pig processes data in parallel on the Hadoop cluster. It provides a language called Pig Latin to express data flows. Pig Latin contains operators for many of the traditional data operations such as join, filter, sort, etc. It allows users to develop their own functions (User Defined Functions) for reading, processing, and writing data.

The Anatomy of Pig The main components of Pig are as follows: Data flow language (Pig Latin). Ease of programming. easy to write, understand, and maintain. (1/20 th the lines of code and 1/16 th the development time) Optimization opportunities. user can focus on semantics rather than efficiency. Extensibility. Users can create their own functions to do special-purpose processing. Interactive shell where you can type Pig Latin statements (Grunt). Pig interpreter and execution engine.

Pig on Hadoop Pig runs on Hadoop. Pig uses both Hadoop Distributed File System and MapReduce Programming. By default, Pig reads input files from HDFS. Pig stores the intermediate data (data produced by MapReduce jobs) and the output in HDFS. However, Pig can also read input from and place output to other sources.

Pig Philosophy Pig processes data quickly Pigs Fly Pigs are Domestic Animals Pig Philosophy Pigs Eat Anything easily controlled and modified by its users operate on relational, nested, or unstructured Pigs Live Anywhere Processes files not only from HDFS

Pig Latin Statements Pig Latin statements are basic constructs to process data using Pig An operator in Pig Latin takes relation as input and yields another relation as output Pig Latin statements include schemas and expressions to process data Statements end with semicolon

Pig Latin Statements Pig Latin Statements are generally ordered as follows: 1. LOAD statement that reads data from the file system. 2. Series of statements to perform transformations. 3. DUMP or STORE to display/store result. A = load 'student' (rollno, name, gpa); A = filter A by gpa > 4.0; A = foreach A generate UPPER (name); STORE A INTO myreport A is a relation

Pig Latin Identifiers Begins with a letter and followed by letters, numbers and underscores Valid Identifier Y A1 A1_2014 Sample Invalid Identifier 5 Sales$ Sales% _Sales Pig Latin Comments In Pig Latin two types of comments are supported: 1. Single line comments that begin with. 2. Multiline comments that begin with /* and end with */. Pig Latin Case sensitivity Keywords are not case sensitive Relations, paths and function names are case sensitive.

Operators in Pig Latin Arithmetic Comparison Null Boolean + = = IS NULL AND -! = IS NOT NULL OR * < NOT / > % <= >=

Data Types in Pig Latin Simple Data Types Name int long float double chararray bytearray datetime boolean Complex Data Types Name Tuple Bag map Description Whole numbers Large whole numbers Decimals Very precise decimals Text strings Raw bytes Datetime true or false Description An ordered set of fields. Example: (2,3) A collection of tuples. Example: {(2,3),(7,5)} key, value pair (open # Apache)

Running Pig Pig can run in two ways: 1. Interactive Mode. 2. Batch Mode. (by storing script as.pig) Execution Modes of Pig You can execute pig in two modes: Local Mode. pig x local filename Map Reduce Mode. pig filename

hdfs dfs -mkdir /user/cloudera/pigdemo hdfs dfs -put /home/cloudera/downloads/students.txt /user/cloudera/pigdemo Filter Find the tuples of those student where the GPA is greater than 8.0. A = load '/user/cloudera/pigdemo/student.txt' as (rollno:int, name:chararray, gpa:float); B = filter A by gpa > 8.0; DUMP B;

FOREACH Display the name of all students in uppercase. A = load '/user/cloudera/pigdemo/student.txt' as (rollno:int, name:chararray, gpa:float); B = foreach A generate UPPER (name); DUMP B;

Group Group tuples of students based on their GPA. A = load '/user/cloudera/pigdemo/student.txt' as (rollno:int, name:chararray, gpa:float); B = GROUP A BY gpa; DUMP B;

Distinct To remove duplicate tuples of students. A = load '/user/cloudera/pigdemo/student.txt' as (rollno:int, name:chararray, gpa:float); B = foreach A generate UPPER (name); C = DISTINCT B; DUMP C;

Join To join two relations namely, student and department based on the values contained in the rollno column. A = load '/user/cloudera/pigdemo/students.txt' as (rollno:int, name:chararray, gpa:float); B = load '/user/cloudera/pigdemo/departments.txt' as (rollno:int, deptno:int,deptname:chararray); C = JOIN A BY rollno, B BY rollno; DUMP C; DUMP B;

Split To partition a relation based on the GPAs acquired by the students. GPA = 8.0, place it into relation X. GPA is < 8.0, place it into relation Y. A = load '/user/cloudera/pigdemo/students.txt' as (rollno:int, name:chararray, gpa:float); SPLIT A INTO X IF gpa==8.0, Y IF gpa<8.0; DUMP X;

Avg To calculate the average marks for each student. A = load '/user/cloudera/pigdemo/students.csv' USING PigStorage (',') as (studname:chararray,marks:int); B = GROUP A BY studname; C = FOREACH B GENERATE A.studname, AVG(A.marks); DUMP C;

Max To calculate the maximum marks for each student. A = load '/user/cloudera/pigdemo/students.csv' USING PigStorage (',') as (studname:chararray,marks:int); B = GROUP A BY studname; C = FOREACH B GENERATE A.studname, MAX(A.marks); DUMP C;

Map MAP represents a key/value pair. To depict the complex data type map. John Jack James [city#bangalore] [city#pune] [city#chennai] A = load '/user/cloudera/pigdemo/studentsmap.txt' Using PigStorage as (studname:chararray,m:map[chararray]); B = foreach A generate m#'city' as CityName:chararray; DUMP B

Word Count using Pig lines = load '/user/cloudera/pigdemo/word_count_text.txt' AS (line:chararray); words = foreach lines generate flatten(tokenize(line)) as word; grouped = group words by word; wordcount = foreach grouped generate COUNT(words), group; dump wordcount; store wordcount into '/user/cloudera/pigdemo/pig_wordcount';

User defined function package myudfs; import java.io.ioexception; import org.apache.pig.evalfunc; import org.apache.pig.data.tuple; public class UPPER extends EvalFunc<String> { } public String exec(tuple input) throws IOException { } if (input == null input.size() == 0) try{ return null; String str = (String)input.get(0); return str.touppercase(); }catch(exception e){ } throw new IOException("Caught exception processing input row ", e);

UDF To use upper function REGISTER /home/cloudera/desktop/myudfs.jar; A = LOAD '/user/cloudera/pigdemo/students.txt' as (rollno:int, name:chararray, gpa:float); B = FOREACH A GENERATE myudfs.upper(name); DUMP B;

When to use Pig? Pig can be used in the following situations: 1. When data loads are time sensitive. - Loading large volumes of data can become a problem as the volume of data increases: the more data there is, the longer it takes to load. Instead of buying faster server, Pig can be used by data professionals to build simple, easily understood scripts to process and analyze massive quantities of data in a massively parallel environment. 2. When processing various data sources. -A job can be written to collect web server logs, use external programs to fetch geo-location data for the users IP addresses, and join the new set of geo-located web traffic to click maps stored as JSON, web analytic data in CSV format, and spreadsheets from the advertising department to build a rich view of user behavior overlaid with advertising effectiveness. 3. When analytical insights are required through sampling- Pig allows sampling with a random distribution of data. This can reduce the amount of data that needs to be analyzed and still deliver meaningful results.

When NOT to use Pig? Pig should not be used in the following situations: 1. When data is completely unstructured such as video, text, and audio. 2. When there is a time constraint because Pig is slower than MapReduce jobs. (eg: cross join)

Pig Vs. Hive Features Pig Hive Used By Programmers and Analyst Researchers Used For Programming Reporting Language Procedural data flow SQL Like language Suitable For Semi - Structured Structured Schema / Types Explicit Implicit UDF Support YES YES Join / Order / Sort YES YES DFS Direct Access YES (Explicit) YES (Implicit) Web Interface No Yes (HWI) Partitions No Yes Shell YES YES

Answer a few quick questions

Fill in the blanks 1. Pig is a language. 2. In Pig, is used to specify data flow. 3. Pig provides an to execute data flow. 4., are execution modes of Pig. 5. Pig is used in process.

References

Further Readings http://pig.apache.org/docs/r0.12.0/index.html http://www.edureka.co/blog/introduction-to-pig/

Thank you