Hadoop Scripting with Jaql & Pig

Similar documents

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

ITG Software Engineering

Big Data and Scripting Systems build on top of Hadoop

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Hadoop Hands-On Exercises

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

American International Journal of Research in Science, Technology, Engineering & Mathematics

On a Hadoop-based Analytics Service System

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

COURSE CONTENT Big Data and Hadoop Training

Click Stream Data Analysis Using Hadoop

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

An Insight on Big Data Analytics Using Pig Script

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Complete Java Classes Hadoop Syllabus Contact No:

How To Scale Out Of A Nosql Database

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Workshop on Hadoop with Big Data

Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop IST 734 SS CHUNG

Introduction to Apache Pig Indexing and Search

Open Source Technologies on Microsoft Azure

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Apache Sentry. Prasad Mujumdar

Comparing SQL and NOSQL databases

Map Reduce & Hadoop Recommended Text:

Implement Hadoop jobs to extract business value from large and varied data sets

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Hadoop Big Data for Processing Data and Performing Workload

Peers Techno log ies Pv t. L td. HADOOP

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Sentimental Analysis using Hadoop Phase 2: Week 2

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Moving From Hadoop to Spark

Big Data and Scripting Systems build on top of Hadoop

A Brief Introduction to Apache Tez

An Open Source NoSQL solution for Internet Access Logs Analysis

Manifest for Big Data Pig, Hive & Jaql

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Using distributed technologies to analyze Big Data

Data processing goes big

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

An Approach to Implement Map Reduce with NoSQL Databases

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Other Map-Reduce (ish) Frameworks. William Cohen

Big Data Training - Hackveda

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Xiaoming Gao Hui Li Thilina Gunarathne

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source Google-style large scale data analysis with Hadoop

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Hadoop: The Definitive Guide

Scaling Up HBase, Hive, Pegasus

ASAM ODS, Peak ODS Server and openmdm as a company-wide information hub for test and simulation data. Peak Solution GmbH, Nuremberg

Big Data Course Highlights

Secure Your Hadoop Cluster With Apache Sentry (Incubating) Xuefu Zhang Software Engineer, Cloudera April 07, 2014

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Introduction to Apache Hive

HDFS. Hadoop Distributed File System

Hadoop: The Definitive Guide

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

A Performance Analysis of Distributed Indexing using Terrier

BIG DATA What it is and how to use?

Forensic Clusters: Advanced Processing with Open Source Software. Jon Stewart Geoff Black

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Internals of Hadoop Application Framework and Distributed File System

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Comparing Scalable NOSQL Databases

Large Scale Text Analysis Using the Map/Reduce

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data for the JVM developer. Costin Leau,

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

The Hadoop Eco System Shanghai Data Science Meetup

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Transcription:

Hadoop Scripting with Jaql & Pig Konstantin Haase und Johan Uhle 1

Outline Introduction Markov Chain Jaql Pig Testing Scenario Conclusion Sources 2

Introduction Goal: Compare two high level scripting languages for Hadoop Contestants: Pig Jaql Hive 3

Markov Chain "Random" Text Generator Scan Text for Succeeding Word Tuple Apache Hadoop is a free Java software framework that Apache Hadoop is -> a Hadoop is a -> free is a free -> Java... Generate a new Text 4

Markov Chain Generated Senseless Text Can anyone think of myself as a third sex. Yes, I am expected to have. People often get used to me knowing these things and then a cover is placed over all of them. Along the side of the $$ are spent by (or at least for ) the girls. You can't settle the issue. It seems I've forgotten what it is, but I don't. I know about violence against women, and I really doubt they will ever join together into a large number of jokes. It showed Adam, just after being created. He has a modem and an autodial routine. He calls my number 1440 times a day. So I will conclude by saying that I can well understand that she might soon have the time, it makes sense, again, to get the gist of my argument, I was in that (though it's a Republican administration). ( by Mark V Shaney on http://en.wikipedia.org/wiki/mark_v_shaney) 5

Markov Chain How to implement? Read Transform SQL Database Ruby Generator Source Wikipedia 6

Markov Chain 7

Outline Introduction Markov Chain Jaql Pig Testing Scenario Conclusion Sources 8

Jaql Based on JavaScript Object Notation Superset of JSON, subset of YAML and JavaScript { hash: ["with", "an", "array", 42] } Inspired by Unix Pipes read -> transform -> count -> write; Auto-Optimise Plan 9

Jaql Different Input/Output Sources: hdfs, hbase, local, http Cluster Usage depends on Input Source Interactive Shell or prepacked JAR Not turing complete 10

Jaql Apache License 2.0 Developed mainly by IBM Lack of documentation Mailing list is dead Latest Release missed lots of features Used SVN head 11

Jaql registerfunction("strsplit", "com.acme.extensions.expr.splititerexpr"); $markoventry = fn($words, $pos) ( $words -> slice($pos - 3, $pos) -> transform serialize($) -> strjoin(", ")); $markovline = fn($line) ( $words = $line -> strsplit(" "), range(3, count($words) - 1) -> transform $markoventry($words, $)); $markov = fn($lines) ( $lines -> expand each $line $markovline($line) -> group by $w = ($) into "INSERT INTO jaqltest.splitsuc VALUES (" + $w + ", " + serialize(count($)) + ");"); read(file("input.dat")) -> $markov() -> write(file("output.dat")); 12

Jaql [ ] "Apache Hadoop is a free Java software framework that...",... read(file("input.dat")) -> $markov() -> write(file("output.dat")); 13

Jaql $markov = fn($lines) ( $lines -> expand each $line $markovline($line) -> group by $w = ($) into "INSERT INTO jaqltest.splitsuc VALUES (" + $w + ", " + serialize(count($)) + ");"); read(file("input.dat")) -> $markov() -> write(file("output.dat")); 14

Jaql "Apache Hadoop is a free Java software framework that supports data intensive distributed applications..." $markov = fn($lines) ( $lines -> expand each $line $markovline($line) -> group by $w = ($) into "INSERT INTO jaqltest.splitsuc VALUES (" + $w + ", " + serialize(count($)) + ");"); read(file("input.dat")) -> $markov() -> write(file("output.dat")); 15

Jaql $markovline = fn($line) ( $words = $line -> strsplit(" "), range(3, count($words) - 1) -> transform $markoventry($words, $)); $markov = fn($lines) ( $lines -> expand each $line $markovline($line) -> group by $w = ($) into "INSERT INTO jaqltest.splitsuc VALUES (" + $w + ", " + serialize(count($)) + ");"); read(file("input.dat")) -> $markov() -> write(file("output.dat")); 16

Jaql ["Apache", "Hadoop", "is", "a", "free", "Java",...] $markovline = fn($line) ( $words = $line -> strsplit(" "), range(3, count($words) - 1) -> transform $markoventry($words, $)); $markov = fn($lines) ( $lines -> expand each $line $markovline($line) -> group by $w = ($) into "INSERT INTO jaqltest.splitsuc VALUES (" + $w + ", " + serialize(count($)) + ");"); read(file("input.dat")) -> $markov() -> write(file("output.dat")); 17

Jaql [3, 4, 5,...] $markovline = fn($line) ( $words = $line -> strsplit(" "), range(3, count($words) - 1) -> transform $markoventry($words, $)); $markov = fn($lines) ( $lines -> expand each $line $markovline($line) -> group by $w = ($) into "INSERT INTO jaqltest.splitsuc VALUES (" + $w + ", " + serialize(count($)) + ");"); read(file("input.dat")) -> $markov() -> write(file("output.dat")); 18

Jaql $markovline = fn($line) ( $words = $line -> strsplit(" "), range(3, count($words) - 1) -> transform $markoventry($words, $)); $markov = fn($lines) ( $lines -> expand each $line $markovline($line) -> group by $w = ($) into "INSERT INTO jaqltest.splitsuc VALUES (" + $w + ", " + serialize(count($)) + ");"); read(file("input.dat")) -> $markov() -> write(file("output.dat")); 19

Jaql $markoventry = fn($words, $pos) ( $words -> slice($pos - 3, $pos) -> transform serialize($) -> strjoin(", ")); $markovline = fn($line) ( $words = $line -> strsplit(" "), range(3, count($words) - 1) -> transform $markoventry($words, $)); $markov = fn($lines) ( $lines -> expand each $line $markovline($line) -> group by $w = ($) into "INSERT INTO jaqltest.splitsuc VALUES (" + $w + ", " + serialize(count($)) + ");"); read(file("input.dat")) -> $markov() -> write(file("output.dat")); 20

Jaql ["Apache", "Hadoop", "is", "a",...] $markoventry = fn($words, $pos) ( $words -> slice($pos - 3, $pos) -> transform serialize($) -> strjoin(", ")); $markovline = fn($line) ( $words = $line -> strsplit(" "), range(3, count($words) - 1) -> transform $markoventry($words, $)); $markov = fn($lines) ( $lines -> expand each $line $markovline($line) -> group by $w = ($) into "INSERT INTO jaqltest.splitsuc VALUES (" + $w + ", " + serialize(count($)) + ");"); read(file("input.dat")) -> $markov() -> write(file("output.dat")); 21

Jaql ["\"Apache\"", "\"Hadoop\"", "\"is\"", "\"a\""] $markoventry = fn($words, $pos) ( $words -> slice($pos - 3, $pos) -> transform serialize($) -> strjoin(", ")); $markovline = fn($line) ( $words = $line -> strsplit(" "), range(3, count($words) - 1) -> transform $markoventry($words, $)); $markov = fn($lines) ( $lines -> expand each $line $markovline($line) -> group by $w = ($) into "INSERT INTO jaqltest.splitsuc VALUES (" + $w + ", " + serialize(count($)) + ");"); read(file("input.dat")) -> $markov() -> write(file("output.dat")); 22

Jaql "\"Apache\", \"Hadoop\", \"is\", \"a\"" $markoventry = fn($words, $pos) ( $words -> slice($pos - 3, $pos) -> transform serialize($) -> strjoin(", ")); $markovline = fn($line) ( $words = $line -> strsplit(" "), range(3, count($words) - 1) -> transform $markoventry($words, $)); $markov = fn($lines) ( $lines -> expand each $line $markovline($line) -> group by $w = ($) into "INSERT INTO jaqltest.splitsuc VALUES (" + $w + ", " + serialize(count($)) + ");"); read(file("input.dat")) -> $markov() -> write(file("output.dat")); 23

Jaql $w = "\"Apache\", \"Hadoop\", \"is\", \"a\"" $ = ["\"Apache\", \"Hadoop\", \"is\", \"a\""] $markov = fn($lines) ( $lines -> expand each $line $markovline($line) -> group by $w = ($) into "INSERT INTO jaqltest.splitsuc VALUES (" + $w + ", " + serialize(count($)) + ");"); read(file("input.dat")) -> $markov() -> write(file("output.dat")); 24

Jaql "INSERT INTO jaqltest.splitsuc VALUES (\"Apache\", \"Hadoop\", \"is\", \"a\", 1);" $markov = fn($lines) ( $lines -> expand each $line $markovline($line) -> group by $w = ($) into "INSERT INTO jaqltest.splitsuc VALUES (" + $w + ", " + serialize(count($)) + ");"); read(file("input.dat")) -> $markov() -> write(file("output.dat")); 25

Jaql "INSERT INTO jaqltest.splitsuc VALUES (\"Apache\", \"Hadoop\", \"is\", \"a\", 1);" read(file("input.dat")) -> $markov() -> write(file("output.dat")); 26

Pig Hadoop Subproject in Apache Incubator since 2007 Mainly developed by Yahoo Active Mailinglist and average documentation High-level language for data transformation Apache License 27

Pig Key Aspects: Ease of Programming Optimisation Opportunities Extensibility Similar to SQL but data flow programming language Similar to the DBMS query planner Auto Optimisation Plan is also visible Step by Step set of operations on an input relation, in which each step is a single transformation. 28

Pig Text: chararray Numeric: int, long, float, double Tuple: ( 1, 'element2', 300L ) sequence of fields of any type Bag: { (1),(1, 'element2') } unordered collection of tuples 29

Pig register splitsuc.jar; wikipedia = LOAD 'corpera' USING PigStorage() AS (row:chararray); tuples = FOREACH wikipedia GENERATE FLATTEN(splitsuc.SPLITSUC()); grouped = GROUP tuples by (keytuple, successortuple); grouped_counted = FOREACH grouped GENERATE group, COUNT(tuples); STORE final INTO 'wikipedia.sql' USING splitsuc.storesql(); 30

Pig register splitsuc.jar; wikipedia = LOAD 'corpera' USING PigStorage() AS (row:chararray);... The capital of Algeria is Algiers... wikipedia: row1, row2, row3,... 31

Pig register splitsuc.jar; wikipedia = LOAD 'corpera' USING PigStorage() AS (row:chararray); tuples = FOREACH wikipedia GENERATE FLATTEN(splitsuc.SPLITSUC()); (The,capital,of) (Algeria) (capital,of,algeria) (is) (of,algeria,is) (Algiers.) tuples: ((word1,word2,word3),(successor)) ( keytuple, successortuple) 32

Pig register splitsuc.jar; wikipedia = LOAD 'corpera' USING PigStorage() AS (row:chararray); tuples = FOREACH wikipedia GENERATE FLATTEN(splitsuc.SPLITSUC()); grouped = GROUP tuples by (keytuple, successortuple); ((of,algeria,is),(algiers.)) {((of,algeria,is),(algiers.)), ((of,algeria,is),(algiers.))} grouped: {(keytuple, successortuple), { (keytuple, successortuple),... } } { group, { group,... } } 33

Pig register splitsuc.jar; wikipedia = LOAD 'corpera' USING PigStorage() AS (row:chararray); tuples = FOREACH wikipedia GENERATE FLATTEN(splitsuc.SPLITSUC()); grouped = GROUP tuples by (keytuple, successortuple); grouped_counted = FOREACH grouped GENERATE group, COUNT(tuples); ((of,algeria,is),(algiers.)) 2 grouped_counted: (keytuple, successortuple), Count 34

Pig register splitsuc.jar; wikipedia = LOAD 'corpera' USING PigStorage() AS (row:chararray); tuples = FOREACH wikipedia GENERATE FLATTEN(splitsuc.SPLITSUC()); grouped = GROUP tuples by (keytuple, successortuple); grouped_counted = FOREACH grouped GENERATE group, COUNT(tuples); STORE final INTO 'wikipedia.sql' USING splitsuc.storesql(); INSERT INTO table VALUES... ('of', 'Algeria', 'is', 'Algiers.', '2'),... Show Output to File and Output to MySQL 35

Outline Introduction Markov Chain Jaql Pig Testing Scenario Conclusion Sources 36

Test Scenario Both Need Hadoop 0.18 Will Run On Cluster Week After Exams Building Markov Chain Index Of Four with JAQL and PIG Benchmarking Minimising error by multiple tests per setup Input: Wikipedia Corpus Size 1 GB, 10 GB, 20 GB Generating Random Text with a Ruby Script from DB 37

Conclusion Rapid Development compared to native Java Hadoop Less thinking in Map/Reduce tasks Both offer tight Java integration Pig Jaql declarative evolving ecosystem functional / procedural small ecosystem 38

Conclusion Both still immature but promising Bad Documentation Bad Debugging Lack of Built-In Functionality Lack of Ecosystem No Community Inhouse Development We are looking forward to cluster tests 39

Sources http://en.wikipedia.org/wiki/markov_chain http://www.jaql.org/ http://code.google.com/p/jaql/ http://hadoop.apache.org/pig/ "Hadoop: The Definitive Guide" by Tom White O Reilly Media, Inc. 2009 40

End Introduction Markov Chain Jaql Pig Testing Scenario Conclusion Sources 41