Introduction to Apache Pig Indexing and Search

Similar documents

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Apache Pig Joining Data-Sets

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

American International Journal of Research in Science, Technology, Engineering & Mathematics

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Click Stream Data Analysis Using Hadoop

Big Data and Scripting Systems build on top of Hadoop

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Hadoop Pig. Introduction Basic. Exercise

Internals of Hadoop Application Framework and Distributed File System

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36

Other Map-Reduce (ish) Frameworks. William Cohen

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Pig vs Hive. Big Data 2014

Hadoop Job Oriented Training Agenda

ITG Software Engineering

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop Scripting with Jaql & Pig

Big Data Technology Pig: Query Language atop Map-Reduce

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Hadoop Hands-On Exercises

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig

Assignment 4: Pig, Hive and Spark

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Hadoop WordCount Explained! IT332 Distributed Systems

Big Data Too Big To Ignore

Big Data Analytics. Lucas Rego Drumond

Big Data and Scripting Systems build on top of Hadoop

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Lecture 9-10: Advanced Pig Latin! Claudia Hauff (Web Information Systems)!

Big Data and Scripting map/reduce in Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Hadoop Streaming. Table of contents

Introduction To Hive

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

COURSE CONTENT Big Data and Hadoop Training

Big Data Course Highlights

Cloud Computing. Chapter Hadoop

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Big Data for the JVM developer. Costin Leau,

CS 378 Big Data Programming. Lecture 2 Map- Reduce

This exam contains 17 pages (including this cover page) and 21 questions. Check to see if any pages are missing.

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Data Intensive Computing Handout 5 Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Relational Database: Additional Operations on Relations; SQL

Using distributed technologies to analyze Big Data

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Introduction to Hadoop

CS 378 Big Data Programming

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Integrating NLTK with the Hadoop Map Reduce Framework Human Language Technology Project

Data Management in the Cloud

Relational Processing on MapReduce

Integrating VoltDB with Hadoop

Cloudera Certified Developer for Apache Hadoop

Hadoop Project for IDEAL in CS5604

Large Scale Data Analysis Using Apache Pig Master's Thesis

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Open source large scale distributed data management with Google s MapReduce and Bigtable

DIPLOMA IN WEBDEVELOPMENT

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Data Intensive Computing Handout 6 Hadoop

Getting Started with Hadoop with Amazon s Elastic MapReduce

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

CSCI 5417 Information Retrieval Systems Jim Martin!

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parquet. Columnar storage for the people

Word count example Abdalrahman Alsaedi

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Computers. An Introduction to Programming with Python. Programming Languages. Programs and Programming. CCHSG Visit June Dr.-Ing.

Lecture 10 - Functional programming: Hadoop and MapReduce

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Infrastructures for big data

Download and install Download virtual machine Import virtual machine in Virtualbox

Advanced Business Analytics using Distributed Computing (Hadoop)

Complete Java Classes Hadoop Syllabus Contact No:

Hands-on Exercises with Big Data

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop and Map-reduce computing

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data With Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Transcription:

Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013

Organizational 1st Tutorial Tuesday 29th Pig and Indexing Every person crosses what exercises he/she did For every exercise, one person is picked and he/she solves it At least 50% crosses to qualify for the exam Name Last name Ex 1.1 Ex 1.2 Ex 1.3 Max Muestermann X X Daniel Winkler X X

What is Pig? Apache Pig is a high level platform for big data analysis: Compiler Generates Map Reduce programs High-level language PigLatin

Why Pig? Writing Map Reduce jobs can be painful Difficult to make abstractions Verbose Joins are difficult Linking many Map Reduce jobs can be difficult Pig aims to solve these problems

Why Pig? High Level Support many relational features Join, Group by, User defined functions Multiple MapReduce jobs easy

Why Pig? motivation by example Load users Load sites Assume two data sources: User data Website data Filter by age Join by user id We need: Top 10 visited urls for people between 25-45 Group by url Count clicks Sort by clicks Take top 10

Example in MapReduce

Example in PigLatin Load users Load sites Users = load 'users' as (name, age); filtered_users = filter Users by age>18 and age <45; Pages = load 'pages' as (user, url); Joined = join filtered_users by name, Pages by user; Grouped = group Joined by url; Counted = foreach Grouped generate group, count(pages) as clicks; Sorted = order Counted by clicks desc; Top10 = limit Sorted 10; store Top10 into 'topten'; Filter by age Join by user id Group by url Count clicks Sort by clicks Take top 10

Usage Run modes Local Mode MapReduce Mode Run ways Interactive (grunt shell) Script Embedded in another program

Running Pig executes PigLatin statements in two steps: 1) Validation of syntax/semantics of statements Grunt> Employees = load 'employees' as (name, address) Grunt> EmployeesF = filter Employees by address == 'Berlin' 2) If 'DUMP' or 'STORE' then execute statements Grunt> dump EmployeesF; (Thomas, Berlin) (Maria, Berlin) (Jan, Berlin)...

Data types Simple data types Complex data types Int, long Tuple Float, double Ordered Chararray Bytearray Bag fixed length of values Accessed by index Unordered collection Map of tuples All of the same type Chararray as key and any type for value

Example Schema: employees = load 'department1' as ( firstname:chararray, lastname:chararray, salary:float, subordinates: bag{t:(firstname:chararray, lastname:chararray)}, deductions:map[float], address:tuple(street: chararray,city:chararray,state: chararray, zip:int)); File format: Patrik Peters 50000.0 {(Jan, Roberts),(Fritz, Karls)} [Federal Taxes#0.2,State Taxes#0.05,Insurance#0.1] (Zeughausstrasse 30, Darmstadt, Hessen,64289) Fields separated by '\t' Tuples : (field1, field2,...) Bags : {(tuple1), (tuple2),...} Maps : [key1#value1, key2#value2,...]

I/O operations Load X = load '/data/customers.tsv' as (id:int, name:chararray, age:int) Store store X into '/data/customers.tsv' Dump dump X

Relational operations Foreach Projection operation Y = foreach X generate $1, $3; I = foreach employees generate lastname, deductions#'insurance' C = foreach employees generate lastname, address.city Filter Y = filter X by age>30 Y = filter X by name matches 'Ja.*'

Relational operations Group Y = group X by age count_by_age = foreach Y generate group, COUNT(X) Order O = order X by age Join Z = join X by id, Y by id

Other operations Flatten Removes nesting A = foreach employees generate name, flatten(address) Sample S = sample employees 0.10 Describe Displays the schema of a relation Explain Displays execution plans

Word count example A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);

Word Count Example input.txt the cat and the dog the dog eats he eats bananas i have a dog Dump A; (the cat and the dog) (the dog eats) (he eats bananas) (i have a dog) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);

Word Count Example Dump B; (the) (cat) (and) (the) (dog) (the) (dog) (eats) (he) (eats) (bananas) (I) (have) (a) (dog) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);

Word Count Example Dump C; (the, {(the), (the), (the)}) (cat, {(cat)}) (and, {(and)}) (dog, {(dog), (dog), (dog)}) (eats, {(eats), (eats)}) (he, {(he)}) (bananas, {(bananas)}) (I, {(I)}) (have, {(have)}) (a, {(a)}) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);

Word Count Example Dump D; (the, 3) (cat, 1) (and, 1) (dog, 3) (eats, 2) (he, 1) (bananas, 1) (I, 1) (have, 1) (a, 1) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);

Build-in functions Math: MIN, MAX, SUM,... String manipulation: CONCAT, REPLACE,... Others: e.g. SIZE, IsEmpty, TOTUPLE, TOBAG,...

User Defined Functions Support for UDFs For things that cannot be done in pure PigLatin Custom load, Column transformation, filtering, aggregation Can be written in Java, Python or Javascript PiggyBank Java UDFs (from users to users)

UDF example in Java package myudfs; import java.io.ioexception; import org.apache.pig.evalfunc; import org.apache.pig.data.tuple; import org.apache.pig.impl.util.wrappedioexception; public class UPPER extends EvalFunc (String) { public String exec(tuple input) throws IOException { if (input == null input.size() == 0) return null; try{ String str = (String)input.get(0); return str.touppercase(); }catch(exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } }

Calling UDF register myudfs.jar; U = foreach employee generate myudfs.upper(name); dump U;

UDF example in Python @outputschema("word:chararray") def UPPER(word): return word.upper() Registering: register 'myudfs.py' using jython as myudfs; U = foreach employee generate myudfs.upper(name); dump U;

Is Pig fast? PigMix Set of queries to test Pig's efficiency On average 1.1x the time of a Map-Reduce program https://cwiki.apache.org/pig/pigmix.html

Pig Conclusion Pig opens the Map-Reduce system to more people (non Java experts) Pig Provides common (relational) operations Increases productivity 10 lines Pig 200 lines Java Only slightly slower than Java implementation

Searching Search for a specific keyword/query over many documents?

Searching Search for a specific keyword/query over many documents Do it sequentially Build data structures for indexing Not efficient!!! Data structures for searching Inverted Index Tries

Inverted indexes Map tokens to documents Extra information can be considered HTML-tags, type setting, etc Terms Occurences T1 D1,D3 T2 D2, D9......

Inverted index for documents

Inverted index for text position

Empirical memory analysis Text size = n Vocabulary n n Storing occurrences 0.3n 0.4n 0.4 0.6 Heap's Law (depends of the text) Omitting stop words

Naive Construction Straight forward computation: 1) Loop over all documents 2) Loop over all tokens 3) Insert(token, position) Problems Insertions need an efficient data structure Workaround Use intermediate data structure and derive index in the end

Tries Tree-based structure for building indexes Inner nodes indicate potential splits for unseen tokens Leaves are labeled with token and position Edges are labeled with characters

root Tries

Tries root c curiosity:1

Tries root c k curiosity:1 kills: 11

Tries root c t curiosity:1 k kills: 11 the: 17

Tries a cat:21 c u curiosity:1 root t k kills: 11 the: 17

Tries a cat:21 c u curiosity:1 root t k kills: 11 the:17 u Unfortunate: 26

Tries a cat:21 c u curiosity:1 root t k kills: 11 the:17 f u Unfortunate: 26 for: 38

Tries a cat:21 c u curiosity:1 root t k kills: 11 the:17,42 u Unfortunate: 26 for: 38

Tries a cat:21,46 c u curiosity:1 root t k kills: 11 the:17,42 u Unfortunate: 26 for: 38

Tries r caramba a t cat:21,46 c u curiosity:1 root t k kills: 11 the:17,42 u Unfortunate: 26 for: 38

Constructing a Trie

Inverted indexes using Tries 1) Loop over all documents 2) Loop over all tokens 3) Insert(token, position) into Trie 4) Extract Index from Trie If out of memory Save Trie and load subtree Only consider tokens of subtree Save and loop again over all documents using next subtree

Inverted Index in Hadoop Similar to WordCount example Mapper: Compute (token, occurrence) Reducer: Sort/merge output of Mapper Tutorial Pig can be used for simplicity

More about indexing

Pig resources Programming Pig http://ofps.oreilly.com/titles/9781449302641/ Cloudera's introduction http://vimeo.com/29733324 IBM tutorial http://www.ibm.com/developerworks/linux/library/l-apachepigdataquery/

Thanks for your attention!