Introduction to Apache Pig Indexing and Search

Size: px

Start display at page:

Download "Introduction to Apache Pig Indexing and Search"

Melvyn Mathews
9 years ago
Views:

1 Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013

2 Organizational 1st Tutorial Tuesday 29th Pig and Indexing Every person crosses what exercises he/she did For every exercise, one person is picked and he/she solves it At least 50% crosses to qualify for the exam Name Last name Ex 1.1 Ex 1.2 Ex 1.3 Max Muestermann X X Daniel Winkler X X

picked and he/she solves it At least 50% crosses to qualify for the

3 What is Pig? Apache Pig is a high level platform for big data analysis: Compiler Generates Map Reduce programs High-level language PigLatin

4 Why Pig? Writing Map Reduce jobs can be painful Difficult to make abstractions Verbose Joins are difficult Linking many Map Reduce jobs can be difficult Pig aims to solve these problems

5 Why Pig? High Level Support many relational features Join, Group by, User defined functions Multiple MapReduce jobs easy

6 Why Pig? motivation by example Load users Load sites Assume two data sources: User data Website data Filter by age Join by user id We need: Top 10 visited urls for people between Group by url Count clicks Sort by clicks Take top 10

7 Example in MapReduce

8 Example in PigLatin Load users Load sites Users = load 'users' as (name, age); filtered_users = filter Users by age>18 and age <45; Pages = load 'pages' as (user, url); Joined = join filtered_users by name, Pages by user; Grouped = group Joined by url; Counted = foreach Grouped generate group, count(pages) as clicks; Sorted = order Counted by clicks desc; Top10 = limit Sorted 10; store Top10 into 'topten'; Filter by age Join by user id Group by url Count clicks Sort by clicks Take top 10

by url; Counted = foreach Grouped generate group, count(pages) as clicks; Sorted = order Counted by clicks desc; Top10 =

9 Usage Run modes Local Mode MapReduce Mode Run ways Interactive (grunt shell) Script Embedded in another program

10 Running Pig executes PigLatin statements in two steps: 1) Validation of syntax/semantics of statements Grunt> Employees = load 'employees' as (name, address) Grunt> EmployeesF = filter Employees by address == 'Berlin' 2) If 'DUMP' or 'STORE' then execute statements Grunt> dump EmployeesF; (Thomas, Berlin) (Maria, Berlin) (Jan, Berlin)...

address) Grunt> EmployeesF = filter Employees by address == 'Berlin' 2) If 'DUMP' or

11 Data types Simple data types Complex data types Int, long Tuple Float, double Ordered Chararray Bytearray Bag fixed length of values Accessed by index Unordered collection Map of tuples All of the same type Chararray as key and any type for value

length of values Accessed by index Unordered collection Map

12 Example Schema: employees = load 'department1' as ( firstname:chararray, lastname:chararray, salary:float, subordinates: bag{t:(firstname:chararray, lastname:chararray)}, deductions:map[float], address:tuple(street: chararray,city:chararray,state: chararray, zip:int)); File format: Patrik Peters {(Jan, Roberts),(Fritz, Karls)} [Federal Taxes#0.2,State Taxes#0.05,Insurance#0.1] (Zeughausstrasse 30, Darmstadt, Hessen,64289) Fields separated by '\t' Tuples : (field1, field2,...) Bags : {(tuple1), (tuple2),...} Maps : [key1#value1, key2#value2,...]

$1] (Zeughausstrasse 30, Darmstadt, Hessen,64289) Fields separated by '\t' Tuples : (field1, field2,...) Bags : {(tuple1), (tuple2),.$

13 I/O operations Load X = load '/data/customers.tsv' as (id:int, name:chararray, age:int) Store store X into '/data/customers.tsv' Dump dump X

14 Relational operations Foreach Projection operation Y = foreach X generate $1, $3; I = foreach employees generate lastname, deductions#'insurance' C = foreach employees generate lastname, address.city Filter Y = filter X by age>30 Y = filter X by name matches 'Ja.*'

deductions#'insurance' C = foreach employees generate lastname,

15 Relational operations Group Y = group X by age count_by_age = foreach Y generate group, COUNT(X) Order O = order X by age Join Z = join X by id, Y by id

16 Other operations Flatten Removes nesting A = foreach employees generate name, flatten(address) Sample S = sample employees 0.10 Describe Displays the schema of a relation Explain Displays execution plans

17 Word count example A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);

18 Word Count Example input.txt the cat and the dog the dog eats he eats bananas i have a dog Dump A; (the cat and the dog) (the dog eats) (he eats bananas) (i have a dog) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);

19 Word Count Example Dump B; (the) (cat) (and) (the) (dog) (the) (dog) (eats) (he) (eats) (bananas) (I) (have) (a) (dog) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);

20 Word Count Example Dump C; (the, {(the), (the), (the)}) (cat, {(cat)}) (and, {(and)}) (dog, {(dog), (dog), (dog)}) (eats, {(eats), (eats)}) (he, {(he)}) (bananas, {(bananas)}) (I, {(I)}) (have, {(have)}) (a, {(a)}) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);

21 Word Count Example Dump D; (the, 3) (cat, 1) (and, 1) (dog, 3) (eats, 2) (he, 1) (bananas, 1) (I, 1) (have, 1) (a, 1) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);

22 Build-in functions Math: MIN, MAX, SUM,... String manipulation: CONCAT, REPLACE,... Others: e.g. SIZE, IsEmpty, TOTUPLE, TOBAG,...

23 User Defined Functions Support for UDFs For things that cannot be done in pure PigLatin Custom load, Column transformation, filtering, aggregation Can be written in Java, Python or Javascript PiggyBank Java UDFs (from users to users)

24 UDF example in Java package myudfs; import java.io.ioexception; import org.apache.pig.evalfunc; import org.apache.pig.data.tuple; import org.apache.pig.impl.util.wrappedioexception; public class UPPER extends EvalFunc (String) { public String exec(tuple input) throws IOException { if (input == null input.size() == 0) return null; try{ String str = (String)input.get(0); return str.touppercase(); }catch(exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } }

25 Calling UDF register myudfs.jar; U = foreach employee generate myudfs.upper(name); dump U;

26 UDF example in def UPPER(word): return word.upper() Registering: register 'myudfs.py' using jython as myudfs; U = foreach employee generate myudfs.upper(name); dump U;

27 Is Pig fast? PigMix Set of queries to test Pig's efficiency On average 1.1x the time of a Map-Reduce program

28 Pig Conclusion Pig opens the Map-Reduce system to more people (non Java experts) Pig Provides common (relational) operations Increases productivity 10 lines Pig 200 lines Java Only slightly slower than Java implementation

29 Searching Search for a specific keyword/query over many documents?

30 Searching Search for a specific keyword/query over many documents Do it sequentially Build data structures for indexing Not efficient!!! Data structures for searching Inverted Index Tries

31 Inverted indexes Map tokens to documents Extra information can be considered HTML-tags, type setting, etc Terms Occurences T1 D1,D3 T2 D2, D

32 Inverted index for documents

33 Inverted index for text position

34 Empirical memory analysis Text size = n Vocabulary n n Storing occurrences 0.3n 0.4n Heap's Law (depends of the text) Omitting stop words

35 Naive Construction Straight forward computation: 1) Loop over all documents 2) Loop over all tokens 3) Insert(token, position) Problems Insertions need an efficient data structure Workaround Use intermediate data structure and derive index in the end

36 Tries Tree-based structure for building indexes Inner nodes indicate potential splits for unseen tokens Leaves are labeled with token and position Edges are labeled with characters

37 root Tries

38 Tries root c curiosity:1

39 Tries root c k curiosity:1 kills: 11

40 Tries root c t curiosity:1 k kills: 11 the: 17

41 Tries a cat:21 c u curiosity:1 root t k kills: 11 the: 17

42 Tries a cat:21 c u curiosity:1 root t k kills: 11 the:17 u Unfortunate: 26

43 Tries a cat:21 c u curiosity:1 root t k kills: 11 the:17 f u Unfortunate: 26 for: 38

44 Tries a cat:21 c u curiosity:1 root t k kills: 11 the:17,42 u Unfortunate: 26 for: 38

45 Tries a cat:21,46 c u curiosity:1 root t k kills: 11 the:17,42 u Unfortunate: 26 for: 38

46 Tries r caramba a t cat:21,46 c u curiosity:1 root t k kills: 11 the:17,42 u Unfortunate: 26 for: 38

47 Constructing a Trie

48 Inverted indexes using Tries 1) Loop over all documents 2) Loop over all tokens 3) Insert(token, position) into Trie 4) Extract Index from Trie If out of memory Save Trie and load subtree Only consider tokens of subtree Save and loop again over all documents using next subtree

49 Inverted Index in Hadoop Similar to WordCount example Mapper: Compute (token, occurrence) Reducer: Sort/merge output of Mapper Tutorial Pig can be used for simplicity

50 More about indexing

Pig resources Programming Pig http://ofps.oreilly.

51 Pig resources Programming Pig Cloudera's introduction IBM tutorial

52 Thanks for your attention!

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types