Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013
Organizational 1st Tutorial Tuesday 29th Pig and Indexing Every person crosses what exercises he/she did For every exercise, one person is picked and he/she solves it At least 50% crosses to qualify for the exam Name Last name Ex 1.1 Ex 1.2 Ex 1.3 Max Muestermann X X Daniel Winkler X X
What is Pig? Apache Pig is a high level platform for big data analysis: Compiler Generates Map Reduce programs High-level language PigLatin
Why Pig? Writing Map Reduce jobs can be painful Difficult to make abstractions Verbose Joins are difficult Linking many Map Reduce jobs can be difficult Pig aims to solve these problems
Why Pig? High Level Support many relational features Join, Group by, User defined functions Multiple MapReduce jobs easy
Why Pig? motivation by example Load users Load sites Assume two data sources: User data Website data Filter by age Join by user id We need: Top 10 visited urls for people between 25-45 Group by url Count clicks Sort by clicks Take top 10
Example in MapReduce
Example in PigLatin Load users Load sites Users = load 'users' as (name, age); filtered_users = filter Users by age>18 and age <45; Pages = load 'pages' as (user, url); Joined = join filtered_users by name, Pages by user; Grouped = group Joined by url; Counted = foreach Grouped generate group, count(pages) as clicks; Sorted = order Counted by clicks desc; Top10 = limit Sorted 10; store Top10 into 'topten'; Filter by age Join by user id Group by url Count clicks Sort by clicks Take top 10
Usage Run modes Local Mode MapReduce Mode Run ways Interactive (grunt shell) Script Embedded in another program
Running Pig executes PigLatin statements in two steps: 1) Validation of syntax/semantics of statements Grunt> Employees = load 'employees' as (name, address) Grunt> EmployeesF = filter Employees by address == 'Berlin' 2) If 'DUMP' or 'STORE' then execute statements Grunt> dump EmployeesF; (Thomas, Berlin) (Maria, Berlin) (Jan, Berlin)...
Data types Simple data types Complex data types Int, long Tuple Float, double Ordered Chararray Bytearray Bag fixed length of values Accessed by index Unordered collection Map of tuples All of the same type Chararray as key and any type for value
Example Schema: employees = load 'department1' as ( firstname:chararray, lastname:chararray, salary:float, subordinates: bag{t:(firstname:chararray, lastname:chararray)}, deductions:map[float], address:tuple(street: chararray,city:chararray,state: chararray, zip:int)); File format: Patrik Peters 50000.0 {(Jan, Roberts),(Fritz, Karls)} [Federal Taxes#0.2,State Taxes#0.05,Insurance#0.1] (Zeughausstrasse 30, Darmstadt, Hessen,64289) Fields separated by '\t' Tuples : (field1, field2,...) Bags : {(tuple1), (tuple2),...} Maps : [key1#value1, key2#value2,...]
I/O operations Load X = load '/data/customers.tsv' as (id:int, name:chararray, age:int) Store store X into '/data/customers.tsv' Dump dump X
Relational operations Foreach Projection operation Y = foreach X generate $1, $3; I = foreach employees generate lastname, deductions#'insurance' C = foreach employees generate lastname, address.city Filter Y = filter X by age>30 Y = filter X by name matches 'Ja.*'
Relational operations Group Y = group X by age count_by_age = foreach Y generate group, COUNT(X) Order O = order X by age Join Z = join X by id, Y by id
Other operations Flatten Removes nesting A = foreach employees generate name, flatten(address) Sample S = sample employees 0.10 Describe Displays the schema of a relation Explain Displays execution plans
Word count example A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);
Word Count Example input.txt the cat and the dog the dog eats he eats bananas i have a dog Dump A; (the cat and the dog) (the dog eats) (he eats bananas) (i have a dog) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);
Word Count Example Dump B; (the) (cat) (and) (the) (dog) (the) (dog) (eats) (he) (eats) (bananas) (I) (have) (a) (dog) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);
Word Count Example Dump C; (the, {(the), (the), (the)}) (cat, {(cat)}) (and, {(and)}) (dog, {(dog), (dog), (dog)}) (eats, {(eats), (eats)}) (he, {(he)}) (bananas, {(bananas)}) (I, {(I)}) (have, {(have)}) (a, {(a)}) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);
Word Count Example Dump D; (the, 3) (cat, 1) (and, 1) (dog, 3) (eats, 2) (he, 1) (bananas, 1) (I, 1) (have, 1) (a, 1) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);
Build-in functions Math: MIN, MAX, SUM,... String manipulation: CONCAT, REPLACE,... Others: e.g. SIZE, IsEmpty, TOTUPLE, TOBAG,...
User Defined Functions Support for UDFs For things that cannot be done in pure PigLatin Custom load, Column transformation, filtering, aggregation Can be written in Java, Python or Javascript PiggyBank Java UDFs (from users to users)
UDF example in Java package myudfs; import java.io.ioexception; import org.apache.pig.evalfunc; import org.apache.pig.data.tuple; import org.apache.pig.impl.util.wrappedioexception; public class UPPER extends EvalFunc (String) { public String exec(tuple input) throws IOException { if (input == null input.size() == 0) return null; try{ String str = (String)input.get(0); return str.touppercase(); }catch(exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } }
Calling UDF register myudfs.jar; U = foreach employee generate myudfs.upper(name); dump U;
UDF example in Python @outputschema("word:chararray") def UPPER(word): return word.upper() Registering: register 'myudfs.py' using jython as myudfs; U = foreach employee generate myudfs.upper(name); dump U;
Is Pig fast? PigMix Set of queries to test Pig's efficiency On average 1.1x the time of a Map-Reduce program https://cwiki.apache.org/pig/pigmix.html
Pig Conclusion Pig opens the Map-Reduce system to more people (non Java experts) Pig Provides common (relational) operations Increases productivity 10 lines Pig 200 lines Java Only slightly slower than Java implementation
Searching Search for a specific keyword/query over many documents?
Searching Search for a specific keyword/query over many documents Do it sequentially Build data structures for indexing Not efficient!!! Data structures for searching Inverted Index Tries
Inverted indexes Map tokens to documents Extra information can be considered HTML-tags, type setting, etc Terms Occurences T1 D1,D3 T2 D2, D9......
Inverted index for documents
Inverted index for text position
Empirical memory analysis Text size = n Vocabulary n n Storing occurrences 0.3n 0.4n 0.4 0.6 Heap's Law (depends of the text) Omitting stop words
Naive Construction Straight forward computation: 1) Loop over all documents 2) Loop over all tokens 3) Insert(token, position) Problems Insertions need an efficient data structure Workaround Use intermediate data structure and derive index in the end
Tries Tree-based structure for building indexes Inner nodes indicate potential splits for unseen tokens Leaves are labeled with token and position Edges are labeled with characters
root Tries
Tries root c curiosity:1
Tries root c k curiosity:1 kills: 11
Tries root c t curiosity:1 k kills: 11 the: 17
Tries a cat:21 c u curiosity:1 root t k kills: 11 the: 17
Tries a cat:21 c u curiosity:1 root t k kills: 11 the:17 u Unfortunate: 26
Tries a cat:21 c u curiosity:1 root t k kills: 11 the:17 f u Unfortunate: 26 for: 38
Tries a cat:21 c u curiosity:1 root t k kills: 11 the:17,42 u Unfortunate: 26 for: 38
Tries a cat:21,46 c u curiosity:1 root t k kills: 11 the:17,42 u Unfortunate: 26 for: 38
Tries r caramba a t cat:21,46 c u curiosity:1 root t k kills: 11 the:17,42 u Unfortunate: 26 for: 38
Constructing a Trie
Inverted indexes using Tries 1) Loop over all documents 2) Loop over all tokens 3) Insert(token, position) into Trie 4) Extract Index from Trie If out of memory Save Trie and load subtree Only consider tokens of subtree Save and loop again over all documents using next subtree
Inverted Index in Hadoop Similar to WordCount example Mapper: Compute (token, occurrence) Reducer: Sort/merge output of Mapper Tutorial Pig can be used for simplicity
More about indexing
Pig resources Programming Pig http://ofps.oreilly.com/titles/9781449302641/ Cloudera's introduction http://vimeo.com/29733324 IBM tutorial http://www.ibm.com/developerworks/linux/library/l-apachepigdataquery/
Thanks for your attention!