Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process - involves pre-processing a collection of documents and storing a representation of it in an index Retrieval/runtime process upon a user issuing a query, involves accessing the index to find documents relevant to the query To process queries, search engines need quick access to all documents containing a set of search terms Billions of documents sparsely containing millions of terms The Inverted Index: a mapping from terms to the documents containing them With additional location-specific details 23 March 2014 236620 Big Data Technology 2 1

Inverted Index Example The good 1 2 the bad 3 4 and the ugly 5 6 7 Doc #1 As good as it gets, and more 1 2 3 4 5 6 7 Term Doc #2 Doc #3 Lexicon DF The 1 Good 2 Bad 2 And 3 Ugly 2 As 1 It 2 Gets 1 More 2 Is 1 (1; 1,3,6) (1; 2) (2; 2) (1; 4) (3; 5) (1; 5) (2; 6) (3; 4,8) (1; 7) (3; 3) (2; 1,3) (2; 4) (3; 2,6) (2; 5) (2, 7) (3, 9) (3; 1,7) Is it ugly and bad? It is, and more! 1 2 3 4 5 6 7 8 9 Occurrences sorted by increasing docid & location 23 March 2014 236620 Big Data Technology 3 Inverted Index Structure An Inverted Index consists of 2 elements: The lexicon (AKA dictionary) The inverted file (AKA postings file) The Inverted file is a set of postings lists a list per term. The list consists of posting elements. The list of term t holds the locations (documents + offsets) where t appears Encoded in compressed form Many variations and degrees of freedom The Lexicon is the set of all indexing units (terms) in a given collection The entry of a term typically holds its frequency, and points to the corresponding postings list 23 March 2014 236620 Big Data Technology 4 2

(Traditional) Indexing: Assumptions Computational Assumptions: Sequential scans of RAM are faster than accessing random main-memory addresses RAM access is much faster than disk I/O Sequential I/O is much faster than random-access I/O I/O reads/writes data from/to disk in block-size units Scale assumptions: Both the input (token stream) and the output (inverted file) are too large to fit in main memory (RAM) Mission: efficiently transform a stream of tokens into an inverted index 23 March 2014 236620 Big Data Technology 5 The good 1 2 the bad 3 4 and the ugly 5 6 7 Doc #1 Indexing First Approximation As good as it gets, and more 1 2 3 4 5 6 7 Doc #2 Is it ugly and bad? It is, and more! 1 2 3 4 5 6 7 8 9 Doc #3 tokenizer The 1 1 Good 1 2 The 1 3 Bad 1 4 And 1 5 The 1 6 Ugly 1 7 As 2 1 Good 2 2 As 2 3 It 2 4 Gets 2 5 And 2 6 More 2 7 Is 3 1 It 3 2 Ugly 3 3 Stable Lexicographic sort And 1 5 And 2 6 And 3 4 And 3 8 As 2 1 As 2 3 Bad 1 4 Bad 3 5 Gets 2 5 Good 1 2 Good 2 2 Is 3 1 Is 3 7 It 2 4 It 3 2 It 3 6 More 2 7 23 March 2014 236620 Big Data Technology 6 3

Indexing First Approximation (cont) And 1 5 And 2 6 And 3 4 And 3 8 As 2 1 As 2 3 Bad 1 4 Bad 3 5 Gets 2 5 Good 1 2 Good 2 2 Is 3 1 Is 3 7 It 2 4 It 3 2 It 3 6 More 2 7 Group by term Term Lexicon (1;5)-(2;6)-(3;4,8)- (2;1,3)-(1;4)-(3;5)- (2;5)-(1;2)-(2;2)- (3;1,7)-(2;4)-(3;2,6)- (2;7)-(3;9)-(1;1,3,6)- (1;7)-(3;3) 23 March 2014 236620 Big Data Technology 7 DF And 3 As 1 Bad 2 Gets 1 Good 2 Is 1 It 2 More 2 The 1 Ugly 2 Inverted File Note that the lexicon is created in parallel to the inverted file At query time, lookup in the lexicon is logarithmic in its size First Problem Cannot work in RAM Due to scale of data: token stream does not fit in RAM Solution: work in runs Allocate a RAM buffer, fill it with as many tokens as possible Sort buffer, write to disk Once all runs (say, k) have been written to disk, perform a k-way merge to build inverted file and lexicon Merge key: term document offset Merge reads full blocks from each run into a RAM buffer Increasing terms Run 1 Lexicon, inverted file K-way merge Run 2 Run k-1 Run k Increasing documents/offsets 23 March 2014 236620 Big Data Technology 8 4

Second Problem: Handling Variable-Length Sort Keys The sorting of each run, and the k-way merge of all runs, require handling of keys whose format is (word, docnum, offset) The first component is variable-length And requires costly string comparisons Solution: work with term identifiers E.g., hash each string into an integer (complexity is linear with length of string) Keys become fixed-length Sorting can be done in linear complexity through radix-sort The 1 1 1490 1 1 Good 1 2 5792 1 2 1490 = h( the ) The 1 3 1490 1 3 5792 = h( good ) becomes Bad 1 4 2614 1 4 2614= h( bad ) And 1 5 5837 1 5 5837= h( and ) The 1 6 1490 1 6 What about hash collisions? 23 March 2014 236620 Big Data Technology 9 Issues with Hashing Terms Any hash function of strings into integers introduces the probability of collisions What will this cause at query time? Can decrease the probability of hash collision by increasing number of bits in hash function Birthday Paradox : need ~twice as many bits as needed for simply counting the number of distinct terms (i.e. 2 log vocabulary ) However, widening the hash function slows down sorting/merging Solution: assign terms consecutive (ordinal) numbers, through the maintenance of a lexicon in real time Previously unseen terms get added to the lexicon with the next available ordinal number and an initial count of 1 The lexicon needs to be maintained globally, i.e. the same lexicon is used throughout all runs (why?) 23 March 2014 236620 Big Data Technology 10 5

Third Problem: Scale of Data Order of magnitude: more than 10 10 documents of average length over 10 4 bytes petabyte scale (10 15 ) Requires lots of storage Requires lots of I/O bandwidth Requires lots of sorting Index cannot be built on a single machine Solution: the computation must be distributed However, writing distributed business logic is difficult! 23 March 2014 236620 Big Data Technology 11 Segmented Inverted Indices So far we assumed the index cannot be built on a single machine In reality, it also cannot be stored on a single machine Consequently, indexes of large-scale search engines are distributed across multiple machines Addresses mainly data scale; usage scale (query throughput) is mostly addressed by replication Two basic architectures: Local index organization - index partitioned by documents. Each machine inverts a disjoint set of documents Global index organization - index partitioned by terms. Each machine holds postings lists for a disjoint set of terms Query processing becomes a distributed task, where the choice of the partitioning scheme affects the query processing algorithm 23 March 2014 236620 Big Data Technology 12 6

Segmented Inverted Indices Doc 1 Doc 2 Doc 3 Doc 4 A B C A B D A C D B C D Global index organization: index partitioned by terms Segment 1 A: 1,2,3 B: 1,2,4 Segment 2 C: 1,3,4 D: 2,3,4 23 March 2014 236620 Big Data Technology 13 Segmented Inverted Indices Doc 1 Doc 2 Doc 3 Doc 4 A B C A B D A C D B C D Local index organization: index partitioned by documents Segment 1 A: 1,2 B: 1,2 C: 1 D: 2 Segment 2 A: 3 B: 4 C: 3,4 D: 3,4 23 March 2014 236620 Big Data Technology 14 7

Local Index - Runtime Top-n query n best results QI Send all partitions a top-k query (the same query sent by the user) QI merges km results and returns the top-n to the user Each partitions returns its top-k results S 1 S m Query latency depends on the latency of the slowest partition Partition latency depends on number of its documents that match the query, and on the overall size of its index 23 March 2014 236620 Big Data Technology 15 Abstracting the Distributed Indexing Problem We care about our business logic: 1. We want to process lots of data, specifically tuples [token streams] 2. We want to group them by some key [token] 3. We want to sort within each group [by doc-id and position] 4. We want to process each group somehow [encode posting list and output] We want to utilize many machines in parallel, without having to worry about: Data partitioning Inter-machine communication RAM limitations, e.g. dealing with out-of-band sorting Fault tolerance of machines, disks, network, 23 March 2014 236620 Big Data Technology 16 8

Scalable Indexing Logical Steps Virtual Huge Token Stream Data Partitioning Token Stream Token Stream Token Stream Token Stream Processing: Define Groups {t 1,d,o} * {t 5,d,o} * {t 9,d,o} * {t 2,d,o} * {t 6,d,o} * {t 10,d,o} * {t 3,d,o} * {t 7,d,o} * {t 11,d,o} * {t 4,d,o} * {t 8,d,o} * {t 12,d,o} * Group-by, Sort Groups Processing: Encode Groups Output 23 March 2014 236620 Big Data Technology 17 Map-Reduce Overloaded term: refers to (1) a programming paradigm and (2) a realizing system for distributed computation The combination of the system and paradigm was first introduced by Google in a paper in 2004 Actually built and utilized a few years before Hadoop, the open-source implementation of Map- Reduce, was initiated in 2005 Today, Hadoop is used by dozens of Big Data companies; Google and Microsoft are known to use their own proprietary platforms Distributed indexing was a main use-case for both 23 March 2014 236620 Big Data Technology 18 9

Map-Reduce Programming Paradigm A flow for processing key-value pairs that consists of two computational functions (per round): 1. Mapper: transforms input key-value pairs to output keyvalue pairs, thereby defining how to next group the data (by keys) 2. Reducer: consumes streams of (potentially sorted) values associated with the same key and performs some computation/aggregation on them Ultimately emits an output of the form key-value as well 23 March 2014 236620 Big Data Technology 19 Map-Reduce the System Runs on large clusters of shared-nothing machines Operates over a Distributed File System (DFS) Spawns multiple mapper tasks to run in parallel on multiple machines over multiple partitions of the input Shuffles the outputs of the mappers around, grouping by output keys and further sorting if required Spawns multiple reducer tasks to run in parallel on multiple machines, and routes one or more groups to each reducer 23 March 2014 236620 Big Data Technology 20 10

Map-Reduce the System (cont.) The system takes care of: Physical data partitioning and replication (DFS) Distributed processing of mappers and reducers Inter-machine communication for shuffling & sorting data by keys Task management that overcomes hardware and software failures (more on the system s internals in the next lecture) 23 March 2014 236620 Big Data Technology 21 Distributed Indexing via Map-Reduce Input: [URL, {token array}]* We will produce a locally partitioned distributed index by three map-reduce jobs Description is at a high level, hides many details One of several possible implementations 23 March 2014 236620 Big Data Technology 22 11

Distributed Indexing, First Step Input: [URL, {token array}]* Goal: create index partitions by routing documents uniformly at random (good for load balancing) Number documents densely [1..N] per partition Mapper: [URL, {token array}] [hash(url), URL, token, offset]* Group key: hash(url), i.e. partition# Sort within group: URL (primary), offset (secondary) Reducer: [hash(url), URL, token, offset] [partition#, token, doc#, offset] Increment doc# whenever URL changes 23 March 2014 236620 Big Data Technology 23 Distributed Indexing, Second Step Input: [partition#, token, doc#, offset] Goal: create inverted lists per token per partition Mapper: identity, i.e. [partition#, token, doc#, offset] [partition#, token, doc#, offset]* Group key: {partition#, token} Note: many more - and much smaller groups than in first step Sort within group: doc# (primary), offset (secondary) Reducer: [partition#, token, doc#, offset] [partition#, token, encoded inverted list] 23 March 2014 236620 Big Data Technology 24 12

Distributed Indexing, Third Step Input: [partition#, token, encoded inverted list]* Goal: create index per partition Mapper: identity, i.e. [partition#, token, encoded inverted list] [partition#, token, encoded inverted list]* Group key: {partition} Sort within group: N/A or by token, depending on implementation Reducer: [partition#, token, encoded inverted list] [partition#, inverted index] Q: why didn t we group by partition (pushing token to sort key) to finish the indexing task in the second step? 23 March 2014 236620 Big Data Technology 25 13