Big Data Technology Map-Reduce Motivation: Indexing in Search Engines



Similar documents
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Developing MapReduce Programs

Big Data With Hadoop

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Infrastructures for big data

Distributed Computing and Big Data: Hadoop and MapReduce

Parallel Processing of cluster by Map Reduce

Big Data Processing with Google s MapReduce. Alexandru Costan

Chapter 7. Using Hadoop Cluster and MapReduce

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop


Introduction to Hadoop

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Comparison of Different Implementation of Inverted Indexes in Hadoop

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

CSE-E5430 Scalable Cloud Computing Lecture 2

Cloud Computing at Google. Architecture

Hypertable Architecture Overview

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Jeffrey D. Ullman slides. MapReduce for data intensive computing

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Introduction to Hadoop

Big Data and Scripting map/reduce in Hadoop

HADOOP PERFORMANCE TUNING

Sibyl: a system for large scale machine learning

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Hadoop and Map-Reduce. Swati Gore

MAPREDUCE Programming Model

Physical Data Organization

Energy Efficient MapReduce

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Design and Evolution of the Apache Hadoop File System(HDFS)

Apache Hadoop. Alexandru Costan

Map Reduce / Hadoop / HDFS

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Data-Intensive Computing with Map-Reduce and Hadoop

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data Technology Pig: Query Language atop Map-Reduce

Architectures for Big Data Analytics A database perspective

Architectures for massive data management

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

The Hadoop Framework

Why compute in parallel? Cloud computing. Big Data 11/29/15. Introduction to Data Management CSE 344. Science is Facing a Data Deluge!

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

Map Reduce & Hadoop Recommended Text:

Internals of Hadoop Application Framework and Distributed File System

Similarity Search in a Very Large Scale Using Hadoop and HBase

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

How To Scale Out Of A Nosql Database

Can the Elephants Handle the NoSQL Onslaught?

Data Management in the Cloud

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduction to Hadoop

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

MapReduce (in the cloud)

Hadoop Architecture. Part 1

MapReduce and Hadoop Distributed File System V I J A Y R A O

In-Memory Data Management for Enterprise Applications

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION

Hadoop and Map-reduce computing

Storage and File Structure

A very short Intro to Hadoop

Reduction of Data at Namenode in HDFS using harballing Technique

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Networking in the Hadoop Cluster

Analysis of MapReduce Algorithms

Chapter 11 I/O Management and Disk Scheduling

MapReduce and Hadoop Distributed File System

PERFORMANCE MODELS FOR APACHE ACCUMULO:

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Chapter 13. Disk Storage, Basic File Structures, and Hashing

NoSQL Data Base Basics

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

A Survey of Cloud Computing Guanfeng Octides

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

The MapReduce Framework

Rethinking SIMD Vectorization for In-Memory Databases

Big Fast Data Hadoop acceleration with Flash. June 2013

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Parallel Computing. Benson Muite. benson.

Integrating Big Data into the Computing Curricula

Parquet. Columnar storage for the people

CS 378 Big Data Programming

Transcription:

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process - involves pre-processing a collection of documents and storing a representation of it in an index Retrieval/runtime process upon a user issuing a query, involves accessing the index to find documents relevant to the query To process queries, search engines need quick access to all documents containing a set of search terms Billions of documents sparsely containing millions of terms The Inverted Index: a mapping from terms to the documents containing them With additional location-specific details 23 March 2014 236620 Big Data Technology 2 1

Inverted Index Example The good 1 2 the bad 3 4 and the ugly 5 6 7 Doc #1 As good as it gets, and more 1 2 3 4 5 6 7 Term Doc #2 Doc #3 Lexicon DF The 1 Good 2 Bad 2 And 3 Ugly 2 As 1 It 2 Gets 1 More 2 Is 1 (1; 1,3,6) (1; 2) (2; 2) (1; 4) (3; 5) (1; 5) (2; 6) (3; 4,8) (1; 7) (3; 3) (2; 1,3) (2; 4) (3; 2,6) (2; 5) (2, 7) (3, 9) (3; 1,7) Is it ugly and bad? It is, and more! 1 2 3 4 5 6 7 8 9 Occurrences sorted by increasing docid & location 23 March 2014 236620 Big Data Technology 3 Inverted Index Structure An Inverted Index consists of 2 elements: The lexicon (AKA dictionary) The inverted file (AKA postings file) The Inverted file is a set of postings lists a list per term. The list consists of posting elements. The list of term t holds the locations (documents + offsets) where t appears Encoded in compressed form Many variations and degrees of freedom The Lexicon is the set of all indexing units (terms) in a given collection The entry of a term typically holds its frequency, and points to the corresponding postings list 23 March 2014 236620 Big Data Technology 4 2

(Traditional) Indexing: Assumptions Computational Assumptions: Sequential scans of RAM are faster than accessing random main-memory addresses RAM access is much faster than disk I/O Sequential I/O is much faster than random-access I/O I/O reads/writes data from/to disk in block-size units Scale assumptions: Both the input (token stream) and the output (inverted file) are too large to fit in main memory (RAM) Mission: efficiently transform a stream of tokens into an inverted index 23 March 2014 236620 Big Data Technology 5 The good 1 2 the bad 3 4 and the ugly 5 6 7 Doc #1 Indexing First Approximation As good as it gets, and more 1 2 3 4 5 6 7 Doc #2 Is it ugly and bad? It is, and more! 1 2 3 4 5 6 7 8 9 Doc #3 tokenizer The 1 1 Good 1 2 The 1 3 Bad 1 4 And 1 5 The 1 6 Ugly 1 7 As 2 1 Good 2 2 As 2 3 It 2 4 Gets 2 5 And 2 6 More 2 7 Is 3 1 It 3 2 Ugly 3 3 Stable Lexicographic sort And 1 5 And 2 6 And 3 4 And 3 8 As 2 1 As 2 3 Bad 1 4 Bad 3 5 Gets 2 5 Good 1 2 Good 2 2 Is 3 1 Is 3 7 It 2 4 It 3 2 It 3 6 More 2 7 23 March 2014 236620 Big Data Technology 6 3

Indexing First Approximation (cont) And 1 5 And 2 6 And 3 4 And 3 8 As 2 1 As 2 3 Bad 1 4 Bad 3 5 Gets 2 5 Good 1 2 Good 2 2 Is 3 1 Is 3 7 It 2 4 It 3 2 It 3 6 More 2 7 Group by term Term Lexicon (1;5)-(2;6)-(3;4,8)- (2;1,3)-(1;4)-(3;5)- (2;5)-(1;2)-(2;2)- (3;1,7)-(2;4)-(3;2,6)- (2;7)-(3;9)-(1;1,3,6)- (1;7)-(3;3) 23 March 2014 236620 Big Data Technology 7 DF And 3 As 1 Bad 2 Gets 1 Good 2 Is 1 It 2 More 2 The 1 Ugly 2 Inverted File Note that the lexicon is created in parallel to the inverted file At query time, lookup in the lexicon is logarithmic in its size First Problem Cannot work in RAM Due to scale of data: token stream does not fit in RAM Solution: work in runs Allocate a RAM buffer, fill it with as many tokens as possible Sort buffer, write to disk Once all runs (say, k) have been written to disk, perform a k-way merge to build inverted file and lexicon Merge key: term document offset Merge reads full blocks from each run into a RAM buffer Increasing terms Run 1 Lexicon, inverted file K-way merge Run 2 Run k-1 Run k Increasing documents/offsets 23 March 2014 236620 Big Data Technology 8 4

Second Problem: Handling Variable-Length Sort Keys The sorting of each run, and the k-way merge of all runs, require handling of keys whose format is (word, docnum, offset) The first component is variable-length And requires costly string comparisons Solution: work with term identifiers E.g., hash each string into an integer (complexity is linear with length of string) Keys become fixed-length Sorting can be done in linear complexity through radix-sort The 1 1 1490 1 1 Good 1 2 5792 1 2 1490 = h( the ) The 1 3 1490 1 3 5792 = h( good ) becomes Bad 1 4 2614 1 4 2614= h( bad ) And 1 5 5837 1 5 5837= h( and ) The 1 6 1490 1 6 What about hash collisions? 23 March 2014 236620 Big Data Technology 9 Issues with Hashing Terms Any hash function of strings into integers introduces the probability of collisions What will this cause at query time? Can decrease the probability of hash collision by increasing number of bits in hash function Birthday Paradox : need ~twice as many bits as needed for simply counting the number of distinct terms (i.e. 2 log vocabulary ) However, widening the hash function slows down sorting/merging Solution: assign terms consecutive (ordinal) numbers, through the maintenance of a lexicon in real time Previously unseen terms get added to the lexicon with the next available ordinal number and an initial count of 1 The lexicon needs to be maintained globally, i.e. the same lexicon is used throughout all runs (why?) 23 March 2014 236620 Big Data Technology 10 5

Third Problem: Scale of Data Order of magnitude: more than 10 10 documents of average length over 10 4 bytes petabyte scale (10 15 ) Requires lots of storage Requires lots of I/O bandwidth Requires lots of sorting Index cannot be built on a single machine Solution: the computation must be distributed However, writing distributed business logic is difficult! 23 March 2014 236620 Big Data Technology 11 Segmented Inverted Indices So far we assumed the index cannot be built on a single machine In reality, it also cannot be stored on a single machine Consequently, indexes of large-scale search engines are distributed across multiple machines Addresses mainly data scale; usage scale (query throughput) is mostly addressed by replication Two basic architectures: Local index organization - index partitioned by documents. Each machine inverts a disjoint set of documents Global index organization - index partitioned by terms. Each machine holds postings lists for a disjoint set of terms Query processing becomes a distributed task, where the choice of the partitioning scheme affects the query processing algorithm 23 March 2014 236620 Big Data Technology 12 6

Segmented Inverted Indices Doc 1 Doc 2 Doc 3 Doc 4 A B C A B D A C D B C D Global index organization: index partitioned by terms Segment 1 A: 1,2,3 B: 1,2,4 Segment 2 C: 1,3,4 D: 2,3,4 23 March 2014 236620 Big Data Technology 13 Segmented Inverted Indices Doc 1 Doc 2 Doc 3 Doc 4 A B C A B D A C D B C D Local index organization: index partitioned by documents Segment 1 A: 1,2 B: 1,2 C: 1 D: 2 Segment 2 A: 3 B: 4 C: 3,4 D: 3,4 23 March 2014 236620 Big Data Technology 14 7

Local Index - Runtime Top-n query n best results QI Send all partitions a top-k query (the same query sent by the user) QI merges km results and returns the top-n to the user Each partitions returns its top-k results S 1 S m Query latency depends on the latency of the slowest partition Partition latency depends on number of its documents that match the query, and on the overall size of its index 23 March 2014 236620 Big Data Technology 15 Abstracting the Distributed Indexing Problem We care about our business logic: 1. We want to process lots of data, specifically tuples [token streams] 2. We want to group them by some key [token] 3. We want to sort within each group [by doc-id and position] 4. We want to process each group somehow [encode posting list and output] We want to utilize many machines in parallel, without having to worry about: Data partitioning Inter-machine communication RAM limitations, e.g. dealing with out-of-band sorting Fault tolerance of machines, disks, network, 23 March 2014 236620 Big Data Technology 16 8

Scalable Indexing Logical Steps Virtual Huge Token Stream Data Partitioning Token Stream Token Stream Token Stream Token Stream Processing: Define Groups {t 1,d,o} * {t 5,d,o} * {t 9,d,o} * {t 2,d,o} * {t 6,d,o} * {t 10,d,o} * {t 3,d,o} * {t 7,d,o} * {t 11,d,o} * {t 4,d,o} * {t 8,d,o} * {t 12,d,o} * Group-by, Sort Groups Processing: Encode Groups Output 23 March 2014 236620 Big Data Technology 17 Map-Reduce Overloaded term: refers to (1) a programming paradigm and (2) a realizing system for distributed computation The combination of the system and paradigm was first introduced by Google in a paper in 2004 Actually built and utilized a few years before Hadoop, the open-source implementation of Map- Reduce, was initiated in 2005 Today, Hadoop is used by dozens of Big Data companies; Google and Microsoft are known to use their own proprietary platforms Distributed indexing was a main use-case for both 23 March 2014 236620 Big Data Technology 18 9

Map-Reduce Programming Paradigm A flow for processing key-value pairs that consists of two computational functions (per round): 1. Mapper: transforms input key-value pairs to output keyvalue pairs, thereby defining how to next group the data (by keys) 2. Reducer: consumes streams of (potentially sorted) values associated with the same key and performs some computation/aggregation on them Ultimately emits an output of the form key-value as well 23 March 2014 236620 Big Data Technology 19 Map-Reduce the System Runs on large clusters of shared-nothing machines Operates over a Distributed File System (DFS) Spawns multiple mapper tasks to run in parallel on multiple machines over multiple partitions of the input Shuffles the outputs of the mappers around, grouping by output keys and further sorting if required Spawns multiple reducer tasks to run in parallel on multiple machines, and routes one or more groups to each reducer 23 March 2014 236620 Big Data Technology 20 10

Map-Reduce the System (cont.) The system takes care of: Physical data partitioning and replication (DFS) Distributed processing of mappers and reducers Inter-machine communication for shuffling & sorting data by keys Task management that overcomes hardware and software failures (more on the system s internals in the next lecture) 23 March 2014 236620 Big Data Technology 21 Distributed Indexing via Map-Reduce Input: [URL, {token array}]* We will produce a locally partitioned distributed index by three map-reduce jobs Description is at a high level, hides many details One of several possible implementations 23 March 2014 236620 Big Data Technology 22 11

Distributed Indexing, First Step Input: [URL, {token array}]* Goal: create index partitions by routing documents uniformly at random (good for load balancing) Number documents densely [1..N] per partition Mapper: [URL, {token array}] [hash(url), URL, token, offset]* Group key: hash(url), i.e. partition# Sort within group: URL (primary), offset (secondary) Reducer: [hash(url), URL, token, offset] [partition#, token, doc#, offset] Increment doc# whenever URL changes 23 March 2014 236620 Big Data Technology 23 Distributed Indexing, Second Step Input: [partition#, token, doc#, offset] Goal: create inverted lists per token per partition Mapper: identity, i.e. [partition#, token, doc#, offset] [partition#, token, doc#, offset]* Group key: {partition#, token} Note: many more - and much smaller groups than in first step Sort within group: doc# (primary), offset (secondary) Reducer: [partition#, token, doc#, offset] [partition#, token, encoded inverted list] 23 March 2014 236620 Big Data Technology 24 12

Distributed Indexing, Third Step Input: [partition#, token, encoded inverted list]* Goal: create index per partition Mapper: identity, i.e. [partition#, token, encoded inverted list] [partition#, token, encoded inverted list]* Group key: {partition} Sort within group: N/A or by token, depending on implementation Reducer: [partition#, token, encoded inverted list] [partition#, inverted index] Q: why didn t we group by partition (pushing token to sort key) to finish the indexing task in the second step? 23 March 2014 236620 Big Data Technology 25 13