EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Similar documents
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

CSCI 5417 Information Retrieval Systems Jim Martin!

Big Data and Apache Hadoop s MapReduce

Introduction to Parallel Programming and MapReduce

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Introduction to Hadoop


Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Hadoop and Map-Reduce. Swati Gore

MapReduce (in the cloud)

MapReduce Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce and Hadoop Distributed File System

Big Data Processing with Google s MapReduce. Alexandru Costan

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Introduction to Hadoop

Analysis of Web Archives. Vinay Goel Senior Data Engineer

CS54100: Database Systems

Introduction to Hadoop

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

BIG DATA, MAPREDUCE & HADOOP

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Search and Information Retrieval

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

What is Analytic Infrastructure and Why Should You Care?

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Lecture Data Warehouse Systems

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Cloud Computing Summary and Preparation for Examination

Map Reduce & Hadoop Recommended Text:

Chapter 7. Using Hadoop Cluster and MapReduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Big Data With Hadoop

MapReduce and Hadoop Distributed File System V I J A Y R A O

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

16.1 MAPREDUCE. For personal use only, not for distribution. 333

The Hadoop Framework

Introduction to Information Retrieval

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Similarity Search in a Very Large Scale Using Hadoop and HBase

CSE-E5430 Scalable Cloud Computing Lecture 2

Energy Efficient MapReduce

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

A Performance Analysis of Distributed Indexing using Terrier

A computational model for MapReduce job flow

Cloud Computing using MapReduce, Hadoop, Spark

Introduction to Database Systems CSE 444. Lecture 24: Databases as a Service

SEAIP 2009 Presentation

GraySort and MinuteSort at Yahoo on Hadoop 0.23

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Infrastructures for big data

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Ensuring Collective Availability in Volatile Resource Pools via Forecasting

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

INTRO TO BIG DATA. Djoerd Hiemstra. Big Data in Clinical Medicinel, 30 June 2014

BIG DATA IN SCIENCE & EDUCATION

Hadoop & its Usage at Facebook

Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen

Data Mining in the Swamp

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

How To Scale Out Of A Nosql Database

Improving MapReduce Performance in Heterogeneous Environments

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

A Comparison of Approaches to Large-Scale Data Analysis

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Google File System. Web and scalability

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Big Data: Big N. V.C Note. December 2, 2014

White Paper. Optimizing the Performance Of MySQL Cluster

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Lecture 1: Introduction and the Boolean Model

Data processing goes big

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Why MySQL beats MongoDB

Hadoop and Map-reduce computing

1 Boolean retrieval. Online edition (c)2009 Cambridge UP

The Big Data Paradigm Shift. Insight Through Automation

Transcription:

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze

Announcements Project proposals due tonight Sign up for project meetings

Sec. 1.2 Last time: Inverted index Posting Brutus 2 4 8 16 32 64 128 Calpurnia 1 2 3 5 8 13 21 34 Caesar 13 16 Dictionary Postings lists 3

How Does Web Search Work? Inverted Indices Inverted Indices enable fast IR Tokenization & Linguistic issues Handling phrase queries Scalable, distributed construction & deployment Web Crawlers Ranking Algorithms

Today s Outline Index compression Distributed Indexing MapReduce Distributed Searching

Sec. 1.2 Index Compression Posting Brutus 2 4 8 16 32 64 128 Calpurnia 1 2 3 5 8 13 21 34 Caesar 13 16 Dictionary Postings lists 6

Postings compression The postings file is much larger than the dictionary, factor of at least 10. Key goal: store each posting compactly. A posting for our purposes is a docid. Naïve: use log 2 20,000,000,000 34 bits per docid. Our goal: use a lot less than 34 bits per docid.

Postings: two conflicting forces Common terms (e.g. the) 34 bits/posting is too expensive Prefer 0/1 document incidence vector in this case If the appears in half the docs => savings of 17x! Rare terms (e.g. arachnocentric) Say it occurs once every million docs, on average 0/1 document incidence vector is a disaster 34 bits/occurrence uses 29,000x less space!

Postings file entry We store the list of docs containing a term in increasing order of docid. computer: 33,47,154,159,202 Consequence: it suffices to store gaps. 33,14,107,5,43 Hope: most gaps can be encoded/stored with far fewer than 34 bits.

Example: Three postings entries Preserves linear-time merge just keep track of sum as you traverse postings

Variable length encoding Aim: For arachnocentric, use ~20 bits per gap entry For the, use ~1 bit per gap entry More generally: to encode a gap with integral value G, use ~log 2 G bits Standard, fixed n-bit int won t work here Solution: Variable length codes

Variable Byte (VB) codes For a gap value G Begin with one byte to store G and dedicate 1 bit in it to be a continuation bit c If G 127, binary-encode it in the 7 available bits and set c =1 Else Encode G s lowest-order 7 bits, put in rightmost byte Encode G s higher-order bits in additional bytes to left When finished, set continuation bits: last byte c=1; other bytes c=0

Example docids 824 829 215406 gaps 5 214577 VB code 00000110 10111000 10000101 00001101 00001100 10110001 Postings stored as the byte concatenation 000001101011100010000101000011010000110010110001 Key property: VB-encoded postings are uniquely prefix-decodable. For a small gap (5), VB uses a whole byte.

Benefits from index compression Shrink index to ~15% the size of original text! but we ignored positional information In practice savings are more limited (about 30%-50% of original text) Techniques are substantially the same

Today s Outline Index compression Distributed Indexing MapReduce Distributed Searching

Sec. 1.2 Indexing at Web-scale. Sort by terms. Core indexing step. And we have to sort postings too how do we do this with 20 billion documents? Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i' 1 it 2 julius 1 killed 1 killed 1

www.rice.edu

http://www.sabahan.com/2006/08/03/picture-of-google-data-center/

Distributed indexing For web-scale indexing: use a distributed computing cluster Individual machines are fault-prone Can unpredictably slow down or fail How do we exploit such a pool of machines?

Data centers Mainly commodity machines Distributed around the world Estimates for Google: At least 1 million servers, est. 2% of worldwide supply (datacenterknowledge.com, 2010) Approximately 200,000 new servers each quarter, based on expenditures of $1.6B/year http://www.circleid.com/posts/20101021_googles_spending_spree_24_million_servers_and_counting/ According to Microsoft research chief Rick Rashid, around 20 per cent of all the servers sold around the world each year are now being bought by a small handful of internet companies - he named Microsoft, Google, Yahoo and Amazon. http://blogs.ft.com/techblog/2009/03/how-many-computers-does-the-world-need/

Google data centers If in a non-fault-tolerant system with 1000 nodes, each node has 99.9% uptime, what is the uptime of the system? Answer: 63%

Distributed indexing Maintain a master machine directing the indexing job considered safe. Break up indexing into sets of (parallel) tasks. Master machine assigns each task to an idle machine from a pool.

Parallel tasks We will use two sets of parallel tasks Parsers Inverters Break the input document corpus into splits Each split is a subset of documents

Parsers Master assigns a split to an idle parser machine Parser reads a document at a time and emits (term, doc) pairs Parser writes pairs into j partitions Each partition is for a range of terms first letters (e.g., a-f, g-p, q-z) here j=3.

Inverters An inverter collects all (term,doc) pairs (= postings) for one term-partition. Sorts and writes to postings lists Important: Each partition is small enough such that this sort can be done on a single machine so in practice we d use large j (many partitions) e.g., al-aq rather than a-j

Data flow assign Master assign Postings Parser a-f g-p q-z Inverter a-f Parser a-f g-p q-z Inverter g-p splits Parser a-f g-p q-z Inverter q-z Segment files

MapReduce The index construction algorithm we just described is an instance of MapReduce a robust and conceptually simple framework for distributed computing [Dean and Ghemawat 2004] Handles the distributed part The Google indexing system (ca. 2002) consists of a number of phases, each until recently implemented in MapReduce.

Schema for index construction in MapReduce Schema of map and reduce functions map: input list(<k, v>) reduce: (k,list(v)) output Instantiation of the schema for index construction map: web collection list(termid, docid) reduce: (<termid1, list(docid)>, <termid2, list(docid)>, ) (postings list1, postings list2, )

MapReduce Example for Index Construction map: d2 : C died. d1 : C came, C c ed. <C,d2>, <died,d2>, <C,d1>, <came,d1>, <C,d1>, <c ed,d1> reduce: (<C,(d2,d1,d1)>, <c ed,(d1)>,<came,(d1)>, <died,(d2)>) <C,(d1:2,d2:1)>, <c ed,(d1:1)>, <came,(d1:1)>, <died,(d2:1)>, In key-sorted order.

Gee, it s easy You write two methods: map: takes a data object and emits (key, value) pairs reduce: takes (key, value) pairs for one key and merges them You specify: A partition function for reduce (which keys go where) Parameters M (number of map tasks) and R (number of reduce tasks same as j)

Map phase Data flow Reduce phase assign Master assign Postings Parser a-f g-p q-z Inverter a-f Parser a-f g-p q-z Inverter g-p splits Parser a-f g-p q-z Inverter q-z Segment files

MapReduce Code Example Word Counting In a set of a million documents, how often does each word appear? Map: for a given doc, emit <word, 1> for each word that appears Reduce: Output sum

[Dean and Ghemawat, 2004]

[Dean and Ghemawat, 2004]

Fun with MapReduce (1 of 4) Distributed grep In a set of files, which lines include the word arachnocentric? Map: scan text and output matching lines Reduce: Copy the input

Fun with MapReduce (2 of 4) Monte Carlo simulation E.g., Given computer players of backgammon A and B, how often will A beat B? Map: play a game and emit <A, 1> or <B, 1> based on who wins Reduce: Sum post-processing: compute the ratio

Fun with MapReduce (3 of 4) Boolean Satisfiability Given a boolean formula in 3-CNF, e.g.: (x1 v -x3 v x7) ^ (x4 v x5 v -x6) ^ How many distinct assignments to variables (i.e. xi = true false) make the formula true? Create M = 2^k splits, one for each assignment Map: Output <whatever, 1> iff the assignment makes the formula true Reduce: Sum

Fun with MapReduce (4 of 4) Link Graph Index Build an index to answer which pages link to me queries (i.e. Google s link: operator) Map: For document d1, output <dj, d1> for all <a href =dj> tags therein Reduce: Sort by values into posting lists Partition function (which map keys are grouped together) Sorted, e.g. ranges of domains (www.aa*-www.ai*)

Today s Outline Index compression Distributed Indexing MapReduce Distributed Searching

Distributed Searching We described indexing into a term-partitioned structure Each machine handles some fraction of terms (e.g. al-aq) Another possibility: Document-partitioned: one machine handles a subset of documents

Back to MapReduce How do we change our implementation to be document-partitioned? One possibility: map: output <doc-group word-id, docid> reduce: sum and output posting list as before partition function: by doc-group

Next Web Crawlers, Link Analysis, Information Extraction