Inverted Indexes: Trading Precision for Efficiency



Similar documents
Inverted Files for Text Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Binary Heap Algorithms

MyOra 3.0. User Guide. SQL Tool for Oracle. Jayam Systems, LLC

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Automated SEO. A Market Brew White Paper

D B M G Data Base and Data Mining Group of Politecnico di Torino

This presentation explains how to monitor memory consumption of DataStage processes during run time.

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

Curriculum Map. Discipline: Computer Science Course: C++

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

1 Formulating The Low Degree Testing Problem

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

Infrastructures for big data

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Similarity Search in a Very Large Scale Using Hadoop and HBase

SQL Query Evaluation. Winter Lecture 23

TECHNICAL UNIVERSITY OF CRETE DATA STRUCTURES FILE STRUCTURES

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Performance Analysis and Optimization on Lucene

Search and Information Retrieval

Storage Management for Files of Dynamic Records

Electronic Document Management Using Inverted Files System

Vector storage and access; algorithms in GIS. This is lecture 6

Data Deduplication in Slovak Corpora

Textbook and References

DNA Sequencing Data Compression. Michael Chung

Turing Machines: An Introduction

In this session, we use the table ZZTELE with approx. 115,000 records for the examples. The primary key is defined on the columns NAME,VORNAME,STR

Big Data & Scripting Part II Streaming Algorithms

Search Engines. Stephen Shaw 18th of February, Netsoc

Covariance and Correlation

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

CSI 333 Lecture 1 Number Systems

A Time Efficient Algorithm for Web Log Analysis

Improving SQL Server Performance

Introduction to Database Management Systems

Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables

Introduction to Information Retrieval

Database Applications Microsoft Access

DYNAMIC QUERY FORMS WITH NoSQL

Principles of Data Mining by Hand&Mannila&Smyth

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

System Copy GT Manual 1.8 Last update: 2015/07/13 Basis Technologies

Developing MapReduce Programs

SRC Technical Note. Syntactic Clustering of the Web

Outline. mass storage hash functions. logical key values nested tables. storing information between executions using DBM files

Unifying Search for the Desktop, the Enterprise and the Web

A Searching Strategy to Adopt Multi-Join Queries

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

4.2 Sorting and Searching

Data storage Tree indexes

Unique column combinations

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

IBM Campaign & Interact Access external customer profile data to augment and enhance segmentation in marketing campaigns

Information Retrieval System Assigning Context to Documents by Relevance Feedback

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

NCPC 2013 Presentation of solutions

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

COURSE DESCRIPTION. Queries in Microsoft Access. This course is designed for users with a to create queries in Microsoft Access.

Introduction to Information Retrieval

INTRODUCTION IMAGE PROCESSING >INTRODUCTION & HUMAN VISION UTRECHT UNIVERSITY RONALD POPPE

Rise of the Machines: An Internet-Wide Analysis of Web Bots in 2014

1 Approximating Set Cover

Data Warehousing With DB2 for z/os... Again!

PostgreSQL Concurrency Issues

Factoring & Primality

Operating Systems. Virtual Memory

CSC148 Lecture 8. Algorithm Analysis Binary Search Sorting

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

In and Out of the PostgreSQL Shared Buffer Cache

Large-Scale Data Cleaning Using Hadoop. UC Irvine

FOUNDATIONS OF ALGEBRAIC GEOMETRY CLASS 22

A Simple Inventory System

Analysis of Binary Search algorithm and Selection Sort algorithm

Query Processing C H A P T E R12. Practice Exercises

Chapter 2 Data Storage

RFM Analysis: The Key to Understanding Customer Buying Behavior

(Refer Slide Time: 2:03)

Overview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8

Transcription:

Inverted Indexes: Trading Precision for Efficiency Yufei Tao KAIST April 1, 2013

After compression, an inverted index is often small enough to fit in memory. This benefits query processing because it avoids disk I/Os. In this lecture, we will discuss several additional tricks to make a query algorithm even faster. As we will see, some of these tricks trade efficiency for precision, namely, they do not guarantee returning the same results as our current algorithm, but the differences are usually minor in practice, and hardly affect the usefulness of a search engine.

First, let us recall the score definition we have been using to judge the importance of a webpage D with respect to a query Q that consists of terms t 1,..., t m : score(d, Q) = A(D) tf (D, t j ) idf (t j ) 2 where t j Q A(D) = rank(d)/ p where p is the point converted from D.

One effective heuristic in practice is to modify the score function to score (D, Q) = A(D) max{trim(tf (D, t j ) idf (t j ) 2, τ)} t j Q where τ is a system parameter and function trim(x, τ) is defined as: trim(x, τ) = { 0 if x < τ x otherwise The intuition is that, if the contribution tf (D, t i ) idf (t j ) 2 of a term t j is too small, we may ignore it without harming the quality of the query result significantly.

The new score definition has an important implication: for each term t j, it suffices to consider only those pairs (i, tf (D i, t j )) in its inverted list such that tf (D i, t j ) τ/idf (t j ) 2 Hence, we can sort the pairs (i, tf (D i, t j )) of an inverted list in descending order of tf (D i, t j ), so that in query processing we can scan the list from the beginning, and stop as soon as hitting a pair that does not satisfy the above inequality. See the next slide for an example.

Example (Adapted from [Zobel and Moffat, 2006]) term w inverted list for w and (6, 2) big (2, 2), (3, 1) dark (6, 1) did (4, 1) gown (2, 1) had (3, 1) house (2, 1), (3, 1) in (2, 2), (6, 2), (1, 1), (3, 1), (5, 1), keep (1, 1), (3, 1), (5, 1) keeper (1, 1), (4, 1), (5, 1) keeps (1, 1), (5, 1), (6, 1) light (6, 1) never (4, 1) night (5, 2), (1, 1), (4, 1) old (1, 1), (2, 2), (3, 1), (4, 1) sleep (4, 1) sleeps (6, 1) the (1, 3), (3, 3), (5, 3), (2, 2), (6, 2), (4, 1) town (1, 1), (3, 1) where (4, 1)

Query Algorithm A query with term set Q = {t 1,..., t m } can be answered as follows: algorithm query(q) 1. score (D, Q) = 0 for all D S 2. for each term t j Q 3. for each pair (i, tf (D i, t j )) of list(t j ) in descending order of tf (D i, t j ) 4. if tf (D i, t j ) > τ/idf (t j ) 2 then 5. score (D i, Q) = score (D i, Q) + tf (D i, t j ) idf (t j ) 2 6. else break 7. for each D S 8. score (D, Q) = score (D, Q)/A(D) 9. sort the documents by score 10. return the k documents with the highest scores

Compression Since the sorting method has changed, we also need to modify the way we compress an inverted list. Recall that, previously, the pairs (i, tf (D i, t)) of the inverted list of term t are sorted in ascending order of document id i. Now, they are first sorted by frequency tf (D i, t), and then, by document id i. For example: the (1, 3), (3, 3), (5, 3), (2, 2), (6, 2), (4, 1) Let us refer to the sequence of pairs with the same frequency as a segment. For example, (1, 3), (3, 3), (5, 3) form a segment, (2, 2), (6, 2) form another, and so does (4, 1).

Compression (cont.) In general, a segment (f, id 1 ), (f, id 2 ),..., (f, id x ) is stored as: (f, id 1, id 2 id 1, id 3 id 2,..., id x id x 1 ) where each number is represented as Elia s gamma or delta code. For example, the inverted list of the previous slide is stored as: the (3, 1, 2, 2), (2, 2, 4), (1, 4)

Remarks So far we have learned two types of inverted indexes: Id-sorted: where the pairs of an inverted list are sorted by document id. Frequency-sorted: where the pairs of an inverted list are sorted by term frequency. Both types of inverted indexes are collectively referred to as document-level inverted indexes.