Indexing, cont d PHRASE QUERIES AND POSITIONAL INDEXES

Similar documents
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Inverted Indexes: Trading Precision for Efficiency

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

WRITING PROOFS. Christopher Heil Georgia Institute of Technology

Search and Information Retrieval

Chapter 2 The Information Retrieval Process

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Introduction to Information Retrieval

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

Electronic Document Management Using Inverted Files System

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Large-Scale Data Cleaning Using Hadoop. UC Irvine

Web Document Clustering

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

A Comparative Approach to Search Engine Ranking Strategies

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

An Information Retrieval using weighted Index Terms in Natural Language document collections

Understanding Video Lectures in a Flipped Classroom Setting. A Major Qualifying Project Report. Submitted to the Faculty

The University of Lisbon at CLEF 2006 Ad-Hoc Task

CS 103X: Discrete Structures Homework Assignment 3 Solutions

Large-Scale Test Mining

Basic indexing pipeline

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value

1 Boolean retrieval. Online edition (c)2009 Cambridge UP

Optimization of Internet Search based on Noun Phrases and Clustering Techniques

Sentence Blocks. Sentence Focus Activity. Contents

C H A P T E R Regular Expressions regular expression

Static vs. Dynamic. Lecture 10: Static Semantics Overview 1. Typical Semantic Errors: Java, C++ Typical Tasks of the Semantic Analyzer

Isotope distributions

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

Analysis of Compression Algorithms for Program Data

Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm

NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

KEYWORD SEARCH IN RELATIONAL DATABASES

3. Mathematical Induction

Data Structures Fibonacci Heaps, Amortized Analysis

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Collecting Polish German Parallel Corpora in the Internet

Computational Geometry. Lecture 1: Introduction and Convex Hulls

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Oracle8i Spatial: Experiences with Extensible Databases

Metasearch Engines. Synonyms Federated search engine

International Journal of Advance Research in Computer Science and Management Studies

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Chapter 13: Query Processing. Basic Steps in Query Processing

Information Retrieval Elasticsearch

Log Analysis of Academic Digital Library: User Query Patterns

Solutions to Problem Set 1

Structure for String Keys

Standard for Software Component Testing

Inverted Files for Text Search Engines

Storage Management for Files of Dynamic Records

International Journal of Advanced Research in Computer Science and Software Engineering

Testing LTL Formula Translation into Büchi Automata

A Rule-Based Short Query Intent Identification System

A Survey on Product Aspect Ranking

MFL skills map. Year 3 Year 4 Year 5 Year 6 Develop understanding of the sounds of Individual letters and groups of letters (phonics).

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Data Structures and Algorithms Written Examination

Performance Analysis and Optimization on Lucene

INTRUSION PROTECTION AGAINST SQL INJECTION ATTACKS USING REVERSE PROXY

Glossary of translation tool types

Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Risk Decomposition of Investment Portfolios. Dan dibartolomeo Northfield Webinar January 2014

CSCI 5417 Information Retrieval Systems Jim Martin!

Natural Language Database Interface for the Community Based Monitoring System *

Cassandra A Decentralized, Structured Storage System


A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Online Tutoring System For Essay Writing

Sequential Data Structures

w ki w kj k=1 w2 ki k=1 w2 kj F. Aiolli - Sistemi Informativi 2007/2008

Transcription:

Indexing, cont d PHRASE QUERIES AND POSITIONAL INDEXES

Sec. 2.4 Phrase Queries Want to be able to answer queries such as dalhousie university as a phrase Thus the sentence I went to university at dalhousie is not a match. The concept of phrase queries has proven easily understood by users; one of the few advanced search ideas that works Many more queries are implicit phrase queries For this, it no longer suffices to store only <term : docs> entries

Sec. 2.4.1 A First Attempt: Biword indexes Index every consecutive pair of terms in the text as a phrase For example the text Friends, Romans, Countrymen would generate the biwords friends romans romans countrymen Each of these biwords is now a dictionary term Two-word phrase query-processing is now immediate.

One Motive Could be that most search queries are 2.4 words long. This applies to Web search only.

Sec. 2.4.1 Longer Phrase Queries Longer phrases are processed as Boolean biword queries: dalhousie university halifax canada can be broken into the following Boolean query on biwords: dalhousie university AND university halifax AND halifax canada Without the documents, we cannot verify that the documents matching the above Boolean query do contain the phrase. Can have false positives!

Sec. 2.4.1 Extended biwords Parse the indexed text and perform part-of-speech-tagging (POST). Bucket the terms into (say) Nouns (N) and articles/prepositions (X). Call any string of terms of the form NX*N an extended biword. Each such extended biword is now made a term in the dictionary. Example: catcher in the rye N X X N Query processing: parse it into N s and X s Segment query into enhanced biwords Look up in index: catcher rye

Sec. 2.4.1 Issues for biword indexes False positives, as noted before Index blowup due to bigger dictionary Infeasible for more than biwords, big even for them Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy

Sec. 2.4.2 Solution 2: Positional Indexes In the postings, store, for each term the position(s) in which tokens of it appear: <term, number of docs containing term; doc1: position1, position2 ; doc2: position1, position2 ; etc.>

Sec. 2.4.2 Positional index example Which of docs 1,2,4,5 could contain to be or not to be? For phrase queries, we use a merge algorithm recursively at the document level But we now need to deal with more than just equality

Sec. 2.4.2 Processing a phrase query Extract inverted index entries for each distinct term: to, be, or, not. Merge their doc:position lists to enumerate all positions with to be or not to be. to: be: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191;... 1:17,19; 4:17,191,291,430,434; 5:14,19,101;... Same general method for proximity searches

Sec. 2.4.2 Proximity queries LIMIT! /3 STATUTE /3 FEDERAL /2 TORT Again, here, /k means within k words of. Clearly, positional indexes can be used for such queries; biword indexes cannot. Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k? This is a little tricky to do correctly and efficiently See Figure 2.12 of IIR There s likely to be a problem on it!

Figure 2.12 Text Book Please run it through this example: to: be: 4: 8,16,190,429,433; 4: 17,191,291,430,434; Proximity on both sides of the term

Sec. 2.4.2 Positional index size You can compress position values/offsets. Nevertheless, a positional index expands postings storage substantially Nevertheless, a positional index is now standardly used because of the power and usefulness of phrase and proximity queries whether used explicitly or implicitly in a ranking retrieval system.

Sec. 2.4.2 Positional index size Need an entry for each occurrence, not just once per document Index size depends on average document size Average web page has <1000 terms Books and some epic poems easily 100,000 terms Consider a term with frequency 0.1% Document size 1000 100,000 Postings 1 1 Positional postings 1 100

Sec. 2.4.2 Rules of thumb A positional index is 2 4 as large as a nonpositional index Positional index size 35 50% of volume of original text Caveat: all of this holds for English-like languages

Sec. 2.4.3 Combination schemes These two approaches can be profitably combined For particular phrases ( Michael Jackson, Britney Spears ) it is inefficient to keep on merging positional postings lists Even more so for phrases like The Who Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme A typical web query mixture was executed in ¼ of the time of using just a positional index It required 26% more space than having a positional index alone

Resources for today s lecture IIR Chapter 2: http://nlp.stanford.edu/irbook/html/htmledition/positional-postingsand-phrase-queries-1.html H.E. Williams, J. Zobel, and D. Bahle. 2004. Fast Phrase Querying with Combined Indexes, ACM Transactions on Information Systems. http://www.seg.rmit.edu.au/research/research.php?author=4 D. Bahle, H. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. SIGIR 2002, pp. 215-221.