Lecture 1: Introduction and the Boolean Model

Similar documents

1 Boolean retrieval. Online edition (c)2009 Cambridge UP

Introduction to Information Retrieval

Search and Information Retrieval

This lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis.

types of information systems computer-based information systems

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Introduction to Information Retrieval

Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007

Introduction to Information Retrieval

Search Result Optimization using Annotators

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

PowerPoint Presentation to Accompany. Chapter 3. File Management. Copyright 2014 Pearson Educa=on, Inc. Publishing as Pren=ce Hall

Similarity Search in a Very Large Scale Using Hadoop and HBase

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Lecture 5: Evaluation

1 o Semestre 2007/2008

Brunswick, NJ: Ometeca Institute, Inc., pp. $ ( ).

Large-Scale Test Mining

ifinder ENTERPRISE SEARCH

Get to Grips with SEO. Find out what really matters, what to do yourself and where you need professional help

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

The Data Mining Process

CS3220 Lecture Notes: QR factorization and orthogonal transformations

CLOUD ANALYTICS: Empowering the Army Intelligence Core Analytic Enterprise

Text Analytics. Introduction to Information Retrieval. Ulf Leser

Cover Page. "Assessing the Agreement of Cognitive Space with Information Space" A Research Seed Grant Proposal to the UNC-CH Cognitive Science Program

Information Management course

Search Engines. Stephen Shaw 18th of February, Netsoc

Analyzing survey text: a brief overview

SEO for Web 2.0 Enterprise Search Summit West September 2008

Active Learning SVM for Blogs recommendation

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Distributed Computing and Big Data: Hadoop and MapReduce

The University of Lisbon at CLEF 2006 Ad-Hoc Task

Removing Web Spam Links from Search Engine Results

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

Flattening Enterprise Knowledge

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Taxonomy Enterprise System Search Makes Finding Files Easy

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives

Mining Text Data: An Introduction

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Information Need Assessment in Information Retrieval

Query 4. Lesson Objectives 4. Review 5. Smart Query 5. Create a Smart Query 6. Create a Smart Query Definition from an Ad-hoc Query 9

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Identifying SPAM with Predictive Models

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Search engine optimisation (SEO)

F. Aiolli - Sistemi Informativi 2007/2008

Big Data: Rethinking Text Visualization

BIG DATA What it is and how to use?

Text Analytics. Introduction to Information Retrieval. Ulf Leser

Help Sheet on the Use of RIA-Checkpoint Database

Add external resources to your search by including Universal Search sites in either Basic or Advanced mode.

DATABASDESIGN FÖR INGENJÖRER - 1DL124

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

IT services for analyses of various data samples

Introduction. A. Bellaachia Page: 1

Linear Algebra Methods for Data Mining

Integration of a Multilingual Keyword Extractor in a Document Management System

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 2 The Information Retrieval Process

ATLAS.ti 6 Distinguishing features and functions

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Facilitating Business Process Discovery using Analysis

Data processing goes big

Data Discovery on the Information Highway

Indexed vs. Unindexed Searching:

OvidSP Quick Reference Guide

Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file?

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Search Engine Design understanding how algorithms behind search engines are established

Big Data Text Mining and Visualization. Anton Heijs

CSCI 5417 Information Retrieval Systems Jim Martin!

Mining event log patterns in HPC systems

Electronic Document Management Using Inverted Files System

ECM Governance Policies

TIM 50 - Business Information Systems

Role of Social Networking in Marketing using Data Mining

Search engine ranking

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Inge Os Sales Consulting Manager Oracle Norway

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

QAD Business Intelligence Dashboards Demonstration Guide. May 2015 BI 3.11

Technical challenges in web advertising

Table and field properties Tables and fields also have properties that you can set to control their characteristics or behavior.

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms

A Comparison of Database Query Languages: SQL, SPARQL, CQL, DMX

Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1

Transcription:

Lecture 1: Introduction and the Boolean Model Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk 1

Overview 1 Motivation Definition of Information Retrieval IR: beginnings to now 2 First Boolean Example Term-Document Incidence matrix The inverted index Processing Boolean Queries Practicalities of Boolean Search

What is Information Retrieval? Manning et al, 2008: Information retrieval (IR) is finding material... of an unstructured nature...that satisfies an information need from within large collections... 2

What is Information Retrieval? Manning et al, 2008: Information retrieval (IR) is finding material... of an unstructured nature...that satisfies an information need from within large collections... 3

Document Collections 4

Document Collections IR in the 17th century: Samuel Pepys, the famous English diarist, subject-indexed his treasured 1000+ books library with key words. 5

Document Collections 6

What we mean here by document collections Manning et al, 2008: Information retrieval (IR) is finding material (usually documents) of an unstructured nature...that satisfies an information need from within large collections (usually stored on computers). Document Collection: text units we have built an IR system over. Usually documents But could be memos book chapters paragraphs scenes of a movie turns in a conversation... Lots of them 7

IR Basics Document Collection Query IR System Set of relevant documents 8

IR Basics web pages Query IR System Set of relevant web pages 9

What is Information Retrieval? Manning et al, 2008: Information retrieval (IR) is finding material (usually documents) of an unstructured nature...that satisfies an information need from within large collections (usually stored on computers. 10

Structured vs Unstructured Data Unstructured data means that a formal, semantically overt, easy-for-computer structure is missing. In contrast to the rigidly structured data used in DB style searching (e.g. product inventories, personnel records) SELECT * FROM business catalogue WHERE category = florist AND city zip = cb1 This does not mean that there is no structure in the data Document structure (headings, paragraphs, lists...) Explicit markup formatting (e.g. in HTML, XML...) Linguistic structure (latent, hidden) 11

Information Needs and Relevance Manning et al, 2008: Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). An information need is the topic about which the user desires to know more about. A query is what the user conveys to the computer in an attempt to communicate the information need. A document is relevant if the user perceives that it contains information of value with respect to their personal information need. 12

Types of information needs Manning et al, 2008: Information retrieval (IR) is finding material... of an unstructured nature...that satisfies an information need from within large collections... Known-item search Precise information seeking search Open-ended search ( topical search ) 13

Information scarcity vs. information abundance Information scarcity problem (or needle-in-haystack problem): hard to find rare information Lord Byron s first words? 3 years old? Long sentence to the nurse in perfect English?...when a servant had spilled an urn of hot coffee over his legs, he replied to the distressed inquiries of the lady of the house, Thank you, madam, the agony is somewhat abated. [not Lord Byron, but Lord Macaulay] Information abundance problem (for more clear-cut information needs): redundancy of obvious information What is toxoplasmosis? 14

Relevance Manning et al, 2008: Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Are the retrieved documents about the target subject up-to-date? from a trusted source? satisfying the user s needs? How should we rank documents in terms of these factors? More on this in a lecture soon 15

How well has the system performed? The effectiveness of an IR system (i.e., the quality of its search results) is determined by two key statistics about the system s returned results for a query: Precision: What fraction of the returned results are relevant to the information need? Recall: What fraction of the relevant documents in the collection were returned by the system? What is the best balance between the two? Easy to get perfect recall: just retrieve everything Easy to get good precision: retrieve only the most relevant There is much more to say about this lecture 5 16

IR today Web search ( ) Search ground are billions of documents on millions of computers issues: spidering; efficient indexing and search; malicious manipulation to boost search engine rankings Link analysis covered in Lecture 8 Enterprise and institutional search ( ) e.g company s documentation, patents, research articles often domain-specific Centralised storage; dedicated machines for search. Most prevalent IR evaluation scenario: US intelligence analyst s searches Personal information retrieval (email, pers. documents; ) e.g., Mac OS X Spotlight; Windows Instant Search Issues: different file types; maintenance-free, lightweight to run in background 17

A short history of IR 0 1 recall precision no items retrieved 1945 1950s 1960s 1970s 1980s 1990s 2000s memex Term IR coined by Calvin Moers Literature searching systems; evaluation by P&R (Alan Kent) precision/ recall Cranfield experiments Boolean IR SMART Salton; VSM TREC pagerank Multimedia Multilingual (CLEF) Recommendation Systems 18

IR for non-textual media 19

Similarity Searches 20

Areas of IR Ad hoc retrieval (lectures 1-5) web retrieval (lecture 8) Support for browsing and filtering document collections: Clustering (lecture 6) Classification; using fixed labels (common information needs, age groups, topics; lecture 7) Further processing a set of retrieved documents, e.g., by using natural language processing Information extraction Summarisation Question answering 21

Overview 1 Motivation Definition of Information Retrieval IR: beginnings to now 2 First Boolean Example Term-Document Incidence matrix The inverted index Processing Boolean Queries Practicalities of Boolean Search

Boolean Retrieval In the Boolean retrieval model we can pose any query in the form of a Boolean expression of terms i.e., one in which terms are combined with the operators and, or, and not. Shakespeare example 22

Brutus AND Caesar AND NOT Calpurnia Which plays of Shakespeare contain the words Brutus and Caesar, but not Calpurnia? Naive solution: linear scan through all text grepping In this case, works OK (Shakespeare s Collected works has less than 1M words). But in the general case, with much larger text colletions, we need to index. Indexing is an offline operation that collects data about which words occur in a text, so that at search time you only have to access the precompiled index. 23

The term-document incidence matrix Main idea: record for each document whether it contains each word out of all the different words Shakespeare used (about 32K). Antony Julius The Hamlet Othello Macbeth and Caesar Tempest Cleopatra Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0... Matrix element (t,d) is 1 if the play in column d contains the word in row t, 0 otherwise. 24

Query Brutus AND Caesar AND NOT Calpunia We compute the results for our query as the bitwise AND between vectors for Brutus, Caesar and complement (Calpurnia): Antony Julius The Hamlet Othello Macbeth and Caesar Tempest Cleopatra Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0... This returns two documents, Antony and Cleopatra and Hamlet. 25

Query Brutus AND Caesar AND NOT Calpunia We compute the results for our query as the bitwise AND between vectors for Brutus, Caesar and complement (Calpurnia): Antony Julius The Hamlet Othello Macbeth and Caesar Tempest Cleopatra Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 1 0 1 1 1 1 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0... This returns two documents, Antony and Cleopatra and Hamlet. 26

Query Brutus AND Caesar AND NOT Calpunia We compute the results for our query as the bitwise AND between vectors for Brutus, Caesar and complement (Calpurnia): Antony Julius The Hamlet Othello Macbeth and Caesar Tempest Cleopatra Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 1 0 1 1 1 1 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 AND 1 0 0 1 0 0 Bitwise AND returns two documents, Antony and Cleopatra and Hamlet. 27

The results: two documents Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to Dominitus Enobarbus]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring, and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar: I was killed i the Capitol; Brutus killed me. 28

Bigger collections Consider N=10 6 documents, each with about 1000 tokens 10 9 tokens at avg 6 Bytes per token 6GB Assume there are M=500,000 distinct terms in the collection Size of incidence matrix is then 500,000 10 6 Half a trillion 0s and 1s 29

Can t build the Term-Document incidence matrix Observation: the term-document matrix is very sparse Contains no more than one billion 1s. Better representation: only represent the things that do occur Term-document matrix has other disadvantages, such as lack of support for more complex query operators (e.g., proximity search) We will move towards richer representations, beginning with the inverted index. 30

The inverted index The inverted index consists of a dictionary of terms (also: lexicon, vocabulary) and a postings list for each term, i.e., a list that records which documents the term occurs in. Brutus 1 2 4 11 31 45 173 174 Caesar 1 2 4 5 6 16 57 132 179 Calpurnia 2 31 54 101 31

Processing Boolean Queries: conjunctive queries Our Boolean Query Brutus AND Calpurnia Locate the postings lists of both query terms and intersect them. Brutus 1 2 4 11 31 45 173 174 Calpurnia 2 31 54 101 Intersection 2 31 Note: this only works if postings lists are sorted 32

Algorithm for intersection of two postings INTERSECT (p1, p2) 1 answer <> 2 while p1 NIL and p2 NIL 3 do if docid(p1) = docid(p2) 4 then ADD (answer, docid(p1)) 5 p1 next(p1) 6 p2 next(p2) 7 if docid(p1) < docid(p2) 8 then p1 next(p1) 9 else p2 next(p2) 10 return answer Brutus 1 2 4 11 31 45 173 174 Calpurnia 2 31 54 101 Intersection 2 31 33

Complexity of the Intersection Algorithm Bounded by worst-case length of postings lists Thus officially O(N), with N the number of documents in the document collection But in practice much, much better than linear scanning, which is asymptotically also O(N) 34

Query Optimisation: conjunctive terms Organise order in which the postings lists are accessed so that least work needs to be done Brutus AND Caesar AND Calpurnia Process terms in increasing document frequency: execute as (Calpurnia AND Brutus) AND Caesar Brutus 8 1 2 4 11 31 45 173 174 Caesar 9 1 2 4 5 6 16 57 132 179 Calpurnia 4 2 31 54 101 35

Query Optimisation: disjunctive terms (maddening OR crowd) AND (ignoble OR strife) AND (killed OR slain) Process the query in increasing order of the size of each disjunctive term Estimate this in turn (conservatively) by the sum of frequencies of its disjuncts 36

Practical Boolean Search Provided by large commercial information providers 1960s-1990s Complex query language; complex and long queries Extended Boolean retrieval models with additional operators proximity operators Proximity operator: two terms must occur close together in a document (in terms of certain number of words, or within sentence or paragraph) Unordered results... 37

Example: Westlaw Largest commercial legal search service 500K subscribers Boolean Search and ranked retrieval both offered Document ranking only wrt chronological order Expert queries are carefully defined and incrementally developed 38

Westlaw Queries/Information Needs trade secret /s disclos! /s prevent /s employe! Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company. disab! /p access! /s work-site work-place (employment /3 place) Information need: Requirements for disabled people to be able to access a workplace. host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest Information need: Cases about a host s responsibility for drunk guests. 39

Comments on WestLaw Proximity operators: /3= within 3 words, /s=within same sentence /p =within a paragraph Space is disjunction, not conjunction (This was standard in search pre-google.) Long, precise queries: incrementally developed, unlike web search Why professional searchers like Boolean queries: precision, transparency, control. 40

Does Google use the Boolean Model? On Google, the default interpretation of a query [w 1 w 2... w n ] is w 1 AND w 2 AND... AND w n Cases where you get hits which don t contain one of the w i: Page contains variant of w i (morphology, misspelling, synonym) long query (n is large) Boolean expression generates very few hits w i was in the anchor text Google also ranks the result set Simple Boolean Retrieval returns matching documents in no particular order. Google (and most well-designed Boolean engines) rank hits according to some estimator of relevance 41

Reading Manning, Raghavan, Schütze: Introduction to Information Retrieval (MRS), chapter 1 42