Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007



Similar documents
Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

Search and Information Retrieval

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

A survey on the use of relevance feedback for information access systems

An Information Retrieval using weighted Index Terms in Natural Language document collections

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

The University of Lisbon at CLEF 2006 Ad-Hoc Task

Information Need Assessment in Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

JERIBI Lobna, RUMPLER Beatrice, PINON Jean Marie

I. The SMART Project - Status Report and Plans. G. Salton. The SMART document retrieval system has been operating on a 709^

Eng. Mohammed Abdualal

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Lecture 5: Evaluation

Lecture 1: Introduction and the Boolean Model

Information Retrieval and Web Search Engines

A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS

Relevance Feedback versus Local Context Analysis as Term Suggestion Devices: Rutgers TREC 8 Interactive Track Experience

Statistical Machine Learning

Improving Web Page Retrieval using Search Context from Clicked Domain Names

1 o Semestre 2007/2008

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Introduction to Information Retrieval

Cover Page. "Assessing the Agreement of Cognitive Space with Information Space" A Research Seed Grant Proposal to the UNC-CH Cognitive Science Program

Content-Based Recommendation

Dynamical Clustering of Personalized Web Search Results

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System

Supervised Learning Evaluation (via Sentiment Analysis)!

Clustering Connectionist and Statistical Language Processing

Machine Learning using MapReduce

Mining Text Data: An Introduction

The Need for Training in Big Data: Experiences and Case Studies

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search

Dynamics of Genre and Domain Intents

Content-Based Image Retrieval

Inverted Indexes: Trading Precision for Efficiency

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Relevance Feedback In An Automatic Document Retrieval System

Search Engines. Stephen Shaw 18th of February, Netsoc

How To Cluster On A Search Engine

How One Word Can Make all the Difference

Modern Information Retrieval: A Brief Overview

Assisting bug Triage in Large Open Source Projects Using Approximate String Matching

TEMPER : A Temporal Relevance Feedback Method

A SYSTEM FOR AUTOMATIC QUERY EXPANSION IN A BROWSER-BASED ENVIRONMENT


MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Information Retrieval System Assigning Context to Documents by Relevance Feedback

Outline. for Making Online Advertising Decisions. The first banner ad in Online Advertising. Online Advertising.

Personalized Hierarchical Clustering

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Technical challenges in web advertising

Computational Advertising Andrei Broder Yahoo! Research. SCECR, May 30, 2009

A Meta-Search Method with Clustering and Term Correlation

LCs for Binary Classification

SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON

The PageRank Citation Ranking: Bring Order to the Web

Performance Metrics for Graph Mining Tasks

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

AN SQL EXTENSION FOR LATENT SEMANTIC ANALYSIS

DYNAMIC QUERY FORMS WITH NoSQL

Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value

Business Intelligence and Big Data Analytics: Speeding the Cycle from Insights to Action Four Steps to More Profitable Customer Engagement

Terminology Extraction from Log Files

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Software-assisted document review: An ROI your GC can appreciate. kpmg.com

Database Marketing, Business Intelligence and Knowledge Discovery

Linear Algebra Methods for Data Mining

Custom Web Development Guidelines

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

ACTIVITY THEORY (AT) REVIEW

Lean UX. Best practices for integrating user insights into the app development process. Best Practices Copyright 2015 UXprobe bvba

Get the most value from your surveys with text analysis

THE ROLE OF INFORMATION RETRIEVAL IN KNOWLEDGE MANAGEMENT

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

How To Use B Insight'S New Search Engine On Sharepoint

Interactive Information Retrieval Using Clustering and Spatial Proximity

degrees Fahrenheit. Scientists believe it's human activity that's driving the temperatures up, a process

Term extraction for user profiling: evaluation by the user

Document Retrieval, Automatic

Screen Design : Navigation, Windows, Controls, Text,

Fast Data in the Era of Big Data: Twitter s Real-

Mining a Corpus of Job Ads

Best Practice Search Engine Optimisation

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

2. EXPLICIT AND IMPLICIT FEEDBACK

Phases, Activities, and Work Products. Object-Oriented Software Development. Project Management. Requirements Gathering

8 Evaluating Search Engines

Artificial Intelligence and Transactional Law: Automated M&A Due Diligence. By Ben Klaber

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Introduction to Software Engineering. 8. Software Quality

On the role of a Librarian Agent in Ontology-based Knowledge Management Systems

Transcription:

Information Retrieval Lecture 8 - Relevance feedback and query expansion Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 32 Introduction An information need may be expressed using different keywords (synonymy) impact on recall examples: ship vs boat, aircraft vs airplane Solutions: refining queries manually or expanding queries (semi) automatically Semi-automatic query expansion: local methods global methods based on the retrieved documents and the query (ex: Relevance Feedback) independent of the query and results (ex: thesaurus, spelling corrections) 2/ 32 Overview When to use Relevance Feedback 3/ 32 Feedback given by the user about the relevance of the documents in the initial set of results 4/ 32

(continued) Based on the idea that: (i) defining good queries is difficult when the collection is (partly) unknown (ii) judging particular documents is easy Allows to deal with situations where the user s information needs evolve with the checking of the retrieved documents Example: image search engine http://nayana.ece.ucsb.edu/imsearch/imsearch.html 5/ 32 Relevance feedback example 1 6/ 32 Relevance feedback example 1 7/ 32 Relevance feedback example 1 8/ 32

Relevance feedback example 2 Query: New space satellite applications + 1. 0.539, 08/13/91, NASA Hasn t Scrapped Imaging Spectrometer + 2. 0.533, 07/09/91, NASA Scratches Environment Gear From Satellite Plan 3. 0.528, 04/04/90, Science Panel Backs NASA Satellite Plan, But Urges Launches of Smaller Probes 4. 0.526, 09/09/91, A NASA Satellite Project Accomplishes Incredible Feat: Staying Within Budget 5. 0.525, 07/24/90, Scientist Who Exposed Global Warming Proposes Satellites for Climate Research 6. 0.524, 08/22/90, Report Provides Support for the Critics Of Using Big Satellites to Study Climate 7. 0.516, 04/13/87, Arianespace Receives Satellite Launch Pact From Telesat Canada + 8. 0.509, 12/02/87, Telecommunications Tale of Two Companies 9/ 32 Relevance feedback example 2 2.074 new 15.106 space 30.816 satellite 5.660 application 5.991 nasa 5.196 eos 4.196 launch 3.972 aster 3.516 instrument 3.446 arianespace 3.004 bundespost 2.806 ss 2.790 rocket 2.053 scientist 2.003 broadcast 1.172 earth 0.836 oil 0.646 measure 10/ 32 Relevance feedback example 2 + 1. 0.513, 07/09/91, NASA Scratches Environment Gear From Satellite Plan + 2. 0.500, 08/13/91, NASA Hasn t Scrapped Imaging Spectrometer 3. 0.493, 08/07/89, When the Pentagon Launches a Secret Satellite, Space Sleuths Do Some Spy Work of Their Own 4. 0.493, 07/31/89, NASA Uses Warm Superconductors For Fast Circuit + 5. 0.492, 12/02/87, Telecommunications Tale of Two Companies 6. 0.491, 07/09/91, Soviets May Adapt Parts of SS-20 Missile For Commercial Use 7. 0.490, 07/12/88, Gaping Gap: Pentagon Lags in Race To Match the Soviets In Rocket Launchers 8. 0.490, 06/14/90, Rescue of Satellite By Space Agency To Cost \$90 Million 11/ 32 Standard algorithm for relevance feedback (SMART, 70s) Integrates a measure of relevance feedback into the Vector Space Model Idea: we want to find a query vector q opt maximizing the similarity with relevant documents while minimizing the similarity with non-relevant documents q opt = argmax q [sim( q, C r ) sim( q, C nr )] With the cosine similarity, this gives: q opt = 1 d j 1 C r C nr dj Cr d j dj Cnr 12/ 32

(continued) Problem with the above metrics: the set of relevant documents is unknown Instead, we produce the modified query m: q m = α q 0 + β 1 d j γ 1 D r D nr dj Dr d j dj Dnr where: q 0 is the original query vector D r is the set of known relevant documents D nr is the set of known non-relevant documents α, β, γ are balancing weights (judge vs system) 13/ 32 (continued) Remarks: Negative weights are usually ignored Rocchio-based relevance feedback improves both recall and precision For reaching high recall, many iterations are needed Empirically determined values for the balancing weights: α = 1 β = 0.75 γ = 0.15 Positive feedback is usually more valuable than negative feedback: β > γ 14/ 32 Rocchio algorithm: exercise Consider the following collection (one doc per line): good movie trailer shown trailer with good actor unseen movie a dictionary made of the words movie, trailer and good, and an IR system using the standard tf idf weighting (without normalisation). Assuming a user judges the first 2 documents relevant for the query movie trailer. What would be the Rocchio-revised query? 15/ 32 Alternative to the Rocchio algorithm, use a document classification instead of a Vector Space Model-based retrieval P(x t = 1 R = 1) = VR t VR P(x t = 0 R = 0) = n t VR t N VR where: N is the total number of documents n t is the number of documents containing t VR is the set of known relevant documents VR t is the set of known relevant documents containing t Problem: no memory of the original query 16/ 32

When to use Relevance Feedback When to use Relevance Feedback Relevance Feedback does not work when: the query is misspelled we want cross-language retrieval the vocabulary is ambiguous the users do not have sufficient initial knowledge the query concerns an instance of a general concept (e.g. felines) the documents are gathered into subsets each using a different vocabulary the query has disjunctive answer sets (e.g. the pop star that worked at KFC ) there exist several prototypes of relevant documents Practical problem: refining leads to longer queries that need more time to process 17/ 32 Few web IR systems use relevance feedback hard to explain to users users are mainly interested in fast retrieval (i.e. no iterations) users usually are not interested in high recall Nowadays: clickstream-based feedback (which links are clicked on by users) implicit feedback from the writer rather than feedback from the reader 18/ 32 Note that improvements brought by the relevance feedback decrease with the number of iterations, usually one round gives good results Several evaluation strategies: (a) comparative evaluation query q 0 prec/recall graph query q m prec/recall graph usually +50% of mean average precision (partly comes from the fact that known relevant documents are higher ranked) 19/ 32 (continued) Evaluation strategies: (b) residual collection same technique as above but by looking at the set of retrieved documents - the set of assessed relevant documents the performance measure drops (c) using two similar collections collection #1 is used for querying and giving relevance feedback collection #2 is used for comparative evaluation q 0 and q m are compared on collection #2 20/ 32

(continued) Evaluation strategies: (d) user studies e.g. time-based comparison of retrieval, user satisfaction, etc. user utility is a fair evaluation as it corresponds to real system usage 21/ 32 Overview When to use Relevance Feedback 22/ 32 Pseudo Relevance Feedback Aka blind relevance feedback No need of an extended interaction between the user and the system Method: normal retrieval to find an initial set of most relevant documents assumption that the top k documents are relevant relevance feedback defined accordingly Works with the TREC Ad Hoc task lnc.ltc (precision at k = 50): no-rf 62.5 %, RF 72.7 % Problem: distribution of the documents may influence the results 23/ 32 Indirect Relevance Feedback Uses evidences rather than explicit feedback Example: number of clicks on a given retrieved document Not user-specific More suitable for web IR, since it does not need an extra action from the user 24/ 32

Overview When to use Relevance Feedback 25/ 32 Vocabulary tools for query reformulation Tools displaying: a list of close terms belonging to the dictionary information about the query words that were omitted (cf stop-list) the results of stemming debugging environnement 26/ 32 Query logs and thesaurus Users select among query suggestions that are built either from query logs or thesaurus Replacement words are extracted from thesaurus according to their proximity to the initial query word Thesaurus can be developed: manually (e.g. biomedicine) automatically (cf below) NB: query expansion (i) increases recall (ii) may need users relevance on query terms ( documents) 27/ 32 Automatic thesaurus generation Analyze of the collection for building the thesaurus automatically: 1. Using word co-occurrences (co-occurring words are more likely to belong to the same query field) may contain false positives (example: apple) 2. Using a shallow grammatical analyzes to find out relations between words example:cooked, eaten, digested food Note that co-occurrence-based thesaurus are more robust, but grammatical-analyzes thesaurus are more accurate 28/ 32

Building a co-occurrence-based thesaurus We build a term-document matrix A where A[t, d] = w t,d (e.g. normalized tf idf ) We then calculate C = A.A T c 11 c 1n C =..... c m1 c mn c ij is the similarity score between terms i and j 29/ 32 Automatically built thesaurus 30/ 32 Conclusion Query expansion using either local methods: Rocchio algorithm for Relevance Feedback Pseudo Relevance Feedback Indirect Relevance Feedback or global ones: Query logs Thesaurus Thesaurus-based query expansion increases recall but may decrease precision (cf ambiguous terms) High cost of thesaurus development and maintenance Thesaurus-based query expansion is less efficient than Rocchio Relevance Feedback but may be as good as Pseudo Relevance Feedback 31/ 32 References C. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval http://nlp.stanford.edu/ir-book/pdf/ chapter09-queryexpansion.pdf Chris Buckley and Gerard Salton and James Allan The effect of adding relevance information in a relevance feedback environment (1994) http://citeseer.ist.psu.edu/buckley94effect.html Ian Ruthven and Mounia Lalmas A survey on the use of relevance feedback for information access systems (2003) http://personal.cis.strath.ac.uk/ ir/papers/ker.pdf 32/ 32