Text Analytics. Models for Information Retrieval 1. Ulf Leser

Similar documents
1 o Semestre 2007/2008

Text Analytics. Introduction to Information Retrieval. Ulf Leser

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Introduction to Information Retrieval

Text Analytics. Introduction to Information Retrieval. Ulf Leser

Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Search Engines. Stephen Shaw 18th of February, Netsoc

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

An Information Retrieval using weighted Index Terms in Natural Language document collections

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

Mining Text Data: An Introduction

Search and Information Retrieval

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

Big Data Analytics CSCI 4030

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Linear Algebra Methods for Data Mining

Data Warehousing und Data Mining

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

A survey on the use of relevance feedback for information access systems

Introduction to Information Retrieval

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Social Business Intelligence Text Search System

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

types of information systems computer-based information systems

Text Analytics. Modeling Information Retrieval 2. Ulf Leser

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Eng. Mohammed Abdualal

Information Retrieval Elasticsearch

Data Mining Yelp Data - Predicting rating stars from review text

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

w ki w kj k=1 w2 ki k=1 w2 kj F. Aiolli - Sistemi Informativi 2007/2008

Mining a Corpus of Job Ads

Collaborative Filtering. Radek Pelánek

Self Organizing Maps: Fundamentals

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Information Retrieval and Web Search Engines

Information Retrieval Models

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

The University of Lisbon at CLEF 2006 Ad-Hoc Task

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Content-Based Recommendation

Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives

Chapter 2 The Information Retrieval Process

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Clustering Technique in Data Mining for Text Documents

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

Review: Information Retrieval Techniques and Applications

LCs for Binary Classification

Datenbanksysteme II: Implementation of Database Systems Implementing Joins

Vocabulary Problem in Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California. fshli,

Three Methods for ediscovery Document Prioritization:

Machine Learning using MapReduce

Information Retrieval on the Internet

3 An Illustrative Example

isecure: Integrating Learning Resources for Information Security Research and Education The isecure team

Lecture 5: Evaluation

SMART Electronic Legal Discovery via Topic Modeling

Information Retrieval Systems in XML Based Database A review

Dynamical Clustering of Personalized Web Search Results

9.4. The Scalar Product. Introduction. Prerequisites. Learning Style. Learning Outcomes

CSE 473: Artificial Intelligence Autumn 2010

Basics of Statistical Machine Learning

Social Media Mining. Data Mining Essentials

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Theme-based Retrieval of Web News

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

Active Learning SVM for Blogs recommendation

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Statistical Machine Learning

Development of an Enhanced Web-based Automatic Customer Service System

Sentiment-Oriented Contextual Advertising

CSCI567 Machine Learning (Fall 2014)

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Clinical Decision Support with the SPUD Language Model

A SYSTEM FOR AUTOMATIC QUERY EXPANSION IN A BROWSER-BASED ENVIRONMENT

Efficient Multi-keyword Ranked Search over Outsourced Cloud Data based on Homomorphic Encryption

Principal components analysis

Electronic Document Management Using Inverted Files System

Comparative Study of Features Space Reduction Techniques for Spam Detection

Spam Detection A Machine Learning Approach

Text Analytics Illustrated with a Simple Data Set


Supervised Learning Evaluation (via Sentiment Analysis)!

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

INTRUSION PREVENTION AND EXPERT SYSTEMS

Proseminar Wissenschaftliches Arbeiten. Ulf Leser

Modern Information Retrieval: A Brief Overview

I. The SMART Project - Status Report and Plans. G. Salton. The SMART document retrieval system has been operating on a 709^

9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08

Information Need Assessment in Information Retrieval

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Personalized Hierarchical Clustering

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System

Semantically Enhanced Web Personalization Approaches and Techniques

Transcription:

Text Analytics Models for Information Retrieval 1 Ulf Leser

Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Winter Semester 2012/2013 2

Information Retrieval Core The core question in IR: Which of a given set of (normalized) documents is relevant for a given query? Ranking: How relevant is each document of a given set for a given query? Query Normalization Document base Normalization Match / Relevance Scoring Result Ulf Leser: Text Analytics, Winter Semester 2012/2013 3

How can Relevance be Judged? Set Theoretic Fuzzy Extended Boolean Classic Models Algebraic U s e r T a s k Retrieval: Adhoc Filtering Browsing Boolean Vector-Space Probabilistic Structured Models Non-Overlapping Lists Proximal Nodes Browsing Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network Flat Structure Guided Hypertext [BYRN99] Ulf Leser: Text Analytics, Winter Semester 2012/2013 4

Notation All of the models we discuss use the Bag of Words view Definition Let K be the set of all terms, k K is a term Let D be the set of all documents, d D is a document Let w be the function that maps a document to its set of distinct terms in an arbitrary, yet fixed order (bag-of-words) After pre-processing: Stemming, stop-word removal etc. Let v d by a vector for d (or a query q) with v d [i]=0 if the i th term k i in w(d) is not contained in w(d) v d [i]=1 if the i th term k i in w(d) is contained in w(d) Often, we shall use weights instead of 0/1 memberships Let w ij 0 be the weight of term k i in document d j (w ij =v j [i]) w ij =0 if k i d j Ulf Leser: Text Analytics, Winter Semester 2012/2013 5

Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Winter Semester 2012/2013 6

Boolean Model Simple model based on set theory Queries are specified as Boolean expressions Terms are atoms Terms are connected by AND, OR, NOT, (XOR,...) No weights; all terms contribute equally to relevance Relevance of a document is either 0 or 1 Compute all atoms (keywords) of the query (q=<k 1, k 2, >) An atom k i evaluates to true on d iff v d [k i ]=1 Compute values of all atoms for each d Compute value of q as logical expression over these values Ulf Leser: Text Analytics, Winter Semester 2012/2013 7

Properties Simple, clear semantics, widely used in (early) systems Disadvantages No partial matching Suppose query k 1 k 2 k 9 A doc d with k 1 k 2 k 8 is as irrelevant as one with none of the terms No ranking Query terms cannot be weighted Average users don t like (understand) Boolean expressions Results: Often unsatisfactory Too many documents (too few restrictions) Too few documents (too many restrictions) Several extensions exist, but generally not satisfactory Ulf Leser: Text Analytics, Winter Semester 2012/2013 8

A Note on Implementation Implementation search k i s in D and do not iterative over D Searching is supported by an index Assume we have an index with fast operation find: K Ρ D Search each k i of the query, resulting in a set D i D Evaluate query in the given order using set operations on D i s k i k j : D i D j k i k j : D i D j NOT k i : D\D i Parenthesis treated in the usual way Improvements: Cost-based evaluation Evaluate sub-expressions first (hopefully) resulting in small results Smaller intermediate results, faster intersections, Ulf Leser: Text Analytics, Winter Semester 2012/2013 9

Negation in the Boolean Model Evaluating NOT k i is expensive Result often is very, very large ( D\D i D ) If k i is not a stop word (which we removed anyway) Most other terms appear in almost no documents Recall Zipf s Law the tail of the distribution Solution 1: Disallow negation This is what many web search engines do Solution 2: Allow only in the form k i NOT k j Cannot use previous implementation scheme (D not-kj would be very large) Better: D := D i \ D j Ulf Leser: Text Analytics, Winter Semester 2012/2013 10

Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Winter Semester 2012/2013 11

Vector Space Model Salton, G., Wong, A. and Yang, C. S. (1975). "A Vector Space Model for Automatic Indexing." Communications of the ACM 18(11): 613-620. A breakthrough in IR Still most popular model today General idea Fix a vocabulary K View each doc and query as a point in a K -dimensional space Rank docs according to distance from the query in that space Main advantages Ability to rank docs (according to distance) Naturally supports partial matching Ulf Leser: Text Analytics, Winter Semester 2012/2013 12

Vector Space Quelle: http://www.vilceanu.ne Star Astronomy Movie stars Each term is one dimension Different suggestions for determining co-ordinates Weights The closest docs are the most relevant ones Rational: vectors correspond to themes which are loosely related to sets of terms Distance between vector ~ distance between themes Mammals Diet Different suggestions for defining distance Ulf Leser: Text Analytics, Winter Semester 2012/2013 13

Computing the Angle between Two Vectors??? Wie ist das mit dem Einheitsvektor? Recall: The scalar product between vectors v and w of equal dimension is defined as follows v w = v * w *cos( v, w) This gives us the angle With v = cos( v, w) 2 v i v w = v * w v w = i= 1.. n v i * w i Ulf Leser: Text Analytics, Winter Semester 2012/2013 14

Ulf Leser: Text Analytics, Winter Semester 2012/2013 15 Distance as Angle Distance = cosine of the angle between doc d and query q ( ) 2 2 ] [ * ] [ ] [ ]* [ * ), cos( ), ( = = = i v i v i v i v v v v v v v q d sim q d d q q d q d q d Can be dropped for ranking Length normalization

Example Data Assume stop word removal, stemming, and binary weights Text verkauf haus italien gart miet blüh woll 1 Wir verkaufen Häuser in Italien 2 Häuser mit Gärten zu vermieten 3 Häuser: In Italien, um Italien, um Italien herum 4 Die italienschen Gärtner sind im Garten 5 Der Garten in unserem italienschen Haus blüht Q Wir wollen ein Haus mit Garten in Italien mieten 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Ulf Leser: Text Analytics, Winter Semester 2012/2013 16

Ranking 1 1 1 1 sim( d, q) = ( v ) q[ i]* vd[ i] 2 vd[ i] 2 1 1 1 3 1 1 4 1 1 5 1 1 1 1 Q 1 1 1 1 1 sim(d 1,q) = (1*0+1*1+1*1+0*1+0*1+0*0+0*1) / 3 ~ 1.15 sim(d 2,q) = (1+1+1) / 3 ~ 1.73 sim(d 3,q) = (1+1) / 2 ~ 1.41 sim(d 4,q) = (1+1) / 2 ~ 1.41 sim(d 5,q) = (1+1+1) / 4 ~ 1.5 Rg Q: Wir wollen ein Haus mit Garten in Italien mieten 1 d 2 : Häuser mit Gärten zu vermieten 2 d 5 : Der Garten in unserem italienschen Haus blüht d 4 : Die italienschen Gärtner sind im Garten 3 d 3 : Häuser: In Italien, um Italien, um Italien herum 5 d 1 : Wir verkaufen Häuser in Italien Ulf Leser: Text Analytics, Winter Semester 2012/2013 17

Introducing Term Weights Definition Let D be a document collection, K be the set of all terms in D, d D and k K The term frequency tf dk is the frequency of k in d The document frequency df k is the number of docs in D containing k This should rather be called corpus frequency Often differently defined as the number of occurrences of k in D Both definitions are valid and both are used The inverse document frequency is defined as idf k = D / df k This should rather be called inverse relative document frequency In practice, one usually uses idf k = log( D / df k ) Ulf Leser: Text Analytics, Winter Semester 2012/2013 18

Ranking with TF scoring sim( d, q) = ( v ) q[ i]* vd[ i] 2 vd[ i] 1 1 1 1 2 1 1 1 3 1 3 4 1 2 5 1 1 1 1 Q 1 1 1 1 1 sim(d 1,q) = (1*0+1*1+1*1+0*1+0*1+0*0+0*1) / 3 ~ 1.15 sim(d 2,q) = (1+1+1) / 3 ~ 1.73 sim(d 3,q) = (1+3) / 10 ~ 1.26 sim(d 4,q) = (1+2) / 5 ~ 1.34 sim(d 5,q) = (1+1+1) / 4 ~ 1.5 Rg Q: Wir wollen ein Haus mit Garten in Italien mieten 1 d 2 : Häuser mit Gärten zu vermieten 2 d 5 : Der Garten in unserem italienschen Haus blüht 3 d 4 : Die italienschen Gärtner sind im Garten 4 d 3 : Häuser: In Italien, um Italien, um Italien herum 5 d 1 : Wir verkaufen Häuser in Italien Ulf Leser: Text Analytics, Winter Semester 2012/2013 19

Improved Scoring: TF*IDF One obvious problem: The longer a document, the more non-null and the higher are the tf values, and thus the higher the chances of being ranked high Solution: Normalize on document length (also yields 0 w ij 1) tf Second problem: Some terms are everywhere in D, don t help to discriminate, and should be scored less Solution: Use IDF scores ' dk = tf d dk i = tf dk tf k = 1.. d tfij w ij = * idf d j dk Ulf Leser: Text Analytics, Winter Semester 2012/2013 20

Example TF*IDF w ij tf = d ij sim( d, q) = i * idf j tf = d ij i D * df ( v ) q[ i]* vd[ i] 2 vd[ i] k IDF 5 5/4 5/4 5/3 5 5 0 1 (tf) 1/3 1/3 1/3 2 (tf) 1/3 1/3 1/3 3 (tf) 1/4 3/4 4 (tf) 1/3 2/3 5 (tf) 1/4 1/4 1/4 1/4 Q 1 1 1 1 1 sim(d 1,q)=(5/4*1/3 +5/4*1/3) / 0.3 ~ 1.51 sim(d 2,q)=(5/4*1/3 +5/3*1/3+5*1/3) / 0.3 ~ 4,80 sim(d 3,q)=(5/4*1/4+5/4*3/4) / 0.63 ~ 1,57 sim(d 4,q)=(5/4*1/3 +5/3*2/3) / 0.56 ~ 2,08 sim(d 5,q)=(5/4*1/4 +5/4*1/4+5/3*1/4) / 0.25 ~ 2,08 wollen ein Haus mit Garten in Italien mieten d 2 : Häuser mit Gärten zu vermieten d 5 : Der Garten in unserem italienschen Haus blüht wollen ein Haus mit Garten in Italien mieten Häuser mit Gärten zu vermieten Der Garten in unserem italienschen Haus blüht d 4 : Die italienschen Gärtner sind im Garten d 3 : Häuser: In Italien, um Italien, um Italien herum d 1 : Wir verkaufen Häuser in Italien Die italienschen Gärtner sind im Garten Wir verkaufen Häuser in Italien Häuser: In Italien, um Italien, um Italien herum Ulf Leser: Text Analytics, Winter Semester 2012/2013 21

TF*IDF in Short Give terms in a doc d high weights which are frequent in d and infrequent in D IDF deals with the consequences of Zipf s law The few very frequent (and unspecific) terms get lower scores The many infrequent (and specific) terms get higher scores Ulf Leser: Text Analytics, Winter Semester 2012/2013 22

Further Improvements Scoring the query in the same way as the documents Note: Repeating words in a query then makes a difference Empirically estimated best scoring function [BYRN99] Different length normalization tf '' = ij tf j ij maxtf ij Use log of IDF with slightly different measure; rare terms are not totally distorting the score any more w ij = tf j ij maxtf ij D *log df j Ulf Leser: Text Analytics, Winter Semester 2012/2013 23

Distance Measure Why not use Euclidean distance? Length of vectors would be much more important Since queries usually are very short, very short documents would always win Cosine measures normalizes by the length of both vectors Ulf Leser: Text Analytics, Winter Semester 2012/2013 24

Shortcomings We assume that all terms are independent, i.e., that their vectors are orthogonal Clearly wrong: some terms are semantically closer than others The appearance of red in a doc with wine doesn t mean much But wine is an important match for drink, red not Extension: Topic-based Vector Space Model (LSI) No treatment of synonyms (query expansion, ) No treatment of homonyms Different senses = different dimensions We would need to disambiguate words into their senses (later) Term-order independent But order carries semantic meaning (object? subject?) Ulf Leser: Text Analytics, Winter Semester 2012/2013 25

A first Note on Implementation Assume we want to retrieve the top-r docs Look up all terms k i of the query in an index Build the union of all documents which contain at least on k i Hold in a list sorted by score (initialize with 0) Walk through terms k i in order of decreasing IDF-weights Go through docs in order of current score For each document d j : Add w ji *IDF i to current score s j Early stop Look at s j r s j r+1 : Can doc with rank r+1 still reach doc with rank r? Can be estimated from distribution of TF and IDF values Early stop might produce the wrong order of top-r docs, but cannot produce wrong docs Why not? Ulf Leser: Text Analytics, Winter Semester 2012/2013 26

A Different View Star Query evaluation actually searches for the top-r nearest neighbors (for some similarity measure) Can be achieved using multidimensional indexing kdd-trees, Grid files, etc. Hope: No sequential scan of (all, many) docs, but directed search according to distances Severe problem: High dimensionality Diet Ulf Leser: Text Analytics, Winter Semester 2012/2013 27

A Concrete (and Popular) VSM-Model Okapi BM25 Okapi: First system which used it (80ties) BM25: Best-Match, version 25 (roughly) Good results in several TREC evaluations sim( d, q) = k q IDF( k)* tf dk + tfdk *( k1 + 1) k1 * 1 b + b* d a ; IDF( k) = D tfk + 0.5 tf + 0.5 k k 1, b constants (often b=0.75, k 1 =0.2) a is the average document length in D Ulf Leser: Text Analytics, Winter Semester 2012/2013 28

Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Ulf Leser: Text Analytics, Winter Semester 2012/2013 29

Interactive IR Recall: IR is a process, not a query Relevance feedback User poses query System answers User judges the relevance of the top-k returned docs System considers feedback and generates new (improved) ranked answers User never needs to pose another query The new query is generated by the system Loop until satisfaction Query (Ranked) Results IR System Ulf Leser: Text Analytics, Winter Semester 2012/2013 30

Relevance Feedback Basic assumptions Relevant docs are somewhat similar to each other the common core should be emphasized Irrelevant docs are different from relevant docs the differences should be deemphasized System may adapt to feedback by Query expansion: Add new terms to the query From the relevant documents More aggressive: add NOT with terms from irrelevant docs Term re-weighting: Assign new weights to terms Overweight terms the relevant documents, down-weight other terms Ulf Leser: Text Analytics, Winter Semester 2012/2013 31

Formal Approach Assume we know the set R of all relevant docs, and N=D\R How would a good query for R look like? v q = 1 R d R v d 1 N d N v d Weight terms high that are in R Weight terms low that are not in R, i.e., that are in N Remark It is not clear whether this is the best of all queries Relevance feedback is a heuristic Especially when only a sample of R is known or the corpus changes Ulf Leser: Text Analytics, Winter Semester 2012/2013 32

Rocchio Algorithm Usually we do not know the real R Best bet: Let R (N) be the set of docs marked as relevant (irrelevant) by the user Still: Do not forget the original query Rocchio: Adapt query vector after each feedback v 1 1 = + q α * vq β * vd γ * new R N d R d N v d Implicitly performs query expansion and term re-weighting Ulf Leser: Text Analytics, Winter Semester 2012/2013 33

Example Let α=0.5, β=0.5, γ=0, K={information, science, retrieval, system} Information 1.0 d 1 ="information science = (0.2,0.8, 0, 0) d 2 ="retrieval systems = ( 0, 0,0.8,0.2) q ="retrieval of inf = (0.6, 0,0.2, 0) 0.5 d 1 q If d 1 were marked relevant q = ½ q + ½ d 1 = (0.4,0.4,0.1, 0) If d 2 were marked relevant q = ½ q + ½ d 2 = (0.3, 0,0.5,0.1) q q d 2 Quelle: A. Nürnberger: IR 0.5 1.0 Retrieval Ulf Leser: Text Analytics, Winter Semester 2012/2013 34

Choices for N How can we determine N? Naïve: N = D\R N very large Let the user explicitly define it by negative feedback (or not) More work for the user; Might by a by-product of looking at promising, yet non relevant docs Chose those presented for assessment and not rated R Implicitly assumes that user looked at all Generally: Large N make things slow Query after first round has ~ K dimensions with non-null values Computing the novel score very slow due to third term Ulf Leser: Text Analytics, Winter Semester 2012/2013 35

Variations How to choose α, β, γ? Tuning with gold standard sets difficult Educated guess Alternative treatment for N Intuition: Non-relevant docs are heterogeneous and tear in every direction better to only take the worst instead of all of them vq new 1 = α * vq + β * vd γ *{ v R d R d d = arg min( sim( v d N q, v d ))} Also faster due to much shorter query vectors Ulf Leser: Text Analytics, Winter Semester 2012/2013 36

Effects of Relevance Feedback Advantages Improved results (many studies) compared to single queries Comfortable: Users need not generate new queries themselves Iterative process converging to the best possible answer Especially helpful for increasing recall (due to query expansion) Disadvantages Still requires some work by the user Excite: Only 4% used relevance feedback ( more of this button) Writing a new query based on returned results might be faster (and easier and more successful) than classifying results Based on the assumption that relevant docs are similar What if user searches for all meanings of jaguar? Query very long already after one iteration slow retrieval Ulf Leser: Text Analytics, Winter Semester 2012/2013 37

Collaborative Filtering More inputs for improving IR performance Collaborative filtering: Give the user what other yet similar users liked Customers who bought this book also bought In IR: Find users posing similar queries and look at what they did with the answers In e-commerce: Which produces did they buy? (very reliable) In IR, we need to approximate Documents a user clicked on (if recordable) Did the user look at the second page? (Low credit for first results) Did the user pose a refinement query next? All these measures are not very reliable; we need many users Ulf Leser: Text Analytics, Winter Semester 2012/2013 38

Thesaurus-based Query Expansion [M07, CS276] Expand query with synonyms and hyponyms of each term feline feline cat One may weight added terms less than original query terms Often used in scientific IR systems (Medline) Requires high quality thesaurus General observation Increases recall May significantly decrease precision interest rate interest rate fascinate evaluate Do synonyms really exist? Ulf Leser: Text Analytics, Winter Semester 2012/2013 39