Introduction to Information Retrieval http://informationretrieval.org



Similar documents
TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Introduction to Information Retrieval

Lecture 5: Evaluation

Eng. Mohammed Abdualal

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

Lecture 1: Introduction and the Boolean Model

Search Engines. Stephen Shaw 18th of February, Netsoc

Mining Text Data: An Introduction

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

1 o Semestre 2007/2008

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Spazi vettoriali e misure di similaritá

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

This lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis.

w ki w kj k=1 w2 ki k=1 w2 kj F. Aiolli - Sistemi Informativi 2007/2008

Vector Spaces; the Space R n

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

An Information Retrieval using weighted Index Terms in Natural Language document collections

9 Multiplication of Vectors: The Scalar or Dot Product

Development of Secure Multikeyword Retrieval Methodology for Encrypted Cloud Data

Modern Information Retrieval: A Brief Overview

Math 215 HW #6 Solutions

Linear Algebra Methods for Data Mining

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Query Recommendation employing Query Logs in Search Optimization

Big Data Analytics CSCI 4030

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Recall that two vectors in are perpendicular or orthogonal provided that their dot

MAT 200, Midterm Exam Solution. a. (5 points) Compute the determinant of the matrix A =

Active Learning SVM for Blogs recommendation

Orthogonal Projections

Indexing by Latent Semantic Analysis. Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637

Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Mathematics Course 111: Algebra I Part IV: Vector Spaces

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Information Retrieval Elasticsearch

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Search and Information Retrieval

Why is Internal Audit so Hard?

1 Boolean retrieval. Online edition (c)2009 Cambridge UP

Clustering Technique in Data Mining for Text Documents

LINEAR ALGEBRA W W L CHEN

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Data Mining: Algorithms and Applications Matrix Math Review

Least-Squares Intersection of Lines

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

Review: Information Retrieval Techniques and Applications

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

Classification Problems

Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

1 VECTOR SPACES AND SUBSPACES

Dimensionality Reduction: Principal Components Analysis

Linear Algebra Notes for Marsden and Tromba Vector Calculus

CSCI 5417 Information Retrieval Systems Jim Martin!

Inverted Indexes: Trading Precision for Efficiency

Mining a Corpus of Job Ads

Adding vectors We can do arithmetic with vectors. We ll start with vector addition and related operations. Suppose you have two vectors

Data Mining Yelp Data - Predicting rating stars from review text

Common factor analysis

MATH APPLIED MATRIX THEORY


Similarity Search in a Very Large Scale Using Hadoop and HBase

Equations Involving Lines and Planes Standard equations for lines in space

Linear Algebra Review. Vectors

Information Retrieval on the Internet

Introduction to Data Mining

Manifold Learning Examples PCA, LLE and ISOMAP

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

Solving Systems of Linear Equations

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

3-Step Competency Prioritization Sequence

Machine Learning using MapReduce

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Extra Credit Assignment Lesson plan. The following assignment is optional and can be completed to receive up to 5 points on a previously taken exam.

Transcription:

Introduction to Information Retrieval http://informationretrieval.org IIR 6&7: Vector Space Model Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 2011-08-29 Schütze: Vector space model 1 / 37

Models and Methods 1 Boolean model and its limitations (30) 2 Vector space model (30) 3 Probabilistic models (30) 4 Language model-based retrieval (30) 5 Latent semantic indexing (30) 6 Learning to rank (30) Schütze: Vector space model 3 / 37

Take-away Schütze: Vector space model 4 / 37

Take-away tf-idf weighting: Quick review of tf-idf weighting Schütze: Vector space model 4 / 37

Take-away tf-idf weighting: Quick review of tf-idf weighting Vector space model represents queries and documents in a high-dimensional space. Schütze: Vector space model 4 / 37

Take-away tf-idf weighting: Quick review of tf-idf weighting Vector space model represents queries and documents in a high-dimensional space. Pivot normalization (or pivoted document length normalization ): alternative to cosine normalization that removes a bias inherent in standard length normalization Schütze: Vector space model 4 / 37

Outline 1 tf-idf weighting 2 Vector space model 3 Pivot length normalization Schütze: Vector space model 5 / 37

Binary incidence matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0... Each document is represented as a binary vector {0,1} V. Schütze: Vector space model 6 / 37

Binary incidence matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0... Each document is represented as a binary vector {0,1} V. Schütze: Vector space model 6 / 37

Count matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 157 73 0 0 0 1 Brutus 4 157 0 2 0 0 Caesar 232 227 0 2 1 0 Calpurnia 0 10 0 0 0 0 Cleopatra 57 0 0 0 0 0 mercy 2 0 3 8 5 8 worser 2 0 1 1 1 5... Each document is now represented as a count vector N V. Schütze: Vector space model 7 / 37

Count matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 157 73 0 0 0 1 Brutus 4 157 0 2 0 0 Caesar 232 227 0 2 1 0 Calpurnia 0 10 0 0 0 0 Cleopatra 57 0 0 0 0 0 mercy 2 0 3 8 5 8 worser 2 0 1 1 1 5... Each document is now represented as a count vector N V. Schütze: Vector space model 7 / 37

Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. Schütze: Vector space model 8 / 37

Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. Schütze: Vector space model 8 / 37

Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Schütze: Vector space model 8 / 37

Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: Schütze: Vector space model 8 / 37

Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term. Schütze: Vector space model 8 / 37

Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term. But not 10 times more relevant. Schütze: Vector space model 8 / 37

Term frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. We want to rank documents according to query-document matching scores and use tf as a component in these matching scores. But how? Raw term frequency is not what we want because: A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency. Schütze: Vector space model 8 / 37

Instead of raw frequency: Log frequency weighting Schütze: Vector space model 9 / 37

Instead of raw frequency: Log frequency weighting The log frequency weight of term t in d is defined as follows { 1 + log10 tf w t,d = t,d if tf t,d > 0 0 otherwise Schütze: Vector space model 9 / 37

Instead of raw frequency: Log frequency weighting The log frequency weight of term t in d is defined as follows { 1 + log10 tf w t,d = t,d if tf t,d > 0 0 otherwise tf t,d w t,d : 0 0, 1 1, 2 1.3, 10 2, 1000 4, etc. Schütze: Vector space model 9 / 37

Instead of raw frequency: Log frequency weighting The log frequency weight of term t in d is defined as follows { 1 + log10 tf w t,d = t,d if tf t,d > 0 0 otherwise tf t,d w t,d : 0 0, 1 1, 2 1.3, 10 2, 1000 4, etc. Matching score for a document-query pair: sum over terms t in both q and d: tf-matching-score(q,d) = t q d (1 + log tf t,d) Schütze: Vector space model 9 / 37

Frequency in document vs. frequency in collection In addition, to term frequency (the frequency of the term in the document)... Schütze: Vector space model 10 / 37

Frequency in document vs. frequency in collection In addition, to term frequency (the frequency of the term in the document)......we also want to use the frequency of the term in the collection for weighting and ranking. Schütze: Vector space model 10 / 37

idf weight Schütze: Vector space model 11 / 37

idf weight df t is the document frequency, the number of documents that t occurs in. Schütze: Vector space model 11 / 37

idf weight df t is the document frequency, the number of documents that t occurs in. df t is an inverse measure of the informativeness of term t. Schütze: Vector space model 11 / 37

idf weight df t is the document frequency, the number of documents that t occurs in. df t is an inverse measure of the informativeness of term t. Inverse document frequency, idf t, is a direct measure of the informativeness of the term. Schütze: Vector space model 11 / 37

idf weight df t is the document frequency, the number of documents that t occurs in. df t is an inverse measure of the informativeness of term t. Inverse document frequency, idf t, is a direct measure of the informativeness of the term. The idf weight of term t is defined as follows: idf t = log 10 N df t (N is the number of documents in the collection.) Schütze: Vector space model 11 / 37

idf weight df t is the document frequency, the number of documents that t occurs in. df t is an inverse measure of the informativeness of term t. Inverse document frequency, idf t, is a direct measure of the informativeness of the term. The idf weight of term t is defined as follows: idf t = log 10 N df t (N is the number of documents in the collection.) [log N/df t ] instead of [N/df t ] to dampen the effect of idf Schütze: Vector space model 11 / 37

Examples for idf Schütze: Vector space model 12 / 37

Examples for idf idf t = log 10 1,000,000 df t term df t idf t calpurnia 1 6 animal 100 4 sunday 1000 3 fly 10,000 2 under 100,000 1 the 1,000,000 0 Schütze: Vector space model 12 / 37

Effect of idf on ranking Schütze: Vector space model 13 / 37

Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. Schütze: Vector space model 13 / 37

Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. Schütze: Vector space model 13 / 37

Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. idf affects the ranking of documents for queries with at least two terms. Schütze: Vector space model 13 / 37

Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. idf affects the ranking of documents for queries with at least two terms. For example, in the query arachnocentric line, idf weighting increases the relative weight of arachnocentric and decreases the relative weight of line. Schütze: Vector space model 13 / 37

Effect of idf on ranking idf gives high weights to rare terms like arachnocentric. idf gives low weights to frequent words like good, increase, and line. idf affects the ranking of documents for queries with at least two terms. For example, in the query arachnocentric line, idf weighting increases the relative weight of arachnocentric and decreases the relative weight of line. idf has little effect on ranking for one-term queries. Schütze: Vector space model 13 / 37

Summary: tf-idf weighting Schütze: Vector space model 14 / 37

Summary: tf-idf weighting Assign a tf-idf weight for each term t in each document d: w t,d = (1 + log tf t,d ) log N df t Schütze: Vector space model 14 / 37

Summary: tf-idf weighting Assign a tf-idf weight for each term t in each document d: w t,d = (1 + log tf t,d ) log N df t The tf-idf weight... Schütze: Vector space model 14 / 37

Summary: tf-idf weighting Assign a tf-idf weight for each term t in each document d: w t,d = (1 + log tf t,d ) log df N t The tf-idf weight...... increases with the number of occurrences within a document. (term frequency component) Schütze: Vector space model 14 / 37

Summary: tf-idf weighting Assign a tf-idf weight for each term t in each document d: w t,d = (1 + log tf t,d ) log df N t The tf-idf weight...... increases with the number of occurrences within a document. (term frequency component)... increases with the rarity of the term in the collection. (inverse document frequency component) Schütze: Vector space model 14 / 37

Outline 1 tf-idf weighting 2 Vector space model 3 Pivot length normalization Schütze: Vector space model 15 / 37

Binary incidence matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0... Each document is represented as a binary vector {0,1} V. Schütze: Vector space model 16 / 37

Count matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 157 73 0 0 0 1 Brutus 4 157 0 2 0 0 Caesar 232 227 0 2 1 0 Calpurnia 0 10 0 0 0 0 Cleopatra 57 0 0 0 0 0 mercy 2 0 3 8 5 8 worser 2 0 1 1 1 5... Each document is now represented as a count vector N V. Schütze: Vector space model 17 / 37

Binary count weight matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 5.25 3.18 0.0 0.0 0.0 0.35 Brutus 1.21 6.10 0.0 1.0 0.0 0.0 Caesar 8.59 2.54 0.0 1.51 0.25 0.0 Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95... Each document is now represented as a real-valued vector of tf-idf weights R V. Schütze: Vector space model 18 / 37

Binary count weight matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 5.25 3.18 0.0 0.0 0.0 0.35 Brutus 1.21 6.10 0.0 1.0 0.0 0.0 Caesar 8.59 2.54 0.0 1.51 0.25 0.0 Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95... Each document is now represented as a real-valued vector of tf-idf weights R V. Schütze: Vector space model 18 / 37

Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. Schütze: Vector space model 19 / 37

Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Schütze: Vector space model 19 / 37

Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Terms are axes of the space. Schütze: Vector space model 19 / 37

Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Schütze: Vector space model 19 / 37

Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines Schütze: Vector space model 19 / 37

Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights R V. So we have a V -dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines Each vector is very sparse - most entries are zero. Schütze: Vector space model 19 / 37

Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Schütze: Vector space model 20 / 37

Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query Schütze: Vector space model 20 / 37

Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity Schütze: Vector space model 20 / 37

Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity proximity negative distance Schütze: Vector space model 20 / 37

Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity proximity negative distance Recall: We re doing this because we want to get away from the you re-either-in-or-out, feast-or-famine Boolean model. Schütze: Vector space model 20 / 37

Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity proximity negative distance Recall: We re doing this because we want to get away from the you re-either-in-or-out, feast-or-famine Boolean model. Instead: rank relevant documents higher than nonrelevant documents Schütze: Vector space model 20 / 37

How do we formalize vector space similarity? Schütze: Vector space model 21 / 37

How do we formalize vector space similarity? First cut: (negative) distance between two points Schütze: Vector space model 21 / 37

How do we formalize vector space similarity? First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Schütze: Vector space model 21 / 37

How do we formalize vector space similarity? First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Schütze: Vector space model 21 / 37

How do we formalize vector space similarity? First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea... Schütze: Vector space model 21 / 37

How do we formalize vector space similarity? First cut: (negative) distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea......because Euclidean distance is large for vectors of different lengths. Schütze: Vector space model 21 / 37

Why distance is a bad idea Schütze: Vector space model 22 / 37

Why distance is a bad idea poor 1 d 1 :Ranks of starving poets swell d 2 :Rich poor gap grows q:[rich poor] 0 d 3 :Record baseball salaries in 2010 rich 0 1 The Euclidean distance of q and d 2 is large although the distribution of terms in the query q and the distribution of terms in the document d 2 are very similar. Schütze: Vector space model 22 / 37

Use angle instead of distance Schütze: Vector space model 23 / 37

Use angle instead of distance Rank documents according to angle with query Schütze: Vector space model 23 / 37

Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Schütze: Vector space model 23 / 37

Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Rank documents according to the angle between query and document in decreasing order Schütze: Vector space model 23 / 37

Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order Schütze: Vector space model 23 / 37

Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order Cosine is a monotonically decreasing function of the angle for the interval [0,180 ] Schütze: Vector space model 23 / 37

Use angle instead of distance Rank documents according to angle with query The following two notions are equivalent. Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order Cosine is a monotonically decreasing function of the angle for the interval [0,180 ] do ranking according to cosine Schütze: Vector space model 23 / 37

Cosine similarity between query and document Schütze: Vector space model 24 / 37

Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = V i=1 q i q di d Schütze: Vector space model 24 / 37

Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 q i q di d Schütze: Vector space model 24 / 37

Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 d i is the tf-idf weight of term i in the document. q i q di d Schütze: Vector space model 24 / 37

Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 d i is the tf-idf weight of term i in the document. q and d are the lengths of q and d. q i q di d Schütze: Vector space model 24 / 37

Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 d i is the tf-idf weight of term i in the document. q and d are the lengths of q and d. q i q di d This is the cosine similarity of q and d....or, equivalently, the cosine of the angle between q and d. Schütze: Vector space model 24 / 37

Cosine similarity between query and document cos( q, d) = sim( q, d) = q q d d = q i is the tf-idf weight of term i in the query. V i=1 d i is the tf-idf weight of term i in the document. q and d are the lengths of q and d. q i q di d This is the cosine similarity of q and d....or, equivalently, the cosine of the angle between q and d. cosine similarity = dot product of length-normalized vectors Schütze: Vector space model 24 / 37

Cosine similarity illustrated Schütze: Vector space model 25 / 37

Cosine similarity illustrated poor 1 v(d 1 ) v(q) v(d 2 ) θ 0 0 1 v(d 3 ) rich Schütze: Vector space model 25 / 37

Components of tf-idf weighting Term frequency Document frequency Normalization n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log N dft c (cosine) 1 w 2 1 +w2 2+...+w2 M a (augmented) b (boolean) 0.5 + 0.5 tft,d maxt(tft,d) { 1 if tft,d > 0 0 otherwise p (prob idf) max{0,log N dft } u (pivoted dft unique) 1/u b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(tft,d) 1+log(avet d(tft,d)) Schütze: Vector space model 26 / 37

Components of tf-idf weighting Term frequency Document frequency Normalization n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log N dft c (cosine) 1 w 2 1 +w2 2+...+w2 M a (augmented) b (boolean) 0.5 + 0.5 tft,d maxt(tft,d) { 1 if tft,d > 0 0 otherwise p (prob idf) max{0,log N dft } u (pivoted dft unique) 1/u b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(tft,d) 1+log(avet d(tft,d)) Best known combination of weighting options Schütze: Vector space model 26 / 37

tf-idf example Schütze: Vector space model 27 / 37

tf-idf example We often use different weightings for queries and documents. Schütze: Vector space model 27 / 37

tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Schütze: Vector space model 27 / 37

tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn Schütze: Vector space model 27 / 37

tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization Schütze: Vector space model 27 / 37

tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Schütze: Vector space model 27 / 37

tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Isn t it bad to not idf-weight the document? Schütze: Vector space model 27 / 37

tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Isn t it bad to not idf-weight the document? Example query: best car insurance Schütze: Vector space model 27 / 37

tf-idf example We often use different weightings for queries and documents. Notation: ddd.qqq Example: lnc.ltn document: logarithmic tf, no df weighting, cosine normalization query: logarithmic tf, idf, no normalization Isn t it bad to not idf-weight the document? Example query: best car insurance Example document: car insurance auto insurance Schütze: Vector space model 27 / 37

tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto best car insurance Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 best 1 car 1 insurance 1 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 1 best 1 0 car 1 1 insurance 1 2 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 1 best 1 1 0 car 1 1 1 insurance 1 1 2 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 1 1 best 1 1 0 0 car 1 1 1 1 insurance 1 1 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 1 1 best 1 1 50000 0 0 car 1 1 10000 1 1 insurance 1 1 1000 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 1 1 best 1 1 50000 1.3 0 0 car 1 1 10000 2.0 1 1 insurance 1 1 1000 3.0 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 0 1 1 best 1 1 50000 1.3 1.3 0 0 car 1 1 10000 2.0 2.0 1 1 insurance 1 1 1000 3.0 3.0 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 0 1 1 best 1 1 50000 1.3 1.3 0 0 car 1 1 10000 2.0 2.0 1 1 insurance 1 1 1000 3.0 3.0 2 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 0 1 1 1 best 1 1 50000 1.3 1.3 0 0 0 car 1 1 10000 2.0 2.0 1 1 1 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 0 1 1 1 0.52 best 1 1 50000 1.3 1.3 0 0 0 0 car 1 1 10000 2.0 2.0 1 1 1 0.52 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight 1 2 + 0 2 + 1 2 + 1.3 2 1.92 1/1.92 0.52 1.3/1.92 0.68 Schütze: Vector space model 28 / 37

tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 0 1 1 1 0.52 0 best 1 1 50000 1.3 1.3 0 0 0 0 0 car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Schütze: Vector space model 28 / 37

tf-idf example: lnc.ltn Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 0 1 1 1 0.52 0 best 1 1 50000 1.3 1.3 0 0 0 0 0 car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight Final similarity score between query and document: i w qi w di = 0 + 0 + 1.04 + 2.04 = 3.08 Schütze: Vector space model 28 / 37

Outline 1 tf-idf weighting 2 Vector space model 3 Pivot length normalization Schütze: Vector space model 29 / 37

A problem for cosine normalization Schütze: Vector space model 30 / 37

A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Schütze: Vector space model 30 / 37

A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents Schütze: Vector space model 30 / 37

A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents d 1 : a short document on anti-doping rules at 2008 Olympics Schütze: Vector space model 30 / 37

A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents d 1 : a short document on anti-doping rules at 2008 Olympics d 2 : a long document that consists of a copy of d 1 and 5 other news stories, all on topics different from Olympics/anti-doping Schütze: Vector space model 30 / 37

A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents d 1 : a short document on anti-doping rules at 2008 Olympics d 2 : a long document that consists of a copy of d 1 and 5 other news stories, all on topics different from Olympics/anti-doping d 3 : a short document on anti-doping rules at the 2004 Athens Olympics Schütze: Vector space model 30 / 37

A problem for cosine normalization Query q: anti-doping rules Beijing 2008 olympics Compare three documents d 1 : a short document on anti-doping rules at 2008 Olympics d 2 : a long document that consists of a copy of d 1 and 5 other news stories, all on topics different from Olympics/anti-doping d 3 : a short document on anti-doping rules at the 2004 Athens Olympics What ranking do we expect in the vector space model? Schütze: Vector space model 30 / 37

Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Schütze: Vector space model 31 / 37

Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Schütze: Vector space model 31 / 37

Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. Schütze: Vector space model 31 / 37

Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. This removes the unfair advantage that short documents have. Schütze: Vector space model 31 / 37

Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: turning the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. This removes the unfair advantage that short documents have. Singhal s study is also interesting from the point of view of methodology. Schütze: Vector space model 31 / 37

Predicted and true probability of relevance Schütze: Vector space model 32 / 37

Predicted and true probability of relevance source: Lillian Lee Schütze: Vector space model 32 / 37

Pivot normalization Schütze: Vector space model 33 / 37

Pivot normalization source: Lillian Lee Schütze: Vector space model 33 / 37

Pivoted normalization: Amit Singhal s experiments Schütze: Vector space model 34 / 37

Pivoted normalization: Amit Singhal s experiments (relevant documents retrieved and (change in) average precision) Schütze: Vector space model 34 / 37

Summary: Ranked retrieval in the vector space model Schütze: Vector space model 35 / 37

Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Schütze: Vector space model 35 / 37

Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Schütze: Vector space model 35 / 37

Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Schütze: Vector space model 35 / 37

Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Alternatively, use pivot normalization Schütze: Vector space model 35 / 37

Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Alternatively, use pivot normalization Rank documents with respect to the query Schütze: Vector space model 35 / 37

Summary: Ranked retrieval in the vector space model Represent each document as a weighted tf-idf vector Represent the query as a weighted tf-idf vector Compute the cosine similarity between the query vector and each document vector Alternatively, use pivot normalization Rank documents with respect to the query Return the top K (e.g., K = 10) to the user Schütze: Vector space model 35 / 37

Take-away tf-idf weighting: Quick review of tf-idf weighting Vector space model represents queries and documents in a high-dimensional space. Pivot normalization (or pivoted document length normalization ): alternative to cosine normalization that removes a bias inherent in standard length normalization Schütze: Vector space model 36 / 37

Resources Chapters 6 and 7 of Introduction to Information Retrieval Resources at http://informationretrieval.org/essir2011 Gerard Salton (main proponent of vector space model in 70s, 80s, 90s) Exploring the similarity space (Moffat and Zobel, 2005) Pivot normalization (original paper) Schütze: Vector space model 37 / 37