Information Retrieval. Lecture 3: Evaluation methodology

Similar documents
Lecture 5: Evaluation

Evaluation of Retrieval Systems

An Information Retrieval using weighted Index Terms in Natural Language document collections

Search and Information Retrieval

Evaluation & Validation: Credibility: Evaluating what has been learned

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Software-assisted document review: An ROI your GC can appreciate. kpmg.com

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

How to calibrate an RTD or Platinum Resistance Thermometer (PRT)

Universities of Leeds, Sheffield and York

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

I. The SMART Project - Status Report and Plans. G. Salton. The SMART document retrieval system has been operating on a 709^

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System

Efficient Recruitment and User-System Performance

Query term suggestion in academic search

Performance Measures for Machine Learning

Cover Page. "Assessing the Agreement of Cognitive Space with Information Space" A Research Seed Grant Proposal to the UNC-CH Cognitive Science Program

Downloaded from UvA-DARE, the institutional repository of the University of Amsterdam (UvA)

Discovering process models from empirical data

Predict the Popularity of YouTube Videos Using Early View Data

Predictive Coding Defensibility and the Transparent Predictive Coding Workflow

CGS X X X X X

How to stop looking in the wrong place? Use PubMed!

8 Evaluating Search Engines

Application Note. So You Need to Measure Some Inductors?

A s h o r t g u i d e t o s ta n d A r d i s e d t e s t s

1 Review of Least Squares Solutions to Overdetermined Systems

How To Cluster Of Complex Systems

Improving Health Records Search Using Multiple Query Expansion Collections

Predictive Coding Defensibility and the Transparent Predictive Coding Workflow

Agilent AN 1316 Optimizing Spectrum Analyzer Amplitude Accuracy

Big Data Text Mining and Visualization. Anton Heijs

Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?

Exhibit memory of previously-learned materials by recalling facts, terms, basic concepts, and answers. Key Words

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Review: Information Retrieval Techniques and Applications

Data Mining - Evaluation of Classifiers

SINGLE-SUPPLY OPERATION OF OPERATIONAL AMPLIFIERS

The University of Lisbon at CLEF 2006 Ad-Hoc Task

Extraction of Chinese Compound Words An Experimental Study on a Very Large Corpus

Improving Contextual Suggestions using Open Web Domain Knowledge

Session 7 Bivariate Data and Analysis

Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

UMass at TREC 2008 Blog Distillation Task

Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007

The Scope of IR. IR systems are part of a family that shares many principles (Figure 1).

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

The Purpose of PR 2016

Data quality and metadata

SNPbrowser Software v3.5

Predicting Query Performance in Intranet Search

CALCULATIONS & STATISTICS

SEARCHING QUESTION AND ANSWER ARCHIVES

Use the Academic Word List vocabulary to make tips on Academic Writing. Use some of the words below to give advice on good academic writing.

Appendix 4 Simulation software for neuronal network models

Health Insurance Premiums for Seniors

Modeling Concept and Context to Improve Performance in ediscovery

Online Ensembles for Financial Trading

90 Digital igaming Reports: SEO Risk in igaming. Risk of being hit by a Google update has doubled in 17 months. June Gillis van den Broeke

Partial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Technical Article. Markus Luidolt and David Gamperl

THE ROLE OF INFORMATION RETRIEVAL IN KNOWLEDGE MANAGEMENT

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

ENZYME KINETICS ENZYME-SUBSTRATE PRODUCTS

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

Understanding CIC Compensation Filters

Abstract. 1. Introduction Methodology

Descriptive Statistics and Measurement Scales

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Course 395: Machine Learning

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Elasticity. I. What is Elasticity?

Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!

Validation and Calibration. Definitions and Terminology

Disambiguating Implicit Temporal Queries by Clustering Top Relevant Dates in Web Snippets

Designing Efficient Software for Solving Delay Differential Equations. C.A.H. Paul. Numerical Analysis Report No Oct 2000

Machine Learning Final Project Spam Filtering

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

Music Mood Classification

Coverity White Paper. Effective Management of Static Analysis Vulnerabilities and Defects

Magruder Statistics & Data Analysis

USES OF CONSUMER PRICE INDICES

Three Key Paper Properties

CHAPTER TWELVE TABLES, CHARTS, AND GRAPHS

Convention Paper Presented at the 118th Convention 2005 May Barcelona, Spain

Search Query and Matching Approach of Information Retrieval in Cloud Computing

Direct Evidence Delay with A Task Decreases Working Memory Content in Free Recall

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Improved Software Testing Using McCabe IQ Coverage Analysis

AAR BENCHMARKING SURVEY 2015 SUMMARY REPORT

Operations and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

MASCOT Search Results Interpretation

University of Lille I PC first year list of exercises n 7. Review

The University of Winnipeg Medical Algorithms propose New Network Utility Maximize Problem

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

Transcription:

Information Retrieval Lecture 3: Evaluation methodology Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk Today 2. General concepts in IR evaluation 2. The TREC competitions 3. IR evaluation metrics

Evaluation: difficulties 3 IR system in: a query out: relevant documents Evaluation of IR systems Goal: predict future from past experience Reasons why IR evaluation is hard: Large variation in human information needs and queries The precise contributions of each component are hard to entangle: Collection coverage Document indexing Query formulation Matching algorithm Evaluation: the laboratory model 4 Test only system parameters Index language devices for description and search Methods of term choice for documents Matching algorithm Type of user interface Ignore environment variables Properties of documents use many documents Properties of users use many queries

What counts as acceptable test data? 5 In 60s and 70s, very small test collections, arbitrarily different, one per project in 60s: 35 queries on 82 documents in 990: still only 35 queries on 2000 documents not always kept test and training apart as so many environment factors were tested TREC-3: 742,000 documents; TREC Web-track: small snapshot of the web Large test collections are needed to capture user variation to support claims of statistical significance in results to demonstrate that systems scale up commercial credibility Practical difficulties in obtaining data; non-balanced nature of the collection Today s test collections 6 A test collection consists of: Document set: Large, in order to reflect diversity of subject matter, literary style, noise such as spelling errors Queries/Topics short description of information need TREC topics : longer description detailing relevance criteria frozen reusable Relevance judgements binary done by same person who created the query

TREC 7 Text REtrieval Conference Run by NIST (US National Institute of Standards and Technology) Marks a new phase in retrieval evaluation common task and data set many participants continuity Large test collection: text, queries, relevance judgements Queries devised and judged by information specialist (same person) Relevance judgements done only for up to 000 documents/query 2003 was 2th year 87 commercial and research groups participated in 2002 Sample TREC query 8 <num> Number: 508 <title> hair loss is a symptom of what diseases <desc> Description: Find diseases for which hair loss is a symptom. <narr> Narrative: A document is relevant if it positively connects the loss of head hair in humans with a specific disease. In this context, thinning hair and hair loss are synonymous. Loss of body and/or facial hair is irrelevant, as is hair loss caused by drug therapy.

TREC Relevance Judgements 9 Humans decide which document query pairs are relevant. Evaluation metrics 0 Relevant Non-relevant Total Retrieved A B A+B Not retrieved C D C+D Total A+C B+D A+B+C+D Recall: proportion of retrieved items amongst the relevant items ( A A+C ) Precision: proportion of relevant items amongst retrieved items ( A A+B ) Accuracy: proportion of correctly classified items as relevant/irrelevant A+D ( A+B+C+D ) Recall: [0..]; Precision: [0..]; Accuracy: [0..] Accuracy is not a good measure for IR, as it conflates performance on relevant items (A) with performance on irrelevant (uninteresting) items (D)

Recall and Precision All documents: A+B+C+D = 30 Relevant documents for a given query: A+C = 28 Recall and Precision: System 2 System retrieves 25 items: (A+B) = 25 Relevant and retrieved items: A = 6 R = A A+C = 6 28 =.57 P = A (A+B) = 6 25 =.64 A = A +D A+B+C+D = 6+93 30 =.84

Recall and Precision: System 2 3 System B retrieves set (A+B) 2 = 5 items A 2 = 2 R 2 = 2 28 =.43 P 2 = 2 5 =.8 A 2 = 2+99 30 =.85 Recall-precision curve 4 precision/ recall recall precision precision 0 no items retrieved Plotting precision and recall (versus no. of documents retrieved) shows inverse relationship between precison and recall Precision/recall cross-over can be used as combined evaluation measure 0 recall Plotting precision versus recall gives recall-precision curve Area under normalised recall-precision curve can be used as evaluation measure

Recall-criticality and precision-criticality 5 Inverse relationship between precision and recall forces general systems to go for compromise between them But some tasks particularly need good precision whereas others need good recall: Precision-critical task Little time available A small set of relevant documents answers the information need Potentially many documents might fill the information need (redundantly) Example: web search for factual information Recall-critical task Time matters less One cannot afford to miss a single document Need to see each relevant document Example: patent search The problem of determining recall 6 Recall problem: for a collection of non-trivial size, it becomes impossible to inspect each document It would take 6500 hours to judge 800,000 documents for one query (30 sec/document) Pooling addresses this problem

Pooling 7 Pooling (Sparck Jones and van Rijsbergen, 975) Pool is constructed by putting together top N retrieval results from a set of n systems (TREC: N = 00) Humans judge every document in this pool Documents outside the pool are automatically considered to be irrelevant There is overlap in returned documents: pool is smaller than theoretical maximum of N n systems (around the maximum size) 3 Pooling works best if the approaches used are very different Large increase in pool quality by manual runs which are recall-oriented, in order to supplement pools F-measure 8 Weighted harmonic mean of P and R (Rijsbergen 979) F α = P R ( α)p + αr High α: Precision is more important Low α: Recall is more important Most commonly used with α=0.5 F 0.5 = 2P R P + R Maximum value of F 0.5 -measure (or F-measure for short) is a good indication of best P/R compromise F-measure is an approximation of cross-over point of precision and recall

Precision and recall in ranked IR engines 9 With ranked list of return documents there are many P/R data points Sensible P/R data points are those after each new relevant document has been seen (black points) Precision Recall Query Rank Relev. R P X 0.20.00 2 0.50 3 X 0.40 0.67 4 0.50 5 0.40 6 X 0.60 0.50 7 0.43 8 0.38 9 0.33 0 X 0.80 0.40 0.36 2 0.33 3 0.3 4 0.29 5 0.27 6 0.25 7 0.24 8 0.22 9 0.2 20 X.00 0.25 Summary IR measures 20 Precision at a certain rank: P(00) Precision at a certain recall value: P(R=.2) Precision at last relevant document: P(last relev) Recall at a fixed rank: R(00) Recall at a certain precison value: R(P=.)

Summary IR measures over several queries 2 Want to average over queries Problem: queries have differing number of relevant documents Cannot use one single cut-off level for all queries This would not allow systems to achieve the theoretically possible maximal values in all conditions Example: if a query has 0 relevant documents If cutoff > 0, P < for all systems If cutoff < 0, R < for all systems Therefore, more complicated joint measures are required point average precision 22 P pt = 0 j=0 N N i= P i (r j ) with P i (r j ) the precision at the jth recall point in the ith query (out of N queries) Define standard recall points r j = j 0 : r 0 = 0, r = 0.... r 0 = We need P i (r j ); i.e. the precision at our recall points P i (R = r) can be measured: the precision at each point when recall changes (because a new relevant document is retrieved) Problem: unless the number of relevant documents per query is divisible by 0, P i (r j ) does not coincide with a measurable data point r Solution: interpolation P i (r j ) = max(r j r < r j+ )P i (R = r) if P i (R = r) exists P i (r j+ ) otherwise Note that P i (R = ) can always be measured.

point average precision: measured data points, Q 23 0.9 0.8 Precision 0.7 0.6 0.5 0.4 0.3 Blue for Query Bold Circles measured Five r j s coincide with datapoint 0.2 0. 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 Recall 0.8 0.9 point average precision: interpolation, Q 24 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 Blue for Query Bold Circles measured Thin circles interpolated 0.2 0. 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 Recall 0.8 0.9

point average precision: measured data points, Q2 25 0.9 0.8 Precision 0.7 0.6 0.5 0.4 0.3 Red for Query 2 Bold Circles are measured Only r 0 coincides with a data point 0.2 0. 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 Recall 0.8 0.9 point average precision: interpolation, Q2 26 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 Red for Query 2 Bold Circles measured Thin circles interpolated 0.2 0. 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 Recall 0.8 0.9

point average precision: averaging 27 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 Now average at each p r over N (number of queries) data points 0.2 0. 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 Recall 0.8 0.9 point average precision: area/result 28 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 End result: point average precision (representation of area) 0.2 0. 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 Recall 0.8 0.9

point average precision: same example as a table 29 P (r i ) Nj= P j (r i ) P 2 (r i ) Query P (r 0 ) =.00.00 P2 (r 0 ) =.00 # R P (r ) =.00.00 P2 (r ) =.00 X 0.20 P (r 2 ) = P (R =.2) =.00.00 P2 (r 2 ) =.00 Query 2 2 P (r 3 ) = 0.67 0.84 P2 (r 3 ) =.00 R # 3 X 0.40 P (r 4 ) = P (R =.4) = 0.67 0.67 0.33 X 4 P2 (r 4 ) = 0.67 2 5 P (r 5 ) = 0.50 0.59 P2 (r 5 ) = 0.67 0.67 X 3 6 X 0.60 P (r 6 ) = P (R =.6) = 0.50 0.59 P2 (r 6 ) = 0.67 4 7 5 8 6 9 P (r 7 ) = 0.40 0.30 P2 (r 7 ) = 0.20 7 0 X 0.80 P (r 8 ) = P (R =.8) = 0.40 0.30 P2 (r 8 ) = 0.20 8 9 2 0 3 4 P (r 9 ) = 0.25 0.23 P2 (r 9 ) = 0.20 2 5 3 6 4 7 0.23 P 2 (r 0 ) = P 2 (R =.0) = 0.20 X 5 8 9 20 X.00 P (r 0 ) = P (R =.0) = 0.25 P pt = 0.6 P i (r j ) is (interpolated) precision of ith query, at jth recall point. P i (R = r j ) (black) are exactly measured precision values. Mean Average Precision (MAP) 30 Also called average precision at seen relevant documents Determine precision at each point when a new relevant document gets retrieved Use P=0 for each relevant document that was not retrieved Determine average for each query, then average over queries MAP = N N j= Q j Q j i= P (doc i) with: Q j number of relevant documents for query j N number of queries P (doc i ) precision at ith relevant document

Mean Average Precision: example 3 Query Rank Relev. P (doc i ) X.00 2 3 X 0.67 4 5 6 X 0.50 7 8 9 0 X 0.40 2 3 4 5 6 7 8 9 20 X 0.25 AVG: 0.564 Query 2 Rank Relev. P (doc i ) X.00 2 3 X 0.67 4 5 6 7 8 9 0 2 3 4 5 X 0.2 AVG: 0.623 MAP favours systems which return relevant documents fast Precision-biased MAP = 0.564+0.623 2 = 0.594 Relevance Judgements and Subjectivity 32 Relevance is subjective Judgements differ across judges Relevance is situational Judgements also differ across time (same judge!) Problem: Systems are not comparable if metrics compiled from different judges or at different times will differ Countermeasure, Part A: Use guidelines Relevance defined independently of novelty Then, relevance decisions are independent of each other Countermeasure, Part B: counteract natural variation by extensive sampling; large populations of users and information needs Then: Relative success measurements on systems stable across judges (but not necessarily absolute ones) (Voorhees, 2000) Okay if all you want to do is compare systems

TREC: IR system performance, future 33 TREC-7 and 8: P(30) between.40 and.45, using long queries and narratives (one team even for short queries); P(0) =.5 even with short queries, >.5 with medium length queries Systems must have improved since TREC-4, 5, and 6 manual performance (sanity check) remained on a plateau of around.6 The best TREC-8 ad-hoc systems not stat. significantly different plateau reached? Ad hoc track discontinued after TREC-8. New tasks: filtering, web, QA, genomics, interactive, novelty, robust, video, cross-lingual,... 2006 is TREC-5. Latest tasks: spam, terabyte, blog, web, legal TREC tracks 992-2002 34 TRACK TREC 992 993 994 995 996 997 998 999 2000 200 2002 2003 2004 2005 Ad Hoc 8 24 26 23 28 3 42 4 Routing 6 25 25 5 6 2 Interactive 3 2 9 8 7 6 6 6 Spanish 4 0 7 Confusion 4 5 Database Merging 3 3 Filtering 4 7 0 2 4 5 9 2 Chinese 9 2 NLP 4 2 Speech 3 0 0 3 Cross-Language 3 9 3 6 0 9 High Precision 5 4 Very Large Corpus 7 6 Query 2 5 6 Question Answering 20 28 36 34 33 28 33 Web 7 23 30 23 27 28 Video 2 9 Novelty Detection 3 4 4 Genomic 29 33 4 HARD 4 6 6 Robust 6 4 7 Terabyte 7 23 Enterprise 9 Spam 3 22 3 33 36 38 5 56 66 68 87 93 93 03 7

Summary 35 IR evaluation as currently performed (TREC) only covers one small part of the spectrum: System performance in batch mode Laboratory conditions; not directly involving real users Precision and recall measured from large, fixed test collections However, this evaluation methodology is very stable and mature Host of elaborate performance metrics available, e.g. MAP Relevance problem solvable (in principle) by query sampling, guidelines, relative system comparisons Recall problem solvable (in practice) by pooling methods Provable that these methods produce stable evaluation results Literature 36 Teufel (2006, To Appear): Chapter An Overview of evaluation methods in TREC Ad-hoc Information Retrieval and TREC Question Answering. In: L. Dybkjaer, H. Hemsen, W. Minker (Eds.) Evaluation of Text and Speech Systems. Springer, Dordrecht, The Netherlands. Copyrighted please go to Student Administration for a Copy