Downloaded from UvA-DARE, the institutional repository of the University of Amsterdam (UvA) http://hdl.handle.net/11245/2.122992



Similar documents
Multileaved Comparisons for Fast Online Evaluation

Implementing a Resume Database with Online Learning to Rank

This software agent helps industry professionals review compliance case investigations, find resolutions, and improve decision making.

Data Discovery, Analytics, and the Enterprise Data Hub

Section 4: Key Informant Interviews

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

A QuestionPro Publication

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

WHITE PAPER. Five Steps to Better Application Monitoring and Troubleshooting

A Process is Not Just a Flowchart (or a BPMN model)

Identifying Best Bet Web Search Results by Mining Past User Behavior

The big data revolution

Testing Approaches in Marketing. Design your test tto ensure clarity of results

How to get started on research in economics? Steve Pischke June 2012

NON-PROBABILITY SAMPLING TECHNIQUES

White Paper April 2006

Lottery Looper. User Manual

Dynamical Clustering of Personalized Web Search Results

High-Performance Scorecards. Best practices to build a winning formula every time

Optimizing engagement in online news

UNDERSTAND YOUR CLIENTS BETTER WITH DATA How Data-Driven Decision Making Improves the Way Advisors Do Business

The Photoshop CS Digital Photo Workflow

edtpa Frequently Asked Questions

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Web Database Integration

14:30 Watson applicaties bouwen met IBM Bluemix

Automatic Document Categorization A Hummingbird White Paper

Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks

customer care solutions

Ten steps to better requirements management.

A Marketer's Guide. to Facebook Metrics

Digital Experts Programme Tri-borough adult social care app case study

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

ICF CORE COMPETENCIES RATING LEVELS

The Role of Reactive Typography in the Design of Flexible Hypertext Documents

Promoting your Site: Search Engine Optimisation and Web Analytics

Tableau Metadata Model

Chapter 5 Use the technological Tools for accessing Information 5.1 Overview of Databases are organized collections of information.

COMPETENCY ACC LEVEL PCC LEVEL MCC LEVEL 1. Ethics and Standards

Search Engine Optimization Techniques To Enhance The Website Performance

Board report for 31 May 06 Item 8

MIA/MPA Capstone Project Management Resource Materials

Mathematical finance and linear programming (optimization)

Honors Thesis Guidelines

LITERATURE REVIEWS. The 2 stages of a literature review

Student Writing Guide. Fall Lab Reports

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Restaurant Tips 1. Restaurant Tips and Service Quality: A Weak Relationship or Just Weak Measurement? Michael Lynn. School of Hotel Administration

Introduction to Solid Modeling Using SolidWorks 2012 SolidWorks Simulation Tutorial Page 1

Use Your Master s Thesis Supervisor

Library and information science research trends in India

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

TIBCO Spotfire Guided Analytics. Transferring Best Practice Analytics from Experts to Everyone

Thinking the unthinkable doing away with the library catalogue

From copyright police to a valuable gap filler. Vanessa Tuckfield. Digital Repository Manager, Canberra Institute of Technology

A MARKETER'S GUIDE TO REPORTS THAT MATTER

Copyright and PhD Theses

VACA: A Tool for Qualitative Video Analysis

Mining Social-Driven Data

An Overview of Computational Advertising

Invited Applications Paper

arxiv: v1 [cs.ir] 12 Jun 2015

USABILITY TEST. Appendix A. May 2001 TECHSOURCE COLLEGE LIBRARY (TCL)

A Business Process Services Portal

Best Practices for Structural Metadata Version 1 Yale University Library June 1, 2008

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Review of Literature

Measuring Line Edge Roughness: Fluctuations in Uncertainty

OLAP Services. MicroStrategy Products. MicroStrategy OLAP Services Delivers Economic Savings, Analytical Insight, and up to 50x Faster Performance

Research Investments in Large Indian Software Companies

Federico Rajola. Customer Relationship. Management in the. Financial Industry. Organizational Processes and. Technology Innovation.

Web Data Extraction: 1 o Semestre 2007/2008

Registry Tuner. Software Manual

An Implementation of Active Data Technology

CHOOSING AN SEM PLATFORM:

How to research and develop signatures for file format identification

Microsoft Dynamics NAV

Universiteit Leiden. Opleiding Informatica

Introduction to Searching with Regular Expressions

INTERNATIONAL FRAMEWORK FOR ASSURANCE ENGAGEMENTS CONTENTS

1 About This Proposal

Comparing IPL2 and Yahoo! Answers: A Case Study of Digital Reference and Community Based Question Answering

Qlik s Associative Model

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search

FIELD GUIDE TO LEAN EXPERIMENTS

Decision Support System For A Customer Relationship Management Case Study

A System for Human-Machine Interactive Learning. Kevin Jamieson University of Wisconsin - Madison / UC Berkeley AMP Lab

A-level Art and Design

University of North Dakota Department of Electrical Engineering Graduate Program Assessment Plan

Simplify Software as a Service (SaaS) Integration

Qlik Sense Enterprise

The TWO ways to apply are as follows:

Databricks. A Primer

Publishers Note. Anson Reed Limited St John Street London EC1V 4PY United Kingdom. Anson Reed Limited and InterviewGold.

Doctor of Philosophy Dissertation Guidelines

Final Master Thesis. MSc in IT Strategic Management. The Final Master Thesis definition. Competences. Methodology. Master s Thesis Schedule

Connecting library content using data mining and text analytics on structured and unstructured data

University of Calgary Schulich School of Engineering Department of Electrical and Computer Engineering

Data Management for Portable Media Players

Transcription:

Downloaded from UvA-DARE, the institutional repository of the University of Amsterdam (UvA) http://hdl.handle.net/11245/2.122992 File ID Filename Version uvapub:122992 1: Introduction unknown SOURCE (OR PART OF THE FOLLOWING SOURCE): Type PhD thesis Title Fast and reliable online learning to rank for information retrieval Author(s) K. Hofmann Faculty FNWI: Informatics Institute (II) Year 2013 FULL BIBLIOGRAPHIC DETAILS: http://hdl.handle.net/11245/1.394453 Copyright It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content licence (like Creative Commons). UvA-DARE is a service provided by the library of the University of Amsterdam (http://dare.uva.nl) (pagedate: 2014-11-24)

1 Introduction The goal of the research area of information retrieval (IR) is to develop the insights and technology needed to provide access to data collections. The most prominent applications, web search engines like Bing, Google, or Yahoo!, provide instant and easy access to vast and constantly growing collections of web pages. A user looking for information submits a query to the search engine, receives a ranked list of results, and follows links to the most promising ones. To address the flood of data available on the web, today s web search engines have developed into very complex systems. They combine hundreds of ranking features (properties of the query, a document, and the relationship between the two) with the goal of creating the best possible search results for all their users at all times. 1 A ranker (or ranking function) is the part of a search engine that determines the order in which documents retrieved for a given user query should be presented. Until recently, most rankers were developed manually, based on expert knowledge. Developing a good ranker may be easy for some search tasks, but in many cases what constitutes a good ranking depends on the search context, such as users background knowledge, age, or location, or their specific search goals and intents (Besser et al., 2010; Hofmann et al., 2010a; Rose and Levinson, 2004; Shen et al., 2005). And even though there is an enormous variety in the tasks and goals encountered in web search, web search engines are only the tip of the iceberg. More specialized systems are everywhere: search engines for companies intranets, local and national libraries, universities course catalogues, and users personal documents (e.g., photos, emails, and music) all provide access to different, more or less specialized, document collections, and cater to different users with different search goals and expectations. Addressing each of these settings manually is infeasible. Instead, we need to look for scalable methods that can learn good rankings without expensive, and necessarily limited, manual or semi-manual tuning. For automatically tuning the parameters of a ranking function, machine learning algorithms are invaluable (Liu, 2009). Most methods employ supervised learning to rank, i.e., algorithms are trained on examples of relevant and non-relevant documents for particular queries. Data to train these approaches is typically obtained from experts who label document-query pairs, which is time-consuming and expensive. In many settings, such as personalized or localized search, or when deploying a search engine for a com- 1 http://www.wired.com/magazine/2010/02/ff_google_algorithm/all/, retrieved December 29, 2012. 1

1. Introduction pany s intranet or a library catalogue, collecting the large amounts of training data required for supervised learning is usually not feasible (Sanderson, 2010). Even in environments where training data is available, it may not capture typical information needs and user preferences perfectly (Radlinski and Craswell, 2010), and cannot anticipate future changes in user needs. In this thesis we follow an alternative approach, called online learning to rank. This technology can enable self-learning search engines that learn directly from natural interactions with their users. Such systems promise to be able to continuously adapt and improve their rankings to the specific setting they are deployed in, and continue to learn for as long as they are being used. Learning directly from user interactions is fundamentally different from the currently dominant supervised learning to rank approaches for IR, where training data is assumed to be randomly sampled from some underlying distribution, and where absolute and reliable labels are provided by professional annotators. In an online learning setting, feedback for learning is a by-product of natural user interactions. This strongly affects what kind of feedback can be obtained, and the quality of the obtained feedback. For example, users expect to be presented with useful results at all times, so trying out new rankings (called exploration) can have a high cost in user satisfaction and needs to be balanced against possible future learning gains. Also, feedback inferred from user interactions can be noisy, and it may be affected by how search results are presented (one example of such an effect is caption bias). Learning from such lower-quality feedback may result in degraded learning, unless we can design learning to rank algorithms that are robust against these effects. In this thesis we investigate the principles that allow effective online learning to rank for IR, and translate our insights into new algorithms for fast and reliable online learning. 1.1 Research Outline and Questions The broad question that motivates the research for this thesis is: Can we build search engines that automatically learn good ranking functions by interacting with their users? Individual components towards solving this problem already exist (see Chapter 2 for an overview), but other aspects, such as how to learn from noisy and relative feedback, have not yet been investigated. This thesis aims to close some of these gaps, contributing to the long-term goal of a complete online learning to rank solution for IR. We start our investigation by focusing on the type and quality of feedback that can be obtained for learning to rank in an online setting. Extracting reliable and useful feedback for learning to rank from natural user interactions is difficult, because user interactions are noisy and context-dependent. The most effective techniques identified so far focus on extracting relative information, i.e., they infer user preferences between documents or whole result rankings. In this thesis we focus on these relative feedback techniques, and particularly on so-called interleaved comparison methods that infer preferences between rankings using click data. Besides in online learning to rank applications, these methods are used for online evaluation for search engine research and development in general. Given that three interleaved comparison methods have been developed previously, we first aim to understand how these methods compare to each other, i.e., how can we 2

1.1. Research Outline and Questions decide which method to use for an online learning to rank system, or to evaluate a given retrieval system? We formalize criteria for analyzing interleaved comparison methods, to answer the following questions: RQ 1 What criteria should an interleaved comparison method satisfy to enable reliable online learning to rank for IR? RQ 2 Do current interleaved comparison methods satisfy these criteria? In answering these questions, we identify three minimal criteria that interleaved comparison methods should satisfy: fidelity, soundness, and efficiency. An interleaved comparison method has fidelity if the quantity it measures, i.e., the expected outcome of ranker comparisons, properly corresponds to the true relevance of the ranked documents. It is sound if its estimates of that quantity are statistically sound, i.e., unbiased and consistent. It is efficient if those estimates are accurate with only little data. Analyzing previously developed interleaved comparison methods, we find that none of them exhibit fidelity. To address this shortcoming, we develop a new interleaved comparison method, probabilistic interleave (PI), that is based on a probabilistic interpretation of the interleaving process. An extension of PI, PI-MA, is then derived to increase the method s efficiency by marginalizing over known variables instead of using noisy estimates. Regarding these new methods, we address the following questions: RQ 3 Do PI and its extension PI-MA exhibit fidelity and soundness? RQ 4 Is PI-MA more efficient than previous interleaved comparison methods? Is it more efficient than PI? While previous interleaved comparison methods required collecting new data for each ranker comparison, our probabilistic framework enables the reuse of previously collected data. Intuitively, the information contained in these previously collected lists and user clicks should provide some information about the relative quality of new target rankers. However, the source distribution under which the data was collected may differ from the target distribution under which samples would be collected if the new target rankers were compared with live data. This can result in biased estimates of comparison outcomes. To address this problem, we design a second extension of PI, PI-MA-IS. It uses importance sampling to compensate for differences between the source and target distribution, and marginalization to maintain high efficiency. Investigating this method analytically and experimentally allows us to address the following questions: RQ 5 Can historical data be reused to compare new ranker pairs? RQ 6 Does PI-MA-IS maintain fidelity and soundness? RQ 7 Can PI-MA-IS reuse historical data effectively? We then turn to more practical issues of using interleaved comparisons in a web search setting. In this setting, user clicks may be affected by aspects of result pages other than true result relevance, such as how results are presented. If such visual aspects affect user clicks, the question becomes what click-based evaluation really measures. We address the following questions: RQ 8 (How) does result presentation affect user clicks (caption bias)? 3

1. Introduction RQ 9 Can we model caption bias, and compensate for it in interleaving experiments? RQ 10 (How) does caption bias affect interleaving experiments? After addressing the above questions regarding the quality of feedback that can be obtained in online learning to rank for IR settings, we turn to the principles of online learning to rank for IR. Given the characteristics and restrictions of online learning to rank, we investigate how to perform as effectively as possible in this setting. One central challenge that we formulate is the exploration-exploitation dilemma. In an online setting, a search engine learns continuously, while interacting with its users. To satisfy users expectations as well as possible at any point in time, the system needs to exploit what it has learned up to this point. It also needs to explore possibly better solutions, to ensure continued learning and improved performance in the future. We hypothesize that addressing this dilemma by balancing exploration and exploitation can improve online performance. We design two algorithms for achieving such a balance in pairwise and listwise online learning to rank: RQ 11 Can balancing exploration and exploitation improve online performance in online learning to rank for IR? RQ 12 How are exploration and exploitation affected by noise in user feedback? RQ 13 How does the online performance of different types (pairwise and listwise) of online learning to rank for IR approaches relate to balancing exploration and exploitation? Finally, we return to the question of how to learn as quickly and effectively as possible in an online learning to rank for IR setting. We hypothesize that reusing click data that was collected during earlier learning steps could be used to gain additional information about the relative quality of new rankers. Based on our PI-MA-IS method for reusing historical data, we develop two algorithms for learning with historical data reuse. The first, RHC, reuses historical data to make ranker comparisons during learning more reliable. The other, CPS, uses historical data for more effective exploration of the solution space. The research questions addressed by our subsequent research are: RQ 14 Can previously observed (historical) interaction data be used to speed up online learning to rank? RQ 15 Is historical data more effective when used to make comparisons more reliable (as in RHC), or when used to increase local exploration (as in CPS)? RQ 16 How does noise in user feedback affect the reuse of historical interaction data for online learning to rank? 1.2 Main Contributions In this section we summarize the main algorithmic, theoretical, and empirical contributions of this thesis. 4

1.2. Main Contributions Algorithmic contributions: A probabilistic interleaved comparison method, called probabilistic interleave (PI), that exhibits fidelity, and an extension of PI, called PI-MA, that increases the efficiency of PI by marginalizing over known variables instead of using noisy observations. The first interleaved comparison method that allows reuse of historical interaction data (called PI-MA-IS, an extension of PI-MA). An approach for integrating models of caption bias with interleaved comparison methods in order to compensate for caption bias in interleaving experiments. The first two online learning to rank for IR algorithms (one pairwise, one listwise approach) that can balance exploration and exploitation. The first two online learning to rank algorithms that can utilize previously observed (historical) interaction data: reliable historical comparisons (RHC), and candidate preselection (CPS). Theoretical contributions: A framework for analyzing interleaved comparison methods in terms of fidelity, soundness, and (previously proposed) efficiency. Analysis of the interleaved comparison methods balanced interleave, team draft, and document constraints, showing that none exhibits fidelity. Two proofs that show that our proposed extensions of PI, PI-MA and PI-MA-IS maintain soundness. A general-purpose probabilistic model of caption bias in user click behavior that can combine document-pairwise and pointwise features. A formalization of online learning to rank for IR as a contextual bandit problem, and formulation of the exploration-exploitation dilemma in this setting. Empirical contributions: An experimental framework that allows for the assessment of online evaluation and online learning to rank methods using annotated learning to rank data sets and click models in terms of their online performance. An empirical evaluation of PI, PI-MA, and all existing interleaved comparison methods, showing that PI-MA is the most efficient method. An empirical evaluation of interleaved comparison methods under historical data, showing that PI-MA-IS is the only method that can effectively reuse historical data. A large-scale assessment of our caption-bias model with pairwise and pointwise feature sets using real click data in a web search setting, showing that pointwise features are most important, but combinations of pointwise and pairwise features are most accurate for modeling caption bias. 5

1. Introduction Results of applying caption bias models to interleaving experiments in a web search setting, indicating that caption bias can affect interleaved comparison outcomes. The first empirical evidence that shows that balancing exploration and exploitation in online learning to rank for IR can significantly improve online performance in online learning to rank for IR. First empirical evidence showing that reusing historical data for online learning to rank can substantially and significantly improve online performance. In addition to the contributions listed above, the software developed for running online learning to rank experiments following our experimental setup is made freely available (Appendix A). This software package includes reference implementations of the developed interleaved comparison methods (PI, PI-MA, and PI-MA-IS) and online learning to rank approaches (balancing exploration and exploitation in pairwise and listwise online learning, online learning to rank with historical data reuse). 1.3 Thesis Overview This section gives an overview of the content of each chapter of this thesis. The next chapter (Chapter 2) introduces background for all subsequent chapters. Chapter 3 details the problem formulation used throughout this thesis, and introduces the experimental setup that forms the basis of the empirical evaluations in Chapters 4, 6, and 7. The next four chapters are the main research chapters of this thesis. Each focuses on a specific aspect of online learning to rank for IR. We start with the most central component, the feedback mechanism, in Chapter 4. This chapter develops a framework for analyzing interleaved comparison methods, and proposes a new, probabilistic, interleaved comparison method (PI) and two extensions for more efficient comparisons (PI-MA), and for comparisons with reuse of historical interaction data (PI-MA-IS). Chapter 5 investigates interleaved comparison methods in a web search setting, and develops models for compensating for caption bias in interleaved comparisons. Chapter 6 focuses on approaches for online learning, and investigates how exploration and exploitation can be balanced in online learning to rank for IR, and whether such a balance can improve the online performance of such systems. Finally, Chapter 7 integrates the interleaved comparison methods developed in Chapter 4 with an online learning to rank algorithm to investigate whether and in what way historical data reuse can speed up online learning to rank for IR. We draw conclusions and give an outlook on future work in Chapter 8. All research chapters build on background introduced in Chapter 2, and all but Chapter 5 use the experimental setup detailed in Chapter 3. Although ideas developed in earlier chapters are referred to in later chapters, each chapter is relatively self-contained (assuming knowledge of the background material provided in Chapters 2 and 3). An exception is Chapter 7, which builds on the interleaved comparison methods developed in Chapter 4. Readers familiar with existing online evaluation and online learning to rank approaches can skip over Chapter 2. Also, for readers primarily interested in the main 6

1.4. Origins ideas and theoretical contributions of this thesis, it is recommended to skip Chapter 3 and only revisit it to understand empirical results as needed. 1.4 Origins The following publications form the basis of chapters in this thesis. Chapter 4 is based on (Hofmann et al., 2011c, 2012b, 2013c). Chapter 5 is based on (Hofmann et al., 2012a). Chapter 6 is based on (Hofmann et al., 2011a, 2013b). Chapter 7 is based on (Hofmann et al., 2013a). In addition, Chapter 3 combines material from (Hofmann et al., 2011a,c, 2012b, 2013a,b,c). Finally, this thesis draws from insights and experiences gained in (Besser et al., 2010; Hofmann et al., 2008, 2009a,b, 2010a,b, 2011b; Lubell-Doughtie and Hofmann, 2012; Tsivtsivadze et al., 2012). 7