Downloaded from UvA-DARE, the institutional repository of the University of Amsterdam (UvA) http://hdl.handle.net/11245/2.122992

Downloaded from UvA-DARE, the institutional repository of the University of Amsterdam (UvA) http://hdl.handle.net/11245/2.122992 File ID Filename Version uvapub:122992 1: Introduction unknown SOURCE (OR PART OF THE FOLLOWING SOURCE): Type PhD thesis Title Fast and reliable online learning to rank for information retrieval Author(s) K. Hofmann Faculty FNWI: Informatics Institute (II) Year 2013 FULL BIBLIOGRAPHIC DETAILS: http://hdl.handle.net/11245/1.394453 Copyright It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content licence (like Creative Commons). UvA-DARE is a service provided by the library of the University of Amsterdam (http://dare.uva.nl) (pagedate: 2014-11-24)

1 Introduction The goal of the research area of information retrieval (IR) is to develop the insights and technology needed to provide access to data collections. The most prominent applications, web search engines like Bing, Google, or Yahoo!, provide instant and easy access to vast and constantly growing collections of web pages. A user looking for information submits a query to the search engine, receives a ranked list of results, and follows links to the most promising ones. To address the flood of data available on the web, today s web search engines have developed into very complex systems. They combine hundreds of ranking features (properties of the query, a document, and the relationship between the two) with the goal of creating the best possible search results for all their users at all times. 1 A ranker (or ranking function) is the part of a search engine that determines the order in which documents retrieved for a given user query should be presented. Until recently, most rankers were developed manually, based on expert knowledge. Developing a good ranker may be easy for some search tasks, but in many cases what constitutes a good ranking depends on the search context, such as users background knowledge, age, or location, or their specific search goals and intents (Besser et al., 2010; Hofmann et al., 2010a; Rose and Levinson, 2004; Shen et al., 2005). And even though there is an enormous variety in the tasks and goals encountered in web search, web search engines are only the tip of the iceberg. More specialized systems are everywhere: search engines for companies intranets, local and national libraries, universities course catalogues, and users personal documents (e.g., photos, emails, and music) all provide access to different, more or less specialized, document collections, and cater to different users with different search goals and expectations. Addressing each of these settings manually is infeasible. Instead, we need to look for scalable methods that can learn good rankings without expensive, and necessarily limited, manual or semi-manual tuning. For automatically tuning the parameters of a ranking function, machine learning algorithms are invaluable (Liu, 2009). Most methods employ supervised learning to rank, i.e., algorithms are trained on examples of relevant and non-relevant documents for particular queries. Data to train these approaches is typically obtained from experts who label document-query pairs, which is time-consuming and expensive. In many settings, such as personalized or localized search, or when deploying a search engine for a com- 1 http://www.wired.com/magazine/2010/02/ff_google_algorithm/all/, retrieved December 29, 2012. 1

1. Introduction pany s intranet or a library catalogue, collecting the large amounts of training data required for supervised learning is usually not feasible (Sanderson, 2010). Even in environments where training data is available, it may not capture typical information needs and user preferences perfectly (Radlinski and Craswell, 2010), and cannot anticipate future changes in user needs. In this thesis we follow an alternative approach, called online learning to rank. This technology can enable self-learning search engines that learn directly from natural interactions with their users. Such systems promise to be able to continuously adapt and improve their rankings to the specific setting they are deployed in, and continue to learn for as long as they are being used. Learning directly from user interactions is fundamentally different from the currently dominant supervised learning to rank approaches for IR, where training data is assumed to be randomly sampled from some underlying distribution, and where absolute and reliable labels are provided by professional annotators. In an online learning setting, feedback for learning is a by-product of natural user interactions. This strongly affects what kind of feedback can be obtained, and the quality of the obtained feedback. For example, users expect to be presented with useful results at all times, so trying out new rankings (called exploration) can have a high cost in user satisfaction and needs to be balanced against possible future learning gains. Also, feedback inferred from user interactions can be noisy, and it may be affected by how search results are presented (one example of such an effect is caption bias). Learning from such lower-quality feedback may result in degraded learning, unless we can design learning to rank algorithms that are robust against these effects. In this thesis we investigate the principles that allow effective online learning to rank for IR, and translate our insights into new algorithms for fast and reliable online learning. 1.1 Research Outline and Questions The broad question that motivates the research for this thesis is: Can we build search engines that automatically learn good ranking functions by interacting with their users? Individual components towards solving this problem already exist (see Chapter 2 for an overview), but other aspects, such as how to learn from noisy and relative feedback, have not yet been investigated. This thesis aims to close some of these gaps, contributing to the long-term goal of a complete online learning to rank solution for IR. We start our investigation by focusing on the type and quality of feedback that can be obtained for learning to rank in an online setting. Extracting reliable and useful feedback for learning to rank from natural user interactions is difficult, because user interactions are noisy and context-dependent. The most effective techniques identified so far focus on extracting relative information, i.e., they infer user preferences between documents or whole result rankings. In this thesis we focus on these relative feedback techniques, and particularly on so-called interleaved comparison methods that infer preferences between rankings using click data. Besides in online learning to rank applications, these methods are used for online evaluation for search engine research and development in general. Given that three interleaved comparison methods have been developed previously, we first aim to understand how these methods compare to each other, i.e., how can we 2

1.1. Research Outline and Questions decide which method to use for an online learning to rank system, or to evaluate a given retrieval system? We formalize criteria for analyzing interleaved comparison methods, to answer the following questions: RQ 1 What criteria should an interleaved comparison method satisfy to enable reliable online learning to rank for IR? RQ 2 Do current interleaved comparison methods satisfy these criteria? In answering these questions, we identify three minimal criteria that interleaved comparison methods should satisfy: fidelity, soundness, and efficiency. An interleaved comparison method has fidelity if the quantity it measures, i.e., the expected outcome of ranker comparisons, properly corresponds to the true relevance of the ranked documents. It is sound if its estimates of that quantity are statistically sound, i.e., unbiased and consistent. It is efficient if those estimates are accurate with only little data. Analyzing previously developed interleaved comparison methods, we find that none of them exhibit fidelity. To address this shortcoming, we develop a new interleaved comparison method, probabilistic interleave (PI), that is based on a probabilistic interpretation of the interleaving process. An extension of PI, PI-MA, is then derived to increase the method s efficiency by marginalizing over known variables instead of using noisy estimates. Regarding these new methods, we address the following questions: RQ 3 Do PI and its extension PI-MA exhibit fidelity and soundness? RQ 4 Is PI-MA more efficient than previous interleaved comparison methods? Is it more efficient than PI? While previous interleaved comparison methods required collecting new data for each ranker comparison, our probabilistic framework enables the reuse of previously collected data. Intuitively, the information contained in these previously collected lists and user clicks should provide some information about the relative quality of new target rankers. However, the source distribution under which the data was collected may differ from the target distribution under which samples would be collected if the new target rankers were compared with live data. This can result in biased estimates of comparison outcomes. To address this problem, we design a second extension of PI, PI-MA-IS. It uses importance sampling to compensate for differences between the source and target distribution, and marginalization to maintain high efficiency. Investigating this method analytically and experimentally allows us to address the following questions: RQ 5 Can historical data be reused to compare new ranker pairs? RQ 6 Does PI-MA-IS maintain fidelity and soundness? RQ 7 Can PI-MA-IS reuse historical data effectively? We then turn to more practical issues of using interleaved comparisons in a web search setting. In this setting, user clicks may be affected by aspects of result pages other than true result relevance, such as how results are presented. If such visual aspects affect user clicks, the question becomes what click-based evaluation really measures. We address the following questions: RQ 8 (How) does result presentation affect user clicks (caption bias)? 3

1. Introduction RQ 9 Can we model caption bias, and compensate for it in interleaving experiments? RQ 10 (How) does caption bias affect interleaving experiments? After addressing the above questions regarding the quality of feedback that can be obtained in online learning to rank for IR settings, we turn to the principles of online learning to rank for IR. Given the characteristics and restrictions of online learning to rank, we investigate how to perform as effectively as possible in this setting. One central challenge that we formulate is the exploration-exploitation dilemma. In an online setting, a search engine learns continuously, while interacting with its users. To satisfy users expectations as well as possible at any point in time, the system needs to exploit what it has learned up to this point. It also needs to explore possibly better solutions, to ensure continued learning and improved performance in the future. We hypothesize that addressing this dilemma by balancing exploration and exploitation can improve online performance. We design two algorithms for achieving such a balance in pairwise and listwise online learning to rank: RQ 11 Can balancing exploration and exploitation improve online performance in online learning to rank for IR? RQ 12 How are exploration and exploitation affected by noise in user feedback? RQ 13 How does the online performance of different types (pairwise and listwise) of online learning to rank for IR approaches relate to balancing exploration and exploitation? Finally, we return to the question of how to learn as quickly and effectively as possible in an online learning to rank for IR setting. We hypothesize that reusing click data that was collected during earlier learning steps could be used to gain additional information about the relative quality of new rankers. Based on our PI-MA-IS method for reusing historical data, we develop two algorithms for learning with historical data reuse. The first, RHC, reuses historical data to make ranker comparisons during learning more reliable. The other, CPS, uses historical data for more effective exploration of the solution space. The research questions addressed by our subsequent research are: RQ 14 Can previously observed (historical) interaction data be used to speed up online learning to rank? RQ 15 Is historical data more effective when used to make comparisons more reliable (as in RHC), or when used to increase local exploration (as in CPS)? RQ 16 How does noise in user feedback affect the reuse of historical interaction data for online learning to rank? 1.2 Main Contributions In this section we summarize the main algorithmic, theoretical, and empirical contributions of this thesis. 4

1.2. Main Contributions Algorithmic contributions: A probabilistic interleaved comparison method, called probabilistic interleave (PI), that exhibits fidelity, and an extension of PI, called PI-MA, that increases the efficiency of PI by marginalizing over known variables instead of using noisy observations. The first interleaved comparison method that allows reuse of historical interaction data (called PI-MA-IS, an extension of PI-MA). An approach for integrating models of caption bias with interleaved comparison methods in order to compensate for caption bias in interleaving experiments. The first two online learning to rank for IR algorithms (one pairwise, one listwise approach) that can balance exploration and exploitation. The first two online learning to rank algorithms that can utilize previously observed (historical) interaction data: reliable historical comparisons (RHC), and candidate preselection (CPS). Theoretical contributions: A framework for analyzing interleaved comparison methods in terms of fidelity, soundness, and (previously proposed) efficiency. Analysis of the interleaved comparison methods balanced interleave, team draft, and document constraints, showing that none exhibits fidelity. Two proofs that show that our proposed extensions of PI, PI-MA and PI-MA-IS maintain soundness. A general-purpose probabilistic model of caption bias in user click behavior that can combine document-pairwise and pointwise features. A formalization of online learning to rank for IR as a contextual bandit problem, and formulation of the exploration-exploitation dilemma in this setting. Empirical contributions: An experimental framework that allows for the assessment of online evaluation and online learning to rank methods using annotated learning to rank data sets and click models in terms of their online performance. An empirical evaluation of PI, PI-MA, and all existing interleaved comparison methods, showing that PI-MA is the most efficient method. An empirical evaluation of interleaved comparison methods under historical data, showing that PI-MA-IS is the only method that can effectively reuse historical data. A large-scale assessment of our caption-bias model with pairwise and pointwise feature sets using real click data in a web search setting, showing that pointwise features are most important, but combinations of pointwise and pairwise features are most accurate for modeling caption bias. 5

1. Introduction Results of applying caption bias models to interleaving experiments in a web search setting, indicating that caption bias can affect interleaved comparison outcomes. The first empirical evidence that shows that balancing exploration and exploitation in online learning to rank for IR can significantly improve online performance in online learning to rank for IR. First empirical evidence showing that reusing historical data for online learning to rank can substantially and significantly improve online performance. In addition to the contributions listed above, the software developed for running online learning to rank experiments following our experimental setup is made freely available (Appendix A). This software package includes reference implementations of the developed interleaved comparison methods (PI, PI-MA, and PI-MA-IS) and online learning to rank approaches (balancing exploration and exploitation in pairwise and listwise online learning, online learning to rank with historical data reuse). 1.3 Thesis Overview This section gives an overview of the content of each chapter of this thesis. The next chapter (Chapter 2) introduces background for all subsequent chapters. Chapter 3 details the problem formulation used throughout this thesis, and introduces the experimental setup that forms the basis of the empirical evaluations in Chapters 4, 6, and 7. The next four chapters are the main research chapters of this thesis. Each focuses on a specific aspect of online learning to rank for IR. We start with the most central component, the feedback mechanism, in Chapter 4. This chapter develops a framework for analyzing interleaved comparison methods, and proposes a new, probabilistic, interleaved comparison method (PI) and two extensions for more efficient comparisons (PI-MA), and for comparisons with reuse of historical interaction data (PI-MA-IS). Chapter 5 investigates interleaved comparison methods in a web search setting, and develops models for compensating for caption bias in interleaved comparisons. Chapter 6 focuses on approaches for online learning, and investigates how exploration and exploitation can be balanced in online learning to rank for IR, and whether such a balance can improve the online performance of such systems. Finally, Chapter 7 integrates the interleaved comparison methods developed in Chapter 4 with an online learning to rank algorithm to investigate whether and in what way historical data reuse can speed up online learning to rank for IR. We draw conclusions and give an outlook on future work in Chapter 8. All research chapters build on background introduced in Chapter 2, and all but Chapter 5 use the experimental setup detailed in Chapter 3. Although ideas developed in earlier chapters are referred to in later chapters, each chapter is relatively self-contained (assuming knowledge of the background material provided in Chapters 2 and 3). An exception is Chapter 7, which builds on the interleaved comparison methods developed in Chapter 4. Readers familiar with existing online evaluation and online learning to rank approaches can skip over Chapter 2. Also, for readers primarily interested in the main 6

1.4. Origins ideas and theoretical contributions of this thesis, it is recommended to skip Chapter 3 and only revisit it to understand empirical results as needed. 1.4 Origins The following publications form the basis of chapters in this thesis. Chapter 4 is based on (Hofmann et al., 2011c, 2012b, 2013c). Chapter 5 is based on (Hofmann et al., 2012a). Chapter 6 is based on (Hofmann et al., 2011a, 2013b). Chapter 7 is based on (Hofmann et al., 2013a). In addition, Chapter 3 combines material from (Hofmann et al., 2011a,c, 2012b, 2013a,b,c). Finally, this thesis draws from insights and experiences gained in (Besser et al., 2010; Hofmann et al., 2008, 2009a,b, 2010a,b, 2011b; Lubell-Doughtie and Hofmann, 2012; Tsivtsivadze et al., 2012). 7