A Decision Theoretic Framework for Ranking using Implicit Feedback

A Decision Theoretic Frmework for Rnking using Implicit Feedbck Onno Zoeter Michel Tylor Ed Snelson John Guiver Nick Crswell Mrtin Szummer Microsoft Reserch Cmbridge 7 J J Thomson Avenue Cmbridge, United Kingdom {onnoz,mitylor,esnelson,joguiver,nickcr,szummer}@microsoft.com ABSTRACT This pper presents decision theoretic rnking system tht incorportes both explicit nd implicit feedbck. The system hs model tht predicts, given ll vilble dt t query time, different interctions person might hve with serch results. Possible interctions include relevnce lbelling nd clicking. We define utility function tht tkes s input the outputs of the interction model to provide rel vlued score to the user s session. The optiml rnking is the list of documents tht, in expecttion under the model, mximizes the utility for user session. The system presented is bsed on simple exmple utility function tht combines both click behvior nd lbelling. The click prediction model is Byesin generlized liner model. Its notble chrcteristic is tht it incorportes both weights for explntory fetures nd weights for ech querydocument pir. This llows the model to generlize to unseen queries but mkes it t the sme time flexible enough to keep in memory where the model should devite from its feture bsed prediction. Such click-predicting model could be prticulrly useful in n ppliction such s enterprise serch, llowing on-site dpttion to locl documents nd user behviour. The exmple utility function hs prmeter tht controls the trdeoff between optimizing for clicks nd optimizing for lbels. Experimentl results in the context of enterprise serch show tht blnce in the trdeoff leds to the best NDCG nd good (predicted) clickthrough. Ctegories nd Subject Descriptors H.. [Informtion Systems Applictions]: Keywords clickthrough, lerning, rnking, metrics. INTRODUCTION Permission to mke digitl or hrd copies of ll or prt of this work for personl or clssroom use is grnted without fee provided tht copies re not mde or distributed for profit or commercil dvntge nd tht copies ber this notice nd the full cittion on the first pge. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission nd/or fee. Presented t LR4IR Workshop t SIGIR 8, Singpore Copyright X ACM X-XXXXX-XX-X/XX/XX...$5.. This pper presents system for lerning to rnk in decision-theoretic frmework. In such frmework ech potentil top-k rnking is thought of s n ction tht could be mde by the serch engine. Then retrievl is decision procedure, of choosing n optiml ction ccording to given utility function. The decision theoretic view of IR hs long-stnding trdition (see e.g. [, 4, 8] for succesful uses). In this pper we explore the ide of using it to lern rnker bsed on multiple strems of feedbck. The utility function is then not only bsed on judge lbels, but lso on chrcteristics of user s session. A model is lerned on historic dt to predict the user s interction with result list. Although mny chrcteristics of the user s session could be incorported in such utility function, we will minly concentrte on one prticulr nd importnt one, nmely clicks. The reson to consider both lbels nd clicks in the utility function is tht ech provides different sort of relevnce informtion: Quntity nd cost. Click informtion is vilble t zero cost s long s the system hs some users, nd the quntity depends on the level of user ctivity. Relevnce judges re usully pid, so the quntity of lbels depends on budget. Explicitness. Judges give explicit relevnce lbels. With clicks, dwell-time, nd bndonment, relevnce informtion must be inferred. Rel user popultion. Clicks come from the true user popultion, so my reflect rel relevnce. Relevnce judges in lbortory conditions my disgree with the rel users. Deep/negtive judgments. Relevnce judges cn be pid to lbel lrge pool of documents per query, including mny bd documents. Clicks tend to hppen only on top-rnked documents, nd gthering negtive click informtion hs detrimentl effect on users, becuse the bd documents must be retrieved ner the top. The question is how to build model tht works well, incorporting explicit nd implicit relevnce informtion. One pproch (Figure ) is to choose n evlution mesure s the gold stndrd for relevnce, such s the lbelbsed metric DCG [6], nd build model to optimize it such s LmbdRnk [] or SoftRnk []. The inputs my be fetures chrcterizing the qulity of the query-document

) b) x, y old x R y U(y, x) Figure : Two different pproches to the incorportion of implicit feedbck into rnker; ) uses historic user behvior s input to predict single relevnce score R. Approch b), proposed in this pper, constructs the best possible model to explin outputs y from inputs x nd seprtely defines utility function U tht puts preference ordering on possible explicit nd implicit behviors. mtch. Historicl implicit feedbck cn be incorported s dditionl input fetures []. The output of the model gives sclr-vlued score by which documents re rnked vi sort. Note here tht the vlue of n individul document s score hs no prcticl interprettion. Our pproch (Figure b) is different nd bsed on n extension of the decision theoretic frmework for IR, s described in [4, 8]. The inputs nd outputs of the model re ll observble: inputs re query-document fetures nd outputs re implicit/explicit relevnce informtion. The sole tsk of the model is predicting outputs. We then define utility U which is function of these predicted outputs, nmely both implicit nd explicit judgments nd behviors. Rnking is then decision procedure, to find the results list with mximum utility. The specific contributions of this pper re s follows. We propose to construct rnkers tht combine mny sources of informtion using the decision theoretic frmework for IR. We discuss wht n idel setup would look like nd how it would dd diversity to result lists, correctly incorporte rel world chrcteristics such s position bis, blnce uthoritiveness nd populrity, nd more. As n initil implementtion of the pproch, we present Byesin logistic regression model tht predicts both relevnce judgments nd click rtes. The model hs one weight per query-document pir tht cts s memory of the historic click rte tht is not lredy explined by the other fetures. We combine it with crude utility function tht is fr from the idel sketched setting, but lredy introduces mny of the potentil benefits the combintion of two dtstrems cn bring. We evlute the decision theoretic system in n enterprise serch scenrio, demonstrting tht the clickpredicting prt of the model cn dpt to new enterprise.. RANKING AS A DECISION THEORY PROBLEM Decision theory is very well estblished field which dtes bck t lest to the works of Dniel Bernoulli in the 8th century. The informtion retrievl problem of presenting list of results given specific user query hs been interpreted s decision theory problem in severl studies in the pst. The probbility relevnce principle [] for instnce cn be motivted from such view. Interesting nd successful pplictions cn lso be found in mongst others [8, 4]. In this section we first review the bstrct decision problem in its generl form, nd then move on to describe how it cn be pplied to incorporte both explicit nd implicit feedbck in common frmework. At the bsis of the decision theoretic view is utility function. It represents user stisfction in single sclr, lrger being better. Formlly it is mpping of ll relevnt quntities of interest (sercher chrteristics, query, clicks, dwell time, etc.) to the rel line. In the remining we will mke distinction between two sets of informtion: inputs nd outputs. Inputs re those quntities tht re vilble before result list needs to be compiled, outputs re those quntities tht hve become vilble in the user session fter the result list is presented to the user. This includes clicks, dwell time, click bcks, etc., but lso explicit lbels if we sk the user to ct s humn judge. The idel utility function could be very complex incorporting detiled chrcteristics of user, intent of the query, etc. It would increse if interesting results were found, decrese s more nd more effort is needed to find them. We will discuss some of the potentil properties of n idel utility function in Section.. In rel world use we will hve to mke simplifictions, such s is done in Section.. If we would know hed of time exctly how user would interct with prticulr serch result list it would be esy to select the optiml one. It would simply be tht result list tht mximizes user stisfction. Since t query time we do not know the user s response, we need to construct model tht predicts user behvior. The optiml decision (the optiml list) is then the list tht in expecttion under the model mximizes the users utility. In summry nd formlly we cn represent the decision theoretic view of IR s follows. Given set of inputs (querydocument fetures) x X the rnker is sked to select n ction (result list) A. After performing the ction we observe outputs (judgments, user behviour etc.) y Y. A utility function U : X A Y R ssigns sclr utility to the observed x,, y-triple. The outputs y in generl do not follow deterministiclly from the inputs x nd ction. A model p(y x, ) gives the probbility of observing y fter selecting when x is observed. The optiml ction is the ction tht leds to the mximl expected utility = rgmx E p(y x,) [U (x,, y)]. () We propose to use the trditionl decision theoretic interprettion of IR to combine multiple sources of dt in principled wy. We tret the different sources of implicit feedbck s extr dimensions in the output vector y.. Utility functions The utility function gives rel vlued score to user ses- Note tht lterntively we could include x nd into the observtion y, but this nottion emphsizes the flexibility of the pproch.

sion tht represents his stisfction. Thinking bout wht the idel utility function would look like cn esily be dzzling. For well defined nvigtionl query such s Wht is the next trin connection between Cmbridge nd London Kings Cross? we might rgue tht finding the nswer gives fixed utility nd ny work tht needs to be done to get to tht point (reding snippets, clicking on potentil nswer pges, clicking bck, etc.) will led to deductions. But wht bout informtionl queries? Wht is the utility for one, two, or three interesting documents in the result list. Do three interesting documents hve three times s much utility s single one, or is there lw of diminishing returns? Wht is the cost of misleding snippet? Some sources re very uthoritive, some hve very fresh content. How should these two properties be trded-off? Should tht be done in the sme wy in ll contexts? A smll time spent thinking bout these things leds esily to n extremely complex function. Even coming up with procedure of going bout constructing utility function is difficult problem. Here we discuss briefly two pproches. A first pproch would be to conduct lb experiments with users where they re sked to explicitly score their stisfction with session. The experiments in [5] form fscinting pproch in tht direction for instnce. Assume for simplicity tht we hve binry stisfction signl t {thumbs up, thumbs down}, nd simple utility function U(t = thumbs up) = nd U(t = thumbs down) =. In dily use the explicit stisfction scores t re not vilble. To overcome this we could lern, bsed on {t, y}-pirs, specil model p(t y) (not to be confused with the output prediction model) nd work with lerned utility Ũ(x, y, ) = E p(t x,y,) [U (t)]. () Combined with n output prediction model p(y x, ) we could then use Eqution () for rnking. In second pproch we sk experts to crft simple utility U(x, y, ) nd itertively improve it. Perhps in the first version only few sources of feedbck re modelled in y nd this is expnded in the next, or we find tht certin trdeoffs looked good on pper but used in prctice leds to complints from rel users. In both pproches constructing the model p(y x, ) is clssicl mchine lerning problem. Using historicl {x, y, }- triplets we cn trin nd select the pproprite user behvior prediction model. An importnt benefit in prctice is then tht the problem of designing resonble utility function nd constructing good prediction model cn be decoupled. The prediction cn be tested on historic dt. Adjusting nd tuning the utility function cn be done incrementlly over time without the need of retrining the model with ech new ttempt. To summrize: constructing utility function is very difficult problem nd cn leve one with the wkwrd feeling tht golden stndrd or ground truth is not vilble. We would rgue tht the IR problem simply is this complex. Any choice in rel world system will mke some pproximtion nd is likely to require chnges nd improvements over time. In the Section.. we introduce wht rgubly is the simplest possible utility function tht combines both signl strem of explicit lbel feedbcks nd strem of implicit user clicks. It is simple convex combintion of lbel bsed utility nd click bsed utility intoduced in Sections.. nd.. respectively... Discounted Cumultive Gin In some pproches to rnking the im is to mximize function of the lbels in the result set. It is esy to see tht these pproches form specil cse of the frmework considered here. If we look t the discounted cumultive gin (DCG) [6] for instnce we see tht it is n exmple of utility function tht only tkes into ccount the humn relevnce judgments t every position. It is bsed on discount function d(p) over positions p {,..., n}, nd gin function g(s) over humn relevnce judgments, e.g. s {,..., 5}. The position discount function is monotoniclly decresing from the top position p =, to the bottom position p = n: d() > d() > d(n), nd gin function g(s) tht is incresing for better relevnce judgments: g() g() g(5). If s[],..., s[n] re the scores received for the documents selected by, the discounted cumultive gin is given by DCG (s[],..., s[n]) = n d(p)g(s[p]). () To mximize the DCG we would select nd rnk such tht the expected DCG is highest. The expecttion is then with respect to the observtion model p(y x, ) = p(s[],..., s[n] x, ) which represents the best estimte of the humn relevnce judgments for the documents selected by given x [ n ] = rg mx E p(s[],...,s[n] x,) d(p)g(s[p]). p= p= Different choices of g(s) led to different rnking principles (decision rules). If g(s) is convex in s the resulting principle is risk seeking: for two documents with the sme expected judgment but different vrinces the document with the lrger vrince is preferred. This is becuse lrger thn expected judgment leds to bigger rise in utility thn the decrese in utility tht results if lower thn expected judgment is encountered. We could sy tht such convex gin function leds to going for the jckpot effect. The often used exponentil function g(s) = s hs this effect. It is importnt to relize tht this is not conservtive rnking principle. If we hve liner gin g(s) = s, the expected utility only involves the expecttions of judgments: [ n ] = rg mx E p(s[],...,s[n] x,) d(p)g(s[p]) = rg mx p= n d(p)e p(s[],...,s[n] x,) [s[p]]. p= hence we get rnking principle tht simply orders documents ccording to their expected humn relevnce judgment: = rg mx n d(p)e p(s[p] x,) [s[p]]. p=

This utility function is n exmple where the optiml ction cn be found in O ( D ) time, where D is the number of documents in the corpus. This is despite the fct tht the spce of ll possible selections nd rnkings is D n. This is due to the fct tht the judgment probbility p(s[p] x, ) is not explicitly function of position (the judge is presented with ech document independently). This mens tht the expected judgment cn be computed for ech document nd the documents simply sorted to obtin the optiml rnking. There re mny interesting utility functions tht led to O ( D ) rnking principles, but in generl pproximtions might be necessry. Note tht, since there is no element in the utility function tht encourges diversity in the results, we need to explicitly dd the constrint tht links to documents cnnot be replicted. Otherwise would be n duplictions of the link with the highest expected relevnce judgments... Clicks An nlogous utility function tht only tkes into ccount whether or not user clicked on document could be given by click-dcg utility n U clicks (c[],..., c[n]) = d(p)c[p]. (4) p= If p(c[p] = x, ) (the probbility of click on the document tht ws put in position p by ) is modeled bsed on link specific nd position specific contribution it will in generl not simplify to n O( D ) ordering rule. This is becuse now p(c[p] x, ) is explicitly function of p ny given document will be clicked with different probbility depending on where it is plced. It cn be tht position nd link effects combine in complex non-liner wys. However there re suitble heuristics for ordering in O( D ), e.g. compute the probbility document will be clicked if it were plced in position, nd order by tht. This click-dcg ssigns positive utility to the ct of clicking itself. Philosophiclly this is not relly sound, since the ct of clicking is ctully nuisnce, nd only from the ctul reding of n interesting document is utility obtined. So in order to motivte (4) we need to ppel to n rgument long the lines of the lerned utility in (): becuse we hve estblished in the pst tht the ct of clicking on n interesting link leds to n interesting pge we cn ssign n (expected) utility to the ct of clicking. However, from more prcticl point of view (4) then still hs problems. If we motivte the vlue of click from n pprent interest in the result pge, we ssume tht ll interesting snippets point to interesting lnding pges. This will unfortuntely not lwys be the cse in prctice. To overcome this the utility cn be extended by incorporting miniml dwell time s proxy for n endorsement of the lnding pge. To encourge diversity, one simple pproch would be to introduce concve function f of the simple DCG-like sum of clicks: ( n ) U clicks pge (c[],..., c[n]) = f d(p)c[p]. (5) p= This cptures the notion tht the step from clicks to click on pge is bigger thn tht from to. The trnsformed utility would penlize systems with click-dcg ner zero. For n mbiguous query with severl types of result, rnking optimized to void zero click-dcg could potentilly present results of ech type, hedging its bets by giving more diverse results list. To tke dvntge of this type of diversity-encourging utility, one must combine it with model tht cn cpture correltions between click events on different documents on pge. For instnce, for mbiguous queries, clicks on links to two different interprettions will in generl be nti-correlted: someone clicking on link of one type will be less likely to lso click on link of the other type, presuming they hve one interprettion of the query in mind when serching. To do this requires model for the joint distribution p(c[],..., c[n] x, ), which is in generl difficult modeling tsk. An independence ssumption p(c[],..., c[n] x, ) = n p= p(c[p] x, ) does not cpture these correltions, but is resonble simplifying modeling ssumption if one is using the more stright-forwrd click-dcg utility of (4)... Combintions of bsic utilities The decision theoretic frmework llows for principled trde-off between desired behvior of the sercher nd relevnce cues from selected set of humn judges. In generl the utility function should depend on both. A strightforwrd scheme is to tke weighted combintion of the bsic utility functions presented in Section.. nd.... Properties of the bsic click-lbel utility In the experiments in Section we will use utility function s sketched in Section..: U(y) = ( λ)u DCG(y) + λu clicks (y). (6) The prmeter λ is still design choice in this prmetric form. As rgued in Section.. the click prt in the utility function (6) is only wekly motivted by the guiding principles, but it forms good strting point since it cptures lredy few interesting chrcteristics from the two dt strems. If there is noise in the lbeled set, or if the model mkes poor lbel predictions for query, suboptiml ordering cn be corrected by clicks. If the model correctly predicts lbels but there re ties, top three of only good documents sy, users effectively vote with their mouse which one they prefer. Since the frmework consists of model tht predicts clicks bsed on fetures, n improvement in the rnking for populr queries lso extends to unseen queries. For instnce if Excel documents prove to be populr in prticulr serch context, they cn be boosted for ll queries in tht context. Effectively the click bsed component in the utility will boost results tht re predicted to be populr. If judges re instructed to lbel ccording to uthority, the λ prmeter llows us to trde-off populrity nd uthority. For instnce in experiments with webserch dt we found for instnce tht for the query dobe the url www.dobe.com is predicted to get the highest lbel, but the crobt reder downlod pge is the most populr. One could rgue tht the idel result list hs www.dobe.com t the top position nd the link to the reder s the second link. This ws the list returned in our experiments with λ =.5.

judgment model x click model p(l x, w) p(c x, w) Figure : The model implemented in this pper sets out to predict two things: nmely the probbility of click event p(y c x, w) nd the probbility of prticulr relevnce judgment p(y j x, w). The GLM model implies tht the two sub-models fctorize, nd thus cn be lerned independently. By hving the position of document s prt of the inputs x nd fitting pproprite weights in the model, position bis is utomticlly ccounted for. If x contins chrcteristics of the user, the rnker utomticlly gives personlized result. If the model is sophisticted enough such tht it cptures the interction between documents, e.g. predicting tht the probbility of being clicked for ner duplicte documents is nti-correlted, the rnker will, with click-utility component from (5) utomticlly diversify the result list.. ON-SITE ADAPTATION OF INTRANET SEARCH SYSTEMS An interesting ppliction of the decision theoretic frmework is in enterprise serch. When serch system is instlled out-of-the-box its rnker is bsed on generic trining set. Since intrnets nd their user bses cn be quite diverse, it mkes sense to use implicit feedbck to dpt the rnker to the specific site for which it will be used. It is generlly difficult to obtin judged queries complete with clickthrough dt from externl orgniztions. Hence for this work, we were obliged to test the dpttion frmework using n rtificil corpus split creted from dt obtined from the Intrnet of single lrge multintionl softwre corportion. To reflect significnt chnge from the trin set to the dpttion set, we creted split of our queries. For trining, we use ll queries nd documents concerning the generl res of dministrtion nd mrketing. For the dpttion set, simulting potentilly very different Intrnet site, we use queries nd documents of technicl nture. The dmin/mrketing dtset used to trin the out-ofthe-box model consists of 546 queries. For ech query, bout documents from the top of rnked list from bseline rnker were judged, nd some of them hd click informtion. The click-prediction prt of the model is further trined using the dpttion query set, consisting of technicl queries. This simultes the on-site dpttion of the system to the user s clicks in the enterprise. In this cse, the explicit judgments re not used for dpting the model, but insted used for evlution only. The click dt we use is noteworthy in the following sense. We record not only the clicked documents, but lso the documents tht re skipped, or pssed over, on the wy to click. In this work, inspired by [7], we ssume sequentil scn of the result list, nd s consequence, tht ny document tht is bove the lst click on the list is exmined. In this wy, we cn ggregte the number of clicks nd the number of exmintions for given query-document pir: document which is clicked ech time it is exmined is intuitively good, nd document tht is rrely clicked hving been exmined is probbly poor result. Importntly, we cnnot infer much bout the relevnce of documents tht hve few exmintions. This cn hppen if result is either low in the rnking, or ner the top yet just below very good result.. A Byesin generlized liner model In this first illustrtion we use generlized liner model (GLM) [9] for p(y x, ). A GLM consists of likelihood model p(y θ), liner combintion of inputs x nd model weights w: x w, nd link function g(θ) tht mps the prmeter θ to the rel line. In this section we will use building blocks tht hve binomil likelihood model nd probit link function. In genertive model interprettion the inverse probit link function ( g (s) = Φ s;, ) π plys centrl role. This inverse link function is the well known cumultive norml function tht mps the outcome of the inner product x w R to the [, ] spce of the success prmeter θ in the binomil. The inverse precision π cn be set to n rbitrry number to fix the scle of the s-spce. Here we will put Gmm prior on π nd integrte it out to obtin robust model. If we hve N exmples in our trining set for which the inputs hve vlue x, nd we observe c positive outcomes, the likelihood becomes: ( ( ) ) p(c x, w) = Bin c; g x w, N. (7) In Figure we show more detiled version of Figure b, where we re more explicit bout wht we set out to predict with the model. In this initil implementtion, the output y in the model describes for ech position p =,..., n single implicit feedbck: the click event y c, nd single explicit feedbck: the relevnce judgment y j. Figure 4 shows the ordinl regression submodel p(l x, w) which is generliztion of the click model. Insted of one of two outcomes it hs one of five possible lbel vlues it cn output. Along with the other weights we therefore lso lern four boundry vlues b,..., b 4 tht mrk the edges in s-spce of the five ctegories. Ech hs Gussin prior. The IsBetween fctor in the figure represents two stepfunctions tht bound the intervl for lbel l. Added to the sum

x type w type x url w url x pos w pos x query w query x query-docw query-doc xbm5 wbm5 w t w t w t # file types w d w d w d # URL length + w p w p w p # rnking positions w q w q w q # queries w qd w qd w qd # query-document pirs s p(c x, w) Figure : Indictor binry fetures nd the GLM. This is specific exmple of the click model shown on the right in Figure. Here the inputs x re mde explicit s one rel-vlued feture (BM5) nd five bgs of binry fetures. The output is the predicted probbility of click. s is Gussin disturbnce with inverse precision π. This disturbnce cn be interpreted s softening of the stepfunction such tht some noise in the lbel is supported by the model. It is the direct nlog of the choice of probit link function insted of hrd step function in the discrete click prediction cse.. Fetures Figure tkes more detiled look t the inputs (fetures) used for just the click model GLM. The input x contins prts tht re query specific, prts tht re document specific, nd prts tht re derived from the query-document pir. A BM5F score [] is used s generl input tht indictes the mtch between query nd document. Document specific fetures include the document file type (e.g. Html, Pdf, Excel etc) nd the length of the url. Aprt from these bsic descriptive fetures, the vector x includes binry indictors for the query ID nd the query-document ID, nd lso the rnk (position) of the document in the list for which the click event ws observed or is to be predicted. The descriptive fetures give the model the bility to generlize between queries nd documents, nd the identifier (ID) weights effectively serve s n instnce-specific memory. For frequently seen documents for populr queries the model cn store, using the identifier weights, very ccurte click predictions, even if they re fr from the generl trend predicted by the descriptive fetures. A bis term tht is lwys is included to cpture grnd verge.. Trining To lern the distributions of w we use the pproximte Byesin inference procedure from [5] with fctorized Gussin prior. The ordinl regression prt is treted s in [] with the difference tht here we do not resort to n ML-II pproximtion of π but integrte it out. The min benefit of the Byesin procedure is tht with ech individul weight in w notion of the uncertinty bout its optiml vlue is mintined. This results in lerning lgorithm tht correctly updtes imprecisely determined weights more thn precisely determined ones, which is essentil for our model. The weights for descriptive fetures effectively see lot more dt thn the query nd document specific identifier weights. The Byesin updte rules ensure tht ech get updted t the right rte in prticulr, smll number of exmintions will not chnge the weight distributions nerly s much s lrge number. This is something tht could not esily be hndled in for exmple mximum likelihood pproches..4 Results Before ny implicit feedbck dt is vilble the rnker is bsed on model tht predicts clicks nd lbels. The utility we used in the experiments is simple weighted combintion between DCG nd click-dcg s given in Eqution 6. The specific setting of λ in this utility is design choice. The dotted line in Figure 5 shows, for the out-of-the-box model, the NDCG@ on the dpttion set s function of λ. We see tht using the click utility (λ = ) ctully reduces the NDCG@ score. This is to be expected, since the NDCG score does not depend on observed clicks. The utility in (6)

Gm... N b l - s with GLM Model + s N(y; s,/p) y IsBetween...... observed lbel l p(l x, w) N b l... Figure 4: This is specific exmple of the lbel model shown on the left in Figure. The inputs re the sme s for the click model. However the output is hndled differently. First noise is dded to the vrible s; the result is then constrined to lie between the two threshold vribles which correspond to the observed lbel. Thresholds nd noise precision re lernt in ddition to the weights. with λ = is equivlent to DCG, nd setting λ to nother vlue encourges the rnker to optimize different metric thn NDCG@ shown on the y-xis in Figure 5. If we use two months of dpttion dt, i.e. the site specific click feedbck, we get the nlogous solid/crossed curve in Figure 5. Here we see tht incorporting clicks leds to n improvement of NDCG@. A vlue for λ other thn nd leds effectively to the combintion of the two dtsets (the trin set nd the dpttion set). This improves performnce, even if we mesure the performnce of the resulting system with NDCG, n evlution metric tht does not rewrd clicks. The lower dshed line in Figure 5 represents BM5F bseline, with no click dt. We note tht it is horizontl line since rnking by BM5F does not involve λ prmeter. We see point NDCG@ gin from the fetures lone (λ = ) nd n dditionl points gin if we set λ =.5)..4. A proxy for click-metric The NDCG metric reported in Figure 5 is well-known metric, but if λ it is strictly speking not the metric tht the rnker seeks to optimize. If it is decided tht (6) with prticulr vlue for λ is the utility tht represents end user stisfction the best, then tht utility should be the finl evlution metric. Idelly we would like to test rnker in n on-line setting where it cn control the results lists. In such setting we could monitor the ccumulted utility by the rnker. However, since we only hve historic NDCG@ on test set.64.6.6.6.6.59.58.57.56 BM5F No test set clicks Adpted with clicks....4.5.6.7.8.9 Judgement Utility λ Click Utility Figure 5: NDCG@ scores for the different rnkers s function of λ, the reltive weight given to the click-prt in combined utility function. Click Metric on hold-out set..5..5..5.....4.5.6.7.8.9 Judgement Utility λ Click Utility Figure 6: The click bsed scores from Eqution (8) for the different rnkers s function of λ. dt vilble we use the following proxy click metric: n p= S clicks = d(p)nc(p) n p= Ne(p) (8) where we denote the totl number of clicks for the document on position p with N c(p) nd the number of exmintions with N e(p). The numertor in (8) uses the sme discount function d(p) s used in (). Extr in this proxy evlution metric is the normliztion represented by the denomintor. This ensures tht documents tht were not shown to users in the dtset (nd hence hve clicks nd exmintions) re properly disregrded. The score bove is for single query, nd the totl score would be the verge over ll queries. Figure 6 shows plot nlogous to Figure 5, but now with the click-bsed evlution metric from (8). We note tht this new metric gets better with incresing λ. This is to be expected: rnking formed from utility bsed predominntly upon predicted click rte should do better when evluted with click-bsed metric. This provides further orthogonl evidence tht combining implicit nd explicit feedbck improves serch results. To get feel for the qulittive chnges tht the different choices of utility function imply it is instructive to look t

Relevnce Judgments Utility (DCG).: http://vsts.: http://develop/vs5field.: http://msdnprod/vstudio Click Utility.: http://devdiv.: http://msdnprod/vstudio.: http://infoweb/c6/visulstudiodotnet Mixed Utility (λ =.5).: http://msdnprod/vstudio.: http://vsts.: http://devdiv Tble : Reorderings of the top-rnked positions for the Visul Studio query specific exmple. Tble shows the top three results for the query Visul studio for λ = (DCG rnking), λ = (click rnking) nd λ =.5 (blnced rnking). In this exmple the DCG bsed top three re ll documents tht could clim to be definitive result for serchers interested in using the Visul Studio product. They were ll lbeled good by humn ssessors. If the rnker is using the clickonly utility (λ = ) we see tht the top three chnges. Of the three good results in the DCG list, the msdnprod link nd snippet is pprently the most ppeling to the users in the dpttion phse, contining technicl informtion. The other two documents tht hve entered the top three reflect different interprettions of the query Visul studio : the devdiv pge gives informtion bout the tem tht cretes Visul studio, nd the infoweb provides mrketing dt. This exmple demonstrtes tht there is no unique definition of relevnce. If we deem the most populr pge to be the most relevnt, we should pick the click utility. However, if we wnt the result list to be more uthorittive, utility bsed upon explicit judgments might promote pges tht re more likely to hve been overlooked in stright snippet-bsed populrity contest. This dvntge of n incresed relibility of explicit judgment usully comes with the disdvntge of single user interprettion of relevnce: there is nturl trdeoff between judgment ccurcy nd result diversity. As Tble shows mixed utility llows us to find blnce between these two extremes. Including click feedbck hs hd two qulittive effects for the Visul Studio query: (i) rerrngement of, from the externl perspective eqully good, documents ccording to locl preferences, nd (ii) promotion of lterntive interprettions of the query tht re common t the specific intrnet. Although we present single exmple here, we hve seen these effects in mny other queries, together with third mjor effect: (iii) the correction of erroneous humn judgments. 4. SUMMARY In this pper we hve explored the decision theoretic frmework for IR nd studied how it cn be used to combine severl sources of feedbck into single rnker. The pproch is bsed on utility function tht describeses the user stisfction fter serch session, nd model tht predicts user ctions such s clicking nd lbelling bsed on known quntities t query time. Constructing model nd formulting utility function re both difficult problems. But we observe tht in the cse of lbel nd click strem dt the simplest possible utility function nd resonble prediction model lredy give mny of the potentil benefits tht the combintion of the two strems cn bring. In experiments in n enterprise serch setting, we see tht the pproch leds to incresed performnce. Qulittively we see tht mislbelled queries get filtered out, result lists for mbiguous queries chnge to better reflect the most often intended interprettion by users, nd tie breking of identiclly lbelled results is done ccording to popultion preference. In terms of NDCG@ including the click strem in the decision theoretic rnker leds to two point gin. This is despite the fct tht the rnker does not im to optimize this metric. Given the qulittive results we expect tht end user experience improves even more thn the two point NDCG gin indictes. 5. REFERENCES [] E. Agichtein, E. Brill, nd S. Dumis. Improving web serch rnking by incorporting user behvior informtion. In SIGIR, 6. [] C. Burges, R. Rgno, nd Q. V. L. Le. Lerning to rnk with nonsmooth cost functions. In NIPS, 6. [] W. Chu nd Z. Ghhrmni. Gussin processes for ordinl regression. JMLR, 6:9 4, 5. [4] I. J. Cox, M. L. Miller, T. P. Mink, T. Ppthoms, nd P. N. Yinilos. The Byesin imge retrievl system, pichunter: Theory, implementtion nd psychophysicl experiments. IEEE Trnsctions on Imge Processing, Specil Issue on Imge nd Video Processing for Digitl Librries, 9(): 7,. [5] S. Fox, K. Krnwt, M. Mydlnd, S. T. Dumis, nd T. White. Evluting implicit mesures to improve the serch experience. ACM Trnsctions on Informtion Systems, :47 68, 5. [6] Järvelin nd J. Kekäläinen. IR evlution methods for retrieving highly relevnt documents. In SIGIR,. [7] T. Jochims. Optimizing serch engines using clickthrough dt. In Proceedings of Knowledge Discovery in Dtbses,. [8] J. Lfferty nd C. Zhi. Document lnguge models, query models, nd risk minimiztion for informtion retrievl. In SIGIR, pges 9,. [9] P. McCullgh nd J. A. Nelder. Generlized Liner Models. CRC Press, nd edition, 99. [] S. Robertson, H. Zrgoz, nd M. Tylor. A simple BM 5 extension to multiple weighted fields. In CIKM, pges 4 9, 4. [] S. E. Robertson. The probbility rnking principle in IR. Journl of Documenttion, :94 4, 977. [] M. Tylor, J. Guiver, S. Robertson, nd T. Mink. SoftRnk: optimizing non-smooth rnk metrics. In WSDM 8, pges 77 86. ACM, 8. [] S. K. M. Wong, P. Bollmnn, nd Y. Y. Yo. Informtion retrievl bsed on xiomtic decision theory. Generl Systems, 9(), (): 7, 99. [4] O. Zoeter. Byesin generlized liner models in terbyte world. In IEEE Conference on Imge nd Signl Processing nd Anlysis, 7.