Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising

Transcription

1 Joural of Machie Learig Research 14 (2013) Submitted 9/12; Revised 3/13; Published 11/13 Couterfactual Reasoig ad Learig Systems: The Example of Computatioal Advertisig Léo Bottou Microsoft 1 Microsoft Way Redmod, WA 98052, USA Joas Peters Max Plack Istitute Spemastraße Tübige, Germay [email protected] [email protected] Joaqui Quiñoero-Cadela [email protected] Deis X. Charles [email protected] D. Max Chickerig [email protected] Elo Portugaly [email protected] Dipakar Ray [email protected] Patrice Simard [email protected] Ed Selso [email protected] Microsoft 1 Microsoft Way Redmod, WA 98052, USA Abstract This work shows how to leverage causal iferece to uderstad the behavior of complex learig systems iteractig with their eviromet ad predict the cosequeces of chages to the system. Such predictios allow both humas ad algorithms to select the chages that would have improved the system performace. This work is illustrated by experimets o the ad placemet system associated with the Big search egie. Keywords: causatio, couterfactual reasoig, computatioal advertisig 1. Itroductio Statistical machie learig techologies i the real world are ever without a purpose. Usig their predictios, humas or machies make decisios whose circuitous cosequeces ofte violate the modelig assumptios that justified the system desig i the first place. Such cotradictios appear very clearly i the case of the learig systems that power web scale applicatios such as search egies, ad placemet egies, or recommedatio systems. For istace, the placemet of advertisemet o the result pages of Iteret search egies deped o the bids of advertisers ad o scores computed by statistical machie learig systems. Because the scores affect the cotets of the result pages proposed to the users, they directly ifluece the occurrece of clicks ad the correspodig advertiser paymets. They also have importat idirect effects. Ad placemet decisios impact the satisfactio of the users ad therefore their willigess to frequet this web site i the future. They also impact the retur o ivestmet observed by the. Curret address: Joas Peters, ETH Zürich, Rämistraße 101, 8092 Zürich, Switzerlad.. Curret address: Joaqui Quiñoero-Cadela, Facebook, 1 Hacker Way, Melo Park, CA 94025, USA. c 2013 Léo Bottou, Joas Peters, Joaqui Quiñoero-Cadela, Deis X. Charles, D. Max Chickerig, Elo Portugaly, Dipakar Ray, Patrice Simard ad Ed Selso

2 BOTTOU, PETERS, ET AL. advertisers ad therefore their future bids. Fially they chage the ature of the data collected for traiig the statistical models i the future. These complicated iteractios are clarified by importat theoretical works. Uder simplified assumptios, mechaism desig (Myerso, 1981) leads to a isightful accout of the advertiser feedback loop (Varia, 2007; Edelma et al., 2007). Uder simplified assumptios, multiarmed badits theory (Robbis, 1952; Auer et al., 2002; Lagford ad Zhag, 2008) ad reiforcemet learig (Sutto ad Barto, 1998) describe the exploratio/exploitatio dilemma associated with the traiig feedback loop. However, oe of these approaches gives a complete accout of the complex iteractios foud i real-life systems. This cotributio proposes a ovel approach: we view these complicated iteractios as maifestatios of the fudametal differece that separates correlatio ad causatio. Usig the ad placemet example as a model of our problem class, we therefore argue that the laguage ad the methods of causal iferece provide flexible meas to describe such complex machie learig systems ad give soud aswers to the practical questios facig the desiger of such a system. Is it useful to pass a ew iput sigal to the statistical model? Is it worthwhile to collect ad label a ew traiig set? What about chagig the loss fuctio or the learig algorithm? I order to aswer such questios ad improve the operatioal performace of the learig system, oe eeds to uravel how the iformatio produced by the statistical models traverses the web of causes ad effects ad evetually produces measurable performace metrics. Readers with a iterest i causal iferece will fid i this paper (i) a real world example demostratig the value of causal iferece for large-scale machie learig applicatios, (ii) causal iferece techiques applicable to cotiuously valued variables with meaigful cofidece itervals, ad (iii) quasi-static aalysis techiques for estimatig how small itervetios affect certai causal equilibria. Readers with a iterest i real-life applicatios will fid (iv) a selectio of practical couterfactual aalysis techiques applicable to may real-life machie learig systems. Readers with a iterest i computatioal advertisig will fid a pricipled framework that (v) explais how to soudly use machie learig techiques for ad placemet, ad (vi) coceptually coects machie learig ad auctio theory i a compellig maer. The paper is orgaized as follows. Sectio 2 gives a overview of the advertisemet placemet problem which serves as our mai example. I particular, we stress some of the difficulties ecoutered whe oe approaches such a problem without a pricipled perspective. Sectio 3 provides a codesed review of the essetial cocepts of causal modelig ad iferece. Sectio 4 ceters o formulatig ad aswerig couterfactual questios such as how would the system have performed durig the data collectio period if certai itervetios had bee carried out o the system? We describe importace samplig methods for couterfactual aalysis, with clear coditios of validity ad cofidece itervals. Sectio 5 illustrates how the structure of the causal graph reveals opportuities to exploit prior iformatio ad vastly improve the cofidece itervals. Sectio 6 describes how couterfactual aalysis provides essetial sigals that ca drive learig algorithms. Assume that we have idetified itervetios that would have caused the system to perform well durig the data collectio period. Which guaratee ca we obtai o the performace of these same itervetios i the future? Sectio 7 presets couterfactual differetial techiques for the study of equlibria. Usig data collected whe the system is at equilibrium, we ca estimate how a small itervetio displaces the equilibrium. This provides a elegat ad effective way to reaso about log-term feedback effects. Various appedices complete the mai text with iformatio that we thik more relevat to readers with specific backgrouds. 3208

3 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS 2. Causatio Issues i Computatioal Advertisig After givig a overview of the advertisemet placemet problem, which serves as our mai example, this sectio illustrates some of the difficulties that arise whe oe does ot pay sufficiet attetio to the causal structure of the learig system. 2.1 Advertisemet Placemet All Iteret users are ow familiar with the advertisemet messages that ador popular web pages. Advertisemets are particularly effective o search egie result pages because users who are searchig for somethig are good targets for advertisers who have somethig to offer. Several actors take part i this Iteret advertisemet game: Advertisers create advertisemet messages, ad place bids that describe how much they are willig to pay to see their ads displayed or clicked. Publishers provide attractive web services, such as, for istace, a Iteret search egie. They display selected ads ad expect to receive paymets from the advertisers. The ifrastructure to collect the advertiser bids ad select ads is sometimes provided by a advertisig etwork o behalf of its affiliated publishers. For the purposes of this work, we simply cosider a publisher large eough to ru its ow ifrastructure. Users reveal iformatio about their curret iterests, for istace, by eterig a query i a search egie. They are offered web pages that cotai a selectio of ads (Figure 1). Users sometimes click o a advertisemet ad are trasported to a web site cotrolled by the advertiser where they ca iitiate some busiess. A covetioal biddig laguage is ecessary to precisely defie uder which coditios a advertiser is willig to pay the bid amout. I the case of Iteret search advertisemet, each bid specifies (a) the advertisemet message, (b) a set of keywords, (c) oe of several possible matchig criteria betwee the keywords ad the user query, ad (d) the maximal price the advertiser is willig to pay whe a user clicks o the ad after eterig a query that matches the keywords accordig to the specified criterio. Wheever a user visits a publisher web page, a advertisemet placemet egie rus a auctio i real time i order to select wiig ads, determie where to display them i the page, ad compute the prices charged to advertisers, should the user click o their ad. Sice the placemet egie is operated by the publisher, it is desiged to further the iterests of the publisher. Fortuately for everyoe else, the publisher must balace short term iterests, amely the immediate reveue brought by the ads displayed o each web page, ad log term iterests, amely the future reveues resultig from the cotiued satisfactio of both users ad advertisers. Auctio theory explais how to desig a mechaism that optimizes the reveue of the seller of a sigle object (Myerso, 1981; Milgrom, 2004) uder various assumptios about the iformatio available to the buyers regardig the itetios of the other buyers. I the case of the ad placemet problem, the publisher rus multiple auctios ad sells opportuities to receive a click. Whe early idetical auctios occur thousad of times per secod, it is temptig to cosider that the advertisers have perfect iformatio about each other. This assumptio gives support to the popular geeralized secod price rak-score auctio (Varia, 2007; Edelma et al., 2007): 3209

4 BOTTOU, PETERS, ET AL. Figure 1: Mailie ad sidebar ads o a search result page. Ads placed i the mailie are more likely to be oticed, icreasig both the chaces of a click if the ad is relevat ad the risk of aoyig the user if the ad is ot relevat. Let x represet the auctio cotext iformatio, such as the user query, the user profile, the date, the time, etc. The ad placemet egie first determies all eligible ads a 1...a ad the correspodig bids b 1...b o the basis of the auctio cotext x ad of the matchig criteria specified by the advertisers. For each selected ad a i ad each potetial positio p o the web page, a statistical model outputs the estimate q i,p (x) of the probability that ad a i displayed i positio p receives a user click. The rak-score r i,p (x)=b i q i,p (x) the represets the purported value associated with placig ad a i at positio p. Let L represet a possible ad layout, that is, a set of positios that ca simultaeously be populated with ads, ad let L be the set of possible ad layouts, icludig of course the empty layout. The optimal layout ad the correspodig ads are obtaied by maximizig the total rak-score subject to reserve costraits max L L max i 1,i 2 r ip,p(x), (1),... p L p L, r ip,p(x) R p (x), ad also subject to diverse policy costraits, such as, for istace, prevetig the simultaeous display of multiple ads belogig to the same advertiser. Uder mild assumptios, this discrete maximizatio problem is ameable to computatioally efficiet greedy algorithms (see appedix A.) The advertiser paymet associated with a user click is computed usig the geeralized secod price (GSP) rule: the advertiser pays the smallest bid that it could have etered without chagig the solutio of the discrete maximizatio problem, all other bids remaiig equal. I other words, the advertiser could ot have maipulated its bid ad obtaied the same treatmet for a better price. 3210

5 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS Uder the perfect iformatio assumptio, the aalysis suggests that the publisher simply eeds to fid which reserve prices R p (x) yield the best reveue per auctio. However, the total reveue of the publisher also depeds o the traffic experieced by its web site. Displayig a excessive umber of irrelevat ads ca trai users to igore the ads, ad ca also drive them to competig web sites. Advertisers ca artificially raise the rak-scores of irrelevat ads by temporarily icreasig the bids. Idelicate advertisers ca create deceivig advertisemets that elicit may clicks but direct users to spam web sites. Experiece shows that the cotiued satisfactio of the users is more importat to the publisher tha it is to the advertisers. Therefore the geeralized secod price rak-score auctio has evolved. Rak-scores have bee augmeted with terms that quatify the user satisfactio or the ad relevace. Bids receive adaptive discouts i order to deal with situatios where the perfect iformatio assumptio is urealistic. These adjustmets are drive by additioal statistical models. The ad placemet egie should therefore be viewed as a complex learig system iteractig with both users ad advertisers. 2.2 Cotrolled Experimets The desiger of such a ad placemet egie faces the fudametal questio of testig whether a proposed modificatio of the ad placemet egie results i a improvemet of the operatioal performace of the system. The simplest way to aswer such a questio is to try the modificatio. The basic idea is to radomly split the users ito treatmet ad cotrol groups (Kohavi et al., 2008). Users from the cotrol group see web pages geerated usig the umodified system. Users of the treatmet groups see web pages geerated usig alterate versios of the system. Moitorig various performace metrics for a couple moths usually gives sufficiet iformatio to reliably decide which variat of the system delivers the most satisfactory performace. Modifyig a advertisemet placemet egie elicits reactios from both the users ad the advertisers. Whereas it is easy to split users ito treatmet ad cotrol groups, splittig advertisers ito treatmet ad cotrol groups demads special attetio because each auctio ivolves multiple advertisers (Charles et al., 2012). Simultaeously cotrollig for both users ad advertisers is probably impossible. Cotrolled experimets also suffer from several drawbacks. They are expesive because they demad a complete implemetatio of the proposed modificatios. They are slow because each experimet typically demads a couple moths. Fially, although there are elegat ways to efficietly ru overlappig cotrolled experimets o the same traffic (Tag et al., 2010), they are limited by the volume of traffic available for experimetatio. It is therefore difficult to rely o cotrolled experimets durig the coceptio phase of potetial improvemets to the ad placemet egie. It is similarly difficult to use cotrolled experimets to drive the traiig algorithms associated with click probability estimatio models. Cheaper ad faster statistical methods are eeded to drive these essetial aspects of the developmet of a ad placemet egie. Ufortuately, iterpretig cheap ad fast data ca be very deceivig. 2.3 Cofoudig Data Assessig the cosequece of a itervetio usig statistical data is geerally challegig because it is ofte difficult to determie whether the observed effect is a simple cosequece of the itervetio or has other ucotrolled causes. 3211

6 BOTTOU, PETERS, ET AL. Treatmet A: Ope surgery Treatmet B: Percutaeous ephrolithotomy Overall Patiets with small stoes Patiets with large stoes 78% (273/350) 93% (81/87) 73% (192/263) 83% (289/350) 87% (234/270) 69% (55/80) Table 1: A classic example of Simpso s paradox. The table reports the success rates of two treatmets for kidey stoes (Charig et al., 1986, Tables I ad II). Although the overall success rate of treatmet B seems better, treatmet B performs worse tha treatmet A o both patiets with small kidey stoes ad patiets with large kidey stoes. See Sectio 2.3. For istace, the empirical compariso of certai kidey stoe treatmets illustrates this difficulty (Charig et al., 1986). Table 2.3 reports the success rates observed o two groups of 350 patiets treated with respectively ope surgery (treatmet A, with 78% success) ad percutaeous ephrolithotomy (treatmet B, with 83% success). Although treatmet B seems more successful, it was more frequetly prescribed to patiets sufferig from small kidey stoes, a less serious coditio. Did treatmet B achieve a high success rate because of its itrisic qualities or because it was preferetially applied to less severe cases? Further splittig the data accordig to the size of the kidey stoes reverses the coclusio: treatmet A ow achieves the best success rate for both patiets sufferig from large kidey stoes ad patiets sufferig from small kidey stoes. Such a iversio of the coclusio is called Simpso s paradox (Simpso, 1951). The stoe size i this study is a example of a cofoudig variable, that is a ucotrolled variable whose cosequeces pollute the effect of the itervetio. Doctors kew the size of the kidey stoes, chose to treat the healthier patiets with the least ivasive treatmet B, ad therefore caused treatmet B to appear more effective tha it actually was. If we ow decide to apply treatmet B to all patiets irrespective of the stoe size, we break the causal path coectig the stoe size to the outcome, we elimiate the illusio, ad we will experiece disappoitig results. Whe we suspect the existece of a cofoudig variable, we ca split the cotigecy tables ad reach improved coclusios. Ufortuately we caot fully trust these coclusios uless we are certai to have take ito accout all cofoudig variables. The real problem therefore comes from the cofoudig variables we do ot kow. Radomized experimets arguably provide the oly correct solutio to this problem (see Stigler, 1992). The idea is to radomly chose whether the patiet receives treatmet A or treatmet B. Because this radom choice is idepedet from all the potetial cofoudig variables, kow ad ukow, they caot pollute the observed effect of the treatmets (see also Sectio 4.2). This is why cotrolled experimets i ad placemet (Sectio 2.2) radomly distribute users betwee treatmet ad cotrol groups, ad this is also why, i the case of a ad placemet egie, we should be somehow cocered by the practical impossibility to radomly distribute both users ad advertisers. 3212

7 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS Overall q 2 low q 2 high q 1 low 6.2% (124/2000) 5.1% (92/1823) 18.1% (32/176) q 1 high 7.5% (149/2000) 4.8% (71/1500) 15.6% (78/500) Table 2: Cofoudig data i ad placemet. The table reports the click-through rates ad the click couts of the secod mailie ad. The overall couts suggest that the click-through rate of the secod mailie ad icreases whe the click probability estimate q 1 of the top ad is high. However, if we further split the pages accordig to the click probability estimate q 2 of the secod mailie ad, we reach the opposite coclusio. See Sectio Cofoudig Data i Ad Placemet Let us retur to the questio of assessig the value of passig a ew iput sigal to the ad placemet egie click predictio model. Sectio 2.1 outlies a placemet method where the click probability estimates q i,p (x) deped o the ad ad the positio we cosider, but do ot deped o other ads displayed o the page. We ow cosider replacig this model by a ew model that additioally uses the estimated click probability of the top mailie ad to estimate the click probability of the secod mailie ad (Figure 1). We would like to estimate the effect of such a itervetio usig existig statistical data. We have collected ad placemet data for Big search result pages served durig three cosecutive hours o a certai slice of traffic. Let q 1 ad q 2 deote the click probability estimates computed by the existig model for respectively the top mailie ad ad the secod mailie ad. After excludig pages displayig fewer tha two mailie ads, we form two groups of 2000 pages radomly picked amog those satisfyig the coditios q 1 < 0.15 for the first group ad q for the secod group. Table 2.4 reports the click couts ad frequecies observed o the secod mailie ad i each group. Although the overall umbers show that users click more ofte o the secod mailie ad whe the top mailie ad has a high click probability estimate q 1, this coclusio is reversed whe we further split the data accordig to the click probability estimate q 2 of the secod mailie ad. Despite superficial similarities, this example is cosiderably more difficult to iterpret tha the kidey stoe example. The overall click couts show that the actual click-through rate of the secod mailie ad is positively correlated with the click probability estimate o the top mailie ad. Does this mea that we ca icrease the total umber of clicks by placig regular ads below frequetly clicked ads? Remember that the click probability estimates deped o the search query which itself depeds o the user itetio. The most likely explaatio is that pages with a high q 1 are frequetly associated with more commercial searches ad therefore receive more ad clicks o all positios. The observed correlatio occurs because the presece of a click ad the magitude of the click probability estimate q 1 have a commo cause: the user itetio. Meawhile, the click probability estimate q 2 retured by the curret model for the secod mailie ad also deped o the query ad therefore the user itetio. Therefore, assumig that this depedece has comparable stregth, ad assumig that there are o other causal paths, splittig the couts accordig to the magitude of q 2 factors out the effects of this commo cofoudig cause. We the observe a egative correlatio which ow 3213

8 BOTTOU, PETERS, ET AL. suggests that a frequetly clicked top mailie ad has a egative impact o the click-through rate of the secod mailie ad. If this is correct, we would probably icrease the accuracy of the click predictio model by switchig to the ew model. This would decrease the click probability estimates for ads placed i the secod mailie positio o commercial search pages. These ads are the less likely to clear the reserve ad therefore more likely to be displayed i the less attractive sidebar. The et result is probably a loss of clicks ad a loss of moey despite the higher quality of the click probability model. Although we could tue the reserve prices to compesate this ufortuate effect, othig i this data tells us where the performace of the ad placemet egie will lad. Furthermore, ukow cofoudig variables might completely reverse our coclusios. Makig sese out of such data is just too complex! 2.5 A Better Way It should ow be obvious that we eed a more pricipled way to reaso about the effect of potetial itervetios. We provide oe such more pricipled approach usig the causal iferece machiery (Sectio 3). The ext step is the the idetificatio of a class of questios that are sufficietly expressive to guide the desiger of a complex learig system, ad sufficietly simple to be aswered usig data collected i the past usig adequate procedures (Sectio 4). A machie learig algorithm ca the be viewed as a automated way to geerate questios about the parameters of a statistical model, obtai the correspodig aswers, ad update the parameters accordigly (Sectio 6). Learig algorithms derived i this maer are very flexible: huma desigers ad machie learig algorithms ca cooperate seamlessly because they rely o similar sources of iformatio. 3. Modelig Causal Systems Whe we poit out a causal relatioship betwee two evets, we describe what we expect to happe to the evet we call the effect, should a exteral operator maipulate the evet we call the cause. Maipulability theories of causatio (vo Wright, 1971; Woodward, 2005) raise this commosese isight to the status of a defiitio of the causal relatio. Difficult adjustmets are the eeded to iterpret statemets ivolvig causes that we ca oly observe through their effects, because they love me, or that are ot easily maipulated, because the earth is roud. Moder statistical thikig makes a clear distictio betwee the statistical model ad the world. The actual mechaisms uderlyig the data are cosidered ukow. The statistical models do ot eed to reproduce these mechaisms to emulate the observable data (Breima, 2001). Better models are sometimes obtaied by deliberately avoidig to reproduce the true mechaisms (Vapik, 1982, Sectio 8.6). We ca approach the maipulability puzzle i the same spirit by viewig causatio as a reasoig model (Bottou, 2011) rather tha a property of the world. Causes ad effects are simply the pieces of a abstract reasoig game. Causal statemets that are ot empirically testable acquire validity whe they are used as itermediate steps whe oe reasos about maipulatios or itervetios ameable to experimetal validatio. This sectio presets the rules of this reasoig game. We largely follow the framework proposed by Pearl (2009) because it gives a clear accout of the coectios betwee causal models ad probabilistic models. 3214

9 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS x = f 1 (u,ε 1 ) Query cotext x from user itet u. a = f 2 (x,v,ε 2 ) Eligible ads (a i ) from query x ad ivetory v. b = f 3 (x,v,ε 3 ) Correspodig bids (b i ). q = f 4 (x,a,ε 4 ) Scores (q i,p,r p ) from query x ad ads a. s = f 5 (a,q,b,ε 5 ) Ad slate s from eligible ads a, scores q ad bids b. c = f 6 (a,q,b,ε 6 ) Correspodig click prices c. y = f 7 (s,u,ε 7 ) User clicks y from ad slate s ad user itet u. z = f 8 (y,c,ε 8 ) Reveue z from clicks y ad prices c. Figure 2: A structural equatio model for ad placemet. The sequece of equatios describes the flow of iformatio. The fuctios f k describe how effects deped o their direct causes. The additioal oise variables ε k represet idepedet sources of radomess useful to model probabilistic depedecies. 3.1 The Flow of Iformatio Figure 2 gives a determiistic descriptio of the operatio of the ad placemet egie. Variable u represets the user ad his or her itetio i a uspecified maer. The query ad query cotext x is the expressed as a ukow fuctio of the u ad of a oise variable ε 1. Noise variables i this framework are best viewed as idepedet sources of radomess useful for modelig a odetermiistic causal depedecy. We shall oly metio them whe they play a specific role i the discussio. The set of eligible ads a ad the correspodig bids b are the derived from the query x ad the ad ivetory v supplied by the advertisers. Statistical models the compute a collectio of scores q such as the click probability estimates q i,p ad the reserves R p itroduced i Sectio 2.1. The placemet logic uses these scores to geerate the ad slate s, that is, the set of wiig ads ad their assiged positios. The correspodig click prices c are computed. The set of user clicks y is expressed as a ukow fuctio of the ad slate s ad the user itet u. Fially the reveue z is expressed as aother fuctio of the clicks y ad the prices c. Such a system of equatios is amed structural equatio model (Wright, 1921). Each equatio asserts a fuctioal depedecy betwee a effect, appearig o the left had side of the equatio, ad its direct causes, appearig o the right had side as argumets of the fuctio. Some of these causal depedecies are ukow. Although we postulate that the effect ca be expressed as some fuctio of its direct causes, we do ot kow the form of this fuctio. For istace, the desiger of the ad placemet egie kows fuctios f 2 to f 6 ad f 8 because he has desiged them. However, he does ot kow the fuctios f 1 ad f 7 because whoever desiged the user did ot leave sufficiet documetatio. Figure 3 represets the directed causal graph associated with the structural equatio model. Each arrow coects a direct cause to its effect. The oise variables are omitted for simplicity. The structure of this graph reveals fudametal assumptios about our model. For istace, the user clicks y do ot directly deped o the scores q or the prices c because users do ot have access to this iformatio. We hold as a priciple that causatio obeys the arrow of time: causes always precede their effects. Therefore the causal graph must be acyclic. Structural equatio models the support two fudametal operatios, amely simulatio ad itervetio. 3215

10 BOTTOU, PETERS, ET AL. Figure 3: Causal graph associated with the structural equatio model of Figure 2. The mutually idepedet oise variables ε 1 to ε 8 are implicit. The variables a, b, q, s, c, ad z deped o their direct causes i kow ways. I cotrast, the variables u ad v are exogeous ad the variables x ad y deped o their direct causes through ukow fuctios. Simulatio Let us assume that we kow both the exact form of all fuctioal depedecies ad the value of all exogeous variables, that is, the variables that ever appear i the left had side of a equatio. We ca compute the values of all the remaiig variables by applyig the equatios i their atural time sequece. Itervetio As log as the causal graph remais acyclic, we ca costruct derived structural equatio models usig arbitrary algebraic maipulatios of the system of equatios. For istace, we ca clamp a variable to a costat value by rewritig the right-had side of the correspodig equatio as the specified costat value. The algebraic maipulatio of the structural equatio models provides a powerful laguage to describe itervetios o a causal system. This is ot a coicidece. May aspects of the mathematical otatio were iveted to support causal iferece i classical mechaics. However, we o loger have to iterpret the variable values as physical quatities: the equatios simply describe the flow of iformatio i the causal model (Wieer, 1948). 3.2 The Isolatio Assumptio Let us ow tur our attetio to the exogeous variables, that is, variables that ever appear i the left had side of a equatio of the structural model. Leibiz s priciple of sufficiet reaso claims that there are o facts without causes. This suggests that the exogeous variables are the effects of a etwork of causes ot expressed by the structural equatio model. For istace, the user itet u ad the ad ivetory v i Figure 3 have temporal correlatios because both users ad advertisers worry about their budgets whe the ed of the moth approaches. Ay structural equatio model should the be uderstood i the cotext of a larger structural equatio model potetially describig all thigs i existece. Ads served o a particular page cotribute to the cotiued satisfactio of both users ad advertisers, ad therefore have a effect o their willigess to use the services of the publisher i the future. The ad placemet structural equatio model show i Figure 2 oly describes the causal depedecies for a sigle page ad therefore caot accout for such effects. Cosider however a very 3216

11 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS Figure 4: Coceptually urollig the user feedback loop by threadig istaces of the sigle page causal graph (Figure 3). Both the ad slate s t ad user clicks y t have a idirect effect o the user itet u t+1 associated with the ext query. large structural equatio model cotaiig a copy of the page-level model for every web page ever served by the publisher. Figure 4 shows how we ca thread the page-level models correspodig to pages served to the same user. Similarly we could model how advertisers track the performace ad the cost of their advertisemets ad model how their satisfactio affects their future bids. The resultig causal graphs ca be very complex. Part of this complexity results from time-scale differeces. Thousads of search pages are served i a secod. Each page cotributes a little to the cotiued satisfactio of oe user ad a few advertisers. The accumulatio of these cotributios produces measurable effects after a few weeks. May of the fuctioal depedecies expressed by the structural equatio model are left uspecified. Without direct kowledge of these fuctios, we must reaso usig statistical data. The most fudametal statistical data is collected from repeated trials that are assumed idepedet. Whe we cosider the large structured equatio model of everythig, we ca oly have oe large trial producig a sigle data poit. 1 It is therefore desirable to idetify repeated patters of idetical equatios that ca be viewed as repeated idepedet trials. Therefore, whe we study a structural equatio model represetig such a patter, we eed to make a additioal assumptio to expresses the idea that the outcome of oe trial does ot affect the other trials. We call such a assumptio a isolatio assumptio by aalogy with thermodyamics. 2 This ca be achieved by assumig that the exogeous variables are idepedetly draw from a ukow but fixed joit probability distributio. This assumptio cuts the causatio effects that could flow through the exogeous variables. The oise variables are also exogeous variables actig as idepedet source of radomess. The oise variables are useful to represet the coditioal distributio P( effect causes) usig the equatio effect= f(causes,ε). Therefore, we also assume joit idepedece betwee all the oise variables ad ay of the amed exogeous variable. 3 For istace, i the case of the ad placemet 1. See also the discussio o reiforcemet learig, Sectio The cocept of isolatio is pervasive i physics. A isolated system i thermodyamics (Reichl, 1998, Sectio 2.D) or a closed system i mechaics (Ladau ad Lifshitz, 1969, 5) evolves without exchagig mass or eergy with its surroudigs. Experimetal trials ivolvig systems that are assumed isolated may differ i their iitial setup ad therefore have differet outcomes. Assumig isolatio implies that the outcome of each trial caot affect the other trials. 3. Rather tha lettig two oise variables display measurable statistical depedecies because they share a commo cause, we prefer to ame the commo cause ad make the depedecy explicit i the graph. 3217

12 BOTTOU, PETERS, ET AL. ( u,v,x,a,b P q,s,c,y,z ) = P(u,v) Exogeous vars. P(x u) Query. P(a x,v) Eligible ads. P(b x,v) Bids. P(q x,a) Scores. P(s a,q,b) Ad slate. P(c a,q,b) Prices. P(y s,u) Clicks. P(z y,c) Reveue. Figure 5: Markov factorizatio of the structural equatio model of Figure 2. Figure 6: Bayesia etwork associated with the Markov factorizatio show i Figure 5. model show i Figure 2, we assume that the joit distributio of the exogeous variables factorizes as P(u,v,ε 1,...,ε 8 )=P(u,v)P(ε 1 )...P(ε 8 ). Sice a isolatio assumptio is oly true up to a poit, it should be expressed clearly ad remai uder costat scrutiy. We must therefore measure additioal performace metrics that reveal how the isolatio assumptio holds. For istace, the ad placemet structural equatio model ad the correspodig causal graph (figures 2 ad 3) do ot take user feedback or advertiser feedback ito accout. Measurig the reveue is ot eough because we could easily geerate reveue at the expese of the satisfactio of the users ad advertisers. Whe we evaluate itervetios uder such a isolatio assumptio, we also eed to measure a battery of additioal quatities that act as proxies for the user ad advertiser satisfactio. Noteworthy examples iclude ad relevace estimated by huma judges, ad advertiser surplus estimated from the auctios (Varia, 2009). 3.3 Markov Factorizatio Coceptually, we ca draw a sample of the exogeous variables usig the distributio specified by the isolatio assumptio, ad we ca the geerate values for all the remaiig variables by simulatig the structural equatio model. 3218

13 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS This process defies a geerative probabilistic model represetig the joit distributio of all variables i the structural equatio model. The distributio readily factorizes as the product of the joit probability of the amed exogeous variables, ad, for each equatio i the structural equatio model, the coditioal probability of the effect give its direct causes (Spirtes et al., 1993; Pearl, 2000). As illustrated by figures 5 ad 6, this Markov factorizatio coects the structural equatio model that describes causatio, ad the Bayesia etwork that describes the joit probability distributio followed by the variables uder the isolatio assumptio. 4 Structural equatio models ad Bayesia etworks appear so itimately coected that it could be easy to forget the differeces. The structural equatio model is a algebraic object. As log as the causal graph remais acyclic, algebraic maipulatios are iterpreted as itervetios o the causal system. The Bayesia etwork is a geerative statistical model represetig a class of joit probability distributios, ad, as such, does ot support algebraic maipulatios. However, the symbolic represetatio of its Markov factorizatio is a algebraic object, essetially equivalet to the structural equatio model. 3.4 Idetificatio, Trasportatio, ad Trasfer Learig Cosider a causal system represeted by a structural equatio model with some ukow fuctioal depedecies. Subject to the isolatio assumptio, data collected durig the operatio of this system follows the distributio described by the correspodig Markov factorizatio. Let us first assume that this data is sufficiet to idetify the joit distributio of the subset of variables we ca observe. We ca itervee o the system by clampig the value of some variables. This amouts to replacig the right-had side of the correspodig structural equatios by costats. The joit distributio of the variables is the described by a ew Markov factorizatio that shares may factors with the origial Markov factorizatio. Which coditioal probabilities associated with this ew distributio ca we express usig oly coditioal probabilities idetified durig the observatio of the origial system? This is called the idetifiability problem. More geerally, we ca cosider arbitrarily complex maipulatios of the structural equatio model, ad we ca perform multiple experimets ivolvig differet maipulatios of the causal system. Which coditioal probabilities pertaiig to oe experimet ca be expressed usig oly coditioal probabilities idetified durig the observatio of other experimets? This is called the trasportability problem. Pearl s do-calculus completely solves the idetifiability problem ad provides useful tools to address may istaces of the trasportability problem (see Pearl, 2012). Assumig that we kow the coditioal probability distributios ivolvig observed variables i the origial structural equatio model, do-calculus allows us to derive coditioal distributios pertaiig to the maipulated structural equatio model. Ufortuately, we must further distiguish the coditioal probabilities that we kow (because we desiged them) from those that we estimate from empirical data. This distictio is importat because estimatig the distributio of cotiuous or high cardiality variables is otoriously difficult. Furthermore, do-calculus ofte combies the estimated probabilities i ways that amplify estimatio errors. This happes whe the maipulated structural equatio model exercises the variables i ways that were rarely observed i the data collected from the origial structural equatio model. 4. Bayesia etworks are directed graphs represetig the Markov factorizatio of a joit probability distributio: the arrows o loger have a causal iterpretatio. 3219

14 BOTTOU, PETERS, ET AL. Therefore we prefer to use much simpler causal iferece techiques (see sectios 4.1 ad 4.2). Although these techiques do ot have the completeess properties of do-calculus, they combie estimatio ad trasportatio i a maer that facilitates the derivatio of useful cofidece itervals. 3.5 Special Cases Three special cases of causal models are particularly relevat to this work. I the multi-armed badit (Robbis, 1952), a user-defied policy fuctio π determies the distributio of actio a {1...K}, ad a ukow reward fuctio r determies the distributio of the outcome y give the actio a (Figure 7). I order to maximize the accumulated rewards, the player must costruct policies π that balace the exploratio of the actio space with the exploitatio of the best actio idetified so far (Auer et al., 2002; Audibert et al., 2007; Seldi et al., 2012). The cotextual badit problem (Lagford ad Zhag, 2008) sigificatly icreases the complexity of multi-armed badits by addig oe exogeous variable x to the policy fuctio π ad the reward fuctios r (Figure 8). Both multi-armed badit ad cotextual badit are special case of reiforcemet learig (Sutto ad Barto, 1998). I essece, a Markov decisio process is a sequece of cotextual badits where the cotext is o loger a exogeous variable but a state variable that depeds o the previous states ad actios (Figure 9). Note that the policy fuctio π, the reward fuctio r, ad the trasitio fuctio s are idepedet of time. All the time depedecies are expressed usig the states s t. These special cases have icreasig geerality. May simple structural equatio models ca be reduced to a cotextual badit problem usig appropriate defiitios of the cotext x, the actio a ad the outcome y. For istace, assumig that the prices c are discrete, the ad placemet structural equatio model show i Figure 2 reduces to a cotextual badit problem with cotext (u, v), actios (s, c) ad reward z. Similarly, give a sufficietly itricate defiitio of the state variables s t, all structural equatio models with discrete variables ca be reduced to a reiforcemet learig problem. Such reductios lose the fie structure of the causal graph. We show i Sectio 5 how this fie structure ca i fact be leveraged to obtai more iformatio from the same experimets. Moder reiforcemet learig algorithms (see Sutto ad Barto, 1998) leverage the assumptio that the policy fuctio, the reward fuctio, the trasitio fuctio, ad the distributios of the correspodig oise variables, are idepedet from time. This ivariace property provides great beefits whe the observed sequeces of actios ad rewards are log i compariso with the size of the state space. Oly Sectio 7 i this cotributio presets methods that take advatage of such a ivariace. The geeral questio of leveragig arbitrary fuctioal ivariaces i causal graphs is left for future work. 4. Couterfactual Aalysis We ow retur to the problem of formulatig ad aswerig questios about the value of proposed chages of a learig system. Assume for istace that we cosider replacig the score computatio 3220

15 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS a = π(ε) Actio a {1...K} y = r(a, ε ) Reward y R Figure 7: Structural equatio model for the multi-armed badit problem. The policy π selects a discrete actio a, ad the reward fuctio r determies the outcome y. The oise variables ε ad ε represet idepedet sources of radomess useful to model probabilistic depedecies. a = π(x, ε) Actio a {1...K} y = r(x, a, ε ) Reward y R Figure 8: Structural equatio model for cotextual badit problem. Both the actio ad the reward deped o a exogeous cotext variable x. a t = π(s t 1, ε t ) Actio y t = r(s t 1, a t, ε t) Reward r t R s t = s(s t 1, a t, ε t ) Next state Figure 9: Structural equatio model for reiforcemet learig. The above equatios are replicated for all t {0...,T}. The cotext is ow provided by a state variable s t 1 that depeds o the previous states ad actios. model M of a ad placemet egie by a alterate model M. We seek a aswer to the coditioal questio: How will the system perform if we replace model M by model M? Give sufficiet time ad sufficiet resources, we ca obtai the aswer usig a cotrolled experimet (Sectio 2.2). However, istead of carryig out a ew experimet, we would like to obtai a aswer usig data that we have already collected i the past. How would the system have performed if, whe the data was collected, we had replaced model M by model M? The aswer of this couterfactual questio is of course a couterfactual statemet that describes the system performace subject to a coditio that did ot happe. Couterfactual statemets challege ordiary logic because they deped o a coditio that is kow to be false. Although assertio A B is always true whe assertio A is false, we certaily do ot mea for all couterfactual statemets to be true. Lewis (1973) avigates this paradox usig a modal logic i which a couterfactual statemet describes the state of affairs i a alterate world that resembles ours except for the specified differeces. Couterfactuals ideed offer may subtle ways to qualify such alterate worlds. For istace, we ca easily describe isolatio assumptios (Sectio 3.2) i a couterfactual questio: How would the system have performed if, whe the data was collected, we had replaced model M by model M without icurrig user or advertiser reactios? 3221

16 BOTTOU, PETERS, ET AL. Figure 10: Causal graph for a image recogitio system. We ca estimate couterfactuals by replayig data collected i the past. Figure 11: Causal graph for a radomized experimet. We ca estimate certai couterfactuals by reweightig data collected i the past. The fact that we could ot have chaged the model without icurrig the user ad advertiser reactios does ot matter ay more tha the fact that we did ot replace model M by model M i the first place. This does ot prevet us from usig couterfactual statemets to reaso about causes ad effects. Couterfactual questios ad statemets provide a atural framework to express ad share our coclusios. The remaiig text i this sectio explais how we ca aswer certai couterfactual questios usig data collected i the past. More precisely, we seek to estimate performace metrics that ca be expressed as expectatios with respect to the distributio that would have bee observed if the couterfactual coditios had bee i force Replayig Empirical Data Figure 10 shows the causal graph associated with a simple image recogitio system. The classifier takes a image x ad produces a prospective class label ŷ. The loss measures the pealty associated with recogizig class ŷ while the true class is y. To estimate the expected error of such a classifier, we collect a represetative data set composed of labelled images, ru the classifier o each image, ad average the resultig losses. I other words, we replay the data set to estimate what (couterfactual) performace would have bee observed if we had used a differet classifier. We ca the select i retrospect the classifier that would have worked the best ad hope that it will keep workig well. This is the couterfactual viewpoit o empirical risk miimizatio (Vapik, 1982). Replayig the data set works because both the alterate classifier ad the loss fuctio are kow. More geerally, to estimate a couterfactual by replayig a data set, we eed to kow all the fuctioal depedecies associated with all causal paths coectig the itervetio poit to the measuremet poit. This is obviously ot always the case. 5. Although couterfactual expectatios ca be viewed as expectatios of uit-level couterfactuals (Pearl, 2009, Defiitio 4), they elude the sematic subtleties of uit-level couterfactuals ad ca be measured with radomized experimets (see Sectio 4.2.) 3222

17 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS 4.2 Reweightig Radomized Trials Figure 11 illustrates the radomized experimet suggested i Sectio 2.3. The patiets are radomly split ito two equally sized groups receivig respectively treatmets A ad B. The overall success rate for this experimet is therefore Y =(Y A +Y B )/2 where Y A ad Y B are the success rates observed for each group. We would like to estimate which (couterfactual) overall success rate Y would have bee observed if we had selected treatmet A with probability p ad treatmet B with probability 1 p. Sice we do ot kow how the outcome depeds o the treatmet ad the patiet coditio, we caot compute which outcome y would have bee obtaied if we had treated patiet x with a differet treatmet u. Therefore we caot aswer this questio by replayig the data as we did i Sectio 4.1. However, observig differet success rates Y A ad Y B for the treatmet groups reveals a empirical correlatio betwee the treatmet u ad the outcome y. Sice the oly cause of the treatmet u is a idepedet roll of the dices, this correlatio caot result from ay kow or ukow cofoudig commo cause. 6 Havig elimiated this possibility, we ca reweight the observed outcomes ad compute the estimate Y py A +(1 p)y B. 4.3 Markov Factor Replacemet The reweightig approach ca i fact be applied uder much less striget coditios. Let us retur to the ad placemet problem to illustrate this poit. The average umber of ad clicks per page is ofte called click yield. Icreasig the click yield usually beefits both the advertiser ad the publisher, whereas icreasig the reveue per page ofte beefits the publisher at the expese of the advertiser. Click yield is therefore a very useful metric whe we reaso with a isolatio assumptio that igores the advertiser reactios to pricig chages. Let be a shorthad for all variables appearig i the Markov factorizatio of the ad placemet structural equatio model, P() = P(u,v)P(x u)p(a x,v)p(b x,v)p(q x,a) P(s a,q,b)p(c a,q,b)p(y s,u)p(z y,c). (2) Variable y was defied i Sectio 3.1 as the set of user clicks. I the rest of the documet, we slightly abuse this otatio by usig the same letter y to represet the umber of clicks. We also write the expectatio Y =E P() [y] usig the itegral otatio Y = y P(). We would like to estimate what the expected click yield Y would have bee if we had used a differet scorig fuctio (Figure 12). This itervetio amouts to replacig the actual factor P(q x,a) by a couterfactual factor P (q x,a) i the Markov factorizatio. P () = P(u,v)P(x u)p(a x,v)p(b x,v)p (q x,a) P(s a,q,b)p(c a,q,b)p(y s,u)p(z x,c). (3) 6. See also the discussio of Reichebach s commo cause priciple ad of its limitatios i Spirtes et al. (1993) ad Spirtes ad Scheies (2004). 3223

18 BOTTOU, PETERS, ET AL. Figure 12: Estimatig which average umber of clicks per page would have bee observed if we had used a differet scorig model. Let us assume, for simplicity, that the actual factor P(q x,a) is ozero everywhere. We ca the estimate the couterfactual expected click yield Y usig the trasformatio Y = y P () = y P (q x,a) P(q x,a) P() 1 i=1 y i P (q i x i,a i ) P(q i x i,a i ), (4) where the data set of tuples (a i,x i,q i,y i ) is distributed accordig to the actual Markov factorizatio istead of the couterfactual Markov factorizatio. This data could therefore have bee collected durig the ormal operatio of the ad placemet system. Each sample is reweighted to reflect its probability of occurrece uder the couterfactual coditios. I geeral, we ca use importace samplig to estimate the couterfactual expectatio of ay quatityl() : with weights Y = l() P () = l() P () P() P() 1 i=1 l( i ) w i (5) w i = w( i ) = P ( i ) P( i ) = factors appearig i P ( i ) but ot i P( i ) factors appearig i P( i ) but ot i P ( i ). (6) Equatio (6) emphasizes the simplificatios resultig from the algebraic similarities of the actual ad couterfactual Markov factorizatios. Because of these simplificatios, the evaluatio of the weights oly requires the kowledge of the few factors that differ betwee P() ad P (). Each data sample eeds to provide the value of l( i ) ad the values of all variables eeded to evaluate the factors that do ot cacel i the ratio (6). I cotrast, the replayig approach (Sectio 4.1) demads the kowledge of all factors of P () coectig the poit of itervetio to the poit of measuremet l(). O the other had, it does ot require the kowledge of factors appearig oly i P(). Importace samplig relies o the assumptio that all the factors appearig i the deomiator of the reweightig ratio (6) are ozero wheever the factors appearig i the umerator are ozero. Sice these factors represets coditioal probabilities resultig from the effect of a idepedet oise variable i the structural equatio model, this assumptio meas that the data 3224

19 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS must be collected with a experimet ivolvig active radomizatio. We must therefore desig cost-effective radomized experimets that yield eough iformatio to estimate may iterestig couterfactual expectatios with sufficiet accuracy. This problem caot be solved without aswerig the cofidece iterval questio: give data collected with a certai level of radomizatio, with which accuracy ca we estimate a give couterfactual expectatio? 4.4 Cofidece Itervals At first sight, we ca ivoke the law of large umbers ad write Y = P() l()w() 1 i=1 l( i )w i. (7) For sufficietly large, the cetral limit theorem provides cofidece itervals whose width grows with the stadard deviatio of the product l()w(). Ufortuately, whe P() is small, the reweightig ratio w() takes large values with low probability. This heavy tailed distributio has aoyig cosequeces because the variace of the itegrad could be very high or ifiite. Whe the variace is ifiite, the cetral limit theorem does ot hold. Whe the variace is merely very large, the cetral limit covergece might occur too slowly to justify such cofidece itervals. Importace samplig works best whe the actual distributio ad the couterfactual distributio overlap. Whe the couterfactual distributio has sigificat mass i domais where the actual distributio is small, the few samples available i these domais receive very high weights. Their oisy cotributio domiates the reweighted estimate (7). We ca obtai better cofidece itervals by elimiatig these few samples draw i poorly explored domais. The resultig bias ca be bouded usig prior kowledge, for istace with a assumptio about the rage of values take byl(), l() [0, M]. (8) Let us choose the maximum weight value R deemed acceptable for the weights. We have obtaied very cosistet results i practice with R equal to the fifth largest reweightig ratio observed o the empirical data. 7 We ca the rely o clipped weights to elimiate the cotributio of the poorly explored domais, w() = { w() if P ()<R P() 0 otherwise. The coditio P ()<RP() esures that the ratio has a ozero deomiator P() ad is smaller tha R. Let Ω R be the set of all values of associated with acceptable ratios: Ω R = { : P ()<R P()}. We ca decompose Y i two terms: Y = l()p () + Ω R Ω\Ω R l()p () = Ȳ +(Y Ȳ ). (9) 7. This is i fact a slight abuse because the theory calls for choosig R before seeig the data. 3225

20 BOTTOU, PETERS, ET AL. The first term of this decompositio is the clipped expectatio Ȳ. Estimatig the clipped expectatio Ȳ is much easier tha estimatig Y from (7) because the clipped weights w() are bouded by R. Ȳ = l()p () = w() P() Ŷ l() = 1 l( i ) w( i ). (10) Ω R The secod term of Equatio (9) ca be bouded by leveragig assumptio (8). The resultig boud ca the be coveietly estimated usig oly the clipped weights. [ ] [ ] Y Ȳ = l()p () 0, M P (Ω\Ω R ) = 0, M(1 W ) with Ω\Ω R W = P (Ω R ) = Ω R P () = i=1 w()p() W = 1 i=1 w( i ). (11) Sice the clipped weights are bouded, the estimatio errors associated with (10) ad (11) are well characterized usig either the cetral limit theorem or usig empirical Berstei bouds (see appedix B for details). Therefore we ca derive a outer cofidece iterval of the form { } P Ŷ ε R Ȳ Ŷ + ε R 1 δ (12) ad a ier cofidece iterval of the form P{ Ȳ Y Ȳ + M(1 W + ξ R ) } 1 δ. (13) The ames ier ad outer are i fact related to our preferred way to visualize these itervals (e.g., Figure 13). Sice the bouds o Y Ȳ ca be writte as we ca derive our fial cofidece iterval, Ȳ Y Ȳ + M(1 W ), (14) P{ Ŷ ε R Y Ŷ + M(1 W + ξ R )+ε R } 1 2δ. (15) I coclusio, replacig the ubiased importace samplig estimator (7) by the clipped importace samplig estimator (10) with a suitable choice of R leads to improved cofidece itervals. Furthermore, sice the derivatio of these cofidece itervals does ot rely o the assumptio that P() is ozero everywhere, the clipped importace samplig estimator remais valid whe the distributio P() has a limited support. This relaxes the mai restrictio associated with importace samplig. 4.5 Iterpretig the Cofidece Itervals The estimatio of the couterfactual expectatio Y ca be iaccurate because the sample size is isufficiet or because the samplig distributio P() does ot sufficietly explore the couterfactual coditios of iterest. By costructio, the clipped expectatio Ȳ igores the domais poorly explored by the samplig distributio P(). The differece Y Ȳ the reflects the iaccuracy resultig from a lack of exploratio. Therefore, assumig that the boud R has bee chose competetly, the relative sizes of the outer ad ier cofidece itervals provide precious cues to determie whether we ca cotiue collectig data usig the same experimetal setup or should adjust the data collectio experimet i order to obtai a better coverage. 3226

21 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS The ier cofidece iterval (13) witesses the ucertaity associated with the domai G R isufficietly explored by the actual distributio. A large ier cofidece iterval suggests that the most practical way to improve the estimate is to adjust the data collectio experimet i order to obtai a better coverage of the couterfactual coditios of iterest. The outer cofidece iterval (12) represets the ucertaity that results from the limited sample size. A large outer cofidece iterval idicates that the sample is too small. To improve the result, we simply eed to cotiue collectig data usig the same experimetal setup. 4.6 Experimetig with Mailie Reserves We retur to the ad placemet problem to illustrate the reweightig approach ad the iterpretatio of the cofidece itervals. Maipulatig the reserves R p (x) associated with the mailie positios (Figure 1) cotrols which ads are promietly displayed i the mailie or displaced ito the sidebar. We seek i this sectio to aswer couterfactual questios of the form: How would the ad placemet system have performed if we had scaled the mailie reserves by a costat factor ρ, without icurrig user or advertiser reactios? Radomizatio was itroduced usig a modified versio of the ad placemet egie. Before determiig the ad layout (see Sectio 2.1), a radom umber ε is draw accordig to the stadard ormal distributio N(0,1), ad all the mailie reserves are multiplied by m=ρe σ2 /2+σε. Such multipliers follow a log-ormal distributio 8 whose mea is ρ ad whose width is cotrolled by σ. This effectively provides a parametrizatio of the coditioal score distributio P( q x, a) (see Figure 5.) The Big search platform offers may ways to select traffic for cotrolled experimets (Sectio 2.2). I order to match our isolatio assumptio, idividual page views were radomly assiged to traffic buckets without regard to the user idetity. The mai treatmet bucket was processed with mailie reserves radomized by a multiplier draw as explaied above with ρ=1 ad σ=0.3. With these parameters, the mea multiplier is exactly 1, ad 95% of the multipliers are i rage [0.52, 1.74]. Samples describig 22 millio search result pages were collected durig five cosecutive weeks. We the use this data to estimate what would have bee measured if the mailie reserve multipliers had bee draw accordig to a distributio determied by parameters ρ ad σ. This is achieved by reweightig each sample i with w i = P (q i x i,a i ) P(q i x i,a i ) = p(m i ; ρ, σ ) p(m i ; ρ, σ), where m i is the multiplier draw for this sample durig the data collectio experimet, ad p(t ; ρ,σ) is the desity of the log-ormal multiplier distributio. Figure 13 reports results obtaied by varyig ρ while keepig σ = σ. This amouts to estimatig what would have bee measured if all mailie reserves had bee multiplied by ρ while keepig the same radomizatio. The curves boud 95% cofidece itervals o the variatios of the average umber of mailie ads displayed per page, the average umber of ad clicks per page, 8. More precisely, ln(µ,σ 2 ) with µ=σ 2 /2+logρ. 3227

22 BOTTOU, PETERS, ET AL. Average mailie ads per page +60% +40% +20% +0% 20% 40% +50% +40% +30% +20% +10% +0% 10% 20% 50% +0% +50% +100% Mailie reserve variatio Average clicks per page +25% +20% +15% +10% +5% +0% 5% 10% 50% +0% +50% +100% Mailie reserve variatio Average reveue per page 15% 50% +0% +50% +100% Mailie reserve variatio Figure 13: Estimated variatios of three performace metrics i respose to mailie reserve chages. The curves delimit 95% cofidece itervals for the metrics we would have observed if we had icreased the mailie reserves by the percetage show o the horizotal axis. The filled areas represet the ier cofidece itervals. The hollow squares represet the metrics measured o the experimetal data. The hollow circles represet metrics measured o a secod experimetal bucket with mailie reserves reduced by 18%. The filled circles represet the metrics effectively measured o a cotrol bucket ruig without radomizatio. 3228

23 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS ad the average reveue per page, as fuctios of ρ. The ier cofidece itervals, represeted by the filled areas, grow sharply whe ρ leaves the rage explored durig the data collectio experimet. The average reveue per page has more variace because a few very competitive queries commad high prices. I order to validate the accuracy of these couterfactual estimates, a secod traffic bucket of equal size was cofigured with mailie reserves reduced by about 18%. The hollow circles i Figure 13 represet the metrics effectively measured o this bucket durig the same time period. The effective measuremets ad the couterfactual estimates match with high accuracy. Fially, i order to measure the cost of the radomizatio, we also ra the umodified ad placemet system o a cotrol bucket. The brow filled circles i Figure 13 represet the metrics effectively measured o the cotrol bucket durig the same time period. The radomizatio caused a small but statistically sigificat icrease of the umber of mailie ads per page. The click yield ad average reveue differeces are ot sigificat. This experimet shows that we ca obtai accurate couterfactual estimates with affordable radomizatio strategies. However, this ice coclusio does ot capture the true practical value of the couterfactual estimatio approach. 4.7 More o Mailie Reserves The mai beefit of the couterfactual estimatio approach is the ability to use the same data to aswer a broad rage of couterfactual questios. Here are a few examples of couterfactual questios that ca be aswered usig data collected usig the simple mailie reserve radomizatio scheme described i the previous sectio: Differet variaces Istead of estimatig what would have bee measured if we had icreased the mailie reserves without chagig the radomizatio variace, that is, lettig σ = σ, we ca use the same data to estimate what would have bee measured if we had also chaged σ. This provides the meas to determie which level of radomizatio we ca afford i future experimets. Poitwise estimates We ofte wat to estimate what would have bee measured if we had set the mailie reserves to a specific value without radomizatio. Although computig estimates for small values of σ ofte works well eough, very small values lead to large cofidece itervals. Let Y ν (ρ) represet the expectatio we would have observed if the multipliers m had mea ρ ad variace ν. We have the Y ν (ρ)=e m [E[y m]]=e m [Y 0 (m)]. Assumig that the poitwise value Y 0 is smooth eough for a secod order developmet, Y ν (ρ) E m [ Y0 (ρ)+(m ρ)y 0(ρ)+(m ρ) 2 Y 0 (ρ)/2 ] = Y 0 (ρ)+νy 0 (ρ)/2. Although the reweightig method caot estimate the poit-wise value Y 0 (ρ) directly, we ca use the reweightig method to estimate both Y ν (ρ) ad Y 2ν (ρ) with acceptable cofidece itervals ad write Y 0 (ρ) 2Y ν (ρ) Y 2ν (ρ) (Goodwi, 2011). Query-depedet reserves Compare for istace the queries car isurace ad commo cause priciple i a web search egie. Sice the advertisig potetial of a search 3229

24 BOTTOU, PETERS, ET AL. varies cosiderably with the query, it makes sese to ivestigate various ways to defie querydepedet reserves (Charles ad Chickerig, 2012). The data collected usig the simple mailie reserve radomizatio ca also be used to estimate what would have bee measured if we had icreased all the mailie reserves by a query-depedet multiplier ρ (x). This is simply achieved by reweightig each sample i with w i = P (q i x i,a i ) P(q i x i,a i ) = p(m i ; ρ (x i ), σ). p(m i ; µ, σ) Cosiderably broader rages of couterfactual questios ca be aswered whe data is collected usig radomizatio schemes that explore more dimesios. For istace, i the case of the ad placemet problem, we could apply a idepedet radom multiplier for each score istead of applyig a sigle radom multiplier to the mailie reserves oly. However, the more dimesios we radomize, the more data eeds to be collected to effectively explore all these dimesios. Fortuately, as discussed i sectio 5, the structure of the causal graph reveals may ways to leverage a priori iformatio ad improve the cofidece itervals. 4.8 Related Work Importace samplig is widely used to deal with covariate shifts (Shimodaira, 2000; Sugiyama et al., 2007). Sice maipulatig the causal graph chages the data distributio, such a itervetio ca be viewed as a covariate shift ameable to importace samplig. Importace samplig techiques have also bee proposed without causal iterpretatio for may of the problems that we view as causal iferece problems. I particular, the work preseted i this sectio is closely related to the Mote-Carlo approach of reiforcemet learig (Sutto ad Barto, 1998, Chapter 5) ad to the offlie evaluatio of cotextual badit policies (Li et al., 2010, 2011). Reiforcemet learig research traditioally focuses o cotrol problems with relatively small discrete state spaces ad log sequeces of observatios. This focus reduces the eed for characterizig exploratio with tight cofidece itervals. For istace, Sutto ad Barto suggest to ormalize the importace samplig estimator by 1/ i w( i ) istead of 1/. This would give erroeous results whe the data collectio distributio leaves parts of the state space poorly explored. Cotextual badits are traditioally formulated with a fiite set of discrete actios. For istace, Li s (2011) ubiased policy evaluatio assumes that the data collectio policy always selects a arbitrary policy with probability greater tha some small costat. This is ot possible whe the actio space is ifiite. Such assumptios o the data collectio distributio are ofte impractical. For istace, certai ad placemet policies are ot worth explorig because they caot be implemeted efficietly or are kow to elicit fraudulet behaviors. There are may practical situatios i which oe is oly iterested i limited aspects of the ad placemet policy ivolvig cotiuous parameters such as click prices or reserves. Discretizig such parameters elimiates useful a priori kowledge: for istace, if we slightly icrease a reserve, we ca reasoable believe that we are goig to show slightly less ads. Istead of makig assumptios o the data collectio distributio, we costruct a biased estimator (10) ad boud its bias. We the iterpret the ier ad outer cofidece itervals as resultig from a lack of exploratio or a isufficiet sample size. 3230

25 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS Fially, the causal framework allows us to easily formulate couterfactual questios that pertai to the practical ad placemet problem ad yet differ cosiderably i complexity ad exploratio requiremets. We ca address specific problems idetified by the egieers without icurrig the risks associated with a complete redesig of the system. Each of these icremetal steps helps demostratig the soudess of the approach. 5. Structure This sectio shows how the structure of the causal graph reveals may ways to leverage a priori kowledge ad improve the accuracy of our couterfactual estimates. Displacig the reweightig poit (Sectio 5.1) improves the ier cofidece iterval ad therefore reduce the eed for exploratio. Usig a predictio fuctio (Sectio 5.2) essetially improve the outer cofidece iterval ad therefore reduce the sample size requiremets. 5.1 Better Reweightig Variables May search result pages come without eligible ads. We the kow with certaity that such pages will have zero mailie ads, receive zero clicks, ad geerate zero reveue. This is true for the radomly selected value of the reserve, ad this would have bee true for ay other value of the reserve. We ca exploit this kowledge by pretedig that the reserve was draw from the couterfactual distributio P (q x i,a i ) istead of the actual distributio P(q x i,a i ). The ratio w( i ) is therefore forced to the uity. This does ot chage the estimate but reduces the size of the ier cofidece iterval. The results of Figure 13 were i fact helped by this little optimizatio. There are i fact may circumstaces i which the observed outcome would have bee the same for other values of the radomized variables. This prior kowledge is i fact ecoded i the structure of the causal graph ad ca be exploited i a more systematic maer. For istace, we kow that users make click decisios without kowig which scores were computed by the ad placemet egie, ad without kowig the prices charged to advertisers. The ad placemet causal graph ecodes this kowledge by showig the clicks y as direct effects of the user itet u ad the ad slate s. This implies that the exact value of the scores q does ot matter to the clicks y as log as the ad slate s remais the same. Because the causal graph has this special structure, we ca simplify both the actual ad couterfactual Markov factorizatios (2) (3) without elimiatig the variable y whose expectatio is sought. Successively elimiatig variables z, c, ad q gives: P(u,v,x,a,b,s,y) = P(u,v)P(x u)p(a x,v)p(b x,v)p(s x,a,b)p(y s,u), P (u,v,x,a,b,s,y) = P(u,v)P(x u)p(a x,v)p(b x,v)p (s x,a,b)p(y s,u). The coditioal distributios P(s x,a,b) ad P (s x,a,b) did ot origially appear i the Markov factorizatio. They are defied by margializatio as a cosequece of the elimiatio of the variable q represetig the scores. P(s x,a,b)= P(s a,q,b) P(q x,a), P (s x,a,b)= P(s a,q,b) P (q x,a). q q 3231

26 BOTTOU, PETERS, ET AL. Average mailie ads per page +60% +40% +20% +0% 20% 40% +50% +40% +30% +20% +10% +0% 10% 20% 50% +0% +50% +100% Mailie reserve variatio Average clicks per page 50% +0% +50% +100% Mailie reserve variatio Figure 14: Estimated variatios of two performace metrics i respose to mailie reserve chages. These estimates were obtaied usig the ad slates s as reweightig variable. Compare the ier cofidece itervals with those show i Figure 13. We ca estimate the couterfactual click yield Y usig these simplified factorizatios: Y = y P (u,v,x,a,b,s,y) = y P (s x,a,b) P(s x,a,b) P(u,v,x,a,b,s,y) 1 i=1 y i P (s i x i,a i,b i ) P(s i x i,a i,b i ). (16) We have reproduced the experimets described i Sectio 4.6 with the couterfactual estimate (16) istead of (4). For each example i, we determie which rage [m max i,m mi i ] of mailie reserve multipliers could have produced the observed ad slate s i, ad the compute the reweightig ratio usig the formula: w i = P (s i x i,a i,b i ) P(s i x i,a i,b i ) = Ψ(mmax i ; ρ,σ ) Ψ(m mi i ; ρ,σ ) Ψ(m max ; ρ,σ) Ψ(m mi, ; ρ,σ) where Ψ(m; ρ, σ) is the cumulative of the log-ormal multiplier distributio. Figure 14 shows couterfactual estimates obtaied usig the same data as Figure 13. The obvious improvemet of the i i 3232

27 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS Figure 15: The reweightig variable(s) must itercept all causal paths from the poit of itervetio to the poit of measuremet. Figure 16: A distributio o the scores q iduce a distributio o the possible ad slates s. If the observed slate isslate2, the reweightig ratio is 34/22. ier cofidece itervals sigificatly exteds the rage of mailie reserve multipliers for which we ca compute accurate couterfactual expectatios usig this same data. Comparig (4) ad (16) makes the differece very clear: istead of computig the ratio of the probabilities of the observed scores uder the couterfactual ad actual distributios, we compute the ratio of the probabilities of the observed ad slates uder the couterfactual ad actual distributios. As illustrated by Figure 15, we ow distiguish the reweightig variable (or variables) from the itervetio. I geeral, the correspodig maipulatio of the Markov factorizatio cosists of margializig out all the variables that appear o the causal paths coectig the poit of itervetio to the reweightig variables ad factorig all the idepedet terms out of the itegral. This simplificatio works wheever the reweightig variables itercept all the causal paths coectig the poit of itervetio to the measuremet variable. I order to compute the ew reweightig ratios, all the factors remaiig iside the itegral, that is, all the factors appearig o the causal paths coectig the poit of itervetio to the reweightig variables, have to be kow. Figure 14 does ot report the average reveue per page because the reveue z also depeds o the scores q through the click prices c. This causal path is ot itercepted by the ad slate variable s aloe. However, we ca itroduce a ew variable c= f(c,y) that filters out the click prices computed 3233

28 BOTTOU, PETERS, ET AL. for ads that did ot receive a click. Markedly improved reveue estimates are the obtaied by reweightig accordig to the joit variable (s, c). Figure 16 illustrates the same approach applied to the simultaeous radomizatio of all the scores q usig idepedet log-ormal multipliers. The weight w( i ) is the ratio of the probabilities of the observed ad slate s i uder the couterfactual ad actual multiplier distributios. Computig these probabilities amouts to itegratig a multivariate Gaussia distributio (Gez, 1992). Details will be provided i a forthcomig publicatio. 5.2 Variace Reductio with Predictors Although we do ot kow exactly how the variable of iterest l() depeds o the measurable variables ad are affected by itervetios o the causal graph, we may have strog a priori kowledge about this depedecy. For istace, if we augmet the slate s with a ad that usually receives a lot of clicks, we ca expect a icrease of the umber of clicks. Let the ivariat variables υ be all observed variables that are ot direct or idirect effects of variables affected by the itervetio uder cosideratio. This defiitio implies that the distributio of the ivariat variables is ot affected by the itervetio. Therefore the values υ i of the ivariat variables sampled durig the actual experimet are also represetative of the distributio of the ivariat variables uder the couterfactual coditios. We ca leverage a priori kowledge to costruct a predictor ζ() of the quatity l() whose couterfactual expectatio Y is sought. We assume that the predictor ζ() depeds oly o the ivariat variables or o variables that deped o the ivariat variables through kow fuctioal depedecies. Give sampled values υ i of the ivariat variables, we ca replay both the origial ad maipulated structural equatio model as explaied i Sectio 4.1 ad obtai samples ζ i ad ζ i that respectively follow the actual ad couterfactual distributios The, regardless of the quality of the predictor, Y = l()p () = ζ()p () + 1 i=1ζ i + 1 i=1 (l() ζ())p () (l( i ) ζ i )w( i ). (17) The first term i this sum represets the couterfactual expectatio of the predictor ad ca be accurately estimated by averagig the simulated couterfactual samples ζ i without resortig to potetially large importace weights. The secod term i this sum represets the couterfactual expectatio of the residualsl() ζ() ad must be estimated usig importace samplig. Sice the magitude of the residuals is hopefully smaller tha that ofl(), the variace of(l() ζ())w() is reduced ad the importace samplig estimator of the secod term has improved cofidece itervals. The more accurate the predictor ζ(), the more effective this variace reductio strategy. This variace reductio techique is i fact idetical to the doubly robust cotextual badit evaluatio techique of Dudík et al. (2012). Doubly robust variace reductio has also bee extesively used for causal iferece applied to biostatistics (see Robis et al., 2000; Bag ad Robis, 2005). We subjectively fid that viewig the predictor as a compoet of the causal graph (Figure 17) clarifies how a well desiged predictor ca leverage prior kowledge. For istace, i order to estimate the couterfactual performace of the ad placemet system, we ca easily use a predictor that rus the ad auctio ad simulate the user clicks usig a click probability model traied offlie. 3234

29 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS Figure 17: Leveragig a predictor. Yellow odes represet kow fuctioal relatios i the structural equatio model. We ca estimate the couterfactual expectatio Y of the umber of clicks per page as the sum of the couterfactual expectatios of a predictor ζ, which is easy to estimate by replayig empirical data, ad y ζ, which has to be estimated by importace samplig but has reduced variace. Figure 18: The two plots show the hourly click yield for two variats of the ad placemet egie. The daily variatios dwarf the differeces betwee the two treatmets. 3235

30 BOTTOU, PETERS, ET AL. 5.3 Ivariat Predictors I order to evaluate which of two itervetios is most likely to improve the system, the desiger of a learig system ofte seeks to estimate a couterfactual differece, that is, the differece Y + Y of the expectatios of a same quatity l() uder two differet couterfactual distributios P + () ad P (). These expectatios are ofte affected by variables whose value is left uchaged by the itervetios uder cosideratio. For istace, seasoal effects ca have very large effects o the umber of ad clicks (Figure 18) but affect Y + ad Y i similar ways. Substatially better cofidece itervals o the differece Y + Y ca be obtaied usig a ivariat predictor, that is, a predictor fuctio that depeds oly o ivariat variables υ such as the time of the day. Sice the ivariat predictor ζ(υ) is ot affected by the itervetios uder cosideratio, ζ(υ)p ()= ζ(υ)p + (). (18) Therefore Y + Y = 1 ζ(υ)p + ()+ (l() ζ(υ))p + () ζ(υ)p () (l() ζ(υ))p () i=1 ( l(i ) ζ(υ i ) ) P + ( i ) P ( i ). P( i ) This direct estimate of the couterfactual differece Y + Y beefits from the same variace reductio effect as (17) without eed to estimate the expectatios (18). Appedix C provide details o the computatio of cofidece itervals for estimators of the couterfactual differeces. Appedix D shows how the same approach ca be used to compute couterfactual derivatives that describe the respose of the system to very small itervetios. 6. Learig The previous sectios deal with the idetificatio ad the measuremet of iterpretable sigals that ca justify the actios of huma decisio makers. These same sigals ca also justify the actios of machie learig algorithms. This sectio explais why optimizig a couterfactual estimate is a soud learig procedure. 6.1 A Learig Priciple We cosider i this sectio itervetios that deped o a parameter θ. For istace, we might wat to kow what the performace of the ad placemet egie would have bee if we had used differet values for the parameter θ of the click scorig model. Let P θ () deote the couterfactual Markov factorizatio associated with this itervetio. Let Y θ be the couterfactual expectatio of l() uder distributio P θ. Figure 19 illustrates our simple learig setup. Traiig data is collected from a sigle experimet associated with a iitial parameter value θ 0 chose usig prior kowledge acquired i a uspecified maer. A preferred parameter value θ is the determied usig the traiig data ad loaded ito the system. The goal is of course to observe a good performace o data collected durig a test period that takes place after the switchig poit. 3236

31 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS Figure 19: Sigle desig A preferred parameter value θ is determied usig radomized data collected i the past. Test data is collected after loadig θ ito the system. The isolatio assumptio itroduced i Sectio 3.2 states that the exogeous variables are draw from a ukow but fixed joit probability distributio. This distributio iduces a joit distributio P() o all the variables appearig i the structural equatio model associated with the parameter θ. Therefore, if the isolatio assumptio remais valid durig the test period, the test data follows the same distributio P θ () that would have bee observed durig the traiig data collectio period if the system had bee usig parameter θ all alog. We ca therefore formulate this problem as the optimizatio of the expectatio Y θ of the reward l() with respect to the distributio P θ () max Y θ = l() P θ () θ o the basis of a fiite set of traiig examples 1,..., sampled from P(). However, it would be uwise to maximize the estimates obtaied usig approximatio (5) because they could reach a maximum for a value of θ that is poorly explored by the actual distributio. As explaied i Sectio 4.5, the gap betwee the upper ad lower boud of iequality (14) reveals the ucertaity associated with isufficiet exploratio. Maximizig a empirical estimate Ŷ θ of the lower boud Ȳ θ esures that the optimizatio algorithm fids a trustworthy aswer θ = argmaxŷ θ. (19) θ We shall ow discuss the statistical basis of this learig priciple Uiform Cofidece Itervals As discussed i Sectio 4.4, iequality (14), where Ȳ θ Y θ Ȳ θ + M(1 W θ ), Ȳ θ = w() P() Ŷ l() θ = 1 W θ = w() P() W θ = 1 i=1 i=1 l( i ) w( i ), w( i ), 9. The idea of maximizig the lower boud may surprise readers familiar with the UCB algorithm for multi-armed badits (Auer et al., 2002). UCB performs exploratio by maximizig the upper cofidece iterval boud ad updatig the cofidece itervals olie. Exploratio i our setup results from the active system radomizatio durig the offlie data collectio. See also Sectio

32 BOTTOU, PETERS, ET AL. leads to cofidece itervals (15) of the form δ>0, θ P{ Ŷ θ ε R Y θ Ŷ θ + M(1 W θ + ξ R )+ε R } 1 δ. (20) Both ε R ad ξ R coverge to zero i iverse proportio to the square root of the sample size. They also icrease at most liearly i logδ ad deped o both the cappig boud R ad the parameter θ through the empirical variaces (see appedix B.) Such cofidece itervals are isufficiet to provide guaratees for a parameter value θ that depeds o the sample. I fact, the optimizatio (19) procedure is likely to select values of θ for which the iequality is violated. We therefore seek uiform cofidece itervals (Vapik ad Chervoekis, 1968), simultaeously valid for all values of θ. Whe the parameter θ is chose from a fiite set F, applyig the uio boud to the ordiary itervals (20) immediately gives the uiform cofidece iterval : { P θ F, Ŷ θ ε R Y θ Ŷ θ +M(1 W θ +ξ R )+ε R } 1 F δ. Followig the pioeerig work of Vapik ad Chervoekis, a broad choice of mathematical tools have bee developed to costruct uiform cofidece itervals whe the set F is ifiite. For istace, appedix E leverages uiform empirical Berstei bouds (Maurer ad Potil, 2009) ad obtais the uiform cofidece iterval { P θ F, Ŷ θ ε R Y θ Ŷ θ +M(1 W θ +ξ R )+ε R } 1 M()δ, (21) where the growth fuctio M() measures the capacity of the family of fuctios { f θ : l() w(), g θ : w(), θ F }. May practical choices of P () lead to fuctios M() that grow polyomially with the sample size. Because both ε R ad ξ R are O( 1/2 logδ), they coverge to zero with the sample size whe oe maitais the cofidece level 1 M()δ equal to a predefied costat. The iterpretatio of the ier ad outer cofidece itervals (Sectio 4.5) also applies to the uiform cofidece iterval (21). Whe the sample size is sufficietly large ad the cappig boud R chose appropriately, the ier cofidece iterval reflects the upper ad lower boud of iequality (14). The uiform cofidece iterval therefore esures that Y θ is close to the maximum of the lower boud of iequality (14) which essetially represets the best performace that ca be guarateed usig traiig data sampled from P(). Meawhile, the upper boud of this same iequality reveals which values of θ could potetially offer better performace but have bee isufficietly probed by the samplig distributio (Figure 20.) 6.3 Tuig Ad Placemet Auctios We ow preset a applicatio of this learig priciple to the optimizatio of auctio tuig parameters i the ad placemet egie. Despite icreasigly challegig egieerig difficulties, 3238

33 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS Figure 20: The uiform ier cofidece iterval reveals where the best guarateed Y θ is reached ad where additioal exploratio is eeded. Figure 21: Level curves associated with the average umber of mailie ads per page (red curves labelled from 6% to +10%) ad the average estimated advertisemet value geerated per page (black curves, labelled with arbitrary uits ragig from 164 to 169) that would have bee observed for a certai query cluster if we had chaged the mailie reserves by the multiplicative factor show o the horizotal axis, ad if we had applied a squashig expoet α show o the vertical axis to the estimated click probabilities q i,p (x). comparable optimizatio procedures ca obviously be applied to larger umbers of tuable parameters. Lahaie ad McAfee (2011) propose to accout for the ucertaity of the click probability estimatio by itroducig a squashig expoet α to cotrol the impact of the estimated probabilities o the rak scores. Usig the otatios itroduced i Sectio 2.1, ad assumig that the estimated probability of a click o ad i placed at positio p after query x has the form q ip (x)=γ p β i (x) (see 3239

34 BOTTOU, PETERS, ET AL. appedix A), they redefie the rak-score r ip (x) as: r ip (x)=γ p b i β i (x) α. Usig a squashig expoet α < 1 reduces the cotributio of the estimated probabilities ad icreases the reliace o the bids b i placed by the advertisers. Because the squashig expoet chages the rak-score scale, it is ecessary to simultaeously adjust the reserves i order to display comparable umber of ads. I order to estimate the couterfactual performace of the system uder itervetios affectig both the squashig expoet ad the mailie reserves, we have collected data usig a radom squashig expoet followig a ormal distributio, ad a mailie reserve multiplier followig a log-ormal distributio as described i Sectio 4.6. Samples describig 12 millio search result pages were collected durig four cosecutive weeks. Followig Charles ad Chickerig (2012), we cosider separate squashig coefficiets α k ad mailie reserve multipliers ρ k per query cluster k {1..K}, ad, i order to avoid egative user or advertiser reactios, we seek the auctio tuig parameters α k ad ρ k that maximize a estimate of the advertisemet value 10 subject to a global costrait o the average umber of ads displayed i the mailie. Because maximizig the advertisemet value istead of the publisher reveue amouts to maximizig the size of the advertisemet pie istead of the publisher slice of the pie, this criterio is less likely to simply raise the prices without improvig the ads. Meawhile the costrait esures that users are ot exposed to excessive umbers of mailie ads. We the use the collected data to estimate bouds o the couterfactual expectatios of the advertiser value ad the couterfactual expectatio of the umber of mailie ads per page. Figure 21 shows the correspodig level curves for a particular query cluster. We ca the ru a simple optimizatio algorithm ad determie the optimal auctio tuig parameters for each cluster subject to the global mailie footprit costrait. Appedix D describes how to estimate off-policy couterfactual derivatives that greatly help the umerical optimizatio. The obvious alterative (see Charles ad Chickerig, 2012) cosists of replayig the auctios with differet parameters ad simulatig the user usig a click probability model. However, it may be uwise to rely o a click probability model to estimate the best value of a squashig coefficiet that is expected to compesate for the ucertaity of the click predictio model itself. The couterfactual approach described here avoids the problem because it does ot rely o a click predictio model to simulate users. Istead it estimates the couterfactual performace of the system usig the actual behavior of the users collected uder moderate radomizatio. 6.4 Sequetial Desig Cofidece itervals computed after a first radomized data collectio experimet might ot offer sufficiet accuracy to choose a fial value of the parameter θ. It is geerally uwise to simply collect additioal samples usig the same experimetal setup because the curret data already reveals iformatio (Figure 20) that ca be used to desig a better data collectio experimet. Therefore, it seems atural to exted the learig priciple discussed i Sectio 6.1 to a sequece of data collectio experimets. The parameter θ t characterizig the t-th experimet is the determied usig samples collected durig the previous experimets (Figure 22). 10. The value of a ad click from the poit of view of the advertiser. The advertiser paymet the splits the advertisemet value betwee the publisher ad the advertiser. 3240

35 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS Figure 22: Sequetial desig The parameter θ t of each data collectio experimet is determied usig data collected durig the previous experimets. Although it is relatively easy to costruct coverget sequetial desig algorithms, reachig the optimal learig performace is otoriously difficult (Wald, 1945) because the selectio of parameter θ t ivolves a trade-off betwee exploitatio, that is, the maximizatio of the immediate reward Y θ t, ad exploratio, that is, the collectio of samples potetially leadig to better Y θ i the more distat future. The optimal exploratio exploitatio trade-off for multi-armed badits is well uderstood (Gittis, 1989; Auer et al., 2002; Audibert et al., 2007) because a essetial property of multi-armed badits makes the aalysis much simpler: the outcome observed after performig a particular actio brigs o iformatio about the value of other actios. Such a assumptio is both urealistic ad pessimistic. For istace, the outcome observed after displayig a certai ad i respose to a certai query brigs very useful iformatio about the value of displayig similar ads o similar queries. Refied cotextual badit approaches (Slivkis, 2011) accout for similarities i the cotext ad actio spaces but do ot take advatage of all the additioal opportuities expressed by structural equatio models. For istace, i the cotextual badit formulatio of the ad placemet problem outlied i Sectio 3.5, actios are pairs (s,c) describig the ad slate s ad the correspodig click prices c, policies select actios by combiig idividual ad scores i very specific ways, ad actios determie the rewards through very specific mechaisms. Meawhile, despite their suboptimal asymptotic properties, heuristic exploratio strategies perform surprisigly well durig the time spa i which the problem ca be cosidered statioary. Eve i the simple case of multi-armed badits, excellet empirical results have bee obtaied usig Thompso samplig (Chapelle ad Li, 2011) or fixed strategies (Vermorel ad Mohri, 2005; Kuleshov ad Precup, 2010). Leveragig the problem structure seems more importat i practice tha perfectig a otherwise soud exploratio strategy. Therefore, i the absece of sufficiet theoretical guidace, it is both expediet ad practical to maximizig Ŷ θ at each roud, as described i Sectio 6.1, subject to additioal ad-hoc costraits esurig a miimum level of exploratio. 7. Equilibrium Aalysis All the methods discussed i this cotributio rely o the isolatio assumptio preseted i Sectio 3.2. This assumptio lets us iterpret the samples as repeated idepedet trials that follow the patter defied by the structural equatio model ad are ameable to statistical aalysis. The isolatio assumptio is i fact a compoet of the couterfactual coditios uder ivestigatio. For istace, i Sectio 4.6, we model sigle auctios (Figure 3) i order to empirically 3241

36 BOTTOU, PETERS, ET AL. determie how the ad placemet system would have performed if we had chaged the mailie reserves without icurrig a reactio from the users or the advertisers. Sice the future publisher reveues deped o the cotiued satisfactio of users ad advertisers, liftig this restrictio is highly desirable. We ca i priciple work with larger structural equatio models. For istace, Figure 4 suggests to thread sigle auctio models with additioal causal liks represetig the impact of the displayed ads o the future user goodwill. However, there are practical limits o the umber of trials we ca cosider at oce. For istace, it is relatively easy to simultaeously model all the auctios associated with the web pages served to the same user durig a thirty miute web sessio. O the other had, it is practically impossible to cosider several weeks worth of auctios i order to model their accumulated effect o the cotiued satisfactio of users ad advertisers. We ca sometimes use problem-specific kowledge to costruct alterate performace metrics that aticipate the future effects of the feedback loops. For istace, i Sectio 6.3, we optimize the advertisemet value istead of the publisher reveue. Sice this alterative criterio takes the advertiser iterests ito accout, it ca be viewed as a heuristic proxy for the future reveues of the publisher. This sectio proposes a alterative way to accout for such feedback loops usig the quasistatic equilibrium method familiar to physicists: we assume that the publisher chages the parameter θ so slowly that the system remais at equilibrium at all times. Usig data collected while the system was at equilibrium, we describe empirical methods to determie how a ifiitesimal itervetio dθ o the model parameters would have displaced the equilibrium: How would the system have performed durig the data collectio period if a small chage dθ had bee applied to the model parameter θ ad the equilibrium had bee reached before the data collectio period. A learig algorithm ca the update θ to improve selected performace metrics. 7.1 Ratioal Advertisers The ad placemet system is a example of game where each actor furthers his or her iterests by cotrollig some aspects of the system: the publisher cotrols the placemet egie parameters, the advertisers cotrol the bids, ad the users cotrol the clicks. As a example of the geeral quasi-static approach, this sectio focuses o the reactio of ratioal advertisers to small chages of the scorig fuctios drivig the ad placemet system. Ratioal advertisers always select bids that maximize their ecoomic iterests. Although there are more realistic ways to model advertisers, this exercise is iterestig because the auctio theory approaches also rely o the ratioal advertiser assumptio (see Sectio 2.1). This aalysis seamlessly itegrates the auctio theory ad machie learig perspectives. As illustrated i Figure 23, we treat the bid vector b =(b 1...b A ) [0,b max ] A as the parameter of the coditioal distributio P b (b x,v) of the bids associated with the eligible ads. 11 The vari- 11. Quatities measured whe a feedback causal system reaches equilibrium ofte display coditioal idepedece patters that caot be represeted with directed acyclic graphs (Lauritze ad Richardso, 2002; Dash, 2003). Treatig the feedback loop as parameters istead of variables works aroud this difficulty i a maer that appears sufficiet to perform the quasi-static aalysis. 3242

37 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS Figure 23: Advertisers select the bid amouts b a o the basis of the past umber of clicks y a ad the past prices z a observed for the correspodig ads.! Figure 24: Advertisers cotrol the expected umber of clicks Y a ad expected prices Z a by adjustig their bids b a. Ratioal advertisers select bids that maximize the differece betwee the value they see i the clicks ad the price they pay. ables y a i the structural equatio model represets the umber of clicks received by ads associated with bid b a. The variables z a represets the amout charged for these clicks to the correspodig advertiser. The advertisers select their bids b a accordig to their aticipated impact o the umber of resultig clicks y a ad o their cost z a. Followig the patter of the perfect iformatio assumptio (see Sectio 2.1), we assume that the advertisers evetually acquire full kowledge of the expectatios Y a (θ,b )= y a P θ,b () ad Z a (θ,b )= z a P θ,b (). Let V a deote the value of a click for the correspodig advertiser. Ratioal advertiser seek to maximize the differece betwee the value they see i the clicks ad the price they pay to the publisher, as illustrated i Figure 24. This is expressed by the utility fuctios U θ a(b ) = V a Y a (θ,b ) Z a (θ,b ). 3243

38 BOTTOU, PETERS, ET AL. Followig Athey ad Nekipelov (2010), we argue that the ijectio of smooth radom oise ito the auctio mechaism chages the discrete problem ito a cotiuous problem ameable to stadard differetial methods. Mild regularity assumptio o the desities probability P b (b x,v) ad P θ (q x,a) are i fact sufficiet to esure that the expectatios Y a (θ,b ) ad Z a (θ,b ) are cotiuously differetiable fuctios of the distributio parameters b ad θ. Further assumig that utility fuctios U θ a(b ) are diagoally quasicocave, Athey ad Nekipelov establish the existece of a uique Nash equilibrium a b a ArgMax b U θ a(b 1,...,b a 1,b,b a+1,...,b A ) characterized by its first order Karush-Kuh-Tucker coditios a V a Y a b a Z a b a 0 if b a = 0, 0 if b a = b max, = 0 if 0<b a < b max. (22) We use the first order equilibrium coditios (22) for two related purposes. Sectio 7.2 explais how to complete the advertiser model by estimatig the values V a. Sectio 7.3 estimates how the equilibrium bids ad the system performace metrics respod to a small chage dθ of the model parameters. Iterestigly, this approach remais sesible whe key assumptios of the equilibrium model are violated. The perfect iformatio assumptio is ulikely to hold i practice. The quasi-cocavity of the utility fuctios is merely plausible. However, after observig the operatio of the statioary ad placemet system for a sufficietly log time, it is reasoable to assume that the most active advertisers have tried small bid variatios ad have chose locally optimal oes. Less active advertisers may leave their bids uchaged for loger time periods, but ca also update them brutally if they experiece a sigificat chage i retur o ivestmet. Therefore it makes sese to use data collected whe the system is statioary to estimate advertiser values V a that are cosistet with the first order equilibrium coditios. We the hope to maitai the coditios that each advertisers had foud sufficietly attractive, by first estimatig how a small chage dθ displaces this posited local equilibrium, the by usig performace metrics that take this displacemet ito accout. 7.2 Estimatig Advertiser Values We first eed to estimate the partial derivatives appearig i the equilibrium coditio (22). These derivatives measure how the expectatios Y a ad Z a would have bee chaged if each advertiser had placed a slightly differet bid b a. Such quatities ca be estimated by radomizig the bids ad computig o-policy couterfactual derivatives as explaied i appedix D. Cofidece itervals ca be derived with the usual tools. Ufortuately, the publisher is ot allowed to directly radomize the bids because the advertisers expect to pay prices computed usig the bid they have specified ad ot the potetially higher bids resultig from the radomizatio. However, the publisher has full cotrol o the estimated click probabilities q i,p (x). Sice the rak-scores r i,p (x) are the products of the bids ad the estimated click probabilities (see Sectio 2.1), a radom multiplier applied to the bids ca also be iterpreted as a radom multiplier applied to the estimated click probabilities. Uder these two iterpretatios, the same ads are show to the users, but differet click prices are charged to the advertisers. Therefore, 3244

39 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS the publisher ca simultaeously charge prices computed as if the multiplier had bee applied to the estimated click probabilities, ad collect data as if the multiplier had bee applied to the bid. This data ca the be used to estimate the derivatives. Solvig the first order equilibrium equatios the yields estimated advertiser values V a that are cosistet with the observed data. 12 V a Y / a Za b a b a There are however a couple caveats: The advertiser bid b a may be too small to cause ads to be displayed. I the absece of data, we have o meas to estimate a click value for these advertisers. May ads are ot displayed ofte eough to obtai accurate estimates of the partial derivatives Y a b a ad Z a b a. This ca be partially remediated by smartly aggregatig the data of advertisers deemed similar. Some advertisers attempt to capture all the available ad opportuities by placig extremely high bids ad hopig to pay reasoable prices thaks to the geeralized secod price rule. Both partial derivatives Y a b a ad Z a b a are equal to zero i such cases. Therefore we caot recover V a by solvig the equilibrium Equatio (22). It is however possible to collect useful data by selectig for these advertisers a maximum bid b max that prevets them from moopolizig the eligible ad opportuities. Sice the equilibrium coditio is a iequality whe b a = b max, we ca oly determie a lower boud of the values V a for these advertisers. These caveats i fact uderlie the limitatios of the advertiser modellig assumptios. Whe their ads are ot displayed ofte eough, advertisers have o more chace to acquire a full kowledge of the expectatios Y a ad Z a tha the publisher has a chace to determie their value. Similarly, advertisers that place extremely high bids are probably uderestimatig the risk to occasioally experiece a very high click price. A more realistic model of the advertiser iformatio acquisitio is required to adequately hadle these cases. 7.3 Estimatig the Equilibrium Respose Let A be the set of the active advertisers, that is, the advertisers whose value ca be estimated (or lower bouded) with sufficiet accuracy. Assumig that the other advertisers leave their bids uchaged, we ca estimate how the active advertisers adjust their bids i respose to a ifiitesimal chage dθ of the scorig model parameters. This is achieved by differetiatig the equilibrium equatios (22): a A, 0 = ( V a 2 ) ( Y a b a θ 2 Z a dθ+ b a θ V a a A 2 ) Y a 2 Z a db a. (23) b a b a b a b a The partial secod derivatives must be estimated as described i appedix D. Solvig this liear system of equatios the yields a expressio of the form db a = Ξ a dθ. 12. This approach is of course related to the value estimatio method proposed by Athey ad Nekipelov (2010) but strictly relies o the explicit radomizatio of the scores. I cotrast, practical cosideratios force Athey ad Nekipelov to rely o the apparet oise ad hope that the oise model accouts for all potetial cofoudig factors. 3245

40 BOTTOU, PETERS, ET AL. This expressio ca the be used to estimate how ay couterfactual expectatio Y of iterest chages whe the publisher applies a ifiitesimal chage dθ to the scorig parameter θ ad the active advertisers A ratioally adjust their bids b a i respose: ( Y dy = θ + Y Ξ a a b a ) dθ. (24) Although this expressio provides useful iformatio, oe should remai aware of its limitatios. Because we oly ca estimate the reactio of active advertisers, expressio (24) does ot icludes the potetially positive reactios of advertisers who did ot bid but could have. Because we oly ca estimate a lower boud of their values, this expressio does ot model the potetial reactios of advertisers placig urealistically high bids. Furthermore, oe eeds to be very cautious whe the system (23) approaches sigularities. Sigularities idicate that the ratioal advertiser assumptio is o loger sufficiet to determie the reactios of certai advertisers. This happes for istace whe advertisers caot fid bids that deliver a satisfactory retur. The evetual behavior of such advertisers the depeds o factors ot take i cosideratio by our model. To alleviate these issues, we could alter the auctio mechaism i a maer that forces advertisers to reveal more iformatio, ad we could eforce policies esurig that the system (23) remais safely osigular. We could also desig experimets revealig the impact of the fixed costs icurred by advertisers participatig ito ew auctios. Although additioal work is eeded to desig such refiemets, the quasistatic equilibrium approach provides a geeric framework to take such aspects ito accout. 7.4 Discussio The ratioal advertiser assumptio is the corerstoe of semial works describig simplified variats of the ad placemet problem usig auctio theory (Varia, 2007; Edelma et al., 2007). More sophisticated works accout for more aspects of the ad placemet problem, such as the impact of click predictio learig algorithms (Lahaie ad McAfee, 2011), the repeated ature of the ad auctios (Bergema ad Said, 2010), or for the fact that advertisers place bids valid for multiple auctios (Athey ad Nekipelov, 2010). Despite these advaces, it seems techically very challegig to use these methods ad accout for all the effects that ca be observed i practical ad placemet systems. We believe that our couterfactual reasoig framework is best viewed as a modular toolkit that lets us apply isights from auctio theory ad machie learig to problems that are far more complex tha those studied i ay sigle paper. For istace, the quasi-static equilibrium aalysis techique illustrated i this sectio exteds aturally to the aalysis of multiple simultaeous causal feedback loops ivolvig additioal players: The first step cosists i desigig ad-hoc experimets to idetify the parameters that determie the equilibrium equatio of each player. I the case of the advertisers, we have show how to use radomized scores to reveal the advertiser values. I the case of the user feedback, we must carefully desig experimets that reveal how users respod to chages i the quality of the displayed ads. Differetiatig all the equilibrium equatios yields a liear system of equatios likig the variatios of the parameter uder our cotrol, such as dθ, ad all the parameters uder the 3246

41 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS cotrol of the other players, such as the advertiser bids, or the user willigess to visit the site ad click o ads. Solvig this system ad writig the total derivative of the performace measure gives the aswer to our questio. Although this programme has ot yet bee fully realized, the existece of a pricipled framework to hadle such complex iteractios is remarkable. Furthermore, thaks to the flexibility of the causal iferece frameworks, these techiques ca be ifiitely adapted to various modelig assumptios ad various system complexities. 8. Coclusio Usig the ad placemet example, this work demostrates the cetral role of causal iferece (Pearl, 2000; Spirtes et al., 1993) for the desig of learig systems iteractig with their eviromet. Thaks to importace samplig techiques, data collected durig radomized experimets gives precious cues to assist the desiger of such learig systems ad useful sigals to drive learig algorithms. Two recurret themes structure this work. First, we maitai a sharp distictio betwee the learig algorithms ad the extractio of the sigals that drive them. Sice real world learig systems ofte ivolve a mixture of huma decisio ad automated processes, it makes sese to separate the discussio of the learig sigals from the discussio of the learig algorithms that leverage them. Secod, we claim that the mathematical ad philosophical tools developed for the aalysis of physical systems appear very effective for the aalysis of causal iformatio systems ad of their equilibria. These two themes are i fact a vidicatio of cyberetics (Wieer, 1948). Ackowledgmets We would like to ackowledge extesive discussios with Susa Athey, Miroslav Dudík, Patrick Jorda, Joh Lagford, Lihog Li, Sebastie Lahaie, Shie Maor, Chris Meek, Alex Slivkis, ad Paul Viola. We also thak the Microsoft adceter RR team for givig us the ivaluable opportuity to deploy these ideas at scale ad prove their worth. Fially we gratefully ackowledge the precious commets of our JMLR editor ad reviewers. Appedix A. Greedy Ad Placemet Algorithms Sectio 2.1 describes how to select ad place ads o a web page by maximizig the total rakscore (1). Followig (Varia, 2007; Edelma et al., 2007), we assume that the click probability estimates are expressed as the product of a positive positio term γ p ad a positive ad term β i (x). The rak-scores ca therefore be writte as r i,p (x) = γ p b i β i (x). We also assume that the policy costraits simply state that a web page should ot display more tha oe ad belogig to ay give advertiser. The discrete maximizatio problem is the ameable to computatioally efficiet greedy algorithms. Let us fix a layout L ad focus o the ier maximizatio problem. Without loss of geerality, we ca reumber the positios such that L={1,2,...N} ad γ 1 γ

42 BOTTOU, PETERS, ET AL. ad write the ier maximizatio problem as max i 1,...,i N R L (i 1,...,i N ) = r ip,p(x) p L subject to the policy costraits ad reserve costraits r i,p (x) R p (x). Let S i deote the advertiser owig ad i. The set of ads is the partitioed ito subsets I s ={i : S i = s} gatherig the ads belogig to the same advertiser s. The ads that maximize the product b i β i (x) withi set I s are called the best ads for advertiser s. If the solutio of the discrete maximizatio problem cotais oe ad belogig to advertiser s, the it is easy to see that this ad must be oe of the best ads for advertiser s: were it ot the case, replacig the offedig ad by oe of the best ads would yield a higher R L without violatig ay of the costraits. It is also easy to see that oe could select ay of the best ads for advertiser s without chagig R L. Let the set I cotai exactly oe ad per advertiser, arbitrarily chose amog the best ads for this advertiser. The ier maximizatio problem ca the be simplified as: max R L(i 1,...,i N ) = i 1,...,i N γ p b ip β ip (x) I p L where all the idices i 1,...,i N are distict, ad subject to the reserve costraits. Assume that this maximizatio problem has a solutio i 1,...,i N, meaig that there is a feasible ad placemet solutio for the layout L. For k=1...n, let us defie I k I as I k = ArgMax i I \{i 1,...,i k 1 } b i β i (x). It is easy to see that Ik itersects {i k,...,i N } because, were it ot the case, replacig i k by ay elemet of Ik would icrease R L without violatig ay of the costraits. Furthermore it is easy to see that i k Ik because, were it ot the case, there would be h>k such that i h Ik, ad swappig i k ad i h would icrease R L without violatig ay of the costraits. Therefore, if the ier maximizatio problem admits a solutio, we ca compute a solutio by recursively pickig i 1,...,i N from I1,I 2,...,I N. This ca be doe efficietly by first sortig the b i β i (x) i decreasig order, ad the greedily assigig ads to the best positios subject to the reserve costraits. This operatio has to be repeated for all possible layouts, icludig of course the empty layout. The same aalysis ca be carried out for click predictio estimates expressed as arbitrary mootoe combiatio of a positio term γ p (x) ad a ad term β i (x), as show, for istace, by Graepel et al. (2010). Appedix B. Cofidece Itervals Sectio 4.4 explais how to obtai improved cofidece itervals by replacig the ubiased importace samplig estimator (7) by the clipped importace samplig estimator (10). This appedix provides details that could have obscured the mai message. 3248

43 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS B.1 Outer Cofidece Iterval We first address the computatio of the outer cofidece iterval (12) which describes how the estimator Ŷ approaches the clipped expectatio Ȳ. Ȳ = w() P() Ŷ l() = 1 i=1 l( i ) w( i ). Sice the samplesl( i ) w( i ) are idepedet ad idetically distributed, the cetral limit theorem (e.g., Cramér, 1946, Sectio 17.4) states that the empirical average Ŷ coverges i law to a ormal distributio of mea Ȳ =E[l() w()] ad variace V = var[l() w()]. Sice this covergece usually occurs quickly, it is widely accepted to write { } P Ŷ ε R Ȳ Ŷ + ε R 1 δ, with ε R = erf 1 (1 δ) ad to estimate the variace V usig the sample variace V V V = 1 1 i=1 2 V. (25) ( l( i ) w( i ) Ŷ ) 2. This approach works well whe the ratio ceilig R is relatively small. However the presece of a few very large ratios makes the variace estimatio oisy ad might slow dow the cetral limit covergece. The first remedy is to boud the variace more rigorously. For istace, the followig boud results from (Maurer ad Potil, 2009, Theorem 10). } 2log(2/δ) P{ V > V + (M m)r δ 1 Combiig this boud with (25) gives a cofidece iterval valid with probability greater tha 1 2δ. Although this approach elimiates the potetial problems related to the variace estimatio, it does ot address the potetially slow covergece of the cetral limit theorem. The ext remedy is to rely o empirical Berstei bouds to derive rigorous cofidece itervals that leverage both the sample mea ad the sample variace (Audibert et al., 2007; Maurer ad Potil, 2009). Theorem 1 (Empirical Berstei boud) (Maurer ad Potil, 2009, thm 4) Let X,X 1,X 2,...,X be i.i.d. radom variable with values i [a,b] ad let δ>0. The, with probability at least 1 δ, E[X] M 2V log(2/δ) where M ad V respectively are the sample mea ad variace M = 1 i=1 X i, V = 1 1 +(b a) 7log(2/δ) 3( 1), i=1 (X i M )

44 BOTTOU, PETERS, ET AL. Applyig this theorem to both l( i ) w( i ) ad l( i ) w( i ) provides cofidece itervals that hold for for the worst possible distributio of the variablesl() ad w(). { } P Ŷ ε R Ȳ Ŷ + ε R 1 2δ where ε R = 2 V log(2/δ) + M R 7log(2/δ) 3( 1). (26) Because they hold for the worst possible distributio, cofidece itervals obtaied i this way are less tight tha cofidece itervals based o the cetral limit theorem. O the other had, thaks to the Berstei boud, they remais reasoably competitive, ad they provide a much stroger guaratee. B.2 Ier Cofidece Iterval Ier cofidece itervals are derived from iequality (14) which bouds the differece betwee the couterfactual expectatio Y ad the clipped expectatio Ȳ : 0 Y Ȳ M(1 W ). The costat M is defied by assumptio (8). The first step of the derivatio cosists i obtaiig a lower boud of W W usig either the cetral limit theorem or a empirical Berstei boud. For istace, applyig theorem 1 to w( i ) yields P W W 2 V w log(2/δ) R 7log(2/δ) 3( 1) 1 δ where V w is the sample variace of the clipped weights V w = 1 1 i=1 ( w( i ) W ) 2. Replacig i iequality (14) gives the outer cofidece iterval with P{ Ȳ Y Ȳ + M(1 W + ξ R ) ξ R = } 1 δ. 2 V w log(2/δ) + R 7log(2/δ) 3( 1). (27) Note that 1 W + ξ R ca occasioally be egative. This occurs i the ulucky cases where the cofidece iterval is violated, with probability smaller tha δ. Puttig together the ier ad outer cofidece itervals, P{ Ŷ ε R Y Ŷ + M(1 W + ξ R )+ε R } 1 3δ, with ε R ad ξ R computed as described i expressios (26) ad (27). 3250

45 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS Appedix C. Couterfactual Differeces We ow seek to estimate the differece Y + Y of the expectatios of a same quatity l() uder two differet couterfactual distributios P + () ad P (). These expectatios are ofte affected by variables whose value is left uchaged by the itervetios uder cosideratio. For istace, seasoal effects ca have very large effects o the umber of ad clicks. Whe these variables affect both Y + ad Y i similar ways, we ca obtai substatially better cofidece itervals for the differece Y + Y. I additio to the otatio represetig all the variables i the structural equatio model, we use otatio υ to represet all the variables that are ot direct or idirect effects of variables affected by the itervetios uder cosideratio. Let ζ(υ) be a kow fuctio believed to be a good predictor of the quatity l() whose couterfactual expectatio is sought. Sice P (υ)=p(υ), the followig equality holds regardless of the quality of this predictio: Y = l() P () = ζ(υ) P [ ] (υ) + l() ζ(υ) P () υ [ ] = ζ(υ) P(υ) + l() ζ(υ) w() P(). (28) υ Decomposig both Y + ad Y i this way ad computig the differece, Y + Y = [l() ζ(υ)] w() P() 1 [ l(i ) ζ(υ i ) ] w( i ), with w() = P+ () P() P () P() i=1 = P+ () P (). P() The outer cofidece iterval size is reduced if the variace of the residuall() ζ(υ) is smaller tha the variace of the origial variable l(). For istace, a suitable predictor fuctio ζ(υ) ca sigificatly capture the seasoal click yield variatios regardless of the itervetios uder cosideratio. Eve a costat predictor fuctio ca cosiderably chage the variace of the outer cofidece iterval. Therefore, i the absece of better predictor, we still ca ( ad always should ) ceter the itegrad usig a costat predictor. The rest of this appedix describes how to costruct cofidece itervals for the estimatio of couterfactual differeces. Additioal bookkeepig is required because both the weights w( i ) ad the itegrad l() ζ(υ) ca be positive or egative. We use the otatio υ to represet the variables of the structural equatio model that are left uchaged by the itervetio uder cosideratios. Such variables satisfy the relatios P (υ)=p(υ) ad P ()=P (\υ υ)p(υ), where we use otatio \υ to deote all remaiig variables i the structural equatio model. A ivariat predictor is the a fuctio ζ(υ) that is believed to be a good predictor of l(). I particular, it is expected that var[l() ζ(υ)] is smaller tha var[l()]. C.1 Ier Cofidece Iterval with Depedet Bouds We first describe how to costruct fier ier cofidece itervals by usig more refied bouds o l(). I particular, istead of the simple boud (8), we ca use bouds that deped o ivariat variables: m m(υ) l() M(υ) M. 3251

46 BOTTOU, PETERS, ET AL. The key observatio is the equality E[w () υ] = w () P(\υ υ) = \υ \υ P (\υ υ)p(υ) P(\υ υ)p(υ) P(\υ υ) = 1. We ca the write Y Ȳ = = υ [ w () w () ] l() P() (1 E[ w () υ]) M(υ) P(υ) = υ E[w () w () υ] M(υ) P(υ) Usig a similar derivatio for the lower boud B lo, we obtai the iequality With the otatios V lo = 1 1 ξ lo = i=1 B lo Y Ȳ B hi B lo = 1 w i=1(1 ( i ))m(υ i ), B hi = 1 [ ] 2 (1 w ( i ))m(υ i ) B lo, V hi = 1 2 V lo log(2/δ) + m R 7log(2/δ), ξ hi = 3( 1) (1 w ()) M(υ) P() = B hi. i=1 (1 w ( i ))M(υ i ), [ (1 w ( i ))M(υ i ) B hi ] 2, 1 i=1 2 V hi log(2/δ) + M R 7log(2/δ), 3( 1) two applicatios of theorem 1 give the ier cofidece iterval: { } P Ȳ + B lo ξ lo Y Ȳ + B hi + ξ hi 1 2δ. C.2 Cofidece Itervals for Couterfactual Differeces We ow describe how to leverage ivariat predictors i order to costruct tighter cofidece itervals for the differece of two couterfactual expectatios. Y + Y 1 i=1 [ l(i ) ζ(υ i ) ] w( i ) with w()= P+ () P (). P() Let us defie the reweigthig ratios w + ()=P + ()/P() ad w ()=P ()/P(), their clipped variats w + () ad w (), ad the clipped cetered expectatios Ȳ c + = [l() ζ(υ)] w + ()P() ad Ȳc = [l() ζ(υ)] w ()P(). The outer cofidece iterval is obtaied by applyig the techiques of Sectio B.1 to Ȳ c + Ȳc = [l() ζ(υ)][ w + () w ()] P(). Sice the weights w + w ca be positive or egative, addig or removig a costat to l() ca cosiderably chage the variace of the outer cofidece iterval. This meas that oe should 3252

47 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS always use a predictor. Eve a costat predictor ca vastly improve the outer cofidece iterval differece. The ier cofidece iterval is the obtaied by writig the differece ( Y + Y ) ( Ȳ + ) [ ][ = l() ζ(υ) w + () w + () ] P() c Ȳc [ l() ζ(υ) ][ w () w () ] P() ad boudig both terms by leveragig υ-depedet bouds o the itegrad: M ζ(υ) l() ζ(υ) M ζ(υ) M. This ca be achieved as show i Sectio C.1. Appedix D. Couterfactual Derivatives We ow cosider itervetios that deped o a cotiuous parameter θ. For istace, we might wat to kow what the performace of the ad placemet egie would have bee if we had used a parametrized scorig model. Let P θ () represet the couterfactual Markov factorizatio associated with this itervetio. Let Y θ be the couterfactual expectatio of l() uder distributio P θ. Computig the derivative of (28) immediately gives Y θ θ w = [ ] l() ζ(υ) w θ () P() 1 with w θ ()= Pθ () P() ad i=1 w θ()= w θ() θ [ l(i ) ζ(υ i ) ] w θ ( i) = w θ () logpθ () θ. (29) Replacig the expressios P() ad P θ () by the correspodig Markov factorizatios gives may opportuities to simplify the reweightig ratio w θ (). The term w θ() simplifies as show i (6). The derivative of logp θ () depeds oly o the factors parametrized by θ. Therefore, i order to evaluate w θ (), we oly eed to kow the few factors affected by the itervetio. Higher order derivatives ca be estimated usig the same approach. For istace, 2 Y θ θ i θ j = w [ ] l() ζ(υ) w i j() P() 1 i=1 [ l(i ) ζ(υ i ) ] w i j( i ) with w i j()= 2 w θ () = w θ () logpθ () logp θ () + w θ () 2 logp θ (). θ i θ j θ i θ j θ i θ j The secod term i w i j () vaishes whe θ i ad θ j parametrize distict factors i P θ (). D.1 Ifiitesimal Itervetios ad Policy Gradiet Expressio (29) becomes particularly attractive whe P()=P θ (), that is, whe oe seeks derivatives that describe the effect of a ifiitesimal itervetio o the system from which the data was collected. The resultig expressio is the idetical to the celebrated policy gradiet (Aleksadrov et al., 1968; Gly, 1987; Williams, 1992) which expresses how the accumulated rewards 3253

48 BOTTOU, PETERS, ET AL. i a reiforcemet learig problem are affected by small chages of the parameters of the policy fuctio. Y θ [ θ = ] l() ζ(υ) w θ () P θ () 1 i=1 [ l(i ) ζ(υ i ) ] w θ( i ) where i are sampled i.i.d. from P θ ad w θ() = logpθ (). θ Samplig from P θ () elimiates the potetially large ratio w θ () that usually plagues importace samplig approaches. Choosig a parametrized distributio that depeds smoothly o θ is the sufficiet to cotai the size of the weights w θ (). Sice the weights ca be positive or egative, ceterig the itegrad with a predictio fuctio ζ(υ) remais very importat. Eve a costat predictor ζ ca substatially reduce the variace var[(l() ζ)w θ()] = var[l()w θ() ζw θ()] = var[l()w θ()] 2ζcov[l()w θ(), w θ()]+ζ 2 var[w θ()] whose miimum is reached for ζ= cov[l()w θ (),w θ ()] var[w θ ()] = E[l()w θ ()2 ]. E[w θ ()2 ] We sometimes wat to evaluate expectatios uder a couterfactual distributio that is too far from the actual distributio to obtai reasoable cofidece itervals. Suppose, for istace, that we are uable to reliably estimate which click yield would have bee observed if we had used a certai parameter θ for the scorig models. We still ca estimate how quickly ad i which directio the click yield would have chaged if we had slightly moved the curret scorig model parameters θ i the directio of the target θ. Although such a aswer is ot as good as a reliable estimate of Y θ, it is certaily better tha o aswer. D.2 Off-Policy Gradiet We assume i this subsectio that the parametrized probability distributio P θ () is regular eough to esure that all the derivatives of iterest are defied ad that the evet{w θ ()=R} has probability zero. Furthermore, i order to simplify the expositio, the followig derivatio does ot leverage a ivariat predictor fuctio. Estimatig derivatives usig data sampled from a distributio P() differet from P θ () is more challegig because the ratios w θ ( i ) i Equatio (29) ca take very large values. However it is comparatively easy to estimate the derivatives of lower ad upper bouds usig a slightly differet way to clip the weights. Usig otatio 1l(x) represet the idicator fuctio, equal to oe if coditio x is true ad zero otherwise, let us defie respectively the clipped weights w θ Z ad the capped weights w θ M: w Z θ()=w θ ()1l{P ()<RP()} ad w M θ()=mi{w θ (), R}. Although Sectio 4.4 illustrates the use of clipped weights, the cofidece iterval derivatio ca be easily exteded to the capped weights. Defiig the capped quatities Ȳ θ = l() w θ() M P() ad W θ = w θ() M P() 3254

49 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS ad writig 0 Y θ Ȳ θ = yields the iequality M Ω\Ω R l()(p () RP()) ( ) 1 P (Ω R ) RP(Ω\Ω R ) ( ) = M 1 w θ()p() M Ȳ θ Y θ Ȳ θ + M(1 W θ ). (30) I order to obtai reliable estimates of the derivatives of these upper ad lower bouds, it is of course sufficiet to obtai reliable estimates of the derivatives of Ȳ θ ad W θ. By separately cosiderig the cases w θ ()<R ad w θ ()>R, we easily obtai the relatio w M θ () = wm θ () θ = w Z θ() logpθ () θ ad, thaks to the regularity assumptios, we ca write Ȳ θ θ W θ θ = = l() w θ M () P() 1 w M θ () P() 1 i=1 i=1 whe w θ () R l( i ) w M θ ( i), w M θ ( i ), Estimatig these derivatives is cosiderably easier tha usig approximatio (29) because they ivolve the bouded quatity w Z θ () istead of the potetially large ratio w θ(). It is still ecessary to choose a sufficietly smooth samplig distributio P() to limit the magitude of logp θ / θ. Such derivatives are very useful to drive optimizatio algorithms. Assume for istace that we wat to fid the parameter θ that maximizes the couterfactual expectatio Y θ as illustrated i Sectio 6.3. Maximizig the estimate obtaied usig approximatio (5) could reach its maximum for a value of θ that is poorly explored by the actual distributio. Maximizig a estimate of the lower boud (30) esures that the optimizatio algorithm fids a trustworthy aswer. Appedix E. Uiform Empirical Berstei Bouds This appedix reviews the uiform empirical Berstei boud give by Maurer ad Potil (2009) ad describes how it ca be used to costruct the uiform cofidece iterval (21). The first step cosists of characterizig the size of a family F of fuctios mappig a space X ito the iterval [a,b] R. Give poits x = (x 1...x ) X, the trace F(x) R is the set of vectors ( f(x1 ),..., f(x ) ) for all fuctios f F. Defiitio 2 (Coverig umbers, etc.) Give ε>0, the coverig umber N(x,ε,F)is the smallest possible cardiality of a subset C F(x) satisfyig the coditio ad the growth fuctio N(,ε,F)is v F(x) c C max v i c i ε, i=1... N(,ε,F) = sup x X N(x,ε,F). 3255

50 BOTTOU, PETERS, ET AL. Thaks to a famous combiatorial lemma (Vapik ad Chervoekis, 1968, 1971; Sauer, 1972), for may usual parametric families F, the growth fuctio N(,ε,F) icreases at most polyomially 13 with both ad 1/ε. Theorem 3 (Uiform empirical Berstei boud) (Maurer ad Potil, 2009, thm 6) Let δ (0,1), >= 16. Let X,X 1,...,X be i.i.d. radom variables with values i X. Let F be a set of fuctios mappig X ito [a,b] R ad let M()=10N(2,F,1/). The we probability at least 1 δ, f F, E[ f(x)] M 18V log(m()/δ) 15 log(m()/δ) +(b a), 1 where M ad V respectively are the sample mea ad variace M = 1 i=1 f(x i ), V = 1 1 i=1 ( f(x i ) M ) 2. The statemet of this theorem emphasizes its similarity with the o-uiform empirical Berstei boud (theorem 1). Although the costats are less attractive, the uiform boud still coverges to zero whe icreases, provided of course that M()=10N(2,F,1/) grows polyomially with. Let us the defie the family of fuctios F ={ f θ : l() w M θ(), g θ : w M θ(), θ F }, ad use the uiform empirical Berstei boud to derive a outer iequality similar to (26) ad a ier iequality similar to (27). The theorem implies that, with probability 1 δ, both iequalities are simultaeously true for all values of the parameter θ. The uiform cofidece iterval (21) the follows directly. Refereces V M. Aleksadrov, V. I. Sysoyev, ad V. V. Shemeeva. Stochastic optimizatio. Egieerig Cyberetics, 5:11 16, Susa Athey ad Deis Nekipelov. A structural model of sposored search advertisig. Workig paper, URL Search.pdf. Jea-Yves Audibert, Remi Muos, ad Csaba Szepesvári. Tuig badit algorithms i stochastic eviromets. I Proc. 18th Iteratioal Coferece o Algorithmic Learig Theory (ALT 2007), pages , Peter Auer, Nicolò Cesa-Biachi, ad Paul Fisher. Fiite time aalysis of the multiarmed badit problem. Machie Learig, 47(2 3): , For a simple proof of this fact, slice [a,b] ito itervals S k of maximal width ε ad apply the lemma to the family of idicator fuctios (x i,s k ) 1l{ f(x i ) S k }. 3256

51 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS Heejug Bag ad James M. Robis. Doubly robust estimatio i missig data ad causal iferece models. Biometrics, 61: , Dirk Bergema ad Maher Said. Dyamic auctios: a survey. Discussio Paper 1757R, Cowles Foudatio for Research i Ecoomics, Yale Uiversity, Léo Bottou. From machie learig to machie reasoig v3, Feb Leo Breima. Statistical modelig: The two cultures. Statistical Sciece, 16(3): , Olivier Chapelle ad Lihog Li. A empirical evaluatio of thompso samplig. I Advaces i Neural Iformatio Processig Systems 24, pages NIPS Foudatio, C. R. Charig, D. R. Webb, S. R. Paye, ad J. E. A. Wickham. Compariso of treatmet of real calculi by ope surgery, percutaeous ephrolithotomy, ad extracorporeal shockwave lithotripsy. British Medical Joural (Cli Res Ed), 292(6254): , Deis X. Charles ad D. Max Chickerig. Optimizatio for paid search auctios. Mauscript i preparatio, Deis X. Charles, D. Max Chickerig, ad Patrice Simard. Micro-market experimetatio for paid search. Mauscript i preparatio, Harald Cramér. Mathematical Methods of Statistics. Priceto Uiversity Press, Dever Dash. Caveats for Causal Reasoig with Equilibrium Models. PhD thesis, Uiversity of Pittsburgh, Miroslav Dudík, Dimitru Erha, Joh Lagford, ad Lihog Li. Sample-efficiet ostatioarypolicy evaluatio for cotextual badits. I Proceedigs of Ucertaity i Artificial Itelligece (UAI), pages , Bejami Edelma, Michael Ostrovsky, ad Michael Schwarz. Iteret advertisig ad the geeralized secod price auctio: Sellig billios of dollars worth of keywords. America Ecoomic Review, 97(1): , Ala Gez. Numerical computatio of multivariate ormal probabilities. Joural Computatio of Multivariate Normal Probabilities, 1: , Joh C. Gittis. Badit Processes ad Dyamic Allocatio Idices. Wiley, Peter W. Gly. Likelihood ratio gradiet estimatio: a overview. I Proceedigs of the 1987 Witer Simulatio Coferece, pages , Joh Goodwi. Microsoft adceter. Persoal commuicatio, Thore Graepel, Joaqui Quioero Cadela, Thomas Borchert, ad Ralf Herbrich. Web-scale Bayesia click-through rate predictio for sposored search advertisig i Microsoft s Big search egie. I Proceedigs of the 27th Iteratioal Coferece o Machie Learig (ICML 2010), Ivited Applicatios Track. Omipress,

52 BOTTOU, PETERS, ET AL. Ro Kohavi, Roger Logbotham, Da Sommerfield, ad Radal M. Hee. Cotrolled experimets o the web: Survey ad practical guide. Data Miig ad Kowledge Discovery, 18(1): , July Volodymyr Kuleshov ad Doia Precup. Algorithms for multi-armed badit problems, October Sébastie Lahaie ad R. Presto McAfee. Efficiet rakig i sposored search. I Proc. 7th Iteratioal Workshop o Iteret ad Network Ecoomics (WINE 2011), pages LNCS 7090, Spriger, Lev Ladau ad Evgey Lifshitz. Course i Theoretical Physics, Volume 1: Mechaics. Pergamo Press, d editio. Joh Lagford ad Tog Zhag. The epoch-greedy algorithm for multi-armed badits with side iformatio. I Advaces i Neural Iformatio Processig Systems 20, pages MIT Press, Cambridge, MA, Steffe L. Lauritze ad Thomas S. Richardso. Chai graph models ad their causal iterpretatio. Joural of the Royal Statistical Society, Series B, 64: , David K. Lewis. Couterfactuals. Harvard Uiversity Press, d editio: Wiley-Blackwell, Lihog Li, Wei Chu, Joh Lagford, ad Robert E. Schapire. A cotextual-badit approach to persoalized ews article recommedatio. I Proceedigs of the 19th Iteratioal Coferece o the World Wide Web (WWW 2010), pages ACM, Lihog Li, Wei Chu, Joh Lagford, ad Xuahui Wag. Ubiased offlie evaluatio of cotextualbadit-based ews article recommedatio algorithms. I Proc. 4th ACM Iteratioal Coferece o Web Search ad Data Miig (WSDM 2011), pages , Adreas Maurer ad Massimiliao Potil. Empirical berstei bouds ad sample-variace pealizatio. I Proc. The 22d Coferece o Learig Theory (COLT 2009), Paul Milgrom. Puttig Auctio Theory to Work. Cambridge Uiversity Press, Roger B. Myerso. Optimal auctio desig. Mathematics of Operatios Research, 6(1):58 73, Judea Pearl. Causality: Models, Reasoig, ad Iferece. Cambridge Uiversity Press, d editio: Judea Pearl. Causal iferece i statistics: a overview. Statistics Surveys, 3:96 146, Judea Pearl. The do-calculus revisited. I Proc. Twety-Eighth Coferece o Ucertaity i Artificial Itelligece (UAI-2012), pages 3 11, Lida E. Reichl. A Moder Course i Statistical Physics, 2d Editio. Wiley,

53 COUNTERFACTUAL REASONING AND LEARNING SYSTEMS Herbert Robbis. Some aspects of the sequetial desig of experimets. Bulleti of the America Mathematical Society, 58(5): , James M. Robis, Miguel Agel Hera, ad Babette Brumback. Margial structural models ad causal iferece i epidemiology. Epidemiology, 11(5): , Sep Norbert Sauer. O the desity of families of sets. Joural of Combiatorial Theory, 13: , Yevgey Seldi, cois Laviolette Fra Nicolò Cesa-Biachi, Joh Shawe-Taylor, ad Peter Auer. PAC-Bayesia iequalities for martigales. IEEE Trasactios o Iformatio Theory, 58(12): , Hidetoshi Shimodaira. Improvig predictive iferece uder covariate shift by weightig the log likelihood fuctio. Joural of Statistical Plaig ad Iferece, 90(2): , Edward H. Simpso. The iterpretatio of iteractio i cotigecy tables. Joural of the Royal Statistical Society, Ser. B, 13: , Alexsadrs Slivkis. Cotextual badits with similarity iformatio. JMLR Coferece ad Workshop Proceedigs, 19: , Peter Spirtes ad Richard Scheies. Causal iferece of ambiguous maipulatios. Philosophy of Sciece, 71(5): , Dec Peter Spirtes, Clark Glymour, ad Richard Scheies. Causatio, Predictio ad Search. Spriger Verlag, New York, d editio: MIT Press, Cambridge (Mass.), Stephe M. Stigler. A historical view of statistical cocepts i psychology ad educatioal research. America Joural of Educatio, 101(1):60 70, Nov Masashi Sugiyama, Matthias Krauledat, ad Klaus-Robert Müller. Covariate shift adaptatio by importace weighted cross validatio. Joural of Machie Learig Research, 8: , Rich S. Sutto ad Adrew G. Barto. Reiforcemet Learig: A Itroductio. MIT Press, Cambridge, MA, Diae Tag, Ashish Agarwal, Deirdre O Brie, ad Mike Meyer. Overlappig experimet ifrastructure: More, better, faster experimetatio. I Proceedigs 16th Coferece o Kowledge Discovery ad Data Miig (KDD 2010), pages 17 26, Vladimir N. Vapik. Estimatio of Depedeces based o Empirical Data. Spriger Series i Statistics. Spriger Verlag, Berli, New York, Vladimir N. Vapik ad Alexey Ya. Chervoekis. Uiform covegece of the frequecies of occurece of evets to their probabilities. Proc. Academy of Scieces of the USSR, 181(4), Eglish traslatio: Soviet Mathematics - Doklady, 9: , Vladimir N. Vapik ad Alexey Ya. Chervoekis. O the uiform covergece of relative frequecies of evets to their probabilities. Theory of Probability ad its Applicatios, 16(2): ,

54 BOTTOU, PETERS, ET AL. Hal R. Varia. Positio auctios. Iteratioal Joural of Idustrial Orgaizatio, 25: , Hal R. Varia. Olie ad auctios. America Ecoomic Review, 99(2): , Joaes Vermorel ad Mehryar Mohri. Multi-armed badit algorithms ad empirical evaluatio. I Proc. Europea Coferece o Machie Learig, pages , Georg H. vo Wright. Explaatio ad Uderstadig. Corell Uiversity Press, Abraham Wald. Sequetial tests of statistical hypotheses. The Aals of Mathematical Statistics, 16(2): , Norbert Wieer. Cyberetics, or Cotrol ad Commuicatio i the Aimal ad the Machie. Herma et Cie (Paris), MIT Press (Cambridge, Mass.), Wiley ad Sos (New York), d Editio (expaded): MIT Press, Wiley ad Sos, Roald J. Williams. Simple statistical gradiet-followig algorithms for coectioist reiforcemet learig. Machie Learig, 8( ), James Woodward. Makig Thigs Happe. Oxford Uiversity Press, Sewall S. Wright. Correlatio ad causatio. Joural of Agricultural Research, 20: ,