Stochastic Bandits with Side Observations on Networks

Transcription

1 Stochastc Bandts wth Sde Observatons on Networks Swapna Buccapatnam, Atlla Erylmaz Department of ECE The Oho State Unversty Columbus, OH buccapat@eceosuedu, erylmaz@osuedu Ness B Shroff Departments of ECE and CSE The Oho State Unversty Columbus, OH shroff@osuedu ABSTRACT We study the stochastc mult-armed bandt (MAB problem n the presence of sde-observatons across actons In our model, choosng an acton provdes addtonal sde observatons for a subset of the remanng actons One example of ths model occurs n the problem of targetng users n onlne socal networks where users respond to ther frends s actvty, thus provdng nformaton about each other s preferences Our contrbutons are as follows: We derve an asymptotc (wth respect to tme lower bound (as a functon of the network structure on the regret (loss of any unformly good polcy that acheves the maxmum long term average reward We propose two polces - a randomzed polcy and a polcy based on the well-known upper confdence bound (UCB polces, both of whch explore each acton at a rate that s a functon of ts network poston We show that these polces acheve the asymptotc lower bound on the regret up to a multplcatve factor ndependent of network structure The upper bound guarantees on the regret of these polces are better than those of exstng polces Fnally, we use numercal examples on a real-world socal network to demonstrate the sgnfcant benefts obtaned by our polces aganst other exstng polces Categores and Subject Descrptors H Models and Prncples: Mscellaneous; I Artfcal Intellgence: Mscellaneous Keywords Multarmed bandts; Sde observatons; Socal networks INTRODUCTION Mult-armed bandt (MAB problems have receved renewed nterest over the past decade because of the emergence of content recommendaton, onlne advertsng, and socal networks In the classcal MAB settng, at each tme, a polcy must choose an acton from a set of K actons wth Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page Copyrghts for components of ths work owned by others than ACM must be honored Abstractng wth credt s permtted To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee Request permssons from Permssons@acmorg SIGMETRICS 4, June 6 0, 04, Austn, Texas, USA Copyrght 04 ACM /4/06 $500 unknown probablty dstrbutons Choosng an acton at tme t gves a random reward X (t drawn from the dstrbuton of acton The regret of any polcy s defned as the dfference between the total reward obtaned from the acton wth the hghest average reward and the gven polcy s total reward The goal s to fnd polces that mnmze the expected regret over a gven tme horzon In our work, we consder an mportant MAB settng, smlar to that of and 6, where choosng an acton not only generates a reward from acton, but also reveals observatons for a subset of the remanng actons An example of such a scenaro s as follows: a decson maker must choose one user at each tme n an onlne socal network (FaceBook, etc to offer a promoton 6 Each tme the decson maker offers a promoton to a user, he also has an opportunty to survey the user s neghbors n the network regardng ther potental nterest n a smlar offer (see Fgure Users are found to be more responsve to such surveys usng socal network nformaton compared to generc surveys 3 and ths effect can be leveraged to construct sde-observatons In ths example, choosng an acton n the mult-armed bandt problem corresponds to choosng a user n the network and sde-observatons across actons are captured by the lnks n the socal network Another example s when the actons n the MAB problem are advertsements - the decson maker constructs a graph of dfferent vacaton places (Hawa, Carbbean, Pars, etc, where lnks capture smlartes between dfferent places When a customer shows nterest n one of the places, he s also asked to provde hs opnon about the neghborng places n the graph For the settng of sde-observatons, the authors n consder adversaral bandts, whle 6 consders stochastc bandts as n our current work In 6, the authors propose modfed upper-confdence bound (UCB based polces and show that the regret of these polces s at most O( χ(glog(t, where χ(g s the clque partton number (see Defnton of the sde-observaton network G(K However, t s possble to acheve a lower regret For example, n the star network wth one central acton lnked to all other actons, explorng the central acton yelds suffcent exploraton for the rest of the network In ths case, the optmal regret s at most O(log(t, whle χ(g scalesaso(k for the star network Ths s possble when the onlne network has an addtonal survey feature that generates sde observatons For example, when user s offered a promoton, her neghbors may be quered as follows: User was recently offered a promoton Would you also be nterested n the offer?

2 Motvated by ths observaton, n our work, we am to characterze the asymptotc lower bound on the regret for a general stochastc mult-armed bandt problem n the presence of sde-observatons and nvestgate polces that acheve ths lower bound by takng the network structure nto account Our man contrbutons are as follows: We model the MAB problem n the presence of sdeobservatons and derve an asymptotc (wth respect to tme lower bound (as a functon of the network structure on the regret of any unformly good polcy whch acheves the maxmum long term average reward Ths lower bound s presented n terms of the optmal value of a lnear program (LP, namely, P Motvated by LP P, we propose and nvestgate the performance of a randomzed polcy, we call ɛ t-greedy- LP polcy, as well as an upper confdence bound based polcy, we call UCB-LP polcy Both of these polces explore each acton at a rate that s a functon of ts network poston We show that these polces are optmal n the sense that they acheve the asymptotc lower bound on the regret up to a multplcatve constant ndependent of the network structure under mld assumptons We also show that the upper bound on the regret of our polces scales as O(γ(Glog(t, where γ(g s the sze of the mnmum domnatng set of the network of actons We show that the regret performance of our polces can be strctly better than those proposed n 6 for some mportant network structures Fnally, we use numercal results on a socal network dataset obtaned from Flxster to emprcally compare the performance of our polces aganst those n 6 The model consdered n our work can be vewed as a frst step n the drecton of more general models of nterdependence across actons For ths model, we show that as the number of actons becomes large, sgnfcant benefts can be obtaned from polces that explctly take network structure nto account Whle ɛ t-greedy-lp polcy explores actons at a rate proportonal to ther network poston, ts exploraton s oblvous to the average rewards of the suboptmal actons On the other hand, UCB-LP polcy takes nto account both the upper confdence bounds on the mean rewards as well as network poston of dfferent actons at each tme The rest of the paper s organzed as follows In Secton, we brefly dscuss the related exstng lterature We ntroduce the model n Secton 3 and we present our man results n Sectons 4, 5, 6 Fnally, we present some numercal results n Secton 7 and the proofs of our theoretcal results are gven n Secton 8 We conclude our work n Secton 9 RELATED WORK The semnal work of 0 shows that the asymptotc lower bound on the regret of any unformly good polcy scales logarthmcally wth tme wth a multplcatve constant whch Flxster s a move recommendaton network wth a socal graph Datasets from ths network are made publcly avalable by authors of 9 at datasets/ s a functon of the dstrbutons of actons Further, the authors of 0 provde constructve polces called Upper Confdence Bound (UCB polces based on the concept of optmsm n the face of uncertanty that asymptotcally acheve the lower bound More recently, the authors n consder the case of bounded rewards and propose smpler samplemean based UCB polces and decreasng-ɛ t-greedy polcy that acheve logarthmc regret unformly over tme, rather than only asymptotcally as n the prevous works The UCB ndex of any acton at tme t ntroduced n s gven below: log(t x (t+ s, ( (t where x (t s the average reward and s (t s the total number of observatons avalable for acton at tme t The tradtonal mult-armed bandt polces ncur a regret that s lnear n the number of suboptmal arms Ths makes them unsutable n settngs such as content recommendaton, advertsng, etc, where the acton space s typcally very large To overcome ths dffculty, rcher models specfyng addtonal nformaton across reward dstrbutons of dfferent actons have been studed, such as dependent bandts 3, X -armed bandts 5, lnear bandts 4, contextual sde nformaton n bandt problems, etc More recently, and 6 propose to handle the large number of actons by assumng that choosng an acton reveals observatons for a larger set of actons The polces proposed n acheve the best possble regret n the adversaral settng (see 4 for a survey of adversaral MABs wth sde-observatons and the regret bounds of these polces are n terms of the ndependence number of the network For the settng of stochastc bandts wth sde-observatons, t can be easly shown that, except n the trval case where all actons have a neghborng acton whch s optmal, the regret due to any unformly good polcy s lower bounded by Ω(log(t A formal proof s gven n 6 Further, the authors n 6 propose two modfed UCB polces, namely, UCB-N and UCB-MaxN, and show that the regret of these polces s at most O( χ(glog(t, where χ(g s the clque partton number (see Defnton In our work, we consder the stochastc MAB problem wth sde-observatons smlar to 6 and characterze the asymptotc lower bound on the regret as a functon of the network structure for the stochastc MAB problem n the presence of sde-observatons Further, we propose two polces: ɛ t-greedy-lp polcy, UCB-LP polcy, whch acheve the asymptotc regret lower bound up to a multplcatve constant ndependent of the network structure Further, the regret of our polces s at most O(γ(Glog(t, where γ(g s the sze of the mnmum domnatng set of the network of actons Snce, γ(g χ(g for any network G, the regret obtaned by our polces can be better than that of UCB-N and UCB-MaxN proposed n 6 3 MODEL In ths secton, we formally defne the K-armed bandt problem n the presence of sde observatons across actons Let K {,,K} denote the set of actons A decson maker must choose an acton K at each tme t Let X (t denote the reward obtaned by the decson maker on choosng acton at tme t The random varable X (t has an unknown probablty dstrbuton F Let μ be the mean

3 of the random varable X (t We assume that {X (t,t 0} are d for each and {X (t, K}are ndependent for each tme t We further assume that the dstrbuton, F has a bounded support n 0,bforeach We let b for smplcty of exposton n our work Sde-observaton model : The actons K form nodes n anetworkg(k, represented by the adjacency matrx G g(, j,j K, where g(, j {0, } and g(, for all Let K be the set of neghbors of acton (ncludng, e, g(j,, j K Whle all the results n our work can be easly extended to drected networks, we assume that G(K s undrected for smplcty of notaton We assume that when the decson maker chooses an acton at tme t, he receves a reward X (t andalsoreceves observatons X j(t, for all j K such that EX j(t μ j In general, not all neghborng actons are equally responsve n provdng sde-observatons Ths scenaro can be modeled by assumng that each acton has a known probablty p of provdng sde-observatons when any of ts neghbors are actually chosen We let p forall for the sake of clarty Our results can be easly extended to the settng of mperfect sde-observatons wth p We wll dscuss ths extenson n Remark 5 n Secton 6 Fgure : At tme t, suppose the decson maker chooses user to offer a promoton He then receves a response X (t fromuser Usng the socal nterconnectons, he also observes responses X j(t andx k (t of s neghbors j and k Fgure llustrates the sde-observaton model for the example of targetng users n an onlne socal network Such sde observatons are made possble n settngs of onlne socal networks lke Facebook by surveyng or trackng a user s neghbors reactons (lkes, dslkes, no opnon, etc to the user s actvty Ths s possble when the onlne socal network has a survey or a lke/dslke ndcator that generates sde observatons For example, when user s offered a promoton, her neghbors may be quered as follows: User was recently offered a promoton Would you also be nterested n the offer? 3 Objectve: An allocaton strategy or polcy φ chooses the acton to be played at each tme Formally, φ s a sequence of random varables {φ(t,t 0}, where φ(t Ks the acton chosen by polcy φ at tme t Let X φ (t be the reward and sde-observatons obtaned by the polcy φ at tme t Then, the event {φ (t } belongs to the σ-feld generated by {φ(m, X φ (m,m t } 3 Snce, the neghbors do not have any nformaton on whether the user accepted the promoton, they act ndependently accordng to ther own preferences n answerng ths survey The network tself provdes a better way for surveyng and obtanng sde observatons Let T φ (t be the total number of tmes acton s chosenuptotmet by polcy φ For each acton, rewards are only obtaned when the acton s chosen by the polcy (sdeobservatons do not contrbute to the total reward Then, the regret of polcy φ at tme t for a fxed μ (μ,,μ K s defned by R φ μ(t μ t K K μ ET φ (t Δ ET φ (t, where Δ μ μ and μ max μ Henceforth, we drop K the superscrpt φ unless t s requred The objectve s to fnd polces that mnmze the rate at whch the regret grows as a functon of tme for every fxed network G(K We focus our nvestgaton on the class of unformly good polces 0 defned below: Unformly good polces: An allocaton rule φ s sad to be unformly good f for every fxed μ, the followng condton s satsfed as t : R μ(t o(t b, for every b>0 The above condton mples that unformly good polces acheve the optmal long term average reward of μ Next, we defne two structures that wll be useful later to bound the performance of allocaton strateges n terms of the network structure G(K Defnton A clque coverng C of a network G(K s a partton of all ts nodes nto sets C Csuch that the sub-network formed by each C s a clque Let χ(g be the smallest number of clques nto whch the nodes of the network G(K can be parttoned, also called the clque partton number Defnton A domnatng set D of a network G(K s such that every node n the network s ether n D or has at least one neghbor n D Let γ(g denote the sze of the mnmum domnatng set of network G(K, also called the domnaton number Note that γ(g χ(g for any network G(K In the next secton, we obtan an asymptotc lower bound on the regret of unformly good polces for the settng of MABs wth sde-observatons Ths lower bound s expressed as the optmal value of a lnear program (LP, where the constrants of the LP capture the connectvty of each acton n the network 4 REGRET LOWER BOUND In order to derve a lower bound on the regret, we need some mld regularty assumptons (Assumptons,, and 3 on the dstrbutons F that are smlar to the ones n 0 Let the probablty dstrbuton F have a unvarate densty functon f(x; θ wth unknown parameters θ Let D(θ σ denote the Kullback Lebler (KL dstance between dstrbutons wth densty functons f(x; θ andf(x; σ andwth means μ(θ and μ(σ respectvely Assumpton (Fnteness We assume that f( ; s such that 0 <D(θ σ < whenever μ(σ >μ(θ Assumpton (Contnuty For any ɛ>0 and θ, σ such that μ(σ >μ(θ, there exsts η>0 for whch D(θ σ D(θ ρ <ɛwhenever μ(σ <μ(ρ <μ(σ +η

4 Assumpton 3 (Denseness For each K, θ Θ where the set Θ satsfes: for all θ Θ and for all η>0, there exsts θ Θ such that μ(θ <μ(θ <μ(θ+η The followng proposton s obtaned usng Theorem n 0 It provdes an asymptotc lower bound on the regret of any unformly good polcy under the model descrbed n Secton 3: Proposton Suppose Assumptons,, and 3 hold Let U { : μ <μ } be the set of suboptmal actons Also, let Δ μ μ Recall that K s the set of neghbors of, ncludng, n the network G(K Then, under any unformly good polcy φ, the expected regret s asymptotcally bounded below as follows: lm nf t R μ(t cμ, ( log(t where c μ s the optmal value of the followng lnear program (LP P : P : mn U Δ w, subject to: w j, U, D(θ θ w 0, K Proof (Sketch Let S (t be the total number of observatons correspondng to acton avalable at tme t Then, by modfyng the proof of Theorem of 0, we have that, for U, lm nf t ES (t log(t D(θ θ An observaton s receved for acton whenever any acton n K s chosen Hence, S (t T j(t These two facts gve us the constrants n LP P See Secton 8 for the full proof The lnear program gven n P contans the graphcal nformaton that governs the lower bound However, t requres the knowledge of θ and θ, whch are unknown Ths motvates the constructon of the followng lnear program, LP P, whch preserves the graphcal structure whle elmnatng the dstrbutonal dependence on θ and θ P :mn K z subject to: z j, K, and z 0, K Let z (z K be the optmal soluton of LP P In Sectons 5 and 6, we use the above LP P to modfy the ɛ- greedy polcy n and UCB polcy n for the settng of sde-observatons We provde regret guarantees of these modfed polces n terms of the optmal value K z of LP P We note that the lnear program P s, n fact, the LP relaxaton of the mnmum domnatng set problem on network G(K Snce, any domnatng set of network G(K s a feasble soluton to the LP P, we have that the optmal value of the LP K z γ(g χ(g As we wll see n Remark, for a rch network structure, t s possble to have γ(g << χ(g In the next proposton, we provde a lower bound on c μ n Equaton ( usng the optmal soluton z (z K of LP P Proposton Let U { : μ < μ } be the set of suboptmal actons Let O { : μ μ } be the set of optmal actons Then, max U D(θ θ c μ + O z mn U D(θ θ c μ mn U Δ max U Δ K (3 Proof (Sketch Let I { U: K O } be the set of suboptmal actons wth neghbors n O Usng the optmal soluton of LP P, we construct a feasble soluton satsfyng constrants n LP P for actons n U\I In order to satsfy the constrants for actons n I O, we use z for all n O The feasble soluton constructed n ths way gves an upper bound on the optmal value of LP P n terms of the optmal value of LP P For the lower bound, any feasble soluton of P, n partcular z, can be used to construct a feasble soluton of P See Secton 8 for the full proof K z Θ(c μ completely captures the tme dependence of regret on network structure under the followng assumpton: Assumpton 4 The quanttes O, mn Δ, and U mn U D(θ θ are constants that are ndependent of network sze K Note that the constants n the above assumpton are unknown to the decson maker In the next secton, we propose the ɛ t-greedy-lp polcy whch acheves the regret lower bound of c μ log(t up to a multplcatve constant factor that s ndependent of the network structure and tme 5 EPSILON-GREEDY-LP POLICY Motvated by the LPs P and P, we propose a networkaware randomzed polcy called the ɛ t-greedy-lp polcy We provde an upper bound on the regret of ths polcy and show that t acheves the asymptotc lower bound up to a constant multpler, ndependent of network structure Let x (t be the emprcal average of observatons (rewards and sde-observatons combned avalable for acton up to tme t The ɛ t-greedy-lp polcy s descrbed n Algorthm The polcy conssts of two phases - explotaton and exploraton, where the exploraton probablty decreases as /t, smlar to the ɛ t-greedy polcy proposed n However, n our polcy, we choose the exploraton probablty for acton to be proportonal to z /t, where z s the optmal soluton of LP P, whle n the orgnal polcy n, the exploraton probablty s unform over all actons The followng proposton provdes performance guarantees on the ɛ t-greedy-lp polcy: Proposton 3 For 0 <d<mn U Δ, any c>0, and α>, the probablty wth whch a suboptmal acton s selected by the ɛ t-greedy-lp polcy, descrbed n Algorthm,

5 Algorthm : ɛ t-greedy-lp Input: c>0, 0 <d<, optmal soluton z of LP P for each tme t do ( Let ɛ(t mn, c K z and a argmax d t x(t K Wth probablty ɛ(t, pck acton φ(t a Wth probablty ɛ(t, pck acton φ(t wth probablty z for all K K z end for Update average rewards x v(t +, v K φ(t for all t>t c K z d ( c d t z + 4 d ( et t s at most c/α + δc αd ( et t cr/αd ( e t log, t (4 where r 3(α, and δ s the maxmum degree n the 8α network Proof (Sketch Snce z satsfes the constrants n LP P, there s suffcent exploraton wthn each suboptmal acton s neghborhood The proof s then a combnaton of ths fact and the proof of Theorem 3 n See Secton 8 for the full proof In the above proposton, for large enough c, we see that second and thrd terms are O(/t +ɛ forsomeɛ>0 Usng ths fact, the followng corollary bounds the expected regret of the ɛ t-greedy-lp polcy: Corollary Choose parameters c and d such that, 0 <d<mn Δ, and U c>max(αd /r, α, for any α> Then, the expected regret at tme t of the ɛ t-greedy-lp polcy descrbed n Algorthm s at most ( c Δ z d log(t+o(k, (5 U where the O(K term captures constants ndependent of tme but dependent on the network structure Remark Under Assumpton 4, we can see from Proposton and Corollary that, ɛ t-greedy-lp( algorthm s order optmal achevng the lower bound O z log(t O (c μ log(t as the network and tme scale Whle ɛ t-greedy-lp polcy s network aware, ts exploraton s oblvous to the average rewards of the sub-optmal actons Further, ts performance guarantees depend on the knowledge of mn U Δ, whch s the dfference between the best and the second best optmal actons On the other hand, the UCB-LP polcy proposed n the next secton s network-aware takng nto account the average rewards of suboptmal actons Ths could lead to better performance compared to ɛ t-greedy-lp polcy n certan stuatons, for example, when the hghly connected acton s also hghly suboptmal U 6 UCB-LP POLICY In ths secton we propose the UCB-LP polcy defned n Algorthm and obtan upper bounds on ts regret The UCB-LP polcy s based on the mproved UCB polcy proposed n, whch can be summarzed as follows: the polcy estmates the values of Δ n each round by a value Δ m whch s ntalzed to and halved n each round m By each round m, the polcy draws n(m observatons for each acton n the set of actons not elmnated by round m, where n(m sdetermnedby Δ m Then, t elmnates those actons whose UCB ndces perform poorly Our polcy dffers from the one n by accountng for the presence of sde-observatons - ths s acheved by choosng each acton accordng to the optmal soluton of LP P, whle ensurng that n(m observatons are avalable for each acton not elmnated by round m Algorthm : UCB-LP polcy Input: Set of actons K, tme horzon T, and optmal soluton z of LP P Intalzaton: Let Δ 0 :, S 0 : K, and B 0 : K for round m 0,,,, log T do e log(t Δ Acton Selecton: Let n(m : m Δ m If B m : choose the sngle acton n B m untl tme T Else If z B m Δ m : For each acton n S m, S m choose t z n(m n(m tmes Else For each acton n B m, choose t n(m n(m tmes Update the average rewards of all actons n B m Acton Elmnaton: To get B m+, delete all actons n B or whch log(t x Δ m (m+ max s xa(m log(t Δ m (m a B m s a(m, where x (m s the emprcal average reward of acton, and s (m s the total number of observatons for acton up to round m Reset: The set S m+ of actons wth neghbors n B m+ s gven as S m+ { : K B m+ } Note that B m+ S m+ Let Δ m+ Δ m end for The followng proposton provdes performance guarantees on the expected regret due to UCB-LP polcy: Proposton 4 For acton, defne round m as follows: { } m : mn m : Δ m < Δ

6 { Defne m mn m : z > K set B { U: m > m} :m >m } m+ and the Then, the expected regret due to the UCB-LP polcy descrbed n Algorthm s at most U\B Δ z 3 log(t ˆΔ ˆΔ + B 3 log(t Δ Δ + O(Kδ, (6 where ˆΔ max{ m+, mn K {Δ j}} and (z s the soluton of LP P δs the maxmum degree n the network The O(Kδ term captures constants ndependent of tme Further, under Assumpton 4, the regret s also at most ( O z log(t + O(Kδ, (7 K where (z entrely captures the tme dependence on network structure Proof (Sketch The log(t term n the regret follows from the fact that, wth hgh probablty, each suboptmal acton s elmnated (from the set B monorbeforethe frst round m such that Δ m < Δ / See Secton 8 for the full proof Next, we brefly descrbe the polces UCB-N and UCB- MaxN proposed n 6 In UCB-N polcy, at each tme, the acton wth the hghest UCB ndex (see ( s chosen smlar to UCB polcy n In UCB-MaxN polcy, at each tme t, the acton wth the hghest UCB ndex ( s dentfed and ts neghborng acton j K wth the hghest emprcal average reward at tme t s chosen Remark The regret upper bound of UCB-N polcy s nf C C C 8max C Δ mn C Δ log(t +O(K, where C s a clque coverng of the sub-network of suboptmal actons The regret upper bound for UCB-MaxN s the same as that for UCB-N wth an O( C term nstead of the tmenvarant O(K term We show a better regret performance for UCB-LP polcy and ɛ t-greedy-lp polces wth respect to the log(t term because K z γ(g χ(g However, the tme-nvarant term n our polces s O(K ando(kδ, whch can be worse than the tme-nvarant term O( C n UCB-MaxN To see the possble gap between γ(g and χ(g, consder an Erdos-Reny random graph G(K, p As noted n, as K, for any fxed p>0, γ(g satmosto(log(k whle χ(g satleasto(k/ log(k On the other hand, t has been shown 7 that for power law graphs, both γ(g and χ(g scale lnearly wth N, although γ(g hasalower slope We note that the socal network of nterest may or may not dsplay a power law behavor We fnd that the subgraphs of the Flxster network have a degree dstrbuton that s a straght lne on a log-log plot ndcatng a power law dstrbuton dsplay whle the authors n 5 show that the degree dstrbuton of the global Facebook network s not a straght lne on log-log plot Our numercal results n Secton 7 show that our polces outperform exstng polces even for the Flxster network Remark 3 All unformly good polces that gnore sdeobservatons ncur a regret that s at least Ω( U log(t 0, where U s the number of suboptmal actons Ths could be sgnfcantly hgher than the guarantees on the regret of both ɛ t-greedy-lp polcy and UCB-LP polcy for a rch network structure as dscussed n Remark Remark 4 Whle ɛ t-greedy-lp does not requre knowledge of the tme horzon T, UCB-LP polcy requres the knowledge of T UCB-LP polcy can be extended to the case of an unknown tme horzon smlar to the suggeston n Start wth T 0 and at end of each T l, set T l+ Tl The regret bound for ths case s expected to be smlar to the one n Proposton 4 Remark 5 In our work, we assumed that the sde observatons are always avalable However, n realty, sde observatons may only be obtaned sporadcally Suppose that when acton s chosen, sde-observatons are obtaned for each neghborng acton j wth a known probablty p j In ths case, Proposton holds wth the replacement of LP P wth LP P as follows: P : mn U Δ w, subject to: w + p \{} w j, U, D(θ θ w 0, K Both of our polces work for ths settng by changng the LP P to P as follows: P : mn K z subject to: z + p \{} z j, K, and z 0, K The regret bounds of our polces wll now depend on the optmal soluton of LP P 7 NUMERICAL RESULTS We consder the Flxster network dataset for the numercal evaluaton of our algorthms The authors n 9 collected ths socal network data, whch contans about mllon users and 4 mllon lnks We use graph clusterng 8 to dentfy two strongly clustered sub-networks of szes 000 and 000 nodes Both these sub-networks have a degree dstrbuton that s a straght lne on a log-log plot ndcatng a power law dstrbuton commonly observed n socal networks Our emprcal setup s as follows Each user n the network s offered a promoton at each tme, and accepts the promoton wth probablty μ 03, 09 The decson maker receves a random reward of f a user accepts the promoton or 0 reward otherwse μ s chosen unformly at random from 03, 08 and there are 50 randomly chosen users wth optmal μ 09 Fgures and 3 show the regret performance as a functon of tme for the two sub-networks of szes 000 and 000 respectvely For the ɛ t-greedy-lp polcy, we let c 5andd 0 For both networks, we see that our polces outperform the UCB-N and UCB-MaxN polces

7 We also observe that the mprovement obtaned by UCB-N polcy over the baselne UCB polcy s margnal Expected Regret 4 x UCB wth no sde observatons UCB N UCB MaxN UCB LP ε t greedy LP Tme Fgure : Regret of all the polces for a network of sze 000 Expected Regret x 05 0 UCB wth no sde observatons UCB N UCB MaxN UCB LP ε t greedy LP Tme Fgure 3: Regret of all the polces for a network of sze PROOFS In what follows, we gve the proofs of all propostons stated n the earler sectons These proofs make use of Lemmas,, and 3, and Proposton 5 gven n the Appendx Proof of Proposton Let U { : μ <μ } be the set of suboptmal actons Also, let Δ μ μ Recall that K s the set of neghbors of, ncludng, n the network G(K Also, T (t s the total number of tmes acton s chosen up to tme t by polcy φ Let S (t be the total number of observatons correspondng to acton avalable at tme t From Proposton 5 gven n the Appendx, we have, lm nf t ES (t log(t, U (8 D(θ θ An observaton s receved for acton whenever any acton n K s chosen Hence, S (t T j(t (9 Now, from Equatons (8 and (9, for each U, j K lm nf ET j(t t log(t D(θ θ (0 Usng (0, we get the constrants of LP P Further, we have from defnton of regret that, lm nf t R μ(t log(t lm nf t U Δ ET (t log(t The above equaton along wth the constrants of the LP P obtaned from (0 gves us the requred lower bound on regret Proof of Proposton Let I { U: K O } be the set of suboptmal actons wth neghbors n O Let (z K be the optmal soluton of LP P We wll frst prove the upper bound n Equaton 3 Usng the optmal soluton (w K of LP P, we construct a feasble soluton satsfyng constrants ( n LP P n the followng way: For actons U, let z max U D(θ θ w Then (z U satsfy constrants for all actons U\Ibecause w satsfy constrants of LP P In order to satsfy the constrants for actons n I O, we use z forall n O The feasble soluton constructed n ths way gves an upper bound on the optmal value of LP P Hence, z + O K z U U ( max U D(θ θ w + O max U D(θ θ mn U Δ Δ w + O max U D(θ θ c μ + O mn U Δ For the lower bound, any feasble soluton of P, n partcular z, can be used to construct a feasble soluton of P For z actons K, let w Then mn U D(θ θ (w K satsfes the constrants of LP P and hence gves an upper bound on ts optmal value Therefore, we have c μ U Δ w, U Δ z mn U D(θ K θ max U Δ z mn U D(θ K θ whch gves us the requred lower bound Proof of Proposton 3 Snce z satsfes the constrants n LP P, there s suffcent exploraton wthn each suboptmal acton s neghborhood The proof s then a combnaton of ths fact and the proof of Theorem 3 n Let X (t be the random varable denotng the sample mean of all observatons avalable for

8 acton at tme t Let X (t be the random varable denotng the sample mean of all observatons avalable for an optmal acton at tme t Let X,m denote the sample mean of m random varables drawn from the dstrbuton F Fx asuboptmalacton For some α>, defne m as follows, m t ɛ(m α j K z Let φ(t be the acton chosen by ɛ t-greedy-lp polcy at tme t Then, Pφ(t ɛ(tz +( ɛ(tp X (t X (t K z We also have that, P X (t X (t P X (t μ + Δ + P X (t μ Δ The analyss of both the terms n the rght hand sde of the above expresson s smlar Let S (R (t be the total number of observatons avalable for acton from the exploraton phase of the polcy up to tme t Let S (t be the total number of observatons avalable for acton up to tme t Hence, we have, t P X (t μ + Δ P S (t m; X (t μ + Δ t t m P S (t m X,m μ + Δ P S (t m X,m μ + Δ P e Δ X,m μ + Δ m (follows from Chernoff-Hoeffdng bound n Lemma P S (t m X,m μ + Δ + e Δ m Δ ( snce e ku k e km m m P P m+ S (R (t m X,m μ + Δ S (R (t m + d e d m + Δ e Δ m In the above, the last equaton follows from the fact that S (R (t, whch s the total number of observatons for acton avalable from the exploraton phase of the polcy up to tme t, s ndependent of the sample means of all actons Now, E S (R (t t ɛ(m z j j K z j K z t ɛ(m αm var S (R (t t ɛ(m j K z ɛ(m t ɛ(m j K z j K z E S (R (t Now, usng Bernsten s nequalty gven n Lemma, we have P S (R (t m P S (R (t E S (R (t (α m exp ( rm, where r 3(α Now, we wll obtan upper and lower 8α bounds on m For the upper bound, for any t > t c K z d, m α α α z j j K z t ɛ(m t + j K z α c K z j K z d ( e t δc αd log t, z j j K z + δc αd t mt + t mt + c j K z j d t where δ s the maxmum degree n the network In the above, K z δ because z, whch s due to the fact that (z K s the optmal soluton of LP P Next, for the lower bound, we use the fact that z for all because (z K satsfes the constrants of LP P Thus m α j K z α j K z c αd t mt + t ɛ(m t mt + t c j K z j d t c αd log t ( t et Hence, combnng the nequaltes above, P X (t μ + Δ m P S (R (t m + d e d m m exp ( rm + d e d m δc ( et cr/αd ( e t log + d ( et c/α αd t t t Now, smlarly for the optmal acton, we have, for all t>t P X (t μ Δ δc αd ( et cr/αd ( e t log t t + ( et c/α d t

9 Combnng everythng, we have for any suboptmal acton, for all t>t Pφ(t ( c d t z + 4 d ( et + δc αd Proof of Proposton 4 t ( et t c/α cr/αd ( e t log The proof technque n smlar to that n We wll analyze the regret by condtonng on two dsjont events The frst event s that each suboptmal acton a s elmnated by an optmal acton on or before the frst round m such that Δ m < Δ a/ Ths happens wth hgh probablty and leads to logarthmc regret The complment of the frst event yelds lnear regret n tme but occurs wth probablty proportonal to /T The man dfference from the proof n s that on the frst event, the number of tmes we choose each acton s proportonal to z log(t n the exploraton phase of the polcy Ths gves us the requred upper bound n terms of optmal soluton z of LP P Let denote any optmal acton Let m denote the round n whch the last optmal acton s elmnated For each suboptmal acton, defne round m : mn{m : Δ m < Δ } For an optmal acton, m by conventon Then, by the defnton of m, for all rounds m<m, Δ Δ m, and < m 4 < m+ ( Δ Δ m Δ Δ m + From Lemma 3 n the Appendx, the probablty that acton s not elmnated n round m by s at most t T Δ m Let U K U be the set of suboptmal neghbors of acton Let I(t be the acton chosen at tme t by the UCB-LP polcy Let E m be the event that all suboptmal actons wth m m are elmnated by on or before ther respectve m Then, the complement of E m, denoted as Em c, s the event that there exsts some suboptmal acton wth m m, whch s not elmnated n round m Let E c be the event that acton s not elmnated by round m by Let log T and I(t denote the acton chosen at tme t by e the polcy Recall that regret s denoted by R μ(t Let Pm m bedenotedbyp m Hence, p m E R μ(t + t j U E R μ(t {m m} Pm m T Δ jp I(t j {m m} p m T Δ jp {I(t j} E m {m m} p m t j U T Δ jp {I(t j} Em c {m m} p m t j U (+( Next we wll show that term ( leads to logarthmc regret whle term ( leads to a constant regret wth tme Frst, consder the term ( of the regret expresson Recall that U j K j U s the set of suboptmal neghbors of j For each j U, we have, t m f T P {I(t j} Em c {m m} Pm m t m f T t t T P {I(t j} ( U:m m Ec {m m} p m P {I(t j} ( Uj :m m Ec {m m} p m (because {I((t j} depends only on neghbors of j T ( P {I(t j} ( Uj :m m Ec, {m m} T P Uj E c {m } P Uj :m m Ec {m m} p m p m T U j T Δ, m ( usng Lemma 3, P E c {m } T Δ m 3, Δ U j where the last nequalty follows from Equaton ( Hence, the term ( ofregrets T Δ jp {I(t j} Em c {m m} p m t j U j U 3 Δ j Δ U j where δ s the maxmum degree n the network O(Kδ, ( Next, we consder the term ( Recall that, n ths term, we consder the case that all suboptmal actons wth m m are elmnated by on or before m ( T Δ jp {I(t j} E m {m m} p m t j U E R μ(t {m m},e m PE m {m m}p m ( E Regret from { : m m } {m m},e m + E Regret from { : m >m } {m m},e m p m (E Regret from { : m } {m },E mf + E Regret from { : m >m } {m m},e m p m

10 E R μ(t {m },E mf + (a+(b p m E Regret from { : m >m } {m m},e m p m Once agan, we wll consder the above two terms separately For the term (a, under the event E mf, each suboptmal acton s elmnated by by round m Defne round m and the set B as follows: m mn{m : z > m+ }, K :m >m B { U: m > m} After round m, Algorthm chooses only those actons wth m > m Also, by the defnton of the Reset phase of Algorthm, we have that any suboptmal acton / B s chosen (e appears n the set S m at round m onlyuntl all actons n ts neghborhood are elmnated or untl m, whchever happens frst Defne n mn{ m, max K {m j}} for each suboptmal acton Then any suboptmal acton / B s chosen for at most n rounds (a E R μ(t {m },E mf U\B U\B Δ z Δ z log(t Δ n Δ n 3 log(t ˆΔ ˆΔ + B + B Δ log(t Δ m Δ m 3 log(t Δ Δ, (3 Δ where ˆΔ max{ m+, mn K {Δ j}} and (z sthesoluton of LP P Fnally, we consder the term (b An optmal acton s not elmnated n round m f (6 holds for m m Hence, usng (7 and (8, the probablty p m that s elmnated by a suboptmal acton n any round m s at most Hence, term (b sgvenas: T Δ m E Regret from { : m >m } {m m},e m p m U:m m max U Δj U U m T Δ m T max U Δj Δ U:m m m Δ m m + U 3 Δ O(K (4 Now we get the result (6 by combnng the bounds n (, (3, and (4 Further, the defnton of set B ensures that we have Δ z B K Also, usng the Assumpton 4, 3Δ log(t ˆΔ, 3 log(t Δ ˆΔ are Δ bounded by C log(t, where C s a constant ndependent of network structure Hence, (3 can be bounded as: U\B Δ z 3 log(t ˆΔ ˆΔ + B Δ 3 log(t Δ Δ Δ C log(t z C log(t + U\B B m+ C log(t z C log(t + U\B B z C log(t (5 K Hence, we get (7 from (5, (, and (4 9 CONCLUSION In ths work, we studed the stochastc mult-armed bandt problem n the presence of sde-observatons across actons that are embedded n a network We obtaned an asymptotc (wth respect to tme lower bound as a functon of the network structure on the regret of any unformly good polcy Further, we proposed two polces: the ɛ t-greedy- LP polcy, and the UCB-LP polcy, both of whch are optmal n the sense that they acheve the asymptotc lower bound on the regret, up to a multplcatve constant that s ndependent of the network structure These polces can have a better regret performance than exstng polces for some mportant network structures The ɛ t-greedy-lp polcy s a network-aware any-tme polcy, but ts exploraton s oblvous to the average rewards of the suboptmal actons On the other hand, UCB-LP consders both the network structure and the average rewards of actons Fnally, usng numercal examples on the Flxster network dataset, we confrmed the sgnfcant benefts obtaned by our polces aganst other exstng polces Acknowledgments Ths work s supported by the NSF grants: CAREER-CNS , CCF , CNS-0700, CNS-06536, and by grants from the Army Research Offce: W9NF and W9NF Also, the work of A Erylmaz was supported n part by the QNRF grant number NPRP REFERENCES P Auer, N Cesa-Banch, and P Fscher Fnte-tme analyss of the multarmed bandt problem Machne Learnng, 47(-3:35 56, May 00 P Auer and R Ortner UCB revsted: Improved regret bounds for the stochastc mult-armed bandt problem Perodca Mathematca Hungarca, 6(-:55 65, 00 3 R M Bond, C J Farss, J J Jones, A D I Kramer, CMarlow,JESettle,andJHFowlerA 6-mllon-person experment n socal nfluence and poltcal moblzaton Nature, 489:95 98, 0 4 S Bubeck and N Cesa-Banch Regret analyss of stochastc and nonstochastc mult-armed bandt problems Foundatons and Trends n Machne Learnng, 5(:, 0

11 5 S Bubeck, R Munos, G Stoltz, and C Szepesvár X-armed bandts Journal of Machne Learnng Research, : , 0 6 S Caron, B Kveton, M Lelarge, and S Bhagat Leveragng sde observatons n stochastc bandts In UAI, pages 4 5 AUAI Press, 0 7 C Cooper, R Klasng, and M Zto Lower bounds and algorthms for domnatng sets n web graphs Internet Mathematcs, :75 300, I S Dhllon, Y Guan, and B Kuls Weghted graph cuts wthout egenvectors: A multlevel approach IEEE Transactons on Pattern Analyss and Machne Intellgence, 9(: , M Jamal and M Ester A matrx factorzaton technque wth trust propagaton for recommendaton n socal networks In Proceedngs of the fourth ACM conference on Recommender systems, RecSys 0, pages 35 4 ACM, 00 0 T L La and H Robbns Asymptotcally effcent adaptve allocaton rules Advances n Appled Mathematcs, 6(:4, 985 L L, W Chu, J Langford, and R E Schapre A contextual-bandt approach to personalzed news artcle recommendaton In Proceedngs of the 9th Internatonal Conference on World Wde Web, WWW 0, pages ACM, 00 S Mannor and O Shamr From bandts to experts: On the value of sde-observatons In NIPS, pages , 0 3 S Pandey, D Chakrabart, and D Agarwal Mult-armed bandt problems wth dependent arms In Proceedngs of the 4th Internatonal Conference on Machne Learnng, ICML 07, pages 7 78, New York, NY, USA, 007 ACM 4 P Rusmevchentong and J N Tstskls Lnearly parameterzed bandts Math Oper Res, 35(:395 4, 00 5 J Ugander, B Karrer, L Backstrom, and C Marlow The anatomy of the facebook socal graph CoRR, abs/4503, 0 APPENDIX Notaton: S n n n j Xj s called the sample mean of the random varables X,,X n The frst two lemmas below state the Chernoff-Hoeffdng nequalty and Bernsten s nequalty Lemma Let X,,X n be a sequence of random varables wth support 0, and EX tμ for all t n Let S n n n j Xj Then, for all ɛ>0, we have, PS n μ + ɛ e nɛ PS n μ ɛ e nɛ Lemma Let X,,X n be a sequence of random varables wth support 0, and t k varx k X,,X k σ for all t n Let S n n j Xj Then, for all ɛ>0, we have, { } ɛ PS n ES n+ɛ exp σ + ɛ 3 { PS n ES n ɛ exp ɛ σ + 3 ɛ } The next lemma s used n the proof of Proposton 4 Lemma 3 The probablty that acton s not elmnated n round m by s at most T Δ m Proof Let X (m be the sample mean of all observatons for acton avalable n round m Let X (m bethe sample mean of the optmal acton The constrants of LP P ensure that at the end of each round m, for all actons n log(t Δ B m, we have n(m : m observatons Now, for Δ m m m, f we have, log(t X (m μ Δ m + and X (m μ log(t Δ m, n(m n(m (6 then, acton s elmnated by n round m In fact, n round m, we have log(t Δ m n(m Δ m < Δ 4 Hence, n the elmnaton phase of the UCB-LP polcy, f (6 holds for acton n round m, we have, log(t X Δ m log(t (m + μ + Δ m n(m n(m <μ +Δ μ log(t Δ m n(m X (m log(t Δ m n(m log(t Δ m, n(m and acton s elmnated Hence, the probablty that acton s not elmnated n round m s the probablty that ether one of the nequaltes n (6 do not hold Usng Chernoff- Hoeffdng bound (Lemma, we can bound ths as follows, P log(t X(m >μ + Δ m n(m T Δ (7 m P X (m <μ log(t Δ m n(m T Δ (8 m Summng the above two nequaltes for m m gves us that the probablty that acton s not elmnated n round m by s at most T Δ m The next proposton s a modfed verson of Theorem n 0 We use t to obtan the regret lower bound n Proposton Proposton 5 Suppose Assumptons,, and 3 hold Then, under any unformly good polcy φ, we have that, for each acton wth μ <μ, lm nf t ES (t log(t D(θ θ (9

12 Proof Ths proof follows from the proof of Theorem n 0 To fx deas, suppose s a suboptmal acton and suppose acton s optmal Let the parameters of the reward dstrbutons be θ (θ,,θ K and the assocated means be μ (μ,,μ K Then, for any 0 <δ<, due to Assumptons,, 3, we have that there exsts a parameter λ and mean μ λ assocated wth the densty functon f(,λ such that μ λ >μ and D(θ θ D(θ λ δd(θ θ (0 Now, consder the new sets of parameters η (λ,,θ K, where the mean rewards are changed to (μ λ,μ,,μ K For ths set of parameters, acton s the unque optmal Then, for any unformly good polcy, for 0 <b<δ, E ηt T (t o(t b and therefore, P η T (t < ( δlog(t/d(θ λ o(t b, smlar to the asymptotc lower bound proof n 0 Now, usng the fact that S (t T (t, we have P η S (t < ( δlog(t/d(θ λ o(t b Now the rest of the proof of Theorem n 0 apples drectly to S (t We wll repeat t below for completeness Let (Y (k k be the observatons drawn from dstrbuton F and defne m ( f(y(k; θ L m log f(y (k; λ k Now, we have that P ηc to(t b wherec t {S (t < ( δlog(t/d(θ λ andl S (t ( blog(t} Now, P ηt (t t,,t K(t t K,L s ( blog(t E η P η T (t t,,t K(t t K, L s ( blog(t (Y (k { K,k,,t s f(y (k; λ E θ f(y k (k; θ P θ T (t t,,t K(t t K, L s ( blog(t (Y (k { K,k,,s exp( ( blog(tp θ T (t t,, T K(t t K,L s ( blog(t, ( In the above, we used the defnton s t j j K Also, C t s a dsjont unon of events of the form {T (t t,,t K(t t K,L s ( blog(t} wth t ++t M t and s ( δlog(t/d(θ λ Hence usng (, P θ C t t ( b P ηc t 0 ( By strong law of large numbers L m/m D(θ λ asm and max km L k /m D(θ λ almostsurely Now,snce b> δ, t follows that as t, P θ L k ( blog(t forsome k<( δlog(t/d(θ λ 0 (3 Hence, from ( and (3, P θ S (t < ( δlog(t/d(θ λ 0 Now usng the above equaton wth (0 gves us the asymptotc lower bound n (9