Learning to Search Better than Your Teacher


 Cuthbert Anthony
 1 years ago
 Views:
Transcription
1 KiWei Chng University of Illinois t Urbn Chmpign, IL Akshy Krishnmurthy Crnegie Mellon University, Pittsburgh, PA Alekh Agrwl Microsoft Reserch, New York, NY Hl Dumé III University of Mrylnd, College Prk, MD John Lngford Microsoft Reserch, New York, NY Abstrct Methods for lerning to serch for structured prediction typiclly imitte reference policy, with existing theoreticl gurntees demonstrting low regret compred to tht reference. This is unstisfctory in mny pplictions where the reference policy is suboptiml nd the gol of lerning is to improve upon it. Cn lerning to serch work even when the reference is poor? We provide new lerning to serch lgorithm, LOLS, which does well reltive to the reference policy, but dditionlly gurntees low regret compred to devitions from the lerned policy: locloptimlity gurntee. Consequently, LOLS cn improve upon the reference policy, unlike previous lgorithms. This enbles us to develop structured contextul bndits, prtil informtion structured prediction setting with mny potentil pplictions. 1. Introduction In structured prediction problems, lerner mkes joint predictions over set of interdependent output vribles nd observes joint loss. For exmple, in prsing tsk, the output is prse tree over sentence. Achieving optiml performnce commonly requires the prediction of ech out Proceedings of the 32 nd Interntionl Conference on Mchine Lerning, Lille, Frnce, JMLR: W&CP volume 37. Copyright 2015 by the uthor(s). put vrible to depend on neighboring vribles. One pproch to structured prediction is lerning to serch (L2S) (Collins & Rork, 2004; Dumé III & Mrcu, 2005; Dumé III et l., 2009; Ross et l., 2011; Dopp et l., 2014; Ross & Bgnell, 2014), which solves the problem by: 1. converting structured prediction into serch problem with specified serch spce nd ctions; 2. defining structured fetures over ech stte to cpture the interdependency between output vribles; 3. constructing reference policy bsed on trining dt; 4. lerning policy tht imittes the reference policy. Empiriclly, L2S pproches hve been shown to be competitive with other structured prediction pproches both in ccurcy nd running time (see e.g. Dumé III et l. (2014)). Theoreticlly, existing L2S lgorithms gurntee tht if the lerning step performs well, then the lerned policy is lmost s good s the reference policy, implicitly ssuming tht the reference policy ttins good performnce. Good reference policies re typiclly derived using lbels in the trining dt, such s ssigning ech word to its correct POS tg. However, when the reference policy is suboptiml, which cn rise for resons such s computtionl constrints, nothing cn be sid for existing pproches. This problem is most obviously mnifest in structured contextul bndit 1 setting. For exmple, one might wnt to predict how the lnding pge of high profile web 1 The key difference from (1) contextul bndits is tht the ction spce is exponentilly lrge (in the length of trjectories in the serch spce); nd from (2) reinforcement lerning is tht bseline reference policy exists before lerning strts.
2 site should be displyed; this involves mny interdependent predictions: items to show, position nd size of those items, font, color, lyout, etc. It my be plusible to derive qulity signl for the displyed pge bsed on user feedbck, nd we my hve ccess to resonble reference policy (nmely the existing rulebsed system tht renders the current web pge). But, pplying L2S techniques results in nonsense lerning something lmost s good s the existing policy is useless s we cn just keep using the current system nd obtin tht gurntee. Unlike the full feedbck settings, lbel informtion is not even vilble during lerning to define substntilly better reference. The gol of lerning here is to improve upon the current system, which is most likely fr from optiml. This nturlly leds to the question: is lerning to serch useless when the reference policy is poor? This is the core question of the pper, which we ddress first with new L2S lgorithm, LOLS (Loclly Optiml Lerning to Serch) in Section 2. LOLS opertes in n online fshion nd chieves bound on convex combintion of regrettoreference nd regrettoownonestepdevitions. The first prt ensures tht good reference policies cn be leverged effectively; the second prt ensures tht even if the reference policy is very suboptiml, the lerned policy is pproximtely loclly optiml in sense mde forml in Section 3. LOLS opertes ccording to generl schemtic tht encompses mny pst L2S lgorithms (see Section 2), including Sern (Dumé III et l., 2009), DAgger (Ross et l., 2011) nd AggreVTe (Ross & Bgnell, 2014). A secondry contribution of this pper is theoreticl nlysis of both good nd bd wys of instntiting this schemtic under vriety of conditions, including: whether the reference policy is optiml or not, nd whether the reference policy is in the hypothesis clss or not. We find tht, while pst lgorithms chieve good regret gurntees when the reference policy is optiml, they cn fil rther drmticlly when it is not. LOLS, on the other hnd, hs superior performnce to other L2S lgorithms when the reference policy performs poorly but locl hillclimbing in policy spce is effective. In Section 5, we empiriclly confirm tht LOLS cn significntly outperform the reference policy in prctice on relworld dtsets. In Section 4 we extend LOLS to ddress the structured contextul bndit setting, giving nturl modifiction to the lgorithm s well s the corresponding regret nlysis. The proofs of our min results, nd the detils of the costsensitive clssifier used in experiments re deferred to the ppendix. The lgorithm LOLS, the new kind of regret gurntee it stisfies, the modifictions for the structured contextul bndit setting, nd ll experiments re new here. [ ] [V ] [N ] [N N ] [N V ] [N V N],loss=0 [N V V],loss= Figure 1. An illustrtion of the serch spce of sequentil tgging exmple tht ssigns prtofspeech tg sequence to the sentence John sw Mry. Ech stte represents prtil lbeling. The strt stte b = [ ] nd the set of end sttes E = {[N V N], [N V V ],...}. Ech end stte is ssocited with loss. A policy chooses n ction t ech stte in the serch spce to specify the next stte. 2. Lerning to Serch A structured prediction problem consists of n input spce X, n output spce Y, fixed but unknown distribution D over X Y, nd nonnegtive loss function l(y, ŷ) R 0 which mesures the distnce between the true (y ) nd predicted (ŷ) outputs. The gol of structured lerning is to use N smples (x i, y i ) N i=1 to lern mpping f : X Y tht minimizes the expected structured loss under D. In the lerning to serch frmework, n input x X induces serch spce, consisting of n initil stte b (which we will tke to lso encode x), set of end sttes nd trnsition function tht tkes stte/ction pirs s, nd deterministiclly trnsitions to new stte s. For ech end stte e, there is corresponding structured output y e nd for convenience we define the loss l(e) = l(y, y e ) where y will be cler from context. We futher define feture generting function Φ tht mps sttes to feture vectors in R d. The fetures express both the input x nd previous predictions (ctions). Fig. 1 shows n exmple serch spce 2. An gent follows policy π Π, which chooses n ction A(s) t ech nonterminl stte s. An ction specifies the next stte from s. We consider policies tht only ccess stte s through its feture vector Φ(s), mening tht π(s) is mpping from R d to the set of ctions A(s). A trjectory is complete sequence of stte/ction pirs from the strting stte b to n end stte e. Trjectories cn be generted by repetedly executing policy π in the serch spce. Without loss of generlity, we ssume the lengths of trjectories re fixed nd equl to T. The expected loss of policy J(π) is the expected loss of the end stte of the trjectory e π, where e E is n end stte reched by following the policy 3. Throughout, expecttions re tken with 2 Dopp et l. (2014) discuss severl pproches for defining serch spce. The theoreticl properties of our pproch do not depend on which serch spce definition is used. 3 Some imittion lerning literture (e.g., (Ross et l., 2011; He et l., 2012)) defines the loss of policy s n ccumultion of the costs of sttes nd ctions in the trjectory generted by the policy. For simplicity, we define the loss only bsed on the end
3 x X s r e rollin onestep devitions e rollout e y e Y, l(y e )=0.8 y e Y, l(y e )=0.0 y e Y, l(y e )=0.2 Figure 2. An exmple serch spce. The explortion begins t the strt stte s nd chooses the middle mong three ctions by the rollin policy twice. Grey nodes re not explored. At stte r the lerning lgorithm considers the chosen ction (middle) nd both onestep devitions from tht ction (top nd bottom). Ech of these devitions is completed using the rollout policy until n end stte is reched, t which point the loss is collected. Here, we lern tht deviting to the top ction (insted of middle) t stte r decreses the loss by 0.2. respect to drws of (x, y) from the trining distribution, s well s ny internl rndomness in the lerning lgorithm. An optiml policy chooses the ction leding to the miniml expected loss t ech stte. For losses decomposble over the sttes in trjectory, generting n optiml policy is trivil given y (e.g., the sequence tgging exmple in (Dumé III et l., 2009)). In generl, finding the optiml ction t sttes not in the optiml trjectory cn be tricky (e.g., (Goldberg & Nivre, 2013; Goldberg et l., 2014)). Finlly, like most other L2S lgorithms, LOLS ssumes ccess to costsensitive clssifiction lgorithm. A costsensitive clssifier predicts lbel ŷ given n exmple x, nd receives loss c x (ŷ), where c x is vector contining the cost for ech possible lbel. In order to perform online updtes, we ssume ccess to noregret online costsensitive lerner, which we formlly define below. Definition 1. Given hypothesis clss H : X [K], the regret of n online costsensitive clssifiction lgorithm which produces hypotheses h 1,..., h M on costsensitive exmple sequence {(x 1, c 1 ),..., (x M, c M )} is Regret CS M = M M c m (h m (x m )) min c m (h(x m )). m=1 h H m=1 An lgorithm is noregret if Regret CS M = o(m). Such noregret gurntees cn be obtined, for instnce, by pplying the SECOC technique (Lngford & Beygelzimer, 2005) on top of ny importnce weighted binry clssifiction lgorithm tht opertes in n online fshion, exmples being the perceptron lgorithm or online ridge regression. stte. However, our theorems cn be generlized. (1) Algorithm 1 Loclly Optiml Lerning to Serch (LOLS) Require: Dtset {x i, y i } N i=1 drwn from D nd β 0: mixture prmeter for rollout. 1: Initilize policy π 0. 2: for ll i {1, 2,..., N} (loop over ech instnce) do 3: Generte reference policy π ref bsed on y i. 4: Initilize Γ =. 5: for ll t {0, 1, 2,..., T 1} do 6: Rollin by executing πi in = ˆπ i for t rounds nd rech s t. 7: for ll A(s t ) do 8: Let π out i =π ref with probbility β, otherwise ˆπ i. 9: Evlute cost c i,t () by rollingout with πi out for T t 1 steps. 10: end for 11: Generte feture vector Φ(x i, s t ). 12: Set Γ = Γ { c i,t, Φ(x i, s t ) }. 13: end for 14: ˆπ i+1 Trin(ˆπ i, Γ) (Updte). 15: end for 16: Return the verge policy cross ˆπ 0, ˆπ 1,... ˆπ N. LOLS (see Algorithm 1) lerns policy ˆπ Π to pproximtely minimize J(π), 4 ssuming ccess to reference policy π ref (which my or my not be optiml). The lgorithm proceeds in n online fshion generting sequence of lerned policies ˆπ 0, ˆπ 1, ˆπ 2,.... At round i, structured smple (x i, y i ) is observed, nd the configurtion of serch spce is generted long with the reference policy π ref. Bsed on (x i, y i ), LOLS constructs T costsensitive multiclss exmples using rollin policy πi in nd rollout policy πi out. The rollin policy is used to generte n initil trjectory nd the rollout policy is used to derive the expected loss. More specificlly, for ech decision point t [0, T ), LOLS executes πi in for t rounds reching stte s t πi in. Then, costsensitive multiclss exmple is generted using the fetures Φ(s t ). Clsses in the multiclss exmple correspond to vilble ctions in stte s t. The cost c() ssigned to ction is the difference in loss between tking ction nd the best ction. c() = l(e()) min l(e( )), (2) where e() is the end stte reched with rollout by πi out fter tking ction in stte s t. LOLS collects the T exmples from the different rollout points nd feeds the set of exmples Γ into n online costsensitive multiclss lerner, thereby updting the lerned policy from ˆπ i to ˆπ i+1. By defult, we use the lerned policy ˆπ i for rollin nd mixture 4 We cn prmeterize the policy ˆπ using weight vector w R d such tht costsensitive clssifier cn be used to choose n ction bsed on the fetures t ech stte. We do not consider using different weight vectors t different sttes.
4 rollout rollin Reference Reference Mixture Lerned Inconsistent Lerned Not loclly opt. Good RL Tble 1. Effect of different rollin nd rollout policies. The strtegies mrked with Inconsistent might generte lerned policy with lrge structured regret, nd the strtegies mrked with Not loclly opt. could be much worse thn its one step devition. The strtegy mrked with RL reduces the structure lerning problem to reinforcement lerning problem, which is much hrder. The strtegy mrked with Good is fvored. policy for rollout. For ech rollout, the mixture policy either executes π ref to n endstte with probbility β or ˆπ i with probbility 1 β. LOLS converts into btch lgorithm with stndrd onlinetobtch conversion where the finl model π is generted by verging ˆπ i cross ll rounds (i.e., picking one of ˆπ 1,... ˆπ N uniformly t rndom). 3. Theoreticl Anlysis In this section, we nlyze LOLS nd nswer the questions rised in Section 1. Throughout this section we use π to denote the verge policy obtined by first choosing n [1, N] uniformly t rndom nd then cting ccording to π n.we begin with discussing the choices of rollin nd rollout policies. Tble 1 summrizes the results of using different strtegies for rollin nd rollout The Bd Choices An obvious bd choice is rollin nd rollout with the lerned policy, becuse the lerner is blind to the reference policy. It reduces the structured lerning problem to reinforcement lerning problem, which is much hrder. To build intuition, we show two other bd cses. Rollin with π ref is bd. Rollin with reference policy cuses the stte distribution to be unrelisticlly good. As result, the lerned policy never lerns to correct for previous mistkes, performing poorly when testing. A relted discussion cn be found t Theorem 2.1 in (Ross & Bgnell, 2010). We show theorem below. Theorem 1. For πi in = π ref, there is distribution D over (x, y) such tht the induced costsensitive regret RegretM CS = o(m) but J( π) J(π ref ) = Ω(1). Proof. We demonstrte exmples where the clim is true. We strt with the cse where πi out = πi in = π ref. In this cse, suppose we hve one structured exmple, whose serch spce is defined s in Figure 3(). From stte s 1, there re s 1 b e 1, 0 s 2 e 2, 10 s3 () π in c e d f e 3, 100 e 4, 0 i =πi out =π ref s 1 e 1, 0 s 2 e 2, 10 s3 c e d f e 3, 100 e 4, 0 (b) πi in = π ref, representtion constrined s 1 b s 2 s3 c c d d e 1, 1 e 2, 1 ɛ e 3, 1+ɛ e 4, 0 (c) πi out =π ref Figure 3. Counterexmples of πi in = π ref nd πi out = π ref. All three exmples hve 7 sttes. The loss of ech end stte is specified in the figure. A policy chooses ctions to trverse through the serch spce until it reches n end stte. Legl policies re bitvectors, so tht policy with weight on goes up in s 1 of Figure 3() while weight on b sends it down. Since fetures uniquely identify ctions of the policy in this cse, we just mrk the edges with corresponding fetures for simplicity. The reference policy is boldfced. In Figure 3(b), the fetures re the sme on either brnch from s 1, so tht the lerned policy cn do no better thn pick rndomly between the two. In Figure 3(c), sttes s 2 nd s 3 shre the sme feture set (i.e., Φ(s 2) = Φ(s 3)). Therefore, policy chooses the sme set of ctions t sttes s 2 nd s 3. Plese see text for detils. two possible ctions: nd b (we will use ctions nd fetures interchngebly since fetures uniquely identify ctions here); the (optiml) reference policy tkes ction. From stte s 2, there re gin two ctions (c nd d); the reference tkes c. Finlly, even though the reference policy would never visit s 3, from tht stte it chooses ction f. When rolling in with π ref, the costsensitive exmples re generted only t stte s 1 (if we tke onestep devition on s 1 ) nd s 2 but never t s 3 (since tht would require two devitions, one t s 1 nd one t s 3 ). As result, we cn never lern how to mke predictions t stte s 3. Furthermore, under rollout with π ref, both ctions from stte s 1 led to loss of zero. The lerner cn therefore lern to tke ction c t stte s 2 nd b t stte s 1, nd chieve zero costsensitive regret, thereby thinking it is doing good job. Unfortuntely, when this policy is ctully run, it performs s bdly s possible (by tking ction e hlf the time in s 3 ), which results in the lrge structured regret. Next we consider the cse where πi out is either the lerned policy or mixture with π ref. When pplied to the exmple in Figure 3(b), our feture representtion is not expressive enough to differentite between the two ctions t stte s 1, so the lerned policy cn do no better thn pick rndomly between the top nd bottom brnches from this stte. The lgorithm either rolls in with π ref on s 1 nd genertes costsensitive exmple t s 2, or genertes costsensitive exmple on s 1 nd then completes roll out with πi out. Crucilly, the lgorithm still never genertes costsensitive exmple t the stte s 3 (since it would hve lredy tken onestep devition to rech s 3 nd is constrined to do roll out from s 3 ). As result, if the lerned policy were to
5 choose the ction e in s 3, it leds to zero costsensitive regret but lrge structured regret. Despite these negtive results, rolling in with the lerned policy is robust to both the bove filure modes. In Figure 3(), if the lerned policy picks ction b in stte s 1, then we cn roll in to the stte s 3, then generte costsensitive exmple nd lern tht f is better ction thn e. Similrly, we lso observe costsensitive exmple in s 3 in the exmple of Figure 3(b), which clerly demonstrtes the benefits of rolling in with the lerned policy s opposed to π ref. Rollout with π ref is bd if π ref is not optiml. When the reference policy is not optiml or the reference policy is not in the hypothesis clss, rollout with π ref cn mke the lerner blind to compounding errors. The following theorem holds. We stte this in terms of locl optimlity : policy is loclly optiml if chnging ny one decision it mkes never improves its performnce. Theorem 2. For πi out = π ref, there is distribution D over (x, y) such tht the induced costsensitive regret RegretM CS = o(m) but π hs rbitrrily lrge structured regret to onestep devitions. Proof. Suppose we hve only one structured exmple, whose serch spce is defined s in Figure 3(c) nd the reference policy chooses or c depending on the node. If we rollout with π ref, we observe expected losses 1 nd 1 + ɛ for ctions nd b t stte s 1, respectively. Therefore, the policy with zero costsensitive clssifiction regret chooses ctions nd d depending on the node. However, one step devition ( b) does rdiclly better nd cn be lerned by insted rolling out with mixture policy. The bove theorems show the bd cses nd motivte good L2S lgorithm which genertes lerned policy tht competes with the reference policy nd devitions from the lerned policy. In the following section, we show tht Algorithm 1 is such n lgorithm Regret Gurntees Let Q π (s t, ) represent the expected loss of executing ction t stte s t nd then executing policy π until reching n end stte. T is the number of decisions required before reching n end stte. For nottionl simplicity, we use Q π (s t, π ) s shorthnd for Q π (s t, π (s t )), where π (s t ) is the ction tht π tkes t stte s t. Finlly, we use d t π to denote the distribution over sttes t time t when cting ccording to the policy π. The expected loss of policy is: J(π) = E s d t π [Q π (s, π)], (3) for ny t [0, T ]. In words, this is the expected cost of rolling in with π up to some time t, tking π s ction t time t nd then completing the roll out with π. Our min regret gurntee for Algorithm 1 shows tht LOLS minimizes combintion of regret to the reference policy π ref nd regret its own onestep devitions. In order to concisely present the result, we present n dditionl definition which cptures the regret of our pproch: δ N = 1 NT N i=1 t=1 T [ E s d ṱ πi Q πout i +(1 β) min Qˆπ i (s, ) ( (s, ˆπ i) β min Q π ref (s, ) )], (4) where πi out = βπ ref + (1 β)ˆπ i is the mixture policy used to rollout in Algorithm 1. With these definitions in plce, we cn now stte our min result for Algorithm 1. Theorem 3. Let δ N be s defined in Eqution 4. The verged policy π generted by running N steps of Algorithm 1 with mixing prmeter β stisfies β(j( π) J(π ref )) + (1 β) T t=1 ( J( π) min π Π E s d t π [Q π (s, π)] ) T δ N. It might pper tht the LHS of the theorem combines one term which is constnt to nother scling with T. We point the reder to Lemm 1 in the ppendix to see why the terms re comprble in mgnitude. Note tht the theorem does not ssume nything bout the qulity of the reference policy, nd it might be rbitrrily suboptiml. Assuming tht Algorithm 1 uses noregret costsensitive clssifiction lgorithm (recll Definition 1), the first term in the definition of δ N converges to l 1 = min π Π NT N T i=1 t=1 E s d ṱ πi [Q πout i (s, π)]. This observtion is formlized in the next corollry. Corollry 1. Suppose we use noregret costsensitive clssifier in Algorithm 1. As N, δ N δ clss, where δ clss = l 1 NT E s d ṱ πi [β min Q πref (s, ) i,t ] +(1 β) min Qˆπi (s, ). When we hve β = 1, so tht LOLS becomes lmost identicl to AGGREVATE (Ross & Bgnell, 2014), δ clss rises solely due to the policy clss Π being restricted. For other vlues of β (0, 1), the symptotic gp does not lwys vnish even if the policy clss is unrestricted, since l mounts to obtining min Q πout i (s, ) in ech stte. This corresponds to tking minimum of n verge rther thn the verge of the corresponding minimum vlues. In order to void this symptotic gp, it seems desirble to hve regrets to reference policy nd onestep devitions
6 controlled individully, which is equivlent to hving the gurntee of Theorem 3 for ll vlues of β in [0, 1] rther thn specific one. As we show in the next section, gurnteeing regret bound to onestep devitions when the reference policy is rbitrrily bd is rther tricky nd cn tke n exponentilly long time. Understnding structures where this cn be done more trctbly is n importnt question for future reserch. Nevertheless, the result of Theorem 3 hs interesting consequences in severl settings, some of which we discuss next. 1. The second term on the left in the theorem is lwys nonnegtive by definition, so the conclusion of Theorem 3 is t lest s powerful s existing regret gurntee to reference policy when β = 1. Since the previous works in this re (Dumé III et l., 2009; Ross et l., 2011; Ross & Bgnell, 2014) hve only studied regret gurntees to the reference policy, the quntity we re studying is strictly more difficult. 2. The symptotic regret incurred by using mixture policy for rollout might be lrger thn tht using the reference policy lone, when the reference policy is neroptiml. How the combintion of these fctors mnifests in prctice is empiriclly evluted in Section When the reference policy is optiml, the first term is nonnegtive. Consequently, the theorem demonstrtes tht our lgorithm competes with onestep devitions in this cse. This is true irrespective of whether π ref is in the policy clss Π or not. 4. When the reference policy is very suboptiml, then the first term cn be negtive. In this cse, the regret to onestep devitions cn be lrge despite the gurntee of Theorem 3, since the first negtive term llows the second term to be lrge while the sum stys bounded. However, when the first term is significntly negtive, then the lerned policy hs lredy improved upon the reference policy substntilly! This bility to improve upon poor reference policy by using mixture policy for rolling out is n importnt distinction for Algorithm 1 compred with previous pproches. Overll, Theorem 3 shows tht the lerned policy is either competitive with the reference policy nd nerly loclly optiml, or improves substntilly upon the reference policy Hrdness of locl optimlity In this section we demonstrte tht the process of reching locl optimum (under onestep devitions) cn be exponentilly slow when the initil strting policy is rbitrry. This reflects the hrdness of lerning to serch problems when equipped with poor reference policy, even if locl rther thn globl optimlity is considered yrdstick. We estblish this lower bound for clss of lgorithms substntilly more powerful thn LOLS. We strt by defining serch spce nd policy clss. Our serch spce consists of trjectories of length T, with 2 ctions vilble t ech step of the trjectory. We use 0 nd 1 to index the two ctions. We consider policies whose only feture in stte is the depth of the stte in the trjectory, mening tht the ction tken by ny policy π in stte s t depends only on t. Consequently, ech policy cn be indexed by bit string of length T. For instnce, the policy executes ction 0 in the first step of ny trjectory, ction 1 in the second step nd 0 t ll other levels. It is esily seen tht two policies re onestep devitions of ech other if the corresponding bit strings hve Hmming distnce of 1. To estblish lower bound, consider the following powerful lgorithmic pttern. Given current policy π, the lgorithm exmines the cost J(π ) for ll the onestep devitions π of π. It then chooses the policy with the smllest cost s its new lerned policy. Note tht ccess to the ctul costs J(π) mkes this lgorithm more powerful thn existing L2S lgorithms, which cn only estimte costs of policies through rollouts on individul exmples. Suppose this lgorithm strts from n initil policy ˆπ 0. How long does it tke for the lgorithm to rech policy ˆπ i which is loclly optiml compred with ll its onestep devitions? We next present lower bound for lgorithms of this style. Theorem 4. Consider ny lgorithm which updtes policies only by moving from the current policy to onestep devition. Then there is serch spce, policy clss nd cost function where the ny such lgorithm must mke Ω(2 T ) updtes before reching loclly optiml policy. Specificlly, the lower bound lso pplies to Algorithm 1. The result shows tht competing with the seemingly resonble benchmrk of onestep devitions my be very chllenging from n lgorithmic perspective, t lest without ssumptions on the serch spce, policy clss, loss function, or strting policy. For instnce, the construction used to prove Theorem 4 does not pply to Hmming loss. 4. Structured Contextul Bndit We now show tht vrint of LOLS cn be run in structured contextul bndit setting, where only the loss of single structured lbel cn be observed. As mentioned, this setting hs pplictions to webpge lyout, personlized serch, nd severl other domins. At ech round, the lerner is given n input exmple x, mkes prediction ŷ nd suffers structured loss l(y, ŷ). We ssume tht the structured losses lie in the intervl [0, 1], tht the serch spce hs depth T nd tht there re t most K ctions vilble t ech stte. As before, the lgorithm hs ccess to policy clss Π, nd lso to reference policy π ref. It is importnt to emphsize tht the reference policy does not hve ccess to the true lbel, nd the gol
7 Algorithm 2 Structured Contextul Bndit Lerning Require: Exmples {x i } N i=1, reference policy πref, explortion probbility ɛ nd mixture prmeter β 0. 1: Initilize policy π 0, nd set I =. 2: for ll i = 1, 2,..., N (loop over ech instnce) do 3: Obtin the exmple x i, set explore = 1 with probbility ɛ, set n i = I. 4: if explore then 5: Pick rndom time t {0, 1,..., T 1}. 6: Rollin by executing πi in = ˆπ ni for t rounds nd rech s t. 7: Pick rndom ction t A(s t ); let K = A(s t ). 8: Let πi out = π ref with probbility β, otherwise ˆπ ni. 9: Rollout with πi out for T t 1 steps to evlute ĉ() = Kl(e( t ))1[ = t ]. 10: Generte feture vector Φ(x i, s t ). 11: ˆπ ni+1 Trin(ˆπ ni, ĉ, Φ(x i, s t )). 12: Augment I = I {ˆπ ni+1} 13: else 14: Follow the trjectory of policy π drwn rndomly from I to n end stte e, predict the corresponding structured output y ie. 15: end if 16: end for is improving on the reference policy. Our pproch is bsed on the ɛgreedy lgorithm which is common strtegy in prtil feedbck problems. Upon receiving n exmple x i, the lgorithm rndomly chooses whether to explore or exploit on this exmple. With probbility 1 ɛ, the lgorithm chooses to exploit nd follows the recommendtion of the current lerned policy. With the remining probbility, the lgorithm performs rndomized vrint of the LOLS updte. A detiled description is given in Algorithm 2. We ssess the lgorithm s performnce vi mesure of regret, where the comprtor is mixture of the reference policy nd the best onestep devition. Let π i be the verged policy bsed on ll policies in I t round i. y ie is the predicted lbel in either step 9 or step 14 of Algorithm 2. The verge regret is defined s: Regret = 1 N N i=1 (1 β) ( E[l(y i, y ie )] βe[l(y i, y ieref )] T ) min E s d π Π t π [Q πi (s, π)] i t=1 Reclling our erlier definition of δ i (4), we bound on the regret of Algorithm 2 with proof in the ppendix. Theorem 5. Algorithm 2 with prmeter ɛ stisfies: Regret ɛ + 1 N N δ ni, i=1 With noregret lerning lgorithm, we expect log Π δ i δ clss + ck, (5) i where Π is the crdinlity of the policy clss. This leds to the following corollry with proof in the ppendix. Corollry 2. In the setup of Theorem 5, suppose further tht the underlying noregret lerner stisfies (5). Then with probbility t lest 1 2/(N 5 K 2 T 2 log(n Π )) 3, Regret = O ( 5. Experiments (KT ) 2/3 3 log(n Π ) N + T δ clss ) This section shows tht LOLS is ble to improve upon suboptiml reference policy nd provides empiricl evidence to support the nlysis in Section 3. We conducted experiments on the following three pplictions. CostSensitive Multiclss clssifiction. For ech costsensitive multiclss smple, ech choice of lbel hs n ssocited cost. The serch spce for this tsk is binry serch tree. The root of the tree corresponds to the whole set of lbels. We recursively split the set of lbels in hlf, until ech subset contins only one lbel. A trjectory through the serch spce is pth from roottolef in this tree. The loss of the end stte is defined by the cost. An optiml reference policy cn led the gent to the end stte with the miniml cost. We lso show results of using bd reference policy which rbitrrily chooses n ction t ech stte. The experiments re conducted on KDDCup 99 dtset 5 generted from computer network intrusion detection tsk. The dtset contins 5 clsses, 4, 898, 431 trining nd 311, 029 test instnces. Prt of speech tgging. The serch spce for POS tgging is lefttoright prediction. Under Hmming loss the trivil optiml reference policy simply chooses the correct prt of speech for ech word. We trin on 38k sentences nd test on 11k from the Penn Treebnk (Mrcus et l., 1993). One cn construct suboptiml or even bd reference policies, but under Hmming loss these re ll equivlent to the optiml policy becuse rollouts by ny fixed policy will incur exctly the sme loss nd the lerner cn immeditely lern from onestep devitions. 5 kddcup99/kddcup99.html.
8 rollout rollin Reference Mixture Lerned Reference is optiml Reference Lerned Reference is bd Reference Lerned Tble 2. The verge cost on costsensitive clssifiction dtset; columns re rollout nd rows re rollin. The best result is bold. SEARN chieves nd when the reference policy is optiml nd bd, respectively. LOLS is Lerned/Mixture nd highlighted in green. rollout rollin Reference Mixture Lerned Reference is optiml Reference Lerned rollout rollin Reference Mixture Lerned Reference is optiml Reference Lerned Reference is suboptiml Reference Lerned Reference is bd Reference Lerned Tble 4. The UAS score on dependency prsing dt set; columns re rollout nd rows re rollin. The best result is bold. SEARN chieves 84.0, 81.1, nd 63.4 when the reference policy is optiml, suboptiml, nd bd, respectively. LOLS is Lerned/Mixture nd highlighted in green. Tble 3. The ccurcy on POS tgging; columns re rollout nd rows re rollin. The best result is bold. SEARN chieves LOLS is Lerned/Mixture nd highlighted in green. Dependency prsing. A dependency prser lerns to generte tree structure describing the syntctic dependencies between words in sentence (McDonld et l., 2005; Nivre, 2003). We implemented hybrid trnsition system (Kuhlmnn et l., 2011) which prses sentence from left to right with three ctions: SHIFT, REDUCELEFT nd REDUCERIGHT. We used the nondeterministic orcle (Goldberg & Nivre, 2013) s the optiml reference policy, which leds the gent to the best end stte rechble from ech stte. We lso designed two suboptiml reference policies. A bd reference policy chooses n rbitrry legl ction t ech stte. A suboptiml policy pplies greedy selection nd chooses the ction which leds to good tree when it is obvious; otherwise, it rbitrrily chooses legl ction. (This suboptiml reference ws the defult reference policy used prior to the work on nondeterministic orcles. ) We used dt from the Penn Treebnk Wll Street Journl corpus: the stndrd dt split for trining (sections 0221) nd test (section 23). The loss is evluted in UAS (unlbeled ttchment score), which mesures the frction of words tht pick the correct prent. For ech tsk nd ech reference policy, we compre 6 different combintions of rollin (lerned or reference) nd rollout (lerned, mixture or reference) strtegies. We lso include SEARN in the comprison, since it hs notble differences from LOLS. SEARN rolls in nd out with mixture where different policy is drwn for ech stte, while LOLS drws policy once per exmple. SEARN uses btch lerner, while LOLS uses online. The policy in SEARN is mixture over the policies produced t ech itertion. For LOLS, it suffices to keep just the most recent one. It is n open reserch question whether n nlogous theoreticl gurntee of Theorem 3 cn be estblished for SEARN. Our implementtion is bsed on Vowpl Wbbit 6, mchine lerning system tht supports online lerning nd L2S. For LOLS s mixture policy, we set β = 0.5. We found tht LOLS is not sensitive to β, nd setting β to be 0.5 works well in prctice. For SEARN, we set the mixture prmeter to be 1 (1 α) t, where t is the number of rounds nd α = Unless stted otherwise ll the lerners tke 5 psses over the dt. Tbles 2, 3 nd 4 show the results on costsensitive multiclss clssifiction, POS tgging nd dependency prsing, respectively. The empiricl results qulittively gree with the theory. Rolling in with reference is lwys bd. When the reference policy is optiml, then doing rollouts with reference is good ide. However, when the reference policy is suboptiml or bd, then rolling out with reference is bd ide, nd mixture rollouts perform substntilly better. LOLS lso significntly outperforms SEARN on ll tsks. Acknowledgements Prt of this work ws crried out while KiWei, Akshy nd Hl were visiting Microsoft Reserch. 6 vw/
9 References Abbott, H.L nd Ktchlski, M. On the snke in the box problem. Journl of Combintoril Theory, Series B, 45 (1):13 24, CesBinchi, N. nd Lugosi, G. Prediction, Lerning, nd Gmes. Cmbridge University Press, Collins, Michel nd Rork, Brin. Incrementl prsing with the perceptron lgorithm. In Proceedings of the Conference of the Assocition for Computtionl Linguistics (ACL), Dumé III, Hl nd Mrcu, Dniel. Lerning s serch optimiztion: Approximte lrge mrgin methods for structured prediction. In Proceedings of the Interntionl Conference on Mchine Lerning (ICML), Dumé III, Hl, Lngford, John, nd Mrcu, Dniel. Serchbsed structured prediction. Mchine Lerning Journl, Dumé III, Hl, Lngford, John, nd Ross, Stéphne. Efficient progrmmble lerning to serch. rxiv: , Dopp, Jnrdhn Ro, Fern, Aln, nd Tdeplli, Prsd. HCSerch: A lerning frmework for serchbsed structured prediction. Journl of Artificil Intelligence Reserch (JAIR), 50, Goldberg, Yov nd Nivre, Jokim. Trining deterministic prsers with nondeterministic orcles. Trnsctions of the ACL, 1, Goldberg, Yov, Srtorio, Frncesco, nd Stt, Giorgio. A tbulr method for dynmic orcles in trnsitionbsed prsing. Trnsctions of the ACL, 2, He, He, Dumé III, Hl, nd Eisner, Json. Imittion lerning by coching. In Neurl Informtion Processing Systems (NIPS), Kuhlmnn, Mrco, GómezRodríguez, Crlos, nd Stt, Giorgio. Dynmic progrmming lgorithms for trnsitionbsed dependency prsers. In Proceedings of the 49th Annul Meeting of the Assocition for Computtionl Linguistics: Humn Lnguge Technologies Volume 1, pp Assocition for Computtionl Linguistics, Lngford, John nd Beygelzimer, Alin. Sensitive error correcting output codes. In Lerning Theory, pp Springer, Mrcus, Mitch, Mrcinkiewicz, Mry Ann, nd Sntorini, Betrice. Building lrge nnotted corpus of English: The Penn Treebnk. Computtionl Linguistics, 19(2): , McDonld, Ryn, Pereir, Fernndo, Ribrov, Kiril, nd Hjic, Jn. Nonprojective dependency prsing using spnning tree lgorithms. In Proceedings of the Joint Conference on Humn Lnguge Technology Conference nd Empiricl Methods in Nturl Lnguge Processing (HLT/EMNLP), Nivre, Jokim. An efficient lgorithm for projective dependency prsing. In Interntionl Workshop on Prsing Technologies (IWPT), pp , Ross, Stéphne nd Bgnell, J. Andrew. Efficient reductions for imittion lerning. In Proceedings of the Workshop on Artificil Intelligence nd Sttistics (AIStts), Ross, Stéphne nd Bgnell, J. Andrew. Reinforcement nd imittion lerning vi interctive noregret lerning. rxiv: , Ross, Stéphne, Gordon, Geoff J., nd Bgnell, J. Andrew. A reduction of imittion lerning nd structured prediction to noregret online lerning. In Proceedings of the Workshop on Artificil Intelligence nd Sttistics (AI Stts), Zinkevich, Mrtin. Online convex progrmming nd generlized infinitesiml grdient scent. In Proceedings of the Interntionl Conference on Mchine Lerning (ICML), 2003.
When Simulation Meets Antichains (on Checking Language Inclusion of NFAs)
When Simultion Meets Antichins (on Checking Lnguge Inclusion of NFAs) Prosh Aziz Abdull 1, YuFng Chen 1, Lukáš Holík 2, Richrd Myr 3, nd Tomáš Vojnr 2 1 Uppsl University 2 Brno University of Technology
More informationSCRIBE: A largescale and decentralized applicationlevel multicast infrastructure
!! IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 2, NO. 8, OCTOBER 22 1 SCRIBE: A lrgescle nd decentrlized pplictionlevel multicst infrstructure Miguel Cstro, Peter Druschel, AnneMrie Kermrrec
More informationOn the Robustness of Most Probable Explanations
On the Robustness of Most Probble Explntions Hei Chn School of Electricl Engineering nd Computer Science Oregon Stte University Corvllis, OR 97330 chnhe@eecs.oregonstte.edu Adnn Drwiche Computer Science
More informationContextualizing NSSE Effect Sizes: Empirical Analysis and Interpretation of Benchmark Comparisons
Contextulizing NSSE Effect Sizes: Empiricl Anlysis nd Interprettion of Benchmrk Comprisons NSSE stff re frequently sked to help interpret effect sizes. Is.3 smll effect size? Is.5 relly lrge effect size?
More informationThe Tradeoff Between Inequality and Growth
ANNALS OF ECONOMICS AND FINANCE 4, 329 345 2003 The Trdeoff Between Inequlity nd Growth Jess Benhbib Deprtment of Economics, New York University 269 Mercer Street, 7th floor, New York, NY 10003, USA. Emil:
More informationA National Look at the High School Counseling Office
A Ntionl Look t the High School Counseling Office Wht Is It Doing nd Wht Role Cn It Ply in Fcilitting Students Pths to College? by Alexndri Wlton Rdford, Nicole Ifill, nd Terry Lew Introduction Between
More informationSolving BAMO Problems
Solving BAMO Problems Tom Dvis tomrdvis@erthlink.net http://www.geometer.org/mthcircles Februry 20, 2000 Abstrct Strtegies for solving problems in the BAMO contest (the By Are Mthemticl Olympid). Only
More informationAllAtom Empirical Potential for Molecular Modeling and Dynamics Studies of Proteins
3586 J. Phys. Chem. B 1998, 102, 35863616 AllAtom Empiricl Potentil for Moleculr Modeling nd Dynmics Studies of Proteins A. D. McKerell, Jr.,*,, D. Bshford,, M. Bellott,, R. L. Dunbrck, Jr.,, J. D. Evnseck,,
More informationFirst variation. (onevariable problem) January 21, 2015
First vrition (onevrible problem) Jnury 21, 2015 Contents 1 Sttionrity of n integrl functionl 2 1.1 Euler eqution (Optimlity conditions)............... 2 1.2 First integrls: Three specil cses.................
More informationITS HISTORY AND APPLICATIONS
NEČAS CENTER FOR MATHEMATICAL MODELING, Volume 1 HISTORY OF MATHEMATICS, Volume 29 PRODUCT INTEGRATION, ITS HISTORY AND APPLICATIONS Antonín Slvík (I+ A(x)dx)=I+ b A(x)dx+ b x2 A(x 2 )A(x 1 )dx 1 dx 2
More informationGLF. General Level Framework. A Framework for Pharmacist Development in General Pharmacy Practice
GLF Generl Level Frmework A Frmework for Phrmcist Development in Generl Phrmcy Prctice GLF Second Edition October 2007 About CoDEG The Competency Development nd Evlution Group (CoDEG) is collbortive network
More informationJournal of Business Research
Journl of Business Reserch 64 (2011) 896 903 Contents lists vilble t ScienceDirect Journl of Business Reserch Reltionship mrketing's role in mnging the firm investor dyd Arvid O.I. Hoffmnn, Joost M.E.
More informationA New Data Set of Educational Attainment in the World, 1950 2010 *
A New Dt Set of Eductionl Attinment in the World, 1950 2010 * Robert J. Brro + Hrvrd University nd JongWh Lee Kore University Revised: November 2011 * We re grteful to UNESCO Institute for Sttistics for
More informationOstrowski Type Inequalities and Applications in Numerical Integration. Edited By: Sever S. Dragomir. and. Themistocles M. Rassias
Ostrowski Type Inequlities nd Applictions in Numericl Integrtion Edited By: Sever S Drgomir nd Themistocles M Rssis SS Drgomir) School nd Communictions nd Informtics, Victori University of Technology,
More informationPoint Groups and Space Groups in Geometric Algebra
Point Groups nd Spce Groups in Geometric Alger Dvid Hestenes Deprtment of Physics nd Astronomy Arizon Stte University, Tempe, Arizon, USA Astrct. Geometric lger provides the essentil foundtion for new
More informationTheImpactoftheNation smost WidelyUsedInsecticidesonBirds
TheImpctoftheNtion smost WidelyUsedInsecticidesonBirds Neonicotinoid Insecticides nd Birds The Impct of the Ntion s Most Widely Used Insecticides on Birds Americn Bird Conservncy, Mrch 2013 Grsshopper
More informationDoes the chimpanzee have a theory of mind? 30 years later
Review Does the chimpnzee hve theory of mind? 30 yers lter Josep Cll nd Michel Tomsello Mx Plnck Institute for Evolutionry Anthropology, Deutscher Pltz 6, D04103 Leipzig, Germny On the 30th nniversry
More informationINTERCHANGING TWO LIMITS. Zoran Kadelburg and Milosav M. Marjanović
THE TEACHING OF MATHEMATICS 2005, Vol. VIII, 1, pp. 15 29 INTERCHANGING TWO LIMITS Zorn Kdelburg nd Milosv M. Mrjnović This pper is dedicted to the memory of our illustrious professor of nlysis Slobodn
More informationSome Techniques for Proving Correctness of Programs which Alter Data Structures
Some Techniques for Proving Correctness of Progrms which Alter Dt Structures R. M. Burstll Deprtment of Mchine Intelligence University of Edinburgh 1. INTRODUCTION Consider the following sequence of instructions
More informationNAEYC Early Childhood Program Standards and Accreditation Criteria & Guidance for Assessment
NAEYC Erly Childhood Progrm Stndrds nd Accredittion Criteri & Guidnce for Assessment This document incorportes the lnguge of ll NAEYC Erly Childhood Progrm Stndrds nd Accredittion Criteri, including 39
More informationIntroduction to Integration Part 2: The Definite Integral
Mthemtics Lerning Centre Introduction to Integrtion Prt : The Definite Integrl Mr Brnes c 999 Universit of Sdne Contents Introduction. Objectives...... Finding Ares 3 Ares Under Curves 4 3. Wht is the
More informationLIFE AS POLYCONTEXTURALITY *)
Ferury 2004 LIFE AS POLYCONTEXTURALITY *) y Gotthrd Günther Kein Leendiges ist ein Eins, Immer ist's ein Vieles. (Goethe) Prt I : The Concept of Contexture A gret epoch of scientific trdition is out to
More informationTHE FUNDAMENTAL GROUP AND COVERING SPACES
THE FUNDAMENTAL GROUP AND COVERING SPACES JESPER M. MØLLER Astrct. These notes, from first course in lgeric topology, introduce the fundmentl group nd the fundmentl groupoid of topologicl spce nd use them
More informationEach copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.
Do Artifcts Hve Politics? Author(s): Lngdon Winner Source: Dedlus, Vol. 109, No. 1, Modern Technology: Problem or Opportunity? (Winter, 1980), pp. 121136 Published by: The MIT Press on behlf Americn Acdemy
More informationMill Road Area. Conservation Area Appraisal. June 2011
Mill Rod Are Conservtion Are Apprisl June 0 MILL RO AREA CONSERVATION AREA APPRAISAL This publiction hs been produced by: Plnning Services, Cmbridge City Council, PO Box 00, Cmbridge CB 0JH Tel: 0 000
More informationFinancial Ties between DSMIV Panel Members and the Pharmaceutical Industry
Regulr Article DOI: 10.1159/000091772 Finncil Ties between DSMIV Pnel Members nd the Phrmceuticl Industry Lis Cosgrove b Sheldon Krimsky Mnish Vijyrghvn Lis Schneider University of Msschusetts, Boston,
More informationEach copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.
The Cption Ml Person Auth(s): Peter A. French Source: Americn Philosophicl Qurterly, Vol. 16, No. 3 (Jul., 1979), pp. 207215 Published : University Illinois Press on behlf Nth Americn Philosophicl Publictions
More informationMephedrone, compared with MDMA (ecstasy) and amphetamine, rapidly increases both dopamine and 5HT levels in nucleus accumbens of awake rats
1949..1958 BJP British Journl of Phrmcology DOI:10.1111/j.14765381.2011.01499.x www.brjphrmcol.org RESEARCH PAPERbph_1499 Mephedrone, compred with MDMA (ecstsy) nd mphetmine, rpidly increses both dopmine
More informationTHE INSCRIPTIONS FROM TEMPLE XIX AT PALENQUE
DAVID STUART THE INSCRIPTIONS FROM TEMPLE XIX AT PALENQUE The Inscriptions fromtemple XIX t Plenque A Commentry The Inscriptions from TempleXIX t Plenque A Commentry By Dvid Sturt Photogrphs y Jorge Pérez
More informationMagnetism from Conductors, and Enhanced NonLinear Phenomena
Mnetism from Conductors, nd Enhnced NonLiner Phenomen JB Pendry, AJ Holden, DJ Roins, nd WJ Stewrt Astrct  We show tht microstructures uilt from nonmnetic conductin sheets exhiit n effective mnetic
More information