Real-Time Scheduling via Reinforcement Learning

Transcription

1 Real-Time Scheduling via Reinforcemen Learning Rober Glaubius, Terry Tidwell, Chrisopher Gill, and William D. Smar Deparmen of Compuer Science and Engineering Washingon Universiy in S. Louis Absrac Cyber-physical sysems, such as mobile robos, mus respond adapively o dynamic operaing condiions. Effecive operaion of hese sysems requires ha sensing and acuaion asks are performed in a imely manner. Addiionally, execuion of mission specific asks such as imaging a room mus be balanced agains he need o perform more general asks such as obsacle avoidance. This problem has been addressed by mainaining relaive uilizaion of shared resources among asks near a user-specified arge level. Producing opimal scheduling sraegies requires complee prior knowledge of ask behavior, which is unlikely o be available in pracice. Insead, suiable scheduling sraegies mus be learned online hrough ineracion wih he sysem. We consider he sample complexiy of reinforcemen learning in his domain, and demonsrae ha while he problem sae space is counably infinie, we may leverage he problem s srucure o guaranee efficien learning. 1 Inroducion In cyber-physical sysems such as mobile robos, seing and enforcing a uilizaion arge for shared resources is a useful mechanism for sriking a balance beween general and mission-specific goals while ensuring imely execuion of asks. However, classical scheduling approaches are inapplicable o asks in he domains we consider. Firs, some asks are no efficienly preempable: for example, acuaion asks involve moving a physical resource, such as a roboic arm or pan-il uni. Resoring he acuaor sae afer a preempion would be essenially he same as resaring ha ask. Therefore, once an insance of a ask acquires he resource, i should reain he resource unil compleion. Second, he duraion for which a ask holds he resource may be sochasic. This is rue for acuaion asks, which ofen involve one or more variable mechanical processes. Classical real-ime scheduling approaches model asks deerminisically by reaing a ask s wors-case execuion ime (WCET) as is execuion budge. This is inappropriae in our domain, as a ask s WCET may be many orders of magniude larger han is ypical duraion. To accoun for his variabiliy, we assume ha each ask s duraion obeys some underlying bu unknown saionary disribuion. Behaving opimally under hese condiions requires ha we accoun for his uncerainy in order o anicipae common evens while exploiing early resource availabiliy and hedging agains delays. In previous work (Glaubius e al., 2008, 2009), we have proposed mehods for solving scheduling problems wih hese concerns, provided ha accurae ask models are available. One sraighforward approach for employing hese mehods is via cerainy equivalence: consrucing and solving an approximae model from observaions of he sysem. However, his is less effecive han inerleaving modeling and soluion wih execuion, since inerleaving learning allows he conroller o adap o condiions observed during execuion, which may differ from condiions observed in a disinc modeling phase. Inerleaving modeling and execuion raises he exploraion/exploiaion dilemma (Kaelbling e al., 1996): he conroller mus balance opimal behavior wih respec o available informaion agains he long-erm benefi of choosing apparenly subopimal exploraory acions ha will improve ha informaion. This dilemma is paricularly relevan in he real-ime sysems domain, as susained subopimal behavior ranslaes direcly ino poor qualiy of service. In his paper we consider he problem of learning nearopimal schedules when he sysem model is no known

2 in advance. We provide PAC bounds on he compuaional complexiy of learning a near-opimal policy using balanced wandering. Our resul is novel, as i exends esablished mehods for learning in finie Markov decision processes o a domain wih a counably infinie sae space wih unbounded coss. We also provide an empirical comparison of several exploraion mehods, and observe ha he srucure of he ask scheduling problem enforces effecive exploraion. 2 Background 2.1 Sysem Model As in Glaubius e al. (2008, 2009), he ask scheduling model consiss of n asks (T i ) n i=1 ha require muually exclusive use of a single common resource. Each ask T i consiss of an infinie sequence of jobs (T i,j ) j=0. Job T i,0 is available a ime 0, while each job T i,(j+1) becomes available immediaely upon compleion of job T i,j. Jobs canno be preemped, so whenever a job is graned he resource, i occupies ha resource for some sochasic duraion unil compleion. Two simplifying assumpions are made regarding he disribuion of job duraions: (A1) Iner-ask job duraions are independenly disribued. (A2) Inra-ask job duraions are independenly and idenically disribued. When A1 holds, he duraion of job T i,j always obeys he same disribuion regardless of wha job preceded i. This means ha he sysem hisory is no necessary o predic he behavior of a paricular job. When A2 holds, consecuive jobs of he same ask obey he same disribuion. Thus, every ask T i has a duraion disribuion P ( i) from which he duraion of every job of T i is drawn. The acuaor example in he previous secion does no immediaely saisfy hese assumpions, since a job s duraion depends on he sae of he acuaor when he job sars execuing. These may be enforced in acuaor-sharing, however, by requiring ha each job leaves he acuaor in a saic reference posiion before relinquishing conrol. In addiion o he assumpions saed above, each duraion disribuion mus have bounded suppor on he posiive inegers: ha is, every ask T i has an inegervalued WCET W i such ha W i =1 P ( i) = 1. For simpliciy, W denoes he maximum among all W i, and he WCET of individual asks are ignored. Our goal is o schedule jobs in order o preserve emporal isolaion (Srinivasan and Anderson, 2005) among asks. We specify some arge uilizaion u i for each ask ha describes is inended resource share a any emporal resoluion. More specifically, le x i () denoe he number of quana during which ask T i held he resource in he inerval [0, ). Our objecive is o keep ( )u i (x i ( ) x i ()) as small as possible over every ime inerval [, ) for each ask T i. We require ha each ask s uilizaion arge u i is raional and ha he resource is compleely divided among all asks, so ha n i=1 u i = MDP Formulaion Following Glaubius e al. (2008, 2009), his problem is modeled as a Markov decision process (MDP) (Puerman, 1994). An MDP consiss of a se of saes X, a se of acions A, a ransiion sysem P, and a cos funcion C. A each discree decision epoch k, a conroller observes he curren MDP sae x k and selecs an acion i k. The MDP hen ransiions o sae x k+1 disribued according o P ( x k, i k ) and incurs cos c k = C(x k+1 ). The value V π of a policy π is he expeced long-erm γ-discouned cos of following π, where γ is a discoun facor in (0, 1). V π saisfies he recurrence V π (x) = y X P (y x, π(x))[γv π (y) C(y)]. I is ofen convenien o compare alernaive acions using he sae-acion value funcion Q π (x, i), Q π (x, i) = y X P (y x, i)[γv π (y) C(y)]. The objecive is o find an opimal policy π such ha V π (x) V π (x) among all saes x and policies π. For breviy, V and Q are used o denoe V π and Q π. V saisfies he Bellman Equaion (Puerman, 1994) V (x) = max i A P (y x, i)[γv (y) C(y)], (1) y X or equivalenly V (x) = max i {Q(x, i)}. An opimal policy is obained by behaving greedily wih respec o Q, π (x) argmax i {Q(x, i)}. Thus, compuing he opimal conrol can be reduced o compuing he opimal value funcion. Several dynamic programming and linear programming approaches have been developed o solve such problems when X and A are finie (Puerman, 1994). The ask scheduling problem is modeled as an MDP over a se of uilizaion saes X = N n. Each sae x is an n-vecor x = (x 1,..., x n ) where each componen x i

3 Figure 1: The uilizaion sae model for a woask problem insance. T 1 (grey, open arrowheads) sochasically ransiions o he righ, while T 2 (black, closed arrowheads) deerminisically ransiions upward. The dashed ray indicaes he uilizaion arge. rays parallel o he uilizaion ray may be aggregaed. The resuling problem sill has infiniely many saes, bu an opimal policy can be esimaed accuraely using a finie sae approximaion (Glaubius e al., 2008). Applying his model minimizaion approach (Givan e al., 2003) does require prior knowledge of he ask parameers, which is ofen unavailable in pracice. In his paper, we use reinforcemen learning o inegrae model and policy esimaion. An imporan quesion is how much experience is necessary before we can rus learned policies. We address his quesion by deriving a PAC bound on he sample complexiy of obaining a near-opimal policy. To he bes of our knowledge, his is he firs such guaranee for problems wih infinie sae spaces and unbounded coss. is he oal number of quana during which ask T i occupied he shared resource since sysem iniializaion. τ(x) denoes he oal elapsed ime in sae x, τ(x) = n x i. (2) i=1 Each acion i in his MDP corresponds o he decision o run ask T i. Transiions are deermined according o ask duraion disribuions, so ha { P ( i) y = x + i P (y x, i) = (3) 0 oherwise where i is he zero vecor excep ha componen i is equal o one, i.e., execuing ask T i alers jus one dimension of he sysem sae. The cos of a sae is is L 1 -disance from arge uilizaion wihin he hyperplane of saes wih equal elapsed ime τ(x), C(x) = n x i τ(x)u i. (4) i=1 Figure 1 illusraes he uilizaion sae model for a problem wih wo asks and a arge uilizaion u = (1, 2)/3 (ha is, ask T 1 should receive 1/3 of he processor, and ask T 2 should receive he res). The arge uilizaion defines a arge uilizaion ray {λu : λ 0}. When he componens of u are raional, his ray regularly passes hrough many uilizaion saes. In Figure 1, for example, he uilizaion ray passes hrough ineger muliples of (1, 2). Every sae on his ray has zero cos, and saes wih he same displacemen from he arge uilizaion ray have equal cos. This ask scheduling MDP has an infinie sae space and unbounded coss, bu because of repeaed ransiion and cos srucure, saes ha are collinear along 2.3 Relaed Work A principle ha unifies many successful mehods for efficien exploraion is opimism in he face of uncerainy (Kaelbling e al., 1996; Szia and Lőrincz, 2008). When presened wih a choice beween wo acions wih similar esimaed value, mehods using his principle end o selec he acion ha has been ried less frequenly. Opimism can ake he form of opimisic iniializaion (Even-Dar and Mansour, 2001), i.e., boosrapping iniial approximaions of he value funcion wih large values (Brafman and Tennenholz, 2003; Srehl and Liman, 2008). Inerval esimaion echniques insead bias acion selecion owards exploraion by mainaining confidence inervals on model parameers (Srehl and Liman, 2008; Auer e al., 2009) or value esimaes (Even-Dar e al., 2002). Inerval esimaion echniques have been developed for solving single-sae Bandi problems (Auer e al., 2002; Even-Dar e al., 2002; Mannor and Tsisiklis, 2004; Mnih e al., 2008), as hey can be exended o he general MDP seing by reaing each sae as a disinc Bandi problem. Heurisic exploraion sraegies are ofen employed due o heir relaive simpliciy. ɛ-greedy exploraion and Bolzmann acion selecion mehods (Kaelbling e al., 1996) are randomizaion sraegies ha bias acion selecion oward exploiaion. Perhaps he mos commonly used sraegy, ɛ-greedy exploraion, simply chooses an acion uniformly a random wih probabiliy ɛ k a epoch k, and oherwise i selecs he apparen bes acion. By decaying ɛ k appropriaely his sraegy asympoically approaches he opimal policy (Even- Dar e al., 2002). We are ineresed in quanifying he sample complexiy of learning good policies in erms of he number of observaions necessary o compue a nearopimal policy wih high probabiliy i.e., probably

4 approximaely correc (PAC) learning (Valian, 1984). Kakade (2003) has considered he quesion of PAC learning in MDPs in deail. Several PAC reinforcemen learning algorihms have been developed, including E 3 (Kearns and Singh, 2002), R-Max (Brafman and Tennenholz, 2003), MBIE (Srehl and Liman, 2008), and OIM (Szia and Lőrincz, 2008). These algorihms are limied o he finie sae case, and assume bounded rewards. Meric E 3 (Kakade e al., 2003) is a PAC learner for MDPs wih coninuous bu compac sae spaces. 3 Online Learning Resuls We consider he difficully of learning good scheduling policies in his secion. We approach his quesion boh analyically and empirically. In Secion 3.1, we derive a PAC bound (Valian, 1984) on a balanced wandering approach o exploraion (Kearns and Singh, 2002; Even-Dar e al., 2002; Brafman and Tennenholz, 2003) in he scheduling domain. Our resul is novel, as i exends resuls derived for he finie-sae bounded cos seing, o a domain wih a counably infinie sae space and unbounded coss. These resuls rely on a specific Lipschiz-like condiion ha resrics he growh rae of he value funcion under our cos funcion (See Lemmas 3 and 4 in he appendix), and finie suppor of he duraion disribuions, i.e., finie wors-case execuion imes of asks. In Secion 3.2, we presen resuls from simulaions comparing alernaive exploraion sraegies. We esimae ask duraion disribuions using he empirical probabiliy measure. We suppose a collecion of m observaions {(i k, k ) : k = 1,..., m}, where ask T ik ran for k P ( i k ) quana a decision epoch k. Then le ω m (i) be he number of observaions involving ask T i, and le ω m (i, ) be he number of hose observaions in which T i ran for quana, ω m (i) = ω m (i, ) = m I {i k = i}, (5) k=1 m I {i k = i k = }, (6) k=1 where I { } is he indicaor funcion. Then our ask duraion model P m ( i) is jus P m ( i) = ω m (i, )/ω m (i). (7) Since cos is compleely deermined by he sysem sae, he ransiion model is he sole source of uncerainy in his problem. 3.1 Analyical PAC Bound We consider he sample complexiy of esimaing a near-opimal policy wih high confidence by bounding he number of low value exploraory acions aken (Kakade, 2003). Our analysis proceeds in hree pars. Firs, we derive bounds on he value esimaion error as a funcion of he model accuracy. Nex, we deermine he number of observaions needed o guaranee ha model accuracy. Finally, we use hese resuls o deermine how many observaions suffice o arrive a a near-opimal policy wih high cerainy. We focus on esimaing he sae-acion value funcion Q, Q(x, i) = W P ( i)[γv (x + i ) C(x + i )]. (8) =1 We use V m o denoe he opimal sae value funcion and Q m o denoe he sae-acion value funcion of he esimaed MDP wih ransiion dynamics P m. To esablish our main resul consraining he sample complexiy of learning in our scheduling domain, we firs provide he following simulaion lemma, which is proven in he appendix. Lemma 1. If here is a consan β such ha for all asks T i, W P m ( i) P ( i) β, (9) =1 where he wors-case execuion ime W is finie, hen Q m Q 2W β (1 γ) 2. (10) This resul serves an idenical role o he Simulaion Lemma of Kearns and Singh (2002) relaing model esimaion error o value esimaion error. Our bound replaces he quadraic dependence on he number of saes in ha resul wih a dependence on he WCET W. This is consisen wih observaions indicaing ha he sample complexiy of obaining a good approximaion should depend polynomially on he number of parameers of he ransiion model (Kakade, 2003; Leffler e al., 2007), which is O( X 2 A ) for general MDPs, bu is (W A ) in his scheduling domain. Theorem 1 provides a PAC bound on he number of observaions needed o arrive a an accurae esimae of he value funcion. For he sake of simpliciy we assume balanced wandering here, as his resul can be easily used o guide offline modeling as well as employed during online learning. Theorem 1. Under balanced exploraion, if ( 32W 3 ) ( ) n 2W n m ε 2 (1 γ) 4 log, (11) δ hen Q m Q ε wih probabiliy a leas 1 δ.

5 Proof. According o Lemma 1, model accuracy β ε(1 γ) 2 /(2W ) is sufficien o guaranee ha Q m Q ε. Thus, demonsraing he bound in Equaion 11 is a maer of guaraneeing wih high cerainy ha P m is near P ; specifically, we require ha { n ( W )} P P m ( i) P ( i) > β δ, i=1 =1 which{ we can enforce using he union bound by requiring P W } =1 P m( i) P ( i) β 1 δ/n for every ask. By Lemma from Kakade s disseraion (Kakade, 2003), ω m (i) (8W/β 2 ) log(2w n/δ) is sufficien o guaranee wih probabiliy 1 δ/n ha P m ( i) is accurae. If we assume balanced wandering, ha ω m (i) = m/n for each ask T i, hen we require m (8W n/β 2 ) log(2w n/δ) (12) observaions. Subsiuing he leas accuracy β = ε(1 γ) 2 /(2W ) ha will sill guaranee an ε- approximaion o Q, produces he saed resul, m ( 32W 3 ) n ε 2 (1 γ) 4 log ( 2W n δ ). Theorem 1 provides a PAC bound on he number of observaions needed o learn an ε-approximaion o Q. However, we are principally ineresed in discovering he number of observaions we need o rus our learned policies. Corollary 1 esablishes he sample complexiy for using balanced complexiy o learn good scheduling policies. Corollary 1. Assuming each acion is ried an equal number of imes, if ( 128W 3 γ 2 ) ( ) n 2W n m ε 2 (1 γ) 6 log, δ hen he opimal policy π m of he esimaed ask scheduling MDP is wihin ε of he opimal policy π wih probabiliy a leas 1 δ. A classical resul due o Singh and Yee (1994) demonsraes ha, in general, a policy ˆπ ha is greedy wih respec o value funcion approximaion ˆV is wihin 2γ ˆV V /(1 γ) of opimal. Corollary 1 follows by noing ha ˆV V ˆQ Q, so we require ha 2γ ˆQ Q /(1 γ) ε. Subsiuing his consrain on Q m ino Theorem 1 esablishes he corollary. As wih exising bounds, he sample complexiy scales polynomially in he parameers 1/γ, 1/δ, 1/ε, and he number of acions. Unlike bounds for general MDPs, here is no dependence on he number of saes; insead, he complexiy of learning is deermined by he wors-case execuion ime W. This resul is similar o bounds for relocaable acion models (Leffler e al., 2007), in which he sae space can be pariioned ino a relaively small number of classes. Transiion models can be generalized among saes in he same class, so he sample complexiy of learning depends on he number of classes raher han he number of saes. Our scheduling MDP is a special case of he relocaable acion model in which here is only one class of saes. While relocaable acion models have been used o address infinie sae spaces (Brunskill e al., 2009), exising sample complexiy resuls do no address he unbounded reward case. We are able o handle unbounded coss here by aking advanage of he slow growh rae of he value funcion relaive o he discoun facor. Specifically, he disance beween consecuive saes is bounded, so while coss grow polynomially wih disance from he resource share arge (cf. Lemma 3 in he appendix), since coss are exponenially discouned he value of any paricular sae is finie. These observaions enable he bound in Lemma 1, suggesing ha sample complexiy bounds may be possible in general for infinie sae, unbounded cos models as long as he number of classes is finie and individual sae values can be bounded. Of course, for hese resuls o be useful good policies mus be represened compacly, which is possible for he scheduling domain considered here (Glaubius e al., 2008), bu is no generally he case. 3.2 Empirical Evaluaion The PAC bound in he previous secion gives a sense of he finie sample performance for learning a good policy; however, i requires several simplifying assumpions, such as balanced wandering, so he bound may no be igh. In pracice, alernaive exploraion sraegies may yield beer performance han our bound would indicae. We compare he performance of several exploraion sraegies in he conex of he ask scheduling problem by conducing experimens comparing ɛ-greedy, balanced wandering, and an inerval-based exploraion sraegy. For inerval-based opimisic exploraion, we use he confidence inervals derived for he muli-armed bandi case by Even-Dar e al. (2002) for he Successive Eliminaion algorihm. Tha algorihm consrucs inervals of he form α k = log(nk 2 c)/k abou he expeced cos of each acion a decision epoch k, hen eliminaes acions ha appear worse han he apparen bes using an overlap es. The parameer c conrols he sensi-

6 (a) Opimisic (b) ɛ-greedy (c) Balanced Figure 2: Simulaion comparison of exploraion echniques. Noe he differing scales on he verical axes. iviy of he inervals. We use hese inervals o selec acions opimisically according o argmax{q m (x, i) + α i,k } i A where we have adjused he confidence inervals according o he poenially differen number of observaions of each ask, α k,i = log(nω k (i) 2 c)/ω k (i) We vary c o conrol he chance of aking exploraory acions. As c shrinks, hese inervals narrow, increasing he endency o exploi he esimaed model. In our experimens wih ɛ-greedy, we se he random selecion rae a decision epoch k, ɛ k = ɛ 0 /k for varying values of ɛ 0 ; his sraegy always explois when ɛ 0 = 0. Balanced wandering simply execues each ask a fixed number of imes m prior o exploiing. We vary his parameer o deermine is impac on he learning rae. When m = 0, his sraegy always explois is curren model knowledge. To compare he performance of hese exploraion sraegies, we generaed 400 random problem insances wih wo asks. Duraion disribuions for hese asks were generaed by firs selecing a wors-case execuion ime W uniformly a random from he inerval [8, 32], hen choosing a normal disribuion wih mean and variance seleced uniformly a random from he respecive inervals [1, W ] and [1, 4]; his disribuion was hen runcaed and discreized over he inerval [1, W ]. Uilizaion arges for each ask were chosen according o u = (u 1, u 2)/(u 1 + u 2), where u 1 and u 2 were inegers seleced uniformly a random beween [1, 64]. We used a discoun facor of γ = 0.95 in our ess. We conduced experimens by iniializing he model in he sae x = (0, 0). The conroller simulaed a single rajecory over 20,000 decision epochs in each problem insance wih each exploraion sraegy. In order o avoid enumeraing arbirarily large numbers of saes, we reiniialized he sae whenever a sae wih cos greaer han 50 was encounered. These high cos saes were reaed as absorbing saes in he approximae model o avoid degenerae policies ha exploi he rese. We repor he number of misakes he number of imes he exploraion sraegy chooses an subopimal acion i ha has value V (x) Q(x, i) The resuls of hese experimens are shown in Figure Evaluaion Resuls In Figure 2, we repor 90% confidence inervals on he mean number of misakes each exploraion sraegy makes, averaged across he problem insances described above. Noe ha hese plos have differen scales due o he variaion in misake raes among exploraion sraegies. Figure 2(a) compares he performance of inervalbased opimisic acion selecion o ha of Exploi, he policy ha greedily follows he opimal policy of he approximae model a each decision epoch. All of he inerval-based exploraion seings we considered exhibied saisically similar performance. Ineresingly, he exploiive sraegy yields beer performance han he exploraive sraegies despie is lack of an explici exploraion mechanism. This observaion holds rue for ε-greedy exploraion and balanced wandering as well. Figure 2(b) illusraes he performance of ɛ-greedy exploraion. Noice ha he misake rae decreases along wih he likelihood of aking exploraory acions ha is, as ɛ 0 approaches zero. Explici exploraion may no improve performance in his domain. This is furher suppored by our resuls for balanced wandering. The heory behind balanced wandering is ha making a few iniial misakes early on will pay off in he long run due o more uniformly accurae models. Figure 2(c) shows ha his is no he case in our scheduling domain, as

7 a purely exploiive sraegy m = 0 ouperforms each balanced wandering approach. These resuls sugges ha he exploiive sraegy may be he bes available exploraion mehod in our ask scheduling problem domain. One plausible explanaion is ha he environmen iself enforces raional exploraion: if some ask is never dispached, he sysem will ener progressively more cosly saes as ha ask becomes more and more underused. Thus, evenually he esimaed benefi of running ha ask will be subsanial enough ha he exploiive sraegy mus use i. I is ineresing o noe ha all of he exploraive policies considered have quie low misake raes despie he igh hreshold of 10 6 used o disinguish subopimal acions. 4 Conclusions In his paper we have considered he problem of learning near-opimal schedules when he sysem model is no fully known in advance. We have presened analyical resuls ha bound he number of subopimal acions aken prior o arriving a a near-opimal policy wih high cerainy. Ineresingly, he ransiion sysem s porabiliy resuls in bounds ha are similar o hose for esimaing he underlying model in a single sae. This naurally leads o a comparison o he muliarmed bandi model (see, for example, Even-Dar e al. (2002)), in which here is a single sae wih several available acions. Each acion causes he emission of a reward according o a corresponding unknown, saionary random process. However, a bandi model does no appear o apply direcly because while he duraion disribuions are saionary processes ha are invarian beween saes, he payoff associaed wih each acion is sae-dependen. We have focused on he PAC model of learning raher han deriving bounds on regre he loss in value incurred due o subopimal behavior during learning (Auer e al., 2009). Regre bounds may ranslae more readily ino guaranees abou ransien real-ime performance effecs during learning, as guaranees regarding cos (and hence value) ranslae ino guaranees abou ask imeliness. We have presened empirical resuls which sugges ha a learner ha always explois is curren informaion ouperforms agens ha explicily encourage exploraion in his domain. This occurs because any policy ha consisenly ignores some acion will ge progressively farher from he uilizaion arge, resuling in arbirarily large coss. Thus he domain iself appears o enforce an appropriae level of exploraion, perhaps obviaing he need for an explici exploraion mechanism. I is an open quesion wheher a more general class of MDPs ha exhibi his behavior can be idenified. Acknowledgemens This research has been suppored in par by NSF grans CNS (Cyberrus) and CCF (CAREER). Appendix: Proof of Lemma 1 Lemma 1 saes ha he error in approximaing Q is bounded, Q m Q 2W β/(1 γ) 2, when he ransiion model esimaion error is bounded by β (cf. Equaion 9), where W is he maximum worscase execuion ime among all asks. We inroduce lemmas prior o demonsraing his resul. The firs provides a bound on expeced successor sae value of a funcion wih a Lipschiz-like speed limi on is growh. Subsequen lemmas esablish ha boh coss and values exhibi his propery. Lemma 2. Suppose p and ˆp are disribuions over {1,..., W } ha saisfy W =1 p() ˆp() β, and ha for any i, he funcion f : X R saisfies f(x i, ) f(x) λ for some λ 0. Then W [p() ˆp()]f(x + i ) λw β. =1 Proof. Since we can decompose f(x+ i ) ino an f(x) erm and a λ erm, we have [p() ˆp()]f(x + i ) [p() ˆp()]f(x) + λ p() ˆp(). Since f(x) does no depend on, he firs erm on he righ-hand side vanishes. Since W, we have λ p() ˆq() λw β. We now show ha he cos funcion C and he opimal value V saisfy he condiions of Lemma 2. Lemma 3. For any sae x, ask T i, and duraion, C(x) C(x + i ) C( i ). (13)

8 Proof. Since C(x) is he L 1 -norm beween x and τ(x)u (cf. Equaion 4), we can use he riangle inequaliy and scalabiliy o derive he upper bound C(x + i ) C(x) + C( i ). We can also use he riangle inequaliy o obain he lower bound, since C(x) C(x + i ) + C( i ); rearranging he erms yields he inended resul. I is sraighforward o show ha C( i ) < 2 for any ask T i. We make use of his fac and Lemma 3 o derive a relaed limi on he growh of V. Lemma 4. For any sae x, ask T i, and duraion, V (x + i ) V (x) 2/(1 γ). Proof. Le y = x + i. We can bound he difference in values a x and y in erms of he difference in Q- values, since V (y) V (x) max Q(y, j) Q(x, j). (14) j By expanding Q according o Equaion 8 and rearranging erms, Q(y, j) Q(x, j) ( = P (s j) γv (y + s j ) γv (x + s j ) s γ s + s 2 + γ s C(y + s j ) + C(x + s j )) P (s j) V (y + s j ) V (x + s j ) P (s j) C(y + s j ) C(x + s j ) P (s j) V (y + s j ) V (x + s j ). Recurring his argumen on he absolue value in he righ-hand side resuls in accumulaing a residual γ k C( i ) for he k h repeiion. Therefore, V (x + i ) V (x) γ k C( i ) = 2 1 γ. k=0 We are ready now o prove Lemma 1. Proof of Lemma 1. We begin bounding Q(x, i) Q m (x, i) by expanding according o Equaion 8, rearranging erms o group coss and values, hen decomposing he sum by using he superaddiiviy of he absolue value: Q(x, i) Q m (x, i) γ P ( i)v (x + i ) P m ( i)v m (x + i ) + [P ( i) P m ( i)]c(x + i ). (15) Applying Lemmas 2 and 3, we have [P ( i) P m ( i)]c(x + i ) 2W β. We can apply he riangle inequaliy o obain P ( i)v (x + i ) P m ( i)v m (x + i ) [P ( i) P m ( i)]v (x + i ) + P m ( i) V (x + i ) V m (x + i ). Using Lemmas 2 and 4 yields [P ( i) P m ( i)]v (x + i ) 2W β/(1 γ). Subsiuing back ino Equaion 15 allows us o wrie Q(x, i) Q m (x, i) 2W β + γ 2W β 1 γ + γ P m ( i) V (x + i ) V m (x + i ). Finally, we can use Equaion 14 o express V (x + i ) V m (x + i ) in erms of Q, hen recur his argumen o produce he saed bound, Q(x, i) Q m (x, i) References k=0 γ k 2W β 1 γ = 2W β (1 γ) 2. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finie ime analysis of he muliarmed bandi problem. Machine Learning, 47(2-3): , P. Auer, T. Jaksch, and R. Orner. Near-opimal regre bounds for reinforcemen learning. In Advances in Neural Informaion Processing Sysems, volume 21, pages 89 96, R. I. Brafman and M. Tennenholz. R-MAX a general polynomial ime algorihm for near-opimal reinforcemen learning. Journal of Machine Learning Research, 3: , E. Brunskill, B. R. Leffler, L. Li, M. L. Liman, and N. Roy. Provably efficien learning wih yped parameric models. Journal of Machine Learing Research, 10: , 2009.

9 E. Even-Dar and Y. Mansour. Convergence of opimisic and incremenal Q-learning. In Advances in Neural Informaion Processing Sysems, volume 13, pages , E. Even-Dar, S. Mannor, and Y. Mansour. PAC bounds for muli-armed bandi and markov decision processes. In COLT 02: Proceedings of he 15h Annual Conference on Compuaional Learning Theory, pages , R. Givan, T. Dean, and M. Greig. Equivalence noions and model minimizaion in markov decision processes. Arificial Inelligence, 147(1-2): , R. Glaubius, T. Tidwell, W. D. Smar, and C. Gill. Scheduling design and verificaion for open sof real-ime sysems. In RTSS 08: Proceedings of he 2008 Real-Time Sysems Symposium, pages , R. Glaubius, T. Tidwell, C. Gill, and W. D. Smar. Scheduling policy design for auonomic sysems. Inernaional Journal on Auonomous and Adapive Communicaions Sysems, 2(3): , L. P. Kaelbling, M. Liman, and A. Moore. Reinforcemen learning: A survey. Journal of Arificial Inelligence Research, 4: , S. M. Kakade. On he Sample Complexiy of Reinforcemen Learning. PhD hesis, Gasby Compuaional Neuroscience Uni, Universiy College London, London, UK, S. M. Kakade, M. Kearns, and J. Langford. Exploraion in meric sae spaces. In ICML 03: Proceedings of he 20h Inernaional Conference on Machine Learning, pages , M. J. Kearns and S. P. Singh. Near-opimal reinforcemen learning in polynomial ime. Machine Learning, 2-3(49): , B. R. Leffler, M. L. Liman, and T. Edmunds. Efficien reinforcemen learning wih relocaable acion models. In AAAI 07: Proceedings of he 22nd Naional Conference on Arificial Inelligence, pages , S. Mannor and J. N. Tsisiklis. The sample complexiy of exploraion in he muli-armed bandi problem. Journal of Machine Learning Research, 5: , V. Mnih, C. Szepesvári, and J.-Y. Audiber. Empirical bernsein sopping. In ICML 08: Proceedings of he 25h Inernaional Conference on Machine Learning, pages , M. L. Puerman. Markov Decision Processes: Discree Sochasic Dynamic Programming. Wiley-Inerscience, S. P. Singh and R. C. Yee. An upper bound on he loss from approximae opimal-value funcions. Machine Learning, 16(3): , A. Srinivasan and J. H. Anderson. Efficien scheduling of sof real-ime applicaions on muliprocessors. Journal of Embedded Compuing, 1(2): , A. L. Srehl and M. L. Liman. An analysis of modelbased inerval esimaion for markov decision processes. Journal of Compuer and Sysem Sciences, 74(8): , I. Szia and A. Lőrincz. The many faces opimism: a unifying approach. In ICML 08: Proceedings of he 25h Inernaional Conference on Machine Learning, pages , L. G. Valian. A heory of he learnable. In STOC 84: Proceedings of he Sixeenh Annual ACM Symposium on Theory of Compuing, pages , 1984.