Efficient Reinforcement Learning in Factored MDPs

Effcent Renforcement Learnng n Factored MDPs Mchael Kearns AT&T Labs mkearns@research.att.com Daphne Koller Stanford Unversty koller@cs.stanford.edu Abstract We present a provably effcent and near-optmal algorthm for renforcement learnng n Markov decson processes (MDPs) whose transton model can be factored as a dynamc Bayesan network (DBN). Our algorthm generalzes the recent E 3 algorthm of Kearns and Sngh, and assumes that we are gven both an algorthm for approxmate plannng, and the graphcal structure (but not the parameters) of the DBN. Unlke the orgnal E 3 algorthm, our new algorthm explots the DBN structure to acheve a runnng tme that scales polynomally n the number of parameters of the DBN, whch may be exponentally smaller than the number of global states. 1 Introducton Kearns and Sngh (1998) recently presented a new algorthm for renforcement learnng n Markov decson processes (MDPs). Ther E 3 algorthm (for Explct Explore or Explot) acheves near-optmal performance n a runnng tme and a number of actons whch are polynomal n the number of states and a parameter T, whch s the horzon tme n the case of dscounted return, and the mxng tme of the optmal polcy n the case of nfnte-horzon average return. The E 3 algorthm makes no assumptons on the structure of the unknown MDP, and the resultng polynomal dependence on the number of states makes E 3 mpractcal n the case of very large MDPs. In partcular, t cannot be easly appled to MDPs n whch the transton probabltes are represented n the factored form of a dynamc Bayesan network (DBN). MDPs wth very large state spaces, and such DBN-MDPs n partcular, are becomng ncreasngly mportant as renforcement learnng methods are appled to problems of growng dffculty [Boutler et al., 1999]. In ths paper, we extend the E 3 algorthm to the case of DBN-MDPs. The orgnal E 3 algorthm reles on the ablty to fnd optmal strateges n a gven MDP that s, to perform plannng. Ths ablty s readly provded by algorthms such as value teraton n the case of small state spaces. Whle the general plannng problem s ntractable n large MDPs, sgnfcant progress has been made recently on approxmate soluton algorthms for both DBN-MDPs n partcular [Boutler et al., 1999], and for large state spaces n general [Kearns et al., 1999; Koller and Parr, 1999]. Our new DBN-E 3 algorthm therefore assumes the exstence of a procedure for fndng approxmately optmal polces n any gven DBN-MDP. Our algorthm also assumes that the qualtatve structure of the transton model s known,.e., the underlyng graphcal structure of the DBN. Ths assumpton s often reasonable, as the qualtatve propertes of a doman are often understood. Usng the plannng procedure as a subroutne, DBN-E 3 explores the state space, learnng the parameters t consders relevant. It acheves near-optmal performance n a runnng tme and a number of actons that are polynomal n T and the number of parameters n the DBN-MDP, whch n general s exponentally smaller than the number of global states. We further examne condtons under whch the mxng tme T of a polcy n a DBN-MDP s polynomal n the number of parameters of the DBN-MDP. The anytme nature of DBN-E 3 allows t to compete wth such polces n total runnng tme that s bounded by a polynomal n the number of parameters. 2 Prelmnares We begn by ntroducng some of the basc concepts of MDPs and factored MDPs. A Markov Decson Process (MDP) s defned as a tuple (S; A; R; P ) where: S s a set of states; A s a set of actons; R s a reward functon R : S 7! [0;R max ], such that R(s) represents the reward obtaned by the agent n state s 1 ; P s a transton model P : S A 7! S, such that P (s 0 j s; a) represents the probablty of landng n state s 0 f the agent takes acton a n state s. Most smply, MDPs are descrbed explctly, by wrtng down a set of transton matrces and reward vectors one for each acton a. However, ths approach s mpractcal for descrbng complex processes. Here, the set of states s typcally descrbed va a set of random varables X = fx 1 ;:::;X n g, where each X takes on values n some fnte doman Val(X ). In general, for a set of varables Y X,an nstantaton y assgns a value x 2 Val(X) for every X 2 Y; we use Val(Y) to denote the set of possble nstantatons to 1 A reward functon s sometmes assocated wth (state,acton) pars rather than wth states. Our assumpton that the reward depends only on the state s made purely to smplfy the presentaton; t has no effect on our results.

Y. A state n ths MDP s an assgnment x 2 Val(X); the total number of states s therefore exponentally large n the number of varables. Thus, t s mpractcal to represent the transton model explctly usng transton matrces. The framework of dynamc Bayesan networks (DBNs) allows us to descrbe a certan mportant class of such MDPs n a compact way. Processes whose state s descrbed va a set of varables typcally exhbt a weak form of decouplng not all of the varables at tme t drectly nfluence the transton of a varable X from tme t to tme t +1. For example, n a smple robotcs doman, the locaton of the robot at tme t +1 may depend on ts poston, velocty, and orentaton at tme t, but not on what t s carryng, or on the amount of paper n the prnter. DBNs are desgned to represent such processes compactly. Let a 2 A be an acton. We frst want to specfy the transton model P (x 0 j x;a). Let X denote the varable X at the current tme and X 0 denote the varable at the next tme step. The transton model for acton a wll consst of two parts an underlyng transton graph assocated wth a, and parameters assocated wth that graph. The transton graph s a 2-layer drected acyclc graph whose nodes are fx 1 ;:::;X n ;X1;:::;X 0 ng. 0 All edges n ths graph are drected from nodes n fx 1 ;:::;X n g to nodes n fx 0 1 ;:::;X0 ng; note that we are assumng that there are no edges between varables wthn a tme slce. We denote the parents of X 0 n the graph by Pa a(x 0 ). Intutvely, the transton graph for a specfes the qualtatve nature of probablstc dependences n a sngle tme step namely, the new settng of X depends only on the current settng of the varables n Pa a (X 0 ). To make ths dependence quanttatve, each node X 0 s assocated wth a condtonal probablty table (CPT) P a (X 0 j Pa a(x 0 )). Q The transton probablty P (x 0 j x;a) s then defned to be P a(x 0 j u ),whereu s the settng n x of the varables n Pa a (X 0 ). We also need to provde a compact representaton of the reward functon. As n the transton model, explctly specfyng a reward for each of the exponentally many states s mpractcal. Agan, we use the dea of factorng the representaton of the reward functon nto a set of localzed reward functons, each of whch only depends on a small set of varables. In our robot example, our reward mght be composed of several subrewards: for example, one assocated wth locaton (for gettng too close to a wall), one assocated wth the prnter status (for lettng paper run out), and so on. More precsely, let R be a set of functons R 1 ;:::;R k ; each functon R s assocated wth a cluster of varables C fx 1 ;:::;X n g, such that R s a functon from Val(C ) to IR. Abusng notaton, we wll use R (x) to denote the value that R takes for the part of the state vector correspondng to C. The reward functon assocated wth P the DBN-MDP at a state k x s then defned to be R(x) = R =1 (x) 2 [0;R max ]. The followng defntons for fnte-length paths n MDPs wll be of repeated techncal use n the analyss. Let M be a Markov decson process, and let be a polcy n M. A T -path n M s a sequence p of T +1states (that s, T transtons) of M: p = x 1 ;:::;x T ; x T +1. The probablty that p s traversed n M upon startng n state x 1 and executng polcy s denoted PM [p] =T P k=1 (x k+1 j x k ;(x k )). There are three standard notons of the expected return enjoyed by a polcy n an MDP: the asymptotc dscounted return, the asymptotc average return, and the fnte-tme average return. Lke the orgnal E 3 algorthm, our new generalzaton wll apply to all three cases, and to convey the man deas t suffces for the most part to concentrate on the fntetme average return. Ths s because our fnte-tme average return result can be appled to the asymptotc returns through ether the horzon tme 1=(1, ) for the dscounted case, or the mxng tme of the optmal polcy n the average case. (We examne the propertes of mxng tmes n a DBN-MDP n Secton 5.) Let M be a Markov decson process, let be a polcy n M, andletp be a T -path n M. Theaverage return along p n M s U M (p) =(1=T )(R(x 1 )++ R(x T +1 )): The T -step (expected) average return from state x s U M (x;t) =P P [p]u p M M(p) where the sum s over all T -paths p n M that start at x. Furthermore, we defne the optmal T -step average return from x n M by U M (x;t)= max fu (x;t)g. M An mportant problem n MDPs s plannng: fndng the polcy that acheves optmal return n a gven MDP. In our case, we are nterested n achevng the optmal T -step average return. The complexty of all exact MDP plannng algorthms depends polynomally on the number of states; ths property renders all of these algorthms mpractcal for DBN- MDPs, where the number of states grows exponentally n the sze of the representaton. However, there has been recent progress on algorthms for approxmately solvng MDPs wth large state spaces [Kearns et al., 1999], partcularly on ones represented n a factored way as an MDP [Boutler et al., 1999; Koller and Parr, 1999]. The focus of our work s on the renforcement learnng task, so we smply assume that we have access to a black box that performs approxmate plannng for a DBN-MDP. Defnton 2.1:A -approxmaton T -step plannng algorthm for a DBN-MDP s one that, gven a DBN-MDP M, produces a (compactly represented) polcy such that U M (x;t) (1, )U (x;t). M We wll charge our learnng algorthm a sngle step of computaton for each call to the assumed approxmate plannng algorthm. One way of thnkng about our result s as a reducton of the problem of effcent learnng n DBN-MDPs to the problem of effcent plannng n DBN-MDPs. Our goal s to perform model-based renforcement learnng. Thus, we wsh to learn an approxmate model from experence, and then explot t (or explore t) by plannng gven the approxmate model. In ths paper, we focus on the problem of learnng the model parameters (the CPTs), assumng that the model structure (the transton graphs) s gven to us. It s therefore useful to consder the set of parameters that we wsh to estmate. As we assumed that the rewards are determnstc, we can focus on the probablstc parameters. (Our results easly extend to the case of stochastc rewards.) We defne a transton component of the DBN-MDP to be a

dstrbuton P a (X 0 j u) for some acton a and some partcular nstantaton u to the parents Pa a (X 0 ) n the transton model. P Note that the number of transton components s at most a; jval(pa a(x 0 ))j, but may be much lower when a varable s behavor s dentcal for several actons. 3 Overvew of the Orgnal E 3 Snce our algorthm for learnng n DBN-MDPs wll be a drect generalzaton of the E 3 algorthm of Kearns and Sngh hereafter abbrevated KS we begn wth an overvew of that algorthm and ts analyss. It s mportant to bear n mnd that the orgnal algorthm s desgned only for the case where the total number of states N s small, and the algorthm runs n tme polynomal n N. E 3 s what s commonly referred to as an ndrect or modelbased algorthm: rather than mantanng only a current polcy or value functon, the algorthm mantans a model for the transton probabltes and the rewards for some subset of the states of the unknown MDP M. Although the algorthm mantans a partal model of M, t may choose to never buld a complete model of M, f dong so s not necessary to acheve hgh return. The algorthm starts off by dong balanced wanderng: the algorthm, upon arrvng n a state, takes the acton t has tred the fewest tmes from that state (breakng tes randomly). At each state t vsts, the algorthm mantans the obvous statstcs: the reward receved at that state, and for each acton, the emprcal dstrbuton of next states reached (that s, the estmated transton probabltes). A crucal noton s that of a known state astatethat the algorthm has vsted so many tmes that the transton probabltes for that state are very close to ther true values n M. Ths defnton s carefully balanced so that so many tmes s stll polynomally bounded, yet very close suffces to meet the smulaton requrements below. An mportant observaton s that we cannot do balanced wanderng ndefntely before at least one state becomes known: by the Pgeonhole Prncple, we wll soon start to accumulate accurate statstcs at some state. The most mportant constructon of the analyss s the known-state MDP. IfS s the set of currently known states, the known-state MDP s smply an MDP M S that s naturally nduced on S by the full MDP M. Brefly, all transtons n M between states n S are preserved n M S, whle all other transtons n M are redrected n M S to lead to a sngle new, absorbng state that ntutvely represents all of the unknown and unvsted states. Although E 3 does not have drect access to M S, by vrtue of the defnton of the known states, t does have a good approxmaton ^M S. The KS analyss hnges on two central techncal lemmas. The frst s called the Smulaton Lemma, and t establshes that ^M S has good smulaton accuracy: that s, the expected T -step return of any polcy n ^M S s close to ts expected T - step return n M S. Thus, at any tme, ^M S s a useful partal model of M, for that part of M that the algorthm knows very well. The second central techncal lemma s the Explore or Explot Lemma. It states that ether the optmal (T -step) polcy n M acheves ts hgh return by stayng (wth hgh probablty) n the set S of currently known states, or the optmal polcy has sgnfcant probablty of leavng S wthn T steps. Most mportantly, the algorthmcan detect whch of these two s the case; n the frst case, t can smulate the behavor of the optmal polcy by fndng a hgh-return explotaton polcy n the partal model ^M S, and n the second case, t can replcate the behavor of the optmal polcy by fndng an exploraton polcy that quckly reaches the addtonal absorbng state of the partal model ^M S. Thus, by performng two off-lne plannng computatons on ^M S, the algorthm s guaranteed to fnd ether a way to get near-optmal return for the next T steps, or a way to mprove the statstcs at an unknown or unvsted state wthn the next T steps. KS show that ths algorthm ensures near-optmal return n tme polynomal n N. 4 The DBN-E 3 Algorthm Our goal s to derve a generalzaton of E 3 for DBN-MDPs, and to prove for t a result analogous to that of KS but wth a polynomal dependence not on the number of states N, but on the number of CPT parameters ` n the DBN model. Our analyss closely mrrors the orgnal, but requres a sgnfcant generalzaton of the Smulaton Lemma that explots the structure of a DBN-MDP, a modfed constructon of ^M S that can be represented as a DBN-MDP, and a number of alteratons of the detals. Lke the orgnal E 3 algorthm, DBN-E 3 wll buld a model of the unknown DBN-MDP on the bass of ts experence, but now the model wll be represented n a compact, factorzed form. More precsely, suppose that our algorthm s n state x, executes acton a, and arrves n state x 0. Ths experence wll be used to update all the approprate CPT entres of our model namely, all the estmates ^P a (x 0 j u ) are updated n the obvous way, where as usual u s the settng of Pa a (X 0 ) n x. We wll also mantan counts C a (x 0; u ) of the number of tmes ^P a (x 0 j u ) has been updated. Recall that a crucal element of the orgnal E 3 analyss was the noton of a known state. In the orgnal analyss, t was observed that f N s the total number of states, then after O(N ) experences some state must become known by the Pgeonhole Prncple. We cannot hope to use the same logc here, as we are now n a DBN-MDP wth an exponentally large number of states. Rather, we must pgeonhole not on the number of states, but on the number of parameters requred to specfy the DBN-MDP. Towards ths goal, we wll say that the CPT entry ^P a (x 0 j u ) s known f t has been vsted enough tmes to ensure that, wth hgh probablty jp a (x 0 j u ), ^P a (x 0 j u )j: We now would lke to establsh that f, for an approprate choce of, all CPT entres are known, then our approxmate DBN-MDP can be used to accurately estmate the expected return of any polcy n the true DBN-MDP. Ths s the desred generalzaton of the orgnal Smulaton Lemma. As n the orgnal analyss, we wll eventually apply t to a generalzaton of the nduced MDP M S, n whch we delberately restrct attenton to only the known CPT entres.

4.1 The DBN-MDP Smulaton Lemma Let M and ^M be two DBN-MDPs over the same state space wth the same transton graphs for every acton a, and wth the same reward functons. Then we say that ^M s an - approxmaton of M f for every acton a and node X 0 n the transton graphs, for every settng u of Pa a (X 0 ),andfor every possble value x 0 of X, 0 jp a (x 0 j u), ^P a (x 0 j u)j where P a (j) and ^P a (j) are the CPTs of M and ^M, respectvely. Lemma 4.1: Let M be any DBN-MDP over n state varables wth ` CPT entres n the transton model, and let ^M be an -approxmaton of M, where = O((=(T 2`R max )) 2 ). Then for any polcy, and for any state x, ju M (x;t), U (x;t)j : ^M Proof: (Sketch) Let us fx a polcy and state x. Recall that for any next state x 0 and any acton a, the transton Q probablty factorzes va the CPTs as P (x 0 j x;a)= P a(x 0 j u ). where u s the settng of Pa a (X 0 ) n x. Let us say that P (x 0 j x;a) contans a -small factor f any of ts CPT factors P a (x 0 j u ) s smaller than. Note that a transton probablty may actually be qute small tself (exponentally small n n) wthout necessarly contanng a -small factor. Our frst goal s to show that trajectores n M and ^M that cross transtons contanng a -small CPT factor can be thrown away wthout much error. Consder a random trajectory of T steps n M from state x followng polcy. It can be shown that the probablty that such a trajectory wll cross at least one transton P (x 0 j x;a) that contans a - small factor s at most T`. Essentally, the probablty that at any step, any partcular -small transton (CPT factor) wll be taken by any partcular varable X s at most. A smple unon argument over the CPT entres and the T tme steps gves the desred bound. Therefore, the total contrbuton to the dfference ju M (x;t), U (x;t)j by these trajectores ^M canbeshowntobeatmostt 2 R max`( + ). We wll thus gnore such trajectores for now. The key advantage of elmnatng -smallfactorssthat we can convert addtve approxmaton guarantees nto multplcatve ones. Let p be any path of length T. If all the relevant CPT factors are greater than, andwelet==, t can be shown that (1, ) Tn P M [p] ^P M [p] (1+)Tn P M [p]: In other words, gnorng -small CPT factors, the dstrbutons on paths nduced by n M and ^M are qute smlar. From ths t follows that, for the upper bound, 2 U ^M (x;t) (1 + )Tn U M (x;t)+t 2 R max`( +2): For the choces = p, = O((=(T 2`R max )) 2 ) the lemma s obtaned. 2 The lower bound argument s entrely symmetrc. Returnng to the man development, we can now gve a precse defnton of a known CPT entry. It s a smple applcaton of Chernoff bounds to show that provded the count C a (x 0 ; u ) exceeds O(1= 2 log(1=)), ^P a (x 0 j u ) has addtve error at most wth probablty at least 1,. We thus say that ths CPT entry s known f ts count exceeds the gven bound for the choce = O((=(T 2 nvr max )) 2 ) specfed by the DBN-MDP Smulaton Lemma. The DBN-MDP Smulaton Lemma shows that f all CPT entres are known, then our approxmate model ^M can be used to fnd a near-optmal polcy n the true DBN-MDP M. Note that we can dentfy whch CPT entres are known va the counts C a (x 0 ; u ). Thus, f we are at a state x for whch at least one of the assocated CPT entres ^P a (x 0 j u ) sunknown, by takng actona we then obtan an experence that wll ncrease the correspondng count C a (x 0; u ). Thus, n analogy wth the orgnal E 3, as long as we are encounterng unknown CPT entres, we can contnue takng actons that ncrease the qualty of our model but now rather than ncreasng counts on a per-state bass, the DBN-MDP Smulaton Lemma shows why t suffces to ncrease the counts on a per-cpt entry bass, whch s crucal for obtanng the runnng tme we desre. We can thus show that f we encounter unknown CPT entres for a number of steps that s polynomal n the total number ` of CPT entres and 1=, there can no longer be any unknown CPT entres, and we know the true DBN-MDP well enough to solve for a near-optmal polcy. However, smlar to the orgnal algorthm, the real dffculty arses when we are n a state wth no unknown CPT entres, yet there do reman unknown CPT entres elsewhere. Then we have no guarantee that we can mprove our model at the next step. In the orgnal algorthm, ths was solved by defnng the known-state MDP M S, and provng the aforementoned Explore or Explot Lemma. Duplcatng ths step for DBN-MDPs wll requre another new dea. 4.2 The DBN-MDP Explore or Explot Lemma In our context, when we construct a known-state MDP, we must satsfy the addtonal requrement that the known-state MDP preserve the DBN structure of the orgnal problem, so that f we have a plannng algorthm for DBN-MDPs that explots the structure, we can then apply t to the known-state MDP 3. Therefore, we cannot just ntroduce a new snk state to represent that part of M that s unknown to us; we must also show how ths snk state can be represented as a settng of the state varables of a DBN-MDP. We present a new constructon, whch extends the dea of known states to the dea of known transtons. We say that a transton component P a (X 0 j u) s known f all of ts CPT entres are known. The basc dea s that, whle t s mpossble to check locally whether a state s known, t s easy to check locally whether a transton component s known. Let T be the set of known transton components. We defne the known-transton DBN-MDP M T as follows. The 3 Certan approaches to approxmate plannng n large MDPs do not requre any structural assumptons [Kearns et al., 1999], butwe antcpate that the most effectve DBN-MDP plannng algorthms eventually wll.

model behaves dentcally to M as long as only known transtons are taken. As soon as an unknown transton s taken for some varable X 0,thevarableX0 takes on a new wanderng value w, whch we ntroduce nto the model. The transton model s defned so that, once a varable takes on the value w, ts value never changes. The reward functon s defned so that, once at least one varable takes on the wanderng value, the total reward s nonpostve. These two propertes gve us the same overall behavor that KS got by makng a snk state for the set of unknown states. Defnton 4.2:Let M be a DBN-MDP and let T be any subset of the transton components n the model. The nduced DBN-MDP on T, denoted M T, s defned as follows: M T has the same set of state varables as M; however, n M T, each varable X has, n addton to ts orgnal set of values Val M (X ), a new value w. M T has the same transton graphs as M. For each a,, andu 2 Val M (Pa a (X 0 )), wehavethatp M T a (X 0 j u) =Pa M (X0 j u) f the correspondng transton component s n T ; n all other cases, Pa MT (w j u) =1,and Pa MT (x j u) =0for all x 2 Val M (X ). M T has the same set R as M. For each = 1;:::;k and c 2 Val M (C ),wehavethatr M T (c) =R M (c). For other vectors c, wehavethatr MT (c) =,R max. Wth ths defnton, we can prove the analogue to the Explore or Explot Lemma (detals omtted). Lemma 4.3:Let M be any DBN-MDP, let T be any subset of the transton components of M, and let M T be the nduced MDP on M. For any x 2 S, any T, and any 1 >>0, ether there exsts a polcy n M T such that U (x;t) MT U M (x;t),, or there exsts a polcy n M T such that the probablty that a walk of T steps followng wll take at least one transton not n T exceeds =((k +1)TR max ). Ths lemma essentally asserts that ether there exsts a polcy that already acheves near-optmal (global) return by stayng only n the local model M T, or there exsts a polcy that quckly exts the local model. 4.3 Puttng It All Together We now have all the peces to fnsh the descrpton and analyss of the DBN-E 3 algorthm. The algorthmntally executes balanced wanderng for some perod of tme. After some number of steps, by the Pgeonhole Prncple one or more transton components become known. When the algorthm reaches a known state x one where all the transton components are known t can no longer perform balanced wanderng. At that pont, the algorthm performs approxmate off-lne polcy computatons for two dfferent DBN-MDPs. The frst corresponds to attempted explotaton, and the second to attempted exploraton. Let T be the set of known transtons at ths step. In the attempted explotaton computaton, the DBN-E 3 algorthm would lke to fnd the optmal polcy on the nduced DBN- MDP M T. Clearly, ths DBN-MDP s not known to the algorthm. Thus, we use ts approxmaton ^M T, where the true transtonprobabltesare replaced wth ther current approxmaton n the model. The defnton of M T uses only the CPT entres of known transton components. The Smulaton Lemma now tells us that, for an approprate choce of a choce that wll result n a defnton of known transton that requres the correspondng count to be only polynomal n 1=, n, v, andt the return of any polcy n ^M T s wthn of ts return n M T. We wll specfy a choce for later (whch n turn sets the choce of and the defnton of known state). Let us now consder the two cases n the Explore or Explot Lemma. In the explotaton case, there exsts a polcy n M T such that U (x;t) U MT M (x;t),. (Agan, we wll dscuss the choce of below.) From the Smulaton Lemma,wehavethatU ^M (x;t) U T M (x;t),( +). Our approxmate plannng algorthm returns a polcy 0 whose value n ^M T s guaranteed to be a multplcatve factor of at most 1, away from the optmal polcy n ^M T. Thus, we are guaranteed that U 0 (x;t) (1, )(U ^M T M (x;t), ( + )). Therefore, n the explotaton case, our approxmate planner s guaranteed to return a polcy whose value s close to the optmal value. In the exploraton case, there exsts a polcy n M T (and therefore n ^M T ) that s guaranteed to take an unknown transton wthn T steps wth some mnmum probablty. Our goal now s to use our approxmate planner to fnd such a polcy. In order to do that, we need use a slghtly dfferent constructon MT 0 ( ^M T 0 ). The transton structure of M T 0 s dentcal to that of M T. However, the rewards are now dfferent. Here, for each =1;:::;kand c 2 Val M (C ),wehavethat R M 0 T (c) =0; for other vectors c, wehavethatr M T (c) =1. Now let 0 be the polcy returned by our approxmate planner on the DBN-MDP ^M T 0. It can be shown that the probablty that a T -step walk followng 0 wll take at least one unknown transton s at least (1, )(=((k +1)TR max ), )=kt. To summarze: our approxmate planner ether fnds an explotaton polcy n ^M T that enjoys actual return UM (x;t) (1, )(U M (x;t), ( + )) from our current state x, or t fnds an explotaton polcy n ^M T 0 that has probablty at least p =(1, )(=((k +1)TR max ), )=kt of mprovng our statstcs at an unknown transton n the next T steps. Approprate choces for and yeld our man theorem, whch we are now fnally ready to descrbe. Recall that for expostory purposes we have concentrated on the case of T -step average return. However, as for the orgnal E 3, our man result can be stated n terms of the asymptotc dscounted and average return cases. We omt the detals of ths translaton, but t s a smple matter of argung that t suffces to set T to be ether (1=(1,)) log(1=) (dscounted) or the mxng tme of the optmal polcy (average). Theorem 4.4: (Man Theorem) Let M be a DBN-MDP wth ` total entres n the CPTs. (Undscounted case) Let T be the mxng tme of the polcy achevng the optmal average asymptotc return U n M. There exsts an algorthm DBN-E 3 that, gven access to a -approxmaton plannng algorthm for DBN-

MDPs, and gven nputs ; ;`;T and U, takes a number of actons and computaton tme bounded by a polynomal n 1=(1, ); 1=, 1=, `, T, and R max, and wth probablty at least 1,, acheves total actual return exceedng U,. (Dscounted case) Let V denote the value functon for the polcy wth the optmal expected dscounted return n M. There exsts an algorthm DBN-E 3 that, gven access to a -approxmaton plannng algorthm for DBN-MDPs, and gven nputs,, ` and V, takes a number of actons and computaton tme bounded by a polynomal n 1=(1, ); 1=; 1=;`, the horzon tme T =1=(1, ), and R max, and wth probablty at least 1,, wll halt n a state x, and output a polcy ^, such that V ^ M (x) V (x),. Some remarks: The loss n polcy qualty nduced by the approxmate plannng subroutne translates nto degradaton n the runnng tme of our algorthm. As wth the orgnal E 3, we can elmnate knowledge of the optmal returns n both cases va search technques. Although we have stated our asymptotc undscounted average return result n terms of the mxng tme of the optmal polcy, we can nstead gve an anytme algorthm that competes aganst polces wth longer and longer mxng tmes the longer t s run. (We omt detals, but the analyss s analogous to the orgnal E 3 analyss.) Ths extenson s especally mportant n lght of the results of the followng secton, where we examne propertes of mxng tmes n DBN-MDPs. 5 Mxng Tme Bounds for DBN-MDPs As n the orgnal E 3 paper, our average case result depends on the amount of tme T that t takes the target polcy to mx. Ths dependence s unavodable. If some of the probabltes are very small, so that the optmal polcy cannot easly reach the hgh-reward parts of the space, t s unrealstc to expect the renforcement learnng algorthm to do any better. In the context of a DBN-MDP, however, ths dependence s more troublng. The sze of the state space s exponentally large, and vrtually all of the probabltes for transtonng from one state to the next wll be exponentally small (because a transton probablty s the product of n numbers that are 1). Indeed, one can construct very reasonable DBN- MDPs that have an exponentally long mxng tme. For example, a DBN representng the Markov chan of an Isng model [Jerrum and Snclar, 1993] has small parent sets (at most four parents per node), and CPT entres that are reasonably large. Nevertheless, the mxng tme of such a DBN can be exponentally large n n. Gven that even reasonable DBNs such as ths can have exponental mxng tmes, one mght thnk that ths s the typcal stuaton that s, that most DBN-MDPs have an exponentally long mxng tme, rentroducng the exponental dependence on n that we have been tryng so hard to avod. We now show that ths s not always the case. We provde a tool for analyzng the mxng tme of a polcy n a DBN-MDP, whch can gve us much better bounds on the mxng tme. In partcular, we demonstrate a class of DBN-MDPs and assocated polces for whch we can guarantee rapd mxng. Note that any fxed polcy n a DBN-MDP defnes a Markov chan whose transton model s represented as a DBN. We therefore begn by consderng the mxng tme of a pure DBN, wth no actons. We then extend that analyss to the mxng rate for a fxed polcy n a DBN-MDP. Defnton 5.1:Let Q be a transton model for a Markov chan, and let fx (t) g 1 t=1 represent the state of the chan. Let S = fx 1 ;:::;x s g. Let j be the statonary probablty of x j n ths Markov chan. We say that the Markov chan Q s -mxed at tme m f max ;j jp (X (t) = x j j X (1) = x ), j j. Our bounds on mxng tmes make use of the couplng method [Lndvall, 1992]. The dea of the couplng method s as follows: we run two copes of the Markov chan n parallel, from dfferent startng ponts. Our goal s to make the states of the two processes coalesce. Intutvely, the frst tme the states of the two copes are the same, the ntal states have been forgotten, whch corresponds to the processes havng mxed. More precsely, consder a transton matrx Q over some state space S. Let Q be a transton matrx over the state space S S, such that f f(y (t) ;Z (t) )g 1 t=1 s the Markov chan for Q, then the separated Markov chans fy (t) g 1 t=1 and fz (t) g 1 t=1 both evolve accordng to Q. Let be the random varable that represents the couplng tme the smallest m for whch Y (m) = Z (m). The followng lemma establshes the correspondence between mxng and couplng tmes. Lemma 5.2: For any, letm be such that for any ; j = 1;:::;s, P ( >mj Y (1) = x ;Z (1) = x j ). ThenQ s -mxed at tme m. Thus, to show that a Markov chan s -mxed by some tme m, we need only construct a coupled chan and show that the probablty that ths chan has not coupled by tme m decreases very rapdly n m. The couplng method allows us to construct the jont chan over (Y (t) ;Z (t) ) n any way that we want, as long as each of the two chans n solaton has the same dynamcs as the orgnal Markov chan Q. In partcular, we can correlate the transtons of the two processes, so as to make ther states concde faster than they would f each was pcked ndependently of the other. That s, we choose Y (t+1) and Z (t+1) to be equal to each other whenever possble, subject to the constrants on the transton probabltes. More precsely, let Y (t) = x and Z (t) = x j. For any value x 2 S, we can make the event Y (t+1) = x ;Z (t+1) = x j have a probablty that s the smaller of P (X 0 = x k j X = x ) and P (X 0 = x k j X = x j ). Compare ths to the probablty of ths event f the two processes were ndependent, whch s the product of these two numbers rather than ther mnmum. Overall, by correlatng the two processes as much as possble, and consderng the worst case over the current state

of the process, we can guarantee that, at every step, the two processes couple wth probablty at least mn ;j X k mn[p (X 0 = x k j X = x );P(X 0 = x k j X = x j )] Ths quantty represents the amount of probablty mass that any two transton dstrbutons are guaranteed to have n common. It s called the Dobrushn coeffcent, and s the contracton rate for L 1 -norm [Dobrushn, 1956] n Markov chans. Now, consder a DBN over the state varables X = fx 1 ;:::;X n g. As above, we create two copes of the process, lettng Y 1 ;:::;Y n denote the varables n the frst component of the coupled Markov chan, and Z 1 ;:::;Z n denote those n the second component. Our goal s to construct a Markov chan over Y; Z such that both Y and Z separately have the same dynamcs as X n the orgnal DBN. Our constructon of the jont Markov chan s very smlar to the one used above, except that wll now choose the transton of each varable par Y and Z so as to maxmze the probablty that they couple (assume the same value). As above, we can guarantee that Y and Z couple at any tme t wth probablty at least 8 < = mn u;u 0 2Val(Pa(X 0)) : X x 2Val(X ) mn[p (x j u);p(x j u 0 )] Ths coeffcent was defned by [Boyen and Koller, 1998] n ther analyss of the contracton rate of DBNs. Note that depends only on the numbers n a sngle CPT of the DBN. Assumngthatthe transtonprobabltesneach CPT are not too extreme, the probablty that any sngle varable couples wll be reasonably hgh. Unfortunately, ths bound s not enough to show that all of the varable pars couple wthn a short tme. The problem s that t s not enough for two varables Y (t) and Z (t) to couple, as process dynamcs may force us to decouple them at subsequent tme slces. To understand ths ssue, consder a smple process wth two varables X 1 ;X 2,and a transton graph wth the edges X 1! X1, 0 X 2! X2, 0 X 1! X2. 0 Assume that at tme t, the varable par Y (t) 2 ;Z(t) 2 has coupled wth value x 2,butY (t) 1 ;Z(t) 1 has not, so that Y (t) 1 = x 1 and Z (t) 1 = x 0 1. At the next tme slce, we must select Y (t+1) 2 ;Z (t+1) 2 from two dfferent dstrbutons P (X 0 2 j x 1 ;x 2 ) and P (X 0 2 j x 0 1;x 2 ), respectvely. Thus, our samplng process may be forced to gve them dfferent values, decouplng them agan. As ths example clearly llustrates, t s not enough for a varable par to couple momentarly. In order to eventually couple the two processes as a whole, we need to make each varable par a stable par.e., we need to guarantee that our samplng process can keep them coupled from then on. In our example, the par Y 1 ;Z 1 s stable as soon as t frst couples. And once Y 1 ;Z 1 s stable, then Y 2 ;Z 2 wll also be stable as soon as t couples. However, f Y 2 ;Z 2 couples whle Y 1 ;Z 1 s not yet stable, then the samplng process cannot guarantee stablty. 9 = ; In general, a varable par can only be stable f ther parents are also stable. So what happens f we add the edge X 2! X 0 1 to our transton model? In ths case, nether Y 1 ;Z 1 nor Y 2 ;Z 2 can stablze n solaton. They can only stablze f they couple smultaneously. Ths dscusson leads to the followng defnton. Defnton 5.3:Consder a DBN over the state varables X 1 ;:::;X n.thedependency graph D for the DBN s a drected cyclc graph whose nodes are X 1 ;:::;X n and where there s a drected edge from X to X j f there s an edge n the transton graph of the DBN from X to X 0. j Hence, there s a drected path from X to X j n D ff X (t) nfluences X (t0 ) j for some t 0 >t. We assume that the transton graph of the DBN always has arcs X! X 0, so that the every node n D has a self-loop. Let, 1 ;:::;, l be the maxmal strongly connected componentsnd, sorted so that f <j, there are no drected edges from, j to,. Our analyss wll be based on stablzng the, s n successon. (We note that we provde only a rough bound; a more refned analyss s possble.) Let = mn and g = max j j, j j. Assume that, 1 ;:::;,,1 have all stablzed by tme t. In order for, to stablze, all of the varables need to couple at exactly the same tme. Ths event happens at tme t wth probablty g. As soon as, stablzes, we can move on to stablzng, +1. When all the, s have stablzed, we are done. Theorem 5.4:For any 0, the Markov chan correspondng to a DBN as descrbed above s -mxed at tme m provded m 8l g log(1=): Thus, the mxng tme of a DBN grows exponentally wth the sze of the largest component n the dependency graph, whch may be sgnfcantly smaller than the total number of varables n a DBN. Indeed, n two real-lfe DBNs BAT [Forbes et al., 1995] wth ten state varables, and WATER [Jensen et al., 1989] wth eght the maxmal cluster sze s 3 4. It remans only to extend ths analyss to DBN-MDPs, where we have a polcy. Our stochastc couplng scheme must now deal wth the fact that the actons taken at tme t n the two copes of the process may be dfferent. The dffculty s that dfferent actons at tme t correspond to dfferent transton models. If a varable X has a dfferent transton model n dfferent transton graphs P a, t wll use a dfferent transton dstrbuton f the acton s not the same. Hence X cannot stablze untl we are guaranteed that the same acton s taken n both copes. That s, the acton must also stablze. The acton s only guaranteed to have stablzed when all of the varables on whch the choce of acton can possbly depend have stablzed. Otherwse, we mght encounter a par of states n whch we are forced to use dfferent actons n the two copes. We can analyze ths behavor by extendng the dependency graph to nclude a new node correspondng to the choce of acton. We then see what assumptons allow us to bound the set of ncomng and outgong edges. We can then use

the same analyss descrbed above to bound the mxng tme. The outgong edges correspond to the effect of an acton. In many processes, the acton only drectly affects the transton model of a small number of state varables n the process. In other words, for many varables X,wehavethatPa a (X ) and P a (X j Pa a (X )) are the same for all a. In ths case, the new acton node wll only have outgong edges to the remanng varables (those for whch the transton model mght dffer). We note that such localzed nfluence models have a long hstory both for nfluence dagram [Howard and Matheson, 1984] and for DBN-MDPs [Boutler et al., 1999]. Now, consder outgong edges. In general, the optmal polcy mght well be such that the acton depends on every varable. However, the mere representaton of such a polcy may be very complex, renderng ts use mpractcal n a DBN-MDP wth many varables. Therefore, we often want to restrct attenton to a smpler class of polces, such as a small fnte state machne or a small decson tree. If our target polcy s such that the choce of acton only depends on a small number of varables, then there wll only be a small number of ncomng edges nto the acton node n the dependency graph. Havng ntegrated the acton node nto the dependency graph, our analyss above holds unchanged. The only dfference from a random varable s that we do not have to nclude the acton node when computng the sze of the, that contans t, as we do not have to stochastcally make t couple; rather, t couples mmedately once ts parents have coupled. Fnally, we note that ths analyss easly accommodates DBN-MDPs where the decson about the acton s also decomposed nto several ndependent decsons (e.g., as n [Meuleau et al., 1998]). Dfferent component decsons can nfluence dfferent subsets of varables, and the choce of acton n each one can depend on dfferent subsets of varables. Each decson forms a separate node n the dependency graph, and can stablze ndependently of the other decsons. The analyss above gves us technques for estmatng the mxng rate of polces n DBN-MDPs. In partcular, f we want to focus on gettng a good steady-state return from DBN-E 3 n a reasonable amount of tme, ths analyss shows us how to restrct attenton to polces that are guaranteed to mx rapdly gven the structure of the gven DBN-MDP. 6 Conclusons Structured probablstc models, and partcularly Bayesan networks, have revolutonzed the feld of reasonng under uncertanty by allowng compact representatons of complex domans. Ther success s bult on the fact that ths structure can be exploted effectvely by nference and learnng algorthms. Ths success leads one to hope that smlar structure can be exploted n the context of plannng and renforcement learnng under uncertanty. Ths paper, together wth the recent work on representng and reasonng wth factored MDPs [Boutler et al., 1999], demonstrate that substantal computatonal gans can ndeed be obtaned from these compact, structured representatons. Ths paper leaves many nterestng problems unaddressed. Of these, the most ntrgung one s to allow the algorthm to learn the model structure as well as the parameters. The recent body of work on learnng Bayesan networks from data [Heckerman, 1995] lays much of the foundaton, but the ntegraton of these deas wth the problems of exploraton/explotaton s far from trval. Acknowledgements We are grateful to the members of the DAGS group for useful dscussons, and partcularly to Bran Mlch for pontng out a problem n an earler verson of ths paper. The work of Daphne Koller was supported by the ARO under the MURI program Integrated Approach to Intellgent Systems, by ONR contract N66001-97-C-8554 under DARPA s HPKB program, and by the generosty of the Powell Foundaton and the Sloan Foundaton. References [Boutler et al., 1999] C. Boutler, T. Dean, and S. Hanks. Decson theoretc plannng: Structural assumptons and computatonal leverage. Journal of Artfcal Intellgence Research, 1999. To appear. [Boyen and Koller, 1998] X. Boyen and D. Koller. Tractable nference for complex stochastc processes. In Proc. UAI, pages 33 42, 1998. [Dobrushn, 1956] R.L. Dobrushn. Central lmt theorem for nonstatonary Markov chans. Theoryof Probablty and ts Applcatons, pages 65 80, 1956. [Forbes et al., 1995] J. Forbes, T. Huang, K. Kanazawa, and S.J. Russell. The BATmoble: Towards a Bayesan automated tax. In Proc. IJCAI, 1995. [Heckerman, 1995] D. Heckerman. A tutoral on learnng wth Bayesan networks. Techncal Report MSR-TR-95-06, Mcrosoft Research, 1995. [Howard and Matheson, 1984] R. A. Howard and J. E. Matheson. Influence dagrams. In R. A. Howard and J. E. Matheson, edtors, Readngs on the Prncples and Applcatons of Decson Analyss, pages 721 762. Strategc Decsons Group, Menlo Park, Calforna, 1984. [Jensen et al., 1989] F.V. Jensen, U. Kjærulff, K.G. Olesen, and J. Pedersen. An expert system for control of waste water treatment a plot project. Techncal report, Judex Datasystemer A/S, Aalborg, 1989. In Dansh. [Jerrum and Snclar, 1993] M. Jerrum and A. Snclar. Polynomaltme approxmaton algorthms for the Isng model. SIAM Journal on Computng, 22:1087 1116, 1993. [Kearns and Sngh, 1998] M. Kearns and S.P. Sngh. Near-optmal performance for renforcement learnng n polynomal tme. In Proc. ICML, pages 260 268, 1998. [Kearns et al., 1999] M. Kearns, Y. Mansour, and A. Ng. A sparse samplng algorthm for near-optmal plannng n large markov decson processes. In these proceedngs, 1999. [Koller and Parr, 1999] D. Koller and R. Parr. Computng factored value functons for polces n structured MDPs. In these proceedngs, 1999. [Lndvall, 1992] T. Lndvall. Lectureson the Couplng Method.Wley, 1992. [Meuleau et al., 1998] N. Meuleau, M. Hauskrecht, K-E. Km, L. Peshkn, L.P. Kaelblng, T. Dean, and C. Boutler. Solvng very large weakly coupled Markov decson processes. In Proc. AAAI, pages 165 172, 1998.