Solving Factored MDPs with Continuous and Discrete Variables

Transcription

1 Solvng Factored MPs wth Contnuous and screte Varables Carlos Guestrn Berkeley Research Center Intel Corporaton Mlos Hauskrecht epartment of Computer Scence Unversty of Pttsburgh Branslav Kveton Intellgent Systems Program Unversty of Pttsburgh Abstract Although many real-world stochastc plannng problems are more naturally formulated by hybrd models wth both dscrete and contnuous varables, current state-of-the-art methods cannot adequately address these problems. We present the frst framework that can explot problem structure for modelng and solvng hybrd problems effcently. We formulate these problems as hybrd Markov decson processes (MPs wth contnuous and dscrete state and acton varables), whch we assume can be represented n a factored way usng a hybrd dynamc Bayesan network (hybrd BN). We present a new lnear program approxmaton method that explots the structure of the hybrd MP and lets us compute approxmate value functons more effcently. In partcular, we descrbe a new factored dscretzaton of contnuous varables that avods the exponental blow-up of tradtonal approaches. We provde theoretcal bounds on the qualty of such an approxmaton and on ts scaleup potental. We support our theoretcal arguments wth experments on a set of control problems wth up to 28-dmensonal contnuous state space and 22-dmensonal acton space. 1 Introducton Markov decson processes (MPs) (Bellman 1957; Bertsekas & Tstskls 1996) offer an elegant mathematcal framework for representng sequental decson problems n the presence of uncertanty. Whle standard soluton technques, such as value or polcy teraton, scale-up well n terms of the total number of states and actons, these technques are less successful n real-world MPs. In purely dscrete settngs, the runnng tme of these algorthms grows exponentally n the number varables, the so called curse of dmensonalty. Furthermore, many real-world problems nclude a combnaton of contnuous and dscrete state and acton varables. The contnuous components are usually dscretzed, whch leads to an exponental blow up n the number of varables. We present the frst framework that explots problem structure and solves large hybrd MPs effcently. The MPs are modelled by hybrd factored MPs, where the stochastc dynamcs s represented compactly by a probablstc graphcal model, a hybrd dynamc Bayesan network (BN) (ean & Kanazawa 1989). The soluton of the MP s approxmated by a lnear combnaton of bass functons (Bellman, Kalaba, & Kotkn 1963; Bertsekas & Tstskls 1996). Specfcally, we use a factored (lnear) value functon (Koller & Parr 1999), where each bass functon depends on a small number of state varables. We show that the weghts of ths approxmaton can be optmzed usng a convex formulaton that we call hybrd approxmate lnear programmng (HALP). The HALP reduces to the approxmate lnear programmng (ALP) formulaton (Schwetzer & Sedmann 1985) n purely dscrete settngs and to the formulaton recently proposed by (Hauskrecht & Kveton 23) for the contnuous-state settngs. Copyrght c 24, Amercan Assocaton for Artfcal Intellgence ( All rghts reserved. We present a theoretcal analyss of the HALP, provdng bounds wth respect to the best approxmaton n the space of the bass functons. Unfortunately, the HALP formulaton of the problem may not be solved drectly snce t may use nfnte number of constrants. To address ths problem, we formulate a relaxed verson of the HALP, an ɛ-halp, that uses a fnte subset of constrants nduced by the ε-grd dsretzaton of contnuous components. We provde a bound on the loss n the qualty of the ε-halp soluton wth respect to the complete HALP formulaton. The man advantage of the ε-halp s that t can be solved effcently by exstng factored ALP methods (Guestrn, Koller, & Parr 21a; Schuurmans & Patrascu 22). Therefore, the complexty of our soluton does not grow exponentally wth the number of varables, and depends only on the structure of the problem and the choce of bass functons. We llustrate the feasblty of our formulaton and ts soluton algorthm on a sequence of control optmzaton problems wth 28-dmensonal contnuous state space and 22-dmensonal acton space. These nontrval dynamc optmzaton problems are far out of reach of classc soluton technques. 2 Multagent hybrd factored MPs Factored MPs (Boutler, earden, & Goldszmdt 1995) allow one to explot problem structure to represent exponentally large MPs compactly. We extend ths formalsm to a multagent hybrd factored MP that s defned by a 4-tuple (X, A, P, R) consstng of a state space X represented by a set of state varables X = {X 1,... X n }, an acton space A defned by a set of acton varables A = {A 1,... A m }, a stochastc transton model P modelng the dynamcs of a state condtoned on the prevous state and acton choce, and a reward model R that quantfes the mmedate payoffs assocated wth a state-acton confguraton. State varables: Each state varable s ether dscrete or contnuous. We assume that every contnuous varable s bounded to a [, 1] subspace, and each dscrete varable takes on values n some fnte doman. A state s defned by a vector x of value assgnments to each state varable, whch splts nto dscrete and contnuous components denoted by x = (x, x C ). Actons: Acton space s dstrbuted such that every acton corresponds to one agent. As wth state varables, the global acton a s defned by a vector of ndvdual acton choces that can be dvded nto dscrete a and contnuous a C components. Factored transton: State transton model s defned by a dynamc Bayesan network (BN) (ean & Kanazawa 1989). Let X denote a varable at the current tme and let X denote the same varable at the successve step. The transton graph of a BN s a two-layer drected acyclc graph whose nodes are {X 1,..., X n, A 1,..., A m, X 1,..., X n}. The parents of X n the graph are denoted by Par(X ). For smplc- ) {X, A},.e., ty of exposton, we assume that Par(X

2 all arcs n the BN are between varables n consecutve tme slces. Each node X s assocated wth a condtonal probablty functon (CPF) p(x Par(X )). The transton probablty p(x x, a) s then defned to be p(x u ), where u s the value n {x, a} of the varables n Par(X Parameterzaton of CPFs: The transton model ). for each varable s local, as each CPF depends only on a small subset of state varables and ndvdual actons. Compact parametrc representaton of the transtons s acheved by usng beta or mxture of beta denstes (Hauskrecht & Kveton 23; Kveton & Hauskrecht 24) for contnuous varables, and by general dscrmnant functons for dscrete varables. Rewards: Reward functon R decomposes as a sum of partal reward functons R j defned on the subsets of state and acton varables. Polcy: The objectve s to fnd a control polcy π : X A that maxmzes the nfnte-horzon, dscounted reward crteron: E[ = γ r ], where γ [, 1) s a dscount factor, and r s a reward obtaned n step. Value functon: The value of the optmal polcy satsfes the Bellman fxed pont equaton (Bellman 1957; Bertsekas & Tstskls 1996): V (x) = sup p(x x, a)v (x ), (1) a R(x, a) + γ x where V s the value of the optmal polcy. Gven the value functon V, the optmal polcy π (x) s defned by the composte acton a optmzng Equaton 1. 3 Approxmate lnear programmng solutons for hybrd MPs A standard way of solvng complex MPs s to assume a surrogate value functon form wth a small set of tunable parameters. Increasngly popular n recent years are the approxmatons based on lnear representatons of value functons, where the value functon V (x) s expressed as a lnear combnaton of k bass functons f (x) (Bellman, Kalaba, & Kotkn 1963; Roy 1998): k V (x) = w f (x). =1 Bass functons are often restrcted to small subsets of state varables (Bellman, Kalaba, & Kotkn 1963; Roy 1998), and the goal of the optmzaton s to ft the set of weghts w = (w 1,..., w k ). 3.1 Formulaton We generalze approxmate lnear programmng (ALP) for dscrete MPs (Schwetzer & Sedmann 1985) nto hybrd settngs. Weghts w are optmzed by solvng a convex optmzaton problem that we call hybrd approxmate lnear program (HALP): mnmze w w α subject to: x C w F (x, a) R(x, a) x, a; (2) where α denotes the bass functon relevance weght gven by: α = x x C ψ(x)f (x)dx C, (3) where ψ(x) > s a state relevance densty functon such that x x C ψ(x)dx C = 1, allowng us to weght the qualty of our approxmaton dfferently for dfferent parts of the state space; and F (x, a) denotes: F (x, a) = f (x) γ p(x x, a)f (x )dx C. (4) x x C Ths formulaton reduces to the standard dscrete-case ALP (Schwetzer & Sedmann 1985; Guestrn, Koller, & Parr 21b; de Faras & Van Roy 23; Schuurmans & Patrascu 22) f the state space x s dscrete, or to the contnuous ALP (Hauskrecht & Kveton 23) f the state space s contnuous. A number of concerns arse n context of the HALP approxmaton. Frst, the formulaton of the HALP appears to be arbtrary, and t s not mmedately clear how t relates to the orgnal hybrd MP problem. Second, the HALP aproxmaton for the hybrd MP nvolves complex ntegrals that must be evaluated. Thrd, the number of constrants defnng the LP s exponental f the state and acton spaces are dscrete and nfnte f any of the spaces nvolves contnuous components. In the followng text, we address and provde solutons for each of these ssues. 3.2 Theoretcal analyss Theoretcal analyss of the qualty of the soluton obtaned by the HALP follows the deas of de Faras and Van Roy 23 for the dscrete case. They note that the approxmate formulaton cannot guarantee an unformly good approxmaton of the optmal value functon over the whole state space. To address ths ssue, they defne a Lyapunov functon that weghs states approprately: a Lyapunov functon L(x) = wl f (x) wth contracton factor κ (, 1) for the transton model P π s a strctly postve functon such that: κl(x) γ P π (x x)l(x )dx C. (5) x x C Ths defnton allows to clam: Proposton 1 Let w be an optmal soluton to the HALP n Equaton 2, then, for any Lyapunov functon L(x), we have that: V Hw 1,ψ 2ψ L 1 κ mn V Hw w,1/l, where Hw represents the functon w f ( ), the L 1 norm weghted by ψ gven by 1,ψ, and,1/l s the maxnorm weghted by 1/L. Proof: The proof of ths result for the hybrd settng follows the outlne of the proof of de Faras and Van Roy s Theorem 4.2 (de Faras & Van Roy 23) for the dscrete case. 4 Factored HALP Factored MP models offer, n addton to structured parameterzatons of the process, an opportunty to solve the problem more effcently. The opportunty stems from the structure of constrant defntons that decompose over state and acton subspaces. Ths s a drect consequence of: (1) factorzatons, (2) presence of local transtons, and (3) bass functons defned over small state subspaces. Ths secton descrbes how these propertes allow us to compute the factors n the HALP effcently.

3 4.1 Factored hybrd bass functon representaton Koller and Parr 1999 show that bass functons wth lmted scope provde the bass for effcent approxmatons n the context of dscrete factored MPs. An mportant ssue n hybrd settngs s that the problem formulaton ncorporates ntegrals, whch may not be computable. Hauskrecht and Kveton 23 propose conjugate transton model and bass functon classes that lead to closed-form solutons of all ntegrals n strctly contnuous cases. In our hybrd settng, each bass functon f (x ) s defned over dscrete components x and contnuous components x C, and decomposes as a product of two factors: f (x ) = f (x )f C (x C ), (6) where f C (x C ) takes the form of polynomals over the varables n X C, and f (x ) s an arbtrary functon over the dscrete varables X. Ths bass functon representaton gves us hgh flexblty and ablty to effcently solve hybrd plannng problem. 4.2 Hybrd backprojectons Computaton of F (x, a), the dfference between the bass functon f (x) and ts dscounted backprojecton, gven by: g (x, a) = p(x x, a)f (x )dx C x x C requres us to compute a sum over the exponental number of dscrete states x, and ntegrals over the contnuous states x C Based on the results of Koller and Parr 1999 for dscrete. varables, and Hauskrecht and Kveton 23 for contnuous varables, we can rewrte the backprojecton for hybrd bass: g (x, a) = g (x, a)g C (x, a), ( ) = p(x x x, a)f (x ) ( ) (7) p(x x C x, a)f C (x C )dx C C and compute t effcently. Note that g (x, a) s the backprojecton of a dscrete bass functon and g C (x, a) s the backprojecton of a contnuous bass functon. 4.3 Hybrd relevance weghts Computaton of bass functon relevance weghts α n Equaton 3 requres us to solve exponentally-large sums and complex ntegrals. Guestrn et al. 21b; 23 showed that f the state relevance densty ψ(x) s represented n a factorzed fashon, these weghts can be computed effcently. Ths result extends to hybrd settngs, and thus we can decompose the computaton of α : α = α α C, ( ) = ψ(x x )f (x ) ( ) (8) ψ(x xc C )f C (x C )dx C, where ψ(x ) s the margnal of the densty ψ(x) to the dscrete varables X, and ψ(x C ) s the margnal to the contnuous varables X C. 5 Factored ε-halp formulaton espte the decompostons and closed-form solutons, factored HALPs reman hard to solve. Unfortunately, the formulaton ncludes constrants for each jont state x and acton a, whch leads to exponentally-many constrants for dscrete components, and uncountably nfnte constrant set for contnuous. To address these ssues, we propose to transform the factored HALP nto ε-halp, an approxmaton of the factored HALP wth a fnte number of constrants. The ε-halp reles on the ε coverage of the constrant space. In the ε-coverage each contnuous (state or acton) varable s dscretzed nto 1 2ε + 1 equally spaced values. The dscretzaton nduces a multdmensonal grd G, such that any pont n [, 1] d s at most ε far from a pont n G under the max-norm. If we drectly enumerate each state and acton confguraton of the ε-halp we obtan an LP wth exponentallymany constrants. However, not all these constrants defne the soluton and need to be enumerated. Ths s the same settng as the factored LP decomposton of Guestrn et al. 21a. We can use the same technque to decompose our ε-halp nto an equvalent LP wth exponentally-fewer constrants. The complexty of ths new problem wll only be exponentally n the tree-wdth of a cost network formed by the restrcted scope functons n our LP, rather than n the complete set of varables (Guestrn, Koller, & Parr 21a; Guestrn et al. 23). Alternatvely we can also apply the approach by Schuurmans and Patrascu 22 that ncrementally bulds the set of constrants usng a constrant generaton heurstc and often performs well n practce. The ε-halp offers an effcent approxmaton of a hybrd factored MP; however, t s unclear how the dscretzaton affects the qualty of the approxmaton. Most dscretzaton approaches requre an exponental number of ponts for a fxed approxmaton level. In the remander of ths secton, we provde a proof that explots factorzaton structure to show that our ε-halp provdes a polynomal approxmaton of the contnuous HALP formulaton. 5.1 Bound on the qualty of ε-halp A soluton to the ε-halp wll usually volate some of the constrants n the orgnal HALP formulaton. We show that f these constrants are volated by a small amount, then the ε-halp soluton s nearly optmal. Let us frst defne the degree to whch a relaxed HALP, that s, a HALP defned over a fnte subset constrants, volates the complete set of constrants. efnton 1 A set of weghts w s δ-nfeasble f: w F (x, a) R(x, a) δ, x, a. Now we are ready to show that, f the soluton to the relaxed HALP s δ-nfeasble, then the qualty of the approxmaton obtaned from the relaxed HALP s close to the one n the complete HALP. Proposton 2 Let w be any optmal soluton to the complete HALP n Equaton 2, and ŵ be any optmal soluton to a relaxed HALP, such that ŵ s δ-nfeasble, then: V Hŵ 1,ψ V Hw δ 1,ψ γ. Proof: Frst, by monotoncty of the Bellman operator, any feasble soluton w n the complete HALP satsfes: w f (x) V (x). (9) Usng ths fact, we have that: Hw V 1,ψ = ψ Hw V, = ψ (Hw V ), = ψ Hw ψ V. (1)

4 Next, note that the constrants n the relaxed HALP are a subset of those n the complete HALP. Thus, w s feasble for the relaxed HALP, and we have that: ψ Hw ψ Hŵ. (11) Now, note that f ŵ s δ-nfeasble n the complete HALP, then f we add δ 1 γ to Hŵ we obtan a feasble soluton to the complete HALP, yeldng: Hŵ + δ 1 γ V = ψ Hŵ + δ 1,ψ 1 γ ψ V, ψ Hw + δ 1 γ ψ V, = Hw V 1,ψ + δ 1 γ. (12) The proof s concluded by substtutng Equaton 12 nto the trangle nequalty bound: Hŵ V 1,ψ Hŵ + δ 1 γ V + 1,ψ δ 1 γ. The above result can be combned wth the result n Secton 3 to obtan the bound on the qualty of the ε-halp. Theorem 1 Let ŵ be any optmal soluton to the relaxed ε- HALP satsfyng the δ nfeasblty condton. Then, for any Lyapunov functon L(x), we have: V δ Hŵ 1,ψ 2 1 γ + 2ψ L 1 κ mn V Hw w,1/l. Proof: rect combnaton of Propostons 1, Resoluton of the ε grd Our bound for relaxed versons on the HALP formulaton, presented n the prevous secton, reles on addng enough constrants to guarantee at most δ-nfeasblty. The ε-halp approxmates the constrants n HALP by restrctng values of ts contnuous varables to the ε grd. In ths secton, we analyze the relatonshp between the choce of ε and the volaton level δ, allowng us to choose the approprate dscretzaton level for a desred approxmaton error n Theorem 1. Our condton n efnton 1 can be satsfed by a set constrants C that ensures a δ max-norm dscretzaton of ŵf (x, a) R(x, a). In the ε-halp ths condton s met wth the ε-grd dscretzaton that assures that for any stateacton par x, a there exsts a par x G, a G n the ε grd such that: ŵf(x, a) R(x, a) ŵf(xg, ag) + R(xG, ag) δ. Usually, such bounds are acheved by consderng the Lpschtz modulus of the dscretzed functon: Let h(u) be an arbtrary functon defned over the contnuous subspace U [, 1] d wth a Lpschtz modulus K and let G be an ε-grd dscretzaton of U. Then the δ max-norm dscretzaton of h(u) can be acheved wth a ε grd wth the resoluton ε δ K. Usually, the Lpschtz modulus of a functon rapdly ncreases wth dmenson d, thus requrng addtonal ponts for a desred dscretzaton level. Each constrant n the ε-halp s defned n terms of a sum of functons: ŵf (x, a) j R(x, a), where each functon depends only on a small number of varables (and thus has a small dmenson). Therefore, nstead of usng a global Lpschtz constant K for the complete expresson we can express the relaton n between the factor δ and ε n terms of the Non outgong channels Outgong channels x (a) (b) Fgure 1: a. The topology of an rrgaton system. Irrgaton channels are represented by lnks x and water regulaton devces are marked by rectangles a. Input and output regulaton devces are shown n lght and dark gray colors. b. Reward functons for the amount of water x n the th rrgaton channel. Lpschtz constants of ndvdual functons, explotng the factorzaton structure. In partcular, let K max be the worst-case Lpschtz constant over both the reward functons R j (x, a) and w F (x, a). To guarantee that K max s bounded, we must bound the magntude of ŵ. Typcally, f the bass functons have unt magntude, the ŵ wll be bounded R max /(1 γ). Here, we can defne K max to be the maxmum of the Lpschtz constants of the reward functons and of R max /(1 γ) tmes the constant for each F (x, a). By choosng an ε dscretzaton of only: δ ε, MK max where M s the number of functons, we guarantee the condton of Theorem 1 for a volaton of δ. 6 Experments Ths secton presents an emprcal evaluaton of our approach, demonstratng the qualty of the approxmaton and the scaleup potental. 6.1 Irrgaton network example An rrgaton system conssts of a network of rrgaton channels that are connected by regulaton devces (Fgure 1a). Regulaton devces are used to regulate the amount of water n the channels, whch s acheved by pumpng the water from one of the channels to another one. The goal of the operator of the rrgaton system s to keep the amount of water n all channels on an optmal level (determned by the type of planted crops, etc.), by manpulaton of regulaton devces. Fgure 1a llustrates the topology of channels and regulaton devces for one of the rrgaton systems used n the experments. To keep problem formulaton smple, we adopt several smplfyng assumptons: all channels are of the same sze, water flows are orented, and the control structures operate n dscrete modes. The rrgaton system can be formalzed as a hybrd MP, and the optmal behavor of the operator can be found as the optmal control polcy for the MP. The amount of water n the th channel s naturally represented by a contnuous state factor x [, 1]. Each regulaton devce can operate n multple modes: the water can be pumped n between any par

5 x x x x Fgure 2: Feature functons for the amount of water x n the th rrgaton channel. of ncomng and outgong channel. These optons are represented by dscrete acton varables a, one varable per regulaton devce. The nput and output regulaton devces (devces wth no ncomng or no outgong channels) are specal and contnuously pump the water n or out of the rrgaton system. Transton functons are defned as beta denstes that represent water flows dependng on the operatng modes of the regulaton devces. Reward functon reflects our preference for the amount of water n the channels (Fgure 1b). The reward functon s factorzed along channels, defned by a lnear reward functon for the outgong channels, and a mxture of Gaussans for all other channels. The dscount factor s γ =.95. To approxmate the optmal value functon, a combnaton of lnear and pecewse lnear feature functons s used at every channel (Fgure 2). 6.2 Expermental results The objectve of the frst set of experments was to compare the qualty of solutons obtaned by the ε-halp for varyng grd resolutons ε aganst other technques for polcy generaton and to llustrate tme (n seconds) needed to solve the ε- HALP problem. All experments are performed on the rrgaton network from Fgure 1a wth 17 dmensonal state space and 15 dmensonal acton space. The results are presented n Fgure 3. The qualty of polces s measured n terms of the average reward that s obtaned va Monte Carlo smulatons of the polcy on 1 state-acton trajectores, each of 1 steps. To assure the farness of the comparson, the set of ntal states s kept fxed across experments. Three alternatve solutons are used n the comparson: random polcy, local heurstc, and global heurstc. The random polcy operates regulaton devces randomly and serves as a baselne soluton. The local heurstc optmzes the one-step expected reward for every regulaton devce locally, whle gnorng all other devces. Fnally, the global heurstc attempts to optmze one-step expected reward for all regulatory devces together. The parameter of the global heurstc s the number of trals used to estmate the global one-step reward. All heurstc solutons were appled n the on-lne mode; thus, ther soluton tmes are not ncluded n Fgure 3. The results show that the ε-halp s able to solve a very complex optmzaton problem relatvely quckly and outperform strawman heurstc methods n terms of the qualty of ther solutons. 6.3 Scale-up study The second set of experments focuses on the scale-up potental of ε-halp method wth respect to the complexty of the model. The experments are performed for n-rng and n-rngof-rngs topologes (Fgure 4a). The results, summarzed n Fgure 4b, show several mportant trends: (1) the qualty of the polcy for the ε-halp mproves wth hgher grd resoluton ε, (2) the runnng tme of the method grows polynomally wth ε-halp Alternatve soluton ε µ σ Tme[s] Method µ σ Random / Local / Global / Global / Global Fgure 3: Results of the experments for the rrgaton system n Fgure 1a. The qualty of found polces s measured by the average reward µ for 1 state-acton trajectores, where σ denotes the standard devaton of the rewards. the grd resoluton, and (3) the ncrease n the runnng tme of the method for topologes of ncreased complexty s mld and far from exponental n the number of varables n. Graphcal examples of each of these trends are gven n Fgures 4c, 4d, and 4e. In addton to the runnng tme curve, Fgure 4e shows a quadratc polynomal ftted to the values for dfferent n. Ths supports our theoretcal fndngs that the runnng tme complexty of the ε-halp method for an approprate choce of bass functons does not grow exponentally n the number of varables. 7 Conclusons We present the frst framework that can explot problem structure for modelng and approxmately solvng hybrd problems effcently. We provde bounds on the qualty of the solutons obtaned by our HALP formulaton wth respect to the best approxmaton n our bass functon class. Ths HALP formulaton can be closely approxmated by the (relaxed) ε- HALP, f the resultng soluton s near feasble n the orgnal HALP formulaton. Although we would typcally requre an exponentally-large dscretzaton to guarantee ths near feasblty, we provde an algorthm that can effcently generate an equvalent guarantee wth an exponentally-smaller dscretzaton. When combned, these theoretcal results lead to a practcal algorthm that we have successfully demonstrated on a set of control problems wth up to 28-dmensonal contnuous state space and 22-dmensonal acton space. The technques presented n ths paper drectly generalze to collaboratve multagent settngs, where each agent s responsble for one of the acton varables, and they must coordnate to maxmze the total reward. The off-lne plannng stage of our algorthm remans unchanged. However, n the on-lne acton selecton phase, at every tme step, the agents must coordnate to choose the acton that jontly maxmzes the expected value for the current state. We can acheve ths by extendng the coordnaton graph algorthm of Guestrn et al. 21b to our hybrd settng wth our factored dscretzaton scheme. The result wll be an effcent dstrbute coordnaton algorthm that can cope wth both contnuous and dscrete actons. Many real-world problems nvolve contnuous and dscrete elements. We beleve that our algorthms and theoretcal results wll sgnfcantly further the applcablty of automated plannng algorthms to these settngs. Acknowledgments Mlos Hauskrecht was supported n part by the Natonal Scence Foundaton under grant ITR and grant Branslav Kveton acknowledges the fellowshp support from the School of Arts and Scences, Unversty of Ptts-

6 (a) Expected reward n-rng ε n = 6 n = 9 n = 12 n = 15 n = 18 µ Tme[s] µ Tme[s] µ Tme[s] µ Tme[s] µ Tme[s] / / / / n-rng-of-rngs ε n = 6 n = 9 n = 12 n = 15 n = 18 µ Tme[s] µ Tme[s] µ Tme[s] µ Tme[s] µ Tme[s] / / / / rng of rngs / ε Tme 3 x 12 rng of rngs Tme / ε (b) n rng of rngs, 1 / ε = n (c) (d) (e) Fgure 4: a. Two rrgaton network topologes used n the scale-up experments: n-rng-of-rngs (shown for n = 6) and n-rng (shown for n = 6). b. Average rewards and polcy computaton tmes for dfferent ε and varous networks archtectures. c. Average reward as a functon of grd resoluton ε. d. Tme complexty as a functon of grd resoluton ε. e. Tme complexty (sold lne) as a functon of dfferent network szes n. Quadratc approxmaton of the tme complexty s plotted as dashed lne. burgh. References Bellman, R.; Kalaba, R.; and Kotkn, B Polynomal approxmaton a new computatonal technque n dynamc programmng. Math. Comp. 17(8): Bellman, R. E ynamc programmng. Prnceton Press. Bertsekas,. P., and Tstskls, J. N Neuro-dynamc Programmng. Athena. Boutler, C.; earden, R.; and Goldszmdt, M Explotng structure n polcy constructon. In IJCAI. de Faras,. P., and Roy, B. V. 21. On constrant samplng for the lnear programmng approach to approxmate dynamc programmng. Mathematcs of Operatons Research submtted. de Faras,., and Van Roy, B. 23. The lnear programmng approach to approxmate dynamc programmng. Operatons Research 51(6). ean, T., and Kanazawa, K A model for reasonng about persstence and causaton. Computatonal Intellgence 5: Guestrn, C. E.; Koller,.; Parr, R.; and Venkataraman, S. 23. Effcent soluton algorthms for factored MPs. JAIR 19: Guestrn, C. E.; Koller,.; and Parr, R. 21a. Max-norm projectons for factored MPs. In IJCAI-1. Guestrn, C. E.; Koller,.; and Parr, R. 21b. Multagent plannng wth factored MPs. In NIPS-14. Hauskrecht, M., and Kveton, B. 23. Lnear program approxmatons to factored contnuous-state Markov decson processes. In NIPS-17. Koller,., and Parr, R Computng factored value functons for polces n structured MPs. In IJCAI-99. Kveton, B., and Hauskrecht, M. 24. Heurstc refnements of approxmate lnear programmng for factored contnuousstate Markov decson processes. In ICAPS-14. Roy, B. V Learnng and value functon approxmaton n complex decson problems. Ph.. ssertaton, MIT. Schuurmans,., and Patrascu, R. 22. rect valueapproxmaton for factored mdps. In NIPS-14. Schwetzer, P., and Sedmann, A Generalzed polynomal approxmatons n Markovan decson processes. Journal of Math. Analyss and Apps. 11: