Recent Progresses on Lnear Programmng and the Smplex Method Ynyu Ye www.stanford.edu/ ~ yyye K.T. L Professor of Engneerng Management Scence and Engneerng and Insttute of Computatonal and Mathematcal Engneerng Stanford Unversty
Lnear Programmng started
wth the smplex method
Outlne Counterexamples to the Hrsch conecture Lnear Programmng (LP) and the smplex method Pvotng rules and ther exponental behavor Smplex and polcy-teraton methods for Markov Decson Process (MDP) and Zero-Sum Game wth fxed dscounts Smplex method for determnstc MDP wth varable dscounts Remarks and comments
Hrsch s Conecture Warren Hrsch conectured n 1957 that the dameter of the graph of a (convex) polyhedron defned by n nequaltes n d dmensons s at most n-d. The dameter of the graph s the maxmum of the shortest paths between every two vertces.
Counter examples to Hrsch s conecture Francsco Santos (2010): There s a 43-dmensonal polytope wth 86 facets and of dameter at least 44. There s an nfnte famly of non-hrsch polytopes wth dameter (1 + ε)n, even n fxed dmenson. Santos' constructon s an extenson of a result of Klee and Walkup (1967), where they proved that the Hrsch conecture could be proved true from ust the case n = 2d.
LP and the Smplex Method Optmze a lnear obectve functon over a convex polyhedron
Pvotng rules The smplex method s governed by a pvot rule,.e. a method of choosng adacent vertces wth a better obectve functon value. Klee and Mnty (1972) showed that Dantzg's orgnal greedy pvot rule may requre exponentally many steps. The random edge pvot rule chooses, from among all mprovng pvotng steps (or edges) from the current basc feasble soluton (or vertex), one unformly at random. The Zadeh pvot rule chooses the decreasng edge or the enterng varable that has been entered least often n the prevous pvot steps.
and they fall as well No non-polynomal lower bounds were known untl now for these two pvot rules. Fredmann, Hansen and Zwck (2011) gave an example that the random edge pvot rule needs exponentally many steps. Fredman (2011) developed an example that the Zadeh pvot rule needs exponentally many steps. These examples explore the connecton of lnear programmng and Markov Decson Process (MDP), and the close relaton between the smplex method for solvng lnear programs and the polcy teraton method for MDP. (The dameter of MDP polytopes s bounded by d.)
Markov Decson Process Markov decson process provdes a mathematcal framework for modelng sequental decsonmakng n stuatons where outcomes are partly random and partly under the control of a decson maker. MDPs are useful for studyng a wde range of optmzaton problems solved va dynamc programmng, where t was known at least as early as the 1950s (cf. Shapley 1953, Bellman 1957). Modern applcatons nclude dynamc plannng, renforcement learnng, socal networkng, and almost all other dynamc/sequental decson makng problems n Mathematcal, Physcal, Management, Economcs, and Socal Scences.
States and Actons At each tme step, the process s n some state = 1,...,m, and the decson maker chooses an acton A that s avalable for state, say of total n actons. The process responds at the next tme step by randomly movng nto a new state, and gvng the decson maker an mmedate correspondng cost c. The probablty that the process enters as ts new state s nfluenced by the chosen acton. Specfcally, t s gven by the state transton probablty dstrbuton P. But gven acton, the probablty s condtonally ndependent of all prevous states and actons; n other words, the state transtons of an MDP possess the Markov property.
A Smple MDP Problem I
Smplfed Representaton
Polcy and Dscount Factor A polcy of MDP s a set functon π = { 1, 2,, m } that specfes one acton A that the decson maker wll choose for each state. The MDP s to fnd an optmal (statonary) polcy to mnmze the expected dscounted sum over an nfnte horzon wth a dscount factor 0 γ < 1. One can obtan an LP that models the MDP problem n such a way that there s a one-to-one correspondence between polces of the MDP and extreme-pont solutons of the (dual) LP, and between mprovng swtches and mprovng pvots. de Ghellnck (1960), D Epenoux (1960) and Manne (1960)
Cost-to-Go values and LP formulaton Let y R m represent the expected present costto-go values of the m states, respectvely, for a gven polcy. Then, the cost-to-go vector of the optmal polcy s a Fxed Pont of Such a fxed pont computaton can be formulated as an LP. },, arg mn{, },, mn{ A y p c A y p c y T T. ;, s.t. max 1 A y p c y y T m
Cost-to-Go values Chosen actons n Red
The dual of the MDP-LP mn s.t. n 1 ( e 1 p ) x 1,, 0,. where e =1 f A and 0 otherwse. n c x x Dual varable x represents the expected acton flow or vst-frequency, that s, the expected present value of the number of tmes acton s used.
Greedy Smplex Rule Chosen actons n Red
Lowest-Index Smplex Rule Chosen actons n Red
Polcy Iteraton Rule (Howard 1960) Chosen actons n Red
Effcency of smplex/polcy methods Melekopoglou and Condon (1990) showed that the smplex method wth the smallest ndex pvot rule needs an exponental number of teratons to compute an optmal polcy for a specfc MDP problem regardless of dscount factors. Fearnley (2010) showed that the polcy-teraton method needs an exponental number of teratons for a undscounted fnte-horzon MDP, together wth early mentoned negatve results. Negatve theoretcal results mentoned earler In practce, the polcy-teraton method, ncludng the smplex method wth greedy pvot rule, has been remarkably successful and shown to be most effectve and wdely used. Any good news n theory?
Bound on the smplex/polcy methods Y (2011): The classc smplex and polcy teraton methods, wth the greedy pvotng rule, termnate n no more than pvot steps, where n s the total number of actons n an m-state MDP wth dscount factor γ. Ths s a strongly polynomal-tme upper bound when γ s bounded above by a constant less than one. CIPA (Y, 2005) m 2 mn ) 1 mn log( 1 log( 1 m 2 )
Roadmap of proof Defne a combnatoral event that cannot repeats more than n tmes. More precsely, at any step of the pvot process, there exsts a non-optmal acton that wll never re-enter future polces or bases after 2 m m log( 1 1 pvot steps There are at most (n - m) such non-optmal acton to elmnate from appearance n any future polces generated by the smplex or polcy-teraton method. The proof reles on the dualty, the reduced-cost vector at the current polcy and the optmal reducedcost vector to provde a lower and upper bound for a non-optmal acton when the greedy rule s used. )
Improvement and extenson Hansen, Mltersen and Zwck (2011): For the polcy teraton method termnates n no more steps. 2 n m log( 1 1 The smplex and polcy teraton methods, wth the greedy pvotng rule, are strongly polynomaltme algorthms for Turn-Based Two-Person Zero-Sum Stochastc Game wth any fxed dscount factor, whch problem cannot even be formulated as an LP. )
A Turn-Based Zero-Sum Game
Improvement and extenson Ktahara and Mzuno (2011) extended the bound to solvng general non-degenerate LPs: mn s.t. 1 1, ; The smplex method termnates n at most n a n x mn log c pvot steps, when the rato of the mnmum value over the maxmum value, n all basc feasble soluton entres, s bounded below by σ. x m ( 2 b ) x 0,.
Determnstc MDP wth dscounts Dstrbuton vector p R m contans exactly one 1 and 0 everywhere else. },, arg mn{, },, mn{ A y p c A y p c y T T. ;, s.t. max 1 A y p c y y T m It has unform dscounts f all γ are dentcal.
The dual resembles generalzed flow mn s.t. n 1 ( e 1 ) x 1,, 0,. where e =1 f A and 0 otherwse. n c x p x Dual varable x represents the expected acton flow or frequency, that s, the expected present value of the number of tmes acton s chosen.
Effcency of smplex/polcy methods They are not known to be polynomal-tme algorthms for determnstc MDP even wth unform dscounts. There are quadratc lower bounds on these methods for solvng MDP wth unform dscounts. Ian Post and Y (2012): The Smplex method wth the greedy pvot rule termnates n at most 3 0( m n 2 log m) pvot steps when dscount factors are unform, or n at most 0( m 5 n 3 log pvot steps wth non-unform dscounts. We are not yet able to prove such results hold for the polcy teraton method. 2 2 m)
Polcy structures wth unform factors Each chosen acton can be ether a path-edge or cycle-edge. x n [ 1, m ] f t s a path-acton, x n [ 1/(1-γ), m/(1-γ) ] f t s a cycle-acton, so that they form two possble polynomal layers.
Roadmap of proof There two types of pvots: the newly chosen acton s ether on a path or on a cycle of the new polcy. In every m 2 n log(m ) consecutve pvot steps, there must be at least one step that s a cycle pvot. After every m log(m ) cycle pvot steps, there s an acton that would never re-enter as a cycle or path acton. There are at most n acton for such a downgrade. Item 2 result remans true when dscounts are not unform, but others do not hold.
Polcy structures of general factors The flow value of x depends on the smallest dscount factor (domnatng factor γ a ) on a same cycle. There are n dfferent dscount factors, so that there are n possble dfferent polynomal layers of x s.
Decomposed s-dual of MDP-LP mn s.t. 1 1 ( e ( e 1 There are m such dual LPs, and the optmal polcy s also optmal for each of them. x of a gven polcy on each s-dual form a sngle path+cycle or a sngle cycle. n n n c x p p ) x ) x x 1, 0, s, 0,. s, or
Roadmap of Proof Let(s,γ a ) denote a polcy where the cycle for the s-dual s domnated by γ a. In every m 2 n log(m ) consecutve pvot steps, there must be at least one step that s a cycle pvot. After every m 2 log(m ) cycle pvot steps, there s an acton that would never re-enter to form a (s,γ a ) polcy. There are at most nm such combnatons, and at most n actons for such a down-grade. Ths gves the overall pvot step bound.
Remarks and Open Problems Is the polcy teraton method a strongly polynomal tme algorthm for determnstc MDP? Is there a smplex method strongly polynomal for the determnstc turn-based stochastc game? Is there strongly polynomal tme algorthm for MDP wth varable dscounts, generalzed network flow, or even LP? Solve LPs wth a huge sze (bllon-dmenson) n practce? Lnear Programmng and the Smplex Method Story Contnues