Neuro-Dynamic Programming An Overview



Similar documents
Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

Reinforcement Learning

Dynamic Programming and Optimal Control 3rd Edition, Volume II. Chapter 6 Approximate Dynamic Programming

Reinforcement Learning

TD(0) Leads to Better Policies than Approximate Value Iteration

Optimization under uncertainty: modeling and solution methods

Learning Tetris Using the Noisy Cross-Entropy Method

Tetris: A Study of Randomized Constraint Sampling

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

6.231 Dynamic Programming and Stochastic Control Fall 2008

Stochastic Methods for Large-Scale Linear Problems, Variational Inequalities, and Convex Optimization. Mengdi Wang

Simulation-based optimization methods for urban transportation problems. Carolina Osorio

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Machine Learning and Pattern Recognition Logistic Regression

Applied Computational Economics and Finance

Lecture 3: Linear methods for classification

Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Dynamic Modeling, Predictive Control and Performance Monitoring

Nonlinear Optimization: Algorithms 3: Interior-point methods

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

Channel Allocation in Cellular Telephone. Systems. Lab. for Info. and Decision Sciences. Cambridge, MA

Mapping an Application to a Control Architecture: Specification of the Problem

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department sss40@eng.cam.ac.

Adaptive Business Intelligence

Linear-Quadratic Optimal Controller 10.3 Optimal Linear Control Systems

VENDOR MANAGED INVENTORY

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

Safe robot motion planning in dynamic, uncertain environments

Numerical Methods for Option Pricing

ANTALYA INTERNATIONAL UNIVERSITY INDUSTRIAL ENGINEERING COURSE DESCRIPTIONS

CHAPTER 1 INTRODUCTION

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

LECTURE 4. Last time: Lecture outline

C21 Model Predictive Control

A linear algebraic method for pricing temporary life annuities

17. Inner product spaces Definition Let V be a real vector space. An inner product on V is a function

Pricing and calibration in local volatility models via fast quantization

Computing Near Optimal Strategies for Stochastic Investment Planning Problems

Optimization applications in finance, securities, banking and insurance

2.3 Convex Constrained Optimization Problems

Tutorial on Markov Chain Monte Carlo

Chapter 3 Nonlinear Model Predictive Control

Enhancing Parallel Pattern Search Optimization with a Gaussian Process Oracle

Paper Pulp Dewatering

Temporal Difference Learning in the Tetris Game

1 Teaching notes on GMM 1.

Nonlinear Algebraic Equations Example

The Master s Degree with Thesis Course Descriptions in Industrial Engineering

Introduction to General and Generalized Linear Models

The Psychology of Simulation Model and Metamodeling

SINGLE-STAGE MULTI-PRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION

STRATEGIC CAPACITY PLANNING USING STOCK CONTROL MODEL

Learning Exercise Policies for American Options

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

19 LINEAR QUADRATIC REGULATOR

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Project management: a simulation-based optimization method for dynamic time-cost tradeoff decisions

Stochastic Gradient Method: Applications

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Dual Methods for Total Variation-Based Image Restoration

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

INTEGRATED OPTIMIZATION OF SAFETY STOCK

Enhancing the SNR of the Fiber Optic Rotation Sensor using the LMS Algorithm

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

Offline sorting buffers on Line

Lecture 1: Introduction to Reinforcement Learning

Project Time Management

Least-Squares Intersection of Lines

Software and Hardware Solutions for Accurate Data and Profitable Operations. Miguel J. Donald J. Chmielewski Contributor. DuyQuang Nguyen Tanth

Statistical Machine Learning

Approximation Algorithms for Stochastic Inventory Control Models

Example of the Glicko-2 system

Understanding and Applying Kalman Filtering

1 Review of Least Squares Solutions to Overdetermined Systems

Optimization of Supply Chain Networks

A spreadsheet Approach to Business Quantitative Methods

Optimal shift scheduling with a global service level constraint

3. Regression & Exponential Smoothing

INTEGER PROGRAMMING. Integer Programming. Prototype example. BIP model. BIP models

Formulations of Model Predictive Control. Dipartimento di Elettronica e Informazione

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Transcription:

1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. September 2006

2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly applicable, but it suffers from: Curse of dimensionality Curse of modeling We address complexity by using lowdimensional parametric approximations We allow simulators in place of models Unlimited applications in planning, resource allocation, stochastic control, discrete optimization Application is an art but guided by substantial theory

3 OUTLINE Main NDP framework Discussion of two classes of methods, based on approximate value/policy iteration: Actor-critic methods/lspe Rollout algorithms Additional classes of methods (not discussed): approximate linear programming, approximation in policy space References: Neuro-Dynamic Programming (1996, Bertsekas + Tsitsiklis) Reinforcement Learning (1998, Sutton + Barto) Dynamic Programming: 3rd Edition (2006, Bertsekas) Recent papers with V. Borkar, A. Nedic, and J. Yu Papers and this talk can be downloaded from http://web.mit.edu/dimitrib/www/home.html

4 DYNAMIC PROGRAMMING / DECISION AND CONTROL Main ingredients: Dynamic system; state evolving in discrete time Decision/control applied at each time Cost is incurred at each time There may be noise & model uncertainty There is state feedback used to determine the control Decision/ Control System State Feedback Loop

5 APPLICATIONS Extremely broad range Sequential decision contexts Planning (shortest paths, schedules, route planning, supply chain) Resource allocation over time (maintenance, power generation) Finance (investment over time, optimal stopping/option valuation) Automatic control (vehicles, machines) Nonsequential decision contexts Combinatorial/discrete optimization (breakdown solution into stages) Branch and Bound/ Integer programming Applies to both deterministic and stochastic problems

6 ESSENTIAL TRADEOFF CAPTURED BY DP Decisions are made in stages The decision at each stage: Determines the present stage cost Affects the context within which future decisions are made At each stage we must trade: Low present stage cost Undesirability of high future costs

7 KEY DP RESULT: BELLMAN S EQUATION Optimal decision at the current state minimizes the expected value of Current stage cost + Future stages cost (starting from the next state - using opt. policy) Extensive mathematical methodology Applies to both discrete and continuous systems (and hybrids) Dual curses of dimensionality/modeling

8 KEY NDP IDEA Use one-step lookahead with an approximate cost At the current state select decision that minimizes the expected value of Current stage cost + Approximate future stages cost (starting from the next state) Important issues: How to construct the approximate cost of a state How to understand and control the effects of approximation

9 METHODS TO COMPUTE AN APPROXIMATE COST Parametric approximation algorithms Use a functional approximation to the optimal cost; e.g., linear combination of basis functions Select the weights of the approximation One possibility: Hand-tuning, and trial and error Systematic DP-related policy and value iteration methods (TD-Lambda, Q-Learning, LSPE, LSTD, etc) Rollout algorithms Use the cost of the heuristic (or a lower bound) as cost approximation Obtain by simulation this cost, starting from the state of interest

10 SIMULATION AND LEARNING Simulation (learning by experience): used to compute the (approximate) cost-to-go -- a key distinctive aspect of NDP Important advantage: detailed system model not necessary - use a simulator instead In case of parametric approximation: off-line learning In case of a rollout algorithm: on-line learning (we learn only the cost values needed by online simulation)

11 PARAMETRIC APPROXIMATION: CHESS PARADIGM Chess playing computer programs State = board position Score of position: Important features appropriately weighted Feature Extraction Features: Material balance Mobility Safety etc Scoring Function Score of position Position Evaluator

12 TRAINING In chess: Weights are hand-tuned In more sophisticated methods: Weights are determined by using simulation-based training algorithms Temporal Differences TD(λ), Q-Learning, Least Squares Policy Evaluation LSPE(λ), Least Squares Temporal Differences LSTD(λ), etc All of these methods are based on DP ideas of policy iteration and value iteration

13 POLICY IMPROVEMENT PRINCIPLE Given a current policy, define a new policy as follows: At each state minimize Current stage cost + cost-to-go of current policy (starting from the next state) Policy improvement result: New policy has improved performance over current policy If the cost-to-go is approximate, the improvement is approximate Oscillation around the optimal; error bounds

14 ACTOR/CRITIC SYSTEMS Metaphor for policy evaluation/improvement Actor implements current policy Critic evaluates the performance; passes feedback to the actor Actor changes/ improves policy System Simulator Actor Performance Simulation Data Critic Performance Evaluation Approximate Scoring Function Decision Decision Generator State Feedback from critic Scoring Function

15 APPROXIMATE POLICY EVALUATION Consider stationary policy µ w/ cost function J Satisfies Bellman s equation: J = T(J) = g µ + α P µ J (discounted case) Subspace approximation J ~ Φr Φ: matrix of basis functions r: parameter vector

16 DIRECT AND INDIRECT APPROACHES Direct: Use simulated cost samples and least-squares fit J J ~ ΠJ Approximate the cost Projection on S ΠJ 0 S: Subspace spanned by basis functions Direct Mehod: Projection of cost vector J Indirect: Solve a projected form of Bellman s equation T(Φr) Φr = ΠT(Φr) Approximate the equation Projection on S Φr = ΠT(Φr) 0 S: Subspace spanned by basis functions Indirect method: Solving a projected form of Bellman s equation

17 DIRECT APPROACH Minimize over r; least squares Σ (Simulated cost sample of J(i) - (Φr) i ) 2 Each state is weighted proportionally to its appearance in the simulation Works even with nonlinear function approximation (in place of Φr) Gradient or special least squares methods can be used Problem with large error variance

18 INDIRECT POLICY EVALUATION The most popular simulation-based methods solve the Projected Bellman Equation (PBE) TD(λ): (Sutton 1988) - stochastic approximation method LSTD(λ): (Barto & Bradtke 1996, Boyan 2002) - solves by matrix inversion a simulation generated approximation to PBE, optimal convergence rate (Konda 2002) LSPE(λ): (Bertsekas, Ioffe 1996, Borkar, Nedic 2004, Yu 2006) - uses projected value iteration to find fixed point of PBE We will focus now on LSPE

19 LEAST SQUARES POLICY EVALUATION (LSPE) Consider α-discounted Markov Decision Problem (finite state and control spaces) We want to approximate the solution of Bellman equation: J = T(J) = g µ + α P µ J It solves the projected Bellman equation T(Φr) Φr = ΠT(Φr) Projection on S Φr = ΠT(Φr) 0 S: Subspace spanned by basis functions Indirect method: Solving a projected form of Bellman s equation

20 PROJECTED VALUE ITERATION Value iteration: J t+1 = T(J t ) Projected Value iteration: Φr t+1 = ΠT(Φr t ) where Φ is a matrix of basis functions and Π is projection w/ respect to some weighted Euclidean norm. Norm mismatch issue: Π is nonexpansive with respect to. T is a contraction w/ respect to the sup norm Key Question: When is ΠT a contraction w/ respect to some norm?

21 PROJECTION W/ RESPECT TO DISTRIBUTION NORM Consider the steady-state distribution norm Weight of ith component: the steady-state probability of state i in the Markov chain corresponding to the policy evaluated Projection with respect to this norm can be approximated by simulation-based least-squares Remarkable Fact: If Π is projection w/ respect to the distribution norm, then ΠT is a contraction for discounted problems Average cost story is more complicated (Yu and Bertsekas, 2006)

22 LSPE: SIMULATION-BASED IMPLEMENTATION Simulation-based implementation of Φr t+1 = ΠT(Φr t ) with an infinitely long trajectory, and least squares Φr t+1 = ΠT(Φr t ) + Diminishing simulation noise Interesting convergence theory (see papers at www site) Optimal convergence rate; much better than TD(λ), same as LSTD (Yu and Bertsekas, 2006)

23 SUMMARY OF ACTOR-CRITIC SYSTEMS A lot of mathematical analysis, insight, and practical experience are now available There is solid theory for policy evaluation methods with linear function approximation: TD(λ), LSPE(λ), LSTD(λ) Typically, improved policies are obtained early, then oscillation near the optimum On-line computation is small Training is challenging and time-consuming Less suitable when problem data changes frequently

24 ROLLOUT POLICIES: BACKGAMMON PARADIGM On-line (approximate) cost-to-go calculation by simulation of some base policy (heuristic) Rollout: Use action w/ best simulation results Rollout is one-step policy iteration Possible Moves Av. Score by Monte-Carlo Simulation Av. Score by Monte-Carlo Simulation Av. Score by Monte-Carlo Simulation Av. Score by Monte-Carlo Simulation

25 COST IMPROVEMENT PROPERTY Generic result: Rollout improves on base heuristic A special case of policy iteration/policy improvement In practice, substantial improvements over the base heuristic(s) have been observed Major drawback: Extensive Monte-Carlo simulation Extension to multiple heuristics: From each next state, run multiple heuristics Use as value of the next state the best heuristic value Cost improvement: The rollout algorithm performs at least as well as each of the base heuristics Interesting special cases: The classical open-loop feedback control policy (base heuristic is the optimal open-loop policy) Model predictive control (major applications in control systems)

26 STOCHASTIC PROBLEMS Major issue: Computational burden of Monte- Carlo simulation Motivation to use approximate Monte-Carlo Approximate Monte-Carlo by certainty equivalence: Assume future unknown quantities are fixed at some typical values Advantage : Single simulation run per next state, but some loss of optimality Extension to multiple scenarios (see Bertsekas and Castanon, 1997)

27 DETERMINISTIC PROBLEMS ONLY ONE simulation trajectory needed Use heuristic(s) for approximate cost-to-go calculation At each state, consider all possible next states, and run the heuristic(s) from each Select the next state with best heuristic cost Straightforward to implement Cost improvement results are sharper (Bertsekas, Tsitsiklis, Wu, 1997, Bertsekas 2005) Extension to constrained problems

28 ROLLOUT ALGORITHM PROPERTIES Forward looking (the heuristic runs to the end) Self-correcting (the heuristic is reapplied at each time step) Suitable for on-line use Suitable for replanning Suitable for situations where the problem data are a priori unknown Substantial positive experience with many types of optimization problems, including combinatorial (e.g., scheduling)

29 RELATION TO MODEL PREDICTIVE CONTROL Motivation: Deal with state/control constraints Basic MPC framework Deterministic discrete time system x k+1 = f(x k,u k ) Control contraint U, state constraint X Quadratic cost per stage: x Qx+u Ru MPC operation: At the typical state x Drive the state to 0 in m stages with minimum quadratic cost, while observing the constraints Use the 1st component of the m-stage optimal control sequence, discard the rest Repeat at the next state

30 ADVANTAGES OF MPC It can deal explicitly with state and control constraints It can be implemented using standard deterministic optimal control methodology Key result: The resulting (suboptimal) closedloop system is stable (under a constrained controllability assumption - Keerthi/Gilbert, 1988) Connection with infinite-time reachability Extension to problems with set-membership description of uncertainty

31 CONNECTION OF MPC AND ROLLOUT MPC <==> Rollout with suitable base heuristic Heuristic: Apply the (m-1)-stage policy that drives the state to 0 with minimum cost Stability of MPC <==> Cost improvement of rollout Base heuristic stable ==> Rollout policy is also stable

32 EXTENSIONS The relation with rollout suggests more general MPC schemes: Nontraditional control and/or state constraints Set-membership disturbances The success of MPC should encourage the use of rollout

33 RESTRICTED STRUCTURE POLICIES General suboptimal control scheme At each time step: Impose restrictions on future information or control Optimize the future under these restrictions Use 1st component of the restricted policy Recompute at the next step Special cases: Rollout, MPC: Restrictions on future control Open-loop feedback control: Restrictions on future information Main result for the suboptimal policy so obtained: It has better performance than the restricted policy

34 CONCLUDING REMARKS NDP is a broadly applicable methodology; addresses optimization problems that are intractable in other ways Many off-line and on-line methods to choose from Interesting theory No need for a detailed model; a simulator suffices Computational requirements are substantial Successful application is an art Many questions remain