Chapter 4: Dynamic Programming

Similar documents
Polynomial Functions. Polynomial functions in one variable can be written in expanded form as ( )

Factoring Polynomials

QUADRATURE METHODS. July 19, Kenneth L. Judd. Hoover Institution

Lecture 3 Gaussian Probability Distribution

Use Geometry Expressions to create a more complex locus of points. Find evidence for equivalence using Geometry Expressions.

Math 135 Circles and Completing the Square Examples

Binary Representation of Numbers Autar Kaw

and thus, they are similar. If k = 3 then the Jordan form of both matrices is

Operations with Polynomials

Babylonian Method of Computing the Square Root: Justifications Based on Fuzzy Techniques and on Computational Complexity

Example A rectangular box without lid is to be made from a square cardboard of sides 18 cm by cutting equal squares from each corner and then folding

Physics 43 Homework Set 9 Chapter 40 Key

5.2. LINE INTEGRALS 265. Let us quickly review the kind of integrals we have studied so far before we introduce a new one.

LINEAR TRANSFORMATIONS AND THEIR REPRESENTING MATRICES

Vectors Recap of vectors

g(y(a), y(b)) = o, B a y(a)+b b y(b)=c, Boundary Value Problems Lecture Notes to Accompany

2 DIODE CLIPPING and CLAMPING CIRCUITS

Example 27.1 Draw a Venn diagram to show the relationship between counting numbers, whole numbers, integers, and rational numbers.

9 CONTINUOUS DISTRIBUTIONS

Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

PROF. BOYAN KOSTADINOV NEW YORK CITY COLLEGE OF TECHNOLOGY, CUNY

4.11 Inner Product Spaces

Lecture 5. Inner Product

Or more simply put, when adding or subtracting quantities, their uncertainties add.

CHAPTER 11 Numerical Differentiation and Integration

Distributions. (corresponding to the cumulative distribution function for the discrete case).

Econ 4721 Money and Banking Problem Set 2 Answer Key

FAULT TREES AND RELIABILITY BLOCK DIAGRAMS. Harry G. Kwatny. Department of Mechanical Engineering & Mechanics Drexel University

How fast can we sort? Sorting. Decision-tree model. Decision-tree for insertion sort Sort a 1, a 2, a 3. CS Spring 2009

Brillouin Zones. Physics 3P41 Chris Wiebe

Small Business Cloud Services

Second Term MAT2060B 1. Supplementary Notes 3 Interchange of Differentiation and Integration

Review guide for the final exam in Math 233

EQUATIONS OF LINES AND PLANES

. At first sight a! b seems an unwieldy formula but use of the following mnemonic will possibly help. a 1 a 2 a 3 a 1 a 2

A.7.1 Trigonometric interpretation of dot product A.7.2 Geometric interpretation of dot product

MATH 150 HOMEWORK 4 SOLUTIONS

SPECIAL PRODUCTS AND FACTORIZATION

Treatment Spring Late Summer Fall Mean = 1.33 Mean = 4.88 Mean = 3.

Integration by Substitution

Enhancing Q-Learning for Optimal Asset Allocation

Allocation Strategies of Virtual Resources in Cloud-Computing Networks

PHY 140A: Solid State Physics. Solution to Homework #2

Economics Letters 65 (1999) macroeconomists. a b, Ruth A. Judson, Ann L. Owen. Received 11 December 1998; accepted 12 May 1999

COMPONENTS: COMBINED LOADING

4 Approximations. 4.1 Background. D. Levy

Graphs on Logarithmic and Semilogarithmic Paper

Rotating DC Motors Part II

Euler Euler Everywhere Using the Euler-Lagrange Equation to Solve Calculus of Variation Problems

Helicopter Theme and Variations

P.3 Polynomials and Factoring. P.3 an 1. Polynomial STUDY TIP. Example 1 Writing Polynomials in Standard Form. What you should learn

Bayesian Updating with Continuous Priors Class 13, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Experiment 6: Friction

One Minute To Learn Programming: Finite Automata

Regular Sets and Expressions

LECTURE #05. Learning Objective. To describe the geometry in and around a unit cell in terms of directions and planes.

Section 7-4 Translation of Axes

Fast Demand Learning for Display Advertising Revenue Management

Linear Equations in Two Variables

Mathematics. Vectors. hsn.uk.net. Higher. Contents. Vectors 128 HSN23100

Unit 6: Exponents and Radicals

RIGHT TRIANGLES AND THE PYTHAGOREAN TRIPLETS

Integration. 148 Chapter 7 Integration

The Velocity Factor of an Insulated Two-Wire Transmission Line

How To Make A Network More Efficient

Reasoning to Solve Equations and Inequalities

#A12 INTEGERS 13 (2013) THE DISTRIBUTION OF SOLUTIONS TO XY = N (MOD A) WITH AN APPLICATION TO FACTORING INTEGERS

Online Multicommodity Routing with Time Windows

6.2 Volumes of Revolution: The Disk Method

Lectures 8 and 9 1 Rectangular waveguides

Review Problems for the Final of Math 121, Fall 2014

, and the number of electrons is -19. e e C. The negatively charged electrons move in the direction opposite to the conventional current flow.

MODULE 3. 0, y = 0 for all y

Section 5-4 Trigonometric Functions

Appendix D: Completing the Square and the Quadratic Formula. In Appendix A, two special cases of expanding brackets were considered:

FUNCTIONS AND EQUATIONS. xεs. The simplest way to represent a set is by listing its members. We use the notation

Java CUP. Java CUP Specifications. User Code Additions You may define Java code to be included within the generated parser:

Cypress Creek High School IB Physics SL/AP Physics B MP2 Test 1 Newton s Laws. Name: SOLUTIONS Date: Period:

Abstract. This paper introduces new algorithms and data structures for quick counting for machine

NOTES. Cohasset Associates, Inc Managing Electronic Records Conference 8.1

AREA OF A SURFACE OF REVOLUTION

Solving BAMO Problems

Vectors. The magnitude of a vector is its length, which can be determined by Pythagoras Theorem. The magnitude of a is written as a.

Applications to Physics and Engineering

UNIVERSITY OF OSLO FACULTY OF MATHEMATICS AND NATURAL SCIENCES

Module Summary Sheets. C3, Methods for Advanced Mathematics (Version B reference to new book) Topic 2: Natural Logarithms and Exponentials

Learning to Search Better than Your Teacher

Lower Bound for Envy-Free and Truthful Makespan Approximation on Related Machines

CS99S Laboratory 2 Preparation Copyright W. J. Dally 2001 October 1, 2001

Efficient load-balancing routing for wireless mesh networks

Thinking out of the Box... Problem It s a richer problem than we ever imagined

Novel Methods of Generating Self-Invertible Matrix for Hill Cipher Algorithm

Modeling POMDPs for Generating and Simulating Stock Investment Policies

Factoring RSA moduli with weak prime factors

Homework 3 Solutions

ClearPeaks Customer Care Guide. Business as Usual (BaU) Services Peace of mind for your BI Investment

3 The Utility Maximization Problem

JaERM Software-as-a-Solution Package

Radius of the Earth - Radii Used in Geodesy James R. Clynch February 2006

Transcription:

Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence, optiml policies Discuss efficiency nd utility of DP 1

Policy Evlution Policy Evlution: for given policy π, compute the stte-vlue function V π Recll: Stte - vlue function for policy π : V π (s) = E R π (s) { } = E γ t r t s 0 = s t= 0 Bellmn eqution for V π : V π (s) = π( s) P s [ R ss ʹ + γv π ( )] system of S simultneous liner equtions 2

Itertive Methods V 0 V 1 V k V k +1 V π sweep A sweep consists of pplying bckup opertion to ech stte. A full policy-evlution bckup: V k +1 (s) π( s) P s [ R ss ʹ + γv k ( )] 3

Itertive Policy Evlution 4

A Smll Gridworld An undiscounted episodic tsk Nonterminl sttes: 1, 2,..., 14; One terminl stte (shown twice s shded squres) Actions tht would tke gent off the grid leve stte unchnged Rewrd is 1 until the terminl stte is reched 5

Itertive Policy Evl for the Smll Gridworld π = equiprobble rndom ction choices 6

Policy Improvement Suppose we hve computed V π for deterministic policy π. For given stte s, would it be better to do n ction? π(s) The vlue of doing in stte s is : Q π (s,) = P ss ʹ [ R ss ʹ + γv π ( )] It is better to switch to ction for stte s if nd only if Q π (s, ) > V π (s) 7

Policy Improvement Theorem π π ʹ Let nd be ny pir of deterministic policies such tht Then the policy ʹ π must be s good s, or better thn π 8

Policy Improvement Cont. Do this for ll sttes to get new policy ʹ π tht is greedy with respect to V π : Then V π ʹ π ʹ (s) = rgmx Q π (s, ) V π = rgmx P [ + γ V π ( )] R 9

Policy Improvement Cont. Wht if V π ʹ = V π? i.e., for ll s S, V π ʹ (s) = mx P ss ʹ [ R ss ʹ + γv π ( )]? But this is the Bellmn Optimlity Eqution. So V ʹ π = V nd both π nd ʹ π re optiml policies. 10

Policy Itertion π 0 V π 0 π 1 V π 1 π * V * π * policy evlution policy improvement greedifiction 11

Policy Itertion 12

Vlue Itertion Recll the full policy-evlution bckup: V k +1 (s) π( s) P s [ R ss ʹ + γv k ( )] Here is the full vlue-itertion bckup: V k +1 (s) mx P ss ʹ [ + γ V k ( )] R ss ʹ 13

Vlue Itertion Cont. 14

Asynchronous DP All the DP methods described so fr require exhustive sweeps of the entire stte set. Asynchronous DP does not use sweeps. Insted it works like this: Repet until convergence criterion is met: Pick stte t rndom nd pply the pproprite bckup Still need lots of computtion, but does not get locked into hopelessly long sweeps Cn you select sttes to bckup intelligently? YES: n gent s experience cn ct s guide. 15

Generlized Policy Itertion Generlized Policy Itertion (GPI): ny interction of policy evlution nd policy improvement, independent of their grnulrity. A geometric metphor for convergence of GPI: 16

Liner Progrmming Since lim k T k V = V * for ll V, we hve Thus is the smllest tht stisfies the constrint 17

Efficiency of DP To find n optiml policy is polynomil in the number of sttes BUT, the number of sttes is often stronomicl, e.g., often growing exponentilly with the number of stte vribles (wht Bellmn clled the curse of dimensionlity ). In prctice, clssicl DP cn be pplied to problems with few millions of sttes. Asynchronous DP cn be pplied to lrger problems, nd pproprite for prllel computtion. It is surprisingly esy to come up with MDPs for which DP methods re not prcticl. 18

Totl number of deterministic policies DP methods re polynomil time lgorithms VI (ech itertion) O S 2 A PI (ech itertion) = the cost of policy evlution + the cost of policy improvement Liner system of equtions itertive Efficiency of DP nd LP O S 3 or O S 2.807 O S 2 log(1/θ) log(1/γ) O S 2 A Ech itertion of PI is computtionlly more expensive thn ech itertion of VI PI typiclly require fewer itertions to converge thn VI Exponentilly fster thn ny direct serch in policy spce Number of sttes often grows exponentilly with the number of stte vribles 19

Efficiency of LP LP methods Their worst-cse convergence gurntees re better thn those of DP methods Become imprcticl t much smller number of sttes thn do DP methods 20

Summry Policy evlution: bckups without mx Policy improvement: form greedy policy, if only loclly Policy itertion: lternte the bove two processes Vlue itertion: bckups with mx Full bckups (to be contrsted lter with smple bckups) Generlized Policy Itertion (GPI) Asynchronous DP: wy to void exhustive sweeps Bootstrpping: updting estimtes bsed on other estimtes 21