# TD(0) Leads to Better Policies than Approximate Value Iteration

Save this PDF as:

Size: px
Start display at page:

Download "TD(0) Leads to Better Policies than Approximate Value Iteration"

## Transcription

1 TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA Abstract We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same fixed points as TD(0). Our analysis also leads to the first performance loss bound for approximate value iteration with an average cost objective. 1 Preliminaries Consider a discrete-time communicating Markov decision process (MDP) with a finite state space S = {1,..., S }. At each state x S, there is a finite set U x of admissible actions. If the current state is x and an action u U x is selected, a cost of g u (x) is incurred, and the system transitions to a state y S with probability p xy (u). For any x S and u U x, y S p xy(u) = 1. Costs are discounted at a rate of α (0, 1) per period. Each instance of such an MDP is defined by a quintuple (S, U, g, p, α). A (stationary deterministic) policy is a mapping µ that assigns an action u U x to each state x S. If actions are selected based on a policy µ, the state follows a Markov process with transition matrix P µ, where each (x, y)th entry is equal to p xy (µ(x)). The restriction to communicating MDPs ensures that it is possible to reach any state from any other state. Each policy µ is associated with a cost-to-go function J µ R S, defined by J µ = t=0 αt Pµg t µ = (I αp µ ) 1 g µ, where, with some abuse of notation, g µ (x) = g µ(x) (x) for each x S. A policy µ is said to be greedy with respect to a function J if µ(x) argmin(g u (x) + α y S p xy(u)j(y)) for all x S. u U x The optimal cost-to-go function J R S is defined by J (x) = min µ J µ (x), for all x S. A policy µ is said to be optimal if J µ = J. It is well-known that an optimal policy exists. Further, a policy µ is optimal if and only if it is greedy with respect to J. Hence, given the optimal cost-to-go function, optimal actions can computed be minimizing the right-hand side of the above inclusion.

2 Value iteration generates a sequence J l converging to J according to J l+1 = T J l, where T is the dynamic programming operator, defined by (T J)(x) = min u Ux (g u (x) + α y S p xy(u)j(y)), for all x S and J R S. This sequence converges to J for any initialization of J 0. 2 Approximate Value Iteration The state spaces of relevant MDPs are typically so large that computation and storage of a cost-to-go function is infeasible. One approach to dealing with this obstacle involves partitioning the state space S into a manageable number K of disjoint subsets S 1,..., S K and approximating the optimal cost-to-go function with a function that is constant over each partition. This can be thought of as a form of state aggregation all states within a given partition are assumed to share a common optimal cost-to-go. To represent an approximation, we define a matrix Φ R S K such that each kth column is an indicator function for the kth partition S k. Hence, for any r R K, k, and x S k, (Φr)(x) = r k. In this paper, we study variations of value iteration, each of which computes a vector r so that Φr approximates J. The use of such a policy µ r which is greedy with respect to Φr is justified by the following result (see [10] for a proof): Theorem 1 If µ is a greedy policy with respect to a function J R S then J µ J 2α 1 α J J. One common way of approximating a function J R S with a function of the form Φr involves projection with respect to a weighted Euclidean norm π. The weighted Euclidean norm: J 2,π = ( x S π(x)j 2 (x) ) 1/2 S. Here, π R + is a vector of weights that assign relative emphasis among states. The projection Π π J is the function Φr that attains the minimum of J Φr 2,π ; if there are multiple functions Φr that attain the minimum, they must form an affine space, and the projection is taken to be the one with minimal norm Φr 2,π. Note that in our context, where each kth column of Φ represents an indicator function for the kth partition, for any π, J, and x S k, (Π π J)(x) = y S k π(y)j(y)/ y S k π(y). Approximate value iteration begins with a function Φr (0) and generates a sequence according to Φr (l+1) = Π π T Φr (l). It is well-known that the dynamic programming operator T is a contraction mapping with respect to the maximum norm. Further, Π π is maximum-norm nonexpansive [16, 7, 8]. (This is not true for general Φ, but is true in our context in which columns of Φ are indicator functions for partitions.) It follows that the composition Π π T is a contraction mapping. By the contraction mapping theorem, Π π T has a unique fixed point Φ r, which is the limit of the sequence Φr (l). Further, the following result holds: Theorem 2 For any MDP, partition, and weights π with support intersecting every partition, if Φ r = Π π T Φ r then and Φ r J 2 1 α min r R K J Φr, (1 α) J µ r J 4α 1 α min r R K J Φr. The first inequality of the theorem is an approximation error bound, established in [16, 7, 8] for broader classes of approximators that include state aggregation as a special case. The

3 second is a performance loss bound, derived by simply combining the approximation error bound and Theorem 1. Note that J µ r (x) J (x) for all x, so the left-hand side of the performance loss bound is the maximal increase in cost-to-go, normalized by 1 α. This normalization is natural, since a cost-to-go function is a linear combination of expected future costs, with coefficients 1, α, α 2,..., which sum to 1/(1 α). Our motivation of the normalizing constant begs the question of whether, for fixed MDP parameters (S, U, g, p) and fixed Φ, min r J Φr also grows with 1/(1 α). It turns out that min r J Φr = O(1). To see why, note that for any µ, J µ = (I αp µ ) 1 g µ = 1 1 α λ µ + h µ, where λ µ (x) is the expected average cost if the process starts in state x and is controlled by policy µ, τ 1 1 λ µ = lim P t τ τ µg µ, and h µ is the discounted differential cost function t=0 h µ = (I αp µ ) 1 (g µ λ µ ). Both λ µ and h µ converge to finite vectors as α approaches 1 [3]. For an optimal policy µ, lim α 1 λ µ (x) does not depend on x (in our context of a communicating MDP). Since constant functions lie in the range of Φ, lim min J Φr lim h µ <. α 1 r R K α 1 The performance loss bound still exhibits an undesirable dependence on α through the coefficient 4α/(1 α). In most relevant contexts, α is close to 1; a representative value might be Consequently, 4α/(1 α) can be very large. Unfortunately, the bound is sharp, as expressed by the following theorem. We will denote by 1 the vector with every component equal to 1. Theorem 3 For any δ > 0, α (0, 1), and 0, there exists MDP parameters (S, U, g, p) and a partition such that min r R K J Φr = and, if Φ r = Π π T Φ r with π = 1, (1 α) J µ r J 4α 1 α min J Φr δ. r R K This theorem is established through an example in [22]. The choice of uniform weights (π = 1) is meant to point out that even for such a simple, perhaps natural, choice of weights, the performance loss bound is sharp. Based on Theorems 2 and 3, one might expect that there exists MDP parameters (S, U, g, p) and a partition such that, with π = 1, ( ) 1 (1 α) J µ r J = Θ 1 α min J Φr. r R K In other words, that the performance loss is both lower and upper bounded by 1/(1 α) times the smallest possible approximation error. It turns out that this is not true, at least if we restrict to a finite state space. However, as the following theorem establishes, the coefficient multiplying min r R K J Φr can grow arbitrarily large as α increases, keeping all else fixed.

4 Theorem 4 For any L and 0, there exists MDP parameters (S, U, g, p) and a partition such that lim α 1 min r R K J Φr = and, if Φ r = Π π T Φ r with π = 1, lim inf (1 α) (J µ r (x) J (x)) L lim min J Φr, α 1 α 1 r R K for all x S. This Theorem is also established through an example [22]. For any µ and x, lim ((1 α)j µ (x) λ µ (x)) = lim(1 α)h µ (x) = 0. α 1 α 1 Combined with Theorem 4, this yields the following corollary. Corollary 1 For any L and 0, there exists MDP parameters (S, U, g, p) and a partition such that lim α 1 min r R K J Φr = and, if Φ r = Π π T Φ r with π = 1, lim inf (λ µ r (x) λ µ (x)) L lim min J Φr, α 1 α 1 r R K for all x S. 3 Using the Invariant Distribution In the previous section, we considered an approximation Φ r that solves Π π T Φ r = Φ r for some arbitrary pre-selected weights π. We now turn to consider use of an invariant state distribution π r of P µ r as the weight vector. 1 This leads to a circular definition: the weights are used in defining r and now we are defining the weights in terms of r. What we are really after here is a vector r that satisfies Π π r T Φ r = Φ r. The following theorem captures the associated benefits. (Due to space limitations, we omit the proof, which is provided in the full length version of this paper [22].) Theorem 5 For any MDP and partition, if Φ r = Π π r T Φ r and π r has support intersecting every partition, (1 α)π T r (J µ r J ) 2α min r R K J Φr. When α is close to 1, which is typical, the right-hand side of our new performance loss bound is far less than that of Theorem 2. The primary improvement is in the omission of a factor of 1 α from the denominator. But for the bounds to be compared in a meaningful way, we must also relate the left-hand-side expressions. A relation can be based on the fact that for all µ, lim α 1 (1 α)j µ λ µ = 0, as explained in Section 2. In particular, based on this, we have lim(1 α) J µ J = λ µ λ = λ µ λ = lim π T (J µ J ), α 1 α 1 for all policies µ and probability distributions π. Hence, the left-hand-side expressions from the two performance bounds become directly comparable as α approaches 1. Another interesting comparison can be made by contrasting Corollary 1 against the following immediate consequence of Theorem 5. Corollary 2 For all MDP parameters (S, U, g, p) and partitions, if Φ r = Π π r T Φ r and lim inf α 1 x S k π r (x) > 0 for all k, lim sup α 1 λ µ r λ µ 2 lim α 1 min r R K J Φr. The comparison suggests that solving Φ r = Π π r T Φ r is strongly preferable to solving Φ r = Π π T Φ r with π = 1. 1 By an invariant state distribution of a transition matrix P, we mean any probability distribution π such that π T P = π T. In the event that P µ r has multiple invariant distributions, π r denotes an arbitrary choice.

5 4 Exploration If a vector r solves Φ r = Π π r T Φ r and the support of π r intersects every partition, Theorem 5 promises a desirable bound. However, there are two significant shortcomings to this solution concept, which we will address in this section. First, in some cases, the equation Π π r T Φ r = Φ r does not have a solution. It is easy to produce examples of this; though no example has been documented for the particular class of approximators we are using here, [2] offers an example involving a different linearly parameterized approximator that captures the spirit of what can happen. Second, it would be nice to relax the requirement that the support of π r intersect every partition. To address these shortcomings, we introduce stochastic policies. A stochastic policy µ maps state-action pairs to probabilities. For each x S and u U x, µ(x, u) is the probability of taking action u when in state x. Hence, µ(x, u) 0 for all x S and u U x, and u U x µ(x, u) = 1 for all x S. Given a scalar ɛ > 0 and a function J, the ɛ-greedy Boltzmann exploration policy with respect to J is defined by µ(x, u) = e (TuJ)(x)( Ux 1)/ɛe u U x e (TuJ)(x)( Ux 1)/ɛe. For any ɛ > 0 and r, let µ ɛ r denote the ɛ-greedy Boltzmann exploration policy with respect to Φr. Further, we define a modified dynamic programming operator that incorporates Boltzmann exploration: (T ɛ u U J)(x) = x e (TuJ)(x)( Ux 1)/ɛe (T u J)(x). u U x e (TuJ)(x)( Ux 1)/ɛe As ɛ approaches 0, ɛ-greedy Boltzmann exploration policies become greedy and the modified dynamic programming operators become the dynamic programming operator. More precisely, for all r, x, and J, lim ɛ 0 µ ɛ r(x, µ r (x)) = 1 and lim ɛ 1 T ɛ J = T J. These are immediate consequences of the following result (see [4] for a proof). Lemma 1 For any n, v R n, min i v i + ɛ i e vi(n 1)/ɛe v i / i e vi(n 1)/ɛe min i v i. Because we are only concerned with communicating MDPs, there is a unique invariant state distribution associated with each ɛ-greedy Boltzmann exploration policy µ ɛ r and the support of this distribution is S. Let π ɛ r denote this distribution. We consider a vector r that solves Φ r = Π π ɛ r T ɛ Φ r. For any ɛ > 0, there exists a solution to this equation (this is an immediate extension of Theorem 5.1 from [4]). We have the following performance loss bound, which parallels Theorem 5 but with an equation for which a solution is guaranteed to exist and without any requirement on the resulting invariant distribution. (Again, we omit the proof, which is available in [22].) Theorem 6 For any MDP, partition, and ɛ > 0, if Φ r = Π π ɛ r T ɛ Φ r then (1 α)(π ɛ r )T (J µ ɛ r J ) 2α min r R K J Φr + ɛ. 5 Computation: TD(0) Though computation is not a focus of this paper, we offer a brief discussion here. First, we describe a simple algorithm from [16], which draws on ideas from temporal-difference learning [11, 12] and Q-learning [23, 24] to solve Φ r = Π π T Φ r. It requires an ability to sample a sequence of states x (0), x (1), x (2),..., each independent and identically

6 distributed according to π. Also required is a way to efficiently compute (T Φr)(x) = min u Ux (g u (x) + α y S p xy(u)(φr)(y)), for any given x and r. This is typically possible when the action set U x and the support of p x (u) (i.e., the set of states that can follow x if action u is selected) are not too large. The algorithm generates a sequence of vectors r (l) according to ( ) r (l+1) = r (l) + γ l φ(x (l) ) (T Φr (l) )(x (l) ) (Φr (l) )(x (l) ), where γ l is a step size and φ(x) denotes the column vector made up of components from the xth row of Φ. In [16], using results from [15, 9], it is shown that under appropriate assumptions on the step size sequence, r (l) converges to a vector r that solves Φ r = Π π T Φ r. The equation Φ r = Π π T Φ r may have no solution. Further, the requirement that states are sampled independently from the invariant distribution may be impractical. However, a natural extension of the above algorithm leads to an easily implementable version of TD(0) that aims at solving Φ r = Π π T ɛ r ɛ Φ r. The algorithm requires simulation of a trajectory x 0, x 1, x 2,... of the MDP, with each action u t U xt generated by the ɛ-greedy Boltzmann exploration policy with respect to Φr (t). The sequence of vectors r (t) is generated according to ( ) r (t+1) = r (t) + γ t φ(x t ) (T ɛ Φr (t) )(x t ) (Φr (t) )(x t ). Under suitable conditions on the step size sequence, if this algorithm converges, the limit satisfies Φ r = Π π ɛ r T ɛ Φ r. Whether such an algorithm converges and whether there are other algorithms that can effectively solve Φ r = Π π ɛ r T ɛ Φ r for broad classes of relevant problems remain open issues. 6 Extensions and Open Issues Our results demonstrate that weighting a Euclidean norm projection by the invariant distribution of a greedy (or approximately greedy) policy can lead to a dramatic performance gain. It is intriguing that temporal-difference learning implicitly carries out such a projection, and consequently, any limit of convergence obeys the stronger performance loss bound. This is not the first time that the invariant distribution has been shown to play a critical role in approximate value iteration and temporal-difference learning. In prior work involving approximation of a cost-to-go function for a fixed policy (no control) and a general linearly parameterized approximator (arbitrary matrix Φ), it was shown that weighting by the invariant distribution is key to ensuring convergence and an approximation error bound [17, 18]. Earlier empirical work anticipated this [13, 14]. The temporal-difference learning algorithm presented in Section 5 is a version of TD(0), This is a special case of TD(λ), which is parameterized by λ [0, 1]. It is not known whether the results of this paper can be extended to the general case of λ [0, 1]. Prior research has suggested that larger values of λ lead to superior results. In particular, an example of [1] and the approximation error bounds of [17, 18], both of which are restricted to the case of a fixed policy, suggest that approximation error is amplified by a factor of 1/(1 α) as λ is changed from 1 to 0. The results of Sections 3 and 4 suggest that this factor vanishes if one considers a controlled process and performance loss rather than approximation error. Whether the results of this paper can be extended to accommodate approximate value iteration with general linearly parameterized approximators remains an open issue. In this broader context, error and performance loss bounds of the kind offered by Theorem 2 are

7 unavailable, even when the invariant distribution is used to weight the projection. Such error and performance bounds are available, on the other hand, for the solution to a certain linear program [5, 6]. Whether a factor of 1/(1 α) can similarly be eliminated from these bounds is an open issue. Our results can be extended to accommodate an average cost objective, assuming that the MDP is communicating. With Boltzmann exploration, the equation of interest becomes Φ r = Π π ɛ r (T ɛ Φ r λ1). The variables include an estimate λ R of the minimal average cost λ R and an approximation Φ r of the optimal differential cost function h. The discount factor α is set to 1 in computing an ɛ-greedy Boltzmann exploration policy as well as T ɛ. There is an average-cost version of temporal-difference learning for which any limit of convergence ( λ, r) satisfies this equation [19, 20, 21]. Generalization of Theorem 2 does not lead to a useful result because the right-hand side of the bound becomes infinite as α approaches 1. On the other hand, generalization of Theorem 6 yields the first performance loss bound for approximate value iteration with an average-cost objective: Theorem 7 For any communicating MDP with an average-cost objective, partition, and ɛ > 0, if Φ r = Π π ɛ r (T ɛ Φ r λ1) then λ µ ɛ r λ 2 min r R K h Φr + ɛ. Here, λ µ R denotes the average cost under policy µ ɛ r ɛ r, which is well-defined because the process is irreducible under an ɛ-greedy Boltzmann exploration policy. This theorem can be proved by taking limits on the left and right-hand sides of the bound of Theorem 6. It is easy to see that the limit of the left-hand side is λ µ λ ɛ r. The limit of min r R K J Φr on the right-hand side is min r R K h Φr. (This follows from the analysis of [3].) Acknowledgments This material is based upon work supported by the National Science Foundation under Grant ECS and by the Office of Naval Research under Grant MURI N The author s understanding of the topic benefited from collaborations with Dimitri Bertsekas, Daniela de Farias, and John Tsitsiklis. A full length version of this paper has been submitted to Mathematics of Operations Research and has benefited from a number of useful comments and suggestions made by reviewers. References [1] D. P. Bertsekas. A counterexample to temporal-difference learning. Neural Computation, 7: , [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, [3] D. Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33: , [4] D. P. de Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), [5] D. P. de Farias and B. Van Roy. Approximate dynamic programming via linear programming. In Advances in Neural Information Processing Systems 14. MIT Press, 2002.

8 [6] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6): , [7] G. J. Gordon. Stable function approximation in dynamic programming. Technical Report CMU-CS , Carnegie Mellon University, [8] G. J. Gordon. Stable function approximation in dynamic programming. In Machine Learning: Proceedings of the Twelfth International Conference (ICML), San Francisco, CA, [9] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6: , [10] S. P. Singh and R. C. Yee. An upper-bound on the loss from approximate optimalvalue functions. Machine Learning, [11] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Amherst, MA, [12] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9 44, [13] R. S. Sutton. On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference, [14] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, Cambridge, MA, MIT Press. [15] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16: , [16] J. N. Tsitsiklis and B. Van Roy. Feature based methods for large scale dynamic programming. Machine Learning, 22:59 94, [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5): , [18] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. In Advances in Neural Information Processing Systems 9, Cambridge, MA, MIT Press. [19] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. In Proceedings of the IEEE Conference on Decision and Control, [20] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. Automatica, 35(11): , [21] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporaldifference learning. Machine Learning, 49(2-3): , [22] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Under review with Mathematics of Operations Research, available at bvr/psfiles/aggregation.pdf, [23] C. J. C. H. Watkins. Learning From Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, [24] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8: , 1992.

### Tetris: Experiments with the LP Approach to Approximate DP

Tetris: Experiments with the LP Approach to Approximate DP Vivek F. Farias Electrical Engineering Stanford University Stanford, CA 94403 vivekf@stanford.edu Benjamin Van Roy Management Science and Engineering

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Tetris: A Study of Randomized Constraint Sampling

Tetris: A Study of Randomized Constraint Sampling Vivek F. Farias and Benjamin Van Roy Stanford University 1 Introduction Randomized constraint sampling has recently been proposed as an approach for approximating

### ADVANTAGE UPDATING. Leemon C. Baird III. Technical Report WL-TR Wright Laboratory. Wright-Patterson Air Force Base, OH

ADVANTAGE UPDATING Leemon C. Baird III Technical Report WL-TR-93-1146 Wright Laboratory Wright-Patterson Air Force Base, OH 45433-7301 Address: WL/AAAT, Bldg 635 2185 Avionics Circle Wright-Patterson Air

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

### Using Markov Decision Processes to Solve a Portfolio Allocation Problem

Using Markov Decision Processes to Solve a Portfolio Allocation Problem Daniel Bookstaber April 26, 2005 Contents 1 Introduction 3 2 Defining the Model 4 2.1 The Stochastic Model for a Single Asset.........................

### Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning Draft of the lecture published in the Synthesis Lectures on Artificial Intelligence and Machine Learning series by Morgan & Claypool Publishers Csaba Szepesvári June

### Performance Bounds for λ Policy Iteration and Application to the Game of Tetris

Journal of Machine Learning Research 14 (2013) 1181-1227 Submitted 10/11; Revised 10/12; Published 4/13 Performance Bounds for λ Policy Iteration and Application to the Game of Tetris Bruno Scherrer MAIA

### Neuro-Dynamic Programming An Overview

1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. September 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

Eligibility Traces 0 Eligibility Traces Suggested reading: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Eligibility Traces Eligibility Traces 1 Contents:

### Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

### Channel Allocation in Cellular Telephone. Systems. Lab. for Info. and Decision Sciences. Cambridge, MA 02139. bertsekas@lids.mit.edu.

Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems Satinder Singh Department of Computer Science University of Colorado Boulder, CO 80309-0430 baveja@cs.colorado.edu Dimitri

### 1 Norms and Vector Spaces

008.10.07.01 1 Norms and Vector Spaces Suppose we have a complex vector space V. A norm is a function f : V R which satisfies (i) f(x) 0 for all x V (ii) f(x + y) f(x) + f(y) for all x,y V (iii) f(λx)

### Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning

Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning Peter Marbach LIDS MIT, Room 5-07 Cambridge, MA, 09 email: marbach@mit.edu Oliver Mihatsch Siemens AG Corporate

### α = u v. In other words, Orthogonal Projection

Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v

### Reinforcement Learning

Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Martin Lauer AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg martin.lauer@kit.edu

### Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Motivation Machine Learning Can a software agent learn to play Backgammon by itself? Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut

### Online Convex Optimization

E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting

### Metric Spaces. Chapter 7. 7.1. Metrics

Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some

### Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1 J. Zhang Institute of Applied Mathematics, Chongqing University of Posts and Telecommunications, Chongqing

### ALMOST COMMON PRIORS 1. INTRODUCTION

ALMOST COMMON PRIORS ZIV HELLMAN ABSTRACT. What happens when priors are not common? We introduce a measure for how far a type space is from having a common prior, which we term prior distance. If a type

### it is easy to see that α = a

21. Polynomial rings Let us now turn out attention to determining the prime elements of a polynomial ring, where the coefficient ring is a field. We already know that such a polynomial ring is a UF. Therefore

### No: 10 04. Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

No: 10 04 Bilkent University Monotonic Extension Farhad Husseinov Discussion Papers Department of Economics The Discussion Papers of the Department of Economics are intended to make the initial results

### MA651 Topology. Lecture 6. Separation Axioms.

MA651 Topology. Lecture 6. Separation Axioms. This text is based on the following books: Fundamental concepts of topology by Peter O Neil Elements of Mathematics: General Topology by Nicolas Bourbaki Counterexamples

### Reinforcement Learning

Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Joschka Bödecker AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg jboedeck@informatik.uni-freiburg.de

### Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.

Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,

### Jae Won idee. School of Computer Science and Engineering Sungshin Women's University Seoul, 136-742, South Korea

STOCK PRICE PREDICTION USING REINFORCEMENT LEARNING Jae Won idee School of Computer Science and Engineering Sungshin Women's University Seoul, 136-742, South Korea ABSTRACT Recently, numerous investigations

### Properties of BMO functions whose reciprocals are also BMO

Properties of BMO functions whose reciprocals are also BMO R. L. Johnson and C. J. Neugebauer The main result says that a non-negative BMO-function w, whose reciprocal is also in BMO, belongs to p> A p,and

### THE SCHEDULING OF MAINTENANCE SERVICE

THE SCHEDULING OF MAINTENANCE SERVICE Shoshana Anily Celia A. Glass Refael Hassin Abstract We study a discrete problem of scheduling activities of several types under the constraint that at most a single

### MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.

MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set. Vector space A vector space is a set V equipped with two operations, addition V V (x,y) x + y V and scalar

### Convex analysis and profit/cost/support functions

CALIFORNIA INSTITUTE OF TECHNOLOGY Division of the Humanities and Social Sciences Convex analysis and profit/cost/support functions KC Border October 2004 Revised January 2009 Let A be a subset of R m

### Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization Archis Ghate a and Robert L. Smith b a Industrial Engineering, University of Washington, Box 352650, Seattle, Washington,

### A Sarsa based Autonomous Stock Trading Agent

A Sarsa based Autonomous Stock Trading Agent Achal Augustine The University of Texas at Austin Department of Computer Science Austin, TX 78712 USA achal@cs.utexas.edu Abstract This paper describes an autonomous

### Linear Dependence Tests

Linear Dependence Tests The book omits a few key tests for checking the linear dependence of vectors. These short notes discuss these tests, as well as the reasoning behind them. Our first test checks

### 3 Does the Simplex Algorithm Work?

Does the Simplex Algorithm Work? In this section we carefully examine the simplex algorithm introduced in the previous chapter. Our goal is to either prove that it works, or to determine those circumstances

### A class of on-line scheduling algorithms to minimize total completion time

A class of on-line scheduling algorithms to minimize total completion time X. Lu R.A. Sitters L. Stougie Abstract We consider the problem of scheduling jobs on-line on a single machine and on identical

### Stochastic Inventory Control

Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the

### What is Linear Programming?

Chapter 1 What is Linear Programming? An optimization problem usually has three essential ingredients: a variable vector x consisting of a set of unknowns to be determined, an objective function of x to

### Lecture 5: Model-Free Control

Lecture 5: Model-Free Control David Silver Outline 1 Introduction 2 On-Policy Monte-Carlo Control 3 On-Policy Temporal-Difference Learning 4 Off-Policy Learning 5 Summary Introduction Model-Free Reinforcement

### Finitely Additive Dynamic Programming and Stochastic Games. Bill Sudderth University of Minnesota

Finitely Additive Dynamic Programming and Stochastic Games Bill Sudderth University of Minnesota 1 Discounted Dynamic Programming Five ingredients: S, A, r, q, β. S - state space A - set of actions q(

### LECTURE 5: DUALITY AND SENSITIVITY ANALYSIS. 1. Dual linear program 2. Duality theory 3. Sensitivity analysis 4. Dual simplex method

LECTURE 5: DUALITY AND SENSITIVITY ANALYSIS 1. Dual linear program 2. Duality theory 3. Sensitivity analysis 4. Dual simplex method Introduction to dual linear program Given a constraint matrix A, right

### BANACH AND HILBERT SPACE REVIEW

BANACH AND HILBET SPACE EVIEW CHISTOPHE HEIL These notes will briefly review some basic concepts related to the theory of Banach and Hilbert spaces. We are not trying to give a complete development, but

### Factorization Theorems

Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

### Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

### 1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

Copyright c 2009 by Karl Sigman 1 Stopping Times 1.1 Stopping Times: Definition Given a stochastic process X = {X n : n 0}, a random time τ is a discrete random variable on the same probability space as

### Reinforcement Learning with Factored States and Actions

Journal of Machine Learning Research 5 (2004) 1063 1088 Submitted 3/02; Revised 1/04; Published 8/04 Reinforcement Learning with Factored States and Actions Brian Sallans Austrian Research Institute for

### Bidding Dynamics of Rational Advertisers in Sponsored Search Auctions on the Web

Proceedings of the International Conference on Advances in Control and Optimization of Dynamical Systems (ACODS2007) Bidding Dynamics of Rational Advertisers in Sponsored Search Auctions on the Web S.

### Playing Tetris and Routing Under Uncertainty

IMPERIAL COLLEGE LONDON Department of Computing Approximate Dynamic Programming: Playing Tetris and Routing Under Uncertainty Michael Hadjiyiannis Supervised by Dr Daniel Kuhn 06/16/2009 MEng Final Year

### Date: April 12, 2001. Contents

2 Lagrange Multipliers Date: April 12, 2001 Contents 2.1. Introduction to Lagrange Multipliers......... p. 2 2.2. Enhanced Fritz John Optimality Conditions...... p. 12 2.3. Informative Lagrange Multipliers...........

### Chapter 3 Nonlinear Model Predictive Control

Chapter 3 Nonlinear Model Predictive Control In this chapter, we introduce the nonlinear model predictive control algorithm in a rigorous way. We start by defining a basic NMPC algorithm for constant reference

### Some stability results of parameter identification in a jump diffusion model

Some stability results of parameter identification in a jump diffusion model D. Düvelmeyer Technische Universität Chemnitz, Fakultät für Mathematik, 09107 Chemnitz, Germany Abstract In this paper we discuss

### A Simultaneous Deterministic Perturbation Actor-Critic Algorithm with an Application to Optimal Mortgage Refinancing

Proceedings of the 45th IEEE Conference on Decision & Control Manchester Grand Hyatt Hotel San Diego, CA, USA, December 13-15, 2006 A Simultaneous Deterministic Perturbation Actor-Critic Algorithm with

### NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

NEURAL NETWORKS AND REINFORCEMENT LEARNING Abhijit Gosavi Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Rolla, MO 65409 1 Outline A Quick Introduction

### Linear Algebra I. Ronald van Luijk, 2012

Linear Algebra I Ronald van Luijk, 2012 With many parts from Linear Algebra I by Michael Stoll, 2007 Contents 1. Vector spaces 3 1.1. Examples 3 1.2. Fields 4 1.3. The field of complex numbers. 6 1.4.

### IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL. 1. Introduction

IRREDUCIBLE OPERATOR SEMIGROUPS SUCH THAT AB AND BA ARE PROPORTIONAL R. DRNOVŠEK, T. KOŠIR Dedicated to Prof. Heydar Radjavi on the occasion of his seventieth birthday. Abstract. Let S be an irreducible

### 24. The Branch and Bound Method

24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no

### 2.3 Scheduling jobs on identical parallel machines

2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed

### Rotation Rate of a Trajectory of an Algebraic Vector Field Around an Algebraic Curve

QUALITATIVE THEORY OF DYAMICAL SYSTEMS 2, 61 66 (2001) ARTICLE O. 11 Rotation Rate of a Trajectory of an Algebraic Vector Field Around an Algebraic Curve Alexei Grigoriev Department of Mathematics, The

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### OPTIMAL CONTROL IN DISCRETE PEST CONTROL MODELS

OPTIMAL CONTROL IN DISCRETE PEST CONTROL MODELS Author: Kathryn Dabbs University of Tennessee (865)46-97 katdabbs@gmail.com Supervisor: Dr. Suzanne Lenhart University of Tennessee (865)974-427 lenhart@math.utk.edu.

### t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction

### Lecture 13 Linear quadratic Lyapunov theory

EE363 Winter 28-9 Lecture 13 Linear quadratic Lyapunov theory the Lyapunov equation Lyapunov stability conditions the Lyapunov operator and integral evaluating quadratic integrals analysis of ARE discrete-time

### Lecture 7: Finding Lyapunov Functions 1

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.243j (Fall 2003): DYNAMICS OF NONLINEAR SYSTEMS by A. Megretski Lecture 7: Finding Lyapunov Functions 1

### Notes from Week 1: Algorithms for sequential prediction

CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

### Ultraproducts and Applications I

Ultraproducts and Applications I Brent Cody Virginia Commonwealth University September 2, 2013 Outline Background of the Hyperreals Filters and Ultrafilters Construction of the Hyperreals The Transfer

### Manipulability of the Price Mechanism for Data Centers

Manipulability of the Price Mechanism for Data Centers Greg Bodwin 1, Eric Friedman 2,3,4, and Scott Shenker 3,4 1 Department of Computer Science, Tufts University, Medford, Massachusetts 02155 2 School

### 7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

### 1. LINEAR EQUATIONS. A linear equation in n unknowns x 1, x 2,, x n is an equation of the form

1. LINEAR EQUATIONS A linear equation in n unknowns x 1, x 2,, x n is an equation of the form a 1 x 1 + a 2 x 2 + + a n x n = b, where a 1, a 2,..., a n, b are given real numbers. For example, with x and

### Chapter 7. BANDIT PROBLEMS.

Chapter 7. BANDIT PROBLEMS. Bandit problems are problems in the area of sequential selection of experiments, and they are related to stopping rule problems through the theorem of Gittins and Jones (974).

### Chapter 4, Arithmetic in F [x] Polynomial arithmetic and the division algorithm.

Chapter 4, Arithmetic in F [x] Polynomial arithmetic and the division algorithm. We begin by defining the ring of polynomials with coefficients in a ring R. After some preliminary results, we specialize

### POWER SETS AND RELATIONS

POWER SETS AND RELATIONS L. MARIZZA A. BAILEY 1. The Power Set Now that we have defined sets as best we can, we can consider a sets of sets. If we were to assume nothing, except the existence of the empty

### Regression Methods for Pricing Complex American-Style Options

694 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 4, JULY 2001 Regression Methods for Pricing Complex American-Style Options John N. Tsitsiklis, Fellow, IEEE, Benjamin Van Roy Abstract We introduce

### 3. Linear Programming and Polyhedral Combinatorics

Massachusetts Institute of Technology Handout 6 18.433: Combinatorial Optimization February 20th, 2009 Michel X. Goemans 3. Linear Programming and Polyhedral Combinatorics Summary of what was seen in the

### INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

### Stationary random graphs on Z with prescribed iid degrees and finite mean connections

Stationary random graphs on Z with prescribed iid degrees and finite mean connections Maria Deijfen Johan Jonasson February 2006 Abstract Let F be a probability distribution with support on the non-negative

### 3. INNER PRODUCT SPACES

. INNER PRODUCT SPACES.. Definition So far we have studied abstract vector spaces. These are a generalisation of the geometric spaces R and R. But these have more structure than just that of a vector space.

### Load balancing of temporary tasks in the l p norm

Load balancing of temporary tasks in the l p norm Yossi Azar a,1, Amir Epstein a,2, Leah Epstein b,3 a School of Computer Science, Tel Aviv University, Tel Aviv, Israel. b School of Computer Science, The

### Scheduling Software Projects to Minimize the Development Time and Cost with a Given Staff

Scheduling Software Projects to Minimize the Development Time and Cost with a Given Staff Frank Padberg Fakultät für Informatik Universität Karlsruhe, Germany padberg @ira.uka.de Abstract A probabilistic

### Note on Negative Probabilities and Observable Processes

Note on Negative Probabilities and Observable Processes Ulrich Faigle and Alexander Schönhuth 2 1 Mathematisches Institut/ZAIK, Universität zu Köln Weyertal 80, 50931 Köln, Germany faigle@zpr.uni-koeln.de

### MixedÀ¾ Ð½Optimization Problem via Lagrange Multiplier Theory

MixedÀ¾ Ð½Optimization Problem via Lagrange Multiplier Theory Jun WuÝ, Sheng ChenÞand Jian ChuÝ ÝNational Laboratory of Industrial Control Technology Institute of Advanced Process Control Zhejiang University,

### Mathematical finance and linear programming (optimization)

Mathematical finance and linear programming (optimization) Geir Dahl September 15, 2009 1 Introduction The purpose of this short note is to explain how linear programming (LP) (=linear optimization) may

### The Characteristic Polynomial

Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem

### MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

### Congestion-Dependent Pricing of Network Services

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 8, NO 2, APRIL 2000 171 Congestion-Dependent Pricing of Network Services Ioannis Ch Paschalidis, Member, IEEE, and John N Tsitsiklis, Fellow, IEEE Abstract We consider

### SOLUTIONS TO ASSIGNMENT 1 MATH 576

SOLUTIONS TO ASSIGNMENT 1 MATH 576 SOLUTIONS BY OLIVIER MARTIN 13 #5. Let T be the topology generated by A on X. We want to show T = J B J where B is the set of all topologies J on X with A J. This amounts

### Dynamic Programming and Optimal Control 3rd Edition, Volume II. Chapter 6 Approximate Dynamic Programming

Dynamic Programming and Optimal Control 3rd Edition, Volume II by Dimitri P. Bertsekas Massachusetts Institute of Technology Chapter 6 Approximate Dynamic Programming This is an updated version of the

### A Sublinear Bipartiteness Tester for Bounded Degree Graphs

A Sublinear Bipartiteness Tester for Bounded Degree Graphs Oded Goldreich Dana Ron February 5, 1998 Abstract We present a sublinear-time algorithm for testing whether a bounded degree graph is bipartite

### Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

### 10. Proximal point method

L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing

### 1 if 1 x 0 1 if 0 x 1

Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or

### Example 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum. asin. k, a, and b. We study stability of the origin x

Lecture 4. LaSalle s Invariance Principle We begin with a motivating eample. Eample 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum Dynamics of a pendulum with friction can be written

### Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number

### Competitive Analysis of On line Randomized Call Control in Cellular Networks

Competitive Analysis of On line Randomized Call Control in Cellular Networks Ioannis Caragiannis Christos Kaklamanis Evi Papaioannou Abstract In this paper we address an important communication issue arising

### LEARNING OBJECTIVES FOR THIS CHAPTER

CHAPTER 2 American mathematician Paul Halmos (1916 2006), who in 1942 published the first modern linear algebra book. The title of Halmos s book was the same as the title of this chapter. Finite-Dimensional

### Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

### Solving Systems of Linear Equations

LECTURE 5 Solving Systems of Linear Equations Recall that we introduced the notion of matrices as a way of standardizing the expression of systems of linear equations In today s lecture I shall show how

### OPRE 6201 : 2. Simplex Method

OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2