Master MVA: Apprentissage par renforcement Lecture: 5. Références bibliographiques: [LGM10b, MS08, MMLG10, ASM08] i=1 X ] log 2/δ (b i a i ) n.

Transcription

1 Master MVA: Apprentissage par renforcement Lecture: 5 Sample complexity en apprentissage par renforcement Professeur: Rémi Munos munos/master-mva/ Références bibliographiques: [LGM0b, MS08, MMLG0, ASM08] Plan:. Inégalité d Azuma. Sample complexity of LSTD 3. Other results Inégalité d Azuma Etend l inégalité de Chernoff-Hoeffding à des variables aléatoires qui peuvent être dépendantes mais qui forment une Martingale. Proposition. Soient X i [a i, b i ] variables aléatoires telles que E[X i X,..., X i ] = 0. Alors P ( n X i ɛ ) e ɛ P n (b i a i ). () Autrement dit, pour tout δ (0, ], on a avec probabilité au moins δ, n X i log /δ (b i a i ) n n n. Preuve. On a: P( X i ɛ) = P(e s P n X i e sɛ ) e sɛ E[e s P n X i ], par Markov [ e sɛ [ E X,...,X n E Xn e s P n X ] i X ],..., X n, [ e sɛ E X,...,X n e s P n X i [ E Xn e sx n ] ] X,..., X n, e sɛ+s (b n a n ) /8 E X,...,X n [ e s P n X i ], par Hoeffding e sɛ+s P n (b i a i ) /8 En choisissant s = 4ɛ/ n (b i a i ) on déduit P ( n X i µ i ɛ ) e ɛ P n (b i a i ). En refaisant le même calcul pour P ( n X i µ i ɛ ) on déduit ().

2 Sample complexity en apprentissage par renforcement Sample complexity of LSTD. Pathwise LSTD We follow a fixed policy π. Our goal is to approximate the value function V π (written V removing reference to π to simplify notations). We use a linear approximation space F spanned by a set of d basis functions ϕ i : X R. We denote by φ : X R d, φ( ) = ( ϕ ( ),..., ϕ d ( ) ) the feature vector. Thus F = { f α α R d and f α ( ) = φ( ) α }. Let (X,..., X n ) be a sample path (trajectory) of size n generated by following policy π. Let v R n and r R n such that v t = V (X t ) and r t = R(X t ) be the value vector and the reward vector, respectively. Also, let Φ = [φ(x ) ;... ; φ(x n ) ] be the feature matrix defined at the states, and F n = {Φα, α R d } R n be the corresponding vector space. We denote by Π : R n F n the empirical orthogonal projection onto F n, defined as Πy = arg min z F n y z n, where y n = n n y t. Note that Π is a non-expansive mapping w.r.t. the l -norm: Πy Πz n y z n. Define the empirical Bellman operator T : R n R n as ( T y) t = { rt + γy t+ t < n, r t t = n. Proposition. The operator Π T is a contraction in l -norm, thus possesses a unique fixed point ˆv. Preuve. Note that by defining the operator P : R n R n as ( P y) t = y t+ for t < n and ( P y) n = 0, we have T y = r + γ P y. The empirical Bellman operator is a γ-contraction in l -norm since, for any y, z R n, we have T y T z n = γ P (y z) n γ y z n. Now, since the orthogonal projection Π is non-expansive w.r.t. l -norm, from Banach fixed point theorem, there exists a unique fixed-point ˆv of the mapping Π T, i.e., ˆv = Π T ˆv. Since ˆv is the unique fixed point of Π T, the vector ˆv T ˆv is perpendicular to the space F n, and thus, Φ (ˆv T ˆv) = 0. By replacing ˆv with Φα, we obtain Φ Φα = Φ (r+γ P Φα) and then Φ (I γ P )Φ α = }{{}}{{} Φ r. A b Therefore, by setting A i,j = b i = n φ i (x t ) [ φ j (x t ) γφ j (x t+ ) ] + φ i (x n )φ j (x n ), φ i (x t )r t, we have that the system Aα = b always has at least one solution (since the fixed point ˆv exists) and we call the solution with minimal norm, ˆα = A + b, where A + is the Moore-Penrose pseudo-inverse of A, the pathwise LSTD solution.

3 Sample complexity en apprentissage par renforcement 3. Performance Bound Here we derive a bound for the performance of ˆv evaluated on the states of the trajectory used by the pathwise LSTD algorithm. Théorème. Let X,..., X n be a trajectory of the Markov chain, and v, ˆv R n be the vectors whose components are the value function and the pathwise LSTD solution at {X t } n, respectively. Then with probability δ (the probability is w.r.t. the random trajectory), we have ˆv v n v Πv n + γ γ [ d ( 8 log(d/δ) γv max L + ν n n n) ], () where the random variable ν n is the smallest strictly-positive eigenvalue of the sample-based Gram matrix n Φ Φ. Remark When the eigenvalues of the sample-based Gram matrix n Φ Φ are all non-zero, Φ Φ is invertible, and thus, Π = Φ(Φ Φ) Φ. In this case, the uniqueness of ˆv implies the uniqueness of ˆα since ˆv = Φα = Φ ˆv = Φ Φα = ˆα = (Φ Φ) Φ ˆv. On the other hand, when the sample-based Gram matrix n Φ Φ is not invertible, the system Ax = b may have many solutions. Among all the possible solutions, one may choose the one with minimal norm: ˆα = A + b. Remark 3 Theorem provides a bound without any reference to the stationary distribution of the Markov chain. In fact, the bound of Equation holds even when the chain does not possess a stationary distribution. For example, consider a Markov chain on the real line where the transitions always move the states to the right, i.e., p(x t+ dy X t = x) = 0 for y x. For simplicity assume that the value function V is bounded and belongs to F. This Markov chain is not recurrent, and thus, does not have a stationary distribution. We also assume that the feature vectors φ(x ),..., φ(x n ) are sufficiently independent, so that the eigenvalues of n Φ Φ are greater than ν > 0. Then according to Theorem, pathwise LSTD is able to estimate the value function at the samples at a rate O(/ n). This may seem surprising because at each state X t the algorithm is only provided with a noisy estimation of the expected value of the next state. However, the estimates are unbiased conditioned on the current state, and we will see in the proof that using a concentration inequality for martingale, pathwise LSTD is able to learn a good estimate of the value function at a state X t using noisy pieces of information at other states that may be far away from X t. In other words, learning the value function at a given state does not require making an average over many samples close to that state. This implies that LSTD does not require the Markov chain to possess a stationary distribution. In order to prove Theorem, we first introduce the model of regression with Markov design and then state and prove a lemma about this model. Définition. The model of regression Markov design is a regression problem where the data (X t, Y t ) t n are generated according to the following model: X,..., X n is a sample path generated by a Markov chain, Y t = f(x t ) + ξ t, where f is the target function, and the noise term ξ t is a random variable which is adapted to the filtration generated by X,..., X t+ and is such that ξ t C and E[ξ t X,..., X t ] = 0. (3) The next lemma reports a risk bound for the Markov design setting. Lemme (Regression bound for the Markov design setting). Let ŵ F n be the least-squares estimate of the (noisy) values Y = {Y t } n, i.e., ŵ = ΠY, and w F n be the least-squares estimate of the (noiseless) values

4 4 Sample complexity en apprentissage par renforcement ξ Z Y ˆξ F n ŵ ˆξ w Figure : This figure shows the components used in Lemma. and its proof such as w, ŵ, ξ, and ˆξ, and the fact that ˆξ, ξ n = ˆξ n. Z = {Z t = f(x t )} n, i.e., w = ΠZ. Then for any δ > 0, with probability at least δ (the probability is w.r.t. the random sample path X,..., X n ), we have ŵ w n CL d log(d/δ) nν n, (4) where ν n is the smallest strictly-positive eigenvalue of the sample-based Gram matrix n Φ Φ. Preuve. We define ξ R n to be the vector with components ξ t, and ˆξ = ŵ w = Π(Y Z) = Πξ. Since the projection is orthogonal we have ˆξ, ξ n = ˆξ n (see Figure ). Since ˆξ F n, there exists at least one α R d such that ˆξ = Φα, so by Cauchy-Schwarz inequality we have ˆξ n = ˆξ, ξ n = n d n α i [ d ] ξ t ϕ i (X t ) ( n / n α ξ t ϕ i (X t )). (5) Now among the vectors α such that ˆξ = Φα, we define ˆα to be the one with minimal l -norm, i.e., ˆα = Φ + ˆξ. Let K denote the null space of Φ, which is also the null space of n Φ Φ. Then ˆα can be decomposed as ˆα = ˆα K + ˆα K, where ˆα K K and ˆα K K, and because the decomposition is orthogonal, we have ˆα = ˆα K + ˆα K. Since ˆα is of minimal norm among all the vectors α such that ˆξ = Φα, its component in K must be zero, thus ˆα K. The Gram matrix n Φ Φ is positive-semidefinite, thus its eigenvectors corresponding to zero eigenvalues generate K and the other eigenvectors generate its orthogonal complement K. Therefore, from the assumption that the smallest strictly-positive eigenvalue of n Φ Φ is ν n, we deduce that since ˆα K, By using the result of Equation 6 in Equation 5, we obtain ˆξ n = n ˆα Φ Φˆα ν n ˆα ˆα = ν n ˆα. (6) ˆξ n n ν n [ d ] ( n / ξ t ϕ i (X t )). (7)

5 Sample complexity en apprentissage par renforcement 5 T v T ˆv v F n ˆv = Π T ˆv Πv Π T v Figure : This figure represents the space R n, the linear vector subspace F n and some vectors used in the proof of Theorem. Now, from Equation 3, we have that for any i =,..., d E[ξ t ϕ i (X t ) X,..., X t ] = ϕ i (X t )E[ξ t X,..., X t ] = 0, and since ξ t ϕ i (X t ) is adapted to the filtration generated by X,..., X t+, it is a martingale difference sequence w.r.t. that filtration. Thus one may apply Azuma s inequality to deduce that with probability δ, ξ t ϕ i (X t ) CL n log(/δ). where we used that ξ t ϕ i (X t ) CL for any i and t. By a union bound over all features, we have that with probability δ, for all i d ξ t ϕ i (X t ) CL n log(d/δ). (8) The result follows by combining Equations 8 and 7. Remarks about this Lemma In the Markov design model considered in this lemma, states {X t } n are random variables generated according to the Markov chain and the noise terms ξ t may depend on the next state X t+ (but should be centered conditioned on the past states X,..., X t ). This lemma will be used in order to prove Theorem, where we replace the target function f with the value function V, and the noise term ξ t with the temporal difference r(x t ) + γv (X t+ ) V (X t ). Note that this lemma is an extension of the bound for the model of regression with deterministic design in which the states, {X t } n, are fixed and the noise terms, ξ t s, are independent. In deterministic design, usual concentration results provide high probability bounds similar to Equation 4, but without the dependence on ν n. An open question is whether it is possible to remove ν n in the bound for the Markov design regression setting. Preuve. [Théorème ]

6 6 Sample complexity en apprentissage par renforcement Step : Using the Pythagorean theorem and the triangle inequality, we have (see Figure ) ˆv v n = v Πv n + ˆv Πv n ˆv Πv n + ( ˆv Π T v n + Π T v Πv ). n (9) From the γ-contraction of the operator Π T and the fact that ˆv is its unique fixed point, we obtain ˆv Π T v n = Π T ˆv Π T v n γ ˆv v n, (0) Thus from Equation 9 and 0, we have ˆv v n ˆv Πv n + ( γ ˆv v n + Π T v Πv ). n () Step : We now provide a high probability bound on Π T v Πv n. This is a consequence of Lemma. applied to the vectors Y = T v and Z = v. Since v is the value function at the points {X t } n, from the definition of the pathwise Bellman operator, we have that for t n, ξ t = y t v t = r(x t ) + γv (X t+ ) V (X t ) = γ [ V (X t+ ) P (dy X t )V (y) ], and ξ n = y n v n = γ P (dy X n )V (y). Thus, Equation 3 holds for t n. Here we may choose C = γv max for a bound on ξ t, t n, and C = γv max for a bound on ξ n. Azuma s inequality may only be applied to the sequence of n terms (the n-th term adds a contribution to the bound), thus instead of Equation 8, we obtain ξ t ϕ i (X t ) γv max L ( n log(d/δ) + ), with probability δ, for all i d. Combining with Equation 7, we deduce that with probability δ, we have Π T v Πv d ( 8 log(d/δ) n γv max L + ), () ν n n n where ν n is the smallest strictly-positive eigenvalue of n Φ Φ. The claim follows by combining Equations and, and solving the result for ˆv v n..3 Generalization bound When the Markov chain is ergodic (say β-mixing) and possesses a stationnary distribution µ, then it is possible to derive generalization bounds of the form: with probability δ, ˆV V µ which provides a bound expressed in terms of c inf V f µ + O γ f F the best possible approximation of V in F measured with µ the smallest eigenvalue ν of the Gram matrix ( φ i φ j dµ ) i,j β-mixing coefficients of the chain (hidden in O). ( d log(d/δ) (see [Lazaric, Ghavamzadeh, Munos, Finite-sample analysis of LSTD, 00]). nν ),

7 Sample complexity en apprentissage par renforcement 7 3 Other results Similar results have been obtained for different algorithms: Approximate Value iteration [MS08] Policy iteration with Bellman residual minimization [MMLG0] Policy iteration with modified Bellman residual minimization [ASM08] Classification based policy iteration algorithm [LGM0a] But there remains many open problems... References [ASM08] [LGM0a] [LGM0b] A. Antos, Cs. Szepesvari, and R. Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning Journal, 7:89 9, 008. A. Lazaric, M. Ghavamzadeh, and R. Munos. Analysis of a classification-based policy iteration algorithm. In International Conference on Machine Learning, pages , 00. A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of LSTD. In International Conference on Machine Learning, pages 65 6, 00. [MMLG0] O. A. Maillard, R. Munos, A. Lazaric, and M. Ghavamzadeh. Finite sample analysis of bellman residual minimization. In Masashi Sugiyama and Qiang Yang, editors, Asian Conference on Machine Learning. JMLR: Workshop and Conference Proceedings, volume 3, pages , 00. [MS08] R. Munos and Cs. Szepesvári. Finite time bounds for sampling based fitted value iteration. Journal of Machine Learning Research, 9:85 857, 008.