CHAPTER III - MARKOV CHAINS

Transcription

1 CHAPTER III - MARKOV CHAINS JOSEPH G. CONLON 1. General Theory of Markov Chains We have already discussed the standard random walk on the integers Z. A Markov Chain can be viewed as a generalization of this. We shall only consider in this Chapter Markov chains on a countable (finite or infinite) state space F. In the finite case we can identify F as the set of positive integers F = {1, 2,..., m} for some m = F, and in the infinite case as all positive integers F = {n Z : n 1}. The chain is determined by a set of transition probabilities p t (i, j), i, j F, t = 0, 1, 2,..., where t denotes discrete time and p t (i, j) is the probability of moving from state i to state j at time t. Thus (1.1) p t (i, j) 0, i, j, F ; p t (i, j) = 1, i F. To define a probability space associated with these transition probabilities we need to set a distribution function π : F R for the initial state of the chain. Thus π( ) satisfies (1.2) π(j) 0, j F ; π(j) = 1. The probability space Ω is given by Ω = F, which is the infinite product of the state space F, and the σ field F is the Borel field generated by finite dimensional rectangles as usual. Random variables X n : Ω F, n = 0, 1, 2..., are defined by X n (ω) = ω n, n = 0, 1,..., where ω = (ω 0, ω 1,..) Ω. The probability measure P depends on π( ), so we shall denote it by P π. The measure P π is determined once we know the p.d.f. for every set of variables (X 0,..., X m ), m 0. This is given by the formula m 1 (1.3) P (X 0 = j 0, X 1 = j 1,..., X m = j m ) = π(j 0 ) p t (j t, j t+1 ). Note the p.d.f for X 0 is π( ), and for any subset A F, (1.4) P (X t+1 A Xt, X t 1,.., X 0 ) = P (X t+1 A Xt ) with probability 1. The so called Markov property or no memory property (1.4) characterizes a Markov chain, and can be an alternative starting point for the theory of Markov chains. Note the similarity of (1.4) to the definition of a Martingale. One big advantage of studying Markov chains is a technique is available to compute many expectation values. This is the so called backward Kolmogorov equation. Thus let f : F R be a function and suppose we want to evaluate the expectation j F j F (1.5) E [ f(x T ) X0 = i ], i F. 1 t=0

2 2 JOSEPH G. CONLON Then (1.5) is given by u(i, 0) where u(i, t), i F, t = 0,.., T, is a solution to the terminal value problem, (1.6) u(i, t) = j F p t (i, j)u(j, t + 1), i F, t < T ; u(i, T ) = f(i), i F. Note the equation (1.6) is to be solved backwards in time from time T to time 0. We turn now to Markov chains where the transition probabilities are independent of time, whence p t (i, j) = p(i, j), t = 0, 1,.., i, j F. An important property of these Markov chains is the strong Markov property. Observe from (1.4) (1.7) P (X t+1 A Xt, X t 1,.., X 0 ) = P (X t+1 A Xt ) = P (X 1 A X0 ) with probability 1. The strong Markov property is a random version of (1.7). Thus let τ : Ω Z be a nonnegative measurable function which has the property (1.8) {ω Ω : τ(ω) = T } F(X 0, X 1,.., X T ), T = 0, 1,... Such a function is called a stopping time. Proposition 1.1 (Strong Markov Property). Consider the time independent Markov chain (X 0, X 1,..) with initial distribution π( ) and let τ( ) be a stopping time for this chain. Then the sequence of variables (X τ, X τ+1, X τ+2,...) has the same distribution as the sequence (X 0, X 1,..) with initial distribution π τ ( ), where π τ ( ) is the pdf of X τ under π( ). Proof. We illustrate the proof with the simplest case, which is then easy to generalize. Thus (1.9) P (X τ = j 0, X τ+1 = j 1, X τ+2 = j 2 ) = P (τ = m, X m = j 0, X m+1 = j 1, X m+2 = j 2 ). m=0 Now from (1.8) it follows (1.10) P (τ = m, X m = j 0, X m+1 = j 1, X m+2 = j 2 ) = P (τ = m, X m = j 0 )p(j 0, j 1 )p(j 1, j 2 ). Since we have (1.11) we conclude P (τ = m, X m = j 0 ) = π τ (j 0 ), m=0 (1.12) P (X τ = j 0, X τ+1 = j 1, X τ+2 = j 2 ) = π τ (j 0 )p(j 0, j 1 )p(j 1, j 2 ), whence the result holds. Definition 1. An initial distribution π( ) for the time independent Markov chain is said to be stationary if it satisfies the equation (1.13) π(i)p(i, j) = π(j), j F. i F

3 MATH Evidently (1.13) implies if X 0 has distribution π( ) then X 1 also has distribution π( ). One can easily see further the entire sequence (X 0, X 1,..) is stationary in the sense of Definition 4 of Chapter II page 18. An obvious question to ask then is does a Markov chain have a stationary distribution, and if so is it unique? We can give a satisfactory answer to the uniqueness question by introducing the notion of indecomposability of the chain. We say the chain is decomposable if there exists disjoint subsets A 0, A 1 of F such (1.14) p(i, j) = 1, i A k, k = 0, 1. j A k Thus (1.14) states the chain allows no communication between the subsets A 0 and A 1 of F. Hence we may reduce the original chain to independent chains on the reduced state spaces A 0, A 1. A chain which is not decomposable is called indecomposable. Proposition 1.2. Suppose the Markov chain is indecomposable and a stationary distribution π( ) for it exists. Then π( ) is unique and the associated stationary sequence (X 0, X 1,...) is ergodic. Proof. We first show for stationary π( ) the corresponding stationary sequence (X 0, X 1,...) is ergodic. Thus let T be the shift operator on Ω, P = P π be the probability measure on Ω, and C F be an invariant set. We define a function φ : F R by φ(j) = P (C X 0 = j), j F. Evidently φ(j) is only uniquely defined if π(j) > 0. Otherwise we can take φ(j) to be anything we want. Note now (1.15) φ(x 0 ) = P (C X 0 ) with probability 1, and (1.16) P (C X 0 ) = E[ P (C X 1, X 0 ) X 0 ] with probability 1. By the invariance of C we have C F(X 1, X 2,...), whence it follows from the Markov property (1.4) (1.17) P (C X 1, X 0 ) = P (C X 1 ) = φ(x 1 ) with probability 1. Evidently (1.16), (1.17) imply (1.18) p(i, j)φ(j) = φ(i), i F with π(i) > 0. j F Note from (1.13) if π(i) > 0 and π(j) = 0 then p(i, j) = 0, whence the values of φ(j) for π(j) = 0 do not enter on the LHS of (1.18). Observe next for any set C F(X 0, X 1,...) one has (1.19) lim n P (C X n, X n 1,..., X 0 ) = χ C with probability 1. This intuitively obvious result follows from the Martingale convergence theorem. In fact define a sequence of random variables Z n, n = 0, 1, 2,.., by Z n = P (C X n, X n 1,..., X 0 ). Then from the definition of conditional expectation one has (1.20) E[ Z n Xn 1,..., X 0 ] = Z n 1, n = 1, 2,..., with probability 1, which implies (1.21) E[ Z n Z n 1,..., Z 0 ] = Z n 1, n = 1, 2,..., with probability 1.

4 4 JOSEPH G. CONLON Hence lim n Z n exists with probability 1, and it is easy to see the limit must be χ C. Observe now for the invariant set C, one can see as in (1.17) P (C X n, X n 1,..., X 0 ) = φ(x n ), whence we have (1.22) lim n φ(x n) = χ C with probability 1. Since we may assume 0 φ(j) 1, j F, and we have from (1.22) (1.23) E[ φ(x 0 ){1 φ(x 0 )} ] = lim n E[ φ(x n){1 φ(x n )} ] = 0, we conclude φ(j) can only take the values 0 or 1 for j F with π(j) > 0. Now let us define sets A 1, A 0 by (1.24) A 1 = {i F : π(i) > 0, φ(i) = 1}, A 0 = {i F : π(i) > 0, φ(i) = 0}. It follows from (1.18) (1.25) j A 1 p(i, j) = 1, i A 1. Note in (1.25) we are again using the fact if π(i) > 0 and π(j) = 0 then p(i, j) = 0. Replacing the function φ( ) by 1 φ( ) in (1.18), we see also from (1.25) (1.26) p(i, j) = 1, i A 0. j A 0 Evidently (1.25), (1.26) and the indecomposability assumption imply either A 0 or A 1 is empty, so let us assume A 0 is empty. It follows then from (1.15) (1.27) P (C) = E[ φ(x 0 ) ] = 1. We similarly see P (C) = 0 if A 1 is empty. We have proved the ergodicity of the stationary sequence. We can prove the uniqueness of π( ) by using Proposition 4.3 of Chapter II. Thus let π 1 ( ) and π 2 ( ) be two invariant distributions with corresponding probability measures P 1, P 2. From Proposition 4.3 of Chapter II it follows P 1 and P 2 are mutually singular, so there exists E F such P 1 (E) = 1, P 2 (E) = 0. Now let C F be the set (1.28) C = n=0t n E, whence it follows T 1 C C. Evidently P 1 (C) = 1, and P 2 (C) = 0 by the measure preserving property of P 2. We consider now the distribution π( ) defined by π( ) = [π 1 ( ) + π 2 ( )]/2, which evidently also satisfies (1.13). The corresponding probability measure is P = [P 1 + P 2 ]/2. Hence T is measure preserving and ergodic under P. Since T 1 C C, it follows from the measure preserving property of T (1.29) P ([C T 1 C] [T 1 C C]) = 0. Lemma 3.2 of Chapter II implies then P (C) = 0 or P (C) = 1. However we easily see P (C) = [P 1 (C) + P 2 (C)]/2 = 1/2, which is a contradiction.

5 MATH Definition 2. A time independent Markov chain is said to be irreducible if all states communicate. That is for any i, j F, (1.30) P (X n = j X 0 = i) > 0 for some n 1 which can depend on (i, j). Note irreducibility implies indecomposability. Lemma 1.1. Suppose a Markov chain is time independent irreducible and a stationary distribution π( ) for it exists. Then π(j) > 0 for all j F. Proof. We use equation (1.13) and choose i F such π(i) > 0. Evidently if (1.30) holds for n = 1 then π(j) > 0. Observe now (1.13) implies (1.31) π(i 1 )p(i 1, i 2 )p(i 2, j) = π(j), j F. Since i 1,i 2 F (1.32) P (X 2 = j X 0 = i) = p(i, i 2 )p(i 2, j), we conclude if (1.30) holds for n = 2 then π(j) > 0. Evidently we can generalize this argument to any n 1. We can apply Proposition 5.2 of Chapter II to obtain a relation between recurrence for the Markov chain and the invariant measure. Proposition 1.3. Suppose a time independent Markov chain is irreducible and has stationary distribution π( ). Then for all i F one has P (X n = i infinitely often X 0 = i) = 1. Furthermore if τ i is the first recurrence time to i one has E[ τ i X 0 = i] = 1/π(i). Proof. The formula for the recurrence time is immediate from Proposition 5.2 since the stationary sequence is ergodic. This result evidently also implies i 2 F (1.33) P (X n = i for some finite n X 0 = i) = 1. Recurrence i.e. return to i infinitely often follows immediately from (1.33). 2. Markov Chains with Finite State Space We consider the case when the Markov chain is time independent and the state space is finite, F = m, in which case the transition probabilities p(i, j), i, j F, form an m m transition matrix T with non-negative entries which satisfy (2.1) T1 = 1, 1 = [1, 1,..., ]. Equation (1.13) can be written in matrix form as (2.2) πt = π or alternatively T π = π, where T is the adjoint of T. Now (2.1) implies T has eigenvalue 1, and hence T has eigenvalue 1. Thus a non-trivial solution to (2.2) exists. Proposition 2.1. There exists a non-trivial solution π to (2.2) with all nonnegative entries. If the chain is indecomposable -see (1.14)- it is unique.

6 6 JOSEPH G. CONLON Proof. Evidently (1.13) implies (2.3) π(i) p(i, j) π(j), j F. i F If strict inequality holds in (2.3) for some j F, then on summing (2.3) over j F we conclude (2.4) π(i) > π(j), i F j F which is an impossibility. We have shown therefore the vector with entries π(i), i F, is an eigenvector of T with eigenvalue 1. Assume now the chain is indecomposable and π 1 ( ) and π 2 ( ) are two distinct invariant measures. Setting π( ) = π 1 ( ) π 2 ( ) and defining sets A 0, A 1 by (2.5) A 0 = {i F : π(i) > 0}, A 1 = {i F : π(i) < 0}, we see from the fact i F π(i) = 0, both A 0 and A 1 are non-empty. Using the fact π( ) is a solution to (2.2), we conclude equality holds in (2.3) for all j F. It follows if j / A 0 A 1, then p(i, j) = 0 for all i A 0 A 1. Further if j A 1 then p(i, j) = 0 for i A 0. We conclude (2.6) p(i, j) = 1, i A 0. j A 0 Since we get a similar result for A 1 the chain is decomposable, a contradiction. We have already seen in Proposition 1.2 the stationary sequence (X 0, X 1,...) associated with the Markov chain is ergodic if the chain is indecomposable. Next we wish to show under certain conditions on the transition matrix T the sequence is strong mixing. In order to do this we first prove a classical theorem in the theory of positive matrices. Theorem 2.1 (Perron-Frobenius Theorem). Suppose T = (t i,j ) is an n n matrix with entries t i,j satisfying t i,j 0 and such for some positive integer k the entries of T k are all strictly positive. Then there exists an eigenvalue r of T such : (a) r is real and strictly positive. (b) r is a simple root of the characteristic polynomial of T. (c) If λ C is a root of the characteristic polynomial of T different from r then λ < r. (d) The eigenvector of T for r can be chosen so all its entries are strictly positive. Proof. We first construct an eigenvector for T with all positive entries. To do this let S be the set (2.7) S = {x = (x 1,.., x n ) R n : x i > 0, i = 1,.., n, x = 1}, so S is the part of the unit sphere which is strictly in the positive quadrant (if n = 2). Define now a function r : S R by (Tx) i (2.8) r(x) = min. 1 i n x i

7 MATH It is clear r( ) is a non-negative continuous function on S and is bounded above as n (2.9) r(x) max t i,j. It follows from (2.9) 1 i n j=1 (2.10) r = sup r(x) <, x S Hence for all x S, (2.11) (Tx) i rx i for some i = i(x), 1 i n, depending on x S. Let x N S, N = 1, 2,..., be a sequence such lim N r(x N ) = r. Then some subsequence of x N converges to a point x S, the closure of S in R n so x = 1. Furthermore (2.8) implies (2.12) z = [T r]x is a vector with all nonnegative entries. Since the matrix T k has all strictly positive entries, it follows the vector w k = T k z has all strictly positive entries if z 0. In case (2.12) implies (2.13) r(t k x / T k x ) > r, a contradiction to the definition of r, from which we conclude z = 0. This also implies T k x is an eigenvector of T with eigenvalue r which has all strictly positive entries. We have proven (a) and (d). Next we turn to (b). Let λ C be an eigenvalue of T in which case there exists a non-zero vector x C n such Tx = λx. Letting x R n be the vector x = ( x 1,.., x n ), we see (2.14) (T x) i λ x i, i = 1,.., n, whence it follows from (2.11) λ r. Supposing now λ = r, we can argue in the same way we established x is an eigenvector of T with eigenvalue r, x is also an eigenvector of T with eigenvalue r. Thus we have (2.15) T k x = λ k x, T k x = r k x. This implies (2.16) (T k x) i = (T k x) i, 1 i n. We conclude from (2.16) using the fact all entries of T k are strictly positive (2.17) x = e iθ x, for some θ [0, 2π). Thus λ = r implies λ = r, which completes the proof of (c). Note the argument of the previous paragraph rules out the possibility of having 2 independent eigenvectors with eigenvalue r. In fact if two such existed we could construct an eigenvector x = (x 1,..., x n ) all of whose entries do not have identical phase -e iθ in (2.17). This yields a contradiction. To complete the proof of (b) we need to exclude the possibility T has any generalized eigenvectors with eigenvalue r. Let us suppose a generalized eigenvector exists, whence there exists y C d such (2.18) [T r]y = x,

8 8 JOSEPH G. CONLON where x is the unique eigenvector of T with eigenvalue r and all positive entries. From what we have already proved, there also exists a unique eigenvector x of T with eigenvalue r and all positive entries. Applying x to the left side of the equation (2.18) we obtain (2.19) x [T r]y = x x > 0. Since x [T r] = 0 we have a contradiction. Corollary 2.1. Suppose a time independent Markov chain on a finite state space F is irreducible. Then it has a unique invariant distribution π( ) and π(j) > 0 for all j F. Proof. For N = 1, 2,..., let K N be the matrix (2.20) K N = 1 N N 1 n=0 T n. Irreducibility implies for large enough N the matrix K N has all strictly positive entries. Since πk N = π it follows from Theorem 2.1 π( ) is unique and all entries of π( ) are strictly positive. Definition 3. An n n matrix T has a spectral gap if there is exactly one simple root λ C of the characteristic polynomial for T which satisfies λ = σ(t) = max{ λ C : λ eigenvalue of T}. Note the Perron-Frobenius theorem implies a matrix T with nonnegative entries such T k has strictly positive entries for some k 1 has a spectral gap. Proposition 2.2. Suppose the transition matrix T for a finite state Markov chain has a spectral gap. Then the associated stationary sequence (X 0, X 1,...) is strong mixing. Proof. From (3.26) of Chapter II we need to show for any bounded Borel measurable functions f, g : R m+1 R one has (2.21) lim n E[ f(x 0,..., X m )g(x n, X n+1,..., X n+m ) ] = E[ f(x 0, X 1,..., X m ) ]E[ g(x 0,..., X m ) ]. Observe now for n m the LHS of (2.21) can be written as (2.22) E[ f(x 0,..., X m ) z F T n m X m,z E[ g(x 0, X 1,..., X m ) X 0 = z ] ], where T n m y,z, y, z F, denotes the entries of the matrix T n m. Hence (2.21) will follow from (2.22) if we can show (2.23) lim k Tk y,z = π(z), y, z F. Using the notation of (2.1), (2.2), consider the matrix A = T 1π, where we are taking π( ) to be a row vector and 1 a column vector. By the spectral gap condition all eigenvalues of A have absolute value strictly less than 1, whence lim k A k = 0. Since A k = T k 1π the limit (2.23) holds.

9 MATH We have used the Perron-Frobenius theorem to establish some properties of time independent Markov chains, assuming certain conditions on the transfer matrix. Another condition on the transfer matrix which can help us understand the corresponding Markov chain is so called reversibility. This condition is in effect stating the transfer matrix is self-adjoint with respect to some positive definite diagonal quadratic form. In case much more is known about the transfer matrix than in the general case, in particular all its eigenvalues are real and the matrix is diagonalizable. To see what this states about the entries (t i,j ) of the n n matrix T, suppose the quadratic form is given by n (2.24) [ξ, ζ] µ = µ(i)ξ i ζ i, ξ, ζ R n, i=1 where µ R n has all positive entries. Then the self-adjointness condition (2.25) [ξ, Tζ] µ = [Tξ, ζ] µ, is equivalent to (2.26) µ(i)t i,j = µ(j)t j,i, 1 i, j n. From (2.26) we see if we normalize the vector µ( ) so the sum of its elements is 1, then µ is the invariant measure for the Markov chain. Note for a reversible indecomposable Markov chain, if T does not have eigenvalue 1 there is a spectral gap and hence the chain is strong mixing. 3. Examples of Markov Chains We first note a trivial case of a time independent Markov chain, which is a sequence of i.i.d. random variables (X 0, X 1,..). Let p(i), i F, be the probability density function of X 0. Then the transition probabilities are p(i, j) = p(j), i, j F. If F is finite then all the rows of the matrix T are identical, so T has rank 1. In particular the pdf of X 0 is the invariant measure π and the matrix A = T 1π is the 0 matrix. Hence the sequence is strong mixing by Proposition 2.2. Another example of a time independent Markov chain is the sum S n = X X n, n = 0, 1, 2,.. of i.i.d. variables. In case p(i, j) = p(j i), where p( ) is the pdf of X 1. Thus sums of i.i.d. variables give rise to chains which are spatially translation invariant. It is therefore often useful to use the Fourier transform to analyse them. We essentially did this in our proof of the CLT in Chapter I. In the case of the Bernoulli variables X 0 = ±1 with probability 1/2, for which S n is the standard random walk on Z, we have p(i, j) = p(j, i) = 1/2 if and only if i j = 1. This chain is therefore reversible, and it is also easy to see it is irreducible. We have in Chapter 1 proved it is recurrent, but no invariant measure exists (compare Proposition 1.3). To see this note if π( ) is an invariant measure then trivially τ π( ) is also an invariant measure where τπ(i) = π(i + 1), i Z. The uniqueness theorem Proposition 1.2 implies then τπ( ) = π( ), in which case π : Z R is a constant function, contradicting the normalization condition for π( ). University of Michigan, Department of Mathematics, Ann Arbor, MI address: conlon@umich.edu