Markov Chain Fundamentals (Contd.)

Similar documents
(67902) Topics in Theory and Complexity Nov 2, Lecture 7

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 10

NOTES ON LINEAR TRANSFORMATIONS

Similarity and Diagonalization. Similar Matrices

Vector and Matrix Norms

1 if 1 x 0 1 if 0 x 1

MATH APPLIED MATRIX THEORY

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

DATA ANALYSIS II. Matrix Algorithms

Systems of Linear Equations

Lecture 1: Schur s Unitary Triangularization Theorem

Conductance, the Normalized Laplacian, and Cheeger s Inequality

Section Inner Products and Norms

S on n elements. A good way to think about permutations is the following. Consider the A = 1,2,3, 4 whose elements we permute with the P =

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

Inner Product Spaces and Orthogonality

6.3 Conditional Probability and Independence

Lecture 3: Finding integer solutions to systems of linear equations

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

University of Lille I PC first year list of exercises n 7. Review

Chapter 17. Orthogonal Matrices and Symmetries of Space

Inner Product Spaces

Midterm Practice Problems

Chapter 7. Permutation Groups

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Notes on Determinant

Chapter 6. Orthogonality

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

Inner products on R n, and more

1 The Line vs Point Test

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation

Inner product. Definition of inner product

1 The Brownian bridge construction

LECTURE 4. Last time: Lecture outline

Notes on Symmetric Matrices

Computing the Symmetry Groups of the Platonic Solids With the Help of Maple

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

1 Norms and Vector Spaces

Lecture 13 - Basic Number Theory.

1. Let P be the space of all polynomials (of one real variable and with real coefficients) with the norm

COMBINATORIAL PROPERTIES OF THE HIGMAN-SIMS GRAPH. 1. Introduction

[1] Diagonal factorization

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem

Chapter 6. Cuboids. and. vol(conv(p ))

Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products

Lab 11. Simulations. The Concept

Basics of Statistical Machine Learning

Outline 2.1 Graph Isomorphism 2.2 Automorphisms and Symmetry 2.3 Subgraphs, part 1

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

Linear Algebra Notes for Marsden and Tromba Vector Calculus

How To Understand The Theory Of Media Theory

Introduction to Markov Chain Monte Carlo

BANACH AND HILBERT SPACE REVIEW

Polynomial Invariants

Online Appendix to Stochastic Imitative Game Dynamics with Committed Agents

Design of LDPC codes

Linear Programming. March 14, 2014

1 Solving LPs: The Simplex Algorithm of George Dantzig

Unified Lecture # 4 Vectors

Continuity of the Perron Root

Lecture 16 : Relations and Functions DRAFT

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

LS.6 Solution Matrices

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

Arkansas Tech University MATH 4033: Elementary Modern Algebra Dr. Marcel B. Finan

Tiers, Preference Similarity, and the Limits on Stable Partners

17. Inner product spaces Definition Let V be a real vector space. An inner product on V is a function

Mathematics Course 111: Algebra I Part IV: Vector Spaces

The Graphical Method: An Example

Practical Guide to the Simplex Method of Linear Programming

Arrangements And Duality

x1 x 2 x 3 y 1 y 2 y 3 x 1 y 2 x 2 y 1 0.

Methods for Finding Bases

A permutation can also be represented by describing its cycles. What do you suppose is meant by this?

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.

3 Some Integer Functions

FUNCTIONAL ANALYSIS LECTURE NOTES: QUOTIENT SPACES

CONDITIONAL, PARTIAL AND RANK CORRELATION FOR THE ELLIPTICAL COPULA; DEPENDENCE MODELLING IN UNCERTAINTY ANALYSIS

UNCOUPLING THE PERRON EIGENVECTOR PROBLEM

Let H and J be as in the above lemma. The result of the lemma shows that the integral

Metric Spaces. Chapter Metrics

CONTROLLABILITY. Chapter Reachable Set and Controllability. Suppose we have a linear system described by the state equation

Lecture 18 - Clifford Algebras and Spin groups

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

Lecture 1: Course overview, circuits, and formulas

CSE 135: Introduction to Theory of Computation Decidability and Recognizability

1.2 Solving a System of Linear Equations

Solution to Homework 2

Linear Algebra Review. Vectors

Approximation Algorithms

A Sublinear Bipartiteness Tester for Bounded Degree Graphs

10 Evolutionarily Stable Strategies


Transcription:

Markov Chain Fundamentals (Contd.) Recap from last lecture: - Any starting probability distribution π 0 can be represented as a linear combination of the eigenvectors including v 0 (stationary distibution) and hence its evolution π t with t as shown below, π 0 = v 0 + Σ i 0 a i v i ; π t = v 0 + Σ i 0 a i λ t iv i and since λ i 0 < 1, the v i 0 components diminish as t grows larger, making π t v 0. - The multiplicity/coefficient of v 0 (a 0 ) in any such linear combination is always 1. Facts on Markov-Matrices Fact 1: Any irreducible and ergodic Markov chain with a symmetric transition matrix has uniform stationary distribution. Proof. Using P (x, y) = P (y, x) in the condition for detailed balance, the claim follows. Note that the dominant eigenvector (corresponding to λ 0 = 1) in such a case becomes [ 1 N, 1 N,..., 1 N ], where N = Ω. Fact 2: We can define an other transition matrix P such that P ij = π(i)1/2 P (i, j) π(j) 1/2, where π is the stationary distribution corresponding to P. More generally, Note that: P = Dπ 1/2 P Dπ 1/2 ; D π = - P is not necessarily a stochastic matrix. π(1) 0... 0 0 π(2)... 0............ 0 0... π(n) - P is symmetric whenever P represents a reversible Markov chain. Proof. From the condition for detailed balance we have π(i) P (i, j) = π(j) P (j, i) π(i) 1/2 P (i, j) π(j) 1/2 = π(j) 1/2 P (j, i) π(i) 1/2 P ij = P ji. - P has the same eigenvalues as P and hence the same spectral gap. Proof. P = Dπ 1/2 P Dπ 1/2 Dπ 1/2 P Dπ 1/2 = P Let λ be an eigenvalue of P and x the corresponding eigenvector. Then P x = λx Dπ 1/2 P Dπ 1/2 x = λx P Dπ 1/2 x = λdπ 1/2 x 1

shows that P too has an eigenvalue λ with D 1/2 π x as the corresponding eigenvector. - When P is symmetric, it has an orthogonal basis of eigenvectors and the columns of D π above form just such a basis. - Since (P ) t = Dπ 1/2 P t Dπ 1/2, convergence time for the P - Markov chain is similar to that for the (possibly asymmetric) P -Markov chain. Mixing Time and Spectral Gap Theorem: Let P be an ergodic, symmetric Markov chain with N states and spectral gap δ. Then its mixing time is bounded above by τ ε < ln(nε 1 ) δ. Proof. Let us first write the initial distribution π 0 as a linear combination of P s eigenvectors: Then the distribution after t steps is given by π 0 = π eq + π neq, where π neq = i [N 1] a kv k π t = π 0 P t = π eq + π neq P t = π eq + i [N 1] a kλ t k v k Also, the total variation distance is given by π t π eq T V = 1 2 P t π neq 1, where the subscript 1 denotes the L1 norm. For any N-dimensional vector v, the Cauchy-Schwartz inequality relates its L1 norm to its L2 norm by Thus v 2 v 1 N v 2 π t π eq T V 1 2 N P t π neq 2 Since P is symmetric, its eigenvectors v k are orthogonal. Then Pythagoras theorem and λ k 1 1 δ, gives P t π neq 2 = i [N 1] a k 2 λ k 2t v k 2 2 (1 δ)t i [N 1] a k 2 v k 2 2 = (1 δ) t π neq 2 = (1 δ) t π 0 2 2 π eq 2 2 which in turn gives Recall that (1 δ) t, since π 0 2 2 1 π t π eq T V 1 2 N (1 δ) t 1 2 N e δt. τ ε = Min P0 {t : π t π T V < ε} so that setting π t π T V = ε gives us an upper bound on the mixing time, τ ε < ln( Nε 1 /2) δ which implies he weaker bound stated in the theorem. 2

Mixing Time from First Principles We will learn some formal methods of bounding the mixing time of a Markov chain (canonical paths, coupling, etc.) in the following lecture. Let us now try to bound it from first principles in the examples below. Walking on the hypercube A random walk on an n-dimensional hypercube can be used to generate random n-bit strings. As shown below, an n-dimensional hypercube has 2 n vertices, and every pair that has a hamming distance of 1 between them forms a pair of neighbors along an edge of the hypercube. Figure 1: The 3-dimensional hypercube To sample a random n-bit string, we can define a Markov chain on the hypercube using the Metropolis method as follows. Choose an index-bit pair (i, b) u {1, 2,..., n} x {0, 1} and change the i-th bit of the string at the current vertex (σ) to b to move to a (possibly) new vertex (τ). This defines a Markov kernel over the vertices of the hypercube as given below: P (σ, τ) = 1 2n ; δ H(σ, τ) = 1 1 2 ; σ = τ (δ H(σ, τ) = 0) 0 ; δ H (σ, τ) > 1 Clearly, the state space is connected and consists of self-loops (aperiodicity), which makes the resulting Markov chain ergodic. The same can also be understood in the following manner. At each step, with probability 1/2 we choose randomly from one of the n neighbors (i.e. flip the bit σ i ), and with probability 1/2 we stay where we are. Because we leave σ i alone or flip it with equal probability, its new value is equally likely to be 0 or 1, regardless of what its previous value was. This means that each bit of σ becomes random whenever we touch it, so that the entire string will become random when we have touched each bit at least once. This is nothing but the Coupon-Collector s problem! 3

Thus the expected number of bits that we haven t touched at the end of t steps can be treated as a measure of how far we are then from the stationary distribution. Since the probability that a particular bit is still untouched at the end of t steps is (1 1/n) t, we have π t π eq T V E[# untouched bits] n (1 1/n) t ne t/n using 1 x e t. Setting it equal to ε gives us the following upper bound on the mixing time τ ε < n ln(nε 1 ) = O(n ln(n)) for a constant ε. Thus the Markov chain defined above mixes rapidly. Riffle-shuffling a deck of cards Another interesting Markov chain is shuffling a deck of cards. We will analyze riffle-shuffling due to Edgar N. Gilbert, Claude Shannon, and Jim Reeds (henceforth GSR), following the statistician David Aldous and the former magician-turned-mathematician Persi Diaconis (who proved that 7 shuffles get us reasonably close to a truly random deck of cards), but before that let s take a look at a familiar analysis. Imagine that the cards are numbered 1 52 (instead of the colorful suits) and arranged so initially from top to bottom. Now take the card on the top and insert it uniformly into one of the 52 positions among the cards below it. If we do this repeatedly, then at any stage, all the cards that have been removed from the top would be uniformly distributed throughout the deck. Thus, as soon as the original bottom card is removed from the top of the deck (which, like the foregoing example of walking on the hypercube, should happen in O(n log(n)) steps), the whole deck becomes random. For a deck of 52 cards, this amounts to around 200 such top-in-at-random shuffles. Figure 2: Riffle shuffle In a riffle shuffle (illustrated in figure 2), the deck is split into two roughly equal parts, which are then interleaved by dropping cards from the bottoms of the two parts in a random fashion. In the GSR model of the riffle shuffle, if at any time the two parts have x and y cards remaining, the x x+y and cards are dropped from the two parts with probabilities respectively. This is better understood backwards: each card has an equal and independent chance of being pulled back into one of the two parts; the equivalent inverse riffle shuffle is illustrated in the figure below. y x+y 4

Figure 3: Inverse riffle shuffle In the inverse riffle shuffle, we assign a bit to each card in the deck uniformly at random (u.a.r.). After all the cards have been assigned a bit, the cards labeled 0 are moved to the top and those labeled 1 are moved to the bottom. We then again assign each card a bit u.a.r. and repeat the movement. In effect, if each card were assigned a t-bit long string in the beginning itself, it would end up at a position in the final deck determined by the lexicographic order of the string assigned to it in the set of all strings assigned, unless of course two cards are assigned the same string which will lead to a collision. Thus, if every card is assigned a different bit-string the permutation of the deck is completely determined. Moreover, since the strings are generated via random coin flips (or a walk on the hypercube), the final permutation is a random permutation. But we are already getting ahead of ourselves, because we don t even know whether analyzing this seemingly easy-to-understand model would do us any good. Turns out it will. We now take a little detour and claim that it is enough to analyze the inverse riffle shuffle. To justify this, observe that both the riffle shuffle and the inverse riffle shuffle (indeed any card shuffling scheme) are nothing but random walks on the same group (in this case, the symmetric group S n of all possible permutations of the n cards). We want to argue that for a random walk on this group, ε t = ε inv t, where ε inv denotes the variation distance from stationary distribution for the inverse walk. Let the original (corresponding to the riffle shuffle) random walk on the group be specified by a set of generators {g 1, g 2,..., g k }, such that at each step a generator chosen and applied according to a fixed probability distribution µ (so that each generator has non-zero probability of being chosen). The inverse random walk is then specified in exactly the same way, but using instead the generators {g1 1, g 1 2,..., g 1 k } with the same fixed probability distribution ( µ(g i) = µ(gi 1 )). Since each element of S n has a unique inverse (permutation) under the function composition operation ( ), there exists for any given state x, a bijective mapping f between the set of t-step paths starting at x in the original walk, and the set of t-step paths starting at x in the inverse walk, i.e. f(x σ 1 σ 2... σ t ) = x σ 1 t σt 1 1... σ 1 1 This bijection preserves the probabilities of the paths since the probability distribution over the two sets of generators is the same. Moreover, if two paths reach the same state, i.e. x σ = x τ, then by the group property we must have x σ 1 = x τ 1, or that the paths f(x σ) and f(x τ) also reach the same state. This implies that f induces another bijective mapping f between the set of states reachable from x in t steps of the original walk, and the set of states reachable from x 5

in t steps of the inverse walk (f(xσ) = xσ 1 ). As a result, πx(y) t = (π inv ) t x(f(y)) y and t, which is just another way of saying that the distributions πx t and (π inv ) t x are identical upto the relabeling of the states. And since the stationary distribution π eq of both the original and the inverse walk is uniform (they are both doubly stochastic), we conclude that πx t π eq = (π inv ) t x π eq, and hence ε t = ε inv t, as claimed earlier. Going back to inverse riffle shuffle, the question that now remains is, how long each string needs to be so that, if the strings are generated at random, no two strings are the same. Why that is only the birthday paradox where we need to figure out the total number of days (here 2 t ) to allow n people (cards) to choose their birthdays (bit-strings) from, so that there are no collisions. The chance that any of the potential ( n 2) pairs of cards are assigned the same string is 1/2 t. Thus, from union bound, the deck has a collision with probability at most ( n 2) 2 t ( n 2 2 t ), and setting this equal to its distance ε from the random deck, we get an upper bound on t, and hence the number of (inverse) riffle shuffles required. ε = n 2 2 t t = log 2 (n 2 ε 1 ) = 2 log 2 (n) + log 2 (ε 1 ) It turns out that the right factor in front of log 2 (n) is 3/2, which for n = 52 suggests that we should shuffle a deck of cards 9 times. 6