Markov chains (cont d)

Similar documents
Nonlinear Systems of Ordinary Differential Equations

Random variables, probability distributions, binomial random variable

Moving Least Squares Approximation

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

LECTURE 4. Last time: Lecture outline

Solving Systems of Linear Equations Using Matrices

Lecture 3: Finding integer solutions to systems of linear equations

SOLVING LINEAR SYSTEMS

Conditional Random Fields: An Introduction

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

1.2 Solving a System of Linear Equations

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

DATA ANALYSIS II. Matrix Algorithms

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Chapter 4. Polynomial and Rational Functions. 4.1 Polynomial Functions and Their Graphs

3/25/ /25/2014 Sensor Network Security (Simon S. Lam) 1

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Tagging with Hidden Markov Models

Language Modeling. Chapter Introduction

Student Outcomes. Lesson Notes. Classwork. Discussion (10 minutes)

Recurrent Neural Networks

1 Determine whether an. 2 Solve systems of linear. 3 Solve systems of linear. 4 Solve systems of linear. 5 Select the most efficient

3.2 Roulette and Markov Chains

6.3 Conditional Probability and Independence

Minimizing Probing Cost and Achieving Identifiability in Probe Based Network Link Monitoring

The Basics of Graphical Models

Notes on Determinant

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Hidden Markov Models

1 Representation of Games. Kerschbamer: Commitment and Information in Games

x y The matrix form, the vector form, and the augmented matrix form, respectively, for the system of equations are

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Affine Transformations

Solution to Homework 2

Survey and remarks on Viro s definition of Khovanov homology

Course: Model, Learning, and Inference: Lecture 5

1 Solving LPs: The Simplex Algorithm of George Dantzig

Solving Systems of Linear Equations

1 Simulating Brownian motion (BM) and geometric Brownian motion (GBM)

1.5 SOLUTION SETS OF LINEAR SYSTEMS

Hidden Markov Models 1

CHAPTER 2 Estimating Probabilities

Amino Acids and Their Properties

8 Divisibility and prime numbers

Approximation of Aggregate Losses Using Simulation

Chapter 15: Dynamic Programming

Reinforcement Learning

Multi-state transition models with actuarial applications c

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Pigeonhole Principle Solutions

Normal Approximation. Contents. 1 Normal Approximation. 1.1 Introduction. Anthony Tanbakuchi Department of Mathematics Pima Community College

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

Section 1-4 Functions: Graphs and Properties

Distributed Caching Algorithms for Content Distribution Networks

Session 6 Number Theory

Monte Carlo-based statistical methods (MASM11/FMS091)

T cell Epitope Prediction

Reinforcement Learning

6 Systems Represented by Differential and Difference Equations

Regions in a circle. 7 points 57 regions

Data Mining: An Overview. David Madigan

Introduction to Markov Chain Monte Carlo

7.7 Solving Rational Equations

5544 = = = Now we have to find a divisor of 693. We can try 3, and 693 = 3 231,and we keep dividing by 3 to get: 1

Section 3-7. Marginal Analysis in Business and Economics. Marginal Cost, Revenue, and Profit. 202 Chapter 3 The Derivative

Statistical Machine Translation: IBM Models 1 and 2

The Basics of FEA Procedure

Ground Rules. PC1221 Fundamentals of Physics I. Kinematics. Position. Lectures 3 and 4 Motion in One Dimension. Dr Tay Seng Chuan

Part 2: Community Detection

0 0 such that f x L whenever x a

MATH APPLIED MATRIX THEORY

8 Square matrices continued: Determinants

Math Quizzes Winter 2009

FFT Algorithms. Chapter 6. Contents 6.1

Recursive Algorithms. Recursion. Motivating Example Factorial Recall the factorial function. { 1 if n = 1 n! = n (n 1)! if n > 1

Advanced Software Test Design Techniques Use Cases

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

UNCOUPLING THE PERRON EIGENVECTOR PROBLEM

Note on growth and growth accounting

Reading 13 : Finite State Automata and Regular Expressions

What is Linear Programming?

Inner Product Spaces

Matrix Multiplication

C3: Functions. Learning objectives

5 Directed acyclic graphs

3. The Junction Tree Algorithms

Lecture L3 - Vectors, Matrices and Coordinate Transformations

CS 598CSC: Combinatorial Optimization Lecture date: 2/4/2010

Systems of Linear Equations

5 Homogeneous systems

Chapter 5 Discrete Probability Distribution. Learning objectives

1 The Brownian bridge construction

Modeling and Performance Evaluation of Computer Systems Security Operation 1

(Refer Slide Time: 00:01:16 min)

Dynamic Programming 11.1 AN ELEMENTARY EXAMPLE

1 Maximum likelihood estimation

Transcription:

6.867 Machine learning, lecture 19 (Jaaola) 1 Lecture topics: Marov chains (cont d) Hidden Marov Models Marov chains (cont d) In the contet of spectral clustering (last lecture) we discussed a random wal over the nodes induced by a weighted graph. Let W ij 0 be symmetric weights associated with the edges in the graph; W ij = 0 whenever edge doesn t eist. We also assumed that W ii = 0 for all i. The graph defines a random wal where the probability of transitioning from node (state) i to node j is given by P (X(t + 1) = j X(t) = i) = P ij = W ij W ij j (1) Note that self-transitions (going from i to i) are disallowed because W ii = 0 for all i. We can understand the random wor as a homogeneous Marov chain: the probability of transitioning from i to j only depends on i, not the path that too the process to i. In other words, the current state summarizes the past as far as the future transitions are concerned. This is a Marov (conditional independence) property: P (X(t + 1) = j X(t) = i, X(t 1) = i t 1,..., X(1) = i 1 ) = P (X(t + 1) = j X(t) = i) (2) The term homogeneous specifies that the transition probabilities are independent of time (the same probabilities are used whenever the random wal returns to i). We also defined ergodicity as follows: Marov chain is ergodic if there eist a finite m such that P (X(t + m) = j X(t) = i) > 0 for all i and j (3) Simple weighted graphs need not define ergodic chains. Consider, for eample, a weighted graph between two nodes 1 2 where W 12 > 0. The resulting random wal is necessarily periodic, i.e., 121212.... A Marov chain is ergodic only when all the states are communicating and the chain is aperiodic which is clearly not the case here. Similarly, even a graph 1 2 3 with positive weights on the edges would not define an ergodic Marov chain. Every other state would necessarily be 2, thus the chain is periodic. The reason here is

6.867 Machine learning, lecture 19 (Jaaola) 2 that W ii = 0. By adding a positive self-transition, we can remove periodicity (random wal would stay in the same state a variable number of steps). Any connected weighted graph with positive weights and positive self-transitions gives rise to an ergodic Marov chain. Our definition of the random wal so far is a bit incomplete. We did not specify how the process started, i.e., we didn t specify the initial state distribution. Let q(i) be the probability that the random wal is started from state i. We will use q as a vector of probabilities across states (reserving n for the number training eamples as usual). There are two ways of describing Marov chains: through state transition diagrams or as simple graphical models. The descriptions are complementary. A transition diagram is a directed graph over the possible states where the arcs between states specify all allowed transitions (those occuring with non-zero probability). See Figure 1 for eamples. We could also add the initial state distribution as transitions from a dedicated initial (null) state (not shown in the figure). 1 1 2 2 3 Figure 1: Eamples of transition diagrams defining non-ergodic Marov chains. In graphical models, on the other hand, we focus on eplicating variables and their dependencies. At each time point the random wal is in a particular state X(t). This is a random variable. It s value is only affected by the random variable X(t 1) specifying the state of the random wal at the previous time point. Graphically, we can therefore write a sequence of random variables where arcs specify how the values of the variables are influenced by others (dependent on others). More precisely, X(t 1) X(t) means that the value of X(t) depends on X(t 1). Put another way, in simulating the random wal, we would have to now the value of X(t 1) in order to sample a value for X(t). The graphical model is shown in Figure 2. State prediction We will cast the problem of calculating the predictive probabilities over states in a form

6.867 Machine learning, lecture 19 (Jaaola) 3... X(t 1) X(t) X(t + 1)... Figure 2: Marov chain as a graphical model. that will be useful for Hidden Marov Models later on. Since we can also write for any n P (X(t + m) = j X(t) = i) = [P m ] ij (4) P (X(n) = j) = q(i)p (X(n) = j X(1) = i) = q(i)[p ] ij (5) In a vector form q T P is a row vector whose j th component is P (X(n) = j). Note that the matri products involve summing over all the intermediate states until X(n) = j. More eplicitly, let s evaluate the sum over all the states 1,..., n in the matri form as n 1 times {}}{ P (X(1) = 1 ) P (X(t + 1) = t+1 X(t) = t ) = q T P P P P 1 = 1 (6) 1,..., n t=1 This is a sum over n possible state configurations (settings of 1,..., n ) but can be easily performed in terms of matri products. We can understand this in terms of recursive evaluation of t step probabilities α t (i) = P (X(t) = i). We will write α t for the corresponding column vector so that Clearly, Estimation t 1 times {}}{ q T T P P P P = α t α t 1 (i)p ij = α t (j) (10) (7) q T T = α 1 (8) T T α t 1 P = α t, t > 1 (9)

6.867 Machine learning, lecture 19 (Jaaola) 4 Marov models can be estimated easily from observed sequences of states. Given 1,..., n (e.g., 1212221), the log-lielihood of the observed sequence is given by [ ] log P ( 1,..., n ) = log P (X(1) = 1 ) P (X(t + 1) = t+1 X(t) = t ) (11) t=1 = log q( 1 ) + log P t,t+1 (12) t=1 = log q( 1 ) + nˆ(i, j) log P ij (13) i,j where ˆn(i, j) is the number of observed transitions from i to j in the sequence 1,..., n. The resulting maimum lielihood setting of P ij is obtained as an empirical fraction nˆ(i, j) Pˆij = j nˆ(i, j ) (14) Note that q(i) can only be reliably estimated from multiple observed sequences. For eample, based on 1,..., n, we would simply set ˆq(i) = δ(i, 1 ) which is hardly accurate (sample size one). Regularization is useful here, as before. Hidden Marov Models Hidden Marov Models (HMMs) etend Marov models by assuming that the states of the Marov chain are not observed directly, i.e., the Marov chain itself remains hidden. We therefore also model how the states relate to the actual observations. This assumption of a simple Marov model underlying a sequence of observations is very useful in many practical contets and has made HMMs very popular models of sequence data, from speech recognition to bio-sequences. For eample, to a first approimation, we may view words in speech as Marov sequences of phonemes. Phonemes are not observed directly, however, but have to be related to acoustic observations. Similarly, in modeling protein sequences (sequences of amino acid residues), we may, again approimately, describe a protein molecule as a Marov sequence of structural characteristics. The structural features are typically not observable, only the actual residues. We can understand HMMs by combining miture models and Marov models. Consider the simple eample in Figure 3 over four discrete time points t = 1, 2, 3, 4. The figure summarizes multiple sequences of observations y 1,..., y 4, where each observation sequence

6.867 Machine learning, lecture 19 (Jaaola) 5 corresponds to a single value y t per time point. Let s begin by ignoring the time information and instead collapse the observations across the time points. The observations form two clusters are now well modeled by a two component miture: 2 P (y) = P (j)p (y j) (15) j=1 where, e.g., P (y j) could be a Gaussian N(y; µ j, σj 2 ). By collapsing the observations we are effectively modeling the data at each time point with the same miture model. If we generate data form the resulting miture model we would select a miture component at random at each time step and generate the observation from the corresponding component (cluster). There s nothing that ties the selection of miture components in time so that samples from the miture yield phantom clusters at successive time points (we select the wrong component/cluster with equal probability). By omitting the time information, we therefore place half of the probability mass in locations with no data. Figure 4 illustrates the miture model as a graphical model. y y a) t = 1 t = 2 t = 3 t = 4 time b) t = 1 t = 2 t = 3 t = 4 time Figure 3: a) Eample data over four time points, b) actual data and ranges of samples generated from a miture model (red ovals) estimated without time information. The solution is to model the selection of the miture components as a Marov model, i.e., the component at t = 2 is selected on the basis of the component used at t = 1. Put another way, each state in the Marov model now uses one of the components in the miture model to generate the corresponding observation. As a graphical model, the miture model is a combination of the two as shown in Figure 5. Probability model One advantage of representing the HMM as a graphical model is that we can easily write down the joint probability distribution over all the variables. The graph eplicates how the

6.867 Machine learning, lecture 19 (Jaaola) 6 X(1) X(2) X(3) X(4) Y (1) Y (2) Y (3) Y (4) Figure 4: A graphical model view of the miture model over the four time points. The variables are indeed by time (different samples would be drawn at each time point) but the parameters are shared across the four time points. X(t) refers to the selection of the miture component while Y (t) refers to the observations. variables depend on each other (who influences who) and thus highlights which conditional probabilities we need to write down: P ( 1,..., n, y 1,..., y n ) = P ( 1 )P (y 1 1 )P ( 2 1 )P (y 2 2 )... (16) = P ( 1 )P (y 1 1 ) [P ( t+1 t )P (y t+1 t+1 )] (17) t=1 [ ] = q( 1 )P (y 1 1 ) P t,t+1 P (y t+1 t+1 ) (18) where we have used the same notation as before for the Marov chains. t=1 X(1) X(2) X(3) X(4) Y (1) Y (2) Y (3) Y (4) Figure 5: HMM as a graphical model. It is a Marov model where each state is associated with a distribution over observations. Alternatively, we can view it as a miture model where the miture components are selected in a time dependent manner. Three problems to solve We typically have to be able to solve the following three problems in order to use these models effectively:

6.867 Machine learning, lecture 19 (Jaaola) 7 1. Evaluate the probability of observed data or P (y 1,..., y n ) = P ( 1,..., n, y 1,..., y n ) (19) 1,..., n 2. Find the most liely hidden state sequence 1,..., n given observations y 1,..., y n, i.e., {1,..., n} = arg ma P ( 1,..., n, y 1,..., y n ) (20) 1,..., n 3. Estimate the parameters of the model from multiple sequences of y (l) ) 1,..., y n (ll, l = 1,..., L. Problem 1 As in the contet of Marov chains we can efficiently sum over the possible hidden state sequences. Here the summation means evaluating P (y 1,..., y n ). We will perform this in two ways depending on whether the recursion moves forward in time, computing α t (j), or bacward in time, evaluating β t (i). The only change from before is the fact that whatever state we happen to visit at time t, we will also have to generate the observation y t from that state. This additional requirement of generating the observations can be included via diagonal matrices P (y 1) 0 D y = (21) 0 P (y ) So, for eample, Similarly, q T D y1 1 = q(i)p (y 1 i) = P (y 1 ) (22) q T D y1 P D y2 1 = q(i)p (y 1 i) P ij P (y 2 j) = P (y 1, y 2 ) (23) j=1 We can therefore write the forward and bacward algorithms as methods that perform the matri multiplications in q T D y1 P D y2 P P D yn 1 = P (y 1,..., y n ) (24)

6.867 Machine learning, lecture 19 (Jaaola) 8 either in the forward or bacward direction. In terms of the forward pass algorithm: q T D y1 = α1 T (25) α T P D yt = αt T, or equivalently (26) ( ) t 1 α t 1 (i)p ij P (y t j) = α t (j) (27) These values hold eactly α t (j) = P (y 1,..., y t, X(t) = j) since we have generated all the observations up to and including y t and have summed over all the states ecept for the last one X(t). The bacward pass algorithm is similarly defined as: β n = 1 (28) β t = P D yt+1 β t+1, or equivalently (29) β t (i) = P ij P (y t+1 j)β t+1 (j) (30) j=1 In this case β t (i) = P (y t+1,..., y n X(t) = i) since we have summed over all the possible values of the state variables X(t + 1),..., X(n), starting from a fied X(t) = i, and the first observation we have generated in the recursion is y t+1. By combining the two recursions we can finally evaluate T P (y 1,..., y n ) = α t β t = α t (i)β t (i) (31) which holds for any t = 1,..., n. You can understand this result in two ways: either in terms of performing the remaining matri multiplication corresponding to the two parts α T t {}}{{}}{ P (y 1,..., y n ) = (q T D y1 P P D yt ) (P D yt+1 P D yn 1) (32) or as an illustration of the Marov property: α t(i) β t(i) {}}{{}}{ P (y 1,..., y n ) = P (y 1,..., y t, X(t) = i) P (y t+1,..., y n X(t) = i) (33) β t

6.867 Machine learning, lecture 19 (Jaaola) 9 Also, since β n (i) = 1 for all i, clearly P (y 1,..., y n ) = α n (i) = P (y 1,..., y t, X(t) = i) (34)