Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU



Similar documents
Manifold Learning Examples PCA, LLE and ISOMAP

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Text Analytics (Text Mining)

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Statistical Machine Learning

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Linear Threshold Units

Data, Measurements, Features

STA 4273H: Statistical Machine Learning

Hardware Implementation of Probabilistic State Machine for Word Recognition

Neural Networks Lesson 5 - Cluster Analysis

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Social Media Mining. Data Mining Essentials

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Classification algorithm in Data mining: An Overview

Nonlinear Iterative Partial Least Squares Method

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Machine Learning.

Probabilistic Latent Semantic Analysis (plsa)

Visualization by Linear Projections as Information Retrieval

A Survey on Pre-processing and Post-processing Techniques in Data Mining

: Introduction to Machine Learning Dr. Rita Osadchy

Course: Model, Learning, and Inference: Lecture 5

Statistical Models in Data Mining

Component Ordering in Independent Component Analysis Based on Data Power

Visualization of Large Font Databases

Principal components analysis

How To Cluster

Advanced Signal Processing and Digital Noise Reduction

Least-Squares Intersection of Lines

Machine Learning with MATLAB David Willingham Application Engineer

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Master s Theory Exam Spring 2006

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Data Mining: Algorithms and Applications Matrix Math Review

DATA ANALYSIS II. Matrix Algorithms

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Lecture 3: Linear methods for classification

Exploratory Data Analysis with MATLAB

Linear Algebra Review. Vectors

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

An Overview of Knowledge Discovery Database and Data mining Techniques

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression

CSCI567 Machine Learning (Fall 2014)

Supervised Learning (Big Data Analytics)

Visualization of General Defined Space Data

Unsupervised and supervised dimension reduction: Algorithms and connections

Volume 2, Issue 9, September 2014 International Journal of Advance Research in Computer Science and Management Studies

Support Vector Machines with Clustering for Training with Very Large Datasets

Introduction to Algorithmic Trading Strategies Lecture 2

HDDVis: An Interactive Tool for High Dimensional Data Visualization

Section 1.1. Introduction to R n

Selection of the Suitable Parameter Value for ISOMAP

CS Introduction to Data Mining Instructor: Abdullah Mueen

Christfried Webers. Canberra February June 2015

Machine Learning for Data Science (CS4786) Lecture 1

Object Recognition and Template Matching

Learning is a very general term denoting the way in which agents:

Dimensionality Reduction - Nonlinear Methods

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

SYMMETRIC EIGENFACES MILI I. SHAH

Adaptive Face Recognition System from Myanmar NRC Card

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

1 Spectral Methods for Dimensionality

Subspace Analysis and Optimization for AAM Based Face Alignment

So which is the best?

Accurate and robust image superresolution by neural processing of local image representations

Introduction to General and Generalized Linear Models

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Classification Techniques for Remote Sensing

Unsupervised Data Mining (Clustering)

Principles of Data Mining by Hand&Mannila&Smyth

A Learning Based Method for Super-Resolution of Low Resolution Images

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Methods of Data Analysis Working with probability distributions

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Lecture 9: Introduction to Pattern Analysis

Visualization of Topology Representing Networks

Feature Selection vs. Extraction

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

UW CSE Technical Report Probabilistic Bilinear Models for Appearance-Based Vision

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

HT2015: SC4 Statistical Data Mining and Machine Learning

Introduction to Matrix Algebra

Predict Influencers in the Social Network

Virtual Landmarks for the Internet

Machine learning for algo trading

Adaptive Framework for Network Traffic Classification using Dimensionality Reduction and Clustering

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Transcription:

1 Dimension Reduction Wei-Ta Chu 2014/10/22

2 1.1 Principal Component Analysis (PCA) Widely used in dimensionality reduction, lossy data compression, feature extraction, and data visualization Also known as Karhunen-Loeve transform Two commonly-used definitions Orthogonal projection of the data onto a lower dimensional linear space such that the variance of the projected data is maximized. Linear projection that minimizes the average projection cost C.M. Bishop, Chapter 12 of Pattern Recognition and Machine Learning, Springer, 2006.

Maximum Variance Formulation 3 Data set of observation {x n } with dimensionality D. Goal: project the data onto a space having dimensionality M < D with maximizing the variance of the projected data. Assume the value of M is given. Begin with M=1. Data are projected onto a line in a D-dimensional space. The direction of the line is denoted by a D-dimensional vector u 1. Each data point x n is then projected onto a scalar value u 1T x n.

LA Recap: Orthogonal Projection 4 proj a u = u a a 2 a (vector component of u along a) u proj a u = u u a a 2 a (vector component of u orthogonal to a) proj a u = u a a = u cosθ

Maximum Variance Formulation 5 The mean of the projected data is The variance of the projected data is given by Where S is the covariance matrix defined by

Maximum Variance Formulation 6 Maximize the projected variance with respect to u 1 Introduce a Lagrange multiplier denoted by λ 1 By setting the derivative with respect to u 1 equal to zero, we see that this quantity will have a stationary point when u 1 must be an eigenvector of S The variance will be a maximum when we set u 1 equal to the eigenvector having the largest eigenvalueλ 1

Maximum Variance Formulation 7 The optimal linear projection for which the variance of the projected data is maximized is now defined by the M eigenvectors u 1,, u M of the data covariance matrix S corresponding to the M largest eigenvaluesλ 1,,λ M Principal component analysis involves evaluating the mean and the covariance matrix of the data set and then finding the M eigenvectors of S corresponding the M largest eigenvalues.

Covariance 8 High variance, low covariance No inter-dimension dependency High variance, high covariance inter-dimension dependency

Minimum Error Formulation 9 Each data point can be represented by a linear combination of the basis vectors Our goal is to approximate this data point using a representation involving a restricted number M < D of variables corresponding to a projection onto a lower-dimensional subspace. M-dim projection

Minimum Error Formulation 10 Minimize approximation error Obtaining the minimum value of J by selecting eigenvectors to those having the D-M smallest eigenvalues, and hence the eigenvectors defining the principal subspace are those corresponding to the M largest eigenvalues. L.I. Smith, A tutorial on Principal Component Analysis, http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf J. Shlens, A tutorial on Principal Component Analysis, http://www.cs.cmu.edu/~elaw/papers/pca.pdf

Applications of PCA 11 Mean vector and the first four PCA eigenvectors for the off-line digits data set Eigenvalue spectrum and the sum of the discard eigenvalues An original example together with its PCA reconstructions obtained by retaining M principal components

Eigenfaces 12 Eigenfaces for face recognition is a famous application of PCA Eigenfaces capture the majority of variance in face data Project a face on those eigenfaces to represent face features M. Turk and A.P. Pentland, Face recognition using eigenfaces, Proc. of CVPR, pp. 586-591, 1991.

13 1.2 Singular Value Decomposition (SVD) SVD works directly on data PCA works on covariance matrix of data The SVD technique examines the entire set of data and rotates the axis to maximize variance along the first few dimensions. Problem: #1: Find concepts in text #2: Reduce dimensionality http://www.cs.cmu.edu/~guestrin/class/10701-s06/handouts/recitations/recitation-pca_svd.ppt

SVD - Definition 14 A [n x m] = U [n x r] L [ r x r] (V [m x r] ) T A: n x m matrix (e.g., n documents, m terms) U: n x r matrix (n documents, r concepts) L: r x r diagonal matrix (strength of each concept ) (r: rank of the matrix) V: m x r matrix (m terms, r concepts)

SVD - Properties 15 spectral decomposition of the matrix: 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = x x u 1 u 2 l 1 l 2 v 1 v 2

SVD - Interpretation 16 documents, terms and concepts : U: document-to-concept similarity matrix V: term-to-concept similarity matrix L: its diagonal elements: strength of each concept Projection: best axis to project on: ( best = min sum of squares of projection errors)

SVD - Example 17 A = U L V T - example: doc-to-concept similarity matrix CS-concept MD-concept data infṛetrieval CS MD 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = brain lung 0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Example 18 A = U L V T - example: data infṛetrieval CS MD 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = brain lung 0.18 0 strength of CS-concept 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Example 19 CS MD A = U L V T - example: data infṛetrieval brain lung 0.18 0 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 CS-concept x term-to-concept similarity matrix 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD Dimensionality reduction 20 Q: how exactly is dim. reduction done? A: set the smallest singular values to zero: 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = 0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Dimensionality reduction 21 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 ~ 0.18 0.36 0.18 0.90 0 0 0 x 9.64 x 0.58 0.58 0.58 0 0

SVD - Dimensionality reduction 22 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 ~ 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2.1 Multidimensional Scaling (MDS) 23 Goal: represent data points in some lowerdimensional space such that the distances between points in that space correspond to the distance between points in the original space http://www.analytictech.com/networks/mds.htm

Multidimensional Scaling (MDS) 24 What MDS does is to find a set of vectors in p-dimensional space such that the matrix of Euclidean distances among them corresponds as closely as possible to some function of the input matrix according to a criterion function called stress. Stress: the degree of correspondence between the distances among points implied by MDS map and the input matrix. d ij refers to the distance between points i and j in the original space z ij refers to the distance between points i and j on the map

Multidimensional Scaling (MDS) 25 The true dimensionality of the data will be revealed by the rate of decline of stress as dimensionality increases.

Multidimensional Scaling (MDS) 26 Algorithm Assign points to arbitrary coordinates in p-dimensional space Compute Euclidean distances among all pairs of points to form a matrix Compare the matrix with the input matrix by evaluating the stress function. The smaller the value, the greater the correspondence between the two. Adjust coordinates of each point in the direction that best maximally stress Repeat steps 2 through 4 until stress won t get any lower T.F. Cox and M.A.A. Cox, Multidimensional Scaling, Chapman & Hall/CRC; 2 edition, 2000

27 2.2 Isometric Feature Mapping (Isomap) Examples J.B. Tenenbaum, V. de Silva, and J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science, vol. 290, pp. 2319-2323, 2000.

Isometric Feature Mapping (Isomap) 28 Estimate the geodesic distance between far away points, given only input-space distances. Adding up a sequence of short hops between neighboring points

Isometric Feature Mapping (Isomap) 29 Algorithm Step 1: construct neighborhood graph Determines which points are neighbors on the manifold Connect each point to all points within some fixed radius ε, or to its K nearest neighbors Step 2: compute shortest paths Estimate the geodesic distance between all pairs of points on the manifold by computing their shortest path in the graph Step 3: construct d-dimensional embedding Apply MDS to the matrix of graph distances constructing an embedding of the data

30 Isometric Feature Mapping (Isomap)

2.3 Locally Linear Embedding (LLE) 31 Eliminate the need to estimate pairwise distances between widely separated data points. LLE recovers global nonlinear structure from locally linear fits. S.T. Roweis and L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, vol. 290, pp. 2323-2326, 2000 http://www.cs.toronto.edu/~r oweis/lle/publications.html

Locally Linear Embedding (LLE) 32 Characterize the local geometry by linear coefficients that reconstruct each data point from its neighbors. Minimize the reconstruction errors Choosing d-dimensional coordinate Y i to minimize the embedding cost function

Example 33 The bottom images correspond to points along the top-right path, illustrating one particular mode of variability in pose and expression.

34 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2014/10/22

Outline 35 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM)

Overview 36 Any computer program that can improve its performance at some task through experience (or training) can be called a learning program. During early days, computer scientists developed learning algorithms based on heuristics and insights into human reasoning mechanisms. Decision tree, Neuro-scientists attempted to devise learning methods by imitating the structure of human brains. Y. Gong and W. Xu, Machine Learning for Multimedia Content Analysis, Springer, 2007

Basic Statistical Learning Problems 37 Many learning tasks can be formulated as one of the following two problems. Regression: X: input variable, Y: output variable. Infer a function f(x) so that given a value of x of the input variable X, y = f(x) is a good predication of the true value y of the output variable Y.

Basic Statistical Learning Problems 38 Classification: Assume that a random variable X can belong to one of a finite set of classes C={1,2,,K}. Given the value x of variable X, infer its class label l=g(x), where. It is also of great interest to estimate the probability P(k x) that X belongs to class k,. In fact both the regression and classification problems can be formulated using the same framework.

39 Categorizations of Machine Learning Techniques Unsupervised vs. Supervised For inferring the functions f(x) and g(x), if pairs of training data (x i,y i ) or (x i, l i ), i = 1,,N are available, then the inference process is called supervised learning. Most regression methods are supervised learning. Unsupervised methods strive to automatically partition a given data set into the predefined number of clusters also called clustering.

40 Categorizations of Machine Learning Techniques Generative Models vs. Discriminative Models Discriminative models strive to learn P(k x) directly from the training set without the attempt to modeling the observation x. Generative models compute P(k x) by first modeling the class-conditional probabilities P(x k) as well as the class probabilities P(k) Posterior prob. likelihood Priori prob.

41 Categorizations of Machine Learning Techniques Generative models: Naïve Bayes, Bayesian Networks, Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), Discriminative models: Neural Networks, Support Vector Machines (SVM), Maximum Entropy Models (MEM), Conditional Random Fields (CRF),

42 Categorizations of Machine Learning Techniques Models for Simple Data vs. Models for Complex Data Complex data: consist of sub-entities that are strongly related one to another E.g. a beach scene usually composed of a blue sky on top, an ocean in the middle, and a sand beach at the bottom For simple: Naïve Bayes, GMM, NN, SVM For complex: BN, HMM, MEM, CRF, M 3 -net

43 Categorizations of Machine Learning Techniques Model Identification vs. Model Prediction Model identification: to discover an existing Law of Nature The model identification paradigm is an ill-posed problems, and is annoyed by the curse of dimensionality. The goal of model predication is to predict events well, but not necessarily through the identification of the model of events.

44 Gaussian Mixture Model Wei-Ta Chu 2014/10/22

Introduction 45 By using a sufficient number of Gaussians, and by adjusting their means and covariances as well as the coefficients in the linear combinations, almost any continuous density can be approximated to arbitrary accuracy. C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006

Introduction 46 Consider a superposition of K Gaussian densities Each Gaussian density is called a component of the mixture and has its own mean and covariance.

Introduction 47 From the sum and product rules, the marginal density is given by We can view as the prior probability of picking the kth component, and the density as the probability of x conditioned on k:

Introduction 48 From Baye s theorem, the posterior probability p(k x) is given by Gaussian mixture distribution is governed by parameters.. One way to set these parameters is to use maximum likelihood. likelihood Assume that different mixtures are independent and identically distributed

Introduction 49 In case of a single variable x, the Gaussian distribution is in the form For a D-dimensional vector x, the multivariate Gaussian distribution takes the form

Maximizing Likelihood 50 Setting the derivative of with respect to the means of the Gaussian components to zero The mean for the kth Gaussian component is obtained by taking a weighted mean of all of the points in the data set, in which the weighting factor for data point is given by the posterior probability

Maximizing Likelihood 51 Setting the derivative of with respect to the covariance of the Gaussian components to zero Each data point weighted by the corresponding posterior probability

Maximizing Likelihood 52 Maximize with respect to the mixing coefficients Constraint: the sum of mixing coefficients is one Using Lagrange multiplier and maximizing the following quantity If we multiply both sides by and sum over k making use of the constraint, we find. Using this to eliminate and rearranging we obtain Mixing coefficient of the kth component is given by the average responsibility which that component takes for explaining the data points

53 Expectation-Maximization (EM) Algorithm We first choose some initial values for the means, covariances, and mixing coefficients. Expectation step (E step) Use the current parameters to evaluate the posterior probabilities Maximization step (M step) Re-estimate the means, covariances, and mixing coefficients Each update to the parameters resulting from an E step followed by an M step is guaranteed to increase the log likelihood function.

Example 54

EM for Gaussian Mixtures 55

Case Study 56 Jiang, et al. A new method to segment playfield and its applications in match analysis in sports video, In Proc. of ACM MM, pp. 292-295, 2004.

Case Study 57 The condition density of a pixel belongs to the playfield region is modeled with M Gaussian densities:

Related Resources 58 GMMBAYES - Bayesian Classifier and Gaussian Mixture Model ToolBox http://www.it.lut.fi/project/gmmbayes/downloads/src/g mmbayestb/ Netlab http://www.ncrg.aston.ac.uk/netlab/index.php Matlab toolboxes collection http://stommel.tamu.edu/~baum/toolboxes.html

59 Hidden Markov Model Wei-Ta Chu 2014/10/22

Markov Model Chain rule 60 Assume that each of the condition distributions is independent of all previous observations except the most recent, we obtain the first-order Markov chain. First-order Markov chain Second-order Markov chain C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006

Example 61 What s the probability that the weather for eight consecutive days is sun-sun-sun-rain-rain-sun-cloudysun? 0.4 0.6 rain 0.3 1 2 0.1 0.2 0.3 0.1 sunny 3 cloudy 0.2 0.8 L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993 L. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of IEEE, vol. 77, no. 2, pp. 257-286, 1989.

Coin-Toss Model 62 You are in a room with a curtain through which you cannot see that is happening. On the other side of the curtain is another person who is performing a coin tossing experiment (using one or more coins). The person will not tell you which coin he selects at any time; he will only tell you the result of each coin flip. A typical observation sequence would be The question is: how do we build an model to explain the observed sequence of head and tails?

Coin-Toss Model 63 P(H) 1-P(H) Head 1-P(H) 1 2 P(H) Tail 1-coin model (Observable Markov Model) a 11 a 22 1-a 11 1 1-a 22 2 2-coins model (Hidden Markov Model) P(H) = P 1 P(T) =1- P 1 P(H) = P 2 P(T) =1- P 2

Coin-Toss Model 64 a 11 a 22 a 12 1 2 a 21 3-coins model (Hidden Markov Model) a 31 a 13 3 a 32 a 23 a 33 State 1 State 2 State 3 P(H) = P 1 P(H) = P 2 P(H) = P 3 P(T) =1- P 1 P(T) =1- P 2 P(T) =1- P 3

Elements of HMM 65 N: the number of states in the model M: the number of distinct observation symbols per state The state-transition probability A={a ij } The observation symbol probability distribution B={b j (k)} The initial state distribution To describe an HMM, we usually use the compact notation

Three Basic Problems of HMM 66 Problem 1: Probability Evaluation How do we compute the probability that the observed sequence was produced by the model? Scoring how well a given model matches a given observation sequence.

Three Basic Problems of HMM 67 Problem 2: Optimal State Sequence Attempt to uncover the hidden part of the model that is, to find the correct state sequence. For practical situations, we usually use an optimality criterion to solve this problem as best as possible.

Three Basic Problems of HMM 68 Problem 3: Parameter Estimation Attempt to optimize the model parameters to best describe how a given observation sequence comes about. The observation sequence used to adjust the model parameters is called a training sequence because it is used to train the HMM.

Solution to Problem 1 69 There are N T possible state sequences Consider one fixed-state sequence The prob. of the observation sequence given the state sequence Where we have assumed statistical independence of observations, thus we get

Solution to Problem 1 70 The prob. of such state sequence can be written as The joint prob. of O and q, i.e., the prob. that O and q occur simultaneously, is simply the product of the above terms The prob. of O is obtained by summing this joint prob. over all possible state sequences q, giving

Solution to Problem 1 71 The Forward Procedure The prob. of the partial observation sequence o 1,o 2,,o t (until time t) and state i at time t, given the model We solve for it inductively 1. Initialization 2. Induction 3. Termination

Solution to Problem 1 72 The Forward Procedure Require on the order of N 2 T calculations, rather than 2TN T as required by the direction calculation.

73 The Backward Procedure The prob. of partial observation sequence from t+1 to the end, given state i at time t and the model 1. Initialization 2. Induction

目 前 無 法 顯 示 此 圖 像 Solution to Problem 2 74 We define the quantity Which is the best score (highest probability) along a single path, at time t, which accounts for the first t observations and ends in state i. By induction we have

目 前 無 法 顯 示 此 圖 像 目 前 無 法 顯 示 此 圖 像 目 前 無 法 顯 示 此 圖 像 目 前 無 法 顯 示 此 圖 像 目 前 無 法 顯 示 此 圖 像 Solution to Problem 2 75 The Viterbi Algorithm 1. Initialization 2. Recursion 3. Termination 4. Path (state sequence) backtracking The major difference between Viterbi and the forward procedure is the maximization over previous states.

Solution to Problem 3 76 Choose such that its likelihood,, is locally maximized using an iterative procedure such as the Baum- Welch algorithm (also known as EM algorithm or forwardbackward algorithm) Define the prob. of being in state i at time t, and state j at time t+1, given the model and the observation sequence.

Solution to Problem 3 77 The prob. of being in state i at time t, given the observation sequence O and the model

Solution to Problem 3 78

Types of HMMs 79 Ergodic Left-right Parallel path left-right

Case Study 80 Features Field descriptor Edge descriptor Grass and sand Player height Peng, et al. Extract highlights from baseball game video with hidden Markov models, In Proc. of ICIP, vol. 1, pp. 609-612, 2002.

Related Resources 81 Hidden Markov Model (HMM) Toolbox for Matlab http://www.cs.ubc.ca/~murphyk/software/hmm/hmm. html The General Hidden Markov Model library (GHMM) http://ghmm.sourceforge.net/ HTK Speech Recognition Toolkit http://htk.eng.cam.ac.uk