Dimension Reduction. Wei-Ta Chu 2014/10/22. Multimedia Content Analysis, CSIE, CCU

Transcription

1 1 Dimension Reduction Wei-Ta Chu 2014/10/22

2 2 1.1 Principal Component Analysis (PCA) Widely used in dimensionality reduction, lossy data compression, feature extraction, and data visualization Also known as Karhunen-Loeve transform Two commonly-used definitions Orthogonal projection of the data onto a lower dimensional linear space such that the variance of the projected data is maximized. Linear projection that minimizes the average projection cost C.M. Bishop, Chapter 12 of Pattern Recognition and Machine Learning, Springer, 2006.

3 Maximum Variance Formulation 3 Data set of observation {x n } with dimensionality D. Goal: project the data onto a space having dimensionality M < D with maximizing the variance of the projected data. Assume the value of M is given. Begin with M=1. Data are projected onto a line in a D-dimensional space. The direction of the line is denoted by a D-dimensional vector u 1. Each data point x n is then projected onto a scalar value u 1T x n.

4 LA Recap: Orthogonal Projection 4 proj a u = u a a 2 a (vector component of u along a) u proj a u = u u a a 2 a (vector component of u orthogonal to a) proj a u = u a a = u cosθ

5 Maximum Variance Formulation 5 The mean of the projected data is The variance of the projected data is given by Where S is the covariance matrix defined by

6 Maximum Variance Formulation 6 Maximize the projected variance with respect to u 1 Introduce a Lagrange multiplier denoted by λ 1 By setting the derivative with respect to u 1 equal to zero, we see that this quantity will have a stationary point when u 1 must be an eigenvector of S The variance will be a maximum when we set u 1 equal to the eigenvector having the largest eigenvalueλ 1

7 Maximum Variance Formulation 7 The optimal linear projection for which the variance of the projected data is maximized is now defined by the M eigenvectors u 1,, u M of the data covariance matrix S corresponding to the M largest eigenvaluesλ 1,,λ M Principal component analysis involves evaluating the mean and the covariance matrix of the data set and then finding the M eigenvectors of S corresponding the M largest eigenvalues.

8 Covariance 8 High variance, low covariance No inter-dimension dependency High variance, high covariance inter-dimension dependency

9 Minimum Error Formulation 9 Each data point can be represented by a linear combination of the basis vectors Our goal is to approximate this data point using a representation involving a restricted number M < D of variables corresponding to a projection onto a lower-dimensional subspace. M-dim projection

10 Minimum Error Formulation 10 Minimize approximation error Obtaining the minimum value of J by selecting eigenvectors to those having the D-M smallest eigenvalues, and hence the eigenvectors defining the principal subspace are those corresponding to the M largest eigenvalues. L.I. Smith, A tutorial on Principal Component Analysis, J. Shlens, A tutorial on Principal Component Analysis,

11 Applications of PCA 11 Mean vector and the first four PCA eigenvectors for the off-line digits data set Eigenvalue spectrum and the sum of the discard eigenvalues An original example together with its PCA reconstructions obtained by retaining M principal components

12 Eigenfaces 12 Eigenfaces for face recognition is a famous application of PCA Eigenfaces capture the majority of variance in face data Project a face on those eigenfaces to represent face features M. Turk and A.P. Pentland, Face recognition using eigenfaces, Proc. of CVPR, pp , 1991.

13 Singular Value Decomposition (SVD) SVD works directly on data PCA works on covariance matrix of data The SVD technique examines the entire set of data and rotates the axis to maximize variance along the first few dimensions. Problem: #1: Find concepts in text #2: Reduce dimensionality

14 SVD - Definition 14 A [n x m] = U [n x r] L [ r x r] (V [m x r] ) T A: n x m matrix (e.g., n documents, m terms) U: n x r matrix (n documents, r concepts) L: r x r diagonal matrix (strength of each concept ) (r: rank of the matrix) V: m x r matrix (m terms, r concepts)

15 SVD - Properties 15 spectral decomposition of the matrix: = x x u 1 u 2 l 1 l 2 v 1 v 2

16 SVD - Interpretation 16 documents, terms and concepts : U: document-to-concept similarity matrix V: term-to-concept similarity matrix L: its diagonal elements: strength of each concept Projection: best axis to project on: ( best = min sum of squares of projection errors)

17 SVD - Example 17 A = U L V T - example: doc-to-concept similarity matrix CS-concept MD-concept data infṛetrieval CS MD = brain lung x x

18 SVD - Example 18 A = U L V T - example: data infṛetrieval CS MD = brain lung strength of CS-concept x x

19 SVD - Example 19 CS MD A = U L V T - example: data infṛetrieval brain lung = CS-concept x term-to-concept similarity matrix x

20 SVD Dimensionality reduction 20 Q: how exactly is dim. reduction done? A: set the smallest singular values to zero: = x x

21 SVD - Dimensionality reduction ~ x 9.64 x

22 SVD - Dimensionality reduction ~

23 2.1 Multidimensional Scaling (MDS) 23 Goal: represent data points in some lowerdimensional space such that the distances between points in that space correspond to the distance between points in the original space

24 Multidimensional Scaling (MDS) 24 What MDS does is to find a set of vectors in p-dimensional space such that the matrix of Euclidean distances among them corresponds as closely as possible to some function of the input matrix according to a criterion function called stress. Stress: the degree of correspondence between the distances among points implied by MDS map and the input matrix. d ij refers to the distance between points i and j in the original space z ij refers to the distance between points i and j on the map

25 Multidimensional Scaling (MDS) 25 The true dimensionality of the data will be revealed by the rate of decline of stress as dimensionality increases.

26 Multidimensional Scaling (MDS) 26 Algorithm Assign points to arbitrary coordinates in p-dimensional space Compute Euclidean distances among all pairs of points to form a matrix Compare the matrix with the input matrix by evaluating the stress function. The smaller the value, the greater the correspondence between the two. Adjust coordinates of each point in the direction that best maximally stress Repeat steps 2 through 4 until stress won t get any lower T.F. Cox and M.A.A. Cox, Multidimensional Scaling, Chapman & Hall/CRC; 2 edition, 2000

27 Isometric Feature Mapping (Isomap) Examples J.B. Tenenbaum, V. de Silva, and J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science, vol. 290, pp , 2000.

28 Isometric Feature Mapping (Isomap) 28 Estimate the geodesic distance between far away points, given only input-space distances. Adding up a sequence of short hops between neighboring points

29 Isometric Feature Mapping (Isomap) 29 Algorithm Step 1: construct neighborhood graph Determines which points are neighbors on the manifold Connect each point to all points within some fixed radius ε, or to its K nearest neighbors Step 2: compute shortest paths Estimate the geodesic distance between all pairs of points on the manifold by computing their shortest path in the graph Step 3: construct d-dimensional embedding Apply MDS to the matrix of graph distances constructing an embedding of the data

30 30 Isometric Feature Mapping (Isomap)

31 2.3 Locally Linear Embedding (LLE) 31 Eliminate the need to estimate pairwise distances between widely separated data points. LLE recovers global nonlinear structure from locally linear fits. S.T. Roweis and L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science, vol. 290, pp , oweis/lle/publications.html

32 Locally Linear Embedding (LLE) 32 Characterize the local geometry by linear coefficients that reconstruct each data point from its neighbors. Minimize the reconstruction errors Choosing d-dimensional coordinate Y i to minimize the embedding cost function

33 Example 33 The bottom images correspond to points along the top-right path, illustrating one particular mode of variability in pose and expression.

34 34 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2014/10/22

35 Outline 35 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM)

36 Overview 36 Any computer program that can improve its performance at some task through experience (or training) can be called a learning program. During early days, computer scientists developed learning algorithms based on heuristics and insights into human reasoning mechanisms. Decision tree, Neuro-scientists attempted to devise learning methods by imitating the structure of human brains. Y. Gong and W. Xu, Machine Learning for Multimedia Content Analysis, Springer, 2007

37 Basic Statistical Learning Problems 37 Many learning tasks can be formulated as one of the following two problems. Regression: X: input variable, Y: output variable. Infer a function f(x) so that given a value of x of the input variable X, y = f(x) is a good predication of the true value y of the output variable Y.

38 Basic Statistical Learning Problems 38 Classification: Assume that a random variable X can belong to one of a finite set of classes C={1,2,,K}. Given the value x of variable X, infer its class label l=g(x), where. It is also of great interest to estimate the probability P(k x) that X belongs to class k,. In fact both the regression and classification problems can be formulated using the same framework.

39 39 Categorizations of Machine Learning Techniques Unsupervised vs. Supervised For inferring the functions f(x) and g(x), if pairs of training data (x i,y i ) or (x i, l i ), i = 1,,N are available, then the inference process is called supervised learning. Most regression methods are supervised learning. Unsupervised methods strive to automatically partition a given data set into the predefined number of clusters also called clustering.

40 40 Categorizations of Machine Learning Techniques Generative Models vs. Discriminative Models Discriminative models strive to learn P(k x) directly from the training set without the attempt to modeling the observation x. Generative models compute P(k x) by first modeling the class-conditional probabilities P(x k) as well as the class probabilities P(k) Posterior prob. likelihood Priori prob.

41 41 Categorizations of Machine Learning Techniques Generative models: Naïve Bayes, Bayesian Networks, Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), Discriminative models: Neural Networks, Support Vector Machines (SVM), Maximum Entropy Models (MEM), Conditional Random Fields (CRF),

42 42 Categorizations of Machine Learning Techniques Models for Simple Data vs. Models for Complex Data Complex data: consist of sub-entities that are strongly related one to another E.g. a beach scene usually composed of a blue sky on top, an ocean in the middle, and a sand beach at the bottom For simple: Naïve Bayes, GMM, NN, SVM For complex: BN, HMM, MEM, CRF, M 3 -net

43 43 Categorizations of Machine Learning Techniques Model Identification vs. Model Prediction Model identification: to discover an existing Law of Nature The model identification paradigm is an ill-posed problems, and is annoyed by the curse of dimensionality. The goal of model predication is to predict events well, but not necessarily through the identification of the model of events.

44 44 Gaussian Mixture Model Wei-Ta Chu 2014/10/22

45 Introduction 45 By using a sufficient number of Gaussians, and by adjusting their means and covariances as well as the coefficients in the linear combinations, almost any continuous density can be approximated to arbitrary accuracy. C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006

46 Introduction 46 Consider a superposition of K Gaussian densities Each Gaussian density is called a component of the mixture and has its own mean and covariance.

47 Introduction 47 From the sum and product rules, the marginal density is given by We can view as the prior probability of picking the kth component, and the density as the probability of x conditioned on k:

48 Introduction 48 From Baye s theorem, the posterior probability p(k x) is given by Gaussian mixture distribution is governed by parameters.. One way to set these parameters is to use maximum likelihood. likelihood Assume that different mixtures are independent and identically distributed

49 Introduction 49 In case of a single variable x, the Gaussian distribution is in the form For a D-dimensional vector x, the multivariate Gaussian distribution takes the form

50 Maximizing Likelihood 50 Setting the derivative of with respect to the means of the Gaussian components to zero The mean for the kth Gaussian component is obtained by taking a weighted mean of all of the points in the data set, in which the weighting factor for data point is given by the posterior probability

51 Maximizing Likelihood 51 Setting the derivative of with respect to the covariance of the Gaussian components to zero Each data point weighted by the corresponding posterior probability

52 Maximizing Likelihood 52 Maximize with respect to the mixing coefficients Constraint: the sum of mixing coefficients is one Using Lagrange multiplier and maximizing the following quantity If we multiply both sides by and sum over k making use of the constraint, we find. Using this to eliminate and rearranging we obtain Mixing coefficient of the kth component is given by the average responsibility which that component takes for explaining the data points

53 53 Expectation-Maximization (EM) Algorithm We first choose some initial values for the means, covariances, and mixing coefficients. Expectation step (E step) Use the current parameters to evaluate the posterior probabilities Maximization step (M step) Re-estimate the means, covariances, and mixing coefficients Each update to the parameters resulting from an E step followed by an M step is guaranteed to increase the log likelihood function.

54 Example 54

55 EM for Gaussian Mixtures 55

56 Case Study 56 Jiang, et al. A new method to segment playfield and its applications in match analysis in sports video, In Proc. of ACM MM, pp , 2004.

57 Case Study 57 The condition density of a pixel belongs to the playfield region is modeled with M Gaussian densities:

58 Related Resources 58 GMMBAYES - Bayesian Classifier and Gaussian Mixture Model ToolBox mmbayestb/ Netlab Matlab toolboxes collection

59 59 Hidden Markov Model Wei-Ta Chu 2014/10/22

60 Markov Model Chain rule 60 Assume that each of the condition distributions is independent of all previous observations except the most recent, we obtain the first-order Markov chain. First-order Markov chain Second-order Markov chain C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006

61 Example 61 What s the probability that the weather for eight consecutive days is sun-sun-sun-rain-rain-sun-cloudysun? rain sunny 3 cloudy L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993 L. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of IEEE, vol. 77, no. 2, pp , 1989.

62 Coin-Toss Model 62 You are in a room with a curtain through which you cannot see that is happening. On the other side of the curtain is another person who is performing a coin tossing experiment (using one or more coins). The person will not tell you which coin he selects at any time; he will only tell you the result of each coin flip. A typical observation sequence would be The question is: how do we build an model to explain the observed sequence of head and tails?

63 Coin-Toss Model 63 P(H) 1-P(H) Head 1-P(H) 1 2 P(H) Tail 1-coin model (Observable Markov Model) a 11 a 22 1-a a coins model (Hidden Markov Model) P(H) = P 1 P(T) =1- P 1 P(H) = P 2 P(T) =1- P 2

64 Coin-Toss Model 64 a 11 a 22 a a 21 3-coins model (Hidden Markov Model) a 31 a 13 3 a 32 a 23 a 33 State 1 State 2 State 3 P(H) = P 1 P(H) = P 2 P(H) = P 3 P(T) =1- P 1 P(T) =1- P 2 P(T) =1- P 3

65 Elements of HMM 65 N: the number of states in the model M: the number of distinct observation symbols per state The state-transition probability A={a ij } The observation symbol probability distribution B={b j (k)} The initial state distribution To describe an HMM, we usually use the compact notation

66 Three Basic Problems of HMM 66 Problem 1: Probability Evaluation How do we compute the probability that the observed sequence was produced by the model? Scoring how well a given model matches a given observation sequence.

67 Three Basic Problems of HMM 67 Problem 2: Optimal State Sequence Attempt to uncover the hidden part of the model that is, to find the correct state sequence. For practical situations, we usually use an optimality criterion to solve this problem as best as possible.

68 Three Basic Problems of HMM 68 Problem 3: Parameter Estimation Attempt to optimize the model parameters to best describe how a given observation sequence comes about. The observation sequence used to adjust the model parameters is called a training sequence because it is used to train the HMM.

69 Solution to Problem 1 69 There are N T possible state sequences Consider one fixed-state sequence The prob. of the observation sequence given the state sequence Where we have assumed statistical independence of observations, thus we get

70 Solution to Problem 1 70 The prob. of such state sequence can be written as The joint prob. of O and q, i.e., the prob. that O and q occur simultaneously, is simply the product of the above terms The prob. of O is obtained by summing this joint prob. over all possible state sequences q, giving

71 Solution to Problem 1 71 The Forward Procedure The prob. of the partial observation sequence o 1,o 2,,o t (until time t) and state i at time t, given the model We solve for it inductively 1. Initialization 2. Induction 3. Termination

72 Solution to Problem 1 72 The Forward Procedure Require on the order of N 2 T calculations, rather than 2TN T as required by the direction calculation.

73 73 The Backward Procedure The prob. of partial observation sequence from t+1 to the end, given state i at time t and the model 1. Initialization 2. Induction

74 目前無法顯示此圖像 Solution to Problem 2 74 We define the quantity Which is the best score (highest probability) along a single path, at time t, which accounts for the first t observations and ends in state i. By induction we have

75 目前無法顯示此圖像目前無法顯示此圖像目前無法顯示此圖像目前無法顯示此圖像目前無法顯示此圖像 Solution to Problem 2 75 The Viterbi Algorithm 1. Initialization 2. Recursion 3. Termination 4. Path (state sequence) backtracking The major difference between Viterbi and the forward procedure is the maximization over previous states.

76 Solution to Problem 3 76 Choose such that its likelihood,, is locally maximized using an iterative procedure such as the Baum- Welch algorithm (also known as EM algorithm or forwardbackward algorithm) Define the prob. of being in state i at time t, and state j at time t+1, given the model and the observation sequence.

77 Solution to Problem 3 77 The prob. of being in state i at time t, given the observation sequence O and the model

78 Solution to Problem 3 78

79 Types of HMMs 79 Ergodic Left-right Parallel path left-right

80 Case Study 80 Features Field descriptor Edge descriptor Grass and sand Player height Peng, et al. Extract highlights from baseball game video with hidden Markov models, In Proc. of ICIP, vol. 1, pp , 2002.

81 Related Resources 81 Hidden Markov Model (HMM) Toolbox for Matlab html The General Hidden Markov Model library (GHMM) HTK Speech Recognition Toolkit