Statistical Machine Learning from Data

Transcription

1 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland bengio@idiap.ch January 23, 2006

2 Samy Bengio Statistical Machine Learning from Data

3 Samy Bengio Statistical Machine Learning from Data 3 Basics What is a GMM? Applications

4 Reminder: Basics on Probabilities Basics What is a GMM? Applications A few basic equalities that are often used: 1 (conditional probabilities) P(A, B) = P(A B) P(B) 2 (Bayes rule) P(A B) = P(B A) P(A) P(B) 3 If ( B i = Ω) and i, j i (B i Bj = ) then P(A) = i P(A, B i ) Samy Bengio Statistical Machine Learning from Data 4

5 Basics What is a GMM? Applications What is a Gaussian Mixture Model? A Gaussian Mixture Model (GMM) is a distribution The likelihood given a Gaussian distribution is ( 1 N (x µ, Σ) = exp 1 ) (2π) d 2 Σ 2 (x µ)t Σ 1 (x µ) where d is the dimension of x, µ is the mean and Σ is the covariance matrix of the Gaussian. Σ is often diagonal. The likelihood given a GMM is N p(x) = w i N (x µ i, Σ i ) where N is the number of Gaussians and w i is the weight of Gaussian i, with w i = 1 and i : w i 0 i Samy Bengio Statistical Machine Learning from Data 5

6 Characteristics of a GMM Basics What is a GMM? Applications While ANNs are universal approximators of functions, GMMs are universal approximators of densities. (as long as there are enough Gaussians of course) Even diagonal GMMs are universal approximators. Full rank GMMs are not easy to handle: number of parameters is the square of the number of dimensions. GMMs can be trained by maximum likelihood using an efficient algorithm: Expectation-Maximization. Samy Bengio Statistical Machine Learning from Data 6

7 Basics What is a GMM? Applications Practical Applications using GMMs Biometric person authentication (using voice, face, handwriting, etc): one GMM for the client one GMM for all the others Bayes decision = likelihood ratio Any highly imbalanced classification task one GMM per class, tuned by maximum likelihood Bayes decision = likelihood ratio Dimensionality reduction Quantization Samy Bengio Statistical Machine Learning from Data 7

8 Samy Bengio Statistical Machine Learning from Data 8 Graphical Interpretation More Formally

9 Graphical Interpretation More Formally Basics of Expectation-Maximization Objective: maximize the likelihood p(x θ) of the data X drawn from an unknown distribution, given the model parameterized by θ: θ = arg max θ p(x θ) = arg max θ n p(x p θ) p=1 Basic ideas of EM: Introduce a hidden variable such that its knowledge would simplify the maximization of p(x θ) At each iteration of the algorithm: E-Step: estimate the distribution of the hidden variable given the data and the current value of the parameters M-Step: modify the parameters in order to maximize the joint distribution of the data and the hidden variable Samy Bengio Statistical Machine Learning from Data 9

10 EM for GMM (Graphical View, 1) Graphical Interpretation More Formally Hidden variable: for each point, which Gaussian generated it? A B Samy Bengio Statistical Machine Learning from Data 10

11 EM for GMM (Graphical View, 2) Graphical Interpretation More Formally E-Step: for each point, estimate the probability that each Gaussian generated it A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8 Samy Bengio Statistical Machine Learning from Data 11

12 EM for GMM (Graphical View, 3) Graphical Interpretation More Formally M-Step: modify the parameters according to the hidden variable to maximize the likelihood of the data (and the hidden variable) A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8 Samy Bengio Statistical Machine Learning from Data 12

13 EM: More Formally Graphical Interpretation More Formally Let us call the hidden variable Q. Let us consider the following auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] It can be shown that maximizing A θ s+1 = arg max A(θ, θ s ) θ always increases the likelihood of the data p(x θ s+1 ), and a maximum of A corresponds to a maximum of the likelihood. Samy Bengio Statistical Machine Learning from Data 13

14 EM: Proof of Convergence Graphical Interpretation More Formally First let us develop the auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] = q Q P(q X, θ s ) log p(x, q θ) = q Q P(q X, θ s ) log(p(q X, θ) p(x θ)) = P(q X, θ s ) log P(q X, θ) + log p(x θ) q Q Samy Bengio Statistical Machine Learning from Data 14

15 Samy Bengio Statistical Machine Learning from Data 15 EM: Proof of Convergence Graphical Interpretation More Formally then if we evaluate it at θ s A(θ s, θ s ) = P(q X, θ s ) log P(q X, θ s ) + log p(x θ s ) q Q the difference between two consecutive log likelihoods of the data can be written as log p(x θ) log p(x θ s ) = A(θ, θ s ) A(θ s, θ s ) + q Q P(q X, θ s ) log P(q X, θs ) P(q X, θ)

16 EM: Proof of Convergence Graphical Interpretation More Formally hence, since the last part of the equation is a Kullback-Leibler divergence which is always positive or null, if A increases, the log likelihood of the data also increases Moreover, one can show that when A is maximum, the likelihood of the data is also at a maximum. Samy Bengio Statistical Machine Learning from Data 16

17 Hidden Variable Auxiliary Function E-Step M-Step Samy Bengio Statistical Machine Learning from Data

18 Samy Bengio Statistical Machine Learning from Data 18 EM for GMM: Hidden Variable Hidden Variable Auxiliary Function E-Step M-Step For GMM, the hidden variable Q will describe which Gaussian generated each example. If Q was observed, then it would be simple to maximize the likelihood of the data: simply estimate the parameters Gaussian by Gaussian Moreover, we will see that we can easily estimate Q Let us first write the mixture of Gaussian model for one x i : p(x i θ) = N P(j θ)p(x i j, θ) j=1 Let us now introduce the following indicator variable: { 1 if Gaussian j emitted xi q i,j = 0 otherwise

19 EM for GMM: Auxiliary Function Hidden Variable Auxiliary Function E-Step M-Step We can now write the joint likelihood of all the X and q: n N p(x, Q θ) = P(j θ) q i,j p(x i j, θ) q i,j which in log gives j=1 log p(x, Q θ) = N q i,j log P(j θ) + q i,j log p(x i j, θ) j=1 Samy Bengio Statistical Machine Learning from Data 19

20 Samy Bengio Statistical Machine Learning from Data 20 EM for GMM: Auxiliary Function Hidden Variable Auxiliary Function E-Step M-Step Let us now write the corresponding auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] N = E Q q i,j log P(j θ) + q i,j log p(x i j, θ) X, θ s j=1 j=1 N = E Q [q i,j X, θ s ] log P(j θ) + E Q [q i,j X, θ s ] log p(x i j, θ)

21 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: E-Step and M-Step A(θ, θ s ) = j=1 N E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) Hence, the E-Step estimates the posterior: E Q [q i,j X, θ s ] = 1 P(q i,j = 1 X, θ s ) + 0 P(q i,j = 0 X, θ s ) = P(j x i, θ s ) = p(x i j, θ s )P(j θ s ) p(x i θ s ) and the M-step finds the parameters θ that maximizes A, hence searching for A θ = 0 for each parameter (µ j, variances σ 2 j, and weights w j). Note however that w j should sum to 1. Samy Bengio Statistical Machine Learning from Data 21

22 Samy Bengio Statistical Machine Learning from Data 22 EM for GMM: M-Step for Means Hidden Variable Auxiliary Function E-Step M-Step N A(θ, θ s ) = E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) j=1 A µ j = = = A log p(x i j, θ) log p(x i j, θ) µ j P(j x i, θ s ) log p(x i j, θ) µ j P(j x i, θ s ) (x i µ j ) σ 2 j = 0

23 Samy Bengio Statistical Machine Learning from Data 23 EM for GMM: M-Step for Means Hidden Variable Auxiliary Function E-Step M-Step P(j x i, θ s ) (x i µ j ) σ 2 j = (removing constant terms in the sum) = 0 P(j x i, θ s ) x i P(j x i, θ s ) x i P(j x i, θ s ) P(j x i, θ s ) µ j = 0 = ˆµ j

24 Samy Bengio Statistical Machine Learning from Data 24 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Variances N A(θ, θ s ) = E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) j=1 A σ 2 j = = = A log p(x i j, θ) log p(x i j, θ) σj 2 P(j x i, θ s ) log p(x i j, θ) P(j x i, θ s ) σ 2 j ( (x i µ j ) 2 2σ 4 j ) 1 2σj 2 = 0

25 Samy Bengio Statistical Machine Learning from Data 25 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Variances P(j x i, θ s ) ( (x i µ j ) 2 P(j x i, θ s )(x i µ j ) 2 2σ 4 j 2σ 4 j P(j x i, θ s )(x i µ j ) 2 σ 2 j ) 1 2σj 2 = 0 P(j x i, θ s ) 2σj 2 = 0 P(j x i, θ s ) = 0 P(j x i, θ s )(x i ˆµ j ) 2 P(j x i, θ s ) = ˆσ 2 j

26 Samy Bengio Statistical Machine Learning from Data 26 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Weights We have the constraint that all weights w j should be positive and sum to 1: N w j = 1 j=1 Incorporating it into the system: J(θ, θ s ) = A(θ, θ s ) + (1 N w j ) λ j where λ j are Lagrange multipliers. So we need to derive J with respect to w j and to set it to 0. j=1

27 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Weights and incorporating the probabilistic constraint, we get J J A(θ, θ s ) = w j A(θ, θ s λ j ) w j ( ) = 1 P(j x i, θ s 1 ) λ j = 0 ŵ j = P(j x i, θ s ) ŵ j = λ j N P(j x i, θ s ) k=1 Samy Bengio Statistical Machine Learning from Data 27 w j = 1 n P(k x i, θ s ) P(j x i, θ s )

28 Samy Bengio Statistical Machine Learning from Data 28 EM for GMM: Update Rules Hidden Variable Auxiliary Function E-Step M-Step Means ˆµ j = Variances ˆσ j 2 = x i P(j x i, θ s ) Weights: ŵ j = 1 n P(j x i, θ s ) (x i ˆµ j ) 2 P(j x i, θ s ) P(j x i, θ s ) P(j x i, θ s )

29 Samy Bengio Statistical Machine Learning from Data 29 Initialization Capacity Control Adaptation

30 Samy Bengio Statistical Machine Learning from Data 30 Initialization Initialization Capacity Control Adaptation EM is an iterative procedure that is very sensitive to initial conditions! Start from trash end up with trash. Hence, we need a good and fast initialization procedure. Often used: K-Means. Other options: hierarchical K-Means, Gaussian splitting.

31 Samy Bengio Statistical Machine Learning from Data 31 Capacity Control Initialization Capacity Control Adaptation How to control the capacity with GMMs? selecting the number of Gaussians constraining the value of the variances to be far from 0 (small variances = large capacity) Use cross-validation on the desired criterion (Maximum Likelihood, classification...)

32 Adaptation Techniques Initialization Capacity Control Adaptation In some cases, you have access to only a few examples coming from the target distribution but many coming from a nearby distribution! How can we profit from the big nearby dataset??? Solution: use adaptation techniques. The most well known and used for GMMs: the Maximum A Posteriori adaptation. Samy Bengio Statistical Machine Learning from Data 32

33 Samy Bengio Statistical Machine Learning from Data 33 MAP Adaptation Initialization Capacity Control Adaptation Normal maximum likelihood training for a dataset X : θ = arg max p(x θ) θ Maximum A Posteriori (MAP) training: θ = arg max p(θ X ) θ p(x θ)p(θ) = arg max θ p(x ) = arg max p(x θ)p(θ) θ where p(θ) represents your prior belief about the distribution of the parameters θ.

34 Samy Bengio Statistical Machine Learning from Data 34 Implementation Initialization Capacity Control Adaptation Which kind of prior distribution for p(θ)? Two objectives: constraining θ to reasonable values keep the EM algorithm tractable Use conjugate priors: Dirichlet distribution for weights Gaussian densities for means and variances

35 Samy Bengio Statistical Machine Learning from Data 35 What is a Conjugate Prior? Initialization Capacity Control Adaptation A conjugate prior is chosen such that the corresponding posterior belongs to the same functional family as the prior. So we would like that p(x θ)p(θ) is distributed according to the same family as p(θ) and tractable. Example: Likelihood is Gaussian: p(x θ) = K 1 exp ( (x 1 µ 1 ) 2 ) 2σ1 2 Prior is Gaussian: p(θ) = K 2 exp ( (x 2 µ 2 ) 2 ) 2σ 2 2 Posterior is Gaussian: p(x θ)p(θ) = K 1 K 2 exp ( (x 1 µ 1 ) 2 2σ 2 1 ) (x µ)2 = K 3 exp ( 2σ 2 (x 2 µ 2 ) 2 ) 2σ2 2

36 Samy Bengio Statistical Machine Learning from Data 36 Conjugate Prior of Multinomials Multinomial distribution: Initialization Capacity Control Adaptation P(X 1 = x 1,, X n = x n θ) = N! n x i! n where x i are nonnegative integers and n x i = N. Dirichlet distribution with parameter u: P(θ u) = 1 Z(u) n θ u i 1 i where θ 1,, θ n 0 and n θ i = 1 and u 1,, u n 0. Conjugate prior = dirichlet with parameter x + u: P(X, θ u) = 1 Z n θ x i +u i 1 i θ x i i

37 Samy Bengio Statistical Machine Learning from Data 37 Examples of Conjugate Priors Initialization Capacity Control Adaptation likelihood conjugate prior posterior p(x θ) p(θ) p(θ x) Gaussian(θ, σ) Gaussian(µ 0, σ 0 ) Gaussian(µ 1, σ 1 ) Binomial(N, θ) Beta(r, s) Beta (r + n, s + N n) Poisson(θ) Gamma(r, s) Gamma(r + n, s + 1) Multinomial(θ 1,, θ k ) Dirichlet Dirichlet (α (α 1,, α k ) 1 +n 1,, α k + n k )

38 Initialization Capacity Control Adaptation Simple Implementation for MAP-GMMs Samy Bengio Statistical Machine Learning from Data 38

39 Simple Implementation Initialization Capacity Control Adaptation Train a generic prior model p with large amount of available data = {w p j, µp j, σp j } One hyper-parameter: α [0, 1]: faith on prior model Weights: ŵ j = [ αw p j + (1 α) ] P(j x i ) γ where γ is a normalization factor (so that j w j = 1) Samy Bengio Statistical Machine Learning from Data 39

40 Samy Bengio Statistical Machine Learning from Data 40 Simple Implementation Initialization Capacity Control Adaptation Means: Variances: ˆµ j = αµ p j + (1 α) ( ) ˆσ j = α σ p j + µ p j µp j + (1 α) P(j x i )x i P(j x i ) P(j x i )x i x i P(j x i ) ˆµ j ˆµ j

41 Initialization Capacity Control Adaptation Adapted GMMs for Person Authentication Person authentication task: accept access if P(S i X) > P( S i X) with S i a client, S i all the other persons, and X an access attributed to S i. Using Bayes theorem, this becomes: p(x S i ) p(x S i ) > P( S i ) P(S i ) = S i p(x S i ) is trained on a large dataset p(x S i ) is MAP adapted from p(x S i ). is found on a separate validation set to optimize a given criterion. Samy Bengio Statistical Machine Learning from Data 41