Machine Learning Expectation Maximization hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1
We ve seen the update eqs. of GMM, but How are they derived? What is the general algorithm of EM? hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 2
General Settings for EM Given data X = (x 1,, x n ), where x i ~p x; θ. Want to maximize log-likelihood log p X; θ. p X; θ is difficult to maximize because it involves some latent variables. But maximizing the complete data log-likelihood log p X, ; θ would be easy (if we observed ). We think of as missing data. In this scenario, EM gives an efficient method to maximize likelihood p X; θ to some local maximum. hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 3
Basic Idea of EM Since maximizing data log-likelihood log p X; θ is hard but maximizing complete data log-likelihood log p X, ; θ is easy (if we observed ), out bet is to maximize the latter and hopefully it also increases the former. (E step) Since we didn t observe, we cannot maximize log p X, ; θ directly. We will consider its expected value under the posterior dist. of, using old parameter. (M step) We then update parameter θ to maximize the expected complete data log-likelihood. hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 4
EM Algorithm in General 1. Initialize parameter θ old. 2. E step: evaluate posterior dist. of latent variables p X; θ old, using old parameter. Then the expected complete data log-likelihood, under this dist. would be A function of θ Q θ; θ old = p X; θ old Taking expectation log p X, ; θ Complete data log-likelihood 3. M step: update parameters to maximize the expected complete data log-likelihood. θ new = arg max Q θ; θold θ 4. Check convergence criterion. Return to step 2 if not satisfied. hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 5
GMM Revisited E step Evaluate posterior dist. of latent variables p X; θ old, using old parameters. Since data are i.i.d., we evaluate for each i. 1 e (x i μ j ) 2 2σ 2 j 2πσ j q (j) i p z i = j x i = p z i = j, x i p x i = p x i z i = j p z i = j l=1 p x i z i = l p z i = l w j hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 6
GMM Revisited E step Then the expected complete data log-likelihood, under this dist. would be Q θ; θ old = p z i = j x i log p x i, z i ; θ old j=1 = q i (j) log p x i, z i ; θ old j=1 = q i (j) log p z i = j p x i z i = j j=1 = q i (j) log w j j=1 1 e (x i μ j ) 2 2σ 2 j 2πσ j hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 7
GMM Revisited M step (for μ j ) Update parameters to maximize the expected complete data log-likelihood. Q θ; θ old = q i (j) log w j j=1 1 e (x i μ j ) 2 2σ 2 j 2πσ j For μ j, set derivative to 0. Q θ; θ old μ j = q i (j) x i μ j σ 2 j We get update equation for means: μ j = q (j) i q i (j) x i. = 0 hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 8
GMM Revisited M step (for σ j ) The derivation is similar to the derivation for μ j Try it yourself We get update equation for variances: σ 2 q (j) 2 i x i μ j j = q i (j) hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 9
GMM Revisited M step (for w j ) For w j, recall there is a constraint j=1 w j = 1. To maximize Q θ; θ old w.r.t. w j, we construct the Lagrangian Q θ; θ old = Q θ; θ old + β w j Set derivative to 0 Q θ; θ old = w j q i (j) w j j=1 + β = 0 1 hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 10
GMM Revisited M step (for w j ) We get w j = q (j) i β Sum over j, and using β = j=1 = q i j j=1 j=1 w j j q i = = 1, we get So we get update equation for weights: w j = 1 (j) q i hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 11
More Theoretical Questions We ve seen how the update equations of GMM are derived from the EM algorithm. But do these equations really work? Will the data likelihood be maximized (at least to some local maximum)? Will the algorithm converge? hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 12
Answers We will show that the data log-likelihood never decrease in each iteration. We also know that log-likelihood (which is log of probability) is bounded above by 0. Therefore, EM algorithm always converges! hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 13
Expected data log-likelihood increases Recall that in M step, we maximize the expected complete log-likelihood θ new = arg max Q θ; θold θ = arg max θ Therefore p X; θ old p X; θ old log p X, ; θ new log p X, ; θ p X; θ old log p X, ; θ old hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 14
From previous slide p X; θ old So we also have p X; θ old Therefore log p X, ; θ new log p X; θ old log p X; θ old log p X, ; θ old p X, ; θnew p X; θ old p X, ; θold p X; θ old = p X; θ old log p X; θ old = log p X; θ old Data log-likelihood using old parameter hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 15 (1)
Jensen s Inequality Let f be a convex function, X be a random variable, then E f X f EX Since log() is a concave function, we have Expectation Concave function Random variable p X; θ old p X, ; θnew log p X; θ old p X, ; θnew log p X; θold p X, ; θ old hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 16
p X; θ old Continue log p X, ; θnew p X; θ old p X, ; θnew log p X; θold p X, ; θ old (2) = log p X, ; θ new = log p X; θ new Data log-likelihood using new parameter hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 17
Finally Putting (1) and (2) together, we get log p X; θ old p X; θ old log log p X; θ new p X, ; θnew p X; θ old Data log-likelihood monotonically increases! hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 18
You Should now EM algorithm is an efficient way to do maximum likelihood estimation, when there are latent variables or missing data. The general algorithm of EM E step: calculate posterior dist. of latent variables. M step: update parameters by maximizing the expected complete data log-likelihood. How to derive EM update equations of GMM? Can you derive EM update equations for parameter estimation of a mixture of categorical distributions? Why does EM always converge (to some local optimum)? hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 19