Machine Learning. Expectation Maximization. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Size: px

Start display at page:

Download "Machine Learning. Expectation Maximization. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall"

Edwin Cameron
9 years ago
Views:

1 Machine Learning Expectation Maximization hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

2 We ve seen the update eqs. of GMM, but How are they derived? What is the general algorithm of EM? hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

3 General Settings for EM Given data X = (x 1,, x n ), where x i ~p x; θ. Want to maximize log-likelihood log p X; θ. p X; θ is difficult to maximize because it involves some latent variables. But maximizing the complete data log-likelihood log p X, ; θ would be easy (if we observed ). We think of as missing data. In this scenario, EM gives an efficient method to maximize likelihood p X; θ to some local maximum. hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

4 Basic Idea of EM Since maximizing data log-likelihood log p X; θ is hard but maximizing complete data log-likelihood log p X, ; θ is easy (if we observed ), out bet is to maximize the latter and hopefully it also increases the former. (E step) Since we didn t observe, we cannot maximize log p X, ; θ directly. We will consider its expected value under the posterior dist. of, using old parameter. (M step) We then update parameter θ to maximize the expected complete data log-likelihood. hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

5 EM Algorithm in General 1. Initialize parameter θ old. 2. E step: evaluate posterior dist. of latent variables p X; θ old, using old parameter. Then the expected complete data log-likelihood, under this dist. would be A function of θ Q θ; θ old = p X; θ old Taking expectation log p X, ; θ Complete data log-likelihood 3. M step: update parameters to maximize the expected complete data log-likelihood. θ new = arg max Q θ; θold θ 4. Check convergence criterion. Return to step 2 if not satisfied. hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

6 GMM Revisited E step Evaluate posterior dist. of latent variables p X; θ old, using old parameters. Since data are i.i.d., we evaluate for each i. 1 e (x i μ j ) 2 2σ 2 j 2πσ j q (j) i p z i = j x i = p z i = j, x i p x i = p x i z i = j p z i = j l=1 p x i z i = l p z i = l w j hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

7 GMM Revisited E step Then the expected complete data log-likelihood, under this dist. would be Q θ; θ old = p z i = j x i log p x i, z i ; θ old j=1 = q i (j) log p x i, z i ; θ old j=1 = q i (j) log p z i = j p x i z i = j j=1 = q i (j) log w j j=1 1 e (x i μ j ) 2 2σ 2 j 2πσ j hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

8 GMM Revisited M step (for μ j ) Update parameters to maximize the expected complete data log-likelihood. Q θ; θ old = q i (j) log w j j=1 1 e (x i μ j ) 2 2σ 2 j 2πσ j For μ j, set derivative to 0. Q θ; θ old μ j = q i (j) x i μ j σ 2 j We get update equation for means: μ j = q (j) i q i (j) x i. = 0 hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

9 GMM Revisited M step (for σ j ) The derivation is similar to the derivation for μ j Try it yourself We get update equation for variances: σ 2 q (j) 2 i x i μ j j = q i (j) hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

10 GMM Revisited M step (for w j ) For w j, recall there is a constraint j=1 w j = 1. To maximize Q θ; θ old w.r.t. w j, we construct the Lagrangian Q θ; θ old = Q θ; θ old + β w j Set derivative to 0 Q θ; θ old = w j q i (j) w j j=1 + β = 0 1 hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

11 GMM Revisited M step (for w j ) We get w j = q (j) i β Sum over j, and using β = j=1 = q i j j=1 j=1 w j j q i = = 1, we get So we get update equation for weights: w j = 1 (j) q i hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

12 More Theoretical Questions We ve seen how the update equations of GMM are derived from the EM algorithm. But do these equations really work? Will the data likelihood be maximized (at least to some local maximum)? Will the algorithm converge? hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

13 Answers We will show that the data log-likelihood never decrease in each iteration. We also know that log-likelihood (which is log of probability) is bounded above by 0. Therefore, EM algorithm always converges! hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

14 Expected data log-likelihood increases Recall that in M step, we maximize the expected complete log-likelihood θ new = arg max Q θ; θold θ = arg max θ Therefore p X; θ old p X; θ old log p X, ; θ new log p X, ; θ p X; θ old log p X, ; θ old hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

15 From previous slide p X; θ old So we also have p X; θ old Therefore log p X, ; θ new log p X; θ old log p X; θ old log p X, ; θ old p X, ; θnew p X; θ old p X, ; θold p X; θ old = p X; θ old log p X; θ old = log p X; θ old Data log-likelihood using old parameter hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall (1)

16 Jensen s Inequality Let f be a convex function, X be a random variable, then E f X f EX Since log() is a concave function, we have Expectation Concave function Random variable p X; θ old p X, ; θnew log p X; θ old p X, ; θnew log p X; θold p X, ; θ old hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

17 p X; θ old Continue log p X, ; θnew p X; θ old p X, ; θnew log p X; θold p X, ; θ old (2) = log p X, ; θ new = log p X; θ new Data log-likelihood using new parameter hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

18 Finally Putting (1) and (2) together, we get log p X; θ old p X; θ old log log p X; θ new p X, ; θnew p X; θ old Data log-likelihood monotonically increases! hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

19 You Should now EM algorithm is an efficient way to do maximum likelihood estimation, when there are latent variables or missing data. The general algorithm of EM E step: calculate posterior dist. of latent variables. M step: update parameters by maximizing the expected complete data log-likelihood. How to derive EM update equations of GMM? Can you derive EM update equations for parameter estimation of a mixture of categorical distributions? Why does EM always converge (to some local optimum)? hiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation