A gentle introduction to Expectation Maximization

A getle itroductio to Expectatio Maximizatio Mark Johso Brow Uiversity November 2009 1 / 15

Outlie What is Expectatio Maximizatio? Mixture models ad clusterig EM for setece topic modelig 2 / 15

Why Expectatio Maximizatio? Expectatio Maximizatio (EM) is a geeral approach for solvig problems ivolvig hidde or latet variables Y Goal: lear the parameter vector θ of a model P θ (X, Y) from traiig data D = (x 1,..., x ) cosistig of samples from P θ (X), i.e., Y is hidde Maximum likelihood estimate usig D: ˆθ = argmax θ L D (θ) = argmax θ i=1 y Y P θ (x i, y) EM is useful whe directly optimizig L D (θ) is itractible, but computig MLE from fully-observed data D = ((x 1, y 1 ),..., (x, y )) is easy 3 / 15

Mixture models ad clusterig A mixture model is a liear combiatio of models P(X = x) = P(Y = y) P(X = x Y = y), where: y Y y Y idetifies the mixture compoet, P(y) is probability of geeratig mixture compoet y, ad P(x y) is distributio associated with mixture compoet y I clusterig, Y = {1,..., m} are the cluster labels After learig P(y) ad P(x y), compute cluster probabilities for data item x i as follows: P(Y = y X = x i ) = P(Y = y) P(X = x i Y = y) y Y P(Y = y ) P(X = x i Y = y ) 5 / 15

Mixtures of multiomials (1) Y = {1,..., m}, i.e., m differet clusters Y is coi idetity i coi-tossig game Y is setece topic i setece clusterig applicatio X = U l, i.e., each observatio is a sequece x = (u 1,..., u l ), where each u k U U = {H, T}, x is oe sequece of coi tosses from same (ukow) coi U is the vocabulary, x is a setece (sequece of words) Assume each u k is geerated i.i.d. give y, so models have parameters: P(Y = y) = πy, i.e., probability of pickig cluster y P(Uk = u Y = y) = ϕ u y, i.e., probability of geeratig a u i cluster y 6 / 15

Mixtures of multiomials (2) P(Y = y) = π y P(U k = u Y = y) = ϕ u y l P(X = x, Y = y) = π y ϕ uk y k=1 = π y ϕ c u(x) u k y u U where x = (u 1,..., u l ), ad c u (x) is umber of times u appears i x. 7 / 15

Coi-tossig example π 1 = π 2 = 0.5 ϕ H 1 = 0.1; ϕ T 1 = 0.9 ϕ H 2 = 0.8; ϕ T 2 = 0.2 P(X = HTHH, Y = 1) = π 1 ϕ 3 H 1 ϕ1 T 1 = 0.00045 P(X = HTHH, Y = 2) = π 2 ϕ 3 H 2 ϕ1 T 2 = 0.0512 P(X = HTHH) = π 1 ϕ 3 H 1 ϕ1 T 1 + π 2 ϕ 3 H 2 ϕ1 T 2 = 0.05165, so: P(Y = 1 X = HTHH) = P(X = HTHH, Y = 1) P(X = HTHH) = 0.008712 P(Y = 2 X = HTHH) = 0.9912 8 / 15

Estimatio from visible data Give visible data how would we estimate π ad ϕ? Data D = ((x 1, y 1 ),..., (x, y )), where each x i = (u i,1,..., u i,l ) Sufficiet statistics for estimatig multiomial mixture: y = i=1 II(y, y i), i.e., umber of times cluster y is see u,y = i=1 c u(x i )II(y, y i ), i.e., umber of times u is see i cluster y, where c u (x) is the umber of times u appears i x Maximum likelihood estimates: π y = y ϕ u y = u,y u U u,y 9 / 15

Estimatio from hidde data (1) Data D = (x 1,..., x ), where each x i = (u i,1,..., u i,l ) Log likelihood of hidde data: log L D (π, ϕ) = log π y ϕ c u(x i ) u y i=1 y Y u U Imposig Lagrage multipliers ad settig the derivative to zero, we ca show: π y = E[ y] ; ϕ u y = E[ y ] = E[ u,y ] = i=1 P π, ϕ (Y = y X = x i ) c u (x i ) P π, ϕ (Y = y X = x i ) i=1 E[ u,y ] u U E[ u,y], where: 10 / 15

Estimatio from hidde data (2) π y = E[ y] ; ϕ u y = E[ y ] = E[ u,y ] = i=1 P π, ϕ (Y = y X = x i ) c u (x i ) P π, ϕ (Y = y X = x i ) i=1 E[ u,y ] u U E[ u,y], where: Ulike i the visible data case, these are ot a closed-form solutio for π or ϕ, as E[ y ] ad E[ u,y ] ivolve π ad ϕ But they do suggest a fixed-poit calculatio procedure 11 / 15

EM for multiomial mixtures Guess iitial values π (0) ad ϕ (0) For iteratios t = 1, 2, 3,... do: E-step: calculate expected values of sufficiet statistics E[ y ] = E[ u,y ] = i=1 P π (t 1),ϕ (t 1)(Y = y X = x i) c u (x i ) P π (t 1),ϕ (t 1)(Y = y X = x i) i=1 M-step: update model based o sufficiet statistics π (t) y = E[ y] ϕ (t) u y = E[ u,y ] u U E[ u,y] 12 / 15

Summary of the model P(Y = y X = x) = P π,ϕ (X = x, Y = y) = π y u U P(Y = y, X = x) y Y P(Y = y, X = x) ϕ c u(x) u y, where: c u (x) = the umber of times u appears i x 13 / 15

Homework hits The fact that differet seteces have differet legths does t affect the calculatio c u (x i ) is the umber of times word u appears i setece x i You ca iitialize π with a uiform distributio, but you ll eed to iitialize ϕ (0) to break symmetry, e.g., by addig a radom umber of about 10 4 You should compute the log likelihood at each iteratio (it s easy to do this as a by-product of the expectatio calculatios) There is a theorem that says the log likelihood ever decreases o each EM step If your log likelihood decreases, the you have a bug! 15 / 15