Probabilistic clustering and the EM algorithm

Size: px

Start display at page:

Download "Probabilistic clustering and the EM algorithm"

Scott Hensley
7 years ago
Views:

1 Probabilistic clustering and the EM algorithm Guillaume Obozinsi Ecole des Ponts - ParisTech Master MVA EM 1/27

2 Outline 1 Clustering 2 The EM algorithm for the Gaussian mixture model EM 2/27

3 Clustering EM 4/27

4 Supervised, unsupervised and semi-supervised learning Supervised learning Training set composed of pairs {(x 1, y 1 ),..., (x n, y n )}. Learn to classify new points in the classes Unsupervised learning Training set composed of pairs {x 1,..., x n }. Partition the data in a number of classes. Possibly produce a decision rule for new points. Transductive learning Data available at train time composed of train data {(x 1, y 1 ),..., (x n, y n )} + test data {x n+1,..., x n } Classify all the test data Semi-supervised learning Data available at train time composed of labelled data {(x 1, y 1 ),..., (x n, y n )} + unlabelled data {x n+1,..., x n } Produce a classification rule for future points EM 5/27

5 Clustering Clustering is word usually used for unsupervised classification Clustering techniques can be useful to solve semi-supervised classification problem. Clustering is not a well-specified problem Classes might be impossible to infer from the distribution of X alone Several goals possible: Find the modes of the distribution Find a set of denser connected regions supporting most of the density Find a set of denser convex regions supporting most of the density Find a set of denser ellipsoidal regions supporting most of the density Find a set of denser round regions supporting most of the density EM 6/27

6 K-means Key assumption: Data composed of K roundish clusters of similar sizes with centroids (µ 1,, µ K ). Problem can be formulated as: 1 min µ 1,,µ K n Difficult (NP-hard) nonconvex problem. K-means algorithm 1 Draw centroids at random 2 Assign each point to the closest centroid n i=1 min x i µ 2. C { i x i µ 2 = min j x i µ j 2} 3 Recompute centroid as center of mass of the cluster 4 Go to 2 µ 1 C i C x i EM 7/27

7 K-means properties Three remars: K-means is greedy algorithm It can be shown that K-means converges in a finite number of steps. The algorithm however typically get stuc in local minima and it practice it is necessary to try several restarts of the algorithm with a random initialization to have chances to obtain a better solution. Will fail if the clusters are not round EM 8/27

8 K-means++, (Arthur and Vassilvitsii, 2007) Algorithm Choose first center µ 1 uniformly among data points For = 2...K endfor Let D 2 i = min j< x i µ 2 2 Choose the next center among {x 1,..., x n } with probability D 2 i. Solution is log(k) optimal. See Arthur, D. and Vassilvitsii, S. (2007). -means++: the advantages of careful seeding. Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms. EM 9/27

9 The Gaussian mixture model and the EM algorithm EM 11/27

10 Jensen s Inequality Consider a function f : R d R 1 if f is convex and if X is a random variable, then E [ f(x) ] f ( E[X] ) 2 if f is strictly convex, we have equality in the previous inequality if and only if X is constant almost surely. EM 12/27

11 Entropy Let X a r.v. with values in the finite set X and p(x) = P (X = x). Quantity of information of the observation x Definition of entropy I(x) := log 1 p(x) H(X) := E [I(X)] = x X p(x) log p(x) Remars: Convention: 0 log 0 = 0 H defined either with natural log or the log in base 2 (i.e. log 2 ). log 2 is better for coding interpretations In this course we will use the natural logarithm. EM 13/27

12 Kullbac-Leibler divergence Definition Let p and q be two finite distributions on X finite. The Kullbac-Leibler divergence is defined by D(p q) = p(x) log p(x) q(x) x X = ( p(x) log p(x) ) q(x) q(x) q(x) x X [ = E X p log p(x) ] q(x) [ p(x) p(x) = E X q log q(x) q(x) ] The KL divergence is not a distance: it is not symmetric. x X with q(x) = 0 and p(x) 0 then D(p q) = +. If EM 14/27

13 Kullbac-Leibler divergence Proposition D(p q) 0 and equality holds if and only if p = q. Proof. W.l.o.g assume q(x) > 0 everywhere. 1 y y log y is convex so by Jensen s inequality, we have [ ( )] [ ] [ ] p(x) p(x) p(x) p(x) D(p q) = E q q(x) log E q log E q = 0 q(x) q(x) q(x) since [ ] p(x) E q = p(x) q(x) q(x) q(x) = p(x) = 1. x X x X 2 D(p q) = 0 iff there is equality in Jensen s inequality p(x) = cq(x) q-a.s., but summing this last equality over x implies that c = 1, in turn implies that p = q. EM 15/27

14 Differential entropy and KL Let X be a r.v. with distribution P and density p w.r.t. a measure µ. Differential entropy: H diff (p) = X p(x) log(p(x))dµ(x) Differential Kullbac Leibler Divergence D diff (p q) = p(x) log p(x) X q(x) dµ(x) [ = E X p log p(x) ] q(x) H diff (p) 0 H diff (p) depends on the reference measure µ. H diff (p) does not capture intrinsic properties of P. However, D diff (p q) does not depend on µ. EM 16/27

15 Gaussian mixture model K components z component indicator z = (z 1,..., z K ) {0, 1} K z M(1, (π 1,..., π K )) K p(z) = =1 π z p(x z; (µ, Σ ) ) = p(x) = K z N (x; µ, Σ ) =1 K π N (x; µ, Σ ) =1 Estimation: [ K ] argmax log π N (x; µ, Σ ) µ,σ =1 π z n x n µ Σ (a) N EM 17/27

16 Applying maximum lielihood to the Gaussian mixture Let Z = {z {0, 1} K K =1 z = 1} p(x) = z Z Issue p(x, z) = z Z K =1 [ π N (x; µ, Σ )] z = K π N (x; µ, Σ ) =1 The marginal log-lielihood l(θ) = i log(p(x(i) )) with θ = ( π, (µ, Σ ) 1 K ) is now complicated No hope to find a simple solution to the maximum lielihood problem By contrast the complete log-lielihood has a rather simple form: l ( θ) = M log p(x (i), z (i) ) = i, i=1 z (i) log N (x(i) ; µ, Σ )+ i, z (i) log(π ), EM 18/27

17 Applying ML to the multinomial mixture l ( θ) = M log p(x (i), z (i) ) = i, i=1 z (i) log N (x(i) ; µ, Σ )+ i, z (i) log(π ), If we new z (i) we could maximize l(θ). If we new θ = ( π, (µ, Σ ) 1 K ), we could find the best z (i) since we could compute the true a posteriori on z (i) given x (i) : p(z (i) = 1 x; θ) = Seems a chicen and egg problem... In addition, we want to solve ( ) max log p(x (i), z (i) ) and not θ i z (i) π N (x (i) ; µ, Σ ) K j=1 π j N (x (i) ; µ j, Σ j ) max θ, z (1),...,z (M) log p(x (i), z (i) ) Can we still use the intuitions above to construct an algorithm maximizing the marginal lielihood? i EM 19/27

18 Principle of the Expectation-Maximization Algorithm log p(x; θ) = log z z p(x, z; θ) = log z q(z) log p(x, z; θ) q(z) p(x, z; θ) q(z) q(z) = E q [log p(x, z; θ)] + H(q) =: L(q, θ) This shows that L(q, θ) log p(x; θ) Moreover: θ L(q, θ) is a concave function. Finally it is possible to show that L(q, θ) = log p(x; θ) KL(q p( x; θ)) So that if we set q(z) = p(z x; θ (t) ) then L(q, θ (t) ) = p(x; θ (t) ). ln p(x θ) L (q, θ) θ old θnew EM 20/27

19 A graphical idea of the EM algorithm ln p(x θ) L (q, θ) θ old θ new EM 21/27

20 Expectation Maximization algorithm Initialize θ = θ 0 WHILE (Not converged) Expectation step 1 q(z) = p(z x; θ (t 1) ) 2 L(q, θ) = E q [ log p(x, z; θ) ] + H(q) Maximization step 1 θ (t) [ ] = argmax E q log p(x, z; θ) θ ENDWHILE ln p(x θ) L (q, θ) θ old θ new θ old = θ (t 1) θ new = θ (t) EM 22/27

21 Expected complete log-lielihood With the notation: q (t) i ] E q (t) [ l(θ) = E q (t) [ M = E q (t) = E q (t) = i, = i, = P q (t) (z (i) i [ ] log p(x, Z; θ) i=1 [ i, E q (t) i = 1) = E q (t) i ] log p(x (i), z (i) ; θ) z (i) log N (x(i), µ, Σ ) + i, ] log N (x (i), µ, Σ ) + [ z (i) q (t) i log N (x(i), µ, Σ ) + i, [ (i)] z, we have i, ] z (i) log(π ) E q (t) i q (t) i log(π ) [ (i)] z log(π ) EM 23/27

22 Expectation step for the Gaussian mixture We computed previously q (t) i (z (i) ), which is a multinomial distribution defined by q (t) i (z (i) ) = p(z (i) x (i) ; θ (t 1) ) Abusing notation we will denote (q (t) i1,..., q(t) ik ) the corresponding vector of probabilities defined by q (t) i q (t) i = P q (t) (z (i) i = p(z(i) = 1 x (i) ; θ (t 1) ) = = 1) = E q (t) i π (t 1) K j=1 π(t 1) j [ (i)] z N (x (i), µ (t 1), Σ (t 1) ) N (x (i), µ (t 1) j, Σ (t 1) j ) EM 24/27

23 Maximization step for the Gaussian mixture ( π t, (µ (t), Σ(t) ) 1 K) = argmax θ ] E q (t) [ l(θ) This yields the updates: µ (t) = i x(i) q (t) i i q(t) i, Σ (t) = i ( x (i) µ (t) )( x (i) µ (t) i q(t) i ) q (t) i and π (t) = i q(t) i i, q(t) i EM 25/27

24 Final EM algorithm for the Multinomial mixture model Initialize θ = θ 0 WHILE (Not converged) Expectation step q (t) i π(t 1) K j=1 π(t 1) j N (x (i), µ (t 1), Σ (t 1) ) N (x (i), µ (t 1) j, Σ (t 1) j ) Maximization step µ (t) = i x(i) q (t) i i q(t) i, Σ (t) = i ( x (i) µ (t) )( x (i) µ (t) i q(t) i ) q (t) i ENDWHILE and π (t) = i q(t) i i, q(t) i EM 26/27

25 EM Algorithm for the Gaussian mixture model III p(x z) p(z x) EM 27/27

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery