Maximum-Likelihood and Bayesian Parameter Estimation

Transcription

1 Maximum-Likelihood and Bayesian Parameter Estimation Expectation Maximization (EM)

2 Estimating Missing Feature Value Estimating missing variable with known parameters Known Value In the absence of x, most likely class is ω Choose that value of x which maximizes the likelihood Choosing mean of missing feature (over all classes) will result in worse performance! This is a case of estimating the hidden variable given the parameters. In EM unknowns are both: parameters and hidden variables (Μissing variable)

3 EM Task Estimate unknown parameters θ given measurement data U However some variables J are missing which need to be integrated out We want to maximize the posterior probability of θ, given the data U, marginalizing over J Parameter to be estimated θ arg max θ n JεJ P( θ, J U ) Μissing Variables Data

4 EM Principle Estimate unknown parameters θ given measurement data Uand not given nuisance variables J which need to be integrated out θ arg max θ n Jε J P( θ, J U ) Alternate between estimating unknowns θ and the hidden variables J At each iteration, instead of finding the best J ε J given an estimate θ at each iteration, EM computes a distribution over the space J

5 k-means Algorithm as EM Estimate means of k classes when class labels are unknown Parameters: means to be estimated Hidden variables: class labels begin initialize m, m,..m k (E-step) do classify n samples according to nearest m i (M-step) end recompute m i until no change in m i return m,m,..m k An iterative algorithm derivable from EM

6 EM Importance EM algorithm widely used for learning in the presence of unobserved variables, e.g., missing features, class labels used even for variables whose value is never directly observed, provided the general form of the pdf governing these variables is known has been used to train Radial Basis Function Networks and Bayesian Belief Networks basis for many unsupervised clustering algorithms basis for widely used Baum-Welch forward-backward algorithm for HMMs

7 EM Algorithm Learning in the presence of unobserved variables Only a subset of relevant instance features might be observable Case of unsupervised learning or clustering How many classes are there?

8 EM Principle EM algorithm iteratively estimates the likelihood given the data that is present

9 Likelihood Formulation Sample points from a single distribution W { x,.., x n Any sample has good and missing (bad ) features x k { xkg,.., xkb} Features are divided into two sets } W W UW g b

10 Central equation in EM Likelihood Formulation Expected Value is over missing features Current best estimate for the full distribution Candidate Vector for an improved estimate Algorithm will select the best candidate θ and call it θ i+

11 Algorithm EM begin initialize θ 0, T, i 0 do i i+ E step: compute Q(θ ;θ i ) M step: θ i+ arg max θ Q(θ ;θ i ) until Q(θ i+ ;θ i ) - Q(θ i ;θ i- )< T

12 EM for D Normal Model Suppose data consists of 4 points in dimensions, one point of which is missing a feature, where * represents the unknown value of the first feature of point x 4. Thus our bad data D b consists of the single feature x 4, and the good data D g consists of the rest. 4 *,, 0, 0 },,, { 4 3 x x x x D x x

13 EM for D Normal Model Assuming that the model is a Gaussian with diagonal covariance and arbitrary mean, it can be described by the parameter vector σ σ µ µ θ

14 EM for D Normal model We take our initial initial guess to be a Gaussian centered on the origin having Σ, that is, 0 θ 0 0

15 EM for a D Normal Model To find improved estimate, must calculate

16 EM for a D Normal Model Simplifying Completes E step. Gives the next estimate Final Solution 0.667

17 EM for D normal model Four data points, one missing the value of x, are in red. Initial estimate is a circularly symmetric Gaussian, centered on the origin (gray). (A better initial estimate could have been derived from the 3 known points.) Each iteration leads to an improved estimate, labeled by iteration number i; after 3 iterations, the algorithm converged.

18 EM to Estimate Means of k Gaussians Data D drawn from mixture of k distinct normal distributions Two-step process generates samples One of the k distributions is selected at random A single random instance x i is generated according to selected distribution

19 Instances Generated by a Mixture of Two Normal Distributions instances

20 Example of EM to Estimate Means of k Gaussians Each instance is generated by Choosing one of the k Gaussians with uniform probability Generating an instance at random according to that Gaussian The single normal distribution is chosen with uniform probability Each of the k Normal distributions has the same known variance Learning task: output a hypothesis that describes the means of each of the k distributions h < µ µ,.., k >

21 Estimating Means of k Gaussians We would like to find a maximum likelihood hypothesis for these means: a hypothesis h that maximizes p(d h)

22 Maximum Likelihood Estimate of Mean of a Single Gaussian Given observed data instances x x...,, x m Drawn from a single distribution that is normally distributed Problem is to find the mean of that distribution

23 Maximum Likelihood Estimate of Mean of a Single Gaussian Maximum likelihood estimate of the mean of a normal distribution can be shown to be one that minimizes the sum of squared errors µ ML m min µ i arg ( x µ ) i Right hand side has a maximum value at µ m ML x i m i which is the sample mean

24 Mixture of Two Normal Distributions We cannot observe as to which instances were generated by which distribution Full description of instance < x,, i zi zi x i observed value of i th instance z i and z i indicate which of two normal distributions was used to generate x i z ij if z ij was used to generate x i, 0 otherwise z i and z i are hidden variables, which have probability distributions associated with them >

25 Hidden variables specify distribution Z i Z i 0 Z i 0 Z i (x i,,0)

26 -Means Problem Full description of instance < x,, i zi zi > x i observed variable z i and z i are hidden variables If z i and z i were observed, we could use maximum likelihood estimates for the means: m m µ x i µ i z i i z Since we do not know z i and z i, we will use EM instead m m i x i

27 EM Algorithm Applied to k-means Problem Search for a Maximum Likelihood Hypothesis by repeatedly re-estimating expected values of hidden binary variables z ij given its current hypothesis < µ,...µ k > Then recalculate the maximum likelihood hypothesis using these expected values for the hidden variables

28 EM algorithm for two means Regarded As Probabilities Z i Z i 0 Z i 0 Z i. Hypothesize means, then determine expected value of hidden variables for all samples. Use these hidden variable values to recalculate the means

29 EM Applied to Two-Means Problem Initialize the hypothesis to h < µ, µ > Estimate the expected values of hidden variables z ij given its current hypothesis Recalculate the maximum likelihood hypothesis using these expected values for the hidden variables Re-estimate h repeatedly until the procedure converges to a stationary value of h

30 EM Algorithm for -Means Step Step Calculate the expected value E[z ij ] of each hidden variable z ij, assuming the current hypothesis holds h µ, µ Calculate new maximum likelihood hypothesis assuming the value taken on by each hidden variable h z ij is its expected value E[z ij ] calculated in Step. Then replace hypothesis by the new hypothesis and iterate. h µ, µ h µ, µ µ, µ

31 EM First Step Calculate the expected value E[z ij ] of each hidden variable z ij, assuming the current hypothesis holds µ, µ h ) ( ) ( ] [ n n i j i ij x x p x x p z E µ µ µ µ n e ) ( ) ( x j xi j i e µ σ µ σ Probability that instance x i was generated by the j th Gaussian

32 EM Second Step Calculate a new maximum likelihood hypothesis µ j h m i m µ, µ E[ z i ij ] E[ z ij ] x i Observations:. Similar to earlier sample mean calculation for a single Gaussian. µ ML m m x i i

33 Clustering using EM

34 Feature Extraction Feature Extraction Image I_ Feature Extraction Feature Vector (74 values) i

35 Feature Extraction Image I_ Global Features: Aspect ratio Stroke ratio 7 Local Features F(i,j) s(i,j) i,9 N(i)*S(j) j 0,7 Feature Extraction s(i,j) - no. of components with slope j in subimage i N(i) - no. of components in subimage i S(j) max i s(i,j) N(i) Feature Vector (74 values)

36 For One Feature P(I_ C_) P(I_ C_) P(I_ C_3) P(I_ C_4) P(I_ C_5) Cycle Cycle Cycle N Final Clusters Centers (Images closer to the means) P(C_3) P(C_) P(C_) P(C_3) P(C_3)

37 Clustering Initial Mean for Cluster (Initialized with FV for Image ) 0 cycles Final Mean for Cluster

38 Essence of EM Algorithm The current hypothesis is used to estimate the unobserved variables Expected values of these variables are used to calculate an improved hypothesis It can be shown that on each iteration through the loop EM increases the likelihood P(D/h) unless it is at a local maximum Algorithm converges on a local maximum likelihood hypothesis for µ >, µ <

39 General Statement of EM Algorithm Parameters of interest were θ < µ,µ > Full data were the triples < x,, i zi zi >, of which only x i is observed

40 General Statement of EM Algorithm (cont d.) Let X { x, x..., x } m Let denote the observed data in a set of m independently drawn instances denote the unobserved data in these same instances Let Y Z X { denote the full data z, z..., z } m Z

41 General Statement of EM Algorithm (cont d.) The unobserved Z can be treated as a random variable whose p.d.f. depends on the unknown parameters θ and on the observed data X Similarly Y is a random variable because it is defined in terms of the random variable Z h denotes current hypothesized values of the parameters θ h denotes revised hypothesis estimated on each iteration of EM algorithm

42 General Statement of EM Algorithm (cont d.) EM algorithm searches for m.l.e. hypothesis by seeking the that maximizes h E [ln P( Y h )] h

43 General Statement of EM Algorithm (cont d.) Define the function Q ( h h) that gives E [ln P( Y h )] as a function of h under the assumption that θ h Q ( h h) E[ln P( Y h ) h, X ] And given the observed portion X of the full data Y

44 General Statement of EM Algorithm Repeat until convergence Step : Estimation (E) Step: Calculate Q ( h h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y Q ( h h) E[ln P( Y h ) h, X ] Step : Maximization (M) Step: Replace hypothesis h by the hypothesis h that maximizes this Q function h arg max Q( h h h)

45 Derivation of k Means Algorithm from General EM algorithm Derive previously seen algorithm for estimating the means of a mixture of k Normal distributions. Estimate the parameters θ < µ µ,... k We are given the observed data > X i { x } The hidden variables Z ik { z i, Kz } Indicate which of the k Normal distributions was used to generate x i

46 Derivation of the k Means Algorithm from General EM Algorithm (cont d.) Need to derive an expression for Q( h / h') First we derive an expression for ln P ( Y h )

47 Derivation of the k Means Algorithm The probability of a single instance of the full data can be written ) ( h y p i ik i i z z x y K,, k j j z ij x ik i i i e h z z x p h y p ) ( ),, ( ) ( µ σ π σ K

48 Derivation of the k Means Algorithm (cont d.) Given this probability for a single instance p ( y, the h i ) logarithmic probability ln P ( Y h for ) all m instances in the data is ln P( Y h ) ln m i p( y i h ) m lnp( y i i h ) m k ln z j ij ( x i µ ) j π σ σ i

49 Derivation of the k Means Algorithm (cont d.) In general, for any function f(z) that is a linear function of z, the following equality holds E [ f ( z)] f ( E z]) m k E[ln P( Y h )] E ln z j ij ( x i µ ) i j π σ σ i m k ln E[ z j ij ]( x i µ ) j π σ σ

50 Derivation of the k Means Algorithm To summarize, where h µ and, K, µ Eis [ z k ij calculated ] on the current hypothesis h and the observed data X, the function for the Q ( kh means h) problem is m k E[ z j ij ]( x i µ ) i j π σ σ ( h) ln Q h E[ z ij ] e k n ( x σ e i σ µ ) j ( x µ ) i n

51 Derivation of k Means Algorithm: Second (Maximization Step) To Find the Values h µ,, µ K k arg µ m k maxq( h h) arg max ln E[ z ij ]( x i j ) h h i π σ σ j arg min E[ z ]( x µ ) h m k i j ij i j µ j m i m E[ z CSE 555: isrihari ij ] E[ z ij ] x i

52 Summary In many parameter estimation tasks, some of the relevant instance variables may be unobservable. In this case, the EM algorithm is useful.