Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Transcription

1 Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

2 The Generative Model POV We think of the data as being generated from some process. We assume this process can be modeled statistically as an underlying distribution. We often assume a parametric distribution, like a Gaussian, because they re easier to represent. We infer model parameters from the data. Then we can use the model to classify/cluster or even generate data. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

3 Probability density Parametric Distribution Represent the underlying probability distribution with a parametric density function. Machine learning students Height (cm) Gaussian (normal) distribution, two parameters: p x; μ, σ 2 1 (x μ) 2 = 2πσ e 2σ 2 View each point as generated from p x; μ, σ 2. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

4 Using Generative Models for Classification Gaussians whose means and variances were learned from data Machine learning students BA players Height (cm) ew point. Which group does he belong to? Answer: the group (class) that calls the new point most probable. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

5 Maximum Likelihood Estimation Our hypothesis space is Gaussian distributions. Find parameter(s) θ that make a Gaussian most likely to generate data X = (x 1,, x n ). Likelihood function: l θ X p X; θ = p x i ; θ i=1 θ ={μ, σ 2 } Only if X is i.i.d. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

6 Likelihood Function l θ X p X; θ = p x i ; θ i=1 In our Gaussian example, x i is a continuous variable, p x i ; θ is the probability density function (pdf). It is meaningless to talk about probability mass here, as the probability mass at any value of x i is zero. If x i is a discrete variable (e.g. binary), p x i ; θ should be replaced by the probability mass function P x i ; θ. It is meaningless to talk about probability density p here, as the density will be infinite at the value of each data point. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

7 Log-likelihood Function Likelihood function l θ X p X; θ = p x i ; θ Log-likelihood function i=1 L θ X log l θ X = log p x i ; θ i=1 Maximizing Log-likelihood maximizing likelihood Easier to optimize Prevents underflow!!! What happens when multiplying 1000 probabilities? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

8 Example Gaussian Log-likelihood Log-likelihood function L θ X log l θ X = log p x i ; θ Recall Gaussian dist. (probability density function) p x; μ, σ 2 1 (x μ) 2 = 2πσ e 2σ 2 So the log-likelihood of Gaussian would be: L μ, σ 2 X = 2 log(2π) log σ i=1 x i μ 2 2σ 2 i=1 a constant term Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

9 Maximizing Log-likelihood Log-likelihood of Gaussian: L μ, σ 2 X = C log σ i=1 2σ 2 x i μ 2 Take the partial derivatives w.r.t μ and σ and set them to 0, i.e. let L L = 0 and = 0. μ σ Then solve (try it yourself), we get μ = 1 x i ; σ 2 = 1 i=1 i=1 x i μ 2 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

10 What if the data distribution can t be well represented by a single Gaussian? Can we model more complex distributions using multiple Gaussians? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

11 Gaussian Mixture Model (GMM) Machine learning students & BA players Two Gaussian components Represent the dist. with a mixture of Gaussians z: a membership r.v. indicating which Gaussian that x belongs to Height (cm) K p x = P(z = j)p x z = j j=1 Weight of j-th Gaussian. Often notated as w j z is a discrete variable, so we use probability mass P The j-th Gaussian, parameter:(μ j, σ 2 j ) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

12 Generative Process for GMM Machine learning students & BA players Height (cm) K p x = P(z = j)p x z = j j= Randomly pick a component j, according to P z = j ; 2. Generate x according to p(x z = j). Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

13 What are we optimizing? GMM distribution: K p x = P(z = j)p x z = j j=1 K = w j j=1 1 e (x μ j ) 2 2σ 2 j 2πσ j Three parameters per Gaussian in the mixture w j, μ j, σ 2 K j, where j=1 w j = 1. Find parameters that maximize data likelihood. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

14 Maximum Likelihood Estimation of GMM Given X = (x 1,, x n ), x i ~p x, log-likelihood is L θ X = log p x i i=1 K = log P z i = j p x i z i = j i=1 j=1 K = log w j i=1 j=1 1 e (x i μ j ) 2 2σ 2 j 2πσ j Try to solve parameters μ j, σ 2 j, w j by setting their partial derivatives to 0? o closed form solution. (Try it yourself) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

15 Why is ML hard for GMM? Each data point x i has a membership random variable z i, indicating which Gaussian it comes from. But the value of z i cannot be observed as x i, i.e. we are uncertain about which Gaussian x i comes from. z i is a latent variable because we can t observe it. Latent variables can also be viewed as missing data, data that we didn t observe. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

16 If we know what value z i takes, ML is easy Machine learning students & BA players Height (cm) w j = 1 μ j = σ 2 j = i i i 1 z i = j 1 z i=j 1 z i =j x i i 1 z i=j x i μ 2 1 z i =j i Indicator function: 1 z i = j = 1, if z i = j; 0, if z i j. = number of training examples Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

17 Illustration of Soft Membership Which component does the point i come from? The probability that it comes from j: q i (j) P z i = j x i 1 st component 2 nd component From 1 st : 0.99 From 2 nd : st : nd : st : nd : 0.9 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

18 Improving our posterior probability The posterior probability of a Gaussian is the probability that this Gaussian generated the data we observe. Let s find a way to use posterior probabilities to make an algorithm that automatically creates a set of Gaussians that would have been very likely to generate this data. 18

19 Expectation Maximization (EM) Instead of analytically solving the maximum likelihood parameter estimation problem of GMM, we seek an alternative way, EM algorithm. EM algorithm updates parameters iteratively. In each iteration, the likelihood value increases (at least it doesn t decrease). EM algorithm always converges (to some local optimum), i.e. likelihood value and parameters converge. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

20 EM Algorithm Summary Initialize parameters. w j, μ j, σ 2 j for each Gaussian j in our model. E step: calculate posterior dist. of latent variables probability that these Gaussians generated the data M step: update parameters. update w j, μ j, σ 2 j for each Gaussian j Repeat E and M steps until convergence. go until parameters don t change much It converges to some local optimum. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

21 EM for GMM - Initialization Start by choosing the number of Gaussian components K. Also, choose an initialization of parameters of all components w j, μ j, σ 2 j for j = 1,, K. Make sure K j=1 w j = 1. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

22 EM for GMM Expectation step For each x i, calculate its soft membership, i.e. the posterior dist. of z i, using current parameters. Bayes rule q i (j) P z i = j x i = P z i = j, x i p x i = p x i z i = j P z i = j K p x i z i = l P z i = l l=1 Prior dist. of component j ote: we are guessing the distribution (i.e. a soft membership) of z i, instead of a hard membership. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

23 EM for GMM Maximization step M step: update parameters. Recall q i (j) is the soft membership of x i of the j-th Gaussian. σ 2 j = w j = 1 μ j = i=1 (j) q i i=1 i=1 q (j) i x i q (j) i=1 i q i (j) x i μ j 2 i=1 q i (j) Average of their membership Weighted average using membership Weighted variance using membership Repeat E step and M step until convergence. Convergence criterion in practice: compare with the previous iteration, the likelihood value doesn t increase much, or the parameters don t change much. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

24 EM Algorithm Summary Initialize parameters. w j, μ j, σ 2 j for each Gaussian j in our model. E step: calculate posterior dist. of latent variables probability that these Gaussians generated the data M step: update parameters. update w j, μ j, σ 2 j for each Gaussian j Repeat E and M steps until convergence. go until parameters don t change much It converges to some local optimum. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

25 What if our data isn t just scalars, but each data point has multiple dimensions? Can we generalize to multiple dimensions? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

26 Multivariate Gaussian Mixture μ j, a mean vector, marks the center of a ellipse. Second dimension K p x = w j j=1 1 2π d 2 S j 1 2 exp 1 2 x μ j T Sj 1 x μ j K Parameters: μ j, S j, w j for j = 1,, K, with j=1 w j = 1. How many parameters? means covariance's d(d + 1) dk + K + K 2 S j, the covariance matrix (d d), describes the shape and orientation of a ellipse. First dimension d: dimensionality weights Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

27 Example: Initialization (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

28 After Iteration #1 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

35 GMM Remarks GMM is powerful: any density function can be arbitrarily-well approximated by a GMM with enough components. If the number of components K is too large, data will be overfitted. Likelihood increases with K. Extreme case: Gaussians for data points, with variances 0, then likelihood. How to choose K? Use domain knowledge. Validate through visualization. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

36 GMM is a soft version of K-means Similarity K needs to be specified. Converges to some local optima. Initialization matters final results. One would want to try different initializations. Differences GMM Assigns soft labels to instances. GMM Considers variances in addition to means. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

37 GMM for Classification 1. Given D = x i, y i, where y i 1,, C. 2. Model p x y = l with a GMM, for each l. 3. Calculate class posterior probability. p x y = l P(y = l) P y = l x = p x y = k P(y = k) C k=1 Bayes optimal classifier 4. Classify x to the class having largest posterior. (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

38 GMM for Regression Given D = x i, y i, where y i R. Model p x, y with a GMM. Compute f x = E y x, conditional expectation (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

39 You Should Know How to do maximum likelihood (ML) estimation? GMM models data dist. with a mixture of K Gaussians, with para μ j, S j, w j, for j = 1,, K. o closed form solution for ML estimation of GMM parameters. How to estimate GMM parameters with EM algorithm? How is GMM related to K-means? How to use GMM for clustering, classification and regression? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall