Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1

The Generative Model POV We think of the data as being generated from some process. We assume this process can be modeled statistically as an underlying distribution. We often assume a parametric distribution, like a Gaussian, because they re easier to represent. We infer model parameters from the data. Then we can use the model to classify/cluster or even generate data. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 2

Probability density Parametric Distribution Represent the underlying probability distribution with a parametric density function. Machine learning students 150 160 170 180 190 200 Height (cm) Gaussian (normal) distribution, two parameters: p x; μ, σ 2 1 (x μ) 2 = 2πσ e 2σ 2 View each point as generated from p x; μ, σ 2. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 3

Using Generative Models for Classification Gaussians whose means and variances were learned from data Machine learning students BA players 150 160 170 180 190 200 Height (cm) 210 220 230 240 ew point. Which group does he belong to? Answer: the group (class) that calls the new point most probable. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 4

Maximum Likelihood Estimation Our hypothesis space is Gaussian distributions. Find parameter(s) θ that make a Gaussian most likely to generate data X = (x 1,, x n ). Likelihood function: l θ X p X; θ = p x i ; θ i=1 θ ={μ, σ 2 } Only if X is i.i.d. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 5

Likelihood Function l θ X p X; θ = p x i ; θ i=1 In our Gaussian example, x i is a continuous variable, p x i ; θ is the probability density function (pdf). It is meaningless to talk about probability mass here, as the probability mass at any value of x i is zero. If x i is a discrete variable (e.g. binary), p x i ; θ should be replaced by the probability mass function P x i ; θ. It is meaningless to talk about probability density p here, as the density will be infinite at the value of each data point. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 6

Log-likelihood Function Likelihood function l θ X p X; θ = p x i ; θ Log-likelihood function i=1 L θ X log l θ X = log p x i ; θ i=1 Maximizing Log-likelihood maximizing likelihood Easier to optimize Prevents underflow!!! What happens when multiplying 1000 probabilities? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 7

Example Gaussian Log-likelihood Log-likelihood function L θ X log l θ X = log p x i ; θ Recall Gaussian dist. (probability density function) p x; μ, σ 2 1 (x μ) 2 = 2πσ e 2σ 2 So the log-likelihood of Gaussian would be: L μ, σ 2 X = 2 log(2π) log σ i=1 x i μ 2 2σ 2 i=1 a constant term Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 8

Maximizing Log-likelihood Log-likelihood of Gaussian: L μ, σ 2 X = C log σ i=1 2σ 2 x i μ 2 Take the partial derivatives w.r.t μ and σ and set them to 0, i.e. let L L = 0 and = 0. μ σ Then solve (try it yourself), we get μ = 1 x i ; σ 2 = 1 i=1 i=1 x i μ 2 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 9

What if the data distribution can t be well represented by a single Gaussian? Can we model more complex distributions using multiple Gaussians? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 10

Gaussian Mixture Model (GMM) Machine learning students & BA players Two Gaussian components Represent the dist. with a mixture of Gaussians z: a membership r.v. indicating which Gaussian that x belongs to. 150 160 170 180 190 200 Height (cm) K p x = P(z = j)p x z = j j=1 Weight of j-th Gaussian. Often notated as w j z is a discrete variable, so we use probability mass P. 210 220 230 240 The j-th Gaussian, parameter:(μ j, σ 2 j ) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 11

Generative Process for GMM Machine learning students & BA players 150 160 170 180 190 200 Height (cm) K p x = P(z = j)p x z = j j=1 210 220 230 240 1. Randomly pick a component j, according to P z = j ; 2. Generate x according to p(x z = j). Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 12

What are we optimizing? GMM distribution: K p x = P(z = j)p x z = j j=1 K = w j j=1 1 e (x μ j ) 2 2σ 2 j 2πσ j Three parameters per Gaussian in the mixture w j, μ j, σ 2 K j, where j=1 w j = 1. Find parameters that maximize data likelihood. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 13

Maximum Likelihood Estimation of GMM Given X = (x 1,, x n ), x i ~p x, log-likelihood is L θ X = log p x i i=1 K = log P z i = j p x i z i = j i=1 j=1 K = log w j i=1 j=1 1 e (x i μ j ) 2 2σ 2 j 2πσ j Try to solve parameters μ j, σ 2 j, w j by setting their partial derivatives to 0? o closed form solution. (Try it yourself) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 14

Why is ML hard for GMM? Each data point x i has a membership random variable z i, indicating which Gaussian it comes from. But the value of z i cannot be observed as x i, i.e. we are uncertain about which Gaussian x i comes from. z i is a latent variable because we can t observe it. Latent variables can also be viewed as missing data, data that we didn t observe. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 15

If we know what value z i takes, ML is easy Machine learning students & BA players 150 160 170 180 190 200 Height (cm) 210 220 230 240 w j = 1 μ j = σ 2 j = i i i 1 z i = j 1 z i=j 1 z i =j x i i 1 z i=j x i μ 2 1 z i =j i Indicator function: 1 z i = j = 1, if z i = j; 0, if z i j. = number of training examples Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 16

Illustration of Soft Membership Which component does the point i come from? The probability that it comes from j: q i (j) P z i = j x i 1 st component 2 nd component From 1 st : 0.99 From 2 nd : 0.01 1 st : 0.5 2 nd : 0.5 1 st : 0.1 2 nd : 0.9 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 17

Improving our posterior probability The posterior probability of a Gaussian is the probability that this Gaussian generated the data we observe. Let s find a way to use posterior probabilities to make an algorithm that automatically creates a set of Gaussians that would have been very likely to generate this data. 18

Expectation Maximization (EM) Instead of analytically solving the maximum likelihood parameter estimation problem of GMM, we seek an alternative way, EM algorithm. EM algorithm updates parameters iteratively. In each iteration, the likelihood value increases (at least it doesn t decrease). EM algorithm always converges (to some local optimum), i.e. likelihood value and parameters converge. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 19

EM Algorithm Summary Initialize parameters. w j, μ j, σ 2 j for each Gaussian j in our model. E step: calculate posterior dist. of latent variables probability that these Gaussians generated the data M step: update parameters. update w j, μ j, σ 2 j for each Gaussian j Repeat E and M steps until convergence. go until parameters don t change much It converges to some local optimum. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 20

EM for GMM - Initialization Start by choosing the number of Gaussian components K. Also, choose an initialization of parameters of all components w j, μ j, σ 2 j for j = 1,, K. Make sure K j=1 w j = 1. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 21

EM for GMM Expectation step For each x i, calculate its soft membership, i.e. the posterior dist. of z i, using current parameters. Bayes rule q i (j) P z i = j x i = P z i = j, x i p x i = p x i z i = j P z i = j K p x i z i = l P z i = l l=1 Prior dist. of component j ote: we are guessing the distribution (i.e. a soft membership) of z i, instead of a hard membership. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 22

EM for GMM Maximization step M step: update parameters. Recall q i (j) is the soft membership of x i of the j-th Gaussian. σ 2 j = w j = 1 μ j = i=1 (j) q i i=1 i=1 q (j) i x i q (j) i=1 i q i (j) x i μ j 2 i=1 q i (j) Average of their membership Weighted average using membership Weighted variance using membership Repeat E step and M step until convergence. Convergence criterion in practice: compare with the previous iteration, the likelihood value doesn t increase much, or the parameters don t change much. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 23

EM Algorithm Summary Initialize parameters. w j, μ j, σ 2 j for each Gaussian j in our model. E step: calculate posterior dist. of latent variables probability that these Gaussians generated the data M step: update parameters. update w j, μ j, σ 2 j for each Gaussian j Repeat E and M steps until convergence. go until parameters don t change much It converges to some local optimum. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 24

What if our data isn t just scalars, but each data point has multiple dimensions? Can we generalize to multiple dimensions? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 25

Multivariate Gaussian Mixture μ j, a mean vector, marks the center of a ellipse. Second dimension K p x = w j j=1 1 2π d 2 S j 1 2 exp 1 2 x μ j T Sj 1 x μ j K Parameters: μ j, S j, w j for j = 1,, K, with j=1 w j = 1. How many parameters? means covariance's d(d + 1) dk + K + K 2 S j, the covariance matrix (d d), describes the shape and orientation of a ellipse. First dimension d: dimensionality weights Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 26

Example: Initialization (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 27

After Iteration #1 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 28

GMM Remarks GMM is powerful: any density function can be arbitrarily-well approximated by a GMM with enough components. If the number of components K is too large, data will be overfitted. Likelihood increases with K. Extreme case: Gaussians for data points, with variances 0, then likelihood. How to choose K? Use domain knowledge. Validate through visualization. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 35

GMM is a soft version of K-means Similarity K needs to be specified. Converges to some local optima. Initialization matters final results. One would want to try different initializations. Differences GMM Assigns soft labels to instances. GMM Considers variances in addition to means. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 36

GMM for Classification 1. Given D = x i, y i, where y i 1,, C. 2. Model p x y = l with a GMM, for each l. 3. Calculate class posterior probability. p x y = l P(y = l) P y = l x = p x y = k P(y = k) C k=1 Bayes optimal classifier 4. Classify x to the class having largest posterior. (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 37

GMM for Regression Given D = x i, y i, where y i R. Model p x, y with a GMM. Compute f x = E y x, conditional expectation (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 38

You Should Know How to do maximum likelihood (ML) estimation? GMM models data dist. with a mixture of K Gaussians, with para μ j, S j, w j, for j = 1,, K. o closed form solution for ML estimation of GMM parameters. How to estimate GMM parameters with EM algorithm? How is GMM related to K-means? How to use GMM for clustering, classification and regression? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 39