Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Similar documents
Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Multivariate Normal Distribution

Machine Learning and Pattern Recognition Logistic Regression

1 Maximum likelihood estimation

Bayes and Naïve Bayes. cs534-machine Learning

Basics of Statistical Machine Learning

Lecture 9: Introduction to Pattern Analysis

Linear Threshold Units

Statistical Machine Learning from Data

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Dirichlet Processes A gentle tutorial

CS229 Lecture notes. Andrew Ng

Lecture 3: Linear methods for classification

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

CSCI567 Machine Learning (Fall 2014)

Christfried Webers. Canberra February June 2015

HT2015: SC4 Statistical Data Mining and Machine Learning

Classification Problems

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Linear Classification. Volker Tresp Summer 2015

Machine Learning.

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Designing a learning system

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application

Classification Techniques for Remote Sensing

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Data Mining: Algorithms and Applications Matrix Math Review

Introduction to Markov Chain Monte Carlo

Lecture 8: Signal Detection and Noise Assumption

Bayesian networks - Time-series models - Apache Spark & Scala

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Norbert Schuff Professor of Radiology VA Medical Center and UCSF

Probabilistic Latent Semantic Analysis (plsa)

Data, Measurements, Features

Introduction to General and Generalized Linear Models

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Introduction to Logistic Regression

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Introduction to Learning & Decision Trees

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

BAYESIAN CLASSIFICATION USING GAUSSIAN MIXTURE MODEL AND EM ESTIMATION: IMPLEMENTATIONS AND COMPARISONS

Neural Networks Lesson 5 - Cluster Analysis

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

Gaussian Processes in Machine Learning

Semi-Supervised Learning in Inferring Mobile Device Locations

Monotonicity Hints. Abstract

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Data Mining: An Overview. David Madigan

Programming Exercise 3: Multi-class Classification and Neural Networks

Course: Model, Learning, and Inference: Lecture 5

Predict Influencers in the Social Network

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Supporting Online Material for

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Mixtures of Robust Probabilistic Principal Component Analyzers

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Question 2 Naïve Bayes (16 points)

Highly Efficient Incremental Estimation of Gaussian Mixture Models for Online Data Stream Clustering

Item selection by latent class-based methods: an application to nursing homes evaluation

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Tutorial on Markov Chain Monte Carlo

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

CHAPTER 2 Estimating Probabilities

BayesX - Software for Bayesian Inference in Structured Additive Regression

: Introduction to Machine Learning Dr. Rita Osadchy

Solution to Homework 2

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

LCs for Binary Classification

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Machine Learning Logistic Regression

Bayesian probability theory

L4: Bayesian Decision Theory

A Procedure for Classifying New Respondents into Existing Segments Using Maximum Difference Scaling

Data Mining Practical Machine Learning Tools and Techniques

How To Solve The Cluster Algorithm

Logistic Regression (1/24/13)

Sections 2.11 and 5.8

Employer Health Insurance Premium Prediction Elliott Lui

Transcription:

Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1

The Generative Model POV We think of the data as being generated from some process. We assume this process can be modeled statistically as an underlying distribution. We often assume a parametric distribution, like a Gaussian, because they re easier to represent. We infer model parameters from the data. Then we can use the model to classify/cluster or even generate data. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 2

Probability density Parametric Distribution Represent the underlying probability distribution with a parametric density function. Machine learning students 150 160 170 180 190 200 Height (cm) Gaussian (normal) distribution, two parameters: p x; μ, σ 2 1 (x μ) 2 = 2πσ e 2σ 2 View each point as generated from p x; μ, σ 2. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 3

Using Generative Models for Classification Gaussians whose means and variances were learned from data Machine learning students BA players 150 160 170 180 190 200 Height (cm) 210 220 230 240 ew point. Which group does he belong to? Answer: the group (class) that calls the new point most probable. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 4

Maximum Likelihood Estimation Our hypothesis space is Gaussian distributions. Find parameter(s) θ that make a Gaussian most likely to generate data X = (x 1,, x n ). Likelihood function: l θ X p X; θ = p x i ; θ i=1 θ ={μ, σ 2 } Only if X is i.i.d. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 5

Likelihood Function l θ X p X; θ = p x i ; θ i=1 In our Gaussian example, x i is a continuous variable, p x i ; θ is the probability density function (pdf). It is meaningless to talk about probability mass here, as the probability mass at any value of x i is zero. If x i is a discrete variable (e.g. binary), p x i ; θ should be replaced by the probability mass function P x i ; θ. It is meaningless to talk about probability density p here, as the density will be infinite at the value of each data point. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 6

Log-likelihood Function Likelihood function l θ X p X; θ = p x i ; θ Log-likelihood function i=1 L θ X log l θ X = log p x i ; θ i=1 Maximizing Log-likelihood maximizing likelihood Easier to optimize Prevents underflow!!! What happens when multiplying 1000 probabilities? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 7

Example Gaussian Log-likelihood Log-likelihood function L θ X log l θ X = log p x i ; θ Recall Gaussian dist. (probability density function) p x; μ, σ 2 1 (x μ) 2 = 2πσ e 2σ 2 So the log-likelihood of Gaussian would be: L μ, σ 2 X = 2 log(2π) log σ i=1 x i μ 2 2σ 2 i=1 a constant term Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 8

Maximizing Log-likelihood Log-likelihood of Gaussian: L μ, σ 2 X = C log σ i=1 2σ 2 x i μ 2 Take the partial derivatives w.r.t μ and σ and set them to 0, i.e. let L L = 0 and = 0. μ σ Then solve (try it yourself), we get μ = 1 x i ; σ 2 = 1 i=1 i=1 x i μ 2 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 9

What if the data distribution can t be well represented by a single Gaussian? Can we model more complex distributions using multiple Gaussians? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 10

Gaussian Mixture Model (GMM) Machine learning students & BA players Two Gaussian components Represent the dist. with a mixture of Gaussians z: a membership r.v. indicating which Gaussian that x belongs to. 150 160 170 180 190 200 Height (cm) K p x = P(z = j)p x z = j j=1 Weight of j-th Gaussian. Often notated as w j z is a discrete variable, so we use probability mass P. 210 220 230 240 The j-th Gaussian, parameter:(μ j, σ 2 j ) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 11

Generative Process for GMM Machine learning students & BA players 150 160 170 180 190 200 Height (cm) K p x = P(z = j)p x z = j j=1 210 220 230 240 1. Randomly pick a component j, according to P z = j ; 2. Generate x according to p(x z = j). Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 12

What are we optimizing? GMM distribution: K p x = P(z = j)p x z = j j=1 K = w j j=1 1 e (x μ j ) 2 2σ 2 j 2πσ j Three parameters per Gaussian in the mixture w j, μ j, σ 2 K j, where j=1 w j = 1. Find parameters that maximize data likelihood. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 13

Maximum Likelihood Estimation of GMM Given X = (x 1,, x n ), x i ~p x, log-likelihood is L θ X = log p x i i=1 K = log P z i = j p x i z i = j i=1 j=1 K = log w j i=1 j=1 1 e (x i μ j ) 2 2σ 2 j 2πσ j Try to solve parameters μ j, σ 2 j, w j by setting their partial derivatives to 0? o closed form solution. (Try it yourself) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 14

Why is ML hard for GMM? Each data point x i has a membership random variable z i, indicating which Gaussian it comes from. But the value of z i cannot be observed as x i, i.e. we are uncertain about which Gaussian x i comes from. z i is a latent variable because we can t observe it. Latent variables can also be viewed as missing data, data that we didn t observe. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 15

If we know what value z i takes, ML is easy Machine learning students & BA players 150 160 170 180 190 200 Height (cm) 210 220 230 240 w j = 1 μ j = σ 2 j = i i i 1 z i = j 1 z i=j 1 z i =j x i i 1 z i=j x i μ 2 1 z i =j i Indicator function: 1 z i = j = 1, if z i = j; 0, if z i j. = number of training examples Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 16

Illustration of Soft Membership Which component does the point i come from? The probability that it comes from j: q i (j) P z i = j x i 1 st component 2 nd component From 1 st : 0.99 From 2 nd : 0.01 1 st : 0.5 2 nd : 0.5 1 st : 0.1 2 nd : 0.9 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 17

Improving our posterior probability The posterior probability of a Gaussian is the probability that this Gaussian generated the data we observe. Let s find a way to use posterior probabilities to make an algorithm that automatically creates a set of Gaussians that would have been very likely to generate this data. 18

Expectation Maximization (EM) Instead of analytically solving the maximum likelihood parameter estimation problem of GMM, we seek an alternative way, EM algorithm. EM algorithm updates parameters iteratively. In each iteration, the likelihood value increases (at least it doesn t decrease). EM algorithm always converges (to some local optimum), i.e. likelihood value and parameters converge. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 19

EM Algorithm Summary Initialize parameters. w j, μ j, σ 2 j for each Gaussian j in our model. E step: calculate posterior dist. of latent variables probability that these Gaussians generated the data M step: update parameters. update w j, μ j, σ 2 j for each Gaussian j Repeat E and M steps until convergence. go until parameters don t change much It converges to some local optimum. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 20

EM for GMM - Initialization Start by choosing the number of Gaussian components K. Also, choose an initialization of parameters of all components w j, μ j, σ 2 j for j = 1,, K. Make sure K j=1 w j = 1. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 21

EM for GMM Expectation step For each x i, calculate its soft membership, i.e. the posterior dist. of z i, using current parameters. Bayes rule q i (j) P z i = j x i = P z i = j, x i p x i = p x i z i = j P z i = j K p x i z i = l P z i = l l=1 Prior dist. of component j ote: we are guessing the distribution (i.e. a soft membership) of z i, instead of a hard membership. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 22

EM for GMM Maximization step M step: update parameters. Recall q i (j) is the soft membership of x i of the j-th Gaussian. σ 2 j = w j = 1 μ j = i=1 (j) q i i=1 i=1 q (j) i x i q (j) i=1 i q i (j) x i μ j 2 i=1 q i (j) Average of their membership Weighted average using membership Weighted variance using membership Repeat E step and M step until convergence. Convergence criterion in practice: compare with the previous iteration, the likelihood value doesn t increase much, or the parameters don t change much. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 23

EM Algorithm Summary Initialize parameters. w j, μ j, σ 2 j for each Gaussian j in our model. E step: calculate posterior dist. of latent variables probability that these Gaussians generated the data M step: update parameters. update w j, μ j, σ 2 j for each Gaussian j Repeat E and M steps until convergence. go until parameters don t change much It converges to some local optimum. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 24

What if our data isn t just scalars, but each data point has multiple dimensions? Can we generalize to multiple dimensions? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 25

Multivariate Gaussian Mixture μ j, a mean vector, marks the center of a ellipse. Second dimension K p x = w j j=1 1 2π d 2 S j 1 2 exp 1 2 x μ j T Sj 1 x μ j K Parameters: μ j, S j, w j for j = 1,, K, with j=1 w j = 1. How many parameters? means covariance's d(d + 1) dk + K + K 2 S j, the covariance matrix (d d), describes the shape and orientation of a ellipse. First dimension d: dimensionality weights Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 26

Example: Initialization (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 27

After Iteration #1 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 28

After Iteration #2 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 29

After Iteration #3 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 30

After Iteration #4 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 31

After Iteration #5 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 32

After Iteration #6 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 33

After Iteration #20 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 34

GMM Remarks GMM is powerful: any density function can be arbitrarily-well approximated by a GMM with enough components. If the number of components K is too large, data will be overfitted. Likelihood increases with K. Extreme case: Gaussians for data points, with variances 0, then likelihood. How to choose K? Use domain knowledge. Validate through visualization. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 35

GMM is a soft version of K-means Similarity K needs to be specified. Converges to some local optima. Initialization matters final results. One would want to try different initializations. Differences GMM Assigns soft labels to instances. GMM Considers variances in addition to means. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 36

GMM for Classification 1. Given D = x i, y i, where y i 1,, C. 2. Model p x y = l with a GMM, for each l. 3. Calculate class posterior probability. p x y = l P(y = l) P y = l x = p x y = k P(y = k) C k=1 Bayes optimal classifier 4. Classify x to the class having largest posterior. (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 37

GMM for Regression Given D = x i, y i, where y i R. Model p x, y with a GMM. Compute f x = E y x, conditional expectation (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 38

You Should Know How to do maximum likelihood (ML) estimation? GMM models data dist. with a mixture of K Gaussians, with para μ j, S j, w j, for j = 1,, K. o closed form solution for ML estimation of GMM parameters. How to estimate GMM parameters with EM algorithm? How is GMM related to K-means? How to use GMM for clustering, classification and regression? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 39