Statistical Machine Learning from Data

Similar documents

Basics of Statistical Machine Learning

Statistical Machine Learning

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Linear Classification. Volker Tresp Summer 2015

Reject Inference in Credit Scoring. Jie-Men Mok

Notes on the Negative Binomial Distribution

Linear Threshold Units

Gaussian Conjugate Prior Cheat Sheet

1 Prior Probability and Posterior Probability

Bayesian Statistics in One Hour. Patrick Lam

Course: Model, Learning, and Inference: Lecture 5

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Bayesian Image Super-Resolution

1 Sufficient statistics

Package EstCRM. July 13, 2015

Lecture 3: Linear methods for classification

Maximum Likelihood Estimation

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

STA 4273H: Statistical Machine Learning

Christfried Webers. Canberra February June 2015

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Probabilistic user behavior models in online stores for recommender systems

An Introduction to Machine Learning

L4: Bayesian Decision Theory

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Practice problems for Homework 11 - Point Estimation

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Machine Learning and Pattern Recognition Logistic Regression

Lecture 9: Introduction to Pattern Analysis

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Principled Hybrids of Generative and Discriminative Models

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Principle of Data Reduction

Introduction to General and Generalized Linear Models

CHAPTER 2 Estimating Probabilities

UW CSE Technical Report Probabilistic Bilinear Models for Appearance-Based Vision

Neural Networks Lesson 5 - Cluster Analysis

Message-passing sequential detection of multiple change points in networks

Towards running complex models on big data

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Least-Squares Intersection of Lines

Inference of Probability Distributions for Trust and Security applications

Classification Problems

Music Classification. Juan Pablo Bello MPATE-GE 2623 Music Information Retrieval New York University

Package MixGHD. June 26, 2015

Markov Chain Monte Carlo Simulation Made Simple

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

Designing a learning system

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Classification by Pairwise Coupling

Simple Linear Regression Inference

Lecture 4: BK inequality 27th August and 6th September, 2007

Methods of Data Analysis Working with probability distributions

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Efficiency and the Cramér-Rao Inequality

Factor Analysis. Factor Analysis

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

CS229 Lecture notes. Andrew Ng

Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application

Predict Influencers in the Social Network

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

A Learning Based Method for Super-Resolution of Low Resolution Images

Bayesian Statistics: Indian Buffet Process

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Detection of changes in variance using binary segmentation and optimal partitioning

Probabilistic Latent Semantic Analysis (plsa)

Computational Statistics for Big Data

Topic models for Sentiment analysis: A Literature Survey

R 2 -type Curves for Dynamic Predictions from Joint Longitudinal-Survival Models

One-Class Classifiers: A Review and Analysis of Suitability in the Context of Mobile-Masquerader Detection

HT2015: SC4 Statistical Data Mining and Machine Learning

Bayes and Naïve Bayes. cs534-machine Learning

Gamma Distribution Fitting

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Understanding the Impact of Weights Constraints in Portfolio Theory

Estimating an ARMA Process

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Exploratory Data Analysis Using Radial Basis Function Latent Variable Models

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Credit Risk Models: An Overview

CSCI567 Machine Learning (Fall 2014)

Transcription:

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland bengio@idiap.ch http://www.idiap.ch/~bengio January 23, 2006

Samy Bengio Statistical Machine Learning from Data 2 1 2 3 4

Samy Bengio Statistical Machine Learning from Data 3 Basics What is a GMM? Applications 1 2 3 4

Reminder: Basics on Probabilities Basics What is a GMM? Applications A few basic equalities that are often used: 1 (conditional probabilities) P(A, B) = P(A B) P(B) 2 (Bayes rule) P(A B) = P(B A) P(A) P(B) 3 If ( B i = Ω) and i, j i (B i Bj = ) then P(A) = i P(A, B i ) Samy Bengio Statistical Machine Learning from Data 4

Basics What is a GMM? Applications What is a Gaussian Mixture Model? A Gaussian Mixture Model (GMM) is a distribution The likelihood given a Gaussian distribution is ( 1 N (x µ, Σ) = exp 1 ) (2π) d 2 Σ 2 (x µ)t Σ 1 (x µ) where d is the dimension of x, µ is the mean and Σ is the covariance matrix of the Gaussian. Σ is often diagonal. The likelihood given a GMM is N p(x) = w i N (x µ i, Σ i ) where N is the number of Gaussians and w i is the weight of Gaussian i, with w i = 1 and i : w i 0 i Samy Bengio Statistical Machine Learning from Data 5

Characteristics of a GMM Basics What is a GMM? Applications While ANNs are universal approximators of functions, GMMs are universal approximators of densities. (as long as there are enough Gaussians of course) Even diagonal GMMs are universal approximators. Full rank GMMs are not easy to handle: number of parameters is the square of the number of dimensions. GMMs can be trained by maximum likelihood using an efficient algorithm: Expectation-Maximization. Samy Bengio Statistical Machine Learning from Data 6

Basics What is a GMM? Applications Practical Applications using GMMs Biometric person authentication (using voice, face, handwriting, etc): one GMM for the client one GMM for all the others Bayes decision = likelihood ratio Any highly imbalanced classification task one GMM per class, tuned by maximum likelihood Bayes decision = likelihood ratio Dimensionality reduction Quantization Samy Bengio Statistical Machine Learning from Data 7

Samy Bengio Statistical Machine Learning from Data 8 Graphical Interpretation More Formally 1 2 3 4

Graphical Interpretation More Formally Basics of Expectation-Maximization Objective: maximize the likelihood p(x θ) of the data X drawn from an unknown distribution, given the model parameterized by θ: θ = arg max θ p(x θ) = arg max θ n p(x p θ) p=1 Basic ideas of EM: Introduce a hidden variable such that its knowledge would simplify the maximization of p(x θ) At each iteration of the algorithm: E-Step: estimate the distribution of the hidden variable given the data and the current value of the parameters M-Step: modify the parameters in order to maximize the joint distribution of the data and the hidden variable Samy Bengio Statistical Machine Learning from Data 9

EM for GMM (Graphical View, 1) Graphical Interpretation More Formally Hidden variable: for each point, which Gaussian generated it? A B Samy Bengio Statistical Machine Learning from Data 10

EM for GMM (Graphical View, 2) Graphical Interpretation More Formally E-Step: for each point, estimate the probability that each Gaussian generated it A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8 Samy Bengio Statistical Machine Learning from Data 11

EM for GMM (Graphical View, 3) Graphical Interpretation More Formally M-Step: modify the parameters according to the hidden variable to maximize the likelihood of the data (and the hidden variable) A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8 Samy Bengio Statistical Machine Learning from Data 12

EM: More Formally Graphical Interpretation More Formally Let us call the hidden variable Q. Let us consider the following auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] It can be shown that maximizing A θ s+1 = arg max A(θ, θ s ) θ always increases the likelihood of the data p(x θ s+1 ), and a maximum of A corresponds to a maximum of the likelihood. Samy Bengio Statistical Machine Learning from Data 13

EM: Proof of Convergence Graphical Interpretation More Formally First let us develop the auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] = q Q P(q X, θ s ) log p(x, q θ) = q Q P(q X, θ s ) log(p(q X, θ) p(x θ)) = P(q X, θ s ) log P(q X, θ) + log p(x θ) q Q Samy Bengio Statistical Machine Learning from Data 14

Samy Bengio Statistical Machine Learning from Data 15 EM: Proof of Convergence Graphical Interpretation More Formally then if we evaluate it at θ s A(θ s, θ s ) = P(q X, θ s ) log P(q X, θ s ) + log p(x θ s ) q Q the difference between two consecutive log likelihoods of the data can be written as log p(x θ) log p(x θ s ) = A(θ, θ s ) A(θ s, θ s ) + q Q P(q X, θ s ) log P(q X, θs ) P(q X, θ)

EM: Proof of Convergence Graphical Interpretation More Formally hence, since the last part of the equation is a Kullback-Leibler divergence which is always positive or null, if A increases, the log likelihood of the data also increases Moreover, one can show that when A is maximum, the likelihood of the data is also at a maximum. Samy Bengio Statistical Machine Learning from Data 16

Hidden Variable Auxiliary Function E-Step M-Step Samy Bengio Statistical Machine Learning from Data 17 1 2 3 4

Samy Bengio Statistical Machine Learning from Data 18 EM for GMM: Hidden Variable Hidden Variable Auxiliary Function E-Step M-Step For GMM, the hidden variable Q will describe which Gaussian generated each example. If Q was observed, then it would be simple to maximize the likelihood of the data: simply estimate the parameters Gaussian by Gaussian Moreover, we will see that we can easily estimate Q Let us first write the mixture of Gaussian model for one x i : p(x i θ) = N P(j θ)p(x i j, θ) j=1 Let us now introduce the following indicator variable: { 1 if Gaussian j emitted xi q i,j = 0 otherwise

EM for GMM: Auxiliary Function Hidden Variable Auxiliary Function E-Step M-Step We can now write the joint likelihood of all the X and q: n N p(x, Q θ) = P(j θ) q i,j p(x i j, θ) q i,j which in log gives j=1 log p(x, Q θ) = N q i,j log P(j θ) + q i,j log p(x i j, θ) j=1 Samy Bengio Statistical Machine Learning from Data 19

Samy Bengio Statistical Machine Learning from Data 20 EM for GMM: Auxiliary Function Hidden Variable Auxiliary Function E-Step M-Step Let us now write the corresponding auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] N = E Q q i,j log P(j θ) + q i,j log p(x i j, θ) X, θ s j=1 j=1 N = E Q [q i,j X, θ s ] log P(j θ) + E Q [q i,j X, θ s ] log p(x i j, θ)

Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: E-Step and M-Step A(θ, θ s ) = j=1 N E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) Hence, the E-Step estimates the posterior: E Q [q i,j X, θ s ] = 1 P(q i,j = 1 X, θ s ) + 0 P(q i,j = 0 X, θ s ) = P(j x i, θ s ) = p(x i j, θ s )P(j θ s ) p(x i θ s ) and the M-step finds the parameters θ that maximizes A, hence searching for A θ = 0 for each parameter (µ j, variances σ 2 j, and weights w j). Note however that w j should sum to 1. Samy Bengio Statistical Machine Learning from Data 21

Samy Bengio Statistical Machine Learning from Data 22 EM for GMM: M-Step for Means Hidden Variable Auxiliary Function E-Step M-Step N A(θ, θ s ) = E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) j=1 A µ j = = = A log p(x i j, θ) log p(x i j, θ) µ j P(j x i, θ s ) log p(x i j, θ) µ j P(j x i, θ s ) (x i µ j ) σ 2 j = 0

Samy Bengio Statistical Machine Learning from Data 23 EM for GMM: M-Step for Means Hidden Variable Auxiliary Function E-Step M-Step P(j x i, θ s ) (x i µ j ) σ 2 j = (removing constant terms in the sum) = 0 P(j x i, θ s ) x i P(j x i, θ s ) x i P(j x i, θ s ) P(j x i, θ s ) µ j = 0 = ˆµ j

Samy Bengio Statistical Machine Learning from Data 24 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Variances N A(θ, θ s ) = E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) j=1 A σ 2 j = = = A log p(x i j, θ) log p(x i j, θ) σj 2 P(j x i, θ s ) log p(x i j, θ) P(j x i, θ s ) σ 2 j ( (x i µ j ) 2 2σ 4 j ) 1 2σj 2 = 0

Samy Bengio Statistical Machine Learning from Data 25 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Variances P(j x i, θ s ) ( (x i µ j ) 2 P(j x i, θ s )(x i µ j ) 2 2σ 4 j 2σ 4 j P(j x i, θ s )(x i µ j ) 2 σ 2 j ) 1 2σj 2 = 0 P(j x i, θ s ) 2σj 2 = 0 P(j x i, θ s ) = 0 P(j x i, θ s )(x i ˆµ j ) 2 P(j x i, θ s ) = ˆσ 2 j

Samy Bengio Statistical Machine Learning from Data 26 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Weights We have the constraint that all weights w j should be positive and sum to 1: N w j = 1 j=1 Incorporating it into the system: J(θ, θ s ) = A(θ, θ s ) + (1 N w j ) λ j where λ j are Lagrange multipliers. So we need to derive J with respect to w j and to set it to 0. j=1

Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Weights and incorporating the probabilistic constraint, we get J J A(θ, θ s ) = w j A(θ, θ s λ j ) w j ( ) = 1 P(j x i, θ s 1 ) λ j = 0 ŵ j = P(j x i, θ s ) ŵ j = λ j N P(j x i, θ s ) k=1 Samy Bengio Statistical Machine Learning from Data 27 w j = 1 n P(k x i, θ s ) P(j x i, θ s )

Samy Bengio Statistical Machine Learning from Data 28 EM for GMM: Update Rules Hidden Variable Auxiliary Function E-Step M-Step Means ˆµ j = Variances ˆσ j 2 = x i P(j x i, θ s ) Weights: ŵ j = 1 n P(j x i, θ s ) (x i ˆµ j ) 2 P(j x i, θ s ) P(j x i, θ s ) P(j x i, θ s )

Samy Bengio Statistical Machine Learning from Data 29 Initialization Capacity Control Adaptation 1 2 3 4

Samy Bengio Statistical Machine Learning from Data 30 Initialization Initialization Capacity Control Adaptation EM is an iterative procedure that is very sensitive to initial conditions! Start from trash end up with trash. Hence, we need a good and fast initialization procedure. Often used: K-Means. Other options: hierarchical K-Means, Gaussian splitting.

Samy Bengio Statistical Machine Learning from Data 31 Capacity Control Initialization Capacity Control Adaptation How to control the capacity with GMMs? selecting the number of Gaussians constraining the value of the variances to be far from 0 (small variances = large capacity) Use cross-validation on the desired criterion (Maximum Likelihood, classification...)

Adaptation Techniques Initialization Capacity Control Adaptation In some cases, you have access to only a few examples coming from the target distribution...... but many coming from a nearby distribution! How can we profit from the big nearby dataset??? Solution: use adaptation techniques. The most well known and used for GMMs: the Maximum A Posteriori adaptation. Samy Bengio Statistical Machine Learning from Data 32

Samy Bengio Statistical Machine Learning from Data 33 MAP Adaptation Initialization Capacity Control Adaptation Normal maximum likelihood training for a dataset X : θ = arg max p(x θ) θ Maximum A Posteriori (MAP) training: θ = arg max p(θ X ) θ p(x θ)p(θ) = arg max θ p(x ) = arg max p(x θ)p(θ) θ where p(θ) represents your prior belief about the distribution of the parameters θ.

Samy Bengio Statistical Machine Learning from Data 34 Implementation Initialization Capacity Control Adaptation Which kind of prior distribution for p(θ)? Two objectives: constraining θ to reasonable values keep the EM algorithm tractable Use conjugate priors: Dirichlet distribution for weights Gaussian densities for means and variances

Samy Bengio Statistical Machine Learning from Data 35 What is a Conjugate Prior? Initialization Capacity Control Adaptation A conjugate prior is chosen such that the corresponding posterior belongs to the same functional family as the prior. So we would like that p(x θ)p(θ) is distributed according to the same family as p(θ) and tractable. Example: Likelihood is Gaussian: p(x θ) = K 1 exp ( (x 1 µ 1 ) 2 ) 2σ1 2 Prior is Gaussian: p(θ) = K 2 exp ( (x 2 µ 2 ) 2 ) 2σ 2 2 Posterior is Gaussian: p(x θ)p(θ) = K 1 K 2 exp ( (x 1 µ 1 ) 2 2σ 2 1 ) (x µ)2 = K 3 exp ( 2σ 2 (x 2 µ 2 ) 2 ) 2σ2 2

Samy Bengio Statistical Machine Learning from Data 36 Conjugate Prior of Multinomials Multinomial distribution: Initialization Capacity Control Adaptation P(X 1 = x 1,, X n = x n θ) = N! n x i! n where x i are nonnegative integers and n x i = N. Dirichlet distribution with parameter u: P(θ u) = 1 Z(u) n θ u i 1 i where θ 1,, θ n 0 and n θ i = 1 and u 1,, u n 0. Conjugate prior = dirichlet with parameter x + u: P(X, θ u) = 1 Z n θ x i +u i 1 i θ x i i

Samy Bengio Statistical Machine Learning from Data 37 Examples of Conjugate Priors Initialization Capacity Control Adaptation likelihood conjugate prior posterior p(x θ) p(θ) p(θ x) Gaussian(θ, σ) Gaussian(µ 0, σ 0 ) Gaussian(µ 1, σ 1 ) Binomial(N, θ) Beta(r, s) Beta (r + n, s + N n) Poisson(θ) Gamma(r, s) Gamma(r + n, s + 1) Multinomial(θ 1,, θ k ) Dirichlet Dirichlet (α (α 1,, α k ) 1 +n 1,, α k + n k )

Initialization Capacity Control Adaptation Simple Implementation for MAP-GMMs Samy Bengio Statistical Machine Learning from Data 38

Simple Implementation Initialization Capacity Control Adaptation Train a generic prior model p with large amount of available data = {w p j, µp j, σp j } One hyper-parameter: α [0, 1]: faith on prior model Weights: ŵ j = [ αw p j + (1 α) ] P(j x i ) γ where γ is a normalization factor (so that j w j = 1) Samy Bengio Statistical Machine Learning from Data 39

Samy Bengio Statistical Machine Learning from Data 40 Simple Implementation Initialization Capacity Control Adaptation Means: Variances: ˆµ j = αµ p j + (1 α) ( ) ˆσ j = α σ p j + µ p j µp j + (1 α) P(j x i )x i P(j x i ) P(j x i )x i x i P(j x i ) ˆµ j ˆµ j

Initialization Capacity Control Adaptation Adapted GMMs for Person Authentication Person authentication task: accept access if P(S i X) > P( S i X) with S i a client, S i all the other persons, and X an access attributed to S i. Using Bayes theorem, this becomes: p(x S i ) p(x S i ) > P( S i ) P(S i ) = S i p(x S i ) is trained on a large dataset p(x S i ) is MAP adapted from p(x S i ). is found on a separate validation set to optimize a given criterion. Samy Bengio Statistical Machine Learning from Data 41