Statistical Machine Learning from Data

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Statistical Machine Learning from Data"

Transcription

1 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland January 23, 2006

2 Samy Bengio Statistical Machine Learning from Data

3 Samy Bengio Statistical Machine Learning from Data 3 Basics What is a GMM? Applications

4 Reminder: Basics on Probabilities Basics What is a GMM? Applications A few basic equalities that are often used: 1 (conditional probabilities) P(A, B) = P(A B) P(B) 2 (Bayes rule) P(A B) = P(B A) P(A) P(B) 3 If ( B i = Ω) and i, j i (B i Bj = ) then P(A) = i P(A, B i ) Samy Bengio Statistical Machine Learning from Data 4

5 Basics What is a GMM? Applications What is a Gaussian Mixture Model? A Gaussian Mixture Model (GMM) is a distribution The likelihood given a Gaussian distribution is ( 1 N (x µ, Σ) = exp 1 ) (2π) d 2 Σ 2 (x µ)t Σ 1 (x µ) where d is the dimension of x, µ is the mean and Σ is the covariance matrix of the Gaussian. Σ is often diagonal. The likelihood given a GMM is N p(x) = w i N (x µ i, Σ i ) where N is the number of Gaussians and w i is the weight of Gaussian i, with w i = 1 and i : w i 0 i Samy Bengio Statistical Machine Learning from Data 5

6 Characteristics of a GMM Basics What is a GMM? Applications While ANNs are universal approximators of functions, GMMs are universal approximators of densities. (as long as there are enough Gaussians of course) Even diagonal GMMs are universal approximators. Full rank GMMs are not easy to handle: number of parameters is the square of the number of dimensions. GMMs can be trained by maximum likelihood using an efficient algorithm: Expectation-Maximization. Samy Bengio Statistical Machine Learning from Data 6

7 Basics What is a GMM? Applications Practical Applications using GMMs Biometric person authentication (using voice, face, handwriting, etc): one GMM for the client one GMM for all the others Bayes decision = likelihood ratio Any highly imbalanced classification task one GMM per class, tuned by maximum likelihood Bayes decision = likelihood ratio Dimensionality reduction Quantization Samy Bengio Statistical Machine Learning from Data 7

8 Samy Bengio Statistical Machine Learning from Data 8 Graphical Interpretation More Formally

9 Graphical Interpretation More Formally Basics of Expectation-Maximization Objective: maximize the likelihood p(x θ) of the data X drawn from an unknown distribution, given the model parameterized by θ: θ = arg max θ p(x θ) = arg max θ n p(x p θ) p=1 Basic ideas of EM: Introduce a hidden variable such that its knowledge would simplify the maximization of p(x θ) At each iteration of the algorithm: E-Step: estimate the distribution of the hidden variable given the data and the current value of the parameters M-Step: modify the parameters in order to maximize the joint distribution of the data and the hidden variable Samy Bengio Statistical Machine Learning from Data 9

10 EM for GMM (Graphical View, 1) Graphical Interpretation More Formally Hidden variable: for each point, which Gaussian generated it? A B Samy Bengio Statistical Machine Learning from Data 10

11 EM for GMM (Graphical View, 2) Graphical Interpretation More Formally E-Step: for each point, estimate the probability that each Gaussian generated it A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8 Samy Bengio Statistical Machine Learning from Data 11

12 EM for GMM (Graphical View, 3) Graphical Interpretation More Formally M-Step: modify the parameters according to the hidden variable to maximize the likelihood of the data (and the hidden variable) A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8 Samy Bengio Statistical Machine Learning from Data 12

13 EM: More Formally Graphical Interpretation More Formally Let us call the hidden variable Q. Let us consider the following auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] It can be shown that maximizing A θ s+1 = arg max A(θ, θ s ) θ always increases the likelihood of the data p(x θ s+1 ), and a maximum of A corresponds to a maximum of the likelihood. Samy Bengio Statistical Machine Learning from Data 13

14 EM: Proof of Convergence Graphical Interpretation More Formally First let us develop the auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] = q Q P(q X, θ s ) log p(x, q θ) = q Q P(q X, θ s ) log(p(q X, θ) p(x θ)) = P(q X, θ s ) log P(q X, θ) + log p(x θ) q Q Samy Bengio Statistical Machine Learning from Data 14

15 Samy Bengio Statistical Machine Learning from Data 15 EM: Proof of Convergence Graphical Interpretation More Formally then if we evaluate it at θ s A(θ s, θ s ) = P(q X, θ s ) log P(q X, θ s ) + log p(x θ s ) q Q the difference between two consecutive log likelihoods of the data can be written as log p(x θ) log p(x θ s ) = A(θ, θ s ) A(θ s, θ s ) + q Q P(q X, θ s ) log P(q X, θs ) P(q X, θ)

16 EM: Proof of Convergence Graphical Interpretation More Formally hence, since the last part of the equation is a Kullback-Leibler divergence which is always positive or null, if A increases, the log likelihood of the data also increases Moreover, one can show that when A is maximum, the likelihood of the data is also at a maximum. Samy Bengio Statistical Machine Learning from Data 16

17 Hidden Variable Auxiliary Function E-Step M-Step Samy Bengio Statistical Machine Learning from Data

18 Samy Bengio Statistical Machine Learning from Data 18 EM for GMM: Hidden Variable Hidden Variable Auxiliary Function E-Step M-Step For GMM, the hidden variable Q will describe which Gaussian generated each example. If Q was observed, then it would be simple to maximize the likelihood of the data: simply estimate the parameters Gaussian by Gaussian Moreover, we will see that we can easily estimate Q Let us first write the mixture of Gaussian model for one x i : p(x i θ) = N P(j θ)p(x i j, θ) j=1 Let us now introduce the following indicator variable: { 1 if Gaussian j emitted xi q i,j = 0 otherwise

19 EM for GMM: Auxiliary Function Hidden Variable Auxiliary Function E-Step M-Step We can now write the joint likelihood of all the X and q: n N p(x, Q θ) = P(j θ) q i,j p(x i j, θ) q i,j which in log gives j=1 log p(x, Q θ) = N q i,j log P(j θ) + q i,j log p(x i j, θ) j=1 Samy Bengio Statistical Machine Learning from Data 19

20 Samy Bengio Statistical Machine Learning from Data 20 EM for GMM: Auxiliary Function Hidden Variable Auxiliary Function E-Step M-Step Let us now write the corresponding auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] N = E Q q i,j log P(j θ) + q i,j log p(x i j, θ) X, θ s j=1 j=1 N = E Q [q i,j X, θ s ] log P(j θ) + E Q [q i,j X, θ s ] log p(x i j, θ)

21 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: E-Step and M-Step A(θ, θ s ) = j=1 N E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) Hence, the E-Step estimates the posterior: E Q [q i,j X, θ s ] = 1 P(q i,j = 1 X, θ s ) + 0 P(q i,j = 0 X, θ s ) = P(j x i, θ s ) = p(x i j, θ s )P(j θ s ) p(x i θ s ) and the M-step finds the parameters θ that maximizes A, hence searching for A θ = 0 for each parameter (µ j, variances σ 2 j, and weights w j). Note however that w j should sum to 1. Samy Bengio Statistical Machine Learning from Data 21

22 Samy Bengio Statistical Machine Learning from Data 22 EM for GMM: M-Step for Means Hidden Variable Auxiliary Function E-Step M-Step N A(θ, θ s ) = E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) j=1 A µ j = = = A log p(x i j, θ) log p(x i j, θ) µ j P(j x i, θ s ) log p(x i j, θ) µ j P(j x i, θ s ) (x i µ j ) σ 2 j = 0

23 Samy Bengio Statistical Machine Learning from Data 23 EM for GMM: M-Step for Means Hidden Variable Auxiliary Function E-Step M-Step P(j x i, θ s ) (x i µ j ) σ 2 j = (removing constant terms in the sum) = 0 P(j x i, θ s ) x i P(j x i, θ s ) x i P(j x i, θ s ) P(j x i, θ s ) µ j = 0 = ˆµ j

24 Samy Bengio Statistical Machine Learning from Data 24 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Variances N A(θ, θ s ) = E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) j=1 A σ 2 j = = = A log p(x i j, θ) log p(x i j, θ) σj 2 P(j x i, θ s ) log p(x i j, θ) P(j x i, θ s ) σ 2 j ( (x i µ j ) 2 2σ 4 j ) 1 2σj 2 = 0

25 Samy Bengio Statistical Machine Learning from Data 25 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Variances P(j x i, θ s ) ( (x i µ j ) 2 P(j x i, θ s )(x i µ j ) 2 2σ 4 j 2σ 4 j P(j x i, θ s )(x i µ j ) 2 σ 2 j ) 1 2σj 2 = 0 P(j x i, θ s ) 2σj 2 = 0 P(j x i, θ s ) = 0 P(j x i, θ s )(x i ˆµ j ) 2 P(j x i, θ s ) = ˆσ 2 j

26 Samy Bengio Statistical Machine Learning from Data 26 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Weights We have the constraint that all weights w j should be positive and sum to 1: N w j = 1 j=1 Incorporating it into the system: J(θ, θ s ) = A(θ, θ s ) + (1 N w j ) λ j where λ j are Lagrange multipliers. So we need to derive J with respect to w j and to set it to 0. j=1

27 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Weights and incorporating the probabilistic constraint, we get J J A(θ, θ s ) = w j A(θ, θ s λ j ) w j ( ) = 1 P(j x i, θ s 1 ) λ j = 0 ŵ j = P(j x i, θ s ) ŵ j = λ j N P(j x i, θ s ) k=1 Samy Bengio Statistical Machine Learning from Data 27 w j = 1 n P(k x i, θ s ) P(j x i, θ s )

28 Samy Bengio Statistical Machine Learning from Data 28 EM for GMM: Update Rules Hidden Variable Auxiliary Function E-Step M-Step Means ˆµ j = Variances ˆσ j 2 = x i P(j x i, θ s ) Weights: ŵ j = 1 n P(j x i, θ s ) (x i ˆµ j ) 2 P(j x i, θ s ) P(j x i, θ s ) P(j x i, θ s )

29 Samy Bengio Statistical Machine Learning from Data 29 Initialization Capacity Control Adaptation

30 Samy Bengio Statistical Machine Learning from Data 30 Initialization Initialization Capacity Control Adaptation EM is an iterative procedure that is very sensitive to initial conditions! Start from trash end up with trash. Hence, we need a good and fast initialization procedure. Often used: K-Means. Other options: hierarchical K-Means, Gaussian splitting.

31 Samy Bengio Statistical Machine Learning from Data 31 Capacity Control Initialization Capacity Control Adaptation How to control the capacity with GMMs? selecting the number of Gaussians constraining the value of the variances to be far from 0 (small variances = large capacity) Use cross-validation on the desired criterion (Maximum Likelihood, classification...)

32 Adaptation Techniques Initialization Capacity Control Adaptation In some cases, you have access to only a few examples coming from the target distribution but many coming from a nearby distribution! How can we profit from the big nearby dataset??? Solution: use adaptation techniques. The most well known and used for GMMs: the Maximum A Posteriori adaptation. Samy Bengio Statistical Machine Learning from Data 32

33 Samy Bengio Statistical Machine Learning from Data 33 MAP Adaptation Initialization Capacity Control Adaptation Normal maximum likelihood training for a dataset X : θ = arg max p(x θ) θ Maximum A Posteriori (MAP) training: θ = arg max p(θ X ) θ p(x θ)p(θ) = arg max θ p(x ) = arg max p(x θ)p(θ) θ where p(θ) represents your prior belief about the distribution of the parameters θ.

34 Samy Bengio Statistical Machine Learning from Data 34 Implementation Initialization Capacity Control Adaptation Which kind of prior distribution for p(θ)? Two objectives: constraining θ to reasonable values keep the EM algorithm tractable Use conjugate priors: Dirichlet distribution for weights Gaussian densities for means and variances

35 Samy Bengio Statistical Machine Learning from Data 35 What is a Conjugate Prior? Initialization Capacity Control Adaptation A conjugate prior is chosen such that the corresponding posterior belongs to the same functional family as the prior. So we would like that p(x θ)p(θ) is distributed according to the same family as p(θ) and tractable. Example: Likelihood is Gaussian: p(x θ) = K 1 exp ( (x 1 µ 1 ) 2 ) 2σ1 2 Prior is Gaussian: p(θ) = K 2 exp ( (x 2 µ 2 ) 2 ) 2σ 2 2 Posterior is Gaussian: p(x θ)p(θ) = K 1 K 2 exp ( (x 1 µ 1 ) 2 2σ 2 1 ) (x µ)2 = K 3 exp ( 2σ 2 (x 2 µ 2 ) 2 ) 2σ2 2

36 Samy Bengio Statistical Machine Learning from Data 36 Conjugate Prior of Multinomials Multinomial distribution: Initialization Capacity Control Adaptation P(X 1 = x 1,, X n = x n θ) = N! n x i! n where x i are nonnegative integers and n x i = N. Dirichlet distribution with parameter u: P(θ u) = 1 Z(u) n θ u i 1 i where θ 1,, θ n 0 and n θ i = 1 and u 1,, u n 0. Conjugate prior = dirichlet with parameter x + u: P(X, θ u) = 1 Z n θ x i +u i 1 i θ x i i

37 Samy Bengio Statistical Machine Learning from Data 37 Examples of Conjugate Priors Initialization Capacity Control Adaptation likelihood conjugate prior posterior p(x θ) p(θ) p(θ x) Gaussian(θ, σ) Gaussian(µ 0, σ 0 ) Gaussian(µ 1, σ 1 ) Binomial(N, θ) Beta(r, s) Beta (r + n, s + N n) Poisson(θ) Gamma(r, s) Gamma(r + n, s + 1) Multinomial(θ 1,, θ k ) Dirichlet Dirichlet (α (α 1,, α k ) 1 +n 1,, α k + n k )

38 Initialization Capacity Control Adaptation Simple Implementation for MAP-GMMs Samy Bengio Statistical Machine Learning from Data 38

39 Simple Implementation Initialization Capacity Control Adaptation Train a generic prior model p with large amount of available data = {w p j, µp j, σp j } One hyper-parameter: α [0, 1]: faith on prior model Weights: ŵ j = [ αw p j + (1 α) ] P(j x i ) γ where γ is a normalization factor (so that j w j = 1) Samy Bengio Statistical Machine Learning from Data 39

40 Samy Bengio Statistical Machine Learning from Data 40 Simple Implementation Initialization Capacity Control Adaptation Means: Variances: ˆµ j = αµ p j + (1 α) ( ) ˆσ j = α σ p j + µ p j µp j + (1 α) P(j x i )x i P(j x i ) P(j x i )x i x i P(j x i ) ˆµ j ˆµ j

41 Initialization Capacity Control Adaptation Adapted GMMs for Person Authentication Person authentication task: accept access if P(S i X) > P( S i X) with S i a client, S i all the other persons, and X an access attributed to S i. Using Bayes theorem, this becomes: p(x S i ) p(x S i ) > P( S i ) P(S i ) = S i p(x S i ) is trained on a large dataset p(x S i ) is MAP adapted from p(x S i ). is found on a separate validation set to optimize a given criterion. Samy Bengio Statistical Machine Learning from Data 41

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Notes on the Negative Binomial Distribution

Notes on the Negative Binomial Distribution Notes on the Negative Binomial Distribution John D. Cook October 28, 2009 Abstract These notes give several properties of the negative binomial distribution. 1. Parameterizations 2. The connection between

More information

Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring. Jie-Men Mok Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Gaussian Conjugate Prior Cheat Sheet

Gaussian Conjugate Prior Cheat Sheet Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian

More information

P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )

P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i ) Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

More information

1 Prior Probability and Posterior Probability

1 Prior Probability and Posterior Probability Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

An Introduction to Statistical Machine Learning - Overview -

An Introduction to Statistical Machine Learning - Overview - An Introduction to Statistical Machine Learning - Overview - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny, Switzerland

More information

α α λ α = = λ λ α ψ = = α α α λ λ ψ α = + β = > θ θ β > β β θ θ θ β θ β γ θ β = γ θ > β > γ θ β γ = θ β = θ β = θ β = β θ = β β θ = = = β β θ = + α α α α α = = λ λ λ λ λ λ λ = λ λ α α α α λ ψ + α =

More information

Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour. Patrick Lam Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

More information

1 Sufficient statistics

1 Sufficient statistics 1 Sufficient statistics A statistic is a function T = rx 1, X 2,, X n of the random sample X 1, X 2,, X n. Examples are X n = 1 n s 2 = = X i, 1 n 1 the sample mean X i X n 2, the sample variance T 1 =

More information

A crash course in probability and Naïve Bayes classification

A crash course in probability and Naïve Bayes classification Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Bayesian Image Super-Resolution

Bayesian Image Super-Resolution Bayesian Image Super-Resolution Michael E. Tipping and Christopher M. Bishop Microsoft Research, Cambridge, U.K..................................................................... Published as: Bayesian

More information

Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

More information

Lecture 6: The Bayesian Approach

Lecture 6: The Bayesian Approach Lecture 6: The Bayesian Approach What Did We Do Up to Now? We are given a model Log-linear model, Markov network, Bayesian network, etc. This model induces a distribution P(X) Learning: estimate a set

More information

Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard

Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,

More information

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

The Exponential Family

The Exponential Family The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

More information

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil. Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Probability Theory. Elementary rules of probability Sum rule. Product rule. p. 23

Probability Theory. Elementary rules of probability Sum rule. Product rule. p. 23 Probability Theory Uncertainty is key concept in machine learning. Probability provides consistent framework for the quantification and manipulation of uncertainty. Probability of an event is the fraction

More information

Christfried Webers. Canberra February June 2015

Christfried Webers. Canberra February June 2015 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

More information

Package EstCRM. July 13, 2015

Package EstCRM. July 13, 2015 Version 1.4 Date 2015-7-11 Package EstCRM July 13, 2015 Title Calibrating Parameters for the Samejima's Continuous IRT Model Author Cengiz Zopluoglu Maintainer Cengiz Zopluoglu

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

More information

L4: Bayesian Decision Theory

L4: Bayesian Decision Theory L4: Bayesian Decision Theory Likelihood ratio test Probability of error Bayes risk Bayes, MAP and ML criteria Multi-class problems Discriminant functions CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Probabilistic user behavior models in online stores for recommender systems

Probabilistic user behavior models in online stores for recommender systems Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

More information

Practice problems for Homework 11 - Point Estimation

Practice problems for Homework 11 - Point Estimation Practice problems for Homework 11 - Point Estimation 1. (10 marks) Suppose we want to select a random sample of size 5 from the current CS 3341 students. Which of the following strategies is the best:

More information

BAYESIAN CLASSIFICATION USING GAUSSIAN MIXTURE MODEL AND EM ESTIMATION: IMPLEMENTATIONS AND COMPARISONS

BAYESIAN CLASSIFICATION USING GAUSSIAN MIXTURE MODEL AND EM ESTIMATION: IMPLEMENTATIONS AND COMPARISONS LAPPEENRANTA UNIVERSITY OF TECHNOLOGY DEPARTMENT OF INFORMATION TECHNOLOGY BAYESIAN CLASSIFICATION USING GAUSSIAN MIXTURE MODEL AND ESTIMATION: IMPLENTATIONS AND COMPARISONS Information Technology Project

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

Bayesian Methods. 1 The Joint Posterior Distribution

Bayesian Methods. 1 The Joint Posterior Distribution Bayesian Methods Every variable in a linear model is a random variable derived from a distribution function. A fixed factor becomes a random variable with possibly a uniform distribution going from a lower

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x θ) Likelihood: eskmate

More information

Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

More information

Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Principled Hybrids of Generative and Discriminative Models

Principled Hybrids of Generative and Discriminative Models Principled Hybrids of Generative and Discriminative Models Julia A. Lasserre University of Cambridge Cambridge, UK jal62@cam.ac.uk Christopher M. Bishop Microsoft Research Cambridge, UK cmbishop@microsoft.com

More information

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

More information

Principle of Data Reduction

Principle of Data Reduction Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Felix Brockherde 12 Kristof Schütt 1 1 Technische Universität Berlin 2 Max Planck Institute of Microstructure Physics IPAM Tutorial 2013 Felix Brockherde, Kristof Schütt

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Improving Generalization

Improving Generalization Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. Cross-Validation 4. Weight Restriction

More information

Multidimensional data and factorial methods

Multidimensional data and factorial methods Multidimensional data and factorial methods Bidimensional data x 5 4 3 4 X 3 6 X 3 5 4 3 3 3 4 5 6 x Cartesian plane Multidimensional data n X x x x n X x x x n X m x m x m x nm Factorial plane Interpretation

More information

Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

Chapter 4: Non-Parametric Classification

Chapter 4: Non-Parametric Classification Chapter 4: Non-Parametric Classification Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination

More information

Automated Hierarchical Mixtures of Probabilistic Principal Component Analyzers

Automated Hierarchical Mixtures of Probabilistic Principal Component Analyzers Automated Hierarchical Mixtures of Probabilistic Principal Component Analyzers Ting Su tsu@ece.neu.edu Jennifer G. Dy jdy@ece.neu.edu Department of Electrical and Computer Engineering, Northeastern University,

More information

UW CSE Technical Report 03-06-01 Probabilistic Bilinear Models for Appearance-Based Vision

UW CSE Technical Report 03-06-01 Probabilistic Bilinear Models for Appearance-Based Vision UW CSE Technical Report 03-06-01 Probabilistic Bilinear Models for Appearance-Based Vision D.B. Grimes A.P. Shon R.P.N. Rao Dept. of Computer Science and Engineering University of Washington Seattle, WA

More information

Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

More information

Lecture 20: Clustering

Lecture 20: Clustering Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

More information

Inference of Probability Distributions for Trust and Security applications

Inference of Probability Distributions for Trust and Security applications Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi Outline 2 Outline Motivations 2 Outline Motivations

More information

Data a systematic approach

Data a systematic approach Pattern Discovery on Australian Medical Claims Data a systematic approach Ah Chung Tsoi Senior Member, IEEE, Shu Zhang, Markus Hagenbuchner Member, IEEE Abstract The national health insurance system in

More information

Towards running complex models on big data

Towards running complex models on big data Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

More information

Music Classification. Juan Pablo Bello MPATE-GE 2623 Music Information Retrieval New York University

Music Classification. Juan Pablo Bello MPATE-GE 2623 Music Information Retrieval New York University Music Classification Juan Pablo Bello MPATE-GE 2623 Music Information Retrieval New York University 1 Classification It is the process by which we automatically assign an individual item to one of a number

More information

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

More information

Message-passing sequential detection of multiple change points in networks

Message-passing sequential detection of multiple change points in networks Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal

More information

Package MixGHD. June 26, 2015

Package MixGHD. June 26, 2015 Type Package Package MixGHD June 26, 2015 Title Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions Version 1.7 Date 2015-6-15 Author

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,

More information

Methods of Data Analysis Working with probability distributions

Methods of Data Analysis Working with probability distributions Methods of Data Analysis Working with probability distributions Week 4 1 Motivation One of the key problems in non-parametric data analysis is to create a good model of a generating probability distribution,

More information

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

More information

Natural Conjugate Gradient in Variational Inference

Natural Conjugate Gradient in Variational Inference Natural Conjugate Gradient in Variational Inference Antti Honkela, Matti Tornio, Tapani Raiko, and Juha Karhunen Adaptive Informatics Research Centre, Helsinki University of Technology P.O. Box 5400, FI-02015

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

Designing a learning system

Designing a learning system Lecture Designing a learning system Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x4-8845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

An introduction to Hidden Markov Models

An introduction to Hidden Markov Models An introduction to Hidden Markov Models Christian Kohlschein Abstract Hidden Markov Models (HMM) are commonly defined as stochastic finite state machines. Formally a HMM can be described as a 5-tuple Ω

More information

10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics. Clustering expression data 10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

More information

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA REVSTAT Statistical Journal Volume 4, Number 2, June 2006, 131 142 A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA Authors: Daiane Aparecida Zuanetti Departamento de Estatística, Universidade Federal de São

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Revisiting k-means: New Algorithms via Bayesian Nonparametrics

Revisiting k-means: New Algorithms via Bayesian Nonparametrics Brian Kulis Department of CSE, Ohio State University, Columbus, OH Michael I. Jordan Departments of EECS and Statistics, University of California, Berkeley, CA KULIS@CSE.OHIO-STATE.EDU JORDAN@EECS.BERKELEY.EDU

More information

Health Status Monitoring Through Analysis of Behavioral Patterns

Health Status Monitoring Through Analysis of Behavioral Patterns Health Status Monitoring Through Analysis of Behavioral Patterns Tracy Barger 1, Donald Brown 1, and Majd Alwan 2 1 University of Virginia, Systems and Information Engineering, Charlottesville, VA 2 University

More information

Classification by Pairwise Coupling

Classification by Pairwise Coupling Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

2. Mean-variance portfolio theory

2. Mean-variance portfolio theory 2. Mean-variance portfolio theory (2.1) Markowitz s mean-variance formulation (2.2) Two-fund theorem (2.3) Inclusion of the riskfree asset 1 2.1 Markowitz mean-variance formulation Suppose there are N

More information

A Survey of Kernel Clustering Methods

A Survey of Kernel Clustering Methods A Survey of Kernel Clustering Methods Maurizio Filippone, Francesco Camastra, Francesco Masulli and Stefano Rovetta Presented by: Kedar Grama Outline Unsupervised Learning and Clustering Types of clustering

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

Lecture 4: BK inequality 27th August and 6th September, 2007

Lecture 4: BK inequality 27th August and 6th September, 2007 CSL866: Percolation and Random Graphs IIT Delhi Amitabha Bagchi Scribe: Arindam Pal Lecture 4: BK inequality 27th August and 6th September, 2007 4. Preliminaries The FKG inequality allows us to lower bound

More information

Bayesian Clustering for Email Campaign Detection

Bayesian Clustering for Email Campaign Detection Peter Haider haider@cs.uni-potsdam.de Tobias Scheffer scheffer@cs.uni-potsdam.de University of Potsdam, Department of Computer Science, August-Bebel-Strasse 89, 14482 Potsdam, Germany Abstract We discuss

More information

Factor Analysis. Factor Analysis

Factor Analysis. Factor Analysis Factor Analysis Principal Components Analysis, e.g. of stock price movements, sometimes suggests that several variables may be responding to a small number of underlying forces. In the factor model, we

More information

Introduction to latent variable models

Introduction to latent variable models Introduction to latent variable models lecture 1 Francesco Bartolucci Department of Economics, Finance and Statistics University of Perugia, IT bart@stat.unipg.it Outline [2/24] Latent variables and their

More information

Efficiency and the Cramér-Rao Inequality

Efficiency and the Cramér-Rao Inequality Chapter Efficiency and the Cramér-Rao Inequality Clearly we would like an unbiased estimator ˆφ (X of φ (θ to produce, in the long run, estimates which are fairly concentrated i.e. have high precision.

More information

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference Stephen Tu tu.stephenl@gmail.com 1 Introduction This document collects in one place various results for both the Dirichlet-multinomial

More information

Model for dynamic website optimization

Model for dynamic website optimization Chapter 10 Model for dynamic website optimization In this chapter we develop a mathematical model for dynamic website optimization. The model is designed to optimize websites through autonomous management

More information

Selection Sampling from Large Data sets for Targeted Inference in Mixture Modeling

Selection Sampling from Large Data sets for Targeted Inference in Mixture Modeling Selection Sampling from Large Data sets for Targeted Inference in Mixture Modeling Ioanna Manolopoulou, Cliburn Chan and Mike West December 29, 2009 Abstract One of the challenges of Markov chain Monte

More information

L15: statistical clustering

L15: statistical clustering Similarity measures Criterion functions Cluster validity Flat clustering algorithms k-means ISODATA L15: statistical clustering Hierarchical clustering algorithms Divisive Agglomerative CSCE 666 Pattern

More information

Computational Statistics for Big Data

Computational Statistics for Big Data Lancaster University Computational Statistics for Big Data Author: 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application

Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application Iraqi Journal of Statistical Science (18) 2010 p.p. [35-58] Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application ABSTRACT Nawzad. M. Ahmad * Bayesian decision theory

More information

A Scalable Formulation of Probabilistic Linear Discriminant Analysis: Applied to Face Recognition

A Scalable Formulation of Probabilistic Linear Discriminant Analysis: Applied to Face Recognition IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL 35, NO 7, JULY 2013 1 A Scalable Formulation of Probabilistic Linear Discriminant Analysis: Applied to Face Recognition Laurent El Shafey,

More information

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Part X Factor analysis Whenwehavedatax (i) R n thatcomesfromamixtureofseveral Gaussians, the EM algorithm can be applied to fit a mixture model. In this setting, we usually

More information