# Statistical Machine Learning from Data

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland January 23, 2006

2 Samy Bengio Statistical Machine Learning from Data

3 Samy Bengio Statistical Machine Learning from Data 3 Basics What is a GMM? Applications

4 Reminder: Basics on Probabilities Basics What is a GMM? Applications A few basic equalities that are often used: 1 (conditional probabilities) P(A, B) = P(A B) P(B) 2 (Bayes rule) P(A B) = P(B A) P(A) P(B) 3 If ( B i = Ω) and i, j i (B i Bj = ) then P(A) = i P(A, B i ) Samy Bengio Statistical Machine Learning from Data 4

5 Basics What is a GMM? Applications What is a Gaussian Mixture Model? A Gaussian Mixture Model (GMM) is a distribution The likelihood given a Gaussian distribution is ( 1 N (x µ, Σ) = exp 1 ) (2π) d 2 Σ 2 (x µ)t Σ 1 (x µ) where d is the dimension of x, µ is the mean and Σ is the covariance matrix of the Gaussian. Σ is often diagonal. The likelihood given a GMM is N p(x) = w i N (x µ i, Σ i ) where N is the number of Gaussians and w i is the weight of Gaussian i, with w i = 1 and i : w i 0 i Samy Bengio Statistical Machine Learning from Data 5

6 Characteristics of a GMM Basics What is a GMM? Applications While ANNs are universal approximators of functions, GMMs are universal approximators of densities. (as long as there are enough Gaussians of course) Even diagonal GMMs are universal approximators. Full rank GMMs are not easy to handle: number of parameters is the square of the number of dimensions. GMMs can be trained by maximum likelihood using an efficient algorithm: Expectation-Maximization. Samy Bengio Statistical Machine Learning from Data 6

7 Basics What is a GMM? Applications Practical Applications using GMMs Biometric person authentication (using voice, face, handwriting, etc): one GMM for the client one GMM for all the others Bayes decision = likelihood ratio Any highly imbalanced classification task one GMM per class, tuned by maximum likelihood Bayes decision = likelihood ratio Dimensionality reduction Quantization Samy Bengio Statistical Machine Learning from Data 7

8 Samy Bengio Statistical Machine Learning from Data 8 Graphical Interpretation More Formally

9 Graphical Interpretation More Formally Basics of Expectation-Maximization Objective: maximize the likelihood p(x θ) of the data X drawn from an unknown distribution, given the model parameterized by θ: θ = arg max θ p(x θ) = arg max θ n p(x p θ) p=1 Basic ideas of EM: Introduce a hidden variable such that its knowledge would simplify the maximization of p(x θ) At each iteration of the algorithm: E-Step: estimate the distribution of the hidden variable given the data and the current value of the parameters M-Step: modify the parameters in order to maximize the joint distribution of the data and the hidden variable Samy Bengio Statistical Machine Learning from Data 9

10 EM for GMM (Graphical View, 1) Graphical Interpretation More Formally Hidden variable: for each point, which Gaussian generated it? A B Samy Bengio Statistical Machine Learning from Data 10

11 EM for GMM (Graphical View, 2) Graphical Interpretation More Formally E-Step: for each point, estimate the probability that each Gaussian generated it A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8 Samy Bengio Statistical Machine Learning from Data 11

12 EM for GMM (Graphical View, 3) Graphical Interpretation More Formally M-Step: modify the parameters according to the hidden variable to maximize the likelihood of the data (and the hidden variable) A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8 Samy Bengio Statistical Machine Learning from Data 12

13 EM: More Formally Graphical Interpretation More Formally Let us call the hidden variable Q. Let us consider the following auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] It can be shown that maximizing A θ s+1 = arg max A(θ, θ s ) θ always increases the likelihood of the data p(x θ s+1 ), and a maximum of A corresponds to a maximum of the likelihood. Samy Bengio Statistical Machine Learning from Data 13

14 EM: Proof of Convergence Graphical Interpretation More Formally First let us develop the auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] = q Q P(q X, θ s ) log p(x, q θ) = q Q P(q X, θ s ) log(p(q X, θ) p(x θ)) = P(q X, θ s ) log P(q X, θ) + log p(x θ) q Q Samy Bengio Statistical Machine Learning from Data 14

15 Samy Bengio Statistical Machine Learning from Data 15 EM: Proof of Convergence Graphical Interpretation More Formally then if we evaluate it at θ s A(θ s, θ s ) = P(q X, θ s ) log P(q X, θ s ) + log p(x θ s ) q Q the difference between two consecutive log likelihoods of the data can be written as log p(x θ) log p(x θ s ) = A(θ, θ s ) A(θ s, θ s ) + q Q P(q X, θ s ) log P(q X, θs ) P(q X, θ)

16 EM: Proof of Convergence Graphical Interpretation More Formally hence, since the last part of the equation is a Kullback-Leibler divergence which is always positive or null, if A increases, the log likelihood of the data also increases Moreover, one can show that when A is maximum, the likelihood of the data is also at a maximum. Samy Bengio Statistical Machine Learning from Data 16

17 Hidden Variable Auxiliary Function E-Step M-Step Samy Bengio Statistical Machine Learning from Data

18 Samy Bengio Statistical Machine Learning from Data 18 EM for GMM: Hidden Variable Hidden Variable Auxiliary Function E-Step M-Step For GMM, the hidden variable Q will describe which Gaussian generated each example. If Q was observed, then it would be simple to maximize the likelihood of the data: simply estimate the parameters Gaussian by Gaussian Moreover, we will see that we can easily estimate Q Let us first write the mixture of Gaussian model for one x i : p(x i θ) = N P(j θ)p(x i j, θ) j=1 Let us now introduce the following indicator variable: { 1 if Gaussian j emitted xi q i,j = 0 otherwise

19 EM for GMM: Auxiliary Function Hidden Variable Auxiliary Function E-Step M-Step We can now write the joint likelihood of all the X and q: n N p(x, Q θ) = P(j θ) q i,j p(x i j, θ) q i,j which in log gives j=1 log p(x, Q θ) = N q i,j log P(j θ) + q i,j log p(x i j, θ) j=1 Samy Bengio Statistical Machine Learning from Data 19

20 Samy Bengio Statistical Machine Learning from Data 20 EM for GMM: Auxiliary Function Hidden Variable Auxiliary Function E-Step M-Step Let us now write the corresponding auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] N = E Q q i,j log P(j θ) + q i,j log p(x i j, θ) X, θ s j=1 j=1 N = E Q [q i,j X, θ s ] log P(j θ) + E Q [q i,j X, θ s ] log p(x i j, θ)

21 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: E-Step and M-Step A(θ, θ s ) = j=1 N E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) Hence, the E-Step estimates the posterior: E Q [q i,j X, θ s ] = 1 P(q i,j = 1 X, θ s ) + 0 P(q i,j = 0 X, θ s ) = P(j x i, θ s ) = p(x i j, θ s )P(j θ s ) p(x i θ s ) and the M-step finds the parameters θ that maximizes A, hence searching for A θ = 0 for each parameter (µ j, variances σ 2 j, and weights w j). Note however that w j should sum to 1. Samy Bengio Statistical Machine Learning from Data 21

22 Samy Bengio Statistical Machine Learning from Data 22 EM for GMM: M-Step for Means Hidden Variable Auxiliary Function E-Step M-Step N A(θ, θ s ) = E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) j=1 A µ j = = = A log p(x i j, θ) log p(x i j, θ) µ j P(j x i, θ s ) log p(x i j, θ) µ j P(j x i, θ s ) (x i µ j ) σ 2 j = 0

23 Samy Bengio Statistical Machine Learning from Data 23 EM for GMM: M-Step for Means Hidden Variable Auxiliary Function E-Step M-Step P(j x i, θ s ) (x i µ j ) σ 2 j = (removing constant terms in the sum) = 0 P(j x i, θ s ) x i P(j x i, θ s ) x i P(j x i, θ s ) P(j x i, θ s ) µ j = 0 = ˆµ j

24 Samy Bengio Statistical Machine Learning from Data 24 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Variances N A(θ, θ s ) = E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) j=1 A σ 2 j = = = A log p(x i j, θ) log p(x i j, θ) σj 2 P(j x i, θ s ) log p(x i j, θ) P(j x i, θ s ) σ 2 j ( (x i µ j ) 2 2σ 4 j ) 1 2σj 2 = 0

25 Samy Bengio Statistical Machine Learning from Data 25 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Variances P(j x i, θ s ) ( (x i µ j ) 2 P(j x i, θ s )(x i µ j ) 2 2σ 4 j 2σ 4 j P(j x i, θ s )(x i µ j ) 2 σ 2 j ) 1 2σj 2 = 0 P(j x i, θ s ) 2σj 2 = 0 P(j x i, θ s ) = 0 P(j x i, θ s )(x i ˆµ j ) 2 P(j x i, θ s ) = ˆσ 2 j

26 Samy Bengio Statistical Machine Learning from Data 26 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Weights We have the constraint that all weights w j should be positive and sum to 1: N w j = 1 j=1 Incorporating it into the system: J(θ, θ s ) = A(θ, θ s ) + (1 N w j ) λ j where λ j are Lagrange multipliers. So we need to derive J with respect to w j and to set it to 0. j=1

27 Hidden Variable Auxiliary Function E-Step M-Step EM for GMM: M-Step for Weights and incorporating the probabilistic constraint, we get J J A(θ, θ s ) = w j A(θ, θ s λ j ) w j ( ) = 1 P(j x i, θ s 1 ) λ j = 0 ŵ j = P(j x i, θ s ) ŵ j = λ j N P(j x i, θ s ) k=1 Samy Bengio Statistical Machine Learning from Data 27 w j = 1 n P(k x i, θ s ) P(j x i, θ s )

28 Samy Bengio Statistical Machine Learning from Data 28 EM for GMM: Update Rules Hidden Variable Auxiliary Function E-Step M-Step Means ˆµ j = Variances ˆσ j 2 = x i P(j x i, θ s ) Weights: ŵ j = 1 n P(j x i, θ s ) (x i ˆµ j ) 2 P(j x i, θ s ) P(j x i, θ s ) P(j x i, θ s )

29 Samy Bengio Statistical Machine Learning from Data 29 Initialization Capacity Control Adaptation

30 Samy Bengio Statistical Machine Learning from Data 30 Initialization Initialization Capacity Control Adaptation EM is an iterative procedure that is very sensitive to initial conditions! Start from trash end up with trash. Hence, we need a good and fast initialization procedure. Often used: K-Means. Other options: hierarchical K-Means, Gaussian splitting.

31 Samy Bengio Statistical Machine Learning from Data 31 Capacity Control Initialization Capacity Control Adaptation How to control the capacity with GMMs? selecting the number of Gaussians constraining the value of the variances to be far from 0 (small variances = large capacity) Use cross-validation on the desired criterion (Maximum Likelihood, classification...)

32 Adaptation Techniques Initialization Capacity Control Adaptation In some cases, you have access to only a few examples coming from the target distribution but many coming from a nearby distribution! How can we profit from the big nearby dataset??? Solution: use adaptation techniques. The most well known and used for GMMs: the Maximum A Posteriori adaptation. Samy Bengio Statistical Machine Learning from Data 32

33 Samy Bengio Statistical Machine Learning from Data 33 MAP Adaptation Initialization Capacity Control Adaptation Normal maximum likelihood training for a dataset X : θ = arg max p(x θ) θ Maximum A Posteriori (MAP) training: θ = arg max p(θ X ) θ p(x θ)p(θ) = arg max θ p(x ) = arg max p(x θ)p(θ) θ where p(θ) represents your prior belief about the distribution of the parameters θ.

34 Samy Bengio Statistical Machine Learning from Data 34 Implementation Initialization Capacity Control Adaptation Which kind of prior distribution for p(θ)? Two objectives: constraining θ to reasonable values keep the EM algorithm tractable Use conjugate priors: Dirichlet distribution for weights Gaussian densities for means and variances

35 Samy Bengio Statistical Machine Learning from Data 35 What is a Conjugate Prior? Initialization Capacity Control Adaptation A conjugate prior is chosen such that the corresponding posterior belongs to the same functional family as the prior. So we would like that p(x θ)p(θ) is distributed according to the same family as p(θ) and tractable. Example: Likelihood is Gaussian: p(x θ) = K 1 exp ( (x 1 µ 1 ) 2 ) 2σ1 2 Prior is Gaussian: p(θ) = K 2 exp ( (x 2 µ 2 ) 2 ) 2σ 2 2 Posterior is Gaussian: p(x θ)p(θ) = K 1 K 2 exp ( (x 1 µ 1 ) 2 2σ 2 1 ) (x µ)2 = K 3 exp ( 2σ 2 (x 2 µ 2 ) 2 ) 2σ2 2

36 Samy Bengio Statistical Machine Learning from Data 36 Conjugate Prior of Multinomials Multinomial distribution: Initialization Capacity Control Adaptation P(X 1 = x 1,, X n = x n θ) = N! n x i! n where x i are nonnegative integers and n x i = N. Dirichlet distribution with parameter u: P(θ u) = 1 Z(u) n θ u i 1 i where θ 1,, θ n 0 and n θ i = 1 and u 1,, u n 0. Conjugate prior = dirichlet with parameter x + u: P(X, θ u) = 1 Z n θ x i +u i 1 i θ x i i

37 Samy Bengio Statistical Machine Learning from Data 37 Examples of Conjugate Priors Initialization Capacity Control Adaptation likelihood conjugate prior posterior p(x θ) p(θ) p(θ x) Gaussian(θ, σ) Gaussian(µ 0, σ 0 ) Gaussian(µ 1, σ 1 ) Binomial(N, θ) Beta(r, s) Beta (r + n, s + N n) Poisson(θ) Gamma(r, s) Gamma(r + n, s + 1) Multinomial(θ 1,, θ k ) Dirichlet Dirichlet (α (α 1,, α k ) 1 +n 1,, α k + n k )

38 Initialization Capacity Control Adaptation Simple Implementation for MAP-GMMs Samy Bengio Statistical Machine Learning from Data 38

39 Simple Implementation Initialization Capacity Control Adaptation Train a generic prior model p with large amount of available data = {w p j, µp j, σp j } One hyper-parameter: α [0, 1]: faith on prior model Weights: ŵ j = [ αw p j + (1 α) ] P(j x i ) γ where γ is a normalization factor (so that j w j = 1) Samy Bengio Statistical Machine Learning from Data 39

40 Samy Bengio Statistical Machine Learning from Data 40 Simple Implementation Initialization Capacity Control Adaptation Means: Variances: ˆµ j = αµ p j + (1 α) ( ) ˆσ j = α σ p j + µ p j µp j + (1 α) P(j x i )x i P(j x i ) P(j x i )x i x i P(j x i ) ˆµ j ˆµ j

41 Initialization Capacity Control Adaptation Adapted GMMs for Person Authentication Person authentication task: accept access if P(S i X) > P( S i X) with S i a client, S i all the other persons, and X an access attributed to S i. Using Bayes theorem, this becomes: p(x S i ) p(x S i ) > P( S i ) P(S i ) = S i p(x S i ) is trained on a large dataset p(x S i ) is MAP adapted from p(x S i ). is found on a separate validation set to optimize a given criterion. Samy Bengio Statistical Machine Learning from Data 41

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Notes on the Negative Binomial Distribution

Notes on the Negative Binomial Distribution John D. Cook October 28, 2009 Abstract These notes give several properties of the negative binomial distribution. 1. Parameterizations 2. The connection between

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Gaussian Conjugate Prior Cheat Sheet

Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian

### P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )

Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

### 1 Prior Probability and Posterior Probability

Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### An Introduction to Statistical Machine Learning - Overview -

An Introduction to Statistical Machine Learning - Overview - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny, Switzerland

α α λ α = = λ λ α ψ = = α α α λ λ ψ α = + β = > θ θ β > β β θ θ θ β θ β γ θ β = γ θ > β > γ θ β γ = θ β = θ β = θ β = β θ = β β θ = = = β β θ = + α α α α α = = λ λ λ λ λ λ λ = λ λ α α α α λ ψ + α =

### Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

### 1 Sufficient statistics

1 Sufficient statistics A statistic is a function T = rx 1, X 2,, X n of the random sample X 1, X 2,, X n. Examples are X n = 1 n s 2 = = X i, 1 n 1 the sample mean X i X n 2, the sample variance T 1 =

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### Bayesian Image Super-Resolution

Bayesian Image Super-Resolution Michael E. Tipping and Christopher M. Bishop Microsoft Research, Cambridge, U.K..................................................................... Published as: Bayesian

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### Lecture 6: The Bayesian Approach

Lecture 6: The Bayesian Approach What Did We Do Up to Now? We are given a model Log-linear model, Markov network, Bayesian network, etc. This model induces a distribution P(X) Learning: estimate a set

### Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard

Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,

### Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### The Exponential Family

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

### Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

### Maximum Likelihood Estimation

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Probability Theory. Elementary rules of probability Sum rule. Product rule. p. 23

Probability Theory Uncertainty is key concept in machine learning. Probability provides consistent framework for the quantification and manipulation of uncertainty. Probability of an event is the fraction

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### Package EstCRM. July 13, 2015

Version 1.4 Date 2015-7-11 Package EstCRM July 13, 2015 Title Calibrating Parameters for the Samejima's Continuous IRT Model Author Cengiz Zopluoglu Maintainer Cengiz Zopluoglu

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### L4: Bayesian Decision Theory

L4: Bayesian Decision Theory Likelihood ratio test Probability of error Bayes risk Bayes, MAP and ML criteria Multi-class problems Discriminant functions CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

### Probabilistic user behavior models in online stores for recommender systems

Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### Practice problems for Homework 11 - Point Estimation

Practice problems for Homework 11 - Point Estimation 1. (10 marks) Suppose we want to select a random sample of size 5 from the current CS 3341 students. Which of the following strategies is the best:

### BAYESIAN CLASSIFICATION USING GAUSSIAN MIXTURE MODEL AND EM ESTIMATION: IMPLEMENTATIONS AND COMPARISONS

LAPPEENRANTA UNIVERSITY OF TECHNOLOGY DEPARTMENT OF INFORMATION TECHNOLOGY BAYESIAN CLASSIFICATION USING GAUSSIAN MIXTURE MODEL AND ESTIMATION: IMPLENTATIONS AND COMPARISONS Information Technology Project

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Bayesian Methods. 1 The Joint Posterior Distribution

Bayesian Methods Every variable in a linear model is a random variable derived from a distribution function. A fixed factor becomes a random variable with possibly a uniform distribution going from a lower

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x θ) Likelihood: eskmate

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

### Principled Hybrids of Generative and Discriminative Models

Principled Hybrids of Generative and Discriminative Models Julia A. Lasserre University of Cambridge Cambridge, UK jal62@cam.ac.uk Christopher M. Bishop Microsoft Research Cambridge, UK cmbishop@microsoft.com

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### Principle of Data Reduction

Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Introduction to Machine Learning

Introduction to Machine Learning Felix Brockherde 12 Kristof Schütt 1 1 Technische Universität Berlin 2 Max Planck Institute of Microstructure Physics IPAM Tutorial 2013 Felix Brockherde, Kristof Schütt

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

### Improving Generalization

Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. Cross-Validation 4. Weight Restriction

### Multidimensional data and factorial methods

Multidimensional data and factorial methods Bidimensional data x 5 4 3 4 X 3 6 X 3 5 4 3 3 3 4 5 6 x Cartesian plane Multidimensional data n X x x x n X x x x n X m x m x m x nm Factorial plane Interpretation

### Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### Chapter 4: Non-Parametric Classification

Chapter 4: Non-Parametric Classification Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination

### Automated Hierarchical Mixtures of Probabilistic Principal Component Analyzers

Automated Hierarchical Mixtures of Probabilistic Principal Component Analyzers Ting Su tsu@ece.neu.edu Jennifer G. Dy jdy@ece.neu.edu Department of Electrical and Computer Engineering, Northeastern University,

### UW CSE Technical Report 03-06-01 Probabilistic Bilinear Models for Appearance-Based Vision

UW CSE Technical Report 03-06-01 Probabilistic Bilinear Models for Appearance-Based Vision D.B. Grimes A.P. Shon R.P.N. Rao Dept. of Computer Science and Engineering University of Washington Seattle, WA

### Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

### Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

### Inference of Probability Distributions for Trust and Security applications

Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi Outline 2 Outline Motivations 2 Outline Motivations

### Data a systematic approach

Pattern Discovery on Australian Medical Claims Data a systematic approach Ah Chung Tsoi Senior Member, IEEE, Shu Zhang, Markus Hagenbuchner Member, IEEE Abstract The national health insurance system in

### Towards running complex models on big data

Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

### Music Classification. Juan Pablo Bello MPATE-GE 2623 Music Information Retrieval New York University

Music Classification Juan Pablo Bello MPATE-GE 2623 Music Information Retrieval New York University 1 Classification It is the process by which we automatically assign an individual item to one of a number

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### Message-passing sequential detection of multiple change points in networks

Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal

### Package MixGHD. June 26, 2015

Type Package Package MixGHD June 26, 2015 Title Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions Version 1.7 Date 2015-6-15 Author

### Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,

### Methods of Data Analysis Working with probability distributions

Methods of Data Analysis Working with probability distributions Week 4 1 Motivation One of the key problems in non-parametric data analysis is to create a good model of a generating probability distribution,

### Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

### Natural Conjugate Gradient in Variational Inference

Natural Conjugate Gradient in Variational Inference Antti Honkela, Matti Tornio, Tapani Raiko, and Juha Karhunen Adaptive Informatics Research Centre, Helsinki University of Technology P.O. Box 5400, FI-02015

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### Designing a learning system

Lecture Designing a learning system Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x4-8845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### An introduction to Hidden Markov Models

An introduction to Hidden Markov Models Christian Kohlschein Abstract Hidden Markov Models (HMM) are commonly defined as stochastic finite state machines. Formally a HMM can be described as a 5-tuple Ω

### 10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

### A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

REVSTAT Statistical Journal Volume 4, Number 2, June 2006, 131 142 A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA Authors: Daiane Aparecida Zuanetti Departamento de Estatística, Universidade Federal de São

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Revisiting k-means: New Algorithms via Bayesian Nonparametrics

Brian Kulis Department of CSE, Ohio State University, Columbus, OH Michael I. Jordan Departments of EECS and Statistics, University of California, Berkeley, CA KULIS@CSE.OHIO-STATE.EDU JORDAN@EECS.BERKELEY.EDU

### Health Status Monitoring Through Analysis of Behavioral Patterns

Health Status Monitoring Through Analysis of Behavioral Patterns Tracy Barger 1, Donald Brown 1, and Majd Alwan 2 1 University of Virginia, Systems and Information Engineering, Charlottesville, VA 2 University

### Classification by Pairwise Coupling

Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating

### Simple Linear Regression Inference

Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

### 2. Mean-variance portfolio theory

2. Mean-variance portfolio theory (2.1) Markowitz s mean-variance formulation (2.2) Two-fund theorem (2.3) Inclusion of the riskfree asset 1 2.1 Markowitz mean-variance formulation Suppose there are N

### A Survey of Kernel Clustering Methods

A Survey of Kernel Clustering Methods Maurizio Filippone, Francesco Camastra, Francesco Masulli and Stefano Rovetta Presented by: Kedar Grama Outline Unsupervised Learning and Clustering Types of clustering

### Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

### Lecture 4: BK inequality 27th August and 6th September, 2007

CSL866: Percolation and Random Graphs IIT Delhi Amitabha Bagchi Scribe: Arindam Pal Lecture 4: BK inequality 27th August and 6th September, 2007 4. Preliminaries The FKG inequality allows us to lower bound

### Bayesian Clustering for Email Campaign Detection

Peter Haider haider@cs.uni-potsdam.de Tobias Scheffer scheffer@cs.uni-potsdam.de University of Potsdam, Department of Computer Science, August-Bebel-Strasse 89, 14482 Potsdam, Germany Abstract We discuss

### Factor Analysis. Factor Analysis

Factor Analysis Principal Components Analysis, e.g. of stock price movements, sometimes suggests that several variables may be responding to a small number of underlying forces. In the factor model, we

### Introduction to latent variable models

Introduction to latent variable models lecture 1 Francesco Bartolucci Department of Economics, Finance and Statistics University of Perugia, IT bart@stat.unipg.it Outline [2/24] Latent variables and their

### Efficiency and the Cramér-Rao Inequality

Chapter Efficiency and the Cramér-Rao Inequality Clearly we would like an unbiased estimator ˆφ (X of φ (θ to produce, in the long run, estimates which are fairly concentrated i.e. have high precision.

### The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference Stephen Tu tu.stephenl@gmail.com 1 Introduction This document collects in one place various results for both the Dirichlet-multinomial

### Model for dynamic website optimization

Chapter 10 Model for dynamic website optimization In this chapter we develop a mathematical model for dynamic website optimization. The model is designed to optimize websites through autonomous management

### Selection Sampling from Large Data sets for Targeted Inference in Mixture Modeling

Selection Sampling from Large Data sets for Targeted Inference in Mixture Modeling Ioanna Manolopoulou, Cliburn Chan and Mike West December 29, 2009 Abstract One of the challenges of Markov chain Monte

### L15: statistical clustering

Similarity measures Criterion functions Cluster validity Flat clustering algorithms k-means ISODATA L15: statistical clustering Hierarchical clustering algorithms Divisive Agglomerative CSCE 666 Pattern

### Computational Statistics for Big Data

Lancaster University Computational Statistics for Big Data Author: 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application

Iraqi Journal of Statistical Science (18) 2010 p.p. [35-58] Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application ABSTRACT Nawzad. M. Ahmad * Bayesian decision theory