Statistical Machine Learning from Data


 Howard Strickland
 1 years ago
 Views:
Transcription
1 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland January 23, 2006
2 Samy Bengio Statistical Machine Learning from Data
3 Samy Bengio Statistical Machine Learning from Data 3 Basics What is a GMM? Applications
4 Reminder: Basics on Probabilities Basics What is a GMM? Applications A few basic equalities that are often used: 1 (conditional probabilities) P(A, B) = P(A B) P(B) 2 (Bayes rule) P(A B) = P(B A) P(A) P(B) 3 If ( B i = Ω) and i, j i (B i Bj = ) then P(A) = i P(A, B i ) Samy Bengio Statistical Machine Learning from Data 4
5 Basics What is a GMM? Applications What is a Gaussian Mixture Model? A Gaussian Mixture Model (GMM) is a distribution The likelihood given a Gaussian distribution is ( 1 N (x µ, Σ) = exp 1 ) (2π) d 2 Σ 2 (x µ)t Σ 1 (x µ) where d is the dimension of x, µ is the mean and Σ is the covariance matrix of the Gaussian. Σ is often diagonal. The likelihood given a GMM is N p(x) = w i N (x µ i, Σ i ) where N is the number of Gaussians and w i is the weight of Gaussian i, with w i = 1 and i : w i 0 i Samy Bengio Statistical Machine Learning from Data 5
6 Characteristics of a GMM Basics What is a GMM? Applications While ANNs are universal approximators of functions, GMMs are universal approximators of densities. (as long as there are enough Gaussians of course) Even diagonal GMMs are universal approximators. Full rank GMMs are not easy to handle: number of parameters is the square of the number of dimensions. GMMs can be trained by maximum likelihood using an efficient algorithm: ExpectationMaximization. Samy Bengio Statistical Machine Learning from Data 6
7 Basics What is a GMM? Applications Practical Applications using GMMs Biometric person authentication (using voice, face, handwriting, etc): one GMM for the client one GMM for all the others Bayes decision = likelihood ratio Any highly imbalanced classification task one GMM per class, tuned by maximum likelihood Bayes decision = likelihood ratio Dimensionality reduction Quantization Samy Bengio Statistical Machine Learning from Data 7
8 Samy Bengio Statistical Machine Learning from Data 8 Graphical Interpretation More Formally
9 Graphical Interpretation More Formally Basics of ExpectationMaximization Objective: maximize the likelihood p(x θ) of the data X drawn from an unknown distribution, given the model parameterized by θ: θ = arg max θ p(x θ) = arg max θ n p(x p θ) p=1 Basic ideas of EM: Introduce a hidden variable such that its knowledge would simplify the maximization of p(x θ) At each iteration of the algorithm: EStep: estimate the distribution of the hidden variable given the data and the current value of the parameters MStep: modify the parameters in order to maximize the joint distribution of the data and the hidden variable Samy Bengio Statistical Machine Learning from Data 9
10 EM for GMM (Graphical View, 1) Graphical Interpretation More Formally Hidden variable: for each point, which Gaussian generated it? A B Samy Bengio Statistical Machine Learning from Data 10
11 EM for GMM (Graphical View, 2) Graphical Interpretation More Formally EStep: for each point, estimate the probability that each Gaussian generated it A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8 Samy Bengio Statistical Machine Learning from Data 11
12 EM for GMM (Graphical View, 3) Graphical Interpretation More Formally MStep: modify the parameters according to the hidden variable to maximize the likelihood of the data (and the hidden variable) A B P(A) = 0.6 P(B) = 0.4 P(A) = 0.2 P(B) = 0.8 Samy Bengio Statistical Machine Learning from Data 12
13 EM: More Formally Graphical Interpretation More Formally Let us call the hidden variable Q. Let us consider the following auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] It can be shown that maximizing A θ s+1 = arg max A(θ, θ s ) θ always increases the likelihood of the data p(x θ s+1 ), and a maximum of A corresponds to a maximum of the likelihood. Samy Bengio Statistical Machine Learning from Data 13
14 EM: Proof of Convergence Graphical Interpretation More Formally First let us develop the auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] = q Q P(q X, θ s ) log p(x, q θ) = q Q P(q X, θ s ) log(p(q X, θ) p(x θ)) = P(q X, θ s ) log P(q X, θ) + log p(x θ) q Q Samy Bengio Statistical Machine Learning from Data 14
15 Samy Bengio Statistical Machine Learning from Data 15 EM: Proof of Convergence Graphical Interpretation More Formally then if we evaluate it at θ s A(θ s, θ s ) = P(q X, θ s ) log P(q X, θ s ) + log p(x θ s ) q Q the difference between two consecutive log likelihoods of the data can be written as log p(x θ) log p(x θ s ) = A(θ, θ s ) A(θ s, θ s ) + q Q P(q X, θ s ) log P(q X, θs ) P(q X, θ)
16 EM: Proof of Convergence Graphical Interpretation More Formally hence, since the last part of the equation is a KullbackLeibler divergence which is always positive or null, if A increases, the log likelihood of the data also increases Moreover, one can show that when A is maximum, the likelihood of the data is also at a maximum. Samy Bengio Statistical Machine Learning from Data 16
17 Hidden Variable Auxiliary Function EStep MStep Samy Bengio Statistical Machine Learning from Data
18 Samy Bengio Statistical Machine Learning from Data 18 EM for GMM: Hidden Variable Hidden Variable Auxiliary Function EStep MStep For GMM, the hidden variable Q will describe which Gaussian generated each example. If Q was observed, then it would be simple to maximize the likelihood of the data: simply estimate the parameters Gaussian by Gaussian Moreover, we will see that we can easily estimate Q Let us first write the mixture of Gaussian model for one x i : p(x i θ) = N P(j θ)p(x i j, θ) j=1 Let us now introduce the following indicator variable: { 1 if Gaussian j emitted xi q i,j = 0 otherwise
19 EM for GMM: Auxiliary Function Hidden Variable Auxiliary Function EStep MStep We can now write the joint likelihood of all the X and q: n N p(x, Q θ) = P(j θ) q i,j p(x i j, θ) q i,j which in log gives j=1 log p(x, Q θ) = N q i,j log P(j θ) + q i,j log p(x i j, θ) j=1 Samy Bengio Statistical Machine Learning from Data 19
20 Samy Bengio Statistical Machine Learning from Data 20 EM for GMM: Auxiliary Function Hidden Variable Auxiliary Function EStep MStep Let us now write the corresponding auxiliary function: A(θ, θ s ) = E Q [log p(x, Q θ) X, θ s ] N = E Q q i,j log P(j θ) + q i,j log p(x i j, θ) X, θ s j=1 j=1 N = E Q [q i,j X, θ s ] log P(j θ) + E Q [q i,j X, θ s ] log p(x i j, θ)
21 Hidden Variable Auxiliary Function EStep MStep EM for GMM: EStep and MStep A(θ, θ s ) = j=1 N E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) Hence, the EStep estimates the posterior: E Q [q i,j X, θ s ] = 1 P(q i,j = 1 X, θ s ) + 0 P(q i,j = 0 X, θ s ) = P(j x i, θ s ) = p(x i j, θ s )P(j θ s ) p(x i θ s ) and the Mstep finds the parameters θ that maximizes A, hence searching for A θ = 0 for each parameter (µ j, variances σ 2 j, and weights w j). Note however that w j should sum to 1. Samy Bengio Statistical Machine Learning from Data 21
22 Samy Bengio Statistical Machine Learning from Data 22 EM for GMM: MStep for Means Hidden Variable Auxiliary Function EStep MStep N A(θ, θ s ) = E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) j=1 A µ j = = = A log p(x i j, θ) log p(x i j, θ) µ j P(j x i, θ s ) log p(x i j, θ) µ j P(j x i, θ s ) (x i µ j ) σ 2 j = 0
23 Samy Bengio Statistical Machine Learning from Data 23 EM for GMM: MStep for Means Hidden Variable Auxiliary Function EStep MStep P(j x i, θ s ) (x i µ j ) σ 2 j = (removing constant terms in the sum) = 0 P(j x i, θ s ) x i P(j x i, θ s ) x i P(j x i, θ s ) P(j x i, θ s ) µ j = 0 = ˆµ j
24 Samy Bengio Statistical Machine Learning from Data 24 Hidden Variable Auxiliary Function EStep MStep EM for GMM: MStep for Variances N A(θ, θ s ) = E Q [q i,j X, θ s ] log P(j θ)+e Q [q i,j X, θ s ] log p(x i j, θ) j=1 A σ 2 j = = = A log p(x i j, θ) log p(x i j, θ) σj 2 P(j x i, θ s ) log p(x i j, θ) P(j x i, θ s ) σ 2 j ( (x i µ j ) 2 2σ 4 j ) 1 2σj 2 = 0
25 Samy Bengio Statistical Machine Learning from Data 25 Hidden Variable Auxiliary Function EStep MStep EM for GMM: MStep for Variances P(j x i, θ s ) ( (x i µ j ) 2 P(j x i, θ s )(x i µ j ) 2 2σ 4 j 2σ 4 j P(j x i, θ s )(x i µ j ) 2 σ 2 j ) 1 2σj 2 = 0 P(j x i, θ s ) 2σj 2 = 0 P(j x i, θ s ) = 0 P(j x i, θ s )(x i ˆµ j ) 2 P(j x i, θ s ) = ˆσ 2 j
26 Samy Bengio Statistical Machine Learning from Data 26 Hidden Variable Auxiliary Function EStep MStep EM for GMM: MStep for Weights We have the constraint that all weights w j should be positive and sum to 1: N w j = 1 j=1 Incorporating it into the system: J(θ, θ s ) = A(θ, θ s ) + (1 N w j ) λ j where λ j are Lagrange multipliers. So we need to derive J with respect to w j and to set it to 0. j=1
27 Hidden Variable Auxiliary Function EStep MStep EM for GMM: MStep for Weights and incorporating the probabilistic constraint, we get J J A(θ, θ s ) = w j A(θ, θ s λ j ) w j ( ) = 1 P(j x i, θ s 1 ) λ j = 0 ŵ j = P(j x i, θ s ) ŵ j = λ j N P(j x i, θ s ) k=1 Samy Bengio Statistical Machine Learning from Data 27 w j = 1 n P(k x i, θ s ) P(j x i, θ s )
28 Samy Bengio Statistical Machine Learning from Data 28 EM for GMM: Update Rules Hidden Variable Auxiliary Function EStep MStep Means ˆµ j = Variances ˆσ j 2 = x i P(j x i, θ s ) Weights: ŵ j = 1 n P(j x i, θ s ) (x i ˆµ j ) 2 P(j x i, θ s ) P(j x i, θ s ) P(j x i, θ s )
29 Samy Bengio Statistical Machine Learning from Data 29 Initialization Capacity Control Adaptation
30 Samy Bengio Statistical Machine Learning from Data 30 Initialization Initialization Capacity Control Adaptation EM is an iterative procedure that is very sensitive to initial conditions! Start from trash end up with trash. Hence, we need a good and fast initialization procedure. Often used: KMeans. Other options: hierarchical KMeans, Gaussian splitting.
31 Samy Bengio Statistical Machine Learning from Data 31 Capacity Control Initialization Capacity Control Adaptation How to control the capacity with GMMs? selecting the number of Gaussians constraining the value of the variances to be far from 0 (small variances = large capacity) Use crossvalidation on the desired criterion (Maximum Likelihood, classification...)
32 Adaptation Techniques Initialization Capacity Control Adaptation In some cases, you have access to only a few examples coming from the target distribution but many coming from a nearby distribution! How can we profit from the big nearby dataset??? Solution: use adaptation techniques. The most well known and used for GMMs: the Maximum A Posteriori adaptation. Samy Bengio Statistical Machine Learning from Data 32
33 Samy Bengio Statistical Machine Learning from Data 33 MAP Adaptation Initialization Capacity Control Adaptation Normal maximum likelihood training for a dataset X : θ = arg max p(x θ) θ Maximum A Posteriori (MAP) training: θ = arg max p(θ X ) θ p(x θ)p(θ) = arg max θ p(x ) = arg max p(x θ)p(θ) θ where p(θ) represents your prior belief about the distribution of the parameters θ.
34 Samy Bengio Statistical Machine Learning from Data 34 Implementation Initialization Capacity Control Adaptation Which kind of prior distribution for p(θ)? Two objectives: constraining θ to reasonable values keep the EM algorithm tractable Use conjugate priors: Dirichlet distribution for weights Gaussian densities for means and variances
35 Samy Bengio Statistical Machine Learning from Data 35 What is a Conjugate Prior? Initialization Capacity Control Adaptation A conjugate prior is chosen such that the corresponding posterior belongs to the same functional family as the prior. So we would like that p(x θ)p(θ) is distributed according to the same family as p(θ) and tractable. Example: Likelihood is Gaussian: p(x θ) = K 1 exp ( (x 1 µ 1 ) 2 ) 2σ1 2 Prior is Gaussian: p(θ) = K 2 exp ( (x 2 µ 2 ) 2 ) 2σ 2 2 Posterior is Gaussian: p(x θ)p(θ) = K 1 K 2 exp ( (x 1 µ 1 ) 2 2σ 2 1 ) (x µ)2 = K 3 exp ( 2σ 2 (x 2 µ 2 ) 2 ) 2σ2 2
36 Samy Bengio Statistical Machine Learning from Data 36 Conjugate Prior of Multinomials Multinomial distribution: Initialization Capacity Control Adaptation P(X 1 = x 1,, X n = x n θ) = N! n x i! n where x i are nonnegative integers and n x i = N. Dirichlet distribution with parameter u: P(θ u) = 1 Z(u) n θ u i 1 i where θ 1,, θ n 0 and n θ i = 1 and u 1,, u n 0. Conjugate prior = dirichlet with parameter x + u: P(X, θ u) = 1 Z n θ x i +u i 1 i θ x i i
37 Samy Bengio Statistical Machine Learning from Data 37 Examples of Conjugate Priors Initialization Capacity Control Adaptation likelihood conjugate prior posterior p(x θ) p(θ) p(θ x) Gaussian(θ, σ) Gaussian(µ 0, σ 0 ) Gaussian(µ 1, σ 1 ) Binomial(N, θ) Beta(r, s) Beta (r + n, s + N n) Poisson(θ) Gamma(r, s) Gamma(r + n, s + 1) Multinomial(θ 1,, θ k ) Dirichlet Dirichlet (α (α 1,, α k ) 1 +n 1,, α k + n k )
38 Initialization Capacity Control Adaptation Simple Implementation for MAPGMMs Samy Bengio Statistical Machine Learning from Data 38
39 Simple Implementation Initialization Capacity Control Adaptation Train a generic prior model p with large amount of available data = {w p j, µp j, σp j } One hyperparameter: α [0, 1]: faith on prior model Weights: ŵ j = [ αw p j + (1 α) ] P(j x i ) γ where γ is a normalization factor (so that j w j = 1) Samy Bengio Statistical Machine Learning from Data 39
40 Samy Bengio Statistical Machine Learning from Data 40 Simple Implementation Initialization Capacity Control Adaptation Means: Variances: ˆµ j = αµ p j + (1 α) ( ) ˆσ j = α σ p j + µ p j µp j + (1 α) P(j x i )x i P(j x i ) P(j x i )x i x i P(j x i ) ˆµ j ˆµ j
41 Initialization Capacity Control Adaptation Adapted GMMs for Person Authentication Person authentication task: accept access if P(S i X) > P( S i X) with S i a client, S i all the other persons, and X an access attributed to S i. Using Bayes theorem, this becomes: p(x S i ) p(x S i ) > P( S i ) P(S i ) = S i p(x S i ) is trained on a large dataset p(x S i ) is MAP adapted from p(x S i ). is found on a separate validation set to optimize a given criterion. Samy Bengio Statistical Machine Learning from Data 41
Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation
Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationNotes on the Negative Binomial Distribution
Notes on the Negative Binomial Distribution John D. Cook October 28, 2009 Abstract These notes give several properties of the negative binomial distribution. 1. Parameterizations 2. The connection between
More informationReject Inference in Credit Scoring. JieMen Mok
Reject Inference in Credit Scoring JieMen Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationGaussian Conjugate Prior Cheat Sheet
Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian
More informationP (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )
Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =
More information1 Prior Probability and Posterior Probability
Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which
More informationCourse: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.
More informationAn Introduction to Statistical Machine Learning  Overview 
An Introduction to Statistical Machine Learning  Overview  Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny, Switzerland
More informationα α λ α = = λ λ α ψ = = α α α λ λ ψ α = + β = > θ θ β > β β θ θ θ β θ β γ θ β = γ θ > β > γ θ β γ = θ β = θ β = θ β = β θ = β β θ = = = β β θ = + α α α α α = = λ λ λ λ λ λ λ = λ λ α α α α λ ψ + α =
More informationBayesian Statistics in One Hour. Patrick Lam
Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical
More information1 Sufficient statistics
1 Sufficient statistics A statistic is a function T = rx 1, X 2,, X n of the random sample X 1, X 2,, X n. Examples are X n = 1 n s 2 = = X i, 1 n 1 the sample mean X i X n 2, the sample variance T 1 =
More informationA crash course in probability and Naïve Bayes classification
Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s
More informationProbabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
More informationBayesian Image SuperResolution
Bayesian Image SuperResolution Michael E. Tipping and Christopher M. Bishop Microsoft Research, Cambridge, U.K..................................................................... Published as: Bayesian
More informationNotes for STA 437/1005 Methods for Multivariate Data
Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.
More informationLecture 6: The Bayesian Approach
Lecture 6: The Bayesian Approach What Did We Do Up to Now? We are given a model Loglinear model, Markov network, Bayesian network, etc. This model induces a distribution P(X) Learning: estimate a set
More informationRobotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard
Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,
More informationMachine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationThe Exponential Family
The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural
More informationLinear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.
Steven J Zeil Old Dominion Univ. Fall 200 DiscriminantBased Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 DiscriminantBased Classification Likelihoodbased:
More informationMaximum Likelihood Estimation
Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for
More informationClass #6: Nonlinear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Nonlinear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Nonlinear classification Linear Support Vector Machines
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationProbability Theory. Elementary rules of probability Sum rule. Product rule. p. 23
Probability Theory Uncertainty is key concept in machine learning. Probability provides consistent framework for the quantification and manipulation of uncertainty. Probability of an event is the fraction
More informationChristfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
More informationPackage EstCRM. July 13, 2015
Version 1.4 Date 2015711 Package EstCRM July 13, 2015 Title Calibrating Parameters for the Samejima's Continuous IRT Model Author Cengiz Zopluoglu Maintainer Cengiz Zopluoglu
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher
More informationL4: Bayesian Decision Theory
L4: Bayesian Decision Theory Likelihood ratio test Probability of error Bayes risk Bayes, MAP and ML criteria Multiclass problems Discriminant functions CSCE 666 Pattern Analysis Ricardo GutierrezOsuna
More informationProbabilistic user behavior models in online stores for recommender systems
Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,
More informationPractice problems for Homework 11  Point Estimation
Practice problems for Homework 11  Point Estimation 1. (10 marks) Suppose we want to select a random sample of size 5 from the current CS 3341 students. Which of the following strategies is the best:
More informationBAYESIAN CLASSIFICATION USING GAUSSIAN MIXTURE MODEL AND EM ESTIMATION: IMPLEMENTATIONS AND COMPARISONS
LAPPEENRANTA UNIVERSITY OF TECHNOLOGY DEPARTMENT OF INFORMATION TECHNOLOGY BAYESIAN CLASSIFICATION USING GAUSSIAN MIXTURE MODEL AND ESTIMATION: IMPLENTATIONS AND COMPARISONS Information Technology Project
More informationCS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
More informationBayesian Methods. 1 The Joint Posterior Distribution
Bayesian Methods Every variable in a linear model is a random variable derived from a distribution function. A fixed factor becomes a random variable with possibly a uniform distribution going from a lower
More informationMachine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
More informationLikelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University
Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x θ) Likelihood: eskmate
More informationIntroduction to Convex Optimization for Machine Learning
Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning
More informationLecture 9: Introduction to Pattern Analysis
Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototypebased clustering Densitybased clustering Graphbased
More informationPrincipled Hybrids of Generative and Discriminative Models
Principled Hybrids of Generative and Discriminative Models Julia A. Lasserre University of Cambridge Cambridge, UK jal62@cam.ac.uk Christopher M. Bishop Microsoft Research Cambridge, UK cmbishop@microsoft.com
More informationPattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University
Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision
More informationPrinciple of Data Reduction
Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationIntroduction to Machine Learning
Introduction to Machine Learning Felix Brockherde 12 Kristof Schütt 1 1 Technische Universität Berlin 2 Max Planck Institute of Microstructure Physics IPAM Tutorial 2013 Felix Brockherde, Kristof Schütt
More informationIntroduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models  part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK2800 Kgs. Lyngby
More informationNeural Networks Lesson 5  Cluster Analysis
Neural Networks Lesson 5  Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt.  Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
More informationImproving Generalization
Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. CrossValidation 4. Weight Restriction
More informationMultidimensional data and factorial methods
Multidimensional data and factorial methods Bidimensional data x 5 4 3 4 X 3 6 X 3 5 4 3 3 3 4 5 6 x Cartesian plane Multidimensional data n X x x x n X x x x n X m x m x m x nm Factorial plane Interpretation
More informationMarkov Chain Monte Carlo Simulation Made Simple
Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
More informationChapter 4: NonParametric Classification
Chapter 4: NonParametric Classification Introduction Density Estimation Parzen Windows KnNearest Neighbor Density Estimation KNearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination
More informationAutomated Hierarchical Mixtures of Probabilistic Principal Component Analyzers
Automated Hierarchical Mixtures of Probabilistic Principal Component Analyzers Ting Su tsu@ece.neu.edu Jennifer G. Dy jdy@ece.neu.edu Department of Electrical and Computer Engineering, Northeastern University,
More informationUW CSE Technical Report 030601 Probabilistic Bilinear Models for AppearanceBased Vision
UW CSE Technical Report 030601 Probabilistic Bilinear Models for AppearanceBased Vision D.B. Grimes A.P. Shon R.P.N. Rao Dept. of Computer Science and Engineering University of Washington Seattle, WA
More informationLeastSquares Intersection of Lines
LeastSquares Intersection of Lines Johannes Traa  UIUC 2013 This writeup derives the leastsquares solution for the intersection of lines. In the general case, a set of lines will not intersect at a
More informationLecture 20: Clustering
Lecture 20: Clustering Wrapup of neural nets (from last lecture Introduction to unsupervised learning Kmeans clustering COMP424, Lecture 20  April 3, 2013 1 Unsupervised learning In supervised learning,
More informationInference of Probability Distributions for Trust and Security applications
Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi Outline 2 Outline Motivations 2 Outline Motivations
More informationData a systematic approach
Pattern Discovery on Australian Medical Claims Data a systematic approach Ah Chung Tsoi Senior Member, IEEE, Shu Zhang, Markus Hagenbuchner Member, IEEE Abstract The national health insurance system in
More informationTowards running complex models on big data
Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation
More informationMusic Classification. Juan Pablo Bello MPATEGE 2623 Music Information Retrieval New York University
Music Classification Juan Pablo Bello MPATEGE 2623 Music Information Retrieval New York University 1 Classification It is the process by which we automatically assign an individual item to one of a number
More informationProbabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur
Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:
More informationMessagepassing sequential detection of multiple change points in networks
Messagepassing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal
More informationPackage MixGHD. June 26, 2015
Type Package Package MixGHD June 26, 2015 Title Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions Version 1.7 Date 2015615 Author
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,
More informationMethods of Data Analysis Working with probability distributions
Methods of Data Analysis Working with probability distributions Week 4 1 Motivation One of the key problems in nonparametric data analysis is to create a good model of a generating probability distribution,
More informationSampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data
Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian
More informationNatural Conjugate Gradient in Variational Inference
Natural Conjugate Gradient in Variational Inference Antti Honkela, Matti Tornio, Tapani Raiko, and Juha Karhunen Adaptive Informatics Research Centre, Helsinki University of Technology P.O. Box 5400, FI02015
More informationClassification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
More informationDesigning a learning system
Lecture Designing a learning system Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x48845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationAn introduction to Hidden Markov Models
An introduction to Hidden Markov Models Christian Kohlschein Abstract Hidden Markov Models (HMM) are commonly defined as stochastic finite state machines. Formally a HMM can be described as a 5tuple Ω
More information10810 /02710 Computational Genomics. Clustering expression data
10810 /02710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intracluster similarity low intercluster similarity Informally,
More informationA LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA
REVSTAT Statistical Journal Volume 4, Number 2, June 2006, 131 142 A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA Authors: Daiane Aparecida Zuanetti Departamento de Estatística, Universidade Federal de São
More informationModern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
More informationRevisiting kmeans: New Algorithms via Bayesian Nonparametrics
Brian Kulis Department of CSE, Ohio State University, Columbus, OH Michael I. Jordan Departments of EECS and Statistics, University of California, Berkeley, CA KULIS@CSE.OHIOSTATE.EDU JORDAN@EECS.BERKELEY.EDU
More informationHealth Status Monitoring Through Analysis of Behavioral Patterns
Health Status Monitoring Through Analysis of Behavioral Patterns Tracy Barger 1, Donald Brown 1, and Majd Alwan 2 1 University of Virginia, Systems and Information Engineering, Charlottesville, VA 2 University
More informationClassification by Pairwise Coupling
Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating
More informationSimple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
More information2. Meanvariance portfolio theory
2. Meanvariance portfolio theory (2.1) Markowitz s meanvariance formulation (2.2) Twofund theorem (2.3) Inclusion of the riskfree asset 1 2.1 Markowitz meanvariance formulation Suppose there are N
More informationA Survey of Kernel Clustering Methods
A Survey of Kernel Clustering Methods Maurizio Filippone, Francesco Camastra, Francesco Masulli and Stefano Rovetta Presented by: Kedar Grama Outline Unsupervised Learning and Clustering Types of clustering
More informationClustering. 15381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is
Clustering 15381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv BarJoseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is
More informationLecture 4: BK inequality 27th August and 6th September, 2007
CSL866: Percolation and Random Graphs IIT Delhi Amitabha Bagchi Scribe: Arindam Pal Lecture 4: BK inequality 27th August and 6th September, 2007 4. Preliminaries The FKG inequality allows us to lower bound
More informationBayesian Clustering for Email Campaign Detection
Peter Haider haider@cs.unipotsdam.de Tobias Scheffer scheffer@cs.unipotsdam.de University of Potsdam, Department of Computer Science, AugustBebelStrasse 89, 14482 Potsdam, Germany Abstract We discuss
More informationFactor Analysis. Factor Analysis
Factor Analysis Principal Components Analysis, e.g. of stock price movements, sometimes suggests that several variables may be responding to a small number of underlying forces. In the factor model, we
More informationIntroduction to latent variable models
Introduction to latent variable models lecture 1 Francesco Bartolucci Department of Economics, Finance and Statistics University of Perugia, IT bart@stat.unipg.it Outline [2/24] Latent variables and their
More informationEfficiency and the CramérRao Inequality
Chapter Efficiency and the CramérRao Inequality Clearly we would like an unbiased estimator ˆφ (X of φ (θ to produce, in the long run, estimates which are fairly concentrated i.e. have high precision.
More informationThe DirichletMultinomial and DirichletCategorical models for Bayesian inference
The DirichletMultinomial and DirichletCategorical models for Bayesian inference Stephen Tu tu.stephenl@gmail.com 1 Introduction This document collects in one place various results for both the Dirichletmultinomial
More informationModel for dynamic website optimization
Chapter 10 Model for dynamic website optimization In this chapter we develop a mathematical model for dynamic website optimization. The model is designed to optimize websites through autonomous management
More informationSelection Sampling from Large Data sets for Targeted Inference in Mixture Modeling
Selection Sampling from Large Data sets for Targeted Inference in Mixture Modeling Ioanna Manolopoulou, Cliburn Chan and Mike West December 29, 2009 Abstract One of the challenges of Markov chain Monte
More informationL15: statistical clustering
Similarity measures Criterion functions Cluster validity Flat clustering algorithms kmeans ISODATA L15: statistical clustering Hierarchical clustering algorithms Divisive Agglomerative CSCE 666 Pattern
More informationComputational Statistics for Big Data
Lancaster University Computational Statistics for Big Data Author: 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationBayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application
Iraqi Journal of Statistical Science (18) 2010 p.p. [3558] Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application ABSTRACT Nawzad. M. Ahmad * Bayesian decision theory
More informationA Scalable Formulation of Probabilistic Linear Discriminant Analysis: Applied to Face Recognition
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL 35, NO 7, JULY 2013 1 A Scalable Formulation of Probabilistic Linear Discriminant Analysis: Applied to Face Recognition Laurent El Shafey,
More informationCS229 Lecture notes. Andrew Ng
CS229 Lecture notes Andrew Ng Part X Factor analysis Whenwehavedatax (i) R n thatcomesfromamixtureofseveral Gaussians, the EM algorithm can be applied to fit a mixture model. In this setting, we usually
More information