Christfried Webers. Canberra February June 2015

Similar documents
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

STA 4273H: Statistical Machine Learning

Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Classification Problems

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Linear Classification. Volker Tresp Summer 2015

CSCI567 Machine Learning (Fall 2014)

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Machine Learning and Pattern Recognition Logistic Regression

Lecture 3: Linear methods for classification

Statistical Machine Learning

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Linear Threshold Units

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CSC 411: Lecture 07: Multiclass Classification

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Basics of Statistical Machine Learning

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Lecture 8 February 4

Bayes and Naïve Bayes. cs534-machine Learning

Predict Influencers in the Social Network

Machine Learning Logistic Regression

Reject Inference in Credit Scoring. Jie-Men Mok

Classification. Chapter 3

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Principled Hybrids of Generative and Discriminative Models

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

1 Maximum likelihood estimation

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Lecture 9: Introduction to Pattern Analysis

Gaussian Processes in Machine Learning

Nominal and ordinal logistic regression

Statistical Machine Learning from Data

A Log-Robust Optimization Approach to Portfolio Management

CSI:FLORIDA. Section 4.4: Logistic Regression

Cell Phone based Activity Detection using Markov Logic Network

Semi-Supervised Support Vector Machines and Application to Spam Filtering

An Introduction to Machine Learning

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Support Vector Machine (SVM)

Interpretation of Somers D under four simple models

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Introduction to General and Generalized Linear Models

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Robert Collins CSE598G. More on Mean-shift. R.Collins, CSE, PSU CSE598G Spring 2006

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Data Mining: Algorithms and Applications Matrix Math Review

The equivalence of logistic regression and maximum entropy models

Message-passing sequential detection of multiple change points in networks

Chapter 2: Binomial Methods and the Black-Scholes Formula

Exploratory Data Analysis Using Radial Basis Function Latent Variable Models

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Multivariate Normal Distribution

Monotonicity Hints. Abstract

BayesX - Software for Bayesian Inference in Structured Additive Regression

Gaussian Conjugate Prior Cheat Sheet

Master s Theory Exam Spring 2006

Gaussian Process Training with Input Noise

1 The Brownian bridge construction

An Introduction to Basic Statistics and Probability

Logistic Regression (1/24/13)

Support Vector Machines for Classification and Regression

The Optimality of Naive Bayes

Section 5.1 Continuous Random Variables: Introduction

Computational Statistics for Big Data

Probabilistic user behavior models in online stores for recommender systems

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

APPLICATIONS OF BAYES THEOREM

Supervised Learning (Big Data Analytics)

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Question 2 Naïve Bayes (16 points)

L4: Bayesian Decision Theory

Nonlinear Optimization: Algorithms 3: Interior-point methods

3F3: Signal and Pattern Processing

Introduction to Logistic Regression

Principle of Data Reduction

Penalized Logistic Regression and Classification of Microarray Data

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Bayesian Image Super-Resolution

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Detection of changes in variance using binary segmentation and optimal partitioning

Semiparametric Multinomial Logit Models for the Analysis of Brand Choice Behaviour

Linear regression methods for large n and streaming data

Classification by Pairwise Coupling

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Lecture 6: Logistic Regression

Transcription:

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829

c Part VIII Linear Classification 2 Logistic 277of 829

Three for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Logistic 278of 829

Generative approach: model class-conditional densities p(x C k ) and priors p(c k ) to calculate the posterior probability for class C 1 p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) 1 = 1 + exp( a(x)) = σ(a(x)) where a and the logistic sigmoid function σ(a) are given by a(x) = ln p(x C 1) p(c 1 ) p(x C 2 ) p(c 2 ) = ln p(x, C 1) p(x, C 2 ) 1 σ(a) = 1 + exp( a). c Logistic 279of 829

Logistic Sigmoid The logistic sigmoid function σ(a) = 1 1+exp( a) "squashing function because it maps the real axis into a finite interval (0, 1) σ( a) = 1 σ(a) Derivative d daσ(a) = σ(a) σ( a) = σ(a) (1 σ(a)) ( ) Inverse is called logit function a(σ) = ln σ 1 σ c Σ a 1.0 0.8 a Σ 4 Logistic 0.6 0.4 2 0.2 0.4 0.6 0.8 1.0 Σ 2 0.2 4 10 5 5 10 Logistic Sigmoid σ(a) a 6 Logit a(σ) 280of 829

- Multiclass c The normalised exponential is given by where p(c k x) = p(x C k) p(c k ) j p(x C j) p(c j ) = exp(a k) j exp(a j) a k = ln(p(x C k ) p(c k )). Also called softmax function as it is a smoothed version of the max function. Example: If a k a j for all j k, then p(c k x) 1, and p(c j x) 0. Logistic 281of 829

Probabil. Generative Model - c Assume class-conditional probabilities are Gaussian, all classes share the same covariance. What can we say about the posterior probabilities? { 1 1 p(x C k ) = exp 1 } (2π) D/2 Σ 1/2 2 (x µ k) T Σ 1 (x µ k ) { 1 1 = exp 1 } (2π) D/2 Σ 1/2 2 xt Σ 1 x exp {µ Tk Σ 1 x 12 } µtk Σ 1 µ k where we separated the quadratic term in x and the linear term. Logistic 282of 829

Probabil. Generative Model - For two classes and a(x) is Therefore where p(c 1 x) = σ(a(x)) a(x) = ln p(x C 1) p(c 1 ) p(x C 2 ) p(c 2 ) = ln exp { } µ T 1 Σ 1 x 1 2 µt 1 Σ 1 µ 1 exp { } + ln p(c 1) µ T 2 Σ 1 x 1 2 µt 2 Σ 1 µ 2 p(c 2 ) w = Σ 1 (µ 1 µ 2 ) p(c 1 x) = σ(w T x + w 0 ) w 0 = 1 2 µt 1 Σ 1 µ 1 + 1 2 µt 2 Σ 1 µ 2 + ln p(c 1) p(c 2 ) c Logistic 283of 829

Probabil. Generative Model - c Class-conditional densities for two classes (left). Posterior probability p(c 1 x) (right). Note the logistic sigmoid of a linear function of x. Logistic 284of 829

General Case - K Classes, Shared Covariance Use the normalised exponential where p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) = exp(a k) j exp(a j) to get a linear function of x a k = ln (p(x C k )p(c k )). c where a k (x) = w T k x + w k0. w k = Σ 1 µ k w k0 = 1 2 µt k Σ 1 µ k + p(c k ). Logistic 285of 829

General Case - K Classes, Different Covariance If each class-conditional probability has a different covariance, the quadratic terms 1 2 xt Σ 1 x do not longer cancel each other out. We get a quadratic discriminant. c 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2 1 0 1 2 Logistic 286of 829

Maximum Likelihood Solution Given the functional form of the class-conditional densities p(x C k ), can we determine the parameters µ and Σ? c Logistic 287of 829

Maximum Likelihood Solution Given the functional form of the class-conditional densities p(x C k ), can we determine the parameters µ and Σ? Not without data ;-) Given also a data set (x n, t n ) for n = 1,..., N. (Using the coding scheme where t n = 1 corresponds to class C 1 and t n = 0 denotes class C 2. Assume the class-conditional densities to be Gaussian with the same covariance, but different mean. Denote the prior probability p(c 1 ) = π, and therefore p(c 2 ) = 1 π. Then p(x n, C 1 ) = p(c 1 )p(x n C 1 ) = π N (x n µ 1, Σ) p(x n, C 2 ) = p(c 2 )p(x n C 2 ) = (1 π) N (x n µ 2, Σ) c Logistic 288of 829

Maximum Likelihood Solution Thus the likelihood for the whole data set X and t is given by p(t, X π, µ 1, µ 2, Σ) = N [π N (x n µ 1, Σ)] tn n=1 Maximise the log likelihood The term depending on π is [(1 π) N (x n µ 2, Σ)] 1 tn N (t n ln π + (1 t n ) ln(1 π) n=1 which is maximal for π = 1 N N n=1 t n = N 1 N = N 1 N 1 + N 2 where N 1 is the number of data points in class C 1. c Logistic 289of 829

Maximum Likelihood Solution c Similarly, we can maximise the log likelihood (and thereby the likelihood p(t, X π, µ 1, µ 2, Σ) ) depending on the mean µ 1 or µ 2, and get µ 1 = 1 N t n x n N 1 n=1 µ 2 = 1 N (1 t n ) x n N 2 n=1 For each class, this are the means of all input vectors assigned to this class. Logistic 290of 829

Maximum Likelihood Solution c Finally, the log likelihood ln p(t, X π, µ 1, µ 2, Σ) can be maximised for the covariance Σ resulting in Σ = N 1 N S 1 + N 2 N S 2 S k = 1 N k n C k (x n µ k )(x n µ k ) T Logistic 291of 829

- Naive Bayes c Assume the input space consists of discrete features, in the simplest case x i {0, 1}. For a D-dimensional input space, a general distribution would be represented by a table with 2 D entries. Together with the normalisation constraint, this are 2 D 1 independent variables. Grows exponentially with the number of features. The Naive Bayes assumption is that all features conditioned on the class C k are independent of each other. p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi Logistic 292of 829

- Naive Bayes With the naive Bayes p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi we can then again find the factors a k in the normalised exponential p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) = exp(a k) j exp(a j) as a linear function of the x i a k (x) = D {x i ln µ ki + (1 x i ) ln(1 µ ki )} + ln p(c k ). i=1 c Logistic 293of 829

Three for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Logistic 294of 829

c Maximise a likelihood function defined through the conditional distribution p(c k x) directly. Discriminative training Typically fewer parameters to be determined. As we learn the posteriror p(c k x) directly, prediction may be better than with a generative model where the class-conditional density assumptions p(x C k ) poorly approximate the true distributions. But: discriminative models can not create synthetic data, as p(x) is not modelled. Logistic 295of 829

Original Input versus Feature Space Used direct input x until now. All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). Example: Use two Gaussian basis functions centered at the green crosses in the input space. c x2 1 0 φ2 1 0.5 Logistic 1 0 1 0 1 x1 0 0.5 1 φ1 296of 829

Original Input versus Feature Space Linear decision boundaries in the feature space correspond to nonlinear decision boundaries in the input space. Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. BUT: If classes overlap in input space, they will also overlap in feature space. Nonlinear features φ(x) can not remove the overlap; but they may increase it! c Logistic x2 1 0 φ2 1 0.5 1 0 1 0 1 x1 0 0.5 1 φ1 297of 829

Original Input versus Feature Space c Fixed basis functions do not adapt to the data and therefore have important limitations (see discussion in Linear ). Understanding of more advanced algorithms becomes easier if we introduce the feature space now and use it instead of the original input space. Some applications use fixed features successfully by avoiding the limitations. We will therefore use φ instead of x from now on. Logistic 298of 829

Logistic is Classification Two classes where the posterior of class C 1 is a logistic sigmoid σ() acting on a linear function of the feature vector φ p(c 1 φ) = y(φ) = σ(w T φ) p(c 2 φ) = 1 p(c 1 φ) Model dimension is equal to dimension of the feature space M. Compare this to fitting two Gaussians }{{} 2M + M(M + 1)/2 = M(M + 5)/2 } {{ } means shared covariance For larger M, the logistic regression model has a clear advantage. c Logistic 299of 829

Logistic is Classification Determine the parameter via maximum likelihood for data (φ n, t n ), n = 1,..., N, where φ n = φ(x n ). The class membership is coded as t n {0, 1}. Likelihood function p(t w) = where y n = p(c 1 φ n ). N n=1 y tn n (1 y n ) 1 tn Error function : negative log likelihood resulting in the cross-entropy error function E(w) = ln p(t w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 c Logistic 300of 829

Logistic is Classification Error function (cross-entropy error ) E(w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 y n = p(c 1 φ n ) = σ(w T φ n ) Gradient of the error function (using dσ da E(w) = N (y n t n )φ n n=1 = σ(1 σ) ) gradient does not contain any sigmoid function for each data point error is product of deviation y n t n and basis function φ n. BUT : maximum likelihood solution can exhibit over-fitting even for many data points; should use regularised error or MAP then. c Logistic 301of 829

Given a continous distribution p(x) which is not Gaussian, can we approximate it by a Gaussian q(x)? Need to find a mode of p(x). Try to find a Gaussian with the same mode. c 0.8 0.6 0.4 0.2 0 2 1 0 1 2 3 4 40 30 20 10 0 2 1 0 1 2 3 4 Logistic Non-Gaussian (yellow) and Gaussian approximation (red). Negative log of the Non-Gaussian (yellow) and Gaussian approx. (red). 302of 829

Assume p(x) can be written as c p(z) = 1 Z f (z) with normalisation Z = f (z) dz. Furthermore, assume Z is unknown! A mode of p(z) is at a point z 0 where p (z 0 ) = 0. Taylor expansion of ln f (z) at z 0 where ln f (z) ln f (z 0 ) 1 2 A(z z 0) 2 A = d2 dz 2 ln f (z) z=z 0 Logistic 303of 829

c Exponentiating we get ln f (z) ln f (z 0 ) 1 2 A(z z 0) 2 f (z) f (z 0 ) exp{ A 2 (z z 0) 2 }. And after normalisation we get the Laplace approximation q(z) = ( ) 1/2 A exp{ A 2π 2 (z z 0) 2 }. Only defined for precision A > 0 as only then p(z) has a maximum. Logistic 304of 829

- Vector Space Approximate p(z) for z R M c we get the Taylor expansion p(z) = 1 Z f (z). ln f (z) ln f (z 0 ) 1 2 (z z 0) T A(z z 0 ) where the Hessian A is defined as A = ln f (z) z=z0. The Laplace approximation of p(z) is then q(z) = { A 1/2 exp 1 } (2π) M/2 2 (z z 0) T A(z z 0 ) = N (z z 0, A 1 ) Logistic 305of 829

c Exact Bayesian inference for the logistic regression is intractable. Why? Need to normalise a product of prior probabilities and likelihoods which itself are a product of logistic sigmoid functions, one for each data point. Evaluation of the predictive distribution also intractable. Therefore we will use the Laplace approximation. Logistic 306of 829

c Assume a Gaussian prior because we want a Gaussian posterior. p(w) = N (w m 0, S 0 ) for fixed hyperparameter m 0 and S 0. Hyperparameters are parameters of a prior distribution. In contrast to the model parameters w, they are not learned. For a set of training data (x n, t n ), where n = 1,..., N, the posterior is given by where t = (t 1,..., t N ) T. p(w t) p(w)p(t w) Logistic 307of 829

Using our previous result for the cross-entropy function E(w) = ln p(t w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 we can now calculate the log of the posterior p(w t) p(w)p(t w) using the notation y n = σ(w T φ n ) as ln p(w t) = 1 2 (w m 0) T S 1 0 (w m 0) + N {t n ln y n + (1 t n ) ln(1 y n )} n=1 c Logistic 308of 829

To obtain a Gaussian approximation to ln p(w t) = 1 2 (w m 0) T S 1 0 (w m 0) + N {t n ln y n + (1 t n ) ln(1 y n )} n=1 1 Find w MAP which maximises ln p(w t). This defines the mean of the Gaussian approximation. (Note: This is a nonlinear function in w because y n = σ(w T φ n ).) 2 Calculate the second derivative of the negative log likelihood to get the inverse covariance of the Laplace approximation S N = ln p(w t) = S 1 0 + N y n (1 y n )φ n φ T n. n=1 c Logistic 309of 829

c The approximated Gaussian (via Laplace approximation) of the posterior distribution is now where q(w φ) = N (w w MAP, S N ) S N = ln p(w t) = S 1 0 + N y n (1 y n )φ n φ T n. n=1 Logistic 310of 829