These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Similar documents

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear Models for Classification

Christfried Webers. Canberra February June 2015

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

STA 4273H: Statistical Machine Learning

Statistical Machine Learning

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Lecture 3: Linear methods for classification

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Linear Classification. Volker Tresp Summer 2015

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Linear Threshold Units

Machine Learning and Pattern Recognition Logistic Regression

CSCI567 Machine Learning (Fall 2014)

Classification Problems

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Lecture 8 February 4

Machine Learning Logistic Regression

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

CSC 411: Lecture 07: Multiclass Classification

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CSI:FLORIDA. Section 4.4: Logistic Regression

Lecture 6: Logistic Regression

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Logistic Regression (1/24/13)

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Local classification and local likelihoods

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

An Introduction to Machine Learning

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Support Vector Machines Explained

Reject Inference in Credit Scoring. Jie-Men Mok

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Support Vector Machine (SVM)

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

The equivalence of logistic regression and maximum entropy models

Lecture 2: The SVM classifier

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

A Simple Introduction to Support Vector Machines

Statistical Machine Learning from Data

LCs for Binary Classification

Making Sense of the Mayhem: Machine Learning and March Madness

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Classification by Pairwise Coupling

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Nonlinear Optimization: Algorithms 3: Interior-point methods

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Introduction to General and Generalized Linear Models

Supervised Learning (Big Data Analytics)

1 Maximum likelihood estimation

Machine Learning in Spam Filtering

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Predict Influencers in the Social Network

Efficient Streaming Classification Methods

3F3: Signal and Pattern Processing

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Lecture 9: Introduction to Pattern Analysis

Simple and efficient online algorithms for real world applications

Data Mining Practical Machine Learning Tools and Techniques

Least-Squares Intersection of Lines

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Penalized Logistic Regression and Classification of Microarray Data

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

One-Class Classifiers: A Review and Analysis of Suitability in the Context of Mobile-Masquerader Detection

Logistic Regression for Spam Filtering

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Support Vector Machines for Classification and Regression

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Introduction to Online Learning Theory

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Question 2 Naïve Bayes (16 points)

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Introduction to Logistic Regression

Robert Collins CSE598G. More on Mean-shift. R.Collins, CSE, PSU CSE598G Spring 2006

The zero-adjusted Inverse Gaussian distribution as a model for insurance claims

Classification Techniques for Remote Sensing

L4: Bayesian Decision Theory

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

4.1 Learning algorithms for neural networks

Basics of Statistical Machine Learning

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Logistic Regression for Data Mining and High-Dimensional Classification

Big Data Analytics CSCI 4030

Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

CHAPTER 2 Estimating Probabilities

Transcription:

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Linear Models for Classification Goal: take input vector x and map it onto one of K discrete classes Input space divided into decision regions divided by decision surfaces or decision boundaries Consider linear models: those separable by (D-1) dimensional hyperplanes in the D-dimensional input space. Perfect separation: linearly-separable Recall simplest linear regression model: y(x) = w T x + w 0 Use activation function f(. ) to map function onto discrete classes y(x) = f ( w T ) x + w 0 Due to f(. ) these models are no longer linear in the parameters 2

Discriminant Functions 2 class case, simplest: y(x) = w T x + w 0 where w is a weight vector and wo is the bias Decision boundary is 0. Consider 2 points xa and xa lying on decision surface. Because y(xa) = y(xb) = 0 we have w T (x A x B ) = 0 thus vector w is orthogonal to every vector lying within decision surface If x is on decision surface then y(x)=0 indicating that: w T x w = w 0 w We thus see that bias wo decides location of decision surface 3

Geometry of linear discriminant y > 0 y = 0 y < 0 x 2 R 2 R 1 w x y(x) w x x 1 w 0 w Decision surface is perpendicular to w. Displacement controlled by w0. 4

Multiple Classes Not generally good idea to use multiple 2-class classifiers to do K- class classification Leads to ambiguous regions. See figure:? C 1 C 3 R 1 R 1 R 2 C 1? R 3 C 1 R 3 not C 1 C 2 C 2 R 2 C 3 not C 2 C 2 Left: distinguish points in Ck from those not in Ck. Right: pairwise separation of classes Ck and Cj. Both lead to ambiguous regions. 5

Single K-class classifier Single discriminant comprising K linear functions of form y k (x) = wk T x + w k0 Point x belongs in class Ck if yk(x) > yj(x) for all j k Decision boundary between Ck and Cj is had by yk(x) = yj(x) and corresponds to (D-1)-dimensional hyperplane (w k w j ) T x + (w k0 w j0 ) = 0 Decision region singly connected and convex (due to linearity of discriminant functions. See Bishop pg 184. R j R i R k x A ˆx x B 6

Least Squares Classification Approximates conditional expectation E[t x] Each class Ck described by its own linear model y k (x) = w T k x + w k0 Minimize sum-of-squared errors on set of models Works very poorly. E.g. very sensitive compared to logistic regression to outliers. 4 4 2 2 0 0 2 2 4 4 6 6 8 4 2 0 2 4 6 8 8 4 2 0 2 4 6 8 7

Least Squares Classification Second example: even for easy dataset with clear decision boundaries one class is underrepresented (compare to logistic regression below) Behavior not surprising when we consider that it corresponds to ML under assumption of Gaussian conditional distribution. Binary target vectors are far from Gaussian. 6 6 4 4 2 2 0 0 2 2 4 4 6 6 4 2 0 2 4 6 6 6 4 2 0 2 4 6 8

Fisher s linear discriminant Can look at classification as dimensionality reduction: take D dimensional input vector and project it down to 1 dimension using: y = w T x Use threshold such that y w0 is classified as C0 (Yields standard linear classifier from previous slide). Adjust weight vector to project with maximum class separation. Consider 2-class version: m 1 = 1 N 1 n C 1 x n m 2 = 1 N 2 n C 2 x n Simplest measure of class separation is separation of class means. Choose w to maximize m 2 m 1 = w T (m 2 m 1 ) where m k = w T m k si the mean of the projected data from class Ck. 9

Fisher s linear discriminant Still problematic: when there is strongly non-diagonal covariance between class distributions, separation is poor: 4 2 0 2 2 2 6 Fishers: maximize separating between class means while minimizing variance within each class. Within class variance is: s 2 k = (y n m k ) 2 where y n = w T x n n C k Total within-class variance is s 2 1 + s 2 2 Fisher criterion is ratio of the between-class mean and the within-class J(w) = (m 2 m 1 ) 2 s 2 1 + s2 2 10

Geometry of Fisher s 4 4 2 2 0 0 2 2 2 2 6 2 2 6 Left: samples from two classes along with histograms resulting from projection onto line joining the class means (reproduced from previous slide). Right: corresponding projection based on Fisher linear discriminant. 11

Fisher s criterion: Fisher s linear discriminant J(w) = (m 2 m 1 ) 2 s 2 1 + s2 2 Can relate this to weights w by rewriting as: where SB is between-class covariance and where SW is within-class covariance S W = n C 1 (x n m 1 )(x n m 1 ) T + n C 2 (x n m 2 )(x n m 2 ) T J(w) = wt S B w w T S W w S B = (m 2 m 1 )(m 2 m 1 ) T Differentiating J(w) with respect to w reveals maximum when: (w T S B w)s W w = (w T S W w)s B w SBw is always in the direction of (m2 - m1). Don t care about magnitiude of w so drop scalar factors Multiply both sides by yields Fisher s linear discriminant S 1 W w S 1 W (m 2 m 1 ) 12 (w T S B w) and (w T S W w)

Fisher s linear discriminant Least squares approach to linear discriminant was to make model predictions as close as possible to target values Fisher was derived based on maximum separation of classes Fisher can still be derived as special case of least squares Trick is to recode target for C1 to be N/N1 where N1 is the number of patterns in class C1 and N is the total number of patterns. For C2 target is -N/N2. Approximates reciprocal of the prior for the class. Read 4.1.7 Fisher s criterion can be extended to >2 classes. Read 4.1.6 13

Rosenblatt (1962) Perceptron algorithm. Linear model with step activation function: y(x) = f(w T φ(x)) f(a) = { +1, a 0 1, a < 0. Train using perceptron criterion E P = n M w T φ n t n where M is the set of misclassified patterns. Note that direct misclassification using total number of miscalssified patterns will not work because of nonlinear f() (the gradient is ill-conditioned). Instead we seek a weight vector such that x n C 1, w T φ(x n ) > 0 and x n C 2, w T φ(x n ) < 0. 14

Perceptron algorithm. Total error function is piecewise linear (across all misclassified patterns). Stochastic gradient descent: w (τ+1) = w (τ) η E P (w) = w (τ) + ηφ n t n Update is not a function of w thus η can be equal to 1 Perceptron convergence theorem: If there exists an exact solution (i.e. if training set is indeed linearly separable) then PA will find a solution in finite number of steps. Attacked by Minsky and Papert in Perceptrons (1969). Attack valid only for single-layer perceptrons. Still, stopped research in neural computation for nearly a decade. 15

Geometry of Perceptron 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 1 0.5 0 0.5 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 Convergence for 2 classes (red and blue). Black arrow = w; black line is decision boundary. On top left, the point circled in green is misclassified thus its feature vector is added to w. This is repeated on the bottom 16

Probabilistic Generative Models Model class-conditional densities Posterior probability for class C1: p(c 1 x) = where we have defined σ is the logistic sigmoid; Note that and that the inverse of σ is the logit function a = ln = ( σ 1 σ representing ratio of probabilities p(x C k ) p(x C 1 )p(c 1 ) p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) 1 1 + exp( a) = σ(a) ) a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) σ( a) = 1 σ(a) a = ln[p(c 1 x)/p(c 2 x)] 17

Probabilistic Generative Models Generalization to multiple classes: normalized exponential function p(c k x) = where = p(x C k )p(c k ) j p(x C j)p(c j ) exp(a k ) j exp(a j) a k = ln p(x C k )p(c k ) Also known as softmax function because it is a smoothed version of max. Different representations for class-conditional densities yield different consequences in how classification is done [see following slides...] 18

Continuous inputs First assume all classes share same covariance matrix and only 2 classes. This yields: p(c 1 x) = σ(w T x + w 0 ) where: w = Σ 1 (µ 1 µ 2 ) w 0 = 1 2 µt 1 Σ 1 µ 1 + 1 2 µt 2 Σ 1 µ 2 + ln p(c 1) p(c 2 ) Quadratic term from Gaussians vanishes. priors p(ck) only enter via bias parameter. Thus make parallel shift in decision boundry. For general case of K classes we have: a k (x) = wk T x + w k0 where: w k = Σ 1 (µ k ) w k0 = 1 2 µt k Σ 1 µ k + ln p(c k ) 19

Continuous inputs Left: class-conditional densities for two classes, red and blue. Right, corresponding posterior probability p(c1 x) which is given by a logistic sigmoid of a linear function of x. On right: proportion of red yields p(c1 x), proportion of blue yields p(c2 x) = 1-p(C1 x) 20

Linear versus quadratic When covariance is shared by classes, decision boundary is linear. When covariances are unlinked, decision boundary is quadratic. See Bishop p198-199 for details 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2 1 0 1 2 Left: class conditional densities for three Gaussians. Green and red have same covariance matrix. Right: Decision boundary is linear where covariances are the same, quadratic where covariances differ. 21

Maximum likelihood Now have a parametric form for class-conditional densities p(x Ck). Can now determine values of parameters and priors p(ck). p(x n, C 1 ) = p(c 1 )p(x n C 1 ) = πn (x n µ 1, Σ) p(x n, C 2 ) = p(c 2 )p(x n C 2 ) = (1 π)n (x n µ 2, Σ) Yields likelihood: p(t π, µ 1, µ 2, Σ) = N [πn (x n µ 1, Σ)] t n [(1 π)n (x n µ 2, Σ)] 1 t n n=1 Maximize log-likelihood. For π relevant terms are: N {t n ln π + (1 t n ) ln(1 π)} n=1 set derivative with respect to π to 0 yields: π = 1 N N n=1 t n = N 1 N = N 1 N 1 + N 2 22

Probabilistic Discriminative Models Generative approach: For two-class classification, posterior of C1 is logistic sigmoid acting on linear function of x for many distributions p(x Ck). Used ML to determine parameters of densities as well as class priors p(ck). Then use Bayes to determine posterior class probabilities Discriminiative approach: Learn parameters of general linear model explicitly using ML. Maximize likelihood of conditional p(ck x) directly. Advantages of discriminiative approach: fewer parameters, better performance when class-conditional densities do not correspond well to true distributions 23

Fixed Basis Functions 1 1 0 0.5 1 0 1 0 1 0 0.5 1 Left: original data is not linearly separable. Two Gaussian basis functions ϕ1 and ϕ1 are defined, as shown by green lines. Right: New linearly-separable space as created by the Gaussians. Line is nonlinear in the original variables. 24

Logistic regression Posterior probability of class C1written as a logistic sigmoid acting on linear function of feature vector ϕ p(c 1 φ) = y(φ) = σ(w T φ) where σ(. ) is the logistic sigmoid Called logistic regression but really a model of classification More compact than ML-fitting of Gaussians: for M parameters, Gaussian model uses 2M parameters for the means and M(M + 1) /2 parameters for shared covariance matrix. Grows quadratically. Use ML to determine parameters. Recall derivitive of logistic sigmoid is: p(t w) = dσ da N n=1 = σ(1 σ) y t n n { 1 yn } 1 tn 25

Logistic regression Negative log of likelihood yields cross entropy: N { E(w) = ln p(t, w) = tn ln y n + (1 t n ) ln(1 t n ) } n=1 Gradient with respect to w yields: E(w) = N (y n t n )φ n n=1 w ML = ( Φ T Φ ) 1 Φ T t Takes same form as the gradient for sum-of-squares error (eqn B3.13): N { ln p(t w, β) = tn w T φ(x n ) } φ(x n ) T n=1 However logistic sigmoid means no closed-form quadratic solution (here is sum of squared error for comparison:) w ML = ( Φ T Φ ) 1 Φ T t where Φ is the design matrix 26

Iterative reweighted least squares (IRLS) Efficient iterative optimization: Newton-Raphson w new = w old H 1 E(w) where H is the Hessian matrix comprising the second derivatives of E(w) with respect to w. For sum-of-squares error this can be done in one step because the error function is quadratic (see p207) For cross-entropy we get a similar set of normal equations for weighted least squares where the weighting depends on w This dependancy forces us to apply the update iteratively. (see p208) 27

Final notes on chapter 4 Logistic regression suffers from overfitting. ML solution corresponds to σ=0.5 which is where w T ϕ=0. Magnitude of w goes to infinity; slope of the sigmoid becomes infinitely steep. All positive examples take a posterior of p (Ck x)=1. Not resolved by having more data. But is resolved using (a)regularization or (b) a Bayesian approach involving an appropriate prior over w. (4.3.4) Multiclass logistic regression is an easy extension of the two-class formulation. Replace logistic sigmoid with softmax and use maximum likelihood. (4.3.5) Probit regression is of theoretical interest but is in practice similar to logistic regression. (4.3.6) Canonical link functions provide a principled framework for matching activation functions and error functions. Based on assumptions about the conditional distribution for the target variable. 4.3.4-6 are not discussed further. Please read them. 28