CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Size: px

Start display at page:

Download "CS 688 Pattern Recognition Lecture 4. Linear Models for Classification"

MargaretMargaret Harmon
10 years ago
Views:

1 CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1

2 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p( Ck ) Ck x = p( x) ( x) p( x C ) p( ) p p = k k C k Two class case: logistic sigmoid function 2

3 K >2 classes: softmax function Lets assume the class-conditional densities are Gaussian with the same covariance matrix: Two class case first. We can show the following result: 3

4 4

5 We have shown: Decision boundary: 5

6 K >2 classes: We can show the following result: The cancellation of the quadratic terms is due to the assumption of shared covariances. If we allow each class conditional density to have its own covariance matrix, the cancellations no longer occur, and we obtain a quadratic function of x, i.e. a quadratic discriminant. 6

If we allow each class conditional density to have its own covariance matrix,

7 Maximum likelihood solution We have a parametric functional form for the class-conditional densities: We can estimate the parameters and the prior class probabilities using maximum likelihood. Two class case with shared covariance matrix. Training data: 7

parameters and the prior class probabilities using maximum

8 Maximum likelihood solution For a data point from class we have and therefore For a data point from class we have and therefore Maximum likelihood solution Assuming observations are drawn independently, we can write the likelihood function as follows: N p( t π,µ 1,µ 2,Σ) = p( t n π,µ 1,µ 2,Σ) = N n=1 [ p( x n,c 1 )] t n p x n,c 2 n=1 N n=1 [ πn ( x n µ 1,Σ)] t n [ ( )] 1 t n = [( 1 π)n ( x n µ 2,Σ)] 1 t n t = ( t 1,,t N ) T 8

write the likelihood function as follows: N p( t π,µ 1,µ 2,Σ) = p( t n π,µ 1,µ 2,Σ) = N n=1 [ p( x n,c 1 )]

9 Maximum likelihood solution We want to find the values of the parameters that maximize the likelihood function, i.e., fit a model that best describes the observed data. p t π,µ 1,µ 2,Σ N n=1 ( ) = [ πn ( x n µ 1,Σ)] t n [( 1 π)n ( x n µ 2,Σ)] 1 t n As usual, we consider the log of the likelihood: Maximum likelihood solution We first maximize the log likelihood with respect to terms that depend on are. The 9

p t π,µ 1,µ 2,Σ N n=1 ( ) = [ πn ( x n µ 1,Σ)] t n [( 1 π)n ( x n µ 2,Σ)] 1 t n As usual, we consider

10 Maximum likelihood solution Thus, the maximum likelihood estimate of of points in class is the fraction The result can be generalized to the multiclass case: the maximum likelihood estimate of is given by the fraction of points in the training set that belong to Maximum likelihood solution We now maximize the log likelihood with respect to terms that depend on are. The 10

$estimate of is given by the fraction of points in the training set that belong to Maximum$

11 Maximum likelihood solution Thus, the maximum likelihood estimate of is the sample mean of all the input vectors assigned to class By maximizing the log likelihood with respect to obtain a similar result for we Maximum likelihood solution Maximizing the log likelihood with respect to the maximum likelihood estimate we obtain Thus: the maximum likelihood estimate of the covariance is given by the weighted average of the sample covariance matrices associated with each of the classes. This results extend to K classes. 11

likelihood with respect to the maximum likelihood estimate we obtain Thus: the maximum likelihood estimate of the covariance is

12 Probabilistic Discriminative Models Two-class case: Multiclass case: Discriminative approach: use the functional form of the generalized linear model for the posterior probabilities and determine its parameters directly using maximum likelihood. Probabilistic Discriminative Models Advantages: Fewer parameters to be determined Improved predictive performance, especially when the class-conditional density assumptions give a poor approximation of the true distributions. 12

13 Probabilistic Discriminative Models Two-class case: In the terminology of statistics, this model is known as logistic regression. Assuming estimate? how many parameters do we need to Probabilistic Discriminative Models How many parameters did we estimate to fit Gaussian class-conditional densities (generative approach)? 13

how many parameters do we need to Probabilistic Discriminative Models How many

14 Logistic Regression We use maximum likelihood to determine the parameters of the logistic regression model. { x n,t n } n =1,,N t n =1 denotes class C 1 t n = 0 denotes class C 2 We want to find the values of w that maximize the posterior probabilities associated to the observed data Likelihood function : L w N ( ) = P( C 1 x n ) t n n=1 ( 1 P( C 1 x n )) 1 t n N ( ) t n = y x n n=1 ( 1 y( x n )) 1 t n Logistic Regression We consider the negative logarithm of the likelihood: 14

posterior probabilities associated to the observed data Likelihood function : L w N ( ) = P( C 1 x n ) t n n=1 ( 1 P( C

15 Logistic Regression We compute the derivative of the error function with respect to w (gradient): We need to compute the derivative of the logistic sigmoid function: a σ ( a ) = a 1 1+ e = e a a 1+ e a = σ a 1+ e a 1+ e a ( ) 2 = 1 ( ) ( ) 1 σ( a) e a 1+ e a 1+ e = a Logistic Regression 15

logistic sigmoid function: a σ ( a ) = a 1 1+ e = e a a 1+ e a 1 1 1 = σ

16 Logistic Regression The gradient of E at w gives the direction of the steepest increase of E at w. We need to minimize E. Thus we need to update w so that we move along the opposite direction of the gradient: This technique is called gradient descent It can be shown that E is a concave function of w. Thus, it has a unique minimum. An efficient iterative technique exists to find the optimal w parameters (Newton-Raphson optimization). Batch vs. on-line learning The computation of the above gradient requires the processing of the entire training set (batch technique) If the data set is large, the above technique can be costly; For real time applications in which data become available as continuous streams, we may want to update the parameters as data points are presented to us (on-line technique). 16

Thus, it has a unique minimum. An efficient iterative technique exists to find the optimal w parameters (Newton-Raphson optimization). Batch vs.

17 On-line learning After the presentation of each data point n, we compute the contribution of that data point to the gradient (stochastic gradient): The on-line updating rule for the parameters becomes: η > 0 is called learning rate. It's value needs to be chosen carefully to ensure convergence Multiclass Logistic Regression Multiclass case: We use maximum likelihood to determine the parameters of the logistic regression model. { x n,t n } n =1,, N ( ) denotes class C k t n = 0,, 1,,0 We want to find the values of w 1,,w K that maximize the posterior probabilities associated to the observed data Likelihood function : ( ) = P( C k x n ) t nk L w 1,,w K N K = y k x n n=1 k =1 n=1 k =1 N K ( ) t nk 17

It's value needs to be chosen carefully to ensure convergence Multiclass Logistic Regression Multiclass case: We use maximum likelihood to determine the parameters of the

18 Multiclass Logistic Regression We consider the negative logarithm of the likelihood: Multiclass Logistic Regression We compute the gradient of the error function with respect to one of the parameter vectors: 18

Logistic Regression We compute the gradient of the

19 Multiclass Logistic Regression Thus, we need to compute the derivatives of the softmax function: y k = a k a k e ak e a j = j e a k j e a j j e a j e a k ea k ( ) 2 = y k y 2 k = y k 1 y k Multiclass Logistic Regression Thus, we need to compute the derivatives of the softmax function: for j k, y k = a j a j e ak j e a j = e a k e a j = y 2 k y j e a j j 19

= y k 1 y k function: for j k, y k = a j a j e ak j e a j = e a k e a j = y 2 k y j e a

20 Multiclass Logistic Regression Compact expression: Multiclass Logistic Regression 20

21 Multiclass Logistic Regression It can be shown that E is a concave function of w. Thus, it has a unique minimum. For a batch solution, we can use the Newton-Raphson optimization technique. On-line solution (stochastic gradient descent): 21

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical