Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Size: px

Start display at page:

Download "Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on"

Hugh Cecil Potter
8 years ago
Views:

1 Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on

2 Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output that has target values of 1 for the posi've class and 0 (or some'mes - 1) for the other class For probabilis'c class labels the output of the model can represent the probability the model gives to the posi've class. If there are N classes we oden use a vector of N target values containing a single 1 for the correct class and zeros elsewhere. For probabilis'c labels we can then use a vector of class probabili'es as the target vector.

represent the probability the model gives to the posi've class.

3 Three approaches to classifica'on Use discriminant func'ons directly without probabili'es: Convert the input vector into one or more real values so that a simple opera'on (like threshholding) can be applied to get the class. Infer condi'onal class probabili'es: p ( class = C k x) Then make a decision that minimizes some loss func'on Compare the probability of the input under separate, class- specific, genera've models. E.g. fit a mul'variate Gaussian to the input vectors of each class and see which Gaussian makes a test data vector most probable.

Infer condi'onal class probabili'es: p ( class = C k x) Then make a decision that minimizes some loss func'on Compare the probability of

4 Discriminant func'ons The planar decision surface in data-space for the simple linear discriminant function: y = w T x + w 0 0

5 Discriminant func'ons for N> classes One possibility is to use N two- way discriminant func'ons. Each func'on discriminates one class from the rest. Another possibility is to use N(N- 1)/ two- way discriminant func'ons Each func'on discriminates between two par'cular classes. Both these methods have problems

Another possibility is to use N(N- 1)/ two- way discriminant func'ons Each

6 Problems with mul'- class discriminant func'ons

7 Use K discriminant functions, yi, y j, yk and pick the max. This is guaranteed to give consistent and convex decision regions if y is linear. y y k k ( x A ) implies A simple solu'on ( α x + (1 α) x ) > y ( α x + (1 α) x ) A > ( y j for ( x A... ) and positive α) B y k that j ( x B ) A > y j ( x B ) B

8 Using least squares for classifica'on This is not the right thing to do and it doesn t work as well as other methods (we will see later), but it is easy. It reduces classifica'on to least squares regression. We already know how to do regression. We can just solve for the op'mal weights with some matrix algebra.

It reduces classifica'on to least squares regression.

9 Least squares for classifica6on {x n, t n } The sum-of-squares error function: nth row of matrix T is t T n

10 Problems with using least squares for classifica'on logistic regression least squares regression If the right answer is 1 and the model says 1.5, it loses, so it changes the boundary to avoid being too correct

11 Another example where least squares regression gives poor decision surfaces

12 Fisher s linear discriminant A simple linear discriminant func'on is a projec'on of the data down to 1- D. So choose the projec'on that gives the best separa'on of the classes. What do we mean by best separa'on? An obvious direc'on to choose is the direc'on of the line joining the class means. But if the main direc'on of variance in each class is not orthogonal to this line, this will not give good separa'on. Fisher s method chooses the direc'on that maximizes the ra'o of between class variance to within class variance. This is the direc'on in which the projected points contain the most informa'on about class membership (under Gaussian assump'ons)

An obvious direc'on to choose is the direc'on of the line joining the class means.

13 Fisher s linear discriminant The two- class linear discriminant acts as a projec'on: y = w T x followed by a threshold In which direc'on w should we project? One which separates classes well y w 0 Intui'on: We want the (projected) centers of the classes to be far apart, and each class (projec'on) to be clustered around its centre.

One which separates classes well y w 0 Intui'on: We want the (projected) centers

14 Fisher s linear discriminant A natural idea would be to project in the direc'on of the line connec'ng class means However, problema'c if classes have variance in this direc'on (ie are not clustered around the mean) Fisher criterion: maximize ra'o of inter- class separa'on (between) to intra- class variance (inside)

variance in this direc'on (ie are not clustered around the mean) Fisher

15 A picture showing the advantage of Fisher s linear discriminant. When projected onto the line joining the class means, the classes are not well separated. Fisher chooses a direction that makes the projected classes much tighter, even though their projected means are less far apart.

16 Math of Fisher s linear discriminants What linear transforma'on is best for discrimina'on? The projec'on onto the vector separa'ng the class means seems sensible. But we also want small variance within each class: Fisher s objec've func'on is: y = w T x ) ( ) ( m y s m y s C n n C n n = = ε ε 1 1 ) ( ) ( s s m m J + = w between within

But we also want small variance within each class: Fisher s objec've func'on is: y =

17 ) ( : ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( m m S w m x m x m x m x S m m m m S w S w w S w w + = = = + = W C n T n n C n T n n W T B W T B T solution Optimal s s m m J More math of Fisher s linear discriminants

18 Probabilis6c Genera6ve Models Model the class-conditional densities and class prior Use Bayes theorem to compute posterior probabilities

19 Logistic Sigmoid Function

20 Probabilis6c Genera6ve Models (k class) Beyes theorem if a k a j for all j k, then p(ck x) 1, and p(cj x) 0.

21 Continuous inputs Model the class-conditional densities Here, all classes share the same covariance matrix. Two-classes Problem

22 Two-Classes Problem

23 The posterior when the covariance matrices are different for different classes. The decision surface is planar when the covariance matrices are the same and quadratic when they are not.

24 Maximum likelihood solu6on The likelihood func'on:

25 Maximum likelihood solu6on Note that the approach of fi`ng Gaussian distribu'ons to the classes is not robust to outliers, because the maximum likelihood es*ma*on of a Gaussian is not robust.

26 Discrete features When The class-conditional distributions: Linear functions of the input values:

27 Probabilistic Discriminative Models What is it? It means to use the functional form of the generalized linear probabilistic model explicitly, and to determine its parameters directly The difference from Discriminant Functions, and Probabilistic Generative Models 7

28 Fixed Basis Functions Space transformation by nonlinear function 8

29 Logistic Regression To reduce the complexity of probabilistic models, the logistic sigmoid is introduced For Gaussian class conditional densities: M(M+5)/+1 parameters Logistic Regression: M parameters 9

30 The Likelihood p(t w) = N n=1y t n n {1 y n } 1 t n y n = p(c 1 n) 30

31 The Gradient of the Log Likelihood Error function: Gradient of the Error function 31

32 Iterative Reweighted Least Squares Newton-Raphson H is the Hessian Matrix H = T R R is an NxN diagonal matrix with elements: R nn = y n (1 y n ) 3

33 Iterative Reweighted Least Squares The Newton-Raphson update takes this form: z = w (old) R 1 (y t) 33

34 Multiclass Logistic Regression Generative models (discussed in section 4.) where Using the maximum likelihood to determine the parameters directly 34

35 Multiclass Logistic Regression Maximum likelihood method Using IRLS algorithm to do the w updating work 35

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical