Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based: Assume a model for p( x C i ). Use Bayes rule to calculate P(C i x) g i ( x) = log P(C i x) Discriminant-based: Assume a model for g i ( x φ i ). Vapnik: Estimating the class densities is a harder problem than estimating the class discriminants. It does not make sense to solve a hard problem to solve an easier one. Linear discriminant: i x + w i0 = d w ij x j + w i0 j= Advantages: Simple: O(d) space/computation Knowledge extraction: Weights sizes give an indication of significance of contribution of each attribute Optimal when p( x C i ) are Gaussian with shared covariance matrix Useful when classes are (almost) linearly separable 3 4

More General Linear Models g i ( x w i, w i0 ) = d w ij x j + w i0 We can replace the x i on the right by any linearly independent set of basis functions: { C if g( x) > 0 Choose ow C 2 j= g( x) = g ( x) g 2 ( x) = w T x + w 0 Geometric Interpretation Rewrite x as w x = x p + r w where x p is the projection of x onto the hyperplane g( x) = 0 w is normal to the hyperplane r = g( x) w is the (signed) distance 5 6 Linearly Separable Systems For multiple classes with i x + w i0 with the w i normalized Choose C i if g i ( x) = k max j= g j( x) Pairwise Separation If not linearly separable, compute discriminants between each pair of classes: g ij ( x w ij, w ij0 ) = w T ij x+w ij0 Choose C i if j i, g ij ( x) > 0 7 8

Revisiting Parametric Methods When p( x C i ) N ( µ, Σ), i x + w i0 w i = Σ µ i w i0 = 2 µt i Σ µ i + log P(C i ) Let y P(C x). Then P(C 2 x) = y y We choose C if y > 0.5, or alternatively, if [ ] y >. Equivalently, if log y y > 0 The latter is called the log odds of y or logit. log odds For 2 normal classes with a shared cov. matrix, the log odds is linear logit(p(c x)) = log P(C x) P(C 2 x) = log P( x C ) P( x C 2 ) + log P(C ) P(C 2 ) = log P( x C ) log P( x C 2 ) + log P(C ) P(C 2 ) The P( x C) terms are exponential in x (Gaussian pdf), so the log is linear logit(p(c x)) = w T x + w 0 with w = Σ ( µ µ 2 ), w 0 = 2 ( µ + µ 2 ) T Σ ( µ + µ 2 ) 9 0 logistic The inverse of the logit function: logit(p(c x)) = w T x + w 0 is called the logistic a.k.a. the sigmoid: P(C x) = sigmoid( w T x + w 0 ) = + exp[ w T x + w 0 ] Using the Sigmoid During training During training, estimate m, m 2, S, then compute the w During testing, either Calculate g( x w, w 0 ) = w T x + w 0 and choose C i if g( x) > 0, or Calculate y = sigmoid( w T x + w 0 ) and choose C i if y > 0.5 2

Logistic Discrimination Estimating w For two classes, assume the log likelihood ratio is linear log p( x C ) p( x C 2 ) = w T x + w 0 logit(p(c )) = w T x + w 0 Likelihood y = ˆP(C x) = l( w, w 0 X ) = t + exp [ w T x + w 0 ] (y t ) r t ( y t ) r t Error ( cross-entropy ) E( w, w 0 X ) = t r t log y t + ( r t ) log ( y t ) Train by numerical optimization to minimize E 3 4 Multiple classes For K classes, take C K as a reference class log p( x C i ) p( x C K ) = w T x + w 0 p(c i x) [ ] p(c K x) = exp w T x + w 0 y i = ˆP(C i x) = exp [ w i T ] x + w i0 + [ ] K j= exp w j T x + w j0 This is called the softmax function because exponentiation combined with normalization tends to exaggerate weight of the maximum term Likelihood l( w, w 0 X ) = t (yi t ) r i t i Multiple classes (cont.) Error ( cross-entropy) ) E( w, w 0 X ) = t i ri t log yi t Train by numerical optimization to minimize E 5 6

Softmax Classification Softmax Discriminants 7 8