Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829

c Part VIII Linear Classification 2 Logistic 277of 829

Three for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Logistic 278of 829

Generative approach: model class-conditional densities p(x C k ) and priors p(c k ) to calculate the posterior probability for class C 1 p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) 1 = 1 + exp( a(x)) = σ(a(x)) where a and the logistic sigmoid function σ(a) are given by a(x) = ln p(x C 1) p(c 1 ) p(x C 2 ) p(c 2 ) = ln p(x, C 1) p(x, C 2 ) 1 σ(a) = 1 + exp( a). c Logistic 279of 829

Logistic Sigmoid The logistic sigmoid function σ(a) = 1 1+exp( a) "squashing function because it maps the real axis into a finite interval (0, 1) σ( a) = 1 σ(a) Derivative d daσ(a) = σ(a) σ( a) = σ(a) (1 σ(a)) ( ) Inverse is called logit function a(σ) = ln σ 1 σ c Σ a 1.0 0.8 a Σ 4 Logistic 0.6 0.4 2 0.2 0.4 0.6 0.8 1.0 Σ 2 0.2 4 10 5 5 10 Logistic Sigmoid σ(a) a 6 Logit a(σ) 280of 829

- Multiclass c The normalised exponential is given by where p(c k x) = p(x C k) p(c k ) j p(x C j) p(c j ) = exp(a k) j exp(a j) a k = ln(p(x C k ) p(c k )). Also called softmax function as it is a smoothed version of the max function. Example: If a k a j for all j k, then p(c k x) 1, and p(c j x) 0. Logistic 281of 829

Probabil. Generative Model - c Assume class-conditional probabilities are Gaussian, all classes share the same covariance. What can we say about the posterior probabilities? { 1 1 p(x C k ) = exp 1 } (2π) D/2 Σ 1/2 2 (x µ k) T Σ 1 (x µ k ) { 1 1 = exp 1 } (2π) D/2 Σ 1/2 2 xt Σ 1 x exp {µ Tk Σ 1 x 12 } µtk Σ 1 µ k where we separated the quadratic term in x and the linear term. Logistic 282of 829

Probabil. Generative Model - For two classes and a(x) is Therefore where p(c 1 x) = σ(a(x)) a(x) = ln p(x C 1) p(c 1 ) p(x C 2 ) p(c 2 ) = ln exp { } µ T 1 Σ 1 x 1 2 µt 1 Σ 1 µ 1 exp { } + ln p(c 1) µ T 2 Σ 1 x 1 2 µt 2 Σ 1 µ 2 p(c 2 ) w = Σ 1 (µ 1 µ 2 ) p(c 1 x) = σ(w T x + w 0 ) w 0 = 1 2 µt 1 Σ 1 µ 1 + 1 2 µt 2 Σ 1 µ 2 + ln p(c 1) p(c 2 ) c Logistic 283of 829

Probabil. Generative Model - c Class-conditional densities for two classes (left). Posterior probability p(c 1 x) (right). Note the logistic sigmoid of a linear function of x. Logistic 284of 829

General Case - K Classes, Shared Covariance Use the normalised exponential where p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) = exp(a k) j exp(a j) to get a linear function of x a k = ln (p(x C k )p(c k )). c where a k (x) = w T k x + w k0. w k = Σ 1 µ k w k0 = 1 2 µt k Σ 1 µ k + p(c k ). Logistic 285of 829

General Case - K Classes, Different Covariance If each class-conditional probability has a different covariance, the quadratic terms 1 2 xt Σ 1 x do not longer cancel each other out. We get a quadratic discriminant. c 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2 1 0 1 2 Logistic 286of 829

Maximum Likelihood Solution Given the functional form of the class-conditional densities p(x C k ), can we determine the parameters µ and Σ? c Logistic 287of 829

Maximum Likelihood Solution Given the functional form of the class-conditional densities p(x C k ), can we determine the parameters µ and Σ? Not without data ;-) Given also a data set (x n, t n ) for n = 1,..., N. (Using the coding scheme where t n = 1 corresponds to class C 1 and t n = 0 denotes class C 2. Assume the class-conditional densities to be Gaussian with the same covariance, but different mean. Denote the prior probability p(c 1 ) = π, and therefore p(c 2 ) = 1 π. Then p(x n, C 1 ) = p(c 1 )p(x n C 1 ) = π N (x n µ 1, Σ) p(x n, C 2 ) = p(c 2 )p(x n C 2 ) = (1 π) N (x n µ 2, Σ) c Logistic 288of 829

Maximum Likelihood Solution Thus the likelihood for the whole data set X and t is given by p(t, X π, µ 1, µ 2, Σ) = N [π N (x n µ 1, Σ)] tn n=1 Maximise the log likelihood The term depending on π is [(1 π) N (x n µ 2, Σ)] 1 tn N (t n ln π + (1 t n ) ln(1 π) n=1 which is maximal for π = 1 N N n=1 t n = N 1 N = N 1 N 1 + N 2 where N 1 is the number of data points in class C 1. c Logistic 289of 829

Maximum Likelihood Solution c Similarly, we can maximise the log likelihood (and thereby the likelihood p(t, X π, µ 1, µ 2, Σ) ) depending on the mean µ 1 or µ 2, and get µ 1 = 1 N t n x n N 1 n=1 µ 2 = 1 N (1 t n ) x n N 2 n=1 For each class, this are the means of all input vectors assigned to this class. Logistic 290of 829

Maximum Likelihood Solution c Finally, the log likelihood ln p(t, X π, µ 1, µ 2, Σ) can be maximised for the covariance Σ resulting in Σ = N 1 N S 1 + N 2 N S 2 S k = 1 N k n C k (x n µ k )(x n µ k ) T Logistic 291of 829

- Naive Bayes c Assume the input space consists of discrete features, in the simplest case x i {0, 1}. For a D-dimensional input space, a general distribution would be represented by a table with 2 D entries. Together with the normalisation constraint, this are 2 D 1 independent variables. Grows exponentially with the number of features. The Naive Bayes assumption is that all features conditioned on the class C k are independent of each other. p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi Logistic 292of 829

- Naive Bayes With the naive Bayes p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi we can then again find the factors a k in the normalised exponential p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) = exp(a k) j exp(a j) as a linear function of the x i a k (x) = D {x i ln µ ki + (1 x i ) ln(1 µ ki )} + ln p(c k ). i=1 c Logistic 293of 829

Three for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Logistic 294of 829

c Maximise a likelihood function defined through the conditional distribution p(c k x) directly. Discriminative training Typically fewer parameters to be determined. As we learn the posteriror p(c k x) directly, prediction may be better than with a generative model where the class-conditional density assumptions p(x C k ) poorly approximate the true distributions. But: discriminative models can not create synthetic data, as p(x) is not modelled. Logistic 295of 829

Original Input versus Feature Space Used direct input x until now. All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). Example: Use two Gaussian basis functions centered at the green crosses in the input space. c x2 1 0 φ2 1 0.5 Logistic 1 0 1 0 1 x1 0 0.5 1 φ1 296of 829

Original Input versus Feature Space Linear decision boundaries in the feature space correspond to nonlinear decision boundaries in the input space. Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. BUT: If classes overlap in input space, they will also overlap in feature space. Nonlinear features φ(x) can not remove the overlap; but they may increase it! c Logistic x2 1 0 φ2 1 0.5 1 0 1 0 1 x1 0 0.5 1 φ1 297of 829

Original Input versus Feature Space c Fixed basis functions do not adapt to the data and therefore have important limitations (see discussion in Linear ). Understanding of more advanced algorithms becomes easier if we introduce the feature space now and use it instead of the original input space. Some applications use fixed features successfully by avoiding the limitations. We will therefore use φ instead of x from now on. Logistic 298of 829

Logistic is Classification Two classes where the posterior of class C 1 is a logistic sigmoid σ() acting on a linear function of the feature vector φ p(c 1 φ) = y(φ) = σ(w T φ) p(c 2 φ) = 1 p(c 1 φ) Model dimension is equal to dimension of the feature space M. Compare this to fitting two Gaussians }{{} 2M + M(M + 1)/2 = M(M + 5)/2 } {{ } means shared covariance For larger M, the logistic regression model has a clear advantage. c Logistic 299of 829

Logistic is Classification Determine the parameter via maximum likelihood for data (φ n, t n ), n = 1,..., N, where φ n = φ(x n ). The class membership is coded as t n {0, 1}. Likelihood function p(t w) = where y n = p(c 1 φ n ). N n=1 y tn n (1 y n ) 1 tn Error function : negative log likelihood resulting in the cross-entropy error function E(w) = ln p(t w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 c Logistic 300of 829

Logistic is Classification Error function (cross-entropy error ) E(w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 y n = p(c 1 φ n ) = σ(w T φ n ) Gradient of the error function (using dσ da E(w) = N (y n t n )φ n n=1 = σ(1 σ) ) gradient does not contain any sigmoid function for each data point error is product of deviation y n t n and basis function φ n. BUT : maximum likelihood solution can exhibit over-fitting even for many data points; should use regularised error or MAP then. c Logistic 301of 829

Given a continous distribution p(x) which is not Gaussian, can we approximate it by a Gaussian q(x)? Need to find a mode of p(x). Try to find a Gaussian with the same mode. c 0.8 0.6 0.4 0.2 0 2 1 0 1 2 3 4 40 30 20 10 0 2 1 0 1 2 3 4 Logistic Non-Gaussian (yellow) and Gaussian approximation (red). Negative log of the Non-Gaussian (yellow) and Gaussian approx. (red). 302of 829

Assume p(x) can be written as c p(z) = 1 Z f (z) with normalisation Z = f (z) dz. Furthermore, assume Z is unknown! A mode of p(z) is at a point z 0 where p (z 0 ) = 0. Taylor expansion of ln f (z) at z 0 where ln f (z) ln f (z 0 ) 1 2 A(z z 0) 2 A = d2 dz 2 ln f (z) z=z 0 Logistic 303of 829

c Exponentiating we get ln f (z) ln f (z 0 ) 1 2 A(z z 0) 2 f (z) f (z 0 ) exp{ A 2 (z z 0) 2 }. And after normalisation we get the Laplace approximation q(z) = ( ) 1/2 A exp{ A 2π 2 (z z 0) 2 }. Only defined for precision A > 0 as only then p(z) has a maximum. Logistic 304of 829

- Vector Space Approximate p(z) for z R M c we get the Taylor expansion p(z) = 1 Z f (z). ln f (z) ln f (z 0 ) 1 2 (z z 0) T A(z z 0 ) where the Hessian A is defined as A = ln f (z) z=z0. The Laplace approximation of p(z) is then q(z) = { A 1/2 exp 1 } (2π) M/2 2 (z z 0) T A(z z 0 ) = N (z z 0, A 1 ) Logistic 305of 829

c Exact Bayesian inference for the logistic regression is intractable. Why? Need to normalise a product of prior probabilities and likelihoods which itself are a product of logistic sigmoid functions, one for each data point. Evaluation of the predictive distribution also intractable. Therefore we will use the Laplace approximation. Logistic 306of 829

c Assume a Gaussian prior because we want a Gaussian posterior. p(w) = N (w m 0, S 0 ) for fixed hyperparameter m 0 and S 0. Hyperparameters are parameters of a prior distribution. In contrast to the model parameters w, they are not learned. For a set of training data (x n, t n ), where n = 1,..., N, the posterior is given by where t = (t 1,..., t N ) T. p(w t) p(w)p(t w) Logistic 307of 829

Using our previous result for the cross-entropy function E(w) = ln p(t w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 we can now calculate the log of the posterior p(w t) p(w)p(t w) using the notation y n = σ(w T φ n ) as ln p(w t) = 1 2 (w m 0) T S 1 0 (w m 0) + N {t n ln y n + (1 t n ) ln(1 y n )} n=1 c Logistic 308of 829

To obtain a Gaussian approximation to ln p(w t) = 1 2 (w m 0) T S 1 0 (w m 0) + N {t n ln y n + (1 t n ) ln(1 y n )} n=1 1 Find w MAP which maximises ln p(w t). This defines the mean of the Gaussian approximation. (Note: This is a nonlinear function in w because y n = σ(w T φ n ).) 2 Calculate the second derivative of the negative log likelihood to get the inverse covariance of the Laplace approximation S N = ln p(w t) = S 1 0 + N y n (1 y n )φ n φ T n. n=1 c Logistic 309of 829

c The approximated Gaussian (via Laplace approximation) of the posterior distribution is now where q(w φ) = N (w w MAP, S N ) S N = ln p(w t) = S 1 0 + N y n (1 y n )φ n φ T n. n=1 Logistic 310of 829