3F3: Signal and Pattern Processing Lecture 3: Classification Zoubin Ghahramani zoubin@eng.cam.ac.uk Department of Engineering University of Cambridge Lent Term
Classification We will represent data by vectors in some vector space. Let denote a data point with elements = (, 2,..., D ) The elements of, e.g. d, represent measured (observed) features of the data point; D denotes the number of measured features of each point. The data set D consists of N pairs of data points and corresponding discrete class labels: D = {( (), y () )..., ( (N), y (N) )} where y (n) {,..., C} and C is the number of classes. The goal is to classify new inputs correctly (i.e. to generalize). o o o o o o o Eamples: spam vs non-spam normal vs disease 0 vs vs 2 vs 3... vs 9
Classification: Eample Iris Dataset 3 classes, 4 numeric attributes, 50 instances A data set with 50 points and 3 classes. Each point is a random sample of measurements of flowers from one of three iris species setosa, versicolor, and virginica collected by Anderson (935). Used by Fisher (936) for linear discrimant function technique. The measurements are sepal length, sepal width, petal length, and petal width in cm. Data: 5.,3.5,.4,0.2,Iris-setosa 4.9,3.0,.4,0.2,Iris-setosa 4.7,3.2,.3,0.2,Iris-setosa... 7.0,3.2,4.7,.4,Iris-versicolor 6.4,3.2,4.5,.5,Iris-versicolor 6.9,3.,4.9,.5,Iris-versicolor... 6.3,3.3,6.0,2.5,Iris-virginica sepal width 4.5 4 3.5 3 2.5 5.8,2.7,5.,.9,Iris-virginica 7.,3.0,5.9,2.,Iris-virginica 2 4 5 6 sepal length 7 8
Linear Classification Data set D: D = {( (), y () )..., ( (N), y (N) )} Assume y (n) {0, } (i.e. two classes) and (n) R D, and let (n) = Linear Classification (Deterministic) ( (n) ). y (n) = H (β (n)) where H(z) = { if z 0 0 if z < 0 is known as the Heaviside, threshold, or step function.
Linear Classification Data set D: D = {( (), y () )..., ( (N), y (N) )} Assume y (n) {0, } (i.e. two classes) and (n) R D, and let (n) = Linear Classification (Probabilistic) ( (n) ). P (y (n) = (n), β) = σ (β (n)) Logistic Function where σ( ) is a monotonic increasing function σ : R [0, ]. For eample if σ(z) is σ(z) = +ep( z), the sigmoid or logistic function, this is called logistic regression. Compare this to... Linear Regression y (n) = β (n) + ɛ n σ(z) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 5 0 5 z
Logistic Classification Data set: D = {( (), y () )..., ( (N), y (N) )}. P (y (n) = (n), β) = σ (β (n)) P (y (n) = 0 (n), β) = P (y (n) = (n), β) = σ ( β (n)) Since σ(z) = +ep( z), note that σ(z) = σ( z). This defines a Bernoulli distribution over y at each point. (cf θ y ( θ) ( y) ) The likelihood for β is: P (y X, β) = n P (y (n) (n), β) = n σ(β (n) ) y(n) ( σ(β (n) )) ( y(n) ) Given the data, this is just some function of β which we can optimize... This is usually called Logistic Regression but it s really solving a classification problem, so I m going to correct this misnomer.
Logistic Classification Maimum Likelihood learning: L(β) = ln P (y X, β) = ln P (y (n) (n), β) n = [ y (n) ln σ(β (n) ) + ( y (n) ) ln σ( β ] (n) ) n Taking derivatives, using ln σ(z) z = σ( z), and defining z n def = β (n), we get: L(β) β = n = n [y (n) σ( z n ) (n) ( y (n) )σ(z n ) (n)] [y (n) σ( z n ) (n) + y (n) σ(z n ) (n) σ(z n ) (n)] = n (y (n) σ(z n )) (n) Intuition?
Logistic Classification The gradient: L(β) β = (y (n) σ(z n )) (n) n A learning rule (steepest gradient ascent in the likelihood): β [t+] = β [t] + η L(β[t] ) β [t] where η is a learning rate Which gives: β [t+] = β [t] + η n (y (n) σ(z n )) (n) This is a batch learning rule, since all N points are considered at once. An online learning rule, can consider one data point at a time... β [t+] = β [t] + η(y (t) σ(z t )) (t) This is useful for real-time learning and adaptation applications. One can also use any standard optimizer (e.g. conjugate gradients, Newton s method).
Logistic Classification Demo of the online logistic classification learning rule: where z t = β [t] (t). β [t+] = β [t] + η(y (t) σ(z t )) (t) 3 2 0 2 lrdemo 3 3 2 0 2 3
Maimum a Posteriori (MAP) and Bayesian Learning MAP: Assume we define a prior on β, for eample a Gaussian prior with variance λ: P (β) = ep{ 2πλ D+ 2λ β β} The instead of maimizing the likelihood we can maimize the posterior probability of β The second term penalizes the length of β. Why would we want to do this? ln P (β D) = ln P (D β) + ln P (β) + const = L(β) 2λ β β + const Bayesian Learning: We can also do Bayesian learning of β by trying to compute or approimate P (β D) rather than optimize it.
Nonlinear Logistic Classification Easy: just map the inputs into a higher dimensional space by computing some nonlinear functions of the inputs. For eample: (, 2 ) (, 2, 2, 2, 2 2) Then do logistic classification using these new inputs. 3 2 0 2 nlrdemo 3 3 2 0 2 3
Multinomial Classification How do we do multiple classes? y {,..., C} Answer: Use a softma function which generalizes the logistic function: Each class c has a vector β c P (y = c, β) = ep{β c } C c = ep{β c }
Some eercises. show how using the softma function is equivalent to doing logistic classification when C = 2. 2. implement the online logistic classification rule in Matlab/Octave. 3. implement the nonlinear etension. 4. investigate what happens when the classes are well separated, and what role the prior has in MAP learning.