CSCI567 Machine Learning (Fall 2014)

Transcription

1 CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu September 22, 2014 Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

2 Outline Administration 1 Administration 2 Logistic Regression - continued 3 Multiclass classification Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

3 Administration A few announcements Homework 1: due 9/24 (see the homework sheets for detailed submission information) Revised lecture slides are on Blackboard and the course website Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

4 Outline Logistic Regression - continued 1 Administration 2 Logistic Regression - continued Logistic regression Numerical optimization Gradient descent Gradient descent for logistic regression Newton method 3 Multiclass classification Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

5 Logistic Regression - continued Logistic regression Logistic classification Setup for two classes Input: x R D Output: y {0, 1} Training data: D = {(x n, y n ), n = 1, 2,..., N} Model of conditional distribution p(y = 1 x; b, w) = σ[g(x)] where g(x) = b + d w d x d = b + w T x Linear decision boundary g(x) = b + w T x = 0 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

6 Logistic Regression - continued Logistic regression Maximum likelihood estimation Cross-entropy error (negative log-likelihood) E(b, w) = n {y n log σ(b + w T x n ) + (1 y n ) log[1 σ(b + w T x n )]} Numerical optimization Gradient descent: simple, scalable to large-scale problems Newton method: fast but not scalable Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

7 Logistic Regression - continued Logistic regression Shorthand notation This is for convenience Append 1 to x x [1 x 1 x 2 x D ] Append b to w w [b w 1 w 2 w D ] Cross-entropy is then E(w) = n {y n log σ(w T x n ) + (1 y n ) log[1 σ(w T x n )]} NB. We are not using the x and w (as in several textbooks) for cosmetic reasons. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

8 Logistic Regression - continued Numerical optimization How to find the optimal parameters for logistic regression? We will minimize the error function E(w) = n {y n log σ(w T x n ) + (1 y n ) log[1 σ(w T x n )]} However, this function is complex and we cannot find the simple solution as we did in Naive Bayes. So we need to use numerical methods. Numerical methods are messier, in contrast to cleaner analytic solutions. In practice, we often have to tune a few optimization parameters patience is necessary. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

9 Logistic Regression - continued Numerical optimization An overview of numerical methods We describe two Gradient descent (our focus in lecture): simple, especially effective for large-scale problems Newton method: classical and powerful method Gradient descent is often referred to as an first-order method as it requires only to compute the gradients (i.e., the first-order derivative) of the function. In contrast, Newton method is often referred as to an second-order method. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

10 Logistic Regression - continued Gradient descent Example: min f(θ) = 0.5(θ 2 1 θ 2 ) (θ 1 1) 2 We compute the gradients f = 2(θ1 2 θ 2 )θ 1 + θ 1 1 θ 1 (1) f = (θ1 2 θ 2 ) θ 2 (2) Use the following iterative procedure for gradient descent 1 Initialize θ (0) 1 and θ (0) 2, and t = 0 2 do [ θ (t+1) 1 θ (t) 1 η 2(θ (t) 2 ] (t) 1 θ 2 )θ(t) 1 + θ (t) 1 1 [ θ (t+1) 2 θ (t) 2 η (θ (t) 2 ] (t) 1 θ 2 ) 3 until f(θ (t) ) does not change much (3) (4) t t + 1 (5) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

11 Logistic Regression - continued Gradient descent Gradient descent General form for minimizing f(θ) Remarks θ t+1 θ η f θ η is often called step size literally, how far our update will go along the the direction of the negative gradient Note that this is for minimizing a function, hence the subtraction ( η) With a suitable choice of η, the iterative procedure converges to a stationary point where f θ = 0 A stationary point is only necessary for being the minimum. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

12 Seeing in action Logistic Regression - continued Gradient descent Choose the right η is important small η is too slow? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

13 Seeing in action Logistic Regression - continued Gradient descent Choose the right η is important small η is too slow? large η is too unstable? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

14 Logistic Regression - continued Gradient descent for logistic regression How do we do this for logistic regression? Simple fact: derivatives of σ(a) d σ(a) d a = d ( ) 1 d a 1 + e a = (1 + e a ) (1 + e a ) 2 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

15 Logistic Regression - continued Gradient descent for logistic regression How do we do this for logistic regression? Simple fact: derivatives of σ(a) d σ(a) d a = d ( 1 d a 1 + e a = ) = (1 + e a ) e a (1 + e a ) 2 = e a (1 + e a ) ( e a ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

16 Logistic Regression - continued Gradient descent for logistic regression How do we do this for logistic regression? Simple fact: derivatives of σ(a) d σ(a) d a d log σ(a) d a = d ( 1 d a 1 + e a = ) = (1 + e a ) e a (1 + e a ) 2 = e a = σ(a)[1 σ(a)] = 1 σ(a) (1 + e a ) ( e a ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

17 Logistic Regression - continued Gradient descent for logistic regression Gradients of the cross-entropy error function Gradients E(w) w = n { yn [1 σ(w T x n )]x n (1 y n )σ(w T x n )x n } (6) = n { σ(w T x n ) y n } xn (7) Remarks e n = { σ(w T } x n ) y n is called error for the nth training sample. Stationary point (in this case, the optimum): σ(w T x n )x n = x n y n n n Intuition: on average, the error is zero. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

18 Logistic Regression - continued Gradient descent for logistic regression Numerical optimization Gradient descent Choose a proper step size η > 0 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

19 Logistic Regression - continued Gradient descent for logistic regression Numerical optimization Gradient descent Choose a proper step size η > 0 Iteratively update the parameters following the negative gradient to minimize the error function w (t+1) w (t) η n { σ(w T x n ) y n } xn Remarks The step size needs to be chosen carefully to ensure convergence. The step size can be adaptive (i.e. varying from iteration to iteration). For example, we can use techniques such as line search There is a variant called stochastic gradient descent, also popularly used (later in this semester). Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

20 Logistic Regression - continued Intuition for Newton method Newton method Approximate the true function with an easy-to-solve optimization problem f(x) f quad (x) f(x) f quad (x) x k x k +d k x k x k +d k Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

21 Logistic Regression - continued Newton method Approximation Taylor expansion of the cross-entropy function E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) where E(w (t) ) is the gradient H (t) is the Hessian matrix evaluated at w (t) Example: a scalar function sin(θ) sin(0) + θ cos(θ = 0) θ2 [ sin(θ = 0)] = θ where sin(θ) = cos(θ) and H = cos(θ) = sin(θ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

22 Logistic Regression - continued So what is the Hessian matrix? Newton method The matrix of second-order derivatives In other words, H = 2 E(w) ww T H ij = ( ) E(w) w j w i So the Hessian matrix is R D D, where w R D. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

23 Logistic Regression - continued Optimizing the approximation Newton method Minimize the approximation E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) and use the solution as the new estimate of the parameters w (t+1) min w (w w(t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

24 Logistic Regression - continued Newton method Optimizing the approximation Minimize the approximation E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) and use the solution as the new estimate of the parameters w (t+1) min w (w w(t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) The quadratic function minimization has a closed form, thus, we have i.e., the Newton method. ( w (t+1) w (t) H (t)) 1 E(w (t) ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

25 Logistic Regression - continued Newton method Contrast gradient descent and Newton method Similar Both are iterative procedures. Difference Newton method requires second-order derivatives. Newton method does not have the magic η to be set. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

26 Logistic Regression - continued Newton method Other important things about Hessian Our cross-entropy error function is convex E(w) w = n {σ(w T x n ) y n }x n (8) H = 2 E(w) = homework (9) wwt Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

27 Logistic Regression - continued Newton method Other important things about Hessian Our cross-entropy error function is convex For any vector v, E(w) w = n {σ(w T x n ) y n }x n (8) H = 2 E(w) = homework (9) wwt v T Hv = homework 0 Thus, positive definite. Thus, the cross-entropy error function is convex, with only one global optimum. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

28 Logistic Regression - continued Good about Newton method Newton method Fast! Suppose we want to minimize f(x) = x 2 + 2x and we have its current estimate at x (t) 1. So what is the next estimate? x (t+1) x (t) [f (x)] 1 f (x) = x (t) 1 2 (2x(t) + 2) = 1 Namely, the next step (of iteration) immediately tells us the global optimum! (In optimization, this is called superlinear convergence rate). In general, the better our approximation, the fast the Newton method is in solving our optimization problem. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

29 Logistic Regression - continued Newton method Bad about Newton method Not scalable! Computing and inverting Hessian matrix can be very expensive for large-scale problems where the dimensionally D is very large. Newton method does not guarantee convergence if your starting point is far away from the optimum NB. There are fixes and alternatives, such as Quasi-Newton/Quasi-second order method. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

30 Outline Multiclass classification 1 Administration 2 Logistic Regression - continued 3 Multiclass classification Use binary classifiers as building blocks Multinomial logistic regression Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

31 Multiclass classification Setup Suppose we need to predict multiple classes/outcomes: C 1, C 2,..., C K Weather prediction: sunny, cloudy, raining, etc Optical character recognition: 10 digits + 26 characters (lower and upper cases) + special characters, etc Studied methods Nearest neighbor classifier Naive Bayes Gaussian discriminant analysis Logistic regression Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

32 Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

33 Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

34 Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers 1 What if all the classifiers say negative? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

35 Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers 1 What if all the classifiers say negative? 2 What if multiple classifiers say positive? Take-home exercise: there are different combination strategies. Can you think of any? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

36 Multiclass classification Use binary classifiers as building blocks Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

37 Multiclass classification Use binary classifiers as building blocks Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Train K(K 1)/2 binary classifiers using logistic regression to differentiate the two classes Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

38 Multiclass classification Use binary classifiers as building blocks Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Train K(K 1)/2 binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers There are K(K 1)/2 votes! Take-home exercise: can you think of any good combination strategies? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

39 Multiclass classification Use binary classifiers as building blocks Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Make a huge difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

40 Multiclass classification Use binary classifiers as building blocks Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Make a huge difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? one versus one: only needs to train a smaller subset of data (only those labeled with those two classes would be involved). Make a huge difference if you have a lot of data to go through. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

41 Multiclass classification Use binary classifiers as building blocks Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Make a huge difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? one versus one: only needs to train a smaller subset of data (only those labeled with those two classes would be involved). Make a huge difference if you have a lot of data to go through. Bad about both of them Combining classifiers outputs seem to be a bit tricky. Any other good methods? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

42 Multiclass classification Multinomial logistic regression Multinomial logistic regression Intuition: from the decision rule of our naive Bayes classifier y = arg max c p(y = c x) = arg max c log p(x y = c)p(y = c) (10) = arg max c log π c + z k log θ ck = arg max c wc T x (11) k Essentially, we are comparing with one for each category. w T 1 x, w T 2 x,, w T C x (12) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

43 First try Multiclass classification Multinomial logistic regression So, can we define the following conditional model? p(y = c x) = σ[w T c x] Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

44 First try Multiclass classification Multinomial logistic regression So, can we define the following conditional model? p(y = c x) = σ[w T c x] This would not work at least for the reason p(y = c x) = σ[wc T x] 1 c c as each the summand can be any number (independently) between 0 and 1. But we are close Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

45 Multiclass classification Multinomial logistic regression Definition of multinomial logistic regression Model For each class C k, we have a parameter vector w k and model the posterior probability as p(c k x) = ewt k x k ewt k x This is called softmax function Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

46 Multiclass classification Multinomial logistic regression Definition of multinomial logistic regression Model For each class C k, we have a parameter vector w k and model the posterior probability as p(c k x) = ewt k x k ewt k x This is called softmax function Decision boundary: assign x with the label that is the maximum of posterior arg max k P (C k x) arg max k w T k x Note: the notation is changed to denote the classes as C k instead of just c Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31