Logistic Regression. Machine Learning CSEP546 Carlos Guestrin University of Washington

Transcription

1 Logistic Regression Machine Learning CSEP546 Carlos Guestrin University of Washington January 27, 2014 Carlos Guestrin

2 Reading Your Brain, Simple Example Pairwise classification accuracy: 85% Person Animal [Mitchell et al.] Carlos Guestrin

3 Classification Learn: h:x! Y X features Y target classes Simplest case: Thresholding Carlos Guestrin

4 Linear (Hyperplane) Decision Boundaries Carlos Guestrin

5 Classification Learn: h:x! Y X features Y target classes Thus far: just a decision boundary What if you want probability of each class? P(Y X) Carlos Guestrin

6 Ad Placement Strategies Companies bid on ad prices Which ad wins? (many simplifications here) Naively: But: Instead: Carlos Guestrin

7 Link Functions Estimating P(Y X): Why not use standard linear regression? Combing regression and probability? Need a mapping from real values to [0,1] A link function! Carlos Guestrin

8 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular functional form for link function Sigmoid applied to a linear function of the input features: Z Features can be discrete or continuous! Carlos Guestrin

9 Understanding the sigmoid w 0 =-2, w 1 =-1 w 0 =0, w 1 =-1 w 0 =0, w 1 = Carlos Guestrin

10 Logistic Regression a Linear classifier Carlos Guestrin

11 Very convenient! implies 0 1 implies 1 implies 0 1 linear classification rule! 0 Carlos Guestrin

12 Loss function: Conditional Likelihood Have a bunch of iid data of the form: Discriminative (logistic regression) loss function: Conditional Data Likelihood Carlos Guestrin

13 Expressing Conditional Log Likelihood `(w) = X j y j ln P (Y =1 x j, w)+(1 y j )lnp (Y =0 x j, w) Carlos Guestrin

14 Maximizing Conditional Log Likelihood Good news: l(w) is concave function of w, no local optima problems Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize Carlos Guestrin

15 Optimizing concave function Gradient ascent Conditional likelihood for Logistic Regression is concave. Find optimum with gradient ascent Gradient: Step size, η>0 Update rule: Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent can be much better Carlos Guestrin

16 Coordinate Descent v. Gradient Descent Illustration from Wikipedia Carlos Guestrin

17 Maximize Conditional Log Likelihood: i = NX x j i (yj P (Y =1 x j, w) j=1 Carlos Guestrin

18 Gradient Descent for LR: Intuition Gender Age Location Income Referrer New or Returning Clicked? 1. Encode data as numbers F Young US High Google New N M Middle US Low Direct New N F Old BR Low Google Returning Y M Young BR Low Bing Returning N Gender (F=1, M=0) Age (Young=0, Middle=1, Old=2) Location (US=1, Abroad=0) Income (High=1, Low=0) Referrer New or Returning (New=1,, Returning =0) Clicked? (Click=1, NoClick=0) 2. Until convergence: for each feature a. Compute average gradient over data points b. Update parameter Carlos Guestrin

19 Gradient Ascent for LR Gradient ascent algorithm: iterate until change < ε (t) For i=1,,k, (t) repeat Carlos Guestrin

20 Regularization in linear regression Overfitting usually leads to very large parameter choices, e.g.: X 0.30 X ,700,910.7 X 8,585,638.4 X 2 + Regularized least-squares (a.k.a. ridge regression), for λ>0: Carlos Guestrin

21 Linear Separability Carlos Guestrin

22 Large parameters Overfitting If data is linearly separable, weights go to infinity In general, leads to overfitting: Penalizing high weights can prevent overfitting Carlos Guestrin

23 Regularized Conditional Log Likelihood Add regularization penalty, e.g., L 2 : `(w) =ln NY j=1 P (y j x j, w) 2 w 2 2 Practical note about w 0 : Gradient of regularized likelihood: Carlos Guestrin

24 Standard v. Regularized Updates Maximum conditional likelihood estimate NY w = arg max ln P (y j x j, w) w j=1 (t) Regularized maximum conditional likelihood estimate NY kx w = arg max ln P (y j x j, w) wi 2 w 2 j=1 i=1 (t) Carlos Guestrin

25 Please Stop!! Stopping criterion `(w) =ln Y j P (y j x j, w)) w 2 2 When do we stop doing gradient descent? Because l(w) is strongly concave: i.e., because of some technical condition `(w ) `(w) apple 1 2 r`(w) 2 2 Thus, stop when: Carlos Guestrin

26 Digression: Logistic regression for more than 2 classes Logistic regression in more general case (C classes), where Y in {0,,C-1} Carlos Guestrin

27 Digression: Logistic regression more generally Logistic regression in more general case, where Y in {0,,C-1} for c>0 P (Y = c x, w) = exp(w c0 + P k i=1 w cix i ) 1+ P C 1 c 0 =1 exp(w c P k i=1 w c 0 ix i ) for c=0 (normalization, so no weights for this class) P (Y =0 x, w) = 1 1+ P C 1 c 0 =1 exp(w c P k i=1 w c 0 ix i ) Learning procedure is basically the same as what we derived! Carlos Guestrin

28 Stochastic Gradient Descent Machine Learning CSEP546 Carlos Guestrin University of Washington January 27, 2014 Carlos Guestrin

29 The Cost, The Cost!!! Think about the cost What s the cost of a gradient update step for LR??? (t) Carlos Guestrin

30 Learning Problems as Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., squared error, logistic loss, We often minimize loss in training data: `D(w) = 1 N NX `(w, x j ) j=1 However, we should really minimize expected loss on all data: Z `(w) =E x [`(w, x)] = p(x)`(w, x)dx So, we are approximating the integral by the average on the training data Carlos Guestrin

31 SGD: Stochastic Gradient Ascent (or Descent) True gradient: r`(w) =E x [r`(w, x)] Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Called stochastic gradient ascent (or descent) Among many other names VERY useful in practice!!! Carlos Guestrin

32 Stochastic Gradient Ascent for Logistic Regression Logistic loss as a stochastic function: E x [`(w, x)] = E x ln P (y x, w) w 2 2 Batch gradient ascent updates: w (t+1) i w (t) i Stochastic gradient ascent updates: Online setting: w (t+1) i w (t) i + + t n 8 < : w(t) i w (t) i + 1 N NX j=1 9 = i [y (j) P (Y =1 x (j), w (t) )] ; x (j) o + x (t) i [y (t) P (Y =1 x (t), w (t) )] Carlos Guestrin

33 Stochastic Gradient Descent for LR: Intuition Gender Age Location Income Referrer New or Returning F Young US High Google New N M Middle US Low Direct New N F Old BR Low Google Returning Y M Young BR Low Bing Returning N Gender (F=1, M=0) Age (Young=0, Middle=1, Old=2) Location (US=1, Abroad=0) Income (High=1, Low=0) Referrer New or Returning (New=1,, Returning =0) Clicked? Clicked? (Click=1, NoClick=0) 1. Until convergence: get a data point a. Encode data as numbers b. For each feature i. Compute gradient for this data point ii. Update parameter n w (t+1) i w (t) i + t w (t) i o + x (t) i [y (t) P (Y =1 x (t), w (t) )] Carlos Guestrin

34 Stochastic Gradient Ascent: general case Given a stochastic function of parameters: Want to find maximum Start from w (0) Repeat until convergence: Get a sample data point x t Update parameters: Works on the online learning setting! Complexity of each gradient step is constant in number of examples! In general, step size changes with iterations Carlos Guestrin

35 What you should know Classification: predict discrete classes rather than real values Logistic regression model: Linear model Logistic function maps real values to [0,1] Optimize conditional likelihood Gradient computation Overfitting Regularization Regularized optimization Cost of gradient step is high, use stochastic gradient descent Carlos Guestrin

36 What s the Perceptron Optimizing? Machine Learning CSEP546 Carlos Guestrin University of Washington January 27, 2014 Carlos Guestrin

37 Remember our friend the Perceptron Algorithm At each time step: Observe a data point: Update parameters if make a mistake: Carlos Guestrin

38 What is the Perceptron Doing??? When we discussed logistic regression: Started from maximizing conditional log-likelihood When we discussed the Perceptron: Started from description of an algorithm What is the Perceptron optimizing???? Carlos Guestrin

39 Perceptron Prediction: Margin of Confidence Carlos Guestrin

40 Hinge Loss Perceptron prediction: Makes a mistake when: Hinge loss (same as maximizing the margin used by SVMs) Carlos Guestrin

41 Stochastic Gradient Descent for Hinge Loss SGD: observe data point x (t), update each parameter w (t+1) i w (t) i (t),x (t) i How do we compute the gradient for hinge loss? Carlos Guestrin

42 (Sub)gradient of Hinge Hinge loss: w (t+1) i w (t) i (t),x (t) i Subgradient of hinge loss: If y (t) (w.x (t) ) > 0: If y (t) (w.x (t) ) < 0: If y (t) (w.x (t) ) = 0: In one line: Carlos Guestrin

43 Stochastic Gradient Descent for Hinge Loss SGD: observe data point x (t), update each parameter w (t+1) i w (t) i (t),x (t) i How do we compute the gradient for hinge loss? Carlos Guestrin

44 Perceptron Revisited SGD for hinge loss: i w (t+1) w (t) + t hy (t) (w (t) x (t) ) apple 0 y (t) x (t) Perceptron update: w (t+1) w (t) + h i y (t) (w (t) x (t) ) apple 0 y (t) x (t) Difference? Carlos Guestrin

45 What you need to know Perceptron is optimizing hinge loss Subgradients and hinge loss (Sub)gradient decent for hinge objective Carlos Guestrin

46 Support Vector Machines Machine Learning CSEP546 Carlos Guestrin University of Washington January 27, 2014 Carlos Guestrin

47 Support Vector Machines One of the most effective classifiers to date! Popularized kernels There is a complicated derivation, but Very simple based on what you ve learned thus far! Carlos Guestrin

48 Linear classifiers Which line is better? Carlos Guestrin

49 Pick the one with the largest margin! w.x + w 0 = 0 confidence = y j (w x j + w 0 ) Carlos Guestrin

50 Maximize the margin w.x + w 0 = 0 Carlos Guestrin

51 SVMs = Hinge Loss + L2 Regularization Maximizing Margin same as regularized hinge loss But, SVM convention is confidence has to be at least 1 w.x + w 0 = +1 w.x + w 0 = 0 w.x + w 0 = -1 Carlos Guestrin

52 L2 Regularized Hinge Loss Final objective, adding regularization: But, again, in SVMs, convention slightly different (but equivalent) w C NX j=1 1 y j (w x j + w 0 ) + Carlos Guestrin

53 SVMs for Non-Linearly Separable meet my friend the Perceptron Perceptron was minimizing the hinge loss: NX j=1 y j (w x j + w 0 ) + SVMs minimizes the regularized hinge loss!! w C NX j=1 1 y j (w x j + w 0 ) + Carlos Guestrin

54 Stochastic Gradient Descent for SVMs Perceptron minimization: NX y j (w x j + w 0 ) + j=1 SGD for Perceptron: SVMs minimization: w C NX j=1 SGD for SVMs: 1 y j (w x j + w 0 ) + w (t+1) w (t) + h i y (t) (w (t) x (t) ) apple 0 y (t) x (t) Carlos Guestrin

55 What you need to know Maximizing margin Derivation of SVM formulation Non-linearly separable case Hinge loss A.K.A. adding slack variables SVMs = Perceptron + L2 regularization Can also use kernels with SVMs Can optimize SVMs with SGD Many other approaches possible Carlos Guestrin

56 Boosting Machine Learning CSEP546 Carlos Guestrin University of Washington January 27, 2014 Carlos Guestrin

57 Fighting the bias-variance tradeoff Simple (a.k.a. weak) learners are good e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) Low variance, don t usually overfit too badly Simple (a.k.a. weak) learners are bad High bias, can t solve hard learning problems Can we make weak learners always good??? No!!! But often yes Carlos Guestrin

58 The Simplest Weak Learner: Thresholding, a.k.a. Decision Stumps Learn: h:x! Y X features Y target classes Simplest case: Thresholding Carlos Guestrin

59 Voting (Ensemble Methods) Instead of learning a single (weak) classifier, learn many weak classifiers that are good at different parts of the input space Output class: (Weighted) vote of each classifier Classifiers that are most sure will vote with more conviction Classifiers will be most sure about a particular part of the space On average, do better than single classifier! But how do you??? force classifiers to learn about different parts of the input space? weigh the votes of different classifiers? Carlos Guestrin

60 Boosting [Schapire, 1989] Idea: given a weak learner, run it multiple times on (reweighted) training data, then let learned classifiers vote On each iteration t: weight each training example by how incorrectly it was classified Learn a hypothesis h t A strength for this hypothesis α t Final classifier: Practically useful Theoretically interesting Carlos Guestrin

61 Learning from weighted data Sometimes not all data points are equal Some data points are more equal than others Consider a weighted dataset D(j) weight of j th training example (x j,y j ) Interpretations: j th training example counts as D(j) examples If I were to resample data, I would get more samples of heavier data points Now, in all calculations, whenever used, j th training example counts as D(j) examples Carlos Guestrin

62 Boosting Cartoon Carlos Guestrin

63 AdaBoost Initialize weights to uniform dist: D 1 (j) = 1/N For t = 1 T Train weak learner h t on distribution D t over the data Choose weight α t Update weights: D t+1 (j) = D t(j)exp( t y j h t (x j )) Z t Where Z t is normalizer: Z t = NX D t (j)exp( t y j h t (x j )) j=1 Output final classifier: Carlos Guestrin

64 Picking Weight of Weak Learner Weigh h t higher if it did well on training data (weighted by D t ): t = 1 2 ln 1 t t Where ε t is the weighted training error: t = NX D t (j) [h t (x j ) 6= y j ] j=1 Carlos Guestrin

65 AdaBoost Cartoon D t+1 (j) = D t(j)exp( t y j h t (x j )) Z t t = 1 2 ln 1 t t Carlos Guestrin

66 Why choose α t for hypothesis h t this way? Simple theoretical analysis: Z t = Training error upper-bounded by product of normalizers 1 NX TY [H(x j ) 6= y j ] apple Z t N j=1 t=1 [Schapire, 1989] NX D t (j)exp( t y j h t (x j )) j=1 Pick α t to minimize upper-bound Take derivative and set to zero! Carlos Guestrin

67 Strong, weak classifiers If each classifier is (at least slightly) better than random ε t < 0.5 AdaBoost will achieve zero training error (exponentially fast):! 1 NX TY TX [H(x j ) 6= y j ] apple Z t apple exp 2 (1/2 t ) 2 N j=1 t=1 t=1 Is it hard to achieve better than random training error? Carlos Guestrin

68 Boosting results Digit recognition [Schapire, 1989] Boosting often Robust to overfitting Test set error decreases even after training error is zero Carlos Guestrin

69 Boosting: Experimental Results [Freund & Schapire, 1996] Comparison of C4.5, Boosting C4.5, Boosting decision stumps (depth 1 trees), 27 benchmark datasets error error error Carlos Guestrin

70 Carlos Guestrin

71 What you need to know about Boosting Combine weak classifiers to obtain very strong classifier Weak classifier slightly better than random on training data Resulting very strong classifier can eventually provide zero training error AdaBoost algorithm Most popular application of Boosting: Boosted decision stumps! Very simple to implement, very effective classifier Carlos Guestrin