Learning Logistic Regressors by Gradient Descent

Size: px

Start display at page:

Download "Learning Logistic Regressors by Gradient Descent"

Claud Dawson
7 years ago
Views:

1 Learning Logistic Regressors by Gradient Descent Machine Learning CSE446 Carlos Guestrin University of Washington April 17, Classification Learn: h:x Y X features Y target classes Conditional probability: P(Y X) Suppose you know P(Y X) exactly, how should you classify? Bayes optimal classifier: How do we estimate P(Y X)? 2 1

2 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular functional form for link function Sigmoid applied to a linear function of the input features: Z Features can be discrete or continuous! 3 Logistic Regression a Linear classifier

3 Loss function: Conditional Likelihood Have a bunch of iid data of the form: Discriminative (logistic regression) loss function: Conditional Data Likelihood 5 Maximizing Conditional Log Likelihood Good news: l(w) is concave function of w, no local optima problems Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize 6 3

4 Optimizing concave function Gradient ascent Conditional likelihood for Logistic Regression is concave. Find optimum with gradient ascent Gradient: Step size, η>0 Update rule: Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent can be much better 7 Maximize Conditional Log Likelihood: Gradient ascent 8 4

5 Gradient Ascent for LR Gradient ascent algorithm: iterate until change < ε (t) For i=1,,k, (t) repeat 9 Regularization in linear regression Overfitting usually leads to very large parameter choices, e.g.: X 0.30 X ,700,910.7 X 8,585,638.4 X 2 + Regularized least-squares (a.k.a. ridge regression), for λ>0: 10 5

6 Linear Separability 11 Large parameters Overfitting If data is linearly separable, weights go to infinity In general, leads to overfitting: Penalizing high weights can prevent overfitting 12 6

7 Regularized Conditional Log Likelihood Add regularization penalty, e.g., L 2 : `(w) =ln NY P (y j x j, w) j=1 2 w 2 2 Practical note about w 0 : Gradient of regularized likelihood: 13 Standard v. Regularized Updates Maximum conditional likelihood estimate NY w = arg max ln P (y j x j, w) w j=1 (t) Regularized maximum conditional likelihood estimate NY kx w = arg max ln P (y j x j, w) wi 2 w 2 j=1 i=1 (t) 14 7

8 Please Stop!! Stopping criterion `(w) =ln Y j P (y j x j, w)) w 2 2 When do we stop doing gradient descent? Because l(w) is strongly concave: i.e., because of some technical condition `(w ) `(w) apple 1 2 r`(w) 2 2 Thus, stop when: 15 Digression: Logistic regression for more than 2 classes Logistic regression in more general case (C classes), where Y in {1,,C} 16 8

9 Digression: Logistic regression more generally Logistic regression in more general case, where Y in {1,,C} for c<c P (Y = c x, w) = exp(w c0 + P k i=1 w cix i ) 1+ P C 1 c 0 =1 exp(w c P k i=1 w c 0 ix i ) for c=c (normalization, so no weights for this class) 1 P (Y = C x, w) = 1+ P C 1 c 0 =1 exp(w c P k i=1 w c 0 ix i ) Learning procedure is basically the same as what we derived! 17 Stochastic Gradient Descent Machine Learning CSE446 Carlos Guestrin University of Washington April 17,

10 The Cost, The Cost!!! Think about the cost What s the cost of a gradient update step for LR??? (t) 19 Learning Problems as Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: `D(w) = 1 N NX `(w, x j ) j=1 However, we should really minimize expected loss on all data: Z `(w) =E x [`(w, x)] = p(x)`(w, x)dx So, we are approximating the integral by the average on the training data 20 10

11 Gradient ascent in Terms of Expectations True objective function: `(w) =E x [`(w, x)] = Z p(x)`(w, x)dx Taking the gradient: True gradient ascent rule: How do we estimate expected gradient? 21 SGD: Stochastic Gradient Ascent (or Descent) True gradient: r`(w) =E x [r`(w, x)] Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Called stochastic gradient ascent (or descent) Among many other names VERY useful in practice!!! 22 11

12 Stochastic Gradient Ascent for Logistic Regression Logistic loss as a stochastic function: E x [`(w, x)] = E x ln P (y x, w) w 2 2 Batch gradient ascent updates: w (t+1) i w (t) i 8 < + : w(t) i + 1 N Stochastic gradient ascent updates: Online setting: n w (t+1) i w (t) i + t w (t) i NX j=1 9 = i [y (j) P (Y =1 x (j), w (t) )] ; x (j) o + x (t) i [y (t) P (Y =1 x (t), w (t) )] 23 Stochastic Gradient Ascent: general case Given a stochastic function of parameters: Want to find maximum Start from w (0) Repeat until convergence: Get a sample data point x t Update parameters: Works on the online learning setting! Complexity of each gradient step is constant in number of examples! In general, step size changes with iterations 24 12

13 What you should know Classification: predict discrete classes rather than real values Logistic regression model: Linear model Logistic function maps real values to [0,1] Optimize conditional likelihood Gradient computation Overfitting Regularization Regularized optimization Cost of gradient step is high, use stochastic gradient descent 25 13

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features