Regularization, Ridge Regression

Transcription

1 Regularization, Ridge Regression Machine Learning CSEP546 Carlos Guestrin University of Washington January 13, The regression problem Instances: <x j, t j > Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions Find coeffs w={w 1,,w k } Why is this called linear regression??? model is linear in the parameters Precisely, minimize the residual squared error: Carlos Guestrin 2 1

2 The regression problem in matrix notation N data points K basis func N data points K basis functions Carlos Guestrin weights 3 observations Regression solution = simple matrix operations where k k matrix for k basis functions k 1 vector Carlos Guestrin 4 2

3 Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance Carlos Guestrin 5 Test set error as a function of model complexity Carlos Guestrin 6 3

4 Overfitting Overfitting: a learning algorithm overfits the training data if it outputs a solution w when there exists another solution w such that: Carlos Guestrin 7 Regularization in Linear Regression Overfitting usually leads to very large parameter choices, e.g.: X 0.30 X ,700,910.7 X 8,585,638.4 X 2 + Regularized or penalized regression aims to impose a complexity penalty by penalizing large weights Shrinkage method 8 4

5 Quadratic Penalty (regularization) What we thought we wanted to minimize: But weights got too big, penalize large weights: 9 Ridge Regression Ameliorating issues with overfitting: New objective: 10 5

6 Ridge Regression in Matrix Notation ŵ ridge = arg min w NX j=1 2 kx t(x j ) (w 0 + w i h i (x j ))! + i=1 kx i=1 w 2 i + w T I 0+k w h 0...h k N data points K+1 basis func N data points K+1 basis functions weights observations 11 Minimizing the Ridge Regression Objective MLE ŵ ridge = arg min w NX j=1 t(x j ) (w kx w i h i (x j ))! + i=1 =(Hw t) T (Hw t)+ w T I 0+k w kx i=1 w 2 i 12 6

7 Shrinkage Properties ŵ ridge =(H T H + I 0+k ) 1 H T t If orthonormal features/basis: H T H = I 13 Ridge Regression: Effect of Regularization ŵ ridge = arg min w NX j=1 2 kx t(x j ) (w 0 + w i h i (x j ))! + i=1 kx i=1 w 2 i Solution is indexed by the regularization parameter λ Larger λ Smaller λ As λ à 0 As λ à 14 7

8 Ridge Coefficient Path From Kevin Murphy textbook Typical approach: select λ using cross validation, more on this later in the quarter 15 Error as a function of regularization parameter for a fixed model complexity λ= λ=0 16 8

9 What you need to know Regularization Penalizes for complex models Ridge regression L 2 penalized least-squares regression Regularization parameter trades off model complexity with training error 17 Cross-Validation Machine Learning CSEP546 Carlos Guestrin University of Washington January 13,

10 Test set error as a function of model complexity 19 How How How??????? How do we pick the regularization constant λ And all other constants in ML, cause one thing ML doesn t lack is constants to tune L We could use the test data, but 20 10

11 (LOO) Leave-one-out cross validation Consider a validation set with 1 example: D training data D\j training data with j th data point moved to validation set Learn classifier h D\j with D\j dataset Estimate true error as squared error on predicting t(x j ): Unbiased estimate of error true (h D\j )! Seems really bad estimator, but wait! LOO cross validation: Average over all data points j: For each data point you leave out, learn a new classifier h D\j Estimate error as: error LOO = 1 N NX j=1 t(x j ) h D\j (x j ) 2 21 LOO cross validation is (almost) unbiased estimate of true error of h D! When computing LOOCV error, we only use N-1 data points So it s not estimate of true error of learning with N data points! Usually pessimistic, though learning with less data typically gives worse answer LOO is almost unbiased! Great news! Use LOO error for model selection!!! E.g., picking λ 22 11

12 Using LOO to Pick λ errorloo = 1 N N X j=1 t(xj) h D\j (xj) 2 λ= λ=0 23 Using LOO error for model selection errorloo = 1 N N X j=1 t(xj) h D\j (xj)

13 Computational cost of LOO Suppose you have 100,000 data points You implemented a great version of your learning algorithm Learns in only 1 second Computing LOO will take about 1 day!!! If you have to do for each choice of basis functions, it will take fooooooreeeve!!! Solution 1: Preferred, but not usually possible Find a cool trick to compute LOO (e.g., see homework) 25 Solution 2 to complexity of computing LOO: (More typical) Use k-fold cross validation Randomly divide training data into k equal parts D 1,,D k For each i Learn classifier h D\Di using data point not in D i Estimate error of h D\Di on validation set D i : error Di = k X t(x j ) h D\Di (x j ) 2 N x j2d i k-fold cross validation error is average over data splits: k-fold cross validation properties: Much faster to compute than LOO More (pessimistically) biased using much less data, only m(k-1)/k Usually, k = 10 J 26 13

14 ML Pipeline data 27 What you need to know Never ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever train on the test data Use cross-validation to choose magic parameters such as λ Leave-one-out is the best you can do, but sometimes too slow In that case, use k-fold cross-validation 28 14

15 Variable Selection LASSO: Sparse Regression Machine Learning CSEP546 Carlos Guestrin University of Washington January 13, Sparsity Vector w is sparse, if many entries are zero: Very useful for many tasks, e.g., Efficiency: If size(w) = 100B, each prediction is expensive: If part of an online system, too slow If w is sparse, prediction computation only depends on number of non-zeros Interpretability: What are the relevant dimension to make a prediction? Eat Push Run E.g., what are the parts of the brain associated with particular words? But computationally intractable to perform all subsets regression Participant P1 Mean of independently learned signatures over all nine participants Pars opercularis (z=24mm) Postcentral gyrus (z=30mm) Superior temporal sulcus (posterior) (z=12mm) 30 Figure from Tom Mitchell 15

16 Simple greedy model selection algorithm Pick a dictionary of features e.g., polynomials for linear regression Greedy heuristic: Start from empty (or simple) set of features F 0 = Run learning algorithm for current set of features F t Obtain h t Select next best feature X i * e.g., X j that results in lowest training error learner when learning with F t + {X j } F t+1 ç F t + {X i* } Recurse 31 Greedy model selection Applicable in many settings: Linear regression: Selecting basis functions Naïve Bayes: Selecting (independent) features P(X i Y) Logistic regression: Selecting features (basis functions) Decision trees: Selecting leaves to expand Only a heuristic! But, sometimes you can prove something cool about it e.g., [Krause & Guestrin 05]: Near-optimal in some settings that include Naïve Bayes There are many more elaborate methods out there 32 16

17 When do we stop??? Greedy heuristic: Select next best feature X i * e.g., X j that results in lowest training error learner when learning with F t + {X j } F t+1 ç F t + {X i* } Recurse When do you stop??? When training error is low enough? When test set error is low enough? 33 Regularization in Linear Regression Overfitting usually leads to very large parameter choices, e.g.: X 0.30 X ,700,910.7 X 8,585,638.4 X 2 + Regularized or penalized regression aims to impose a complexity penalty by penalizing large weights Shrinkage method 34 17

18 Variable Selection by Regularization Ridge regression: Penalizes large weights What if we want to perform feature selection? E.g., Which regions of the brain are important for word prediction? Can t simply choose features with largest coefficients in ridge solution Try new penalty: Penalize non-zero weights Regularization penalty: Leads to sparse solutions Just like ridge regression, solution is indexed by a continuous param λ This simple approach has changed statistics, machine learning & electrical engineering 35 LASSO Regression LASSO: least absolute shrinkage and selection operator New objective: 36 18

19 Geometric Intuition for Sparsity wβ 2 ^. wβ MLE w 2 β 2 ^. w β MLE β1 w 1 Ridge Regression Lasso β 1 w 1 From Rob Tibshirani slides 37 Optimizing the LASSO Objective LASSO solution: ŵ LASSO = arg min w NX j=1 2 kx t(x j ) (w 0 + w i h i (x j ))! + i=1 kx w i i=

20 Coordinate Descent Given a function F Want to find minimum Often, hard to find minimum for all coordinates, but easy for one coordinate Coordinate descent: How do we pick next coordinate? Super useful approach for *many* problems Converges to optimum in some cases, such as LASSO 39 How do we find the minimum over each coordinate? Key step in coordinate descent: Find minimum over each coordinate Illustration from Wikipedia Standard approach: 40 20

21 Optimizing LASSO Objective One Coordinate at a Time 2 NX kx kx t(x j ) (w 0 + w i h i (x j ))! + w i j=1 i=1 i=1 Taking the derivative: Residual sum of squares NX RSS(w) = 2 h`(x j ) t(x j ) (w 0 j=1! kx w i h i (x j )) i=1 Penalty term: 41 Coordinate Descent for LASSO (aka Shooting Algorithm) Repeat until convergence Pick a coordinate l at (random or sequentially) 8 Set: < (c` + )/a` c` < ŵ` = 0 c` 2 [, ] : (c` )/a` c` > Where: NX a` =2 (h`(x j)) 2 j=1 0 1 NX c` =2 h`(x j) (w 0 + X w ih i(x j)) A i6=` j=1 For convergence rates, see Shalev-Shwartz and Tewari 2009 Other common technique = LARS Least angle regression and shrinkage, Efron et al

22 Soft Thresholding ŵ` = 8 < : (c` + )/a` c` < 0 c` 2 [, ] (c` )/a` c` > c` From Kevin Murphy textbook 43 Recall: Ridge Coefficient Path From Kevin Murphy textbook Typical approach: select λ using cross validation 44 22

23 Now: LASSO Coefficient Path From Kevin Murphy textbook 45 LASSO Example Term Least Squares Ridge Lasso Intercept lcavol lweight age lbph svi lcp gleason pgg From Rob Tibshirani slides 46 23

24 Debiasing From Kevin Murphy textbook 47 What you need to know Variable Selection: find a sparse solution to learning problem L 1 regularization is one way to do variable selection Applies beyond regressions Hundreds of other approaches out there LASSO objective non-differentiable, but convex è Use subgradient No closed-form solution for minimization è Use coordinate descent Shooting algorithm is very simple approach for solving LASSO 48 24