Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS



Similar documents
Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Penalized regression: Introduction

5. Multiple regression

Statistical Machine Learning

2. Linear regression with multiple regressors

Econometrics Simple Linear Regression

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Quadratic forms Cochran s theorem, degrees of freedom, and all that

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Multiple Linear Regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Lecture 3: Linear methods for classification

Section 6: Model Selection, Logistic Regression and more...

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Penalized Logistic Regression and Classification of Microarray Data

Estimation of σ 2, the variance of ɛ

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Introduction to General and Generalized Linear Models

CURVE FITTING LEAST SQUARES APPROXIMATION

n + n log(2π) + n log(rss/n)

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Statistical Models in R

Constrained Least Squares

Similarity and Diagonalization. Similar Matrices

[3] Big Data: Model Selection

Model Validation Techniques

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Fitting Subject-specific Curves to Grouped Longitudinal Data

Part II. Multiple Linear Regression

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

SAS Software to Fit the Generalized Linear Model

Dimensionality Reduction: Principal Components Analysis

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

[1] Diagonal factorization

Coefficient of Determination

Notes on Applied Linear Regression

Part 2: Analysis of Relationship Between Two Variables

Orthogonal Diagonalization of Symmetric Matrices

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

An Introduction to Path Analysis. nach 3

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Regression Analysis. Regression Analysis MIT 18.S096. Dr. Kempthorne. Fall 2013

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = i.

Regression III: Advanced Methods

Detection of changes in variance using binary segmentation and optimal partitioning

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Chapter 15. Mixed Models Overview. A flexible approach to correlated data.

Cross Validation. Dr. Thomas Jensen Expedia.com

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Multivariate Logistic Regression

Numerical methods for American options

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Logistic Regression (a type of Generalized Linear Model)

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Testing for Lack of Fit

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

University of Lille I PC first year list of exercises n 7. Review

Lasso on Categorical Data

Multivariate Analysis of Variance (MANOVA)

Rockefeller College University at Albany

Introduction to proc glm

Solving Linear Systems, Continued and The Inverse of a Matrix

11. Analysis of Case-control Studies Logistic Regression

Mean squared error matrix comparison of least aquares and Stein-rule estimators for regression coefficients under non-normal disturbances

4. Multiple Regression in Practice

Multinomial and Ordinal Logistic Regression

Exercise 1.12 (Pg )

5. Linear Regression

Inner Product Spaces

Local classification and local likelihoods

Two-sample inference: Continuous data

Week 3&4: Z tables and the Sampling Distribution of X

Random effects and nested models with SAS

FACTOR ANALYSIS NASC

Chapter 4: Vector Autoregressive Models

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Statistical Models in R

Nonlinear Regression Functions. SW Ch 8 1/54/

Topic 3. Chapter 5: Linear Regression in Matrix Form

Nonparametric statistics and model selection

Hypothesis testing - Steps

Imputing Missing Data using SAS

Rob J Hyndman. Forecasting using. 11. Dynamic regression OTexts.com/fpp/9/1/ Forecasting using R 1

Transcription:

Ridge Regression Patrick Breheny September 1 Patrick Breheny BST 764: Applied Statistical Modeling 1/22

Ridge regression: Definition Definition and solution Properties As mentioned in the previous lecture, ridge regression penalizes the size of the regression coefficients Specifically, the ridge regression estimate β is defined as the value of β that minimizes (y i x T i β) 2 + λ i p j=1 β 2 j Patrick Breheny BST 764: Applied Statistical Modeling 2/22

Ridge regression: Solution Definition and solution Properties Theorem: The solution to the ridge regression problem is given by β = (X T X + λi) 1 X T y Note the similarity to the ordinary least squares solution, but with the addition of a ridge down the diagonal Corollary: As λ 0, β ridge Corollary: As λ, β ridge 0 β OLS Patrick Breheny BST 764: Applied Statistical Modeling 3/22

Definition and solution Properties Ridge regression: Solution (Cont d) Corollary: In the special case of an orthonormal design matrix, ˆβ ridge J = ˆβ OLS J 1 + λ This illustrates the essential feature of ridge regression: shrinkage Applying the ridge regression penalty has the effect of shrinking the estimates toward zero introducing bias but reducing the variance of the estimate Patrick Breheny BST 764: Applied Statistical Modeling 4/22

Definition and solution Properties Ridge vs. OLS in the presence of collinearity The benefits of ridge regression are most striking in the presence of multicollinearity, as illustrated in the following example: > x1 <- rnorm(20) > x2 <- rnorm(20,mean=x1,sd=.01) > y <- rnorm(20,mean=3+x1+x2) > lm(y~x1+x2)$coef (Intercept) x1 x2 2.582064 39.971344-38.040040 > lm.ridge(y~x1+x2,lambda=1) x1 x2 2.6214998 0.9906773 0.8973912 Patrick Breheny BST 764: Applied Statistical Modeling 5/22

Invertibility Ridge regression Definition and solution Properties Recall from BST 760 that the ordinary least squares estimates do not always exist; if X is not full rank, X T X is not invertible and there is no unique solution for β OLS This problem does not occur with ridge regression, however Theorem: For any design matrix X, the quantity X T X + λi is always invertible; thus, there is always a unique solution β ridge Patrick Breheny BST 764: Applied Statistical Modeling 6/22

Bias and variance Ridge regression Definition and solution Properties Theorem: The variance of the ridge regression estimate is Var( β) = σ 2 WX T XW, where W = (X T X + λi) 1 Theorem: The bias of the ridge regression estimate is Bias( β) = λwβ It can be shown that the total variance ( j Var( ˆβ j )) is a monotone decreasing sequence with respect to λ, while the total squared bias ( j Bias2 ( ˆβ j )) is a monotone increasing sequence with respect to λ Patrick Breheny BST 764: Applied Statistical Modeling 7/22

Existence theorem Ridge regression Definition and solution Properties Existence Theorem: There always exists a λ such that the MSE of β ridge λ is less than the MSE of β OLS This is a rather surprising result with somewhat radical implications: even if the model we fit is exactly correct and follows the exact distribution we specify, we can always obtain a better estimator by shrinking towards zero Patrick Breheny BST 764: Applied Statistical Modeling 8/22

Bayesian interpretation Definition and solution Properties As mentioned in the previous lecture, penalized regression can be interpreted in a Bayesian context: Theorem: Suppose β N(0, τ 2 I). Then the posterior mean of β given the data is ) 1 (X T X + σ2 τ 2 I X T y. Patrick Breheny BST 764: Applied Statistical Modeling 9/22

Degrees of freedom Ridge regression Information criteria Cross-validation Information criteria are a common way of choosing among models while balancing the competing goals of fit and parsimony In order to apply AIC or BIC to the problem of choosing λ, we will need an estimate of the degrees of freedom Recall that in linear regression: ŷ = Hy, where H was the projection ( hat ) matrix tr(h) = p, the degrees of freedom Patrick Breheny BST 764: Applied Statistical Modeling 10/22

Degrees of freedom (cont d) Information criteria Cross-validation Ridge regression is also a linear estimator (ŷ = Hy), with H ridge = X(X T X + λi) 1 X T Analogously, one may define its degrees of freedom to be tr(h ridge ) Furthermore, one can show that df ridge = λ i λ i + λ where {λ i } are the eigenvalues of X T X If you don t know what eigenvalues are, don t worry about it. The main point is to note that df is a decreasing function of λ with df = p at λ = 0 and df = 0 at λ =. Patrick Breheny BST 764: Applied Statistical Modeling 11/22

AIC and BIC Ridge regression Information criteria Cross-validation Now that we have a way to quantify the degrees of freedom in a ridge regression model, we can calculate AIC or BIC and use them to guide the choice of λ: AIC = n log(rss) + 2df BIC = n log(rss) + df log(n) Patrick Breheny BST 764: Applied Statistical Modeling 12/22

Introduction Ridge regression Information criteria Cross-validation An alternative way of choosing λ is to see how well predictions based on β λ do at predicting actual instances of Y Now, it would not be fair to use the data twice twice once to fit the model and then again to estimate the prediction accuracy as this would reward overfitting Ideally, we would have an external data set for validation, but obviously data is expensive to come by and this is rarely practical Patrick Breheny BST 764: Applied Statistical Modeling 13/22

Cross-validation Ridge regression Information criteria Cross-validation One idea is to split the data set into two fractions, then use one portion to fit β and the other to evaluate how well X β predicted the observations in the second portion The problem with this solution is that we rarely have so much data that we can freely part with half of it solely for the purpose of choosing λ To finesse this problem, cross-validation splits the data into K folds, fits the data on K 1 of the folds, and evaluates risk on the fold that was left out Patrick Breheny BST 764: Applied Statistical Modeling 14/22

Cross-validation figure Information criteria Cross-validation This process is repeated for each of the folds, and the risk averaged across all of these results: 1 2 3 4 5 Common choices for K are 5, 10, and n (also known as leave-one-out cross-validation) Patrick Breheny BST 764: Applied Statistical Modeling 15/22

Generalized cross-validation Information criteria Cross-validation You may recall from BST 760 that we do not actually have to refit the model to obtain the leave-one-out ( deleted ) residuals: y i ŷ i( i) = r i 1 H ii Actually calculating H turns out to be computationally inefficient for a number of reasons, so the following simplification (called generalized cross validation) is often used instead: GCV = 1 ( ) yi ŷ 2 i n 1 tr(h)/n i Patrick Breheny BST 764: Applied Statistical Modeling 16/22

Prostate cancer study An an example, consider the data from a 1989 study examining the relationship prostate-specific antigen (PSA) and a number of clinical measures in a sample of 97 men who were about to receive a radical prostatectomy PSA is typically elevated in patients with prostate cancer, and serves a biomarker for the early detection of the cancer The explanatory variables: lcavol: Log cancer volume lweight: Log prostate weight age lbph: Log benign prostatic hyperplasia svi: Seminal vesicle invasion lcp: Log capsular penetration gleason: Gleason score pgg45: % Gleason score 4 or 5 Patrick Breheny BST 764: Applied Statistical Modeling 17/22

SAS/R syntax Ridge regression To fit a ridge regression model in SAS, we can use PROC REG: PROC REG DATA=prostate ridge=0 to 50 by 0.1 OUTEST=fit; MODEL lpsa = pgg45 gleason lcp svi lbph age lweight lcavol; RUN; In R, we can use lm.ridge in the MASS package: fit <- lm.ridge(lpsa~.,prostate,lambda=seq(0,50,by=0.1)) R (unlike SAS, unfortunately) also provides the GCV criterion for each λ: fit$gcv Patrick Breheny BST 764: Applied Statistical Modeling 18/22

Results Ridge regression Coefficients 0.0 0.2 0.4 0.6 0.8 224 63 18 0 λ svi lcavol lweight lbph gleason pgg45 age lcp 0 2 4 6 8 Degrees of freedom Red=AIC, black=gcv, green=bic Patrick Breheny BST 764: Applied Statistical Modeling 19/22

Additional plots Degrees of freedom 0 2 4 6 8 RSS 60 80 100 120 0.01 10 10 4 10 7 0 2 4 6 8 λ Degrees of freedom Patrick Breheny BST 764: Applied Statistical Modeling 20/22

Model selection criteria Red=AIC, black=gcv, green=bic: Criterion min 0 5 10 15 20 0 2 4 6 8 Degrees of freedom Patrick Breheny BST 764: Applied Statistical Modeling 21/22

Ridge vs. OLS Ridge regression Estimate Std. Error z-score OLS Ridge OLS Ridge OLS Ridge lcavol 0.587 0.519 0.088 0.075 6.68 6.96 lweight 0.454 0.444 0.170 0.153 2.67 2.89 age -0.020-0.016 0.011 0.010-1.76-1.54 lbph 0.107 0.096 0.058 0.053 1.83 1.83 svi 0.766 0.698 0.244 0.209 3.14 3.33 lcp -0.105-0.044 0.091 0.072-1.16-0.61 gleason 0.045 0.060 0.157 0.128 0.29 0.47 pgg45 0.005 0.004 0.004 0.003 1.02 1.02 Patrick Breheny BST 764: Applied Statistical Modeling 22/22