Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Similar documents
Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

Regularized Logistic Regression for Mind Reading with Parallel Validation

Model Validation Techniques

5. Multiple regression

Penalized Logistic Regression and Classification of Microarray Data

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Lasso on Categorical Data

Machine Learning Methods for Demand Estimation

Penalized regression: Introduction

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

[3] Big Data: Model Selection

Multiple Linear Regression in Data Mining

Machine Learning Big Data using Map Reduce

Multivariate Logistic Regression

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Causal Leading Indicators Detection for Demand Forecasting

Package bigdata. R topics documented: February 19, 2015

Cross Validation. Dr. Thomas Jensen Expedia.com

Least Squares Estimation

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

How To Build A Predictive Model In Insurance

Data Mining - Evaluation of Classifiers

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

IN THE CITY OF NEW YORK Decision Risk and Operations. Advanced Business Analytics Fall 2015

Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

SAS Software to Fit the Generalized Linear Model

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Using SAS to Create Sales Expectations for Everyday and Seasonal Products Ellebracht, Netherton, Gentry, Hallmark Cards, Inc.

Section 6: Model Selection, Logistic Regression and more...

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Chapter 7: Simple linear regression Learning Objectives

On the Creation of the BeSiVa Algorithm to Predict Voter Support

Lasso-based Spam Filtering with Chinese s

Logistic Regression.

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Smart ESG Integration: Factoring in Sustainability

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria

HLM software has been one of the leading statistical packages for hierarchical

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Statistical Machine Learning

Predict the Popularity of YouTube Videos Using Early View Data

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Cross-Validation. Synonyms Rotation estimation

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Simple Linear Regression Inference

Regression Modeling Strategies

College Readiness LINKING STUDY

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Predicting Health Care Costs by Two-part Model with Sparse Regularization

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Performance Metrics for Graph Mining Tasks

Identifying SPAM with Predictive Models

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Machine Learning Capacity and Performance Analysis and R

JetBlue Airways Stock Price Analysis and Prediction

TOWARD BIG DATA ANALYSIS WORKSHOP

A Comparison of Variable Selection Techniques for Credit Scoring

APPLICATION OF LINEAR REGRESSION MODEL FOR POISSON DISTRIBUTION IN FORECASTING

Lecture 3: Linear methods for classification

Chapter 15. Mixed Models Overview. A flexible approach to correlated data.

A Short Tour of the Predictive Modeling Process

Logistic Regression for Spam Filtering

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Credit Risk Analysis Using Logistic Regression Modeling

Statistical issues in the analysis of microarray data

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

The Basic Two-Level Regression Model

11. Analysis of Case-control Studies Logistic Regression

SPSS Guide: Regression Analysis

17. SIMPLE LINEAR REGRESSION II

Correlational Research

Gerry Hobbs, Department of Statistics, West Virginia University

Estimation of σ 2, the variance of ɛ

Examining a Fitted Logistic Model

Data Mining. Nonlinear Classification

A Property & Casualty Insurance Predictive Modeling Process in SAS

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Rob J Hyndman. Forecasting using. 11. Dynamic regression OTexts.com/fpp/9/1/ Forecasting using R 1

Lecture 13: Validation

Marketing Mix Modelling and Big Data P. M Cain

Linear Models and Conjoint Analysis with Nonlinear Spline Transformations

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Data Mining III: Numeric Estimation

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

Transcription:

Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013

Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection. Description of lasso. Discussion/comparison of approaches

Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers benefit from the use of Statistics Collaboration: Visit our website to request personalized statistical advice and assistance with: Experimental Design Data Analysis Interpreting Results Grant Proposals Software (R, SAS, JMP, SPSS...) LISA also offers: LISA statistical collaborators aim to explain concepts in ways useful for your research. Great advice right now: Meet with LISA before collecting your data. Educational Short Courses: Designed to help graduate students apply statistics in their research Walk-In Consulting: M-F 1-3 PM GLC Video Conference Room for questions requiring <30 mins Also 3-5 PM Port (Library/Torg Bridge) and 9-11 AM ICTAS Café X All services are FREE for VT researchers. We assist with research not class projects or homework. www.lisa.stat.vt.edu 3

The goal is to demonstrate the lasso technique using real world data. Lasso stands for least absolute shrinkage and selection operator. Continuous subset selection algorithm, can shrink the effect of unimportant predictors, can set effects to zero. Requires more technical work to implement compared to other common methods. Note: The analysis closely follows Tibshirani (1996) and Friedman, Hastie, and Tibshirani (2009).

In addition to the lasso, these statistical concepts will be discussed. Exploratory data analysis and graphing. Ordinary least squares regression. Cross validation. Model selection, including forward, backward, stepwise selection and information criteria (e.g. AIC, BIC).

The prostate data originally described in Stamey et. al (1989). 97 men who were about to undergo radical prostatectomy. Research goal: measure association between cancer volume and 8 other clinical measures.

The clinical measures are Index variable label 1 lcavol log(cancer volume) 2 lweight log(prostate weight volume) 3 age age 4 lbph log(benign prostatic hyperplasia) 5 svi seminal vesicle invasion 6 lcp log(capsular penetration) 7 gleason Gleason score 8 pgg45 percent Gleason scores 4 or 5 y lpsa log(prostate specific antigen)

Regression brief review Simple case: We wish to use a single predictor variable x to predict some outcome y using ordinary least squares (OLS). E.g. x= lcavol, y=lpsa Quiz question 1:What do you see in the plot?

Here is the same plot with the regression line included

What important property does the regression line have? Question 2: Why not use these lines?

The simple linear regression model is: y i = β 0 + β 1 x 1i + ε i Which values are known/unknown? Which are data, which are parameters? Which term is the slope? Intercept? Common assumption about error structure (Question 3: fill in the blanks): ε i ~ (, ) Question 4: What is the difference between β 1 and β 1?

Frequently there are many predictors that we want to use simultaneously Multiple linear regression model: y i = β 0 + β 1 x 1i + β 2 x 2i + + β p x pp + ε i In this situation each β j represents the partial slope of predictor j = 1,, p. Question 5: Interpretation? In our case we have 8 candidate predictors (see slide 7). Which set should we use to model the response?

Cross validation is used to determine whether a model has good predictive ability for a new data set Parameter estimates β 1are chosen on the basis of available data. We expect a good model to perform well on data used to fit (or train ) the model. Could your model perform well on new data (e.g. patients)? If, not, model may be overfit. Cross validation: hold out a portion of the data (called validation set), fit model to the rest of the data (training set), determine if model based on training set performs well in validation set. Metric to assess prediction error: Mean Square Error MMM = 1 n y 2 n i=1 i y i, y i is predicted value of y i based on model.

Now complete code section 1 Import the data to Rstudio. View the data. Plot the data, adding regression lines.

Variable subset selection uses statistical criteria to identify a set of predictors Variable subset selection: Among a set of candidate predictors, choose a subset to include in the model based on some statistical criterion, e.g. p-values Forward selection: Add variables one at a time starting with the x most strongly associated with y. Stop when no other significant variables are identified

Variable subset selection continued Backwards elimination: Start with every candidate predictor in the model. Remove variables one at a time until all remaining variables are significantly associated with response. Stepwise selection: As forward selection, but at each iteration remove variables which are made obsolete by new additions. Stop when nothing new is added or when a term is removed immediately after it was added

Full enumeration methods Given a set of candidate predictors, fit every possible model, use some statistical criterion to decide which is best. AII = 2 lll + 2k BBB = 2 lll + n k Where L represents the likelihood function, k is the number of parameters. Both of these criteria consider the likelihood of each model with a penalty for model complexity

MANY methods have been proposed to choose and use predictors Shrinkage methods (Ridge regression, Garotte, many recent lasso-related developments) Tree-based methods Forward stagewise selection (different from forward stepwise regression) Maximum adjusted or unadjusted R 2, Mallow s C p Bayes Factor, Likelihood ratio tests AICc, Deviance information criterion (DIC) Many others!

The lasso algorithm performs variable selection by constraining the sum of the magnitudes of the coefficients y i = β 0 + β 1 x 1i + β 2 x 2i + + β p x pp + ε i n i=1 p j=1 β lllll = aaaaaa β (y i β 0 x ii β j ) Subject to p j=1 β j < t. The lasso estimator minimizes the sum of squared differences between the observed outcome and the linear model so long as the sum of the absolute value of the coefficients is below some value t. 2

Why constrain the sum of the absolute value of the coefficients? We want a parsimonious model, or a model which describes the response well but is as simple as possible. The lasso aims for parsimony using the constraint explained on the previous slide. Since the overall magnitude of the coefficients is constrained, important predictors are included in the model, and less important predictors shrink, potentially to zero.

A few other important items An equivalent Lagrangian form of lasso: n β lllll = aaaaaa β { (y i β 0 x ii β j ) i=1 p j=1 2 p + λ β j j=1 Many software packages require specification of λ. } Also, the shrinkage factor s is defined by s = between zero and one. t p j=1 β j, which is Question: As t (or s) increases, what happens to the coefficient estimates? Question: As λ increases, what happens to the coefficient estimates?

Now complete code section 2 Fit the lasso model to the prostate data using the lars package Plot the lasso path Observe how the coefficients change as s increases. Obtain estimated coefficients and predicted values for given values of s.

The least angle regression algorithm is used to fit the lasso path efficiently. Extremely efficient way to obtain the lasso coefficient estimates. Identifies the variable most associated with response (like forward selection), but then adds only part of the variable at a time, can switch variables before adding all of the first variable. For more detail, see Efron et. al (2004) and Friedman et. al (2009).

The lasso path plot illustrates coefficient behavior for various s. Question: How should we decide which s to use?

Cross validation is used to both choose s and assess predictive accuracy of model Initial training and validation sets established. Tuning parameter s is chosen based on training set, model is fit based on training set. Performance of the model chosen above is then assessed on the basis of the validation set. Training model used to predict outcomes in validation set. MSE is computed. If training model produces reasonable MSE based on validation set, model is adopted.

K-fold cross validation splits data into K=10. Training set then broken into 10 pieces, 10-fold cross validation used to determine value of shrinkage factor s. Model is fit on entire training set at chosen s, coefficients estimates stored, MSE computed.

Now complete code section 3 Make a 10 fold cross validation ID vector Make a vector of s values to use. Perform 10-fold cross validation on the training set at the chosen values of s. Determine which value of s minimizes 10 fold cross validation error. Determine how well chosen model performs in validation set. Compare performance of lasso with AIC, BIC

S is chosen to minimize MSE in the training set based on k fold cross validation Picture is of average MSE based on 10 holdout sets for various values of s. Vertical bars depict 1 standard error Typically, value of s that is within 1 SE of lowest value is chosen. 10-fold cross validation suggests s=0.4 is a good choice.

Other interesting notes Ridge regression is an earlier and similar method to the lasso, which p 2 invokes the constraint β j < t. j=1 This is also a shrinkage or penalization method. Ridge regression will not set any specified predictor coefficients to exactly zero. Lasso is preferable when predictors may be highly correlated. For both ridge regression and lasso, λ cannot be estimated directly from the data using maximum likelihood due to an identifiability issue. This is why cross validation is chosen to fix λ at a constant.

Acknowledgements Thanks to the following Dhruva Sharma Scotland Leman Andy Hoege

References Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion), Annals of Statistics 32(2): 407-499. Friedman, Jerome; Hastie, Trevor; Tibshirani, Robert (2009-02-09). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) (Kindle Locations 13024-13026). Springer - A. Kindle Edition. Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E. and Yang, N. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate II radical prostatectomy treated patients, Journal of Urology 16: 1076-1083. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267-288).