Penalized regression: Introduction

Similar documents

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Local classification and local likelihoods

Multiple Linear Regression in Data Mining

Statistics in Retail Finance. Chapter 6: Behavioural models

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Regression III: Advanced Methods

5. Multiple regression

SAS Software to Fit the Generalized Linear Model

Statistical Models in R

Section 6: Model Selection, Logistic Regression and more...

Penalized Logistic Regression and Classification of Microarray Data

Chapter 7: Simple linear regression Learning Objectives

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Elements of statistics (MATH0487-1)

STA 4273H: Statistical Machine Learning

Marketing Mix Modelling and Big Data P. M Cain

Simple Linear Regression Inference

Simple linear regression

5. Linear Regression

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Statistical Machine Learning

Recall this chart that showed how most of our course would be organized:

4. Simple regression. QBUS6840 Predictive Analytics.

Nonparametric statistics and model selection

Multivariate Normal Distribution

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Model Validation Techniques

11. Analysis of Case-control Studies Logistic Regression

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Some Essential Statistics The Lure of Statistics

Building a Smooth Yield Curve. University of Chicago. Jeff Greco

Factors affecting online sales

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

2013 MBA Jump Start Program. Statistics Module Part 3

Chapter 15. Mixed Models Overview. A flexible approach to correlated data.

Econometrics Simple Linear Regression

An Analysis of the Telecommunications Business in China by Linear Regression

Fitting Subject-specific Curves to Grouped Longitudinal Data

Multivariate Analysis (Slides 13)

Example G Cost of construction of nuclear power plants

GLM I An Introduction to Generalized Linear Models

Testing for Lack of Fit

SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria

2. Simple Linear Regression

Analysis of Bayesian Dynamic Linear Models

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Replicating portfolios revisited

Poisson Models for Count Data

Multivariate Logistic Regression

5 Numerical Differentiation

Betting with the Kelly Criterion

Logs Transformation in a Regression Equation

Latent Class Regression Part II

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Data Mining: Algorithms and Applications Matrix Math Review

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Chapter 6: Multivariate Cointegration Analysis

The equivalence of logistic regression and maximum entropy models

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

VI. Introduction to Logistic Regression

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Dimensionality Reduction: Principal Components Analysis

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

A Detailed Price Discrimination Example

Introduction to Principal Components and FactorAnalysis

data visualization and regression

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Simple Regression Theory II 2010 Samuel L. Baker

Predicting Health Care Costs by Two-part Model with Sparse Regularization

Generalized Linear Models

Introduction to General and Generalized Linear Models

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Statistical Functions in Excel

A Predictive Model for NFL Rookie Quarterback Fantasy Football Points

Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania

Detection of changes in variance using binary segmentation and optimal partitioning

Chapter 4: Vector Autoregressive Models

Non-Linear Regression Samuel L. Baker

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

The Basics of Graphical Models

17. SIMPLE LINEAR REGRESSION II

How To Build A Predictive Model In Insurance

Least Squares Estimation

7 Time series analysis

Introduction to Regression and Data Analysis

AP Physics 1 and 2 Lab Investigations

Is a Single-Bladed Knife Enough to Dissect Human Cognition? Commentary on Griffiths et al.

Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure?

Biostatistics: Types of Data Analysis

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

Linear Classification. Volker Tresp Summer 2015

Transcription:

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19

Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood estimation: One begins by specifying a distribution f(x θ) for the data This distribution in turn induces a log-likelihood function L(θ x), which specifies how likely the observed data is for various values of the unknown parameters Estimates can be found by finding the most likely values of the unknown parameters Hypothesis tests and confidence intervals can be carried out by evaluating how the likelihood changes in response to moving away from these most likely values Patrick Breheny BST 764: Applied Statistical Modeling 2/19

Problems with maximum likelihood in regression Maximum likelihood estimation has many wonderful properties, but is often unsatisfactory in regression problems for two reasons: Large variability: when p is large with respect to n, or when columns of X are highly correlated, the variance of ˆβ is large Lack of interpretability: if p is large, we often desire a smaller set of predictors in order to gain insight into the most relevant relationships between y and X Patrick Breheny BST 764: Applied Statistical Modeling 3/19

Subset selection Probably the most common solution to this problem is choose a subset of important explanatory variables based on ad hoc trial and error using p-values to guide whether a variable should be in the model or not However, this approach is impossible to replicate and therefore to study the statistical properties of A more systematic approach is to exhaustively search all possible subsets and then use some sort of criterion such as AIC or BIC to choose an optimal subset Patrick Breheny BST 764: Applied Statistical Modeling 4/19

Problem #1: Computational infeasibility One problem with this approach is that the number of all possible subsets grows exponentially with p With today s computers, searching all possible subsets is computationally infeasible for p much larger than 40 or 50 For this reason, modifications such as forward selection, backward selection, and forward/backward hybrids have been proposed Patrick Breheny BST 764: Applied Statistical Modeling 5/19

Problem #2: Instability Another problem is the fact that subset selection is discontinuous, in the sense that an infinitesimally small change in the data can result in completely different estimates As a result, subset selection is often unstable and highly variable, especially in high dimensions As we will see, penalized regression allows us to accomplish the same goals as subset selection, but in a more stable, continuous, and computationally efficient fashion Patrick Breheny BST 764: Applied Statistical Modeling 6/19

Subset selection pitfalls It should be pointed out that it is common practice to report inferential results from ordinary least squares models following subset selection as if the model had been prespecified from the beginning This is unfortunate, as the resulting inferential procedures violate every principle of statistical estimation and hypothesis testing: Test statistics no longer follow t/f distributions Standard errors are biased low, and confidence intervals falsely narrow p-values are falsely small Regression coefficients are biased away from zero Patrick Breheny BST 764: Applied Statistical Modeling 7/19

Penalized maximum likelihood A different way of dealing with this problem is to introduce a penalty: instead of maximizing l(θ x) = log{l(θ x}, we maximize the function where M(θ) = l(θ x) λp (θ) P is a function that penalizes what one would consider less realistic values of the unknown parameters λ controls the tradeoff between the two parts The function M is called the objective function Patrick Breheny BST 764: Applied Statistical Modeling 8/19

Penalized maximum likelihood (cont d) In regression, we usually think about minimizing a loss function (usually squared error loss) rather than maximizing likelihood; an equivalent formulation is that we estimate θ by minimizing M(θ) = L(θ x) + λp (θ) where L here is a loss function (usually a quantity that is proportional to a negative log likelihood, such as the residual sum of squares) Patrick Breheny BST 764: Applied Statistical Modeling 9/19

Penalty functions What exactly do we mean by less realistic values? In the first section of this course, we mean that coefficient values around zero are more believable than those far away from zero, and P is therefore a function which penalizes coefficients as they get further away from zero The two penalties that we will cover in depth in this section of the course are ridge regression and the lasso: Ridge: P (β) = Lasso: P (β) = p j=1 β 2 j p β j j=1 Patrick Breheny BST 764: Applied Statistical Modeling 10/19

Penalty functions (cont d) Later in the course, we will use the idea of penalization when fitting curves to data, to favor simple, smoother curves over wiggly, erratic ones; here, P is a measure of the wiggliness of the curve There are also plenty of other varieties of penalization in existence that we will not address in this course The all operate on the same basic principle, however: all other things being equal, we would tend to favor the simpler explanation over the more complicated one Patrick Breheny BST 764: Applied Statistical Modeling 11/19

The regularization parameter As mentioned earlier, the parameter λ controls the tradeoff between the penalty and the fit (loss/likelihood): When λ is too small, we tend to overfit the data and end up models that have high variance When λ is too large, we tend to underfit the data and end up with models that are too simplistic and thus potentially biased The parameter λ is called the regularization parameter; changing the regularization parameter allows us to directly balance the bias-variance tradeoff Obviously, selection of λ is a very important practical aspect of fitting these models Patrick Breheny BST 764: Applied Statistical Modeling 12/19

Bayesian interpretation From a Bayesian perspective, one can think of the penalty as arising from a prior distribution on the parameters The objective function is then (proportional to) the log of the posterior distribution of θ given the data By optimizing this objective function, we are finding the mode of the posterior distribution of θ Patrick Breheny BST 764: Applied Statistical Modeling 13/19

Constrained regression Yet another way to think about penalized regression is that they imply a constraint on the values of θ Suppose we were trying to maximize the likelihood subject to the constraint that P (θ) t A standard approach to solving such problem is to introduce a Lagrange multiplier; when we do so, we arrive at the same objective function as earlier Patrick Breheny BST 764: Applied Statistical Modeling 14/19

Penalization and the intercept term Earlier, we mentioned that penalized regression revolves around an assumption that coefficient values around zero are more believable than those far away from zero Some care is needed, however, in the application of this idea First of all, it does not make sense to apply this idea to the intercept, unless you happened to have some reason to think that the mean of y should be zero Hence, the intercept is not included in the penalty; if it were, the estimates would not be invariant to changes of location Patrick Breheny BST 764: Applied Statistical Modeling 15/19

Standardization A separate consideration is how to make far from zero mean the same thing for all the variables For example, suppose x 1 varied from 0 to 1, while x 2 varied from 0 to 1 million; clearly, a one-unit change in x does not mean the same for both of these variables Thus, the explanatory variables are usually standardized prior to model fitting to have mean zero and standard deviation 1; i.e., for all j x j = 0 x T j x j = n Patrick Breheny BST 764: Applied Statistical Modeling 16/19

Standardization (cont d) This can be accomplished without any loss of generality: Any location shifts for X are absorbed into the intercept Scale changes can be reversed after the model has been fit: x ij β j = x ij a aβ j = x ij βj ; i.e., if we had to divide x j by a to standardize it, we simply divide the transformed solution β j by a to obtain β j on the original scale Patrick Breheny BST 764: Applied Statistical Modeling 17/19

Further benefits Centering and scaling the explanatory variables has added benefits: The explanatory variables are now orthogonal to the intercept term, meaning that in the standardized covariate space, ˆβ 0 = ȳ regardless of what goes on in the rest of the model In other words, if we center y by subtracting off its mean, we don t even need to estimate β 0 Also, standardization simplifies the solutions; recall from BST 760 that for simple linear regression ˆα = ȳ ˆβ x (yi ȳ)(x i x) ˆβ = (xi x) 2 If we center and scale x and center y, however, then we get the much simpler expression ˆβ = x T y/n Patrick Breheny BST 764: Applied Statistical Modeling 18/19

Standardization summary To summarize, centering the response and centering and scaling each explanatory variable to have mean 0 and standard deviation 1 has the following benefits: Estimates of β are location-scale invariant Computational savings (we only need to estimate p parameters, not p + 1) Simplicity No loss of generality, as we can transform back to the original scale once we finish fitting the model Patrick Breheny BST 764: Applied Statistical Modeling 19/19