Parametric versus Semi/nonparametric Regression Models

Similar documents
Syntax Menu Description Options Remarks and examples Stored results Methods and formulas References Also see. Description

Regression III: Advanced Methods

Smoothing and Non-Parametric Regression

Examples. David Ruppert. April 25, Cornell University. Statistics for Financial Engineering: Some R. Examples. David Ruppert.

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Using kernel methods to visualise crime data

Examining a Fitted Logistic Model

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Fitting Subject-specific Curves to Grouped Longitudinal Data

Least Squares Estimation

Regression Modeling Strategies

Time Series Analysis

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

exspline That: Explaining Geographic Variation in Insurance Pricing

Automated Learning and Data Visualization

Nonnested model comparison of GLM and GAM count regression models for life insurance data

Simple linear regression

SAS Software to Fit the Generalized Linear Model

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Extending Linear Regression: Weighted Least Squares, Heteroskedasticity, Local Polynomial Regression

PS 271B: Quantitative Methods II. Lecture Notes

A short introduction to splines in least squares regression analysis

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Penalized Logistic Regression and Classification of Microarray Data

STATISTICAL DATA ANALYSIS COURSE VIA THE MATLAB WEB SERVER

Univariate Regression

Introduction to General and Generalized Linear Models

A.II. Kernel Estimation of Densities

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

Bandwidth Selection for Nonparametric Distribution Estimation

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Time Series and Forecasting

Regression III: Advanced Methods

Elements of statistics (MATH0487-1)

Multiple Linear Regression in Data Mining

Model Diagnostics for Regression

Simple Predictive Analytics Curtis Seare

Local classification and local likelihoods

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Parallel Computing of Kernel Density Estimates with MPI

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Computer exercise 4 Poisson Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Statistics in Retail Finance. Chapter 6: Behavioural models

BayesX - Software for Bayesian Inference in Structured Additive Regression

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

The zero-adjusted Inverse Gaussian distribution as a model for insurance claims

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Local Quantile Regression

Data Mining: An Overview. David Madigan

ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM COLLEGE OF SCIENCE. School of Mathematical Sciences

Penalized regression: Introduction

GLM I An Introduction to Generalized Linear Models

Chapter 7: Simple linear regression Learning Objectives

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Signal Detection. Outline. Detection Theory. Example Applications of Detection Theory

2. Linear regression with multiple regressors

Teaching Multivariate Analysis to Business-Major Students

Simple Linear Regression Inference

STOCHASTIC CLAIMS RESERVING IN GENERAL INSURANCE. By P. D. England and R. J. Verrall. abstract. keywords. contact address. ".

The Best of Both Worlds:

Exploratory Data Analysis with MATLAB

Multiple Regression: What Is It?

ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS

Natural cubic splines

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

The term structure of Russian interest rates

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Penalized Splines - A statistical Idea with numerous Applications...

Section 3 Part 1. Relationships between two numerical variables

4. Simple regression. QBUS6840 Predictive Analytics.

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

Building a Smooth Yield Curve. University of Chicago. Jeff Greco

A Review of Statistical Outlier Methods

EECS 556 Image Processing W 09. Interpolation. Interpolation techniques B splines

Simple Methods and Procedures Used in Forecasting

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Automated Biosurveillance Data from England and Wales,

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

Effort Estimation of Web Based Applications

Nonparametric Statistics

Homework 8 Solutions

13. Poisson Regression Analysis

5. Multiple regression

Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney

Imputing Missing Data using SAS

Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Efficient Curve Fitting Techniques

Introduction to Quantitative Methods

Transcription:

Parametric versus Semi/nonparametric Regression Models Hamdy F. F. Mahmoud Virginia Polytechnic Institute and State University Department of Statistics LISA short course series- July 23, 2014 Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 1/

Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 2/

Outline 1 What is semi/nonparametric regression? 2 When should we use semi/nonparametric regression? 3 semi/nonparametric regression estimation methods: a). Kernel Regression b). Smoothing Spline 4 Other methods in nonparametric regression models estimation. 5 Discussion and Recommendations Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 3/

What is semi/nonparametric regression? Nonparametric regression is a form of regression analysis in which NONE of the predictors take predetermined forms with the response but are constructed according to information derived from the data. Semiparametric regression is a form of regression analysis in which a PART of the predictors do not take predetermined forms and the other part takes known forms with the response. Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 4/

What is semi/nonparametric regression? Example: Assume that we have a response variable Y and two explanatory variables, x 1 and x 2. In general the regression model that describes the relationship can be written as: Y = f 1 (x 1 ) + f 2 (x 2 ) + ɛ Some parametric regression models: Y = β 0 + β 1 x 1 + β 2 x 2 + ɛ (Multiple linear regression model) Y = β 0 + β 10 x 1 + β 11 x 2 1 + β 20x 2 + β 21 x 2 2 + ɛ (Polynomial regression model of second order) Y = β 0 + β 1 x 1 + β 2 e (β 3x 2 ) + ɛ (Nonlinear regression model) log(µ) = β 0 + β 1 x 1 + β 2 x 2 (Poisson regression when Y is count) Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 5/

What is semi/nonparametric regression? If we do not know f 1 and f 2 functions, we need to use a NONparametric regression model. If we do not know f 1 and know f 2, we need to use SEMIparametric regression model. Example: Y = β 0 + β 1 x 1 + f (x 2 ) + ɛ Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 6/

When we should use semi/nonparametric regression? The FIRST STEP in any analysis is GRAPHICAL ANALYSIS for the response (dependent) variable and the explanatory (independent) variables. Examples: Boxplots Area plots Scatterplots GO TO REAL DATA [Course R Code] Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 7/

When should we use semi/nonparametric regression? Four principal assumptions which justify the use of linear regression models for purposes of fitting and inferences: Linearity Independence Constant variance Normality Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 8/

When should we use semi/nonparametric regression? Violations of linearity are extremely serious, especially when you extrapolate beyond the range of the sample data. How to detect: Nonlinearity is usually most evident in a plot of residuals versus predicted values. Use Goodness of fit test. How to fix: Use a nonlinear transformation to the dependent and/or independent variables such as a log transformation, square root, or power transformation. Add another regressors which is a nonlinear function of one of the other variables. For example, if you have regressed Y on X, it may make sense to regress Y on both X and X 2 (i.e., X-squared). Use semi(non)parametric regression model. Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 9/

When should we use semi/nonparametric regression? GO to R Code file to: Practice on identifying the relationship (linear or not linear) using different data sets. For wage data set, regress log(wage) on age using linear regression model. Check the linearity assumption. Try to fix the problem. Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 10

Estimation methods Two of the most commonly used approaches to nonparametric regression are: 1 Kernel Regression: estimates the conditional expectation of Y at given value x using a weighted filter to the data. 2 Smoothing splines: minimize the sum of squared residuals plus a term which penalizes the roughness of the fit. Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 11

Semi/nonparametric regression estimation methods. 1. Nadaraya-Watson Kernel Regression [local constant] Nadaraya and Watson 1964 proposed a method to estimate ˆf (x 0 ) at a given value x 0 as a locally weighted average of all y s associated to the values around x. The Nadaraya-Watson estimator is: f h (x) = n i=1 K( x x i )y h i n i=1 K( x x i ) h where K is a Kernel function (weight function) with a bandwidth h. Remark: K function should give us weights decline as one moves away from the target value. Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 12

Semi/nonparametric regression estimation methods. Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 13

Popular choices of weight function are: Epanechnikov: K( ) = 3 4 (1 d 2 ), d 2 < 1, 0 otherwise, Minimum var: K( ) = 3 8 (1 5d 2 ), d 2 < 1, 0 otherwise, Gaussian density: exp( x x i h ) Tricube function: W (z) = (1 z 3 ) 3 for z < 1 and 0 otherwise. Problem: local constant has one difficulty is that a kernel smoother still exhibits bias at the end points. Solution: Use local linear kernel regression Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 14

Semi/nonparametric regression estimation methods. How to choose the bandwidth? Rule of thumb: If we use Gaussian then it can be shown that the optimal choice for h is ) 1 h = 5 1.06ˆσn 1/5, ( 4ˆσ 5 3n where ˆσ is the standard deviation of the samples. Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 15

Semi/nonparametric regression estimation methods. GO to R Code file to: 1 Apply Kernel regression on wage data 2 Study the effect of bandwidth h on estimation 3 Compare between Kernel regression (nonparametric) and second order polynomial regression (parametric) in terms of fitting and prediction. Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 16

Semi/nonparametric regression estimation methods. 2. Spline Smoothing A spline is a piecewise polynomial with pieces defined by a sequence of knots θ 1 < θ 2 <... < θ K such that the pieces join smoothly at the knots. A spline of degree p can be represented as a power series: f (x) = β 0 + β 1 x + β 2 x 2 + β 3 x 3 +... + β p x p + K k=1 β 1k (x θ k ) p +, where (x θ k ) + = x θ k,x > θ k and 0 otherwise Example: f (x) = β 0 + β 1 x + K k=1 β 1k (x θ k ) + (linear spline) Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 17

Semi/nonparametric regression estimation methods. How many knots need to be used? Where those knots should be located? Number of parameters is 1 + p + K that we need big number of observations. Possible solution: use penalized spline smoothing Consider fitting a spline with knots of every data point, so it could fit perfectly, but estimate its parameters by minimizing the usual sum of squares plus a roughness penalty. A suitable penalty is to integrate the squared second derivative, leading to penalized sum of squares criterion: ni=1 [y i f (x i )] 2 + λ [f (x)] 2 dx where λ is a tunning parameter controls smoothness. Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 18

Semi/nonparametric regression estimation methods. GO to R Code file to: 1 Apply spline regression on prestige data, and 2 Study the effect of λ on smoothing Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 19

Semi/nonparametric regression estimation methods. What if we have more than one explanatory (independent) variable? 1 Kernel Regression 2 Spline Regression GO to R Code file Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 20

Discussion and Recommendations Pros: It is flexible. Better in fitting the data than parametric regression models. Cons: Nonparametric regression requires larger sample sizes than regression based on parametric models because the data must supply the model structure as well as the model estimates. limitations Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 21

Discussion and Recommendations Steps of modeling: Graphical Analysis If you have a nonlinear and unknown relationship between a response and an explanatory variable: use transformation add a new variable in the model to capture the relationship. If transformation does not work, use nonparametric regression. Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 22

References 1 D. Ruppert, M. P. Wand, and R. J. Carrol (2003), Semiparametric Regression, Cambridge University Press, NY. 2 J. S. Simonoff (1996) Smoothing methods in statistics, New York : Springer. 3 Y. Wang (2011) Smoothing splines : methods and applications Boca Raton, FL: CRC Press. 4 M.P. Wand and M.C. Jones (1995) Kernel smoothing, London; New York: Chapman and Hall. Hamdy Mahmoud - Email: ehamdy@vt.edu Parametric versus Semi/nonparametric Regression Models 23