Statistical Models in R

Similar documents

Multiple Linear Regression

Statistical Models in R

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

Comparing Nested Models

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Correlation and Simple Linear Regression

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Week 5: Multiple Linear Regression

Psychology 205: Research Methods in Psychology

Using R for Linear Regression

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

5. Linear Regression

Testing for Lack of Fit

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Lucky vs. Unlucky Teams in Sports

ANOVA. February 12, 2015

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Simple Linear Regression Inference

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

Part 2: Analysis of Relationship Between Two Variables

BIOL 933 Lab 6 Fall Data Transformation

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

Chapter 7: Simple linear regression Learning Objectives

N-Way Analysis of Variance

2. Simple Linear Regression

Univariate Regression

Simple Linear Regression

Regression Analysis: A Complete Example

MIXED MODEL ANALYSIS USING R

Two-way ANOVA and ANCOVA

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

SPSS Guide: Regression Analysis

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Final Exam Practice Problem Answers

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Getting Correct Results from PROC REG

Logistic Regression (a type of Generalized Linear Model)

MSwM examples. Jose A. Sanchez-Espigares, Alberto Lopez-Moreno Dept. of Statistics and Operations Research UPC-BarcelonaTech.

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Regression step-by-step using Microsoft Excel

Exchange Rate Regime Analysis for the Chinese Yuan

International Statistical Institute, 56th Session, 2007: Phil Everson

Recall this chart that showed how most of our course would be organized:

17. SIMPLE LINEAR REGRESSION II

Outline. Dispersion Bush lupine survival Quasi-Binomial family

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Elementary Statistics Sample Exam #3

Factors affecting online sales

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

n + n log(2π) + n log(rss/n)

2013 MBA Jump Start Program. Statistics Module Part 3

Regression III: Advanced Methods

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Premaster Statistics Tutorial 4 Full solutions

Exercises on using R for Statistics and Hypothesis Testing Dr. Wenjia Wang

Q = ak L + bk L. 2. The properties of a short-run cubic production function ( Q = AL + BL )

Generalized Linear Models

Multiple Regression: What Is It?

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Module 5: Multiple Regression Analysis

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

GLM I An Introduction to Generalized Linear Models

SAS Software to Fit the Generalized Linear Model

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Introduction to Hierarchical Linear Modeling with R

Regression and Correlation

Time Series Analysis

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

MULTIPLE REGRESSION EXAMPLE

11. Analysis of Case-control Studies Logistic Regression

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Exercise 1.12 (Pg )

STAT 350 Practice Final Exam Solution (Spring 2015)

Chapter 3 Quantitative Demand Analysis

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

The importance of graphing the data: Anscombe s regression examples

1 Simple Linear Regression I Least Squares Estimation

DATA INTERPRETATION AND STATISTICS

Lab 13: Logistic Regression

Multivariate Logistic Regression

R: A Free Software Project in Statistical Computing

Regression Analysis (Spring, 2000)

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

Multiple Linear Regression in Data Mining

Multi Factors Model. Daniel Herlemont. March 31, Estimating using Ordinary Least Square regression 3

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

Chapter 7 Section 1 Homework Set A

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

An Introduction to Spatial Regression Analysis in R. Luc Anselin University of Illinois, Urbana-Champaign

Transcription:

Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007

Outline Statistical Models Linear Models in R

Regression Regression analysis is the appropriate statistical method when the response variable and all explanatory variables are continuous. Here, we only discuss linear regression, the simplest and most common form. Remember that a statistical model attempts to approximate the response variable Y as a mathematical function of the explanatory variables X 1,..., X n. This mathematical function may involve parameters. Regression analysis attempts to use sample data find the parameters that produce the best model

Linear Models The simplest such model is a linear model with a unique explanatory variable, which takes the following form. ŷ = a + bx. Here, y is the response variable vector, x the explanatory variable, ŷ is the vector of fitted values and a (intercept) and b (slope) are real numbers. Plotting y versus x, this model represents a line through the points. For a given index i, ŷ i = a + bx i approximates y i. Regression amounts to finding a and b that gives the best fit.

Linear Model with 1 Explanatory Variable y 0 5 10 y y hat x=2 0 1 2 3 4 5 x

Plotting Commands for the record The plot was generated with test data xr, yr with: > plot(xr, yr, xlab = "x", ylab = "y") > abline(v = 2, lty = 2) > abline(a = -2, b = 2, col = "blue") > points(c(2), yr[9], pch = 16, col = "red") > points(c(2), c(2), pch = 16, col = "red") > text(2.5, -4, "x=2", cex = 1.5) > text(1.8, 3.9, "y", cex = 1.5) > text(2.5, 1.9, "y-hat", cex = 1.5)

Linear Regression = Minimize RSS Least Squares Fit In linear regression the best fit is found by minimizing RSS = n (y i ŷ i ) 2 = i=1 n (y i (a + bx i )) 2. i=1 This is a Calculus I problem. There is a unique minimum and unique a and b achieving the minimum.

No Difference with Multiple Explanatory Variables Suppose there are k continuous explanatory variables, x 1,..., x k. A linear model of y in these variables is of the form ŷ = a + b 1 x 1 + b 2 x 2 + + b k x k. The multiple linear regression problem is to find a, b 1,..., b k that minimze RSS. With the mild assumption of independence in the x 1,..., x k, there is a again a unique solution. It s just a problem in matrix algebra rather than high school algebra.

Is the Model Predictive? No assumptions have yet been made about the distribution of y or any other statistical properties. In modeling we want to calculate a model (a and b 1,..., b k ) from the sample data and claim the same relationship holds for other data, within a certain error. Is it reasonable to assume the linear relationship generalizes? Given that the variables represent sample data there is some uncertainty in the coefficients. With other sample data we may get other coefficients. What is the error in estimating the coefficients? Both of these issues can be addressed with some additional assumptions.

Is the Model Predictive? No assumptions have yet been made about the distribution of y or any other statistical properties. In modeling we want to calculate a model (a and b 1,..., b k ) from the sample data and claim the same relationship holds for other data, within a certain error. Is it reasonable to assume the linear relationship generalizes? Given that the variables represent sample data there is some uncertainty in the coefficients. With other sample data we may get other coefficients. What is the error in estimating the coefficients? Both of these issues can be addressed with some additional assumptions.

Is the Model Predictive? No assumptions have yet been made about the distribution of y or any other statistical properties. In modeling we want to calculate a model (a and b 1,..., b k ) from the sample data and claim the same relationship holds for other data, within a certain error. Is it reasonable to assume the linear relationship generalizes? Given that the variables represent sample data there is some uncertainty in the coefficients. With other sample data we may get other coefficients. What is the error in estimating the coefficients? Both of these issues can be addressed with some additional assumptions.

Assumptions in Linear Models Given a linear model ŷ = a + b 1 x 1 + + b k x k of the response variable y, the validity of the model depends on the following assumptions. Recall: the residual vector is y ŷ. Homoscedasticity (Constant Variance) The variance of the residuals is constant across the indices. The points should be evenly distributed around the mean. Plotting residuals versus fitted values is a good test. Normality of Errors The residuals are normally distributed.

Assumptions in Linear Models Given a linear model ŷ = a + b 1 x 1 + + b k x k of the response variable y, the validity of the model depends on the following assumptions. Recall: the residual vector is y ŷ. Homoscedasticity (Constant Variance) The variance of the residuals is constant across the indices. The points should be evenly distributed around the mean. Plotting residuals versus fitted values is a good test. Normality of Errors The residuals are normally distributed.

Assumptions in Linear Models Given a linear model ŷ = a + b 1 x 1 + + b k x k of the response variable y, the validity of the model depends on the following assumptions. Recall: the residual vector is y ŷ. Homoscedasticity (Constant Variance) The variance of the residuals is constant across the indices. The points should be evenly distributed around the mean. Plotting residuals versus fitted values is a good test. Normality of Errors The residuals are normally distributed.

Assessment Methods These conditions are verified in R linear fit models with plots, illustrated later. If a plot of residuals versus fitted values shows a dependence pattern then a linear model is likely invalid. Try transforming the variables; e.g., fit log(y) instead of y, or include more complicated explanatory variables, like x 2 1 or x 1x 2. With normality of residuals, RSS satisfies a chi-squared distribution. This can be used as a measure of the model s quality and compare linear models with different sets of explanatory variables.

Assessment Methods These conditions are verified in R linear fit models with plots, illustrated later. If a plot of residuals versus fitted values shows a dependence pattern then a linear model is likely invalid. Try transforming the variables; e.g., fit log(y) instead of y, or include more complicated explanatory variables, like x 2 1 or x 1x 2. With normality of residuals, RSS satisfies a chi-squared distribution. This can be used as a measure of the model s quality and compare linear models with different sets of explanatory variables.

Assessment Methods These conditions are verified in R linear fit models with plots, illustrated later. If a plot of residuals versus fitted values shows a dependence pattern then a linear model is likely invalid. Try transforming the variables; e.g., fit log(y) instead of y, or include more complicated explanatory variables, like x 2 1 or x 1x 2. With normality of residuals, RSS satisfies a chi-squared distribution. This can be used as a measure of the model s quality and compare linear models with different sets of explanatory variables.

Linear Models in R Given: A response variable Y and explanatory variables X1, X2,...,Xk from continuous random variables. A linear regression of Y on X1, X2,..., Xk is executed by the following command. > lmfit <- lm(y ~ X1 +... + Xk) The values of the estimated coefficients and statistics measuring the goodness of fit are revealed through summary(lmfit)

Example Problem There is one response variable yy and five explanatory variables x1, x2, x3, x4, x5, all of length 20. The linear fit is executed by > lmfit1 <- lm(yy ~ x1 + x2 + x3 + x4 + + x5)

> summary(lmfit1) Results of the Linear Fit Call: lm(formula = yy ~ x1 + x2 + x3 + x4 + x5) Residuals: Min 1Q Median 3Q Max -1.176-0.403-0.106 0.524 1.154 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 4.660 1.098 4.24 0.00082 x1 3.235 1.207 2.68 0.01792 x2 3.147 0.688 4.57 0.00043 x3-6.486 1.881-3.45 0.00391 x4-1.117 0.596-1.87 0.08223 x5 1.931 0.241 8.03 1.3e-06

Results of the Linear Fit continued (Intercept) *** x1 * x2 *** x3 ** x4. x5 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.684 on 14 degrees of freedom Multiple R-Squared: 0.974, Adjusted R-squared: 0.965 F-statistic: 106 on 5 and 14 DF, p-value: 1.30e-10

What Class is lmfit1? > class(lmfit1) [1] "lm" > names(lmfit1) [1] "coefficients" "residuals" [3] "effects" "rank" [5] "fitted.values" "assign" [7] "qr" "df.residual" [9] "xlevels" "call" [11] "terms" "model" These can be used to extract individual components, e.g., lmfit1$fitted.values is the vector of fitted values, the hat vector.

Explanation of Coefficients The Estimate column gives the model s estimate of a (Intercept) and b1, b2, b3, b4, b5. The vector of fitted values is yy-hat = 4.660 + 3.235*x1 + 3.147*x2-6.486*x3 + -1.117*x4 + 1.931*x5 From the assumed normal distribution of the residuals it s possible to estimate the error in the coefficients (see the second column). The t test is a test of null hypothesis that the coefficient is 0. If the p-value in the fourth column is < 0.05 then the variable is significant enough to be included in the model.

Measures of Fit Quality Several parameters are given that measure the quality of the fit. The distribution of values of the residuals is given. The model degrees of freedom, df, is the length of yy minus the number of parameters calculated in the model. Here this is 20 6 = 14. By definition the residual standard error is RSS. df Clearly, it s good when this is small.

Measures of Fit Quality A quantity frequently reported in a model is R 2. Given the y values y 1,..., y n, the mean of y, ȳ, and the fitted values ŷ 1,..., ŷ n, n R 2 i=1 = (ŷ i ȳ) 2 n i=1 (y i ȳ) 2. This is a number between 0 and 1. The quality of fit increases with R 2. The adjusted R 2 does some adjustment for degrees of freedom. In our example R 2 is 0.974, which is very high.

Plots to Assess the Model Remember the assumptions on the residuals needed to consider the linear model valid. We need an even scatter of residuals when plotted versus the fitted values, and a normal distribution of residuals. R produces 4 plots we can use to judge the model. The following code generates the 4 plots in one figure, then resets the original graphic parameters. > oldpar <- par(mfrow = c(2, 2)) > plot(lmfit1) > par(oldpar)

Plots of lmfit1 Residuals vs Fitted Normal Q Q Residuals 1.0 0.0 1.0 7 5 3 Standardized residuals 2 1 0 1 2 7 5 3 12 8 6 4 2 2 1 0 1 2 Fitted values Theoretical Quantiles Standardized residuals 0.0 0.4 0.8 1.2 Scale Location 7 3 5 Standardized residuals 2 1 0 1 2 Residuals vs Leverage 7 1 0.5 19 0.5 Cook's distance 3 1 12 8 6 4 2 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Fitted values Leverage

Using R 2 to Compare Models? A problem with R 2, though, is that it doesn t follow a distribution. We can t compare the R 2 s in two models and know when one is meaningfully better. Just as an F statistic assessed the significance of an anova model, we use a statistic that follows an F distribution to compare two linear models, and to compare a single model to the null model.

Comparing Linear Models A typical concern in a linear modeling problem is whether leaving a variable out meaningfully diminishes the quality of the model. There is some disadvantage in that RSS may increase some in the smaller model, however using fewer variables is a simpler model, always a plus. We need to measure the trade-off.

The F Statistic Suppose the data contain N samples (N = length of y). Consider two linear models M 0, M 1. M 0 has p 0 variables and RSS value RSS 0. Model M 1 has p 1 > p 0 variables, the variables in M 0 are included in those used in M 1, and the RSS value is RSS 1. Let F be the number defined as F = (RSS 0 RSS 1 )/(p 1 p 0 ). RSS 1 /(N p 1 1) Under the assumption that the residuals are normally distributed, F satisfies an F distribution with p 1 p 0 and N p 1 1 degrees of freedom.

Tested with Anova There is a simple way to execute this test in R. If fit1 and fit2 are the objects returned by lm for the two nested models, the test is executed by > compmod <- anova(fit1, fit2) This is not aov, which models a continuous variable against a factor. The similarity is that both use the F distribution to measure the statistic; all such tests are an analysis of variance in some form.

Test One Model Against the Null In the summary(lmfit1) output the last line reports an F statistic. This is a comparison between the model and the null model, that sets all coefficients to 0 except the intercept. This statistic can be >.05 when y has no dependence on the explanatory variables.

Remove Variable from lmfit1 and Test the Result The variable judged least significant in lmfit1 is x4. For it, the p-value is.08, which is above the threshold. Generate another model without it. > lmfit2 <- lm(yy ~ x1 + x2 + x3 + x5)

Inspect lmfit2 > summary(lmfit2) Call: lm(formula = yy ~ x1 + x2 + x3 + x5) Residuals: Min 1Q Median 3Q Max -1.15346-0.33076 0.00698 0.29063 1.30315 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 3.622 1.024 3.54 0.00300 x1 1.013 0.237 4.27 0.00067 x2 2.137 0.461 4.63 0.00032 x3-2.975 0.152-19.63 4.1e-12 x5 1.935 0.260 7.44 2.1e-06

Inspect lmfit2 continued (Intercept) ** x1 *** x2 *** x3 *** x5 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.739 on 15 degrees of freedom Multiple R-Squared: 0.968, Adjusted R-squared: 0.959 F-statistic: 113 on 4 and 15 DF, p-value: 5.34e-11

Compare the Two Models with Anova > compfit1fit2 <- anova(lmfit2, lmfit1) > compfit1fit2 Analysis of Variance Table Model 1: yy ~ x1 + x2 + x3 + x5 Model 2: yy ~ x1 + x2 + x3 + x4 + x5 Res.Df RSS Df Sum of Sq F Pr(>F) 1 15 8.18 2 14 6.54 1 1.64 3.5 0.082. --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Use lmfit2 In Place of lmfit1 Since the p-value is 0.082, which is > 0.05, we accept the null hypothesis that the model using 5 variables (lmfit1) is not significantly better than the model using 4 variables (lmfit2). In this situation we use lmfit2 as a model preferred over lmfit1.

Variable Selection a simple but typical case The following steps yield a model with the fewest number of variables that is statistically as meaningful as a larger model. Generate an initial model using all reasonable explanatory variables. Identify the variable with the smallest p-value. Compute a linear model using the smaller set of variables. Compute an anova for the two models. If the p-value is < 0.05 then the larger model is significantly better than the smaller. We accept the larger model as optimal. Otherwise, repeat steps 2 4.

Variable Selection a simple but typical case The following steps yield a model with the fewest number of variables that is statistically as meaningful as a larger model. Generate an initial model using all reasonable explanatory variables. Identify the variable with the smallest p-value. Compute a linear model using the smaller set of variables. Compute an anova for the two models. If the p-value is < 0.05 then the larger model is significantly better than the smaller. We accept the larger model as optimal. Otherwise, repeat steps 2 4.

Variable Selection a simple but typical case The following steps yield a model with the fewest number of variables that is statistically as meaningful as a larger model. Generate an initial model using all reasonable explanatory variables. Identify the variable with the smallest p-value. Compute a linear model using the smaller set of variables. Compute an anova for the two models. If the p-value is < 0.05 then the larger model is significantly better than the smaller. We accept the larger model as optimal. Otherwise, repeat steps 2 4.

Variable Selection a simple but typical case The following steps yield a model with the fewest number of variables that is statistically as meaningful as a larger model. Generate an initial model using all reasonable explanatory variables. Identify the variable with the smallest p-value. Compute a linear model using the smaller set of variables. Compute an anova for the two models. If the p-value is < 0.05 then the larger model is significantly better than the smaller. We accept the larger model as optimal. Otherwise, repeat steps 2 4.

Linear Models are Broadly Applicable More complicated models can be generated by transforming a variable or including interactions between variables. Instead of fitting y to a + b 1 x1 + b2x 2 it may be more meaningful to fit log(y) to a + c 1 x 1 + c 2 x 2 1 + c 3 x 2 + c 4 x 1 x 2. This is still considered a linear model since it is linear in the parameters. R s handling of generalized lienar models is applicable here.