MULTIPLE REGRESSION EXAMPLE



Similar documents
Correlation and Regression

August 2012 EXAMINATIONS Solution Part I

Rockefeller College University at Albany

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

Linear Regression Models with Logarithmic Transformations

Interaction effects between continuous variables (Optional)

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Chapter 7: Simple linear regression Learning Objectives

Lecture 15. Endogeneity & Instrumental Variable Estimation

Handling missing data in Stata a whirlwind tour

Quick Stata Guide by Liz Foster

Regression Analysis: A Complete Example

Nonlinear Regression Functions. SW Ch 8 1/54/

Discussion Section 4 ECON 139/ Summer Term II

Nonlinear relationships Richard Williams, University of Notre Dame, Last revised February 20, 2015

Linear Models in STATA and ANOVA

Exercise 1.12 (Pg )

Correlation and Simple Linear Regression

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

2. Simple Linear Regression

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

STAT 350 Practice Final Exam Solution (Spring 2015)

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Multiple Linear Regression

Data Analysis Methodology 1

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Econometrics I: Econometric Methods

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Standard errors of marginal effects in the heteroskedastic probit model

25 Working with categorical data and factor variables

Using Stata 9 & Higher for OLS Regression Richard Williams, University of Notre Dame, Last revised January 8, 2015

Module 5: Multiple Regression Analysis

17. SIMPLE LINEAR REGRESSION II

Stata Walkthrough 4: Regression, Prediction, and Forecasting

Using R for Linear Regression

DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Basic Statistical and Modeling Procedures Using SAS

Multinomial and Ordinal Logistic Regression

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Estimation of σ 2, the variance of ɛ

Univariate Regression

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Chapter 23. Inferences for Regression

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

MTH 140 Statistics Videos

MODELING AUTO INSURANCE PREMIUMS

The leverage statistic, h, also called the hat-value, is available to identify cases which influence the regression model more than others.

Notes on Applied Linear Regression

especially with continuous

Session 7 Bivariate Data and Analysis

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015

3.4 Statistical inference for 2 populations based on two samples

You have data! What s next?

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

SPSS Guide: Regression Analysis

Forecasting in STATA: Tools and Tricks

Addressing Alternative. Multiple Regression Spring 2012

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Simple Predictive Analytics Curtis Seare

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

4. Multiple Regression in Practice

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

The following postestimation commands for time series are available for regress:

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Using Stata for Categorical Data Analysis

Multiple Regression: What Is It?

The importance of graphing the data: Anscombe s regression examples

Premaster Statistics Tutorial 4 Full solutions

xtmixed & denominator degrees of freedom: myth or magic

Statistics 2014 Scoring Guidelines

Regression step-by-step using Microsoft Excel

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Data Analysis Tools. Tools for Summarizing Data

Additional sources Compilation of sources:

Title. Syntax. stata.com. fp Fractional polynomial regression. Estimation

Simple Linear Regression Inference

1 Simple Linear Regression I Least Squares Estimation

MEASURING THE INVENTORY TURNOVER IN DISTRIBUTIVE TRADE

An Analysis of the Undergraduate Tuition Increases at the University of Minnesota Duluth

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Simple Regression Theory II 2010 Samuel L. Baker

Regression Analysis (Spring, 2000)

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

Transcription:

MULTIPLE REGRESSION EXAMPLE For a sample of n = 166 college students, the following variables were measured: Y = height X 1 = mother s height ( momheight ) X 2 = father s height ( dadheight ) X 3 = 1 if male, 0 if female ( male ) Our goal is to predict student s height using the mother s and father s heights, and sex, where sex is categorized using the variable male = 1 if male, 0 if female. The population model is Y i = β 0 + β 1 X i1 + β 2 X i2 + β 3 X i3 + ε i where ε i are independent, N(0, σ 2 ). NOTE: The original data has a text variable called sex with two values Male and Female. To create the indicator variable Male in Stata, the commands are:. generate male = 1. replace male = 0 if sex=="'female'" First let s look at some plots of the original data to see if there are outliers, and if the patterns look linear. See plots in extended handout on website. The plots are: G1. Separate histograms of male and female students heights. Stata: histogram height, by(sex) G2. Histogram of mothers heights. Stata: histogram momheight G3. Histogram of fathers heights. Stata: histogram dadheight G4. A scatter plot matrix (see p. 232 of text), separately for males and females. Stata: graph matrix height momheight dadheight, by(sex) or Graphics -> Scatterplot matrix Examining the plots, a few possible outliers are evident: A case with momheight = inches. This is almost surely a mistake it s a female height of 6 ft, 8 inches. So it is legitimate to remove it, since we can t recover what it should be. The case is in row 129. We need to change the value to the missing value code, which is a period in Stata: replace momheight =. in 129 A case with height = 57 inches for a male. While unusual (4 ft, 9 inches) it is possible. Do not remove. A case with dadheight = 55 inches. Again this is very unusual (4 ft, 7 inches) but is possible. Do not remove. One option is to run the analysis with and without it, and see what difference it makes. (Of course you would always report that you had done that, not just chose which one you like best!) From the scatter plot matrix, we see that the relationships between the response variable height and the explanatory variables momheight and dadheight look linear, at least from what we can tell from such tiny pictures. Now let s run the regress command:. regress height momheight dadheight male We will follow that up with some plots: G5: A plot of the residuals versus fitted values: rvfplot, yline(0) G6: A normal probability plot (we need to create the variable residuals first): predict residuals, residuals qnorm residuals

Regress command results (remember that we now have n = 165 cases; we removed one outlier): Source SS df MS Number of obs = 165 -------------+------------------------------ F( 3, 161) = 104.38 Model 1679.17741 3 559.7253 Prob > F = 0.0000 Residual 863.31653 161 5.36221447 R-squared = 0.64 -------------+------------------------------ Adj R-squared = 0.6541 Total 2542.49394 164 15.5030118 Root MSE = 2.3156 ------------------------------------------------------------------------------ height Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- momheight.2996154.0687614 4.36 0.000.1638247.435406 dadheight.412135.0510733 8.07 0.000.311275.512995 male 5.298218.3637717 14.56 0.000 4.579839 6.016597 _cons 16.96746 4.658309 3.64 0.000 7.768196 26.16673 The regression equation (rounding coefficients to 2 decimal places) is: Predicted height = 16.97 + 0.30 (momheight) + 0.41 (dadheight) + 5.30 (male) The coefficient for the variable male has a specific interpretation. It says that for a fixed combination of momheight and dadheight, on average males will be about 5.30 inches taller than females with that same combination of momheight and dadheight. The coefficient of about 0.30 for momheight tells us that for a given dadheight and sex, the predicted student s height increases by about 0.30 inches for every 1.0 inch increase in momheight. For example, for male students whose dads are inches tall, those whose moms are 65 inches tall are on average predicted to be about 0.30 inches taller than those whose moms are 64 inches tall. For each of the coefficients, a test for H 0 : β = 0 versus H a : β 0 has p-value of 0.000. (See the column headed P> t.) These are conditional hypotheses. They are testing whether or not each explanatory variable needs to be in the model, given that the others are already there. Therefore, in this example, the tests tell us that all 3 of the explanatory variables are useful in the model, even after the others are already in the model. In other words, even with (for example) mom s height and student s sex in the model, dad s height still adds a substantial contribution to explaining student s height. R 2 = 66.04%, which is pretty good. Later we will learn about Adjusted R 2 which can be more useful in multiple regression, especially when comparing models with different numbers of X variables. Root MSE = s = our estimate of σ = 2.32 inches, indicating that within every combination of momheight, dadheight and sex, the standard deviation of heights is about 2.32 inches. In other words, that s the estimate of the standard deviation for the population of all male students with momheight = 64 and dadheight =, for the population of all female students with those parents heights, etc, for any combination of the 3 explanatory variables. The F(3, 161) = 104.38 is F* for testing the full model versus the reduced model E{Y i } = β 0. In other words, it is simultaneously testing H 0 : β 1, β 2, β 3 all = 0 versus H a : At least one is not 0. The p-value is given as 0.0000, so clearly we can reject the null hypothesis and we can conclude that at least one of the explanatory variables is useful. The residual versus fitted values and normal probability plot both show the outlier (height = 57 ) but otherwise they look like we hope they would.

'Female' 'Male' Density 0.1.2 55 65 75 55 65 75 'Height' Graphs by 'Sex' Density 0.05.1.15 55 65 75

Density 0.05.1.15 55 65 75 'dadheight' 'Female' 'Male' 65 50 'Height' 65 'Height' 65 50 'dadheight' 'dadheight' 65 50 50 Graphs by 'Sex'

Residuals 65 75 Fitted values Residuals 55 65 75 Mom height outlier removed:

Residuals 65 75 Fitted values The remaining outlier is a male 57 inches tall with parents heights of 61 and 66 inches (mom and dad). This could be a legitimate point so it s not okay to remove it. Normal probability plot shows the outlier too, but otherwise looks good: Residuals -5 0 5 Inverse Normal