Soci708 Statistics for Sociologists

Similar documents
MULTIPLE REGRESSION EXAMPLE

Correlation and Regression

August 2012 EXAMINATIONS Solution Part I

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Multiple Linear Regression

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

2. Simple Linear Regression

STAT 350 Practice Final Exam Solution (Spring 2015)

SPSS Guide: Regression Analysis

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Regression Analysis: A Complete Example

Chapter 7: Simple linear regression Learning Objectives

Rockefeller College University at Albany

Statistiek II. John Nerbonne. October 1, Dept of Information Science

Week 5: Multiple Linear Regression

Nonlinear Regression Functions. SW Ch 8 1/54/

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Using R for Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Module 5: Multiple Regression Analysis

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

Interaction effects between continuous variables (Optional)

Regression step-by-step using Microsoft Excel

5. Linear Regression

International Statistical Institute, 56th Session, 2007: Phil Everson

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Simple linear regression

Final Exam Practice Problem Answers

Chapter 23. Inferences for Regression

Comparing Nested Models

Example: Boats and Manatees

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Section 13, Part 1 ANOVA. Analysis Of Variance

ANOVA. February 12, 2015

Univariate Regression

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

Correlation and Simple Linear Regression

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015

Handling missing data in Stata a whirlwind tour

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Regression III: Advanced Methods

1.5 Oneway Analysis of Variance

Data Analysis Methodology 1

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Exercise 1.12 (Pg )

1 Simple Linear Regression I Least Squares Estimation

The importance of graphing the data: Anscombe s regression examples

One-Way Analysis of Variance (ANOVA) Example Problem

Chapter 4 and 5 solutions

1.1. Simple Regression in Excel (Excel 2010).

MTH 140 Statistics Videos

Simple Linear Regression Inference

Lecture 15. Endogeneity & Instrumental Variable Estimation

Notes on Applied Linear Regression

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Linear Regression Models with Logarithmic Transformations

Simple Linear Regression

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

MODELING AUTO INSURANCE PREMIUMS

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Statistics Review PSY379

Estimation of σ 2, the variance of ɛ

Regression Analysis (Spring, 2000)

Discussion Section 4 ECON 139/ Summer Term II

Topic 3. Chapter 5: Linear Regression in Matrix Form

N-Way Analysis of Variance

2013 MBA Jump Start Program. Statistics Module Part 3

Fairfield Public Schools

Section 1: Simple Linear Regression

Elements of statistics (MATH0487-1)

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Statistical Models in R

Elementary Statistics Sample Exam #3

Chapter 5 Analysis of variance SPSS Analysis of variance

Stata Walkthrough 4: Regression, Prediction, and Forecasting

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Coefficient of Determination

An analysis method for a quantitative outcome and two categorical explanatory variables.

Quick Stata Guide by Liz Foster

xtmixed & denominator degrees of freedom: myth or magic

Week TSX Index

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

Transcription:

Soci708 Statistics for Sociologists Module 11 Multiple Regression 1 François Nielsen University of North Carolina Chapel Hill Fall 2009 1 Adapted from slides for the course Quantitative Methods in Sociology (Sociology 6Z3) taught at McMaster University by Robert Andersen (now at University of Toronto) 1 / 21

Goals of This Module Review of least-squares regression analysis Simple and multiple regression Slope, intercept and R 2 Standard error of the regression Inference for Regression Confidence intervals and hypothesis tests for the slope F-test for the entire regression model Assumptions of regression and how to check them 2 / 21

Inference in Multiple Regression (1) Recall that least-squares regression fits the equation: Y = A + B 1 X 1 + B 2 X 2 + + B k X k finding the values of A, B 1, B 2,..., B k that minimize the sum of the squared residuals: residual 2 = E 2 i = (y ŷ) 2 The slopes are now partial slopes (versus marginal slopes in simple regression) The slope coefficient B 1 represents the average change in y associated with a one-unit increase in x 1, holding the other x s constant 3 / 21

Inference in Multiple Regression (1) At this point explain again meaning of B k s in context of an example Remember standardized coefficients Explain R 2 and how it relates to correlation between Ŷ and Y 4 / 21

Inference in Multiple Regression (2) The standard deviation of y around the regression surface (standard error of the regression) is again simply estimated: S E = E 2 /(n k 1) i There are n k 1 degrees of freedom, where n is the sample size and k is the number of slopes to estimate As in the simple regression case, S E is used to find the standard errors for each slope and CI and hypothesis are tested in exactly the same way We will not go into details regarding this here because the formulas are complicated. A computer program will calculated the SE s for us It is important to note, however, that we now use the t-distribution with n k 1 df 5 / 21

Multiple Regression in R > # in R > ed<-c(12,13,12,14,12,15,12) > age<-c(45,35,27,60,23,30,49) > income<-c(20000,22000,23000,25000,18000,30000,26000) > reg.model<-lm(income~ed+age) > summary(reg.model) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -8374.74 14195.13-0.590 0.5869 ed 2354.38 1103.60 2.133 0.0998. age 39.88 100.33 0.398 0.7113 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 3248 on 4 degrees of freedom Multiple R-Squared: 0.5591, Adjusted R-squared: 0.3386 F-statistic: 2.536 on 2 and 4 DF, p-value: 0.1944 6 / 21

Analysis of Variance and F Test (1) Recall that for the (simple or multiple) regression model (y i y) = (ŷ i y) + (y i ŷ i ) It is an algebraic fact (that can be demonstrated) that the equality holds after one squares the deviations and sum them: (y i y) 2 = (ŷ i y) 2 + (y i ŷ i ) 2 This can be written SST = SSM + SSE where SST = (y i y) 2 (Sum of Squares Total) SSM = (ŷ i y) 2 (Sum of Squares Model) SSE = (y i ŷ i ) 2 (Sum of Squares Error) 7 / 21

Analysis of Variance and F Test (2) Recall that to calculate s 2 y, the sample variance of y, we divide SST by its degrees of freedom (n 1); likewise to calculate s 2, the error variance, we divide SSE by its degrees of freedom (n k 1) Each sum of squares has its degrees of freedom DF, and these add up to the total degrees of freedom DFT = (n 1): DFT = DFM + DFE (n 1) = k + (n k 1) For each source of variation the mean square MS is calculated as sum of squares MS = degrees of freedom 8 / 21

Analysis of Variance and F Test (3) The three mean squares are thus: s 2 SST (yi y = MST = DFT = y) 2 n 1 MSM = SSM (ŷi DFM = y) 2 k s 2 = MSE = SSE (yi DFE = ŷ i ) 2 n k 1 The null hypothesis that y is not linearly related to the x variables may be tested by comparing MSM and MSE using the F statistic F = MSM MSE Under the null hypothesis H 0 : β 1 = β 2 = = β k = 0 F is distributed as an F(k, n k 1) distribution 9 / 21

Analysis of Variance and F Test (4) A small P-value favors the alternative hypothesis 2 H a : at least one of the β j is not 0 2 From Moore & McCabe (2006, p.689) 10 / 21

Analysis of Variance and F Test (5) The Analysis of Variance (ANOVA) Table The ANOVA table summarizes sums of squares, mean squares and the F test ANOVA Table Degrees Source of freedom Sum of squares Mean square F Model k SSM = (ŷ i y) 2 SSM/DFM MSM/MSE Error n k 1 SSE = (y i ŷ i ) 2 SSE/DFE Total n 1 SST = (y i y) 2 SST/DFT 11 / 21

Analysis of Variance and F Test (6) Stata output with ANOVA table for regression of depression score. * in Stata. reg total l10inc age female cath jewi none Source SS df MS Number of obs = 256 ---------+------------------------------ F( 6, 249) = 6.35 Model 2686.95106 6 447.825177 Prob > F = 0.0000 Residual 17554.7989 249 70.5012006 R-squared = 0.1327 ---------+------------------------------ Adj R-squared = 0.1118 Total 20241.75 255 79.3794118 Root MSE = 8.3965 -------------------------------------------------------------------------- total Coef. Std. Err. t P> t [95% Conf. Interval] ---------+---------------------------------------------------------------- l10inc -6.946661 1.72227-4.03 0.000-10.33874-3.554586 age -.074917.0334132-2.24 0.026 -.1407254 -.0091085 female 2.559048 1.095151 2.34 0.020.402108 4.715987 cath.7527141 1.440845 0.52 0.602-2.085084 3.590512 jewi 4.674959 1.917259 2.44 0.015.8988472 8.451071 none 3.264667 1.400731 2.33 0.021.5058747 6.023459 _cons 18.24958 2.971443 6.14 0.000 12.39721 24.10195 -------------------------------------------------------------------------- 12 / 21

Analysis of Variance and F Test (7) Stata output with ANOVA table for regression of depression score. * in Stata. reg total l10inc age female cath jewi none Source SS df MS ---------+------------------------------ Model 2686.95106 6 447.825177 Residual 17554.7989 249 70.5012006 ---------+------------------------------ Total 20241.75 255 79.3794118 Number of obs = 256 F( 6, 249) = 6.35 Prob > F = 0.0000 R-squared = 0.1327 Adj R-squared = 0.1118 Root MSE = 8.3965 Model = SSM = (ŷ i y) 2 Residual = SSE = (y i ŷ i ) 2 Total = SST = (y i y) 2 MS = SS/Df... What is SST/DFT = 79.379? F(6, 249) = MSM/MSE R-squared = SSM/SST Rooth MSE = MSE 13 / 21

Analysis of Variance and F Test (8) Exercise: Can you recover the redacted figures?. * in Stata. reg total l10inc age female cath jewi none Source SS df MS Number of obs = 256 ---------+------------------------------ F( 6, 249) = xxxx Model 2686.95106 xxx 447.825177 Prob > F = 0.0000 Residual 17554.7989 xxx 70.5012006 R-squared = xxxxxx ---------+------------------------------ Adj R-squared = 0.1118 Total 20241.75 255 79.3794118 Root MSE = 8.3965 -------------------------------------------------------------------------- total Coef. Std. Err. t P> t [95% Conf. Interval] ---------+---------------------------------------------------------------- l10inc -6.946661 1.72227 xxxxx xxxxx -10.33874-3.554586 age -.074917.0334132 xxxxx xxxxx -.1407254 -.0091085 female 2.559048 1.095151 2.34 0.020.402108 4.715987 cath.7527141 1.440845 0.52 0.602-2.085084 3.590512 jewi 4.674959 1.917259 2.44 0.015.8988472 8.451071 none 3.264667 1.400731 2.33 0.021.5058747 6.023459 _cons 18.24958 2.971443 6.14 0.000 12.39721 24.10195 -------------------------------------------------------------------------- 14 / 21

Extended Regression Analysis Example Four regression models of CESD depression score (transformed) in Stata. * in Stata. use "D:\soci708\data\survey2b.dta", clear. * CESD depression scale score is called total. * check distribution of total and transform (next 6 lines). histogram total, kdensity. qnorm total. ladder total. generate sqrtdep=sqrt(total). histogram sqrtdep, kdensity. qnorm sqrtdep. * estimate 4 models and standardized coef. (next 5 lines). reg sqrtdep age female. reg sqrtdep age female l10inc educatn. reg sqrtdep age female l10inc educatn cath jewi none. reg sqrtdep age female l10inc educatn cath jewi none drinks. reg sqrtdep age female l10inc educatn cath jewi none drinks, beta. * get descriptive stats and correlations for presentation. su sqrtdep age female l10inc educatn cath jewi none drinks. cor sqrtdep age female l10inc educatn cath jewi none drinks 15 / 21

Checking the Assumptions (1) Although regression can be an effective method to summarize the relationship between quantitative variables, some assumptions must be met As with the other methods of inference we have discussed, these assumptions pertain to the population we want to make inferences about Of course, we do not have data on the whole population so cannot assess the assumptions directly We can, however, check whether the assumptions appear reasonable for the sample 16 / 21

Checking the Assumptions (2) Assumptions for linear regression are: 1. Linearity: Is there a linear relationship between y and x? We can assess this assumption in simple regression by looking at a scatterplot 2. Constant Spread: Is the spread of the y-values approximately the same regardless of the value of x? If the spread of y changes with x, then we have a problem. A scatterplot of y and x or of the residuals against x allows us to assess this. 3. Normality: Are the residuals normally distributed (is there a skew or outliers)? If not normally distributed, we have a problem. We can check this assumption using a histogram or stemplot of the residuals 17 / 21

Assessing Linearity > # in R > library(car) > data(prestige) > attach(prestige) > par(pty="s") # to make square plot > plot(education, prestige) > abline(lm(prestige~education), col="red", lwd=3) 6 8 10 12 14 16 20 40 60 80 education prestige 18 / 21

Assessing Spread > # in R > data(prestige) > attach(prestige) > par(pty="s") # to make square plot > mod1<-lm(prestige~education) > plot(education, mod1$residuals, xlab="education", ylab="residuals") > abline(h=0) 6 8 10 12 14 16 20 10 0 10 Education Residuals 19 / 21

Assessing Normality > # in R > mod1<-lm(prestige~education) > hist(mod1$residuals, main="histogram of residuals", xlab="residual", col="red") Frequency 0 5 10 15 20 25 Histogram of residuals 30 20 10 0 10 20 Residual 20 / 21

Choosing the Right Test 21 / 21