Statistics 112 Regression Cheatsheet Section 1B - Ryan Rosario



Similar documents
" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Final Exam Practice Problem Answers

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Part 2: Analysis of Relationship Between Two Variables

Estimation of σ 2, the variance of ɛ

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Regression Analysis: A Complete Example

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Chapter 7: Simple linear regression Learning Objectives

Regression step-by-step using Microsoft Excel

Simple Linear Regression Inference

Using R for Linear Regression

Section 1: Simple Linear Regression

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

One-Way Analysis of Variance (ANOVA) Example Problem

Section 13, Part 1 ANOVA. Analysis Of Variance

Study Guide for the Final Exam

Factors affecting online sales

Week TSX Index

Simple Regression Theory II 2010 Samuel L. Baker

Coefficient of Determination

Univariate Regression

Notes on Applied Linear Regression

Example: Boats and Manatees

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Elementary Statistics Sample Exam #3

Hypothesis testing - Steps

Multiple Linear Regression

12: Analysis of Variance. Introduction

2. Simple Linear Regression

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

General Regression Formulae ) (N-2) (1 - r 2 YX

1.5 Oneway Analysis of Variance

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Penalized regression: Introduction

Module 5: Multiple Regression Analysis

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

August 2012 EXAMINATIONS Solution Part I

1 Simple Linear Regression I Least Squares Estimation

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

5. Linear Regression

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Recall this chart that showed how most of our course would be organized:

Statistiek II. John Nerbonne. October 1, Dept of Information Science

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

A Primer on Forecasting Business Performance

MULTIPLE REGRESSION ANALYSIS OF MAIN ECONOMIC INDICATORS IN TOURISM. R, analysis of variance, Student test, multivariate analysis

Elements of statistics (MATH0487-1)

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

MULTIPLE REGRESSION EXAMPLE

SPSS Guide: Regression Analysis

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

An Introduction to Statistical Tests for the SAS Programmer Sara Beck, Fred Hutchinson Cancer Research Center, Seattle, WA

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Statistics Review PSY379

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Causal Forecasting Models

MULTIPLE REGRESSION WITH CATEGORICAL DATA

THE FIRST SET OF EXAMPLES USE SUMMARY DATA... EXAMPLE 7.2, PAGE 227 DESCRIBES A PROBLEM AND A HYPOTHESIS TEST IS PERFORMED IN EXAMPLE 7.

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

One-Way Analysis of Variance

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Testing for Lack of Fit

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Rockefeller College University at Albany

STATISTICS 8, FINAL EXAM. Last six digits of Student ID#: Circle your Discussion Section:

Premaster Statistics Tutorial 4 Full solutions

Statistical Models in R

Additional sources Compilation of sources:

Two-sample hypothesis testing, II /16/2004

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Multinomial and Ordinal Logistic Regression

Econometrics Simple Linear Regression

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Descriptive Statistics

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Lecture Notes Module 1

STAT 350 Practice Final Exam Solution (Spring 2015)

11. Analysis of Case-control Studies Logistic Regression

Introduction to Regression and Data Analysis

1 Basic ANOVA concepts

Linear Models in STATA and ANOVA

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Regression Analysis (Spring, 2000)

3.4 Statistical inference for 2 populations based on two samples

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Curve Fitting. Before You Begin

17. SIMPLE LINEAR REGRESSION II

2013 MBA Jump Start Program. Statistics Module Part 3

SIMPLE LINEAR REGRESSION

An Introduction to Statistics Course (ECOE 1302) Spring Semester 2011 Chapter 10- TWO-SAMPLE TESTS

Wooldridge, Introductory Econometrics, 3d ed. Chapter 12: Serial correlation and heteroskedasticity in time series regressions

Transcription:

Statistics 112 Regression Cheatsheet Section 1B - Ryan Rosario I have found that the best way to practice regression is by brute force That is, given nothing but a dataset and your mind, compute everything there is to compute about a regression model! So let s pretend that Part 1 - Simple Linear Regression The simple linear regression model is, well, simple(r) y β 0 + β 1 x where y is the response variable and is known if the data is given to us The above is the true, unobserved model Some oracle out there knows what the βs are, but humans are not as wise The best we can do is estimate the values of β 0 and β 1 using the estimators b 0 and b 1! The fit (or estimated) model is ŷ b 0 + b 1 x where ŷ is the predicted value of y based on our fitted model Note that we can predict a y for any value of x that is in the domain of x That is, if x goes from 0 to 100, then we can predict y for values of x from 0 to 100 Suppose we have data that looks like the following We need to fit a model, so we need to find b 0 and b 1 For simple linear regression, x y b 1 r s y s x, b 0 ȳ b 1 x where r is the correlation coefficient between x and y, s x and s y are the standard deviations of x and y respectively, and x and ȳ are the means of x and y respectively Step 1: Find x, ȳ, s x, s y x xi n, ȳ yi n, s x (xi x) 2 (yi ȳ) 2, s y 1

Step 2: Calculate r, the correlation coefficient r xy (xi x)(y i ȳ ()s x s y zx z y Step 3: Calculate the regression coefficient and intercept Given r, b 1 r s y s x (xi x)(y i ȳ) (xi x) 2 b 0 ȳ b 1 x Step 3: Write out the linear model using these estimates ŷ b 0 + b 1 x Because simple linear regression has only one regression coefficient, there is no F -test involved Think about why this is true by looking ath te null/atlernate hypotheses as well as the null/alternate hypotheses for the t test In this boring case where there is one predictor, F t 2 Step 5: Compute R 2 which is just, well, r raised to the power of 2! Step 6: Determine whether or not the model and/or regression coefficient is statistically significant Using a confidence interval: b 1 ± t SE(β 1 ) Since H 0 : b 1 0, we check to see if 0 is in the confidence interval If it is, fail to reject H 0, x has no effect on y Otherwise, reject H 0, x does have an effect on y Or using a hypothesis test t b 1 β 1 SE(β 1 ) b 1 0 SE(β 1 ) then find the critical t from the t-table using n 2 degrees of freedom If the t-values or the p-values are given USE THEM! 2

Step 7: Compute the standard error of the predicted values If you are asked about prediction error, compute s s (yi ŷ i ) 2 where y i ŷ is the residual which is the error in predicting a particular value of y from a value of x Part 2 - Multiple Linear Regression Simple linear regression was a special case of multiple regression which is unfortunate The principle is still the same, but how we account for variance is more tedious The regression model is y β 0 + β 1 x 1 + β 2 x 2 + + β n x n where y is the response variable and is known if the data is given to us, but we only know y for values of y that are in the dataset! x 1, x 2,, x n is also known because it is given to us with the data The above is the true, unobserved model Some oracle out there knows what the βs are, but humans are not as wise The best we can do is estimate the values of β i using the estimators b i! The fit or estimated model is ŷ b 0 + b 1 x 1 + b 2 x 2 + + b n x n Suppose we are given the following data Let s compute a multiple regression entirely by hand x 1 x 2 x n y IMPORTANT NOTE: You will NOT be asked to compute the regression coefficients for a multiple linear regression It requires linear algebra To compute by hand, you must be given either x i s or y i s Using these x i s and a model given to you, you can then compute the predicted value of y, ŷ if ŷ is not given to you Step 1: Compute the residuals e i y i ŷ i 3

Step 2: Compute the sum of squares The total variance in the response variable can be quantified using the sum of squares The total sum of squares SST O quantifies all variance in the response variable It is a sum of two specific sources of variance: the variance that is explained by the regression model (good) and variance that is not explained by the regression model (bad) SST O (y i ȳ) 2 The sum of squares regression SS REG is the amount of variance that is explained by the regression model It can also be called sum-of-squares explained SS REG (ŷ i ȳ) 2 The sum of squares residual, or the sum of squares error, is the amount of variance that is not explained by the regression model That is, it is the amount of variance that is described by factors other than the predictors in the model note that y i ŷ i is the residual (y i ŷ i ) 2 e 2 i Step 3: Compute the mean squared error In Step 2 we computed the squared error Now we want to essentially average it by dividing the something This something is the degrees of freedom This yields that average squared error MS REG SS REG k MSE n k 1 where k is the number of predictors in the model and n is the sample size Step 4: Compute the F value to perform the F test F is sometimes called the signal-to-noise ratio As in an audio system, noise is bad, it is erroneous sound Signal is what our brains actually pay attention to; it is the part of the variance that our brains can explain! Thus, using this analogy, F Signal Noise MS REG MSE SSREG k n k 1 R 2 k 1 R 2 n k 1 Sometimes k is called the degrees of freedom in the numerator and n k 1 is called the degrees of freedom in the denominator This is the terminology used by the F table 4

Look up the degrees of freedom in the F table and find the critical F value F If your calculated value of F > F, we reject the null hypothesis of the F test H 0 : β 1 β 2 β n 0 H a : β i 0, for some i If we reject H 0 this means that at least one of the β i s is not 0, that is, at least one of the explanatory variables helps to predict y If we fail to reject H 0, all β i s are 0, and none of the predictors have any effect on the response variable If we fail to reject, we stop We cannot continue Step 5: Compute R 2 Remember, R 2 is the proportion of the variance that is explained by the regression model Recall that SS REG is the amount of variance that can be explained by the regression model Thus, On the other hand, R 2 SS REG 1 R 2 1 SS REG SS REG SS RES which is the proportion of variance in y that is NOT explained by the regression model That is, it is the proportion of the variance that is explained by factors other than the predictors in the model Step 6: Which regression coefficient(s) are significant? Which variables do help predict y? Perform a t-test for every coefficient just as we did with simple linear regression You need the standard error of β, SE(β i ), to use the hypothesis test or confidence interval IF THE t-value or p-value IS GIVEN TO YOU, USE THEM! Step 7: Compute the Standard Error of the Estimate The standard error of the estimate (if needed) gives an overall value of how accurate the prediction is s (yi ŷ i ) 2 n 2 n 2 s y 1 R 2 N N 2 5