Correlation and Simple Linear Regression

Similar documents
Multiple Linear Regression

Using R for Linear Regression

Statistical Models in R

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Generalized Linear Models

Comparing Nested Models

Model Diagnostics for Regression

5. Linear Regression

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Chapter 7: Simple linear regression Learning Objectives

Week 5: Multiple Linear Regression

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Factors affecting online sales

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

MULTIPLE REGRESSION EXAMPLE

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

Chapter 13 Introduction to Linear Regression and Correlation Analysis

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Simple Linear Regression Inference

Two-way ANOVA and ANCOVA

Regression step-by-step using Microsoft Excel

Regression Analysis: A Complete Example

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Simple Linear Regression

Simple linear regression

STAT 350 Practice Final Exam Solution (Spring 2015)

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

2. Simple Linear Regression

Exercise 1.12 (Pg )

Part 2: Analysis of Relationship Between Two Variables

Final Exam Practice Problem Answers

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Univariate Regression

Chapter 23. Inferences for Regression

Testing for Lack of Fit

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

2013 MBA Jump Start Program. Statistics Module Part 3

Multi Factors Model. Daniel Herlemont. March 31, Estimating using Ordinary Least Square regression 3

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Introduction to Regression and Data Analysis

Psychology 205: Research Methods in Psychology

Module 5: Multiple Regression Analysis

Regression Analysis (Spring, 2000)

Premaster Statistics Tutorial 4 Full solutions

ANOVA. February 12, 2015

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Example: Boats and Manatees

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

Statistical Models in R

Correlation and Regression

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Homework 8 Solutions

SPSS Guide: Regression Analysis

Exchange Rate Regime Analysis for the Chinese Yuan

Regression III: Advanced Methods

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Multivariate Logistic Regression

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Lucky vs. Unlucky Teams in Sports

17. SIMPLE LINEAR REGRESSION II

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

The correlation coefficient

Linear Regression. use waist

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

MIXED MODEL ANALYSIS USING R

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

August 2012 EXAMINATIONS Solution Part I

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

N-Way Analysis of Variance

A Primer on Forecasting Business Performance

Data Analysis Tools. Tools for Summarizing Data

4. Simple regression. QBUS6840 Predictive Analytics.

1 Simple Linear Regression I Least Squares Estimation

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

Part II. Multiple Linear Regression

Simple Predictive Analytics Curtis Seare

Elements of statistics (MATH0487-1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

International Statistical Institute, 56th Session, 2007: Phil Everson

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

MSwM examples. Jose A. Sanchez-Espigares, Alberto Lopez-Moreno Dept. of Statistics and Operations Research UPC-BarcelonaTech.

Homework 11. Part 1. Name: Score: / null

Simple Regression Theory II 2010 Samuel L. Baker

Introduction to Quantitative Methods

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

DATA INTERPRETATION AND STATISTICS

Least Squares Estimation

Transcription:

Correlation and Simple Linear Regression We are often interested in studying the relationship among variables to determine whether they are associated with one another. When we think that changes in a variable x explain, or maybe even cause, changes in a second variable y, we call x an explanatory variable and y a response variable. If the form of the plot looks like a straight line, this indicates there may be a linear relationship between the two variables. The relationship is strong if all the data points approximately make up a straight line and weak if the points are widely scattered about the line. The covariance and correlation are measures of the strength and direction of a linear relationship between two quantitative variables. A regression line is a mathematical model for describing a linear relationship between an explanatory variable, x, and a response variable, y. It can be used to predict the value of y for a given value of x. Covariance, correlations and regression lines can all be computed using R. A. Covariance and correlation We can compute the covariance and correlation in R using the cov() and cor() functions. Ex. A pediatrician wants to study the relationship between a child s height and their head circumference (both measured in inches). She selects a SRS of 11 three-year old children and obtains the following data. (See lecture notes for data) Begin by reading in the data: > Height = c(27.75, 24.5, 25.5, 26, 25, 27.75, 26.5, 27, 26.75, 26.75, 27.5) > Circ = c(17.5, 17.1, 17.1,17.3, 16.9, 17.6, 17.3, 17.5, 17.3, 17.5, 17.5) > Dat = data.frame(height,circ) > attach(dat) > Dat Height Circ 1 27.75 17.5 2 24.50 17.1 3 25.50 17.1 8 27.00 17.5 9 26.75 17.3 10 26.75 17.5 11 27.50 17.5

Dat$Circ 16.9 17.0 17.1 17.2 17.3 17.4 17.5 17.6 To make a scatter plot of circumference against height type: > plot(dat$height,dat$circ) 24.5 25.0 25.5 26.0 26.5 27.0 27.5 Dat$Height Studying the plot, there appears to be a linear relationship between the two variables. This relationship can be quantified by computing the covariance and correlation between variables. > cov(dat) # Covariance matrix Height Circ Height 1.1977273 0.21886364 Circ 0.2188636 0.04818182 From the output we see that the variance of Height and Circ is 1.198 and 0.048, respectively. The covariance between the two variables is 0.219 indicating a positive relationship. > cor(dat) # Correlation matrix Height Circ Height 1.0000000 0.9110727 Circ 0.9110727 1.0000000 From the output we see that the correlation between Height and Circ is 0.911. Hence, the positive linear relationship between the variables is quite strong.

B. Simple Linear Regression If there exists a strong linear relationship between two variables it is often of interest to model the relationship using a regression line. The main function for performing regression in R is lm(). It has many options that we will explore throughout the semester. To perform simple linear regression we can use the command: lm(response ~ explanatory) Here the terms response and explanatory in the function should be replaced by the names of the response and explanatory variables, respectively, used in the analysis. Ex. Fit a regression line that describes the relationship between Height and Circumference. > results = lm(circ ~ Height) > results Call: lm(formula = Circ ~ Height) Coefficients: (Intercept) Height 12.4932 0.1827 The results indicate that the least squares regression line takes the form: yhat = 12.493 + 0.183x. Hence the model states that a one inch increase in height would lead to a 0.183 inch increase in head circumference. To superimpose the regression line over the data first make a scatter plot of Circ against Height, and therefore overlay the regression line using the command abline(results). Here results contains all relevant information about the regression line. > plot(height,circ) > abline(results)

Dat$Circ 16.9 17.0 17.1 17.2 17.3 17.4 17.5 17.6 24.5 25.0 25.5 26.0 26.5 27.0 27.5 Dat$Height The next step in our analysis is to verify all the relevant model assumptions needed for using the simple linear regression model. The residuals should be normally distributed with equal variance for every value of x. We can check the model assumptions by making appropriate plots of the residuals. Note that after fitting the model, the residuals are saved in the variable results$res. We begin by plotting the residuals against the explanatory variable. The residuals should be randomly scattered about 0. The width should be equal throughout for the constant variance assumption to hold. > plot(height,results$res)

This plot shows no apparent pattern in the residuals indicating no clear violations of any model assumptions. To check the normality assumption, make a QQ-plot using the command: > qqnorm(results$res) > qqline(results$res) This makes a QQ-plot and overlays a straight line for comparison purposes. To make a histogram of the residuals type: > hist(results$res)

After verifying the assumptions, the next step is to perform inference. We want to construct tests and confidence intervals for the slope and intercept, confidence intervals for the mean response and prediction intervals for future observations. To test whether the slope is significantly different from 0 we can use the function summary(results). > summary(results) Call: lm(formula = Dat$Circ ~ Dat$Height) Residuals: Min 1Q Median 3Q Max -0.16148-0.05842-0.01831 0.06442 0.12989 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 12.49317 0.72968 17.12 3.56e-08 *** Dat$Height 0.18273 0.02756 6.63 9.59e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.09538 on 9 degrees of freedom Multiple R-Squared: 0.8301, Adjusted R-squared: 0.8112 F-statistic: 43.96 on 1 and 9 DF, p-value: 9.59e-05 From the output we see that we can clearly reject the null hypothesis of no linear relationship between height and head circumference (p=9.59e-05). The term Residual standard error gives us an estimate of the standard deviation around the regression line, or s, which is equal to 0.09538. To construct 95% confidence intervals for 1 we can use the command confint(results). > confint(results) 2.5 % 97.5 % (Intercept) 10.8425070 14.1438307 Height 0.1203848 0.2450801 Finally, we may want to use our regression equation to predict future values of the response variable. The predicted value of head circumference for a child of a given height has two interpretations; it can either represent the mean circumference for all children whose height is x, or it can represent the predicted circumference for a randomly selected child whose height is x. The predicted value will be the same in both cases. However, the standard error will be larger in the second case due to the additional variation of individual responses about the mean.

The function predict() can be used to make both types of intervals. To make confidence intervals for the mean response use the option interval= confidence. To make a prediction interval use the option interval= prediction. Ex. Obtain a 95% confidence intervals for the mean head circumference of children who are 25 inches tall. > predict(results,data.frame(height =25),interval="confidence") fit lwr upr [1,] 17.06148 16.94987 17.17309 The confidence interval lies in the range (16.95, 17.17). Ex. Obtain a 95% prediction intervals for a child who is 25 inches tall. > predict(results,data.frame(height =25),interval="prediction") fit lwr upr [1,] 17.06148 16.81855 17.30441 The prediction interval lies in the range (16.82, 17.30). Note that it is slightly wider than the confidence interval.