Quantifying association. Lecture 8: Correlation and Intro to Linear Regression. Correlation Coefficient I. Correlation analysis

Similar documents
Module 3: Correlation and Covariance

Chapter 7: Simple linear regression Learning Objectives

Elements of statistics (MATH0487-1)

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Correlation key concepts:

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Session 7 Bivariate Data and Analysis

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Scatter Plot, Correlation, and Regression on the TI-83/84

Section 3 Part 1. Relationships between two numerical variables

Exercise 1.12 (Pg )

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Correlation in Random Variables

Simple linear regression

Father s height (inches)

with functions, expressions and equations which follow in units 3 and 4.


Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

CORRELATION ANALYSIS

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

The importance of graphing the data: Anscombe s regression examples

Pearson s Correlation

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Descriptive statistics; Correlation and regression

What does the number m in y = mx + b measure? To find out, suppose (x 1, y 1 ) and (x 2, y 2 ) are two points on the graph of y = mx + b.

Simple Regression Theory II 2010 Samuel L. Baker

Introduction to Regression and Data Analysis

Pennsylvania System of School Assessment

CALCULATIONS & STATISTICS

MTH 140 Statistics Videos

Example: Boats and Manatees

2. Simple Linear Regression

Econometrics Simple Linear Regression

AP STATISTICS REVIEW (YMS Chapters 1-8)

PLOTTING DATA AND INTERPRETING GRAPHS

Sections 2.11 and 5.8

Descriptive Statistics

Linear Approximations ACADEMIC RESOURCE CENTER

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

Linear Equations. Find the domain and the range of the following set. {(4,5), (7,8), (-1,3), (3,3), (2,-3)}

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Temperature Scales. The metric system that we are now using includes a unit that is specific for the representation of measured temperatures.

How To Run Statistical Tests in Excel

SPSS Guide: Regression Analysis

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

Module 5: Multiple Regression Analysis

Covariance and Correlation

Chapter 13 Introduction to Linear Regression and Correlation Analysis

II. DISTRIBUTIONS distribution normal distribution. standard scores

Linear functions Increasing Linear Functions. Decreasing Linear Functions

PS 271B: Quantitative Methods II. Lecture Notes

5 Systems of Equations

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

A full analysis example Multiple correlations Partial correlations

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Determine If An Equation Represents a Function

Regression and Correlation

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Univariate Regression

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

Statistics E100 Fall 2013 Practice Midterm I - A Solutions

Georgia Standards of Excellence Curriculum Map. Mathematics. GSE 8 th Grade

CORRELATIONAL ANALYSIS: PEARSON S r Purpose of correlational analysis The purpose of performing a correlational analysis: To discover whether there

Multivariate Normal Distribution

17. SIMPLE LINEAR REGRESSION II

Lecture 13/Chapter 10 Relationships between Measurement (Quantitative) Variables

WEB APPENDIX. Calculating Beta Coefficients. b Beta Rise Run Y X

Statistics 2014 Scoring Guidelines

Diagrams and Graphs of Statistical Data

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Mathematics Pre-Test Sample Questions A. { 11, 7} B. { 7,0,7} C. { 7, 7} D. { 11, 11}

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

For example, estimate the population of the United States as 3 times 10⁸ and the

Premaster Statistics Tutorial 4 Full solutions

Graphing Linear Equations

So, using the new notation, P X,Y (0,1) =.08 This is the value which the joint probability function for X and Y takes when X=0 and Y=1.

Correlation and Simple Linear Regression

Lecture 1: Review and Exploratory Data Analysis (EDA)

4. Multiple Regression in Practice

Describing Relationships between Two Variables

Name: Date: Use the following to answer questions 2-3:

Exploratory data analysis (Chapter 2) Fall 2011

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

GRADES 7, 8, AND 9 BIG IDEAS

x 2 + y 2 = 1 y 1 = x 2 + 2x y = x 2 + 2x + 1

Lecture 2. Summarizing the Sample

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Algebra 1 Course Title

Causal Infraction and Network Marketing - Trends in Data Science

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Illustration (and the use of HLM)

Mathematics within the Psychology Curriculum

c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.

Transcription:

Quantifying association Lecture 8: Correlation and Intro to Linear Regression Sandy Eckel seckel@jhsph.edu 5 May 2008 Goal: Express the strength of the relationship between two variables Metric depends on the nature of the variables For now, we ll focus on continuous variables (e.g. height, weight) Important! association does not imply causation To describe the relationship between two continuous variables, use: Correlation analysis Measures strength and direction of the linear relationship between two variables Regression analysis Concerns prediction or estimation of outcome variable, based on value of another variable (or variables) 1 / 40 2 / 40 Correlation analysis Correlation Coefficient I Plot the data (or have a computer to do so) Visually inspect the relationship between two continous variables Is there a linear relationship (correlation)? Are there outliers? Are the distributions skewed? Measures the strength and direction of the linear relationship between two variables X and Y Population correlation coefficient: ρ = cov(x, Y ) var(x) var(y ) = Sample correlation coefficient: (obtained by plugging in sample estimates) r = sample cov(x, Y ) s 2 x s 2 Y = E[(X µ X )(Y µ Y )] E[(X µx ) 2 ] E[(Y µ Y ) 2 ] n i=1 (X i X)(Y i Ȳ) n 1 n (X i X) 2 i=1 n 1 n (Y i Ȳ)2 i=1 n 1 3 / 40 4 / 40

Correlation Coefficient II Correlation Coefficient III The correlation coefficient, ρ, takes values between -1 and +1-1: Perfect negative linear relationship 0: No linear relationship +1: Perfect positive relationship Plot standardized Y versus standardized X Observe an ellipse (elongated circle) Correlation is the slope of the major axis 5 / 40 6 / 40 Correlation Notes Several levels of correlation Other names for r Pearson correlation coefficient Product moment of correlation Characteristics of r Measures *linear* association The value of r is independent of units used to measure the variables The value of r is sensitive to outliers r 2 tells us what proportion of variation in Y is explained by linear relationship with X 7 / 40 8 / 40

Examples of the Correlation Coefficient I Examples of the Correlation Coefficient II Perfect positive correlation, r 1 Perfect negative correlation, r -1 9 / 40 10 / 40 Examples of the Correlation Coefficient III Examples of the Correlation Coefficient IV Imperfect positive correlation, 0< r <1 Imperfect negative correlation, -1<r <0 11 / 40 12 / 40

Examples of the Correlation Coefficient V Examples of the Correlation Coefficient VI No relation, r 0 Some relation but little *linear* relationship, r 0 13 / 40 14 / 40 Association and Causality Sir Bradford Hill s Criteria for Causality In general, association between two variables means there is some form of relationship between them The relationship is not necessarily causal Association does not imply causation, no matter how much we would like it to Example: Hot days, ice cream, drowning Strength: magnitude of association Consistency of association: repeated observation of the association in different situations Specificity: uniqueness of the association Temporality: cause precedes effect Biologic gradient: dose-response relationship Biologic plausibility: known mechanisms Coherence: makes sense based on other known facts Experimental evidence: from designed (randomized) experiments Analogy: with other known associations 15 / 40 16 / 40

Simple Linear Regression (SLR): Main idea Example: Boxplot of % low income by education level Linear regression can be used to study a continuous outcome variable as a linear function of a predictor variable Education level is coded as a binary variable with values low and high Example: 60 cities in the US were evaluated for numerous characteristics, including: Outcome variable (y) the % of the population with low income Predictor variable (x) median education level Linear regression can help us to model the association between median education and % of the population with low income 17 / 40 18 / 40 Simple linear regression, t-tests and ANOVA Review: equation for a line Mean in low education group: 15.7% Mean in high education group: 13.2% The two means could be compared by a t-test or ANOVA, but regression provides a unified equation: = β 0 + β 1 x i = 15.7 2.5x i Recall that back in geometry class, you learned that a line could be represented by the equation where y = mx + b m = slope of the line (rise/run) b = y-intercept (value y when x=0) where x i = 1 for high education and 0 for low education (x is called a dummy variable or indicator variable that designates group) is our estimate of the mean % low income for the given the value of education what about the β s? 19 / 40 Y 0 5 10 15 20 y=mx+b m = 2 b = 5 0 1 2 3 4 5 X 20 / 40

Regression analysis represented by equation for a line Example: the model components In simple linear regression, we use the equation for a line y = mx + b but we write it slightly differently: = β 0 + β 1 x i = 15.7 + ( 2.5)x i β 0 β 1 ŷ = β 0 + β 1 x = y-intercept (value y when x=0) = slope of the line (rise/run) is the predicted mean of the outcome y i for x i β 0 is the intercept, or the value of when x i = 0 β 1 is the slope, or the change in for a 1 unit increase in x i = 0 x i is the indicator variable of low or high education for observation i 21 / 40 22 / 40 Example: fill in covariate value to help interpretation Interpretations = β 0 + β 1 x i = 15.7 2.5x i x i = 0 (low education) = 15.7 2.5 0 = 15.7 = β 0 x i = 1 (high education) = 15.7 2.5 1 = 13.2 = β 0 + β 1 Intercept Slope β 0 is the mean outcome for the reference group, or the group for which x i = 0. Here, β 0 is the average percent of the population that is low income for cities with low education. β 1 is the difference in the mean outcome between the two groups (when x i = 1 vs. when x i = 0) Here, β 1 is difference in the average percent of the population that is low income for cities with high education compared to cities with low education. 23 / 40 24 / 40

Why use linear regression? Regression analysis Linear regression is very powerful. It can be used for many things: Binary X Continuous X Categorical X Adjustment for confounding Interaction Curved relationships between X and Y A regression is a description of a response measure, Y, the dependent variable, as a function of an explanatory variable, X, the independent variable. Goal: prediction or estimation of the value of one variable, Y, based on the value of the other variable, X. A simple relationship between the two variables is a linear relationship (straight line relationship) Other names: linear, simple linear, least squares regression 25 / 40 26 / 40 Foundational example: Galton s study on height Example: Galton s data and resulting regression 1000 records of heights of family groups Really tall fathers tend on average to have tall sons but not quite as tall as the really tall fathers Really short fathers tend on average to have short sons but not quite as short as the really short fathers There is a regression of a sons height toward the mean height for sons 27 / 40 28 / 40

Regression analysis: population model Another way to write the model Probability model: Independent responses y 1,y 2,...,y n are sampled from y i N(µ i, σ 2 ) Systematic: y i = β 0 + β 1 x i + ɛ i Probability (random): ɛ i N(0, σ 2 ) Systematic model: µ i = E(y i x i ) = β 0 + β 1 x i where β 0 β 1 = intercept = slope The response y i is a linear function of x i plus some random, normally distributed error, ɛ i data = signal + noise 29 / 40 30 / 40 Geometric interpretation Remember: two (equivalent) ways to write the model Probability: y i N(µ i, σ 2 ) Systematic: µ i = E(y i x i ) = β 0 + β 1 x i where β 0 β 1 = intercept = slope OR Systematic: y i = β 0 + β 1 x i + ɛ i Probability: ɛ i N(0, σ 2 ) The response, y i, is a linear function of x i plus some random, normally distributed error, ɛ i 32 / 40 31 / 40

Interpretation of coefficients From Galton s example Intercept (β 0 ) Mean model: µ = E(y x) = β 0 + β 1 x β 0 = expected response when x = 0 Since E(y x = 0) = β 0 + β 1 (0) = β 0 Slope (β 1 ) β 1 = change in expected response per 1 unit increase in x E(y x) = β 0 + β 1 x E(y x) = 33.7 + 0.52x where: y = son s height (inches) x = father s height (inches) Expected son s height = 33.7 inches when father s height is 0 inches Expected difference in heights for sons whose fathers heights differ by one inch = 0.52 inches 33 / 40 34 / 40 City education/income example City education/income model When education is a continuous variable (not binary) Using the continuous variable for median education in city i (x i ): E(y i x i ) = β 0 + β 1 x i E(y i x i ) = 36.2 2.0x i When x i = 0 E(y i x i ) = 36.2 2.0(0) = 36.2 = β 0 When x i = 1 E(y i x i ) = 36.2 2.0(1) = 34.2 = β 0 + β 1 When x i = 2 35 / 40 E(y i x i ) = 36.2 2.0(2) = 32.2 = β 0 + β 1 2 36 / 40

City education/income model interpretation Finding β s from the graph Intercept (β 0 ) β 0 is the mean outcome for the reference group, or the group for which x i = 0. Here, β 0 is the average percent of the population that is low income for cities with median education level of 0. Slope (β 1 ) β 1 is the difference in the mean outcome for a one unit change in x. Here, β 1 is difference in the average percent of the population that is low income between two cities, when the first city has 1 unit higher median education level than the second city. β 0 is the y-intercept of the line, or the average value of y when x = 0. β 1 is the slope of the line, or the average change in y per unit change in x. y = mx + b ˆβ 1 = rise run = y 1 y 2 x 1 x 2 b = β 0, m = β 1 Note on notation: β 1 represents the true slope (in the population) ˆβ 1 (or b 1 ) is the sample estimate of the slope 37 / 40 38 / 40 Where is our intercept? Summary Today we ve discussed Correlation Linear regression with continuous y variables Simple linear regression (just one x variable) Binary x ( dummy or indicator variable for group) β 1 : mean difference in outcome between groups Continuous x β 1 : mean difference in outcome corresponding to a 1-unit increase in x Interpretation of regression coefficients (intercept and slope) The intercept isn t in the range of our observed data. This means: The intercept isn t very interpretable since the average of y when x = 0 was never observed Possible solution: we might want to center our x variable 39 / 40 How to write the regression model (2 ways) Next time we ll discuss multiple linear regression (more than one x variable) and confounding 40 / 40