Correlation & Regression, II. Residual Plots. What we like to see: no pattern. Steps in regression analysis (so far)



Similar documents
Father s height (inches)

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Descriptive statistics; Correlation and regression

Session 7 Bivariate Data and Analysis

Chapter 7: Simple linear regression Learning Objectives

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Chapter 11: r.m.s. error for regression

Section 3 Part 1. Relationships between two numerical variables

Module 3: Correlation and Covariance

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

Correlation key concepts:

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Descriptive Statistics

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Exercise 1.12 (Pg )

Statistics. Measurement. Scales of Measurement 7/18/2012

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces

CORRELATIONAL ANALYSIS: PEARSON S r Purpose of correlational analysis The purpose of performing a correlational analysis: To discover whether there

CALCULATIONS & STATISTICS

Post-hoc comparisons & two-way analysis of variance. Two-way ANOVA, II. Post-hoc testing for main effects. Post-hoc testing 9.

Simple linear regression

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Describing Relationships between Two Variables

Simple Regression Theory II 2010 Samuel L. Baker

The Dummy s Guide to Data Analysis Using SPSS

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

1.6 The Order of Operations

Linear Models in STATA and ANOVA

Lecture 13/Chapter 10 Relationships between Measurement (Quantitative) Variables

Two-sample inference: Continuous data

SPSS Explore procedure

Lesson 4 Measures of Central Tendency

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

AP STATISTICS REVIEW (YMS Chapters 1-8)

Z - Scores. Why is this Important?

Covariance and Correlation

Simple Linear Regression

Reliability Analysis

Module 5: Multiple Regression Analysis

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

The correlation coefficient

2. Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Univariate Regression

Scatter Plot, Correlation, and Regression on the TI-83/84

Introduction to Regression and Data Analysis

Using Excel for Statistical Analysis

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Projects Involving Statistics (& SPSS)

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Correlation. The relationship between height and weight

Key Vocabulary: Sample size calculations


Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Multiple Regression: What Is It?

Updates to Graphing with Excel

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Linear functions Increasing Linear Functions. Decreasing Linear Functions

6.4 Normal Distribution

Chapter 6: Probability

January 26, 2009 The Faculty Center for Teaching and Learning

WRITING PROOFS. Christopher Heil Georgia Institute of Technology

Chapter 9 Descriptive Statistics for Bivariate Data

The Normal Distribution

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

11. Analysis of Case-control Studies Logistic Regression

Microeconomics Sept. 16, 2010 NOTES ON CALCULUS AND UTILITY FUNCTIONS

So, using the new notation, P X,Y (0,1) =.08 This is the value which the joint probability function for X and Y takes when X=0 and Y=1.

Polynomial and Rational Functions

Chapter 10 Acid-Base titrations Problems 1, 2, 5, 7, 13, 16, 18, 21, 25

table to see that the probability is (b) What is the probability that x is between 16 and 60? The z-scores for 16 and 60 are: = 1.

Simple Linear Regression, Scatterplots, and Bivariate Correlation

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

Independent samples t-test. Dr. Tom Pierce Radford University

Unit 7: Normal Curves

Chapter 4 Online Appendix: The Mathematics of Utility Functions

Experiment #1, Analyze Data using Excel, Calculator and Graphs.

CURVE FITTING LEAST SQUARES APPROXIMATION

AP Statistics Solutions to Packet 2

with functions, expressions and equations which follow in units 3 and 4.

Correlation. Alan T. Arnholt Department of Mathematical Sciences Appalachian State University

CS 147: Computer Systems Performance Analysis

Correlation and Regression

0.8 Rational Expressions and Equations

The Big Picture. Correlation. Scatter Plots. Data

Data Mining: Algorithms and Applications Matrix Math Review

Simple Predictive Analytics Curtis Seare

II. DISTRIBUTIONS distribution normal distribution. standard scores

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Correlation and Simple Linear Regression

Estimate a WAIS Full Scale IQ with a score on the International Contest 2009

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

Practice Test Answer and Alignment Document Mathematics: Algebra II Performance Based Assessment - Paper

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

DATA INTERPRETATION AND STATISTICS

Transcription:

Steps in regression analysis (so far) Correlation & Regression, II 9.07 4/6/2004 Plot a scatter plot Find the parameters of the best fit regression line, y =a+bx Plot the regression line on the scatter plot Plot the residuals, xi vs. (y i y i ), as a scatter plot, for diagnostic purposes Residual Plots Plotting the residuals (yi y i ) against x i can reveal how well the linear equation explains the data Can suggest that the relationship is significantly non-linear, or other oddities The best structure to see is no structure at all What we like to see: no pattern 40 20 0-20 -40 height ( inches) 1

If it looks like this, you did something wrong there s still a linear component! 40 20 0-20 -40 height ( inches) If there s a pattern, it was inappropriate to fit a line (instead of some other function) 40 20 0-20 -40 height (inches) What to do if a linear function isn t appropriate Often you can transform the data so that it is linear, and then fit the transformed data. This is equivalent to fitting the data with a model, y = M(x), then plotting y vs. y and fitting that with a linear model. There are other tricks people use. This is outside of the scope of this class. Coming up next Assumptions implicit in regression The regression fallacy Confidence intervals on the parameters of the regression line Confidence intervals on the predicted value y, given x Correlation 2

Assumption #1: your residual plot should not look like this: 40 20 0-20 -40 height (inches) Heteroscedastic data Data for which the amount of scatter depends upon the x- value (vs. homoscedastic, where it doesn t depend on x) Leads to residual plots like that on the previous slide Happens a lot in behavioral research because of Weber s law. Ask people how much of an increment in sound volume they can j ust distinguish from a standard volume How big a difference is required (and how much variability there is in the response) depends upon the standard volume Can often deal with this problem by transforming the data, or doing a modified, weighted regression ( Again, outside of the scope of this class.) Why we care about heteroscedasticity vs. homoscedasticity Along with the residual plots, we often want to look at the rms (root-mean-square) error for the regression: rms = sqrt( Σ(y i y i ) 2 /N) = s y This gives us a measure of the spread around the regression line For this measure to be meaningful and useful, we want the data to be homoscedastic, i.e. we want the data to be spread out to the same degree for every value of xi. 40 20 0-20 -40 Homoscedastic height ( inches) Here, rms error is a good measure of the amount of spread of the data about yi, for any value of xi. 40 20 0-20 -40 Heteroscedastic hei ght (i nches) Here, rms error is not such a good measure of the spread -- for some xi it will be an overestimate of spread, for some an underestimate. 3

Another assumption for regression analysis Assume the y scores at each x form an approximately normal distribution Because the rms error, sy is like a standard deviation, if the above assumptions hold, then we can say things like, approximately 68% of all y scores will be between ±1 s y from the regression line Assumptions Homoscedastic data y scores approximately normally distributed about the regression line The regression effect and the regression fallacy Your book s example: A preschool program is aimed at increasing children s IQ s. Children are given a pre-test prior to entering the program, and a post-test at the end of the program. On both tests, the mean score is about 100, and the SD is about 15. The program seems to have no effect. However, closer inspection shows that students who started out with lower IQ s had an average gain of 5 IQ pts. Students who started out with a higher IQ showed a drop in IQ of 5 pts, on average. Is something interesting going on, here? Does the program act to equalize intelligence? Does it make smart kids less smart, and less bright kids more bright? No, nothing much is going on. This is just the regression effect. In virtually all test-retest situations, the group that performs poorly on the first test will show some improvement (on average) on the second, and the group that performs better than average will do worse on the second test. The regression fallacy is assuming that this effect is due to something important (like the program equalizing intelligence). As we ll see, it s not. 4

Simple intuition about the regression effect Divide children into two groups, based upon their performance on the first test Poor group vs. Good group Prediction: 1 st <mean Poor group Children 1 st >mean Good group Simple intuition about the regression effect However, by chance, some of the children who belong in the poor group got a high enough score on test 1 to be categorized in the good group. And vice versa. If there s a real difference between the groups, on test 2 we d expect some of these mis-labeled children to score more like the group they really belong to. -> poor group scores, on average: better on test 2 than test 1, good group scores, on average: worse on test 2 than on test 1 2 nd score low 2 nd score high The regression effect: more involved explanation To explain the regression effect the way your book does, it will help first to talk some more about what kind of distribution we expect to see in our scatter plots, and more about some intuitions for what the least squares linear regression is telling us. Football-shaped scatter plots For many observational studies, the scatter plot will tend to be what your book calls footballshaped 200 150 100 height (i nches) 5

Why football-shaped? Just as weights and heights tend to have a (1 dimensional) normal distribution, a plot of bivariate ( height, weight) data will often tend to have a 2-dimensional normal distribution The two-dimensional form of the normal bellshaped distribution is a cloud of points, most of them falling within an elliptical football-shaped region about the mean(height, weight). This is equivalent to most of the points in a normal distribution falling within +/- one standard deviation from the mean. Wait, are football-shaped scatter plots homoscedastic? 200 150 100 hei ght (inches) If it s really football-shaped, isn t the spread of the data in the y-direction slightly smaller at the ends (for small and large x)? Isn t the spread in the data smaller for small and large x? No, this is just because, in an observational study, there are fewer data points in the ends of the football there just aren t that many people who are really short or really tall, so we have the illusion that there s less spread for those heights. In fact, for a controlled experiment, where an experimenter makes sure there are the same number of data points for each value of x i, the scatter plot will tend to look more like this: Typical scatter plot for a controlled experiment 200 150 100 60 65 70 x 75 same # of data points for each x collected data only for specific values of x 6

What happens in regression an alternate view Divide the scatter plot into little vertical strips. Take the mean in each vertical strip. Least squares regression attempts to fit a line through these points. 200 150 100 hei ght (inches) Note that the regression line is not the line through the axis of the ellipse The axis line, shown dashed, is what your book calls the SD line 200 150 It starts at the mean, 100 and as it moves 60 65 right one s x, it hei ght (inches) moves up one s y 70 75 Consider what our ellipse looks like, in the preschool program situation If score test1 = score test2 for each child, the data would fall exactly on a 45º line. However, this doesn t often happen. Instead we see spread about this line. Spread is in both pre- and post-test scores. So it looks like we ve centered our football about the 45º line = the SD line 120 100 80 45º (SD) line 80 100 120 pre-test score For each score range on test 1, look at the mean score expected on test 2 To do this, we again look at a vertical strip through our football, and find the mean in that strip. This will be the mean score on the 2nd test for the students whose 1st test score lies within the given range. 120 100 80 80 100 120 pre-test score 7

Predicted scores on test 2 are closer to the mean than corresponding scores on test 1 Solid line corresponds to score test1 = scoretest2 120 For score test1 < mean predict 100 score test2 > score test1 For score test1 > mean 80 predict 80 100 120 score test2 < score test1 pre-test score The regression effect We don t expect scores on the second test to be exactly the same as the scores on the first test (the SD line) The scatter about the SD line leads to the familiar football-shaped scatter plot The spread around the line makes the mean second score come up for the bottom group, and go down for the top group. There s nothing else to the regression effect the preschool program does not equalize IQ s. Regression to the mean This effect was first noticed by aristocrat Galton, in his study of family resemblances. Tall fathers tended to have sons with height closer to the average (i.e., shorter) Short fathers also tended to have sons with height closer to the average (i.e., taller) Galton referred to this effect as regression to mediocrity Regression = movement backward to a previous and especially worse or more primitive state or condition This is where the term regression analysis comes from, although now it means something quite different Important experimental design lesson! Do NOT choose the groups for your experiment based upon performance above or below a threshold on a pre-test The group that did worse on the pre-test will likely do better after the treatment, but this could just be the regression effect (and thus meaningless) 8

Example: does a school program improve performance for students with a learning disability? Poor design: Pre-test determines who has a learning disability, and who is a normal control Both groups go through treatment Post-test Post-test, in this situation, will likely show a meaningless regression to the mean, and we wont be able to tell if the school program helps Example: does a school program improve performance for students with a learning disability? Good design: Pre-test determines who has a learning disability Split learning disability group randomly into treatment and control groups Treatment group goes through school program of interest, control group goes through some other, control program Post-test If the post-test shows an improvement for the treatment group, we can be more confident this shows an actual effect of the treatment Correlation We ll come back to regression later, and talk about confidence intervals and so on. But first, if there is a linear predictive relationship between two variables (as found by regression), how strong is the relationship? This is a question for correlation analysis Correlation a relation existing between phenomena or things or between mathematical or statistical variables which tend to vary, be associated, or occur together in a way not expected on the basis of chance alone We ll be talking about a measure of correlation which incorporates both the sign of the relationship, and its strength 9

( Positive correlation: weighing more tends to go with being taller 80 Height and weight Negative correlation: sleeping less tends to go with drinking more coffee 10 Coffee and sleep 70 9 8 Height inches) 60 Hours of sleep 7 6 50 5 40 140 150 160 170 180 190 200 210 4 0 2 4 6 8 Weight (pounds) Cups of coffee Zero correlation: weight does not tend to vary with coffee consumption Correlation In our scatter plots, we can see positive correlation between the two variables, negative correlation, and zero correlation (no association between the two variables) Some scatter plots will have more scatter about the best fit line than others. 10

Strength of association A strong positive correlation When the points cluster closely about the best fit line, the association between the two variables is strong. Knowing information about one variable (e.g. a person s height is 65 ) tells you a lot about the person s weight (it s probably about 177 lbs) When the spread of points increases, the association weakens Knowing information about one variable doesn t help you much in pinning down the other variable Figures by MIT OCW. 11

A strong negative correlation A not-so-strong positive correlation A not-so-strong negative correlation A weak positive correlation Figures by MIT OCW. 12

Measuring strength of relationship The Pearson Product-Moment Correlation Coefficient, r Provides a numerical measure of the strength of the relationship between two variables -1 r 1 Sign indicates direction of relationship Magnitude indicates strength The formula for r Remember, r has to do with correlation not with regression 1 r = N N i = 1 ( x i m s x x ) ( y i m s y y (Use the 1/N form of sx and s y ) ) 1 = N N z x i = 1 z y Unpacking the formula for r r = 1.00 r is based on z-scores -- it is the average product of the standard scores of the two variables r will be positive if both variables tend to be on the same side of their means at the same time, negative if on opposite sides r will be 0 if there s no systematic relationship between the two variables Figure by MIT OCW. 13

r = -.54 r =.85 r = -.94 r =.42 Figures by MIT OCW. 14

r = -.33 r =.17 r =.39 Example Looking at the correlation between # of glasses of juice consumed per day (x), and doctor visits per year (y) x = [0 0 1 1 1 2 2 3 3 4] ; y = [8 7 7 6 5 4 4 4 2 0] ; zx = (x-mean(x))/std(x, 1); zy = (y-mean(y))/std(y, 1); r = mean(zx.*zy) = -0.95 Figures by MIT OCW. 15

Doctor's visits per year 9 8 7 6 5 4 3 2 1 0 Scatter plot r = -0.95 0 1 2 3 4 5 Glasses of orange juice per day Behavioral research ( Only rules of thumb:) Typical r within ±0.30 or ±0.50 ±0.20 typically considered weak Between ±0.60 and ±0.80 quite strong, impressive Greater than ±0.80 is extremely strong and unlikely in this sort of research Greater than ±1.0? Something s wrong. Limitations of r No correlation? r only tells you whether two variables tend to vary together -- nothing about the nature of the relationship E.G: correlation is not causation!! r only measures the strength of the linear relationship between two variables Other kinds of relationships can exist, so look at the data! y r = 0 x 16

An alternative computational formula for r 1 N ( x N = i mx ) ( y i my ) 1 r = zx z y N i = 1 s x s y N i = 1 Note we can write this as: r = E[(x-m x)/s x (y-m y )/s y ] = cov(x, y)/(s x s y ) Plugging in equations for cov(x, y), sx, and s y, you ll also see this version: N ( xy) ( x)( y) r = 2 2 2 2 [ N ( x ) ( x) ][ N( y ) ( y) ] Notes on computing r That last equation is intended for computing by hand/with a calculator. If you have to compute by hand, it s probably the most efficient version. These days, in MATLAB the easiest thing to do is just to compute cov(x, y), sx, and s y. Juice & doctor visits example, again x = [0 0 1 1 1 2 2 3 3 4] ; y = [8 7 7 6 5 4 4 4 2 0] ; tmpmx = cov(x, y, 1)/(std(x, 1)*(std(y, 1)) % Recall: cov returns the matrix % [var(x) cov(x,y); cov(x,y) var(y)] r = tmpmx(1, 2) = -0.95 17