Descriptive statistics; Correlation and regression



Similar documents
Father s height (inches)

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Exercise 1.12 (Pg )

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Session 7 Bivariate Data and Analysis

Simple Regression Theory II 2010 Samuel L. Baker

CALCULATIONS & STATISTICS

Module 3: Correlation and Covariance

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Lesson 4 Measures of Central Tendency

Section 3 Part 1. Relationships between two numerical variables

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

6.4 Normal Distribution

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Chapter 7: Simple linear regression Learning Objectives

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Descriptive Statistics and Measurement Scales

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Two-sample inference: Continuous data

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Correlation key concepts:

DATA INTERPRETATION AND STATISTICS

2. Simple Linear Regression

Frequency Distributions

Chapter 4: Average and standard deviation

Interpreting Data in Normal Distributions

The Normal Distribution

Exploratory data analysis (Chapter 2) Fall 2011

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

Describing, Exploring, and Comparing Data

Descriptive Statistics

Simple linear regression

Means, standard deviations and. and standard errors

Statistics. Measurement. Scales of Measurement 7/18/2012

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces


Fairfield Public Schools

6. Decide which method of data collection you would use to collect data for the study (observational study, experiment, simulation, or survey):

Grade. 8 th Grade SM C Curriculum

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Describing Relationships between Two Variables

Correlation and Regression

Measurement with Ratios

MATH 103/GRACEY PRACTICE EXAM/CHAPTERS 2-3. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

MEASURES OF VARIATION

1.6 The Order of Operations

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Describing and presenting data

Name: Date: Use the following to answer questions 2-3:

6 3 The Standard Normal Distribution

The correlation coefficient

Diagrams and Graphs of Statistical Data

CORRELATIONAL ANALYSIS: PEARSON S r Purpose of correlational analysis The purpose of performing a correlational analysis: To discover whether there

Comparing Means in Two Populations

A Picture Really Is Worth a Thousand Words

SOLUTIONS TO BIOSTATISTICS PRACTICE PROBLEMS

Math 1. Month Essential Questions Concepts/Skills/Standards Content Assessment Areas of Interaction

AP * Statistics Review. Descriptive Statistics

c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test March 2014

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Introduction to Statistics and Quantitative Research Methods

Exploratory Data Analysis. Psychology 3256

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Introduction. Chapter Before you start Formulation

Relationships Between Two Variables: Scatterplots and Correlation

AP Statistics Solutions to Packet 2

Mind on Statistics. Chapter 2

Statistics E100 Fall 2013 Practice Midterm I - A Solutions

Week 4: Standard Error and Confidence Intervals

II. DISTRIBUTIONS distribution normal distribution. standard scores

Chapter 9 Descriptive Statistics for Bivariate Data

COMMON CORE STATE STANDARDS FOR

Additional sources Compilation of sources:

17. SIMPLE LINEAR REGRESSION II

Statistics 151 Practice Midterm 1 Mike Kowalski

Week 1. Exploratory Data Analysis

Welcome to Basic Math Skills!

The Dummy s Guide to Data Analysis Using SPSS

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

PRACTICE PROBLEMS FOR BIOSTATISTICS

Characteristics of Binomial Distributions

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Chapter 1: Exploring Data

Sta 309 (Statistics And Probability for Engineers)

Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions.

Introduction to Quantitative Methods

Unit 9 Describing Relationships in Scatter Plots and Line Graphs

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

Transcription:

Descriptive statistics; and regression Patrick Breheny September 16 Patrick Breheny STA 580: Biostatistics I 1/59

Tables and figures Descriptive statistics Histograms Numerical summaries Percentiles Human beings are not good at sifting through large streams of data; we understand data much better when it is summarized for us We often display summary statistics in one of two ways: tables and figures Tables of summary statistics are very common (we have already seen several in this course) nearly all published studies in medicine and public health contain a table of basic summary statistics describing their sample However, figures are usually better than tables in terms of distilling clear trends from large amounts of information Patrick Breheny STA 580: Biostatistics I 2/59

Types of data Descriptive statistics Histograms Numerical summaries Percentiles The best way to summarize and present data depends on the type of data There are two main types of data: Categorical data: Data that takes on distinct values (i.e., it falls into categories), such as sex (male/female), alive/dead, blood type (A/B/AB/O), stages of cancer Continuous data: Data that takes on a spectrum of fractional values, such as time, age, temperature, cholesterol levels The distinction between categorical (also called discrete) and continuous data is fundamental and we will return to it throughout the course Patrick Breheny STA 580: Biostatistics I 3/59

Categorical data Descriptive statistics Histograms Numerical summaries Percentiles Summarizing categorical data is pretty straightforward you just count how many times each category occurs Instead of counts, we are often interested in percents A percent is a special type of rate, a rate per hundred Counts (also called frequencies), percents, and rates are the three basic summary statistics for categorical data, and are often displayed in tables or bar charts, as we saw in lab Patrick Breheny STA 580: Biostatistics I 4/59

Continuous data Descriptive statistics Histograms Numerical summaries Percentiles For continuous data, instead of a finite number of categories, observations can take on a potentially infinite number of values Summarizing continuous data is therefore much less straightforward To introduce concepts for describing and summarizing continuous data, we will look at data on infant mortality rates for 111 nations on three continents: Africa, Asia, and Europe Patrick Breheny STA 580: Biostatistics I 5/59

Histograms Descriptive statistics Histograms Numerical summaries Percentiles One very useful way of looking at continuous data is with histograms To make a histogram, we divide a continuous axis into equally spaced intervals, then count and plot the number of observations that fall into each interval This allows us to see how our data points are distributed Patrick Breheny STA 580: Biostatistics I 6/59

Histograms Numerical summaries Percentiles Histogram of European infant mortality rates Count 0 5 10 15 20 25 0 2 4 6 8 10 0 2 4 6 8 10 Europe Asia Africa 0 50 100 150 200 Deaths per 1,000 births Patrick Breheny STA 580: Biostatistics I 7/59

Summarizing continuous data Histograms Numerical summaries Percentiles As we can see, continuous data comes in a variety of shapes Nothing can replace seeing the picture, but if we had to summarize our data using just one or two numbers, how should we go about doing it? The aspect of the histogram we are usually most interested in is, Where is its center? This is typically represented by the average Patrick Breheny STA 580: Biostatistics I 8/59

The average and the histogram Histograms Numerical summaries Percentiles The average represents the center of mass of the histogram: Count 0 5 10 15 20 25 0 2 4 6 8 10 0 2 4 6 8 10 Europe Asia Africa 0 50 100 150 200 Deaths per 1,000 births Patrick Breheny STA 580: Biostatistics I 9/59

Spread Descriptive statistics Histograms Numerical summaries Percentiles The second most important bit of information from the histogram to summarize is, How spread out are the observations around the center? This is most typically represented by the standard deviation To understand how standard deviation works, let s return to our small example with the numbers {4, 5, 1, 9} Each of these numbers deviates from the mean by some amount: 4 4.75 = 0.75 5 4.75 = 0.25 1 4.75 = 3.75 9 4.75 = 4.25 How should we measure the overall size of these deviations? Patrick Breheny STA 580: Biostatistics I 10/59

Root-mean-square Descriptive statistics Histograms Numerical summaries Percentiles Taking their mean isn t going to tell us anything (why not?) We could take the average of their absolute values: 0.75 + 0.25 + 3.75 + 4.25 4 = 2.25 But it turns out that for a variety of reasons, the root-mean-square works better as a measure of overall size: ( 0.75) 2 + (0.25) 2 + ( 3.75) 2 + (4.25) 2 2.86 4 Patrick Breheny STA 580: Biostatistics I 11/59

The standard deviation Histograms Numerical summaries Percentiles The formula for the standard deviation is n i=1 s = (x i x) 2 n 1 Wait a minute; why n 1? The reason (which we will discuss further in a few weeks) is that dividing by n turns out to underestimate the true standard deviation Dividing by n 1 instead of n corrects some of that bias The standard deviation of {4, 5, 1, 9} is 3.30 (recall that we got 2.86 if we divide by n) Patrick Breheny STA 580: Biostatistics I 12/59

Meaning of the standard deviation Histograms Numerical summaries Percentiles The standard deviation (SD) describes how far away numbers in a list are from their average The SD is often used as a plus or minus number, as in adult women tend to be about 5 4, plus or minus 3 inches Most numbers (roughly 68%) will be within 1 SD away from the average Very few entries (roughly 5%) will be more than 2 SD away from the average This rule of thumb works very well for a wide variety of data; we ll discuss where these numbers come from in a few weeks Patrick Breheny STA 580: Biostatistics I 13/59

Histograms Numerical summaries Percentiles Standard deviation and the histogram Background areas within 1 SD of the mean are shaded: 0 5 10 15 Europe 10 20 30 40 Asia 0 2 4 6 810 Count 0 2 4 6 0 50 100 150 Africa 50 100 150 200 Deaths per 1,000 births Patrick Breheny STA 580: Biostatistics I 14/59

The 68%/95% rule in action Histograms Numerical summaries Percentiles % of observations within Continent One SD Two SDs Europe 78 97 Asia 67 97 Africa 63 95 Patrick Breheny STA 580: Biostatistics I 15/59

Summaries can be misleading! Histograms Numerical summaries Percentiles All of the following have the same mean and standard deviation: Frequency 4 2 0 2 4 4 2 0 2 4 Frequency 4 2 0 2 4 4 2 0 2 4 Patrick Breheny STA 580: Biostatistics I 16/59

Percentiles Descriptive statistics Histograms Numerical summaries Percentiles The average and standard deviation are not the only ways to summarize continuous data Another type of summary is the percentile A number is the 25th percentile of a list of numbers if it is bigger than 25% of the numbers in the list The 50th percentile is given a special name: the median The median, like the mean, can be used to answer the question, Where is the center of the histogram? Patrick Breheny STA 580: Biostatistics I 17/59

Median vs. mean Descriptive statistics Histograms Numerical summaries Percentiles The dotted line is the median, the solid line is the mean: 0 5 10 15 Europe 10 20 30 40 Asia 0 2 4 6 810 Count 0 2 4 6 0 50 100 150 Africa 50 100 150 200 Deaths per 1,000 births Patrick Breheny STA 580: Biostatistics I 18/59

Skew Descriptive statistics Histograms Numerical summaries Percentiles Note that the histogram for Europe is not symmetric: the tail of the distribution extends further to the right than it does to the left Such distributions are called skewed The distribution of infant mortality rates in Europe is said to be right skewed or skewed to the right For asymmetric/skewed data, the mean and the median will be different Patrick Breheny STA 580: Biostatistics I 19/59

Hypothetical example Descriptive statistics Histograms Numerical summaries Percentiles Azerbaijan had the highest infant mortality rate in Europe at 37 What if, instead of 37, it was 200? Mean Median Real 14.1 11 Hypothetical 19.2 11 The mean is now higher than 72% of the countries Note that the average is sensitive to extreme values, while the median is not; statisticians say that the median is robust to the presence of outlying observations Patrick Breheny STA 580: Biostatistics I 20/59

Box plots Descriptive statistics Histograms Numerical summaries Percentiles Quantiles are used in a type of graphical summary called a box plot Box plots are constructed as follows: Calculate the three quartiles (the 25th, 50th, and 75th) Draw a box bounded by the first and third quartiles and with a line in the middle for the median Call any observation that is extremely far from the box an outlier and plot the observations using a special symbol (this is somewhat arbitrary and different rules exist for defining outliers) Draw a line from the top of the box to the highest observation that is not an outlier; likewide for the lowest non-outlier Patrick Breheny STA 580: Biostatistics I 21/59

Histograms Numerical summaries Percentiles Box plots of the infant mortality rate data 0 50 100 150 Africa Asia Europe Patrick Breheny STA 580: Biostatistics I 22/59

Descriptive statistics Box plots are a way to examine the relationship between a continuous variable and a categorical variable In lab, we saw bar charts as a way of comparing two (or more) categorical variables Now, we will discuss how to summarize and illustrate the relationship between two continuous variables Patrick Breheny STA 580: Biostatistics I 23/59

Pearson s height data Descriptive statistics Statisticians in Victorian England were fascinated by the idea of quantifying hereditary influences Two of the pioneers of modern statistics, the Victorian Englishmen Francis Galton and Karl Pearson were quite passionate about this topic In pursuit of this goal, they measured the heights of 1,078 fathers and their (fully grown) sons Patrick Breheny STA 580: Biostatistics I 24/59

The scatter plot Descriptive statistics As we ve mentioned, it is important to plot continuous data this is especially true when you have two continuous variables and you re interested in the relationship between them The most common way to plot the relationship between two continuous variables is the two-way scatter plot Scatter plots are created by setting up two continuous axes, then creating a dot for every pair of observations Patrick Breheny STA 580: Biostatistics I 25/59

Scatter plot of Pearson s height data 60 65 70 75 80 60 65 70 75 80 Father's height (Inches) Son's height (Inches) Patrick Breheny STA 580: Biostatistics I 26/59

Observations about the scatter plot Taller fathers tend to have taller sons The scatter plot shows how strong this association is there is a tendency, but there are plenty of exceptions Patrick Breheny STA 580: Biostatistics I 27/59

Standardizing a variable Before we summarize this relationship numerically, we must discuss the idea of standardizing a variable In Pearson s height data, one of the sons measured 63.2 inches tall Because the average height of the sons in the sample was 68.7 inches, another way of describing his height is to say that he was 5.5 inches below average Furthermore, because the standard deviation of the sons was 2.8 inches, yet another way of describing his height is to say that he was 1.9 standard deviations below the average Patrick Breheny STA 580: Biostatistics I 28/59

The standardization formula Putting this into a formula, we standardize an observation x i by subtracting the average and dividing by the standard deviation: z i = x i x SD x where x and SD x are the mean and standard deviation of the variable x One virtue of standardizing a variable is interpretability: If someone tells you that the concentration of urea in your blood is 50 mg/dl, that likely means nothing to you On the other hand, if you are told that the concentration of urea in your blood is 4 standard deviations above average, you can immediately recognize this as a very high value Patrick Breheny STA 580: Biostatistics I 29/59

More benefits of standardization If you standardize all of the observations in your sample, the resulting variable will be standardized in the sense of having mean 0 and standard deviation 1 Standardization therefore brings all variables onto a common scale regardless of whether the heights were originally measured in inches, centimeters, or miles, the standardized heights will be identical As we will see momentarily, this allows us to study the relationship between two continuous variables without worrying about the scale of measurement The concept behind standardization taking an observation, then subtracting the expected value and dividing by the variability is fundamental to statistics and we will variations on this idea many times in this course Patrick Breheny STA 580: Biostatistics I 30/59

The correlation coefficient The summary statistic for describing the strength of association between two variables is the correlation coefficient, denoted by r (and sometimes called Pearson s correlation coefficient) The correlation coefficient is always between 1 (perfect positive correlation) and -1 (perfect negative correlation), and can take on any value in between A positive correlation means that as one variable increases, the other one tends to increase as well A negative correlation means that as one variable increases, the other one tends to decrease Patrick Breheny STA 580: Biostatistics I 31/59

Calculating the correlation coefficient The correlation coefficient is simply the average of the products of the standardized variables In mathematical notation, r = n i=1 zx i zy i n 1 where z x i and z y i are the standardized values of x and y Note: The n versus n 1 issue has nothing to do with correlation; however, if n 1 is used when standardizing, it must be used again here, Patrick Breheny STA 580: Biostatistics I 32/59

Meaning behind the correlation coefficient formula 60 65 70 75 80 60 65 70 75 80 Father's height (Inches) Son's height (Inches) Patrick Breheny STA 580: Biostatistics I 33/59

The correlation coefficient and the scatter plot 0.88 y x 0.34 y x 0.02 y x 0.29 y x 0.91 y x Patrick Breheny STA 580: Biostatistics I 34/59

More about the correlation coefficient Because the correlation coefficient is based on standardized variables, it does not depend on the units of measurement Thus, the correlation between father s and son s heights would be 0.5 even if the father s height was measured in inches and the son s in centimeters Furthermore, the correlation between x and y is the same as the correlation between y and x Patrick Breheny STA 580: Biostatistics I 35/59

Interpreting the correlation coefficient The correlation between heights of identical twins is around 0.95 The correlation between income and education in the United States is about 0.44 The correlation between a woman s education and the number of children she has is about -0.2 When concrete physical laws determine the relationship between two variables, their correlation can exceed 0.9 In the social sciences, this is rare correlations of 0.3 to 0.7 are considered quite strong in these fields Patrick Breheny STA 580: Biostatistics I 36/59

Numerical summaries can be misleading! From Cook & Swayne s Interactive and Dynamic Graphics for Data Analysis: 130 6 Miscellaneous Topics is negative rather than positive. The plot at bottom right shows two variables with some positive linear dependence, but the obvious non-linear dependence is more interesting. 2 0 2 4 3 2 1 0 1 2 3 Y X 0 10 20 30 10 0 10 20 30 40 50 Y X 4 2 0 2 4 6 8 2 0 2 4 6 8 10 Y X 0 5 10 3 2 1 0 1 2 3 Y X Fig. 6.1. Studying dependence between X and Y. All four pairs of variables have correlation approximately equal to 0.7, but they all have very different patterns. Only the top left plot shows two variables matching a dependence modeled by correlation. Patrick Breheny STA 580: Biostatistics I 37/59

Ecological correlations Descriptive statistics Epidemiologists often look at the correlation between two variables at the ecological level say, the correlation between cigarette consumption and lung cancer deaths per capita However, people smoke and get cancer, not countries These correlations have the potential to be misleading The reason is that by replacing individual measurements by the averages, you eliminate a lot of the variability that is present at the individual level and obtain a higher correlation than there really is Patrick Breheny STA 580: Biostatistics I 38/59

Fat in the diet and cancer From an article by Carroll in Cancer Research (1975): Patrick Breheny STA 580: Biostatistics I 39/59

NHANES Descriptive statistics and correlation The regression fallacy Every few years, the CDC conducts a huge survey of randomly chosen Americans called the National Health and Nutrition Examination Survey (NHANES) Hundreds of variables are measured on these individuals: Demographic variables like age, education, and income Physiological variables like height, weight, blood pressure, and cholesterol levels Dietary habits Disease status Lots more: everything from cavities to sexual behavior Patrick Breheny STA 580: Biostatistics I 40/59

Predicting weight from height and correlation The regression fallacy For the 2,649 adult women in the NHANES data set: average height = 5 feet, 3.5 inches average weight = 166 pounds SD(height) = 2.75 inches SD(weight) = 44.5 pounds correlation between height and weight = 0.3 Suppose you were asked to predict a person s weight from their height First, an easy case: suppose the woman was 5 feet, 3.5 inches Since the woman is average height, we have no reason to guess anything other than the average weight, 166 pounds Patrick Breheny STA 580: Biostatistics I 41/59

and correlation The regression fallacy Predicting weight from height (cont d) How about a woman who is 5 6? She s a bit taller than average, so she probably weighs a bit more than average But how much more? To put the question a different way, she is almost one standard deviation above the average height; how many standard deviations above the average weight should we expect her to be? Patrick Breheny STA 580: Biostatistics I 42/59

Using the correlation coefficient and correlation The regression fallacy The answer turns out to depend on the correlation coefficient Since the correlation coefficient for this data is 0.3, we would expect the woman to be 0.3 standard deviations above the mean weight, or 166 + 0.3(44.5) = 179 pounds Patrick Breheny STA 580: Biostatistics I 43/59

and correlation The regression fallacy Graphical interpretation 55 60 65 70 100 200 300 400 Height (inches) Weight (lbs) Patrick Breheny STA 580: Biostatistics I 44/59

The regression line Descriptive statistics and correlation The regression fallacy This line is called the regression line It tells you, for any height, the average weight for women of that height Here, we were trying to predict one variable based on one other variable; if we were trying to predict weight based on height, dietary habits, and cholesterol levels, or trying to study the relationship between cholesterol and weight while controlling for height, then this is called multiple regression Multiple regression is beyond the scope of this course, but is a major topic in Biostatistics II Patrick Breheny STA 580: Biostatistics I 45/59

The equation of the regression line and correlation The regression fallacy Like all lines, the regression line may be represented by the equation y = α + βx, where α is the intercept and β is the slope For the height/weight NHANES data, the intercept is -137 pounds and the slope is 4.8 pounds/inch Patrick Breheny STA 580: Biostatistics I 46/59

β vs. r Descriptive statistics and correlation The regression fallacy Note the similarity and the difference between the slope of the regression line (β) and the correlation coefficient (r): The correlation coefficient says that if you go up in height by one standard deviation, you can expect to go up in weight by r = 0.3 standard deviations The slope of the regression line tells you that if you go up in height by one inch, you can expect to go up in weight by β = 4.8 pounds Essentially, they tell you the same thing, one in terms of standard units, the other in terms of actual units Therefore, if you know one, you can always figure out the other simply by changing units (which here involves multiplying by the ratio of the standard deviations) Patrick Breheny STA 580: Biostatistics I 47/59

β vs. r (cont d) Descriptive statistics and correlation The regression fallacy Suppose a woman s height is increased one inch; what do we expect to happen to her weight? 1 inch = 1/2.75 SDs An increase of 1/2.75 SDs in height means an increase in 0.3/2.75 SDs in weight 0.3/2.75 SDs = 0.3(44.5/2.75) = 4.8 pounds Patrick Breheny STA 580: Biostatistics I 48/59

β vs. r (cont d) Descriptive statistics and correlation The regression fallacy Suppose a woman s height is increased by one SD; what do we expect to happen to her weight? 1 SD = 2.75 inches An increase of 2.75 inches in height means an increase in 4.85(2.75) pounds in weight 4.85(2.75) pounds = 4.85(2.75)/44.5 = 0.3 SDs Patrick Breheny STA 580: Biostatistics I 49/59

There are two regression lines and correlation The regression fallacy We said that the correlation between weight and height is the same as the correlation between height and weight This is not true for regression The regression of weight on height will give a different answer than the regression of height on weight Patrick Breheny STA 580: Biostatistics I 50/59

and correlation The regression fallacy The two regression lines 55 60 65 70 100 200 300 400 Height (inches) Weight (lbs) Patrick Breheny STA 580: Biostatistics I 51/59

and correlation The regression fallacy and root-mean-square error The amount by which the regression prediction is off is called the residual One way of looking at the quality of our predictions is by measuring the size of the residuals Out of all possible lines that you could draw, which one has the lowest possible root-mean-square of the residuals? The regression line Because of this, the regression line is also called the least squares fit Patrick Breheny STA 580: Biostatistics I 52/59

Why only r standard deviations? and correlation The regression fallacy Only moving r standard deviations away from the average may be counterintuitive; if height goes up by one SD, shouldn t weight too? Here s an example that I hope will help clarify this concept: A student is taking her first course in statistics, and we want to predict whether she will do well in the course or not Suppose we know that last semester, she got an A in math Now suppose that we know that last semester, she got an A in pottery These two pieces of information are not equally informative for predicting how well she will do in her statistics class We need to balance our baseline guess (that she will receive an average grade) with this new piece of information, and the correlation coefficient tells us how much weight the new information should carry Patrick Breheny STA 580: Biostatistics I 53/59

and correlation The regression fallacy Fathers and sons again 60 65 70 75 80 60 65 70 75 80 Father's height (Inches) Son's height (Inches) Patrick Breheny STA 580: Biostatistics I 54/59

How regression got its name and correlation The regression fallacy Because the correlation coefficient is always less than 1, the regression line will always lie beneath the x goes up by 1 SD, y goes up by 1 SD rule Galton called this phenomenon regression to mediocrity, and this is where regression gets its name People frequently read too much into the regression effect this is called the regression fallacy Patrick Breheny STA 580: Biostatistics I 55/59

The regression fallacy, example #1 and correlation The regression fallacy A group of subjects are recruited into a study Their initial blood pressure is taken, then they take an herbal supplement for a month, and their blood pressure is taken again The mean blood pressure was the same, both before and after However, subjects with high blood pressure tended to have lower blood pressure one month later, and subjects with low blood pressure tended to have higher blood pressure later Does this supplement act to stabilize blood pressure? Patrick Breheny STA 580: Biostatistics I 56/59

and correlation The regression fallacy Why the does regression to the mean happen? Not really; the same effect would occur if they took placebo Why? Consider a person with a blood pressure 2 SDs above average It s possible that the person has a true blood pressure 1 SD above average, but happened to have a high first measurement; it s also possible that the person has a true blood pressure 3 SDs above average, but happened to have a low first measurement However, the first explanation is much more likely Patrick Breheny STA 580: Biostatistics I 57/59

The regression fallacy, example #2 and correlation The regression fallacy In professional sports, some first-year players have outstanding years and win Rookie of the Year awards They often fail to live up to expectations in their second years Writers call this the sophomore slump, and come up with elaborate explanations for it Patrick Breheny STA 580: Biostatistics I 58/59

and correlation The regression fallacy The regression fallacy, example #3 An instructor standardizes her midterm and final so that the class average is 50 and the SD is 10 on both tests She has taught this class many times and the correlation between the tests is always around 0.5 This year, she decides to do something different she takes the 10 students with the lowest scores on the midterm and gives them special tutoring On the final, all ten students score above 50; can this be explained by the regression effect? No! The regression effect can only take these students closer to the average; the fact that they all score above average indicates that the tutoring really did work Patrick Breheny STA 580: Biostatistics I 59/59