Chapter 11: r.m.s. error for regression



Similar documents
Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

MEASURES OF VARIATION

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Exercise 1.12 (Pg )

Session 7 Bivariate Data and Analysis

Getting Correct Results from PROC REG

Univariate Regression

Simple Regression Theory II 2010 Samuel L. Baker

The Correlation Coefficient

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Unit 7: Normal Curves

FREE FALL. Introduction. Reference Young and Freedman, University Physics, 12 th Edition: Chapter 2, section 2.5

Father s height (inches)

Lecture 13/Chapter 10 Relationships between Measurement (Quantitative) Variables

Plot the following two points on a graph and draw the line that passes through those two points. Find the rise, run and slope of that line.

Week 4: Standard Error and Confidence Intervals

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Describing Relationships between Two Variables

Name Partners Date. Energy Diagrams I

2. Simple Linear Regression

The correlation coefficient

Graphing Rational Functions

Descriptive Statistics

Mathematics (Project Maths Phase 1)

MODERN APPLICATIONS OF PYTHAGORAS S THEOREM

EXPERIMENTAL ERROR AND DATA ANALYSIS

Graphing Quadratic Functions

Activity 5. Two Hot, Two Cold. Introduction. Equipment Required. Collecting the Data

Chapter 7: Simple linear regression Learning Objectives

Because the slope is, a slope of 5 would mean that for every 1cm increase in diameter, the circumference would increase by 5cm.

Polynomial Degree and Finite Differences

10.1. Solving Quadratic Equations. Investigation: Rocket Science CONDENSED

Straightening Data in a Scatterplot Selecting a Good Re-Expression Model

1.7 Graphs of Functions

AP Statistics Solutions to Packet 2

Means, standard deviations and. and standard errors

The Logistic Function

Diagrams and Graphs of Statistical Data

DATA INTERPRETATION AND STATISTICS

Determining the Acceleration Due to Gravity

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Updates to Graphing with Excel

Summary of important mathematical operations and formulas (from first tutorial):

Temperature Measure of KE At the same temperature, heavier molecules have less speed Absolute Zero -273 o C 0 K

Algebra II End of Course Exam Answer Key Segment I. Scientific Calculator Only

Foundations for Functions

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

AP * Statistics Review. Descriptive Statistics

Measurement with Ratios

Linear Equations. Find the domain and the range of the following set. {(4,5), (7,8), (-1,3), (3,3), (2,-3)}

2.2 Derivative as a Function

Simple linear regression

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Representing Vector Fields Using Field Line Diagrams

Coordinate Plane, Slope, and Lines Long-Term Memory Review Review 1

the Median-Medi Graphing bivariate data in a scatter plot

The Big Picture. Correlation. Scatter Plots. Data

Basic Graphing Functions for the TI-83 and TI-84

Homework 8 Solutions

DETERMINING SOLAR ALTITUDE USING THE GNOMON. How does the altitude change during the day or from day to day?

is the degree of the polynomial and is the leading coefficient.

Lesson 4 Measures of Central Tendency

Elements of a graph. Click on the links below to jump directly to the relevant section

Comparing Means in Two Populations

Performance. Power Plant Output in Terms of Thrust - General - Arbitrary Drag Polar

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

CURVE FITTING LEAST SQUARES APPROXIMATION

Confidence Intervals for One Standard Deviation Using Standard Deviation


Interpreting Data in Normal Distributions

Mathematical goals. Starting points. Materials required. Time needed

Scatter Plots with Error Bars

MATH 110 Landscape Horticulture Worksheet #4

Chapter 27: Taxation. 27.1: Introduction. 27.2: The Two Prices with a Tax. 27.2: The Pre-Tax Position

In the Herb Business, Part III Factoring and Quadratic Equations

The common ratio in (ii) is called the scaled-factor. An example of two similar triangles is shown in Figure Figure 47.1

State Newton's second law of motion for a particle, defining carefully each term used.

17. SIMPLE LINEAR REGRESSION II

Torque and Rotary Motion

Map Patterns and Finding the Strike and Dip from a Mapped Outcrop of a Planar Surface

-2- Reason: This is harder. I'll give an argument in an Addendum to this handout.

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Lecture 2: Discrete Distributions, Normal Distributions. Chapter 1

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Graphing Motion. Every Picture Tells A Story

Descriptive Statistics

Students summarize a data set using box plots, the median, and the interquartile range. Students use box plots to compare two data distributions.

25 Integers: Addition and Subtraction

Excel Tutorial. Bio 150B Excel Tutorial 1

Chapter 4: Average and standard deviation

What Does the Normal Distribution Sound Like?

a) Find the five point summary for the home runs of the National League teams. b) What is the mean number of home runs by the American League teams?

AP Physics 1 and 2 Lab Investigations

ACTIVITY 6: Falling Objects

Correlation key concepts:

Relationships Between Two Variables: Scatterplots and Correlation

Measurement & Data Analysis. On the importance of math & measurement. Steps Involved in Doing Scientific Research. Measurement

2. Select Point B and rotate it by 15 degrees. A new Point B' appears. 3. Drag each of the three points in turn.

Logo Symmetry Learning Task. Unit 5

Transcription:

Chapter 11: r.m.s. error for regression Context................................................................... 2 Prediction error 3 r.m.s. error for the regression line............................................... 4 68% 95% rule........................................................... 5 Computing the r.m.s. error.................................................... 6 r.m.s. error vs. SD of y...................................................... 7 When to use which one?..................................................... 8 Computing the r.m.s. error.................................................... 9 r.m.s. error vs. r........................................................... 10 Residual plots 11 Residual plot............................................................. 12 Example................................................................ 13 Example................................................................ 14 Looking at vertical strips 15 r.m.s. error............................................................... 16 Homoscedastic vs heteroscedastic............................................... 17 Use normal curve inside strip.................................................. 18 Example................................................................ 19 1

Context In Chapter 10 we looked at predicting the value for y from a given value of x. How precise are these estimates? To determine that, we will look at the r.m.s. error of the regression line Other new things: residual, residual plot, homoscedasticity, heteroscedasticity, normal approximation for points within narrow strip 2 / 19 Prediction error 3 / 19 r.m.s. error for the regression line See overhead prediction error = actual y value - predicted y value To determine the typical size of these errors, we take the r.m.s. size of them: (error#1) 2 + (error#2) 2 + + (error#n) 2 n This is the r.m.s. error of the regression line, which says how far typical points are above or below the regression line 4 / 19 68% 95% rule 68% 95% rule: About 68% of the points on a scatter diagram are within one r.m.s. error from the regression line About 95% of the points on a scatter diagram are within two r.m.s. errors from the regression line See overhead 5 / 19 Computing the r.m.s. error We know how to compute the r.m.s. error: For each data point, predict y using the regression method Compute the prediction errors: actual value - predicted value Take the r.m.s. size of the prediction errors That is a lot of work! You ll learn a shortcut soon. 6 / 19 2

r.m.s. error vs. SD of y Compare the following two lines for predicting y a horizontal line through the average of y the regression line What are the differences? The horizontal line ignores the value of x The regression line uses the value of x The prediction errors for the horizontal line are expected to be larger. Why? Because we don t use information on x. The r.m.s. size of the prediction errors for the horizontal line is just the SD of y. Why? Because now the prediction errors are the deviations from the average, and the r.m.s. size of those is the SD of y. 7 / 19 When to use which one? So when to use the r.m.s. error and when to use the SD of y? If there is no information on x, then: The best prediction for y is its average In other words: we use the horizontal line This prediction is likely to be off by the e r.m.s. error of this line, which is the SD of y If we know the value of x, and the scatter diagram is football shaped, then: The best prediction for y is the regression estimate In other words, we use the regression line This prediction is likely to be off by the r.m.s. error of the regression line 8 / 19 Computing the r.m.s. error So the r.m.s. error for the regression line is smaller than the SD of y. By how much? By a factor 1 r 2 : r.m.s. error of the regression line = 1 r 2 SD of y That is a nice shortcut! How does the r.m.s. error of the regression line compare to the r.m.s. error of other straight lines? For football shaped diagrams, the regression line is the line with the smallest r.m.s. error (see Chapter 12). 9 / 19 3

r.m.s. error vs. r Both the r.m.s. error of the regression line and r tell us about the spread/clustering of the points around the regression line r measures clustering. If r close to 1 or 1, then the points are tightly clustered. r.m.s. error measures distance of points to the regression line. If the r.m.s. error is small, the points are close to the line. What are the units of the r.m.s. error? The same as the original units of y Why need r.m.s. error when we already had r, which tells us about the clustering of the points? r measures clustering relative to the SD the r.m.s. error measures the spread in the original units of y 10 / 19 Residual plots 11 / 19 Residual plot Prediction errors are also called residuals In the scatter diagram we usually plot (x-value, y-value) We can make a new plot, by plotting (x-value, residual) This is called a residual plot See overhead The average of the residuals is always equal to zero. Hence, the SD of the residuals is the same as the r.m.s. error of the regression line. Why? Why do we make residual plots? To check if we can use the regression method: The residual plot should have no pattern, and should look like a horizontal football 12 / 19 Example James D. Forbes, Scottish scientist in 1840s - 1850s. He wanted to estimate the altitude above see level. He knew that altitude could be determined from atmospheric pressure, measured by a barometer. In the 1840s it was difficult to transport the fragile barometers. Therefore he wanted to use the boiling temperature of water to estimate the atmospheric pressure, and that would then tell him the altitude. 13 / 19 4

Example Scatter diagram Residual plot Pressure in inches of Hg 22 24 26 28 30 Residuals 0.2 0.2 0.6 Scatter diagram Residual plot Log(Pressure in inches of Hg) 3.1 3.2 3.3 3.4 Residuals 0.00 0.02 14 / 19 Looking at vertical strips 15 / 19 r.m.s. error See overhead (Exercise 3 on page 190) Take the points in a narrow vertical strip The SD of their y-coordinates is given by the r.m.s. error of the regression line, if the scatter diagram is homoskedastic (football shaped). 16 / 19 5

Homoscedastic vs heteroscedastic Some Greek: homos = same, heteros = different, skedastos = scatter homeskedastic (football shaped): Vertical strips in scatter diagram have similar spread Note that the range of the points is larger in the middle of the football, due to larger number of points there. But the SDs of the y-values in each strip are similar. This SD is given by the r.m.s. error of the regression line heteroskedastic (not football shaped): The vertical strips have different amounts of spread The r.m.s. error of the regression line gives a sort of average spread across all vertical strips, and cannot be used for any one vertical strip 17 / 19 Use normal curve inside strip Suppose we have a football shaped scatter diagram Consider points in a narrow vertical strip as new data set The new average of this data set is the regression estimate The new SD of this data set is the r.m.s. error of regression line The football shape means that we can use the normal approximation Hence, we can use the normal approximation as before, with the new average and the new SD. 18 / 19 6

Example Data on height and forearm length of 1000 men: height: average = 68 inches, SD = 2.7 inches forearm length: average = 18 inches, SD = 1 inch r = 0.80 the scatter diagram is football shaped What percentage of men have forearms longer than 20 inches? See overhead. Straightforward normal approximation Of the men who are 71 inches tall, what percentage have forearms longer than 20 inches? See overhead: Step 1: Draw picture Step 2: Find new average of strip: regression estimate Step 3: Find new SD of strip: r.m.s. error Step 4: Then do normal approximation in the strip 19 / 19 7