Simple Linear Regression



Similar documents
" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Univariate Regression

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

FINAL EXAM SECTIONS AND OBJECTIVES FOR COLLEGE ALGEBRA

Simple linear regression

Straightening Data in a Scatterplot Selecting a Good Re-Expression Model

Fairfield Public Schools

Chapter 7: Simple linear regression Learning Objectives

MTH 140 Statistics Videos

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

STAT 360 Probability and Statistics. Fall 2012

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

How to Win the Stock Market Game

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Simple Predictive Analytics Curtis Seare

Regression Analysis: A Complete Example

2013 MBA Jump Start Program. Statistics Module Part 3

Name: Date: Use the following to answer questions 3-4:

4. Simple regression. QBUS6840 Predictive Analytics.

11. Analysis of Case-control Studies Logistic Regression

Factors affecting online sales

Statistics 151 Practice Midterm 1 Mike Kowalski

Simple Regression Theory II 2010 Samuel L. Baker

Nominal and Real U.S. GDP

430 Statistics and Financial Mathematics for Business

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Introduction to Regression and Data Analysis

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Using Excel for Statistical Analysis

International Statistical Institute, 56th Session, 2007: Phil Everson

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance


RUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY

Correlation and Regression

AP Statistics: Syllabus 1

Statistics 104: Section 6!

Manhattan Center for Science and Math High School Mathematics Department Curriculum

How To Run Statistical Tests in Excel

Data Mining Introduction

The Assumption(s) of Normality

UNIT 1: COLLECTING DATA

Problem Solving and Data Analysis

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Constructing and Interpreting Confidence Intervals

What is the Purpose of College Algebra? Sheldon P. Gordon Farmingdale State College of New York

3. Data Analysis, Statistics, and Probability

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Simple Linear Regression Inference

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

Nonlinear Regression Functions. SW Ch 8 1/54/

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Street Address: 1111 Franklin Street Oakland, CA Mailing Address: 1111 Franklin Street Oakland, CA 94607

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

2. Simple Linear Regression

Data Analysis Tools. Tools for Summarizing Data

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Week TSX Index

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Elements of statistics (MATH0487-1)

UNIVERSITY of MASSACHUSETTS DARTMOUTH Charlton College of Business Decision and Information Sciences Fall 2010

Algebra 1 Course Information

Preparation of Two-Year College Mathematics Instructors to Teach Statistics with GAISE Session on Assessment

Some Essential Statistics The Lure of Statistics

PCHS ALGEBRA PLACEMENT TEST

Recall this chart that showed how most of our course would be organized:

Lecture 1: Review and Exploratory Data Analysis (EDA)

Exercise 1.12 (Pg )

Homework 8 Solutions

Multiple Regression: What Is It?

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

17. SIMPLE LINEAR REGRESSION II

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

Chapter 23. Inferences for Regression

COMMON CORE STATE STANDARDS FOR

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Local outlier detection in data forensics: data mining approach to flag unusual schools

Geostatistics Exploratory Analysis

Statistics Class Level Test Mu Alpha Theta State 2008

STT 200 LECTURE 1, SECTION 2,4 RECITATION 7 (10/16/2012)

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

table to see that the probability is (b) What is the probability that x is between 16 and 60? The z-scores for 16 and 60 are: = 1.

Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure?

COLLEGE ALGEBRA. Paul Dawkins

When to Use a Particular Statistical Test

USING A TI-83 OR TI-84 SERIES GRAPHING CALCULATOR IN AN INTRODUCTORY STATISTICS CLASS

ch12 practice test SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

Description. Textbook. Grading. Objective

Formula for linear models. Prediction, extrapolation, significance test against zero slope.

Common Core Unit Summary Grades 6 to 8

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Transcription:

STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals (9.3) Conditions for inference (9.1) Want More Stats??? If you have enjoyed learning how to analyze data, and want to learn more: take STAT 210 (Regression Analysis) Applied, focused on data analysis Recommended for any major involving data analysis Only prerequisite is STAT 101 If you like math and want to learn more of the mathematical theory behind what we ve learned: take STAT 230 (Probability) and then STAT 250 (Mathematical Statistics) Prerequisite: multivariable calculus Review Presidential Elections Which plot goes with the line y = x + 10? (a) (b) We can build a model using data from past elections to predict an incumbent s margin of victory based on approval rating (c) (d) Presidential Elections What was Obama s predicted margin of victory, based on his approval rating on the day of the election (50%)? Prediction We would like to use the regression equation to predict y for a certain value of x For useful predictions, we also want interval estimates We will predict the value of y at x = x 1

Point Estimate The point estimate for the average y value at x=x is simply the predicted value: ŷ ˆ ˆ x 0 1 Alternatively, you can think of it as the value on the line above the x value The uncertainty in this point estimate comes from the uncertainty in the coefficients Confidence Intervals We can calculate a confidence interval for the average y value for a certain x value We are 95% confident that the average y value for x=x lies in this interval Equivalently, the confidence interval is for the point estimate, or the predicted value This is the amount the line is free to wiggle, and the width of the interval decreases as the sample size increases Bootstrapping We need a way to assess the uncertainty in predicted y values for a certain x value any ideas? Take repeated samples, with replacement, from the original sample data (bootstrap) Each sample gives a slightly different fitted line If we do this repeatedly, take the middle P% of predicted y values at x for a confidence interval of the predicted y value at x Bootstrap CI Middle 95% of predicted values gives the confidence interval for average (predicted) margin of victory for an incumbent president with an approval rating of 50% Confidence Interval Confidence Interval For x = 50%: (1.07, 9.52) We are 95% confident that the average margin of victory for incumbent U.S. presidents with approval ratings of 50% is between 1.07 and 9.52 percentage points But wait, this still doesn t tell us about a particular incumbent! We don t care about the average, we care about an interval for one incumbent president with an approval rating of 50%! 2

Prediction Intervals We can also calculate a prediction interval for y values for a certain x value We are 95% confident that the y value for x = x lies in this interval This takes into account the variability in the line (in the predicted value) AND the uncertainty around the line (the random errors) Intervals Intervals A confidence interval has a given chance of capturing the mean y value at a specified x value A prediction interval has a given chance of capturing the y value for a particular case at a specified x value For a given x value, which will be wider? a) Confidence interval b) Prediction interval Intervals As the sample size increases: the standard errors of the coefficients decrease we are more sure of the equation of the line the widths of the confidence intervals decrease for a huge n, the width of the CI will be almost 0 The prediction interval may be wide, even for large n, and depends more on the correlation between x and y (how well y can be linearly predicted by x) Prediction Interval Based on the data and the simple linear model: The predicted margin of victory for an incumbent with an approval rating of 50% is 5.3 percentage points We are 95% confident that the margin of victory (or defeat) for an incumbent with an approval rating of 50% will be between -8.8 and 19.4 percentage points Formulas for Intervals NOTE: You will never need to use these formulas in this class you will just have RStudio do it for you. Confidence Interval: yˆ t s e 2 1 x x n ( n 1) s Prediction Interval: yˆ t s e 2 x x x 2 1 1 n ( n1) s 2 x s e : estimate for the standard deviation of the residuals 3

Conditions Inference based on the simple linear model is only valid if the following conditions hold: 1) Linearity 2) Constant Variability of Residuals 3) Normality of Residuals Linearity The relationship between x and y is linear (it makes sense to draw a line through the scatterplot) Charlie 1 dog year = 7 human years Linear: human age = 7 dog age LINEAR Dog Years From www.dogyears.com: The old rule-of-thumb that one dog year equals seven years of a human life is not accurate. The ratio is higher with youth and decreases a bit as the dog ages. ACTUAL All models are wrong, but some are useful -George Box Simple Linear Model y = β 0 + β 1 x + ε ε ~ N(0, σ ε ) Residuals (errors) Conditions for residuals: ~ 0, i N The errors are normally distributed The average of the errors is 0 The standard deviation of the errors is constant for all cases Check with a histogram (Always true for least squares regression) Constant vertical spread in the residual plot 4

Conditions not Met? Non-Constant Variability If the association isn t linear: Try to make it linear (transformation) If can t make linear, then simple linear regression isn t a good fit for the data If variability is not constant, or residuals are not normal: The model itself is still valid, but inference may not be accurate Non-Normal Residuals Simple Linear Regression 1) Plot your data! Association approximately linear? Outliers? Constant variability? 2) Fit the model (least squares) 3) Use the model Interpret coefficients Make predictions 4) Look at histogram of residuals (normal?) 5)Inference (extend to population) Inference on slope Confidence and prediction intervals 1. Plot the data: 1. Plot the data: Is the trend approximately linear? (a) Yes (b) No Are there obvious outliers? (a) Yes (b) No 5

1. Plot the data: 2. Fit the Model: Is there approximately constant variability? (a) Yes (b) No Margin = 36.5 + 0.84 Approval 3. Use the model: Margin = 36.5 + 0.84 Approval Which of the following is a correct interpretation? a) For every percentage point increase in margin of victory, approval increases by 0.84 percentage points b) For every percentage point increase in approval, predicted margin of victory increases by 0.84 percentage points c) For every 0.84 increase in approval, predicted margin of victory increases by 1 3. Use the model: Margin = 36.5 + 0.84 Approval The predicted margin of victory for an incumbent with an approval rating of 50%: 4. Look at histogram of residuals: Are the residuals approximately normally distributed? 5. Inference Should we do inference? (a) Yes (b) No (a) Yes (b) No 6

5. Inference Give a 95% confidence interval for the slope coefficient. Is it significantly different than 0? (a) Yes (b) No 5. Inference: We don t really care about the slope coefficient, we care about the margin of victory for a president with an approval rating of 50%. A 95% prediction interval for margin of victory for an incumbent with an approval rating of 50% is -8.8 to 19.4. Obama s margin of victory in 2012: 2.8 (50.6% Obama to 47.8% Romney) Conditions What if the conditions for inference aren t met??? Option 1 (best option): Take STAT 210 and learn more about modeling! Option 2: Try a transformation Transformations If the conditions are not satisfied, there are some common transformations you can apply to the response variable You can take any function of y and use it as the response, but the most common are log(y) (natural logarithm - ln) y (square root) y 2 (squared) e y (exponential)) Original Response, y: log(y) Original Response, y: y Logged Response, log(y): Square root of Response, y: 7

Original Response, y: y 2 Original Response, y: e y Squared response, y 2 : Exponentiated Response, e y : Transformations Interpretation becomes a bit more complicated if you transform the response it should only be done if it clearly helps the conditions to be met If you transform the response, be careful when interpreting coefficients and predictions The slope will now have different meaning, and predictions and confidence/prediction intervals will be for the transformed response Transformations You do NOT need to know which transformation would be appropriate for given data on the exam, but they may help if conditions are not met for Project 2 or for future data you may want to analyze Exam 2: In-Class In class Wednesday 4/2 Cumulative, but emphasis is on material since Exam 1 (Chapters 5-9, we skipped 8.2 and 9.2) Closed book, but allowed 2 double-sided pages of notes prepared by you You won t have technology, so won t have to compute p-values, but should be able to tell by looking at a distribution whether something is significant Key to Success WORK PRACTICE PROBLEMS! Recommended problems: Units C and D Essential Synthesis and review problems (solutions on course website under documents) In Unit D odd essential synthesis and review problems, skip D9, D17, D25, D47, D52-D58 (will cover after exam) Want more practice problems??? Full solutions to all odd problems in the book are on reserve in Perkins 8

Read Chapter 9 To Do Do Homework 7 (due Monday, 3/31) NO LATE HOMEWORK ACCEPTED SOLUTIONS WILL BE POSTED IMMEDIATELY AFTER CLASS Study for Exam 2 (Wednesday 4/2) 9