Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb 2 2015



Similar documents
Chapter 7: Simple linear regression Learning Objectives

Multiple Linear Regression

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Chapter 23. Inferences for Regression

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Using R for Linear Regression

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Comparing Nested Models

11. Analysis of Case-control Studies Logistic Regression

Example: Boats and Manatees

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

SPSS Guide: Regression Analysis

Exercise 1.12 (Pg )

Premaster Statistics Tutorial 4 Full solutions

5. Linear Regression

Module 5: Multiple Regression Analysis

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Statistical Models in R

STAT 350 Practice Final Exam Solution (Spring 2015)

2. Simple Linear Regression

Part 2: Analysis of Relationship Between Two Variables

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Simple linear regression

Univariate Regression

Simple Linear Regression Inference

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

17. SIMPLE LINEAR REGRESSION II

Simple Predictive Analytics Curtis Seare

Correlation and Regression

5. Multiple regression

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

The importance of graphing the data: Anscombe s regression examples

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Interaction between quantitative predictors

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Copyright 2007 by Laura Schultz. All rights reserved. Page 1 of 5

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Regression Analysis: A Complete Example

Lecture 5: Model Checking. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

Principles of Hypothesis Testing for Public Health

MTH 140 Statistics Videos

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

Testing for Lack of Fit

Chapter 7: Dummy variable regression

Logs Transformation in a Regression Equation


Correlation and Simple Linear Regression

MULTIPLE REGRESSION EXAMPLE

Data Analysis Tools. Tools for Summarizing Data

2013 MBA Jump Start Program. Statistics Module Part 3

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Forecasting in STATA: Tools and Tricks

Multivariate Logistic Regression

Factors affecting online sales

ch12 practice test SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

Logistic Regression (a type of Generalized Linear Model)

Categorical Data Analysis

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques Page 1 of 11. EduPristine CMA - Part I

Simple Linear Regression

List of Examples. Examples 319

Causal Forecasting Models

Forecast. Forecast is the linear function with estimated coefficients. Compute with predict command

Week TSX Index

Week 5: Multiple Linear Regression

Unit 26: Small Sample Inference for One Mean

Name: Date: Use the following to answer questions 2-3:

Discussion Section 4 ECON 139/ Summer Term II

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Regression step-by-step using Microsoft Excel

Scatter Plot, Correlation, and Regression on the TI-83/84

Introduction to Regression and Data Analysis

DATA INTERPRETATION AND STATISTICS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

MULTIPLE REGRESSION WITH CATEGORICAL DATA

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Final Exam Practice Problem Answers

Homework 8 Solutions

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

II. DISTRIBUTIONS distribution normal distribution. standard scores

Statistics 151 Practice Midterm 1 Mike Kowalski

Simple Regression Theory II 2010 Samuel L. Baker

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Chapter 3 Quantitative Demand Analysis

How Far is too Far? Statistical Outlier Detection

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

Algebra 1 Course Title

Transcription:

Stat 412/512 CASE INFLUENCE STATISTICS Feb 2 2015 Charlotte Wickham stat512.cwick.co.nz

Regression in your field See website. You may complete this assignment in pairs. Find a journal article in your field of study (or a field of interest) that uses multiple linear regression to answer their question of interest. Write a one page report that includes: a citation to the article a brief summary of the background of the problem and their question of interest a full specification of their regression model (i.e. full β form) defining all the variables included a statement about how their question of interest translates into questions about parameters in their regression model a summary of the results they report Due Feb 27th

Case 11.01 Alcohol Metabolism Women get drunk quicker than men. Women also develop alcohol related liver disease more readily. Theory: a particular enzyme responsible for alcohol metabolism in the stomach is more active in men. "first pass metabolism" = alcohol metabolized in the stomach so it doesn't reach the bloodstream To determine first pass metabolism, compare blood alcohol levels after drinking to after intravenous alcohol. Also measure enzyme activity.

{ First pass metabolism gast, female, alcoholic} = gast + female + alcoholic + female x alcoholic + gast x female + gast x alcoholic + gast x female x alcoholic

Outliers Least squares estimates are not resistant to outliers. Identify outliers early on, so you don't end up tailoring the model to to fit a few unusual observations. An observation is said to be influential if the fitted model depends unduly on its value. For example, removing it: changes the estimate of parameters greatly, changes conclusions, or changes which terms are included in the model.

with 31 & 32 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -1.6597 0.9996-1.660 0.1099 Gastric 2.5142 0.3434 7.322 1.46e-07 *** SexFemale 1.4657 1.3326 1.100 0.2823 AlcoholAlcoholic 2.5521 1.9460 1.311 0.2021 Gastric:SexFemale -1.6734 0.6202-2.698 0.0126 * Gastric:AlcoholAlcoholic -1.4587 1.0529-1.386 0.1786 SexFemale:AlcoholAlcoholic -2.2517 4.3937-0.512 0.6130 Gastric:SexFemale:AlcoholAlcoholic 1.1987 2.9978 0.400 0.6928 without 31 & 32 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.6797 1.3091-0.519 0.60878 Gastric 1.9212 0.6082 3.159 0.00455 ** SexFemale 0.4858 1.4665 0.331 0.74360 AlcoholAlcoholic 1.5722 1.8119 0.868 0.39493 Gastric:SexFemale -1.0805 0.7211-1.498 0.14826 Gastric:AlcoholAlcoholic -0.8658 0.9631-0.899 0.37839 SexFemale:AlcoholAlcoholic -1.2718 3.4669-0.367 0.71725 Gastric:SexFemale:AlcoholAlcoholic 0.6058 2.3158 0.262 0.79608

Whether we have evidence males and females have different slopes depends heavily on if observations 31 and 32 are included. Safe option, drop 31 & 32 and only make inferences for people with a Gastric AD activity level < 3

Case influence statistics Case influence statistics help identify observations that may be influential. Sometimes influential observations won't show up in our usual plots.

Leverage Case influence statistics Measures the distance of the observation from the average explanatory values (taking correlation into account). High leverage = unusual combination of explanatory values = possibility to be influential. Studentized residuals The residual divided by its expected variation. High residual = observation far from fitted line. Cook's distance a number for every observation The effect on estimated parameters when the observation is dropped out. High Cook s distance = influential on parameter estimates. you don't need to know the formulas use them together to understand influential points

High Cook's distance, means point changes regression estimates High leverage, means point has potential to be influential High studentized residual, means point is unusual compared to the modelled mean.

Your turn Which would you expect to be large for observations 31 & 32?

Sleuth fits a different model

Next step: Find a simple, good fitting model. You can test whether terms are necessary with F-tests. Generally, if you think something is important, leave it in whether it is significant or not. Some people say (and I agree): if it was important enough for you to think about including it, leave it in. Be careful about interpreting individual t-tests, especially if it involves a term that is elsewhere in the model.

Case 11.01 Alcohol Metabolism Not many alcoholics, and an extra sum of square F-test with reduced model: μ{ First pass metabolism gast, female, alcoholic} = gast + female + gast x female has a p-value of 0.93. So, use this reduced model, i.e. no effect of alcoholism.

Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.06952 0.80195 0.087 0.931580 Gastric 1.56543 0.40739 3.843 0.000704 *** SexFemale -0.26679 0.99324-0.269 0.790352 Gastric:SexFemale -0.72849 0.53937-1.351 0.188455 No evidence of different intercepts, if different slopes are in the model. No evidence of different slopes, if different intercepts are in the model. Here: a multiplicative rather than additive difference makes more sense, both having the same zero intercept makes sense, μ{ First pass metabolism gast, female, alcoholic} = β1gast + β2gast x female

Estimate Std. Error t value Pr(> t ) Gastric 1.5989 0.1249 12.800 3.20e-13 *** Gastric:SexFemale -0.8732 0.1740-5.019 2.63e-05 *** "Males had a higher first-pass metabolism than females even after accounting for differences in gastric AD activity (two-sided p-value = 0.0003 for a t-test for equality of male and female slopes when both intercepts are zero.) For a given level of gastric AD activity the mean first-pass metabolism for men is estimated to be 2.20 times as large as the mean first-pass alcohol metabolism for women." a different way to interpret a difference in slopes, but only if there are no intercepts.