Bivariate Data Cleaning

Similar documents
SPSS Guide: Regression Analysis

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Multiple Regression. Page 24

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Chapter 13 Introduction to Linear Regression and Correlation Analysis

The Dummy s Guide to Data Analysis Using SPSS

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Simple linear regression

Factor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models

Comparing a Multiple Regression Model Across Groups

Chapter 7: Simple linear regression Learning Objectives

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Additional sources Compilation of sources:

A full analysis example Multiple correlations Partial correlations

Directions for using SPSS

Data analysis process

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 23. Inferences for Regression

5. Correlation. Open HeightWeight.sav. Take a moment to review the data file.

The importance of graphing the data: Anscombe s regression examples

Correlation and Regression Analysis: SPSS

January 26, 2009 The Faculty Center for Teaching and Learning

Moderation. Moderation

Example: Boats and Manatees

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

SPSS Tutorial, Feb. 7, 2003 Prof. Scott Allard

We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries?

Multiple Regression: What Is It?

SPSS Explore procedure

How Far is too Far? Statistical Outlier Detection

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Chapter 7. Comparing Means in SPSS (t-tests) Compare Means analyses. Specifically, we demonstrate procedures for running Dependent-Sample (or

Univariate Regression

When to use Excel. When NOT to use Excel 9/24/2014

Moderator and Mediator Analysis

Correlation and Regression

Statistical Control using Partial and Semi-partial (Part) Correlations

Example of Including Nonlinear Components in Regression

Data exploration with Microsoft Excel: analysing more than one variable

Two Related Samples t Test

MULTIPLE REGRESSION EXAMPLE

False. Model 2 is not a special case of Model 1, because Model 2 includes X5, which is not part of Model 1. What she ought to do is estimate

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Final Exam Practice Problem Answers

Chapter 2 Probability Topics SPSS T tests

Regression III: Advanced Methods

Bivariate & Multivariate Regression. Linear regression for prediction... Correlation research (95%) Prediction research (5%)

Student debt from higher education attendance is an increasingly troubling problem in the

Analysing Questionnaires using Minitab (for SPSS queries contact -)

CORRELATION ANALYSIS

SPSS-Applications (Data Analysis)

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

SPSS Guide How-to, Tips, Tricks & Statistical Techniques

Data analysis and regression in Stata

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

SPSS Tests for Versions 9 to 13

Outliers Richard Williams, University of Notre Dame, Last revised April 7, 2016

Using Excel for Statistical Analysis

SPSS Resources. 1. See website (readings) for SPSS tutorial & Stats handout

Binary Logistic Regression

Simple Tricks for Using SPSS for Windows

2. Simple Linear Regression

Section 3 Part 1. Relationships between two numerical variables

Using R for Linear Regression

Data Analysis for Marketing Research - Using SPSS

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

1.1. Simple Regression in Excel (Excel 2010).

One-Way ANOVA using SPSS SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

Chapter 15. Mixed Models Overview. A flexible approach to correlated data.

Homework 8 Solutions

Formula for linear models. Prediction, extrapolation, significance test against zero slope.

Exercise 1.12 (Pg )

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Relationships Between Two Variables: Scatterplots and Correlation

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Module 5: Multiple Regression Analysis

Multiple Regression Using SPSS

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

Introduction to Regression and Data Analysis

Data Analysis Tools. Tools for Summarizing Data

MULTIPLE REGRESSION WITH CATEGORICAL DATA

AP STATISTICS REVIEW (YMS Chapters 1-8)

Pearson s Correlation

SPSS TUTORIAL & EXERCISE BOOK

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Geostatistics Exploratory Analysis

Data Mining for Model Creation. Presentation by Paul Below, EDS 2500 NE Plunkett Lane Poulsbo, WA USA

Linear Models in STATA and ANOVA

Advertising value of mobile marketing through acceptance among youth in Karachi

Statistics 2014 Scoring Guidelines

Descriptive Statistics

A Basic Guide to Analyzing Individual Scores Data with SPSS


RARITAN VALLEY COMMUNITY COLLEGE ACADEMIC COURSE OUTLINE MATH 111H STATISTICS II HONORS

Transcription:

Bivariate Data Cleaning Bivariate Outliers In Simple Correlation/Regression Analyses Imagine we are interested in the correlation between two variables. Being schooled about outliers we examine the distribution of each variable before beginning the analysis. Statistics N Mean Std. Deviation Skewness Std. Error of Skewness Minimum Maximum Valid Missing 50 50 0 0 45.7200 46.4400 4.94349 8.3950 -.277 -.384.337.337 7.00 29.00 69.00 64.00 34.0000 47.5000 58.0000 42.0000 47.5000 5.0000 For X.5*hinge spread = 36, with outlier boundaries of -2 & 94 no outliers For Y.5*hinge spread = 3.3, with outlier boundaries of 28.7 & 64.3 no outliers (but close). So, we proceed with getting the correlation between the variables. Correlations Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Clearly nonsignificant..9..84 50 50.9.84. 50 50 But we also knew enough to get the corresponding scatterplot. 70 Which looks more like a positive correlation, except for a few notable cases 60 None of these cases are "outliers" on either X or Y, but they do seem to be bivariate outliers. That is, they are outliers relative to the central envelope of the scatterplot. 50 40 They might also be "influential cases" in a bivariate sense, because they might be pivoting the regression line and lowering the sample r -- notice that all are away from the means of X and Y! 30 20 0 20 30 40 50 60 70

One way to check if these are such "bivariate outliers" is to examine the residuals of the cases in the analysis. To do this, we obtain the bivariate regression formula, apply it back to each case obtaining the y', and then compute the residual as y-y'. Actually SPSS will do this for us within a regression run. Analyze Regression Linear Highlight and move the criterion and predictor variables. Click "Save" and check the types of Residuals you'd like. Then we "Examine" the residual values. To do this we can apply the same outlier formulas we applied to any input variable -- but applying this approach to the residual means we are looking for cases that are specifically "bivariate outliers". Leaving out a few portions of the output, we get Statistics Unstandardized Residual N Valid Minimum Maximum Missing 50 0-9.75953 20.54999 Unstandardized Residual -4.0542.034447 5.0527943 Applying the formulas, we get For X.5*hinge spread = 9., with outlier boundaries of -3.5 & 4.5 clearly there are bivariate outliers If we trim these cases, we can take a look at the before and after regression, to examine the "influence" of these cases. To trim them Data Select Cases with the formula (res_ ge -3.5) and (res_ le 4.5)

Here's the "Before-" and "After-Trimming" regression results. Before ANOVA b Summary b R R Square.9 a.037 a. Predictors: (Constant), b. Dependent Variable: Regression Residual a. Predictors: (Constant), b. Dependent Variable: 8.596 8.596.820.84 a 327.724 48 65.6 3246.320 49 (Constant) a. Dependent Variable: Coefficients a Unstandardized Coefficients Standardi zed Coefficien ts B Std. Error Beta t Sig. 4.680 3.708.240.000.04.077.9.349.84 Things to notice: The change in df error tells us we trimmed 4 cases (upper-left case on scatterplot is 2 cases) There was a substantial drop in MSerror, and in the Std Error of the regression coefficient Both indicate the "error" in the model was reduced (since we trimmed cases with large residuals) The r and b values are substantially larger and statistically significant After Summary b ANOVA b (Constant) R R Square.47 a.22 a. Predictors: (Constant), b. Dependent Variable: a. Dependent Variable: Coefficients a Unstandardized Coefficients Standardi zed Coefficien ts B Std. Error Beta t Sig. 36.556 3.26.695.000 Regression 458.252 458.252 2.5.00 a Residual 6.683 44 36.629 2069.935 45 a. Predictors: (Constant), b. Dependent Variable:.23.065.47 3.537.00 So -- check those scatterplots and examine residuals to look for bivariate outliers. When doing multiple regression, examine each of the X-Y plots at a minimum. Examining all the interpredictor plots can be important as well!

"Bivariate Outliers" In Group Comparison Analyses This time we have a variable "Z" that tells which group each participant was in -- = control and 2 = treatment. We know that there are no outliers on, so we do the ANOVA (after taking out the data selection command). We get.00 2.00 Descriptives N Mean Std. Deviation 22 45.0455 9.7633 28 47.5357 6.74664 50 46.4400 8.2437 We get no effect (darn). Then we consider, the purpose of each group is to represent their respective population. So, maybe our preliminary outlier analyses should be conducted separately for each group! ANOVA Between Groups Within Groups 76.40 76.40.35.292 3229.99 48 67.290 3306.320 49 Data Split Files All subsequent analyses will be done for each group defined by this variable.

Analyze Descriptive Statistics Explore and get the "usual stuff" (only partial results shown below) Group Z = Descriptive Statistics a Valid N (listwise) a. Z =.00 N Minimum Maximum Mean Std. Deviation 22 30.00 59.00 45.0455 9.7633 22 a Outlier boundaries would be a. Z =.00 36.0000 47.5000 54.0000 9 and 8 no outliers Group Z = 2 Descriptive Statistics a Valid N (listwise) a. Z = 2.00 a. Z = 2.00 N Minimum Maximum Mean Std. Deviation 28 28.00 56.00 47.5357 6.74664 28 a 45.0000 47.5000 5.0000 Outlier boundaries would be 36 and 60 at least one "too small" outlier

We can trim outlying values in the Z=2 group, using the following in the Select Cases command What we're doing is: ) selecting everybody in Z= (that group has no outliers) 2) selecting only those cases from Z=2 with values that are not outliers When we do this the ANOVA results are Notice what changes and what doesn't.00 2.00 Descriptives N Mean Std. Deviation 22 45.0455 9.7633 27 49.3343 5.65457 50 46.4400 8.2437 ANOVA Nothing changes for the Z= group For the Z=2 group the sample size drops by the mean increases (since all the outliers were "too small" outliers) the std decreases (because extreme values were trimmed) The combined results is a significant mean difference - - the previous results with the "full data set" were misleading because of single extreme case!!! Between Groups Within Groups 298.869 298.869 4.867.032 2947.45 46 6.405 3246.320 47

We can Windsorize outlying values in Z=2 using the following commands -- these build a new variable "y-wind" that Has the value of y for those in the Z= group -- no cases need to be Windsorized "too small" outliers are changed to a value of 36 for the Z=2 group -- Windsorizing File New Syntax this will open a syntax window into which we can type useful commands The first line assigns the y score of each person in group Z= as their y_wind score The second line assigns the y score of "non-outliers" in group Z=2 as their y_wind score The third line assigns a y_wind score of "36" (the smallest nonoutlying score) to "outliers" in the group Z=2 Then click on the right-pointing arrow to perform the assignments The results of the Windsorized ANOVA are.00 2.00 Descriptives N Mean Std. Deviation 22 43.688 9.8440 28 48.607 5.8522 50 46.4400 8.3950 The mean and std of the Z=2 group change in the same direction, but not as much, as the trimmed data. Similarly, the MSE doesn't drop as much, but the result still changes to a significant mean difference. ANOVA Between Groups Within Groups 298.869 295.870 4.732.042 2947.45 47 62.539 3246.320 48 Applying group-specific outlier analysis Do outlier analyses and trimming/windsorizing separately for each group This gets lengthy when working with factorial designs -- remember the purpose of each condition!!! Some suggest taking an analogous approach with doing regression analyses that involve a mix of continuous and categorical predictors --- examining each group defined by each categorical variable for outliers on each quantitative variable (and then following up for bivariate outliers among pairs of quantitative variables).