Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Similar documents
SPSS Guide: Regression Analysis

Relationships Between Two Variables: Scatterplots and Correlation

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

2. Simple Linear Regression

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Introduction to Regression and Data Analysis

Module 3: Correlation and Covariance

Module 5: Multiple Regression Analysis

Simple Predictive Analytics Curtis Seare

Linear Models in STATA and ANOVA

2013 MBA Jump Start Program. Statistics Module Part 3

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Premaster Statistics Tutorial 4 Full solutions

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Dimensionality Reduction: Principal Components Analysis

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

Simple linear regression

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Session 7 Bivariate Data and Analysis

The Dummy s Guide to Data Analysis Using SPSS

We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries?

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Regression and Correlation

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

PLOTTING DATA AND INTERPRETING GRAPHS

Chapter 7: Simple linear regression Learning Objectives

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Univariate Regression

Pearson s Correlation

Elements of a graph. Click on the links below to jump directly to the relevant section

Part 2: Analysis of Relationship Between Two Variables

Correlation key concepts:

Factors affecting online sales

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Example: Boats and Manatees

Econometrics Simple Linear Regression

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Copyright 2007 by Laura Schultz. All rights reserved. Page 1 of 5

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

WEB APPENDIX. Calculating Beta Coefficients. b Beta Rise Run Y X

Getting Correct Results from PROC REG

Chapter 23. Inferences for Regression

Module 5: Statistical Analysis

Scatter Plots with Error Bars

SPSS Explore procedure

Scatter Plot, Correlation, and Regression on the TI-83/84

Part 1: Background - Graphing

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Absorbance Spectrophotometry: Analysis of FD&C Red Food Dye #40 Calibration Curve Procedure

Summary of important mathematical operations and formulas (from first tutorial):

Exercise 1.12 (Pg )

Simple Linear Regression Inference

with functions, expressions and equations which follow in units 3 and 4.

Section 1.1 Linear Equations: Slope and Equations of Lines

Correlation and Simple Linear Regression

Session 9 Case 3: Utilizing Available Software Statistical Analysis

Introduction to Data Analysis in Hierarchical Linear Models

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

How To Run Statistical Tests in Excel

Mixed 2 x 3 ANOVA. Notes

Curve Fitting in Microsoft Excel By William Lee

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

Plot the following two points on a graph and draw the line that passes through those two points. Find the rise, run and slope of that line.

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Review of Fundamental Mathematics

Dealing with Data in Excel 2010

Profile analysis is the multivariate equivalent of repeated measures or mixed ANOVA. Profile analysis is most commonly used in two cases:

Causal Forecasting Models

Directions for using SPSS

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Graphing Linear Equations

Section 3 Part 1. Relationships between two numerical variables

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Chapter 5 Analysis of variance SPSS Analysis of variance

Chapter 5 Estimating Demand Functions

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

FRICTION, WORK, AND THE INCLINED PLANE

Regression Analysis (Spring, 2000)

Introduction to Quantitative Methods

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Simple Regression Theory II 2010 Samuel L. Baker

Descriptive Statistics

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Analysis of Variance ANOVA

17. SIMPLE LINEAR REGRESSION II

SPSS Resources. 1. See website (readings) for SPSS tutorial & Stats handout

Nonlinear Regression Functions. SW Ch 8 1/54/

Chapter 7: Modeling Relationships of Multiple Variables with Linear Regression

CORRELATIONAL ANALYSIS: PEARSON S r Purpose of correlational analysis The purpose of performing a correlational analysis: To discover whether there

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

Financial Risk Management Exam Sample Questions/Answers

Panel Data Analysis in Stata

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

Formula for linear models. Prediction, extrapolation, significance test against zero slope.

Moderation. Moderation

Transcription:

SPSS Regressions Social Science Research Lab American University, Washington, D.C. Web. www.american.edu/provost/ctrl/pclabs.cfm Tel. x3862 Email. SSRL@American.edu Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS. Learning Outcomes Learn how to make a scatter plot Learn how to run and interpret simple regressions Learn how to run a multiple regression Note: This tutorial uses the World95.sav dataset, which can be found at our website. Simple Regressions and the Scatter Plot A Simple regression analyzes the relationship between two variables. Scatter plot The relationship between two variables can be portrayed visually using a scatter plot. To create a scatter plot Go to Graphs>Legacy Dialogues>Scatter/dot then choose simple. Plug in your variables. For your dependent variable choose Infant mortality (babymort), and for your independent variable choose Women s literacy (lit_fema). Tip: The dependent variable is displayed on the (Y) vertical axis and the independent variable on the (X) horizontal axis. 1

An SPSS Scatter plot displaying the relationship between infant mortality and women s literacy The relationship between the two variables is estimated as a linear or straight line relationship, defined by the equation: Y = ax + b b is the intercept or the constant and a, the slope. The line is mathematically calculated such that the sum of distances from each observation to the line is minimized. By definition, the slope indicates the change in Y as a result of aunit change in X. The straight line is also called the regression line or the fit line and a is referred to as the regression coefficient. Tip: The method of calculating the regression coefficient (the slope) is called ordinary least squares, or OLS. OLS estimates the slope by minimizing the sum of squared differences between each predicted (ax + b) and the actual value of Y. One reason for squaring these distances is to ensure that all distances are positive. A positive regression coefficient indicates a positive relationship between the variables; the fit line will be upward sloping (see the example on the previous 2

page). A negative regression coefficient indicates a negative relationship between the variables; the fit line will be downward sloping. Testing for Statistical Significance The test of significance of the regression slope is a key test of hypothesis regression analysis that tells us whether the slope a is statistically different from 0. To determine whether the slope equals zero, a t-test is performed (For more information on t-tests see the SPSS bivariate statistics tutorial). As a general rule when observations on the scatter plot lie closely around the fit line, the regression line is more likely to be statistically significant: in other words, it will be more likely that the two variables are positively or negatively related. Fortunately, SPSS calculates the slope the intercept standard error of the slope, and the level at which the slope is statically significant. Let s look at an example: determine whether gas mileage and horsepower are related. Go to Analyze>Regression>Linear SPSS Linear Regression Window Infant Mortality (babymort) is our dependent variable and Female Literacy (lit_fema), our independent variable. Click OK. SPSS will produce a series of 3

tables which tell us about the nature of the relationship between these two variables. In the first table, the R 2, also called the coefficient of determination is very useful. It measures the proportion of the total variation in Y about its mean explained by the regression of Y on X. In this case, our regression explains 70.8 % of the variation of infant mortality. Typically, values of R 2 below 0.2 are considered weak, between 0.2 and 0.4, moderate, and above 0.4, strong. In the second table, we will focus on the F-statistic. By computing this statistic, we test the hypothesis that none of the explanatory variables help explain variation in Y about its mean. The information to pay attention to here is the probability shown as Sig. in the table. If this probability is below 0.05, we conclude that the F-statistic is large enough so that we can reject the hypothesis that none of the explanatory variables help explain variation in Y. This test is like a test of significance of the R 2. 4

Finally, the last table will help us determine whether infant mortality and women s literacy are significantly related, and the direction and strength of their relationship. The first important thing to note is that the sign of the coefficient of Females who read (%) is negative. It confirms our assumption (infant mortality decreases as women s education increases) and our visual analysis of the scatter plot (see page one for the scatter plot). Furthermore, the probability reported in the right column is very low. This implies that the slope a is statistically significant. To be less abstract, let us recall what those coefficients mean: they are the slope and the intercept of the regression line, i.e. Y = -1.129 X + 127.203. What does this mean? It means that when literacy increases by one unit (i.e. 1%), infant mortality on average fall by 1.13 per thousand. In sum, R 2 is high, probabilities are low: WE ARE HAPPY! Multiple Regressions A multiple regression takes what we ve just done and adds several more variables to the mix. It is the right tool whenever you think that your dependent variable is explained by more than one independent variable. In our empirical work, we reasonably assume that gas mileage (Y) is not only explained by horsepower (X1) but also the weight (X2) of the car and its engine displacement (X3). Therefore, we will test the following equation: Y = a0 + a1x1 + a2x2 + a3x3 + error Y is the dependent variable, a0 is the constant, X1, X2, and X3 are the independent variables and their respective coefficients a1, a2, and a3, and the error term reflects all other factors that are not in the model. Scatter Plot Matrix Before performing any statistical work, do not forget to draw scatter plots of every independent variable against the dependent variable. Go to Graphs>Legacy Dialogues>Scatter/dot then choose Matrix Scatter 5

SPSS Scatter plot Matrix Window Lets plug in these variables: babymort, lit_fema and gdp_cap This will produce a matrix of scatter plots which looks like this: 6

A scatter plot like this is helpful for visually detecting relationships between your variables. For instance, you might observe a non-linear relationship between two variables, in which case, you should use different techniques (GDP per capita in general exhibits non-linear relationships with other variables). Correlation Matrix Before starting our multiple regression analysis, it is important to compute the correlation matrix. Go to Analyze>Correlate>Bivariate We select the three independent variables. Then click OK. The following table will show up in the output window. This preliminary step is important because independent variables should NOT be correlated with one another. If independent variables are correlated, this might affect the robustness of our results. In the fascinating world of statistics, this is referred to as the issue of multicollinearity. When we use multiple regression analysis, we attach a weight to any of the independent variables in order to explain the variation in Y. If two independent variables are strongly correlated, it becomes very hard to attach a weight to those variables because they basically convey the same information. As a result, the validity of our empirical work will be greatly affected. In general, 7

assuming two of the independent variables are correlated, the easiest solution is to ignore one of them variables one and to use simply the other one. In this particular case, all variables are strongly correlated with one another. I would recommend you should use only the women s literacy (which we did in the previous section) and not use GDP per capita or daily calorie intake. Let s conduct a multiple regression to explain its basic philosophy, even though our results will be sloppy. Running your Regression To run this regression, go to Analyze>Regression>Linear Select babymort as the dependent variable and calories, lit_fema, and gdp_cap as the independent variables. Click OK. The following tables show up in the output window. The R 2 for our new model is slightly greater than we obtained earlier (.711 vs..811). However, the variation is too large large considering that we added two new variables. Clearly, this is due to the fact that the new independent variables are strongly correlated and ultimately, do not bring much extra information. One of the flaw of the R 2 is that it is sensitive to the number of included independent variables. Specifically, addition of additional independent variables can only increase the R 2. In contrast, the Adjusted R 2 accounts for the number of independent variables. It may rise or fall with the addition of more variables. The Adjusted R 2 is greater than the one obtained in the former section. Therefore, the extra information brought by the new variables is greater than the penalty of adding variables (assuming that we did not encounter the issue of multicollinearity). 8

Finally, as expected from our correlation matrix, we find out that our independent variables are negatively correlated with infant mortality. If we look to the right column we find that our puzzling and strange result for GDP per capita is not statistically significant. This result is an illustration of what happens when there is extensive multicolinearity amongst your independent variables 9