By Hui Bian Office for Faculty Excellence 1
Email: bianh@ecu.edu Phone: 328-5428 Location: 2307 Old Cafeteria Complex 2
When want to predict one variable from a combination of several variables. When want to determine which variables are better predictors than others. When want to compare models. 3
It is a model for the relationship between a dependent variable and a collection of independent variables. According to IBM SPSS Manual Linear regression is used to model the value of a dependent scale variable based on its linear relationship or straight line relationship to one or more predictors. 4
Regression Equation Y predicted = b 0 +b 1 x 1 +b 2 x 2 + +b p x p +e p Y predicted : predicted score of dependent variable b 0 : intercept p: number of predictors b 1 -b p : weights or partial regression coefficients for predictors/slope x 1 -x p : scores of predictors e p : errors of prediction Positive and negative regression weights reflect the nature of correlations between predictor and dependent variable. 5
Y Regression line Intercept 0 Slope X 6
Positive relationship Negative relationship No relationship 7
The model is linear because increasing the value of the p th predictor by 1 unit increases the value of the dependent by b p units. b 0 is the intercept, the model-predicted value of the dependent variable when the value of every predictor is equal to 0. 8
We use Least Square Criterion to estimate parameters. Lease Square means the sum of the squared estimated errors of predictions is minimized. Residuals or errors = y(observed scorepredicted score) The line best fits the data. The vertical distance between observed values of y and line is the residual. 9
In the scatterplot, we have an independent or X variable, and a dependent or Y variable. Each point in the plot represents one case (or one subject). The goal of linear regression procedure is to fit a line through the points. SPSS program computes a line so that the squared deviations of the observed points from that line are minimized. This general procedure is sometimes also referred to as least squares estimation. 10
11
Normality Linearity Equal variance 12
For each value of the independent variable, the distribution of the dependent variable must be normal. The variance of the distribution of the dependent variable should be constant for all values of the independent variable. The relationship between the dependent variable and each independent variable should be linear, and all observations should be independent. 13
The error term has a normal distribution with a mean of 0. The variance of the error term is constant across cases and independent of the variables in the model. 14
Multicollinearity Moderate to high inter-correlations among the independent variables It limits the size of R. The model is unstable in terms of prediction. It is hard to interpret the significance of predictors. 15
Checking assumptions Histogram of the standardized or studentized residuals (normality assumption) Scatter plots: the dependent variable, standardized predicted values, standardized residuals, deleted residuals, adjusted predicted values, Studentized residuals, or Studentized deleted residuals. 16
Scatter plots: Plot the standardized residuals (* ZRESID) against the standardized predicted values (*ZPRED) to check for linearity and equality of variances. From SPSS: Dependent and Standardized predicted values (*ZPRED), Standardized residuals (*ZRESID), Deleted residuals (*DRESID), Adjusted predicted values (*ADJPRED), Studentized residuals (*SRESID), Studentized deleted residuals (*SDRESID). 17
Plots from SPSS 18
19
Regression coefficients determine the relative importance of the significant predictors when the effects of other predictors are controlled. Unstandardized regression coefficients (B): reflect the raw score values (different metrics). Standardized regression coefficients (β): all the variables are measured on the same metric. 20
Squared multiple correlation (R 2 ) The model accounts for certain amount of the variance of dependent variable. That certain amount is R 2. Residual (prediction error) The difference between the predicted value and observed score of dependent variable. 21
Dependent variable: criterion variable Scale variables (interval or ratio)/quantitative Independent variables: predictors or control variables Continuous or categorical Inclusion of variables in the model is based on theories and empirical studies done by other researchers. 22
1 means the presence of something 0 means the absence of something or reference Number of dummy variables = p-1 p = number of levels of nominal variable Each dummy variable is dichotomous (0, 1) The reference level is the focus and other levels will compare with it. 23
Exercise One variable: a03 (race) Recode a03 into three categories: white, black, and others and create a new variable named a03r (1 = White, 2 = Black, 3 = Others) Then recode a03r into two dummy variables White is the reference category Two new dummy variables are: Dummy1 (black vs. white) and Dummy2 (others vs. white) 24
Recode a03 into a03r Response options for a03r: 1 = White, 2 = Black, 3 = Others 25
Recode a03 into a03r Transform > Recode into different variables > Highlight a03 and click > type a03r > Click 26
Click Old and New Values button 27
Dummy1 Dummy2 White 0 0 Black 1 0 Others 0 1 Dummy1: if participants are Black then coded 1, other categories are coded 0. Dummy2: if participants are Others then coded 1, other categories are coded 0. 28
Transform > Recode into different variables > Highlight a03r and click > type Dummy1 29
Click Old and New Values button 30
The same process of creating Dummy2 You should have this window: 31
Example: we want to determine if several predictors have effect on problem of drug use among drug users (use any of alcohol, cigarettes, and marijuana in the last 30 days:a28, a29, and a30) while controlling race variable (two dummy variables). Dependent variable: aalcohol_problem (total score: 0-17) 32
Independent variables: including two dummy variables and Frequency of marijuana use (a30: During the past 30 days, on how many days did you use marijuana? 1 = 0 days, 2 = 1-3 days, 11 = 28-30 days) Self-efficacy (a80r: How sure are you that you can avoid using alcohol, if offered by friends? 0= Very sure, 1 = Somewhat sure to not sure) Self-control ( During the past 30 days, which of the following have you used to help you avoid or limit your alcohol, cigarette, or marijuana use? a total score ranges from 0 to 18). Higher score = More self-control Peer norms: a93a ( My friends think that it's okay for me to drinks too much alcohol. 1 = Agree a lot, 2= Agree, 3= Disagree, 4 = Disagree a lot) 33
Regression model for our study Self-efficacy Error Self-control Marijuana use Problems related to drug uses Peer norms 34
Enter: enters all independent variables in a single step Stepwise: enters one independent variable at a time. At each step, the program performs the following calculations: for each variable currently in the model, it computes "F-toremove" statistic; for each variable not in the model, it computes "F-to-enter" statistic. At the next step, the program automatically enters the variable with the highest F- to-enter statistic, or removes the variable with the lowest F- to-remove statistic. Each predictor is constantly assessed. 35
Forward: enters one independent variable at each step and that variable has the largest simple correlation with dependent variable. Once a variable is entered into the model, it remains in the model. Backward: enters all independent variables in the analysis, then starts to remove non-significant variable from the model. The loss of this variable would least decrease the R 2. 36
Data screening Purpose of data screen is to check assumptions for the regression model Residual plots: used to check the constant variance assumption. standardized residuals (on Y axis) versus standardized predicted values (on X axis) If there is no violation of assumptions, standardized residuals should scatter randomly around a horizontal line of 0. Histogram and Normal p-p plot of standardized or studentized residuals Used to check normality assumption 37
Run multiple regression analysis First select cases: condition is: a28>1 a29>1 a30>1 then go to Analyze > Regression > Linear > put aalcohol_problem into Dependent > put a80r, a30, a93a, self-control, and two dummy variables into Independent 38
Click Statistics 39
Click Plots 40
Click Save 41
From the Descriptive Statistics table, we know that a total of 202 drug users were in the study. The average drug problems is 3.47 (SD=3.25). The method used is Enter. It means that we entered all independent variables into model simultaneously. 42
1. Model Summary a. R is a Pearson correlation between predicted values and actual values of dependent variable. b. R 2 is multiple correlation coefficient that represents the amount of variance of dependent variable explained by the combination of six predictors. 14% variance of drug problem is explained by six predictors. c. Adjusted R 2 is a more conservative than R 2. 2. ANOVA table The significant F value, F(6, 195) = 5.18, p <.01, indicates that there is a significant relationships between drug problem and six predictors. 43
1. The regression equation should be: Y = 2.275 +.258 Marijuana+1.128 Self_efficacy+.088 self_control -.457 Peer Norms +.410 Dummy 1 +.035 Dummy2 2. B is unstandardized regression coefficient and Beta is standardized regression coefficient. 3. t test and sig show the outcomes of each independent variable. 44
Colinearity Tolerance is the percentage of the variance in a given predictor that cannot be explained by the other predictors. When the tolerances are close to 0, there is high multicollinearity and the standard error of the regression coefficients will be inflated. Variance Inflation Factor (VIF) greater than 2 is usually considered problematic (based on SPSS manual). 45
Histogram of standardized residuals 46
Normal QQ plot 1. We want to know whether the distribution of errors matches a normal distribution. 2. If the selected variable matches the test distribution, the points cluster around a straight line. 47
Residual plot 1. Our residuals scatter randomly around 0. 2. The constant variance assumption is not violated. 3. The standardized residual of ID 1090 is 3.06. 48
1. 1. First two residuals plots suggest that the error variance changes with the independent variable. 2. Neither of these distributions are constant variance patterns. Therefore there is a violation of equal error variance assumption 3. The last horizontal-band pattern suggests that the variance of the residuals is constant. 49
Zero-order correlation Simple bivariate correlations between independent variable and dependent variable. Partial correlation Correlation between independent variable and dependent variable after all other independent variables are controlled. Part correlation Correlation between independent variable and dependent variable with the correlation between dependent variable and other independent variable is controlled. When squared, it represents the unique contribution of the independent variable to the model. 50
New created variable: standardized residual: ZRE_1 Run descriptive statistics of ZRE_1, e.g. use Explore function 51
Explore results 52
Explore results The Kolmogorov-Smirnov test is based on a simple way to quantify the discrepancy between the observed and expected distributions. It turns out, however, that it is too simple, and doesn't do a good job of discriminating whether or not your data was sampled from a Gaussian distribution. An expert on normality tests, R.B. D Agostino, makes a very strong statement: The Kolmogorov-Smirnov test is only a historical curiosity. It should never be used. ( Tests for Normal Distribution in Goodness-of-fit Techniques, Marcel Decker, 1986). 53
Run previous analysis again using stepwise methods. Analyze > Regression > Linear 54
1. This table lists how many models in the process and which variable is entered and which is removed on each step. 2. No variable is removed on each step. 55
1. Model summary shows R 2 for each model. 2. Sig F Change tells us when extra IV is added into model, what kind contribution that IV makes. 56
1. ANOVA table shows F values for each model. 2. All two models are significant ( p <.05). 3. The last model has two predictors in the model. 57
For self-efficacy, high score means lower self-efficacy. The results show that drug users who used more marijuana and had lower self-efficacy, more likely to have drug use problems 58
Meyers, L. S., Gamst, G., & Guarino, A. J. (2006). Applied multivariate research: design and interpretation. Thousand Oaks, CA: Sage Publications, Inc. Stevens, J. P. (2002). Applied multivariate statistics for the social sciences. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. 59
60