Regression with Categorical and Continuous Independent Variables

Regression with Categorical and Continuous Independent Lecture 12 November 19, 2008 ERSH 8320 Lecture #12-11/19/2008 Slide 1 of 28

Today s Lecture How regression works with categorical and continuous variables (Chapter 14). Today s Lecture Lecture #12-11/19/2008 Slide 2 of 28

Continuous and Categorical Independent Vari Previous techniques used either Categorical Independent or Continuous Independent. Example Data Wrong Way A Better Way Full Model Reduced Model Now, we will look at what happens when we combine both Categorical and Continuous Independent in a single analysis. Lecture #12-11/19/2008 Slide 3 of 28

Example Data An experiment was designed to study the effects of incentives and study time on retention of classroom material in students. Example Data Wrong Way A Better Way Full Model Reduced Model Study design: Groups of students: Incentive or No Incentive This is a categorical variable. Amount of study time: 5 hours, 10 hours, 15 hours, or 20 hours. We will consider this a continuous variable. Dependent variable was score on a test (retention). Lecture #12-11/19/2008 Slide 4 of 28

The Wrong Way to Analyze Data One way to analyze these data is to compute two regression lines. Example Data Wrong Way A Better Way Full Model Reduced Model One for the Incentive Group and one for the No Incentive Group. Then look to see how these two lines differ (if at all). This is not the right approach (the right one will be shown next). Lecture #12-11/19/2008 Slide 5 of 28

Two Regression Analyses Incentive Group No Incentive Group Incentive: Incentive Incentive: No Incentive 15.00 15.00 12.00 12.00 Y - Retention 9.00 6.00 Y - Retention 9.00 6.00 3.00 R Sq Linear 0.459 = 3.00 R Sq Linear 0.708 =.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00 20.00.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00 20.00 Study Time Study Time Y = 7.33 + 0.21X Y = 2.50 + 0.27X Do these equations seem different? Lecture #12-11/19/2008 Slide 6 of 28

The Eyeball Approach The slopes do not seem that different,.21 is fairly close to.27. Example Data Wrong Way A Better Way Full Model Reduced Model The increase in test score as a function of study time is very similar in both incentive groups. As you can see, there is a large difference in intercepts. The base score (score with no study time) is almost 5 points greater in the incentive group when modeled separately. Is that difference significant? Statistics needs evidence, not just eyeballs. Lecture #12-11/19/2008 Slide 7 of 28

A Better Way A better way to answer the question is to use a single statistical model. Example Data Wrong Way A Better Way Full Model Reduced Model We will refer to a regression equation with both study time and incentive group as IVs as the full model. To set up a comparison, we first need to calculate the regression equation using the full model (both variables together). The model will have both main effects (Incentive Group and Study Time) and well as the interaction between the incentive group and study time. Incentive is coded as 1 for No Incentive and -1 for Incentive. This is effect coding. Lecture #12-11/19/2008 Slide 8 of 28

The Full Model The full model is the model where incentive, study time, and their interaction are included to predict an examinee s retention: Example Data Wrong Way A Better Way Full Model Reduced Model Where: Y = a + b 1 X 1 + b 2 X 2 + b 3 X 1 X 2 X 1 is the effect coded variable for the incentive group of an examinee (either a -1 or a 1). X 2 is the amount of time studied for the test. X 1 X 2 is the product of the two variables, representing the interaction. To use the regression package in SPSS, we have to manually create this variable by using the Transform function. Lecture #12-11/19/2008 Slide 9 of 28

Full Model Results Model Summary Mode Adjusted R Std. Error of l R R Square Square the Estimate 1.909 a.827.801 1.22270 a. Predictors: (Constant), interact, Study Time, Incentive ANOVA b Model 1 Regression Residual Total Sum of Squares 142.725 29.900 172.625 a. Predictors: (Constant), interact, Study Time, Incentive b. Dependent Variable: Y - Retention df 3 20 23 Mean Square 47.575 1.495 Coefficients a F 31.823 Sig..000 a Model 1 (Constant) Incentive Study Time interact Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 4.917-2.417.237.030 a. Dependent Variable: Y - Retention.611.611.045.045 -.901.493.153 8.042-3.953 5.301.672.000.001.000.509 Lecture #12-11/19/2008 Slide 10 of 28

Full Model Results Estimated regression equation: Y = 4.917 2.417X 1 +.237X 2 +.03X 1 X 2 From the SPSS output we can tell the following: No significant interaction between incentive group and study time (ˆb 3 = 0.03, p = 0.509). We will come to know that no interaction means the slope of the line is the same across all levels of the categorical variable. Significant main effect of study time (ˆb 2 =.237, P < 0.001). Regardless of incentive group, retention increases by.237 points for every additional hour of study time. Significant main effect of incentive group (ˆb 1 = 2.417, p = 0.001). There is a significant difference in (adjusted) mean value of retention between the two groups. Lecture #12-11/19/2008 Slide 11 of 28

Further Interpretation Because the full model included a categorical independent variable, we can decompose that model into two separate models, one for each group: Incentive Group Y = 4.917 2.417( 1) +.237X 2 +.03( 1)X 2 Y = (4.917 + 2.417) + (.237.03)X 2 Y = 7.334 +.207X 2 No Incentive Group Y = 4.917 2.417(1) +.237X 2 +.03(1)X 2 Y = (4.917 2.417) + (.237 +.03)X 2 Y = 2.5 +.267X 2 Recall from slide 6 the original results: Incentive group: Y = 7.33 +.21X 2 No Incentive group: Y = 2.5 +.27X 2 We get the same numbers! Lecture #12-11/19/2008 Slide 12 of 28

Continuing the Analysis Example Data Wrong Way A Better Way Full Model Reduced Model Because the full model interaction term was not statistically significant, we should remove the term from the model and re-estimated the model. This is called the reduced model, and looks like: Where: Y = a + b 1 X 1 + b 2 X 2 X 1 is the effect coded variable for the incentive group of an examinee (either a -1 or a 1). X 2 is the amount of time studied for the test. Without the interaction, our model makes the assumption of equal slopes across incentive groups. We already tested this assumption and found evidence the slopes were equal across groups. Lecture #12-11/19/2008 Slide 13 of 28

Reduced Model Results Model Summary Mode Adjusted R Std. Error of l R R Square Square the Estimate 1.907 a.823.806 1.20663 a. Predictors: (Constant), Study Time, Incentive ANOVA b Model 1 Regression Residual Total Sum of Squares 142.050 30.575 172.625 a. Predictors: (Constant), Study Time, Incentive df 2 21 23 Mean Square 71.025 1.456 F 48.783 Sig..000 a b. Dependent Variable: Y - Retention Coefficients a Model 1 (Constant) Incentive Study Time Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 4.917-2.042.237 a. Dependent Variable: Y - Retention.603.246.044 -.761.493 8.149-8.289 5.371.000.000.000 Lecture #12-11/19/2008 Slide 14 of 28

Reduced Model Results Estimated regression equation: Y = 4.917 2.042X 1 +.237X 2 From the SPSS output we can tell the following: Significant main effect of study time (ˆb 2 =.237, P < 0.001). Regardless of incentive group, retention increases by.237 points for every additional hour of study time. Significant main effect of incentive group (ˆb 1 = 2.0424, p < 0.001). There is a significant difference in (adjusted) mean value of retention between the two groups. Lecture #12-11/19/2008 Slide 15 of 28

Further Interpretation Because the reduced model included a categorical independent variable, we can decompose that model into two separate models, one for each group: Incentive Group Y = 4.917 2.042( 1) +.237X 2 Y = (4.917 + 2.042) +.237X 2 Y = 6.959 +.237X 2 Recall from slide 6 the original results: Incentive group: Y = 7.33 +.21X 2 No Incentive Group Y = 4.917 2.042(1) +.237X 2 Y = (4.917 2.042) +.237X 2 Y = 2.875 +.237X 2 No Incentive group: Y = 2.5 +.27X 2 We do not get the same numbers... Lecture #12-11/19/2008 Slide 16 of 28

Danger Basis For Some researchers may find it beneficial to partition continuous variables into a number of categories. In our example, even though study time was continuous, we could have also thought of it as a categorical variable with 4 levels (5, 10, 15, 20 minutes). A 2 X 4 ANOVA could have been computed. Lecture #12-11/19/2008 Slide 17 of 28

Danger Basis For Another way of categorizing a continuous variable is often done in a treatment-by-levels design. For example, a researcher may be interested in the difference between two teaching methods. Prior to beginning treatment, all subjects have a different intelligence level. The experimenter may want to control for intelligence in the design to piece out the information regarding the treatment. The resulting ANOVA will portion out the variance related to the control variable. Lecture #12-11/19/2008 Slide 18 of 28

Danger Basis For Some studies categorize continuous variables in an attempt to study possible interactions between the independent variables. These are often called: Aptitude-Treatment Interaction (ATI). Attribute-Treatment Interaction (ATI). Trait-Treatment Interaction (TTI). Different from previous categorization because the control variable is actually a factor of interest. In this same example, the researcher may want to see if the treatments change the test scores differently for people with different intelligence. Lecture #12-11/19/2008 Slide 19 of 28

Danger Basis For You can also categorize continuous variables in a counterproductive way. This can occur if a researcher categorizes a continuous variable that has more than one attribute. For example, categorizing personality, attitudes, etc... Generally, you lose statistical power when you categorize a continuous variable. Contrary to the book s overall advice, categorization is a dangerous endeavor. Lecture #12-11/19/2008 Slide 20 of 28

Basis For Categorization How do you categorize a categorical variable? Danger Basis For Often, variables are cut in half at the median, then labeled low or high. It should be noted that you should be careful in your categorization, because not all lowšs are created equal... What effect does categorization have? Categorization leads to a loss of information and a less sensitive analysis. Lecture #12-11/19/2008 Slide 21 of 28

In the case where there is one continuous variable and one categorical variable (as in today s example), the interaction answers the question of whether the regression lines of the dependent variable (Retention) on the continuous variable (Study Time) are parallel for all the categories of the categorical variable (Incentive Group). In our example, the Study Time was manipulated, however, that is not always the case (researchers may simply ask how long the individual studied, for example). The test of significance would be the same, however, the interpretation of the interaction effect would differ. In the previous design, since we know Study Time was manipulated, the cause for difference has to be related to the Incentive Group. If we do not manipulate Study Time, the significance of the interaction may be a result of both the Incentive Group AND the Study Time. Lecture #12-11/19/2008 Slide 22 of 28

Types of Interaction Effects There are two main types of interaction effects: Ordinal Interaction: Interaction Types Reflects the fact that an independent variable seems to have more of an effect under one level of a second independent variable than under another level. If you graph an ordinal interaction, the lines will not be parallel, but they will not cross. Disordinal Interaction: When an independent variable has one kind of effect in the presence of one level of a second independent variable, but a different kind of effect in the presence of a different level of the second independent variable. Called a crossover interaction because the lines in a graph will cross. Lecture #12-11/19/2008 Slide 23 of 28

Types of Interaction Effects Ordinal Disordinal Lecture #12-11/19/2008 Slide 24 of 28

Comparing Regression Equations in Nonexperimental Nonexperimental designs are those in which neither the categorical variable nor the continuous variable are manipulated. The analytic approach in such designs is identical to that of experimental designs, however, it is the interpretation that differs. The interpretation is often more complex and ambiguous in terms of the findings. Lecture #12-11/19/2008 Slide 25 of 28

The Study of Bias One definition of test bias (Cleary, 1968) A test is biased for members of one subgroup of the population if, in the prediction of the criterion for which the test was designed, consistent nonzero errors of prediction are made for members of the subgroup. In other words, the test is biased if the criterion score predicted from the common regression line is consistently too high or too low for members of the subgroup. This is the regression model for test bias. This idea of test bias occurs when there is an interaction present when modeling two regression lines representing two categorical groups. Lecture #12-11/19/2008 Slide 26 of 28

Final Thought Combining categorical and continuous variables provides powerful statistical tools that help provide evidence as to the behavior of phenomena under study. Such tools provide the basis for most practical models used in quantitative research. Most nonexperimental studies include both categorical and continuous variables. Final Thought Next Class Next time we will see this is called ANCOVA (ANalysis of COVAriance). We will also see how controlling for continuous variables adjusts the means of our experimental groups. Lecture #12-11/19/2008 Slide 27 of 28

Next Time Lab: Categorical and continuous independent variables. Homework 8 due next week at the start of class. No class next week (Thanksgiving break). December 3: Analysis of Covariance (chapter 15), final preparation. Final Thought Next Class Lecture #12-11/19/2008 Slide 28 of 28