Logistic Regression.

Transcription

1 Logistic Regression Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy and the independents are of any type. Multinomial logistic regression exists to handle the case of dependents with more classes than two, though it is sometimes used for binary dependents also since it generates somewhat different output described below. When multiple classes of a multinomial dependent variable can be ranked, then ordinal logistic regression is preferred to multinomial logistic regression. Continuous variables are not used as dependents in logistic regression. Unlike logit regression, there can be only one dependent variable. More recently, generalized linear modeling (GZLM) has appeared as a module in SPSS, SAS, and other packages. GZLM provides allows the researcher to create regression models with any distribution of the dependent (ex., binomial, multinomial, ordinal) and any link function (ex., log for loglinear analysis, logit for binary or multinomial logistic analysis, cumulative logit for ordinal logistic analysis). Similarly, generalized linear mixed modeling (GLMM) is now available to handle multilevel logistic modeling. Logistic regression can be used to predict a categorical dependent variable on the basis of continuous and/or categorical independents; to determine the effect size of the independent variables on the dependent; to rank the relative importance of independents; to assess interaction effects; and to understand the impact of covariate control variables. The impact of predictor variables is usually explained in terms of odds ratios. Logistic regression applies maximum likelihood estimation after transforming the dependent into a logit variable. A logit is the natural log of the odds of the dependent equaling a certain value or not (usually 1 in binary logistic models, the highest value in multinomial models). In this way, logistic regression estimates the odds of a certain event (value) occurring. Note that logistic regression calculates changes in the log odds of the dependent, not changes in the dependent itself as OLS regression does. Logistic regression has many analogies to OLS regression: logit coefficients correspond to b coefficients in the logistic regression equation, the standardized logit coefficients correspond to beta weights, and a pseudo R 2 statistic is available to summarize the strength of the relationship. Unlike OLS regression, however, logistic regression does not assume linearity of relationship between the raw values of the independent variables and the dependent, does not require normally distributed variables, does not assume homoscedasticity, and in general has less stringent requirements. It does, however, require that observations be independent and that the independent variables be linearly 1

2 related to the logit of the dependent. The predictive success of the logistic regression can be assessed by looking at the classification table, showing correct and incorrect classifications of the dichotomous, ordinal, or polytomous dependent. Goodness-of-fit tests such as the likelihood ratio test are available as indicators of model appropriateness, as is the Wald statistic to test the significance of individual independent variables. In SPSS, binary logistic regression, sometimes called binomial logistic regression, is under Analyze - Regression - Binary Logistic, and the multinomial version is under Analyze - Regression - Multinomial Logistic. Logit regression, discussed separately, is another related option in SPSS for using loglinear methods to analyze one or more dependents. Where both are applicable, logit regression has numerically equivalent results to logistic regression, but with different output options. For the same class of problems, logistic regression has become more popular among social scientists. Key concepts and terms Contents Significance tests Stepwise logistic regression Interpreting parameter estimates Information theory measures Effect size measures Contrast analysis Analysis of residuals Conditional logistic regression for matched pairs data Assumptions SPSS output Frequently asked questions Bibliography 2

3 Key Terms and Concepts o The logistic model. The logistic curve, illustrated below, is better for modeling binary dependent variables coded 0 or 1 because it comes closer to hugging the y=0 and y=1 points on the y axis. Even more, the logistic function is bounded by 0 and 1, whereas the OLS regression function may predict values above 1 and below 0. Logistic analysis can be extended to multinomial dependents by modeling a series of binary comparisons: the lowest value of the dependent compared to a reference category (by default the highest category), the next-lowest value compared to the reference category, and so on, creating k - 1 binary model equations for the k values of the multinomial dependent variable. o The logistic equation. Logistic regression predicts the log odds of the dependent event (odds and odds ratios are explained further below ). The "event" is a particular value of y, the dependent variable. By default the event is y = 1 for binary dependents coded 0,1, and the reference category is 0. For multinomial values event is y equals the value of interest and the reference category is the highest value of y. (Note this means that for binary dependents the reference category is 0 in binomial logistic regression but 1 in multinomial logistic regression!) The natural log of the odds of an event equal the natural log of the probability of the event occurring divided by the probability of the event not occurring: ln(odds(event)) = ln(prob(event)/prob(nonevent)) The logistic regression equation itself is: z = b 0 + b 1 X 1 + b 2 X b k X k 3

4 where z is the log odds of the dependent variable = ln(odds(event)) where b0 is the constant and where there are k independent (X) variables, some of which may be interaction terms. The "z" is the logit, also called the log odds. The "b" terms are the logistic regression coefficients, also called parameter estimates Exp(b) = the odds ratio for an independent variable = the natural log base e raised to the power of b. The odds ratio is the factor by which the independent increases or (if negative) decreases the log odds of the dependent (see discussion of interpreting b parameters below. Exp(z) = the odds ratio for the dependent variable, being the odds that the dependent equals the level of interest rather than the reference level. In binary logistic regression, this is usually the odds the dependent = 1 rather than 0. In multinomial logistic regression, this is usually the odds the dependent = a value of interest rather than the highest level. Thus for a one-independent model, z would equal the constant plus the b coefficient times the value of X 1, when predicting odds(event) for persons with a particular value of X 1, by default the value "1" for the binary case. If X 1 is a binary (0,1) variable, then z = X 0 (that is, the constant) for the "0" group on X 1 and equals the constant plus the b coefficient for the "1" group. To convert the log odds (which is z, which is the logit) back into an odds ratio, the natural logarithmic base e is raised to the zth power: odds(event) = exp(z) = odds the binary dependent is 1 rather than 0. If X 1 is a continuous variable, then z equals the constant plus the b coefficient times the value of X 1. For models with additional independent variables, z is the constant plus the crossproducts of the b coefficients times the values of the X (independent) variables. Exp(z) is the log odds of the dependent, or the estimate of odds(event). o Dependent variable. Binomial and multinomial logistic regression support only a single dependent variable. For binary logistic regression, this response variable can have only two categories. For multinomial logistic regression, there may be two or more categories, usually more, but the dependent is never a continuous variable. 4

5 It is important to be careful to specify the desired reference category of the dependent variable, which should be meaningful. For example, the highest category (4 = None) in the example above would be the default reference category, but "None" could be chosen for a great variety of reasons and has no specific meaning as a reference. The researcher may wish to select another, more specific option as the reference category. The dependent reference default in binary vs. multinomial logistic regression in SPSS: Binary logistic regression predicts the "1" value of the dependent, using the "0" level as the reference value. Multinomial logistic regression will compare each level of the dependent with the reference category, for each independent variable. For instance, given the multinomial dependent "Degree of interest in joining" with levels 0=low interest, 1 = medium interest, and 2=high interest; and given the independents region (a factor) and income (a covariate), 2 = high interest will be the reference category by default. For region and separately for income, multinomial logistic output will show (1) the comparision of low interest with high interest, and (2) the comparison of medium interest with high interest. BINARY LOGISTIC REGRESSION: OUTCOME: DEPENDENTS Binary variable is entered as Highest is predicted, 5

6 a dependent MULTINOMIAL LOGISTIC REGRESSION: DEPENDENTS Binary or multinomial variable entered as dependent lowest is reference The reference level cannot be changed. Highest is reference, all others compared to it by default. Click "Reference Category" button to override the default. Setting the reference category for the dependent. 1. In binary logistic regression, the reference category (the lower, usually '0' category) cannot be overridden (though, of course, the researcher could flip the values by recoding). 2. In multinomial logistic regression, as illustrated below, in SPSS from the main multinomial logistic regression dialog, enter a categorical dependent variable, then to override the default selection of the highest-coded category as the reference category, click the "Reference" button and select "First Category" or "Custom" to set the reference code manually. In SPSS syntax, set the reference category simply by entering a base parameter after the dependent in the variable list (ex., NOMREG depvar (base=2) WITH indvar1 indvar2 indvar3 / PRINT = PARAMETER SUMMARY.) When setting custom or base values, a numeral indicates the order number (ex., 3 = third value will be the reference, assuming ascending order); for actual numeric values, dates, or strings, put quotation marks around the value. 6

7 o Group membership. Particularly in multinomial logistic regression, it is common to label the levels of the dependent variable as "groups". A logistic prediction (regression) formula can be created from logistic coefficients to predict the probability of membership in any group. Any given observation will be assigned to the group with the highest probability. For binary logistic regression, this is the group with a probability higher than.50. Factors, sometimes called "design variables," are nominal or ordinal categorical predictors, whose levels are entered as dummy variables. In SPSS binary logistic regression but not multinomial logistic regression, categorical independent variables must be declared by clicking on the "Categorical" button in the Logistic Regression dialog box. Setting the reference category (contrast) for a factor (categorical predictor). Dichotomous and categorical variables may be entered as factors but are handled differently in binary vs. multinomial regression. In binary logistic regression, after declaring a variable 7

8 to be categorical, the user is given the opportunity to declare the first or last value as the reference category. Independent variable reference categories cannot be changed in multinomial logistic regression in SPSS. If the researcher prefers another category to be the reference category in multinomial logistic regression, the researcher must either (1) recode the variables beforehand to make the desired reference category the last one, or (2) must create dummy variables manually prior to running the analysis. For more on the selection of dummy variables, click here. Multinomial logistic regression in SPSS handles reference categories differently for factors and covariates, as illustrated below. Binary logistic regression provides for the usual choice of contrast types for factors. In SPSS binary logistic regression, all variables are first entered as covariates, then the "Categorical" button is pressed to switch any of those 8

9 entered to be categorical factors. Under the Categorical button dialog the use may select the type of contrast. Warning: one must click the Change button after selecting the contrast from the drop-down menu. 0. Indicator: This is the default. The reference value is the lowest (0 for binary factors) category. However, this may be reversed by checking the Descending radio button (Ascending is default), which makes the lowest value be considered the last. 1. Simple: The last category is the reference. This may be reversed by checking the Descending radio button. 2. Difference: The reference is the mean of previous categories (except for the first category). 3. Helmert: The reference is the mean of subsequent categories (except for the last category). 4. Repeated: The reference is the previous category (except for the first category). 5. Deviation: The reference is the grand mean. 6. Polynomial. Used for equally spaced numeric categories to check linearity. For factor "X", output will list X(1) for the linear effect of X, X(2) for the quadratic effect, etc. Reference levels in SPSS binomial and multinomial regression. What is the reference category depends on whether one is using binomial or multinomial regression, whether the variable is the dependent or an independent, and if the variable is a factor or covariate. For factors, a coefficient will be printed for predicted categories but not for the reference category. For covariates, a single coefficient is printed and is referenced similar to regression, where increasing the independent from 0 to 1 corresponds to a b increase in the dependent (except in logistic regression, the increase is in the log odds of the dependent). The table below summarizes the options. 9

10 Covariates are interval independents in most programs. However, in SPSS binary logistic regression dialog all independents are entered as covariates, then one clicks the Categorical button to declare any of those entered as categorical. In SPSS multinomial regression, factors and covariates are entered in separate entry boxes. Interaction terms. As in OLS regression, one can add interaction terms to the model. As illustrated in the figure below, in multinomial logistic regression, interaction terms are entered under the Model button, where interactions of factors and covariates are constructed automatically. For binary logistic regression, interaction terms such as age*income must be created manually. For categorical variables, one also multiplies but you have to multiply two category codes as shown in the Categorical Variables Codings table of SPSS output (ex, race(1)*religion(2)). The codes will be 1's and 0's so most of the products in the new variables will be 0's unless you recode some products (ex., setting 0*0 to 1). Warning: when interaction terms are added to the model, centering of the constituent independent variables is strongly recommended to reduce multicollinearity and aid intepretation of coefficients. See the full discussion in the section on OLS regression, here. 10

11 Maximum likelihood estimation, ML, is the method used to calculate the logit coefficients. This contrasts to the use of ordinary least squares (OLS) estimation of coefficients in regression. OLS seeks to minimize the sum of squared distances of the data points to the regression line. ML methods seek to maximize the log likelihood, LL, which reflects how likely it is (the odds) that the observed values of the dependent may be predicted from the observed values of the independents. Put another way, OLS can be seen as a subtype of ML for the special case of a linear model characterized by normally distributed disturbances around the regression line, where the regression coefficients computed maximize the likelihood of obtaining the least sum of squared disturbances. When error is not normally distributed or when the dependent variable is not normally distributed, ML estimates are preferred to OLS estimates because they are unbiased beyond the special case handled by OLS. Ordinal and multinomial logistic regression are extensions of binary logistic regression that allow the simultaneous comparison of more than one contrast. That is, the log odds of three or more contrasts are estimated simultaneously (ex., the probability of A vs. B, A vs.c, B vs.c., etc.). SPSS and SAS. In SPSS, select Analyze, Regression, Binary (or Multinomial), Logistic; select the dependent and the covariates; use the Categorical button to declare categorical variables; use the Options button to select desired output; Continue; OK. SAS's PROC CATMOD computes both simple and multinomial logistic regression, whereas PROC LOGIST is for simple (dichotomous) logistic regression. CATMOD uses a conventional model command: ex., model wsat*supsat*qman=_response_ /nogls ml ;. Note that in the model command, nogls suppresses generalized least squares estimation and ml specifies maximum likelihood estimation. Basic SPSS Output 11

12 12

13 Thus in the output above, we find that being white compared to other races reduces the odds of not owning a gun in the home the most. All three independent variables in the model (Sex, Race, Degree) are significant, but not all categories of each independent variable are significant. Being African-American, for instance, does not have a significant impact on the odds of not owning gun in the home. No category of any of the independent variables modeled significantly increase the odds of not having a gun in the home compared to the respective reference categories when other variables in the model are controlled - that is, there are no significant odds ratios above 1.0. (The chart output is from Excel, not SPSS). More about SPSS output. Most output options in SPSS binary logistic regression are found under the "Options" button, but for multinomial logistic regression under the "Statistics" button, as illustrated below. 13

14 Significance Tests o Significance tests for binary logistic regression 14

15 Hosmer and Lemeshow chi-square test of goodness of fit. The recommended test for overall fit of a binary logistic regression model is the Hosmer and Lemeshow test, also called the chi-square test. It is not available in multinomial logistic regression. This test is considered more robust than the traditional chi-square test, particularly if continuous covariates are in the model or sample size is small. A finding of non-significance, as in the illustration below, corresponds to the researcher concluding the model adequately fits the data. This test is preferred over classification tables when assessing model fit. This test appears in SPSS output in the "Hosmer and Lemeshow" table. It is requested by checking "Hosmer-Lemeshow goodness of fit" under the Options button in SPSS. How it works. Hosmer and Lemeshow's goodness of fit test divides subjects into deciles based on predicted probabilities as illustrated above, then computes a chisquare from observed and expected frequencies. Then a probability (p) value is computed from the chi-square distribution with 8 degrees of freedom to test the fit of the logistic model. If the H-L goodness-of-fit test statistic is greater than.05, as we want for well-fitting models, we fail to reject the null hypothesis that there is no difference between observed and model-predicted values, implying that the model's estimates fit the data at an acceptable level. That is, well-fitting models show nonsignificance on the H- 15

16 L goodness-of-fit test, indicating model prediction is not significantly different from observed values. This does not mean that the model necessarily explains much of the variance in the dependent, only that however much or little it does explain is significant. As the sample size gets large, the H-L statistic can find smaller and smaller differences between observed and model-predicted values to be significant. On the other hand, the H-L statistic assumes sampling adequacy, with a rule of thumb being enough cases so that no group has an expected value < 1 and 95% of cells (typically, 10 decile groups times 2 outcome categories = 20 cells) have an expected frequency > 5. Collapsing groups may not solve a sampling adequacy problem since when the number of groups is small, the H-L test will be biased toward nonsignificance (will overestimate model fit). This test is not to be confused with a similarly named, obsolete goodness of fit index discussed below, Omnibus tests of model coefficients. This table in SPSS binary logistic regression reports significance levels by the traditional chisquare method and is an alternative to the Hosmer-Lemeshow test above. It tests if the model with the predictors is significantly different from the model with only the intercept. The omnibus test may be interpreted as a test of the capability of all predictors in the model jointly to predict the response (dependent) variable. A finding of significance, as in the illustration below, corresponds to the a research conclusion that there is adequate fit of the data to the model, meaning that at least one of the predictors is significantly related to the response variable. In the illustration below, the Enter method is employed (all model terms are entered in one step), so there is no difference for step, block, or model, but in a stepwise procedure one would see results for each step. Fit tests in stepwise or block-entry logistic regression. When predictors are entered in steps or blocks, the Hosmer-Lemeshow and omnibus tests may also be interpreted as tests of the change in model significance when adding a variable (stepwise) or block of variables (block) to the model. This is illustrated below for a model predicting having a gun in the home from marital status, race, and attitude toward capital punishment. In stepwise forward logistic 16

17 regression, Marital is added first, then Race, then Cappun; Age is in the covariate list but is not added as the stepwise procedure ends at Step 3. o Significance tests for multinomial logistic regression Binary vs. multinomial logistic regression. In SPSS, selecting Analyze, Regression, Binary logistic invokes the LOGISTIC REGRESSION procedure. Selecting Analyze, Regression, Multinomial logistic invokes the NOMREG procedure. Although NOMREG is designed for the case where the dependent has more than two categories, a binary dependent may be entered. The NOMREG procedure provides a different set of output options, including different significance tests of the model. The figures below are for the NOMREG version of the same example, where the binary variable Gunown (having a gun in the home) is predicted from the categorical variables Marital (four categories of marital status), Race (three categories), the binary variable Cappun (attitude toward the death penalty), and the continuous covariate Age. Pearson and deviance goodness of fit. The "Goodness of Fit" table in multinomial logistic regression in SPSS gives two similar overall model fit tests. Like the Hosmer-Lemeshow goodness of fit 17

18 test in binary logistic regression, adequate fit corresponds to a finding of nonsignificance for these tests, as in the illustration below. Both are chi-square methods, but the Pearson statistic is based on traditional chi-square and the deviance statistic is based on likelihood ratio chi-square. The deviance test is preferred over the Pearson (Menard, 2002: 47). Either test is preferred over classification tables when assessing model fit. Both tests usually yield the same substantive results. For more on computation of Pearson and likelihood ratio chi-square, click here. A well-fitting model is non-significant. The likelihood ratio is a function of log likelihood and is used in significance testing in NOMREG. A "likelihood" is a probability, specifically the probability that the observed values of the dependent may be predicted from the observed values of the independents. Like any probability, the likelihood varies from 0 to 1. The log likelihood (LL) is its log and varies from 0 to minus infinity (it is negative because the log of any number less than 1 is negative). LL is calculated through iteration, using maximum likelihood estimation (ML). Log likelihood is the basis for tests of a logistic model. Because -2LL has approximately a chi-square distribution, -2LL can be used for assessing the significance of logistic regression, analogous to the use of the sum of squared errors in OLS regression. The -2LL statistic is the likelihood ratio. It is also called goodness of fit, deviance chi-square, scaled deviance, deviation chi-square, DM, or L-square. It reflects the significance of the unexplained variance in the dependent. In SPSS output, this statistic is found in the "-2 Log Likelihood" column of the "Model Fitting Information" table for the Final row. The likelihood ratio is not used directly in significance testing, but it it the basis for the likelihood ratio test, which is the test of the difference between two likelihood ratios (two -2LL's), as discussed below. In general, as the model becomes better, -2LL will decrease in magnitude. 18

19 A well-fitting model is significant. The likelihood ratio test, also called the log-likelihood test, is based on -2LL (deviance). The likelihood ratio test is a test of the significance of the difference between the likelihood ratio (-2LL) for the researcher's model minus the likelihood ratio for a reduced model. This difference is called "model chi-square." The likelihood ratio test is generally preferred over its alternative, the Wald test, discussed below. There are four main forms of the likelihood ratio test:: 0. Models. Two models are referenced in the "Model Fitting Information" table above: (1) the "Intercept Only" model, also called the null model; it reflects the net effect of all variables not in the model plus error; and (2) the "Final" model, also called the fitted model, which is the researcher's model comprised of the predictor variables; the logistic equation is the linear combination of predictor variables which maximizes the log likelihood that the dependent variable equals the predicted value/class/group. The difference in the -2 log likelihood (-2LL) measures how much the final model improves over the null model. The likelihood ratio test is thus a test of the overall model, also called the "model chi-square test", is the test shown above in the "Final" row in the "Model Fitting Information" in SPSS. When the reduced model is the baseline model with the constant only (a.k.a., initial model or model at step 0), the likelihood ratio test tests the significance of the researcher's model as a whole. A well-fitting model is significant at the.05 level or better, as in the figure above, meaning the researcher's model is significantly different from the one with the constant only. That is, a finding of significance (p<=.05 is the usual cutoff) leads to rejection of the null hypothesis that all of the predictor effects are zero. When this likelihood test is significant, at least one of the predictors is significantly related to the dependent variable. Put yet another way, the likelihood ratio test tests 19

20 the null hypothesis that all population logistic regression coefficients except the constant are zero. And put a final way, the likelihood ratio test reflects the difference between error not knowing the independents (initial chi-square) and error when the independents are included in the model (deviance). When probability (model chi-square) <=.05, we reject the null hypothesis that knowing the independents makes no difference in predicting the dependent in logistic regression. The likelihood ratio test looks at model chi-square (chi square difference) by subtracting deviance (-2LL) for the final (full) model from deviance for the intercept-only model. Degrees of freedom in this test equal the number of terms in the model minus 1 (for the constant). This is the same as the difference in the number of terms between the two models, since the null model has only one term. Model chi-square measures the improvement in fit that the explanatory variables make compared to the null model. Warning: If the log-likelihood test statistic shows a small p value (<=.05) for a model with a large effect size, ignore contrary findings based on the Wald statistic discussed below as it is biased toward Type II errors in such instances - instead assume good model fit overall. 1. Testing for interaction effects. A common use of the likelihood ratio test is to test the difference between a full model and a reduced model dropping an interaction effect. If model chi-square (which is -2LL for the full model minus -2LL for the reduced model) is significant, then the interaction effect is contributing significantly to the full model and should be retained. 2. Test of individual model parameters. The likelihood ratio test assesses the overall logistic model but does not tell us if particular independents are more important than others. This can be done, however, by comparing the difference in -2LL for the overall model with a nested model which drops one of the independents. We can use the likelihood ratio test to drop one variable from the model to create a nested reduced model. In this situation, the likelihood ratio test tests if the logistic regression coefficient for the dropped variable can be treated as 0, thereby justifying dropping the variable from the model. A nonsignificant likelihood ratio test indicates no difference between the full and the reduced models, hence justifying dropping the 20

21 given variable so as to have a more parsimonious model that works just as well. Note that the likelihood ratio test of individual parameters is a better criterion than the alternative Wald statistic when considering which variables to drop from the logistic regression model. In SPSS output, the "Likelihood Ratio Tests" table contains the likelihood ratio tests of individual model parameters, illustrated below. In the table above, the response variable is Gunown, a binary variable indicating whether or not there is a gun in the home. Predictors are marital status, race, attitude toward the death penalty (Cappun), and age. The likelihood ratio tests of individual parameters show that the model without Age is not significantly different from the final (full) model and therefore Age should be dropped based on preference for the more parsimonious reduced model. For the significant variables, the larger the chi-square value, the greater the loss of model fit if that term is dropped. In this example, dropping "marital" would result in the greatest loss of model fit. 3. Tests for model refinement. In general, the likelihood ratio test can be used to test the difference between a given model and any nested model which is a subset of the given model. It cannot be used to compare two non-nested models. Chi-square is the difference in likelihood ratios (- 2LL) for the two models, and degrees of freedom is the difference in degrees of freedom for the two models. If the computed chi-square is equal or greater than the critical 21

22 value of chi-square (in a chi-square table) for the given df, then the models are significantly different. If the difference is significant, then the researcher concludes that the variables dropped in the nested model do matter significantly in predicting the dependent. If the difference is below the critical value, there is a finding of nonsignificance and the researcher concludes that dropping the variables makes no difference in prediction and for reasons of parsimony the variables are dropped from the model. That is, chi-square difference can be used to help decide which variables to drop from or add to the model. This is discussed further in the next section. Block chi-square is a synonym for the likelihood ratio (chi-square difference) test, referring to the change in -2LL due to entering a block of variables. The main Logistic Regression dialog allows the researcher to enter independent variables in blocks. Blocks may contain one or more variables. Sequential logistic regression is analysis of nested models using block chi-square, where the researcher is testing the control effects of a set of covariates. The logistic regression model is run against the dependent for the full model with independents and covariates, then is run again with the block of independents dropped. If chi-square difference is not significant, then the researcher concludes that the independent variables are controlled by the covariates (that is, they have no effect once the effect of the covariates is taken into account). Alternatively, the nested model may be just the independents, with the covariates dropped. In that case a finding of non-significance implies that the covariates have no control effect. Assessing dummy variables with sequential logistic regression. Similarly, running a full model and then a model with all the variables in a dummy set dropped (ex., East, West, North for the variable Region) allows assessment of dummy variables, using chi-square difference. Note that even though SPSS computes loglikelihood ratio tests of individual parameters for each level of a dummy variable, the log-likelihood ratio tests of individual parameters (discussed below) should not be used, but rather use the likelihood ratio test (chi-square difference method) for the set of dummy variables pertaining to a given variable. Because all dummy variables associated with the categorical variable are entered as a block this is sometimes called the "block chi-square" test and its value is considered more reliable than the Wald test, which can be misleading for large effects in finite samples. 22

23 o Stepwise logistic regression: The forward or backward stepwise logistic regression methods, available in both binary and multinomial regression in SPSS, determine automatically which variables to add or drop from the model. As data-driven methods, stepwise procedures run the risk of modeling noise in the data and are considered useful only for exploratory purposes. Selecting model variables on a theoretic basis and using the "Enter" method is preferred. Both binomial and multinomial logistic regression offer both forward and backward stepwise logistic regression, though will slight differences illustrated below. Binary logistic regression in SPSS offers these variants in the Method area of the main binary logistic regression dialog: forward conditional, forward LR, forward Wald, backward conditional, backward LR, or backward Wald. The conditional options uses a computationally faster version of the likelihood ratio test, LR options utilize the likelihood ratio test (chi-square difference), and the Wald options use the Wald test. The LR option is most often preferred. The likelihood ratio test computes -2LL for the current model, then reestimates -2LL with the target variable removed. The conditional option is preferred when LR estimation proves too computationally time-consuming. The conditional statistic is considered not as accurate as the likelihood ratio test but more so than the third possible criterion, the Wald test. Stepwise procedures are selected in the Method drop-down list of the binary logistic regression dialog. Multinomial logistic regression offers these variants under the Model button if a Custom model is specified: forward stepwise, backward 23

24 stepwise, forward entry, and backward elimination. These four options are described in the FAQ section below. All are based on maximum likelihood estimation (ML), with forward methods using the likelihood ratio or score statistic and backward methods using the likelihood ratio or Wald's statistic. LR is the default, but score and Wald alternatives are available under the Options button. Forward entry adds terms to the model until no omitted variable would contribute significantly to the model. Forward stepwise determines the forward entry model and then alternates between backward elimination and forward entry until all variables not in the model fail to meet entry or removal criteria. Backward elimination and backward stepwise are similar, but begin with all terms in the model and work backward. with backward elimination stopping when the model contains only terms which are significant and with backward stepwise taking this result and further alternating between forward entry and backward elimination until no omitted variable would contribute significantly to the model. Forward selection vs. backward elimination: Forward selection is the usual option, starting with the constant-only model and adding variables one at a time in the order they are best by some criterion (see below) until some cutoff level is reached (ex., until the step at which all variables not in the model have a significance higher than.05). Backward selection starts with all variables and deletes one at a time, in the order they are worst by some criterion. In the illustration below, forward stepwise modeling of a binary dependent, which was having a gun in the home or not (Gunown), as predicted by the categorical variable marital status (marital, with four categories), race (three categories), and attitude on the death penalty (binary). The forward stepwise procedure adds Marital first, then Race, then Gunown. 24

25 Cross-validation. When using stepwise methods, problems of overfitting the model to noise in the current data may be mitigated by cross-validation, fitting the model to a test subset of the data and validating the model using a hold-out validation subset. Score as a variable entry criterion for forward selection. Rao's efficient score statistic has also been used as the forward selection criterion for adding variables to a model. It is similar but not identical to a likelihood ratio test of the coefficient for an individual explanatory variable. Score appears in the "score" column of the "Variables not in the equation" table of SPSS output for binary logistic regression. Score may be selected as a variable entry criterion in multinomial regression, but there is no corresponding table (its use is just noted in a footnote in the "Step Summary" table). In the example below, having a gun in the home or not is predicted from marital status, race, age, and attitudes toward capital punishment. Race, with the highest score in Step 1, gets added to the initial predictor, marital status, in Step 2. The variable with the highest Score in Step 2, attitude toward capital punishment (Cappun) gets added in Step 3. Age, with a nonsignificant Score, is never added as the stepwise procedure stops at Step 3. 25

26 Score statistic: Rao's efficient score, labeled simply "score" in SPSS output, is test for whether the logistic regression coefficient for a given explanatory variable is zero. It is mainly used as the criterion for variable inclusion in forward stepwise logistic regression (discussed above), because of its advantage of being a non-iterative and therefore computationally fast method of testing individual parameters compared to the likelihood ratio test. In essence, the score statistic is similar to the first iteration of the likelihood ratio method, where LR typically goes on to three or four more iterations to refine its estimate. In addition to testing the significance of each variable, the score procedure generates an "Overall statistics" significance test for the model as a whole. A finding of nonsignificance (ex., p>.05) on the score statistic leads to acceptance of the null hypothesis that coefficients are zero and the variable may be dropped. SPSS continues by this method until no remaining predictor variables have a score statistic significance of.05 or better. Which step is the best model? Stepwise methods do not necessarily identify "best models" at all as they work by fitting an automated model to the current dataset, raising the danger of overfitting to noise in the particular dataset at hand. However, there are three possible methods of selecting the "final model" that emerges from the stepwise procedure. Binomial stepwise logistic regression offers only the first, but multinomial stepwise logistic regression offers two more, based on the information theory measures AIC and BIC, discussed below. 0. Last step. The final model is the last step model, where adding another variable would not improve the model significantly. In the figure below, a general happiness survey variable (3 levels) is predicted from age, education, 26

27 and a categorical survey item on whether life is exciting or dull (3 levels). By the customary last step rule, the best model is model Lowest AIC. The "Step Summary" table will print the Akaike Information Criterion (AIC) for each step. AIC is commonly used to compare models, where the lower the AIC, the better. The step with the lowest AIC thus becomes the "final model." By the lowest AIC criterion, the best model would be model Lowest BIC. The "Step Summary" table will print the Bayesian Information Criterion (BIC) for each step. BIC is also used to compare models, again where the lower the BIC, the better. The step with the lowest BIC thus becomes the "final model." Often BIC will point to a more parsimonious model than will AIC as its formula factors in degrees of freedom, which is related to number of variables. By the lowest BIC criterion, the best model would be model 1. o Click here for further discussion of stepwise methods. Wald statistic (test): The Wald statistic is an alternative test which is commonly used to test the significance of individual logistic regression coefficients for each independent variable (that is, to test the null hypothesis in logistic regression that a particular logit (effect) coefficient is zero). The Wald statistic is the squared ratio of the unstandardized logistic coefficient to its standard error. The Wald statistic and its corresponding p probability level is part of SPSS output in the "Variables in the Equation" table, illustrated above. The Wald test corresponds to significance testing of b coefficients in OLS regression. The researcher may well want to drop independents from the model when their effect is not significant by the Wald statistic. The Wald test appears in the "Parameter Estimates" table in SPSS logistic regression output. (If computing the Wald statistic manually by spreadsheet, note that rounding/precision errors will cause the computed result to vary a little from SPSS output). 27

28 Warning: Menard (p. 39) warns that for large logit coefficients, standard error is inflated, lowering the Wald statistic and leading to Type II errors (false negatives: thinking the effect is not significant when it is). That is, there is a flaw in the Wald statistic such that very large effects may lead to large standard errors and small Wald chi-square values. For models with large logit coefficients or when dummy variables are involved, it is better to test the difference using the likelihood ratio test of the difference of models with and without the parameter. Also note that the Wald statistic is sensitive to violations of the large-sample assumption of logistic regression. Put another way, the likelihood ratio test is considered more reliable for small samples (Agresti, 1996). For these reasons, the likelihood ratio test of individual model parameters is generally preferred. o o o Logistic coefficients and correlation. Note that a logistic coefficient may be found to be significant when the corresponding correlation is found to be not significant, and vice versa. To make certain global statements about the significance of an independent variable, both the correlation and the parameter estimate (b) should be significant. Among the reasons why correlations and logistic coefficients may differ in significance are these: (1) logistic coefficients are partial coefficients, controlling for other variables in the model, whereas correlation coefficients are uncontrolled; (2) logistic coefficients reflect linear and nonlinear relationships, whereas correlation reflects only linear relationships; and (3) a significant parameter estimate (b) means there is a relation of the independent variable to the dependent variable for selected control groups, but not necessarily overall. Confidence interval for the logistic regression coefficient. The confidence interval around the logistic regression coefficient is plus or minus 1.96*ASE, where ASE is the asymptotic standard error of logistic b. "Asymptotic" in ASE means the smallest possible value for the standard error when the data fit the model. It is also the highest possible precision. The real (enlarged) standard error is typically slightly larger than ASE. One typically uses real SE if one hypothesizes that noise in the data are systematic and one uses ASE if one hypothesizes that noise in the data are random. As the latter is typical, ASE is used here. Goodness of Fit Index (obsolete), also known as Hosmer and Lemeshow's goodness of fit index or C-hat, is an alternative to model chisquare for assessing the significance of a logistic regression model.this is different from Hosmer and Lemeshow's (chi-square) goodness of fit test discussed above. Menard (p. 21) notes it may be better when the number of combinations of values of the independents is approximately equal to the number of cases under analysis. This measure was included in SPSS output as "Goodness of Fit" prior to Release 10. However, it was removed from the reformatted output for SPSS Release 10 because, as noted by 28

29 David Nichols, senior statistician for SPSS, it "is done on individual cases and does not follow a known distribution under the null hypothesis that the data were generated by the fitted model, so it's not of any real use" (SPSSX-L listserv message, 3 Dec. 1999). Interpreting Parameter Estimates o Probabilities, odds, and odds ratios are all important basic terms in logistic regression. See the more extensive coverage in the separate section on log-linear analysis. 29

30 30

31 o Logistic coefficients, also called unstandardized logistic regression coefficients, logit coefficients, log odds-ratios, or effect coefficients or (in SPSS output) simply parameter estimates, correspond to b coefficients in OLS regression. Logistic coefficients, which are on the right-hand side of the logistic equation, are not to be confused with logits, which is the term on the left-hand side. Both logit and regression coefficients can be used to 31

32 o o o o construct prediction equations and generate predicted values, which in logistic regression are called logistic scores. The SPSS table which lists the b coefficiences also lists the standard error of b, the Wald statistic and its significance (discussed below), and the odds ratio (labeled Exp(b) ) as well as confidence limits on the odds ratio. In SPSS this is the "Variables in the Equation Table" in binary logistic regression, shown below, and there is an almost identical table in multinomial logistic regression labeled "Parameter Estimates". Model intercepts/constants. Labeled "constant" in binomial and "intercept" in SPSS multinomial regression output, this parameter is the log odds (logit estimate) of the dependent variable when model predictors are evaluated at zero. Often particular predictor variables can never be zero in real life. Note that the intercepts will be more interpretable if variables are centered around their means prior to analysis because then the interecepts are the log odds of the dependent when predictors are at their average values. In the example illustrated above, there is only one intercept as the dependent variable, capital punishment (cappun) is binomial, even though multinomial regression is applied. In general, there will be (k - 1) logistic equations and hence (k - 1) intercepts, where k is the number of categories of the dependent variable. Logistic models: parameter estimates. In SPSS and most statistical output for logistic regression, the "parameter estimate" is the b coefficient used to predict the log odds (logit) of the dependent variable, as discussed in the section on the logistic equation. Logistic models: logit links. Where OLS regression has an identity link function, binary or multinomial logistic regression has a logit link function (that is, logistic regression calculates changes in the log odds of the dependent, not changes in the dependent itself as OLS regression does). Ordinal logistic regression has a cumulative logit link. In binary logistic regression, the logit link creates a logistic model whose distribution, illustrated at the top of this section, hugs the 0 and 1 values of the Y axis for a binary dependent and, moreover, does not extrapolate out of range (below 0 or above 1). Multinomial logistic regression also uses the logit link, setting up a series of binary comparisons between the given level of the dependent and the reference level (by default, the highest coded level). Ordinal logistic regression uses a cumulative logit link, setting up a series of binary comparisons between the given level or lower compared to higher levels. (Technically, the SPSS Analyze, Regression, Ordinal menu choice runs the PLUM procedure, which uses a logit link, but the logit link is used to fit a cumulative logit model. The SPSS Analyze, Generalized Linear Models, Generalized linear Models menu choice for ordinal regression assumes an ordered multinomial distribution with a cumulative logit link). Logistic models: logits. As discussed above, logits are the log odds of the event occurring (usually, that the dependent = 1 rather than 0). Parameter estimates (b coefficients) associated with explanatory variables are 32