Logistic Regression.

Size: px
Start display at page:

Download "Logistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests"

Transcription

1 Logistic Regression Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy and the independents are of any type. Multinomial logistic regression exists to handle the case of dependents with more classes than two, though it is sometimes used for binary dependents also since it generates somewhat different output described below. When multiple classes of a multinomial dependent variable can be ranked, then ordinal logistic regression is preferred to multinomial logistic regression. Continuous variables are not used as dependents in logistic regression. Unlike logit regression, there can be only one dependent variable. More recently, generalized linear modeling (GZLM) has appeared as a module in SPSS, SAS, and other packages. GZLM provides allows the researcher to create regression models with any distribution of the dependent (ex., binomial, multinomial, ordinal) and any link function (ex., log for loglinear analysis, logit for binary or multinomial logistic analysis, cumulative logit for ordinal logistic analysis). Similarly, generalized linear mixed modeling (GLMM) is now available to handle multilevel logistic modeling. Logistic regression can be used to predict a categorical dependent variable on the basis of continuous and/or categorical independents; to determine the effect size of the independent variables on the dependent; to rank the relative importance of independents; to assess interaction effects; and to understand the impact of covariate control variables. The impact of predictor variables is usually explained in terms of odds ratios. Logistic regression applies maximum likelihood estimation after transforming the dependent into a logit variable. A logit is the natural log of the odds of the dependent equaling a certain value or not (usually 1 in binary logistic models, the highest value in multinomial models). In this way, logistic regression estimates the odds of a certain event (value) occurring. Note that logistic regression calculates changes in the log odds of the dependent, not changes in the dependent itself as OLS regression does. Logistic regression has many analogies to OLS regression: logit coefficients correspond to b coefficients in the logistic regression equation, the standardized logit coefficients correspond to beta weights, and a pseudo R 2 statistic is available to summarize the strength of the relationship. Unlike OLS regression, however, logistic regression does not assume linearity of relationship between the raw values of the independent variables and the dependent, does not require normally distributed variables, does not assume homoscedasticity, and in general has less stringent requirements. It does, however, require that observations be independent and that the independent variables be linearly 1

2 related to the logit of the dependent. The predictive success of the logistic regression can be assessed by looking at the classification table, showing correct and incorrect classifications of the dichotomous, ordinal, or polytomous dependent. Goodness-of-fit tests such as the likelihood ratio test are available as indicators of model appropriateness, as is the Wald statistic to test the significance of individual independent variables. In SPSS, binary logistic regression, sometimes called binomial logistic regression, is under Analyze - Regression - Binary Logistic, and the multinomial version is under Analyze - Regression - Multinomial Logistic. Logit regression, discussed separately, is another related option in SPSS for using loglinear methods to analyze one or more dependents. Where both are applicable, logit regression has numerically equivalent results to logistic regression, but with different output options. For the same class of problems, logistic regression has become more popular among social scientists. Key concepts and terms Contents Significance tests Stepwise logistic regression Interpreting parameter estimates Information theory measures Effect size measures Contrast analysis Analysis of residuals Conditional logistic regression for matched pairs data Assumptions SPSS output Frequently asked questions Bibliography 2

3 Key Terms and Concepts o The logistic model. The logistic curve, illustrated below, is better for modeling binary dependent variables coded 0 or 1 because it comes closer to hugging the y=0 and y=1 points on the y axis. Even more, the logistic function is bounded by 0 and 1, whereas the OLS regression function may predict values above 1 and below 0. Logistic analysis can be extended to multinomial dependents by modeling a series of binary comparisons: the lowest value of the dependent compared to a reference category (by default the highest category), the next-lowest value compared to the reference category, and so on, creating k - 1 binary model equations for the k values of the multinomial dependent variable. o The logistic equation. Logistic regression predicts the log odds of the dependent event (odds and odds ratios are explained further below ). The "event" is a particular value of y, the dependent variable. By default the event is y = 1 for binary dependents coded 0,1, and the reference category is 0. For multinomial values event is y equals the value of interest and the reference category is the highest value of y. (Note this means that for binary dependents the reference category is 0 in binomial logistic regression but 1 in multinomial logistic regression!) The natural log of the odds of an event equal the natural log of the probability of the event occurring divided by the probability of the event not occurring: ln(odds(event)) = ln(prob(event)/prob(nonevent)) The logistic regression equation itself is: z = b 0 + b 1 X 1 + b 2 X b k X k 3

4 where z is the log odds of the dependent variable = ln(odds(event)) where b0 is the constant and where there are k independent (X) variables, some of which may be interaction terms. The "z" is the logit, also called the log odds. The "b" terms are the logistic regression coefficients, also called parameter estimates Exp(b) = the odds ratio for an independent variable = the natural log base e raised to the power of b. The odds ratio is the factor by which the independent increases or (if negative) decreases the log odds of the dependent (see discussion of interpreting b parameters below. Exp(z) = the odds ratio for the dependent variable, being the odds that the dependent equals the level of interest rather than the reference level. In binary logistic regression, this is usually the odds the dependent = 1 rather than 0. In multinomial logistic regression, this is usually the odds the dependent = a value of interest rather than the highest level. Thus for a one-independent model, z would equal the constant plus the b coefficient times the value of X 1, when predicting odds(event) for persons with a particular value of X 1, by default the value "1" for the binary case. If X 1 is a binary (0,1) variable, then z = X 0 (that is, the constant) for the "0" group on X 1 and equals the constant plus the b coefficient for the "1" group. To convert the log odds (which is z, which is the logit) back into an odds ratio, the natural logarithmic base e is raised to the zth power: odds(event) = exp(z) = odds the binary dependent is 1 rather than 0. If X 1 is a continuous variable, then z equals the constant plus the b coefficient times the value of X 1. For models with additional independent variables, z is the constant plus the crossproducts of the b coefficients times the values of the X (independent) variables. Exp(z) is the log odds of the dependent, or the estimate of odds(event). o Dependent variable. Binomial and multinomial logistic regression support only a single dependent variable. For binary logistic regression, this response variable can have only two categories. For multinomial logistic regression, there may be two or more categories, usually more, but the dependent is never a continuous variable. 4

5 It is important to be careful to specify the desired reference category of the dependent variable, which should be meaningful. For example, the highest category (4 = None) in the example above would be the default reference category, but "None" could be chosen for a great variety of reasons and has no specific meaning as a reference. The researcher may wish to select another, more specific option as the reference category. The dependent reference default in binary vs. multinomial logistic regression in SPSS: Binary logistic regression predicts the "1" value of the dependent, using the "0" level as the reference value. Multinomial logistic regression will compare each level of the dependent with the reference category, for each independent variable. For instance, given the multinomial dependent "Degree of interest in joining" with levels 0=low interest, 1 = medium interest, and 2=high interest; and given the independents region (a factor) and income (a covariate), 2 = high interest will be the reference category by default. For region and separately for income, multinomial logistic output will show (1) the comparision of low interest with high interest, and (2) the comparison of medium interest with high interest. BINARY LOGISTIC REGRESSION: OUTCOME: DEPENDENTS Binary variable is entered as Highest is predicted, 5

6 a dependent MULTINOMIAL LOGISTIC REGRESSION: DEPENDENTS Binary or multinomial variable entered as dependent lowest is reference The reference level cannot be changed. Highest is reference, all others compared to it by default. Click "Reference Category" button to override the default. Setting the reference category for the dependent. 1. In binary logistic regression, the reference category (the lower, usually '0' category) cannot be overridden (though, of course, the researcher could flip the values by recoding). 2. In multinomial logistic regression, as illustrated below, in SPSS from the main multinomial logistic regression dialog, enter a categorical dependent variable, then to override the default selection of the highest-coded category as the reference category, click the "Reference" button and select "First Category" or "Custom" to set the reference code manually. In SPSS syntax, set the reference category simply by entering a base parameter after the dependent in the variable list (ex., NOMREG depvar (base=2) WITH indvar1 indvar2 indvar3 / PRINT = PARAMETER SUMMARY.) When setting custom or base values, a numeral indicates the order number (ex., 3 = third value will be the reference, assuming ascending order); for actual numeric values, dates, or strings, put quotation marks around the value. 6

7 o Group membership. Particularly in multinomial logistic regression, it is common to label the levels of the dependent variable as "groups". A logistic prediction (regression) formula can be created from logistic coefficients to predict the probability of membership in any group. Any given observation will be assigned to the group with the highest probability. For binary logistic regression, this is the group with a probability higher than.50. Factors, sometimes called "design variables," are nominal or ordinal categorical predictors, whose levels are entered as dummy variables. In SPSS binary logistic regression but not multinomial logistic regression, categorical independent variables must be declared by clicking on the "Categorical" button in the Logistic Regression dialog box. Setting the reference category (contrast) for a factor (categorical predictor). Dichotomous and categorical variables may be entered as factors but are handled differently in binary vs. multinomial regression. In binary logistic regression, after declaring a variable 7

8 to be categorical, the user is given the opportunity to declare the first or last value as the reference category. Independent variable reference categories cannot be changed in multinomial logistic regression in SPSS. If the researcher prefers another category to be the reference category in multinomial logistic regression, the researcher must either (1) recode the variables beforehand to make the desired reference category the last one, or (2) must create dummy variables manually prior to running the analysis. For more on the selection of dummy variables, click here. Multinomial logistic regression in SPSS handles reference categories differently for factors and covariates, as illustrated below. Binary logistic regression provides for the usual choice of contrast types for factors. In SPSS binary logistic regression, all variables are first entered as covariates, then the "Categorical" button is pressed to switch any of those 8

9 entered to be categorical factors. Under the Categorical button dialog the use may select the type of contrast. Warning: one must click the Change button after selecting the contrast from the drop-down menu. 0. Indicator: This is the default. The reference value is the lowest (0 for binary factors) category. However, this may be reversed by checking the Descending radio button (Ascending is default), which makes the lowest value be considered the last. 1. Simple: The last category is the reference. This may be reversed by checking the Descending radio button. 2. Difference: The reference is the mean of previous categories (except for the first category). 3. Helmert: The reference is the mean of subsequent categories (except for the last category). 4. Repeated: The reference is the previous category (except for the first category). 5. Deviation: The reference is the grand mean. 6. Polynomial. Used for equally spaced numeric categories to check linearity. For factor "X", output will list X(1) for the linear effect of X, X(2) for the quadratic effect, etc. Reference levels in SPSS binomial and multinomial regression. What is the reference category depends on whether one is using binomial or multinomial regression, whether the variable is the dependent or an independent, and if the variable is a factor or covariate. For factors, a coefficient will be printed for predicted categories but not for the reference category. For covariates, a single coefficient is printed and is referenced similar to regression, where increasing the independent from 0 to 1 corresponds to a b increase in the dependent (except in logistic regression, the increase is in the log odds of the dependent). The table below summarizes the options. 9

10 Covariates are interval independents in most programs. However, in SPSS binary logistic regression dialog all independents are entered as covariates, then one clicks the Categorical button to declare any of those entered as categorical. In SPSS multinomial regression, factors and covariates are entered in separate entry boxes. Interaction terms. As in OLS regression, one can add interaction terms to the model. As illustrated in the figure below, in multinomial logistic regression, interaction terms are entered under the Model button, where interactions of factors and covariates are constructed automatically. For binary logistic regression, interaction terms such as age*income must be created manually. For categorical variables, one also multiplies but you have to multiply two category codes as shown in the Categorical Variables Codings table of SPSS output (ex, race(1)*religion(2)). The codes will be 1's and 0's so most of the products in the new variables will be 0's unless you recode some products (ex., setting 0*0 to 1). Warning: when interaction terms are added to the model, centering of the constituent independent variables is strongly recommended to reduce multicollinearity and aid intepretation of coefficients. See the full discussion in the section on OLS regression, here. 10

11 Maximum likelihood estimation, ML, is the method used to calculate the logit coefficients. This contrasts to the use of ordinary least squares (OLS) estimation of coefficients in regression. OLS seeks to minimize the sum of squared distances of the data points to the regression line. ML methods seek to maximize the log likelihood, LL, which reflects how likely it is (the odds) that the observed values of the dependent may be predicted from the observed values of the independents. Put another way, OLS can be seen as a subtype of ML for the special case of a linear model characterized by normally distributed disturbances around the regression line, where the regression coefficients computed maximize the likelihood of obtaining the least sum of squared disturbances. When error is not normally distributed or when the dependent variable is not normally distributed, ML estimates are preferred to OLS estimates because they are unbiased beyond the special case handled by OLS. Ordinal and multinomial logistic regression are extensions of binary logistic regression that allow the simultaneous comparison of more than one contrast. That is, the log odds of three or more contrasts are estimated simultaneously (ex., the probability of A vs. B, A vs.c, B vs.c., etc.). SPSS and SAS. In SPSS, select Analyze, Regression, Binary (or Multinomial), Logistic; select the dependent and the covariates; use the Categorical button to declare categorical variables; use the Options button to select desired output; Continue; OK. SAS's PROC CATMOD computes both simple and multinomial logistic regression, whereas PROC LOGIST is for simple (dichotomous) logistic regression. CATMOD uses a conventional model command: ex., model wsat*supsat*qman=_response_ /nogls ml ;. Note that in the model command, nogls suppresses generalized least squares estimation and ml specifies maximum likelihood estimation. Basic SPSS Output 11

12 12

13 Thus in the output above, we find that being white compared to other races reduces the odds of not owning a gun in the home the most. All three independent variables in the model (Sex, Race, Degree) are significant, but not all categories of each independent variable are significant. Being African-American, for instance, does not have a significant impact on the odds of not owning gun in the home. No category of any of the independent variables modeled significantly increase the odds of not having a gun in the home compared to the respective reference categories when other variables in the model are controlled - that is, there are no significant odds ratios above 1.0. (The chart output is from Excel, not SPSS). More about SPSS output. Most output options in SPSS binary logistic regression are found under the "Options" button, but for multinomial logistic regression under the "Statistics" button, as illustrated below. 13

14 Significance Tests o Significance tests for binary logistic regression 14

15 Hosmer and Lemeshow chi-square test of goodness of fit. The recommended test for overall fit of a binary logistic regression model is the Hosmer and Lemeshow test, also called the chi-square test. It is not available in multinomial logistic regression. This test is considered more robust than the traditional chi-square test, particularly if continuous covariates are in the model or sample size is small. A finding of non-significance, as in the illustration below, corresponds to the researcher concluding the model adequately fits the data. This test is preferred over classification tables when assessing model fit. This test appears in SPSS output in the "Hosmer and Lemeshow" table. It is requested by checking "Hosmer-Lemeshow goodness of fit" under the Options button in SPSS. How it works. Hosmer and Lemeshow's goodness of fit test divides subjects into deciles based on predicted probabilities as illustrated above, then computes a chisquare from observed and expected frequencies. Then a probability (p) value is computed from the chi-square distribution with 8 degrees of freedom to test the fit of the logistic model. If the H-L goodness-of-fit test statistic is greater than.05, as we want for well-fitting models, we fail to reject the null hypothesis that there is no difference between observed and model-predicted values, implying that the model's estimates fit the data at an acceptable level. That is, well-fitting models show nonsignificance on the H- 15

16 L goodness-of-fit test, indicating model prediction is not significantly different from observed values. This does not mean that the model necessarily explains much of the variance in the dependent, only that however much or little it does explain is significant. As the sample size gets large, the H-L statistic can find smaller and smaller differences between observed and model-predicted values to be significant. On the other hand, the H-L statistic assumes sampling adequacy, with a rule of thumb being enough cases so that no group has an expected value < 1 and 95% of cells (typically, 10 decile groups times 2 outcome categories = 20 cells) have an expected frequency > 5. Collapsing groups may not solve a sampling adequacy problem since when the number of groups is small, the H-L test will be biased toward nonsignificance (will overestimate model fit). This test is not to be confused with a similarly named, obsolete goodness of fit index discussed below, Omnibus tests of model coefficients. This table in SPSS binary logistic regression reports significance levels by the traditional chisquare method and is an alternative to the Hosmer-Lemeshow test above. It tests if the model with the predictors is significantly different from the model with only the intercept. The omnibus test may be interpreted as a test of the capability of all predictors in the model jointly to predict the response (dependent) variable. A finding of significance, as in the illustration below, corresponds to the a research conclusion that there is adequate fit of the data to the model, meaning that at least one of the predictors is significantly related to the response variable. In the illustration below, the Enter method is employed (all model terms are entered in one step), so there is no difference for step, block, or model, but in a stepwise procedure one would see results for each step. Fit tests in stepwise or block-entry logistic regression. When predictors are entered in steps or blocks, the Hosmer-Lemeshow and omnibus tests may also be interpreted as tests of the change in model significance when adding a variable (stepwise) or block of variables (block) to the model. This is illustrated below for a model predicting having a gun in the home from marital status, race, and attitude toward capital punishment. In stepwise forward logistic 16

17 regression, Marital is added first, then Race, then Cappun; Age is in the covariate list but is not added as the stepwise procedure ends at Step 3. o Significance tests for multinomial logistic regression Binary vs. multinomial logistic regression. In SPSS, selecting Analyze, Regression, Binary logistic invokes the LOGISTIC REGRESSION procedure. Selecting Analyze, Regression, Multinomial logistic invokes the NOMREG procedure. Although NOMREG is designed for the case where the dependent has more than two categories, a binary dependent may be entered. The NOMREG procedure provides a different set of output options, including different significance tests of the model. The figures below are for the NOMREG version of the same example, where the binary variable Gunown (having a gun in the home) is predicted from the categorical variables Marital (four categories of marital status), Race (three categories), the binary variable Cappun (attitude toward the death penalty), and the continuous covariate Age. Pearson and deviance goodness of fit. The "Goodness of Fit" table in multinomial logistic regression in SPSS gives two similar overall model fit tests. Like the Hosmer-Lemeshow goodness of fit 17

18 test in binary logistic regression, adequate fit corresponds to a finding of nonsignificance for these tests, as in the illustration below. Both are chi-square methods, but the Pearson statistic is based on traditional chi-square and the deviance statistic is based on likelihood ratio chi-square. The deviance test is preferred over the Pearson (Menard, 2002: 47). Either test is preferred over classification tables when assessing model fit. Both tests usually yield the same substantive results. For more on computation of Pearson and likelihood ratio chi-square, click here. A well-fitting model is non-significant. The likelihood ratio is a function of log likelihood and is used in significance testing in NOMREG. A "likelihood" is a probability, specifically the probability that the observed values of the dependent may be predicted from the observed values of the independents. Like any probability, the likelihood varies from 0 to 1. The log likelihood (LL) is its log and varies from 0 to minus infinity (it is negative because the log of any number less than 1 is negative). LL is calculated through iteration, using maximum likelihood estimation (ML). Log likelihood is the basis for tests of a logistic model. Because -2LL has approximately a chi-square distribution, -2LL can be used for assessing the significance of logistic regression, analogous to the use of the sum of squared errors in OLS regression. The -2LL statistic is the likelihood ratio. It is also called goodness of fit, deviance chi-square, scaled deviance, deviation chi-square, DM, or L-square. It reflects the significance of the unexplained variance in the dependent. In SPSS output, this statistic is found in the "-2 Log Likelihood" column of the "Model Fitting Information" table for the Final row. The likelihood ratio is not used directly in significance testing, but it it the basis for the likelihood ratio test, which is the test of the difference between two likelihood ratios (two -2LL's), as discussed below. In general, as the model becomes better, -2LL will decrease in magnitude. 18

19 A well-fitting model is significant. The likelihood ratio test, also called the log-likelihood test, is based on -2LL (deviance). The likelihood ratio test is a test of the significance of the difference between the likelihood ratio (-2LL) for the researcher's model minus the likelihood ratio for a reduced model. This difference is called "model chi-square." The likelihood ratio test is generally preferred over its alternative, the Wald test, discussed below. There are four main forms of the likelihood ratio test:: 0. Models. Two models are referenced in the "Model Fitting Information" table above: (1) the "Intercept Only" model, also called the null model; it reflects the net effect of all variables not in the model plus error; and (2) the "Final" model, also called the fitted model, which is the researcher's model comprised of the predictor variables; the logistic equation is the linear combination of predictor variables which maximizes the log likelihood that the dependent variable equals the predicted value/class/group. The difference in the -2 log likelihood (-2LL) measures how much the final model improves over the null model. The likelihood ratio test is thus a test of the overall model, also called the "model chi-square test", is the test shown above in the "Final" row in the "Model Fitting Information" in SPSS. When the reduced model is the baseline model with the constant only (a.k.a., initial model or model at step 0), the likelihood ratio test tests the significance of the researcher's model as a whole. A well-fitting model is significant at the.05 level or better, as in the figure above, meaning the researcher's model is significantly different from the one with the constant only. That is, a finding of significance (p<=.05 is the usual cutoff) leads to rejection of the null hypothesis that all of the predictor effects are zero. When this likelihood test is significant, at least one of the predictors is significantly related to the dependent variable. Put yet another way, the likelihood ratio test tests 19

20 the null hypothesis that all population logistic regression coefficients except the constant are zero. And put a final way, the likelihood ratio test reflects the difference between error not knowing the independents (initial chi-square) and error when the independents are included in the model (deviance). When probability (model chi-square) <=.05, we reject the null hypothesis that knowing the independents makes no difference in predicting the dependent in logistic regression. The likelihood ratio test looks at model chi-square (chi square difference) by subtracting deviance (-2LL) for the final (full) model from deviance for the intercept-only model. Degrees of freedom in this test equal the number of terms in the model minus 1 (for the constant). This is the same as the difference in the number of terms between the two models, since the null model has only one term. Model chi-square measures the improvement in fit that the explanatory variables make compared to the null model. Warning: If the log-likelihood test statistic shows a small p value (<=.05) for a model with a large effect size, ignore contrary findings based on the Wald statistic discussed below as it is biased toward Type II errors in such instances - instead assume good model fit overall. 1. Testing for interaction effects. A common use of the likelihood ratio test is to test the difference between a full model and a reduced model dropping an interaction effect. If model chi-square (which is -2LL for the full model minus -2LL for the reduced model) is significant, then the interaction effect is contributing significantly to the full model and should be retained. 2. Test of individual model parameters. The likelihood ratio test assesses the overall logistic model but does not tell us if particular independents are more important than others. This can be done, however, by comparing the difference in -2LL for the overall model with a nested model which drops one of the independents. We can use the likelihood ratio test to drop one variable from the model to create a nested reduced model. In this situation, the likelihood ratio test tests if the logistic regression coefficient for the dropped variable can be treated as 0, thereby justifying dropping the variable from the model. A nonsignificant likelihood ratio test indicates no difference between the full and the reduced models, hence justifying dropping the 20

21 given variable so as to have a more parsimonious model that works just as well. Note that the likelihood ratio test of individual parameters is a better criterion than the alternative Wald statistic when considering which variables to drop from the logistic regression model. In SPSS output, the "Likelihood Ratio Tests" table contains the likelihood ratio tests of individual model parameters, illustrated below. In the table above, the response variable is Gunown, a binary variable indicating whether or not there is a gun in the home. Predictors are marital status, race, attitude toward the death penalty (Cappun), and age. The likelihood ratio tests of individual parameters show that the model without Age is not significantly different from the final (full) model and therefore Age should be dropped based on preference for the more parsimonious reduced model. For the significant variables, the larger the chi-square value, the greater the loss of model fit if that term is dropped. In this example, dropping "marital" would result in the greatest loss of model fit. 3. Tests for model refinement. In general, the likelihood ratio test can be used to test the difference between a given model and any nested model which is a subset of the given model. It cannot be used to compare two non-nested models. Chi-square is the difference in likelihood ratios (- 2LL) for the two models, and degrees of freedom is the difference in degrees of freedom for the two models. If the computed chi-square is equal or greater than the critical 21

22 value of chi-square (in a chi-square table) for the given df, then the models are significantly different. If the difference is significant, then the researcher concludes that the variables dropped in the nested model do matter significantly in predicting the dependent. If the difference is below the critical value, there is a finding of nonsignificance and the researcher concludes that dropping the variables makes no difference in prediction and for reasons of parsimony the variables are dropped from the model. That is, chi-square difference can be used to help decide which variables to drop from or add to the model. This is discussed further in the next section. Block chi-square is a synonym for the likelihood ratio (chi-square difference) test, referring to the change in -2LL due to entering a block of variables. The main Logistic Regression dialog allows the researcher to enter independent variables in blocks. Blocks may contain one or more variables. Sequential logistic regression is analysis of nested models using block chi-square, where the researcher is testing the control effects of a set of covariates. The logistic regression model is run against the dependent for the full model with independents and covariates, then is run again with the block of independents dropped. If chi-square difference is not significant, then the researcher concludes that the independent variables are controlled by the covariates (that is, they have no effect once the effect of the covariates is taken into account). Alternatively, the nested model may be just the independents, with the covariates dropped. In that case a finding of non-significance implies that the covariates have no control effect. Assessing dummy variables with sequential logistic regression. Similarly, running a full model and then a model with all the variables in a dummy set dropped (ex., East, West, North for the variable Region) allows assessment of dummy variables, using chi-square difference. Note that even though SPSS computes loglikelihood ratio tests of individual parameters for each level of a dummy variable, the log-likelihood ratio tests of individual parameters (discussed below) should not be used, but rather use the likelihood ratio test (chi-square difference method) for the set of dummy variables pertaining to a given variable. Because all dummy variables associated with the categorical variable are entered as a block this is sometimes called the "block chi-square" test and its value is considered more reliable than the Wald test, which can be misleading for large effects in finite samples. 22

23 o Stepwise logistic regression: The forward or backward stepwise logistic regression methods, available in both binary and multinomial regression in SPSS, determine automatically which variables to add or drop from the model. As data-driven methods, stepwise procedures run the risk of modeling noise in the data and are considered useful only for exploratory purposes. Selecting model variables on a theoretic basis and using the "Enter" method is preferred. Both binomial and multinomial logistic regression offer both forward and backward stepwise logistic regression, though will slight differences illustrated below. Binary logistic regression in SPSS offers these variants in the Method area of the main binary logistic regression dialog: forward conditional, forward LR, forward Wald, backward conditional, backward LR, or backward Wald. The conditional options uses a computationally faster version of the likelihood ratio test, LR options utilize the likelihood ratio test (chi-square difference), and the Wald options use the Wald test. The LR option is most often preferred. The likelihood ratio test computes -2LL for the current model, then reestimates -2LL with the target variable removed. The conditional option is preferred when LR estimation proves too computationally time-consuming. The conditional statistic is considered not as accurate as the likelihood ratio test but more so than the third possible criterion, the Wald test. Stepwise procedures are selected in the Method drop-down list of the binary logistic regression dialog. Multinomial logistic regression offers these variants under the Model button if a Custom model is specified: forward stepwise, backward 23

24 stepwise, forward entry, and backward elimination. These four options are described in the FAQ section below. All are based on maximum likelihood estimation (ML), with forward methods using the likelihood ratio or score statistic and backward methods using the likelihood ratio or Wald's statistic. LR is the default, but score and Wald alternatives are available under the Options button. Forward entry adds terms to the model until no omitted variable would contribute significantly to the model. Forward stepwise determines the forward entry model and then alternates between backward elimination and forward entry until all variables not in the model fail to meet entry or removal criteria. Backward elimination and backward stepwise are similar, but begin with all terms in the model and work backward. with backward elimination stopping when the model contains only terms which are significant and with backward stepwise taking this result and further alternating between forward entry and backward elimination until no omitted variable would contribute significantly to the model. Forward selection vs. backward elimination: Forward selection is the usual option, starting with the constant-only model and adding variables one at a time in the order they are best by some criterion (see below) until some cutoff level is reached (ex., until the step at which all variables not in the model have a significance higher than.05). Backward selection starts with all variables and deletes one at a time, in the order they are worst by some criterion. In the illustration below, forward stepwise modeling of a binary dependent, which was having a gun in the home or not (Gunown), as predicted by the categorical variable marital status (marital, with four categories), race (three categories), and attitude on the death penalty (binary). The forward stepwise procedure adds Marital first, then Race, then Gunown. 24

25 Cross-validation. When using stepwise methods, problems of overfitting the model to noise in the current data may be mitigated by cross-validation, fitting the model to a test subset of the data and validating the model using a hold-out validation subset. Score as a variable entry criterion for forward selection. Rao's efficient score statistic has also been used as the forward selection criterion for adding variables to a model. It is similar but not identical to a likelihood ratio test of the coefficient for an individual explanatory variable. Score appears in the "score" column of the "Variables not in the equation" table of SPSS output for binary logistic regression. Score may be selected as a variable entry criterion in multinomial regression, but there is no corresponding table (its use is just noted in a footnote in the "Step Summary" table). In the example below, having a gun in the home or not is predicted from marital status, race, age, and attitudes toward capital punishment. Race, with the highest score in Step 1, gets added to the initial predictor, marital status, in Step 2. The variable with the highest Score in Step 2, attitude toward capital punishment (Cappun) gets added in Step 3. Age, with a nonsignificant Score, is never added as the stepwise procedure stops at Step 3. 25

26 Score statistic: Rao's efficient score, labeled simply "score" in SPSS output, is test for whether the logistic regression coefficient for a given explanatory variable is zero. It is mainly used as the criterion for variable inclusion in forward stepwise logistic regression (discussed above), because of its advantage of being a non-iterative and therefore computationally fast method of testing individual parameters compared to the likelihood ratio test. In essence, the score statistic is similar to the first iteration of the likelihood ratio method, where LR typically goes on to three or four more iterations to refine its estimate. In addition to testing the significance of each variable, the score procedure generates an "Overall statistics" significance test for the model as a whole. A finding of nonsignificance (ex., p>.05) on the score statistic leads to acceptance of the null hypothesis that coefficients are zero and the variable may be dropped. SPSS continues by this method until no remaining predictor variables have a score statistic significance of.05 or better. Which step is the best model? Stepwise methods do not necessarily identify "best models" at all as they work by fitting an automated model to the current dataset, raising the danger of overfitting to noise in the particular dataset at hand. However, there are three possible methods of selecting the "final model" that emerges from the stepwise procedure. Binomial stepwise logistic regression offers only the first, but multinomial stepwise logistic regression offers two more, based on the information theory measures AIC and BIC, discussed below. 0. Last step. The final model is the last step model, where adding another variable would not improve the model significantly. In the figure below, a general happiness survey variable (3 levels) is predicted from age, education, 26

27 and a categorical survey item on whether life is exciting or dull (3 levels). By the customary last step rule, the best model is model Lowest AIC. The "Step Summary" table will print the Akaike Information Criterion (AIC) for each step. AIC is commonly used to compare models, where the lower the AIC, the better. The step with the lowest AIC thus becomes the "final model." By the lowest AIC criterion, the best model would be model Lowest BIC. The "Step Summary" table will print the Bayesian Information Criterion (BIC) for each step. BIC is also used to compare models, again where the lower the BIC, the better. The step with the lowest BIC thus becomes the "final model." Often BIC will point to a more parsimonious model than will AIC as its formula factors in degrees of freedom, which is related to number of variables. By the lowest BIC criterion, the best model would be model 1. o Click here for further discussion of stepwise methods. Wald statistic (test): The Wald statistic is an alternative test which is commonly used to test the significance of individual logistic regression coefficients for each independent variable (that is, to test the null hypothesis in logistic regression that a particular logit (effect) coefficient is zero). The Wald statistic is the squared ratio of the unstandardized logistic coefficient to its standard error. The Wald statistic and its corresponding p probability level is part of SPSS output in the "Variables in the Equation" table, illustrated above. The Wald test corresponds to significance testing of b coefficients in OLS regression. The researcher may well want to drop independents from the model when their effect is not significant by the Wald statistic. The Wald test appears in the "Parameter Estimates" table in SPSS logistic regression output. (If computing the Wald statistic manually by spreadsheet, note that rounding/precision errors will cause the computed result to vary a little from SPSS output). 27

28 Warning: Menard (p. 39) warns that for large logit coefficients, standard error is inflated, lowering the Wald statistic and leading to Type II errors (false negatives: thinking the effect is not significant when it is). That is, there is a flaw in the Wald statistic such that very large effects may lead to large standard errors and small Wald chi-square values. For models with large logit coefficients or when dummy variables are involved, it is better to test the difference using the likelihood ratio test of the difference of models with and without the parameter. Also note that the Wald statistic is sensitive to violations of the large-sample assumption of logistic regression. Put another way, the likelihood ratio test is considered more reliable for small samples (Agresti, 1996). For these reasons, the likelihood ratio test of individual model parameters is generally preferred. o o o Logistic coefficients and correlation. Note that a logistic coefficient may be found to be significant when the corresponding correlation is found to be not significant, and vice versa. To make certain global statements about the significance of an independent variable, both the correlation and the parameter estimate (b) should be significant. Among the reasons why correlations and logistic coefficients may differ in significance are these: (1) logistic coefficients are partial coefficients, controlling for other variables in the model, whereas correlation coefficients are uncontrolled; (2) logistic coefficients reflect linear and nonlinear relationships, whereas correlation reflects only linear relationships; and (3) a significant parameter estimate (b) means there is a relation of the independent variable to the dependent variable for selected control groups, but not necessarily overall. Confidence interval for the logistic regression coefficient. The confidence interval around the logistic regression coefficient is plus or minus 1.96*ASE, where ASE is the asymptotic standard error of logistic b. "Asymptotic" in ASE means the smallest possible value for the standard error when the data fit the model. It is also the highest possible precision. The real (enlarged) standard error is typically slightly larger than ASE. One typically uses real SE if one hypothesizes that noise in the data are systematic and one uses ASE if one hypothesizes that noise in the data are random. As the latter is typical, ASE is used here. Goodness of Fit Index (obsolete), also known as Hosmer and Lemeshow's goodness of fit index or C-hat, is an alternative to model chisquare for assessing the significance of a logistic regression model.this is different from Hosmer and Lemeshow's (chi-square) goodness of fit test discussed above. Menard (p. 21) notes it may be better when the number of combinations of values of the independents is approximately equal to the number of cases under analysis. This measure was included in SPSS output as "Goodness of Fit" prior to Release 10. However, it was removed from the reformatted output for SPSS Release 10 because, as noted by 28

29 David Nichols, senior statistician for SPSS, it "is done on individual cases and does not follow a known distribution under the null hypothesis that the data were generated by the fitted model, so it's not of any real use" (SPSSX-L listserv message, 3 Dec. 1999). Interpreting Parameter Estimates o Probabilities, odds, and odds ratios are all important basic terms in logistic regression. See the more extensive coverage in the separate section on log-linear analysis. 29

30 30

31 o Logistic coefficients, also called unstandardized logistic regression coefficients, logit coefficients, log odds-ratios, or effect coefficients or (in SPSS output) simply parameter estimates, correspond to b coefficients in OLS regression. Logistic coefficients, which are on the right-hand side of the logistic equation, are not to be confused with logits, which is the term on the left-hand side. Both logit and regression coefficients can be used to 31

32 o o o o construct prediction equations and generate predicted values, which in logistic regression are called logistic scores. The SPSS table which lists the b coefficiences also lists the standard error of b, the Wald statistic and its significance (discussed below), and the odds ratio (labeled Exp(b) ) as well as confidence limits on the odds ratio. In SPSS this is the "Variables in the Equation Table" in binary logistic regression, shown below, and there is an almost identical table in multinomial logistic regression labeled "Parameter Estimates". Model intercepts/constants. Labeled "constant" in binomial and "intercept" in SPSS multinomial regression output, this parameter is the log odds (logit estimate) of the dependent variable when model predictors are evaluated at zero. Often particular predictor variables can never be zero in real life. Note that the intercepts will be more interpretable if variables are centered around their means prior to analysis because then the interecepts are the log odds of the dependent when predictors are at their average values. In the example illustrated above, there is only one intercept as the dependent variable, capital punishment (cappun) is binomial, even though multinomial regression is applied. In general, there will be (k - 1) logistic equations and hence (k - 1) intercepts, where k is the number of categories of the dependent variable. Logistic models: parameter estimates. In SPSS and most statistical output for logistic regression, the "parameter estimate" is the b coefficient used to predict the log odds (logit) of the dependent variable, as discussed in the section on the logistic equation. Logistic models: logit links. Where OLS regression has an identity link function, binary or multinomial logistic regression has a logit link function (that is, logistic regression calculates changes in the log odds of the dependent, not changes in the dependent itself as OLS regression does). Ordinal logistic regression has a cumulative logit link. In binary logistic regression, the logit link creates a logistic model whose distribution, illustrated at the top of this section, hugs the 0 and 1 values of the Y axis for a binary dependent and, moreover, does not extrapolate out of range (below 0 or above 1). Multinomial logistic regression also uses the logit link, setting up a series of binary comparisons between the given level of the dependent and the reference level (by default, the highest coded level). Ordinal logistic regression uses a cumulative logit link, setting up a series of binary comparisons between the given level or lower compared to higher levels. (Technically, the SPSS Analyze, Regression, Ordinal menu choice runs the PLUM procedure, which uses a logit link, but the logit link is used to fit a cumulative logit model. The SPSS Analyze, Generalized Linear Models, Generalized linear Models menu choice for ordinal regression assumes an ordered multinomial distribution with a cumulative logit link). Logistic models: logits. As discussed above, logits are the log odds of the event occurring (usually, that the dependent = 1 rather than 0). Parameter estimates (b coefficients) associated with explanatory variables are 32

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

HLM software has been one of the leading statistical packages for hierarchical

HLM software has been one of the leading statistical packages for hierarchical Introductory Guide to HLM With HLM 7 Software 3 G. David Garson HLM software has been one of the leading statistical packages for hierarchical linear modeling due to the pioneering work of Stephen Raudenbush

More information

Module 4 - Multiple Logistic Regression

Module 4 - Multiple Logistic Regression Module 4 - Multiple Logistic Regression Objectives Understand the principles and theory underlying logistic regression Understand proportions, probabilities, odds, odds ratios, logits and exponents Be

More information

Ordinal Regression. Chapter

Ordinal Regression. Chapter Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

IBM SPSS Regression 20

IBM SPSS Regression 20 IBM SPSS Regression 20 Note: Before using this information and the product it supports, read the general information under Notices on p. 41. This edition applies to IBM SPSS Statistics 20 and to all subsequent

More information

i SPSS Regression 17.0

i SPSS Regression 17.0 i SPSS Regression 17.0 For more information about SPSS Inc. software products, please visit our Web site at http://www.spss.com or contact SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6412

More information

SUGI 29 Statistics and Data Analysis

SUGI 29 Statistics and Data Analysis Paper 194-29 Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC Michelle L. Pritchard and David J. Pasta Ovation Research Group, San Francisco,

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

More information

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

SPSS Explore procedure

SPSS Explore procedure SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS CHAPTER 7B Multiple Regression: Statistical Methods Using IBM SPSS This chapter will demonstrate how to perform multiple linear regression with IBM SPSS first using the standard method and then using the

More information

Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

More information

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Directions for using SPSS

Directions for using SPSS Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...

More information

Binary Logistic Regression

Binary Logistic Regression Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

More information

Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance. Chapter 2: Statistical models of default Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

More information

LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

More information

It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3.

It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3. IDENTIFICATION AND ESTIMATION OF AGE, PERIOD AND COHORT EFFECTS IN THE ANALYSIS OF DISCRETE ARCHIVAL DATA Stephen E. Fienberg, University of Minnesota William M. Mason, University of Michigan 1. INTRODUCTION

More information

How to set the main menu of STATA to default factory settings standards

How to set the main menu of STATA to default factory settings standards University of Pretoria Data analysis for evaluation studies Examples in STATA version 11 List of data sets b1.dta (To be created by students in class) fp1.xls (To be provided to students) fp1.txt (To be

More information

Multivariate Logistic Regression

Multivariate Logistic Regression 1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear. Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear. In the main dialog box, input the dependent variable and several predictors.

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

More information

Credit Risk Analysis Using Logistic Regression Modeling

Credit Risk Analysis Using Logistic Regression Modeling Credit Risk Analysis Using Logistic Regression Modeling Introduction A loan officer at a bank wants to be able to identify characteristics that are indicative of people who are likely to default on loans,

More information

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.

More information

January 26, 2009 The Faculty Center for Teaching and Learning

January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS A USER GUIDE January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS Table of Contents Table of Contents... i

More information

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 libname in1 >c:\=; Data first; Set in1.extract; A=1; PROC LOGIST OUTEST=DD MAXITER=100 ORDER=DATA; OUTPUT OUT=CC XBETA=XB P=PROB; MODEL

More information

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand

More information

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

More information

7 Generalized Estimating Equations

7 Generalized Estimating Equations Chapter 7 The procedure extends the generalized linear model to allow for analysis of repeated measurements or other correlated observations, such as clustered data. Example. Public health of cials can

More information

Indices of Model Fit STRUCTURAL EQUATION MODELING 2013

Indices of Model Fit STRUCTURAL EQUATION MODELING 2013 Indices of Model Fit STRUCTURAL EQUATION MODELING 2013 Indices of Model Fit A recommended minimal set of fit indices that should be reported and interpreted when reporting the results of SEM analyses:

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

Paper D10 2009. Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

Paper D10 2009. Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI Paper D10 2009 Ranking Predictors in Logistic Regression Doug Thompson, Assurant Health, Milwaukee, WI ABSTRACT There is little consensus on how best to rank predictors in logistic regression. This paper

More information

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

Chapter 5 Analysis of variance SPSS Analysis of variance

Chapter 5 Analysis of variance SPSS Analysis of variance Chapter 5 Analysis of variance SPSS Analysis of variance Data file used: gss.sav How to get there: Analyze Compare Means One-way ANOVA To test the null hypothesis that several population means are equal,

More information

Chapter 15. Mixed Models. 15.1 Overview. A flexible approach to correlated data.

Chapter 15. Mixed Models. 15.1 Overview. A flexible approach to correlated data. Chapter 15 Mixed Models A flexible approach to correlated data. 15.1 Overview Correlated data arise frequently in statistical analyses. This may be due to grouping of subjects, e.g., students within classrooms,

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R. ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R. 1. Motivation. Likert items are used to measure respondents attitudes to a particular question or statement. One must recall

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA Examples: Multilevel Modeling With Complex Survey Data CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA Complex survey data refers to data obtained by stratification, cluster sampling and/or

More information

CREDIT SCORING MODEL APPLICATIONS:

CREDIT SCORING MODEL APPLICATIONS: Örebro University Örebro University School of Business Master in Applied Statistics Thomas Laitila Sune Karlsson May, 2014 CREDIT SCORING MODEL APPLICATIONS: TESTING MULTINOMIAL TARGETS Gabriela De Rossi

More information

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

More information

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING

More information

1 Theory: The General Linear Model

1 Theory: The General Linear Model QMIN GLM Theory - 1.1 1 Theory: The General Linear Model 1.1 Introduction Before digital computers, statistics textbooks spoke of three procedures regression, the analysis of variance (ANOVA), and the

More information

Two Correlated Proportions (McNemar Test)

Two Correlated Proportions (McNemar Test) Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with

More information

From the help desk: Bootstrapped standard errors

From the help desk: Bootstrapped standard errors The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

More information

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous

More information

How to Get More Value from Your Survey Data

How to Get More Value from Your Survey Data Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................2

More information

SPSS Guide: Regression Analysis

SPSS Guide: Regression Analysis SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

IBM SPSS Missing Values 22

IBM SPSS Missing Values 22 IBM SPSS Missing Values 22 Note Before using this information and the product it supports, read the information in Notices on page 23. Product Information This edition applies to version 22, release 0,

More information

Calculating the Probability of Returning a Loan with Binary Probability Models

Calculating the Probability of Returning a Loan with Binary Probability Models Calculating the Probability of Returning a Loan with Binary Probability Models Associate Professor PhD Julian VASILEV (e-mail: vasilev@ue-varna.bg) Varna University of Economics, Bulgaria ABSTRACT The

More information

Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

Automated Statistical Modeling for Data Mining David Stephenson 1

Automated Statistical Modeling for Data Mining David Stephenson 1 Automated Statistical Modeling for Data Mining David Stephenson 1 Abstract. We seek to bridge the gap between basic statistical data mining tools and advanced statistical analysis software that requires

More information

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

There are six different windows that can be opened when using SPSS. The following will give a description of each of them. SPSS Basics Tutorial 1: SPSS Windows There are six different windows that can be opened when using SPSS. The following will give a description of each of them. The Data Editor The Data Editor is a spreadsheet

More information

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

More information

Logit Models for Binary Data

Logit Models for Binary Data Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response

More information

Regression 3: Logistic Regression

Regression 3: Logistic Regression Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing

More information

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships

More information

IBM SPSS Direct Marketing 19

IBM SPSS Direct Marketing 19 IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS

More information

Moderation. Moderation

Moderation. Moderation Stats - Moderation Moderation A moderator is a variable that specifies conditions under which a given predictor is related to an outcome. The moderator explains when a DV and IV are related. Moderation

More information

Εισαγωγή στην πολυεπίπεδη μοντελοποίηση δεδομένων με το HLM. Βασίλης Παυλόπουλος Τμήμα Ψυχολογίας, Πανεπιστήμιο Αθηνών

Εισαγωγή στην πολυεπίπεδη μοντελοποίηση δεδομένων με το HLM. Βασίλης Παυλόπουλος Τμήμα Ψυχολογίας, Πανεπιστήμιο Αθηνών Εισαγωγή στην πολυεπίπεδη μοντελοποίηση δεδομένων με το HLM Βασίλης Παυλόπουλος Τμήμα Ψυχολογίας, Πανεπιστήμιο Αθηνών Το υλικό αυτό προέρχεται από workshop που οργανώθηκε σε θερινό σχολείο της Ευρωπαϊκής

More information

IBM SPSS Data Preparation 22

IBM SPSS Data Preparation 22 IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release

More information

MORE ON LOGISTIC REGRESSION

MORE ON LOGISTIC REGRESSION DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 816 MORE ON LOGISTIC REGRESSION I. AGENDA: A. Logistic regression 1. Multiple independent variables 2. Example: The Bell Curve 3. Evaluation

More information

Binary Diagnostic Tests Two Independent Samples

Binary Diagnostic Tests Two Independent Samples Chapter 537 Binary Diagnostic Tests Two Independent Samples Introduction An important task in diagnostic medicine is to measure the accuracy of two diagnostic tests. This can be done by comparing summary

More information

Basic Statistical and Modeling Procedures Using SAS

Basic Statistical and Modeling Procedures Using SAS Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom

More information

Data analysis process

Data analysis process Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

Chapter 29 The GENMOD Procedure. Chapter Table of Contents Chapter 29 The GENMOD Procedure Chapter Table of Contents OVERVIEW...1365 WhatisaGeneralizedLinearModel?...1366 ExamplesofGeneralizedLinearModels...1367 TheGENMODProcedure...1368 GETTING STARTED...1370

More information

Main Effects and Interactions

Main Effects and Interactions Main Effects & Interactions page 1 Main Effects and Interactions So far, we ve talked about studies in which there is just one independent variable, such as violence of television program. You might randomly

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

Logistic (RLOGIST) Example #1

Logistic (RLOGIST) Example #1 Logistic (RLOGIST) Example #1 SUDAAN Statements and Results Illustrated EFFECTS RFORMAT, RLABEL REFLEVEL EXP option on MODEL statement Hosmer-Lemeshow Test Input Data Set(s): BRFWGT.SAS7bdat Example Using

More information

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest Analyzing Intervention Effects: Multilevel & Other Approaches Joop Hox Methodology & Statistics, Utrecht Simplest Intervention Design R X Y E Random assignment Experimental + Control group Analysis: t

More information

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents Mplus Short Courses Topic 2 Regression Analysis, Eploratory Factor Analysis, Confirmatory Factor Analysis, And Structural Equation Modeling For Categorical, Censored, And Count Outcomes Linda K. Muthén

More information

Association Between Variables

Association Between Variables Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

More information

Chapter 23. Inferences for Regression

Chapter 23. Inferences for Regression Chapter 23. Inferences for Regression Topics covered in this chapter: Simple Linear Regression Simple Linear Regression Example 23.1: Crying and IQ The Problem: Infants who cry easily may be more easily

More information

Instructions for SPSS 21

Instructions for SPSS 21 1 Instructions for SPSS 21 1 Introduction... 2 1.1 Opening the SPSS program... 2 1.2 General... 2 2 Data inputting and processing... 2 2.1 Manual input and data processing... 2 2.2 Saving data... 3 2.3

More information

Weight of Evidence Module

Weight of Evidence Module Formula Guide The purpose of the Weight of Evidence (WoE) module is to provide flexible tools to recode the values in continuous and categorical predictor variables into discrete categories automatically,

More information

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

More information

Introduction to Quantitative Methods

Introduction to Quantitative Methods Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................

More information

2013 MBA Jump Start Program. Statistics Module Part 3

2013 MBA Jump Start Program. Statistics Module Part 3 2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just

More information

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group Introduction to Multilevel Modeling Using HLM 6 By ATS Statistical Consulting Group Multilevel data structure Students nested within schools Children nested within families Respondents nested within interviewers

More information

The Basic Two-Level Regression Model

The Basic Two-Level Regression Model 2 The Basic Two-Level Regression Model The multilevel regression model has become known in the research literature under a variety of names, such as random coefficient model (de Leeuw & Kreft, 1986; Longford,

More information

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

More information