Binary Logistic Regression

Similar documents
Chapter 5 Analysis of variance SPSS Analysis of variance

Binary Logistic Regression

Multinomial and Ordinal Logistic Regression

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Ordinal Regression. Chapter

LOGISTIC REGRESSION ANALYSIS

11. Analysis of Case-control Studies Logistic Regression

Additional sources Compilation of sources:

STATISTICA Formula Guide: Logistic Regression. Table of Contents

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Calculating the Probability of Returning a Loan with Binary Probability Models

SPSS Guide: Regression Analysis

LOGIT AND PROBIT ANALYSIS

Multiple logistic regression analysis of cigarette use among high school students

Association Between Variables

Descriptive Statistics

Introduction to Quantitative Methods

Generalized Linear Models

Logistic Regression.

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

SAS Software to Fit the Generalized Linear Model

Online Appendix to Are Risk Preferences Stable Across Contexts? Evidence from Insurance Data

Multivariate Logistic Regression

Research Methods & Experimental Design

Linear Models in STATA and ANOVA

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Statistical tests for SPSS

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Statistics in Retail Finance. Chapter 2: Statistical models of default

HLM software has been one of the leading statistical packages for hierarchical

ANOVA ANOVA. Two-Way ANOVA. One-Way ANOVA. When to use ANOVA ANOVA. Analysis of Variance. Chapter 16. A procedure for comparing more than two groups

How to set the main menu of STATA to default factory settings standards

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

IBM SPSS Statistics for Beginners for Windows

Module 5: Introduction to Multilevel Modelling SPSS Practicals Chris Charlton 1 Centre for Multilevel Modelling

It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3.

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Simple linear regression

Regression Analysis: A Complete Example

Moderation. Moderation

II. DISTRIBUTIONS distribution normal distribution. standard scores

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

SUGI 29 Statistics and Data Analysis

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Odds ratio, Odds ratio test for independence, chi-squared statistic.

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Module 4 - Multiple Logistic Regression

Free Trial - BIRT Analytics - IAAs

Sample Size and Power in Clinical Trials

Introduction to Regression and Data Analysis

Directions for using SPSS

Two Correlated Proportions (McNemar Test)

Simple Tricks for Using SPSS for Windows

Credit Risk Analysis Using Logistic Regression Modeling

Multinomial Logistic Regression

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

When to Use a Particular Statistical Test

Chapter 2 Probability Topics SPSS T tests

Simple Linear Regression Inference

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Binary Diagnostic Tests Two Independent Samples

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Independent t- Test (Comparing Two Means)

Part 2: Analysis of Relationship Between Two Variables

Wooldridge, Introductory Econometrics, 4th ed. Chapter 7: Multiple regression analysis with qualitative information: Binary (or dummy) variables

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

Regression step-by-step using Microsoft Excel

Example: Boats and Manatees

Chapter 15. Mixed Models Overview. A flexible approach to correlated data.

Premaster Statistics Tutorial 4 Full solutions

January 26, 2009 The Faculty Center for Teaching and Learning

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

One-Way Analysis of Variance

Module 3: Correlation and Covariance

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Categorical Data Analysis

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Module 5: Multiple Regression Analysis

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

UNIVERSITY OF NAIROBI

Logit Models for Binary Data

Elements of statistics (MATH0487-1)

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Transcription:

Binary Logistic Regression Describing Relationships Classification Accuracy Examples Logistic Regression Logistic regression is used to analyze relationships between a dichotomous dependent variable and metric or dichotomous independent variables. Logistic regression combines the independent variables to estimate the probability that a particular event willoccur, i.e.asubjectwillbea member of one of the groups defined by the dichotomous dependent variable. In SPSS, the model is always constructed to predict the group with higher numeric code. If responses are coded 1 for Yes and for No, SPSS will predict membership in the No category. If responses are coded 1 for No and for Yes, SPSS will predict membership in the Yes category. We will refer to the predicted event for a particular analysis as the modeled event. Class ١

What logistic regression predicts The variate or value produced by logistic regression is a probability value between 0.0 and 1.0. If the probability for group membership in the modeled category is above some cut point (the default is 0.50), the subject is predicted to be a member of the modeled group. If the probability is below the cut point, the subject is predicted to be a member of the other group. For any given case, logistic regression computes the probability that a case with a particular set of values for the independent variable is a member of the modeled category. Level of measurement requirements Logistic regression analysis requires that the dependent variable be dichotomous. Logistic regression analysis requires that the independent variables be metric or dichotomous. If an independent variable is nominal level and not dichotomous, the logistic regression procedure in SPSS has a option to dummy code the variable for you. Class ٢

Assumptions and Sample Size Requirements Logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables. Sample size requirements The minimum number of cases per independent variable is 10. For preferred case-to-variable ratios, we will use 0 to 1 for simultaneous and hierarchical logistic regression and 50 to 1 for stepwise logistic regression. Methods for including variables There are many methods available for including variables in the regression equation: the simultaneous method in which all independents are included at the same time The stepwise method (forward conditional in SPSS) in which variables are selected in the order in which they maximize the statistically significant contribution to the model. For all methods, the contribution to the model is measures by model chi-square is a statistical measure of the fit between the dependent and independent variables, like R². Class ٣

Logistic Regression with 1 Predictor Response - Presence/Absence of characteristic Predictor - Numeric variable observed for each case Model - π(x) Probability of presence at predictor level x ( x) e = 1 + e α + β x π α + β x β = 0 P(Presence) is the same at each level of x β > 0 P(Presence) increases as x increases β < 0 P(Presence) decreases as x increases Logistic Regression with 1 Predictor α, β are unknown parameters and must be estimated using statistical software such as SPSS, SAS, or STATA Primary interest in estimating and testing hypotheses regarding β Large-Sample test (Wald Test): H 0 : β = 0 H A : β 0 T. S. : R. R. : X X obs obs β = σ β χ α,1 P val : P ( χ X obs ) Class ٤

Example Pain Relief Step 1 a DOSE Constant a. Variable(s) entered on step 1: DOSE. Variables in the Equation B S.E. Wald df Sig. Exp(B).165.037 19.819 1.000 1.180 -.490.85 76.456 1.000.083 Dependent variable: Complete Pain Relief at hours (Yes/No) Independent variable Dose (mg): Placebo (0),.5,5,10 e 1+ e.490+ 0.165x ( x ) =.490+ 0. 165 x π H 0 : β = 0 T. S.: X RR : X obs obs P val :.000 H A : β 0 0.165 = 0.037 χ.05,1 = 3.84 = 19.819 Odds Ratio Interpretation of Regression Coefficient (b): In linear regression, the slope coefficient is the change in the mean response as x increases by 1 unit In logistic regression, we can show that: odds(x) e = β π(x) = 1 π(x) Thus e β represents the change in the odds of the outcome (multiplicatively) by increasing x by 1 unit If β = 0, the odds and probability are the same at all x levels (e β =1) If β > 0, the odds and probability increase as x increases (e β >1) If β < 0, the odds and probability decrease as x increases (e β <1) Class ٥

95% Confidence Interval for Odds Ratio Step 1: Construct a 95% CI for β : β ± 1.96 σ β β 1.96 σ β, β + 1. 96 σ 1 β Step : Raise e =.718 to the lower and upper bounds of the CI: 1.96σ β, β 1.96σ β β + e e If entire interval is above 1, conclude positive association If entire interval is below 1, conclude negative association If interval contains 1, cannot conclude there is an association Example - Pain Relief 95% CI for β : β = 0.165 95% CI : σ β = 0.037 0.165 ± 1.96(0.037) (0.095, 0.375) 95% CI for population odds ratio: 0.095 0.375 ( e, e ) (1.10, 1.7) Conclude positive association between dose and probability of complete relief Class ٦

Example - Death Penalty for Crime Which we interpret as: Step 1 a Blacks are 1.47 times more likely to receive a death sentence as non blacks The risk of receiving a death sentence are 1.47 times greater for blacks than non blacks The odds of a death sentence for blacks are 47% higher than the odds of a death sentence for non blacks. (1.47-1.00) The predicted odds for black defendants are 1.47 times the odds for non black defendants. A one unit change in the independent variable race (nonblack to black) increases the odds of receiving a death penalty by a factor of 1.47. BLACKD Constant Variables in the Equation B S.E. Wald df Sig. Exp(B).386.350 1.13 1.71 1.471 -.860.54 11.439 1.001.43 a. Variable(s) entered on step 1: BLACKD. Multiple Logistic Regression Extension to more than one predictor variable (either numeric or dummy variables). With p predictors, the model is written: e α + β x + + β x π = α + β x + + β x 1+ e 1 1 1 1 p p p p Adjusted Odds ratio for raising x i by 1 unit, holding all other predictors constant: OR i = e β i Inferences on β i and OR i are conducted as was described above for the case with a single predictor Class ٧

Overall test of relationship The overall test of relationship among the independent variables and groups defined by the dependent is based on the reduction in the likelihood values for a model which does not contain any independent variables and the model that contains the independent variables. This difference in likelihood follows a chi-square distribution, and is referred to as the model chi-square. The significance test for the model chi-square is our statistical evidence of the presence of a relationship between the dependent variable and the combination of the independent variables. Ending logistic regression model In this problem, the model chi-square is 33.65 65, which is statistically significant at p<0.001. Model chi-square is 33.65, significant at p < 0.001. Class ٨

Strength of logistic regression relationship While logistic regression does compute correlation measures to estimate the strength of the relationship (pseudo R square measures, such as Nagelkerke's R²), these correlations measures do not really tell us much about the accuracy or errors associated with the model. A more useful measure to assess the utility of a logistic regression model is classification accuracy, which compares predicted group membership based on the logistic model to the actual, known group membership, which is the value for the dependent variable. Evaluating usefulness for logistic models The benchmark that we will use to characterize a logistic regression model as useful is a 5% improvement over the rate of accuracy achievable by chance alone. Even if the independent variables had no relationship to the groups defined by the dependent variable, we would still expect to be correct in our predictions of group membership some percentage of the time. This is referred to as by chance accuracy. The estimate of by chance accuracy that we will use is the proportional by chance accuracy rate, computed by summing the squared percentage of cases in each group. Class ٩

Comparing accuracy rates To characterize our model as useful, we compare the overall percentage accuracy rate produced by SPSS at the last step in which variables are entered to 5% more than the proportional by chance accuracy. (Note: SPSS does not compute a cross-validated accuracy rate for logistic regression.) Classification Table a Observed Step 1 EXPECT U.S. IN WORLYES WAR IN 10 YEARS NO Overall Percentage a. The cut value is.500 Predicted EXPECT U.S. IN WORLD WAR IN 10 YEARS Percentage YES NO Correct 0 34 37.00 10 7 87.8 67.6 SPSS reports the overall accuracy rate in SW388R7 the footnotes to the table "Classification Data Analysis & Computers II Table." The overall accuracy rate computed by SPSS was 67.6%. Slide ١٩ Computing by chance accuracy The number of cases in each group is found in the Classification Table at Step 0 (before any independent variables are included). The proportion of cases in the largest group is equal to the overall percentage (60.3%). Classification Table a,b Observed Step 0 EXPECT U.S. IN WORLYES WAR IN 10 YEARS NO Overall Percentage a. Constant is included in the model. b. The cut value is.500 Predicted EXPECT U.S. IN WORLD WAR IN 10 YEARS Percentage YES NO Correct 0 54.0 0 8 100.0 60.3 The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0, and then squaring and summing the proportion of cases in each group (0.397² + 0.603² SW388R7 = 0.51). Data Analysis & Computers II The proportional by chance accuracy criteria is 65.% (1.5 x Slide ٢٠ 5.1% = 65.%). The criteria for classification accuracy is satisfied Class ١٠

Multinomial Logistic Regression Describing Relationships Classification Accuracy Examples Multinomial logistic regression Multinomial logistic regression is used to analyze relationships between a qualitative dependent variable and quantitative or dichotomous independent variables. Multinomial logistic regression compares multiple groups through a combination of binary logistic regressions. The group comparisons are equivalent to the comparisons for a dummy-coded dependent variable, with the group with the highest numeric score used as the reference group. For example, if we wanted to study differences in BS, MS, and PhD students using multinomial logistic regression, the analysis would compare BS students to PhD students and MS students to PhD students. For each independent variable, there would be two comparisons. Class ١١

What multinomial logistic regression predicts Multinomial logistic regression provides a set of coefficients for each of the two comparisons. The coefficients for the reference group are all zeros, similar to the coefficients for the reference group for a dummy-coded variable. Thus, there are three equations, one for each of the groups defined by the dependent variable. The three equations can be used to compute the probability that a subject is a member of each of the three groups. A case is predicted to belong to the group associated with the highest probability. Predicted group membership can be compared to actual group membership to obtain a measure of classification accuracy. Level of measurement requirements Multinomial logistic regression analysis requires that the dependent variable be qualitative. Dichotomous, nominal, and ordinal variables satisfy the level of measurement requirement. Multinomial logistic regression analysis requires that the independent variables be quantitative or dichotomous. Since SPSS will automatically dummy-code nominal level variables, they can be included since they will be dichotomized in the analysis. In SPSS,,qualitative independent variables are included as factors. SPSS will dummy-code qualitative IVs. In SPSS, quantitative independent variables are included as covariates. If an independent variable is ordinal, we will attach the usual caution. Class ١٢

Assumptions and Sample size requirements Multinomial logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables. The minimum number of cases per independent variable is 10 For preferred case-to-variable ratios, we will use 0 to 1. The only method for selecting independent variables in SPSS is simultaneous or direct entry. Overall test of relationship The overall test of relationship among the independent variables and groups defined by the dependent is based on the reduction in the likelihood values for a model which does not contain any independent variables and the model that contains the independent variables. Class ١٣

Strength of multinomial logistic regression relationship While multinomial logistic regression does compute correlation measures to estimate the strength of the relationship (such as R²) ), these correlations measures do not really tell us much about the accuracy or errors associated with the model. A more useful measure to assess the utility of a multinomial logistic regression model is classification accuracy, which compares predicted group membership based on the logistic model to the actual, known group membership, which is the value for the dependent variable. Evaluating usefulness for logistic models The benchmark that we will use to characterize a multinomial logistic regression model as useful is a 5% improvement over the rate of accuracy achievable by chance alone. Even if the independent variables had no relationship to the groups defined by the dependent variable, we would still expect to be correct in our predictions of group membership some percentage of the time. This is referred to as by chance accuracy. The estimate of by chance accuracy that we will use is the proportional by chance accuracy rate, computed by summing the squared percentage of cases in each group. The only difference between by chance accuracy for binary logistic models and by chance accuracy for multinomial logistic models is the number of groups defined by the dependent variable. The clue that we have numerical problems and should not interpret the results are standard errors for some independent variables that are larger than.0. Class ١٤

Relationship of individual independent variables and the dependent variable There are two types of tests for individual independent variables: The likelihood ratio test evaluates the overall relationship between an independent variable and the dependent variable The Wald test evaluates whether or not the independent variable is statistically significant in differentiating between the two groups in each of the embedded binary logistic comparisons. If an independent variable has an overall relationship to the dependent variable, it might or might not be statistically significant in differentiating between pairs of groups defined by the dependent variable. Relationship of individual independent variables and the dependent variable The interpretation for an independent variable focuses on its ability to distinguish between pairs of groups and the contribution which it makes to changing g the odds of being in one dependent variable group rather than the other. We should not interpret the significance of an independent variable s role in distinguishing between pairs of groups unless the independent variable also has an overall relationship to the dependent variable in the likelihood ratio test. The interpretation of an independent variable s role in differentiating dependent variable groups is the same as we used in binary logistic regression. The difference in multinomial logistic regression is that we can have multiple interpretations for an independent variable in relation to different pairs of groups. Class ١٥

Example 1 In the dataset Congress, the independent variables (IVs) are "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] The dependent variable (DV) is distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. The responses to opinion about spending on highways and bridges were: 1= Too little, = About right, and 3 = Too much. Example 1 - Continued These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. SPSS only supports direct or simultaneous entry of independent y pp y p variables in multinomial logistic regression, so we have no choice of method for entering variables. Class ١٦

Request multinomial logistic regression Select the Regression Multinomial Logistic command from the Analyze menu. Selecting the dependent variable First, highlight the dependent variable natroad in the list of variables. Second, click on the right arrow button to move the dependent variable to the Dependent text box. Class ١٧

Selecting quantitative independent variables While we will accept most of the SPSS defaults for the analysis, we need to specifically request the classification table. Click on the Statistics button to make a request. Requesting the classification table Class ١٨

Sample size ratio of cases to variables Case Processing Summary Marginal N Percentage HIGHWAYS 1 6 37.1% AND BRIDGES 93 55.7% 3 1 7.% Valid 167 100.0% Missing Total Subpopulation 103 70 153 a a. The dependent variable has only one value observed Multinomial logistic in 146 regression (95.4%) subpopulations. requires that the minimum ratio of valid cases to independent variables be at least 10 to 1. The ratio of valid cases (167) to number of independent variables (3) was 55.7 to 1, which was equal to or greater than the minimum ratio. The requirement for a minimum ratio of cases to independent variables was satisfied. The preferred ratio of valid cases to independent variables is 0 to 1. The ratio of 55.7 to 1 was equal to or greater than the preferred ratio. The preferred ratio of cases to independent variables was satisfied. OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES Model Fitting Information Model Intercept Only Final - Log Likelihood Chi-Square df Sig. 84.49 65.97 18.457 6.005 The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the final model chi-square in the SPSS table titled "Model Fitting Information". In this analysis, the probability of the model chi-square (18.457) was 0.005, less than the level of significance of 0.05. The null hypothesis that there was no difference between the model without independent variables and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported. Class ١٩

NUMERICAL PROBLEMS Parameter Estimates HIGHWAYS AND BRIDGES a 1 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS a. The reference category is: 3. B Std. Error Wald df Sig. Exp(B) L 3.40.478 1.709 1.191.019.00.906 Multicollinearity 1.341 in the multinomial 1.019.071.108.47 logistic 1 regression.514 solution 1.073 is detected by examining the standard -1.373.60 4.913 errors 1 for the.07 b coefficients..53 A 3.639.456.195 standard 1 error.138 larger than.0.003.00.017 indicates 1 numerical.897 problems, 1.003 such as multicollinearity among the.17.110.463 independent 1 variables,.117 1.188 and zero -1.657.613 7.98 cells for 1 a dummy-coded.007.191 independent variable because all of the subjects have the same value for the variable None of the independent variables in this analysis had a standard error larger than.0. (We are not interested in the standard errors associated with the intercept.) RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1 Effect Intercept AGE EDUC CONLEGIS Likelihood Ratio Tests - Log Likelihood of Reduced Model Chi-Square df Sig. 68.33.350.309 68.65.65.65 70.395 4.43.110 75.194 9.1.010 The chi-square statistic is the difference in - log-likelihoods between the final model and a reduced model. The reduced model is formed by omitting an effect from the final model. The null hypothesis is that all parameters of that effect are 0. The statistical significance of the relationship between confidence in Congress and opinion about spending on highways and bridges is based on the statistical significance of the chi-square statistic in the SPSS table titled "Likelihood Ratio Tests". For this relationship, the probability of the chi-square statistic (9.1) was 0.010, less than the level of significance of 0.05. The null hypothesis that all of the b coefficients associated with confidence in Congress were equal to zero was rejected. The existence of a relationship between confidence in Congress and opinion about spending on highways and bridges was supported. Class ٢٠

RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - Parameter Estimates HIGHWAYS AND BRIDGES a 1 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS a. The reference category is: 3. B Std. Error Wald df Sig. Exp(B) L 3.40.478 1.709 1.191.019.00.906 1.341 1.019.071.108.47 1.514 1.073-1.373.60 4.913 1.07.53 3.639.456.195 1.138.003.00.017 1.897 1.003.17.110.463 1.117 1.188-1.657.613 7.98 1.007.191 In the comparison of survey respondents who thought we spend too little money on highways and bridges to survey respondents who thought we spend too much money on highways and bridges, the probability of the Wald statistic (4.913) for the variable confidence in Congress [conlegis] was 0.07. Since the probability was less than the level of significance of 0.05, the null hypothesis that the b coefficient for confidence in Congress was equal to zero for this comparison was rejected. RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 3 Parameter Estimates HIGHWAYS AND BRIDGES a B Std. Error Wald df Sig. Exp(B) L 1 Intercept 3.40.478 1.709 1.191 AGE.019.00.906 1.341 1.019 EDUC.071.108.47 1.514 1.073 CONLEGIS -1.373.60 4.913 1.07.53 Intercept 3.639.456.195 1.138 AGE.003.00.017 1.897 1.003 EDUC.17.110.463 1.117 1.188 CONLEGIS -1.657.613 7.98 1.007.191 a. The reference category is: 3. The value of Exp(B) was 0.53 which implies that for each unit increase in confidence in Congress the odds decreased by 74.7% (0.53-1.0 = -0.747). Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Class ٢١

RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 4 Parameter Estimates HIGHWAYS AND BRIDGES a 1 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS a. The reference category is: 3. B Std. Error Wald df Sig. Exp(B) L 3.40.478 1.709 1.191.019.00.906 1.341 1.019.071.108.47 1.514 1.073-1.373.60 4.913 1.07.53 3.639.456.195 1.138.003.00.017 1.897 1.003.17.110.463 1.117 1.188-1.657.613 7.98 1.007.191 In the comparison of survey respondents who thought we spend about the right amount of money on highways and bridges to survey respondents who thought we spend too much money on highways and bridges, the probability of the Wald statistic (7.98) for the variable confidence in Congress [conlegis] was 0.007. Since the probability was less than the level of significance of 0.05, the null hypothesis that the b coefficient for confidence in Congress was equal to zero for this comparison was rejected. RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 5 Parameter Estimates HIGHWAYS AND BRIDGES a B Std. Error Wald df Sig. Exp(B) L 1 Intercept 3.40.478 1.709 1.191 AGE.019.00.906 1.341 1.019 EDUC.071.108.47 1.514 1.073 CONLEGIS -1.373.60 4.913 1.07.53 Intercept 3.639.456.195 1.138 AGE.003.00.017 1.897 1.003 EDUC.17.110.463 1.117 1.188 CONLEGIS -1.657.613 7.98 1.007.191 a. The reference category is: 3. The value of Exp(B) was 0.191 which implies that for each unit increase in confidence in Congress the odds decreased by 80.9% (0.191-1.0=-0.809). The relationship stated in the problem is supported. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. Class ٢٢

CLASSIFICATION USING THE MULTINOMIAL LOGISTIC REGRESSION MODEL: BY CHANCE ACCURACY RATE The independent variables could be characterized as useful predictors distinguishing survey respondents who thought we spend too little money on highways and bridges, survey respondents who thought we spend about the right amount of money on highways and bridges and survey respondents who thought we spend too much money on highways and bridges if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 5% or more higher than the proportional by chance accuracy rate. Case Processing Summary Marginal N Percentage HIGHWAYS 1 6 37.1% AND BRIDGES 93 55.7% 3 1 7.% Valid 167 100.0% Missing 103 The proportional Total by chance accuracy rate 70 was computed by calculating Subpopulation the proportion of cases for each 153 a group based on the number a. of cases in each group in the 'Case Processing The dependent variable has only one value observed Summary', and then squaring and summing the proportion of cases in each in 146 group (95.4%) (0.371² subpopulations. + 0.557² + 0.07² = 0.453). CLASSIFICATION USING THE MULTINOMIAL LOGISTIC REGRESSION MODEL: CLASSIFICATION ACCURACY Classification Observed 1 3 Overall Percentage Predicted Percent 1 3 Correct 15 47 0 4.% 7 86 0 9.5% 5 7 0.0% 16.% 83.8%.0% 60.5% The classification accuracy rate was 60.5% which was greater than or equal to the proportional by chance accuracy criteria of 56.6% (1.5 x 45.3% = 56.6%). The criteria for classification accuracy is satisfied. Class ٢٣

Example In the dataset Exampe1, The variables are "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration. Selecting quantitative independent variables quantitative independent variables are specified as covariates in multinomial logistic regression. quantitative variables can be either interval or, by convention, ordinal. Move the quantitative independent variables, educ and income98, to the Covariate(s) list box. Class ٢٤

Sample size ratio of cases to variables Case Processing Summary Marginal N Percentage SPACE EXPLORATION PROGRAM 1 33 90 15.9% 43.3% 3 85 40.9% RESPONDENTS SEX 1 94 45.% 114 54.8% Valid 08 100.0% Missing 6 Total 70 Subpopulation 138 a a. The dependent variable has only one value observed in 11 Multinomial (81.%) logistic subpopulations. regression requires that the minimum ratio of valid cases to independent variables be at least 10 to 1. The ratio of valid cases (08) to number of independent variables( 3) was 69.3 to 1, which was greater than the minimum ratio. The requirement for a minimum ratio of cases to independent variables was satisfied. The preferred ratio of valid cases to independent variables is 0 to 1. The ratio of 69.3 to 1 was equal to or greater than the preferred ratio. The preferred ratio of cases to independent variables was satisfied. OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES Model Fitting Information Model Intercept Only Final - Log Likelihood Chi-Square df Sig. 354.68 334.967 19.301 6.004 The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the final model chi-square in the SPSS table titled "Model Fitting Information". In this analysis, the probability of the model chi-square (19.301) was 0.004, less than the level of significance of 0.05. The null hypothesis that there was no difference between the model without independent variables and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported. Class ٢٥

NUMERICAL PROBLEMS Parameter Estimates SPACE EXPLORATION PROGRAM a B Std. Error Wald df Sig. Exp(B) 1 Intercept -4.136 1.157 1.779 1.000 EDUC.101.089 1.76 1.59 1.106 INCOME98.097.050 3.701 1.054 1.10 [SEX=1].67.46.488 1.115 1.959 [SEX=] 0 b.. 0.. Intercept -.487.840 8.774 1.003 EDUC.108.068.51 1.11 1.114 INCOME98.058.034.93 1.087 1.060 [SEX=1].501.317.49 1.114 1.650 0 b None of the independent variables [SEX=].. in this analysis 0 had. a standard. a. The reference category is: 3. error larger than.0. b. This parameter is set to zero because it is redundant. RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1 Effect Intercept EDUC INCOME98 SEX Likelihood Ratio Tests - Log Likelihood of Reduced Model Chi-Square df Sig. 334.967 a.000 0. 337.788.81.44 340.154 5.187.075 338.511 3.544.170 The chi-square statistic is the difference in - log-likelihoods between the final model and a reduced model. The reduced model is formed by omitting an effect from the final model. The null hypothesis is that all parameters of that effect are 0. The statistical significance a. of the relationship between total family income and This opinion reduced about model is spending equivalent on to the space final model because exploration is based on omitting the statistical the effect does significance not increase of the the degrees of freedom. chi-square statistic in the SPSS table titled "Likelihood Ratio Tests". For this relationship, the probability of the chi-square statistic (5.187) was 0.075, greater than the level of significance of 0.05. The null hypothesis that all of the b coefficients associated with total family income were equal to zero was not rejected. The existence of a relationship between total family income and opinion about spending on space exploration was not supported. Class ٢٦

Example 3 Several brands of similar products are on the market, and you want to study brand choices based on gender and age. The data set contains information on 735 subjects who were asked their preference on three brands of camera. Included in the data set are the information on subjects' gender and age. The outcome variable is brand. The variable gender is coded as 1 for female and for male. (file: digital camera) Multinomial Logit Model Results 1 Since there are multiple categories, we will choose a base category as the comparison group. Here our choice is the first brand (brand=kodak). brand gender Valid Missing Total Subpopulation Case Processing Summary Kodak Canon Sony female male Marginal N Percentage 07 8.% 307 41.8% 1 30.1% 466 63.4% 69 36.6% 735 100.0% 0 735 6 a a. The dependent variable has only one value observed in 3 (11.5%) subpopulations. Class ٢٧

Multinomial Logit Model Results Model Fitting Information Model Intercept Only Final Model Fitting Criteria Likelihood Ratio Tests - Log Likelihood Chi-Square df Sig. 36.033 176.183 185.850 4.000 Effect Intercept age gender Likelihood Ratio Tests Model Fitting Criteria Likelihood Ratio Tests - Log Likelihood of Reduced Model Chi-Square df Sig. 176.183 a.000 0. 353.964 177.781.000 183.834 7.651.0 The chi-square statistic is the difference in - log-likelihoods between the final model and a reduced model. The reduced model is formed by omitting an effect from the final model. The null hypothesis is that all parameters of that effect are 0. a. This reduced model is equivalent to the final model because omitting the effect does not increase the degrees of freedom. Multinomial Logit Model Results 3 Parameter Estimates 95% Confidence Interval for Exp(B) brand a B Std. Error Wald df Sig. Exp(B) Lower Bound Upper Bound Canon Intercept -11.775 1.775 44.04 1.000 age.368.055 44.813 1.000 1.445 1.97 1.610 [gender=1].54.194 7.7 1.007 1.688 1.154.471 [gender=] 0 b.. 0.... Sony Intercept -.71.058 11.890 1.000 age.686.063 119.954 1.000 1.986 1.756.45 [gender=1].466.6 4.47 1.039 1.594 1.03.48 [gender=] 0 b.. 0.... a. The reference category is: Kodak. b. This parameter is set to zero because it is redundant. Classification Observed Kodak Canon Sony Overall Percentage Predicted Percent Kodak Canon Sony Correct 58 136 13 8.0% 18 38 51 77.5% 10 101 110 49.8% 11.7% 64.6% 3.7% 55.% Class ٢٨

Conclusion For example, we can say that for one unit change in the variable age,, the log of the ratio of the two probabilities, P(brand=Canon)/P(brand=Kodak), will be increased by 0.368, and the log of the ratio of the two probabilities P(brand=Sony)/P(brand= Kodak) will be increased by 0.686. Therefore, we can say that, in general, the older a person is, the more he/she will prefer brand Canon or Sony. Sample Write-up of the Analysis Below is one way of describing the results. Both female and age are statistically significant across the two models. Females are more likely to prefer brands Canon or Sony compared to brand Kodak. Also, the older a person is, the more likely he/she is to prefer brands Canon or Sony to brand Kodak. Both of these findings are statistically significant. Steps in multinomial logistic regression: level of measurement and initial sample size The following is a guide to the decision process for answering problems about the basic relationships in multinomial logistic regression: Dependent qualitative? Independent variables quantitative or dichotomous? No Inappropriate application of a statistic Yes Ratio of cases to No Inappropriate independent variables at application of least 10 to 1? a statistic Yes Run multinomial logistic regression Class ٢٩

Steps in multinomial logistic regression: overall relationship and numerical problems Overall relationship statistically significant? (model chi-square test) No False Yes Standard errors of coefficients indicate no numerical problems (s.e. <=.0)? No False Yes Steps in multinomial logistic regression: relationships between IV's and DV Overall relationship between specific IV and DV is statistically significant? (likelihood ratio test) No False Yes Role of specific IV and DV groups statistically significant and interpreted correctly? (Wald test and Exp(B)) No False Yes Class ٣٠

Steps in multinomial logistic regression: classification accuracy and adding cautions Overall accuracy rate is 5% > than proportional by chance accuracy rate? No False Yes Satisfies preferred ratio of cases to IV's of 0 to 1 No True with caution Yes One or more IV's are ordinal level treated as metric? Yes True with caution No True Class ٣١