Statistics and Data Analysis

Transcription

1 NESUG 27 PRO LOGISTI: The Logistics ehind Interpreting ategorical Variable Effects Taylor Lewis, U.S. Office of Personnel Management, Washington, D STRT The goal of this paper is to demystify how SS models (a.k.a, parameterizes) categorical variables in PRO LOGISTI. Specifically, readers will become more familiar with the commonly used effect and reference parameterizations. In conjunction with these two parameterizations and associated options, this paper touches on issues such as why SS needs to create dummy variables for the k distinct categories and why the output displays estimates for only k 1 parameters. t the conclusion of the paper, readers should feel more confident interpreting a categorical variable s effect on the response as well as testing for significance, by way of the odds ratios computed from the output or via the ONTRST statement. Discussion uses real-world data from the U.S. Office of Personnel Management, collected for a multiple logistic regression model project whereby the likelihood of a promotion for Federal civilian employees was modeled using personnel data. KGROUND PRO LOGISTI is the SS/STT procedure which allows users to model and analyze factors affecting the outcome of a dichotomous response variable one in which an event or nonevent can occur. fter some initial derivations to linearize this modeling process (the details of which are not a concern of this paper), the end result involves computing the log-odds, or logits, and producing a logit function, L (X ), model as follows: P( event x) ( X ) log β + β x P( nonevent x) L 1 In the instance of a continuous variable, β 1 has the interpretation of the increase in the log-odds, given a one-unit increase in the variable x. Exponentiate this model parameter estimate exp(β 1 ) and you have the more readily interpretable change in the odds themselves (no more logarithms), given that one-unit increase in x. The plot thickens, however, when the predictor variable of interest is categorical in nature, rather than continuous. series of design, or dummy, variables must be created for the different levels of the categorical variable, and interpretations and tests of significance can quickly become more involved. Lucky for us, PRO LOGISTI performs a lot of the nitty-gritty modeling work behind the scenes, but it is imperative to first understand the varying SS parameterization schemes available before utilizing the PRO s options and output to guide SS in producing exactly what is desired. EFFET ODING THE DEFULT PRMETERIZTION Through the course of this paper, we will consider a personnel data extract of nearly 6, Federal employees used to model the likelihood of promotion over a one-year period. The SS data set PROM contains, for each employee, the variable PROMOTION given as 1 if a promotion occurred, if not. The predictor variable to be investigated is education level attainment, EDLEVEL, consisting of four groups of employees: high school diploma or equivalent; bachelor s degree; master s degree; and DPh.D. To initially model education, we invoke PRO LOGISTI with the following syntax PRO LOGISTI dataprom descending; LSS edlevel; MODEL promotion edlevel; RUN; note about the descending option in the PRO LOGISTI statement: SS will first try to model the probability that the variable PROMOTION. Recall that our data has a promotion indicated by a 1, and discussion makes more sense when talking about likelihood of promotion as opposed to likelihood of not being promoted. This option is a quick way to reverse the SS default. We immediately note from the nalysis of Maximum Likelihood Estimates section of the output that parameter estimates are given for EDLEVEL,, and but not D nalysis of Maximum Likelihood Estimates 1

2 NESUG 27 Parameter DF Estimate Error hi-square Pr > hisq Intercept <.1 EDLEVEL <.1 EDLEVEL <.1 EDLEVEL We also note there is a lass Level Information section with a curious matrix of 1s, s and -1s. lass Level Information lass Value Design Variables EDLEVEL D This parameterization scheme is PRO LOGISTIS s default effect coding of dummy variables. SS sorts the class variable s value list and assigns dummy variables for one less than the number of distinct values, omitting the last category the number of columns under the Design Variables heading indicates the count of dummy variables created. n initial roadblock with this scheme is that the parameter estimates of the dummy variables are not directly interpretable; they are a measure of the difference between the classification level s effect and the average effect across all levels. Notice, however, there is an Odds Ratio Estimates section in the output Odds Ratio Estimates Point 95% Effect Estimate onfidence Limits EDLEVEL vs D EDLEVEL vs D EDLEVEL vs D For any logistic regression model without interaction terms, SS computes a series of odds ratios and confidence limits for each class variable. It is important to review how these odds ratios are computed, since SS will not output all possible comparisons of interest. From the Design Variables section of lass Level Information, the first, second, and third columns correspond to the dummy variables for group,, and, all such dummy variables in the model. Each row can be thought of as the sequence of coefficients to be placed in front of the dummy variable parameter estimates to arrive at a logit function estimate for that particular level. For instance, the row of -1s for the last group, D, corresponds to a logit function of β + (-1)*β + (-1)*β + (-1)*β ) or β - β - β - β. ssume we want to investigate the odds of promotion between groups and D. Our log-odds difference of interest is ( β + ( β )) ( β + ( β β β )) L( ) D) 2 * (.26) β + β + β nd the odds ratio turns out to be exp(.7568) 2.13, exactly as seen in the first row of Odds Ratio Estimates output. This says the probability of promotion for those educated at the high school level is more than double that of the Ph.D level. Knowing how the odds ratios are calculated gives us greater flexibility to compare, say, two levels within a classification variable that do not happen to be listed in the Odds Ratio Estimates output. For instance, we may wish to investigate a statistical difference between group, high school graduates, and group, bachelor s degrees. We 2

3 NESUG 27 note from the output how close the maximum likelihood parameter estimates for the two groups are and further reason the model could be simplified if we could collapse groups and into one group. For the two groups, we take coefficients from the first and second rows of the lass Information Matrix to arrive at the following ( β + β ) ( β + β ) β β L( ) ) We observe this logit difference is approximately zero, and exp() 1. With an odds ratio of 1, the probabilities of promotion between the two groups are roughly the same, so it is not necessary for the model to distinguish between them. It may prove easier to collapse groups and together into one category covering all employees who have attained a bachelor s degree or less. REFERENE ODING N LTERNTIVE PRMETERIZTION While there are situations where such a coding scheme is preferable, SS allows users to change this setting to other parameterizations. second useful coding scheme is called reference coding, where one level of the classification variable is designated as the reference level to which parameter estimates for the remaining levels are directly comparable. Under this coding scheme, the exponentiated parameter estimate of a level is interpreted as the odds ratio between that level and the reference level. Hence, it would make sense to assign to the reference level any particular level we wanted to pit against all others. Suppose we were interested in reporting the effect of education level on promotion likelihood and wanted to compare, individually, those who had obtained a bachelor s, master s, and Ph.D, with the high school diploma. We can use additional LSS statement options to reference parameterize EDLEVEL with the group as the reference category PRO LOGISTI dataprom desc; LSS edlevel(paramref ref''); MODEL promotion edlevel; RUN; In parentheses after the listed LSS variable, paramref overrides the default parameffect and ref'' designates the high school level to be the reference. Other ref options are LST, the default, which sorts the distinct variable levels and sets the last level to the reference, and FIRST, which sorts and sets the first value in the list. Interestingly, the ref option in the LSS statement is also available under the effect parameterization; it determines what level gets the -1 row of dummy variable coefficients and, thus, what group is compared to all others in the Odds Ratio Estimates portion of the output. Looking at the output, we note some differences in the nalysis of Maximum Likelihood Estimates and lass Level Information matrix from what we initially saw under the effect parameterization nalysis of Maximum Likelihood Estimates Parameter DF Estimate Error hi-square Pr > hisq Intercept <.1 EDLEVEL EDLEVEL <.1 EDLEVEL D <.1 lass Level Information lass Value Design Variables 3

4 NESUG 27 EDLEVEL 1 1 D 1 In terms of the parameter estimates, notice how no dummy variable is created for the reference group, as the three other groups estimates are interpreted as the difference in the log-odds from that first group. The.7 parameter estimate form EDLEVEL group suggests a small, nearly zero increase in the log-odds compared to group. This is precisely the conclusion we drew under the effect coding. This should serve as an affirmation that PRO LOGISTI can take more than one path to arrive at a given conclusion. The ultimate path to be chosen can be what is most comfortable for the analyst. Rest assured, we are still able to compute odds ratios by hand from the lass Level Information matrix by plugging in the appropriate dummy variables L( ) ) ( β ) ( β + β ) β. 7 Recall that our model parameter estimates under the reference coding have a new interpretation involving odds ratios related to the reference level, but they are still reported in the output as log-odds differences. To quickly convert these to odds-ratios sans logarithms, we have the EXP option available in the MODEL statement MODEL promotion edlevel / expb; This adds a column to the end of the Parameter Estimates Output nalysis of Maximum Likelihood Estimates Parameter DF Estimate Error hi-square Pr > hisq Exp(Est) Intercept < EDLEVEL EDLEVEL < EDLEVEL D < gain, this last column is simply the Estimate column exponentiated for quick reference. We observe how this agrees with the Odds Ratio Estimates section of the output, which is still created Odds Ratio Estimates Point 95% Effect Estimate onfidence Limits EDLEVEL vs EDLEVEL vs EDLEVEL D vs THE ONTRST STTEMENT We have seen how we can compute basic odds ratios by hand. The limitation to these is they lack confidence intervals on the estimates. We often want to check that the odds ratio estimate s confidence interval does not contain 1, for example. The Odds Ratio Estimates output will contain confidence intervals, but only for the levels of a categorical variable compared to one particular reference level. Though we could re-run PRO LOGISTI with differing reference levels to get additional odds ratio estimates and confidence intervals, we are still restricted to a one-to-one comparison. It may be prudent to investigate a difference between the average of two EDLEVEL groups compared with a reference group, as we will explore momentarily, or any other relevant combination of levels. To solve this dilemma, we can make use of the ONTRST statement. It is in constructing these statements that we are apt to be familiar with the lass Level Information matrix and effect versus reference parameterizations. The general syntax of the ONTRST statement is 4

5 NESUG 27 ONTRST 'label' var-name dummy-coeff-1 < dummy-coeff-n> </ options >; fter providing a label required, since more than one ONTRST statements are allowed we define the variable name for which we are interested in constructing odds ratios. Immediately after that, we will assign dummy coefficients by summoning the lass Level Information matrix. Identically as we did by hand, we can use the ONTRST statement in a simple, one-to-one comparison to test the logit function difference between EDLEVEL and D. Recall that under effect coding we had ( β + ( β )) ( β + ( β β β )) β + β + β L ) D) 2 ( The ONTRST statement syntax would then be ONTRST 'EDLEVEL vs. D' EDLEVEL 2 1 1/ estimateboth; ontrast Test Results ontrast DF hi-square Pr > hisq EDLEVEL vs. D <.1 ontrast Rows Estimation and Testing Results ontrast Type Row Estimate Error lpha onfidence Limits hi-square EDLEVEL vs. D PRM EDLEVEL vs. D EXP With no options in the ONTRST statement, the only output is the global test given the null hypothesis that the difference in the logit functions is zero. We see here that the test statistic is large and so we have a significant result, but we do not know in which direction the odds are favored. The estimateboth option in the ONTRST statement adds the value of the logit function difference in both log-odds terms (TypePRM line) and the exponentiated odds ratio terms (TypeEXP line). The is the same odds ratio difference we have calculated twice earlier, and the 95% confidence interval (1.817, ) matches with what was seen in the Odds Ratio Estimates section of the output. Relating this to the reference parameterization with as the reference level, we reason that the third dummy variable SS created for EDLEVEL is an odds ratio of group D vs. group. To invert this computation and make comparable to the contrast above, testing -1 times this estimate produces the desired group vs. group D odds ratio. ONTRST 'EDLEVEL vs. D' EDLEVEL -1/ estimateboth; Though we refrain from reprinting, the syntax above produces the exact same contrast output as does the syntax under effect parameterization of EDLEVEL. We saw there was very little difference between odds of promotion between EDLEVEL groups and, suggesting we could collapse the two groups to simplify the model. We could also employ the ONTRST statement to jointly test whether groups / and /D could be collapsed, respectively. One can separate by a comma two parts, or rows, of a contrast. Staying with reference coding and as the reference level, to test vs you would have L( ) ) β ( β + β ) β Furthermore, to test vs D you would have 5

6 NESUG 27 ( β + β ) ( β + β D ) β β D L( ) D) So we painlessly determined the dummy variable coefficients necessary for the ONTRST statement. This time we apply a few more options. The first is the estimateexp option, which outputs only the exponentiated logit function (odds ratio); the second is the e option that outputs the vector of coefficients and corresponding dummy variables. This is good practice to double-check that the contrast being calculated is what the analyst intended. Needless to say, changes to the reference level or parameterization scheme can quickly change what a sequence of coefficients is actually testing. contrast 'Joint / & /D' edlevel -1, edlevel 1-1 / e estimateexp; Produces the following output oefficients of ontrast Joint / & /D Parameter Row1 Row2 Intercept EDLEVEL -1 EDLEVEL 1 EDLEVELD -1 ontrast Test Results ontrast DF hi-square Pr > hisq Joint / & /D <.1 ontrast Rows Estimation and Testing Results ontrast Type Row Estimate Error lpha onfidence Limits Joint / & /D EXP Joint / & /D EXP fter acknowledging the oefficients of ontrast as what we intended, we note that the ontrast Test Results section yields a test statistic which suggests strongly the contrast is not equal to zero. Virtually all of the deviation from zero is clearly coming from the second part of the contrast between group and group D, as the odds ratio for that comparison is significantly greater than 1 (1.6117), while the group vs. group odds ratio is not significantly different from 1. t this point, we conclude that we cannot jointly collapse groups with and with D. ONLUSION This paper outlined two parameterization schemes for a logistic regression model in which the predictor variable is categorical. There are other parameterizations available within SS for this PRO, but practice and experience have dictated to the author that the effect and reference parameterizations are utilized most frequently. t an initial glance of the unabridged output from a PRO LOGISTI invocation, the shear amount of output can make interpretation and analysis appear a daunting task. Yet after a little work picking out the relevant sections and tweaking the SS code with a few added options, the task at hand can be quickly simplified, especially when one can realize how the various sections are interrelated. REFERENES SS Institute Inc. 24. SS/STT 9.1 User s Guide. ary, N: SS Institute Inc. Hosmer, David and Lemeshow, Stanley, pplied Logistic Regression. John Wiley & Sons. gresti, lan, n Introduction to ategorical Data nalysis. John Wiley & Sons. 6

7 NESUG 27 ONTT INFORMTION Your comments and questions are valued and encouraged. ontact the author at: Taylor Lewis U.S. Office of Personnel Management (OPM) 19 E St., NW, Room 7439 Washington, D 2415 Work Phone: (22) Fax: (22) Taylor.Lewis@opm.gov SS and all other SS Institute Inc. product or service names are registered trademarks or trademarks of SS Institute Inc. in the US and other countries. indicates US registration. Other brand and product names are trademarks of their respective companies. 7