# Some Issues in Using PROC LOGISTIC for Binary Logistic Regression

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Some Issues in Using PROC LOGISTIC for Binary Logistic Regression by David C. Schlotzhauer Contents Abstract 1. The Effect of Response Level Ordering on Parameter Estimate Interpretation 2. Odds Ratios 2.1 Binary Explanatory Variable Modeling the Event 2.2 Binary Explanatory Variable Modeling the Nonevent 2.3 Continuous Explanatory Variable 3. Predicted Probabilities 4. Predicted by Observed Classification Tables 4.1 Classification Using Predicted Probabilities 4.2 Classification Using Bias-adjusted Predicted Probabilities 5. Classifying New Observations Summary References Abstract For regression situations in which the response is binary (i.e. it has only two levels), logistic regression is often used. Proper interpretation of the estimated parameters of the logistic model depends on the ordering of the two response levels. Similarly, the construction and interpretation of odds ratio estimates (available in Release 6.07 TS301 and later) are affected by response level ordering. In addition to these issues, this article discusses using the fitted model to compute predicted probabilities. By selecting a cutoff value, these probabilities can be used to classify observations into one of the response levels. This leads to construction of a classification table summarizing the predicted and observed responses. The fitted model and cutoff value can also be used to classify future observations. 1. The Effect of Response Level Ordering on Parameter Estimate Interpretation For binary response data, it is important to remember that the default response function modeled by the LOGISTIC procedure is where p is the probability of the response level identified in the Response Profiles section as having Ordered Value 1. In the binary response case, the response levels can be ordered in two ways: the event of interest as Ordered Value 1 and the nonevent as Ordered Value 2, or the event of interest as Ordered Value 2 and the nonevent as Ordered Value 1. Either ordering can be used, but correct interpretation of the results must take this into account as we ll see below. Consider a model with a single explanatory variable, X, For positive β, as X increases, logit(p) increases. But it is also true that logit(p) monotonically increases with p, meaning that logit(p) never decreases as p increases as shown in Figure 1 below. Since p and logit(p) increase and decrease together, if β is positive, increasing X increases p. Similarly, a negative β indicates that increasing X decreases p. If the event of interest is Ordered Value 1, then p is the probability of an event. Consequently, by simply examining the sign of an explanatory variable s parameter estimate, you can easily determine the effect of the variable on the 1

2 done either through the LOGISTIC procedure with the LINK=NORMIT option or through the PROBIT procedure. Just as with logistic regression, the parameter estimates in a probit model may be interpreted more intuitively if the event of interest is the first ordered level. The following example uses the common convention of numerically coding the response levels as 0 and 1 with the value 1 corresponding to the event of interest. If the possible responses are DISEASE and NO DISEASE, with DISEASE being the event of interest, then a response variable is often defined as follows: Similarly, an explanatory variable, EXPOSURE, might be defined like this: Figure 1. Logit(p) increases with p Suppose the data to be modeled are summarized in the following table: probability of an event. If the event of interest is not Ordered Value 1, and you do not take this into account, then the parameter signs may seem backwards to you. It is very important, therefore, to know which of your response levels is Ordered Value 1 before looking at the remainder of the analysis output. For instance, if the response levels are ordered such that the event of interest is Ordered Value 2, then p becomes the probability of the nonevent (since it is Ordered Value 1), and, when β is positive, the probability of a nonevent increases as X increases. In this case, if you interpret the positive β as meaning that the probability of the event increases as X increases, you would be assuming, incorrectly, that the event of interest is Ordered Value 1. The importance of knowing the response level ordering also applies to binary probit analysis EXPOSURE Y Frequency Row Pct 0 1 Total Total Figure 2. Association of Exposure and Disease Notice in Figure 2 that the EXPOSED group (the last row) is much more likely to be diseased than the NOT EXPOSED group (the first row). The selected sections of LOGISTIC output shown in Output 1 are generated by 2

3 submitting the following statements 1 : data disease; input y exposure freq; cards; ; proc logistic data=disease; model y = exposure; order. The ordering scheme may be altered by the use of the ORDER= and DESCENDING options on the PROC LOGISTIC statement as needed. Because of the way that we ve coded our response, Y, the default ordering results in Y=0 as Ordered Value 1 and Y=1 as Ordered Value 2. Notice how this affects parameter estimate interpretation. The negative parameter estimate for EXPOSURE ( ) does not mean that as EXPOSURE increases from 0 to 1, Y tends to decrease from 1 (DISEASE) to 0 (NO DISEASE). Since Ordered Value 1 Response Profile Ordered Value Y Count Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCPT EXPOSURE Output 1. Model with p = Pr(No Disease) The first thing to notice is the level-ordering displayed in the Response Profile. Note that Ordered Value 1 corresponds to NO DISEASE (Y=0) rather than the event of interest, DISEASE. By default, LOGISTIC orders the response values (or the associated formatted values if defined) in increasing alphanumeric 1 The FREQ statement, DESCENDING option, odds ratios with optional confidence limits and multiple-cutoff classification table illustrated in this article are features available in Release 6.07 TS301 and later. These features are described in Technical Report P-229 SAS/STAT Software: Changes and Enhancements, Release All output in this article was generated using Release 6.07 TS301. corresponds to Y=0 (NO DISEASE), p is defined to be the probability of a nonevent. The negative parameter estimate for EXPOSURE actually indicates that as EXPOSURE increases from 0 to 1, logit(p) and p decrease, meaning the probability of NO DISEASE decreases. Since the probabilities of DISEASE and NO DISEASE must sum to one, the probability of DISEASE increases as the probability of NO DISEASE decreases. So, the negative EXPOSURE parameter means that as EXPOSURE increases, the probability of DISEASE increases. This is consistent with the data as seen in the Figure 2 above the probability of DISEASE increases from 18% in the EXPOSURE=0 group to 89% in the EXPOSURE=1 group. 3

4 Though we finally arrived at the correct interpretation by taking the level ordering into account, it was difficult simply because all information printed by LOGISTIC refers to Ordered Value 1 (NO DISEASE) but we wanted to make statements about Ordered Value 2 (DISEASE). If you ensure that the event of interest is Ordered Value 1, then interpretation is much easier. For instance, if you reverse the response level ordering so that p becomes the probability of DISEASE, the parameter estimates change sign, as seen below in Output 2. The positive EXPOSURE parameter now has the intuitive interpretation of an increase in EXPOSURE yielding an increase in the probability of DISEASE 2. There are several ways that you can reverse the response level ordering to yield Output 2 rather than Output 1: The simplest way, available in Release 6.07 TS301 and later, uses the option, DESCENDING. Specify the DESCENDING option on the PROC LOGISTIC statement to reverse the default ordering of Y from 0,1 to 1,0 making 1 (DISEASE) the level with Ordered Value 1: proc logistic data=disease descending; model y = exposure; Assign a format to Y such that the first formatted value (when the formatted values are put in sorted order) corresponds to DISEASE. For this example, Y = 0 could be assigned formatted value no disease and Y = 1 assigned formatted value 2 If you have used the Version 5 contributed procedure, LOGIST, for binary logistic regression, you noticed that it requires a 0,1 coding of the response variable (the response variable in LOGISTIC can have any numeric or character values) and it always defines p as the probability of level 1. For this example, the parameter estimates obtained by LOGIST would match those in Output 2. disease. proc format; value disfmt 1= disease 0= no disease ; proc logistic data=disease; model y = exposure; format y disfmt.; Sort the input data so that observations with Y = 1 occur before observations with Y = 0, then use the ORDER=DATA option on the PROC LOGISTIC statement. For this example, sort the data before running LOGISTIC with the following statements: proc sort data=disease; by descending y; proc logistic data=disease order=data; model y = exposure; Create a new variable to replace Y as the response variable in the MODEL statement of LOGISTIC such that observations with Y = 1 take on the first value of the new variable (when put in sorted order). data disease2; set disease; if y = 0 then y1 = no disease ; else y1 = disease ; proc logistic data=disease2; model y1 = exposure; Create a new variable (N, say) with constant value 1 for each observation. Use the events/trials MODEL statement syntax with Y as the event variable and N as the trial variable. (Note that this method depends on Y having values 0 and 1, where 1 represents the event of interest). When this is done, the ratio y/n for each observation is either 0/1, indicating no event occurred in this single trial, or 1/1 indicating an event did occur in this trial. 4

5 Response Profile Ordered Value Y Count Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCPT EXPOSURE Output 2. Model with p = Pr(Disease) data disease3; set disease; n=1; proc logistic data=disease3; model y/n = exposure; the odds ratio is 8/0.222 = 36 meaning that the odds of disease in the exposed group are 36 times the odds of disease in the unexposed group. Notice that logit(p) is just the log odds: 2. Odds Ratios The odds of an event is the ratio of the probability of an event (p) to the probability of a nonevent (1-p): Then, using Model 1 with estimated parameters a and b, The odds ratio comparing the odds for two values of X, x 1 and x 2 (where x 1 <x 2 ), is For instance, in Figure 2, the odds of disease in the exposed group are 88.89/11.11 = 8 meaning that disease is 8 times more likely than no disease in the exposed group. For the unexposed group, the odds are An explanatory variable s odds ratio, given in the last column of the LOGISTIC procedure output, tells you how the odds of an event change as you increase the variable by one 3. If you compute the odds at two different settings of the explanatory variable, the odds ratio is just the ratio of these two odds. In Figure 2, where u =x 2 -x 1. When X is a binary variable 3 You can also request a 100(1-α)% confidence interval for the odds ratio by specifying the RISKLIMITS and ALPHA=α options on the MODEL statement of the LOGISTIC procedure. 5

6 such that x 2 -x 1 = 1, then the odds ratio estimate is simply exp(b). In our example, the explanatory variable, EXPOSURE, meets this condition. The effect of response level ordering on the interpretation of odds ratios for binary explanatory variables is discussed in sections 2.1 and 2.2. Odds ratios for continuous explanatory variables are covered in section 2.3. Notice that the left-hand side is the odds ratio and 36 is the value shown in Output 2 above. Notice that an algebraically-equivalent statement is that the odds of disease in the unexposed group are 1/36 = times the odds of disease in the exposed group. 2.1 Binary Explanatory Variable Modeling the Event Just as in parameter estimate interpretation, the interpretation of odds ratios requires that you be aware of the response level ordering. In the disease-exposure example, when DISEASE is Ordered Value 1, so that p is the probability of DISEASE as in Output 2 above, the fitted Model 1 is 2.2 Binary Explanatory Variable Modeling the Nonevent Let s see what happens if we model the nonevent, NO DISEASE. When NO DISEASE is Ordered Value 1 as in Output 1 above, let p represent the probability of the nonevent, NO DISEASE. The fitted Model 1 becomes when EXPOSURE=1 and when EXPOSURE=1 and when EXPOSURE=0. The odds ratio (as shown in Output 1) is when EXPOSURE=0, where p x represents the probability of disease in the exposed group and p nx is the probability of disease in the unexposed group. Subtracting Equation 3 from Equation 2 gives Equation 4 shows that the parameter estimate for EXPOSURE is the difference in log odds for the two groups. Writing the difference in logs as the log of the ratio and exponentiating both sides of Equation 4 gives The odds of being undiseased in the exposed group are times those in the unexposed group. So, the exposed group has a much lower odds of being undiseased. Notice that this odds ratio is just the reciprocal of the previous odds ratio, 1/36, and it could have been found from Equation 5 by noting that p =1-p and using a little algebra. Generally, changing the order of the columns of Figure 2 (i.e., changing the response level ordering) changes the sign of the parameter estimate and inverts the odds ratio as we ve just shown. Inversion of the odds ratio also occurs if you 6

7 switch the rows of Figure 2, as shown at the end of Section Continuous Explanatory Variable For a continuous variable, X, exp(b) represents the change in odds for a unit increase in X. LOGISTIC always estimates the odds ratio with exp(b) since it treats all explanatory variables as continuous. To obtain an estimate of the odds ratio for a change of u units, simply raise the odds ratio reported by LOGISTIC to the power u. This will also allow you to obtain an estimate of the odds ratio if you have a binary variable whose values differ by u units. To demonstrate the continuous explanatory variable case, suppose that, in a separate group of individuals, age is used to predict disease. In this group, 30 of 100 individuals have the disease. A subset of the data is shown below in Figure 3. AGE Y The following statements: proc logistic data=age descending; model y = age; fit a logistic model with Y=1 (DISEASE) as Ordered Value 1 and produce Output 3 below. The odds ratio of indicates that the odds of disease increase by a factor of for each increase in age of one year. For an increase of 10 years the odds of disease increase by a factor of = Predicted Probabilities As in simple regression, predicted values can be obtained from a logistic regression model. However, the predicted value from a binary logistic model is the estimated probability of an observation being an event (which is Ordered Value 1 when you use LOGISTIC). In our disease-exposure example, Model 1 becomes Equation 2 when EXPOSURE=1, yielding a predicted logit of To obtain the estimate of p, the probability of an event, solve the estimated Model 1 for p: Figure 3. Subset of Age and Disease Data Multiplying both sides by 1-pˆ and isolating pˆ gives Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCPT AGE Output 3. Model for Disease and Age 7

8 shows the PROBS data set. Multiplying top and bottom by 1/exp(a + bx) gives When EXPOSURE=1, the predicted probability of disease is 1/(1+exp( )) = The predicted probability of no disease is pˆ = 1-pˆ = You can use LOGISTIC to compute and output predicted probabilities for each observation in your data set by using the PREDICTED= option on the OUTPUT statement. These statements proc logistic data=disease descending; model y = exposure; output out=probs predicted=phat; create an output data set called PROBS that is a copy of the input data set, but with two additional variables: _LEVEL_ and PHAT. For each observation, PHAT contains the predicted probability of that observation being in the response level shown by _LEVEL_. For binary logistic regression, this level will always be the level with Ordered Value 1. By using the DESCENDING option, we selected Y=1 (DISEASE) to be Ordered Value 1. Output 4 Y EXPOSURE FREQ _LEVEL_ PHAT Output 4. Predicted Probabilities 4. Predicted by Observed Classification Tables Using the predicted probabilities, you can use the fitted model to predict whether an observation is an event or a nonevent. It is then possible to construct a table summarizing the predicted and observed responses. The resulting 2 2 table is called a Classification Table, and is one tool that can be used to assess the predictive accuracy of the model. 4.1 Classification Using Predicted Probabilities To obtain a predicted response for each observation, you apply a decision rule to the predicted probabilities. For instance, you might use a rule that classifies an observation as diseased if pˆ>pˆ and as undiseased otherwise. Notice that this rule can be written as pˆ>0.5. Using the disease-age example, we can add a variable giving the predicted classification for each observation with a simple DATA step as follows: proc logistic data=age descending; model y = age; output out=probs predicted=phat; data probs; set probs; pred_dis=0; if phat>0.5 then pred_dis=1; Using this predicted classification variable, PRED_DIS, and the actual response variable, Y, we can create a table showing how many observations were correctly and incorrectly classified by the model: proc freq data=probs; tables y*pred_dis / norow nocol nopercent; Output 5 shows the table. 8

9 Y PRED_DIS Frequency 0 1 Total Total Output 5. Classification Summary Table Correctly-classified observations appear on the main diagonal of the table. Using our decision rule s cutoff value of 0.5, the model correctly classified ( )/100 = 0.87 or 87% of the observations. 4.2 Classification Using Bias-adjusted Predicted Probabilities LOGISTIC can create a classification table similar to that in Output 5 with the CTABLE and PPROB= options. CTABLE requests that a classification table be created and PPROB= allows you to specify a cutoff value (or more than one cutoff value in release 6.07 TS301 and later). However, there is an important difference in the two tables. Notice that the table in Output 5 resulted from using all observations to fit the model. Consequently, each observation influenced the model used to classify itself. This can bias the results. Ideally, for a given observation, you would fit a model excluding that observation from the data and then classify the observation using the resulting model. This method could obviously be very expensive if the data set were large. The method used by the CTABLE option is one that approximates this unbiased method and is less expensive. The parameters of the classifying model for each observation are adjusted using the method described in the "Calculation Method" Details section of the LOGISTIC documentation. The observations are then classified according to the cutoff(s) specified in the PPROB= option and the classification table is produced. The following code produces a classification table for ten different cutoffs: pˆ 0.05, pˆ 0.10,..., pˆ 0.50: proc logistic data=age descending; model y = age / ctable pprob=(0.05 to 0.5 by 0.05); The table is displayed in Output 6 below. Classification Table Correct Incorrect Percentages Prob Non- Non- Sensi- Speci- False False Level Event Event Event Event Correct tivity ficity POS NEG Output 6. Bias-adjusted Classification Table 9

10 Event in this table always refers to Ordered Value 1 DISEASE in this example. The two columns labeled Correct give the number of correctly classified events and nonevents. The Incorrect Event column gives the number of nonevents incorrectly classified as events and the Incorrect Nonevent column gives the number of events incorrectly classified as nonevents. The Percentage Correct column shows that a pˆ cutoff in the 0.20 to 0.25 range results in the maximum correct classification rate of 90%. For the 0.20 cutoff, the Correct Event column shows that all 30 diseased individuals were correctly classified. However, the 0.25 cutoff incorrectly classifies one diseased individual as undiseased for a false negative rate of 1/62 = 1.6%. When 0.20 is used as the cutoff, the Incorrect Event column shows that 10 of the 70 undiseased individuals are incorrectly classified as diseased for a false positive rate of 10/40 = 25%. When 0.25 is used as the cutoff, 9 undiseased individuals are incorrectly classified as diseased for a false positive rate of 9/38 = 23.7%. Your selection of a cutoff may not be merely a choice of the maximum correct classification rate. Depending on the relative costs of false positives and false negatives, you may opt to use a cutoff that gives a slightly lower correct classification rate in order to minimize cost. Notice when using the 0.5 cutoff, the percentage correct is 86% compared to the 87% found with the previous method that does not adjust for bias. 5. Classifying New Observations Bias is not a problem if the observations that you wish to classify are not those used to fit the model. You can use LOGISTIC to classify a separate set of observations using a previously constructed model. For instance, suppose you wish to know how well the disease-age model, using the pˆ>=0.25 decision rule, would classify a set of ten new individuals. You can do this by creating a new data set (called AGE2) containing these observations, and, optionally, their actual responses. Then use the following statements to set the response variable to missing. If you have the actual responses for these observations, create a separate variable (called YACTUAL) to hold that information. data unknown; set age2; yactual=y; y=.; Then, add these observations to the original disease-age data set (AGE) to create a new data set: data both; set unknown age; If you input data set BOTH to LOGISTIC, the procedure ignores the first ten observations, because it always ignores observations which contain missing values. As a result, with the following statements, LOGISTIC uses exactly the same observations as was used originally and produces exactly the same results as in Output 3 above. proc logistic data=both descending; model y = age; output out=probs predicted=phat; However, the output data set (PROBS) created by the OUTPUT statement contains predicted probabilities for all observations in the input data set including the first ten. Figure 4 shows our ten new observations as they appear in data sets AGE2, UNKNOWN and PROBS. Now, proceed exactly as in Section 4.1: data probs; set probs; pred_dis=0; if phat>0.25 then pred_dis=1; If you did not have the actual response values for these individuals, you could still print the first ten observations of data set PROBS to see the predicted classification for each individual 10

11 - AGE UNKNOWN PROBS AGE Y AGE YACTUAL Y AGE YACTUAL PHAT Figure 4. Data sets leading to predicted probabilities as shown at the right of Figure 4. Since we do have the actual response values, we can continue with these statements: proc freq data=probs; tables yactual*pred_dis / norow nocol nopercent; to produce a classification table for the ten new observations as shown in Output 7 below. YACTUAL PRED_DIS Frequency 0 1 Total Total Frequency Missing = 100 Output 7. Classification table of new observations observations in data sets BOTH and PROBS. Notice that all ten new observations were correctly classified by the model using 0.25 as the cutoff. Summary To properly interpret the results from the LOGISTIC procedure, it is important to pay close attention to response level ordering. Parameter estimates, odd ratios and predicted probabilities may all seem backward if the event of interest is not Ordered Value 1 in the Response Profile of the LOGISTIC output. There are several methods available for adjusting the response level ordering to your needs. With the response ordering set appropriately, predicted probabilities for existing or new observations may be obtained from the parameter estimates, or more easily, by using the OUTPUT statement. With the CTABLE and PPROB= options, you can select a cutoff value to apply to the predicted probabilities, allowing you to classify the observations and create an observed-by-predicted classification table. This classification table shows the classification results of only our ten new observations, since YACTUAL is missing for the original 11

12 References Agresti, A. (1990), Categorical Data Analysis, New York: John Wiley & Sons, Inc. Agresti, A. (1984), Analysis of Ordinal Categorical Data, New York: John Wiley & Sons, Inc. Freeman, D.H., Jr. (1987), Applied Categorical Data Analysis, New York: Marcel Dekker, Inc. Hosmer, D.W., Jr. and Lemeshow, S. (1989), Applied Logistic Regression, New York: John Wiley & Sons, Inc. SAS Institute, Inc. (1990), SAS/STAT User s Guide, Volume 2, GLM-VARCOMP, Version 6, Fourth Edition, Cary, NC: SAS Institute, Inc. SAS Institute, Inc. (1992), SAS Technical Report P-229, SAS/STAT Software: Changes and Enhancements, Release 6.07, Cary, NC: SAS Institute, Inc. 12

### A Tutorial on Logistic Regression

A Tutorial on Logistic Regression Ying So, SAS Institute Inc., Cary, NC ABSTRACT Many procedures in SAS/STAT can be used to perform logistic regression analysis: CATMOD, GENMOD,LOGISTIC, and PROBIT. Each

### PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY ABSTRACT Keywords: Logistic. INTRODUCTION This paper covers some gotchas in SAS R PROC LOGISTIC. A gotcha

### Application of Odds Ratio and Logistic Models in Epidemiology and Health Research

Application of Odds Ratio and Logistic Models in Epidemiology and Health Research By: Min Qi Wang, PhD; James M. Eddy, DEd; Eugene C. Fitzhugh, PhD Wang, M.Q., Eddy, J.M., & Fitzhugh, E.C.* (1995). Application

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

### Lecture 20: Logit Models for Multinomial Responses

Lecture 20: Logit Models for Multinomial Responses Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South

### Logistic Regression With SAS

Logistic Regression With SAS Please read my introductory handout on logistic regression before reading this one. The introductory handout can be found at. Run the program LOGISTIC.SAS from my SAS programs

### USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Logistic regression is an increasingly popular statistical technique

### Lecture 12: Generalized Linear Models for Binary Data

Lecture 12: Generalized Linear Models for Binary Data Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of

### Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

### Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 libname in1 >c:\=; Data first; Set in1.extract; A=1; PROC LOGIST OUTEST=DD MAXITER=100 ORDER=DATA; OUTPUT OUT=CC XBETA=XB P=PROB; MODEL

### Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

### SUGI 29 Statistics and Data Analysis

Paper 194-29 Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC Michelle L. Pritchard and David J. Pasta Ovation Research Group, San Francisco,

### PROC FREQ IS MORE THAN JUST SIMPLY GENERATING A 2-BY-2 TABLE

PROC FREQ IS MORE THAN JUST SIMPLY GENERATING A 2-BY-2 TABLE Wuchen Zhao, University of Southern California, Los Angeles, CA ABSTRACT The FREQ procedure is one of the most commonly-used statistical procedures

### Logistic Regression. Introduction. The Purpose Of Logistic Regression

Logistic Regression...1 Introduction...1 The Purpose Of Logistic Regression...1 Assumptions Of Logistic Regression...2 The Logistic Regression Equation...3 Interpreting Log Odds And The Odds Ratio...4

### STAT 5817: Logistic Regression and Odds Ratio

A cohort of people is a group of people whose membership is clearly defined. A prospective study is one in which a cohort of people is followed for the occurrence or nonoccurrence of specified endpoints

### Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY

Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY ABSTRACT PROC FREQ is an essential procedure within BASE

### 1/2/2016. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 When and why do we use logistic regression? Binary Multinomial Theory behind logistic regression Assessing the model Assessing predictors

### Statistics and Data Analysis

NESUG 27 PRO LOGISTI: The Logistics ehind Interpreting ategorical Variable Effects Taylor Lewis, U.S. Office of Personnel Management, Washington, D STRT The goal of this paper is to demystify how SS models

### Beginning Tutorials. PROC FREQ: It s More Than Counts Richard Severino, The Queen s Medical Center, Honolulu, HI OVERVIEW.

Paper 69-25 PROC FREQ: It s More Than Counts Richard Severino, The Queen s Medical Center, Honolulu, HI ABSTRACT The FREQ procedure can be used for more than just obtaining a simple frequency distribution

### Ordinal Regression. Chapter

Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

### FREQ-OUT: An Applied Presentation of the Options and Output of the FREQ Procedure. Pamela Landsman MPH, Merck & Co., Inc, West Point, PA

FREQ-OUT: An Applied Presentation of the Options and Output of the FREQ Procedure Pamela Landsman MPH, Merck & Co., Inc, West Point, PA Abstract: Have you ever been told compare the rate of death by gender,

Chapter 39 The LOGISTIC Procedure Chapter Table of Contents OVERVIEW...1903 GETTING STARTED...1906 SYNTAX...1910 PROCLOGISTICStatement...1910 BYStatement...1912 CLASSStatement...1913 CONTRAST Statement.....1916

### Two Correlated Proportions (McNemar Test)

Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with

### Chapter 6: Answers. Omnibus Tests of Model Coefficients. Chi-square df Sig Step Block Model.

Task Chapter 6: Answers Recent research has shown that lecturers are among the most stressed workers. A researcher wanted to know exactly what it was about being a lecturer that created this stress and

### 11. Analysis of Case-control Studies Logistic Regression

Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

### Binary Logistic Regression

Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

### ANNOTATED OUTPUT--SPSS Logistic Regression

Logistic Regression Logistic regression is a variation of the regression model. It is used when the dependent response variable is binary in nature. Logistic regression predicts the probability of the

### Data Analysis for categorical variables and its application to happiness studies

Data Analysis for categorical variables and its application to happiness studies Thanawit Bunsit Department of Economics, University of Bath The economics of happiness and wellbeing workshop: building

### Logistic Regression: Use & Interpretation of Odds Ratio (OR)

Logistic Regression: Use & Interpretation of Odds Ratio (OR) Fu-Lin Wang, B.Med.,MPH, PhD Epidemiologist Adjunct Assistant Professor Fu-lin.wang@gov.ab.ca Tel. (780)422-1825 Surveillance & Assessment Branch,

### Linear Regression in SPSS

Linear Regression in SPSS Data: mangunkill.sav Goals: Examine relation between number of handguns registered (nhandgun) and number of man killed (mankill) checking Predict number of man killed using number

### Multinomial and ordinal logistic regression using PROC LOGISTIC Peter L. Flom National Development and Research Institutes, Inc

ABSTRACT Multinomial and ordinal logistic regression using PROC LOGISTIC Peter L. Flom National Development and Research Institutes, Inc Logistic regression may be useful when we are trying to model a

### Using Stata 11 & higher for Logistic Regression Richard Williams, University of Notre Dame, Last revised March 28, 2015

Using Stata 11 & higher for Logistic Regression Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised March 28, 2015 NOTE: The routines spost13, lrdrop1, and extremes are

### Lecture 19: Conditional Logistic Regression

Lecture 19: Conditional Logistic Regression Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina

### 5. Ordinal regression: cumulative categories proportional odds. 6. Ordinal regression: comparison to single reference generalized logits

Lecture 23 1. Logistic regression with binary response 2. Proc Logistic and its surprises 3. quadratic model 4. Hosmer-Lemeshow test for lack of fit 5. Ordinal regression: cumulative categories proportional

### Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.

Paper 264-26 Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc. Abstract: There are several procedures in the SAS System for statistical modeling. Most statisticians who use the SAS

### (Logistic Regression) 2 *! %\$%!&2.!0(%%##%#\$" 2 \$ (dichotomy) %%!0(%-"! #\$""\$1 \$%,#*"./'(0(6 *-16*7" 22% "1 / 4"2'(0(-17")* "(4(*:#5* \$" \$2

(Logistic Regression)!"#\$ %!&'()* *%!+,-*("./.\$ Multiple Regression +%!& %%!*"0(%-"! %\$..(/" + 2 *! %\$%!&2.!0(%%##%#\$" 2 \$ (dichotomy) %%!0(%-"! #\$""\$ '()*\$%#\$"#0( 2 \$(4( 54( \$#%)#*" \$"-\$(*( \$%,#*"./'(0(6

### LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

### SAS Programmer s check list-quick checks to be done before the statistical reports go off the SAS Programmer s table.

PharmaSUG2010 - Paper IB11 SAS Programmer s check list-quick checks to be done before the statistical reports go off the SAS Programmer s table. Thomas T. Joseph, Independent Consultant Babruvahan Hottengada,

### How to set the main menu of STATA to default factory settings standards

University of Pretoria Data analysis for evaluation studies Examples in STATA version 11 List of data sets b1.dta (To be created by students in class) fp1.xls (To be provided to students) fp1.txt (To be

### Charles Secolsky County College of Morris. Sathasivam 'Kris' Krishnan The Richard Stockton College of New Jersey

Using logistic regression for validating or invalidating initial statewide cut-off scores on basic skills placement tests at the community college level Abstract Charles Secolsky County College of Morris

### Rotation Matrices. Suppose that 2 R. We let

Suppose that R. We let Rotation Matrices R : R! R be the function defined as follows: Any vector in the plane can be written in polar coordinates as rcos, sin where r 0and R. For any such vector, we define

### Lecture 02: Statistical Inference for Binomial Parameters

Lecture 02: Statistical Inference for Binomial Parameters Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University

### Biostatistics 305. Multinomial logistic regression

B a s i c S t a t i s t i c s F o r D o c t o r s Singapore Med J 2005; 46(6) : 259 CME Article Biostatistics 305. Multinomial logistic regression Y H Chan Multinomial logistic regression is the extension

### Displaying and comparing correlated ROC curves with the SAS System

Displaying and comparing correlated ROC curves with the SAS System Barbara Schneider, University of Vienna, Austria ABSTRACT Diagnosis is an essential part of clinical practice. Much medical research is

### Logistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests

Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy

### Lecture 22: Introduction to Log-linear Models

Lecture 22: Introduction to Log-linear Models Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina

### Paper Beyond Breslow-Day: Homogeneity Across R x C Tables ABSTRACT INTRODUCTION SAMPLE DATA K 2 2 TABLES

Paper 74949 Beyond Breslow-Day: Homogeneity Across R x C Tables Ginny P. Lai, David R. Mink, David J. Pasta, ICON Late Phase & Outcomes Research, San Francisco, CA ABSTRACT In the epidemiological world,

### Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,

### Logistic Regression: Basics

Evidence is no evidence if based solely on p value Logistic Regression: Basics Prediction Model: Binary Outcomes Nemours Stats 101 Laurens Holmes, Jr. General Linear Model OUTCOME Continuous Counts Data

### Available online at Math. Finance Lett. 2014, 2014:2 ISSN A RULE OF THUMB FOR REJECT INFERENCE IN CREDIT SCORING

Available online at http://scik.org Math. Finance Lett. 2014, 2014:2 ISSN 2051-2929 A RULE OF THUMB FOR REJECT INFERENCE IN CREDIT SCORING GUOPING ZENG AND QI ZHAO Think Finance, 4150 International Plaza,

### Testing Research and Statistical Hypotheses

Testing Research and Statistical Hypotheses Introduction In the last lab we analyzed metric artifact attributes such as thickness or width/thickness ratio. Those were continuous variables, which as you

### Binary Logistic Regression with SPSS

Binary Logistic Regression with SPSS Logistic regression is used to predict a categorical (usually dichotomous) variable from a set of predictor variables. With a categorical dependent variable, discriminant

### In order to master the techniques explained here it is vital that you undertake plenty of practice exercises so that they become second nature.

Indices or Powers A knowledge of powers, or indices as they are often called, is essential for an understanding of most algebraic processes. In this section of text you will learn about powers and rules

### REGRESSION LINES IN STATA

REGRESSION LINES IN STATA THOMAS ELLIOTT 1. Introduction to Regression Regression analysis is about eploring linear relationships between a dependent variable and one or more independent variables. Regression

### Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

### ABSTRACT INTRODUCTION

Paper SP03-2009 Illustrative Logistic Regression Examples using PROC LOGISTIC: New Features in SAS/STAT 9.2 Robert G. Downer, Grand Valley State University, Allendale, MI Patrick J. Richardson, Van Andel

### Salary. Cumulative Frequency

HW01 Answering the Right Question with the Right PROC Carrie Mariner, Afton-Royal Training & Consulting, Richmond, VA ABSTRACT When your boss comes to you and says "I need this report by tomorrow!" do

### Logistic Regression. Introduction CHAPTER The Logistic Regression Model 14.2 Inference for Logistic Regression

Logistic Regression Introduction The simple and multiple linear regression methods we studied in Chapters 10 and 11 are used to model the relationship between a quantitative response variable and one or

### 1 Logit & Probit Models for Binary Response

ECON 370: Limited Dependent Variable 1 Limited Dependent Variable Econometric Methods, ECON 370 We had previously discussed the possibility of running regressions even when the dependent variable is dichotomous

### Loglinear Models. Angela Jeansonne. Brief History. Overview. This page last updated

Loglinear Models Angela Jeansonne This page last updated Brief History Until the late 1960 s, contingency tables - two-way tables formed by cross classifying categorical variables - were typically analyzed

### SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89 by Joseph Collison Copyright 2000 by Joseph Collison All rights reserved Reproduction or translation of any part of this work beyond that permitted by Sections

### One-Way ANOVA using SPSS 11.0. SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

1 One-Way ANOVA using SPSS 11.0 This section covers steps for testing the difference between three or more group means using the SPSS ANOVA procedures found in the Compare Means analyses. Specifically,

### COMP9417: Machine Learning and Data Mining

COMP9417: Machine Learning and Data Mining A note on the two-class confusion matrix, lift charts and ROC curves Session 1, 2003 Acknowledgement The material in this supplementary note is based in part

### MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

### Homework 5 Solutions

Homework 5 Solutions 4.2: 2: a. 321 = 256 + 64 + 1 = (01000001) 2 b. 1023 = 512 + 256 + 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = (1111111111) 2. Note that this is 1 less than the next power of 2, 1024, which

### Algebra 1A and 1B Summer Packet

Algebra 1A and 1B Summer Packet Name: Calculators are not allowed on the summer math packet. This packet is due the first week of school and will be counted as a grade. You will also be tested over the

### Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

### Categorical Data Analysis: Logistic Regression

Categorical Data Analysis: Logistic Regression Haitao Chu, M.D., Ph.D. University of Minnesota H. Chu (UM) PubH 7406: Advanced Regression 1 / 30 5.1 Model Interpretation 5.1.1 Model Interpretation The

### Sampling Distributions and the Central Limit Theorem

135 Part 2 / Basic Tools of Research: Sampling, Measurement, Distributions, and Descriptive Statistics Chapter 10 Sampling Distributions and the Central Limit Theorem In the previous chapter we explained

### Selecting a Sub-set of Cases in SPSS: The Select Cases Command

Selecting a Sub-set of Cases in SPSS: The Select Cases Command When analyzing a data file in SPSS, all cases with valid values for the relevant variable(s) are used. If I opened the 1991 U.S. General Social

### Solutions of Linear Equations in One Variable

2. Solutions of Linear Equations in One Variable 2. OBJECTIVES. Identify a linear equation 2. Combine like terms to solve an equation We begin this chapter by considering one of the most important tools

### Instructions for carrying out statistical procedures and tests using SPSS

Instructions for carrying out statistical procedures and tests using SPSS These instructions are closely linked to the author s book: Essential Statistics for the Pharmaceutical Sciences John Wiley & Sons

### Module 10: Mixed model theory II: Tests and confidence intervals

St@tmaster 02429/MIXED LINEAR MODELS PREPARED BY THE STATISTICS GROUPS AT IMM, DTU AND KU-LIFE Module 10: Mixed model theory II: Tests and confidence intervals 10.1 Notes..................................

### 3.2 Matrix Multiplication

3.2 Matrix Multiplication Question : How do you multiply two matrices? Question 2: How do you interpret the entries in a product of two matrices? When you add or subtract two matrices, you add or subtract

### Dealing with Missing Data for Credit Scoring

ABSTRACT Paper SD-08 Dealing with Missing Data for Credit Scoring Steve Fleming, Clarity Services Inc. How well can statistical adjustments account for missing data in the development of credit scores?

### Raul Cruz-Cano, HLTH653 Spring 2013

Multilevel Modeling-Logistic Schedule 3/18/2013 = Spring Break 3/25/2013 = Longitudinal Analysis 4/1/2013 = Midterm (Exercises 1-5, not Longitudinal) Introduction Just as with linear regression, logistic

### Weight of Evidence Module

Formula Guide The purpose of the Weight of Evidence (WoE) module is to provide flexible tools to recode the values in continuous and categorical predictor variables into discrete categories automatically,

### Introduction to Stata

Introduction to Stata September 23, 2014 Stata is one of a few statistical analysis programs that social scientists use. Stata is in the mid-range of how easy it is to use. Other options include SPSS,

### Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.) Logistic regression generalizes methods for 2-way tables Adds capability studying several predictors, but Limited to

### SPSS Workbook 1 Data Entry : Questionnaire Data

TEESSIDE UNIVERSITY SCHOOL OF HEALTH & SOCIAL CARE SPSS Workbook 1 Data Entry : Questionnaire Data Prepared by: Sylvia Storey s.storey@tees.ac.uk SPSS data entry 1 This workbook is designed to introduce

### Lecture 14: GLM Estimation and Logistic Regression

Lecture 14: GLM Estimation and Logistic Regression Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South

### Cool Tools for PROC LOGISTIC

Cool Tools for PROC LOGISTIC Paul D. Allison Statistical Horizons LLC and the University of Pennsylvania March 2013 www.statisticalhorizons.com 1 New Features in LOGISTIC ODDSRATIO statement EFFECTPLOT

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

### The Distributive Property

Mastering algebra is an important requisite of understanding calculus. The motivation for this review is to refresh algebraic concepts that might not have been encountered for some time, and to try to

### The basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23

(copyright by Scott M Lynch, February 2003) Brief Matrix Algebra Review (Soc 504) Matrix algebra is a form of mathematics that allows compact notation for, and mathematical manipulation of, high-dimensional

### ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

### A Brief Primer on Matrix Algebra

A Brief Primer on Matrix Algebra A matrix is a rectangular array of numbers whose individual entries are called elements. Each horizontal array of elements is called a row, while each vertical array is

### COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Binary numbers The reason humans represent numbers using decimal (the ten digits from 0,1,... 9) is that we have ten fingers. There is no other reason than that. There is nothing special otherwise about

### PASS Sample Size Software

Chapter 250 Introduction The Chi-square test is often used to test whether sets of frequencies or proportions follow certain patterns. The two most common instances are tests of goodness of fit using multinomial

### Wednesday PM. Multiple regression. Multiple regression in SPSS. Presentation of AM results Multiple linear regression. Logistic regression

Wednesday PM Presentation of AM results Multiple linear regression Simultaneous Stepwise Hierarchical Logistic regression Multiple regression Multiple regression extends simple linear regression to consider

### Generalized Linear Mixed Modeling and PROC GLIMMIX

Generalized Linear Mixed Modeling and PROC GLIMMIX Richard Charnigo Professor of Statistics and Biostatistics Director of Statistics and Psychometrics Core, CDART RJCharn2@aol.com Objectives First ~80

### 2. Perform elementary row operations to get zeros below the diagonal.

Gaussian Elimination We list the basic steps of Gaussian Elimination, a method to solve a system of linear equations. Except for certain special cases, Gaussian Elimination is still state of the art. After

### Equations and Inequalities

Rational Equations Overview of Objectives, students should be able to: 1. Solve rational equations with variables in the denominators.. Recognize identities, conditional equations, and inconsistent equations.

### The three statistics of interest comparing the exposed to the not exposed are:

Review: Common notation for a x table: Not Exposed Exposed Disease a b a+b No Disease c d c+d a+c b+d N Warning: Rosner presents the transposed table so b and c have their meanings reversed in the text.

### Generalized Linear Model Theory

Appendix B Generalized Linear Model Theory We describe the generalized linear model as formulated by Nelder and Wedderburn (1972), and discuss estimation of the parameters and tests of hypotheses. B.1