Statistics and Data Analysis

Size: px
Start display at page:

Download "Statistics and Data Analysis"


1 NESUG 27 PRO LOGISTI: The Logistics ehind Interpreting ategorical Variable Effects Taylor Lewis, U.S. Office of Personnel Management, Washington, D STRT The goal of this paper is to demystify how SS models (a.k.a, parameterizes) categorical variables in PRO LOGISTI. Specifically, readers will become more familiar with the commonly used effect and reference parameterizations. In conjunction with these two parameterizations and associated options, this paper touches on issues such as why SS needs to create dummy variables for the k distinct categories and why the output displays estimates for only k 1 parameters. t the conclusion of the paper, readers should feel more confident interpreting a categorical variable s effect on the response as well as testing for significance, by way of the odds ratios computed from the output or via the ONTRST statement. Discussion uses real-world data from the U.S. Office of Personnel Management, collected for a multiple logistic regression model project whereby the likelihood of a promotion for Federal civilian employees was modeled using personnel data. KGROUND PRO LOGISTI is the SS/STT procedure which allows users to model and analyze factors affecting the outcome of a dichotomous response variable one in which an event or nonevent can occur. fter some initial derivations to linearize this modeling process (the details of which are not a concern of this paper), the end result involves computing the log-odds, or logits, and producing a logit function, L (X ), model as follows: P( event x) ( X ) log β + β x P( nonevent x) L 1 In the instance of a continuous variable, β 1 has the interpretation of the increase in the log-odds, given a one-unit increase in the variable x. Exponentiate this model parameter estimate exp(β 1 ) and you have the more readily interpretable change in the odds themselves (no more logarithms), given that one-unit increase in x. The plot thickens, however, when the predictor variable of interest is categorical in nature, rather than continuous. series of design, or dummy, variables must be created for the different levels of the categorical variable, and interpretations and tests of significance can quickly become more involved. Lucky for us, PRO LOGISTI performs a lot of the nitty-gritty modeling work behind the scenes, but it is imperative to first understand the varying SS parameterization schemes available before utilizing the PRO s options and output to guide SS in producing exactly what is desired. EFFET ODING THE DEFULT PRMETERIZTION Through the course of this paper, we will consider a personnel data extract of nearly 6, Federal employees used to model the likelihood of promotion over a one-year period. The SS data set PROM contains, for each employee, the variable PROMOTION given as 1 if a promotion occurred, if not. The predictor variable to be investigated is education level attainment, EDLEVEL, consisting of four groups of employees: high school diploma or equivalent; bachelor s degree; master s degree; and DPh.D. To initially model education, we invoke PRO LOGISTI with the following syntax PRO LOGISTI dataprom descending; LSS edlevel; MODEL promotion edlevel; RUN; note about the descending option in the PRO LOGISTI statement: SS will first try to model the probability that the variable PROMOTION. Recall that our data has a promotion indicated by a 1, and discussion makes more sense when talking about likelihood of promotion as opposed to likelihood of not being promoted. This option is a quick way to reverse the SS default. We immediately note from the nalysis of Maximum Likelihood Estimates section of the output that parameter estimates are given for EDLEVEL,, and but not D nalysis of Maximum Likelihood Estimates 1

2 NESUG 27 Parameter DF Estimate Error hi-square Pr > hisq Intercept <.1 EDLEVEL <.1 EDLEVEL <.1 EDLEVEL We also note there is a lass Level Information section with a curious matrix of 1s, s and -1s. lass Level Information lass Value Design Variables EDLEVEL D This parameterization scheme is PRO LOGISTIS s default effect coding of dummy variables. SS sorts the class variable s value list and assigns dummy variables for one less than the number of distinct values, omitting the last category the number of columns under the Design Variables heading indicates the count of dummy variables created. n initial roadblock with this scheme is that the parameter estimates of the dummy variables are not directly interpretable; they are a measure of the difference between the classification level s effect and the average effect across all levels. Notice, however, there is an Odds Ratio Estimates section in the output Odds Ratio Estimates Point 95% Effect Estimate onfidence Limits EDLEVEL vs D EDLEVEL vs D EDLEVEL vs D For any logistic regression model without interaction terms, SS computes a series of odds ratios and confidence limits for each class variable. It is important to review how these odds ratios are computed, since SS will not output all possible comparisons of interest. From the Design Variables section of lass Level Information, the first, second, and third columns correspond to the dummy variables for group,, and, all such dummy variables in the model. Each row can be thought of as the sequence of coefficients to be placed in front of the dummy variable parameter estimates to arrive at a logit function estimate for that particular level. For instance, the row of -1s for the last group, D, corresponds to a logit function of β + (-1)*β + (-1)*β + (-1)*β ) or β - β - β - β. ssume we want to investigate the odds of promotion between groups and D. Our log-odds difference of interest is ( β + ( β )) ( β + ( β β β )) L( ) D) 2 * (.26) β + β + β nd the odds ratio turns out to be exp(.7568) 2.13, exactly as seen in the first row of Odds Ratio Estimates output. This says the probability of promotion for those educated at the high school level is more than double that of the Ph.D level. Knowing how the odds ratios are calculated gives us greater flexibility to compare, say, two levels within a classification variable that do not happen to be listed in the Odds Ratio Estimates output. For instance, we may wish to investigate a statistical difference between group, high school graduates, and group, bachelor s degrees. We 2

3 NESUG 27 note from the output how close the maximum likelihood parameter estimates for the two groups are and further reason the model could be simplified if we could collapse groups and into one group. For the two groups, we take coefficients from the first and second rows of the lass Information Matrix to arrive at the following ( β + β ) ( β + β ) β β L( ) ) We observe this logit difference is approximately zero, and exp() 1. With an odds ratio of 1, the probabilities of promotion between the two groups are roughly the same, so it is not necessary for the model to distinguish between them. It may prove easier to collapse groups and together into one category covering all employees who have attained a bachelor s degree or less. REFERENE ODING N LTERNTIVE PRMETERIZTION While there are situations where such a coding scheme is preferable, SS allows users to change this setting to other parameterizations. second useful coding scheme is called reference coding, where one level of the classification variable is designated as the reference level to which parameter estimates for the remaining levels are directly comparable. Under this coding scheme, the exponentiated parameter estimate of a level is interpreted as the odds ratio between that level and the reference level. Hence, it would make sense to assign to the reference level any particular level we wanted to pit against all others. Suppose we were interested in reporting the effect of education level on promotion likelihood and wanted to compare, individually, those who had obtained a bachelor s, master s, and Ph.D, with the high school diploma. We can use additional LSS statement options to reference parameterize EDLEVEL with the group as the reference category PRO LOGISTI dataprom desc; LSS edlevel(paramref ref''); MODEL promotion edlevel; RUN; In parentheses after the listed LSS variable, paramref overrides the default parameffect and ref'' designates the high school level to be the reference. Other ref options are LST, the default, which sorts the distinct variable levels and sets the last level to the reference, and FIRST, which sorts and sets the first value in the list. Interestingly, the ref option in the LSS statement is also available under the effect parameterization; it determines what level gets the -1 row of dummy variable coefficients and, thus, what group is compared to all others in the Odds Ratio Estimates portion of the output. Looking at the output, we note some differences in the nalysis of Maximum Likelihood Estimates and lass Level Information matrix from what we initially saw under the effect parameterization nalysis of Maximum Likelihood Estimates Parameter DF Estimate Error hi-square Pr > hisq Intercept <.1 EDLEVEL EDLEVEL <.1 EDLEVEL D <.1 lass Level Information lass Value Design Variables 3

4 NESUG 27 EDLEVEL 1 1 D 1 In terms of the parameter estimates, notice how no dummy variable is created for the reference group, as the three other groups estimates are interpreted as the difference in the log-odds from that first group. The.7 parameter estimate form EDLEVEL group suggests a small, nearly zero increase in the log-odds compared to group. This is precisely the conclusion we drew under the effect coding. This should serve as an affirmation that PRO LOGISTI can take more than one path to arrive at a given conclusion. The ultimate path to be chosen can be what is most comfortable for the analyst. Rest assured, we are still able to compute odds ratios by hand from the lass Level Information matrix by plugging in the appropriate dummy variables L( ) ) ( β ) ( β + β ) β. 7 Recall that our model parameter estimates under the reference coding have a new interpretation involving odds ratios related to the reference level, but they are still reported in the output as log-odds differences. To quickly convert these to odds-ratios sans logarithms, we have the EXP option available in the MODEL statement MODEL promotion edlevel / expb; This adds a column to the end of the Parameter Estimates Output nalysis of Maximum Likelihood Estimates Parameter DF Estimate Error hi-square Pr > hisq Exp(Est) Intercept < EDLEVEL EDLEVEL < EDLEVEL D < gain, this last column is simply the Estimate column exponentiated for quick reference. We observe how this agrees with the Odds Ratio Estimates section of the output, which is still created Odds Ratio Estimates Point 95% Effect Estimate onfidence Limits EDLEVEL vs EDLEVEL vs EDLEVEL D vs THE ONTRST STTEMENT We have seen how we can compute basic odds ratios by hand. The limitation to these is they lack confidence intervals on the estimates. We often want to check that the odds ratio estimate s confidence interval does not contain 1, for example. The Odds Ratio Estimates output will contain confidence intervals, but only for the levels of a categorical variable compared to one particular reference level. Though we could re-run PRO LOGISTI with differing reference levels to get additional odds ratio estimates and confidence intervals, we are still restricted to a one-to-one comparison. It may be prudent to investigate a difference between the average of two EDLEVEL groups compared with a reference group, as we will explore momentarily, or any other relevant combination of levels. To solve this dilemma, we can make use of the ONTRST statement. It is in constructing these statements that we are apt to be familiar with the lass Level Information matrix and effect versus reference parameterizations. The general syntax of the ONTRST statement is 4

5 NESUG 27 ONTRST 'label' var-name dummy-coeff-1 < dummy-coeff-n> </ options >; fter providing a label required, since more than one ONTRST statements are allowed we define the variable name for which we are interested in constructing odds ratios. Immediately after that, we will assign dummy coefficients by summoning the lass Level Information matrix. Identically as we did by hand, we can use the ONTRST statement in a simple, one-to-one comparison to test the logit function difference between EDLEVEL and D. Recall that under effect coding we had ( β + ( β )) ( β + ( β β β )) β + β + β L ) D) 2 ( The ONTRST statement syntax would then be ONTRST 'EDLEVEL vs. D' EDLEVEL 2 1 1/ estimateboth; ontrast Test Results ontrast DF hi-square Pr > hisq EDLEVEL vs. D <.1 ontrast Rows Estimation and Testing Results ontrast Type Row Estimate Error lpha onfidence Limits hi-square EDLEVEL vs. D PRM EDLEVEL vs. D EXP With no options in the ONTRST statement, the only output is the global test given the null hypothesis that the difference in the logit functions is zero. We see here that the test statistic is large and so we have a significant result, but we do not know in which direction the odds are favored. The estimateboth option in the ONTRST statement adds the value of the logit function difference in both log-odds terms (TypePRM line) and the exponentiated odds ratio terms (TypeEXP line). The is the same odds ratio difference we have calculated twice earlier, and the 95% confidence interval (1.817, ) matches with what was seen in the Odds Ratio Estimates section of the output. Relating this to the reference parameterization with as the reference level, we reason that the third dummy variable SS created for EDLEVEL is an odds ratio of group D vs. group. To invert this computation and make comparable to the contrast above, testing -1 times this estimate produces the desired group vs. group D odds ratio. ONTRST 'EDLEVEL vs. D' EDLEVEL -1/ estimateboth; Though we refrain from reprinting, the syntax above produces the exact same contrast output as does the syntax under effect parameterization of EDLEVEL. We saw there was very little difference between odds of promotion between EDLEVEL groups and, suggesting we could collapse the two groups to simplify the model. We could also employ the ONTRST statement to jointly test whether groups / and /D could be collapsed, respectively. One can separate by a comma two parts, or rows, of a contrast. Staying with reference coding and as the reference level, to test vs you would have L( ) ) β ( β + β ) β Furthermore, to test vs D you would have 5

6 NESUG 27 ( β + β ) ( β + β D ) β β D L( ) D) So we painlessly determined the dummy variable coefficients necessary for the ONTRST statement. This time we apply a few more options. The first is the estimateexp option, which outputs only the exponentiated logit function (odds ratio); the second is the e option that outputs the vector of coefficients and corresponding dummy variables. This is good practice to double-check that the contrast being calculated is what the analyst intended. Needless to say, changes to the reference level or parameterization scheme can quickly change what a sequence of coefficients is actually testing. contrast 'Joint / & /D' edlevel -1, edlevel 1-1 / e estimateexp; Produces the following output oefficients of ontrast Joint / & /D Parameter Row1 Row2 Intercept EDLEVEL -1 EDLEVEL 1 EDLEVELD -1 ontrast Test Results ontrast DF hi-square Pr > hisq Joint / & /D <.1 ontrast Rows Estimation and Testing Results ontrast Type Row Estimate Error lpha onfidence Limits Joint / & /D EXP Joint / & /D EXP fter acknowledging the oefficients of ontrast as what we intended, we note that the ontrast Test Results section yields a test statistic which suggests strongly the contrast is not equal to zero. Virtually all of the deviation from zero is clearly coming from the second part of the contrast between group and group D, as the odds ratio for that comparison is significantly greater than 1 (1.6117), while the group vs. group odds ratio is not significantly different from 1. t this point, we conclude that we cannot jointly collapse groups with and with D. ONLUSION This paper outlined two parameterization schemes for a logistic regression model in which the predictor variable is categorical. There are other parameterizations available within SS for this PRO, but practice and experience have dictated to the author that the effect and reference parameterizations are utilized most frequently. t an initial glance of the unabridged output from a PRO LOGISTI invocation, the shear amount of output can make interpretation and analysis appear a daunting task. Yet after a little work picking out the relevant sections and tweaking the SS code with a few added options, the task at hand can be quickly simplified, especially when one can realize how the various sections are interrelated. REFERENES SS Institute Inc. 24. SS/STT 9.1 User s Guide. ary, N: SS Institute Inc. Hosmer, David and Lemeshow, Stanley, pplied Logistic Regression. John Wiley & Sons. gresti, lan, n Introduction to ategorical Data nalysis. John Wiley & Sons. 6

7 NESUG 27 ONTT INFORMTION Your comments and questions are valued and encouraged. ontact the author at: Taylor Lewis U.S. Office of Personnel Management (OPM) 19 E St., NW, Room 7439 Washington, D 2415 Work Phone: (22) Fax: (22) SS and all other SS Institute Inc. product or service names are registered trademarks or trademarks of SS Institute Inc. in the US and other countries. indicates US registration. Other brand and product names are trademarks of their respective companies. 7

SUGI 29 Statistics and Data Analysis

SUGI 29 Statistics and Data Analysis Paper 194-29 Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC Michelle L. Pritchard and David J. Pasta Ovation Research Group, San Francisco,

More information

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking Dummy Coding for Dummies Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health ABSTRACT There are a number of ways to incorporate categorical variables into

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Traditional Conjoint Analysis with Excel

Traditional Conjoint Analysis with Excel hapter 8 Traditional onjoint nalysis with Excel traditional conjoint analysis may be thought of as a multiple regression problem. The respondent s ratings for the product concepts are observations on the

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.) Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.) Logistic regression generalizes methods for 2-way tables Adds capability studying several predictors, but Limited to

More information

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY ABSTRACT Keywords: Logistic. INTRODUCTION This paper covers some gotchas in SAS R PROC LOGISTIC. A gotcha

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

Multivariate Logistic Regression

Multivariate Logistic Regression 1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

More information

Cool Tools for PROC LOGISTIC

Cool Tools for PROC LOGISTIC Cool Tools for PROC LOGISTIC Paul D. Allison Statistical Horizons LLC and the University of Pennsylvania March 2013 1 New Features in LOGISTIC ODDSRATIO statement EFFECTPLOT

More information

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

More information

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

More information

Credit Risk Analysis Using Logistic Regression Modeling

Credit Risk Analysis Using Logistic Regression Modeling Credit Risk Analysis Using Logistic Regression Modeling Introduction A loan officer at a bank wants to be able to identify characteristics that are indicative of people who are likely to default on loans,

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

More information

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 libname in1 >c:\=; Data first; Set in1.extract; A=1; PROC LOGIST OUTEST=DD MAXITER=100 ORDER=DATA; OUTPUT OUT=CC XBETA=XB P=PROB; MODEL

More information


LOGISTIC REGRESSION ANALYSIS LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

More information

Chapter 39 The LOGISTIC Procedure. Chapter Table of Contents

Chapter 39 The LOGISTIC Procedure. Chapter Table of Contents Chapter 39 The LOGISTIC Procedure Chapter Table of Contents OVERVIEW...1903 GETTING STARTED...1906 SYNTAX...1910 PROCLOGISTICStatement...1910 BYStatement...1912 CLASSStatement...1913 CONTRAST Statement.....1916

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

International Statistical Institute, 56th Session, 2007: Phil Everson

International Statistical Institute, 56th Session, 2007: Phil Everson Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: 1. Introduction

More information

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

More information

Improved Interaction Interpretation: Application of the EFFECTPLOT statement and other useful features in PROC LOGISTIC

Improved Interaction Interpretation: Application of the EFFECTPLOT statement and other useful features in PROC LOGISTIC Paper AA08-2013 Improved Interaction Interpretation: Application of the EFFECTPLOT statement and other useful features in PROC LOGISTIC Robert G. Downer, Grand Valley State University, Allendale, MI ABSTRACT

More information

WKU Freshmen Performance in Foundational Courses: Implications for Retention and Graduation Rates

WKU Freshmen Performance in Foundational Courses: Implications for Retention and Graduation Rates Research Report June 7, 2011 WKU Freshmen Performance in Foundational Courses: Implications for Retention and Graduation Rates ABSTRACT In the study of higher education, few topics receive as much attention

More information

Logistic (RLOGIST) Example #1

Logistic (RLOGIST) Example #1 Logistic (RLOGIST) Example #1 SUDAAN Statements and Results Illustrated EFFECTS RFORMAT, RLABEL REFLEVEL EXP option on MODEL statement Hosmer-Lemeshow Test Input Data Set(s): BRFWGT.SAS7bdat Example Using

More information

Logistic Regression.

Logistic Regression. Logistic Regression Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy

More information

Additional sources Compilation of sources:

Additional sources Compilation of sources: Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources:

More information

Lecture 19: Conditional Logistic Regression

Lecture 19: Conditional Logistic Regression Lecture 19: Conditional Logistic Regression Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina

More information

Chapter 27 Using Predictor Variables. Chapter Table of Contents

Chapter 27 Using Predictor Variables. Chapter Table of Contents Chapter 27 Using Predictor Variables Chapter Table of Contents LINEAR TREND...1329 TIME TREND CURVES...1330 REGRESSORS...1332 ADJUSTMENTS...1334 DYNAMIC REGRESSOR...1335 INTERVENTIONS...1339 TheInterventionSpecificationWindow...1339

More information

Statistics 305: Introduction to Biostatistical Methods for Health Sciences

Statistics 305: Introduction to Biostatistical Methods for Health Sciences Statistics 305: Introduction to Biostatistical Methods for Health Sciences Modelling the Log Odds Logistic Regression (Chap 20) Instructor: Liangliang Wang Statistics and Actuarial Science, Simon Fraser

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

Logistic regression modeling the probability of success

Logistic regression modeling the probability of success Logistic regression modeling the probability of success Regression models are usually thought of as only being appropriate for target variables that are continuous Is there any situation where we might

More information

Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

More information

Module 4 - Multiple Logistic Regression

Module 4 - Multiple Logistic Regression Module 4 - Multiple Logistic Regression Objectives Understand the principles and theory underlying logistic regression Understand proportions, probabilities, odds, odds ratios, logits and exponents Be

More information


A LOGISTIC REGRESSION MODEL TO PREDICT FRESHMEN ENROLLMENTS Vijayalakshmi Sampath, Andrew Flagel, Carolina Figueroa A LOGISTIC REGRESSION MODEL TO PREDICT FRESHMEN ENROLLMENTS Vijayalakshmi Sampath, Andrew Flagel, Carolina Figueroa ABSTRACT Predictive modeling is the technique of using historical information on a certain

More information

Two Correlated Proportions (McNemar Test)

Two Correlated Proportions (McNemar Test) Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with

More information

Multinomial Logistic Regression

Multinomial Logistic Regression Multinomial Logistic Regression Dr. Jon Starkweather and Dr. Amanda Kay Moske Multinomial logistic regression is used to predict categorical placement in or the probability of category membership on a

More information

Estimation of σ 2, the variance of ɛ

Estimation of σ 2, the variance of ɛ Estimation of σ 2, the variance of ɛ The variance of the errors σ 2 indicates how much observations deviate from the fitted surface. If σ 2 is small, parameters β 0, β 1,..., β k will be reliably estimated

More information


ABSTRACT INTRODUCTION Paper SP03-2009 Illustrative Logistic Regression Examples using PROC LOGISTIC: New Features in SAS/STAT 9.2 Robert G. Downer, Grand Valley State University, Allendale, MI Patrick J. Richardson, Van Andel

More information

Logit Models for Binary Data

Logit Models for Binary Data Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response

More information

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

Chapter 29 The GENMOD Procedure. Chapter Table of Contents Chapter 29 The GENMOD Procedure Chapter Table of Contents OVERVIEW...1365 WhatisaGeneralizedLinearModel?...1366 ExamplesofGeneralizedLinearModels...1367 TheGENMODProcedure...1368 GETTING STARTED...1370

More information

Yew May Martin Maureen Maclachlan Tom Karmel Higher Education Division, Department of Education, Training and Youth Affairs.

Yew May Martin Maureen Maclachlan Tom Karmel Higher Education Division, Department of Education, Training and Youth Affairs. How is Australia s Higher Education Performing? An analysis of completion rates of a cohort of Australian Post Graduate Research Students in the 1990s. Yew May Martin Maureen Maclachlan Tom Karmel Higher

More information

Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

More information

Free Trial - BIRT Analytics - IAAs

Free Trial - BIRT Analytics - IAAs Free Trial - BIRT Analytics - IAAs 11. Predict Customer Gender Once we log in to BIRT Analytics Free Trial we would see that we have some predefined advanced analysis ready to be used. Those saved analysis

More information

Getting Correct Results from PROC REG

Getting Correct Results from PROC REG Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking

More information

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

More information

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996) MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL by Michael L. Orlov Chemistry Department, Oregon State University (1996) INTRODUCTION In modern science, regression analysis is a necessary part

More information

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,

More information


MORE ON LOGISTIC REGRESSION DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 816 MORE ON LOGISTIC REGRESSION I. AGENDA: A. Logistic regression 1. Multiple independent variables 2. Example: The Bell Curve 3. Evaluation

More information


ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R. ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R. 1. Motivation. Likert items are used to measure respondents attitudes to a particular question or statement. One must recall

More information

GLM I An Introduction to Generalized Linear Models

GLM I An Introduction to Generalized Linear Models GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

Regression step-by-step using Microsoft Excel

Regression step-by-step using Microsoft Excel Step 1: Regression step-by-step using Microsoft Excel Notes prepared by Pamela Peterson Drake, James Madison University Type the data into the spreadsheet The example used throughout this How to is a regression

More information

Logs Transformation in a Regression Equation

Logs Transformation in a Regression Equation Fall, 2001 1 Logs as the Predictor Logs Transformation in a Regression Equation The interpretation of the slope and intercept in a regression change when the predictor (X) is put on a log scale. In this

More information


LOGIT AND PROBIT ANALYSIS LOGIT AND PROBIT ANALYSIS A.K. Vasisht I.A.S.R.I., Library Avenue, New Delhi 110 012 In dummy regression variable models, it is assumed implicitly that the dependent variable Y

More information

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are

More information

Basic Statistical and Modeling Procedures Using SAS

Basic Statistical and Modeling Procedures Using SAS Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom

More information

Ordinal Regression. Chapter

Ordinal Regression. Chapter Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

More information

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form. One-Degree-of-Freedom Tests Test for group occasion interactions has (number of groups 1) number of occasions 1) degrees of freedom. This can dilute the significance of a departure from the null hypothesis.

More information

Pearson's Correlation Tests

Pearson's Correlation Tests Chapter 800 Pearson's Correlation Tests Introduction The correlation coefficient, ρ (rho), is a popular statistic for describing the strength of the relationship between two variables. The correlation

More information



More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Final Exam Practice Problem Answers

Final Exam Practice Problem Answers Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal

More information

Elements of statistics (MATH0487-1)

Elements of statistics (MATH0487-1) Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -

More information

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance. Chapter 6: Behavioural models Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

More information

Lecture 14: GLM Estimation and Logistic Regression

Lecture 14: GLM Estimation and Logistic Regression Lecture 14: GLM Estimation and Logistic Regression Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South

More information


I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Tests for Two Survival Curves Using Cox s Proportional Hazards Model Chapter 730 Tests for Two Survival Curves Using Cox s Proportional Hazards Model Introduction A clinical trial is often employed to test the equality of survival distributions of two treatment groups.

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables

Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables Introduction In the summer of 2002, a research study commissioned by the Center for Student

More information

Oracle Data Miner (Extension of SQL Developer 4.0)

Oracle Data Miner (Extension of SQL Developer 4.0) An Oracle White Paper October 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Generate a PL/SQL script for workflow deployment Denny Wong Oracle Data Mining Technologies 10 Van de Graff Drive Burlington,

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is

More information

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To

More information


Taming the PROC TRANSPOSE Taming the PROC TRANSPOSE Matt Taylor, Carolina Analytical Consulting, LLC ABSTRACT The PROC TRANSPOSE is often misunderstood and seldom used. SAS users are unsure of the results it will give and curious

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material

More information


INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) As with other parametric statistics, we begin the one-way ANOVA with a test of the underlying assumptions. Our first assumption is the assumption of

More information

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Logistic regression is an increasingly popular statistical technique

More information

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015 Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015 Note: This handout assumes you understand factor variables,

More information

New SAS Procedures for Analysis of Sample Survey Data

New SAS Procedures for Analysis of Sample Survey Data New SAS Procedures for Analysis of Sample Survey Data Anthony An and Donna Watts, SAS Institute Inc, Cary, NC Abstract Researchers use sample surveys to obtain information on a wide variety of issues Many

More information

Automated Statistical Modeling for Data Mining David Stephenson 1

Automated Statistical Modeling for Data Mining David Stephenson 1 Automated Statistical Modeling for Data Mining David Stephenson 1 Abstract. We seek to bridge the gap between basic statistical data mining tools and advanced statistical analysis software that requires

More information

Recall this chart that showed how most of our course would be organized:

Recall this chart that showed how most of our course would be organized: Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

More information

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Simple Linear Regression, Scatterplots, and Bivariate Correlation 1 Simple Linear Regression, Scatterplots, and Bivariate Correlation This section covers procedures for testing the association between two continuous variables using the SPSS Regression and Correlate analyses.

More information

9.2 Summation Notation

9.2 Summation Notation 9. Summation Notation 66 9. Summation Notation In the previous section, we introduced sequences and now we shall present notation and theorems concerning the sum of terms of a sequence. We begin with a

More information

HLM software has been one of the leading statistical packages for hierarchical

HLM software has been one of the leading statistical packages for hierarchical Introductory Guide to HLM With HLM 7 Software 3 G. David Garson HLM software has been one of the leading statistical packages for hierarchical linear modeling due to the pioneering work of Stephen Raudenbush

More information


CREDIT SCORING MODEL APPLICATIONS: Örebro University Örebro University School of Business Master in Applied Statistics Thomas Laitila Sune Karlsson May, 2014 CREDIT SCORING MODEL APPLICATIONS: TESTING MULTINOMIAL TARGETS Gabriela De Rossi

More information


ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

More information

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil. Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

More information

Nominal and ordinal logistic regression

Nominal and ordinal logistic regression Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

More information

One-Way Analysis of Variance

One-Way Analysis of Variance One-Way Analysis of Variance Note: Much of the math here is tedious but straightforward. We ll skim over it in class but you should be sure to ask questions if you don t understand it. I. Overview A. We

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

Some Essential Statistics The Lure of Statistics

Some Essential Statistics The Lure of Statistics Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived

More information

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY ABSTRACT: This project attempted to determine the relationship

More information

XPost: Excel Workbooks for the Post-estimation Interpretation of Regression Models for Categorical Dependent Variables

XPost: Excel Workbooks for the Post-estimation Interpretation of Regression Models for Categorical Dependent Variables XPost: Excel Workbooks for the Post-estimation Interpretation of Regression Models for Categorical Dependent Variables Contents Simon Cheng J. Scott Long

More information