# USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

Size: px
Start display at page:

Download "USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA"

Transcription

1 USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Logistic regression is an increasingly popular statistical technique used to model the probability of discrete (i.e., binary or multinomial) outcomes. When properly applied, logistic regression analyses yield very powerful insights in to what attributes (i.e., variables) are more or less likely to predict event outcome in a population of interest. These models also show the extent to which changes in the values of the attributes may increase or decrease the predicted probability of event outcome. This approach to pattern recognition or data mining is particularly well suited to applied statistical analyses of consumer behavior. Logistic regression models are frequently employed to assess the chance that a customer will: a) re-purchase a product, b) remain a customer, or c) respond to a direct mail or other marketing stimulus. This paper discusses how logistic regression, and its implementation in the SAS/STAT module of the SAS System, can be employed in these situations. We will focus on concepts and implementation, rather than the statistical theory underlying logistic regression; the reader is encourage to consult the sources listed at the end of the paper for more information about the statistical theory and other details of logistic regression modeling. What is Logistic Regression? Logistic regression refers to statistical models where the dependent, or outcome variable, is categorical, rather than continuous. The logistic function maps or translates changes in the values of the continuous or dichotomous independent variables on the right-hand side of the equation to increasing or decreasing probability of the event modeled by the dependent, or left-hand-side, variable. These statements highlight the key difference between logistic regression and regular (more formally, ordinary least squares) regression: in logistic regression, the predicted value of the dependent variable being generated by operations on the right-hand-side variables is a probability. But, in ordinary least squares regression we are predicting the population mean value of the dependent variable at given levels (i.e., values) of the independent variable(s) in the model. Economists frequently call logistic regression a qualitative choice model, and for obvious reasons: a logistic regression model helps us assess probability which qualities or outcomes will be chosen (selected) by the population under analysis. When proper care is taken to create an appropriate dependent variable, logistic regression is often a superior (both substantively and statistically) alternative to other tools available to model event outcomes. The SAS System s implementation of logistic regression techniques include a wealth of tools for the analyst to use to first construct a model, and then test its ability to perform well in the population from which the data under analysis are assumed to be a random sample. Before proceeding further a clarification is warranted as to the nature of the dependent variable. In many analytic scenarios the customer (or other unit of analysis ) may have more than one choice available to them. For example, during open enrollment a health plan participant may be offered a chance to: maintain their current health care plan, choose one with a richer benefits mix, select one with fewer benefits, or select a plan from a competing provider. In this situation a multinomial logistic regression model might be employed, rather than a binary or dichotomous logistic regression model. While multinomial models can be implemented in both PROC LOGISTIC and PROC GENMOD, they are beyond the scope of this discussion, which will be limited to models where a binary (i.e., one outcome or the other ) outcome is to be analyzed. Why is Logistic Regression Often a Superior Alternative to Other Methods to Analyze Event Outcome? Other approaches besides logistic regression are often proposed to develop models of event outcome or qualitative choice. These include: using ordinary least squares regression with a binary (zero/one) dependent variable, discriminant analysis, and probit analysis. While each of these competing methods are applicable to the discrete choice modeling scenario, the logistic regression model is most often a superior approach for a number of statistical (as well as substantive reasons), including: The logistic regression equation limits generation of the predicted values of the dependent variable to lie in the interval between zero and one; whereas OLS regression often results in values of the dependent variable take on values of less than zero or greater than one,

2 which are substantively irrelevant and have no interpretative value. A simple transformation (exponentiation) of the logistic regression model s parameters leads to an easily interpretable and explainable quantity: the odds ratio. In more recent releases of SAS/STAT software the odds ratio is printed for each independent variable (parameter) in the model. A number of useful tests for assessing model adequacy and fit are available for logistic regression models. These include: measures similar to the coefficient of determination in OLS regression; a generalized test (Hosmer- Lemeshow) for lack of model fit, and the ability to develop tables which show the proportion of cases under analysis which have been correctly classified; false positive and negative rates; and the sensitivity and specificity of the model. Tests are also available for influential and other ill-fitting observations. Parameter estimates generated from a logistic regression model can be applied in a simple Data Step to the population of interest, this scoring, or creating a probability of event outcome for each member of the population. This score can then be used to select subsets of the population for various treatments as may be appropriate to the substantive issue under analysis. Implementation of logistic regression models in the SAS System in PROC LOGISTIC is very similar to how OLS regression models are implemented in PROCs REG and GLM. If you are already familiar with how to perform OLS regression in PROC REG then learning how to use PROC LOGISTIC for binary outcome modeling is a straightforward task. As with PROCs REG and GLM, separate output data sets containing predicted values and parameter estimates can be created for subsequent analysis or scoring. Construction of the Dependent Variable: a Critical Consideration In many analytic situations where logistic regression is the method of choice the analyst has several or more independent variables to use in the modeling process. These variables may be the result of some naturally occurring or observable aspect of the behavior of the units under analysis, or be constructed from other variables. For example, an analyst developing a model predicting re-enrollment in a health insurance plan may have data for each member s interaction with both the health plans administrative apparatus and health care utilization in the prior plan year. The analyst can, with appropriate prior knowledge of the substantive relevance of the data at hand, as well as knowledge of how to use the SAS System s Data Step to operate on variables in a SAS data set, obtain or construct variables which can be employed in the modeling process. These variables may include: number of times member called the health plan for information, number of physician office visits, whether or not the member changed primary care physicians during the previous plan year, and answers to a customer satisfaction survey. Using this example, it is clear that the analyst has an array of attributes (variables) from which to choose in developing models. What is not yet available is the variable which can employed as the outcome or dependent variable. In logistic regression analyses it is often the analyst s responsibility to construct the dependent variable based on an agreed-upon definition of what constitutes the event of interest which is being modeled. From a statistical standpoint, the dichotomous, nominal level dependent variable must be both discriminating and exhaustive. The variable has two levels (i.e., it is dichotomous), and the values of the variable can choose (or discriminate ) between observations in the population of interest. Finally, the dependent variable is considered exhaustive because each individual observation can be assigned to one, and only one, of its values. In most consumer behavior studies the dependent variable represents the analyst s operationalization of the construct of interest, rather than a phenomenon which naturally occurs in the environment from which the values of independent variables measured. Creation of the dependent variable therefore depends on the analyst s own understanding of the consumer phenomenon of interest, as well as any internal rules their organization follows when defining the outcome of interest. Again returning to the health care re-enrollment example from above, a health plan s management team may define attrition or failure to re-enroll as situations where a member fails to return the re-enrollment card within 30 days of its due date. Or, in a response modeling scenario, a direct mail firm may define non-response to an advertisement as failure to respond within 45 days of mailout. These rules are carefully implemented by the analyst during construction of the dependent variable in a Data Step. This often requires clear communication between the analyst and others in their organization as to how the rules will be translated in to SAS programming language statements that will construct the dependent

3 variable. In many applied situations construction of the dependent variable, once the rules for it have been agreedupon, requires extensive Data Step manipulation of variables and data sets. Failure to respond to a mail circular may, for example, be determined only by matchmerging two data sets, one with containing customer information for all persons to whom the mailing was sent, and another with just those who responded to it within the agreed-upon time frame. If a customer is found in the mailed to data set, but not found in the responses data set, the analyst may then use appropriate Data Step coding to classify the customer as a non respondent. Similarly, a customer who has been found on both data sets will be classified as a respondent to the mail campaign. From an applied standpoint, appropriate construction of the dependent variable cannot be overemphasized. Failure on the analyst s part to correctly understand both the construct (i.e., event) whose probability is to be modeled, as well as how to code the rules which operationalize the construct may result in models with limited substantive as well as statistical usefulness. Implementing and Interpreting a Logistic Regression Model We now turn to the implementation and interpretation of a logistic regression model. PROC LOGISTIC, in the SAS/STAT module, contains the tools necessary to apply a logistic regression model to a data set and assess its results. As with its ordinary least squares counterpart, PROC REG, the MODEL Statement in PROC LOGISTIC contains the dependent variable to the left of an equals sign (=) and the name(s) of the independent variable(s) on the right hand side of the equals sign. Options are placed to the right of a slash or stroke mark (/) following the last independent variable in the model. As with PROC REG, the optional OUTPUT statement in PROC LOGISTIC can be used to create an output SAS data set. This topic will be explored in detail below. Output from successful execution of a PROC LOGISTIC step in a SAS Software program yields a wealth of information about both the overall fit of the model, as well as the statistical significance of each of its parameters. The 2LOGL test is analogous to the global F test in OLS regression, and is used to determine the overall impact of the independent variables on predicting increasing or decreasing probability of event outcome. The local chi-square tests are the counterparts to the local t-tests in OLS regression, as the test whether or not the associated parameter is statistically significant. As mentioned previously, exponentiation of the parameter estimates yielded by PROC LOGISTIC yields a very useful and informative value: the odds ratio. This quantity represents the increasing or decreasing probability of event outcome per unit change in the value of the independent variable. The odds ratio is therefore a very valuable substantive, as well as statistical, outcome of applying a logistic regression model to a customer retention or other qualitative choice scenario. The odds ratio is: a) easily calculated (and printed by default by PROC LOGISTIC in recent releases of SAS/STAT software), and, b) easily interpreted by non-quantitative personnel who might use the results of a logistic regression model to plan a marketing or other campaign Since customer retention efforts also have an economic component (e.g., how much will it cost to keep a client? ), articulation of the results of a model in the context of the parameter s odds ratios makes it much easier for all decision makers in the customer retention effort to grasp the substantive impact of the parameters on the phenomenon under study. In my own experience, telling a financial institution s marketing manager if we add a credit card to a household s product portfolio, in addition to their checking and saving accounts, the probability we will keep them as a customer for another year increases by x percent, or being able to explain to an HMO s sales team that for every additional year the member has been with our HMO, the chance they will re-enroll increases by y percent permits the substantive relevance of a logistic regression model to be readily understood by those who will use its results to plan or modify strategy. Customized odds ratios, as well as confidence intervals for odds ratios are available from PROC LOGISTIC using the UNITS and RISKLIMITS options. By default, invocation of the RISKLIMITS option generates 95% confidence intervals around the odds ratios; users can specify other intervals using the ALPHA option in conjunction with the RISKLIMITS option. Selecting the Optimal Subset of Independent Variables In many modeling situations the analyst may have many variables to consider for inclusion in their model(s). In addition to relying on the substantive significance of the variables from which to choose, PROC LOGISTIC permits implementation of automated variable subset selection tools such as forward, backward and stepwise generation of optimal subsets. As with any other type of modeling effort, the results selected as best via an automated subset selection process implemented in PROC LOGISTIC should be carefully examined in both the statistical and substantive relevance of the variables chosen. In addition, both multicollinearity and influential observations are issues in logistic regression just as they are with ordinary least squares regression. Various measures for both conditions are available by specifying the INFLUENCE and IPLOTS options in PROC LOGISTIC. The reader is directed to pp

4 of SAS Institute publication SAS/STAT Software: Changes and Enhancements Through Release 6.12 (full citation below) for more information on diagnostic tests implemented via specification of these options. Choosing from Competing Models As with other forms of statistical modeling, the logistic regression modeling effort frequently yields more than one competing model of the phenomenon under study. Fortunately, PROC LOGISTIC provides several tests which help assess: a) how well the independent variables fit the model; and, b) the model s ability to correctly predict the outcome modeled by the dependent variable. Among the measures of model fit available in PROC LOGISTIC are the Hosmer-Lemeshow goodness of fit test (generated by the LACKFIT option in the MODEL statement) and several measures similar to the coefficient of determination obtained in ordinary least squares regression. These r-square like statistics are generated when the RSQUARE option is invoked. These measures generalize the coefficient of determination to the categorical modeling situation and help assess how well the independent variables fit the model. Another approach is to assess the sensitivity (proportion of observations with the condition of interest which are predicted by the model to have the condition of interest) and/or specificity (proportion of cases without the condition of interest which are predicted by the model to NOT have the condition of interest). In addition, the false positive and false negative rates of several competing models may also be of interest. These and other common measures of model adequacy are portrayed in classification tables generated by PROC LOGISTIC when the CTABLE option is specified. If the analyst has a specific (or range) of prior probabilities of event occurrence for which they want a classification table generated, the PPROB option results in limiting generation of the classification table to just those desired by the analyst. Implementing the Results of a Logistic Regression Model Once a model predicting customer retention has been selected (usually on the basis of a combination of statistical and substantive considerations), attention usually turns to the implementation of the model in the organization. In my experience, there are often two distinct aspects to how an organization applies the results of a model. The first aspect is dissemination of the model s results to others in the organization. Employees may be told, for example, about the results of the model (in a nonstatistical fashion) via internal communication channels. Managers may be tasked to implement the results of the model via changes in strategy or on going programs. For example, the credit card marketing department of a financial institution may decide to increase its offers of credit cards to (creditworthy) customers if the results of a model suggest that adding this product to a customer s portfolio increases the chance of retention. In one organization where I have consulted, management compensation plans were changed to provide financial incentives to those line managers who were successful in maintaining or improving customer retention rates. As part of the compensation plan s implementation managers attended training sessions where the retention model was discussed in the context of the drivers of customer retention and how these managers were responsible for employee performance on these drivers. A second, and very common, aspect to model implementation is to score or apply the selected model s parameters to every customer of the organization commissioning the model. Most customer retention modeling efforts take place with a (hopefully) random sample of a larger population of interest. Once a model has been chosen, its parameters are applied to the entire population of interest, thereby obtaining a probability to stay (or leave) score for each customer. Once generated, these scores can then be used to identify groups of customers who are more (or less) likely to stay (or leave), and marketing efforts tailored accordingly. The OUTEST option in the PROC LOGISTIC statement will generate an output SAS data set containing model parameters. The variables in this data set can be used in subsequent Data Steps to implement the scoring process. Summary and Conclusions Logistic regression is a very powerful tool in building models of customer retention. SAS System software implements this tool in both PROC LOGISTIC in the SAS/STAT module and the forthcoming Enterprise Miner product. When applied properly, logistic regression models can yield powerful insights in to why some customers leave and others stay. These insights can then be employed to modify organizational strategies and/or assess the impact of the implementation of these strategies. As with any other form of statistical modeling, care must be taken to carefully construct variables representing the phenomenon of interest and to select candidate independent variables that are substantively relevant to the issues under analysis. Finally, the dynamic nature of most systems of customer choice suggests that periodic re-assessment of the selected model is highly appropriate. As customer

5 demands and needs, as well as the environment in which the organization competes, change, so must the models the organization employs to predict customer retention. Note: SAS, SAS/STAT and Enterprise Miner are trademarks of SAS Institute, Cary, NC. USA. References Berry, Michael J.A., and Gordon Linoff, Data Mining Techniques For Marketing, Sales and Customer Support, Wiley, 1997 Groth. Robert, Data Mining: a Hands-On Approach for Business Professionals, Prentice-Hall, 1998 Hosmer and Lemeshow, Applied Logistic Regression, Wiley:, 1989 Kennedy, Ruby L, et,al, Solving Data Mining Problems Through Pattern Recognition, Prentice-Hall, 1997 SAS Institute, Inc: SAS/STAT Software, Volume 2: the LOGISTIC Procedure SAS Institute, Inc.: SAS/STAT Technical Report P-229, SAS/STAT Software: Changes and Enhancements, Release 6.07 SAS Institute, Inc.: SAS/STAT Software: Changes and Enhancements Release 6.10 SAS Institute, Inc.: Logistic Regression Examples Using the SAS System (1995) SAS Institute, Inc: SAS/STAT Software: Changes and Enhancements through Release 6.12 (1997) Weiss, Sholom M., and Nitin Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann, 1998 The author can be contacted at: Sierra Information Services, Inc Webster Street Suite 1308 San Francisco, California / AOL.COM

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Alex Vidras, David Tysinger. Merkle Inc.

Using PROC LOGISTIC, SAS MACROS and ODS Output to evaluate the consistency of independent variables during the development of logistic regression models. An example from the retail banking industry ABSTRACT

### Credit Risk Analysis Using Logistic Regression Modeling

Credit Risk Analysis Using Logistic Regression Modeling Introduction A loan officer at a bank wants to be able to identify characteristics that are indicative of people who are likely to default on loans,

### Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.

Paper 264-26 Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc. Abstract: There are several procedures in the SAS System for statistical modeling. Most statisticians who use the SAS

### Multinomial and ordinal logistic regression using PROC LOGISTIC Peter L. Flom National Development and Research Institutes, Inc

ABSTRACT Multinomial and ordinal logistic regression using PROC LOGISTIC Peter L. Flom National Development and Research Institutes, Inc Logistic regression may be useful when we are trying to model a

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### A Property and Casualty Insurance Predictive Modeling Process in SAS

Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly

### Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry

Paper 12028 Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry Junxiang Lu, Ph.D. Overland Park, Kansas ABSTRACT Increasingly, companies are viewing

### HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

### Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

### A Property & Casualty Insurance Predictive Modeling Process in SAS

Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

### ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous

### Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

### Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona

Paper 1285-2014 Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona ABSTRACT In this new era of healthcare reform, health insurance companies have

### PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY ABSTRACT Keywords: Logistic. INTRODUCTION This paper covers some gotchas in SAS R PROC LOGISTIC. A gotcha

### USING ANALYTICS TO MEASURE THE VALUE OF EMPLOYEE REFERRAL PROGRAMS

USING ANALYTICS TO MEASURE THE VALUE OF EMPLOYEE REFERRAL PROGRAMS Evolv Study: The benefits of an employee referral program significantly outweigh the costs Using Analytics To Measure The Value Of Employee

### Binary Logistic Regression

Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

### Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are

### SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT The purpose of this paper is to investigate several SAS procedures that are used in

### Get to Know the IBM SPSS Product Portfolio

IBM Software Business Analytics Product portfolio Get to Know the IBM SPSS Product Portfolio Offering integrated analytical capabilities that help organizations use data to drive improved outcomes 123

### Predicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS

Paper 114-27 Predicting Customer in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS Junxiang Lu, Ph.D. Sprint Communications Company Overland Park, Kansas ABSTRACT

### A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia

### SUGI 29 Statistics and Data Analysis

Paper 194-29 Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC Michelle L. Pritchard and David J. Pasta Ovation Research Group, San Francisco,

### Master of Science in Marketing Analytics (MSMA)

Master of Science in Marketing Analytics (MSMA) COURSE DESCRIPTION The Master of Science in Marketing Analytics program teaches students how to become more engaged with consumers, how to design and deliver

### Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com

SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING

### Introduction to Fixed Effects Methods

Introduction to Fixed Effects Methods 1 1.1 The Promise of Fixed Effects for Nonexperimental Research... 1 1.2 The Paired-Comparisons t-test as a Fixed Effects Method... 2 1.3 Costs and Benefits of Fixed

Data Mining with SAS Mathias Lanner mathias.lanner@swe.sas.com Copyright 2010 SAS Institute Inc. All rights reserved. Agenda Data mining Introduction Data mining applications Data mining techniques SEMMA

### The Probit Link Function in Generalized Linear Models for Data Mining Applications

Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/\$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications

### Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

### INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER Mary-Elizabeth ( M-E ) Eddlestone Principal Systems Engineer, Analytics SAS Customer Loyalty, SAS Institute, Inc. AGENDA Overview/Introduction to Data Mining

### Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,

### Some Essential Statistics The Lure of Statistics

Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived

### Ordinal Regression. Chapter

Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

### MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

### Easily Identify Your Best Customers

IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do

### Sensitivity Analysis in Multiple Imputation for Missing Data

Paper SAS270-2014 Sensitivity Analysis in Multiple Imputation for Missing Data Yang Yuan, SAS Institute Inc. ABSTRACT Multiple imputation, a popular strategy for dealing with missing values, usually assumes

### Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

### Logistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests

Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy

### Fairfield Public Schools

Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

### 2015 Workshops for Professors

SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

### Multiple Imputation for Missing Data: A Cautionary Tale

Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust

### LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

### Prediction of Stock Performance Using Analytical Techniques

136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

### Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

### How to achieve excellent enterprise risk management Why risk assessments fail

How to achieve excellent enterprise risk management Why risk assessments fail Overview Risk assessments are a common tool for understanding business issues and potential consequences from uncertainties.

### ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION, Fuel Consulting, LLC May 2013 DATA AND ANALYSIS INTERACTION Understanding the content, accuracy, source, and completeness of data is critical to the

### Improving SAS Global Forum Papers

Paper 3343-2015 Improving SAS Global Forum Papers Vijay Singh, Pankush Kalgotra, Goutam Chakraborty, Oklahoma State University, OK, US ABSTRACT Just as research is built on existing research, the references

### 11. Analysis of Case-control Studies Logistic Regression

Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

### It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3.

IDENTIFICATION AND ESTIMATION OF AGE, PERIOD AND COHORT EFFECTS IN THE ANALYSIS OF DISCRETE ARCHIVAL DATA Stephen E. Fienberg, University of Minnesota William M. Mason, University of Michigan 1. INTRODUCTION

### Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study Tongshan Chang The University of California Office of the President CAIR Conference in Pasadena 11/13/2008

### Building HR Capabilities. Through the Employee Survey Process

Building Capabilities Through the Employee Survey Process Survey results are only data unless you have the capabilities to analyze, interpret, understand and act on them. Your organization may conduct

### Variable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal Bank of Scotland, Bridgeport, CT

Variable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal ank of Scotland, ridgeport, CT ASTRACT The credit card industry is particular in its need for a wide variety

### TEXT ANALYTICS INTEGRATION

TEXT ANALYTICS INTEGRATION A TELECOMMUNICATIONS BEST PRACTICES CASE STUDY VISION COMMON ANALYTICAL ENVIRONMENT Structured Unstructured Analytical Mining Text Discovery Text Categorization Text Sentiment

### ABSTRACT INTRODUCTION

Paper SP03-2009 Illustrative Logistic Regression Examples using PROC LOGISTIC: New Features in SAS/STAT 9.2 Robert G. Downer, Grand Valley State University, Allendale, MI Patrick J. Richardson, Van Andel

### Improved Interaction Interpretation: Application of the EFFECTPLOT statement and other useful features in PROC LOGISTIC

Paper AA08-2013 Improved Interaction Interpretation: Application of the EFFECTPLOT statement and other useful features in PROC LOGISTIC Robert G. Downer, Grand Valley State University, Allendale, MI ABSTRACT

### Understanding Characteristics of Caravan Insurance Policy Buyer

Understanding Characteristics of Caravan Insurance Policy Buyer May 10, 2007 Group 5 Chih Hau Huang Masami Mabuchi Muthita Songchitruksa Nopakoon Visitrattakul Executive Summary This report is intended

### Descriptive Statistics

Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

### Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

### A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression

### MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

### Multinomial Logistic Regression

Multinomial Logistic Regression Dr. Jon Starkweather and Dr. Amanda Kay Moske Multinomial logistic regression is used to predict categorical placement in or the probability of category membership on a

### Chapter 1 Introduction. 1.1 Introduction

Chapter 1 Introduction 1.1 Introduction 1 1.2 What Is a Monte Carlo Study? 2 1.2.1 Simulating the Rolling of Two Dice 2 1.3 Why Is Monte Carlo Simulation Often Necessary? 4 1.4 What Are Some Typical Situations

### Challenges in Developing a Small Business Tax payer Burden Model

Challenges in Developing a Small Business Tax payer Burden Model Authors: Don DeLuca Arnold Greenland Audrey Kindlon Michael Stavrianos Presented at the 2003 IRS Research Conference I. Introduction In

### IBM SPSS Direct Marketing

IBM Software IBM SPSS Statistics 19 IBM SPSS Direct Marketing Understand your customers and improve marketing campaigns Highlights With IBM SPSS Direct Marketing, you can: Understand your customers in

### USING LOGIT MODEL TO PREDICT CREDIT SCORE

USING LOGIT MODEL TO PREDICT CREDIT SCORE Taiwo Amoo, Associate Professor of Business Statistics and Operation Management, Brooklyn College, City University of New York, (718) 951-5219, Tamoo@brooklyn.cuny.edu

### EST.03. An Introduction to Parametric Estimating

EST.03 An Introduction to Parametric Estimating Mr. Larry R. Dysert, CCC A ACE International describes cost estimating as the predictive process used to quantify, cost, and price the resources required

### Joseph Twagilimana, University of Louisville, Louisville, KY

ST14 Comparing Time series, Generalized Linear Models and Artificial Neural Network Models for Transactional Data analysis Joseph Twagilimana, University of Louisville, Louisville, KY ABSTRACT The aim

### Sample Size and Power in Clinical Trials

Sample Size and Power in Clinical Trials Version 1.0 May 011 1. Power of a Test. Factors affecting Power 3. Required Sample Size RELATED ISSUES 1. Effect Size. Test Statistics 3. Variation 4. Significance

### The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,

### Data Mining Techniques for Mortality at Advanced Age

Data Mining Techniques for Mortality at Advanced Age Lijia Guo, Ph.D., A.S.A. and Morgan C. Wang, Ph.D. University of Central Florida Abstract This paper addresses issues and techniques for advanced age

### 269 Business Intelligence Technologies Data Mining Winter 2011. (See pages 8-9 for information about 469)

269 Business Intelligence Technologies Data Mining Winter 2011 (See pages 8-9 for information about 469) University of California, Davis Graduate School of Management Professor Yinghui (Catherine) Yang

### Automated Statistical Modeling for Data Mining David Stephenson 1

Automated Statistical Modeling for Data Mining David Stephenson 1 Abstract. We seek to bridge the gap between basic statistical data mining tools and advanced statistical analysis software that requires

### Data Mining Applications in Higher Education

Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

### Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations

### 20 Best Practices for Customer Feedback Programs Building a Customer-Centric Company

Business growth through customer insight 20 Best Practices for Customer Feedback Programs Building a Customer-Centric Company The content of this paper is based on the book, Beyond the Ultimate Question,

### ETPL Extract, Transform, Predict and Load

ETPL Extract, Transform, Predict and Load An Oracle White Paper March 2006 ETPL Extract, Transform, Predict and Load. Executive summary... 2 Why Extract, transform, predict and load?... 4 Basic requirements

### Performance Test Suite Results for SAS 9.1 Foundation on the IBM zseries Mainframe

Performance Test Suite Results for SAS 9.1 Foundation on the IBM zseries Mainframe A SAS White Paper Table of Contents The SAS and IBM Relationship... 1 Introduction...1 Customer Jobs Test Suite... 1

### ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis

ElegantJ BI White Paper The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis Integrated Business Intelligence and Reporting for Performance Management, Operational

### USING SAS/STAT SOFTWARE'S REG PROCEDURE TO DEVELOP SALES TAX AUDIT SELECTION MODELS

USING SAS/STAT SOFTWARE'S REG PROCEDURE TO DEVELOP SALES TAX AUDIT SELECTION MODELS Kirk L. Johnson, Tennessee Department of Revenue Richard W. Kulp, David Lipscomb College INTRODUCTION The Tennessee Department

### Weight of Evidence Module

Formula Guide The purpose of the Weight of Evidence (WoE) module is to provide flexible tools to recode the values in continuous and categorical predictor variables into discrete categories automatically,

### Strategic HR Partner Assessment (SHRPA) Feedback Results

Strategic HR Partner Assessment (SHRPA) Feedback Results January 04 Copyright 997-04 Assessment Plus, Inc. Introduction This report is divided into four sections: Part I, The SHRPA TM Model, explains how

### Regression with a Binary Dependent Variable

Regression with a Binary Dependent Variable Chapter 9 Michael Ash CPPA Lecture 22 Course Notes Endgame Take-home final Distributed Friday 19 May Due Tuesday 23 May (Paper or emailed PDF ok; no Word, Excel,

Chapter 39 The LOGISTIC Procedure Chapter Table of Contents OVERVIEW...1903 GETTING STARTED...1906 SYNTAX...1910 PROCLOGISTICStatement...1910 BYStatement...1912 CLASSStatement...1913 CONTRAST Statement.....1916

### Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Glo bal Leadership M BA BUSINESS STATISTICS FINAL EXAM Name: INSTRUCTIONS 1. Do not open this exam until instructed to do so. 2. Be sure to fill in your name before starting the exam. 3. You have two hours

### Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

### Strategies for Identifying Students at Risk for USMLE Step 1 Failure

Vol. 42, No. 2 105 Medical Student Education Strategies for Identifying Students at Risk for USMLE Step 1 Failure Jira Coumarbatch, MD; Leah Robinson, EdS; Ronald Thomas, PhD; Patrick D. Bridge, PhD Background

### Acquire with retention in mind.

White Paper A Driver of Long-Term Profitability for Personal Auto Carriers Acquire with retention in mind. Use data and analytics to help identify and attract prospects with the highest potential for long-term

### Missing data in randomized controlled trials (RCTs) can

EVALUATION TECHNICAL ASSISTANCE BRIEF for OAH & ACYF Teenage Pregnancy Prevention Grantees May 2013 Brief 3 Coping with Missing Data in Randomized Controlled Trials Missing data in randomized controlled

### Simple Predictive Analytics Curtis Seare

Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

### Center for Effective Organizations

Center for Effective Organizations HR METRICS AND ANALYTICS USES AND IMPACTS CEO PUBLICATION G 04-8 (460) EDWARD E. LAWLER III ALEC LEVENSON JOHN BOUDREAU Center for Effective Organizations Marshall School

### Research Methods & Experimental Design

Research Methods & Experimental Design 16.422 Human Supervisory Control April 2004 Research Methods Qualitative vs. quantitative Understanding the relationship between objectives (research question) and

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

### T-test & factor analysis

Parametric tests T-test & factor analysis Better than non parametric tests Stringent assumptions More strings attached Assumes population distribution of sample is normal Major problem Alternatives Continue

### Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Cairo University Faculty of Economics and Political Science Statistics Department English Section Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Prepared