Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL
|
|
|
- Sarah Craig
- 9 years ago
- Views:
Transcription
1 Paper SA Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations of variables ( interactions ) when building a predictive model. If the target is Y and there is an interaction between two predictors X1 and X2, it means that the relationship between X1 and Y differs depending on the value of X2. SAS provides several easy-to-use methods to detect interactions. One method is to consider all possible n-way interactions using multiplicative interaction terms. For example, this can be done in the MODEL statement of PROC LOGISTIC using between potentially interactive variables to specify the number of variables that can be involved in an interaction refers to 2-way interactions). Another approach that has been recommended is to use decision trees to identify interactions; SAS Enterprise Miner enables interaction detection using this method. This paper presents a case study illustrating the use of different methods for interaction detection (i.e., all possible n-way interactions, interactions detected via decision trees, and a combination of the two) when building a predictive model in SAS. INTRODUCTION When creating a predictive model, analysts typically consider combinations of variables ( interactions ) as potential predictors. If the target is Y and there is an interaction between two predictors X1 and X2, it means that the relationship between X1 and Y differs depending on the value of X2. There are several methods that can be used to detect interactions in SAS. One method is to consider all possible n- way interactions using multiplicative interaction terms. SAS/STAT provides easy-to-use tools for doing this. For example, this can be done in the MODEL statement of PROC LOGISTIC using between potentially interactive predictors to specify the number of predictors that can be involved in an interaction refers to 2- way interactions). In a logistic regression model, with a binary dependent variable Y and potential predictors X1, X2 and X3, the MODEL statement of PROC LOGISTIC might be: model Y = X1 X2 / selection=stepwise. Then interactions X1*X2, X1*X3 and X2*X3 would all be considered, and retained in the model if they met the fit criteria specified in the stepwise algorithm. Although this method is easy to execute in SAS/STAT, a downside of the method is that the analysis can be slow when many interaction terms are considered. A second approach to interaction detection that has been recommended is decision trees. This approach is used for interaction detection in the Rapid Predictive Modeler procedure of Enterprise Miner. A decision tree is a set of rules for dividing a set of observations into distinct subgroups. The first rule is the one that best splits the entire set of observations into subgroups, such that the resulting subgroups are as pure as possible with respect to the dependent variable (assuming a binary dependent variable coded 0 or 1, this means that the resulting subgroups are as different as possible from each other with respect to the proportion of 1 s). Then another rule is applied to the resulting subgroups, splitting them into further subgroups that are as pure as possible with respect to the dependent variable. This process continues until pre-specified stopping criteria are met, which are sometimes based on the minimum allowed number of observations in a subgroup. The final resulting subgroups are called leaves. A leaf that is defined by multiple variables may represent an interaction. For example, a leaf may be defined by the rule an observation is in Leaf 1 if X1=1 and X2<56. Because leaves are constructed by identifying subgroups of observations that are as pure as possible with respect to the dependent variable, this essentially means that Y depends on both X1 and X2. However, a leaf defined by two predictors X1 and X2 does not necessarily point to an interaction between X1 and X2. It means that X1 and X2 are both important for the prediction of Y, but this is consistent with either main effects of X1 and X2 and no X1*X2 interaction, or an interaction between X1 and X2 (meaning that the relationship between X1 and Y is different at different levels of X2). An example of this difference is presented below. The decision tree method of interaction detection requires SAS Enterprise Miner, unless the analyst is willing to program a decision tree algorithm in BASE/SAS, or use some other software that does decision tree analyses such 1
2 as R or Weka. An upside is that decision trees can detect complex non-linear associations; downsides are that the analyst needs to have Enterprise Miner available, program a decision tree algorithm or use various non-sas decision tree algorithms in the analysis. Given the high cost of Enterprise Miner, it is an important practical question whether decision trees offer a substantial improvement over n-way interaction detection methods available with the SAS/STAT regression procedures. This paper presents a case study of using different methods for interaction detection when building a predictive model in SAS. The case study is a logistic regression model that would be fairly typical in marketing analytics. SAS/STAT is the primary analytic package. All of the methods can be implemented in SAS/STAT, with the exception that decision tree interaction detection uses SAS Enterprise Miner. This is a case study that illustrates findings in a single analytic situation; it is not guaranteed that similar findings will be present in every predictive modeling scenario. METHOD ILLUSTRATIVE DATA AND ANALYSIS SITUATION The analyses are illustrated with a case study using data from the National Health and Nutrition Examination Survey (NHANES), and waves combined. Assume that a campaign is being devised to market a new microbrew variety. The objective is to advertise to potential customers who are likely to drink beer; there is little value in advertising to individuals who are unlikely to drink beer. A predictive model is created to target likely beer drinkers. The dependent variable is drank_beer, defined as drinking beer sometime during the past year (coded as 1=yes or 0=no, based on the NHANES food frequency questionnaire). Potential predictors included measures of demographics (race/ethnicity, age, gender, income, education, marriage status), energy intake, Body Mass Index (BMI) and pregnancy status. The predictors included a mix of continuous variables (age, energy intake, BMI) and binary variables (all others). The continuous variables differed in scale. Prior to building the model, the observations were randomly allocated to training and validation datasets (2/3 training sample, 1/3 validation or holdout sample). The variable splitwgt was defined as 1=training observation,.= validation observation. When splitwgt is included in a WEIGHT statement, PROC LOGISTIC builds the model using only the training observation but produces scores for all observations in the dataset (including the validation sample). Models were built using the training observations and evaluated using the validation sample. Prior to analysis, a small number of observations with missing data were excluded. The analysis is applicable to any marketing scenario where the objective is to distribute the advertising message to likely consumers, and avoid unlikely consumers, for example, likely purchasers of automobiles, insurance policies, or telecommunications services. Methods considered for interaction detection 1. Main effects only This method considers only main effects and no interactions. The code is shown below. proc logistic data=nhanes_0306_extract descending namelen=100; weight splitwgt; model drank_beer = energy RIDAGEYR bmi wave0506 female mexam othhisp white afam inc_lt25k inc_75kplus some_college_or_more married pregnant / selection=stepwise; output out=scores1 prob=prob predprobs=crossvalidate; ods output parameterestimates=parms1; 2. All possible 2-way interactions (no recode of continuous predictors) 2
3 This method considers all possible 2-way interactions among the predictors. Linear slopes are estimated for the continuous predictors (BMI, RIDAGEYR and energy). proc logistic data=nhanes_0306_extract descending namelen=100; weight splitwgt; model drank_beer = energy RIDAGEYR bmi wave0506 female mexam othhisp white afam inc_lt25k inc_75kplus some_college_or_more married / selection=stepwise; output out=scores2 prob=prob predprobs=crossvalidate; ods output parameterestimates=parms2; 3. All possible 2-way interactions (binary split of continuous predictors) Method 2 estimated linear slopes for the continuous predictors. A possible improvement is to split the predictors into categories prior to estimating the interactions. This enables non-linear patterns of association between a continuous predictor and the dependent variable to be detected. To gauge the possible lift from using continuous predictors split into categories, interactions after splitting the predictors into categories were tested. Two and ten splits were considered. Two-category splits provides a baseline, while 10-category splits enables a more sensitive representation of possible non-linear patterns of association between the continuous predictors and the dependent variable. The macro categorize_continuous uses PROC RANK to break the continuous predictors into n equal size categories, passed to the macro using levels=n. A set of dummy variables is created to represent the resulting dichotomous predictor variable and entered into the MODEL statement of PROC LOGISTIC. options mprint; %macro categorize_continuous(levels=2); %let levels_minus_1=%eval(&levels-1); %let ranked = r_energy r_ridageyr r_bmi; * Use class variables for continuous; * dummy coded ranks; proc rank data=nhanes_0306_extract out=rnk_nhanes_0306_extract groups=&levels; ranks r_energy r_ridageyr r_bmi; var energy RIDAGEYR bmi; data rnk_nhanes_0306_extract; set rnk_nhanes_0306_extract; %do i=1 %to 3; %let curr_var = %scan(&ranked,&i); %do j=0 %to &levels_minus_1; &curr_var.&j=(&curr_var=&j); %end; %end; proc logistic data=rnk_nhanes_0306_extract descending namelen=100; weight splitwgt; model drank_beer = %do k=0 %to &levels_minus_1; r_energy&k r_ridageyr&k r_bmi&k %end; wave0506 female mexam othhisp white afam inc_lt25k inc_75kplus some_college_or_more married / selection=stepwise; output out=scores prob=prob predprobs=crossvalidate; ods output parameterestimates=parms; %mend categorize_continuous; 3
4 %categorize_continuous(levels=2); 4. All possible 2-way interactions (10 category split of continuous predictors) This is the same as Method 3 except that continuous predictors are broken into 10 equal size categories instead of 2. The code is exactly the same as in Method 3, except for a different macro call: %categorize_continuous(levels=10); 5. Stepwise with decision tree leaves, no other interactions Method 5 used decision tree leaves to represent interactions. The leaves were terminal nodes from a set of decision tree analyses conducted using SAS Enterprise Miner (EM). To conduct decision tree analyses, the first step was to import the training sample data into EM. Then decision trees were built varying the following analytic options: Tree A. Maximum branches = 2, maximum depth 2, accept all other defaults (misclassification rate, training sample = 0.349). Tree B. Maximum branches = 5, maximum depth 2, accept all other defaults (misclassification rate, training sample = 0.349; Tree B turned out to be identical to Tree A). Tree C. Maximum branches = 2, maximum depth 3, accept all other defaults (misclassification rate, training sample = 0.341). Tree D. Maximum branches = 5, maximum depth 3, accept all other defaults (misclassification rate, training sample = 0.337). Tree E. Interactive tree, allowing only splits where log(p)>=3 (misclassification rate, training sample = 0.349). The five trees exhibited only minor differences in accuracy as indicated by misclassification rate on the training sample. Results for the best fitting tree, Tree D, are shown in Figure 1. Figure 1. Results for the best fitting decision tree (Tree D). 4
5 All terminal nodes (leaves) from all trees were retained. Table 1 below describes the entire set of leaves. Leaves that were already represented by binary split predictors (e.g., female=0 vs. other, which is the top right terminal node that appears in Figure 1) were not coded as a leaf because this would be redundant. Across all trees, there were 8 distinct leaves, labeled leaf1 through leaf8 subsequently in this paper. After conducting decision tree analyses in SAS/EM as described above, terminal nodes were coded in the following dataset and the resulting leaves were used as inputs to logistic regression analyses conducted in SAS/STAT as shown below. * Code nodes from decision tree analysis; data nhanes_0306_extract; set nhanes_0306_extract; leaf1=(female=0 and ridageyr<55.5); leaf2=(female=0 and ridageyr>=55.5 and bmi ne. and bmi< ); leaf3=(female=0 and ridageyr>=55.5 and (bmi=. or bmi>= )); leaf4=(female=0 and ridageyr>=55.5 and bmi ne. and bmi> ); leaf5=(female=0 and ridageyr>=55.5 and (bmi=. or <=bmi< )); leaf6=(female in(.,1) and ridageyr<55.5); leaf7=(female in(.,1) and ridageyr>=55.5 and some_college_or_more=1); leaf8=(female in(.,1) and ridageyr>=55.5 and some_college_or_more<1); proc logistic data=nhanes_0306_extract descending namelen=100; weight splitwgt; model drank_beer = leaf1 leaf2 leaf3 leaf4 leaf5 leaf6 leaf7 leaf8 energy RIDAGEYR bmi wave0506 female mexam othhisp white afam inc_75kplus some_college_or_more married pregnant / include=8 selection=stepwise; output out=scores5 prob=prob predprobs=crossvalidate; ods output parameterestimates=parms5; inc_lt25k 6. Stepwise with decision tree leaves, allow other 2-way interactions This method estimates all possible 2-way interactions plus the potential interactions identified using decision trees. proc logistic data=nhanes_0306_extract descending namelen=100; weight splitwgt; model drank_beer = leaf1 leaf2 leaf3 leaf4 leaf5 leaf6 leaf7 leaf8 energy RIDAGEYR bmi wave0506 female mexam othhisp white afam inc_lt25k inc_75kplus some_college_or_more married / selection=stepwise; output out=scores6 prob=prob predprobs=crossvalidate; ods output parameterestimates=parms6; 7. Stepwise with all possible 2-way interactions (no recode of continuous predictors), then allow decision tree leaves to enter This model takes the model selected in Method 2, forces that model in via the include= option in the MODEL statement of PROC LOGISTIC, and finally allows decision trees to enter the model. proc logistic data=nhanes_0306_extract descending namelen=100; weight splitwgt; model drank_beer = Energy RIDAGEYR Energy*RIDAGEYR bmi 5
6 female mexam afam inc_lt25k inc_75kplus afam*inc_75kplus some_college_or_more Energy*some_college_or_more afam*some_college_or_more married RIDAGEYR*married afam*married pregnant mexam*pregnant leaf1 leaf2 leaf3 leaf4 leaf5 leaf6 leaf7 leaf8 / include=18 selection=stepwise; output out=scores7 prob=prob predprobs=crossvalidate; ods output parameterestimates=parms7; FIT STATISTICS CONSIDERED To assess the fit of models resulting from Methods 1 through 7, fit statistics were computed using the observations in the validation sample. Two common fit statistics were used, the C-statistic (also known as area under the ROC curve or AUC) and percent of target cases in the top decile, i.e., the percent of all observations where drank_beer=1 that were caught in the top 10% of observations by model score. The C-statistic ranges from 0 to 1; the closer to 1, the better the model fit. A model where C=0.5 is no better than random guessing (in other words, a model that predicts well should have C>0.5). The percent of target cases in the top decile ranges from 0% to 100%; the closer to 100% the better, and 10% indicates a model that is no better than random guessing (in other words, a model that predicts well should have values of >10% for this fit statistic). RESULTS Descriptive statistics for the predictors considered are shown in Table 1. All percentages are column percentages (for example, 54.6% of the total sample was female, 39.3% of the beer drinkers were female and 68.0% of the non-beerdrinkers were female). 6
7 Table 1. Mean or % of potential predictor variables by beer consumption (drank beer sometime in the past year vs. did not drink beer in the past year) and in total. Drank beer (n=2,703) Did not drink beer (n=3,074) Total (n=5,777) Variable name Description Energy 24-hour energy intake (kcal) 2, , ,142.4 RIDAGEYR Age (years) bmi Body mass index (BMI) wave0506 NHANES wave = % 46.9% 47.7% female Female 39.3% 68.0% 54.6% mexam Mexican America 18.7% 17.6% 18.1% othhisp Other Hispanic 3.1% 2.6% 2.9% white Non-Hispanic Caucasian 57.7% 53.3% 55.3% afam African American 16.4% 22.1% 19.4% inc_lt25k Annual income < $25K 22.0% 32.8% 27.8% inc_75kplus Annual income $75K+ 27.0% 18.2% 22.3% some_college_or_more Some college education or more 56.5% 46.0% 50.9% married Married 57.0% 58.8% 58.0% pregnant Pregnant 4.1% 7.3% 5.8% leaf1 Male & age < % 14.9% 27.1% leaf2 Male & age >=56 & BMI < % 2.6% 2.1% leaf3 Male & age >=56 & BMI >= % 14.5% 16.2% leaf4 Male & age >=56 & BMI > % 0.7% 0.4% leaf5 Male & age >=56 & (26.6<=BMI<40.6) 18.1% 13.8% 15.8% leaf6 Female & age < % 38.6% 34.8% leaf7 Female & age >=56 & some college+ 4.6% 10.8% 7.9% leaf8 Female & age >=56 & <college 4.1% 18.6% 11.8% Fit statistics from models using Methods 1 through 7 are shown in Table 2. 7
8 Table 2. Fit statistics for model selection Methods 1 through 7 for the training sample and validation sample. Method Training sample Validation sample C-statistic % target in top decile C-statistic % target in top decile 1. Main effects only % % 2. All possible 2-way interactions (no recode of continuous predictors) 3. All possible 2-way interactions (binary split of continuous predictors) 4. All possible 2-way interactions (10 category split of continuous predictors) 5. Stepwise with decision tree leaves, no other interactions 6. Stepwise with decision tree leaves, allow other 2-way interactions 7. Stepwise with all possible 2-way interactions (no recode of continuous predictors); then allow decision tree leaves to enter % % % % % % % % % % % % All models fit reasonably well and were markedly better than random guessing (C>0.5, % of target observations in the top decile >10%). Based on the c-statistic, the best method was Method 2 for the validation sample (Method 7 was close). Based on % of target observations in the top decile, the best method was Method 6 for validation, although Methods 2 and 7 were close. Creating categorical predictors always resulted in a worse-fitting solution compared with estimating linear slopes of continuous predictors, regardless of whether 2- or 10-category splits where used. Allowing 2-way interactions in addition to tree leaves resulted in a better solution than using decision tree leaves alone. Predictors selected and parameter estimates for the solutions that tended to exhibit best predictive performance, i.e., Methods 2 and 7, are shown in Tables 3 through 5 below. 8
9 Table 3. Parameter estimates, standard errors and p-values (null hypothesis is parameter estimate =0) for the model selected by Method 2. Variable Estimate SE p-value Intercept < Energy < RIDAGEYR < Energy*RIDAGEYR < < bmi female < mexam afam inc_lt25k inc_75kplus afam*inc_75kplus some_college_or_more Energy*some_college_or_more afam*some_college_or_more married < RIDAGEYR*married afam*married pregnant mexam*pregnant Table 4. Parameter estimates, standard errors and p-values (null hypothesis is parameter estimate =0) for the model selected by Method 6. Variable Estimate SE p-value Intercept leaf < leaf < leaf leaf Energy < RIDAGEYR bmi afam inc_lt25k inc_75kplus afam*inc_75kplus married RIDAGEYR*married pregnant
10 Table 5. Parameter estimates, standard errors and p-values (null hypothesis is parameter estimate =0) for the model selected by Method 7. Variable Estimate SE p-value Intercept < Energy RIDAGEYR < Energy*RIDAGEYR bmi female < mexam afam inc_lt25k inc_75kplus afam*inc_75kplus some_college_or_more Energy*some_college_or_more afam*some_college_or_more married < RIDAGEYR*married afam*married pregnant mexam*pregnant leaf < leaf leaf Some 2-way interactions were selected by all three methods, specifically afam*inc_75kplus and RIDAGEYR*married. These interactions are graphed using validation sample observations in Figures 2 and 3 below. Although a linear slope was estimated for RIDAGEYR (age in years), Figure 3 uses a median split of RIDAGYR for graphing purposes. Among African Americans, beer drinking was unrelated to income, whereas for other ethnicities, beer drinking was much more frequent among higher-income individuals. For younger individuals, beer drinking was more frequent among the unmarried, whereas beer drinking was more frequent among the married for older individuals. Figure 2. Percent of individuals drinking beer by ethnicity and income. 70.0% 60.0% 62.6% Percent drinking beer 50.0% 40.0% 30.0% 20.0% 40.7% 39.5% 44.7% Income <= $75K Income > $75K 10.0% 0.0% African American Other race/ethnicity Race/ethnicity 10
11 Figure 3. Percent of individuals drinking beer by age and marriage status. Percent drinking beer 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 60.0% 51.9% 34.7% 40.4% Not Married Married 10.0% 0.0% Age<=48 Age>48 Age category (median split) As noted above, tree leaves may or may not represent interactions. For example, Leaf1 was uncovered in the decision tree analyses. It is a combination of gender and age<56. It entered as a significant predictor in the model selected by Method 6. However, Leaf1 represents main effects of gender and age<56, but there is no interaction. This is shown in the following analysis. data trainset; set nhanes_0306_extract(where=(role='t')); age_lt56=(ridageyr<55.5); age_gt56=(ridageyr>=55.5); bmi_lt23=(bmi< ); proc logistic data=trainset descending namelen=100; model drank_beer = female age_lt56; proc logistic data=trainset descending namelen=100; model drank_beer = female age_lt56 leaf1; In both logistic regressions, gender and age<56 are significant predictors of drinking beer (p<0.0001), but leaf1 does not add anything to the prediction (p=0.57 in the second logistic regression; note that leaf1 is equivalent to female*age_lt56). Why was there a tree leaf, but no interaction? The reason becomes clear based on the data in Table 6. Table 6. Percent of individuals drinking beer by age and gender. Gender <56 56 Female 40.2% 20.2% Male 71.3% 50.5% Age 11
12 First the tree splits by gender (rates of drinking beer appear much higher among males), then the tree splits by age among the males (there is a 20.8% difference between younger vs. older males in beer drinking rates, as opposed to a 20.0% difference between younger vs. older females). These two splits result in Leaf1, but these splits indicate main effects only, not interactions. However, Leaf2 (defined by gender, age>56 and BMI<22.6) is a decision tree leaf that represents a real interaction, as illustrated in the logistic regressions below. proc logistic data=trainset descending namelen=100; model drank_beer = female age_gt56 bmi_lt23; proc logistic data=trainset descending namelen=100; model drank_beer = female age_gt56 bmi_lt23 leaf2; In the second logistic regression, leaf2 (which is equivalent to female*age_gt56*bmi_lt23) was highly significant (p<0.0001), and the C-statistics differed between the two logistic regressions (C-statistics =0.70 vs. 0.71, respectively). When looking at means by group, it appears that the interaction is in part driven by a difference in beer drinking rates among the older respondents (age 56), by BMI and gender. Specifically, among the older respondents, males with lower BMI (<22.6) were much less likely to drink beer (28%, vs. 53% for older males with BMI 22.6), whereas females with lower BMI were slightly more likely to drink beer (21%, vs. 20% for older females with BMI 22.6). CONCLUSION This paper illustrated two possible approaches to detect interactions in predictive models: n-way multiplicative interactions and decisions trees. These approaches were examined separately and in combination in the scenarios illustrated in this paper. In the illustrative scenarios, n-way multiplicative interactions and decision trees both pointed to useful interactions. The n-way multiplicative interaction detection method performed best based on the holdout c-statistic (and it was nearly the best in the holdout decile analysis), raising questions as to whether decision trees -- and the tools required to create them -- are really necessary for effective interaction detection in predictive models. However, good models did result from combining the two methods of interaction detection -- a combination of the two approaches, Method 6, performed best in the holdout decile analysis. Therefore, possibly the optimal approach is to combine the two approaches for interaction detection in predictive models. A definitive answer would require further research. When using the decision tree method to detect interactions, it is worthwhile to keep in mind that tree leaves may point to either main effects or interactions. The results reported in this paper were a case study. It is unknown to what extent the results reported here will generalize to other situations. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Doug Thompson, PhD Senior Director, Analytics Consulting Blue Cross Blue Shield of IL, NM, OK & TX 300 E Randolph Chicago, IL (312) [email protected] 12
13 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 13
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct
Paper D10 2009. Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI
Paper D10 2009 Ranking Predictors in Logistic Regression Doug Thompson, Assurant Health, Milwaukee, WI ABSTRACT There is little consensus on how best to rank predictors in logistic regression. This paper
ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
Alex Vidras, David Tysinger. Merkle Inc.
Using PROC LOGISTIC, SAS MACROS and ODS Output to evaluate the consistency of independent variables during the development of logistic regression models. An example from the retail banking industry ABSTRACT
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
A Property and Casualty Insurance Predictive Modeling Process in SAS
Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly
Binary Logistic Regression
Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including
Smart Sell Re-quote project for an Insurance company.
SAS Analytics Day Smart Sell Re-quote project for an Insurance company. A project by Ajay Guyyala Naga Sudhir Lanka Narendra Babu Merla Kiran Reddy Samiullah Bramhanapalli Shaik Business Situation XYZ
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
Survival Analysis of the Patients Diagnosed with Non-Small Cell Lung Cancer Using SAS Enterprise Miner 13.1
Paper 11682-2016 Survival Analysis of the Patients Diagnosed with Non-Small Cell Lung Cancer Using SAS Enterprise Miner 13.1 Raja Rajeswari Veggalam, Akansha Gupta; SAS and OSU Data Mining Certificate
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
Microsoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql [email protected] http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
Copyright 2006, SAS Institute Inc. All rights reserved. Predictive Modeling using SAS
Predictive Modeling using SAS Purpose of Predictive Modeling To Predict the Future x To identify statistically significant attributes or risk factors x To publish findings in Science, Nature, or the New
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
Modeling Lifetime Value in the Insurance Industry
Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting
Improved Interaction Interpretation: Application of the EFFECTPLOT statement and other useful features in PROC LOGISTIC
Paper AA08-2013 Improved Interaction Interpretation: Application of the EFFECTPLOT statement and other useful features in PROC LOGISTIC Robert G. Downer, Grand Valley State University, Allendale, MI ABSTRACT
Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry
Paper 12028 Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry Junxiang Lu, Ph.D. Overland Park, Kansas ABSTRACT Increasingly, companies are viewing
Same-sex Couples Consistency in Reports of Marital Status. Housing and Household Economic Statistics Division
Same-sex Couples Consistency in Reports of Marital Status Author: Affiliation: Daphne Lofquist U.S. Census Bureau Housing and Household Economic Statistics Division Phone: 301-763-2416 Fax: 301-457-3500
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia
Data Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
Mortality Assessment Technology: A New Tool for Life Insurance Underwriting
Mortality Assessment Technology: A New Tool for Life Insurance Underwriting Guizhou Hu, MD, PhD BioSignia, Inc, Durham, North Carolina Abstract The ability to more accurately predict chronic disease morbidity
ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking
Dummy Coding for Dummies Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health ABSTRACT There are a number of ways to incorporate categorical variables into
Easily Identify Your Best Customers
IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do
Using SAS Proc Mixed for the Analysis of Clustered Longitudinal Data
Using SAS Proc Mixed for the Analysis of Clustered Longitudinal Data Kathy Welch Center for Statistical Consultation and Research The University of Michigan 1 Background ProcMixed can be used to fit Linear
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
Analysis of Survey Data Using the SAS SURVEY Procedures: A Primer
Analysis of Survey Data Using the SAS SURVEY Procedures: A Primer Patricia A. Berglund, Institute for Social Research - University of Michigan Wisconsin and Illinois SAS User s Group June 25, 2014 1 Overview
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
The Youth Vote in 2012 CIRCLE Staff May 10, 2013
The Youth Vote in 2012 CIRCLE Staff May 10, 2013 In the 2012 elections, young voters (under age 30) chose Barack Obama over Mitt Romney by 60%- 37%, a 23-point margin, according to the National Exit Polls.
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Predicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS
Paper 114-27 Predicting Customer in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS Junxiang Lu, Ph.D. Sprint Communications Company Overland Park, Kansas ABSTRACT
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Direct Marketing Profit Model. Bruce Lund, Marketing Associates, Detroit, Michigan and Wilmington, Delaware
Paper CI-04 Direct Marketing Profit Model Bruce Lund, Marketing Associates, Detroit, Michigan and Wilmington, Delaware ABSTRACT A net lift model gives the expected incrementality (the incremental rate)
Fleet and Marine Corps Health Risk Assessment, 1 January 31 December, 2014
Fleet and Marine Corps Health Risk Assessment, 1 January 31 December, 2014 Executive Summary The Fleet and Marine Corps Health Risk Appraisal is a 22-question anonymous self-assessment of many of the most
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
The Chi-Square Test. STAT E-50 Introduction to Statistics
STAT -50 Introduction to Statistics The Chi-Square Test The Chi-square test is a nonparametric test that is used to compare experimental results with theoretical models. That is, we will be comparing observed
A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND
Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Constructing a Table of Survey Data with Percent and Confidence Intervals in every Direction
Constructing a Table of Survey Data with Percent and Confidence Intervals in every Direction David Izrael, Abt Associates Sarah W. Ball, Abt Associates Sara M.A. Donahue, Abt Associates ABSTRACT We examined
A Basic Guide to Modeling Techniques for All Direct Marketing Challenges
A Basic Guide to Modeling Techniques for All Direct Marketing Challenges Allison Cornia Database Marketing Manager Microsoft Corporation C. Olivia Rud Executive Vice President Data Square, LLC Overview
Data mining and statistical models in marketing campaigns of BT Retail
Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120
Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY
Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY ABSTRACT PROC FREQ is an essential procedure within BASE
Row vs. Column Percents. tab PRAYER DEGREE, row col
Bivariate Analysis - Crosstabulation One of most basic research tools shows how x varies with respect to y Interpretation of table depends upon direction of percentaging example Row vs. Column Percents.
CALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY
S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT Predictive modeling includes regression, both logistic and linear,
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO
Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit
Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit Duncan Cleary Revenue Irish Tax and Customs, Ireland [email protected] Abstract: Revenue, the Irish
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Segmentation For Insurance Payments Michael Sherlock, Transcontinental Direct, Warminster, PA
Segmentation For Insurance Payments Michael Sherlock, Transcontinental Direct, Warminster, PA ABSTRACT An online insurance agency has built a base of names that responded to different offers from various
Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@
Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,
Assignments Analysis of Longitudinal data: a multilevel approach
Assignments Analysis of Longitudinal data: a multilevel approach Frans E.S. Tan Department of Methodology and Statistics University of Maastricht The Netherlands Maastricht, Jan 2007 Correspondence: Frans
Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC
Paper CC 14 Counting the Ways to Count in SAS Imelda C. Go, South Carolina Department of Education, Columbia, SC ABSTRACT This paper first takes the reader through a progression of ways to count in SAS.
Chapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -
Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller
Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive
Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables
Paper 10961-2016 Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables Vinoth Kumar Raja, Vignesh Dhanabal and Dr. Goutam Chakraborty, Oklahoma State
Individual Growth Analysis Using PROC MIXED Maribeth Johnson, Medical College of Georgia, Augusta, GA
Paper P-702 Individual Growth Analysis Using PROC MIXED Maribeth Johnson, Medical College of Georgia, Augusta, GA ABSTRACT Individual growth models are designed for exploring longitudinal data on individuals
1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC
Technical Paper (Last Revised On: May 6, 2013) Big Data Analytics Benchmarking SAS, R, and Mahout Allison J. Ames, Ralph Abbey, Wayne Thompson SAS Institute Inc., Cary, NC Accurate and Simple Analysis
Free Trial - BIRT Analytics - IAAs
Free Trial - BIRT Analytics - IAAs 11. Predict Customer Gender Once we log in to BIRT Analytics Free Trial we would see that we have some predefined advanced analysis ready to be used. Those saved analysis
Business Problems and Data Science Solutions
CSCI E-84 A Practical Approach to Data Science Ramon A. Mata-Toledo, Ph.D. Professor of Computer Science Harvard Extension School Unit 1 - Lecture 2 February, Wednesday 3, 2016 Business Problems and Data
Didacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
Scatter Plots with Error Bars
Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each
Linear Models in STATA and ANOVA
Session 4 Linear Models in STATA and ANOVA Page Strengths of Linear Relationships 4-2 A Note on Non-Linear Relationships 4-4 Multiple Linear Regression 4-5 Removal of Variables 4-8 Independent Samples
Lesson 3: Calculating Conditional Probabilities and Evaluating Independence Using Two-Way Tables
Calculating Conditional Probabilities and Evaluating Independence Using Two-Way Tables Classwork Example 1 Students at Rufus King High School were discussing some of the challenges of finding space for
Predictive Modeling of Titanic Survivors: a Learning Competition
SAS Analytics Day Predictive Modeling of Titanic Survivors: a Learning Competition Linda Schumacher Problem Introduction On April 15, 1912, the RMS Titanic sank resulting in the loss of 1502 out of 2224
SUGI 29 Statistics and Data Analysis
Paper 194-29 Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC Michelle L. Pritchard and David J. Pasta Ovation Research Group, San Francisco,
SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES
SCHOOL OF HEALTH AND HUMAN SCIENCES Using SPSS Topics addressed today: 1. Differences between groups 2. Graphing Use the s4data.sav file for the first part of this session. DON T FORGET TO RECODE YOUR
Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona
Paper 1285-2014 Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona ABSTRACT In this new era of healthcare reform, health insurance companies have
Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)
Business Intelligence Professor Chen NAME: Due Date: Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*) Tutorial Summary Objective: Richard would
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS
Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple
Abbas S. Tavakoli, DrPH, MPH, ME 1 ; Nikki R. Wooten, PhD, LISW-CP 2,3, Jordan Brittingham, MSPH 4
1 Paper 1680-2016 Using GENMOD to Analyze Correlated Data on Military System Beneficiaries Receiving Inpatient Behavioral Care in South Carolina Care Systems Abbas S. Tavakoli, DrPH, MPH, ME 1 ; Nikki
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Paper PO06. Randomization in Clinical Trial Studies
Paper PO06 Randomization in Clinical Trial Studies David Shen, WCI, Inc. Zaizai Lu, AstraZeneca Pharmaceuticals ABSTRACT Randomization is of central importance in clinical trials. It prevents selection
Mind on Statistics. Chapter 12
Mind on Statistics Chapter 12 Sections 12.1 Questions 1 to 6: For each statement, determine if the statement is a typical null hypothesis (H 0 ) or alternative hypothesis (H a ). 1. There is no difference
Chapter 15. Mixed Models. 15.1 Overview. A flexible approach to correlated data.
Chapter 15 Mixed Models A flexible approach to correlated data. 15.1 Overview Correlated data arise frequently in statistical analyses. This may be due to grouping of subjects, e.g., students within classrooms,
Package ATE. R topics documented: February 19, 2015. Type Package Title Inference for Average Treatment Effects using Covariate. balancing.
Package ATE February 19, 2015 Type Package Title Inference for Average Treatment Effects using Covariate Balancing Version 0.2.0 Date 2015-02-16 Author Asad Haris and Gary Chan
Paper PO12 Pharmaceutical Programming: From CRFs to Tables, Listings and Graphs, a process overview with real world examples ABSTRACT INTRODUCTION
Paper PO12 Pharmaceutical Programming: From CRFs to Tables, Listings and Graphs, a process overview with real world examples Mark Penniston, Omnicare Clinical Research, King of Prussia, PA Shia Thomas,
ABSTRACT INTRODUCTION STUDY DESCRIPTION
ABSTRACT Paper 1675-2014 Validating Self-Reported Survey Measures Using SAS Sarah A. Lyons MS, Kimberly A. Kaphingst ScD, Melody S. Goodman PhD Washington University School of Medicine Researchers often
HOUSEHOLDS WITH HIGH LEVELS OF NET ASSETS
HOUSEHOLDS WITH HIGH LEVELS OF NET ASSETS Report to the Consumer Federation of America and Providian Financial Corp. Catherine P. Montalto, Ph.D. Associate Professor Consumer and Textile Sciences Department
Using Adaptive Random Trees (ART) for optimal scorecard segmentation
A FAIR ISAAC WHITE PAPER Using Adaptive Random Trees (ART) for optimal scorecard segmentation By Chris Ralph Analytic Science Director April 2006 Summary Segmented systems of models are widely recognized
Text Analytics using High Performance SAS Text Miner
Text Analytics using High Performance SAS Text Miner Edward R. Jones, Ph.D. Exec. Vice Pres.; Texas A&M Statistical Services Abstract: The latest release of SAS Enterprise Miner, version 13.1, contains
Multinomial and Ordinal Logistic Regression
Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,
Statistics. Measurement. Scales of Measurement 7/18/2012
Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does
