Segmentation For Insurance Payments Michael Sherlock, Transcontinental Direct, Warminster, PA ABSTRACT An online insurance agency has built a base of names that responded to different offers from various carriers, some of whom have purchased one or more insurance products. The goal of this project is to analyze response and sales data to understand relationship among products, offers, and payments for a single carrier to deploy predictive models and segmentation schemes to target best prospects for acquisition or cross-sell through a multi-channel communication strategy. Individual information was overlaid with demographic and psychographic data. A logistic regression model was developed to score individuals based on their propensity to convert (i.e. pay for the issued policy). These records were then segmented using a clustering technique, and distinct, descriptive groups were produced to alter marketing language. Also, a cross-sell matrix was developed to identify products that the same carrier may offer to existing customers. INTRODUCTION A single carrier s (referred to here as carrier ) Accidental Death and Dismemberment (ADD) product is sold through two different online portals (URLs). Each portal is distinct, despite the fact that the product is identical. Aesthetically, the sites are different; however, more significantly, the offers vary. Portal A features a deviated premium offer ($1 for the first month of coverage) while Portal B features a bonus offer (free $10,000 policy for a limited time). The carrier is having a problem with nonpayment of premium. The portals are reaching their traffic and signup goals, but the carrier is disappointed with the proportion of individuals that make their first payment (i.e. conversion). The goal of this project is to identify those most likely to convert and compare them to those least likely to convert. By utilizing this information, the carrier may adjust their creative and placement strategy as well as their offerings. APPROACH After data hygiene, application and purchase files were matched together to yield a single master file of 16,456 records. Individuals that were denied coverage were removed from the analysis. 155 demographic and psychographic variables were matched to the file which was then loaded into SAS for analysis. The seven most predictive variables that may be gathered during the online application process were used. Since all records visited a site and requested a policy, a logistic regression model was built to predict which records are most likely to pay for the policy after issue. A series of models were built with a variety of variables, and the best performing one was selected. The most predictive fields were: face value of the policy, payment method, age, types of credit cards owned, household income, home ownership, and length of residence. Other variables did show some predictability, but the model was reduced for parsimony. MODEL CODE The code below represents how the final logistic regression model was built. The various iterations of the model building process are omitted here. ODS HTML; ODS GRAPHICS ON; PROC LOGISTIC DATA = client.carrier_final; WHERE issue = 1 & free = "N"; CLASS homeowner cc payment / PARAM=REF REF = FIRST MISSING; MODEL paid(event = '1') = faceamount payment agecode cc medianhhincome lor*homeowner / RSQUARE IPLOTS CLPARM = BOTH LACKFIT CORRB COVB NODUMMYPRINT STB; OUTPUT OUT = client.carrier_logit1 PREDICTED = pred1 PREDPROBS = individual; GRAPHICS DFBETAS ROC ESTPROB; ODS GRAPHICS OFF; ODS HTML CLOSE; ODS HTML file='c:\documents and Settings\msherlock\My Documents\My SAS 1
Files\CLIENT\CLIENT_OUT\carrierlogit1.html'; PROC PRINT DATA = client.carrierlogit1; VAR ID issue free paid faceamount payment agecode cc medianhhincome lor homeowner _FROM INTO_ IP_0 IP_1 _LEVEL_ pred1; ODS HTML CLOSE; This code builds the logistic regression model to predict where payment = 1; that is, the issued policy was converted into a paid policy. Logistic regression defaults to the lowest number (in this case, 0) so the software must be explicitly told to model for a 1. Nominal variables (homeowner, credit card type, and payment method) are included in the class statement for SAS to automatically produce dummy variables. An interaction variable (lor*homeowner) was used to only include length of residence for homeowners. The OUT = statement is used to produce an output scored dataset. An OUTMODEL = statement was also included originally to score a hold-out dataset to confirm model validity. The variables used in this model were selected after multiple iterations of running the logistic procedure and selecting the best performing model. MODELING CONCLUSIONS Method of payment is by far the most predictive element of the model. Those paying by credit card are four-times more likely to complete the transaction. Those paying by a bank draft (a.k.a. electronic fund transfer, or EFT) are twice as likely. Those requesting a bill to be sent are 53% less likely to pay. Although method of payment is the most predictive, it is not the only factor. Besides, one does not know the method of payment before someone pays. This model seeks to identify those most likely to pay before the payment option is selected, thereby identifying those that may need more incentive to pay. By mining the data, it was discovered that 23% of those that requested a bill and subsequently did not pay it do indeed possess a bank issued credit card. 26% of those that offered an EFT method of payment and did not pay their bill also have a bank issued credit card. 78% of those that select the credit card payment option complete the transaction. Nearly one out of every four invoice and EFT non-payers were able to pay by credit card. If guided into this payment method, the overall completed transactions would be significantly increased. Over one third (34%) of those selecting EFT do not pay. 26% of those people have a credit card. It is known that 78% of credit card users pay. Meaning, one may increase paid transactions by 7% (34% * 26% * 78% = 7%) by encouraging EFT users to credit card payment. Nearly three-quarters (71%) of invoices go unpaid. It is known that 23% of these people possess a credit card. Again, 78% of credit card users pay. Meaning, one may increase paid transactions by 13% (71% * 23% * 78% = 13%) by encouraging invoice requestors to use a credit card. 2
MODEL OUTPUT Model Fit Statistics Intercept Only Intercept and Covariates AIC 19331.39 16975.56 SC 19338.96 17096.68-2 Log L 19329.389 16943.555 R-Square 0.1534 Max-rescaled R-Square 0.2071 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 2385.8338 15 <.0001 Score 2335.4160 15 <.0001 Wald 1956.3615 15 <.0001 Type 3 Analysis of Effects DF Wald Chi-Square Pr > ChiSq faceamount 1 27.5758 <.0001 payment 3 1670.4991 <.0001 agecode 1 16.1692 <.0001 cc 7 54.8414 <.0001 medianhhincome 1 41.8986 <.0001 LOR*homeowner 2 6.1830 0.0454 Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi- Square Pr > ChiSq Standardized Estimate Intercept 1-0.5291 0.1129 21.9639 <.0001 faceamount 1-0.00177 0.000337 27.5758 <.0001-0.0548 Payment (EFT) b 1 0.7804 0.1038 56.4937 <.0001 Payment (Credit) c 1 1.3786 0.0969 202.4303 <.0001 Payment (Invoice) i 1-0.7514 0.0832 81.5481 <.0001 agecode 1 0.0334 0.00830 16.1692 <.0001 0.0455 cc 1 1 0.1363 0.0568 5.7661 0.0163 cc 2 1 0.1803 0.0719 6.2935 0.0121 cc 3 1 0.2693 0.0573 22.0917 <.0001 cc 4 1-0.1249 0.2672 0.2184 0.6402 cc 5 1 0.8069 0.1881 18.3977 <.0001 cc 6 1-0.6402 0.3972 2.5975 0.1070 cc 7 1 1.0041 0.2750 13.3323 0.0003 medianhhincome 1 0.0809 0.0125 41.8986 <.0001 0.0691 LOR*homeowner 1 1-0.0150 0.00843 3.1776 0.0747 LOR*homeowner 2 1 0.00369 0.00324 1.2961 0.2549 3
Odds Ratio Estimates Point Estimate 95% Wald Confidence Limits faceamount 0.998 0.998 0.999 payment b vs 2.182 1.780 2.675 payment c vs 3.969 3.283 4.800 payment i vs 0.472 0.401 0.555 agecode 1.034 1.017 1.051 cc 1 vs 0 1.146 1.025 1.281 cc 2 vs 0 1.198 1.040 1.379 cc 3 vs 0 1.309 1.170 1.465 cc 4 vs 0 0.883 0.523 1.490 cc 5 vs 0 2.241 1.550 3.240 cc 6 vs 0 0.527 0.242 1.148 cc 7 vs 0 2.729 1.592 4.679 medianhhincome 1.084 1.058 1.111 Association of Predicted Probabilities and Observed Responses Percent Concordant 70.6 Somers' D 0.418 Percent Discordant 28.8 Gamma 0.420 Percent Tied 0.6 Tau-a 0.201 Pairs 49424012 c 0.709 The ROC Curve below illustrates the sensitivity (probability of a false positive) versus 1- specificity (inverse of the probability of a false negative). On this curve, the rapid climb shown on the left-hand side shows that this model is predicting policy conversion well. The estimated area under the curve (C) is approximately 0.71. If it were 0.5, the resulting curve would be a straight diagonal line; meaning the model would only be predicting well 50% of the time. 4
DECILE RESULTS The file was deciled by applying the scoring algorithm to all records. The file was then split into ten portions of equal size to gauge the lift realized by applying said model. Payment Rate Cumulative % paid % of file % paid % of file Decile 1 20% 10% 20% 10% Decile 2 18% 10% 38% 20% Decile 3 12% 10% 50% 30% Decile 4 9% 10% 59% 40% Decile 5 8% 10% 67% 50% Decile 6 7% 10% 75% 60% Decile 7 7% 10% 82% 70% Decile 8 7% 10% 88% 80% Decile 9 6% 10% 95% 90% Decile 10 5% 10% 100% 100% The graph below further illustrates how the predicted payment changes as one goes deeper into the file by decile: Predicted Payment 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Decile 1 Decile 2 Decile 3 Decile 4 Decile 5 Decile 6 Decile 7 Decile 8 Decile 9 Decile 10 The model output and analysis above are included to illustrate the validity of the model. Based on demographic information as well as payment type, one may predict the likelihood of payment. Using this information alone, the carrier has a general idea of the prospects that will most likely result in payment. To further hone the carrier s strategy, a segmentation analysis was produced to group prospects into those most likely to pay and those least likely to pay. This analysis follows. 5
SEGMENTATION After the predictive model was built, the scored records were grouped to identify key clusters of good pay and bad pay individuals. The process resulted in twelve distinct clusters where records perform similarly to one another within the cluster and dissimilarly to records in other clusters. Interestingly, only 24% of the final file contained records that were homeowners; however, three of the four best pay clusters were all 100% homeowners. SEGMENTATION CODE Segmentation is a multi-step process. A number of iterations were performed here to produce the best fitting segments. The procedures that follow here are the results of the best group of variables. /*Step 1 Approximate Covariance Estimation (ACE) for Clustering*/ PROC ACECLUS DATA = client.carriersegment OUT = client.carrierlogit1aceclus OUTSTAT = client.carrierlogit1aceclusstat PROPORTION =.03 PP; VAR pred1 medianhhincome agecode renter; /*Step 2 Clustering*/ PROC FASTCLUS DATA = client.carrierlogit1aceclus MAXCLUSTERS=12 MAXITER=10 OUT = client.carrierlogit1fastclus; VAR Can1 Can2 Can3 Can4; /*Step 4 Descriptives*/ ODS HTML file='c:\documents and Settings\msherlock\My Documents\My SAS Files\CLIENT\CLIENT_OUT\Carrier_clusters1.html'; PROC UNIVARIATE DATA = client.carrierlogit1fastclus PLOTS; VAR pred1 paid medianhhincome agecode renter bankcard retail; CLASS cluster (ORDER = INTERNAL); ODS HTML CLOSE; /*Additional Descriptives*/ ODS HTML file='c:\documents and Settings\msherlock\My Documents\My SAS Files\CLIENT\CLIENT_OUT\Carrier_clusters2.html'; PROC UNIVARIATE DATA= client.carrierlogit1fastclus plots; VAR bankdraft creditcard invoice faceamount lor sqinsurl psurl; CLASS cluster (ORDER = INTERNAL); ODS HTML CLOSE; The ACECLUS procedure produces a series of canonical variables based on the input variables. The resulting dataset is then run through the FASTCLUS procedure to actually group the records together based on said canonical variables. Once this task is complete, the clusters are analyzed by the original input variables as well as other variables to produce a full view of what these records look like. The procedure output is omitted here. But what follows is the analysis of the UNIVARIATE procedures grouped by the various cluster segments with the comparison to similar clusters shown. 6
THE RENTERS Cluster 2 Cluster 3 Overall Predicted Payment 71% 26% 40% Income $30-$39,999 $30-$39,999 $30-$39,999 Age 35-39 25-29 40-44 Home Owner 100% renters 100% renters 24% renters Gender 53% male 38% male 54% male Marital Status 34% married 30% married 68% married Length of Residence Under 5 years Under 5 years Under 12 years Face Value of Policy $118,100 $136,400 $132,700 Dev. Prem / Bonus Off 16% / 67% 12% / 74% 13% / 66% EFT / Credit Card 35% / 62% 0% / 0% 7% / 14% Cluster 2 was nearly three-times more likely to pay than cluster 3, despite the fact that they had many commonalities. Both clusters had similar household income, they rent, and have lived in the same domicile for up to five years. The good pay group, cluster 2, had more men than cluster 3; a proportion more on par with the whole sample. Cluster 2 was somewhat older than cluster 3, 35-39 versus 25-29, respectively. Also, cluster 2 tended to go for policies with lower face values than cluster 3, resulting in lower monthly premiums. THE HOMEOWNERS Cluster 5 Cluster 7 Overall Predicted Payment 80% 32% 40% Income $40-$49,999 $40-$49,999 $30-$39,999 Age 55-59 35-39 40-44 Home Owner 100% own 100% own 76% own Gender 68% male 60% male 54% male Marital Status 76% married 53% married 68% married Length of Residence 6+ years 2-11 years Under 12 years Face Value of Policy $124,500 $139,300 $132,700 Dev. Prem / Bonus Off 18% / 60% 13% / 65% 13% / 66% EFT / Credit Card 24% / 73% 0% / 0% 7% / 14% Both of these groups own their homes and have an annual household income of $40,000-$49,999. They both tend to be married males. Cluster 5, which is more than twice as likely to pay, is significantly more likely to have older, married men than cluster 7. Also, Cluster 5 tends towards less expensive policies. 7
THE UPPER-MIDDLE Cluster 12 Cluster 10 Overall Predicted Payment 77% 38% 40% Income $50-$74,999 $50-$74,999 $30-$39,999 Age 40-44 50-54 40-44 Home Owner 100% own 100% own 76% own Gender 66% male 66% male 54% male Marital Status 62% married 68% married 68% married Length of Residence 3-13 years 5-18 years Under 12 years Face Value of Policy $140,000 $140,400 $132,700 Dev. Prem / Bonus Off 24% / 53% 13% / 63% 13% / 66% EFT / Credit Card 29% / 65% 0% / 0% 7% / 14% One may expect the most affluent group to be the best pay of all the clusters. However, there is a distinct different between the two most affluent groups. One pays well, the other does not. Key differences are found in age, offer, and payment method. These two clusters contain the records with the highest household income in the sample. Both tend to contain married men that own their homes seeking policies around $140,000. These commonalities aside, cluster 12, with the 40-44 year-olds, is twice as likely to pay their bill as cluster 10 with the 50-54 year-olds. THE POWER OF THE OFFER Cluster 1 Cluster 8 Overall Predicted Payment 72% 30% 40% Income $20-$29,999 $20-$29,999 $30-$39,999 Age 45-49 45-49 40-44 Home Owner 100% own 100% own 76% own Gender 63% male 59% male 54% male Marital Status 64% married 62% married 68% married Length of Residence 3-13 years 3-15 years Under 12 years Face Value of Policy $126,500 $128,800 $132,700 Dev. Prem / Bonus Off 19% / 61% 10% / 70% 13% / 66% EFT / Credit Card 34% / 58% 0% / 0% 7% / 14% Clusters 1 and 8 are nearly identical. But these two similar groups have a significant disparity in the kind of offer they responded to. The only difference between these two groups is that no one in cluster 8 volunteered a credit card or bank draft as a method of payment. The key difference, and, arguably, the only difference, is that those in cluster 8 simply do not trust the on-line channel for sensitive banking information. In addition, those in cluster 1, the better-pay group, are more likely to have come through a deviated premium offer than cluster 8, which tends towards a bonus offer. 8
OTHER LOW-PROBABILITY TO PAY CLUSTERS Cluster 6 Cluster 9 Cluster 11 Cluster 4 Predicted Payment 37% 31% 24% 26% Income 30-39,999 20-29,999 15-19,999 40-49,999 Age 65-69 55-59 30-34 45-49 Home Owner 100% own 100% renters 100% renters 100% renters Gender 59% male 46% male 37% male 46% Marital Status 66% married 38% married 22% married 49% married Length of 6+ years 1-9 years Under 5 years Under 5 years Residence Face Value of $126,500 $128,100 $134,200 $139,400 Policy Dev. Prem / Bonus 11% / 65% 11% / 72% 11% / 73% 10% / 76% Off EFT / Credit Card 1% / 0% 4% / 3% 2% / 0% 1% / 0% Cluster 6 is the oldest. Most of the file is 40-44; this cluster is 65-69. Predicted payment is 37% Cluster 9 and 11 are the poorest. Cluster 11 household income of $15,000-$19,999 is well below the $30,000 to $39,999 seen for most of the file. Cluster 11 has a predicted payment of 24%. Cluster 9 s household income is $20,000-$29,999. Cluster 9 has a predicted payment of 31%. Cluster 4 is anomalous in that it contains relatively affluent renters, who are slightly older than the sample mean, and yet is considered unlikely to pay. Besides being highly likely to request a bill, rather than using a credit card or EFT, cluster 4 is the most likely cluster to have come through a bonus offer rather than a deviated premium offer. The predicted payment of cluster 4 is 26%. CROSS-SELLING RESULTS Since multiple lists from a variety of providers were available, a cross-selling matrix was developed to identify commonalities among groups that investigate multiple providers. The greatest cross-over existed between the ADD policies (analyzed above) and the adult term-life product from the same carrier. Six clusters were produced to examine interactions among variables and interest in both products, or lack thereof. Only homeowners were interested in both ADD and Adult products. Women ages 40-44 were primarily interested in both products. 65% of these women were married. The only group of men interested in both products are ages 45-49, and 84% married. These two groups combined represent 82% of all the records interested in both products. 9
CONCLUSION By applying this segmentation scheme to the online portals, the carrier may identify the likelihood that the policy will be paid for. Since it is not based solely on payment type, the carrier may intelligently encourage those who most need it towards more immediate payment options. This model may be applied not only online, but may also be used in an offline re-contact strategy to ensure payment of policies. In addition, the cross-selling opportunities identified here may be used to increase the book of business. When crossselling, one may choose to approach only those in the most likely to pay clusters to create greater efficiencies. Although the EFT method of payment is preferred in the industry, one must realize that due to the public concern with identity theft it is quite difficult to get consumers to volunteer checking account routing numbers online. The industry is opposed to accepting credit cards, due primarily to the increased transaction cost, but credit cards are the universal currency of the online space. There are also quite a few opportunities for further analysis, such as the following: Include marketing costs and upstream marketing source data into the model to use cost-per-lead and costper-policy analysis to identify the most valuable media type Include profitability data to quantify earnings potential for switching policies to credit card payment after deducting card transaction fee Include attrition data to derive lifetime value for use in acquisition and retention marketing With longitudinal data a frequency model may be built to determine the proper timing of on-line and/or offline solicitations to cross-sell other products. ACKNOWLEDGEMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Michael Sherlock Transcontinental Direct 75 Hawk Road Warminster, PA 18974 267.960.3161 mjsherlock@gmail.com 10