A Study of Car Insurance in the Netherlands BUDT733: Spring 2011 Vijayakumar Ayyaswamy Logan Baranowitz Cyrus Havewala Stephanie Romich Car Insurance in Netherlands Page 1 of 7
Executive Summary The project is to analyze data for a car insurance firm in Holland. The firm will use the report to target the zip codes that best suit the business and create marketing strategies to promote their insurance products in the area. The data collected for the projects include product usage and socio-demographic information for different zip codes in the country. Each record corresponds to a particular zip code and customer type with details about percentage of the population belonging to various demographic categories, average contribution to other policies and average number of other policies held by the group. The intent of the analysis is to profile the zip code area to create marketing strategies that will increase the buyer sensitivity towards the car insurance and avoid investment in areas that do not suit the business. The analysis will help in cutting costs, effectively using the advertising expenditures to target right customers and increase the return on investment. It is common sense to consider the areas where usage of cars is more. Contributions to other policies provide insight on their existing usage pattern and their spending power. For example, more contributions to bicycle insurance or moped insurance in a zip code shows that the population is more inclined to use bicycles than cars. The area may be congested and densely populated like downtowns where the preference is smaller vehicle and easier accessibility. Our analysis showed exactly same pattern. 1. The area with high contributions to bicycle, moped, fire policies and third party firms will not contribute to car insurance. The area may be densely populated and preference is smaller vehicles. Contribution towards third party firm insurance denotes that area is business center with possibility of less parking and more crowds. 2. The area with high contributions to social security insurance and tractor policies show potential for car insurance. It shows that the area may be farm lands or outskirts where the need for car is more. 3. The analysis also shows that the areas where the more than 50% of the population own at least one car and have more than 2 policies other than car insurance are conducive for the business. Based on the analysis, we recommend the following options for targeted marketing 1. Print advertising and direct mail marketing: Advertise on local papers in the targeted zip codes and send out direct mails through postal. They should target rural communities in which there is not a high concentration of alternative transportation such as Mopeds or bicycles. 2. Joint marketing with car dealers in the area may prove profitable. Areas in which more than 50% of the population have more than 1 car, seem to have a higher potential of having car policies. By encouraging people to buy more cars, the company can increase the market for car policies. 3. Provide bundled products of car and tractor policies. As noted previously, rural areas appear to be a significant portion of the company s customers. By providing a bundled product of tractor and car policy, the company can reach customers that may not have considered purchasing car insurance on its own. Car Insurance in Netherlands Page 2 of 7
Technical Summary Goal Definition: The overarching goal of the project is to use a dataset containing demographic and insurance information of 9822 zip codes and see if this data is helpful in explaining the purchase of car insurance in these zip codes. Data description: The insurance dataset contains 9822 records and 85 dimensions. Each record contains the common characteristics of households in a postal zip code. 42 dimensions represent different policy types, out of which 21 indicate the average number of policies owned and the other 21 indicate the total monetary contribution to those policies, for a particular zip code. The next 39 dimensions are categorical and show the percentage of total households within a zip code that represent the dimension. For example, 5 dimensions represent household income level categories, and the records indicate what percentage of households in a zip code fall within that income level. The other major categories covered are Social Class (5), Profession (5), Religious Affiliation (4), Marital Status (4), Education (3), No. of Cars (3), Rent/Own Home (2), Children (2), Health Insurance (2), Average Income (1), Status (1), Average Age (1) and Purchasing Power (1). Finally, Number of Houses and Average Household size are numerical dimensions. The data set attempts to classify the zip codes into customer categories which is indicated by the dimension Customer Type. Data preprocessing, Exploratory Data Analysis and Choice of Variables: The raw data contained numbers that were linked to a dictionary with the actual bin definitions for each dimension. The first step was to convert the raw data into meaningful bins. For example, in the Average Age dimension, a value of 2 was converted to the bin 30-40 years. Next, dimensions representing the same major category were consolidated. For example, the 5 dimensions representing household income levels (< 30k, 30-40k, etc.) with bins showing average percentage of households falling in that category were consolidate to a single dimension Household Income. The record was modified to reflect the majority value of the 5 dimensions i.e. if 30-40k had the largest percentage of households, then it became the representative for Household Income for that record. Initial data exploration in Spot fire indicated that there was no meaningful relationship between the demographic data and the number of car insurance policies owned by people in a zip code. In most of the cases, for any demographic variable, the number of zip codes with car insurance (and any other insurance, for that matter) was about 50%. However, there was a strong relationship between a zip codes average contribution to other different kinds of policies with car insurance policies. For example, zip codes with high contribution to bicycle or tractor insurance did not have car policies. Once the contribution to different polices and total number of policies (3 or more) owned in a zip code were identified as the important dimensions toward owning car insurance, the dataset was culled by Car Insurance in Netherlands Page 3 of 7
eliminating all the demographic information. At this point, the data was ready to be used in different classification models. The variables we identified to include in further analysis consisted of the following: Variable Description Example third party insurance for personal insurance Contribution private third party insurance Contribution third party insurance (firms) third party insurance for firms The contribution is denoted in a range (Netherland Currency): 0, 1-49, 50-99, 100-199, 200-499, 500-999, 1000-4999, 5000-9999, 10000-19999, >20000. If the average contribution to insure on third party individuals for a zip code is 150, the data is denoted as 4. Contribution tractor policies Contribution moped policies tractor policies moped policies Contribution fire policies fire policies Contribution bicycle policies bicycle policies Contribution social security insurance social security insurance policies policies Total Number of Policies (not Car) > 2 No Car < 50% Dummy variable created to denote 0 if the total number of policies other than car insurance less than or equal to 2 and 1 if the count is greater than 2 Dummy variable created to denote 1 if the percentage of population in the zip code with no cars is less than or equal to 50% and 0 if the percentage is greater than 50% Values are 0 or 1 Values are 0 or 1 Choice of methods and models used: Since the goal of the project was to profile the data, the following methods were deemed appropriate: The Naïve Rule, Classification Trees and Logistic Regression. The major characteristics of each of the models are displayed in the table below: Model Sensitivity Specificity False Positive False Negative Overall Error Naïve Rule 50.88% 49.12% 0.00% 49.12% 49.12% - Classification tree 54.63% 45.37% 24.18% 10.33% 34.51% 29.74% Logistic Regression 64.07% 35.93% 27.65% 8.28% 35.93% 26.86% Lift Car Insurance in Netherlands Page 4 of 7
The Classification Tree Model (Exhibit D) had the following characteristics: Used the log of all individual contribution amounts, total policies > 2 variable and <50% have no car variable Tree was pruned to use only 6 decision points Additional contribution variables had very little effect on the overall accuracy. Error rate is 34.51% The Logistic Regression Model (Exhibit E) had the following characteristics: Started with same variables as the Classification Tree (including all contributions) Narrowed best output to a model with nine variables Error rate is 35.93% Interesting note four of the contribution variables had negative coefficients, meaning that zip codes with higher average contributions to these policies were less likely to purchase at least one car insurance policy Based upon the above results we see that the classification tree and the Logistic regression model provide a significant lift to the Naïve rule. Even though the classification tree is marginally better with the overall error rate and specificity, the logistic regression model is the best fit for our overall goal since it provides a more complete picture of the characteristics of zip codes with car policies due to the increased number of variables included in the final model. Car Insurance in Netherlands Page 5 of 7
Exhibit A: Similar Demographics for those with or without Car Policies Exhibit B Effect of Contribution to Moped Policies Zip Codes with Contributions to Moped Policies have lower % with Car Policy Exhibit C - Effect of Contribution to Fire Policies on Car Policy Insurance Zip Codes with increasing Contributions to Fire Policies have higher % with Car Policy Car Insurance in Netherlands Page 6 of 7
Exhibit D Classification Tree: Pruned Tree Exhibit E - Logistic Regression Results P rior cla ss proba bilitie s A ccording to relative occurrences in training data 3.7682 3rdParty_prv 2532 1397 C las s 1 0 The Re gre ssion M ode l Pr o b. 0.508755854 0.491244146 <-- Success Class 2.1587 4.6641 MopedPolicy_ FirePolicy_c 2281 251 406 991 0 0 1 1.6094 FirePolicy_c 1419 862 Input variables Constant term Contribution_3rdParty _prvt_t Contribution_3rdParty _f irms _ Contribution_trac torpolic y _Tr Contribution_MopedPolic y _Tr Contribution_FirePolic y _Tran Contribution_Bic y c lepolic y_t Contribution_s s _ins _polic y _t Polic y Count > 2 (not c ar) <50% No Car Tra ining Da ta scoring - S um m a ry Re port Coe fficient Std. Error p -value Odds -0.36276466 0.09566187 0.00014935 * 0.12484553 0.01263979 0 1.13297343-0.13780972 0.03793787 0.00028068 0.87126446 0.11102166 0.02905416 0.00013281 1.11741912-0.38373208 0.02176546 0 0.68131393-0.10721723 0.00966389 0 0.89833051-0.33676875 0.04441036 0 0.71407396 0.20611629 0.04648391 0.00000924 1.22889614 1.35116041 0.07966585 0 3.86190438 0.46907887 0.09527586 0.00000085 1.59852111 Cut off Prob.Val. for Success (Updatable) 0.5 1.6094 0.5 BicyclePolic Policy # > 2 1374 45 746 116 Classification Confusion M atrix Pre dicted Class Actual Clas s 1 0 1 4184 813 0 2716 2109 1 0 0 1 Error Report C las s # C as e s # Er r o r s % Er r o r 1 4997 813 16.27 0 4825 2716 56.29 Ove rall 9822 3529 35.93 Car Insurance in Netherlands Page 7 of 7