A Study of Car Insurance in the Netherlands. BUDT733: Spring 2011



Similar documents
Car Insurance Policies. TEAM 1 Vijayakumar Ayyaswamy Logan Baranowitz Cyrus Havewala Stephanie Romich

Understanding Characteristics of Caravan Insurance Policy Buyer

Prediction of Car Prices of Federal Auctions

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Lowering social cost of car accidents by predicting high-risk drivers

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 22

Determining Factors of a Quick Sale in Arlington's Condo Market. Team 2: Darik Gossa Roger Moncarz Jeff Robinson Chris Frohlich James Haas

Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report

Social Media Mining. Data Mining Essentials

Numerical Algorithms Group

Easily Identify the Right Customers

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

The Insurance Company (TIC) Benchmark Original Problem Task Description

not possible or was possible at a high cost for collecting the data.

from Larson Text By Susan Miertschin

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation

IBM SPSS Direct Marketing 19

Data Mining - Evaluation of Classifiers

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Predicting earning potential on Adult Dataset

Easily Identify Your Best Customers

Finding Supporters. Political Predictive Analytics Using Logistic Regression. Multivariate Solutions

Data Mining Applications in Higher Education

Predictive Data modeling for health care: Comparative performance study of different prediction models

Logistic Regression. BUS 735: Business Decision Making and Research

Young Researchers Seminar 2011

Machine Learning Logistic Regression

Demand for Life Insurance in Malaysia

B2C Case Study: Service Company

Introduction to Marketing

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Data Mining Algorithms Part 1. Dejan Sarka

Getting the Most from Demographics: Things to Consider for Powerful Market Analysis

Business Analytics using Data Mining

EARLY VS. LATE ENROLLERS: DOES ENROLLMENT PROCRASTINATION AFFECT ACADEMIC SUCCESS?

Knowledge Discovery and Data Mining

Binary Logistic Regression

Chapter 20: Data Analysis

How to set the main menu of STATA to default factory settings standards

Generalized Linear Models

Employer Health Insurance Premium Prediction Elliott Lui

Gaining an Understanding of Your Customers Using Portfolio Analysis

Data quality in Accounting Information Systems

Direct Marketing of Insurance. Integration of Marketing, Pricing and Underwriting

Imagine what it would mean to your marketing

Creating a Comprehensive Scientists Automotive Database

Data Mining Part 5. Prediction

Digital Data Landscape

Data Mining: Overview. What is Data Mining?

VI. The Investigation of the Determinants of Bicycling in Colorado

The Data Mining Process

Projektgruppe. Categorization of text documents via classification

Marketing Applications of Predictive Analytics. Robert J. Walling III, FCAS, MAAA San Diego, CA October 6, 2008

Data Mining III: Numeric Estimation

IBM SPSS Direct Marketing

IBM SPSS Direct Marketing 20

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

1DP-BR INDEPENDENT DEALER PROFILE & MEDIA USAGE

Data Mining for Fun and Profit

A Property & Casualty Insurance Predictive Modeling Process in SAS

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

Some Statistical Applications In The Financial Services Industry

Data mining and statistical models in marketing campaigns of BT Retail

LECTURE 2 SERVICE SYSTEM DESIGN AND DELIVERY PROCESS

LCs for Binary Classification

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Profiles and Data Analysis. 5.1 Introduction

Detecting Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo

Demographics of Atlanta, Georgia:

Cluster this! June 2011

How To Predict Diabetes In A Cost Bucket

Programming Exercise 3: Multi-class Classification and Neural Networks

Bb 2. Targeting Segmenting and Profiling How to generate leads and get new customers I N S I G H T. Profiling. What is Segmentation?

Small-to medium-business partnership overview. Partner with Experian to enhance your revenue by helping your clients find and acquire more customers

Media Efficiency Panel MEP INMA Conference, Lissabon

PASW Direct Marketing 18

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Dealing with continuous variables and geographical information in non life insurance ratemaking. Maxime Clijsters

Lending Club Interest Rate Data Analysis

Predictive Modeling on the Cheap

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Visual Presentation Fall 2011

Non-Emergent Emergency Department Use among Adults with Disabilities

Role of Social Networking in Marketing using Data Mining

UNIVERSITY OF SOUTHERN CALIFORNIA Marshall School of Business BUAD 425 Data Analysis for Decision Making (Fall 2013) Syllabus

L3: Statistical Modeling with Hadoop

The Artificial Prediction Market

Learning Objectives: Quick answer key: Question # Multiple Choice True/False Describe the important of accounting and financial information.

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data Mining is the process of knowledge discovery involving finding

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK

1 Choosing the right data mining techniques for the job (8 minutes,

A STUDY ON ASSET MANAGEMENT OF SELECTED AUTOMOBILE COMPANIES IN INDIA

PAST PRESENT FUTURE YoU can T TEll where ThEY RE going if YoU don T know where ThEY ve been.

Data Select SM Creating a search and placing an order

Debtor s Full Legal Name: Spouse s Full Legal Name: Other Names Ever Used: Tel#: Cell#: Emergency Contact (name & number):

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Transcription:

A Study of Car Insurance in the Netherlands BUDT733: Spring 2011 Vijayakumar Ayyaswamy Logan Baranowitz Cyrus Havewala Stephanie Romich Car Insurance in Netherlands Page 1 of 7

Executive Summary The project is to analyze data for a car insurance firm in Holland. The firm will use the report to target the zip codes that best suit the business and create marketing strategies to promote their insurance products in the area. The data collected for the projects include product usage and socio-demographic information for different zip codes in the country. Each record corresponds to a particular zip code and customer type with details about percentage of the population belonging to various demographic categories, average contribution to other policies and average number of other policies held by the group. The intent of the analysis is to profile the zip code area to create marketing strategies that will increase the buyer sensitivity towards the car insurance and avoid investment in areas that do not suit the business. The analysis will help in cutting costs, effectively using the advertising expenditures to target right customers and increase the return on investment. It is common sense to consider the areas where usage of cars is more. Contributions to other policies provide insight on their existing usage pattern and their spending power. For example, more contributions to bicycle insurance or moped insurance in a zip code shows that the population is more inclined to use bicycles than cars. The area may be congested and densely populated like downtowns where the preference is smaller vehicle and easier accessibility. Our analysis showed exactly same pattern. 1. The area with high contributions to bicycle, moped, fire policies and third party firms will not contribute to car insurance. The area may be densely populated and preference is smaller vehicles. Contribution towards third party firm insurance denotes that area is business center with possibility of less parking and more crowds. 2. The area with high contributions to social security insurance and tractor policies show potential for car insurance. It shows that the area may be farm lands or outskirts where the need for car is more. 3. The analysis also shows that the areas where the more than 50% of the population own at least one car and have more than 2 policies other than car insurance are conducive for the business. Based on the analysis, we recommend the following options for targeted marketing 1. Print advertising and direct mail marketing: Advertise on local papers in the targeted zip codes and send out direct mails through postal. They should target rural communities in which there is not a high concentration of alternative transportation such as Mopeds or bicycles. 2. Joint marketing with car dealers in the area may prove profitable. Areas in which more than 50% of the population have more than 1 car, seem to have a higher potential of having car policies. By encouraging people to buy more cars, the company can increase the market for car policies. 3. Provide bundled products of car and tractor policies. As noted previously, rural areas appear to be a significant portion of the company s customers. By providing a bundled product of tractor and car policy, the company can reach customers that may not have considered purchasing car insurance on its own. Car Insurance in Netherlands Page 2 of 7

Technical Summary Goal Definition: The overarching goal of the project is to use a dataset containing demographic and insurance information of 9822 zip codes and see if this data is helpful in explaining the purchase of car insurance in these zip codes. Data description: The insurance dataset contains 9822 records and 85 dimensions. Each record contains the common characteristics of households in a postal zip code. 42 dimensions represent different policy types, out of which 21 indicate the average number of policies owned and the other 21 indicate the total monetary contribution to those policies, for a particular zip code. The next 39 dimensions are categorical and show the percentage of total households within a zip code that represent the dimension. For example, 5 dimensions represent household income level categories, and the records indicate what percentage of households in a zip code fall within that income level. The other major categories covered are Social Class (5), Profession (5), Religious Affiliation (4), Marital Status (4), Education (3), No. of Cars (3), Rent/Own Home (2), Children (2), Health Insurance (2), Average Income (1), Status (1), Average Age (1) and Purchasing Power (1). Finally, Number of Houses and Average Household size are numerical dimensions. The data set attempts to classify the zip codes into customer categories which is indicated by the dimension Customer Type. Data preprocessing, Exploratory Data Analysis and Choice of Variables: The raw data contained numbers that were linked to a dictionary with the actual bin definitions for each dimension. The first step was to convert the raw data into meaningful bins. For example, in the Average Age dimension, a value of 2 was converted to the bin 30-40 years. Next, dimensions representing the same major category were consolidated. For example, the 5 dimensions representing household income levels (< 30k, 30-40k, etc.) with bins showing average percentage of households falling in that category were consolidate to a single dimension Household Income. The record was modified to reflect the majority value of the 5 dimensions i.e. if 30-40k had the largest percentage of households, then it became the representative for Household Income for that record. Initial data exploration in Spot fire indicated that there was no meaningful relationship between the demographic data and the number of car insurance policies owned by people in a zip code. In most of the cases, for any demographic variable, the number of zip codes with car insurance (and any other insurance, for that matter) was about 50%. However, there was a strong relationship between a zip codes average contribution to other different kinds of policies with car insurance policies. For example, zip codes with high contribution to bicycle or tractor insurance did not have car policies. Once the contribution to different polices and total number of policies (3 or more) owned in a zip code were identified as the important dimensions toward owning car insurance, the dataset was culled by Car Insurance in Netherlands Page 3 of 7

eliminating all the demographic information. At this point, the data was ready to be used in different classification models. The variables we identified to include in further analysis consisted of the following: Variable Description Example third party insurance for personal insurance Contribution private third party insurance Contribution third party insurance (firms) third party insurance for firms The contribution is denoted in a range (Netherland Currency): 0, 1-49, 50-99, 100-199, 200-499, 500-999, 1000-4999, 5000-9999, 10000-19999, >20000. If the average contribution to insure on third party individuals for a zip code is 150, the data is denoted as 4. Contribution tractor policies Contribution moped policies tractor policies moped policies Contribution fire policies fire policies Contribution bicycle policies bicycle policies Contribution social security insurance social security insurance policies policies Total Number of Policies (not Car) > 2 No Car < 50% Dummy variable created to denote 0 if the total number of policies other than car insurance less than or equal to 2 and 1 if the count is greater than 2 Dummy variable created to denote 1 if the percentage of population in the zip code with no cars is less than or equal to 50% and 0 if the percentage is greater than 50% Values are 0 or 1 Values are 0 or 1 Choice of methods and models used: Since the goal of the project was to profile the data, the following methods were deemed appropriate: The Naïve Rule, Classification Trees and Logistic Regression. The major characteristics of each of the models are displayed in the table below: Model Sensitivity Specificity False Positive False Negative Overall Error Naïve Rule 50.88% 49.12% 0.00% 49.12% 49.12% - Classification tree 54.63% 45.37% 24.18% 10.33% 34.51% 29.74% Logistic Regression 64.07% 35.93% 27.65% 8.28% 35.93% 26.86% Lift Car Insurance in Netherlands Page 4 of 7

The Classification Tree Model (Exhibit D) had the following characteristics: Used the log of all individual contribution amounts, total policies > 2 variable and <50% have no car variable Tree was pruned to use only 6 decision points Additional contribution variables had very little effect on the overall accuracy. Error rate is 34.51% The Logistic Regression Model (Exhibit E) had the following characteristics: Started with same variables as the Classification Tree (including all contributions) Narrowed best output to a model with nine variables Error rate is 35.93% Interesting note four of the contribution variables had negative coefficients, meaning that zip codes with higher average contributions to these policies were less likely to purchase at least one car insurance policy Based upon the above results we see that the classification tree and the Logistic regression model provide a significant lift to the Naïve rule. Even though the classification tree is marginally better with the overall error rate and specificity, the logistic regression model is the best fit for our overall goal since it provides a more complete picture of the characteristics of zip codes with car policies due to the increased number of variables included in the final model. Car Insurance in Netherlands Page 5 of 7

Exhibit A: Similar Demographics for those with or without Car Policies Exhibit B Effect of Contribution to Moped Policies Zip Codes with Contributions to Moped Policies have lower % with Car Policy Exhibit C - Effect of Contribution to Fire Policies on Car Policy Insurance Zip Codes with increasing Contributions to Fire Policies have higher % with Car Policy Car Insurance in Netherlands Page 6 of 7

Exhibit D Classification Tree: Pruned Tree Exhibit E - Logistic Regression Results P rior cla ss proba bilitie s A ccording to relative occurrences in training data 3.7682 3rdParty_prv 2532 1397 C las s 1 0 The Re gre ssion M ode l Pr o b. 0.508755854 0.491244146 <-- Success Class 2.1587 4.6641 MopedPolicy_ FirePolicy_c 2281 251 406 991 0 0 1 1.6094 FirePolicy_c 1419 862 Input variables Constant term Contribution_3rdParty _prvt_t Contribution_3rdParty _f irms _ Contribution_trac torpolic y _Tr Contribution_MopedPolic y _Tr Contribution_FirePolic y _Tran Contribution_Bic y c lepolic y_t Contribution_s s _ins _polic y _t Polic y Count > 2 (not c ar) <50% No Car Tra ining Da ta scoring - S um m a ry Re port Coe fficient Std. Error p -value Odds -0.36276466 0.09566187 0.00014935 * 0.12484553 0.01263979 0 1.13297343-0.13780972 0.03793787 0.00028068 0.87126446 0.11102166 0.02905416 0.00013281 1.11741912-0.38373208 0.02176546 0 0.68131393-0.10721723 0.00966389 0 0.89833051-0.33676875 0.04441036 0 0.71407396 0.20611629 0.04648391 0.00000924 1.22889614 1.35116041 0.07966585 0 3.86190438 0.46907887 0.09527586 0.00000085 1.59852111 Cut off Prob.Val. for Success (Updatable) 0.5 1.6094 0.5 BicyclePolic Policy # > 2 1374 45 746 116 Classification Confusion M atrix Pre dicted Class Actual Clas s 1 0 1 4184 813 0 2716 2109 1 0 0 1 Error Report C las s # C as e s # Er r o r s % Er r o r 1 4997 813 16.27 0 4825 2716 56.29 Ove rall 9822 3529 35.93 Car Insurance in Netherlands Page 7 of 7