Understanding Characteristics of Caravan Insurance Policy Buyer



Similar documents
Prediction of Car Prices of Federal Auctions

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Determining Factors of a Quick Sale in Arlington's Condo Market. Team 2: Darik Gossa Roger Moncarz Jeff Robinson Chris Frohlich James Haas

Easily Identify Your Best Customers

Data Mining: Overview. What is Data Mining?

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Business Analytics using Data Mining

Predictive Data modeling for health care: Comparative performance study of different prediction models

JetBlue Airways Stock Price Analysis and Prediction

Chapter 7: Simple linear regression Learning Objectives

not possible or was possible at a high cost for collecting the data.

Data Mining for Fun and Profit

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Multiple Linear Regression in Data Mining

Generalized Linear Models

The Insurance Company (TIC) Benchmark Original Problem Task Description

Data Mining Algorithms Part 1. Dejan Sarka

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Data Mining Methods: Applications for Institutional Research

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Data Mining Applications in Higher Education

Data Mining - Evaluation of Classifiers

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Business Analytics using Data Mining Project Report. Optimizing Operation Room Utilization by Predicting Surgery Duration

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

A Property & Casualty Insurance Predictive Modeling Process in SAS

DATA MINING TECHNIQUES AND APPLICATIONS

Predictive modelling around the world

The primary goal of this thesis was to understand how the spatial dependence of

Variable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal Bank of Scotland, Bridgeport, CT

Get to Know the IBM SPSS Product Portfolio

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

from Larson Text By Susan Miertschin

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)

Numerical Algorithms Group

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Why do statisticians "hate" us?

Azure Machine Learning, SQL Data Mining and R

Using Adaptive Random Trees (ART) for optimal scorecard segmentation

Data quality in Accounting Information Systems

EARLY VS. LATE ENROLLERS: DOES ENROLLMENT PROCRASTINATION AFFECT ACADEMIC SUCCESS?

August 2012 EXAMINATIONS Solution Part I

Mobile Money in Developing Countries: Common Factors. Khyati Malik

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

Additional sources Compilation of sources:

Course Syllabus. Purposes of Course:

SAP Predictive Analysis Real Life Use Case Predicting Who Will Buy Additional Insurance

DHL Data Mining Project. Customer Segmentation with Clustering

Data mining and statistical models in marketing campaigns of BT Retail

Simple Predictive Analytics Curtis Seare

Moderation. Moderation

Predicting Defaults of Loans using Lending Club s Loan Data

KnowledgeSEEKER Marketing Edition

Easily Identify the Right Customers

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Chapter 4 and 5 solutions

Data Mining Techniques Chapter 7: Artificial Neural Networks

CREDIT SCORING MODEL APPLICATIONS:

What is Market Research? Why Conduct Market Research?

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Binary Logistic Regression

DIVIDEND POLICY, TRADING CHARACTERISTICS AND SHARE PRICES: EMPIRICAL EVIDENCE FROM EGYPTIAN FIRMS

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

Regression 3: Logistic Regression

Cleaned Data. Recommendations

Implied Volatility Skews in the Foreign Exchange Market. Empirical Evidence from JPY and GBP:

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

IBM SPSS Direct Marketing 23

C:\Users\<your_user_name>\AppData\Roaming\IEA\IDBAnalyzerV3

Potential Value of Data Mining for Customer Relationship Marketing in the Banking Industry

Benchmarking of different classes of models used for credit scoring

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Segmentation: Foundation of Marketing Strategy

Course Outcome Summary

IBM SPSS Direct Marketing 22

Data Mining mit der JMSL Numerical Library for Java Applications

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

Alex Vidras, David Tysinger. Merkle Inc.

Chapter 6. The stacking ensemble approach

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Transcription:

Understanding Characteristics of Caravan Insurance Policy Buyer May 10, 2007 Group 5 Chih Hau Huang Masami Mabuchi Muthita Songchitruksa Nopakoon Visitrattakul

Executive Summary This report is intended to understand characteristics of a caravan insurance policy buyer. The dataset we used consists of 9,822 customer records and includes socio demographic data of the area where a customer lives and product ownership data of the customer. Our aim of this profiling analysis is to acquire managerial insights to create competitive strategies useful for making business decision. In our analysis, we found that caravan insurance policy buyers show several characteristics which are consistent with our common sense. Intuitively, a caravan policy buyer might own a caravan and a caravan owner might own a car to pull his/her caravan. Our research clearly supports this intuitive profiling. Furthermore, one might suppose that a caravan policy buyer should be rich, leisured or risk averse. Our research also shows correlation between the ownership of caravan policies and indicators of wealth or risk averseness. Overall, our model describes a typical caravan insurance policy buyer as a rich, risk averse person who lives in wealthy area. Based on the findings from our research, we suggest the following actionable plans to help the company gain deeper understanding of customer characteristics and develop marketing strategies that align with the findings. Area Focused Marketing: Marketers could focus on marketing strategies that are most suitable and most likely to induce customers in each specific area. They might place marketing campaigns in areas with high proportion of highly educated people and high proportion of high average incomers. Advertising Campaign: The insurance company could seek opportunities of cross selling by knowing who are potential customers (risk averse and rich). Bundled Products: The insurance company could offer potential customers a bundle package of car insurance and caravan insurance. Joint Marketing: Marketers could hold a joint marketing event with caravan makers to target potential customers and arouse their awareness of risks. 1

Technical Summary Our analysis is aimed at explaining characteristics of people who are likely to or not to buy caravan policy, based on sampled product usage data, such as contribution and number of insurance policies, and socio demographical data, such as average household size and average income. Our original dataset, which was originally used in a data mining contest 1, consists of 86 variables 2 and 9,822 customer records, of which are originally 5,822 training data records and 4,000 validation data records. The response variable is CARAVAN_POLICY whose value is either Buyer or Non buyer of a caravan policy. It is also important to note that the socio demographic data were derived from zip codes. All customers living in the same zip code are assumed to share the same socio demographic attributes. Preprocessing We preprocessed our dataset in order to understand characteristics of caravan policy buyers clearly. We oversampled the success class we are interested in to train our models since the success class, buyer of a caravan policy, is rare (/5,822 = 5.98%). We applied the models to the original (nonoversampled) validation set to avoid a biased estimate of model performance. See Exhibit 1 for the framework we used to preprocess our dataset. We explored the data to see the validity of each variable as a useful predictor. We mainly used bar charts since almost all of the variables in our dataset are categorical. See Exhibit 2 for examples of data visualization. Based on our data analysis, we eliminated 78 variables that do not significantly distinguish people who are likely to buy caravan policy from those who are unlikely to. We also dropped variables which, by its nature, seem to be correlated and kept one in each group of correlated variables in our model; for example, M_RENTED_HOUSE (% of house renters in a specific zip code area) and M_HOME_OWNERS (% of home owners) which are two sides of a coin. We transformed some remaining variables to make our analysis more practical based on domain knowledge and insight from data visualization. For example, we created a binary variable, PRIV_3RD, stating whether or not a person buy at least one private third party insurance while we dropped CON_PRIV_3RD (Contribution to private third party insurance) even though the proportion of success class in the records which have higher values of CON_PRIV_3RD tends to be higher. The underlying reason for these transformations is explanatory simplicity. The model with binary variables is better than that with numerical variables because it is easy to interpret as long as the performance of each model does not show significant difference. In this way, we created three derived variables: FIRE_INS (0 indicates that a person does not have any fire policies; 1 otherwise), PRIV_3RD (0 indicates that a person does not have private third party insurance, 1 otherwise), and CAR_INS (0 indicates that a person does not have any car policies; 1 otherwise). We also created another two dummy variables since some characteristics of zip code areas distinguish well people who buy caravan insurance from those who do not: M_MAIN=DRV_GRW (1 indicates a person whose customer main type is driven growers ; 0 otherwise), and M_SUB=MID_CLS (1 indicates a person whose customer subtype is middle class families ; 0 otherwise). Analysis We constructed models using the two algorithms, logistics regression and classification tree to analyze our data. We didnʹt use discriminant analysis because almost all of our variables are 1 The CoIL Challenge 2000, http://www.liacs.nl/~putten/library/cc2000/ 2 We revised names of variables used in the original contest to make them more understandable in this report 2

Understanding the characteristics of caravan insurance holders categorical and some of them are dummies which violate the assumption of discriminant analysis. See Exhibit 4 for the outputs from models. In our logistics regression analysis, we eliminated useless or insignificant variables based on our domain knowledge and p values for each variable. After all, our final model had four explanatory variables; M_EDU_HIGH (binned proportion of people who have high education in a specific area) M_AVG_INCOME (binned level of average income of a specific area) PRIV_3RD_INS CAR_INS Of these four variables, CAR_INS had the lowest p value, thus its significance contributed the most to the model. Based on our analysis, we found that all these four variables were attributable to two main customer characteristics, wealth and risk aversion. In our classification tree analysis, CAR_INS, M_EDU_HIGH, and M_AVG_INCOME played an important role in explaining characteristics of caravan insurance policy buyer. Although these two models yielded some similar results; that is, some variables in both models had explanatory power, the logistics regression model seemed to perform better than the classification tree. In the logistics regression, the percentage errors of Buyer (success class) in training and validation sets are (28.74% and 34.87% respectively) lower than those of the classification tree (37.36% and 39.50% respectively). We thus selected logistics regression model to explain behavior of caravan insurance policy buyers. The implications drawn from the logistics regression model are as follows: 1. We found that caravan insurance buyers are likely to live in wealthy area. The underlying logic of this finding might be as follows: Based on our common sense, caravan owners would be rich and wealthy people would live in wealthy area. 2. Residents living in area which has high proportion of highly educated people are more likely to be high incomers, thus they tend to be a caravan policy insurance buyers. 3. A caravan owner is likely to own a car. In addition, private third party insurance policy is required for car owners. The ownership of car insurance and private third party insurance is a good indicator of car ownership. 4. If a person is risk averse, he/she would buy insurances. The ownership of car insurance and private third party insurance is a good indicator of risk averseness. A risk averse caravan owner is likely to buy a caravan policy. Recommendations The implications of the four explanatory variables show that wealth and risk aversion are two critical attributes of a caravan buyer. These characteristics give us some hints for further actions: Area Focused Marketing: Based on socio demographic predictor such as M_EDU_HIGH and M_AVG_INCOME, marketers could place marketing campaigns in areas with high proportion of highly educated people and high proportion of high average incomers. Advertising Campaign: Based on customers past transactions, the insurance company could seek opportunities of cross selling by knowing who are likely to be potential customers. Bundled Products: The company could bundle a car insurance and a caravan insurance to target the potential customers. Joint Marketing: Marketers could join caravan makers in launching promotion campaigns to focus target potential customers and to arouse their awareness of risks. 3

Exhibit 1: Framework for data pre processing Original* Training Original* Training Original* Validation Original* Validation 5822 5822 Over-sampled Over-sampled Build Models Build Models Explanatory Explanatory Models Models Apply Models Apply Models Validation Validation * Original refers to the original name in the data mining contest. * Original refers to the original name in the data mining contest. Exhibit 2: Bar Charts for selected variables Bar Chart 100% 227 151 145 69 47 36 9 7 3 2 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 1 2 3 4 5 6 7 8 9 M_EDU_HIGH Bar Chart 100% 255 441 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 1 CAR_INS 4

Understanding the characteristics of caravan insurance holders Exhibit 3: Logistics Regression Model The Regression Model Input variables Constant term M_EDU_HIGH M_AVG_INCOME PRIV_3RD_INS CAR_INS Coefficient Std. Error p-value Odds -2.05754209 0.31352952 0 * Residual df 691 0.17362288 0.05885206 0.00317612 1.1896069 Residual Dev. 847.1951904 0.17481972 0.07740599 0.02391586 1.19103146 % Success in training data 50 0.4663696 0.17033134 0.00618115 1.59419608 # Iterations used 8 1.32716846 0.18058152 0 3.77035236 Multiple R-squared 0.12195091 Training Data scoring - Summary Report Buyer 248 100 Non-Buyer 131 217 Buyer 100 28.74 Non-Buyer 131 37.64 Overall 696 231 33.19 Validation Data scoring - Summary Report Buyer 155 83 Non-Buyer 1455 2307 Buyer 238 83 34.87 Non-Buyer 3762 1455 38.68 Overall 1538 38.45 Exhibit 4: Classification Tree Training Data scoring - Summary Report (Using Full Tree) Buyer 218 130 Non-Buyer 96 252 CAR_INS 1980 2020 Buyer 130 37.36 Non-Buyer 96 27.59 Overall 696 226 32.47 Validation Data scoring - Summary Report (Using Best Pruned Tree) Non-Buy 1.5 M_EDU_HIGH 1180 840 3.5 Buyer Buyer 144 94 Non-Buyer 1191 2571 M_AVG_INCOME 685 495 Buyer 238 94 39.50 Non-Buyer 3762 1191 31.66 Overall 1285 32.13 Non-Buy Buyer 5