Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Similar documents

The Insurance Company (TIC) Benchmark Original Problem Task Description

Understanding Characteristics of Caravan Insurance Policy Buyer

Lecture 10: Regression Trees

Data Mining Classification: Decision Trees

The Data Mining Process

Data Mining. Nonlinear Classification

Question 2 Naïve Bayes (16 points)

Data Mining Techniques Chapter 6: Decision Trees

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Data Mining Algorithms Part 1. Dejan Sarka

Knowledge Discovery and Data Mining

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

How To Make A Credit Risk Model For A Bank Account

Data Mining Methods: Applications for Institutional Research

Principles of Data Mining by Hand&Mannila&Smyth

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Modeling Lifetime Value in the Insurance Industry

Knowledge Discovery and Data Mining

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Predictive Modeling Techniques in Insurance

Statistics Graduate Courses

Customer and Business Analytic

Prediction of Car Prices of Federal Auctions

Advanced analytics at your hands

Introduction to Predictive Modeling Using GLMs

Predictive modelling around the world

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Gerry Hobbs, Department of Statistics, West Virginia University

Data mining and statistical models in marketing campaigns of BT Retail

Business Analytics and Credit Scoring

Decision Trees from large Databases: SLIQ

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Azure Machine Learning, SQL Data Mining and R

A Property & Casualty Insurance Predictive Modeling Process in SAS

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

GLM I An Introduction to Generalized Linear Models

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Paper AA Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Chapter 12 Discovering New Knowledge Data Mining

DATA MINING METHODS WITH TREES

DATA MINING TECHNIQUES AND APPLICATIONS

Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Statistics in Retail Finance. Chapter 6: Behavioural models

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Chapter 20: Data Analysis

Data Mining Applications in Higher Education

Data Mining - Evaluation of Classifiers

Microsoft Azure Machine learning Algorithms

Data mining techniques: decision trees

Data Mining Practical Machine Learning Tools and Techniques

Improve Marketing Campaign ROI using Uplift Modeling. Ryan Zhao

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

S The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

Data Analysis of Trends in iphone 5 Sales on ebay

Credit Risk Analysis Using Logistic Regression Modeling

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Linear Classification. Volker Tresp Summer 2015

Supervised Learning (Big Data Analytics)

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

5. Multiple regression

HT2015: SC4 Statistical Data Mining and Machine Learning

Predictive Data modeling for health care: Comparative performance study of different prediction models

Decision Trees What Are They?

An Overview and Evaluation of Decision Tree Methodology

Hyper-targeted. Customer Retention with Customer360

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Predictive Modeling and Big Data

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

not possible or was possible at a high cost for collecting the data.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Model Validation Techniques

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

How To Understand The Theory Of Probability

Simple Predictive Analytics Curtis Seare

Easily Identify Your Best Customers

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Detection of changes in variance using binary segmentation and optimal partitioning

Statistical Machine Learning

Customer retention and price elasticity

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Data Mining for Knowledge Management. Classification

Using multiple models: Bagging, Boosting, Ensembles, Forests

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Social Media Mining. Data Mining Essentials

SOLVING A DIRECT MARKETING PROBLEM BY THREE TYPES OF ARTMAP NEURAL NETWORKS. Anatoli Nachev

SAS Certificate Applied Statistics and SAS Programming

Transcription:

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014

Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics process 3. Modelling techniques (CRT in more detail) 4. Case study Identification of customers for X-selling campaign Business and data understanding Predictive modelling Cost-benefit analysis 2 2014 Deloitte Česká republika

Areas of application in insurance

Main application areas of Insurance Analytics Cross-selling, Up-selling Which customers are most likely to buy more? => classification analysis Retention Who is most likely to leave? => classification analysis Fraud detection What claims are likely to be fraudulent? => cluster analysis Claim segmentation Which claims are likely to become large? => regression analysis, cluster analysis 4 2014 Deloitte Česká republika

Main application areas of Insurance Analytics Pricing What are expected claims/risks of an individual? => regression analysis Identification of segments/factors relevant for pricing? => cluster analysis Underwriting efficiency Through which channel should an individual be approached? => Classification analysis, cluster analysis Customer Life-time Value (CLV) What value will the customer bring to the company? => CF projections, regression analysis, Markov chains, More complex problem Powerful in combination with other analyses 5 2014 Deloitte Česká republika

Insurance Analytics process

Insurance Analytics project Phases of the CRISP-DM reference model The process is not straightforward 7 2014 Deloitte Česká republika

Phases overview Components of phases of the CRISP-DM reference model Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment Business Objectives Assess Situation Collect Initial Data Describe Data Clean Data Select Relevant Variables Select Modelling Techniques Build Model Evaluate Results Asses models Application of Results Next steps Explore Data Verify Data Quality 1-way ANOVA scatter plots, box plots Construct New Variables Cluster analysis Identification of outliers Dimension reduction techniques GLM (logistic regression, ) CRT Naïve Bayesian Networks Neural networks Crossvalidation Cumulative gains, Lift Cost-benefit analysis Optimization 8 2014 Deloitte Česká republika

Modelling techniques

GLM - overview Based on parametrized statistical model with explicit assumptions Observations e.g. no. of claims, severity of claims, Lapse/non-lapse, Link function Selected a-priory Linear predictor: Parameters - estimated, Factors Random error Distribution from exponential family; Selected a-priori Parameters estimated using statistical techniques e.g. maximum likelihood => (asymptotic) properties known from general theory Includes logistic regression (binary response variable) useful for classification 10 2014 Deloitte Česká republika

Classification and regression trees (CRT) Tree : ( X 1, X 2, X n ) Y X 2 < 5 X 1 < 1 X 1 < 4 Y = 2 X 2 < 2 Y = 0 Y = 1.3 Y = 5 Y = 3.5 Piece-wise constant function on segments of factor-space Classification trees: Y = 0 or Y = 1 Predictions in terms of distributions: y = P(Y = 1) 11 2014 Deloitte Česká republika

Regression tree for house prices in California Source: http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf 12 2014 Deloitte Česká republika

Growing (learning) CRT Performed on training data 1. Find good partitioning Start with single node For each node Stopping criterion YES Go to next node NO Find a variable and critical value minimize within-node heterogeneity Split the node 2 new nodes Go to next node 2. Fit local models for leaves: constant function Average over observations in the leave 13 2014 Deloitte Česká republika

Growing CRT Stopping criterion Number of observations within each child Threshold for decrease in heterogeneity Measuring heterogeneity Various measures => various algorithms 1. Regression trees: Sum of squared errors S = y i m c 2 i c c leaves(t), where m c = 1 c i c y i 2. Classification trees: Gini impurity ( sum of squared errors), I G = avg c leaves(t) p c 1 p c, where p c = 1 c i c y i Entropy I E = avg c leaves(t) p c log 2 p c 1 p c log 2 1 p c, 14 2014 Deloitte Česká republika

Pruning CRT Problem: How to set stopping criterion Too strict => CRT cannot capture local dependencies Too mild => over-fitting Way out: Pruning 1. Grow large tree: with mild stopping criterion 2. Prune the tree: re-join leaves, which do not decrease heterogeneity Combine with cross-validation: 2 sets of records: training and testing Use training data to grow the tree Use testing data to prune the tree 15 2014 Deloitte Česká republika

Uncertainty in CRT How precise/reliable is prediction from CRT? 1. Error in conditional mean estimate Prediction error assuming the tree is correct Estimate of standard error of mean (under i.i.d. errors) SE c = s c c SE c = s c c = 1 c 1 c 1 i c y i m c 2 2. Error in tree fitting How different would the tree be had we drawn different sample Non-parametric bootstrapping Draw samples from data with replacement Grow trees on samples => empirical distribution 16 2014 Deloitte Česká republika

Why to use CRT? Pros Captures local dependencies and interactions Making predictions is fast Identifies variables important for predictions => dimension reduction Easy to understand and interpret (white box) Able to handle numerical and categorical data Cons Not smooth = No sensitivities Does not distinguish points Difficulties with global patterns Learning NP-complete => locally optimal trees Robust (not sensitive to assumptions) Performs well with large datasets Predictions if some variables are missing Enables distributional predictions 17 2014 Deloitte Česká republika

Naive Bayes Classifiers Based on: Bayes theorem Conditional independence of factors (given the class variable) it is too strong and naive assumption Probabilistic model Parameter estimation P Y = c X 1 = x 1,, X n = x n = P Y = c P X 1 = x 1,, X n = x n Y = c P X 1 = x 1,, X n = x n = 1 Z P Y = c P X 1 = x 1 Y = c P X n = x n Y = c maximum likelihood = relative frequencies 18 2014 Deloitte Česká republika

Case study: Cross-selling of Caravan insurance

Business and data understanding

Business goals Determination of target customers of X-selling campaign for Caravan insurance 21 2014 Deloitte Česká republika

Data Understanding Source Public dataset from the Coil 2000 data mining competition Downloaded from http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html Characteristics A table with customer data of an insurance company 9822 records Target variable: Caravan Insurance holder (binary) 86 attributes: 43 socio-demographic variables derived from ZIP code 43 variables about premium paid and number of insurance policies Only 5% of customers have Caravan Insurance 22 2014 Deloitte Česká republika

Data Exploration Single factor analysis, prediction power w.r.t. dependent variable Variables categorized => prediction power tested by 1-way ANOVA: Summary of 1-way ANOVA results selected variables with highest F-statistic Variable Description F Sig. (p-value) PPERSAUT Contribution car policies 145.393 4.36E-33 APERSAUT Number of car policies 123.607 1.98E-28 ATOTAL Total number of insurance ADJ 104.262 2.84E-24 PTOTAL Total insurance contribution ADJ 78.093 1.28E-18 APLEZIER Number of boat policies 65.758 6.16E-16 PWAPART Contribution private third party insurance 56.454 6.62E-14 MKOOPKLA Purchasing power class 54.066 2.21E-13 MINKGEM Average income 47.725 5.43E-12 MOPLEDUC Education level ADJ 46.977 7.92E-12 AWAPART Number of private third party insurance 46.877 8.34E-12 MSKSOCIAL Social class ADJ 37.892 7.98E-10 MINCOME Average income ADJ 31.339 2.27E-08 MRELGE Married 29.271 6.54E-08 PPLEZIER Contribution boat policies 28.946 7.73E-08 Factors ordered by F-statistic: the higher the F-statistic, the better prediction potential of the factor No information about the sign of the dependency, the dependency need not be monotonous Does not capture correlations 23 2014 Deloitte Česká republika

Number of customers Share of caravan Data Exploration More detailed analysis of selected individual factors Selected based on business feeling and 1-way ANOVA Main expected factors to determine propensity to buy Caravan insurance are: Ownership of car(s), wealth (purchasing power), risk aversion (propensity to buy insurance) Example of single factor analysis: Car insurance policies: 3000 Number of car policies.18.16 More policies => higher propensity to buy Caravan insurance 2500 2000.14.12.10 Customers Classes with low number of customers irrelevant 1500 1000.08.06 Share of Caravan insurance.04 500.02 0 0 1 2 3 4 6 7 Nr. of policies.00 24 2014 Deloitte Česká republika

Number of customers Share of caravan Number of customers Share of caravan Data Exploration 7000 Number of boat policies.60 Boat insurance Having a boat insurance significantly increases propensity to have Caravan insurance Only few customers have boat insurance 6000 5000 4000 3000 2000 1000.50.40.30.20.10 Customers Share of Caravan insurance 0 0 1 2 Nr. of policies.00 Purchasing power class Purchasing power class Prediction power of factor purchasing power class appears rather limited (except for class 7) 1800 1600 1400 1200 1000 800.16.14.12.10.08 Customers 600.06 400 200 0 1 2 3 4 5 6 7 8 Class.04.02.00 Share of Caravan insurance 25 2014 Deloitte Česká republika

Number of customers Share of caravan Data Exploration Family status: Higher chance of being married (derived from ZIP code) increases chance to buy caravan insurance Current marital status (not available in current data) might be useful factor 1800 1600 Married.09.08 1400 1200.07.06 Customers 1000 800 600 400 200.05.04.03.02.01 Share of Caravan insurance 0 0,00% 5,50% 17,00% 30,00% 43,00% 56,00% 69,00% 82,00% 94,00% 100,00% marriage rate.00 26 2014 Deloitte Česká republika

Data Preparation Can by very time-demanding Important for successful modelling Includes: Data cleansing; Derivation of new factors Use of external sources etc. 27 2014 Deloitte Česká republika

Modelling

Cross-validation Records divided into 2 subset: Training data: used to estimate parameters Testing data: used to verify model prediction power 29 2014 Deloitte Česká republika

Logistic regression Special type of GLM Bernoulli 0-1 valued response variable: P(Y=1) = p, P(Y=0 ) =1- p Regression equation: log p 1 p = K i=1 B i x i where B i = parameter; x i = variable Based on forward stepwise selection and single factor analysis, we selected following variables for the model Variable PPERSAUT MKOOPKLA MRELGE MOPLEDUC PTOTAL Description Contribution car policies Purchasing power class Married Education level ADJ Total insurance contribution ADJ APLEZIER Number of boat policies 30 2014 Deloitte Česká republika

Logistic regression dummy variables Recoding of categorical variable KOOPKLA (purchasing power class) into several dummy (0-1) variables Enables use of nominal/categorical variables Enables modelling non-monotonous dependencies Categorical Variables Coding Factor value Dummy variable (1) (2) (3) (4) (5) (6) (7) 1 1 0 0 0 0 0 0 2 0 1 0 0 0 0 0 3 0 0 1 0 0 0 0 4 0 0 0 1 0 0 0 5 0 0 0 0 1 0 0 6 0 0 0 0 0 1 0 7 0 0 0 0 0 0 1 8 0 0 0 0 0 0 0 31 2014 Deloitte Česká republika

Logistic regression Regression equation: log p 1 p = Parameter interpretation: K i=1 B i x i where B i = parameter; x i = variable B estimated parameter value Sig significance level (p-value) of the hypothesis parameter = 0. Low values indicate significant variables (i.e. sig = probability that real parameter is zero) Exp(B) percentage change of odds (p/1-p) by changing the corresponding variable by 1 unit Example (see below): Increase in number of boat policies by 1 increases odds (ratio of those with caravan insurance to those without caravan insurance) 7.3 times compare to single factor analysis. Parameter estimates Variable Description B Sig. Exp(B) APLEZIER Number of boat policies 1.99262 0.00000 7.33469 32 2014 Deloitte Česká republika

Logistic regression parameter estimates MLE Parameter estimates Variable Description B Sig. Exp(B) MKOOPKLA Purchasing power class 0.00106 MKOOPKLA(1) -0.44840 0.16598 0.63865 MKOOPKLA(2) -0.30545 0.38822 0.73679 MKOOPKLA(3) -0.24723 0.29789 0.78096 MKOOPKLA(4) -0.13853 0.58047 0.87064 MKOOPKLA(5) -0.09847 0.72481 0.90622 MKOOPKLA(6) 0.06729 0.76930 1.06961 MKOOPKLA(7) 0.60826 0.00710 1.83723 PPERSAUT Contribution car policies 0.00046 0.00000 1.00046 MRELGE Married 0.01052 0.00054 1.01057 MOPLEDUC Education level ADJ 0.36838 0.00036 1.44539 PTOTAL Total insurance contribution ADJ -0.00008 0.06396 0.99992 APLEZIER Number of boat policies 1.99262 0.00000 7.33469 Constant -4.72476 0.00000 0.00887 Being in purchasing power class 7 significantly increases probability of caravan ins. compared to base class 8, other classes rather decrease the probability, but the estimates are not significant (compare to single factor analysis) Increase in contributions to car insurance increases probability of caravan ins. The sensitivity to change in contributions by 1 monetary unit is not high. However, the parameter is significant and should stay in the model. Higher probability of being married increases probability of caravan insurance. Higher education level increases probability of caravan insurance. Higher insurance contributions slightly decrease probability of caravan ins, which is contraintuitive (compare to single factor analysis. However, the parameter is insignificant. => Should be excluded from the model? Having boat policy => high probability of having caravan insurance. 33 2014 Deloitte Česká republika

Logistic regression diagnostics of classification Predictions on testing set Correct predictions = 3629 + 26 = 3655 If we predicted 0 => 3762 correct predictions (but useless) Prediction = 1 => 26 correct, 133 misclassified Seemingly poor classifier due to unbalanced data (only 6% caravan ins. holders) Prediction Sum 0 1 Real value 0 3629 133 3762 1 212 26 238 Caravans share 6% 16% 6% 34 2014 Deloitte Česká republika

Percentage of caravan insurance holders in the sample Increment of chance of caravan insurance holder to be in the sample Logistic regression diagnostics of probabilities Cumulative Gains Lift 1 4.5 0.9 4 0.8 3.5 0.7 0.6 0.5 0.4 0.3 0.2 3 2.5 2 1.5 1 0.1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Percentage of observations with the highest probability provided by the model 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Percentage of observations with the highest probability provided by the model Cumulative Gains: In 20% of all observation (with the highest probability), there are 45% of all caravan insurance holders. Corresponding Lift: for best 20% of observation the method is 2.5x better then random selection 35 2014 Deloitte Česká republika

Percentage of caravan insurance holders in the sample Increment of chance of caravan insurance holder to be in the sample Logistic regression diagnostics of probabilities 2 Cumulative Gains Lift 1 4.5 0.9 4 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 train set test set 3.5 3 2.5 2 1.5 1 0.5 train set test set 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Percentage of observations with the highest probability provided by the model 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Percentage of observations with the highest probability provided by the model Cumulative Gain/Lift slightly better for training set => slight over-fitting 36 2014 Deloitte Česká republika

Classification tree growing Pre-selection of variables not needed Graph of the decrease in deviance (Gini impurity) by adding nodes 37 2014 Deloitte Česká republika

Classification tree - pruned Based on graph, originally selected 20 nodes => over-fitting Nr. of nodes reduced to 6 => worse Lift for training set, but better Lift for testing set 38 2014 Deloitte Česká republika

Classification tree selected factors Variable PTOTAL MOSHOOFD PPLEZIER MBERARBO Description Total insurance contribution ADJ Customer main type (L2) Contribution boat policies Unskilled labourers Customer main type L2 a b c d e f g h i j Label Successful hedonists Driven Growers Average Family Career Loners Living well Cruising Seniors Retired and Religeous Family with grown ups Conservative families Farmers Selected variables to a certain extend correspond to single factor analysis. Number of observations in terminal nodes seem to be unbalanced 39 2014 Deloitte Česká republika

Percentage of caravan owners in the sample Increment of chance of caravan owner to be in the sample Classification tree - diagnostics 1 Cumulative Gains 4.5 Lift 0.9 4 0.8 3.5 0.7 3 train set 0.6 0.5 0.4 0.3 0.2 0.1 0 train set test set 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Percentage of observations with the highest probability provided by the model 2.5 2 1.5 1 0.5 0 test set 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Percentage of observations with the highest probability provided by the model The tree usable for up to 30% of the observations The identification of remaining caravans seems comparable to random selection (due to unbalanced leaves) 40 2014 Deloitte Česká republika

Percentage of caravan insurance holders in the sample Increment of chance of caravan insurance holders to be in the sample Naive Bayes Classifier Selected same factors as in logistic regression Cumulative Gains Lift 1 4.5 0.9 4 0.8 3.5 0.7 0.6 0.5 0.4 0.3 0.2 training set testing set 3 2.5 2 1.5 1 training set testing set 0.1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Percentage of observations with the highest probability provided by the model 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Percentage of observations with the highest probability provided by the model 41 2014 Deloitte Česká republika

Percentage of caravan insurance holders in the sample Increment of chance of caravan insurance holder to be in the sample Comparison of the methods on testing set Cumulative Gains Lift 1 4.5 0.9 4 Logistic regression 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Logistic regression Classification tree Naïve Bayes 3.5 3 2.5 2 1.5 1 0.5 Classification tree Naïve Bayes 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Percentage of observations with the highest probability provided by the model 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Percentage of observations with the highest probability provided by the model Up to 30% - 40% of observations all the methods give similar results For the remaining part logistic regressions and naïve Bayes are better than classification tree 42 2014 Deloitte Česká republika

Cost-benefit analysis

Costs Ex-post cost analysis What are the costs to win the desired number of caravan insurance? Based on logistic regression Dependency of costs on the number of caravan insurance buyers within testing set Comparison of costs without analysis (blue line) and with analysis (green line) Assumptions Initial costs to start the campaign = 5 000 Costs per customer = 4 Success rate in the population = 6% (computed from the testing data set) 25 000 20 000 15 000 Total costs to a random customer selection Example: Suppose, we want to sell 100 caravan insurance Without analysis, the costs would be 11 704 10 000 5 000 Total costs to a customer selection according to analysis With analysis, the costs would be only 7 780-0 50 100 150 200 The number of caravan insurance buyers 44 2014 Deloitte Česká republika

Total gains/costs Profit analysis What is the profit when addressing desired number of customers based on the analysis? Dependency of profit on the number of addressed customers within testing set Assumptions Initial costs to start the campaign = 5 000 Costs per customer = 4 Profit from 1 customer = 100 Maximum profit is 5 460 The maximum profit is reached at 2085 addressed customers 30 000 25 000 20 000 15 000 10 000 5 000 Total costs Total gains - expected Total profit - expected - -5 000-10 000 0 500 1000 1500 2000 2500 3000 3500 4000 The number of addressed customers Total gains - real Total profit - real 45 2014 Deloitte Česká republika

Thank you for your attention. Questions? 46 2014 Deloitte Česká republika