EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA



Similar documents
Statistical Machine Learning

The zero-adjusted Inverse Gaussian distribution as a model for insurance claims

Stochastic programming approaches to pricing in non-life insurance

BayesX - Software for Bayesian Inference in Structured Additive Regression

SAS Software to Fit the Generalized Linear Model

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

A Deeper Look Inside Generalized Linear Models

Underwriting risk control in non-life insurance via generalized linear models and stochastic programming

Early defect identification of semiconductor processes using machine learning

Exploratory Data Analysis

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Lecture 8: Gamma regression

Local classification and local likelihoods

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

Support Vector Machines for Classification and Regression

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Introduction to Predictive Modeling Using GLMs

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Optimization approaches to multiplicative tariff of rates estimation in non-life insurance

Handling missing data in Stata a whirlwind tour

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Linear Threshold Units

Consistent Binary Classification with Generalized Performance Metrics

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Basics of Statistical Machine Learning

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Introduction: Overview of Kernel Methods

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

An Introduction to Machine Learning

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Penalized Logistic Regression and Classification of Microarray Data

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Review of Random Variables

Data Mining: An Overview. David Madigan

A Model of Optimum Tariff in Vehicle Fleet Insurance

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Offset Techniques for Predictive Modeling for Insurance

Quantile Regression under misspecification, with an application to the U.S. wage structure

Automated Biosurveillance Data from England and Wales,

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

HT2015: SC4 Statistical Data Mining and Machine Learning

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Statistical Learning for Short-Term Photovoltaic Power Predictions

Lecture 3: Linear methods for classification

Lecture 6: Poisson regression

Studying Auto Insurance Data

Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution

A Study on the Comparison of Electricity Forecasting Models: Korea and China

Data Exploration Data Visualization

E-commerce Transaction Anomaly Classification

Logistic Regression (a type of Generalized Linear Model)

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Predictive Modeling Techniques in Insurance

GLM I An Introduction to Generalized Linear Models

Travelers Analytics: U of M Stats 8053 Insurance Modeling Problem

Nominal and ordinal logistic regression

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Java Modules for Time Series Analysis

Simple and efficient online algorithms for real world applications

Visualizing class probability estimators

A.II. Kernel Estimation of Densities

Bootstrapping Big Data

CSCI567 Machine Learning (Fall 2014)

How To Build A Predictive Model In Insurance

E3: PROBABILITY AND STATISTICS lecture notes

Support Vector Machines Explained

Knowledge Discovery from patents using KMX Text Analytics

Least Squares Estimation

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

D-optimal plans in observational studies

Interactive Machine Learning. Maria-Florina Balcan

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

Linear Classification. Volker Tresp Summer 2015

Dongfeng Li. Autumn 2010

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Tree based ensemble models regularization by convex optimization

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Support Vector Machines with Clustering for Training with Very Large Datasets

A three dimensional stochastic Model for Claim Reserving

Christfried Webers. Canberra February June 2015

Sections 2.11 and 5.8

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Classification with Hybrid Generative/Discriminative Models

Contents. List of Figures. List of Tables. List of Examples. Preface to Volume IV

Principle of Data Reduction

Transcription:

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA Andreas Christmann Department of Mathematics homepages.vub.ac.be/ achristm Talk: ULB, Sciences Actuarielles, 17/NOV/2006 Contents 1. Project: Motor vehicle insurance 2. Classical Methods 3. Convex risk minimization 4. Robustness 5. Cont.: Motor vehicle insurance 6. Summary 1

1. PROJECT: MOTOR VEHICLE INSURANCE Verband öffentlicher Versicherer, Düsseldorf, Germany non-aggregated data from 15 insurance companies 3GBperyear,> 4.5 million customers, > 70 explanatory variables primary response: pure premium [year] insurance tariffs secondary response: probability of having at least 1 claim per year complex dependencies, empty cells, missing values, some extreme high costs 2

claim amount for each customer x R p explanatory variables Statistical Objectives Actual premium. Charged to the customer pure premium + safety loading + administrative costs + desired profit Primary response: Pure premium. E(Y X = x) observed pure premium = sum of individual claim sizes number of days under risk / 360 Secondary response: Prob. of claim. P(Y >0 X = x) Precision criterion: MSE. E((Y Ŷ )2 X = x) =min! Fair criterion: Bias. E(Y Ŷ X = x) 0! in whole population and in subpopulations for each customer 3

Characteristic Properties Most of the policy holders have no claim within a year or a certain period. The claim sizes are extremely skewed to the right, but there is atom in zero. There is a complex, high dimensional dependency structure between variables. There are only imprecise values available for some explanatory variables. Some claim sizes are only estimates. The data sets to be analyzed are huge. Extreme high claim amounts are rare events, but contribute enormously to the total sum of all claims. 4

Exploratory Data Analysis Pure premium per policy holder per year total mean 360 EUR Pure premium % obs. % of total sum Mean Median total 364 0 0 94.9 0 0 0 (0,2000] 2.2 6.7 1097 1110 (2000,10000] 2.4 27.1 4156 3496 (10000,50000] 0.4 19.8 18443 15059 >50000 0.07 46.4 234621 96417 Max. individual claim size: some million EUR 5

EUR 0 500 1000 1500 2000 2500 Pure premium Relative frequency 20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90 Age of main user Age of main user Probability 0.04 0.06 0.08 0.10 0.12 6

smoothed prob. for claim 0.6 0.4 0.2 0.0 20 40 SFKL 60 80 80 20 40 60 Age of main user 7

8

2. METHODS Classical approach (in Germany): Marginal Sum Model Poisson-Regression Generalized Linear Models (GLIM): Y i has distribution from exponential family E(Y i )=µ i =g 1 (x i β)andvar(y i)=φ V (g 1 (x i β))/w i g = link function, V = variance function Special cases: Poisson: E(Y i )=Var(Y i ) Gamma: V (µ i )=µ 2 i Inverse Gaussian: V (µ i )=µ 3 i Negative Binomial: V (µ i )=µ i + kµ 2 i Tweedie s compound Poisson Model (Smyth & Jørgensen, 02): 9

Check of GLIM assumption QQ plot for GPD Variance 0 e+00 3 e+10 6 e+10 Gamma:Var = ke 2 Poisson : Var = ke 0 5000 10000 15000 20000 Average log(empirical quantiles) log(empirical quantiles) 5 10 15 5 10 15 GPD( 0.8, 50035.42 ) 5 10 15 log(theoretical quantiles) GPD( 0.63, 51214.8 ) log(empirical quantiles) log(empirical quantiles) 5 10 15 5 10 15 GPD( 0.89, 53357.66 ) 5 10 15 log(theoretical quantiles) GPD( 0.85, 44742.58 ) 5 10 15 5 10 15 overdispersion k 1 log(theoretical quantiles) log(theoretical quantiles) 10

PROPOSAL Denote: Y pure premium, x vector of explanatory variables 1: Define additional variable C to classify Y,e.g. C =0 ify = 0 no cost 94.9 % =1 ify (0, 2000] low 2.2 % =2 ify (2000, 10000] medium 2.4 % =3 ify (10000, 50000] high 0.4 % =4 ify>50, 000 extreme 0.07 % 2: No claims: E(Y X = x, C =0) 0 for all x. E(Y X = x) =P(C>0 X = x) E(Y C >0,X = x) =P(C>0 X = x) k+1 c=1 P(C = c C >0,X = x) E(Y C = c, X = x) 11

E(Y X = x) =P(C>0 X = x) k+1 c=1 P(C = c C >0,X = x) E(Y C = c, X = x) + Companies have interest in P(C = c) ore(y X = x, C = c) + Circumvents the problem: most y i = 0, but P(Y =0)=0 + Reduction of computation time possible: regression only for 5% of obs.! + Reduction of interactions: different x for different C = c + Variable selection: different vectors x for different C groups are possible + Different estimation techniques can be used for estimating P(C = c X = x) ande(y X = x, C = c), e.g. Multinomial logistic regression + Gamma regression Robust logistic regr. + semi-parametr. regr. Classification trees + semiparametric regression Kernel logistic regression + ε support vector regression Combination of above pairs + extreme value theory (e.g. GPD) 12

3. EMPIRICAL RISK MINIMIZATION Vapnik 98 Data set (x i,y i ) R p { 1, +1}. Assume (X i,y i ) iid P, P unknown, predictor ˆf(x), classifier sign( ˆf(x)+ˆb), loss function L(y, f(x)+b) goal: arg min f,b E P L(Y,f(X)+b) reg. emp. ( ˆf n,λ, ˆb n,λ )=argmin f H, b R 1 n ni=1 L(y i,f(x i )+b)+λ f 2 H, risk: where L convex, λ > 0 reg. constant, H reproducing kernel Hilbert space (RKHS), defined by reproducing kernel k [k : R p R p R, k(x, ) H, f(x) = f,k(x, ) ] reg. theor. risk: (f P,λ,b P,λ )=argmin f H, b R E P L(Y,f(X)+b)+λ f 2 H Vapnik 98, Zhang 01, Chr & Steinwart 05: L-risk consistency 13

Advantages: Kernel trick restricts class of functions to broad subset (RKHS) allows very flexible fits RBF kernel: k(x i,x j ) =exp( γ x i x j 2 ),γ>0 Convex loss function avoids computational intractable NP-hard problems Regularization improves generalization property, avoids overfitting Software: many computational tricks Overview: http://www.kernel-machines.org 14

Kernel Logistic Regression (KLR) L(x, y, f) =ln(1+exp( y[f(x)+b])) smoothly decreasing risk scoring is possible: P(Y =1 X = x) = 1 1+e [f(x)+b] ε Support Vector Regression (SVR) (Vapnik 98) L ε (x, y, f) =max{0, y f(x) ε} Loss 0.0 1.0 2.0 3.0 ε=0.5 3 2 1 0 1 2 3 y 3 2 1 0 1 2 3 ξ i ξj 2ε 3 2 1 0 1 2 3 4 y [f(x)+b] x 15

4. Robustness model assumptions valid? some extreme claim sizes are estimates imprecise data: driving distance during a year (1 ε)p +εp ~ P neighborhood Goal: T (P), but T ((1 ε)p + ε P) T (P)? set of all distributions T (P) =(f P,λ,b P,λ )=argmin f H, b R E P L(Y,f(X)+b)+λ f 2 H 16

Robustness Approaches Influence function (F. Hampel): Gâteaux-derivative towards Dirac-distr. z ( ) T (1 ε)p + ε z T (P) IF (z; T,P) = lim ε 0 ε Sensitivity curve (J.W. Tukey): measures impact of one point z SC n (z; T n )=n ( T n (z 1,...,z n 1, z) T n 1 (z 1,...,z n 1 ) ) SC n (z; T n )= T ( (1 ε n )P n 1 + ε n z ) T (Pn 1 ) ε n, ε n = 1 n Maxbias (P.J. Huber): measures worst case behavior in neighborhood maxbias(ε; T,P) = sup T (Q) T (P) Q N ε (P) N ε (P) ={Q =(1 ε)p + ε P; P is distr. on X Y, 0 ε< 1 2 } 17

Robustness Results Assume for classification or regression problem: loss function L is convex, continuous, L bounded H RKHS of a bounded, continuous kernel k. Then these kernel methods have good robustness properties: influence function sensitivity curve bias maxbias are bounded, sometimes even uniformly bounded, and they are able to learn the unknown distribution P. Details and Proofs: Chr & Steinwart ( 04, 05, 06) 18

Y : pure premium [year] 5. Cont.: motor vehicle insurance X: 8 explanatory variables: age of main user, gender of main user, car kept in garage?, driving distance, geographical information, population density, no. of years without claims, strength of engine Estimation with (KLR, ε SVR) with RBF kernel using E(Y X = x) =P(C>0 X = x) k+1 c=1 P(C = c C >0,X = x) E(Y C = c, X = x) 50% training (n=2.2e6), 25% validation (n=1.1e6), 25% test (n=1.1e6) 19

Estimated pure premium E(Y X=x) EUR 0 500 1000 1500 20 30 40 50 60 70 80 90 Age of main user 20

Estimated pure premium E(Y X=x) EUR 0 500 1000 1500 Males Females 20 30 40 50 60 70 80 90 Age of main user 21

20 40 60 80 0 500 1000 1500 E(Y X=x) Age of main user 20 40 60 80 0.40 0.44 0.48 P(C=2 C>0,X=x) Age of main user Probability EUR Probability 0.06 0.10 0.14 Probability 0.04 0.08 0.12 P(C>0 X=x) 20 40 60 80 Age of main user P(C=3 C>0,X=x) 20 40 60 80 Age of main user Probability 0.010 0.020 0.030 Probability 0.30 0.40 0.50 P(C=1 C>0,X=x) 20 40 60 80 Age of main user P(C=4 C>0,X=x) 20 40 60 80 Age of main user 22

6. Summary + Empirical Risk Minimization: + statistics + mathematics + computer sciences + companies (insurance, banking, IT, telecommunication) tariffs, risk scoring, data mining, CRM, CHURN + software companies (SAS/Enterprise Miner) + good predictions, very flexible + applicable also in very high dimensions + good robustness properties 23

References Christmann, A., Steinwart, I. (2006). Consistency of kernel based quantile regression. Submitted. Christmann, A., Steinwart, I., Hubert, M. (2006). Robust Learning from Bites for Data Mining. To appear in CSDA. Christmann, A. (2005). On a strategy to develop robust and simple tariffs from motor vehicle insurance data. Acta Mathematicae Applicatae Sinica (English Series, Springer), 21, 193-208. Christmann, A., Steinwart, I. (2005) Consistency and robustness of kernel based regression. Submitted. Christmann, A. (2004). On Properties of Support Vector Machines for Pattern Recognition in Finite Samples. In: Theory and Applications of Recent Robust Methods. Eds.: M. Hubert, G. Pison, A. Struyf and S. Van Aelst. Series: Statistics for Industry and Technology, Birkhäuser, Basel. Christmann, A., Steinwart, I. (2004). On robust properties of convex risk minimization methods for pattern recognition. Journal of Machine Learning Research, 5, 1007-1034. Schölkopf, Smola (2002) Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press. Smyth, Jørgensen (2002). Fitting Tweedie s Compound Poisson Model to Insurance Claims Data: Dispersion Modelling. ASTIN Bulletin, 32, 143-157. Vapnik (1998). Statistical Learning Theory. Wiley. homepages.vub.ac.be/ achristm 24