EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA Andreas Christmann Department of Mathematics homepages.vub.ac.be/ achristm Talk: ULB, Sciences Actuarielles, 17/NOV/2006 Contents 1. Project: Motor vehicle insurance 2. Classical Methods 3. Convex risk minimization 4. Robustness 5. Cont.: Motor vehicle insurance 6. Summary 1
1. PROJECT: MOTOR VEHICLE INSURANCE Verband öffentlicher Versicherer, Düsseldorf, Germany non-aggregated data from 15 insurance companies 3GBperyear,> 4.5 million customers, > 70 explanatory variables primary response: pure premium [year] insurance tariffs secondary response: probability of having at least 1 claim per year complex dependencies, empty cells, missing values, some extreme high costs 2
claim amount for each customer x R p explanatory variables Statistical Objectives Actual premium. Charged to the customer pure premium + safety loading + administrative costs + desired profit Primary response: Pure premium. E(Y X = x) observed pure premium = sum of individual claim sizes number of days under risk / 360 Secondary response: Prob. of claim. P(Y >0 X = x) Precision criterion: MSE. E((Y Ŷ )2 X = x) =min! Fair criterion: Bias. E(Y Ŷ X = x) 0! in whole population and in subpopulations for each customer 3
Characteristic Properties Most of the policy holders have no claim within a year or a certain period. The claim sizes are extremely skewed to the right, but there is atom in zero. There is a complex, high dimensional dependency structure between variables. There are only imprecise values available for some explanatory variables. Some claim sizes are only estimates. The data sets to be analyzed are huge. Extreme high claim amounts are rare events, but contribute enormously to the total sum of all claims. 4
Exploratory Data Analysis Pure premium per policy holder per year total mean 360 EUR Pure premium % obs. % of total sum Mean Median total 364 0 0 94.9 0 0 0 (0,2000] 2.2 6.7 1097 1110 (2000,10000] 2.4 27.1 4156 3496 (10000,50000] 0.4 19.8 18443 15059 >50000 0.07 46.4 234621 96417 Max. individual claim size: some million EUR 5
EUR 0 500 1000 1500 2000 2500 Pure premium Relative frequency 20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90 Age of main user Age of main user Probability 0.04 0.06 0.08 0.10 0.12 6
smoothed prob. for claim 0.6 0.4 0.2 0.0 20 40 SFKL 60 80 80 20 40 60 Age of main user 7
8
2. METHODS Classical approach (in Germany): Marginal Sum Model Poisson-Regression Generalized Linear Models (GLIM): Y i has distribution from exponential family E(Y i )=µ i =g 1 (x i β)andvar(y i)=φ V (g 1 (x i β))/w i g = link function, V = variance function Special cases: Poisson: E(Y i )=Var(Y i ) Gamma: V (µ i )=µ 2 i Inverse Gaussian: V (µ i )=µ 3 i Negative Binomial: V (µ i )=µ i + kµ 2 i Tweedie s compound Poisson Model (Smyth & Jørgensen, 02): 9
Check of GLIM assumption QQ plot for GPD Variance 0 e+00 3 e+10 6 e+10 Gamma:Var = ke 2 Poisson : Var = ke 0 5000 10000 15000 20000 Average log(empirical quantiles) log(empirical quantiles) 5 10 15 5 10 15 GPD( 0.8, 50035.42 ) 5 10 15 log(theoretical quantiles) GPD( 0.63, 51214.8 ) log(empirical quantiles) log(empirical quantiles) 5 10 15 5 10 15 GPD( 0.89, 53357.66 ) 5 10 15 log(theoretical quantiles) GPD( 0.85, 44742.58 ) 5 10 15 5 10 15 overdispersion k 1 log(theoretical quantiles) log(theoretical quantiles) 10
PROPOSAL Denote: Y pure premium, x vector of explanatory variables 1: Define additional variable C to classify Y,e.g. C =0 ify = 0 no cost 94.9 % =1 ify (0, 2000] low 2.2 % =2 ify (2000, 10000] medium 2.4 % =3 ify (10000, 50000] high 0.4 % =4 ify>50, 000 extreme 0.07 % 2: No claims: E(Y X = x, C =0) 0 for all x. E(Y X = x) =P(C>0 X = x) E(Y C >0,X = x) =P(C>0 X = x) k+1 c=1 P(C = c C >0,X = x) E(Y C = c, X = x) 11
E(Y X = x) =P(C>0 X = x) k+1 c=1 P(C = c C >0,X = x) E(Y C = c, X = x) + Companies have interest in P(C = c) ore(y X = x, C = c) + Circumvents the problem: most y i = 0, but P(Y =0)=0 + Reduction of computation time possible: regression only for 5% of obs.! + Reduction of interactions: different x for different C = c + Variable selection: different vectors x for different C groups are possible + Different estimation techniques can be used for estimating P(C = c X = x) ande(y X = x, C = c), e.g. Multinomial logistic regression + Gamma regression Robust logistic regr. + semi-parametr. regr. Classification trees + semiparametric regression Kernel logistic regression + ε support vector regression Combination of above pairs + extreme value theory (e.g. GPD) 12
3. EMPIRICAL RISK MINIMIZATION Vapnik 98 Data set (x i,y i ) R p { 1, +1}. Assume (X i,y i ) iid P, P unknown, predictor ˆf(x), classifier sign( ˆf(x)+ˆb), loss function L(y, f(x)+b) goal: arg min f,b E P L(Y,f(X)+b) reg. emp. ( ˆf n,λ, ˆb n,λ )=argmin f H, b R 1 n ni=1 L(y i,f(x i )+b)+λ f 2 H, risk: where L convex, λ > 0 reg. constant, H reproducing kernel Hilbert space (RKHS), defined by reproducing kernel k [k : R p R p R, k(x, ) H, f(x) = f,k(x, ) ] reg. theor. risk: (f P,λ,b P,λ )=argmin f H, b R E P L(Y,f(X)+b)+λ f 2 H Vapnik 98, Zhang 01, Chr & Steinwart 05: L-risk consistency 13
Advantages: Kernel trick restricts class of functions to broad subset (RKHS) allows very flexible fits RBF kernel: k(x i,x j ) =exp( γ x i x j 2 ),γ>0 Convex loss function avoids computational intractable NP-hard problems Regularization improves generalization property, avoids overfitting Software: many computational tricks Overview: http://www.kernel-machines.org 14
Kernel Logistic Regression (KLR) L(x, y, f) =ln(1+exp( y[f(x)+b])) smoothly decreasing risk scoring is possible: P(Y =1 X = x) = 1 1+e [f(x)+b] ε Support Vector Regression (SVR) (Vapnik 98) L ε (x, y, f) =max{0, y f(x) ε} Loss 0.0 1.0 2.0 3.0 ε=0.5 3 2 1 0 1 2 3 y 3 2 1 0 1 2 3 ξ i ξj 2ε 3 2 1 0 1 2 3 4 y [f(x)+b] x 15
4. Robustness model assumptions valid? some extreme claim sizes are estimates imprecise data: driving distance during a year (1 ε)p +εp ~ P neighborhood Goal: T (P), but T ((1 ε)p + ε P) T (P)? set of all distributions T (P) =(f P,λ,b P,λ )=argmin f H, b R E P L(Y,f(X)+b)+λ f 2 H 16
Robustness Approaches Influence function (F. Hampel): Gâteaux-derivative towards Dirac-distr. z ( ) T (1 ε)p + ε z T (P) IF (z; T,P) = lim ε 0 ε Sensitivity curve (J.W. Tukey): measures impact of one point z SC n (z; T n )=n ( T n (z 1,...,z n 1, z) T n 1 (z 1,...,z n 1 ) ) SC n (z; T n )= T ( (1 ε n )P n 1 + ε n z ) T (Pn 1 ) ε n, ε n = 1 n Maxbias (P.J. Huber): measures worst case behavior in neighborhood maxbias(ε; T,P) = sup T (Q) T (P) Q N ε (P) N ε (P) ={Q =(1 ε)p + ε P; P is distr. on X Y, 0 ε< 1 2 } 17
Robustness Results Assume for classification or regression problem: loss function L is convex, continuous, L bounded H RKHS of a bounded, continuous kernel k. Then these kernel methods have good robustness properties: influence function sensitivity curve bias maxbias are bounded, sometimes even uniformly bounded, and they are able to learn the unknown distribution P. Details and Proofs: Chr & Steinwart ( 04, 05, 06) 18
Y : pure premium [year] 5. Cont.: motor vehicle insurance X: 8 explanatory variables: age of main user, gender of main user, car kept in garage?, driving distance, geographical information, population density, no. of years without claims, strength of engine Estimation with (KLR, ε SVR) with RBF kernel using E(Y X = x) =P(C>0 X = x) k+1 c=1 P(C = c C >0,X = x) E(Y C = c, X = x) 50% training (n=2.2e6), 25% validation (n=1.1e6), 25% test (n=1.1e6) 19
Estimated pure premium E(Y X=x) EUR 0 500 1000 1500 20 30 40 50 60 70 80 90 Age of main user 20
Estimated pure premium E(Y X=x) EUR 0 500 1000 1500 Males Females 20 30 40 50 60 70 80 90 Age of main user 21
20 40 60 80 0 500 1000 1500 E(Y X=x) Age of main user 20 40 60 80 0.40 0.44 0.48 P(C=2 C>0,X=x) Age of main user Probability EUR Probability 0.06 0.10 0.14 Probability 0.04 0.08 0.12 P(C>0 X=x) 20 40 60 80 Age of main user P(C=3 C>0,X=x) 20 40 60 80 Age of main user Probability 0.010 0.020 0.030 Probability 0.30 0.40 0.50 P(C=1 C>0,X=x) 20 40 60 80 Age of main user P(C=4 C>0,X=x) 20 40 60 80 Age of main user 22
6. Summary + Empirical Risk Minimization: + statistics + mathematics + computer sciences + companies (insurance, banking, IT, telecommunication) tariffs, risk scoring, data mining, CRM, CHURN + software companies (SAS/Enterprise Miner) + good predictions, very flexible + applicable also in very high dimensions + good robustness properties 23
References Christmann, A., Steinwart, I. (2006). Consistency of kernel based quantile regression. Submitted. Christmann, A., Steinwart, I., Hubert, M. (2006). Robust Learning from Bites for Data Mining. To appear in CSDA. Christmann, A. (2005). On a strategy to develop robust and simple tariffs from motor vehicle insurance data. Acta Mathematicae Applicatae Sinica (English Series, Springer), 21, 193-208. Christmann, A., Steinwart, I. (2005) Consistency and robustness of kernel based regression. Submitted. Christmann, A. (2004). On Properties of Support Vector Machines for Pattern Recognition in Finite Samples. In: Theory and Applications of Recent Robust Methods. Eds.: M. Hubert, G. Pison, A. Struyf and S. Van Aelst. Series: Statistics for Industry and Technology, Birkhäuser, Basel. Christmann, A., Steinwart, I. (2004). On robust properties of convex risk minimization methods for pattern recognition. Journal of Machine Learning Research, 5, 1007-1034. Schölkopf, Smola (2002) Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press. Smyth, Jørgensen (2002). Fitting Tweedie s Compound Poisson Model to Insurance Claims Data: Dispersion Modelling. ASTIN Bulletin, 32, 143-157. Vapnik (1998). Statistical Learning Theory. Wiley. homepages.vub.ac.be/ achristm 24