EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

Transcription

1 EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA Andreas Christmann Department of Mathematics homepages.vub.ac.be/ achristm Talk: ULB, Sciences Actuarielles, 17/NOV/2006 Contents 1. Project: Motor vehicle insurance 2. Classical Methods 3. Convex risk minimization 4. Robustness 5. Cont.: Motor vehicle insurance 6. Summary 1

2 1. PROJECT: MOTOR VEHICLE INSURANCE Verband öffentlicher Versicherer, Düsseldorf, Germany non-aggregated data from 15 insurance companies 3GBperyear,> 4.5 million customers, > 70 explanatory variables primary response: pure premium [year] insurance tariffs secondary response: probability of having at least 1 claim per year complex dependencies, empty cells, missing values, some extreme high costs 2

3 claim amount for each customer x R p explanatory variables Statistical Objectives Actual premium. Charged to the customer pure premium + safety loading + administrative costs + desired profit Primary response: Pure premium. E(Y X = x) observed pure premium = sum of individual claim sizes number of days under risk / 360 Secondary response: Prob. of claim. P(Y >0 X = x) Precision criterion: MSE. E((Y Ŷ )2 X = x) =min! Fair criterion: Bias. E(Y Ŷ X = x) 0! in whole population and in subpopulations for each customer 3

4 Characteristic Properties Most of the policy holders have no claim within a year or a certain period. The claim sizes are extremely skewed to the right, but there is atom in zero. There is a complex, high dimensional dependency structure between variables. There are only imprecise values available for some explanatory variables. Some claim sizes are only estimates. The data sets to be analyzed are huge. Extreme high claim amounts are rare events, but contribute enormously to the total sum of all claims. 4

5 Exploratory Data Analysis Pure premium per policy holder per year total mean 360 EUR Pure premium % obs. % of total sum Mean Median total (0,2000] (2000,10000] (10000,50000] > Max. individual claim size: some million EUR 5

6 EUR Pure premium Relative frequency Age of main user Age of main user Probability

7 smoothed prob. for claim SFKL Age of main user 7

8 8

9 2. METHODS Classical approach (in Germany): Marginal Sum Model Poisson-Regression Generalized Linear Models (GLIM): Y i has distribution from exponential family E(Y i )=µ i =g 1 (x i β)andvar(y i)=φ V (g 1 (x i β))/w i g = link function, V = variance function Special cases: Poisson: E(Y i )=Var(Y i ) Gamma: V (µ i )=µ 2 i Inverse Gaussian: V (µ i )=µ 3 i Negative Binomial: V (µ i )=µ i + kµ 2 i Tweedie s compound Poisson Model (Smyth & Jørgensen, 02): 9

10 Check of GLIM assumption QQ plot for GPD Variance 0 e+00 3 e+10 6 e+10 Gamma:Var = ke 2 Poisson : Var = ke Average log(empirical quantiles) log(empirical quantiles) GPD( 0.8, ) log(theoretical quantiles) GPD( 0.63, ) log(empirical quantiles) log(empirical quantiles) GPD( 0.89, ) log(theoretical quantiles) GPD( 0.85, ) overdispersion k 1 log(theoretical quantiles) log(theoretical quantiles) 10

11 PROPOSAL Denote: Y pure premium, x vector of explanatory variables 1: Define additional variable C to classify Y,e.g. C =0 ify = 0 no cost 94.9 % =1 ify (0, 2000] low 2.2 % =2 ify (2000, 10000] medium 2.4 % =3 ify (10000, 50000] high 0.4 % =4 ify>50, 000 extreme 0.07 % 2: No claims: E(Y X = x, C =0) 0 for all x. E(Y X = x) =P(C>0 X = x) E(Y C >0,X = x) =P(C>0 X = x) k+1 c=1 P(C = c C >0,X = x) E(Y C = c, X = x) 11

12 E(Y X = x) =P(C>0 X = x) k+1 c=1 P(C = c C >0,X = x) E(Y C = c, X = x) + Companies have interest in P(C = c) ore(y X = x, C = c) + Circumvents the problem: most y i = 0, but P(Y =0)=0 + Reduction of computation time possible: regression only for 5% of obs.! + Reduction of interactions: different x for different C = c + Variable selection: different vectors x for different C groups are possible + Different estimation techniques can be used for estimating P(C = c X = x) ande(y X = x, C = c), e.g. Multinomial logistic regression + Gamma regression Robust logistic regr. + semi-parametr. regr. Classification trees + semiparametric regression Kernel logistic regression + ε support vector regression Combination of above pairs + extreme value theory (e.g. GPD) 12

13 3. EMPIRICAL RISK MINIMIZATION Vapnik 98 Data set (x i,y i ) R p { 1, +1}. Assume (X i,y i ) iid P, P unknown, predictor ˆf(x), classifier sign( ˆf(x)+ˆb), loss function L(y, f(x)+b) goal: arg min f,b E P L(Y,f(X)+b) reg. emp. ( ˆf n,λ, ˆb n,λ )=argmin f H, b R 1 n ni=1 L(y i,f(x i )+b)+λ f 2 H, risk: where L convex, λ > 0 reg. constant, H reproducing kernel Hilbert space (RKHS), defined by reproducing kernel k [k : R p R p R, k(x, ) H, f(x) = f,k(x, ) ] reg. theor. risk: (f P,λ,b P,λ )=argmin f H, b R E P L(Y,f(X)+b)+λ f 2 H Vapnik 98, Zhang 01, Chr & Steinwart 05: L-risk consistency 13

14 Advantages: Kernel trick restricts class of functions to broad subset (RKHS) allows very flexible fits RBF kernel: k(x i,x j ) =exp( γ x i x j 2 ),γ>0 Convex loss function avoids computational intractable NP-hard problems Regularization improves generalization property, avoids overfitting Software: many computational tricks Overview: 14

15 Kernel Logistic Regression (KLR) L(x, y, f) =ln(1+exp( y[f(x)+b])) smoothly decreasing risk scoring is possible: P(Y =1 X = x) = 1 1+e [f(x)+b] ε Support Vector Regression (SVR) (Vapnik 98) L ε (x, y, f) =max{0, y f(x) ε} Loss ε= y ξ i ξj 2ε y [f(x)+b] x 15

16 4. Robustness model assumptions valid? some extreme claim sizes are estimates imprecise data: driving distance during a year (1 ε)p +εp ~ P neighborhood Goal: T (P), but T ((1 ε)p + ε P) T (P)? set of all distributions T (P) =(f P,λ,b P,λ )=argmin f H, b R E P L(Y,f(X)+b)+λ f 2 H 16

17 Robustness Approaches Influence function (F. Hampel): Gâteaux-derivative towards Dirac-distr. z ( ) T (1 ε)p + ε z T (P) IF (z; T,P) = lim ε 0 ε Sensitivity curve (J.W. Tukey): measures impact of one point z SC n (z; T n )=n ( T n (z 1,...,z n 1, z) T n 1 (z 1,...,z n 1 ) ) SC n (z; T n )= T ( (1 ε n )P n 1 + ε n z ) T (Pn 1 ) ε n, ε n = 1 n Maxbias (P.J. Huber): measures worst case behavior in neighborhood maxbias(ε; T,P) = sup T (Q) T (P) Q N ε (P) N ε (P) ={Q =(1 ε)p + ε P; P is distr. on X Y, 0 ε< 1 2 } 17

18 Robustness Results Assume for classification or regression problem: loss function L is convex, continuous, L bounded H RKHS of a bounded, continuous kernel k. Then these kernel methods have good robustness properties: influence function sensitivity curve bias maxbias are bounded, sometimes even uniformly bounded, and they are able to learn the unknown distribution P. Details and Proofs: Chr & Steinwart ( 04, 05, 06) 18

19 Y : pure premium [year] 5. Cont.: motor vehicle insurance X: 8 explanatory variables: age of main user, gender of main user, car kept in garage?, driving distance, geographical information, population density, no. of years without claims, strength of engine Estimation with (KLR, ε SVR) with RBF kernel using E(Y X = x) =P(C>0 X = x) k+1 c=1 P(C = c C >0,X = x) E(Y C = c, X = x) 50% training (n=2.2e6), 25% validation (n=1.1e6), 25% test (n=1.1e6) 19

20 Estimated pure premium E(Y X=x) EUR Age of main user 20

21 Estimated pure premium E(Y X=x) EUR Males Females Age of main user 21

22 E(Y X=x) Age of main user P(C=2 C>0,X=x) Age of main user Probability EUR Probability Probability P(C>0 X=x) Age of main user P(C=3 C>0,X=x) Age of main user Probability Probability P(C=1 C>0,X=x) Age of main user P(C=4 C>0,X=x) Age of main user 22

23 6. Summary + Empirical Risk Minimization: + statistics + mathematics + computer sciences + companies (insurance, banking, IT, telecommunication) tariffs, risk scoring, data mining, CRM, CHURN + software companies (SAS/Enterprise Miner) + good predictions, very flexible + applicable also in very high dimensions + good robustness properties 23

24 References Christmann, A., Steinwart, I. (2006). Consistency of kernel based quantile regression. Submitted. Christmann, A., Steinwart, I., Hubert, M. (2006). Robust Learning from Bites for Data Mining. To appear in CSDA. Christmann, A. (2005). On a strategy to develop robust and simple tariffs from motor vehicle insurance data. Acta Mathematicae Applicatae Sinica (English Series, Springer), 21, Christmann, A., Steinwart, I. (2005) Consistency and robustness of kernel based regression. Submitted. Christmann, A. (2004). On Properties of Support Vector Machines for Pattern Recognition in Finite Samples. In: Theory and Applications of Recent Robust Methods. Eds.: M. Hubert, G. Pison, A. Struyf and S. Van Aelst. Series: Statistics for Industry and Technology, Birkhäuser, Basel. Christmann, A., Steinwart, I. (2004). On robust properties of convex risk minimization methods for pattern recognition. Journal of Machine Learning Research, 5, Schölkopf, Smola (2002) Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press. Smyth, Jørgensen (2002). Fitting Tweedie s Compound Poisson Model to Insurance Claims Data: Dispersion Modelling. ASTIN Bulletin, 32, Vapnik (1998). Statistical Learning Theory. Wiley. homepages.vub.ac.be/ achristm 24