The zero-adjusted Inverse Gaussian distribution as a model for insurance claims

Transcription

1 The zero-adjusted Inverse Gaussian distribution as a model for insurance claims Gillian Heller 1, Mikis Stasinopoulos 2 and Bob Rigby 2 1 Dept of Statistics, Macquarie University, Sydney, Australia. gheller@efs.mq.edu.au 2 STORM, London Metropolitan University. s: d.stasinopoulos@londonmet.ac.uk and r.rigby@londonmet.ac.uk Abstract: We introduce a method for modelling insurance claim sizes, including zero claims. A mixed discrete-continuous model, with a probability mass at zero and an Inverse Gaussian continuous component, is used. The Inverse Gaussian distribution accommodates the extreme right skewness of the claim distribution. The model explicitly specifies a logit-linear model for the occurrence of a claim; and log-linear models for the mean claim size (given a claim has occurred); and the dispersion of claim sizes (given a claim has occurred). The method is illustrated on aa Australian motor vehicle insurance data set. Keywords: Inverse Gaussian model; zero-adjusted; insurance claims; gamlss. 1 Introduction The purpose of modelling claim sizes on insurance policies is to price premiums accurately, and to estimate the risk of extreme claim events. In a fixed period, a policy will either experience a claim, which is a nonnegative amount typically having an extremely right-skewed distribution, or no claim, in which the claim amount is identically zero. The distribution of the claim size is then mixed discrete-continuous: a continuous, rightskewed distribution mixed with a single probability mass at zero. In this respect the phenomenon is similar to rainfall, which is either identically zero on a dry day, or a continuous non-negative size on a wet day. 1.1 Models for insurance claims Much attention has been paid in the actuarial literature to alternative distributions for claim sizes (e.g. Hogg and Klugman (1984)) and some authors have developed regression models (usually generalized linear models) for explaining claim sizes as a function of risk factors (e.g. Haberman and Renshaw (1996)). All of these are models for claim sizes in the subclass of policies which had a claim in the period of observation. Jørgensen and de Souza (1994) and Smyth and Jørgensen (2002) considered models for claim sizes, including the zero claims. These are based on

2 2 Zero-adjusted Inverse Gaussian the Tweedie distribution, which may be characterised as a Poisson sum of Gamma random variates. A problem with the Tweedie distribution model is that the probabilities at zero can not modeled explicitly as a function of explanatory variables; and as we shall see in the example, the Gamma distribution is inadequate for modelling the extreme right-skewness which is present in our data. 2 The zero-adjusted Inverse Gaussian model Let y i = size of claim on i th policy, i = 1,..., n. We can write the distribution of y as a mixed discrete-continuous probability function: f(y) = 1 π y = 0 = π g(y) y > 0 (1) where g(y) is the density of a continuous, right-skewed distribution and π is the probability of a claim. 2.1 Continuous part of the model The extreme right skewness of claims distributions has been well documented. Candidate distributions within the exponential family are the Gamma and Inverse Gaussian distributions. Motor vehicle insurance example We illustrate the method on a class of motor vehicle insurance policies from an Australian insurance company in There were 67,856 policies, of which 4,624 (6.8%) had at least one claim in the period of observation. Of these, 4,333 policies (6.4%) had one claim, and the remaining 291 policies (0.4%) between 2 and 4 claims. The maximum claim size was $56,000. A histogram of the non-zero claims, and the pdfs of the fitted Gamma and inverse Gaussian distributions are shown in Figure 1. (For clarity of display the horizontal axis has been truncated, at $15,000. Sixty-five observations were omitted.) The Gamma clearly does not reproduce the shape of the observed claim size distribution; the Inverse Gaussian looks to be a far better fit, accommodating both the mode near zero and the extremely long tail of the distribution. The density of the inverse Gaussian is: [ 1 g(y) = 2πy3 σ exp 1 ( ) ] 2 y µ 2y µσ y > 0 which has E(y) = µ and V ar(y) = σ 2 µ 3. The use of the Inverse Gaussian distribution for modelling claim sizes has been recommended by, for example, Berg (1994).

3 Heller et al. 3 Inverse Gaussian Gamma f(y) 0 e+00 2 e 04 4 e 04 6 e 04 8 e 04 FIGURE 1. distribution: motor vehicle insurance 2.2 Discrete part of the model The obvious model for the probability of a claim is the Bernoulli. Let w i be a binary variable indicating the occurrence of at least one claim, and π i be the probability of at least one claim, on policy i. Note that the occurrence of more than one claim in the period of observation is rare. Then f(w i ) = π w i i (1 π i ) 1 w i w i = 0, 1 However, we have to correct for the typical feature of policy-level data, that not all policies have been in force for the entire period of observation. Let t i = exposure of policy i, 0 < t i 1. (Exposure is the proportion of the period of observation for which the policy has been in force.) We will be assuming that the t i are known. If c i is the number of claims in the period, and we assume a Poisson process with mean number of claims (per unit exposure time) π i then c i t i P o(t i π i ), P (c i = 0 t i = 1) = e πi 1 π i and P (c i = 0 t i ) = e t iπ i 1 t i π i, provided t i π i is small. This gives f(w i ) = (π i ) w i (1 π i ) 1 w i w i = 0, 1 i.e. Bernoulli with πi link function on π i : = t iπ i. We incorporate covariates through the logit log π i 1 π i = η i

4 4 Zero-adjusted Inverse Gaussian i.e. πi log /t i 1 πi /t = η i (2) i and the correction for differing periods of exposure enters the model through the modified link function (2). The predictor η i is defined in the next section. 2.3 The mixture model The zero-adjusted Inverse Gaussian (ZAIG) model is then f(y i ) = 1 π i y i = 0 = π i [ 1 exp 1 2πy 3 i σ i 2y i ( ) ] 2 yi µ i µ i σ i y i > 0 which has E(y i ) = π i µ i and V ar(y i ) = π i µ i 2 ( 1 π i + µ iσ 2 i ). Following Rigby and Stasinopoulos (2005), who specify generalized additive models for the location, scale and shape parameters of a variety of distributions, we specify the following models on the parameters µ i, σ i and π i : log(µ i ) = x 1µiβ µ + f µ (x 2µi ) log(σ i ) = x 1σiβ σ + f σ (x 2σi ) πi log /t i 1 πi /t i = x 1πiβ π + f π (x 2πi ) where x 1µi, x 2µi, x 1σi, x 2σi, x 1πi and x 2πi are covariate vectors for µ i, σ i and πi, which may be different, the same, or may have some but not all elements in common; β µ, β σ and β π are the corresponding parameter vectors; and f µ, f σ and f π are nonparametric functions, typically smoothing splines. In order to correct for multiple claims in the period, we use the fact that, if y j IG(µ, σ), j = 1,..., c independently, then the total t = j y j has the distribution t IG(µ, σ ) where µ = cµ and σ = σ/c. As log(µ ) = log(µ) + log(c) and log(σ ) = log(σ) log(c) we use log(c i ) and log(c i ) as offsets in the models for µ i and σ i respectively, where c i is the number of claims on policy i. (A doubtful assumption here is that multiple claim amounts on the same policy are independent.)

5 Heller et al. 5 3 Estimation The ZAIG has been incorporated into the gamlss package in R (Stasinopoulos et al. (2006)). Maximum (penalised) likelihood estimation is used. The penalized log likelihood function of the model is maximized iteratively using either the RS or CG algorithm of Rigby and Stasinopoulos (2005), which in turn uses a back-fitting algorithm to perform each step of the Fisher scoring procedure. Both RS and CG algorithms use the log likelihood of the data, and its first derivatives (and optionally expected second derivatives) with respect to distributional parameters, which in this case are µ, σ and ν = π. The CG algorithm, a generalization of the algorithm used by Cole and Green (1992), additionally uses the expected cross derivatives. 3.1 Motor vehicle insurance The following covariates were available: Variable Range Characteristics of policy holder: Age band 1,2,3,4,5,6 (1 is youngest) Gender male, female Area of residence A, B, C, D, E, F Characteristics of vehicle: Value $0-$350,000 Make A, B, C, D Age 1, 2, 3, 4 (1 is recent) Body type bus, convertible, coupe, hatchback, hardtop, motorised caravan/combi, minibus, panel van, roadster, sedan, station wagon, truck, utility Using the GAIC as model selection criterion, the following final model was selected: log(µ) = age band + gender + area + offset{log(claims)} log(σ) = area + offset{-log(claims)} ) = age band + area + vehicle body + spline(vehicle value) log( π 1 π Comments on the model Model for π: The model for the occurrence of a claim has terms for both policyholder and vehicle characteristics. Policyholder age, area and vehicle body are all categorical, so their form is not an issue; vehicle value is the only continuous covariate that we have, and it enters in the model in a smoothing spline form. This is understood when we examine the scatterplot of claim/no claim, with a smoothing spline, in Figure 2. The relationship is nonlinear; the probability of a claim is at a maximum for vehicle value around $40,000.

6 Zero-adjusted Inverse Gaussian Claim Smoothed data Vehicle value in $10,000 units FIGURE 2. Occurrence of a claim (0/1) plotted against vehicle value, with smoothing spline Model for µ: This contains only policyholder characteristics, which is surprising. A more complicated model involving vehicle value, make and some interaction terms, was a close second in the model selection. However, it was felt that this was too complex and difficult to interpret, so the simpler version was chosen. Model for σ: Area is the only covariate for σ. The variation of the claim size distribution with area is shown in Figure 3: it can be seen that areas D, E and F have shapes which are different from A, B and C, reflected in lower values for σ. In fact areas D, E and F are rural whereas A, B and C are urban. The explanatory variables age band and area appear in the model equations for both π and µ. It is of interest whether they affect the occurrence of a claim, and claim size, in the same way. Figure 4.a shows the effect of age band (eβ ), on both π/(1 π) and µ; figure 4.b shows the effect of area on both π/(1 π) and µ. Note that age band=3 and area=a are the reference categories. Age band 1 (the youngest drivers) increases both the odds of a claim and the mean claim size, to a similar extent; age bands 2 and 4 have a similar effect to age band 3; and age bands 5 and 6 (older drivers) decrease both the odds of a claim, and the mean claim size, their effect being greater on the odds of a claim. The effect of area on the odds of a claim, and mean claim size, is less clear: the only clear indication is that the mean claim size is increased in area F.

7 Heller et al. 7 A. µ^ = 1909, σ^ = B. µ^ = 1860, σ^ = C. µ^ = 2030, σ^ = D. µ^ = 1837, σ^ = E. µ^ = 2251, σ^ = F. µ^ = 2864, σ^ = FIGURE 3. distribution by area 4 Conclusion We introduce a method for modelling insurance claim sizes using a zero adjusted Inverse Gaussian (ZAIG) model, which explicitly specifies a logitlinear model for the occurrence of a claim; and log-linear models for the mean claim size (given a claim has occurred); and the dispersion of claim sizes (given a claim has occurred). These three models may incorporate different covariates, or some of the same covariates, and may depend on common covariates in different ways. The Inverse Gaussian distribution accommodates the extreme right skewness of the claim distributions. Given the risk factors for a potential new policyholder, the expected claim size may easily be computed as the expected value of the ZAIG distribution, conditional on the covariate values; and quartiles of the claim size distribution may be calculated for each combination of covariate values. The ZAIG distribution introduced here is a useful distribution for modelling data where the total amount per unit of time is observed but where zero amounts are possible. Rainfall data and smoking/drinking habits data are possible candidates for modelling using the ZAIG distribution. References Berg, P.T. (1994). Deductibles and the inverse Gaussian distribution. ASTIN Bulletin, 24,

8 8 Zero-adjusted Inverse Gaussian a. Age band b. Area exp(β^) Occurrence of claim exp(β^) Age band A B C D E F Area FIGURE 4. Effect of age category and area (exp( ˆβ)) on occurrence of claim and claim size Cole, T. and Green, P. (1992) Smoothing reference centile curves: The LMS method and penalized likelihood. Statist. in Med, 11, Hogg, R.V. and Klugman, S.A. (1984). Loss Distributions. New York: Wiley. Haberman, S. and Renshaw, A.E. (1996). Generalized Linear Models and Actuarial Science. The Statistician, 45 (4), Jørgensen, B. and de Souza, M.C.P. (1994). Fitting Tweedie s compound Poisson model to insurance claims data. Scandinavian Actuarial Journal, Rigby, R.A. and Stasinopoulos, D.M. (2005). Generalized Additive Models for Location, Scale and Shape (with discussion). Appl. Statist., 54, 1-38 Smyth, G.K. and Jørgensen, B. (2002). Fitting Tweedie s compound Poisson model to insurance claims data: dispersion modelling. ASTIN Bulletin, 32(1), Stasinopoulos D. M., Rigby R.A. and Akantziliotou C. (2006) gamlss: A collection of functions to fit Generalized Additive Models for Location Scale and Shape, R package version 1.1-0, url = ac.uk/gamlss/.