GENERALIZED PARETO FIT TO THE SOCIETY OF ACTUARIES LARGE CLAIMS DATABASE

Transcription

1 GENERALIZED PARETO FIT TO THE SOCIETY OF ACTUARIES LARGE CLAIMS DATABASE Ana C. Cebrián, Michel Denuit and Philippe Lambert Dpto. Metodos Estadisticos. Ed. Matematicas Universidad de Zaragoza Zaragoza 59 Spain Institut de Statistique Université Catholique de Louvain B-1348 Louvain-la-Neuve, Belgium Institut des Sciences Actuarielles Université Catholique de Louvain B-1348 Louvain-la-Neuve, Belgium March 17, 23

2 Abstract This paper discusses a statistical modeling strategy based on extreme value theory to describe the behavior of an insurance portfolio, with particular emphasis on large claims. The strategy is illustrated using the group medical claims database maintained by the Society of Actuaries. Using extreme value theory, the modeling strategy focuses on the excesses over threshold approach to fit Generalized Pareto distributions. The proposed strategy is compared to standard parametric modeling based on Gamma, LogNormal and LogGamma distributions. Extreme value theory outperforms classical parametric fits and allows the actuary to easily estimate high quantiles and the probable maximum loss from the data. Key words and phrases: large claims, extreme value theory, generalized Pareto distribution, generalized extreme value distribution, excesses over threshold method, probable maximum loss. G22 Insurance C19 Econometric and statistical methods

3 1 Motivation In nonlife insurance, a few large claims hitting a portfolio usually represent the most part of the indemnities paid by the company. These extreme events are therefore of prime interest for actuaries. They also form the statistical material - for the pricing of reinsurance agreements such as excess-of-loss reinsurance treaties (under which the reinsurer has to indemnify the direct insurer for all his expenditure associated with a given claim as soon as that expenditure exceeds a fixed limit named the deductible). - for the estimation of high percentiles. - for the derivation of the probable maximal loss (PML, in short), to be used as an upper bound for the claim size in the calculation of risk quantities. The present paper aims to discuss a universal model for dealing with these extreme claims, that is acceptable both from the theoretical and practical viewpoints. To this end, we recall the basic features of Extreme Value Theory (EVT, in short) that gives the theory for describing extremes of random phenomena. It is the key to many risk management problems related to insurance, reinsurance and finance, as shown by Embrechts et al. (1999). Estimation of a high quantile is not an easy task, as one wants to make inference about the extremal behavior of claim severities, i.e. in an area of the sample where there is a very small amount of data. Moreover, extrapolation beyond the range of the data is often desirable, i.e. statements about areas where there are no observation at all. As in other disciplines sensitive to extreme observations, like hydrology and climatology, standard statistical techniques are mostly useless in an analysis concerned with the tail of the distribution. For this reason, alternative statistical methods have been developed. The latter are based only on that part of the sample which carries the information about the extremal behavior, i.e. only the largest sample values. EVT, and more precisely, the excesses over threshold model together with the generalized Pareto distribution, offers an unified approach to the modelling of the tail of a loss severity distribution. This method is not solely based on the data at hand but includes a probabilistic argument concerning the behavior of the extreme sample values. EVT has already been successfully applied to actuarial problems e.g. by Rootzén and Tajvidi (1996), McNeil (1997) and McNeil and Saladin (1997). The present paper considers the medical insurance claims from the SOA Group Medical Insurance Large Claims Database. A model is fitted to the amounts of the 1991 group claims. We use the 1992 group of claims to validate the previous fitted model. Our work is organized as follows. In Section 2, we briefly present the data and provide the reader with a concise descriptive analysis of the claim data set. In Section 3 we present and develop the statistical models based on the excesses over threshold approach and the generalized Pareto distribution. As an application, PML and high percentiles are calculated from the fitted models in Section 4. Section 5 aims to study the effect of including covariate information in the parameters of the model. A comparison in terms of predictive accuracy of models including or not covariate information is performed in Section 6. Finally, Section 7 concludes. The technical aspects of the analysis and some graphical procedures are deferred to Appendices A and B. 1

4 2 Data 2.1 Data description We consider the SOA Group Medical Insurance Large Claims Database, that records all the claim amounts exceeding $25, over the period and is available from The left censorship on the variable Total is not a problem since we are interested in the extreme behaviour, that is in the largest losses. There is no truncation due to benefit maximums. The study conducted by Grazier et al. (1997) (where a thorough description of this data can be found) collects information from 26 insurers. The 171, claims recorded are part of a database including about 3,, claims over the years The total amount paid by the insurer for each claim is split into hospital charges and other expenses. We only consider the total amount paid in our analysis (coded as Total, in the remainder). In addition to this value, several covariates are available. For the ease of explanation, we only consider the year of observation (1991 or 1992), the claimant s age (in years) and the claimant s gender ( M for male and F for female). No exposure measures are available. Henceforth, we only deal with the 1991 data; those relating to 1992 will be used only in Section 6 to assess the quality of the fit. 2.2 Descriptive statistics Total, Sex and Age distributions There are 75,789 observations in the data set. Descriptive statistics for Total are provided in Table 2.1. Since the database only recorded those claims above $25,, the minimum equals this value, as expected. On average, the claim amount equals $58,41. The interquartile range is not too large (about 3,) but the data base contains a significant number of very high claims (the largest claim observed costed $4,518,). Mean 58,41 Minimum 25, 25% Quantile 3,54 Maximum 4,518, 5% Quantile 4,22 Range 4,493, 75% Quantile 61,32 Standard deviation 66,4.96 Skewness Coefficient of variation Table 2.1: Summary statistics for Total. The data are considerably skewed to the right (skewness coefficient of 13.2 according to Table 2.1). Therefore, we opt for the log-scale to represent the histogram of Total. An histogram of log(t otal) is shown in Figure 1. Even on the log-scale the histogram is markedly positively skewed. The data concern 36,11 women (47.7%) and 39,67 men (52.3%). Note that there are nine claims for which the value of the variable Sex has not been recorded (the value of Total for these missing data are all around the mean so that the potential impact of these data is probably small). 2

5 Frequency log(total) Figure 1: Histogram of log(t otal). The distribution of the age of the claimants is given in Figure 2, where the number of claims by age is shown. A noticeable feature is the high peak at age, representing all the perinatal medical problems. A second peak around age 16 could be attributed to the so-called accident hump. The number of claims then increases until 65, before decreasing until 11, the maximum observed age in the database. The main difference between male and female age structure is observed in the period from age 3 to 65 (see Figure 3). Indeed, in that interval the number of claims is more or less constant for women (around 8 for each year of age) whereas for men, it increases from 5 to 1, Relationship of Total to Sex and Age To see the possible association between the amount of the claims and sex of the claimers, some descriptive statistics of Total split by Sex are shown in Table 2.2. On average, claims for men are more expensive ($6,53 instead of $56,9). The spread is comparable for both sexes, except that data for males appear even more skewed to the right than those relating to females (skewness coefficient equal to 14.1 for males and 11.8 for females). The effect of age on the total amount is illustrated in Figure 4, where box-plots for log(t otal) have been displayed for each year of age. The lower and upper whiskers indicate respectively the minimum and the 99% quantile; the central rectangle giving the quartiles. The more disperse and very high claim amounts in periods Age 3 and Age 9 are the most remarkable features. Thus, according to the preliminary analysis it seems there are some influence of Sex and Age on the total amount, but reliable conclusions cannot be still obtained at this stage. 3

6 Frequency Age Figure 2: Histogram of Age Female age structure Male age structure Figure 3: Histograms of Age by Sex. 4

7 Female Mean 56,9 Minimum 25, 25% Quantile 29,89 Maximum 3,484,55 5% Quantile 38,47 Range 3,459,55 75% Quantile 58,23 Standard deviation 62,24.8 Skewness 11.8 Coefficient of variation 1.1 Male Mean 6,53 Minimum 25, 25% Quantile 31,28 Maximum 4,518,42 5% Quantile 41,96 Range 4,493,42 75% Quantile 64,3 Standard deviation 69, Skewness 14.1 Coefficient of variation Statistical modelling Table 2.2: Summary statistics for Total by Sex. 3.1 Extreme Value Theory (EVT) Gamma, Lognormal and LogGamma distributions (as well as other parametric models) have been often used by actuaries to fit claim sizes; see Klugman et al. (1998). However, when the main interest is in the tail of loss severity distributions, it is essential to have a good model for the largest claims. Distributions providing a good overall fit can be particularly bad at fitting the tails. EVT and generalized Pareto distribution (GPD in short) focus on the tails, being supported by strong theoretical arguments. The basic results and definitions of EVT are summarized in Section A.1 of Appendix A. For a deeper introduction to EVT, we refer the interested reader to the books by Embrechts et al. (1997) and Beirlant et al. (1996). We only give hereafter a short non technical description of the fundaments of EVT. Considering a sequence of independent and identically distributed random variables (claim severities, say) X 1, X 2, X 3,..., most classical results from probability and statistics that are relevant for insurance are based on sums S n = n i=1 X i. Let us mention the law of large numbers and the central limit theorem, for instance. Another interesting yet less standard statistics for the actuary is M n = max(x 1,..., X n ) the maximum of the n claims. EVT mainly addresses the following question: how does M n behave in large samples (i.e. when n tends to infinity)? Of course, without further restriction, M n obviously diverges to +. Once M n is appropriately centered and normalized, however, it may converge to some specific limit distribution (of three different types, according to the fatness of the tails of the X i s). In insurance applications, heavy tailed distributions are most often encountered. Such distributions have survival functions that decay like a power function. A prominent example is the Pareto distribution, widely used by actuaries. To sum up, EVT discusses the asymptotic behavior of M n and provides results analogous to the central limit theorem for maxima (rather than sums). Of course, in practice, we need more information to manage the portfolio than just the law governing M n. This is why the approach described in the next section is particularly 5

8 Log(Total charges) Age Figure 4: Box-Plots of log(t otal) by Age useful for actuaries. Basically, it describes the behavior of those claims exceeding some sufficiently high threshold u. 3.2 Excess Over Threshold approach and Generalized Pareto distribution Excess Over Threshold (EOT) approach The traditional approach to EVT is based on extreme value limit distributions. Here, a model for extreme losses is based on the possible parametric form of the limit distribution of maxima. A more flexible model, is known as the Excesses Over Threshold (EOT in short) method. This approach appears as an alternative to maxima analysis for studying the extreme behavior of a variable X. Basically, given a series X 1,..., X n of realizations of that variable, EOT analyzes the series [X i u X i > u], i = 1,..., n, the exceedances of the variable over a high threshold u. Mathematical theory supports the Poisson distribution for the number of exceedances combined with independent excesses over the threshold. Let F u stand for the common cumulative distribution function of the [X i u X i > u] s; F u thus represents the conditional distribution of the losses, given that they exceed the threshold u. The two-parameter Generalized Pareto Distribution (GPD, in short) provides a good approximation to the excess distribution F u over large thresholds. According to Definition of Embrechts et al. (1997), the cumulative distribution function associated to GPD, denoted as G ξ, is defined as { 1 (1 + ξx) G ξ (x) = 1/ξ if ξ, 1 exp( x) if ξ =, where x if ξ and x [, 1/ξ] if ξ <. The parameter ξ is named the Pareto index. 6

9 The related scale family is then defined as G ξ,β (x) = G ξ ( x β ), β >. As particular cases of the GPD G ξ;β, we find some classical distributions, namely the Pareto distribution when ξ >, the type II Pareto distribution when ξ < and the exponential distribution when ξ =. Appendix A.2 gathers some technical results about the GPD that will be used in the remainder of the paper. For some appropriate function β(u) and some Pareto index ξ to be estimated from the data, the approximation F u (x) G ξ;β(u) (x) for x R + (3.1) holds for large u. The approximation (3.1) is justified by the following result by Balkema and de Haan (1974) and Pickands (1975): the formula lim Fu (x) G ξ,β(u) (x) = (3.2) sup u + x is true provided that F satisfies some rather general technical conditions (see Appendix A.3 for details about these technical conditions). These conditions are verified by the heavy tailed distributions. In view of (3.1) the excesses [X i u X i > u] can be treated as a random sample from the GPD distribution provided the threshold u is large enough. It must be stressed that in order to estimate far in the tails (beyond or at the limit of available data), one has to make mathematical assumptions on the tail model. Yet acceptable, these assumptions are very difficult to verify in practice. Hence, there remains an intrinsic model risk Detection of the heavy-tailedness The statistical techniques used in this paper rely on the heavy-tailedness of the X i s. Therefore, we need tools to explore this characteristic of the data, even if insurance loss data usually present heavy tails. Two simple and powerful graphical tools can be used to detect tail behaviour - Plot of the empirical mean excess function. Assuming the finiteness of the variable mean, i.e. E[X] < +, the mean excess function associated to X is defined as e(u) = E[X u X > u], that is, the expected exceedance of the threshold u given that exceedance occurs. It is easy to prove that if X is exponentially distributed, its mean excess function is constant, e(u) = E[X] = 1/λ. Consequently, the plot of e(u) versus the threshold will be an horizontal line. Short-tailed distributions will show a downward trend. On the contrary, an upward trend will be an indication of heavy tailed behaviour. Usually the function e(u) is not known but it can be easily estimated from a random sample and this empirical estimator ê n can be plotted (see Appendix B.1 for more details about estimation issues). 7

10 Mean excess Sample quantile 1^6 2*1^6 3*1^6 4*1^ u Exponential quantile Figure 5: Empirical mean excess function plot (left) and Exponential QQ-plot (right) of Total. - Exponential QQ-plot. Its interpretation is easy: if the data are an independent and identically distributed sample from an exponential distribution, the points should lie approximately along a straight line. A convex departure from the ideal shape indicates a heavier tailed distribution in the sense that empirical quantiles grow faster than the theoretical ones. On the contrary, concavity indicates a shorter tailed distribution Application to the SOA database In order to highlight the methodology briefly discussed above, let us apply it to the SOA database. As it can be seen from Figure 1, the data are markedly right-skewed (even on the log-scale). This indicates the long-tailed behavior of the underlying data. Let us further examine it with the two graphical procedures described above. Figure 5 (left) displays the graph of ê n for our data; the heavy-tailed character is obvious from the visible upward trend. The convex behaviour of the exponential QQ-plot is also apparent on the right panel of Figure 5, suggesting a heavy-tailed underlying distribution. 3.3 Choice of the threshold Two conflicting goals We have seen above that if the heavy tailed character of the data is fulfilled, a high enough threshold is selected and enough data are available above that threshold, the use of GPD is justified to model large losses. The only practical problem to apply this result is how to determine what a high enough threshold is; we deal with this problem in the present section. Two factors have to be taken into account in the choice of an optimal threshold u: A value of u too large yields few exceedances and consequently imprecise upper quantile estimates. We also loose the possibility to estimate smaller quantiles. 8

11 A value of u too small implies that the generalized Pareto character does not hold for the moderate observations and it yields biased quantiles estimates. This bias can be important as moderate observations usually constitute the largest proportion of the sample. Thus, our aim is to determine the minimum value of the threshold beyond which the GPD becomes a reasonable approximation to the tail of the distribution Graphical procedures To identify the optimal threshold value, we provide three graphical tools: - Empirical mean excess function plot it is easily checked that when X follows a GPD with cumulative distribution function G ξ,β, the mean excess function is a linear function in u e(u) = β 1 ξ + ξ 1 ξ u provided β + uξ >. Hence, the idea is to determine, on the basis of the graph of the empirical estimator of the excess function ê n, a region [u, + ) where ê n (t) becomes approximately linear for t u. - GPD index plot In virtue of the stability property (A.3) of the GPD, if X is GPD with cumulative distribution function G ξ,β, the variable [X u X > u] is GPD with cumulative distribution function G ξ,β+ξu, i.e with the same index parameter ξ, for any u >. Consequently, in the plot of the index maximum likelihood estimators ˆξ resulting from using increasing thresholds, we will obserandom variablee that estimation stabilizes when the smallest threshold for which the GPD behaviour holds is reached. - Gertensgarbe plot This procedure is very powerful and provides an estimation of the optimal threshold. However, it demands much more statistical background than the two others. Its presentation has been deferred to Appendix B.2. Since, usually, these techniques can only provide approximative information about the threshold, simultaneous application of them is highly recommended in order to get more reliable results Application to the SOA database Concerning the large claim data set, the empirical mean excess function depicted in Figure 5 shows an approximately linear behavior starting somewhere around $2,. A similar result is provided by the GPD index plot (Figure 6), where index estimations are relatively stable after somewhere between $15, and a bit less than $2,. As we have pointed out before the information provided by these plots can be rather approximative. The Gertensgarbe plot detects a threshold of $199,62. It seems thus that we can trust in the GPD behavior for excesses over $2,. 9

12 Pareto index versus treshold Estimated Pareto index Treshold Figure 6: Plot of the GPD index ξ. Threshold ˆβ s.e( ˆβ) ˆξ s.e.(ˆξ) K-S pv 25, 18, , 56,444 1, , 93,91 3, >.3 Table 3.1: GPD fit for different thresholds Check of the goodness of a given threshold value According to the previous results, we perform the confirmatory analysis for two threshold values, u =$1, and u =$2,. Moreover, for the sake of comparison, the same analysis was applied to the whole data set. Remember that since only claims greater than $25, were recorded, that means that a threshold u =$25, applies to the database. From the 75,789 claims in the record, 7,86 (1.4%) exceed $1, and 2,13 (2.7%) exceed $2,. Estimation of GPD parameters can be found in Table 3.1 together with standard errors. Kolmogorov-Smirnov tests have been performed to check the compliance of the data with the GPD. The corresponding p-values are reported in Table 3.1. The GPD fit is clearly rejected for u =$25, and $1, but not for the threshold $2,. In order to check for the generalized Pareto behavior of the sample of exceedances, we have plotted - The observed and theoretical frequencies plot on the log-scale (left panels of Figure 7) - GPD QQ-plot (right panels of Figure 7). In this plot the empirical versus the estimated GPD quantiles are represented. If the GPD model fits, a linear pattern has to appear. We can conclude that only a threshold of $2, satisfies all the tests described above. 1

13 Frequency Frequency Histogram log(x) Histogram log(x) Sample quantile e+ 2 e+6 4 e+6 6 e+6 8 e+6 1 e+7 Sample quantile e+ 2 e+6 4 e+6 6 e+6 8 e+6 1 e+7 Q Q plot e+ 2 e+6 4 e+6 6 e+6 8 e+6 GP quantile Q Q plot e+ 1 e+6 2 e+6 3 e+6 4 e+6 GP quantile Frequency Histogram log(x) Sample quantile e+ 2 e+6 4 e+6 6 e+6 8 e+6 1 e+7 Q Q plot e+ 1 e+6 2 e+6 3 e+6 GP quantile Figure 7: GPD fit, whole data set (top), data above $1, (middle) and data above $2, (bottom). 11

14 Threshold Gamma LogNormal LogGamma 25, < 1 5 < 1 5 < 1 5 1, < 1 5 < 1 5 < 1 5 2, < 1 5 < 1 5 < 1 5 Table 3.2: Kolmogorov-Smirnov p-values for parametric models above different thresholds. 3.4 Comparison with standard parametric fits In order to check the improvement achieved by using the GPD instead of a classical parametric model, we provide a brief summary of the Gamma, LogNormal and LogGamma fits for the three previous thresholds. The p-values of the lognormal Kolmogorov-Smirnov test are shown in Table 3.2 and QQ-plots are displayed in Figure 8. From these results, we can conclude that no satisfactory fit for the large losses is obtained. Contrarily to the GPD, the fit does not significatively improve when increasing the threshold. Briefly, the results highly support the use of the GPD instead of the traditional parametric models and demonstrates the interest of EVT when the interest lies in the right tail. 4 Some applications of the model In this section we want to show how the use of the GPD to model the distribution of the exceedances over high thresholds can be helpful for solving some actuarial problems. 4.1 Point estimation of high quantiles A measure that provides useful information for insurers are the high quantiles of the distribution of the claim amounts. Usually quantiles can be estimated by their empirical counterparts but when we are interested in the very high quantiles, this approach is not longer valid since estimation based on a low number of large observations would be strongly imprecise. Thus, another approach based on the previous GPD model is suggested. From (3.1), we see that (provided u is sufficiently large) a potential estimator for the excess-of-loss distribution F u (x) is Gˆξ; ˆβ(x). So, Gˆξ; ˆβ(x) approximates the conditional distribution of the losses, given that they exceed the threshold u. Quantile estimators derived from this curandom variablee are conditional quantile estimators which indicate the scale of losses which could be experienced if the threshold u were to be exceeded. When estimates of the unconditional quantiles are of interest, relating the unconditional cumulative distribution function F to Gˆξ; ˆβ through F u we obtain the following estimator ˆq ɛ = u + ˆβ ˆξ { ( ) ˆξ n (1 ɛ) 1 N u }. (4.1) where N u and n are the number of claims above the threshold u and the total number of claims respectively. All the details about the derivation of (4.1) can be found in Appendix A.4. Figure 9 displays the GPD quantiles (4.1) against the 1991 empirical quantiles. Those quantities clearly agree. 12

15 3 4 4*1^6 Sample quantile 1^6 5 2*1^5 4*1^5 8*1^5 1^6 4*1^6 6*1^6 8*1^6 1^7 8*1^6 2*1^6 4*1^ *1^6 8*1^6 15 Lognormal quantile 15 Sample quantile 5 Sample quantile *1^6 6*1^6 1^ *1^6 1^7 6*1^6 1 LogGamma quantile 1^6 4*1^6 Sample quantile 4*1^ *1^5 2*1^6 2*1^6 Lognormal quantile 6 6*1^5 Lognormal quantile 4 4*1^5 8*1^6 1^7 8*1^6 Sample quantile 4*1^6 2*1^6 2*1^6 2*1^5 Gamma quantile 6*1^6 8*1^6 6*1^6 Sample quantile 4*1^6 2*1^6 Sample quantile 6*1^5 Gamma quantile 1^7 Gamma quantile 1 2 1^6 1 2*1^6 3*1^6 4*1^6 3*1^6 Sample quantile 2*1^6 4*1^6 3*1^6 2*1^6 Sample quantile 1^ LogGamma quantile LogGamma quantile Figure 8: Gamma (top), LogNormal (middle) and LogGamma (bottom) QQ-plots for the losses with threshold $25, (left) threshold $1, (middle) and threshold $2, (right). 13

16 Sample quantile 5*1^5 1^6 1.5*1^6 2*1^5 4*1^5 6*1^5 8*1^5 1^6 1.2*1^6 1.6*1^6 GP quantile Figure 9: GPD quantiles against their 1991 empirical analogues. 4.2 Probable Maximum Loss (PML) Another useful measure in insurance is the Probable Maximum Loss (PML, in short). Broadly speaking, we can say that the PML is the worst loss likely to happen. Wilkinson (1982) and Kremer (199,1994) proposed to set the PML equal to either (1+θ)E[M n ] or E[M n ] + θ Var[M n ] where θ is a safety loading coefficient. The PML can also be obtained by solving the equation P [M n PML ɛ ] = 1 ɛ, for some small ɛ >. This means that the PML is a high quantile of the maximum of a random sample of size n. Since the maximum M n will exceed the so-defined PML only in 1ɛ% of the cases, it is very unlikely that an individual claim amount assumes a value larger than the PML. We then get PML ɛ = F 1 M n (1 ɛ) so that the PML is computed as the 1 ɛ quantile of the distribution of the maximum loss over the time period. In order to apply such formulas Wilkinson (1982) suggests an approach based on nonparametric techniques for order statistics. However, in the spirit of Kremer (199,1994), we will use an approach based on the GPD approximation of the exceedances. A result recalled in Appendix A.3.3 states that under the conditions that yield to the GPD approximation G ξ;β for the distribution of the exceedances over u (those conditions are detailed in Appendix A.3), the number N of exceedances over that threshold is roughly Poisson. In that situation it can be proved that the distribution of the maximum M N of these N exceedances can be approached by a generalized extreme value distribution H ξ;µ,ψ where µ = βξ 1 (λ ξ 1) and ψ = βλ ξ with λ = E[N]. 14

17 See Appendix A.5 for details about the distribution of the maximum in a Poisson size sample. Using this distribution, the previous PML definition results in the formula, { ( ) ξ λ β PML ɛ = u + 1} ln(1 ɛ) ξ. Usually, one takes ɛ = 5%, or 1%. Using the set of the 1991 claims, the PML for these values of ɛ results in $8,163,894 and $13,678,959, respectively. 5 Including covariate information in the GPD approach In most insurance and real world problems, it is desirable to be able to apply the GPD approach with as low thresholds as possible. The natural question that arises is whether there is any factor that influences the distribution of the exceedances. In such a case, it can be hoped that the incorporation of that information will improve the model. In particular, the GPD behavior is expected to appear with lower thresholds. To answer this question, we study whether the estimated parameters ξ and β depend on the covariates Age and Sex, both categorical. In order to incorporate the information provided by Age and Sex, we allow one or both parameters, β and ξ, to be different in every category defined by the covariates. Threshold selection is performed as described in Section 3.3 and the corresponding GPD parameters fitted, separately for each category. Additional tools such as the AIC criterion and the likelihood ratio test can be used to compare models with the same threshold but including different covariates. First, Sex is considered and sample is divided into two groups: men and women. Hereafter, the nine claims without sex record have been omitted from the analysis. As we hoped, allowing different parameters for each group distribution, lower thresholds are needed to obserandom variablee the GPD behaviour. The Gertensgarbe test suggests a threshold of about $1, for both categories: u =$15,587 for women and u =$16,161 for men. A GPD is fitted for each sex group for threshold u =$1, (see Table 5.1 for a brief summary of the results). The confirmatory analysis for u =$1, is satisfactory: Kolmogorov-Smirnov test does not yield to rejection of the GPD assumption (p-values of 19.9% for women and 14.2% for men). To include Age information in the parameters, we created ten categories that can be considered homogeneous according to the former descriptive analysis (see Section 2.2): (,3], (3, 12], (12, 2], (2, 25], (25, 35], (35, 45], (45, 55], (55, 65], (65, 75] and [75, + ). We conclude again that inclusion of this covariate allows us to lower the threshold. The Gertensgarbe test provides threshold values varying between about u =$54, and u =$64, for all the age categories except the first one, which requires a higher threshold u =$86,. We fit separate GPD and perform the confirmatory analysis for u =$6, and u =$1,. As it can be seen, Table 5.2, for u =$6, the fit is satisfactory for all the categories except the first one. For u =$1, the GPD behaviour is accepted in all the categories. To sum up we can say that inclusion of covariate information yields to detect the GPD behaviour with lower thresholds. In particular including Sex or Age in the model, a threshold 15

18 Threshold Cat. ˆβ s.e( ˆβ) ˆξ s.e.(ˆξ) K-S pv L-lik. AIC 1, F 57,68 1, , ,245 M 55,994 1, Table 5.1: GPD fit including Sex. Threshold Cat. ˆβ s.e( ˆβ) ˆξ s.e.(ˆξ) K-S pv L-lik. AIC 6, (,3] 65,56 2, ,38 468,655 (3,12] 46,911 3, (12,2] 39,656 2, >.3 (2,25] 4,95 2, (25,35] 38,177 1, (35,45] 36,171 1, (45,55] 31, (55,65] 3, >.3 (65,75] 31,888 1, >.3 (75, + ) 3,212 2, >.3 1, (,3] 84,514 4, >.3-96,54 193,83 (3,12] 7,97 6, >.3 (12,2] 55,736 4, (2,25] 77,831 7, >.3 (25,35] 71,772 3, >.3 (35,45] 61,4 2, (45,55] 58,1 2, (55,65] 48,697 1, (65,75] 49,473 3, (75, + ) 78,851 11, >.3 Table 5.2: GPD fit including Age. 16

19 Mean 58,79 Minimum 25, 25% Quantile 3,5 Maximum 7,14, 5% Quantile 4,1 Range 7,79, 75% Quantile 61,1 Standard deviation 72,79.83 Skewness Coefficient of variation Table 6.1: Summary statistics for the Total amount (1992 data). u =$1, can be used. However, the combination of both factors does not improve the results. 6 Predictive accuracy 6.1 Distribution of Total The aim of this section is to study the predictive accuracy of the fitted models in order to evaluate their forecasting capabilities. Since the model estimation was based on the medical claim database from 1991, it is natural to assess the quality of the predictions provided by the fitted models by comparing them to the values observed in the 1992 record. No adjustment for inflation has been done. In practice, this issue should be carefully addressed. The 1992 data set contains 95,447 observations, but 11 of them have been removed since they presented a negative value in the Age record; five observations with a missing Sex record have also been removed. Thus, the final size is 95,431. Descriptive statistics of this data set are available in Table 6.1. The most noticeable feature is that they show a wider range and spread than the data from 1991 (see Table 2.1); this feature is mainly caused by the presence of a very large claim, $7,14,, corresponding to a 11 year old man. Although the number of claims increased in 1992, their mean value is similar to the one observed in To check if the GPD model fitted to the 1991 data remains valid for the 1992 data, we applied a Kolmogorov-Smirnov goodness of fit test. This yielded a p-value larger than 3%, so that the 1991 GPD model is not rejected. 6.2 High quantiles Figure 1 displays the GPD quantiles (4.1) estimated on the basis of the 1991 data against the 1992 empirical quantiles. It is clear that those quantities closely agree, except at the very end. This is due to the very large claim appearing in the 1992 data. We will come back to this particular feature further. 6.3 PML Recall from Section 4 that for ɛ = 5% and 1% the PML were respectively $8,163,894 and $13,678,959. None of the observations of the 1992 record, whose maximum claim is $7,14,81, exceeds the values of these estimations. This fact support that the method for determining the PML used in this paper can deal even with very large claims. 17

20 Sample quantile *1^5 1^6 1.5*1^6 2*1^5 4*1^5 6*1^5 8*1^5 1^6 1.2*1^6 1.6*1^6 GP quantile Figure 1: 1991 GPD quantiles against their 1992 empirical analogues. 6.4 Excess-of-loss payments Let us now evaluate the predictive capabilities of our model on the basis of the variable S ω, the total burden over a high value ω (with ω > u so that the GPD model applies) in one year: S ω = (X j ω). j:x j >ω The prediction of S ω is based on its expected value. Bearing in mind that since ω > u the number of excesses N ω over the threshold ω follows approximately a Poisson distribution and the corresponding exceedances [X ω X > ω] follows a GPD with parameters ξ and β, the expected value of S ω is given by E[S ω ] = E[N ω ]E [X ω X > ω] [ ] ξ = E[N ω ] β + (ω u). 1 ξ Now, using (A.6) yields E[N ω ] = E[N u ] [1 + ξ(ω u)/β] 1/ξ. (6.1) where E[N u ] can be empirically estimated. Combining the above expressions yields E[S ω ]. The latter quantity is then estimated by substituting point estimates for the corresponding unknown quantities. If the model includes covariates N ω = K k=1 N ω,k, where N ω,k represents the number of 18

21 Required Considered Covariate threshold threshold u ω = 1, ω = 15, ω = 2, ω = 25, No cov. 2, 2, Sex 1, 1, Age 1, 1, Table 6.2: s ω of extreme claim models. Required Considered Covariate threshold threshold u ω = 1, ω = 15, ω = 2, ω = 25, No cov. 2, 2, Sex 1, 1, Age 1, 1, Table 6.3: s ω of extreme claim models removing the largest observation. observations greater than ω in category k. The expected value of S ω is then given by where E[S ω ] = K [ ] ξ k E[N ω,k ] β k + (ω u) 1 ξ k k=1 E[N ω,k ] = E[N u,k ] [1 + ξ k (ω u)/β k ] 1/ξ k. In order to make possible comparisons between models with different thresholds, we consider the global standardized differences between the observed and predicted values, s ω = (s ω ŝ ω ) /s ω, where s ω stands for the observed value of S ω and ŝ ω for its prediction. The s w values corresponding to each of the three fitted models and four ω values ($1,, $15,, $2, and $25,) are summarized in Table 6.2. From them we conclude that the model including Sex information but lower threshold shows a slightly worse prediction ability than the model without covariates. Similar conclusions can be obtained from the use of the covariate Age; to be specific, with regard to the model including Sex a slightly better accuracy is achieved in the prediction of the total burden. However, the model without covariates and threshold u =$2, still provides slightly better predictions for the highest ω values. In the previous descriptive analysis, a very large claim, $7,14,81, was detected in the 1992 record. To check the effect of that observation, we have recalculated the s w values after removing this observation from the data set. The corresponding differences are shown in Table 6.3. Slightly better predictions are obtained but our conclusions remain qualitatively similar. 19

22 To conclude this section, we could say that to make inference on very extreme values, the use of high thresholds (about $2, in our example) is recommended since it provides the most accurate predictions. However, when models with lower thresholds are of interest, the use of covariate information can allow us to reduce the required threshold, obtaining almost the same prediction accuracy. In particular, we have seen that the inclusion of Age information in the model provides satisfactory results, with predictive errors less than 5% for claims over $1,. Note that prediction accuracy is generally improved just over the threshold u, i.e. for values of ω not too far from u. This is due to the dominating presence of observations close to the threshold (over more extreme and less frequent claims) which have a determining influence on the parameter estimates. 7 Discussion The present study has shown the usefulness of EVT for actuaries, but also its limitations. Indeed, this theory undoubtedly allows the actuary to determine distribution of the large losses, high quantiles and the PML. However, the Pareto behaviour of the data seems to appear only above very high thresholds. This makes the approach moderately useful for the reinsurers covering the first layers. In such a case, the inclusion of covariate information in the model reveals useful to lower the thresholds. Moreover, the GPD combined with the EOT models allowed us to accurately forecast the total claim amount above high tresholds for the year 1992, as well as the PML. All the analyses and graphics in this work have been performed using S-plus 2 (codes can be obtained from the authors). A Appendix A: Some EVT results A.1 The generalized extreme value distribution A.1.1 Sequence of maxima Given a sequence of independent and identically distributed random variables X 1, X 2, X 3,... from a parent cumulative distribution function F, let us define a sequence of maxima M n = max{x 1, X 2,..., X n }. Then, P [M n x] = {F (x)} n and M n u X almost surely, where u X is the right-endpoint of the support of the X i s, precisely defined as A.1.2 u X = sup{x R F (x) < 1}, possibly infinite. Limiting behavior of maxima In most cases, M n can be normalized in such a way that it converges to a limit random variable, which together with the normalizing constants determines the asymptotic behavior of the sample maxima. If there exist sequences of real numbers a n > and b n R such that 2

23 the normalized sequence (M n b n )/a n converges in distribution to H, i.e. [ ] lim P Mn b n x = lim F n (a n x + b n ) = H(x) n + n + a n (A.1) for all points of continuity of H, then H is a Generalized Extreme Value Distribution (GEVD, in short). Note that (A.1) stands in contrast to the theory of averages which is about S n = n i=1 X i. A.1.3 GEVD Let us now make precise the form of the cumulative distribution function H involved in (A.1). The GEVD is the only non-degenerate limit distribution for appropriately normalized sample maxima. The cumulative distribution function H ξ corresponding to the standard GEVD is defined as { { } exp (1 + ξx) 1/ξ if ξ, H ξ (x) = exp { exp( x)} if ξ =, where 1 + ξx >. The support of H ξ is ( 1/ξ, + ) if ξ >, (, 1/ξ) if ξ < and the whole real line R if ξ =. The related location-scale family H ξ;µ,ψ is then given by ( ) x µ H ξ;µ,ψ (x) = H ξ, µ R, ψ >. ψ The three classical extreme value distributions are special cases of the GEVD: if ξ > we have the Fréchet distribution, if ξ < we have the Weibull distribution and ξ = gives the Gumbel distribution. A.1.4 Domain of attraction of heavy-tailed distributions If the appropriately normalized sample maxima converge in distribution to a non-degenerate limit H in (A.1), the cumulative distribution function F is said to be in the domain of attraction of the GEVD H. The class of distributions for which such an asymptotic behaviour holds is large: all commonly encountered continuous distributions fulfill (A.1). The relation (A.1) holds with H = H ξ, ξ >, if, and only if, the representation 1 F (x) = x 1/ξ L(x) (A.2) holds for some slowly varying function L (a slowly varying function L is such that L(λx)/L(x) 1 as x + for any λ > ). This result essentially says that if 1 F decays like a power function then the distribution is in the domain of attraction of the Fréchet distribution. The formula (A.2) is often taken as a definition of heavy-tailed distributions; Pareto, Burr, LogGamma and Cauchy distributions, as well as various mixture models present this characteristic. The appropriateness of heavy-tailed distributions of the form (A.2) to model catastrophe risks manifests in the asymptotic relation P [M n > x] lim x + P [S n > x] 21 = 1

24 where S n = n i=1 X i. This relation describes a situation where the sum of n claims gets large if, and only if, its maximum gets large. Distributions satisfying this condition are usually referred to as subexponential in the actuarial literature (see e.g. Embrechts and Veraverbeke (1982) and Klüppelberg and Villasenor (1993)). A.2 GPD A.2.1 Threshold stability property The GPD enjoys the convenient threshold stability property which ensures that X is GPD G ξ,β [X u X > u] is GPD G ξ,β+ξu for any u >. (A.3) This basically says that provided X is GPD, the exceedances of X over any threshold remain GPD. A.2.2 Estimation of the parameters of the GPD model There are several methods to fit a GPD and to estimate the corresponding parameters. It can be shown that for ξ >.5 (in particular for all heavy-tailed disributions), the maximum likelihood (ML in short) regularity conditions are fulfilled. Hosking and Wallis (1987) proved that the ML estimators of the GPD parameters (ˆξ, ˆβ) are asymptotically normally distributed with expected value (ξ, β) and approximative covariance matrix 1 n ( (1 + ξ) 2 β(1 + ξ) β(1 + ξ) 2β 2 (1 + ξ) This result allows us to obtain the standard errors for the ML estimators. For reasons of convergence of the algorithm, a reparametrization of the GPD model with parameters σ = log(β) and α = 1/ξ has been used in this paper to maximize the loglikelihood. A.3 Technical conditions to ensure the GPD approximation A.3.1 Relating GPD and GEVD The technical conditions on F to ensure the validity of (3.2) are determined by the link between the GPD and the GEVD. It can be shown that (A.1) holds with H = H ξ if, and only if, there exists a positive measurable function a(.) such that for 1 + ξx > lim P u + [ X u a(u) ) ] > x X > u = G ξ (x). In other words, G ξ describes excesses over high thresholds if, and only if, H ξ governs the sample maximum behaviour, i.e. if F belongs to the domain of attraction of a GEVD. A.3.2 The case of heavy-tailed distributions Thus we can conclude that for heavy-tailed distributions and above sufficiently high thresholds the distribution of the excesses can be approximated by a GPD with ξ > (which is nothing else than a reparametrized version of the Pareto law). 22

25 A.3.3 Poisson approximation of the number of excesses over a high threshold Let N n be the number of excesses over a threshold u n. If the sequence of thresholds (u n ) satisfies lim 1 F (u n ) ) = τ n (A.4) then for k =, 1, 2,... k lim n k] = e τ τ s n s!. A.4 Point estimation of high quantiles In order to estimate the high unconditional quantiles, we want relate the unconditional cumulative distribution function F to the conditional cumulative distribution function F u. Denoting F (x) = 1 F (x), we have, s= F u (x) = P [X u x X > u] = F (u + x) F (u) From this relationship and applying the GPD approximation of the distribution of the exceedances, we obtain ( ) F (u + x) F (u) 1 Gˆξ; ˆβ(x) Provided we have a large enough sample, we can estimate F (u) by its empirical counterpart to obtain an estimator of the function in a higher value F (u + x); we get F (u + x) = N ( ) u 1 Gˆξ; n ˆβ(x) where N u and n are the number of claims above the threshold u and the total number of claims respectively. To sum up, we can estimate the probability that the claim amount is larger than y, for any amount y > u by F (y) = N u n [ ] 1 Gˆξ; ˆβ(y u). The latter formula yields the quantile estimator (4.1) of the distribution F. A.5 Distribution of the maximum of a GPD sample of Poisson size Consider a random variable N distributed according the Poisson law with mean λ, and let X 1,..., X N be a sequence of N independent and identically distributed random variables with common cumulative distribution function G ξ;β. Then, it is easily seen that, if M N = max{x 1, X 2,..., X N }, P [M N x] = H ξ;µ,ψ (x), (A.5) with µ = βξ 1 (λ ξ 1) and ψ = βλ ξ. 23

26 This result comes from the fact that under the stated conditions P [M N x] = E[F (x) N ] = exp[ (1 F (x))/λ] A.6 Stability property of the Poisson process If the expected number of excesses over a threshold u is λ u and the corresponding exceedances follow a GPD(ξ, β), the number of excesses over ω > u is Poisson with intensity λ ω = λ u [1 + ξ(ω u)/β] 1/ξ. See Leadbetter et al. (1983, p. 38) for the proof, based on the fact that random deletion of events in a Poisson process results again in a Poisson process for the remaining events. This expression yields E[N ω ] = E[N u ] [1 + ξ(ω u)/β] 1/ξ. (A.6) B Appendix B: graphical tools B.1 The mean excess function The mean excess function associated to X is defined as e(u) = E[X u X > u] = + (1 F u (x))dx. This function can be estimated from a random sample {x 1, x 2,..., x n } by ê n (u) = n i=1 x ii[x i > u] #{x i : x i > u} u = n i=1 (x i u)i[x i > u] #{x i : x i > u} where I[A] = 1 if the event A did occur and otherwise, and where #B is the number of elements in the set B; that means that e(u) is estimated by the sum of exceedances over the threshold u divided by the number of data points exceeding the threshold u. Usually, the mean excess function is evaluated in the observations of the sample. Denoting the sample observations arranged in ascending order as x [1] x [2]... x [n], we have in this case ê n (x [k] ) = 1 n k (x [k+j] x [k] ). n k j=1 B.2 Gertensgarbe and Werner Plot The test proposed by Gertensgarbe and Werner (1989) to select a proper threshold, is based on the determination of the starting point of the extreme value region. More precisely, given the series of differences i = x [i] x [i 1], i = 2, 3,..., n of a sorted sample, x [1] x [2]... x [n], the starting point of the extreme region will be detected as a change point of the series { i, i = 2, 3,..., n}. The key idea is that it may be reasonably expected that 24

27 Gertensgarbe graph u, up (i) Figure 11: Gertensgarbe plot. the behaviour of the differences corresponding to the extreme observations will be different from the one corresponding to the non-extreme observations. This change of behaviour will appear as a change point of the series of differences. To identify the change point in a series, a sequential version of the Mann-Kendall test is applied. In this test, the normalized series U i is defined as, U i = U i i(i 1) 4 i(i 1)(i+5) 72 where Ui = i k=1 n k, and n k is the number of values in 1,..., k less than k. Another series, denoted by U p, is calculated applying the same procedure to the series of the differences from the end to the start, n,..., 1, instead of from the start to the end. The intersection point between these two series, U and U p, determines a probable change point that will be significant if it exceeds a high normal percentile. The Gertensgarbe plot for the claims of the SOA database is shown in Figure 11. The threshold corresponding to the cross point of the graph correspond to observation numbered i = 2, 19, equal to $199,62. The test to contrast if that point is a change point of the series is significative with a p-value less than.1. Acknowledgements Ana Cebrián thanks the Université catholique de Louvain, Louvain-la-Neuve, Belgium, for financial support through a FSR research grant. The authors wish to thank the anonymous referees and an associate editor for many useful comments, that significantly improved the paper. 25

28 References [1] Balkema, A. and de Haan, L. (1974), Residual life time at great age. Annals of Probability 2, [2] Beirlant, J., Teugels, J.L., and Vynckier, P. (1996). Practical Analysis of Extreme Values. Leuven University Press. Leuven. [3] Embrechts, P., Klüppelberg, C., and Mikosch, T. (1997). Modelling Extremal Events for Insurance and Finance. Springer Verlag, Berlin. [4] Embrechts, P., Resnick, S.I., and Samorodnitsky, G. (1999). Extreme Value Theory as a risk management tool. North American Actuarial Journal 3, [5] Embrechts, P., and Veraverbeke, N. (1982). Estimates for the probability of ruin with special emphasis on the possibility of large claims. Insurance: Mathematics & Economics 1, [6] Gertensgarbe, F. W., and Werner, P. C. (1989). A method for the statistical definition of extreme-value regions and their application to meteorological time series. Z. Meteorol. 39, [7] Grazier, K.L., and G Sell Associates (1997). Group Medical Insurance Large Claims Database and Collection. SOA Monograph M-HB97-1. [8] Hosking, J. and Wallis, J. (1987). Parameter and quantile estimation for the generalized Pareto distribution. Technometrics 29, [9] Klugman, S.A., Panjer, H.H., and Willmot, G.E. (1998).Loss Models: From Data to Decisions. Wiley, New york. [1] Klüppelberg, C., and Villasenor, A. (1993). Estimation of distribution tails a semiparametric approach. Blätter der Deutschen Gesellschaft für Versicherungsmathematik, [11] Kremer, E. (199). On the probable maximum loss. Blätter der Deutschen Gesellschaft für Versicherungsmathematik, [12] Kremer, E. (1994). More on the probable maximum loss. Blätter der Deutschen Gesellschaft für Versicherungsmathematik, [13] Leadbetter, M.R., Lindgren, G. and Rootzén, H. (1983). Extremes and Related Properties of Random Sequences and Processes. Springer-verlag, New York. [14] McNeil, A.J. (1997). Estimating the tails of loss severity distributions using extreme velue theory. ASTIN Bulletin 27, [15] McNeil, A.J., and Saladin, T. (1997). The peaks over thresholds method for estimating high quantiles of loss distributions. Proceedings of 28th International ASTIN Colloquium. 26

29 [16] Pickands, J. (1975). Statistical inference using extreme order statistics. The Annals of Statistics 3, [17] Rootzén, H., and Tajvidi, N. (1996). Extreme value statistics and wind storm losses: a case study. Scandinavian Actuarial Journal, [18] Wilkinson, M.E. (1982). Estimating probable maximum loss with order statistics. Proceedings of the Casualty Actuarial Society LXIX,