How To Predict Insurance Caim Size With A Tree Based Gradient Boosting Agorithm

Transcription

1 A Boosted Tweedie Compound Poisson Mode for Insurance Premium Yi Yang, Wei Qian and Hui Zou Juy 27, 2015 Abstract The Tweedie GLM is a widey used method for predicting insurance premiums. However, the inear mode assumption can be too rigid for many appications. As a better aternative, a boosted Tweedie mode is considered in this paper. We propose a TDboost estimator of pure premiums and use a profie ikeihood approach to estimate the index and dispersion parameters. Our method is capabe of fitting a fexibe noninear Tweedie mode and capturing compex interactions among predictors. A simuation study confirms the exceent prediction performance of our method. As an appication, we appy our method to an auto insurance caim data and show that the new method is superior to the existing methods in the sense that it generates more accurate premium predictions, thus heping sove the adverse seection issue. We have impemented our method in a userfriendy R package that aso incudes a nice visuaization too for interpreting the fitted mode. 1 Introduction One of the most important probems in insurance business is to set the premium for the customers (poicyhoders). By insurance contracts, a arge number of poicyhoders individua osses are transformed into a more predictabe, aggregate oss of the insurer. In a McGi University Rochester Institute of Technoogy Corresponding author, zoux019@umn.edu, University of Minnesota 1

2 competitive market, it is advantageous for the insurer to charge a fair premium according to the expected oss of the poicyhoder. In persona car insurance, for instance, if an insurance company charges too much for od drivers and charges too itte for young drivers, then the od drivers wi switch to its competitors, and the remaining poicies for the young drivers woud be underpriced. This resuts in the adverse seection issue (Chiappori and Saanie, 2000; Dionne et a., 2001): the insurer oses profitabe poicies and is eft with bad risks, resuting in economic oss both ways. To appropriatey set the premiums for the insurer s customers, one crucia task is to predict the size of actua (currenty unforeseeabe) caims. In this paper, we wi focus on modeing caim oss, athough other ingredients such as safety oadings, administrative costs, cost of capita, and profit are aso important factors for setting the premium. One difficuty in modeing the caims is that the distribution is usuay highy rightskewed, mixed with a point mass at zero. Such type of data cannot be transformed to normaity by power transformation, and specia treatment on zero caims is often required. As an exampe, Figure 1 shows the histogram of an auto insurance caim data (Yip and Yau, 2005), in which there are 6,290 poicy records with zero caims and 4,006 poicy records with positive osses. The need for predictive modes emerges from the fact that the expected oss is highy dependent on the characteristics of an individua poicy such as age and annua income of the poicyhoder, popuation density of the poicyhoder s residentia area, and age and mode of the vehice. Traditiona methods used generaized inear modes (GLM; Neder and Wedderburn, 1972) for modeing the caim size (e.g. Renshaw, 1994; Haberman and Renshaw, 1996). However, a of these works performed their anayses on a subset of the poicies, which have at east one caim. Aternative approaches have empoyed Tobit modes by treating zero outcomes as censored beow some cutoff points (Van de Ven and van Praag, 1981; Showers and Shotick, 1994), but these approaches rey on a normaity assumption of the atent response. Aternativey, Jørgensen and de Souza (1994) and Smyth and Jørgensen (2002) used GLMs with a Tweedie distributed (Jørgensen, 1987, 1997) outcome to simutaneousy mode frequency and severity of insurance caims. They assume Poisson arriva of caims and gamma distributed amount for individua caims so that the size of the tota caim amount foows a Tweedie compound Poisson distribution. Due to its abiity to simutaneousy mode the zeros and the continuous positive outcomes, the Tweedie GLM 2

3 Frequency Tota Insurance Caim Amount (in $1000) Per Poicy Year Figure 1: Histogram of the auto insurance caim data as anayzed in Yip and Yau (2005). It shows that there are 6290 poicy records with zero tota caims per poicy year, whie the remaining 4006 poicy records have positive osses. 3

4 has been a widey used method in actuaria studies (Midenha, 1999; Murphy et a., 2000; Peters et a., 2008; Quijano Xacur et a., 2011). Despite of the popuarity of the Tweedie GLM, a major imitation is that the structure of the ink function is restricted to a inear form, which can be too rigid for rea appications. In auto insurance, for exampe, it is known that the risk does not monotonicay decrease as age increases (Owsey et a., 1991; McCartt et a., 2003; Anstey et a., 2005). Athough noninearity may be modeed by adding spines (Zhang, 2011), owdegree spines are often inadequate to capture the noninearity in the data, whie highdegree spines often resut in the overfitting issue that produces unstabe estimates. Generaized additive modes (GAM; Hastie and Tibshirani, 1990; Wood, 2006) overcome the restrictive inear assumption of GLMs, and can mode the continuous variabes by smooth functions estimated from data. The structure of the mode, however, has to be determined a priori. That is, one has to specify the main effects and interaction effects to be used in the mode. As a resut, misspecification of nonignorabe effects is ikey to adversey affect prediction accuracy. In this paper, we aim to mode the caim size by a nonparametric Tweedie compound Poisson mode, and we propose a gradient treeboosting agorithm (TDBoost henceforth) to fit this mode. Gradient boosting (Freund and Schapire, 1997, 1996) is one of the most successfu machine earning agorithms for nonparametric regression and cassification. Boosting adaptivey combines a arge number of reativey simpe prediction modes caed base earners into an ensembe earner to achieve high prediction performance. The semina work on the boosting agorithm caed AdaBoost (Freund and Schapire, 1997, 1996) was originay proposed for cassification probems. Later Breiman (1998) and Breiman (1999) pointed out an important connection between the AdaBoost agorithm and a functiona gradient descent agorithm. Friedman et a. (2000), Friedman (2001) and Hastie et a. (2009) deveoped a statistica view of boosting and proposed gradient boosting methods for both cassification and regression. There is a arge body of iterature on boosting. We refer interested readers to Bühmann and Hothorn (2007) for a comprehensive review of boosting agorithms. The TDBoost mode is motivated by the proven success of boosting in machine earning for cassification and regression probems (Friedman, 2001, 2002; Hastie et a., 2009). Its advantages are threefod. First, the mode structure of TDBoost is earned from data and not predetermined, thereby avoiding an expicit mode specification. Noninearities, 4

5 discontinuities, compex and higher order interactions are naturay incorporated into the mode to reduce the potentia modeing bias and to produce high predictive performance, which enabes TDboost to serve as a benchmark mode in scoring insurance poicies, guiding pricing practice, and faciitating marketing efforts. Second, in contrast to other nonparametric statistica earning methods, TDboost can provide interpretabe resuts, by means of the partia dependence pots, and reative importance of the predictors. Feature seection is performed as an integra part of the procedure. In addition, TDboost handes the predictor and response variabes of any type without the need for transformation, and it is highy robust to outiers. Missing vaues in the predictors are managed amost without oss of information (Eith et a., 2008). A these properties make TDboost an attractive too for insurance premium modeing. The remainder of this paper is organized as foows. We briefy review the gradient boosting agorithm and the Tweedie compound Poisson mode in Section 2 and Section 3, respectivey. We present the main methodoogy deveopment with impementation detais in Section 4. In Section 5, we use simuation to show the high predictive accuracy of TDboost. As an appication, we appy TDboost to anayze an auto insurance caim data in Section 6. 2 Gradient Boosting To keep the paper sefcontained, we briefy expain the genera procedures for the gradient boosting. Let x = (x 1,..., x p ) be the pdimensiona predictor variabes and y be the onedimensiona response variabe. The goa is to estimate the optima prediction function F ( ) that maps x to y by minimizing the expected vaue of a oss function Ψ(, ) over the function cass F: F ( ) = arg mine y,x [Ψ(y, F (x))], F ( ) F where Ψ is assumed to be differentiabe with respect to F. Given the observed data {y i, x i } n i=1, estimation of F ( ) can be done by minimizing the empirica risk function min R 1 n(f ) =: min F ( ) F F ( ) F n n Ψ(y i, F (x i )). (1) i=1 5

6 For the gradient boosting, each candidate function F F is assumed to be an ensembe of M base earners M F (x) = F [0] + β [m] h(x; ξ [m] ), (2) m=1 where h(x; ξ [m] ) usuay beongs to a cass of some simpe functions of x caed base earners (e.g., regression/decision tree) with the parameter ξ [m] (m = 1, 2,, M). F [0] is a constant scaar and β [m] is the expansion coefficient. Note that differing from the usua structure of an additive mode, there is no restriction on the number of predictors to be incuded in each h( ), and consequenty, highorder interactions can be easiy considered using this setting. A forward stagewise agorithm is adopted to approximate the minimizer of (1), which buids up the components β [m] h(x; ξ [m] ) (m = 1, 2,..., M) sequentiay through a gradientdescentike approach. At each iteration stage m (m = 1, 2,...), suppose the current estimate for F ( ) is ˆF [m 1] ( ). In principe, we want to update ˆF [m 1] ( ) to ˆF [m] ( ) aong the negative gradient direction of R n (F ). However, we ony know the negative gradient at the training data points. Indeed, the negative gradient vector (u [m] 1,..., u [m] n ) of R n (F ) with respect to F at {F (x i ) = ˆF [m 1] (x i )} n i=1 can be written as i = R n(f ) F (x i ) u [m] F (xi )= ˆF [m 1] (x i ), but we cannot directy update ˆF [m 1] ( ) using (u [m] 1,..., u [m] n ), as u [m] is ony defined at x i for i = 1,..., n and cannot make predictions based on new data not represented in the training set. To sove this probem, the gradient boosting fits the negative gradient vector (u [m] 1,..., u [m] n ) (as the working response) to (x 1,..., x n ) (as the predictor) to find a base earner h(x; ξ [m] ). Thus, the fitted h(x; ξ [m] ) can be viewed as an approximation of the negative gradient, and it can be evauated on the entire space of x. The expansion coefficient β [m] can then be determined by a ine search β [m] = argmin β n Ψ(y i, ˆF [m 1] (x i ) + βh(x; ξ [m] )), (3) i=1 which is a reminiscence of the steepest descent method. Consequenty, the estimation of i 6

7 F (x) for the next stage is ˆF [m] (x) := ˆF [m 1] (x) + νβ [m] h(x; ξ [m] ), (4) where 0 < ν 1 is the shrinkage factor (Friedman, 2001) that contros the update step size. A sma ν imposes more shrinkage whie ν = 1 gives compete negative gradient steps. Friedman (2001) has found that the shrinkage factor reduces overfitting and improves the predictive accuracy. 3 Compound Poisson Distribution and Tweedie Mode In this section, we briefy introduce the compound Poisson distribution and Tweedie mode, which is necessary for our methodoogy deveopment. Let N be a Poisson random variabe denoted by Pois(λ), and et Z d s be i.i.d. gamma random variabes denoted by Gamma(α, γ) with mean αγ and variance αγ 2. Assume N is independent of Z d s. Define a random variabe Z by 0 if N = 0 Z = Z 1 + Z Z. (5) N if N = 1, 2,... Thus Z is the Poisson sum of independent Gamma random variabes. The resuting distribution of Z is referred to as the compound Poisson distribution (Feer, 1968; BarLev and Stramer, 1987; Jørgensen and de Souza, 1994; Smyth and Jørgensen, 2002), which is known to be cosey connected to exponentia dispersion modes (EDM) (Jørgensen, 1987, 1997). Note that the distribution of Z has a probabiity mass at zero: P r(z = 0) = exp( λ). Then based on that Z conditiona on N = j is Gamma(jα, γ), the distribution function of Z can be written as f Z (z λ, α, γ) = P r(n = 0)d 0 (z) + = exp( λ)d 0 (z) + P r(n = j)f Z N=j (z) j=1 λ j e λ j=1 j! z jα 1 e z/γ γ jα Γ(jα), 7

8 where d 0 is the Dirac deta function at zero and f Z N=j is the conditiona density of Z given N = j. This gives the cumuant generating function of Z og M Z (t) = λ{(1 γt) α 1}. (6) Smyth (1996) pointed out that the compound Poisson distribution beongs to a specia cass of EDMs known as Tweedie modes (Tweedie, 1984). The EDMs are defined by the form { zθ κ(θ) } f Z (z θ, φ) = a(z, φ) exp, (7) φ where a( ) is a normaizing function, κ( ) is caed the cumuant function, and both a( ) and κ( ) are known. The parameter θ is in R and the dispersion parameter φ is in R +. EDMs have the property that the mean E(Z) µ = κ(θ) and the variance Var(Z) = φ κ(θ), where κ(θ) and κ(θ) are the first and second derivatives of κ(θ), respectivey. The cumuant generating function of EDMs is og M Z (t) = 1 {κ(θ + tφ) κ(θ)}. (8) φ Tweedie modes are specia cases of the EDMs characterized by power meanvariance reationship Var(Z) = φµ ρ for some index parameter ρ. Such meanvariance reation gives µ 1 ρ 1 ρ θ =, ρ 1 µ 2 ρ, κ(θ) =, ρ 2 2 ρ. (9) og µ, ρ = 1 og µ, ρ = 2 One can show that the compound Poisson distribution beongs to the cass of Tweedie modes. Indeed, if we repace the parameters (λ, α, γ) in the cumuant function (6) by λ = 1 µ 2 ρ φ 2 ρ, α = 2 ρ ρ 1, γ = φ(ρ 1)µρ 1, (10) the cumuant function of the compound Poisson mode has the form of a Tweedie mode with 1 < ρ < 2 and µ > 0. As a resut, for the rest of this paper, we ony consider the mode (5), and simpy refer to (5) as the Tweedie mode (or Tweedie compound Poisson mode), denoted by Tw(µ, φ, ρ), where 1 < ρ < 2 and µ > 0. 8

9 It is straightforward to show that the ogikeihood of the Tweedie mode is og f Z (z µ, φ, ρ) = 1 φ where the normaizing function a( ) can be written as ) (z µ1 ρ 1 ρ µ2 ρ + og a(z, φ, ρ), (11) 2 ρ 1 z t=1 a(z, φ, ρ) = W t(z, φ, ρ) = 1 z tα z t=1 for z > 0 (ρ 1) tα φ t(1+α) (2 ρ) t t!γ(tα), 1 for z = 0 and α = (2 ρ)/(ρ 1) and t=1 W t is an exampe of Wright s generaized Besse function (Tweedie, 1984). One of the desirabe properties of Tweedie modes is that they are the ony EDMs that are scae invariant (Jørgensen, 1997, Section 4.1.1): if Z is a Tweedie variabe with mean µ and dispersion φ, then cz foows the same distribution with mean cµ and dispersion c 2 ρ φ. This property makes Tweedie distributions a good choice for modeing data with an arbitrary monetary unit. 4 Our proposa In this section, we propose to integrate the Tweedie mode to the treebased gradient boosting agorithm to predict insurance caim size. Specificay, our discussion focuses on modeing the persona car insurance as an iustrating exampe (see Section 6 for a rea data anaysis), since our modeing strategy is easiy extended to other ines of nonife insurance business. Given an auto insurance poicy i, et N i be the number of caims (known as the caim frequency) and Z di be the size of each caim observed for d i = 1,..., N i. Let w i be the poicy duration, that is, the ength of time that the poicy remains in force. Then Z i = N i d i =1 Z di is the tota caim amount. In the foowing, we are interested in modeing the ratio between the tota caim and the duration Y i = Z i /w i, a key quantity known as the pure premium (Ohsson and Johansson, 2010). Foowing the settings of the compound Poisson mode, we assume N i is Poisson distributed, and its mean λ i w i has a mutipicative reation with the duration w i, where λ i is a poicyspecific parameter representing the expected caim frequency under unit duration. 9

10 Conditiona on N i, assume Z di s (d i = 1,..., N i ) are i.i.d. Gamma(α, γ i ), where γ i is a poicyspecific parameter that determines caim severity, and α is a constant. Furthermore, we assume that under unit duration (i.e., w i = 1), the meanvariance reation of a poicy satisfies for a poicies, where Y i ρ = (α + 2)/(α + 1). Note that V ar(y i ) = φ[e(y i )] ρ (12) is the pure premium under unit duration, φ is a constant, and µ i := E(Y i ) = E(E(Y i N i )) = λ i αγ i, V ar(y i ) = E(V ar(y i N i )) + V ar(e(y i N i )) = λ i αγ 2 i + λ i α 2 γ 2 i. Simiary, under duration w i, µ i := E(Y i ) = 1 w i E(Z i ) = λ i αγ i, V ar(y i ) = 1 w 2 i V ar(z i ) = (λ i αγ 2 i + λ i α 2 γ 2 i )/w i. As a resut, we can obtain the meanvariance reation for the pure premium Y i that V ar(y i ) = 1 w i V ar(y i ) = φ w i (µ i ) ρ = φ w i µ ρ i, (13) where the second equation foows by (12). Consequenty, the scae invariant property of Tweedie distribution and (13) impies that Y i Tw(µ i, φ/w i, ρ). Under the aforementioned settings, consider a portfoio of poicies {(y i, x i, w i )} n i=1 from n independent insurance contracts, where for the ith contract, y i is the poicy pure premium, x i is a vector of expanatory variabes that characterize the poicyhoder and the risk being insured (e.g. house, vehice), and w i is the duration. We then assume that the expected 10

11 pure premium µ i is determined by a predictor function F : R p R of x i : og{µ i } = og{e(y i x i )} = F (x i ). (14) In this paper, we do not impose a inear or other parametric form restriction on F ( ). Given the fexibiity of F ( ), we ca such setting as the boosted Tweedie mode (as opposed to the Tweedie GLM). Given {(y i, x i, w i )} n i=1, the ogikeihood function can be written as (F ( ), φ, ρ {y i, x i, w i } n i=1) = = n og f Y (y i µ i, φ/w i, ρ), i=1 n w i µ (y 1 ρ ) i i φ 1 ρ µ2 ρ i + og a(y i, φ/w i, ρ). (15) 2 ρ i=1 4.1 Estimating F ( ) via TDboost We estimate the predictor function F ( ) by integrating the boosted Tweedie mode into the treebased gradient boosting agorithm. To deveop the idea, we assume that φ and ρ are given for the time being. The joint estimation of F ( ), φ and ρ wi be studied in Section 4.2. Given ρ and φ, we repace the genera objective function in (1) by the negative ogikeihood derived in (15), and target the minimizer function F ( ) over a cass F of base earner functions in the form of (2). That is, we intend to estimate { F (x) = argmin (F ( ), φ, ρ {yi, x i, w i } n i=1) } = argmin F F F F n Ψ(y i, F (x i ) ρ), (16) i=1 where Ψ(y i, F (x i ) ρ) = w i { y i exp[(1 ρ)f (x i )] 1 ρ } + exp[(2 ρ)f (x i)]. 2 ρ Note that in contrast to (16), the function cass targeted by Tweedie GLM (Smyth, 1996) is restricted to a coection of inear functions of x. We propose to appy the forward stagewise agorithm described in Section 2 for soving (16). The initia estimate of F ( ) is chosen as a constant function that minimizes the 11

12 negative ogikeihood: ˆF [0] = argmin = og η n Ψ(y i, η ρ) i=1 ( n i=1 w iy i n i=1 w i ). This corresponds to the best estimate of F without any covariates. Let ˆF [m 1] be the current estimate before the mth iteration. At the mth step, we fit a base earner h(x; ξ [m] ) via ξ[m] = argmin ξ [m] n i=1 [u [m] i h(x i ; ξ [m] )] 2, (17) where (u [m] 1,..., u [m] n ) is the current negative gradient of Ψ( ρ), i.e., i = Ψ(y i, F (x i ) ρ) F (x i ) u [m] F (xi )= ˆF [m 1] (x i ) (18) = w i { yi exp[(1 ρ) ˆF [m 1] (x i )] + exp[(2 ρ) ˆF [m 1] (x i )] }, (19) and use an Ltermina node regression tree h(x; ξ [m] ) = L =1 u [m] I(x R [m] ) (20) with parameters ξ [m] = {R [m], u [m] } L =1 as the base earner. To find R[m] and u [m], we use a fast topdown bestfit agorithm with a east squares spitting criterion (Friedman et a., 2000) to find the spitting variabes and corresponding spit ocations that determine the fitted termina regions { as the mean faing in each region: [m] R } L =1. Note that estimating the R[m] entais estimating the u [m] ū [m] = mean i:xi [m](u [m] R i ) = 1,..., L. Once the base earner h(x; ξ [m] ) has been estimated, the optima vaue of the expansion 12

13 coefficient β [m] is determined by a ine search β [m] = argmin β = argmin β n Ψ(y i, ˆF [m 1] (x i ) + βh(x i ; ξ [m] ) ρ) (21) i=1 n Ψ(y i, ˆF [m 1] (x i ) + β i=1 The regression tree (20) predicts a constant vaue ū [m] L =1 ū [m] I(x i within each region (21) by a separate ine search performed within each respective region (21) reduces to finding a best constant η [m] R [m] based on the foowing criterion: ˆη [m] = argmin where the soution is given by ˆη [m] = og η { i:x i [m] i:x i R i:x i R [m] R [m] [m] R ) ρ). [m] R, so we can sove [m] R. The probem to improve the current estimate in each region Ψ(y i, ˆF [m 1] (x i ) + η ρ), = 1,..., L, (22) w i y i exp[(1 ρ) ˆF } [m 1] (x i )] w i exp[(2 ρ) ˆF, = 1,..., L. (23) [m 1] (x i )] Having found the parameters {ˆη [m] } L =1, we then update the current estimate ˆF [m 1] (x) in each corresponding region ˆF [m] (x) = ˆF [m 1] (x) + ν ˆη [m] I(x R [m] ), = 1,..., L, (24) where 0 < ν 1 is the shrinkage factor. Foowing (Friedman, 2001), we set ν = in our impementation. More discussions on the choice of tuning parameters are in Section 4.4. In summary, the compete TDBoost agorithm is shown in Agorithm 1. The boosting step is repeated M times and we report ˆF [M] (x) as the fina estimate. 4.2 Estimating (ρ, φ) via profie ikeihood Foowing Smyth (1996) and Dunn and Smyth (2005), we use the profie ikeihood to estimate the dispersion φ and the index parameter ρ, which jointy determine the meanvariance 13

14 Agorithm 1 TDboost 1. Initiaize ˆF [0] ˆF [0] = og ( n i=1 w iy i n i=1 w i ). 2. For m = 1,..., M repeatedy do steps 2.(a) 2.(d) 2.(a) Compute the negative gradient {u [m] i } n i=1 u [m] i = w i { yi exp[(1 ρ) ˆF [m 1] (x i )] + exp[(2 ρ) ˆF [m 1] (x i )] } i = 1,..., n. 2.(b) Fit the negative gradient vector {u [m] i } n i=1 to x 1,..., x n by an Ltermina node [m] regression tree, giving us the partitions { R } L =1. 2.(c) Compute the optima termina node predictions η [m] 1, 2,..., L { ˆη [m] = og i:x i i:x i R [m] R [m] for each region w i y i exp[(1 ρ) ˆF } [m 1] (x i )] w i exp[(2 ρ) ˆF. [m 1] (x i )] R [m], = 2.(d) Update ˆF [m] (x) for each region [m] R, = 1, 2,..., L ˆF [m] (x) = ˆF [m 1] (x) + ν ˆη [m] I(x 3. Report ˆF [M] (x) as the fina estimate. R [m] ) = 1, 2,..., L. 14

15 reation V ar(y i ) = φµ ρ i /w i of the pure premium. We expoit the fact that in Tweedie modes the estimation of µ depends ony on ρ: given a fixed ρ, the mean estimate µ (ρ) can be soved in (16) without knowing φ. Then conditiona on this ρ and the corresponding µ (ρ), we maximize the ogikeihood function with respect to φ by { φ (ρ) = argmax (µ (ρ), φ, ρ) }, (25) φ which is a univariate optimization probem that can be soved using a combination of goden section search and successive paraboic interpoation (Brent, 2013). In such a way, we have determined the corresponding (µ (ρ), φ (ρ)) for each fixed ρ. Then we acquire the estimate of ρ by maximizing the profie ikeihood with respect to 50 equay spaced vaues {ρ 1,..., ρ 50 } on (0, 1): ρ = argmax { (µ (ρ), φ (ρ), ρ) }. (26) ρ {ρ 1,...,ρ 50 } Finay, we appy ρ in (16) and (25) to obtain the corresponding estimates µ (ρ ) and φ (ρ ). There are some computationa issues, which must be taken care of when evauating the ogikeihood functions in (25) and (26): since in genera there are no cosed forms for Tweedie densities, in ikeihood evauation one must dea with an infinite summation in the normaizing function a(y, φ, ρ) = 1 y t=1 W t. For numerica evauation of Tweedie densities, Dunn and Smyth (2005) proposed a series expansions approach, which sums an infinite series arising from a Tayor expansion of the characteristic function. Aternativey, Dunn and Smyth (2008) deveoped a Fourier inversion approach, which consists of an inversion of the characteristic function based on numerica integration methods for osciating functions. These two numerica methods turn out to be compementary since each has advantages under a certain situation: when ony considering the case 1 < ρ < 2, the series approach performs very we for sma y but graduay oses computationa efficiency as y increases, whereas the inversion approach performs very we for arge y but graduay fais to provide accurate resuts as y decreases. Hence the inversion approach is preferred for arge y and the series approach for sma y. Dunn and Smyth (2008) provided a simpe guideine to choose between the two methods. In this paper we use their R package tweedie (avaiabe at rproject.org/web/packages/tweedie/index.htm) for evauating Tweedie densities in our profie ikeihood computation. For further detais regarding their agorithms, the reader 15

16 may refer to Dunn and Smyth (2005, 2008). 4.3 Mode interpretation Compared to other nonparametric statistica earning methods such as neura networks and kerne machines, our new estimator provides interpretabe resuts. In this section, we discuss some ways for mode interpretation after fitting the boosted Tweedie mode Margina effects of predictors The main effects and interaction effects of the variabes in the boosted Tweedie mode can be extracted easiy. In our estimate we can contro the order of interactions by choosing the tree size L (the number of termina nodes) and the number p of predictors. A tree with L termina nodes produces a function approximation of p predictors with interaction order of at most min(l 1, p). For exampe, a stump (L = 2) produces an additive TDboost mode with ony the main effects of the predictors, since it is a function based on a singe spitting variabe in each tree. Setting L = 3 aows both main effects and second order interactions. Foowing Friedman (2001) we use the socaed partia dependence pots to visuaize the main effects and interaction effects. Given the training data {y i, x i } n i=1, with a pdimensiona input vector x = [x 1, x 2,..., x p ], et z s be a subset of size s, such that z s = {z 1,..., z s } {x 1,..., x p }. For exampe, to study the main effect of the variabe j, we set the subset z s = {z j }, and to study the second order interaction of variabes i and j, we set z s = {z i, z j }. Let z \s be the compement set of z s, such that z \s z s = {x 1,..., x p }. Let the prediction ˆF (z s z \s ) be a function of the subset z s conditioned on specific vaues of z \s. The partia dependence of ˆF (x) on z s then can be formuated as ˆF (z s z \s ) averaged over the margina density of the compement subset z \s ˆF s (z s ) = ˆF (z s z \s )p \s (z \s )dz \s, (27) where p \s (z \s ) = p(x)dz s is the margina density of z \s. We estimate (27) by F s (z s ) = 1 n n i=1 ˆF (z s z \s,i ), (28) 16

17 where {z \s,i } n i=1 are evauated at the training data. We then pot F s (z s ) against z s. We have incuded the partia dependence pot function in our R package TDboost. We wi demonstrate this functionaity in Section Variabe importance In many appications identifying reevant predictors of the mode in the context of treebased ensembe methods is of interest. The TDboost mode defines a variabe importance measure for each candidate predictor X j in the set X = {X 1,..., X p } in terms of prediction/expanation of the response Y. The major advantage of this variabe seection procedure, as compared to univariate screening methods, is that the approach considers the impact of each individua predictor as we as mutivariate interactions among predictors simutaneousy. We start by defining the variabe importance (VI henceforth) measure in the context of a singe tree. First introduced by Breiman et a. (1984), the VI measure I Xj (T m ) of the variabe X j in a singe tree T m is defined as the tota heterogeneity reduction of the response variabe Y produced by X j, which can be estimated by adding up a the decreases in the squared error reductions ˆδ obtained in a L 1 interna nodes when X j is chosen as the spitting variabe. Denote v(x j ) = the event that X j is seected as the spitting variabe in the interna node, and et I j = I(v(X j ) = ). Then where ˆδ I Xj (T m ) = L 1 =1 ˆδ I j, (29) is defined as the squared error difference between the constant fit and the two subregion fits (the subregion fits are achieved by spitting the region associated with the interna node into the eft and right regions). Friedman (2001) extended the VI measure I Xj for the boosting mode with a combination of M regression trees, by averaging (29) over {T 1,..., T M }: I Xj = 1 M M I Xj (T m ). (30) m=1 Despite of the wide use of the VI measure, Breiman et a. (1984), White and Liu (1994) and Kononenko (1995) among others have pointed out that the VI measures (29) and (30) 17

18 are biased: even if X j is a noninformative variabe to Y (not correated to Y ), X j may sti be seected as a spitting variabe, hence the VI measure of X j is nonzero by Equation (30). Foowing Sandri and Zuccootto (2008) and Sandri and Zuccootto (2010) to avoid the variabe seection bias, we compute an adjusted VI measure for each expanatory variabe by permutating each X j : (1) For s = 1,..., S, repeat steps (2) (4). (2) Generate a matrix z s by randomy permutating (without repacement) the n rows of the design matrix x, whie keeping the order of coumns unchanged. (3) Create an n 2p matrix x s = [x, z s ] by binding z s with x matrix by coumn. (4) Use the data {y, x s } to fit the mode, and compute VI measures I s X j for X j and I s Z s j for Z s j, where Z s j (jth coumn of Z s ) is the pseudopredictor corresponding to X j. (5) Compute the VI measure I Xj as the average of I s X j and the baseine I Zj as the average of I s Z s j I Xj = 1 S S IX s j I Zj = 1 S s=1 S IZ s. (31) j s s=1 (6) Report the adjusted VI measure as I adj X j = I Xj I Zj for the variabe X j. The basic idea of the above agorithm is the foowing: the permutation breaks the association between the response variabe Y and each pseudopredictor Z s j, but sti preserves the association between Z s j and Z s k (k j); since Zs j is reshuffed from X j, Z s j has the same number of possibe spits as the corresponding predictor X j and has approximatey the same probabiity of being seected in spit nodes. Therefore, I Zj approximation of the importance of X j. can be viewed as a bias 4.4 Impementation We have impemented our proposed method in an R package TDboost, which is pubicy avaiabe from the Comprehensive R Archive Network at web/packages/tdboost/index.htm. Here, we discuss the choice of three meta parameters 18

19 in Agorithm 1: L (the size of the trees), ν (the shrinkage factor) and M (the number of boosting steps). To avoid overfitting and improve outofsampe predictions, the boosting procedure can be reguarized by imiting the number of boosting iterations M (eary stopping; Zhang and Yu, 2005) and the shrinkage factor ν. Empirica evidence (Friedman, 2001; Bühmann and Hothorn, 2007; Ridgeway, 2007; Eith et a., 2008) showed that the predictive accuracy is amost aways better with a smaer shrinkage factor at the cost of more computing time. However, smaer vaues of ν usuay requires a arger number of boosting iterations M and hence induces more computing time (Friedman, 2001). ν = throughout and determine M by the data. We choose a sufficienty sma The vaue L shoud refect the true interaction order in the underying mode, but we amost never have such prior knowedge. Therefore we choose the optima M and L using K fod cross vaidation, starting with a fixed vaue of L. The data are spit into K roughy equasized fods. Let an index function π(i) : {1,..., n} {1,..., K} indicate the fod to which observation i is aocated. Each time, we remove the kth fod of the data (k = 1, 2,..., K), and train the mode using the remaining K 1 fods. Denoting by [M] ˆF k (x) the resuting mode, we compute the vaidation oss by predicting on each kth fod of the data removed: CV(M, L) = 1 n n Ψ(y i, i=1 ˆF [M] π(i) (x i; L) ρ). (32) We seect the optima M at which the minimum vaidation oss is reached M L = argmin CV(M, L). M If we need to seect L too, then we repeat the whoe process for severa L (e.g. L = 2, 3, 4, 5) and choose the one with the smaest minimum generaization error L = argmin L CV(L, M L ). For a given ν, fitting trees with higher L eads to smaer M being required to reach the minimum error. 19

20 5 Simuation Studies In this section, we compare TDboost with the Tweedie GLM mode (TGLM: Jørgensen and de Souza, 1994) and the Tweedie GAM mode in terms of the function estimation performance. The Tweedie GAM mode is proposed by Wood (2001), which is based on a penaized regression spine approach with automatic smoothness seection. There is an R package MGCV accompanying the work, avaiabe at packages/mgcv/index.htm. In a numerica exampes beow using the TDboost mode, fivefod cross vaidation is adopted for seecting the optima (M, L) pair, whie the shrinkage factor ν is set to its defaut vaue of Case I In this simuation study, we demonstrate that TDboost is we suited to fit target functions that are noninear or invove compex interactions. We consider two true target functions: Mode 1 (Discontinuous function): The target function is discontinuous as defined by F (x) = 0.5I(x > 0.5). We assume x Unif(0, 1), and y Tw(µ, φ, ρ) with ρ = 1.5 and φ = 0.5. Mode 2 (Compex interaction): The target function has two his and two vaeys. F (x 1, x 2 ) = e 5(1 x 1) 2 +x e 5x 2 1 +(1 x 2) 2, which corresponds to a common scenario where the effect of one variabe changes depending on the effect of another. We assume x 1, x 2 Unif(0, 1), and y Tw(µ, φ, ρ) with ρ = 1.5 and φ = 0.5. We generate n = 1000 observations for training and n = 1000 for testing, and fit the training data using TDboost, MGCV, and TGLM. Since the true target functions are known, we consider the mean absoute deviation (MAD) as performance criteria, MAD = 1 n F (x n i ) ˆF (x i ), i=1 20