How To Predict Insurance Caim Size With A Tree Based Gradient Boosting Agorithm

Similar documents

Face Hallucination and Recognition

Australian Bureau of Statistics Management of Business Providers

Secure Network Coding with a Cost Criterion

A Latent Variable Pairwise Classification Model of a Clustering Ensemble

Teamwork. Abstract. 2.1 Overview

Fixed income managers: evolution or revolution

A Supplier Evaluation System for Automotive Industry According To Iso/Ts Requirements

CONTRIBUTION OF INTERNAL AUDITING IN THE VALUE OF A NURSING UNIT WITHIN THREE YEARS

Fast Robust Hashing. ) [7] will be re-mapped (and therefore discarded), due to the load-balancing property of hashing.

Vendor Performance Measurement Using Fuzzy Logic Controller

Finance 360 Problem Set #6 Solutions

Pay-on-delivery investing

SELECTING THE SUITABLE ERP SYSTEM: A FUZZY AHP APPROACH. Ufuk Cebeci

TERM INSURANCE CALCULATION ILLUSTRATED. This is the U.S. Social Security Life Table, based on year 2007.

A quantum model for the stock market

Risk Margin for a Non-Life Insurance Run-Off

Design of Follow-Up Experiments for Improving Model Discrimination and Parameter Estimation

GREEN: An Active Queue Management Algorithm for a Self Managed Internet

Normalization of Database Tables. Functional Dependency. Examples of Functional Dependencies: So Now what is Normalization? Transitive Dependencies

Older people s assets: using housing equity to pay for health and aged care

Risk Margin for a Non-Life Insurance Run-Off

Bite-Size Steps to ITIL Success

Distribution of Income Sources of Recent Retirees: Findings From the New Beneficiary Survey

WHITE PAPER UndERsTAndIng THE VAlUE of VIsUAl data discovery A guide To VIsUAlIzATIons

Betting Strategies, Market Selection, and the Wisdom of Crowds

effect on major accidents

Maintenance activities planning and grouping for complex structure systems

Chapter 3: e-business Integration Patterns

Introduction the pressure for efficiency the Estates opportunity

Oligopoly in Insurance Markets

The Basel II Risk Parameters. Second edition

Life Contingencies Study Note for CAS Exam S. Tom Struppeck

A Description of the California Partnership for Long-Term Care Prepared by the California Department of Health Care Services

A New Statistical Approach to Network Anomaly Detection

Simultaneous Routing and Power Allocation in CDMA Wireless Data Networks

COMPARISON OF DIFFUSION MODELS IN ASTRONOMICAL OBJECT LOCALIZATION

READING A CREDIT REPORT

Business schools are the academic setting where. The current crisis has highlighted the need to redefine the role of senior managers in organizations.

Comparison of Traditional and Open-Access Appointment Scheduling for Exponentially Distributed Service Time

Let s get usable! Usability studies for indexes. Susan C. Olason. Study plan

Early access to FAS payments for members in poor health

Betting on the Real Line

HEALTH PROFESSIONS PATHWAYS

A Similarity Search Scheme over Encrypted Cloud Images based on Secure Transformation

Multi-Robot Task Scheduling

LADDER SAFETY Table of Contents

Take me to your leader! Online Optimization of Distributed Storage Configurations

Internal Control. Guidance for Directors on the Combined Code

Art of Java Web Development By Neal Ford 624 pages US$44.95 Manning Publications, 2004 ISBN:

Business Banking. A guide for franchises

University of Southern California

Chapter 1 Structural Mechanics

Integrating Risk into your Plant Lifecycle A next generation software architecture for risk based

Advanced ColdFusion 4.0 Application Development Server Clustering Using Bright Tiger

WHITE PAPER BEsT PRAcTIcEs: PusHIng ExcEl BEyond ITs limits WITH InfoRmATIon optimization

How To Deiver Resuts

Serving the Millennial Generation - The Challenge and Opportunity for Financial Services Companies

FRAME BASED TEXTURE CLASSIFICATION BY CONSIDERING VARIOUS SPATIAL NEIGHBORHOODS. Karl Skretting and John Håkon Husøy

Discounted Cash Flow Analysis (aka Engineering Economy)

An Idiot s guide to Support vector machines (SVMs)

CERTIFICATE COURSE ON CLIMATE CHANGE AND SUSTAINABILITY. Course Offered By: Indian Environmental Society

Oracle Project Financial Planning. User's Guide Release

Human Capital & Human Resources Certificate Programs

The guaranteed selection. For certainty in uncertain times

Virtual trunk simulation

Leadership & Management Certificate Programs

Protection Against Income Loss During the First 4 Months of Illness or Injury *

Qualifications, professional development and probation

Welcome to Colonial Voluntary Benefits. Thank you for your interest in our Universal Life with the Accelerated Death Benefit for Long Term Care Rider.

3.3 SOFTWARE RISK MANAGEMENT (SRM)

Market Design & Analysis for a P2P Backup System

1##111##1111#1#111i#lllil

ABSTRACT. Categories and Subject Descriptors. General Terms. Keywords 1. INTRODUCTION. Jun Yin, Ye Wang and David Hsu

SPOTLIGHT. A year of transformation

Chapter 2 Traditional Software Development

Overview of Health and Safety in China

Recent Trends in Workers Compensation Coverage by Brian Z. Brown, FCAS Melodee J. Saunders, ACAS

History of Stars and Rain Education Institute for Autism (Stars and Rain)

3.5 Pendulum period :40:05 UTC / rev 4d4a39156f1e. g = 4π2 l T 2. g = 4π2 x1 m 4 s 2 = π 2 m s Pendulum period 68

Scheduling in Multi-Channel Wireless Networks

NCH Software MoneyLine

We are XMA and Viglen.

Niagara Catholic. District School Board. High Performance. Support Program. Academic

Oracle Hyperion Tax Provision. User's Guide Release

NCH Software FlexiServer

Leakage detection in water pipe networks using a Bayesian probabilistic framework

IMPLEMENTING THE RATE STRUCTURE: TIERING IN THE FEE-FOR-SERVICE SYSTEM

Transcription:

A Boosted Tweedie Compound Poisson Mode for Insurance Premium Yi Yang, Wei Qian and Hui Zou Juy 27, 2015 Abstract The Tweedie GLM is a widey used method for predicting insurance premiums. However, the inear mode assumption can be too rigid for many appications. As a better aternative, a boosted Tweedie mode is considered in this paper. We propose a TDboost estimator of pure premiums and use a profie ikeihood approach to estimate the index and dispersion parameters. Our method is capabe of fitting a fexibe noninear Tweedie mode and capturing compex interactions among predictors. A simuation study confirms the exceent prediction performance of our method. As an appication, we appy our method to an auto insurance caim data and show that the new method is superior to the existing methods in the sense that it generates more accurate premium predictions, thus heping sove the adverse seection issue. We have impemented our method in a userfriendy R package that aso incudes a nice visuaization too for interpreting the fitted mode. 1 Introduction One of the most important probems in insurance business is to set the premium for the customers (poicyhoders). By insurance contracts, a arge number of poicyhoders individua osses are transformed into a more predictabe, aggregate oss of the insurer. In a McGi University Rochester Institute of Technoogy Corresponding author, zoux019@umn.edu, University of Minnesota 1

competitive market, it is advantageous for the insurer to charge a fair premium according to the expected oss of the poicyhoder. In persona car insurance, for instance, if an insurance company charges too much for od drivers and charges too itte for young drivers, then the od drivers wi switch to its competitors, and the remaining poicies for the young drivers woud be underpriced. This resuts in the adverse seection issue (Chiappori and Saanie, 2000; Dionne et a., 2001): the insurer oses profitabe poicies and is eft with bad risks, resuting in economic oss both ways. To appropriatey set the premiums for the insurer s customers, one crucia task is to predict the size of actua (currenty unforeseeabe) caims. In this paper, we wi focus on modeing caim oss, athough other ingredients such as safety oadings, administrative costs, cost of capita, and profit are aso important factors for setting the premium. One difficuty in modeing the caims is that the distribution is usuay highy rightskewed, mixed with a point mass at zero. Such type of data cannot be transformed to normaity by power transformation, and specia treatment on zero caims is often required. As an exampe, Figure 1 shows the histogram of an auto insurance caim data (Yip and Yau, 2005), in which there are 6,290 poicy records with zero caims and 4,006 poicy records with positive osses. The need for predictive modes emerges from the fact that the expected oss is highy dependent on the characteristics of an individua poicy such as age and annua income of the poicyhoder, popuation density of the poicyhoder s residentia area, and age and mode of the vehice. Traditiona methods used generaized inear modes (GLM; Neder and Wedderburn, 1972) for modeing the caim size (e.g. Renshaw, 1994; Haberman and Renshaw, 1996). However, a of these works performed their anayses on a subset of the poicies, which have at east one caim. Aternative approaches have empoyed Tobit modes by treating zero outcomes as censored beow some cutoff points (Van de Ven and van Praag, 1981; Showers and Shotick, 1994), but these approaches rey on a normaity assumption of the atent response. Aternativey, Jørgensen and de Souza (1994) and Smyth and Jørgensen (2002) used GLMs with a Tweedie distributed (Jørgensen, 1987, 1997) outcome to simutaneousy mode frequency and severity of insurance caims. They assume Poisson arriva of caims and gamma distributed amount for individua caims so that the size of the tota caim amount foows a Tweedie compound Poisson distribution. Due to its abiity to simutaneousy mode the zeros and the continuous positive outcomes, the Tweedie GLM 2

Frequency 0 2000 4000 6000 0 2 4 6 8 10 12 Tota Insurance Caim Amount (in $1000) Per Poicy Year Figure 1: Histogram of the auto insurance caim data as anayzed in Yip and Yau (2005). It shows that there are 6290 poicy records with zero tota caims per poicy year, whie the remaining 4006 poicy records have positive osses. 3

has been a widey used method in actuaria studies (Midenha, 1999; Murphy et a., 2000; Peters et a., 2008; Quijano Xacur et a., 2011). Despite of the popuarity of the Tweedie GLM, a major imitation is that the structure of the ink function is restricted to a inear form, which can be too rigid for rea appications. In auto insurance, for exampe, it is known that the risk does not monotonicay decrease as age increases (Owsey et a., 1991; McCartt et a., 2003; Anstey et a., 2005). Athough noninearity may be modeed by adding spines (Zhang, 2011), owdegree spines are often inadequate to capture the noninearity in the data, whie highdegree spines often resut in the overfitting issue that produces unstabe estimates. Generaized additive modes (GAM; Hastie and Tibshirani, 1990; Wood, 2006) overcome the restrictive inear assumption of GLMs, and can mode the continuous variabes by smooth functions estimated from data. The structure of the mode, however, has to be determined a priori. That is, one has to specify the main effects and interaction effects to be used in the mode. As a resut, misspecification of nonignorabe effects is ikey to adversey affect prediction accuracy. In this paper, we aim to mode the caim size by a nonparametric Tweedie compound Poisson mode, and we propose a gradient treeboosting agorithm (TDBoost henceforth) to fit this mode. Gradient boosting (Freund and Schapire, 1997, 1996) is one of the most successfu machine earning agorithms for nonparametric regression and cassification. Boosting adaptivey combines a arge number of reativey simpe prediction modes caed base earners into an ensembe earner to achieve high prediction performance. The semina work on the boosting agorithm caed AdaBoost (Freund and Schapire, 1997, 1996) was originay proposed for cassification probems. Later Breiman (1998) and Breiman (1999) pointed out an important connection between the AdaBoost agorithm and a functiona gradient descent agorithm. Friedman et a. (2000), Friedman (2001) and Hastie et a. (2009) deveoped a statistica view of boosting and proposed gradient boosting methods for both cassification and regression. There is a arge body of iterature on boosting. We refer interested readers to Bühmann and Hothorn (2007) for a comprehensive review of boosting agorithms. The TDBoost mode is motivated by the proven success of boosting in machine earning for cassification and regression probems (Friedman, 2001, 2002; Hastie et a., 2009). Its advantages are threefod. First, the mode structure of TDBoost is earned from data and not predetermined, thereby avoiding an expicit mode specification. Noninearities, 4

discontinuities, compex and higher order interactions are naturay incorporated into the mode to reduce the potentia modeing bias and to produce high predictive performance, which enabes TDboost to serve as a benchmark mode in scoring insurance poicies, guiding pricing practice, and faciitating marketing efforts. Second, in contrast to other nonparametric statistica earning methods, TDboost can provide interpretabe resuts, by means of the partia dependence pots, and reative importance of the predictors. Feature seection is performed as an integra part of the procedure. In addition, TDboost handes the predictor and response variabes of any type without the need for transformation, and it is highy robust to outiers. Missing vaues in the predictors are managed amost without oss of information (Eith et a., 2008). A these properties make TDboost an attractive too for insurance premium modeing. The remainder of this paper is organized as foows. We briefy review the gradient boosting agorithm and the Tweedie compound Poisson mode in Section 2 and Section 3, respectivey. We present the main methodoogy deveopment with impementation detais in Section 4. In Section 5, we use simuation to show the high predictive accuracy of TDboost. As an appication, we appy TDboost to anayze an auto insurance caim data in Section 6. 2 Gradient Boosting To keep the paper sefcontained, we briefy expain the genera procedures for the gradient boosting. Let x = (x 1,..., x p ) be the pdimensiona predictor variabes and y be the onedimensiona response variabe. The goa is to estimate the optima prediction function F ( ) that maps x to y by minimizing the expected vaue of a oss function Ψ(, ) over the function cass F: F ( ) = arg mine y,x [Ψ(y, F (x))], F ( ) F where Ψ is assumed to be differentiabe with respect to F. Given the observed data {y i, x i } n i=1, estimation of F ( ) can be done by minimizing the empirica risk function min R 1 n(f ) =: min F ( ) F F ( ) F n n Ψ(y i, F (x i )). (1) i=1 5

For the gradient boosting, each candidate function F F is assumed to be an ensembe of M base earners M F (x) = F [0] + β [m] h(x; ξ [m] ), (2) m=1 where h(x; ξ [m] ) usuay beongs to a cass of some simpe functions of x caed base earners (e.g., regression/decision tree) with the parameter ξ [m] (m = 1, 2,, M). F [0] is a constant scaar and β [m] is the expansion coefficient. Note that differing from the usua structure of an additive mode, there is no restriction on the number of predictors to be incuded in each h( ), and consequenty, highorder interactions can be easiy considered using this setting. A forward stagewise agorithm is adopted to approximate the minimizer of (1), which buids up the components β [m] h(x; ξ [m] ) (m = 1, 2,..., M) sequentiay through a gradientdescentike approach. At each iteration stage m (m = 1, 2,...), suppose the current estimate for F ( ) is ˆF [m 1] ( ). In principe, we want to update ˆF [m 1] ( ) to ˆF [m] ( ) aong the negative gradient direction of R n (F ). However, we ony know the negative gradient at the training data points. Indeed, the negative gradient vector (u [m] 1,..., u [m] n ) of R n (F ) with respect to F at {F (x i ) = ˆF [m 1] (x i )} n i=1 can be written as i = R n(f ) F (x i ) u [m] F (xi )= ˆF [m 1] (x i ), but we cannot directy update ˆF [m 1] ( ) using (u [m] 1,..., u [m] n ), as u [m] is ony defined at x i for i = 1,..., n and cannot make predictions based on new data not represented in the training set. To sove this probem, the gradient boosting fits the negative gradient vector (u [m] 1,..., u [m] n ) (as the working response) to (x 1,..., x n ) (as the predictor) to find a base earner h(x; ξ [m] ). Thus, the fitted h(x; ξ [m] ) can be viewed as an approximation of the negative gradient, and it can be evauated on the entire space of x. The expansion coefficient β [m] can then be determined by a ine search β [m] = argmin β n Ψ(y i, ˆF [m 1] (x i ) + βh(x; ξ [m] )), (3) i=1 which is a reminiscence of the steepest descent method. Consequenty, the estimation of i 6

F (x) for the next stage is ˆF [m] (x) := ˆF [m 1] (x) + νβ [m] h(x; ξ [m] ), (4) where 0 < ν 1 is the shrinkage factor (Friedman, 2001) that contros the update step size. A sma ν imposes more shrinkage whie ν = 1 gives compete negative gradient steps. Friedman (2001) has found that the shrinkage factor reduces overfitting and improves the predictive accuracy. 3 Compound Poisson Distribution and Tweedie Mode In this section, we briefy introduce the compound Poisson distribution and Tweedie mode, which is necessary for our methodoogy deveopment. Let N be a Poisson random variabe denoted by Pois(λ), and et Z d s be i.i.d. gamma random variabes denoted by Gamma(α, γ) with mean αγ and variance αγ 2. Assume N is independent of Z d s. Define a random variabe Z by 0 if N = 0 Z = Z 1 + Z 2 + + Z. (5) N if N = 1, 2,... Thus Z is the Poisson sum of independent Gamma random variabes. The resuting distribution of Z is referred to as the compound Poisson distribution (Feer, 1968; BarLev and Stramer, 1987; Jørgensen and de Souza, 1994; Smyth and Jørgensen, 2002), which is known to be cosey connected to exponentia dispersion modes (EDM) (Jørgensen, 1987, 1997). Note that the distribution of Z has a probabiity mass at zero: P r(z = 0) = exp( λ). Then based on that Z conditiona on N = j is Gamma(jα, γ), the distribution function of Z can be written as f Z (z λ, α, γ) = P r(n = 0)d 0 (z) + = exp( λ)d 0 (z) + P r(n = j)f Z N=j (z) j=1 λ j e λ j=1 j! z jα 1 e z/γ γ jα Γ(jα), 7

where d 0 is the Dirac deta function at zero and f Z N=j is the conditiona density of Z given N = j. This gives the cumuant generating function of Z og M Z (t) = λ{(1 γt) α 1}. (6) Smyth (1996) pointed out that the compound Poisson distribution beongs to a specia cass of EDMs known as Tweedie modes (Tweedie, 1984). The EDMs are defined by the form { zθ κ(θ) } f Z (z θ, φ) = a(z, φ) exp, (7) φ where a( ) is a normaizing function, κ( ) is caed the cumuant function, and both a( ) and κ( ) are known. The parameter θ is in R and the dispersion parameter φ is in R +. EDMs have the property that the mean E(Z) µ = κ(θ) and the variance Var(Z) = φ κ(θ), where κ(θ) and κ(θ) are the first and second derivatives of κ(θ), respectivey. The cumuant generating function of EDMs is og M Z (t) = 1 {κ(θ + tφ) κ(θ)}. (8) φ Tweedie modes are specia cases of the EDMs characterized by power meanvariance reationship Var(Z) = φµ ρ for some index parameter ρ. Such meanvariance reation gives µ 1 ρ 1 ρ θ =, ρ 1 µ 2 ρ, κ(θ) =, ρ 2 2 ρ. (9) og µ, ρ = 1 og µ, ρ = 2 One can show that the compound Poisson distribution beongs to the cass of Tweedie modes. Indeed, if we repace the parameters (λ, α, γ) in the cumuant function (6) by λ = 1 µ 2 ρ φ 2 ρ, α = 2 ρ ρ 1, γ = φ(ρ 1)µρ 1, (10) the cumuant function of the compound Poisson mode has the form of a Tweedie mode with 1 < ρ < 2 and µ > 0. As a resut, for the rest of this paper, we ony consider the mode (5), and simpy refer to (5) as the Tweedie mode (or Tweedie compound Poisson mode), denoted by Tw(µ, φ, ρ), where 1 < ρ < 2 and µ > 0. 8

It is straightforward to show that the ogikeihood of the Tweedie mode is og f Z (z µ, φ, ρ) = 1 φ where the normaizing function a( ) can be written as ) (z µ1 ρ 1 ρ µ2 ρ + og a(z, φ, ρ), (11) 2 ρ 1 z t=1 a(z, φ, ρ) = W t(z, φ, ρ) = 1 z tα z t=1 for z > 0 (ρ 1) tα φ t(1+α) (2 ρ) t t!γ(tα), 1 for z = 0 and α = (2 ρ)/(ρ 1) and t=1 W t is an exampe of Wright s generaized Besse function (Tweedie, 1984). One of the desirabe properties of Tweedie modes is that they are the ony EDMs that are scae invariant (Jørgensen, 1997, Section 4.1.1): if Z is a Tweedie variabe with mean µ and dispersion φ, then cz foows the same distribution with mean cµ and dispersion c 2 ρ φ. This property makes Tweedie distributions a good choice for modeing data with an arbitrary monetary unit. 4 Our proposa In this section, we propose to integrate the Tweedie mode to the treebased gradient boosting agorithm to predict insurance caim size. Specificay, our discussion focuses on modeing the persona car insurance as an iustrating exampe (see Section 6 for a rea data anaysis), since our modeing strategy is easiy extended to other ines of nonife insurance business. Given an auto insurance poicy i, et N i be the number of caims (known as the caim frequency) and Z di be the size of each caim observed for d i = 1,..., N i. Let w i be the poicy duration, that is, the ength of time that the poicy remains in force. Then Z i = N i d i =1 Z di is the tota caim amount. In the foowing, we are interested in modeing the ratio between the tota caim and the duration Y i = Z i /w i, a key quantity known as the pure premium (Ohsson and Johansson, 2010). Foowing the settings of the compound Poisson mode, we assume N i is Poisson distributed, and its mean λ i w i has a mutipicative reation with the duration w i, where λ i is a poicyspecific parameter representing the expected caim frequency under unit duration. 9

Conditiona on N i, assume Z di s (d i = 1,..., N i ) are i.i.d. Gamma(α, γ i ), where γ i is a poicyspecific parameter that determines caim severity, and α is a constant. Furthermore, we assume that under unit duration (i.e., w i = 1), the meanvariance reation of a poicy satisfies for a poicies, where Y i ρ = (α + 2)/(α + 1). Note that V ar(y i ) = φ[e(y i )] ρ (12) is the pure premium under unit duration, φ is a constant, and µ i := E(Y i ) = E(E(Y i N i )) = λ i αγ i, V ar(y i ) = E(V ar(y i N i )) + V ar(e(y i N i )) = λ i αγ 2 i + λ i α 2 γ 2 i. Simiary, under duration w i, µ i := E(Y i ) = 1 w i E(Z i ) = λ i αγ i, V ar(y i ) = 1 w 2 i V ar(z i ) = (λ i αγ 2 i + λ i α 2 γ 2 i )/w i. As a resut, we can obtain the meanvariance reation for the pure premium Y i that V ar(y i ) = 1 w i V ar(y i ) = φ w i (µ i ) ρ = φ w i µ ρ i, (13) where the second equation foows by (12). Consequenty, the scae invariant property of Tweedie distribution and (13) impies that Y i Tw(µ i, φ/w i, ρ). Under the aforementioned settings, consider a portfoio of poicies {(y i, x i, w i )} n i=1 from n independent insurance contracts, where for the ith contract, y i is the poicy pure premium, x i is a vector of expanatory variabes that characterize the poicyhoder and the risk being insured (e.g. house, vehice), and w i is the duration. We then assume that the expected 10

pure premium µ i is determined by a predictor function F : R p R of x i : og{µ i } = og{e(y i x i )} = F (x i ). (14) In this paper, we do not impose a inear or other parametric form restriction on F ( ). Given the fexibiity of F ( ), we ca such setting as the boosted Tweedie mode (as opposed to the Tweedie GLM). Given {(y i, x i, w i )} n i=1, the ogikeihood function can be written as (F ( ), φ, ρ {y i, x i, w i } n i=1) = = n og f Y (y i µ i, φ/w i, ρ), i=1 n w i µ (y 1 ρ ) i i φ 1 ρ µ2 ρ i + og a(y i, φ/w i, ρ). (15) 2 ρ i=1 4.1 Estimating F ( ) via TDboost We estimate the predictor function F ( ) by integrating the boosted Tweedie mode into the treebased gradient boosting agorithm. To deveop the idea, we assume that φ and ρ are given for the time being. The joint estimation of F ( ), φ and ρ wi be studied in Section 4.2. Given ρ and φ, we repace the genera objective function in (1) by the negative ogikeihood derived in (15), and target the minimizer function F ( ) over a cass F of base earner functions in the form of (2). That is, we intend to estimate { F (x) = argmin (F ( ), φ, ρ {yi, x i, w i } n i=1) } = argmin F F F F n Ψ(y i, F (x i ) ρ), (16) i=1 where Ψ(y i, F (x i ) ρ) = w i { y i exp[(1 ρ)f (x i )] 1 ρ } + exp[(2 ρ)f (x i)]. 2 ρ Note that in contrast to (16), the function cass targeted by Tweedie GLM (Smyth, 1996) is restricted to a coection of inear functions of x. We propose to appy the forward stagewise agorithm described in Section 2 for soving (16). The initia estimate of F ( ) is chosen as a constant function that minimizes the 11

negative ogikeihood: ˆF [0] = argmin = og η n Ψ(y i, η ρ) i=1 ( n i=1 w iy i n i=1 w i ). This corresponds to the best estimate of F without any covariates. Let ˆF [m 1] be the current estimate before the mth iteration. At the mth step, we fit a base earner h(x; ξ [m] ) via ξ[m] = argmin ξ [m] n i=1 [u [m] i h(x i ; ξ [m] )] 2, (17) where (u [m] 1,..., u [m] n ) is the current negative gradient of Ψ( ρ), i.e., i = Ψ(y i, F (x i ) ρ) F (x i ) u [m] F (xi )= ˆF [m 1] (x i ) (18) = w i { yi exp[(1 ρ) ˆF [m 1] (x i )] + exp[(2 ρ) ˆF [m 1] (x i )] }, (19) and use an Ltermina node regression tree h(x; ξ [m] ) = L =1 u [m] I(x R [m] ) (20) with parameters ξ [m] = {R [m], u [m] } L =1 as the base earner. To find R[m] and u [m], we use a fast topdown bestfit agorithm with a east squares spitting criterion (Friedman et a., 2000) to find the spitting variabes and corresponding spit ocations that determine the fitted termina regions { as the mean faing in each region: [m] R } L =1. Note that estimating the R[m] entais estimating the u [m] ū [m] = mean i:xi [m](u [m] R i ) = 1,..., L. Once the base earner h(x; ξ [m] ) has been estimated, the optima vaue of the expansion 12

coefficient β [m] is determined by a ine search β [m] = argmin β = argmin β n Ψ(y i, ˆF [m 1] (x i ) + βh(x i ; ξ [m] ) ρ) (21) i=1 n Ψ(y i, ˆF [m 1] (x i ) + β i=1 The regression tree (20) predicts a constant vaue ū [m] L =1 ū [m] I(x i within each region (21) by a separate ine search performed within each respective region (21) reduces to finding a best constant η [m] R [m] based on the foowing criterion: ˆη [m] = argmin where the soution is given by ˆη [m] = og η { i:x i [m] i:x i R i:x i R [m] R [m] [m] R ) ρ). [m] R, so we can sove [m] R. The probem to improve the current estimate in each region Ψ(y i, ˆF [m 1] (x i ) + η ρ), = 1,..., L, (22) w i y i exp[(1 ρ) ˆF } [m 1] (x i )] w i exp[(2 ρ) ˆF, = 1,..., L. (23) [m 1] (x i )] Having found the parameters {ˆη [m] } L =1, we then update the current estimate ˆF [m 1] (x) in each corresponding region ˆF [m] (x) = ˆF [m 1] (x) + ν ˆη [m] I(x R [m] ), = 1,..., L, (24) where 0 < ν 1 is the shrinkage factor. Foowing (Friedman, 2001), we set ν = 0.005 in our impementation. More discussions on the choice of tuning parameters are in Section 4.4. In summary, the compete TDBoost agorithm is shown in Agorithm 1. The boosting step is repeated M times and we report ˆF [M] (x) as the fina estimate. 4.2 Estimating (ρ, φ) via profie ikeihood Foowing Smyth (1996) and Dunn and Smyth (2005), we use the profie ikeihood to estimate the dispersion φ and the index parameter ρ, which jointy determine the meanvariance 13

Agorithm 1 TDboost 1. Initiaize ˆF [0] ˆF [0] = og ( n i=1 w iy i n i=1 w i ). 2. For m = 1,..., M repeatedy do steps 2.(a) 2.(d) 2.(a) Compute the negative gradient {u [m] i } n i=1 u [m] i = w i { yi exp[(1 ρ) ˆF [m 1] (x i )] + exp[(2 ρ) ˆF [m 1] (x i )] } i = 1,..., n. 2.(b) Fit the negative gradient vector {u [m] i } n i=1 to x 1,..., x n by an Ltermina node [m] regression tree, giving us the partitions { R } L =1. 2.(c) Compute the optima termina node predictions η [m] 1, 2,..., L { ˆη [m] = og i:x i i:x i R [m] R [m] for each region w i y i exp[(1 ρ) ˆF } [m 1] (x i )] w i exp[(2 ρ) ˆF. [m 1] (x i )] R [m], = 2.(d) Update ˆF [m] (x) for each region [m] R, = 1, 2,..., L ˆF [m] (x) = ˆF [m 1] (x) + ν ˆη [m] I(x 3. Report ˆF [M] (x) as the fina estimate. R [m] ) = 1, 2,..., L. 14

reation V ar(y i ) = φµ ρ i /w i of the pure premium. We expoit the fact that in Tweedie modes the estimation of µ depends ony on ρ: given a fixed ρ, the mean estimate µ (ρ) can be soved in (16) without knowing φ. Then conditiona on this ρ and the corresponding µ (ρ), we maximize the ogikeihood function with respect to φ by { φ (ρ) = argmax (µ (ρ), φ, ρ) }, (25) φ which is a univariate optimization probem that can be soved using a combination of goden section search and successive paraboic interpoation (Brent, 2013). In such a way, we have determined the corresponding (µ (ρ), φ (ρ)) for each fixed ρ. Then we acquire the estimate of ρ by maximizing the profie ikeihood with respect to 50 equay spaced vaues {ρ 1,..., ρ 50 } on (0, 1): ρ = argmax { (µ (ρ), φ (ρ), ρ) }. (26) ρ {ρ 1,...,ρ 50 } Finay, we appy ρ in (16) and (25) to obtain the corresponding estimates µ (ρ ) and φ (ρ ). There are some computationa issues, which must be taken care of when evauating the ogikeihood functions in (25) and (26): since in genera there are no cosed forms for Tweedie densities, in ikeihood evauation one must dea with an infinite summation in the normaizing function a(y, φ, ρ) = 1 y t=1 W t. For numerica evauation of Tweedie densities, Dunn and Smyth (2005) proposed a series expansions approach, which sums an infinite series arising from a Tayor expansion of the characteristic function. Aternativey, Dunn and Smyth (2008) deveoped a Fourier inversion approach, which consists of an inversion of the characteristic function based on numerica integration methods for osciating functions. These two numerica methods turn out to be compementary since each has advantages under a certain situation: when ony considering the case 1 < ρ < 2, the series approach performs very we for sma y but graduay oses computationa efficiency as y increases, whereas the inversion approach performs very we for arge y but graduay fais to provide accurate resuts as y decreases. Hence the inversion approach is preferred for arge y and the series approach for sma y. Dunn and Smyth (2008) provided a simpe guideine to choose between the two methods. In this paper we use their R package tweedie (avaiabe at http://cran. rproject.org/web/packages/tweedie/index.htm) for evauating Tweedie densities in our profie ikeihood computation. For further detais regarding their agorithms, the reader 15

may refer to Dunn and Smyth (2005, 2008). 4.3 Mode interpretation Compared to other nonparametric statistica earning methods such as neura networks and kerne machines, our new estimator provides interpretabe resuts. In this section, we discuss some ways for mode interpretation after fitting the boosted Tweedie mode. 4.3.1 Margina effects of predictors The main effects and interaction effects of the variabes in the boosted Tweedie mode can be extracted easiy. In our estimate we can contro the order of interactions by choosing the tree size L (the number of termina nodes) and the number p of predictors. A tree with L termina nodes produces a function approximation of p predictors with interaction order of at most min(l 1, p). For exampe, a stump (L = 2) produces an additive TDboost mode with ony the main effects of the predictors, since it is a function based on a singe spitting variabe in each tree. Setting L = 3 aows both main effects and second order interactions. Foowing Friedman (2001) we use the socaed partia dependence pots to visuaize the main effects and interaction effects. Given the training data {y i, x i } n i=1, with a pdimensiona input vector x = [x 1, x 2,..., x p ], et z s be a subset of size s, such that z s = {z 1,..., z s } {x 1,..., x p }. For exampe, to study the main effect of the variabe j, we set the subset z s = {z j }, and to study the second order interaction of variabes i and j, we set z s = {z i, z j }. Let z \s be the compement set of z s, such that z \s z s = {x 1,..., x p }. Let the prediction ˆF (z s z \s ) be a function of the subset z s conditioned on specific vaues of z \s. The partia dependence of ˆF (x) on z s then can be formuated as ˆF (z s z \s ) averaged over the margina density of the compement subset z \s ˆF s (z s ) = ˆF (z s z \s )p \s (z \s )dz \s, (27) where p \s (z \s ) = p(x)dz s is the margina density of z \s. We estimate (27) by F s (z s ) = 1 n n i=1 ˆF (z s z \s,i ), (28) 16

where {z \s,i } n i=1 are evauated at the training data. We then pot F s (z s ) against z s. We have incuded the partia dependence pot function in our R package TDboost. We wi demonstrate this functionaity in Section 6. 4.3.2 Variabe importance In many appications identifying reevant predictors of the mode in the context of treebased ensembe methods is of interest. The TDboost mode defines a variabe importance measure for each candidate predictor X j in the set X = {X 1,..., X p } in terms of prediction/expanation of the response Y. The major advantage of this variabe seection procedure, as compared to univariate screening methods, is that the approach considers the impact of each individua predictor as we as mutivariate interactions among predictors simutaneousy. We start by defining the variabe importance (VI henceforth) measure in the context of a singe tree. First introduced by Breiman et a. (1984), the VI measure I Xj (T m ) of the variabe X j in a singe tree T m is defined as the tota heterogeneity reduction of the response variabe Y produced by X j, which can be estimated by adding up a the decreases in the squared error reductions ˆδ obtained in a L 1 interna nodes when X j is chosen as the spitting variabe. Denote v(x j ) = the event that X j is seected as the spitting variabe in the interna node, and et I j = I(v(X j ) = ). Then where ˆδ I Xj (T m ) = L 1 =1 ˆδ I j, (29) is defined as the squared error difference between the constant fit and the two subregion fits (the subregion fits are achieved by spitting the region associated with the interna node into the eft and right regions). Friedman (2001) extended the VI measure I Xj for the boosting mode with a combination of M regression trees, by averaging (29) over {T 1,..., T M }: I Xj = 1 M M I Xj (T m ). (30) m=1 Despite of the wide use of the VI measure, Breiman et a. (1984), White and Liu (1994) and Kononenko (1995) among others have pointed out that the VI measures (29) and (30) 17

are biased: even if X j is a noninformative variabe to Y (not correated to Y ), X j may sti be seected as a spitting variabe, hence the VI measure of X j is nonzero by Equation (30). Foowing Sandri and Zuccootto (2008) and Sandri and Zuccootto (2010) to avoid the variabe seection bias, we compute an adjusted VI measure for each expanatory variabe by permutating each X j : (1) For s = 1,..., S, repeat steps (2) (4). (2) Generate a matrix z s by randomy permutating (without repacement) the n rows of the design matrix x, whie keeping the order of coumns unchanged. (3) Create an n 2p matrix x s = [x, z s ] by binding z s with x matrix by coumn. (4) Use the data {y, x s } to fit the mode, and compute VI measures I s X j for X j and I s Z s j for Z s j, where Z s j (jth coumn of Z s ) is the pseudopredictor corresponding to X j. (5) Compute the VI measure I Xj as the average of I s X j and the baseine I Zj as the average of I s Z s j I Xj = 1 S S IX s j I Zj = 1 S s=1 S IZ s. (31) j s s=1 (6) Report the adjusted VI measure as I adj X j = I Xj I Zj for the variabe X j. The basic idea of the above agorithm is the foowing: the permutation breaks the association between the response variabe Y and each pseudopredictor Z s j, but sti preserves the association between Z s j and Z s k (k j); since Zs j is reshuffed from X j, Z s j has the same number of possibe spits as the corresponding predictor X j and has approximatey the same probabiity of being seected in spit nodes. Therefore, I Zj approximation of the importance of X j. can be viewed as a bias 4.4 Impementation We have impemented our proposed method in an R package TDboost, which is pubicy avaiabe from the Comprehensive R Archive Network at http://cran.rproject.org/ web/packages/tdboost/index.htm. Here, we discuss the choice of three meta parameters 18

in Agorithm 1: L (the size of the trees), ν (the shrinkage factor) and M (the number of boosting steps). To avoid overfitting and improve outofsampe predictions, the boosting procedure can be reguarized by imiting the number of boosting iterations M (eary stopping; Zhang and Yu, 2005) and the shrinkage factor ν. Empirica evidence (Friedman, 2001; Bühmann and Hothorn, 2007; Ridgeway, 2007; Eith et a., 2008) showed that the predictive accuracy is amost aways better with a smaer shrinkage factor at the cost of more computing time. However, smaer vaues of ν usuay requires a arger number of boosting iterations M and hence induces more computing time (Friedman, 2001). ν = 0.005 throughout and determine M by the data. We choose a sufficienty sma The vaue L shoud refect the true interaction order in the underying mode, but we amost never have such prior knowedge. Therefore we choose the optima M and L using K fod cross vaidation, starting with a fixed vaue of L. The data are spit into K roughy equasized fods. Let an index function π(i) : {1,..., n} {1,..., K} indicate the fod to which observation i is aocated. Each time, we remove the kth fod of the data (k = 1, 2,..., K), and train the mode using the remaining K 1 fods. Denoting by [M] ˆF k (x) the resuting mode, we compute the vaidation oss by predicting on each kth fod of the data removed: CV(M, L) = 1 n n Ψ(y i, i=1 ˆF [M] π(i) (x i; L) ρ). (32) We seect the optima M at which the minimum vaidation oss is reached M L = argmin CV(M, L). M If we need to seect L too, then we repeat the whoe process for severa L (e.g. L = 2, 3, 4, 5) and choose the one with the smaest minimum generaization error L = argmin L CV(L, M L ). For a given ν, fitting trees with higher L eads to smaer M being required to reach the minimum error. 19

5 Simuation Studies In this section, we compare TDboost with the Tweedie GLM mode (TGLM: Jørgensen and de Souza, 1994) and the Tweedie GAM mode in terms of the function estimation performance. The Tweedie GAM mode is proposed by Wood (2001), which is based on a penaized regression spine approach with automatic smoothness seection. There is an R package MGCV accompanying the work, avaiabe at http://cran.rproject.org/web/ packages/mgcv/index.htm. In a numerica exampes beow using the TDboost mode, fivefod cross vaidation is adopted for seecting the optima (M, L) pair, whie the shrinkage factor ν is set to its defaut vaue of 0.005. 5.1 Case I In this simuation study, we demonstrate that TDboost is we suited to fit target functions that are noninear or invove compex interactions. We consider two true target functions: Mode 1 (Discontinuous function): The target function is discontinuous as defined by F (x) = 0.5I(x > 0.5). We assume x Unif(0, 1), and y Tw(µ, φ, ρ) with ρ = 1.5 and φ = 0.5. Mode 2 (Compex interaction): The target function has two his and two vaeys. F (x 1, x 2 ) = e 5(1 x 1) 2 +x 2 2 + e 5x 2 1 +(1 x 2) 2, which corresponds to a common scenario where the effect of one variabe changes depending on the effect of another. We assume x 1, x 2 Unif(0, 1), and y Tw(µ, φ, ρ) with ρ = 1.5 and φ = 0.5. We generate n = 1000 observations for training and n = 1000 for testing, and fit the training data using TDboost, MGCV, and TGLM. Since the true target functions are known, we consider the mean absoute deviation (MAD) as performance criteria, MAD = 1 n F (x n i ) ˆF (x i ), i=1 20

Mode TGLM MGCV TDboost 1 0.1102 (0.0006) 0.0752 (0.0016) 0.0595 (0.0021) 2 0.3516 (0.0009) 0.2511 (0.0004) 0.1034 (0.0008) Tabe 1: The averaged MADs and the corresponding standard errors based on 100 independent repications. where both the true predictor function F (x i ) and the predicted function ˆF (x i ) are evauated on the test set. The resuting MADs on the testing data are reported in Tabe 1, which are averaged over 100 independent repications. The fitted functions from Mode 2 are potted in Figure 2. In both cases, we find that TDboost outperforms TGLM and MGCV in terms of the abiity to recover the true functions and gives the smaest prediction errors. 5.2 Case II The idea is to see the performance of the TDboost estimator and MGCV estimator on a variety of very compicated, randomy generated predictor functions, and study how the size of the training set, distribution settings and other characteristics of probems affect fina performance of the two methods. We use the random function generator (RFG) mode by Friedman (2001) in our simuation. The true target function F is randomy generated as a inear expansion of functions {g k } 20 k=1 : 20 F (x) = b k g k (z k ). (33) k=1 Here each coefficient b k is a uniform random variabe from Unif[ 1, 1]. Each g k (z k ) is a function of z k, where z k is defined as a p k sized subset of the tendimensiona variabe x in the form z k = {x ψk (j)} p k j=1, (34) where each ψ k is an independent permutation of the integers {1,..., p}. The size p k is randomy seected by min( 2.5 + r k, p), where r k is generated from an exponentia distribution with mean 2. Hence the expected order of interactions presented in each g k (z k ) is between 21

(a) True F (x 1,x 2 ) (b) TDboost ˆF (x 1,x 2 ) 2.5 2.5 F (x 1,x 2) 2.0 1.5 ˆF (x 1,x 2) 2.0 1.5 1.0 1.0 0.8 0.6 x 2 0.4 0.2 0.0 0.0 0.2 0.4 0.6 x 1 0.8 0.8 0.6 x 2 0.4 0.2 0.0 0.0 0.2 0.4 0.6 x 1 0.8 (c) TGLM ˆF (x 1,x 2 ) (d) MGCV ˆF (x 1,x 2 ) 1.35 2.0 1.30 1.25 ˆF (x 1,x 2) 1.5 ˆF (x 1,x 2) 1.20 1.15 1.0 1.10 0.8 0.6 x 2 0.4 0.2 0.0 0.0 0.2 0.4 0.6 x 1 0.8 0.8 0.6 x 2 0.4 0.2 0.0 0.0 0.2 0.4 0.6 x 1 0.8 Figure 2: Fitted curves that recover the target function defined in Mode 2. The top eft figure shows the true target function. The top right, bottom eft, and bottom right figures show the predictions on the testing data from TDboost, TGLM, and MGCV, respectivey. 22

four and five. Each function g k (z k ) is a p k dimensiona Gaussian function: { g k (z k ) = exp 1 } 2 (z k u k ) T V k (z k u k ), (35) where each mean vector u k is randomy generated from N(0, I pk ). The p k p k covariance matrix V k is defined by V k = U k D k U T k, (36) where U k is a random orthonorma matrix, D k = diag{d k [1],..., d k [p k ]}, and the square root of each diagona eement d k [j] is a uniform random variabe from Unif[0.1, 2.0]. We generate data {y i, x i } n i=1 according to y i Tw(µ i, φ, ρ), x i N(0, I p ), i = 1,..., n, (37) where µ i = exp{f (x i )}. Setting I: when the index is known Firsty, we study the situation that the true index parameter ρ is known when fitting modes. We generate data according to the RFG mode with index parameter ρ = 1.5 and the dispersion parameter φ = 1 in the true mode. We set the number of predictors to be p = 10 and generate n {1000, 2000, 5000} observations as training sets, on which both MGCV and TDboost are fitted with ρ specified to be the true vaue 1.5. An additiona test set of n = 5000 observations was generated for evauating the performance of the fina estimate. Figure 3 shows simuation resuts for comparing the estimation performance of MGCV and TDboost, when varying the training sampe size. The empirica distributions of the MADs shown as boxpots are based on 100 independent repications. We can see that in a of the cases, TDboost outperforms MGCV in terms of prediction accuracy. We aso test estimation performance on µ when the index parameter ρ is misspecified, that is, we use a guess vaue ρ differing from the true vaue ρ when fitting the TDboost mode. Because µ is statisticay orthogona to φ and ρ, meaning that the offdiagona eements of the Fisher information matrix are zero (Jørgensen, 1997), we expect ˆµ wi vary very sowy as ρ changes. Indeed, using the previous simuation data with the true vaue ρ = 1.5 and 23

Mean absoute deviation (MAD) 0.35 0.30 0.25 0.20 0.15 N = 1000 N = 2000 N = 5000 MGCV T DBoost MGCV T DBoost MGCV T DBoost Method Figure 3: Simuation resuts for Setting I: compare the estimation performance of MGCV and TDboost when varying the training sampe size and the dispersion parameter in the true mode. Boxpots dispay empirica distributions of the MADs based on 100 independent repications. φ = 1, we fitted TDboost modes with nine guess vaues of ρ {1.1, 1.2,..., 1.9}. The resuting MADs are dispayed in Figure 4, which shows the choice of the vaue ρ has amost no significant effect on estimation accuracy of µ. Setting II: using the estimated index Next we study the situation that the true index parameter ρ is unknown, and we use the estimated ρ obtained from the profie ikeihood procedure discussed in Section 4.2 for fitting the mode. The same data generation scheme is adopted as in Setting I, except now both MGCV and TDboost are fitted with ρ estimated by maximizing the profie ikeihood. Figure 5 shows simuation resuts for comparing the estimation performance of MGCV and TDboost in such setting. We can see that the resuts have no significant difference to the resuts of Setting I: TDboost sti outperforms MGCV in terms of prediction accuracy when using the estimated ρ instead of the true vaue. Lasty, we demonstrate our resuts from the estimation of the dispersion φ and the index ρ by using the profie ikeihood. A tota number of 200 sets of training sampes are randomy 24

0.5 Mean absoute deviation (MAD) 0.4 0.3 0.2 0.1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 ρ Figure 4: Simuation resuts for Setting I when the index is misspecified: the estimation performance of TDboost when varying the vaue of the index parameter ρ {1.1, 1.2,..., 1.9}. In the true mode ρ = 1.5 and φ = 1. Boxpots show empirica distributions of the MADs based on 200 independent repications. 0.40 N = 1000 N = 2000 N = 5000 Mean absoute deviation (MAD) 0.35 0.30 0.25 0.20 0.15 MGCV T DBoost MGCV T DBoost MGCV T DBoost Method Figure 5: Simuation resuts for Setting II: compare the estimation performance of MGCV and TDboost when varying the training sampe size and the dispersion parameter in the true mode. Boxpots dispay empirica distributions of the MADs based on 100 independent repications. 25

Profie ogikeihood 10000 9000 8000 7000 6000 5000 4000 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 ρ Figure 6: The curve represents the profie ikeihood function of ρ from a singe run. The dotted ine shows the true vaue ρ = 1.7. The soid ine shows the estimated vaue ρ = 1.68 corresponding to the maximum ikeihood. The associated estimated dispersion is φ =1.89. generated from a true mode according to the setting (37) with φ = 2 and ρ = 1.7, each sampe having 2000 observations. We fit the TDboost mode on each sampe and compute the estimates φ at each of the 50 equay spaced vaues {ρ 1,..., ρ 50 } on (1, 2). The (ρ j, φ (ρ j )) corresponding to the maxima profie ikeihood is the estimate of (ρ, φ). The estimation process is repeated 200 times. The estimated indices have mean ρ = 1.68 and standard error SE(ρ ) = 0.026, so the true vaue ρ = 1.7 is within ρ ± SE(ρ ). The estimated dispersions have mean φ = 1.82 and standard error SE(φ ) = 0.12. Figure 6 shows the profie ikeihood function of ρ for a singe run. 26

Tota Caim Amount % obs. % of tota sum Mean Median 0 61.1 0 0 0 (0, 10000] 29.6 36.0 4902 4842 (10000, 50000] 9.1 61.5 27144 27679 > 50000 0.2 2.5 52157 51986 Tabe 2: Description of the individua tota caim amount in the ast five years. 6 Appication: Automobie Caims We consider an auto insurance caim dataset as anayzed in Yip and Yau (2005) and Zhang and Yu (2005). The data set contains 10,296 driver vehice records, each record incuding an individua driver s tota caim amount (z i ) in the ast five years (w i = 5) and 17 characteristics x i = (x i,1,..., x i,17 ) for the driver and the insured vehice. We want to predict the expected pure premium based on x i. Tabe 2 and Tabe 3 summarize the data set. The histogram of the tota caim amounts in Figure 1 shows that the empirica distribution of these vaues is highy skewed. We find that approximatey 61.1% of poicyhoders had no caims, and approximatey 29.6% of the poicyhoders had a positive caim amount up to 10,000 doars. Note that ony 9.3% of the poicyhoders had a high caim amount above 10,000 doars, but the sum of their caim amount made up to 64% of the overa sum. We separate the entire dataset into a training set and a testing set with equa size. Then the TDboost mode is fitted on the training set and tuned with fivefod cross vaidation. For comparison, we aso fit TGLM and MGCV, both of which are fitted using a the expanatory variabes. In MGCV, the numerica variabes AGE, BLUEBOOK, HOME KIDS, INCOME, KIDSDRIV, MVR PTS, NPOLICY, RETAINED and TRAVTIME are modeed by smooth terms represented using penaized regression spines. We find the appropriate smoothness for each appicabe mode term using Generaized Cross Vaidation (GCV) (Craven and Wahba, 1978; Wahba, 1990). For the TDboost mode, it is not necessary to carry out data transformation, since the treebased boosting method can automaticay hande different types of data. For other modes, we use ogarithmic transformation on two variabes, i.e. og(bluebook), og(income+10), and scae a the numerica variabes except for HOMEKIDS, KIDSDRIV, MVR PTS and NPOLICY to have mean 0 and standard deviation 1. We aso create dummy variabes for the categorica variabes with more 27

ID Variabe Type Description 1 AGE N Driver s age 2 BLUEBOOK N Vaue of vehice 3 HOMEKIDS N Number of chidren 4 INCOME N Annua income 5 KIDSDRIV N Number of driving chidren 6 MVR PTS N Motor vehice record points 7 NPOLICY N Number of poicies 8 RETAINED N Number of years as a customer 9 TRAVTIME N Distance to work 10 AREA C Home/work area: Rura, Urban 11 CAR USE C Vehice use: Commercia, Private 12 CAR TYPE C Type of vehice: Pane Truck, Pickup, Sedan, Sports Car, SUV, Van 13 GENDER C Driver s gender: F, M 14 JOBCLASS C Unknown, Bue Coar, Cerica, Doctor, Home Maker, Lawyer, Manager, Professiona, Student 15 MAX EDUC C Education eve: High Schoo or Beow, Bacheors, High Schoo, Masters, PhD 16 MARRIED C Married or not: Yes, No 17 REVOKED C Whether icense revoked in past 7 years: Yes, No Tabe 3: Expanatory variabes in the caim history data set. Type N stands for numerica variabe, Type C stands for categorica variabe. AGE INCOME HOMEKIDS BLUEBOOK KIDSDRIV Min. 16.00 0 0.0000 1500 0.0000 1st Qu. 39.00 27620 0.0000 9200 0.0000 Median 45.00 53564 0.0000 14405 0.0000 Mean 44.84 61610 0.7199 15666 0.1694 3rd Qu. 51.00 86214 1.0000 20900 0.0000 Max. 81.00 367030 5.0000 69740 4.0000 NPOLICY RETAINED TRAVTIME MVR PTS Min. 1.000 1.000 5.00 0.000 1st Qu. 1.000 1.000 22.00 0.000 Median 1.000 4.000 33.00 1.000 Mean 1.695 5.328 33.42 1.709 3rd Qu. 2.000 7.000 44.00 3.000 Max. 9.000 25.000 142.00 13.000 Tabe 4: Descriptive statistics for the continuous variabes in the caim history data set in Section 6. 28

AREA MARRIED REVOKED GENDER Rura: 20.2% No: 39.9% No: 87.8% F: 53.8% Urban: 79.8% Yes: 60.1% Yes: 12.2% M: 46.2% CAR USE MAX EDUC CAR TYPE JOBCLASS Private: 63.2% <High Schoo: 14.6% Pane Truck: 8.3% Bue Coar: 22.2% Commercia: 36.8% Bacheors: 27.3% Pickup: 17.3% Cerica: 15.5% High Schoo: 28.7% Sedan: 26.2% Professiona: 13.6% Masters: 20.2% Sports Car: 11.4% Manager: 12.2% PhD: 9.2% SUV: 27.9% Lawyer: 10.0% Van: 8.9% Student: 8.7% (Other): 17.8% Tabe 5: Descriptive statistics for the categorica variabes in the caim history data set in Section 6. Mode Parameter TGLM MGCV TDboost Index ρ 1.37 1.34 1.34 Dispersion φ 2.31 2.15 2.05 Tabe 6: The estimated ρ and φ of the mode TGLM, MGCV and TDboost using the profie ikeihood method. than two eves (CAR TYPE, JOBCLASS and MAX EDUC). For a modes, we use the profie ikeihood method to estimate the dispersion φ and the index ρ, which are in turn used in fitting the fina modes. The estimated vaues of φ and ρ are reported in Tabe 6. We see that the estimated vaue of the dispersion parameter in TGLM is greater than those in MGCV and TDBoost. To examine the performance of TGLM, MGCV and TDboost, after fitting on the training set, we predict the pure premium P (x) = ˆµ(x) by appying each mode on the independent hedout testing set. However, attention must be paid when measuring the differences between predicted premiums P (x) and rea osses y on the testing data. The mean squared oss or mean absoute oss is not appropriate here because the osses have high proportions of zeros and are highy right skewed. Therefore an aternative statistica measure the ordered Lorenz curve and the associated Gini index proposed by Frees et a. (2011) are used for capturing the discrepancy between the premium and oss distributions. By cacuating the 29

Gini index, the performance of different predictive modes can be compared. Here we ony briefy expain the idea of the ordered Lorenz curve (Frees et a., 2011, 2013). Let B(x) be the base premium, which is cacuated using the existing premium prediction mode, and et P (x) be the competing premium cacuated using an aternative premium prediction mode. In the ordered Lorenz curve, the distribution of osses and the distribution of premiums are sorted based on the reative premium R(x) = P (x)/b(x). The ordered premium distribution is ˆD P (s) = n i=1 B(x i)i(r(x i ) s) n i=1 B(x, i) and the ordered oss distribution is ˆD L (s) = n i=1 y ii(r(x i ) s) n i=1 y. i Two empirica distributions are based on the same sort order, which makes it possibe to compare the premium and oss distributions for the same poicyhoder group. The ordered Lorenz curve is the graph of ( ˆD P (s), ˆD L (s)). When the percentage of osses equas the percentage of premiums for the insurer, the curve resuts in a 45degree ine, known as the ine of equaity. Twice the area between the ordered Lorenz curve and the ine of equaity measures the discrepancy between the premium and oss distributions, and is defined as the Gini index. Curves beow the ine of equaity indicate that, given knowedge of the reative premium, an insurer coud identify the profitabe contracts, whose premiums are greater than osses. Therefore, a arger Gini index (hence a arger area between the ine of equaity and the curve beow) woud impy a more favorabe mode. Foowing Frees et a. (2013), we successivey specify the prediction from each mode as the base premium B(x) and use predictions from the remaining modes as the competing premium P (x) to compute the Gini indices. The entire procedure of the data spitting and Gini index computation are repeated 20 times, and a matrix of the averaged Gini indices and standard errors is reported in Tabe 7. To pick the best mode, we use a minimax strategy (Frees et a., 2013) to seect the base premium mode that are east vunerabe to competing premium modes; that is, we seect the mode that provides the smaest of the maxima Gini indices, taken over competing premiums. We find that the maxima Gini index 30

Competing Premium Base Premium TGLM MGCV TDboost TGLM 0 6.432 (0.313) 14.324 (0.415) MGCV 8.052 (0.551) 0 14.020 (0.497) TDboost 4.677 (0.418) 2.818 (0.456) 0 Tabe 7: The averaged Gini indices and standard errors in the auto insurance caim data exampe based on 20 random spits. is 14.324 when using B(x) = ˆµ TGLM (x) as the base premium, 14.020 when B(x) = ˆµ MGCV (x), and 4.677 when B(x) = ˆµ TDboost (x). Therefore, TDboost has the smaest maximum Gini index at 4.677, hence is the east vunerabe to aternative scores. Figure 7 aso shows that when TGLM (or MGCV) is seected as the base premium, the area between the ine of equaity and the ordered Lorenz curve is arger when choosing TDboost as the competing premium, indicating again that the TDboost mode represents the most favorabe choice. MGCV TGLM 100 75 Loss 50 Mode TGLM MGCV TDBoost 25 0 0 25 50 75 100 0 25 50 75 100 Premium Figure 7: The ordered Lorenz curves for the auto insurance caim data. Next, we focus on the anaysis using the TDboost mode. There are severa expanatory variabes significanty reated to the pure premium. The VI measure and the baseine vaue of each expanatory variabe are shown in Figure 8. We find that REVOKED, MVR PTS, AREA and INCOME have high VI measure scores (the vertica ine), and their scores a surpass the corresponding baseines (the horizonta ineength), indicating that the impor 31

tance of those expanatory variabes is rea. We aso find the variabes AGE, CAR TYPE, JOBCLASS, NPOLICY, MARRIED, KIDSDRIV, MAX EDUC and CAR USE have argerthanbaseine VI measure scores, but the absoute scaes are much ess than aforementioned four variabes. On the other hand, athough the VI measure of, e.g., BLUEBOOK is quite arge, it does not significanty surpass the baseine importance. Reative Infuence Baseine REVOKED MVR PTS AREA INCOME BLUEBOOK AGE TRAVTIME CAR TYPE JOBCLASS NPOLICY MARRIED RETAINED KIDSDRIV MAX EDUC CAR USE HOMEKIDS GENDER 0 10 20 30 40 50 Fraction of Reduction in Sum of Squared Error in Gradient Prediction Figure 8: The variabe importance measures and baseines of 17 expanatory variabes for modeing the pure premium. We now use the partia dependence pots to visuaize the fitted mode. Figure 9 shows the main effects of four important expanatory variabes on the pure premium. We ceary see that the strong noninear effects exist in predictors INCOME and MVR PTS: for the poicyhoders with the annua income beow 163 (in $1000), their pure premium is negativey associated with the income; after the income passes 163, the pure premium starts to graduay increase with the income unti the pure premium curve reaches a pateau when the income passes 237; Additionay, the pure premium is positivey associated with motor vehice record points MVR PTS, but the pure premium curve reaches a pateau when MVR PTS exceeds five. On the other hand, the partia dependence pots suggest that a poicyhoder who ives in the urban area (AREA= URBAN ) or with driver s icense revoked (REVOKED= YES ) 32

typicay has reativey high pure premium. REVOKED AREA Pure Premium (in $1000) 2 4 6 8 10 1.8 2.0 2.2 2.4 2.6 No INCOME Yes 1.0 1.5 2.0 2.5 3.0 2 3 4 5 6 Rura MVR PTS Urban 0 100 200 300 x 0 5 10 Figure 9: Margina effects of four most significant expanatory variabes on the pure premium. In our mode, the datadriven choice for the tree size is L = 7, which means that our mode incudes higher order interactions. In Figure 10, we visuaize the effects of four important second order interactions using the joint partia dependence pots. These four interactions are AREA MVR PTS, AREA REVOKED, AREA NPOLICY and INCOME MVR PTS. The first three interactions a invove the variabe AREA: we can see that the margina effects of MVR PTS, REVOKED and NPOLICY on the pure premium are greater for the poicyhoders iving in the urban area (AREA= URBAN ) than those iving in the rura area (AREA= RURAL ). Aso, a strong INCOME MVR PTS interaction suggests 33

(a) (b) 6 10 5 8 µ(x) 4 3 2 µ(x) 6 4 1 2 12 10 8 Yes 6 MVR PTS 4 2 0 Rura AREA Urban REVOKED No Rura AREA Urban (c) (d) 2.5 2.0 6 µ(x) 1.5 µ(x) 4 1.0 0.5 2 12 Urban AREA Rura 8 6 4 NPOLICY 2 10 8 6 MVR PTS 4 2 0 0 100 200 INCOME 300 Figure 10: Four strong pairwise interactions. 34

that when MVR PTS vaue is ow, income vaues of the poicyhoders do not have a strong margina effect on the expected pure premium; but when MVR PTS vaue is high, the partia dependence of the pure premium on the income becomes stronger. 7 Concusions The need for noninear risk factors as we as risk factor interactions for modeing insurance caim sizes is werecognized by actuaria practitioners, but practica toos to study them are very imited. In this paper, reying on neither the inear assumption nor a prespecified interaction structure, a fexibe treebased gradient boosting method is designed for the Tweedie mode. We impement the proposed method in a userfriendy R package TDboost that can make accurate insurance premium predictions for compex data sets and serve as a convenient too for actuaria practitioners to investigate the noninear and interaction effects. In the context of persona auto insurance, we impicity use the poicy duration as a voume measure (or exposure), and demonstrate the favorabe prediction performance of TDboost for the pure premium. In cases that exposure measures other than duration are used, which is common in commercia insurance, we can extend the TDboost method to the corresponding caim size by simpy repacing the duration with any chosen exposure measure. We aso want to point out that TDboost can be an important compement to the traditiona GLM mode in insurance rating. Even under the strict circumstances that the reguators demand the fina mode to have a GLM structure, our approach can sti be quite hepfu due to its abiity to extract additiona information such as nonmonotonicity/noninearity and important interaction. In Appendix A, we provide an additiona rea data anaysis to demonstrate that our method can provide insights into the structure of interaction terms. After integrating the obtained information about the interaction terms into the origina GLM mode, we can much enhance the overa accuracy of the insurance premium prediction whie maintaining a GLM mode structure. 35

References Anstey, K. J., Wood, J., Lord, S., and Waker, J. G. (2005), Cognitive, sensory and physica factors enabing driving safety in oder aduts, Cinica psychoogy review, 25, 45 65. BarLev, S. K. and Stramer, O. (1987), Characterizations of natura exponentia famiies with power variance functions by zero regression properties, Probabiity theory and reated fieds, 76, 509 522. Breiman, L. (1998), Arcing cassifier (with discussion and a rejoinder by the author), The Annas of Statistics, 26, 801 849. (1999), Prediction games and arcing agorithms, Neura Computation, 11, 1493 1517. Breiman, L., Friedman, J., Oshen, R., Stone, C., Steinberg, D., and Coa, P. (1984), CART: Cassification and regression trees, Wadsworth. Brent, R. P. (2013), Agorithms for minimization without derivatives, Courier Dover Pubications. Bühmann, P. and Hothorn, T. (2007), Boosting agorithms: Reguarization, prediction and mode fitting, Statistica Science, 22, 477 505. Chiappori, P.A. and Saanie, B. (2000), Testing for Asymmetric Information in Insurance Markets, The Journa of Poitica Economy, 108, 56 78. Craven, P. and Wahba, G. (1978), Smoothing noisy data with spine functions, Numerische Mathematik, 31, 377 403. Dionne, G., Gouriéroux, C., and Vanasse, C. (2001), Testing for evidence of adverse seection in the automobie insurance market: A comment, Journa of Poitica Economy, 109, 444 453. Dunn, P. K. and Smyth, G. K. (2005), Series evauation of Tweedie exponentia dispersion mode densities, Statistics and Computing, 15, 267 280. (2008), Evauation of Tweedie exponentia dispersion mode densities by Fourier inversion, Statistics and Computing, 18, 73 86. 36

Eith, J., Leathwick, J. R., and Hastie, T. (2008), A working guide to boosted regression trees, Journa of Anima Ecoogy, 77, 802 813. Feer, W. (1968), An Introduction to Probabiity Theory and its Appications, Wiey\ & Sons, New York. Frees, E. W., Meyers, G., and Cummings, A. D. (2011), Summarizing insurance scores using a Gini index, Journa of the American Statistica Association, 106. Frees, E. W. J., Meyers, G., and Cummings, A. D. (2013), Insurance ratemaking and a Gini index, Journa of Risk and Insurance. Freund, Y. and Schapire, R. (1996), Experiments with a new boosting agorithm, in Machine earning: Proceedings of the Thirteenth Internationa Conference, Morgan Kaufmann Pubishers, Inc., pp. 148 156. (1997), A decisiontheoretic generaization of onine earning and an appication to boosting, Journa of Computer and System Sciences, 55, 119 139. Friedman, J. (2001), Greedy function approximation: A gradient boosting machine, The Annas of Statistics, 29, 1189 1232. Friedman, J., Hastie, T., and Tibshirani, R. (2000), Additive ogistic regression: A statistica view of boosting (With discussion and a rejoinder by the authors), The Annas of Statistics, 28, 337 407. Friedman, J. H. (2002), Stochastic gradient boosting, Computationa Statistics & Data Anaysis, 38, 367 378. Haberman, S. and Renshaw, A. E. (1996), Generaized inear modes and actuaria science, Statistician, 45, 407 436. Hastie, T., Tibshirani, R., and Friedman, J. (2009), The eements of statistica earning: Data mining, inference, and prediction. Second Edition., Springer Series in Statistics, Springer. Hastie, T. J. and Tibshirani, R. J. (1990), Generaized additive modes, vo. 43, CRC Press. 37

Jørgensen, B. (1987), Exponentia dispersion modes, Journa of the Roya Statistica Society. Series B (Methodoogica), 127 162. (1997), The theory of dispersion modes, vo. 76, CRC Press. Jørgensen, B. and de Souza, M. C. (1994), Fitting Tweedie s compound Poisson mode to insurance caims data, Scandinavian Actuaria Journa, 1994, 69 93. Kononenko, I. (1995), On biases in estimating mutivaued attributes, in Proceedings of the 14th internationa joint conference on Artificia inteigencevoume 2, Morgan Kaufmann Pubishers Inc., pp. 1034 1040. McCartt, A. T., Shabanova, V. I., and Leaf, W. A. (2003), Driving experience, crashes and traffic citations of teenage beginning drivers, Accident Anaysis & Prevention, 35, 311 320. Midenha, S. J. (1999), A systematic reationship between minimum bias and generaized inear modes, in Proceedings of the Casuaty Actuaria Society, vo. 86, pp. 393 487. Murphy, K. P., Brockman, M. J., and Lee, P. K. (2000), Using generaized inear modes to buid dynamic pricing systems, in Casuaty Actuaria Society Forum, Winter, pp. 107 139. Neder, J. and Wedderburn, R. (1972), Generaized Linear Modes, Journa of the Roya Statistica Society. Series A (Genera), 135, 370 384. Ohsson, E. and Johansson, B. (2010), Nonife insurance pricing with generaized inear modes, Springer. Owsey, C., Ba, K., Soane, M. E., Roenker, D. L., and Bruni, J. R. (1991), Visua/cognitive correates of vehice accidents in oder drivers. Psychoogy and aging, 6, 403. Peters, G. W., Shevchenko, P. V., and Wüthrich, M. V. (2008), Mode risk in caims reserving within Tweedie s compound Poisson modes, ASTIN Buetin, to appear. 38

Quijano Xacur, O. A. et a. (2011), Property and Casuaty Premiums based on Tweedie Famiies of Generaized Linear Modes, Ph.D. thesis, Concordia University. Renshaw, A. E. (1994), Modeing the caims process in the presence of covariates, ASTIN Buetin, 24, 265 285. Ridgeway, G. (2007), Generaized Boosted Regression Modes, R package manua. Sandri, M. and Zuccootto, P. (2008), A bias correction agorithm for the Gini variabe importance measure in cassification trees, Journa of Computationa and Graphica Statistics, 17. (2010), Anaysis and correction of bias in Tota Decrease in Node Impurity measures for treebased agorithms, Statistics and Computing, 20, 393 407. Showers, V. E. and Shotick, J. A. (1994), The effects of househod characteristics on demand for insurance: A tobit anaysis, Journa of Risk and Insurance, 492 502. Smyth, G. and Jørgensen, B. (2002), Fitting Tweedie s Compound Poisson Mode to Insurance Caims Data: Dispersion Modeing, ASTIN Buetin, 32, 143 157. Smyth, G. K. (1996), Regression anaysis of quantity data with exact zeros, in Proceedings of the second Austraia Japan workshop on stochastic modes in engineering, technoogy and management, Citeseer, pp. 572 580. Tweedie, M. (1984), An index which distinguishes between some important exponentia famiies, in Statistics: Appications and New Directions: Proc. Indian Statistica Institute Goden Jubiee Internationa Conference, pp. 579 604. Van de Ven, W. and van Praag, B. M. (1981), Risk aversion and deductibes in private heath insurance: appication of an adjusted tobit mode to famiy heath care expenditures, Heath, economics, and heath economics, 125 48. Wahba, G. (1990), Spine modes for observationa data, vo. 59, SIAM. White, A. P. and Liu, W. Z. (1994), Technica note: Bias in informationbased measures in decision tree induction, Machine Learning, 15, 321 329. 39

Wood, S. (2001), mgcv: GAMs and generaized ridge regression for R, R News, 1, 20 25. (2006), Generaized additive modes: an introduction with R, CRC press. Yip, K. C. and Yau, K. K. (2005), On modeing caim frequency data in genera insurance with extra zeros, Insurance: Mathematics and Economics, 36, 153 163. Zhang, T. and Yu, B. (2005), Boosting with eary stopping: Convergence and consistency, The Annas of Statistics, 1538 1579. Zhang, W. (2011), cpm: Monte Caro EM agorithms and Bayesian methods for fitting Tweedie compound Poisson inear modes, R package, http://cran.rproject.org/ web/packages/cpm/index.htm. 40