A Practitioner's Guide to Generalized Linear Models

Transcription

1 A Practtoner's Gude to Generalzed Lnear Models A CAS Study Note Duncan Anderson, FIA Sholom Feldblum, FCAS Claudne Modln, FCAS Dors Schrmacher, FCAS Ernesto Schrmacher, ASA Neeza Thand, FCAS Thrd Edton February 7

2 The Practtoner's Gude to Generalzed Lnear Models s wrtten for the practcng actuary who would lke to understand generalzed lnear models GLMs and use them to analyze nsurance data. The gude s dvded nto three sectons. Secton provdes a foundaton for the statstcal theory and gves llustratve examples and ntutve explanatons whch clarfy the theory. The ntutve explanatons buld upon more commonly understood actuaral methods such as lnear models and the mnmum bas procedures. Secton provdes practcal nsghts and realstc model output for each stage of a GLM analyss - ncludng data preparaton and prelmnary analyses, model selecton and teraton, model refnement and model nterpretaton. Ths secton s desgned to boost the actuary's confdence n nterpretng GLMs and applyng them to solve busness problems. Secton dscusses other topcs of nterest relatng to GLMs such as retenton modelng and scorng algorthms. More techncal materal n the paper s set out n appendces. Acknowledgements The authors would lke to thank James Tanser, FIA, for some helpful comments and contrbutons to some elements of ths paper, Shaun Wang, FCAS, for revewng the paper pror to ncluson on the CAS exam syllabus, and Volker Wlmsen for some helpful comments on the Second Edton of ths paper.

3 Contents Secton GLMs - theory and ntuton 4 GLMs n practce 4 Other applcatons of GLMs 8 Bblography 9 Appendx A The desgn matrx when varates are used 94 B The exponental famly of dstrbutons 96 C The Tweede dstrbuton 99 D Canoncal lnk functons E F Solvng for maxmum lkelhood n the general case of an exponental dstrbuton Example of solvng for maxmum lkelhood wth a gamma error and nverse lnk functon 4 G Data requred for a GLM clams analyss 6 H Automated approach for factor categorzaton I Cramer's V J Benefts of modelng frequency and severty separately rather than usng Tweede GLMs 4

4 GLMs - theory and ntuton. Secton dscusses how GLMs are formularzed and solved. The followng topcs are covered n detal: background of GLMs - buldng upon tradtonal actuaral methods such as mnmum bas procedures and lnear models ntroducton to the statstcal framework of GLMs formularzaton of GLMs - ncludng the lnear predctor, the lnk functon, the offset term, the error term, the scale parameter and the pror weghts typcal model forms solvng GLMs - maxmum lkelhood estmaton and numercal technques alasng model dagnostcs - standard errors and devance tests. Background. Tradtonal ratemakng methods n the Unted States are not statstcally sophstcated. Clams experence for many lnes of busness s often analyzed usng smple one-way and two-way analyses. Iteratve methods known as mnmum bas procedures, developed by actuares n the 96s, provde a sgnfcant mprovement, but are stll only part way toward a full statstcal framework.. The classcal lnear model and many of the most common mnmum bas procedures are, n fact, specal cases of generalzed lnear models GLMs. The statstcal framework of GLMs allows explct assumptons to be made about the nature of the nsurance data and ts relatonshp wth predctve varables. The method of solvng GLMs s more techncally effcent than teratvely standardzed methods, whch s not only elegant n theory but valuable n practce. In addton, GLMs provde statstcal dagnostcs whch ad n selectng only sgnfcant varables and n valdatng model assumptons..4 Today GLMs are wdely recognzed as the ndustry standard method for prcng prvate passenger auto and other personal lnes and small commercal lnes nsurance n the European Unon and many other markets. Most Brtsh, Irsh and French auto nsurers use GLMs to analyze ther portfolos and to the authors' knowledge GLMs are commonly used n Italy, the Netherlands, Scandnava, Span, Portugal, Belgum, Swtzerland, South Afrca, Israel and Australa. The method s ganng popularty n Canada, Japan, Korea, Brazl, Sngapore, Malaysa and eastern European countres..5 The prmary applcatons of GLMs n nsurance analyss are ratemakng and underwrtng. Crcumstances that lmt the ablty to change rates at wll eg regulaton have ncreased the use of GLMs for target marketng analyss. 4

5 The falngs of one-way analyss.6 In the past, actuares have reled heavly on one-way analyses for prcng and montorng performance..7 A one-way analyss summarzes nsurance statstcs, such as frequency or loss rato, for each value of each explanatory varable, but wthout takng account of the effect of other varables. Explanatory varables can be dscrete or contnuous. Dscrete varables are generally referred to as "factors", wth values that each factor can take beng referred to as "levels", and contnuous varables are generally referred to as "varates". The use of varates s generally less common n nsurance modelng..8 One-way analyses can be dstorted by correlatons between ratng factors. For example, young drvers may n general drve older cars. A one-way analyss of age of car may show hgh clams experence for older cars, however ths may result manly from the fact that such older cars are n general drven more by hgh rsk younger drvers. Relatvtes based on one-way analyses of age of vehcle and age of drver would double-count the effect of age of drver. Tradtonal actuaral technques for addressng ths problem usually attempt to standardze the data n such a way as to remove the dstortng effect of uneven busness mx, for example by focusng on loss ratos on a one-way bass, or by standardzng for the effect of one or more factors. These methods are, however, only approxmatons..9 One-way analyses also do not consder nterdependences between factors n the way they affect clams experence. These nterdependences, or nteractons, exst when the effect of one factor vares dependng on the levels of another factor. For example, the pure premum dfferental between men and women may dffer by levels of age.. Multvarate methods, such as generalzed lnear models, adjust for correlatons and allow nvestgaton nto nteracton effects. The falngs of mnmum bas procedures. In the 96s, actuares developed a ratemakng technque known as mnmum bas procedures. These procedures mpose a set of equatons relatng the observed data, the ratng varables, and a set of parameters to be determned. An teratve procedure solves the system of equatons by attemptng to converge to the optmal soluton. The reader seekng more nformaton may reference "The Mnmum Bas Procedure: A Practtoner's Gude" by Sholom Feldblum and Dr J. Erc Brosus. Baley, Robert A. and LeRoy J. Smon, "Two Studes n Automoble Insurance Ratemakng," Proceedngs of the Casualty Actuaral Socety, XLVII, 96. Feldblum, Sholom and Brosus, J Erc, "The Mnmum Bas Procedures: A Practtoner's Gude", Casualty Actuaral Socety Forum, Vol: Fall Pages:

6 . Once an optmal soluton s calculated, however, the mnmum bas procedures gve no systematc way of testng whether a partcular varable nfluences the result wth statstcal sgnfcance. There s also no credble range provded for the parameter estmates. The mnmum bas procedures lack a statstcal framework whch would allow actuares to assess better the qualty of ther modelng work. The connecton of mnmum bas to GLM. Stephen Mldenhall has wrtten a comprehensve paper showng that many mnmum bas procedures do correspond to generalzed lnear models. The followng table summarzes the correspondence for many of the more common mnmum bas procedures. The GLM termnology lnk functon and error functon s explaned n depth later n ths secton. In bref, these functons are key components for specfyng a generalzed lnear model. Mnmum Bas Procedures Generalzed Lnear Models Lnk functon Error functon Multplcatve balance prncple Logarthmc Posson Addtve balance prncple Identty Normal Multplcatve least squares Logarthmc Normal Multplcatve maxmum lkelhood Logarthmc Gamma wth exponental densty functon Multplcatve maxmum lkelhood Logarthmc Normal wth Normal densty functon Addtve maxmum lkelhood wth Normal densty functon Identty Normal.4 Not all mnmum bas procedures have a generalzed lnear model analog and vce versa. For example, the χ addtve and multplcatve mnmum bas models have no correspondng generalzed lnear model analog. Lnear models.5 A GLM s a generalzed form of a lnear model. To understand the structure of generalzed lnear models t s helpful, therefore, to revew classc lnear models..6 The purpose of both lnear models LMs and generalzed lnear models s to express the relatonshp between an observed response varable, Y, and a number of covarates also called predctor varables, X. Both models vew the observatons, Y, as beng realzatons of the random varable Y. Mldenhall, Stephen, "A Systematc Relatonshp between Mnmum Bas and Generalzed Lnear Models", Proceedngs of the Casualty Actuaral Socety, LXXXVI,

7 .7 Lnear models conceptualze Y as the sum of ts mean, μ, and a random varable, ε :.8 They assume that Y μ ε a. the expected value of Y, μ, can be wrtten as a lnear combnaton of the covarates, X, and b. the error term, ε, s Normally dstrbuted wth mean zero and varance σ..9 For example, suppose a smple prvate passenger auto classfcaton system has two categorcal ratng varables: terrtory urban or rural and gender male or female. Suppose the observed average clam severtes are: Urban Rural Male 8 5 Female 4. The response varable, Y, s the average clam severty. The two factors, terrtory and gender, each have two levels resultng n the four covarates: male X, female X, urban X, and rural X 4. These ndcator varables take the value or. For example, the urban covarate, X, s equal to f the terrtory s urban, and otherwse.. The lnear model seeks to express the observed tem Y n ths case average clam severty as a lnear combnaton of a specfed selecton of the four varables, plus a Normal random varable ε wth mean zero and varance σ, often wrtten ε ~ N,σ. One such model mght be Y 4 X 4 X X X ε. However ths model has as many parameters as t does combnatons of ratng factor levels beng consdered, and there s a lnear dependency between the four covarates X, X, X, X 4. Ths means that the model n the above form s not unquely defned - f any arbtrary value k s added to both and, and the same value k s subtracted from and 4, the resultng model s equvalent. 7

8 . To make the model unquely defned n the parameters consder nstead the model ε X X Y X.4 Ths model s equvalent to assumng that there s an average response for men and an average response for women, wth the effect of beng an urban polcyholder as opposed to beng a rural one havng an addtonal addtve effect whch s the same regardless of gender..5 Alternatvely ths could be thought of as a model whch assumes an average response for the "base case" of women n rural areas wth addtonal addtve effects for beng male - and for beng n an urban area..6 Thus the four observatons can be expressed as the system of equatons: ε ε ε ε Y Y Y Y.7 The parameters,, whch best explan the observed data are then selected. For the classcal lnear model ths s done by mnmzng the sum of squared errors SSE: ε ε ε ε SSE.8 Ths expresson can be mnmzed by takng dervatves wth respect to, and and settng each of them to zero. The resultng system of three equatons n three unknowns s: SSE SSE SSE 8

9 whch can be solved to derve: Vector and Matrx Notaton.9 Formulatng the system of equatons above quckly becomes complex as both the number of observatons and the number of covarates ncreases; consequently, vector notaton s used to express these equatons n compact form.. Let Y be a column vector wth components correspondng to the observed values for the response varable: Y Y Y Y Y. Let X, X, and X denote the column vectors wth components equal to the observed values for the respectve ndcator varables eg the th element of X s when the th observaton s male, and f female: X X X. Let denote a column vector of parameters, and for a gven set of parameters let ε be the vector of resduals: 4 ε ε ε ε ε 9

10 . Then the system of equatons takes the form: Y X X X ε.4 To smplfy ths further the vectors X, X, and X can be aggregated nto a sngle matrx X. Ths matrx s called the desgn matrx and n the example above would be defned as: X.5 Appendx A shows an example of the form of the desgn matrx X when explanatory varables nclude contnuous varables, or "varates"..6 The system of equatons takes the form YX. ε.7 In the case of the lnear model, the goal s to fnd values of the components of whch mnmze the sum of squares of the components of ε. If there are n observatons and p parameters n the model, ε wll have n components and wll have p components p<n..8 The basc ngredents for a lnear model thus consst of two elements: a. a set of assumptons about the relatonshp between Y and the predctor varables, and b. an objectve functon whch s to be optmzed n order to solve the problem. Standard statstcal theory defnes the objectve functon to be the lkelhood functon. In the case of the classcal lnear model wth an assumed Normal error t can be shown that the parameters whch mnmze sum of squared error also maxmze lkelhood.

11 Classcal lnear model assumptons.9 Lnear models assume all observatons are ndependent and each comes from a Normal dstrbuton..4 Ths assumpton does not relate to the aggregate of the observed tem, but to each observaton ndvdually. An example may help llustrate ths dstncton. Dstrbuton of ndvdual observatons Women Men.4 An examnaton of average clam amounts by gender may dentfy that average clam amounts for men are Normally dstrbuted, as are average clam amounts for women, and that the mean of the dstrbuton for men s twce the mean of the dstrbuton for women. The total dstrbuton of average clam amounts across all men and women s not Normally dstrbuted. The only dstrbuton of nterest s the dstrbuton of the two separate classes. In ths case there are only two classes beng consdered, but n a more complcated model there would be one such class for each combnaton of the ratng factors beng consdered..4 Lnear models assume that the mean s a lnear combnaton of the covarates, and that each component of the random varable s assumed to have a common varance.

12 .4 The lnear model can be wrtten as follows: Y E[ Y ] ε, E[ Y ] X..44 McCullagh and Nelder outlne the explct assumptons as follows: 4 LM Random component: Each component of Y s ndependent and s Normally dstrbuted. The mean, μ, of each component s allowed to dffer, but they all have common varance σ LM Systematc component: The p covarates are combned to gve the "lnear predctor" η: η X. LM Lnk functon: The relatonshp between the random and systematc components s specfed va a lnk functon. In the lnear model the lnk functon s equal to the dentty functon so that: E[Y ] μ η.45 The dentty lnk functon assumpton n LM may appear to be superfluous at ths pont, but t wll become more meanngful when dscussng the generalzaton to GLMs. Lmtatons of Lnear Models.46 Lnear models pose qute tractable problems that can be easly solved wth well-known lnear algebra approaches. However t s easy to see that the requred assumptons are not easy to guarantee n applcatons: It s dffcult to assert Normalty and constant varance for response varables. Classcal lnear regresson attempts to transform data so that these condtons hold. For example, Y may not satsfy the hypotheses but lny may. However there s no reason why such a transformaton should exst. The values for the response varable may be restrcted to be postve. The assumpton of Normalty volates ths restrcton. If the response varable s strctly non-negatve then ntutvely the varance of Y tends to zero as the mean of Y tends to zero. That s, the varance s a functon of the mean. 4 McCullagh, P. and J. A. Nelder, Generalzed Lnear Models, nd Ed., Chapman & Hall/CRC, 989.

13 The addtvty of effects encapsulated n the second LM and thrd LM assumptons s not realstc for a varety of applcatons. For example, suppose the response varable s equal to the area of the wngs of a butterfly and the predctor varables are the wdth and length of the wngs. Clearly, these two predctor varables do not enter addtvely; rather, they enter multplcatvely. More relevantly, many nsurance rsks tend to vary multplcatvely wth ratng factors ths s dscussed n more detal n Secton. Generalzed lnear model assumptons.47 GLMs consst of a wde range of models that nclude lnear models as a specal case. The LM restrcton assumptons of Normalty, constant varance and addtvty of effects are removed. Instead, the response varable s assumed to be a member of the exponental famly of dstrbutons 5. Also, the varance s permtted to vary wth the mean of the dstrbuton. Fnally, the effect of the covarates on the response varable s assumed to be addtve on a transformed scale. Thus the analog to the lnear model assumptons LM, LM, and LM are as follows. GLM Random component: Each component of Y s ndependent and s from one of the exponental famly of dstrbutons. GLM Systematc component: The p covarates are combned to gve the lnear predctor η: η X. GLM Lnk functon: The relatonshp between the random and systematc components s specfed va a lnk functon, g, that s dfferentable and monotonc such that: E[ Y ] μ g η.48 Most statstcal texts denote the frst expresson n GLM wth gx wrtten on the left sde of the equaton; therefore, the systematc element s generally expressed on the rght sde as the nverse functon, g -. 5 The exponental famly s a broader class of dstrbutons sharng the same densty form and ncludng Normal, Posson, gamma, nverse Gaussan, bnomal, exponental and other dstrbutons.

14 Exponental Famly of Dstrbutons.49 Formally, the exponental famly of dstrbutons s a -parameter famly defned as: f yθ b θ y ; θ, φ exp c y, φ a φ where a φ, bθ, and cy,φ are functons specfed n advance; θ s a parameter related to the mean; and φ s a scale parameter related to the varance. Ths formal defnton s further explored n Appendx B. For practcal purposes t s useful to know that a member of the exponental famly has the followng two propertes: a. the dstrbuton s completely specfed n terms of ts mean and varance, b. the varance of Y s a functon of ts mean..5 Ths second property s emphaszed by expressng the varance as: φ V μ Var Y ω where Vx, called the varance functon, s a specfed functon; the parameter φ scales the varance; and ω s a constant that assgns a weght, or credblty, to observaton..5 A number of famlar dstrbutons belong to the exponental famly: the Normal, Posson, bnomal, gamma, and nverse Gaussan. 6 The correspondng value of the varance functon s summarzed n the table below: Normal Posson Gamma Bnomal Inverse Gaussan V x x x x x x where the number of trals.5 A specal member of the exponental famly s the Tweede dstrbuton. The Tweede dstrbuton has a pont mass at zero and a varance functon proportonal to μ p where p< or <p< or p>. Ths dstrbuton s typcally used to model pure premum data drectly and s dscussed further n Appendx C. 6 A notable excepton to ths lst s the lognormal dstrbuton, whch does not belong to the exponental famly. 4

15 .5 The choce of the varance functon affects the results of the GLM. For example, the graph below consders the result of fttng three dfferent and very smple GLMs to three data ponts. In each case the model form selected s a two-parameter model the ntercept and slope of a lne, and the three ponts represent the ndvdual observatons wth the observed value Y shown on the y-axs for dfferent values of a sngle contnuous explanatory varable shown on the x-axs. Effect of varyng the error term smple example Data Normal Posson Gamma.54 The three GLMs consdered have a Normal, Posson and gamma varance functon respectvely. It can be seen that the GLM wth a Normal varance functon whch assumes that each observaton has the same fxed varance has produced ftted values whch are attracted to the orgnal data ponts wth equal weght. By contrast the GLM wth a Posson error assumes that the varance ncreases wth the expected value of each observaton. Observatons wth smaller expected values have a smaller assumed varance, whch results n greater credblty when estmatng the parameters. The model thus has produced ftted values whch are more nfluenced by the observaton on the left wth smaller expected value than the observaton on the rght whch has a hgher expected value and hence a hgher assumed varance..55 It can be seen that the GLM wth assumed gamma varance functon s even more strongly nfluenced by the pont on the left than the pont on the rght snce that model assumes the varance ncreases wth the square of the expected value. 5

16 .56 A further, rather more realstc, example llustrates how selectng an approprate varance functon can mprove the accuracy of a model. Ths example consders an artfcally generated dataset whch represents an nsurance portfolo. Ths dataset contans several ratng factors some of whch are correlated, and n each case the true effect of the ratng factor s assumed to be known. Clams experence n ths case average clam sze experence s then randomly generated for each polcy usng a gamma dstrbuton, wth the mean n each case beng that mpled by the assumed effect of the ratng factors. The clams experence s then analyzed usng three models to see how closely the results of each model relate to the n ths case known true factor effect..57 The three methods consdered are a one-way analyss a GLM wth assumed Normal varance functon a GLM wth assumed gamma varance functon..58 The results for one of the several ratng factors consdered are shown on the graph below. It can be seen that owng to the correlatons between the ratng factors n the data, the one-way analyss s badly dstorted. The GLM wth an assumed Normal dstrbuton s closer to the correct relatvtes, but t can be seen that t s the GLM wth an assumed gamma varance functon whch yelds results that are the closest to the true effect Effect of varyng the error term nsurance ratng factor example r Log of multple... A B C D -. E -. True effect One way GLM / Normal GLM / Gamma 6

17 .59 In addton to the varance functon Vx, two other parameters defne the varance of each observaton, the scale parameter φ and the pror weghts ω φ V μ Var[ Y ] ω Pror weghts.6 The pror weghts allow nformaton about the known credblty of each observaton to be ncorporated n the model. For example, f modelng clams frequency, one observaton mght relate to one month's exposure, and another to one year's exposure. There s more nformaton and less varablty n the observaton relatng to the longer exposure perod, and ths can be ncorporated n the model by defnng ω to be the exposure of each observaton. In ths way observatons wth hgher exposure are deemed to have lower varance, and the model wll consequently be more nfluenced by these observatons..6 An example demonstrates the approprateness of ths more clearly. Consder a set of observatons for personal auto clams under some classfcaton system. Let cell denote some generc cell defned by ths classfcaton system. To analyze frequency let: m k be the number of clams arsng from the k th unt of exposure n cell ω be the number of exposures n cell Y be the observed clam frequency n cell : ω Y m ω k k.6 If the random process generatng m k s Posson wth frequency f for all exposures k then E[m k ] f Var[m k ] 7

18 .6 Assumng the exposures are ndependent then k k f f m E Y E ω ω ω μ ω ] [ ] [ k k f f m Var Y Var ω μ ω ω ω ω ω ] [ ] [.64 So n ths case Vμ μ, φ, and the pror weghts are the exposures n cell..65 An alternatve example would be to consder clams severty. Let z k be the clam sze of the k th clam n cell ω be the number of clams n cell Y be the observed mean clam sze n cell : k k Y z ω ω.66 Ths tme assume that the random process generatng each ndvdual clam s gamma dstrbuted. Denotng E[z k ] m and Var[z k ] σ m and assumng each clam s ndependent then k k m m z E Y E ω ω ω μ ω ] [ ] [ k k m m z Var Y Var ω σ μ σ ω ω σ ω ω ω ] [ ] [.67 So for severty wth a gamma dstrbuton the varance of Y follows the general form for all exponental dstrbutons wth Vμ μ, φ σ, and pror weght equal to the number of clams n cell. 8

19 .68 Pror weghts can also be used to attach a lower credblty to a part of the data whch s known to be less relable. The scale parameter.69 In some cases eg the Posson dstrbuton the scale parameter φ s dentcally equal to and falls out of the GLM analyss entrely. However n general and for the other famlar exponental dstrbutons φ s not known n advance, and n these cases t must be estmated from the data..7 Estmaton of the scale parameter s not actually necessary n order to solve for the GLM parameters, however n order to determne certan statstcs such as standard errors, dscussed below t s necessary to estmate φ..7 φ can be treated as another parameter and estmated by maxmum lkelhood. The drawback of ths approach s that t s not possble to derve an explct formula for φ, and the maxmum lkelhood estmaton process can take consderably longer..7 An alternatve s to use an estmate of φ, such as a. the moment estmator Pearson χ statstc defned as ˆ φ ω Y μ n p V μ b. the total devance estmator φˆ n D p where D, the total devance, s defned later n ths paper. Lnk Functons.7 In practce when usng classcal lnear regresson practtoners sometmes attempt to transform data to satsfy the requrements of Normalty and constant varance of the response varable and addtvty of effects. Generalzed lnear models, on the other hand, merely requre that there be a lnk functon that guarantees the last condton of addtvty. Whereas LM requres that Y be addtve n the covarates, the generalzaton GLM nstead requres that some transformaton of Y, wrtten as gy, be addtve n the covarates. 9

20 .74 It s more helpful to consder μ as a functon of the lnear predctor, so typcally t s the nverse of gx whch s consdered: μ g η.75 In theory a dfferent lnk functon could be used for each observaton, but n practce ths s rarely done..76 The lnk functon must satsfy the condton that t be dfferentable and monotonc ether strctly ncreasng or strctly decreasng. Some typcal choces for a lnk functon nclude Identty Log Logt Recprocal g x x ln x ln x / x / x e x g e x x x x / e / x.77 Each error structure has assocated wth t a "canoncal" lnk functon whch smplfes the mathematcs of solvng GLMs analytcally. These are dscussed n Appendx D. When solvng GLMs usng modern computer software, however, the use of canoncal lnk functons s not mportant and any parng of lnk functon and varance functon whch s deemed approprate may be selected..78 The log-lnk functon has the appealng property that the effect of the covarates are multplcatve. Indeed, wrtng gx lnx so that g - x e x results n μ g x... x exp x exp x...exp x p p.79 In other words, when a log lnk functon s used, rather than estmatng addtve effects, the GLM estmates logs of multplcatve effects..8 As mentoned prevously, alternatve choces of lnk functons and error structures can yeld GLMs whch are equvalent to a number of the mnmum bas models as well as a smple lnear model see secton "The Connecton of Mnmum Bas to GLM". p p

21 The offset term.8 There are occasons when the effect of an explanatory varable s known, and rather than estmatng parameters n respect of ths varable t s approprate to nclude nformaton about ths varable n the model as a known effect. Ths can be acheved by ntroducng an "offset term" ξ nto the defnton of the lnear predctor η: whch gves η X. ξ E[Y] μ g - η g - X. ξ.8 A common example of the use of an offset term s when fttng a multplcatve GLM to the observed number, or count, of clams as opposed to clam frequency. Each observaton may relate to a dfferent perod of polcy exposure. An observaton relatng to one month's exposure wll obvously have a lower expected number of clams all other factors beng equal than an observaton relatng to a year's exposure. To make approprate allowance for ths, the assumpton that the expected count of clams ncreases n proporton to the exposure of an observaton all other factors beng equal can be ntroduced n a multplcatve GLM by settng the offset term ξ to be equal to the log of the exposure of each observaton, gvng: E[ Y ] g X j j j ξ exp j X j j log e exp X j j. e j where e the exposure for observaton..8 In the partcular case of a Posson multplcatve GLM t can be shown that modelng clam counts wth an offset term equal to the log of the exposure and pror weghts set to produces dentcal results to modelng clam frequences wth no offset term but wth pror weghts set to be equal to the exposure of each observaton.

22 Structure of a generalzed lnear model.84 In summary, the assumed structure of a GLM can be specfed as: where μ E[ Y ] g X j j ξ φ V μ Var[ Y ] ω j Y s the vector of responses gx s the lnk functon: a specfed nvertble functon whch relates the expected response to the lnear combnaton of observed factors X j s a matrx the "desgn matrx" produced from the factors j s a vector of model parameters, whch s to be estmated ξ s a vector of known effects or "offsets" φ s a parameter to scale the functon Vx Vx s the varance functon ω s the pror weght that assgns a credblty or weght to each observaton.85 The vector of responses Y, the desgn matrx X j, the pror weghts ω, and the offset term ξ are based on data n a manner determned by the practtoner. The assumptons whch then further defne the form of the model are the lnk functon gx, the varance functon Vx, and whether φ s known or to be estmated. Typcal GLM model forms.86 The typcal model form for modelng nsurance clam counts or frequences s a multplcatve Posson. As well as beng a commonly assumed dstrbuton for clam numbers, the Posson dstrbuton also has a partcular feature whch makes t ntutvely approprate n that t s nvarant to measures of tme. In other words, measurng frequences per month and measurng frequences per year wll yeld the same results usng a Posson multplcatve GLM. Ths s not true of some other dstrbutons such as gamma.

23 .87 In the case of clam frequences the pror weghts are typcally set to be the exposure of each record. In the case of clam counts the offset term s set to be the log of the exposure..88 A common model form for modelng nsurance severtes s a multplcatve gamma. As well as often beng approprate because of ts general form, the gamma dstrbuton also has an ntutvely attractve property for modelng clam amounts snce t s nvarant to measures of currency. In other words measurng severtes n dollars and measurng severtes n cents wll yeld the same results usng a gamma multplcatve GLM. Ths s not true of some other dstrbutons such as Posson..89 The typcal model form for modelng retenton and new busness converson s a logt lnk functon and bnomal error term together referred to as a logstc model. The logt lnk functon maps outcomes from the range of, to -, and s consequently nvarant to measurng successes or falures. If the y-varate beng modeled s generally close to zero, and f the results of a model are gong to be used qualtatvely rather than quanttatvely, t may also be possble to use a multplcatve Posson model form as an approxmaton gven that the model output from a multplcatve GLM can be rather easer to explan to a non-techncal audence..9 The below table summarzes some typcal model forms. Y Clam frequences Clam numbers or counts Average clam amounts Probablty eg of renewng Lnk functon gx lnx lnx lnx lnx/-x Error Posson Posson Gamma Bnomal Scale parameter φ Estmated Varance functon Vx x x x x-x* Pror weghts ω Exposure # of clams Offset ξ lnexposure * where the number of trals, or xt-x/t where the number of trals t

24 GLM maxmum lkelhood estmators.9 Havng defned a model form n terms of X, gx, ξ, Vx, φ, and ω, and gven a set of observatons Y, the components of are derved by maxmzng the lkelhood functon or equvalently, the logarthm of the lkelhood functon. In essence, ths method seeks to fnd the parameters whch, when appled to the assumed model form, produce the observed data wth the hghest probablty..9 The lkelhood s defned to be the product of probabltes of observng each value of the y-varate. For contnuous dstrbutons such as the Normal and gamma dstrbutons the probablty densty functon s used n place of the probablty. It s usual to consder the log of the lkelhood snce beng a summaton across observatons rather than a product, ths yelds more manageable calculatons and any maxmum of the lkelhood s also a maxmum of the log-lkelhood. Maxmum lkelhood estmaton n practce, therefore, seeks to fnd the values of the parameters that maxmze ths log-lkelhood..9 In smple examples the procedure for maxmzng lkelhood nvolves fndng the soluton to a system of equatons wth lnear algebra. In practce, the large number of observatons typcally beng consdered means that ths s rarely done. Instead numercal technques and n partcular mult-dmensonal Newton-Raphson algorthms are used. Appendx E shows the system of equatons for maxmzng the lkelhood functon n the general case of an exponental dstrbuton..94 An explctly solved llustratve example and a dscusson of numercal technques used wth large datasets are set out below. Solvng smple examples.95 To understand the mechancs nvolved n solvng a GLM, a concrete example s presented. Consder the same four observatons dscussed n a prevous secton for average clam severty: Urban Rural Male 8 5 Female 4.96 The general procedure for solvng a GLM nvolves the followng steps: a. Specfy the desgn matrx X and the vector of parameters b. Choose the error structure and lnk functon c. Identfy the log-lkelhood functon d. Take the logarthm to convert the product of many terms nto a sum 4

25 e. Maxmze the logarthm of the lkelhood functon by takng partal dervatves wth respect to each parameter, settng them to zero and solvng the resultng system of equatons f. Compute the predcted values..97 Recall that the vector of observatons, the desgn matrx, and the vector of parameters are as follows: and,, Rural Female Urban Female Rural Male Urban Male X Y where the frst column of X ndcates f an observaton s male or not, the second column ndcates whether the observaton s female, and the last column specfes f the observaton s n an urban terrtory or not..98 The followng three alternatve model structures are llustrated: Normal error structure wth an dentty lnk functon Posson error structure wth a log lnk functon Gamma error structure wth an nverse lnk functon..99 These three model forms may not necessarly be approprate models to use n practce - nstead they llustrate the theory nvolved.. In each case the elements of ω the pror weghts wll be assumed to be, and the offset term ξ assumed to be zero, and therefore these terms wll, n ths example, be gnored. Normal error structure wth an dentty lnk functon. The classcal lnear model case assumes a Normal error structure and an dentty lnk functon. The predcted values n the example take the form:. ] [ g g g g X g Y E 5

26 . The Normal dstrbuton wth mean μ and varance σ has the followng densty functon: ln exp, ; πσ σ μ σ μ y y f. Its lkelhood functon s: n y y L } ln exp{, ; πσ σ μ σ μ.4 Maxmzng the lkelhood functon s equvalent to maxmzng the log-lkelhood functon: n y y l ln, ; πσ σ μ σ μ.5 Wth the dentty lnk functon, μ Σ j X j j and the log-lkelhood functon becomes n p j j y X j y l ln., ; πσ σ σ μ.6 In ths example, up to a constant term of.lnπσ, the log-lkelhood s 4 5 σ σ σ * 8, ; σ σ μ y l.7 To maxmze l * take dervatves wth respect to, and and set each of them to zero. The resultng system of three equatons n three unknowns s: * * * l l l 6

27 .8 It can be seen that these equatons are dentcal to those derved when mnmzng the sum of squared error for a smple lnear model. Agan, these can be solved to derve: whch produces the followng predcted values: Urban Rural Male Female The Posson error structure wth a logarthm lnk functon.9 For the Posson model wth a logarthm lnk functon, the predcted values are gven by ] [ e e e e g g g g X g Y E. A Posson dstrbuton has the followng densty functon! / ; y y f y μ μ μ e. Its log-lkelhood functon s therefore. Wth the logarthm lnk functon, μ expσ j X j j, and the log-lkelhood functon reduces to. In ths example, the equaton s y y y f y l! ln ln ; ln ; μ μ μ μ ln!. ln 4! 4. ln 5! 5. 8! 8. ; μ e e e e y l n n! ln.. exp ; n p j j j p j j j X y X y X e y l ln 7

28 .4 Ignorng the constant of ln8! ln5! ln4! ln!, the followng functon s to be maxmzed: l * ; μ y e e e e..5 To maxmze l * the dervatves wth respect to, and are set to zero and the followng three equatons are derved: * l * l * l > > > e e e e e e 6 e.6 These can be solved to derve the followng parameter estmates: whch produces the followng predcted values: Urban Rural Male Female The gamma error structure wth an nverse lnk functon.7 Ths example s set out n Appendx F. 8

29 Solvng for large datasets usng numercal technques.8 The general case for solvng for maxmum lkelhood n the case of a GLM wth an assumed exponental dstrbuton s set out n Appendx E. In nsurance modelng there are typcally many thousands f not mllons of observatons beng modeled, and t s not practcal to fnd values of whch maxmze lkelhood usng the explct technques llustrated above and n Appendces E and F. Instead teratve numercal technques are used..9 As was the case n the smple examples above, the numercal technques seek to optmze lkelhood by seekng the values of whch set the frst dfferental of the log-lkelhood to zero, as there are a number of standard methods whch can be appled to ths problem. In practce, ths s done usng an teratve process, for example Newton-Raphson teraton whch uses the formula: n n H -.s where n s the n th teratve estmate of the vector of the parameter estmates wth p elements, s s the vector of the frst dervatves of the log-lkelhood and H s the p by p matrx contanng the second dervatves of the log-lkelhood. Ths s smply the generalzed form of the one-dmensonal Newton-Raphson equaton, x n x n - f'x n / f''x n whch seeks to fnd a soluton to f'x.. The teratve process can be started usng ether values of zero for elements of or alternatvely the estmates mpled by a one-way analyss of the data or of another prevously ftted GLM.. Several generc commercal packages are avalable to ft generalzed lnear models n ths way such as SAS, S, R, etc, and packages specfcally bult for the nsurance ndustry, whch ft models GLMs more quckly and wth helpful nterpretaton of output, are also avalable. 9

30 Base levels and the ntercept term. The smple examples dscussed above consdered a three parameter model, where corresponded to men, to women and to the effect of beng n an urban area. In the case of an addtve model wth dentty lnk functon ths could be thought of as ether assumng that there s an average response for men,, and an average response for women,, wth the effect of beng an urban polcyholder as opposed to beng a rural one havng an addtonal addtve effect whch s the same regardless of gender or assumng there s an average response for the "base case" of women n rural areas,, wth an addtonal addtve effects for beng male, -, and for beng n an urban area,.. In the case of a multplcatve model ths three parameter form could be thought of as assumng that there s an average response for men, exp, and an average response for women, exp, wth the effect of beng an urban polcyholder as opposed to beng a rural one havng a multplcatve effect exp, whch s the same regardless of gender or assumng there s an average response for the "base case" of women n rural areas exp wth an addtonal multplcatve effects for beng male, exp -, and for beng n an urban area exp..4 In the example consdered, some measure of the overall average response was ncorporated n both the values of and. The decson to ncorporate ths n the parameters relatng to gender rather than area was arbtrary..5 In practce when consderng many factors each wth many levels t s more helpful to parameterze the GLM by consderng, n addton to observed factors, an "ntercept term", whch s a parameter that apples to all observatons.

31 .6 In the above example, ths would have been acheved by defnng the desgn matrx X as X that s, by redefnng as the ntercept term, and only havng one parameter relatng to the gender of the polcyholder. It would not be approprate to have an ntercept term and a parameter for every value of gender snce then the GLM would not be unquely defned - any arbtrary constant k could be added to the ntercept term and subtracted from each of the parameters relatng to gender and the predcted values would reman the same..7 In practce when consderng categorcal factors and an ntercept term, one level of each factor should have no parameter assocated wth t, n order that the model remans unquely defned..8 For example consder a smple ratng structure wth three factors - age of drver a factor wth 9 levels, terrtory a factor wth 8 levels and vehcle class a factor wth 5 levels. An approprate parameterzaton mght be represented as follows: Age of drver Terrtory Vehcle class Factor level Parameter Factor level Parameter Factor level Parameter 7- A A -4 B B C C D D E E 4-49 F G H Intercept term that s, an ntercept term s defned for every polcy, and each factor has a parameter assocated wth each level except one. If a multplcatve GLM were ftted to clams frequency by selectng a log lnk functon the exponentals of the parameter estmates could be set out n tabular form also:

32 Age of drver Terrtory Vehcle class Factor level Multpler Factor level Multpler Factor level Multpler A.947 A B.9567 B C. C D.955 D E.975 E F G H Intercept term.4.9 In ths example the clams frequency predcted by the model can be calculated for a gven polcy by takng the ntercept term.4 and multplyng t by the relevant factor relatvtes. For the factor levels for whch no parameter was estmated the "base levels", no multpler s relevant, and ths s shown n the above table by dsplayng multplers of. The ntercept term relates to a polcy wth all factors at the base level e n ths example the model predcts a clam frequency of.4 for a 4-49 year old n terrtory C and a vehcle n class A. Ths ntercept term s not an average rate snce ts value s entrely dependent upon the arbtrary choce of whch level of each factor s selected to be the base level.. If a model were structured wth an ntercept term but wthout each factor havng a base level, then the GLM solvng routne would remove as many parameters as necessary to make the model unquely defned. Ths process s known as alasng. Alasng. Alasng occurs when there s a lnear dependency among the observed covarates X,,X p. That s, one covarate may be dentcal to some combnaton of other covarates. For example, t may be observed that X 4 X X 5. Equvalently, alasng can be defned as a lnear dependency among the columns of the desgn matrx X.

33 . There are two types of alasng: ntrnsc alasng and extrnsc alasng. Intrnsc alasng.4 Intrnsc alasng occurs because of dependences nherent n the defnton of the covarates. These ntrnsc dependences arse most commonly whenever categorcal factors are ncluded n the model..5 For example, suppose a prvate passenger automoble classfcaton system ncludes the factor vehcle age whch has the four levels: - years X, 4-7 years X, 8-9 years X, and years X 4. Clearly f any of X, X, X, s equal to then X 4 s equal to ; and f all of X, X, X, are equal to then X 4 must be equal to. Thus X 4 - X - X - X..6 The lnear predctor η X X X 4 X 4 gnorng any other factors can be unquely expressed n terms of the frst three levels: η X X 4 X X X 4 4 X.7 Upon renamng the coeffcents ths becomes: X X 4 X 4 η α X α X α X α.8 The result s a lnear predctor wth an ntercept term f one dd not already exst and three covarates..9 GLM software wll remove parameters whch are alased. Whch parameter s selected for excluson depends on the software. The choce of whch parameter to alas does not affect the ftted values. For example n some cases the last level declared e the last alphabetcally s alased. In other software the level wth the maxmum exposure s selected as the base level for each factor frst, and then other levels are alased dependent upon the order of declaraton. Ths latter approach s helpful snce t mnmzes the standard errors assocated wth other parameter estmates - ths subject s dscussed later n ths paper.

34 Extrnsc Alasng.4 Ths type of alasng agan arses from a dependency among the covarates, but when the dependency results from the nature of the data rather than nherent propertes of the covarates themselves. Ths data characterstc arses f one level of a partcular factor s perfectly correlated wth a level of another factor..4 For example, suppose a dataset s enrched wth external data and two new factors are added to the dataset: the factors number of doors and color of vehcle. Suppose further that n a small number of cases the external data could not be lnked wth the exstng data wth the result that some records have an unknown color and an unknown number of doors. Exposures # Doors 4 5 Unknown Red,4,4 5,4,4 Green 4,54 4,54,4,45 Blue 6,544 5,44 5,654 4,565 Black 4,64,5 4,565 4,545 Unknown,4 Color Selected Base: # Doors 4; Color Red Addtonal Alasng: Color Unknown.4 In ths case because of the way the new factors were derved, the level unknown for the factor color happens to be perfectly correlated wth the level unknown for the factor # doors. The covarate assocated wth unknown color s equal to n every case for whch the covarate for unknown # doors s equal to, and vce versa..4 Elmnaton of the base levels through ntrnsc alasng reduces the lnear predctor from covarates to 8, plus the ntroducton of an ntercept term. In addton, n ths example, one further covarate needs to be removed as a result of extrnsc alasng. Ths could ether be the unknown color covarate or the unknown # doors covarate. Assumng n ths case the GLM routne alases on the bass of order of declaraton, and assumng that the # doors factor s declared before color, the GLM routne would alas unknown color reducng the lnear predctor to just 7 covarates. 4

35 "Near Alasng".44 When modelng n practce a common problem occurs when two or more factors contan levels that are almost, but not qute, perfectly correlated. For example, f the color of vehcle was known for a small number of polces for whch the # doors was unknown, the two-way of exposure mght appear as follows: Exposures # Doors 4 5 Unknown Red,4,4 5,4,4 Green 4,54 4,54,4,45 Blue 6,544 5,44 5,654 4,565 Black 4,64,5 4,565 4,545 5 Unknown,4 Color Selected Base: # Doors 4; Color Red.45 In ths case the unknown level of color factor s not perfectly correlated to the unknown level of the # doors factor, and so extrnsc alasng wll not occur..46 When levels of two factors are "nearly alased" n ths way, convergence problems can occur. For example, f there were no clams for the 5 exposures ndcated n black color level and unknown # doors level, and f a log lnk model were ftted to clams frequency, the model would attempt to estmate a very large and negatve parameter for unknown # doors for example, - and a very large parameter for unknown color for example.. The sum. n ths example would be an approprate reflecton of the clams frequency for the,4 exposures havng unknown # doors and unknown color, whle the value of the unknown # doors parameter would be drven by the experence of the 5 rogue exposures havng color black wth unknown # doors. Ths can ether gve rse to convergence problems, or to results whch can appear very confusng..47 In order to understand the problem n such crcumstances t s helpful to examne two-way tables of exposure and clam counts for the factors whch contan very large parameter estmates. From these t should be possble to dentfy those factor combnatons whch cause the near-alasng. The ssue can then be resolved ether by deletng or excludng those rogue records, or by reclassfyng the rogue records nto another, more approprate, factor level. 5

36 Model dagnostcs.48 As well as dervng parameter estmates whch maxmze lkelhood, a GLM can produce mportant addtonal nformaton ndcatng the certanty of those parameter estmates whch themselves are estmates of some true underlyng value. Standard errors.49 Statstcal theory can be used to gve an estmate of the uncertanty. In partcular, the multvarate verson of the Cramer-Rao lower bound whch states that the varance of a parameter estmate s greater than or equal to mnus one over the second dervatve of the log lkelhood can defne "standard errors" for each parameter estmate. Such standard errors are defned as beng the dagonal element of the covarance matrx -H - where H the Hessan s the second dervatve matrx of the log lkelhood..5 Intutvely the standard errors can be thought of as beng ndcators of the speed wth whch log-lkelhood falls from the maxmum gven a change n a parameter. For example consder the below dagram. Intutve llustraton of standard errors Log Lkelhood Parameter Parameter 6

37 .5 Ths dagram llustrates a smple case wth two parameters and and shows how log lkelhood vares, for the dataset n queston, for dfferent values of the two parameters. It can be seen that movements n parameter from the optmal poston reduce log lkelhood more quckly than smlar movements n parameter, that s to say the log lkelhood curve becomes steeper n the parameter drecton than n the parameter drecton. Ths can be thought of as the second partal dfferental of log lkelhood wth respect to parameter beng large and negatve, wth the result that the standard error for parameter beng mnus one over the second partal dfferental s small. Conversely the second partal dfferental of log lkelhood wth respect to parameter s less large and negatve, wth the standard error for parameter beng larger ndcatng greater uncertanty..5 Generally t s assumed that the parameter estmates are asymptotcally Normally dstrbuted; consequently t s n theory possble to undertake a smple statstcal test on ndvdual parameter estmates, comparng each estmate wth zero e testng whether the effect of each level of the factor s sgnfcantly dfferent from the base level of that factor. Ths s usually performed usng a χ test, wth the square of the parameter estmate dvded by ts varance beng compared to a χ dstrbuton. Ths test n fact compares the parameter wth the base level of the factor. Ths s not necessarly a fully useful test n solaton as the choce of base level s arbtrary. It s theoretcally possble to change repeatedly the base level and so construct a trangle of χ tests comparng every par of parameter estmates. If none of these dfferences s sgnfcant then ths s good evdence that the factor s not sgnfcant..5 In practce graphcal nterpretaton of the parameter estmates and standard errors are often more helpful, and these are dscussed n Secton. Devance tests.54 In addton to the parameter estmate standard errors, measures of devance can be used to assess the theoretcal sgnfcance of a partcular factor. In broad terms, a devance s a measure of how much the ftted values dffer from the observatons..55 Consder a devance functon dy,μ defned by d Y ; μ ω Y μ Y ζ dζ V ζ Under the condton that Vx s strctly postve, dy,μ s also strctly postve and satsfes the condtons for beng a dstance functon. Indeed t should be nterpreted as such. 7