The zero-adjusted Inverse Gaussian distribution as a model for insurance claims

Size: px
Start display at page:

Download "The zero-adjusted Inverse Gaussian distribution as a model for insurance claims"

Transcription

1 The zero-adjusted Inverse Gaussian distribution as a model for insurance claims Gillian Heller 1, Mikis Stasinopoulos 2 and Bob Rigby 2 1 Dept of Statistics, Macquarie University, Sydney, Australia. gheller@efs.mq.edu.au 2 STORM, London Metropolitan University. s: d.stasinopoulos@londonmet.ac.uk and r.rigby@londonmet.ac.uk Abstract: We introduce a method for modelling insurance claim sizes, including zero claims. A mixed discrete-continuous model, with a probability mass at zero and an Inverse Gaussian continuous component, is used. The Inverse Gaussian distribution accommodates the extreme right skewness of the claim distribution. The model explicitly specifies a logit-linear model for the occurrence of a claim; and log-linear models for the mean claim size (given a claim has occurred); and the dispersion of claim sizes (given a claim has occurred). The method is illustrated on aa Australian motor vehicle insurance data set. Keywords: Inverse Gaussian model; zero-adjusted; insurance claims; gamlss. 1 Introduction The purpose of modelling claim sizes on insurance policies is to price premiums accurately, and to estimate the risk of extreme claim events. In a fixed period, a policy will either experience a claim, which is a nonnegative amount typically having an extremely right-skewed distribution, or no claim, in which the claim amount is identically zero. The distribution of the claim size is then mixed discrete-continuous: a continuous, rightskewed distribution mixed with a single probability mass at zero. In this respect the phenomenon is similar to rainfall, which is either identically zero on a dry day, or a continuous non-negative size on a wet day. 1.1 Models for insurance claims Much attention has been paid in the actuarial literature to alternative distributions for claim sizes (e.g. Hogg and Klugman (1984)) and some authors have developed regression models (usually generalized linear models) for explaining claim sizes as a function of risk factors (e.g. Haberman and Renshaw (1996)). All of these are models for claim sizes in the subclass of policies which had a claim in the period of observation. Jørgensen and de Souza (1994) and Smyth and Jørgensen (2002) considered models for claim sizes, including the zero claims. These are based on

2 2 Zero-adjusted Inverse Gaussian the Tweedie distribution, which may be characterised as a Poisson sum of Gamma random variates. A problem with the Tweedie distribution model is that the probabilities at zero can not modeled explicitly as a function of explanatory variables; and as we shall see in the example, the Gamma distribution is inadequate for modelling the extreme right-skewness which is present in our data. 2 The zero-adjusted Inverse Gaussian model Let y i = size of claim on i th policy, i = 1,..., n. We can write the distribution of y as a mixed discrete-continuous probability function: f(y) = 1 π y = 0 = π g(y) y > 0 (1) where g(y) is the density of a continuous, right-skewed distribution and π is the probability of a claim. 2.1 Continuous part of the model The extreme right skewness of claims distributions has been well documented. Candidate distributions within the exponential family are the Gamma and Inverse Gaussian distributions. Motor vehicle insurance example We illustrate the method on a class of motor vehicle insurance policies from an Australian insurance company in There were 67,856 policies, of which 4,624 (6.8%) had at least one claim in the period of observation. Of these, 4,333 policies (6.4%) had one claim, and the remaining 291 policies (0.4%) between 2 and 4 claims. The maximum claim size was $56,000. A histogram of the non-zero claims, and the pdfs of the fitted Gamma and inverse Gaussian distributions are shown in Figure 1. (For clarity of display the horizontal axis has been truncated, at $15,000. Sixty-five observations were omitted.) The Gamma clearly does not reproduce the shape of the observed claim size distribution; the Inverse Gaussian looks to be a far better fit, accommodating both the mode near zero and the extremely long tail of the distribution. The density of the inverse Gaussian is: [ 1 g(y) = 2πy3 σ exp 1 ( ) ] 2 y µ 2y µσ y > 0 which has E(y) = µ and V ar(y) = σ 2 µ 3. The use of the Inverse Gaussian distribution for modelling claim sizes has been recommended by, for example, Berg (1994).

3 Heller et al. 3 Inverse Gaussian Gamma f(y) 0 e+00 2 e 04 4 e 04 6 e 04 8 e 04 FIGURE 1. distribution: motor vehicle insurance 2.2 Discrete part of the model The obvious model for the probability of a claim is the Bernoulli. Let w i be a binary variable indicating the occurrence of at least one claim, and π i be the probability of at least one claim, on policy i. Note that the occurrence of more than one claim in the period of observation is rare. Then f(w i ) = π w i i (1 π i ) 1 w i w i = 0, 1 However, we have to correct for the typical feature of policy-level data, that not all policies have been in force for the entire period of observation. Let t i = exposure of policy i, 0 < t i 1. (Exposure is the proportion of the period of observation for which the policy has been in force.) We will be assuming that the t i are known. If c i is the number of claims in the period, and we assume a Poisson process with mean number of claims (per unit exposure time) π i then c i t i P o(t i π i ), P (c i = 0 t i = 1) = e πi 1 π i and P (c i = 0 t i ) = e t iπ i 1 t i π i, provided t i π i is small. This gives f(w i ) = (π i ) w i (1 π i ) 1 w i w i = 0, 1 i.e. Bernoulli with πi link function on π i : = t iπ i. We incorporate covariates through the logit log π i 1 π i = η i

4 4 Zero-adjusted Inverse Gaussian i.e. πi log /t i 1 πi /t = η i (2) i and the correction for differing periods of exposure enters the model through the modified link function (2). The predictor η i is defined in the next section. 2.3 The mixture model The zero-adjusted Inverse Gaussian (ZAIG) model is then f(y i ) = 1 π i y i = 0 = π i [ 1 exp 1 2πy 3 i σ i 2y i ( ) ] 2 yi µ i µ i σ i y i > 0 which has E(y i ) = π i µ i and V ar(y i ) = π i µ i 2 ( 1 π i + µ iσ 2 i ). Following Rigby and Stasinopoulos (2005), who specify generalized additive models for the location, scale and shape parameters of a variety of distributions, we specify the following models on the parameters µ i, σ i and π i : log(µ i ) = x 1µiβ µ + f µ (x 2µi ) log(σ i ) = x 1σiβ σ + f σ (x 2σi ) πi log /t i 1 πi /t i = x 1πiβ π + f π (x 2πi ) where x 1µi, x 2µi, x 1σi, x 2σi, x 1πi and x 2πi are covariate vectors for µ i, σ i and πi, which may be different, the same, or may have some but not all elements in common; β µ, β σ and β π are the corresponding parameter vectors; and f µ, f σ and f π are nonparametric functions, typically smoothing splines. In order to correct for multiple claims in the period, we use the fact that, if y j IG(µ, σ), j = 1,..., c independently, then the total t = j y j has the distribution t IG(µ, σ ) where µ = cµ and σ = σ/c. As log(µ ) = log(µ) + log(c) and log(σ ) = log(σ) log(c) we use log(c i ) and log(c i ) as offsets in the models for µ i and σ i respectively, where c i is the number of claims on policy i. (A doubtful assumption here is that multiple claim amounts on the same policy are independent.)

5 Heller et al. 5 3 Estimation The ZAIG has been incorporated into the gamlss package in R (Stasinopoulos et al. (2006)). Maximum (penalised) likelihood estimation is used. The penalized log likelihood function of the model is maximized iteratively using either the RS or CG algorithm of Rigby and Stasinopoulos (2005), which in turn uses a back-fitting algorithm to perform each step of the Fisher scoring procedure. Both RS and CG algorithms use the log likelihood of the data, and its first derivatives (and optionally expected second derivatives) with respect to distributional parameters, which in this case are µ, σ and ν = π. The CG algorithm, a generalization of the algorithm used by Cole and Green (1992), additionally uses the expected cross derivatives. 3.1 Motor vehicle insurance The following covariates were available: Variable Range Characteristics of policy holder: Age band 1,2,3,4,5,6 (1 is youngest) Gender male, female Area of residence A, B, C, D, E, F Characteristics of vehicle: Value $0-$350,000 Make A, B, C, D Age 1, 2, 3, 4 (1 is recent) Body type bus, convertible, coupe, hatchback, hardtop, motorised caravan/combi, minibus, panel van, roadster, sedan, station wagon, truck, utility Using the GAIC as model selection criterion, the following final model was selected: log(µ) = age band + gender + area + offset{log(claims)} log(σ) = area + offset{-log(claims)} ) = age band + area + vehicle body + spline(vehicle value) log( π 1 π Comments on the model Model for π: The model for the occurrence of a claim has terms for both policyholder and vehicle characteristics. Policyholder age, area and vehicle body are all categorical, so their form is not an issue; vehicle value is the only continuous covariate that we have, and it enters in the model in a smoothing spline form. This is understood when we examine the scatterplot of claim/no claim, with a smoothing spline, in Figure 2. The relationship is nonlinear; the probability of a claim is at a maximum for vehicle value around $40,000.

6 Zero-adjusted Inverse Gaussian Claim Smoothed data Vehicle value in $10,000 units FIGURE 2. Occurrence of a claim (0/1) plotted against vehicle value, with smoothing spline Model for µ: This contains only policyholder characteristics, which is surprising. A more complicated model involving vehicle value, make and some interaction terms, was a close second in the model selection. However, it was felt that this was too complex and difficult to interpret, so the simpler version was chosen. Model for σ: Area is the only covariate for σ. The variation of the claim size distribution with area is shown in Figure 3: it can be seen that areas D, E and F have shapes which are different from A, B and C, reflected in lower values for σ. In fact areas D, E and F are rural whereas A, B and C are urban. The explanatory variables age band and area appear in the model equations for both π and µ. It is of interest whether they affect the occurrence of a claim, and claim size, in the same way. Figure 4.a shows the effect of age band (eβ ), on both π/(1 π) and µ; figure 4.b shows the effect of area on both π/(1 π) and µ. Note that age band=3 and area=a are the reference categories. Age band 1 (the youngest drivers) increases both the odds of a claim and the mean claim size, to a similar extent; age bands 2 and 4 have a similar effect to age band 3; and age bands 5 and 6 (older drivers) decrease both the odds of a claim, and the mean claim size, their effect being greater on the odds of a claim. The effect of area on the odds of a claim, and mean claim size, is less clear: the only clear indication is that the mean claim size is increased in area F.

7 Heller et al. 7 A. µ^ = 1909, σ^ = B. µ^ = 1860, σ^ = C. µ^ = 2030, σ^ = D. µ^ = 1837, σ^ = E. µ^ = 2251, σ^ = F. µ^ = 2864, σ^ = FIGURE 3. distribution by area 4 Conclusion We introduce a method for modelling insurance claim sizes using a zero adjusted Inverse Gaussian (ZAIG) model, which explicitly specifies a logitlinear model for the occurrence of a claim; and log-linear models for the mean claim size (given a claim has occurred); and the dispersion of claim sizes (given a claim has occurred). These three models may incorporate different covariates, or some of the same covariates, and may depend on common covariates in different ways. The Inverse Gaussian distribution accommodates the extreme right skewness of the claim distributions. Given the risk factors for a potential new policyholder, the expected claim size may easily be computed as the expected value of the ZAIG distribution, conditional on the covariate values; and quartiles of the claim size distribution may be calculated for each combination of covariate values. The ZAIG distribution introduced here is a useful distribution for modelling data where the total amount per unit of time is observed but where zero amounts are possible. Rainfall data and smoking/drinking habits data are possible candidates for modelling using the ZAIG distribution. References Berg, P.T. (1994). Deductibles and the inverse Gaussian distribution. ASTIN Bulletin, 24,

8 8 Zero-adjusted Inverse Gaussian a. Age band b. Area exp(β^) Occurrence of claim exp(β^) Age band A B C D E F Area FIGURE 4. Effect of age category and area (exp( ˆβ)) on occurrence of claim and claim size Cole, T. and Green, P. (1992) Smoothing reference centile curves: The LMS method and penalized likelihood. Statist. in Med, 11, Hogg, R.V. and Klugman, S.A. (1984). Loss Distributions. New York: Wiley. Haberman, S. and Renshaw, A.E. (1996). Generalized Linear Models and Actuarial Science. The Statistician, 45 (4), Jørgensen, B. and de Souza, M.C.P. (1994). Fitting Tweedie s compound Poisson model to insurance claims data. Scandinavian Actuarial Journal, Rigby, R.A. and Stasinopoulos, D.M. (2005). Generalized Additive Models for Location, Scale and Shape (with discussion). Appl. Statist., 54, 1-38 Smyth, G.K. and Jørgensen, B. (2002). Fitting Tweedie s compound Poisson model to insurance claims data: dispersion modelling. ASTIN Bulletin, 32(1), Stasinopoulos D. M., Rigby R.A. and Akantziliotou C. (2006) gamlss: A collection of functions to fit Generalized Additive Models for Location Scale and Shape, R package version 1.1-0, url = ac.uk/gamlss/.

Travelers Analytics: U of M Stats 8053 Insurance Modeling Problem

Travelers Analytics: U of M Stats 8053 Insurance Modeling Problem Travelers Analytics: U of M Stats 8053 Insurance Modeling Problem October 30 th, 2013 Nathan Hubbell, FCAS Shengde Liang, Ph.D. Agenda Travelers: Who Are We & How Do We Use Data? Insurance 101 Basic business

More information

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE ACTA UNIVERSITATIS AGRICULTURAE ET SILVICULTURAE MENDELIANAE BRUNENSIS Volume 62 41 Number 2, 2014 http://dx.doi.org/10.11118/actaun201462020383 GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE Silvie Kafková

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

Offset Techniques for Predictive Modeling for Insurance

Offset Techniques for Predictive Modeling for Insurance Offset Techniques for Predictive Modeling for Insurance Matthew Flynn, Ph.D, ISO Innovative Analytics, W. Hartford CT Jun Yan, Ph.D, Deloitte & Touche LLP, Hartford CT ABSTRACT This paper presents the

More information

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA REVSTAT Statistical Journal Volume 4, Number 2, June 2006, 131 142 A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA Authors: Daiane Aparecida Zuanetti Departamento de Estatística, Universidade Federal de São

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA Andreas Christmann Department of Mathematics homepages.vub.ac.be/ achristm Talk: ULB, Sciences Actuarielles, 17/NOV/2006 Contents 1. Project: Motor vehicle

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Statistical Analysis of Life Insurance Policy Termination and Survivorship

Statistical Analysis of Life Insurance Policy Termination and Survivorship Statistical Analysis of Life Insurance Policy Termination and Survivorship Emiliano A. Valdez, PhD, FSA Michigan State University joint work with J. Vadiveloo and U. Dias Session ES82 (Statistics in Actuarial

More information

Model Selection and Claim Frequency for Workers Compensation Insurance

Model Selection and Claim Frequency for Workers Compensation Insurance Model Selection and Claim Frequency for Workers Compensation Insurance Jisheng Cui, David Pitt and Guoqi Qian Abstract We consider a set of workers compensation insurance claim data where the aggregate

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

GLM I An Introduction to Generalized Linear Models

GLM I An Introduction to Generalized Linear Models GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial

More information

Hierarchical Insurance Claims Modeling

Hierarchical Insurance Claims Modeling Hierarchical Insurance Claims Modeling Edward W. (Jed) Frees, University of Wisconsin - Madison Emiliano A. Valdez, University of Connecticut 2009 Joint Statistical Meetings Session 587 - Thu 8/6/09-10:30

More information

Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]

Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012] Survival Analysis of Left Truncated Income Protection Insurance Data [March 29, 2012] 1 Qing Liu 2 David Pitt 3 Yan Wang 4 Xueyuan Wu Abstract One of the main characteristics of Income Protection Insurance

More information

Exploratory Data Analysis

Exploratory Data Analysis Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

More information

A revisit of the hierarchical insurance claims modeling

A revisit of the hierarchical insurance claims modeling A revisit of the hierarchical insurance claims modeling Emiliano A. Valdez Michigan State University joint work with E.W. Frees* * University of Wisconsin Madison Statistical Society of Canada (SSC) 2014

More information

Christfried Webers. Canberra February June 2015

Christfried Webers. Canberra February June 2015 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

More information

More Flexible GLMs Zero-Inflated Models and Hybrid Models

More Flexible GLMs Zero-Inflated Models and Hybrid Models More Flexible GLMs Zero-Inflated Models and Hybrid Models Mathew Flynn, Ph.D. Louise A. Francis FCAS, MAAA Motivation: GLMs are widely used in insurance modeling applications. Claim or frequency models

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

More information

Web-based Supplementary Materials for. Modeling of Hormone Secretion-Generating. Mechanisms With Splines: A Pseudo-Likelihood.

Web-based Supplementary Materials for. Modeling of Hormone Secretion-Generating. Mechanisms With Splines: A Pseudo-Likelihood. Web-based Supplementary Materials for Modeling of Hormone Secretion-Generating Mechanisms With Splines: A Pseudo-Likelihood Approach by Anna Liu and Yuedong Wang Web Appendix A This appendix computes mean

More information

Introduction to Predictive Modeling Using GLMs

Introduction to Predictive Modeling Using GLMs Introduction to Predictive Modeling Using GLMs Dan Tevet, FCAS, MAAA, Liberty Mutual Insurance Group Anand Khare, FCAS, MAAA, CPCU, Milliman 1 Antitrust Notice The Casualty Actuarial Society is committed

More information

Logistic Regression (a type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Nominal and ordinal logistic regression

Nominal and ordinal logistic regression Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

13. Poisson Regression Analysis

13. Poisson Regression Analysis 136 Poisson Regression Analysis 13. Poisson Regression Analysis We have so far considered situations where the outcome variable is numeric and Normally distributed, or binary. In clinical work one often

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for

More information

Chapter 3 RANDOM VARIATE GENERATION

Chapter 3 RANDOM VARIATE GENERATION Chapter 3 RANDOM VARIATE GENERATION In order to do a Monte Carlo simulation either by hand or by computer, techniques must be developed for generating values of random variables having known distributions.

More information

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance. Chapter 6: Behavioural models Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

More information

1 Sufficient statistics

1 Sufficient statistics 1 Sufficient statistics A statistic is a function T = rx 1, X 2,, X n of the random sample X 1, X 2,, X n. Examples are X n = 1 n s 2 = = X i, 1 n 1 the sample mean X i X n 2, the sample variance T 1 =

More information

Predictive Modeling in Long-Term Care Insurance

Predictive Modeling in Long-Term Care Insurance Predictive Modeling in Long-Term Care Insurance Nathan R. Lally and Brian M. Hartman May 3, 2015 Abstract The accurate prediction of long-term care insurance (LTCI) mortality, lapse, and claim rates is

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds

Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds Combining Linear and Non-Linear Modeling Techniques: Getting the Best of Two Worlds Outline Who is EMB? Insurance industry predictive modeling applications EMBLEM our GLM tool How we have used CART with

More information

Dongfeng Li. Autumn 2010

Dongfeng Li. Autumn 2010 Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis

More information

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Discrete vs. continuous random variables Examples of continuous distributions o Uniform o Exponential o Normal Recall: A random

More information

Statistics. Measurement. Scales of Measurement 7/18/2012

Statistics. Measurement. Scales of Measurement 7/18/2012 Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does

More information

Approximation of Aggregate Losses Using Simulation

Approximation of Aggregate Losses Using Simulation Journal of Mathematics and Statistics 6 (3): 233-239, 2010 ISSN 1549-3644 2010 Science Publications Approimation of Aggregate Losses Using Simulation Mohamed Amraja Mohamed, Ahmad Mahir Razali and Noriszura

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring. Jie-Men Mok Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

More information

Automated Biosurveillance Data from England and Wales, 1991 2011

Automated Biosurveillance Data from England and Wales, 1991 2011 Article DOI: http://dx.doi.org/10.3201/eid1901.120493 Automated Biosurveillance Data from England and Wales, 1991 2011 Technical Appendix This online appendix provides technical details of statistical

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Logistic regression modeling the probability of success

Logistic regression modeling the probability of success Logistic regression modeling the probability of success Regression models are usually thought of as only being appropriate for target variables that are continuous Is there any situation where we might

More information

From the help desk: hurdle models

From the help desk: hurdle models The Stata Journal (2003) 3, Number 2, pp. 178 184 From the help desk: hurdle models Allen McDowell Stata Corporation Abstract. This article demonstrates that, although there is no command in Stata for

More information

Comparing return to work outcomes between vocational rehabilitation providers after adjusting for case mix using statistical models

Comparing return to work outcomes between vocational rehabilitation providers after adjusting for case mix using statistical models Comparing return to work outcomes between vocational rehabilitation providers after adjusting for case mix using statistical models Prepared by Jim Gaetjens Presented to the Institute of Actuaries of Australia

More information

Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

More information

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully

More information

7.1 The Hazard and Survival Functions

7.1 The Hazard and Survival Functions Chapter 7 Survival Models Our final chapter concerns models for the analysis of data which have three main characteristics: (1) the dependent variable or response is the waiting time until the occurrence

More information

Probability Calculator

Probability Calculator Chapter 95 Introduction Most statisticians have a set of probability tables that they refer to in doing their statistical wor. This procedure provides you with a set of electronic statistical tables that

More information

Risk pricing for Australian Motor Insurance

Risk pricing for Australian Motor Insurance Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model

More information

Geostatistics Exploratory Analysis

Geostatistics Exploratory Analysis Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt

More information

Poisson Regression or Regression of Counts (& Rates)

Poisson Regression or Regression of Counts (& Rates) Poisson Regression or Regression of (& Rates) Carolyn J. Anderson Department of Educational Psychology University of Illinois at Urbana-Champaign Generalized Linear Models Slide 1 of 51 Outline Outline

More information

Own Damage, Third Party Property Damage Claims and Malaysian Motor Insurance: An Empirical Examination

Own Damage, Third Party Property Damage Claims and Malaysian Motor Insurance: An Empirical Examination Australian Journal of Basic and Applied Sciences, 5(7): 1190-1198, 2011 ISSN 1991-8178 Own Damage, Third Party Property Damage Claims and Malaysian Motor Insurance: An Empirical Examination 1 Mohamed Amraja

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

INSURANCE RISK THEORY (Problems)

INSURANCE RISK THEORY (Problems) INSURANCE RISK THEORY (Problems) 1 Counting random variables 1. (Lack of memory property) Let X be a geometric distributed random variable with parameter p (, 1), (X Ge (p)). Show that for all n, m =,

More information

A Log-Robust Optimization Approach to Portfolio Management

A Log-Robust Optimization Approach to Portfolio Management A Log-Robust Optimization Approach to Portfolio Management Dr. Aurélie Thiele Lehigh University Joint work with Ban Kawas Research partially supported by the National Science Foundation Grant CMMI-0757983

More information

Name: Date: Use the following to answer questions 2-3:

Name: Date: Use the following to answer questions 2-3: Name: Date: 1. A study is conducted on students taking a statistics class. Several variables are recorded in the survey. Identify each variable as categorical or quantitative. A) Type of car the student

More information

Factorial experimental designs and generalized linear models

Factorial experimental designs and generalized linear models Statistics & Operations Research Transactions SORT 29 (2) July-December 2005, 249-268 ISSN: 1696-2281 www.idescat.net/sort Statistics & Operations Research c Institut d Estadística de Transactions Catalunya

More information

Motor and Household Insurance: Pricing to Maximise Profit in a Competitive Market

Motor and Household Insurance: Pricing to Maximise Profit in a Competitive Market Motor and Household Insurance: Pricing to Maximise Profit in a Competitive Market by Tom Wright, Partner, English Wright & Brockman 1. Introduction This paper describes one way in which statistical modelling

More information

Parametric Survival Models

Parametric Survival Models Parametric Survival Models Germán Rodríguez grodri@princeton.edu Spring, 2001; revised Spring 2005, Summer 2010 We consider briefly the analysis of survival data when one is willing to assume a parametric

More information

Joint models for classification and comparison of mortality in different countries.

Joint models for classification and comparison of mortality in different countries. Joint models for classification and comparison of mortality in different countries. Viani D. Biatat 1 and Iain D. Currie 1 1 Department of Actuarial Mathematics and Statistics, and the Maxwell Institute

More information

Exam C, Fall 2006 PRELIMINARY ANSWER KEY

Exam C, Fall 2006 PRELIMINARY ANSWER KEY Exam C, Fall 2006 PRELIMINARY ANSWER KEY Question # Answer Question # Answer 1 E 19 B 2 D 20 D 3 B 21 A 4 C 22 A 5 A 23 E 6 D 24 E 7 B 25 D 8 C 26 A 9 E 27 C 10 D 28 C 11 E 29 C 12 B 30 B 13 C 31 C 14

More information

The Probit Link Function in Generalized Linear Models for Data Mining Applications

The Probit Link Function in Generalized Linear Models for Data Mining Applications Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications

More information

Multiple Choice Models II

Multiple Choice Models II Multiple Choice Models II Laura Magazzini University of Verona laura.magazzini@univr.it http://dse.univr.it/magazzini Laura Magazzini (@univr.it) Multiple Choice Models II 1 / 28 Categorical data Categorical

More information

Underwriting risk control in non-life insurance via generalized linear models and stochastic programming

Underwriting risk control in non-life insurance via generalized linear models and stochastic programming Underwriting risk control in non-life insurance via generalized linear models and stochastic programming 1 Introduction Martin Branda 1 Abstract. We focus on rating of non-life insurance contracts. We

More information

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS Eusebio GÓMEZ, Miguel A. GÓMEZ-VILLEGAS and J. Miguel MARÍN Abstract In this paper it is taken up a revision and characterization of the class of

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

Lecture 8: Gamma regression

Lecture 8: Gamma regression Lecture 8: Gamma regression Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Models with constant coefficient of variation Gamma regression: estimation and testing

More information

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #15 Special Distributions-VI Today, I am going to introduce

More information

2WB05 Simulation Lecture 8: Generating random variables

2WB05 Simulation Lecture 8: Generating random variables 2WB05 Simulation Lecture 8: Generating random variables Marko Boon http://www.win.tue.nl/courses/2wb05 January 7, 2013 Outline 2/36 1. How do we generate random variables? 2. Fitting distributions Generating

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

Probabilistic concepts of risk classification in insurance

Probabilistic concepts of risk classification in insurance Probabilistic concepts of risk classification in insurance Emiliano A. Valdez Michigan State University East Lansing, Michigan, USA joint work with Katrien Antonio* * K.U. Leuven 7th International Workshop

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 4: Transformations Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture The Ladder of Roots and Powers Changing the shape of distributions Transforming

More information

Normality Testing in Excel

Normality Testing in Excel Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

More information

Longitudinal Modeling of Singapore Motor Insurance

Longitudinal Modeling of Singapore Motor Insurance Longitudinal Modeling of Singapore Motor Insurance Emiliano A. Valdez University of New South Wales Edward W. (Jed) Frees University of Wisconsin 28-December-2005 Abstract This work describes longitudinal

More information

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents Mplus Short Courses Topic 2 Regression Analysis, Eploratory Factor Analysis, Confirmatory Factor Analysis, And Structural Equation Modeling For Categorical, Censored, And Count Outcomes Linda K. Muthén

More information

NON-LIFE INSURANCE PRICING USING THE GENERALIZED ADDITIVE MODEL, SMOOTHING SPLINES AND L-CURVES

NON-LIFE INSURANCE PRICING USING THE GENERALIZED ADDITIVE MODEL, SMOOTHING SPLINES AND L-CURVES NON-LIFE INSURANCE PRICING USING THE GENERALIZED ADDITIVE MODEL, SMOOTHING SPLINES AND L-CURVES Kivan Kaivanipour A thesis submitted for the degree of Master of Science in Engineering Physics Department

More information

SUMAN DUVVURU STAT 567 PROJECT REPORT

SUMAN DUVVURU STAT 567 PROJECT REPORT SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.

More information

MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...

MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators... MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 2009-2016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

More information

Multivariate Logistic Regression

Multivariate Logistic Regression 1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

More information

Actuarial Applications of a Hierarchical Insurance Claims Model

Actuarial Applications of a Hierarchical Insurance Claims Model Actuarial Applications of a Hierarchical Insurance Claims Model Edward W. Frees Peng Shi University of Wisconsin University of Wisconsin Emiliano A. Valdez University of Connecticut February 17, 2008 Abstract

More information

Predictive Modeling for Life Insurers

Predictive Modeling for Life Insurers Predictive Modeling for Life Insurers Application of Predictive Modeling Techniques in Measuring Policyholder Behavior in Variable Annuity Contracts April 30, 2010 Guillaume Briere-Giroux, FSA, MAAA, CFA

More information

UNIT I: RANDOM VARIABLES PART- A -TWO MARKS

UNIT I: RANDOM VARIABLES PART- A -TWO MARKS UNIT I: RANDOM VARIABLES PART- A -TWO MARKS 1. Given the probability density function of a continuous random variable X as follows f(x) = 6x (1-x) 0

More information

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

More information