4. GLIM for data with constant coefficient of variation

Similar documents
Lecture 8: Gamma regression

Lecture 6: Poisson regression

SAS Software to Fit the Generalized Linear Model

Lecture Notes 1. Brief Review of Basic Probability

GLM I An Introduction to Generalized Linear Models

Logistic Regression (a type of Generalized Linear Model)

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Underwriting risk control in non-life insurance via generalized linear models and stochastic programming

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Simple linear regression

SUMAN DUVVURU STAT 567 PROJECT REPORT

Table 1 and the Standard Deviation of a Model

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NON-LIFE INSURANCE PRICING USING THE GENERALIZED ADDITIVE MODEL, SMOOTHING SPLINES AND L-CURVES

Java Modules for Time Series Analysis

Poisson Models for Count Data

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

GLMs: Gompertz s Law. GLMs in R. Gompertz s famous graduation formula is. or log µ x is linear in age, x,

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Logistic Regression (1/24/13)

Chapter 5 Analysis of variance SPSS Analysis of variance

Regression III: Advanced Methods

Introduction to General and Generalized Linear Models

Generalized Linear Models

Introduction to Predictive Modeling Using GLMs

Time Series Analysis

Elements of statistics (MATH0487-1)

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

A Model of Optimum Tariff in Vehicle Fleet Insurance

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Nonnested model comparison of GLM and GAM count regression models for life insurance data

Probabilistic concepts of risk classification in insurance

1. Let A, B and C are three events such that P(A) = 0.45, P(B) = 0.30, P(C) = 0.35,

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Simple Linear Regression Inference

Sections 2.11 and 5.8

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Life Table Analysis using Weighted Survey Data

Regression III: Advanced Methods

Applying Statistics Recommended by Regulatory Documents

Chapter 7: Simple linear regression Learning Objectives

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

Review of Random Variables

Exam C, Fall 2006 PRELIMINARY ANSWER KEY

Module 3: Correlation and Covariance

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

1.5 Oneway Analysis of Variance

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Analysis of Life Insurance Policy Termination and Survivorship

Logit Models for Binary Data

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Lecture 3: Linear methods for classification

Geostatistics Exploratory Analysis

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Nominal and ordinal logistic regression

1 Teaching notes on GMM 1.

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

SUPPLEMENT TO USAGE-BASED PRICING AND DEMAND FOR RESIDENTIAL BROADBAND : APPENDIX (Econometrica, Vol. 84, No. 2, March 2016, )

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Least Squares Estimation

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Computer exercise 4 Poisson Regression

Ordinal Regression. Chapter

2. Linear regression with multiple regressors

Charles University, Faculty of Mathematics and Physics, Prague, Czech Republic.

Modeling Count Data from Hawk Migrations

Multiple Linear Regression in Data Mining

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney

Statistical Rules of Thumb

Lecture 1: Asset pricing and the equity premium puzzle

Stochastic programming approaches to pricing in non-life insurance

Constant Elasticity of Variance (CEV) Option Pricing Model:Integration and Detailed Derivation

Generalised linear models for aggregate claims; to Tweedie or not?

Week 5: Multiple Linear Regression

SYSTEMS OF REGRESSION EQUATIONS

Financial Risk Management Exam Sample Questions/Answers

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

ANOVA. February 12, 2015

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Joint models for classification and comparison of mortality in different countries.

QUADRATIC, EXPONENTIAL AND LOGARITHMIC FUNCTIONS

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Experiment #1, Analyze Data using Excel, Calculator and Graphs.

BIOL 933 Lab 6 Fall Data Transformation

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Section 3 Part 1. Relationships between two numerical variables

Chapter 3 RANDOM VARIATE GENERATION

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Lecture 14: GLM Estimation and Logistic Regression

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

Linda Staub & Alexandros Gekenidis

Review of Fundamental Mathematics

Transcription:

4. GLIM for data with constant coefficient of variation 4.1. Data with constant coefficient of variation and some simple models Coefficient of variation The coefficient of variation of a random variable Y is defined as Var(Y ) c.v. =. EY If the coefficient of variation is a constant, than Var(Y )=σ 2 (EY ) 2, i.e. V (µ) =µ 2. Variance stablization and Log-normal model Note Y µ σ. Ifσ is small, µ ln Y =lnµ + 1 1 (Y µ) µ 2µ 2(Y µ)2 + Hence, E ln Y =lnµ σ2 2, Var(ln Y ) σ2. If a multiplicative model can be assumed for Y and the covariates X 1,...,X p, i.e., µ =exp{β 0 + p j=1 β jx j } =exp{x t β}, then Y can be treated as if it has a log-normal distribution, and LS regression can be applied to ln Y and x. 1

Weighted non-linear LS regression and Gamma model Assume the multiplicative model for Y and X and the variance function V (µ) = µ 2. The β can be estimated by iteratively weighted non-linear LS regression: minimizing n 1 (y i exp{x t iβ}) 2, i=1 µ 2 i while treating µ i as fixed at each iteration, which is equivalent to solving [ n ] 1 β µ 2 (y i exp{x t i β})2 i=1 i n 1 = (y i exp{x t i β}) exp{xt i β} =0, β i=1 µ 2 i which, in turn, is equivalent to solving = n β i=1 n ( y i µ 2 i i=1 ( y i µ i ln µ i ) 1 µ i ) exp{xt i β} β =0, Weighted non-linear LS regression is equivalent to the GLIM assuming Gamma distribution and the log link function. 2

Comparison of efficiency loss Assumed True model model Log-normal Gamma Log-normal 0 G-L Gamma L-G 0 The efficiency loss G-L is larger than the efficiency loss L-G. 4.2 Gamma distribution and its properties Two ways of paramerization: Conventional parmeterization G(α, β): 1 1 e y/β. Γ(α) β αyα 1 EY = αβ, Var(Y )=αβ 2. GLIM parmeterization in terms of mean and dispersion parameter µ and ν: ( ) ν 1 ν y ν 1 e νy/µ. Γ(ν) µ EY = µ, Var(Y )=µ 2 /ν. Moment and Cumulant generatiting funcitons M(t) = (1 µt/ν) ν, K(t) = ν ln(1 µt/ν). 3

Cumulants: κ 1 = E(Y )=µ, κ 2 = Var(Y )=µ 2 /ν, κ 3 = E(Y µ) 3 =2µ 3 /ν 2, κ 4 = 6µ 4 /ν 3, κ r = (r 1)!µ r /ν r 1. Special cases (i) ν = 1 Exponential distribution: 1 µ e y/µ. (ii) µ = n, ν = n/2: χ 2 -distribution with d.f n: 1 2 n/2 Γ(n/2) y n 2 2 e y/2. Some other properties (i) If Y Γ(µ, ν), then, for c>0, cy Γ(cµ, ν). (ii) If Y 1,...,Y n, i.i.d. Γ(µ, ν), then n Y i Γ(nµ, nν), Ȳ Γ(µ, nν). i=1 4

(iii) As ν, ν(y µ) N(0,µ 2 ). Shapes of the p.d.f. of Gamma distributions 5

4.3. Models for data with constant coefficient of variation Assumptions Data: (Y i, x i ),i=1,...,n. Y i Γ(µ i,ντ i ) η i = g(µ i ), η i = x t i β. Case of common coefficient of variation: τ i 1. Case of diffenent coefficients of variation: τ i are not all equal. Remark: The Gamma model assumed above is a prototype of the models for data with constant coefficient of variation. In application, the data needs not neccesarily follow a Gamma distribution. The Gamma model can be applied if the constant coefficient of variation can be justified. The role the Gamma distribution with multiplicative model plays on data with constant coefficient is to the role the Normal distribution with additive model plays on data with constant variance. Consideration on link function and linear predictor (i) Canonical link, η(µ) =1/µ, and inverse polynomial. The canonical link does not map the range of µ onto the whole real line. Caution must be taken in the estimation to make sure that ˆµ >0. 6

A example of canonical link in needs: x: plant density (number of plants in unit area). Since yield per plant is inversely proportional to density, the mean yield per plant can be modeled as 1 β 0 x + β 1 Thus, yield per unit area is given by x µ = β 0 x + β 1 This gives rise to η = µ 1 = β 1 /x + β 0. In practical problems, the mean response is subjected to certain restrictions. As a function of the covariates, it might be bounded, have maximum or minimum, or have asymptoes, etc.. The choice of the link function and linear predictor must reflect these natures. The inverse polynormials can be useful in this regards. 7

Inverse polynormial: Inverse linear: β 1 /x + β 0. Inverse quadratic: β 1 /x + β 0 + γ 1 x. The inverse polynormials have certain desirable properties as illustrated by the following figure: For more about inverse polynomials, see Nelder (1966), Inverse polynomials, a useful group of multi-factor response functions. Biometrics, 22, 128-141. 8

(ii) Log-link, η(µ) =lnµ, and certain particular scales of covariates. 9

(iii) Identity link, η(µ) =µ, and the modeling of variance in robust design. Scheme of Robust Design: Index induced Noise levels by covariates 1 k Mean Variance 1 Y 11 Y 1k Ȳ 1 s 2 1 2 Y 21 Y 2k Ȳ 2 s 2 2 i Y i1 Y ik Ȳ i s 2 i n Y n1 Y nk Ȳ n s 2 n Covariates affecting the mean: X 1,...,X p. Covariates affecting the variance: Z 1,...,Z q. The model: p EȲi = β j x ij, Note: Es 2 i = j=1 q γ l z ij. l=1 (k 1)s 2 i σ 2 i χ 2 k 1 =Γ(k 1, (k 1)/2). Es 2 i = σ2 i = µ i, Var(s 2 i )= σ 4 i (k 1)/2 = µ2 i /ν. 10

use of glm For the case of common coefficient of variance: glm(reponse ~ model formula, famility = Gamma(link), data=data.set) For the case of different coefficient of variance: glm(reponse ~ model formula, famility = Gamma(link), weight = w.vector,data=data.set) where w.vector= (τ 1,...,τ n ) t. Available link functions: log, inverse, identity Estimation of dispersion parameter σ 2 =1/ν ˆσ 2 = 1 n τ i ( y i ˆµ i ) 2. n q ˆµ i i=1 Remark: If the Gamma distribution can be assumed, ν can be estimated by maximum likelihood estimation. The MLE is obtained by solving 2n(ln ˆν Γ (ˆν) )=D(y; ˆµ), Γ(ˆν) where D(y; ˆµ) is the deviance. But (a) MLE is sensitive to rounding errors and (b) MLE is not robust if the assumption of Gamma distribution is violated. 11

4.4. Examples Example 1. Car insurance claims 12

The Variables: Response Y: average claims in pounds sterling, Covariates: Polocyholder age (PA): 17-20, 21-24, 25-29,30-34,35-39,40-49,50-59,60+; Car group (CG): A, B, C, D; Vehicle age (VA): 0-3, 4-7, 8-9,10+ Determination of Variance function From the original raw data, for each average, the corresponding sample variance can be computed. The relationship of the variance and the mean can be assessed, which can aid at the determination of the variance function. This can be done by checking the plot: s 2 ijk vs. y ijk. Since Var(y ˆ ijk )=s 2 ijk /m ijk, m ijk must be used as the weights in glm. Thus V (µ ijk )=σ 2 µ 2 ijk/m ijk. Determination of Link function An intuitive consideration: if the insurance company wants to determine the policy price for a given period length, it would like to know the rate at which instalments of 1 pound must be paid to service an average claim over the given period. This rate is given by 13

the inverse of the average claim. Hence the inverse link η ijk = µ 1 ijk can be considered. A more rigorous verification is provided below: The linear predictor η ijk = µ 0 + α i + β j + γ k. The linear predictor should also be subjected to checking. 14

Rcodes PA_factor(rep (rep( c(17,21,25,30,35,40,50,60), c(4,4,4,4,4,4,4,4)),4)) CG_factor(rep(rep(c("A","B","C","D"),8),4)) VA_factor(rep(c(0,4,8,10),c(32,32,32,32))) claim.amount <- c(289,372,189,763,302,420,268,407,268,275,334,383, 236,259,340,400,207,208,251,233,254,218,239,387, 251,196,268,391,264,224,269,385,282,249,288,850, 194,243,343,320,285,243,274,305,270,226,260,349, 129,214,232,325,213,209,250,299,227,229,250,228, 198,193,258,324,133,288,179, 0,135,196,293,205, 181,179,208,116,160,161,189,147,157,149,204,207, 149,172,174,325,172,164,175,346,167,178,227,192, 160, 11, 0, 0,166,135,104, 0,110,264,150,636, 110,107,104, 65,113,137,141, 0, 98,110,129,137, 98,132,152,167,114,101,119,123) claim.num <- c(8,10,9,3,18,59,44,24,56,125,163,72,43,179,197,104, 43,191,210,119,90,380,401,199,69,366,310,105,64,228, 183,62,8,28,13,2,31,96,39,18,55,172,129,50,53,211, 125,55,73,219,131,43,98,434,253,88,120,353,148,46, 100,233,103,22,4,1,1,0,10,13,7,2,17,36,18,6,15,39, 30,8,21,46,32,4,35,97,50,8,42,95,33,10,43,73,20,6, 1,1,0,0,4,3,2,0,12,10,8,1,12,19,9,2, 14,23,8,0,22, 59,15,9,35,45,13,1,53,44,6,6) car.dat<-data.frame(claim.amount,claim.num,pa,cg,va) options(contrasts=c("contr.treatment","contr.poly")) car.glm<-glm(claim.amount~pa+cg+va,family=gamma(inverse), weight=claim.num,data=car.dat) summary(car.glm) 15

Estimated Coefficents Value Std. Error t value (Intercept) 0.003411 0.000418 8.160993 PA21 0.000101 0.000436 0.232406 PA25 0.000350 0.000412 0.848662 PA30 0.000462 0.000411 1.125999 PA35 0.001370 0.000419 3.268478 PA40 0.000969 0.000405 2.395944 PA50 0.000916 0.000408 2.246508 PA60 0.000920 0.000416 2.213389 CGB 0.000038 0.000169 0.223233 CGC -0.000614 0.000170-3.611006 CGD -0.001421 0.000181-7.866716 VA4 0.000366 0.000101 3.631972 VA8 0.001651 0.000227 7.281134 VA10 0.004154 0.000442 9.390544 Estimation of dispersion parameter rp_resid(car.glm,type="pearson") dispersion_sum(rp^2)/109 > dispersion [1] 1.209015 16

Conclusion The largest average claims are made by polocyholders in the youngest four age groups, i.e., up to age 34, the smallest claims by those aged 35-39, and intermideiate claims by those aged 40 and over. The value of claims decreases with car age. There are marked differences between the car groups, group D being the most expensive and group C intermediate. No significance difference between groups A and B. Examle 2. Developmental rate of Drosophila melanogaster 17

Determination of variance function Plot average duration time against standard deviation: The plot seems linear, which implies Var(mean durition time) = σ 2 µ 2 /BS, BS: batch size. 18

Determination of link function and linear predictor Plot mean duration time against temprature: The right panel of the Figure above shows the Arrhenius s law: ln µ = c/t. The left panel can be well fitted by ln µ = β 0 + β 1 T + β 1 /(T δ). If β 1 < 0, then, as T δ, µ. 19

R codes and results melanogaster<-read.table("melanogaster.txt") attach(melanogaster) plot(d2,std) x<-seq(15,35,by=0.5) y<-1/x par(mfrow=c(1,2)) plot(temp,d2) plot(x,y*840) mela.glm<-glm(d2~temp+i(1/(temp-58.644)), family=gamma(log),weight=bs,data=melanogaster) summary(mela.glm) Coefficients: Value Std. Error t value (Intercept) 3.2007 0.0439 72.82830 Temp -0.2648 0.0036-72.35518 I(1/(Temp - 58.644)) -217.0802 4.3935-49.40890 (Dispersion Parameter for Gamma family taken to be 0.0159168 ) Minimum duration occurs at T = δ (β 1 /β 1 ) 1/2 =30.01. 20