Generalized Linear Models. Today: definition of GLM, maximum likelihood estimation. Involves choice of a link function (systematic component)

Similar documents
Poisson Models for Count Data

Logistic Regression (1/24/13)

Logistic Regression (a type of Generalized Linear Model)

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

SAS Software to Fit the Generalized Linear Model

Linear Classification. Volker Tresp Summer 2015

Generalized Linear Models

Lecture 14: GLM Estimation and Logistic Regression

Logit Models for Binary Data

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Lecture 3: Linear methods for classification

Factorial experimental designs and generalized linear models

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Lecture 6: Poisson regression

Statistical Machine Learning

GLM, insurance pricing & big data: paying attention to convergence issues.

Multivariate Logistic Regression

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Part 2: One-parameter models

Introduction to General and Generalized Linear Models

Ordinal Regression. Chapter

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

Maximum Likelihood Estimation

Local classification and local likelihoods

Principle of Data Reduction

13. Poisson Regression Analysis

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

VI. Introduction to Logistic Regression

Statistics in Retail Finance. Chapter 2: Statistical models of default

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Lecture 8: Gamma regression

Nominal and ordinal logistic regression

Standard errors of marginal effects in the heteroskedastic probit model

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Penalized Logistic Regression and Classification of Microarray Data

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

1 Logistic Regression

GLM with a Gamma-distributed Dependent Variable

Nonlinear Regression:

SUMAN DUVVURU STAT 567 PROJECT REPORT

Linear Threshold Units

A Basic Introduction to Missing Data

STA 4273H: Statistical Machine Learning

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Regression III: Advanced Methods

Practice problems for Homework 11 - Point Estimation

The zero-adjusted Inverse Gaussian distribution as a model for insurance claims

Simple example of collinearity in logistic regression

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Analysis of ordinal data with cumulative link models estimation with the R-package ordinal

Basics of Statistical Machine Learning

Introduction to Logistic Regression

171:290 Model Selection Lecture II: The Akaike Information Criterion

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

Computer exercise 4 Poisson Regression

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Lecture 19: Conditional Logistic Regression

GLM I An Introduction to Generalized Linear Models

Pricing of Car Insurance with Generalized Linear Models

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Examining a Fitted Logistic Model

Model Selection and Claim Frequency for Workers Compensation Insurance

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

11. Analysis of Case-control Studies Logistic Regression

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Distribution (Weibull) Fitting

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

Logistic Regression for Data Mining and High-Dimensional Classification

Multinomial and Ordinal Logistic Regression

Extreme Value Modeling for Detection and Attribution of Climate Extremes

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Chapter 3 RANDOM VARIATE GENERATION

Parametric Survival Models

Probability Calculator

Basic Statistical and Modeling Procedures Using SAS

Package dsmodellingclient

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

LOGIT AND PROBIT ANALYSIS

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

Multiple Choice: 2 points each

Multivariate Normal Distribution

Least Squares Estimation

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Machine Learning and Pattern Recognition Logistic Regression

Estimating an ARMA Process

Reject Inference in Credit Scoring. Jie-Men Mok

7.1 The Hazard and Survival Functions

Simple Linear Regression Inference

Transcription:

Generalized Linear Models Last time: definition of exponential family, derivation of mean and variance (memorize) Today: definition of GLM, maximum likelihood estimation Include predictors x i through a regression model for θ i Involves choice of a link function (systematic component) Examples for counts, binomial data Algorithm for maximizing likelihood 1

Systematic Component, Link Functions Instead of modeling the mean, µ i, as a linear function of predictors, x i, we introduce on one-to-one continuously differentiable transformation g( ) and focus on η i = g(µ i ), where g( ) will be called the link function and η i the linear predictor. We assume that the transformed mean follows a linear model, η i = x iβ. Since the link function is invertible and one-to-one, we have µ i = g 1 (η i ) = g 1 (x iβ). 2

Note that we are transforming the expected value, µ i, instead of the raw data, y i. For classical linear models, the mean is the linear predictor. In this case, the identity link is reasonable since both µ i and η i can take any value on the real line. This is not the case in general. 3

Link Functions for Poisson Data For example, if Y i Poi(µ i ) then µ i must be > 0. In this case, a linear model is not reasonable since for some values of x i µ i 0. By using the model, η i = log(µ i ) = x iβ, we are guaranteed to have µ i > 0 for all β R p and all values of x i. In general, a link function for count data should map the interval (0, ) R (i.e., from the + real numbers to the entire real line). The log link is a natural choice 4

Link Functions for Binomial Data For the binomial distribution, 0 < µ i < 1 (mean of y i is n i µ i ) Therefore, the link function should map from (0, 1) R Standard choices: 1. logit: η i = log{µ i /(1 µ i )}. 2. probit: η i = Φ 1 (µ i ), where Φ( ) is the N(0, 1) cdf. 3. complementary log-log: η i = log{ log(1 µ i )}. Each of these choices is important in applications & will be considered in detail later in the course 5

Recall that the exponential family density has the following form: f(y i ; θ i, φ) = exp { y i θ i b(θ i ) a(φ) + c(y i, φ) }. where a( ), b( ) and c( ) are known functions. Specifying the GLM involves choosing a( ), b( ), c( ): 1. Specify a( ), c( ) to correspond to particular distribution (e.g., Binomial, Poisson) 2. Specify b( ) to correspond to a particular link function 6

Recall that mean & variance are µ i = b (θ i ) and σ 2 = b (θ i )φ. Using b (θ i ) = g 1 (x iβ), we can express the density as f(y i ; x i, β, φ), so that the conditional likelihood of y i given x i depends on parameters β and φ. It would seem that a natural choice for b( ) and hence g( ), would correspond to θ i = η i = x iβ, so that b ( ) is the inverse link 7

Canonical Links and Sufficient Statistics Each of the distributions we have considered has a special, canonical, link function for which there exists a sufficient statistic equal in dimension to β. Canonical links occur when θ i = η i = x iβ, with θ i the canonical parameter As a homework exercise, please show for next Thursday that the following distributions are in the exponential family and have the listed canonical links: Normal η i = µ i Poisson η i = logµ i binomial η i = log{µ i /(1 µ i )} gamma η i = µ 1 i For the canonical links, the sufficient statistic is X y, with components i x ij y i, for j = 1,..., p. 8

Although canonical links often nice properties, selection of the link function should be based on prior expectation and model fit Example: Logistic Regression Suppose y i Bin(1, p i ), for i = 1,..., n, are independent 0/1 indicator variables of an adverse response (e.g., preterm birth) and let x i denote a p 1 vector of predictors for individual i (e.g., dose of dde exposure, race, age, etc). The likelihood is as follows: f(y β) = n = n = exp [ n p y i i (1 p i ) 1 y i = n ( p i ) y i (1 p i ) 1 p i exp { y i log ( p ) ( i 1 )} log 1 p i 1 p i {y i θ i log(1 + e θ i )} ]. 9

Choosing the canonical link, θ i = log ( p i 1 p i the likelihood has the following form: ) = x i β, f(y β) = exp[ n {y i x iβ log(1 + e x iβ )}]. This is logistic regression, which is widely used in epidemiology and other applications for modeling of binary response data. In general, if f(y i ; θ i, φ) is in the exponential family and θ i = θ(η i ), η i = x iβ, then the model is called a generalized linear model (GLM) 10

Model fitting Choosing a GLM results in a likelihood function: L(y; β, φ, x) = n exp { y i θ i b(θ i ) a(φ) + c(y i, φ) }, where θ i is a function of η i = x iβ. The maximum likelihood estimate is defined as β = sup L(y; β, φ, x), β with φ initially assumed to be known 11

Frequentist inferences for GLMs typically rely on β and asymptotic approximations. In the normal linear model special case, the MLE corresponds to the least squares estimator In general, there is no closed form expression so we need an algorithm to calculate β. 12

Maximum Likelihood Estimation of GLMs All GLMs can be fit using the same algorithm, a form of iteratively re-weighted least squares: 1. Given an initial value for β, calculate the estimated linear predictor η i = x i β and use that to obtain the fitted values µ i = g 1 ( η i ). Calculate the adjusted dependent variable, z i = η i + (y i µ i ) ( dη ) i dµ, 0 i where the derivative is evaluated at µ i. 13

2. Calculate the iterative weights W 1 i = ( dη ) i dµ V 0 i. i where V i is the variance function evaluated at µ i. 3. Regress z i on x i with weight W i to give new estimates of β 14

Justification for the IWLS procedure Note that the log-likelihood can be expressed as l = n {y i θ i b(θ i )}/a(φ) + c(y i, φ). To maximize this log-likelihood we need l/ β j, l β j = n = n = n l i dθ i dµ i θ i dµ i dη i (y i µ i ) a(φ) (y i µ i ) W i a(φ) η i β j 1 V i dµ i dη i x ij, dη i dµ i x ij since µ i = b (θ i ) and b (θ i ) = V i implies dµ i /dθ i = V i. With constant dispersion (a(φ) = φ), the MLE equations for β j : n W i (y i µ i ) dη i dµ i x ij = 0. 15

Fisher s scoring method uses the gradient vector, l/ β = u, and minus the expected value of the Hessian matrix E ( 2 l ) = A. β r β s Given the current estimate b of β, choose the adjustment δb so Aδb = u. Excluding φ, the components of u are u r = n so we have A rs = E( u r / β s ) = E n [ (yi µ i ) β s W i (y i µ i ) dη i dµ i x ir, { dη } i dη i Wi x ir + Wi x ir (y i µ i ) ]. dµ i dµ i β s The expectation of the first term is 0 and the second term is n W i dη i dµ i x ir µ i β s = n W i dη i dµ i x ir dµ i dη i η i β s = n W i x ir x is. 16

The new estimate b = b + δb of β thus satisfies Ab = Ab + Aδb = Ab + u, where (Ab) r = s A rs b s = n W i x ir η i. Thus, the new estimate b satisfies (Ab ) r = n W i x ir {η i + (y i µ i )dη i /dµ i }. These equations have the form of linear weighted least squares equation with weight W i and dependent variable z i. 17

Some Comments The IWLS procedure is simple to implement and converges rapidly in most cases Procedures are available to calculate MLEs and implement frequentist inferences for GLMs in most software packages. In R or S-PLUS the glm( ) function can be used - try help(glm) In Matlab the glmfit( ) function can be used 18

Example: Smoking and Obesity y i = 1 if the child is obese and y i = 0 otherwise, for i = 1,..., n x i = (1, age i, smoke i, age i smoke i ) Bernoulli likelihood, L(y; β, x) = n where µ i = Pr(y i = 1 x i, β). µ y i i (1 µ i ) 1 y i, Choosing the canonical link, µ i = 1/{1 + exp( x iβ)}, results in a logistic regression model: Pr(y i = 1 x i, β) = exp(x iβ) 1 + exp(x iβ), Hence, probability of obesity depends on age and smoking through a non-linear model 19

Letting X = cbind(age,smoke,age*smoke) and Y = 0/1 obesity outcome in R, we use fit<- glm(y ~ age + smoke + age*smoke, family=binomial, data=obese) to implement IWLS and fit the model Note that data are available on the web - try to replicate results (note children a year or younger have been discarded) The command summary(glm) yields the results: 20

Coefficients: Value Std. Error t value (Intercept) -2.365173738 0.50112688-4.7197104 age -0.066204429 0.08957593-0.7390873 smoke -0.043079741 0.22375895-0.1925275 age:smoke -0.008448488 0.04010827-0.2106420 Null Deviance: 1580.905 on 3874 degrees of freedom Residual Deviance: 1574.663 on 3871 degrees of freedom Number of Fisher Scoring Iterations: 6 Correlation of Coefficients: (Intercept) age smoke age -0.9382877 smoke -0.9067235 0.8520241 age:smoke 0.8496495-0.9062117-0.9391875 21

Thus, the IWLS algorithm converged in 6 iterations to the MLE: β = ( 2.365, 0.066, 0.043, 0.008) For any value of the covariates we can calculate the probability of obesity For example, for non-smokers the age curves can be plotted by using: beta<- fit$coef ## introduce grid spanning range of observed ages x<- seq(min(obese$age),max(obese$age),length=100) ## calculate fitted probability of obesity py<- 1/(1+exp(-beta[1]+beta[2]*x)) plot(x,py,xlab="age in years", ylab="pr(obesity)") Meaning of the rest of the R/S-PLUS output will be clear after next class 22

Next Class Topic: Frequentist inference for GLMs Have homework exercise completed and written up for next Thursday Complete the following exercise: 1. Write down generalized linear models for the Caesarian data (grouping the two different infection types) and the cellular differentiation data. 2. Show the different components of the GLM, expressing the likelihood in exponential family form & using a canonical link function 3. Fit the GLM using maximum likelihood and report the parameter estimates. 23