Introduction to General and Generalized Linear Models

Similar documents

Least Squares Estimation

Multivariate normal distribution and testing for means (see MKB Ch 3)

Quadratic forms Cochran s theorem, degrees of freedom, and all that

Simple Linear Regression Inference

Lecture 3: Linear methods for classification

Regression Analysis. Regression Analysis MIT 18.S096. Dr. Kempthorne. Fall 2013

Recall that two vectors in are perpendicular or orthogonal provided that their dot

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Statistical Machine Learning

1 Introduction to Matrices

1.5 Oneway Analysis of Variance

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

Part 2: Analysis of Relationship Between Two Variables

Least-Squares Intersection of Lines

Some probability and statistics

Multivariate Normal Distribution

Data Mining: Algorithms and Applications Matrix Math Review

MATHEMATICAL METHODS OF STATISTICS

Chapter 3: The Multiple Linear Regression Model

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Chapter 6: Multivariate Cointegration Analysis

17. Inner product spaces Definition Let V be a real vector space. An inner product on V is a function

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Inner Product Spaces

3.1 Least squares in matrix form

MAT 200, Midterm Exam Solution. a. (5 points) Compute the determinant of the matrix A =

Linear Models for Continuous Data

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

NOTES ON LINEAR TRANSFORMATIONS

Factor analysis. Angela Montanari

Chapter 4: Statistical Hypothesis Testing

Notes on Applied Linear Regression

SAS Software to Fit the Generalized Linear Model

Time Series Analysis

Regression Analysis: A Complete Example

Linear Threshold Units

Statistical Models in R

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Similarity and Diagonalization. Similar Matrices

Orthogonal Projections

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

Regression III: Advanced Methods

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

Name: Section Registered In:

Poisson Models for Count Data

Logit Models for Binary Data

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

2. Simple Linear Regression

These axioms must hold for all vectors ū, v, and w in V and all scalars c and d.

Orthogonal Diagonalization of Symmetric Matrices

Linear Regression. Guy Lebanon

Elementary Statistics Sample Exam #3

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Econometrics Simple Linear Regression

Use of deviance statistics for comparing models

Section 1.4. Lines, Planes, and Hyperplanes. The Calculus of Functions of Several Variables

Linear Algebra Review. Vectors

MULTIVARIATE PROBABILITY DISTRIBUTIONS

Additional sources Compilation of sources:

Permutation Tests for Comparing Two Populations

1 Determinants and the Solvability of Linear Systems

The Loss in Efficiency from Using Grouped Data to Estimate Coefficients of Group Level Variables. Kathleen M. Lang* Boston College.

BANACH AND HILBERT SPACE REVIEW

Detection of changes in variance using binary segmentation and optimal partitioning

CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Vector and Matrix Norms

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

1 Prior Probability and Posterior Probability

individualdifferences

II. DISTRIBUTIONS distribution normal distribution. standard scores

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

3. INNER PRODUCT SPACES

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

HYPOTHESIS TESTING: POWER OF THE TEST

Lecture 9: Introduction to Pattern Analysis

5. Linear Regression

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

Final Exam Practice Problem Answers

Linear Algebra Notes for Marsden and Tromba Vector Calculus

1 Simple Linear Regression I Least Squares Estimation

Reject Inference in Credit Scoring. Jie-Men Mok

Multivariate Analysis of Variance (MANOVA): I. Theory

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

Lecture 14: GLM Estimation and Logistic Regression

Credit Risk Models: An Overview

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Solutions to Math 51 First Exam January 29, 2015

Section 1.1. Introduction to R n

Transcription:

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby October 2010 Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 1 / 37

Today The general linear model - intro The multivariate normal distribution Deviance Likelihood, score function and information matrix The general linear model - definition Estimation Fitted values Residuals Partitioning of variation Likelihood ratio tests The coefficient of determination Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 2 / 37

The general linear model - intro The general linear model - intro We will use the term classical GLM for the General linear model to distinguish it from GLM which is used for the Generalized linear model. The classical GLM leads to a unique way of describing the variations of experiments with a continuous variable. The classical GLM s include Regression analysis Analysis of variance - ANOVA Analysis of covariance - ANCOVA The residuals are assumed to follow a multivariate normal distribution in the classical GLM. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 3 / 37

The general linear model - intro The general linear model - intro Classical GLM s are naturally studied in the framework of the multivariate normal distribution. We will consider the set of n observations as a sample from a n-dimensional normal distribution. Under the normal distribution model, maximum-likelihood estimation of mean value parameters may be interpreted geometrically as projection on an appropriate subspace. The likelihood-ratio test statistics for model reduction may be expressed in terms of norms of these projections. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 4 / 37

The multivariate normal distribution The multivariate normal distribution Let Y = (Y 1, Y 2,..., Y n ) T be a random vector with Y 1, Y 2,..., Y n independent identically distributed (iid) N(0, 1) random variables. Note that E[Y ] = 0 and the variance-covariance matrix Var[Y ] = I. Definition (Multivariate normal distribution) Z has an k-dimensional multivariate normal distribution if Z has the same distribution as AY + b for some n, some k n matrix A, and some k vector b. We indicate the multivariate normal distribution by writing Z N(b, AA T ). Since A and b are fixed, we have E[Z] = b and Var[Z] = AA T. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 5 / 37

The multivariate normal distribution The multivariate normal distribution Let us assume that the variance-covariance matrix is known apart from a constant factor, σ 2, i.e. Var[Z] = σ 2 Σ. The density for the k-dimensional random vector Z with mean µ and covariance σ 2 Σ is: [ 1 f Z (z) = (2π) k/2 σ k det Σ exp 1 ] 2σ 2 (z µ)t Σ 1 (z µ) where Σ is seen to be (a) symmetric and (b) positive semi-definite. We write Z N k (µ, σ 2 Σ). Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 6 / 37

The multivariate normal distribution The normal density as a statistical model Consider now the n observations Y = (Y 1, Y 2,..., Y n ) T, and assume that a statistical model is Y N n (µ, σ 2 Σ) for y R n The variance-covariance matrix for the observations is called the dispersion matrix, denoted D[Y ], i.e. the dispersion matrix for Y is D[Y ] = σ 2 Σ Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 7 / 37

Inner product and norm Inner product and norm Definition (Inner product and norm) The bilinear form δ Σ (y 1, y 2 ) = y T 1 Σ 1 y 2 defines an inner product in R n. Corresponding to this inner product we can define orthogonality, which is obtained when the inner product is zero. A norm is defined by y Σ = δ Σ (y, y). Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 8 / 37

Deviance Deviance for normal distributed variables Definition (Deviance for normal distributed variables) Let us introduce the notation D(y; µ) = δ Σ (y µ, y µ) = (y µ) T Σ 1 (y µ) to denote the quadratic norm of the vector (y µ) corresponding to the inner product defined by Σ 1. For a normal distribution with Σ = I, the deviance is just the Residual Sum of Squares (RSS). Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 9 / 37

Deviance Deviance for normal distributed variables Using this notation the normal density is expressed as a density defined on any finite dimensional vector space equipped with the inner product, δ Σ : [ f(y; µ, σ 2 1 ) = ( 2π) n σ n det(σ) exp 1 ] D(y; µ). 2σ2 Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 10 / 37

Likelihood, score function and information matrix The likelihood and log-likelihood function The likelihood function is: L(µ, σ 2 ; y) = 1 ( 2π) n σ n det(σ) exp [ 1 ] D(y; µ) 2σ2 The log-likelihood function is (apart from an additive constant): l µ,σ 2(µ, σ 2 ; y) = (n/2) log(σ 2 ) 1 2σ 2 (y µ)t Σ 1 (y µ) = (n/2) log(σ 2 ) 1 D(y; µ). 2σ2 Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 11 / 37

Likelihood, score function and information matrix The score function, observed - and expected information for µ The score function wrt. µ is µ l µ,σ 2(µ, σ2 ; y) = 1 σ 2 [ Σ 1 y Σ 1 µ ] = 1 σ 2 Σ 1 (y µ) The observed information (wrt. µ) is j(µ; y) = 1 σ 2 Σ 1. It is seen that the observed information does not depend on the observations y. Hence the expected information is i(µ) = 1 σ 2 Σ 1. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 12 / 37

The general linear model The general linear model In the case of a normal density the observation Y i is most often written as Y i = µ i + ɛ i which for all n observations (Y 1, Y 2,..., Y n ) can be written on the matrix form Y = µ + ɛ where Y N n (µ, σ 2 Σ) for y R n Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 13 / 37

General Linear Models The general linear model In the linear model it is assumed that µ belongs to a linear (or affine) subspace Ω 0 of R n. The full model is a model with Ω full = R n and hence each observation fits the model perfectly, i.e. µ = y. The most restricted model is the null model with Ω null = R. It only describes the variations of the observations by a common mean value for all observations. In practice, one often starts with formulating a rather comprehensive model with Ω = R k, where k < n. We will call such a model a sufficient model. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 14 / 37

The general linear model The General Linear Model Definition (The general linear model) Assume that Y 1, Y 2,..., Y n is normally distributed as described before. A general linear model for Y 1, Y 2,..., Y n is a model where an affine hypothesis is formulated for µ. The hypothesis is of the form H 0 : µ µ 0 Ω 0, where Ω 0 is a linear subspace of R n of dimension k, and where µ 0 denotes a vector of known offset values. Definition (Dimension of general linear model) The dimension of the subspace Ω 0 for the linear model is the dimension of the model. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 15 / 37

The design matrix The general linear model Definition (Design matrix for classical GLM) Assume that the linear subspace Ω 0 = span{x 1,..., x k }, i.e. the subspace is spanned by k vectors (k < n). Consider a general linear model where the hypothesis can be written as H 0 : µ µ 0 = Xβ with β R k, where X has full rank. The n k matrix X of known deterministic coefficients is called the design matrix. The i th row of the design matrix is given by the model vector for the i th observation. x T i = Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 16 / 37 x i1 x i2. x ik T,

Estimation Estimation of mean value parameters Under the hypothesis H 0 : µ Ω 0, the maximum likelihood estimate for the set µ is found as the orthogonal projection (with respect to δ Σ ), p 0 (y) of y onto the linear subspace Ω 0. Theorem (ML estimates of mean value parameters) For hypothesis of the form H 0 : µ(β) = Xβ the maximum likelihood estimated for β is found as a solution to the normal equation X T Σ 1 y = X T Σ 1 X β. If X has full rank, the solution is uniquely given by β = (X T Σ 1 X) 1 X T Σ 1 y Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 17 / 37

Estimation Properties of the ML estimator Theorem (Properties of the ML estimator) For the ML estimator we have β N k (β, σ 2 (X T Σ 1 X) 1 ) Unknown Σ Notice that it has been assumed that Σ is known. If Σ is unknown, one possibility is to use the relaxation algorithm described in Madsen (2008) a. a Madsen, H. (2008) Time Series Analysis. Chapman, Hall Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 18 / 37

Fitted values Fitted values Fitted or predicted values The fitted values µ = X β is found as the projection of y (denoted p 0 (y)) on to the subspace Ω 0 spanned by X, and β denotes the local coordinates for the projection. Definition (Projection matrix) A matrix H is a projection matrix if and only if (a) H T = H and (b) H 2 = H, i.e. the matrix is idempotent. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 19 / 37

The hat matrix Fitted values The matrix is a projection matrix. H = X[X T Σ 1 X] 1 X T Σ 1 The projection matrix provides the predicted values µ, since µ = p 0 (y) = X β = Hy It follows that the predicted values are normally distributed with D[X β] = σ 2 X[X T Σ 1 X] 1 X T = σ 2 HΣ The matrix H is often termed the hat matrix since it transforms the observations y to their predicted values symbolized by a hat on the µ s. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 20 / 37

Residuals Residuals The observed residuals are r = y X β = (I H)y Orthogonality The maximum likelihood estimate for β is found as the value of β which minimizes the distance y Xβ. The normal equations show that X T Σ 1 (y X β) = 0 i.e. the residuals are orthogonal (with respect to Σ 1 ) to the subspace Ω 0. The residuals are thus orthogonal to the fitted or predicted values. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 21 / 37

Residuals Residuals 3.5 Likelihood ratio tests 45 y y y p o (y) Ω 0 0 p 0 (y) p 0 (y) = X β Figure 3.1: Orthogonality between the residual (y X b β) and the vector X b β. Figure: Orthogonality between the residual (y X β) and the vector X β. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 22 / 37

Residuals Residuals The residuals r = (I H)Y are normally distributed with D[r] = σ 2 (I H) The individual residuals do not have the same variance. The residuals are thus belonging to a subspace of dimension n k, which is orthogonal to Ω 0. It may be shown that the distribution of the residuals r is independent of the fitted values X β. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 23 / 37

Cochran s theorem Residuals Theorem (Cochran s theorem) Suppose that Y N n (0, I n ) (i.e. standard multivariate Gaussian random variable) Y T Y = Y T H 1 Y + Y T H 2 Y + + Y T H k Y where H i is a symmetric n n matrix with rank n i, i = 1, 2,..., k. Then any one of the following conditions implies the other two: i The ranks of the H i adds to n, i.e. k i=1 n i = n ii Each quadratic form Y T H i Y χ 2 n i (thus the H i are positive semidefinite) iii All the quadratic forms Y T H i Y are independent (necessary and sufficient condition). Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 24 / 37

Partitioning of variation Partitioning of variation Partitioning of the variation D(y; Xβ) = D(y; X β) + D(X β; Xβ) = (y X β) T Σ 1 (y X β) + ( β β) T X T Σ 1 X( β β) (y X β) T Σ 1 (y X β) Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 25 / 37

Partitioning of variation Partitioning of variation χ 2 -distribution of individual contributions Under H 0 it follows from the normal distribution of Y that D(y; Xβ) = (y Xβ) T Σ 1 (y Xβ) σ 2 χ 2 n Furthermore, it follows from the normal distribution of r and of β that D(y; X β) = (y X β) T Σ 1 (y X β) σ 2 χ 2 n k D(X β; Xβ) = ( β β) T X T Σ 1 X( β β) σ 2 χ 2 k moreover, the independence of r and X β implies that D(y; X β) and D(X β; Xβ) are independent. Thus, the σ 2 χ 2 n-distribution on the left side is partitioned into two independent χ 2 distributed variables with n k and k degrees of freedom, respectively. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 26 / 37

Estimation of the residual variance σ 2 Estimation of the residual variance σ 2 Theorem (Estimation of the variance) Under the hypothesis H 0 : µ(β) = Xβ the maximum marginal likelihood estimator for the variance σ 2 is σ 2 = D(y; X β) n k = (y X β) T Σ 1 (y X β) n k Under the hypothesis, σ 2 σ 2 χ 2 f /f with f = n k. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 27 / 37

Likelihood ratio tests Likelihood ratio tests In the classical GLM case the exact distribution of the likelihood ratio test statistic may be derived. Consider the following model for the data Y N n (µ, σ 2 Σ). Let us assume that we have the sufficient model with dim(ω 1 ) = m 1. H 1 : µ Ω 1 R n Now we want to test whether the model may be reduced to a model where µ is restricted to some subspace of Ω 1, and hence we introduce Ω 0 Ω 1 as a linear (affine) subspace with dim(ω 0 ) = m 0. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 28 / 37

Residual under hypothesis Model reduction Likelihood ratio tests n m 0 y p 0 (y) 2 y y p 0 (y) y y p 1 (y) Ω 1 Ω 0 0 p 0 (y) p 1 (y) p 1 (y) p 0 (y) Figure 3.2: Model reduction. The partitioning of the deviance corresponding to a Figure: Model reduction. The partitioning of the deviance corresponding to a test test of the hypothesis H 0 : µ Ω 0 under the assumption of H 1 : 1. of the hypothesis H 0 : µ Ω 0 under the assumption of H 1 : µ Ω 1. Initial test for model sufficiency Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 29 / 37

Likelihood ratio tests Test for model reduction Theorem (A test for model reduction) The likelihood ratio test statistic for testing is a monotone function of H 0 : µ Ω 0 against the alternative H 1 : µ Ω 1 \ Ω 0 F (y) = D(p 1(y); p 0 (y))/(m 1 m 0 ) D(y; p 1 (y))/(n m 1 ) where p 1 (y) and p 0 (y) denote the projection of y on Ω 1 and Ω 0, respectively. Under H 0 we have F F (m 1 m 0, n m 1 ) i.e. large values of F reflects a conflict between the data and H 0, and hence lead to rejection of H 0. The p-value of the test is found as p = P [F (m 1 m 0, n m 1 ) F obs ], where F obs is the observed value of F given the data. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 30 / 37

Likelihood ratio tests Test for model reduction The partitioning of the variation is presented in a Deviance table (or an ANalysis Of VAriance table, ANOVA). The table reflects the partitioning in the test for model reduction. The deviance between the variation of the model from the hypothesis is measured using the deviance of the observations from the model as a reference. Under H 0 they are both χ 2 distributed, orthogonal and thus independent. This means that the ratio is F distributed. If the test quantity is large this shows evidence against the model reduction tested using H 0. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 31 / 37

Deviance table Likelihood ratio tests Source f Deviance Test statistic, F Model versus hypothesis m 1 m 0 p 1 (y) p 0 (y) 2 p 1 (y) p 0 (y) 2 /(m 1 m 0 ) y p 1 (y) 2 /(n m 1 ) Residual under model n m 1 y p 1 (y) 2 Residual under hypothesis n m 0 y p 0 (y) 2 Table: Deviance table corresponding to a test for model reduction as specified by H 0. For Σ = I this corresponds to an analysis of variance table, and then Deviance is equal to the Sum of Squared deviations (SS) Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 32 / 37

Likelihood ratio tests Test for model reduction The test is a conditional test It should be noted that the test has been derived as a conditional test. It is a test for the hypothesis H 0 : µ Ω 0 under the assumption that H 1 : µ Ω 1 is true. The test does in no way assess whether H 1 is in agreement with the data. On the contrary in the test the residual variation under H 1 is used to estimate σ 2, i.e. to assess D(y; p 1 (y)). The test does not depend on the particular parametrization of the hypotheses Note that the test does only depend on the two sub-spaces Ω 1 and Ω 0, but not on how the subspaces have been parametrized (the particular choice of basis, i.e. the design matrix). Therefore it is sometimes said that the test is coordinate free. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 33 / 37

Likelihood ratio tests Initial test for model sufficiency In practice, one often starts with formulating a rather comprehensive model, a sufficient model, and then tests whether the model may be reduced to the null model with Ω null = R, i.e. dim Ω null = 1. The hypotheses are where dim Ω 1 = k. H null : µ R H 1 : µ Ω 1 \ R. The hypothesis is a hypothesis of Total homogeneity, namely that all observations are satisfactorily represented by their common mean. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 34 / 37

Deviance table Likelihood ratio tests Source f Deviance Test statistic, F Model H null k 1 p 1 (y) p null (y) 2 p 1 (y) p null (y) 2 /(k 1) y p 1 (y) 2 /(n k) Residual under H 1 n k y p 1 (y) 2 Total n 1 y p null (y) 2 Table: Deviance table corresponding to the test for model reduction to the null model. Under H null, F F (k 1, n k), and hence large values of F would indicate rejection of the hypothesis H null. The p-value of the test is p = P [F (k 1, n k) F obs ]. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 35 / 37

Coefficient of determination, R 2 Coefficient of determination, R 2 The coefficient of determination, R 2, is defined as R 2 = D(p 1(y); p null (y)) D(y; p null (y)) = 1 D(y; p 1(y)) D(y; p null (y)), 0 R2 1. Suppose you want to predict Y. If you do not know the x s, then the best prediction is y. The variability corresponding to this prediction is expressed by the total variation. If the model is utilized for the prediction, then the prediction error is reduced to the residual variation. R 2 expresses the fraction of the total variation that is explained by the model. As more variables are added to the model, D(y; p 1 (y)) will decrease, and R 2 will increase. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 36 / 37

Coefficient of determination, R 2 Adjusted coefficient of determination, R 2 adj The adjusted coefficient of determination aims to correct that R 2 increases as more variables are added to the model. It is defined as: R 2 adj = 1 D(y; p 1(y))/(n k) D(y; p null (y))/(n 1). It charges a penalty for the number of variables in the model. As more variables are added to the model, D(y; p 1 (y)) decreases, but the corresponding degrees of freedom also decreases. The numerator in may increase if the reduction in the residual deviance caused by the additional variables does not compensate for the loss in the degrees of freedom. Henrik MadsenPoul Thyregod (IMM-DTU) Chapman & Hall October 2010 37 / 37