Partitioning of Variance in Multilevel Models. Dr William J. Browne

Similar documents

Parallelization Strategies for Multicore Data Analysis

Handling attrition and non-response in longitudinal data

Power and sample size in multilevel modeling

Multilevel Modelling of medical data

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Bayesian Statistics in One Hour. Patrick Lam

Introduction to Longitudinal Data Analysis

Measures of explained variance and pooling in multilevel models

AN ILLUSTRATION OF MULTILEVEL MODELS FOR ORDINAL RESPONSE DATA

Models for Longitudinal and Clustered Data

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

STATISTICA Formula Guide: Logistic Regression. Table of Contents

The Basic Two-Level Regression Model

College Readiness LINKING STUDY

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

BayesX - Software for Bayesian Inference in Structured Additive Regression

Chenfeng Xiong (corresponding), University of Maryland, College Park

A Hierarchical Linear Modeling Approach to Higher Education Instructional Costs

Simple linear regression

STA 4273H: Statistical Machine Learning

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA

R2MLwiN Using the multilevel modelling software package MLwiN from R

Imputing Missing Data using SAS

The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities

Simple Regression Theory II 2010 Samuel L. Baker

Introduction to Data Analysis in Hierarchical Linear Models

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

Introducing the Multilevel Model for Change

Multinomial and Ordinal Logistic Regression

Goodness of fit assessment of item response theory models

Illustration (and the use of HLM)

DURATION ANALYSIS OF FLEET DYNAMICS

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Joint models for classification and comparison of mortality in different countries.

The Latent Variable Growth Model In Practice. Individual Development Over Time

Bayesian Analysis of Comparative Survey Data

Regression Analysis: A Complete Example

Elements of statistics (MATH0487-1)

A Bayesian hierarchical surrogate outcome model for multiple sclerosis

Regression III: Advanced Methods

A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Latent Class Regression Part II

Poisson Models for Count Data

A Guide to Sample Size Calculations for Random Effect Models via Simulation and the MLPowSim Software Package

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

More details on the inputs, functionality, and output can be found below.

Towards running complex models on big data

Prediction for Multilevel Models

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Logit Models for Binary Data

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Ordinal Regression. Chapter

LOGIT AND PROBIT ANALYSIS

Validation of Software for Bayesian Models Using Posterior Quantiles

Sample size and power calculations

Calculating the Probability of Returning a Loan with Binary Probability Models

Missing Data. Katyn & Elena

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

PREDICTIVE DISTRIBUTIONS OF OUTSTANDING LIABILITIES IN GENERAL INSURANCE

Module 5: Introduction to Multilevel Modelling SPSS Practicals Chris Charlton 1 Centre for Multilevel Modelling

Test Bias. As we have seen, psychological tests can be well-conceived and well-constructed, but

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques Page 1 of 11. EduPristine CMA - Part I

Logistic Regression (1/24/13)

MS RR Booil Jo. Supplemental Materials (to be posted on the Web)

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Point and Interval Estimates

Interpretation of Somers D under four simple models

11. Time series and dynamic linear models

Chapter 7: Simple linear regression Learning Objectives

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Electronic Theses and Dissertations UC Riverside

The 3-Level HLM Model

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Predict the Popularity of YouTube Videos Using Early View Data

Notes on the Negative Binomial Distribution

Tutorial on Markov Chain Monte Carlo

Multilevel Modeling of Complex Survey Data

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Analysis of variance

Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models

Lecture 5 Three level variance component models

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Analyzing Clinical Trial Data via the Bayesian Multiple Logistic Random Effects Model

Yew May Martin Maureen Maclachlan Tom Karmel Higher Education Division, Department of Education, Training and Youth Affairs.

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

DISCUSSION PAPER ANALYSIS OF VARIANCE WHY IT IS MORE IMPORTANT THAN EVER 1. BY ANDREW GELMAN Columbia University

SAS Software to Fit the Generalized Linear Model

Summary. Modern Bayesian inference is highly computational but commonly proceeds without reference to modern developments in statistical graphics.

Multiple Imputation for Missing Data: A Cautionary Tale

Transcription:

Partitioning of Variance in Multilevel Models Dr William J. Browne University of Nottingham with thanks to Harvey Goldstein (Institute of Education) & S.V. Subramanian (Harvard School of Public Health) 1

Outline 1. The ICC and the VPC 2. Partitioning variances in: a. General normal response multilevel models b. Binary response multilevel models c. General Binomial response multilevel models with overdispersion 3. Example : Indian illiteracy dataset 2

Statistical Modelling Often in statistics we have a response variable y which is our main interest. We select a sample of size n from our population of interest and observe values y i, i = 1,..., n. We then wish to infer properties of the variable y in terms of other observed predictor variables X i = (x 1i,..., x ki ), i = 1,..., n. The main use of these predictor variables is to account for differences in the response variable, or to put it another way to explain the variation in y. 3

Linear Modelling Consider the standard linear model y i = X i β + e i, e i N(0, σ 2 ) Here we explain variation in y by allowing our predictors X to describe differences in mean behaviour. The amount of variance explained can be calculated via R 2 statistics for the model. Note that X can contain both continuous and categorical predictor variables. Note R 2 is so called as for linear regression R 2 is the square of the correlation between y and x. 4

Multilevel Modelling and the VPC Consider a 2-level variance components model y ij = X ij β + u j + e ij, u j N(0, σ 2 u), e ij N(0, σ 2 e) A typical example data structure (from education) is pupils within schools. Here we have the variance partitioned into two components: σ 2 u is the between schools variance and σ 2 e is the residual variation between pupils. We introduce the term Variance Partition Coefficient or VPC to represent the percentage variance explained by the higher level (school). Hence V P C = σ 2 u/(σ 2 e + σ 2 u) 5

VPC and ICC It should be noted that an alternative representation of a multilevel model is to consider the response vector following a MV normal, e.g. for 2 pupils in 2 schools. y 11 y 21 y 12 y 22 N X 11 β X 21 β X 12 β X 22 β, σ 2 u + σ2 e σ 2 u 0 0 σ 2 u σ 2 u + σ2 e 0 0 0 0 σ 2 u + σ2 e σ 2 u 0 0 σ 2 u σ 2 u + σ2 e Then the Intra-class correlation or ICC is the correlation between 2 individuals who are in the same higher level unit. ICC = σ 2 u/(σ 2 e + σ 2 u) = V P C Note that this equivalence is not always true for other models. 6

VPC for other normal response models Consider a 2-level model y ij = X ij β + Z ij u j + e ij, u j MV N(0, Σ u ), e ij N(0, σ 2 e) Then the formula for V P C depends on Z ij and so is dependent on predictor variables. V P C ij = Z ij Σ u Z T ij /(Z ijσ u Z T ij + σ2 e). To illustrate this we consider a 2-level random slopes model in the next slide: y ij = β 0 + u 0j + (β 1 + u 1j )x ij + e ij, u j MV N(0, Σ u ), e ij N(0, σ 2 e) 7

Random slopes model Here variability between lines increases as x increases and hence V P C increases. 8

VPC for binary response multilevel models Consider the following multilevel bernouilli model y ij Binomial(1, p ij ) where logit(p ij ) = X ij β + u j, u j N(0, σ 2 u) Now the variance of the bernouilli distribution and hence the level 1 variation is p ij (1 p ij ). This means that even though we have a simple variance for the random effects, the V P C will be a function of the predictor variables. 9

Approaches for calculating the VPC The VPC is here more difficult to calculate due to the different scales that the level 1 variance and the higher level variances are measured on due to the link function. We consider the following 3 approaches Model linearisation Simulation A Latent Variable approach 10

Method A: Model linearisation Using a first order Taylor series expansion around the mean of the distribution of the u j we have p ij = p ij = exp(x ijβ) 1+exp(X ij β) and first derivative with respect to X ijβ p ij 1+exp(X ij β) and the following expansion y ij = X ij βp ij + u jp ij + e ij pij (1 p ij ) where var(e ij ) = 1. This results in an approximate formula for the VPC as V P C ij = σ 2 u p2 ij /(1+exp(X ijβ)) 2 σ 2 u p2 ij /(1+exp(X ijβ)) 2 +p ij (1 p ij ) 11

Method B: Simulation This algorithm simulates from the model to give an estimate of the VPC which improves as the number of simulations increases The steps are 1. Generate a large number m (say 50,000) of values for u j from the distribution N(0, σ 2 u) 2. For a particular fixed value of X ij compute the m corresponding values of p ij, p k, k = 1,..., m. 3. Compute the corresponding Bernouilli variances v k = p k (1 p k ), k = 1,..., m and calculate their mean by σ 2 1 = 1 m m i=1 v k 4. Then V P C ij = σ 2 2/(σ 2 2 + σ 2 1) where σ 2 2 = var(p k ) 12

Method C: Latent Variables Here we consider our observed binary response to represent a thresholded continuous variable where we observe 0 below the threshold and 1 above. Now in a logit model we have an underlying (standard) logistic distribution for such a variable. The (standard) logistic distribution has variance π 2 /3 = 3.29 and so we can take this as the level 1 variance and now both the level 1 and 2 variances are on the same scale and so we have the simple formula V P C ij = σ2 u σu 2 +π2 /3. Note this approach is used by Snijders and Bosker (1999) and results in a constant VPC for this model. 13

Example: Voting dataset Here we have a two-level dataset (voters within constituencies) with a response as to whether voters will vote conservative. The dataset has four attitudinal predictors designed so that lower values correspond to left-wing attitudes. We consider two models and two VPCs. Firstly a model with just an intercept which has β 0 = 0.256, σ 2 u = 0.142 and secondly a model with all four predictors and a left-wing individual where X ij β = 2.5 14

Example: Voting dataset Scenario Method A Method B Method C 1 0.0340 0.0340 0.043 2 0.0096 0.0111 0.047 Note: Method C gives the same VPC estimate for all p ij but the level 2 variance changes between the two models used in the scenarios. Good agreement between methods A and B particularly in scenario 1. 15

VPC in general binomial models with overdispersion Our motivation for considering such models is a dataset on illiteracy in India studied by Subramanian and co-workers. Here we have a structured population with people living in districts which lie within states. We do not have individual data but instead have illiteracy proportions and counts for cells of people who live within a particular district. The cells are formed by individuals with the same values for 3 predictor variables: gender, caste and urban/rural. The size of a cell varies from 1 person to over 4 million! There are 442 districts within 29 states. 16

Indian literacy dataset continued A simple logistic model would be y ijk Binomial(n ijk, p ijk ) where logit(p ijk ) = X ijk β + v k + u jk, v k N(0, σ 2 v), u jk N(0, σ 2 u) Here i indexes cell, j indexes district and k indexes state. So we have accounted for variation(over dispersion) due to districts and states but this model doesn t account for overdispersion due to cells. 17

Accounting for overdispersion The general binomial distribution is one of many possible distributions for data where the response is a proportion. Datasets can be constructed that have more (overdispersion) or less (underdispersion) variability that a Binomial distribution. There are two approaches that can be used for overdispersion: 1. Multiplicative overdispersion - Note can cope with underdispersion. Adds a scale factor to the model that multiplies the variance but model doesn t have particular distributional form. 2. Additive overdispersion. 18

Additive overdispersion If we consider our general model previously we can add an overdispersion term to it as follows: y ijk Binomial(n ijk, p ijk ) where logit(p ijk ) = X ijk β + v k + u jk + e ijk, v k N(0, σ 2 v), u jk N(0, σ 2 u), e ijk N(0, σ 2 e) Here e ijk is a cell level overdispersion effect and σ 2 e is the variance of these effects. One advantage of an additive approach is that it can be fitted as a standard random effects model by the addition of a pseudo level 2 to the model. 19

Additive overdispersion continued With a change of notation we can get the following: v (s) l y ijkl Binomial(n ijkl, p ijkl ) where logit(p ijkl ) = X ijkl β + v (s) l + v (d) kl + u jkl, N(0, σ 2 s), v (d) kl N(0, σ 2 d ), u jkl N(0, σ 2 u) Here the second level indexed by j is pseudo in that each level 2 unit contains only 1 level 1 unit. One interpretation of the model is to consider level 2 as the cell level and level 1 as the individual level and in fact expanding the dataset to include each individual response would give an equivalent model. 20

Fitting a simple model with districts only For our first model to be fitted to the illiteracy dataset we ignore states and simply fit districts and cells with an intercept term. y ijk Binomial(n ijk, p ijk ) where logit(p ijk ) = β 0 + v (d) k + u jk, v (d) k N(0, σ 2 d ), u jk N(0, σ 2 u) We will consider both quasi-likelihood based methods and MCMC methods (with diffuse priors) for fitting this model. MLwiN offers both Marginal quasi-likelihood (MQL) and penalised quasi-likelihood (PQL) estimation methods but for this model only 1st order MQL will converge. 21

Fitting using MCMC MLwiN also offers an MCMC algorithm that consists of a mixture of Metropolis and Gibbs sampling steps. This algorithm had incredibly poor mixing and the chains did not appear to converge. We switched to WinBUGS as we could here reparameterise the model using hierarchical centering (Gelfand, Sahu & Carlin 1995). y ijk Binomial(n ijk, p ijk ) logit(p ijk ) = u jk where u jk = β 0 + v (d) k + u jk u jk N(v k, σ2 u) where vk = β 0 + v (d) k vk N(β 0, σv) 2 22

Estimates for Model Parameter 1st MQL (SE) MCMC (95% CI) β 0-0.021 (0.025) 0.008 (-0.058,0.076) σd 2 0.199 (0.018) 0.383 (0.320,0.454) σu 2 0.764 (0.016) 1.349 (1.293,1.408) Note the underestimation of MQL (see e.g. Browne(1998) for details). Here the MCMC intervals do not even contain the MQL estimates for either variance estimate. 23

VPC methods for overdispersed random effects models All 3 methods of calculating the VPC can be extended to several levels. A: Model Linearisation Here for our example the VPC associated with districts has the following form: V P C ijk = σ 2 d p2 ijk /(1+exp(β 0)) 2 (σ 2 d +σ2 u )p2 ijk /(1+exp(β 0)) 2 +p ijk (1 p ijk ) C: Latent Variable Approach Here the formula is V P C ijk = σ 2 d /(σ2 d + σ2 u + π 2 /3) 24

Method B simulation This method now involves the following steps: 1. From the fitted model simulate a large number (M) of values for the higher-level random effects, from the distribution N(0, σ 2 d ). 2. For each of the generated higher-level random effects, simulate a large number (m) of values for the over-dispersion random effects, from the distribution N(0, σ 2 u). 3. Combine the generated residuals with β 0 and compute the T = M m corresponding values of p ijk (p (r) ijk ), r = 1,.., T using the anti-logit function. For each of these values compute the level 1 binomial variance σ 2 rijk = p(r) ijk (1 p(r) ijk ). 25

Method B simulation (continued) 4. For each of the M higher-level random effect draws calculate the mean of the m generated p (r) ijk, p(r) ijk. 5. The coefficient V P C ijk is now calculated as follows V P C ijk = σ 2 3/(σ 2 2 + σ 2 1) where σ 2 3 = var(p (R) ijk ), σ 2 2 = var(p (r) ijk ) and σ 2 1 = r σ2 rijk /T 26

VPC estimates for this model The following table contains the district level VPC estimates using both MQL and MCMC estimates for the parameters Method 1st MQL (SE) MCMC (95% CI) Linearisation 4.01% 6.68% Simulation 3.45% (0.016%) 5.40% (0.033%) Latent Variable 4.68% 7.62% Note Simulation estimates based on 10 runs with M = m = 5000 and the number in brackets is MCSE. 27

Adding cell-level predictor variables We can extend our model by adding predictors for each cell as follows: y ijk Binomial(n ijk, p ijk ) logit(p ijk ) = β 0 + X T jk β + v(d) k + u jk, v (d) k N(0, σ 2 d ), u jk N(0, σ 2 u) In this example the variables in β are gender, social class (caste, tribe or other) and urban/rural habitation along with some interactions. This will result in 12 (2 3 2) cell types and 12 VPC estimates 28

Hierarchical centering again Note that because all our predictors are at level 2 we can perform hierarchical centering as follows: y ijk Binomial(n ijk, p ijk ) logit(p ijk ) = u jk where u jk = β 0 + Xjk T β + v(d) k + u jk u jk N(XT jk β + v k, σ2 u) where vk = β 0 + v (d) k vk N(β 0, σv) 2 This hierarchical centered formulation allows WinBUGS to use conjugate Gibbs sampling steps for all parameters apart from u jk. 29

Adding cell-level predictor variables The parameter estimates for the model are given below: Parameter MCMC estimate (95% CI) β 0 Intercept -0.713 (-0.789, -0.639) β 1 Female 1.437 (1.393, 1.481) β 2 Caste 0.696 (0.646, 0.746) β 3 Tribe 0.993 (0.954, 1.033) β 4 Urban -1.034 (-1.084, -0.983) β 5 Caste Urban 0.350 (0.285, 0.414) β 6 Female Urban -0.339 (-0.402, -0.276) σv 2 District level variance 0.467 (0.407, 0.539) σu 2 Over-dispersion variance 0.314 (0.301, 0.328) 30

VPC estimates Cell Type Illiteracy VPC Method A VPC Method B Probability (MCSE) Male/Other/Rural 0.329 8.79% 8.05% (0.03%) Male/Other/Urban 0.148 5.37% 5.89% (0.03%) Female/Other/Rural 0.674 8.77% 8.03% (0.04%) Female/Other/Urban 0.343 8.95% 8.13% (0.03%) Male/Caste/Rural 0.496 9.77% 8.57% (0.03%) Male/Caste/Urban 0.332 8.82% 8.06% (0.03%) Female/Caste/Rural 0.805 6.52% 6.69% (0.04%) Female/Caste/Urban 0.598 9.45% 8.41% (0.04%) Male/Tribe/Rural 0.569 9.61% 8.49% (0.03%) Male/Tribe/Urban 0.320 8.69% 7.99% (0.03%) Female/Tribe/Rural 0.848 5.48% 5.97% (0.04%) Female/Tribe/Urban 0.585 9.53% 8.45% (0.04%) Note that method C is independent of the predictors and gives the VPC estimate of 11.47% for all cell types. 31

Adding the state level We have thus far ignored the state level. When we add in this level we will get two VPC estimates, one for district and one for state. We will here simply consider a model with no predictor variables as shown below: v (s) l y ijkl Binomial(n ijkl, p ijkl ) logit(p ijkl ) = β 0 + v (s) l + v (d) kl + u jkl, N(0, σ 2 s), v (d) kl N(0, σ 2 d ), u jkl N(0, σ 2 u) Again we can use hierarchical centering to improve the mixing of the MCMC estimation routine. 32

Adding the state level The estimates for this model are: Parameter MCMC Estimate (95% CI) β 0 Intercept -0.394 (-0.654, -0.135) σ 2 s State level variance 0.455 (0.246, 0.811) σd 2 District level variance 0.042 (0.022, 0.066) σ 2 u Over-dispersion variance 1.346 (1.290, 1.402) From our initial inspection we can see that the variability between the 29 states is far greater than the variability between the districts within the states. 33

VPC estimates and further models The VPC estimates are given below: Method State VPC District VPC Estimate Estimate A. Linearisation 7.59% 0.70% B. Simulation 6.19% (0.08%) 0.59% (0.001%) C. Latent Variable 8.87% 0.81% Browne et al. (2005) also consider models that include both the state level and predictor variables along with heteroskedasticity at the cell level. 34

Interval estimation for the VPC An advantage of using an MCMC algorithm to get parameter estimates for our random effects logistic regression model is that we get a sample of estimates from the posterior distribution of the parameters. We can therefore, for each iteration work out the VPC using whichever method we prefer and hence get a sample of VPC estimates from which we can produce interval estimates. In our first example with no predictors and just district level variation we have 95% credible intervals of (5.64%,7.82%) for method A and (6.45%,8.92%) for method C. Note method B is too computationally burdensome here. 35

Further work - 1 Looking at the form of the VPC: The following shows curves representing the VPC plotted against X ij β for various values of σ 2 u. 36

Further work - 2 We can try and approximate these curves with Normal, t or power family distributions. Similarly for 3 levels we see below an example with σ 2 u = σ 2 v = 0.2 37

Further work - 3 Andrew Gelman and Iain Pardoe have been looking at expanding the R 2 concept to multilevel models. Here they consider there to be an R 2 for each level where a level is a set of random effects. This means in a random slopes model both the intercepts and slopes are levels and have an R 2 associated with them. Here slopes and intercepts are assumed independent. The R 2 values then explain the percentage of variance explained by fixed predictors for each level. It would be interesting to to compare and/or combine their idea with the VPC concept. 38

References Goldstein, H., Browne, W.J., and Rasbash, J. (2002). Partitioning variation in multilevel models. Understanding Statistics 1, 223-231. Browne, W.J., Subramanian, S.V., Jones, K. and Goldstein, H. (2005). Variance partitioning in multilevel logistic models that exhibit over-dispersion. JRSS A to appear Browne, W.J. (1998). Applying MCMC Methods to Multilevel Models. PhD dissertation, Dept. of Maths, University of Bath. Gelfand, A.E., Sahu, S.K., and Carlin, B.P. (1995). Efficient parameterizations for normal linear mixed models. Biometrika 82, 479-488. Snijders, T. and Bosker, R. (1999). Multilevel Analysis. Sage Gelman, A and Pardoe, I (2004). Bayesian measures of explained variance and pooling in multilevel (hierarchical) models. Unpublished Technical Report 39