Multilevel Modeling of Complex Survey Data



Similar documents
Prediction for Multilevel Models

A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models

Comparison of Estimation Methods for Complex Survey Data Analysis

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Handling attrition and non-response in longitudinal data

gllamm companion for Contents

A General Approach to Variance Estimation under Imputation for Missing Survey Data

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

Approaches for Analyzing Survey Data: a Discussion

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Chapter 19 Statistical analysis of survey data. Abstract

Lecture 3: Linear methods for classification

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

The Probit Link Function in Generalized Linear Models for Data Mining Applications

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Survey Data Analysis in Stata

10 Dichotomous or binary responses

The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data ABSTRACT INTRODUCTION SURVEY DESIGN 101 WHY STRATIFY?

Chapter 11 Introduction to Survey Sampling and Analysis Procedures

Electronic Theses and Dissertations UC Riverside

New SAS Procedures for Analysis of Sample Survey Data

Introduction to mixed model and missing data issues in longitudinal studies

A Basic Introduction to Missing Data

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

Imputing Missing Data using SAS

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Stat : Analysis of Complex Survey Data

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview of Methods for Analyzing Cluster-Correlated Data. Garrett M. Fitzmaurice

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Clarifying Some Issues in the Regression Analysis of Survey Data

Standard errors of marginal effects in the heteroskedastic probit model

Multiple Choice Models II

Poisson Models for Count Data

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Analyzing Structural Equation Models With Missing Data

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Problem of Missing Data

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

HLM software has been one of the leading statistical packages for hierarchical

Statistical Machine Learning

Models for Longitudinal and Clustered Data

Descriptive Methods Ch. 6 and 7

Best Practices in Using Large, Complex Samples: The Importance of Using Appropriate Weights and Design Effect Compensation

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Efficient and Practical Econometric Methods for the SLID, NLSCY, NPHS

Statistical Models in R

Power and sample size in multilevel modeling

Lecture 14: GLM Estimation and Logistic Regression

Statistics in Retail Finance. Chapter 2: Statistical models of default

Multinomial and Ordinal Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

2. Linear regression with multiple regressors

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Logit Models for Binary Data

Simple Linear Regression Inference

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Using SAS Proc Mixed for the Analysis of Clustered Longitudinal Data

Multinomial Logistic Regression

Statistical and Methodological Issues in the Analysis of Complex Sample Survey Data: Practical Guidance for Trauma Researchers

Note on the EM Algorithm in Linear Regression Model

Parametric fractional imputation for missing data analysis

Module 5: Introduction to Multilevel Modelling SPSS Practicals Chris Charlton 1 Centre for Multilevel Modelling

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Visualization of Complex Survey Data: Regression Diagnostics

Methodological aspects of small area estimation from the National Electronic Health Records Survey (NEHRS). 1

Survey Data Analysis in Stata

Ordinal Regression. Chapter

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Hierarchical Logistic Regression Modeling with SAS GLIMMIX Jian Dai, Zhongmin Li, David Rocke University of California, Davis, CA

Chapter XXI Sampling error estimation for survey data* Donna Brogan Emory University Atlanta, Georgia United States of America.

Modelling the Scores of Premier League Football Matches

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

From the help desk: hurdle models

Forecast. Forecast is the linear function with estimated coefficients. Compute with predict command

Additional sources Compilation of sources:

INTRODUCTION TO SURVEY DATA ANALYSIS THROUGH STATISTICAL PACKAGES

Module 4 - Multiple Logistic Regression

Example: Boats and Manatees

Introduction to Longitudinal Data Analysis

Least Squares Estimation

Econometrics Simple Linear Regression

From the help desk: Swamy s random-coefficients model

Introduction to Path Analysis

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

5. Linear Regression

Transcription:

Multilevel Modeling of Complex Survey Data Sophia Rabe-Hesketh, University of California, Berkeley and Institute of Education, University of London Joint work with Anders Skrondal, London School of Economics 2007 West Coast Stata Users Group Meeting Marina del Rey, October 2007 GLLAMM p.1

Outline Model-based and design based inference Multilevel models and pseudolikelihood Pseudo maximum likelihood estimation for U.S. PISA 2000 data Scaling of level-1 weights Simulation study Conclusions GLLAMM p.2

Multistage sampling: U.S. PISA 2000 data Program for International Student Assessment (PISA): Assess and compare 15 year old students reading, math, etc. Three-stage survey with different probabilities of selection Stage 1: Geographic areas k sampled Stage 2: Schools j =1,...,n (2) sampled with different probabilities π j (taking into account school non-response) Stage 3: Students i=1,...,n (1) j conditional probabilities π i j sampled from school j, with Probability that student i from school j is sampled: π ij = π i j π j GLLAMM p.3

Model-based and design-based inference Model-based inference: Target of inference is parameter β in infinite population (parameter of data generating mechanism or statistical model) called superpopulation parameter Consistent estimator (assuming simple random sampling) such as maximum likelihood estimator (MLE) yields estimate β Design-based inference: Target of inference is statistic in finite population (FP), e.g., mean score y FP of all 15-year olds in LA Student who had a π ij = 1/5 chance of being sampled represents w ij = 1/π ij = 5 similar students in finite population Estimate of finite population mean (Horvitz-Thompson): ŷ FP = Similar for proportions, totals, etc. 1 ij w w ij y ij ij ij GLLAMM p.4

Model-based inference for complex surveys Target of inference is superpopulation parameter β View finite population as simple random sample from superpopulation (or as realization from model) MLE β FP using finite population treated as target (consistent for β) Design-based estimator of β FP applied to complex survey data Replace usual log likelihood by weighted log likelihood, giving pseudo maximum likelihood estimator (PMLE) If PMLE is consistent for β FP, then it is consistent for β GLLAMM p.5

Multilevel modeling: Levels Levels of a multilevel model can correspond to stages of a multistage survey Level-1: Elementary units i (stage 3), here students Level-2: Units j sampled in previous stage (stage 2), here schools Top-level: Units k sampled at stage 1 (primary sampling units), here areas However, not all levels used in the survey will be of substantive interest & there could be clustering not due to the survey design In PISA data, top level is geographical areas details are undisclosed, so not represented as level in multilevel model GLLAMM p.6

Two-level linear random intercept model Linear random intercept model for continuous y ij : y ij = β 0 + β 1 x 1ij + + β p x pij + ζ j + ǫ ij x 1ij,...,x pij are student-level and/or school-level covariates β 0,...,β p are regression coefficients ζ j N(0, ψ) are school-specific random intercepts, uncorrelated across schools and uncorrelated with covariates ǫ ij N(0, θ) are student-specific residuals, uncorrelated across students and schools, uncorrelated with ζ j and with covariates GLLAMM p.7

Two-level logistic random intercept model Logistic random intercept model for dichotomous y ij As generalized linear model logit[pr(y ij = 1 x ij )] = β 0 + β 1 x 1ij + + β p x pij + ζ j As latent response model y ij = β 0 + β 1 x 1ij + + β p x pij + ζ j + ǫ ij y ij = 1 if y ij > 0, y ij = 0 if y ij 0 ζ j N(0, ψ) are school-specific random intercepts, uncorrelated across schools and uncorrelated with covariates ǫ ij Logistic are student-specific residuals, uncorrelated across students and schools, uncorrelated with ζ j and with covariates GLLAMM p.8

Illustration of two-level linear and logistic random intercept model E(y ij x ij, ζ j ) = β 0 + β 1 x ij + ζ j Pr(y ij = 1 x ij, ζ j ) = exp(β 0+β 1 x ij +ζ j ) 1+exp(β 0 +β 1 x ij +ζ j ) E(yij xij, ζj) 10 15 20 25 30 35 Pr(yij = 1 xij, ζj) 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 x ij 0 20 40 60 80 100 x ij GLLAMM p.9

Pseudolikelihood Usual marginal log likelihood (without weights) n (2) n (1) j n (2) n (1) j log f(y ij ζ j ) g(ζ j ) dζ j = log exp log f(y ij ζ j ) i=1 g(ζ j) dζ j j=1 } {{ } Pr(y j ζ j ) j=1 Log pseudolikelihood (with weights) n (2) n (1) j w j log exp w i j log f(y ij ζ j ) g(ζ j) dζ j j=1 i=1 i=1 Note: need w j = 1/π j, w i j = 1/π i j ; cannot use w ij = w i j w j Evaluate using adaptive quadrature, maximize using Newton-Raphson [Rabe-Hesketh et al., 2005] in gllamm GLLAMM p.10

Standard errors, taking into account survey design Conventional model-based standard errors not appropriate with sampling weights Sandwich estimator of standard errors (Taylor linearization) Cov( ϑ) = I 1 J I 1 J : Expectation of outer product of gradients, approximated using PSU contributions to gradients I: Expected information, approximated by observed information ( model-based standard errors obtained from I 1 ) Sandwich estimator accounts for Stratification at stage 1 Clustering at levels above highest level of multilevel model Implemented in gllamm with cluster() and robust options GLLAMM p.11

Analysis of U.S. PISA 2000 data Two-level (students nested in schools) logistic random intercept model for reading proficiency (dichotomous) PSUs are areas, sampling weights w i j for students and w j for schools provided Predictors: [Female]: Student is female (dummy) [ISEI]: International socioeconomic index [MnISEI]: School mean ISEI [Highschool]/ [College]: Highest education level by either parent is highschool/college (dummies) [English]: Test language (English) spoken at home (dummy) [Oneforeign]: One parent is foreign born (dummy) [Bothforeign]: Both parents are foreign born (dummy) GLLAMM p.12

Data structure and gllamm syntax in Stata Data strucure. list id_school wt2 wt1 mn_isei isei in 28/37, clean noobs id_school wt2 wt1 mn_isei isei 2 105.82.9855073 47.76471 30 2 105.82.9855073 47.76471 57 2 105.82.9855073 47.76471 50 2 105.82 1.108695 47.76471 71 2 105.82.9855073 47.76471 29 2 105.82.9855073 47.76471 29 3 296.95.9677663 42 56 3 296.95.9677663 42 67 3 296.95.9677663 42 38 3 296.95.9677663 42 40 gllamm syntax gllamm pass_read female isei mn_isei high_school college english one_for both_for, i(id_school) cluster(wvarstr) link(logit) family(binom) pweight(wt) adapt GLLAMM p.13

PISA 2000 estimates for multilevel regression model Unweighted Maximum likelihood Weighted Pseudo maximum likelihood Parameter Est (SE) Est (SE R ) (SE PSU R ) β 0 : [Constant] 6.034 (0.539) 5.878 (0.955) (0.738) β 1 : [Female] 0.555 (0.103) 0.622 (0.154) (0.161) β 2 : [ISEI] 0.014 (0.003) 0.018 (0.005) (0.004) β 3 : [MnISEI] 0.069 (0.001) 0.068 (0.016) (0.018) β 4 : [Highschool] 0.400 (0.256) 0.103 (0.477) (0.429) β 5 : [College] 0.721 (0.255) 0.453 (0.505) (0.543) β 6 : [English] 0.695 (0.283) 0.625 (0.382) (0.391) β 7 : [Oneforeign] 0.020 (0.224) 0.109 (0.274) (0.225) β 8 : [Bothforeign] 0.099 (0.236) 0.280 (0.326) (0.292) ψ 0.272 (0.086) 0.296 (0.124) (0.115) GLLAMM p.14

Problem with using weights in linear models Linear variance components model, constant cluster size n (1) j = n (1) y ij = β 0 + ζ j + ǫ ij, Var(ζ j ) = ψ, Var(ǫ ij ) = θ Assume sampling independent of ǫ ij, w i j = a > 1 for all i, j Get biased estimate of ψ: Weighted sum of squares due to clusters SSC w = j (y.j y.. ) 2 = j (ζ j ζ. ) 2 + j (ǫ ẉ j ǫ ẉ.) 2 = SSC Expectation of SSC w, same as expectation [ of unweighted SSC E(SSC w ) = (n (2) 1) ψ + θ ] n (1) Pseudo maximum likelihood estimator ψ PML = SSCw θ w > n (2) an ψ ML = SSC ML θ (1) n (2) n (1) GLLAMM p.15

Explanation for bias and anticipated results for logit/probit models Clusters appear bigger than they are (a times as big) Between-cluster variability in ǫ ẉ j an (1) This extra between-cluster variability in ǫ ẉ j greater than for clusters of size is attributed to ψ However, if sampling at level 1 stratified according to ǫ ij, e.g. π i j 0.25 if ǫ ij > 0 0.75 if ǫ ij 0 variance of ǫ ẉ j decreases, and upward bias of ψ PML decreases Bias decreases as n (1) increases In logit/probit models, anticipate that β PML increases when increases; therefore biased estimates of β ψ PML GLLAMM p.16

Solution: Scaling of weights? Scaling method 1 [Longford,1995, 1996; Pfeffermann et al., 1998] wi j = i w i j w i j i w2 i j so that i w i j = i w 2 i j In linear model example with sampling independent of ǫ ij, no bias egen sum_w = sum(w), by(id_school) egen sum_wsq = sum(wˆ2), by(id_school) generate wt1 = w*sum_w/sum_wsq Scaling method 2 [Pfeffermann et al., 1998] w i j = n(1) j i w w i j i j so that i w i j = n(1) j In line with intuition (clusters do not appear bigger than they are) egen nj = count(w), by(id_school) generate wt1 = w*nj/sum_w GLLAMM p.17

Simulations Dichotomous random intercept logistic regression (500 clusters, N j units per cluster in FP), with yij = }{{} 1 + }{{} 1 x 1j + }{{} 1 x 2ij + ζ j + ǫ ij, ψ = 1 β 0 β 1 β 2 Stage 1: Sample clusters with probabilities 0.25 if ζ j > 1 π j 0.75 if ζ j 1 Stage 2: Sample units with probabilities 0.25 if ǫ ij > 0 π i j 0.75 if ǫ ij 0 Vary N j from 5 to 100, 100 datasets per condition, 12-point adaptive quadrature GLLAMM p.18

Results for N j = 5 True Unweighted Weighted Pseudo maximum likelihood Parameter value ML Raw Method 1 Method 2 Model parameters: Conditional effects β 0 1 0.40 1.03 0.68 0.75 (0.11) (0.19) (0.16) (0.15) β 1 1 1.08 1.19 0.96 0.98 (0.18) (0.32) (0.26) (0.26) β 2 1 1.06 1.22 0.94 0.96 (0.22) (0.35) (0.25) (0.26) ψ 1 0.39 1.47 0.58 0.70 (0.37) (0.21) (0.31) (0.30) GLLAMM p.19

Effect of level-1 stratification method (N j = 10) (1) Strata based on sign of ǫ ij (2) Strata based on sign of ξ ij, Cor(ǫ ij, ξ ij ) = 0.5 (3) Strata based on sign of ξ ij, Cor(ǫ ij, ξ ij ) = 0 True Raw Method 1 Parameter value (1) (2) (3) (1) (2) (3) β 0 1 1.04 1.10 1.29 0.83 0.88 1.01 (0.16) (0.16) (0.21) (0.14) (0.13) (0.16) β 1 1 1.06 1.11 1.26 0.91 0.92 0.99 (0.23) (0.26) (0.30) (0.20) (0.23) (0.25) β 2 1 1.11 1.12 1.17 0.91 0.91 0.96 (0.20) (0.21) (0.25) (0.16) (0.17) (0.19) ψ 1 1.19 1.33 1.77 0.40 0.61 0.98 (0.13) (0.15) (0.15) (0.34) (0.24) (0.16) GLLAMM p.20

Simulation results for pseudo maximum likelihood estimation Little bias for ψ when N j 50 (cluster sizes in sample n (1) j 25) For smaller cluster sizes: Raw level-1 weights produce positive bias for ψ Scaling methods 1 and 2 overcorrect positive bias for ψ apparently due to stratification based on sign of ǫ ij Inflation of β estimates whenever positive bias for ψ Good coverage using sandwich estimator (1000 simulations) for N j = 50 GLLAMM p.21

Conclusions Pseudo maximum likelihood estimation allows for stratification, clustering, and weighting Three common methods for scaling level-1 weights: no scaling, scaling method 1, scaling method 2 Inappropriate scaling can lead to biased estimates If clusters are sufficiently large, little bias similar results with all three scaling methods If level-1 weights based on variables strongly associated with outcome, use no scaling If level-1 weights based on variables not associated with outcome, use method 1 For intermediate situations, use method 2? GLLAMM p.22

References Longford, N. T. (1995). Models for Uncertainty in Educational Testing. New York: Springer. Longford, N. T. (1996). Model-based variance estimation in surveys with stratified clustered designs. Australian Journal of Statistics, 38, 333 352. Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H., & Rasbash, J. (1998). Weighting for unequal selection probabilities in multilevel models. Journal of the Royal Statistical Society, Series B, 60, 23 40. GLLAMM p.23

References: Our relevant work Rabe-Hesketh, S. & Skrondal, A. (2006). Multilevel modeling of complex survey data. Journal of the Royal Statistical Society (Series A) 169, 805 827. Rabe-Hesketh, S., Skrondal, A. and Pickles, A. (2005). Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects. Journal of Econometrics 128, 301 323. Skrondal, A. & Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel, longitudinal and structural equation models. Boca Raton, FL: Chapman & Hall/ CRC. gllamm and manual from http://www.gllamm.org GLLAMM p.24