A General Approach to Variance Estimation under Imputation for Missing Survey Data

Similar documents

A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models

Parametric fractional imputation for missing data analysis

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Chapter 13 Introduction to Nonlinear Regression( 非線性迴歸 )

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Statistical Machine Learning

Econometrics Simple Linear Regression

2. Linear regression with multiple regressors

Reject Inference in Credit Scoring. Jie-Men Mok

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

Comparison of Imputation Methods in the Survey of Income and Program Participation

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Multilevel Modeling of Complex Survey Data

PS 271B: Quantitative Methods II. Lecture Notes

Linear Classification. Volker Tresp Summer 2015

problem arises when only a non-random sample is available differs from censored regression model in that x i is also unobserved

Missing data and net survival analysis Bernard Rachet

Chapter 10: Basic Linear Unobserved Effects Panel Data. Models:

Analyzing Structural Equation Models With Missing Data

From the help desk: hurdle models

A Basic Introduction to Missing Data

Handling missing data in Stata a whirlwind tour

Basics of Statistical Machine Learning

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Multiple Choice Models II

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Classification Problems

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

3. Regression & Exponential Smoothing

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Note on the EM Algorithm in Linear Regression Model

VI. Introduction to Logistic Regression

Handling attrition and non-response in longitudinal data

Introduction to mixed model and missing data issues in longitudinal studies

Need for Sampling. Very large populations Destructive testing Continuous production process

Credit Risk Models: An Overview

APPLIED MISSING DATA ANALYSIS

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

An Internal Model for Operational Risk Computation

5. Linear Regression

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Problem of Missing Data

Workpackage 11 Imputation and Non-Response. Deliverable 11.2

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

An extension of the factoring likelihood approach for non-monotone missing data

Tail-Dependence an Essential Factor for Correctly Measuring the Benefits of Diversification

Extreme Value Modeling for Detection and Attribution of Climate Extremes

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Chapter 19 Statistical analysis of survey data. Abstract

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Monte Carlo Simulation

Chapter 11 Introduction to Survey Sampling and Analysis Procedures

University of Maryland Fraternity & Sorority Life Spring 2015 Academic Report

Approaches for Analyzing Survey Data: a Discussion

Imputing Missing Data using SAS

Factorial experimental designs and generalized linear models

Christfried Webers. Canberra February June 2015

Course 4 Examination Questions And Illustrative Solutions. November 2000

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Imputation of missing data under missing not at random assumption & sensitivity analysis

Sovereign Defaults. Iskander Karibzhanov. October 14, 2014

Differential privacy in health care analytics and medical research An interactive tutorial

Confidence Intervals for the Difference Between Two Means

Erdős on polynomials

Bayesian Statistics in One Hour. Patrick Lam

R 2 -type Curves for Dynamic Predictions from Joint Longitudinal-Survival Models

Standard errors of marginal effects in the heteroskedastic probit model

Discussion. Seppo Laaksonen Introduction

New SAS Procedures for Analysis of Sample Survey Data

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Modeling the Implied Volatility Surface. Jim Gatheral Stanford Financial Mathematics Seminar February 28, 2003

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Comparison of Estimation Methods for Complex Survey Data Analysis

arxiv: v1 [stat.ap] 28 Jun 2012

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Logit Models for Binary Data

Risk Preferences and Demand Drivers of Extended Warranties

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Interpretation of Somers D under four simple models

Data Mining Practical Machine Learning Tools and Techniques

individualdifferences

Stat : Analysis of Complex Survey Data

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Monte Carlo Methods in Finance

ARMA, GARCH and Related Option Pricing Method

Survey Data Analysis in Stata

Chapter 4: Statistical Hypothesis Testing

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

ESTIMATION OF THE EFFECTIVE DEGREES OF FREEDOM IN T-TYPE TESTS FOR COMPLEX DATA

Chapter XXI Sampling error estimation for survey data* Donna Brogan Emory University Atlanta, Georgia United States of America.

Transcription:

A General Approach to Variance Estimation under Imputation for Missing Survey Data J.N.K. Rao Carleton University Ottawa, Canada 1 2 1 Joint work with J.K. Kim at Iowa State University. 2 Workshop on Survey Sampling in honor of Jean-Claude Deville, Neuchâtel, Switzerland, June 24-26, 2009

Outline Item Nonresponse Deterministic imputation: Population model approach Imputed estimator Linearization variance estimator Examples: Domain estimation, Composite imputation Stochastic imputation: variance estimation Examples: Multiple imputation, binary response Simulation results Doubly robust approach Extensions

Survey Data Design features: clustering, stratification, unequal probability of selection Source of error: 1. Sampling errors 2. Non-sampling errors: Nonresponse (missing data) Noncoverage Measurement errors

Types of nonresponse Unit (or total) nonresponse: refusal, not-at-home Remedy: weight adjustment within classes Item nonresponse: sensitive item, answer not known, inconsistent answer Remedy: imputation (fill in missing data)

Advantages of imputation Complete data file: standard complete data methods Different analyses consistent with each other Reduce nonresponse bias Auxiliary x observed can be used to get good imputed values Same survey weight for all items

Commonly used imputation methods Marginal imputation methods: 1. Business surveys: Ratio, Regression, Nearest neighbor (NN) 2. Socio-economic surveys: Random donor (within classes), Stochastic ratio or regression, Fractional imputation (FI), Multiple imputation (MI)

Complete response set-up Population total: θ N = N i=1 y i NHT estimator: ˆθ n = i s d i y i where d i = π 1 i : design weight π i = inclusion probability = Pr (i s) Variance estimator: ˆV n = i s Ω ij y i y j j s Ω ij depends on joint inclusion probabilities π ij > 0

Deterministic imputation Population model approach (Deville and Särndal, 1994): E ζ (y i x i ) = m (x i, β 0 ) a i = 1 if y i observed when i s = 0 otherwise for i U = {1, 2,, N} MAR: Distribution of a i depends only on x i Imputed value: ŷ i = m(x i, ˆβ) ˆβ: unique solution of EE Û (β) = d i a i {y i m (x i, β)} h (x i, β) = 0 i s

Model specification Further model specification: Var ζ (y i x i ) = σ 2 q (x i, β 0 ) h (x i, β) = ṁ (x i, β) /q (x i, β) h i Examples: commonly used imputations 1. Ratio imputation: h i = 1 E ζ (y i x i ) = β 0 x i, Var ζ (y i x i ) = σ 2 x i 2. Linear regression imputation: h i = x i E ζ (y i x i ) = x i β 0, Var ζ (y i x i ) = σ 2 3. Logistic regression imputation (y i = 0 or 1): h i = x i log {m i / (1 m i )} = x i β 0, Var ζ (y i x i ) = m i (1 m i ) where m i = E ζ (y i x i )

Imputed estimator Imputed estimator of total θ N : ˆθ Id = } d i {a i y i + (1 a i ) m(x i, ˆβ) i s i s d i ỹ i Examples 1. Ratio imputation: m(x i, ˆβ) = x i ˆβ where ˆβ = ( i s d ) 1 ia i x i i s d ia i y i 2. Linear regression imputation: m(x i, ˆβ) = x i ˆβ where ˆβ = ( i s d ) ia i x i x i 1 i s d ia i x i y i 3. Logistic regression imputation: ˆβ is the solution to i s d i {y i m (x i, β)} x i = 0 Imputed estimator of domain total θ z = N i=1 z iy i : ˆθ I,z = i s d i z i ỹ i where z i = 1 if i D; z i = 0 otherwise.

Variance estimation Treating imputed values as if observed: Underestimation if ỹ i used in ˆV n for y i Methods that account for imputation: Adjusted jackknife: Rao and Shao (1992) Linearization (Pop. model): Deville and Särndal (1994) Fractional imputation method: Fuller and Kim (2005) Bootstrap: Shao and Sitter (1996) Reverse approach: Shao and Steel (1999)

Variance estimation (Cont d) Linearization method: Theorem 1 (Kim and Rao, 2009): Under regularity conditions, n 1/2 N 1 (ˆθId θ Id ) = o p (1) where θ Id = i s w i η i { η i = m (x i ; β 0 ) + a i 1 + c } h i {yi m (x i ; β 0 )}, { N } 1 N c = a i ṁ (x i ; β 0 ) h i (1 a i ) ṁ (x i ; β 0 ). i=1 Reference distribution: Joint distribution of population model and sampling mechanism, conditional on realized (x i, a i ) in the population. i=1

Variance estimation (Cont d) Reverse approach: 1. ˆV 1d = Ω ij ˆη i ˆη j i s j s 2. where ˆη i = η i ( ˆβ). ˆV 2d = i s ( ) 2 ( )} d i a i 1 + ĉ ĥ i {y i m x i ; ˆβ ˆV 2d valid even if V ζ (y i x i ) is misspecified. 3. Variance estimator of ˆθ Id ( θ Id ): ˆV d = ˆV 1d + ˆV 2d ˆV d approximately design-model unbiased. If the overall sampling rate negligible: ˆV d = ˆV1d

Variance estimation (Cont d) Domain estimation: 1. ˆθ I,z : design-model unbiased for θ z 2. Use ˆV 1d = Ω ij ˆη iz ˆη jz i s j s where ˆη iz = z i m(x i ; ˆβ) + a i {z i + ĉ zh i } { } y i m(x i ; ˆβ), ĉ z = { } 1 d i a i ṁ(x i ; ˆβ)ĥ i d i z i (1 a i ) ṁ(x i ; ˆβ) i s i s

Composite imputation x, y, z: z always observed Imputation model: s = s RR s RM s MR s MM θ N = N i=1 y i s RM : x observed and y missing s MM : x and y missing E ζ (y i x i, z i ) = β y x x i E ζ (x i z i ) = β x z x i Imputed estimator: ˆθ Id = ( ) d i y i + d i ˆβy x x i + i s +R i s RM i s MM d i ( ˆβy x ˆβx z z i )

Composite imputation (Cont d) ˆβ y x and ˆβ x z solutions of estimation equations: ( ) ( ) Û 1 βy x = yi β y x x i = 0 i S RR d i Û 2 ( βx z ) = i S R+ d i ( xi β x z z i ) = 0 Taylor linearization of the imputed estimator: ˆθ Id ( ˆβ) = ˆθ Id (β) ( ˆθ Id β ) ( Û β where Û = (Û1, Û 2 ) and β = ( βy x, β x z ). ) 1 Û (β)

Stochastic imputation y i = imputed value of y i such that Imputed estimator of θ N : E I (y i ) = m(x i, ˆβ) ˆm i ˆθ I = i s d i {a i y i + (1 a i ) y i } Variance estimator of ˆθ I : E I (ˆθI ) = ˆθ Id ˆV I = ˆV d + ˆV where ˆV = i s d 2 i (1 a i ) (y i ˆm i ) 2

Multiple imputation: Rubin y (1) i,..., y (M) i = imputed values of y i (M 2) ˆθ (k) I Imputed estimator = i s Rubin s variance estimator: { } d i a i y i + (1 a i ) y (k) ˆθ MI = M 1 M k=1 ˆθ (k) I ˆV R = W M + M + 1 M B M where W M is the average of M naive variance estimators and B M = (M 1) 1 M k=1 (ˆθ(k) I ˆθ ) 2 MI i

Multiple imputation (Cont d) ˆV R theoretically justified when ) ) V (ˆθId = V (ˆθn + V (ˆθId ˆθ ) n (A) (Congenialty assumption) ˆVR seriously biased if assumption (A) violated. (A) not satisfied for domain estimation when domains not specified at the imputation stage. Our proposal: ˆV MI = ˆV d + M 1 B M ˆVMI valid for ˆθ Id as well as ˆθ I,z without (A).

Binary response Model: y i x i Bernoulli {m i = m (x i, β 0 )} logit (m i ) = x i β; q (x i, β 0 ) = m i (1 m i ) q i ( ˆm i = m x i, ˆβ ) where ˆβ is the solution to d i a i {y i m (x i, β)} x i = 0 i s Stochastic hot deck imputation { yi 1 with prob ˆmi = 0 with prob 1 ˆm i ˆη i = ˆm i + a i (1 + ĉ x i ) (y i ˆm i ) ĉ = { i s d ia i ˆq i x i x i } 1 i s d i (1 a i ) ˆq i x i.

Binary response (Cont d) Fractional imputation (FI): Eliminate imputation variance V by FI M = 2 fractions: impute { yi 1 with fractional weight ˆmi = 0 with fractional weight 1 ˆm i Data file reports real values 1 and 0 with associated fractions ˆm i and 1 ˆm i. ˆθ FI = ˆθ Id : V eliminated Estimation of domain total and mean: ˆθ FI,z, ( i s d iz i ) 1 ˆθ FI,z

Binary response (Cont d) Multiple imputation (MI): { 1. Generate β N ( ˆβ, i s a ) } i ˆq i x i x i 1 2. Generate yi Bernoulli (mi ) with m i = m (x i, β ) 3. Repeat steps 1 and 2 independently M times.

Simulation Study : Binary response Finite population of size N = 10, 000 from x i N (3, 1) y i x i Bernoulli (m i ), where logit (m i ) = 0.5x i 2 z i Bernoulli (0.4) (z i : Domain indicator) SRS of size n = 100 x i and z i : always observed. y i subject to missing. Missing response mechanism a i Bernoulli (π i ) ; logit (π i ) = φ 0 + φ 1 (x i 3) + φ 2 x i 3 (a) φ 1 = 0, φ 2 = 0; (b) φ 1 = 1, φ 2 = 0; (c) φ 1 = 0, φ 2 = 1 φ 0 is determined to achieve 70% response rate. Two variance estimates of multiple imputation are computed.

Simulation Study (Cont d) Table: Relative bias (RB) of the Rubin s variance estimator (R) and proposed variance estimator (KR) for multiple imputation Parameter Response RB (%) Mechanism R KR Case 1 1.07 2.90 Population Case 2-0.29 1.42 Mean Case 3-3.96-2.09 Case 1 34.25 2.37 Domain Case 2 31.08 2.28 Mean Case 3 27.55-3.41 Conclusion: 1. KR has small RB in all cases 2. R leads to large RB in the case of domain mean: 28% to 34%

Doubly robust method Case 1: p i known (p i = probability of response) Let β be the solution to Û (β) = ( ) 1 d i a i 1 {y i m (x i, β)} h (x i, β) = 0 p i i s Imputed estimator: θ Id = i s d i {a i y i + (1 a i ) m(x i, β) } If 1 is an element of h i, then θ Id = { ( ai d i y i + 1 a ) } i m(x i, β) p i p i i s

Doubly robust method (Cont d) Properties of θ Id : 1. Under the assumed response model, E R ( θ Id ) = ˆθ n regardless of the choice of m(x i, β). 2. Under the imputation model, E ζ ( θ Id ˆθ n ) = 0. (1) and (2) imply that θ Id is doubly robust.

Doubly robust method (Cont d) Case 2: p i unknown (p i = p i (α)) Linearization variance estimator: Haziza and Rao (2006): linear regression imputation Deville (1999), Demnati and Rao (2004) approach: general case

Extensions Calibration estimators Davison and Sardy (2007): deterministic linear regression imputation, stratified SRS Pseudo-empirical likelihood intervals Other parameters