Survey Data Analysis in Stata



Similar documents
Survey Data Analysis in Stata

The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data ABSTRACT INTRODUCTION SURVEY DESIGN 101 WHY STRATIFY?

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Youth Risk Behavior Survey (YRBS) Software for Analysis of YRBS Data

Chapter 11 Introduction to Survey Sampling and Analysis Procedures

Complex Survey Design Using Stata

Software for Analysis of YRBS Data

Sampling Error Estimation in Design-Based Analysis of the PSID Data

INTRODUCTION TO SURVEY DATA ANALYSIS THROUGH STATISTICAL PACKAGES

New SAS Procedures for Analysis of Sample Survey Data

Survey Analysis: Options for Missing Data

National Longitudinal Study of Adolescent Health. Strategies to Perform a Design-Based Analysis Using the Add Health Data

Chapter XXI Sampling error estimation for survey data* Donna Brogan Emory University Atlanta, Georgia United States of America.

STATA SURVEY DATA REFERENCE MANUAL

Chapter 19 Statistical analysis of survey data. Abstract

Variance Estimation Guidance, NHIS (Adapted from the NHIS Survey Description Documents) Introduction

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

Analysis of complex survey samples.

The Research Data Centres Information and Technical Bulletin

ESTIMATION OF THE EFFECTIVE DEGREES OF FREEDOM IN T-TYPE TESTS FOR COMPLEX DATA

Workshop on Using the National Survey of Children s s Health Dataset: Practical Applications

A Guide to Survey Analysis in GenStat. by Steve Langton. Defra Environmental Observatory, 1-2 Peasholme Green, York YO1 7PX, UK.

An Introduction to Secondary Data Analysis

6 Regression With Survey Data From Complex Samples

Statistics 522: Sampling and Survey Techniques. Topic 5. Consider sampling children in an elementary school.

Elementary Statistics

Introduction to Sampling. Dr. Safaa R. Amer. Overview. for Non-Statisticians. Part II. Part I. Sample Size. Introduction.

Comparison of Variance Estimates in a National Health Survey

Guidelines for Analyzing Add Health Data. Ping Chen Kim Chantala

Accounting for complex survey design in modeling usual intake. Kevin W. Dodd, PhD National Cancer Institute

IPUMS User Note: Issues Concerning the Calculation of Standard Errors (i.e., variance estimation)using IPUMS Data Products

1. Examples using data from the Survey of industrial and service firms

6. CONDUCTING SURVEY DATA ANALYSIS

From the help desk: hurdle models

Robert L. Santos, Temple University

From the help desk: Bootstrapped standard errors

Sample design for educational survey research

Life Table Analysis using Weighted Survey Data

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Statistical and Methodological Issues in the Analysis of Complex Sample Survey Data: Practical Guidance for Trauma Researchers

Measuring change with the Belgian Survey on Income and Living Conditions (SILC): taking account of the sampling variance

Models for Longitudinal and Clustered Data

Multilevel Modeling of Complex Survey Data

Optimization of sampling strata with the SamplingStrata package

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Stat : Analysis of Complex Survey Data

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Automated Analysis of Survey Data using STATA. Vinay A. Patel Deputy Manager National Dairy Development Board

Multivariate Analysis of Health Survey Data

Descriptive Methods Ch. 6 and 7

NEPS Working Papers. Replication Weights for the Cohort Samples of Students in Grade 5 and 9 in the National Educational Panel Study

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Approaches for Analyzing Survey Data: a Discussion

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

CLUSTER SAMPLE SIZE CALCULATOR USER MANUAL

A General Approach to Variance Estimation under Imputation for Missing Survey Data

Comparing Alternate Designs For A Multi-Domain Cluster Sample

Problem of Missing Data

Confidence Intervals and Loss Functions

Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

Stephen du Toit Mathilda du Toit Gerhard Mels Yan Cheng. LISREL for Windows: SIMPLIS Syntax Files

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Statistics 104: Section 6!

Confidence Intervals in Public Health

Sampling Probability and Inference

SSDA.Analysis - A Class Library for Analysis of Sample Survey Data

Why Sample? Why not study everyone? Debate about Census vs. sampling

National Endowment for the Arts. A Technical Research Manual

Analysis of Survey Data Using the SAS SURVEY Procedures: A Primer

Clustering in the Linear Model

Methods for Analyzing Longitudinal Health Survey Data. Theory and Applications

Comparison of Imputation Methods in the Survey of Income and Program Participation

STATISTICAL DATA ANALYSIS

Northumberland Knowledge

Linear Models in STATA and ANOVA

Forecasting in STATA: Tools and Tricks

Analysis of Regression Based on Sampling Weights in Complex Sample Survey: Data from the Korea Youth Risk Behavior Web- Based Survey

From the help desk: Swamy s random-coefficients model

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

The construction and use of sample design variables in EU SILC. A user s perspective

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

11. Analysis of Case-control Studies Logistic Regression

Bootstrapping Big Data

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

Transcription:

Survey Data Analysis in Stata Jeff Pitblado Associate Director, Statistical Software StataCorp LP Stata Conference DC 2009 J. Pitblado (StataCorp) Survey Data Analysis DC 2009 1 / 44

Outline 1 Types of data 2 Survey data characteristics 3 Variance estimation 4 Estimation for subpopulations 5 Summary J. Pitblado (StataCorp) Survey Data Analysis DC 2009 2 / 44

Why survey data? Collecting data can be expensive and time consuming. Consider how you would collect the following data: Smoking habits of teenagers Birth weights for expectant mothers with high blood pressure Using stages of clustered sampling can help cut down on the expense and time. J. Pitblado (StataCorp) Survey Data Analysis DC 2009 3 / 44

Types of data Simple random sample (SRS) data Observations are "independently" sampled from a data generating process. Typical assumption: independent and identically distributed (iid) Make inferences about the data generating process Sample variability is explained by the statistical model attributed to the data generating process Standard data We ll use this term to distinguish this data from survey data. J. Pitblado (StataCorp) Survey Data Analysis DC 2009 4 / 44

Types of data Correlated data Individuals are assumed not independent. Cause: Observations are taken over time Random effects assumptions Cluster sampling Treatment: Time-series models Longitudinal/panel data models cluster() option J. Pitblado (StataCorp) Survey Data Analysis DC 2009 5 / 44

Types of data Survey data Individuals are sampled from a fixed population according to a survey design. Distinguishing characteristics: Complex nature under which individuals are sampled Make inferences about the fixed population Sample variability is attributed to the survey design J. Pitblado (StataCorp) Survey Data Analysis DC 2009 6 / 44

Types of data Standard data Estimation commands for standard data: proportion regress We ll refer to these as standard estimation commands. Survey data Survey estimation commands are governed by the svy prefix. svy: proportion svy: regress svy requires that the data is svyset. J. Pitblado (StataCorp) Survey Data Analysis DC 2009 7 / 44

Survey data characteristics Single-stage syntax svyset [ psu ] [ weight ] [, strata(varname) fpc(varname) ] Primary sampling units (PSU) Sampling weights pweight Strata Finite population correction (FPC) J. Pitblado (StataCorp) Survey Data Analysis DC 2009 8 / 44

Survey data characteristics Sampling unit An individual or collection of individuals from the population that can be selected for observation. Sampling groups of individuals is synonymous with cluster sampling. Cluster sampling usually results in inflated variance estimates compared to SRS. J. Pitblado (StataCorp) Survey Data Analysis DC 2009 9 / 44

Survey data characteristics Sampling weight The reciprocal of the probability for an individual to be sampled. Probabilities are derived from the survey design. Sampling units Strata Typically considered to be the number of individuals in the population that a sampled individual represents. Reduces bias induced by the sampling design. J. Pitblado (StataCorp) Survey Data Analysis DC 2009 10 / 44

Survey data characteristics Strata In stratified designs, the population is partitioned into well-defined groups, called strata. Sampling units are independently sampled from within each stratum. Stratification usually results in smaller variance estimates compared to SRS. J. Pitblado (StataCorp) Survey Data Analysis DC 2009 11 / 44

Survey data characteristics Finite population correction (FPC) An adjustment applied to the variance due to sampling without replacement. Sampling without replacement from a finite population reduces sampling variability. J. Pitblado (StataCorp) Survey Data Analysis DC 2009 12 / 44

Survey data characteristics Example: svyset for single-stage designs J. Pitblado (StataCorp) Survey Data Analysis DC 2009 13 / 44

Survey data characteristics Population 1000 J. Pitblado (StataCorp) Survey Data Analysis DC 2009 14 / 44

Survey data characteristics SRS sample 200 J. Pitblado (StataCorp) Survey Data Analysis DC 2009 15 / 44

Survey data characteristics Cluster sample 20 (208 obs) J. Pitblado (StataCorp) Survey Data Analysis DC 2009 16 / 44

Survey data characteristics Stratified sample 198 J. Pitblado (StataCorp) Survey Data Analysis DC 2009 17 / 44

Survey data characteristics Stratified-cluster sample 20 (215 obs) J. Pitblado (StataCorp) Survey Data Analysis DC 2009 18 / 44

Survey data characteristics Multistage syntax svyset psu [ weight ] [, strata(varname) fpc(varname) ] [ ssu [, strata(varname) fpc(varname) ] ] [ ssu [, strata(varname) fpc(varname) ] ]... Stages are delimited by SSU secondary/subsequent sampling units FPC is required at stage s for stage s + 1 to play a role in the linearized variance estimator J. Pitblado (StataCorp) Survey Data Analysis DC 2009 19 / 44

Survey data characteristics Poststratification A method for adjusting sampling weights, usually to account for underrepresented groups in the population. Adjusts weights to sum to the poststratum sizes in the population Reduces bias due to nonresponse and underrepresented groups Can result in smaller variance estimates Syntax svyset... poststrata(varname) postweight(varname) J. Pitblado (StataCorp) Survey Data Analysis DC 2009 20 / 44

Survey data characteristics Example: svyset for poststratification J. Pitblado (StataCorp) Survey Data Analysis DC 2009 21 / 44

Strata with a single sampling unit Big problem for variance estimation Consider a sample with only 1 observation svy reports missing standard error estimates by default Finding these lonely sampling units Use svydes: Describes the strata and sampling units Helps find strata with a single sampling unit J. Pitblado (StataCorp) Survey Data Analysis DC 2009 22 / 44

Strata with a single sampling unit Example: svydes J. Pitblado (StataCorp) Survey Data Analysis DC 2009 23 / 44

Strata with a single sampling unit Handling lonely sampling units 1 Drop them from the estimation sample. 2 svyset one of the ad-hoc adjustments in the singleunit() option. 3 Somehow combine them with other strata. J. Pitblado (StataCorp) Survey Data Analysis DC 2009 24 / 44

Certainty units Sampling units that are guaranteed to be chosen by the design. Certainty units are handled by treating each one as its own stratum with an FPC of 1. J. Pitblado (StataCorp) Survey Data Analysis DC 2009 25 / 44

Variance estimation Stata has three variance estimation methods for survey data: Linearization Balanced repeated replication The jackknife J. Pitblado (StataCorp) Survey Data Analysis DC 2009 26 / 44

Variance estimation Linearization A method for deriving a variance estimator using a first order Taylor approximation of the point estimator of interest. Syntax Foundation: Variance of the total estimator svyset... [ vce(linearized) ] Delta method Huber/White/robust/sandwich estimator J. Pitblado (StataCorp) Survey Data Analysis DC 2009 27 / 44

Variance estimation Total estimator Stratified two-stage design y hijk observed value from a sampled individual Strata: h = 1,..., L PSU: i = 1,..., n h SSU: j = 1,..., m hi Individual: k = 1,..., m hij Ŷ = w hijk y hijk V (Ŷ ) = n h (1 f h ) (y hi y n h 1 h ) 2 + h i m hi f h (1 f hi ) (y hij y m hi 1 hi ) 2 h i j J. Pitblado (StataCorp) Survey Data Analysis DC 2009 28 / 44

Variance estimation Total estimator Stratified two-stage design y hijk observed value from a sampled individual Strata: h = 1,..., L PSU: i = 1,..., n h SSU: j = 1,..., m hi Individual: k = 1,..., m hij Ŷ = w hijk y hijk V (Ŷ ) = n h (1 f h ) (y hi y n h 1 h ) 2 + h i m hi f h (1 f hi ) (y hij y m hi 1 hi ) 2 h i j J. Pitblado (StataCorp) Survey Data Analysis DC 2009 28 / 44

Variance estimation Total estimator Stratified two-stage design y hijk observed value from a sampled individual Strata: h = 1,..., L PSU: i = 1,..., n h SSU: j = 1,..., m hi Individual: k = 1,..., m hij Ŷ = w hijk y hijk V (Ŷ ) = n h (1 f h ) (y hi y n h 1 h ) 2 + h i m hi f h (1 f hi ) (y hij y m hi 1 hi ) 2 h i j J. Pitblado (StataCorp) Survey Data Analysis DC 2009 28 / 44

Variance estimation Example: svy: total J. Pitblado (StataCorp) Survey Data Analysis DC 2009 29 / 44

Variance estimation Linearized variance for regression models Model is fit using estimating equations. Ĝ() is a total estimator, use Taylor expansion to get V ( β). Ĝ(β) = j w j s j x j = 0 V ( β) = D V {Ĝ(β)} β= b β D J. Pitblado (StataCorp) Survey Data Analysis DC 2009 30 / 44

Variance estimation Linearized variance for regression models Model is fit using estimating equations. Ĝ() is a total estimator, use Taylor expansion to get V ( β). Ĝ(β) = j w j s j x j = 0 V ( β) = D V {Ĝ(β)} β= b β D J. Pitblado (StataCorp) Survey Data Analysis DC 2009 30 / 44

Variance estimation Example: svy: logit J. Pitblado (StataCorp) Survey Data Analysis DC 2009 31 / 44

Variance estimation Balanced repeated replication For designs with two PSUs in each of L strata. Compute replicates by dropping a PSU from each stratum. Find a balanced subset of the 2 L replicates. L r < L + 4 The replicates are used to estimate the variance. Syntax svyset... vce(brr) [ mse ] J. Pitblado (StataCorp) Survey Data Analysis DC 2009 32 / 44

Variance estimation BRR variance formulas θ point estimates θ (i) ith replicate of the point estimates θ (.) average of the replicates Default variance formula: V ( θ) = 1 r r { θ (i) θ (.) }{ θ (i) θ (.) } i=1 Mean squared error (MSE) formula: V ( θ) = 1 r { θ r (i) θ}{ θ (i) θ} i=1 J. Pitblado (StataCorp) Survey Data Analysis DC 2009 33 / 44

Variance estimation Example: svy brr: logit J. Pitblado (StataCorp) Survey Data Analysis DC 2009 34 / 44

Variance estimation The jackknife A replication method for variance estimation. Not restricted to a specific survey design. Syntax Delete-1 jackknife: drop 1 PSU Delete-k jackknife: drop k PSUs within a stratum svyset... vce(jackknife) [ mse ] J. Pitblado (StataCorp) Survey Data Analysis DC 2009 35 / 44

Variance estimation Jackknife variance formulas θ (h,i) replicate of the point estimates from stratum h, PSU i θ h average of the replicates from stratum h m h = (n h 1)/n h delete-1 multiplier for stratum h Default variance formula: L n h V ( θ) = (1 f h ) m h { θ (h,i) θ h }{ θ (h,i) θ h } h=1 Mean squared error (MSE) formula: L n h V ( θ) = (1 f h ) m h { θ (h,i) θ}{ θ (h,i) θ} h=1 i=1 i=1 J. Pitblado (StataCorp) Survey Data Analysis DC 2009 36 / 44

Variance estimation Example: svy jackknife: logit J. Pitblado (StataCorp) Survey Data Analysis DC 2009 37 / 44

Variance estimation Replicate weight variable A variable in the dataset that contains sampling weight values that were adjusted for resampling the data using BRR or the jackknife. Typically used to protect the privacy of the survey participants. Eliminate the need to svyset the strata and PSU variables. Syntax svyset... brrweight(varlist) svyset... jkrweight(varlist [,... multiplier(#) ] ) J. Pitblado (StataCorp) Survey Data Analysis DC 2009 38 / 44

Estimation for subpopulations Focus on a subset of the population Subpopulation variance estimation: Assumes the same survey design for subsequent data collection. The subpop() option. Restricted-sample variance estimation: Assumes the identified subset for subsequent data collection. Ignores the fact that the sample size is a random quantity. The if and in restrictions. J. Pitblado (StataCorp) Survey Data Analysis DC 2009 39 / 44

Estimation for subpopulations Total from SRS data Data is y 1,..., y n and S is the subset of observations. { 1, if j S δ j (S) = 0, otherwise Subpopulation (or restricted-sample) total: n Ŷ S = δ j (S)w j y j j=1 Sampling weight and subpopulation size: w j = N n n, N S = δ j (S)w j = N n n S j=1 J. Pitblado (StataCorp) Survey Data Analysis DC 2009 40 / 44

Estimation for subpopulations Variance of a subpopulation total Sample n without replacement from a population comprised of the N S subpopulation values with N N S additional zeroes. V (ŶS) = ( 1 n ) N n n 1 n j=1 Variance of a restricted-sample total { δ j (S)y j 1 } 2 n ŶS Sample n S without replacement from the subpopulation of N S values. Ṽ (ŶS) = ( 1 n ) S ˆN S ns n S 1 n j=1 { δ j (S) y j 1 } 2 Ŷ n S S J. Pitblado (StataCorp) Survey Data Analysis DC 2009 41 / 44

Estimation for subpopulations Example: svy, subpop() J. Pitblado (StataCorp) Survey Data Analysis DC 2009 42 / 44

Summary 1 Use svyset to specify the survey design for your data. 2 Use svydes to find strata with a single PSU. 3 Choose your variance estimation method; you can svyset it. 4 Use the svy prefix with estimation commands. 5 Use subpop() instead of if and in. J. Pitblado (StataCorp) Survey Data Analysis DC 2009 43 / 44

References Levy, P. and S. Lemeshow. 1999. Sampling of Populations. 3rd ed. New York: Wiley. StataCorp. 2009. Survey Data Reference Manual: Release 11. College Station, TX: StataCorp LP. J. Pitblado (StataCorp) Survey Data Analysis DC 2009 44 / 44