Survey Analysis: Options for Missing Data

Transcription

1 Survey Analysis: Options for Missing Data Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD Abstract A common situation researchers working with survey data face is the analysis of missing data, often due to nonresponse. In addition to missing values for analysis variables, SAS excludes observations if the weight of any of the design variables (strata, cluster, domain) have missing values. This paper discusses two options available with the SAS survey procedures (e.g. SURVEYFREQ, SURVEYMEANS): the MISSING option and the NOMCAR option. The MISSING option is used with categorical variables to instruct SAS to treat missing values as a valid category. The NOMCAR option (new with version 9.2) is used when the default assumption that missing values for analysis variables are missing completely at random (i.e. the group of non-respondents do not differ in any relevant respect from the group of respondents) is not appropriate. Use of the NOMCAR option instructs SAS to perform a domain analysis of missing and non-missing values. Specific examples will be used to illustrate the effect of the use of these two options for variance estimation and the computation of confidence limits Introduction A useful starting point for the discussion of missing data in this paper is the following text from the SAS documentation section on Missing Values for PROC SURVEYMEANS: (1) By default, when computing statistics for an analysis variable, PROC SURVEYMEANS omits observations with missing values for that variable. The procedure computes statistics for each variable based only on observations that have nonmissing values for that variable. This treatment is based on the assumption that the missing values are missing completely at random (MCAR). However, this assumption is sometimes not true. For example, evidence from other surveys might suggest that observations with missing values are systematically different from observations without missing values. If you believe that missing values are not missing completely at random, then you can specify the NOMCAR option to let variance estimation include these observations with missing values in the analysis variables. For the analysis of complex surveys another factor comes into play, i.e. the omission of observations with missing values potentially removes important information with respect to the design properties of the survey, e.g. strata and cluster information. We will see, in the discussion of an example from the Medical Expenditure Panel Survey (MEPS) below, that the use of the NOMCAR option can be an alternative to using a DOMAIN analysis which is often recommended instead of prior restricting of analyses to target subpopulations. The effect of using the NOMCAR option is given in (2). (2) When the NOMCAR option is used, the procedure treats observations with and without missing values for analysis variables as two different domains, and it performs a domain analysis in the domain of nonmissing observations. Although SAS 9.2 includes options for replication methods of variance estimation (BRR, Jackknife), the NOMCAR option only applies to the default Taylor series method. Note also the reference in (2) to analysis variables. In contrast, the MISSING option affects categorical variables. The text in (3) is from the SAS 9.2 documentation for PROC SURVEYMEANS. (3) [The MISSING option] treats missing values as a valid (nonmissing) category for all categorical variables, which includes CLASS, STRATA, CLUSTER, and DOMAIN variables. 1

2 By default, if you do not specify the MISSING option, an observation is excluded from the analysis if it has a missing value. Note that SAS' characterization of a variable as categorical is based on its use on one of the listed statements (e.g. DOMAIN), and not on the variable's values or range of values. The rest of this paper consists of three examples: Example 1 shows the effect of the NOMCAR option with a simple stratified sample with missing data for the analysis variable; Example 2 shows the effect of the MISSING option for a similar stratified sample with missing values for a categorical variable used in the DOMAIN statement; Example 3 uses a morecomplex example to compare the effect of using the NOMCAR option with a DOMAIN analysis based on missing values for an analysis variable. The goal of this paper is to illustrate some of the effects you will observe when using the NOMCAR and MISSING options. This paper is by no means an exhaustive discussion of the topic. Nor does it advise you when it is appropriate to use or not use these options. Often this is determined solely by the design properties of the survey data you are analyzing and/or your research goals. A discussion of all the different design and analytic factors to consider is beyond the scope of this paper. But the examples discussed below should give you a concrete sense of the use of these options, as well as specific questions to consider when weighing their use. Example 1 (Spending on Ice Cream by Grade Level) This example is straight from the SAS 9.2. documentation for PROC SURVEYMEANS (Example 85.4, Analyzing Survey Data with Missing Values). In this example students from three grades (7, 8, and 9) are sampled with respect to spending for ice cream (you can see a more user-friendly formatting of the ICECREAM data set, sorted by GRADE, SPENDING in Appendix A). The value of WEIGHT is assigned as the inverse of the probability of selection (1/PROB). For each grade, PROB is defined as the ratio of the number sampled to the total number of students. Not shown here is a separate data set (STUDENTTOTALS) which has the total number of students for each grade, i.e. the population totals for each stratum. (4) DATA ICECREAM; INPUT GRADE IF GRADE = 7 THEN PROB = 20/1824; IF GRADE = 8 THEN PROB = 9/1025; IF GRADE = 9 THEN PROB = 11/1151; WEIGHT = 1/PROB; DATALINES; ; For comparison purposes we will first show output for the SURVEYMEANS code shown in (5). Here the mean and sum are requested. Although not germane to the missing data issues discussed here, the STUDENTTOTALS data set is used to compute a finite population correction for variance estimation (it is included here to maintain consistency with the SAS documentation example). In the code below GRADE is the stratification variable, SPENDING the analysis variable, and WEIGHT the weight variable. The LIST option on the STRATA statement requests a Stratum Information table as part of the procedure output. 2

3 (5) PROC SURVEYMEANS DATA= ICECREAM TOTAL=STUDENTTOTALS MEAN SUM; STRATA GRADE / LIST; VAR SPENDING; WEIGHT WEIGHT; The Data Summary table in (6) below lists the number of strata (i.e. grades), the number of observations (cf. the PROC PRINT output in Appendix A), and the weighted sum (i.e. the sum of the population total for all grades). The Stratum Information table lists descriptive information for each strata (grade). The N Obs column shows the number sampled and the N column shows the number of observations with non-missing values for the analysis variable SPENDING. Subtracting N from N Obs shows that Grade 7 has 3 missing values and Grades 8 and 9 have 2 missing values each (see Appendix A). The tables shows the requested MEAN and SUM, along with the variance estimate for each. For these estimates the observations with missing values were excluded. As stated, this is the default SAS behavior. (6) Output for the SURVEYMEANS code in (5). Data Summary Number of Strata 3 Number of Observations 40 Sum of Weights 4000 Stratum Index GRADE Stratum Information Population Total Sampling Rate N Obs Variable % 20 SPENDING % 9 SPENDING % 11 SPENDING 9 N Variable of Sum Std Dev SPENDING Keeping especially the estimates of variance in mind, we now modify the example by including the NOMCAR option to see its effect. 3

4 (7) PROC SURVEYMEANS DATA= ICECREAM TOTAL=STUDENTTOTALS NOMCAR MEAN SUM; STRATA GRADE / LIST; VAR SPENDING; WEIGHT WEIGHT; The Data Summary and Strata Information tables are unchanged from the prior example so they will not be reproduced below. The output in (8) does show a new Variance Estimation table to reflect the inclusion of the NOMCAR option. As stated this option is specific to the Taylor Series method for variance estimation, and this method is listed in the table as is the fact that observations for missing values for the analysis variable will be included. (8) Output for the SURVEYMEANS code in (7) Variance Estimation Method Taylor Series Missing Values Included (NOMCAR) Variable of Sum Std Dev SPENDING Of particular interest here is the difference in the standard error for the mean and the standard deviation for the sum. But first note that the point estimates (MEAN, SUM) are unaffected. It is only the variance estimation which is affected. This is particularly important when variance estimates are used to determine if two point estimates (e.g. the MEAN or SUM in different years) are significantly different. Standard errors and standard deviations tend to be larger when the NOMCAR option is used than when the assumption is made that missing values are missing completely at random. This is certainly the case with the example shown. Therefore the assumption that missing values are not missing completely at random is the more-conservative assumption. Example 2 (Spending on Ice Cream: Domain Analysis, Parent's Education) This example modifies the input data used in Example 1 by adding a new, binary, variable (PARENT_ED) which indicates if the student's parent completed high school or college. In addition to values of COLLEGE or HIGHSCHOOL, in this data set, the variable also has missing values. The code below is similar to that in (5), except for the inclusion of the DOMAIN statement. In addition to the table we saw in Example 1, the use of the DOMAIN statement will generate an output Domain Analysis table. (9) PROC SURVEYMEANS DATA= ICECREAM TOTAL=STUDENTTOTALS MEAN SUM; STRATA GRADE / LIST; VAR SPENDING; DOMAIN PARENT_ED; WEIGHT WEIGHT; In the table below the overall estimates (, Sum and their variance estimates) are identical to those we saw for Example 1 when the NOMCAR option was not used. The Domain Analysis table shows these estimates for the sub- 4

5 populations of students with parents with either a college or high school education. Note that observations are not included if the value of PARENT_ED is missing. (10) Output for the SURVEYMEANS code in (9) Variable of Sum Std Dev SPENDING PARENT_ED Variable Domain Analysis: PARENT_ED of Sum Std Dev COLLEGE SPENDING HIGHSCHOOL SPENDING Below we add the MISSING option in order to include all observations in the data set, including those for students where we don t have information about their parents' education. (11) PROC SURVEYMEANS DATA= ICECREAM TOTAL=STUDENTTOTALS MISSING MEAN SUM; STRATA GRADE / LIST; VAR SPENDING; DOMAIN PARENT_ED; WEIGHT WEIGHT; (12) Output for the SURVEYMEANS code in (11) Variable of Sum Std Dev SPENDING PARENT_ED Variable Domain Analysis: PARENT_ED of Sum Std Dev SPENDING COLLEGE SPENDING HIGHSCHOOL SPENDING

6 In the Domain Analysis table above we now see three rows for the PARENT_ED domain variable. In addition to seeing the estimates for this subpopulation, we also see that the inclusion of these observation, by changing the total number of observations within each stratum, has also changed the variance estimates. For example, the Std Dev for students whose parents attended college is 3,000 when the observations with missing PARENT_ED values are excluded. But the Std Dev for this same group is 3,363 when those observations are included, i.e. including these observations with missing values yields more-conservative estimates of reliability. This difference points to the importance of determining, for the survey analysis you are conducting, whether or not it is appropriate to exclude missing values for categorical variables in generating variances for your estimates. Next we turn to a real-world example using data from the Medical Expenditure Panel Survey. Example 3 (Hospital Stay Expenses) The Medical Expenditure Panel Survey (MEPS) is a complex national probability survey of the civilian noninstitutionalized population. Each year MEPS collects healthcare utilization, expenditure and other information for approximately 32,000 individuals. Public use files (PUFs) are released each year. The data in the example discussed below is from the 2006 MEPS Full-Year Consolidated Data file (HC-105), available for download from the Agency For Healthcare Research and Quality s Web site ( In order to use MEPS data for national estimates, person- and family-level weights are developed and released on the annual public-use files. In the example used here the 2006 person-level weight variable PERWT06F is used. In addition, the MEPS sample design includes stratification, clustering, multiple stages of selection, and disproportionate sampling. Because of these complex design properties, it is not appropriate to assume simple random sampling for variance estimation. To obtain accurate variance estimates an appropriate technique to derive standard errors associated with the weighted estimates must be used. Several methods for estimating standard errors for estimates from complex surveys have been developed, including the Taylor-series linearization method, balanced repeated replication, and the jack-knife method. The MEPS public use files include variables to obtain weighted estimates and to implement a Taylor-series approach to estimate standard errors for weighted survey estimates. These variables, which jointly reflect the MEPS survey design, include the estimation weight, sampling strata, and the cluster or primary sampling unit (PSU). Standard errors for MEPS estimates normally require the analytic file to contain all of the MEPS sample persons (e.g., those with positive values for the person weight variable) in order for the analysis to correctly account for the MEPS strata and PSUs. Subsetting to a population of interest (e.g. persons with a particular condition, procedure, or utilization), although normally an efficient programming move, potentially removes important stratification and clustering information from the analysis procedure. Indeed this is often the reason to use a survey procedure such as SURVEY- MEANS or SURVEY FREQ rather than their counterparts MEANS and FREQ. In the examples discussed below the following design variables will be used: PERWT06F (person-level weight variable); VARSTR (stratum variable); VARPSU (PSU, i.e. cluster, variable). The analysis variable is IPFEXP06 (2006 inpatient hospital stay facility expenses). Consider a situation where you are asked to generate the mean and total person-level expenditures for hospitals stays in 2006, but only for persons with hospital stay expenses. This is a typical way to look at average expenditures because the majority of persons will have zero hospital-stay expenses in a given year. You could remove persons with zero expenses from the analysis by deleting them from the input data set. But this conflicts with the recommendation not to subset in this way because it removes important strata and cluster (PSU) information from the variance estimation calculations. As Machlin et al (2005) point out, (12) Analyses are often limited to a subgroup of the population. However, creating a special analysis file that contains only observations for the subgroup of interest may yield incorrect standard errors because all of the observations corresponding to a stage of the MEPS sample design may be deleted. Therefore, it is advisable to preserve the entire survey design structure for the program by reading in the entire person-level file. 6

7 One apparent alternative is to recode the zero values to missing, as in (13) below, in order to exclude these observations from the analysis. This would indeed exclude those observations since the analysis procedure will omit observations with missing values for the analysis variable. But this would be equivalent to the prior subsetting already discussed. (13) DATA IP2006M; SET CDATA.H105 (KEEP= IPFEXP06 VARSTR VARPSU PERWT06F); IF IPFEXP06 = 0 THEN IPFEXP06 =. ; (14) PROC SURVEYMEANS DATA= IP2006M MEAN SUM; STRATA VARSTR ; CLUSTER VARPSU; VAR IPFEXP06; WEIGHT PERWT06F; (15) Output for the SURVEYMEANS code in (14) Data Summary Number of Strata 203 Number of Clusters 451 Number of Observations Number of Observations Used Number of Obs with Nonpositive Weights 1568 Sum of Weights Variable Label of Sum Std Dev IPFEXP06 HOSP FACILITY EXPENSES As we saw with the previous examples, the Data Summary table contains the basic information for the number of strata, clusters (PSUs), etc. Note that the number of observations used in this table is the number of observations with a positive weight. The sum of observations used and the number of observations with nonpositive weights is the number of observations (32, ,568 = 34,145). The table above shows that, in 2006, the mean, per-person, hospital stay expense, for those with a stay, is $12,584. The standard error for this estimate is The total expense is $264.9 billion, with a standard deviation of 13.3 billion. Having seen in (15) the variance estimates when persons with zero estimates (recoded to missing) are excluded from the analysis, we modify the example to include the NOMCAR option. Note that the input dataset here still has zero values recoded to missing. 7

8 (16) PROC SURVEYMEANS DATA= IP2006M NOMCAR MEAN SUM; STRATA VARSTR / LIST; CLUSTER VARPSU; VAR IPFEXP06; WEIGHT PERWT06F; (17) Output for the SURVEYMEANS code in (16) Variance Estimation Method Taylor Series Missing Values Included (NOMCAR) Variable Label of Sum Std Dev IPFEXP06 HOSP FACILITY EXPENSES Again, as we saw in the ice cream example, neither the mean nor total estimates are affected by the use of the NOMCAR option. But the standard error for the mean, as well as the standard deviation for the sum, are larger, i.e. when observations with zero hospital expenses are excluded from the analysis, the standard error is but when these observations are included. Similarly, when the zero-expense records are excluded, the standard deviation is 13.3 billion, but 14.2 billion when those observations are included. As the SAS documentation says, when the NOMCAR option is used, the analysis procedure treats observations with and without missing values for analysis variables as two different domains, and it performs a domain analysis in the domain of nonmissing observations. We can see this explicitly if we consider that, prior to the introduction of the NOMCAR option with version 9.2., the only alternative was to create a domain variable and use the DOMAIN statement to instruct SAS to perform a domain analysis. Consider the domain variable SUBPOP created in (18) and used in (19). Here the zero values for IPFEXP06 have not been recoded to missing, but rather keep their original value. (18) DATA IP2006; SET CDATA.H105 (KEEP= IPFEXP06 VARSTR VARPSU PERWT06F); IF IPFEXP06 > 0 THEN SUBPOP = 'WITH EXP'; ELSE SUBPOP = 'WITHOUT EXP'; (19) PROC SURVEYMEANS DATA= IP2006 MEAN SUM; STRATA VARSTR ; CLUSTER VARPSU; VAR IPFEXP06; WEIGHT PERWT06F; DOMAIN SUBPOP; 8

9 (20) Output for the SURVEYMEANS code in (19) Variable Label of Sum Std Dev IPFEXP06 HOSP FACILITY EXPENSES Domain Analysis: SUBPOP SUBPOP Variable Label of Sum Std Dev WITH EXP IPFEXP06 HOSP FACILITY EXPENSES WITHOUT IPFEXP06 HOSP FACILITY EXPENSES Here the table gives the estimates for the full population, i.e. persons with and without a hospital stay expense. The Domain Analysis table shows the estimates of interest, i.e. those for persons with an expense (WITH EXP). What is important to note here is that both the standard error and the standard deviation are identical to those produced by use of the NOMCAR option. This follows from the fact that the NOMCAR option is, behind the scenes, performing the domain analysis explicitly coded in (18) and (19). One potential advantage of using the explicit DOMAIN analysis here is that the output more accurately reflects the input data and the analysis preformed. The NOMCAR option, although potentially a useful shortcut, masks both the properties of the input data and the fact that a domain analysis is being performed. Summary This paper has illustrated the use of two options of potential use when working with survey data with missing values. The MISSING option overrides SAS' default behavior of excluding observations where the values of a categorical value are missing. Instead it treats missing values as a valid analysis category. The NOMCAR option is intended for use when the default assumption that observations where the analysis variable has missing values are missing completely at random is not justified. This option instructs SAS to perform a domain analysis for observations with and without missing values for the analysis variable. References Machlin, S., Yu, W., and Zodet, M. Computing Standard Errors for MEPS Estimates. January Agency for Healthcare Research and Quality, Rockville, MD. Available at: Acknowledgements I would like to thank my colleagues at Social & Scientific Systems, Inc. for lots of help with SAS in general and survey analysis in particular. 9

10 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. Contact Information Paul Gorrell Social & Scientific Systems, Inc Georgia Avenue Silver Spring, MD

11 APPENDIX A ICECREAM DATA SET Obs GRADE SPENDING Obs GRADE SPENDING