The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data ABSTRACT INTRODUCTION SURVEY DESIGN 101 WHY STRATIFY?

The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health, ABSTRACT With recent releases, SAS has become increasingly capable of analyzing survey data descriptive statistics, crosstabulations, and regression models can be computed using procedures that account for complex sampling schemes and survey designs. The SURVEYFREQ procedure, for instance, is a common method for computing populationbased prevalence estimates of health indicators using data from large national and statewide surveys. Although this procedure may seem familiar to users of PROC FREQ, there are important differences between the analysis of survey data and simple random samples. This paper will provide an overview of PROC SURVEYFREQ and discuss syntax important for producing weighted estimates of prevalence and standard errors. INTRODUCTION Surveys are widely used across the United States and in public health. Survey methods, such as stratification, clustering, and oversampling, allow researchers to gather data from a segment of the population (e.g. a sample) and make generalizations to a larger population (e.g. target population) with efficiency and precision. At the same time, these aspects of a survey s design affect the computation of prevalence and variance estimates and, therefore, must be accounted for in the analysis of data. With the release of version 9.2, SAS is more capable than ever before of analyzing survey data using procedures like PROC SURVEYFREQ. Although survey syntax may seem familiar to users of PROC FREQ, there are important differences between the analysis of survey data and simple random samples (SRS). The purpose of this paper is to provide an overview of PROC SURVEYFREQ, paying specific attention to syntax that is important for producing weighted estimates of prevalence and appropriate standard errors. SURVEY DESIGN 101 Stratification, clustering, and oversampling are common characteristics of surveys and understanding why these methods are used is helpful for understanding how to account for them in the analysis. WHY STRATIFY? A stratified random sample refers to a sample that is drawn from a population that is divided into subgroups. Prior information on the stratification variable(s) is available beforehand and used to create a sampling frame, dividing the population into segments. A random sample is then drawn from each subgroup independently from the other groups. Stratified samples are used for several reasons (Lohr, 1999): Protect from the possibility of obtaining a poor sample. African Americans, for instance, comprise about 5% of mothers giving birth each year in California. A SRS of 1,000 may very well yield no African American mothers in the sample or too few to obtain meaningful estimates of prevalence by race. In comparison, if a stratified sample is taken and African American mothers are sampled proportionate to their distribution in the population, about 50 African American mothers would be included in the sample. Obtain a known precision for subgroups. If only 50 African American mothers were sampled in the scenario above, the precision of prevalence estimates for African Americans would not be as precise as they are for other groups sampled. Therefore, instead of sampling African American women proportionate to their distribution in the population, they could be oversampled so that the level of precision for estimates among African Americans is comparable to that among White and Hispanic mothers, who comprise a greater percentage of births. Convenience and cost. Different sampling approaches may be used in different strata, which may increase the feasibility of the survey and decrease cost. Obtain more precise estimates for the entire population. People from the same group tend to have similar responses or characteristics, and therefore, the variance within strata may be smaller than the variance in the population as a whole. California s Maternal and Infant Health Assessment (MIHA) is an example of a stratified sample. Conducted by the Maternal, Child and Adolescent Health Program at the California Department of Public Health in collaboration with researchers from the University of California, San Francisco, and modeled after the Center for Disease Control and 1

Prevention s (CDC) Pregnancy Risk Assessment Monitoring System (PRAMS), MIHA is an annual population-based survey of mothers with a recent live birth who are sampled from birth certificates. Designed to produce a representative sample of live births in the State, the sample is stratified by African American race, high school graduation, and region of California, all of which are available on the birth certificate, with oversampling of African Americans (California Department of Public Health, 2009). MIHA will be used as an example when demonstrating SAS survey procedures below. WHY CLUSTER? A stratified sample can only be taken if information about the population and sampling units is readily available, as is the case with birth certificates all births occurring in California and in the United States are registered with vital statistics and birth certificate data on maternal and infant characteristics are available to help construct a sampling frame. Often this type of detailed information is not available, making the construction of a sampling frame time consuming, costly, or not feasible. Dividing the population into clusters can address these issues, although there are some trade-offs. Whereas stratifying a sample can reduce the standard errors if the population in each stratum is homogenous, clustering usually increases the variance if the members of a cluster are alike. Nevertheless, cluster samples are often used because a (Lohr, 1999): Sampling frame is difficult or impossible to construct for the entire population. Suppose a researcher wanted to sample high school students in California. It would be difficult to obtain a list of students from every school to compile a sampling frame, but it would be possible to obtain a list and sample a proportion of schools, and then proceed with sampling all students (e.g. one-stage cluster sampling) or a proportion of students (e.g. multi-stage cluster sampling) from each school that agrees to participate. Population is widely distributed geographically or clustered. It would be much cheaper to select schools and interview students within these schools than to select students using a SRS. A SRS would result in a sample that was geographically distributed and that contained only a small number of students per school. This would require more resources for travel and interviewers, and increase the cost of the study. The Youth Risk Behavior Surveillance System (YRBSS) is an example of a survey that uses cluster sampling. The survey was first conducted in California in the Spring of 2009 by the California Department of Public Health, the California Department of Education and the Public Health Institute in cooperation with the Centers for Disease Control and Prevention (Survey Research Group, 2009). Designed to produce a representative sample of students in 9 th through 12 th grade, state-based YRBS employs a two-stage cluster sample where a stratified sample of schools is taken first, proportionate to enrollment size, and classes are randomly selected second, from which all students are eligible to participate. National YRBS employs a more complex, three-stage cluster sample where a stratified sample of counties is selected first, according to enrollment size and other factors, such as urban or rural location. Schools and classrooms are selected next (Centers for Disease Control and Prevention, 2004). YRBS will also be used as an example below. SPECIFYING YOUR SURVEY S DESIGN IN SAS The rules used to select units from a population constitute a survey s sample design and must be accounted for in SAS. Sample design information is provided in the WEIGHT, STRATA, and CLUSTER statements, and in the RATE= option in the PROC SURVEYFREQ statement. These portions of syntax play an important role in either the estimation of prevalence or variance (SAS Institute Inc., 2009). ESTIMATING PREVALENCE: THE WEIGHT STATEMENT Weights adjust for different components of a survey s design, and generally it is important to know about these aspects of your survey s methods even if the weights may already be calculated for you. In stratified samples, like MIHA, one component of the weight is the inverse of the sampling fraction in each stratum (e.g. the ratio of the number of people sampled to the number of people in the target population or sampling frame, n s / N s ). This weight, often called the sample design weight, can be adjusted for factors like survey non-response (e.g. the tendency of certain groups not to respond to the questionnaire) and noncoverage (e.g. the sampling frame at the time the sample is drawn may not always be complete). Below PROC SURVEYFREQ is used to calculate the prevalence of smoking during the 3 rd trimester of pregnancy in MIHA, weighted to represent the proportion of women with a recent live birth in California who smoked at the end of their pregnancy. 2

proc surveyfreq data = miha rate = samprate nomcar; ESTIMATING VARIANCE Estimating variance is slightly more complicated than estimating prevalence in SAS you must consider sampling rates and whether to use a finite population correction (FPC), strata and cluster information, domain analysis, and missing data. If not otherwise specified, SAS uses the Taylor linearization method to estimate variance. In MIHA and YRBSS, the Taylor linearization method is appropriate and will be used in the rest of the examples. However, briefly, SAS also offers two re-sampling methods for estimating variance balanced repeated replication (BRR) and the jackknife method. These methods can be requested using the VARMETHOD= option in the PROC SURVEYFREQ statement and using the REPWEIGHTS statement. If replicate weights are provided the STRATA and CLUSTER statements are not necessary. Variance estimates in SAS are calculated based on the first stage of the sampling process. In a multi-stage cluster sample, such as YRBS, schools might be selected first. In this case, all of the information provided about strata, clusters, and sampling rates should be at the level of the school. FINITE POPULATION CORRECTION By default, the Taylor linearization method assumes the sampling fraction is small, or the first-stage sample is drawn with replacement such that the sampling fraction is negligible (e.g. the population is infinite). Sometimes, particularly in stratified designs where the sample is drawn without replacement, the sampling fraction is not small. In strata where the sampling fraction (n s / N s ) is large, the sample contains more information about the target population, reducing the variance. The finite population correction (FPC) accounts for the extra efficiency gained in these instances. As the sampling rate becomes large (e.g. approximates one) the FPC will have a larger impact on the reduction in standard errors. The correction is made in SAS using the RATE= option in the PROC SURVEYFREQ statement. In the example below, SAMPRATE is a data set that contains the sampling fraction in each stratum. The stratum number should be located in a variable with the same name of the variable specified in the STRATA statement. Data on the sampling fraction should be located in a variable called _RATE_. Note you could accomplish the same thing using the TOTAL= option, naming a data set that contains the population totals by strata (N s ) in a variable called _TOTAL_. proc surveyfreq data = miha rate = samprate nomcar; Note that SAS does not include the FPC in the variance calculation if replicate weights are used. THE STRATA STATEMENT The STRATA statement should be used where the sample design is stratified at the first stage of sampling. The STRATA variable represents non-overlapping subgroups that were sampled independently. proc surveyfreq data = miha rate = samprate nomcar; 3

THE CLUSTER STATEMENT The CLUSTER statement is used to identify variables that contain information on the first-stage clusters, or primary sampling units (PSUs) in a cluster-sample design, such as the YRBSS. proc surveyfreq data = yrbss rate = samprate nomcar; cluster psu; weight weight; tables smoke; DOMAINS AND SUPOPULATIONS Domain analysis refers to the computation of statistics for subpopulations (e.g. stratifying your analysis on race). Other SAS procedures (e.g. PROC SURVEYMEANS) and other software (e.g. SUDAAN, STATA) have specific SUBPOP or DOMAIN statements. However, in PROC SURVEYFREQ, to request a domain analysis you must put the stratification variable in the TABLES statement. proc surveyfreq data = miha rate = samprate nomcar; by race; If you stratify your analysis using a BY statement, like in the example above, or subset your analysis using a WHERE statement, the standard errors produced will be different than the results produced had you included the variable in the TABLES statement. This has to do with the way SAS processes data. When SAS sees a BY statement it treats each analysis as a completely separate analysis, virtually on a separate data set. When SAS sees a WHERE statement the data are subset to exclude all observations that do not meet the condition before the analysis begins. However, standard errors for stratified survey designs are calculated using the value for the total number of individuals in each stratum in the entire sample, N s, and the same thing is true of cluster designs. Using BY and WHERE statements then exclude individuals who should be contributing to these total counts. Using a data set in PROC SURVEYFREQ that has been subset in a DATA step through an IF or WHERE statement would also be inappropriate. proc surveyfreq data = miha rate = samprate nomcar; tables race*smktri3; ACCOUNTING FOR MISSING DATA IN SAS 9.2 If a variable in the TABLES statement contains missing values, SAS assumes these values are missing completely at random by default and excludes these observations from the dataset prior to performing the analysis. This is equivalent to excluding observations with missing values using a WHERE statement, as discussed above, and may result in different estimates of standard error. To request that SAS include observations with missing values in the variance estimation calculations, you must specify the NOMCAR option in the PROC SURVEYFREQ statement available starting in SAS 9.2. Note that SUDAAN does not exclude missing values from variance estimation (Chen and Gorrell, 2004). Also note that this is different from the MISSING option in the PROC SURVEYFREQ statement, which includes missing values both in the table and in the variance estimation. In contrast, NOMCAR includes missing values in the variance estimation, but not in the table, so that you can obtain percentages and corresponding standard errors for 4

non-missing values. proc surveyfreq data = miha rate = samprate nomcar; ACCOUNTING FOR MISSING DATA PRIOR TO SAS 9.2 If you have not upgraded to version 9.2, yet want to account for missing data when estimating the variance around percentages of non-missing values, there is a workaround (Chen and Gorrell, 2004). Unlike PROC SURVEYFREQ, PROC SURVEYMEANS has a DOMAIN statement, which can be used to simulate an analysis where missing values are not assumed to be missing at random. The first step is recoding your data to dummy variables for the variable(s) of interest. In the example below, two dummy variables are created. Assume the original variable for smoking during the 3 rd trimester of pregnancy in MIHA, SMKTRI3, has two categories, coded 1 for Yes and 2 for No. In the first dummy variable, values of 1 represent smokers and values of 0 represent everyone else in the sample. In the second dummy variable, values of 1 represent responses with missing data on smoking. data surveymeans; set miha; smktri3_yes = (smktri3 = 1); smktri3_missing = (smktri3 =.); The mean of the dummy variable for smoking with the codes 1 and 0 is the percentage of smokers. By including the dummy variable for missing values in the DOMAIN statement of PROC SURVEYMEANS, the percentage of smokers is calculated among the non-missing values of smoking status. Even though the mean among missing values cannot be computed, SAS also attempts this calculation. Because the missing values are included as a category using the DOMAIN statement, they are not excluded from the variance estimation. proc surveymeans data = surveymeans rate = samprate missing; var smktri3_yes; domain smktri3_missing; CONCLUSION Surveys are common and useful they allow us to study health behaviors and other phenomenon with greater precision and they make data collection more efficient, saving time and money. Because surveys employ more complex sampling designs than a SRS, accounting for their methods in PROC SURVEYFREQ is more complicated than using PROC FREQ. Considerations when estimating prevalence and variance using PROC SURVEYFREQ have been outlined in this paper in order to help others avoid FREQuent mistakes in survey analysis. REFERENCES California Department of Public Health. Maternal and Infant Health Assessment (MIHA) Survey. Available at: http://www.cdph.ca.gov/data/surveys/pages/maternalandinfanthealthassessment(miha)survey.aspx Accessed July 14, 2009. Centers for Disease Control and Prevention. Methodology of the Youth Risk Behavior Surveillance System. MMWR 2004;53(No. RR-12):[inclusive page numbers]. Chen X, Gorrell P. Variance Estimation With Complex Surveys: Some SAS-SUDAAN Comparisons. Proceedings of the 17 th Annual Northeast SAS Users Group Conference, 2004. Lohr SL. Sampling: Design and Analysis. Pacific Grove, California: Duxbury Press Publishing Company; 1999. SAS Institute Inc. SAS/STAT 9.2 User s Guide: The SURVEYFREQ Procedure. Available at: http://support.sas.com/documentation/cdl/en/statug/59654/html/default/surveyfreq_toc.htm Accessed July 14, 2009. 5

Survey Research Group. The California Youth Risk Behavior Surveillance System. Available at: http://www.surveyresearchgroup.org/sub.php?page=projects_yrbs Accessed July 14, 2009. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Kathryn Martin Maternal, Child and Adolescent Health Program California Department of Public Health 1615 Capitol Avenue, MS 8304, PO Box 997420 Sacramento, CA 95899-7420 E-mail: Katie.Martin@cdph.ca.gov SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 6