Statistical and Methodological Issues in the Analysis of Complex Sample Survey Data: Practical Guidance for Trauma Researchers
|
|
|
- Laurel Doyle
- 9 years ago
- Views:
Transcription
1 Journal of Traumatic Stress, Vol. 2, No., October 2008, pp ( C 2008) Statistical and Methodological Issues in the Analysis of Complex Sample Survey Data: Practical Guidance for Trauma Researchers Brady T. West Center for Statistical Consultation and Research at the University of Michigan Ann Arbor, Ann Arbor, MI Standard methods for the analysis of survey data assume that the data arise from a simple random sample of the target population. In practice, analysts of survey data sets collected from nationally representative probability samples often pay little attention to important properties of the survey data. Standard statistical software procedures do not allow analysts to take these properties of survey data into account. A failure to use more specialized procedures designed for survey data analysis can impact both simple descriptive statistics and estimation of parameters in multivariate models. In this article, the author provides trauma researchers with a practical introduction to specialized methods that have been developed for the analysis of complex sample survey data. Many analysts of data collected in sample surveys are comfortable with using standard procedures in statistical software packages to analyze the data; descriptive statistical procedures are used to compute estimates of parameters such as means and proportions for selected variables in specific subgroups being studied, and regression models might be fitted to estimate the relationships between multiple variables. These standard procedures (e.g., the linear regression procedure in SPSS statistical software program) assume that the data being analyzed are arising from a simple random sample from some population of interest, where all sample respondents were randomly selected without replacement from the population, and all respondents had equal probability of being included in the sample (Cochran, 977, p. 8). Simple random samples have many nice statistical properties: Most important, observations on a given variable are assumed to be independent and identically distributed. Assumptions of simple random sampling are often applied to convenience samples or snowball samples without formal probability designs (where a known probability of inclusion in the sample can be assigned to all population units of analysis). Unfortunately, in the real world, simple random samples are often quite expensive and difficult to collect. When probability samples from a population of interest are being designed, lists of individual people in a population of interest (or sampling frames) are difficult to locate or produce. Even if a sampling frame enu- merating all people in a population of interest is available, the costs of transportation and administration required to collect data from each randomly selected person can be extremely high. As a result, survey statisticians responsible for designing probability samples look for easier, more cost-efficient ways to collect samples from populations. These sample designs, although more cost-efficient and administratively convenient, introduce complications for analysis of the survey data. These designs are often referred to as complex sample designs. In this article, I will introduce readers to issues surrounding the development and analysis of complex samples, including (a) recognizing complex sample designs, (b) the use of sampling weights in analyses of survey data to ensure that computed estimates of desired population parameters are unbiased and representative of the population of interest, (c) the calculation of estimated standard errors for survey estimates that are robust and correctly reflect sampling variability due to the complex design of the sample from which the survey measures were collected, (d) alternative approaches to making inferences based on survey data, (e) the choice of appropriate statistical software to perform unbiased analyses of complex sample survey data, and (f) how to communicate the results of complex sample survey data analyses correctly. All of these points are introduced through an example of analyzing a real survey data set that includes several trauma-related measures: the National Comorbidity Survey-Replication (NCS-R; Kessler et al., 2004). The author wishes to thank the organizers of the 2007 Conference on Innovations in Trauma Research Methods (CITRM) for the opportunity to present the material in this paper at the conference, and two anonymous reviewers for detailed and thoughtful comments on earlier drafts. Correspondence concerning this article should be addressed to: Brady T. West, 30-C Rackham Building, University of Michigan, Ann Arbor, MI [email protected]. C 2008 International Society for Traumatic Stress Studies. Published online in Wiley InterScience ( DOI: 0.002/jts
2 Analysis of Complex Sample Survey Data 44 Complex Sample Designs When designing complex samples, survey statisticians often stratify (or divide) a population of interest into groups that are similar within and different between (in terms of the features being measured in the survey). This process of stratification, which often relies on all possible geographical divisions of the population for administrative convenience, generally has the favorable statistical property of decreasing the standard errors (or increasing the precision) of statistical estimates computed using the survey data. Survey statisticians then assign codes to each respondent indicating his or her stratum, thus producing a stratum variable in the survey data set (for example, the variable containing these codes in the NCS-R data set analyzed in this article is STR). This stratum variable is used by specialized statistical software procedures to compute the proper (and presumably reduced) variance estimates. Readers should note that the stratum variable will include codes for all possible strata of the population; in other words, strata are not randomly selected from the population of interest, and all strata are included in the design. In addition to stratifying the population of interest, survey statisticians often turn to the sampling of clusters from within the strata; this adds to the complexity of these sample designs. Clusters are groups (or organizations) containing multiple sample elements (or units of analysis in the population, such as trauma patients, if those are the people from whom the survey data will be collected). Examples of clusters might include schools or hospitals, and clusters are nested within sampling strata. For example, after the Oklahoma City bombing, suppose that a survey researcher wished to collect a sample of trauma-related measures from school children in Oklahoma City. The school districts in the city may have represented the sampling strata (ensuring that all possible districts would contribute to the sample), and schools of children may have represented the sampling clusters within the strata. The researcher could then sample schools from within each stratum, and children from within each sampled school. Clustered samples are often selected for two primary reasons: (a) sampling frames often do not list individual sample elements (e.g., people), but rather clusters of elements (e.g., schools, hospitals); and (b) cluster sampling dramatically reduces the costs of data collection, in that the sampling of clusters within strata (rather than individuals, who might be widely distributed throughout the stratum) facilitates subsequent data collection from samples of individuals within the clusters. There are some statistical trade-offs to cluster sampling, however. Most important, individuals from the same cluster will tend to be similar in terms of the survey measurements of interest, violating the assumption of independent observations for simple random samples. This dependence within a cluster is sometimes referred to as an intraclass correlation. This intraclass correlation complicates procedures for the estimation of population parameters and the calculation of standard errors for the estimates, and generally increases the standard errors of the estimates (in contrast to stratification). In some survey research projects, there are multiple stages of cluster sampling leading to the final selection of a sample element. For example, a complex sample design may feature the random selection of schools from within a stratum (such as a school district), followed by the random selection of classrooms within a school, and followed by the selection of children within the classrooms. These multiple stages of selection also have an impact on the standard errors of survey estimates, in that the standard errors tend to be increased (meaning precision is decreased) due to the multistage cluster sampling. As a result, survey statisticians also assign codes to individuals identifying the sampling clusters to which they belong, for variance estimation purposes; this leads to the inclusion of a cluster variable in public-use survey data sets, which analysts need to identify in preparation for an analysis (the variable containing cluster codes in the NCS-R data set is SECU ). In multistage cluster sample designs, variance in sample statistics between the primary sampling units (or the first-stage clusters that include multiple lower stage clusters) is of most importance to research analysts, because this variance includes all variance in estimates due to the lower stages of cluster sampling. This is why survey statisticians generally only include a single cluster variable in a survey data set, identifying these first-stage cluster codes. When analyzing data collected from samples with complex designs, it is the responsibility of the data analyst to understand whether a complex design was used, and identify the important variables containing stratum and cluster codes in the survey data set. Analysts are not responsible for designing these complex samples; they are responsible for analyzing the data correctly given the complex sample design that was used, so that variances of survey statistics are computed properly. This makes it essential for analysts to understand whether or not a survey data set contains these design codes, and use the design codes properly when performing statistical analyses. A thorough review of the documentation accompanying a survey data set, with special attention to any sections concerning estimation of variances or estimation of sampling errors, can be quite helpful with this process. Complex sample designs tend to result in (a) possible unequal probability of selection into the sample for individual units of analysis, due to the varying sizes of clusters and strata, which is unlike simple random sampling; (b) lack of independence of individual units within randomly sampled clusters, which is also unlike simple random sampling; and (c) increased precision of estimates due to stratification, but decreased precision of estimates due to the use of clustering (in general). These general properties of a complex design led to the development of a unit-less index known as the design effect (Kish, 96), which quantifies the net loss in precision of a survey estimate derived from a complex sample relative to a simple random sample of the same size. The design effect is calculated as follows: Design Effect = Var (estimate) complex Var (estimate) SRS + (b ) ICC Journal of Traumatic Stress DOI 0.002/jts. Published on behalf of the International Society for Traumatic Stress Studies.
3 442 West The general form of the design effect is a ratio of the variance of a survey estimate according to the complex design (where the variance is estimated using specialized methods that incorporate stratification and clustering) to the variance of a survey estimate had a simple random sample of the same size been collected. In the cases of means and proportions, b corresponds to the average size of the clusters in the population, while ICC corresponds to the intraclass correlation of observations within a cluster. Therefore, a higher within-cluster correlation will tend to result in a larger design effect, or increased variance of the survey estimates relative to simple random sampling. Because of nonzero intraclass correlations within clusters, design effects generally tend to be larger than when cluster sampling defines an aspect of the complex sample design. So there are trade-offs: Complex samples are cost-efficient and administratively convenient, but the precision of survey estimates tends to decrease relative to simple random samples. We now turn to an example of a complex sample survey data set to further introduce these concepts. The working example for this article considers sample data from the NCS-R, which were collected between the years of 200 and These data were collected from a nationally representative sample of English-speaking household residents aged 8 years and older in the coterminous United States. The primary aim of the NCS-R was to build on the original National Comorbidity Survey (Kessler, 990) by continuing to address psychiatric disorders among adults in the U.S. population with improved methods of measurement. The NCS-R sample design was complex in that it involved stratification of the population and multiple stages of cluster sampling leading to the selection of a sample element (a U.S. adult). Careful assessment of the technical documentation accompanying large survey data sets like the NCS-R is essential for analysts of survey data. The following passage is taken directly from the NCS-R technical documentation (Collaborative Psychiatric Epidemiology Surveys [CPES] online documentation; Heeringa & Berglund, 2008): Regardless of whether the linearization method or a resampling approach is used, estimation of variances for complex sample survey estimates requires the specification of a sampling error computation model. CPES data analysts who are interested in performing sampling error computations should be aware that the estimation programs identified in the preceding section assume a specific sampling error computation model and will require special sampling error codes. Individual records in the analysis data set must be assigned sampling error codes that identify to the programs the complex structure of the sample (stratification, clustering) and are compatible with the computation algorithms of the various programs. To facilitate the computation of sampling error for statistics based on CPES data, design-specific sampling error codes will be routinely included in all versions of the data set. Although minor recoding may be required to conform to the input requirements of the individual programs, the sampling error codes that are provided should enable analysts to conduct either Taylor Series or Replicated estimation of sampling errors for survey statistics. (Using CPES, Weighting, Section VII.D) This passage stresses the need for analysts to use specialized statistical software procedures when analyzing the NCS-R data. Readers should note the comments about the complex design effects and the need to take the design structure of the NCS-R sample into account when using specialized software to perform analyses of the survey data. We will revisit the NCS-R example in all of the subsequent sections of this article. Sampling Weights Complex sample designs often result in unequal probabilities of selection into the sample for population units of analysis (e.g., people). If certain individuals have a higher probability of being included in a sample than others, population estimates based on that sample design may be biased. In addition, sampled individuals may choose not to respond to a survey, even if incentives are used as a part of the research, and the distribution of a sample may not be in alignment with known population characteristics (e.g., a sample might be 30% men and 70% women, when the population of interest is known to be 0% men and 0% women). These three complications result in the need to use sampling weights when analyzing the survey data, so that these potential sources of bias in the survey estimates are reduced (or even eliminated). Here a very general description of sampling weights is provided; for additional technical information on the computation of sampling weights, interested readers can refer to Kish (96), Cochran (977), or Lohr (999). Computed sampling weights for people who respond to a survey incorporate (a) unequal probability of selection (people with a lower probability of being included get more weight when estimates are computed); (b) nonresponse rates (people belonging to groups that are not as likely to respond to the survey get more weight when estimates are computed); and (c) poststratification factors, or adjustments to match population distributions across certain strata (e.g., men and women). The first two components of a sampling weight are generally computed by taking the inverse of an individual s probability of selection and the inverse of the response rate for a certain group to which an individual belongs, and multiplying the two components together. For example, consider a survey respondent who has a /00 chance of being included in a sample, and belongs to a demographic group where 0% of people who were sampled actually responded. This individual would have a sampling weight of 00 2 = 200, meaning that they represent themselves and 99 other people. This weight might then be further adjusted by a poststratification factor to match known population distributions. Journal of Traumatic Stress DOI 0.002/jts. Published on behalf of the International Society for Traumatic Stress Studies.
4 Analysis of Complex Sample Survey Data 443 The final computed sampling weight is also generally included in a survey data set as a weight variable, in addition to the stratum and cluster variables (the appropriate weight variable in the NCS- R data set for the examples in this article is FINALP2W). The analyst has the responsibility of identifying this final weight variable so that specialized estimation routines in statistical software can incorporate the sampling weights when computing unbiased estimates of population parameters such as means and regression coefficients. A failure to incorporate sampling weights in analyses can result in severely biased estimates of important parameters for a population of interest. In the NCS-R example, a subsample of,692 of the originally sampled 9,282 respondents (those indicating a lifetime mental health disorder in addition to a sample of other respondents screening as completely healthy) answered a second set of questions in the NCS-R (Part II) measuring risk factors and additional disorders, including posttraumatic stress disorder. The original NCS-R sampling weights were adjusted for this additional subsampling, and the NCS-R technical documentation explains to analysts that the FINALP2W variable contains these adjusted weights that should be used for all analyses involving Part II respondents. Variance Estimation Survey statisticians compute sampling weights for the information provided by sample respondents to enable survey analysts to compute unbiased estimates of population parameters. What this means is that on average (or in expectation), the estimates (or statistics) will equal the population parameter of interest. This makes the estimates unbiased in theory; in practice, different samples from a population will result in different estimates of a population parameter that bounce around the true value of the parameter. Generally speaking, the standard error of a survey estimate describes how far away an estimate from any one sample tends to be from the expected value of an estimate (the parameter it is estimating), on average. Methods for variance estimation in analyses of complex sample survey data are used to compute estimates of the standard errors of complex sample survey statistics. The need to compute estimates of standard errors often raises questions among researchers. Why not just compute the true value of a variance (and a standard error, which is the square root of the variance) for a given sample estimate of a population parameter? Unfortunately, when analyzing data collected from a sample with a complex design, true values of population parameters are needed to compute the true theoretical variances of sample statistics, and we will never know these true values. As a result, good approximations of the theoretical standard errors based on sample estimates of the required population parameters are used to estimate the standard errors in a robust way that accounts for the stratification, clustering, and sample weighting underlying a given complex design. Survey statisticians research variance estimators that are as close to being unbiased estimators of the true theoretical variances for sample statistics as possible. In general, the estimated variance of a given survey statistic is based on within-stratum variances in sample statistics (such as totals) between first-stage sampling clusters. As a result, at least two clusters are required per stratum for variance estimates to be calculated; otherwise, within-stratum variances (and therefore overall variance estimators) cannot be calculated. If only a single cluster is found in a particular stratum by the software processing the sample design variables, errors or warning messages may appear that could prevent the software from reporting estimated standard errors for survey statistics of interest. Survey sampling statisticians therefore do their best to provide users of survey data sets with variables identifying the strata and clusters to which each respondent belongs such that each stratum code has at least two associated cluster codes for variance estimation purposes. Three of the mathematical methods most frequently used to estimate the variances of complex sample survey statistics include Taylor series linearization, jackknife repeated replication, and balanced repeated replication. Taylor series linearization, which is the default variance estimation technique in many statistical software procedures designed for survey data analysis, first linearizes complex, nonlinear statistics (such as ratios of sample totals) into linear functions of sample totals, using results from calculus (Taylor series approximations). Variances of these linearized approximations of the original nonlinear statistics can then be calculated by using simpler known formulas for the variances and covariances of sample totals within strata. Each different survey statistic will have a unique variance estimator based on Taylor series linearization. In contrast, jackknife repeated replication and balanced repeated replication are very general techniques for variance estimation based on resampling, where the same methods are used to estimate variances regardless of the form of the survey statistic. Essentially, replication methods involve drawing repeated samples from the available survey data set, calculating the statistic of interest for each of the repeated samples, and then describing the variance of the repeated statistics around the overall statistic based on the overall sample. Software packages with procedures available for complex sample-survey data analysis are at various stages of incorporating these three techniques; however, Taylor series linearization is generally sufficient for most practical applications. These variance estimation methods will tend to result in robust overestimates of what the true theoretical variances should be (keeping in mind that stratification will generally decrease variances while clustering and weighting will increase variances), and this is a better scenario in practice than underestimating the variances, or overstating the precision of the survey estimates. As a result, the methods tend to provide conservative estimates of variances for the survey statistics, protecting the survey analyst against the risk of overstating the precision of survey estimates. The choice of the variance estimation method generally depends on software availability; Taylor series linearization is currently more widely Journal of Traumatic Stress DOI 0.002/jts. Published on behalf of the International Society for Traumatic Stress Studies.
5 444 West available than the replication techniques. Linearization and jackknife repeated replication are generally more accurate and more stable when the survey statistics represent functions of means, whereas balanced repeated replication has been shown to be better for the estimation of quartiles. As mentioned earlier, Taylor series linearization is generally a reasonable choice for most applications. For more details, interested readers can refer to a simulation study by Kovar, Rao, and Wu (988) that compares the variance estimation techniques. Analysts have the important responsibility of identifying the appropriate complex design variables in a survey data set (e.g., FINALP2W, STR, and SECU in the NCS-R data set) and using them properly in statistical software procedures designed for these analyses, so that robust estimates of standard errors incorporating the complex design features will be calculated correctly. Design-Based Versus Model-Based Inference Generally speaking, statistical inference refers to the art of making statements about a population based on a sample of data collected from that population. Design-based inference refers to an approach to making inferences using survey data that utilizes confidence intervals determined by estimates of survey statistics and their standard errors. The estimates and standard errors are determined using nonparametric methods (without assuming probability distributions for the variables) that incorporate features of the complex sample design and rely on how representative the sample is of the population of interest. Researchers examine the computed confidence intervals (CIs) to determine whether they include the hypothesized value of a population parameter of interest. One would interpret a 9% CI (corresponding to a type I error rate of.0) for a population parameter by stating that 9% of CIs constructed in the exact same way across repeated samples would include the true population parameter. In contrast, model-based inference relies on assumptions of specific probability distributions for the variables being analyzed. This method of inference is extremely powerful if the assumed probability distributions for the variables are correct. One example of a model-based approach to the analysis of survey data is multilevel (or mixed-effects) modeling, where the effects of clustering are treated as random effects in linear models (e.g., Pfefferman, Skinner, Holmes, Goldstein, & Rasbash, 998). This article s primary focus is on methods for design-based inference, or robust, nonparametric inference that directly recognizes the complex design features of a representative probability sample. For further details on these two forms of inference for survey data, interested readers can refer to Hansen, Madow, and Tepping (983). Current Software Options A wide variety of software procedures capable of analyzing survey data are currently available to survey researchers. Considering commercial statistical software packages first, the SAS software (Version 9..3; SAS Institute, Cary, NC) currently offers procedures for descriptive analysis of continuous and categorical variables (PROC SURVEYMEANS and PROC SURVEYFREQ), in addition to two procedures for regression modeling (PROC SURVEYREG and PROC SURVEYLOGISTIC). The SPSS software (Version 6; SPSS Inc., Chicago, IL) currently offers the complex samples add-on module, which needs to be purchased in addition to the base SPSS package; this module enables several descriptive analyses, in addition to regression modeling for continuous, categorical, count, and time-to-event outcomes. Here we will consider the complex samples module for the NCS-R analysis. The Stata software package (Version 0; StataCorp LP, College Station, TX) and the SUDAAN software package (Version 9.0.; Research Triangle Institute, Research Triangle Park, NC) currently offer the widest variety of procedures for survey data analysis, including survival analysis and estimation of percentiles. Other software packages including WesVar ( and IVEware ( are freely available for survey data analysis, and the IVEware package is especially useful for missing data problems (which are beyond the scope of this article). Returning to the NCS-R analysis example, we define two research objectives: (a) to estimate the percentage of the adult U.S. population that has ever participated in combat; and (b) to estimate the percentage of older (age 0) men in the United States that has ever participated in combat. The second analysis objective defines a subclass analysis, and more details on these types of analyses will be provided in the next section. We consider the SPSS software for this example, although this example analysis could be performed using any of the aforementioned statistical software packages. The following NCS-R variables will be considered in the analysis (in addition to the FINALP2W, STR, and SECU variables, representing the sampling weights and the stratum and cluster codes, respectively): PT (Has the respondent ever participated in combat, as a member of a military or organized nonmilitary group?), SC (respondent s age), and SC (respondent s gender). The first step for SPSS users looking to analyze a complex sample-survey data set is to purchase the complex samples add-on module: readers can visit samples for more information. Other general-purpose statistical software packages like SAS, Stata, and SUDAAN have procedures for the analysis of complex sample survey data integrated into their base statistical packages. With the NCS-R data set open in SPSS, the next step is to define the complex design features of the NCS-R. From the SPSS menus, one can select Analyze Complex Samples PrepareforAnalysis...Thisallowsuserstodefinewhatis known in SPSS as a plan file, which has an extension of.csaplan. Users first choose the option to create a plan file, and then enter a name for the file before clicking Next to proceed. At this point, users select the variables containing information on the strata, clusters, and sampling weights that are critical for a design-based Journal of Traumatic Stress DOI 0.002/jts. Published on behalf of the International Society for Traumatic Stress Studies.
6 Analysis of Complex Sample Survey Data 44 Population Size % of PT: Evr in combat military/non-military 9% Confidence Standard Interval Unweighted Estimate Error Lower Upper Design Effect Count %.4% 4.%.8% %.4% 94.2% 9.9% %.0% 00.0% 00.0%. 689 Figure. SPSS output from the first combat example, showing that an estimated 4.9% of U.S. adults have ever participated in combat before (variable PT = ). analysis of the survey data, and click Next. The standard sampling error calculation model selected in the next screen is WR (or with replacement), which assumes that the first-stage clusters (not the individual sample elements) have been selected with replacement from within the first-stage strata (an assumption that primarily facilitates variance estimation; see Cochran, 977, for more details). Users then click Finish to finish the definition of the sampling plan. Next, one defines the analysis to be performed. Users can select Analyze Complex Samples Frequencies... to request estimates of percentages for categorical variables. One first has to identify the sampling plan file to be used in the analysis, which was created in the previous steps. Next, the variable for which weighted percentages are desired (PT, or the indicator of participation in combat) is selected for frequency tables analysis, and selected statistics (e.g., standard errors, 9% CIs, design effects) can be requested for the output. After these options have been selected, the user can run the analysis procedure by clicking OK (or pasting and running the SPSS syntax). This generates the SPSS output displayed in Figure. The weighted estimate of the percentage of U.S. adults having participated in combat is 4.9% (Linearized SE = 0.4%, 9% CI = 4.%.8%). A null hypothesis that % of U.S. adults have participated in combat would not be rejected, but a null hypothesis stating that 0% of adults have participated in combat is not supported by these results. The 9% CI reflects expected sampling variability in the estimate given the complex design features of the NCS-R sample, and the estimated standard error for the estimate is computed using Taylor series linearization. The unweighted estimate of this percentage, ignoring the complex design features and performing a standard frequency table analysis in SPSS (Analyze Descriptive Statistics Frequencies), is 4.3%, which indicates that a more standard analysis would have resulted in an estimate that was biased low. Further, the estimated design effect (DEFF) is 2., suggesting that the complex design of the NCS-R sample increased the variance of this estimate by around 0% relative to a simple random sample of the same size (indicating a loss in the precision of the estimate due to the complex design). This finding is quite common for design effects, and is mainly driven by the effects of clustering and weighting, as discussed earlier. Subclass Analyses Analysts of survey data are often interested in restricting inferences to a specific subpopulation, or subclass, of the population of interest. For example, one may wish to restrict the estimation of a population parameter to older men. In practice, subclass analyses of complex sample survey data are dangerous because it is very easy for analysts to incorrectly (and permanently) delete those cases that do not fall into the subclass from the data set when performing the analysis. The two primary problems with this type of conditional approach to the subclass analysis are (a) the subclass sample size is treated as fixed, when it should actually be treated as a random variable (i.e., the subclass sample size will vary in theory from one sample to another); and (b) deleting cases not falling into the subclass may result in entire first-stage sampling clusters being deleted from the analysis, if the subclass by random chance is not represented in a given cluster in a given sample. Estimated standard errors of survey statistics should reflect theoretical sample-to-sample variance in the statistics based on the original complex sample design. In other words, across repeated samples, how far off (on average) will estimates be from the true population parameter of interest? If clusters from the original sample design are deleted in a conditional subclass analysis, when in theory the subclass might have appeared in those clusters, standard errors will be underestimated. The statistical software used to analyze the subclasses will not know that the deleted clusters were ever part of the original complex design. In addition, the software will treat the subclass sample size as being fixed from sample to sample when calculating the standard errors, when sample-to-sample variability in the subclass sample sizes should be incorporated into the standard errors. This also leads to underestimation of standard Journal of Traumatic Stress DOI 0.002/jts. Published on behalf of the International Society for Traumatic Stress Studies.
7 446 West PT: Evr in combat military/non-military oldmale 0 Population Size % of Population Size % of 9% Confidence Standard Interval Unweighted Estimate Error Lower Upper Design Effect Count %.3%.% 2.2% %.3% 97.8% 98.9% %.0% 00.0% 00.0% % 2.0% 7.6% 2.8% % 2.0% 74.2% 82.4% %.0% 00.0% 00.0%. 79 Figure 2. SPSS output from the application of the combat example to selected population subclasses (men with age 0 are represented by OLDMALE = ), showing that an estimated 2.4% of men 0 years of age or older have ever participated in combat before (as opposed to an estimated.% of other adults). errors. Conditional subclass analyses can also result in the situation where the software only recognizes a single cluster in a given sampling stratum, and different software packages react differently to this problem because within-stratum variance between the clusters cannot be estimated. To prevent these critical problems that can arise when performing subclass analyses, survey data analysts should perform unconditional subclass analyses. Specialized options in statistical software procedures allow users to define an indicator variable in the survey data set for the subclass of interest, equal to for cases in the subclass, and 0 otherwise. By processing this indicator variable, the software can recognize the full complex design of the sample, treat the subclass sample size as random, and proceed with the estimation of the statistics and standard errors according to the full complex design (clusters containing no subclass elements are still recognized in the variance calculations); interested readers can refer to West, Berglund, and Heeringa (in press) or Cochran (977) for more details. We consider this unconditional approach by analyzing older (age 0)menintheNCS-Rdataset,and estimating what proportion of this subclass has even participated in combat. First, we compute an indicator variable (OLDMALE), equal to for respondents in this subclass, and 0 otherwise. Then, we select Analyze Complex Samples Frequencies... to request estimates of the proportion for this specific subclass. We request frequency tables for PT (the indicator of participation in combat), and then move the indicator variable for older men into the subpopulations window. This will request an unconditional subclass analysis for the older men. Running this analysis produces the SPSS output shown in Figure 2. Note in Figure 2 that an estimated 2.4% of men 0 years of age or older (Linearized SE = 2.0%, 9% CI = 7.6% 2.8%) ever participated in combat, which is quite different from the estimate of 4.9% for the overall population. There is a larger margin of error for this smaller subclass, which is quite common in subclass analyses; the Linearization-based standard error of the estimate in this case is 2.0%, compared to 0.4% for the overall analysis. The unweighted estimate for this subclass is 2.8%, suggesting that there is not a great deal of bias in the unweighted estimate. Further, the design effect for this estimate is 2.3, suggesting a loss of precision relative to a simple random sample of the same size from this subclass. Given statistics of primary interest, survey statisticians work to minimize this design effect when designing complex samples. Analysts may be interested in statistically comparing selected subclasses in terms of important descriptive parameters (e.g., means or proportions) when analyzing a survey data set. For example, one may wish to compare respondents with combat experience with respondents not having combat experience in terms of the prevalence of PTSD. In this case, a design-based version of the chi-square test could be performed, comparing these two groups in terms of the prevalence of PTSD. The design-based analog of the χ 2 test is the Rao-Scott χ 2 test (Rao & Scott, 984), which has been widely implemented in the different software packages mentioned in this article. In this case, no subclass indicator variable would be required because the two groups of interest would Journal of Traumatic Stress DOI 0.002/jts. Published on behalf of the International Society for Traumatic Stress Studies.
8 Analysis of Complex Sample Survey Data 447 define the entire population being studied. If one were to adjust for covariates when comparing the groups in terms of this prevalence, a logistic regression model should be fitted, and specialized procedures capable of fitting logistic regression models to survey data are also widely implemented in the different statistical software packages. Regardless of the model to be fitted or the subclass comparison of interest, the important point for analysts is that a specialized procedure be used to perform the analysis that can correctly incorporate complex sample design features when generating estimates, standard errors, and test statistics used for making inferences about populations of research interest. Conclusion A broad and general introduction to the analysis of complex sample survey data has been presented in this article. Analysts of survey data have several issues to keep in mind before beginning the analysis of a complex sample survey data set:. What was the nature of the complex sample design, and are there codes for strata and clusters included in the survey data set? 2. What is the correct sampling weight variable to be used in the analysis for calculating unbiased estimates of population parameters? 3. Is a software procedure being used that can accommodate the complex sample design features in the analysis? 4. What method of variance estimation is the software procedure using to calculate design-based estimates of standard errors (e.g., Taylor series linearization)?. Is the analysis of interest focused on a subclass of the population from which the complex sample survey data were collected, and has an indicator variable been computed for this subclass enabling unconditional subclass analyses? Example analyses in this article were presented using the complex samples module of the SPSS statistical software (Version 6), but the analyses can be easily repeated using specialized procedures for survey data analysis in several other statistical software packages (e.g., SAS, Stata, and SUDAAN). The author can be contacted for examples of code that can be used to perform the analyses in these other software packages. The scientific design of cost-efficient probability samples and the collection of survey data from the resulting samples in survey research projects like the NCS-R cost granting agencies (and tax payers) millions of dollars each year. A failure of data analysts to use appropriate statistical software procedures for unbiased analyses of the survey data essentially negates the substantial money and effort used to collect the data, and can result in scientific publications that present skewed pictures of populations of research interest. This presentation has attempted to clarify these issues for analysts of complex sample survey data in a relatively heuristic and practical manner. Several additional references can be consulted for additional technical details on these analysis approaches, including a practical handbook by Lee and Forthofer (2006), and theoretical articles on variance estimation by Rust (98) and Binder (983). REFERENCES Binder, D. A. (983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review,, Cochran, W. G. (977). Sampling techniques (3rd ed.). New York: Wiley. Hansen, M., Madow, W., & Tepping, B. (983). An evaluation of model-dependent and probability-sampling inferences in sample surveys. Journal of the American Statistical Association, 78, Heeringa, S. G., & Berglund, P. (2008). National Institutes of Mental Health (NIMH) Collaborative Psychiatric Epidemiology Survey Program (CPES) data set. Integrated weights and sampling error codes for design-based analysis. Retrieved June, 2008, from cpes/using.xml?section=weighting#vii.+procedures+for+sampling+error +Estimation+in+Design-based+Analysis+of+the+CPES+Data Kessler, R. C. (990). National Comorbidity Survey: Baseline (NCS-), [Data file]. Conducted by University of Michigan, survey research center (ICPSR06693-v4). Ann Arbor, MI: Inter-University Consortium for Political andsocialresearch. Kessler, R. C., Berglund, P., Chiu, W. T., Demler, O., Heeringa, S., Hiripi, E., et al. (2004). The US National Comorbidity Survey Replication (NCS-R): Design and field procedures. International Journal of Methods in Psychiatric Research, 3, Kish, L. (96). Survey sampling. New York: Wiley. Kovar, J. G., Rao, J. N. K., & Wu, C. F. J. (988). Bootstrap and other methods to measure errors in survey estimates. Canadian Journal of Statistics, 6, 2 4. Lohr, S. L. (999). Sampling: Design and analysis. Pacific Grove, CA: Duxbury Press. Lee, E. S., & Forthofer, R. N. (2006). Analyzing complex survey data (2nd ed.). Quantitative Applications in the Social Sciences, 7. Newbury Park, CA: Sage. Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H., & Rasbash, J. (998). Weighting for unequal selection probabilities in multilevel models. Journal of the Royal Statistical Society, 60, Rao, J. N. K., & Scott, A. J. (984). On chi-squared tests for multi-way tables with cell proportions estimated from survey data. Annals of Statistics, 2, Rust, K. (98). Variance estimation for complex estimation in sample surveys. Journal of Official Statistics,, West,B.T.,Berglund,P.,&Heeringa,S.G.(inpress).Acloserexaminationof subpopulation analysis of complex sample survey data. The Stata Journal. Journal of Traumatic Stress DOI 0.002/jts. Published on behalf of the International Society for Traumatic Stress Studies.
Sampling Error Estimation in Design-Based Analysis of the PSID Data
Technical Series Paper #11-05 Sampling Error Estimation in Design-Based Analysis of the PSID Data Steven G. Heeringa, Patricia A. Berglund, Azam Khan Survey Research Center, Institute for Social Research
Chapter 11 Introduction to Survey Sampling and Analysis Procedures
Chapter 11 Introduction to Survey Sampling and Analysis Procedures Chapter Table of Contents OVERVIEW...149 SurveySampling...150 SurveyDataAnalysis...151 DESIGN INFORMATION FOR SURVEY PROCEDURES...152
The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data ABSTRACT INTRODUCTION SURVEY DESIGN 101 WHY STRATIFY?
The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health, ABSTRACT
Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses
Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses G. Gordon Brown, Celia R. Eicheldinger, and James R. Chromy RTI International, Research Triangle Park, NC 27709 Abstract
Software for Analysis of YRBS Data
Youth Risk Behavior Surveillance System (YRBSS) Software for Analysis of YRBS Data June 2014 Where can I get more information? Visit www.cdc.gov/yrbss or call 800 CDC INFO (800 232 4636). CONTENTS Overview
Analysis of Survey Data Using the SAS SURVEY Procedures: A Primer
Analysis of Survey Data Using the SAS SURVEY Procedures: A Primer Patricia A. Berglund, Institute for Social Research - University of Michigan Wisconsin and Illinois SAS User s Group June 25, 2014 1 Overview
Youth Risk Behavior Survey (YRBS) Software for Analysis of YRBS Data
Youth Risk Behavior Survey (YRBS) Software for Analysis of YRBS Data CONTENTS Overview 1 Background 1 1. SUDAAN 2 1.1. Analysis capabilities 2 1.2. Data requirements 2 1.3. Variance estimation 2 1.4. Survey
Chapter XXI Sampling error estimation for survey data* Donna Brogan Emory University Atlanta, Georgia United States of America.
Chapter XXI Sampling error estimation for survey data* Donna Brogan Emory University Atlanta, Georgia United States of America Abstract Complex sample survey designs deviate from simple random sampling,
Fairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
Chapter 19 Statistical analysis of survey data. Abstract
Chapter 9 Statistical analysis of survey data James R. Chromy Research Triangle Institute Research Triangle Park, North Carolina, USA Savitri Abeyasekera The University of Reading Reading, UK Abstract
New SAS Procedures for Analysis of Sample Survey Data
New SAS Procedures for Analysis of Sample Survey Data Anthony An and Donna Watts, SAS Institute Inc, Cary, NC Abstract Researchers use sample surveys to obtain information on a wide variety of issues Many
Survey Data Analysis in Stata
Survey Data Analysis in Stata Jeff Pitblado Associate Director, Statistical Software StataCorp LP Stata Conference DC 2009 J. Pitblado (StataCorp) Survey Data Analysis DC 2009 1 / 44 Outline 1 Types of
Visualization of Complex Survey Data: Regression Diagnostics
Visualization of Complex Survey Data: Regression Diagnostics Susan Hinkins 1, Edward Mulrow, Fritz Scheuren 3 1 NORC at the University of Chicago, 11 South 5th Ave, Bozeman MT 59715 NORC at the University
Life Table Analysis using Weighted Survey Data
Life Table Analysis using Weighted Survey Data James G. Booth and Thomas A. Hirschl June 2005 Abstract Formulas for constructing valid pointwise confidence bands for survival distributions, estimated using
INTRODUCTION TO SURVEY DATA ANALYSIS THROUGH STATISTICAL PACKAGES
INTRODUCTION TO SURVEY DATA ANALYSIS THROUGH STATISTICAL PACKAGES Hukum Chandra Indian Agricultural Statistics Research Institute, New Delhi-110012 1. INTRODUCTION A sample survey is a process for collecting
Introduction to Sampling. Dr. Safaa R. Amer. Overview. for Non-Statisticians. Part II. Part I. Sample Size. Introduction.
Introduction to Sampling for Non-Statisticians Dr. Safaa R. Amer Overview Part I Part II Introduction Census or Sample Sampling Frame Probability or non-probability sample Sampling with or without replacement
IPUMS User Note: Issues Concerning the Calculation of Standard Errors (i.e., variance estimation)using IPUMS Data Products
IPUMS User Note: Issues Concerning the Calculation of Standard Errors (i.e., variance estimation)using IPUMS Data Products by Michael Davern and Jeremy Strief Producing accurate standard errors is essential
SAMPLING & INFERENTIAL STATISTICS. Sampling is necessary to make inferences about a population.
SAMPLING & INFERENTIAL STATISTICS Sampling is necessary to make inferences about a population. SAMPLING The group that you observe or collect data from is the sample. The group that you make generalizations
Variables Control Charts
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. Variables
MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
Multiple Imputation for Missing Data: A Cautionary Tale
Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust
Guidelines for Analyzing Add Health Data. Ping Chen Kim Chantala
Guidelines for Analyzing Add Health Data Ping Chen Kim Chantala Carolina Population Center University of North Carolina at Chapel Hill Last Update: March 2014 Table of Contents Overview... 3 Understanding
From the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
Stratified Sampling for Sales and Use Tax Highly Skewed Data Determination of the Certainty Stratum Cut-off Amount
Stratified Sampling for Sales and Use Tax Highly Skewed Data Determination of the Certainty Stratum Cut-off Amount By Eric Falk and Wendy Rotz, Ernst and Young LLP Introduction With the increasing number
Statistics 2014 Scoring Guidelines
AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home
Chapter 8: Quantitative Sampling
Chapter 8: Quantitative Sampling I. Introduction to Sampling a. The primary goal of sampling is to get a representative sample, or a small collection of units or cases from a much larger collection or
Comparison of Estimation Methods for Complex Survey Data Analysis
Comparison of Estimation Methods for Complex Survey Data Analysis Tihomir Asparouhov 1 Muthen & Muthen Bengt Muthen 2 UCLA 1 Tihomir Asparouhov, Muthen & Muthen, 3463 Stoner Ave. Los Angeles, CA 90066.
Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)
Spring 204 Class 9: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.) Big Picture: More than Two Samples In Chapter 7: We looked at quantitative variables and compared the
Why Sample? Why not study everyone? Debate about Census vs. sampling
Sampling Why Sample? Why not study everyone? Debate about Census vs. sampling Problems in Sampling? What problems do you know about? What issues are you aware of? What questions do you have? Key Sampling
Two Correlated Proportions (McNemar Test)
Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with
Handling attrition and non-response in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
Using Proxy Measures of the Survey Variables in Post-Survey Adjustments in a Transportation Survey
Using Proxy Measures of the Survey Variables in Post-Survey Adjustments in a Transportation Survey Ting Yan 1, Trivellore Raghunathan 2 1 NORC, 1155 East 60th Street, Chicago, IL, 60634 2 Institute for
Data Cleaning and Missing Data Analysis
Data Cleaning and Missing Data Analysis Dan Merson [email protected] India McHale [email protected] April 13, 2010 Overview Introduction to SACS What do we mean by Data Cleaning and why do we do it? The SACS
UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)
UNDERSTANDING ANALYSIS OF COVARIANCE () In general, research is conducted for the purpose of explaining the effects of the independent variable on the dependent variable, and the purpose of research design
ESTIMATION OF THE EFFECTIVE DEGREES OF FREEDOM IN T-TYPE TESTS FOR COMPLEX DATA
m ESTIMATION OF THE EFFECTIVE DEGREES OF FREEDOM IN T-TYPE TESTS FOR COMPLEX DATA Jiahe Qian, Educational Testing Service Rosedale Road, MS 02-T, Princeton, NJ 08541 Key Words" Complex sampling, NAEP data,
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
Outline. Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test
The t-test Outline Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test - Dependent (related) groups t-test - Independent (unrelated) groups t-test Comparing means Correlation
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Accounting for complex survey design in modeling usual intake. Kevin W. Dodd, PhD National Cancer Institute
Accounting for complex survey design in modeling usual intake Kevin W. Dodd, PhD National Cancer Institute Slide 1 Hello and welcome to today s webinar, the fourth in the Measurement Error Webinar Series.
Descriptive Methods Ch. 6 and 7
Descriptive Methods Ch. 6 and 7 Purpose of Descriptive Research Purely descriptive research describes the characteristics or behaviors of a given population in a systematic and accurate fashion. Correlational
Association Between Variables
Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi
Comparison of Variance Estimates in a National Health Survey
Comparison of Variance Estimates in a National Health Survey Karen E. Davis 1 and Van L. Parsons 2 1 Agency for Healthcare Research and Quality, 540 Gaither Road, Rockville, MD 20850 2 National Center
UNDERSTANDING THE TWO-WAY ANOVA
UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables
Analysis of Regression Based on Sampling Weights in Complex Sample Survey: Data from the Korea Youth Risk Behavior Web- Based Survey
, pp.65-74 http://dx.doi.org/10.14257/ijunesst.2015.8.10.07 Analysis of Regression Based on Sampling Weights in Complex Sample Survey: Data from the Korea Youth Risk Behavior Web- Based Survey Haewon Byeon
Survey Analysis: Options for Missing Data
Survey Analysis: Options for Missing Data Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD Abstract A common situation researchers working with survey data face is the analysis of missing
Imputing Missing Data using SAS
ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are
Power Calculation Using the Online Variance Almanac (Web VA): A User s Guide
Power Calculation Using the Online Variance Almanac (Web VA): A User s Guide Larry V. Hedges & E.C. Hedberg This research was supported by the National Science Foundation under Award Nos. 0129365 and 0815295.
Stat 9100.3: Analysis of Complex Survey Data
Stat 9100.3: Analysis of Complex Survey Data 1 Logistics Instructor: Stas Kolenikov, [email protected] Class period: MWF 1-1:50pm Office hours: Middlebush 307A, Mon 1-2pm, Tue 1-2 pm, Thu 9-10am.
Sample Size Issues for Conjoint Analysis
Chapter 7 Sample Size Issues for Conjoint Analysis I m about to conduct a conjoint analysis study. How large a sample size do I need? What will be the margin of error of my estimates if I use a sample
UNIVERSITY OF NAIROBI
UNIVERSITY OF NAIROBI MASTERS IN PROJECT PLANNING AND MANAGEMENT NAME: SARU CAROLYNN ELIZABETH REGISTRATION NO: L50/61646/2013 COURSE CODE: LDP 603 COURSE TITLE: RESEARCH METHODS LECTURER: GAKUU CHRISTOPHER
Approaches for Analyzing Survey Data: a Discussion
Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata
Chapter 1 Introduction. 1.1 Introduction
Chapter 1 Introduction 1.1 Introduction 1 1.2 What Is a Monte Carlo Study? 2 1.2.1 Simulating the Rolling of Two Dice 2 1.3 Why Is Monte Carlo Simulation Often Necessary? 4 1.4 What Are Some Typical Situations
Standard Deviation Estimator
CSS.com Chapter 905 Standard Deviation Estimator Introduction Even though it is not of primary interest, an estimate of the standard deviation (SD) is needed when calculating the power or sample size of
A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models
A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models Grace Y. Yi 13, JNK Rao 2 and Haocheng Li 1 1. University of Waterloo, Waterloo, Canada
Analysis of Complex Survey Data
Paper Number ST-01 Analysis of Complex Survey Data Varma Nadimpalli, Westat, Rockville, Maryland ABSTRACT Analysis of dose-response relationships has long been the main focus of clinical and epidemiological
Overview of Factor Analysis
Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
National Longitudinal Study of Adolescent Health. Strategies to Perform a Design-Based Analysis Using the Add Health Data
National Longitudinal Study of Adolescent Health Strategies to Perform a Design-Based Analysis Using the Add Health Data Kim Chantala Joyce Tabor Carolina Population Center University of North Carolina
1.0 Abstract. Title: Real Life Evaluation of Rheumatoid Arthritis in Canadians taking HUMIRA. Keywords. Rationale and Background:
1.0 Abstract Title: Real Life Evaluation of Rheumatoid Arthritis in Canadians taking HUMIRA Keywords Rationale and Background: This abbreviated clinical study report is based on a clinical surveillance
Sampling strategies *
UNITED NATIONS SECRETARIAT ESA/STAT/AC.93/2 Statistics Division 03 November 2003 Expert Group Meeting to Review the Draft Handbook on Designing of Household Sample Surveys 3-5 December 2003 English only
Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.
Paper 264-26 Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc. Abstract: There are several procedures in the SAS System for statistical modeling. Most statisticians who use the SAS
Multilevel Modeling of Complex Survey Data
Multilevel Modeling of Complex Survey Data Sophia Rabe-Hesketh, University of California, Berkeley and Institute of Education, University of London Joint work with Anders Skrondal, London School of Economics
National Endowment for the Arts. A Technical Research Manual
2012 SPPA PUBLIC-USE DATA FILE USER S GUIDE A Technical Research Manual Prepared by Timothy Triplett Statistical Methods Group Urban Institute September 2013 Table of Contents Introduction... 3 Section
Organizing Your Approach to a Data Analysis
Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize
II. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
Sampling: What is it? Quantitative Research Methods ENGL 5377 Spring 2007
Sampling: What is it? Quantitative Research Methods ENGL 5377 Spring 2007 Bobbie Latham March 8, 2007 Introduction In any research conducted, people, places, and things are studied. The opportunity to
Descriptive Statistics
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional
Best Practices in Using Large, Complex Samples: The Importance of Using Appropriate Weights and Design Effect Compensation
A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to
Variance Estimation Guidance, NHIS 2006-2014 (Adapted from the 2006-2014 NHIS Survey Description Documents) Introduction
June 9, 2015 Variance Estimation Guidance, NHIS 2006-2014 (Adapted from the 2006-2014 NHIS Survey Description Documents) Introduction The data collected in the NHIS are obtained through a complex, multistage
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September
Concepts of Experimental Design
Design Institute for Six Sigma A SAS White Paper Table of Contents Introduction...1 Basic Concepts... 1 Designing an Experiment... 2 Write Down Research Problem and Questions... 2 Define Population...
Paper PO06. Randomization in Clinical Trial Studies
Paper PO06 Randomization in Clinical Trial Studies David Shen, WCI, Inc. Zaizai Lu, AstraZeneca Pharmaceuticals ABSTRACT Randomization is of central importance in clinical trials. It prevents selection
Sample Size and Power in Clinical Trials
Sample Size and Power in Clinical Trials Version 1.0 May 011 1. Power of a Test. Factors affecting Power 3. Required Sample Size RELATED ISSUES 1. Effect Size. Test Statistics 3. Variation 4. Significance
PARTIAL LEAST SQUARES IS TO LISREL AS PRINCIPAL COMPONENTS ANALYSIS IS TO COMMON FACTOR ANALYSIS. Wynne W. Chin University of Calgary, CANADA
PARTIAL LEAST SQUARES IS TO LISREL AS PRINCIPAL COMPONENTS ANALYSIS IS TO COMMON FACTOR ANALYSIS. Wynne W. Chin University of Calgary, CANADA ABSTRACT The decision of whether to use PLS instead of a covariance
Sample design for educational survey research
Quantitative research methods in educational planning Series editor: Kenneth N.Ross Module Kenneth N. Ross 3 Sample design for educational survey research UNESCO International Institute for Educational
Logistic (RLOGIST) Example #3
Logistic (RLOGIST) Example #3 SUDAAN Statements and Results Illustrated PREDMARG (predicted marginal proportion) CONDMARG (conditional marginal proportion) PRED_EFF pairwise comparison COND_EFF pairwise
NON-PROBABILITY SAMPLING TECHNIQUES
NON-PROBABILITY SAMPLING TECHNIQUES PRESENTED BY Name: WINNIE MUGERA Reg No: L50/62004/2013 RESEARCH METHODS LDP 603 UNIVERSITY OF NAIROBI Date: APRIL 2013 SAMPLING Sampling is the use of a subset of the
Selecting Research Participants
C H A P T E R 6 Selecting Research Participants OBJECTIVES After studying this chapter, students should be able to Define the term sampling frame Describe the difference between random sampling and random
Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random
[Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator
Biostatistics: Types of Data Analysis
Biostatistics: Types of Data Analysis Theresa A Scott, MS Vanderbilt University Department of Biostatistics [email protected] http://biostat.mc.vanderbilt.edu/theresascott Theresa A Scott, MS
Farm Business Survey - Statistical information
Farm Business Survey - Statistical information Sample representation and design The sample structure of the FBS was re-designed starting from the 2010/11 accounting year. The coverage of the survey is
Inclusion and Exclusion Criteria
Inclusion and Exclusion Criteria Inclusion criteria = attributes of subjects that are essential for their selection to participate. Inclusion criteria function remove the influence of specific confounding
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
ONS Methodology Working Paper Series No 4. Non-probability Survey Sampling in Official Statistics
ONS Methodology Working Paper Series No 4 Non-probability Survey Sampling in Official Statistics Debbie Cooper and Matt Greenaway June 2015 1. Introduction Non-probability sampling is generally avoided
Confidence Intervals for One Standard Deviation Using Standard Deviation
Chapter 640 Confidence Intervals for One Standard Deviation Using Standard Deviation Introduction This routine calculates the sample size necessary to achieve a specified interval width or distance from
Power and sample size in multilevel modeling
Snijders, Tom A.B. Power and Sample Size in Multilevel Linear Models. In: B.S. Everitt and D.C. Howell (eds.), Encyclopedia of Statistics in Behavioral Science. Volume 3, 1570 1573. Chicester (etc.): Wiley,
p ˆ (sample mean and sample
Chapter 6: Confidence Intervals and Hypothesis Testing When analyzing data, we can t just accept the sample mean or sample proportion as the official mean or proportion. When we estimate the statistics
Sampling. COUN 695 Experimental Design
Sampling COUN 695 Experimental Design Principles of Sampling Procedures are different for quantitative and qualitative research Sampling in quantitative research focuses on representativeness Sampling
Problem of Missing Data
VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;
