Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses
|
|
|
- Shanna Lang
- 9 years ago
- Views:
Transcription
1 Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses G. Gordon Brown, Celia R. Eicheldinger, and James R. Chromy RTI International, Research Triangle Park, NC Abstract The researchers for the CAHPS of Medicare Fee- For-Service and Disenrollee/Assessment surveys have used a repeated measures approach in SUDAAN to create individual level composite scores on a number of constructs (e.g. the CAHPS composite Getting needed Health Care combines four survey questions that ask a beneficiary about experiences with getting care needed). This approach, applicable to any survey analysis with cluster-correlated responses, estimates models that assume each question in the composite is a repeated measure. Advantages include models that adjust for covariates, account for the survey design, and perform hypothesis testing while preserving the CAHPS composite design. Traditionally the CAHPS composites are analyzed using the CAHPS Macro; however, the CAHPS Macro does not possess the ability to correctly conduct the hypothesis tests we required. Our goals are to 1) disseminate this useful technique to researchers using the CAHPS methodology and 2) generalize the technique to other cluster-correlated survey responses. Keywords: Repeated Measure, CAHPS, SUDAAN 1. Introduction Analyzing data that come from a complex survey presents a number of challenges to the researcher. These challenges can be amplified significantly when the data come from a sample survey that has a longitudinal or repeated measures component. Software that is designed for sample surveys, such as SUDAAN (Research Triangle Institute, 2004), may not be specifically designed to address the longitudinal aspect of a repeated measures design. Software that does have the appropriate procedures (survey or repeated measures) may not have the capability of correctly adjusting the variances for the design of the data. The researcher is often left pondering the best course of action. In some of these situations, an optimal or near optimal solution can be found, which brings us to the focus of this paper: (1) produce variance estimates that are adjusted for the sample design and a longitudinal or repeated measures component of a survey and (2) provide a method to conduct complex hypothesis tests for data with these features. In a repeated measures or longitudinal data setting, there is often a high degree of correlation between the observations from a given experimental unit, which will affect the variance estimates. Failing to appropriately account for this correlation and to account for the sample design can lead to erroneous conclusions as both the repeated measures aspect of the study and the sample design affect the variances needed for hypothesis testing. Correctly accounting for all types of potential correlation is crucial for any form of statistical analysis. We hope to give the reader an appreciation of the methods needed to account for longitudinal data coming from a complex sample design when performing data analysis. Any set of survey questions or experimental measurements can be treated as repeated measures if they are, in some fashion, measuring the same outcome or if a combination of the questions or measurements is attempting to measure a single outcome. Often in a survey setting, such as the CAHPS 1 Medicare Fee-For-Service (MFFS) survey (Centers for Medicare & Medicaid Services [CMS], 2000), this single outcome is attempting to measure an experience based on a serious of three to four related survey questions. Although we have applied this technique to analyzing the CAHPS MFFS data, the models and concepts presented in this paper can easily be applied to any survey or other study with a repeated measures component. The motivation behind our method, which we call the Repeated Measures CAHPS (RM-CAHPS), is the need to perform trend analyses and conduct complex hypothesis tests when comparisons of more than two group-level means are desired. The specific analysis required on CAHPS MFFS survey data consists of analyzing combinations of questions commonly referred to as composites. The CAHPS Survey Users Network Web site contains a listing of all the composites (Agency for Healthcare Research and Quality [AHRQ], 2004). 1 Formerly know as the Consumer Assessment of Health Plans Study. 2799
2 The standard CAHPS analysis method treats these composites as a grouping of questions that together measure the same trait; similar to the repeated measure of a given trait. The ultimate goal is to obtain estimates, called composite scores, that are means for Medicare plans or means for the levels of the analysis variable of interest based on a function of the means of the questions that comprise the composite. These composite scores are then used to conduct various hypothesis tests. Through out this paper we discuss the advantages and assumptions of using the repeated measures approach to modeling the CAHPS composites, which we call RM-CAHPS. Although we use the CAHPS composites as an example, this technique can be applied to a variety of settings. For example in a panel study where the same people respond to a set of survey questions in multiple years, using the repeated measures approach described in this paper will allow the research to correctly account for the design of the survey as well as account for the within person correlation. 2. Accounting for Sample Design One of the key aspects of analyzing survey data is accounting for the weighting and sample design when estimating the variance. Most large surveys have analysis weights plus stratification and clustering in the sample design. All of these elements of a complex survey impact the variance and need to be accounted for when conducting data analysis. Weighting has an impact on both the point estimates and the variance estimates. For this paper, we treat the analysis weights as survey weights and not counts. The typical purpose of survey weights is to indicate the number of persons a given observation represents in the population. They do not indicate the number of observations in the sample, which implies that the sum of the weights represent the number of persons in the population, not the number of persons in the sample. Standard statistical software packages, not designed for survey analysis, usually treat the weights as counts. Consequently, the point estimates are correct but the variances can be grossly underestimated. Clustered sampling is very common in sample surveys and fairly common in experiments or observational data. In most sample surveys, the correlation within clusters is positive (overdispersion). Failing to account for this correlation will result in a variance estimate that is biased low. Low variance estimates will result in confidence intervals that are too narrow and hypothesis tests that have an inflated Type I error rate (too liberal). Infrequently, the intracluster correlation will be negative (underdispersion) and will produce the opposite effects of overdispersion. Stratification is often implemented as a method of improving the efficiency of drawing the actual sample. However, effect usages of stratification can lead to a decrease in the estimate of the sampling variance. If homogeneity of responses exists within strata, the estimates of variance can be less than if stratification did not occur. Typically, stratification has less impact on variance estimates than weighting or clustering. To obtain unbiased variance estimates that are adjusted for sample design features, it is important to use a software package specifically designed for surveys. For example, SUDAAN, SAS (PROC SURVEYREG and PROC SURVEYLOG), STATA, and WESTVAR are software packages that can correctly handle survey data. We used SUDAAN exclusively for our analyses. 3. Correlation within Repeated Measures and CAHPS Composites In most situations, the observations from repeated measures data within a person or experimental unit are positively correlated. Ignoring this correlation will lely result in variance estimates that are biased low and, consequently, confidence intervals based on these variance estimates that will be too narrow. Similarly, the potential for rejecting a true null hypothesis in favor of an unsupported alternative hypothesis (Type I error) increases. For obvious reasons, this will lead to poor inference and erroneous conclusions with regard to hypothesis tests. For the CAHPS composites, each respondent in a CAHPS survey has the potential to answer each question in a given composite. It is logical to assume that their responses to the questions that make up a given CAHPS composite will be positively correlated. 4. Missing Data Often, some responses within an experimental unit comprising the repeated measure or questions within a person comprising a CAHPS composite will be missing. For the following reasons, missing observations do not pose any difficulty to the proposed Repeated Measures Method (RM-CAHPS specifically for the CAHPS composites). 2800
3 Repeated measures designs have a correlation matrix for observations within the experimental unit or cluster where the structure depends on the ordering of the responses. In most analyses, this correlation matrix is considered a nuisance parameter; it needs to be determined to obtain meaningful inference from the analysis, but the actual values of the matrix are not important to the researcher. The use of Generalized Estimating Equations (GEE) (Binder 1983, Zeger and Liang 1986) or other robust variance estimation algorithms alleviates this problem; estimation of the exact correlation structure is unnecessary when using GEE to adjust variance estimates for the sample design. Each cluster is assumed to have a unique correlation structure that does not need to be estimated in order to adjust the estimates of variance. The end result is that the persons or experimental units with missing values do not need to be removed from the study or adjusted in any way. Another concern is the use of analysis weights with the missing observations. The analysis weights are almost always created for a given person; they are not created for given response items or questions from the survey. However, the repeated measures models that we propose assign weight by response items that contain nonmissing data, not by person. To analyze repeated measures data using the method purposed, a separate record must be created for each response item for each person. For example, if there are four repeated measures per person, there will be four records on the data set for every person. (Please see the section on setting-up the repeated measures for CAHPS MFFS data below.) Each record from a given person will use the analysis weight that is assigned to that person. As a result, a given person will have more impact on the final results when they provide nonmissing responses to more items within the repeated measure than one who provides nonmissing responses to fewer items within the measure. The missing item within the measure is not used in the modeling process; however, other valid items from the same person are used in the modeling process. This allows the use of partial data from a given individual in all of the modeling situations. The only time that a person s data would not be used in the modeling process is if that individual had missing responses to all items that comprise the repeated measure or composite. We must note that missing data for the independent variables in the model will result in the record being deleted during the modeling process. When using GEE, the degrees of freedom are not affected by missing data. The default degrees of freedom used in our models is the number of design strata minus the number of primary sampling units (PSU). All records for a given person must be in the same PSU to take advantage of the robust variance estimation methods. Therefore, the number of degrees of freedom will not depend on the number of items in a PSU or on missing values. This is an important property of our modeling scheme, as the creation of multiple records per person could artificially inflate the degrees of freedom and once again, provide test results that are too liberal. 5. Modeling Applications The method that we describe can be employed for a variety of model structures. We will present examples using SUDAAN s linear regression procedure (PROC REGRESS). However, the method is not limited to linear regression models. Other types of models are available in SUDAAN: logistic (PROC LOGISTIC), multinomial (PROC MULTILOG), count (PROC LOGLINK), and survival (PROC SURVIVAL). For all of the point and variance estimates reported, we used the GEE option and a with-replacement design. However, the methods presented here would work using other designs (e.g., without replacement), and other robust variance estimation methods le Replicate Weight Jackknife and Balanced Repeated Replication. 6. Test Statistics Another issue that we considered is which test statistic to use for multiple degrees-of-freedom hypothesis tests. The standard Wald chi-square test statistic is often too liberal for survey data for multiple degrees-of-freedom tests. For our analyses purposes, we chose the adjusted Wald F (Fellegi 1980), a more appropriate test statistic for survey data. Satterthwaite s adjusted chi-square (Rao and Scott 1981) and the Wald F (RTI 2004, pp ) are other test statistics that are recommended for surveys. These test statistics are available for all procedures in SUDAAN. 7. Methods The RM-CAHPS begins by fitting a general regression model to the data of interest and was first proposed by Chromy and McLeod (2000). The subgroup variable of interest, covariates, and an indicator for item (question number) are placed in the model. If desired, interactions between independent variables (subgroup and/or covariates) are also included in the model. Once the researcher has the 2801
4 desired terms in the model, to replicate the CAHPS Macro, an addition set of terms must be added. These additional terms are the interaction between all independent variables and the item variable. This interaction has to be included because the CAHPS Macro models each question separately. Note that if the researcher does not care to replicate the CAHPS Macro, the additional interactions are not needed. If the repeated measures is truly a repeated measure and not a set of survey questions measuring a similar trait, then the indicator for item is not necessary either. For example, in a panel study if the same question was answered by the same person once every year, then an indicator for year would not be required. A general form of the model is easily derived. The notation being used is consistent with notation for a complex survey. For the model below, we assume that there is one level of stratification and one primary sampling unit. y = g(α + subgroup + qa _ num + subgroup* qa _ num + β qa _ num* x ) + ε i = 1, 2, I represents the strata variable j = 1, 2, J i represents the PSU within strata k = 1,2, n ij represents person or experimental unit within PSU 2 Var( ε ) = σ I n The response vector y is composed of all of the responses to the questions that form the composites or repeated measure; in the case of the ratings on single response variable, the response is a scalar. The term x is a vector of all covariates that are used in the model, and β is the corresponding vector of regression coefficients. The vector ε comprises the residuals. Note that there is a residual for every element of y. The dimensions of the intracluster covariance matrix Var( ε ) depend on the number of items for a composite. For a ratings question, 2 Var( ε ) is the scalarσ. The functional form of g(*) depends on the type of regression. From linear regression, g(*) will be the identity link function, for logistic regression it is the logit link function, and for multinomial models it will be the generalized or cumulative logit link function (Agresti, 1990). 2 The above variance assumption, Var( ε ) = σ I n, is used for parameter estimation only. To account for the intracluster correlation, variance is estimated using generalized estimating equations (GEE) (Binder, 1983; Zeger and Liang 1986). When using GEE, the Var( ε ) matrix is unstructured. It is not necessary to actually estimate the correlation structure to obtain an estimate of variance because the parameters are estimated under the assumption of independence and the estimate of variance is robust to this assumption. This relieves the researcher of the responsibility for determining the exact form of the intracluster correlation matrix. Predicted margins (direct standardization) are used to produce an estimate of the composite scores from our model (Korn and Graubard, 1999). Assume that the variable subgroup has R levels. The predicted margins are calculated for r = 1, 2, R. For a given level of subgroup, r, the formula for the predicted margin is * * E( Y X, β, subgroup = r) = g( α + subgroup + qa _ num + + subgroup * qa _ num β qa _ num * x In this situation, the variable subgroup is set to a given level r. The subgroup * indicates that all observations are set to this value for subgroup; all other covariates remain unchanged. Typically, the predicted margins are calculated for every level of the subgroup variable. Hypothesis tests can then be conducted on the margins. For the unequal item weights, the predicted marginals produce the point estimates desired. For the equal weights, contrasts of the predicted marginals are needed to produce the desired estimates. ) 2802
5 Modification of the repeated measures model for the CAHPS MFFS data is relatively straightforward. For the example presented, we are interested in trends over years. Our subgroup variable will be year. The question numbers will be the items that compose a y = g(α + year + item + year * item + β item* ) + given CAHPS composite. For the CAHPS MFFS data, the individuals that comprise the sample are the primary sampling units. As a result, the index j is dropped from the model. The repeated measures model is i = 1, 2,...,1104 x ε, k = 1, 2,..., n ij where i is an index for the year*geounit interaction, which is the stratum variable for our models, and k is an index for person within a year*geounit stratum. We are primarily interested in estimating trends across years. This will be done using the predicted margins and calculating the appropriate contrast. 8. Setting Up the Repeated Measures for CAHPS MFFS Data Here we will detail the steps taken to get the CAHPS MFFS data set-up for the repeated measures analysis. We will also explain how the SUDAAN code should be set-up. First it is important to note that each survey respondent has the potential to provide a response for each question comprising a given composite. In Table 1 the lay out for the file is shown. Notice respondent ID 1 has responses to 4 survey questions. In this mock-up example, there are 4 questions comprising the repeated measure or composite in the CAHPS example. Also notice that for each survey question in the composite there is a record in the data set, thus the number of records in the dataset, for a composite with 4 questions, will be 4 times the number of respondents. Table 1. Example File Layout for Repeated Measures Respondent ID Question # for Response to Subgroup Variable Composite Question Year Year Year Year Year Year Year Year Year Year Year Year Year Year Year Year Year Year Year Year Year Year Year Year 3 Once the data is set-up, we are ready to run the analysis in SUDAAN. Setting up the repeated measures aspect in SUDAAN is relatively simple. The primary sampling unit (PSU) identifies the repeated measure. In our example the respondent ID (called id in the code below) indicates the survey respondents and thus the repeated measure. In SUDAAN the PSU is generally the second variable in the nest statement following the stratification variable. If you have more than one stratification variable then the PSU variable can be the third, fourth, etc variable in the nest statement. If this is the case the code for identifying the PSU variable is psulev = position of PSU variable. In the example code, below, our stratification variable is the cross of year and geounit (yrgeo) and id is our PSU. We also include an indicator of the survey question (qanum) in the model as well as all the interactions with qanum. The researcher does not have to include these extra terms and interactions if they are not trying to mimic the CAHPS macro. And in cases of true repeated measures identifying the repeated measure with an indicator variable is not necessary. proc regress data = tcare_1 design = wr; nest yrgeo id; weight postwght; class qanum year/nofreq; model response = qanum year qanum*year female under65 qanum*(female under65 ) ; 0.25)*year=( )/NAME="Year = 1"; 2803
6 0.25)*year=( )/NAME="Year = 2"; 0.25)*year=( )/NAME="Year = 3"; 0.25)*year=( )/NAME="Year = 4"; predmarg year; Our subgroup variable is year and we use the pred_eff and predmarg statements to obtain the compsites scores by year. The pred_eff statements are used to calculate the composite scores with equal weighting of the items since we have four questions in the composite each equation is weighted by The researcher can chose to weight the questions differently, perhaps using the factor loadings from a principle components analysis to determine how each question should be weighted. The predmarg outputs the predicted marginals which are used for the unequal weighting of item scores. In the regression setting this is the same as the least squares means. 9. Results The results, found in Table 2, compare the CAHPS composite scores and standard errors for the Getting Needed Care Composite using the RM-CAHPS, a regression model based on a simple random sample (SRS) without any repeated measures, and the CAHPS Macro. These two tables also illustrate the difference between the two CAHPS Macro settings we have chosen to investigate: unequal weighting and equal weighting of the items comprising the composite. In both tables, the predicted marginals generated from linear regression models using SUDAAN are found in the column titled RM Marginals. (The marginals are the same regardless of survey design, so we present them only once). As already noted, the predicted marginals are very similar to the CAHPS Macro scores. The standardization of both independent and dependent variables performed in the CAHPS Macro explains the differences. The last three columns demonstrate the effect each method has on the standard errors. When the sample design and the clustering from the items within a composite are ignored, the standard errors are smaller, as shown by the standard errors from the SRS regression model ( RM SE-Naïve ) and the standard errors from the CAHPS Macro ( CAHPS Macro SE ). The difference in the standard errors between the CAHPS Marco and the RM-CAHPS range from to ; a 10% to 19% decrease in the standard error. This decrease in the standard error will result in an increase in the Type I error rate. Table 2. Comparison Estimates with Unequal Weighted Items CAHPS Macro RM RM CAHPS Survey Year RM Marginals Scores SE-Naive SE-Design Macro SE Unequal Weighted Items Year = Year = Year = Year = Equal Weighted Items Year = Year = Year = Year = Conclusions The flexibility of the modeling method makes the repeated measures approach applicable to a variety of different situations a researcher may encounter. For example, the models we presented can be modified by replacing year with any independent variable of interest. Additionally, all other hypothesis tests available in a regular regression setting are available to the researcher. As expected, these additional tests are based on the full model and account for covariates, weights, and the sample design. The 2804
7 repeated measures approach gives the researcher a great deal of flexibility when modeling clustered data often found in a survey setting. Two additional features of the GEE method are (1) a complicated correlation structure does not need to be specified for the repeated measures models being fit and (2) missing data do not create problems. This is not the case for other statistical packages, such as SAS s PROC MIXED, where the user is required to specify the correlation structure. Also, during our preliminary investigative work, we discovered that PROC MIXED was not capable of handling large numbers of missing values within the repeated measure (the respondent, in our case). As a result, we were never able to obtain a fitted model using PROC MIXED. Acknowledgements Performed under contract /T.O.#7. We would le to thank Edward S. Sekscenski, Larry Campbell, Lisa Carpenter, Jeffrey Laufenberg, and Shulamit Bernard. References Agency for Healthcare Research and Quality (AHRQ) Topics Covered by the Core Items in the Health Plan Survey [accessed on November 2, 2005]. Available at: Agresti, A Categorical Data Analysis. New York: Wiley. Binder, D. A On the Variances of Asymptotically Normal Estimators from Complex Surveys. International Statistical Review, 51, Chromy, J. R. and L. D. McLeod Considerations for the Development of G- CAHPS Composite Scoring Algorithms. Development and Testing of the Group-Level Consumer Assessment of Health Plans Study Instrument: Results from the National G- CAHPS Field Test. G.-C. R. Team. Research Triangle Park, NC, RTI International: A110-A129. Centers for Medicare & Medicaid Services (CMS) Implementation of CAHPS Fee-for- Service Survey Final Report for Year 1 [accessed on November 3, 2005]. Available at: onsumers/ffs_overview1.pdf Fellegi, I. P Approximate Tests of Independence and Goodness of Fit Based on Stratified Multistage Samples. American Statistical Association, Proceedings of the Section on Survey Research Methods, pp Baltimore MD: American Statistical Association. Korn, E. L., and B. I. Graubard Analysis of Health Surveys. New York: Wiley. Rao, J. N. K., and A. J. Scott The Analysis of Categorical Data from Complex Surveys: Chi- Squared Tests for Goodness of Fit and Independence in Two-Way Tables. Journal of the American Statistical Association, 76 (374): Research Triangle Institute SUDAAN Language Manual (release 9.0).Research Triangle Park, NC: RTI International. [accessed on November 3, 2005] Available at: Zeger, S. and K. Liang (1986). Longitudinal data analysis for discrete and continuous outcomes. Biometrics, 42,
New SAS Procedures for Analysis of Sample Survey Data
New SAS Procedures for Analysis of Sample Survey Data Anthony An and Donna Watts, SAS Institute Inc, Cary, NC Abstract Researchers use sample surveys to obtain information on a wide variety of issues Many
Sampling Error Estimation in Design-Based Analysis of the PSID Data
Technical Series Paper #11-05 Sampling Error Estimation in Design-Based Analysis of the PSID Data Steven G. Heeringa, Patricia A. Berglund, Azam Khan Survey Research Center, Institute for Social Research
Life Table Analysis using Weighted Survey Data
Life Table Analysis using Weighted Survey Data James G. Booth and Thomas A. Hirschl June 2005 Abstract Formulas for constructing valid pointwise confidence bands for survival distributions, estimated using
Youth Risk Behavior Survey (YRBS) Software for Analysis of YRBS Data
Youth Risk Behavior Survey (YRBS) Software for Analysis of YRBS Data CONTENTS Overview 1 Background 1 1. SUDAAN 2 1.1. Analysis capabilities 2 1.2. Data requirements 2 1.3. Variance estimation 2 1.4. Survey
Software for Analysis of YRBS Data
Youth Risk Behavior Surveillance System (YRBSS) Software for Analysis of YRBS Data June 2014 Where can I get more information? Visit www.cdc.gov/yrbss or call 800 CDC INFO (800 232 4636). CONTENTS Overview
ESTIMATION OF THE EFFECTIVE DEGREES OF FREEDOM IN T-TYPE TESTS FOR COMPLEX DATA
m ESTIMATION OF THE EFFECTIVE DEGREES OF FREEDOM IN T-TYPE TESTS FOR COMPLEX DATA Jiahe Qian, Educational Testing Service Rosedale Road, MS 02-T, Princeton, NJ 08541 Key Words" Complex sampling, NAEP data,
Logistic (RLOGIST) Example #3
Logistic (RLOGIST) Example #3 SUDAAN Statements and Results Illustrated PREDMARG (predicted marginal proportion) CONDMARG (conditional marginal proportion) PRED_EFF pairwise comparison COND_EFF pairwise
Chapter 11 Introduction to Survey Sampling and Analysis Procedures
Chapter 11 Introduction to Survey Sampling and Analysis Procedures Chapter Table of Contents OVERVIEW...149 SurveySampling...150 SurveyDataAnalysis...151 DESIGN INFORMATION FOR SURVEY PROCEDURES...152
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Comparison of Estimation Methods for Complex Survey Data Analysis
Comparison of Estimation Methods for Complex Survey Data Analysis Tihomir Asparouhov 1 Muthen & Muthen Bengt Muthen 2 UCLA 1 Tihomir Asparouhov, Muthen & Muthen, 3463 Stoner Ave. Los Angeles, CA 90066.
Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups
Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln Log-Rank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics)
Survey Data Analysis in Stata
Survey Data Analysis in Stata Jeff Pitblado Associate Director, Statistical Software StataCorp LP Stata Conference DC 2009 J. Pitblado (StataCorp) Survey Data Analysis DC 2009 1 / 44 Outline 1 Types of
Simple Second Order Chi-Square Correction
Simple Second Order Chi-Square Correction Tihomir Asparouhov and Bengt Muthén May 3, 2010 1 1 Introduction In this note we describe the second order correction for the chi-square statistic implemented
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
A General Approach to Variance Estimation under Imputation for Missing Survey Data
A General Approach to Variance Estimation under Imputation for Missing Survey Data J.N.K. Rao Carleton University Ottawa, Canada 1 2 1 Joint work with J.K. Kim at Iowa State University. 2 Workshop on Survey
Simple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
INTRODUCTION TO SURVEY DATA ANALYSIS THROUGH STATISTICAL PACKAGES
INTRODUCTION TO SURVEY DATA ANALYSIS THROUGH STATISTICAL PACKAGES Hukum Chandra Indian Agricultural Statistics Research Institute, New Delhi-110012 1. INTRODUCTION A sample survey is a process for collecting
Approaches for Analyzing Survey Data: a Discussion
Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
International Statistical Institute, 56th Session, 2007: Phil Everson
Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: [email protected] 1. Introduction
From the help desk: Swamy s random-coefficients model
The Stata Journal (2003) 3, Number 3, pp. 302 308 From the help desk: Swamy s random-coefficients model Brian P. Poi Stata Corporation Abstract. This article discusses the Swamy (1970) random-coefficients
Logistic (RLOGIST) Example #1
Logistic (RLOGIST) Example #1 SUDAAN Statements and Results Illustrated EFFECTS RFORMAT, RLABEL REFLEVEL EXP option on MODEL statement Hosmer-Lemeshow Test Input Data Set(s): BRFWGT.SAS7bdat Example Using
Analysis of Survey Data Using the SAS SURVEY Procedures: A Primer
Analysis of Survey Data Using the SAS SURVEY Procedures: A Primer Patricia A. Berglund, Institute for Social Research - University of Michigan Wisconsin and Illinois SAS User s Group June 25, 2014 1 Overview
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA
Examples: Multilevel Modeling With Complex Survey Data CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA Complex survey data refers to data obtained by stratification, cluster sampling and/or
Consider a study in which. How many subjects? The importance of sample size calculations. An insignificant effect: two possibilities.
Consider a study in which How many subjects? The importance of sample size calculations Office of Research Protections Brown Bag Series KB Boomer, Ph.D. Director, [email protected] A researcher conducts
National Longitudinal Study of Adolescent Health. Strategies to Perform a Design-Based Analysis Using the Add Health Data
National Longitudinal Study of Adolescent Health Strategies to Perform a Design-Based Analysis Using the Add Health Data Kim Chantala Joyce Tabor Carolina Population Center University of North Carolina
MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
Multivariate Logistic Regression
1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation
Complex Survey Design Using Stata
Complex Survey Design Using Stata 2010 This document provides a basic overview of how to handle complex survey design using Stata. Probability weighting and compensating for clustered and stratified samples
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
Visualization of Complex Survey Data: Regression Diagnostics
Visualization of Complex Survey Data: Regression Diagnostics Susan Hinkins 1, Edward Mulrow, Fritz Scheuren 3 1 NORC at the University of Chicago, 11 South 5th Ave, Bozeman MT 59715 NORC at the University
From the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)
Cairo University Faculty of Economics and Political Science Statistics Department English Section Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Prepared
Chapter 19 Statistical analysis of survey data. Abstract
Chapter 9 Statistical analysis of survey data James R. Chromy Research Triangle Institute Research Triangle Park, North Carolina, USA Savitri Abeyasekera The University of Reading Reading, UK Abstract
COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.
277 CHAPTER VI COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. This chapter contains a full discussion of customer loyalty comparisons between private and public insurance companies
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2
University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling
What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and
A COMPREHENSIVE SOFTWARE PACKAGE FOR SURVEY DATA ANALYSIS
SUDAAN: A COMPREHENSIVE SOFTWARE PACKAGE FOR SURVEY DATA ANALYSIS Lisa M. LaVange, Babubhai V. Shah, Beth G. Barnwell and Joyce F. Killinger Lisa M. LaVan~e. Research Triangle Institute KEYWORDS: variance
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data ABSTRACT INTRODUCTION SURVEY DESIGN 101 WHY STRATIFY?
The SURVEYFREQ Procedure in SAS 9.2: Avoiding FREQuent Mistakes When Analyzing Survey Data Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health, ABSTRACT
Imputing Missing Data using SAS
ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are
Introduction to Longitudinal Data Analysis
Introduction to Longitudinal Data Analysis Longitudinal Data Analysis Workshop Section 1 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development Section 1: Introduction
Handling attrition and non-response in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
Survey Analysis: Options for Missing Data
Survey Analysis: Options for Missing Data Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD Abstract A common situation researchers working with survey data face is the analysis of missing
Logistic (RLOGIST) Example #7
Logistic (RLOGIST) Example #7 SUDAAN Statements and Results Illustrated EFFECTS UNITS option EXP option SUBPOPX REFLEVEL Input Data Set(s): SAMADULTED.SAS7bdat Example Using 2006 NHIS data, determine for
Comparing Alternate Designs For A Multi-Domain Cluster Sample
Comparing Alternate Designs For A Multi-Domain Cluster Sample Pedro J. Saavedra, Mareena McKinley Wright and Joseph P. Riley Mareena McKinley Wright, ORC Macro, 11785 Beltsville Dr., Calverton, MD 20705
Chapter XXI Sampling error estimation for survey data* Donna Brogan Emory University Atlanta, Georgia United States of America.
Chapter XXI Sampling error estimation for survey data* Donna Brogan Emory University Atlanta, Georgia United States of America Abstract Complex sample survey designs deviate from simple random sampling,
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)
Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
" Y. Notation and Equations for Regression Lecture 11/4. Notation:
Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through
Statistics in Retail Finance. Chapter 6: Behavioural models
Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
Statistical and Methodological Issues in the Analysis of Complex Sample Survey Data: Practical Guidance for Trauma Researchers
Journal of Traumatic Stress, Vol. 2, No., October 2008, pp. 440 447 ( C 2008) Statistical and Methodological Issues in the Analysis of Complex Sample Survey Data: Practical Guidance for Trauma Researchers
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.
Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing
Overview of Methods for Analyzing Cluster-Correlated Data. Garrett M. Fitzmaurice
Overview of Methods for Analyzing Cluster-Correlated Data Garrett M. Fitzmaurice Laboratory for Psychiatric Biostatistics, McLean Hospital Department of Biostatistics, Harvard School of Public Health Outline
Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes
Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, Discrete Changes JunXuJ.ScottLong Indiana University August 22, 2005 The paper provides technical details on
The Research Data Centres Information and Technical Bulletin
Catalogue no. 12-002 X No. 2014001 ISSN 1710-2197 The Research Data Centres Information and Technical Bulletin Winter 2014, vol. 6 no. 1 How to obtain more information For information about this product
Clustering in the Linear Model
Short Guides to Microeconometrics Fall 2014 Kurt Schmidheiny Universität Basel Clustering in the Linear Model 2 1 Introduction Clustering in the Linear Model This handout extends the handout on The Multiple
HLM software has been one of the leading statistical packages for hierarchical
Introductory Guide to HLM With HLM 7 Software 3 G. David Garson HLM software has been one of the leading statistical packages for hierarchical linear modeling due to the pioneering work of Stephen Raudenbush
An analysis method for a quantitative outcome and two categorical explanatory variables.
Chapter 11 Two-Way ANOVA An analysis method for a quantitative outcome and two categorical explanatory variables. If an experiment has a quantitative outcome and two categorical explanatory variables that
Multinomial and Ordinal Logistic Regression
Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,
Introduction to mixed model and missing data issues in longitudinal studies
Introduction to mixed model and missing data issues in longitudinal studies Hélène Jacqmin-Gadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models
MATHEMATICAL METHODS OF STATISTICS
MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS
Chapter 7. One-way ANOVA
Chapter 7 One-way ANOVA One-way ANOVA examines equality of population means for a quantitative outcome and a single categorical explanatory variable with any number of levels. The t-test of Chapter 6 looks
Server Load Prediction
Server Load Prediction Suthee Chaidaroon ([email protected]) Joon Yeong Kim ([email protected]) Jonghan Seo ([email protected]) Abstract Estimating server load average is one of the methods that
Using Excel for Statistics Tips and Warnings
Using Excel for Statistics Tips and Warnings November 2000 University of Reading Statistical Services Centre Biometrics Advisory and Support Service to DFID Contents 1. Introduction 3 1.1 Data Entry and
Problem of Missing Data
VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;
Introduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:
Lab 5 Linear Regression with Within-subject Correlation Goals: Data: Fit linear regression models that account for within-subject correlation using Stata. Compare weighted least square, GEE, and random
No. 92. R. P. Chakrabarty Bureau of the Census
THE SURVEY OF INCOME AND PROGRAM PARTICIPATION MULTIVARIATE ANALYSIS BY USERS OF SIPP MICRO-DATA FILES No. 92 R. P. Chakrabarty Bureau of the Census U. S. Department of Commerce BUREAU OF THE CENSUS Acknowledgments
A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models
A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models Grace Y. Yi 13, JNK Rao 2 and Haocheng Li 1 1. University of Waterloo, Waterloo, Canada
Workshop on Using the National Survey of Children s s Health Dataset: Practical Applications
Workshop on Using the National Survey of Children s s Health Dataset: Practical Applications Julian Luke Stephen Blumberg Centers for Disease Control and Prevention National Center for Health Statistics
Multivariate Analysis of Variance (MANOVA)
Chapter 415 Multivariate Analysis of Variance (MANOVA) Introduction Multivariate analysis of variance (MANOVA) is an extension of common analysis of variance (ANOVA). In ANOVA, differences among various
Multilevel Modeling of Complex Survey Data
Multilevel Modeling of Complex Survey Data Sophia Rabe-Hesketh, University of California, Berkeley and Institute of Education, University of London Joint work with Anders Skrondal, London School of Economics
Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis
Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics
Indices of Model Fit STRUCTURAL EQUATION MODELING 2013
Indices of Model Fit STRUCTURAL EQUATION MODELING 2013 Indices of Model Fit A recommended minimal set of fit indices that should be reported and interpreted when reporting the results of SEM analyses:
SUGI 29 Statistics and Data Analysis
Paper 194-29 Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC Michelle L. Pritchard and David J. Pasta Ovation Research Group, San Francisco,
Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares
Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation
Best Practices in Using Large, Complex Samples: The Importance of Using Appropriate Weights and Design Effect Compensation
A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to
individualdifferences
1 Simple ANalysis Of Variance (ANOVA) Oftentimes we have more than two groups that we want to compare. The purpose of ANOVA is to allow us to compare group means from several independent samples. In general,
Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.
Analyzing Complex Survey Data: Some key issues to be aware of Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 24, 2015 Rather than repeat material that is
Windows-Based Meta-Analysis Software. Package. Version 2.0
1 Windows-Based Meta-Analysis Software Package Version 2.0 The Hunter-Schmidt Meta-Analysis Programs Package includes six programs that implement all basic types of Hunter-Schmidt psychometric meta-analysis
