Analyzing Longitudinal Data from Complex Surveys Using SUDAAN



Similar documents
Confidence Intervals for One Mean

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Lesson 17 Pearson s Correlation Coefficient

5: Introduction to Estimation

Output Analysis (2, Chapters 10 &11 Law)

INVESTMENT PERFORMANCE COUNCIL (IPC) Guidance Statement on Calculation Methodology

Determining the sample size

Quadrat Sampling in Population Ecology

Hypothesis testing. Null and alternative hypotheses

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

ODBC. Getting Started With Sage Timberline Office ODBC

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

INVESTMENT PERFORMANCE COUNCIL (IPC)

One-sample test of proportions

1. C. The formula for the confidence interval for a population mean is: x t, which was

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

Hypothesis testing using complex survey data

Domain 1: Designing a SQL Server Instance and a Database Solution

HCL Dynamic Spiking Protocol

1 Correlation and Regression Analysis

Inference on Proportion. Chapter 8 Tests of Statistical Hypotheses. Sampling Distribution of Sample Proportion. Confidence Interval

Definition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean

Modified Line Search Method for Global Optimization

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

PSYCHOLOGICAL STATISTICS

Measures of Spread and Boxplots Discrete Math, Section 9.4

Statistical inference: example 1. Inferential Statistics

I. Chi-squared Distributions

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

CHAPTER 3 DIGITAL CODING OF SIGNALS

5.4 Amortization. Question 1: How do you find the present value of an annuity? Question 2: How is a loan amortized?

CHAPTER 3 THE TIME VALUE OF MONEY

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

A GUIDE TO LEVEL 3 VALUE ADDED IN 2013 SCHOOL AND COLLEGE PERFORMANCE TABLES

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Confidence Intervals for Linear Regression Slope

PUBLIC RELATIONS PROJECT 2016

STUDENTS PARTICIPATION IN ONLINE LEARNING IN BUSINESS COURSES AT UNIVERSITAS TERBUKA, INDONESIA. Maya Maria, Universitas Terbuka, Indonesia

Mann-Whitney U 2 Sample Test (a.k.a. Wilcoxon Rank Sum Test)

Present Value Tax Expenditure Estimate of Tax Assistance for Retirement Saving

Biology 171L Environment and Ecology Lab Lab 2: Descriptive Statistics, Presenting Data and Graphing Relationships

Neolane Reporting. Neolane v6.1

1 Computing the Standard Deviation of Sample Means

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

OMG! Excessive Texting Tied to Risky Teen Behaviors

Learning objectives. Duc K. Nguyen - Corporate Finance 21/10/2014

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Domain 1 - Describe Cisco VoIP Implementations

Chapter 7 Methods of Finding Estimators

CS100: Introduction to Computer Science

CREATIVE MARKETING PROJECT 2016

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

Chapter 7: Confidence Interval and Sample Size

Normal Distribution.

Incremental calculation of weighted mean and variance

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

This document contains a collection of formulas and constants useful for SPC chart construction. It assumes you are already familiar with SPC.

Solving Logarithms and Exponential Equations

DAME - Microsoft Excel add-in for solving multicriteria decision problems with scenarios Radomir Perzina 1, Jaroslav Ramik 2

LECTURE 13: Cross-validation

Baan Service Master Data Management

WindWise Education. 2 nd. T ransforming the Energy of Wind into Powerful Minds. editi. A Curriculum for Grades 6 12

Lesson 15 ANOVA (analysis of variance)

How to use what you OWN to reduce what you OWE

Forecasting techniques

Systems Design Project: Indoor Location of Wireless Devices

The Big Picture: An Introduction to Data Warehousing

A Mathematical Perspective on Gambling

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

G r a d e. 2 M a t h e M a t i c s. statistics and Probability

Measuring Magneto Energy Output and Inductance Revision 1

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Estimating Probability Distributions by Observing Betting Practices

Asymptotic Growth of Functions

Exam 3. Instructor: Cynthia Rudin TA: Dimitrios Bisias. November 22, 2011

hp calculators HP 12C Statistics - average and standard deviation Average and standard deviation concepts HP12C average and standard deviation

MEI Structured Mathematics. Module Summary Sheets. Statistics 2 (Version B: reference to new book)

Institute of Actuaries of India Subject CT1 Financial Mathematics

FM4 CREDIT AND BORROWING

Page 1. Real Options for Engineering Systems. What are we up to? Today s agenda. J1: Real Options for Engineering Systems. Richard de Neufville

Statement of cash flows

Pre-Suit Collection Strategies


Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

A Guide to the Pricing Conventions of SFE Interest Rate Products

Finding the circle that best fits a set of points

Forecasting. Forecasting Application. Practical Forecasting. Chapter 7 OVERVIEW KEY CONCEPTS. Chapter 7. Chapter 7

Introducing Your New Wells Fargo Trust and Investment Statement. Your Account Information Simply Stated.

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Evaluating Model for B2C E- commerce Enterprise Development Based on DEA

Chapter 5 Unit 1. IET 350 Engineering Economics. Learning Objectives Chapter 5. Learning Objectives Unit 1. Annual Amount and Gradient Functions

Example: Probability ($1 million in S&P 500 Index will decline by more than 20% within a

Transcription:

Aalyzig Logitudial Data from Complex Surveys Usig SUDAAN Darryl Creel Statistics ad Epidemiology, RTI Iteratioal, 312 Trotter Farm Drive, Rockville, MD, 20850 Abstract SUDAAN: Software for the Statistical Aalysis of Correlated Data (SUDAAN) ca be used to aalyze data from surveys with complex desigs. A possible feature of a complex survey desig is clusterig. Oe way i which clusterig ca occur is to have the same iformatio collected o a samplig uit at differet poits i time. This type of clusterig creates data that may be referred to as logitudial, pael, or repeated measures data. This paper provides a example of logitudial data aalysis usig SUDAAN. The example covers the structure of the data ad data set; aalytic strategies ad iterpretatio; ad the implemetatio of the aalytic strategies usig SUDAAN. Keywords: Logitudial Data Aalysis (LDA), SUDAAN. 1. Itroductio Oe of the major uses of logitudial data is to aalyze treds, or chage, over time. The SUDAAN team at RTI Iteratioal ofte receives questios about how to coduct logitudial data aalysis usig SUDAAN. This paper provides a aswer to this questio. I Sectio 2, we discuss logitudial data. I Sectio 3, we discuss various survey desigs over time. I Sectio 4, we examie the variace of a differece of two meas. I Sectio 5, we discuss the data structure SUDAAN requires for the data aalysis. I Sectio 6, we examie the SUDAAN code to aalyze logitudial data ad discuss a cautioary ote. I Sectio 7, we provide a example to illustrate the possible differeces that may occur whe oe does ad does ot accout for the logitudial structure of the data. Fially, i Sectio 8, we provide some recommedatios ad cautios. 2. Logitudial Survey Data Logitudial data measures the same characteristics of the same samplig uit over time. For example, i a logitudial health survey of childre, measuremets such height ad weight may be measured each time the survey is coducted to create the child s body mass idex (BMI). Some of the goals of collectig logitudial data are to produce populatio estimates over time, study chage over time, ad/or study variables that affect chage over time. Cotiuig the logitudial health survey childre example, researchers may be iterested i the populatio estimates of BMI for specific subgroups over time. Researchers may also be iterested i studyig the chage i BMI overtime ad what variables are related to the chage i BMI over time. Logitudial data from a survey with a complex survey desig has the added complicatio of accoutig for this complex survey desig i the aalysis. SUDAAN ca accout for differet aspects of the complex survey desig, e.g., stratificatio, clusterig, ad differetial weightig, while coductig logitudial data aalysis. 3. Survey Desigs over Time There are four commo desig for surveys coducted over time: repeated surveys, pael survey, rotatig pael survey, ad split pael survey. 1 I the repeated survey desig, similar measuremets are made o samples from a equivalet populatio at differet poits of time, but without attemptig to esure that ay elemets are icluded i more tha oe roud of data collectio. 2 Its particular stregth is that at each roud of data collectio it routiely selects a sample of the populatio existig at that time. 3 The major limitatio of a repeated survey is that it does ot yield data to satisfy objectives [of estimatig chage at the elemet level betwee two time poits ad other compoets of idividual chage] ad [aggregate data for idividuals over time]. 4 1 There is a detailed explaatio of these desigs by Greg Duca ad Graham Kalto i Issues of Desig ad Aalysis of Surveys Across Time, Iteratioal Statistical Review, Vol. 55, No. 1, pp.97-117. This sectio is a brief summary of some of their discussio. 2 Duca ad Kalto 100. 3 Duca ad Kalto 101. 4 Duca ad Kalto 101. 3527

A pael survey is oe i which similar measuremets are made o the same sample at differet poits i time. 5 The major advatage of a pael survey over a repeated survey is its much greater aalytic potetial. It eables compoets of idividual chage to be measured ad also the summatio of a variable across time. 6 It ca be much more efficiet tha a repeated survey for measurig et chage. 7 [T]wo major potetial problems with pael surveys are pael losses through orespose ad the itroductio of ew elemets to the populatio as time passes. 8 I a pael survey, sample elemets are, i priciple, kept i the pael for the duratio of the survey. I a rotatio pael survey, sample elemets have a restricted pael life; as they leave the pael, ew elemets are added. The limited membership i a rotatig pael acts to reduce the problems of pael coditioig ad pael loss i compariso with orotatig pael survey, ad the cotiual itroductio of ew sample helps to maitai a upto-date sample of a chagig populatio. 9 A split pael survey is a combiatio of a pael ad a repeated or rotatig pael survey, as advocated i Kish (1983, 1986). 10 4. Variace of a Differece of Two Meas The sectio focuses o the repeated cross-sectioal survey, the pael survey, ad the rotatig pael survey. The split pael survey is ot discussed i the sectio, but recall that it is a combiatio of a fixed pael ad ew sample elemets from either a repeated or rotatig pael. The repeated cross-sectioal survey desig uses the same survey desig each year but samples a differet group of members each year. This approach is coceptually straight forward, samples from the curret populatio, ad avoids the complexity of a pael survey, fixed or rotatig. However, it is difficult to tell if the differeces are simply due to the differet samples or are a true differece i the outcome variable. Also, whe aalyzig the differece of two meas betwee years the repeated cross-sectioal survey desig is ot the most efficiet survey desig. Because of the idepedet samples, 5 Duca ad Kalto 101. 6 Duca ad Kalto 102. 7 Duca ad Kalto 102. 8 Duca ad Kalto 103. 9 Duca ad Kalto 103. 10 Duca ad Kalto 104. the variace of the differece of two meas is relatively large compared to other methods. Usig simplifyig assumptios that the variaces of the meas are equal for the two time 2 periods, S 1 = S2 = S, ad that the sample sizes are equal for the two time periods, 1 = 2 =, the variace for the differece of two meas, where m 1 is the mea for time period oe ad m 2 is the mea for time period two, for repeated cross-sectioal surveys is var( m 2 m1 ) = S. Cotrast this with the most efficiet survey desig to measure differeces betwee time periods which is the fixed pael survey desig, i.e., a sigle sample o which data is collected at differet poits i time. The efficiecy of the fixed pael survey depeds o the correlatio betwee the outcome variable at two time periods, ρ 12. Usig the same assumptios that were used for the variace of a differece of two meas for repeated cross-sectioal surveys, the variace of a differece of two meas for the fixed pael survey is var( m m ) = S (1 12). 2 1 ρ Comparig the variace of the differece of two meas for repeated cross-sectioal survey ad a fixed pael survey, the variace of the differece of two meas for the fixed pael survey has a smaller variace by the factor ( 1 ρ12). Cosequetly, the higher the correlatio betwee the two time periods is the smaller the variace of the differece of two meas. Although the fixed pael survey is the most efficiet at measurig differeces betwee years, it is ot without its limitatios. Geerally, a fixed pael survey has three limitatios: the pael is selected at oe poit i time, pael attritio, ad pael coditioig. If the populatio is chagig, the selectig the sample oce ad ot every year may cause the sample to become less ad less represetative of the populatio ad bias the survey estimates. Pael attritio ca arise because of the added respose burde for pael members to provide data every year. Pael coditioig meas that pael member s resposes chage i some way because they are part of the pael. A rotatig pael survey desig ca mitigate the problems associated with the fixed pael survey 3528

without losig all of the beefits of the reductio i the variace of the differeces. I a rotatig pael survey desig, pael members are oly retaied i the pael for a set period of time ad ew pael members are brought ito the pael. This mitigates the pael attritio ad pael coditioig which is a cocer for a fixed pael. Also, because of the rotatio i of ew groups ito the pael at each time period, the pael is ot static ad is updated with ew pael members from the curret populatio. This will accout for ay chages i the populatio over the course of the life of the survey. Because there is ot complete overlap, there will be some loss i the efficiecy of the rotatig pael that is proportioal to the size of the pael that does ot overlap from oe time period to the ext. That is, the formula for the variace of the differece of two meas has a added term that represets the amout of overlap,λ, var( m2 m1 ) = S (1 λρ12). With λ = ½, the variaces will oly beefit by half of the correlatio of the outcome variable betwee the two time periods. If λ = 1, i.e., there is complete overlap, the the variace is equal to the fixed pael variace. If λ = 0, i.e., there is o overlap, the the variace is equal to the repeated cross-sectioal survey variace. 5. Structure of the Data Sets Let us assume that there are two data sets. Oe data set is from 2004 ad the other is from 2005. The two data sets do have some overlap. That is, there are some primary samplig uits (PSU) that are o both data sets. Also, each of the data sets has a commo set of aalytic variables that are ot show i the followig tables. Table 1 shows the stratum, PSU, ad year for the 2004 data set. Table 1: 2004 Data Set Showig the Stratum, Primary Samplig Uit, ad Year Table 2 shows the same iformatio for the 2005 data set. Note that the data i Table 1 are italicized ad bolded; the data i Table 2 are ot. This distictio is carried through i the other tables. Table 2: 2005 Data Set Showig the Stratum, Primary Samplig Uit, ad Year 1 005 I order to perform the logitudial data aalysis, SUDAAN requires that the two separate data sets be combied ito oe data set. The combied data set is show i Table 3. Table 3: Combied 2004 ad 2005 Data Set Showig the Stratum, Primary Samplig Uit, ad Year 1 005 SUDAAN also requires that the data set is sorted by the variables o the est statemet. The est statemet that will be used i our first set of example code cotais year, stratum, ad PSU. The data set sorted by these variables i show i Table 3. This sortig used year as a stratificatio variable. Cosequetly, the results usig this data set are similar to results from a repeated cross-sectioal survey. That is, there is o beefit for the correlatio betwee resposes over the two time periods. The est statemet that will be used i our secod set of example code cotais stratum ad PSU. The data 3529

set sorted by these variables i show i Table 4. Sortig by stratum ad PSU, ad ot usig year, creates a data set that has year clustered withi PSU. Cosequetly, the results usig this data set are similar the pael desig, although we do ot have complete overlap. We still have the advatage of the variace reductio because of the overlap that we do have ad the correlatio betwee the resposes. Table 4: Combied 2004 ad 2005 Data Set Showig the Stratum, Primary Samplig Uit, ad Year Sorted by Stratum ad PSU 1 005 6. SUDAAN Code for Logitudial Data Aalysis ad Cautioary Note 6.1 SUDAAN Code The focus of the followig SUDAAN code is to calculate the cotrast, ad associated iformatio, betwee 2007 ad 2006. Ofte we see examples of SUDAAN code, that cotai the year variable as a stratificatio variable as show i the followig SUDAAN code: proc descript data = dataset desig = wr; est year stratum PSU / psulev = 3; 11 weight aweight; class year / ofreqs; 11 The psulev = 3 optio o the est statemet tells SUDAAN that the third variable o the est statemet is the PSU which implies that the first two variables o the est statemet are stratificatio variables. A full descriptio of the SUDAAN laguage ca be foud i the SUDAAN Laguage Maual, Release 9.0. var avar.; cotrast year = ( -1 1 ) / ame = "2007 2006 Cotrast"; prit sum mea semea t_mea p_mea; ru; Usig the year variable as a stratificatio variable, does ot allow us to beefit from the logitudial structure of the data. That is, the observatios for a PSU are classified across multiple years ad ot clustered withi PSU. Oe way to capture the multiple years of data collected for a PSU is ot to use year as a stratificatio variable. The followig code oly icludes the stratum variable as the stratificatio variable: proc descript data = dataset desig = wr; est stratum PSU; weight aweight; class year / ofreqs; var avar.; cotrast year = ( -1 1 ) / ame = "2007 2006 Cotrast"; prit sum mea semea t_mea p_mea; ru; Cosequetly, this SUDAAN code treats the years as clustered withi the PSU ad allows us to take advatage of the logitudial structure of the data. 6.2 Cautioary Note The focus of the previous SUDAAN code is usig a combied data set to produce cotrasts betwee years. The umber for the degrees of freedom (d.f.) for our simple example that SUDAAN uses is correct for this purpose; it would use 4 d.f. There is a cautio whe oe aalyzes a sigle year s data. Each sigle year data set for our simple example would have 3 d.f. ad this is what SUDAAN would use for the sigle year data sets. For the combied data set, SUDANN would use 4 d.f. eve for the sigle year aalysis. Cosequetly, oe should use the DDF = 3 optio for the combied data set for sigle year aalysis or use the sigle year data sets. 7. Example We have icluded oe example usig simulated data so that oe ca see the potetial impact of ot takig the logitudial data structure ito accout, ad possibly gettig smaller stadard errors. The simulated data set had 500 observatios i a sigle 3530

stratum, a λ = 1, ad ρ = 0.66. The results of aalyzig that data treatig it as a repeated crosssectioal data structure ad a pael data structure are preseted i Table 5. Research Triagle Istitute (2004), SUDAAN Laguage Maual, Release 9.0, Research Triagle Park, NC: Research Triagle Istitute. Table 5: Results of Simulated Data Set Aalyzed as a Repeated Cross-Sectio Survey ad a Pael Survey Repeated Pael Cross- Sectio Cotrast Mea (CM) 0.11 0.11 SE CM 0.06 0.04 Lower Limit 95% CI CM -0.01 0.04 Upper Limit 95% CI CM 0.23 0.18 T-test CM 1.78 3.05 P-Value T-test CM 0.0757 0.0024 Note that the estimates for the cotrast mea are the same, but the stadard error estimates for the cotrast mea is smaller for the pael survey tha for the repeated cross-sectio survey. This differece carries through to the cofidece itervals ad testig, which results i a statistically sigificat differece at the α = 0.05 level for the pael but ot for the repeated cross-sectio survey. Hece, the aalytic approach has the possibility of makig a differece i your iterpretatio of the output. 8. Recommedatios The mai poit is to take advatage of the logitudial data structure ad possibly smaller stadard errors. Oe ca accout for the logitudial data structure easily usig SUDAAN to produce cotrasts. Fially, use a data set that combies years of iformatio for cotrasts or sigle year aalysis usig the DDF optio. Oe could also use the sigle year data sets for the sigle year aalysis. Refereces Duca, Greg ad Kalto, Graham (1987), Issues of Desig ad Aalysis of Surveys Across Time, Iteratioal Statistical Review, Vol. 55, No. 1, pp. 97-117. Kish, Leslie (1983), Data Collectio for Details over Space ad Time, Statistical Methods ad the Improvemet of Data Quality, Ed. T. Wright, New York: Academic Press, pp. 73-84. Kish, Leslie (1986), Timig of Surveys for Public Policy, The Australia Joural of Statistics, Vol 28, pp. 1-12. 3531