Aalyzig Logitudial Data from Complex Surveys Usig SUDAAN Darryl Creel Statistics ad Epidemiology, RTI Iteratioal, 312 Trotter Farm Drive, Rockville, MD, 20850 Abstract SUDAAN: Software for the Statistical Aalysis of Correlated Data (SUDAAN) ca be used to aalyze data from surveys with complex desigs. A possible feature of a complex survey desig is clusterig. Oe way i which clusterig ca occur is to have the same iformatio collected o a samplig uit at differet poits i time. This type of clusterig creates data that may be referred to as logitudial, pael, or repeated measures data. This paper provides a example of logitudial data aalysis usig SUDAAN. The example covers the structure of the data ad data set; aalytic strategies ad iterpretatio; ad the implemetatio of the aalytic strategies usig SUDAAN. Keywords: Logitudial Data Aalysis (LDA), SUDAAN. 1. Itroductio Oe of the major uses of logitudial data is to aalyze treds, or chage, over time. The SUDAAN team at RTI Iteratioal ofte receives questios about how to coduct logitudial data aalysis usig SUDAAN. This paper provides a aswer to this questio. I Sectio 2, we discuss logitudial data. I Sectio 3, we discuss various survey desigs over time. I Sectio 4, we examie the variace of a differece of two meas. I Sectio 5, we discuss the data structure SUDAAN requires for the data aalysis. I Sectio 6, we examie the SUDAAN code to aalyze logitudial data ad discuss a cautioary ote. I Sectio 7, we provide a example to illustrate the possible differeces that may occur whe oe does ad does ot accout for the logitudial structure of the data. Fially, i Sectio 8, we provide some recommedatios ad cautios. 2. Logitudial Survey Data Logitudial data measures the same characteristics of the same samplig uit over time. For example, i a logitudial health survey of childre, measuremets such height ad weight may be measured each time the survey is coducted to create the child s body mass idex (BMI). Some of the goals of collectig logitudial data are to produce populatio estimates over time, study chage over time, ad/or study variables that affect chage over time. Cotiuig the logitudial health survey childre example, researchers may be iterested i the populatio estimates of BMI for specific subgroups over time. Researchers may also be iterested i studyig the chage i BMI overtime ad what variables are related to the chage i BMI over time. Logitudial data from a survey with a complex survey desig has the added complicatio of accoutig for this complex survey desig i the aalysis. SUDAAN ca accout for differet aspects of the complex survey desig, e.g., stratificatio, clusterig, ad differetial weightig, while coductig logitudial data aalysis. 3. Survey Desigs over Time There are four commo desig for surveys coducted over time: repeated surveys, pael survey, rotatig pael survey, ad split pael survey. 1 I the repeated survey desig, similar measuremets are made o samples from a equivalet populatio at differet poits of time, but without attemptig to esure that ay elemets are icluded i more tha oe roud of data collectio. 2 Its particular stregth is that at each roud of data collectio it routiely selects a sample of the populatio existig at that time. 3 The major limitatio of a repeated survey is that it does ot yield data to satisfy objectives [of estimatig chage at the elemet level betwee two time poits ad other compoets of idividual chage] ad [aggregate data for idividuals over time]. 4 1 There is a detailed explaatio of these desigs by Greg Duca ad Graham Kalto i Issues of Desig ad Aalysis of Surveys Across Time, Iteratioal Statistical Review, Vol. 55, No. 1, pp.97-117. This sectio is a brief summary of some of their discussio. 2 Duca ad Kalto 100. 3 Duca ad Kalto 101. 4 Duca ad Kalto 101. 3527
A pael survey is oe i which similar measuremets are made o the same sample at differet poits i time. 5 The major advatage of a pael survey over a repeated survey is its much greater aalytic potetial. It eables compoets of idividual chage to be measured ad also the summatio of a variable across time. 6 It ca be much more efficiet tha a repeated survey for measurig et chage. 7 [T]wo major potetial problems with pael surveys are pael losses through orespose ad the itroductio of ew elemets to the populatio as time passes. 8 I a pael survey, sample elemets are, i priciple, kept i the pael for the duratio of the survey. I a rotatio pael survey, sample elemets have a restricted pael life; as they leave the pael, ew elemets are added. The limited membership i a rotatig pael acts to reduce the problems of pael coditioig ad pael loss i compariso with orotatig pael survey, ad the cotiual itroductio of ew sample helps to maitai a upto-date sample of a chagig populatio. 9 A split pael survey is a combiatio of a pael ad a repeated or rotatig pael survey, as advocated i Kish (1983, 1986). 10 4. Variace of a Differece of Two Meas The sectio focuses o the repeated cross-sectioal survey, the pael survey, ad the rotatig pael survey. The split pael survey is ot discussed i the sectio, but recall that it is a combiatio of a fixed pael ad ew sample elemets from either a repeated or rotatig pael. The repeated cross-sectioal survey desig uses the same survey desig each year but samples a differet group of members each year. This approach is coceptually straight forward, samples from the curret populatio, ad avoids the complexity of a pael survey, fixed or rotatig. However, it is difficult to tell if the differeces are simply due to the differet samples or are a true differece i the outcome variable. Also, whe aalyzig the differece of two meas betwee years the repeated cross-sectioal survey desig is ot the most efficiet survey desig. Because of the idepedet samples, 5 Duca ad Kalto 101. 6 Duca ad Kalto 102. 7 Duca ad Kalto 102. 8 Duca ad Kalto 103. 9 Duca ad Kalto 103. 10 Duca ad Kalto 104. the variace of the differece of two meas is relatively large compared to other methods. Usig simplifyig assumptios that the variaces of the meas are equal for the two time 2 periods, S 1 = S2 = S, ad that the sample sizes are equal for the two time periods, 1 = 2 =, the variace for the differece of two meas, where m 1 is the mea for time period oe ad m 2 is the mea for time period two, for repeated cross-sectioal surveys is var( m 2 m1 ) = S. Cotrast this with the most efficiet survey desig to measure differeces betwee time periods which is the fixed pael survey desig, i.e., a sigle sample o which data is collected at differet poits i time. The efficiecy of the fixed pael survey depeds o the correlatio betwee the outcome variable at two time periods, ρ 12. Usig the same assumptios that were used for the variace of a differece of two meas for repeated cross-sectioal surveys, the variace of a differece of two meas for the fixed pael survey is var( m m ) = S (1 12). 2 1 ρ Comparig the variace of the differece of two meas for repeated cross-sectioal survey ad a fixed pael survey, the variace of the differece of two meas for the fixed pael survey has a smaller variace by the factor ( 1 ρ12). Cosequetly, the higher the correlatio betwee the two time periods is the smaller the variace of the differece of two meas. Although the fixed pael survey is the most efficiet at measurig differeces betwee years, it is ot without its limitatios. Geerally, a fixed pael survey has three limitatios: the pael is selected at oe poit i time, pael attritio, ad pael coditioig. If the populatio is chagig, the selectig the sample oce ad ot every year may cause the sample to become less ad less represetative of the populatio ad bias the survey estimates. Pael attritio ca arise because of the added respose burde for pael members to provide data every year. Pael coditioig meas that pael member s resposes chage i some way because they are part of the pael. A rotatig pael survey desig ca mitigate the problems associated with the fixed pael survey 3528
without losig all of the beefits of the reductio i the variace of the differeces. I a rotatig pael survey desig, pael members are oly retaied i the pael for a set period of time ad ew pael members are brought ito the pael. This mitigates the pael attritio ad pael coditioig which is a cocer for a fixed pael. Also, because of the rotatio i of ew groups ito the pael at each time period, the pael is ot static ad is updated with ew pael members from the curret populatio. This will accout for ay chages i the populatio over the course of the life of the survey. Because there is ot complete overlap, there will be some loss i the efficiecy of the rotatig pael that is proportioal to the size of the pael that does ot overlap from oe time period to the ext. That is, the formula for the variace of the differece of two meas has a added term that represets the amout of overlap,λ, var( m2 m1 ) = S (1 λρ12). With λ = ½, the variaces will oly beefit by half of the correlatio of the outcome variable betwee the two time periods. If λ = 1, i.e., there is complete overlap, the the variace is equal to the fixed pael variace. If λ = 0, i.e., there is o overlap, the the variace is equal to the repeated cross-sectioal survey variace. 5. Structure of the Data Sets Let us assume that there are two data sets. Oe data set is from 2004 ad the other is from 2005. The two data sets do have some overlap. That is, there are some primary samplig uits (PSU) that are o both data sets. Also, each of the data sets has a commo set of aalytic variables that are ot show i the followig tables. Table 1 shows the stratum, PSU, ad year for the 2004 data set. Table 1: 2004 Data Set Showig the Stratum, Primary Samplig Uit, ad Year Table 2 shows the same iformatio for the 2005 data set. Note that the data i Table 1 are italicized ad bolded; the data i Table 2 are ot. This distictio is carried through i the other tables. Table 2: 2005 Data Set Showig the Stratum, Primary Samplig Uit, ad Year 1 005 I order to perform the logitudial data aalysis, SUDAAN requires that the two separate data sets be combied ito oe data set. The combied data set is show i Table 3. Table 3: Combied 2004 ad 2005 Data Set Showig the Stratum, Primary Samplig Uit, ad Year 1 005 SUDAAN also requires that the data set is sorted by the variables o the est statemet. The est statemet that will be used i our first set of example code cotais year, stratum, ad PSU. The data set sorted by these variables i show i Table 3. This sortig used year as a stratificatio variable. Cosequetly, the results usig this data set are similar to results from a repeated cross-sectioal survey. That is, there is o beefit for the correlatio betwee resposes over the two time periods. The est statemet that will be used i our secod set of example code cotais stratum ad PSU. The data 3529
set sorted by these variables i show i Table 4. Sortig by stratum ad PSU, ad ot usig year, creates a data set that has year clustered withi PSU. Cosequetly, the results usig this data set are similar the pael desig, although we do ot have complete overlap. We still have the advatage of the variace reductio because of the overlap that we do have ad the correlatio betwee the resposes. Table 4: Combied 2004 ad 2005 Data Set Showig the Stratum, Primary Samplig Uit, ad Year Sorted by Stratum ad PSU 1 005 6. SUDAAN Code for Logitudial Data Aalysis ad Cautioary Note 6.1 SUDAAN Code The focus of the followig SUDAAN code is to calculate the cotrast, ad associated iformatio, betwee 2007 ad 2006. Ofte we see examples of SUDAAN code, that cotai the year variable as a stratificatio variable as show i the followig SUDAAN code: proc descript data = dataset desig = wr; est year stratum PSU / psulev = 3; 11 weight aweight; class year / ofreqs; 11 The psulev = 3 optio o the est statemet tells SUDAAN that the third variable o the est statemet is the PSU which implies that the first two variables o the est statemet are stratificatio variables. A full descriptio of the SUDAAN laguage ca be foud i the SUDAAN Laguage Maual, Release 9.0. var avar.; cotrast year = ( -1 1 ) / ame = "2007 2006 Cotrast"; prit sum mea semea t_mea p_mea; ru; Usig the year variable as a stratificatio variable, does ot allow us to beefit from the logitudial structure of the data. That is, the observatios for a PSU are classified across multiple years ad ot clustered withi PSU. Oe way to capture the multiple years of data collected for a PSU is ot to use year as a stratificatio variable. The followig code oly icludes the stratum variable as the stratificatio variable: proc descript data = dataset desig = wr; est stratum PSU; weight aweight; class year / ofreqs; var avar.; cotrast year = ( -1 1 ) / ame = "2007 2006 Cotrast"; prit sum mea semea t_mea p_mea; ru; Cosequetly, this SUDAAN code treats the years as clustered withi the PSU ad allows us to take advatage of the logitudial structure of the data. 6.2 Cautioary Note The focus of the previous SUDAAN code is usig a combied data set to produce cotrasts betwee years. The umber for the degrees of freedom (d.f.) for our simple example that SUDAAN uses is correct for this purpose; it would use 4 d.f. There is a cautio whe oe aalyzes a sigle year s data. Each sigle year data set for our simple example would have 3 d.f. ad this is what SUDAAN would use for the sigle year data sets. For the combied data set, SUDANN would use 4 d.f. eve for the sigle year aalysis. Cosequetly, oe should use the DDF = 3 optio for the combied data set for sigle year aalysis or use the sigle year data sets. 7. Example We have icluded oe example usig simulated data so that oe ca see the potetial impact of ot takig the logitudial data structure ito accout, ad possibly gettig smaller stadard errors. The simulated data set had 500 observatios i a sigle 3530
stratum, a λ = 1, ad ρ = 0.66. The results of aalyzig that data treatig it as a repeated crosssectioal data structure ad a pael data structure are preseted i Table 5. Research Triagle Istitute (2004), SUDAAN Laguage Maual, Release 9.0, Research Triagle Park, NC: Research Triagle Istitute. Table 5: Results of Simulated Data Set Aalyzed as a Repeated Cross-Sectio Survey ad a Pael Survey Repeated Pael Cross- Sectio Cotrast Mea (CM) 0.11 0.11 SE CM 0.06 0.04 Lower Limit 95% CI CM -0.01 0.04 Upper Limit 95% CI CM 0.23 0.18 T-test CM 1.78 3.05 P-Value T-test CM 0.0757 0.0024 Note that the estimates for the cotrast mea are the same, but the stadard error estimates for the cotrast mea is smaller for the pael survey tha for the repeated cross-sectio survey. This differece carries through to the cofidece itervals ad testig, which results i a statistically sigificat differece at the α = 0.05 level for the pael but ot for the repeated cross-sectio survey. Hece, the aalytic approach has the possibility of makig a differece i your iterpretatio of the output. 8. Recommedatios The mai poit is to take advatage of the logitudial data structure ad possibly smaller stadard errors. Oe ca accout for the logitudial data structure easily usig SUDAAN to produce cotrasts. Fially, use a data set that combies years of iformatio for cotrasts or sigle year aalysis usig the DDF optio. Oe could also use the sigle year data sets for the sigle year aalysis. Refereces Duca, Greg ad Kalto, Graham (1987), Issues of Desig ad Aalysis of Surveys Across Time, Iteratioal Statistical Review, Vol. 55, No. 1, pp. 97-117. Kish, Leslie (1983), Data Collectio for Details over Space ad Time, Statistical Methods ad the Improvemet of Data Quality, Ed. T. Wright, New York: Academic Press, pp. 73-84. Kish, Leslie (1986), Timig of Surveys for Public Policy, The Australia Joural of Statistics, Vol 28, pp. 1-12. 3531