Assignments Analysis of Longitudinal data: a multilevel approach Frans E.S. Tan Department of Methodology and Statistics University of Maastricht The Netherlands Maastricht, Jan 2007 Correspondence: Frans E.S. Tan, Methodology and Statistics, University of Maastricht, P. O. box 616, 6200 MD Maastricht, The Netherlands. Tel: +31433882278, e-mail: frans.tan@stat.unimaas.nl 1
See guidelines for performing a multilevel/longitudinal data analysis at the back. 1. Growth data (SPSS system file: growthdata.sav). (Pothoff & Roy) Study design: Orthodontic growth measurements for 11 girls and 16 boys For each subject the distance (in mm) between the pituitary and the maximally fissure was recorded at ages 8,10,12,14. These two locations can be easily identified on x-ray. Variables: - Distance - Sex: 0 =boy; 1=girl - Age: Age in years Design: Distance Measurements all subjects X X X X age 8 10 12 14 Analyse the data by comparing (using the linear regression method) growth and growth velocity between boys and girls. Consult the following sub-questions: a. Plot the distance versus age for boys and girls separately (in one plot by means of set markers by sex). Plot also the fitted lines through the scatter plots (in the chart option). b. Perform a standard linear regression analysis of the following model dis tan ce = β 0 1Age 2Sex 3 Age Sex + ε. Save the unstandardised predicted values. c. Plot the predicted values versus age for boys and girls separately in one plot. Which model describes the observed data best? Argue whether there is a difference between boys and girls w.r.t. growth-velocity of the head circumference? Show that the regression parameter β 3 can be interpreted as the difference between the regression slopes of both sexes. 2. Consider the study about the relationship between alcohol consumption and violent behaviour (SPSS system file: alc_violent.sav). Study design: Random sample of five subjects. Each subject was measured five times between 1950 and 1958 Not all subjects were measured in the same year (unbalanced design) Goals study: testing the hypothesis that alcohol is positively related to violent behaviour 2
Design: X= Violent/alcohol Measurements Subject 5 X X X X X Subject 4 X X X X X Subject 3 X X X X X Subject 2 X X X X X Subject 1 X X X X X time 1950 1951 1952 1953 1954 1955 1956 1957 Analyse the relationship between alcohol consumption and violent behaviour. Is there a difference between the relationship at the group level and at the subject level? Consult the following sub-questions: a. Plot violent versus alcohol. Describe what you see. b. Perform a standard linear regression analysis of violent on alcohol and save the unstandardised predicted values. Plot the predicted values versus alcohol to visualized your findings and interpret your results. c. Plot violent versus alcohol for each subject (use set markers subjects). d. Perform a standard linear regression analysis of the relationship between alcohol vs. violent behaviour, and with subject as a discrete covariate (use dummy variables) in the regression model (also save your unstandardised predicted values). Compare your results with that of b (do not forget to Plot to visualized your findings. Set markers by: subject). Can you explain the difference? 3. Growth data (Pothoff & Roy) (SPSS system file: growthdata.sav). Consider the study about the orthodontic growth of boys and girls. Compare growth and growth velocity between boys and girls as in assignment 1 using the SPSS option mixed (see guidelines). Analyse with OLS and with random effects and compare the two methods. Consult the following sub-questions: a. Plot the subject specific profiles (a plot of individual changes over time). See guidelines b. Plot the mean profiles (a plot of mean changes over time, separately for boys and girls) Question: Compare the growth velocity of boys and girls at the group level and at the subject level. Describe what you see based on these plots. c. Perform an OLS regression analysis with Mixed Models of the following model dis tan ce = β 0 1Age 2Sex 3 Age Sex + ε 1958 3
Questions: Compare with your findings from assignment 1. What is the interpretation ofβ0 1Age 2Sex 3Age Sex? Indicate this in the plots. d. Perform a random intercept model Questions: Is the interaction between age and sex significant? Compare the output with that of (c). Explain the discrepancy with respect to the s.e. s of b 3. What is the interpretation ofβ 0 i 1Age 2Sex 3 Age Sex for a specific subject i? Indicate this in the plots. What is the interpretation of the first-level variance (R cov -matrix)? What is the interpretation of the second-level variance (G cov -matrix)? What is the interpretation of the overall variances and covariances (correlations) (V cov matrix)? Indicate these in the plots. Determine the V cov matrix and V corr matrix and interpret the ICC. e. Perform a marginal model with the (homogeneous) Compound symmetry covariance structure. Questions: Compare the variances and covariances of the output with that of (d). f. Perform a random intercept model with an AR(1) serial correlation. Compare the results with that of (d). g. Perform a random slope (random slope for age) model Question: Compare all the output and argue which model you would prefer. Determine the V cov matrix of the models in f and g. 4. Aggregation of longitudinal data; Ecological fallacy Consider the study about the relationship between alcohol consumption and violent behaviour (SPSS system file: alc_violent.sav). Analyse the relationship about alcohol and violent as in assignment 2 and using SPSS option mixed. Analyse with OLS and with random effects and compare the two methods. Consult the following sub-questions: a. Plot violent vs. alcohol for each subject (use the Interactive scatter plot option). b. What can you say about the subject specific relationship between alcohol and violent behaviour? c. Perform an OLS regression analysis with Mixed Models of the relationship between alcohol and violent behaviour. d. Compare the output in (c) with what you expected considering the plot in (a). e. Perform a mixed model analysis with a random intercept model to study the subjectspecific relationship between alcohol vs. violent behaviour and compare this output with the previous ones (set estimation maximum scoring steps : 10). 4
f. Compare the results from (e) with the results from assignment (2.d). g. Discuss the overall results of the analysis that you have performed and relate the results to the "ecological fallacy" phenomenon. 5. Interpersonal proximity Description of the study (teacher.sav) Brekelmans and Creton (1993) made a study of the development over time of evaluations of teachers by their pupils. Starting from the first year of their teaching career, teachers were evaluated on their interpersonal behaviour in the classroom. This happened repeatedly, at intervals of about one year. Results are presented about the proximity dimension, representing the degree of cooperation or closeness between a teacher and his or her students. The higher the proximity score of a teacher, the more cooperation is perceived by his or her students. There are four measurement occasions: after 0, 1, 2, and 3 years of experience. Thus, the time variable assumes the values 0 through 3. A total of 51 teachers were studied. The number of observations for the 4 moments decreased from 46 at t=0 to 32 at t=3. Hence, we are dealing with an unbalanced design. Non-response at various moments may be considered to be random. Another variable in the dataset is gender. Gender (0=male; 1=female) could possibly be a predictor of the proximity score of the teacher. It is also possible that gender has an influence on the relationship between the measurement occasion and the proximity score. Note: In the data file there is also a variable occ_cat. This variable is identical to the variable occ. Design (there are missing observations): Proximity Measurements all teachers X X X X occasion 0 1 2 3 a. Discuss the multilevel design b. Plot the teacher specific proximity score vs. occasion (use the variable occ ) for each Gender and a plot of the gender specific proximity score vs. occasion c. What can you say about the (Teacher specific/ gender specific) pattern of proximity score across occasions? d. Argue that the following model specification does make sense. Pr ox = β0 1occ0 2occ1 3occ2 4Sex 5occ0 Sex 6occ1 Sex 7occ2 random part + R The variable occ i denotes the dummy variable for occasion i, i = 1,2. Can you make an educated guess whether a random intercept or a random slope (with random intercept) would be most appropriate to describe the data? e. Perform an OLS regression of the model specified in (d), with Occasion as a categorical variable (factor) and interaction occasion and Gender. Sex + 5
f. Perform successively with the same fixed part as in (e) (set estimation maximum scoring steps : 10): 0. A random intercept model. 1. A random intercept/slope model (take the random slope of the quantitative variable occasion. Use as cov. Type for the random effects: Unstructured). 2. A random intercept/slope model with an AR (1) serial correlation and homogeneous variances. 3. A random intercept model with AR (1) and heterogeneous variances and 4. A model with AR (1) and heterogeneous variances. Compare with (e). g. Calculate the corresponding V cov -matrices following the calculations mentioned in the transparencies. h. Which model would you consider as most appropriate and why? i. Would you conclude that there is a difference in change of proximity score between male and female teachers (fit the model for male and female teachers separately)? 6. Growth data (Pothoff & Roy) (SPSS system file: growthdata.sav). Consider the study about the orthodontic growth of boys and girls. a. Run the OLS model dis tan ce = β 0 1Age 2Sex 3 Age Sex + ε and plot the residuals against age. Go to Graphs Interactive Line... Click the reset button. Drag the variable resid_1 to the box for the y-variable Drag the variable age to the box for the x-variable. Drag the variable Subj to the 'color' box. Right-click on the variable Subj and select categorical. Drag the variable sex to the 'style' box. Select convert. Click paste and run the syntax. Questions: What can you say about the variance over time? b. Plot a scatter-plot matrix of the residuals and determine the correlation matrix for the different time points First, transpose the data: The values of Age should be transposed into columns. respnr sex age growth 1 1 8 21 1 1 10 20 1 1 12 21.5 1 1 14 23 2 1 8 21 2 1 10 21.5 2 1 12 24 6
2 1 14 25.5 respnr sex dist_8 dist_10 dist_12 dist_14 1 1 21 20 21.5 23 2 1 21 21.5 24 25.5 Transposing rows into columns: Rename the variable Distance to Dist and resid_1 to res_1 with the following syntax: Rename variables distance = dist. Rename variables resid_1=res. (Note: If you do not rename the variables, then the final variable names will be too long after transposing the data). Go to Data Restructure. Spss asks you if you want to save: Do not save (it is not necessary to save the new dist and res variable). Choose Restructure selected cases into variables. Click volgende. Subj is the Identifier variable, Age is the Index variable. Click Volgende 3 times. Choose paste the syntax generated by the wizard into a syntax window Click voltooien. Run the syntax. Save your datafile under a new name. Scatter-plot matrix of residuals: Go to Graphs Scatter Choose matrix. Select res.8, res.10, res.12, res.14 and put them in the Matrix Variables box. Click paste and run the syntax. correlationmatrix of the responses: Go to Analyze Correlate Bivariate... Select the variables dist.8, dist.10, dist.12 and dist.14. Click paste and run the syntax. Questions: Do the scatter-plots change with time? Which model will probably come out based on the exploratory analysis? 7
c. Consider some other (reasonable) alternative models with different covariance structures than the random intercept model and determine the most adequate model using the basic guidelines mentioned in the course. d. Check whether the proximity model in assignment 5 will also be obtained following the basic guidelines mentioned in the course. 7. Life event study (SPSS system file: lifesubset.sav). (Nieboer et al.) a. Reconstruct the analysis of the Life event data. Follow the basic guidelines mentioned in the course. Use both the gain score and the Ancova approach. For the gain score approach use the ' lifesubset.sav' file. For the ancova approach use the 'lifeancova.sav' file. b. Is the difference between male and female at time point 12 significantly different than at time point 3? c. What are your conclusions concerning the difference between male and female on the one hand and carers and widowers on the other hand over time for both the gain score analysis and the ancova analysis? d. Explain why an interaction term between Gender and Time is specified in the gainscore analysis and not in the ancova analysis? e. Which model would you prefer in this case? 8. Alzheimer Study (SPSS system file: Alzheimer.sav, Alzheimer_vert.sav, Alzheimer_hor.sav' ) a. Investigate the missingness pattern of the Alzheimer data (see transparencies). Assume that missingness is due to monotone dropout. Open the file Alzheimer.sav. Go to Analyze Descriptive Statistics Frequencies. Click week into the 'variable(s)' box. Deduce the missingness pattern from the frequency table and fill in the following table. Pattern -1 0 1 2 4 6 8 10 counts 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 0 3 1 1 1 1 1 1 0 0 4 1 1 1 1 1 0 0 0 5 1 1 1 1 0 0 0 0 6 1 1 1 0 0 0 0 0 8
7 1 1 0 0 0 0 0 0 Frequency % b. Plot the proportion of patients in the study vs. time for: center gender treatment To plot the proportion of patients vs. time for each center we need: - the total number of patients per center - the number of patients in the study per center per time point Guidelines for a plot per center (general case when large data sets are involved): Number of patients per center Sort the data in ascending order. Start with an equal number of weeks per patients. Open the Alzheimer_vert.sav file. Go to Data Sort cases. Click center into the 'Sort by' box. Create data with the number of patients per center. Go to Data Aggregate. Click center into the 'Break variable' box. Select the box ' Save number of cases in break group variable'. Change the name of the variable (n_break) to nmeas. Click on the button 'File...' and change the directory to which the new file is written into... Change the name of the file into 'alz_aggr_center_npat.sav' Open the new aggregate file. The variable nmeas has to be divided by 8 to get the number of patients per center instead of the number of measurements per center. Go to Transform Compute. Click nmeas in the 'Numeric Expression' box. Add the expression ' /8'. Type npat in the 'Target variable' box. Save the file. 9
Number of patients in study per center per time point Sort data in ascending order of center and weeks. Open the Alzheimer_vert.sav file. Go to Data Sort cases. Click center into the 'Sort by' box. Click week into the 'Sort by' box. Go to Data Aggregate. Click center into the 'Break variable' box. Click week into the 'Break variable' box. Click alz into the 'Aggregate variable' box. Click on the button 'Name&Label' Change the name to patinstu. Click on the button 'Function'. Select the option unweighted. Click on the button 'File...' and change the directory to which the new file is written into... Change the name of the file into 'alz_aggr_center.sav'. Match the two aggregate files Open the file 'alz_aggr_center.sav'. Go to Data Merge files Add variables. Select the file ' 'alz_aggr_center_npat.sav' Select the box 'Match cases on key variables in sorted files Select 'external file is keyed table'. Click center into the 'Key variables' box. Calculate the proportion of patients in the study per week Go to Transform Compute. Type the expression 'patinstu/npat' in the 'Numeric Expression' box. The target variable is p_instud. Plot the proportion of patients in the study vs. time with separate lines for center Follow the guidelines mentioned under 'mean profiles'. Follow the same procedure for gender and treatment. 10
c. Perform a logistic regression to evaluate whether the occurrence of missing is predicted by treatment, gender, age, center and the 1st and 2nd measurement. Open the file 'Alzheimer_hor.sav'. The variable 'miss' indicates whether a patient has one or more missings values on the Alzheimer score. Go to Analyze Regression Binary Logistic. Click the variable miss into the 'Dependent:' box. Click the variables treatm, gender, center, age, alz.1 and alz.2 into the 'Covariates:' box. Click on the button Categorical: Click treatm and center into the 'Categorical covariates:' box. Select 'first' as the Reference category for both variables. Click on Change. Click Continue. Choose 'Backward LR as method for the analysis. Click ok. On which predictors does the missingness depend? d. Is the underlying missing value mechanism MCAR or MAR (assume not MNAR)? 9. Test Assignment Longitudinal data analysis Beating the blues (system file: data81.sav) A clinical trial was designed to assess the effectiveness of an interactive program using multimedia techniques for the delivery of cognitive behavioral therapy for depressed patients and known as Beating the Blues (BtB). In a randomized controlled trial of the program, patients with depression recruited in primary care were randomized to either the BtB program, or to Treatment as Usual (TAU). The variable Treat represents these two treatments (Treat = 1 if BtB, and Treat = 0 if TAU). The outcome measure (Depress) used in the trial was the Beck Depression Inventory II with higher values indicating more depression. Measurements of this variable were made on five occasions: - prior to treatment ( Bdipre) - follow up at 2, 3, 5, and 8 months after treatment (months) - There is a considerable number of missing values caused by patients dropping out of the study - There are repeated measurements of the outcome taken on each patient post treatment, along with a baseline pre-treatment measurement. The question of most interest about these data is whether the BtB program does better than TAU in treating depression. Perform a longitudinal analysis. Write a report of your findings. Include the following considerations in your analysis. 11
- Determine the pattern of missing observations and investigate whether the underlying missing-mechanism is MCAR or MAR. - Determine by the choice of the design and model selection which mixed effects model is most suitable. - Are there any interactions involved? - Compare the approach based on ancova with that based on change scores. Which approach is preferable? - Determine the corresponding V cov of the final model and compare with the observed covariance matrix - Interpret your results. What is your final conclusion regarding the research question? 12
General guidelines for performing a multilevel and longitudinal data analysis with the SPSS option Mixed Linear An independent variable should be specified as a covariate in SPSS if it is a quantitative variable or a qualitative variable with 2 categories. It should be specified as a factor if it is a qualitative variable with more than 2 categories. The Mixed Models option in SPSS will automatically compute dummy variables from the qualitative variable with the highest category as a reference. Plotting longitudinal data. Subject specific profiles. Open your dataset Go to Graphs Interactive Line... Click on reset. Drag the dependent variable to the box for the y-variable Drag the (time dependent) independent variable to the box for the x-variable. Drag the identification variable to the 'colour' box Drag the grouping variable to the 'style' box. Select convert. Right-click on the identification variable and select categorical. Mean profiles. Go to Graphs Interactive Line... Click on reset. Drag the dependent variable to the box for the y-variable Drag the (time dependent) independent variable to the box for the x-variable. Drag the grouping variable to the 'colour' box. Right-click on the grouping variable and select categorical. Go to the tab Dots and Lines and select Dots. Performing an OLS regression model. Click on the reset button. Click the dependent variable into the 'Dependent variable:' box Click the quantitative (or dichotomous) independent variables into the 'Covariate(s):' box Click the qualitative independent variables into the 'Factor(s):' box Click the fixed button. Select the independent variables. If the full model is required, then factorial should be selected, and the independent variables should be selected simultaneously. Click the Add button. Click the statistics button and select the following checkboxes: Parameter estimates and covariances of residuals. Performing a random effects model. Go to Analyze Mixed models Linear... Click on the reset button. Click the identification variable into the 'subjects:' box 13
Click the dependent variable into the 'Dependent variable:' box Click the quantitative independent variables into the 'Covariate(s):' box Click the qualitative independent variables into the 'Factor(s):' box Click the fixed button. Select the independent variables. If the full model is required, then factorial should be selected, and the independent variables should be selected simultaneously. Click the Add button. Click the random button. Select the 'Include Intercept' checkbox if a random intercept is required. If random slope is required, then select the relevant independent variable en put it in the Model box. Choose unstructured as Covariance type. Click the identification variable from the 'Subjects:' into the 'Combinations:' box. Click the statistics button and select the following checkboxes: Parameter estimates, covariances of random effects and Covariances of residuals. Specifying a serial correlation. Go to Analyze Mixed models Linear... Click on the reset button. Click the identification variable into the subjects: box and age into the 'Repeated:' box. Choose a covariance structure option as the Repeated Covariance type. Note: there are several covariance structure like AR(1) (which is homogeneous), AR(1) heterogeneous, unstructured, toeplitz, scaled identity etc. Click the dependent variable into the 'Dependent variable:' box, age and sex into the 'Covariate(s):' box Click the statistics button and select the following checkboxes: Parameter estimates, Covariance of random effects and Covariances of residuals. Click the fixed button. Select the required independent variables and click the Add button. Performing an LR test: calculation of corresponding p-value Go to Transform Compute... Click the function CDF.CHISQ into the 'Numeric expression:' box. Type '1- ' before the function. Replace the first question mark by the difference in -2 restricted LL between the 2 models. Replace the second question mark by the number of degrees of freedom. Fill in a name for the target variable in the ' target variable' box. Click ok. 14