# Analysis of Longitudinal Data with Missing Values.

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Analysis of Longitudinal Data with Missing Values. Methods and Applications in Medical Statistics. Ingrid Garli Dragset Master of Science in Physics and Mathematics Submission date: June 2009 Supervisor: Jo Eidsvik, MATH Co-supervisor: Stian Lydersen, IKM Norwegian University of Science and Technology Department of Mathematical Sciences

2

3 Problem Description In medical research, data sets are seldom complete. That is, some of the values that should have been recorded, are missing for some of the persons in the study. Traditionally, complete case analysis has been used in such situations. That is, only cases (persons) with complete data are included in the analysis. In addition reducing the sample size, this method typically gives biased results. Better methods include multiple imputation (MI), full maximum likelihood (ML), mixed models, and generalized estimating equations (GEE). The candidate shall give a short description of alternative methods for handling missing values. The main part of the thesis consists of analyses of longitudinal data from a research project at the Faculty of Medicine, using alternative methods. Assignment given: 21. January 2009 Supervisor: Jo Eidsvik, MATH

4

6 ii

7 Summary Missing data is a concept used to describe the values that are, for some reason, not observed in datasets. Most standard analysis methods are not feasible for datasets with missing values. The methods handling missing data may result in biased and/or imprecise estimates if methods are not appropriate. It is therefore important to employ suitable methods when analyzing such data. Cardiac surgery is a procedure suitable for patients suffering from different types of heart diseases. It is a physical and psychical demanding surgical operation for the patients, although the mortality rate is low. Health-related quality of life (HRQOL) is a popular and widespread measurement tool to monitor the overall situation of patients undergoing cardiac surgery, especially in elderly patients with naturally limited life expectancies [Gjeilo, 2009]. There has been a growing attention to possible differences between men and women with respect to HRQOL after cardiac surgery. The literature is not consistent regarding this topic. Gjeilo et al. [2008] studied HRQOL in patients before and after cardiac surgery with emphasis on differences between men and women. In the period from September 2004 to September 2005, 534 patients undergoing cardiac surgery at St Olavs Hospital were included in the study. HRQOL were measured by the self-reported questionnaires Short-Form 36 (SF-36) and the Brief Pain Inventory (BPI) before surgery and at six and twelve months follow-up. The SF-36 reflects health-related quality of life measuring eight conceptual domains of health [Loge and Kaasa, 1998]. Some of the patients have not responded to all questions, and there are missing values in the records for about 41% of the patients. Women have more missing values than men at all time points. The statistical analyses performed in Gjeilo et al. [2008] employ the complete-case method, which is the most common method to handle missing data until recent years. The complete-case method discards all subjects with unobserved data prior to the analyses. It makes standard statistical analyses accessible and is the default method to handle missing data in several statistical software packages. The complete-case method gives correct estimates only if data are missing completely at random without any relation to other observed or unobserved measurements. This assumption is seldom met, and violations can result in incorrect estimates and decreased efficiency. The focus of this paper is on improved methods to handle missing values in longitudinal data, that is observations of the same subjects at multiple occasions. Multiple imputation and imputation by expectation maximization are general methods that can be applied with many standard analysis methods and several missing data situations. Regression models can also give correct estimates and are available for longitudinal data. In this paper we present the theory of these approaches and application to the dataset introduced above. The results are compared to the complete-case analyses published in Gjeilo et al. [2008], and the methods are discussed with respect to their properties of handling missing values in this setting. The data of patients undergoing cardiac surgery are analyzed in Gjeilo et al. [2008] with respect to gender differences at each of the measurement occasions; Presurgery, six months, and twelve months after the operation. This is done by a two-sample Student s t-test assuming unequal variances. All patients observed at the relevant occasion is included in the analyses. Repeated measures ANOVA are used to determine gender differences in the evolution of the iii

8 HRQOL-variables. Only patients with fully observed measurements at all three occasions are included in the ANOVA. The methods of expectation maximization (EM) and multiple imputation (MI) are used to obtain plausible complete datasets including all patients. EM gives a single imputed dataset that can be analyzed similar to the complete-case analysis. MI gives multiple imputed datasets where all dataset must be analyzed separately and their estimates combined according to a technique called Rubin s rules. Results of both Student s t-tests and repeated measures ANOVA can be performed by these imputation methods. The repeated measures ANOVA can be expressed as a regression equation that describes the HRQOL-score improvement in time and the variation between subjects. The mixed regression models (MRM) are known to model longitudinal data with non-responses, and can further be extended from the repeated measures ANOVA to fit data more sufficiently. Several MRM are fitted to the data of cardiac surgery patients to display their properties and advantages over ANOVA. These models are alternatives to the imputation analyses when the aim is to determine gender differences in improvement of HRQOL after surgery. The imputation methods and mixed regression models are assumed to handle missing data in an adequate way, and gives similar analysis results for all methods. These results differ from the complete-case method results for some of the HRQOL-variables when examining the gender differences in improvement of HRQOL after surgery. iv

9 Contents 1 Introduction 1 2 Theory Missing data Notation MCAR MAR MNAR Missing data patterns Criteria for methods to handle missing data Simple but deficient methods Complete-case analysis Single imputation Imputation methods Expectation maximization Multiple imputation Why use MI? Imputation model Properties Full model-based methods assuming MNAR Selection models Pattern-mixture models Longitudinal data analysis Analysis considerations General approaches Repeated measures ANOVA v

10 3.3.1 Assumptions of repeated measures ANOVA Repeated measures ANOVA table Mixed regression models Introduction Random intercepts mixed regression models Random slopes mixed regression model Random slopes mixed regression model, quadratic time trend Curvilinear mixed regression models Comparison of models Covariance pattern models Covariance patterns Analyses of discrete outcome variables Generalized linear mixed regression models Generalized estimating equations The data of patients undergoing cardiac surgery SF-36 questionnaire Missing-data structure Observed covariance and correlation matrices Complete-case analyses Student s t-test Implementation in SPSS and Stata Results, Student s t-test Repeated measures ANOVA Data layout Implementation in SPSS and Stata Results, repeated measures ANOVA Graphical presentation of subject samples Interpretation of profile plots Imputation analyses Expectation Maximization Implementation of EM in R Results, EM analyses Multiple imputation Imputation by MICE The imputation model Constraints of imputed values Implementation of ice in Stata T-test after multiple imputation Results, two-sample t-test assuming unequal variances Repeated measurements ANOVA after MI Implementation of mim in Stata vi

11 6.2.9 Results, random intercepts MRM after MI Features of mim Analyses by regression models Mixed regression models in Stata Random intercepts MRM Random slopes MRM Quadratic time trend MRM s Curvilinear trend MRM s Results of analyses by MRM Analyses by generalized estimating equations Discussion 82 9 Conclusion 87 References 89 Appendices 93 A Additional tables 93 B Stata and R scripts 102 vii

12 viii

13 1 Introduction The aim of this thesis is to study methods that handle missing values in longitudinal data. Longitudinal data come from repeated observations of a sample of units over multiple occasions, and are also denoted panel data, hierarchical or multilevel data, Such data have a correlated structure of observations measured on the same subjects that has to be taken into account when handling the missing values. The focus of my project thesis last semester [Dragset et al., 2008] was methods to handle missing data, with an application of the method of multiple imputation (MI) on a cross-sectional dataset. Cross-sectional data are measurements observed once on each unit without relation to the time difference. The data were previously analyzed in Klepstad et al. [2003], with focus on scores from the Brief Pain Inventory (BPI). The variables of this questionnaires were scaled to intervals, which resulted in challenges when imputing on these variables. This is explained in Dragset et al. [2008]. The method of multiple imputation is further examined in this thesis, with emphasize on application for longitudinal data. The longitudinal data studied in this paper are previously analyzed in Gjeilo et al. [2008], and consists of 534 patients undergoing cardiac surgery at St Olavs Hospital in Trondheim. Healthrelated quality of life (HRQOL) are measured by the self-reported Short-Form 36 (SF-36) and the Brief Pain Inventory (BPI) questionnaires. The patients are intended to be observed at baseline, that is before surgery, and then repeatedly at six and twelve months after surgery. These repeated measurements make us able to monitor the gender differences at each time point and the gender difference in evolution in time after surgery. The two statistical analyses employed to examine these differences are the Student s t-test for the differences at each time point and the repeated measures ANOVA to examine the differences in improvement pattern. Both these methods require balanced data, that is all subjects are measured at all time points. The opposite is denoted unbalanced design, and is characterized by an unequal number of measurements for the units in the study. As much as 40% of the subjects included in the study have missing values for one or more measurement occasions. In Gjeilo et al. [2008] the complete-case analysis is used to obtain balanced and complete data, that is all units that have one or more missing values in their records are excluded from the analysis. This leads to deletion of considerable amounts of information 1

14 CHAPTER 1. INTRODUCTION in the data. Further the results can be biased, that is skewed expectatins and erroneous standard deviations, and the statistical power are decreased. Comprehensive research is performed on the field of missing data, and several methods are developed during the last decades. The alternatives of methods for analyses of longitudinal data are numerous, including imputation methods of expectation maximization and multiple imputation, mixed regression models, and marginal models estimating the correlation in data externally. The focus on methods of multiple imputation introduced by Rubin [1976, 1987] and the expectation maximization algorithm by Dempster et al. [1977] are growing within most research communities exposed to missing data nowadays. Common statistical software implement these methods and make new features accessible. Especially the iterative algorithm of multiple imputation, denoted MICE, has been extensively used, and was first introduced by Schafer [1997]. Royston [2004, 2005] have implemented multiple imputation by MICE in Stata [2007]. Mixed regression models (MRM) are described under a variety of names, and some of the first to be published were variance components models [Dempster et al., 1981], random-effects models [Laird and Ware, 1982] and random regression models [Bock, 1983]. The correlation structure of the data are fitted by random effects, for example the random intercepts and slopes effects. These models are similar to the covariance pattern models (CPM) introduced by Jennrich and Schluchter [1986]. CPM reflect the covariance in data through estimation of a specified correlation structure. A third type of regression models for longitudinal data were introduced during the 80 s, denoted generalized estimating equations (GEE) [Liang and Zeger, 1986, Zeger and Liang, 1986, Zeger et al., 1988]. GEE are a quasi-likelihood estimating processes that are able to give unbiased results under some missing data assumptions, but are not as general as imputation by EM and MI, or the mixed regression models and covariance pattern models. The attractiveness of this method is rather focused on its ability to analyze both discrete and continuous outcome variables. Recent work on the regression models introduced above are found in Hedeker and Gibbons [2006], Fitzmaurice et al. [2009] and Diggle et al. [2002], and a more applied setting for Stata users are given in Rabe-Hesketh and Skrondal [2008]. We observe that there are a set of commands specially designed to analyze longitudinal datasets, and these are labeled with the prefix xt-. Some of these commands will be described through this paper. The structure of this paper is based on the methods described above. Chapter 2 introduces concepts and definitions of missing data and examples of situations where missing values typically occur. The complete-case method and other deficient methods are presented in Section 2.2, followed by the imputation methods and more complex methods. An overview of methods to handle longitudinal data are given in Chapter 3, including a description of repeated measures ANOVA, mixed regression models, covariance pattern models and methods for discrete outcome variables. The dataset analyzed in Gjeilo et al. [2008] are described in Chapter 4 with emphasize on the missing data structure and the observed correlation structure. The application of the complete-case method, imputation methods and the regression models are explained in Chapters 5, 6 and 7, including results of the respective methods. Discussion of methods and results are given in Chapter 8, followed by a conclusion that summarizes the most important results and features of the methods. 2

15 2 Theory To describe methods that handle missing values in dataset and to evaluate the properties of these methods we introduce some definitions and concepts that are helpful throughout the report. 2.1 Missing data Day [1999] defines missing data as follows Missing data refers to a data value that should have been recorded but, for some reason, was not. Missing data are also referred to as non-response or unobserved data, and occur in most types of studies. Missing values can occur due to failure of measurement, data loss, out-of-range data and data loading issues, units fail to answer all questions, loss of follow-up or other plausible reasons. It is important to handle the problem of missing data in a proper way to obtain unbiased results that can be used in research. We can characterize the missing values based on occurrence in time, for example do intermittent missing values correspond to unobserved values at some time points where the subject is measured at occasions both before and after the missing occasion. This may be due to panel waves (subjects missing the whole occasion) or subjects omitting the specific item at that time point. Another type of missing values is dropout that occur when subjects terminates all further observation through the study. This may happen when subjects relocate, decides to end the study, die or just due to random reasons. The issue of missing data is widespread within the field of clinical trials, especially when the study is of longitudinal structure. The objects that are measured are denoted units or subjects. These subjects can be humans, which is often the case in clinical studies, but may as well be animals, plants or groups of individuals, to mention some examples Notation The notation of missing data were introduced by Rubin [1976] and is still in use as the common notation. The data that are planned to be observed are denoted X i for the covariate matrix and 3

16 CHAPTER 2. THEORY Y i for the dependent variable vector, both for subject i and intentionally observed for n time points. Further we can partition the dependent variable vector Y i in the observed variable vector Yi O and the missing variable vector Yi M, thus we get Y i = (Yi O, Yi M ). R is the missing data distribution that describes the occurrence of missing data in the dataset. R ij is a dichotomous variable (takes two levels of values), equal to one if the value of the dependent variable is observed for subject i at time point j, and zero if Y ij is missing. This missing data distribution is important to describe the reasons for missing data, the missing data mechanism. There are three main missing data mechanisms described in the literature, these are denoted MCAR, MAR and MNAR. These concepts are explained slightly different by various authors, the notation described here are based on Hedeker and Gibbons [2006] MCAR Missing completely at random (MCAR) is the assumption that missing data occur totally random without any relation to the other observed or unobserved data. This is the most basic missing data mechanism and assumes missing data to occur for completely random reasons. The distribution of missing values R are thus assumed to be independent of both covariates and the dependent variable as Figure 2.1 displays graphically the principle of MCAR. P(R Y, X) = P(R). (2.1) Figure 2.1: Graphical display of the missing data mechanism MCAR. Y represents the dependent variable vector (Y O,Y M ), X are observed covariate variables, Z are variables representing reasons for missing data and R is the missing data distribution [Schafer and Graham, 2002]. A less stringent case of MCAR is the covariate-dependent MCAR. This missing data mechanism is dependent on the fully observed covariates X i, expressed as P(R Y, X) = P(R X). (2.2) This special case of MCAR allows the fraction of missing data to vary across variables. An example of a situation where the covariate-dependent MCAR is more appropriate than the strict MCAR mechanism is when the fraction of missing data varies in time. It is important to include predictors for the missing data in the given analysis. In the above example the time variable must be included in the covariate matrix of the analysis to yield unbiased results. Another example of a covariate-dependent MCAR mechanism is found in a study on patients aged 18 to 70, where all the patients are asked to answer a questionnaire. The elder patients may 4

17 2.1. MISSING DATA have higher proportions of missing data. This can be related to their age, not necessarily by the values of the unobserved data. Again, it is important to include the variable age as a covariate since the missing data may be explained by this variable. A graphical presentation of covariate-dependent MCAR is given in Figure 2.2. Figure 2.2: Graphical display of the missing data mechanism covariate-dependent MCAR. Y represents the dependent variable vector (Y O,Y M ), X are observed covariate variables, Z are variables representing reasons for missing data and R is the missing data distribution [Schafer and Graham, 2002] MAR MAR is the short form for missing at random and describes how the missing data distribution R is dependent on both observed covariates X i and the observed dependent variable vector Yi O, but not on the unobserved variable vector Yi M. This is expressed as P(R Y, X) = P(R Y O, X). (2.3) We illustrate the missing data mechanism by a simple example. Missing values due to subject dropout is a common issue in longitudinal data, and occur when subjects that have entered a study do not respond at a given time or any subsequent time points. This kind of missing values may be related to the observations measured prior to dropout. Thus the missing data distribution is assumed to be related to the observed dependent variable vector Yi O of the dropout subjects. This can be found for example in a study of mental health, where the dropout subjects have lower scores than the remaining subjects. The observed values of the dependent variable are related to the missing values through the correlation structure of the data. In the previous section we described covariate-dependent MCAR, and emphasized the importance of including covariates in the analysis model that explain possible reasons of missing values, and this is equally important for the MAR mechanism. In addition it is essential to specify the correct variance-covariance matrix of the distribution of the data. If these factors are not proper implemented, the analyses may not necessarily be consistent with the underlying MAR mechanism and we can get biased estimates although the method are expected to give unbiased results. An expression often used in the literature of missing data is ignorable missing. Ignorable missing is a collective term that requires the missing data mechanisms MCAR and MAR, and in addition the parameters of the data model and for the missing data mechanism must be distinct. This means that data can be analyzed without respect to the missing data distribution. The 5

18 CHAPTER 2. THEORY opposite situation of non-ignorable missing refers to the missing data mechanism explained below, and must be handled with caution MNAR Missing not at random (MNAR) is the missing data mechanism where missing values are assumed to be related to the unobserved dependent variable vector Yi M in addition to the remaining observed values. This is expressed as P(R Y, X) = P(R Y M, Y O, X). (2.4) An example of the MNAR mechanism is when special values of a dependent variable lead to non-response. When investigating smoking habits we may experience that missing data rely on the actual value of the unobserved measurements, the subjects that are heavy smokers often omit the questions about smoking while non-smokers gladly fulfill these questions. There are no ways to confirm or reject MAR versus MNAR, since the unobserved dependent variable vector Yi M is involved. There are developed methods that can handle data with missing data due to the unobserved values and some of these are presented in Section 2.4, and MNAR missing data is a topic of intense and emerging research [Hedeker and Gibbons, 2006] Missing data patterns In the literature different missing-data patterns are presented, and these are important for the types of methods that are recommended used for each specific dataset. Some of the single imputation methods explained in Schafer and Graham [2002] rely on missing data to occur only as dropout, while the method of expectation maximization described in Section handle complex missing data patterns. Some frequently used examples of missing-data patterns are the univariate, monotone and arbitrary patterns displayed in Figure 2.3. The univariate missing data pattern can be described as one single variable containing unobserved values due to dropout, while the monotone non-response pattern includes several variables with missing values. These variables can be sorted in a way that consecutive variables have equal or more missing values. In a longitudinal study with units lost to follow-up, a monotone missing-data pattern may occur. Arbitrary missing data patterns are more complex allowing missing values in all variables and at all time points Criteria for methods to handle missing data Appropriate methods to handle an analysis of data with missing values are characterized by three criteria to yield valid inferences about the data. The first two criteria are given by Rässler et al. [2008], while the criterion of consistency (3) are presented in Carpenter and Kenward [2007]. 1. Estimates of coefficients of the missing data analysis must be approximately equal to the population estimates (that includes for example means and variance estimates). 2. The confidence intervals for estimates (with a significance level α) should have the property of covering the true population mean at least (1-α) 100% of the time. 6

19 2.2. SIMPLE BUT DEFICIENT METHODS Figure 2.3: Graphical display of missing data patterns, (a) univariate pattern, (b) monotone pattern and (c) arbitrary pattern. Rows correspond to units and columns correspond to variables, X s are completely observed variables and Y s are partially observed variables [Schafer and Graham, 2002]. 3. The methods must lead to consistency, which means that the confidence intervals are expected to be narrower when the number of units increases in the data set. The methods described in Section 2.2 do not meet the criteria above, but, if data are assumed to be ignorable, the methods in the subsequent Sections 2.3, 3.4 and 3.5 are stated as proper methods to handle missing data analyses. 2.2 Simple but deficient methods There are two main categories of deficient methods explained in Schafer and Graham [2002]. The first is those excluding units with missing values, denoted complete-case analysis and explained in the following section. The second category is those replacing missing values with an alternative best guess. The latter is denoted single imputation methods and are described in Section Note: There exists a single imputation method that meets the criteria of methods to handle ignorable missing data outlined in Section This method will be explained in Section Complete-case analysis One of the more widespread and straight forward ways to handle missing data is the completecase analysis. Other names for this method are listwise deletion and case deletion. It excludes all units in the dataset with one or more unobserved values. This leads to a complete dataset that consists of units with completely observed variables. Many of the statistical software packages (for example SPSS and Stata) employ this as default, which, in addition to being simple makes this method well-known and often applied. The complete-case analysis requires data to be strict MCAR to be valid, that is the missing values are independent of all the covariates and observations at other measurement occasions. 7

20 CHAPTER 2. THEORY This is a rather strong assumption, especially for longitudinal data, and are seldom met. Violation of this assumption can typically lead to selection bias. For example, consider a group of patients that participates to a questionnaire prior to a medical treatment. The same group of patients are requested to answer the questionnaire again after this treatment. Suppose that only the patients that still fell ill return the questionnaire, while the cured patients don t. The results will be too pessimistic for these patients due to selection bias. The opposite situation is also possible, and will give more uplifting results than realistic. The degree of bias depends on the extent of violation of the assumption of MCAR, the proportion of excluded patients and the analysis being implemented [Rässler et al., 2008]. Even if the data can satisfy the MCAR assumptions, there are several problems arising. A small percentage of missing values can lead to a considerable proportion of excluded units, even when data are MCAR. An example is found in a dataset where a high number of variables are measured, and many of the units have a few missing values. The percentage of missing values will be low, but a considerable number of units with non-observed variables may be excluded. This decreases the amount of data to be analyzed and also the power of the statistical results. The number of units is limited in most studies due to economic, ethic or other reasons, and with the complete-case analyses these samples are further confined. The complete-case method might be inefficient, even in case of MCAR, since the number of units decrease. In multivariate cases this gets even worse. The degree of inefficiency depends strongly on the fraction of excluded units. In studies where data can be assumed MCAR and only a small proportion of units are excluded, this method can be a sensible choice. It is easy to implement and makes the way forward to the real analysis short. But it should be handled with caution, if the small group of deleted units keeps a large proportion of one feature, the analysis can still be biased. A special case of the complete case method is the available case analysis, sometimes referred to as pair-wise deletion or pair-wise inclusion. This method deals with one analysis at the time, and includes all units where all the variables are observed. For a longitudinal study where each of the measurement occasions are examined separately, only those units with unobserved values at that single time point are excluded from the analysis. As long as no comparison of the analyses at the different measurement occasions are performed this is a legal analysis method, and leads to a higher number of included units in each analysis. As stated above, the degree of bias and efficiency depends on the proportion of excluded units, thus this method can reduce some of the bias and lead to higher efficiency. However the issue of bias and inefficiency is still present, and an additional disadvantage is that different groups of units contribute to the analysis at different time points depending on the missing-data pattern Single imputation In a statistical setting the expression imputation is applied with replacement of missing values by values estimated from covariates or other observed values [Rosner, 2006]. The estimation function can be based on available information from the observed data of the unit itself and from other units with (similar) observed values. No units are excluded from the analysis, thus the original number of included units is maintained at all time points. There are many ways to estimate these imputations and some of these are described in Schafer and Graham [2002], including hot deck imputation, marginal mean imputation, conditional mean imputation and 8

21 2.3. IMPUTATION METHODS conditional distribution imputation. A special case of single imputation is the hierarchical scales imputation, outlined in Fayers and Machin [2007]. The extent of the bias of results are different for the above mentioned single imputation methods. Some of these imputation methods may give approximately unbiased estimates. The underestimation of variance in the estimates is common for all the single imputation methods, and is also the reason why these methods are unsuitable for data that are not missing completely at random (MCAR). Expectation maximization algorithm and multiple imputation are both extensions of the single imputation methods, and are more appropriate for data with missing values since they model the variation in a better way. This is explained further in Section 2.3. An imputation model is a useful tool to make imputations for unobserved values in a dataset. It describes the conditional distribution of the missing values dependent on other variables in the dataset, and also possibly on other variables important for the missing-data structure. The imputation model must not be confused with the analysis model, that is used to examine the dataset, for example to do inference about a sample. All variables with missing values and variables that can predict or explain the missing values are included in the imputation model, and can be thought of as multivariate responses. This way of structurizing the variables makes the imputation model a device to preserve important features of the joint distribution of nonobserved variables. The analysis model on the other hand, makes distinctions between dependent and independent variables, and keeps up this relationship of variables throughout the analysis. Last observation carried forward The method of last observation carried forward imputes values for unobserved data based on former observed values, thus it is one of the single imputation methods described above. This method applies especially to longitudinal data where the units are observed at several occasions, and some units are lost-to-follow up or have intermittent missing values. It imputes values equal to the last observed response for the variable for each unit. This method gives potentially bias for all types of plausible missing-data mechanisms, and is almost never good fitted to the expected behavior. It should practically never be used according to Carpenter and Kenward [2007]. 2.3 Imputation methods In the section above we have described the missing data mechanisms MCAR (including dependent MCAR), MAR and MNAR, and the properties of these mechanisms. Both MCAR and MAR mechanisms are referred to as ignorable missing, and can be analyzed without loss of information. It must be emphasized that this relies on the use of analysis methods based on the full likelihood function [Diggle et al., 2002]. The expectation maximization (EM) algorithm and multiple imputation (MI) are imputation methods that provide unbiased parameter estimates and their belonging standard errors as required by the criteria in Section Expectation maximization Expectation maximization (EM) is a method of obtaining estimates of coefficients from an analysis on base of Bayesian thinking, introduced by Dempster et al. [1977]. It applies maximum 9

22 CHAPTER 2. THEORY likelihood estimation (ML) to draw population inferences based on observed values from a full likelihood function. Schafer and Graham [2002] describes the EM algorithm as follows: The key idea of EM is to solve a difficult incomplete-data estimation problem by iteratively solving an easier complete-data problem. For monotone missing data patterns (results of dropout, see Figure 2.3) the full-distributional likelihood can be expressed in closed form, and the log-likelihood function may be expressed as a sum of functions, each function dependent of one parameter only. Maximum likelihood (ML) estimates are calculated by maximizing the log-likelihood function with respect to each parameter separately. For some models the full likelihood function exists, but the log-likelihood does not represent the parameters distinctly and thus maximizing the factors separately will not necessarily maximize the likelihood. Especially for complex missing data structures like the unstructured pattern in Figure 2.3 the direct calculation of ML estimates is not feasible. Iterative methods to approximate the full likelihood function have been developed for situations where explicit calculation of ML estimates are not available [Diggle et al., 2002]. The expectation maximization algorithm is a general iterative algorithm for ML estimation in incomplete-data problems. It applies the observed-data likelihood, also referred as the likelihood ignoring the missing data mechanism [Schafer and Graham, 2002]. The estimated parameter values ˆθ that maximizes the observed-data likelihood function have attractive properties. The estimated parameters tends to be approximately unbiased in large samples, and the estimated variances obtained are close to what is theoretically desirable. Thus the EM algorithm fulfill the criteria of methods handling data with missing values. The method assumes a large number of data so that the EM estimates can be approximately unbiased and normally distributed. In addition it assumes data to be ignorable, that is MCAR or MAR mechanism. The procedure of the EM algorithm builds on a relatively old ad hoc idea for handling missing data Diggle et al. [2002]. First we replace the missing values by initial values (for example estimated values). Then iteration over the following steps are carried out. 1. Estimate parameters of the full likelihood model by maximum likelihood estimation. 2. Re-estimate the missing values assuming the new estimated parameters from second step is correct. Iteration must be continued until convergence, that is until values that are re-estimated by the second step approximate the previous estimated values. The notation missing values is used to separate the EM algorithm from the ad hoc idea. In EM the missing data are not the actual missing data Y mis, rather the functions of Y mis appearing in the complete-data likelihood. The second step above is called the E step, which calculates the conditional expectation of the missing data given the observed data and the estimated parameters ˆθ. The first step is denoted the M step and estimates the new set of parameters from maximization of the observed completedata log-likelihood. The algorithm can be quite easily implemented, and each iteration consists of one run of the E step and one of the M step. EM can be shown to converge reliably if the observed-data likelihood function is bounded [Diggle et al., 2002]. One drawback for the expectation maximization algorithm is that with large fractions of missing data the convergence of the iterations can be very slow. In some problems the compu- 10

23 2.3. IMPUTATION METHODS tation of the M step can be difficult because no closed form of the likelihood function exists. Diggle et al. [2002] presents some extended algorithms of EM that gives solutions for the problems above. Some statistical analysis software available for EM estimation are described in Acock [2005] and Schafer and Graham [2002]. Another imputation method based upon the likelihood function is multiple imputation described in the following section Multiple imputation The method of multiple imputation (MI) is a continuation of the method of single imputation from a conditional distribution [Schafer and Graham, 2002]. It retains much of the attractiveness of the single imputation method and gives unbiased results with respect to the estimated variance. MI produces m plausible datasets where each could have been the complete dataset if all values were observed. The completed datasets can be combined by easy arithmetic [Rubin, 1987] to obtain estimates and standard errors that reflect uncertainty in the missing-data and the finitesample variation. This method makes use of complete-data techniques and software already possible which is very favorable. Figure 2.4: Schematic representation of multiple imputation with m imputed datasets [Schafer and Graham, 2002]. Multiple imputation is based on a Bayesian way of thinking, in the sense that the distribution in which the imputations are drawn from is a full-conditional posterior distribution. To perform MI successfully, the imputations need to be proper according to the criteria in Section This means that the uncertainty about the parameters in the imputation model must be taken into account when imputing unobserved values. Both the missing data and parameters of the imputation model have distributions, and this feature is important in the MI setting Why use MI? To perform proper multiple imputation we need a prior distribution for the parameters of the imputation model and a likelihood function for the variables that have unobserved values. If a large sample is obtained, the likelihood function can be approximated by the imputation model 11

24 CHAPTER 2. THEORY as done by EM algorithms, and the prior distribution for the parameters can be uninformative. Further a posterior distribution [θ, y mis y obs ] must be obtained, where θ represents the sampled parameters of the imputation model drawn from the prior distribution. The sampled values of y mis are used as imputations in the dataset. This sampling is done m times to obtain m imputed datasets. In this procedure the parameters of the imputation model are drawn for each imputation, so these parameters are unique for each imputation. This way to treat the model parameters as stochastic variables with uncertainty and not as fixed values (as in prediction) is the important property of MI that ensures the variance estimates to be unbiased. The total variance, that consists of within-imputation and between-imputation variance is therefore approximately unbiased for data assumed MCAR or MAR. The uncertainty is represented in both the parameters of the imputation model and the random sampling from the imputation model to draw imputations for the missing values. The EM algorithm on the other hand, estimates the parameters of the analysis model only once, thus they are considered as fixed parameters Imputation model As for single imputation from a conditional distribution we must create an imputation model to obtain the imputations. The imputation model must be at least as rich as the analysis models. This is to ensure that all variables that may contain information about other variables in the analysis model are represented. In addition some additional variables may be included in the imputation model, according to Buuren et al. [1999]. These variables are divided in two groups, variables that are known to have influenced the missing-data pattern, called U-variables, and variables that explain much of the variance in the unobserved variables, denoted V-variables. Last step in the determination of the imputation model is to exclude those of the U- and V- variables that have too many unobserved values within the group of incomplete cases Properties In Schafer and Graham [2002] an example of MI is presented, with a bivariate normal distribution as base for the full likelihood function. MI does not require all the variables to be normally distributed, but this assumption is used in many publications about MI. This is because the multivariate normal distribution is one of the easiest approach to the method. MI shares some of its properties with the EM algorithm, for example do both rely on large-sample approximations (though this assumption is stronger for the EM algorithms). MI and EM both use the observeddata likelihood function to approximate the likelihood function. MI implies uncertainty in the imputed values through both the parameters of the imputation model and the drawn imputation values themselves. This stands in contrast to EM that computes the joint likelihood function with fixed parameters, and imply uncertainty only in the imputed draws from this imputation model. Multiple imputation gives unbiased estimates and variance under MCAR and MAR, and satisfies the criteria for methods to handle missing data. Even if data are MNAR the method of MI is assumed to be a better method than the above described simple methods [Rässler et al., 2008]. Software available to perform MI is listed in Schafer and Graham [2002] and Acock [2005]. 12

25 2.3. IMPUTATION METHODS Rubin s rules Rubin s rules Rubin [1987] is a method to combine the results from m imputed datasets for a scalar parameter. For notation here we use Q as the parameter quantity (for example a regression coefficient), ˆQ is the estimated value for Q and U denotes the estimated standard error if the original data were complete. This method assumes a large enough sample to ensure that ˆQ is approximately normal distributed with mean Q and variance U. There are one estimated ˆQ and U for each of the m datasets, thus we have the notation [Q (j),u (j) ], where j = 1, 2,, m. The overall estimate is expressed as Q = m 1 m j=1 ˆQ (j) (2.5) The uncertainty in ˆQ has two parts, within-imputation variance and between-imputation variance Ū = m 1 m U (j) j=1 B = (m 1) 1 m j=1 [ ˆQ(j) Q] 2 The total variance is a sum of the two parts of variance ( T = Ū ) B m Now we have an estimated combined parameter Q, with appurtenant standard error T which is the unbiased estimator for the dataset if no values were missing. To obtain p-values, Rubin recommended using an approximation to the Student s t distribution with ν degrees of freedom Q Q t ν T 2 1 ν = (m 1) 1 + ( ) 1 + Ū m B When the degrees of freedom ν is large, total variance is well estimated and thus the number of imputed datasets m is large enough. The number of imputed datasets required varies from dataset to dataset, dependent on the quality of the imputation model and the structure in the data. Quantities like 5, 10, up to 20 and 50 imputations are advised, and this issue is well discussed in the literature [Schafer and Graham, 2002]. In recent years the recommended number of imputed datasets m has increased 13

26 CHAPTER 2. THEORY due to new methods to estimate the error in results because of a finite number of imputations. As a rule of thumb the number of imputations should be raised if p-values are to be calculated. The time to make 10 versus 20 imputed datasets is seldom of large consequence compared to the effort and time it takes to establish the imputation model. 2.4 Full model-based methods assuming MNAR All of the methods described above assume the missing data mechanism to be MCAR or MAR, but in certain situations the assumption that data are MNAR may be plausible. To handle this the missing data distribution R must be taken into account when unobserved values are to be imputed. Especially is MNAR potentially the issue in clinical studies, where the patients may have a high fraction of unobserved values, or possibly drop-out, closely related to the values that are not observed. Schafer and Graham [2002] presents two main types of model-based methods that assumes data to be MNAR, that is the selection models and the pattern-mixture models, both described below. Hedeker and Gibbons [2006] state that several researchers warn against reliance of a single MNAR model, because the assumptions about the missing data are impossible to access with the observed data. Thus we should use MNAR models with caution, and perhaps examine several models to conduct a sensitivity analysis of the missing data Selection models A selection model consists of a distribution for the complete data, and a distribution of the missing-data given the data itself. This means that a joint distribution of the complete data Y and the missing-data distribution R as P(Y, R θ, ξ) = P(Y θ) P(R Y, ξ) θ is the unknown parameters of the complete-data population, while ξ is the parameters of the conditional distribution of the missing-data distribution given the complete data. The likelihood function may be calculated by integrating the equation above over all values of the missing-data Y mis. The maximum of the likelihood is not always possible to find analytically, so iterative methods are needed to approximate ˆθ s. Another issue with the selection models is that they are extremely sensitive to the distributional shape that is chosen for the population Pattern-mixture models The other type of model-based models presented in Schafer and Graham [2002] is the patternmixture models, which groups the whole sample on basis of the missing-data distribution. A model of this form may be written as P(Y, R θ, ξ) = P(R η) P(Y R, ν) θ and ξ have the same interpretation as above, η represents the proportions of the population that end up in each of the missing-data groups and ν is the parameters of the complete-data 14

27 2.4. FULL MODEL-BASED METHODS ASSUMING MNAR distribution given the missing-data group. Pattern-mixture models are suffering from the fact that it is difficult to estimate ν for the groups containing unobserved values, so heavy restrictions or unverifiable assumptions must be made. On the other hand, these models are not as sensitive to the distribution of the population as the above described selection models. Development and incorporation of full model-based methods into suitable software are necessary to be able to apply these methods for scientific research. Schafer and Graham [2002] recommend the use of MI and EM algorithms for handling missing-data problems until the appropriate software is accessible. 15

28 3 Longitudinal data analysis Longitudinal studies have both advantages and disadvantages over cross-sectional studies. First, a longitudinal study can give more powerful results with the same number of subjects compared with a cross-sectional study. Another way to formulate this is that longitudinal studies need fewer subjects included in the study to obtain the same statistical power as a cross-sectional study. Second, in a longitudinal study each subject is measured more than once, and can serve as his/her own control. The within-subject variability is often smaller than the variance between subjects, and the between-subject variability can be separated from the measurement error. This results in more efficient estimators of treatment-related effects compared with cross-sectional studies. Third, a longitudinal study can measure individual trends, or evolution in time, of dependent variables. This determines growth and changes at individual level, which is not possible in cross-sectional studies. Finally, these kind of studies allow us to separate the effect of changes over time within subjects from the differences between subjects at baseline. Observations from a repeated measures study are assumed clustered dependent since more observations of the same variable are measured on each subject. This leads to a need of more sophisticated statistical methods than for ordinary cross-sectional studies. Very common issues of longitudinal studies are dropout or intermittent missing values at one or more occasions during the study. Simple statistical analysis methods do not handle this problem without inducing possibly biased results. One solution of this is to impute the missing values prior to statistical analysis. Other methods like mixed regression models and covariance pattern models use all available data in the analyses, which increases the statistical power and prevents biased estimates. 3.1 Analysis considerations When modeling longitudinal data there are several properties that must be considered. Quite simple statistical analyses are available for continuous normally distributed outcome variables in data without missing values, for example the repeated measures ANOVA. To employ large sample theory to approximate non-normally distributed variables, the number of subjects m 16

29 3.2. GENERAL APPROACHES allocated in each group should exceed 50. Also the number of observations per subject n i should be considered. If each subject is measured twice, a simple change score may be calculated. An example of a change score is to calculate the difference in observations for each subject. This makes analysis methods for cross-sectional data suitable, such as the analysis of covariance (ANCOVA). This approach relies on balanced design of data and is not very suitable for data with missing values. For datasets where the number of observations varies between subjects, more general methods are required, for example mixed regression models (MRM). For one-sample analyses, the only terms needed in the analysis model is the random factors. No between-subject covariates is required since all subjects are assumed to be randomly sampled from the same population. For two-sample situations the variables that describe the differences between groups must be included in the model. The terms used above are explained in the following sections. The last thing to consider for longitudinal data modeling is the variance/covariance structure of the panel data. This is further explained in the following sections. 3.2 General approaches A wide range of methods to model longitudinal data are available. The following methods are discussed in Hedeker and Gibbons [2006]. The simplest method, also denoted as the derived variable approach, implies combination of the repeated measurements into one summary variable. This causes a longitudinal study to be simplified to a cross-sectional study since each subject obtain one single combined measurement. A derived variable may for example be an averaged over time -variable, changed score -variable, area under a curve -score or linear trend across time -variable. A problem with this methods is the strong dependency of balanced design, which makes it unsuitable for data with missing values. The second approach to longitudinal data analyses are the univariate repeated measures analysis of variance (ANOVA). This method requires several assumptions to be met and is therefore limited in its application but is relatively easy to compute and implement. The most critical assumption for longitudinal data, besides the assumption of balanced data, is sphericity. Sphericity is defined in Rabe-Hesketh and Skrondal [2008] as the assumption that all pairwise differences between responses have the same variance. This is rarely met when analyzing longitudinal data, and violations can lead to skewed F-distributions. ANOVA takes into account that subjects can have individual baseline observations but no subject-specific evolution in time. Multivariate ANOVA (MANOVA) is an approach to analyze longitudinal datasets but holds similar restrictions as the univariate ANOVA described above. It does not handle missing values in data and in addition it assumes the variables to be measured at the same occasions. It is therefore not suitable for longitudinal datasets with non-responses. The fourth approach is the mixed regression models (MRM) that can be used for both categorical and continuous, and non-normally and normally distributed outcome variables. MRM give unbiased results if data are assumed ignorable (MCAR and MAR), and allow the measurement occasions to vary among the units. The method handle both time-invariant and timevarying variables and is therefore a suitable method to analyze longitudinal data with nonresponses. In this report the term mixed regression models refers to the linear mixed regression models that assumes dependent variables to be continuous, if not specified otherwise. The 17

30 CHAPTER 3. LONGITUDINAL DATA ANALYSIS generalized linear mixed regression models are presented in Section The method of generalized estimating equations (GEE) is an alternative to the generalized mixed regression models in the sense that both methods handle some missing values, take timevarying covariates and different types of outcome variables. The drawback for this method is the more strict requirement for missing values to be ignorable, missing data are assumed to be explained by covariates in the model and not by observed values at other measurement occasions. 3.3 Repeated measures ANOVA In longitudinal datasets some observations are dependent of other observations since more than one observation may be measured on the same subject. This must be taken into account when analyzing such types of data. Repeated measures ANOVA is a method to examine possible differences in means, that is between two or more groups, and from one, two or several samples. The method models the correlation structure of observations within subjects by the inclusion of a random subjects effect in the regression formulation. This random subject effect allows subjects to have individual baseline observations, thus we can partition the variance in explained variance by the random effect and unexplained variance, also known as the residual error Assumptions of repeated measures ANOVA As stated above the method of repeated measures ANOVA have many assumptions that must be met to obtain unbiased estimates from the analysis. All subjects must be measured at the same fixed time points. The data are assumed complete and balanced in terms of time points n, but the size of the groups h may be unequal. Thus data with missing values must be handled by the complete-case method or one of the imputation methods prior to the analysis. The dependent variables are assumed multivariate normally distributed, and the variance-covariance matrix of the data with respect to the occasions is assumed equal to the compound symmetry structure. The latter consists of two layers, that is the assumption of homoskedasticity and sphericity. Homoskedasticity is defined as the variance of the dependent variable to be equal at all measurement occasions, while sphericity refers to the assumption that all pairwise differences between responses have the same variance, as stated above. This implies that consecutive observations are equally correlated as observations more distant in time, and that the variance at all occasions are equal. The assumption of compound symmetry is seldom met for longitudinal data. Rabe- Hesketh and Skrondal [2008] write that a less strict assumption of sphericity alone is sufficient for the repeated measures ANOVA to be valid. Repeated measures ANOVA claims to be fairly robust to violations of normality and homoskedasticity. If sphericity cannot be assumed, the Greenhouse-Geisser or Huynh-Feldt correction methods can be applied to adjust the degrees of freedom for the F-test (explained in Section 3.3.2). The idea of these correction methods is to decrease the degree of freedom because more parameters are necessary to model the covariance matrix. In the next section some notation is introduced to further explain the method of repeated measures ANOVA. This notation is also used with mixed regression models, covariance pattern models and generalized estimating equations in subsequent sections. 18

31 3.3. REPEATED MEASURES ANOVA Notation All sample members in the longitudinal study are denoted subjects. These subjects are the units measured during the study, and can be patients treated at a hospital, football players on a team or plants on a field as examples. The dependent variable measured repeatedly on each subject is called the within-subjects factor. Examples of such within-subjects effects are outcome of quality of life-questionnaires, type of treatment for patients or quantity of irrigation for plants. Variables measured on independent groups of sample members are called between-subject variables, and examples are gender of subjects, the football team, the area of which the plant grows and so on. When performing repeated measures ANOVA we have to calculate sum of squares for both fixed and random terms in our analysis model. First we look at the analysis model expressed as y hij = β 0 + β 1 t j + β 2 x h + β 3 x h t j + v 0i + ɛ hij, (3.1) where h is the number of groups, h = 1, 2,, g, i is the number of subjects in group h, i = 1, 2,, N h, and j is the number of measurement occasions, j = 1, 2,, n. An indicator variable must be created for each of the groups except the chosen baseline category. There are N = g h=1 N h subjects included in the dataset. The measurement occasions are assumed equal for all subjects, and we also assume balanced design. y hij is the outcome variable, assumed normally distributed and continuous. The rest of the variables can be interpreted as follows: β 0 = grand mean, β 1 = effect of time t j, t j = time variable, β 2 = effect of group x h, x h = group variable, β 3 = interaction coefficient for the group by time interaction, v 0i = individual difference component for subject i, ɛ hij = measurement error for subject i in group h at time j. (3.2) Equation (3.1) may be expressed in matrix form as y i = X i β + Z i v i + ɛ i, (3.3) where y i is the response vector for subject i, X i is the covariate matrix for subject i, β is the vector of fixed regression parameters, Z i is the random effects design matrix for subject i, v i denotes the vector of random subjects effects and ɛ i is the error vector. Note: It is only one random term in a repeated measures ANOVA, the random subjects effect. In Section 3.4 we explore mixed regression models with more than one random factor, therefore the matrix representation for the random effects are introduced here. For the analysis model in Equation (3.1) the covariate matrix X i and random effects design matrix Z are represented as 19

32 CHAPTER 3. LONGITUDINAL DATA ANALYSIS 1 t i1 x i x i t i1 1 t i2 x i x i t i2 X i =. 1 t ini x i x i t ini 1 1 Z i =.. (3.4) 1 The random subjects variables v 0i are assumed normally distributed with mean zero and variation σ 2 v 0, and ɛ hij are similarly assumed normally distributed with mean zero and variation σ 2. We express the association between random terms in model (3.1) as E(y hij ) = β 0 + β 1 t j + β 2 x h + β 3 x h t j Var(y hij ) = V ar(v 0i + ɛ hij ) = σ 2 v 0 + σ 2 ɛ, Cov(y hij, y hi j) = 0 for i i, Cov(y hij, y hij ) = σ 2 v 0 for j j. The expectation of y hij is only related to the fixed terms of the model, that is the intercept, the covariate gender, the time variable and the gender by time interaction. In a simple linear regression model all the variance between observations (assumed independent) are due to the measurement error ɛ, and all observations are assumed to have equal variance. This variance is now partitioned into two parts, the subject-specific variance and the residual variance. The unexplained variance is decreased because some variance is explained by the variability in the population of subjects. The first covariance statement in Equation (3.5) indicates that measurements from different subjects are independent of each other, while the second covariance statement tells us that measurements of the same subject are correlated equally for any pair of measurements observed on the same subject. This correlation is denoted the intra-class correlation, and is given as (3.5) Corr(y hij, y hij ) = σ2 v 0 σv σɛ 2. The covariance matrix of the repeated measures ANOVA model in Equation (3.1) is presented as σv σ 2 ɛ σv 2 0 σv σ 2 ɛ σv 2 0 σv 2 0 σv σɛ 2 (3.6).... σv 2 0 σv σɛ 2 and we observe that it is equal to the above described compound symmetry structure. This is seldom realistic for longitudinal data because the variance tends to differ between groups and in time. In addition are consecutive measurements often more correlated than observations more distant in time. 20

33 3.3. REPEATED MEASURES ANOVA Repeated measures ANOVA table When analyzing a repeated measures ANOVA, we introduce the dot notation to represent the mean of groups, time points, and subjects. This notation is presented as follows: ȳ... = average across groups, time points and subjects, ȳ hi. = average for subject i in group h across time points, ȳ h.j = average for group h at time point j across all subjects in the group, ȳ h.. = average for group h across time points and subjects, ȳ..j = average for time point j across groups and subjects. Further, SS is the sum of squares, calculated as listed in Table 3.1. MS is the Mean of Squares and is calculated by dividing the sum of squares by the corresponding degrees of freedom. The sum of squares can be found for all terms in the analysis model in Equation (3.1), and are used in the calculation of F statistics as described below. Table 3.1: Repeated measures ANOVA table for the regression model in Equation (3.1). SS is the sum of squares, g is the number of groups, n is the number of measurement occasions and N h is the number of subjects in group h. Source SS df Group Time SS G = n g h=1 N h(ȳ h.. y... ) 2 g-1 SS T = N n j=1 (ȳ..j y... ) 2 n-1 Group Time Subjects in groups Residual SS GT = g n h=1 j=1 N h(ȳ h.j ȳ h.. ȳ..j + y... ) 2 SS S(G) = n g Nh h=1 i=1 (ȳ hi. y h.. ) 2 SS R = g Nh n h=1 i=1 j=1 (ȳ hij ȳ h.j ȳ hi. + y h.. ) 2 (g-1) (n-1) N-g (N-g) (n-1) Total SS T = g Nh n h=1 i=1 j=1 (ȳ hij y... ) 2 (Nn-1) (n-1) Group by time interaction The first term in model (3.1) that we want to investigate is the group by time interaction. The null hypothesis is formulated as 21

34 CHAPTER 3. LONGITUDINAL DATA ANALYSIS H GT : β 3 = 0 F GT = MS GT MS R = SS GT /(g 1)(n 1) SS R /(N g)(n 1). F is the statistic that is calculated and compared to the F-distribution, as given in Kvaløy and Tjelmeland [2000]. If this hypothesis is rejected there are three conclusions that can be drawn: 1. The between-group differences vary across time, 2. The between-group curves are not parallel across time, 3. The group and time effect are confounded by the interaction, and separate group effects or time effects cannot be estimated. The latter describes the importance of testing the interaction prior to the main effects. Tests of the main effects group and time can be carried out if the group by time interaction is insignificant. This is performed in a similar way as the above gender by time interaction. 3.4 Mixed regression models A repeated measures ANOVA can be thought of as one of the simplest mixed regression models. A repeated measures ANOVA consists of both random subjects effects and covariates, and in a similar way a mixed regression model consists of both fixed and random effects (hence the name mixed model). Fixed factors are the covariates of the model that we want to estimate in the analysis. Examples of fixed factors are age and gender, and the interaction of fixed factors can be included in the same way as in a simple regression model. Random factors are variables in which we are not explicitly interested in estimating coefficients. The reason for inclusion is to control for the variance related to the random variables. Examples of such variables are subjects in a repeated measures ANOVA model, schools and children in a multi center study of children and reading and mothers and children in a study of birth weight. If we can control for these between-subjects factors we are rewarded with increasing statistical power, which makes us able to discover correlations among variables that otherwise are hidden. The following sections are based on the theory in Hedeker and Gibbons [2006], Rabe- Hesketh and Skrondal [2008] and Fitzmaurice et al. [2009] Introduction We have previously explored the univariate and multivariate ANOVA models for repeated measures. The univariate ANOVA models assumes sphericity which is not a common feature of longitudinal datasets. Neither of the methods handle missing values in the data, thus we have to omit subjects with non-responses from the data prior to the analysis and the results may be biased. The time variable is restricted in these models, all subjects must be measured at the same fixed time points, and in a MANOVA the distances in time between all consecuting measuring 22

35 3.4. MIXED REGRESSION MODELS occasions must be equal. Both univariate and multivariate ANOVA models have possibilities to examine group trends across time, but subject-specific trends are not accessible. Mixed regression models (also denoted MRMs) are more general models to analyze longitudinal data in a flexible and correct way with respect to bias induced by missing values. First, the variance structure can be specified more generally than the compound symmetry, but less costly than the alternative unstructured covariance matrix where all variances and covariances are allowed to be unequal. The time variable can be continuous, and subjects can be measured at individual time points. The mixed models can handle ignorable non-responses in longitudinal datasets without inducing bias to the estimates, since all subjects with observed values are included in the study. The sample is assumed representative for the population of subjects. Compared to complete-case data the mixed models analysis has increased power, which is an important additional advantage in such analyses. Both time-variant and time-invariant covariates can be included in mixed models. Variants of mixed models are known as random-effects models, multilevel models, hierarchical linear models, variance component models, two-stage models and random regression models, to mention some. The common feature for all these models is the inclusion of random subjects effects into the regression models. The random subjects effects describe each units deviation from the population mean intercept, and take part in the explanation of the correlation structure in the longitudinal data. Alternatively, they describe the degree of variation between subjects in the population of subjects. Simple linear regression models A simple linear regression model with time variable t ij and dichotomous grouping variable x h can be written as y hij = β 0 + β 1 t ij + β 2 x h + β 3 x h t ij + ξ hij. (3.7) y hij is the response for subject i, i = 1, 2,, N h at occasion t ij, where the time variable can be continuous and individual for each subject. The model above assumes all observations to be independent. This can be expressed by the variance component ξ, that is assumed to be independent normally distributed with mean zero and common variance σ Random intercepts mixed regression models In longitudinal data the assumption of independence is violated since we cannot assume observations on the same subject to be uncorrelated. To control for this correlation we simply add a random subjects effect to the regression model, thus the variance due to subject-specific features is separated from the residual variance. A random intercepts mixed regression model for a two-sample analysis is presented as y hij = β 0 + β 1 t ij + β 2 x h + β 3 x h t ij + v 0i + ɛ hij. (3.8) Interpretation of the parameters are equal to the parameters in the repeated measures ANOVA model as given in Equation (3.3.1). The variance ξ hij in Equation (3.7) is partitioned into two 23

36 CHAPTER 3. LONGITUDINAL DATA ANALYSIS variance parts, the random subjects effect and the residual error, expressed as ξ hij = v 0i + ɛ hij. The random subjects effects v 0i have mean zero and common variance σ 2 v 0. The variation due to subject-specific features is thus separated from the residual and leads to a decrease in the unexplained variance. The variances σ 2 v 0 are assumed identical and independent between subjects, and the residuals ɛ hij are assumed identical and independent of each other and the σ 2 v 0 conditionally on y hij. This leads to a compound symmetry structure, which is one of the strict assumptions of repeated measures ANOVA. Thus the analysis by a random intercepts mixed regression model and the repeated measures ANOVA assuming compound symmetry gives similar results. The compound symmetry can be expressed by the following covariance matrix, here with three measure occasions, t=0, 1, 2: σ 2 + σ 2 v 0 σv 2 0 σ 2 + σv 2 0. (3.9) σv 2 0 σv 2 0 σ 2 + σv 2 0 (3.10) An alternative way to represent the model in Equation (3.8) is by a hierarchical, or multilevel structure. The equation is partitioned into a within-subjects model and a between-subjects model y hij = b 0i + b 1i t ij + ɛ hij (3.11) b 0i = β 0 + β 2 x h + v 0i, b 1i = β 1 + β 3 x h. The random subjects effects v 0i can be interpreted as each subjects deviation from the group intercept, thus it represents the individual intercepts and are denoted random subjects or random intercepts effects. A random intercepts MRM assumes the slopes of all subjects to be equal, so all lines on a profile plot are parallel for all group means. A profile plot displays the group means of the dependent variable over time, with lines connecting the time points Random slopes mixed regression model Now we have explored the random intercepts mixed regression model, where random subjects effects are included in the linear regression model. In many situations this model may be too simplistic. By using this model we assume that all women have the same progress for the response variable over time, and all men the same. The strict variance assumption of compound symmetry is often violated in longitudinal data. The next step in developing a MRM is to examine the extent of heterogeneity of subjects within groups with respect to their slopes. In 24

37 3.4. MIXED REGRESSION MODELS other words, we are examining whether a random slopes effect is statistically significant. A random slopes MRM, also denoted random-coefficient MRM, is expressed as y hij = β 0 + β 1 t ij + β 2 x h + β 3 x h t ij + v 0i + v 1i t ij + ɛ hij. (3.12) We now want to express Equation (3.12) as a two-level model similar to the two-level model in Section The within-subjects model in Equation (3.11) is unchanged, but the betweensubjects model is augmented as b 0i = β 0 + β 2 x h + v 0i, b 1i = β 1 + β 3 x h + v 1i. As before, b 0i is the intercept parameter for each subject i and b 1i is the subject-specific slope parameter indicating the development in response variable over time for subject i. The new term v 1i is the individual slope deviation from the population slope for subject i. We now have a model with subject-specific intercepts and time trends (v 0i and v 1i ), in addition to the population intercept and change in time (β 0 and β 1 ). The fixed effects of the random intercepts MRM are unaltered, but the variance-components of the random slopes MRM must be investigated further. The error ɛ hij is independendent conditional on v 0i and v 0i, and normally distributed with mean zero and variance σ 2. The random subjects and slopes effects are assumed to be bivariate normal with mean vector zero and covariance matrix given as [ ] σ 2 Σ v = v0 σ v0 v 1 σ v0 v 1 σv 2. 1 The interpretation of σv 2 0 and σv 2 1 is the heterogeneity of subject intercepts and slopes, respectively. The notation σ v0 v 1 denotes the covariance of v 0 and v 1, which may be interpreted as the relation between a subjects intercept and the subjects individual slope parameter. So, if σ v0 v 1 is positive we would interpret this as follows: A subject with an intercept above the population mean (higher initial values) have steeper slopes than subjects with smaller initial values. One of the important advantages of the random slopes mixed regression model compared with the random intercepts model is the more slack variance assumptions, compound symmetry is no longer required. A total of four variance components are estimated in the analysis, and model a more flexible variance-covariance relation of the data, see covariance matrix in Equation (3.13). This correlation structure is well explained in Rabe-Hesketh and Skrondal [2008]. In longitudinal data it is natural to expect observations that are measured subsequently to be more correlated than observations more distant in time. This model include this in the model assumptions, and is therefore more flexible and in many settings more realistic than the compound symmetry. Random coefficient structure are displayed in Equation (3.13) for data with three measure occasions, t=0,1,2: σv σ 2 σv σ v0 v 1 σv σ v0 v 1 + σv σ 2. (3.13) σv σ v0 v 1 σv σ v0 v 1 + 2σv 2 1 σv σ v0 v 1 + 4σv σ 2 25

38 CHAPTER 3. LONGITUDINAL DATA ANALYSIS Decomposing the time effect Till now we have assumed the time trend to be linear, but this may be a strict assumption that models the data poorly. Particularly this gets critical for studies with many measurement occasions. The next step in modeling longitudinal data can therefore be examination of quadratic (and maybe cubic) time trends. One alternative to introduce a quadratic trend is to simply include a quadratic time term in the regression model as y hij = β 0 + β 1 t ij + β 2 x h + β 3 x h t ij + β 4 t 2 ij + β 5 x h t 2 ij + v 0i + v 1i + ɛ hij. (3.14) An interaction group by time squared may also be included. An alternative approach to introduce polynomial time trends is to create a dichotomous variable for each measuring occasion. This requires equal fixed time points for all subjects, and makes comparison of progress in time between gender less informative. Several parameters must be estimated, thus we loose degrees of freedom. The overall progress of the dependent variable over time is not described. For data with measures at time t = 0, 1, 2 the quadratic time term is t 2 = 0, 1, 4, which is nearly collinear to t. To avoid this we can express time in centered form, for example t j = (t ij t). The interpretation of the main intercept must be adjusted for the shift in the time variable, and must be interpreted as the mean of all observations at the midpoint of time t ij. The centered time t ij denotes the original time notation, while t ij corresponds to the centered time. β 0 is the original intercept coefficient and β 0 is the centered time intercept coefficient displayed as β 0 + β 1 t ij = β 0 + β 1 t ij = β 0 + β 1 (t ij t). (3.15) The group by time interaction is also affected by the centering of time. Results for fixed effects are not altered by the centering of time, neither are the other effects in a random intercepts regression model (group effect, random subject effects and residual variance). Note: Models with polynomial time trends are still denoted linear regression models since they are linear in terms of regression covariates other than time [Hedeker and Gibbons, 2006]. When the number of measuring occasions increases, the method of centering time points gets rather complicated. A more general method to examine polynomial time effects is by using orthogonal polynomials. Comparisons of time-related effects such as change relative to baseline, consecutive time comparisons or contrasting each time point to the mean of subsequent time points are also possible to examine through orthogonal polynomials Random slopes mixed regression model, quadratic time trend We can introduce a squared time variable to model the response variable as a quadratic curve along the time axis. By including t 2 ij we obtain the model expressed as y hij = β 0 + β 1 t ij + β 2 x h + β 3 x h t ij + β 4 t 2 ij + β 5 x h t 2 ij + v 0i + v 1i t ij + ɛ hij. 26

39 3.4. MIXED REGRESSION MODELS Again this model can be represented as a two-level model. The within-subjects model is given as and the between-subjects model is presented as y hij = b 0i + b 1i t ij + b 2i t 2 ij + ɛ hij (3.16) b 0i = β 0 + β 2 x h + v 0i, b 1i = β 1 + β 3 x h + v 1i, b 2i = β 4 + β 5 x h. The squared time parameter β 4 represents the curvation of the group mean, while β 5 refers to the deviation in curvation for the group h when x h 0. The interpretation of the remaining parameters in this model are not modified compared to the simpler random slopes model. The mixed model can be extended even further by including higher-degree time variables or more random effects, but the principle of analyzing these models are basically the same as for the models described above Curvilinear mixed regression models The random slopes MRM with quadratic time trend induces a possible extension of the random effects model, that is a random squared time effect. This effect can be described as the subject-specific deviation in curvation from the population mean. This mixed regression model is denoted curvilinear MRM, and is displayed as y hij = β 0 + β 1 t ij + β 2 x h + β 3 x h t ij + β 4 t 2 ij + β 5 x h t 2 ij + v 0i + v 1i t j + v 2i t 2 j + ɛ hij. (3.17) The random curvilinear effect v 2i is multiplied by the squared time variable. In the same way as for the previous MRMs, the curvilinear model can be given as a two-level model, with withinsubjects model equal to the random slopes MRM with quadratic time term in (3.16) and betweensubjects model given as b 0i = β 0 + β 2 x h + v 0i, b 1i = β 1 + β 3 x h + v 1i, b 2i = β 4 + β 5 x h + v 2i. The fixed effects of the random slopes and curvilinear MRM are equal, and the random effects are further partioned into subject-specific intercept, slope and curvilinear variance and measurements error (unexplained variance). The subject-specific variances are correlated within subjects in a similar matter as in matrix (3.4.3), which leads to a total of six estimated variance and covariance parameters. The error variances are independent normally distributed with mean zero and variance σ 2, and conditionally independent on the other variance terms as before. The curvilinear MRM allows each subject to have an individual intercept, slope and curve parameter, and is the most general model accessible for longitudinal data restricted to three fixed measurement occasions. 27

40 CHAPTER 3. LONGITUDINAL DATA ANALYSIS Comparison of models The fixed effects in a mixed regression model can be tested by the Wald s test, which compare the test statistics to a standard normal frequency table. The null hypothesis is that the parameters are zero. For the variance and covariance parameters this test is not suitable since the variance parameters are bounded to positive values [Hedeker and Gibbons, 2006]. This is the reason why no p-values are calculated for the random terms in mixed models. Likelihood ratio tests are used to compare nested models, thus we can use this technique to examine if a random effect is necessary to model the data. Simple linear regression models and random intercepts and slopes models are hierarchically nested in each other, and we apply the likelihood ratio test to compare these models. This test uses the difference in deviance values for the models and compares them with a chi-square distribution given as 2ln ( L0 L 1 ) χ 2 ν. (3.18) The degree of freedom ν is determined by the difference in number of estimated parameters in the two models. L 0 is the maximum value of the likelihood for the data from the more specified model (for example the random intercepts model), while L 1 is the maximum value of the likelihood for the data from the general model (for example the random slopes model). For these two models there are estimated two extra parameters in the random slopes model, that is the slope variance and intercept-slope covariance. Thus the degrees of freedom ν for the likelihood ratio test equals two. Because the variance terms are restricted to be positive, the likelihood ratio test for random effects are too conservative. Hedeker and Gibbons [2006] refers to Berkhof and Snijders [2001] to state that more correct p-values are obtained by dividing the test values by two. This is also described in Rabe-Hesketh and Skrondal [2008]. Regression parameters are calculated by maximum likelihood (ML) estimates or restricted maximum likelihood (REML) estimates. REML estimates are often preferred over ML estimates because REML adjusts the likelihood for the number of covariates in a model. This is also why these estimates cannot be applied when models are compared with a likelihood ratio test. The parameters of all models must be estimated by maximum likelihood according to the above mentioned [Rabe-Hesketh and Skrondal, 2008]. The null hypothesis of likelihood ratio tests is formulated in the following matter; the more restricted model is sufficient to model the data with less parameters compared to the more general model. If the likelihood ratio test is statistically significant, the null hypothesis is rejected and the general model with more parameters is preferred. 3.5 Covariance pattern models In Section 3.4 we found that mixed regression models can be thought of as an extension of the univariate repeated measures ANOVA models, and similarly we can imagine the covariance pattern models (CPM) as an extension of the multivariate ANOVA models for repeated measures. CPM are formulated as regression models in the same way as MANOVA, but assumes 28

41 3.5. COVARIANCE PATTERN MODELS the variance-covariance matrix to be of a certain form. No random effects are included in the regression model, so these models do not distinguish the variance term in within-subjects and between-subjects variance. CPM treat time as a categorical variable, so the time is considered fixed for all subjects. Unlike MANOVA the CPM allow unbalanced design. The regression model for CPM on matrix form can be written as y i = X i β + e i. (3.19) The n i 1 vector y i contain the responses for subject i with j = 1, 2,, n i observations, i = 1, 2,, N. X i is the covariance matrix for subject i, β corresponds to the fixed effects parameter vector and e i is the n i 1 error vector. e i is the model specification that separates CPM from MANOVA models, where the error vector is assumed normally distributed with mean zero and variance σ 2 I ni n i. We assume that the error vector satisfies the assumptions e i N(0, Σ i ). (3.20) The variance-covariance matrix Σ i is estimated for the n i time points in which subject i is measured. Note that CPM estimates the regression parameters jointly, that is based on all covariates and observed values of the dependent variable simultaneously. Thus these models handle missing data structures of both MCAR and MAR mechanisms. Equation (3.20) leads to the model assumptions y i N(X i β, Σ i ) and Var(y i X i ) = Σ i Covariance patterns Different covariance patterns are feasible for CPM, and some of these are explained below. Independent covariance structure The independent covariance structure implies independent measurements and equal variance σ 2 at each time point, that is σ 2 I ni n i. This structure can be found in studies where different subjects are measured at the different time points, as a simple example, but this occurs rarely with longitudinal data. Exchangeable covariance structure The exchangeable covariance structure is previously referred to as compound symmetry, and is induced by the random intercepts MRM and repeated measures ANOVA models assuming homoskedasticity and sphericity. It is defined by two parameters, σ 2 and σ1 2, where the first is the residual variance and the latter being the estimated random intercepts variance. The covariance pattern is displayed for a repeated measures ANOVA in Equation (3.6), and expressed by the newly introduced parameters as 29

42 CHAPTER 3. LONGITUDINAL DATA ANALYSIS σ1 2 + σ2 σ1 2 σ1 2 + σ2 σ1 2 σ1 2 σ1 2 + σ2. (3.21) σ1 2 σ1 2 σ1 2 + σ2 First-order autoregressive structure, AR(1) This covariance pattern is often used with time series and requires estimation of the measurement error σ 2 and the autoregressive coefficient ρ. The covariance is assumed to diminish by lags, so consecutive measurements in time are more correlated than measurements more distant in time. Lags describe the distance in time between two fixed (equally distanced) measuring occasions, where lag-1 corresponds to two consecutive occasions, lag-2 corresponds to two fixed time intervals between measurement occasions, and so on. Estimated covariance between time points j and j are calculated as σ jj = σ 2 ρ j j. This covariance structure assumes at least two consecutive observations to be observed for each subject to be able to estimate the autoregressive parameter. It also assumes all intervals between measuring occasions to be equal, thus no gaps between measures are tolerated. These assumptions force constraints on the missing data structure, and subjects that do not fulfill the requirements are either omitted or should be imputed prior to analyses. The first-order autoregressive structure can be written as σ 2 1 ρ 1 ρ 2. ρ ρ n 1 ρ n 2 ρ 1. (3.22) Toeplitz structure The toeplitz structure, also referred to as the banded structure, holds a covariance parameter for each lag. This leads to a total of n estimated parameters, that is one for each time point. The assumptions implied by this covariance structure are equal variance at all time points and equal covariance for all measurements with equal lags. θ 1 corresponds to the variance of measurements, θ 2 is the lag1 covariance and so on. The measuring occasions are restricted to be fixed and with equal distances between all consecutive measurements. The first-order autoregressive structure is a special case of the toeplitz structure, the latter given as 30

43 3.6. ANALYSES OF DISCRETE OUTCOME VARIABLES θ 1 θ 2 θ 1 θ 3 θ 2 θ 1. (3.23) θ n θ 2 θ 1 Unstructured form The above covariance structures AR(1) and toeplitz hold restrictions for the time point intervals to be equally distanced and equal within lags. A more general structure, denoted the unstructured form, allows all variances and covariances to be different. This covariance structure allows all types of missing data structure among subjects, and is therefore often the preferred for longitudinal datasets with few measuring occasions. The total number of parameters that must be estimated is n(n + 1)/2, where n is the number of time points. For studies with many observational time points this structure demands estimation of a high number of parameters which may lead to imprecise estimates of variances and covariances. The unstructured correlation form is displayed as θ 11 θ 21 θ 22 θ 31 θ 32 θ θ n1 θ n(n 1) θ nn (3.24) Covariance pattern models are dependent of a correct fixed regression model and a suitable covariance pattern to obtain unbiased results. When selecting the model, the set of fixed covariates must include all possible variables that may affect the response variable, and this set of covariates must stay unchanged through the selection process of the covariance pattern. Once the covariance structure is determined, the fixed regression model selection can be performed as in a standard regression analysis. To select the most appropriate covariance structure Σ a likelihood ratio test is employed as explained in Section Analyses of discrete outcome variables We have now described methods of linear models to analyze continuous responses in longitudinal datasets. Both mixed regression models and covariance pattern models handle ignorable missing, and have possibilities to adjust the covariance matrix to the data. When the response variable is discrete (for example binary, ordinal or a count), these linear models do not model the changes in the mean response due to covariates in an adequate way. This section will give an introduction to modeling of ordinal variables in a longitudinal dataset, when the data contain unobserved values. Generalized linear models (GLM) is applied for analyzing univariate discrete outcome variables, via known variances and link functions. For longitudinal data the GLM is not sufficient 31

44 CHAPTER 3. LONGITUDINAL DATA ANALYSIS to model discrete responses because of the dependency between observations within subjects. There are two ways to extend GLMs to longitudinal data, either by generalized linear mixed regression models (GLMM) as described in Section 3.6.1, or by generalized estimating equations (GEE) described in Section 3.6.2, according to Hedeker and Gibbons [2006]. These sections focus on analysis with an ordinal outcome variable, but other discrete outcome variables can be analyzed in a similar way. An ordinal variable is a categorical variable where the categories are ordered. Examples of ordinal variables are agreement ratings with categories disagree, undecided and agree, rating of movies with categories one to six, and items of the SF-36 questionnaire with categories one to five where higher scores refers to higher quality of health for the given item Generalized linear mixed regression models Proportional odds models (POM) are a common choice for analysis of ordinal data, thus many of the mixed models for ordinal data are generalizations of this model. POM is based on logistic regression formulation and characterizes the ordinal responses in C categories in terms of C 1 cumulative category comparisons. The notation P hijc = P (Y hij c) = c k=1 p hijk is introduced, where p hijk is the probability of response in category k for subject i in group h at time j. POM can be extended to a random intercepts mixed effects proportional odds model as follows ( ) Phijc log = γ c (β 1 t ij + β 2 x h + β 3 x h t ij + v 0i + ɛ hij ), (3.25) 1 P hijc where c = 1, 2,, C 1. γ c are category-specific parameters denoted model thresholds and are assumed to be strictly increasing in c. The interpretation of the random intercepts MRM parameters is equivalent to the model described in Section The mixed effects POM is essentially a proportional odds model with random effects in the linear regression equation [Rabe-Hesketh and Skrondal, 2008]. The model in Equation (3.25) is a cumulative model for ordinal responses in terms of the linear regression model linked to a cumulative probability, not the mean of response as in a GLM. POM is a multiplicative model. To illustrate this, let us look at a continuous covariate that increase by 5 units from x i to x i. The odds ratio is exp(β)5 if β is the parameter of variable x i. An alternative modelrepresentation of Equation (3.25) is expressed by the latent-response formulation, also called the threshold model, in Rabe-Hesketh and Skrondal [2008] as y i = β 1 t ij + β 2 x h + β 3 x h t ij + v 0i + ɛ hij. The terms ɛ hij t ij, x h, v 0i have logistic distributions and are assumed to be independent across subjects and measurement occasions. The continuous latent variable yi are related to the ordinal variable via the threshold model y i = s if γ s 1 < yi γ s. Further, more complex mixed effects POM can be fitted in a similar way. An alternative approach to model ordinal variables is by probit regression formulation. Hedeker and Gibbons [2006] describes both approaches in detail, and Rabe-Hesketh and Skrondal [2008] explain the modeling of such methods in Stata. 32

45 3.6. ANALYSES OF DISCRETE OUTCOME VARIABLES Generalized estimating equations As a generalization of GLM, the generalized estimating equations (GEE) support many different types of dependent variables. The method was developed for non-continuous variables as dichotomous responses and counts, and is rarely used for continuous data [Rabe-Hesketh and Skrondal, 2008]. Marginal distribution of Y ij at each time point need to be specified (denoted quasi-likelihood models), in contrast to full likelihood-models where the joint distribution of a subject s response vector y i must be specified. The variance-covariance structure is treated as a nuisance, and this is an important feature of GEE. Since the unobserved variables are dependent only on the covariates, the missing data structure implied is the covariate-dependent MCAR. Thus the method of GEE does not handle ignorable missing data assuming MAR, which is a major disadvantage when working with missing values. GEE assumes all subjects to be measured at the same fixed occasions, but it assumes no restrictions on the missing data pattern (dependent on the chosen correlation structure). Fitzmaurice et al. [2009] introduced the term marginal models which refers to models for longitudinal data without random effects. This includes among others the covariance pattern models described in Section 3.5 and GEE. Both models assume the number of measurement occasions to be fixed, but missing values among the time points are allowed. The correlation structures induced by the CPM and GEE are similar, but there are some basic differences that separate the properties of the methods. CPM specify the joint distribution and likelihood of the dependent variable vector Y i and apply only for continuous normal distributed outcome. GEE, on the other hand, specifies the marginal distribution and likelihood of Y i for each time j, but can be applied for many types of outcome variables. Specification of a GEE is similar to a GLM, with a linear predictor, a link function and variance described as a function of the mean. An additional feature of GEE is the working correlation structure R, a n n correlation matrix common for all subjects. If subject i are observed at n i occasions he or she achieve a n i n i correlation matrix with the appropriate rows and columns for the observed time points from R. The choice of working correlation matrix should be consistent with the observed correlation matrix, and is often selected from the structures displayed in Section A common set of association parameters a are estimated, where the size of the vector a dependent on the working correlation structure chosen. Selection of the correlation structure for the repeated measurements is not as critical for GEE as for mixed regression models or covariance pattern models. This is because GEE provides estimated parameters and standard errors that are robust to misclassification of the variancecovariance structure, as long as the univariate analysis models at each time point are specified correctly. The statistical power decreases with misclassification of the correlation structure, but the loss of power is small when the number of subjects increases. GEE should be applied when the research interest is focused on estimates and inference of the regression parameters, but is not suitable when modeling variance-covariance structures of longitudinal data. Solving the GEE involves iterating between the quasi-likelihood solutions for estimating β s and a robust method for estimating a set of correlation parameters a as a function of β. Hedeker and Gibbons [2006] describes the iterating process as follows 33

46 CHAPTER 3. LONGITUDINAL DATA ANALYSIS 1. Given estimates of R i (a), calculate estimates of β using iteratively reweighted least squares. 2. Given estimates of β, calculate Pearson residuals and use these residuals to consistently estimate a. Iteration over the two steps continues until convergence, that is when the updated estimates approximates the rejected estimates. Hedeker and Gibbons [2006] describes in more detail the process of GEE, and examples where the method is applied. There are two general approaches for handling data assuming the MAR mechanism within the framework of GEE. The first approach is to analyze multiple imputed data by generalized estimating equations. The technique of GEE described above can be performed without further adjustment. The properties of multiple imputation leads to unbiased analysis results if the GEE is correctly specified. The second approach is the weighted estimating equations. The idea of weighted estimating equations is to account for subjects with missing responses by giving extra weight to the subjects with observed measurements and similar observed covariates and the same history of responses. The weights are calculated as the inverse probability of being observed. This approach is suitable when the missing data pattern is monotone, typically as a result of dropout, and is discussed in particular by Fitzmaurice et al. [2009]. In many longitudinal studies the subjects are missing at an arbitrary missing data pattern, as explained in Section For such settings approximative methods have been proposed for handling missing data that are assumed to be MAR. Further, if data are assumed MNAR, more complex methods must be applied. 34

47 4 The data of patients undergoing cardiac surgery Missing values or non-response occur frequently in datasets with measurements of persons involved, especially when the observations are self-reported and repeated at several timepoints. A dataset of this type is analyzed in the article The role of sex in health-related quality of life after cardiac surgery: a prospective study, written by Gjeilo et al. [2008]. Of all patients undergoing cardiac surgery at the Cardiothoracic Surgery at St Olavs Hospital, Norway, in the period from September 2004 to September 2005, a total of 534 patients were included in the study. Of these were 413 men and 121 women. Data were prospectively collected using the Norwegian verson of the Short-Form Health Survey (SF-36) version 1.2 [Loge and Kaasa, 1998], among others. The variables with main focus in the article by Gjeilo et al. [2008] are age, gender, marital status and health-related quality of life variables (denoted HRQOL and measured by SF-36 and health transition). We have selected four of the HRQOL-variables, that is general health, bodily pain, social functioning and role emotional based on fraction of missing data and results from the completecase analyses. Men and women were found to improve different for the variables role emotional and bodily pain, while the differences between genders presurgery and follow-up occasions differed between general health and social functioning. This selection is performed to decrease the amount of results and the size of tables in the report, and thus help the reader to focus on the important results. These four variables are ment to represent the features that exist in the data. Age and gender are variables that are measured presurgery, and do not change during the study (age increases by one year during the study, this increase is equal for all patients). Marital status is measured at all three time points, and have nine options of status; married, couple living together, widow/widower, single, separated/divorced and combinations of these. The research questions involve the status of living together or alone, thus the variable is implemented as a dichotomous variable where zero refers to living alone and one is living together. This variable may change during the study. 35

48 CHAPTER 4. THE DATA OF PATIENTS UNDERGOING CARDIAC SURGERY 4.1 SF-36 questionnaire The SF-36 [Loge and Kaasa, 1998] is a short-form health survey with 36 questions/items developed to access HRQOL. The 36 items yield eight scales of functional, physical and mental health, that is general health (GH), physical function (PF), bodily pain (BP), mental health (MH), role limitations owing to physical problems (role physical, RP), role limitations owing to emotional problems (role emotional, RE), vitality (VT) and social functioning (SF). These scales are made up from different number of items, as seen in Table 4.1. The SF-36 has been extensively applied in several countries, including Norway, and found satisfactory for evaluating HRQOL in cardiac surgery. In addition to the eight scales from SF-36 the item health transition (HT) is recorded. All nine variables are transformed into intervals from zero to 100, where higher scores reflect better health. Table 4.1: SF-36 scales and health transition (HT). Each scale is made up from several items, which leads to different number of possible responses for each scale. Scales Items Levels PF RP 4 5 BP 2 11 GH 5 21 VT 4 21 SF 2 9 RE 3 4 MH 5 26 HT - 5 The compound scales presented in Table 4.1 are computed as the mean of the items that form the scale. Such scales are sensitive to missing data, if one item is missing then the whole scale will be missing. Questionnaires often have guidelines to handle missing values to ensure that most of the composite scores are obtained. For SF-36 the mean of the available items replaces the missing items if at least half of the items of a score are observed [Ware et al., 2000]. 4.2 Missing-data structure A total of 311 patients (or 58%) of the 534 patients included in the study responded to all of the variables at the three time points. 470 of the 534 patients, that is 88%, have fully observed variables before surgery. The middle columns in Table 4.2 displays the missing data structure of the dataset for four of the HRQOL-variables general health, bodily pain, social functioning and role emotional, in addition to the variables age, gender and marital status. A similar table for all HRQOL-variables are found in Appendix, Table A.1. Each subject are given an identification number (denoted patkey), displayed in the first row of the table. Patkey, age, gender and the presurgery measurement of marital status have no missing values. All the four HRQOL-variables 36

49 4.3. OBSERVED COVARIANCE AND CORRELATION MATRICES have less missing data presurgery compared to the follow-up time points. Role emotional has most missing values, with more than 20% non-response at both follow-up occasions. In studies where the same subjects are observed repeatedly, the phenomena of missing forms may appear. When a subject included in the study have missing values for all variables at one time-point, we call this a missing form. In this study none of the subjects have missing forms presurgery, but at six months a total of 72 subjects have completely missing responses. At six months there are 69 missing forms, and 43 of these are found at both follow-up time points. The reason for missing forms can be many, for example because of forgetfulness, the patient is very ill, or so healthy that he or she sees no point in responding, patient moved or did not recieve the questionnaire due to other reasons, or death. When examining the missing-data structure after omitting subjects with missing forms, the pattern of missing data are more similar at all three timepoints, see the two right columns in Table 4.2. None of the variables at any of the timepoints have more than 9% missing values. The missing data due to missing items only seems to be similar for the subjects without missing forms. The variable with most missing values is role emotional. 521 patients were alive after 12 months from operation time, and approximately 90% of them responded at least one of the two observational timepoints after surgery. The patients that did not survive until 12 months after surgery are a special case of missing data in the dataset, the reason for missing is obvious but it may be difficult to interpret in the analyses. One solution to this is to exclude all patients that die during the first 12 months after surgery, and base the analyses on the patients that survive their first year postoperation. In this way the research question is altered, we only want to look at the surviving patients after heart surgery. Another approach is to evaluate the reason of death, and separate the patients on base of this. If the reason for death is a consequence of the heart surgery or in any other way connected to the operation or their heart condition, the missing values should be handled by care. In this situation the missing data may be cathegorized as either MAR or MNAR. In comparison, if the reason for death has no connection to the surgery, the missing values are arbitrary and thus may be treated as MCAR. This is discussed further in Section Observed covariance and correlation matrices Further we want to look at the relation between measurements within subjects to examine the correlation induced by repeated measures on the same subjects. The covariance matrix of the observed data for role emotional is displayed to the left in Table 4.3. The diagonal elements equals the variances at each time point, covariances below the diagonal are based upon listwise deletion and covariances above the diagonal are computed from pairwise observed data. The right matrix in Table 4.3 displays the correlation matrix of the same variable. As we can see the variances are not equal at the three measurement occasions, most variability is found presurgery while the variability is more similar and lower after the surgery. Covariances between the presurgery measurements and the follow-up measurements are approximately equal, but less than the covariance between six and twelve months follow-up after surgery. This covariance matrix indicates that the assumptions of homoskedasticity and sphericity may be violated. The correlation matrix for role emotional can be interpretted the same way. Only mi- 37

50 CHAPTER 4. THE DATA OF PATIENTS UNDERGOING CARDIAC SURGERY Table 4.2: Missing-data structure for the variables patkey, age, gender, marital status, general health (GH), bodily pain (BP), social functioning (SF) and role emotional (RE). The second and third columns are calculated based on the original dataset, while the two last columns describe the missing data structure in the data after exclusion of missing forms. The percentages are calculated based on all SF-36 items and HT at the three time points. Variable No. of missing % missing No. of missing % missing Patkey (id) Gender Age Marital status Before surgery months follow-up months follow-up General Health Before surgery months follow-up months follow-up Bodily Pain Before surgery months follow-up months follow-up Social Functioning Before surgery months follow-up months follow-up Role Emotional Before surgery months follow-up months follow-up nor differences are found when comparing the complete-case and available-case samples with respect to correlations. 38

51 4.3. OBSERVED COVARIANCE AND CORRELATION MATRICES Table 4.3: Variance-covariance matrix (left) and correlation matrix (right) for the variable role emotional (RE) at the three measurement occasions presurgery (RE1), six months follow-up (RE2) and 12 months follow-up (RE3). Variable quantities below diagonal are based on listwise deletion, and quantities above on pairwise deletion. RE1 RE2 RE3 RE1 RE2 RE3 RE1 RE2 RE RE RE RE

52 5 Complete-case analyses In the article of Gjeilo et al. [2008] the hypotheses of possible differences between genders at each measuring occasion and the impact of gender on the improvement of HRQOL-variables over time are explored. The first research question is examined by a Student s t-test assuming unequal variances and the latter by repeated measures ANOVA. Both of these analyses require completely observed data for all subjects, and complete-case analyses are performed to handle the missing values in the dataset, that is available-case analyses for the t-test and listwise deletion for the repeated measures ANOVA. The following sections describe the methods and results published by Gjeilo et al. [2008] by a redo of the analyses. 5.1 Student s t-test The first main goal of the study is to examine possible differences between men and women prior to surgery and during the restitution time when evaluating Quality-of-Life variables. The statistical analysis to examine this question is a two-sample Student s t-test, which test if the mean of men are significantly different from the mean of women. Normality is assumed by Gjeilo et al. [2008], and can be examined graphically by normal Q-Q plots, or by normality tests like the Kolmogorov-Smirnov test and the Shapiro-Wilks test. For interval-restricted variables we might expect some floor and ceiling effects, but according to Sullivan and Dagostino [1992] such violations of the normality assumption do not necessarily lead to biased results of the t-test. At each of the measuring points baseline, six and twelve months after surgery the subjects are assumed independent. A test of homogeneous variance (F-test) in the two groups male and female indicates that the variance are unequal for some of the variables at some occasions. Thus the two-sample Student s t-test for continuous variables assuming unequal variances (also known as Welch-Satterthwaite t-test) is suitable to examine differences in means for men and women at each time point. 40

53 5.1. STUDENT S T-TEST Implementation in SPSS and Stata The statistical calculations that are published in the article of Gjeilo et al. [2008] are performed using SPSS for Windows version 13.0 (SPSS Inc., Chicago, Illinois, USA). The redo of the complete-case analyses are performed using SPSS version 16.0 for Windows [2007] and Stata/SE version 10.1 for Windows [2007]. In SPSS the results are obtained by the command Compare Means -> Independent-Sample T Test, and p-values are found in the row GH-scale, Equal variances not assumed of the output. In Stata the script in Listing 5.1 provides results for the three measuring occasions for the variable general health. Listing 5.1: Student s t-test in Stata for general health (GH) 1 t t e s t gh1, by ( kjønn ) u n e q u a l 2 t t e s t gh2, by ( kjønn ) u n e q u a l 3 t t e s t gh3, by ( kjønn ) u n e q u a l gh1 corresponds to the presurgery measurements, gh2 is the six months follow-up measurements and gh3 is the observations at twelve months follow-up. The by() option specifies the grouping on gender, and unequal implies that the variances are assumed unequal in the groups. Note: The data are assumed to be stored in wide format. This will be explained in Section Results, Student s t-test The results for variables general health, bodily pain, social functioning and role emotional are displayed in Table 5.1. A table including all the variables analyzed in Gjeilo et al. [2008] can be found in Appendix Table A.2. All estimates for both men and women increase during the study, thus the HRQOL-variables seems to be positively affected by the heart surgery. The first impression of the differences between men and women is the high proportion of significant p-values that means there are differences in means for the genders. Men have higher estimated means for all variables at all time points. From Table 5.1 we see that general health is considered equal between men and women at baseline, but after surgery this variable turns out to be different among the genders. Men score significantly higher than women at six and twelve months after surgery. This may suggest that men have more effect of the operation and that they are in better shape after the surgery than women. The same pattern is found for the variables role emotional and bodily pain. For social functioning the pattern is different, the genders score different at baseline and six months after surgery, but at twelve months after surgery the difference is insignificant. It must be noticed that the p-values for role emotional at baseline of and for social functioning at twelve months follow-up of are slightly larger than the significance level of α = 0.05, and should be handled by care. Such p-values indicate a weak difference between genders at the given occasions. 41

54 CHAPTER 5. COMPLETE-CASE ANALYSES Table 5.1: Results from a two-sample Student s t-test assuming unequal variances for the variables general health (GH), bodily pain (BP), social functioning (SF) and role emotional (RE), grouping on gender. Significant p-values are marked as boldface. Baseline Six months Twelve months SF-36 Mean SD P-value Mean SD P-value Mean SD P-value GH Male Female BP Male Female SF Male Female RE Male Female Repeated measures ANOVA The second research question in the article of Gjeilo et al. [2008] is to examine a possible different improvement of mental and physical health during and after a heart surgery. The analyses of differences between genders at each measuring timepoint are already examined above, and the same grouping in gender is examined with respect to improvement of the HRQOL-variables over time. This improvement are analyzed by repeated measures ANOVA as explained in Section 3.3. The first step when performing a repeated measures ANOVA is to set up the analysis model. This must include main effects gender and time, and the gender by time interaction, in addition to the subject identification to ensure the correlation structure is maintained in the analysis. The analysis model is given as y hij = β 0 + β 1 t j + β 2 x h + β 3 x h t j + v 0i(h) + ɛ hij, (5.1) where the parameters are interpreted as described in Section Specially the parameter β 3 that explains the interaction effect is of interest in these analyses. Additional variables may be added as explanatory variables. In the article of Gjeilo et al. [2008] the variables age and marital status are explored as potential covariates, but are not considered significant and thus omitted in the final model. 42

55 5.2. REPEATED MEASURES ANOVA Data layout The model above leads to a structure of the data as displayed in Table 5.2. The subjects are nested within groups and crossed with the time factor. Each subject belong to one of the genders, and is observed once at each time point. Table 5.2: Design of data for the repeated measurements ANOVA, subjects grouped by gender and crossed with time points. Time point Gender Subject Presurgery Six months follow-up Twelve months follow-up 1 1 y 111 y 112 y y 121 y 122 y N 1 y 1N1 1 y 1N1 2 y 1N y 211 y 212 y y 221 y 222 y N 2 y 2N2 1 y 2N2 2 y 2N2 3 We are now familiar with the analysis model. Listwise deletion are performed to obtain a balanced design, that is all subjects have observations at the same measurement occasions. This leads to the same time intervals between measurements for all subjects. The assumption of multivariate normally distributed variables can be tested in Stata by the command omninorm [Baum and Cox, 2007]. Repeated measures ANOVA is robust to violations of the assumption of multivariate normally distributed variables [The University of Texas at Austin Statistical Services, 1997]. The assumption of compound symmetry can be examined by a F-test. As described in Section the repeated measures ANOVA is robust to violations of homoskedasticity, thus we can carry out the analyses although the F-test reveals heteroskedasticity of the data. Sphericity can be examined by Mauchly s test [Mauchly, 1940] implemented in SPSS, but not in Stata. The method of repeated measures ANOVA requires listwise deletion prior to the analysis. This excludes a higher fraction of the subjects than the t-test, since it disregards all information from patients who have one or two observed values, while the t-test applies available-case analyses that includes observations at the relevant measurement occasion independent of the response at the other time points Implementation in SPSS and Stata The repeated measures ANOVA requires data to be stored in long format, that means each variable is represented in one column each, and a variable time keeps track of the measurement occasions. This results in n rows for each subject in the data matrix, and each row for a subject corresponds to a measurement occasion. The corresponding wide format has one row for each subject, and the variables are kept in n columns, one for each time point. The dataset 43

56 CHAPTER 5. COMPLETE-CASE ANALYSES in Gjeilo et al. [2008] is structured in wide format, thus we have to transform it to long format prior to the repeated measures ANOVA. This is done in SPSS by the command Data -> Restructure. In Stata this is performed by the command reshape presented in Listing 5.2 below. The variable time starts counting at one, thus the time will be coded as t = 1, 2, 3. If we want the baseline time point to be zero to ease interpretation this can be done as displayed by the replace command on line two. Listing 5.2: Reshape of the data from wide to long format 1 r e s h a p e wide mh v t bp gh s f pf rp r e ht, i ( p a t k e y ) j ( time ) 2 r e p l a c e t ime = time 1 The command General Linear Models -> Repeated measures is used to perform the complete-case repeated measures ANOVA in SPSS. The variable patkey holds the identification of each subject and ensures the correlation structure of the data to be maintained. The output of this command gives many interesting results about the current variable, for example Levene s test of equality of error variances, Mauchly s test of sphericity and Box s test of equality of covariance matrices, in addition to the test results for between- and within-subjects effects. These results indicate that sphericity cannot be assumed, thus we have to use methods available to adjust the degrees of freedom for the tests of within-subjects effects. In SPSS the Greenhouse- Geisser and Huynh-Feldt adjustments are given. In Gjeilo et al. [2008] the Greenhouse-Geisser correction is preferred when testing the within-subject effects. The same analyses are performed in Stata. Two commands are available for repeated measures ANOVA, that is anova and wsanova. This two gives the same results, but are different in structure and the generality of the commands. anova is the applied command here because it is most general. The script to obtain results from the repeated measures ANOVA in Stata is given in Listing 5.3 for the variable general health. First line performs listwise deletion and second line transforms the data from wide to long format. The command anova is given on the last line. Listing 5.3: Repeated measures ANOVA for general health (GH) 1 drop i f r e 1 ==. r e 2 ==. r e 3 ==. 2 r e s h a p e wide mh v t bp gh s f pf rp r e ht, i ( p a t k e y ) j ( time ) 3 anova gh t ime kjønn p a t k e y time kjønn, r e p e a t e d ( t ime ) bse ( p a t k e y ) The option repeated() tells the program that we have correlated data and takes as input the within-subjects variable, here represented by the variable time. bse() takes the between-subjects effects as input, here as the identification number patkey. It automatically omits subjects without any measurements for the relevant HRQOL-variable when adjusting for lack of sphericity (Greenhouse-Geisser and Huynh-Feldt). The anova-command returns both original and adjusted p-values as default. Note: To perform valid analyses with the anova-command in terms of the assumptions given in Section 3.3.1, listwise deletion must be performed prior to analyses. 44

57 5.3. GRAPHICAL PRESENTATION OF SUBJECT SAMPLES Results, repeated measures ANOVA The anova command presented above gives exactly the same results with respect to p-values for the gender by time interaction as found in SPSS and the published article. Table presents the results of the repeated measures ANOVA for the variables general health, bodily pain, social functioning and role emotional. The results for all SF-36 variables and health transition are given in Appendix Table A. Table 5.3: Repeated measures ANOVA with Greenhouse-Geisser adjustment for non-sphericity for the SF-36-variables general health (GH), bodily pain (BP), social functioning (SF) and role emotional (RE). Analyses are based on complete-case subjects and grouped on gender. Significant p-values are marked as boldface. Number of fully Repeated measures SF-36 observed patients ANOVA GH P-value gender by time BP gender by time SF gender by time RE gender by time The gender by time interaction for bodily pain and role emotional are found statistically significant, which means that the improvement of these scores are indicated to be different for men and women. The intercepts for the two remaining variables general health and social functioning are insignificant, thus there are no proven difference in improvement between genders for these variables. 5.3 Graphical presentation of subject samples The complete-case subjects and available-case subjects at each time point form different samples of subjects from the original data. These samples are displayed as profile plots and tables of means, standard deviations and sample sizes for men and women separately. Profile plots display the means for men and women at each time point, with drawn lines between the occasions. These plots are informative when examining improvement pattern for the genders, and to visualize the differences in genders at each time point. Student s t-tests base the analyses on available-case samples which correspond to all observed values in the profile plots below. The repeated measures ANOVA base upon completecase samples displayed in the left profile plots. The omitted subjects are also displayed in profile plots for the four variables. This is to give information about the excluded information in 45

58 CHAPTER 5. COMPLETE-CASE ANALYSES complete-case analyses. Note: The profile plots for all observations and omitted subjects do not base the estimation of means at each time point upon the same set of subjects, since some of the subjects included only have one or two observations. 46

59 5.3. GRAPHICAL PRESENTATION OF SUBJECT SAMPLES Table 5.4: Mean of variable general health (GH) for men and women at the three measurement occasions presurgery, six and twelve months follow-up. CC refers to the completely observed subjects, All is all subjects included in the study and Omit denotes the subjects omitted by the complete-case method. GH Time Mean Standard deviation Number of subjects point CC All Omit CC All Omit CC All Omit Pre Men 6 mths mths Pre Women 6 mths mths Figure 5.1: Plot of mean for GH for men and women at the three time points before, 6months and 12 months after surgery. The left plot is based on complete-case subjects, the plot in the middle is based on all observations of all subjects and the right plot is based on excluded subjects from complete-case methods. 47

60 CHAPTER 5. COMPLETE-CASE ANALYSES Table 5.5: Mean of variable bodily pain (BP) for men and women at the three measurement occasions presurgery, six and twelve months follow-up. CC refers to the completely observed subjects, All is all subjects included in the study and Omit denotes the subjects omitted by the complete-case method. BP Time Mean Standard deviation Number of subjects point CC All Omit CC All Omit CC All Omit Pre Men 6 mths mths Pre Women 6 mths mths Figure 5.2: Plot of mean for BP for men and women at the three time points before, 6months and 12 months after surgery. The left plot is based on complete-case subjects, the plot in the middle is based on all observations of all subjects and the right plot is based on excluded subjects from complete-case methods. 48

61 5.3. GRAPHICAL PRESENTATION OF SUBJECT SAMPLES Table 5.6: Mean of variable social functioning (SF) for men and women at the three measurement occasions presurgery, six and twelve months follow-up. CC refers to the completely observed subjects, All is all subjects included in the study and Omit denotes the subjects omitted by the complete-case method. SF Time Mean Standard deviation Number of subjects point CC All Omit CC All Omit CC All Omit Pre Men 6 mths mths Pre Women 6 mths mths Figure 5.3: Plot of mean for SF for men and women at the three time points before, 6months and 12 months after surgery. The left plot is based on complete-case subjects, the plot in the middle is based on all observations of all subjects and the right plot is based on excluded subjects from complete-case methods. 49

62 CHAPTER 5. COMPLETE-CASE ANALYSES Table 5.7: Mean of variable role emotional (RE) for men and women at the three measurement occasions presurgery, six and twelve months follow-up. CC refers to the completely observed subjects, All is all subjects included in the study and Omit denotes the subjects omitted by the complete-case method. RE Time Mean Standard deviation Number of subjects point CC All Omit CC All Omit CC All Omit Pre Men 6 mths mths Pre Women 6 mths mths Figure 5.4: Plot of mean for RE for men and women at the three time points before, 6months and 12 months after surgery. The left plot is based on complete-case subjects, the plot in the middle is based on all observations of all subjects and the right plot is based on excluded subjects from complete-case methods. 50

63 5.4. INTERPRETATION OF PROFILE PLOTS 5.4 Interpretation of profile plots For variable general health the mean for both men and women presurgery are considerably affected by the listwise deletion of subjects, the proportion of omitted subjects is large for both groups and the mean of omitted subjects are smaller than for complete-case subjects. Thus we induce a higher intercept mean for both genders by applying listwise deletion. The mean for the three subject samples for men at 6 months and 12 months follow-up are approximately equal for the samples, but for women these means still remain much smaller for omitted females than for complete-case females. The number of women with non-responses is not of the same size as for baseline, thus the influence of the means calculated on base of all subjects are not of equal magnitude as for the presurgery occasion. The standard deviations for both genders at all three time points are similar in magnitude among the three samples of subjects. We have seen that the means for general health are altered when grouping the subjects in samples of complete-case and omitted subjects. The plots for the different samples are not as different for the variable bodily pain. The mean for men is slightly higher for omitted subjects compared to complete-case subjects presurgery, while the mean for women at 12 months followup is higher for the omitted subjects than for the complete-case subjects. The standard deviations are in general higher for omitted male subjects compared with complete-case subjects. For women this is not always the case, prior to surgery and at 12 months follow-up the omitted subjects give less variance than the complete-case subjects. The profile plots of social functioning in Figure 5.3 reveal that the omitted female subjects with observations at 6 months follow-up have an extremely high mean value compared with the complete-case subjects. The standard deviation for these subjects are also smaller than for the complete-case female subjects, but it is important to notice the size of this group, only 12 of the omitted subjects have observed values at this time point. The means for male subjects that are omitted are much smaller than the complete-case subjects at 6 and 12 months follow-up. The number of subjects observed at these occasions are very small, thus the means for all subjects are only slightly altered compared with the means for complete-case subjects. The last variable role emotional is the variable with most significant gender by time interaction from the complete-case repeated measures ANOVA, but with insignificant intercept effect from the imputation analyses and the analysis by curvilinear MRM. This variable is therefore of special interest when looking at observations of the omitted subjects and their affection on the analysis results based on all data. The plots in Figure 5.4 draws the attention to the female subjects, where the presurgery mean of omitted subjects are much smaller than for completecase subjects. Among the 59 omitted subjects as many as 45 subjects have observed responses at baseline. Compared to the sample of complete-case women that consists of 62 subjects we observe that a considerable portion of the information of female patients are excluded in the complete-case analysis. The mean for all observed women presurgery is remarkably altered by the inclusion of omitted subjects, from 58.1 to This is also commented in the article by Gjeilo et al. [2008]. The means at six and twelve months follow-up are more similar for the samples for both men and women. 51

64 6 Imputation analyses We have now presented the published results obtained by complete-case analyses. As proposed in Section 2.3 the methods of expectation maximization (EM) and multiple imputation (MI) are able to handle data with non-response if the missing data can be assumed ignorable. The method of EM is performed in SPSS [2007] and R [2008], while MI is performed mainly in Stata [2007]. 6.1 Expectation Maximization Expectation maximization (EM) is an general algorithm that is used for multiple purposes and is relatively easy to implement, and is therefore incorporated in most statistical software (for example SPSS, R, NORM and SAS). To demonstrate the use of this approach we explain how the analyses performed in Gjeilo et al. [2008] are obtained by EM in both SPSS and R. First, we must settle a joint multivariate model including all variables in the analyses. This model consists of an identification key for each subject, gender and the nine HRQOL-variables. The missing values analysis (MVA) routine in SPSS is found in the Analyze menu, and we choose EM as estimation method. The number of maximum iterations must be increased from the default number of 25 iterations, since the number of variables with missing values is large. von Hippel [2004] was critical to the MVA module in SPSS on the basis of an incorrect implementation of the residual variation for each imputed value. This random disturbance term should be included in imputation to reflect uncertainty associated with the imputation. The absence of the random error term results in a deterministic imputation process, so imputed datasets based on the same imputation model for the dataset are identical. This violates the assumptions of imputation by the EM algorithm. The methods of conditional single imputation are described as deficient in Schafer and Graham [2002], and is partitioned in conditional mean imputation and conditional distributional imputation. The difference of these two methods is the inclusion of residual variation in the latter. It must be emphasized that single imputation using these two methods can give biased standard errors of estimates under the MAR assumption, while the EM algorithm provides theoretically unbiased estimates if implemented properly. 52

65 6.1. EXPECTATION MAXIMIZATION Another software package that supports single imputation by the EM algorithm is R. The norm package [Novo and Schafer, 2006] is developed for this purpose, and yield many helpful tools when working with the EM algorithm. Prior to the analyses we must examine the data with respect to the assumption of normal distributed variables. This may be done in Stata without too much effort (explained in Section 6.2.2), before exporting the datafile to R. As we will see in Section the arc sine transformation [Box et al., 2005] is suitable to achieve a more normally distribution of variables, additional to the feature of restricting imputed values to the relevant intervals. This transformation of all HRQOL-variables are performed before the imputation process, and both original and imputed values are back-transformed after the imputation (the latter performed in R) Implementation of EM in R An extract of the script to perform EM imputation in R is displayed in Listing 6.1. Listing 6.1: R script to perform imputation by the EM algorithm 1 l i b r a r y ( norm ) 2 mat < r e a d. t a b l e ( o r i g i n a l f i l e. t x t, h e a d e r =TRUE, sep = ", ", na. s t r i n g s = ". " ) 3 s < p r e l i m. norm ( mat ) 4 t h e t a h a t < em. norm ( s ) 5 getparam. norm ( t h e t a h a t s, t h e t a h a t, c o r r =TRUE) 6 r n g s e e d ( ) 7 ximp < imp. norm ( s, t h e t a h a t, mat ) The first line specifies that the norm package must be read (only the basic packages of R are stored in memory when starting the software, and additional packages must be specified by the function library()). Further the datafile exported from Stata is read by read.table(), where observations are separated by a comma, and missing values are given as periods. The prelim.norm() function performs preliminary manipulations of the data matrix and is the input object of the functions on the consecutive lines. The EM algorithm itself is executed by calling em.norm(), and provides a vector of maximum likelihood-estimates of the normal parameters on a transformed scale and in packed storage. To achieve these estimates the function getparam.norm() is helpful. Before imputation we must initialize a random number generator seed (function rngseed()), and imputation is performed by the imp.norm() function on the last line of the script. For further explanation of the functions in the norm package we recommend the help file of R [Novo and Schafer, 2006]. After imputation we have obtained a balanced dataset including all the 534 patients with values of all variables at all occasions. The HRQOL-variables must be transformed to the original 0 to 100 interval to ease the interpretation of the results and to make comparisons with other analysis methods possible. The analyses described in Chapter 5 can now be repeated, but without listwise deletion Results, EM analyses The results of Student s t-test assuming unequal variances and the repeated measures ANOVA for the dataset containing observed and imputed values after EM are computed using the func- 53

66 CHAPTER 6. IMPUTATION ANALYSES tions t.test() and aov(), and the script for these analyses are found in Appendix Section B. The results are displayed in Table 6.1. Table 6.1: Estimates of means and p-values for the two-sample Student s t-test assuming unequal variances and p-values for the gender by time interaction for the repeated measures ANOVA after imputing missing values by the EM algorithm. Results displayed for the variables general health (GH), bodily pain (BP), social functioning (SF) and role emotional (RE), grouping on gender. Significant p-values are marked as boldface. Baseline Six months Twelve months ANOVA SF-36 Mean P-value Mean P-value Mean P-value P-value GH Male Female BP Male < Female SF Male Female RE Male < < Female The function t.test() in R gives no estimated standard deviations for the means, thus we must base out results on the estimated mean values and the corresponding p-values for these analyses. The first impression is that most the estimates are approximately equal to the original completecase estimates. The changes of largest magnitude are found in estimates for men and women at six months after surgery for role emotional, and for men at twelve months follow-up for the same variable. The means decrease when analyzing the data after EM imputation. The corresponding p-values for the differences between men and women at these two occasions are highly significant. The p-values for the baseline differences of social functioning and role emotional are altered from the complete-case p-values, but these changes are relatively unessential when interpreting the results. The same yields for the difference at twelve months follow-up for social functioning. Changes in p-values can be due to different estimates for the genders, and the fact that the sample sizes are increased by including all subjects. By imputing missing values we ensure that the analyses are based on more of the information from the dataset, thus the statistical power increases and p-values decreases. We now want to focus on the analysis of improvement for men and women represented by the gender by time interactions in repeated measures ANOVA. EM imputation leads to insignificant p-values for the interactions regarding all the variables. Especially the estimate for role emotional is altered from quite significant in the complete-case analysis (p-value of 0.025) to 54

67 6.2. MULTIPLE IMPUTATION clearly insignificant in the EM analysis (p-value of 0.503). The variable general health on the other hand obtains a decreased p-value compared to the complete-case analysis. Also the p-value for the interaction for bodily pain is in the grey area what significance concerns, modified from significant in the complete-case analysis. A p-value less than 0.10 should be handled by care, it is not necessarily one single way to interpret such p-values. Profile plots of the variables general health and role emotional are found in Figures 6.1 and 6.2 respectively. We see that the improvement patterns of role emotional for men and women are more similar in the right plot after EM imputation than for the complete-case subjects in the left plot. These profile plots are in accordance with the results found in Table 6.1. The profile plots for general health seems to be approximately equal, but the change in p-value for the interaction may be due to increased statistical power induced by the EM imputation. Profileplot of general health, CC Profileplot of general health, EM Mean of GH Gender Men Women Mean of GH Gender Men Women Time Time Figure 6.1: Profile plots for the variable general health (GH) based on the complete-case subjects (denoted CC) and expectation maximization imputed datasets (denoted EM). 6.2 Multiple imputation The theory of multiple imputation (MI) is presented in Chapter 2.3.2, and will be described in a more practical setting in this section. MI is performed on the dataset of Gjeilo et al. [2008]. MI makes use of all information available in the observed data, and meets the criteria for methods to analyze data with missing values assumed MCAR or MAR described in Section A multiple imputation analysis is performed in two steps, 1. Imputation of the missing values in the original dataset that leads to a total of m imputed datasets. Each dataset is a plausible version of the original dataset if all values were observed. 2. Analyze of each of the datasets and combination by Rubin s rules to obtain analysis results. 55

68 CHAPTER 6. IMPUTATION ANALYSES Profileplot of role emotional, CC Profileplot of role emotional, EM Mean of RE Gender Men Women Mean of RE Gender Men Women Time Time Figure 6.2: Profile plots for the variable role emotional (RE) based on the complete-case subjects (denoted CC) and expectation maximization imputed datasets (denoted EM). The first step of multiple imputation of the dataset is described in the Sections to The analysis step is presented in Sections to Imputation by MICE Multiple imputation is performed in Stata [2007] by the command ice. This command creates a user-specified number of datasets where the missing values are replaced with imputed values [Royston, 2005]. It performs multiple imputation by the MICE procedure, where MICE stands for Multiple Imputation by Chained Equations. The missing values are imputed by the method of switching regression, an iterative multivariate regression technique introduced by Buuren and Oudshoorn [2000] for S-plus, and ice implements MICE for Stata. In the Theory Section of MI we find that imputations are obtained as draws from a joint multivariate distribution including all the variables in the imputation model and the parameters of the imputation model. It is seldom possible to obtain this joint distribution in closed form, and in these situations iterative algorithms like MICE are useful. The switching regression makes use of Gibbs sampler iteration algorithm to perform the imputation. MICE assumes existence of a joint multivariate distribution, from which conditional distributions for each variable with missing values can be derived. Thus the complex multivariate problem can be parted into easier univariate problems. Let X 1, X 2,..., X k be the set of variables in the imputation model where some of the X s have missing values. The chained equations method for imputing data described in Carpenter and Kenward [2007] are similar to the the technique of MICE. First, initialize missing values, for example by the mean of the observed values or randomly drawn values of the observations. Then draw imputations 56

69 6.2. MULTIPLE IMPUTATION X1 t from f(xt 1 Xt 1 2, X3 t 1,..., X t 1 k ) X t 2 from f(xt 2 Xt 1, Xt 1 3,..., X t 1 k ) X t k from f(xt k Xt 1, Xt 2,..., Xt k 1 ) Repeat the above as a loop until convergence, that is typically ten to twenty times. Then repeat the loop a further m times to obtain the imputed datasets. Modern Markov Chain simulation methods often need up to thousands of iterations to obtain a sample of m= 20 imputed datasets, while the switching regressions need less. The reason is that switching regressions draws independent samples for each variable, while Markov Chain methods generates dependent samples and therefore need to iterate and reject many samples to achieve approximately statistical independent imputation draws. The number of iterations for switching regressions increases with the amount of unobserved data. van Buuren et al. [2006] write the following about their implementation of MICE "The theoretical weakness of this approach is that the specified conditional densities can be incompatible, and therefore the stationary distribution may not exist. It appears that, despite the theoretical weakness, the actual performance of conditional model specification of multivariate imputation can be quite good, and therefore deserves further study." Further the authors advise to perform MI with an extra caution to the imputations obtained. The method does not guarantee that all imputations are drawn from the joint multivariate normal distribution, rather from other distributions that the method may converge to in case the joint distribution don t exists. A technique to examine the outlier datasets are explained in Section Note: Carpenter and Kenward [2007] did not recommend the method presented above, based on the uncertainty in the convergence to the joint multivariate normal distribution The imputation model To perform multiple imputation we need to make a selection of variables as a base for the imputation model. This model includes variables with missing values, variables that can predict the missing values and variables that can explain the reason for non-response, in addition to the variables in the analysis models. First we look at these variables contained in the analysis models. All variables except the subject identification number patkey, gender and age have missing values, distributed at all three measurements occasions. MI assumes all variables in the imputation model to be multivariate normally distributed, and we want to examine whether the variables in the dataset meet this assumption. Graphical presentation of the variables can be useful to examine the behavior of the variables. Histograms of general health, bodily pain, social functioning and role emotional are displayed in Figures 6.3 to 6.6. General health has 21 levels of theoretical possible values, uniformly distributed on the interval 0 to 100. Histogram of the variable are displayed in Figure 6.3 for the three occasions. At baseline a Gaussian shape can be seen, and also at six and twelve months after surgery this bell shape is present, though more right-skewed than for the first measuring point. 57

70 CHAPTER 6. IMPUTATION ANALYSES Figure 6.3: Histograms of the observed values of the variable general health (GH) before surgery and at six and twelve months follow-up. Figure 6.4: Histograms of the observed values of the variable bodily pain (BP) before surgery and at six and twelve months follow-up. Histograms for the variable bodily pain is displayed in Figure 6.4. This variable has eleven levels of theoretical possible values on the interval 0 to 100. Before surgery the data looks normally distributed with a peak at the middle of the interval and a considerable ceiling-effect. At six and twelve months after surgery this ceiling-effect is still present and dominate the histogram, while the bell shape of an Gaussian variable is suppressed. Figure 6.5: Histograms of the observed values of the variable social functioning (SF) before surgery and at six and twelve months follow-up. Social functioning has nine ordered levels of values. By looking at the histograms in Figure 6.5 we see that the variable possibly violates the assumption of multivariate normally distributed data by the clear ceiling effect at all time points. Role emotional is distributed on four values, and does not seem convincingly bell shaped 58

71 6.2. MULTIPLE IMPUTATION Figure 6.6: Histograms of the observed values of the variable role emotional (RE) before surgery and at six and twelve months follow-up. from the histograms in Figure 6.6. For the histogram of the presurgery values and at twelve months follow-up a total of five levels of the variable can be found. This is due to the mean value imputation described in Section 4.1, and this feature is also found for the other variables Constraints of imputed values As we can see from the histograms above the variables are not convincingly normally distributed, thus the assumption of multivariate normally distributed variables is possibly violated. There are several ways to handle this in such a way that we still can apply the method of multiple imputation. Another aspect of the imputation process that must be considered is the possible imputation values for each variable. All SF-36 variables and health transition are restrained to intervals and we expect the unobserved data to take values on the same intervals. Dragset et al. [2008] examine three possible alternatives to restrain the imputed values to their intervals, and if possible influence the variables to behave more normally distributed. These methods can be summarized as follows Truncation, the method of restraining imputations after the imputation process, Ordinal logistic regression, which leads to restraints during the imputation process, and Transformation, the method of modifying data before the imputation process. Truncation is a method of replacing the imputed values that exceeds the interval of each variable. This means that the imputed values are manually altered after imputation which may induce biased analysis results. It may also lead to artificial enhanced floor and ceiling effects which is not desirable. Dragset et al. [2008] concluded that this method is not preferable. Variables with a handful possible levels may be treated as ordinal (or nominal) variables and with that avoid the assumption of normality. The last approach to restrict the imputed values is transformation prior to the imputation process. With a suitable transformation we may obtain data that behave more normally distributed, and at the same time ensures the imputed values to take values on the intervals. 59

72 CHAPTER 6. IMPUTATION ANALYSES Ordinal logistic regression The variable social functioning takes nine levels of possible values. The histograms in Figure 6.5 does not reveal a convincing bell shaped distribution for the variable. This may violate the assumption of multivariate normality of MI, thus imputation of this variable must be handled with caution. Variables that take only a few levels of values can be treated as nominal or ordinal variables without loosing too much statistical power. This affects the imputations to take values found among the observed values, which means that all imputed values are restricted to the respective intervals. This also applies for other variables with few levels of observations, that is role emotional, role physical and health transition. The implementation of ordinal logistic regression for the specified variables are explained in Section Transformation of variables Transformation of variables prior to multiple imputation is a technique to reduce violations of the normality assumption of the data. We can transform the data to obtain a distribution of the variables that are more Gaussian shaped, impute the missing values and then back-transform the variables after the imputation process. The back-transform is necessary to obtain comparable estimates for the analyses. The desired transformation formula should influence the data to behave approximately normal distributed. Arcsine transformation [Box et al., 2005] is a variance stabilizing transformation for binomial proportions of the form y/n, where y is the original variable and n is the upper limit of the interval of y. The transformation formula is given as x = arcsin ( ) y, (6.1) n where x is the transformed of the original variable y and that holds the characteristic of a more bell shaped distribution than y. The arcsine transformation is preferable when the variables take values close to the limits of the intervals. Applying the arc sine of the variables leads to stretching the tails of the distribution, thus we can transform the variables by the arcsine transformation to imply a more Gaussian shape of the distributions than for the original variables. An additional advantage of the arcsine transformation when applied to data that are limited to intervals is that it constrains imputed values to these interval. Thus we avoid the problem of imputing values outside the original intervals. Consider for example the variable general health that holds original values on the interval 0 to 100. Higher scores indicate better general health, and a score of zero is interpreted as death. If we get an imputed value below 0 it is not obvious how to interpret this value. Implementation of the transformation in ice can be summarized as follows generate transformed values of the variables prior to imputation, include only the transformed version of the variables in the imputation process, and 60

73 6.2. MULTIPLE IMPUTATION back-transform the imputed variables to obtain imputed values for the original format variables. This approach is presented in Section below. An alternative way to implement transformation with ice is to back-transform the variables directly in the imputation process: generate transformed values of the variables prior to imputation, include both original and transformed variables in the imputation process, and specify imputation equations for all variables that are to be imputed. Note: The original continuous variables (not transformed) should not be imputed by the imputation process, nor included in any of the imputation equations. Back-transform the imputed variables before finishing the imputation process. These two approaches to impute transformed variables by multiple imputation generates the same imputation process, but differs in the implementation in Stata. The effort and time spent may vary for different analyses, and the easiest of the approaches are recommended. Histograms of the transformed variables general health and bodily pain are displayed in Figures 6.7 and 6.8 respectively. We observe that the distributions of general health look more Gaussian shaped, and the skewed trend of the follow-up occasions are not as evident as for the original variable. Figure 6.7: Histograms of the transformed values of the variable General Health (GH) before surgery and at six and twelve months follow-up. The histograms for bodily pain are also ameliorated with respect to the bell shape, but the ceiling effect is still overwhelming for the follow-up measurement occasions Implementation of ice in Stata Multiple imputation is performed in Stata by the command ice, as stated above. This command imputes m datasets that are stored in a new file Impfile together with the original dataset. A variable _mi keeps track of the subjects in each dataset, and another variable _mj holds the information to separate the datasets. The script displayed in Listing 6.2 below is based on the imputation model including each patients identification number, gender and age at the time of 61

74 CHAPTER 6. IMPUTATION ANALYSES Figure 6.8: Histograms of the transformed values of the variable Bodily Pain (BP) before surgery and at six and twelve months follow-up. Listing 6.2: Multiple imputation of dataset from Gjeilo et al. [2008] 1 ge g h1t = a s i n ( s q r t ( gh1 / ) ) 2 ge g h2t = a s i n ( s q r t ( gh2 / ) ) 3 ge g h3t = a s i n ( s q r t ( gh3 / ) ) 4 5 i c e p a t k e y g h1t gh2t gh3t p f 1 t p f 2 t p f 3 t s f 1 s f 2 s f 3 rp1 rp2 rp3 r e 1 r e 2 r e 3 mh1t mh2t mh3t v t 1 t v t 2 t v t 3 t bp1t bp2t bp3t h t 1 h t 2 h t 3 ALDER kjønn s i v i l s t a n d 1 s i v i l s t a n d 2 s i v i l s t a n d 3 u s i n g I m p f i l e, m( 2 0 ) genmiss ( mis ) cmd ( s f 1 s f 2 s f 3 rp1 rp2 rp3 r e 1 r e 2 r e 3 h t 1 h t 2 h t 3 : o l o g i t ) 6 7 r e p l a c e gh1 =100 ( s i n ( g h1t ) ) ^2 i f misgh1== 1 8 r e p l a c e gh2 =100 ( s i n ( g h2t ) ) ^2 i f misgh2== 1 9 r e p l a c e gh3 =100 ( s i n ( g h3t ) ) ^2 i f misgh3== 1 surgery, and all HRQOL-variables, health transition and marital status at the three measurement occasions. Data must be arranged in wide format prior to imputation. The variables social function, role physical, role emotional and health transition are imputed by ordinal logistic regression specified by the option cmd(). Marital status is a dichotomous variable and thus handled by logistic regression as default. The remaining of the HRQOL-variables are transformed by the arcsine transformation formula prior to the imputation as shown for general health on line one to three. These transformed variables are included in the variable list instead of the original variables as explained in Section 6.2.3, together with the subject identification number patkey, gender, marital status and the four ordinal variables in original format. There are several more options specified in the script above. genmiss() creates an indicator variable for each variable with non-response in the imputation model, where one refers to an imputed value and zero is an observed value. m() declares the desired number of imputed datasets and dryrun displays the imputation equations without performing imputation (left out in this script). After the imputation process is finished, the back-transformation of the continuous variables are performed in the newly created Impfile, the datafile in which the original and imputed datasets are stored. This file are now ready for analyses and combination by Rubin s rules as described in Section An alternative procedure to obtain imputed datasets for the transformed variables is by in- 62

### Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics

### Problem of Missing Data

VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

### A Basic Introduction to Missing Data

John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

### Missing Data Dr Eleni Matechou

1 Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: R.J.A. Little and D.B. Rubin 2nd edition Statistical Analysis with Missing Data J.L. Schafer and J.W.

### Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

### Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional

### Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

[Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator

### Introduction to mixed model and missing data issues in longitudinal studies

Introduction to mixed model and missing data issues in longitudinal studies Hélène Jacqmin-Gadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models

### Statistical Analysis with Missing Data

Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES

### Analyzing Structural Equation Models With Missing Data

Analyzing Structural Equation Models With Missing Data Craig Enders* Arizona State University cenders@asu.edu based on Enders, C. K. (006). Analyzing structural equation models with missing data. In G.

### APPLIED MISSING DATA ANALYSIS

APPLIED MISSING DATA ANALYSIS Craig K. Enders Series Editor's Note by Todd D. little THE GUILFORD PRESS New York London Contents 1 An Introduction to Missing Data 1 1.1 Introduction 1 1.2 Chapter Overview

### MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

### A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

123 Kwantitatieve Methoden (1999), 62, 123-138. A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA Joop J. Hox 1 ABSTRACT. When we deal with a large data set with missing data, we have to undertake

### Multiple Imputation for Missing Data: A Cautionary Tale

Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust

### Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out Sandra Taylor, Ph.D. IDDRC BBRD Core 23 April 2014 Objectives Baseline Adjustment Introduce approaches Guidance

### Dealing with Missing Data

Res. Lett. Inf. Math. Sci. (2002) 3, 153-160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904

### Imputing Missing Data using SAS

ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

### A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values

Methods Report A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values Hrishikesh Chakraborty and Hong Gu March 9 RTI Press About the Author Hrishikesh Chakraborty,

### Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

### Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear

### HANDLING DROPOUT AND WITHDRAWAL IN LONGITUDINAL CLINICAL TRIALS

HANDLING DROPOUT AND WITHDRAWAL IN LONGITUDINAL CLINICAL TRIALS Mike Kenward London School of Hygiene and Tropical Medicine Acknowledgements to James Carpenter (LSHTM) Geert Molenberghs (Universities of

### 2. Making example missing-value datasets: MCAR, MAR, and MNAR

Lecture 20 1. Types of missing values 2. Making example missing-value datasets: MCAR, MAR, and MNAR 3. Common methods for missing data 4. Compare results on example MCAR, MAR, MNAR data 1 Missing Data

### How to choose an analysis to handle missing data in longitudinal observational studies

How to choose an analysis to handle missing data in longitudinal observational studies ICH, 25 th February 2015 Ian White MRC Biostatistics Unit, Cambridge, UK Plan Why are missing data a problem? Methods:

### A Review of Methods for Missing Data

Educational Research and Evaluation 1380-3611/01/0704-353\$16.00 2001, Vol. 7, No. 4, pp. 353±383 # Swets & Zeitlinger A Review of Methods for Missing Data Therese D. Pigott Loyola University Chicago, Wilmette,

### MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) R.KAVITHA KUMAR Department of Computer Science and Engineering Pondicherry Engineering College, Pudhucherry, India DR. R.M.CHADRASEKAR Professor,

### Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple

### Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

Missing Data & How to Deal: An overview of missing data Melissa Humphries Population Research Center Goals Discuss ways to evaluate and understand missing data Discuss common missing data methods Know

### Handling attrition and non-response in longitudinal data

Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

### Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Statistical modelling with missing data using multiple imputation Session 4: Sensitivity Analysis after Multiple Imputation James Carpenter London School of Hygiene & Tropical Medicine Email: james.carpenter@lshtm.ac.uk

### PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku,

PATTERN MIXTURE MODELS FOR MISSING DATA Mike Kenward London School of Hygiene and Tropical Medicine Talk at the University of Turku, April 10th 2012 1 / 90 CONTENTS 1 Examples 2 Modelling Incomplete Data

### Item Imputation Without Specifying Scale Structure

Original Article Item Imputation Without Specifying Scale Structure Stef van Buuren TNO Quality of Life, Leiden, The Netherlands University of Utrecht, The Netherlands Abstract. Imputation of incomplete

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### An introduction to modern missing data analyses

Journal of School Psychology 48 (2010) 5 37 An introduction to modern missing data analyses Amanda N. Baraldi, Craig K. Enders Arizona State University, United States Received 19 October 2009; accepted

### Imputation and Analysis. Peter Fayers

Missing Data in Palliative Care Research Imputation and Analysis Peter Fayers Department of Public Health University of Aberdeen NTNU Det medisinske fakultet Missing data Missing data is a major problem

### Missing Data. Paul D. Allison INTRODUCTION

4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data, I mean data that are missing for some (but not all) variables and for some (but not all)

### MISSING DATA IN NON-PARAMETRIC TESTS OF CORRELATED DATA

MISSING DATA IN NON-PARAMETRIC TESTS OF CORRELATED DATA Annie Green Howard A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements

### On Treatment of the Multivariate Missing Data

On Treatment of the Multivariate Missing Data Peter J. Foster, Ahmed M. Mami & Ali M. Bala First version: 3 September 009 Research Report No. 3, 009, Probability and Statistics Group School of Mathematics,

### Dealing with missing data: Key assumptions and methods for applied analysis

Technical Report No. 4 May 6, 2013 Dealing with missing data: Key assumptions and methods for applied analysis Marina Soley-Bori msoley@bu.edu This paper was published in fulfillment of the requirements

### Dealing with Missing Data

Dealing with Missing Data Roch Giorgi email: roch.giorgi@univ-amu.fr UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January

### Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

Analyzing Intervention Effects: Multilevel & Other Approaches Joop Hox Methodology & Statistics, Utrecht Simplest Intervention Design R X Y E Random assignment Experimental + Control group Analysis: t

### IBM SPSS Missing Values 22

IBM SPSS Missing Values 22 Note Before using this information and the product it supports, read the information in Notices on page 23. Product Information This edition applies to version 22, release 0,

### Bayesian Approaches to Handling Missing Data

Bayesian Approaches to Handling Missing Data Nicky Best and Alexina Mason BIAS Short Course, Jan 30, 2012 Lecture 1. Introduction to Missing Data Bayesian Missing Data Course (Lecture 1) Introduction to

### How To Use A Monte Carlo Study To Decide On Sample Size and Determine Power

How To Use A Monte Carlo Study To Decide On Sample Size and Determine Power Linda K. Muthén Muthén & Muthén 11965 Venice Blvd., Suite 407 Los Angeles, CA 90066 Telephone: (310) 391-9971 Fax: (310) 391-8971

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### Note on the EM Algorithm in Linear Regression Model

International Mathematical Forum 4 2009 no. 38 1883-1889 Note on the M Algorithm in Linear Regression Model Ji-Xia Wang and Yu Miao College of Mathematics and Information Science Henan Normal University

### Sensitivity Analysis in Multiple Imputation for Missing Data

Paper SAS270-2014 Sensitivity Analysis in Multiple Imputation for Missing Data Yang Yuan, SAS Institute Inc. ABSTRACT Multiple imputation, a popular strategy for dealing with missing values, usually assumes

### Copyright 2010 The Guilford Press. Series Editor s Note

This is a chapter excerpt from Guilford Publications. Applied Missing Data Analysis, by Craig K. Enders. Copyright 2010. Series Editor s Note Missing data are a real bane to researchers across all social

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

### Combining Multiple Imputation and Inverse Probability Weighting

Combining Multiple Imputation and Inverse Probability Weighting Shaun Seaman 1, Ian White 1, Andrew Copas 2,3, Leah Li 4 1 MRC Biostatistics Unit, Cambridge 2 MRC Clinical Trials Unit, London 3 UCL Research

### Missing Data: Our View of the State of the Art

Psychological Methods Copyright 2002 by the American Psychological Association, Inc. 2002, Vol. 7, No. 2, 147 177 1082-989X/02/\$5.00 DOI: 10.1037//1082-989X.7.2.147 Missing Data: Our View of the State

### Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88)

Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88) Introduction The National Educational Longitudinal Survey (NELS:88) followed students from 8 th grade in 1988 to 10 th grade in

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Missing Data Sensitivity Analysis of a Continuous Endpoint An Example from a Recent Submission

Missing Data Sensitivity Analysis of a Continuous Endpoint An Example from a Recent Submission Arno Fritsch Clinical Statistics Europe, Bayer November 21, 2014 ASA NJ Chapter / Bayer Workshop, Whippany

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten Missing Data Treatments

Brockmeier, Kromrey, & Hogarty Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten s Lantry L. Brockmeier Jeffrey D. Kromrey Kristine Y. Hogarty Florida A & M University

### SPSS: Descriptive and Inferential Statistics. For Windows

For Windows August 2012 Table of Contents Section 1: Summarizing Data...3 1.1 Descriptive Statistics...3 Section 2: Inferential Statistics... 10 2.1 Chi-Square Test... 10 2.2 T tests... 11 2.3 Correlation...

### Department of Epidemiology and Public Health Miller School of Medicine University of Miami

Department of Epidemiology and Public Health Miller School of Medicine University of Miami BST 630 (3 Credit Hours) Longitudinal and Multilevel Data Wednesday-Friday 9:00 10:15PM Course Location: CRB 995

### HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

### Missing Data in Educational Research: A Review of Reporting Practices and Suggestions for Improvement

Review of Educational Research Winter 2004, Vol. 74, No. 4, pp. 525-556 Missing Data in Educational Research: A Review of Reporting Practices and Suggestions for Improvement James L. Peugh University of

### 11/20/2014. Correlational research is used to describe the relationship between two or more naturally occurring variables.

Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection

### UNDERSTANDING MULTIPLE REGRESSION

UNDERSTANDING Multiple regression analysis (MRA) is any of several related statistical methods for evaluating the effects of more than one independent (or predictor) variable on a dependent (or outcome)

### Assignments Analysis of Longitudinal data: a multilevel approach

Assignments Analysis of Longitudinal data: a multilevel approach Frans E.S. Tan Department of Methodology and Statistics University of Maastricht The Netherlands Maastricht, Jan 2007 Correspondence: Frans

### Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

### Health 2011 Survey: An overview of the design, missing data and statistical analyses examples

Health 2011 Survey: An overview of the design, missing data and statistical analyses examples Tommi Härkänen Department of Health, Functional Capacity and Welfare The National Institute for Health and

### arxiv:1301.2490v1 [stat.ap] 11 Jan 2013

The Annals of Applied Statistics 2012, Vol. 6, No. 4, 1814 1837 DOI: 10.1214/12-AOAS555 c Institute of Mathematical Statistics, 2012 arxiv:1301.2490v1 [stat.ap] 11 Jan 2013 ADDRESSING MISSING DATA MECHANISM

### Journal Article Reporting Standards (JARS)

APPENDIX Journal Article Reporting Standards (JARS), Meta-Analysis Reporting Standards (MARS), and Flow of Participants Through Each Stage of an Experiment or Quasi-Experiment 245 Journal Article Reporting

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

### Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set

Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set April 2015 SHORT REPORT Baby FACES 2009 This page is left blank for double-sided printing. Imputing Attendance Data in a Longitudinal

### Introduction to latent variable models

Introduction to latent variable models lecture 1 Francesco Bartolucci Department of Economics, Finance and Statistics University of Perugia, IT bart@stat.unipg.it Outline [2/24] Latent variables and their

### Missing Data Techniques for Structural Equation Modeling

Journal of Abnormal Psychology Copyright 2003 by the American Psychological Association, Inc. 2003, Vol. 112, No. 4, 545 557 0021-843X/03/\$12.00 DOI: 10.1037/0021-843X.112.4.545 Missing Data Techniques

### Applied Missing Data Analysis in the Health Sciences. Statistics in Practice

Brochure More information from http://www.researchandmarkets.com/reports/2741464/ Applied Missing Data Analysis in the Health Sciences. Statistics in Practice Description: A modern and practical guide

### ANSWERS TO EXERCISES AND REVIEW QUESTIONS

ANSWERS TO EXERCISES AND REVIEW QUESTIONS PART FIVE: STATISTICAL TECHNIQUES TO COMPARE GROUPS Before attempting these questions read through the introduction to Part Five and Chapters 16-21 of the SPSS

ALLISON 1 TABLE OF CONTENTS 1. INTRODUCTION... 3 2. ASSUMPTIONS... 6 MISSING COMPLETELY AT RANDOM (MCAR)... 6 MISSING AT RANDOM (MAR)... 7 IGNORABLE... 8 NONIGNORABLE... 8 3. CONVENTIONAL METHODS... 10

### Regression in SPSS. Workshop offered by the Mississippi Center for Supercomputing Research and the UM Office of Information Technology

Regression in SPSS Workshop offered by the Mississippi Center for Supercomputing Research and the UM Office of Information Technology John P. Bentley Department of Pharmacy Administration University of

### Penalized regression: Introduction

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

### Bivariate Analysis. Correlation. Correlation. Pearson's Correlation Coefficient. Variable 1. Variable 2

Bivariate Analysis Variable 2 LEVELS >2 LEVELS COTIUOUS Correlation Used when you measure two continuous variables. Variable 2 2 LEVELS X 2 >2 LEVELS X 2 COTIUOUS t-test X 2 X 2 AOVA (F-test) t-test AOVA

### BMJ Open. To condition or not condition? Analyzing change in longitudinal randomized controlled trials

To condition or not condition? Analyzing change in longitudinal randomized controlled trials Journal: BMJ Open Manuscript ID bmjopen-0-00 Article Type: Research Date Submitted by the Author: -Jun-0 Complete

### Epidemiology-Biostatistics Exam Exam 2, 2001 PRINT YOUR LEGAL NAME:

Epidemiology-Biostatistics Exam Exam 2, 2001 PRINT YOUR LEGAL NAME: Instructions: This exam is 30% of your course grade. The maximum number of points for the course is 1,000; hence this exam is worth 300

### SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

### How to Use a Monte Carlo Study to Decide on Sample Size and Determine Power

STRUCTURAL EQUATION MODELING, 9(4), 599 620 Copyright 2002, Lawrence Erlbaum Associates, Inc. TEACHER S CORNER How to Use a Monte Carlo Study to Decide on Sample Size and Determine Power Linda K. Muthén

### Modern Methods for Missing Data

Modern Methods for Missing Data Paul D. Allison, Ph.D. Statistical Horizons LLC www.statisticalhorizons.com 1 Introduction Missing data problems are nearly universal in statistical practice. Last 25 years

### Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Tips for surviving the analysis of survival data Philip Twumasi-Ankrah, PhD Big picture In medical research and many other areas of research, we often confront continuous, ordinal or dichotomous outcomes

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### Simple Linear Regression Chapter 11

Simple Linear Regression Chapter 11 Rationale Frequently decision-making situations require modeling of relationships among business variables. For instance, the amount of sale of a product may be related

### Data Cleaning and Missing Data Analysis

Data Cleaning and Missing Data Analysis Dan Merson vagabond@psu.edu India McHale imm120@psu.edu April 13, 2010 Overview Introduction to SACS What do we mean by Data Cleaning and why do we do it? The SACS

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Missing Data: Patterns, Mechanisms & Prevention. Edith de Leeuw

Missing Data: Patterns, Mechanisms & Prevention Edith de Leeuw Thema middag Nonresponse en Missing Data, Universiteit Groningen, 30 Maart 2006 Item-Nonresponse Pattern General pattern: various variables

### The Classical Linear Regression Model

The Classical Linear Regression Model 1 September 2004 A. A brief review of some basic concepts associated with vector random variables Let y denote an n x l vector of random variables, i.e., y = (y 1,

### Missing data and net survival analysis Bernard Rachet

Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27-29 July 2015 Missing data and net survival analysis Bernard Rachet General context Population-based,

### Using multilevel models to analyze couple and family treatment data: Basic and advanced issues. David C. Atkins. Travis Research Institute

Multilevel analyses 1 RUNNING HEAD: Multilevel analyses Using multilevel models to analyze couple and family treatment data: Basic and advanced issues David C. Atkins Travis Research Institute Fuller Graduate

### Introduction to Longitudinal Data Analysis

Introduction to Longitudinal Data Analysis Longitudinal Data Analysis Workshop Section 1 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development Section 1: Introduction

### Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

### Random-Effects Models for Multivariate Repeated Responses

Random-Effects Models for Multivariate Repeated Responses Geert Verbeke & Steffen Fieuws geert.verbeke@med.kuleuven.be Biostatistical Centre, K.U.Leuven, Belgium QMSS, Diepenbeek, September 2006 p. 1/24

### Introduction to Multivariate Models: Modeling Multivariate Outcomes with Mixed Model Repeated Measures Analyses

Introduction to Multivariate Models: Modeling Multivariate Outcomes with Mixed Model Repeated Measures Analyses Applied Multilevel Models for Cross Sectional Data Lecture 11 ICPSR Summer Workshop University

### A Guide to Imputing Missing Data with Stata Revision: 1.4

A Guide to Imputing Missing Data with Stata Revision: 1.4 Mark Lunt December 6, 2011 Contents 1 Introduction 3 2 Installing Packages 4 3 How big is the problem? 5 4 First steps in imputation 5 5 Imputation

### INTERPRETING THE REPEATED-MEASURES ANOVA

INTERPRETING THE REPEATED-MEASURES ANOVA USING THE SPSS GENERAL LINEAR MODEL PROGRAM RM ANOVA In this scenario (based on a RM ANOVA example from Leech, Barrett, and Morgan, 2005) each of 12 participants

### Checklists and Examples for Registering Statistical Analyses

Checklists and Examples for Registering Statistical Analyses For well-designed confirmatory research, all analysis decisions that could affect the confirmatory results should be planned and registered

### Re-analysis using Inverse Probability Weighting and Multiple Imputation of Data from the Southampton Women s Survey

Re-analysis using Inverse Probability Weighting and Multiple Imputation of Data from the Southampton Women s Survey MRC Biostatistics Unit Institute of Public Health Forvie Site Robinson Way Cambridge