Sensitivity analysis of longitudinal binary data with non-monotone missing values

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Sensitivity analysis of longitudinal binary data with non-monotone missing values"

Transcription

1 Biostatistics (2004), 5, 4,pp doi: /biostatistics/kxh006 Sensitivity analysis of longitudinal binary data with non-monotone missing values PASCAL MININI Laboratoire GlaxoSmithKline, UnitéMéthodologie et Biostatistique, 100 route de Versailles, Marly le Roi, France, and INSERM U472, 16 avenue Paul Vaillant-Couturier, Villejuif, France MICHEL CHAVANCE INSERM U472, 16 avenue Paul Vaillant-Couturier, Villejuif, France SUMMARY This paper highlights the consequences of incomplete observations in the analysis of longitudinal binary data, in particular non-monotone missing data patterns. Sensitivity analysis is advocated and a method is proposed based on a log linear model. A sensitivity parameter that represents the relationship between the response mechanism and the missing data mechanism is introduced. It is shown that although this parameter is identifiable, its estimation is highly questionable. A far better approach is to consider a range of plausible values and to estimate the parameters of interest conditionally upon each value of the sensitivity parameter. This allows us to assess the sensitivity of study s conclusion to assumptions regarding the missing data mechanism. The method is applied to a randomized clinical trial comparing the efficacy of two treatment regimens in patients with persistent asthma. Keywords: Binary data; EM; Ignorance; Longitudinal study; Missing; Multiple imputation; Non-monotone; Sensitivity analysis; Uncertainty. 1. INTRODUCTION We consider longitudinal studies designed to repeatedly observe a binary response at n prespecified occasions. In practice, successful completion of all planned measurements from all subjects is extremely rare. Two main sources of missing data can be distinguished. On the one hand, some subjects will drop-out from the study; for example as a result of an adverse event, the lack of efficacy of the study treatment, or simply the refusal of the subject to continue the study. This will result in a monotone pattern of missing data (Little and Rubin, 1987). On the other hand, some data will be missing intermittently, for example because of an illness, an invalid measurement or forgetfulness. This will result in a non-monotone pattern. Longitudinal studies generally suffer from both types of missingness, and the collected data are often incomplete with a nonmonotone structure. The classification proposed by Little and Rubin (1987) is based on the relationship between the mechanism leading to complete or incomplete data (the missing data process) and the mechanism To whom correspondence should be addressed. Biostatistics Vol. 5 No. 4 c Oxford University Press 2004; all rights reserved.

2 532 P. MININI AND M. CHAVANCE controlling the actual value of the response of interest (the response process). Data are missing at random (MAR) when the missing data process depends only on observed responses, and missing not at random when it depends on unobserved responses. In the framework of likelihood-based inference, if the missing data are MAR and if the parameters of the missing data process and those of the response process are distinct, then the missing data process is termed to be ignorable. Otherwise it is nonignorable. Over the past few years, considerable attention has been given to the modelling of longitudinal binary data with nonignorable missing values, via generalized linear mixed models (e.g. Follmann and Wu, 1995; Ibrahim et al., 2001) or generalized estimating equations (e.g. Paik, 1997; Lipsitz et al., 2000; Fitzmaurice and Laird, 2000). However, a paradigm has emerged: handling incomplete observations necessarily requires assumptions that cannot be assessed from the observed data (Little, 1994a; Rubin, 1994; Verbeke and Molenberghs, 2000). In these circumstances, the need for sensitivity analyses has been clearly recognized. Molenberghs et al. (2001); Kenward et al. (2001); Vansteelandt et al. (2000) and Vansteelandt and Goetghebeur (2001) have developed the concepts of ignorance and uncertainty. On the one hand, the usual imprecision is due to the finite random sampling, which is acknowledged via confidence intervals, the width of which approaches zero as the sample size grows. On the other hand, ignorance is due to the incompleteness of data and can be reflected by the interval of ignorance. Ignorance due to a given proportion of missing data would not disappear even with an infinite sample size. Imprecision and ignorance are combined into the concept of uncertainty, which acknowledges both sources. In controlled clinical trials, it has been recommended by the Committee for Proprietary Medicinal Products (2001) to conduct a sensitivity analysis in order to assess the impact of different missing data assumptions regarding the conclusion of a study. With binary responses, a best-case/worst-case analysis can be performed assigning a positive response to all missing data in the control group and a negative response in the experimental group. Although the assumptions of this approach are unrealistic, this is the most convincing analysis if the conclusion of the study is not qualitatively modified. However, in most cases, the benefit of the new treatment would be annihilated by such an extreme analysis (Unnebrink and Windeler, 1999). In the case of a single binary measure, Hollis (2002) proposed a simple and attractive method that consists in examining all possible allocations of missing data. In another framework, Copas and Li (1997) used a first-order Taylor expansion to perform a sensitivity analysis around the MAR assumption. However, Skinner (1997) suggested that a better approach would be to estimate the parameter of interest conditionally on the sensitivity parameter. The strategy previously proposed by Little (1994b) will be used here. This consists of drawing inferences about the parameters of interest under a range of plausible values for a sensitivity parameter, i.e. under different assumptions regarding the missing data mechanism. This has been widely developed for sensitivity analyses (see for example Rotnitzky et al., 1998, 2001; Scharfstein et al., 1999; Birmingham et al., 2003). These methods deal mainly with quantitative data subject to dropout, the comparison being restricted to the value measured at the end of the study. Here, we will consider longitudinal binary responses with non-monotone missing data, all measurements being considered as equally valuable. A joint modelling of the response process and the missing data process, based on a log linear model is proposed in Section 2. A sensitivity parameter is introduced that represents the relationship between the response process and the missing data process. An important feature of this modelling is that it does not require a monotone missing data structure. In Section 3, it is shown that although the sensitivity parameter is identifiable, its estimation is highly questionable. A far better approach is to consider a range of plausible values, and to estimate the parameters of interest conditionally upon these plausible values. When the objective of the study is to describe the association between explanatory variables and the response of interest, the log linear model introduced in Section 2 may not be satisfactory. In this case, it is proposed in Section 4 to perform multiple imputations of missing data, and to analyse the completed data using the multiple imputation estimator (Rubin, 1987).

3 Longitudinal binary data with missing values 533 In Section 5, the method is applied to a randomized clinical trial comparing the efficacy of two treatment regimens in patients with persistent asthma. 2. NOTATION AND DISTRIBUTIONAL ASSUMPTIONS 2.1 Modelling the response and missing data processes We assume that N subjects are to be observed at n different times. Let Y = (Y 1,...,Y n ) denote the 1 n vector of complete binary data for a given subject, i.e. data that would have been observed if no measurement was missing. Let M j denote the missing data indicator with M j = 1ifthe jth response is missing and M j = 0 otherwise, and form the 1 n vector M = (M 1,...,M n ). The joint distribution of Y and M can be expressed using a log linear model (Bishop et al., 1975) as log P [Y = y, M = m] = µ + λ j y j + λ jk y j y k + θ j m j j=1 j<k j=1 + θ jk m j m k + ψ jk y j m k. (1) j<k j,k Equation (1) imposes the constraint that third- and higher-order terms equal zero, but it can be sensible in some cases to include additional terms, up to a saturated model. An equivalent model has been proposed by Baker et al. (1992) for contingency tables, and used by Molenberghs et al. (2001) for sensitivity analyses. In this formulation, the interaction terms between Y and M determine the nature of the missing data mechanism, and ψ jk represents the relationship between the (possibly unobserved) value y j and the missingness of Y k. In this setting, the model is thus ignorable if all ψ jk = 0, and nonignorable otherwise. Note that this model assumes that the association between two given responses Y j and Y k is independent of the missing data patterns. Model (1) involves n 2 parameters ψ jk. A possible way of reducing this number of parameters is to impose the following constraints: (a) ψ jk = 0 for all j = k (b) ψ jj = ψ for all j = 1,...,n. Constraint (a) imposes that conditionally on Y j, missingness at the jth measure is independent of all other Y k. Constraint (b) imposes that the relationship between the missing data indicator M j and the response Y j is constant over time. These two constraints lead to log P [Y = y, M = m] = µ + λ j y j + λ jk y j y k + j=1 j<k θ j m j θ jk m j m k + ψ j=1 j<k y j m j. (2) An interesting characteristic of this approach is that it can be interpreted in terms of a pattern-mixture model or a selection model. Let us first consider the pattern-mixture formulation. From (2), it can be shown that log P [Y = y M = m] =µ (m) 0 + λ j y j + λ jk y j y k + ψ j=1 j<k j=1 y j m j (3) where µ (m) 0 constrains the 2 n probabilities to sum to 1 within each pattern. Model (3) is a typical pattern-mixture model, in which the parameter ψ models the shift between the distributions of Y under different missing data patterns. Furthermore, let us consider the conditional probability of a positive response at any measurement time j, given the other responses and the missingness of this response. We will let Y [ j] denote the 1 (n 1) vector ( Y 1,...,Y j 1, Y j+1,...,y n ). It follows j=1

4 534 P. MININI AND M. CHAVANCE that logitp [ Y j = 1 Y [ j], M ] = λ j + λ jk y k + ψm j. (4) In words, the odds of a positive response at any measurement time j is multiplied by e ψ for subjects with missing Y j, as compared to subjects with observed Y j. In particular, if ψ < 0, the missing data are associated with poorer responses than the observed data and conversely. The extreme cases where ψ tends toward or + correspond respectively to the worst-case or best-case assumptions, in which the probability of a positive response tends toward 0 or 1. Now, let us consider the selection model formulation. From (2), it can also be shown that which in turn implies that log P [M = m Y = y] =ν (m) 0 + logitp [ M j = 1 Y, M [ j] ] = θ j + k=1 θ j m j + θ jk m j m k + ψ j=1 j<k y j m j (5) j=1 θ jk m k + ψy j. (6) Thus, the parameter ψ can also be interpreted in the framework of selection models, and can be identified as the selection parameter relating the missing data probability to the (possibly unobserved) associated response. k=1 2.2 Handling covariate information In most studies, the objective is to assess the effect of explanatory variables X on the response Y, and sometimes also on the missing data process M. Let X denote the 1 p vector ( X 1,...,X p ), each X j being coded as a dummy variable. A natural extension of model (2) is log P [Y = y, M = m, X = x] = µ + n j=1 λ j y j + λ jk y j y k j<k + n j=1 θ jm j + j<k θ jkm j m k + p j=1 α j x j + j<k α jkx j x k + j,k β jkx j y k + j,k γ jkx j m k +ψ n j=1 y jm j (7) where the parameters β jk and γ jk represent the effect of the covariates on the response Y and the missing data process M, respectively. However, for purposes of comparison, it may be reasonable to assume different missing data processes between groups. For example, if the objective of the study is to compare the group X 1 = 0 to the group

5 Longitudinal binary data with missing values 535 X 1 = 1, the following extension may be recommended: log P [Y = y, M = m, X = x] = µ + n j=1 λ j y j + j<k λ jky j y k + n j=1 θ jm j + j<k θ jkm j m k + p j=1 α j x j + j<k α jkx j x k + j,k β jkx j y k + j,k γ jkx j m k +ψ 0 (1 x 1 ) n j=1 y jm j + ψ 1 x 1 n j=1 y jm j. (8) In (8), both elements of ψ = (ψ 0,ψ 1 ) have the same interpretation as in Section 2.1, whilst allowing the association between the value y j and the missingness of Y j to differ between groups. 3. ESTIMATION PROCEDURE Let φ denote the vector of all the parameters introduced in Section 2, excluding ψ. The full parameter vector is (φ,ψ). In Section 3.1, it will be shown that although the full parameter vector is identifiable, estimation of ψ is highly questionable. A sensitivity analysis approach, estimating φ with fixed ψ, is recommended and presented in Section Estimated versus chosen ψ All the parameters in model (2) are identifiable, and thus are estimable using a maximum-likelihood (ML) approach based on a nonlinear optimization routine (as in Diggle and Kenward, 1994, who used a simplex algorithm). An alternative is the maximization of the log-likelihood via the EM algorithm (Dempster et al., 1977), see for example Baker et al. (1992). However, like other models for incomplete longitudinal data, the estimation of ψ is highly sensitive to assumptions that cannot be assessed from the observed data (Rubin, 1994; Little, 1994a, 1995; Molenberghs et al., 2001). Furthermore, even when the assumptions of the model are correct, the estimation of ψ can be problematic. One of the main reasons is that the partial derivative of the log-likelihood with respect to ψ tends to 0 when ψ tends to infinity. Thus, the profile log-likelihood is flat for very large positive or negative values of ψ (similar problems were encountered by Copas and Li, 1997, in a different framework). In some cases, the likelihood will be maximized when ψ tends to ±. This results in an undefined ML estimate for ψ, which we will denote by an infinite estimate. In many other cases, even when a finite ML estimate for ψ exists, the associated ML confidence interval can be ], + [. To illustrate this point, simulations were performed in the simple case of two measurements per subject with no covariate, as in model (2). The parameters of the simulations were chosen to obtain P [Y 1 = 1] = P [Y 2 = 1] = 0.5, with a log odds-ratio measuring the association between Y 1 and Y 2 of 1. For simplicity, an identical and independent missing data process was generated: P [M 1, M 2 ] = P [M 1 ] P [M 2 ], with P [M 1 = 1] = P [M 2 = 1]. The proportion of missing data was fixed at 10%, 20%, 30%, 40% and 50%. Finally, ψ was set at 0 to investigate the properties of the estimation procedure when missing data are actually ignorable. We were interested in the proportion of simulations that led to an infinite estimate of ψ, and in the proportion that led to an infinite or semi-infinite 95% ML confidence interval. A ML confidence interval was defined as infinite if both limits of the log-likelihood when ψ tends either to or + were not

6 536 P. MININI AND M. CHAVANCE Table 1. Percentage of the simulations that led to infinite estimate of ψ, infinite or semi-infinite ML 95% confidence interval Number of % missing Infinite Infinite Semi-infinite subjects data estimate 95% CI 95% CI significantly lower than the maximized log-likelihood. It was defined as semi-infinite if only one limit was not significantly lower than the maximized log-likelihood. This implied that when an infinite estimate was obtained, the associated ML confidence interval was at least semi-infinite. Various numbers of subjects were considered: 50, 100, 500 and In each situation, simulated data sets were generated. Results of these simulations are presented in Table 1. These simulations showed that although ψ was identifiable, its estimation using a ML approach was not reliable, even if the assumptions of the model were actually correct. For a small or moderate number of subjects (50 or 100), infinite estimations of ψ were frequent. Even when a finite estimate was obtained, the associated ML confidence interval was nearly always infinite or semi-infinite. In these cases, the amount of valuable information provided by observed data was very poor. Thus, a sensitivity analysis appears a far better approach. A sensitivity analysis can be performed using different fixed values for ψ, and examining to what extent these values influence the estimation of the other parameters. Following Kenward et al. (2001), ψ will be termed the sensitivity parameter and φ the estimable parameter. 3.2 Estimation of φ for a fixed ψ Given a fixed value for the sensitivity parameter, the estimable parameter can be estimated via an EM algorithm. This will result in an estimate φ (ψ).

7 Longitudinal binary data with missing values 537 The minimal sufficient statistics of model (8) are the 2 p+2n number of subjects n x,m,y for each possible vector (x, m, y). Splitting y into its observed and missing components (y obs, y miss ), the observed statistics are the 2 p 3 n numbers of subjects n x,m,yobs (each y j being either 0, 1 or missing). The E-step provides the expectation of the minimal sufficient statistics, given the observed data, the current estimate φ [t] and the fixed value ψ: E [ ] n x,m,y n x,m,yobs ; φ [t],ψ = n x,m,yobs [ ] P Y miss = y miss Y obs = y obs, M = m, X = x; φ [t],ψ = n x,m,yobs P [ Y = (y obs, y miss ), M = m, X = x φ [t],ψ ] y miss P [ Y = (y obs, y miss ), M = m, X = x φ [t],ψ ]. The M-step provides φ [t+1], the updated estimate of the parameter. From model (7), the expected log-likelihood is E [ ] L(φ; n x,m,y ) n x,m,yobs ; φ [t],ψ = [ ] E n x,m,y n x,m,yobs ; φ [t],ψ x,m,y [ ] log P Y = y, M = m, X = x φ [t],ψ. The M-step involves a numerical maximization, which can be performed using standard software for generalized linear models such as SAS proc genmod (SAS Institute Inc., 1999), assuming a Poisson distribution and a log link, with log(n) + ψ n j=1 y j m j as an offset variable. After convergence of the EM algorithm φ(ψ), the ML estimate of φ conditional upon the value of ψ is obtained. Its variance can be obtained by numerical differentiation methods (Jamshidian and Jennrich, 2000). Given a set of plausible values for ψ, denoted by ψ, one can obtain the region of ignorance for φ, defined to be the set of all possible point estimates φ(ψ) for ψ ψ. The (1 α)100% region of uncertainty can be constructed around the region of ignorance, in the spirit of a confidence region. The estimated probabilities P [ Y = y, M = m, X = x φ(ψ),ψ ] can be derived from φ(ψ) using (8), and the estimated probabilities P [ Y = y, X = x φ(ψ),ψ ] are then obtained by summation over the missing data patterns. Finally, the probabilities P [ Y = y X = x; φ(ψ),ψ ] are derived by conditioning. This conditional distribution represents the basis of the inferences of interest, as it describes how the probability of a positive response changes with different levels of the covariates. It may answer a wide range of questions addressed by the study, for example, how different levels of the covariates affect the time by time probabilities of success, the probability that all the measurements are successes, or the probability of observing at least k successes out of n measurements. However, the log linear model described in Section 2 may not be the preferred approach for describing the association between the response Y and explanatory variables X. In particular, the parameters β jk introduced in models (7) and (8) describe the association between Y j and X k, conditional on all the other variables of this model. A marginal model, whose parameter estimation is based on generalized estimating equations (Diggle et al., 2002) could be preferable in this context. In this case, an attractive approach is to use the estimated probabilities P [ Y = y, M = m, X = x φ(ψ),ψ ] to perform multiple imputation (Rubin, 1987).

8 538 P. MININI AND M. CHAVANCE 4. MULTIPLE IMPUTATION The procedure for generating proper multiple imputation is the following. First, the parameters of the model are drawn, then the missing data are drawn conditionally on the observed values and the drawn parameters (Rubin, 1987, Sections 4.3 and 4.4). Although most easily understood using Bayesian concepts, a likelihood-based treatment is equally possible (Verbeke and Molenberghs, 2000, Section 20.3). This would consist in first drawing a value φ from the asymptotic distribution of φ(ψ), say φ, then drawing Y miss from the conditional distribution P [ Y miss Y obs, X, M; φ,ψ ]. The general procedure for conducting a sensitivity analysis can then be summarized as follows: 1. Determine a model for the complete data: P [Y = y, M = m, X = x φ,ψ], as detailed in Section Choose a value for ψ. 3. Compute φ(ψ) the ML estimate of φ given ψ and its variance matrix Var ( φ(ψ) ), as detailed in Section Draw a value φ from the asymptotic distribution N ( φ(ψ),var ( φ(ψ) )), say φ. 5. Draw Y miss from the conditional distribution P [ Y miss Y obs, X, M; φ,ψ ]. 6. Repeat steps 4 and 5 I 2 times to obtain I completed data sets. 7. Analyse the I completed data sets using the appropriate statistical method. 8. Combine the I estimations into a single estimation. 9. Repeat steps 2 to 8 with a different value of ψ. One of the main advantages of this procedure is that we may use any method to analyse the completed data sets. The I complete data estimations are then easily combined into a single one. This inference is valid under the model described in Section 2. Considering several values for ψ allows us to display inference about the parameters of primary interest under a range of assumptions concerning the missing data process. A SAS macro and an example can be found on the web site at ~u472/equipes/biostatistique/savoir.html 5. EXAMPLE The proposed method was applied to a randomized clinical trial comparing the efficacy of two treatment regimens in patients with persistent asthma (Grosclaude and Desfougeres, 2003). Patients whose asthma was insufficiently controlled by inhaled corticosteroids alone were included. After a two-week run-in period, patients were randomized to receive either an inhaled corticosteroid / long-acting β 2 - agonist combination (referred to below as the Study treatment) or an inhaled corticosteroid / leukotrieneantagonist association (referred to below as the reference treatment). During the 12-week treatment period, the subjects filled in a diary record card. Asthma control was assessed on a weekly basis, and was defined according to the criteria given in Table 2. Weekly data were then aggregated into monthly data, each patient being considered as controlled for a given month if his/her asthma was controlled at least three weeks out of four. A total of 246 subjects were randomized, 119 in the study treatment group and 127 in the reference treatment group. The distribution of missing data is presented in Table 3. Globally, 69% of subjects had available asthma control data at each of the three months, complete data being more frequent in the study treatment group (75%) than in the reference group (64%). It is noteworthy that missing data predominantly occurred in the Study treatment group which finally was the less effective. Thus, data are not likely to be missing completely at random. Among incomplete observations, monotone missing data was the most common structure (24%) but some patients (7%) also presented a non-monotone missing data structure.

9 Longitudinal binary data with missing values 539 Table 2. Definition of asthma control Presence of at least two of the three following criteria: - morning peak expiratory flow rate 80% of the predicted value every day - no more than 2 days of inhaled short-acting bronchodilator use, up to a maximum of 4 occasions per week - no more than two days with asthma symptoms Presence of all the following criteria: - no night-time awakening due to asthma - no treatment related adverse event leading to permanent discontinuation of treatment - no unscheduled medical visit or hospitalization for asthma - no exacerbation Table 3. Summary of missing data: numbers of subjects in each missing data pattern Study treatment Reference treatment Total M1 M2 M3 (N = 119) (N = 127) (N = 246) Model (8) was used for imputation. The nature of the missing data mechanism was allowed to differ between the two treatment groups by including two values ψ 0 and ψ 1 in the model. Besides treatment, two covariates were included in the model, sex and smoking status, which were a priori considered as potentially related to the asthma control. These covariates were also considered as having a potential effect on the missing data process (e.g. smokers having a higher rate of missing data than non-smokers), but in contrast to treatment, they were not assumed to modify the sensitivity parameter. The first step of the sensitivity analysis consists in determining a set of plausible values for the sensitivity parameter ψ in each treatment group. As ψ can be interpreted as the logarithm of the oddsratio measuring the association between a response and its observation, this parameter has a statistical and a clinical interpretation and should be given reasonable values. However, in the scope of a sensitivity analysis, one can also consider unrealistic and even extreme assumptions, as long as their effect is interpreted cautiously. It seems natural to assume that missing data were associated with a poor response. But it is also plausible that well-controlled subjects may more often forget to complete their diary record card. Thus, both negative and positive values for ψ should be considered. Finally, in this study, a range of values of ψ within ±1 could be considered as sufficiently wide, as they would allow the odds of asthma control to be up to 2.7 times larger or smaller for missing data than for observed data. The aim of the study was to compare the mean evolution of the rate of asthma control between the two groups. This analysis included adjustments for sex and smoking status (smoker versus non-smoker). After estimation of the model as described in Section 3.2, multiple completed data sets were generated using the method described in Section 4. Each completed dataset was then analysed via a marginal model,

10 540 P. MININI AND M. CHAVANCE 40% Reference Treatment 40% Study Treatment ψ = ± ψ = ± 1 ψ = 0 Probability of asthma control 30% 20% 10% Probability of asthma control 30% 20% 10% ψ = ± ψ = ± 1 ψ = 0 0% Time (month) 0% Time (month) Fig. 1. Estimation of monthly probability of asthma control in each treatment group, according to different values of ψ. estimated by generalized estimated equations (see for example Xie and Paik, 1997). This marginal model could be written as logitp [ Y j = 1 X = x ] = µ j + β 1 x 1 + β 2 x 2 + β 3 x 3 (9) where x 1, x 2 and x 3 represent the treatment, sex and smoking status indicators respectively. Thus, the effect of time was modelled qualitatively and the effect of treatment was assumed to be the same at each time of measurement. This was justified by the absence of significant treatment by time interaction. Finally, an unstructured matrix was used for the modelling of the within-subject correlations. For a better interpretation of the results of the sensitivity analysis, a preliminary approach is to estimate the probabilities P [ Y j = 1 X = x ], i.e. the marginal probabilities of asthma control at each month, within each treatment group, after adjustment for covariates. These probabilities are estimated for different values of ψ and are displayed in Figure 1. As missing data were more frequent in the reference group, the effect of ψ was more important in this group than in the study treatment group, especially so for the third month. Finally, values and + for ψ, corresponding to the worst and best case respectively, delimit the range of all possible values for the estimated probabilities of asthma control. The results of this sensitivity analysis are presented in Figure 2. In the upper part of these graphs, each curve joins combinations of ψ in both treatment groups for which the same estimate of the treatment effect (in terms of the odds-ratio between treatment groups) is obtained. In the lower part, couples of ψ with the same level of statistical significance are displayed. Inside the dotted box are reasonable values for ψ in both treatment groups, i.e. ψ within ±1. Outside of this box are more extreme values. Assuming that missing data were ignorable (ψ=0) in both groups, a better asthma control was observed in the study treatment group, with an odds-ratio of 2.15 (p<0.001). In the range of values of ψ within ± 1 in both groups, the treatment effect remained relatively stable, and a significant difference was still observed. The interval of ignorance was [1.56, 2.84]. Its associated 95% interval of uncertainty was [1.02, 4.37], which was entirely above 1. Only assumptions outside the range of plausible values could raise a non-significant result, in particular, a strong nonignorable mechanism of opposed sign in the two groups, greatly disfavouring the treatment group. For example, the best case/worst case analysis, which corresponds to ψ = in the treatment group and ψ = + in the reference group, led to near equality between the two groups (OR=1.00) but did not reverse the conclusion of the study. It should be recalled that in this study, both groups received an active treatment, registered for asthma. Their safety profiles were comparable. Therefore, there seems to be no reason for assuming such a dramatic difference between their missing data processes.

11 Longitudinal binary data with missing values 541 Estimation of Treatment Effect (Odds-ratio) 5 4 OR > 3.5 OR = 2 ψ in Study Treatment OR = 3.5 OR = 3 OR = 2.5 OR < 1.5 OR = ψ in Reference Treatment Statistical Significance of the Treatment Effect p = p = 0.01 ψ in Study Treatment p < p = 0.05 p = ψ in Reference Treatment Fig. 2. Results of the sensitivity analysis: contour plot of the estimation of the treatment effect and statistical significance, for different values of ψ in each treatment group. In conclusion, the sensitivity analysis showed that the superiority of the study treatment was robust to slight to moderate departure from the ignorability assumption. In this study, missing data should therefore not be considered to be a serious source of concern. 6. DISCUSSION The fundamental impossibility to perform any valid inference without strong and untestable assumptions justifies the choice of a sensitivity analysis: inferences under a range of plausible or extreme assumptions rather than a single inference. However, unless the treatment effect is very strong or the rate of missing data very low, it seems obvious that the most extreme assumptions will at least invalidate the statistical significance, and sometimes reverse the conclusion of a study (see examples provided by Rotnitzky et al. 1998, 2001; Scharfstein et al. 1999; Hollis 2002; Birmingham et al. 2003). Our opinion is that one should not expect a conclusion to resist all assumptions about missing data, but only to remain stable under mild or moderate assumptions. In the example described in this paper, values of ψ within ±1 could be considered as a reasonable compromise between the risk of neglecting plausible assumptions and that of considering unrealistic assumptions. In this case, the use of intervals of ignorance and intervals of uncertainty is valuable. The log linear model introduced in Section 2 may be debatable in some contexts, for example when the number of measurements is variable. However, in clinical trials the number of planned observations

12 542 P. MININI AND M. CHAVANCE is usually fixed, even if the number of responses actually observed may be variable across subjects due to missing data. The conditional character of this model also downplays its applicability, and a marginal model is often preferred for the analysis. But that does not preclude its usefulness for the imputation of the missing data. Indeed, in a marginal framework, the conditional distribution of missing data, given the observed data and the missing data structure, is often very complex. This conditional distribution is fundamental for imputation. On the other hand, in the log linear model, this conditional distribution is easily derived. Completing the data under various nonignorable assumptions, and then analysing them using a possibly different model, is in the spirit of multiple imputation (Schafer, 1999). It raises the classical issue of the difference between the imputer s model and the analyst s model (Rubin, 1996; Schafer, 1997). However, in this context, the imputer and the analyst would generally be the same person. Thus, the assumptions made for completing and analysing the data (concerning the time-dependance, within-subject correlations, effect of covariates) should be similar in the log linear model and the marginal model. An alternative to multiple imputation would be to perform a weighted analysis using the estimated probabilities P[M X, Y], for example weighted GEE (Robins et al., 1995). However, note that some widely used procedures such as SAS proc genmod allow the introduction of known weights, but do not take into account their imprecision when they are estimated, and thus underestimate the variance of the weighted estimation. Imputations are generated under a nonignorable model and inferences performed are valid under the assumed mechanism for missing data. One may legitimately wonder to what extent the conclusions of the analysis depend on the model assumptions. Clearly, the adequacy of the model to the observations has to be checked using diagnostic tools, but one cannot expect the examination of the available residuals or influence measures to shed light on the choice of ψ. In this modelling, several assumptions were set. First of all, the association between any pair of responses was assumed not to depend on the missing data pattern. Secondly, it was assumed that conditionally on Y j, the missingness of Y j was independent of all other Y k. Finally, it was assumed that ψ was constant over time. These assumptions cannot be checked, and thus have the potential to raise doubts about the robustness of the method. The method for conducting a sensitivity analysis proposed in this paper can be extended by removing one or all of these assumptions, but at the cost of a higher-dimensional analysis, and a loss of simplicity and interpretability. Conversely, a simple model with a parameter ψ fixed in each treatment group will generally cover all the situations of practical interest. If the conclusion of the study is robust over a range of values for ψ in each group, one can expect it would also resist a more complex model. The flexibility of the log linear model allows us to consider alternative assumptions. For instance, in some contexts it may be more plausible for missingness to depend on other variables than on the variable subjected to missingness itself. In this case, from model (1), one would assume that ψ jj = 0 and derive a different model. Note that this would not necessarily lead to an ignorable model, since missingness could be assumed to depend on the previous response, which is also possibly missing. Other possible extensions may include the handling of the cause of missing data. In clinical trials, these reasons are often investigated. In some cases, the missingness is likely to be unrelated to the unobserved response, for example the refusal to continue for personal convenience or the technical impossibility to perform the measure. In other cases, one would not assume that there exists an unobserved value behind the missing values, especially when dropout corresponds to the death of the subject. In these cases, a simple extension of the proposed method would consist in imputing only missing values considered as potentially related to the existing unobserved response. A drawback of this method is the weight of calculations, which exponentially increases with the number of measurements by subject (n) and the number of explanatory variables ( p). In particular, the EM algorithm involves 2 2n+p sufficient statistics, so that with our computer (a Pentium III 1.4 GHz), this method could not be applied with n 10.

13 Longitudinal binary data with missing values 543 ACKNOWLEDGEMENTS The authors are grateful to Florence Casset-Semanaz for her constant support in this research work. We are also grateful to two anonymous reviewers for their insightful comments that greatly contributed to improving the quality of this paper. Minini s research was partially supported by the French Association Nationale de la Recherche Technique, Convention CIFRE 707/2000. REFERENCES BAKER, S. G., ROSENBERGER, W. F. AND DERSIMONIAN, R. (1992). Closed-form estimates for missing counts in two-way contengency tables. Statistics in Medicine 11, BIRMINGHAM, J., ROTNITZKY, A. AND FITZMAURICE, G. M. (2003). Pattern-mixture and selection models for analysing longitudinal data with monotone missing patterns. Journal of the Royal Statistical Society, Series B 65, BISHOP, Y. M. M., FIENBERG, S. E. AND HOLLAND, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press. COMMITTEE FOR PROPRIETARY MEDICINAL PRODUCTS (2001). Points to consider on missing data, CPMP/EWP/1776/99. European Agency for the Evaluation of Medicinal Products, London. COPAS, J. B. AND LI, H. G. (1997). Inference for non-random samples (with discussion). Journal of the Royal Statistical Society, Series B 59, DEMPSTER, A. P., LAIRD, N. M. AND RUBIN, D. B. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B 39, DIGGLE, P. J. AND KENWARD, M. G. (1994). Informative drop-out in longitudinal data analysis (with discussion). Applied Statistics 43, DIGGLE, P. J., HEAGERTY, P., LIANG, K. Y. AND ZEGER, S. L. (2002). Analysis of Longitudinal Data, 2nd edition. Oxford: Oxford University Press. FITZMAURICE, G. M. AND LAIRD, N. M. (2000). Generalized linear mixture models for handling nonignorable dropouts in longitudinal studies. Biostatistics 1, FOLLMANN, DAND WU, M. (1995). An approximate generalized linear model with random effects for informative missing data. Biometrics 51, GROSCLAUDE, M. AND DESFOUGERES, J. L. (2003). Fluticasone/Salmeterol (FP/S) est plus efficace que l association beclomethasone-montelukast (BDP-M). Revue de Maladies Respiratoires 20, 1S74 1S74. HOLLIS, S. (2002). A graphical sensitivity analysis for clinical trials with non-ignorable missing binary outcome. Statistics in Medicine 21, IBRAHIM, J., CHEN, M. H. AND LIPSITZ, S. R. (2001). Missing responses in generalized linear mixed models when the missing data mechanism is nonignorable. Biometrika 88, JAMSHIDIAN, M.AND JENNRICH, R. I. (2000). Standard errors for EM estimation. Journal of the Royal Statistical Society, Series B 62, KENWARD, M. G., GOETGHEBEUR, J. T. AND MOLENBERGHS, G. (2001). Sensitivity analysis for incomplete categorical data. Statistical Modelling 1, LIPSITZ, S. R,, MOLENBERGHS, G., FITZMAURICE, G. M. AND IBRAHIM, J. (2000). GEE with Gaussian estimation of the correlations when data are incomplete. Biometrics 56, LITTLE, R. J. A. (1994). Discussion to Diggle and Kenward: Informative drop-out in longitudinal data analysis. Applied Statistics 43, 78.

14 544 P. MININI AND M. CHAVANCE LITTLE, R. J. A. (1994). A class of pattern-mixture models for normal incomplete data. Biometrika 81, LITTLE, R. J. A. (1995). Modelling the drop-out mechanism in repeated measures studies. Journal of the American Statistical Association 90, LITTLE, R.J.A.AND RUBIN, D. B. (1987). Statistical Analysis with Missing Data. New York: Wiley. MOLENBERGHS, G., KENWARD, M. G. AND GOETGHEBEUR, E. (2001). Sensitivity analysis for incomplete contingency tables: the Slovenian plebiscite case. Applied Statistics 50, PAIK, M. C. (1997). The generalized estimating equation approach when data are not missing completely at random. Journal of the American Statistical Association 92, ROBINS, J. M., ROTNITZKY, A. AND ZHAO, L. P. (1995). Analysis of semi-parametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association 90, ROTNITZKY, A., ROBINS, J. M. AND SCHARFSTEIN, D. (1998). Semiparametric regression for repeated outcomes with nonignorable nonresponse. Journal of the American Statistical Association 93, ROTNITZKY, A., SCHARFSTEIN, D., SU, T. L. AND ROBINS, J. M. (2001). Methods for conducting sensitivity analysis of trials with potentially nonignorable competing causes of censoring. Biometrics 57, RUBIN, D. B. (1987). Multiple Imputations for Nonresponse in Surveys. New York: Wiley. RUBIN, D. B. (1994). Discussion to Diggle and Kenward: Informative drop-out in longitudinal data analysis. Applied Statistics 43, RUBIN, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association 91, SAS INSTITUTE INC. (1999). SAS/STAT User s guide, Version 8, Cary, NC. SCHAFER, J. L. (1997). Analysis of Incomplete Multivariate Data. New York: Chapman and Hall. SCHAFER, J. L. (1999). Multiple imputation: a primer. Statistical Methods in Medical Research 8, SCHARFSTEIN, D., ROTNITZKY, A. AND ROBINS, J. M. (1999). Adjusting for nonignorable dropout using semiparametric nonresponse models (with discussion). Journal of the American Statistical Association 94, SKINNER, C. J. (1997). Discussion to Copas and Li: Inference for non-random samples. Journal of the Royal Statistical Society, Series B 59, UNNEBRINK, K. AND WINDELER, J. (1999). Sensitivity analysis by worst and best case assessment: is it really sensitive?. Drug Information Journal 33, VANSTEELANDT, S., GOETGHEBEUR, E., KENWARD, M. G. AND MOLENBERGHS, G. (2000). Ignorance and uncertainty regions as inferential tool in a sensitivity analysis. Technical Report 2000/2, Centrum voor Statistiek. Ghent University. VANSTEELANDT, S. AND GOETGHEBEUR, E. (2001). Analyzing the sensitivity of generalized linear models to incomplete outcomes via the IDE algorithm. Journal of Computational and Graphical Statistics 10, VERBEKE, G.AND MOLENBERGHS, G. (2000). Linear Mixed Models for Longitudinal Data. New York: Springer. XIE, F. AND PAIK, M. C. (1997). Multiple imputation methods for the missing covariates in generalized estimating equations. Biometrics 53, [Received 16 June 2003; first revision 28 October 2003; second revision 5 February 2004; accepted for publication 12 February 2004]

Introduction to mixed model and missing data issues in longitudinal studies

Introduction to mixed model and missing data issues in longitudinal studies Introduction to mixed model and missing data issues in longitudinal studies Hélène Jacqmin-Gadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models

More information

A Basic Introduction to Missing Data

A Basic Introduction to Missing Data John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

More information

HANDLING DROPOUT AND WITHDRAWAL IN LONGITUDINAL CLINICAL TRIALS

HANDLING DROPOUT AND WITHDRAWAL IN LONGITUDINAL CLINICAL TRIALS HANDLING DROPOUT AND WITHDRAWAL IN LONGITUDINAL CLINICAL TRIALS Mike Kenward London School of Hygiene and Tropical Medicine Acknowledgements to James Carpenter (LSHTM) Geert Molenberghs (Universities of

More information

PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku,

PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku, PATTERN MIXTURE MODELS FOR MISSING DATA Mike Kenward London School of Hygiene and Tropical Medicine Talk at the University of Turku, April 10th 2012 1 / 90 CONTENTS 1 Examples 2 Modelling Incomplete Data

More information

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear

More information

Sensitivity Analysis in Multiple Imputation for Missing Data

Sensitivity Analysis in Multiple Imputation for Missing Data Paper SAS270-2014 Sensitivity Analysis in Multiple Imputation for Missing Data Yang Yuan, SAS Institute Inc. ABSTRACT Multiple imputation, a popular strategy for dealing with missing values, usually assumes

More information

Statistical Analysis with Missing Data

Statistical Analysis with Missing Data Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES

More information

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random [Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator

More information

Problem of Missing Data

Problem of Missing Data VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

More information

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional

More information

Paper Let the Data Speak: New Regression Diagnostics Based on Cumulative Residuals

Paper Let the Data Speak: New Regression Diagnostics Based on Cumulative Residuals Paper 255-28 Let the Data Speak: New Regression Diagnostics Based on Cumulative Residuals Gordon Johnston and Ying So SAS Institute Inc. Cary, North Carolina, USA Abstract Residuals have long been used

More information

Lecture 27: Introduction to Correlated Binary Data

Lecture 27: Introduction to Correlated Binary Data Lecture 27: Introduction to Correlated Binary Data Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South

More information

Multiple Imputation for Missing Data: A Cautionary Tale

Multiple Imputation for Missing Data: A Cautionary Tale Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust

More information

A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values

A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values Methods Report A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values Hrishikesh Chakraborty and Hong Gu March 9 RTI Press About the Author Hrishikesh Chakraborty,

More information

Models for Count Data With Overdispersion

Models for Count Data With Overdispersion Models for Count Data With Overdispersion Germán Rodríguez November 6, 2013 Abstract This addendum to the WWS 509 notes covers extra-poisson variation and the negative binomial model, with brief appearances

More information

arxiv: v1 [math.st] 5 Jan 2017

arxiv: v1 [math.st] 5 Jan 2017 Sequential identification of nonignorable missing data mechanisms Mauricio Sadinle and Jerome P. Reiter Duke University arxiv:1701.01395v1 [math.st] 5 Jan 2017 January 6, 2017 Abstract With nonignorable

More information

Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

More information

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation Statistical modelling with missing data using multiple imputation Session 4: Sensitivity Analysis after Multiple Imputation James Carpenter London School of Hygiene & Tropical Medicine Email: james.carpenter@lshtm.ac.uk

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

Handling attrition and non-response in longitudinal data

Handling attrition and non-response in longitudinal data Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

More information

Introduction to latent variable models

Introduction to latent variable models Introduction to latent variable models lecture 1 Francesco Bartolucci Department of Economics, Finance and Statistics University of Perugia, IT bart@stat.unipg.it Outline [2/24] Latent variables and their

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple

More information

Note on the EM Algorithm in Linear Regression Model

Note on the EM Algorithm in Linear Regression Model International Mathematical Forum 4 2009 no. 38 1883-1889 Note on the M Algorithm in Linear Regression Model Ji-Xia Wang and Yu Miao College of Mathematics and Information Science Henan Normal University

More information

Analysis of Longitudinal Data with Missing Values.

Analysis of Longitudinal Data with Missing Values. Analysis of Longitudinal Data with Missing Values. Methods and Applications in Medical Statistics. Ingrid Garli Dragset Master of Science in Physics and Mathematics Submission date: June 2009 Supervisor:

More information

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) R.KAVITHA KUMAR Department of Computer Science and Engineering Pondicherry Engineering College, Pudhucherry, India DR. R.M.CHADRASEKAR Professor,

More information

An extension of the factoring likelihood approach for non-monotone missing data

An extension of the factoring likelihood approach for non-monotone missing data An extension of the factoring likelihood approach for non-monotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions

More information

Standard errors of marginal effects in the heteroskedastic probit model

Standard errors of marginal effects in the heteroskedastic probit model Standard errors of marginal effects in the heteroskedastic probit model Thomas Cornelißen Discussion Paper No. 320 August 2005 ISSN: 0949 9962 Abstract In non-linear regression models, such as the heteroskedastic

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

BMJ Open. To condition or not condition? Analyzing change in longitudinal randomized controlled trials

BMJ Open. To condition or not condition? Analyzing change in longitudinal randomized controlled trials To condition or not condition? Analyzing change in longitudinal randomized controlled trials Journal: BMJ Open Manuscript ID bmjopen-0-00 Article Type: Research Date Submitted by the Author: -Jun-0 Complete

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Statistics 104: Section 6!

Statistics 104: Section 6! Page 1 Statistics 104: Section 6! TF: Deirdre (say: Dear-dra) Bloome Email: dbloome@fas.harvard.edu Section Times Thursday 2pm-3pm in SC 109, Thursday 5pm-6pm in SC 705 Office Hours: Thursday 6pm-7pm SC

More information

Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten Missing Data Treatments

Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten Missing Data Treatments Brockmeier, Kromrey, & Hogarty Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten s Lantry L. Brockmeier Jeffrey D. Kromrey Kristine Y. Hogarty Florida A & M University

More information

arxiv:1301.2490v1 [stat.ap] 11 Jan 2013

arxiv:1301.2490v1 [stat.ap] 11 Jan 2013 The Annals of Applied Statistics 2012, Vol. 6, No. 4, 1814 1837 DOI: 10.1214/12-AOAS555 c Institute of Mathematical Statistics, 2012 arxiv:1301.2490v1 [stat.ap] 11 Jan 2013 ADDRESSING MISSING DATA MECHANISM

More information

A hidden Markov model for criminal behaviour classification

A hidden Markov model for criminal behaviour classification RSS2004 p.1/19 A hidden Markov model for criminal behaviour classification Francesco Bartolucci, Institute of economic sciences, Urbino University, Italy. Fulvia Pennoni, Department of Statistics, University

More information

Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

More information

TABLE OF CONTENTS ALLISON 1 1. INTRODUCTION... 3

TABLE OF CONTENTS ALLISON 1 1. INTRODUCTION... 3 ALLISON 1 TABLE OF CONTENTS 1. INTRODUCTION... 3 2. ASSUMPTIONS... 6 MISSING COMPLETELY AT RANDOM (MCAR)... 6 MISSING AT RANDOM (MAR)... 7 IGNORABLE... 8 NONIGNORABLE... 8 3. CONVENTIONAL METHODS... 10

More information

An EM algorithm for the estimation of a ne state-space systems with or without known inputs

An EM algorithm for the estimation of a ne state-space systems with or without known inputs An EM algorithm for the estimation of a ne state-space systems with or without known inputs Alexander W Blocker January 008 Abstract We derive an EM algorithm for the estimation of a ne Gaussian state-space

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Dealing with Missing Data

Dealing with Missing Data Dealing with Missing Data Roch Giorgi email: roch.giorgi@univ-amu.fr UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January

More information

Missing data and net survival analysis Bernard Rachet

Missing data and net survival analysis Bernard Rachet Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27-29 July 2015 Missing data and net survival analysis Bernard Rachet General context Population-based,

More information

13. Poisson Regression Analysis

13. Poisson Regression Analysis 136 Poisson Regression Analysis 13. Poisson Regression Analysis We have so far considered situations where the outcome variable is numeric and Normally distributed, or binary. In clinical work one often

More information

Department of Epidemiology and Public Health Miller School of Medicine University of Miami

Department of Epidemiology and Public Health Miller School of Medicine University of Miami Department of Epidemiology and Public Health Miller School of Medicine University of Miami BST 630 (3 Credit Hours) Longitudinal and Multilevel Data Wednesday-Friday 9:00 10:15PM Course Location: CRB 995

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Bayesian Approaches to Handling Missing Data

Bayesian Approaches to Handling Missing Data Bayesian Approaches to Handling Missing Data Nicky Best and Alexina Mason BIAS Short Course, Jan 30, 2012 Lecture 1. Introduction to Missing Data Bayesian Missing Data Course (Lecture 1) Introduction to

More information

Dealing with Missing Data

Dealing with Missing Data Res. Lett. Inf. Math. Sci. (2002) 3, 153-160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour. Patrick Lam Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

More information

Dr James Roger. GlaxoSmithKline & London School of Hygiene and Tropical Medicine.

Dr James Roger. GlaxoSmithKline & London School of Hygiene and Tropical Medicine. American Statistical Association Biopharm Section Monthly Webinar Series: Sensitivity analyses that address missing data issues in Longitudinal studies for regulatory submission. Dr James Roger. GlaxoSmithKline

More information

Lecture 7 Logistic Regression with Random Intercept

Lecture 7 Logistic Regression with Random Intercept Lecture 7 Logistic Regression with Random Intercept Logistic Regression Odds: expected number of successes for each failure P(y log i x i ) = β 1 + β 2 x i 1 P(y i x i ) log{ Od(y i =1 x i =a +1) } log{

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Nonlinear Regression:

Nonlinear Regression: Zurich University of Applied Sciences School of Engineering IDP Institute of Data Analysis and Process Design Nonlinear Regression: A Powerful Tool With Considerable Complexity Half-Day : Improved Inference

More information

GSK Medicine: Study Number: Title: Rationale: Study Period: Objectives Indication: Study Investigators/Centers: Research Methods Data Source:

GSK Medicine: Study Number: Title: Rationale: Study Period: Objectives Indication: Study Investigators/Centers: Research Methods Data Source: GSK Medicine: Study Number: 08257 Title: OCSIGEN study Longitudinal follow-up of a cohort of patients with asthma treated with inhaled corticosteroids in primary care Rationale: In the Post-Licensing File

More information

Supplement to Call Centers with Delay Information: Models and Insights

Supplement to Call Centers with Delay Information: Models and Insights Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290

More information

Testing on proportions

Testing on proportions Testing on proportions Textbook Section 5.4 April 7, 2011 Example 1. X 1,, X n Bernolli(p). Wish to test H 0 : p p 0 H 1 : p > p 0 (1) Consider a related problem The likelihood ratio test is where c is

More information

Econometric Analysis of Cross Section and Panel Data Second Edition. Jeffrey M. Wooldridge. The MIT Press Cambridge, Massachusetts London, England

Econometric Analysis of Cross Section and Panel Data Second Edition. Jeffrey M. Wooldridge. The MIT Press Cambridge, Massachusetts London, England Econometric Analysis of Cross Section and Panel Data Second Edition Jeffrey M. Wooldridge The MIT Press Cambridge, Massachusetts London, England Preface Acknowledgments xxi xxix I INTRODUCTION AND BACKGROUND

More information

STATISTICAL ANALYSIS OF SAFETY DATA IN LONG-TERM CLINICAL TRIALS

STATISTICAL ANALYSIS OF SAFETY DATA IN LONG-TERM CLINICAL TRIALS STATISTICAL ANALYSIS OF SAFETY DATA IN LONG-TERM CLINICAL TRIALS Tailiang Xie, Ping Zhao and Joel Waksman, Wyeth Consumer Healthcare Five Giralda Farms, Madison, NJ 794 KEY WORDS: Safety Data, Adverse

More information

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA 123 Kwantitatieve Methoden (1999), 62, 123-138. A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA Joop J. Hox 1 ABSTRACT. When we deal with a large data set with missing data, we have to undertake

More information

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

How to choose an analysis to handle missing data in longitudinal observational studies

How to choose an analysis to handle missing data in longitudinal observational studies How to choose an analysis to handle missing data in longitudinal observational studies ICH, 25 th February 2015 Ian White MRC Biostatistics Unit, Cambridge, UK Plan Why are missing data a problem? Methods:

More information

Imputing Missing Data using SAS

Imputing Missing Data using SAS ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...

MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators... MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 2009-2016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

More information

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

More information

Portfolio Using Queuing Theory

Portfolio Using Queuing Theory Modeling the Number of Insured Households in an Insurance Portfolio Using Queuing Theory Jean-Philippe Boucher and Guillaume Couture-Piché December 8, 2015 Quantact / Département de mathématiques, UQAM.

More information

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Tests for Two Survival Curves Using Cox s Proportional Hazards Model Chapter 730 Tests for Two Survival Curves Using Cox s Proportional Hazards Model Introduction A clinical trial is often employed to test the equality of survival distributions of two treatment groups.

More information

Probability and Statistics

Probability and Statistics CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

More information

Monotonicity Hints. Abstract

Monotonicity Hints. Abstract Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

On efficiency of constrained longitudinal data analysis versus longitudinal. analysis of covariance. Supplemental materials

On efficiency of constrained longitudinal data analysis versus longitudinal. analysis of covariance. Supplemental materials Biometrics 000, 000 000 DOI: 000 000 0000 On efficiency of constrained longitudinal data analysis versus longitudinal analysis of covariance Supplemental materials Kaifeng Lu Clinical Biostatistics, Merck

More information

7 Hypothesis testing - one sample tests

7 Hypothesis testing - one sample tests 7 Hypothesis testing - one sample tests 7.1 Introduction Definition 7.1 A hypothesis is a statement about a population parameter. Example A hypothesis might be that the mean age of students taking MAS113X

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Missing data in randomized controlled trials (RCTs) can

Missing data in randomized controlled trials (RCTs) can EVALUATION TECHNICAL ASSISTANCE BRIEF for OAH & ACYF Teenage Pregnancy Prevention Grantees May 2013 Brief 3 Coping with Missing Data in Randomized Controlled Trials Missing data in randomized controlled

More information

Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,

More information

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

More information

Lecture - 32 Regression Modelling Using SPSS

Lecture - 32 Regression Modelling Using SPSS Applied Multivariate Statistical Modelling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur Lecture - 32 Regression Modelling Using SPSS (Refer

More information

Chris Slaughter, DrPH. GI Research Conference June 19, 2008

Chris Slaughter, DrPH. GI Research Conference June 19, 2008 Chris Slaughter, DrPH Assistant Professor, Department of Biostatistics Vanderbilt University School of Medicine GI Research Conference June 19, 2008 Outline 1 2 3 Factors that Impact Power 4 5 6 Conclusions

More information

, then the form of the model is given by: which comprises a deterministic component involving the three regression coefficients (

, then the form of the model is given by: which comprises a deterministic component involving the three regression coefficients ( Multiple regression Introduction Multiple regression is a logical extension of the principles of simple linear regression to situations in which there are several predictor variables. For instance if we

More information

Sample Size Planning, Calculation, and Justification

Sample Size Planning, Calculation, and Justification Sample Size Planning, Calculation, and Justification Theresa A Scott, MS Vanderbilt University Department of Biostatistics theresa.scott@vanderbilt.edu http://biostat.mc.vanderbilt.edu/theresascott Theresa

More information

A LONGITUDINAL AND SURVIVAL MODEL WITH HEALTH CARE USAGE FOR INSURED ELDERLY. Workshop

A LONGITUDINAL AND SURVIVAL MODEL WITH HEALTH CARE USAGE FOR INSURED ELDERLY. Workshop A LONGITUDINAL AND SURVIVAL MODEL WITH HEALTH CARE USAGE FOR INSURED ELDERLY Ramon Alemany Montserrat Guillén Xavier Piulachs Lozada Riskcenter - IREA Universitat de Barcelona http://www.ub.edu/riskcenter

More information

Missing Data Dr Eleni Matechou

Missing Data Dr Eleni Matechou 1 Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: R.J.A. Little and D.B. Rubin 2nd edition Statistical Analysis with Missing Data J.L. Schafer and J.W.

More information

The Probit Link Function in Generalized Linear Models for Data Mining Applications

The Probit Link Function in Generalized Linear Models for Data Mining Applications Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring. Jie-Men Mok Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

A New Imputation Method for Incomplete Binary Data

A New Imputation Method for Incomplete Binary Data A New Imputation Method for Incomplete Binary Data Munevver Mine Subasi Department of Mathematical Sciences Florida Institute of Technology 150 W. University Blvd., Melbourne, FL 32901 USA Martin Anthony

More information

Guideline on missing data in confirmatory clinical trials

Guideline on missing data in confirmatory clinical trials 2 July 2010 EMA/CPMP/EWP/1776/99 Rev. 1 Committee for Medicinal Products for Human Use (CHMP) Guideline on missing data in confirmatory clinical trials Discussion in the Efficacy Working Party June 1999/

More information

Paper Beyond Breslow-Day: Homogeneity Across R x C Tables ABSTRACT INTRODUCTION SAMPLE DATA K 2 2 TABLES

Paper Beyond Breslow-Day: Homogeneity Across R x C Tables ABSTRACT INTRODUCTION SAMPLE DATA K 2 2 TABLES Paper 74949 Beyond Breslow-Day: Homogeneity Across R x C Tables Ginny P. Lai, David R. Mink, David J. Pasta, ICON Late Phase & Outcomes Research, San Francisco, CA ABSTRACT In the epidemiological world,

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni 1 Web-based Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed

More information

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

More information

Mixture Models. Jia Li. Department of Statistics The Pennsylvania State University. Mixture Models

Mixture Models. Jia Li. Department of Statistics The Pennsylvania State University. Mixture Models Mixture Models Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based

More information

Item Imputation Without Specifying Scale Structure

Item Imputation Without Specifying Scale Structure Original Article Item Imputation Without Specifying Scale Structure Stef van Buuren TNO Quality of Life, Leiden, The Netherlands University of Utrecht, The Netherlands Abstract. Imputation of incomplete

More information

Combining Multiple Imputation and Inverse Probability Weighting

Combining Multiple Imputation and Inverse Probability Weighting Combining Multiple Imputation and Inverse Probability Weighting Shaun Seaman 1, Ian White 1, Andrew Copas 2,3, Leah Li 4 1 MRC Biostatistics Unit, Cambridge 2 MRC Clinical Trials Unit, London 3 UCL Research

More information