Sensitivity analysis of longitudinal binary data with nonmonotone missing values


 Marianna Cox
 1 years ago
 Views:
Transcription
1 Biostatistics (2004), 5, 4,pp doi: /biostatistics/kxh006 Sensitivity analysis of longitudinal binary data with nonmonotone missing values PASCAL MININI Laboratoire GlaxoSmithKline, UnitéMéthodologie et Biostatistique, 100 route de Versailles, Marly le Roi, France, and INSERM U472, 16 avenue Paul VaillantCouturier, Villejuif, France MICHEL CHAVANCE INSERM U472, 16 avenue Paul VaillantCouturier, Villejuif, France SUMMARY This paper highlights the consequences of incomplete observations in the analysis of longitudinal binary data, in particular nonmonotone missing data patterns. Sensitivity analysis is advocated and a method is proposed based on a log linear model. A sensitivity parameter that represents the relationship between the response mechanism and the missing data mechanism is introduced. It is shown that although this parameter is identifiable, its estimation is highly questionable. A far better approach is to consider a range of plausible values and to estimate the parameters of interest conditionally upon each value of the sensitivity parameter. This allows us to assess the sensitivity of study s conclusion to assumptions regarding the missing data mechanism. The method is applied to a randomized clinical trial comparing the efficacy of two treatment regimens in patients with persistent asthma. Keywords: Binary data; EM; Ignorance; Longitudinal study; Missing; Multiple imputation; Nonmonotone; Sensitivity analysis; Uncertainty. 1. INTRODUCTION We consider longitudinal studies designed to repeatedly observe a binary response at n prespecified occasions. In practice, successful completion of all planned measurements from all subjects is extremely rare. Two main sources of missing data can be distinguished. On the one hand, some subjects will dropout from the study; for example as a result of an adverse event, the lack of efficacy of the study treatment, or simply the refusal of the subject to continue the study. This will result in a monotone pattern of missing data (Little and Rubin, 1987). On the other hand, some data will be missing intermittently, for example because of an illness, an invalid measurement or forgetfulness. This will result in a nonmonotone pattern. Longitudinal studies generally suffer from both types of missingness, and the collected data are often incomplete with a nonmonotone structure. The classification proposed by Little and Rubin (1987) is based on the relationship between the mechanism leading to complete or incomplete data (the missing data process) and the mechanism To whom correspondence should be addressed. Biostatistics Vol. 5 No. 4 c Oxford University Press 2004; all rights reserved.
2 532 P. MININI AND M. CHAVANCE controlling the actual value of the response of interest (the response process). Data are missing at random (MAR) when the missing data process depends only on observed responses, and missing not at random when it depends on unobserved responses. In the framework of likelihoodbased inference, if the missing data are MAR and if the parameters of the missing data process and those of the response process are distinct, then the missing data process is termed to be ignorable. Otherwise it is nonignorable. Over the past few years, considerable attention has been given to the modelling of longitudinal binary data with nonignorable missing values, via generalized linear mixed models (e.g. Follmann and Wu, 1995; Ibrahim et al., 2001) or generalized estimating equations (e.g. Paik, 1997; Lipsitz et al., 2000; Fitzmaurice and Laird, 2000). However, a paradigm has emerged: handling incomplete observations necessarily requires assumptions that cannot be assessed from the observed data (Little, 1994a; Rubin, 1994; Verbeke and Molenberghs, 2000). In these circumstances, the need for sensitivity analyses has been clearly recognized. Molenberghs et al. (2001); Kenward et al. (2001); Vansteelandt et al. (2000) and Vansteelandt and Goetghebeur (2001) have developed the concepts of ignorance and uncertainty. On the one hand, the usual imprecision is due to the finite random sampling, which is acknowledged via confidence intervals, the width of which approaches zero as the sample size grows. On the other hand, ignorance is due to the incompleteness of data and can be reflected by the interval of ignorance. Ignorance due to a given proportion of missing data would not disappear even with an infinite sample size. Imprecision and ignorance are combined into the concept of uncertainty, which acknowledges both sources. In controlled clinical trials, it has been recommended by the Committee for Proprietary Medicinal Products (2001) to conduct a sensitivity analysis in order to assess the impact of different missing data assumptions regarding the conclusion of a study. With binary responses, a bestcase/worstcase analysis can be performed assigning a positive response to all missing data in the control group and a negative response in the experimental group. Although the assumptions of this approach are unrealistic, this is the most convincing analysis if the conclusion of the study is not qualitatively modified. However, in most cases, the benefit of the new treatment would be annihilated by such an extreme analysis (Unnebrink and Windeler, 1999). In the case of a single binary measure, Hollis (2002) proposed a simple and attractive method that consists in examining all possible allocations of missing data. In another framework, Copas and Li (1997) used a firstorder Taylor expansion to perform a sensitivity analysis around the MAR assumption. However, Skinner (1997) suggested that a better approach would be to estimate the parameter of interest conditionally on the sensitivity parameter. The strategy previously proposed by Little (1994b) will be used here. This consists of drawing inferences about the parameters of interest under a range of plausible values for a sensitivity parameter, i.e. under different assumptions regarding the missing data mechanism. This has been widely developed for sensitivity analyses (see for example Rotnitzky et al., 1998, 2001; Scharfstein et al., 1999; Birmingham et al., 2003). These methods deal mainly with quantitative data subject to dropout, the comparison being restricted to the value measured at the end of the study. Here, we will consider longitudinal binary responses with nonmonotone missing data, all measurements being considered as equally valuable. A joint modelling of the response process and the missing data process, based on a log linear model is proposed in Section 2. A sensitivity parameter is introduced that represents the relationship between the response process and the missing data process. An important feature of this modelling is that it does not require a monotone missing data structure. In Section 3, it is shown that although the sensitivity parameter is identifiable, its estimation is highly questionable. A far better approach is to consider a range of plausible values, and to estimate the parameters of interest conditionally upon these plausible values. When the objective of the study is to describe the association between explanatory variables and the response of interest, the log linear model introduced in Section 2 may not be satisfactory. In this case, it is proposed in Section 4 to perform multiple imputations of missing data, and to analyse the completed data using the multiple imputation estimator (Rubin, 1987).
3 Longitudinal binary data with missing values 533 In Section 5, the method is applied to a randomized clinical trial comparing the efficacy of two treatment regimens in patients with persistent asthma. 2. NOTATION AND DISTRIBUTIONAL ASSUMPTIONS 2.1 Modelling the response and missing data processes We assume that N subjects are to be observed at n different times. Let Y = (Y 1,...,Y n ) denote the 1 n vector of complete binary data for a given subject, i.e. data that would have been observed if no measurement was missing. Let M j denote the missing data indicator with M j = 1ifthe jth response is missing and M j = 0 otherwise, and form the 1 n vector M = (M 1,...,M n ). The joint distribution of Y and M can be expressed using a log linear model (Bishop et al., 1975) as log P [Y = y, M = m] = µ + λ j y j + λ jk y j y k + θ j m j j=1 j<k j=1 + θ jk m j m k + ψ jk y j m k. (1) j<k j,k Equation (1) imposes the constraint that third and higherorder terms equal zero, but it can be sensible in some cases to include additional terms, up to a saturated model. An equivalent model has been proposed by Baker et al. (1992) for contingency tables, and used by Molenberghs et al. (2001) for sensitivity analyses. In this formulation, the interaction terms between Y and M determine the nature of the missing data mechanism, and ψ jk represents the relationship between the (possibly unobserved) value y j and the missingness of Y k. In this setting, the model is thus ignorable if all ψ jk = 0, and nonignorable otherwise. Note that this model assumes that the association between two given responses Y j and Y k is independent of the missing data patterns. Model (1) involves n 2 parameters ψ jk. A possible way of reducing this number of parameters is to impose the following constraints: (a) ψ jk = 0 for all j = k (b) ψ jj = ψ for all j = 1,...,n. Constraint (a) imposes that conditionally on Y j, missingness at the jth measure is independent of all other Y k. Constraint (b) imposes that the relationship between the missing data indicator M j and the response Y j is constant over time. These two constraints lead to log P [Y = y, M = m] = µ + λ j y j + λ jk y j y k + j=1 j<k θ j m j θ jk m j m k + ψ j=1 j<k y j m j. (2) An interesting characteristic of this approach is that it can be interpreted in terms of a patternmixture model or a selection model. Let us first consider the patternmixture formulation. From (2), it can be shown that log P [Y = y M = m] =µ (m) 0 + λ j y j + λ jk y j y k + ψ j=1 j<k j=1 y j m j (3) where µ (m) 0 constrains the 2 n probabilities to sum to 1 within each pattern. Model (3) is a typical patternmixture model, in which the parameter ψ models the shift between the distributions of Y under different missing data patterns. Furthermore, let us consider the conditional probability of a positive response at any measurement time j, given the other responses and the missingness of this response. We will let Y [ j] denote the 1 (n 1) vector ( Y 1,...,Y j 1, Y j+1,...,y n ). It follows j=1
4 534 P. MININI AND M. CHAVANCE that logitp [ Y j = 1 Y [ j], M ] = λ j + λ jk y k + ψm j. (4) In words, the odds of a positive response at any measurement time j is multiplied by e ψ for subjects with missing Y j, as compared to subjects with observed Y j. In particular, if ψ < 0, the missing data are associated with poorer responses than the observed data and conversely. The extreme cases where ψ tends toward or + correspond respectively to the worstcase or bestcase assumptions, in which the probability of a positive response tends toward 0 or 1. Now, let us consider the selection model formulation. From (2), it can also be shown that which in turn implies that log P [M = m Y = y] =ν (m) 0 + logitp [ M j = 1 Y, M [ j] ] = θ j + k=1 θ j m j + θ jk m j m k + ψ j=1 j<k y j m j (5) j=1 θ jk m k + ψy j. (6) Thus, the parameter ψ can also be interpreted in the framework of selection models, and can be identified as the selection parameter relating the missing data probability to the (possibly unobserved) associated response. k=1 2.2 Handling covariate information In most studies, the objective is to assess the effect of explanatory variables X on the response Y, and sometimes also on the missing data process M. Let X denote the 1 p vector ( X 1,...,X p ), each X j being coded as a dummy variable. A natural extension of model (2) is log P [Y = y, M = m, X = x] = µ + n j=1 λ j y j + λ jk y j y k j<k + n j=1 θ jm j + j<k θ jkm j m k + p j=1 α j x j + j<k α jkx j x k + j,k β jkx j y k + j,k γ jkx j m k +ψ n j=1 y jm j (7) where the parameters β jk and γ jk represent the effect of the covariates on the response Y and the missing data process M, respectively. However, for purposes of comparison, it may be reasonable to assume different missing data processes between groups. For example, if the objective of the study is to compare the group X 1 = 0 to the group
5 Longitudinal binary data with missing values 535 X 1 = 1, the following extension may be recommended: log P [Y = y, M = m, X = x] = µ + n j=1 λ j y j + j<k λ jky j y k + n j=1 θ jm j + j<k θ jkm j m k + p j=1 α j x j + j<k α jkx j x k + j,k β jkx j y k + j,k γ jkx j m k +ψ 0 (1 x 1 ) n j=1 y jm j + ψ 1 x 1 n j=1 y jm j. (8) In (8), both elements of ψ = (ψ 0,ψ 1 ) have the same interpretation as in Section 2.1, whilst allowing the association between the value y j and the missingness of Y j to differ between groups. 3. ESTIMATION PROCEDURE Let φ denote the vector of all the parameters introduced in Section 2, excluding ψ. The full parameter vector is (φ,ψ). In Section 3.1, it will be shown that although the full parameter vector is identifiable, estimation of ψ is highly questionable. A sensitivity analysis approach, estimating φ with fixed ψ, is recommended and presented in Section Estimated versus chosen ψ All the parameters in model (2) are identifiable, and thus are estimable using a maximumlikelihood (ML) approach based on a nonlinear optimization routine (as in Diggle and Kenward, 1994, who used a simplex algorithm). An alternative is the maximization of the loglikelihood via the EM algorithm (Dempster et al., 1977), see for example Baker et al. (1992). However, like other models for incomplete longitudinal data, the estimation of ψ is highly sensitive to assumptions that cannot be assessed from the observed data (Rubin, 1994; Little, 1994a, 1995; Molenberghs et al., 2001). Furthermore, even when the assumptions of the model are correct, the estimation of ψ can be problematic. One of the main reasons is that the partial derivative of the loglikelihood with respect to ψ tends to 0 when ψ tends to infinity. Thus, the profile loglikelihood is flat for very large positive or negative values of ψ (similar problems were encountered by Copas and Li, 1997, in a different framework). In some cases, the likelihood will be maximized when ψ tends to ±. This results in an undefined ML estimate for ψ, which we will denote by an infinite estimate. In many other cases, even when a finite ML estimate for ψ exists, the associated ML confidence interval can be ], + [. To illustrate this point, simulations were performed in the simple case of two measurements per subject with no covariate, as in model (2). The parameters of the simulations were chosen to obtain P [Y 1 = 1] = P [Y 2 = 1] = 0.5, with a log oddsratio measuring the association between Y 1 and Y 2 of 1. For simplicity, an identical and independent missing data process was generated: P [M 1, M 2 ] = P [M 1 ] P [M 2 ], with P [M 1 = 1] = P [M 2 = 1]. The proportion of missing data was fixed at 10%, 20%, 30%, 40% and 50%. Finally, ψ was set at 0 to investigate the properties of the estimation procedure when missing data are actually ignorable. We were interested in the proportion of simulations that led to an infinite estimate of ψ, and in the proportion that led to an infinite or semiinfinite 95% ML confidence interval. A ML confidence interval was defined as infinite if both limits of the loglikelihood when ψ tends either to or + were not
6 536 P. MININI AND M. CHAVANCE Table 1. Percentage of the simulations that led to infinite estimate of ψ, infinite or semiinfinite ML 95% confidence interval Number of % missing Infinite Infinite Semiinfinite subjects data estimate 95% CI 95% CI significantly lower than the maximized loglikelihood. It was defined as semiinfinite if only one limit was not significantly lower than the maximized loglikelihood. This implied that when an infinite estimate was obtained, the associated ML confidence interval was at least semiinfinite. Various numbers of subjects were considered: 50, 100, 500 and In each situation, simulated data sets were generated. Results of these simulations are presented in Table 1. These simulations showed that although ψ was identifiable, its estimation using a ML approach was not reliable, even if the assumptions of the model were actually correct. For a small or moderate number of subjects (50 or 100), infinite estimations of ψ were frequent. Even when a finite estimate was obtained, the associated ML confidence interval was nearly always infinite or semiinfinite. In these cases, the amount of valuable information provided by observed data was very poor. Thus, a sensitivity analysis appears a far better approach. A sensitivity analysis can be performed using different fixed values for ψ, and examining to what extent these values influence the estimation of the other parameters. Following Kenward et al. (2001), ψ will be termed the sensitivity parameter and φ the estimable parameter. 3.2 Estimation of φ for a fixed ψ Given a fixed value for the sensitivity parameter, the estimable parameter can be estimated via an EM algorithm. This will result in an estimate φ (ψ).
7 Longitudinal binary data with missing values 537 The minimal sufficient statistics of model (8) are the 2 p+2n number of subjects n x,m,y for each possible vector (x, m, y). Splitting y into its observed and missing components (y obs, y miss ), the observed statistics are the 2 p 3 n numbers of subjects n x,m,yobs (each y j being either 0, 1 or missing). The Estep provides the expectation of the minimal sufficient statistics, given the observed data, the current estimate φ [t] and the fixed value ψ: E [ ] n x,m,y n x,m,yobs ; φ [t],ψ = n x,m,yobs [ ] P Y miss = y miss Y obs = y obs, M = m, X = x; φ [t],ψ = n x,m,yobs P [ Y = (y obs, y miss ), M = m, X = x φ [t],ψ ] y miss P [ Y = (y obs, y miss ), M = m, X = x φ [t],ψ ]. The Mstep provides φ [t+1], the updated estimate of the parameter. From model (7), the expected loglikelihood is E [ ] L(φ; n x,m,y ) n x,m,yobs ; φ [t],ψ = [ ] E n x,m,y n x,m,yobs ; φ [t],ψ x,m,y [ ] log P Y = y, M = m, X = x φ [t],ψ. The Mstep involves a numerical maximization, which can be performed using standard software for generalized linear models such as SAS proc genmod (SAS Institute Inc., 1999), assuming a Poisson distribution and a log link, with log(n) + ψ n j=1 y j m j as an offset variable. After convergence of the EM algorithm φ(ψ), the ML estimate of φ conditional upon the value of ψ is obtained. Its variance can be obtained by numerical differentiation methods (Jamshidian and Jennrich, 2000). Given a set of plausible values for ψ, denoted by ψ, one can obtain the region of ignorance for φ, defined to be the set of all possible point estimates φ(ψ) for ψ ψ. The (1 α)100% region of uncertainty can be constructed around the region of ignorance, in the spirit of a confidence region. The estimated probabilities P [ Y = y, M = m, X = x φ(ψ),ψ ] can be derived from φ(ψ) using (8), and the estimated probabilities P [ Y = y, X = x φ(ψ),ψ ] are then obtained by summation over the missing data patterns. Finally, the probabilities P [ Y = y X = x; φ(ψ),ψ ] are derived by conditioning. This conditional distribution represents the basis of the inferences of interest, as it describes how the probability of a positive response changes with different levels of the covariates. It may answer a wide range of questions addressed by the study, for example, how different levels of the covariates affect the time by time probabilities of success, the probability that all the measurements are successes, or the probability of observing at least k successes out of n measurements. However, the log linear model described in Section 2 may not be the preferred approach for describing the association between the response Y and explanatory variables X. In particular, the parameters β jk introduced in models (7) and (8) describe the association between Y j and X k, conditional on all the other variables of this model. A marginal model, whose parameter estimation is based on generalized estimating equations (Diggle et al., 2002) could be preferable in this context. In this case, an attractive approach is to use the estimated probabilities P [ Y = y, M = m, X = x φ(ψ),ψ ] to perform multiple imputation (Rubin, 1987).
8 538 P. MININI AND M. CHAVANCE 4. MULTIPLE IMPUTATION The procedure for generating proper multiple imputation is the following. First, the parameters of the model are drawn, then the missing data are drawn conditionally on the observed values and the drawn parameters (Rubin, 1987, Sections 4.3 and 4.4). Although most easily understood using Bayesian concepts, a likelihoodbased treatment is equally possible (Verbeke and Molenberghs, 2000, Section 20.3). This would consist in first drawing a value φ from the asymptotic distribution of φ(ψ), say φ, then drawing Y miss from the conditional distribution P [ Y miss Y obs, X, M; φ,ψ ]. The general procedure for conducting a sensitivity analysis can then be summarized as follows: 1. Determine a model for the complete data: P [Y = y, M = m, X = x φ,ψ], as detailed in Section Choose a value for ψ. 3. Compute φ(ψ) the ML estimate of φ given ψ and its variance matrix Var ( φ(ψ) ), as detailed in Section Draw a value φ from the asymptotic distribution N ( φ(ψ),var ( φ(ψ) )), say φ. 5. Draw Y miss from the conditional distribution P [ Y miss Y obs, X, M; φ,ψ ]. 6. Repeat steps 4 and 5 I 2 times to obtain I completed data sets. 7. Analyse the I completed data sets using the appropriate statistical method. 8. Combine the I estimations into a single estimation. 9. Repeat steps 2 to 8 with a different value of ψ. One of the main advantages of this procedure is that we may use any method to analyse the completed data sets. The I complete data estimations are then easily combined into a single one. This inference is valid under the model described in Section 2. Considering several values for ψ allows us to display inference about the parameters of primary interest under a range of assumptions concerning the missing data process. A SAS macro and an example can be found on the web site at ~u472/equipes/biostatistique/savoir.html 5. EXAMPLE The proposed method was applied to a randomized clinical trial comparing the efficacy of two treatment regimens in patients with persistent asthma (Grosclaude and Desfougeres, 2003). Patients whose asthma was insufficiently controlled by inhaled corticosteroids alone were included. After a twoweek runin period, patients were randomized to receive either an inhaled corticosteroid / longacting β 2  agonist combination (referred to below as the Study treatment) or an inhaled corticosteroid / leukotrieneantagonist association (referred to below as the reference treatment). During the 12week treatment period, the subjects filled in a diary record card. Asthma control was assessed on a weekly basis, and was defined according to the criteria given in Table 2. Weekly data were then aggregated into monthly data, each patient being considered as controlled for a given month if his/her asthma was controlled at least three weeks out of four. A total of 246 subjects were randomized, 119 in the study treatment group and 127 in the reference treatment group. The distribution of missing data is presented in Table 3. Globally, 69% of subjects had available asthma control data at each of the three months, complete data being more frequent in the study treatment group (75%) than in the reference group (64%). It is noteworthy that missing data predominantly occurred in the Study treatment group which finally was the less effective. Thus, data are not likely to be missing completely at random. Among incomplete observations, monotone missing data was the most common structure (24%) but some patients (7%) also presented a nonmonotone missing data structure.
9 Longitudinal binary data with missing values 539 Table 2. Definition of asthma control Presence of at least two of the three following criteria:  morning peak expiratory flow rate 80% of the predicted value every day  no more than 2 days of inhaled shortacting bronchodilator use, up to a maximum of 4 occasions per week  no more than two days with asthma symptoms Presence of all the following criteria:  no nighttime awakening due to asthma  no treatment related adverse event leading to permanent discontinuation of treatment  no unscheduled medical visit or hospitalization for asthma  no exacerbation Table 3. Summary of missing data: numbers of subjects in each missing data pattern Study treatment Reference treatment Total M1 M2 M3 (N = 119) (N = 127) (N = 246) Model (8) was used for imputation. The nature of the missing data mechanism was allowed to differ between the two treatment groups by including two values ψ 0 and ψ 1 in the model. Besides treatment, two covariates were included in the model, sex and smoking status, which were a priori considered as potentially related to the asthma control. These covariates were also considered as having a potential effect on the missing data process (e.g. smokers having a higher rate of missing data than nonsmokers), but in contrast to treatment, they were not assumed to modify the sensitivity parameter. The first step of the sensitivity analysis consists in determining a set of plausible values for the sensitivity parameter ψ in each treatment group. As ψ can be interpreted as the logarithm of the oddsratio measuring the association between a response and its observation, this parameter has a statistical and a clinical interpretation and should be given reasonable values. However, in the scope of a sensitivity analysis, one can also consider unrealistic and even extreme assumptions, as long as their effect is interpreted cautiously. It seems natural to assume that missing data were associated with a poor response. But it is also plausible that wellcontrolled subjects may more often forget to complete their diary record card. Thus, both negative and positive values for ψ should be considered. Finally, in this study, a range of values of ψ within ±1 could be considered as sufficiently wide, as they would allow the odds of asthma control to be up to 2.7 times larger or smaller for missing data than for observed data. The aim of the study was to compare the mean evolution of the rate of asthma control between the two groups. This analysis included adjustments for sex and smoking status (smoker versus nonsmoker). After estimation of the model as described in Section 3.2, multiple completed data sets were generated using the method described in Section 4. Each completed dataset was then analysed via a marginal model,
10 540 P. MININI AND M. CHAVANCE 40% Reference Treatment 40% Study Treatment ψ = ± ψ = ± 1 ψ = 0 Probability of asthma control 30% 20% 10% Probability of asthma control 30% 20% 10% ψ = ± ψ = ± 1 ψ = 0 0% Time (month) 0% Time (month) Fig. 1. Estimation of monthly probability of asthma control in each treatment group, according to different values of ψ. estimated by generalized estimated equations (see for example Xie and Paik, 1997). This marginal model could be written as logitp [ Y j = 1 X = x ] = µ j + β 1 x 1 + β 2 x 2 + β 3 x 3 (9) where x 1, x 2 and x 3 represent the treatment, sex and smoking status indicators respectively. Thus, the effect of time was modelled qualitatively and the effect of treatment was assumed to be the same at each time of measurement. This was justified by the absence of significant treatment by time interaction. Finally, an unstructured matrix was used for the modelling of the withinsubject correlations. For a better interpretation of the results of the sensitivity analysis, a preliminary approach is to estimate the probabilities P [ Y j = 1 X = x ], i.e. the marginal probabilities of asthma control at each month, within each treatment group, after adjustment for covariates. These probabilities are estimated for different values of ψ and are displayed in Figure 1. As missing data were more frequent in the reference group, the effect of ψ was more important in this group than in the study treatment group, especially so for the third month. Finally, values and + for ψ, corresponding to the worst and best case respectively, delimit the range of all possible values for the estimated probabilities of asthma control. The results of this sensitivity analysis are presented in Figure 2. In the upper part of these graphs, each curve joins combinations of ψ in both treatment groups for which the same estimate of the treatment effect (in terms of the oddsratio between treatment groups) is obtained. In the lower part, couples of ψ with the same level of statistical significance are displayed. Inside the dotted box are reasonable values for ψ in both treatment groups, i.e. ψ within ±1. Outside of this box are more extreme values. Assuming that missing data were ignorable (ψ=0) in both groups, a better asthma control was observed in the study treatment group, with an oddsratio of 2.15 (p<0.001). In the range of values of ψ within ± 1 in both groups, the treatment effect remained relatively stable, and a significant difference was still observed. The interval of ignorance was [1.56, 2.84]. Its associated 95% interval of uncertainty was [1.02, 4.37], which was entirely above 1. Only assumptions outside the range of plausible values could raise a nonsignificant result, in particular, a strong nonignorable mechanism of opposed sign in the two groups, greatly disfavouring the treatment group. For example, the best case/worst case analysis, which corresponds to ψ = in the treatment group and ψ = + in the reference group, led to near equality between the two groups (OR=1.00) but did not reverse the conclusion of the study. It should be recalled that in this study, both groups received an active treatment, registered for asthma. Their safety profiles were comparable. Therefore, there seems to be no reason for assuming such a dramatic difference between their missing data processes.
11 Longitudinal binary data with missing values 541 Estimation of Treatment Effect (Oddsratio) 5 4 OR > 3.5 OR = 2 ψ in Study Treatment OR = 3.5 OR = 3 OR = 2.5 OR < 1.5 OR = ψ in Reference Treatment Statistical Significance of the Treatment Effect p = p = 0.01 ψ in Study Treatment p < p = 0.05 p = ψ in Reference Treatment Fig. 2. Results of the sensitivity analysis: contour plot of the estimation of the treatment effect and statistical significance, for different values of ψ in each treatment group. In conclusion, the sensitivity analysis showed that the superiority of the study treatment was robust to slight to moderate departure from the ignorability assumption. In this study, missing data should therefore not be considered to be a serious source of concern. 6. DISCUSSION The fundamental impossibility to perform any valid inference without strong and untestable assumptions justifies the choice of a sensitivity analysis: inferences under a range of plausible or extreme assumptions rather than a single inference. However, unless the treatment effect is very strong or the rate of missing data very low, it seems obvious that the most extreme assumptions will at least invalidate the statistical significance, and sometimes reverse the conclusion of a study (see examples provided by Rotnitzky et al. 1998, 2001; Scharfstein et al. 1999; Hollis 2002; Birmingham et al. 2003). Our opinion is that one should not expect a conclusion to resist all assumptions about missing data, but only to remain stable under mild or moderate assumptions. In the example described in this paper, values of ψ within ±1 could be considered as a reasonable compromise between the risk of neglecting plausible assumptions and that of considering unrealistic assumptions. In this case, the use of intervals of ignorance and intervals of uncertainty is valuable. The log linear model introduced in Section 2 may be debatable in some contexts, for example when the number of measurements is variable. However, in clinical trials the number of planned observations
12 542 P. MININI AND M. CHAVANCE is usually fixed, even if the number of responses actually observed may be variable across subjects due to missing data. The conditional character of this model also downplays its applicability, and a marginal model is often preferred for the analysis. But that does not preclude its usefulness for the imputation of the missing data. Indeed, in a marginal framework, the conditional distribution of missing data, given the observed data and the missing data structure, is often very complex. This conditional distribution is fundamental for imputation. On the other hand, in the log linear model, this conditional distribution is easily derived. Completing the data under various nonignorable assumptions, and then analysing them using a possibly different model, is in the spirit of multiple imputation (Schafer, 1999). It raises the classical issue of the difference between the imputer s model and the analyst s model (Rubin, 1996; Schafer, 1997). However, in this context, the imputer and the analyst would generally be the same person. Thus, the assumptions made for completing and analysing the data (concerning the timedependance, withinsubject correlations, effect of covariates) should be similar in the log linear model and the marginal model. An alternative to multiple imputation would be to perform a weighted analysis using the estimated probabilities P[M X, Y], for example weighted GEE (Robins et al., 1995). However, note that some widely used procedures such as SAS proc genmod allow the introduction of known weights, but do not take into account their imprecision when they are estimated, and thus underestimate the variance of the weighted estimation. Imputations are generated under a nonignorable model and inferences performed are valid under the assumed mechanism for missing data. One may legitimately wonder to what extent the conclusions of the analysis depend on the model assumptions. Clearly, the adequacy of the model to the observations has to be checked using diagnostic tools, but one cannot expect the examination of the available residuals or influence measures to shed light on the choice of ψ. In this modelling, several assumptions were set. First of all, the association between any pair of responses was assumed not to depend on the missing data pattern. Secondly, it was assumed that conditionally on Y j, the missingness of Y j was independent of all other Y k. Finally, it was assumed that ψ was constant over time. These assumptions cannot be checked, and thus have the potential to raise doubts about the robustness of the method. The method for conducting a sensitivity analysis proposed in this paper can be extended by removing one or all of these assumptions, but at the cost of a higherdimensional analysis, and a loss of simplicity and interpretability. Conversely, a simple model with a parameter ψ fixed in each treatment group will generally cover all the situations of practical interest. If the conclusion of the study is robust over a range of values for ψ in each group, one can expect it would also resist a more complex model. The flexibility of the log linear model allows us to consider alternative assumptions. For instance, in some contexts it may be more plausible for missingness to depend on other variables than on the variable subjected to missingness itself. In this case, from model (1), one would assume that ψ jj = 0 and derive a different model. Note that this would not necessarily lead to an ignorable model, since missingness could be assumed to depend on the previous response, which is also possibly missing. Other possible extensions may include the handling of the cause of missing data. In clinical trials, these reasons are often investigated. In some cases, the missingness is likely to be unrelated to the unobserved response, for example the refusal to continue for personal convenience or the technical impossibility to perform the measure. In other cases, one would not assume that there exists an unobserved value behind the missing values, especially when dropout corresponds to the death of the subject. In these cases, a simple extension of the proposed method would consist in imputing only missing values considered as potentially related to the existing unobserved response. A drawback of this method is the weight of calculations, which exponentially increases with the number of measurements by subject (n) and the number of explanatory variables ( p). In particular, the EM algorithm involves 2 2n+p sufficient statistics, so that with our computer (a Pentium III 1.4 GHz), this method could not be applied with n 10.
13 Longitudinal binary data with missing values 543 ACKNOWLEDGEMENTS The authors are grateful to Florence CassetSemanaz for her constant support in this research work. We are also grateful to two anonymous reviewers for their insightful comments that greatly contributed to improving the quality of this paper. Minini s research was partially supported by the French Association Nationale de la Recherche Technique, Convention CIFRE 707/2000. REFERENCES BAKER, S. G., ROSENBERGER, W. F. AND DERSIMONIAN, R. (1992). Closedform estimates for missing counts in twoway contengency tables. Statistics in Medicine 11, BIRMINGHAM, J., ROTNITZKY, A. AND FITZMAURICE, G. M. (2003). Patternmixture and selection models for analysing longitudinal data with monotone missing patterns. Journal of the Royal Statistical Society, Series B 65, BISHOP, Y. M. M., FIENBERG, S. E. AND HOLLAND, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press. COMMITTEE FOR PROPRIETARY MEDICINAL PRODUCTS (2001). Points to consider on missing data, CPMP/EWP/1776/99. European Agency for the Evaluation of Medicinal Products, London. COPAS, J. B. AND LI, H. G. (1997). Inference for nonrandom samples (with discussion). Journal of the Royal Statistical Society, Series B 59, DEMPSTER, A. P., LAIRD, N. M. AND RUBIN, D. B. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B 39, DIGGLE, P. J. AND KENWARD, M. G. (1994). Informative dropout in longitudinal data analysis (with discussion). Applied Statistics 43, DIGGLE, P. J., HEAGERTY, P., LIANG, K. Y. AND ZEGER, S. L. (2002). Analysis of Longitudinal Data, 2nd edition. Oxford: Oxford University Press. FITZMAURICE, G. M. AND LAIRD, N. M. (2000). Generalized linear mixture models for handling nonignorable dropouts in longitudinal studies. Biostatistics 1, FOLLMANN, DAND WU, M. (1995). An approximate generalized linear model with random effects for informative missing data. Biometrics 51, GROSCLAUDE, M. AND DESFOUGERES, J. L. (2003). Fluticasone/Salmeterol (FP/S) est plus efficace que l association beclomethasonemontelukast (BDPM). Revue de Maladies Respiratoires 20, 1S74 1S74. HOLLIS, S. (2002). A graphical sensitivity analysis for clinical trials with nonignorable missing binary outcome. Statistics in Medicine 21, IBRAHIM, J., CHEN, M. H. AND LIPSITZ, S. R. (2001). Missing responses in generalized linear mixed models when the missing data mechanism is nonignorable. Biometrika 88, JAMSHIDIAN, M.AND JENNRICH, R. I. (2000). Standard errors for EM estimation. Journal of the Royal Statistical Society, Series B 62, KENWARD, M. G., GOETGHEBEUR, J. T. AND MOLENBERGHS, G. (2001). Sensitivity analysis for incomplete categorical data. Statistical Modelling 1, LIPSITZ, S. R,, MOLENBERGHS, G., FITZMAURICE, G. M. AND IBRAHIM, J. (2000). GEE with Gaussian estimation of the correlations when data are incomplete. Biometrics 56, LITTLE, R. J. A. (1994). Discussion to Diggle and Kenward: Informative dropout in longitudinal data analysis. Applied Statistics 43, 78.
14 544 P. MININI AND M. CHAVANCE LITTLE, R. J. A. (1994). A class of patternmixture models for normal incomplete data. Biometrika 81, LITTLE, R. J. A. (1995). Modelling the dropout mechanism in repeated measures studies. Journal of the American Statistical Association 90, LITTLE, R.J.A.AND RUBIN, D. B. (1987). Statistical Analysis with Missing Data. New York: Wiley. MOLENBERGHS, G., KENWARD, M. G. AND GOETGHEBEUR, E. (2001). Sensitivity analysis for incomplete contingency tables: the Slovenian plebiscite case. Applied Statistics 50, PAIK, M. C. (1997). The generalized estimating equation approach when data are not missing completely at random. Journal of the American Statistical Association 92, ROBINS, J. M., ROTNITZKY, A. AND ZHAO, L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association 90, ROTNITZKY, A., ROBINS, J. M. AND SCHARFSTEIN, D. (1998). Semiparametric regression for repeated outcomes with nonignorable nonresponse. Journal of the American Statistical Association 93, ROTNITZKY, A., SCHARFSTEIN, D., SU, T. L. AND ROBINS, J. M. (2001). Methods for conducting sensitivity analysis of trials with potentially nonignorable competing causes of censoring. Biometrics 57, RUBIN, D. B. (1987). Multiple Imputations for Nonresponse in Surveys. New York: Wiley. RUBIN, D. B. (1994). Discussion to Diggle and Kenward: Informative dropout in longitudinal data analysis. Applied Statistics 43, RUBIN, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association 91, SAS INSTITUTE INC. (1999). SAS/STAT User s guide, Version 8, Cary, NC. SCHAFER, J. L. (1997). Analysis of Incomplete Multivariate Data. New York: Chapman and Hall. SCHAFER, J. L. (1999). Multiple imputation: a primer. Statistical Methods in Medical Research 8, SCHARFSTEIN, D., ROTNITZKY, A. AND ROBINS, J. M. (1999). Adjusting for nonignorable dropout using semiparametric nonresponse models (with discussion). Journal of the American Statistical Association 94, SKINNER, C. J. (1997). Discussion to Copas and Li: Inference for nonrandom samples. Journal of the Royal Statistical Society, Series B 59, UNNEBRINK, K. AND WINDELER, J. (1999). Sensitivity analysis by worst and best case assessment: is it really sensitive?. Drug Information Journal 33, VANSTEELANDT, S., GOETGHEBEUR, E., KENWARD, M. G. AND MOLENBERGHS, G. (2000). Ignorance and uncertainty regions as inferential tool in a sensitivity analysis. Technical Report 2000/2, Centrum voor Statistiek. Ghent University. VANSTEELANDT, S. AND GOETGHEBEUR, E. (2001). Analyzing the sensitivity of generalized linear models to incomplete outcomes via the IDE algorithm. Journal of Computational and Graphical Statistics 10, VERBEKE, G.AND MOLENBERGHS, G. (2000). Linear Mixed Models for Longitudinal Data. New York: Springer. XIE, F. AND PAIK, M. C. (1997). Multiple imputation methods for the missing covariates in generalized estimating equations. Biometrics 53, [Received 16 June 2003; first revision 28 October 2003; second revision 5 February 2004; accepted for publication 12 February 2004]
Introduction to mixed model and missing data issues in longitudinal studies
Introduction to mixed model and missing data issues in longitudinal studies Hélène JacqminGadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models
More informationA Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit nonresponse. In a survey, certain respondents may be unreachable or may refuse to participate. Item
More informationHANDLING DROPOUT AND WITHDRAWAL IN LONGITUDINAL CLINICAL TRIALS
HANDLING DROPOUT AND WITHDRAWAL IN LONGITUDINAL CLINICAL TRIALS Mike Kenward London School of Hygiene and Tropical Medicine Acknowledgements to James Carpenter (LSHTM) Geert Molenberghs (Universities of
More informationPATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku,
PATTERN MIXTURE MODELS FOR MISSING DATA Mike Kenward London School of Hygiene and Tropical Medicine Talk at the University of Turku, April 10th 2012 1 / 90 CONTENTS 1 Examples 2 Modelling Incomplete Data
More informationReview of the Methods for Handling Missing Data in. Longitudinal Data Analysis
Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 113 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationOverview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models
Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear
More informationSensitivity Analysis in Multiple Imputation for Missing Data
Paper SAS2702014 Sensitivity Analysis in Multiple Imputation for Missing Data Yang Yuan, SAS Institute Inc. ABSTRACT Multiple imputation, a popular strategy for dealing with missing values, usually assumes
More informationStatistical Analysis with Missing Data
Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES
More informationMissing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random
[Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sageereference.com/survey/article_n298.html] Missing Data An important indicator
More informationProblem of Missing Data
VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VAaffiliated statisticians;
More informationMissing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional
More informationPaper Let the Data Speak: New Regression Diagnostics Based on Cumulative Residuals
Paper 25528 Let the Data Speak: New Regression Diagnostics Based on Cumulative Residuals Gordon Johnston and Ying So SAS Institute Inc. Cary, North Carolina, USA Abstract Residuals have long been used
More informationLecture 27: Introduction to Correlated Binary Data
Lecture 27: Introduction to Correlated Binary Data Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South
More informationMultiple Imputation for Missing Data: A Cautionary Tale
Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust
More informationA Mixed Model Approach for IntenttoTreat Analysis in Longitudinal Clinical Trials with Missing Values
Methods Report A Mixed Model Approach for IntenttoTreat Analysis in Longitudinal Clinical Trials with Missing Values Hrishikesh Chakraborty and Hong Gu March 9 RTI Press About the Author Hrishikesh Chakraborty,
More informationModels for Count Data With Overdispersion
Models for Count Data With Overdispersion Germán Rodríguez November 6, 2013 Abstract This addendum to the WWS 509 notes covers extrapoisson variation and the negative binomial model, with brief appearances
More informationarxiv: v1 [math.st] 5 Jan 2017
Sequential identification of nonignorable missing data mechanisms Mauricio Sadinle and Jerome P. Reiter Duke University arxiv:1701.01395v1 [math.st] 5 Jan 2017 January 6, 2017 Abstract With nonignorable
More informationHandling missing data in Stata a whirlwind tour
Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled
More informationStatistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation
Statistical modelling with missing data using multiple imputation Session 4: Sensitivity Analysis after Multiple Imputation James Carpenter London School of Hygiene & Tropical Medicine Email: james.carpenter@lshtm.ac.uk
More informationMISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
More informationHandling attrition and nonresponse in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 6372 Handling attrition and nonresponse in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
More informationIntroduction to latent variable models
Introduction to latent variable models lecture 1 Francesco Bartolucci Department of Economics, Finance and Statistics University of Perugia, IT bart@stat.unipg.it Outline [2/24] Latent variables and their
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationMissing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University
Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple
More informationNote on the EM Algorithm in Linear Regression Model
International Mathematical Forum 4 2009 no. 38 18831889 Note on the M Algorithm in Linear Regression Model JiXia Wang and Yu Miao College of Mathematics and Information Science Henan Normal University
More informationAnalysis of Longitudinal Data with Missing Values.
Analysis of Longitudinal Data with Missing Values. Methods and Applications in Medical Statistics. Ingrid Garli Dragset Master of Science in Physics and Mathematics Submission date: June 2009 Supervisor:
More informationMISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)
MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) R.KAVITHA KUMAR Department of Computer Science and Engineering Pondicherry Engineering College, Pudhucherry, India DR. R.M.CHADRASEKAR Professor,
More informationAn extension of the factoring likelihood approach for nonmonotone missing data
An extension of the factoring likelihood approach for nonmonotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions
More informationStandard errors of marginal effects in the heteroskedastic probit model
Standard errors of marginal effects in the heteroskedastic probit model Thomas Cornelißen Discussion Paper No. 320 August 2005 ISSN: 0949 9962 Abstract In nonlinear regression models, such as the heteroskedastic
More informationAuxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
More informationBMJ Open. To condition or not condition? Analyzing change in longitudinal randomized controlled trials
To condition or not condition? Analyzing change in longitudinal randomized controlled trials Journal: BMJ Open Manuscript ID bmjopen000 Article Type: Research Date Submitted by the Author: Jun0 Complete
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN13: 9780470860809 ISBN10: 0470860804 Editors Brian S Everitt & David
More informationStatistics 104: Section 6!
Page 1 Statistics 104: Section 6! TF: Deirdre (say: Deardra) Bloome Email: dbloome@fas.harvard.edu Section Times Thursday 2pm3pm in SC 109, Thursday 5pm6pm in SC 705 Office Hours: Thursday 6pm7pm SC
More informationNonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten Missing Data Treatments
Brockmeier, Kromrey, & Hogarty Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten s Lantry L. Brockmeier Jeffrey D. Kromrey Kristine Y. Hogarty Florida A & M University
More informationarxiv:1301.2490v1 [stat.ap] 11 Jan 2013
The Annals of Applied Statistics 2012, Vol. 6, No. 4, 1814 1837 DOI: 10.1214/12AOAS555 c Institute of Mathematical Statistics, 2012 arxiv:1301.2490v1 [stat.ap] 11 Jan 2013 ADDRESSING MISSING DATA MECHANISM
More informationA hidden Markov model for criminal behaviour classification
RSS2004 p.1/19 A hidden Markov model for criminal behaviour classification Francesco Bartolucci, Institute of economic sciences, Urbino University, Italy. Fulvia Pennoni, Department of Statistics, University
More informationAnalysis of Bayesian Dynamic Linear Models
Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main
More informationTABLE OF CONTENTS ALLISON 1 1. INTRODUCTION... 3
ALLISON 1 TABLE OF CONTENTS 1. INTRODUCTION... 3 2. ASSUMPTIONS... 6 MISSING COMPLETELY AT RANDOM (MCAR)... 6 MISSING AT RANDOM (MAR)... 7 IGNORABLE... 8 NONIGNORABLE... 8 3. CONVENTIONAL METHODS... 10
More informationAn EM algorithm for the estimation of a ne statespace systems with or without known inputs
An EM algorithm for the estimation of a ne statespace systems with or without known inputs Alexander W Blocker January 008 Abstract We derive an EM algorithm for the estimation of a ne Gaussian statespace
More informationMultivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #47/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
More informationDealing with Missing Data
Dealing with Missing Data Roch Giorgi email: roch.giorgi@univamu.fr UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January
More informationMissing data and net survival analysis Bernard Rachet
Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 2729 July 2015 Missing data and net survival analysis Bernard Rachet General context Populationbased,
More information13. Poisson Regression Analysis
136 Poisson Regression Analysis 13. Poisson Regression Analysis We have so far considered situations where the outcome variable is numeric and Normally distributed, or binary. In clinical work one often
More informationDepartment of Epidemiology and Public Health Miller School of Medicine University of Miami
Department of Epidemiology and Public Health Miller School of Medicine University of Miami BST 630 (3 Credit Hours) Longitudinal and Multilevel Data WednesdayFriday 9:00 10:15PM Course Location: CRB 995
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationBayesian Approaches to Handling Missing Data
Bayesian Approaches to Handling Missing Data Nicky Best and Alexina Mason BIAS Short Course, Jan 30, 2012 Lecture 1. Introduction to Missing Data Bayesian Missing Data Course (Lecture 1) Introduction to
More informationDealing with Missing Data
Res. Lett. Inf. Math. Sci. (2002) 3, 153160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904
More informationHandling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza
Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and
More informationBayesian Statistics in One Hour. Patrick Lam
Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical
More informationDr James Roger. GlaxoSmithKline & London School of Hygiene and Tropical Medicine.
American Statistical Association Biopharm Section Monthly Webinar Series: Sensitivity analyses that address missing data issues in Longitudinal studies for regulatory submission. Dr James Roger. GlaxoSmithKline
More informationLecture 7 Logistic Regression with Random Intercept
Lecture 7 Logistic Regression with Random Intercept Logistic Regression Odds: expected number of successes for each failure P(y log i x i ) = β 1 + β 2 x i 1 P(y i x i ) log{ Od(y i =1 x i =a +1) } log{
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002Topics in StatisticsBiological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationNonlinear Regression:
Zurich University of Applied Sciences School of Engineering IDP Institute of Data Analysis and Process Design Nonlinear Regression: A Powerful Tool With Considerable Complexity HalfDay : Improved Inference
More informationGSK Medicine: Study Number: Title: Rationale: Study Period: Objectives Indication: Study Investigators/Centers: Research Methods Data Source:
GSK Medicine: Study Number: 08257 Title: OCSIGEN study Longitudinal followup of a cohort of patients with asthma treated with inhaled corticosteroids in primary care Rationale: In the PostLicensing File
More informationSupplement to Call Centers with Delay Information: Models and Insights
Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290
More informationTesting on proportions
Testing on proportions Textbook Section 5.4 April 7, 2011 Example 1. X 1,, X n Bernolli(p). Wish to test H 0 : p p 0 H 1 : p > p 0 (1) Consider a related problem The likelihood ratio test is where c is
More informationEconometric Analysis of Cross Section and Panel Data Second Edition. Jeffrey M. Wooldridge. The MIT Press Cambridge, Massachusetts London, England
Econometric Analysis of Cross Section and Panel Data Second Edition Jeffrey M. Wooldridge The MIT Press Cambridge, Massachusetts London, England Preface Acknowledgments xxi xxix I INTRODUCTION AND BACKGROUND
More informationSTATISTICAL ANALYSIS OF SAFETY DATA IN LONGTERM CLINICAL TRIALS
STATISTICAL ANALYSIS OF SAFETY DATA IN LONGTERM CLINICAL TRIALS Tailiang Xie, Ping Zhao and Joel Waksman, Wyeth Consumer Healthcare Five Giralda Farms, Madison, NJ 794 KEY WORDS: Safety Data, Adverse
More informationA REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA
123 Kwantitatieve Methoden (1999), 62, 123138. A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA Joop J. Hox 1 ABSTRACT. When we deal with a large data set with missing data, we have to undertake
More informationOverview Classes. 123 Logistic regression (5) 193 Building and applying logistic regression (6) 263 Generalizations of logistic regression (7)
Overview Classes 123 Logistic regression (5) 193 Building and applying logistic regression (6) 263 Generalizations of logistic regression (7) 24 Loglinear models (8) 54 1517 hrs; 5B02 Building and
More informationMarketing Mix Modelling and Big Data P. M Cain
1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored
More informationHow to choose an analysis to handle missing data in longitudinal observational studies
How to choose an analysis to handle missing data in longitudinal observational studies ICH, 25 th February 2015 Ian White MRC Biostatistics Unit, Cambridge, UK Plan Why are missing data a problem? Methods:
More informationImputing Missing Data using SAS
ABSTRACT Paper 32952015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are
More information11. Analysis of Casecontrol Studies Logistic Regression
Research methods II 113 11. Analysis of Casecontrol Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 SigmaRestricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationMATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...
MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 20092016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................
More informationA Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September
More informationPortfolio Using Queuing Theory
Modeling the Number of Insured Households in an Insurance Portfolio Using Queuing Theory JeanPhilippe Boucher and Guillaume CouturePiché December 8, 2015 Quantact / Département de mathématiques, UQAM.
More informationTests for Two Survival Curves Using Cox s Proportional Hazards Model
Chapter 730 Tests for Two Survival Curves Using Cox s Proportional Hazards Model Introduction A clinical trial is often employed to test the equality of survival distributions of two treatment groups.
More informationProbability and Statistics
CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b  0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute  Systems and Modeling GIGA  Bioinformatics ULg kristel.vansteen@ulg.ac.be
More informationMonotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. AbuMostafa EE and CS Deptartments California Institute of Technology
More informationPredict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationOn efficiency of constrained longitudinal data analysis versus longitudinal. analysis of covariance. Supplemental materials
Biometrics 000, 000 000 DOI: 000 000 0000 On efficiency of constrained longitudinal data analysis versus longitudinal analysis of covariance Supplemental materials Kaifeng Lu Clinical Biostatistics, Merck
More information7 Hypothesis testing  one sample tests
7 Hypothesis testing  one sample tests 7.1 Introduction Definition 7.1 A hypothesis is a statement about a population parameter. Example A hypothesis might be that the mean age of students taking MAS113X
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationMissing data in randomized controlled trials (RCTs) can
EVALUATION TECHNICAL ASSISTANCE BRIEF for OAH & ACYF Teenage Pregnancy Prevention Grantees May 2013 Brief 3 Coping with Missing Data in Randomized Controlled Trials Missing data in randomized controlled
More informationFitting Subjectspecific Curves to Grouped Longitudinal Data
Fitting Subjectspecific Curves to Grouped Longitudinal Data Djeundje, Viani HeriotWatt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK Email: vad5@hw.ac.uk Currie,
More informationMULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
More informationLecture  32 Regression Modelling Using SPSS
Applied Multivariate Statistical Modelling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur Lecture  32 Regression Modelling Using SPSS (Refer
More informationChris Slaughter, DrPH. GI Research Conference June 19, 2008
Chris Slaughter, DrPH Assistant Professor, Department of Biostatistics Vanderbilt University School of Medicine GI Research Conference June 19, 2008 Outline 1 2 3 Factors that Impact Power 4 5 6 Conclusions
More information, then the form of the model is given by: which comprises a deterministic component involving the three regression coefficients (
Multiple regression Introduction Multiple regression is a logical extension of the principles of simple linear regression to situations in which there are several predictor variables. For instance if we
More informationSample Size Planning, Calculation, and Justification
Sample Size Planning, Calculation, and Justification Theresa A Scott, MS Vanderbilt University Department of Biostatistics theresa.scott@vanderbilt.edu http://biostat.mc.vanderbilt.edu/theresascott Theresa
More informationA LONGITUDINAL AND SURVIVAL MODEL WITH HEALTH CARE USAGE FOR INSURED ELDERLY. Workshop
A LONGITUDINAL AND SURVIVAL MODEL WITH HEALTH CARE USAGE FOR INSURED ELDERLY Ramon Alemany Montserrat Guillén Xavier Piulachs Lozada Riskcenter  IREA Universitat de Barcelona http://www.ub.edu/riskcenter
More informationMissing Data Dr Eleni Matechou
1 Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: R.J.A. Little and D.B. Rubin 2nd edition Statistical Analysis with Missing Data J.L. Schafer and J.W.
More informationThe Probit Link Function in Generalized Linear Models for Data Mining Applications
Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164169 1538 9472/13/$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
More informationReject Inference in Credit Scoring. JieMen Mok
Reject Inference in Credit Scoring JieMen Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationA New Imputation Method for Incomplete Binary Data
A New Imputation Method for Incomplete Binary Data Munevver Mine Subasi Department of Mathematical Sciences Florida Institute of Technology 150 W. University Blvd., Melbourne, FL 32901 USA Martin Anthony
More informationGuideline on missing data in confirmatory clinical trials
2 July 2010 EMA/CPMP/EWP/1776/99 Rev. 1 Committee for Medicinal Products for Human Use (CHMP) Guideline on missing data in confirmatory clinical trials Discussion in the Efficacy Working Party June 1999/
More informationPaper Beyond BreslowDay: Homogeneity Across R x C Tables ABSTRACT INTRODUCTION SAMPLE DATA K 2 2 TABLES
Paper 74949 Beyond BreslowDay: Homogeneity Across R x C Tables Ginny P. Lai, David R. Mink, David J. Pasta, ICON Late Phase & Outcomes Research, San Francisco, CA ABSTRACT In the epidemiological world,
More informationService courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
More informationWebbased Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni
1 Webbased Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed
More informationSENSITIVITY ANALYSIS AND INFERENCE. Lecture 12
This work is licensed under a Creative Commons AttributionNonCommercialShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationSPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg
SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & Oneway
More informationMixture Models. Jia Li. Department of Statistics The Pennsylvania State University. Mixture Models
Mixture Models Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Clustering by Mixture Models General bacground on clustering Example method: means Mixture model based
More informationItem Imputation Without Specifying Scale Structure
Original Article Item Imputation Without Specifying Scale Structure Stef van Buuren TNO Quality of Life, Leiden, The Netherlands University of Utrecht, The Netherlands Abstract. Imputation of incomplete
More informationCombining Multiple Imputation and Inverse Probability Weighting
Combining Multiple Imputation and Inverse Probability Weighting Shaun Seaman 1, Ian White 1, Andrew Copas 2,3, Leah Li 4 1 MRC Biostatistics Unit, Cambridge 2 MRC Clinical Trials Unit, London 3 UCL Research
More information