1170 M. M. SANCHEZ AND X. CHEN

Transcription

1 STATISTICS IN MEDICINE Statist. Med. 2006; 25: Published online 5 January 2006 in Wiley InterScience ( DOI: /sim.2244 Choosing the analysis population in non-inferiority studies: Per protocol or intent-to-treat M. Matilde Sanchez ; and Xun Chen Clinical Biostatistics and Research Decision Sciences; RY34-A316; Merck Research Laboratories; P.O. Box 2000; Rahway; NJ 07065; U.S.A. SUMMARY For superiority trials, the intent-to-treat population (ITT) is considered the primary analysis population because it tends to avoid the over-optimistic estimates of ecacy that results from a per-protocol (PP) population. However, the roles of the ITT population and PP population in non-inferiority studies are not clearly dened as in superiority trials. In this paper, a simulation study is conducted to systematically investigate the impact of dierent types of missingness and protocol violations on the conservatism or anticonservatism of analyses based on the ITT and the PP population in non-inferiority trials. We nd that conservatism or anticonservatism of the PP or ITT analysis depends on many factors, including the type of protocol deviation and missingness, the treatment trajectory (for longitudinal study) and the method of handling missing data in ITT population. The requirement that non-inferiority be shown for both PP and ITT populations does not necessarily guarantee the validity of a non-inferiority conclusion and a suciently powered PP analysis is not necessarily powered for ITT analysis. It is important to assess the potential types and rates of protocol deviation and missingness that might occur in a non-inferiority trial and to obtain some prior knowledge regarding the treatment trajectory of the test treatment versus the active control at the design stage so that a proper analysis plan and appropriate power estimation can be carried out. In general, for the types of protocol violations and missingness considered, we nd that hybrid ITT/PP analysis, which excludes non-compliant patients as in the PP analysis and properly addresses the impact of non-trivial missing data as in the MLE-based ITT analysis, is more promising by way of providing reliable non-inferiority tests. Copyright? 2006 John Wiley & Sons, Ltd. KEY WORDS: drop-out due to lack of ecacy; hybrid ITT/PP analysis; LVCF method; MLE method; non-compliance; non-trivial missing 1. INTRODUCTION With the evolving ethical perspectives, there has been an increased emphasis in the use of active-control studies in drug development. For a medical condition with no known Correspondence to: M. Matilde Sanchez, Clinical Biostatistics and Research Decision Sciences, RY34-A316, Merck Research Laboratories, P.O. Box 2000, Rahway, NJ 07065, U.S.A. Matilde Sanchez@merck.com Received 29 September 2003 Copyright? 2006 John Wiley & Sons, Ltd. Accepted 8 March 2005

2 1170 M. M. SANCHEZ AND X. CHEN eective treatment, there is usually no ethical dilemma in conducting a study comparing a new treatment with placebo. However, a placebo-controlled study may be considered unethical and unacceptable whenever a known eective treatment is available especially for a serious disease. In an active-control trial, patients are randomly assigned to the new treatment or to a known eective treatment. An important assumption for a non-inferiority trial is the presence of assay sensitivity that is the ability to distinguish an eective treatment from a less eective or ineective treatment. Two possible objectives of an active-control trial in establishing the ecacy of the new treatment are either to show that the new treatment is at least as good as the active control (non-inferiority) or to show that the new treatment is superior to the active control (superiority). The majority of active-control trials are conducted to establish non-inferiority since it has become increasingly dicult to demonstrate superiority over an active control in clinical trials for certain diseases (e.g. some very eective antibiotics are in the market). Another reason is that a new treatment with comparable ecacy might have other advantages (e.g. superior safety prole, more convenient treatment regimen/route of administration) over the active control. Some important issues related to the conduct of non-inferiority studies include (a) choice of active control; (b) selection of non-inferiority margin and (c) choice of analysis population. In this paper, the focus will be on the choice of the appropriate population to use for statistical analysis in non-inferiority studies. Non-inferiority studies are not conservative in nature since aws in the design and conduct of the study will tend to bias the results towards a conclusion of similarity. It is important to minimize protocol deviations such as violations to the entry criteria and non-compliance (or partial compliance more precisely) with the randomized treatments as well as the amount and type of missing data. Ideally, all patients randomized into the study should have satised all entry criteria, complied with all study procedures with no losses to follow-up and provided complete data in which case the intent-to-treat (ITT) population and the per-protocol (PP) population are the same. However, this situation is impossible to achieve in practice. It is thus crucial to minimize the impact of the protocol deviation or missingness on the statistical analysis of the data. The ITT and PP populations play dierent roles in superiority trials and in equivalence or non-inferiority trials. For superiority trials, the ITT population is considered the primary analysis set because it tends to avoid the over-optimistic estimates of ecacy that result from a PP analysis non-compliers included in the ITT analysis will generally diminish the estimated treatment eect [1]. The choice of the appropriate analysis population in noninferiority studies is not well dened. Few publications have systematically evaluated the impact of dierent protocol violations and missingness on PP and ITT analyses for equivalence and non-inferiority studies. Intuitively, the ITT analysis seems to be liberal for non-inferiority trials as it tends to have, as a consequence, results that are biased towards making the two treatments (new treatment and active control) appear similar. Ebbutt and Frith [2] compared the results of the ITT and PP analyses in 11 equivalence asthma trials. They observed that the PP analyses always had wider condence intervals than those for the ITT analyses due entirely to the smaller number of subjects included in the PP analyses. They also found no consistent bias in either direction when comparing estimates obtained from the ITT and PP populations. Meanwhile, Rohmel [3] conducted a simple simulation study to investigate the degree of anticonservatism of the ITT population and quantify the inuence of non-compliers on the conclusion of a non-inferiority study. For a normally distributed variable, he studied the eect of varying proportions of non-compliers, true dierence between the two treatment groups and the variability in the non-complier group. He concluded that in the presence of

3 CHOOSING THE ANALYSIS POPULATION IN NON-INFERIORITY STUDIES 1171 non-compliers, the test for non-inferiority gives higher type I error rates that increase with the proportion of non-compliers, and the degree of anticonservatism of ITT is inversely related to the size of the treatment eect in the non-complier group. Regulatory agencies have concerns in using the ITT population as the primary analysis set for non-inferiority trials. The ICH E9 guidelines Section states: :::, in an equivalence or non-inferiority trial use of the full analysis set (i.e. the ITT population in this paper) is generally not conservative and its role should be considered very carefully. This statement has been interpreted as an indication that the PP population is the conservative choice for non-inferiority studies. On the other hand, Garrett [4] examined the potential impact of using PP populations in analysing binary outcome data. He also concluded that the perceived conservative nature of the PP population appears to be much more a reection of reduced patients numbers than the presence of bias, while bias can be in either direction depending on the pattern of violations. As a compromise, the current thinking of regulatory agencies is that the study objective of a non-inferiority trial should be achieved in both populations [5, 6]. Sample size computations are performed to ensure sucient numbers of subjects in the PP population and then increased for the ITT population based on the projected protocol violation and withdrawal rate [7]. It is generally believed that non-inferiority will be more dicult to demonstrate using the PP population, so sample size calculations based on the PP population will ensure that the ITT analysis is adequately powered. The aim of this paper is to provide a systematic investigation on the impact of various types of protocol violations and missingness on the appropriateness of the ITT populationand the PP population-based analyses for non-inferiority studies. We did not study all possible types of protocol violations and missingness but rather concentrated on the most common types observed in our clinical trials. We further discuss the issues involved in the choice of analysis population in non-inferiority studies in Section 2. Section 3 describes the settings of the simulation study. Simulation results are presented and discussed in Section 4. Finally, in Section 5 conclusions and practical recommendations on the choice of analysis population for non-inferiority studies are provided. 2. APPROPRIATE ANALYSIS POPULATION: INTENT-TO-TREAT VERSUS PER PROTOCOL The objective of any clinical trial is to provide an unbiased assessment of the treatment eect. Decisions regarding the appropriate analysis set to use for a given trial should be guided by the following principles (1) to minimize bias and (2) to avoid ination of the type I error [1]. The ITT population includes all randomized patients in the study regardless of compliance with the protocol or completion of the study. The PP population is dened as a subset of the ITT population who completed the study without any major protocol violations. In the ICH Guideline E9, a more formal denition of PP states that it is the set of data generated by the subjects who complied with the protocol suciently to ensure that these data would be likely to exhibit the eects of the treatment, according to the underlying scientic model wherein compliance refers to treatment exposure, availability of measurements and absence of major protocol violations. The document further states, Subjects who withdraw or dropout of the treatment group or the comparator group will tend to have a lack of response, and hence the results of using the full analysis set may be biased toward demonstrating equivalence [1].

4 1172 M. M. SANCHEZ AND X. CHEN For this reason, the ITT population is preferred over the PP population in superiority study as it tends to avoid over-optimistic estimates of ecacy resulting from a PP analysis and is thus generally more conservative for a superiority study. For the same reason, however, in an equivalence or non-inferiority trial use of the full analysis set is generally not conservative and its role should be considered very carefully [1]. This statement has been interpreted to indicate that the PP population is the conservative choice for non-inferiority studies. However, this seems counterintuitive for certain situations wherein there are treatment-related drop-outs (e.g. protocol-specied discontinuation for lack of ecacy). In this situation, an analysis based on the PP population will make the two treatments (test treatment and active control) appear similar since patients remaining in the study for both treatment groups would most likely be responders. Due to this ambiguity, the current thinking of regulatory agencies is that the study objective for a non-inferiority trial should be achieved in both populations. It is important to clearly dierentiate the application of an ITT population and the pure ITT analysis in this paper. The ITT principle requires that all patients randomized in the trial be included in the analysis and that there is complete follow-up of all patients for study outcomes. Thus, withdrawal from treatment should not lead to withdrawal from study [8]. In the pure ITT analysis, patients are followed up completely and analysed according to their assigned treatment groups at randomization irrespective of compliance and occurrence of adverse eects. The pure ITT analysis tests the entire treatment regimen including factors such as compliance, tolerability and patient motivation rather than just the therapy. It provides the most realistic and unbiased answer to the question of clinical eectiveness of the treatment and attempts to anticipate therapeutic eectiveness as it might be realized in a public health setting. Testing for non-inferiority using the pure ITT analysis wherein patients are followed up completely even after receiving rescue medications will generally be anticonservative. Thus, this pure version of the ITT analysis is generally not of interest for a non-inferiority study and will not be studied in this paper. The ITT analysis in this paper, along with the PP analysis, are both aimed at closely estimating the treatment eect as it would be realized under the perfect protocol implementation (i.e. no protocol violations and no drop-out) or in other words under the scientic model underlying the protocol. Unlike the PP analysis, the diculty with implementing an ITT analysis is that as soon as the patient drops out of the study, measurements are no longer available. Various methods can be applied to handle missing data. The most popular method for dealing with missing data in a clinical trial is the last value carried forward (LVCF) method wherein the last observation obtained for a patient is used for all subsequent missing observations. It assumes that the last observation for a given patient is an unbiased estimate of what the missing value would have been had the patient stayed in the study. This method has been criticized as a simple attempt at imputation that underestimates the variability and provides a biased estimate of treatment dierence when time-trend exists. In this paper, the main focus is to compare the performance of the LVCF-based ITT analysis and the PP analysis in non-inferiority study. The application of an ITT analysis implemented with other more sophisticated missing data methods, such as a maximum likelihood approach, are also investigated. 3. SIMULATION SETTING A simulation study is conducted to investigate the impact of various types of protocol violations and missingness on the conservatism or anticonservatism of analysis based on the

5 CHOOSING THE ANALYSIS POPULATION IN NON-INFERIORITY STUDIES 1173 ITT or PP population. Only the most commonly observed protocol violations and types of missingness were considered. Consider a non-inferiority clinical trial which compares a test treatment, T and an active control, C. In order to satisfy the at least as good as criterion and demonstrate clinical non-inferiority, a statistical test or condence interval must rule out with high probability clinical inferiority of the test treatment. Using the condence interval approach, non-inferiority will be declared if the upper limit of the 95 per cent condence interval of the treatment eect (test treatment active control) is less than the pre-specied non-inferiority margin, (assume that smaller values of the response variable are better). This is often equivalent to using a one-sided =2:5 per cent level test on the null hypothesis: H0: T C versus the alternative hypothesis H a: T C where refers to the non-inferiority margin. The null hypothesis states that the test treatment is inferior to the active control by or more. The alternative hypothesis implies that a dierence of less than (i.e. test treatment is inferior to active control by less than ) is considered to be clinically acceptable and that the test treatment is at least as good as the active control. Assume that the primary endpoint, say Y, is measured multiple times in the trial for both T and C, say, at baseline, and at months one through ve (referred to as visit 1 5), and denoted as Y =(Y 0 ;Y 1 ;:::;Y 5 ). The primary comparison between T and C is made in terms of the change from baseline in Y at month 5. Further, assume Y is normally distributed with means equal to tj for t=t; C, j=0; 1; 2;:::;5, and with an AR(1)-type covariance matrix, i.e Var(Y)= Let the non-inferiority margin be =0:3. With =1, =0:95, one-sided type 1 error of and a sample size of n = 100=group, the study has 90 per cent power to show non-inferiority, if in fact there is no dierence between T and C and if all patients comply fully with the protocol. To evaluate the type I error rates of the ITT and PP analyses in the presence of various protocol violations and missingness, we simulated data such that the test treatment, T is inferior to the active control, C with a true mean dierence of 0.3 (exactly as the pre-specied non-inferiority margin) at month 5, i.e. T5 C5 =0:3. Two dierent treatment trajectories are considered for the means ( 0 5 ) over time: trajectory 1 represents a late onset of treatment dierence (in favour of the active control) where the dierence between T and C becomes largest at month 5 and trajectory 2 represents an early onset of treatment dierence (in favour of the active control) where the dierence between T and C is larger in the middle of the trial compared with that at month 5. Figure 1 illustrates these two dierent trajectories: trajectory 1 has \ T =(7:5; 7:35; 7:2; 7:1; 7:0; 7) and \ C =(7:5; 7:4; 7:25; 7:05; 6:85; 6:7)

6 1174 M. M. SANCHEZ AND X. CHEN Figure 1. An illustration of the simulated treatment trajectories. while trajectory 2 has \ T =(7:5; 7:5; 7:45; 7:3; 7:12; 7) and \ C =(7:5; 7:25; 7:0; 6:9; 6:8; 6:7) The following dierent types of missingness and protocol violation are considered in the simulation (see Table I). The rst type of missingness is imposed completely at random on each patient (e.g. patient relocated so had to discontinue from the study) with patterns 1 and 4 having the same missing rate for T and C (pattern 4 missing rate is larger than pattern 1); with pattern 2 corresponding to a larger missing rate for C whereas pattern 3 has a larger missing rate for T. Specically, a 10 per cent missing rate corresponds to an approximate 1 per cent missing at visit 1, and approximately 1, 1, 3 and 4 per cent missing at visit 2, 3, 4, and 5, respectively; a 20 per cent (30 per cent) missing rate corresponds to an approximate 2 per cent (4 per cent) missing at visit 1, and 3 per cent (4 per cent), 5 per cent (4 per cent), 5 per cent (8 per cent) and 5 per cent (10 per cent) missing at visit 2, 3, 4, and 5, respectively. The second type of missingness investigated is baseline dependent drop-out. In pattern 1, more severe patients (at baseline) are more likely to discontinue from both treatments; in pattern 2, more severe patients (at baseline) are more likely to discontinue from the test treatment, while less severe patients are more likely to discontinue from the active control. This usually happens when the test treatment is less ecacious and the control drug is associated with some adverse experiences (AEs). The reverse scenario is simulated in pattern 3. Specically, the simulated 20 per cent missing rate corresponds to an approximate 2 per cent missing at visit 1, and about 3, 5, 5, and 5 per cent missing at visit 2, 3, 4, and 5, respectively. A review of previous regulatory submissions shows that a primary reason for drop-out in some non-inferiority trials is the lack of ecacy, patients are forced to discontinue

7 CHOOSING THE ANALYSIS POPULATION IN NON-INFERIORITY STUDIES 1175 Table I. Dierent scenarios of missingness and non-compliance. Missingness or non-compliance (per cent) Reasons for exclusion of patients from PP population Tested drug Active control Randomly lost Pattern 1 10 per cent 10 per cent to follow-up Pattern 2 10 per cent 20 per cent Pattern 3 20 per cent 10 per cent Pattern 4 30 per cent 30 per cent Baseline-dependent Pattern 1 20 per cent in sub-pop 20 per cent in sub-pop missing baseline 7:5 baseline 7:5 Pattern 2 20 per cent in sub-pop 20 per cent in sub-pop baseline 7:5 baseline 7:5 Pattern 3 20 per cent in sub-pop 20 per cent in sub-pop baseline 7:5 baseline 7:5 Drop-out due to Pattern 1: visit 2 and after Trajectory lack of ecacy: 1 17 per cent 17 per cent discontinue if response 2 23 per cent 12 per cent variable is greater Pattern 2: visit 3 and after 1 14 per cent 12 per cent than 8.25 at one 2 18 per cent 10 per cent visit and after Pattern 3: visit 4 and after 1 11 per cent 8 per cent 2 13 per cent 7 per cent Non-compliance Pattern 1: loss of 20 per cent 10 per cent 10 per cent ecacy due to non-compliance Pattern 2: loss of 20 per cent 30 per cent 30 per cent ecacy due to non-compliance Pattern 3: loss of 20 per cent 10 per cent 30 per cent ecacy due to non-compliance Pattern 4: loss of 20 per cent 30 per cent 10 per cent ecacy due to non-compliance Drop-out rates are estimated based on distributional assumptions. from a trial if the monitored ecacy variables are not satisfactory at certain visits per prespecied criteria to ensure patients safety. This kind of treatment-related drop-out contributes to the third type of missingness in Table I. In the simulation, the cut-o for discontinuation was xed at the value of 8.25 but was applied at dierent visits. And unlike the rst two types of missingness (with the missing rates pre-specied), the missing rate for the third type of missingness varies with the size of treatment eect and the discontinuation criteria used for lack of ecacy. For example, if discontinuation is required whenever the response variable is greater than 8.25 at visit 2 or after, the resulted discontinuation rate will be estimated by 1 P(Y 2 68:25 and Y 3 68:25 and Y 4 68:25) for the active treatment group in the trajectory 1 (with ( T 2 ; T 3 ; T 4 )=(7:2; 7:1; 7:0) and the variance covariance matrix specied earlier), the total missing rate will be around 17 per cent. Lastly, dierent scenarios of noncompliance are simulated for the two treatments, which are imposed randomly on patients according to the assumed proportions of non-compliance (e.g. 10 per cent or 30). The degree of non-compliance is quantied by the percentage of ecacy loss due to not complying or

8 1176 M. M. SANCHEZ AND X. CHEN partially complying with the assigned treatment. For example, to lose 20 per cent ecacy means patients who do not comply well with the assigned treatment are assumed to achieve 80 per cent ecacy compared to those who fully comply with the assigned treatment (at each visit after treatment starts). The dierent types of missingness and protocol violation listed in Table I are imposed to the generated full data set, separately. Note all of the missing data in this paper were generated in a monotone pattern. 4. ANALYSIS AND RESULTS Assume the primary objective of the trial is to show that the mean change from baseline in Y at month 5 for the test treatment, T, is not clinically inferior to the active control, C. Let = Y 5 Y 0 denote the change from baseline at month 5. An analysis of covariance (ANCOVA) will be used to compare the treatment groups for. The ANCOVA model will have treatment as a factor and the baseline value of the primary variable (Y 0 ) as covariate. Note that the number of patients included in the PP analysis is smaller than 100 patients=group and depends on the drop-out and=or protocol violator rates. And since all patients in the PP population have complete data, the corresponding ANCOVA analysis is straightforward. But to utilize the ITT population, which consists of all patients randomized to the trial whether or not they adhere to the protocol or provide data at the prespecied last timepoint, one would have to handle the missing data (specically, the missing Y 5 ) properly before applying the ANCOVA model. One conventional method which has been widely applied in most regulatory submissions is the LVCF method, wherein the last observation obtained for a patient is used for all subsequent missing observations. This method assumes that the last observation for a given patient is an unbiased estimate of what the missing value would have been had the patient stayed in the study. We rst apply the conventional LVCF method to impute the missing value for Y 5 for the ITT analysis. Table II summarizes the simulation results comparing the PP analysis to the LVCF-based ITT analysis. The results presented in this paper are all based on 2000 simulations. The standard error of the estimated empirical type I error rates, say ˆ, can thus be estimated by ˆ(1 ˆ)=2000. Findings from the simulations are as follows: The PP analysis maintained the one-sided type I error rates near the nominal level of except when missingness is due to lack of ecacy; it tends to be anticonservative in trajectory II where signicantly more drop-outs due to lack of ecacy occur in the test treatment group. In trajectory 1, however, the PP analysis has type I error less than or close to the nominal level when there are less or comparable number of drop-outs due to lack of ecacy in the test treatment group. It once again tends to be anticonservative when noticeably more drop-outs due to lack of ecacy occur in the test treatment group; When there is a late onset of treatment dierence (trajectory 1), the LVCF-based ITT analysis tends to be anticonservative throughout the simulation; When there is an early onset of treatment dierence (trajectory 2), the LVCF-based ITT analysis performs noticeably conservative in the case of missing due to lack of ecacy; The LVCF-based ITT analysis is generally not reliable when non-compliant patients are included in the analysis: the infringement increases with the proportion of non-compliant patients and the degree of non-compliance.

9 CHOOSING THE ANALYSIS POPULATION IN NON-INFERIORITY STUDIES 1177 Table II. Actual type I error in a non-inferiority trial comparing the PP analysis to the LVCF-based ITT analysis (based on 2000 simulations; the nominal error is 0.025). Trajectory I Trajectory II Types of missingness and protocol deviation PP ITT (LVCF) PP ITT (LVCF) Random drop-out Pattern Pattern Pattern Pattern Baseline-dependent Pattern Pattern Pattern Due to lack of ecacy Pattern Pattern Pattern Non-compliance Pattern Pattern Pattern Pattern For baseline-dependent missing patterns 2 and 3, it might be surprising to see close to nominal level type I error rates for the PP approach as these two patterns of missingness results in selection error at baseline. The good performance of the PP approach in such scenarios can be attributed to the baseline covariate adjustment in the ANCOVA model which exactly reects the underlying relationship between the response variable and the baseline measurements in the simulation. If the relationship between the response variable and the baseline measurement is not correctly modelled in the analysis, for example, the response variable is non-linearly correlated with the baseline measurement, it is possible for the PP approach to lead to biased estimates when baseline-dependent missing pattern 2 or 3 occurs. It is interesting to note that when missingness is primarily due to lack of ecacy, for treatment trajectory I, both the PP analysis and the ITT analysis could be anticonservative; whereas for treatment trajectory II, the PP analysis is anticonservative, and the ITT analysis is noticeably conservative. These ndings contradict the conventional belief in the anticonservativeness of the ITT and conservativeness of the PP for non-inferiority studies. As a result, for treatment trajectory I, one cannot reliably conclude non-inferiority even if it is demonstrated in both PP and ITT (LVCF-based) analyses; for treatment trajectory II, although the conservativeness of ITT does provide some protection for the validity of the non-inferiority conclusion, the power of the ITT analysis could be insucient if the trial is designed based on power for PP analysis (assuming that the ITT analysis will be automatically more powerful or say, less conservative). For example, if the two treatments, T and C follow trajectory II except at month 5 they have 0 dierence, say, T = C =6:85, n T = n C = 85 will provide at least 85 per cent power to claim non-inferiority (based on the same covariance matrix and the same non-inferiority margin); if the drop-out rate due to lack of ecacy is projected around

10 1178 M. M. SANCHEZ AND X. CHEN 15 per cent, 100 patients will be planned for recruitment to ensure that the PP population is adequately powered. Assuming patients will discontinue following pattern 1 for lack of ecacy missingness in Table I, our simulation shows that given 100 patients=group, the ITT (LVCF-based) analysis only has 64 per cent power to show non-inferiority at the end of the study. By far it is clear that unless appropriate adjustment is available, including non-compliers in the test for non-inferiority generally gives higher than allowed false non-inferiority rate and certain types of missingness, e.g. missing due to lack of ecacy cannot be simply excluded from analysis population (as in the PP analysis) for valid non-inferiority test, whereas the conventional LVCF-based ITT analysis is not able to address the missingness validly. A hybrid PP/ITT analysis that excludes non-compliant patients from the analysis as in the PP analysis and also addresses the impact of missing data (especially those not missing completely at random) with a proper method other than LVCF, as in the ITT analysis, should result in a more reliable non-inferiority test. We note that the drop-out due to lack of ecacy is a consequence of the protocol design and the probability of the consequent missingness depends on the values observed at previous visits. This component of the missing data mechanism, therefore, is largely missing at random (MAR). Maximum likelihood estimation (MLE) is a statistically valid approach when data are MAR [9, 10]. In the simulation, we apply SAS Proc Mixed to implement the MLE approach to address the missingness due to lack of ecacy. Since this mixed model analysis is intended as an alternative approach to ANCOVA at month 5, time is treated discretely and an unstructured variance covariance matrix for measurements from month 0 to month 5 is assumed. Assuming a specic time-trend, say linear or quadratic, or assuming special variance covariance structure will generally increase the eciency of tests but it could also easily result in substantially biased tests due to mis-specication. General model assumptions are thus applied in the simulation. The readers can refer to the appendix for the applicable SAS code. Table III summarizes the simulation results. It can be seen that for the specic Table III. Actual type I error in a non-inferiority trial the mixed model approach based ITT analysis (based on 2000 simulations; the nominal error is 0.025). ITT (mixed model approach) Types of missingness and protocol deviation Trajectory I Trajectory II Randomly drop-out Pattern Pattern Pattern Pattern Baseline-dependent Pattern Pattern Pattern Due to lack of ecacy Pattern Pattern Pattern Note: ITT (mixed) is the same as ITT (LVCF) with respect to non-compliance.

11 CHOOSING THE ANALYSIS POPULATION IN NON-INFERIORITY STUDIES 1179 non-trivial missingness simulated in this paper the drop-out due to lack of ecacy the MLE approach is sucient to replace the conventional LVCF method to provide a valid test. 5. CONCLUSIONS AND DISCUSSIONS This paper provides a systematic investigation on the impact of various types of missingness and protocol violations on the appropriateness of the ITT population- and the PP populationbased analyses for non-inferiority study. We did not study all possible types of protocol violations and missingness but rather concentrated on the most common scenarios observed in our clinical trials. The impact of dierent types of missingness and protocol violations is assessed separately in the paper. It would be ideal to provide simulation results for the impact of all possible combinations of the investigated dierent types of missingness and violations. It is, however, realistically impossible and unnecessary, as in many real-life clinical trials, that a single scenario of missingness or protocol violation often dominates the others, and more importantly, the essential conclusions of this paper would not change for a few uninvestigated missingness or violation patterns. In summary, we nd that the conservatism or anticonservatism of the PP or ITT analysis depends on many factors, including the type of protocol deviation and missingness, the treatment trajectory (for longitudinal study) and the method of handling missing data in ITT population. The test for non-inferiority is usually anticonservative in the presence of noncompliance. An analysis based on the PP population is generally decient in the presence of non-trivial missingness, such as dropout due to lack of ecacy, and leads to anticonservative results. The performance of ITT analysis is sensitive to the imputation method employed the LVCF-based ITT analysis is generally unable to correctly address the missingness issues in non-inferiority trials. Consequently, the requirement that non-inferiority be shown for both the PP and ITT population does not necessarily guarantee the validity of a non-inferiority conclusion and a suciently powered PP analysis is not necessarily adequately powered for the ITT analysis. It is important to assess the potential types and rates of protocol deviation and missingness in a non-inferiority trial and to obtain as much prior information as possible regarding the treatment trajectory of the test treatment versus the active control at the design stage, so that a proper analysis plan and appropriate power estimation can be carried out. In general, for the types of protocol violations and missingness considered, we nd a hybrid ITT/PP analysis that excludes non-compliant patients as in the PP analysis and properly addresses the impact of non-trivial missing data as in the MLE-based ITT analysis, which is more promising to provide reliable non-inferiority tests. We should also note that the MLE method can properly address the non-trivial missingness in our simulation because (1) the missingness due to lack of ecacy falls in the MAR category and (2) the predictor for the subsequent missing values (the Y value of the previous visit in our simulation) is included in the analysis model in a way which correctly reects its true relationship with the missing value this is also the reason for the good performance of the MLE method in the baseline-dependent missing patterns 2 and 3. This method thus needs to be carefully implemented in practice and supplemented with proper sensitivity analyses to ensure valid handling of missing data. When data are not missing at random, the MLE method is generally unable to correctly address the missingness and a much more sophisticated method is required. This topic is beyond the scope of this paper. In practice, however, the best solution is to proactively minimize the amount of non-random missing data in a trial.

12 1180 M. M. SANCHEZ AND X. CHEN Our next work is to apply the knowledge obtained from the above investigation to guide practical non-inferiority trial design and analysis. We will report our practical experience and recommendation in a separate paper. APPENDIX A: SAS CODE FOR MLE USING PROC MIXED /*** A categorical variable trt time is dened to summarize the full interaction between treatment and time of treatment, that is, no specic time trend/pattern is assumed for the longitudinal treatment eect.***/ /*** For RCT, The baseline mean is assumed to be the same for the two treatments: the same trt time category at time = 0 for both treatments. ***/ if time = () then trt time = (); /*** Dierent trt time category for dierent treatments at the same post-treatment visits ***/ else if time = 1 then do; if trt = 1 then trt time = 11; else trt time = 12; end; else if time = 2 then do; if trt = 1 then trt time = 21; else trt time = 22; end; else if time = 3 then do; if trt = 1 then trt time = 31; else trt time = 32; end; else if time = 4 then do; if trt = 1 then trt time = 41; else trt time = 42; end; else if time = 5 then do; if trt = 1 then trt time = 51; else trt time = 52; end; /*** y =(y0;y1;y2;y3;y4;y5) represents the repeated measurements of response variable; an denotes the subject identication number; for each an, there are up to 6 dierent visits (denoted by time ) and 6 corresponded response values (y); the variance covariance of the repeated measurements for each subject ( an ) is assumed unstructured (type = un). ***/

13 CHOOSING THE ANALYSIS POPULATION IN NON-INFERIORITY STUDIES 1181 proc mixed alpha = 0:05; class an time trt time; model y = trt time; repeated time/type = un subject = an; /*** Based on the denition of trt time, the eleven coecients in the estimation function represent (baseline, trt 1 month 1, trt 2 month 1; :::, trt 1 month 5, trt 2 month 5). The following exemplied contrast thus estimates the treatment dierence at month. A 95 per cent condence interval will also be provided per option/cl. ***/ estimate trt di trt time /cl; run; ACKNOWLEDGEMENTS We would like to thank all the reviewers for their helpful comments and suggestions. REFERENCES 1. ICH Harmonised Tripartite Guideline. E9: statistical principles for clinical trials, Ebbutt AF, Firth L. Practical issues in equivalence trials. Statistics in Medicine 1998; Rohmel J. Therapeutic equivalence investigations: statistical considerations. Statistics in Medicine 1998; 17: Garrett AD. Therapeutic equivalence: fallacies and falsication. Statistics in Medicine 2003; 22: D Agostino RB, Massaro JM, Sullivan LM. Non-inferiority trials: design concepts and issues the encounters of academic consultants in statistics. Statistics in Medicine 2003; 22: Committee on Proprietary Medical Products Point-to-Consider. Points to consider on switching between superiority and non-inferiority. CPMP, Jones B, Jarvis P, Lewis JA, Ebbutt AF. Trials to assess equivalence: the importance of rigorous methods. British Medical Journal 1996; 313: Lachin JM. Statistical considerations in the intent-to-treat principle. Controlled Clinical Trials 2000; 21: Rubin, Donald B. Inference and missing data. Biometrika 1976; 63: Laird NM, Ware JH. Random-eects models for longitudinal data. Biometrics 1982; 38: ICH Harmonised Tripartite Guideline. E10: choice of control group and related issues in clinical trials, Temple R. Diculties in evaluating positive control trials. ASA Proceedings of the Biopharmaceutical Statistics 1983; 1 7.