An interim analysis is any assessment of data done during the patient enrollment or follow-up stages of a trial for the purpose of assessing center performance, the quality of the data collected, or treatment effects. [1] Interim analysis is also called data-dependent stopping or early stopping. Interim analyses are most often used to find convincing enough evidence to say that there is a significance large treatment difference, and that the difference is convincing enough to chance the trial at a point earlier than planned at first. Ethical and economic reasons are also taken into consideration to stop the trial early.[4] [10] The ethical reason is the most important reason to stop the trial. We want to make sure that the maximum number of patients receives the most effective treatment at the earliest stage. Since clinical trials are expensive, there are also economic reasons to include as few patients as possible.[10]interim analysis is also used to possibly reduce the expected number of patients and to shorten the follow-up time needed to make a conclusion. [11].You don t want to spend extra money if you already have enough evidence. Other Examples of Why a Trial May Be Terminated* Treatments are convincingly different Treatments are convincingly not different Unacceptable side effects or toxicity Accumulation is so slow that trial is no longer sufficient Outside information surfaces that makes the trial unnecessary or unethical Poor execution compromises the ability of the study to meet its objectives Disastrous fraud or misconduct. *[3]
Special Statistical Problems I. The Multiple Looks Problem A problem that is often ignored in clinical trials is when an experimenter continuously executes interim analysis over the course of the study. We define this problem as The Multiple Looks Problem. [1] [2].When interim analysis is done repetitively the decision to terminate a trial early because of convincing evidence that the treatments are or are not different. The mistake is made when researchers treat each look at the analysis as if it was the only one. However, the false-positive rate increases as the number of interim analyses increases. As the data is reviewed more and more, the false positive rate is going to increase because the greater the chance of mistaking a real effect for a large statistical fluctuation increases. For example, if we plan to have our clinical trial to have and alpha of 5%, and we want to conduct 5 interim analyses then the false positive rate would be changed to 14%. [2] So, because we are looking more for a difference in treatments, there is a greater chance that at one of these times we are going to find a statistical instability or what we call a False-Positive Look. The table below shows the how the false-positive rate changes with a varying number of false-positive looks at the data with a planned alpha level of.05. Multiple "Looks" at Data versus False-Positive Rate* * Number of Looks False-Positive Rate 1 0.05 2 0.08 3 0.11 4 0.13
5 0.14 10 0.19 20 0.25 50 0.32 100 0.37 1000 0.53 1.0 *[9] Every time we look at the data and consider stopping, we introduce the chance of falsely rejecting the null hypothesis. In other words, we increase the Type I error. If we look at the data multiple times, and use alpha of 0.05 as our criterion of significance, then we have 5% chance of stopping each time.[3] II. The Multiple Outcomes Problem The Multiple Outcome Problem occurs when there are two or more outcome measures that are used to analyze the study treatments. The Multiple Outcome Problem even occurs when the clinical trial, which was planning on focusing on one outcome and the necessity to focus on the other outcome, is necessary.[1] Among the three problems, this one is believed to be the most difficult to address, because the outcomes of interest are likely to be interdependent. The usual approach is to ignore the interdependence and to make comparisons involving the different outcome measures as if they were independent of one another. [1] However, unless the conclusions are preceded with extreme caution this can lead to incorrect conclusions. An example from the book had a placebo and a treatment that were applied to a certain heart study, and were measuring death as an outcome. It was found that they did
not reject the null hypothesis that there was a statistical difference between the treatment and the placebo for the outcome all causes of death. However, there was a significant difference when the outcome was specified to Cardiovascular Related Death. Here the Cardiovascular Related Death was a subset of all causes of death. There is no real solution to this problem, other than to provide the conclusions and analysis of all outcomes, and subsets of these outcomes. III. The Multiple Comparisons Problem The multiple comparisons problem arises when an investigator chooses to make several different treatment comparisons all involving the same outcome measure (and all done at the same point. [1] There are two settings to this. One, is it is of interest to determine if subgroups of the patients are benefited or harmed by the treatment. This analysis is called Data Dredging[1]. One thing that should be noted is that you should avoid conventional interpretation of significance tests. The other, is when the experimenter is interested in more than one treatment either with each other or with the control group. To make a conclusion about one treatment being superior the experimenter must carry out one of the multiple comparison tests. Sequential and Group Sequential Methods A sequential Method is when the accumulating data are analyzed at every new observation. However, the more times the data is analyzed leads to statistical problems which were talked about it in the Multiple Looks Problem. A Group Sequential Method is when datat is analyzed at pre-determined intervals. Group sequential tests are convenient, and they also provide plenty opportunity for early stopping.
It achieves lower expected sample size and shorter average study lengths. Can provide a possible compromise between too long a duration of a trial and too biased results. It is more practical than standard sequential methods.[8] At each interim analysis, a comparison of treatments is done and it is decide whether there is sufficient evidence to stop the trial An Independent Data Monitoring Committee is used to review or conduct the interim analysis[7] Study Design with Interim Monitoring A significance test is performed with k maximum number of analyses, and are planned at each interim analysis j=1,,k where a significance tests is performed. If the P value is below some pre-specified level _j, the trial is stopped. If the P-value is greater than _j the trial continues until the next planned interim analysis. The Pocock design has a group sequential design that has a fixed nominal level. The nominal significance level _j is a constant _* chosen to give an overall probablility _j of a Type I error. There are other opposing designs done by O Brien and Fleming have that have different _j for each test. [11] We want a big enough sample size to achieve a pre determined power for a particular alternative and a desired Type I error (usually 5%). Type I error =P(declare a treatment difference Ho). The group sequential design declares a difference between groups if - Z1 >B1 (stop at 1 st interim), - Z1<B1 but Z2 >B2 (stop at second interim), -
- Z1 < B1, Z1 <B2,., Zk-1 <Bk-1, ZK >BK(last analysis) [6] -B1,B2,.Bk, the stopping boundaries, - K, the maximum number of looks, and - n, the number of patients responding since the previous look in each treatment group. [8] Choosing Boundaries* Pocock (1977) Biometrika 64,191-199: Pocock was the first to take this approach and divided type 1 error evenly across number of analyses. For instance, if K=3, each "k=. 05/3=. 0167.It is provided clear guidelines for group sequential tests with given type 1 error and power, and has greater chance of stopping the trial early. O Brien Fleming (1979) Biometrics 35,549-556: This is an alternative to Pocock repeated significance tests. They spend very little of the Type I error at beginning interim analyses, and gradually increasing it. It makes difficult to stop early, and thus p-values at end close to nominal (0.05) level. Fleming-Harrington-O Brien (1984) Controlled Clinical Trials 5, 348-36 It is similar to O Brien Fleming above, but less conservative. *{9], [3], and [8]
Critical Values-Nominal P values: K=4 analyses, Type I error=0.05 Pocock O Brien-Fleming Analysis cv P cv P 1 2.36 0.016 4.08.000005 2 2.36 0.016 3.22.0013 3 2.36 0.016 2.28 0.0228 4 2.36 0.016 2.04 0.0417 (cv=critical value for Z statistic) [6] Problems with These Stopping Boundaries* Pocock: At the end of the study, the nominal p-value is less than 0.05 but is not sufficiently small enough to achieve the significance based on the design. It requires the largest sample size to achieve specified power. O Brien-Fleming: Very Conservative, the boundaries may see too small during the first tests. Fleming-Harrington-O Brien: Middle ground between other two, but more similar to OF. *[3]
Alternative approaches to Sequential and Group Sequential Designs* Bayesian methods The Bayesian method mainly looks at the interpretation of the estimate of the difference instead of significance levels or confidence intervals [5] Futility analyses Futility analysis occurs when a study ends early due to either: - when a considerable number of patients experience serious side effects - or you realize that evidence shows the results at the end of the trial are going to be negative.[10] Coclusion* All interim analyses should be planned in advance and should be carefully laid out before doing the analysis. Avoid any unplanned interim analysis. The schedule or considerations of analyses, and the stopping guidelines and their properties should also be carefully planned before the time of the first interim analysis When there is a Data Monitoring Committee (DMC), they should approve the procedure plan. Any changes to the trial and then to the statistical procedures should be specified in the protocol at the earliest opportunity. The main goal when selecting procedures should make sure overall probability of Type I error is controlled.
You have to pay a price with an interim analysis, by living with a smaller alpha level at the end of your study [10] *[1],[4],[8], References [1] Meinert, Curtis L. Clinical trials: design, conduct, and analysis. New York : Oxford University Press, 1986 [2] Everitt, B.B. Statistical Methods for Medical Investigations. Oxford University Press [3]astor.som.jhmi.edu/~esg/TALKS/interimanalyses.ppt [4] www.bio.uu.nl/~biostat/seminar%20jc.ppt [5] Spiegel halter D.J., Freedman L.S., Par mar M.K.B. Bayesian approaches to randomized trials. J. R. Statistic. Soc. A. (1994) 157 357 416. [6] http://www.nurs.uoa.gr/emr-ibs/lecture2.pdf {7} www.cde.org.tw/documents/ activities/download/sw/(3)%20.ppt {8} http://psgmac43.ucsf.edu/ticr/syllabus/courses/26/2005/01/20/lecture/notes/lecture3_20jan05.pdf [9] http://www.bath.ac.uk/~mascj/psislides.pdf [10] http://www.cmh.edu/stats/plan/interim.asp [11]Eva Skovland, Repeated Significance Test on Accumulating Survival Data. The Norwegian Cancer Society and Section of Medical Statistics, University of Oslo, Oslo, Norway