Cancer - Interpreting Interrogation of Permutation and Biostatic Analysis

Transcription

1 Using Permutation Tests and Bootstrap Confidence Limits to Analyze Repeated Events Data from Clinical Trials Laurence Freedman, MA Biometry Branch, National Cancer Institute, Bethesda, Maryland Richard Sylvester, ScD EORTC Data Center, Bruxelles, Belgium David P. Byar, MD Chief, Biometry Branch, National Cancer Institute, Bethesda, Maryland ABSTRACT" In clinical trials comparing treatments for superficial bladder cancer, patients are at risk of repeated recurrences of their disease. Statistical methods of analyzing such data are required. This article presents a nonparametric approach. A statistical test to compare the recurrence or tumor rates in two treatment groups, using the randomization distribution, is described. Confidence intervals for the rate ratio are determined from the bootstrap distribution. The implementation of both requires Monte Carlo methods. Computer simulations support the use of these nonparametric methods when there are more than 60 recurrences in each treatment group. An example illustrating their use is given. The strategy adopted for analysis of these data could be applied to other clinical trials where standard methodology is inappropriate. KEY WORDS: repeated events, permutation test, randomization test, bootstrap confidence limits INTRODUCTION The analysis of data from clinical trials has received much attention from biostatisticians. Methods of analysis are well developed for many commonly encountered circumstances, including response data that are univariate, continuous, or discrete, and possibly subject to censorship. However, there are still some types of data for which methods are not well developed. One of these is where the response to treatment consists of a series of observations Address reprint requests to: Laurence Freedman, Biometry Branch, National Cancer Institute, Executive Plaza North, Room 344, 9000 Rockville Pike, Bethesda, MD Received April 25, 1988; revised August 26, Controlled Clinical Trials 10: (1989) 129

2 130 L. Freedman, R. Sylvester, and D. P. Byar made over time. These observations may or may not be regularly spaced in time and may be continuous or categorical. This article discusses one such example arising from trials of the treatment of superficial bladder cancer. The problems confronting the biostatisticians in these less-charted areas are (1) the definition of useful summary measures of response (data reduction) and (2) statistical estimation and testing of the defined measure or measures. In our example the definition of a summary measure arises fairly naturally and the article concentrates on the second aspect. Nonparametric methods are advocated because of doubts about the appropriateness of simple probability models. These methods may be of use in a wider context, as outlined in the Discussion section. Background In most cancer clinical trials, classical outcomes such as those based on the response rate, time to progression, and duration of survival are appropriate for the assessment and comparison of effects of treatment. In trials measuring the time to an event, it is the first occurrence of the event that is generally of interest, with the patient no longer being followed for subsequent occurrences of the same event. In trials comparing treatments for superficial bladder cancer, the situation is different, however, because patients are followed for multiple occurrences of the same event [1,2]. Thus special techniques are required for efficient analysis of the data. We shall assume that on entry to the trial patients in two or more treatment groups with histologically confirmed Ta and T1 superficial bladder cancer undergo a transurethral resection (TUR) to remove all visible tumor in the bladder. Patients are then followed at regular intervals by cystoscopy for some minimum period of time, for example, every 3 months for a period of at least I year. If during a cystoscopy a tumor recurrence is noted, a TUR is again performed to remove all visible lesions and the patient continues on his assigned treatment. Typical data are presented in Figure 1, where the line lengths represent the duration of follow up for individual patients and the open circles represent cystoscopies at which recurrences were detected. Sylvester [1] and Byar et al. [2] have noted that common outcomes such as the percent of patients with recurrence, the percent with recurrence at a given time, and the disease-free interval or time to first recurrence are inefficient, essentially because they make no use of data collected after the first recurrence. Other classical outcomes such as time to progression (increase in T category or appearance of distant metastases) and duration of survival are likewise inappropriate since these events can be expected to occur in a maximum of only 10%-15% of the patients under study. This article presents summary measures of treatment effectiveness that are appropriate for such trials and that may serve as a basis for treatment comparisons. STATISTICAL METHOD Notation Assume that there are K treatment groups with nk patients (k = 1... K) in each. The ith patient in the kth group is followed for a period of length

3 Analyzing Repeated Events Data 131 O v e O y v ~) v O O O O ~) a' m t2 i t8 2'4 3'0 m COMPLETED MONTHS OF FOLLOI4-UP FIGURE 1 Diagram showing follow-up of patients treated for superficial bladder cancer. Each line represents the follow-up of one patient. The open circles indicate examinations at which a recurrent tumor was found. tik and during this period there are rik recurrences (i.e., examinations where tumor is found) and Sik tumors observed (since more than one tumor may be observed at an examination). nk We denote the total observation period in the kth group by Tk = Y, tik, i=1 the total number of recurrences by Rk = bysk = n k Y~ Sik. i=1 nk Y~ ra, and the total number of tumors i=1 Measuring Treatment Effect If the intervals between each recurrence of tumor are independently and identically distributed across patients and time with the exponential probability distribution function, then the recurrences may be considered to form a Poisson renewal process [3].

4 132 L. Freedman, R. Sylvester, and D. P. Byar For the kth treatment group the maximum likelihood estimate of the recurrence rate )~k for that group is given by Kk = RdTk. This estimate has the appeal of simplicity and may be regarded as an average of the individual rates ridtik weighted by tidtk, that is, the contribution of the ith patient in treatment group k to the total follow-up time for that group. It may be objected that this estimate is of little use, since the Poisson renewal process assumption is unlikely to hold true in practice. Departures from the Poisson model are indeed readily seen from clinical trial data. The Poisson model predicts, for example, that the recurrence rate will be constant throughout the follow up period, whereas data from trials show a recurrence rate that decreases sharply after the first 3 months following the initial treatment [1,4]. However, even when the recurrence rate varies with time, the estimate Kk will still represent a measure of the average recurrence rate over the period of observation. The recurrence rate, )~k, can be generalized to define a tumor rate, %, which describes the rate at which individual tumors recur in the bladder. Again, under a Poisson renewal process the maximum likelihood estimate of ~k is ~k = SdG. Thus instead of simply using the presence or absence of tumor as a criterion, this measure incorporates the number of tumors found at each cystoscopy. Comparison of Treatment Effects Parametric approach Under the assumption that the recurrences form a Poisson renewal process, the total number of recurrences R~ has a Poisson distribution, with mean dependent on the recurrence rate )~k. Specifically the density function is flrk) = e ~krk(kktk) ak/(rk)! Potthoff and Whittinghill [5] provide three tests that may be used to test whether the recurrence rates )~k in each group are equal. In the special case when there are just two treatment groups (k=2) an estimate of the ratio of the two recurrence rates K1/~2 may be calculated. One may then compare the treatments by testing the null hypothesis that K1/)~2 = 1 against a one- or two-sided alternative by noting that )~1/)~2 has (under the null hypothesis) a central F distribution. Confidence limits for the ratio of the recurrence rates may also be determined [6-8]. As noted in the section on Measuring Treatment Effect, the Poisson assumption is, in practice, not always justified. The validity of the statistical test and confidence limits based on the F distribution is therefore questionable.

5 Analyzing Repeated Events Data 133 The next two sections develop distribution-free procedures for significance testing and interval estimation. Nonparametric approach--statistical testing When there are two treatment groups a randomization test may be used to test the null hypothesis that M = X2. This test makes no assumptions about the distribution of times to recurrence, either for individual patients or for groups. The validity of the test requires only that the two groups of patients be followed in a comparable manner. A longer interval of follow up or an increased frequency of examination in one treatment group can bias its comparison with another group. The idea behind the randomization test is as follows. If the total number of patients is denoted by N, where N = n~ + n2, then the number of possible ways of allocating the N patients in the study with nl in treatment group 1 and n2 in gr up 2 equals (N)" n~ F r each all cati n ne c uld c mpute ~1'~2' and their ratio ~.JK2. Assuming that the observed ratio from the actual experiment is greater than 1, then the one-tailed p value is simply the number of permutations giving ratios greater than the observed ratio, divided by the total number of possible permutations, that is ~ "(N~. The two-tailed p value \ nl ] would be the number of permutations giving ratios greater than the observed ratio or less than its reciprocal, divided by the total number of possible permutations. Inpracticethenumber( N)nl istoolarge, exceptforverysmallvaluesof n~ and n2, to allow calculation of the recurrence rate ratio for every possible permutation. One therefore adopts a Monte Carlo approach and selects ran- domly, a fixed number of times, from the (N~ permutations. A selection of \ nl,] 2000 samples provides a 95% confidence interval of about for a true p value of The same approach may be used to compare the tumor rates 61 and ~J2 in two treatment groups. Nonparametric approach--interval estimation The randomization test just described provides only a p value for testing the null hypothesis Xl--K2. Confidence limits for the ratio of the recurrence rates M/K2 usefully complement p values when interpreting trial results. Confidence limits show us what range of estimates for K1/)~2 are consistent with the data, a particularly important consideration when we wish to decide whether statistically significant results are clinically relevant or whether nonsignificant results simply reflect an inadequate sample size. Proper interpretation of clinical trial results, therefore, requires a knowledge of the confidence limits for the summary measure of the treatment effect [9,10]. The bootstrap is a general nonparametric method for estimation that may also be used to construct confidence limits [11]. The basic idea is quite simple. For the kth treatment group, the empirical distribution function of the number of recurrences and total observation time for one individual is given by the / \

6 134 L. Freedman, R. Sylvester, and D. P. Byar discrete distribution with probability of 1/nk on each joint observation (rik, tik). Since the true distribution is not known, we use the empirical distribution function in its place. The sampling distribution of Kk may be estimated by considering all possible samples of size nk made by drawing with replacement from the original nk observations. For each such sample, of which there are --~) distinct ones, a value of )kk maybe calculated. Thus, in principle, n k -- the bootstrap sampling distribution of )~k may be determined. As with the randomization test, r/k is usually too large to enumerate the bootstrap distribution, so Monte Carlo methods are used. For the purposes of this article we will concentrate on confidence limits for the ratio )~1/),2 rather than for the rates )~1 or ~2 themselves. Conditioning on the observed sample sizes, we draw two samples with replacement, one of size nl from group 1 and one of size r/2 from group 2, and we calculate K~/K2 for each pair of samples. This is repeated a large number of times, so that a sampling distribution of )~/)~2 is constructed. The confidence limits are then found by simple reference to the appropriate percentiles of the sampling distribution. This procedure is called the "percentile method." For distributions that are median-biased, that is, P(~1/~2 ~ K1/~2) ~ 0.5, Efron [11,12] has proposed a "bias-corrected (BC) percentile method." However, this procedure is not needed in the following simulations since our observed ratio K~/K2 is distributed as M/K2F(2rl,2r2) and P(FK1) = 0.5. The same approach may be used to obtain confidence limits for the ratio of tumor rates, %/~2. In the next section we compare, by computer simulation, the results using the F distribution with the randomization test and bootstrap methods. A FORTRAN computer program (available on request) has been written to perform the randomization tests and calculate the confidence intervals for the ratios of both recurrence and tumor rates. For trials involving more than two treatment groups, the groups are compared pairwise. SIMULATION RESULTS FOR RANDOMIZATION TEST AND BOOTSTRAP CONFIDENCE LIMITS FOR EXPONENTIALLY DISTRIBUTED DATA To test the performance of the nonparametric methods, data were simulated by computer to represent those arising from a Poisson renewal process. Specifically, times to recurrence were generated to follow an exponential distribution. Data for two treatment groups were generated and the ratios of the recurrence rates were set at 1.0, 1.5, 2.0, and 3.0. Equal numbers of patients in each group were included, ranging through 5, 10, 20, 60, and 100 per group. Each patient had exactly one recurrence. In addition two sets of simulations allowed exactly three recurrences per patient with 10 and 30 patients per group, respectively. In both situations, F test results are exact [13]. We sampled 500 and 1000 times from both the randomization and bootstrap distributions. In a practical situation one is unlikely to know the exact distribution of the rate ratio under the null hypothesis. However, when, as in this simulation, the exact distribution is known, one would desire that the randomization test would lead to closely similar results. For selected rate ratios, comparison of the p values in Table 1 permits us to judge how closely the randomization test results

7 Analyzing Repeated Events Data 135 Table 1 Comparison of F Distribution and Randomization Tests Fraction of One-Tailed- Number Randomization Test Fraction Significant by F p Values Less Than of Test at p Equals or Equal to Recurrence Patients Rate Ratio per Group reproduce those of the F test. In this table 500 experiments were performed for each row and the fractions of these significant by the F test at the four selected p values were compared to the fractions whose one-tailed p values for the randomization test were less than or equal to the same four selected p values. This approach effectively compares the average performances of the two tests. By considering the case where nl = n2 = 1 it becomes clear that the randomization distribution does not approximate the F distribution well for very small samples. The results in Table 1 show close agreement for the two tests even when the number of recurrences in each group is as small as 5. Even though the percentages of rejection when the recurrence rate ratio equals 1.0 exceed the nominal levels somewhat, it is the agreement of the two tests that is important. These randomization test results were based on sampling 500 times from the randomization distribution. Results for sampling 1000 times showed no systematic improvement and are therefore not shown. The proportions of significant F tests in Table I for a recurrence rate ratio of 1.0 are somewhat higher than their nominal values. We repeated the simulation with another 500 experiments and this time obtained values close to those expected. We conclude that the discrepancy noted in Table 1 is a chance occurrence. The results comparing the bootstrap and the F distribution confidence limits are shown in Table 2. The bootstrap confidence limits are generally narrower than the correct limits calculated according to the F distribution for simulations where the number of recurrences is 60 or fewer. For simulations with 120 or

8 O', Table 2 Comparison of Confidence Limits from F Distribution and Bootstrap Total Number Recurrence Number of Patients Number of Recurrences of Estimate of Rate Ratio Per Group Per Patient Recurrences Rate Ratio % Confidence Limits a F Distribution B = 500 B = ~r~,=

9 n. o~ t'~ '~B = Number of bootstrap samples.

10 138 L. Freedman, R. Sylvester, and D. P. Byar Table 3 Summary of Data.Set in Byar et al. (2) Treatment Group Total Number of patients Number of patients with recurrence Number of recurrences Number of tumors Total years of follow-up Recurrence rate (per year) Tumor rate (per year) more recurrences the bootstrap confidence limits agree quite well with the F distribution and there is no sign of bias towards intervals that are too narrow or conversely too wide. EXAMPLE We refer to the data set published in Byar et al. [2], in which 61 patients in 3 treatment groups had a total of 82 recurrences and 116 tumors. Table 3 contains a summary of the data with the estimated recurrence rates and tumor rates. Table 4 shows the results of the pairwise comparisons, including significance tests using the randomization distribution, recurrence rate and tumor rate ratio estimates, and their 95% bootstrap confidence limits. There is good agreement between the significance tests and the confidence limits: where a two-tailed test has a p value less than 0.05, the bootstrap 95% confidence limits exclude the value unity, and the converse is also true. DISCUSSION Schenker [14] has expressed qualms about the general usefulness of bootstrap methods for setting confidence limits and illustrates his concern with the problem of placing 90% confidence limits on the variance estimate for a normal distribution if the sample size is too small. He found that the limits Table 4 Pairwise Comparisons of Treatment Groups for Data Set in Byar et al. (2) Groups Groups Groups 1 vs 2 1 vs 3 2 vs 3 Recurrence rates Randomization test p value ~ Estimate of rate ratio % bootstrap confidence limits b (0.16,1.19) (0.11,0.74) (0.31,1.33) Tumor rates Randomization test p value ~ Estimate of rate ratio % bootstrap confidence limits b (0.18,1.48) (0.11,0.92) (0.37,1.69) atwo-tailed test: 500 random permutations. bl000 bootstrap samples.

11 Analyzing Repeated Events Data 139 were too narrow for sample sizes under 100. The same point is made in a different way by Efron and Tibshirani [15]. These results agree with the intuitive notion that one cannot expect repeated bootstrap sampling to provide an accurate representation of the tails of a distribution if the sample is too small. This point can be illustrated by considering very small samples, say three to five observations, drawn from normal distributions because then the bootstrap distribution can be enumerated exhaustively and the expected values of the most extreme points of these distributions can be determined using the expected values and order statistics for the normal distribution. For example, it may be shown that for samples of three from a normal distribution with zero mean and unknown variance, the upper 92.6% confidence limit has an expected value 2.02 whereas the expected value for the same bootstrap upper limit is Recently Efron [16] has proposed improvements on bias-corrected bootstrap confidence limits that are expected to work much better in small samples, although he comments that, "small sample non-parametric confidence intervals are far from well understood... and should be interpreted with some caution." Readers interested in more details will find a number of other articles in the same issue as this recent contribution by Efron as well as commentary on Efron's article from a number of noted statisticians. The methods we have proposed for testing hypotheses and setting confidence limits for clinical trials having multiple outcomes have the appeal of great generality and absence of parametric assumptions. We have demonstrated that the methods behave satisfactorily for exponentially distributed data when compared to hypothesis testing or confidence limits based on the F distribution. Although these results are extremely encouraging they do not prove that these methods will work well in all situations for samples of moderate size, but it is certain, from asymptotic theory, that they would behave well for large samples. Extensions of these techniques to allow for more sophisticated analyses are possible. For example, in analyzing data from a clinical trial one might choose to divide the follow-up period into several intervals and compare treatments separately within them to check for time-varying treatment effects. Another possible extension might involve adjustment for covariates. This could be accomplished by repeatedly applying a parametric or semiparametric adjustment procedure for each sample drawn from the permutation of bootstrap distributions. For instance, in our example one may use a log-linear model to adjust for imbalance in covariates such as initial tumor size. We have found it possible to program this repeated procedure within the statistical computer package for generalized linear modeling known as GLIM [17]. Another possible extension is to develop permutation tests for simultaneous comparison of three or more treatment groups, rather than relying solely on pairwise comparisons. Although we have concentrated on the analysis of recurrence or tumor rates in superficial bladder cancer, the approaches we have suggested are quite general and could be applied to a variety of problems for which standard methods are not available. Possible applications might include cancer prevention trials where the object is to cause regression or disappearance of precancerous lesions that may reappear on more than one future occasion.

12 140 L. Freedman, R. Sylvester, and D. P. Byar Such lesions might include colonic polyps, leukoplakia in the mouth, dysplasia in the esophagus, or skin tumors. Another possible area of application might include serial measurement of quality of life parameters or measures of toxicity following treatment of chronic diseases. Following the lines of this article, the plan would be firstly to define useful summary measures of the data and then to use the nonparametric methods described above to make statistical comparisons and provide the appropriate confidence limits for such measures. Although in our example the choice of summary measures (recurrence rate ratio and tumor rate ratio) was a fairly natural one, in other circumstances the choice will require considerable thought. For example, summarizing quality of life data is a difficult problem that has been discussed by several authors [18,19]. Nevertheless once the summary measures have been chosen, the nonparametric methods illustrated in this article provide a simple and effective solution to the questions of statistical inference and interval estimation. REFERENCES 1. Sylvester, R: The analysis of results in prophylactic superficial bladder cancer studies. In: EORTC Genitourinary Group Monograph 2, Part B: Superficial Bladder Tumors, Schroeder FH, Richards B, Eds. New York: Alan R. Liss, 1985, pp Byar D, Kaihara S, Sylvester R, Freedman L, Hannigan J, Koiso K, Oohashi Y, Tsugawa R: Statistical analysis techniques and sample size determination for clinical trials of treatments for bladder cancer. In: Developments in Bladder Cancer, Denis L, Niijima T, Prout G, Schr6der F, Eds. New York: Alan R. Liss, 1986, pp Cox DR, Miller HD: The Theory of Stochastic Processes. London: Methuen, 1965, pp MRC Working Party on Urological Cancer: The effect of intravesical thiotepa on the recurrence rate of newly diagnosed superficial bladder cancer. Br J Uro157: , Potthoff RF, Whittinghill M: Testing for homogeneity. II. The Poisson distribution. Biometrika 53: , Gehan EA: Statistical methods for survival time studies. In: Cancer Therapy: Prognostic Factors and Criteria of Response, Staquet MJ, Ed. New York: Raven Press, 1975, pp Lee ET: Statistical Methods for Survival Data Analysis. Belmont, CA: Lifetime Learning Publications, 1980, pp Gross AJ, Clark VA: Survival Distributions: Reliability Applications in the Biomedical Sciences. New York: Wiley, 1975, pp Altman DG, Gore SM, Gardner MJ, Pocock SJ: Statistical guidelines for contributors to medical journals. Br Med J 286: , Pocock SJ: Current issues in the design and interpretation of clinical trials. Br Med J 290:3942, Efron B: Nonparametric standard errors and confidence intervals. Can J Stat 9: , Efron B: The jackknife, the bootstrap, and other resampling plans. In: National Science Foundation-Conference Board of the Mathematical Sciences Monograph 38. Philadelphia: Society for Industrial and Applied Mathematics, Cox DR: Some simple approximate tests for Poisson variates. Biometrika 40: , 1953

13 Analyzing Repeated Events Data Schenker N: Qualms about bootstrap confidence intervals. J Am Stat Assoc 80: , Efron B, Tibshirani R: The bootstrap method for assessing statistical accuracy. Behaviormetrika 17:1-35 (section 5), Efron B: Better bootstrap confidence intervals. J Am Stat Assoc 82: , McCullagh P, Nelder JA: Generalized Linear Models. London: Chapman and Hall, 1983, Appendix E, p Nou E, Aberg T: Quality of survival in patients with surgically treated bronchial carcinoma. Thorax 35: , Fayers PM, Jones DR: Measuring and analyzing quality of life in cancer clinical trials: A review. Stat Med 2: , 1983