Sample size calculation for multiple testing in microarray data analysis

Transcription

1 Biostatistics (2005), 6, 1,pp doi: /biostatistics/kxh026 Sample size calculation for multiple testing in microarray data analysis SIN-HO JUNG Department of Biostatistics and Bioinformatics, Duke University, Box 2716, Durham, NC 27705, USA HEEJUNG BANG Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA STANLEY YOUNG National Institute of Statistical Sciences, Research Triangle Park, NC 27709, USA SUMMARY Microarray technology is rapidly emerging for genome-wide screening of differentially expressed genes between clinical subtypes or different conditions of human diseases. Traditional statistical testing approaches, such as the two-sample t-test or Wilcoxon test, are frequently used for evaluating statistical significance of informative expressions but require adjustment for large-scale multiplicity. Due to its simplicity, Bonferroni adjustment has been widely used to circumvent this problem. It is well known, however, that the standard Bonferroni test is often very conservative. In the present paper, we compare three multiple testing procedures in the microarray context: the original Bonferroni method, a Bonferronitype improved single-step method and a step-down method. The latter two methods are based on nonparametric resampling, by which the null distribution can be derived with the dependency structure among gene expressions preserved and the family-wise error rate accurately controlled at the desired level. We also present a sample size calculation method for designing microarray studies. Through simulations and data analyses, we find that the proposed methods for testing and sample size calculation are computationally fast and control error and power precisely. Keywords: Adjusted p-value; Bonferroni; Multi-step; Permutation; Simulation; Single-step. 1. INTRODUCTION DNA microarray is a biotechnology for performing genome-wide screening and monitoring of expression levels in cells for thousands of genes simultaneously, and has been extensively applied to a broad range of problems in biomedical fields (Golub et al., 1999; Alizadeh and Staudt, 2000; Sander, 2000). A primary aim is often to reveal the association of the expression levels and an outcome or other risk factor of interest. Golub et al. (1999) explored about 7000 genes extracted from bone marrow in 38 patients, 27 with acute lymphoblastic leukemia (ALL) and 11 with acute myeloid leukemia (AML), in order to identify the susceptible genes with potential clinical heterogeneity in the two subclasses of leukemia. Genes useful to distinguish ALL from AML may provide insight into cancer pathogenesis and patient treatment. Biostatistics Vol. 6 No. 1 c Oxford University Press 2005; all rights reserved.

2 158 SIN-HO JUNG ET AL. The authors concluded that roughly 1100 genes were more highly correlated with the AML ALL class distinction relying on, what they called, neighborhood analysis; they then selected the top 50 genes arbitrarily for intensive research. This data set has been referred to and reanalyzed by many other researchers (Thomas et al., 2001; Pan, 2002; Dudoit et al., 2003; Ge et al., 2003). Due to different methods and assumptions adopted, statistical inference obtained from the same data set has varied widely with respect to observed significance and the number of significant genes declared (Pan, 2002; Dudoit et al., 2003). Traditional statistical testing procedures, such as two-sample t-tests or Wilcoxon rank sum tests, are frequently used to determine statistical significance of the difference in gene expression patterns. These approaches, however, are faced with serious multiplicity as a very large number possibly or more of hypotheses are to be tested, while the number of studied experimental units is relatively small tens to a few hundreds (West et al., 2001). If we use a per comparison type I error rate α in each test, the probability of rejecting any null hypothesis when all null hypotheses are true, which is called the family-wise error rate (FWER), will be greatly inflated. So as to avoid this pitfall, the Bonferroni test is used most commonly in this field despite its well-known conservativeness. Although Holm (1979) and Hochberg (1998) improved upon such conservativeness by devising multi-step testing procedures, they did not exploit the dependency of the test statistics and consequently the resulting improvement is often minor. Later, Westfall and Young (1989, 1993) proposed adjusting p-values in a state-of-the-art step-down manner using a simulation or resampling method, by which dependency among test statistics is effectively incorporated. Westfall and Wolfinger (1997) derived exact adjusted p-values for a step-down method for discrete data. Recently, the Westfall and Young s permutation-based test was introduced to microarray data analyses and strongly advocated by Dudoit and her colleagues. Troendle et al. (2004) favor permutation test over bootstrap resampling due to slow convergence in high dimensional data. Various multiple testing procedures and error control methods applicable to microarray experiments are well documented in Dudoit et al. (2003, pp ). Which test to use among a bewildering variety of choices should be judged by relevance to research questions, validity (of underlying assumptions), type of control (strong or weak), and computability. The Bonferroni-type single-step procedure, however, is still attractive due to its easy calculation and interpretation. Comparisons between single vs. multi-step testing procedures have been briefly discussed in several papers, but there is little attempt to compare their theoretical and numerical properties, especially in the microarray framework. A stepwise procedure does not offer a critical value, while the Bonferroni s critical value is fixed based on the number of comparisons. Neither provides a simple way to calculate the minimal sample size for a designated power. Sample size estimation in this area is also an important problem as indicated in Golub et al. (1999), where the authors called for larger studies because they were uncertain about the statistical power. In this article, we compare the Bonferroni, resampling-based single-step and step-down multiple testing procedures through simulation and a real data example. The null distribution of the test statistics is approximated by permutation, which is nonparametric in that it does not require specification of the joint distribution of the test statistics and hence of the p-values. Adjusted p-values are also derived as better-suited summaries of the evidence against the null. Most importantly, we show that the single-step test provides a simple and accurate method for sample size determination and that can also be used for multi-step tests. 2. MULTIPLE TESTING PROCEDURES: REVIEW 2.1 Single-step vs. multi-step Suppose that there are n 1 subjects in group 1 and n 2 subjects in group 2. Gene expression data for m genes are measured from each subject. We want to identify the informative genes, i.e. those that are differentially

3 Sample size calculation for multiple testing in microarray data analysis 159 expressed between the two groups. Let (X 1i1,...,X 1im ) denote the gene expression levels obtained from subject i (= 1,...,n 1 )ingroup 1 and (X 2i1,...,X 2im ) similarly for subject i (= 1,...,n 2 )ingroup 2. Let µ 1 = (µ 11,...,µ 1m ) and µ 2 = (µ 21,...,µ 2m ) represent the respective mean vectors. In order to test whether or not gene j (= 1,...,m) is not differentially expressed between the two conditions, i.e. H j : µ 1 j µ 2 j = 0, we may use the t-test statistic T j = X 1 j X 2 j S j n n 1 2 where X kj is the sample mean in group k (= 1, 2) and S 2 j ={ n 1 i=1 (X 1ij X 1 j ) 2 + n 2 i=1 (X 2ij X 2 j ) 2 }/(n 1 + n 2 2) is the pooled sample variance for the jth gene. Suppose that our interest lies in identifying any genes overexpressed in group 1. This question can be stated as multiple one-sided tests of H j vs. H j : µ 1 j >µ 2 j for j = 1,...,m.Two-sided tests, as a simple extension, will be discussed briefly later and in Appendix 1. A single-step procedure adopts a common critical value c to reject H j,infavorof H j, when T j > c. Inthis case, FWER fixed at α is defined as α = P(T 1 > c or T 2 > c,..., or T m > c H 0 ) = P( max j=1,...,m T j > c H 0 ) (1) where H 0 : µ 1 j = µ 2 j for all j = 1,...,m, or equivalently H 0 = m j=1 H j,isthe complete null hypothesis and the relevant alternative hypothesis is H a = m H j=1 j.inorder to control FWER below the nominal level α, Bonferroni uses c = c α = t n1 +n 2 2,α/m, the upper α/m-quantile for the t-distribution with n 1 + n 2 2degrees of freedom imposing normality for the expression data, or c = z α/m,the upper α/m-quantile for the standard normal distribution based on asymptotic normality. If gene expression levels are not normally distributed, the assumption of t-distribution may be violated. Furthermore, n 1 and n 2 usually may not be large enough to warrant a normal approximation. Even if the assumed conditions are met, the Bonferroni procedure is conservative for correlated data. In fact, microarray data are collected from the same individuals and experience co-regulation, so they are expected to be correlated. Being motivated by these limitations together with the relationship in (1), we derive the distribution of W = max j=1,...,m T j under H 0 using permutation. There are B = ( n ) n 1 different ways of partitioning the pooled sample of size n = n1 + n 2 into two groups of sizes n 1 and n 2.Inorder to maintain the dependence structure and distributional characteristics of the gene expression measures within each subject, the sampling unit is subject, not gene. Recently, this type of resampling became popular in multiple testing to avoid the specification of the true distribution for the gene expression data (Dudoit et al., 2002, 2003; Mutter et al., 2001; Ge et al., 2003). Note that the number of possible permutations B can be very large even with a small size. For instance, with n 1 = n 2 = 10, there exist distinct permutations. A reasonable number of random permutations, say B = , can be chosen for feasible computation. For the observed test statistic t j of T j from the original data, the unadjusted (or raw) p-values can be approximated by p j B 1 B b=1 I (t (b) j t j ) where I (A) is an indicator function of event A. For gene-specific inference, an adjusted p-value quantifying a significance of each gene relative to FWER is more realistic. Toward this end, we define an adjusted p-value for gene j as the minimum FWER for which H j will be rejected, i.e. p j = P(max j =1,...,m T j t j H 0 ).Inwhat follows, this probability is estimated from the permutation distribution: Algorithm 1 (Single-step procedure) (A) Compute the test statistics t 1,...,t m from the original data.

4 160 SIN-HO JUNG ET AL. (B) For the bth permutation of the original data (b = 1,...,B), compute the test statistics t (b) 1,...,t(b) m and w b = max j=1,...,m t (b) j. (C) Estimate the adjusted p-values by p j = B b=1 I (w b t j )/B for j = 1,...,m. (D) Reject all hypotheses H j ( j = 1,...,m) such that p j <α. Alternatively, with steps (C) and (D) replaced, the cut-off value c α can be determined: Algorithm 1 (C ) Sort w 1,...,w B to obtain the order statistics w (1) w (B) and compute the critical value c α = w ([B(1 α)+1]), where [a] is the largest integer no greater than a. Ifthere exist ties, c α = w (k) where k is the smallest integer such that w (k) w ([B(1 α)+1]). (D ) Reject all hypotheses H j ( j = 1,...,m) for which t j > c α. Below is a step-down analog suggested by Dudoit et al. (2002, 2003), originally proposed by Westfall and Young (1989, 1993, see Algorithms 2.8 and 4.1 in their book): Algorithm 2 (Step-down procedure) (A) Compute the test statistics t 1,...,t m from the original data. (A1) Sort t 1,...,t m to obtain the ordered test statistics t r1 t rm, where H r1,...,h rm are the corresponding hypotheses. (B) For the bth permutation of the original data (b = 1,...,B), compute the test statistics t r (b) 1,...,t r (b) m and u b, j = max j = j,...,m t r (b) j for j = 1,...,m. (C) Estimate the adjusted p-values by p r j = B b=1 I (u b, j t r j )/B for j = 1,...,m. (C1) Enforce monotonicity by setting p r j max( p r j 1, p r j ) for j = 2,...,m. (D) Reject all hypotheses H r j ( j = 1,...,m) for which p r j <α. Note that two-sided tests can be fulfilled by replacing t j by t j in steps (B) and (C) in Algorithm 1. Finally, it can be shown that a single-step procedure, controlling the FWER weakly as in (1), also controls the FWER strongly under the condition of subset pivotality (see p. 42 in Westfall and Young, 1993). 2.2 A simulation study We investigate the performance of the multiple testing procedures for control of the FWER and power through a simulation study: the Bonferroni (BON), the single-step procedure (SSP) and the step-down procedure (SDP) presented in this section. To evaluate FWER empirically, 1000-dimensional artificial gene expression profiles in each group were generated from a multivariate Gaussian distribution with zeromeans (i.e. µ 1 = µ 2 = 0) and unit marginal variances. A block exchangeable correlation structure was assumed with the correlation coefficient ρ(= 0, 0.4 or 0.8) and block size 100, i.e. genes are correlated within blocks and uncorrelated between blocks. We used balanced allocation (n 1 = n 2 = n/2) with n = 20 or 50 subjects. With one-sided FWER α = 0.05, c α was approximated from B = 1000 random permutations and the empirical FWER was estimated by the proportion of H 0 being rejected out of N = 1000 replications. As Table 1(A) displays, BON is precise with mild correlation (ρ 0.4), but becomes highly conservative as correlation increases (ρ = 0.8). The conservatism becomes more prominent with a larger sample (n = 50). The estimates from both SSP and SDP are slightly anticonservative with n = 20 and ρ = 0, but accurate overall. Also reported are the average of c α values for SSP over simulation along with

5 Sample size calculation for multiple testing in microarray data analysis 161 Table 1. Simulation results (A) Average FWER (critical value) n ρ BON SSP SDP (4.966) 0.066(4.950) (4.898) (4.384) (4.244) 0.046(4.233) (4.177) (3.767) (B) Average true rejection rate (global power) δ = 1 δ = 1.5 n D ρ BON SSP SDP BON SSP SDP (0.237) 0.022(0.245) 0.022(0.245) 0.116(0.702) 0.117(0.706) 0.117(0.706) (0.190) 0.024(0.208) 0.024(0.208) 0.106(0.536) 0.113(0.554) 0.113(0.554) (0.097) 0.055(0.210) 0.055(0.210) 0.116(0.339) 0.215(0.517) 0.217(0.517) (0.625) 0.020(0.627) 0.020(0.627) 0.115(0.999) 0.119(0.999) 0.119(0.999) (0.395) 0.022(0.421) 0.022(0.421) 0.120(0.856) 0.127(0.866) 0.127(0.866) (0.185) 0.042(0.314) 0.042(0.314) 0.117(0.507) 0.211(0.688) 0.214(0.688) (0.949) 0.268(0.949) 0.268(0.949) 0.842(1.00) 0.844(1.00) 0.845(1.00) (0.810) 0.286(0.834) 0.287(0.834) 0.840(0.997) 0.855(0.997) 0.855(0.997) (0.516) 0.393(0.695) 0.394(0.695) 0.845(0.969) 0.929(0.990) 0.929(0.990) (1.00) 0.267(1.00) 0.268(1.00) 0.842(1.00) 0.844(1.00) 0.846(1.00) (0.947) 0.289(0.956) 0.291(0.956) 0.842(1.00) 0.859(1.00) 0.859(1.00) (0.692) 0.426(0.836) 0.429(0.836) 0.841(0.984) 0.925(0.996) 0.925(0.996) BON=Bonferroni, SSP=single-step procedure, and SDP=step-down procedure. n denotes sample size and D denotes the number of genes with non-zero effect size δ out of m = 1000 genes tested. Block diagonal matrix with block size 100 and correlation ρ was used for correlation structure. Nominal α is set at B = N = 1000 permutations and simulations were used. Average false rejection rates (among genes with zero effect size) range in and are omitted in this table. ones for BON, t n 2,α/m.Asexpected, the estimated critical value c α increases in m (result not shown) and decreases in n and is always smaller than the critical value of BON. Forpower analysis, the first D genes in group 1 have a non-zero effect size δ, i.e. µ 1 = (δ 1 D, 0 m D ), where 1 a and 0 a are a-dimensional row vectors with components of all 1 and 0, respectively. Effect size as well as correlation vary: δ = 1or1.5; ρ = 0, 0.4or0.8. Three different rejection rates were assessed: (1) global power (i.e. the probability of rejecting at least one null hypothesis); (2) false rejection rate (FRR) (i.e. the probability of declaring the genes with a null effect as predictive); and (3) true rejection rate (TRR) (i.e. the probability of declaring the predictive genes as predictive). This is important because high global power does not mean high rate of rejecting individual (true or false) hypotheses as Table 1(B) makes clear. For different concepts of power in the multiple testing context, see Dudoit et al. (2003, p. 74). The FRRs are omitted in the table, being similarly very low (maximum 0.15%) for all entries. All three procedures show that the TRR and global power increase in n, δ or D. Interestingly, ρ is associated inversely with global power but positively with TRR both for SSP and SDP. However, for BON, the TRR is virtually constant in ρ. SSP and SDP exhibit almost the same performance although SDP has slightly higher (by 0.5% at most) TRR than SSP, particularly with D = 50 and n = 50. SSP and SDP show identical global power (and FWER under the composite null) in all cases. This is obvious because global power

6 162 SIN-HO JUNG ET AL. Table 2. Average rejection rate and global power in a classical setting Average rejection rate Global D Procedure TRR FRR power 0 SDP SSP SDP SSP SDP SSP SDP SSP SDP SSP SDP SSP SDP=step-down procedure and SSP=single-step procedure. TRR and FRR denote true rejection rate (among genes that are differentially expressed) and false rejection rate (among genes that are not differentially expressed), respectively. D is the number of genes with non-zero effect size δ. m = 5 genes and B = N = permutations and simulations were used. Compound symmetry with the correlation coefficient of 0.3 and a total sample size n of 20 (n 1 = n 2 = 10) were employed. is governed by the smallest adjusted p-value, min j=1,...,m p j, which is common for the two procedures. We conclude that Algorithms 1 (SSP) and 2 (SDP) behave very similarly in situations typically arising in microarray experiments, where the number of genes is very large but the proportion of genes differentially expressed is small. To examine possible differences of the two procedures, we simulated a typical multiple testing situation with a small number of tests and report our findings in Table 2. We set n 1 = n 2 = 10 and m = 5, among which D = 0,...,5 test hypotheses have effect size δ = 1. Raw data are generated from a multivariate Gaussian distribution with a compound symmetry (CS) structure and mild correlation coefficient (ρ = 0.3). For each D, B = permutations were conducted within each simulation and this process was repeated N = times. As D increases, the TRR and FRR are relatively constant in SSP but sharply increase in SDP. Both TRR and FRR are higher in SDP and the difference becomes more pronounced as D increases. 3. SAMPLE SIZE CALCULATION In this section, we derive a sample size calculation method using the single-step procedure. The calculated sample size is also applied to the step-down procedure since the two procedures have the same global power. Our discussion is focused on one-sided testing, but two-sided testing case can be similarly derived. Recall that the multiple testing procedures discussed in this paper do not require a large sample assumption. However, we derive our sample size formula based on the large sample approximation and then show through simulations that the formula also works well with moderate sample sizes.

7 Sample size calculation for multiple testing in microarray data analysis Algorithms for sample size calculation We wish to determine sample size for a designated global power 1 β. Suppose that the gene expression data {(X ki1,...,x kim ), i = 1,...,n k, k = 1, 2} are random samples from an unknown distribution with E(X kij ) = µ kj,var(x kij ) = σ 2 j and corr(x kij, X kij ) = ρ jj. Let R = (ρ jj ) j, j =1,...,m be the m m correlation matrix. Under H a,wespecify the effect size as δ j = (µ 1 j µ 2 j )/σ j.inthe design stage of a microarray study, we usually project the number of predictive genes D and set an equal effect size among them, i.e. δ j = δ for j = 1,...,D = 0 for j = D + 1,...,m. (2) Appendix 2A shows that, for large n 1 and n 2, (T 1,...,T m ) has approximately the same distribution as (e 1,...,e m ) N(0, R) under H 0 and (e j + δ j npq, j = 1,...,m) under Ha, where p = n 1 /n and q = 1 p. Hence, at FWER = α, the common critical value c α is given as the upper α quantile of max j=1,...,m e j from (1). Similarly, the global power as a function of n is h a (n) = P{ max j=1,...,m (e j + δ j npq)>cα }. Thus, given FWER = α, the sample size n to detect the specified effect sizes (δ 1,...,δ m ) with a global power 1 β will be calculated as the solution to h a (n) = 1 β. Analytic calculation of c α and h a (n) will be feasible only when the distributions of max j e j and max j (e j + δ j npq) are available in simple forms. With a large m,however, it is almost impossible to derive the distributions. We avoid the difficulty by using simulation. Our simulation method is to approximate c α and h a ( ) by generating random vectors (e 1,...,e m ) from N(0, R). For easy generation of the random numbers, we have to assume a simple, but realistic, correlation structure for the gene expression data. Recall that R is the correlation matrix among the gene expression data (X ki1,...,x kim ).Areasonable correlation structure would be block compound symmetry (BCS) or CS (i.e. with only 1 block). Suppose that m genes are partitioned into L blocks, and B l denotes the set of genes belonging to block l (l = 1,...,L). Weassume that ρ jj = ρ if j, j B l for some l, and ρ jj = 0otherwise. Under the BCS structure, we can generate (e 1,...,e m ) as a function of i.i.d. standard normal random variates u 1,...,u m, b 1,...,b L : Finally, the entire procedure can be summarized as follows: e j = u j 1 ρ + bl ρ for j Bl. (3) (a) Specify FWER (α), global power (1 β), effect sizes (δ 1,...,δ m ) and correlation structure (R). (b) Generate K (say, ) i.i.d. random vectors {(e (k) 1,...,e(k) m ), k = 1,...,K } from N(0, R). Let ē k = max j=1,...,m e (k) j. (c) Approximate c α by ē [(1 α)k +1], the [(1 α)k + 1]th order statistic of ē 1,...,ē K. (d) Calculate n by solving ĥ a (n) = 1 β by the bisection method (Press et al., 1996), where ĥ a (n) = K 1 K k=1 I {max j=1,...,m (e (k) j + δ j npq)>cα }. Mathematically put, step (d) is equivalent to finding n = min{n : ĥ a (n) 1 β}. In Appendix 2A, the asymptotic distribution of (T 1,...,T m ) is derived without resort to the use of permutations in testing. In this sense, the above algorithm using (3) will be called a naive method. Appendix 2B shows that the permutation procedure alters the correlation structure among the test statistics

8 164 SIN-HO JUNG ET AL. under H a. Suppose that there are m 1 genes in block 1, among which the first D are predictive. Then, under (2) and BCS, we have (ρ + pqδ 2 )/(1 + pqδ 2 ) ρ 1 if 1 j < j D corr(t j, T j ) ρ/ 1 + pqδ 2 ρ 2 if 1 j D < j m 1 (4) ρ if D < j < j m 1 or j, j B l for l 2 where the approximation is with respect to large n. Let R denote the correlation matrix with these correlation coefficients. Note that R = R under H 0 : δ = 0, so that calculation of c α is the same as in the naive method. However, h a (n) should be modified to h a (n) = P{ max (ẽ j + δ j npq)>cα } j=1,...,m where random samples of (ẽ 1,...,ẽ m ) can be generated using ẽ j = u j 1 ρ1 + b 1 ρ2 + b 1 ρ1 ρ 2 if 1 j D u j 1 ρ + b1 ρ2 + b 0 ρ ρ2 if D < j m 1 u j 1 ρ + bl ρ if j Bl for l 2 with u 1,...,u m, b 1, b 0, b 1,...,b L independently from N(0, 1). Then {(ẽ (k) 1,...,ẽ(k) m ), k = 1,...,K } are i.i.d. random vectors from N(0, R), and ˆ h a (n) = K 1 K k=1 I { max j=1,...,m (ẽ(k) j + δ j npq)>cα }. The sample size calculation solving ˆ h a (n) = 1 β will be named a modified method. Note that the methods discussed here are different from a pure simulation method in the sense that it does not require generating the raw data and then calculating test statistics. Thus, the computing time is not of an order of n m, but of m. Furthermore, we can share the random numbers u 1,...,u m, b 1, b 0, b 1,...,b L in the calculation of c α and n. Wedonot need to generate a new set of random numbers at each replication of the bisection procedures either. If the target n is not large, the large sample approximation may not perform well. In our simulation study, we examine how large n needs to be for an adequate approximation. If the target n is so small that the approximation is questionable, then we have to use a pure simulation method by generating raw data. 3.2 A simulation study We conducted numerical experiments to investigate the accuracy of our sample size estimation. First, sample size was computed under one-sided FWER = 0.05; 80% global power; p = q = 0.5; δ = 0.5or1; ρ = 0.1, 0.4or0.8; m, D and block size varied as shown in Table 3. A simulated sample of the calculated size was generated from the same parameter setting as in sample size calculation. B = N = 1000 samples were generated, and global power was calculated empirically. Sample size increases in ρ (assuming there is no variable reduction technique involved) and decreases in δ. GivenD, intuitively, a larger number of tests (m) demand a larger sample size. The sample sizes by the naive method are underpowered, especially with δ = 1 and large m. The modified method remarkably improves the accuracy except when δ = 1 and m = With large m and ρ, the large sample convergence will be slow; resulting in a poor approximation, especially with a large effect size which yields a small n. These results show that power and sample size depend on not only the study design but also the proposed method for analyzing data.

9 Sample size calculation for multiple testing in microarray data analysis 165 Table 3. Sample size (empirical power) for 80% global power Correlation δ = 0.5 δ = 1 m (block size) D formula ρ = 0.1 ρ = 0.4 ρ = 0.8 ρ = 0.1 ρ = 0.4 ρ = (10) 5 naive 119(0.79) 150(0.79) 179(0.82) 30(0.68) 38(0.75) 45(0.74) modified 127(0.79) 152(0.82) 183(0.80) 35(0.79) 40(0.80) 47(0.76) 1000 (100) 10 naive 139(0.76) 168(0.78) 199(0.76) 35(0.65) 42(0.70) 51(0.75) modified 145(0.81) 176(0.80) 204(0.81) 41(0.79) 48(0.81) 53(0.75) (100) 10 naive 183(0.70) 233(0.75) 284(0.79) 45(0.53) 59(0.70) 71(0.70) modified 188(0.77) 239(0.79) 288(0.81) 53(0.74) 64(0.77) 74(0.75) (1000) 1000 naive 41(0.64) 86(0.82) 152(0.77) 10(0.21) 22(0.68) 39(0.71) modified 57(0.83) 113(0.87) 185(0.82) 20(0.87) 34(0.85) 49(0.85) m is the total number of genes tested and D is the number of genes with non-zero effect size δ. Naive and modified represent the original and modified correlation matrix before and after permutation, respectively. Sample size n was estimated from K = 5000 simulated samples. B = N = 1000 times of permutation and simulation were used. 4. APPLICATION TO LEUKEMIA DATA In this section, the leukemia data from Golub et al. (1999) are reanalyzed. There are n ALL = 27 patients with ALL and n AML = 11 patients with AML in the training set, and expression patterns in m = 6810 human genes are explored. Note that, in general, such expression measures are subject to preprocessing steps such as image analysis and normalization, and also to a priori quality control. Supplemental information and dataset are located in the authors website ( mit.edu/mpr). Gene-specific significance was ascertained for alternative hypotheses H 1, j : µ ALL, j = µ AML, j, H 2, j : µ ALL, j <µ AML, j, and H 3, j : µ ALL, j >µ AML, j by SDP and SSP. We implemented our algorithm as well as PROC MULTTEST in SAS with B = permutations (Westfall et al., 2001). Due to essentially identical results, we report the results from SAS. Table 4 lists 41 genes with two-sided adjusted p-values which are smaller than Although adjusted p-values by SDP are slightly smaller than SSP, the results are extremely similar, confirming the findings from our simulation study. Note that Golub et al. and we identified 1100 and 1579 predictive genes without accounting for multiplicity, respectively. A Bonferroni adjustment declared 37 significant genes. This is not so surprising because relatively low correlations among genes were observed in these data. We do not show the results for H 3, j ; only four hypotheses are rejected. Note that the two-sided p-value is smaller than twice of the smaller one-sided p-value as theory predicts (see Appendix 1) and that the difference is not often negligible (Shaffer, 2002). Suppose that we want to design a prospective study to identify predictive genes overexpressing in AML based on observed parameter values. So we assume m = 6810, p = 0.3( 11/38), D = 10 or 100, δ = 0.5 or1,and BCS with block size 100 or CS with a common correlation coefficient of ρ = 0.1 or 0.4. We calculated the sample size using the modified formula under each parameter setting for FWER α = 0.05 and a global power 1 β = 0.8 with K = 5000 replications. For D = 10 and δ = 1, the minimal sample size required for BCS/CS are 59/59 and 74/63 for ρ = 0.1 and 0.4, respectively. If a larger number of genes, say D = 100, are anticipated to overexpress in AML with the same effect size, the respective sample sizes reduce to 34/34 and 49/41 in order to maintain the same power. With δ = 0.5, the required sample size becomes nearly 3.5 to4times that for δ = 1. Note that, with the same ρ, BCS tends to require a larger sample size than CS. One of the referees raised a question about the accuracy of our sample size formula when the gene expression data have other distributions than the multivariate normal distributions. We considered the setting α = 0.05, 1 β = 0.8, δ = 1, D = 100, ρ = 0.1 with CS structure, which results in the smallest

10 166 SIN-HO JUNG ET AL. Table 4. Reanalysis of the leukemia data from Golub et al. (1999) Alternative hypothesis µ ALL = µ AML µ ALL <µ AML Gene index (description) SDP SSP SDP SSP 1701 (FAH Fumarylacetoacetate) (Leukotriene C4 synthase) (Zyxin) (LYN V-yes-1 Yamaguchi) (LEPR Leptin receptor) (CD33 CD33 antigen) (Liver mrna for IGIF) (PRG1 Proteoglycan 1) (DF D component of complement) (GB DEF) (Induced Myeloid Leukemia Cell) (IL8 Precursor) (PEPTIDYL-PROLYL CIS-TRANS Isomerase) (Phosphotyrosine independent ligand p62) (CST3 Cystatin C) (ATP6C Vacuolar H+ ATPase proton channel subunit) (CTSD Cathepsin D) (Interleukin 8) (ITGAX Integrin) (Epb72 gene exon 1) (LGALS3 Lectin) (Thrombospondin-p50) (LYZ Lysozyme) (FTL Ferritin) (Azurocidin) (Protein MAD3) (PFC Properdin P factor) (Lysophospholipase homolog) (Lysozyme) (PPGB Protective protein) (LYZ Lysozyme) (HOX 2.2) (Catalase EC ) (FTH1 Ferritin heavy chain) (CD36 CD36 antigen) (ADM) (CDC25A Cell division cycle) (APLP2 Amyloid beta precursor-like protein) (TIMP2 Tissue inhibitor of metalloproteinase) (C-myb) (NF-IL6-beta protein mrna) Adjusted p-values from two-sided hypothesis less than 0.05 are listed in increasing order among total m = 6810 genes investigated. The total number of studied subjects n was 38 (n ALL = 27 and n AML = 11). B = times of permutation were used. Note that C-myb gene has p-value of against the hypothesis µ ALL >µ AML.Although some gene descriptions are identical, gene accession numbers are different.

11 Sample size calculation for multiple testing in microarray data analysis 167 sample size, n = 34, in the above sample size calculation. Gene expression data were generated from a correlated asymmetric distribution: X kj = µ kj + (e kj 2) ρ/4 + (e k0 2) (1 ρ)/4 for 1 j m and k = 1, 2. Here, µ 1 j = δ j and µ 2 j = 0, and e k0, e k1,...,e km are i.i.d. random variables from a χ 2 distribution with two degrees of freedom. Note that (X k1,...,x km ) have means (µ k1,...,µ km ), marginal variances 1, and a compound symmetry correlation structure with ρ = 0.1. In this case, we obtained an empirical FWER of and an empirical global power of which are close to the nominal α = 0.05 and 1 β = 0.8, respectively, from a simulation with B = N = DISCUSSION Genomic scientists are using DNA microarray as a major high-throughput assay to display DNA or RNA abundance for a large number of genes concurrently; this examination has rekindled interest in statistical issues such as multiple testing, giving methodological and computational challenges. Endeavors to identify the informative genes should be made taking multiplicity into account, but also with enough power to discover important genes successfully. This problem is different from the classical multiple testing situations in that the number of truly effective genes is often very small compared to the number of candidate genes under investigation. Moreover, only a small sample size is often available so large sample theory is not justified for standard statistical inference. An underpowered study is no service to the investigator or to science; results significant without assurance will often fail to replicate, and time will be wasted and resources needlessly expended. In this paper, we compared three popular testing procedures and developed a new fast algorithm for determining sample size with a particular emphasis on the microarray context. We basically suggest using exact permutation-based tests but also argue for the utility of the single-step which is often undervalued. Permutation tests do not require specification of the joint distribution or true correlation structure of the gene expression data. In typical circumstances occurring in microarrays, we verified that the actual advantage of the step-down procedure is minimal and that the improvement is more relevant in classical testing situations dealing with a small number of hypotheses. The single-step method is fast, easy to understand, computes critical values as well as adjusted p-values and, most importantly, offers a simple way tocalculate sample size. Generating high-dimensional (say, ) multivariate (normal) data many times (say, 5000) is not a simple undertaking even with a fast computer. To the best of our knowledge, there is no fast numerical algorithm to generate high-dimensional random vectors from general correlation structure. Some simplifying assumptions (e.g. BCS or CS correlation structure, common effect size and normal test statistics) may be more realistic in the microarray analysis under such technical constraints. However, further simulation under more varied conditions would be extremely useful. Our method for sample size determination is efficiently implemented using a novel and fast algorithm, and accurate as reflected in the empirical evaluation. Although there have been several publications on sample size estimation in the microarray context, none have examined the accuracy of their estimates. Furthermore, all focused on exploratory and approximate relationships among statistical power, sample size (or the number of replicates) and effect size (often, in terms of fold-change), and used the most conservative Bonferroni adjustment without any attempt to incorporate underlying correlation structure (Witte et al., 2000; Wolfinger et al., 2001; Black and Doerge, 2002; Lee and Whitmore, 2002; Pan et al., 2002; Simon et al., 2002; Cui and Churchill, 2003). By comparing empirical power resulting from naive and modified methods, we show that an ostensibly similar but incorrect choice of sample size ascertainment could cause considerable underestimation of

12 168 SIN-HO JUNG ET AL. required sample size. We recommend that the assessment of bias in empirical power (compared to nominal power) be a conventional step in publication of all sample size papers. Recently, some researchers proposed the new concepts of error such as false discovery rate (FDR) and positive-fdr (so-called, pfdr), which control the expected proportion of Type I error among the rejected hypotheses (Benjamini and Hochberg, 1995; Storey, 2002). Controlling these quantities relaxes the multiple testing criteria compared to controlling FWER in general and increases the number of declared significant genes. In particular, pfdr is motivated by Bayesian perspective and inherits the idea of single-step in constructing q-values, which are the counterpart of the adjusted p-values in this case (Ge et al., 2003). It would be useful to do a sample size comparison for FDR, pfdr and FWER. FWER is important as a benchmark because the reexamination of Golub et al. s data tells us that classical FWER control (along with global power) may not necessarily be as exceedingly conservative as many researchers thought and carries clear conceptual and practical interpretations. Appendices are available online at ACKNOWLEDGEMENTS The authors are grateful to the reviewers for their careful and speedy reviews of this paper. Their comments greatly improved this paper without a doubt. REFERENCES ALIZADEH, A. A. AND STAUDT, L. M.(2000). Genomic-scale gene expression profiling of normal and malignant immune cells. Current Opinions in Immunology 12, BENJAMINI, Y.AND HOCHBERG, Y.(1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57, BLACK, M. A. AND DOERGE, R. W.(2002). Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments. Bioinformatics 18, CUI, X. AND CHURCHILL, G. A.(2003). How many mice and how many arrays? Replication in mouse cdna microarray experiments. In Johnson, K. F. and Lin, S. M. (eds), Methods of Microarray Data Analysis II, Norwell, MA: Kluwer Academic Publishers, pp DUDOIT, S., SHAFFER, J.P.AND BOLDRICK, J.C.(2003). Multiple hypothesis testing in microarray experiments. Statistical Science 18, DUDOIT, S., YANG, Y. H., CALLOW, M. J. AND SPEED, T. P. (2002). Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Statistica Sinica 12, GE, Y., DUDOIT, S. AND SPEED, T. P.(2003). Resampling-based multiple testing for microarray data analysis. TEST 12, GOLUB, T.R.,SLONIM, D.K.,TAMAYO, P.,HUARD, C., GAASENBEEK, M., MESIROV, J.P.,COLLER, H., LOH, M. L., DOWNING, J. R., CALIGIURI, M. A., BLOOMFIELD, C. D. AND LANDER, E. S.(1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, HOCHBERG, Y.(1998). A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, HOLM, S.(1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6, LEE, M. L. T. AND WHITMORE, G. A.(2002). Power and sample size for DNA microarray studies. Statistics in Medicine 21,

13 Sample size calculation for multiple testing in microarray data analysis 169 MUTTER, G.L., BAAK, J.P.A.,FITZGERALD, J.T.,GRAY, R., NEUBERG, D., KUST, G.A., GENTLEMAN, R., GALLANS, S. R., WEI, L. J. AND WILCOX, M. (2001). Global express changes of constitutive and hormonally regulated genes during endometrial neoplastic transformation. Gynecologic Oncology 83, PAN, W. (2002). A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18, PAN, W., LIN, J. AND LE, C. T.(2002). How many replicated of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biology 3, PRESS, W. H., TEUKOLSKY, S. A., VETTERLING, W. T. AND FLANNERY, B. P.(1996). Numerical Recipes in Fortran 90. New York: Cambridge University Press. SANDER, C.(2000). Genomic medicine and the future of health care. Science 287, SHAFFER, J. P.(2002). Multiplicity, directional (Type III) errors, and the null hypothesis. Psychological Methods 7, SIMON, R., RADMACHER, M. D. AND DOBBIN, K.(2002). Design of studies with DNA microarrays. Genetic Epidemiology 23, STOREY, J. D.(2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society B 64, THOMAS, J. G., OLSON, J. M., TAPSCOTT, S. J. AND ZHAO, L. P.(2001). An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Research 11, TROENDLE, J. F., KORN, E. L. AND MCSHANE, L. M.(2004). An example of slow convergence of the bootstrap in high dimensions. American Statistician 58, WEST, M., BLANCHETTE, C., DRESSMAN, H., HUANG, E., ISHIDA, S., SPRANG, R., ZUZAN, H., OLSON, J., MARKS, J. AND NEVINS, J.(2001). Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences USA 98, WESTFALL, P. H. AND YOUNG, S. S.(1989). P-value adjustments for multiple tests in multivariate binomial models. Journal of the American Statistical Association 84, WESTFALL, P. H. AND YOUNG, S. S.(1993). Resampling-based Multiple Testing: Examples and Methods for P- value Adjustment. New York: Wiley. WESTFALL, P.H.AND WOLFINGER, R.D.(1997). Multiple tests with discrete distributions. American Statistician 51, 3 8. WESTFALL, P. H., ZAYKIN, D. V. AND YOUNG, S. S.(2001). Multiple tests for genetic effects in association studies: methods in molecular biology. In Looney, S. (ed.), Biostatistical Methods, Toloway, NJ: Humana Press, pp WITTE, J. S., ELSTON, R. C. AND CARDON, L. R.(2000). On the relative sample size required for multiple comparisons. Statistics in Medicine 19, WOLFINGER, R.D.,GIBSON, G., WOLFINGER, E.D.,BENNETT, L., HAMADEH, H., BUSHEL, P.,AFSHARI, C. AND PAULES, R. S.(2001). Assessing gene significance from cdna microarray expression data via mixed models. Journal of Computational Biology 8, [Received 27 April 2004; revised 4 August 2004; accepted for publication 23 September 2004]