Parametric and Nonparametric FDR Estimation Revisited

Size: px
Start display at page:

Download "Parametric and Nonparametric FDR Estimation Revisited"

Transcription

1 Parametric and Nonparametric FDR Estimation Revisited Baolin Wu, 1, Zhong Guan 2, and Hongyu Zhao 3, 1 Division of Biostatistics, School of Public Health University of Minnesota, Minneapolis, MN 55455, USA 2 Department of Mathematical Sciences Indiana University South Bend, South Bend, IN 46634, USA 3 Department of Epidemiology and Public Health Yale University, New Haven, CT 06520, USA Summary. Nonparametric and parametric approaches have been proposed to estimate False Discovery Rate under the independent hypothesis testing assumption. The parametric approach has been shown to have better performance than the nonparametric approaches. In this article, we study the nonparametric approaches and quantify the underlying relations between parametric and nonparametric approaches. Our study reveals the conservative nature of the nonparametric approaches, and establishes the connections between the empirical Bayes method and p-value based nonparametric methods. Based on our results, we advocate using parametric approach, or directly modeling the test statistics using the empirical Bayes method. Key words: Microarray; False discovery rate; Multiple hypothesis testing;

2 Multiple comparisons; Simultaneous inference; Empirical Bayes method 1. Introduction For current large-scale genomic and proteomic datasets, there are usually hundreds of thousands of variables but limited sample size, which poses a unique challenge for statistical analysis. Variable selection serves two purposes in this context: for biological interpretation and to reduce the impact of noise. In microarray datasets, we are often interested in identifying differentially expressed genes. It can be formulated as the following hypothesis testing problem H i : µ i = 0 (i = 1,..., m), where m is the total number of genes and µ i is the mean log ratio of the expression levels for the i-th gene. Here we are testing m genes simultaneously, which causes complications for error control. Multiple hypothesis testing for a testing procedure can be summarized in Table 1, where V is the number of false positives and S is the number of true positives. [Table 1 about here.] For the convenience of the following discussion, define h k = I{k-th hypothesis being true null}, r k = I{k-th hypothesis being rejected}, h = (h 1,..., h m ), r = (r 1,..., r m ), v = (r 1 h 1,..., r m h m ). Here we treat h k as random variables. The L 1 norms of these vectors are h = m 0, r = R and v = V. 2

3 In single hypothesis testing, the commonly used approach is to control Type I error at a pre-specified level α 0 and to maximize the power (or minimize Type II error β) at the same time, α 0 = Pr(r k = 1 h k = 1), β = Pr(r k = 0 h k = 0). When we do multiple hypothesis testing we want to control the overall Type I error to be very small. There are different definitions for overall Type I error in multiple hypothesis testing. A natural extension of Type I error to multiple hypothesis testing is Family-Wise-Error-Rate (FWER), which is the probability of identifying any false positives, i.e. FWER = Pr(V > 0). (1) The most commonly used approach for FWER control is Bonferroni correction, which adjusts individual significance levels to be α 0 /m. Generally, Bonferroni correction is conservative, especially in the context of genomic and proteomic datasets where m is very large. There have been some developments in using resampling methods to improve power while controlling FWER (Westfall and Young, 1993; Ge et al., 2003). False Discovery Rate (FDR), a philosophically different approach, was first proposed by Benjamini and Hochberg (1995). It is defined as E (V/R). When R = 0, there is no discovery, we define 0/0 = 0. We can also write FDR as ( V ) ( V ) FDR = E R R > 0 Pr(R > 0) = E R V > 0 Pr(V > 0). (2) Storey (2002b) defined pfdr as the following conditional expectation ( V ) pfdr = E R R > 0 = FDR Pr(R > 0). (3) 3

4 Clearly, FDR FWER = Pr(V > 0) = E ( FDR, V V > 0) R so FWER is always a stronger control than FDR. We can formally define the FDR estimation problem as follows: DATA: m test statistics, (T 1,..., T m ), one for each hypothesis H k, where k = 1,, m. GOAL: Develop testing procedure and estimate the expectation E (V/R), where V and R are defined in Table 1. Here we assume that (T 1,..., T m ) are m i.i.d. random variables. First define π 0 = Pr(h k = 1), α 0 = Pr(r k = 1 h k = 1), α = Pr(r k = 1), (4) where π 0 is the proportion of true null hypotheses, α 0 is the rejection probability of the true null hypothesis, and α is the marginal rejection probability. Under the i.i.d. assumption, we can have the following intuitive formula for pfdr and FDR (Storey, 2002a; Storey et al., 2004; Benjamini et al., 2001) FDR = Pr(h k = 1 r k = 1) = π 0α 0 α, pfdr = π 0α 0 { } 1 (1 α) m 1. (5) α So the pfdr and FDR estimation problems just transform into our familiar framework of estimating parameters π 0, α 0, and α. Previous research on FDR control includes the nonparametric method of Storey (2002a) and parametric method of Guan et al. (2004). In this paper we further study the operating characteristics of general p-value based 4

5 nonparametric methods. Our study reveals the conservative nature of the nonparametric approaches, and we further theoretically quantify the relations between parametric and nonparametric approaches. The basic idea of the nonparametric approach in Storey (2002a) is to use the p-values (p 1,, p m ) as the test statistics. Note that, usually, under the true null hypothesis, p k U[0, 1]. When the rejection region is chosen as Γ = [0, τ], we have ˆα = F m (τ), ˆα 0 = τ, ˆπ 0 (λ) = 1 F m(λ) 1 λ, (6) where F m is the empirical distribution function of the observed p-values and λ [0, 1]. The optimal λ can be chosen by minimizing the MSE {ˆπ 0 (λ) }. In the parametric approach of Guan et al. (2004), two parametric functions are introduced to model the distribution of the test statistic: F 0 (, θ 0 ) for the null distribution and F 1 (, θ 1 ) for the alternative distribution. The marginal distribution is F (, π 0, θ 0, θ 1 ) = π 0 F 0 (, θ 0 ) + (1 π 0 )F 1 (, θ 1 ). The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) can be used to obtain MLEs of the parameters π 0 and θ 1. Then for any given rejection region Γ, we have ˆα = F (Γ, ˆπ 0, θ 0, ˆθ 1 ) and ˆα 0 = F 0 (Γ, θ 0 ). (7) For simplicity we have used (F, F 0, F 1 ) to represent both the cumulative distribution functions and the corresponding probability measures. 2. Rejection Region Construction and FDR Modeling For the convenience of the following discussion, we write f 0 ( ) for the test statistic density under the null hypothesis and f 1 ( ) for that of the alternative 5

6 hypothesis. In single hypothesis testing, we focus on Type I error and power, α 0 = F 0 (Γ) and 1 β = F 1 (Γ), where Γ is the rejection region. The central dogma of the traditional single hypothesis testing is to control Type I error α 0 under a pre-specified level and at the same time try to maximize the power 1 β. In practice we try to construct rejection regions which will have maximum power. According to the Neyman-Pearson Lemma (Neyman and Pearson, 1933), this can be achieved using the likelihood ratio (LR) statistic LR(x) = f 1 (x)/f 0 (x) constructed from the observed data, from which we can construct the following uniformly most powerful LR rejection region 2.1 P-value Calculation { x : f } 1(x) f 0 (x) > η. (8) P-value is a well-accepted significance measure for rejecting/accepting a hypothesis, and in some papers discussing multiple comparisons (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001; Ge et al., 2003; Storey, 2002b), p-value is used as a test statistic. The distribution of the p-values can be estimated using the empirical distribution function of the observed p-values. The p-value densities are closely related to the distributions of the test statistics and the construction of the rejection region Γ. For p-values we have the following results (see the appendix for proofs, similar results appeared in Sackrowitz and Samuel-Cahn (1999)). Lemma 1. For hypothesis test H 0 versus H a with test statistic X, assume X has density f 0 (x) under H 0 and f 1 (x) under H a, and let P 0 and P 1 be the corresponding measures. Suppose that the rejection regions are con- 6

7 structed as {x : W (x) > η}, where W ( ) is a measurable function. Let Q k (x), q k (x), k = 0, 1 be the distribution and density functions of W (X) under H 0 and H a, respectively. Furthermore assume that Q 0 (x) is continuous and strictly increasing. For an observed test statistic value x 0, the p-value can be calculated as p = P 0 {x : W (x) > W (x 0 )} = 1 Q 0 {W (x 0 )}. (9) Under H 0, the p-value has a uniform density, g 0 (p) = I{p [0, 1]}. Under H a, the p-value has the following density and distribution functions: where g 1 (p) = q 1{Q 1 0 (1 p)} q 0 {Q 1 0 (1 p)}, G 1(p) = 1 Q 1 {Q 1 0 (1 p)}, (10) and hence g 1 (p) inf x {f 1 (x)/f 0 (x)}. q 1 (η) q 0 (η) = lim P 1 {x : η < W (x) η 1 } η 1 η P 0 {x : η < W (x) η 1 }, (11) Theorem 1. For the uniformly most powerful LR test (8), where the rejection region is constructed by we have { x : LR(x) = f } 1(x) f 0 (x) > η, g 1 (p) = Q 1 0 (1 p). (12) Therefore g 1 (p) is a non-increasing function in the interval [0, 1]. Furthermore we have min g 1(p) = g 1 (1) = Q 1 f 1 (x) 0 (0) = inf p [0,1] x f 0 (x). (13) 7

8 This theorem reveals that the p-value based on the LR test region has a monotone decreasing density. In the multiple hypothesis testing, if we assume p-values from individual testings follow one common distribution, nonparametric estimation of π 0 can be based on the p-value density (to be discussed in section 2.2). Theorem 1 then justifies the common practice of using the p-value density at the boundary 1 to approximate π 0. For rejection regions not based on LR test region, it is possible to observe non-monotone p- value density, and according to Lemma 1, the least conservative π 0 estimation will be the minimum of the p-value density, which is not necessarily at the boundary Smoothing Nonparametric Approach Suppose we use p-value as the test statistic. Its distribution is g(p) = π 0 + (1 π 0 )g 1 (p), where π 0 is the proportion of true null hypotheses and g 1 (p) is the density for the p-values under the alternative hypothesis. In the nonparametric approach, the key is the estimation of π 0. We propose the following least conservative estimation for π 0 min g(p) = π 0 + (1 π 0 ) min g 1 (p). (14) p p The simplest density estimation method is the histogram approach, ĝ(p) = {F m (λ 2 ) F m (λ 1 )}/(λ 2 λ 1 ), λ 1 p λ 2. The nonparametric estimator ˆπ 0 (λ) in (6) is just the histogram density estimation over (λ, 1], and implicitly assumes that g(1) achieves the minimum value. We can also apply some other smoothing methods, e.g. kernel density estimations. The poor performance of the nonparametric approach is mainly because ˆπ 0 (λ) is only based on those p-values over [λ, 1). Note that when λ is small, ˆπ 0 (λ) as an estimator itself is very stable. In principle we could borrow 8

9 strength from small λ to extrapolate ˆπ 0 (1). This motivates us to smooth ˆπ 0 (λ) or ĝ(λ) as functions of λ. As discussed previously, it is reasonable to assume g 1 (p) is non-increasing. The theoretical value of ˆπ 0 (λ) is π 0 (λ) = 1 F (λ) 1 λ = π 0 + (1 π 0 ) 1 λ g 1(p)dp 1 λ. (15) We have dπ 0 (λ) dλ = (1 π 0 ) 1 λ g 1(p)dp (1 λ)g 1 (λ) (1 λ) 2 0, so π 0 (λ) and g(λ) = π 0 + (1 π 0 )g 1 (λ) are both non-increasing functions of λ. Hence, monotone smoothing methods can be used for extrapolation. Furthermore, we have π 0 (1) = g(1) = π 0 + (1 π 0 )g 1 (1). (16) In the following applications, we used the constrained B-splines (He and Ng, 1999) for monotone extrapolation. 2.3 Model Test Statistic vs. P-values Although the p-value has a uniform distribution under the null hypothesis, its alternative distribution is often unknown. An empirical Bayes method (Efron et al., 2001; Efron and Tibshirani, 2002; Efron, 2003) proposed to use the posterior probability of being different, ˆπ 1 (x) = 1 π 0 f 0 (x) f(x), (17) as a test statistic, and it was pointed out that π 0 is not identifiable for the nonparametric approach. In addition, Efron (2003) proposed the most conservative estimation for π 1 = 1 π 0 : π 1,min = 1 inf x {f(x)/f 0 (x)}, and 9

10 hence, the least conservative estimate for π 0 : π 0,max = inf x {f(x)/f 0 (x)}. Under the i.i.d. assumption, we have π 1,min = π 1 π 1 inf x f 1 (x) f 0 (x), π f 1 (x) 0,max = π 0 + π 1 inf x f 0 (x). (18) According to (8), this empirical Bayes method is equivalent to the nonparametric version of the LR based test, where densities f 0 (x) and f(x) are estimated from the observed data. Furthermore, according to Lemma 1 and Theorem 1, this is equivalent to the p-value based nonparametric FDR estimation where p-values are obtained using the LR statistics. 3. Simulation Studies 3.1 Finite Normal Mixture Example Here we discuss the parametric and nonparametric approaches for finite normal mixture distributions. Suppose T i H i = 1 N(0, 1); T i H i = 0 k π k N(µ k, 1), where π k (0, 1), k π k = 1 and µ k 0. We have LR(x) = f 1 (x)/f 0 (x) = k π k exp ( xµ k µ 2 k /2). 1. If all the µ k are positive (negative), then inf x LR(x) = 0, and the uniformly most powerful rejection region is {x x 0 } ({x x 0 }). Therefore the nonparametric π 0 estimate can approach the true value. 2. If i, j, µ i < 0, µ j > 0, then it is obvious that inf x LR(x) > 0, and f 1 (0)/f 0 (0) = k π k exp ( µ 2 k /2) > 0. Under this setting, the LR test rejection region { LR(x) > η } is equivalent to { x > x 0 }, if and only if all the π k and µ k satisfy the following condition (see appendix for 10

11 proof) i, j, st. µ i + µ j = 0 and π i = π j. (19) Furthermore arg min x LR(x) = 0 if and only if π k µ k exp ( µ 2 k/2 ) = 0. (20) This is because k dlr(x) dx = k π k µ k exp ( xµ k µ 2 k/2 ), d 2 LR(x) dx 2 = k π k µ 2 k exp ( xµ k µ 2 k/2 ) > 0, so LR(x) is strictly convex. In particular, condition (19) is a special case of (20). Hence for the commonly used symmetric region the estimate of π 0 will approach π 0 + (1 π 0 )f 1 (0)/f 0 (0). It will be larger than the estimate of LR test region π 0 + (1 π 0 ) min x {f 1 (x)/f 0 (x)}, unless the condition (20) is met. 3.2 Simulation Consider the following setup for the finite normal mixture models, π 1 = 0.2, µ 1 = 2, π 2 = 0.8, µ 2 = 1, with f 1 (x) = 2 k=1 π kn(µ k, 1). Suppose we conduct m = 1000 hypothesis tests with π 0 = 0.2 and f 0 (x) = N(0, 1). The parametric normal mixture model, π 0 N(0, 1) + (1 π 0 ){π 1 N(µ 1, 1) + π 2 N(µ 2, 1)} is fitted to obtain π 0 s MLE ˆπ pm. P-values can be calculated as p = 2Φ( x ), then we can get nonparametric estimate ˆπ np of π 0 (Storey, 2002b). For the empirical Bayes method, we first estimate the density of the test statistic ˆf(x), then ˆπ eb = inf x ˆf(x)/f0 (x), where f 0 (x) = φ(x). Figure 1 plots the LR and the symmetric rejection regions as functions of the rejection probability α 0 (4). Also shown in the plot are the p-value 11

12 densities for the two rejection regions. For symmetric rejection regions, the minimum p-value density is π np = π 0 + (1 π 0 )LR(0) = 0.61, compared to π eb = π 0 + (1 π 0 ) min x LR(x) = 0.48 for the LR rejection regions. They both over-estimate the true value π 0 = 0.2. In Figure 1, boxplots are used to summarize the simulation results. We can clearly see that the simulation results agree with the theoretical results very well. [Figure 1 about here.] 4. Application to Microarray Data 4.1 Leukemia gene expression data We apply the proposed FDR estimation procedure to the leukemia gene expression data reported in Golub et al. (1999), where mrna levels of 7129 genes were measured for n = 72 patients, among them n 1 = 47 patients had Acute Lymphoblastic Leukemia (ALL) and n 2 = 25 patients had Acute Myeloid Leukemia (AML). The goal is to identify differentially expressed genes between these two groups. The gene expression data can be summarized in a matrix X = (x ij ), where (x i,1,..., x i,n1 ) are for ALL patients and (x i,n1 +1,..., x i,n ) for AML patients. We follow the same preprocessing procedure as Dudoit et al. (2002). We first cut gene expression levels between 100 and 16000, then keep the i-th gene if it satisfies two conditions: max j x ij /min j x ij > 5 and max j x ij min j x ij > 500. After this filtering m = 3571 genes are left. We then take logarithm of their measured intensities and calculate two sample t-test statistics T i = ( x i1 x i2 )/ ˆσ 2 1/n 1 + ˆσ 2 2/n 2, where x i1 = n 1 j=1 x ij/n 1, x i2 = n j=n 1 +1 x ij/n 2, ˆσ 2 1 = n 1 j=1 (x ij x i1 ) 2 /(n 1 1) and ˆσ 2 2 = n j=n 1 +1 (x ij x i2 ) 2 /(n 2 1). 12

13 For this relatively large sample size (n = 72), we know that T i asymptotically follows a normal distribution with variance 1. We use normal mixture model to fit the t-statistics by proposing the following three-component model to model genes Without Difference: standard normal distribution N(µ 0 = 0, 1); Up-Regulated: normal mixture with positive means, N (µ U > 0, σ 2 U = 1); Down-Regulated: normal mixture with negative means, N (µ L < 0, σ 2 L = 1). The mixture distribution can be written as k π kn(µ k, 1), where k π k = 1. We can use the Bayesian Information Criterion (BIC) to select the number of components, BIC(p) = 2 log Pr ( Data ˆθ ) p log(m), where ˆθ is a vector representing the maximum likelihood estimates of the parameters, and p is the number of parameters in the model (Fraley and Raftery, 2002). In our model setup p = 2G 2, where G is the number of normal distributions (we know the mean for the first component and there is one constraint on the proportions). For G = 1, 2,..., 12, we use the EM algorithm to fit the mixture models and select G = arg max G BIC(p). The maximum of BIC was achieved at G = 8. The corresponding parameter estimates are ˆπ 0 = 0.35, with three positive components (ˆπU, ˆθ U ) = { (0.214, 2.42), (0.045, 5.22), (0.003, 9.57) }, and four negative components (ˆπL, ˆθ L ) = { (0.306, 1.57), (0.068, 3.88), (0.012, 6.82), (0.002, 11.64) }. 13

14 Figure 2 compares the empirical distribution function (ECDF) to the mixture model fitting, and the quantile-quantile plot for the test statistics. Overall we can see the mixture model provides a reasonable fit. Figure 2 also displays the FDR estimations for this dataset, where we choose the rejection region as { T > t 0 }. The maximum value of FDR is ˆπ 0 = 0.35 when t 0 = 0, where every gene is declared as significant. Also shown in the figure is the number of significant genes vs. FDR estimations. When FDR = ˆπ 0 all genes are declared as significant. [Figure 2 about here.] We can also apply the nonparametric approach to this leukemia gene expression data. We use permutation to get the p-values for the t-statistics based on B = 1000 permutations. The histogram for the permutation p- values is plotted in Figure 3, also shown is the monotone smoothing estimation of π 0 based on the constrained B-splines (He and Ng, 1999). The extrapolated value at boundary is ˆπ 0 = [Figure 3 about here.] There is a difference between parametric and nonparametric estimation of π 0 (0.35 vs ). Suppose that the fitted mixture model is correct, the least conservative nonparametric estimation for π 0 is min λ [0,1] g(p) = g(1) = π 0 +(1 π 0 )LR(0) = 0.451, very close to If we use the empirical Bayes method, the least conservative estimate is π 0 + (1 π 0 ) min x LR(x) = π 0 + (1 π 0 )LR(0.41) = Figure 3 compares the permutation p-value density and the theoretical density from the fitted mixture models. They agree with each other very well. 14

15 4.2 Colon cancer gene expression data The colon cancer gene expression data contained the expression values of 2000 genes from 40 tumor and 22 normal colon tissue samples reported by Alon et al. (1999). We apply the normal mixture model to estimate FDR for this data. With BIC we select 6 normal components with mean and probability estimations being ˆπ 0 = 0.408, (ˆπ L, ˆθ L ) = {(0.073, 3.72), (0.193, 1.81)}, (ˆπ U, ˆθ U ) = {(0.247, 1.37), (0.074, 3.36), (0.005, 6.38)}. Figure 4 shows some model fitting diagnostics and the FDR estimation for the colon cancer data. Using permutations we can estimate the p-value for each gene, which can be compared to the parametric approach. Figure 5 shows the p-value density from the permutation and normal mixture model. They agree with each other very well. We have the parametric estimation ˆπ pm = 0.408, the limit value of the nonparametric estimation is ˆπ pm + (1 ˆπ pm )f 1 (0)/f 0 (0) = [Figure 4 about here.] [Figure 5 about here.] 5. Impact of Dependence among Genes Previous discussions were based on the assumption that the genes are independent, which enables us to pool the information across all genes to obtain estimations. In gene expression data it is more practical to assume that genes are locally dependent, e.g. genes in a pathway are more likely to interact with each other and affect the system function in a synergistic way. Here we carry 15

16 out some simulation studies to evaluate the robustness of the proposed model to estimate FDR in the presence of dependence among genes. Suppose we have m genes, which are divided into K blocks with each consisting of m/k genes. We assume independence between blocks, and constant correlation ρ between genes within each block. π 0 proportion of the genes are simulated from N(0, 1); 1 π 0 proportion of the genes are simulated from a mixture of equal proportion of up/down-regulated genes with N(µ 1, 1) and N(µ 2, 1). We will investigate the effects of K and ρ on the FDR estimations. For simplicity of simulation, we assume that we know there are two underlying components for differentially expressed genes. To set reasonable values for µ j and ρ, we can use empirical values from previous two gene expression data. The averages of the positive/negative means for the leukemia gene expression data are θ k >0 π kθ k θ θ k >0 π = 2.98, k <0 π kθ k k θ k <0 π k = For the colon cancer gene expression data, they are 1.91, Therefore we choose µ 1 = 2, µ 2 = 2 in the simulation. To set values for ρ, we first cluster all the genes into groups with approximately 50 genes per group. For each gene we can calculate the two-sample t-statistic. Within each group, 300 bootstrap samples are used to approximate the mean correlation of the t-statistics between genes. Finally the mean correlation is averaged over all the groups to get an average ρ. For the leukemia gene expression data, ρ = 0.32 and ρ = 0.49 for the colon cancer data. We use ρ = 0.3, 0.5 in the simulations, and ρ = 0.1, 0.9 are included as 16

17 two more extreme situations, and the indepndence with ρ = 0 is also included as a comparison reference. Figure 7 summarizes the simulation results for π 0 and FDR from m = 3500, π 0 = 0.35, K = 35, 70, 140 and ρ = 0.1, 0.3, 0.5, 0.9. Overall we can see that the estimate of π 0 has very small bias. And as expected the larger the dependence, the more variable the estimate. The cluster size has a negligible effect when ρ is relatively small. Overall the variation of the π 0 estimation is increased with increasing number of local gene clusters. The FDR estimation is mainly affected by the π 0 estimation, its pattern is very similar to π 0. [Figure 6 about here.] 6. Discussion [Figure 7 about here.] The proposed finite normal mixture model is not identifiable with respect to the ordering of the components and to overfitting. We can eliminate this identifiability problem simply by posing constraints on the ordering of the components (Yakowitz and Spragins, 1968). For finite normal mixture models, it is possible that EM algorithm may converge to a local maximum. We used multiple random starting points to select the best model fitting among all the starting points, and this procedure gave us reasonably good estimators in our simulations and microarray applications. We are in the process of developing an R package for the proposed methods. The R package and the documentations on the implementation details will be posted on the web very soon. 17

18 As the simulation and application examples illustrate, the parametric approach is preferred when possible, as it will give unbiased estimates and is more accurate and efficient. When using the nonparametric approach, the empirical Bayesian approach models the test statistics directly and is equivalent to the likelihood ratio based method. As we do not assume distribution form for the test statistics under the alternative hypothesis, the use of nonparametric approach often can only estimate an upper bound for π 0, the proportion of true null genes. The proposed model essentially assumes the independence among genes. Through simulations we have found that the proposed model can still produce very good estimate for the local dependence situation. But it is possible that there are more complicated examples under which ignoring dependence among genes may seriously under/over-estimate the FDR. More research will be conducted in the future on FDR estimation incorporating the dependence among genes. Acknowledgements We are very grateful to the Associate Editor and the referee for their helpful suggestions. This research was supported in part by NIH grant GM and NSF grant DMS and a startup fund from the Division of Biostatistics, University of Minnesota. References Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. and Levine, A. J. (1999). Broad patterns of gene expression revealed by 18

19 clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS 96, Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, Benjamini, Y., Krieger, A. and Yekutieli, D. (2001). Adaptive linear step-up fdr controlling procedures. Technical Report, Tel Aviv University. Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29, Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B. Methodological 39, Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97, Efron, B. (2003). Robbins, empirical bayes and microarrays. The Annals of Statistics 31, Efron, B. and Tibshirani, R. (2002). Empirical bayes methods and false discovery rates for microarrays. Genet Epidemiol 23, Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical bayes analysis of a microarray experiment. Journal of the American Statistical Association 96, Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97,

20 Ge, Y., Dudoit, S. and Speed, T. P. (2003). Resampling-based Multiple Testing for Microarray Data Analysis. Test 12, Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286, Guan, Z., Wu, B. and Zhao, H. (2004). Model-based approach to fdr estimation. Technical Report. He, X. and Ng, P. (1999). Cobs: Qualitatively constrained smoothing via linear programming. Computational Statistics 14, Neyman, J. and Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 231, Sackrowitz, H. and Samuel-Cahn, E. (1999). P-values as random variables: expected p-values. American Statistician 53, Storey, J., Taylor, J. and Siegmund, D. (2004). Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. Journal of the Royal Statistical Society. Series B (Methodological) 66, Storey, J. D. (2002a). A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B (Methodological) 64, Storey, J. D. (2002b). False Discovery Rates: Theory and Applications to DNA Microarrays. PhD thesis, Stanford University. 20

21 Westfall, P. and Young, S. (1993). Resampling-based multiple testing: Examples and methods for p-value adjustment. Wiley. Yakowitz, S. J. and Spragins, J. D. (1968). On the identifiability of finite mixtures. The Annals of Mathematical Statistics 39, Appendix Proof of Lemma 1 Consider a p-value p 0 = 1 Q 0 {W (x 0 )}, we have W (x 0 ) = Q 1 0 (1 p 0 ). Under H 0 Pr(p p 0 ) = P 0 [x : 1 Q 0 {W (x)} p 0 ] = P 0 {x : W (x) W (x 0 )} = 1 Q 0 {W (x 0 )} = p 0. Under H a Pr(p p 0 ) = P 1 [x : 1 Q 0 {W (x)} p 0 ] = P 1 {x : W (x) W (x 0 )} = 1 Q 1 {Q 1 0 (1 p 0 )}. And hence g 1 (p) = dg 1(p) dp = dq 1{Q 1 0 (1 p)} dp According to the definitions of q 0 ( ) and q 1 ( ), we have q 0 (η) = lim η1 η Therefore P 0 {x : η < W (x) η 1 }, q 1 (η) = lim η 1 η η1 η q 1 (η) q 0 (η) = lim P 1 {x : η < W (x) η 1 } η 1 η P 0 {x : η < W (x) η 1 }. = q 1{Q 1 0 (1 p)} q 0 {Q 1 0 (1 p)}. P 1 {x : η < W (x) η 1 }. η 1 η Since W (x) is a measurable function, the rejection region Γ = {x : η < W (x) η 1 } is measurable. We have P 1 (Γ) = f 1 (x)dx = f 0 (x) f 1(x) f 0 (x) dx Γ and hence g 1 (p) inf x {f 1 (x)/f 0 (x)}. Γ 21 Γ f 0 (x)inf x f 1 (x) f 0 (x) dx = P 0(Γ)inf x f 1 (x) f 0 (x),

22 Proof of Theorem 1 By definition LR(x) = f 1 (x)/f 0 (x), let Γ = {x : η < LR(x) η 1 }, we have P 1 (Γ) = Γ f 1 (x)dx Γ ηf 0 (x)dx = ηp 0 (Γ), similarly we have P 1 (Γ) η 1 P 0 (Γ), so q 1 (η)/q 0 (η) = η and g 1 (p) = q 1{Q 1 0 (1 p)} q 0 {Q 1 0 (1 p)} = Q 1 0 (1 p). Proof of (19) We have shown that LR(x) is a strictly convex function. If i, j, s.t. µ i + µ j = 0 and π i = π j, it is obvious that LR(x) is a symmetric function about zero. Hence, { LR(x) = f 1 (x)/f 0 (x) > η } = { x > x 0 }, where η = LR(x 0 ). Now suppose { LR(x) = f 1 (x)/f 0 (x) > η } = { x > x 0 }, we have x, LR(x) = LR( x). Suppose max j µ j = µ J > 0, we have LR(x) exp( xµ J ) = LR( x) exp( xµ J ), i.e. L 1 = L 2, where L 1 = π J exp( µ 2 J/2) + k J π k exp { x(µ k µ J ) µ 2 k/2 }, L 2 = π J exp( 2xµ J µ 2 J/2) + k J π k exp { x(µ k + µ J ) µ 2 k/2 }. We know that lim x L 1 = π J exp( µ 2 J /2). So there must exist an K, s.t. π K = π J and µ K + µ J = 0, which will make lim x L 2 = lim x L 1. From LR(x) π J exp(xµ J µ 2 J /2) = LR( x) π K exp( xµ K µ 2 K /2), we can prove the second largest µ k satisfies the symmetric condition. So sequentially we can prove that i, j, s.t. µ i + µ j = 0 and π i = π j. 22

23 Figure 1. Simulation study: the top two plots compare the LR and symmetric rejection regions; the bottom one compares the parametric (pm), empirical bayes (eb) and nonparametric (np) estimations. Rejection Threshold LR Symmetric density Symmetric LR α p value π^pm π^eb π^np 0 π 0 = 0.2 π eb = 0.48 π np =

24 Figure 2. 3-Component Model Fitting for the Leukemia Data and FDR estimation Distribution Function Estimation QQ Plot ECDF 3 Component Model Test Statistics Quantiles Test Statistics FDR π 0 Number of Significant Genes Threshold Γ rejection region { T Γ} π 0 FDR 24

25 Figure 3. Nonparametric vs. Parametric Estimation for the Leukemia data π^0(λ) Nonparametric Smoothing π 0 Esimation Density permutation density mixture density λ p value 25

26 Figure 4. 3-Component Model Fitting for the Colon cancer Data and FDR estimation Distribution Function Estimation QQ Plot ECDF 3 Component Model Quantiles Test Statistics Test Statistics FDR π 0 Number of Significant Genes Threshold Γ rejection region { T Γ} π 0 FDR 26

27 Figure 5. Nonparametric vs. Parametric Estimation for the Colon cancer data π 0 (λ) parametric estimation nonparametric estimation density permutation density mixture density λ p value 27

28 Figure 6. FDR estimation under local dependence: there are 13 simulations based on the combination of 5 different ρs and 3 different Ks, which are labeled at the bottom of each plot. The boxplot are based on 100 replicates, and the horizontal dashed black line represents the true value estimated from 100 replicates. We can see that the pattern of FDR estimation are very similar to π 0 : the bigger the correlation ρ and the number of local clusters K, the more variable the estimations. But overall we can see that the proposed model gives very good estimates, even when the local correlation is as large as 0.5. π FDR FDR ρ= K= ρ= K= FDR ρ= K= ρ= K=

29 Figure 7. Bias and variance analysis for FDR estimation under local dependence: there are 13 simulations based on the combination of 5 different ρs and 3 different Ks, which are labeled at the bottom of each plot. Shown in the plot are the ratio of absolute bias/standard error and the true means. The pattern is pretty consistent: larger ρ and K will increase the bias and variance; overall the bias is very small compared to the variance. Under local dependence, the proposed approach gives reasonable estimates even when the local correlation is as high as 0.5. π^ sd/mean bias /mean ρ= K= FDR ρ= K= FDR ρ= K= FDR ρ= K=

30 Table 1 Possible Outcomes of Multiple Hypothesis Testing Accepted Rejected Total True Null U V m 0 True Alternative T S m 1 Total N R m 30

False Discovery Rates

False Discovery Rates False Discovery Rates John D. Storey Princeton University, Princeton, USA January 2010 Multiple Hypothesis Testing In hypothesis testing, statistical significance is typically based on calculations involving

More information

Multiple testing with gene expression array data

Multiple testing with gene expression array data Multiple testing with gene expression array data Anja von Heydebreck Max Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Slides partly

More information

Bootstrapping p-value estimations

Bootstrapping p-value estimations Bootstrapping p-value estimations In microarray studies it is common that the the sample size is small and that the distribution of expression values differs from normality. In this situations, permutation

More information

Statistical issues in the analysis of microarray data

Statistical issues in the analysis of microarray data Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data

More information

A direct approach to false discovery rates

A direct approach to false discovery rates J. R. Statist. Soc. B (2002) 64, Part 3, pp. 479 498 A direct approach to false discovery rates John D. Storey Stanford University, USA [Received June 2001. Revised December 2001] Summary. Multiple-hypothesis

More information

Hypothesis Testing. 1 Introduction. 2 Hypotheses. 2.1 Null and Alternative Hypotheses. 2.2 Simple vs. Composite. 2.3 One-Sided and Two-Sided Tests

Hypothesis Testing. 1 Introduction. 2 Hypotheses. 2.1 Null and Alternative Hypotheses. 2.2 Simple vs. Composite. 2.3 One-Sided and Two-Sided Tests Hypothesis Testing 1 Introduction This document is a simple tutorial on hypothesis testing. It presents the basic concepts and definitions as well as some frequently asked questions associated with hypothesis

More information

Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

More information

QVALUE: The Manual Version 1.0

QVALUE: The Manual Version 1.0 QVALUE: The Manual Version 1.0 Alan Dabney and John D. Storey Department of Biostatistics University of Washington Email: jstorey@u.washington.edu March 2003; Updated June 2003; Updated January 2004 Table

More information

0BComparativeMarkerSelection Documentation

0BComparativeMarkerSelection Documentation 0BComparativeMarkerSelection Documentation Description: Author: Computes significance values for features using several metrics, including FDR(BH), Q Value, FWER, Feature-Specific P-Value, and Bonferroni.

More information

False Discovery Rate Control with Groups

False Discovery Rate Control with Groups False Discovery Rate Control with Groups James X. Hu, Hongyu Zhao and Harrison H. Zhou Abstract In the context of large-scale multiple hypothesis testing, the hypotheses often possess certain group structures

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for

More information

Introduction to Hypothesis Testing. Point estimation and confidence intervals are useful statistical inference procedures.

Introduction to Hypothesis Testing. Point estimation and confidence intervals are useful statistical inference procedures. Introduction to Hypothesis Testing Point estimation and confidence intervals are useful statistical inference procedures. Another type of inference is used frequently used concerns tests of hypotheses.

More information

. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.

. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i. Chapter 3 Kolmogorov-Smirnov Tests There are many situations where experimenters need to know what is the distribution of the population of their interest. For example, if they want to use a parametric

More information

Test of Hypotheses. Since the Neyman-Pearson approach involves two statistical hypotheses, one has to decide which one

Test of Hypotheses. Since the Neyman-Pearson approach involves two statistical hypotheses, one has to decide which one Test of Hypotheses Hypothesis, Test Statistic, and Rejection Region Imagine that you play a repeated Bernoulli game: you win $1 if head and lose $1 if tail. After 10 plays, you lost $2 in net (4 heads

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Multiple Testing. Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf. Abstract

Multiple Testing. Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf. Abstract Multiple Testing Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf Abstract Multiple testing refers to any instance that involves the simultaneous testing of more than one hypothesis. If decisions about

More information

Hypothesis Testing COMP 245 STATISTICS. Dr N A Heard. 1 Hypothesis Testing 2 1.1 Introduction... 2 1.2 Error Rates and Power of a Test...

Hypothesis Testing COMP 245 STATISTICS. Dr N A Heard. 1 Hypothesis Testing 2 1.1 Introduction... 2 1.2 Error Rates and Power of a Test... Hypothesis Testing COMP 45 STATISTICS Dr N A Heard Contents 1 Hypothesis Testing 1.1 Introduction........................................ 1. Error Rates and Power of a Test.............................

More information

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

More information

Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring. Jie-Men Mok Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

More information

Multiple One-Sample or Paired T-Tests

Multiple One-Sample or Paired T-Tests Chapter 610 Multiple One-Sample or Paired T-Tests Introduction This chapter describes how to estimate power and sample size (number of arrays) for paired and one sample highthroughput studies using the.

More information

Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach

Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach J. R. Statist. Soc. B (2004) 66, Part 1, pp. 187 205 Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach John D. Storey,

More information

Likelihood Approaches for Trial Designs in Early Phase Oncology

Likelihood Approaches for Trial Designs in Early Phase Oncology Likelihood Approaches for Trial Designs in Early Phase Oncology Clinical Trials Elizabeth Garrett-Mayer, PhD Cody Chiuzan, PhD Hollings Cancer Center Department of Public Health Sciences Medical University

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Variance of OLS Estimators and Hypothesis Testing. Randomness in the model. GM assumptions. Notes. Notes. Notes. Charlie Gibbons ARE 212.

Variance of OLS Estimators and Hypothesis Testing. Randomness in the model. GM assumptions. Notes. Notes. Notes. Charlie Gibbons ARE 212. Variance of OLS Estimators and Hypothesis Testing Charlie Gibbons ARE 212 Spring 2011 Randomness in the model Considering the model what is random? Y = X β + ɛ, β is a parameter and not random, X may be

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Test Volume 12, Number 1. June 2003

Test Volume 12, Number 1. June 2003 Sociedad Española de Estadística e Investigación Operativa Test Volume 12, Number 1. June 2003 Resampling-based Multiple Testing for Microarray Data Analysis Yongchao Ge Department of Statistics University

More information

Feature Selection for High-Dimensional Genomic Microarray Data

Feature Selection for High-Dimensional Genomic Microarray Data Feature Selection for High-Dimensional Genomic Microarray Data Eric P. Xing Michael I. Jordan Richard M. Karp Division of Computer Science, University of California, Berkeley, CA 9472 Department of Statistics,

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Testing: is my coin fair?

Testing: is my coin fair? Testing: is my coin fair? Formally: we want to make some inference about P(head) Try it: toss coin several times (say 7 times) Assume that it is fair ( P(head)= ), and see if this assumption is compatible

More information

Permutation Tests for Comparing Two Populations

Permutation Tests for Comparing Two Populations Permutation Tests for Comparing Two Populations Ferry Butar Butar, Ph.D. Jae-Wan Park Abstract Permutation tests for comparing two populations could be widely used in practice because of flexibility of

More information

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference) Chapter 45 Two-Sample T-Tests Allowing Unequal Variance (Enter Difference) Introduction This procedure provides sample size and power calculations for one- or two-sided two-sample t-tests when no assumption

More information

Chapter 4: Statistical Hypothesis Testing

Chapter 4: Statistical Hypothesis Testing Chapter 4: Statistical Hypothesis Testing Christophe Hurlin November 20, 2015 Christophe Hurlin () Advanced Econometrics - Master ESA November 20, 2015 1 / 225 Section 1 Introduction Christophe Hurlin

More information

Chapter 9: Hypothesis Testing Sections

Chapter 9: Hypothesis Testing Sections Chapter 9: Hypothesis Testing Sections 9.1 Problems of Testing Hypotheses Skip: 9.2 Testing Simple Hypotheses Skip: 9.3 Uniformly Most Powerful Tests Skip: 9.4 Two-Sided Alternatives 9.6 Comparing the

More information

[Chapter 10. Hypothesis Testing]

[Chapter 10. Hypothesis Testing] [Chapter 10. Hypothesis Testing] 10.1 Introduction 10.2 Elements of a Statistical Test 10.3 Common Large-Sample Tests 10.4 Calculating Type II Error Probabilities and Finding the Sample Size for Z Tests

More information

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Two-Sample T-Tests Assuming Equal Variance (Enter Means) Chapter 4 Two-Sample T-Tests Assuming Equal Variance (Enter Means) Introduction This procedure provides sample size and power calculations for one- or two-sided two-sample t-tests when the variances of

More information

Dongfeng Li. Autumn 2010

Dongfeng Li. Autumn 2010 Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis

More information

Gene Expression Analysis

Gene Expression Analysis Gene Expression Analysis Jie Peng Department of Statistics University of California, Davis May 2012 RNA expression technologies High-throughput technologies to measure the expression levels of thousands

More information

Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

More information

Non-Inferiority Tests for Two Means using Differences

Non-Inferiority Tests for Two Means using Differences Chapter 450 on-inferiority Tests for Two Means using Differences Introduction This procedure computes power and sample size for non-inferiority tests in two-sample designs in which the outcome is a continuous

More information

How to Conduct a Hypothesis Test

How to Conduct a Hypothesis Test How to Conduct a Hypothesis Test The idea of hypothesis testing is relatively straightforward. In various studies we observe certain events. We must ask, is the event due to chance alone, or is there some

More information

Basics of microarrays. Petter Mostad 2003

Basics of microarrays. Petter Mostad 2003 Basics of microarrays Petter Mostad 2003 Why microarrays? Microarrays work by hybridizing strands of DNA in a sample against complementary DNA in spots on a chip. Expression analysis measure relative amounts

More information

6.2 Permutations continued

6.2 Permutations continued 6.2 Permutations continued Theorem A permutation on a finite set A is either a cycle or can be expressed as a product (composition of disjoint cycles. Proof is by (strong induction on the number, r, of

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 10, Issue 1 2011 Article 28 The Joint Null Criterion for Multiple Hypothesis Tests Jeffrey T. Leek, Johns Hopkins Bloomberg School of Public

More information

9-3.4 Likelihood ratio test. Neyman-Pearson lemma

9-3.4 Likelihood ratio test. Neyman-Pearson lemma 9-3.4 Likelihood ratio test Neyman-Pearson lemma 9-1 Hypothesis Testing 9-1.1 Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental

More information

Nonparametric Statistics

Nonparametric Statistics Nonparametric Statistics References Some good references for the topics in this course are 1. Higgins, James (2004), Introduction to Nonparametric Statistics 2. Hollander and Wolfe, (1999), Nonparametric

More information

Median of the p-value Under the Alternative Hypothesis

Median of the p-value Under the Alternative Hypothesis Median of the p-value Under the Alternative Hypothesis Bhaskar Bhattacharya Department of Mathematics, Southern Illinois University, Carbondale, IL, USA Desale Habtzghi Department of Statistics, University

More information

From the help desk: Bootstrapped standard errors

From the help desk: Bootstrapped standard errors The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

More information

Message-passing sequential detection of multiple change points in networks

Message-passing sequential detection of multiple change points in networks Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal

More information

Sampling and Hypothesis Testing

Sampling and Hypothesis Testing Population and sample Sampling and Hypothesis Testing Allin Cottrell Population : an entire set of objects or units of observation of one sort or another. Sample : subset of a population. Parameter versus

More information

1 Sufficient statistics

1 Sufficient statistics 1 Sufficient statistics A statistic is a function T = rx 1, X 2,, X n of the random sample X 1, X 2,, X n. Examples are X n = 1 n s 2 = = X i, 1 n 1 the sample mean X i X n 2, the sample variance T 1 =

More information

1 Prior Probability and Posterior Probability

1 Prior Probability and Posterior Probability Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

More information

An Internal Model for Operational Risk Computation

An Internal Model for Operational Risk Computation An Internal Model for Operational Risk Computation Seminarios de Matemática Financiera Instituto MEFF-RiskLab, Madrid http://www.risklab-madrid.uam.es/ Nicolas Baud, Antoine Frachot & Thierry Roncalli

More information

Models for Count Data With Overdispersion

Models for Count Data With Overdispersion Models for Count Data With Overdispersion Germán Rodríguez November 6, 2013 Abstract This addendum to the WWS 509 notes covers extra-poisson variation and the negative binomial model, with brief appearances

More information

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Exact Nonparametric Tests for Comparing Means - A Personal Summary Exact Nonparametric Tests for Comparing Means - A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola

More information

Microarray Data Analysis. Statistical methods to detect differentially expressed genes

Microarray Data Analysis. Statistical methods to detect differentially expressed genes Microarray Data Analysis Statistical methods to detect differentially expressed genes Outline The class comparison problem Statistical tests Calculation of p-values Permutations tests The volcano plot

More information

Anomaly detection for Big Data, networks and cyber-security

Anomaly detection for Big Data, networks and cyber-security Anomaly detection for Big Data, networks and cyber-security Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with Nick Heard (Imperial College London),

More information

Chapter 9: Hypothesis Testing Sections

Chapter 9: Hypothesis Testing Sections Chapter 9: Hypothesis Testing Sections - we are still here Skip: 9.2 Testing Simple Hypotheses Skip: 9.3 Uniformly Most Powerful Tests Skip: 9.4 Two-Sided Alternatives 9.5 The t Test 9.6 Comparing the

More information

Estimating survival functions has interested statisticians for numerous years.

Estimating survival functions has interested statisticians for numerous years. ZHAO, GUOLIN, M.A. Nonparametric and Parametric Survival Analysis of Censored Data with Possible Violation of Method Assumptions. (2008) Directed by Dr. Kirsten Doehler. 55pp. Estimating survival functions

More information

Chapter 9: Hypothesis Testing Sections

Chapter 9: Hypothesis Testing Sections Chapter 9: Hypothesis Testing Sections 9.1 Problems of Testing Hypotheses Skip: 9.2 Testing Simple Hypotheses Skip: 9.3 Uniformly Most Powerful Tests Skip: 9.4 Two-Sided Alternatives 9.5 The t Test 9.6

More information

Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour. Patrick Lam Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

More information

MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...

MATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators... MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 2009-2016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

More information

Probability and Statistics

Probability and Statistics CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

More information

Statistics 641 - EXAM II - 1999 through 2003

Statistics 641 - EXAM II - 1999 through 2003 Statistics 641 - EXAM II - 1999 through 2003 December 1, 1999 I. (40 points ) Place the letter of the best answer in the blank to the left of each question. (1) In testing H 0 : µ 5 vs H 1 : µ > 5, the

More information

3.6: General Hypothesis Tests

3.6: General Hypothesis Tests 3.6: General Hypothesis Tests The χ 2 goodness of fit tests which we introduced in the previous section were an example of a hypothesis test. In this section we now consider hypothesis tests more generally.

More information

Modèles stochastiques II

Modèles stochastiques II Modèles stochastiques II INFO 15 ianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://wwwulbacbe/di Modéles stochastiques II p1/5 Testing hypothesis Hypothesis testing is the

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

Tutorial 5: Hypothesis Testing

Tutorial 5: Hypothesis Testing Tutorial 5: Hypothesis Testing Rob Nicholls nicholls@mrc-lmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................

More information

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4) Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Semi-parametric Differential Expression Analysis via Partial Mixture Estimation

Semi-parametric Differential Expression Analysis via Partial Mixture Estimation Semi-parametric Differential Expression Analysis via Partial Mixture Estimation DAVID ROSSELL Department of Biostatistics M.D. Anderson Cancer Center, Houston, TX 77030, USA rosselldavid@gmail.com RUDY

More information

Confidence intervals, t tests, P values

Confidence intervals, t tests, P values Confidence intervals, t tests, P values Joe Felsenstein Department of Genome Sciences and Department of Biology Confidence intervals, t tests, P values p.1/31 Normality Everybody believes in the normal

More information

Module 5 Hypotheses Tests: Comparing Two Groups

Module 5 Hypotheses Tests: Comparing Two Groups Module 5 Hypotheses Tests: Comparing Two Groups Objective: In medical research, we often compare the outcomes between two groups of patients, namely exposed and unexposed groups. At the completion of this

More information

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni 1 Web-based Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed

More information

Practice problems for Homework 11 - Point Estimation

Practice problems for Homework 11 - Point Estimation Practice problems for Homework 11 - Point Estimation 1. (10 marks) Suppose we want to select a random sample of size 5 from the current CS 3341 students. Which of the following strategies is the best:

More information

FORMALIZED DATA SNOOPING BASED ON GENERALIZED ERROR RATES

FORMALIZED DATA SNOOPING BASED ON GENERALIZED ERROR RATES Econometric Theory, 24, 2008, 404 447+ Printed in the United States of America+ DOI: 10+10170S0266466608080171 FORMALIZED DATA SNOOPING BASED ON GENERALIZED ERROR RATES JOSEPH P. ROMANO Stanford University

More information

Redwood Building, Room T204, Stanford University School of Medicine, Stanford, CA 94305-5405.

Redwood Building, Room T204, Stanford University School of Medicine, Stanford, CA 94305-5405. W hittemoretxt050806.tex A Bayesian False Discovery Rate for Multiple Testing Alice S. Whittemore Department of Health Research and Policy Stanford University School of Medicine Correspondence Address:

More information

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

More information

Using pivots to construct confidence intervals. In Example 41 we used the fact that

Using pivots to construct confidence intervals. In Example 41 we used the fact that Using pivots to construct confidence intervals In Example 41 we used the fact that Q( X, µ) = X µ σ/ n N(0, 1) for all µ. We then said Q( X, µ) z α/2 with probability 1 α, and converted this into a statement

More information

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis Microarray Data Analysis Workshop MedVetNet Workshop, DTU 2008 Comparative genomic hybridization Because arrays are more than just a tool for expression analysis Carsten Friis ( with several slides from

More information

7 Hypothesis testing - one sample tests

7 Hypothesis testing - one sample tests 7 Hypothesis testing - one sample tests 7.1 Introduction Definition 7.1 A hypothesis is a statement about a population parameter. Example A hypothesis might be that the mean age of students taking MAS113X

More information

Course on Microarray Gene Expression Analysis

Course on Microarray Gene Expression Analysis Course on Microarray Gene Expression Analysis ::: Differential Expression Analysis Daniel Rico drico@cnio.es Bioinformatics Unit CNIO Upregulation or No Change Downregulation Image analysis comparison

More information

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA Rahayu, Kernel Logistic Regression-Linear for Leukemia Classification using High Dimensional Data KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA S.P. Rahayu 1,2

More information

Introduction to Detection Theory

Introduction to Detection Theory Introduction to Detection Theory Reading: Ch. 3 in Kay-II. Notes by Prof. Don Johnson on detection theory, see http://www.ece.rice.edu/~dhj/courses/elec531/notes5.pdf. Ch. 10 in Wasserman. EE 527, Detection

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Parametric or nonparametric: the FIC approach for stationary time series

Parametric or nonparametric: the FIC approach for stationary time series Parametric or nonparametric: the FIC approach for stationary time series Gudmund Horn Hermansen* Department of Mathematics, University of Oslo, Oslo, Norway - gudmunhh@math.uio.no Nils Lid Hjort Department

More information

The Effect of Correlation in False Discovery Rate Estimation

The Effect of Correlation in False Discovery Rate Estimation 1 2 Biometrika (??),??,??, pp. 1 24 C 21 Biometrika Trust Printed in Great Britain Advance Access publication on?????? 3 4 5 6 7 The Effect of Correlation in False Discovery Rate Estimation BY ARMIN SCHWARTZMAN

More information

4. Introduction to Statistics

4. Introduction to Statistics Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

More information

Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:

Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so: Chapter 7 Notes - Inference for Single Samples You know already for a large sample, you can invoke the CLT so: X N(µ, ). Also for a large sample, you can replace an unknown σ by s. You know how to do a

More information

Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,

More information

Stochastic Inventory Control

Stochastic Inventory Control Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Basic concepts and introduction to statistical inference

Basic concepts and introduction to statistical inference Basic concepts and introduction to statistical inference Anna Helga Jonsdottir Gunnar Stefansson Sigrun Helga Lund University of Iceland (UI) Basic concepts 1 / 19 A review of concepts Basic concepts Confidence

More information

Null Hypothesis H 0. The null hypothesis (denoted by H 0

Null Hypothesis H 0. The null hypothesis (denoted by H 0 Hypothesis test In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test (or test of significance) is a standard procedure for testing a claim about a property

More information

Hypothesis testing S2

Hypothesis testing S2 Basic medical statistics for clinical and experimental research Hypothesis testing S2 Katarzyna Jóźwiak k.jozwiak@nki.nl 2nd November 2015 1/43 Introduction Point estimation: use a sample statistic to

More information

Numerical methods for American options

Numerical methods for American options Lecture 9 Numerical methods for American options Lecture Notes by Andrzej Palczewski Computational Finance p. 1 American options The holder of an American option has the right to exercise it at any moment

More information

Hypothesis Testing Level I Quantitative Methods. IFT Notes for the CFA exam

Hypothesis Testing Level I Quantitative Methods. IFT Notes for the CFA exam Hypothesis Testing 2014 Level I Quantitative Methods IFT Notes for the CFA exam Contents 1. Introduction... 3 2. Hypothesis Testing... 3 3. Hypothesis Tests Concerning the Mean... 10 4. Hypothesis Tests

More information

Some stability results of parameter identification in a jump diffusion model

Some stability results of parameter identification in a jump diffusion model Some stability results of parameter identification in a jump diffusion model D. Düvelmeyer Technische Universität Chemnitz, Fakultät für Mathematik, 09107 Chemnitz, Germany Abstract In this paper we discuss

More information