Parametric and Nonparametric FDR Estimation Revisited


 Jodie Quinn
 1 years ago
 Views:
Transcription
1 Parametric and Nonparametric FDR Estimation Revisited Baolin Wu, 1, Zhong Guan 2, and Hongyu Zhao 3, 1 Division of Biostatistics, School of Public Health University of Minnesota, Minneapolis, MN 55455, USA 2 Department of Mathematical Sciences Indiana University South Bend, South Bend, IN 46634, USA 3 Department of Epidemiology and Public Health Yale University, New Haven, CT 06520, USA Summary. Nonparametric and parametric approaches have been proposed to estimate False Discovery Rate under the independent hypothesis testing assumption. The parametric approach has been shown to have better performance than the nonparametric approaches. In this article, we study the nonparametric approaches and quantify the underlying relations between parametric and nonparametric approaches. Our study reveals the conservative nature of the nonparametric approaches, and establishes the connections between the empirical Bayes method and pvalue based nonparametric methods. Based on our results, we advocate using parametric approach, or directly modeling the test statistics using the empirical Bayes method. Key words: Microarray; False discovery rate; Multiple hypothesis testing;
2 Multiple comparisons; Simultaneous inference; Empirical Bayes method 1. Introduction For current largescale genomic and proteomic datasets, there are usually hundreds of thousands of variables but limited sample size, which poses a unique challenge for statistical analysis. Variable selection serves two purposes in this context: for biological interpretation and to reduce the impact of noise. In microarray datasets, we are often interested in identifying differentially expressed genes. It can be formulated as the following hypothesis testing problem H i : µ i = 0 (i = 1,..., m), where m is the total number of genes and µ i is the mean log ratio of the expression levels for the ith gene. Here we are testing m genes simultaneously, which causes complications for error control. Multiple hypothesis testing for a testing procedure can be summarized in Table 1, where V is the number of false positives and S is the number of true positives. [Table 1 about here.] For the convenience of the following discussion, define h k = I{kth hypothesis being true null}, r k = I{kth hypothesis being rejected}, h = (h 1,..., h m ), r = (r 1,..., r m ), v = (r 1 h 1,..., r m h m ). Here we treat h k as random variables. The L 1 norms of these vectors are h = m 0, r = R and v = V. 2
3 In single hypothesis testing, the commonly used approach is to control Type I error at a prespecified level α 0 and to maximize the power (or minimize Type II error β) at the same time, α 0 = Pr(r k = 1 h k = 1), β = Pr(r k = 0 h k = 0). When we do multiple hypothesis testing we want to control the overall Type I error to be very small. There are different definitions for overall Type I error in multiple hypothesis testing. A natural extension of Type I error to multiple hypothesis testing is FamilyWiseErrorRate (FWER), which is the probability of identifying any false positives, i.e. FWER = Pr(V > 0). (1) The most commonly used approach for FWER control is Bonferroni correction, which adjusts individual significance levels to be α 0 /m. Generally, Bonferroni correction is conservative, especially in the context of genomic and proteomic datasets where m is very large. There have been some developments in using resampling methods to improve power while controlling FWER (Westfall and Young, 1993; Ge et al., 2003). False Discovery Rate (FDR), a philosophically different approach, was first proposed by Benjamini and Hochberg (1995). It is defined as E (V/R). When R = 0, there is no discovery, we define 0/0 = 0. We can also write FDR as ( V ) ( V ) FDR = E R R > 0 Pr(R > 0) = E R V > 0 Pr(V > 0). (2) Storey (2002b) defined pfdr as the following conditional expectation ( V ) pfdr = E R R > 0 = FDR Pr(R > 0). (3) 3
4 Clearly, FDR FWER = Pr(V > 0) = E ( FDR, V V > 0) R so FWER is always a stronger control than FDR. We can formally define the FDR estimation problem as follows: DATA: m test statistics, (T 1,..., T m ), one for each hypothesis H k, where k = 1,, m. GOAL: Develop testing procedure and estimate the expectation E (V/R), where V and R are defined in Table 1. Here we assume that (T 1,..., T m ) are m i.i.d. random variables. First define π 0 = Pr(h k = 1), α 0 = Pr(r k = 1 h k = 1), α = Pr(r k = 1), (4) where π 0 is the proportion of true null hypotheses, α 0 is the rejection probability of the true null hypothesis, and α is the marginal rejection probability. Under the i.i.d. assumption, we can have the following intuitive formula for pfdr and FDR (Storey, 2002a; Storey et al., 2004; Benjamini et al., 2001) FDR = Pr(h k = 1 r k = 1) = π 0α 0 α, pfdr = π 0α 0 { } 1 (1 α) m 1. (5) α So the pfdr and FDR estimation problems just transform into our familiar framework of estimating parameters π 0, α 0, and α. Previous research on FDR control includes the nonparametric method of Storey (2002a) and parametric method of Guan et al. (2004). In this paper we further study the operating characteristics of general pvalue based 4
5 nonparametric methods. Our study reveals the conservative nature of the nonparametric approaches, and we further theoretically quantify the relations between parametric and nonparametric approaches. The basic idea of the nonparametric approach in Storey (2002a) is to use the pvalues (p 1,, p m ) as the test statistics. Note that, usually, under the true null hypothesis, p k U[0, 1]. When the rejection region is chosen as Γ = [0, τ], we have ˆα = F m (τ), ˆα 0 = τ, ˆπ 0 (λ) = 1 F m(λ) 1 λ, (6) where F m is the empirical distribution function of the observed pvalues and λ [0, 1]. The optimal λ can be chosen by minimizing the MSE {ˆπ 0 (λ) }. In the parametric approach of Guan et al. (2004), two parametric functions are introduced to model the distribution of the test statistic: F 0 (, θ 0 ) for the null distribution and F 1 (, θ 1 ) for the alternative distribution. The marginal distribution is F (, π 0, θ 0, θ 1 ) = π 0 F 0 (, θ 0 ) + (1 π 0 )F 1 (, θ 1 ). The ExpectationMaximization (EM) algorithm (Dempster et al., 1977) can be used to obtain MLEs of the parameters π 0 and θ 1. Then for any given rejection region Γ, we have ˆα = F (Γ, ˆπ 0, θ 0, ˆθ 1 ) and ˆα 0 = F 0 (Γ, θ 0 ). (7) For simplicity we have used (F, F 0, F 1 ) to represent both the cumulative distribution functions and the corresponding probability measures. 2. Rejection Region Construction and FDR Modeling For the convenience of the following discussion, we write f 0 ( ) for the test statistic density under the null hypothesis and f 1 ( ) for that of the alternative 5
6 hypothesis. In single hypothesis testing, we focus on Type I error and power, α 0 = F 0 (Γ) and 1 β = F 1 (Γ), where Γ is the rejection region. The central dogma of the traditional single hypothesis testing is to control Type I error α 0 under a prespecified level and at the same time try to maximize the power 1 β. In practice we try to construct rejection regions which will have maximum power. According to the NeymanPearson Lemma (Neyman and Pearson, 1933), this can be achieved using the likelihood ratio (LR) statistic LR(x) = f 1 (x)/f 0 (x) constructed from the observed data, from which we can construct the following uniformly most powerful LR rejection region 2.1 Pvalue Calculation { x : f } 1(x) f 0 (x) > η. (8) Pvalue is a wellaccepted significance measure for rejecting/accepting a hypothesis, and in some papers discussing multiple comparisons (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001; Ge et al., 2003; Storey, 2002b), pvalue is used as a test statistic. The distribution of the pvalues can be estimated using the empirical distribution function of the observed pvalues. The pvalue densities are closely related to the distributions of the test statistics and the construction of the rejection region Γ. For pvalues we have the following results (see the appendix for proofs, similar results appeared in Sackrowitz and SamuelCahn (1999)). Lemma 1. For hypothesis test H 0 versus H a with test statistic X, assume X has density f 0 (x) under H 0 and f 1 (x) under H a, and let P 0 and P 1 be the corresponding measures. Suppose that the rejection regions are con 6
7 structed as {x : W (x) > η}, where W ( ) is a measurable function. Let Q k (x), q k (x), k = 0, 1 be the distribution and density functions of W (X) under H 0 and H a, respectively. Furthermore assume that Q 0 (x) is continuous and strictly increasing. For an observed test statistic value x 0, the pvalue can be calculated as p = P 0 {x : W (x) > W (x 0 )} = 1 Q 0 {W (x 0 )}. (9) Under H 0, the pvalue has a uniform density, g 0 (p) = I{p [0, 1]}. Under H a, the pvalue has the following density and distribution functions: where g 1 (p) = q 1{Q 1 0 (1 p)} q 0 {Q 1 0 (1 p)}, G 1(p) = 1 Q 1 {Q 1 0 (1 p)}, (10) and hence g 1 (p) inf x {f 1 (x)/f 0 (x)}. q 1 (η) q 0 (η) = lim P 1 {x : η < W (x) η 1 } η 1 η P 0 {x : η < W (x) η 1 }, (11) Theorem 1. For the uniformly most powerful LR test (8), where the rejection region is constructed by we have { x : LR(x) = f } 1(x) f 0 (x) > η, g 1 (p) = Q 1 0 (1 p). (12) Therefore g 1 (p) is a nonincreasing function in the interval [0, 1]. Furthermore we have min g 1(p) = g 1 (1) = Q 1 f 1 (x) 0 (0) = inf p [0,1] x f 0 (x). (13) 7
8 This theorem reveals that the pvalue based on the LR test region has a monotone decreasing density. In the multiple hypothesis testing, if we assume pvalues from individual testings follow one common distribution, nonparametric estimation of π 0 can be based on the pvalue density (to be discussed in section 2.2). Theorem 1 then justifies the common practice of using the pvalue density at the boundary 1 to approximate π 0. For rejection regions not based on LR test region, it is possible to observe nonmonotone p value density, and according to Lemma 1, the least conservative π 0 estimation will be the minimum of the pvalue density, which is not necessarily at the boundary Smoothing Nonparametric Approach Suppose we use pvalue as the test statistic. Its distribution is g(p) = π 0 + (1 π 0 )g 1 (p), where π 0 is the proportion of true null hypotheses and g 1 (p) is the density for the pvalues under the alternative hypothesis. In the nonparametric approach, the key is the estimation of π 0. We propose the following least conservative estimation for π 0 min g(p) = π 0 + (1 π 0 ) min g 1 (p). (14) p p The simplest density estimation method is the histogram approach, ĝ(p) = {F m (λ 2 ) F m (λ 1 )}/(λ 2 λ 1 ), λ 1 p λ 2. The nonparametric estimator ˆπ 0 (λ) in (6) is just the histogram density estimation over (λ, 1], and implicitly assumes that g(1) achieves the minimum value. We can also apply some other smoothing methods, e.g. kernel density estimations. The poor performance of the nonparametric approach is mainly because ˆπ 0 (λ) is only based on those pvalues over [λ, 1). Note that when λ is small, ˆπ 0 (λ) as an estimator itself is very stable. In principle we could borrow 8
9 strength from small λ to extrapolate ˆπ 0 (1). This motivates us to smooth ˆπ 0 (λ) or ĝ(λ) as functions of λ. As discussed previously, it is reasonable to assume g 1 (p) is nonincreasing. The theoretical value of ˆπ 0 (λ) is π 0 (λ) = 1 F (λ) 1 λ = π 0 + (1 π 0 ) 1 λ g 1(p)dp 1 λ. (15) We have dπ 0 (λ) dλ = (1 π 0 ) 1 λ g 1(p)dp (1 λ)g 1 (λ) (1 λ) 2 0, so π 0 (λ) and g(λ) = π 0 + (1 π 0 )g 1 (λ) are both nonincreasing functions of λ. Hence, monotone smoothing methods can be used for extrapolation. Furthermore, we have π 0 (1) = g(1) = π 0 + (1 π 0 )g 1 (1). (16) In the following applications, we used the constrained Bsplines (He and Ng, 1999) for monotone extrapolation. 2.3 Model Test Statistic vs. Pvalues Although the pvalue has a uniform distribution under the null hypothesis, its alternative distribution is often unknown. An empirical Bayes method (Efron et al., 2001; Efron and Tibshirani, 2002; Efron, 2003) proposed to use the posterior probability of being different, ˆπ 1 (x) = 1 π 0 f 0 (x) f(x), (17) as a test statistic, and it was pointed out that π 0 is not identifiable for the nonparametric approach. In addition, Efron (2003) proposed the most conservative estimation for π 1 = 1 π 0 : π 1,min = 1 inf x {f(x)/f 0 (x)}, and 9
10 hence, the least conservative estimate for π 0 : π 0,max = inf x {f(x)/f 0 (x)}. Under the i.i.d. assumption, we have π 1,min = π 1 π 1 inf x f 1 (x) f 0 (x), π f 1 (x) 0,max = π 0 + π 1 inf x f 0 (x). (18) According to (8), this empirical Bayes method is equivalent to the nonparametric version of the LR based test, where densities f 0 (x) and f(x) are estimated from the observed data. Furthermore, according to Lemma 1 and Theorem 1, this is equivalent to the pvalue based nonparametric FDR estimation where pvalues are obtained using the LR statistics. 3. Simulation Studies 3.1 Finite Normal Mixture Example Here we discuss the parametric and nonparametric approaches for finite normal mixture distributions. Suppose T i H i = 1 N(0, 1); T i H i = 0 k π k N(µ k, 1), where π k (0, 1), k π k = 1 and µ k 0. We have LR(x) = f 1 (x)/f 0 (x) = k π k exp ( xµ k µ 2 k /2). 1. If all the µ k are positive (negative), then inf x LR(x) = 0, and the uniformly most powerful rejection region is {x x 0 } ({x x 0 }). Therefore the nonparametric π 0 estimate can approach the true value. 2. If i, j, µ i < 0, µ j > 0, then it is obvious that inf x LR(x) > 0, and f 1 (0)/f 0 (0) = k π k exp ( µ 2 k /2) > 0. Under this setting, the LR test rejection region { LR(x) > η } is equivalent to { x > x 0 }, if and only if all the π k and µ k satisfy the following condition (see appendix for 10
11 proof) i, j, st. µ i + µ j = 0 and π i = π j. (19) Furthermore arg min x LR(x) = 0 if and only if π k µ k exp ( µ 2 k/2 ) = 0. (20) This is because k dlr(x) dx = k π k µ k exp ( xµ k µ 2 k/2 ), d 2 LR(x) dx 2 = k π k µ 2 k exp ( xµ k µ 2 k/2 ) > 0, so LR(x) is strictly convex. In particular, condition (19) is a special case of (20). Hence for the commonly used symmetric region the estimate of π 0 will approach π 0 + (1 π 0 )f 1 (0)/f 0 (0). It will be larger than the estimate of LR test region π 0 + (1 π 0 ) min x {f 1 (x)/f 0 (x)}, unless the condition (20) is met. 3.2 Simulation Consider the following setup for the finite normal mixture models, π 1 = 0.2, µ 1 = 2, π 2 = 0.8, µ 2 = 1, with f 1 (x) = 2 k=1 π kn(µ k, 1). Suppose we conduct m = 1000 hypothesis tests with π 0 = 0.2 and f 0 (x) = N(0, 1). The parametric normal mixture model, π 0 N(0, 1) + (1 π 0 ){π 1 N(µ 1, 1) + π 2 N(µ 2, 1)} is fitted to obtain π 0 s MLE ˆπ pm. Pvalues can be calculated as p = 2Φ( x ), then we can get nonparametric estimate ˆπ np of π 0 (Storey, 2002b). For the empirical Bayes method, we first estimate the density of the test statistic ˆf(x), then ˆπ eb = inf x ˆf(x)/f0 (x), where f 0 (x) = φ(x). Figure 1 plots the LR and the symmetric rejection regions as functions of the rejection probability α 0 (4). Also shown in the plot are the pvalue 11
12 densities for the two rejection regions. For symmetric rejection regions, the minimum pvalue density is π np = π 0 + (1 π 0 )LR(0) = 0.61, compared to π eb = π 0 + (1 π 0 ) min x LR(x) = 0.48 for the LR rejection regions. They both overestimate the true value π 0 = 0.2. In Figure 1, boxplots are used to summarize the simulation results. We can clearly see that the simulation results agree with the theoretical results very well. [Figure 1 about here.] 4. Application to Microarray Data 4.1 Leukemia gene expression data We apply the proposed FDR estimation procedure to the leukemia gene expression data reported in Golub et al. (1999), where mrna levels of 7129 genes were measured for n = 72 patients, among them n 1 = 47 patients had Acute Lymphoblastic Leukemia (ALL) and n 2 = 25 patients had Acute Myeloid Leukemia (AML). The goal is to identify differentially expressed genes between these two groups. The gene expression data can be summarized in a matrix X = (x ij ), where (x i,1,..., x i,n1 ) are for ALL patients and (x i,n1 +1,..., x i,n ) for AML patients. We follow the same preprocessing procedure as Dudoit et al. (2002). We first cut gene expression levels between 100 and 16000, then keep the ith gene if it satisfies two conditions: max j x ij /min j x ij > 5 and max j x ij min j x ij > 500. After this filtering m = 3571 genes are left. We then take logarithm of their measured intensities and calculate two sample ttest statistics T i = ( x i1 x i2 )/ ˆσ 2 1/n 1 + ˆσ 2 2/n 2, where x i1 = n 1 j=1 x ij/n 1, x i2 = n j=n 1 +1 x ij/n 2, ˆσ 2 1 = n 1 j=1 (x ij x i1 ) 2 /(n 1 1) and ˆσ 2 2 = n j=n 1 +1 (x ij x i2 ) 2 /(n 2 1). 12
13 For this relatively large sample size (n = 72), we know that T i asymptotically follows a normal distribution with variance 1. We use normal mixture model to fit the tstatistics by proposing the following threecomponent model to model genes Without Difference: standard normal distribution N(µ 0 = 0, 1); UpRegulated: normal mixture with positive means, N (µ U > 0, σ 2 U = 1); DownRegulated: normal mixture with negative means, N (µ L < 0, σ 2 L = 1). The mixture distribution can be written as k π kn(µ k, 1), where k π k = 1. We can use the Bayesian Information Criterion (BIC) to select the number of components, BIC(p) = 2 log Pr ( Data ˆθ ) p log(m), where ˆθ is a vector representing the maximum likelihood estimates of the parameters, and p is the number of parameters in the model (Fraley and Raftery, 2002). In our model setup p = 2G 2, where G is the number of normal distributions (we know the mean for the first component and there is one constraint on the proportions). For G = 1, 2,..., 12, we use the EM algorithm to fit the mixture models and select G = arg max G BIC(p). The maximum of BIC was achieved at G = 8. The corresponding parameter estimates are ˆπ 0 = 0.35, with three positive components (ˆπU, ˆθ U ) = { (0.214, 2.42), (0.045, 5.22), (0.003, 9.57) }, and four negative components (ˆπL, ˆθ L ) = { (0.306, 1.57), (0.068, 3.88), (0.012, 6.82), (0.002, 11.64) }. 13
14 Figure 2 compares the empirical distribution function (ECDF) to the mixture model fitting, and the quantilequantile plot for the test statistics. Overall we can see the mixture model provides a reasonable fit. Figure 2 also displays the FDR estimations for this dataset, where we choose the rejection region as { T > t 0 }. The maximum value of FDR is ˆπ 0 = 0.35 when t 0 = 0, where every gene is declared as significant. Also shown in the figure is the number of significant genes vs. FDR estimations. When FDR = ˆπ 0 all genes are declared as significant. [Figure 2 about here.] We can also apply the nonparametric approach to this leukemia gene expression data. We use permutation to get the pvalues for the tstatistics based on B = 1000 permutations. The histogram for the permutation p values is plotted in Figure 3, also shown is the monotone smoothing estimation of π 0 based on the constrained Bsplines (He and Ng, 1999). The extrapolated value at boundary is ˆπ 0 = [Figure 3 about here.] There is a difference between parametric and nonparametric estimation of π 0 (0.35 vs ). Suppose that the fitted mixture model is correct, the least conservative nonparametric estimation for π 0 is min λ [0,1] g(p) = g(1) = π 0 +(1 π 0 )LR(0) = 0.451, very close to If we use the empirical Bayes method, the least conservative estimate is π 0 + (1 π 0 ) min x LR(x) = π 0 + (1 π 0 )LR(0.41) = Figure 3 compares the permutation pvalue density and the theoretical density from the fitted mixture models. They agree with each other very well. 14
15 4.2 Colon cancer gene expression data The colon cancer gene expression data contained the expression values of 2000 genes from 40 tumor and 22 normal colon tissue samples reported by Alon et al. (1999). We apply the normal mixture model to estimate FDR for this data. With BIC we select 6 normal components with mean and probability estimations being ˆπ 0 = 0.408, (ˆπ L, ˆθ L ) = {(0.073, 3.72), (0.193, 1.81)}, (ˆπ U, ˆθ U ) = {(0.247, 1.37), (0.074, 3.36), (0.005, 6.38)}. Figure 4 shows some model fitting diagnostics and the FDR estimation for the colon cancer data. Using permutations we can estimate the pvalue for each gene, which can be compared to the parametric approach. Figure 5 shows the pvalue density from the permutation and normal mixture model. They agree with each other very well. We have the parametric estimation ˆπ pm = 0.408, the limit value of the nonparametric estimation is ˆπ pm + (1 ˆπ pm )f 1 (0)/f 0 (0) = [Figure 4 about here.] [Figure 5 about here.] 5. Impact of Dependence among Genes Previous discussions were based on the assumption that the genes are independent, which enables us to pool the information across all genes to obtain estimations. In gene expression data it is more practical to assume that genes are locally dependent, e.g. genes in a pathway are more likely to interact with each other and affect the system function in a synergistic way. Here we carry 15
16 out some simulation studies to evaluate the robustness of the proposed model to estimate FDR in the presence of dependence among genes. Suppose we have m genes, which are divided into K blocks with each consisting of m/k genes. We assume independence between blocks, and constant correlation ρ between genes within each block. π 0 proportion of the genes are simulated from N(0, 1); 1 π 0 proportion of the genes are simulated from a mixture of equal proportion of up/downregulated genes with N(µ 1, 1) and N(µ 2, 1). We will investigate the effects of K and ρ on the FDR estimations. For simplicity of simulation, we assume that we know there are two underlying components for differentially expressed genes. To set reasonable values for µ j and ρ, we can use empirical values from previous two gene expression data. The averages of the positive/negative means for the leukemia gene expression data are θ k >0 π kθ k θ θ k >0 π = 2.98, k <0 π kθ k k θ k <0 π k = For the colon cancer gene expression data, they are 1.91, Therefore we choose µ 1 = 2, µ 2 = 2 in the simulation. To set values for ρ, we first cluster all the genes into groups with approximately 50 genes per group. For each gene we can calculate the twosample tstatistic. Within each group, 300 bootstrap samples are used to approximate the mean correlation of the tstatistics between genes. Finally the mean correlation is averaged over all the groups to get an average ρ. For the leukemia gene expression data, ρ = 0.32 and ρ = 0.49 for the colon cancer data. We use ρ = 0.3, 0.5 in the simulations, and ρ = 0.1, 0.9 are included as 16
17 two more extreme situations, and the indepndence with ρ = 0 is also included as a comparison reference. Figure 7 summarizes the simulation results for π 0 and FDR from m = 3500, π 0 = 0.35, K = 35, 70, 140 and ρ = 0.1, 0.3, 0.5, 0.9. Overall we can see that the estimate of π 0 has very small bias. And as expected the larger the dependence, the more variable the estimate. The cluster size has a negligible effect when ρ is relatively small. Overall the variation of the π 0 estimation is increased with increasing number of local gene clusters. The FDR estimation is mainly affected by the π 0 estimation, its pattern is very similar to π 0. [Figure 6 about here.] 6. Discussion [Figure 7 about here.] The proposed finite normal mixture model is not identifiable with respect to the ordering of the components and to overfitting. We can eliminate this identifiability problem simply by posing constraints on the ordering of the components (Yakowitz and Spragins, 1968). For finite normal mixture models, it is possible that EM algorithm may converge to a local maximum. We used multiple random starting points to select the best model fitting among all the starting points, and this procedure gave us reasonably good estimators in our simulations and microarray applications. We are in the process of developing an R package for the proposed methods. The R package and the documentations on the implementation details will be posted on the web very soon. 17
18 As the simulation and application examples illustrate, the parametric approach is preferred when possible, as it will give unbiased estimates and is more accurate and efficient. When using the nonparametric approach, the empirical Bayesian approach models the test statistics directly and is equivalent to the likelihood ratio based method. As we do not assume distribution form for the test statistics under the alternative hypothesis, the use of nonparametric approach often can only estimate an upper bound for π 0, the proportion of true null genes. The proposed model essentially assumes the independence among genes. Through simulations we have found that the proposed model can still produce very good estimate for the local dependence situation. But it is possible that there are more complicated examples under which ignoring dependence among genes may seriously under/overestimate the FDR. More research will be conducted in the future on FDR estimation incorporating the dependence among genes. Acknowledgements We are very grateful to the Associate Editor and the referee for their helpful suggestions. This research was supported in part by NIH grant GM and NSF grant DMS and a startup fund from the Division of Biostatistics, University of Minnesota. References Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. and Levine, A. J. (1999). Broad patterns of gene expression revealed by 18
19 clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS 96, Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, Benjamini, Y., Krieger, A. and Yekutieli, D. (2001). Adaptive linear stepup fdr controlling procedures. Technical Report, Tel Aviv University. Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29, Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B. Methodological 39, Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97, Efron, B. (2003). Robbins, empirical bayes and microarrays. The Annals of Statistics 31, Efron, B. and Tibshirani, R. (2002). Empirical bayes methods and false discovery rates for microarrays. Genet Epidemiol 23, Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical bayes analysis of a microarray experiment. Journal of the American Statistical Association 96, Fraley, C. and Raftery, A. E. (2002). Modelbased clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97,
20 Ge, Y., Dudoit, S. and Speed, T. P. (2003). Resamplingbased Multiple Testing for Microarray Data Analysis. Test 12, Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286, Guan, Z., Wu, B. and Zhao, H. (2004). Modelbased approach to fdr estimation. Technical Report. He, X. and Ng, P. (1999). Cobs: Qualitatively constrained smoothing via linear programming. Computational Statistics 14, Neyman, J. and Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 231, Sackrowitz, H. and SamuelCahn, E. (1999). Pvalues as random variables: expected pvalues. American Statistician 53, Storey, J., Taylor, J. and Siegmund, D. (2004). Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. Journal of the Royal Statistical Society. Series B (Methodological) 66, Storey, J. D. (2002a). A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B (Methodological) 64, Storey, J. D. (2002b). False Discovery Rates: Theory and Applications to DNA Microarrays. PhD thesis, Stanford University. 20
21 Westfall, P. and Young, S. (1993). Resamplingbased multiple testing: Examples and methods for pvalue adjustment. Wiley. Yakowitz, S. J. and Spragins, J. D. (1968). On the identifiability of finite mixtures. The Annals of Mathematical Statistics 39, Appendix Proof of Lemma 1 Consider a pvalue p 0 = 1 Q 0 {W (x 0 )}, we have W (x 0 ) = Q 1 0 (1 p 0 ). Under H 0 Pr(p p 0 ) = P 0 [x : 1 Q 0 {W (x)} p 0 ] = P 0 {x : W (x) W (x 0 )} = 1 Q 0 {W (x 0 )} = p 0. Under H a Pr(p p 0 ) = P 1 [x : 1 Q 0 {W (x)} p 0 ] = P 1 {x : W (x) W (x 0 )} = 1 Q 1 {Q 1 0 (1 p 0 )}. And hence g 1 (p) = dg 1(p) dp = dq 1{Q 1 0 (1 p)} dp According to the definitions of q 0 ( ) and q 1 ( ), we have q 0 (η) = lim η1 η Therefore P 0 {x : η < W (x) η 1 }, q 1 (η) = lim η 1 η η1 η q 1 (η) q 0 (η) = lim P 1 {x : η < W (x) η 1 } η 1 η P 0 {x : η < W (x) η 1 }. = q 1{Q 1 0 (1 p)} q 0 {Q 1 0 (1 p)}. P 1 {x : η < W (x) η 1 }. η 1 η Since W (x) is a measurable function, the rejection region Γ = {x : η < W (x) η 1 } is measurable. We have P 1 (Γ) = f 1 (x)dx = f 0 (x) f 1(x) f 0 (x) dx Γ and hence g 1 (p) inf x {f 1 (x)/f 0 (x)}. Γ 21 Γ f 0 (x)inf x f 1 (x) f 0 (x) dx = P 0(Γ)inf x f 1 (x) f 0 (x),
22 Proof of Theorem 1 By definition LR(x) = f 1 (x)/f 0 (x), let Γ = {x : η < LR(x) η 1 }, we have P 1 (Γ) = Γ f 1 (x)dx Γ ηf 0 (x)dx = ηp 0 (Γ), similarly we have P 1 (Γ) η 1 P 0 (Γ), so q 1 (η)/q 0 (η) = η and g 1 (p) = q 1{Q 1 0 (1 p)} q 0 {Q 1 0 (1 p)} = Q 1 0 (1 p). Proof of (19) We have shown that LR(x) is a strictly convex function. If i, j, s.t. µ i + µ j = 0 and π i = π j, it is obvious that LR(x) is a symmetric function about zero. Hence, { LR(x) = f 1 (x)/f 0 (x) > η } = { x > x 0 }, where η = LR(x 0 ). Now suppose { LR(x) = f 1 (x)/f 0 (x) > η } = { x > x 0 }, we have x, LR(x) = LR( x). Suppose max j µ j = µ J > 0, we have LR(x) exp( xµ J ) = LR( x) exp( xµ J ), i.e. L 1 = L 2, where L 1 = π J exp( µ 2 J/2) + k J π k exp { x(µ k µ J ) µ 2 k/2 }, L 2 = π J exp( 2xµ J µ 2 J/2) + k J π k exp { x(µ k + µ J ) µ 2 k/2 }. We know that lim x L 1 = π J exp( µ 2 J /2). So there must exist an K, s.t. π K = π J and µ K + µ J = 0, which will make lim x L 2 = lim x L 1. From LR(x) π J exp(xµ J µ 2 J /2) = LR( x) π K exp( xµ K µ 2 K /2), we can prove the second largest µ k satisfies the symmetric condition. So sequentially we can prove that i, j, s.t. µ i + µ j = 0 and π i = π j. 22
23 Figure 1. Simulation study: the top two plots compare the LR and symmetric rejection regions; the bottom one compares the parametric (pm), empirical bayes (eb) and nonparametric (np) estimations. Rejection Threshold LR Symmetric density Symmetric LR α p value π^pm π^eb π^np 0 π 0 = 0.2 π eb = 0.48 π np =
24 Figure 2. 3Component Model Fitting for the Leukemia Data and FDR estimation Distribution Function Estimation QQ Plot ECDF 3 Component Model Test Statistics Quantiles Test Statistics FDR π 0 Number of Significant Genes Threshold Γ rejection region { T Γ} π 0 FDR 24
25 Figure 3. Nonparametric vs. Parametric Estimation for the Leukemia data π^0(λ) Nonparametric Smoothing π 0 Esimation Density permutation density mixture density λ p value 25
26 Figure 4. 3Component Model Fitting for the Colon cancer Data and FDR estimation Distribution Function Estimation QQ Plot ECDF 3 Component Model Quantiles Test Statistics Test Statistics FDR π 0 Number of Significant Genes Threshold Γ rejection region { T Γ} π 0 FDR 26
27 Figure 5. Nonparametric vs. Parametric Estimation for the Colon cancer data π 0 (λ) parametric estimation nonparametric estimation density permutation density mixture density λ p value 27
28 Figure 6. FDR estimation under local dependence: there are 13 simulations based on the combination of 5 different ρs and 3 different Ks, which are labeled at the bottom of each plot. The boxplot are based on 100 replicates, and the horizontal dashed black line represents the true value estimated from 100 replicates. We can see that the pattern of FDR estimation are very similar to π 0 : the bigger the correlation ρ and the number of local clusters K, the more variable the estimations. But overall we can see that the proposed model gives very good estimates, even when the local correlation is as large as 0.5. π FDR FDR ρ= K= ρ= K= FDR ρ= K= ρ= K=
29 Figure 7. Bias and variance analysis for FDR estimation under local dependence: there are 13 simulations based on the combination of 5 different ρs and 3 different Ks, which are labeled at the bottom of each plot. Shown in the plot are the ratio of absolute bias/standard error and the true means. The pattern is pretty consistent: larger ρ and K will increase the bias and variance; overall the bias is very small compared to the variance. Under local dependence, the proposed approach gives reasonable estimates even when the local correlation is as high as 0.5. π^ sd/mean bias /mean ρ= K= FDR ρ= K= FDR ρ= K= FDR ρ= K=
30 Table 1 Possible Outcomes of Multiple Hypothesis Testing Accepted Rejected Total True Null U V m 0 True Alternative T S m 1 Total N R m 30
False Discovery Rates
False Discovery Rates John D. Storey Princeton University, Princeton, USA January 2010 Multiple Hypothesis Testing In hypothesis testing, statistical significance is typically based on calculations involving
More informationMultiple testing with gene expression array data
Multiple testing with gene expression array data Anja von Heydebreck Max Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Slides partly
More informationBootstrapping pvalue estimations
Bootstrapping pvalue estimations In microarray studies it is common that the the sample size is small and that the distribution of expression values differs from normality. In this situations, permutation
More informationStatistical issues in the analysis of microarray data
Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data
More informationA direct approach to false discovery rates
J. R. Statist. Soc. B (2002) 64, Part 3, pp. 479 498 A direct approach to false discovery rates John D. Storey Stanford University, USA [Received June 2001. Revised December 2001] Summary. Multiplehypothesis
More informationHypothesis Testing. 1 Introduction. 2 Hypotheses. 2.1 Null and Alternative Hypotheses. 2.2 Simple vs. Composite. 2.3 OneSided and TwoSided Tests
Hypothesis Testing 1 Introduction This document is a simple tutorial on hypothesis testing. It presents the basic concepts and definitions as well as some frequently asked questions associated with hypothesis
More informationNotes for STA 437/1005 Methods for Multivariate Data
Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.
More informationQVALUE: The Manual Version 1.0
QVALUE: The Manual Version 1.0 Alan Dabney and John D. Storey Department of Biostatistics University of Washington Email: jstorey@u.washington.edu March 2003; Updated June 2003; Updated January 2004 Table
More information0BComparativeMarkerSelection Documentation
0BComparativeMarkerSelection Documentation Description: Author: Computes significance values for features using several metrics, including FDR(BH), Q Value, FWER, FeatureSpecific PValue, and Bonferroni.
More informationFalse Discovery Rate Control with Groups
False Discovery Rate Control with Groups James X. Hu, Hongyu Zhao and Harrison H. Zhou Abstract In the context of largescale multiple hypothesis testing, the hypotheses often possess certain group structures
More informationMaximum Likelihood Estimation
Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for
More informationIntroduction to Hypothesis Testing. Point estimation and confidence intervals are useful statistical inference procedures.
Introduction to Hypothesis Testing Point estimation and confidence intervals are useful statistical inference procedures. Another type of inference is used frequently used concerns tests of hypotheses.
More information. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.
Chapter 3 KolmogorovSmirnov Tests There are many situations where experimenters need to know what is the distribution of the population of their interest. For example, if they want to use a parametric
More informationTest of Hypotheses. Since the NeymanPearson approach involves two statistical hypotheses, one has to decide which one
Test of Hypotheses Hypothesis, Test Statistic, and Rejection Region Imagine that you play a repeated Bernoulli game: you win $1 if head and lose $1 if tail. After 10 plays, you lost $2 in net (4 heads
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationMultiple Testing. Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf. Abstract
Multiple Testing Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf Abstract Multiple testing refers to any instance that involves the simultaneous testing of more than one hypothesis. If decisions about
More informationHypothesis Testing COMP 245 STATISTICS. Dr N A Heard. 1 Hypothesis Testing 2 1.1 Introduction... 2 1.2 Error Rates and Power of a Test...
Hypothesis Testing COMP 45 STATISTICS Dr N A Heard Contents 1 Hypothesis Testing 1.1 Introduction........................................ 1. Error Rates and Power of a Test.............................
More informationParametric Models Part I: Maximum Likelihood and Bayesian Density Estimation
Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015
More informationReject Inference in Credit Scoring. JieMen Mok
Reject Inference in Credit Scoring JieMen Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business
More informationMultiple OneSample or Paired TTests
Chapter 610 Multiple OneSample or Paired TTests Introduction This chapter describes how to estimate power and sample size (number of arrays) for paired and one sample highthroughput studies using the.
More informationStrong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach
J. R. Statist. Soc. B (2004) 66, Part 1, pp. 187 205 Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach John D. Storey,
More informationLikelihood Approaches for Trial Designs in Early Phase Oncology
Likelihood Approaches for Trial Designs in Early Phase Oncology Clinical Trials Elizabeth GarrettMayer, PhD Cody Chiuzan, PhD Hollings Cancer Center Department of Public Health Sciences Medical University
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationVariance of OLS Estimators and Hypothesis Testing. Randomness in the model. GM assumptions. Notes. Notes. Notes. Charlie Gibbons ARE 212.
Variance of OLS Estimators and Hypothesis Testing Charlie Gibbons ARE 212 Spring 2011 Randomness in the model Considering the model what is random? Y = X β + ɛ, β is a parameter and not random, X may be
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN13: 9780470860809 ISBN10: 0470860804 Editors Brian S Everitt & David
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002Topics in StatisticsBiological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationTest Volume 12, Number 1. June 2003
Sociedad Española de Estadística e Investigación Operativa Test Volume 12, Number 1. June 2003 Resamplingbased Multiple Testing for Microarray Data Analysis Yongchao Ge Department of Statistics University
More informationFeature Selection for HighDimensional Genomic Microarray Data
Feature Selection for HighDimensional Genomic Microarray Data Eric P. Xing Michael I. Jordan Richard M. Karp Division of Computer Science, University of California, Berkeley, CA 9472 Department of Statistics,
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationTesting: is my coin fair?
Testing: is my coin fair? Formally: we want to make some inference about P(head) Try it: toss coin several times (say 7 times) Assume that it is fair ( P(head)= ), and see if this assumption is compatible
More informationPermutation Tests for Comparing Two Populations
Permutation Tests for Comparing Two Populations Ferry Butar Butar, Ph.D. JaeWan Park Abstract Permutation tests for comparing two populations could be widely used in practice because of flexibility of
More informationTwoSample TTests Allowing Unequal Variance (Enter Difference)
Chapter 45 TwoSample TTests Allowing Unequal Variance (Enter Difference) Introduction This procedure provides sample size and power calculations for one or twosided twosample ttests when no assumption
More informationChapter 4: Statistical Hypothesis Testing
Chapter 4: Statistical Hypothesis Testing Christophe Hurlin November 20, 2015 Christophe Hurlin () Advanced Econometrics  Master ESA November 20, 2015 1 / 225 Section 1 Introduction Christophe Hurlin
More informationChapter 9: Hypothesis Testing Sections
Chapter 9: Hypothesis Testing Sections 9.1 Problems of Testing Hypotheses Skip: 9.2 Testing Simple Hypotheses Skip: 9.3 Uniformly Most Powerful Tests Skip: 9.4 TwoSided Alternatives 9.6 Comparing the
More information[Chapter 10. Hypothesis Testing]
[Chapter 10. Hypothesis Testing] 10.1 Introduction 10.2 Elements of a Statistical Test 10.3 Common LargeSample Tests 10.4 Calculating Type II Error Probabilities and Finding the Sample Size for Z Tests
More informationTwoSample TTests Assuming Equal Variance (Enter Means)
Chapter 4 TwoSample TTests Assuming Equal Variance (Enter Means) Introduction This procedure provides sample size and power calculations for one or twosided twosample ttests when the variances of
More informationDongfeng Li. Autumn 2010
Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis
More informationGene Expression Analysis
Gene Expression Analysis Jie Peng Department of Statistics University of California, Davis May 2012 RNA expression technologies Highthroughput technologies to measure the expression levels of thousands
More informationNonparametric adaptive age replacement with a onecycle criterion
Nonparametric adaptive age replacement with a onecycle criterion P. CoolenSchrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK email: Pauline.Schrijner@durham.ac.uk
More informationNonInferiority Tests for Two Means using Differences
Chapter 450 oninferiority Tests for Two Means using Differences Introduction This procedure computes power and sample size for noninferiority tests in twosample designs in which the outcome is a continuous
More informationHow to Conduct a Hypothesis Test
How to Conduct a Hypothesis Test The idea of hypothesis testing is relatively straightforward. In various studies we observe certain events. We must ask, is the event due to chance alone, or is there some
More informationBasics of microarrays. Petter Mostad 2003
Basics of microarrays Petter Mostad 2003 Why microarrays? Microarrays work by hybridizing strands of DNA in a sample against complementary DNA in spots on a chip. Expression analysis measure relative amounts
More information6.2 Permutations continued
6.2 Permutations continued Theorem A permutation on a finite set A is either a cycle or can be expressed as a product (composition of disjoint cycles. Proof is by (strong induction on the number, r, of
More informationStatistical Applications in Genetics and Molecular Biology
Statistical Applications in Genetics and Molecular Biology Volume 10, Issue 1 2011 Article 28 The Joint Null Criterion for Multiple Hypothesis Tests Jeffrey T. Leek, Johns Hopkins Bloomberg School of Public
More information93.4 Likelihood ratio test. NeymanPearson lemma
93.4 Likelihood ratio test NeymanPearson lemma 91 Hypothesis Testing 91.1 Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental
More informationNonparametric Statistics
Nonparametric Statistics References Some good references for the topics in this course are 1. Higgins, James (2004), Introduction to Nonparametric Statistics 2. Hollander and Wolfe, (1999), Nonparametric
More informationMedian of the pvalue Under the Alternative Hypothesis
Median of the pvalue Under the Alternative Hypothesis Bhaskar Bhattacharya Department of Mathematics, Southern Illinois University, Carbondale, IL, USA Desale Habtzghi Department of Statistics, University
More informationFrom the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
More informationMessagepassing sequential detection of multiple change points in networks
Messagepassing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal
More informationSampling and Hypothesis Testing
Population and sample Sampling and Hypothesis Testing Allin Cottrell Population : an entire set of objects or units of observation of one sort or another. Sample : subset of a population. Parameter versus
More information1 Sufficient statistics
1 Sufficient statistics A statistic is a function T = rx 1, X 2,, X n of the random sample X 1, X 2,, X n. Examples are X n = 1 n s 2 = = X i, 1 n 1 the sample mean X i X n 2, the sample variance T 1 =
More information1 Prior Probability and Posterior Probability
Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which
More informationAn Internal Model for Operational Risk Computation
An Internal Model for Operational Risk Computation Seminarios de Matemática Financiera Instituto MEFFRiskLab, Madrid http://www.risklabmadrid.uam.es/ Nicolas Baud, Antoine Frachot & Thierry Roncalli
More informationModels for Count Data With Overdispersion
Models for Count Data With Overdispersion Germán Rodríguez November 6, 2013 Abstract This addendum to the WWS 509 notes covers extrapoisson variation and the negative binomial model, with brief appearances
More informationExact Nonparametric Tests for Comparing Means  A Personal Summary
Exact Nonparametric Tests for Comparing Means  A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola
More informationMicroarray Data Analysis. Statistical methods to detect differentially expressed genes
Microarray Data Analysis Statistical methods to detect differentially expressed genes Outline The class comparison problem Statistical tests Calculation of pvalues Permutations tests The volcano plot
More informationAnomaly detection for Big Data, networks and cybersecurity
Anomaly detection for Big Data, networks and cybersecurity Patrick RubinDelanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with Nick Heard (Imperial College London),
More informationChapter 9: Hypothesis Testing Sections
Chapter 9: Hypothesis Testing Sections  we are still here Skip: 9.2 Testing Simple Hypotheses Skip: 9.3 Uniformly Most Powerful Tests Skip: 9.4 TwoSided Alternatives 9.5 The t Test 9.6 Comparing the
More informationEstimating survival functions has interested statisticians for numerous years.
ZHAO, GUOLIN, M.A. Nonparametric and Parametric Survival Analysis of Censored Data with Possible Violation of Method Assumptions. (2008) Directed by Dr. Kirsten Doehler. 55pp. Estimating survival functions
More informationChapter 9: Hypothesis Testing Sections
Chapter 9: Hypothesis Testing Sections 9.1 Problems of Testing Hypotheses Skip: 9.2 Testing Simple Hypotheses Skip: 9.3 Uniformly Most Powerful Tests Skip: 9.4 TwoSided Alternatives 9.5 The t Test 9.6
More informationBayesian Statistics in One Hour. Patrick Lam
Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical
More informationMATH4427 Notebook 2 Spring 2016. 2 MATH4427 Notebook 2 3. 2.1 Definitions and Examples... 3. 2.2 Performance Measures for Estimators...
MATH4427 Notebook 2 Spring 2016 prepared by Professor Jenny Baglivo c Copyright 20092016 by Jenny A. Baglivo. All Rights Reserved. Contents 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................
More informationProbability and Statistics
CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b  0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute  Systems and Modeling GIGA  Bioinformatics ULg kristel.vansteen@ulg.ac.be
More informationStatistics 641  EXAM II  1999 through 2003
Statistics 641  EXAM II  1999 through 2003 December 1, 1999 I. (40 points ) Place the letter of the best answer in the blank to the left of each question. (1) In testing H 0 : µ 5 vs H 1 : µ > 5, the
More information3.6: General Hypothesis Tests
3.6: General Hypothesis Tests The χ 2 goodness of fit tests which we introduced in the previous section were an example of a hypothesis test. In this section we now consider hypothesis tests more generally.
More informationModèles stochastiques II
Modèles stochastiques II INFO 15 ianluca Bontempi Département d Informatique Boulevard de Triomphe  CP 212 http://wwwulbacbe/di Modéles stochastiques II p1/5 Testing hypothesis Hypothesis testing is the
More informationSpatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
More informationTutorial 5: Hypothesis Testing
Tutorial 5: Hypothesis Testing Rob Nicholls nicholls@mrclmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................
More informationSummary of Formulas and Concepts. Descriptive Statistics (Ch. 14)
Summary of Formulas and Concepts Descriptive Statistics (Ch. 14) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationSemiparametric Differential Expression Analysis via Partial Mixture Estimation
Semiparametric Differential Expression Analysis via Partial Mixture Estimation DAVID ROSSELL Department of Biostatistics M.D. Anderson Cancer Center, Houston, TX 77030, USA rosselldavid@gmail.com RUDY
More informationConfidence intervals, t tests, P values
Confidence intervals, t tests, P values Joe Felsenstein Department of Genome Sciences and Department of Biology Confidence intervals, t tests, P values p.1/31 Normality Everybody believes in the normal
More informationModule 5 Hypotheses Tests: Comparing Two Groups
Module 5 Hypotheses Tests: Comparing Two Groups Objective: In medical research, we often compare the outcomes between two groups of patients, namely exposed and unexposed groups. At the completion of this
More informationWebbased Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni
1 Webbased Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed
More informationPractice problems for Homework 11  Point Estimation
Practice problems for Homework 11  Point Estimation 1. (10 marks) Suppose we want to select a random sample of size 5 from the current CS 3341 students. Which of the following strategies is the best:
More informationFORMALIZED DATA SNOOPING BASED ON GENERALIZED ERROR RATES
Econometric Theory, 24, 2008, 404 447+ Printed in the United States of America+ DOI: 10+10170S0266466608080171 FORMALIZED DATA SNOOPING BASED ON GENERALIZED ERROR RATES JOSEPH P. ROMANO Stanford University
More informationRedwood Building, Room T204, Stanford University School of Medicine, Stanford, CA 943055405.
W hittemoretxt050806.tex A Bayesian False Discovery Rate for Multiple Testing Alice S. Whittemore Department of Health Research and Policy Stanford University School of Medicine Correspondence Address:
More informationPenalized Logistic Regression and Classification of Microarray Data
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAGLMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification
More informationUsing pivots to construct confidence intervals. In Example 41 we used the fact that
Using pivots to construct confidence intervals In Example 41 we used the fact that Q( X, µ) = X µ σ/ n N(0, 1) for all µ. We then said Q( X, µ) z α/2 with probability 1 α, and converted this into a statement
More informationComparative genomic hybridization Because arrays are more than just a tool for expression analysis
Microarray Data Analysis Workshop MedVetNet Workshop, DTU 2008 Comparative genomic hybridization Because arrays are more than just a tool for expression analysis Carsten Friis ( with several slides from
More information7 Hypothesis testing  one sample tests
7 Hypothesis testing  one sample tests 7.1 Introduction Definition 7.1 A hypothesis is a statement about a population parameter. Example A hypothesis might be that the mean age of students taking MAS113X
More informationCourse on Microarray Gene Expression Analysis
Course on Microarray Gene Expression Analysis ::: Differential Expression Analysis Daniel Rico drico@cnio.es Bioinformatics Unit CNIO Upregulation or No Change Downregulation Image analysis comparison
More informationKERNEL LOGISTIC REGRESSIONLINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA
Rahayu, Kernel Logistic RegressionLinear for Leukemia Classification using High Dimensional Data KERNEL LOGISTIC REGRESSIONLINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA S.P. Rahayu 1,2
More informationIntroduction to Detection Theory
Introduction to Detection Theory Reading: Ch. 3 in KayII. Notes by Prof. Don Johnson on detection theory, see http://www.ece.rice.edu/~dhj/courses/elec531/notes5.pdf. Ch. 10 in Wasserman. EE 527, Detection
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationParametric or nonparametric: the FIC approach for stationary time series
Parametric or nonparametric: the FIC approach for stationary time series Gudmund Horn Hermansen* Department of Mathematics, University of Oslo, Oslo, Norway  gudmunhh@math.uio.no Nils Lid Hjort Department
More informationThe Effect of Correlation in False Discovery Rate Estimation
1 2 Biometrika (??),??,??, pp. 1 24 C 21 Biometrika Trust Printed in Great Britain Advance Access publication on?????? 3 4 5 6 7 The Effect of Correlation in False Discovery Rate Estimation BY ARMIN SCHWARTZMAN
More information4. Introduction to Statistics
Statistics for Engineers 41 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation
More informationChapter 7 Notes  Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:
Chapter 7 Notes  Inference for Single Samples You know already for a large sample, you can invoke the CLT so: X N(µ, ). Also for a large sample, you can replace an unknown σ by s. You know how to do a
More informationFitting Subjectspecific Curves to Grouped Longitudinal Data
Fitting Subjectspecific Curves to Grouped Longitudinal Data Djeundje, Viani HeriotWatt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK Email: vad5@hw.ac.uk Currie,
More informationStochastic Inventory Control
Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the
More informationService courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
More informationBasic concepts and introduction to statistical inference
Basic concepts and introduction to statistical inference Anna Helga Jonsdottir Gunnar Stefansson Sigrun Helga Lund University of Iceland (UI) Basic concepts 1 / 19 A review of concepts Basic concepts Confidence
More informationNull Hypothesis H 0. The null hypothesis (denoted by H 0
Hypothesis test In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test (or test of significance) is a standard procedure for testing a claim about a property
More informationHypothesis testing S2
Basic medical statistics for clinical and experimental research Hypothesis testing S2 Katarzyna Jóźwiak k.jozwiak@nki.nl 2nd November 2015 1/43 Introduction Point estimation: use a sample statistic to
More informationNumerical methods for American options
Lecture 9 Numerical methods for American options Lecture Notes by Andrzej Palczewski Computational Finance p. 1 American options The holder of an American option has the right to exercise it at any moment
More informationHypothesis Testing Level I Quantitative Methods. IFT Notes for the CFA exam
Hypothesis Testing 2014 Level I Quantitative Methods IFT Notes for the CFA exam Contents 1. Introduction... 3 2. Hypothesis Testing... 3 3. Hypothesis Tests Concerning the Mean... 10 4. Hypothesis Tests
More informationSome stability results of parameter identification in a jump diffusion model
Some stability results of parameter identification in a jump diffusion model D. Düvelmeyer Technische Universität Chemnitz, Fakultät für Mathematik, 09107 Chemnitz, Germany Abstract In this paper we discuss
More information