Review of Hypothesis Testing

Review of Hypothesis Testing Classic two sample problem: X 1, X 2,..., X m F Y 1, Y 2,..., Y n G H 0 : F = G. e.g.: F,G are Gaussian with different means. The hypothesis testing framework We compute a test statistic ˆθ that takes on a larger (or more extreme) value if H 0 is not true. Assume that the null distribution of ˆθ is known. Compute P 0 (ˆθ > ˆθ) which we call the achieved significance level (ASL).

Review of Hypothesis Testing Achieved significance level: P 0 (ˆθ > ˆθ) ASL small Strong evidence against H 0 H 0 not true. This framework is asymmetric. If H 0 and H A equally likely then accept H 0. Evidence against H 0 does not count as evidence for H A.

Review of Hypothesis Testing X 1, X 2,..., X m N(µ T, σ 2 ) Y 1, Y 2,..., Y n N(µ C, σ 2 ) H 0 : µ T = µ C. ˆθ = X Ȳ Then under H 0, ˆθ N(0, σ 2 (1/n + 1/m)). ( ) ˆθ ASL = P Z > σ, where Z N(0, 1). 1/n + 1/m If σ 2 were estimated using data ˆσ 2 (Xi = X) 2 + (Y i Ȳ )2 n + m 2 then use t n+m 2 -distribution.

Where does Bootstrap and permutations come in? The main practical difficulty with hypothesis tests come in estimating ASL. Often, the null hypothesis distribution is not known because F, G and/or the statistic ˆθ is not as nice. Thus, instead of using theoretical distributions, we estimate the ASL by Monte Carlo sampling from the Bootstrap or permutation distribution.

Fisher s permutation test Combine n + m measurements, draw samples of size n from pool, each with probability 1 ( n + m n Permutation Lemma Under H 0, F = G, each set of n observations has the above probability of appearing in treatment set. ). ASL perm = #{ˆθ > ˆθ} ( ). n + m n This is different from the Bootstrap. It is purely based on random assignment. There is no empirical distribution ˆF at play here. The permutation method works for any statistic ˆθ.

Bootstrapping to two-sample problem 1. Draw samples of size n + m from combined pool with replacement. 2. Assign the first m to control, the next n to treatment. X 1,..., X m, Y 1,..., Y n. Compute ˆθ = ˆθ(X, Y ). Do this B times. 3. ASL boot = #{ˆθ >ˆθ obs } B. For the two sample problem, the only difference with permutations is that samples are drawn with replacement. Not surprisingly this gives similar results as permutations for the two-sample problem.

When there is nothing to permute... Consider the one-sample problem: X 1,..., X n, H 0 : E[X i ] = µ 0 There is nothing to permute for a permutation test. On the other hand, we can Bootstrap from an estimated null distribution ˆF 0. Note: ˆF 0 ˆF n! Simple trick: assume ˆF n has the correct shape and variance but just not the correct mean. Let X i = X i X + µ 0. Bootstrap from the empirical distribution of {X i : i = 1,..., n}. The key in using Bootstrap for hypothesis testing is to identify the null distribution, and constructing it from observed data.

Bootstrap phylogeny Reference: Felsenstein, 1985 Hillis & Bull, 1993 Efron et al. 1996

Phylogeny Evolutionary history of species Many different approaches for phylogenetic inferences Distance-based Maximum parsimony Maximum likelihood Bayesian Gen245, Spring 2009

Data Aligned sequences, x i Tree-building programs E.g. phylip t=t(x 1, x 2,,x k ) Gen245, Spring 2009

Bootstrap confidence level What does confidence mean? Real world Sequence multiple regions, build a phylogeny on each region of the same length Bootstrap world Resample (with replacement) nucleotide positions t * =T(x 1*, x 2*,,x k* ) Confidence=Proportion of t * that include a particular clade Gen245, Spring 2009

Bootstrap example Aligned sequences, x i Tree-building programs E.g. phylip t=t(x 1, x 2,,x k ) Gen245, Spring 2009

The controversy An empirical test (Hillis & Bull, 1993) Simulate a phylogeny Simulate sequence data Compute bootstrap confidence level Gen245, Spring 2009

The controversy An empirical test (Hillis & Bull, 1993) Simulate a phylogeny ( tree A ) Simulate sequence data (multiple real data) Compute bootstrap confidence level (for each data) Is the bootstrap confidence level accurate? Gen245, Spring 2009

What went wrong? Key idea of bootstrap D( ˆ * ˆ ) ~D( ˆ ) Not true D( ˆ * ) ~D( ˆ ) Gen245, Spring 2009 x Ref: Efron, Halloran & Holmes, 1996

How should the p-values of fraction overlap be computed? Uniformly resample the start positions of each feature, keeping the positions of the other feature fixed. Treat the features as 0,1 vectors and apply the classic Bootstrap. What is wrong about these approaches?

Simulation Study