Signal detection and goodness-of-fit: the Berk-Jones statistics revisited

Transcription

1 Signal detection and goodness-of-fit: the Berk-Jones statistics revisited Jon A. Wellner (Seattle) INET Big Data Conference

2 INET Big Data Conference, Cambridge September 29-30, 2015 Based on joint work with: Lutz Dümbgen (Bern) Leah Jager (U.S. Naval Academy)

3 Outline 1: 2: 3: 4: 5: 6: Introduction: some history Testing for sparse normal means: optimal detection boundary Consequences and tradeoffs. The LIL and strong LIL for Brownian motion and Brownian bridge A new class of test statistics: LIL adjusted higher criticism & Berk-Jones. Power properties and confidence bands. INET Big Data Conference, Cambridge, September 29-30,

4 1. Introduction: some history Setting: classical goodness - of - fit X 1,..., X n i.i.d. with distribution function F F n (x) = 1 n ni=1 1 [Xi x] Test H : F = F 0 versus K : F F 0, F 0 continuous Without loss of generality F 0 (x) = x, the U(0, 1) distribution Break hypotheses down into family of pointwise hypotheses: H x : F (x) = F 0 (x) versus K x : F (x) F 0 (x) H = x H x, K = x K x INET Big Data Conference, Cambridge, September 29-30,

5 1. Introduction: some history nf n (x) Binomial(n, F (x)). Likelihood ratio statistic for testing H x versus K x : λ n (x) = sup F (x) L n(f (x)) = L n(f n (x)) L n (F 0 (x)) L n (F 0 (x)) = F n(x) nfn(x) (1 F n (x)) n(1 F n(x)) F 0 (x) nfn(x) (1 F 0 (x)) n(1 F n(x)) ( ) nfn (x) Fn (x) ( 1 F n (x) = F 0 (x) 1 F 0 (x) ) n(1 Fn (x)) INET Big Data Conference, Cambridge, September 29-30,

6 1. Introduction: some history Thus logλ n (x) = nf n (x)log ( Fn (x) F 0 (x) ) + n(1 F n (x))log = nk(f n (x), F 0 (x)) K(u, v) ulog ( u v ) + (1 u)log ( 1 u 1 v Kullback - Leibler distance Bernoulli(u), Bernoulli(v) Berk-Jones (1979) test statistic: via S.N. Roy s union intersection principle, R n sup x ), ( 1 Fn (x) 1 F 0 (x) n 1 logλ n (x) = sup x K(F n (x), F 0 (x)). ) INET Big Data Conference, Cambridge, September 29-30,

7 1. Introduction: some history History: Berk and Jones (1979) Groeneboom and Shorack (1981) Shorack and Wellner (1986, p. 786) Owen (1995): inversion of R n to get confidence bands; finite - sample distribution via Noé s recursion Einmahl and McKeague (2002): integral version of R n Donoho and Jin (2004): Tukey s Higher-Criticism statistic for testing sparse normal means model with comparisons to Berk - Jones statistic R n INET Big Data Conference, Cambridge, September 29-30,

8 2. Testing for sparse normal means Initial setting: multiple testing of normal means For i = 1,..., n consider testing versus H 0,i : X i N(0, 1) H 1,i : X i N(µ i, 1) with µ i > 0. Sparsity: proportion ɛ n n 1 #{i n : µ i > 0} is small; ɛ n n β with 0 < β < 1. Three questions (in increasing order of difficulty): Q1: Can we tell if at least one null hypothesis is false? Q2: What is the proportion of false null hypotheses? Q3: Which null hypotheses are false? Main focus here: Q1. INET Big Data Conference, Cambridge, September 29-30,

9 2. Testing for sparse normal means Previous work: Q1: is there any signal? Ingster (1997, 1999) Jin (2004) Donoho and Jin (2004) Jager and Wellner (2007) Hall and Jin (2007) Cai and Wu (2014) INET Big Data Conference, Cambridge, September 29-30,

10 2. Testing for sparse normal means Change of setting: Ingster - Donoho - Jin testing problem Suppose Y 1,..., Y n i.i.d. G on R test H : G = N(0, 1) versus H 1 : G = (1 ɛ)n(0, 1) + ɛn(µ, 1), and, in particular, against H (n) 1 : G = (1 ɛ n )N(0, 1) + ɛ n N(µ n, 1). for ɛ n = n β, µ n = 2rlogn 0 < β < 1, 0 < r < 1. Let Φ(z) P (Z z) = z (2π) 1/2 exp( x 2 /2)dx, Z N(0, 1). transform to X i 1 Φ(Y i ) [0, 1] i.i.d. F = 1 G(Φ 1 (1 )). INET Big Data Conference, Cambridge, September 29-30,

11 2. Testing for sparse normal means Then the testing problem becomes: test H 0 : F = F 0 = U(0, 1) versus H (n) 1 : F (u) = u + ɛ n {(1 u) Φ(Φ 1 (1 u) µ n )} = (1 ɛ n )u + ɛ n {1 Φ(Φ 1 (1 u) µ n )} Test statistics: Donoho and Jin (2004) proposed HCn n(fn (u) u) sup X (1) u X ([n/2]) u(1 u) Tukey s higher criticism statistic where F n (u) n 1 n i=1 1 [0,u] (X i ) = empirical distribution function of the X i s. Let K 2 (u, v) = 2 1 (u v) 2 /(v(1 v)) 2 ; then 2 1 (HC n) 2 = sup X (1) x X [n/2] K + 2 (F n(x), x) where K + 2 (u, v) = K 2(u, v)1 [v u]. INET Big Data Conference, Cambridge, September 29-30,

12 2. Testing for sparse normal means Optimal detection boundary ρ (β) defined by: ρ (β) = { β 1/2, 1/2 < β 3/4 (1 1 β) 2, 3/4 < β < 1. Theorem 1: (Donoho - Jin, 2004). For r > ρ (β) the tests T n based on 2 1 (HCn) 2 or R n + are both size and power consistent for testing H 0 versus H (n) 1. With t n (α n ) = loglog(n)(1 + o(1)) P H0 (T n > t n (α n )) = α n 0, and P (n)(t n > t n (α n )) 1, as n. H 1 INET Big Data Conference, Cambridge, September 29-30,

13 2. Testing for sparse normal means 1.0 r detectable 0.4 ρ * (β) 0.2 undetectable β INET Big Data Conference, Cambridge, September 29-30,

14 2. Testing for sparse normal means A family of statistics via phi-divergences For s R, x 0 define Then define φ s (x) = 1 s+sx x s s(1 s), s 0, 1 xlog(x) x + 1, s = 1 x log(x) 1, s = 0. K s (u, v) = vφ s (u/v) + (1 v)φ s ((1 u)/(1 v)). INET Big Data Conference, Cambridge, September 29-30,

15 2. Testing for sparse normal means Special cases: K 2 (u, v) = 1 (u v) 2 2v(1 v) K 1 (u, v) = K(u, v) = ulog(u/v) + (1 u)log((1 u)/(1 v)) K 1/2 (u, v) = 2{( u v) 2 + ( 1 u 1 v) 2 } = 4{1 uv (1 u)(1 v)}. K 0 (u, v) = K(v, u) K 1 (u, v) = K 2 (v, u) = 1 (u v) 2 2u(1 u) INET Big Data Conference, Cambridge, September 29-30,

16 2. Testing for sparse normal means The φ divergence family of statistics (Jager & W, 2007): S n (s) = { supx R K s (F n (x), F 0 (x)), s 1 sup x [X(1),X (n) ) K s(f n (x), F 0 (x)), s < 1, Thus, with F 0 (x) = x, S n (2) = 1 2 sup (F n (x) x) 2, x R x(1 x) S n (1) = R n = Berk-Jones statistic S n (1/2) = 4 sup {1 x [X (1),X (n) ) F n (x)x S n (0) = reversed Berk-Jones R n S n ( 1) = 1 (F n (x) x) 2 sup 2 x [X (1),X (n) ) F n (x)(1 F n (x)) (1 F n (x))(1 x)} INET Big Data Conference, Cambridge, September 29-30,

17 2. Testing for sparse normal means Null hypothesis distribution theory: Owen (1995) and Jager (2006): finite sample critical points via Noé s recursion for n 3000 For n 3000, asymptotic theory via Jaeschke (1979) and Eicker (1979) (cf. SW p ), together with so where K s (u, v) 2 1 (u v) 2 /[v(1 v)] nk s (F n (x), x) 1 2 Z n (x) n(fn (x) x) x(1 x) n(f n (x) x) 2 x(1 x) f.d. 1 2 Z n(x) 2 U(x) Z(x) x(1 x) with U a standard Brownian bridge process on [0, 1]. INET Big Data Conference, Cambridge, September 29-30,

18 2. Testing for sparse normal means Let r n loglog(n) + (1/2)logloglog(n) (1/2)log(4π) = loglog(n)(1 + o(1)). Theorem 1. If F = F 0, the uniform distribution on [0, 1], then for 1 s 2 ns n (s) r n d Y 4 where P (Y 4 x) = exp( 4exp( x)). Theorem 2. If F = F n, the Ingster - Donoho - Jin sparse normal means model, then for each s [ 1, 2] P H0 (ns n (s) > t n (α n )) 0, and P (n)(ns n (s) > t n (α n )) 1 if r > ρ (β). H 1 INET Big Data Conference, Cambridge, September 29-30,

19 3. Consequences and tradeoffs: trouble in the middle! Although the test statistics ns n (s) (and their one-sided, onetailed counterparts) have excellent power behavior against sparse normal means and other tail alternatives, we have lost something in the middle: if {F n } is a sequence of distribution functions satisfying 1/2 n(f n f 1/2 0 ) 2 1 af 1/2 0, in L 2 (λ) n(fn F 0 ) A uniformly where A(x) = x a(y)df 0 (y), then it is not hard to see that for any κ > 0 P Fn (ns n (s) r n > κ) 0. Can we find a new family of test statistics which have good power properties for alternatives of both the tail and central type? INET Big Data Conference, Cambridge, September 29-30,

20 4. The LIL and Strong LIL for Brownian Motion and Bridge Standard Brownian motion W = (W(t)) t 0 LIL for BM: lim sup t 0 ±W(t) 2tloglog(t 1 ) = 1 almost surely, lim sup t ±W(t) 2tloglog(t) = 1 almost surely. Refined (upper) strong LIL for BM: For arbitrary constants ν > 3/2, ( W(t) 2 lim sup t {0, } 2t almost surely. ) loglog(t + t 1 ) ν logloglog(t + t 1 ) < 0 INET Big Data Conference, Cambridge, September 29-30,

21 4. The LIL and Strong LIL for Brownian Motion and Bridge Reformulation for standard Brownian bridge U = (U(t)) t (0,1) ( ) t (0, 1) t logit(t) := log R, 1 t R x l(x) := ex 1 + e x (0, 1). C(t) := log 1 + logit(t) 2 /2 loglog(1/t) as t 0, D(t) := log 1 + C(t) 2 /2 logloglog(1/t) as t 0. almost surely. ( U(t) 2 lim sup t {0,1} 2t(1 t) ) C(t) ν D(t) < 0 INET Big Data Conference, Cambridge, September 29-30,

22 5: A new class of test statistics: LIL adjusted higher criticism & Berk-Jones. This suggests a new class of test statistics as follows: Fix s [1, 2] and ν > 3/2. Define X n,s (t) nk s (G n (t), t). Set T n (s, ν) sup 0<t<1 {X n,s (t) C(t) νd(t)}. If s [ 1, 1) replace the supremum over (0, 1) by the sup over [X (1), X (n) ). Theorem. For all s [ 1, 2] and ν > 3/2, if H 0 holds then { U 2 } (t) T n (s, ν) d T ν C(t) νd(t) 2t(1 t) sup 0<t<1 where T ν is finite almost surely. Proof: Careful use of strong approximation methods: Csörgő, Csörgő, Horvath, and Mason (1986). INET Big Data Conference, Cambridge, September 29-30,

23 In particular for the LIL adjusted higher criticism & Berk-Jones statistics: { } U 2 T n (2, ν) sup n (t) C(t) νd(t) d T ν, 0<t<1 2t(1 t) T n (1, ν) {nk 1 (G n (t), t) C(t) νd(t)} d T ν. sup 0<t<1 where K 1 (u, v) K(u, v) = ulog(u/v)+(1 u)log((1 u)/(1 v)). What about power? Tail alternatives, e.g. sparse normal means? Central (or contiguous) alternatives? Widths of confidence bands? INET Big Data Conference, Cambridge, September 29-30,

24 6. Confidence bands and power properties Let U 1, U 2,..., U n be i.i.d. Unif[0, 1]. Auxiliary function K : [0, 1] (0, 1) [0, ], K(x, p) := xlog ( ) x p ( 1 x + (1 x)log 1 p i.e. Kullback-Leibler divergence between Bin(1, x) and Bin(1, p). Two key properties: (x p)2 ( ) K(x, p) = 1 + o(1) as x p. 2p(1 p) K(x, p) c implies x p ) 2c p(1 p) + c 2c x(1 x) + c INET Big Data Conference, Cambridge, September 29-30,

25 6. Confidence bands and power properties Uniform order statistics 0 < U n:1 < U n:2 < < U n:n < 1. T n := {t n1, t n2,..., t nn } with t ni := IE(U n:i ) = i n + 1. Theorem 2. For the process X n = ( X n (t)) t Tn with for ν > 3/2 we have X n (t ni ) (n + 1)K(t ni, U n:i ) T n (1, ν) sup T n { X n C νd } d T ν. Some realizations of X n C νd for n = 5000 and ν = 3: INET Big Data Conference, Cambridge, September 29-30,

26 X n (t) INET Big Data Conference, Cambridge, September 29-30, t

30 Distribution function of arg max t X n (t): t INET Big Data Conference, Cambridge, September 29-30,

31 Let X 1, X 2,..., X n be i.i.d. with unknown c.d.f. F on R. Empirical df: F n (x) := n 1 ni=1 1 [Xi x]. Testing problem: H 0 : F F 0 versus K : F F 0. Berk-Jones statistic: R n (F 0 ) := sup R n K(F n, F 0 ). Critical value: κ BJ n,α (1 α) quantile of sup t (0,1) = loglog(n) + O ( logloglog(n) ). n K n (G n (t), t) New proposal: T n (F 0 ) sup R ( n K(Fn, F 0 ) C(F 0 ) νd(f 0 ) ). INET Big Data Conference, Cambridge, September 29-30,

32 ... with critical value κ new n,α (1 α) quantile of ( n K(Gn (t), t) C(t) νd(t) ) sup t (0,1) (1 α) quantile of ( U(t) 2 sup t (0,1) 2t(1 t) ) C(t) νd(t). INET Big Data Conference, Cambridge, September 29-30,

33 Lemma. such that For any critical value κ > 0 there exists a constant B κ IP F ( Tn (F o ) κ ) B κ n (F, F o ) 4/5 where n (F, F o ) := sup R and Γ( ) := C( ) + 1. n F Fo Γ(F o ) F o (1 F o ) + Γ(F o )/ n Note: Γ(t) t(1 t) 0 as t {0, 1}. INET Big Data Conference, Cambridge, September 29-30,

34 Special case: Detecting heterogeneous Gaussian mixtures (Donoho Jin 2004) F o := Φ, F n := (1 ε n ) Φ + ε n Φ( µ n ), ε n (0, 1), µ n > 0. Setting 1 (Donoho Jin, 2004): ε n = n β+o(1) for some β (1/2, 1). Setting 2: ε n = n 1/2+o(1) but π n := n 1/2 ε n 0. INET Big Data Conference, Cambridge, September 29-30,

35 Theorem A. For any fixed κ > 0, IP Fn ( Tn (F o ) > κ ) 1 provided that µ n satisfies the following conditions: Setting 1 (ε n = n β+o(1), β (1/2, 1)): µ n = 2rlog(n) with r > β 1/2 if β 3/4, ( ) β if β 3/4. Setting 2 (ε n = n 1/2+o(1), π n = n 1/2 ε n 0): µ n = 2slog(1/π n ) with s > 1. INET Big Data Conference, Cambridge, September 29-30,

36 Setting 2 (contiguous alternatives): For fixed π, µ > 0, ε n = πn 1/2 and µ n = µ. Optimal test of F versus F n has asymptotic power ( Φ Theorem B. As π 0 and µ = ( Φ Φ 1 (α) + π2 (exp(µ 2 ) 1) 4 Φ 1 (α) + π2 (exp(µ 2 ) 1) 4 ). 2slog(1/π) for fixed s > 0, ) α if s < 1, 1 if s > 1, lim sup n IP F n ( Tn (F 0 ) > κ n,α ) 1 if s > 1. INET Big Data Conference, Cambridge, September 29-30,

37 Confidence Bands Owen (1995) proposed (1 α)-confidence band { } F : sup n K(F n, F ) κ BJ n,α. R New proposal: With order statistics X n:1 X n:2 X n:n, { ( F : max (n + 1)K(tni, F (X n:i )) C(t ni ) νd(t ni ) ) } κ n,α 1 i n Resulting bounds for F (x): with confidence 1 α, on [X n:i, X n:i+1 ), 0 i n, F [a BJO ni [a new ni, b BJO ni ] with Owen s (1995) proposal,, b new ni ] with new proposal, while F n (X n:i ) = INET Big Data Conference, Cambridge, September 29-30,

38 n = 500: i a new ni, s ni, b new ni INET Big Data Conference, Cambridge, September 29-30,

39 n = 2000: i a ni s ni, b ni s ni INET Big Data Conference, Cambridge, September 29-30,

40 n = 8000: i a ni s ni, b ni s ni INET Big Data Conference, Cambridge, September 29-30,

41 Theorem C. For any fixed α (0, 1), while max 0 i n (bbjo ni max 0 i n (bnew ni max 0 i n b new ni b BJO ni a new ni a BJO ni 1, a BJO ni ) = (1 + o(1)) a new ni ) = O(n 1/2 ). 2loglogn, n INET Big Data Conference, Cambridge, September 29-30,

42 Final comments, extensions One can replace K(u, v) = K 1 (u, v) with more general φdivergences K s (u, v) as in Jager Wellner (2007) under the null hypothesis. Power behavior of the family T n (s, ν) for s / {1, 2} is still unknown. Numerical experiments of Walther (2013) and Siegmund and Li (2015) indicate that K = K 1 has the best small/moderate sample performance in the sparse normal means model of Donoho Jin (2004). Results for more general mixture models: Cai and Wu (2014). INET Big Data Conference, Cambridge, September 29-30,

43 Proof of Jaeschke - Eicker theorem for sup 0<t<1 U n (t) t(1 t) : d n = (logn)5 n < 1/2 if n > Number theory: Littlewood showed that Li(x) π(x) changes sign infinitely often for x large. Skewes (1933): first sign change of Li(x) π(x) before if the Riemann hypothesis holds Current estimate: e first sign change of Li(x) π(x) before INET Big Data Conference, Cambridge, September 29-30,