# RESAMPLING FEWER THAN n OBSERVATIONS: GAINS, LOSSES, AND REMEDIES FOR LOSSES

Save this PDF as:

Size: px
Start display at page:

Download "RESAMPLING FEWER THAN n OBSERVATIONS: GAINS, LOSSES, AND REMEDIES FOR LOSSES"

## Transcription

1 Statistica Sinica 7(1997), 1-31 RESAMPLING FEWER THAN n OBSERVATIONS: GAINS, LOSSES, AND REMEDIES FOR LOSSES P. J. Bickel, F. Götze and W. R. van Zwet University of California, Berkeley, University of Bielefeld and University of Leiden Abstract: We discuss a number of resampling schemes in which m = o(n) observations are resampled. We review nonparametric bootstrap failure and give results old and new on how the m out of n with replacement and without replacement bootstraps work. We extend work of Bickel and Yahav (1988) to show that m out of n bootstraps can be made second order correct, if the usual nonparametric bootstrap is correct and study how these extrapolation techniques work when the nonparametric bootstrap does not. Key words and phrases: Asymptotic, bootstrap, nonparametric, parametric, testing. 1. Introduction Over the last years Efron s nonparametric bootstrap has become a general tool for setting confidence regions, prediction, estimating misclassification probabilities, and other standard exercises of inference when the methodology is complex. Its theoretical justification is based largely on asymptotic arguments for its consistency or optimality. A number of examples have been addressed over the years in which the bootstrap fails asymptotically. Practical anecdotal experience seems to support theory in the sense that the bootstrap generally gives reasonable answers but can bomb. In a recent paper Politis and Romano (1994), following Wu (1990), and independently Götze (1993) showed that what we call the m out of n without replacement bootstrap with m = o(n) typically works to first order both in the situations where the bootstrap works and where it does not. The m out of n with replacement bootstrap with m = o(n) has been known to work in all known realistic examples of bootstrap failure. In this paper, We show the large extent to which the Politis, Romano, Götze property is shared by the m out of n with replacement bootstrap and show that the latter has advantages. If the usual bootstrap works the m out of n bootstraps pay a price in efficiency. We show how, by the use of extrapolation the price can be avoided.

2 2 P. J. BICKEL, F. GÖTZE AND W. R. VAN ZWET We support some of our theory with simulations. The structure of our paper is as follows. In Section 2 we review a series of examples of success and failure to first order (consistency) of (Efron s) nonparametric bootstrap (nonparametric). We try to isolate at least heuristically some causes of nonparametric bootstrap failure. Our framework here is somewhat novel. In Section 3 we formally introduce the m out of n with and without replacement bootstrap as well as what we call sample splitting, and establish their first order properties restating the Politis-Romano-Götze result. We relate these approaches to smoothing methods. Section 4 establishes the deficiency of the m out of n bootstrap to higher order if the nonparametric bootstrap works to first order and Section 5 shows how to remedy this deficiency to second order by extrapolation. In Section 6 we study how the improvements of Section 5 behave when the nonparametric bootstrap doesn t work to first order. We present simulations in Section 7 and proofs of our new results in Section 8. The critical issue of choice of m and applications to testing will be addressed elsewhere. 2. Successes and Failure of the Bootstrap We will limit our work to the i.i.d. case because the issues we discuss are clearest in this context. Extension to the stationary mixing case, as done for the m out of n without replacement bootstrap in Politis and Romano (1994), are possible but the study of higher order properties as in Sections 4 and 5 of our paper is more complicated. We suppose throughout that we observe X 1,...,X n taking values in X = R p (or more generally a separable metric space). i.i.d. according to F F 0. We stress that F 0 need not be and usually isn t the set of all possible distributions. In hypothesis testing applications, F 0 is the hypothesized set, in looking at the distributions of extremes, F 0 is the set of populations for which extremes have limiting distributions. We are interested in the distribution of a symmetric function of X 1,...,X n ; T n (X 1,...,X n,f) T n ( ˆF n,f)where ˆF n is defined to be the empirical distribution of the data. More specifically we wish to estimate a parameter which we denote θ n (F ), of the distribution of T n ( ˆF n,f), which we denote by L n (F ). We will usually think of θ n as real valued, for instance, the variance of n median (X 1,...,X n ) or the 95% quantile of the distribution of n( X EF (X 1 )). Suppose T n (,F) and hence θ n is defined naturally not just on F 0 but on F which is large enough to contain all discrete distributions. It is then natural to estimate F by the nonparametric maximum likelihood estimate, (NPMLE), ˆFn, and hence θ n (F ) by the plug in θ n ( ˆF n ). This is Efron s (ideal) nonparametric bootstrap. Since θ n (F ) γ(l n (F )) and, in the cases we consider, computation of γ is straightforward the real issue is estimation of L n (F ). Efron s (ideal)

3 RESAMPLING FEWER THAN n OBSERVATIONS 3 bootstrap is to estimate L n (F ) by the distribution of T n (X1,...,X n, ˆF n )where, given X 1,...,X n the Xi are i.i.d. ˆFn, i.e. the bootstrap distribution of T n. In practice, the bootstrap distribution is itself estimated by Monte Carlo or more sophisticated resampling schemes, (see DeCiccio and Romano (1989) and Hikley (1988)). We will not enter into this question further. Theoretical analyses of the bootstrap and its properties necessarily rely on asymptotic theory, as n coupled with simulations. We restrict analysis to T n ( ˆF n,f) which are asymptotically stable and nondegenerate on F 0. That is, for all F F 0,atleastweakly L n (F ) L(F ) non degenerate θ n (F ) θ(f ) (2.1) as n. Using m out of n bootstraps or sample splitting implicitly changes our goal from estimating features of L n (F )tofeaturesofl m (F ). This is obviously nonsensical without assuming that the laws converge. Requiring non degeneracy of the limit law means that we have stabilized the scale of T n ( ˆF n,f). Any functional of L n (F ) is also a functional of the distribution of σ n T n ( ˆF n,f)whereσ n 0 which also converges in law to point mass at 0. Yet this degenerate limit has no functional θ(f ) of interest. Finally, requiring that stability need occur only on F 0 is also critical since failure to converge off F 0 in a reasonable way is the first indicator of potential bootstrap failure When does the nonparametric bootstrap fail? If θ n does not depend on n, the bootstrap works, (is consistent on F 0 ), if θ is continuous at all points of F 0 with respect to weak convergence on F. Conversely, the nonparametric bootstrap can fail if, 1. θ is not continuous on F 0. An example we explore later is θ n (F )=1(F discrete) for which θ n ( ˆF n )obviously fails if F is continuous. Dependence on n introduces new phenomena. In particular, here are two other reasons for failure we explore below. 2. θ n is well defined on all of F but θ is defined on F 0 only or exhibits wild discontinuities when viewed as a function on F. This is the main point of examples T n ( ˆF n,f) is not expressible as or approximable on F 0 by a continuous function of n( ˆF n F ) viewed as an object weakly converging to a Gaussian limit in a suitable function space. (See Giné and Zinn (1989).) Example 7 illustrate this failure. Again this condition is a diagnostic and not necessary for failure as Example 6 shows.

4 4 P. J. BICKEL, F. GÖTZE AND W. R. VAN ZWET We illustrate our framework and discuss prototypical examples of bootstrap success and failure Examples of bootstrap success Example 1. Confidence intervals: Suppose σ 2 (F ) Var F (X 1 ) < for all F F 0. (a) Let T n ( ˆF n,f) n( X E F X 1 ). For the percentile bootstrap we are interested in θ n (F ) P F [T n ( ˆF t n,f) t]. Evidently θ(f )=Φ( σ(f )). In fact, we want to estimate the quantiles of the distribution of T n ( ˆF n,f). If θ n (F )isthe1 α quantile then θ(f )=σ(f )z 1 α where z is the Gaussian quantile. (b) Let T n ( ˆF n,f) = n( X E F X 1 )/s where s 2 = 1 ni=1 n 1 (X i X) 2. If θ n (F ) P F (T n ( ˆF n,f) t] then, θ(f )=Φ(t), independent of F. It seems silly to be estimating a parameter whose value is known but, of course, interest now centers on θ (F ) the next higher order term in θ n (F )=Φ(t)+ θ (F ) n + O(n 1 ). Example 2. Estimation of variance: Suppose F has unique median m(f ), continuous density f(m(f )) > 0, E F X δ <, someδ>0 for all F F 0 and θ n (F )=Var F ( n median (X 1,...,X n )). Then θ(f )=[4f 2 (m(f ))] 1 on F 0. Note that, whereas θ n is defined for all empirical distributions F in both examples the limit θ(f )is0or for such distributions in the second. Nevertheless, it is well known (see Efron (1979)) that the nonparametric bootstrap is consistent in both examples in the sense that θ n ( ˆF n ) θ(f P )forf F Examples of bootstrap failure Example 3. Confidence bounds for an extremum: This is a variation on Bickel Freedman (1981). Suppose that all F F 0 have a density f continuous and positive at F 1 (0) >. It is natural to base confidence bounds for F 1 (0) on the bootstrap distribution of Let T n ( ˆF n,f)=n(min i X i F 1 (0)). θ n (F )=P F [T n ( ˆF n,f) >t]=(1 F ( t n + F 1 (0)) n. Evidently θ n (F ) θ(f )=exp( f(f 1 (0))t) onf 0. The nonparametric bootstrap fails. Let n Nn (t) = 1(Xi t n + X (1)),t >0, i=1 where X (1) min i X i and 1(A) is the indicator of A. Given X (1), n ˆF n ( t n +X (1)) is distributed as 1+ binomial (n 1, F ( t n +X (1)) F (X (1) ) (1 F (X (1) )) ) which converges weakly

5 RESAMPLING FEWER THAN n OBSERVATIONS 5 to a Poisson (f(f 1 (0))t) variable. More generally, n ˆF n ( + X (1)) converges ṅ weakly conditionally to 1 + N( ), where N is a homogeneous Poisson process with parameter f(f 1 (0)). It follows that Nn ( ) converges weakly (marginally) toaprocessm(1 + N( )) where M is a standard Poisson process independent of N( ). Thus if, in Efron s notation, we use P to denote conditional probability given ˆF n and let ˆF n, be the empirical d.f. of X1,...,X n then P [T n ( ˆF n) >t]= P [Nn (t) = 0] converges weakly to the random variable P [M(1+N(t)) = 0 N] = e (N(t)+1) rather than to the desired θ(f ). Example 4. Extrema for unbounded distributions: (Athreya and Fukuchi (1994), Deheuvels, Mason, Shorack (1993)) Suppose F F 0 are in the domain of attraction of an extreme value distribution. That is: for some constants A n (F ), B n (F ), n(1 F )(A n (F )+B n (F )x) H(x, F ), where H is necessarily one of the classical three types (David (1981), p.259): e βx 1(βx 0), αx β 1(x 0), α( x) β 1(x 0), for α, β 0. Let, θ n (F ) P [(max(x 1,...,X n ) A n (F ))/B n (F ) t] e H(t,F ) θ(f ). (2.2) Particular choices of A n (F ), for example, F 1 (1 1 n )andb n(f ) are of interest in inference. However, the bootstrap does not work. It is easy to see that n(1 ˆF n (A n (F )+tb n (F ))) w N(t), (2.3) where N is an inhomogeneous Poisson process with parameter H(t, F )and w denotes weak convergence. Hence if T n ( ˆF n,f)=(max(x 1,...,X n ) A n (F ))/B n (F ) then P [T n ( ˆF n,f) t] e w N(t). (2.4) It follows that the nonparametric bootstrap is inconsistent for this choice of A n,b n. If it were consistent, then P [T n ( ˆF n, ˆF n ) t] P e H(t,F ) (2.5) for all t and (2.5) would imply that it is possible to find random A real and B 0 such that N(Bt + A) = H(t, F ) with probability 1. But H(t, F ) is continuous except at 1 point. So (2.4) and (2.5) contradict each other. Again, θ(f ) is well defined for F F 0 but not otherwise. Furthermore, small perturbations in F can lead to drastic changes in the nature of H, sothatθ is not continuous if F 0 is as large as possible. Essentially the same bootstrap failure arises when we consider estimating the mean of distributions in the domain of attraction of stable laws of index 1 <α 2. (See Athreya (1987))

6 6 P. J. BICKEL, F. GÖTZE AND W. R. VAN ZWET Example 5. Testing and improperly centered U and V statistics: (Bretagnolle (1983)) Let F 0 = {F : F [ c, c]=1,e F X 1 =0} and let T n ( ˆF n )=n X 2 = n xyd ˆF n (x) d ˆF n (y). This is a natural test statistic for H : F F 0. Can one use the nonparametric bootstrap to find the critical value for this test statistic? Intuitively, ˆF n F 0 and this procedure is rightly suspect. Nevertheless, in more complicated contexts, it is a mistake made in practice. David Freedman pointed us to Freedman et al. (1994) where the Bureau of the Census appears to have fallen into such a trap. (see Hall and Wilson (1991) for other examples.) The nonparametric bootstrap may, in general, not be used for testing as will be shown in a forthcoming paper. In this example, due to Bretagnolle (1983), we focus on F 0 for which a general U or V statistic T is degenerate and show that the nonparametric bootstrap doesn t work. More generally, suppose ψ : R 2 R is bounded and symmetric and let F 0 = {F : ψ(x, y)df (x) = 0 for all y}. Then, it is easy to see that T n ( ˆF n )= ψ(x, y)dwn 0 (x)dw n 0 (y), (2.6) where Wn 0(x) n( ˆF n (x) F (x)) and well known that θ n (F ) P F [T n ( ˆF [ ] n ) t] P ψ(xy)dw 0 (F (x))dw 0 (F (y)) t θ(f ), where W 0 is a Brownian Bridge. On the other hand it is clear that, T n ( ˆF n)=n ψ(x, y)d ˆF n(x)d ˆF n (y) = ψ(x, y)dwn 0 (x)dwn (y)+2 ψ(x, y)dwn 0 0 (x)dwn (y) + ψ(x, y)dwn(x)dw 0 n(y), 0 (2.7) where W 0 n (x) n( ˆF n(x) ˆF n (x)). It readily follows that, P [T n ( ˆF n) t] w [ P ψ(x, y)dw 0 (F (x))dw 0 (F (y)) +2 ψ(x, y)dw 0 (F (x))d W 0 (F (y)) + ψ(x, y)d W 0 (F (x))d W 0 (F (y)) t W 0], (2.8) where W 0,W 0 are independent Brownian Bridges.

7 RESAMPLING FEWER THAN n OBSERVATIONS 7 This is again an instance where θ(f ) is well defined for F Fbut θ n (F ) does not converge for F F 0 Example 6. Nondifferentiable functions of the empirical: (Beran and Srivastava (1985) and Dümbgen (1993)) Let F 0 = {F : E F X1 2 < } and T n ( ˆF n,f)= n(h( X) h(µ(f ))) when µ(f )=E F X 1.Ifh is differentiable the bootstrap distribution of T n is, of course, consistent. But take h(x) = x, differentiable everywhere except at 0. It is easy to see then that if µ(f ) 0,L n (F ) N(0, Var F (X 1 )) but if µ(f )=0, L n (F ) N(0, Var F (X 1 )). The bootstrap is consistent if µ 0 but not if µ = 0. We can argue as follows. Under µ =0, n( X X), n X are asymptotically independent N (0,σ 2 (F )). Call these variables Z and Z. Then, n( X X ) Z w + Z Z,avariable whose distribution is not the same as that of Z. The bootstrap distribution, as usual, converges (weakly) to the (random) conditional distribution of Z + Z Z given Z. This phenomenon was first observed in a more realistic context by Beran and Srivastava (1985). Dümbgen (1993) constructs similar reasonable though more complicated examples where the bootstrap distribution never converges. If we represent T n ( ˆF n,f)= n(t ( ˆF n ) T (F )) in these cases then there is no linear T (F ) such that n(t ( ˆF n ) T (F )) nt (F )( ˆF n F ) which permits the argument of Bickel-Freedman (1981) Possible remedies Putter and van Zwet (1993) show that if θ n (F ) is continuous for every n on F and there is a consistent estimate F n of F then bootstrapping from F n will work, i.e. θ n ( F n ) will be consistent except possibly for F in a thin set. If we review our examples of bootstrap failure, we can see that constructing suitable F n F 0 and consistent is often a remedy that works for all F F 0 not simply the complement of a set of the second category. Thus in Example 3 taking F n to be ˆF n kernel smoothed with bandwidth h n 0ifnh 2 n 0works. In the first and simplest case of Example 4 it is easy to see, Freedman (1981), that taking F n as the empirical distribution of X i X, 1 i n which has mean 0 and thus belongs to F 0 will work. The appropriate choice of F n in the other examples of bootstrap failure is less clear. For instance, Example 4 calls for F n with estimated tails of the right order but how to achieve this is not immediate. A general approach which we believe is worth investigating is to approximate F 0 by a nested sequence of parametric models, (a sieve), {F 0,m }, and use the M.L.E. Fm(n) for F 0,m(n), for a suitable sequence m(n). See Shen and Wong (1994) for example.

8 8 P. J. BICKEL, F. GÖTZE AND W. R. VAN ZWET The alternative approach we study is to change θ n itself as well as possibly its argument. The changes we consider are the m out of n with replacement bootstrap, the (n m) outofn jackknife or ( n m) bootstrap discussed by Wu (1990) and Politis and Romano (1994), and what we call sample splitting. 3. The m Out of n Bootstraps Let h be a bounded real valued function defined on the range of T n,for instance, t 1(t t 0 ). We view as our goal estimation of θ n (F ) E F (h(t n ( ˆF n,f))). More complicated functionals such as quantiles are governed by the same heuristics and results as those we detail below. Here are the procedures we discuss. (i) The n/n bootstrap (The nonparametric bootstrap) Let, B n (F )=E h(t n ( ˆF n,f)) = n n h(t n (X i1,...,x in,f)). (i 1,...,i n) Then, B n B n ( ˆF n )=θ n ( ˆF )isthen/n bootstrap. (ii) The m/n bootstrap Let B m,n (F ) n m h(t m (X i1,...,x im,f)). (i 1,...,i m) Then, B m,n B m,n ( ˆF n )=θ m ( ˆF n )isthem/n bootstrap. (iii) The ( n m) bootstrap Let ( ) 1 n J m,n (F )= h(t m (X i1,...,x im,f)). m i 1 < <i m Then, J m,n J m,n ( ˆF n )isthe ( n m) bootstrap. (iv) Sample splitting Suppose n = mk. Define, k 1 N m,n (F ) k 1 h(t m (X jm+1,...,x (j+1)m,f)) j=0 and N m,n N m,n ( ˆF n ) as the sample splitting estimates. For safety in practice one should start with a random permutation of the X i. The motivation behind B m(n),n for m(n) is clear. Since, by (2.1), θ m(n) (F ) θ(f ), θ m(n) ( ˆF n ) has as good a rationale as θ n ( ˆF n ). To justify J m,n note that we can write θ m (F )=θ m (F F ) since it is a parameter of the } {{ } m

9 RESAMPLING FEWER THAN n OBSERVATIONS 9 law of T m (X 1,...,X m,f). We now approximate F F not by the m dimensional product measure ˆF n } {{ ˆF n but by sampling without replacement. Thus sample splitting is just k fold cross validation and represents a crude } m approximation to F F. } {{ } m The sample splitting method requires the least computation of any of the lot. Its obvious disadvantages are that it relies on an arbitrary partition of the sample and that since both m and k should be reasonably large, n has to be really substantial. This method and compromises between it and the ( n m) bootstrap are studied in Blom (1976) for instance. The ( n m) bootstrap differs from the m/n by o P (1) if m = o(n 1/2 ). Its advantage is that it never presents us with the ties which make resampling not look like sampling. As a consequence, as we note in Theorem 1, it is consistent under really minimal conditions. On the other hand it is somewhat harder to implement by simulation. We shall study both of these methods further, below, in terms of their accuracy. A simple and remarkable result on J m(n),n has been obtained by Politis and Romano (1994), generalizing Wu (1990). This result was also independently noted and generalized by Götze (1993). Here is a version of the Götze result and its easy proof. Write J m for J m,n,b m for B m,n, N m for N m,n. Theorem 1. Suppose m n 0, m. Then, J m (F )=θ m (F )+O P (( m n ) 1 2 ). (3.1) then If h is continuous and T m (X 1,...,X m,f)=t m (X 1,...,X m, ˆF n )+o p (1) (3.2) J m = θ m (F )+o p (1). (3.3) Proof. Suppose T m does not depend on F. Then, J m is a U statistic with kernel h(t m (x 1,...,x m )) and E F J m = θ m (F ) and (3.1) follows immediately. For (3.2) note that ( ) 1 n E F J m h(t m (X i1,...,x im,f)) m i 1 < <i m E F h(t m (X 1,...,X m, ˆF n )) h(t m (X 1,...,X m,f)) (3.4) and (3.2) follows by bounded convergence. These results follows in the same way and even more easily for N m. Note that if T m does not depend on F, E F N m = θ m (F ) and, Var F (N m )= m n Var F (h(t m (X 1,...,X m ))) > Var F (J m ). (3.5)

10 10 P.J.BICKEL,F.GÖTZE AND W. R. VAN ZWET Note. It may be shown, more generally under (3.2), that, for example, distances between the ( n m) bootstrap distributions of Tm ( ˆF m,f)andl m (F )are also O P (m/n) 1/2. Let X (i) j =(X j,...,x j ) 1 i h i1,...,i r (X 1,...,X r )= 1 r! 1 j 1 j r r for vectors i =(i 1,...,i r ) in the index set Then h(t m (X (i 1) j 1,...,X (ir) j r,f)), (3.6) Λ r,m = {(i 1,...,i r ):1 i 1 i r m, i i r = m}. B m,n (F )= m ω m,n (i)( 1 n i Λ r,m r) r=1 1 j 1 j r m h i (X j1,...,x jr,f), (3.7) where Let ω m,n (i) = θ m,n (F )=E F B m,n (F )= ( )( ) n m /n m. r i 1,...,i r m r=1 i Λ r,m ω m,n (i)e F h i (X 1,...,X r ). (3.8) Finally, let δ m ( r m ) max{ E F h i (X 1,...,X r ) θ m (F ) : i Λ r,m } (3.9) and define δ m (x) by extrapolation on [0, 1]. Note that δ m (1) = 0. Theorem 2. Under the conditions of Theorem 1 If further, B m,n (F )=θ m,n (F )+O P ( m n ) 1 2. (3.10) δ m (1 xm 1/2 ) 0 (3.11) uniformly for 0 x M, allm<, andm = o(n), then Finally if, θ m,n (F )=θ m (F )+o(1). (3.12) T m (X (in) 1,...,X (ir) r,f)=t m (X (i 1) 1,...,X (ir) r, ˆF n )+o P (1) (3.13)

11 RESAMPLING FEWER THAN n OBSERVATIONS 11 whenever i Λ r,m,m and max{i 1,...,i r } = O(m 1/2 ) then, if m,m= o(n), B m = θ m (F )+o p (1). (3.14) The proof of Theorem 2 will be given in the Appendix. There too we will show briefly that, in the examples we have discussed and some others, J m(n), B m(n), N m(n) are consistent for m(n), m n 0. According to Theorem 2, if T n does not depend om F the m/n bootstrap worksaswellasthe ( n m) bootstrap if the value of Tm is not greatly affected by a number on the order of m ties in its argument. Some condition is needed. Consider T n (X 1,...,X n )=1(X i = X j for some i j) and suppose F is continuous. The ( n m) bootstrap gives Tm = 0 as it should. If m o( n)sothatthe ( n ) m and m/n bootstraps do not coincide asymptotically the m/n bootstrap gives T m = 1 with positive probability. Finally, (3.13) is the natural extension of (3.2) and is as easy to verify in all our examples. A number of other results are available for m out of n bootstraps. Giné and Zinn (1989) have shown quite generally that when n( ˆF n F )is viewed as a member of a suitable Banach space F and, (a) T n (X 1,...,X n,f)=t( n( ˆF n F )) for t continuous (b) F is not too big then B n and B m(n) are consistent. Praestgaard and Wellner (1993) extended these results to J m(n) with m = o(n). Finally, under the Giné-Zinn conditions, if m = o(n). Therefore, m( ˆF n F ) =( m n ) n( ˆF n F ) = O P ( m n )1/2 (3.15) t( m( ˆF m ˆF n )) = t( m( ˆF m F )) + o p (1) (3.16) and consistency of N m if m = o(n) follows from the original Giné-Zinn result. We close with a theorem on the parametric version of the m/n bootstrap which gives a stronger property than that of Theorem 1. Let F 0 = {F θ : θ Θ R p } where Θ is open and the model is regular. That is, θ is identifiable, the F θ have densities f θ with respect to a σ finite µ and the map θ f θ is continuously Hellinger differentiable with nonsingular derivative. By a result of LeCam (see Bickel, Klaassen, Ritov, Wellner (1993) for instance), there exists an estimate ˆθ n such that, for all θ, (f 1/2 (x) f 1/2 ˆθ n θ (x)) 2 dµ(x) =O Pθ ( 1 ). (3.17) n

12 12 P.J.BICKEL,F.GÖTZE AND W. R. VAN ZWET Theorem 3. Suppose F 0 is as above. Let F m θ the variational norm. Then F θ F } {{ θ and denote } m F mˆθn F m θ = O p(( m n )1/2 ). (3.18) Proof. This is consequence of the relations (LeCam (1986)). F m θ 0 F m θ 1 ) H(F m θ 0,F m θ 1 )[(2 H 2 (F m θ 0,F m θ 1 )], (3.19) where H 2 (F, G) = 1 2 ( df dg) 2 (3.20) and H 2 (Fθ m 0,Fθ m 1 )=1 ( f θ0 f θ1 dµ) m =1 (1 H 2 (F θ0,f)) m. (3.21) Substituting (3.21) into (3.20) and using (3.17) we obtain F mˆθn F m θ = O P θ (1 exp O Pθ ( m n )) 1 2 (1 + exp O Pθ ( m n ) 1 2 )=O Pθ ( m n ) 1 2. (3.22) This result is weaker than Theorem 1 since it refers only to the parametric bootstrap. It is stronger since even for m = 1, when sampling with and without replacement coincide, ˆF n F θ =1foralln if F θ is continous. 4. Performance of B m, J m,andn m as Estimates of θ n (F ) As we have noted, if we take m(n) = o(n) then in all examples considered in which B n is inconsistent, J m(n), B m(n), N m(n) are consistent. Two obvious questions are, (1) How do we choose m(n)? (2) Is there a price to be paid for using J m(n), B m(n), or N m(n) when B n is consistent? We shall turn to the first very difficult question in a forthcoming paper on diagnostics. The answer to the second is, in general, yes. To make this precise we take the point of view of Beran (1982) and assume that at least on F 0, θ n (F )=θ(f )+θ (F )n 1/2 + O(n 1 ), (4.1) where θ(f ) and θ (F ) are regularly estimable on F 0 in the sense of Bickel, Klaassen, Ritov and Wellner (1993) and O(n 1 ) is uniform on Hellinger compacts. There are a number of general theorems which lead to such expansions. See, for example, Bentkus, Götze and van Zwet (1994).

13 RESAMPLING FEWER THAN n OBSERVATIONS 13 Somewhat more generally than Beran, we exhibit conditions under which B n = θ n ( ˆF n ) is fully efficient as an estimate of θ n (F ) and show that the m out n bootstrap with m n 0 has typically relative efficiency 0. We formally state a theorem which applies to fairly general parameters θ n. Suppose ρ is a metric on F 0 such that ρ( ˆF n,f 0 )=O PF0 (n 1/2 ) for all F 0 F 0. (4.2) Further suppose A. θ(f ),θ (F )areρ Fréchet differentiable in F at F 0 F 0.Thatis, θ(f )=θ(f 0 )+ ψ(x, F 0 )df (x)+o(ρ(f, F 0 )) (4.3) for ψ L 0 2 (F 0) {h : h 2 (x)df 0 (x) <, h(x)df 0 (x) = 0} and θ obeys a similar identity with ψ replaced by another function ψ L 0 2 (F 0). Suppose further B. The tangent space of F 0 at F 0 as defined in Bickel et al. (1993) is L 0 2 (F 0)so that ψ and ψ are the efficient influence functions of θ, θ. Essentially, we require that in estimating F there is no advantage in knowing F F 0. Finally, we assume, C. For all M<, sup{ θ m (F ) θ(f ) θ (F )m 1/2 : ρ(f, F 0 ) M 1/2 n,f F}= O(m 1 ) (4.4) a strengthened form of (4.1). Then, Theorem 4. Under regularity of θ, θ and A and C at F 0, θ m ( ˆF n ) θ(f 0 )+θ (F 0 )m 1/2 + 1 n (ψ(x i,f 0 )+ψ (X i,f 0 )m 1/2 ) n i=1 +O(m 1 )+o p (n 1/2 ). (4.5) If B also holds, θ n ( ˆF n ) is efficient. If in addition, θ (F 0 ) 0,and m n 0 the efficiency of θ m ( ˆF n ) is 0. Proof. The expansions of θ( ˆF n )θ ( ˆF n ) are immediate by Fréchet differentiability and (4.5) follows by plugging these into (4.1). Since θ, θ are assumed regular, ψ and ψ are their efficient influence functions. Full efficiency of θ n ( ˆF n ) follows by general theory as given in Beran (1983) for special cases or by extending Theorem 2, p.63 of Bickel et al. (1993) in an obvious way. On the other hand, if θ (F 0 ) 0, n(θ m ( ˆF n ) θ n (F 0 )) has asymptotic bias ( n m 1)θ (F 0 )+O( n m (1 + o(1))θ (F 0 ) ± and inefficiency follows. n m )=

14 14 P.J.BICKEL,F.GÖTZE AND W. R. VAN ZWET Inefficiency results of the same type or worse may be proved about J m and N m but require going back to T m (X 1,...,X m,f) since J m and B n are not related in a simple way. We pursue this only by way of Example 1. If θ n (F )=Var F ( n( X µ(f )) = θ(f ), B m = B n but, J m = σ 2 ( ˆF n )(1 m 1 ). (4.6) n 1 Thus, since θ m (F ) = 0 here, B m is efficient but J m has efficiency 0 if n. N m evidently behaves in the same way. It is true that the bootstrap is often used not for estimation but for setting confidence bounds. This is clearly the case for Example (1b), the bootstrap of t where θ(f ) is known in advance. For example, Efron s percentile bootstrap uses the (1 α)th quantile of the bootstrap distribution of X as a level (1 α) approximate upper confidence bound for µ. As is well known by now (see Hall (1992)), for example, this estimate although, when suitably normalized, efficiently estimating the (1 α)th quantile of the distribution of n( X µ) does not improve to order n 1/2 over the coverage probability of the usual Gaussian based X s + z 1 α n. However, the confidence bounds based on the bootstrap distribution of the t statistic n( X µ(f ))/s get the coverage probability correct to order n 1/2. Unfortunately, this advantage is lost if one were to use the 1 α quantile of the bootstrap distribution of T m ( ˆF m,f)= m( X m µ(f ))/s m where X m and s 2 m are the mean and usual estimate of variance bsed on a sample of size m. The reason is that, in this case, the bootstrap distribution function is rather than the needed, Φ(t) m 1/2 c( ˆF n )ϕ(t)h 2 (t)+o P (m 1 ) (4.7) Φ(t) n 1/2 c( ˆF n )ϕ(t)h 2 (t)+o P (n 1). The error committed is of order m 1/2. More general formal results can be stated but we do not pursue this. The situation for J m(n) and N m(n) which function under minimal conditions, is even worse as we discuss in the next section. 5. Remedying the Deficiencies of B m(n) when B n is Correct: Extrapolation In Bickel and Yahav (1988), motivated by considerations of computational economy, situations were considered in which θ n has an expansion of the form (4.1) and it was proposed using B m at m = n 0 and m = n 1, n 0 <n 1 << n to produce estimates of θ n which behave like B n. We sketch the argument for a special case.

15 RESAMPLING FEWER THAN n OBSERVATIONS 15 Suppose that, as can be shown for a wide range of situations, if m, Then, if n 1 >n 0 B m = θ m ( ˆF n )=θ( ˆF n )+θ ( ˆF n )m 1/2 + O P (m 1 ). (5.1) θ ( ˆF n )=(B n0 B n1 )(n 1/2 0 n 1/2 1 ) 1 + O P (n 1/2 0 ) (5.2) θ( ˆF n )= n 1/2 0 B n1 n 1/2 1 B n0 n 1/2 0 n 1/2 1 and hence a reasonable estimate of B n is, B n0,n 1 n 1/2 0 B n1 n 1/2 1 B n0 n 1/2 0 n 1/2 1 More formally, + O P (n 1 0 ) (5.3) + (B n 0 B n1 ) n 1/2 0 n 1/2 1 n 1/2. Proposition. Suppose {θ m } obey C of Section 4 and n 0 n 1/2.Then, B n0,n 1 = B n + o p (n 1/2 ). (5.4) Hence, under the conditions of Theorem 3 B n0,n 1 is efficient for estimating θ n (F ). Proof. Under C, (5.4) holds. By construction, and (5.4) follows. B n0,n 1 = θ( ˆF n )+θ ( ˆF n )n 1/2 + O P (n 1 0 )+O P (n 1/2 0 n 1/2 ) = θ n ( ˆF n )+O P (n 1 0 )+O P (n 1/2 0 n 1/2 )+O P (n 1 ) = θ n ( ˆF n )+O P (n 1 0 ) (5.5) Assorted variations can be played on this theme depending on what we know or assume about θ n. If, as in the case where T n is a t statistic, the leading term θ(f ) in (4.1) is θ 0 independent of F, estimation of θ(f ) is unnecessary and we need only one value of m = n 0. We are led to a simple form of estimate, since ψ of Theorem 4 is 0, ˆθ n0 =(1 ( n 0 n )1/2 )θ 0 +( n 0 n )1/2 B n0. (5.6) This kind of interpolation is used to improve theoretically the behaviour of B m0 as an estimate of a parameter of a stable distribution by Hall and Jing (1993) though we argue below that the improvement is somewhat illusory. If we apply (5.4) to construct a bootstrap confidence bound we expect the coverage probability to be correct to order n 1/2 but the error is O P ((n 0 n) 1/2 ) rather than O P (n 1 )aswithb n. We do not pursue a formal statement.

16 16 P.J.BICKEL,F.GÖTZE AND W. R. VAN ZWET 5.1. Extrapolation of J m and N m We discuss extrapolation for J m and N m only in the context of the simplest Example 1, where the essential difficulties become apparent and we omit general theorems. In work in progress, Götze and coworkers are developing expansions for general symmetric statistics under sampling from a finite population. These results will permit general statements of the same qualitative nature as in our discussion of Example 1. Consider θ m (F )=P F [ m( X m µ(f )) t]. If EX1 4 < and the X i obey Cramér s condition, then t θ m (F )=Φ( σ(f ) ) K ϕ 3(F ) 6 m ( t σ(f ) )H t 2( σ(f ) )+O(m 1 ), (5.7) where σ 2 (F )andk 3 (F ) are the second and third cumulants of F and H k (t) = ( 1) k dϕ k (t) ϕ(t). By Singh (1981), B dt k m = θ m ( ˆF n ) has the same expansion with F replaced by ˆF n. However, by an easy extension of results of Robinson (1978) and Babu and Singh (1985), where J m =Φ( t ˆK 2m ) ϕ( t ˆK 1/2 2m ) ˆK 3m 6m 1/2 H 2( t ˆK 1/2 2m )+O P (m 1 ), (5.8) ˆK 2m = σ 2 ( ˆF n )(1 m 1 n 1 ) (5.9) ˆK 3m = K 3 ( ˆF n )(1 m 1 n 1 2(m 1) )(1 ). (5.10) n 2 The essential character of expansion (5.8), if m/n = o(1), is J m = θ( ˆF n )+m 1/2 θ ( ˆF n )+ m n γ n + O P (m 1 +( m n )2 + m 1 2 ), (5.11) n where γ n is O P (1) and independent of m. Them/n terms essentially come from the finite population correction to the variance and highter order cumulants of means of samples from a finite population. They reflect the obvious fact that if m/n λ>0, J m is, in general, incorrect even to first order. For instance, thevarianceofthe ( n m) bootstrap distribution corresponding to m( X µ(f )) is 1/n (X i X) 2 (1 m 1 n 1 )) which converges to σ2 (F )(1 λ) ifm/n λ>0. What this means is that if expansions (4.1), (5.1) and (5.11) are valid, then using J m(n) again gives efficiency 0 compared to B n. Worse is that (5.2) with J n0, J n1 replacing B n0, B n1 will not work since the n 1 /n terms remain and make

17 RESAMPLING FEWER THAN n OBSERVATIONS 17 a contribution larger than n 1/2 if n 1 /n 1/2. Essentially it is necessary to estimate the coefficient of m/n and remove the contribution of this term at the same time while keeping the three required values of m: n 0 <n 1 <n 2 such that the error O( 1 n 0 +( n 2 n ) 2 )iso(n 1/2 ). This essentially means that n 0,n 1,n 2 have order larger than n 1/2 and smaller that n 3/4. This effect persists if we seek to use an extrapolation of J m for the t statistic. The coefficient of m/n as well as m 1/2 needs to be estimated. An alternative here and perhaps more generally is to modify the t statistic being bootstrapped and extrapolated. Thus T m (X 1,...,X m,f) m ( X m µ(f )) leads to an expansion ˆσ(1 m 1 n 1 )1/2 for J m of the form, J m =Φ(t)+θ ( ˆF n )m 1/2 + O P (m 1 + m/n), (5.12) and we again get correct coverage to order n 1/2 by fitting the m 1/2 term s coefficient, weighting it by n 1/2 m 1/2 and adding it to J m. If we know, as we sometimes at least suspect in symmetric cases, that θ(f )= 0, we should appropriately extrapolate linearly in m 1 rather than m 1/2. The sample splitting situation is less satisfactory in the same example. Under (5.1), the coefficient of 1/ m is asymptotically constant. Put another way, the asymptotic correlation of B m, B λm as m, n for fixed λ > 0is1. This is also true for J m under (5.11). However, consider N m and N 2m (say) if T m = m( X m µ(f )). Let h be continuously boundedly differentiable, n =2km. Then Cov (N m,n 2m) = 1 m (h(m k Cov 1/2 ( (X j X))),h((2m) 2m 1/2 (X j X)) ). j=1 (5.13) Thus, by the central limit theorem, Corr(N m,n 2m ) 1 Cov (h(z 1 ),h (Z 1 + Z 2 ) ), (5.14) 2 Var (Z 1 ) 2 where Z 1, Z 2 are independent Gaussian N (0,σ 2 (F )) and σ 2 (F )=Var F (X 1 ). More generally, viewed as a process in m for fixed n, N m centered and normalized is converging weakly to a non degenerate process. Thus, extrapolation does not make sense for N m. Two questions naturally present themselves. (a) How do these games play out in practice rather than theory? (b) If the expansions (5.1) and (5.11) are invalid beyond the 0th order, the usual situation when the nonparametric bootstrap is inconsistent, what price do we pay theoretically for extrapolation? j=1

18 18 P.J.BICKEL,F.GÖTZE AND W. R. VAN ZWET Simulations giving limited encouragement in response to question (a) are given in Bickel and Yahav (1988). We give some further evidence in Section 7. We now turn to question (b) in the next section. 6. Behaviour of the Smaller Resample Schemes When B n is Inconsistent, and Presentation of Alternatives The class of situations in which B n does not work is too poorly defined for us to come to definitive conclusions. But consideration of the examples suggests the following, A. When, as in Example 6, θ(f ), θ (F ) are well defined and regularly estimable on F 0 we should still be able to use extrapolation (suitably applied) to B m and possibly to J m to produce better estimates of θ n (F ). B. When, as in all our other examples of inconsistency, θ(f ) is not regularly estimable on F 0 extrapolation should not improve over the behaviour of B n0, B n1. C. If n 0,n 1 are comparable extrapolation should not do particularly worse either. D. A closer analysis of T n and the goals of the bootstrap may, in these irregular cases, be used to obtain procedures which should do better than the m/n or ( n ) m or extrapolation bootstraps. The only one of these claims which can be made general is C. Proposition 1. Suppose B n1 θ n (F ) B n0 θ n (F ), (6.1) where indicates that the ratio tends to 1. Then,ifn 0 /n 1 1 B n0,n 1 θ n (F ) B n0 θ n (F ). (6.2) Proof. Evidently, Bn 0 +Bn 1 2 = θ n (F )+Ω(ɛ n )whereω(ɛ n ) means that the exact order of the remainder is ɛ n. On the other hand, B n0 B ( n1 1 n n 1/2 0 n 1/2 1 2 ( ) ( n ) 0 ) =Ω(ɛ n ) n0 n1 n +Ω(1) 1 and the proposition follows. We illustrate the other three claims in going through the examples. Example 3. Here, F 1 (0) = 0, θ n (F )=e f(0)t( 1+n 1 f (0) t2 2 ) + O(n 2 ) (6.3)

19 RESAMPLING FEWER THAN n OBSERVATIONS 19 which is of the form (5.1). But the functional θ(f ) is not regular and only estimable at rate n 1/3 if one puts a first order Lipschitz condition on F F 0. On the other hand, log B m = m log(1 ˆF n ( t m )) = m log(1 ( ˆF n ( t m ) ˆF n (0))) = m(f ( t m ) F (0)) m n( ˆFn ( t n m ) F ( t m ))+O P (m( ˆF n ( t m ) F ( t m ))2 ) = tf(0) + Ω( 1 m m )+Ω P ( n )+O P ( 1 ), (6.4) n where as before Ω, Ω p indicate exact order. As Politis and Romano (1994) point out, m =Ω(n 1/3 ) yields the optimal rate n 1/3 (under f Lipschitz). Extrapolation does not help because the m n n term is not of the form γ m n where γ n is independent of m. On the contrary, as a process in m, mn( ˆF n ( t m ) F ( t m )) behaves like the sample path of a stationary Gaussian process. So conclusion B holds in this case. Example 4. A major difficulty here is defining F 0 narrowly enough so that it is meaningful to talk about expansions of θ n (F ), B n (F )etc. IfF 0 in these examples is in the domain of attraction of stable laws or extreme value distributions it is easy to see that θ n (F ) can converge to θ(f ) arbitrarily slowly. This is even true in Example 1 if we remove the Lipschitz condition on f. By putting on conditions as in Example 1, it is possible to obtain rates. Hall and Jing (1993) specify a possible family for the stable law attraction domain estimation of the mean mentioned in Example 4 in which B n =Ω(n 1 α )whereα is the index of the stable law and α and the scales of the (assumed symmetric) stable distribution are not regularly estimable but for which rates such as n 2/5 or a little better are possible. The expansions for θ n (F )arenotinpowersofn 1/2 and the expansion for B n is even more complex. It seems evident that extrapolation does not help. Hall and Jing s (1993) theoretical results and simulations show that B m(n) though consistent, if m(n)/n 0, is a very poor estimate of θ n (F ). They obtain at least theoretically superior results by using interpolation between B m and the, known up to the value of the stable law index α, value of θ(f ). However, the conditions defining F 0 which permit them to deduce the order of B n are uncheckable so that this improvement appears illusory. Example 6. The discontinuity of θ(f )at µ(f ) = 0 underany reasonable specification of F 0 makes it clear that extrapolation cannot succeed. The discontinuity in θ(f ) persists even if we assume F 0 = {N (µ, 1) : µ R} and use the parametric bootstrap. In the parametric case it is possible to obtain constant level

20 20 P.J.BICKEL,F.GÖTZE AND W. R. VAN ZWET confidence bounds by inverting the tests for H : µ = µ 0 vs K : µ > µ 0 using the noncentral χ 2 1 distribution of ( n X) 2. Asymptotically conservative confidence bounds can be constructed in the nonparametric case by forming a bootstrap confidence interval for µ(f )using X and then taking the image of this interval into µ µ. So this example illustrates points B and D. We shall discuss claims A and D in the context of Example 5 or rather its simplest case with T n ( ˆF n,f)=n X 2.Webeginwith, Proposition 2. Suppose E F X1 4 <, E F X 1 =0,andF satisfies Cramer s condition. Then, B m P [ m X 2 t 2 ]=2Φ( ṱ σ ) 1 m X 2 ˆσ 3 tϕ( ṱ σ ) ˆK 3 X 3ˆσ 4 ϕh 3( ṱ σ ) +O P ( m n )3/2 + O P (m 1 ). (6.5) If m =Ω(n 1/2 ) then P [ m X 2 t 2 ]=P F [n X 2 t]+o P (n 1/4 ) (6.6) and no better choice of {m(n)} is possible. If n 0 <n 1, n 0 n 1/2, n 1 = o(n 3/4 ), B n 0,n 1 B n0 n 0 {(B n1 B n0 )/(n 1 n 0 )} = P F [n X 2 t]+o P (n 1/2 ). (6.7) Proof. We make a standard application of Singh (1981). If ˆσ 2 1 n (Xi X) 2, ˆK 3 1 n (Xi X) 3 we get, after some algebra and Edgeworth expansion, P [ m X ( t m X t]=φ ˆσ After Taylor expansion in m X ˆσ ) 1 m ϕ ( t m X ˆσ we conclude, ) ˆK3 6 H 2 ( t m X ˆσ ) +O p (m 1 ). P 2 [m X m t 2 ]=2Φ( ṱ σ ) 1+ϕ 2 ( ṱ σ )m X 2 ˆK 3 3ˆσ 4 [ϕh 3]( ṱ σ ) X+O P ( m n )3/2 +O P (m 1 ) (6.8) and (6.5) follows. Since m X 2 =Ω P (m/n), (6.6) follows. Finally, from (6.5), if n 0 n 1/2, n 1 n 1/2 B n0 n 0 {(B n1 B n0 )/(n 1 n 0 )} =2Φ( ṱ σ ) 1 K 3 6 ϕh 2( ṱ σ ) X + O P (n 3/4 ) +O P (n 1/2 )+O P (n 1/2 ). (6.9) Since X = O P (n 1/2 ), (6.7) follows.

### Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

### The Variability of P-Values. Summary

The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### 5. Convergence of sequences of random variables

5. Convergence of sequences of random variables Throughout this chapter we assume that {X, X 2,...} is a sequence of r.v. and X is a r.v., and all of them are defined on the same probability space (Ω,

### The Delta Method and Applications

Chapter 5 The Delta Method and Applications 5.1 Linear approximations of functions In the simplest form of the central limit theorem, Theorem 4.18, we consider a sequence X 1, X,... of independent and

### Exact Nonparametric Tests for Comparing Means - A Personal Summary

Exact Nonparametric Tests for Comparing Means - A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola

### Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

### Maximum likelihood estimation of mean reverting processes

Maximum likelihood estimation of mean reverting processes José Carlos García Franco Onward, Inc. jcpollo@onwardinc.com Abstract Mean reverting processes are frequently used models in real options. For

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

### MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete

### LOGNORMAL MODEL FOR STOCK PRICES

LOGNORMAL MODEL FOR STOCK PRICES MICHAEL J. SHARPE MATHEMATICS DEPARTMENT, UCSD 1. INTRODUCTION What follows is a simple but important model that will be the basis for a later study of stock prices as

### Bootstrapping Multivariate Spectra

Bootstrapping Multivariate Spectra Jeremy Berkowitz Federal Reserve Board Francis X. Diebold University of Pennsylvania and NBER his Draft August 3, 1997 Address correspondence to: F.X. Diebold Department

### 1. R In this and the next section we are going to study the properties of sequences of real numbers.

+a 1. R In this and the next section we are going to study the properties of sequences of real numbers. Definition 1.1. (Sequence) A sequence is a function with domain N. Example 1.2. A sequence of real

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e. This chapter contains the beginnings of the most important, and probably the most subtle, notion in mathematical analysis, i.e.,

### A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses Michael R. Powers[ ] Temple University and Tsinghua University Thomas Y. Powers Yale University [June 2009] Abstract We propose a

### 1 The Brownian bridge construction

The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof

### Computational Statistics and Data Analysis

Computational Statistics and Data Analysis 53 (2008) 17 26 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Coverage probability

### Non Parametric Inference

Maura Department of Economics and Finance Università Tor Vergata Outline 1 2 3 Inverse distribution function Theorem: Let U be a uniform random variable on (0, 1). Let X be a continuous random variable

### MIN-MAX CONFIDENCE INTERVALS

MIN-MAX CONFIDENCE INTERVALS Johann Christoph Strelen Rheinische Friedrich Wilhelms Universität Bonn Römerstr. 164, 53117 Bonn, Germany E-mail: strelen@cs.uni-bonn.de July 2004 STOCHASTIC SIMULATION Random

### ON THE EXISTENCE AND LIMIT BEHAVIOR OF THE OPTIMAL BANDWIDTH FOR KERNEL DENSITY ESTIMATION

Statistica Sinica 17(27), 289-3 ON THE EXISTENCE AND LIMIT BEHAVIOR OF THE OPTIMAL BANDWIDTH FOR KERNEL DENSITY ESTIMATION J. E. Chacón, J. Montanero, A. G. Nogales and P. Pérez Universidad de Extremadura

### Tail inequalities for order statistics of log-concave vectors and applications

Tail inequalities for order statistics of log-concave vectors and applications Rafał Latała Based in part on a joint work with R.Adamczak, A.E.Litvak, A.Pajor and N.Tomczak-Jaegermann Banff, May 2011 Basic

### THE CENTRAL LIMIT THEOREM TORONTO

THE CENTRAL LIMIT THEOREM DANIEL RÜDT UNIVERSITY OF TORONTO MARCH, 2010 Contents 1 Introduction 1 2 Mathematical Background 3 3 The Central Limit Theorem 4 4 Examples 4 4.1 Roulette......................................

### Gambling Systems and Multiplication-Invariant Measures

Gambling Systems and Multiplication-Invariant Measures by Jeffrey S. Rosenthal* and Peter O. Schwartz** (May 28, 997.. Introduction. This short paper describes a surprising connection between two previously

### An example of a computable

An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Multiple Testing. Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf. Abstract

Multiple Testing Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf Abstract Multiple testing refers to any instance that involves the simultaneous testing of more than one hypothesis. If decisions about

### The Exponential Distribution

21 The Exponential Distribution From Discrete-Time to Continuous-Time: In Chapter 6 of the text we will be considering Markov processes in continuous time. In a sense, we already have a very good understanding

### The Relative Worst Order Ratio for On-Line Algorithms

The Relative Worst Order Ratio for On-Line Algorithms Joan Boyar 1 and Lene M. Favrholdt 2 1 Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, joan@imada.sdu.dk

### 1. Periodic Fourier series. The Fourier expansion of a 2π-periodic function f is:

CONVERGENCE OF FOURIER SERIES 1. Periodic Fourier series. The Fourier expansion of a 2π-periodic function f is: with coefficients given by: a n = 1 π f(x) a 0 2 + a n cos(nx) + b n sin(nx), n 1 f(x) cos(nx)dx

### Principle of Data Reduction

Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

### ON LIMIT LAWS FOR CENTRAL ORDER STATISTICS UNDER POWER NORMALIZATION. E. I. Pancheva, A. Gacovska-Barandovska

Pliska Stud. Math. Bulgar. 22 (2015), STUDIA MATHEMATICA BULGARICA ON LIMIT LAWS FOR CENTRAL ORDER STATISTICS UNDER POWER NORMALIZATION E. I. Pancheva, A. Gacovska-Barandovska Smirnov (1949) derived four

### 2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

### Life Table Analysis using Weighted Survey Data

Life Table Analysis using Weighted Survey Data James G. Booth and Thomas A. Hirschl June 2005 Abstract Formulas for constructing valid pointwise confidence bands for survival distributions, estimated using

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### E3: PROBABILITY AND STATISTICS lecture notes

E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................

### THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS

THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS KEITH CONRAD 1. Introduction The Fundamental Theorem of Algebra says every nonconstant polynomial with complex coefficients can be factored into linear

### Two modified Wilcoxon tests for symmetry about an unknown location parameter

Biometril-a (1982). 69. 2. pp. 377-82 377 Printed in Great Britain Two modified Wilcoxon tests for symmetry about an unknown location parameter BY P. K. BHATTACHARYA Department of Statistics, University

### Lecture 1 Newton s method.

Lecture 1 Newton s method. 1 Square roots 2 Newton s method. The guts of the method. A vector version. Implementation. The existence theorem. Basins of attraction. The Babylonian algorithm for finding

### Section 3 Sequences and Limits

Section 3 Sequences and Limits Definition A sequence of real numbers is an infinite ordered list a, a 2, a 3, a 4,... where, for each n N, a n is a real number. We call a n the n-th term of the sequence.

### A direct approach to false discovery rates

J. R. Statist. Soc. B (2002) 64, Part 3, pp. 479 498 A direct approach to false discovery rates John D. Storey Stanford University, USA [Received June 2001. Revised December 2001] Summary. Multiple-hypothesis

### Mathematics for Computer Science/Software Engineering. Notes for the course MSM1F3 Dr. R. A. Wilson

Mathematics for Computer Science/Software Engineering Notes for the course MSM1F3 Dr. R. A. Wilson October 1996 Chapter 1 Logic Lecture no. 1. We introduce the concept of a proposition, which is a statement

### MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

### Probability and Statistics

CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

### Some stability results of parameter identification in a jump diffusion model

Some stability results of parameter identification in a jump diffusion model D. Düvelmeyer Technische Universität Chemnitz, Fakultät für Mathematik, 09107 Chemnitz, Germany Abstract In this paper we discuss

### Recent progress in additive prime number theory

Recent progress in additive prime number theory University of California, Los Angeles Mahler Lecture Series Additive prime number theory Additive prime number theory is the study of additive patterns in

### Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

### Calculating Interval Forecasts

Calculating Chapter 7 (Chatfield) Monika Turyna & Thomas Hrdina Department of Economics, University of Vienna Summer Term 2009 Terminology An interval forecast consists of an upper and a lower limit between

### Monte Carlo testing with Big Data

Monte Carlo testing with Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with: Axel Gandy (Imperial College London) with contributions from:

### RESULTANT AND DISCRIMINANT OF POLYNOMIALS

RESULTANT AND DISCRIMINANT OF POLYNOMIALS SVANTE JANSON Abstract. This is a collection of classical results about resultants and discriminants for polynomials, compiled mainly for my own use. All results

### , for x = 0, 1, 2, 3,... (4.1) (1 + 1/n) n = 2.71828... b x /x! = e b, x=0

Chapter 4 The Poisson Distribution 4.1 The Fish Distribution? The Poisson distribution is named after Simeon-Denis Poisson (1781 1840). In addition, poisson is French for fish. In this chapter we will

### PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

PITFALLS IN TIME SERIES ANALYSIS Cliff Hurvich Stern School, NYU The t -Test If x 1,..., x n are independent and identically distributed with mean 0, and n is not too small, then t = x 0 s n has a standard

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

### CONTRIBUTIONS TO ZERO SUM PROBLEMS

CONTRIBUTIONS TO ZERO SUM PROBLEMS S. D. ADHIKARI, Y. G. CHEN, J. B. FRIEDLANDER, S. V. KONYAGIN AND F. PAPPALARDI Abstract. A prototype of zero sum theorems, the well known theorem of Erdős, Ginzburg

### Chapter 5: Application: Fourier Series

321 28 9 Chapter 5: Application: Fourier Series For lack of time, this chapter is only an outline of some applications of Functional Analysis and some proofs are not complete. 5.1 Definition. If f L 1

### INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

### The Dirichlet Unit Theorem

Chapter 6 The Dirichlet Unit Theorem As usual, we will be working in the ring B of algebraic integers of a number field L. Two factorizations of an element of B are regarded as essentially the same if

### 3 Does the Simplex Algorithm Work?

Does the Simplex Algorithm Work? In this section we carefully examine the simplex algorithm introduced in the previous chapter. Our goal is to either prove that it works, or to determine those circumstances

### STAT 830 Convergence in Distribution

STAT 830 Convergence in Distribution Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Convergence in Distribution STAT 830 Fall 2011 1 / 31

### 1 Limiting distribution for a Markov chain

Copyright c 2009 by Karl Sigman Limiting distribution for a Markov chain In these Lecture Notes, we shall study the limiting behavior of Markov chains as time n In particular, under suitable easy-to-check

### By W.E. Diewert. July, Linear programming problems are important for a number of reasons:

APPLIED ECONOMICS By W.E. Diewert. July, 3. Chapter : Linear Programming. Introduction The theory of linear programming provides a good introduction to the study of constrained maximization (and minimization)

### 1 if 1 x 0 1 if 0 x 1

Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or

### Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Fabienne Comte, Celine Duval, Valentine Genon-Catalot To cite this version: Fabienne

### t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction

### An Alternative Route to Performance Hypothesis Testing

EDHEC-Risk Institute 393-400 promenade des Anglais 06202 Nice Cedex 3 Tel.: +33 (0)4 93 18 32 53 E-mail: research@edhec-risk.com Web: www.edhec-risk.com An Alternative Route to Performance Hypothesis Testing

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Lectures 5-6: Taylor Series

Math 1d Instructor: Padraic Bartlett Lectures 5-: Taylor Series Weeks 5- Caltech 213 1 Taylor Polynomials and Series As we saw in week 4, power series are remarkably nice objects to work with. In particular,

### Research Article Stability Analysis for Higher-Order Adjacent Derivative in Parametrized Vector Optimization

Hindawi Publishing Corporation Journal of Inequalities and Applications Volume 2010, Article ID 510838, 15 pages doi:10.1155/2010/510838 Research Article Stability Analysis for Higher-Order Adjacent Derivative

### Supplement to Call Centers with Delay Information: Models and Insights

Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290

### Lecture 4: Random Variables

Lecture 4: Random Variables 1. Definition of Random variables 1.1 Measurable functions and random variables 1.2 Reduction of the measurability condition 1.3 Transformation of random variables 1.4 σ-algebra

### Department of Economics

Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 1473-0278 On Testing for Diagonality of Large Dimensional

### C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)}

C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)} 1. EES 800: Econometrics I Simple linear regression and correlation analysis. Specification and estimation of a regression model. Interpretation of regression

### Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

### Lecture 8: Random Walk vs. Brownian Motion, Binomial Model vs. Log-Normal Distribution

Lecture 8: Random Walk vs. Brownian Motion, Binomial Model vs. Log-ormal Distribution October 4, 200 Limiting Distribution of the Scaled Random Walk Recall that we defined a scaled simple random walk last

### The Bootstrap. 1 Introduction. The Bootstrap 2. Short Guides to Microeconometrics Fall Kurt Schmidheiny Unversität Basel

Short Guides to Microeconometrics Fall 2016 The Bootstrap Kurt Schmidheiny Unversität Basel The Bootstrap 2 1a) The asymptotic sampling distribution is very difficult to derive. 1b) The asymptotic sampling

### MATHEMATICAL METHODS OF STATISTICS

MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS

### Problem Set. Problem Set #2. Math 5322, Fall December 3, 2001 ANSWERS

Problem Set Problem Set #2 Math 5322, Fall 2001 December 3, 2001 ANSWERS i Problem 1. [Problem 18, page 32] Let A P(X) be an algebra, A σ the collection of countable unions of sets in A, and A σδ the collection

### Load Balancing and Switch Scheduling

EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

### Joint Probability Distributions and Random Samples (Devore Chapter Five)

Joint Probability Distributions and Random Samples (Devore Chapter Five) 1016-345-01 Probability and Statistics for Engineers Winter 2010-2011 Contents 1 Joint Probability Distributions 1 1.1 Two Discrete

### 2.1 Complexity Classes

15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look

### Lecture 10: CPA Encryption, MACs, Hash Functions. 2 Recap of last lecture - PRGs for one time pads

CS 7880 Graduate Cryptography October 15, 2015 Lecture 10: CPA Encryption, MACs, Hash Functions Lecturer: Daniel Wichs Scribe: Matthew Dippel 1 Topic Covered Chosen plaintext attack model of security MACs

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### Large induced subgraphs with all degrees odd

Large induced subgraphs with all degrees odd A.D. Scott Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, England Abstract: We prove that every connected graph of order

### CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,

### From Binomial Trees to the Black-Scholes Option Pricing Formulas

Lecture 4 From Binomial Trees to the Black-Scholes Option Pricing Formulas In this lecture, we will extend the example in Lecture 2 to a general setting of binomial trees, as an important model for a single

### 2.5 Gaussian Elimination

page 150 150 CHAPTER 2 Matrices and Systems of Linear Equations 37 10 the linear algebra package of Maple, the three elementary 20 23 1 row operations are 12 1 swaprow(a,i,j): permute rows i and j 3 3

### Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different

### Introduction to time series analysis

Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples

### Convex Hull Probability Depth: first results

Conve Hull Probability Depth: first results Giovanni C. Porzio and Giancarlo Ragozini Abstract In this work, we present a new depth function, the conve hull probability depth, that is based on the conve

### International Journal of Information Technology, Modeling and Computing (IJITMC) Vol.1, No.3,August 2013

FACTORING CRYPTOSYSTEM MODULI WHEN THE CO-FACTORS DIFFERENCE IS BOUNDED Omar Akchiche 1 and Omar Khadir 2 1,2 Laboratory of Mathematics, Cryptography and Mechanics, Fstm, University of Hassan II Mohammedia-Casablanca,

### Math 4310 Handout - Quotient Vector Spaces

Math 4310 Handout - Quotient Vector Spaces Dan Collins The textbook defines a subspace of a vector space in Chapter 4, but it avoids ever discussing the notion of a quotient space. This is understandable

### ALMOST COMMON PRIORS 1. INTRODUCTION

ALMOST COMMON PRIORS ZIV HELLMAN ABSTRACT. What happens when priors are not common? We introduce a measure for how far a type space is from having a common prior, which we term prior distance. If a type