On the High-dimensional Power of Linear-time Kernel Two-Sample Testing nder Mean-difference Alternaties arxi:4.634 [math.st] 3 No 04 Aaditya Ramdas aramdas@cs.cm.ed Aarti Singh aarti@cs.cm.ed Sashank J. Reddi sjakkamr@cs.cm.ed Larry Wasserman larry@stat.cm.ed Barnabás Póczós bapoczos@cs.cm.ed Department of Statistics and Machine Learning Department Carnegie Mellon Uniersity Noember 5, 04 Abstract Nonparametric two sample testing deals with the qestion of consistently deciding if two distribtions are different, gien samples from both, withot making any parametric assmptions abot the form of the distribtions. The crrent literatre is split into two kinds of tests - those which are consistent withot any assmptions abot how the distribtions may differ general alternaties, and those which are designed to specifically test easier alternaties, like a difference in means mean-shift alternaties. The main contribtion of this paper is to explicitly characterize the power of a poplar nonparametric two sample test, designed for general alternaties, nder a mean-shift alternatie in the high-dimensional setting. Specifically, we explicitly derie the power of the linear-time Maximm Mean Discrepancy statistic sing the Gassian kernel, where the dimension and sample size can both tend to infinity at any rate, and the two distribtions differ in their means. As a corollary, we find that if the signal-to-noise ratio is held constant, then the test s power goes to one if the nmber of samples increases faster than the dimension increases. This is the first explicit power deriation for a general nonparametric test in the high-dimensional setting, and also the first analysis of how tests designed for general alternaties perform when faced with easier ones. Introdction The central topic of this paper is nonparametric two-sample testing, in which we try to detect a difference between two d-dimensional distribtions P and Q based on n samples from both, i.e. deciding whether two samples are drawn from the same distribtion. We will be concerned with the following two settings, the first of which deals with general alternaties GA, i.e. Both stdent athors had eqal contribtion. H 0 : P Q.s. H : P Q. GA
It is called nonparametric two-sample testing becase no parametric assmptions are made abot the form of P, Q like Gassianity or exponential families. We se the term general alternaties to mean that the difference between P, Q need not hae a simple form. In contrast, the second setting that we are concerned abot deals with mean-shift alternaties MSA, i.e. H 0 : µ P µ Q.s. H : µ P µ Q MSA where µ P E X P [X] and µ Q E Y Q [Y ]. It is still nonparametric two-sample testing, since we make no assmptions abot P, Q, bt deals with easier alternaties, meaning that we specify the exact form in which P and Q differ, i.e. they differ in their means. Parametric two-sample testing for example, when P, Q are Gassian is also important, bt will be ot of the scope of or discssion; see Lopes et al. 0 for a recent example. We assme eqal nmber n of samples for simplicity; or reslts wold also go throgh if n /n n c 0, as n, n.. Hypothesis testing terminology Let X n {x,..., x n } P and Y n {y,..., y n } Q be the two sets of samples, where x i, y j R d for all i, j n. A test is any fnction or algorithm that takes X n, Y n as inpt, and otpts {0, } where is interpreted to mean that it rejects the nll hypothesis H 0, and 0 is interpreted to mean that there is insfficient eidence to reject H 0. A test is characterized by its false positie rate or type- error and its false negatie rate or type- error α P rejecting H 0 H 0 is tre β P not rejecting H 0 H is tre. There is sally a tradeoff inoled - decreasing one error rate increases the other. Hence, one sometimes fixes α to some small ale say 0.0, and refers to φ β as the power of the test at α 0.0. A test is classically called consistent if for any fixed α, the power φ as n wheneer H 0 is false. Many tests in the literatre, inclding the ones we will consider, calclate a test statistic T as a fnction of X n, Y n, and reject the nll hypothesis if T > c α, where the threshold c α depends on the distribtion of T nder H 0 and on a pre-defined α. See Lehmann & Romano 006 for a detailed introdction.. Motiation Or first motiation comes from the fact that there is a big difference between the classical setting of fixing d while letting n, and the high-dimensional HD setting obtained when n, d HD A test wold be called consistent nder HD if for any fixed α, the power φ as n, d wheneer H 0 is false. It is of ital importance, both theoretically and practically, to nderstand the power of tests in sch settings, and to characterize the rate at which n mst grow as a fnction of d so that the test is still consistent. While classical tests were proposed for the low-dimensional settings, oer the past two decades seeral tests hae been proposed specifically for MSA and stdied in the HD setting; see Sbsection.3. Howeer, to the best of or knowledge there has been no formal and precise characterization of power of tests designed for GA in high dimensions. Or second motiation comes from the obseration that there is no literatre on how tests designed for GA perform nder MSA. In other words, while it is expected that tests designed for MSA will not be consistent against more general GA, it is nclear how exactly tests designed for general alternaties fare when when faced with a mean-shift alternatie.
.3 Related Work MSA It is well known see Kariya 98; Simaika 94; Anderson 958; Salaeskii 97 that if P, Q are Gassians, then the niformly most powerfl test in the fixed-dimension setting nder fairly general conditions, is the T-test by Hotelling 93 : T H : m P m Q T S m P m Q where m P, m Q and S are the sal empirical estimators of µ P, µ Q and the joint coariance matrix Σ. In a seminal paper, Bai & Saranadasa 996, showed that in the high-dimensional setting, the T-test performs qite poorly specifically when n, d with d/n ɛ for small ɛ. This is intitiely becase of the difficlty of estimating the Od parameters of Σ with ery few samples. Indeed, S is not een defined when d > n and is poorly conditioned when d is of similar order as n. To aoid this problem, they proposed to se the test statistic T BS : m P m Q trs/n T BS has non-triial power when d/n c 0,. Sriastaa & D 008 proposed to instead se diags instead of S in T H, and showed its adantages in certain settings oer T BS. More recently, Chen & Qin 00, henceforth called CQ, proposed a slight ariant of T BS, which is a U-statistic of the form n T CQ : x T i x j yi T y j n nn n x T i y j that achiees the same power withot explicit restrictions on d, n, bt rather in terms of conditions stated in terms of n, trσ, µ P µ Q. The settings of nder which these arios statistics are consistent, or achiee non-triial power, are slightly complicated to describe, and the reader is referred to their papers for details..4 Related Work GA There are many nonparametric test statistics for two-sample testing. One of the most poplar tests is the kernel Maximm Mean Discrepancy, henceforth called MMD, proposed in Gretton et al. 0. While the technical details of the kernel literatre are nnecessary for the prposes of this paper, it sffices to say that the poplation statistic is MMD : max E P fx E Q fy f H where H is a Reprodcing Kernel Hilbert Space and f H is its nit norm ball. There are two related sample statistics, both of which can be shown to be nbiased estimators of MMD. The first is a U-statistic MMD The second is a linear-time statistic nn nn n kx i, x j n ky i, y j n i,j n i,j kx i, y j MMD l n/ [kx i, x i ky i, y i n/ i kx i, y i ky i, x i ] 3
Note that T CQ is jst MMD nder the linear kernel kx, y x T y. It is known that in the fixed d setting, the power of both MMD l and MMD approaches at the rate of Φ n where Φ is the standard normal cdf, see Gretton et al. 0. Howeer, nothing is formally known when d cold be increasing with n. A recent related manscript by Reddi et al. 04 condcts detailed experiments that demonstrate that in the fixed n, increasing d setting, the power of MMD and distance correlation decay polynomially in high dimensions against fair alternaties. While the athors proide some initial insights into this phenomenon for specific examples, there is still no theoretical analysis of the power of MMD or any statistic designed for GA against MSA or GA or any other set of alternaties, in the high dimensional setting. Another statistic called Energy Distance by Székely & Rizzo 004 is closely tied to the MMD - indeed it has the same form as the MMD with the Eclidean distance instead of a kernel; Lyons 03 showed that one can also se other metrics instead of the Eclidean distance and Sejdinoic et al. 03 showed that there is a close tie between metrics and kernels for these problems. There has been an initial attempt to characterize some properties of distance correlation which is a related statistic for the related problem of independence testing in high dimensions in Székely & Rizzo 03, bt no analysis of power is aailable or easily deriable. There also exist many other tests nder GA like the cross-match test by Rosenbam 005, bt none of them hae been analyzed nder HD. Power of MMD l fixed dimension Let s first reiew the basic argment from Gretton et al. 0 showing the power in the fixed dimensional setting. It will then become clear what the main difficlties are in establishing reslts in the high-dimensional setting. The main tool needed is a simple conergence reslt of the sample statistic to the poplation qantity. It becomes conenient to introdce the notation z i x i, y i and h ij hz i, z j where Then we can rewrite or test statistic as h ij : kx i, x j ky i, y j kx i, y j kx j, y i. MMD l n/ hz i, z i. n/ i Its expectation is E z,z hz, z MMD and then Corollary 6 of Gretton et al. 0 states that nder both H 0 and H, we hae F : nmmd l MMD V N0, 3 where V Var z,z hz, z and means conergence in distribtion as n. Note that V is a constant independent of n, and so there exists a constant z α sch that P Z > z α α when Z N0,. Then, the corresponding test rejects H 0 wheneer Test-MMD l : nmmd l > z α 4 where is twice the empirical ariance of hz, z. If Pr denotes the probability nder H, the power 4
of this test is gien by nmmd Pr l > z α Pr F > n Pr Z > z α V z α nmmd V nmmd V nmmd Φ z α V nmmd Φ z α V 5 6 7 8 9 where Φ is the standard normal cdf. This behaes like Φ n since the poplation MMD and V are constants that are both independent of n.. The challenges in high dimensions There are seeral significant difficlties in lifting this argment to the high-dimensional setting. C. The poplation MMD depends on dimension ia the signal strength and bandwidth, as we later show, and one needs to explicitly accont for this. C. The ariance V also depends on dimension and the signal strength, and the bandwidth, as we later show, and again one needs to explicitly track this, especially its dependence on dimension. C3. In the increasing d, n setting, the limiting distribtion is no longer triially normal, and one needs to establish conditions nder which it is indeed normal - the most important qestion being if the rate of conergence to normality depends on d. C4. In the increasing d, n setting, one needs to characterize the rate at which /V still tends to, so that V z α conerges to z α - since, V depend on d, the key qestion is again whether the rate of conergence depends on d or not. We will hae to accont for each of these challenges explicitly, as we shall see in later sections. Let s first smmarize and discss or assmptions and contribtions before we dele into the technical details. 3 Assmptions and Contribtions We are now in a position to clearly state or contribtions. We focs on analyzing the power of MMD l in the high-dimensional setting when n, d for the Gassian kernel with bandwidth γ, i.e. kx, y exp x y γ, in the mean-shift setting when P and Q differ in their means. Let s first otline or assmptions below; note that we comment abot these assmptions in the next sbsection. A. x i Us i µ P and y i Ut i µ Q, where, s i, t i are i.i.d random ectors for i {,..., n}, each haing d i.i.d. zero-mean coordinates.and U corresponds to a d d orthogonal rotation i.e. UU T I. 5
A. The k-th central moments of each i.i.d. coordinate of s, t exist for k 6. Note that the coordinates of x, y need not be independent and E x P [X] µ P, E y Q [Y ] µ Q. Denote δ : µ P µ Q. Denote the second, third and forth central moments of each i.i.d. coordinate of s, t by σ, µ 3, µ 4. Remember that Ehz i, z j MMD see Eq.. Denote the second, third and forth central moments of hz i, z j by V, τ 3, τ 4. Let. represent the Eclidean norm. Or main contribtion is: Theorem. For the Gassian kernel with bandwidth chosen as γ Ω d, nder assmptions A, A, with n, d at any rate, the Test-MMD l Eq. 4 has asymptotic type- error α and asymptotic power n δ β Φ 8dσ4 8σ δ z α where Φ is the cdf of a standard Normal distribtion and z α is the α qantile of the standard Normal distribtion. For finite samples, type- error behaes like α 0/ n and the power like β 0/ n. The first remarkable point abot this theorem is that the power is independent of bandwidth γ, as long as γ Ω d. Sch behaior has already been noted bt not explained in the experiments of Reddi et al. 04 and we will erify this careflly in or experiments section. While this may not hold tre for other kernels, like the Laplace kernel kx, y exp x y γ, or against more general alternaties, it is both srprising and interesting that this is the case for the Gassian kernel nder MSA. As discssed later, this theorem applies to the bandwidth chosen by the so-called median heristic; see Schölkopf & Smola 00. It implies that the median heristic proides an argably safe choice in the light of haing no frther information, and also why it works reasonably well in practice/simlations. If we consider the signal to noise ratio henceforth called SNR to be defined as Ψ : δ /σ, then focsing on the more important first term, the power behaes like Φ n Ψ 8d 8Ψ z α From this, we get the following two corollaries. The first applies to the small SNR regime which incldes the fair alternatie setting, see Reddi et al. 04 for details, and the second applies when SNR is large. Corollary. When the signal to noise ratio Ψ is small, specifically Ψ od /, the power goes to at the rate of Φ nψ / d.. Corollary. When the signal to noise ratio Ψ is large, specifically Ψ ωd /, then the power goes to at the rate of Φ nψ, independent of d. Note that the switch in behaior between the two corollaries occrs at Ψ being on the order of d /, and at this point the prediction of the two corollaries match - hence one cold se O, Ω instead of o, ω for describing growth of Ψ in both corollaries. 3. Remarks abot assmptions Assmptions A,A are general enogh for the predictions made by or theorem to be accrate and representatie of obsered behaior. We will erify the predictions of the theorem, corollaries and later lemmas in or simlations. 6
A. While the coordinates of x, y need not be independent, the first assmption does restrict their coariances to be σ I. We note that Székely & Rizzo 03 makes a more restrictie assmption of independent coordinates, while Assmption a in Bai & Saranadasa 996 and Eq.3. in Chen & Qin 00 assme the same model as we do bt don t reqire spherical coariance. Howeer, or assmption is trly only for mathematical conenience; if we instead had UD / in A, where D is a diagonal rescaling, all or calclations can still be carried ot, bt wold be more tedios since the coordinates of D / s are still independent bt not identically distribted, and we wold need to track σ j, µ 3j, µ 4j in Appendix Sections 3-6. A. The existence of third and forth moments is needed for calclating poplation MMD and ariance terms, as well as for the Berry-Esseen lemma to control the deiation from normality, and the conergence of to V. The existence of the sixth moment is needed to bond the Taylor expansion residal term in all or calclations. Note that CQ needs the existence of eighth moments, and BS assme the existence of forth moments see Eq. 3. in Chen & Qin 00 and Assmption a in Bai & Saranadasa 996. 3. Remark abot bandwidth choice Remember that the power is independent of the bandwidth γ, as long as γ Ω d. This restriction of γ Ω d is to allow s to control the residal term in the Taylor expansion of the Gassian kernel. Howeer, it is not ery restrictie, since smaller γ typically leads to worse power. Specifically, we note that the experiments in Reddi et al. 04 for mean-shift alternaties show conincingly that when γ is chosen to be a constant or d α for α < 0.5 inclding constant γ, then the power of MMD is poor, while when the highest power occrs for ales α 0.5. Hence or choice coers most reasonable choices of bandwidth. Frthermore, one of the most poplar methods for bandwidth selection is called the median heristic, see Schölkopf & Smola 00, where one chooses the bandwidth as the median of distances between all pairs of points. A simple calclation shows E x P,y Q x y σ d µ P µ Q, so generally speaking the median heristic chooses γ of the same order as σ d or larger if µ P µ Q is large. 3.3 Comparisons to CQ The assmptions in CQ, BS, SD are slightly differently stated from or reslts here. Howeer, their reslts can broadly be compared to ors. We can smmarize the most recent reslts, those of CQ, nder A and A in the following two obserations. The first obseration follows from Eq. 3. in Chen & Qin 00 which applies to the small SNR regime dictated by Eq. 3.4. Obseration. When the signal to noise ratio Ψ is small, specifically Ψ o d/n, the power goes to at the rate of ΦnΨ / d. We beliee there is a mistake in the deriation of Eq. 3. in Chen & Qin 00 which applies in the small SNR regime dictated by Eq. 3.5. We describe this in more detail in the Appendix Section, and jst smmarize the corrected reslting obseration below. Obseration. When the signal to noise ratio Ψ is large, specifically Ψ ω d/n, then the power goes to at the rate of Φ nψ, independent of d. Comparing these expressions with Corollary and, it is clear that CQ has an adantage oer MMD l in the low-snr setting. For example, when n d and the SNR Ψ is constant, the power of CQ can increase n times faster than that of MMD l bt when the SNR is ωd /, the power of both methods scales in the same fashion. This adantage for low SNR might be wiped ot by considering MMD - ascertaining if this is the case is an important direction of ftre work. The 7
main technical challenge is nderstanding the limiting distribtions of general degenerate U-statistics in high dimensions which in fixed dimensional setting is an infinite sm of χ s; see Serfling 009, Section 5.5.. We now proide the proof of Theorem and then erify all or claims in simlations, to conincingly show that these expressions are tight p to constant factors. 4 Proof of Theorem We split the proof into for sbsections, one for each of the challenges C-C4. For C and C, we need to calclate the first two moments of h, introdced in Eq., for which the main tool we se is Taylor expansions whose alidity is explained in Appendix Section, following which the reslts follow after a seqence of tedios calclations and detailed book-keeping. For C3 and C4, we need to bond the third and forth moments of h. The main tool sed for C3 is a Berry-Esseen theorem which helps s track the deiation from normality at finite samples, and C4 is tackled by Chebyshe s ineqality once we hae a handle on the ariance of. Most of the details will be deferred to the Appendix, bt we will otline the main steps of the deriations here. 4. The Poplation MMD The main takeaway point of the following lemma is the dependence of poplation MMD on the bandwidth γ and the signal strength δ recall δ : µ P µ Q. If p, q are the pdfs of P, Q, then note that the poplation MMD with the Gassian kernel is gien by x y pxpy qxqy pxqydxdy R d e Lemma. Under A,A, and when γ Ω d we hae MMD δ o. Proof. We defer details to the Appendix Section 3. On sing Taylor s expansion for the Gassian kernel, the terms in the aforementioned MMD expression can be approximated by bonding higher order residal terms. We proe that the first MMD term is R d e x y pxpydxdy d σ. Using similar techniqes we can also dedce: e x y pxqydxdy R d i σ δ i. Combining these, again sing Taylor expansions, gies s or expression. 4. The Variance As arged earlier, the ariance is gien by V/n where V Var z,z hz, z. The takeaway points of the following lemma are the identical dependence that V has on bandwidth γ as the MMD which then cases their ratio to be essentially independent of γ, and also the role played by dimension and the signal strength in determining the ariance. 8
Lemma. Under A,A, and when γ Ω d, we hae V 6dσ4 6σ δ o. Proof. Note that V E z,z h z, z MMD 4 since MMD E z,z hz, z. Let s focs on the first term: E z,z [h z, z ] E x,x P k x, x E y,y Qk y, y E x P,y Q k x, y E x,x P,y,y Qkx, x ky, y E x,x P,y,y Qkx, y kx, y 4E x,x P,y Qkx, x kx, y 4E x P,y,y Qkx, yky, y Hence, there are fie different kinds of terms to calclate the first and last two are similar. Combining these gies s or soltion. The details are tedios and hence are gien in the Appendix Section 4. 4.3 The Berry-Esseen Bond Lemma 3. Under A, A, and when γ Ω d, we hae n/mmd sp t P l MMD t Φt V 0 n Proof. The Berry-Esseen Lemma see for example Theorem 3.6 or 3.7 in Chen et al. 00, when translated to or problem, essentially yields the aboe lemma, except that the right hand side is ξ 3 0 V 3/ n 0 where ξ 3 E[ hz, z Ehz, z 3 ], and the constant 0 is not optimal. Note that ξ 3 τ 3 third central moment of h de to the absolte ale sign. Gien that we hae the mean and second central moment of h MMD and V respectiely, one might imagine sing similar techniqes to calclate ξ 3. Howeer, the absolte ale poses a problem, and so we mst take an alternate rote. Specifically, tedios calclations in the Appendix Section 5 proe that τ 4 the forth central moment of h is bonded as τ 4 4 ov, allowing s to bond ξ 3 as ξ 3 τ 4 V V 3/ since E X 3 E X 4 E X by Cachy-Schwarz. Sbstitting into Eq.0 gies s or Lemma. The main challenge inoled is in proing that the ratio ξ 3 /V 3/ is independent of d. Note that a ery crde bond of h Eh 4 since e z gies s ξ 3 4V, which wold yield a dimension dependence de to an extra V factor, bt becase τ 4 and hence ξ 3 has exactly the right scaling with V, the dependence on V and hence, importantly, the dimension cancels ot and or Lemma follows. This is only one of the reasons we needed a bond on τ 4, the other appearing in the next lemma. 9
4.4 Bonding /V Recall that is the empirical estimator of V - it is an empirical aerage of n/ nidimensional terms. The sbtlety is that depends on d since V depends on d. What matters is whether the rate of conergence of their ratio to depends on d - fortnately it does not. Lemma 4. Under A,A, and when γ Ω d, we hae /V OP /n /4 Proof. Using k in Theorem A of Section..3 in Serfling 009, the bias of is gien by and its ariance is gien by E[] V V n ar τ 4 V 3V n n both p to smaller order terms where the ineqality follows from the preios lemma. Then, it is easy to see that V O P n, i.e. V O P V/ n. This is becase for any ɛ > 0, V P V/ n > 3 ɛ ɛ P E[] > 3V V ɛ V nɛ n ɛ ar 3V nɛ V n V n where we sed Chebyshe s ineqality, and the second ineqality follows since 3V nɛ V n V n 3V nɛ. At this point we hae all the key elements of the proof of Theorem. Specifically, eqations 5 to 9 follow exactly as written, with the exception of 7 holding een with a n,d - note that this step allows n, d to grow at any relatie rate to precisely becase the rate at which Q conerges to the standard normal Z Berry-Esseen bond and the rate at which /V conerges to, were both independent of d and only needs n. The dependence on d only enters throgh the MMD and its ariance. This concldes the proof of Theorem. One can also write down the finite sample type- error rate as being at most α 0/ n and the finite sample power as being at least β 0/ n, where the additional error is introdced de to the Berry-Esseen bond whose constants we don t optimize, bt cold be tightened to abot 5 instead of 0. We now confirm the tightness of all the predictions in this section by detailed simlations in the next section. 0
5 Experiments Or aim in this section is to confirm the theoretical predictions made by or lemmas and theorems. The most important claims to address are that the Berry-Esseen bond is independent of d, the nll and alternate distribtions are indeed normal een in the extreme case when n is fixed and d is increasing, the ratio of MMD / V is essentially independent of the bandwidth, and finally the final power expression is essentially independent of the bandwidth and has the exact predicted scaling as gien by or expressions. 5. Berry-Esseen bond is independent of d Since the calclations of τ 4 are rather tedios, let s also erify the prediction made in Sbsection 4.3 that ξ 3 /V 3/ is constant and independent of dimension remember that the ratio inoles poplation qantities. To erify this, we draw 000 samples from P, Q, and calclate the empirical ratio for d ranging from 40 to 000, in steps of 0. We make 3 sets of choices for P, Q - standard normals with γ d 0.75, t 4 distribtion with γ d 0.5 and t 4 distribtion with γ d. The reason we se t 4 t distribtion with 4 degrees of freedom is becase it does not hae a finite forth moment τ 4. We find that in all 3 cases, the ratio is a constant of abot.65, showing that or prediction is extremely accrate. Also, while or proof proceeded ia bonding τ 4, it seems to hold tre een when higher moments than 3 don t exist, since it holds for the t 4 distribtion. The spikes are becase we calclate a single empirical ratio at each d..9.8 t dist log γ/d 0.5 t dist log γ/d 0.75 Normal dist log γ/d B E ratio.7.6.5.4.3.. 0 00 00 300 400 500 600 700 800 900 000 d Figre : The empirical Berry-Esseen ratio ξ 3 /V 3/ s dimension, when n 000 for the distribtions t 4, t 4 and normal, with bandwidths d 0.5, d, d 0.75 respectiely. 5. Normality of nll/alternate distribtions Let s now erify that the nll and alternate distribtions are indeed almost standard normal when n is held constant and d is increased. We do this by fixing n 50, and choosing d {50, 00, 00} and calclating or test statistic nmmd l /. We experimentally approximate the nll and alternate distribtions by repeating this process 000 times; the histogram obtained is compared to a normal by plotting a standard normal qantile-qantile plot. The oerlapping straight lines indicate that each of the nll and alternate distribtions for three different d ales are almost exactly standard normal een at a small ale of n like 50. This agrees with or deriation that the Berry-Esseen constant is ery small and normality is achieed soon.
4 3 0 3 4 Qantiles of Inpt Sample 4 3 0 3 4 Standard Normal Qantiles Qantiles of Inpt Sample 4 3 0 3 4 4 3 0 3 4 Standard Normal Qantiles Figre : A normal qantile-qantile plot of nll left and alternate right distribtions of or test statistic for d 50, 00, 00 when n 00 000 repetitions....3 logratio.4.5.6.7.8.9 logγ/d 0.5 logγ/d 0.75 logγ/d Median Heristic 3.8 4 4. 4.4 4.6 4.8 5 5. 5.4 logd Figre 3: A log-log plot of MMD / V s dimension for different bandwidth choices when Ψ and n is large. Note that the slope is 0.5, independent of γ. 5.3 MMD / V is independent of bandwidth Or first two lemmas together imply that the ratio MMD / V is independent of γ as long as γ Ω d. To test this, we actally calclate this ratio for γ d 0.5, d 0.75, d. Remember that these are poplation qantities - we will estimate the ratio sing sample qantities sing a large n, when Ψ. We plot the obtained log-ratio against log-dimension in Figre 3, showing that the power scales as / d as predicted. 5.4 The scaling of power with n, d Here are a few testable predictions of Theorem :. When n 50 and Ψ.5, the power shold decrease as / d Corollary.. When n 50, and Ψ d /4, then the power shold be a constant Corollary. 3. When n d, and Ψ, the power shold stay constant Corollary. 4. When n d, and Ψ 0.3d /, then the power shold increase as d Corollary. From Figre 4, we infer that the precise form of Theorem and Corollaries, is extremely accrate, een at small n and significantly larger d, inclding that it is independent of the bandwidth γ as predicted, as long as γ Ω d.
0.9 0.8 0.7 lmmd logγ/d 0.5 lmmd logγ/d 0.75 lmmd logγ/d lmmd Median 0.9 0.8 0.7 Power 0.6 0.5 Power 0.6 0.5 0.4 0.4 0.3 0. 40 60 80 00 0 40 60 80 00 d 0.3 0. 0. lmmd logγ/d 0.5 lmmd logγ/d 0.75 lmmd logγ/d lmmd Median 0 40 60 80 00 0 40 60 80 00 d 0.9 0.8 0.7 lmmd logγ/d 0.5 lmmd logγ/d 0.75 lmmd logγ/d lmmd Median 0.9 0.8 0.7 Power 0.6 0.5 0.4 Power 0.6 0.5 0.4 0.3 0. 0. 0 40 60 80 00 0 40 60 80 00 d 0.3 0. 0. 0 40 60 80 00 0 40 60 80 00 d lmmd logγ/d 0.5 lmmd logγ/d 0.75 lmmd logγ/d lmmd Median Figre 4: All plots show power s d for different γ {median, d 0.5, d 0.75, d} for d 40 to 00 in steps of 0. From top left to bottom right are the settings -4, with P, Q being Gassians. The power is estimated oer 00 repetitions at each d. 6 Conclsion This paper has two main noelties - the first is to precisely characterize how a nonparametric two sample test, which is consistent in fixed dimensions against general alternaties, performs against a mean-shift alternatie; the second is to perform the analysis in the significantly more difficlt high-dimensional regime. Ftre work inoles nderstanding MMD, bt the limiting distribtions of general U-statistics are be difficlt to ascertain in high dimensions. Another direction inoles the stdy of sparse alternaties, where δ is sparse, as done by Cai et al. 04. Lastly, minimax lower bonds are reqired to nderstand the tradeoffs inoled between Ψ, d, n. References Anderson, Theodore W. An introdction to mltiariate statistical analysis. 958. Bai, Zhidong D and Saranadasa, Hewa. Effect of high dimension: by an example of a two sample problem. Statistica Sinica, 6:3 39, 996. Cai, Tony, Li, Weidong, and Xia, Yin. Two-sample test of high dimensional means nder dependence. Jornal of the Royal Statistical Society: Series B Statistical Methodology, 76:349 37, 04. 3
Chen, Lois HY, Goldstein, Larry, and Shao, Qi-Man. Normal approximation by Steins method. Springer, 00. Chen, Song Xi and Qin, Ying-Li. A two-sample test for high-dimensional data with applications to gene-set testing. The Annals of Statistics, 38:808 835, apr 00. doi: 0.4/09-aos76. URL http://dx.doi.org/0.4/09-aos76. Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., and Smola, A. A kernel two-sample test. Jornal of Machine Learning Research, 3:73 773, 0. Hotelling, Harold. The generalization of stdent s ratio. Annals of Mathematical Statistics, 3: 360 378, ag 93. doi: 0.4/aoms/7773979. URL http://dx.doi.org/0.4/aoms/ 7773979. Kariya, Takeaki. A robstness property of hotelling s t-test. The Annals of Statistics, pp. 4, 98. Lehmann, Erich L and Romano, Joseph P. Testing statistical hypotheses. springer, 006. Lopes, M.E., Jacob, L., and Wainwright, M.J. A more powerfl two-sample test in high dimensions sing random projection. In Adances in Neral Information Processing Systems 4. MIT Press, 0. Lyons, R. Distance coariance in metric spaces. Annals of Probability, 45:384 3305, 03. Reddi, Sashank J., Ramdas, Aaditya, Póczos, Barnabás, Singh, Aarti, and Wasserman, Larry A. Kernel MMD, the median heristic and distance correlation in high dimensions. CoRR, abs/406.083, 04. URL http://arxi.org/abs/406.083. Rosenbam, Pal R. An exact distribtion-free test comparing two mltiariate distribtions based on adjacency. Jornal of the Royal Statistical Society: Series B Statistical Methodology, 674: 55 530, 005. Salaeskii, O.V. Minimax character of hotellings t test. i. In Inestigations in Classical Problems of Probability Theory and Mathematical Statistics, pp. 74 0. Springer, 97. Schölkopf, Bernhard and Smola, A. J. Learning with Kernels. MIT Press, Cambridge, MA, 00. Sejdinoic, D., Sripermbdr, B., Gretton, A., Fkmiz, K., et al. Eqialence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 45:63 9, 03. Serfling, Robert J. Approximation theorems of mathematical statistics, olme 6. John Wiley & Sons, 009. Simaika, JB. On an optimm property of two important statistical tests. Biometrika, pp. 70 80, 94. Sriastaa, Mni S. and D, Meng. A test for the mean ector with fewer obserations than the dimension. Jornal of Mltiariate Analysis, 993:386 40, mar 008. doi: 0.06/j.jma.006..00. URL http://dx.doi.org/0.06/j.jma.006..00. Székely, Gábor J and Rizzo, Maria L. Testing for eqal distribtions in high dimension. InterStat, 5, 004. Székely, G.J. and Rizzo, M.L. The distance correlation t-test of independence in high dimension. J. Mltiariate Analysis, 7:93 3, 03. 4
A The Power of CQ for high SNR Let s first briefly describe what we beliee is an important mistake in Chen & Qin 00 - all notations, eqation nmbers and theorems in this paragraph refer to those in Chen & Qin 00. Using the test statistic T n /ˆσ n defined below Theorem, we can derie the power nder assmption 3.5 as Tn P > ξ α ˆσ n P Tn µ µ Φ ˆσ n > ˆσ n ξ α µ µ ˆσ n ˆσ n µ µ Φ the denominator is not ˆσ n ˆσ n n µ µ µ µ T Σµ µ which shold be the expression for power that they derie in Eq.3., the most important differnce being the presence of n instead of n in the nmerator. They also do not hae an explicit Berry- Esseen bond dealing with the deiation from normality. B Remarks for this Appendix B. Taylor Expansion In all or calclations, we se the Taylor expansion for the fnction e x arond 0. More specifically, we hae λ e p i q i dd e 4 p i p i dd where λ [0, ]. The aboe eqality follows from the exact formla for Taylor expansions haing exact residals. Note that e λ. When γ Ω d and forth moments of the distribtions p i and q i exist, the aboe integral becomes e p i q i dd [ ] p i p i dd o Similarly, an higher order expansion can also be obtained by assming existence of sixth order moments. For ease of exposition, we drop o throghot or calclations. To emphasize this isse, we se symbol in or calclations to indicate that the o term is ignored. B. Independent Coordinates In or calclations, we assme that the coordinates of x, y are independent and that their central moments are σ, µ 3, µ 4. In other words, we se U I in Assmption to derie expressions in this 5
Appendix. Howeer, this is only for ease of exposition and all or proofs hold een when U I. This can be seen from the following argment. x y Us µ P Ut µ Q U s t U µ P µ Q s U µ P t U µ Q s t. where s s U µ P and t t U µ Q. Since U T µ P and U T µ Q are jst rotated mean ectors, the coordinates of s and t are independent since the coordinates of s, t are independent in assmption A and s, t still hae the same central moments as s, t. Using the aboe relation, we can rewrite or calclations inoling e x y / in terms of e s t /. Note that the difference between the means of the distribtions on x, y is µ P µ Q and the that the difference between the means of the distribtions on s, t is also U µ P U µ Q µ P µ Q since U is orthogonal. So all the problem parameters remain the same, except we shift from non-independent coordinates for x, y to independent coordinates for s, t. C Proof of Lemma First note that we can rewrite the poplation MMD as MMD E x,x P [kx, x ] E y,y Q[ky, y ] E x P,y Q [kx, y] e ppdd e qqdd e pqdd We calclate each of these integrals in the following manner. Since the coordinates of the P and Q are independent, we hae e ppdd e p i p i dd i ] [ p i p i dd i σ The last two steps follow from the fact that the coordinates are independent and definition of the second moments of the distribtions p i and q i see Section F. of the Appendix. Similarly the corresponding term for distribtion Q is d e qqdd d σ 6
For the final term, we hae e pqdd i i i i ] [ p i q i d d µ P i µ P i µ P i µ P i σ µ Qi µ Qi µ P i σ σ δ i q i d p i q i dd The second step follows from since integral. The third step follows from independence of the coordinates. The forth step follows from taylor expansion. The final few steps follow from the definition of second moment of the distribtions see Section F. of the Appendix. Combining the aboe terms, we hae MMD i i δ σ i σ D Proof of Lemma i The ariance for the linear time MMD is gien by σ i σ i σ σ δ i σ i ar z,z, hz, z E z,z [h z, z ] E z,z hz, z σ i where hz, z kx, x ky, y kx, y kx, y where x, x P and y, y Q and E z,z [hz, z ] MMD. Hence the second term is jst E z,z hz, z MMD. Let s concentrate on the first term: E z,z [h z, z ] E x,x P k x, x E y,y Qk y, y E x P,y Q k x, y Hence, there are fie kinds of terms to calclate E x,x P,y,y Qkx, x ky, y E x,x P,y,y Qkx, y kx, y 4E x,x P,y Qkx, x kx, y 4E x P,y,y Qkx, yky, y. E x,x P k x, x from which E y,y Qk y, y can follow. E x P,y Q k x, y 3. E x,x P,y,y Qkx, x ky, y 4. E x,x P,y,y Qkx, y kx, y 5. E x,x P,y Qkx, x kx, y from which E x P,y,y Qkx, yky, y can follow Let s calclate these fie terms in order. δ i 7
D. Term : E x,x P k x, x i i x,x P e x x x,x e pxpx dxdx x i x i p i x i p i x idx i dx i 4σ 4µ 4 σ4 4dσ 4dµ 4 dσ4 8dd σ4 The third step follows from or calclations in Section F. of the Appendix. Note that the extra terms arise from considering all cross terms with denominator. D. Term : E x P,y Q k x, y i i x P 4dσ y Q e e 4σ x y x i yi pxqy dxdy p i x i q i y idx i dy i δ i 4µ 4 4σ δi σ4 δ 4dµ 4 4σ δ dσ4 δ4 i δ 4 4 8dd σ4 8d σ δ The third step follows from or calclations in Section F. of the Appendix. D.3 Term 3: E x,x P,y,y Qkx, x ky, y i i i x,x P y,y Q e x i xi e x x e y y pxpx qyqy dxdx dydy σ µ 4 3σ4 4σ µ 4 0σ4 p i x i p i x idx i dx i e σ i y i yi µ 4 3σ4 q i y i q i y idy i dy i δ 4 δ 4 4 4dσ dµ 4 0dσ4 8dd σ4 The third step follows from or calclations in Section F. of the Appendix. 8
D.4 Term 4: E x,x P,y,y Qkx, ykx, y i i x,x P y,y Q σ 4σ 4dσ e x y e x y pxpx qyqy dxdx dydy δ i µ 4 6σ δi δ i µ 4 6σ δi 3σ4 δ4 i 0σ4 δ4 i δ dµ 4 6σ δ 0dσ4 δ 4 4 8dd σ4 The third step follows from or calclations in Section F. of the Appendix. D.5 Term 5: E x,x P,y Qkx, x kx, y i i x,x P 4dσ y Q e 4σ e x x e x y x i xi pxpx qydxdx dy e x i y i p i x i p i x iqy i δ i 3µ 4 8σ δi 9σ4 δ 3dµ 4 8σ δ 9dσ4 δ4 µ 3δ i µ 3 i δ i δ 4 4 δ 4 δ 4 4 8d σ δ 8dd σ4 4d σ δ The second step follows from or calclations in Section F. of the Appendix. Combining the all the terms aboe, we get the following bond on the ariance. δ 4 δ 4 4 9
D.6 The bond on E z,z [h z, z ] E x,x P k x, x E y,y Qk y, y E x P,y Q k x, y E x,x P,y,y Qkx, x ky, y E x,x P,y,y Qkx, y kx, y 4E x,x P,y Qkx, x kx, y 4E x P,y,y Qkx, yky, y 4dσ 4dµ 4 dσ4 8dd σ 4 4dσ 4dµ 4 dσ4 8dd σ 4 4dσ δ 4dµ 4 4σ δ dσ4 8dd σ 4 8d σ δ δ 4 4dσ dµ 4 0dσ4 8dd σ 4 4dµ δ dµ 4 6σ δ 0dσ4 8dd σ 4 δ 4 8d σ δ 4 4dσ δ 3dµ 4 8σ δ 9dσ4 dµ 3 i δ i 8dd σ 4 4d σ δ δ 4 4 4dσ δ 3dµ 4 8σ δ 9dσ4 dµ 3 i δ i 8dd σ 4 4d σ δ δ 4 4 δ 4 6dσ4 6σ δ Finally, sing the bond deried aboe on E z,z [h z, z ], the bond on ariance is ar z,z, hz, z E z,z [h z, z ] E z,z hz, z 6dσ4 6σ δ. E Proof of Lemma 3 E. Upper bond on τ 4 We derie the pper bond on τ 4 in this section. An pper bond on E z,z [hz, z E z,z [hz, z ] 4 ] can be obtain in the following manner. First note that E z,z [hz, z E z,z [hz, z ] 4 ] E[h 4 z, z ] 3MMD 4 4E[h 3 z, z ]MMD 6E[h z, z ]MMD 6κ 4 48 δ 8 δ γ 8 64κ 3 96 δ 8 γ 8 384dσ4 δ 4 γ 8 384σ δ 6 γ 8 where κ 4 E[h 4 z, z ] and κ 3 E[h 3 z, z ]. 0
Calclations for κ 4 We now calclate an pper bond to E z,z [h 4 z, z ] in the following manner. With slight abse of notation, we se x i to denote the i th coordinate of x. We first note that E z,z [h 4 z, z ] E z,z [kx, x ky, y kx, y kx, y] 4 E z,z [ x x y y x y x y [ x x y y x y x ] 4 y 6E z,z 6E z,z 6 γ 8 E z,z 6 γ 8 [ d j x j y j x j y j k k d 4 k k d 4 i d ] 4 4 x i y i ki x i y k k i ki d i d 4 E z [x i y i ki ] k k d ] 4 The aboe smmation splits into fie different sms, based on the different ways to write k k d 4 - we derie these terms sing the calclations in Section F. and Section F., as well as some terms from the Variance calclations in Section D, and explain in brackets which way to sm the k i s to 4 was sed. κ γ 8 [µ 4 σ δi 6σ 4 δi 4 ] sing 4,0,0... i 4 γ 8 δi 3 6σ δ i δj sing 3,,0,0... 3 γ 8 4σ 4 δi 4 4σ δi 4σ 4 δj 4 4σ δj sing,,0,0... 6 γ 8 γ 8 4σ 4 δi 4 4σ δi δj δk sing,,,0,0... k k l δ i δ j δ kδ l sing,,,,0,0... Expanding the each of the aboe terms frther, we get
Term : Term : Term 3: Term 4: Term 5: [ γ 8 4dµ 4 44 σ 4 δi 4 36σ 8 d δi 8 i i 48µ 4 σ δ 4dµ 4 σ 4 4µ 4 δi 4 44σ 6 δ 4σ ] δi 6 i i [ 4 γ 8 δi 6 δj 36σ 4 δi δj σ ] δi 4 δj [ 3 γ 8 8dd σ 8 δi 4 δj 4 8σ 4 d δi 4 3σ 6 δ d 8σ δi 4 δj 6σ ] 4 δi δj i [ 6 γ 8 4σ 4 d δi δj δi 4 δj δk 4σ ] δi δj δk k k [ ] γ 8 δi δj δkδ l k l Calclations for κ 3 Similar to the mltinomial expansion for κ 4, we hae κ 3 γ 6 δi 6 36σ 4 δi σ δi 4 sing 3,0,0,0... i 3 γ 6 4σ 4 δi 4 4σ δi δj sing,,0,0... γ 6 k Using the aboe expansion, we get δ i δ j δ k sing,,,0,0... κ 3 δ γ 8 δi 6 δj 36σ 4 δi δj σ δi 4 δj γ 8 δi 8 36σ 4 δi 4 σ δi 6 3 γ 8 i 4σ 4 δj δk δi 4 δj δk 4σ δi δj δk k 3 γ 8 4σ 4 δi δj δi 6 δj 4σ δi 4 δj 3 γ 8 4σ 4 δj 4 δi 4 δj 4 4σ δi δj 4 γ 8 k l δ i δ j δ kδ l 3 γ 8 k δ 4 i δ j δ k
Also note the following expansions of δ 8 and δ 6. δ 8 γ 8 δ 6 γ 6 i i δi 8 4 δi 6 δj 3 δi 4 δj 4 6 δi 6 3 δi 4 δj δi δj δk k k δ 4 i δ j δ k k l δ i δ j δ kδ l Ptting all terms together Using the aboe calclations for κ 3 E z,z [hz, z ] 4 ]. and κ 4, we obtain the following bond on E z,z [hz, z E z,z [hz, z E z,z [hz, z ] 4 ] E[h 4 z, z ] 3MMD 4 4E[h 3 z, z ]MMD 6E[h z, z ]MMD 6κ 4 48 δ 8 δ γ 8 64κ 3 96 δ 8 γ 8 384dσ4 δ 4 γ 8 384σ δ 6 γ 8 6 4dµ 4 36σ 8 d 4dµ 4 σ 4 4dd σ 8 96dσ 6 48σ 6 48µ 4 σ i 3σ 4 4µ 4 i δ 4 i 44σ 4 δ i δ j 64 4σ 4 i δ 4 i 4σ 4 δ i δi δj γ 8 64dµ 4 576σ 8 d 384dµ 4 σ 4 384dd σ 8 536dσ 6 768σ 6 768µ 4 σ i δ i 576σ 4 64µ 4 i δ 4 i 768σ 4 δ i δ j 3 ov where we sbstitted κ 4, κ 3 in the third eqationand the δ 6 and δ 8 terms perfectly cancel ot. F Helpfl Calclations for Lemma,, 3, 4 F. Doble Integrals e fgdd [ σ ] 4 fgdd δ µ 4 6σ δ 3σ4 δ4 becase γ fgdd µf µ f fgdd σ µ f gd σ δ 3
and 4 γ fgdd 4 µ f µ f 4 fgdd µ4 µ 4 µf 4 µ f 4 4 µ f 3 µ f 4 µ f µ f 3 µ f 4 4µ 3 µ f 6σ µ f gd [ µ4 4µ 3δ 6σ δ ] [ ] [ δ4 4µ3 δ 6σ σ δ ] µ 4 σ δ 6σ4 δ4 6 µ f µ f Finally, we hae 3 fgdd µ f µ f 3 fgdd µ 3 3σ µ f µ f 3 gd F. Triple Integral y e y y σ σ [ [ e y fggydddy y δ σ ] 4 y [ 3σ δ δ 3 3σ δ δ 3 6σ δ. 4 y4 ] [ µ 4 σ δ 6σ4 δ4 3 ] y4 fggydddy y [ ] µ4 6σ4 [ σ µ f ] [ σ µ g ] gd [ ] δ σ µ 4 σ δ 6σ4 δ4 [ ] µ4 6σ4 3σ4 σ δ µ 4 µ 3δ 4σ δ 3µ 4 8σ δ 9σ4 δ4 µ 3δ ] fggydddy The last eqality is obtained from the following: [ σ µ f ] [ σ µ g ] gd σ 4 σ σ µ g µ f σ 4 µ f µ g gd fgdd 4
G Additional Experiments 3 4 4 6 logmmd 5 6 7 8 9 logγ/d 0.5 logγ/d 0.75 logγ/d Median Heristic 0 3.8 4 4. 4.4 4.6 4.8 5 5. 5.4 logd logvariance 8 0 4 logγ/d 0.5 logγ/d 0.75 logγ/d Median Heristic 6 3.8 4 4. 4.4 4.6 4.8 5 5. 5.4 logd Figre 5: A plot for MMD and Variance of linear statistic, when n 000 for Normal distribtion with identity coariance and Ψ, for bandwidths d 0.5, d, d 0.75. Note that these plots proide empirical erification for Lemma and Lemma. 5