A Kernel Two-Sample Test

Transcription

1 Joural of Machie Learig Research 3 0) Subitted 4/08; Revised /; Published 3/ Arthur Gretto MPI for Itelliget Systes Speastrasse Tübige, Geray A Kerel Two-Saple Test Karste M. Borgwardt Machie Learig ad Coputatioal Biology Research Group Max Plack Istitutes Tübige Speastrasse Tübige, Geray Malte J. Rasch 9 XiJieKouWai St. State Key Laboratory of Cogitive Neurosciece ad Learig, Beijig Noral Uiversity, Beijig, 00875, P.R. Chia Berhard Schölkopf MPI for Itelliget Systes Speastrasse , Tübige, Geray Alexader Sola Yahoo! Research 8 Missio College Blvd Sata Clara, CA 95054, USA ARTHUR.GRETTON@GMAIL.COM KARSTEN.BORGWARDT@TUEBINGEN.MPG.DE MALTE@MAIL.BNU.EDU.CN BERNHARD.SCHOELKOPF@TUEBINGEN.MPG.DE ALEX@SMOLA.ORG Editor: Nicolas Vayatis Abstract We propose a fraework for aalyzig ad coparig distributios, which we use to costruct statistical tests to deterie if two saples are draw fro differet distributios. Our test statistic is the largest differece i expectatios over fuctios i the uit ball of a reproducig kerel Hilbert space RKHS), ad is called the axiu ea discrepacy MMD). We preset two distributiofree tests based o large deviatio bouds for the MMD, ad a third test based o the asyptotic distributio of this statistic. The MMD ca be coputed i quadratic tie, although efficiet liear tie approxiatios are available. Our statistic is a istace of a itegral probability etric, ad various classical etrics o distributios are obtaied whe alterative fuctio classes are used i place of a RKHS. We apply our two-saple tests to a variety of probles, icludig attribute atchig for databases usig the Hugaria arriage ethod, where they perfor strogly. Excellet perforace is also obtaied whe coparig distributios over graphs, for which these are the first such tests.. Also at Gatsby Coputatioal Neurosciece Uit, CSML, 7 Quee Square, Lodo WCN 3AR, UK.. This work was carried out while K.M.B. was with the Ludwig-Maxiilias-Uiversität Müche.. This work was carried out while M.J.R. was with the Graz Uiversity of Techology.. Also at The Australia Natioal Uiversity, Caberra, ACT 000, Australia. c 0 Arthur Gretto, Karste M. Borgwardt, Malte J. Rasch, Berhard Schölkopf ad Alexader Sola.

2 GRETTON, BORGWARDT, RASCH, SCHÖLKOPF AND SMOLA Keywords: kerel ethods, two-saple test, uifor covergece bouds, schea atchig, itegral probability etric, hypothesis testig. Itroductio We address the proble of coparig saples fro two probability distributios, by proposig statistical tests of the ull hypothesis that these distributios are equal agaist the alterative hypothesis that these distributios are differet this is called the two-saple proble). Such tests have applicatio i a variety of areas. I bioiforatics, it is of iterest to copare icroarray data fro idetical tissue types as easured by differet laboratories, to detect whether the data ay be aalysed joitly, or whether differeces i experietal procedure have caused systeatic differeces i the data distributios. Equally of iterest are coparisos betwee icroarray data fro differet tissue types, either to deterie whether two subtypes of cacer ay be treated as statistically idistiguishable fro a diagosis perspective, or to detect differeces i healthy ad cacerous tissue. I database attribute atchig, it is desirable to erge databases cotaiig ultiple fields, where it is ot kow i advace which fields correspod: the fields are atched by axiisig the siilarity i the distributios of their etries. We test whether distributios p ad q are differet o the basis of saples draw fro each of the, by fidig a well behaved e.g., sooth) fuctio which is large o the poits draw fro p, ad sall as egative as possible) o the poits fro q. We use as our test statistic the differece betwee the ea fuctio values o the two saples; whe this is large, the saples are likely fro differet distributios. We call this test statistic the Maxiu Mea Discrepacy MMD). Clearly the quality of the MMD as a statistic depeds o the class F of sooth fuctios that defie it. O oe had, F ust be rich eough so that the populatio MMD vaishes if ad oly if p=q. O the other had, for the test to be cosistet i power,feeds to be restrictive eough for the epirical estiate of the MMD to coverge quickly to its expectatio as the saple size icreases. We will use the uit balls i characteristic reproducig kerel Hilbert spaces Fukuizu et al., 008; Sriperubudur et al., 00b) as our fuctio classes, sice these will be show to satisfy both of the foregoig properties. We also review classical etrics o distributios, aely the Kologorov-Sirov ad Earth-Mover s distaces, which are based o differet fuctio classes; collectively these are kow as itegral probability etrics Müller, 997). O a ore practical ote, the MMD has a reasoable coputatioal cost, whe copared with other two-saple tests: give poits sapled fro p ad fro q, the cost is O+) tie. We also propose a test statistic with a coputatioal cost of O+): the associated test ca achieve a give Type II error at a lower overall coputatioal cost tha the quadratic-cost test, by lookig at a larger volue of data. We defie three oparaetric statistical tests based o the MMD. The first two tests are distributio-free, eaig they ake o assuptios regardig p ad q, albeit at the expese of beig coservative i detectig differeces betwee the distributios. The third test is based o the asyptotic distributio of the MMD, ad is i practice ore sesitive to differeces i distributio at sall saple sizes. The preset work sythesizes ad expads o results of Gretto et al. 007a,b) ad Sola et al. 007), who i tur build o the earlier work of Borgwardt et al. 006). Note that. I particular, ost of the proofs here were ot provided by Gretto et al. 007a), but i a accopayig techical report Gretto et al., 008a), which this docuet replaces. 74

3 A KERNEL TWO-SAMPLE TEST the latter addresses oly the third kid of test, ad that the approach of Gretto et al. 007a,b) is rigorous i its treatet of the asyptotic distributio of the test statistic uder the ull hypothesis. We begi our presetatio i Sectio with a foral defiitio of the MMD. We review the otio of a characteristic RKHS, ad establish that whe F is a uit ball i a characteristic RKHS, the the populatio MMD is zero if ad oly if p = q. We further show that uiversal RKHSs i the sese of Steiwart 00) are characteristic. I Sectio 3, we give a overview of hypothesis testig as it applies to the two-saple proble, ad review alterative test statistics, icludig the L distace betwee kerel desity estiates Aderso et al., 994), which is the prior approach closest to our work. We preset our first two hypothesis tests i Sectio 4, based o two differet bouds o the deviatio betwee the populatio ad epirical MMD. We take a differet approach i Sectio 5, where we use the asyptotic distributio of the epirical MMD estiate as the basis for a third test. Whe large volues of data are available, the cost of coputig the MMD quadratic i the saple size) ay be excessive: we therefore propose i Sectio 6 a odified versio of the MMD statistic that has a liear cost i the uber of saples, ad a associated asyptotic test. I Sectio 7, we provide a overview of ethods related to the MMD i the statistics ad achie learig literature. We also review alterative fuctio classes for which the MMD defies a etric o probability distributios. Fially, i Sectio 8, we deostrate the perforace of MMD-based two-saple tests o probles fro eurosciece, bioiforatics, ad attribute atchig usig the Hugaria arriage ethod. Our approach perfors well o high diesioal data with low saple size; i additio, we are able to successfully distiguish distributios o graph data, for which ours is the first proposed test. A Matlab ipleetatio of the tests is at gretto/d/d.ht.. The Maxiu Mea Discrepacy I this sectio, we preset the axiu ea discrepacy MMD), ad describe coditios uder which it is a etric o the space of probability distributios. The MMD is defied i ters of particular fuctio spaces that witess the differece i distributios: we therefore begi i Sectio. by itroducig the MMD for a arbitrary fuctio space. I Sectio., we copute both the populatio MMD ad two epirical estiates whe the associated fuctio space is a reproducig kerel Hilbert space, ad i Sectio.3 we derive the RKHS fuctio that witesses the MMD for a give pair of distributios.. Defiitio of the Maxiu Mea Discrepacy Our goal is to forulate a statistical test that aswers the followig questio: Proble Let x ad y be rado variables defied o a topological space X, with respective Borel probability easures p ad q. Give observatios X :={x,...,x } ad Y :={y,...,y }, idepedetly ad idetically distributed i.i.d.) fro p ad q, respectively, ca we decide whether p q? Where there is o abiguity, we use the shorthad otatio E x [ fx)] := E x p [ fx)] ad E y [ fy)] := E y q [ fy)] to deote expectatios with respect to p ad q, respectively, where x p idicates x has distributio p. To start with, we wish to deterie a criterio that, i the populatio settig, takes o a uique ad distictive value oly whe p = q. It will be defied based o Lea 9.3. of Dudley 00). 75

4 GRETTON, BORGWARDT, RASCH, SCHÖLKOPF AND SMOLA Lea Let X,d) be a etric space, ad let p,q be two Borel probability easures defied o X. The p = q if ad oly if E x fx)) = E y fy)) for all f CX), where CX) is the space of bouded cotiuous fuctios o X. Although CX) i priciple allows us to idetify p=q uiquely, it is ot practical to work with such a rich fuctio class i the fiite saple settig. We thus defie a ore geeral class of statistic, for as yet uspecified fuctio classes F, to easure the disparity betwee p ad q Fortet ad Mourier, 953; Müller, 997). Defiitio Let F be a class of fuctios f :X R ad let p,q,x,y,x,y be defied as above. We defie the axiu ea discrepacy MMD) as MMD[F, p,q] := supe x [ fx)] E y [ fy)]). ) f F I the statistics literature, this is kow as a itegral probability etric Müller, 997). A biased epirical estiate of the MMD is obtaied by replacig the populatio expectatios with epirical expectatios coputed o the saples X ad Y, MMD b [F,X,Y] := sup f F fx i ) fy i ) ). ) We ust therefore idetify a fuctio class that is rich eough to uiquely idetify whether p=q, yet restrictive eough to provide useful fiite saple estiates the latter property will be established i subsequet sectios).. The MMD i Reproducig Kerel Hilbert Spaces I the preset sectio, we propose as our MMD fuctio classf the uit ball i a reproducig kerel Hilbert space H. We will provide fiite saple estiates of this quatity both biased ad ubiased), ad establish coditios uder which the MMD ca be used to distiguish betwee probability easures. Other possible fuctio classesf are discussed i Sectios 7. ad 7.. We first review soe properties of H Schölkopf ad Sola, 00). Sice H is a RKHS, the operator of evaluatio δ x appig f H to fx) R is cotiuous. Thus, by the Riesz represetatio theore Reed ad Sio, 980, Theore II.4), there is a feature appig φx) fro X to R such that fx)= f,φx) H. This feature appig takes the caoical for φx)=kx, ) Steiwart ad Christa, 008, Lea 4.9), where kx,x ) : X X R is positive defiite, ad the otatio kx, ) idicates the kerel has oe arguet fixed at x, ad the secod free. Note i particular that φx),φy) H = kx,y). We will geerally use the ore cocise otatio φx) for the feature appig, although i soe cases it will be clearer to write kx, ). We ext exted the otio of feature ap to the ebeddig of a probability distributio: we will defie a eleet µ p H such that E x f = f,µ p H for all f H, which we call the ea ebeddig of p. Ebeddigs of probability easures ito reproducig kerel Hilbert spaces are well established i the statistics literature: see Berliet ad Thoas-Aga 004, Chapter 4) for further detail ad refereces. We begi by establishig coditios uder which the ea ebeddig µ p exists Fukuizu et al., 004, p. 93), Sriperubudur et al., 00b, Theore ).. The epirical MMD defied below has a upward bias we will defie a ubiased statistic i the followig sectio. 76

5 A KERNEL TWO-SAMPLE TEST Lea 3 If k, ) is easurable ad E x kx,x)< the µp H. Proof The liear operator T p f := E x f for all f F is bouded uder the assuptio, sice T p f = E x f E x f =E x f,φx) H E x kx,x) f H ). Hece by the Riesz represeter theore, there exists a µ p H such that T p f = f,µ p H. If we set f = φt)=kt, ), we obtai µ p t)= µ p,kt, ) H = E x kt,x): i other words, the ea ebeddig of the distributio p is the expectatio uder p of the caoical feature ap. We ext show that the MMD ay be expressed as the distace i H betwee ea ebeddigs Borgwardt et al., 006). Lea 4 Assue the coditio i Lea 3 for the existece of the ea ebeddigs µ p, µ q is satisfied. The MMD [F, p,q]= µ p µ q H. Proof MMD [F, p,q] = = [ sup f H E x [ fx)] E y [ fy)]) [ sup µp µ q, f H f H = µ p µ q H. ] ] We ow establish a coditio o the RKHS H uder which the ea ebeddig µ p is ijective, which idicates that MMD[F, p,q]=0 is a etric 3 o the Borel probability easures o X. Evidetly, this property will ot hold for allh: for istace, a polyoial RKHS of degree two caot distiguish betwee distributios with the sae ea ad variace, but differet kurtosis Sriperubudur et al., 00b, Exaple 3). The MMD is a etric, however, whe H is a uiversal RKHSs, defied o a copact etric space X. Uiversality requires that k, ) be cotiuous, ad H be dese i CX) with respect to the L or. Steiwart 00) proves that the Gaussia ad Laplace RKHSs are uiversal. Theore 5 Let F be a uit ball i a uiversal RKHS H, defied o the copact etric space X, with associated cotiuous kerel k, ). The MMD[F, p, q] = 0 if ad oly if p = q. Proof The proof follows Cortes et al. 008, Suppleetary Appedix), whose approach is clearer tha the origial proof of Gretto et al. 008a, p. 4). 4 First, it is clear that p = q iplies 3. Accordig to Dudley 00, p. 6) a etric dx, y) satisfies the followig four properties: syetry, triagle iequality, dx, x) = 0, ad dx, y) = 0 = x = y. A pseudo-etric oly satisfies the first three properties. 4. Note that the proof of Cortes et al. 008) requires a applicatio the of doiated covergece theore, rather tha usig the Riesz represetatio theore to show the existece of the ea ebeddigs µ p ad µ q as we did i Lea 3. 77

6 GRETTON, BORGWARDT, RASCH, SCHÖLKOPF AND SMOLA MMD{F, p,q} is zero. We ow prove the coverse. By the uiversality of H, for ay give ε>0 ad f CX) there exists a g H such that We ext ake the expasio f g ε. E x fx) E y fy)) E x fx) E x gx) + E x gx) E y gy) + E y gy) E y fy). The first ad third ters satisfy Next, write E x fx) E x gx) E x fx) gx) ε. E x gx) E y gy)= g,µ p µ q H = 0, sice MMD{F, p,q}=0 iplies µ p = µ q. Hece E x fx) E y fy)) ε for all f CX) ad ε>0, which iplies p=q by Lea. While our result establishes the appig µ p is ijective for uiversal kerels o copact doais, this result ca also be show i ore geeral cases. Fukuizu et al. 008) itroduces the otio of characteristic kerels, these beig kerels for which the ea ap is ijective. Fukuizu et al. establish that Gaussia ad Laplace kerels are characteristic o R d, ad thus that the associated MMD is a etric o distributios for this doai. Sriperubudur et al. 008, 00b) ad Sriperubudur et al. 0a) further explore the properties of characteristic kerels, providig a siple coditio to deterie whether traslatio ivariat kerels are characteristic, ad ivestigatig the relatio betwee uiversal ad characteristic kerels o o-copact doais. Give we are i a RKHS, we ay easily obtai of the squared MMD, µ p µ q, i ters of H kerel fuctios, ad a correspodig ubiased fiite saple estiate. Lea 6 Give x ad x idepedet rado variables with distributio p, ad y ad y idepedet rado variables with distributio q, the squared populatio MMD is MMD [F, p,q]=e x,x [ kx,x ) ] E x,y [kx,y)]+e y,y [ ky,y ) ], where x is a idepedet copy of x with the sae distributio, ad y is a idepedet copy of y. A ubiased epirical estiate is a su of two U-statistics ad a saple average, MMD u[f,x,y] = ) j= j i kx i,x j )+ ) j i ky i,y j ) kx i,y j ). 3) Whe =, a slightly sipler epirical estiate ay be used. Let Z := z,...,z ) be i.i.d. rado variables, where z :=x,y) p q i.e., x ad y are idepedet). A ubiased estiate of MMD is MMD u[f,x,y]= ) ) hz i,z j ), 4) 78 i j

7 A KERNEL TWO-SAMPLE TEST which is a oe-saple U-statistic with hz i,z j ) := kx i,x j )+ky i,y j ) kx i,y j ) kx j,y i ). Proof Startig fro the expressio for MMD [F, p,q] i Lea 4, MMD [F, p,q] = µ p µ q H = µ p,µ p H + µ q,µ q H µ p,µ q H = E x,x φx),φx ) H + E y,y φy),φy ) H E x,y φx),φy) H, The proof is copleted by applyig φx),φx ) H = kx,x ); the epirical estiates follow straightforwardly, by replacig the populatio expectatios with their correspodig U-statistics ad saple averages. This statistic is ubiased followig Serflig 980, Chapter 5). Note that MMD u ay be egative, sice it is a ubiased estiator of MMD[F, p,q]). The oly ters issig to esure oegativity, however, are hz i,z i ), which were reoved to reove spurious correlatios betwee observatios. Cosequetly we have the boud MMD u+ ) kx i,x i )+ky i,y i ) kx i,y i ) 0. Moreover, while the epirical statistic for = is a ubiased estiate of MMD, it does ot have iiu variace, sice we igore the cross-ters kx i,y i ), of which there are O). Fro 3), however, we see the iiu variace estiate is alost idetical Serflig, 980, Sectio 5..4). The biased statistic i ) ay also be easily coputed followig the above reasoig. Substitutig the epirical estiates µ X := φx i) ad µ Y := φy i) of the feature space eas based o respective saples X ad Y, we obtai MMD b [F,X,Y]= [ kx i,x j ) i, j=, i, j= kx i,y j )+ ky i,y j )]. 5) i, j= Note that the U-statistics of 3) have bee replaced by V-statistics. Ituitively we expect the epirical test statistic MMD[F,X,Y], whether biased or ubiased, to be sall if p=q, ad large if the distributios are far apart. It costs O+) ) tie to copute both statistics..3 Witess Fuctio of the MMD for RKHSs We defie the witess fuctio f to be the RKHS fuctio attaiig the supreu i ), ad its epirical estiate ˆf to be the fuctio attaiig the supreu i ). Fro the reasoig i Lea 4, it is clear that f t) φt),µ p µ q H = E x[kx,t)] E y [ky,t)], ˆf t) φt),µ X µ Y H = kx i,t) ky i,t). where we have defied µ X = φx i), ad µ Y by aalogy. The result follows sice the uit vector v axiizig v,x H i a Hilbert space is v=x/ x H. We illustrate the behavior of MMD i Figure usig a oe-diesioal exaple. The data X ad Y were geerated fro distributios p ad q with equal eas ad variaces, with p Gaussia 79

8 GRETTON, BORGWARDT, RASCH, SCHÖLKOPF AND SMOLA Prob. desities ad ˆf t) ˆf p Gauss) q Laplace) t Figure : Illustratio of the fuctio axiizig the ea discrepacy i the case where a Gaussia is beig copared with a Laplace distributio. Both distributios have zero ea ad uit variace. The fuctio ˆf that witesses the MMD has bee scaled for plottig purposes, ad was coputed epirically o the basis of 0 4 saples, usig a Gaussia kerel with σ=0.5. ad q Laplacia. We chose F to be the uit ball i a Gaussia RKHS. The epirical estiate ˆf of the fuctio f that witesses the MMD i other words, the fuctio axiizig the ea discrepacy i ) is sooth, egative where the Laplace desity exceeds the Gaussia desity at the ceter ad tails), ad positive where the Gaussia desity is larger. The agitude of ˆf is a direct reflectio of the aout by which oe desity exceeds the other, isofar as the soothess costrait perits it. 3. Backgroud Material We ow preset three backgroud results. First, we itroduce the teriology used i statistical hypothesis testig. Secod, we deostrate via a exaple that eve for tests which have asyptotically o error, we caot guaratee perforace at ay fixed saple size without akig assuptios about the distributios. Third, we review soe alterative statistics used i coparig distributios, ad the associated two-saple tests see also Sectio 7 for a overview of additioal itegral probability etrics). 3. Statistical Hypothesis Testig Havig described a etric o probability distributios the MMD) based o distaces betwee their Hilbert space ebeddigs, ad epirical estiates biased ad ubiased) of this etric, we address the proble of deteriig whether the epirical MMD shows a statistically sigificat differece betwee distributios. To this ed, we briefly describe the fraework of statistical hypothesis testig as it applies i the preset cotext, followig Casella ad Berger 00, Chapter 8). Give i.i.d. 730

9 A KERNEL TWO-SAMPLE TEST saples X p of size ad Y q of size, the statistical test,tx,y) : X X {0,} is used to distiguish betwee the ull hypothesis H 0 : p=q ad the alterative hypothesis H A : p q. This is achieved by coparig the test statistic 5 MMD[F,X,Y] with a particular threshold: if the threshold is exceeded, the the test rejects the ull hypothesis bearig i id that a zero populatio MMD idicates p=q). The acceptace regio of the test is thus defied as the set of real ubers below the threshold. Sice the test is based o fiite saples, it is possible that a icorrect aswer will be retured. A Type I error is ade whe p = q is rejected based o the observed saples, despite the ull hypothesis havig geerated the data. Coversely, a Type II error occurs whe p = q is accepted despite the uderlyig distributios beig differet. The level α of a test is a upper boud o the probability of a Type I error: this is a desig paraeter of the test which ust be set i advace, ad is used to deterie the threshold to which we copare the test statistic fidig the test threshold for a give α is the topic of Sectios 4 ad 5). The power of a test agaist a particular eber of the alterative class H A i.e., a specific p,q) such that p q) is the probability of wrogly acceptig p=q i this istace. A cosistet test achieves a level α, ad a Type II error of zero, i the large saple liit. We will see that the tests proposed i this paper are cosistet. 3. A Negative Result Eve if a test is cosistet, it is ot possible to distiguish distributios with high probability at a give, fixed saple size i.e., to provide guaratees o the Type II error), without prior assuptios as to the ature of the differece betwee p ad q. This is true regardless of the two-saple test used. There are several ways to illustrate this, which each give isight ito the kids of differeces that ight be udetectable for a give uber of saples. The followig exaple 6 is oe such illustratio. Exaple Assue we have a distributio p fro which we have draw i.i.d. observatios. We costruct a distributio q by drawig i.i.d. observatios fro p, ad defiig a discrete distributio over these istaces with probability each. It is easy to check that if we ow draw observatios fro q, there is at least a )! > e > 0.63 probability that we thereby obtai a saple fro p. Hece o test will be able to distiguish saples fro p ad q i this case. We could ake the probability of detectio arbitrarily sall by icreasig the size of the saple fro which we costruct q. 3.3 Previous Work We ext give a brief overview of soe earlier approaches to the two saple proble for ultivariate data. Sice our later experietal copariso is with respect to certai of these ethods, we give abbreviated algorith aes i italics where appropriate: these should be used as a key to the tables i Sectio This ay be biased or ubiased. 6. This is a variatio of a costructio for idepedece tests, which was suggested i a private couicatio by Joh Lagford. 73

10 GRETTON, BORGWARDT, RASCH, SCHÖLKOPF AND SMOLA 3.3. L DISTANCE BETWEEN PARZEN WINDOW ESTIMATES The prior work closest to the curret approach is the Parze widow-based statistic of Aderso et al. 994). We begi with a short overview of the Parze widow estiate ad its properties Silvera, 986), before proceedig to a copariso with the RKHS approach. We assue a distributio p o R d, which has a associated desity fuctio f p. The Parze widow estiate of this desity fro a i.i.d. saple X of size is ˆf p x)= We ay rescale κ accordig to κx i x), where κ satisfies ) κ x h d h X κx)dx= ad κx) 0. for a badwidth paraeter h. To siplify the discussio, we use a sigle badwidth h + for both ˆf p ad ˆf q. Assuig / is bouded away fro zero ad ifiity, cosistecy of the Parze widow estiates for f p ad f q requires li, hd + = 0 ad li, +)hd + =. 6) We ow show the L distace betwee Parze widows desity estiates is a special case of the biased MMD i Equatio 5). Deote by D r p,q) := f p f q r the L r distace betwee the desities f p ad f q correspodig to the distributios p ad q, respectively. For r= the distace D r p,q) is kow as the Lévy distace Feller, 97), ad for r = we ecouter a distace easure derived fro the Reyi etropy Gokcay ad Pricipe, 00). Assue that ˆf p ad ˆf q are give as kerel desity estiates with kerel κx x ), that is, ˆf p x)= κx i x) ad ˆf q y) is defied by aalogy. I this case [ D ˆf p, ˆf q ) = κx i z) κy i z)] dz = kx i x j )+ i, j= ky i y j ) i, j=, i, j= kx i y j ), where kx y)= κx z)κy z)dz. By its defiitio kx y) is a RKHS kerel, as it is a ier product betwee κx z) ad κy z) o the doaix. We ow describe the asyptotic perforace of a two-saple test usig the statistic D ˆf p, ˆf q ). We cosider the power of the test uder local departures fro the ull hypothesis. Aderso et al. 994) defie these to take the for f q = f p + δg, 7) where δ R, ad g is a fixed, bouded, itegrable fuctio chose to esure that f q is a valid desity for sufficietly sall δ. Aderso et al. cosider two cases: the kerel badwidth covergig to zero with icreasig saple size, esurig cosistecy of the Parze widow estiates of f p ad f q ; ad the case of a fixed badwidth. I the forer case, the iiu distace with which the test ca discriiate f p fro f q is 7 δ=+) / h d/ +. I the latter case, this iiu distace is δ = +) /, uder the assuptio that the Fourier trasfor of the kerel κ does ot vaish ), 7. Forally, defie s α as a threshold for the statistic D ˆf p, ˆf q chose to esure the test has level α, ad let δ = +) / h d/ + c for soe fixed c 0. Whe, such that / is bouded away fro 0 ad, ad 73

11 A KERNEL TWO-SAMPLE TEST o a iterval Aderso et al., 994, Sectio.4), which iplies the kerel k is characteristic Sriperubudur et al., 00b). The power of the L test agaist local alteratives is greater whe the kerel is held fixed, sice for ay rate of decrease of h + with icreasig saple size, δ will decrease ore slowly tha for a fixed kerel. A RKHS-based approach geeralizes the L statistic i a uber of iportat respects. First, we ay eploy a uch larger class of characteristic kerels that caot be writte as ier products betwee Parze widows: several exaples are give by Steiwart 00, Sectio 3) ad Micchelli et al. 006, Sectio 3) these kerels are uiversal, hece characteristic). We ay further geeralize to kerels o structured objects such as strigs ad graphs Schölkopf et al., 004), as doe i our experiets Sectio 8). Secod, eve whe the kerel ay be writte as a ier product of Parze widows or d, the D statistic with fixed badwidth o loger coverges to a L distace betwee probability desity fuctios, hece it is ore atural to defie the statistic as a itegral probability etric for a particular RKHS, as i Defiitio. Ideed, i our experiets, we obtai good perforace i experietal settigs where the diesioality greatly exceeds the saple size, ad desity estiates would perfor very poorly 8 for istace the Gaussia toy exaple i Figure 5B, for which perforace actually iproves whe the diesioality icreases; ad the icroarray data sets i Table ). This suggests it is ot ecessary to solve the ore difficult proble of desity estiatio i high diesios to do two-saple testig. Fially, the kerel approach leads us to establish cosistecy agaist a larger class of local alteratives to the ull hypothesis tha that cosidered by Aderso et al. I Theore 3, we prove cosistecy agaist a class of alteratives ecoded i ters of the ea ebeddigs of p ad q, which applies to ay doai o which RKHS kerels ay be defied, ad ot oly desities or d. This ore geeral approach also has iterestig cosequeces for distributios or d : for istace, a local departure fro H 0 occurs whe p ad q differ at icreasig frequecies i their respective characteristic fuctios. This class of local alteratives caot be expressed i the for δg for fixed g, as i 7). We discuss this issue further i Sectio MMD FOR MULTINOMIALS Assue a fiite doai X := {,...,d}, ad defie the rado variables x ad y o X such that p i := Px=i) ad q j := Py= j). We ebed x ito a RKHSHvia the feature appig φx) := e x, where e s is the uit vector i R d takig value i diesio s, ad zero i the reaiig etries. The kerel is the usual ier product o R d. I this case, MMD [F, p,q]= p q R d = d p i q i ). 8) Harchaoui et al. 008, Sectio, log versio) ote that this L statistic ay ot be the best choice for fiite doais, citig a result of Leha ad Roao 005, Theore 4.3.) that Pearso s assuig coditios 6), the liit πc) := li Pr ) ) H A D ˆf p, ˆf q > sα +) is well-defied, ad satisfies α<πc)< for 0< c <, ad πc) as c. 8. The L error of a kerel desity estiate coverges as O 4/4+d) ) whe the optial badwidth is used Wassera, 006, Sectio 6.5). 733

12 GRETTON, BORGWARDT, RASCH, SCHÖLKOPF AND SMOLA Chi-squared statistic is optial for the proble of goodess of fit testig for ultioials. 9 It would be of iterest to establish whether a aalogous result holds for two-saple testig i a wider class of RKHS feature spaces FURTHER MULTIVARIATE TWO-SAMPLE TESTS Biau ad Gyorfi 005) Biau) use as their test statistic the L distace betwee discretized estiates of the probabilities, where the partitioig is refied as the saple size icreases. This space partitioig approach becoes difficult or ipossible for high diesioal probles, sice there are too few poits per bi. For this reaso, we use this test oly for low-diesioal probles i our experiets. A geeralisatio of the Wald-Wolfowitz rus test to the ultivariate doai was proposed ad aalysed by Frieda ad Rafsky 979) ad Heze ad Perose 999) FR Wolf), ad ivolves coutig the uber of edges i the iiu spaig tree over the aggregated data that coect poits i X to poits i Y. The resultig test relies o the asyptotic orality of the test statistic, ad is ot distributio-free uder the ull hypothesis for fiite saples the test threshold depeds o p, as with our asyptotic test i Sectio 5; by cotrast, our tests i Sectio 4 are distributiofree). The coputatioal cost of this ethod usig Kruskal s algorith is O+) log+)), although ore oder ethods iprove o the log + ) ter: see Chazelle 000) for details. Frieda ad Rafsky 979) clai that calculatig the atrix of distaces, which costs O+) ), doiates their coputig tie; we retur to this poit i our experiets Sectio 8). Two possible geeralisatios of the Kologorov-Sirov test to the ultivariate case were studied by Bickel 969) ad Frieda ad Rafsky 979). The approach of Frieda ad Rafsky FR Sirov) i this case agai requires a iial spaig tree, ad has a siilar cost to their ultivariate rus test. A ore recet ultivariate test was itroduced by Rosebau 005). This etails coputig the iiu distace o-bipartite atchig over the aggregate data, ad usig the uber of pairs cotaiig a saple fro both X ad Y as a test statistic. The resultig statistic is distributio-free uder the ull hypothesis at fiite saple sizes, i which respect it is superior to the Frieda- Rafsky test; o the other had, it costs O+) 3 ) to copute. Aother distributio-free test Hall) was proposed by Hall ad Tajvidi 00): for each poit fro p, it requires coputig the closest poits i the aggregated data, ad coutig how ay of these are fro q the procedure is repeated for each poit fro q with respect to poits fro p). As we shall see i our experietal coparisos, the test statistic is costly to copute; Hall ad Tajvidi cosider oly tes of poits i their experiets. 4. Tests Based o Uifor Covergece Bouds I this sectio, we itroduce two tests for the two-saple proble that have exact perforace guaratees at fiite saple sizes, based o uifor covergece bouds. The first, i Sectio 4., uses the McDiarid 989) boud o the biased MMD statistic, ad the secod, i Sectio 4., uses a Hoeffdig 963) boud for the ubiased statistic. 9. A goodess of fit test deteries whether a saple fro p is draw fro a kow target ultioial q. Pearso s Chi-squared statistic weights each ter i the su 8) by its correspodig q i. 734

13 A KERNEL TWO-SAMPLE TEST 4. Boud o the Biased Statistic ad Test We establish two properties of the MMD, fro which we derive a hypothesis test. First, we show that regardless of whether or ot p=q, the epirical MMD coverges i probability at rate O+ ) ) to its populatio value. This shows the cosistecy of statistical tests based o the MMD. Secod, we give probabilistic bouds for large deviatios of the epirical MMD i the case p=q. These bouds lead directly to a threshold for our first hypothesis test. We begi by establishig the covergece of MMD b [F,X,Y] to MMD[F, p,q]. The followig theore is proved i A.. Theore 7 Let p, q, X,Y be defied as i Proble, ad assue 0 kx, y) K. The ) } Pr X,Y { MMD b [F,X,Y] MMD[F, p,q] > K/) +K/) + ε where Pr X,Y deotes the probability over the -saple X ad -saple Y. exp ε K+) Our ext goal is to refie this result i a way that allows us to defie a test threshold uder the ull hypothesis p = q. Uder this circustace, the costats i the expoet are slightly iproved. The followig theore is proved i Appedix A.3. Theore 8 Uder the coditios of Theore 7 where additioally p=q ad =, MMD b [F,X,Y] E x,x [kx,x) kx,x )] + ε K/) / + ε, } {{ } } {{ } B F,p) B F,p) both with probability at least exp ε 4K ). I this theore, we illustrate two possible bouds B F, p) ad B F, p) o the bias i the epirical estiate 5). The first iequality is iterestig iasuch as it provides a lik betwee the bias boud B F, p) ad kerel size for istace, if we were to use a Gaussia kerel with large σ, the kx,x) ad kx,x ) would likely be close, ad the bias sall). I the cotext of testig, however, we would eed to provide a additioal boud to show covergece of a epirical estiate of B F, p) to its populatio equivalet. Thus, i the followig test for p=q based o Theore 8, we use B F, p) to boud the bias. 0 Corollary 9 A hypothesis test of level α for the ull hypothesis p=q, that is, for MMD[F, p,q]=0, has the acceptace regio MMD b [F,X,Y]< K/ + ) logα. We ephasize that this test is distributio-free: the test threshold does ot deped o the particular distributio that geerated the saple. Theore 7 guaratees the cosistecy of the test agaist fixed alteratives, ad that the Type II error probability decreases to zero at rate O /), assuig =. To put this covergece rate i perspective, cosider a test of whether two oral distributios have equal eas, give they have ukow but equal variace Casella ad Berger, 00, Exercise 8.4). I this case, the test statistic has a Studet-t distributio with + degrees of freedo, ad its Type II error probability coverges at the sae rate as our test. It is worth otig that bouds ay be obtaied for the deviatio betwee populatio ea ebeddigs µ p ad the epirical ebeddigs µ X i a copletely aalogous fashio. The proof 0. Note that we use a tighter bias boud tha Gretto et al. 007a). ), 735

14 GRETTON, BORGWARDT, RASCH, SCHÖLKOPF AND SMOLA requires syetrizatio by eas of a ghost saple, that is, a secod set of observatios draw fro the sae distributio. While ot the focus of the preset paper, such bouds ca be used to perfor iferece based o oet atchig Altu ad Sola, 006; Dudík ad Schapire, 006; Dudík et al., 004). 4. Boud o the Ubiased Statistic ad Test The previous bouds are of iterest sice the proof strategy ca be used for geeral fuctio classes with well behaved Radeacher averages see Sriperubudur et al., 00a). WheF is the uit ball i a RKHS, however, we ay very easily defie a test via a covergece boud o the ubiased statistic MMD u i Lea 4. We base our test o the followig theore, which is a straightforward applicatio of the large deviatio boud o U-statistics of Hoeffdig 963, p. 5). Theore 0 Assue 0 kx i,x j ) K, fro which it follows K hz i,z j ) K. The Pr X,Y { MMD u F,X,Y) MMD F, p,q)>t } exp t ) 8K where := / the sae boud applies for deviatios of t ad below). A cosistet statistical test for p=q usig MMD u is the obtaied. Corollary A hypothesis test of level α for the ull hypothesis p=q has the acceptace regio MMD u <4K/ ) logα ). This test is distributio-free. We ow copare the thresholds of the above test with that i Corollary 9. We ote first that the threshold for the biased statistic applies to a estiate of MMD, whereas that for the ubiased statistic is for a estiate of MMD. Squarig the forer threshold to ake the two quatities coparable, the squared threshold i Corollary 9 decreases as, whereas the threshold i Corollary decreases as /. Thus for sufficietly large, the McDiarid-based threshold will be lower ad the associated test statistic is i ay case biased upwards), ad its Type II error will be better for a give Type I boud. This is cofired i our Sectio 8 experiets. Note, however, that the rate of covergece of the squared, biased MMD estiate to its populatio value reais at / bearig i id we take the square of a biased estiate, where the bias ter decays as / ). Fially, we ote that the bouds we obtaied i this sectio ad the last are rather coservative for a uber of reasos: first, they do ot take the actual distributios ito accout. I fact, they are fiite saple size, distributio-free bouds that hold eve i the worst case sceario. The bouds could be tighteed usig localizatio, oets of the distributio, etc.: see, for exaple, Bousquet et al. 005) ad de la Peña ad Gié 999). Ay such iproveets could be plugged straight ito Theore 9. Secod, i coputig bouds rather tha tryig to characterize the distributio of MMDF,X,Y) explicitly, we force our test to be coservative by desig. I the followig we ai for a exact characterizatio of the asyptotic distributio of MMDF, X,Y) istead of a boud. While this will ot satisfy the uifor covergece requireets, it leads to superior tests i practice.. I the case of α=0.05, this is. 736

15 A KERNEL TWO-SAMPLE TEST 5. Test Based o the Asyptotic Distributio of the Ubiased Statistic We propose a third test, which is based o the asyptotic distributio of the ubiased estiate of MMD i Lea 6. This test uses the asyptotic distributio of MMD u uder H 0, which follows fro results of Aderso et al. 994, Appedix) ad Serflig 980, Sectio 5.5.): see Appedix B. for the proof. Theore Let kx i,x j ) be the kerel betwee feature space appigs fro which the ea ebeddig of p has bee subtracted, kx i,x j ) := φx i ) µ p,φx j ) µ p H = kx i,x j ) E x kx i,x) E x kx,x j )+E x,x kx,x ), 9) where x is a idepedet copy of x draw fro p. Assue k L X X, p p) i.e., the cetred kerel is square itegrable, which is true for all p whe the kerel is bouded), ad that for t = +, li, /t ρ x ad li, /t ρ y := ρ x ) for fixed 0<ρ x <. The uderh 0, MMD u coverges i distributio accordig to tmmd [ u[f,x,y] D λ l ρx / a l ρy / b l ) ρ x ρ y ) ], 0) l= where a l N0,) ad b l N0,) are ifiite sequeces of idepedet Gaussia rado variables, ad the λ i are eigevalues of X kx,x )ψ i x)d px)=λ i ψ i x ). We illustrate the MMD desity uder both the ull ad alterative hypotheses by approxiatig it epirically for p=q ad p q. Results are plotted i Figure. Our goal is to deterie whether the epirical test statistic MMD u is so large as to be outside the α quatile of the ull distributio i 0), which gives a level α test. Cosistecy of this test agaist local departures fro the ull hypothesis is provided by the followig theore, proved i Appedix B.. Theore 3 Defie ρ x, ρ y, ad t as i Theore, ad write µ q = µ p +g t, where g t H is chose such that µ p +g t reais a valid ea ebeddig, ad g t H is ade to approach zero as t to describe local departures fro the ull hypothesis. The g t H = ct / is the iiu distace betwee µ p ad µ q distiguishable by the test. A exaple of a local departure fro the ull hypothesis is described earlier i the discussio of the L distace betwee Parze widow estiates Sectio 3.3.). The class of local alteratives cosidered i Theore 3 is ore geeral, however: for istace, Sriperubudur et al. 00b, Sectio 4) ad Harchaoui et al. 008, Sectio 5, log versio) give exaples of classes of perturbatios g t with decreasig RKHS or. These perturbatios have the property that p differs fro q at icreasig frequecies, rather tha siply with decreasig aplitude. Oe way to estiate the α quatile of the ull distributio is usig the bootstrap o the aggregated data, followig Arcoes ad Gié 99). Alteratively, we ay approxiate the ull 737

16 GRETTON, BORGWARDT, RASCH, SCHÖLKOPF AND SMOLA 50 Epirical MMD desity uder H0 u 0 Epirical MMD desity uder H u Prob. desity Prob. desity MMD u MMD u Figure : Left: Epirical distributio of the MMD uder H 0, with p ad q both Gaussias with uit stadard deviatio, usig 50 saples fro each. Right: Epirical distributio of the MMD uder H A, with p a Laplace distributio with uit stadard deviatio, ad q a Laplace distributio with stadard deviatio 3, usig 00 saples fro each. I both cases, the histogras were obtaied by coputig 000 idepedet istaces of the MMD. distributio by fittig Pearso curves to its first four oets Johso et al., 994, Sectio 8.8). Takig advatage of the degeeracy of the U-statistic, we obtai for = [MMD ] ) E u = ) E [ z,z h z,z ) ] ad [MMD ] 3 ) E u = 8 ) ) E [ z,z hz,z )E z hz,z )hz,z ) )] + O 4 ) ) see Appedix B.3), where hz,z ) is defied i Lea 6, z=x,y) p q where x ad y are idepedet, ad z,z ] ) [MMD 4 are idepedet copies of z. The fourth oet E u is ot coputed, sice it is both very sall, O 4 ), ad expesive to calculate, O 4 ). Istead, we replace the kurtosis with a lower boud due to Wilkis 944), kurt MMD u) skew MMD u )) +. I Figure 3, we illustrate the Pearso curve fit to the ull distributio: the fit is good i the upper quatiles of the distributio, where the test threshold is coputed. Fially, we ote that two alterative epirical estiates of the ull distributio have ore recetly bee proposed by Gretto et al. 009): a cosistet estiate, based o a epirical coputatio of the eigevalues λ l i 0); ad a alterative Gaa approxiatio to the ull distributio, which has a saller coputatioal cost but is geerally less accurate. Further detail ad experietal coparisos are give by Gretto et al.. The kurtosis is defied i ters of the fourth ad secod oets as kurt MMD u ) E [MMD u ] 4) = [ E [MMD u ] )]

17 A KERNEL TWO-SAMPLE TEST CDF of the MMD ad Pearso fit 0.8 PMMD u < t) Ep. CDF Pearso t Figure 3: Illustratio of the epirical CDF of the MMD ad a Pearso curve fit. Both p ad q were Gaussia with zero ea ad uit variace, ad 50 saples were draw fro each. The epirical CDF was coputed o the basis of 000 radoly geerated MMD values. To esure the quality of fit was deteried oly by the accuracy of the Pearso approxiatio, the oets used for the Pearso curves were also coputed o the basis of these 000 saples. The MMD used a Gaussia kerel with σ= A Liear Tie Statistic ad Test The MMD-based tests are already ore efficiet tha the O log) ad O 3 ) tests described i Sectio assuig = for cociseess). It is still desirable, however, to obtai O) tests which do ot sacrifice too uch statistical power. Moreover, we would like to obtai tests which have O) storage requireets for coputig the test statistic, i order to apply the test to data streas. We ow describe how to achieve this by coputig the test statistic usig a subsaplig of the ters i the su. The epirical estiate i this case is obtaied by drawig pairs fro X ad Y respectively without replaceet. Lea 4 Defie := /, assue =, ad defie hz,z ) as i Lea 6. The estiator MMD l[f,x,y] := hx i,y i ),x i,y i )) ca be coputed i liear tie, ad is a ubiased estiate of MMD [F, p,q]. While it is expected that MMD l has higher variace tha MMD u as we will see explicitly later), it is coputatioally uch ore appealig. I particular, the statistic ca be used i strea coputatios with eed for oly O) eory, whereas MMD u requires O) storage ad O ) tie to copute the kerel h o all iteractig pairs. Sice MMD l is just the average over a set of rado variables, Hoeffdig s boud ad the cetral liit theore readily allow us to provide both uifor covergece ad asyptotic stateets with little effort. The first follows directly fro Hoeffdig 963, Theore ). 739

18 GRETTON, BORGWARDT, RASCH, SCHÖLKOPF AND SMOLA Theore 5 Assue 0 kx i,x j ) K. The Pr X,Y { MMD l F,X,Y) MMD F, p,q)>t } exp t ) 8K where := / the sae boud applies for deviatios of t ad below). Note that the boud of Theore 0 is idetical to that of Theore 5, which shows the forer is rather loose. Next we ivoke the cetral liit theore e.g., Serflig, 980, Sectio.9). Corollary 6 Assue 0 < E h ) <. The MMD l coverges i distributio to a Gaussia accordig to MMD l MMD [F, p,q] ) D ) N 0,σ l, [ where σ l = E z,z h z,z ) [E z,z hz,z )] ], where we use the shorthad E z,z := E z,z p q. The factor of arises sice we are averagig over oly / observatios. It is istructive to copare this asyptotic distributio with that of the quadratic tie statistic MMD u uder H A, whe =. I this case, MMD u coverges i distributio to a Gaussia accordig to MMD u MMD [F, p,q] ) D ) N 0,σ u, where σ [ u= 4 E z Ez hz,z )) ] [E z,z hz,z ))] ) Serflig, 980, Sectio 5.5). Thus for MMD u, the asyptotic variace is up to scalig) the variace of E z [hz,z )], whereas for MMD l it is Var z,z [hz,z )]. We ed by otig aother potetial approach to reducig the cost of coputig a epirical MMD estiate, by usig a low rak approxiatio to the Gra atrix Fie ad Scheiberg, 00; Willias ad Seeger, 00; Sola ad Schölkopf, 000). A icreetal coputatio of the MMD based o such a low rak approxiatio would require Od) storage ad Od) coputatio where d is the rak of the approxiate Gra atrix which is used to factorize both atrices) rather tha O) storage ad O ) operatios. That said, it reais to be deteried what effect this approxiatio would have o the distributio of the test statistic uder H 0, ad hece o the test threshold. 7. Related Metrics ad Learig Probles The preset sectio discusses a uber of topics related to the axiu ea discrepacy, icludig etrics o probability distributios usig o-rkhs fuctio classes Sectios 7. ad 7.), the relatio with set kerels ad kerels o probability easures Sectio 7.3), a extesio to kerel easures of idepedece Sectio 7.4), a two-saple statistic usig a distributio over witess fuctios Sectio 7.5), ad a coectio to outlier detectio Sectio 7.6). 7. The MMD i Other Fuctio Classes The defiitio of the axiu ea discrepacy is by o eas liited to RKHS. I fact, ay fuctio classf that coes with uifor covergece guaratees ad is sufficietly rich will ejoy the above properties. Below, we cosider the case where the scaled fuctios if are dese i CX) which is useful for istace whe the fuctios i F are or costraied). 740

19 A KERNEL TWO-SAMPLE TEST Defiitio 7 LetF be a subset of soe vector space. The star S[F] of a set F is S[F] :={α f f F ad α [0, )} Theore 8 Deote by F the subset of soe vector space of fuctios fro X to R for which S[F] CX) is dese i CX) with respect to the L X) or. The MMD[F, p,q]=0 if ad oly if p=q, ad MMD[F, p,q] is a etric o the space of probability distributios. Wheever the star of F is ot dese, the MMD defies a pseudo-etric space. Proof It is clear that p = q iplies MMD[F, p,q]=0. The proof of the coverse is very siilar to that of Theore 5. Defie H := SF) CX). Sice by assuptio H is dese i CX), there exists a h H satisfyig h f < ε for all f CX). Write h := α g, where g F. By assuptio, E x g E y g = 0. Thus we have the boud E x fx) E y fy)) E x fx) E x h x) +α E x g x) E y g y) + E y h y) E y fy) ε for all f CX) ad ε>0, which iplies p=q by Lea. To show MMD[F, p,q] is a etric, it reais to prove the triagle iequality. We have sup E p f E q f +sup E q g E r g [ sup E p f E q f + ] E q f E r f F g F f F sup E p f E r f. f F Note that ay uifor covergece stateets i ters of F allow us iediately to characterize a estiator of MMDF, p, q) explicitly. The followig result shows how this reasoig is also the basis for the proofs i Sectio 4, although here we do ot restrict ourselves to a RKHS). Theore 9 Let δ 0,) be a cofidece level ad assue that for soe εδ,,f) the followig holds for saples {x,...,x } draw fro p: } Pr X {sup E x[ f] fx i ) > εδ,,f) δ. I this case we have that, f F Pr X,Y { MMD[F, p,q] MMD b [F,X,Y] >εδ/,,f)} δ, where MMD b [F,X,Y] is take fro Defiitio. Proof The proof works siply by usig covexity ad suprea as follows: MMD[F, p,q] MMD b [F,X,Y] = sup E x [ f] E y [ f] sup f F f F fx i ) fy i ) sup f F E x[ f] E y [ f] fx i )+ fy i ) sup E x[ f] fx i ) + sup E y[ f] fy i ). f F f F 74

20 GRETTON, BORGWARDT, RASCH, SCHÖLKOPF AND SMOLA Boudig each of the two ters via a uifor covergece boud proves the clai. This shows that MMD b [F,X,Y] ca be used to estiate MMD[F, p,q], ad that the quatity is asyptotically ubiased. Reark 0 Reductio to Biary Classificatio) As oted by Frieda 003), ay classifier which aps a set of observatios {z i,l i } with z i X o soe doai X ad labels l i {±}, for which uifor covergece bouds exist o the covergece of the epirical loss to the expected loss, ca be used to obtai a siilarity easure o distributios siply assig l i = if z i X ad l i = for z i Y ad fid a classifier which is able to separate the two sets. I this case axiizatio of E x [ f] E y [ f] is achieved by esurig that as ay z pz) as possible correspod to fz)=, whereas for as ay z qz) as possible we have fz)=. Cosequetly eural etworks, decisio trees, boosted classifiers ad other objects for which uifor covergece bouds ca be obtaied ca be used for the purpose of distributio copariso. Metrics ad divergeces o distributios ca also be defied explicitly startig fro classifiers. For istace, Sriperubudur et al. 009, Sectio ) show the MMD iiizes the expected risk of a classifier with liear loss o the saples X ad Y, ad Be-David et al. 007, Sectio 4) use the error of a hyperplae classifier to approxiate the A-distace betwee distributios Kifer et al., 004). Reid ad Williaso 0) provide further discussio ad exaples. 7. Exaples of No-RKHS Fuctio Classes Other fuctio spaces F ispired by the statistics literature ca also be cosidered i defiig the MMD. Ideed, Lea defies a MMD with F the space of bouded cotiuous real-valued fuctios, which is a Baach space with the supreu or Dudley, 00, p. 58). We ow describe two further etrics o the space of probability distributios, aely the Kologorov- Sirov ad Earth Mover s distaces, ad their associated fuctio classes. 7.. KOLMOGOROV-SMIRNOV STATISTIC The Kologorov-Sirov K-S) test is probably oe of the ost faous two-saple tests i statistics. It works for rado variables x R or ay other set for which we ca establish a total order). Deote by F p x) the cuulative distributio fuctio of p ad let F X x) be its epirical couterpart, F p z) := Pr{x z for x p} ad F X z) := X z xi. It is clear that F p captures the properties of p. The Kologorov etric is siply the L distace F X F Y for two sets of observatios X ad Y. Sirov 939) showed that for p=q the liitig distributio of the epirical cuulative distributio fuctios satisfies { [ } li Pr X,Y F, +] X F Y > x = j= ) j e j x for x 0, ) which is distributio idepedet. This allows for a efficiet characterizatio of the distributio uder the ull hypothesish 0. Efficiet uerical approxiatios to ) ca be foud i uerical aalysis hadbooks Press et al., 994). The distributio uder the alterative p q, however, is ukow. 74