Kernel Mean Estimation and Stein Effect


 Jonah Hood
 1 years ago
 Views:
Transcription
1 Krikamol Muadet Empirical Iferece Departmet, Max Plack Istitute for Itelliget Systems, Tübige, Germay Keji Fukumizu The Istitute of Statistical Mathematics, Tokyo, Japa Bharath Sriperumbudur Statistical Laboratory, Uiversity of Cambridge, Cambridge, Uited Kigdom Arthur Gretto Gatsby Computatioal Neurosciece Uit, Uiversity College Lodo, Lodo, Uited Kigdom Berhard Schölkopf Empirical Iferece Departmet, Max Plack Istitute for Itelliget Systems, Tübige, Germay Abstract A mea fuctio i a reproducig kerel Hilbert space (RKHS), or a kerel mea, is a importat part of may algorithms ragig from kerel pricipal compoet aalysis to Hilbertspace embeddig of distributios. Give a fiite sample, a empirical average is the stadard estimate for the true kerel mea. We show that this estimator ca be improved due to a wellkow pheomeo i statistics called Stei s pheomeo. After cosideratio, our theoretical aalysis reveals the existece of a wide class of estimators that are better tha the stadard oe. Focusig o a subset of this class, we propose efficiet shrikage estimators for the kerel mea. Empirical evaluatios o several applicatios clearly demostrate that the proposed estimators outperform the stadard kerel mea estimator.. Itroductio This paper aims to improve the estimatio of the mea fuctio i a reproducig kerel Hilbert space (RKHS) from a fiite sample. A kerel mea of a probability distributio P over a measurable space X is defied by µ P k(x, ) dp(x) H, () X Proceedigs of the st Iteratioal Coferece o Machie Learig, Beijig, Chia,. JMLR: W&CP volume. Copyright by the author(s). wherehis a RKHS associated with a reproducig kerel k : X X R. Coditios esurig that this expectatio exists are give i Smola et al. (7). Ufortuately, it is ot practical to compute µ P directly because the distributio P is usually ukow. Istead, give a i.i.d sample x,x,...,x from P, we ca easily compute the empirical kerel mea by the average µ P k(x i, ). () The estimate µ P is the most commoly used estimate of the true kerel mea. Our primary iterest here is to ivestigate whether oe ca improve upo this stadard estimator. The kerel mea has recetly gaied attetio i the machie learig commuity, thaks to the itroductio of Hilbert space embeddig for distributios (Berliet ad Aga, ; Smola et al., 7). Represetig the distributio as a mea fuctio i the RKHS has several advatages: ) the represetatio with appropriate choice of kerel k has bee show to preserve all iformatio about the distributio (Fukumizu et al., ; Sriperumbudur et al., ; ); ) basic operatios o the distributio ca be carried out by meas of ier products i RKHS, e.g., E P [f(x)] = f,µ P H for all f H; ) o itermediate desity estimatio is required, e.g., whe testig for homogeeity from fiite samples. As a result, may algorithms have beefited from the kerel mea represetatio, amely, maximum mea discrepacy (MMD) (Gretto et al., 7), kerel depedecy measure (Gretto et al., ), kerel twosampletest (Gretto et al., ), Hilbert space embeddig of HMMs (Sog et al., ), ad kerel Bayes rule
2 (Fukumizu et al., ). Their performaces rely directly o the quality of the empirical estimate µ P. However, it is of great importace, especially for our readers who are ot familiar with kerel methods, to realize a more fudametal role of the kerel mea. It basically serves as a foudatio to most kerelbased learig algorithms. For istace, oliear compoet aalyses, such as kerel PCA, kerel FDA, ad kerel CCA, rely heavily o mea fuctios ad covariace operators i RKHS (Schölkopf et al., 99). The kerel kmeas algorithm performs clusterig i feature space usig mea fuctios as the represetatives of the clusters (Dhillo et al., ). Moreover, it also serves as a basis i early developmet of algorithms for classificatio ad aomaly detectio (ShaweTaylor ad Cristiaii,, chap. ). All of those employ () as the estimate of the true mea fuctio. Thus, the fact that substatial improvemet ca be gaied whe estimatig () may i fact raise a widespread suspicio o traditioal way of learig with kerels. We show i this work that the stadard estimator () is, i a certai sese, ot optimal, i.e., there exist better estimators (more below). I additio, we propose shrikage estimators that outperform the stadard oe. At first glace, it was defiitely couterituitive ad surprisig for us, ad will udoubtedly also be for some of our readers, that the empirical kerel mea could be improved, ad, give the simplicity of the proposed estimators, that this has remaied uoticed util ow. Oe of the reasos may be that there is a commo belief that the estimator ˆµ P already gives a good estimate ofµ P, ad, as sample size goes to ifiity, the estimatio error disappears (ShaweTaylor ad Cristiaii, ). As a result, o eed is felt to improve the kerel mea estimatio. However, give a fiite sample, substatial improvemet is i fact possible ad several factors may come ito play, as will be see later i this work. This work was partly ispired by Stei s semial work i 9, which showed that a maximum likelihood estimator (MLE), i.e., the stadard empirical mea, for the mea of the multivariate Gaussia distributio N(θ,σ I) is iadmissible (Stei, 9). That is, there exists a estimator that always achieves smaller total mea squared error regardless of the true θ, whe the dimesio is at least. Perhaps the best kow estimator of such kid is James Steis estimator (James ad Stei, 9). Iterestigly, the JamesStei estimator is itself iadmissible, ad there exists a wide class of estimators that outperform the MLE, see e.g., Berger (97). However, our work differs fudametally from the Stei s semial works ad those alog this lie i two aspects. First, our settig is oparametric i a sese that we do ot assume ay parametric form of the distributio, whereas most of traditioal works focus o some specific distributios, e.g., Gaussia distributio. Secod, our settig ivolves a oliear feature map ito a highdimesioal space, if ot ifiite. As a result, higher momets of the distributio may come ito play. Thus, oe caot adopt Stei s settig straightforwardly. A direct geeralizatio of JamesStei estimator to ifiitedimesioal Hilbert space has already bee cosidered (Berger ad Wolpert, 9; Madelbaum ad Shepp, 97; Privault ad Rveillac, ). I those works, θ which is the parameter to be estimated is assumed to be the mea of a Gaussia measure o the Hilbert space from which samples are draw. I our case, o the other had, the samples are draw from P ad ot from the Gaussia distributio whose mea isµ P. The cotributio of this paper ca be summarized as follows: First, we show that the stadard kerel mea estimator ca be improved by providig a alterative estimator that achieves smaller risk ( ). The theoretical aalysis reveals the existece of a wide class of estimators that are better tha the stadard. To this ed, we propose i a kerel mea shrikage estimator (KMSE), which is based o a ovel motivatio for regularizatio through the otio of shrikage. Moreover, we propose a efficiet leaveoeout crossvalidatio procedure to select the shrikage parameter, which is ovel i the cotext of kerel mea estimatio. Lastly, we demostrate the beefit of the proposed estimators i several applicatios ( ).. Motivatio: Shrikage Estimators For a arbitrary distributio P, deote by µ ad µ the true kerel mea ad its empirical estimate () from the i.i.d. sample x,x,...,x P (we remove the subscript for ease of otatio). The most atural loss fuctio cosidered i this work is l(µ, µ) = µ µ H. A estimator µ is a mappig which is measurable w.r.t. the Borel σalgebra of H ad is evaluated by its risk fuctior(µ, µ) = E P [l(µ, µ)] wheree P idicates expectatio over the choice of i.i.d. sample of sizefromp. Let us cosider a alterative kerel mea estimator: µ α αf + ( α) µ where α < ad f H. It is essetially a shrikage estimator that shriks the stadard estimator toward a fuctio f by a amout specified by α. If α =, µ α reduces to the stadard estimator µ. The followig theorem asserts that the risk of shrikage estimator µ α is smaller tha that of stadard estimator µ give a appropriate choice of α, regardless of the fuctio f (more below). Theorem. For all distributiospad the kerel k, there existsα > for which R(µ, µ α ) < R(µ, µ). Proof. The risk of the stadard kerel mea estimator satisfies E µ µ = (E[k(x,x)] E[k(x, x)]) =:
3 where x is a idepedet copy of x. Let us defie the risk of the proposed shrikage estimator by α := E µ α µ where α is a oegative shrikage parameter. We ca the write this i terms of the stadard risk as α = αe µ µ, µ µ+µ f + α E f α E[f (x)] + α E µ. It follows from the reproducig property of H that E[f (x)] = f,µ. Moreover, usig the fact that E µ = E µ µ+µ = + E[k(x, x)], we ca simplify the shrikage risk by α = α ( + f µ ) α +. Thus, we have α = α ( + f µ ) α which is opositive where [ ] α, + f µ () ad miimized at α = /( + f µ ). As we ca see i (), there is a rage ofαfor which a opositive α, i.e., R(µ, µ α ) R(µ, µ), is guarateed. However, Theorem relies o the importat assumptio that the true kerel mea of the distributiopis required to estimate α. I spite of this, the theorem has a importat implicatio suggestig that the shrikage estimator µ α ca improve upo µ if α is chose appropriately. Later, we will exploit this result i order to costruct more practical estimators. Remark. The followig observatios follow immediately from Theorem : The shrikage estimator always improves upo the stadard oe regardless of the directio of shrikage, as specified by f. I other words, there exists a wide class of kerel mea estimators that are better tha the stadard oe. The value of α also depeds o the choice of f. The furtherf is fromµ, the smallerαbecomes. Thus, the shrikage gets smaller if f is chose such that it is far from the true kerel mea. This effect is aki to JamesStei estimator. The improvemet ca be viewed as a biasvariace tradeoff: the shrikage estimator reduces variace substatially at the expese of a little bias. Remark sheds light o how oe ca practically costruct the shrikage estimator: we ca choose f arbitrarily as log as the parameter α is chose appropriately. Moreover, further improvemet ca be gaied by icorporatig prior kowledge as to the locatio of µ P, which ca be straightforwardly itegrated ito the framework via f (Berger ad Wolpert, 9). Ispired by JamesStei estimator, we focus o f =. We will ivestigate the effect of differet prior f i future works.. Kerel Mea Shrikage Estimator I this sectio we give a ovel formulatio of kerel mea estimator that allows us to estimate the shrikage parameter efficietly. I the followig, let φ : X H be a feature map associated with the kerel k ad, be a ier product i the RKHSHsuch thatk(x,x ) = φ(x),φ(x ). Uless stated otherwise, deotes the RKHS orm. The kerel mea µ P ad its empirical estimate µ P ca be obtaied as a miimizer of the loss fuctioals E(g) E x P φ(x) g, Ê(g) φ(x i ) g, respectively. We will call the estimator miimizig the loss fuctioal Ê(g) a kerel mea estimator (KME). Note that the losse(g) is differet from the oe cosidered i, i.e., l(µ,g) = µ g = E[φ(x)] g. Nevertheless, we havel(µ,g) = E xx k(x,x ) E x g(x)+ g. SiceE(g) = E x k(x,x) E x g(x)+ g, the lossl(µ,g) differs frome(g) oly bye x k(x,x) E xx k(x,x ) which is ot a fuctio of g. We itroduce the ew form here because it will give a more tractable crossvalidatio computatio (.). I spite of this, the resultig estimators are always evaluated w.r.t. the loss i (cf..). From the formulatio above, it is atural to ask if miimizig the regularized versio of Ê(g) will give better estimator. O the oe had, oe ca argue that, ulike i the classical risk miimizatio, we do ot really eed a regularizer here. The stadard estimator () is kow to be, i a certai sese, optimal ad ca be estimated reliably (ShaweTaylor ad Cristiaii,, prop..). Moreover, the origial formulatio ofê(g) is a wellposed problem. O the other had, sice regularizatio may be viewed as shrikig the solutio toward zero, it ca actually improve the kerel mea estimatio, as suggested by Theorem (cf. discussios at the ed of ). Cosequetly, we miimize a modified loss fuctioal Ê (g) Ê(g)+Ω( g ) = φ(x i ) g +Ω( g ), () whereω( ) deotes a mootoicallyicreasig regularizatio fuctioal ad is a oegative regularizatio parameter. I what follows, we refer to the shrikage estimator µ miimizig Ê(g) as a kerel mea shrikage estimator (KMSE). The parameters α ad play similar role as a shrikage parameter. They specify a amout by which the stadard estimator µ is shruk toward f =. Thus, the term shrikage parameter ad regularizatio parameter will be used iterchageably.
4 It follows from the represeter theorem thatg lies i a subspace spaed by the data, i.e., g = j= β jφ(x j ) for some β R. By cosiderig Ω( g ) = g, we ca rewrite () as φ(x i) β j φ(x j ) + β j φ(x j ) j= j= = β Kβ β K +β Kβ +c, () wherecis a costat term,kis a Gram matrix such that K ij = k(x i,x j ), ad = [/,/,...,/]. Takig a derivative of () w.r.t. β ad settig it to zero yield β = (/( + )). By settig α = /( + ) the shrikage estimate ca be writte as µ = ( α) µ. Sice < α <, the estimator µ correspods to a shrikage estimator discussed i whe f =. We call this estimator a simple kerel mea shrikage estimator (SKMSE). Usig the expasio g = j= β jφ(x j ), we may cosider whe the regularizatio fuctioal is writte i term of β, e.g., β β. This leads to a particularly iterestig kerel mea estimator. I this case, the optimal weight vector is give by β = (K + I) K ad the shrikage estimate ca be writte accordigly as µ = j= β jφ(x j ) = Φ (K + I) K where Φ = [φ(x ),φ(x ),...,φ(x )]. Ulike the SKMSE, this estimator shriks the usual estimate differetly i each coordiate (cf. Theorem ). Hece, we will call it a flexible kerel mea shrikage estimator (FKMSE). The followig theorem characterizes the FKMSE as a shrikage estimator. Theorem. The FKMSE ca be writte as µ = γ i γ µ,v i+ i v i where {γ i,v i } are eigevalue ad eigevector pairs of the empirical covariace operator Ĉ xx ih. I words, the effect of FKMSE is to reduce high frequecy compoets of the expasio of µ, by expadig this i terms of the kerel PCA basis ad shrikig the coefficiets of the high order eigefuctios, e.g., see Rasmusse ad Williams (, sec..). Note that the covariace operator Ĉxx itself does ot deped o. As we ca see, the solutio to the regularized versio is ideed of the form of shrikage estimators whe f =. That is, both SKMSE ad FKMSE shrik the stadard kerel mea estimate towards zero. The differece is that the SKMSE shriks equally i all coordiate, whereas the FKMSE also costraits the amout of shrikage by the iformatio cotaied i each coordiate. Moreover, the squared RKHS orm ca be decomposed as a sum of squared loss weighted by the eigevalues γ i (cf. Madelbaum ad Shepp (97, appedix)). By the same reasoig as Stei s result i fiitedimesioal case, oe would suspect that a improvemet of shrikage estimators i H should also deped o how fast the eigevalues of k decay. That is, oe would expect greater improvemet if the values ofγ i decay very slowly. For example, the Gaussia RBF kerel with larger badwidth gives smaller improvemet whe compared to oe with smaller badwidth. Similarly, we should expect to see more improvemet whe applyig a Laplacia kerel tha whe usig a Gaussia RBF kerel. I some applicatios of kerel mea embeddig, oe may wat to iterpret the weight β as a probability vector (Nishiyama et al., ). However, the weight vector β output by our estimators is i geeral ot ormalized. I fact, all elemets will be smaller tha / as a result of shrikage. However, oe may impose a costrait that β must sum to oe ad resort to a quadratic programmig (Sog et al., ). Ufortuately, this approach has udesirable effect of sparsity which is ulikely to improve upo the stadard estimator. Postormalizig the weights ofte deteriorates the estimatio performace. To the best of our kowledge, o previous attempt has bee made to improve the kerel mea estimatio. However, we discuss some closely related works here. For example, istead of the loss fuctioal Ê(g), Kim ad Scott () cosider a robust loss fuctio such as the Huber s loss to reduce the effect of outliers. The authors cosider kerel desity estimators, which differ fudametally from kerel mea estimators. They eed to reduce the kerel badwidth with icreasig sample size for the estimators to be cosistet. Regularized versio of MMD was adopted by Daafar et al. () i the cotext of kerelbased hypothesis testig. The resultig formulatio resembles our SKMSE. Furthermore, the FKMSE is of a similar form as the coditioal mea embeddig used i Grüewälder et al. (), which ca be viewed more geerally as a regressio problem i RKHS with smooth operators (Grüewälder et al., )... Choosig Shrikage Parameter As discussed i, the amout of shrikage plays a importat role i our estimators. I this work we propose to select the shrikage parameter by a automatic leaveoeout crossvalidatio. For a give shrikage parameter, let us cosider the observatiox i as beig a ew observatio by omittig it from the dataset. Deote by µ ( i) = j i β( i) j φ(x j ) the kerel mea estimated from the remaiig data, usig the valueas a shrikage parameter, so thatβ ( i) is the miimizer ofê( i) (g). We will measure the quality of µ ( i) by how well it approximates φ(x i ). The overall quality of the
5 estimate is quatified by the crossvalidatio score LOOCV() = φ(x i ) µ ( i) H. () By simple algebra, it is ot difficult to show that the optimal shrikage parameter of SKMSE ca be calculated aalytically, as stated by the followig theorem. Theorem. Let ρ j= k(x i,x j ) ad k(x i,x i ). The shrikage parameter = ( ρ)/(( )ρ+ / ) of the SKMSE is the miimizer of LOOCV(). O the other had, fidig the optimalfor the FKMSE is relatively more ivolved. Evaluatig the score () aïvely requires oe to solve for µ ( i) explicitly for every i. Fortuately, we ca simplify the score such that it ca be evaluated efficietly, as stated i the followig theorem. Theorem. The LOOCV score of FKMSE satisfies LOOCV() = (β K K i ) C (β K K i ) where β is the weight vector calculated from the full dataset with the shrikage parameter ad C = (K K(K+I) K) K(K K(K+I) K). Proof of Thorem. For fixed ad i, let µ ( i) be the leaveoeout kerel mea estimate of FKMSE ad let A (K + I). The, we ca write a expressio for the deleted residual as ( i) := µ ( i) φ(x i ) = µ φ(x i ) + j= l= A jl φ(x l ), µ ( i) φ(x i ) φ(x j ). Sice ( i) lies i a subspace spaed by the sample φ(x ),...,φ(x ), we have ( i) = k= ξ kφ(x k ) for some ξ R. Substitutig ( i) back yields k= ξ kφ(x k ) = µ φ(x i ) + j= {AKξ} jφ(x j ). By takig the ier product o both sides w.r.t. the sample φ(x ),...,φ(x ) ad solvig for ξ, we have ξ = (K KAK) (β K K i ) wherek i is theith colum of K. Cosequetly, the leaveoeout score of the sample x i ca be computed by ( i) = ξ Kξ = (β K K i ) (K KAK) K(K KAK) (β K K i ) = (β K K i ) C (β K K i ). Averagig ( i) over all samples gives LOOCV() = ( i) = (β K K i ) C (β K K i ), as required. It is iterestig to see that the leaveoeout crossvalidatio score i Theorem depeds oly o the oleaveoeout solutio β, which ca be obtaied as a byproduct of the algorithm. Computatioal complexity The SKMSE requires O( ) operatios to select shrikage parameter. For the FKMSE, there are two steps i crossvalidatio. First, we eed to compute (K + I) repeatedly for differet values of. Assume that we kow the eigedecompositio K = UDU where D is diagoal with d ii ad UU = I. It follows that (K+I) = U(D+I) U. Cosequetly, solvig for β takes O( ) operatios. Sice eigedecompositio requires O( ) operatios, fidig β for may s is essetially free. A lowrak approximatio ca also be adopted to reduce the computatioal cost further. Secod, we eed to compute the crossvalidatio score (). As show i Theorem, we ca compute it usig oly β obtaied from the previous step. The calculatio of C ca be simplified further via the eigedecompositio of K as C = U(D D(D+I) D) D(D D(D+ I) D) U. Sice it oly ivolves the iverse of diagoal matrices, the iversio ca be evaluated i O() operatios. The overall computatioal complexity of the crossvalidatio requires oly O( ) operatios, as opposed to the aïve approach that requires O( ) operatios. Whe performed as a byproduct of the algorithm, the computatioal cost of crossvalidatio procedure becomes egligible as the dataset becomes larger. I practice, we use the fmisearch ad fmibd routies of the MATLAB optimizatio toolbox to fid the best shrikage parameter... Covariace Operators The covariace operator fromh X toh Y ca be viewed as a mea fuctio i a product space H X H Y. Hece, we ca also costruct a shrikage estimator of covariace operator i RKHS. Let (H X,k X ) ad (H Y,k Y ) be the RKHS of fuctios o measurable space X ad Y, respectively, with p.d. kerel k X ad k Y (with feature map φ ad ϕ). We will cosider a radom vector (X,Y) : Ω X Y with distributio P XY, with P X ad P Y as margial distributios. Uder some coditios, there exists a uique crosscovariace operator Σ YX : H X H Y such that g,σ YX f HY = E XY [(f(x) E X [f(x)])(g(y) E Y [g(y)])] = Cov(f(X),g(Y)) holds for all f H X ad g H Y (Fukumizu et al., ). If X equals Y, we get the selfadjoit operatorσ XX called the covariace operator. Give a i.i.d sample from P XY writte as (x,y ),(x,y ),...,(x,y ), we ca write the empirical crosscovariace operator as Σ YX := φ(x i) ϕ(y i ) µ X µ Y where µ X = φ(x i) ad µ Y = ϕ(y i). Let φ ad ϕ be the cetered feature maps of φ ad ϕ, respectively. The, it ca be rewritte as Σ YX := φ(x i ) ϕ(y i ) H X H Y. It follows from the ier product property i product space that φ(x) ϕ(y), φ(x ) ϕ(y ) HX H Y = φ(x), φ(x ) HX ϕ(y), ϕ(y ) HY = k X (x,x ) k Y (y,y ).
6 =. γ = γ = γ =. γ x = γ x 9 7 = γ x =. γ x = γ x = γ x =. γ = γ = γ (a) LIN (b) POLY (c) POLY (d) RBF Figure. The average loss of KME (left), SKMSE (middle), ad FKMSE (right) estimators with differet values of shrikage parameter. Iside boxes correspod to estimators. We repeat the experimets over differet distributios with = ad d =. The, we ca obtai the shrikage estimators for the covariace operator by pluggig the kerel k((x,y),(x,y )) = k X (x,x ) k Y (y,y ) i our KM SEs. We will call this estimator a covariaceoperator shrikage estimator (COSE). The same trick ca be easily geeralized to tesors of higher order, which have bee previously used, for example, i Sog et al. ().. Experimets We focus o the compariso betwee our shrikage estimators ad the stadard estimator of the kerel mea usig both sythetic datasets ad realworld datasets... Sythetic Data Give the true datageeratig distributio P, we evaluate differet estimators usig the loss fuctio l(β) β ik(x i, ) E P [k(x, )] H where β is the weight vector associated with differet estimators. To allow for a exact calculatio of l(β), we cosider whe P is a mixtureofgaussias distributio ad k is the followig kerel fuctio: ) liear kerel k(x,x ) = x x ; ) polyomial degree kerel k(x,x ) = (x x + ) ; ) polyomial degree kerel k(x,x ) = (x x + ) ; ad ) Gaussia RBF kerel k(x,x ) = exp ( x x /σ ). We will refer to them as LIN, POLY, POLY, ad RBF, respectively. Experimetal protocol. Data are geerated from a d dimesioal mixture of Gaussias: x π i N(θ i,σ i )+ε, θ ij U(,), Σ i W( I d,7), ε N(,. I d ), where U(a,b) ad W(Σ,df) represet the uiform distributio ad Wishart distributio, respectively. We set π = [.,.,.,.]. The choice of parameters here is quite arbitrary; we have experimeted usig various parameter settigs ad the results are similar to those preseted here. For the Gaussia RBF kerel, we set the badwidth parameter to squareroot of the media Euclidea distace betwee samples i the dataset (i.e., σ = media { x i x j } throughout). Figure shows the average loss of differet estimators usig differet kerels as we icrease the value of shrikage parameter. Here we scale the shrikage parameter by the miimum ozero eigevalue γ of kerel matrix K. I geeral, we fid SKMSE ad FKMSE ted to outperform KME. However, as becomes large, there are some cases where shrikage deteriorates the estimatio performace, e.g., see LIN kerel ad some outliers i the figures. This suggests that it is very importat to choose the parameter appropriately (cf. the discussio i ). Similarly, Figure depicts the average loss as we vary the sample size ad dimesio of the data. I this case, the shrikage parameter is chose by the proposed leaveoeout crossvalidatio score. As we ca see, both SKMSE Average Loss Average Loss LIN Sample Size (d=) 7 LIN Dimesio (=).... x POLY Sample Size (d=) x POLY Dimesio (=). x POLY.. Sample Size (d=) x POLY Dimesio (=) RBF KME S KMSE F KMSE Sample Size (d=) RBF Dimesio (=) Figure. The average loss over differet distributios of KME, SKMSE, ad FKMSE with varyig sample size () ad dimesio (d). The shrikage parameter is chose by LOOCV.
7 Table. Average egative loglikelihood of the model Q o test poits over radomizatios. The boldface represets the result whose differece from the baselie, i.e., KME, is statistically sigificat. Dataset LIN POLY POLY RBF KME SKMSE FKMSE KME SKMSE FKMSE KME SKMSE FKMSE KME SKMSE FKMSE. ioosphere soar australia specft wdbc wie satimage segmet vehicle svmguide vowel housig bodyfat abaloe glass ad FKMSE outperform the stadard KME. The SKMSE performs slightly better tha the FKMSE. Moreover, the improvemet is more substatial i the large d, small paradigm. I the worst cases, the SKMSE ad FKMSE perform as well as the KME. Lastly, it is istructive to ote that the improvemet varies with the choice of kerel k. Briefly, the choice of kerel reflects the dimesioality of feature space H. Oe would expect more improvemet i highdimesioal space, e.g., RBF kerel, tha the lowdimesioal, e.g., liear kerel (cf. discussios at the ed of ). This pheomeo ca be observed i both Figure ad... Real Data We cosider three bechmark applicatios: desity estimatio via kerel mea matchig (Sog et al., ), kerel PCA usig shrikage mea ad covariace operator (Schölkopf et al., 99), ad discrimiative learig o distributios (Muadet ad Schölkopf, ; Muadet et al., ). For the first two tasks we employ datasets from the UCI repositories. We use oly realvalued features, each of which is ormalized to have zero mea ad uit variace. Desity estimatio. We perform desity estimatio via kerel mea matchig (Sog et al., ). That is, we fit the desity Q = m j= π jn(θ j,σj I) to each dataset by miimizig µ µ Q H s.t. m j= π j =. The kerel mea µ is obtaied from the samples usig differet estimators, whereas µ Q is the kerel mea embeddig of the desity Q. Ulike experimets i Sog et al. (), our goal is to compare differet estimators of µ P where P is the true data distributio. That is, we replace ˆµ with a versio obtaied via shrikage. A better estimate ofµ P should lead to better desity estimatio, as measured by the egative loglikelihood of Q o the test set. We use % of the dataset as a test set. We set m = for each dataset. The model is iitialized by ruig radom iitializatios usig the kmeas algorithm ad returig the best. We repeat the experimets times ad perform the paired sig test o the results at the % sigificace level. The average egative loglikelihood of the model Q, optimized via differet estimators, is reported i Table. Clearly, both SKMSE ad FKMSE cosistetly achieve smaller egative loglikelihood whe compared to KME. There are however few cases i which KME outperforms the proposed estimators, especially whe the dataset is relatively large, e.g., satimage ad abaloe. We suspect that i those cases the stadard KME already provides a accurate estimate of the kerel mea. To get a better estimate, more effort is required to optimize for the shrikage parameter. Moreover, the improvemet across differet kerels is cosistet with results o the sythetic datasets. Kerel PCA. I this experimet, we perform the KPCA usig differet estimates of the mea ad covariace operators. We compare the recostructio error E proj (z) = φ(z) Pφ(z) o test samples wherepis the projectio costructed from the first pricipal compoets. We use a Gaussia RBF kerel for all datasets. We compare differet scearios: ) stadard KPCA; ) shrikage ceterig with SKMSE; ) shrikage ceterig with FKMSE; ) KPCA with SCOSE; ad ) KPCA with FCOSE. To perform KPCA o shrikage covariace operator, we solve the geeralized eigevalue problem K c BK c V = K c VD where B = diag(β) ad K c is the cetered Gram matrix. The weight vector β is obtaied from shrikage estimators usig the kerel matrix K c K c where deotes the Hadamard product. We use % of the dataset as a test set. The paired sig test is a oparametric test that ca be used to examie whether two paired samples have the same distributio. I our case, we compare SKMSE ad FKMSE agaist KME.
8 KME S KMSE F KMSE S COSE F COSE recostructio error.... ioosphere soar australia specft wdbc wie satimage segmet vehicle svmguide vowel housig bodyfat abaloe glass Figure. The average recostructio error of KPCA o holdout test samples over repetitios. The KME represets the stadard approach, whereas SKMSE ad FKMSE use shrikage meas to perform ceterig. The SCOSE ad FCOSE directly use the shrikage estimate of the covariace operator. Figure illustrates the results of KPCA. Clearly, the S COSE ad FCOSE cosistetly outperforms all other estimators. Although we observe a improvemet of SKMSE ad FKMSE over KME, it is very small compared to that of SCOSE ad FCOSE. This makes sese ituitively, sice chagig the mea poit or shiftig data does ot chage the covariace structure cosiderably, so it will ot sigificatly affect the recostructio error. Table. The classificatio accuracy of SMM ad the area uder ROC curve (AUC) of OCSMM usig differet kerel mea estimators to costruct the kerel o distributios. Estimator Liear Noliear SMM OCSMM SMM OCSMM KME SKMSE FKMSE Discrimiative learig o distributios. A positive semidefiite kerel betwee distributios ca be defied via their kerel mea embeddigs. That is, give a traiig sample ( P,y ),...,( P m,y m ) P {,+} where P i := k= δ x i ad xi k k P i, the liear kerel betwee two distributios is approximated by µ Pi, µ Pj = k= βi k φ(xi k ), l= βj l φ(xj l ) = k,l= βi k βj l k(xi k,xj l ). The weight vectors βi ad β j come from the kerel mea estimates of µ Pi ad µ Pj, respectively. The oliear kerel ca the be defied accordigly, e.g., κ(p i,p j ) = exp( µ Pi µ Pj H /σ ). Our goal i this experimet is to ivestigate if the shrikage estimate of the kerel mea improves the performace of the discrimiative learig o distributios. To this ed, we coduct experimets o atural scee categorizatio usig support measure machie (SMM) (Muadet et al., ) ad group aomaly detectio o a higheergy physics dataset usig oeclass SMM (OC SMM) (Muadet ad Schölkopf, ). We use both liear ad oliear kerels where the Gaussia RBF kerel is employed as a embeddig kerel (Muadet et al., ). All hyperparameters are chose by fold crossvalidatio. For our usupervised problem, we repeat the experimets usig several parameter settigs ad report the best results. Table reports the classificatio accuracy of SMM ad the area uder ROC curve (AUC) of OCSMM usig differet kerel mea estimators. Both shrikage estimators cosistetly lead to better performace o both SMM ad OC SMM whe compared to KME. To summarize, we fid sufficiet evidece to coclude that both SKMSE ad FKMSE outperforms the stadard KME. The performace of SKMSE ad FKMSE is very competitive. The differece depeds o the dataset ad the kerel fuctio.. Coclusios To coclude, we show that the commoly used kerel mea estimator ca be improved. Our theoretical result suggests that there exists a wide class of kerel mea estimators that are better tha the stadard oe. To demostrate this, we focus o two efficiet shrikage estimators, amely, simple ad flexible kerel mea shrikage estimators. Empirical study clearly shows that the proposed estimators outperform the stadard oe i various scearios. Most importatly, the shrikage estimates ot oly provide more accurate estimatio, but also lead to superior performace o realworld applicatios. Ackowledgmets The authors wish to thak David Hogg ad Ross Fedely for readig the first draft ad aoymous reviewers who gave valuable suggestio that has helped to improve the mauscript.
SUPPORT UNION RECOVERY IN HIGHDIMENSIONAL MULTIVARIATE REGRESSION 1
The Aals of Statistics 2011, Vol. 39, No. 1, 1 47 DOI: 10.1214/09AOS776 Istitute of Mathematical Statistics, 2011 SUPPORT UNION RECOVERY IN HIGHDIMENSIONAL MULTIVARIATE REGRESSION 1 BY GUILLAUME OBOZINSKI,
More informationThe Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
Joural of Machie Learig Research 0 2009 22952328 Submitted 3/09; Revised 5/09; ublished 0/09 The Noparaormal: Semiparametric Estimatio of High Dimesioal Udirected Graphs Ha Liu Joh Lafferty Larry Wasserma
More informationMAXIMUM LIKELIHOODESTIMATION OF DISCRETELY SAMPLED DIFFUSIONS: A CLOSEDFORM APPROXIMATION APPROACH. By Yacine AïtSahalia 1
Ecoometrica, Vol. 7, No. 1 (Jauary, 22), 223 262 MAXIMUM LIKELIHOODESTIMATION OF DISCRETEL SAMPLED DIFFUSIONS: A CLOSEDFORM APPROXIMATION APPROACH By acie AïtSahalia 1 Whe a cotiuoustime diffusio is
More informationStéphane Boucheron 1, Olivier Bousquet 2 and Gábor Lugosi 3
ESAIM: Probability ad Statistics URL: http://wwwemathfr/ps/ Will be set by the publisher THEORY OF CLASSIFICATION: A SURVEY OF SOME RECENT ADVANCES Stéphae Bouchero 1, Olivier Bousquet 2 ad Gábor Lugosi
More informationConsistency of Random Forests and Other Averaging Classifiers
Joural of Machie Learig Research 9 (2008) 20152033 Submitted 1/08; Revised 5/08; Published 9/08 Cosistecy of Radom Forests ad Other Averagig Classifiers Gérard Biau LSTA & LPMA Uiversité Pierre et Marie
More informationEverything You Always Wanted to Know about Copula Modeling but Were Afraid to Ask
Everythig You Always Wated to Kow about Copula Modelig but Were Afraid to Ask Christia Geest ad AeCatherie Favre 2 Abstract: This paper presets a itroductio to iferece for copula models, based o rak methods.
More informationA Kernel TwoSample Test
Joural of Machie Learig Research 3 0) 73773 Subitted 4/08; Revised /; Published 3/ Arthur Gretto MPI for Itelliget Systes Speastrasse 38 7076 Tübige, Geray A Kerel TwoSaple Test Karste M. Borgwardt Machie
More informationHow Has the Literature on Gini s Index Evolved in the Past 80 Years?
How Has the Literature o Gii s Idex Evolved i the Past 80 Years? Kua Xu Departmet of Ecoomics Dalhousie Uiversity Halifax, Nova Scotia Caada B3H 3J5 Jauary 2004 The author started this survey paper whe
More informationCounterfactual Reasoning and Learning Systems: The Example of Computational Advertising
Joural of Machie Learig Research 14 (2013) 32073260 Submitted 9/12; Revised 3/13; Published 11/13 Couterfactual Reasoig ad Learig Systems: The Example of Computatioal Advertisig Léo Bottou Microsoft 1
More informationWhich Extreme Values Are Really Extreme?
Which Extreme Values Are Really Extreme? JESÚS GONZALO Uiversidad Carlos III de Madrid JOSÉ OLMO Uiversidad Carlos III de Madrid abstract We defie the extreme values of ay radom sample of size from a distributio
More informationStatistica Siica 6(1996), 31139 EFFECT OF HIGH DIMENSION: BY AN EXAMPLE OF A TWO SAMPLE PROBLEM Zhidog Bai ad Hewa Saraadasa Natioal Su Yatse Uiversity Abstract: With the rapid developmet of moder computig
More informationSOME GEOMETRY IN HIGHDIMENSIONAL SPACES
SOME GEOMETRY IN HIGHDIMENSIONAL SPACES MATH 57A. Itroductio Our geometric ituitio is derived from threedimesioal space. Three coordiates suffice. May objects of iterest i aalysis, however, require far
More informationTesting for Welfare Comparisons when Populations Differ in Size
Cahier de recherche/workig Paper 039 Testig for Welfare Comparisos whe Populatios Differ i Size JeaYves Duclos Agès Zabsoré Septembre/September 200 Duclos: Départemet d écoomique, PEP ad CIRPÉE, Uiversité
More informationThe Unicorn, The Normal Curve, and Other Improbable Creatures
Psychological Bulleti 1989, Vol. 105. No.1, 156166 The Uicor, The Normal Curve, ad Other Improbable Creatures Theodore Micceri 1 Departmet of Educatioal Leadership Uiversity of South Florida A ivestigatio
More informationJ. J. Kennedy, 1 N. A. Rayner, 1 R. O. Smith, 2 D. E. Parker, 1 and M. Saunby 1. 1. Introduction
Reassessig biases ad other ucertaities i seasurface temperature observatios measured i situ sice 85, part : measuremet ad samplig ucertaities J. J. Keedy, N. A. Rayer, R. O. Smith, D. E. Parker, ad M.
More informationType Less, Find More: Fast Autocompletion Search with a Succinct Index
Type Less, Fid More: Fast Autocompletio Search with a Succict Idex Holger Bast MaxPlackIstitut für Iformatik Saarbrücke, Germay bast@mpiif.mpg.de Igmar Weber MaxPlackIstitut für Iformatik Saarbrücke,
More informationHOW MANY TIMES SHOULD YOU SHUFFLE A DECK OF CARDS? 1
1 HOW MANY TIMES SHOULD YOU SHUFFLE A DECK OF CARDS? 1 Brad Ma Departmet of Mathematics Harvard Uiversity ABSTRACT I this paper a mathematical model of card shufflig is costructed, ad used to determie
More informationSoftware Reliability via RuTime ResultCheckig Hal Wasserma Uiversity of Califoria, Berkeley ad Mauel Blum City Uiversity of Hog Kog ad Uiversity of Califoria, Berkeley We review the eld of resultcheckig,
More informationPresent Values, Investment Returns and Discount Rates
Preset Values, Ivestmet Returs ad Discout Rates Dimitry Midli, ASA, MAAA, PhD Presidet CDI Advisors LLC dmidli@cdiadvisors.com May 2, 203 Copyright 20, CDI Advisors LLC The cocept of preset value lies
More informationSignal Reconstruction from Noisy Random Projections
Sigal Recostructio from Noisy Radom Projectios Jarvis Haut ad Robert Nowak Deartmet of Electrical ad Comuter Egieerig Uiversity of WiscosiMadiso March, 005; Revised February, 006 Abstract Recet results
More informationTeaching Bayesian Reasoning in Less Than Two Hours
Joural of Experimetal Psychology: Geeral 21, Vol., No. 3, 4 Copyright 21 by the America Psychological Associatio, Ic. 963445/1/S5. DOI: 1.7//963445..3. Teachig Bayesia Reasoig i Less Tha Two Hours Peter
More informationSystemic Risk and Stability in Financial Networks
America Ecoomic Review 2015, 105(2): 564 608 http://dx.doi.org/10.1257/aer.20130456 Systemic Risk ad Stability i Fiacial Networks By Daro Acemoglu, Asuma Ozdaglar, ad Alireza TahbazSalehi * This paper
More informationNo Eigenvalues Outside the Support of the Limiting Spectral Distribution of Large Dimensional Sample Covariance Matrices
No igevalues Outside the Support of the Limitig Spectral Distributio of Large Dimesioal Sample Covariace Matrices By Z.D. Bai ad Jack W. Silverstei 2 Natioal Uiversity of Sigapore ad North Carolia State
More informationThe Arithmetic of Investment Expenses
Fiacial Aalysts Joural Volume 69 Number 2 2013 CFA Istitute The Arithmetic of Ivestmet Expeses William F. Sharpe Recet regulatory chages have brought a reewed focus o the impact of ivestmet expeses o ivestors
More informationCrowds: Anonymity for Web Transactions
Crowds: Aoymity for Web Trasactios Michael K. Reiter ad Aviel D. Rubi AT&T Labs Research I this paper we itroduce a system called Crowds for protectig users aoymity o the worldwideweb. Crowds, amed for
More informationGCE Further Mathematics (6360) Further Pure Unit 2 (MFP2) Textbook. Version: 1.4
GCE Further Mathematics (660) Further Pure Uit (MFP) Tetbook Versio: 4 MFP Tetbook Alevel Further Mathematics 660 Further Pure : Cotets Chapter : Comple umbers 4 Itroductio 5 The geeral comple umber 5
More informationTurning Brownfields into Greenspaces: Examining Incentives and Barriers to Revitalization
Turig Browfields ito Greespaces: Examiig Icetives ad Barriers to Revitalizatio Juha Siikamäki Resources for the Future Kris Werstedt Virgiia Tech Uiversity Abstract This study employs iterviews, documet
More informationRamseytype theorems with forbidden subgraphs
Ramseytype theorems with forbidde subgraphs Noga Alo Jáos Pach József Solymosi Abstract A graph is called Hfree if it cotais o iduced copy of H. We discuss the followig questio raised by Erdős ad Hajal.
More informationCahier technique no. 194
Collectio Techique... Cahier techique o. 194 Curret trasformers: how to specify them P. Foti "Cahiers Techiques" is a collectio of documets iteded for egieers ad techicias, people i the idustry who are
More informationWork Placement in ThirdLevel Programmes. Edited by Irene Sheridan and Dr Margaret Linehan
Work Placemet i ThirdLevel Programmes Edited by Iree Sherida ad Dr Margaret Lieha Work Placemet i ThirdLevel Programmes Edited by Iree Sherida ad Dr Margaret Lieha The REAP Project is a Strategic Iovatio
More information