PLUG-IN BANDWIDTH SELECTOR FOR THE KERNEL RELATIVE DENSITY ESTIMATOR

Transcription

1 PLUG-IN BANDWIDTH SELECTOR FOR THE KERNEL RELATIVE DENSITY ESTIMATOR ELISA MARÍA MOLANES-LÓPEZ AND RICARDO CAO Departamento de Matemáticas, Facultade de Informática, Universidade da Coruña, Campus de Elviña s/n, 57 A Coruña, Spain BANDWIDTH SELECTION FOR THE RELATIVE DENSITY Abstract Tis paper is focused on two kernel relative density estimators in a two-sample problem An asymptotic expression for te mean integrated squared error of tese estimators is found and, based on it, two solve-te-equation plug-in bandwidt selectors are proposed In order to examine teir practical performance a simulation study and a practical application to a medical dataset are carried out Key words and prases: kernel-type estimates, smooting parameter, solve-teequation rules, survival analysis, two-sample problem Introduction Te study of differences among groups or canges over time is a goal in fields suc as medical researc and social science researc Te traditional metod for tis purpose is te usual parametric location and scale analysis However, tis is a very restrictive tool, since a lot of te information available in te data is unaccessible In order to make a better use of tis information it is convenient to focus on distributional analysis, ie,

2 on te general two-sample problem of comparing te cumulative distribution functions cdf, F and F, of two random variables, X and X Useful tools for tis purpose are te relative distribution function, R t, and te relative density function, r t, of X wit respect to wrt X : R t = P F X t = F F t, < t < were F t = inf {x F x t} denotes te quantile function of F and r t = R t = f F t f F t, < t < were f and f are te densities pertaining to F and F, respectively Tese two curves, as well as estimators for tem, ave been studied by Gastwirt 968, Ćwik and Mielniczuk 993, Hsie 995, Hsie and Turnbull 996, Cao et al 2, 2 and Handcock and Janssen 22 Tese functions, R and r, are closely related to oter statistical metods Te ROC curve, used in te evaluation of te performance of medical tests for separating two groups, is related to R troug te relationsip ROC t = R t see, for instance, Holmgren 996 and Li et al 996 for details and te density ratio x = fx f x,x R, used by Silverman 978 is linked to r troug x = r F x Trougout te paper, we will focus on two kernel-type estimators of r, similar to te one already proposed by Ćwik and Mielniczuk 993 In te following section we will give some notation and obtain an asymptotic representation for te MISE of te relative density estimators Tis is a difference wit respect to Ćwik and Mielniczuk 993, were an asymptotic expression for te MISE of only te dominant part of te estimator was found Section 3 is concerned wit automatic global bandwidt selection Two solve-te-equation plug-in bandwidt selectors based on te ideas by Seater and Jones 2

3 99 are proposed A simulation study is sown in Section 4 were te performance of te data-driven selectors proposed in tis paper is compared wit te selector proposed in Ćwik and Mielniczuk 993 A medical application is presented in Section 5 Finally, te proofs of te results presented in Sections 2 and 3 are included in Section 6 2 Kernel relative density estimators Consider te two-sample problem wit completely observed data: {X,,X n }, {X,,X m } were te X i s are independent and identically distributed as X ; and te X j s are independent and identically distributed as X Tese two sequences are independent eac oter Trougout tis paper all te asymptotic results are obtained if bot sample sizes m and n tend to infinity in suc a way tat, for some constant < λ <, m lim m n = λ We assume te following conditions on te underlying distributions, te kernels K and M and te bandwidts and to be used in te estimators see, 2 and 3 below: F F and F ave continuous density functions, f and f, respectively F2 f is a tree times differentiable density function wit f 3 bounded R r is a differentiable density wit compact support contained in [, ] wit r bounded 3

4 K K is a symmetric four times differentiable density function wit compact support [, ] and K 4 bounded K2 M is a symmetric density and continuous function except at a finite set of points B and m 3 B2 and n 4 Since K t z dr z is close to r t and for smoot distributions it is satisfied tat: t z K dr z = K t F z df z, a natural way to define a kernel-type estimator of r t is to replace te unknown functions F and F by some appropriate estimators We consider two proposals: ˆr t = K t F n zdf m z = m m K t F n X j j= and 2 ˆr, t = K t F n z df m z = m m j= K t F n X j, were K t = K t, K is a kernel function, is te bandwidt used to estimate r, F n and F m are te empirical distribution functions based on X i s and X j s, respectively, and F n is a kernel-type estimate of F given by: n x 3 Fn = n Xi M i= were M denotes te cdf of te kernel M and is te bandwidt used to estimate F 4

5 Using a Taylor expansion, ˆr t can be written as follows: ˆr t = K t F z df m z + K t F z F z F n zdf m z + F z F n z 2 sk 2 t F z s F n z F zdsdf m z Let us define Ũn = F n F and R m = F m F Ten, ˆr t can be rewritten in a useful way for te study of its mean integrated squared error MISE: 4 ˆr t = r t + A + A 2 + B, were B = r t = A 2 = n A = K t F zdf m z = m n i= F z F n z 2 v Ũnv m K t F X j j= K t vd Rm R v F w {X i w}k t F wdfw sk 2 t F z s F n z F z dsdf m z Proceeding in a similar way, we can rewrite ˆr, as follows: 5 ˆr, t = r t + A + A 2 + Â + ˆB, were ˆB = Â = F n w F n w K t F wdf m w F z F n z 2 sk 2 t F z s Fn z F z dsdf m z Our main result is an asymptotic representation for te MISE of ˆr t and ˆr, t Wit te purpose of simplifying te exposition of te results obtained, from ere on we will denote C g = g2 xdx, for any square integrable function g 5

6 Teorem AMISE Assume conditions F, R, K and B Ten were d K = x2 K xdx MISEˆr = AMISE + o m o n AMISE = m CK d 2 KCr 2 + n CrCK If te conditions F2, K2, and B2 are assumed as well, ten te same result is satisfied for te MISEˆr, Remark From Teorem it follows tat te optimal bandwidt, minimizing te asymptotic mean integrated squared error of any of te estimators considered for r, is given by CKλCr + 6 AMISE = d 2 K Cr2 m 5 Remark 2 Note tat AMISE derived from Teorem does not depend on te bandwidt A more iger-order analysis sould be considered to address simultaneously te bandwidt selection problem of and 3 Bandwidt selectors 3 Estimation of density functionals It is very simple to sow tat, under sufficiently smoot conditions on r r C 2l R, te functionals 7 C r l = r l x 2 dx appearing in 6, are related to oter general functionals of r, denoted by Ψ 2l : 8 C r l = l r 2l xr xdx = l Ψ 2l 6

7 were Ψ l = r l xr xdx = E [ r l F X ] Te equation above suggests a natural kernel-type estimator for Ψ l as follows 9 ˆΨl g = m [ m m ] m Ll g F n X j F n X k j= k= were L is a kernel function and g is a smooting parameter called pilot bandwidt Likewise in te previous section, tis is not te unique possibility and we could consider anoter estimator of Ψ l, Ψl g = m [ m m m Ll g Fn X j F ] n X k, j= k= were F n in 9 is replaced by F n Since te difference between bot estimators decreases as tends to zero, it is expected to obtain te same teoretical results for bot estimators Terefore, we will only sow teoretical results for ˆΨ l g We will obtain te asymptotic mean squared error of ˆΨ l g under te following assumptions R3 Te relative density r C l+6 R K3 Te kernel L is a symmetric kernel of order 2, L C l+7 R and satisfies tat l 2 +2 L l d L >, L l = L l+ =, wit d L = x2 Lxdx B3 g = g m is a positive-valued sequence of bandwidts satisfying lim g = and m lim m mgmax{α,β} = were α = 2 l + 7,β = l

8 Condition R3 implies a smoot beaviour of r in te boundary of its support, contained in [, ] If tis smootness fails, te quantity C r l could be still estimated troug its definition, using a kernel estimation for r l see Hall and Marron 987 for te one-sample problem setting Condition K3 can only old for even l Observe tat in condition B3 for even l, max {α,β} = α for l =, 2 and max {α,β} = β for l = 4, 6, Teorem 2 Assume conditions F, R3, K3 and B3 Ten it follows tat [ MSE ˆΨl g = + λψ mg l+ll + 2 d LΨ l+2 g 2 + O g 4 ng l+ +o ] m 2 g 2l+Ψ C L l + o m 2 g 2l+ + O n Remark 3 Te first term in te rigt-and side of corresponds to te squared bias term of MSE Note tat, using K3 and 8, te main bias term can be made to vanis by coosing g as g l 2L l λψ + g l = d L Ψ l+2 m l+3 = 2L l d 2 K Ψ l l+3 AMISE d L Ψ l+2 CK 32 STE rules based on Seater & Jones ideas As in te context of ordinary density estimation, te practical implementation of te kernel-type estimators proposed ere see and 2, requires te coice of te smooting parameter Our two proposals, SJ and SJ2, as well as te selector b 3c recommended by Ćwik and Mielniczuk 993, are modifications of Seater & Jones 99 Since te Seater & Jones selector is te solution of an equation in te bandwidt, it is also known as a solve-te-equation STE rule Motivated by te formula 6 8

9 for te AMISE-optimal bandwidt and te relation 8, solve-te-equation rules require tat is cosen to satisfy te relationsip = C K λ Ψ γ + d 2 K Ψ 4 γ 2 m were te pilot bandwidts for te estimation of Ψ and Ψ 4 are functions of γ and γ 2, respectively and Motivated by Remark 3, we suggest taking 2 L d 2 K γ = Ψ 4 g 4 d L Ψ2 g 2 CK 5, L 4 d 2 K γ 2 = Ψ 7 4 g 4 5 7, d L Ψ6 g 6 CK were Ψ j, j =, 2, 4, 6 are kernel estimates Note tat tis way of proceeding leads us to a never ending process in wic a bandwidt selection problem must be solved at every stage To make tis iterative process feasible in practice one possibility is to propose a stopping stage in wic te unknown quantities are estimated using a parametric scale for r Tis strategy is known in te literature as te stage selection problem see Wand and Jones 995 Wile te selector b 3c in Ćwik and Mielniczuk 993 used a Gaussian scale, now for te implementation of SJ2, we will use a mixture of betas based on te Weierstrass approximation teorem and Bernstein polynomials associated to any continuous function on [, ] see Kakizawa 24 and references terein for te motivation of tis metod Later on we will sow te formula for computing te reference scale above-mentioned, togeter wit te selector b 3c we used in Section 4 In te following we denote te Epanecnikov kernel by K, te uniform density in [, ] by M and we define L as follows Lx = Γ8 8 x + x + 8 { x } 2Γ9Γ

10 Next, we detail te steps required in te implementation of SJ2 Step Obtain PR ˆΨ j j =, 4, 6, 8, parametric estimates for Ψ j j =, 4, 6, 8, wit te replacement of r in C r j/2 see 7, by a mixture of betas, bx, as it will be explained later on see 2 Step 2 Compute kernel estimates for Ψ j j = 2, 4, 6, by using Ψ j gj PR j = 2, 4, 6, wit g PR j = 2 Lj λˆψ PR + d L ˆΨ PR j+2 m j+3, j = 2, 4, 6 Step 3 Estimate Ψ and Ψ 4, using, by means of Ψ ˆγ and Ψ 4 ˆγ 2, were ˆγ = 2 L d 2 K Ψ 4 g PR 4 d L Ψ2 g2 PR C K and ˆγ 2 = 2 L 4 d 2 K Ψ 4 g PR 4 d L Ψ6 g6 PR C K Step 4 Select te bandwidt SJ2 as te one tat solves te following equation in : = C K λ Ψ ˆγ + d 2 K Ψ 4 ˆγ 2 m In order to solve te equation above, it will be necessary to use a numerical algoritm In te simulation study we will use te false-position metod Te main reason is tat te false-position algoritm does not require te computation of te derivatives, wat simplifies considerably te implementation of te proposed bandwidt selectors At te same time, tis algoritm presents some advantages over oters because it tries to combine te speed of metods suc as te secant metod wit te security afforded by te bisection metod 5

11 Unlike te Gaussian parametric reference, used to obtain b 3c, te selector SJ2 uses in Step a mixture of betas as follows: 2 bx = N j R n,m N j= R j n,m βx,j,n j + N were 3 4 m R n,m x = m x M F n X j, g j= 2 xm x M xdx g = nd 2 M C r 3, βx,a,b stands for te beta density βx,a,b = and N is te number of betas in te mixture Γa + b ΓaΓb xa x b,x [, ], Since we are trying to estimate a density wit support in [, ] it seems more suitable to consider a parametric reference wit tis support A mixture of betas is an appropriate option because it is flexible enoug to model a large variety of relative densities, wen derivatives of order, 3 and 4 are also required Note tat, for te sake of simplicity, we are using above te AMISE-optimal bandwidt g for estimating a distribution function in te setting of a one-sample problem see Polansky and Baker 2 for more details in te kernel-type estimate of a distribution function Te use of tis bandwidt requires te previous estimation of te unknown functional, C r We will consider a quick and dirty metod, te rule of tumb, tat uses a parametric reference for r to estimate te above-mentioned unknown quantity More specifically, our reference scale will be a beta wit parameters p, q estimated from te smooted relative sample { Fn X j } m j=, using te metod of moments Following te same ideas as for 3 and 4, te bandwidt selector used for te

12 kernel-type estimator F n introduced in 3 is based on te AMISE-optimal bandwidt in te one-sample problem: = 2 xm x M xdx nd 2 M C f As it was already mentioned above, in most of te cases tis metodology will be applied to survival analysis, so it is natural to assume tat our samples come from distributions wit support on te positive real line Terefore, a gamma reference distribution, Gammaα,β, as been considered, were te parameters α,β are estimated from te smooted relative sample { Fn X j } m j=, using te metod of moments For te implementation of SJ, we proceed analogously to tat of SJ2 above Te only difference now is tat trougout te previous discussion, Ψj and F n are replaced by, respectively, ˆΨ j and F n As a variant of te selector tat Ćwik and Mielniczuk 993 proposed, b 3c is obtained as te soluction to te following equation: b 3c = CK + λˆψ a d 2 ˆΨ K 4 α 2 b 3 c m { } were a = 78ˆσm 3, ˆσ = min s m,îqr/349, s m and ÎQR denote, respectively, te empirical standard deviation and te sample interquartile range of te relative data, {F n X j } m j=, α 2 b 3c = 7694 were GS stands for standard Gaussian scale, 5 3 ˆΨ4 g GS 7 4 b 5 7 ˆΨ 6 g6 GS 3c,, g 4 GS =247ˆσm 7,g6 GS =234ˆσm 9 and te estimates ˆΨ j wit j =, 4, 6 were obtained using 9, wit L replaced by te standard Gaussian kernel and wit data driven bandwidt selectors derived from 2

13 reducing te two-sample problem to a one-sample problem It is interesting to note tat all te kernel-type estimators presented previously ˆr t, ˆr, t, Rn,m x and F n x were not corrected to take into account, respectively, te fact tat r and R ave support on [, ] instead of on te wole real line, and te fact tat f is supported only on te positive real line Terefore, in order to correct te boundary effect in practical applications we will use te well known reflecting metod to modify ˆr t, ˆr, t, R n,m x and F n x, were needed 4 Simulations We compare, troug a simulation study, te performance of te bandwidt selectors SJ and SJ2, proposed in Section 3, wit te standard competitor b 3c recommended by Ćwik and Mielniczuk 993 Altoug we are aware tat te smooting parameter N introduced in 2 sould be selected by some optimal way based on te data, tis issue goes beyond te scope of tis article Consequently, from ere on, we will consider N = 4 components in te beta mixture reference scale model given by 2 We will consider te first sample coming from te random variate X = W U and te second sample coming from te random variate X = W S, were U denotes a uniform distribution in te compact interval [, ], W is te distribution function of a Weibull distribution wit parameters 2, 3 and S is a random variate from one of te following five different populations see Figure : a A beta distribution wit parameters 4 and 7 β 4, 7 b A mixture consisting of V wit probability 4 5 and V 2 wit probability 5, were V = β 4, 37 and V 2 = β 4, 2 3

14 c A mixture consisting of V wit probability 3 and V 2 wit probability 2 3, were V = β 34, 5 and V 2 = β 5, 3 Put Figure about ere Coosing different values for te pair of sample sizes m and n and under eac of te models presented above, we start drawing 5 pair of random samples and, according to every metod, we select te bandwidts ĥ Ten, in order to ceck teir performance we approximate by Monte Carlo te mean integrated squared error, EM, between te true relative density and te kernel-type estimate for r, given by wen ĥ = b 3c, SJ or by 2 wen ĥ = SJ 2 Te computation of te kernel-type estimations can be very time consuming by using a direct algoritm Terefore, we will use linear binned approximations tat, tanks to teir discrete convolution structures, can be fast computed by using te fast Fourier transform FFT see Wand and Jones 995 for more details For all te models, te values of tis criterion for te tree bandwidt selectors, SJ, SJ2 and b 3c, can be found in Table Put Table about ere A careful look at te table points out tat te new selector SJ2 presents a muc better beaviour tan te selector b 3c, especially wen te sample sizes are equal or wen m is larger tan n Te improvement is even larger for unimodal relative densities model a and b On te oter and, it is observed tat te oter proposal, SJ, presents only a moderate improvement over b 3c for unimodal relative densities model a and b and performs only sligtly better or even worse tan b 3c for bimodal relative densities model c Te ratio m n produces an important effect on te beaviour of any of te 4

15 tree selectors considered For instance, it is clearly seen an asymmetric beaviour of te selectors in terms of te sample sizes Oter proposals for selecting ave been investigated For instance, versions of SJ2 were considered in wic eiter te unknown functionals Ψ are estimated from te viewpoint of a one-sample problem or te STE rule is modified in suc a way tat only te function ˆγ 2 is considered in te equation to be solved see Step 4 in Section 3 After a simulation study similar to te one detailed ere, but now carried out for tese versions of SJ2, it was observed a similar practical performance to tat observed for SJ2 However, a worse beaviour was observed wen, in te implementation of tese versions of SJ2, te smoot estimate of F is replaced by te empirical distribution function F n Terefore, altoug SJ2 requires te selection of two bandwidt parameters, a clear better practical beaviour is observed wen considering te smooted relative data instead of te non-smooted ones 5 A medical application In tis section we apply te plug-in STE selector SJ2 detailed above, to estimate te relative density for a real data set concerned wit prostate cancer PC Te data consist of 599 patients suffering from PC + and 835 patients PC-free - For eac patient te illness status as been determined troug a prostate biopsy carried out for first time in Hospital Juan Canalejo Galicia, Spain between January of 22 and September of 25 In te literature, tere exists an increasingly interest in finding a good diagnostic test tat elps in te early detection of PC and avoids te need of undergoing a prostate biopsy Tere are several studies in wic, troug ROC curves, it was investigated te 5

16 performance of different diagnostic tests based on some analytic measurements suc as te total prostate specific antigen tpsa, te free PSA fpsa or te complexed PSA cpsa As it was mentioned in Section, tere exists a close relation between te concepts of ROC curve and relative density Relative density estimates can provide more detailed information about te performance of a diagnostic test wic can be useful not only in comparing different tests but also in designing an improved one Tis issue goes beyond te scope of tis article and terefore it will not be investigated ere In tis section we compare from a distributional point of view te above mentioned measurements tpsa, fpsa and cpsa among te two groups in te data set PC+ and PC- To tis end we start computing te appropriate bandwidts using te datadriven bandwidt selector SJ2 and ten te corresponding relative density estimates are computed using 2 Tese estimates are sown in Figure 2 Put Figure 2 about ere It is clear from Figure 2 tat te relative density estimate is above one in a central interval accounting for a probability of about 4% of te PC- distribution for te variables tpsa and cpsa Tis is also te case in te % rigt tail of te PC- group for tpsa and cpsa, but not for fpsa In te case of fpsa te central interval wit relative density above one is sligtly sifted to te left of te PC- distribution: between percentiles 2% and 6% 6 Proofs Te proof of Teorem will be a direct consequence of some previous lemmas were eac one of te terms tat result from expanding te expression for te MISE are studied 6

17 Some of tem will produce dominant parts in te final expression for te MISE wile oters will yield negligible terms Lemma Assume te ypotesis above Ten i E[ r t rt 2 ]dt = CK r tdt + m 4 4 d 2 K Cr2 + o + 4 m ii E[A2 2]dt = n CrCK + o n iii E[A2 ]dt = o n iv E[B2 ]dt = o n v E[2A r rt]dt = vi E[2A 2 r rt]dt = vii E[2B r rt]dt = o m + 4 viii E[2A A 2 ]dt = o m + 4 ix E[2A B]dt = o m + 4 x E[2A 2B]dt = o n + 4 = O n Lemma 2 Assume te ypotesis above Ten i E[Â2 ]dt = o n ii E[ ˆB 2 ]dt = o n Proof of Lemma Te proof of i is not included ere because it is a classical result in te setting of ordinary density estimation in a one-sample problem see Wand and Jones 995 for details We next prove ii Standard algebra gives E[A 2 2] = n 2 4 n i= n j= K t F w E [ F w {Xi w }F w 2 {Xj w 2 } ] K t F w 2 dfw dfw 2 7

18 Due to te independence between X i and X j for i j, and using te fact tat Cov {F X i u }, {F X i u 2 } = u u 2 g u u 2, were g t = t, te previous expression can be rewritten as follows t E[A 2 2] = 2 n 4 = n 4 t u u 2 g u 2 K u u 2 [ t g u 2 d u K u u 2 Now, using integration by parts, it follows tat t K u2 ru du ] 2 ru ru 2 du du 2 5 E[A 2 2] = n 4 + n 4 lim g u 2 Gu u 2 n 4 Gu 2 2 g u 2 du 2 lim g u 2 Gu 2 2 u 2 + were Gu 2 = t u K u ru du u 2 Since G is a bounded function and g =, te second term in te rigt and side of 5 vanises to zero On te oter and, due to te boundedness of K and r, it follows tat Gu 2 K r u is zero as well Terefore, 2, wic let us conclude tat te first term in 6 E[A 2 2] = n 4 Gu 2 2 g u 2 du 2 Now, using integration by parts, it follows tat t u Gu 2 = u ru K + u 2 t u K [ ru + u r u ]du u 2 t u2 = u 2 ru 2 K + t u K [ u r u ru ]du, u 2 8

19 and plugging tis last expression in 6, it is concluded tat E[A 2 2] = n 2I 2 + 2I 22 + I 23, were I 22 = u 2 K I 2 = t r 2 u 2 K 2 u2 du 2 t u 2 ru u2 2K t u [ u r u ru ]du du 2 I 23 = K u 2 2 t u u 2 K t u [ u r u ru ] u 2 [ u r u ru ]du du du 2 Terefore, 7 E[A 2 2]dt = n 2I 2dt + 2 n 2I 22dt + n 2I 23dt Next, we will study eac summand in 7 separately Te first term can be andled by using canges of variable and a Taylor expansion: n 2I 2 dt = n r 2 u 2 u 2 u 2 K 2 sds du 2 Let us define K 2 x = x K2 sds and rewrite te previous term as follows n 2I 2 dt = n r 2 u2 u 2 K 2 K 2 u 2 du 2 Now, by splitting te integration interval into tree subintervals: [,], [, ] and [, ], using canges of variable and te fact tat CK x K 2 x = x, 9

20 it is easy to sow tat n 2I 2 dt = CK Cr + O n n Below, we will study te second term in te rigt and side of 7 By using canges of variable, Caucy-Scwarz inequality and conditions r <, r < and CK <, straigtforward calculations lead to n 2I 22 dt = O n + = O n 2 n 2 Similar arguments give n 2I 23 dt = O n 2 Terefore, it as been sown tat E[A 2 2]dt = n CrCK + O Finally te proof of ii concludes using condition B We now prove iii Direct calculations lead to + O n n 2 8 E[A 2 ] = 4E [I ] were 9 [ I =E t v Ũnv v 2 Ũnv 2 K v t K v2 d R m Rv d R m Rv 2 /X,,X n ] To tackle wit 8 we first study te conditional expectation 9 It is easy to see tat I = V ar[v/x,,x n ] were V = m m j= t X j ŨnX j K Xj 2

21 Tus { I = [ ] 2 t v v m ŨnvK drv [ and E[A 2 ] = m 4 E Taking into account tat { [ ] }[ 2 t v E v Ũnv K [ v Ũnv v 2 Ũnv 2 [ ] E sup Ũnv v 2 = v ] K t v t v v ŨnvK ] 2 drv m 4 t K v2 P sup Ũnv v 2 > c dc, v we can use te Dvoretzky-Kiefer-Wolfowitz inequality, to conclude tat [ ] 2 E sup Ũnv v 2 v 2e 2nc dc = 2 n ye y2 dy = O ] 2 } drv drv drv 2 n Consequently, using 2 and te conditions r < and K < we obtain tat E[A 2 ] = O mn 4 Te proof of iii is concluded using condition B Te results appearing in items iv-x can be proved by first conditioning to some appropriate random variables and ten andling te conditional moments using standard arguments For tis reason teir proofs are not included ere ten Proof of Lemma 2 We start proving i Let us define D n w = F n w F n w, E[Â2 ] = E[E[Â2 /X,,X m ]] [ ] = E E[D n w D n w 2 ]K t F w K t F w 2 df m v df m v 2 Based on te results set for D n w in Hjort and Walker 2, te conditions F2 and K2 and since E[D n w D n w 2 ] = CovD n w,d n w 2 + E[D n w ]E[D n w 2 ], 4 it follows tat E[D n w D n w 2 ] = O + O 4 n 2

22 Terefore, for any t [, ], we can bound E[Â2 ], using suitable constants C 2 and C 3 as follows 4 2 E[Â2 t ] = C 2 K F z fzdz 4 m 4 m t +C 3 K F z t K F z 2 fz 4 fz 2 dz dz 2 m Besides, te condition R allows us to conclude tat 2 K t F z fzdz = O and K t F z K t F z 2 fz fz 2 dz dz 2 = O 2 for all t [, ] Terefore, E [Â2 and B2, implies i ] dt = O 4 n 3 +O 4 2, wic, taking into account conditions B We next prove ii Te proof is parallel to tat of item iv in Lemma Te only difference now is tat instead of requiring E[sup F n x F x p ] = On p 2, were p is an integer larger tan, it is required tat [ 2 E sup F n x F x p ] = On p 2 To conclude te proof, below we sow tat 2 is satisfied Define H n =sup F n x F x, ten, as it is stated in Amad 22, it follows tat H n E n +W n were E n = sup F n x F x and W n = sup E F n x F x = O 2 Using te binomial formula it is easy to obtain tat, for any integer p, H p n p j= Cp j W n p j E j n, were te constants C p j s wit j {,,,p,p} are te binomial coefficients Terefore, since E[E j n] = On j 2 and W p j n = O 2p j, condition B2 leads to W p j n E [E j n] = On p 2 As a straigtforward consequence, 2 olds and te proof of ii is concluded Proof of teorem 2 Below, we will briefly detail te steps followed to study te asymptotic beaviour of te mean squared error of ˆΨ l g defined in 9 First of all, let 22

23 us observe tat wic implies: ˆΨ l g = m Ll g + m 2 m m j= k=,j k L l g F n X j F n X k, ] E [ˆΨl g = + E [ L mg l+ll g l F n X F n X 2 ] m Starting from te equation E [ L l g F n X F n X 2 ] = E [ E [ ]] L g l F n X F n X 2 /X,,X n [ ] = E L g l F n x F n x 2 f x f x 2 dx dx 2 = and using a Taylor expansion, we ave E [ L l g F n x F n x 2 ] f x f x 2 dx dx 2 22 E [ L l g F n X F n X 2 ] = 7 i= I i were I i = I = E g l+ll i!g l+i+ll+i F x F x 2 g F x F x 2 f x f x 2 dx dx 2 g [ F n x F x F n x 2 + F x 2 i] f x f x 2 dx dx 2 i =,,6 I 7 = 7!g l+7+e [ L l+7 ξ n F n x F x F n x 2 + F x 2 7] f x f x 2 dx dx 2 and ξ n is a value between F x F x 2 g and F nx F n x 2 g 23

24 Now, consider te first term, I, in 22 It is easy to see tat I = = = z2 /g L l g z z 2 rz rz 2 dz dz 2 L g xrz z 2 r l z 2 dz dz 2 Lxrz 2 + gxr l z 2 dxdz 2, ence using a Taylor expansion, we ave I = Ψ l + /2d L Ψ l+2 g 2 + O g 4 Assume x > x 2 and define Z = n i= {x 2 <X i x } Ten, te random variable Z as a Bin,p distribution wit p = F x F x 2 and mean µ = np It is easy to sow tat, for i =,,6, I i = 2 x 2 i!g l+i+ll+i F x F x 2 g f x f x 2 n iµ i Z dx dx 2 were µ r Z = E [Z E [Z] r ] = m k = E [ Z k] = S m,n = n j= k j= r r j m r j µ j, j j= S k,j n!p j, n j! n j j n j m n! Noting µ Z = and µ 2 Z = nf x F x 2 F x + F x 2, we ave I = and 23 I 2 = L l+2 F x F x 2 fx ng l++2 fx 2 g F x F x 2 F x + F x 2 dx dx 2 = u v L l+2 u v u vrurvdudv ng l++2 g = ng l+ v v/g L l+2 xx gxrv + gxrvdxdv 24

25 Using a Taylor expansion of gxrv + gx and noting xll+2 xdx = L l see condition K3, we ave from 23 I 2 = ng l+ψ L l + O ng l Similar arguments can be used to andle I i = On 2 g l+2 for i = 3, 4 and I i = O n 3 g l+3 for i = 5, 6 Coming back to te last term in 22 and using Dvoretzky- Kiefer-Wolfowitz inequality and condition K3, it is easy to sow tat I 7 =O n2g 7 l+8 Terefore, ] E[ˆΨl g =Ψ l + 2 d LΨ l+2 g Ψ mg l+ll ng l+ll +O g 4 +o ng l+ In order to study te variance of ˆΨ l g, note tat ] 24 V ar [ˆΨl g = were 3 c n,i V l,i i= c n, = c n,2 = c n,3 = 2 m m 3 4 m m 2 m 3 m m 2 m 3 m V l, =V ar [ L l g F n X F n X 2 ] V l,2 =Cov [ L g l F n X F n X 2,L l g F n X 2 F n X 3 ] V l,3 =Cov [ L g l F n X F n X 2,L l g F n X 3 F n X 4 ] Terefore, in order to get an asymptotic expression for te variance of ˆΨ l g, we will start getting asymptotic expressions for te terms 25, 26 and 27 in 24 To deal wit te term 25, we will use 28 V l, = E [ L l g F n X F n X 2 ] 2 E [ 2 L g l F n X F n X 2 ] 25

26 and study separately eac term in te rigt and side of 28 Note tat te expectation of L g l F n X F n X 2 as been already studied wen dealing wit te expectation of ˆΨ l g Next we study te first term in te rigt and side of 28 Using a Taylor expansion, te term: [ L l E g F n X F n X 2 ] 2 [ [ L l = E E g F n X F n X 2 ]] 2 /X,,X n [ ] = E F n x F n yf x f ydxdy = E L l2 g [ L l2 g F n x F n y ] f xf ydxdy can be decomposed in a sum of six terms tat can be bounded easily Te first term in tat decomposition can be rewritten as Ψ g 2l+ C L l + o after applying some g 2l+ canges of variable and a Taylor expansion Te oter terms can be easily bounded using Dvoretzky-Kiefer-Wolfowitz inequality and standard canges of variable Tese bounds and condition B3 prove tat te order of tese terms is o Consequently, g 2l+ V l, = g 2l+Ψ C L l + o Te term 26 can be andled using g 2l+ = g 2l+Ψ C L l Ψ 2 l + o Ψ l + o 2 + o g 2l+ 29 V l,2 =E [ L l g F n X F n X 2 L g l F n X 2 F n X 3 ] E 2 [ L l g F n X F n X 2 ] As for 28, it is only needed to study te first term in te rigt and side of 29 26

27 Note tat E [ L g l F n X F n X 2 L g l F n X 2 F n X 3 ] = E [ E [ L l g = F n X F n X 2 L l g F n X 2 F n X 3 /X,, X n ]] f yf zf tdydzdt E [ L g l F n y F n zl g l F n z F n t ] Taylor expansions, canges of variable, Caucy-Scwarz inequality and Dvoretzky- Kiefer-Wolfowitz inequality, give: = E [ L l g F n X F n X 2 L g l F n X 2 F n X 3 ] r l2 z r z dz + O + O + O + O n n 2 n 3 Consequently, using B3 and 29, V l,2 = O To study te term V l,3 in 27, let us define A l = [ L l g It is easy to sow tat: Now a Taylor expansion gives 3 V ar A l = were n 4 g 2l++4 F n y F n z L l g F y F z ] f yf zdydz V l,3 = V ar A l N V ar T k + k= A l = N k= N T k, k= N Cov T k,t l, F y F z T k = f yf z k!g l+ll+k g k Fn y F n z F y F z dydz, for k =,,N, g l= k l 27

28 T N = ξ N!g l+ll+n n f yf z Fn y F n z F y F z g N dydz, for some positive integer N We will only study eac one of te first N summands in 3 Te rest of tem will be easily bounded using Caucy-Scwarz inequality and te bounds obtained for te first N terms Now te variance of T k is studied First of all, note tat V art k E [ T k 2] = L l+k F y F z g k y,z,y 2,z 2 dy dz dy 2 dz 2 k!g l+k+ L l+k F y 2 F z 2 g 2 fy fz fy 2 fz 2 were k y,z,y 2,z 2 = E { [F n y F n z F y F z ] k [F n y 2 F n z 2 F y 2 F z 2 ] k} Using canges of variable we can rewrite E[Tk 2 ] as follows: E [ T k 2] = = 2 rs rt rs 2 rt 2 L l+k g s t L g l+k s 2 t 2 k! k F s,f t,f s 2,F t 2 ds dt ds 2 dt 2 s2 s s 2 s k! 2 rs rs u rs 2 rs 2 u 2 L g l+k u L l+k g u 2 k F s,f s u,f s 2,F s 2 u 2 du ds du 2 ds 2 Note tat closed expressions for k can be obtained using te expressions for te moments of order r = r,r 2,r 3,r 4,r 5 of Z, a random variable wit multinomial distribution wit parameters n;p,p 2,p 3,p 4,p 5 Based on tese expressions, te condition 28

29 R3 and te use of integration by parts we can rewrite E [T k 2 ] as follows: were E [ T k 2] = s2 s s 2 2l+k u l+k u2 l+k s 2 L g u L g u 2 k! k u,s,u 2,s 2 du ds du 2 ds 2, k u,s,u 2,s 2 = rs rs u rs 2 rs 2 u 2 k F s,f s u,f s 2,F s 2 u 2 Besides, based on te multinomial moments we can sow tat sup z R 4 k z = O n k Tis result and condition R3 allow us to conclude tat V ar Tk E T 2 k = O n, for k < N, wic implies tat V ar k Tk = o n, for 2 k < N A Taylor expansion of order N = 6, gives V ar T 6 = O condition B3, proves V ar T 6 = o n Consequently, ] V ar [ˆΨl g =, wic using n N g 2N+l+ 2 m 2 g 2l+Ψ C L l m + o 2 g 2l+ + O n were Remark 4 If equation 22 is replaced by a tree-term Taylor expansion 2 i= I i+i 3 I3 = E [ L l+3 ζ 3!g l+4 n F n x F x F n x 2 + F x 2 3] f x f x 2 dx dx 2, and ζ n is a value between F x F x 2 and F nx F n x 2, ten I g g 3 = O and we n 3 2 g l+4 would ave to ask for te condition ng 6 to conclude tat I3 = o However, ng l+ tis condition is very restrictive because it is not satisfied by te optimal bandwidt g l wit l =, 2, wic is g l n l+3 We could consider 3 i= I i+i 4 and ten we would need to ask for te condition ng 4 However, tis condition is not satisfied by g l wit 29

30 l = In fact, it follows tat ngl 4 if l = and ng4 l if l = 2, 4, Someting similar appens wen we consider 4 i= I i+i 5 or 5 i= I i+i 6, ie, te condition required in g, it is not satisfied by te optimal bandwidt wen l = Only wen we stop in I 7, te required condition, ng 4 5, it is satisfied for all even l If equation 3 is reconsidered by te mean-value teorem, and ten we consider tat A l = T wit T = ζ g l+2ll+ n [F n y F n z F y F z]f x f x 2 dydz, it follows tat V ar A l = O However, assuming tat g, it is impossible ng 2l+2 to conclude from ere tat V ar A l = o n Acknowledgements Researc supported by Grants BES-23-7 EU ESF support included for te first autor, XUGA PGIDIT3PXIC55-PN for te second autor and BFM and MTM EU ERDF support included for bot autors Te autors would like to tank two anonymous referees and an associate editor wose comments ave elped to improve te paper substantially Tanks are also due to Sonia Pértega Díaz and Francisco Gómez Veiga from te Hospital Juan Canalejo in A Coruña for providing te prostate cancer data set References Amad, IA 22 On moment inequalities of te supremum of empirical processes wit applications to kernel estimation, Statistics & Probability Letters, 57, Cao, R, Janssen, P and Veraverbeke, N 2 Relative density estimation wit censored data, Te Canadian Journal of Statistics, 28, 97 3

31 Cao, R, Janssen, P and Veraverbeke, N 2 Relative density estimation and local bandwidt selection for censored data, Computational Statistics & Data Analysis, 36, Ćwik, J & Mielniczuk, J 993 Data-dependent bandwidt coice for a grade density kernel estimate, Statistics & Probability Letters, 6, Gastwirt, JL 968 Te first-median test a two-sided version of te control median test, Journal of te American Statistical Association, 63, Hall, P and Marron, JS 987 Estimation of integrated squared density derivatives, Statistics & Probability Letters, 6, 9 5 Handcock, M and Janssen, P 22 Statistical inference for te relative density, Sociological Metods & Researc, 3, Hjort, NL and Walker, SG 2 A note on kernel density estimators wit optimal bandwidts, Statistics & Probability Letters, 54, Holmgren, EB 996 Te P-P plot as a metod for comparing treatment effects, Journal of te American Statistical Association, 9, Hsie, F 995 Te empirical process approac for semiparametric two-sample models wit eterogenous treatment effect, Journal of te Royal Statistical Society Series B, 57, Hsie, F and Turnbull, BW 996 Nonparametric and semiparametric estimation of te receiver operating caracteristic curve, Te Annals of Statistics, 24, 25 4 Kakizawa, Y 24 Bernstein polynomial probability density estimation, Journal of Nonparametric Statistics, 6, Li, G, Tiwari, RC and Wells, MT 996 Quantile comparison functions in twosample problems wit applications to comparisons of diagnostic markers, Journal of 3

32 te American Statistical Association, 9, Polansky, AM and Baker, ER 2 Multistage plug-in bandwidt selection for kernel distribution function estimates, Journal of Statistical Computation & Simulation, 65, 63 8 Seater, S J and Jones, M C 99 A reliable data-based bandwidt selection metod for kernel density estimation, Journal of te Royal Statistical Society Series B, 53, Silverman, B W 978 Density ratios, empirical likeliood and cot deat, Applied Statistics, 27, Wand, MP and Jones, MC 995 Kernel Smooting, Capman and Hall, London 32

33 Table Values of EM for SJ, SJ2 and b 3c for models a-c EM Model a Model b Model c n,m SJ SJ2 b 3c SJ SJ2 b 3c SJ SJ2 b 3c 5, , , , , , , , ,

34 6 b a c Fig Plots of te relative densities a-c Fig 2 Relative density estimate of te PC+ group wrt te PC- group for te variables tpsa solid line, SJ2 = 84, cpsa dotted line, SJ2 = 8 and fpsa dased line, SJ2 = 74 34