Lecture Notes on Nonparametrics

Transcription

1 Lecture Notes o Noparametrics Bruce E. Hase Uiversity of Wiscosi Sprig 2009

2 Itroductio Parametric meas ite-dimesioal. No-parametric meas i ite-dimesioal. Te di ereces are profoud. Typically, parametric estimates coverge at a 2 rate. No-parametric estimates typically coverge at a rate slower ta 2 : Typically, i parametric models tere is o distictio betwee te true model ad te tted model. I cotrast, o-parametric metods typically distiguis betwee te true ad tted models. No-parametric metods make te compleity of te tted model deped upo te sample. Te more iformatio is i te sample (i.e., te larger te sample size), te greater te degree of compleity of te tted model. Takig tis seriously requires a distict distributio teory. No-parametric teory ackowledges tat tted models are approimatios, ad terefore are ieretly misspeci ed. Misspeci catio implies estimatio bias. Typically, icreasig te compleitiy of a tted model decreases tis bias but icreases te estimatio variace. Noparametric metods ackowledge tis trade-o ad attempt to set model compleity to miimize a overall measure of t, typically mea-squared error (MSE). Tere are may oparametric statistical objects of potetial iterest, icludig desity fuctios (uivariate ad multivariate), desity derivatives, coditioal desity fuctios, coditioal distributio fuctios, regressio fuctios, media fuctios, quatile fuctios, ad variace fuctios. Sometimes tese oparametric objects are of direct iterest. Sometimes tey are of iterest oly as a iput to a secod-stage estimatio problem. If tis secod-stage problem is described by a ite dimesioal parameter we call te estimatio problem semiparametric. Noparametric metods typically ivolve some sort of approimatio or smootig metod. Some of te mai metods are called kerels, series, ad splies. Noparametric metods are typically ideed by a badwidt or tuig parameter wic cotrols te degree of compleity. Te coice of badwidt is ofte critical to implemetatio. Data-depedet rules for determiatio of te badwidt are terefore essetial for oparametric metods. Noparametric metods wic require a badwidt, but do ot ave a eplicit datadepedet rule for selectig te badwidt, are icomplete. Ufortuately tis is quite commo, due to te di culty i developig rigorous rules for badwidt selectio. Ofte i tese cases te badwidt is selected based o a related statistical problem. Tis is a feasible yet worrisome compromise. May oparametric problems are geeralizatios of uivariate desity estimatio. We will start wit tis simple settig, ad eplore its teory i cosiderable detail.

3 2 Kerel Desity Estimatio 2. Discrete Estimator Let X be a radom variable wit cotiuous distributio F () ad desity f() d df (): Te goal is to estimate f() from a radom sample fx ; :::; X g: Te distributio fuctio F () is aturally estimated by te EDF ^F () P i (X i ) : It migt seem atural to estimate te desity f() as te derivative of ^F (); d ^F d (); but tis estimator would be a set of mass poits, ot a desity, ad as suc is ot a useful estimate of f(). Istead, cosider a discrete derivative. For some small > 0, let ^f() ^F ( + ) ^F ( ) 2 We ca write tis as 2 ( + < X i + ) i 2 i k i j j were k(u) ( 2 ; juj 0 juj > is te uiform desity fuctio o [ ; ]: Te estimator ^f() couts te percetage of observatios wic are clsoe to te poit : If may observatios are ear ; te ^f() is large. Coversely, if oly a few X i are ear ; te ^f() is small. Te badwidt cotrols te degree of smootig. ^f() is a special case of wat is called a kerel estimator. Te geeral case is ^f() i k were k(u) is a kerel fuctio. 2.2 Kerel Fuctios A kerel fuctio k(u) : R! R is ay fuctio wic satis es R k(u)du : A o-egative kerel satis es k(u) 0 for all u: I tis case, k(u) is a probability desity fuctio. Te momets of a kerel are j (k) R uj k(u)du: A symmetric kerel fuctio satis es k(u) k( u) for all u: I tis case, all odd momets are zero. Most oparametric estimatio uses symmetric kerels, ad we focus o tis case. 2

4 Te order of a kerel, ; is de ed as te order of te rst o-zero momet. For eample, if (k) 0 ad 2 (k) > 0 te k is a secod-order kerel ad 2. If (k) 2 (k) 3 (k) 0 but 4 (k) > 0 te k is a fourt-order kerel ad 4. Te order of a symmetric kerel is always eve. Symmetric o-egative kerels are secod-order kerels. A kerel is iger-order kerel if > 2: Tese kerels will ave egative parts ad are ot probability desities. Tey are also refered to as bias-reducig kerels. Commo secod-order kerels are listed i te followig table Table : Commo Secod-Order Kerels Kerel Equatio R(k) 2 (k) eff(k) Uiform k 0 (u) 2 (juj ) 2 3 :0758 Epaecikov k (u) 3 4 u2 (juj ) 35 5 :0000 Biweigt k 2 (u) 5 6 u2 2 (juj ) 57 7 :006 Triweigt k 3 (u) u2 3 (juj ) :035 Gaussia k (u) p 2 ep u p :053 I additio to te kerel formula we ave listed its rougess R(k), secod momet 2 (k), ad its e ciecy eff(k), te last wic will be de ed later. Te rougess of a fuctio is R(g) g(u) 2 du: Te most commoly used kerels are te Epaecikov ad te Gaussia. Te kerels i te Table are special cases of te polyomial family k s (u) (2s + )!! 2 s+ s! u 2 s (juj ) were te double factorial meas (2s + )!! (2s + ) (2s ) 5 3 : Te Gaussia kerel is obtaied by takig te limit as s! after rescalig. Te kerels wit iger s are smooter, yieldig estimates ^f() wic are smooter ad possessig more derivatives. Estimates usig te Gaussia kerel ave derivatives of all orders. For te purpose of oparametric estimatio te scale of te kerel is ot uiquely de ed. Tat is, for ay kerel k(u) we could ave de ed te alterative kerel k (u) b k(ub) for some costat b > 0: Tese two kerels are equivalet i te sese of producig te same desity estimator, so log as te badwidt is rescaled. Tat is, if ^f() is calculated wit kerel k ad badwidt ; it is umerically idetically to a calculatio wit kerel k ad badwidt b: Some autors use di eret de itios for te same kerels. Tis ca cause cofusio uless you are attetive. 3

5 Higer-order kerels are obtaied by multiplyig a secod-order kerel by a (2 ) t order polyomial i u 2 : Eplicit formulae for te geeral polyomial family ca be foud i B. Hase (Ecoometric Teory, 2005), ad for te Gaussia family i Wad ad Scucay (Caadia Joural of Statistics, 990). 4t ad 6t order kerels of iterest are give i Tables 2 ad 3. Table 2: Fourt-Order Kerels Kerel Equatio R(k) 4 (k) eff(k) Epaecikov k 4; (u) u2 k (u) 54 2 :0000 Biweigt k 4;2 (u) 7 4 3u2 k 2 (u) :0056 Triweigt k 4;3 (u) u2 k 3 (u) :034 Gaussia k 4; (u) 2 3 u2 k (u) 2732 p 3 :0729 Table 3: Sit-Order Kerels Kerel Equatio R(k) 6 (k) eff(k) Epaecikov k 6; (u) u u4 k (u) :0000 Biweigt k 6;2 (u) u u4 k 2 (u) :0048 Triweigt k 6;2 (u) u2 + 3u 4 k 3 (u) :022 Gaussia k 6; (u) 8 5 0u2 + u 4 k (u) p 5 : Desity Estimator We ow discuss some of te umerical properties of te kerel estimator viewed as a fuctio of : ^f() i k First, if k(u) is o-egative te it is easy to see tat ^f() 0: However, tis is ot guareteed if k is a iger-order kerel. Tat is, i tis case it is possible tat ^f() < 0 for some values of : We tis appes it is prudet to zero-out te egative bits ad te rescale: ~f() ^f() ^f() 0 R ^f() : ^f() 0 d ~f() is o-egative yet as te same asymptotic properties as ^f(): Sice te itegral i te deomiator is ot aalytically available tis eeds to be calculated umerically. Secod, ^f() itegrates to oe. To see tis, rst ote tat by te cage-of-variables u (X i ) wic as Jacobia ; k d k (u) du : 4

6 Te cage-of variables u (X i tis trasformatio. Tus ^f()d i ) will be used frequetly, so it is useful to be familiar wit k d i k as claimed. Tus ^f() is a valid desity fuctio we k is o-egative. d Tird, we ca also calculate te umerical momets of te desity ^f(): Agai usig te cageof-variables u (X i te sample mea of te X i : ); te mea of te estimated desity is ^f()d i i k d (X i + u) k (u) du X i k (u) du + i X i i Te secod momet of te estimated desity is 2 ^f()d i i i 2 k d (X i + u) 2 k (u) du i (k): i It follows tat te variace of te desity ^f() is 2 ^f()d 2 ^f()d i X i k(u)du + i ^ (k) uk (u) du X 2 i i 2 u 2 k (u) du i! 2 X i i were ^ 2 is te sample variace. Tus te desity estimate i ates te sample variace by te factor 2 2 (k). Tese are te umerical mea ad variace of te estimated desity ^f(); ot its samplig 5

7 mea ad variace. 2.4 Estimatio Bias It is useful to observe tat epectatios of kerel trasformatios ca be writte as itegrals wic take te form of a covolutio of te kerel ad te desity fuctio: E k z k f(z)dz Usig te cage-of variables u (z ); tis equals By te liearity of te estimator we see k (u) f( + u)du: E ^f() E k i k (u) f( + u)du Te last epressio sows tat te epected value is a average of f(z) locally about : Tis itegral (typically) is ot aalytically solvable, so we approimate it usig a Taylor epasio of f( + u) i te argumet u; wic is valid as! 0: For a t-order kerel we take te epasio out to te t term f ( + u) f() + f () ()u + 2 f (2) () 2 u 2 + 3! f (3) () 3 u 3 + +! f () () u + o ( ) : Te remaider is of smaller order ta as! ; wic is writte as o( ): (Tis epasio assumes f (+) () eists.) Itegratig term by term ad usig R k (u) du ad te de itio R k (u) uj du j (k); k (u) f ( + u) du f() + f () () (k) + 2 f (2) () 2 2 (k) + 3! f (3) () 3 3 (k) + +! f () () (k) + o ( ) f() +! f () () (k) + o ( ) were te secod equality uses te assumptio tat k is a t order kerel (so j (k) 0 for j < ). 6

8 Tis meas tat E ^f() E k i f() +! f () () (k) + o ( ) : Te bias of ^f() is te Bias( ^f()) E ^f() f()! f () () (k) + o ( ) : For secod-order kerels, tis simpli es to Bias( ^f()) 2 f (2) () 2 2 (k) + O 4 : For secod-order kerels, te bias is icreasig i te square of te badwidt. Smaller badwidts imply reduced bias. Te bias is also proportioal to te secod derivative of te desity f (2) (): Ituitively, te estimator ^f() smoots data local to X i ; so is estimatig a smooted versio of f(): Te bias results from tis smootig, ad is larger te greater te curvature i f(): We iger-order kerels are used (ad te desity as eoug derivatives), te bias is proportioal to ; wic is of lower order ta 2 : Tus te bias of estimates usig iger-order kerels is of lower order ta estimates from secod-order kerels, ad tis is wy tey are called bias-reducig kerels. Tis is te advatage of iger-order kerels. 2.5 Estimatio Variace Sice te kerel estimator is a liear estimator, ad k var ^f() 2 var k 2 Ek 2 is iid, Ek 2 From our aalysis of bias we kow tat Ek f()+o() so te secod term is O : For te rst term, write te epectatio as a itegral, make a cage-of-variables ad a rst-order 7

9 Taylor epasio Ek 2 z k 2 f(z)dz k (u) 2 f ( + u) du k (u) 2 (f () + O ()) du f () R(k) + O () were R(k) R k (u)2 du is te rougess of te kerel. Togeter, we see Te remaider O 2.6 Mea-Squared Error f () R(k) var ^f() + O is of smaller order ta te O leadig term, sice! : A commo ad coveiet measure of estimatio precisio is te mea-squared error MSE( ^f()) 2 E ^f() f() Bias( ^f()) 2 + var ^f() 2 '! f () () (k) + f () R(k) 2 (k) (!) 2 f () () 2 2 f () R(k) + AMSE( ^f()) Sice tis approimatio is based o asymptotic epasios tis is called te asymptotic measquared-error (AMSE). Note tat it is a fuctio of te sample size ; te badwidt ; te kerel fuctio (troug ad R(k)), ad varies wit as f () () ad f() vary. Notice as well tat te rst term (te squared bias) is icreasig i ad te secod term (te variace) is decreasig i : For MSE( ^f()) to declie as! bot of tese terms must get small. Tus as! we must ave! 0 ad! : Tat is, te badwidt must decrease, but ot at a rate faster ta sample size. Tis is su ciet to establis te poitwise cosistecy of te estimator. Tat is, for all ; ^f()! p f() as!. We call tis poitwise covergece as it is valid for eac idividually. We discuss uiform covergece later. 8

10 A global measure of precisio is te asymptotic mea itegrated squared error (AMISE) AM ISE AMSE( ^f())d 2 (k) (!) 2 R f () 2 + R(k) : were R(f () ) R f () () 2 d is te rougess of f () : 2.7 Asymptotically Optimal Badwidt Te AMISE formula epresses te MSE as a fuctio of : Te value of wic miimizes tis epressio is called te asymptotically optimal badwidt. Te solutio is foud by takig te derivative of te AMISE wit respect to ad settig it equal to zero: wit solutio d d AMISE d 2 (k) d (!) 2 R f () 2 + R(k) (k) (!) 2 R f () R(k) C (k; f) (2+) C (k; f) R f () (2+) A (k) A (k) (!) 2 R(k) 2 2 (k)! (2+) Te optimal badwidt is propotioal to (2+) : We say tat te optimal badwidt is of order O (2+) : For secod-order kerels te optimal rate is O 5 : For iger-order kerels te rate is slower, suggestig tat badwidts are geerally larger ta for secod-order kerels. Te ituitio is tat sice iger-order kerels ave smaller bias, tey ca a ord a larger badwidt. Te costat of proportioality C (k; f) depeds o te kerel troug te fuctio A (k) (wic ca be calculated from Table ), ad te desity troug R(f () ) (wic is ukow): If te badwidt is set to 0 ; te wit some simpli catio te AMISE equals AMISE 0 (k) ( + 2) R f ()! 2 (k)r (k) 2 (2+) (!) 2 (2) 2 2(2+) : 9

11 For secod-order kerels, tis equals AMISE 0 (k) (k)R(k) 4 R f (2) 5 45 : As gets large, te covergece rate approaces te parametric rate : Tus, at least asymptotically, te slow covergece of oparametric estimatio ca be mitigated troug te use of iger-order kerels. Tis seems a bit magical. Wat s te catc? For oe, te improvemet i covergece rate requires tat te desity is su cietly smoot tat derivatives eist up to te ( + ) t order. As te desity becomes icreasigly smoot, it is easier to approimate by a low-dimesioal curve, ad gets closer to a parametric-type problem. Tis is eploitig te smootess of f; wic is ieretly ukow. Te oter catc is tat tere is a some evidece tat te bee ts of igerorder kerels oly develop we te sample size is fairly large. My sese is tat i small samples, a secod-order kerel would be te best coice, i moderate samples a 4t order kerel, ad i larger samples a 6t order kerel could be used. 2.8 Asymptotically Optimal Kerel Give tat we ave picked te kerel order, wic kerel sould we use? Eamiig te epressio AMISE 0 we ca see tat for ed te coice of kerel a ects te asymptotic precisio troug te quatity (k) R(k) : All else equal, AMISE will be miimized by selectig te kerel wic miimizes tis quatity. As we discussed earlier, oly te sape of te kerel is importat, ot its scale, so we ca set. Te te problem reduces to miimizatio of R(k) R k(u)2 du subject to te costraits R k(u)du ad R u k(u)du : Tis is a problem i te calculus of variatios. It turs out tat te solutio is a scaled of k ; ( see Muller (Aals of Statistics, 984)). As te scale is irrelevat, tis meas tat for estimatio of te desity fuctio, te igerorder Epaecikov kerel k ; wit optimal badwidt yields te lowest possible AMISE. For tis reaso, te Epaecikov kerel is ofte called te optimal kerel. To compare kerels, its relative e ciecy is de ed as eff(k) AMISE0 (k) (+2)2 AMISE 0 (k ; ) 2 (k) 2 R (k) ( 2 (k ; )) 2 R (k ; ) Te ratios of te AMISE is raised to te power ( + 2) 2 as for large ; te AMISE will be te same weter we use observatios wit kerel k ; or eff(k) observatios wit kerel k. Tus te pealty eff(k) is epressed as a percetage of observatios. Te e ciecies of te various kerels are give i Tables -3. Eamiig te secod-order kerels, we see tat relative to te Epaecikov kerel, te uiform kerel pays a pealty of about 7%, te Gaussia kerel a pealty of about 5%, te Triweigt kerel about.4%, ad te Biweigt 0

12 kerel less ta %. Eamiig te 4t ad 6t-order kerels, we see tat te relative e ciecy of te Gaussia kerel deteriorates, wile tat of te Biweigt ad Triweigt sligtly improves. Te di ereces are ot big. Still, te calculatio suggests tat te Epaecikov ad Biweigt kerel classes are good coices for desity estimatio. 2.9 Rule-of-Tumb Badwidt Te optimal badwidt depeds o te ukow quatity R f () : Silverma proposed tat we try te badwidt computed by replacig R f () i te optimal formula by R were g is a referece desity a plausible cadidate for f; ad ^ 2 is te sample stadard deviatio. Te stadard coice is to set g ^ ; te N(0; ^ 2 ) desity. Te idea is tat if te true desity is ormal, te te computed badwidt will be optimal. If te true desity is reasoably close to te ormal, te te badwidt will be close to optimal. Wile ot a perfect solutio, it is a good place to start lookig. For ay desity g; if we set g () g(); te g () () g () (): Tus R g () (2+) (2+) g () () 2 d (2+) 2 2 g () () 2 d 2 R g () (2+) : (2+) g () () 2 d g () ^ Furtermore, Tus! (2+) R () (2+) 2! 2 : (2)! R () ^! (2+) (2+) 2! 2^ : (2)! Te rule-of-tumb badwidt is te ^C (k) (2+) were C (k) R () (2+) A (k) 2 2 (!) 3 R(k) 2 (2)! 2 (k)! (2+) We collect tese costats i Table 4. Table 4: Rule of Tumb Costats

13 Kerel Epaecikov 2:34 3:03 3:53 Biweigt 2:78 3:39 3:84 Triweigt 3:5 3:72 4:3 Gaussia :06 :08 :08 Silverma Rule-of-Tumb: ^C (k) (2+) were ^ is te sample stadard deviatio, is te order of te kerel, ad C (k) is te costat from Table 4. If a Gaussia kerel is used, tis is ofte simpli ed to ^ (2+) : I particular, for te stadard secod-order ormal kerel, ^ 5 : 2.0 Desity Derivatives Cosider te problem of estimatig te r t derivative of te desity: f (r) () dr d r f(): A atural estimator is foud by takig derivatives of te kerel desity estimator. Tis takes te form were ^f (r) () dr d r ^f() +r k (r) i k (r) () dr d r k(): Tis estimator oly makes sese if k (r) () eists ad is o-zero. Sice te Gaussia kerel as derivatives of all orders tis is a commo coice for derivative estimatio. Te asymptotic aalysis of tis estimator is similar to tat of te desity, but wit a couple of etra wrikles ad oticably di eret results. First, to calculate te bias we observe tat E k(r) +r z k(r) +r f(z)dz z To simplify tis epressio we use itegratio by parts. As te itegral of k (r) is z k (r ) ; we d tat te above epressio equals ) z k(r r Repeatig tis a total of r times, we obtai z k 2 f () (z)dz: f (r) (z)dz:

14 Net, apply te cage of variables to obtai k (u) f (r) ( + u)dz: Now epad f (r) ( + u) i a t-order Taylor epasio about, ad itegrate te terms to d tat te above equals f (r) () +! f (r+) () (k) + o ( ) were is te order of te kerel. Hece te asymptotic bias is Bias( ^f (r) ()) E ^f (r) () f (r) ()! f (r+) () (k) + o ( ) : Tis of course presumes tat f is di eretiable of order at least r + +. For te variace, we d var ^f (r) () Te AMSE ad AMISE are ad var k (r) 2+2r 2 2 Ek(r) Ek(r) 2+2r +r z 2 2+2r k (r) f(z)dz f (r) () 2 + O +2r k (r) (u) 2 f ( + u) du + O f () +2r k (r) (u) 2 du + O f () R(k(r) ) +2r + O : AMSE( ^f (r) ()) f (r+) () (k) (!) 2 + f () R(k(r) ) +2r AMISE( ^f (r) ()) R f (r+) 2 2 (k) (!) 2 + R(k(r) ) +2r : Note tat te order of te bias is te same as for estimatio ofte desity. But te variace is ow of order O +2r wic is muc larger ta te O foud earlier. 3

15 Te asymptotically optimal badwidt is r C r; (k; f) (+2r+2) C r; (k; f) R f (r+) (+2r+2) Ar; (k) A r; (k) ( + 2r) (!) 2 R(k (r) ) 2 2 (k)! (+2r+2) Tus te optimal badwidt coverges at a slower rate ta for desity estimatio. Give tis badwidt, te rate of covergece for te AMISE is O 2(2r+2+) ; wic is slower ta te O 2(2+) 45 ) rate we r 0: We see tat we eed a di eret badwidt for estimatio of derivatives ta for estimatio of te desity. Tis is a commo situatio wic arises i oparametric aalysis. Te optimal amout of smootig depeds upo te object beig estimated, ad te goal of te aalysis. Te AMISE wit te optimal badwidt is AMISE( ^f (r) 2 (k) ()) ( + 2r + 2) (!) 2 ( + 2r) (2r+)(+2r+2) R k (r)! 2(+2r+2) 2(+2r+2) : 2 We ca also ask te questio of wic kerel fuctio is optimal, ad tis is addressed by Muller (984). Te problem amouts to miimizig R k (r) subject to a momet coditio, ad te solutio is to set k equal to k ;r+ ; te polyomial kerel of t order ad epoet r +: Tus to a rst derivative it is optimal to use a member of te Biweigt class ad for a secod derivative a member of te Triweigt class. Te relative e ciecy of a kerel k is te eff(k) AMISE0 (k) AMISE 0 (k ;r+ ) 2 (k) 2 (k ;r+ ) (+2r)2 (+2+2r)2 R k (r) : R k (r) ;r+ Te relative e ciecies of te various kerels are preseted i Table 5. (Te Epaecikov kerel is ot cosidered as it is iappropriate for derivative estimatio, ad similarly te Biweigt kerel for r 2): I cotrast to te case r 0; we see tat te Gaussia kerel is igly ie ciet, wit te e ciecy loss icreasig wit r ad : Tese calculatios suggest tat we estimatig desity derivatives it is importat to use te appropriate kerel. Table 5: Relative E ciecy eff(k) 4

16 Biweigt Triweigt Gaussia r 2 :0000 :085 :29 4 :0000 :059 : :0000 :036 :356 r 2 2 :0000 : :0000 : :0000 :6275 Te Silverma Rule-of-Tumb may also be applied to desity derivative estimatio. Agai usig te referece desity g ; we d te rule-of-tumb badwidt is C r; (k) ^ (2r+2+) were 2 ( + 2r) (!) 2 (r + )!R k (r)! (2r+2+) C r; (k) : (k) (2r + 2)! Te costats C r;v are collected i Table 6. sligtly decreasig as r icreases. For all kerels, te costats C r; are similar but Table 6: Rule of Tumb Costats Biweigt Triweigt Gaussia r 2 2:49 2:83 0:97 4 3:8 3:49 :03 6 3:44 3:96 :04 r 2 2 2:70 0:94 2. Multivariate Desity Estimatio 4 3:35 :00 6 3:84 :02 Now suppose tat X i is a q-vector ad we wat to estimate its desity f() f( ; :::; q ): A multivariate kerel estimator takes te form ^f() jhj K H (X i i ) were K(u) is a multivariate kerel fuctio depedig o a badwidt vector H ( ; :::; q ) 0 ad jhj 2 q : A multivariate kerel satis es Tat is, K(u) (du) K(u)du du q Typically, K(u) takes te product form: K(u) k (u ) k (u 2 ) k (u q ) : 5

17 As i te uivariate case, ^f() as te property tat it itegrates to oe, ad is o-egative if K(u) 0: We K(u) is a product kerel te te margial desities of ^f() equal uivariate kerel desity estimators wit kerel fuctios k ad badwidts j : is Wit some work, you ca sow tat we K(u) takes te product form, te bias of te estimator ad te variace is Hece te AMISE is Bias( ^f()) (k)! var j j 0 AMISE ^f() 2 (!) 2 f() j + o + + q f () R(K) jhj + O f () R(k)q 2 q + j j f() A j 2 : R(k) q (d) + 2 q Tere is o closed-form solutio for te badwidt vector wic miimizes tis epressio. However, eve witout doig do, we ca make a couple of observatios. First, te AMISE depeds o te kerel fuctio oly troug R(k) ad 2 (k); so it is clear tat for ay give ; te optimal kerel miimizes R(k); wic is te same as i te uivariate case. Secod, te optimal badwidts will all be of order (2+q) ad te optimal AMISE of order 2(2+q) : Tis rates are slower ta te uivariate (q ) case. Te fact tat dimesio as a adverse e ect o covergece rates is called te curse of dimesioality. May teoretical papers circumvet tis problem troug te followig trick. Suppose you eed te AMISE of te estimator to coverge at a rate O 2 or faster. Tis requires 2 (2 + q) > 2; or q < 2: For secod-order kerels ( 2) tis restricts te dimesio to be 3 or less. Wat some autors will do is slip i a assumptio of te form: Assume f() is di eretiable of order + were > q2; ad te claim tat teir results old for all q: Te trouble is tat wat te autor is doig is imposig greater smootess as te dimesio icreases. Tis does t really avoid te curse of dimesioality, rater it ides it beid wat appears to be a tecical assumptio. Te bottom lie is tat oparametric objects are muc arder to estimate i iger dimesios, ad tat is wy it is called a curse. To derive a rule-of-tumb, suppose tat 2 q : Te AMISE ^f() 2 (k)r (r f) (!) R(k)q q 6

18 were qx r f() We d tat te optimal j j f(): 0! (!) 2 (2+q) qr(k) q 2 2 (k)r (r (2+q) f) For a rule-of-tumb badwidt, we replace f by te multivariate ormal desity : We ca calculate tat R (r ) q q2 2 q+ (2 )!! + (q ) (( )!!) 2 : Makig tis substitutio, we obtai 0 C (k; q) (2+q) were 0 C (k; q2 2 q+ (!) 2 R(k) q 2 (k) (2 )!! + (q ) (( )!!) 2 A (2+q) Now tis assumed tat all variables ad uit variace. Rescalig te badwidts by te stadard deviatio of eac variable, we obtai te rule-of-tumb badwidt for te j t variable: j ^ j C (k; q) (2+q) : Numerical values for te costats C (k; q) are give i Table 7 for q 2; 3; 4. : Table 7: Rule of Tumb Costats 2 q 2 q 3 q 4 Epaecikov 2:20 2:2 2:07 Biweigt 2:6 2:52 2:46 Triweigt 2:96 2:86 2:80 Gaussia :00 0:97 0:95 4 Epaecikov 3:2 3:20 3:27 Biweigt 3:50 3:59 3:67 Triweigt 3:84 3:94 4:03 Gaussia :2 :6 :9 6 Epaecikov 3:69 3:83 3:96 Biweigt 4:02 4:8 4:32 Triweigt 4:33 4:50 4:66 Gaussia :3 :8 :23 7

19 2.2 Least-Squares Cross-Validatio Rule-of-tumb badwidts are a useful startig poit, but tey are i eible ad ca be far from optimal. Plug-i metods take te formula for te optimal badwidt, ad replace te ukows by estimates, e.g. R ^f () : But tese iitial estimates temselves deped o badwidts. Ad eac situatio eeds to be idividually studied. Plug-i metods ave bee torougly studied for uivariate desity estimatio, but are less well developed for multivariate desity estimatio ad oter cotets. A eible ad geerally applicable data-depedet metod is cross-validatio. Tis metod attempts to make a direct estimate of te squared error, ad pick te badwidt wic miimizes tis estimate. I may seses te idea is quite close to model selectio based o a iformatio criteria, suc as Mallows or AIC. Give a badwidt ad desity estimate ^f() of f(); de e te mea itegrated squared error (MISE) MISE () ^f() f() 2 (d) ^f() 2 (d) 2 ^f()f() (d) + f() 2 (d) Optimally, we wat ^f() to be as close to f() as possible, ad tus for MISE () to be as small as possible. As M ISE () is ukow, cross-validatio replaces it wit a estimate. Te goal is to d a estimate of MISE (), ad d te wic miimizes tis estimate. As te tird term i te above epressio does ot deped o te badwidt ; it ca be igored. Te rst term ca be directly calculated. For te uivariate case ^f() 2 d i 2 2 i j k k! 2 d Xj k d Te covolutio of k wit itself is k() R k (u) k ( u) du R k (u) k (u ) du (by symmetry of k). Te makig te cage of variables u X i ; X i X j k k d k (u) k u du k X j : 8

20 Hece ^f() 2 d 2 i j X j k : Discussio of k () ca be foud i te followig sectio. I te multivariate case, ^f() 2 d 2 jhj i j K H (X i X j ) were K (u) k (u ) k (u q ) Te secod term i te epressio for MISE () depeds o f() so is ukow ad must be estimated. A itegral wit respect to f() is a epectatio wit respect to te radom variable X i : Wile we do t kow te true epectatio, we ave te sample, so ca estimate tis epectatio by takig te sample average. I geeral, a reasoable estimate of te itegral R g()f()d is P i g (X i) ; suggestig te estimate P i ^f (X i ) : I tis case, owever, te fuctio ^f () is itself a fuctio of te data. I particular, it is a fuctio of te observatio X i : A way to clea tis up is to replace ^f (X i ) wit te leave-oe-out estimate ^f i (X i ) ; were ^f i () ( ) jhj X K H (X j j6i ) is te desity estimate computed witout observatio X i ; ad tus ^f i (X i ) ( ) jhj X K H (X j X i ) : j6i Tat is, ^f i (X i ) is te desity estimate at X i ; computed wit te observatios ecept X i : We ed up suggestig to estimate R ^f()f()d wit i ^f i (X i ) ( ) jhj X K H (X j i j6i X i ) : It turs out tat tis is a ubiased estimate, i te sese tat E! ^f i (X i ) i E ^f()f()d 9

21 To see tis, te LHS is E ^f (X ) E E ^f (X ) j X ; :::; X E ^f ()f() (d) E ^f() f() (d) E ^f()f() (d) te secod-to-last equality ecagig itegratio, ad sice E ^f() depeds oly i te badwidt, ot te sample size. Togeter, te least-squares cross-validatio criterio is CV ( ; :::; q ) 2 jhj i j K H (X i X j ) 2 ( ) jhj X K H (X j X i ) : i j6i Aoter way to write tis is CV ( ; :::; q ) K (0) jhj + 2 jhj ' R(k)q jhj + 2 jhj X i j6i X i j6i K H (X i X j ) 2 ( ) jhj X K H (X j i j6i K H (X i X j ) 2K H (X j X i ) X i ) usig K (0) k(0) q ad k(0) R k (u) 2 ; ad te approimatio is by replacig by : Te cross-validatio badwidt vector are te value ^ ; :::; ^ q wic miimizes CV ( ; :::; q ) : Te cross-validatio fuctio is a complicated fuctio of te badwidts; so tis eeds to be doe umerically. I te uivariate case, is oe-dimesioal tis is typically doe by plottig (a grid searc). Pick a lower ad upper value [ ; 2 ]; de e a grid o tis set, ad compute CV () for eac i te grid. A plot of CV () agaist is a useful diagostic tool. Te CV () fuctio ca be misleadig for small values of : Tis arises we tere is data roudig. Some autors de e te cross-validatio badwidt as te largest local miimer of CV () (rater ta te global miimizer). Tis ca also be avoided by pickig a sesible iitial rage [ ; 2 ]: Te rule-of-tumb badwidt ca be useful ere. If 0 is te rule-of-tumb badwidt, te use 0 3 ad or similar. We we discussed above, CV ( ; :::; q ) + R f() 2 (d) is a ubiased estimate of MISE () : Tis by itself does ot mea tat ^ is a good estimate of 0 ; te miimizer of MISE () ; but it 20

22 turs out tat tis is ideed te case. Tat is, ^ 0 0! p 0 Tus, ^ is asymptotically close to 0 ; but te rate of covergece is very slow. Te CV metod is quite eible, as it ca be applied for ay kerel fuctio. If te goal, owever, is estimatio of desity derivatives, te te CV badwidt ^ is ot appropriate. A practical solutio is te followig. Recall tat te asymptotically optimal badwidt for estimatio of te desity takes te form 0 C (k; f) (2+) ad tat for te r t derivative is r C r; (k; f) (+2r+2) : Tus if te CV badwidt ^ is a estimate of 0 ; we ca estimate C (k; f) by ^C ^ (2+) : We also saw (at least for te ormal referece family) tat C r; (k; f) was relatively costat across r: Tus we ca replace C r; (k; f) wit ^C to d ^ r ^C (+2r+2) ^ (2+) (+2r+2) ^ (+2r+2)(2+)(+2r+2) (2+)(+2r+2)(2+) ^ 2r((2+)(+2r+2)) Alteratively, some autors use te rescalig ^ r ^ (+2)(+2r+2) 2.3 Covolutio Kerels If k() () te k() ep( 2 4) p 4: We k() is a iger-order Gaussia kerel, Wad ad Scucay (Caadia Joural of Statistics, 990, p. 20) give a epressio for k(). For te polyomial class, because te kerel k(u) as support o [ ; ]; it follows tat k() as support o [ 2; 2] ad for 0 equals k() R k(u)k( u)du: Tis itegral ca be easily solved usig algebraic software (Maple, Matematica), but te epressio ca be rater cumbersome. For te 2d order Epaecikov, Biweigt ad Triweigt kerels, for 0 2; k () 3 60 (2 ) k 2 () (2 ) k 3 () (2 ) Tese fuctios are symmetric, so te values for < 0 are foud by k() k( ): 2

23 For te 4t, ad 6t order Epaecikov kerels, for 0 2; k 4; () (2 ) k 6; () (2 ) Asymptotic Normality Te kerel estimator is te sample average ^f() i jhj K H (X i ) : We ca terefore apply te cetral limit teorem. But te covergece rate is ot p : We kow tat var ^f() f () R(k)q 2 q + O : so te covergece rate is p 2 q : We we apply te CLT we scale by tis, rater ta te covetioal p : As te estimator is biased, we also ceter at its epectatio, rater ta te true value Tus p 2 q ^f() E ^f() p p 2 q p 2 q p i i jhj K H (X i ) E jhj K H (X i i jhj K H (X i ) E jhj K H (X i i ) ) were We see tat i p 2 q jhj K H (X i ) E jhj K H (X i var ( i ) ' f () R(k) q ) Hece by te CLT, p 2 q ^f() E ^f()! d N (0; f () R(k) q ) : 22

24 We also kow tat E( ^f()) f() + (k)! So aoter way of writig tis is p 2 q 0 ^f() j j j j f() j + o + + f() A j! d N (0; f () R(k) q ) : I te uivariate case tis is p ^f() f() (k) f (2) ()! d N (0; f () R(k))! Tis epressio is most useful we te badwidt is selected to be of optimal order, tat is C (2+) ; for te p C +2 ad we ave te equivalet statemet p ^f() f()! d N C +2 (k) f (2) (); f () R(k)! Tis says tat te desity estimator is asymptotically ormal, wit a o-zero asymptotic bias ad variace. Some autors play a dirty trick, by usig te assumptio tat is of smaller order ta te optimal rate, e.g. o (2+) : For te te obtai te result p ^f() f()! d N (0; f () R(k)) Tis appears muc icer. Te estimator is asymptotically ormal, wit mea zero! Tere are several costs. Oe, if te badwidt is really seleted to be sub-optimal, te estimator is simply less precise. A sub-optimal badwidt results i a slower covergece rate. Tis is ot a good tig. Te reductio i bias is obtaied at i icrease i variace. Aoter cost is tat te asymptotic distributio is misleadig. It suggests tat te estimator is ubiased, wic is ot oest. Fially, it is uclear ow to pick tis sub-optimal badwidt. I call tis assumptio a dirty trick, because it is slipped i by autors to make teir results cleaer ad derivatios easier. Tis type of assumptio sould be avoided. 2.5 Poitwise Co dece Itervals Te asymptotic distributio may be used to costruct poitwise co dece itervals for f(): I te uivariate case covetioal co dece itervals take te form ^f() 2 ^f () R(k) () 2 : 23

25 Tese are ot ecessarily te best coice, sice te variace equals te mea: Tis set as te ufortuate property tat it ca cotai egative values, for eample. Istead, cosider costructig te co dece iterval by ivertig a test statistic. H 0 : f() f 0 ; a t-ratio is t (f 0 ) ^f() f 0 p f0 R(k) : To test We reject H 0 if jt (f 0 )j > 2: By te o-rejectio rule, a asymptotic 95% co dece iterval for f is te set of f 0 wic do reject, i.e. te set of f suc tat jt (f)j 2: Tis is Tis set must be foud umerically. ( ) ^f() f C() f : p 2 fr(k) 24