Nonparametric Estimation: Smoothing and Data Visualization

Transcription

1 Nnparametric Estimatin: Smthing and Data Visualizatin Rnald Dias Universidade Estadual de Campinas 1 Clóqui da Regiã Sudeste Abril de 2011

2

3 Preface In recent years mre and mre data have been cllected in rder t etract infrmatin r t learn valuable characteristics abut eperiments, phenmena, bservatinal facts, etc.. This is what it s been called learning frm data. Due t their cmpleity, several datasets have been analyzed by nnparametric appraches. This field f Statistics impse minimum assumptins t get useful infrmatin frm data. In fact, nnparametric prcedures, usually, let the data speak fr themselves. This wrk is a brief intrductin t a few f the mst useful prcedures in the nnparametric estimatin tward smthing and data visualizatin. In particular, it describes the thery and the applicatins f nnparametric curve estimatin (density and regressin) prblems with emphasis in kernel, nearest neighbr, rthgnal series, smthing splines methds. The tet is designed fr undergraduate students in mathematical sciences, engineering and ecnmics. It requires at least ne semester in calculus, prbability and mathematical statistics. iii

4

5 Cntents Preface List f Figures 1 Intrductin 1 2 Kernel estimatin The Histgram Kernel Density Estimatin The Nearest Neighbr Methd Sme Statistical Results fr Kernel Density Estimatin Bandwidth Selectin Reference t a Standard Distributin Maimum likelihd Crss-Validatin Least-Squares Crss-Validatin Orthgnal series estimatrs Kernel nnparametric Regressin Methd k-nearest Neighbr (k-nn) Lcal Plynmial Regressin: LOWESS Penalized Maimum Likelihd Estimatin Cmputing Penalized Lg-Likelihd Density Estimates Spline Functins Acquiring the Taste Lgspline Density Estimatin Splines Density Estimatin: A Dimensinless Apprach The thin-plate spline n R d Additive Mdels Generalized Crss-Validatin Methd fr Splines nnparametric Regressin Regressin splines, P-splines and H-splines Sequentially Adaptive H-splines P-splines Adaptive Regressin via H-Splines Methd A Bayesian Apprach t H-splines Final Cmments 55 Bibligraphy 57 iii vii v

6

7 List f Figures 2.1 Naive estimate cnstructed frm Old faithful geyser data with h = Kernel density estimate cnstructed frm Old faithful geyser data with Gaussian kernel and h = Bandwidth effect n kernel density estimates. The data set incme was rescaled t have mean Effect f the smthing parameter k n the estimates Cmparisn f tw bandwidths, ˆσ (the sample standard deviatin) and ˆR (the sample interquartile) fr the miture 0.7 N( 2, 1) N(1, 1) Effect f the smthing parameter K n the rthgnal series methd fr density estimatin Effect f bandwiths n Nadaraya-Watsn kernel Effect f the smthing parameter k n the k-nn regressin estimates Effect f the smthing parameter using LOWESS methd Basis Functins with 6 knts placed at Basis Functins with 6 knts placed at lgspline density estimatin fr.5n(0,1)+.5n(5,1) Histgram, SSDE, Kernel and Lgspline density estimates True, tensr prduct, gam nn-adaptive and gam adaptive surfaces Smthing spline fitting with smthing parameter btained by GCV methd Spline least square fittings fr different values f K Five thusand replicates f y() = ep( ) sin(π/2) cs(π) + ɛ Five thusand replicates f the affinity and the partial affinity fr adaptive nnparametric regressin using H-splines with the true curve Density estimates f the affinity based n five thusand replicates f the curve y i = i 3 + ɛ i with ɛ i N(0,.5). Slid line is a density estimate using beta mdel and dtted line is a nnparametric density estimate A cmparisn between smthing splines (S-splines) and hybrid splines (H-splines) methds smth spline and P-spline H-spline fitting fr airmiles data vii

8 6.8 Estimatin results: a) Bayesian estimate with a = 17 and ψ(k) = K 3 (dtted line); b) (SS) smthing splines estimate (dashed line). The true regressin functin is als pltted (slid line). The SS estimate was cmputed using the R functin smth.spline frm which 4 degrees f freedm were btained and λ was cmputed by GCV One hundred estimates f the curve 6.8 and a Bayesian cnfidence interval fr the regressin curve g(t) = ep( t 2 /2) cs(4πt) with t [0, π].. 53

9 Chapter 1 Intrductin Prbably, the mst used prcedure t describe a pssible relatinship amng variables is the statistical technique knwn as regressin analysis. It is always useful t begin the study f regressin analysis by making use f simple mdels. Fr this, assume that we have cllected bservatins frm a cntinuus variable Y at n values f a predict variable T. Let (t j, y j ) be such that: y j = g(t j ) + ε j, j = 1,..., n, (1.1) where the randm variables ε j are uncrrelated with mean zer and variance σ 2. Mrever, g(t j ) are the values btained frm sme unknwn functin g cmputed at the pints t 1,..., t n. In general, the functin g is called regressin functin r regressin curve. A parametric regressin mdel assumes that the frm f g is knwn up t a finite number f parameters. That is, we can write a parametric regressin mdel by, y j = g(t j, β) + ε j, j = 1,..., n, (1.2) where β = (β 1,..., β p ) T R p. Thus, t determine frm the data a curve g is equivalent t determine the vectr f parameters β. One may ntice that, if g has a linear frm, i.e., g(t, β) = p j=1 β j j (t), where { j (t)} p j=1 are the eplanatry variables, e.g., as in plynmial regressin j (t) = t j 1, then we are dealing with a linear parametric regressin mdel. Certainly, there are ther methds f fitting curves t data. A cllectin f techniques knwn as nnparametric regressin, fr eample, allws great fleibility in the pssible frm f the regressin curve. In particular, assume n parametric frm fr g. In fact, a nnparametric regressin mdel makes the assumptin that the regressin curve belngs t sme infinite cllectin f curves. Fr eample, g can be in the class f functins that are differentiable with square integrable secnd derivatives, etc. Cnsequently, in rder t prpse a nnparametric mdel ne may just need t chse an apprpriate space f functins where he/she believes that the regressin curve lies. This chice, usually, is mtivated by the degree f the smthness f g. Then, ne uses the data t determine an element f this functin space that can represent the unknwn regressin curve. Cnsequently, nnparametric techniques rely mre heavily n the data fr infrmatin abut g than their parametric cunterparts. Unfrtunately, nnparametric estimatrs have sme disadvantages. In general, they are less efficient than the parametric estimatrs when the parametric mdel is crrectly specified. Fr 1

10 2 Chapter 1: Intrductin mst parametric estimatrs the risk will decay t zer at a rate f n 1 while nnparametric estimatrs decay at a rate f n α, where the parameter α (0, 1) depends n the smthness f g. Fr eample, when g is twice differentiable the rate is usually, n 4/5. Hwever, in the case where the parametric mdel is incrrectly specified, ad hc, the rate n 1 cannt be achieved. In fact, the parametric estimatr des nt even cnverge t the true regressin curve.

11 Chapter 2 Kernel estimatin Suppse we have n independent measurements {(t i, y i )} i=1 n, the regressin equatin is, in general, described as in (1.1). Nte that the regressin curve g is the cnditinal epectatin f the independent variable Y given the predict variable T, that is, g(t) = E[Y T = t]. When we try t apprimate the mean respnse functin g, we cncentrate n the average dependence f Y n T = t. This means that we try t estimate the cnditinal mean curve g(t) = E[Y T = t] = y f TY(t, y) dy, (2.1) f T (t) where f TY (t, y) dentes the jint density f (T, Y) and f T (t) the marginal density f T. In rder t prvide an estimate ĝ(t) f g we need t btain estimates f f TY (t, y) and f T (t). Cnsequently, density estimatin methdlgies will be described. 2.1 The Histgram The histgram is ne f the first, and ne f the mst cmmn, methds f density estimatin. It is imprtant t bear in mind that the histgram is a smthing technique used t estimate the unknwn density and hence it deserves sme cnsideratin. Let us try t cmbine the data by cunting hw many data pints fall int a small interval f length h. This kind f interval is called a bin. Observe that the well knwn dt plt f B, Hunter and Hunter (1978) is a particular type f histgram where h = 0. Withut lss f generality, we cnsider a bin centered at 0, namely the interval [ h/2, h/2) and let F X be the distributin functin f X such that F X is abslutely cntinuus with respect t a Lesbegue measure n R. Cnsequently the prbability that an bservatin f X will fall int the interval [ h/2, h/2) is given by: P(X [ h/2, h/2)) = h/2 h/2 f X ()d, where f X is the density f X. A natural estimate f this prbability is the relative frequency f the bservatins in this interval, that is, we cunt the number f bservatins falling int the interval and 3

12 4 Chapter 2: Kernel estimatin divide it by the ttal number f bservatins. In ther wrds, given the data X 1,..., X n, we have: P(X [ h/2, h/2)) 1 n #{X i [ h/2, h/2)}. Nw applying the mean value therem fr cntinuus bunded functin we btain, P(X [ h/2, h/2)) = h/2 h/2 f ()d = f (ξ)h, with ξ [ h/2, h/2). Thus, we arrive at the fllwing density estimate: ˆf h () = 1 nh #{X i [ h/2, h/2)}, fr all [ h/2, h/2). Frmally, suppse we bserve randm variables X 1,..., X n whse unknwn cmmn density is f. Let k be the number f bins, and define C j = [ 0 + (j 1)h, 0 + jh), j = 1,..., k. Nw, take n j = i=1 n I(X i C j ), where the functin I( A) is defined t be : { 1 if A I( A) = 0 therwise, and, k j=1 n j = n. Then, ˆf h () = 1 k nh n j I( C j ), j=1 fr all. Here, nte that the density estimate ˆf h depends upn the histgram bandwidth h. By varying h we can have different shapes f ˆf h. Fr eample, if ne increases h, ne is averaging ver mre data and the histgram appears t be smther. When h 0, the histgram becmes a very nisy representatin f the data (needle-plt, Härdle (1990)). The ppsite, situatin when h, the histgram, nw, becmes verly smth (b-shaped). Thus, h is the smthing parameter f this type f density estimate, and the questin f hw t chse the histgram bandwidth h turns ut t be an imprtant questin in representing the data via the histgram. Fr details n hw t estimate h see Härdle (1990). 2.2 Kernel Density Estimatin The mtivatin behind the histgram can be epanded quite naturally. Fr this cnsider a weight functin, and define the estimatr, K() = ˆf () = 1 nh { 12, if < 1 0, therwise n i=1 K( X i ). (2.2) h

13 2.2: Kernel Density Estimatin 5 We can see that ˆf etends the idea f the histgram. Ntice that this estimate just places a b f side (width) 2h and height (2nh) 1 n each bservatin and then sums t btain ˆf. See Silverman (1986) fr a discussin f this kind f estimatr. It is nt difficult t verify that ˆf is nt a cntinuus functin and has zer derivatives everywhere ecept n the jump pints X i ± h. Besides having the undesirable character f nnsmthness (Silverman (1986)), it culd give a misleading impressin t a untrained bserver since its smewhat ragged character might suggest several different bumps. Figure 2.1 shws the nnsmth character f the naive estimate. The data seem t have tw majr mdes. Hwever, the naive estimatr suggests several different small bumps Eruptins length density estimate Figure 2.1: Naive estimate cnstructed frm Old faithful geyser data with h = 0.1 T vercme sme f these difficulties, assumptins have been intrduced n the functin K. That is, K must be a nnnegative kernel functin that satisfies the fllwing prperty: K()d = 1. In ther wrds K() is a prbability density functin, as fr instance, the Gaussian density, it will fllw frm definitin that ˆf will itself be a prbability density. In additin, ˆf will inherit all the cntinuity and differentiability prperties f the kernel K.

14 6 Chapter 2: Kernel estimatin Fr eample, if K is a Gaussian density then ˆf will be a smth curve with derivatives f all rders. Figure 2.2 ehibits the smth prperties f ˆf when a Gaussian kernel is used Eruptins length density estimate Figure 2.2: Kernel density estimate cnstructed frm Old faithful geyser data with Gaussian kernel and h = 0.25 Nte that an estimate based n the kernel functin places bumps n the bservatins and the shape f thse bumps is determined by the kernel functin K. The bandwidth h sets the width arund each bservatin and this bandwidth cntrls the degree f smthness f a density estimate. It is pssible t verify that as h 0, the estimate becmes a sum f Dirac delta functins at the bservatins while as h, it eliminates all the lcal rughness and pssibly imprtant details are missed. The data fr the Figure 2.3 which is labelled incme were prvided by Charles Kperberg. This data set cnsists f 7125 randm samples f yearly net incme in the United Kingdm (Family Ependiture Survey, ). The incme data is cnsiderably large and s it is mre f a challenge t cmputing resurces and there are severe utliers. The peak at 0.24 is due t the UK ld age pensin, which caused many peple t have nearly identical incmes. The width f the peak is abut 0.02, cmpared t the range 11.5 f the data. The rise f the density t the left f the peak is very steep. There is a vast (Silverman (1986)) literature n kernel density estimatin studying its mathematical prperties and prpsing several algrithms t btain estimates based n it. This methd f density estimatin became, apart frm the histgram, the mst cmmnly used estimatr. Hwever it has drawbacks when the underlying

15 2.2: Kernel Density Estimatin 7 Histgram f incme data Relative Frequency h=r default h=.12 h=.25 h= transfrmed data Figure 2.3: Bandwidth effect n kernel density estimates. The data set incme was rescaled t have mean 1. density has lng tails Silverman (1986). What causes this prblem is the fact that the bandwidth is fied fr all bservatins, nt cnsidering any lcal characteristic f the data. In rder t slve this prblem several ther Kernel Density Estimatin Methds were prpsed such as the nearest neighbr and the variable kernel. A detailed discussin and illustratin f these methds can be fund in Silverman (1986) The Nearest Neighbr Methd The idea behind the nearest neighbr methd is t adapt the amunt f smthing t lcal characteristics f the data. The degree f smthing is then cntrlled by an integer k. Essentially, the nearest neighbr density estimatr uses distances frm in f () t the data pint. Fr eample, let d( 1, ) be the distance f data pint 1 frm the pint, and fr each dente d k () as the distance frm its kth nearest neighbr amng the data pints 1,..., n. The kth nearest neighbr density estimate is defined as, ˆf () = k 2nd k (), where n is the sample size and, typically, k is chsen t be prprtinal t n 1/2. In rder t understand this definitin, suppse that the density at is f (). Then, ne wuld epect abut 2rn f () bservatins t fall in the interval [ r, + r] fr

16 8 Chapter 2: Kernel estimatin each r > 0. Since, by definitin, eactly k bservatins fall in the interval [ d k (), + d k ()], an estimate f the density at may be btained by putting k = 2d k ()n ˆf (). Nte that while estimatrs like histgram are based n the number f bservatins falling in a b f fied width centered at the pint f interest, the nearest neighbr estimate is inversely prprtinal t the size f the b needed t cntain a given number f bservatins. In the tail f the distributin, the distance d k () will be larger than in the main part f the distributin, and s the prblem f under-smthing in the tails shuld be reduced. Like the histgram the nearest neighbr estimate is nt a smth curve. Mrever, the nearest neighbr estimate des nt integrate t ne and the tails f ˆf () die away at rate 1, in ther wrds etremely slwly. Hence, this estimate is nt apprpriate if ne is required t estimate the entire density. Hwever, it is pssible t generalize the nearest neighbr estimatr in a manner related t the kernel estimate. The generalized kth nearest neighbr estimate is defined by, ˆf () = 1 n nd k () K( X i d i=1 k () ). Observe that the verall amunt f smthing is gverned by the chice f k, but the bandwidth used at any particular pint depends n the density f bservatins near that pint. Again, we face the prblems f discntinuity f at all the pints where the functin d k () has discntinuus derivative. The precise integrability and tail prperties will depend n the eact frm f the kernel. Figure 2.4 shws the effect f the smthing parameter k n the density estimate. Observe that as k increases rugher the density estimate becmes. This effect is equivalent when h is appraching t zer in the kernel density estimatr. 2.3 Sme Statistical Results fr Kernel Density Estimatin As starting pint ne might want t cmpute the epected value f ˆf. Fr this, suppse we have X i,..., X n i.i.d. randm variables with cmmn density f and let K( ) be a prbability density functin defined n the real line. Then we have, fr a nnstchastic h E[ ˆf ()] = 1 nh n i=1 E[K( X i )] h = 1 h E[K( X i )] h = 1 K( u ) f (u)du h h = K(y) f ( + yh)dy. (2.3) Nw, let h 0. We see that E[ ˆf ()] f () K(y)dy = f (). Thus, ˆf is an asympttic unbiased estimatr f f.

17 2.3: Sme Statistical Results fr Kernel Density Estimatin bs. frm N(0.5,0.1) density True k=40 k=30 k= Figure 2.4: Effect f the smthing parameter k n the estimates data T cmpute the bias f this estimatr we have t make the assumptin that the underlying density is twice differentiable and satisfies the fllwing cnditins Prakasa- Ra (1983): Cnditin 1. sup K() M < ; K() 0 as. Cnditin 2. K() = K( ), (, ) with 2 K()d <. Then by using a Taylr epansin f f ( + yh), the bias f ˆf in estimating f is b f [ ˆf ()] = h2 2 f () y 2 K(y)dy + (h 2 ). We bserve that since we have assumed the kernel K is symmetric arund zer, we have that yk(y)h f ()dy = 0, and the bias is quadratic in h. Parzen (1962) Using a similar apprach we btain : Var f [ ˆf ()] = nh 1 K f () + ( nh ), where K 2 2 = K() 2 d MSE f [ ˆf ()] = 1 nh f () K h4 4 ( f () y 2 K(y)dy) 2 + ( 1 nh ) + (h4 ), where MSE f [ ˆf ] stands fr mean squared errr f the estimatr ˆf f f. Hence, when the cnditins h 0 and nh are assumed, the MSE f [ ˆf ] 0, which means that the kernel density estimate is a cnsistent estimatr f the underlying density f. Mrever, MSE balances variance and squared bias f the estimatr in such way that the variance term cntrls the under-smthing and the bias term cntrls ver-smthing. In ther wrds, an attempt t reduce the bias increases the variance, making the estimate t nisy (under-smth). On the cntrary, minimizing the variance leads t a very smth estimate (ver-smth) with high bias.

18 10 Chapter 2: Kernel estimatin 2.4 Bandwidth Selectin It is natural t think f finding an ptimal bandwidth by minimizing MSE f [ ˆf ] in h > 0. Härdle(1990) shws that the asympttic apprimatin, say, h fr the ptimal bandwidth is ( f () K 2 2 h = ( f ()) 2 ( ) 1/5 n 1/5 y 2 K(y)dy) 2. (2.4) n The prblem with this apprach is that h depends n tw unknwn functins f ( ) and f ( ). An apprach t vercme this prblem uses a glbal measure that can be defined as: IMSE[ ˆf ] = MSE f [ ˆf ()]d = 1 nh K h4 4 ( y 2 K(y)dy) 2 f ( 1 nh ) + (h4 ). (2.5) IMSE is the well knwn integrated mean squared errr f a density estimate. The ptimal value f h cnsidering the IMSE is define as it can be shwn that, ( h pt = c 2/5 2 h pt = arg min h>0 IMSE[ ˆf ]. ) 1/5 ( 1/5n K 2 ()d f 2) 2 1/5, (2.6) where c 2 = y 2 K(y)dy. Unfrtunately, (2.6) still depends n the secnd derivative f f, which measures the speed f fluctuatins in the density f f Reference t a Standard Distributin A very natural way t get arund the prblem f nt knwing f is t use a standard family f distributins t assign a value f the term f 2 2 in epressin (2.6). Fr eample, assume that a density f belngs t the Gaussian family with mean µ and variance σ 2, then ( f ()) 2 d = σ 5 (ϕ ()) 2 d = 3 8 π 1 2 σ σ 5, (2.7) where ϕ() is the standard nrmal density. If ne uses a Gaussian kernel, then h pt = (4π) 1/10 ( 3 8 π 1/2 ) 1/5 σ n 1/5 = ( 4 3 ) 1/5 σ n 1/5 = 1.06 σ n 1/5 (2.8) Hence, in practice a pssible chice fr h pt is 1.06 ˆσ n 1/5, where ˆσ is the sample standard deviatin.

19 2.4: Bandwidth Selectin 11 If we want t make this estimate mre insensitive t utliers, we have t use a mre rbust estimate fr the scale parameter f the distributin. Let ˆR be the sample interquartile, then ne pssible chice fr h is ˆR ĥ pt = 1.06 min(ˆσ, (Φ(3/4) Φ(1/4)) ) n 1/5 = 1.06 min( ˆσ, ˆR ) n 1/5, (2.9) where Φ is the standard nrmal distributin functin. Figure 2.5 ehibits hw a rbust estimate f the scale can help in chsing the bandwidth. Nte that by using ˆR we have strng evidence that the underlying density has tw mdes. Histgram f a miture f tw nrmal densities Relative Frequency True sigmahat interquartile Figure 2.5: Cmparisn f tw bandwidths, ˆσ (the sample standard deviatin) and ˆR (the sample interquartile) fr the miture 0.7 N( 2, 1) N(1, 1). data Maimum likelihd Crss-Validatin Cnsider kernel density estimates ˆf and suppse we want t test fr a specific h the hypthesis ˆf () = f () vs. ˆf () = f (), fr a fied The likelihd rati test wuld be based n the test statistic f ()/ ˆf (). Fr a gd bandwidth this statistic shuld thus be clse t 1. Alternatively, we wuld

20 12 Chapter 2: Kernel estimatin epect E[lg( f (X) )] t be clse t 0. Thus, a gd bandwidth, which is minimizing this ˆf (X) measure f accuracy, is in effect ptimizing the Kullback-Leibler distance: d KL ( f, ˆf ) = ( f () ) lg f ()d. (2.10) ˆf () Of curse, we are nt able t cmpute d KL ( f, ˆf ) frm the data, since we d nt knw f. But frm a theretical pint f view, we can investigate this distance fr the chice f an apprpriate bandwidth h. When d KL ( f, ˆf ) is clse t 0 this wuld give the best agreement with the hypthesis ˆf = f. Hence, we are lking fr a bandwidth h, which minimizes d KL ( f, ˆf ). Suppse we are given a set f additinal bservatins X i, independent f the thers. The likelihd fr these bservatins is i f (X i ). Substituting ˆf in the likelihd equatin we have i ˆf (Xi ) and the value f this statistic fr different h wuld indicate which value f h is preferable, since the lgarithm f this statistic is clse t d KL ( f, ˆf ). Usually, we d nt have additinal bservatins. A way ut f this dilemma is t base the estimate ˆf n the subset {X j } j =i, and t calculate the likelihd fr X i. Denting the leave-ne-ut estimate Hence, n i=1 ˆf (X i ) = (n 1) 1 h 1 j =i ˆf (X i ) = (n 1) n h n n i=1 K( X i X j ). h j =i K( X i X j ). (2.11) h Hwever it is cnvenient t cnsider the lgarithm f this statistic nrmalized with the factr n 1 t get the fllwing prcedure: CV KL (h) = 1 n = 1 n n i=1 lg[ f h,i (X i )] n lg i=1 Naturally, we chse h KL such that: [ j =i K( X i X ] j ) lg[(n 1)h] (2.12) h h KL = arg ma CV KL (h). (2.13) h Since we assumed that X i are i.i.d., the scres lg ˆf i (X i ) are identically distributed and s, E[CV KL (h)] = E[lg ˆf i (X i )]. Disregarding the leave-ne-ut effect, we can write [ E[CV KL (h)] E lg ˆf ] () f ()d lg[ f ()] f ()d. (2.14) E[d KL ( f, ˆf )] +

21 2.4: Bandwidth Selectin 13 The secnd term f the right-hand side des nt depend n h. Then, we can epect that we apprimate the ptimal bandwidth that minimizes d KL ( f, ˆf ). The Maimum likelihd crss validatin has tw shrtcmings: When we have identical bservatins in ne pint, we may btain an infinite value if CV KL (h) and hence we cannt define an ptimal bandwidth. Suppse we use a kernel functin with finite supprt, e.g., the interval [ 1, 1]. If an bservatin X i is mre separated frm the ther bservatins than the bandwidth h, the likelihd ˆf i (X i ) becmes 0. Hence the scre functin reaches the value. Maimizing CV KL (h) frces us t use a large bandwidth t prevent this degenerated case. This might lead t slight ver-smthing fr the ther bservatins Least-Squares Crss-Validatin Cnsider an alternative distance between f h and f. The integrated squared errr (ISE) d ISE (h) = ( f h f ) 2 ()d = fh 2()d 2 ( f h f )()d + f 2 ()d d ISE (h) f 2 ()d = fh 2()d 2 ( f h f )()d (2.15) Fr the last term, bserve that ( f h f )()d = E[ f h (X i )] where the epectatin is understd t be cmputed with respect t an additinal and independent bservatin X. Fr estimatin f this term define the leave-ne-ut estimate This leads t the Least-squares crss-validatin: E X [ fˆ h (X)] = 1 n n f h,i (X i ) (2.16) i=1 CV LS (h) = The bandwidth minimizing this functin is, fh 2 n ()d 2 f h,i (X i ) (2.17) i=1 h LS = arg min CV LS (h). h This crss-validatin functin is called an unbiased crss-validatin criterin, since, E[CV LS (h)] = E[d ISE (h) + 2(E X [ f h (X)] E[ 1 n n f h,i (X i )]) f 2 2 i=1 = IMSE[ f h ] f 2 2. (2.18) An interesting questin is, hw gd is the apprimatin f d ISE by CV LS. T investigate this define a sequence f bandwidths h n = h(x 1,..., X n ) t be asympttically ptimal, if d ISE (h n ) 1, a.s. when n. inf h>0 d ISE (h)

22 14 Chapter 2: Kernel estimatin It can be shwn that if the density f is bunded then h LS is asympttically ptimal. Similarly t maimum likelihd crss-validatin ne can fund in Härdle (1990) an algrithm t cmpute the least-squares crss-validatin. 2.5 Orthgnal series estimatrs Orthgnal series estimatrs apprach the density estimatin prblem frm a quite different pint f view. While kernel estimatrs is clse related t statistical thinking rthgnal series relies n the ideas f apprimatin thery. Withut lss f generality let us assume that we are trying t estimate a density f n the interval [0, 1]. The idea is t use the thery f rthgnal series methd and then t reduce the estimatin prcedure by estimating the cefficients f its Furier epansin. Define the sequence φ v () by φ 0 () = 1 φ 2r 1 () = 2 cs 2πr r = 1, 2,... φ 2r () = 2 sin 2πr r = 1, 2,... It is well knwn that f can be represented as Furier series i=0 a iφ i, where, fr each i 0, a i = f ()φ i ()d. (2.19) Nw, suppse that X is a randm variable with density f. Then written a i = Eφ i (X) and s an unbiased estimatr f f based n X 1,..., X n is (2.19) can be â i = 1 n n φ i (X j ). j=1 Nte that the i=1 âiφ i cnverges t a sum f delta functins at the bservatins, since ω() = 1 n where δ is the Dirac delta functin. Then fr each i, â i = 1 0 n δ( X i ) (2.20) i=1 ω()φ i ()d and hence the â i are eactly the Furier cefficients f the functin ω. The easiest t way t smth ω is t truncate the epansin â i φ i at sme pint. That is, chse K and define a density estimate ˆf by ˆf () = K â i φ i (). (2.21) i=1 Nte that the amunt f smthing is determined by K. Small value f K implies in ver-smthing, large value f K under-smthing.

23 2.5: Orthgnal series estimatrs bs frm N(.5,.1) density True K=3 K=10 K= data Figure 2.6: Effect f the smthing parameter K n the rthgnal series methd fr density estimatin A mre general apprach wuld be, chse a sequence f weights λ i, such that, λ i 0 as i. Then ˆf () = λ i â i φ i (). i=0 The rate at which the weights λ i cnverge t zer will determine the amunt f smthing. Fr nn finite interval we can have weight functins a() = e 2 /2 and rthgnal functins φ() prprtinal t Hermite plynmials. The data in figure 2.6 were prvided t me by Francisc Cribari-Net and cnsists f the variatin rate f ICMS (impst sbre circulaçã de mercadrias e serviçs) ta fr the city f Brasilia, D.F., frm August 1994 t July 1999.

24 16 Chapter 2: Kernel estimatin

25 Chapter 3 Kernel nnparametric Regressin Methd Suppse we have i.i.d. bservatins {(X i, Y i )} i=1 n and the nnparametric regressin mdel given in equatin (1.1). By equatin (2.1) we knw hw t estimate the denminatr by using the kernel density estimatin methd. Fr the numeratr ne can estimate the jint density using the multiplicative kernel f h1,h 2 (, y) = 1 n n K h1 ( X i )K h2 (y Y i ). i=1 where, K h1 ( X i ) = h1 1 K(( X i)/h 1 ), K h2 ( Y i ) = h2 1 K(( Y i)/h 2 ). It is nt difficult t shw that y f h1,h 2 (, y)dy = 1 n n K h1 ( X i )Y i. i=1 Based n the methdlgy f kernel density estimatin Nadaraya (1964) and Watsn (1964) suggested the fllwing estimatr g h fr g. g h () = n i=1 K h( X i )Y i n j=1 K h( X j ) (3.1) In general, the kernel functin K h () = K(( j )/h) is taken as prbability density functin symmetric arund zer and parameter h is called smthing parameter r bandwidth. Nw, cnsider the mdel (1.1) and let X 1,..., X n be i.i.d. randm variables with density f X such that X i is independent f ε i fr all i = 1,..., n. Assume the cnditins given in Sectin 2.3 and suppse that f and g are twice cntinuusly differentiable in neighbrhd f the pint. Then, if h 0 and nh as n, we have ĝ h g in prbability. Mrever, suppse E[ ε i 2+δ ] and K( 2+δ d <, fr sme δ > 0, then nh(ĝ h E[ĝ h ]) N(0, ( f X ()) 1 σ 2 (K()) 2 d) in distributin, where N(, ) stands fr a Gaussian distributin, (see details in Pagan and Ullah (1999)). As an eample, figure 3.1 shws the effect f chsing h n the Nadaraya-Watsn prcedure. The data cnsist f the speed f cars and the distances taken t stp. It is imprtant t ntice that the data were recrded in the 1920s. (These datasets can be 17

26 18 Chapter 3: Kernel nnparametric Regressin Methd fund in the sftware R) The Nadaraya-Watsn kernel methd can be etended t the multivariate regressin prblem by cnsidering the multidimensinal kernel density estimatin methd (see details in Sctt (1992)). dist h=2 h= speed Figure 3.1: Effect f bandwiths n Nadaraya-Watsn kernel 3.1 k-nearest Neighbr (k-nn) One may ntice that regressin by kernels is based n lcal averaging f bservatins Y i in a fied neighbrhd f. Instead f this fied neighbrhd, k-nn emplys varying neighbrhds in the X variable supprt. That is, where, g k () = 1 n W ki () = n W ki ()Y i, (3.2) i=1 { n/k if i J 0 therwise, (3.3) with J = {i : X i is ne f the k nearest bservatins t } It can be shwn that the bias and variance f the k-nn estimatr g k with weights (3.3) are given by, fr a fied and E[g k ()] g() 1 24( f ()) 3 [g () f () + 2g () f ()](k/n) 2 (3.4) Var[g k ()] σ2 k. (3.5)

27 3.2: Lcal Plynmial Regressin: LOWESS 19 We bserve that the bias increasing and the variance is decreasing in the smthing parameter k. T balance this trade-ff ne shuld chse k n 4/5. Fr details, see Härdle (1990). Figure 3.2 shws the effect f the parameter k n the regressin curve estimates. Nte that the curve estimate with k = 2 is less smther than the curve estimate with k = 1. The data set cnsist f the revenue passenger miles flwn by cmmercial airlines in the United States fr each year frm 1937 t 1960 and is available thrugh R package. airmiles data airmiles Data K=1 K= Passenger miles flwn by U.S. cmmercial airlines Figure 3.2: Effect f the smthing parameter k n the k-nn regressin estimates. 3.2 Lcal Plynmial Regressin: LOWESS Cleveland (1979) prpsed the algrithm LOWESS, lcally weighted scatter plt smthing, as an utlier resistant methd based n lcal plynmial fits. The basic idea is t start with a lcal plynmial (a k-nn type fitting) least squares fit and then t use rbust methds t btain the final fit. Specifically, ne can first fit a plynmial regressin in a neighbrhd f, that is, find β R p+1 which minimize n 1 n p W ki (y i β j j) 2, (3.6) i=1 j=0 where W ki dente k-nn weights. Cmpute the residuals ˆɛ i and the scale parameter ˆσ = median( ˆɛ i ). Define rbustness weights δ i = K( ˆɛ i /6ˆσ), where K(u) = (15/16)(1 u) 2, if u 1 and K(u) = 0, if therwise. Then, fit a plynmial regressin as in (3.6) but with weights (δ i W ki ()). Cleveland suggests that p = 1 prvides gd balance

28 20 Chapter 3: Kernel nnparametric Regressin Methd between cmputatinal ease and the need fr fleibility t reprduce patterns in the data. In additin, the smthing parameter can be determined by crss-validatin as in (2.13). Nte that when using the R functin lwess r less, f acts as the smthing parameter. Its relatin t the k-nn nearest neighbr is given by where n is the sample size. k = n f, f (0, 1), lwess(cars) dist f = 2/3 f = speed Figure 3.3: Effect f the smthing parameter using LOWESS methd. 3.3 Penalized Maimum Likelihd Estimatin The methd f penalized maimum likelihd in the cntet f density estimatin cnsist f estimating a density f by minimizing a penalized likelihd scre L ( f ) + λj( f ), where L ( f ) is a gdness-f-fit measure, and J( f ) is a rughness penalty. This sectin is develped cnsidering histrical results, beginning with Gd and Gaskins (1971), and ending with the mst recent result given by Gu (1993). The maimum likelihd (M.L.) methd has been used as statistical standard prcedure in the case where the underlying density f is knwn ecept by a finite number f parameters. It is well knwn the M.L. has ptimal prperties (asympttically unbiased and asympttically nrmal distributed) t estimate the unknwn parameters. Thus, it wuld be interesting if such standard technique culd be applied n a mre general scheme where there is n assumptin n the frm f the underlying density by assuming f t belng t a pre-specified family f density functins.

29 3.3: Penalized Maimum Likelihd Estimatin 21 Let X 1,..., X n be i.i.d. randm variables with unknwn density f. The likelihd functin is given by: n L( f X 1,..., X n ) = f (X i ). i=1 The prblem with this apprach can be described by the fllwing eample. Recall ˆf h () a kernel estimate, that is, ˆf h () = 1 n nh K( X i h i=1 ), with h = h/c, where c is cnstant greater than 0, i.e., fr the mment the bandwidth is h/c. Let h be small enugh such that X i X i h/c > M > 0, and assume K has been chsen s that K(u) = 0, if u > M. Then, ˆf h (X i ) = c nh K(0). If c > 1 K(0) then ˆf h (X i ) > 1 nh. Fr fied n, we can d this fr all X i simultaneusly. Thus, L ( 1 nh )n. Letting h 0, we have L. That is, L( f X 1,..., X n ) des nt have a finite maimum ver the class f all densities. Hence, the likelihd functin can be as large as ne wants it just by taking densities with the smthing parameter appraching zer. Densities having this characteristic, e.g., bandwidth h 0, apprimate t delta functins and the likelihd functin ends up t be a sum f spikes delta functins. Therefre, withut putting cnstraints n the class f all densities, the maimum likelihd prcedure cannt be used prperly. One pssible way t vercme the prblem described abve is t cnsider a penalized lg-likelihd functin. The idea is t intrduce a penalty term n the lglikelihd functin such that this penalty term quantifies the smthness f g = lg f. Let us take, fr instance, the functinal J(g) = (g ) 2 as a penalty term. Then define the penalized lg-likelihd functin by L λ (g) = 1 n n g(x i ) λj(g), (3.7) i=1 where λ is the smthing parameter which cntrls tw cnflicting gals, the fidelity t the data given by n i=1 g(x i) and the smthness, given by the penalty term J(g). The pineer wrk n penalized lg-likelihd methd is due t Gd and Gaskins (1971), wh suggested a Bayesian scheme with penalized lg-likelihd (using their ntatin) becmes: ω = ω( f ) = L( f ) Φ( f ), where L = i=1 n g(x i) and Φ is the smthness penalty. In rder t simplify the ntatin, let h have the same meaning as h()d. Nw, cnsider the number f bumps in the density as the measure f rughness r

30 22 Chapter 3: Kernel nnparametric Regressin Methd smthness. The first apprach was t take the penalty term prprtinal t Fisher s infrmatin, that is, Φ( f ) = ( f ) 2 / f. Nw by setting f = γ 2, Φ( f ) becmes (γ ) 2, and then replace f by γ in the penalized likelihd equatin. Ding that the cnstraint f 0 is eliminated and the ther cnstraint, f = 1, turns ut t be equivalent t γ 2 = 1, with γ L 2 (, ). Gd and Gaskins(1971) verified that when the penalty 4α (γ ) 2 yielded density curves having prtins that lked t straight. This fact can be eplained nting that the curvature depends als n the secnd derivatives. Thus (γ ) 2 shuld be included n the penalty term. The final rughness functinal prpsed was: Φ( f ) = 4α (γ ) 2 + β (γ ) 2, with α, β satisfying, 2ασ β = σ4, (3.8) where σ 2 is either an initially guessed value f the variance r it can be estimated the sample variance based n the data. Accrding t Gd and Gaskins (1971), the basis fr this cnstraint is the feeling that the class f nrmal distributins frm the smthest class f distributins, the imprper unifrm distributin being limiting frm. Mrever, they pinted ut that sme justificatin fr this feeling is that a nrmal distributin is the distributin f maimum entrpy fr a given mean and variance. The integral (γ ) 2 is als minimized fr a given variance when f is nrmal (Gd and Gaskins, 1971). They thught was reasnable t give the nrmal distributin special cnsideratin and decided t chse α, β such that ω(α, β; f ) is maimized by taking the mean equal t and variance as i=1 N ( i ) 2 /N 1. That is, if f () N (µ, σ 2 ) then (γ ) 2 = 1, (γ ) 2 = 3 and hence we have, 4σ 2 16σ 4 ω(α, β; f ) = N 2 lg(2πσ2 ) 1 2σ 2 N i=1 ( i µ) 2 α σ 2 3β 16σ 4. The scre functin ω(α, β; f ) is maimized when µ = and σ is such that, N + N i=1 ( i ) 2 σ 2 + 2α σ 2 + 3β = 0. (3.9) 4σ4 If we put σ 2 = N i=1 ( i ) 2 /N 1, the equatin (3.9) becmes, σ 4 (N 1) + 2ασ 2 + 3β 4 = σ4 N, and s we have the cnstraint (3.8). Pursuing the idea f Gd and Gaskins, Silverman (1982) prpsed a similar methd where the lg density is estimated instead f the density itself. An advantage f Silverman s apprach is that using the lgarithm f the density and the augmented Penalized likelihd functinal, any density estimates btained will autmatically be psitive and integrate t ne. Specifically,

31 3.3: Penalized Maimum Likelihd Estimatin 23 Let (m 1,..., m k ) be a sequence f natural numbers s that 1 i=1 k m i m, where m > 0 is such that g (m 1) eists and is cntinuus. Define a linear differential peratr D as: D(g) = c(m 1,..., m k )( ) m 1... ( ) m k(g). 1 k Nw assume that at least ne f the cefficients c(m 1,..., m k ) = 0 fr m i = m. Using this linear differential peratr define a bilinear functinal, by g 1, g 2 = D(g 1 )D(g 2 ). where the integral is taken ver a pen set Ω with respect t Lebesgue measure. Let S be the set f real functins g n Ω fr which: the (m 1)th derivatives f g eist everywhere and are piecewise differentiable, g, g <, e g <. Given the data X 1,..., X n i.i.d. with cmmn density f, such that g = lg f, ĝ is the slutin, if it eists, f the ptimizatin prblem ma{ 1 n n g(x i ) λ g, g }, 2 i=1 subject t e g = 1. And the density estimate ˆf = eĝ, where the the null space f the penalty term is the set {g S : g, g = 0}. Nte that the null space f g, g is an epnential family with at mst (m 1) parameters, fr eample, if g, g = (g (3) ) 2 then g = lg f is in an epnential family with 2 parameters. See Silverman (1982). Silverman presented an imprtant result which makes the cmputatin f the cnstrained ptimizatin prblem a relatively easy cmputatinal scheme f finding the minimum f an uncnstrained variatinal prblem. Precisely, fr any g in a class f smth functins (see details in Silverman (1982)) and fr any fied psitive λ, let and ω 0 (g) = 1 n ω(g) = 1 n n g(x i ) + λ 2 i=1 n g(x i ) + i=1 e g + λ 2 (g ) 2 (g ) 2. Silverman prved that uncnstrained minimum f ω(g) is identical with the cnstrained minimum f ω 0, if such a minimizer eists.

32 24 Chapter 3: Kernel nnparametric Regressin Methd Cmputing Penalized Lg-Likelihd Density Estimates Based n Silverman s apprach, O Sullivan(1988) develped an algrithm which is a fully autmatic, data driven versin f Silverman s estimatr. Furthermre, the estimatrs btained by O Sullivan s algrithm are apprimated by linear cmbinatin f basis functins. Similarly t the estimatrs given by Gd and Gaskins(1971), O Sullivan prpsed that cubic B-splines with knts at data pints shuld be used as the basis functins. A summary f definitins and prperties f B-splines were given in the sectin 4. The basic idea f cmputing a density estimate prvided by penalized likelihd methd is t cnstruct apprimatins t it. Given 1,..., n, the realizatins f randm variables X 1,..., X n, with cmmn lg density g. We are t slve a finite versin f (3.7) which are reasnable apprimatins t the infinite dimensinal prblem (Thmpsn and Tapia, 1990, ). Gd and Gaskins (1971) based their cmputatinal scheme n the fact that since γ L 2 (, ) then fr a given rthnrmal system f functins {φ n }, n=0 a n φ n m.s. g L 2, with n=0 a n < and {a n } R. That is, γ in L 2 can be arbitrarily apprimated by a linear cmbinatin f basis functins. In their paper, Hermite plynmials were used as basis functins. Specifically: where, f φ n () = e 2 /2 H n ()2 n/2 π 1/4 (n!) 1/2, H n () = ( 1) n e 2 ( dn d n e 2 ). The lg density estimatr prpsed by O Sullivan (1988) is defined as the minimizer 1 n b b n g( i ) + e g(s) ds + λ (g (m) ) 2 ds, (3.10) i=1 a a fr fied λ > 0, and data pints 1,..., n. The minimizatin is ver a class f abslutely cntinuus functins n [a, b] whse mth derivative is square integrable. Cmputatinal advantages f this lg density estimatrs using apprimatins by cubic B-splines are: It is a fully autmatic prcedure fr selecting an apprpriate value f the smthing parameter λ, based n the AIC type criteria. The banded structures induced by B-splines leads t an algrithm where the cmputatinal cst is linear in the number f bservatins (data pints). It prvides apprimate pintwise Bayesian cnfidence intervals fr the estimatr. A disadvantage f O Sullivan s wrk is that it des nt prvide any cmparisn f perfrmance with ther available techniques. We see that the previus cmputatinal framewrk is unidimensinal, althugh Silverman s apprach can be etended t higher dimensins.

33 Chapter 4 Spline Functins 4.1 Acquiring the Taste Due t their simple structure and gd apprimatin prperties, plynmials are widely used in practice fr apprimating functins. Fr this prpse, ne usually divides the interval [a, b] in the functin supprt int sufficiently small subintervals f the frm [ 0, 1 ],..., [ k, k+1 ] and then uses a lw degree plynmial p i fr apprimatin ver each interval [ i, i+1 ], i = 0,..., k. This prcedure prduces a piecewise plynmial apprimating functin s( ); s() = p i () n [ i, i+1 ], i = 0,..., k. In the general case, the plynmial pieces p i () are cnstructed independently f each ther and therefre d nt cnstitute a cntinuus functin s() n [a, b]. This is nt desirable if the interest is n apprimating a smth functin. Naturally, it is necessary t require the plynmial pieces p i () t jin smthly at knts 1,..., k, and t have all derivatives up t a certain rder, cincide at knts. As a result, we get a smth piecewise plynmial functin, called a spline functin. Definitin 4.1 The functin s() is called a spline functin (r simply spline ) f degree r with knts at { i } k i=1 if =: 0 < 1 <... < k < k+1 :=, where =: 0 and k+1 := are set by definitin, fr each i = 0,..., k, s() cincides n [ i, i+1 ] with a plynmial f degree nt greater than r; s(), s (),..., s r 1 () are cntinuus functins n (, ). The set S r ( 1,..., k ) f spline functins is called spline space. Mrever, the spline space is a linear space with dimensin r + k + 1 (Schumaker (1981)). Definitin 4.2 Fr a given pint (a, b) the functin (t ) r + = { (t ) r if t > 0 if t is called the truncated pwer functin f degree r with knt. 25

34 26 Chapter 4: Spline Functins Hence, we can epress any spline functin as a linear cmbinatin f r + k + 1 basis functins. Fr this, cnsider a set f interir knts { 1,..., k } and the basis functins {1, t, t 2,..., t r, (t 1 ) r +,..., (t k) r +}. Thus, a spline functin is given by, s(t) = r θ i t i k + θ j (t j r ) r + i=0 j=r+1 It wuld be interesting if we culd have basis functins that make it easy t cmpute the spline functins. It can be shwn that B-splines frm a basis f spline spaces Schumaker (1981). Als, B-splines have an imprtant cmputatinal prperty, they are splines which have smallest pssible supprt. In ther wrds, B-splines are zer n a large set. Furthermre, a stable evaluatin f B-splines with the aid f a recurrence relatin is pssible. Definitin 4.3 Let Ω = { j } {j Z} be a nndecreasing sequence f knts. The i-th B-spline f rder k fr the knt sequence Ω is defined by B k j (t) = ( k+j j )[ j,..., k+j ](t j ) k 1 + fr all t R, where, [ j,..., k+j ](t j ) k 1 + is (k 1)th divided difference f the functin ( j) k + evaluated at pints j,..., k+j. Frm the Definitin 4.3 we ntice that B k j (t) = 0 fr all t [ j, j+k ]. It fllws that nly k B-splines have any particular interval [ j, j+1 ] in their supprt. That is, f all the B- splines f rder k fr the knt sequence Ω, nly the k B-splines B k j k+1, Bk j k+2,..., Bk j might be nnzer n the interval [ j, j+1 ]. (See de Br (1978) fr details). Mrever, B k j (t) > 0 fr all ( j, j+k ) and j Z B k j (t) = 1, that is, the B-spline sequence Bk j cnsists f nnnegative functins which sum up t 1 and prvides a partitin f unity. Thus, a spline functin can be written as linear cmbinatin f B-splines, s(t) = β j B k j (t). j Z The value f the functin s at pint t is simply the value f the functin j Z β j B k j (t) which makes gd sense since the latter sum has at mst k nnzer terms. Figure 4.1 shws an eample f B-splines basis and their cmpact supprt prperty. This prperty makes the cmputatin f B-splines easier and numerically stable. Of special interest is the set f natural splines f rder 2m, m N, with k knts at j. A spline functin is a natural spline f rder 2m with knts at 1,..., k, if, in additin t the prperties implied by definitin (4.1), it satisfies an etra cnditin: s is plynmial f rder m utside f [ 1, k ]. Cnsider the interval [a, b] R and the knt sequence a := 0 < 1 <... < k < k+1 := b. Then, N S 2m = {s S(P 2m ) : s 0 = s [a,1 ) and s k = s [k,b) P m }, is the natural plynmial spline space f rder 2m with knts at 1,..., k. The name natural spline stems frm the fact that, as a result f this etra cnditin, s satisfies the s called natural bundary cnditins s j (a) = s j (b) = 0, j = m,..., 2m 1. Nw, since the dimensin f S(P 2m ) is 2m + k and we have enfrced 2m etra cnditins t define N S 2m, it is natural t epect the dimensin f N S 2m t be k.

35 4.2: Lgspline Density Estimatin 27 B splines Figure 4.1: Basis Functins with 6 knts placed at Actually, it is well knwn that N S 2m is linear space f dimensin k. See details in Schumaker (1981). In sme applicatins it may be pssible t deal with natural splines by using a basis fr S(P 2m ) and enfrcing the end cnditins. Fr ther applicatins it is desirable t have a basis fr N S 2m itself. T cnstruct such a basis cnsisting f splines with small supprts we just need functins based n the usual B-splines. Particularly, when m = 2, we will be cnstructing basis functins fr the Natural Cubic Spline Space, N S 4. Figure 4.2 shw an eample f the natural splines basis. 4.2 Lgspline Density Estimatin Kperberg and Stne (1991) intrduced anther type f algrithm t estimate an univariate density. This algrithm was based n the wrk f Stne (1990) and Stne and K (1985) where the thery f the lgspline family f functins was develped. Cnsider an increasing sequence f knts {t j } K j=1, K 4, in R. Dente by S 0 the set f real functins such that s is a cubic plynmial in each interval f the frm (, t 1 ], [t 1, t 2 ],..., [t K, ). Elements in S 0 are the well-knwn cubic splines with knts at {t j } K j=1. Ntice that S 0 is a (K + 4)-dimensinal linear space. Nw, let S S 0 such that the dimensin f S is K with functins s S linear n (, t 1 ] and n [t K, ). Thus, S has a basis f the frm 1, B 1..., B K 1, such that B 1 is linear functin with negative slpe n (, t 1 ] and B 2,..., B K 1 are cnstant functins n the same interval. Similarly, B K 1 is linear functin with psitive slpe n [t K, ) and B 1,..., B K 2 are cnstant n the interval [t K, ). Let Θ be the parametric space f dimensin p = K 1, such that θ 1 < 0 and θ p > 0

36 28 Chapter 4: Spline Functins Natural Splines Figure 4.2: Basis Functins with 6 knts placed at fr θ = (θ 1,..., θ p ) R p. Then, define and c(θ) = lg( R K 1 f (; θ) = ep{ K 1 ep( j=1 j=1 θ j B j ()d)) θ j B j () c(θ)}. The p-parametric epnential family f (, θ), θ Θ R p f psitive twice differentiable density functin n R is called lgspline family and the crrespnding lglikelihd functin is given by L(θ) = lg f (; θ); θ Θ. The lg-likelihd functin L(θ) is strictly cncave and hence the maimum likelihd estimatr ˆθ f θ is unique, if it eists. We refer t ˆf = f (, ˆθ) as the lgspline density estimate. Nte that the estimatin f ˆθ makes lgspline prcedure nt essentially nnparametric. Thus, estimatin f θ by Newtn-Raphsn, tgether with small numbers f basis functin necessary t estimate a density, make the lgspline algrithm etremely fast when it is cmpared with Gu (1993) algrithm fr smthing spline density estimatin. In the Lgspline apprach the number f knts is the smthing parameter. That is, t many knts lead t a nisy estimate while t few knts give a very smth curve. Based n their eperience f fitting lgspline mdels, Kperberg and Stne prvide a table with the number f knts based n the number f bservatins. N indicatin was fund that the number f knts takes in cnsideratin the structure f the data (number f mdes, bumps, asymmetry, etc.). Hwever, an bjective criterin

37 4.3: Splines Density Estimatin: A Dimensinless Apprach 29 fr the chice f the number f knts, Stepwise Knt Deletin and Stepwise knt Additin, are included in the lgspline prcedure. Fr 1 j p, let B j be a linear cmbinatin f a truncated pwer basis ( t k ) 3 + fr the a knt sequence t 1,..., t p, that is, Then j B j () = β j + β j0 + k β jk ( t k ) 3 +. θ j B j () = θ j β j0 + β jk θ j ( t k ) 3 +. j k Let j ˆθ j β jk = β T k ˆθ. Then, fr 1 k K Kperberg and Stne (1991), shw that SE(β T k ˆθ) = β T k (I( ˆθ)) 1 β k ) where I(θ) is the Fisher infrmatin matri btained frm the lg-likelihd functin. The knts t 1 and t K are cnsidered permanent knts, and t k, 2 k K, are nnpermanent knts. Then at any step delete (similarly fr additin step) that knt which has the smallest value f β T k ˆθ /SE(β T k ˆθ). In this matter, we have a sequence f mdels which ranges frm 2 t p 1 knts. Nw, dente by ˆL m the lg-likelihd functin f the mth mdel (2 m + 2 p 1) evaluated at the maimum likelihd estimate fr that mdel. T specify a stp criteria, Kperberg and Stne make use f the Akaike Infrmatin Criterin (AIC), that is, AIC α,m = 2ˆL m + α(p m) and chse ˆm that minimizes AIC 3,m. There is n theretical justificatin fr chsing α = 3. The chice was made, accrding t them, because this value f α makes the prbability that ˆf is bimdal when f is Gamma(5) t be abut 0.1. Figure 4.3 shws an eample f lgspline density estimatin fr a miture f tw nrmal densities. It wuld be interesting t have an algrithm which cmbines the lw cmputatinal cst f lgsplines (due t B-splines and the estimatin f their cefficients) and the perfrmance f the autmatic smthing parameter selectin develped by Gu (1993). 4.3 Splines Density Estimatin: A Dimensinless Apprach Let X 1,..., X n a randm sample frm a prbability density f n a finite dmain X. Assuming that f > 0 n X, ne can make a lgistic transfrmatin f = e g /( e g ). We knw that this transfrmatin is nt ne-t-ne and Gu and Qiu (1993) prpsed side cnditins n g such that g( 0 ) = 0, 0 X r X g = 0. Given thse cnditins we have t find the minimizer f the penalized lg-likelihd 1 n n g(x i ) + lg e g + λ J(g) (4.1) i=1 X 2 in a Hilbert space H, where J is a rughness penalty and λ is the smthing parameter. The space H is such that the evaluatin is cntinuus s that the first term in (4.1) is cntinuus. The penalty term J is a seminrm in H with a null space J f finite dimensin M 1. By taking a finite dimensinal J ne prevents interplatin (i.e. the empirical distributin) and a quadratic J makes easier the numerical slutin f the

38 30 Chapter 4: Spline Functins Miture f Nrmals True lgspline Figure 4.3: lgspline density estimatin fr.5n(0,1)+.5n(5,1) variatinal prblem (4.1). Since, H is an infinite dimensinal space, the minimizer f (4.1) is, in general, nt cmputable. Thus, Gu and Qiu (1993) prpse calculating the slutin f the variatinal prblem in finite dimensinal space, say, H n, where n is the sample size. The perfrmance f the smthing spline estimatr depends upn the chice f the smthing parameter λ. Gu (1993), suggested a perfrmance-riented iteratin prcedure ( GCV-like prcedure) which updates g and λ jintly accrding t a perfrmance estimate. The perfrmance is measured by a lss functin which was taken as a symmetrized Kullback-Leibler distance between e g / e g and e g 0/ e g 0. Specifically, if ne slves the variatinal prblem (4.1) in H n by a standard Newtn-Raphsn prcedure, then by starting frm a current iterate g, instead f calculating the net iterate with a fied λ, ne may chse a λ that minimizes the lss functin. Figure 4.4 ehibits the perfrmance f SSDE fr Buffal Snw data. (This data set can be fund in R.) Under this apprach, ne might ask the fllwing questins: Is it pssible t estimate a density using K n basis functins instead f the riginal n such that it reduces the cmputatinal cst f getting the slutin (4.1) significantly? Hw gd wuld such an apprimatin be? Dias (1998) prvided sme answers t thse questins by using the basis functins B i () given in Definitin (4.3) that can be easily etend t a multivariate case by a tensr prduct.

39 4.3: Splines Density Estimatin: A Dimensinless Apprach 31 Buffal snw data SSDE Lgspline(d) Kernel data Figure 4.4: Histgram, SSDE, Kernel and Lgspline density estimates

40

41 Chapter 5 The thin-plate spline n R d There are many applicatins where a unknwn functin g f ne r mre variables and a set f measurements are given such that: y i = L i g + ɛ i (5.1) where L 1,..., L n are linear functinals defined n sme linear space H cntaining g, and ɛ 1,..., ɛ n are measurement errrs usually assumed t be independently, identically and nrmally distributed with mean zer and unknwn variance σ 2. Typically, the L i will be pint evaluatin f the functin g. Straight frward least square fitting is ften apprpriate but it prduces a functin which is nt sufficiently smth fr sme data fitting prblems. In such cases, it may be better t lk fr a functin which minimizes a criterin that invlves a cmbinatin f gdness f fit and an apprpriate measure f smthness. Let t = ( 1,..., d ), t i = ( 1 (i),..., d (i)) fr i = 1,..., n and the evaluatin functinals L i g = g(t i ), then the regressin mdel (5.1) becmes, y i = g( 1 (i),..., d (i)) + ɛ i. (5.2) The thin-plate smthing spline is the slutin t the fllwing variatinal prblem. Find g H t minimize L λ (g) = 1 n n i=1 (y i g(t i )) 2 + λj d m(g) (5.3) where λ is the smthing parameter which cntrls the trade ff between fidelity t the data and smthness with penalty term Jm. d Nte that, when λ is large a premium is being placed n smthness and functins with large mth derivatives are penalized. In fact, λ gives an mth rder plynmial regressin fit t the data. Cnversely, fr small values f λ mre emphasis is put n gdness-f-fit and the limit case f λ 0, we have interplatin. In general, in smthing spline nnparametric regressin the penalty term Jm d is given by J d m(g) = α α d =m m! (... α 1!..., α d! m g α α d d ) 2 j The cnditin 2m d > 0 is necessary and sufficient in rder t have bunded evaluatin functinals in H, i.e., H is a reprducing kernel in Hilbert space. Mrever, the d j. 33

42 34 Chapter 5: The thin-plate spline n R d null space f the penalty term J d m is the M-dimensinal space spanned by plynmials φ 1,..., φ M f degree less r equal t m 1, e.g., φ i (t) = t j 1 /(j 1)!, fr j = 1,..., m. Wahba (1990) has shwn that, if t 1,..., t n are such that least squares regressin n φ 1,..., φ M is unique, then (5.3) has a unique minimizer g λ, with representatin n M g λ (t) = c i E m (t, t i ) + b j φ j (t) i=1 j=1 = Qc + Tb (5.4) where, T is a n M matri with entries φ j (t l ) fr j = 1,..., M, l = 1,..., n and Q is a n n matri with entries E m (t l, t i ), fr i = 1,..., n. The functin E m is a Green s functin fr the m-iterate Laplacian Wahba (1990), Silverman and Green (1994). Fr eample, when d = 1, E m (t, t i ) = (t t i ) + m 1 /(m 1)!. The cefficients c and b can be determined by substituting (5.4) int (5.3). Thus, the ptimizatin prblem (5.3) subject t T c = 0, is reduced t a linear system f equatins which is slved by standard matri decmpsitin such as QR decmpsitin. The cnstraint T c = 0 is necessary t guarantee that when cmputing the penalty term at g λ, Jm(g d λ ) is cnditinally psitive definite (See, Wahba (1990)). Effrts have been dne in rder t reduce substantially the cmputatinal cst f slving smthing splines fitting by intrducing the cncept f H-splines (Lu and Wahba (1997) and Dias (1999)), where the number f basis functins and λ act as the smthing parameters. A majr cnceptual prblem with spline smthing is that it is defined implicitly as the slutin t a variatinal prblem rather than as an eplicit frmula invlving the data values. This difficulty can be reslved, at least apprimately, by cnsidering hw the estimate behaves n large data sets. It can be shwn frm the quadratic nature f (5.3) that g λ is linear in the bservatins y i, in the sense that there eists a weight functin H λ (s, t) such that n g λ (s) = y i H λ (s, t i ). (5.5) i=1 It is pssible t btain the asympttic frm f the weight functin, and hence an apprimate eplicit frm f the estimate. Fr the sake f simplicity cnsider d = 1, m = 2 and suppse that the design pints have lcal density f (t) with respect t a Lesbegue measure n R. Assuming the fllwing cnditins, (Silverman (1984)), 1. g H[a, b]. 2. There eists an abslutely cntinuus distributin functin F n [a, b] such that F n F unifrmly as n. 3. f = F, 0 < inf [a,b] f sup [a,b] f <. 4. The density has bunded first derivative n [a, b]. 5. a(n) = sup [a,b] F n F, the smthing parameter λ depends n n in such a way that λ 0 and λ 1/4 a(n) 0 as n.

43 5.1: Additive Mdels 35 In particular, ne can assume that the design pints are regularly distributed with density f ; that is, t i = F 1 ((i 1/2)/n). Then, sup F n F = (1/2)n 1 s that n 4 λ and λ 0 fr (5) t hld. Thus, as n, where the kernel functin K is given by and the bandwidth h(t) satisfies H λ (s, t) = 1 1 f (t) h(t) K(s t h(t) ), K(u) = 1 2 ep( u / 2) sin( u / 2 + π/4), h(t) = λ 1/4 n 1/4 f (t) 1/4. Based n these frmulas, we can see that the spline smther is apprimately a cnvlutin smthing methd but the data are nt cnvlved with a kernel with fied bandwidth, in fact, h varies acrss the sample. 5.1 Additive Mdels The additive mdel is a generalizatin f the usual linear regressin mdel and what has made it s ppular fr statistical inference is that the linear mdel is linear in the predictr variables (eplanatry variables). Once we have fitted the linear mdel we can eamine the predictr variables separately, in the absence f interactins. Additive mdels are als linear in the predictr variables. An additive mdel is defined by p y i = α + g j (t j ) + ɛ i (5.6) j=1 where t j are the predictr variables and as defined befre in sectin 5, ɛ i are uncrrelated errr measurements with E[ɛ i ] = 0 and Var[ɛ i ] = σ 2. The functins g j are unknwn but assumed t be smth functins lying in sme metric space. Sectin 5 describes a general framewrk fr defining and estimating general nnparametric regressin mdels which includes additive mdels as a special case. Fr this, suppse that Ω is the space f the vectr predictr t and assume the H is reprducing kernel in Hilbert space. Hence H has the decmpsitin H = H 0 + p H k (5.7) k=1 where H 0 is spanned by φ 1,..., φ M and H k has the reprducing kernel E k (, ), defined in sectin 5. The space H 0 is the space f functins that are nt t be penalized in the ptimizatin. Fr eample, recall equatin (5.3) and let m = 2 then H 0 is the space f linear functins in t. The ptimizatin prblem becmes: Fr a given set f predictrs t 1,..., t n, find the minimizer f n i=1 {y i p g k (t i )} 2 + k=0 k λ k g k 2 H k, (5.8) k=1

44 36 Chapter 5: The thin-plate spline n R d with g k H k. Then, the thery f reprducing kernel guarantees that a minimizer eists and has the frm p ĝ = Q k c + Tb, (5.9) k=1 where Q k and T are given in equatin (5.4) and the vectrs c and b are fund by minimizing the finite dimensinal penalized least square criterin y Tb p p Q k c + λ k ck T Q kc 2. (5.10) k=1 k=1 This general prblem (5.9) can ptentially be slved by a backfitting type algrithm as in Hastie and Tibshirani (1990). Algrithm Initialize g j = g (0) j fr j = 0,..., p. 2. Cycle j = 0,..., p,..., j = 0,..., p,... ĝ j = S j (y g j (t j )) j =k 3. Cntinue (ii) until the individual functins d nt change. where y = (y 1,..., y n ), S j = Q k (Q k + λ k I) 1, fr j = 1,..., p, and S 0 = T(T T T) 1. One may bserve that mitting the cnstant term α in (5.6) des nt change the resulting estimates. An eample f gam methd is given in Figure Generalized Crss-Validatin Methd fr Splines nnparametric Regressin Withut lss f generality, take d = 1 and m = 2. The slutin f (5.3) depends strngly n the smthing parameter. Craven and Wahba (1979) prvide an autmatic data-driven prcedure t estimate λ. Fr this, let g [k] λ 1 n (y i g(t i )) 2 + λ i =k be the minimizer f (g (u)) 2 du, the ptimizatin prblem with the kth data pint left ut. Then fllwing Wahba s ntatin, the rdinary crss-validatin functin V 0 (λ) is defined as V 0 (λ) = 1 n n k=1 (y k g [k] λ (t k)) 2, (5.11) and the leave-ne-ut estimate f λ is the minimizer f V 0 (λ). T prceed, we need t describe the influence matri. It is nt difficult t shw (see Wahba (1990)) that,

45 f 5.2: Generalized Crss-Validatin Methd fr Splines nnparametric Regressin linear predictr z z linear predictr 0.4 linear predictr z z Figure 5.1: True, tensr prduct, gam nn-adaptive and gam adaptive surfaces fr fied λ we have by (5.5) that g λ is linear in the bservatins y i, that is, in matri ntatin g λ = H λ y. At this stage, ne may think that the cmputatin f this prblem is prhibitive but Craven and Wahba (1979) give us a very useful mathematical identity, which will nt be prved here, but is (y k g [k] λ (t k)) = (y k g λ (t k ))/(1 h kk (λ), (5.12) where h kk (λ) is the kth entry f H λ. By substituting (5.12) int (5.11) we btain a simplified frm f V 0, that is, V 0 (λ) = 1 n n k=1 (y k g λ (t k )) 2 /(1 h kk (λ)) 2 (5.13) The right hand f (5.13) is easier t cmpute than (5.11), hwever the GCV is even easier. The generalized crss-validatin (GCV) is methd fr chsing the smthing parameter λ, which is based n leaving-ne-ut, but it has tw advantages. It is easy t cmpute and it psses sme imprtant theretical prperties the wuld be impssible t prve fr leaving-ne-ut, althugh, as pinted ut by Wahba, in many cases the GCV and leaving-ne-ut estimates will give similar answers. The GCV functin is defined by V(λ) = 1 n n k=1 (y k g λ (t k )) 2 /(1 h kk (λ)) 2 = 1 n (I H λ)y 2 [ 1 n tr(i H λ] 2, (5.14)

46 38 Chapter 5: The thin-plate spline n R d where h kk (λ) = (1/n)tr(H λ ), with tr(h λ ) standing fr the trace f H λ. Nte that V(λ) is a weighted versin f V 0 (λ). In additin, if h kk (λ) des nt depend n k, then V 0 (λ) = V(λ) fr all λ > 0. It is imprtant t bserve that GCV is a predictive mean square errr criteria. Nte that by defining the predictive mean square errr T(λ) as T(λ) = 1 n n i=1 (L i g λ L i g) 2 (5.15) where, L i is the evaluatin functinal defined in sectin 4.3, the GCV estimate f λ is the minimizer f (5.15). Cnsider the epected value f T(λ), E[T(λ)] = 1 n n E[(L i g λ L i g) 2 ]. (5.16) i=1 The GCV therem Wahba (1990) says that if g is in a reprducing kernel Hilbert space then there is a sequence f minimizers λ(n) f EV(λ) that cmes clse t achieving the minimum pssible value f the epected mean square errr, E[T(λ)], using λ(n), as n. That is, let the epectatin inefficiency I n be defined as In = E[T( λ(n))] E[T(λ, )] where λ is the minimizer f E[T(λ)]. Then, under mild cnditins as such the nes described and discussed by Glub, Heath and Wahba (1979) and Craven and Wahba (1979), we have I n 1 as n. Figure 5.2 shws the scatter plt f the revenue passenger miles flwn by cmmercial airlines in the United States fr each year frm 1937 t (This data can be fund in the sftware). The smthing parameter λ was cmputed by GCV methd thrugh the R functin smth.spline().

47 airmiles data airmiles Data SS Passenger miles flwn by U.S. cmmercial airlines Figure 5.2: Smthing spline fitting with smthing parameter btained by GCV methd

48 40 Chapter 5: The thin-plate spline n R d

49 Chapter 6 Regressin splines, P-splines and H-splines 6.1 Sequentially Adaptive H-splines In regressin splines, the idea is t apprimate g by a finite dimensinal subspace f W spanned by basis functins B 1,..., B K, K n. That is, g g K = K c j B j, j=1 where the parameter K cntrls the fleibility f the fitting. A very cmmn chice fr basis functins is the set f cubic B-splines (de Br, 1978). The B-splines basis functins prvide numerically superir scheme f cmputatin and have the main feature that each B j has cmpact supprt. In practice, it means that we btain a stable evaluatin f the resulting matri with entries B i,j = B j (t i ), fr j = 1,..., K and i = 1,..., n is banded. Unfrtunately, the main difficulty when wrking with regressin splines is t select the number and the psitins f a sequence f breakpints called knts where the piecewise cubic plynmials are tied t enfrce cntinuity and lwer rder cntinuus derivatives. (See Schumaker (1972) fr details. ) Regressin splines are attractive because f their cmputatinal scheme where standard linear mdel techniques can be applied. But smthness f the estimate cannt easily be varied cntinuusly as functins f a single smthing parameter (Hastie and Tibshirani, 1990). In particular, when λ = 0 we have the regressin spline case, where K is the parameter that cntrls the fleibility f the fitting. T eemplify the actin f K n the estimated curve, let us cnsider an eample by simulatin with y(t) = ep( t) sin(πt/2) cs(πt) + ε with ε N(0,.05). The curve estimates were btained by least square methd with fur different numbers f basis functins which are the cubic B-splines. Figure 6.1 shws the effect f varying the number f basis functins n the estimatin f the true curve. Nte that the number f basis functins is the same as the number f knts since it is assumed that we are dealing with natural cubic splines space. Observe that small values f K make smther the estimate and hence ver smthing may ccur. Large values f K may cause under-smthing. In the smthing techniques the number f basis functins is chsen t be as large as the number f bservatins and then let the chice f the smthing parameter 41

50 42 Chapter 6: Regressin splines, P-splines and H-splines 100 bs. frm y()=-ep(-)sin(pi/2)cs(pi)+n(0,.025) y True K=4 K=12 K= Figure 6.1: Spline least square fittings fr different values f K cntlling the smthing (Bates and Wahba, 1982). Here a different apprach is t be taken. The H-splines methd intrduced by Dias (1994) in the case f nnparametric density estimatin, cmbines ideas frm regressin splines and smthing splines methds by finding the number f basis functins and the smthing parameter iteratively accrding t a criterin that is described belw. With the pint evaluatin functinals L i g = g(t i ) the equatin (6.4) becmes, A λ (g) = n i=1 (y i g(t i )) 2 + λ (g ) 2. (6.1) Assume that g g K = i=1 K c ib i = Xc s that g K H K, where H K dentes the space f natural cubic splines (NCS) spanned by the basis functins {B i } i=1 K and X is a n K matri with entries X ij = B i (t j ), fr i = 1,..., K and j = 1,..., n. Then, the numerical prblem is t find a vectr c = (c 1,..., c K ) T that minimizes, A λ (c) = y Xc λct Ωc, where Ω is K K matri with entries Ω ij = B i (t)b j (t)dt and y = (y 1,..., y n ) T. Standard calculatins (de Br, 1978) prvide c as a slutin f the fllwing linear system (X T X + λω)c λ = X T y. Nte that the linear system nw invlves K K matrices instead f using n n matrices which is the case f smthing splines. Bth K and λ cntrls the trade ff between smthness and fidelity t the data. Figure 6.2 shws, fr λ > 0, an eample f the relatinship between K and λ. Nte that when the number f basis functins increases, the smthing parameter decreases t a pint and then it increases with K. That is, fr large values f K, the smthing parameter λ becmes larger in rder t enfrce smthness.

51 6.1: Sequentially Adaptive H-splines 43 knts smhting paramter Smthing parameter knts Figure 6.2: Five thusand replicates f y() = ep( ) sin(π/2) cs(π) + ɛ. Based n the facts described previusly, the idea is t prvide a prcedure that estimates the smthing parameter and the number f basis functins iteratively. Cnsider the fllwing algrithm. Algrithm 6.1 (1) Let K 0 be the initial number f basis functins and fi λ 0. (2) Cmpute c λ0 by slving (X T X + λ 0 Ω)c λ0 = X T y. (3) Find ˆλ which minimizes, GCV(λ) = n 1 n i=1 (y i g K0 (t i )) 2 1 n 1 tr(a(λ)), where A(λ) = X(X T X + λω) 1 X T. (4) Cmpute g K0,ˆλ = A( ˆλ)y. (5) Increment the number f basis functins by ne and repeat steps (2) t (4) in rder t get g K0 +1,ˆλ. (6) Fr a real number δ > 0, if a distance d(g K0,ˆλ, g K 0 +1,ˆλ ) < δ, stp the prcedure. The number δ can be determined empirically accrding t the particular distance d(, ). Nte that each time the number f basis functins K is incremented by ne the numeratr f GCV changes and hence this prcedure prvides an ptimal smthing parameter λ fr the estimate g K based n K basis functins.

52 44 Chapter 6: Regressin splines, P-splines and H-splines The aim is t find a criterin able t tell when t stp increasing the number f basis functins. That is, t find the dimensin f the natural cubic spline space where ne is lking fr the apprimatin f the slutin f (6.4). Fr this, let us define the fllwing transfrmatin. Given any functin in W2 2 [a, b], take t g = g2 g 2, then t g 0 and t g = 1. Fr any functins f, g W2 2 [a, b], define a pseud distance clsely related t the square f the Hellinger distance, ( d 2 ( f, g) = t f ) 2 t g = 2(1 ρ( f, g)), where ρ( f, g) = f t f t g = 2 g 2 f 2 g 2 = f g f 2 g 2, is the affinity between f and g. It is nt difficult t see that 0 ρ( f, g) 1, f, g W2 2[a, b]. Nte that d2 ( f, g) is minimum when ρ( f, g) = 1, i.e., ( f 2 g 2 ) 1/2 = f g nly if α f + g = 0 fr sme α. Increasing the number f basis functins K by ne, the prcedure will stp when g g K,ˆλ K+1,ˆλ in the sense f the partial affinity, ρ(g K,ˆλ, g K+1,ˆλ ) = gk,ˆλ g K+1,ˆλ g 2 K,ˆλ g2 K+1,ˆλ where the dependence f λ n K is mitted fr the sake f simplicity. Simulatins were perfrmed in rder t verify the behavir f the affinity and the partial affinity. Figure 6.3 shws a typical eample given by the underlying functin y() = ep( ) sin(π/2) cs(π) + ɛ. One may ntice that the affinity is a cncave functin f the number f basis functins (knts) and the partial affinity appraches ne quickly. Mrever, numerical eperiments have shwn that the maimum f the affinity and the stabilizatin f the partial affinity cincide. That means, increasing the K arbitrarily nt nly increases the cmputatinal cst but als des nt prvide the best fitted curve (in the pseud Hellinger nrm). It wuld be useful t have the distributin f the affinity between the true curve and the estimate prduced by the adaptive H-splines methd. A previus study (Dias, 1996) shwed an empirical unimdal density with supprt n [0, 1] skewed t the left suggesting a beta mdel. T illustrate, five thusand replicates with sample size 20,100,200 and 500 were taken frm a test functin y i = 3 i + ɛ i, where and ɛ 1,..., ɛ n are i.i.d. N(0, 5). Figure 6.4 shws that the empirical affinity distributin (unimdal, skewed t the left with range between 0 and 1), a nnparametric density estimate using kernel methd and a parametric ne using a beta mdel whse parameters were estimated using methd f the mments. Similar results were btained fr several ther test functins and sme f them are ehibited n Figure 2.5 which brings mre evidences t supprt a beta mdel. 1,

53 6.1: Sequentially Adaptive H-splines 45 knts partial affinity knts affinity Figure 6.3: Five thusand replicates f the affinity and the partial affinity fr adaptive nnparametric regressin using H-splines with the true curve affinity, n= affinity, n= affinity, n= affinity, n=500 Figure 6.4: Density estimates f the affinity based n five thusand replicates f the curve y i = 3 i + ɛ i with ɛ i N(0,.5). Slid line is a density estimate using beta mdel and dtted line is a nnparametric density estimate.

54 46 Chapter 6: Regressin splines, P-splines and H-splines Figure 6.5 shws that, in general, H-splines methd has similar perfrmance as smthing splines. But as mentined befre the H-splines apprach slves a linear system f rder K while smthing splines must have t slve a linear system f rder n K. 100 bs. frm y()=ep(-2)sin(2pi)+n(0,.1) y TRUE H-splines S-splines Figure 6.5: A cmparisn between smthing splines (S-splines) and hybrid splines (H-splines) methds. 6.2 P-splines The basic idea f P-splines prpsed by Eilers and Mar (1996) is t use a cnsiderable number f knts and t cntrl the smthness thrugh a difference penalty n cefficients f adjacent B-splines. Fr this, let s cnsider a simple regressin mdel y() = g() + ε, where ε is a randm variable with symmetric distributin with mean zer and finite variance. Assume that the regressin curve g can be well apprimate by a linear cmbinatin f, withut lss f generality, cubic B-splines, dented by B() = B(; 3). Specifically, Given n data pints ( i, y i ) n a set f K B-splines B j (.), we take, g( i ) = K j=1 a jb j ( i ). Nw, the penalized least square prblem becmes t find a vectr f cefficients a = (a 1..., a K ) that minimizes: PLS(a) = n { Y i i=1 K } 2 { K } a j B j ( i ) + λ a j B j ( i ). j=1 j=1

55 6.3: Adaptive Regressin via H-Splines Methd 47 Fllwing, de Br (1978), we have that the secnd derivative K a j B j ( i ; 3) = h 2 K 2 a j B j ( i ; 1) j=1 j=1 where h is the distance between knts and 2 a j = ( a j ) = (a j a j 1 ) The P-splines methd penalizes the higher-rder f the finite differences f the cefficients f adjacent B-splines. That is, n { Y i i=1 K 2 a j B j ( i )} + λ K ( m (a j )) 2. j=1 j=m+1 Eilers and Mar (1996) shws that the difference penalty is a gd discrete apprimatin t the integrated square f the kth derivative and with this this penalty mments f the data are cnserved and plynmial regressin mdels ccur as limits fr large values f λ. Figure 6.6 shws a cmparisn f smth.spline and P-spline estimates n simulated eample. sin(2pi/10)+.5 + N(0,.71) y True smth.spline P spline Figure 6.6: smth spline and P-spline 6.3 Adaptive Regressin via H-Splines Methd In smthing techniques, the number f basis functins is chsen t be as large as the number f bservatins and then the smthing parameter is chsen t cntrl

56 48 Chapter 6: Regressin splines, P-splines and H-splines the fleibility f the fitting (Bates and Wahba, 1982). The h-splines methd fr nnparametric regressin (Lu and Wahba (1997), Dias (1998) and Dias (1999)) cmbines sme features f regressin splines and f traditinal smthing splines t btain a hybrid smthing prcedure which is usually implemented with large data sets and displays a desirable frm f spatial adaptability when the underlying functin is spatially inhmgeneus in its degree f cmpleity. Basically, chsing the number f basis functins, fr instance by GCV criterin, will d mst f the wrk fr balancing between bias and variance. But there is a mre imprtant reasn why we want t d a penalized regressin, namely numerical stability. It is well knwn that as the number f basis functins increases, the regressin prblem becmes mre ill-cnditined, which makes the numerical cmputatin less stable. The basis functins used, in general, are the cubic spline basis which have larger crrelatins amng them than linear spline basis, hence the ill-cnditining prblem is mre serius. The penalized regressin step acts as a remedy fr this. Similarly t smthing splines, take the penalty term J(g) as (g ) 2, the pint evaluatin functinals L i g = g(t i ) y = (y 1,..., y n ) T, g = (g(t 1 ),..., g(t n )) T, and assume that g g K,θ = K i=1 θ ib i = X K θ s that g K,θ H K, where H K dentes the space f natural cubic splines (NCS) spanned by the basis functins {B i } K i=1 and X K is a n K matri with entries (X K ) {i,j} = B i (t j ), fr i = 1,..., K and j = 1,..., n. Then, the numerical prblem is t find a vectr θ = (θ 1,..., θ K ) T that minimizes the equatin (5.3), L λ (θ) = y X Kθ λθt Ωθ, (6.2) where nw the matri Ω is K K matri with entries Ω ij = B i (t)b j (t)dt. Standard calculatins (de Br, 1978) prvide θ as a slutin f the fllwing linear system (X T X + λω)θ λ = X T y. Nte that the linear system nw invlves K K matrices instead f using n n matrices which is the case f smthing splines. Bth K and λ cntrls the trade ff between smthness and fidelity t the data. By cnstructin H- splines is mre adaptive than the regular smthing splines methd. Simulatins (see Dias (1999)) shw that H-splines methd has better perfrmance even fr small data sets (50 bservatins) and relatively large variance in the measurement errrs. 6.4 A Bayesian Apprach t H-splines We have seen that there are several methds t estimate nn-parametrically an unknwn regressin curve g by using splines since the pineer wrk f Craven and Wahba (1979). Kimeldrf and Wahba (1970) and Wahba (1983) gave an attractive Bayesian interpretatin fr an estimate ĝ f the unknwn curve g. They shwed that ĝ can be viewed as a Bayes estimate f g with respect t a certain prir n the class f all smth functins. The Bayesian apprach allws ne nt nly t estimate the unknwn functin, but als t prvide errr bunds by cnstructing the crrespnding, Bayesian cnfidence intervals (Wahba, 1983). In this sectin we allw tw smthing parameters as Lu and Wahba (1997) and Dias (1999) did. Hwever, instead f ging thrugh the difficulties f specifying them precisely in an ad-hc manner, they are allwed t vary accrding t prir infrmatin. In this way, the prcedure becmes mre capable f prviding an adequate fit.

57 6.4: A Bayesian Apprach t H-splines 49 airmiles data airmiles Data H splines Passenger miles flwn by U.S. cmmercial airlines Figure 6.7: H-spline fitting fr airmiles data Suppse we have the fllwing regressin mdel, y i = g(t i ) + ε i i = 1,..., n. where ε i s are uncrrelated with a N(0, σ 2 ). Mrever, assume that the parametric frm f the regressin curve g is unknwn. Then the likelihd f g given the bservatins y is, l y (g) (σ 2 ) n/2 ep{ 1 2σ 2 y g 2 }. (6.3) The Bayesian justificatin f penalized maimum likelihd is t place a prir density prprtinal t ep{ λ 2 (g ) 2 } ver the space f all smth functins. (see details in Silverman and Green (1994) and Kimeldrf and Wahba (1970)). Hwever, an infinite dimensinal case has a parad alluded t by Wahba (1983). Silverman (1985) prpsed a finite dimensinal Bayesian frmulatin t avid the parades and difficulties invlved in the infinite dimensinal case. Fr this, let g K,θ = i=1 K θ ibi = X K θ K with a knt sequence placed at rder statistics. A cmplete Bayesian apprach wuld assign prir distributin t the cefficients f the epansin, t the knt psitins, t the number f knts and fr σ 2. A Bayesian apprach t hybrid splines nn-parametric regressin assigns prirs fr g K, K, λ and σ 2. Given a realizatin f K the interir knts are placed at rder statistics. This well knwn prcedure in nn-parametric regressin reduces the cmputatinal cst substantially and avids trying t slve a difficult prblem f ptimizing the knt psitins. Any ther prcedure has t take int accunt the fact that changes in the knt psitins might cause cnsiderable change in the functin g (see details in Wahba (1982) fr ill-psed prblems in splines nn-parametric regressin). Mrever, in thery the number f basis functins (which is a linear functin f

58 50 Chapter 6: Regressin splines, P-splines and H-splines the number f knts) can be as large as the sample size. But then ne has t slve a system f n equatins instead f K. An attempt t keep the cmputatinal cst dwn ne might want t have K small as pssible and hence ver-smthing may ccur. Fr any K large enugh λ keeps the balance between ver-smthing and under-smthing. Thus the penalized likelihd becmes with g = g K, l p (σ 2 ) n/2 ep{ 1 2σ 2 y g K 2 } ep{ λ 2 (g K )2 }, (6.4) where g K = g K,θ = X K θ K. The reasn why we suppress the subinde θ in g K,θ will be eplained later in this sectin. Nte that maimizatin f the penalized likelihd is equivalent t minimizatin f (5.3). Fr this prpsed Bayesian set up we have a prir fr (g K, K, λ, σ 2 ) f the frm, p(g K, K, λ, σ 2 ) = p(g K K, λ)p(k, λ)p(σ 2 ) (6.5) = p(g K K, λ)p(λ K)p(K)p(σ 2 ), where p(g K K, λ) ep{ λ 2 (g K )2 }, fr u > 0, v > 0, p(σ 2 ) 1 (σ 2 ) (u+1) ep{ vσ2 }, p(k) = where q = j=k +1 aj /j!, and ep{ a}a K /K! 1 ep{ a}(1 + q, K = 1,..., K ) p(λ K) = ψ(k) ep{ ψ(k)λ}, with ψ any smth functin f K. It is well knwn that, fr λ > 0, when the number f basis functins increases the smthing parameter decreases t a pint and then it increases with K. That is, fr large values f K the smthing parameter λ becmes larger t enfrce smthness. (See details in Dias (1999).). Therefre, functins ψ that satisfy these requirements are recmmended. In particular, a fleible class is given by ψ(k) = K b ep( ck) fr suitably chsen hyperparameters b and c. Hwever, undesirably large values f K can be ecluded thrugh fiing K apprpriately r be made unlikely by fiing a accrdingly. These large values f K can be cntrlled by the hyperparameters a and K f the prir p(k). The chice f a is gverned by the prir epectatin f the structure f the underlying curve such as maima, minima, inflectin pints etc. We suggest the reader t fllw sme f the rules recmmended by Wegman and Wright (1983). These recmmendatins are based n the assumptin f fitting a cubic spline, the mst ppular case and are summarized belw. 1. Etrema shuld be center in intervals and inflectin pints shuld be lcated near knt pints.

59 6.4: A Bayesian Apprach t H-splines N mre than ne etremum and ne inflectin pint shuld fall between knts (because a cubic culd nt fit mre). 3. Knt pints shuld be lcated at data pints. Nte that g K = X K θ K is cmpletely determined if we knw the cefficients θ K. Hence, the verall parameter space Ξ can be written as cuntable unin f subspaces Ξ K = {ξ : ξ = (θ K, φ) (R K {1,..., K } [0, ] 2 )} with φ = (K, λ, σ 2 ). Thus, the psterir is given by π(ξ y) l p (ξ y)p(ξ). (6.6) In rder t sample frm the psterir π(ξ y) we have t cnsider the variatin f dimensinality f this prblem. Hence ne has t design mve types between subspaces Ξ K. Hwever, assigning a prir t g K, equivalently t the cefficients θ K, leads t a serius cmputatinal difficulty pinted ut by Denisn, Mallick and Smith (1998) where a cmparative study was develped. They suggest that the least square estimates fr the vectr θ K leads t a nn-significant deteriratin in perfrmance fr verall curve estimatin. Similarly, given a realizatin f (K, λ, σ 2 ), we slve the penalized least square bjective functin (6.2) t btain the estimates, ˆθ K = ˆθ K (y), fr the vectr θ K and cnsequently we have an estimate ĝ K = X K ˆθ K. Thus, there is n prir assigned fr this vectr f parameters, and s, we write g K = g K,θ. Having gt ˆθ K, we apprimate the marginal psterir π(φ y) by the cnditinal psterir with π(φ y, ˆθ K ) l p (φ y, ˆθ K )p(φ), (6.7) p(φ) = p(k, λ, σ 2 ) = p(λ K)p(K)p(σ 2 ). Nte that if ne assigns independent nrmal distributins t the parameters θ K it will nt be difficult t btain the marginal psterir π(φ y) and apprimatins will nt be necessary. Hwever, the results will be very similar. T slve the prblem f sampling frm the psterir, π(φ y, ˆθ K ), Dias and Gamerman (2002) used reversible jump methdlgy (Green (1995)). This technique is beynd the level f this bk and it will nt be eplained here but the interested reader will find the details f the algrithm in Dias and Gamerman (2002). In figure 6.8 we present a simulated eample t verify the estimates prvided by this apprach. The final estimate is, ŷ + (t i ) = ŷ j (t i ). j=1 Figure 6.9, ehibit apprimate Bayesian cnfidence intervals fr the true curve regressin f and it was cmputed as fllwing. Let y(t i ) and ŷ(t i ) be a particular mdel and its estimate prvided by this prpsed methd with i = 1,..., n where n is the sample size. Fr each i = 1,..., n the fitting vectrs (ŷ 1 (t i ), ŷ 2 (t i ),..., ŷ 100 (t i )) T frm randm samples and frm thse vectrs the lwer and upper quantiles were cmputed in rder t btain the cnfidence intervals. Figure 6.8 ehibits an eample f hw useful a Bayesian apprach t hybrid splines nn-parametric regressin can be. It describes a situatin where a prir infrmatin

60 52 Chapter 6: Regressin splines, P-splines and H-splines tells that the underlying curve has large curvature and the variance f the errr measurements is nt t small and the traditinal methds f smthing, e.g. smthing splines, might nt be able t capture all the structure f the true regressin curve. By using vague but prper prirs t the smthing parameters K and λ and fr the variance σ 2 this Bayesian apprach prvides a much better fitting than the traditinal smth splines apprach des. 50 bservatins frm y=ep( t^2/2)cs(4pit)+n(0,.36) y True SS estimate Bayesian estimate Figure 6.8: Estimatin results: a) Bayesian estimate with a = 17 and ψ(k) = K 3 (dtted line); b) (SS) smthing splines estimate (dashed line). The true regressin functin is als pltted (slid line). The SS estimate was cmputed using the R functin smth.spline frm which 4 degrees f freedm were btained and λ was cmputed by GCV.

61 6.4: A Bayesian Apprach t H-splines 53 Figure 6.9 shws ne hundred curves sampling frm the psterir (after burn-in) and apprimate 95% Bayesian cnfidence interval fr the regressin curve g(t) = ep( t 2 /2) cs(4πt) with t [0, π]. On the right panel f this figure we see the curve estimate which is an apprimatin fr the psterir mean and the percentiles curves 2.5% and 97.5%. The last 100 curves sampled 95% "Cnfidence" Interval Average.025 quantile.975 quantile Figure 6.9: One hundred estimates f the curve 6.8 and a Bayesian cnfidence interval fr the regressin curve g(t) = ep( t 2 /2) cs(4πt) with t [0, π].

62

63 Chapter 7 Final Cmments Cmpared t parametric techniques nnparametric mdeling has mre fleibility since it allws ne t chse frm an infinite dimensinal class f functins where the underlying regressin curve is assumed t belng. In general, this type f chice depends n the unknwn smthness f the true curve. But fr mst f the cases ne can assume mild restrictins such that a regressin curve has an abslutely cntinuus first derivative and a square integrable secnd derivative. Nevertheless, nnparametric estimatrs are less efficient than the parametric nes when the parametric mdel is valid. Fr many parametric estimatrs the mean square errr ges t zer with rate f n 1, while nnparametric estimatrs have rate f n α, α [0, 1], and α depends n the smthness f the underlying curve. When the pstulate parametric mdel is nt valid, many parametric estimatrs cannt have, ad hc, rate n 1. In fact, thse estimatrs will nt cnverge t the true curve. One f the advantages f the adaptive basis functins prcedures, e.g., H-splines methds is the ability t vary the amunt f smthing in respnse t the inhmgeneus curvature f the true functins at different lcatins. Thse methds have been very successful in capturing the structure f the unknwn functin. In general, nnparametric estimatrs are gd candidates when ne des nt knw the frm f the underlying curve. 55

64

65 Bibligraphy Bates, D. and Wahba, G. (1982). Cmputatinal Methds fr Generalized Crss-Validatin with large data sets, Academic Press, Lndn. B, G. E. P., Hunter, W. G. and Hunter, J. S. (1978). Statistics fr Eperiments: An Intrductin t Design, Data Analysis, and Mdel Building, Jhn Wiley and Sns (New Yrk, Chichester). Cleveland, W. S. (1979). Rbust lcally weighted regressin and smthing scatterplts, J. Amer. Statist. Assc. 74(368): Craven, P. and Wahba, G. (1979). Smthing nisy data with spline functins, Numerische Mathematik 31: de Br, C. (1978). A Practical Guide t Splines, Springer Verlag, New Yrk. Denisn, D. G. T., Mallick, B. K. and Smith, A. F. M. (1998). Autmatic bayesian curve fitting, Jurnal f the Ryal Statistical Sciety B 60: Dias, R. (1994). Density estimatin via h-splines, University f Wiscnsin-Madisn. Ph.D. dissertatin. Dias, R. (1996). Sequential adaptive nnparametric regressin via H-splines. Technical Reprt RP 43/96, University f Campinas, June Submitted. Dias, R. (1998). Density estimatin via hybrid splines, Jurnal f Statistical Cmputatin and Simulatin 60: Dias, R. (1999). Sequential adaptive nn parametric regressin via H-splines, Cmmunicatins in Statistics: Cmputatins and Simulatins 28: Dias, R. and Gamerman, D. (2002). A Bayesian apprach t hybrid splines nnparametric regressin, Jurnal f Statistical Cmputatin and Simulatin. 72(4): Eilers, P. H. C. and Mar, B. D. (1996). Fleible smthing with B-splines and penalties, Statist. Sci. 11(2): With cmments and a rejinder by the authrs. Glub, G. H., Heath, M. and Wahba, G. (1979). Generalized crss-validatin as a methd fr chsing a gd ridge parameter, Technmetrics 21(2): Gd, I. J. and Gaskins, R. A. (1971). Nnparametric rughness penalties fr prbability densities, Bimetrika 58:

66 58 Bibligraphy Green, P. J. (1995). Reversible jump Markv Chain Mnte Carl cmputatin and bayesian mdel determinatin, Bimetrika 82: Gu, C. (1993). Smthing spline density estimatin: A dimensinless autmatic algrithm, J. f the Amer. Stat l. Assn. 88: Gu, C. and Qiu, C. (1993). Smthing spline density estimatin:thery, Ann. f Statistics 21: Härdle, W. (1990). Smthing Techniques With Implementatin in S, Springer-Verlag (Berlin, New Yrk). Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Mdels, Chapman and Hall. Kimeldrf, G. S. and Wahba, G. (1970). A crrespndence between Bayesian estimatin n stchastic prcesses and smthing by splines, The Annals f Mathematical Statistics 41: Kperberg, C. and Stne, C. J. (1991). A study f lgspline density estimatin, Cmputatinal Statistics and Data Analyis 12: Lu, Z. and Wahba, G. (1997). Hybrid adaptive splines, Jurnal f the American Statistical Assciatin 92: Nadaraya, E. A. (1964). On estimating regressin, Thery f prbability and its applicatins 10: O Sullivan, F. (1988). Fast cmputatin f fully autmated lg-density and lg-hazard estimatrs, SIAM J. n Scientific and Stat l. Cmputing 9: Pagan, A. and Ullah, A. (1999). Press, Cambridge. Nnparametric ecnmetrics, Cambridge University Parzen, E. (1962). On estimatin f a prbability density functin and mde, Ann. f Mathematical Stat. 33: Prakasa-Ra, B. L. S. (1983). Nnparametric Functinal Estimatin, Academic Press (Duluth, Lndn). Schumaker, L. L. (1972). Spline Functins and Aprimatin thery, Birkhauser. Schumaker, L. L. (1981). Spline Functins: Basic Thery, WileyISci:NJ. Sctt, D. W. (1992). Multivariate Density Estimatin. Thery, Practice, and Visualizatin, Jhn Wiley and Sns (New Yrk, Chichester). Silverman, B. W. (1982). On the estimatin f a prbability density functin by the maimum penalized likelihd methd, Ann. f Statistics 10: Silverman, B. W. (1984). Spline smthing: The equivalent variable kernel methd, Ann. f Statistics 12:

67 Bibligraphy 59 Silverman, B. W. (1985). Sme aspects f the spline smthing apprach t nnparametric regressin curve fitting, Jurnal f the Ryal Statistical Sciety, Series B, Methdlgical 47: Silverman, B. W. (1986). Density Estimatin fr Statistics and Data Analysis, Chapman and Hall (Lndn). Silverman, B. W. and Green, P. J. (1994). Nnparametric Regressin and Generalized Linear Mdels, Chapman and Hall (Lndn). Stne, C. J. (1990). 18: Large-sample inference fr lg-spline mdels, Ann. f Statistics Stne, C. J. and K, C.-Y. (1985). Lgspline density estimatin, Cntemprary Mathematics pp Thmpsn, J. R. and Tapia, R. A. (1990). Nnparametric Functin Estimatin, Mdeling and Simulatin, SIAM:PA. Wahba, G. (1982). Cnstrained regularizatin fr ill psed linear peratr equatins, with applicatins in meterlgy and medicine, in S. S. Gupta and J. O. Berger (eds), Statistical Decisin Thery and Related Tpics III, in tw vlumes, Vl. 2, Academic:NY:Lnd, pp Wahba, G. (1983). Bayesian cnfidence intervals fr the crss-validated smthing spline, JRSS-B, Methdlgical 45: Wahba, G. (1990). Spline Mdels fr Observatinal Data, SIAM:PA. Watsn, G. S. (1964). Smth regressin analysis, Sankya A 26: Wegman, E., J. and Wright, I. W. (1983). Splines in statistics, Jurnal f the American Statistical Assciatin 78: