1 A Survey on Nonparametric Time Series Analysis 1Introduction by Siegfried Heiler 2 Nonparametric regression 3 Kernel estimation in time series 4 Problems of simple kernel estimation and restricted approaces 5 Locally weigted regression 6 Application of locally weigted regression to time series 7 Parameter selection 8 Time series decomposition wit locally weigted regression References
1 INTRODUCTION 2 1 Introduction In tis survey we discuss te application of some nonparametric tecniques to time series. Tere is indeed a long tradition in applying nonparametric metods in time series analysis, and tis olds not only true for certain test situations, as, e.g. runs tests for randomness of a stocastic sequence, permutation tests or certain rank tests. An old and establised tecnique in time series analysis is periodogramme analysis. Altoug te periodogramme is an asymptotically unbiased estimate of te spectral density of an underlying stationary process, it is well known tat it is not consistent. Terefore already in te early fties smooting te periodogramme directly wit a so-called spectral window or using a system of weigts, according to a lag window witwic te empirical autocovariances are multiplied in te calculation of te Fourier transform, was introduced. Quite a number of dierent windows were proposed and wit respect to te window widt similar rules old for acieving consistent estimates as te ones we will sortly discuss in te context of nonparametric regression later in tis text. Nonparametric spectral estimation is extensively treated in many textbooks on time series analysis to wic te interested reader is refered. Hence it will not be treated furter in tis survey. Anoter area, were nonparametric ideas are being applied since a long time is smooting and decomposing seasonal time series. Local polynomial regression can be traced back to 1931 (R.R. Macaulay). A. Fiser (1937) and H.L. Jones (1943) discussed a local least squares t under te side condition tat a locally constant periodic function (for modelling seasonal uctuations) be anniilated and already in 1960 J. Bongard developped a unied principle for treating te interior and te boundary part (wit and witout seasonal variations) of a time series derived from a local regression approac. Tese ideas will be taken up later again in section 8, since tey represent an attractive alternative to smooting and seasonal decomposition procedures based on linear time series models. Te aim of tis survey is to present some basic concepts of nonparametric regression including locally weigted regression wit te special empasis on teir application to time series. Nonparametric regression as become an area wit an abundance in new metodological proposals and developments in recent years. It is not te intention of tis paper to give a compreensive overview on te subject. We rater want to concentrate on te basic ideas only. Te reader interested in some dierent aspects may be refered to a survey paper by Hardle, Lutkepol and Cen (1997), were more specic areas, proposals and furter references can be found. Te ARMA model is a typical linear time series model. Tresold autoregression (TAR) models and its variates are specic types of nonlinear models. ARCH and GARCH type models are also of a very specic nonlinear type to capture volatility penomena. In contrast to tat in nonparametric regression no assumption is made about te form of te regression function. Only some smootness conditions are required. Te complexity of te model will be determined completely by te data. One lets te data speak for temselves.
2 NONPARAMETRIC REGRESSION 3 Tereby one avoids subjectivity in selecting a specic parametric model. But te gain in exibility as a price. One as to coose bandwidts. We come back to tis later. Besides tis, a iger complexity in te matematical argumentation is involved. However, asymptotic considerations will not be discussed in detail in tis survey. Because of teir exibility nonparametric regression tecniques may serve as a rst step in te process of nding an adequate parametric model. If no suc one can be found wic describes te underlying structure adequately, ten te results of nonparametric estimation may be used directly for forecasting or for describing te caracteristics of te time series. 2 Nonparametric regression Since forecasting is an important objective of many time series analyses, estimating te conditional distribution, or some of its caracteristics play a considerable role. For point prediction te conditional mean or median is of particular interest. In order to obtain condence or prediction intervals also estimates of conditional variances or conditional quantiles are needed. Te latter ones are also of interest in studying volatility in nancial time series. Te rst step to go is terefore to look at nonparametric estimation of densities and conditional densities. Let x 2 IR be a random variable wose distribution as a density f and let x 1 ::: x n be a random sample from x.tena kernel density estimator for f is given by f n (x) = 1 n n nx xi ; x K : (2.1) n Here K is a so-called kernel function, i.e. a symmetric density assigning weigts to te observations x i wic decrease wit te distance between x and x i. Some popular kernel functions are listed in Table 2.1 and exibited in Figure 2.1. Te rst 5 ave te interval [;1 1] as support, wereas te Gaussian kernel as innite support. n is te bandwidt wic drives te size of te local neigbourood being included in te estimation of f at x. Te bandwidt depends on te sample size n and as to full n! 0 and n n!1 for n!1 as necessary condition for consistency. But for practical applications tis asymptotic condition is not very elpful. A very small bandwidt will lead to a wiggly course of te estimated density, wereas a large bandwidt yields a smoot course but will possibly atten out interesting details. Bandwidt selection will
2 NONPARAMETRIC REGRESSION 4 be dealt wit in section 7. A k n ; nearest neigbour (k n ; NN) estimator of f is obtained by substituting te Name Table 2.1: Selected kernel functions Kernel 1 Uniform 1I 2 [;1 1](u) Triangle (1 ;juj)1i [;1 1] (u) 3 Epanecnikov (1 ; 4 u2 )1I [;1 1] (u) 15 Bisquare (1 ; 16 2u2 + u 4 )1I [;1 1] (u) 35 Triweigt (1 ; 32 3u2 +3u 4 ; u 6 )1I [;1 1] (u) 1 Gaussian p 2 exp(; 1 2 u2 ) xed bandwidt n in (2.1) by te random variable H n kn (x) measuring te distance between x and te k n -nearest observation among te x i ::: n: Nearest neigbour estimators ave te property tat te number of observations used for te local approac is xed. Tis is an advantage if te x-space sows a greatly unbalanced design. On te oter and te bias varies from point to point due to te variable local bandwidt. For x 2 IR p akernel K : IR p! IR is needed in (2.1). In tis case eiter product kernels K(u) = dy j=1 K j (u j ) wit kernels K j and K j : IR! IR, bandwidt j in coordinate j,and n = 1 ::: p or norm kernels K(u) =K (jjujj) wit a suitable norm on IR p are used. In connection wit time series applications frequently product kernels are applied, f n (x) = 1 n nx py j=1 1 j K j! x ij ; x j j (2.2) and j =^ j wit an estimated standard deviation in te j-t coordinate is a popular coice for te bandwidts. Let now (y x) wit y 2 IR x2 IR p bearandomvector wit joint density f(y x) and let f X (x) be te marginal density of x. Ten te conditional density g(yjx) = f(y x)=f X (x) can be estimated by inserting a kernel density estimator or a corresponding
2 NONPARAMETRIC REGRESSION 5 Figure 2.1. Some popular kernel functions in practice nearest neigbourood estimator in te nominator and denominator of g(yjx). Wit te coice of a kernel function K = IR p+1! IR K(y x) =K 1 (y)k(x) and bandwidts 1 resp. we obtain te kernel estimator for te conditional density g n (yjx) = ;1 1 np xi ;x K yi ;y 1 1 K np K xi : (2.3) ;x An estimator for te conditional mean m(x) = 1 R ;1 yg(yjx)dy is obtained wen we replace
2 NONPARAMETRIC REGRESSION 6 g in te integral by its estmator g n.for K 1 yields m n (x) = np np y i K x;xi K x;xi being a symmetric density tis immediately : (2.4) Tis is te well-known Nadaraya-Watson nonparametric regression estimator (NW-estimator, Nadaraya, 1964 Watson, 1964). We see tat it can be written as a weigted mean m n (x) = nx y i w n i (x x 1 ::: x n ) (2.5) were te random weigts depend on te point x and te random variables x 1 ::: x n. Apart from conditional means also conditional quantiles are of interest in various time series applications. Let F (yjx)= Z y ;1 g(yjx)dy (2.6) denote te conditional distribution function of y given x. Ten te conditional - quantile at x q (x) is dened as q (x) =inffy 2 IRjF (yjx) g 0 <<1: (2.7) If g(jx) is strictly positive, ten of course q (x) is te unique solution of F (yjx)=, i.e. q (x) =F ;1 (jx). One possible procedure for estimating q is to take te empirical -quantile of an estimator F n =(jx) according to (2.7). Let F 1 (z) = z R ;1 K 1 (u)du be te distribution function pertaining to te kernel K 1. Ten te estimated conditional distribution, obtained by integrating g n (jx) from ;1 to y, is given by F n (yjx) = np K xi ;x np K F y;yi 1 1 xi ;x : (2.8)
2 NONPARAMETRIC REGRESSION 7 Let us assume tat K 1 as support [;1 1]. Ten we ave y ; yi ( 1 for yi y ; F 1 = 1 1 0 for y i y + 1 so tat in tis case F n (yjx) = np + 1 K xi ;x nx ( nx xi ; x 1 (;1 y;1 ](y i )K 1 (y;1 y+ 1 )(y i )F 1 y ; yi 1 K xi ; x ) : (2.9) One can see tat te estimation contains only observations in te regressor space laying in a band around x. Te rst sum on te rigt and side includes observations, wose y-values are less tan or equal to y ; 1. Te second sum contains observations wit y i -values in a neigbourood of y. Incontrast to a usual empirical distribution function ere also observations greater tan y obtain a positiveweigt. Of particular interest may be te median regression function q 1=2 for asymmetric distributions as an alternative to ordinary regression based on te mean. Anoter interesting application may be te estimation of q =2 and q 1;=2 in order to get predictiveintervals. Tese can be compared wit intervals obtained from parametric models, wic lack te possibility toevaluate te bias due to mis-specication of te model. Taking some boundary corrections into account, for a not too unbalanced design te second sum in (2.9) can be approximated by 1 xi np ;x (y;1 y]k, so tat te conditional distribution function is estimated by ~F n (yjx) = np 1 (;1 y] (y i )K xi ;x np K xi ;x : (2.10) Tis estimator was for x 2 IR considered by Horvat and Yandell (1988) wo proved asymptotic results for te i.i.d. case. Abberger (1996) derives from (2.10) te empirical quantile function q n (x) = inffy 2 IRj ~ Fn (yjx) g 0 <<1 (2.11)
3 KERNEL ESTIMATION IN TIME SERIES 8 and investigates te beaviour of ~ Fn and q n in applications to stationary time series. 3 Kernel estimation in time series Wen a kernel- or NN-estimator is applied to dependent data, as it is te case in time series, ten it is eected only by te dependence among te observations in a small window and not by tat between all data. Tis fact reduces te dependence between te estimates, so tat many of te tecniques developed for independent data can be applied in tese cases as well. Tis fact was called te witening by windowing principle by Hart (1996). Atypical situation for an application to a time series fz t g is tat te regressor vector x consists of past time series values x t =(z t;1 ::: z t;p ) (3.1) wic leads to te very general nonparametric autoregression model z t = m(z t;1 ::: z t;p )+a t t = p +1 p+2 ::: (3.2) wit fa t g a wite noise sequence. Of course x t migt also include time series values of oter predictivevariables like leading indicators. An indispensable requirement for proving asymptotic properties of kernel estimates in tis and related situations is tat te underlying processes are stationary. Anoter condition is tat te memory of tese underlying processes decreases wit distance between events and tat te rate of decay can be estimated from above by so-called mixing conditions. So-called strong mixing conditions are used by Robinson (1983, 1986). Collomb (1984, 1985) worked wit so-called -or uniform mixing conditions. We will not present tese fairly complicated asymptotic considerations ere. But we would like to remark tat tese mixing conditions are ard to ceck in practice. In contrast to linear autoregressive models of te form z t = 1 z t;1 + :::+ p z t;p + a t and in a certain sense also to tresold autoregression were te autoregressive parameters vary according to some tresold variable te model (3.2) is more general and exible and its estimation may lead to insigts wic can be elpful in coosing an appropriate parametric (possibly nonlinear) model afterwards.
3 KERNEL ESTIMATION IN TIME SERIES 9 For x 2 IR p x t as in (3.1) and weigts xt ; x w n t = K = nx s=p+1 xs ; x K te Nadaraya-Watson estimator in model (3.2) is given by m n (x) = nx s=p+1 z t w n t (x): (3.3) For x equal to te last observed pattern, x =(z n z n;1 ::: z n;p+1 ) 0 tis provides a one-step aead predictor for z n+1 wic allows a very intuitive interpretation. Given te course of te time series observed over te last p instants, te predictor is a weigted mean of all tose time series values in te past, wic followed a course pattern tat is similar to te last observed one. Te weigts depend on ow close te pattern observed in te past comes to te pattern given by (x n ::: x n;p+1 ) 0. A k-step aead predictor is given if z t in (3.3) is replaced by z t;k+1 : m n k = n;k+1 X t=p+1 z t+k;1 w n t (x) k =1 2 ::: : (3.4) Tis predictor does not use te variables z n+1 ::: z n+k,wic are unknown, but may contain information about te conditional expection E(z n+k j(z n ::: z n;p+1 ) 0 ). Tey migt be replaced by estimates in a multistep procedure wic consists in a succession of onestep aead forecasts. Tis procedure can lead to a smaller mean squared error tan te multistep procedure (3.4). For a dierent proposal see Cen (1996). Up to now weave only considered te autoregressive case were te regressor vector contains past time series values. Te case of vector autoregression, were for eac individual (scalar) time series also past values of related time series or leading indicators are included in te regression vector, can be treated in a similar way as nonparametric autoregression, altoug te number of components in x is restricted due to te "curse of dimensionality", to wic we come back later. If te regressor vector x t =(z t;1 ::: z t;p ) 0 is used in estimating conditional distribution functions and conditional quantiles, as e.g. in (2.10) and (2.11), ten we arrive atquantile autoregression. Te median autoregression q n 1=2 may serve as an alternative to te mean autoregression (3.3). In nancial data one is often interested in te beaviour of quantiles in te tails. For instance te value at risk of a certain asset is measured by looking at low quantiles ( =0:01 or =0:05) of te conditional distribution of te corresponding series of returns.
3 KERNEL ESTIMATION IN TIME SERIES 10 Abberger (1996) applied quantile autoregression to time series of daily stock returns. In order to assess suc models forecast error cannot serve as a criterion, since quantiles are not observable. Abberger proposed te criterion were =1; nx t=1 (z t ; q (x t )) = nx t=1 (z t ; q ) (3.5) (u) =1 [0 1) (u)u +( ; 1)1 (;1 0) (u)u (3.6) is te loss function introduced by Koenker and Basset (1978) in teir seminal paper on quantile regression and q is te unconditional -quantile of te corresponding distribution. is constructed according to te R 2 -criterion in ordinary regression. It assumes values between zero and one, were =0 if q (x t )=q for all x t and =1 if z t = q (x t ) for all t and all, i.e. if te distribution of fzjxg is a one-point distribution. Te following Figure 3.1 and Table 3.1 illustrate te beaviour of wit a simulated conical data set of 500 observations. Te observations are eteroscedastic and ave mean zero. Te correlation between x and y is ;0:002. In Table 3.1 empirical -values for dierent are exibited. Tey are calculated by replacing in (3.5) q (x t ) by its kernel estimator q n (x t ) and q byte empirical unconditional quantile of te rst t ; 1 data values z 1 ::: z t;1. Te latter can be interpreted as a naive forecast of q (x t ). Te ndings of Abberger (1996, 1997) for several German stock returns were -values close to zero for te median and increasing in a U-saped form towards te boundary areas around = 0:01 respectively = 0:99. ARCH- and GARCH models representavery specic kind of parametric modeling for studying te penomenon of volatility. A exible alternative to te combination of an ARMA- Table 3.1. -values for te data in Figure 3.1 0.01 0.05 0.10 0.25 0.50 0.75 0.90 0.95 0.99 0.43 0.36 0.27 0.10 0.01 0.11 0.26 0.34 0.41 model wit ARCH- or GARCH-residuals is given by te conditional eteroscedastic autoregressive non ) model
3 KERNEL ESTIMATION IN TIME SERIES 11 Figure 3.1. Simulated eteroskedastic data, n=500 z t = m(x t )+(x t ) t (3.7) studied by Hardle and Yang (1996) or Hardle, Tsybakov andyang (1997). Here x t = (z t;1 ::: z t;p ) 0 is again te autoregressive vector (3.1), t is a random variable wit mean zero and variance one. 2 (x) is called te volatilityfunction. Given an estimator for m, e. g. te NW-estimator m n according to (3.3), it was suggested tat 2 (x) can be estimated by 2 n (x t)=g n (x t ) ; m 2 n (x t) (3.8) were g n (x) = np t=1 np t=1 K xt;x z 2 t K xt;x = nx t=1 z 2 t w n t(x): (3.9) Since te estimator (3.8) is based on a dierence, it can appen tat from time to time a negativevariance estimator results. Tis can be avoided if te volatility function is estimated on te basis of residuals. See (7.10), te discussion tere and Feng and Heiler (1998a).
3 KERNEL ESTIMATION IN TIME SERIES 12 In te context of time series analysis not only past values of te time series itself or of related series may occur as regressor variables, but also te time index itself, in wic case x t = t, or some functions of te time index like polynomials or trigonometric functions. Tis leads to smooting approaces. In te case m(x t )=m(t) te NW estimator at t consists in a weigted mean of te time series values in a neigbourood [t; t+] of z t wit nonrandom weigts. Polynomials and trigonometric functions in t are used in decomposing a seasonal time series into trend- cyclical and seasonal components according to an unobserved components model. Tis application will be studied in section 8 after te discussion of locally weigted regression. In te area of quantile estimation te regressor x t = t leads to quantile smooting. Tis tecnique was used by Abberger (1996, 1997) in order to compare te results of a nonparametric procedure for stock returns wit tose of a GARCH-model, evaluated wit an S{Plus package under te standard assumption of an underlying Gaussian distribution. As an example we take daily discrete DAX returns, dened as z t = (price t ; price t;1 )=price t;1, exibited in Figure 3.2. Since te Gaussian distribution is completely determined by mean and variance, conditional quantiles can easily be calculated from te outcomes of te GARCH model estimation. Te results are depicted in Figure 3.3 and 3.4 for te lower and upper quartiles and for te 0:1 and 0:9 quantiles, respectively. Two messages can be learned from te results. Te rst is tat te asymmetric beaviour of volatility, wic is revealed by te nonparametric approac, will remain completely idden by tecoice of a wrong parametric model wic is being oered as te default option by te package. In te presented example, wic is not untypical for stock returns, volatility is a penomenon wic as mainly to do wit movements in te lower tails of te conditional distributions. Te second nding in te gures is tat kernel smooting is very robust towards aberrant and erratic observations in te course of te time series, wereas GARCH models react very sensitively to tem.
3 KERNEL ESTIMATION IN TIME SERIES 13 Figure 3.2 Time series of daily DAX returns from Jan. 2, 1986 to Aug. 13, 1991
4 PROBLEMS OF SIMPLE KERNEL ESTIMATION AND RESTRICTED APPROACHES14 Figure 3.3 Estimation of 0.25- and 0.75-quantiles of daily DAX returns 4 Problems of simple kernel estimation and restricted approaces Te nonparametric approaces we ave treated so far suer from two drawbacks. One is te so-called "curse of dimensionality", te oter is increased bias in cases of a iglyclustered design density and particularly at te boundaries of te x-space. Curse of dimensionality describes te fact tat in iger dimensional regression problems te subspace of IR p+1 spanned by te data is rater empty, i.e., tere are only few observations in te neigbourood of a point x 2 IR p. In practice tis appens to be te case already for p>2. Several proposals ave been made to cope wit te curse of dimensionality problem. We will describe only two oftemvery sortly. Te rst consists in decomposing IR p into a class of J disjoint course patterns, A j j =1 ::: J wit te aid of a non-ierarcical cluster analysis. Tese J disjoint sets serve ten as te states of a omogeneous Markov cain. In te model m(x t )=E[z t jx t 2 A j ] for x t 2 A j j =1 ::: J
4 PROBLEMS OF SIMPLE KERNEL ESTIMATION AND RESTRICTED APPROACHES15 Figure 3.4 Estimation of 0.10- and 0.90-quantiles of daily DAX returns wit x t being te autoregressive vector (3.1) m is estimated by m n (x t )=N ;1 j nx s=1 z s 1 Aj (x s ) were N j is te number of course patterns of lengt p from te time series in A j. Here te estimator is an unweigted mean of all values following courses in pattern class A j. Markovcain models of tis type were rst used bys.yakowitz (1979b) for analysing time series of water runo in rivers. Asymptotic properties for tis type of model are discussed by Collomb (1980, 1983). Gouriroux and Montfort (1992) examined a corresponding model for economic time series by incorporating volatility. Tey called teir model z t = JX j=1 j 1 Aj (x t )+ JX j=1 j 1 Aj (x t ) t a qualitative tresold ARCH model. Anoter proposal in order to cope wit te curse of dimensionality is given by te so-called generalized additive models, studied by Hastie and Tibsirani (1990), wic are dened
5 LOCALLY WEIGHTED REGRESSION 16 as z t = m 0 + px j=1 m j (z t;ij )+a t : Te components m j are again of a general form. For estimation so-called backtting algoritms suc as te alternating conditional expectation algoritm (ACE) of Breiman and Friedman (1985) or te BRUTO algoritm of Hastie and Tibsirani (1990) maybeused. Te main idea of backtting goes as follows. In te abovemodel E[z t ;m 0 ; P m j (z t;ij )] = m k (z t;ik ): Hence te variable in square brackets can be used to obtain a nonparametric estimate for m k (z t;ik ). But of course, te oter m j are unknown as well, so tat te estimation procedure as to be iterated until all te m n j converge. For a more detailed study of generalized additive models te reader is refered to te book of Hastie and Tibsirani as well as to te two interesting papers by Cen and Tsay in JASA (1993). For furter discussion and oter approaces see also Hardle, Lutkepol and Cen (1997). Quite a few proposals can be found in te literature dealing wit te bias problem of NWestimators close to te boundary and in cases of an unbalanced design in te x-space. Gasser and Muller (1979, 1984) suggested for te case p = 1 a system of variable weigts, Gasser, Muller and Mammitzsc (1985) developed asymmetric boundarykernels and Messer and Goldstein (1993) suggested variable kernels wic automatically get deformed and tus reduce te bias in te boundary area. Yang (1981) and Stute (1984) suggested a symmetrized k ; NN estimator and Micels (1992) proposed boundary kernels for bias reduction wic can be carried over to te case p>1.we do not discuss te above mentioned proposals in more detail since te mentioned disadvantages can be repaired by using locally weigted regression. j6=k 5 Locally weigted regression Locally weigted respectively local polynomial regression was introduced into te statistical literature by Stone (1977) and Cleveland (1979). Te statistical properties were investigated since ten in papers by Tsybakov (1986), Fan (1993), Fan and Gijbels (1992, 1995), Ruppert and Wand (1994) and many oters. A detailed description may be found in te book of Fan and Gijbels (1996). For te sake of simplicity we start wit te assumption tat te regressor x is a scalar. For a better understanding we regard te data as being generated by a location-scale model y = m(x)+(x) (5.1)
5 LOCALLY WEIGHTED REGRESSION 17 akin to te one considered in (3.7), were te are independentwit E() =0 Var() = 1 and m(x 0 )=E(yjx = x 0 ). m is assumed to be smoot in te sense tat te (p + 1)t derivative exists at x 0, so tat it can be expanded in a Taylor series around x 0. m(x) =m(x 0 )+(x ; x 0 )m 0 (x 0 )+:::+(x ; x 0 ) r m(r) (x 0 ) r! wit te remainder term + R r (x) (5.2) R r (x) =(x ; x 0 ) r+1 m (r+1) x 0 + (x ; x 0 ) =(r ; 1)! 0 <<1: (5.3) Wit j (x 0 )=m (j) (x 0 )=j! j =0 1 ::: r (5.4) we arrive at a local polynomial representation for m, m(x) rx j=0 j (x 0 )(x ; x 0 ) j : (5.5) Tis approac motivates te nonparametric estimation of m as a local polynomial by solving te least squares problem 8 > 2 < nx min 2IR r+1 >: 4 yi ; rx j=0 2 5 (x i ; x) j j 3 xi ; x K 9 >= > : Wit te design matrix X x aving te n rows [1 x i ; x ::: (x i ; x) r ], te diagonal weigt matrix W x = diag K( x i;x ) and te vector y =(y 1 ::: y n ) 0 te solutions at x is given by ^(x) =(X 0 x W xx x ) ;1 X 0 x W xy (5.6) and wit e j being te j-t unit vector in IR r+1 we see immediately tat
5 LOCALLY WEIGHTED REGRESSION 18 ^m(x) = ^ 0 = e 0 1(X 0 x W xx x ) ;1 X 0 x W xy (5.7) and tat wit ^m (j) (x) = ^ j (x)j! =j!e 0 j+1(x x W x X x ) ;1 X 0 x W xy j =1 ::: r (5.8) an estimator for te j-t derivative of m is given. Te case r = 0 yields te Nadaraya-Watson estimator (3.3). Let u =(r r (x 1 )) n be te residual vector containing te remainder terms according to (5.3) at te data points. Ten te conditional bias of ^(x) is given by B ^(x) =(X 0 x W xx x ) ;1 X 0 x W xu and wit x = W (x) 2 diag 2 (x i ) its conditional covariance matrix is Var ^(x) =(X 0 x W xx x ) ;1 (X 0 x xx x )(X 0 x W xx x ) ;1 : Te above two expressions cannot be used directly since tey contain te unknown vector u of remainder terms and te unknown diagonal matrix x : A rst order asymptotic expansion of te variance and te bias term uses te moments of K and K 2, denoted by j = Z u j K(u)du and j = Z u j K 2 (u)du wic are contained in te matrices S =( j+l ) 0j lr ~ S =(j+l+1 ) 0j lr S =( j;l ) 0j lr and te vectors c r = ( r+1 ::: 2r+1 ) ~c r = ( r+2 ::: 2r+2 ): For an i. i. d. sample (y 1 x 1 ) ::: (y n x n ) wit te marginal density f(x) > 0 and wit f m (r+1) and
5 LOCALLY WEIGHTED REGRESSION 19 2 continous in a neigbourood of x we obtain for ;! 0 and n n ;!1 te asymptotic conditional variance Var ^m (j) (x) = e 0 j+1 S;1 S (j!) 2 2 (x) S ;1 e j+1 f(x)n + o 1+2j p 1 n 1+2j : (5.9) For te asymptotic conditional bias weave to distinguis between te cases were r;j is odd and were r ; j is even. For r ; j odd we ave Bias ^m (j) (x) = e 0 j! j+1 S;1 c r (r +1)! m(r+1) (x) r+1;j + o p ( r+1;j ): (5.10) For (r ; j) even te asymptotic bias is Bias ^m (j) (x) = e 0 j! j+1 S;1 ~c r (r +2)! nm (r+2) (x)+(r +2)m (r+1) (x) f 0 (x) f(x) o r+2;j + o p ( r+2;j ) (5.11) provided tat f 0 and m (r+2) are continuous in a neigbourood of x and n 3 ;!1: As avery interesting fact we notice te dierence in asymptotic bias between r ; j odd and r ; j even. For instance we ave for te NW-estimator (r =0 j =0), B(m n (x)) = 2 [m 00 (x)=2+m 0 f 0 (x)=f(x)] 2 + o p ( 2 ) wereas for te local linear approac we obtain B(^m(x)) = 2 m 00 (x) 2 =2+o p ( 2 ): We see tat te bias of te local linear estimator as a simpler structure. Te linear term in te bias expansion vanises, wereas te expression for te variance is te same in bot cases and given by 0 2 (x)=n: Te bias of te NW-estimator does not only depend on m 0, but also on te score function ;f 0 =f. Tis is te reason wy anunbalanced design leads to an increased bias. Similar considerations old for iger order polynomials. In practice tis means tat for
5 LOCALLY WEIGHTED REGRESSION 20 estimating m it is sucient to consider r =1 or r =3 and for m 0 only r =2 or r = 4 sould be considered. In many applications r = j + 1 suces. Fitting a iger order polynomial will possibly reduce te bias, but on te oter and te variance will increase since more parameters ave to be estimated locally. If te regressor x isavector rater tan a scalar in most cases a local linear approac is cosen since in tis case te step from r =1 to r =3 leads to strong increase of parameters to be estimated locally wic entails an inacceptable increase in variance. Since ^ j (x) =e 0 j+1 ^ = e 0 j+1(x 0 x W xx x ) ;1 X 0 x W xy = nx w j ni xi ; x y i (5.12) for estimating j (x) =m (j) (x)=j! weave a similar expression as a weigted mean likefor te NW-estimator (3.3). Te weigts depend on te observations x i and on te location of x in te design space. It can be seen easily tat te weigts wni(u j t )=w j xi ;x o ni n satisfy te discrete moment conditions nx x i ; x qw j ni xi ; x = jq wit 0 j q r: As a consequence of tis te sample bias for estimating a polynomial wit degree less tan or equal to r is zero. Te variance of ^m (j) (x) is given by Var ^m (j) (x) = nx w j ni xi ; x 22 (x i ): Te kernel wit te weigts w j ni(u t ) is called te active kernel. A rst order approximation to te w j ni moments matrix S. Te according kernel is given if (X 0 x W xx x ) is replaced by tekernel ~K (j) (u) =e 0 j+1 S;1 (1 u ::: u r ) 0 K(u) (5.13)
5 LOCALLY WEIGHTED REGRESSION 21 is called te equivalent kernel. It satises te corresponding moment conditions Z u q ~ K(j)(u)du = jq 0 j q r: (5.14) For instance, for te case r =1 j =0 weave K(u) ~ =K(u) and for r =2 j = 1 (estimation of m 0 ) K ~ (1) (u) = ;1 2 uk(u): Tis means tat for estimating m itself in te interior of te x-space te eectivekernel is equal to te cosen symmetric kernel function itself wereas for estimating te rst derivative K ~ (1) is a skew function. As a general result K ~ (j) is symmetric for j even and skew for j odd. In terms of equivalent kernels te asymptotic conditional variance and te asymptotic conditional bias (for r ; j odd) are Var ^m (j) (x) = (j!)2 2 (x) Z ~K (j)2 (u)du + o f(x)n 1+2j p (n ;1;2j ) (5.15) Bias Z ^m (j) j! (x) = (r +1)! m(r+1) (x) r+1;j u r+1 K ~ (j) (u)du + o p ( ;r;1+j ): (5.16) Te big advantage of local polynomial regression over oter smooting metods consists in te automatic adaptation of te active resp. equivalent kernel to te estimation situation in te boundary area. If x is scalar and x = min(x i ) x = max(x i ) ten for a given bandwit i te interior of te x-space is given by all observations in te interval x + x ; : For all x in tis interval te equivalent kernels K ~ (j) ave te above mentioned symmetry resp. asymmetry property. In te left boundary part x x + i te number of left neigbours in a local neigbourood of a point x will be small compared to te number of rigt neigbours and for x = x weave only rigt neigbours. Corresponding considerations old for te rigt boundary part x ; x i : For x 2 IR p (p >1) te boundary area will often cover an important part of te wole design space. For (r ; j) odd te active resp. equivalent kernels automatically adapt to te skew data situation in te boundary area. Te situation in te rigt boundary area is illustrated in Figure 5.1 for te Epanecnikov kernel K(u) = 3(1 ; 4 u2 ) + for a local linear estimation of m (r =1 j = 0) and a local quadratic estimation of m 0 (r =2 j =1): We see ow te weigting systems get deformed towards te boundary. Te pictures for te left boundary area are symmetric to tose in Figure 5.1. Since te size of te local neigbourood srinks towards te boundary te bias part of te mean squared error (MSE) will be lower in te boundary area tan in te interior. On te oter and te variance part will increase since less observations are included in te local estimation and
5 LOCALLY WEIGHTED REGRESSION 22 Figure 5.1 Activekernels derived from te Epanecnikov kernel wit n = 30 at te rigt boundary for (a) r =1 j =0 and (b) r =2 j =1: Estimation at interior points (sort dases), at x = x ; 15 (dases and points), at x ; 6 (long dases) and at te boundary point x (solid line). also due to te increasing deformation of te weigting system towards te boundary. Usually, te increase in variance overcompensates te reduction of te bias, particularly if m 00 remains rougly te same in te boundary area. As a conseqence, te MSE will increase towards te boundary. Te increase will be even more pronounced for iger order polynomials. For x 2 IR p te local linear t is given as te solution of te least squares criterion nx y i i ; 0 ; 0 2 xi ; x (x i ; x) K were K is a p{variate kernel. Wit te design matrix X x wit rows 1 (xi1 ; x 1 ) ::: (x ip ; x p ) te solution as te same form as in (5.7). Let K be a product kernel composed of te same univariate kernel and bandwidt in eac coordinate and let H m (x) be te Hessian matrix of te second derivatives of m. Ten we get an asymptotic expression for te variance and te bias in te interior (see Ruppert and Wand, 1994) Var ^m(x) = 0 2 (x) f(x)n p + o p(n p ) (5.17)
5 LOCALLY WEIGHTED REGRESSION 23 and 2 Bias ^m(x) = 2 2trfH m (x)g + o p (p 2 ): (5.18) Te above considerations about te advantage of a local linear approac compared to te local constant estimation, about its design adaptation property and its automatic boundary adaptation old for te multivariate case in a similar way. ;1 Up to now we considered local least squares regression to estimate te mean function m: But te idea of locally weigted regression turns out to be a very versatile tool for estimation in a variety of situations. Yu and Jones (1998) consider te estimation of te conditional distribution function R F (yjx): Let F 1 (u) = u K 1 (v)dv be te distribution function pertaining to a symmetric kernel density K 1 and let 2 be a bandwidt. Yu and Jones consider a local linear approac for F (yjx) wic is motivated by te approximations E F 1 ( y ; y i 0 )jx 0 F (y0 jx 0 ) 2 and F (y 0 jx 0 ) F (y 0 jx)+ _ F(y 0 jx)(x ; x 0 )= 0 + 0 1(x ; x 0 ) were F _ (y 0 jx) =@F(y 0 jx)=@x: Tis suggests te least squares approac nx F 1 ( y i ; y 2 ) ; 0 ; 0 (x i ; x) i 2 K( x i ; x 1 ) were K is a second kernel wit bandwidt 1 : Te solution ~F 1 2 (yjx) = ^ 0 = e 0 1 (X 0 x W xx x ) ;1 X 0 x W x~y (5.19) wit ~y = F 1 ( y 1;y 2 ) ::: F 1 ( yn;y 2 is called a local linear double-kernel smooting by te autors. Te estimator is continuous and as zero as left boundary value (for y ;! ;1) and 1 as rigt boundary value. It can appen tat te estimator ranges outside [0 1]: But tis does not, as te autors say, give problems estimating q by ) 0
5 LOCALLY WEIGHTED REGRESSION 24 ~q (x) = ~ F ;1 1 2 (jx): Tis estimator involes te problem tat two bandwidts 1 and 2 ave tobecosen. For a possible procedure wit 2 < 1 we refer to te paper. Fan, Yao and Tong (1996) considered a related idea for estimating te conditional density itself. 1 E 2 y ; y0 K 1 2 g(y 0 jx)+ _g(y 0 jx)(x ; x 0 ) = 0 + 0 (x ; x 0 ) wit _g(yjx)=@g(yjx)=@x leads to te least squares criterion nx 1 2 yi ; y 2 K 1 ; 0 ; 0 (x ; x 0 ) K 2 xi ; x 1 (5.20) wit te solution ^g(yjx) = ^ 0 as in (5.19), were now tevector ~y is ~y = 1 y1 ; y K 1 2 2 ::: K 1 yn ; y 2 0 : Te local constant approac leads to te traditional estimator (2.3). Fan, Yao and Tong also consider te case of a local quadratic approac for estimating te rst derivative. We will not pursue tis case furter ere, since for te quadratic term p(p + 1)=2 more parameters ave to be estimated. In all local regression approaces so far we used te least squares criterion. Let us now look at cases were instead of te square function anoter convex loss function : IR! IR is used wic as a unique minimum at zero and let m (x) =argmin 0 E [(y ; 0 )jx]. (u) =u 2 yields te conditional expectation wic we analyzed mostly so far. (u) = juj yields te conditional median. Tis is just a special case for = 1=2 of te loss function (u) =juj +(2 ; 1)u, already mentioned in (3.6). was introduced by Koenker and Basset for parametric quantile estimation. Te function 2 (u) for various is exibited in Figure 5.2.
5 LOCALLY WEIGHTED REGRESSION 25 Figure 5.2 2 (u) according to Koenker and Basset for several In robustness considerations -functions were introduced wic increase less rapidly tan te square function and for wic 0 is te so-called -function. See Huber (1981) or Hampel et al (1986). A local constant estimator for m is ^m (x) = argmin 0 n X (y i ; 0 )K( x i ; x ): Te known drawbacks of a local constant approac is tat it cannot adapt to unbalanced design situations and tat it as adverse boundary eects wic require boundary corrections. Tis idea leads to te estimator ^m (x) = ^ 0 were ( ^ 0 ^) = argmin 0 nx y i ; 0 ; 0 (x ; x 0 ) K x ; x0 : (5.21)
6 APPLICATIONS OF LOCALLY WEIGHTED REGRESSION TO TIME SERIES 26 For a -function belonging to a robustness class, suc as Huber's M-type estimators known metods for robust estimation can be applied in order to solve te minimum problem (5.21). We would like to remark tat te use of kernels automatically safeguards against large deviations in te design space. For nonparametric robust M-, L- and R-estimation in a time series setting see Micels (1992). For a local -quantile regression wit te function (3.6) te local solution in (5.18) can be evaluated by solving a linear programming problem, as was sown in te paper of Koenker and Basset (1978). An algoritm for evaluating tis can be found in Koenker and Dorey (1987). For te case of a general convex -function and i. i. d. observations asymptotic normality is proved in Fan, Hu and Truong (1994). Te -quantile estimation according to (5.21) is also considered by Yu and Jones (1998) and compared wit te estimator (5.19). For reasons of practical performance te autors prefer te double smooting approac (5.19). Tey also give an asymptotic expression for te mean squared error for x scalar, wic for te solution of (5.21) is given by MSE ^q (x) = Bias 2 ^q (x) + Var ^q (x) = 1 4 4 2 2 q00 (x)+ 0 (1 ; ) nf(x)f(q (x)jx) 2 : Tese expressions are used for suggestions of bandwidt coice. Te cases of robust locally linear regression and of quantile regression are also considered in Fan and Gijbels (1996). 6 Applications of locally weigted regression to time series Local linear or iger order polynomial regression, originally mainly considered for independent data, can be applied in te same way to stationary processes wit certain memory restrictions. Te reasons are te same as tose mentioned at te beginning of section 3. Given two (dependent) random variables x s and x t and a point x in te design space, te random variables 1 xs;x K( )and 1 xt;x K( ) are nearly uncorrelated as! 0. Tis is te witening by windowing principle anditiswortwile mentionening tat tis property is not sared by parametric estimators. To andle memory restrictions in te proofs of consistency and asymptotic normality mixing conditions (strong mixing, uniform mixing or -mixing) are used. Tey give a bound to te maximal dependence between events being at least k instants apart from eac oter. Sort term dependence does not ave
6 APPLICATIONS OF LOCALLY WEIGHTED REGRESSION TO TIME SERIES 27 muc eect on local regression. But local polynomial tecniques are also applicable under weak dependence in medium or long term. If suitable mixing conditions are fullled, local polynomial estimators for dependent data ave te same asymptotic properties as for independent data. Of course te bias is not inuenced by dependence, wereas te variance terms are aected. In proving asymptotic equivalence ten te task consists in sowing tat te additional terms due to nonvanising covariances between te variables are of smaller order asymptotically. For a local linear estimation of m(x) =m(x 1 ::: x p ) in te autoregressive model (3.2) te design matrix and te vector y ave te form X x = 0 B @ z p ; x 1 ::: z 1 ; x p. z n;1 ; x 1 ::: z n;p ; x p 1 0 C A y = B @ and wit (x t ; x) 0 =(z t;1 ; x 1 ::: z t;p ; x p ) 0 te esimator can be evaluated as in (5.7). For x = x n+1 =(z n ::: z n;p+1 ) 0, z p+1. z t;1 1 C A ^m(x n+1 )= ^ 0 yields te one-step aead predictor. A direct k-step aead predictor is given if y = (z p+k ::: z n ) 0 and if te last row ofte X x -matrix is (z n;k ; z n ::: z n;k;p+1 ; z n;p+1 ). But in tis case a succession of one-step aead predictions seems preferable, as already mentioned in section 3. Asymptotic normality results for locally linear autoregression can be found in Hardle, Tsybakov an Yang (1997) and in Fan and Gijbels (1996). For te CHARN model z t = m(x t )+(x t ) t te function g(x t ) according to (3.9) can be estimated in a similar way asabove, were only in te vector y te time series values are replaced by te squares. Asymptotic normality for tis case is sown in Hardle and Tsybakov (1997). For a residual based estimator of 2 (x) see (7.10) or Feng and Heiler (1998a). Te local linear estimation of a conditional density in a time series setting wit te before mentioned double smooting procedure as in (5.19) is considered in Fan, Yao, and Tong (1996) and in Fan and Gijbels (1996), were also asymptotic results can be found. For te estimation of of te conditional distribution function according to te proposal of Yu and Jones (1998) as in (5.19) and for a general solution of (5.21) asymptotic
7 PARAMETER SELECTION 28 results are known for independent data. See te papers of Yu and Jones (1998), Hardle and Gasser (1984) and Tsybakov (1986). For dependent data, we ave not found yet formally puplised proofs. But considering te witening by windowing eect makes it clear tat for tese cases consistently results will old under suitable mixing conditions. 7 Parameter selection One of te rst questions to be answered in te application of kernel smooting is wic type of kernel to use for dierent coices of r and j. Itiswell known tat for r ; j odd in te interior of te x-space te Epanecknikov kernel K(u) = 3(1 ; 4 u2 ) + is te one wic minimizes te mean squared error in te class of all nonnegative, symmetric and Lipscitz continuous functions and tat for te endpoints x and x te triangular kernels (1 ; u)1 [0 1] (u) resp. (1 + u)1 [;1 0] are optimal. For oter points in te boundary area optimal solutions are not known. 1 It is easy to see tat wen looking at variance only te uniform kernel 1 2 [;1 1](u) is te one minimizing te variance. It is well known tat in practice te coice of te kernel is not very important compared to te coice of te bandwidt. Te Epanecknikov kernel will terefore be a good coice in many cases. Noneteless in practice often iger oder kernels like te Bisquare or te Triweigt are prefered. Tis as to do wit te degree of smootness, since te kernel estimates inerit te smootness properties of te kernel. According to te degree of smootness as introduced by Muller (1984), te uniform kernel as degree zero (not continuous), te triangle and te Epanecknikovkernel ave degree 1 (continuous, but rst derivate not continuous), te Bisquare and te Triweigt ave degrees 2 and 3, respectively, and te Gaussion kernel as degree 1. Te most crucial task in kernel smooting is bandwidt selection. Muc ink as been spoiled on papers concerning tis problem. It is ence impossible to give a compreensive survey ere. Instead we will discuss only a few basic ideas. Te aim is to coose bandwidts suc tat te conditional mean squared error, given by MSE(^m (j) (x)) = Bias 2 (^m (j) (x)) + Var(^m (j) (x)) (7.1) becomes minimal. We ave to distinguis between a locally optimal banwidt and a globally optimal, constant banwidt. It is clear tat a large bandwidt will lead to a low variance, but a ig bias. Decreasing te bandwidt will increase te variance, but reduce te bias. An optimal bandwidt is
7 PARAMETER SELECTION 29 acieved wen te canges in bias and variance balance. Using te asymptotic expressions (5.15) and (5.16) for te conditional variance and bias, ten minimizing (7.1) wit respect to yields for te (asymptotically) optimal bandwidt at x for a scalar x " 2 (x) n = C r j (K) (m (r+1) (x)) 2 f(x) 1 # 1=(2r+3) (7.2) n were te constant C r j (K) = 2 6 4 ((r + 1)!)2 (2j +1) R 3 n K ~ (j) (u) 2 du R o 7 2(r +1; j) u r+1 K ~ 2 5 (j) (u)du 1=(2r+3) (7.3) depends only on r j and te used kernel and can be calculated beforeand. In time series applications we are mainly interested in a constant, global bandwidt, for wic te integrated mean squared error (IMSE) Z Bias(^m (j) (x)) 2 + Var(^m (j) (x)) i w(x)dx is cosen as criterion, were w is a weigt function going to zero at te bounderies to avoid boundery eects. Minimizing te IMSE wit respect to yields te optimal global bandwidt n = C r j (K) 2 4 R 2 (x) w(x)dx f (x) R fm (r+1) (x)g 2 w(x)dx 1 n 3 5 1=(2r+3) : (7.4) For local linear estimation of m wen x is a p-vector and te same bandwidt is cosen in eac coordinate a similar expression can be derived (see Feng and Heiler, 1998a). Here n = c 0 p 1 (p+4) n were
7 PARAMETER SELECTION 30 c 0 = " 0 2 2 # 2 1 (p+4) (x) f(x)trfh m (x)g and H m (x) is te matrix of second derivatives of m. All tese expressions contain quantities wic are unknown and are terefore not amenable in practice. So called plug-in tecniques substitute tese quantities by pilot estimates. For more details see Ruppert, Seater and Wand (1995). A simple procedure of bandwidt selection for independent data, rstly developed to nd te smooting parameter in spline smooting, is cross validation. Let ^m i (x i ) be te so-called leave one out estimator of m at x i, were te observation (y i x i ) is not used in te estimation procedure. Ten te criterion is CV () =n ;1 n X [y i ; ^m i (x i )] 2 (7.5) and CV = argmin CV () is te cross validation bandwidt selector. Te idea can also be used for x 2 IR p and for estimating derivatives. See Hardle (1990) for details. It can be sown tat it converges almost surely to te IMSE optimal bandwidt, but te convergence rate is wit n ;1=10 very low. Te cross validation idea was developed for independent data. In a time series setting it is suggested to replace te leave one out estimator by a "leave block out" estimator, were for estimating at x i not only te i t observation is omitted, but a wole block of data around (y i x i ). Tis idea was used by Abberger (1995, 1996) in smooting te conditional -quantile, were te square function is replaced by te -function (3.6). Let 2 be te variance of te residuals in an i.i.d. sample and in te time series case te unconditional variance of te stationary process. Rice (1983, 1984) proposed a criterion R wic for a general linear smooter is given by R() =RSS() ; ^ 2 +2^ 2 n ;1 n X w ni (x i ) (7.6) were te w ni are te actual weigts for estimating m(x i ), ^ 2 is an estimate for 2 and RSS() =n ;1 n X [y i ; ^m (x i )] 2 (7.7)
7 PARAMETER SELECTION 31 is te mean residual sum of squares. Under te assumption tat ^ 2 is a consistent estimator Rice (1984) sowed tat te proposed estimator R = argmin R() is asymptotically optimal in te sense tat ( R ; 0 )= 0! 0 in probability, were 0 is te minimizer ofte mean averaged squared error MASE() =n ;1 E ( nx [^m (x i ) ; m(x i )] 2 ) Te rate of convergences of R is te same low rate n ;1=10 as for te cross validation solution CV. Te main dierences between te two istat R involves an estimate of 2, wereas CV does not. For ^ 2 Rice proposed an estimator based on rst dierences, wereas Gasser et al. (1986) suggested to take second dierences (since tey anniilate a local linear mean value function), ^ 2 G = 2 3(n ; 2) n;2 X : y i+1 ; 1 2 (y i + y i+2 ) 2 : (7.8) An estimator based on a general difference sequence D m = fd 0 d 1 ::: d m g suc tat P m d 0 j = 0 and P m 0 d2 j =1was considered by Hall et al. (1990). Te variance estimator based on D m is ten n;m X ^ 2 m =(n ; m) ;1 0 X @ m j=0 12 d j y j+i A : (7.9) Fan and Gijbels (1995) suggest te residual sum of squares criterion (RSC), wic is based on a local estimator of te conditional variance derived under a local omogeneity assumption, ^ 2 (x) = np (y i ; ^y i ) 2 K xi ;x tr [W x ; W x (Xx 0 W xx x ) ;1 Xx 0 W x] : (7.10) Wit tis te RSC is dened as RSC(x ) =^ 2 (x)[1+(r +1)V ] (7.11)
7 PARAMETER SELECTION 32 were V is te rst diagonal element of te matrix (X 0 x W xx x ) ;1 (X 0 x W x 2X x)(x 0 x W xx x ) ;1 : V ;1 reects te eective number of local data points. RSC admits te following interpretation. If is too large, ten te bias is large and ence also ^ 2 (x). Wen te bandwidt is too small, ten V will be large. Terefore RSC protects against extreme coices of. Te minimizer of E[RSC(x )] can be approximated by n0 (x) = " # a 0 2 1=(2r+3) (x) (7.12) 2C r r+1nf(x) 2 were a 0 denotes te rst diagonal element of te matrix S ;1 S S ;1, i.e. a 0 = R ~K 2 (u)du and C r = 2r+2 ; c 0 r S;1 c r wit te denitions given in section 5 and r+1 = m (r+1) (x)=(r + 1)!. n0 (x) diers from te optimal bandwidt in (7.3) by an adjusting constant wic only depends on r j, and te kernel used. Hence te latter one can be evaluated, n(x) =Ad j r n0 (x) (7.13) were Ad j r = 2 R 3 6 (2j +1)C 4 n r ( K ~ (j) (u)) 2 du 7 R (r +1; j) u r+1 K ~ 2 R 5 (j) (u)duo ~K(u)2 du 1=(2r+3) For te Epanecnikov and te Gaussian kernel tese constants are tabulated for various r and j in Fan and Gijbels (1996). : For a global bandwidt te minimizer ^ of te integrated RSC, IRSC() = Z RSC(x )dx is taken, wic in practice breaks down to evaluating a mean over certain grid points x i1 ::: x im. ^ is also selected from among a number of grid points in an interval [ min max ]. Te global bandwidt is ten given by
7 PARAMETER SELECTION 33 ^ j r = Ad j r^: (7.14) Te RS criterion suers also from aving a low convergence rate. Terefore te following rened bandwidt selection procedure is suggested. It is a double smooting (DS) procedure. Te pilot smooting consists in tting a polynomial of order r + 2 and selecting ^ j r as above. Wit te bandwidt ^ r+1 r+2 estimates of ^r+1 ^ r+2 and ^ 2 (x) are evaluated. Wit tese pilot estimates in a second stage te MSE(j r) d (x ) = Bias d 2 j r(x)+ Var d j r (x) is evaluated, were Biasj r d (x) denotes te (j +1) t element of te estimated bias vector and Var d (j r) (x) is te (j +1) t diagonal element of te matrix (X 0 x W xx x ) ;1 (X 0 x W x 2X x)(x 0 x W xx x ) ;1^ P 2 (x). Wit S n l = n K xi ;x (xi ; x) l te bias vector is estimated by 0 B ^br (x) =(X 0 x W xx x ) ;1 B @ ^ r+1 S n r+1 + ^ r+2 S n r+2. ^ r+1 S n 2r+1 + ^ r+2 S n 2r+2 1 C A : In order to avoid collinearity eects it is suggested to modify te vector on te rigt side by putting S n r+3 = :::= S n 2r+2 = 0, wic yields ^br (x) =(X 0 x W xx x ) ;1 0 B B @ ^ r+1 S n r+1 + ^ r+2 S n r+2 ^ r+1 S n r+2 0. 0 1 C C A : Te global rened bandwidt selector is ten given by te minimizer ^ R j r of Z ^ MSE j r (x )dx: (7.15) Tis rened tecnique leads to an important improvementover te RSC bandwidt selector. For a balanced design, i.e. for equally spaced x values, Heiler and Feng (1998) propose a simple double smooting procedure, were in te pilot estimation step te R-criterion
7 PARAMETER SELECTION 34 is used. In Feng and Heiler (1998b) a furter improvement of tis proposal can be found, were a variance estimator based on te bootstrap idea is used. Equally spaced x values are for instance given in a time series setting were te regressor is te time index or a function of te time index. Tis kind of smooting will be discussed in te next section. For order selection in a time series autoregression model wit x t =(z t;1 ::: z t;p ) and ^m t (x) being te leave one out estimator according to (5.7), Ceng and Tong (1992) use te cross validation criterion CV (p) =(n ; r +1) ;1 X t [z t ; ^m t (x t )] 2 w(x t ): (7.16) were w is a weigt function to avoid boundary eects. Due to te curse of dimensionality problem it may be advisable not to take all lagged values z t;1 ::: z t;p into account but to look for a subset of lagged values wic yields te best forecasts. For a lag constellation x t (i) =(z t;i1 ::: z t;ip ) 0 Tisteim and Auestad (1994) propose to use te nal prediction error FPE(x t (i)) = n ;1 X t [z t ; ^m(x t (i))] 2 f(i) (7.17) were te factor and o = f(i) = Z K 2 (u)du 1+(n p ) ;1 0 b p (i) 1 ; (n p ) ;1 [2K p (o) ; p o] b p (i) b p (i) =n ;1 X w 2 (x t (i)) ^f(x t (i)) ^f(x t (i)) being a multivariate kernel density estimator. FPE in (7.16) is essentually a sum of squares of one-step aead prediction errors multiplied wit a factor tat penalizes small bandwidts and a large order p.
8 TIME SERIES DECOMPOSITION WITH LOCALLY WEIGHTED REGRESSION35 8 Time series decomposition wit locally weigted regression As already mentioned in section 3, if x t is te time index itself or a polynomial in t, ten we arrive at trend smooting. In a simple trend model z t = m(t)+a t te considerations at te beginning of section 5 deliver an estimator of te smoot trend function or its derivatives. Now te matrix X t as te rows (1 s; t : : : (s ; t) r ) for s =1 ::: n and W t = diag(k( s;t )). As an interesting fact one can easily see tat in te interior of te time series, i.e. for t n ; te weigts given in (5.8), s ; t wnt(s) j =e 0 j+1(x 0 t W tx t ) ;1 (1 s; t ::: (s ; t) r ) K are sift invariant in te sense w j n t+1(s +1)=w j nt(s). Tis means tat in te interior of te time series te local polynomial t works likeamoving average. But te big advantage over oter trend smooting tecniques lies in te automatic boundary adaptation of te procedure. Tis property makes te idea of extending te local regression approac to so-called unobserved components models very appealing. Nonparametric estimation of trend-cyclical movements and of seasonal variations and teir separation by local regression represents an interesting alternative to procedures based on parametric models like X{12 or TRAMO{SEATS. Tese involve extrapolation metods on eiter end of te time series in order to be able to estimate te components also in te boundary parts of a time series. Tis can lead to serious problems if unusual observations in te end parts of time series yield grossly erroneous forecasts. Te latter problem will not appear wit a local regression approac. Note also tat wit a data driven parameter selection te procedure works in a fully automatic way. Te decomposition of a time series into trend{cyclical and seasonal components by LOcally WEigted Scatterplot Smooting (LOWESS) was suggested by Cleveland et al. (1990). Te procedure discussed ere is dierent from teir procedure in essential features. We consider te additive (unobserved) components model z t = T (t)+s(t)+a t t =1 2 ::: (8.1)
8 TIME SERIES DECOMPOSITION WITH LOCALLY WEIGHTED REGRESSION36 For te sake of simplicity we assume tat fa t g is a wite noise sequence wit mean zero and constant variance 2. T (t) represents te trend cyclical and S(t) te seasonal component. Te usual assumption wit respect to T is tat it as certain smootness properties so tat te considerations at te beginning of section 5 apply, leading to a local polynomial representation of order r. Wit respect to te seasonal variations te usual assumption is tat tey sow a similar pattern from one seasonal period to te next, but tey are allowed to vary slowly in te course of time. Hence a natural assumption is tat tey can locally be approximated by afourier series, containing te seasonal frequency and its armonics, S(s) = qx j=1 [ j (t) cos 2j(s ; t)+ j (t)sin2j(s ; t)] (8.2) were is te seasonal frequency, = 1=P and P is te period of te season. Of course q 1=2 (and for q =1=2 te last sine term as to be omitted). Let u t (s) = (cos 2(s ; t) sin 2(s ; t) ::: cos 2q(s ; t) sin 2q(s ; t)) 0 (t) = ( 1 (t) 1 (t) ::: q (t) q (t)) 0 : Ten S(s) =(t) 0 u t (s). Wit te local polynomial representation for te trend-cyclical part T (s) = rx j=0 j (t)(s ; t) j = (t) 0 x t (s) were (t) =( 0 (t) ::: r (t)) 0 x t (s) =(1 s; t ::: (s ; t) r ) 0, te local least squares criterion is nx s=1 s ; t [z t ; (t) 0 x t (s) ; (t) 0 u t (s)] 2 K : (8.3) Wit te design matrices X 1t wit rows x t (s) 0, X 2t wit rows u t (s) 0, X t =(X 1t.X 2t ), te composed vector (t) 0 =((t) 0 (t) 0 ) and te weigt matrix W t = diag K s;t te solution is
8 TIME SERIES DECOMPOSITION WITH LOCALLY WEIGHTED REGRESSION37 ^(t) =(X 0 t W tx t ) ;1 X 0 t W ty (8.4) ^T(t) =e 0 1 (X 0 t W tx t ) ;1 X 0 t W ty (8.5) ^S(t) =(o 0 0 s)(x 0 t W tx t ) ;1 X 0 t W ty (8.6) were o 0 is a row of zeroes of lengt r +1 and 0 s is a row vector of lengt 2q wit entries 0 s =(1010:::1 0). It picks out te ^ j(t), pertaining to te cosine terms in ^S(t). Te estimator for te j t derivative T (j) of T is ^T (j) = j!e 0 j+1 (X 0 t W tx t ) ;1 X 0 t W ty: (8.7) All te above estimators work as moving averages in te interior part of te time series and ave for r ; j odd te simple boundary adaptation property discussed in section 5. Te decomposition ^m(t) = ^T(t) + ^S(t) is not unique, since te matrix X 0 t W tx t is not block diagonal. Tis could of course be acieved by an ortogonalization procedure but seems not to be compelling for practical purposes. We call te above decompostion a natural decomposition. For parameter selection rst a decision as to be made about te degree of te trend polynomial T and te trigonometric polynomial S. Since te seasonal variations are involved in te local approac te bandwidts sould be suc tat at least tree to ve periods of te season are included. In order to acieve tis, te modelization of T sould be rater exible. Hence for te interior part of te time series te polynomial degree r =3 may be preferable to te coice r = 1. A data driven coice for a joint selection of r and bandwidt is a very dicult task since te two parameters are igly correlated. A iger r allows a larger bandwidt and vice versa. In our experience collected so far a data driven procedure for te interior part always opted for te igest allowed degree r max tat was put beforeand even if te MSE criterion included a penalty term for overparameterization. As far as te trigonometric polynomial is concerned, all armonic terms sould be included, unless an inspection of te periodogramme or te estimated spectrum reveals tat one or even more of te seasonal frequences can be ommitted. After tis preselection of parameters a procedure for bandwidt selection is needed. Since for an equidistant time series te "design density" f is a constant teprocedureisso- meow simpler tan in te general situation discussed in section 7. A variate of a double smooting procedure is recommended. In te pilot stage a poly-
8 TIME SERIES DECOMPOSITION WITH LOCALLY WEIGHTED REGRESSION38 nomial of degree r + 2 is tted and te bandwidt is selected wit te Rice criterion wit respect to ^m = ^T + ^S. But due to seasonal variations te dierence based variance estimator (7.8) as to be altered. Heiler and Feng (1996) and Feng (1998) propose a seasonal dierence based variance estimator of te form in (7.9), were not only a local linear function, but also a local periodic function is allowed for. An example for montly data (P = 12) is D 26 12 = c ;1 f;1 2 ;1 0 0 0 0 0 0 0 0 0 2 ;4 2 0 0 0 0 0 0 0 0 0 ;1 2 ;1g P were c is determined suc tat m d 2 j =1: j=0 D 26 12 anniilates a local linear trend and a local periodic function wit periodicity P = 12. Similar sequences can easily be constructed. Let ^ G 2 be te resulting estimator and let g be te minimizer of te R-criterion(7.6). Wit ^m g = ^Tg + ^Sg te resulting estimator is denoted. For an arbitrary te weigts wt (s) for estimating ^T (t) + ^S (t) are te components of te vector (1 0 ::: 0 0 s)(x 0 t W tx t ) ;1 X 0 t W t, were for W t akernel wit bandwidt is taken. Using te pilot estimates ^m g (t) te bias part of te MSE at t for an estimator wit bandwidt is estimated by d Bias(^m (t)) = nx s=1 w t (s)^m g(s) ; ^m g (t) wic yields for te bias part of te mean averaged squared error MASE() B() = n ;1 n X t=1 = n ;1 n X t=1 d Bias 2 (^m (t)) ( nx s=1 Te variance is estimated by V () =n ;1^ 2 n X t=1 nx s=1 w t (s)^m g(s) ; ^m g (t)) 2 : (8.8) w t (s) 2 (8.9)
8 TIME SERIES DECOMPOSITION WITH LOCALLY WEIGHTED REGRESSION39 were ^ 2 sould be a suitable root-n consistent estimator of ^ 2. After te rst pilot step a minimizer ~ of te criterion MASE() =B()+V () (8.10) is evaluated over a grid, were in te second step te estimator ^ G 2 is used in V (). Tis second step leads already to a considerable improvement over te simple R-criterion, but te estimator ^ G 2 is still not very good. Hence an improved estimation wit a lower polynomial degree and a bandwit g v larger tan g is proposed. For details see Feng and Heiler (1998). According to considerations terein an estimator for g v can easily be found by multiplying te minimizer ~ of (8.10) wit a correction factor. Tis factor only depends on te used kernel and on te polynomial degree r, ^g v = CF r ~ : For instance, we get for te Epanecnikov kernel CF 1 =1:431, CF 3 =1:291, for te Bisquare kernel CF 1 =1:451, CF 3 =1:300 and for te Gaussian kernel CF 1 =1:489 and CF 3 =1:305. See Table 5.1 in Muller (1988) or Table 1 in Feng and Heiler (1998). Let now ^m gv = ^T gv + ^S gv be an estimator wit bandwidt g v.tenanimproved variance estimator is obtained by taking te mean squared residuals ^ 2 B = n ;1 n X t=1[z t ; ^m gv (t)] 2 : (8.11) In a tird step tis variance estimator is plugged into (8.9) for ^ 2 and wit tis again a minimizer of te MASE (8.10) is evaluated. In principle tis procedure can be iterated several times, were in te next step wit a polynomial of degree r + 2 a new bias estimator is evaluated. Te above described procedure yields a bandwidt for te interior part of te time series, were after te selection of te interior is given by [ +1 n; ]. As described in section 5 te procedure automatically adapts towards te boundaries. But as also described tere due to increasing variance te MSE will increase as well, particularly if r = 3 is cosen, as was recommended at te beginning of tis section. One possibility to at least partly compensate for tat is to switc to a nearest neigbour estimator in te boundary area, tat is, to keep te total bandwidt T =2 +1 constant at bot ends of te time series. Tis means tat for estimating from t = n ; +1 to t = n te same local neigbourood is used (and similarly for te left boundary). Instead or in addition to tat a switc from a local polynomial of order 3 to a local linear
8 TIME SERIES DECOMPOSITION WITH LOCALLY WEIGHTED REGRESSION40 approac (for T )may be recommended wenever te MSE for r = 1 becomes smaller tan tat for r = 3. In order to do tat, for te given bandwidt and te asymmetric neigbourood situation at eac time point in te boundary area wit te corresponding active weigting systems te MSE 0 s for r = 3 and r =1ave tobeevaluated according to te procedure described above. As soon as MSE 1 <MSE 3, a local linear approac is cosen for T and maintained to te end point. According to practical experiences collected so far suc a switc appened to come to eect close to te end points in almost all cases. In Figures 8.1 and 8.2 we present two examples were te discussed decomposition procedure is applied. Te rst time series is te quarterly series of te German GDP from 1968 to 1994. In te top panel in Figure 8.1 te time series itself and te estimated trend-cyclical component are exibited. In te middle te estimated seasonal component is sown and in te bottom panel te rst derivative of te trend-cyclical is exibited. Tis latter picture sows clearly te temporary boom after German reunication. Te double smooting procedure wit bootstrap variance estimator selected = 11 as bandwidt. Te polynomial degree was two for estimating te rst derivative and tree for te oter estimations. Te second example presented in Figure 8.2 sows corresponding results for te montly series of te German unemployment rates (in per cent) from January 1977 to April 1995. Here te selected bandwidt is = 21. Te polynomial degrees are te same as in te previous example. Cleveland (1979) proposed an iterative robust locally weigted regression in a general regression context and in Cleveland et al. (1990) tis idea is also used in time series decomposition. It can easily be adapted to te procedure discussed ere, altoug in teir proposal te subseries of equal weeks, mont, quarters etc. are treated separately. Te idea consists in looking at te residuals r t = z t ; ^m(t) of a rst, nonrobust procedure and to evaluate a robust scale measure for te residuals. Cleveland suggests to take te median of te jr t j. Since in many time series variability is dierent for dierent periods witin te season depending on te size of te seasonal component, it seems reasonable to evaluate dierent scale measures i for te dierent periods of te season. t;1 For t =1 ::: n let j = P +1 be te year index, j =1:::: J = n;1 i P + 1, were [:] denotes te integer part and let i = t ; P (j ; 1) be te season index, i.e. z t ;! z ij. Ten for all i =1 ::: P a robust scale measure i = median j (jr ij j) is evaluated. From tis so-called robustness weigts are derived, wic according to Cleveland's proposal are given by
8 TIME SERIES DECOMPOSITION WITH LOCALLY WEIGHTED REGRESSION41 Figure 8.1. Decomposition results for te time series of te German GDP from 1968 to 1994. (a) Te data and ^T, (b) ^S and (c) ^T 0
8 TIME SERIES DECOMPOSITION WITH LOCALLY WEIGHTED REGRESSION42 Figure 8.2. Decomposition results for te time series of te German unemploymentrates (in %) from January 1977 to April 1995. (a) Te data and ^T, (b) ^S and (c) ^T 0
REFERENCES 43 ij = K rij 6 i were K is a kernel function (te bisquare kernel is being suggested). In a second step te local estimation procedure is repeated, were te neigbourood weigts k st = K s;t in te diagonal weigt matrices Wt are multiplied wit te corresponding robustness weigts ij,werei and j are te season- and year index corresponding to s. Of course wit te time dependent robustness weigts te procedure is no more sift invariant, so tat te least squares solution as to be evaluated for eac t explicitely. Starting wit te new residuals te procedure can be iterated until te estimates stabilize. Since te robustness weigts will cange te activekernels, dierent bandwidts sould be used in eac iteration step. Cleveland (1979) claimed tat two robust iterations sould be adequate for almost all situations. In Feng (1998) wit a stability criterion a iger number of iteration steps occured in most cases. References [1] ABBERGER, K. (1996). Nictparametrisce Scatzung bedingter Quantile in Zeitreien{Mit Anwendungen auf Finanzmarktdaten. Hartung-Gorre Verlag, Konstanz. [2] ABBERGER, K. (1997). Quantile Smooting in Financial Time Series. Statistical Papers, 38, 125{148. [3] BONGARD, J. (1960). Some Remarks on Moving Averages. In: O.E.C.D. (editor), Seasonal Adjustment on Electronic Computers. Proceedings of an international conference eld in Paris, 361-387. [4] CHEN, R. (1996). A Nonparametric Multi-step Prediction Estimator in Markovian Structures. Statistica Sinica, 6, 603-615. [5] CHEN, R. and TSAY, R.S. (1993). Functional-coecient Autoregressive Models. Journal Amer. Statist. Assoc., 88, 298-308. [6] CHEN, R. and TSAY, R.S. (1993). Nonlinear Additive ARX Models. Journal Amer. Statist. Assoc., 88, 955-967. [7] CHENG, B. and TONG, H. (1992). On Consistent Non-parametric Order Determination and Caos (wit discussion). Journal Royal Statist. Soc., Series B, 54, 427-474.
REFERENCES 44 [8] CLEVELAND, R.B., CLEVELAND, W.S., McRAE, I.E. and TERPENNING, I. (1990). STL: A Seasonal-trend Decomposition Procedure Based on LOWESS (mit Diskussion). Journal of Ocial Statistics, 6, 3{73. [9] CLEVELAND, W.S. (1979). Robust Locally Weigted Regression and Smooting Scatterplots. Journal of te American Statistical Association, 74, 829{836. [10] COLLOMB, G. (1980). Estimation Nonparametrique de Probabilites Conditionelles. Comptes Rendus a l'academie des Sciences de Paris, 291, Serie A, 427{430. [11] COLLOMB, G. (1983). From Nonparametric Regression to Nonparametric Prediction: Survey of te Mean Square Error and Original Results on te Predictogram. Lecture Notes in Statistics, 16, 182{204. [12] COLLOMB, G. (1984). Proprietes de Convergence Presque Complete du Predicteur a Noyau. Zeitscrift fur Warsceinlickeitsteorie und Verwandte Gebiete, 66, 441{ 460. [13] COLLOMB, G. (1985). Nonparametric Time Series Analysis and Prediction: Uniform Almost Sure Convergence of te k-nn Autoregression Estimates. Statistics, 16, 297{ 307. [14] EUBANK, R. L. (1988). Spline Smooting and Nonparametric Regression. Marcel Dekker, New York. [15] FAN, J. (1993). Local Linear Regression Smooters and Teir Minimax Eciencies. Annals of Statistics, 21, 196{216. [16] FAN, J. and GIJBELS, I. (1992). Variable Bandwidt and Local Linear Regression Smooters. Annals of Statistics, 20, 2008{2036. [17] FAN, J. and GIJBELS, I. (1995). Data-driven Bandwidt Selection in Local Polynomial Fitting: Variable bandwidt and Spatial Adaptation. Journal of te Royal Statistical Society, series B, 57, 371{394. [18] FAN, J. and GIJBELS, I. (1996). Local Polynomial Modelling and its Applications. Capman & Hall, London. [19] FAN, J., HU, T-CH. and TRUONG, Y.K. (1994). Robust Non-parametric Function Estimsation. Scandinavian Journal of Statistics, 21, 433-446. [20] FAN, J., YAO, Q. and TONG, H. (1996). Estimation of Conditional Densities and Sensitivity Measures in Nonlinear Dynamic Systems. Biometrika, 83, 189-216.+ [21] FENG, Y. (1998). Kernel- and Locally Weigted Regression wit Application to Time Series Decomposition. P.D. Tesis. University of Konstanz.
REFERENCES 45 [22] FENG, Y. and HEILER, S. (1998a). Locally Weigted Autoregression. In: R. Galata and H. Kuceno (editors), Econometrics in Teory and Practice. Festscrift for Hans Scneewei, 101-117. [23] FENG, Y. and HEILER, S. (1998b). Bandwidt Selection Based on Bootstrap. Discussion Paper. University of Konstanz. [24] FISHER, A. (1937). A Brief Note on Seasonal Variations. Journal of Accountancy, 64, 174. [25] FRIEDMAN, J.H (1991). Multivariate Adaptive Regression Splines (mit Diskussion). Annals of Statistics, 19, 1{141. [26] GASSER, T., KNEIP, A. and KOHLER, W. (1991). A Flexible and Fast Metod for Automatic Smooting. J. Amer. Statist. Assoc., 86, 643{652. [27] GASSER, T. and MULLER, H.G. (1979). Kernel Estimation of Regression Functions. In: Gasser and Rosenblatt (editors), Smooting Tecniques for Curve Estimation, Spring-Verlag, Heidelberg, 23{68. [28] GASSER, T. and MULLER, H.G. (1984). Estimating Regression Functions and Teir Derivatives by te Kernel Metod. Scandinavian Journal of Statistics, 11, 171{185. [29] GASSER, T., M ULLER, H.G. and MAMMITZSCH V. (1985). Kernels for Nonparametric Curve Estimation. Journal of te Royal Statistical Society, series B, 47, 238{ 252. [30] GASSER, T., SROKA, L. and JENNEN-STEINMETZ, C. (1986). Residual Variance and Residual Pattern in Nonlinear Regression. Biometrika, 73, 625{633. [31] GOURIEROUX, CH. and MONFORT, A. (1992). Qualitative Tresold ARCH Models. Journal of Econometrics, 52, 159{199. [32] H ARDLE, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cambridge. [33] H ARDLE, W., HALL, P. and MARRON, J.S. (1992). Regression Smooting Parameters Tat are not far from Teir Optimum. J. Amer. Statist. Assoc., 87, 227{233. [34] H ARDLE, W. and GASSER, T. (1984). Robust Non-parametric Function Fitting. Journal Royal. Statist. Soc., Series B, 46, 42-51. [35] HARDLE, W., LUTKEPOHL, H. and CHEN, R. (1997). A Review of Nonparametric Time Series Analysis. International Statistical Review, 65, 49-72.
REFERENCES 46 [36] H ARDLE, W. and TSYBAKOV, A.B. (1988). Robust Nonparametric Regression wit Simultaneous Scale Curve Estimation. Annals of Statistics, 16, 120{135. [37] H ARDLE, W. and TSYBAKOV, A.B. (1998). Local polynomial Estimators of te Volatility Function. To appear in it Journal of Econometrics. [38] H ARDLE, W., TSYBAKOV, A.B. and YANG, L. (1997). Nonparametric Vector Autoregression. To appear in Journal of Statistical Planning and Inference. [39] H ARDLE, W. and YANG, L. (1996). Nonparametric Time Series Model Selection. Discussion paper, Humboldt-Universitat zu Berlin. [40] HALL, P., KAY, J.W. and TITTERINGTON, D.M. (1990). Asymptotically Optimal Dierence-based Estimation of Variance in Nonparametric Regression. Biometrika, 77, 521-528. [41] HAMPEL, F.R., RONCHETTI, E.M., ROUSSEEUW, P.J. ans STAHEL, W.A. (1986). Robust Statistics: Te Approac Based on te Inuence Function. Wiley, new York. [42] HART, J.D. (1996).Some Automated Metods of Smooting Time-dependent Data. Journal of Nonparametric Statistics, 6, 115-142. [43] HASTIE, T.J. and TIBSHIRANI, R.J. (1990). Generalized Additive Models. Monograps on Statistics and Apllied Probability, 43, Capman and Hall, London. [44] HEILER, S. (1995). Zur Glattung Saisonaler Zeitreien. In: Rinne, H., Ruger, B. and Strecker, H. (editors). Grundlagen der Statistik und Ire Anwendungen. Festscrift fur Kurt Weicselberger, Pysika-Verlag, Heidelberg, 128{148. [45] HEILER, S. and FENG, Y. (1996). Datengesteuerte Zerlegung Saisonaler Zeitreien. ifo Studien, 41{73. [46] HEILER, S. and FENG, Y. (1998). A Simple Root n Bandwidt Selector for Nonparametric Regression. Journal of Nonparametric Statistics9, 1-21. [47] HEILER, S. and FENG, Y. (1997). A Bootstrap Bandwidt Selector for Local Polynomial Fitting. Discussionpaper, SFB178, II{344, University of Konstanz. [48] HEILER, S. and MICHELS, P. (1994). Deskriptive und Explorative Datenanalyse. Oldenbourg-Verlag, Muncen. [49] HORWATH L. and YANDELL B.S. (1988). Asymptotics of Conditional Empirical Porcesses. Journal of Multivariate Analysis, 26, 184-206. [50] HUBER, P.J. (1981). Robust Statistics. Wiley, New York.
REFERENCES 47 [51] JONES, H.L. (1943). Fitting of Polynomial Trends to Seasonal Data by te Metod of Least Squares. Journal Amer. Statist. Assoc., 38, 453 [52] JONES, M.C. and HALL, P. (1990). Mean Squared Error Properties of Kernel Estimates of Regression Quantiles. Statistics & Probability Letters, 10, 283{289. [53] KOENKER, R. and BASSETT, G. (1978). Regression Quantiles. Econometrica, 46, 33{50. [54] KOENKER, R. and DOREY, V. (1987). Computing Regression Quantiles. Applied Statistics, 36, 383-393. [55] KOENKER, R., PORTNOY, S. and Ng, P. (1992). Nonparametric Estimation of Conditional Quantile Functions. In: L 1 -Statistical Analysis and Related Metods (ed. Y. Dodge), Nort-Holland, New York. [56] MACAULAY, R.R. (1931). Te smooting of time series. Natinal Bureau of Economic Researc, New York. [57] MESSER, K. and GOLDSTEIN, L. (1993). A new Class of Kernels for Nonparametric Curve Estimation. Annals of Statistics, 21, 179{195. [58] MICHELS, P. (1992). Nictparametrisce Analyse und Prognose von Zeitreien. Pysica-Verlag, Heidelberg. [59] M ULLER, H.-G. (1985). Empirical Bandwidt Coice for Nonparametric Kernel Regression by Means of Pilot Estimators. Statist. Decisions, Supp. Issue 2, 193{206. [60] M ULLER, H.-G. (1988). Nonparametric Analysis of Longitudinal Data. Springer- Verlag, Berlin. [61] NADARAYA, E.A. (1964). On Estimating Regression. Teory of Probability and Its Applications, 9, 141{142. [62] PRIESTLEY, M.B. and CHAO, M.T. (1972). Nonparametric Function Fitting. Journal of te Royal Statistical Society, series B, 34, 385{392. [63] RICE, J. (1983). Metods for Bandwidt Coice in Nonparametric Kernel Regression. In: J.E. Gentle (editor), Computer Science and Statistics: Te Interface. Nort Holland, Amsterdam, 186-190. [64] RICE, J. (1984). Bandwidt Coice for Nonparametric Regression. Annal of Statistics, 12, 1215{1230. [65] ROBINSON, P.M. (1983). Nonparametric Estimators for Time Series. Journal of Time Series Analysis, 4, 185{207.
REFERENCES 48 [66] ROBINSON, P.M. (1986). On te Consistency and Finite-sample Properties of Nonparametric Kernel Time Series Regression, Autoregression and Density Estimators. Annals of te Institute of Statistical Matematics, 38, A, 539{549. [67] RUPPERT, D., SHEATHER, S.J. and WAND, M.P. (1995). An Eective Bandwidt Selector for Local Least Squares Regression. J. Amer. Statist. Assoc., 90, 1257{1270. [68] RUPPERT, D. and WAND, M.P. (1994). Multivariate Locally Weigted Least Squares Regression, Annals of Statistics, 22, 1346{1370. [69] SILVERMAN, B.W. (1984). Spline Smooting: Te Equivalent Variable Kernel Metod. Annals of Statistics, 12, 898{916. [70] SILVERMAN, B.W. (1985). Some Aspects of te Spline Smooting Approac to Nonparametric Regression Curve Fitting (wit discussion). Journal of te Royal Statistical Society, series B, 47, 1{52. [71] STONE, C.J. (1977). Consistent Nonparametric Regression (wit discussion). Annals of Statistics, 5, 595{620. [72] STUTE, W. (1984). Asymptotic Normality of Nearest Neigbor Regression Function Estimates. Annals of Statistics, 12, 917{926. [73] STUTE, W. (1986). Conditional Empirical Processes. Annals of Statistics, 14, 638{ 647. [74] TJSTHEIM, D. and AUESTAD B. (1994a). Nonparametric Identixation of Nonlinear Time Series: Projection. J. Amer. Statist. Assoc., 89, 1398-1409. [75] TJSTHEIM, D. and AUESTAD B. (1994b). Nonparametric Identixation of Nonlinear Time Series: Selecting Signicant Lags. J. Amer. Statist. Assoc., 89, 1410-1419. [76] TSYBAKOV, A.B. (1986). Robust Reconstruction of Function by te Local Approximation Metod. Problems of Information Transmission, 22, 133{146. [77] WAHBA, G. (1990). Spline Models for Observational Data. SIAM, Piladelpia. [78] WAND, M.P. and JONES, M.C. (1995). Kernel Smooting. Capman & Hall, London. [79] WATSON, G.S. (1964). Smoot Regression Analysis. Sankya, Ser. A, 26, 359{372. [80] YAKOWITZ, S. (1979a). Nonparametric estimation of Markov transition functions. Annals of Statistics, 7, 671-679. [81] YAKOWITZ, S. (1979b). A nonparametric Markov model for daily river ow. Water Resour. Researc, 15, 1035-1043.
REFERENCES 49 [82] YAKOWITZ, S. (1985). Markov Flow Models and te Flood Warning Problem. Water Resources Researc, 21, 81{88. [83] YANG, L. and H ARDLE, W. (1996). Nonparametric Autoregression wit Multiplicative Volatility and Additive Mean. Submitted to Journal of Time Series Analysis. [84] YANG, S. (1981). Linear Functions of Concomitants of Order Statistics wit Application to Nonparametric Estimation of a Regression Function. Journal of te American Statistical Association, 76, 658{662. [85] YU, K. and JONES, M.C. (1998). Local Linear Quantile Regression. Journal. Amer. Statist. Assoc., 93, 228-237.